UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Large-scale optimization algorithms for missing data completion and inverse problems Da Silva, Curt 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2017_november_dasilva_curt.pdf [ 8.86MB ]
Metadata
JSON: 24-1.0355402.json
JSON-LD: 24-1.0355402-ld.json
RDF/XML (Pretty): 24-1.0355402-rdf.xml
RDF/JSON: 24-1.0355402-rdf.json
Turtle: 24-1.0355402-turtle.txt
N-Triples: 24-1.0355402-rdf-ntriples.txt
Original Record: 24-1.0355402-source.json
Full Text
24-1.0355402-fulltext.txt
Citation
24-1.0355402.ris

Full Text

Large-scale optimizationalgorithms for missing datacompletion and inverse problemsbyCurt Da SilvaB.Sc., The University of British Columbia, 2011a thesis submitted in partial fulfillmentof the requirements for the degree ofDoctor of Philosophyinthe faculty of graduate and postdoctoral studies(Mathematics)The University of British Columbia(Vancouver)September 2017© Curt Da Silva, 2017AbstractInverse problems are an important class of problems found in many areas of scienceand engineering. In these problems, one aims to estimate unknown parameters of aphysical system through indirect multi-experiment measurements. Inverse problemsarise in a number of fields including seismology, medical imaging, and astronomy,among others.An important aspect of inverse problems is the quality of the acquired data itself.Real-world data acquisition restrictions, such as time and budget constraints, oftenresults in measured data with missing entries. Many inversion algorithms assumethat the input data is fully sampled and relatively noise free and produce poorresults when these assumptions are violated. Given the multidimensional natureof real-world data, we propose a new low-rank optimization method on the smoothmanifold of Hierarchical Tucker tensors. Tensors that exhibit this low-rank structurecan be recovered from solving this non-convex program in an efficient manner. Wesuccessfully interpolate realistically sized seismic data volumes using this approach.If our low-rank tensor is corrupted with non-Gaussian noise, the resulting opti-mization program can be formulated as a convex-composite problem. This class ofproblems involves minimizing a non-smooth but convex objective composed with anonlinear smooth mapping. In this thesis, we develop a level set method for solv-ing composite-convex problems and prove that the resulting subproblems convergelinearly. We demonstrate that this method is competitive when applied to exam-ples in noisy tensor completion, analysis-based compressed sensing, audio declipping,total-variation deblurring and denoising, and one-bit compressed sensing.With respect to solving the inverse problem itself, we introduce a new softwaredesign framework that manages the cognitive complexity of the various componentsinvolved. Our framework is modular by design, which enables us to easily integrateand replace components such as linear solvers, finite difference stencils, precondi-tioners, and parallelization schemes. As a result, a researcher using this frameworkcan formulate her algorithms with respect to high-level components such as objec-tive functions and hessian operators. We showcase the ease with which one canprototype such algorithms in a 2D test problem and, with little code modification,apply the same method to large-scale 3D problems.iiLay SummaryInverse problems are a class of important problems in science and engineering appli-cations, wherein we measure the response of a physical system and want to infer theintrinsic parameters of the system that produced those measurements. For instance,medical imaging attempts to reconstruct an image of the body’s internal structuregiven electric field data measured along the skin. The measured data itself must beof sufficiently high quality and coverage in order to estimate these parameters ac-curately. In this thesis, we develop methods for completing data that has not beenfully collected, due to time or budget constraints. We also study a class of opti-mization problems that can handle when our input data is contaminated with largenoisy outliers. The final topic presented in this thesis involves designing softwarefor solving these inverse problems in an efficient, flexible, and scalable manner.iiiPrefaceThis thesis consists of my original research, conducted at the Department of Mathe-matics at the University of British Columbia, Vancouver, Canada, under the super-vision of Professor Felix Herrmann as part of the Seismic Laboratory for Imagingand Modelling (SLIM). The following chapters contain previously published or sub-mitted work for which I was the principal investigator and author. I formulatedthis research program, developed the resulting theory and numerical software, andwrote the entirety of the articles below, subject to minor editorial suggestions frommy supervisor. Tristan van Leeuwen gave some valuable input on the presentationof Chapter (5) and Aleksandr Aravkin gave helpful feedback on an early version ofChapter (4).A version of Chapter (2) was published in [82], which significantly extends earlierconference proceedings in [79, 81]. A version of Chapter (5) has been submitted forpublication as [83], which expands upon the conference proceedings in [80], and iscurrently under peer review.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Missing Data Completion . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Convex Composite Optimization . . . . . . . . . . . . . . . . . . . . 61.3 Inverse Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Thesis Outline and Overview . . . . . . . . . . . . . . . . . . . . . . 82 Low-rank Tensor Completion . . . . . . . . . . . . . . . . . . . . . 102.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Contributions and Outline . . . . . . . . . . . . . . . . . . . . . . . . 142.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.1 Matricization . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.2 Multilinear product . . . . . . . . . . . . . . . . . . . . . . . 152.4.3 Tensor-tensor contraction . . . . . . . . . . . . . . . . . . . . 162.5 Smooth Manifold Geometry of the Hierarchical Tucker Format . . . 162.5.1 Hierarchical Tucker Format . . . . . . . . . . . . . . . . . . . 17vTable of Contents2.5.2 Quotient Manifold Geometry . . . . . . . . . . . . . . . . . . 192.6 Riemannian Geometry of the HT Format . . . . . . . . . . . . . . . 212.6.1 Riemannian metric . . . . . . . . . . . . . . . . . . . . . . . . 212.6.2 Riemannian gradient . . . . . . . . . . . . . . . . . . . . . . . 242.6.3 Tensor Completion Objective and Gradient . . . . . . . . . . 262.6.4 Objective function . . . . . . . . . . . . . . . . . . . . . . . . 272.6.5 Riemannian gradient . . . . . . . . . . . . . . . . . . . . . . . 272.7 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.7.1 Reorthogonalization as a retraction . . . . . . . . . . . . . . . 292.7.2 Vector transport . . . . . . . . . . . . . . . . . . . . . . . . . 322.7.3 Smooth optimization methods . . . . . . . . . . . . . . . . . 322.7.4 First order methods . . . . . . . . . . . . . . . . . . . . . . . 332.7.5 Line search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.7.6 Gauss-Newton Method . . . . . . . . . . . . . . . . . . . . . . 352.7.7 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 372.7.8 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . 392.8 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.8.1 Seismic data . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.8.2 Single reflector data . . . . . . . . . . . . . . . . . . . . . . . 432.8.3 Convergence speed . . . . . . . . . . . . . . . . . . . . . . . . 452.8.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.8.5 Synthetic BG Compass data . . . . . . . . . . . . . . . . . . . 472.8.6 Effect of varying regularizer strength . . . . . . . . . . . . . . 532.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543 The Moreau Envelope and the Polyak-Lojasiewicz Inequality . 553.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.3 Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584 A level set, variable projection approach for convex compositeoptimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.2 Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.4 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.4.1 Cosparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.4.2 Robust Tensor PCA / Completion . . . . . . . . . . . . . . . 82viTable of Contents4.4.3 One-bit Compressed Sensing . . . . . . . . . . . . . . . . . . 844.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884.7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895 A Unified 2D/3D Large Scale Software Environment for Nonlin-ear Inverse Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 905.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.1.1 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . 945.2 Preamble: Theory and Notation . . . . . . . . . . . . . . . . . . . . 945.3 From Inverse Problems to Software Design . . . . . . . . . . . . . . . 985.3.1 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.4 Multi-level Recursive Preconditioner for the Helmholtz Equation . . 1055.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.5.1 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.5.2 Full Waveform Inversion . . . . . . . . . . . . . . . . . . . . . 1115.5.3 Sparsity Promoting Seismic Imaging . . . . . . . . . . . . . . 1135.5.4 Electromagnetic Conductivity Inversion . . . . . . . . . . . . 1155.5.5 Stochastic Full Waveform Inversion . . . . . . . . . . . . . . . 1175.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1215.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Appendix A A ‘User Friendly’ Guide to Basic Inverse Problems . . . . 152viiList of TablesTable 2.1 Comparison between geomCG implementations on a 200×200×200random Gaussian tensor with multilinear rank 40 and subsampling factor= 0O1, both run for 20 iterations. Quantities are relative to the respectiveunderlying solution tensor. . . . . . . . . . . . . . . . . . . . . . . . . . 42Table 2.2 Reconstruction results for single reflector data - missing points -mean SNR over 5 random training sets. Values are SNR (dB) and time(in seconds) in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . 45Table 2.3 Reconstruction results for single reflector data - missing receivers- mean SNR over 5 random training test sets. Values are SNR (dB) andtime (in seconds) in parentheses. . . . . . . . . . . . . . . . . . . . . . . 46Table 2.4 HT Recovery results on the BG data set - randomly missing re-ceivers. Starred quantities are computed with regularization. . . . . . . 48Table 2.5 HT parameters for each data set and the corresponding SNR ofthe HT-SVD approximation of each data set. The 12.3 Hz data is ofmuch higher rank than the other two data sets and thus is much moredifficult to recover. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Table 4.1 Cosparse recovery results . . . . . . . . . . . . . . . . . . . . . . . 76Table 4.2 Audio declipping produces much better results with p = 0 com-pared to p = 1, but the computational times become daunting as theu0-norm increases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Table 4.3 Summary of recovery results . . . . . . . . . . . . . . . . . . . . . 83Table 4.4 Huber recovery performance versus  parameter . . . . . . . . . . 84Table 4.5 m = 2000P n = 1000P k = 100, sparse signal . . . . . . . . . . . . . 87Table 4.6 m = 2000P n = 1000P k = 100, compressible signal . . . . . . . . . . 87Table 4.7 m = 500P n = 1000P k = 100, sparse signal . . . . . . . . . . . . . . 87Table 4.8 m = 500P n = 1000P k = 100, compressible signal . . . . . . . . . . 88Table 5.1 Quantities of interest for PDE-constrained optimization. . . . . . 97Table 5.2 Number of PDEs (per frequency/source) for each optimizationquantity of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98viiiList of TablesTable 5.3 Preconditioner performance as a function of varying points perwavelength and number of wavelengths. Values are number of outerFGMRES iterations. In parenthesis are displayed the number of gridpoints (including the PML) and overall computational time (in seconds),respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108Table 5.4 Memory usage for a constant-velocity problem as the number ofpoints per wavelength increases . . . . . . . . . . . . . . . . . . . . . . . 109Table 5.5 Adjoint test results for a single instance of randomly generatedvectors x, y, truncated to four digits for spacing reasons. The linearsystems involved are solved to the tolerance of 10−10. . . . . . . . . . . . 110ixList of FiguresFigure 2.1 Complete dimension tree for {1P 2P 3P 4P 5P 6}. . . . . . . . . . . . . 17Figure 2.2 Visualizing the Hierarchical Tucker format for a 4D tensor X ofsize n1 × n2 × n3 × n4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Figure 2.3 Forward and adjoint depictions of the Gramian mapping differen-tial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Figure 2.4 Dimension tree for seismic data . . . . . . . . . . . . . . . . . . . 43Figure 2.5 Reconstruction results for 90% missing points, best results forgeomCG and HTOpt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44Figure 2.6 Reconstruction results for sampled receiver coordinates, best re-sults for geomCG and HTOpt. Top row: 90% missing receivers. Bottomrow: 70% missing receivers. . . . . . . . . . . . . . . . . . . . . . . . . . 45Figure 2.7 Convergence speed of various optimization methods . . . . . . . 46Figure 2.8 Dense & sparse objective, gradient performance. . . . . . . . . . 47Figure 2.9 HT interpolation results on the BG data set with 75% missingreceivers at 4.68 Hz, figures are shown for fixed source coordinates. . . . 49Figure 2.10 HT interpolation results on the BG data set with 75% missingreceivers at 7.34 Hz, figures are shown for fixed source coordinates. . . . 50Figure 2.11 HT interpolation results on the BG data set with 75% missingreceivers at 12.3 Hz, figures are shown for fixed source coordinates. . . . 51Figure 2.12 Regularization reduces some of the spurious artifacts and reducesoverfitting in the case where there is very little data. 4.86 Hz data, 90%missing receivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Figure 2.13 Recovery SNR versus log10() for 4.68Hz data. 90% missing re-ceivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Figure 4.1 A cartoon depiction of the level set method for convex-compositeoptimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Figure 4.2 True and subsampled signal, 50% receivers removed . . . . . . . 75Figure 4.3 Recovery of a common source gather (fixed source coordinates).Displayed values are SNR in dB. . . . . . . . . . . . . . . . . . . . . . . 75Figure 4.4 TV deblurred image in the noise-free case. . . . . . . . . . . . . . 77xList of FiguresFigure 4.5 TV deblurred image in the noisy case. . . . . . . . . . . . . . . . 78Figure 4.6 Declipping the “Glockenspiel” audio file. The first three secondsare shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Figure 4.7 Recovery of a common source gather. . . . . . . . . . . . . . . . 83Figure 5.1 Software Hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . 99Figure 5.2 Data distributed over the joint (source, frequency) indices. . . . 102Figure 5.3 ML-GMRES preconditioner. The coarse-level problem (relativeto the finest grid spacing) is preconditioned recursively with the samemethod as the fine-scale problem. . . . . . . . . . . . . . . . . . . . . . . 107Figure 5.4 Numerical Taylor error for a 3D reference model. . . . . . . . . . 110Figure 5.5 Analytic and numerical solutions for the 2D Helmholtz equationfor a single source. Difference is displayed on a colorbar 100x smaller thanthe solutions. Top row is the real part, bottom row is the imaginary part. 111Figure 5.6 Analytic and numerical solutions for the 3D Helmholtz equation(depicted as a 2D slice) for a single source. Difference is displayed on acolorbar 10x smaller than the solutions. Top row is the real part, bottomrow is the imaginary part. . . . . . . . . . . . . . . . . . . . . . . . . . . 112Figure 5.7 Analytic and numerical solutions for the 2.5D Helmholtz systemfor a generated data volume with 100 sources, 100 receivers, and 100y-wavenumbers. The 2.5D data took 136s to generate and the 3D datatook 8200s, both on a single machine with no data parallelization. Toprow: real part, bottom row: imaginary part. . . . . . . . . . . . . . . . . 112Figure 5.8 True (left) and initial (middle) and inverted (right) models. . . . 113Figure 5.9 Sparse seismic imaging - full data least-squares inversion versuslinearized Bregman with randomized subsampling . . . . . . . . . . . . 115Figure 5.10 Inversion results when changing the PDE model from theHelmholtz to the Poisson equation . . . . . . . . . . . . . . . . . . . . . 117Figure 5.11 True (left) and initial (middle) and inverted (right) models . . . 118Figure 5.12 Relative model error as a function of the number of randomizedsubproblems solved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119Figure 5.13 True (left) and initial (right) models . . . . . . . . . . . . . . . . 120Figure 5.14 True model (left), initial model (middle), inverted model (right)for a variety of fixed coordinate slices . . . . . . . . . . . . . . . . . . . 120Figure 5.15 True model (left), initial model (middle), inverted model (right)for a variety of fixed z coordinate slices . . . . . . . . . . . . . . . . . . 121xiList of AlgorithmsAlgorithm 2.1 The Riemannian gradient ∇Rf at a point x = (jtPBt) ∈M . 26Algorithm 2.2 Objective and Riemannian gradient for separable objectives . 29Algorithm 2.3 QR-based orthogonalization [Alg. 3 109] . . . . . . . . . . . . 32Algorithm 2.4 General Nonlinear Conjugate Gradient method for minimiz-ing a function f over H . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Algorithm 2.5 The inverse Gauss-Newton Hessian applied to a vector . . . . 36Algorithm 2.6 Yg[Bt] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Algorithm 2.7 Yg∗[Gt] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Algorithm 4.1 The VELVET algorithm for solving problem (4.1) . . . . . . 67Algorithm 5.1 Standard multigrid V-cycle . . . . . . . . . . . . . . . . . . . 106Algorithm 5.2 Linearized Bregman with per-iteration randomized subsam-pling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114xiiAcknowledgementsI would first and foremost like to thank my supervisor, Dr. Felix Herrmann, whoinitially convinced me to join the SLIM group over the course of coffee and regalingme with all of the interesting problems that his group tackled. Collectively, the grouphas tackled a large number of them, but more seem to keep springing up. Alas, suchis the course of research. You have been a avid and enthusiastic supervisor, alwaysin pursuit of solving more interesting yet relevant problems, and that mentality isdefinitely infectious. Thank you for the constant doses of perspective, it’s easy tomiss the big picture sometimes and you’ve taught me to always keep it in mind.Thank you to my talented and insightful colleagues with whom I’ve workedclosely, Rajiv Kumar and Zhilong Fang, you are all very talented and, more impor-tantly, hard working. I know that you will all go far in life. Thank you as well toformer postdoctorate fellows in the group Tristan van Leeuwen and Aleksandr Ar-avkin, who helped me develop my understanding of optimization during their timewith SLIM and continued to provide valuable input on my work even after movingon to other ventures.Thanks especially to my mother, father, and sister, who have always encouragedme to pursue my education and gave me every opportunity to do so through theirhard work. This one’s for you!To my wonderful girlfriend Andrea, you have been so supportive and loving tome throughout my time in graduate school. Now it’s time for more adventurestogether!To all of my friends at Instant Theatre and elsewhere in the Vancouver comedyscene, you kept me sane during this whole process.xiiiTo my wonderful friends and familyChapter 1IntroductionInverse problems are ubiquitous in many science and engineering applications. In-verse problems aim to estimate an unknown quantity of interest (e.g., sound speedof the earth, tissue conductivity) from indirect measurements on the surface orboundary of a region. These problems arise in a large number of fields, includingseismology [261, 243, 244], medical imaging [71, 18, 102, 42], structural engineering[138], chemistry [33], radar imaging [35], quantum scattering theory [62], astronomy[78], and image processing [26, 100], among others.Inverse problems are described by a forward modelling operator, denoted F ,which maps parametersm to predicted data y = F (m). Encoded in F is the entiretyof our underlying assumptions about the physics of our model and the geometry ofour acquisition. This mapping is typically nonlinear and differentiable. The goal ofan inverse problem is to find parameters m∗ that generate data that are as close aspossible to our measured data ymeas, i.e.,F (m∗) ≈ ymeasOMore precisely, we often compute m∗ as the solution to the optimization problemminmϕ(F (m)P ymeas)where ϕ is the measure of misfit between our predicted and observed data. Solvingan inverse problem provides with us a noninvasive means to “see” into a hiddenregion of space. The estimated parameters then form the image of our region ofinterest, whether that be a conductivity map of a potentially cancerous tissue ora velocity model of the subsurface of the Earth. For a comprehensive review ofcomputational methods for inverse problems, we refer the interested reader to [262].An important aspect of such problems is the measured data itself, which isoften required to be fully sampled and relatively noise-free in order for inversionalgorithms to produce meaningful results. These data volumes are often multidi-mensional, depending on several spatial or angular coordinates as well as time. As1Chapter 1. Introductiona result, sampling say c points in each of the y dimensions leads to storage costson the order of cy points, an exponential dependence on the number of dimensions.This is the so-called curse of dimensionality, which creates onerous storage andcomputational costs when processing these volumes. Insisting that the data be fullyacquired in all dimensions incurs enormous time and budgetary costs as practitionersaim to reconstruct these volumes through classical Whittaker–Nyquist–Kotelnikov–Shannon sampling [226, 264, 181]. Broadly speaking, in order to reconstruct aone-dimensional signal with bandwidth W Hertz, one must acquire at least 2W sam-ples per unit interval. Although mathematically elegant, this sampling paradigmis incredibly restrictive for reconstructing realistic data volumes due to their highdimensionality. As a result, fully sampling a data volume is an arduous task thatwe will instead avoid in favour of measuring far fewer samples of the data volume.Real-world signals often possess much more structure than being merely band-limited, which can be exploited for recovering the signal from subsampled measure-ments. For instance, many natural images have been shown empirically to be sparse(i.e., possess a small number of non-zero coefficients relative to the ambient dimen-sionality) in the wavelet basis [99]. The field of compressed sensing (CS) aims toexploit the sparsity of the signal (in a particular basis) in order to acquire samplesat sub-Nyquist rates. A practitioner samples the signal at a rate commensuratewith the sparsity of the signal rather than the dimensionality of the ambient spaceand, from this subsampled data, one reconstructs the signal by solving an associatedoptimization program. There have been a large number of successful applicationsof CS to recover sparse signals from subsampled measurements, including medicalimaging [177, 68], radar imaging [123], and geophysics [129, 125, 130]. The attractivetheoretical and numerical results of CS have motivated researchers to consider var-ious extensions to other low-dimensional structures embedded in high-dimensionalspaces. The notion of low-rank for matrices is directly analogous to sparsity forone-dimensional signals, whereby one imposes a sparsity constraint on the singularvalues of a given matrix. Many similar theoretical guarantees can be proved forrecovering low-rank matrices from a subset of missing entries, the so-called matrixcompletion problem, and a variety of successful techniques have arisen for solvingthese problems [52, 49, 12]. Extending these notions to higher-dimensional tensorsproves challenging as one no longer has the benefit of having a unique singular valuedecomposition (SVD) for tensors. Given that a large amount of numerical and the-oretical work for matrix completion is predicated on the decomposition of a matrixin to its constituent SVD components, even generalizing the definition of a low-ranktensor is non-trivial.This question of whether one should interpolate the data or use some othertechnique to solve the corresponding inverse problem is open, in general, despite theworking assumption by practitioners that having more data leads to more robustresults. The authors in [218] show that insisting that the data be fully sampled canbe relaxed in some inverse problem scenarios involving DC-resistivity. The work of[6] explores using a regularized least-squares approach to solving a financial inverseproblem compared to data completion. The authors also consider the problem2Chapter 1. Introductionof uncertainty in the locations of the measured data, and similarly in [7]. For thelinearized seismic imaging problem, [189] demonstrates that using standard inversionalgorithms with missing data results in exceptionally poor results, leading to a largeimprint of the gap in source/receiver coverage in the inverted model, i.e., a large so-called acquisition footprint. Thus far, there are few definitive answers as to whetherone should complete missing data prior to inversion for general problems. In thisthesis, we will proceed as if the process of interpolating data, whether to produce afully sampled volume for inversion or for its own sake, is a worthwhile endeavour.The presence of noise in acquired signals induces a corresponding noise modelin the resulting optimization program. Due to its simplicity, the most commonassumption made due is that the noise obeys a multivariate Gaussian distribution,often with an identity covariance matrix, which results in minimizing an u2-normdifference between the observed and predicted data. Although desirable from apractical point of view, since the data misfit is smooth and easy to minimize, thisnoise model is often not realistic. In particular, the u2 norm is particularly brittlein the presence of gross outliers, where a small subset of the signal of interest iscorrupted by high magnitude values, e.g., dead pixels in a video stream or a fewreceivers in an array failing in a seismic experiment [143]. Introducing a penaltythat is more robust to such outliers such as the u1 norm [266] can alleviate suchissues as it penalizes large residual values far less than the corresponding u2 penalty.Solving such data recovery problems in this case, however, becomes algorithmicallychallenging due to the non-smoothness of the resulting data misfit and resultingconstraints that do not possess simple and efficient projection operators. Even inthe ideal noise-free case, promoting sparsity or low-rank in an given optimizationprogram from first principles is an involved procedure as it involves non-smoothoptimization.Assuming that we have successfully reconstructed a reasonable approximation ofour full-sampled data volume, we focus our attention on solving the resulting inverseproblem itself, which is an interesting software design challenge. There are a largenumber of research areas in which one must be well-versed in order to solve inverseproblems, but software design is often not among them. As such, many academicresearchers have designed codes that are mathematically correct, but scale poorly interms of memory usage or computational complexity to realistically-sized problems.In industrial settings, the large computational costs of solving such problems oftenresults in code that is exceedingly fast after many programmers hours spent hand-tuning higher-level code. This process produces software that lacks flexibility and isnot easily modifiable, and in many cases may not even satisfy essential mathematicalproperties such as the computed gradient being the true gradient of the objectivefunctional.The main purpose of this thesis is to develop new techniques in the problems oftensor completion, composite-convex optimization, and software design for inverseproblems. Our developments in tensor completion will allow us to recover low-ranktensors with missing entries by exploiting the geometric structure of the particulartensor format we consider. These algorithms will allow us to efficiently complete3Chapter 1. Introductionlarge data volumes. For our composite-convex work, we propose a novel methodfor solving this class of optimization programs that will enable us to solve robusttensor completion problems, allowing us to denoise and interpolate input data thathas been corrupted by high-amplitude noise. This class of optimization programsis also very general, which will allow us to solve similar problems such as one-bitcompressed sensing and audio declipping. The software design that we propose inthis thesis helps manage the complexity of constructing a framework for solvingthese problems through a hierarchical design. Our approach bridges the gap be-tween purely performance-oriented codes and the mathematical objects that theyrepresent, allowing us to solve large-scale inverse problems on a distributed clusterin a straightforward manner. Although seemingly disparate topics at first glance,they all comprise important elements of the entire process, from start to finish, ofsolving inverse problems in real-world contexts. One of the primary focuses in thiswork is developing algorithms and mathematical software that scales effectively tolarge-scale problems.1.1 Missing Data CompletionThe field of compressed sensing has revolutionized the field of signal acquisitionand reconstruction since the initial papers were published in the mid-2000s [57, 58,90], wherein the authors studied the reconstruction of sparse signals from a sub-Nyquist number of random measurements. This work was insightful not merely forthe application of u1-based signal reconstruction, which was observed to succeedempirically in the Geophysical literature since the 1970s [74, 245], but also for thesubsequent rigor used in theoretically proving that this method reconstructs thesignal with high probability under certain conditions on the sampling operator. Thiswork also generated significant interest for practitioners by showing that one couldforgo solving the original sparsity-promoting program, which was NP-Hard, andinstead seek the solution of its convex relaxation, which can be solved in polynomialtime. Under particular conditions on the sampling and measurement operators,these two solutions are identical.Generalizing such ideas to the recovery of matrix-valued signals, i.e., two-dimensional signals versus the one-dimensional vectors considered in CS, leads tothe notion of matrix completion. One of the most well known instances of matrixcompletion is the Netflix problem [27], which aims to fill-in a user-movie matrixm∗, where the entries m∗i;j correspond to the rating, between 1 and 5 stars, thatthe ith user gave the jth movie. Given that any given Netflix user watches a smallnumber of movies relative to the total number of movies available, this matrixhas a large number of missing entries. Netflix originally offered a $1 million prizeto the researchers who developed an algorithm that would improve on the accu-racy of their internal recommender system by 10%. One of the techniques thatspurred a large amount of researcher interest, although ultimately did not win theprize, was the assumption that the underlying, unknown matrix m was low-rank4Chapter 1. Introduction[158]. That is to say, given the singular value decomposition of m = jhk i , withjij = Im, k ik = In, and h = diag() with  = (1P 2P O O O P p) the vector ofsingular values 1 ≥ 2 ≥ O O O p ≥ 0, p = min(mPn), the rank r, i.e., the integersuch that r S r+1 = · · · = p = 0, satisfies r ≪ p. We denote max(m) = 1 andmin(m) = p. Under this low-rank assumption, one can estimate m from solvingthe following optimization problemminm‖m‖∗such that eΩm = wO(1.1)Here eΩm is the matrix satisfying(eΩm)i;j ={mi;j if (iP j) ∈ Ω0 otherwise Pwhere Ω = {(iP j) : (iP j) is known} and w = eΩm∗, for simplicity. The nuclearnorm ‖m‖∗ =∑ki=1 i is the sum of the singular values of m, i.e., ‖m‖∗ = ‖‖1.Under appropriate conditions on the sampling operator, specifically that Ω is chosenuniformly at random with |Ω| ≥ Xk(m + n) log2(n) for some constant X, and theincoherence of the matrix m∗ itself, the solution to the convex problem (1.1) isexactly m∗ [213]. Many other applications of matrix completion have arisen ingeophysics [270, 164, 9], genomics [50], and phase retrieval [60], to name a few.Problem (1.1) is a convex program and, as such, any local minimizer is a globalminimizer. Algorithms that attempt to solve (1.1) via nuclear norm thresholding,such as in [49], scale poorly as the ambient dimensions mPn become large owingto the necessity of computing per-iteration SVDs. Techniques that circumvent thisbottleneck, and thus scale much more reasonably, involve representing m in termsof its low-rank factors m = agi [214, 12] or in terms of its SVD factors m = jhk i ,with the accompanying orthogonality requirements [260]. In this case, the number ofparameters is drastically reduced compared to the ambient space and the resultingoperations do not involve computing SVDs of large matrices.Attempts to generalize the theoretical and numerical machinery from matrixcompletion to completing multidimensional tensors is fraught with a number ofcomplications. The rank of a matrix m is defined in terms of its singular value de-composition, which does not possess a unique extension in more than two dimensions.Attempts to generalize a particular aspect of the SVD result in differing tensor for-mats, each of which have their own advantages and disadvantages. Requiring that atensor X be written as a sum of outer products of rank-one tensors leads to the so-called Candecomp/Parafac (CP) decomposition [119], which leads to a low numberof effective parameters but the notion of tensor rank is difficult to analyze for generictensors. The Tucker format [134, 250] generalizes the CP format, removing some ofits theoretical difficulties at the expense of restoring the exponential dependence ofthe number of parameters on the dimension. The Hierarchical Tucker format [115]overcomes the shortcomings of these other formats by recursively splitting subsets of5Chapter 1. Introductiondimensions. As we shall see, this format possesses a large amount of structure thatwe can exploit for completing low-rank tensors with missing entries. Specifically,as shown in [253], the set of such tensors is a smooth manifold, which is a smooth,nonlinear set that locally behaves like standard Euclidean space Rn. We formulatealgorithms on this tensor manifold in a similar manner to those specified on variousmatrix manifolds [1].1.2 Convex Composite OptimizationConsider the following data-fitting problemy =  (x) + nwhere  captures the linear or nonlinear mapping of parameters x to data  (x) andn is unknown noise. When n is impulsive, that is to say, it has high amplitudes but isspatially sparse, n is often modelled as an independently and identically distributedrandom variable with a Laplace probability distribution, where the probability den-sity function for the ith component of n, denoted ni, satisfiesp(ni) ∝ z−|ni|1 OThe maximum likelihood estimator for x is the solution to the problemmaxxz−‖ (x)−y‖1or, equivalently, taking the negative logarithm of the above expressionminx‖ (x)− y‖1O (1.2)This idea was developed in [247] in the context of the robust LASSO estimator,which is equivalent to using a Laplacian prior on the residuals. When  is a linearmapping,  ∈ Rm×n, this is the famous least absolute deviation regression problem[206]. Minimizing the u1 norm can be written as a linear program [24], for whichthere have been many methods developed [176]. Solving (1.2) is computationallychallenging when the matrix  contains a large number of rows or columns andvarious techniques have been proposed to reduce the dimensionality of the problemthrough randomized subsampling [269, 75]. In particular, although the problemis convex, the u1 norm is not smooth and applying straightforward subgradientmethods converge at a sublinear rate [40, 232], which is far too slow for realisticproblems.Problem (1.2) is a particular instance of composite-convex optimization, a gen-eral class of problems of the formminxh(x(x))6Chapter 1. Introductionwhere h is a convex, typically non-smooth function and x is a linear or smoothnonlinear mapping. We will develop algorithms for this class of problems further inthis thesis.1.3 Inverse ProblemsInverse problems can generally be considered as data-fitting problems, wherein weaim to find unknown parameters m that minimizes our measure of deviation, typ-ically a norm, between the true and predicted data. We consider the followingmulti-experiment systemyi = Fi(m) + iP i = 1P 2P O O O P nswhere yi ∈ Cnr is the measured data corresponding to the ith source, Fi(m) is thepredicted data for the ith experiment from the parameters m, i is the noise in theith source experiment, and ns is the number of sources employed. We assume thatthe total number of source experiments is large, so that ns ≫ 1, and the index i canbe a multi-index of varying dimensionality, depending on the experimental setupconsidered. The goal of this problem is to estimate model parameters m∗ such thatm∗ = argminmns∑i=1ϕ(Fi(m)P yi) (1.3)where  is related to the noise level and ϕ(xP y) is a misfit function between inputsx and y, i.e., ϕ(xP y) ≥ 0 with ϕ(xP y) = 0 ↔ x = y, ϕ(xP y) = ϕ(yP x). If the noisevectors i are identically and independently distributed (i.i.d.) with probabilitydensity function /(z) then setting ϕ(xP y) = − log /(|x−y|) yields the interpretationof (1.3) as an instance of maximum likelihood estimation, which is seen by takingthe exponential of the negative objective in (1.3). The classical u2 norm misfitarises from assuming a zero-mean Gaussian model on the noise vectors whereas anu1 misfit corresponds to assuming a Laplacian distribution. More robust optionsare also available when the noise statistics are only partially known, such as thestudents-t misfit [10] and the Huber norm [114].The challenge of implementing algorithms that solve problem (1.3) and its vari-ants are related to usability and performance. In general, a researcher is typicallyinterested in applying high-level algorithms with queries to the abstract structurespresent in (1.3). For example, one might be interested in applying an algorithmsuch as stochastic gradient descent [271], which relies on having algorithm-drivenoracle access to the gradients with respect to the jth sample, namely being able tocompute ∇mϕ(Fj(m)P yj) for arbitrary j. In this case, the details of, say, paralleldata distribution, the discretization of the specific PDE describing this system, orthe linear solvers used to compute solutions of the PDE, are all irrelevant to the datascientist who wants to operate on this abstract level, despite being necessary for thesoftware to produce correct results. If, on the other hand, another researcher devel-7Chapter 1. Introductionops a new Krylov method or preconditioners for the PDE system, she would liketo easily integrate such changes in to the overall inversion framework. Researchersshould be able to “plug and play” with various components of the system withoutlarge amounts of code modification or duplication.Simultaneously, the high computational costs of these problems merit a softwareframework that uses efficient operations related to solving the PDEs. For realisticindustry-sized 3D problems, even the storing the model vector m becomes onerous,on the order of O(109) points. With a large number of sources, say O(106) a datavolume can easily require storage and computation of O(1015) points, renderingalgorithms that are designed for small 2D problems inadequate. Direct solversbecome largely impractical as the fill-in becomes memory-intensive, even on a largecluster [191]. Krylov methods become the linear solution method of choice as aresult, but these are challenging for the Helmholtz equation in particular, as theindefinite nature of the PDE system makes preconditioning a necessity.1.4 Thesis Outline and OverviewThe main body of this thesis comprises four chapters following the present introduc-tion. Chapter 2 outlines the development of the framework for solving optimizationproblems on the manifold of low rank Hierarchical Tucker (HT) tensors [115]. TheHierarchical Tucker format is formed by a recursive low-rank splitting of dimensions,yielding a tensor format with a number of desirable theoretical and practical proper-ties. Specifically, the number of parameters needed to describe a given tensor has alinear dependence on the dimension, compared to the exponential dependence stor-ing the pointwise values of the tensor. Moreover, the set of HT tensors with ranksat most k form a closed set, unlike the CP format, which implies that there existsa best, at most k− rank, approximation to a given input tensor X. Building onthe theoretical developments on the manifold structure of HT tensors from [253], weequip this manifold with a Riemannian metric and derive the subsequent gradientand Gauss-Newton Hessian expressions for solving tensor completion problems. Ourmethods converge quickly, owing to the explicit form of applying the Gauss-NewtonHessian inverse, and, depending on the size of the full tensor, can avoid constructingquantities of the size of the large ambient dimension. We derive expressions for aregularizer that improves recovery quality when there is very little data availableand prove that our algorithms converge to a stationary point. Numerical examplesshowcase the benefits of this approach when interpolating realistically-sized seismicdata volumes.The results in Chapter 3 are a precursor to those in Chapter 4. We developlemmas on the subject of the Polyak-Lojasiewicz (PL) inequality, which has beenrecently shown in [147] to be an important property of smooth convex functionsused to prove linear convergence of gradient descent methods. Some of the resultswill be used in the forthcoming chapter, while others are interesting on their ownright, independent from the rest of this thesis. Specifically, since the constant in the8Chapter 1. IntroductionPL-inequality determines the contraction factor for steepest descent, one aims tohave as large of a constant as possible. We demonstrate that the Moreau envelopeof a function decreases this parameter and that there are natural upper and lowerbounds on the possible PL constant resulting from this process. Various exampleswill show the resulting tightness of these bounds.We propose a new method for solving composite convex optimization problemsin Chapter 4. This class of problems is quite general, encompassing robust principalcomponent analysis, TV-regularization, and one-bit compressed sensing, to name afew examples. Our method uses a level-set approach, first explored in the SPGu1algorithm [255], coupled with the variable projection method [14] to speed up con-vergence. As a result, we do not have to resort to optimizing non-smooth functions,which have slow convergence rates for large scale problems, and instead exploit fastsolvers for solving the resulting smooth subproblems. In this chapter, we provelinear convergence of gradient descent applied to these subproblems under somegeneral conditions by using the recent analysis of the Polyak-Lojasiewicz inequality.Coupled with a natural superlinear convergence of the outer iterations through theuse of the secant method, our algorithm performs very competitively on a numberof large scale problems. We showcase this technique on a variety of seismic interpo-lation problems, including noisy tensor completion, co-sparsity signal recovery, aswell as other examples in image and audio signal processing. This method is ableto efficiently handle both convex and non-convex problems.In Chapter 5, we consider the computational aspects of solving the actual in-verse problem itself. Assuming that our efforts from Chapters 2 and 4 have beensuccessful in providing us with a high-quality representation of our fully-sampleddata set as input, we focus on designing a software environment that enables us, asresearchers, to quickly prototype high-level algorithms to solve these inverse prob-lems. Our software framework deconstructs various aspects of the overall problem into hierarchically-tiered modular components. In this fashion, our software environ-ment is quite flexible, whereby one can easily swap out PDE stencils, precondition-ers, linear solution methods, parallelization schemes, or even the PDE itself. Thebyproduct of making modularity a priority in our design is that integrating highperformance routines into our code becomes straightforward, rather than being anafterthought as in many academic codebases. When dealing with 2D problems, weuse efficient sparse-matrix routines for multiplication and division while for 3D prob-lems, where the system matrix is much too large to be stored explicitly, we employmulti-threaded C-based matrix-vector products that construct the coefficients on-the-fly. The entire framework is agnostic to the underlying dimensionality of theproblem, however, which makes applying algorithms from small test problems torealistically-sized problems a matter of changing a few lines of code. We demon-strate the effectiveness of this design by implementing a number of algorithms on2D and 3D seismic problems and an electromagnetic inversion problem. The result-ing code for each example reflects the mathematics of the underlying problem andreduces the cognitive load on the user.9Chapter 2Low-rank Tensor Completion2.1 IntroductionAcquiring a multidimensional signal from a real-world experiment can be a costlyaffair. When the signal of interest is a discretized continuous signal, there can be anumber of constraints, physical or otherwise, that limit our ability to ideally sampleit. For instance in the seismic case, the tensor of interest is a multidimensionalwavefield in the earth’s subsurface sampled at an array of receivers located at thesurface. In real-world seismic experiments, budgetary constraints or environmentalobstructions can limit both the total amount of time available for data acquisitionas well as the number and placement of active sources and receivers. Since seismicprocessing, among other domains, relies on having fully sampled data for drawingaccurate inferences, tensor completion is an important technique for a variety ofscientific fields that acquire multidimensional data.We consider the problem of interpolating a y−dimensional tensor from samplesof its entries. That is, we aim to solve,minX∈H12‖eΩX− w‖22P (2.1)where eΩ is a linear operator eΩ : Rn1×n2×:::×nd → Rm, w ∈ Rm is our subsampleddata satisfying w = eΩX∗ for some “solution” tensor X∗ and H is a specific classof low-rank tensors to be specified later. Under the assumption that X∗ is wellapproximated by an element in H, our goal is to recover X∗ by solving (2.1). Forconcreteness, we concern ourselves with the case when eΩ is a restriction operatorthat samples the elements at the multi-indices specified by Ω i.e.,e ∗ΩeΩX ={Xi1;i2;:::;id if (i1P i2P O O O P iy) ∈ ΩP0 otherwise10Chapter 2. Low-rank Tensor Completionand Ω ⊂ [n1]× [n2]× · · · × [ny] is the so-called sampling set, where [n] = {1P O O O P n}.In the above equation, we suppose that |Ω| = m ≪ n1n2 O O O ny, so that eΩ is asubsampling operator.Unlike the matrix case, there is no unique notion of rank for tensors, as we shallsee in Section (2.2). There are multiple tensor formats that generalize a particularnotion of separability from the matrix case—i.e, there is no unique extension of theSVD to tensors. Although each tensor format can lead to compressible representa-tions of their respective class of low-rank signals, the truncation of a general signalto one of these formats requires access to the fully sampled tensor X (or at the veryleast query-based access to the tensor in order to achieve reasonable accuracy [22]).This is primarily due to the use of truncated SVDs acting on various reshapingsof the tensor. As in matrix completion, randomized missing entries change the be-havior of the singular values and vectors of these matricizations and hence of thefinal approximation. In this chapter, we consider the class of Hierarchical Tucker(abbreviated HT) tensors as our low-rank tensors of interest. The set of all suchtensors is a smooth, embedded submanifold of Rn1×n2×:::×nd , first studied in [253],which we equip with a Riemannian metric. Using this Riemannian structure, we canconstruct optimization algorithms in order to solve (2.1) for y-dimensional tensors.We will also study some of the effects of higher dimensional sampling and extendideas from compressive sensing and matrix completion to the HT tensor case for ourspecific seismic examples.2.2 Previous WorkTo provide the reader with some context on tensor representations, let us brieflydetail some of the available structured tensor formats. We refer to [152, 157, 111] fora series of comprehensive overviews of structured tensor formats, and in particularto [116] for an algebraic and functional analytic point of view on the subject. Inwhat follows, we let c = maxi=1···y ni be the maximum individual dimension size,cy :=∏yi=1 ni denote the dimension of the ambient space Rn1×n2×:::×nd , and, foreach tensor format discussed, K denotes the maximum of all of the rank parametersassociated to that format.The so-called Candecomp/Parafac (CP) decomposition is a very straightforwardapplication of the separation of variables technique. Very much like the SVD of amatrix, one stipulates that, for a tensor D ∈ Rn1×n2×:::×nd , one can write it asD ≈K∑i=1f(1)i ⊗ f (2)i ⊗ · · · ⊗ f (y)iwhere ⊗ is the Kronecker product and f (j)i ∈ Rnn . In addition to its straightforwardconstruction, the CP decomposition of rankK only requires ycK parameters versusthe cy of the full tensor and tensor-tensor operations can be performed efficiently11Chapter 2. Low-rank Tensor Completionon the underlying factors rather than the full tensors themselves (see [19] for acomprehensive set of MATLAB tools).Unfortunately, despite the parsimoniousness of the CP construction, the approxi-mation of an arbitrary (full) tensor by CP tensors has both theoretical and numericaldifficulties. In particular, the set of all CP tensors of rank at most K is not closed,and thus a best K−rank approximation is difficult to compute in many cases [85].Despite this shortcoming, various authors have proposed iterative and non-iterativealgorithms in the CP format for approximating full tensors [157] as well as interpo-lating tensors with missing data, such as the Alternating Least Squares approach(a block Gauss-Seidel type method) proposed alongside the CP format in [61] and[118], with convergence analysis in [252], and a nonlinear least-squares optimizationscheme in [3]. The authors in [268] extended the Alternating Least Squares analysisto ensure that it converges globally to a stationary point of a block-convex model,which encompasses a variety of matrix and tensor completion models including theCP format.The CP format is a specific case of the more general Tucker format, which aimsto write a tensor D as a multilinear productD ≈ j1 ×1 j2 ×2 O O O jy ×y Cwhere C ∈ Rk1×k2×:::×kd is the so-called core tensor and the matrices jj ∈ Rnn×kn ,j = 1P O O O P y are the factors of the decomposition. Here we use the notation of themultilinear product, that is, ji×iC indicates thatC is multiplied by ji in dimensioni, e.g., see [85, [84]]. We will elaborate on this construction in Section (2.4.2).The CP format follows from this formulation when the core tensor is diagonal, i.e.,Ci1;i2;i3;:::;id = Ci1;i1;:::;i1i1;i2;:::;id , where i1;i2;:::;id = 1 when i1 = i2 = · · · = iy and0 otherwise.The Tucker format enjoys many benefits in terms of approximation propertiesover its CP counterpart. Namely, the set of all Tucker tensors of at most multilinearrank k = (k1P k2P O O O P ky) is closed and, as a result, every tensor D has a best at mostmultilinear rank-k Tucker approximation. A near-optimal approximation can becomputed efficiently by means of the Higher Order SVD [84]. For the tensor com-pletion problem, the authors in [105] consider the problem of recovering a Tuckertensor with missing entries using the Douglas-Rachford splitting technique, whichdecouples interpolation and regularization by nuclear norm penalization of differentmatricizations of the tensor into subproblems that are then solved via a particularproximal mapping. An application of this approach to seismic data is detailed in[159] for the interpolation problem and [160] for denoising. Depending on the sizeand ranks of the tensor to be recovered, there are theoretical and numerical indica-tions that this approach is no better than penalizing the nuclear norm in a singlematricization (see [196] for a theoretical justification in the Gaussian measurementcase, as well as [234] for an experimental demonstration of this effect). Some pre-liminary results on theoretical guarantees for recovering low-rank Tucker tensorsfrom subsampled measurements are given in [140] for pointwise measurements and12Chapter 2. Low-rank Tensor Completiona suitable, tensor-based incoherence condition and [185], which considers a nuclearnorm penalty of the matricization of the first yR2 modes of X as opposed to a sumof nuclear norms of each of its y modes, as is typically considered.Aside from convex relaxations of the tensor rank minimization problem, theauthors in [161] develop an alternative manifold-based approach to Tucker Tensoroptimization similar to our considerations for the Hierarchical Tucker case and sub-sequently complete such tensors with missing entries. In this case, each evaluationof the objective and Riemannian gradient requires d(y(c + |Ω|)Ky+yKy+1) opera-tions, whereas our method only requires d(ycK2 + y|Ω|K3 + yK4) operations. Asa result of using the Hierarchical Tucker format instead of the Tucker format, ourmethod scales much better as y, c , and K grow. The differential geometric con-siderations for the Tucker format were first analyzed in [174]. It is from here wherethe phrase Dirac Frenkel variational principle arises in the context of dynamical sys-tems, which corresponds to the Riemannian gradient vanishing at the optimum valueof the objective in an optimization context.Hierarchical Tucker (HT) tensors were originally introduced in [115, 109], withthe subclass of Tensor Train (TT) tensors developed independently in [194, 192].These so-called tensor network decompositions have also been previously exploredin the quantum physics community, see for instance [117]. TT tensors are oftenconsidered over HT tensors owing to their explicit, non-recursive formulation andrelative ease of implementation for numerical methods, see for instance [193], al-though many of the ideas developed for TT tensors extend to the HT tensor case.Previous work in completing tensors in the Tensor Train format includes [110,136], wherein the authors use an alternating least-squares approach for the tensorcompletion problem. The derivations of the smooth manifold structure of the setof TT tensors can be found in [137]. This work builds upon the manifold structureof Hierarchical Tucker tensors studied in [253]. The authors in [97] have consideredthe manifold geometry of tensor networks in Banach spaces, but we will not employsuch general machinery here.Owing to its extremely efficient storage requirements (which are linear in thedimension y as opposed to exponential in y), the Hierarchical Tucker format hasenjoyed a recent surge in popularity for parametrizing high-dimensional problems.The hTucker toolbox [162] contains a suite of MATLAB tools for working with ten-sors in the HT format, including efficient vector space operations, matrix-tensor andtensor-tensor products, and truncations of full arrays to HT format. This trunca-tion, the so-called Hierarchical SVD developed in [109], allows one to approximatea full tensor in HT format with a near-optimal approximation error. Even thoughthe authors in [22] develop a HT truncation method that does not need access toevery entry of the tensor in order to form the HT approximation, their approachrequires algorithm-driven access to the entries, which does not apply for the seismicexamples we consider below. A HT approach for solving dynamical systems is out-lined in [175], which considers similar manifold structure as in this article appliedin a different context. The authors in [210] also consider the smooth manifold prop-13Chapter 2. Low-rank Tensor Completionerties of HT tensors to construct tensor completion algorithms using a HierarchicalSVD-based approach. As we shall see, since their methods rely on computing SVDsof large matrices, they will have difficulty scaling to tensors with large mode sizesc , unlike the methods discussed below.2.3 Contributions and OutlineIn this chapter, we extend the primarily theoretical results of [253] to practical algo-rithms for solving optimization algorithms on the HT manifold. In Section (2.5.1),we introduce the Hierarchical Tucker format, wherein we restate some of the resultsof [253] to provide context for the Riemannian metric we introduce on the quotientmanifold in Section (2.6). Equipped with this metric, we can now develop optimiza-tion methods on the HT manifold in Section (2.7) that are fast and SVD-free. Forlarge-scale, high-dimensional problems, the computational costs of SVDs are pro-hibitive and affect the scalability of tensor completion methods such as [105]. Sincewe are using the HT manifold rather than the Tucker manifold, we avoid an exponen-tial dependence on the internal rank parameters as in [161]. We initially proposedthe idea for a Riemannian metric on the HT manifold in the conference proceedings[81] and in this chapter, we have subsequently improved upon these results to reducethe overall computational overhead and speed up the convergence of the algorithmby using our Gauss-Newton method. In Section (2.7.7), we exploit the structureof HT tensors to regularize different matricizations of the tensor without havingto compute SVDs of these matricizations, lessening the effects of overfitting whenthere are very few samples available. We conclude by demonstrating the effective-ness of our techniques on interpolating various seismic data volumes with missingdata points in all dimensions as well as missing receivers, which is more realistic.Our numerical results are similar to those presented previously in [81], but muchmore extensive and include our regularization and Gauss-Newton based methods.In this paper, we also compare our method to a reference implementation of [161]and achieve very reasonable results for our seismic data volumes.We note that the algorithmic results here generalize readily to complex tensorcompletion Cn1×n2×···×nd and more general subsampling operators eΩ.2.4 Notation2.4.1 MatricizationWe consider y−dimensional tensors X of size n1×n2×· · ·×ny. t = (t1P t2P O O O P tk) ⊂{1P O O O P y} selects a subset of y dimensions and we denote tx := {1P O O O P y} \ t itscomplement. We let the matricization of a tensor X along the modes t ⊂ {1P O O O P y}be the matrix m(t) such that the indices in t are vectorized along the rows and the14Chapter 2. Low-rank Tensor Completionindices in tx are vectorized along the columns, i.e., if we set s = tx, thenm(t) ∈ R(nx1nx2 :::nxk )×(ns1ns2 :::nsd−k )(m(t))(ix1 ;:::;ixk );(is1 ;:::;isd−k ) := Xi1;:::;id OWe also use the notation (·)(t) for the dematricization operation, i.e., (m(t))(t) = X,which reshapes the matricized version of X along modes t back to its full tensorform.2.4.2 Multilinear productA natural operation to consider on tensors is that of the multilinear product [84,253, 109, 157].Definition 2.1: Given a y−tensor X of size n1 × n2 × · · · × ny and matricesVi ∈ Rmi×ni , the multilinear product of {Vi}yi=1 with X, is the m1 ×m2 × OOO×mytensor Y = V1 ×1 V2 ×2 O O O Vy ×yX, is defined in terms of the matricizations of Yasn (i) = Vim(i)Viy ⊗Viy−1 ⊗ O O O Vii+1 ⊗Vii−1 · · · ⊗Vi1 P i = 1P 2P O O O P yOIn terms of indices, this definition is equivalent toYi1;:::;id =n1∑j1=1(V1)i1;j1n2∑j2=1(V2)i2;j2 · · ·nd−1∑jd−1=1(Vy−1)id−1;jd−1nd∑jd=1(Vy)id;jdXj1;j2;:::;jdConceptually, we are applying each operator Vi to dimension i of the tensor X,keeping all other coordinates fixed. For example, when VPmPW are matrices ofappropriate sizes, the quantity VmWi can be written as VmWi = V ×1 W ×2 m.We remark in this instance that the ordering of the unfoldings matters and that thisparticular choice is compatible with the standard kronecker product. We refer to[157] for more details.The standard Euclidean inner product between two y−dimensional tensors mand n can be defined in terms of the standard Euclidean product for vectors, byletting〈XPY〉 := vec(X)i vec(Y)where vec(X) := m(1;2;:::;y) is the usual vectorization operator. This inner productinduces a norm ‖X‖2 on the set of all y−dimensional tensors in the usual way, i.e.,‖X‖2 =√〈XPX〉.Here we state several properties of the multilinear product, which are straight-forward to prove.Proposition 2.2: Let {Vi}yi=1, {Wi}yi=1 be collections of linear operators andXPY be tensors, all of appropriate sizes, so that the multilinear products below arewell-defined. Then we have the following:15Chapter 2. Low-rank Tensor Completion1. (V1 ×1 O O O Vy×y) ◦ (W1 ×1 O O O Wy ×y X) = (V1W1)×1 O O O (VyWy)×y X [85]2. 〈V1 ×1 O O O Vy ×y XP W1 ×1 O O O Wy ×y Y〉 = 〈(Wi1 V1)×1 O O O (Wiy Vy)×y XPY〉2.4.3 Tensor-tensor contractionAnother natural operation to consider between two tensors is tensor-tensor con-traction, a generalization of matrix-matrix multiplication. We define tensor-tensorcontraction in terms of tensors of the same dimension for ease of presentation [94].Definition 2.3 : Given a y−tensor X of size n1 × · · · × ny and a y−tensor Yof size m1 × · · · ×my, select sP t ⊂ {1P O O O P y} such that |s| = |t| and nsi = mti fori = 1P O O O P |s|. The tensor-tensor contraction of X and Y along modes sP t, denoted〈XPY〉(s;t), is defined as (2y− (|s|+ |t|))−tensor o of size (nsc Pmtc), satisfyingZ = 〈XPY〉(s;t) = (m(sc)n (t))(sc);(tc)OTensor tensor contraction over modes s and t merely sums over the dimensionsspecified by sP t in X and Y respectively, leaving the dimensions sx and tx free.The inner product 〈XPY〉 is a special case of tensor-tensor contraction whens = t = {1P O O O P y}.We also make use of the fact that when the index sets sP t are sP t = [y] \ i withX, Y, and Vi are appropriately sized for i = 1P O O O P y, then〈V1 ×1 V2 ×2 O O O Vy ×y XPY〉[y]\i;[y]\i =Vi〈V1 ×1 V2 ×2 O O O Vi−1 ×i−1 Vi+1 ×i+1 O O O Vy ×y XPY〉([y]\i;[y]\i)(2.2)i.e., applying Vi to dimension i commutes with contracting tensors over every di-mension except the ith one.2.5 Smooth Manifold Geometry of theHierarchical Tucker FormatIn this section, we review the definition of the Hierarchical Tucker format as well asprevious results [253] in the smooth manifold geometry of this format. We extendthese results in the next section by introducing a Riemannian metric on the spaceof HT parameters and subsequently derive the associated Riemannian gradient withrespect to this metric. A reader familiar with the results in [253] can glance overthis section quickly for a few instances of notation and move on to Section (2.6).16Chapter 2. Low-rank Tensor Completion2.5.1 Hierarchical Tucker FormatThe standard definition of the Hierarchical Tucker format relies on the notion ofa dimension tree, chosen apriori, which specifies the format [109]. Intuitively, thedimension tree specifies which groups of dimensions are “separated” from othergroups of dimensions, where “separation” is used in a similar sense to the SVD intwo dimensions.Definition 2.4 A dimension tree i is a non-trivial binary tree such that• the root, troot, has the label troot = {1P 2P O O O P y}• for every t ̸∈ a, where a is the set of leaves of i , the labels of its left and rightchildren, tlP tr, form a partition of the label for t, i.e., tl ∪ tr = t and tl ∩ tr = ∅.We set c(i ) := i \ a. An example of a dimension tree when y = 6 is given inFigure 2.1.{1, 2, 3, 4, 5, 6} = troot{1, 2, 3, 4} = t{1, 2} = tl{1} {2}{3, 4} = tr{3} {4}{5, 6}{5} {6}Figure 2.1 Complete dimension tree for {1P 2P 3P 4P 5P 6}.Remark For the following derivations, we take the point of view that each quan-tity with a subscript (·)t is associated to the node t ∈ i . By the definition of adimension tree, for each t ∈ i , there is a corresponding subset of {1P O O O P y} associ-ated to t. If our HT tensor has dimensions n1 × n2 × OOO × ny, we let nt =∏i∈t niand, when t ∈ c(i ), nt satisfies nt = ntpntr .Definition 2.5 Given a dimension tree i and a vector of hierarchical ranks(kt)t∈i with kt ∈ Z+, ktroot = 1, a tensor X ∈ Rn1×n2×:::×nd can be written in theHierarchical Tucker format if there exist parameters x = ((jt)t∈LP (Bt)t∈c(i )) suchthat ϕ(x) = X, wherevec(ϕ(x)) = jtp ×1 jtr ×2 Wtroot t = trootjt = (jtp ×1 jtr ×2 Bt)(1;2) t ∈ c(i ) \ troot(2.3)17Chapter 2. Low-rank Tensor Completionwhere jt ∈ Rnx×kx∗ , the set of full-rank nt × kt matrices, for t ∈ a and Bt ∈Rkxp×kxr×kx∗ , the set of 3-tensors of full multilinear rank, i.e.,rank(W(1)t ) = ktp P rank(W(2)t ) = ktr P rank(W(3)t ) = ktOThis technical definition can be visualized as shown in Figure 2.2. If we have a four-dimensional tensor X, we first matricize the first two dimensions to form the matrixm(1;2), which can then be written in the form of a quasi-singular value decomposition.The insight of the Hierarchical Tucker format is that these quasi-singular vectorsj12 contain information on the subspaces generated by dimensions 1 and dimensions2 of the tensor. As such, we can reshape j12 in to a n1 × n2 × k12 tensor that canfurther be decomposed in this multilinear fashion. We apply a similar splitting toj34.X(1,2)n1n2n3n4=n1n2U12k12BT1234 UT34n3n4k34U12n1n2k12! U12n1n2 k12! U1UT2n1k1 n2k2B121Figure 2.2 Visualizing the Hierarchical Tucker format for a 4D tensor X of sizen1 × n2 × n3 × n4We say the parameters x = (jtPBt) are inOrthogonal Hierarchical Tucker (OHT)format if, in addition to the above construction, we also havejit jt = Ikx for t ∈ a(W(1;2)t )iW(1;2)t = Ikx for t ∈ c(i ) \ troot(2.4)We have made a slight modification of the definition of the HT format comparedto [253] for ease of presentation. When y = 2, our construction is the same as thesubspace decomposition introduced in [183] for low-rank matrices, but our approachis not limited to this case.Owing to the recursive construction in (2.3), the intermediate matrices jt fort ∈ c(i ) do not need to be stored. Instead, specifying jt for t ∈ a and Btfor t ∈ c(i ) determines X = ϕ(x) completely. Therefore, the overall number ofparameters x = ((jt)t∈LP (Bt)t∈c(i )) is bounded above by ycK + (y− 2)K3 +K2,18Chapter 2. Low-rank Tensor Completionwhere c = maxi=1;:::;y ni and K = maxt∈i kt. When y ≥ 4 and K ≪ c , thisquantity is much less than the cy parameters typically needed to represent X.Definition 2.6 The hierarchical rank of a tensor X ∈ Rn1×n2×:::×nd corre-sponding to a dimension tree i is the vector k = (kt)t∈i where ktroot = 1 andfor t ∈ i \ troot,kt = rank(m(t))OWe consider the set of Hierarchical Tucker tensors of fixed rank k = (kt)t∈i }, thatis,H = {X ∈ Rn1×n2×:::×nd | rank(m(t)) = kt for t ∈ i \ troot}OWe consider general HT tensors in the sequel, but we implement our algorithms withOHT parameters. In addition to significantly simplifying the resulting notation, thisrestriction allows us to avoid cumbersome and unnecessary matrix inversions, in par-ticular for the resulting subspace projections in future sections. Moreover, the or-thogonal format yields more accurate computations of inner products and subspaceprojections in finite arithmetic. This restriction does not reduce the expressibility ofthe HT format, however, since for any non orthogonalized parameters x such thatX = ϕ(x), there exists orthogonalized parameters x′ with X = ϕ(x′) [Alg 3., 109].We use the grouping x = (jtPBt) to denote ((jt)t∈LP (Bt)t∈c(i )), as these areour independent variables of interest in this case. In order to avoid cumbersomenotation, we also suppress the dependence on (iPk) in the following, and presumea fixed dimension tree i and hierarchical ranks k.2.5.2 Quotient Manifold GeometryIn the interest of keeping this thesis self-contained, we briefly summarize the keyresults of [253] that we will use for the following sections.Given the full-rank constraints above, the space of possible HT parameters, withx = (jtPBt), is written asM =×t∈LRnx×kx∗ × ×t∈c(i )Rkxp×kxr×kx∗ OM is an open submanifold of R∑x∈L nxkx+∑x∈N(T ) kxpkxrkx with corresponding tangentspaceTxM =×t∈LRnx×kx × ×t∈c(i )Rkxp×kxr×kx OLet ϕ :M→H be the parameter to tensor map in (2.3). Then for eachX ∈ H, thenthere is an inherent ambiguity in its representation by parameters x, specifically ifX = ϕ(x) = ϕ(y) for distinct parameters x and y, then these parameters are relatedin the following manner. Let G be the Lie groupG = {(Vt)t∈i : Vt ∈ Ga(kt)P t ̸= trootP Vtroot = 1}O (2.5)19Chapter 2. Low-rank Tensor Completionwhere Ga(p) is the matrix group of invertible p × p matrices and the group actionof component-wise multiplication.Let  be the group action :M×G →M(xPA) := ((jtP Wt)P (Vt)) 7→ x(A) := (jtVtP V−1tp ×1 V−1tr ×2 Vit ×3 Bt)O(2.6)Then ϕ(x) = ϕ(y) if and only if there exists a unique A = (Vt)t∈i ∈ G such thatx = V(y) [Prop. 3, 253]. Therefore these are the only types of ambiguities we mustconsider in this format.It follows that the orbit of x,G x = {A(x) : A ∈ G}Pis the set of all parameters that map to the same tensor X = ϕ(x) under ϕ. Thisinduces an equivalence relation on the set of parameters M,x ∼ y if and only if y ∈ G xOIf we let M/G be the corresponding quotient space of equivalence classes and . :M→M/G denote the quotient map, then pushing ϕ down through . results in aninjective functionϕˆ :M/G → Hwhose image is all of H, and hence is an isomorphism (in fact, a diffeomorphism).The vertical space, VxM, is the subspace of TxM that is tangent to .−1(x).That is, yxv = (jvt P Bvt ) ∈ VxM when it is of the form [Eq. 26, 253]jvt = jtYt for t ∈ aBvt = Yt ×3 Bt −Ytp ×1 Bt −Ytr ×2 Bt for t ∈ c(i ) ∪ trootWvtroot = −YtpWtroot −WtrootYitr for t = trootwhere Yt ∈ Rkx×kx . A straightforward computation shows that Yϕ(x)VxM ≡ 0,and therefore for every yxv ∈ VxM, ϕ(x) = ϕ(x+ yxv) to first order in yxv. Froman optimization point of view, moving from the point x to x + yxv, for small yxv,will not change the current tensor ϕ(x) and therefore for any search direction p,we must filter out the corresponding component in VxM in order to compute thegradient correctly. We accomplish this by projecting on to a horizontal space, whichis any complementary subspace to VxM. One such choice is [Eq. 26, 253],HxM = {(jht P Bht ) :{(jht )ijt = 0kx for t ∈ a(W(1;2)t )i (jitrjtr ⊗ jitp jtp)W(1;2)t = 0kx for t ∈ c(i ) \ troot}O (2.7)20Chapter 2. Low-rank Tensor CompletionNote that there is no restriction on Whtroot , which is a matrix.This choice has the convenient property that HxM is invariant under the actionof , i.e., [Prop. 5, 253]Y(xPA)[HxMP 0] = Hx(A)MP (2.8)which we shall exploit for our upcoming discussion of a Riemannian metric.The horizontal space HxM allows us to uniquely represent abstract tangentvectors in i.(x)M/G with concrete vectors in HxM.2.6 Riemannian Geometry of the HT FormatIn this section, we introduce a Riemannian metric on the parameter space M thatwill allow us to use parameters x as representations for their equivalence class .(x)in a well-defined manner when performing numerical optimization.2.6.1 Riemannian metricSince each distinct equivalence class .(x) is uniquely identified with each distinctvalue of ϕ(x), the quotient manifold M RG is really our manifold of interest for thepurpose of computations—i.e, we would like to formulate our optimization problemover the equivalence classes .(x). By introducing a Riemannian metric on M thatrespects its quotient structure, we can formulate concrete optimization algorithmsin terms of the HT parameters without being affected by the non-uniqueness of theformat—i.e., by optimizing over parameters x while implicitly performing optimiza-tion over equivalence classes .(x). Below, we explain how to explicitly constructthis Riemannian metric for the HT format.Let x = (jtP Bt)P x = (ktP Ct) ∈ TxM be tangent vectors at the pointx = (jtPBt) ∈M. We let bt = jit jt. Then we define the inner product gx(·P ·) atx asgx(xP x) :=∑t∈Ltr((bt)−1jit kt)+∑t∈b(i )\troot〈BtP (btp)×1 (btr)×2 (bt)−1 ×3 Ct〉+ tr(b(troot)rWitrootb(troot)pXtroot)O(2.9)By the full-rank conditions on jt and Bt at each node, by definition of the HTformat, each bt, for t ∈ i , is symmetric positive definite and varies smoothly withx = (jtPBt). As a result, gx is a smooth, symmetric positive definite, bilinearform on TxM, i.e., a Riemannian metric. Note that when x is in OHT, as in futuresections, gx reduces to the standard Euclidean product on the parameter space TxM,making it straightforward to compute in this case.21Chapter 2. Low-rank Tensor CompletionProposition 2.7 On the Riemannian manifold (MP g),  defined in (2.6) actsisometrically on M, i.e., for every A ∈ G, xP x ∈ HxMgx(xP x) = gA(x)(∗xP ∗x)where ∗ is the push-forward map, ∗v = Y(xPA)[v].Proof Let x = (jtPBt) ∈My = (ktPCt) = A(x)= (jtVtP V−1tp×1 V−1tr ×2 Vit ×3 Bt)for A ∈ G.If we write x = (jtP Bt), x = (ktP Ct) for xP x ∈ HxM, then, by (2.8), itfollows thaty = ∗x = (jtVtP V−1tp ×1 V−1tr ×2 Vit ×3 Bt)and similarly for y.We will compare each component of the sum of (2.9) term by term. For ease ofpresentation, we only consider interior nodes t ∈ c(i ) \ troot, as leaf nodes and theroot node are handled in an analogous manner.For t ̸∈ a ∪ troot, let gBt be the component of y at the node t, i.e.,gBt = V−1tp ×1 V−1tr ×2 Vit ×3 Btand similarly for gCt.The above inner product, evaluated at x, between Bt and Ct, evaluated at x,can be written asvec(Bt)i (b−1t ⊗btr ⊗btp)vec(Ct)Pand similarly for the inner product between  eBt and  eCt at y.By setting b˜t := k it kt, a quick computation shows that b˜t = Vit jit jtVt =Vit btVt. Therefore, we have that the inner product between gBt and gCt at y isvec(gBt)i (b˜−1t ⊗ b˜tr ⊗ b˜tp)vec(gCt)= vec(Bt)i (Vt ⊗V−itr ⊗V−itp )((V−1t b−1t V−it )⊗ (VitrbtrVtr)⊗ (VitpbtpVtp))(Vit ⊗V−1tr ⊗V−1tp )vec(Ct)= vec(Bt)i (b−1t ⊗btr ⊗btp)vec(Ct)OTherefore, this term in the inner product is invariant under the group action  andby adding the terms for each t ∈ i , we obtain thatgx(xP x) = gA(x)(∗xP ∗x)O22Chapter 2. Low-rank Tensor CompletionAs we are interested in carrying out our optimization using the HT parameters x asproxies for their equivalence classes .(x), this proposition states that if we measureinner products between two tangent vectors at the point x, we obtain the same resultas if we had measured the inner product between two tangent vectors transformedby A at the point A(x). In this sense, once we have a unique association of tangentvectors inM/G with a subspace of TxM, we can use the actual representatives, theparameters x, instead of the abstract equivalence class .(x), in a well-defined wayduring our optimization.This shows that M/G, endowed with the Riemannian metricg.(x)(P ) := gx(hx P hx )where hP hx are the horizontal lifts at x of P , respectively, is a Riemannian quotientmanifold of M [Sec. 3.6.2, 1]In summary, by using this Riemannian metric and restricting our optimization toonly consider horizontal tangent vectors, we can implicitly formulate our algorithmson the abstract quotient space by working with the concrete HT parameters. Below,we will derive the Riemannian gradient in this context.Remark It should be noted that although the horizontal space (2.7) is complemen-tary to the vertical space (2.7), it is demonstrably not perpendicular to VxM underthe Riemannian metric (2.9). Choosing a horizontal space which is perpendicular toVxM under the standard Euclidean product (i.e., (2.9) when x is orthogonalized)is beyond the scope of this work. Suffice to say, it can be done, as a generalizationof the approach outlined in [183], resulting in a series of symmetry conditions onvarious multi-way combinations of parameters. The resulting projection operatorsinvolve solving a number of coupled Lyapunov equations, increasing with the depthof i . It remains to be seen whether such equations can be solved efficiently when yis large. We will not dwell on this point here, as we will not be needing orthogonalprojections for our computations in the following. The authors in [148, 149] considera similar approach for the Tucker tensor format.We restrict ourselves to orthogonal parameters from this point forward, for thereasons stated previously. In order to not overburden ourselves with notation, weuse the notation M to refer to orthogonal parameters and the corresponding groupacting on M as G for the remainder of this chapter. In particular, when restrictingto orthogonal HT parameters, the general linear group Ga(kt) in (2.5) is replaced byd(kt), the group of orthogonal kt × kt matrices. The expression for the horizontalspace (2.7) and the Riemannian metric are also simplified since, for orthogonalparameters, jit jt = Ikx for all t ∈ i \ troot.23Chapter 2. Low-rank Tensor Completion2.6.2 Riemannian gradientThe problem we are interested in solving isminx∈Mf(ϕ(x))for a smooth objective function f : Rn1×n2×:::×nd → R. We write fˆ :M→ R, wherefˆ(x) = f(ϕ(x)).We need to derive expressions for the Riemannian gradient to update the HT pa-rameters as part of local optimization procedures. Therefore, our primary quantityof interest is the Riemannian gradient of fˆ .Definition 2.8 [Sec. 3.6, 1] Given a smooth scalar function fˆ on a Riemannianmanifold N , the Riemannian gradient of fˆ at x ∈ N , denoted ∇Rfˆ(x), is the uniqueelement of TxN which satisfiesgx(∇Rfˆ(x)P ) = Yfˆ(x) [] ∀ ∈ TxNwith respect to the Riemannian metric gx(·P ·).Our manifold of interest in this case is N = M/G, with the correspondinghorizontal space HxM in lieu of the abstract tangent space i.(x)M/G. Therefore,in the above equation, we can consider the horizontal lift h of the tangent vector and instead writegx(∇Rfˆ(x)P h) = Yfˆ(x)[h]OOur derivation is similar to that of [Sec 6.2.2, 253], except our derivations are morestreamlined and cheaper computationally since we reduce the operations performedat the interior nodes t ∈ c(i ). By a slight abuse of notation in this section,we denote variational quantities associated to node t as ot ∈ Rnxpnxr×kx and letZt ∈ Rnxp×nxr×kx where (ot)(1;2) = Zt is the reshaping of ot in to a 3−tensor.The Riemannian gradient will be denoted (jtP Bt) and a general horizontal vectorwill be denoted by (ktP Ct).When x = (jtPBt) is orthogonalized, we use 〈·P ·〉 to denote the Euclidean innerproduct. By the chain rule, we have that, for any  = (ktP Ct) ∈ HxM,Yfˆ(x)[] = Yf(ϕ(x))[Yϕ(x)[]]= 〈∇ϕ(x)f(ϕ(x))P Yϕ(x)[]〉OThen each tensor Vt ∈ Rnxp×nxr×kx , with ktroot = Yϕ(x)[], satisfies the recursionVt = ktp ×1 jtr ×2 Bt + jtp ×1 ktr ×2 Bt + jtp ×1 jtr ×2 CtP (2.10)for matrices ktp ∈ Rnxp×kxp , ktr ∈ Rnxr×kxr and tensor Ct ∈ Rkxp×kxr×kx satisfying[Lemma 2, 253]k itp jtp = 0 kitr jtr = 0 (X(1;2)t )iW(1;2)t = 0O (2.11)24Chapter 2. Low-rank Tensor CompletionThe third orthogonality condition is omitted when t = troot.Owing to this recursive structure, we compute 〈UtP Vt〉, where Ut is thecomponent of the Riemannian gradient at the current node and recursively extractthe components of the Riemannian gradient associated to the children, i.e., jtp P jtr ,and Bt. Here we let jtroot = ∇ϕ(x)f(ϕ(x)) be the Euclidean gradient of f(ϕ(x))at ϕ(x), reshaped into a matrix of size n(troot)p × n(troot)r .We setjtp = e⊥jxp〈jitr ×2 UtPBt〉(2;3);(2;3)jtr = e⊥jxr〈jitp ×1 UtPBt〉(1;3);(1;3)Bt = jitp×1 jitr ×2 Utfollowed by a projection of B(1;2)t on to span(W(1;2)t )⊥ if t ∈ c(i ) \ troot. Hereejx = (Ikx −jtjit ) is the usual projection on to span(jt)⊥. After a straightforwardcalculation, it follows that 〈UtP Vt〉 is equal to〈jtp P ktp〉+ 〈jtr P ktr〉+ 〈BtP Ct〉and jtp P jtr , and Bt satisfy (2.11). Their recursively decomposed factors willtherefore be in the horizontal space HxM.Bt is the component of the Riemannian gradient at node t. If tl is a leaf node,then we have extracted the component of the Riemannian gradient associated to tl,namely jtp . Otherwise, we set Utp = (jtp)(1;2) and apply the above recursion.We make the same considerations for the right children.In the above computations, the multilinear product operators are never formedexplicitly and instead each operator is applied to various reshapings of the matrixor tensor of interest, see [93] for a reference Matlab implementation.We make the following observations in order to minimize the number of compu-tations performed on intermediate tensors, which can be much larger than dim(M).In computing the termse⊥jxp 〈jitr ×2 UtPBt〉(2;3);(2;3)Pwe have that Ut = (e⊥jxj˜t)(1;2) for a matrix j˜t ∈ Rnxpnxr×kx . Using (2.2), theabove expression can be written as〈e⊥jxp ×1 jitr ×2 (e⊥jxj˜t)(1;2)PBt〉(2;3);(2;3)O (2.12)25Chapter 2. Low-rank Tensor CompletionWe note that in the above, e⊥jxp ×1 jitr ×2 (e⊥jxj˜t)(1;2) = (jitr ⊗ e⊥jxpe⊥jxj˜t)(1;2),and the operator applied to j˜t satisfiesjitr ⊗ e⊥jxpe⊥jx = jitr ⊗ e⊥jxp (Inx − jtjit )= jitr ⊗ e⊥jxp (Inx − jtr ⊗ jtpW(1;2)t (W(1;2)t )ijitr ⊗ jitp )= jitr ⊗ e⊥jxp OThis means that, using (2.2), we can write (2.12) ase⊥jxp 〈jitr ×2 (j˜t)(1;2)PBt〉(2;3);(2;3)i.e., we do not have to apply e⊥jx to the matrix j˜t at the parent node of t. Applyingthis observation recursively and to the other terms in the Riemannian gradient, wemerely need to orthogonally project the resulting extracted parameters (jtP Bt)on to HxM after applying the formula (2.12) without applying the intermediate op-erators e⊥jx , reducing the overall computational costs. We summarize our algorithmfor computing the Riemannian gradient in Algorithm (2.1).Algorithm 2.1 The Riemannian gradient ∇Rf at a point x = (jtPBt) ∈MInput x = (jtPBt) parameter representation of the current point.ComputeX = ϕ(x) and ∇Xf(X), the Euclidean gradient of f , a n1×O O O ny tensor.jtroot ← (∇Xf(X))(1;2)for t ∈ c(i ), visiting parents before their children doUt ← (jt)(1;2)jtp ← 〈jitr ×2 UtPBt〉(2;3);(2;3), jtr ← 〈jitp ×1 UtPBt〉(1;3);(1;3)Bt ← jitp ×1 jitr ×2 Utif t ̸= troot thenBt ← (e⊥B(1;2)x(Wt)(1;2))(1;2)for t ∈ ajt ← e⊥jxjtOutput ∇Rf ← (jtP Bt)This algorithm is computing the operator Yϕ(x)∗ : Rn1×n2×:::×nd → HxMapplied to the Euclidean gradient ∇ϕ(x)f(ϕ(x)). The forward operator Yϕ(x) :HxM→ Tϕ(x)H ⊂ Rn1×n2×:::×nd can be computed using a component-wise orthog-onal projection eHx : TxM→HxM followed by applying (2.10) recursively.2.6.3 Tensor Completion Objective and GradientIn this section, we specialize the computation of the objective and Riemanniangradient in the HT format to the case where the Euclidean gradient of the objectivefunction is sparse, in particular for tensor completion. This will allow us to scale ourmethod to high dimensions in a straightforward fashion as opposed to the inherently26Chapter 2. Low-rank Tensor Completiondense considerations in Algorithm (2.1). Here for simplicity, we suppose that ourdimension tree i is complete, that is a full binary tree up to level depth(i )− 1 andall of the leaves at level depth(i ) are on the leftmost side of i , as in Figure 2.1. Thiswill ease the exposition as well as allow for a more efficient implementation comparedto a noncomplete tree. In what follows below, we denote i = (i1P i2P O O O P iy), with1 ≤ ij ≤ nj for all 1 ≤ j ≤ y, and let it be the subindices of i indexed by t ∈ i .We consider a separable, smooth objective function on the HT manifold,fˆ(x) = f(ϕ(x)) =∑i∈Ωfi(ϕ(x)i)P (2.13)where fi : R→ R is a smooth, single variable function. For the least-squares tensorcompletion problem, fi(v) = 12(v− wi)2.In this section, we also use the Matlab notation for indexing in to matrices, i.e.,V(mPn) is the (mPn)th entry of V, and similarly for tensors. Let K = maxt∈i kt.2.6.4 Objective functionWith this notation in mind, we write each entry of eΩϕ(x), indexed by i ∈ Ω, as(eΩϕ(x))(i) =kxp∑rp=1kxr∑rr=1(jtp)(itp P rl) · (jtr)(itr P rr) ·Wtroot(rlP rr)P where t = troot OEach entry of jtp P jtr can be computed by applying the recursive formula (2.3), i.e.,jt(itP r) =kxp∑rp=1kxr∑rr=1(jtp)(itp P rl) · (jtr)(itr P rr) ·Bt(rlP rrP r)with the substitutions of t→ tlP tr as appropriate.At each node t ∈ i , we perform at most K3 operations and therefore the compu-tation of eΩϕ(x) requires at most 2|Ω|yK3 operations. The least squares objective,12‖eΩϕ(x)− w‖22, can be computed in |Ω| operations.2.6.5 Riemannian gradientThe Riemannian gradient is more involved, notation-wise, to derive explicitly com-pared to the objective, so in the interest of brevity we only concentrate on therecursion for computing j1 below.We let Z = ∇ϕ(x)f(ϕ(x)) denote the Euclidean gradient of f(X) evaluated atX = ϕ(x), which has nonzero entries Z(i) indexed by i ∈ Ω. By expanding out (2.12),for each i ∈ Ω, jtp evaluated at the root node with coordinates itp P rl for rl =27Chapter 2. Low-rank Tensor Completion1P O O O P ktp isjtp(itp P rl) =∑i=(ixp ;ixr )∈ΩZ(i)kxr∑rr=1jtr(itr P rr)Wtroot(rlP rr)P where t = troot OFor each t ∈ c(i ) ∪ troot, we let gjtp denote the length ktp vector, which dependson it, satisfying, for each i ∈ ΩP rl = 1P OOOP ktp ,(gjtp)(itP rtp) = kxr∑rr=1kx∑rx=1jtr(itr P rr)Bt(rlP rrP rt)gjt(itP rt)OThis recursive construction above, as well as similar considerations for the rightchildren for each node, yields Algorithm (2.2). For each node t ∈ i \troot, we perform3|Ω|K3 operations and at the root where we perform 3|Ω|K2 operations. The overallcomputation of the Riemannian gradient requires at most 6y|Ω|K3 operations, wheni is complete, and a negligible d(yK) additional storage to store the vectors gjtfor each fixed i ∈ Ω. The computations above are followed by componentwiseorthogonal projection on to HxM, which requires d(y(cK2+K4)) operations andare dominated by the d(y|Ω|K3) time complexity when |Ω| is large.Therefore for large |Ω|, each evaluation of the objective, with or without theRiemannian gradient, requires d(y|Ω|K3) operations. Since f(X) can be written asa sum of component functions fi, each depending only on the entry X(i), we canparallelize the computation of f(X) and the Riemannian gradient in the followingway. For a system setup with p processors, we first partition the sampling set Ωin to p disjoint subsets, with the subset, Ωj , being assigned to the jth processor.The objective (and possibly the Riemannian gradient) are computed at each proces-sor independently using the set Ωj and the results are added together afterwards.This “embarrassingly parallel” structure, as referred to in the parallel computingliterature, allows us to scale these algorithms to large problems in a distributedenvironment.In certain situations, when say |Ω| = pcy for some p ∈ [10−3P 1] and y is suffi-ciently small, say y = 4P 5, it may be more efficient from a computer hardware pointof view to use the dense linear algebra formulation in Algorithm (2.1) together withan efficient dense linear algebra library such as BLAS, rather than Algorithm (2.2).The dense formulation requires d(cyK) operations when i is a balanced tree, whichmay be smaller than the d(y|Ω|K3) operations needed in this case.Remark: By comparison, the gradient in the Tucker tensor completion case[161] requires\ d(y(|Ω|+c)Ky+Ky+1) operations, which scales much more poorlywhen y ≥ 4 compared to using Algorithm (2.2). This discrepancy is a result of thestructural differences between Tucker and Hierarchical Tucker tensors, the latterof which allows one to exploit additional low-rank behaviour of the core tensorcompared to the Tucker format.28Chapter 2. Low-rank Tensor CompletionAlgorithm 2.2 Objective and Riemannian gradient for separable objectivesInput: x = (jtPBt) parameter representation of the current pointfx ← 0, jtP Bt ← 0, gjt ← 0for i ∈ Ωfor t ∈ c(i ), visiting children before their parentsfor z = 1P 2P O O O P ktjt(itP z)←∑kpw=1∑kry=1(jtp)(itp P w) · (jtr)(itr P y) ·Bt(wP yP z)fx ← fx + fi(jtroot(i))^jtroot ← ∇fi(jtroot(i))for t ∈ c(i ), visiting parents before their childrenfor w = 1P O O O P ktp , y = 1P O O O P ktr , z = 1P O O O P ktBt(wP yP z)← Bt(wP yP z) +gjt(z) · (jtp)(itp P w) · (jtr)(itr P y)for w = 1P O O O P ktpgjtp(w)←∑kxry=1∑kxz=1(jtr)(itr P y) ·Bt(wP yP z) ·gjt(z)for y = 1P O O O P ktrgjtr(y)←∑kxpw=1∑kxz=1(jtp)(itp P w) ·Bt(wP yP z) ·gjt(z)for t ∈ afor z = 1P 2P O O O P ktjt(itP z)← jt(itP z) +gjt(z)Project (jtP Bt) componentwise on to HxMOutput: f(x)← fx, ∇Rf(x)← (jtP Bt)2.7 Optimization2.7.1 Reorthogonalization as a retractionAs is standard in manifold optimization, we employ a retraction on the tangent bun-dle in lieu of the exponential mapping, the former being much less computationallydemanding compared to the latter.Definition 2.9: A retraction on a manifold N is a smooth mapping R fromthe tangent bundle T N onto N with the following properties: Let Rx denote therestriction of R to TxN .• Rx(0x) = x, where 0x denotes the zero element of TxN• With the canonical identification T0xTxN ≃ TxN , Rx satisfiesYRx(0x) = idTxNwhere idTxN denotes the identity mapping on TxN (Local rigidity condition).A retraction approximates the action of the exponential mapping to first orderand hence much of the analysis for algorithms utilizing the exponential mapping29Chapter 2. Low-rank Tensor Completioncan also be carried over by the usual modifications to those using retractions. Formore details, we refer the reader to [1].A computationally feasible retraction on the HT parameters is given byQR-based reorthogonalization. The QR-based orthogonalization of (potentiallynonorthogonal) parameters x = (jtPBt), denoted fg(x), is given in Algorithm (2.3).Alternative retractions, such as those based on the SVD, can also be considered ina similar manner but we do not explore those options here. Since we are implicitlyperforming optimization on the quotient spaceM/G, the choice of retraction shouldnot significantly affect algorithmic performance.Proposition 2.10 Given x ∈M,  ∈ TxM, let QR(x) be the QR-based orthog-onalization defined in Algorithm (2.3). Then Rx() := QR(x+ ) is a retraction onM.Proof We first explicitly define the orthogonal parameter space and its corre-sponding tangent space.Below, let lkxp ;kxr ;kx be the closed submanifold of Rkxp×kxr×kx∗ , the set of3−tensors with full multilinear rank, such thatW ∈lkxp ;kxr ;kx is orthonormal alongmodes 1 and 2, i.e., (l (1;2))il (1;2) = Ikx and let St(ntP kt) be the nt × kt Stiefelmanifold of n× kt matrices with orthonormal columns.Our orthogonal parameter space M is thenM =×t∈LSt(ntP kt)× ×t̸∈L∪trootlkxp ;kxr ;kx × Rk(troot)p×k(troot)r∗with corresponding tangent space at x = (jtPBt) ∈MTxM =×t∈LTjxSt(ntP kt)× ×t̸∈L∪trootTBxlkxp ;kxr ;kx × Rk(troot)p×k(troot)r ONote that Tn St(nP p) = {n Ω + n ⊥K : Ωi = −Ω ∈ Rp×pPK ∈ R(n−p)×p}. We omitan explicit description of TBxlkxp ;kxr ;kx for brevity.It is easy to see that the first point in the definition of a retraction is satisfied,since for m ∈ St(nP p), qf(m) = mLet x = (jtPBt) ∈M and  = (jtP Bt) ∈ TxM. To avoid notational overload,we use the slight abuse of notation that Wt := W(1;2)t for t ̸= troot.Let s ∈ [0P t) 7→ x(s) be a curve in the parameter space M with x(0) = x andx′(0) =  and x(s) = (jt(s)P Wt(s)) and x′(s) = (jt(s)P Wt(s)).Then we have that, in Kronecker form,YRx(0x)[] =8><>:yys qf(x(s)t)s=0if t ∈ ayys qf((gtr(s)⊗gtp(s))(x(s)t)s=0if t ̸∈ troot ∪ayys(gtr(s)⊗gtp(s))(x(s)t)s=0if t = troot30Chapter 2. Low-rank Tensor CompletionThe fact that YRx(0x)[]t = jt for t ∈ a follows from Example 8.1.5 in [1].To compute YRx(0x)[]t for t ̸∈ a ∪ troot, we first note the formula from [1]Y qf(n )[j ] = qf(n )/skew(qf(n )ij(qf(n )in )−1)+(I−qf(n ) qf(n )i )j(qf(n )in )−1(2.14)where n ∈ Rn×k∗ , j ∈ in Rn×k∗ ≃ Rn×k and qf(n ) is the Q-factor of the QR-decomposition of n .Therefore, if we set o(s) = (gtr(s)⊗gtp(s))(x(s)t), where gt(s) is the g-factorof the QR-decomposition of the matrix associated to node t, we haveo ′(0) = [(g′tr(0)⊗ Ikp) + (Ikr ⊗g′tp(0)]Wt + WtAs a result of the discussion in Example 8.1.5 in [1], since gt(0) = Ikx we have thatg′t(0) ={/ji (jit jt) for t ∈ a/ji (Wit Wt) for t ̸∈ a ∪ trootwhere /ji (V) is the projection onto the upper triangular term of the unique decom-position of a matrix into the sum of a skew-symmetric term and an upper triangularterm.Since jt ∈ St(ntP kt) and Wt ∈ St(ktpktr P kt), since when m ∈ ht(nP k),imht(nP k) = {mΩ+m⊥K : Ω = −Ωi }Pthen mi m is skew symmetric, for any tangent vector m, which implies that/ji (mi m) is zero.It follows that g′t(0) = 0 for all t ∈ i \ troot, and thereforeo ′(0) = Wtfrom which we immediately obtainYRx(0x)[]t = Wt for t ̸∈ a ∪ trootA similar approach holds when t = troot, and therefore, Rx() is a retraction on M.As before for the Riemannian metric, we can treat the retractions on the HTparameter space as implicitly being retractions on the quotient space as outlinedbelow.Since Rx() is a retraction on the parameter spaceM, and our horizontal spaceis invariant under the Lie group action, by the discussion in [4.1.2 1], we have thefollowingProposition 2.11 The mappingeR.(m)(z) = .(Rm(o))31Chapter 2. Low-rank Tensor Completionis a retraction on M/G, where Rm(o) is the QR retraction previously defined onM, . : M→M/G is the projection operator, and o is the horizontal lift at m ofthe tangent vector z at .(m).Algorithm 2.3 QR-based orthogonalization [Alg. 3 109]Input HT parameters x = (jtPBt)Output y = (ktPCt) orthogonalized parameters such that ϕ(x) = ϕ(y)for t ∈ aftgt = jt, where ft is orthogonal and gt is upper triangular with positive diag-onal elementskt ← ftfor t ∈ i \ (a ∪ troot), visiting children before their parentsot ← gtp ×1 gtr ×2 Btftgt = o(1;2)t , where ft is orthogonal and gt is upper triangularCt ← (ft)(1;2)Xtroot ← g(troot)pWtrootgi(troot)r2.7.2 Vector transportThe notion of vector transport allows us to relax the isometry constraints of paralleltransport between tangent spaces at differing points on a manifold and decreasecomputational complexity without sacrificing the convergence quality of our CG-based method. Since our parameter space M is a subset of Euclidean space, givena point x ∈ M and a horizontal vector x ∈ HxM, we take our vector transportTx;x : HxM→HRx(x)M of the vector x ∈ HxM to beTx;xx := e hRx(x)xwhere e hx is the component-wise projection onto the horizontal space at x, givenin (2.7). This mapping is well defined on M/G since HxM is invariant under ,and induces a vector transport on the quotient space [Sec. 8.1.4, 1].2.7.3 Smooth optimization methodsNow that we have established the necessary components for manifold optimization,we present a number of concrete optimization algorithms for solvingminx∈Mf(ϕ(x))that are suitable for large-scale problems.32Chapter 2. Low-rank Tensor Completion2.7.4 First order methodsGiven the expressions for the Riemannian gradient and retraction, it is straightfor-ward to implement the classical Steepest Descent algorithm with an Armijo linesearch on this Riemannian manifold, specialized from the general Riemannian mani-fold case [1] to the HT manifold in Algorithm (2.4). This algorithm consists of com-puting the Riemannian gradient, followed by a line search, HT parameter update,and a reorthogonalization. We describe the various components of Algorithm (2.4)below.Here gi denotes the Riemannian gradient at iteration i of the algorithm, pi isthe search direction for the optimization method, and i is the step length.We choose the Polak-Ribiere approachi =〈giP gi − Txi−1;i−1pi−1(gi−1)〉〈gi−1P gi−1〉to compute the CG-parameter i, so that the search direction pi satisfiespi = −gi + iTxi−1;i−1pi−1pi−1and p1 = −g1.2.7.5 Line searchAs for any gradient based optimization scheme, we need a suitable initial step sizeand a computationally efficient line search. Following [230], we use a variation ofthe limited-minimization line search approach to set the initial step length based onthe previous search direction and gradient that are vector transported to the currentpoint—i.e, we havesi = Txi−1;i−1pi−1i−1pi−1yi = gi − Txi−1;i−1pi−1gi−1OIn this context, si is the manifold analogue for the Euclidean difference betweeniterates, xi− xi−1 and yi is the manifold analogue for the difference of gradients be-tween iterates, gi−gi−1, which are standard optimization quantities in optimizationalgorithms set in Rn.Our initial step size for the direction pi is given as0 = −gii piR(ai‖pi‖22)where ai = yii siR‖si‖22 is the estimate of the Lipschitz constant for the gradient [Eq.16, 231].In this context, computing the gradient is much more expensive than evaluatingthe objective. For this reason, we use a simple Armijo-type back-/forward-trackingapproach that only involves function evaluations and seeks to minimize the 1Y33Chapter 2. Low-rank Tensor CompletionAlgorithm 2.4 General Nonlinear Conjugate Gradient method for minimizing afunction f over HInput Initial guess x0 = (jtPBt), 0 Q  ≤ 1 sufficient decrease parame-ter for the Armijo line search, 0 Q  Q 1 step size decrease parameter,  S0 CG restart parameter.p−1 ← 0i← 0for i = 0P 1P 2P O O O until convergenceXi ← ϕ(xi)fi ← f(Xi)gi ← ∇Rfˆ(xi) (Riemannian gradient of fˆ(x) at xi)si ← Txi−1;i−1pi−1i−1pi−1 (Vector transport the previous search direction)yi ← gi − Txi−1;i−1pi−1gi−1ai ← yii siR‖si‖2 (Lipschitz constant estimate)pi ← −gi + iTxi−1;i−1pi−1pi−1if 〈piP gi〉 S − (Restart CG direction)pi = −giif yii si S 0← −gii piR(ai‖pi‖22)else← i−1Find m ∈ Z such that i = m andf(xi + ipi)− fi ≤ igii pif(xi + ipi) ≤ min{f(xi + ipi)P f(xk + i−1pi)}(Find a quasi-optimal minimizer)xi+1 ←Rxi(ipi) (Reorthogonalize)i← i+ 1function f(x + pi) quasi-optimally, i.e., to find m ∈ Z such that  = m0 for S 0f(xi + pi)− f(xi) ≤ gii pif(xi + pi) ≤ min{f(xi + pi)P f(xi + −1pi)}(2.15)so  ≈ ∗ = argmin f(xi + pi) in the sense that increasing or decreasing  by afactor of  will increase f(xi+pi). After the first few iterations of our optimizationprocedure, we observe empirically that our line search only involves two or threeadditional function evaluations to verify the second inequality in (2.15), i.e., ourinitial step length 0 is quasi-optimal.Because ϕ(Rx()) = ϕ(x+ ) for any x ∈ M and horizontal vector , whereRx is the QR retraction, Armijo linesearches do not require reorthogonalizationat the intermediate steps, which further reduces computational costs. The vectorx+  is only orthogonalized once an appropriate step length  is found.34Chapter 2. Low-rank Tensor Completion2.7.6 Gauss-Newton MethodBecause of the least-squares structure of our tensor completion problem (2.1), wecan approximate the Hessian by the Gauss-Newton HessianHGc := Yϕ∗(x)Yϕ(x) : HxM→HxMPwhich arises from linearizing the objective functionminx∈M12‖ϕ(x)− w‖22around the current point x.Note that we do not use the “true” Gauss-Newton Hessian for the tensor com-pletion objective (2.1), which is Yϕ∗(x)e ∗ΩeΩYϕ(x), since for even moderate sub-sampling ratios, e ∗ΩeΩ is close to the zero operator and this Hessian is very poorlyconditioned as a result. Moreover, as we shall see, this formulation will allow us tosimplify the application of HGc and its inverse.Since Yϕ(x) : HxM → Tϕ(x)H is an isomorphism, it is easy to see that HGcis symmetric and positive definite on HxM. The solution to the Gauss-Newtonequation is thenHGc = −∇Rf(x)for  ∈ HxM.We can simplify the computation of HGc by exploiting the recursive structureof Yϕ∗(x) and Yϕ(x), thereby avoiding intermediate vectors of size Rn1×n2×:::×ndin the process. At the root note, by (2.12), we have thatj ′tp = e⊥jxp〈jitr ×2 Yϕ(x)[]PBt〉(2;3);(2;3)Pj ′tr = e⊥jxr〈jitp ×1 Yϕ(x)[]PBt〉(1;3);(1;3)PW′t = jitpYϕ(x)[]jtrwhereYϕ(x)[] = jtp ×1 jtr ×2 Wt + jtp ×1 jtr ×2 Wt + jtp ×1 jtr ×2 WtP t = troot OIn the above expression, Yϕ(x) is horizontal, so that for each t ∈ i \ troot, jtis perpendicular to jt (2.11). A straightforward computation simplifies the aboveexpression toj ′tp = jtpGtp P j′tr = jtrGtr P W′t = Wtroot Owhere Gtp = WtrootWitroot and Gtr = WitrootWtroot .This expression gives us the components of the horizontal vector j ′tp P j ′tr sentto the left and right children, respectively, as well as the horizontal vector W′t.35Chapter 2. Low-rank Tensor CompletionWe proceed recursively by considering a node t ∈ c(i ) ∪ troot and let jtGt bethe contribution from the parent node of t. By applying the adjoint partial deriva-tives, followed by an orthogonal projection on to HxM, we arrive at a simplifiedform for the Gauss-Newton Hessiane⊥jxp 〈jitr ×2 jtGtPBt〉(2;3);(2;3) = 〈jtp ×1 Gt ×3 BtPBt〉(2;3);(2;3):= jtpGtp Pe⊥jxr 〈jitp ×1 jtGtPBt〉(1;3);(1;3) = jtrGtr Pe⊥B(1;2)x(jitp ×1 jitr ×2 Gt ×3 jt) = Gt ×3 BtOIn these expressions, the matrices Gt are the Gramian matrices associated to theHT format, initially introduced in [109] and used for truncation of a general tensorto the HT format as in [248]. They satisfy, for x = (jtPBt) ∈M, Gtroot = 1 andGtp = 〈Gt ×3 BtPBt〉(2;3);(2;3)P Gtr = 〈Gt ×3 BtPBt〉(1;3);(1;3)P (2.16)i.e., the same recursion as Gt in the above derivations. Each Gt is a kt×kt symmetricpositive definite matrix (owing to the full rank constraints of the HT format) andalso satisfiesj(Gt) = j(m(t))2 (2.17)where j(V) is the jth eigenvalue of the matrix V and j(V) is the jth singularvalue of V.Assuming that each Gt is well conditioned, applying the inverse of HGc followsdirectly, summarized in Algorithm (2.5).Algorithm 2.5 The inverse Gauss-Newton Hessian applied to a vectorInput Current point x = (jtPBt), horizontal vector  = (jtP Bt)Compute (Gt)t∈i using (2.16)for t ∈ i \ trootif t ∈ agjt ← jtG−1telsegBt ← G−1t ×3 BtH−1Gc ← (gjtPgBt)For the case where our solution HT tensor exhibits quickly-decaying singularvalues of the matricizations, as is typically the assumption on the underlying tensor,the Gauss-Newton Hessian becomes poorly conditioned as the iterates converge tothe solution, owing to (2.17). This can be remedied byintroducing a small ϵ S 0and applying (Gt + ϵI)−1 instead of G−1t in Algorithm 2.5 or by applying H−1Gcby applying HGc in a truncated PCG method. For efficiency purposes, we findthe former option preferable. Alternatively, we can also avoid ill-conditioning viaregularization, as we will see in the next section.36Chapter 2. Low-rank Tensor CompletionRemark: We note that applying the inverse Gauss-Newton Hessian to a tan-gent vector is akin to ensuring that the projection on to the horizontal space isorthogonal, as in [6.2.2, 253]. Using this method, however, is much faster than thepreviously proposed method, because applying Algorithm 2.5 only involves matrix-matrix operations on the small parameters, as opposed to operations on much largerintermediate matrices that live in the spaces between the full tensor space and theparameter space.2.7.7 RegularizationIn the tensor completion case, when there is little data available, interpolating onthe HT manifold is susceptible to overfitting if one chooses the ranks (kt)t∈i for theinterpolated tensor too high. In that case, one can converge to solutions in null(eΩ)that try leave the current manifold, associated to the ranks (kt)t∈i , to anothernearby manifold corresponding to higher ranks. This can lead to degraded resultsin practice, as the actual ranks for the solution tensor are almost always unknown.One can use cross-validation techniques to estimate the proper internal ranks of thetensor, but we still need to ensure that the solution tensor has the predicted ranksfor this approach to be successful – i.e., the iterates x must stay away from theboundary of H.To avoid our HT iterates converging to the manifold boundary, we introduce aregularization term on the singular values of the HT tensor ϕ(x) = X. To accomplishthis, we exploit the hierarchical structure of X and specifically the property of theGramian matrices Gt in 2.17 to ensure that all matricizations of X remain well-conditioned without having to perform SVDs on each matricization m(t). The latterapproach would be prohibitively expensive when y or c are even moderately large.Instead, we penalize the growth of the Frobenius norm of m(t) and (m(t))†, whichindirectly controls the largest and smallest singular values of m(t). We implementthis regularization via the Gramian matrices in the following way.From (2.17), it follows that tr(Gt) = ‖Gt‖∗ = ‖m(t)‖2F and likewise tr(G−1t ) =‖G−1t ‖∗ = ‖(m(t))†‖2F .Our regularizer is thenH((Bt′)t′∈i ) =∑t∈itr(Gt) + tr(G−1t )OA straightforward calculation shows that for A ∈ G(Gt)t∈i;x = (Vit GtVt)t∈i;A(x)for Vt orthogonal. Therefore, our regularizer g is well-defined on the quotientmanifold in the sense that it is −invariant on the parameter space M. This isthe same regularization term considered in [161], used for (theoretically) preventingthe iterate from approaching the boundary of H. In our case, we can leverage the37Chapter 2. Low-rank Tensor Completionstructure of the Gramian matrices to implement this regularizer in a computationallyfeasible way.Since in the definition of the Gramian matrices (2.16), Gt is computed recursivelyvia tensor-tensor contractions (which are smooth operations), it follows that themapping g : (Bt)t∈c(i ) → (Gt)t∈i is smooth. In order to compute its derivatives,we consider a node t ∈ i \ troot and consider the variations of its left and rightchildren, i.e.,Gtr =UGtrUBtBt +UGtrUGtGtP Gtp =UGtpUBtBt +UGtpUGtGtO (2.18)We can take the adjoint of this recursive formulation, and thus obtain the gradientof g, if we compute the adjoint partial derivatives in (2.18) as well as taking theadjoint of the recursion itself. To visualize this process, we consider the relationshipbetween input variables and output variables in the recursion as a series of smalldirected graphs, shown in Figure 2.3.Forward Gramian derivative map Adjoint Gramian derivative mapFigure 2.3 Forward and adjoint depictions of the Gramian mapping differentialThese graphs can be understood in the context of Algorithmic Differentiation,whereby the forward mode of this derivative map propagates variables up the treeand the adjoint mode propagates variables down the tree and adds (accumulates)the contributions of the relevant variables.Since we only consider tangent vectors Wt that are in the horizontal space at x,each extracted component is projected on to (W(1;2)t )⊥. We summarize our resultsin the following algorithms.Algorithm 2.6 Yg[Bt]Input Current point x = (jtPBt), horizontal vector yx = (jtP Bt).Compute (Gt)t∈i using (2.16)Gtroot ← 0for t ∈ c(i ), visiting parents before childrenGtp ← 〈Gt ×3 BtPBt〉(2;3);(2;3) + 2〈Gt ×3 BtPBt〉(2;3);(2;3)Gtr ← 〈Gt ×3 BtPBt〉(1;3);(1;3) + 2〈Gt ×3 BtPBt〉(1;3);(1;3)Output Yg[Bt]← (Gt)t∈i38Chapter 2. Low-rank Tensor CompletionAlgorithm 2.7 Yg∗[Gt]Input Current point x = (jtPBt), Gramian variations (Gt)t∈i , Gtroot = 0.Compute (Gt)t∈i using (2.16)for t ∈ igGt ← Gtfor t ∈ c(i ), visiting children before parentsBt ← (gGtp + gGtpi )×1 Gt ×3 Bt + (gGtr + gGtri )×2 Gt ×3 Btif t ̸= trootBt ← (e⊥B(1;2)B(1;2)t )(1;2)gGt ← Gt + 〈fGtp ×1 BtPBt〉(1;2);(1;2) + 〈gGtr ×2 BtPBt〉(1;2);(1;2)Output Yg∗[Gt]← (Bt)t∈iApplying Algorithm (2.7) to the gradient of H(Bt),∇H(Bt) = (kt(Ikx − h−2t )k it )Pwhere Gt = kthtk it is the eigenvalue decomposition of Gt, yields the Riemanniangradient of the regularizer. Note that here, we avoid having to compute SVDsof any matricizations of the full data ϕ(x), resulting in a method which is muchfaster than other tensor completion methods that require the SVDs on tensors inRn1×n2×:::×nd [105]. Note that the cost of computing this regularizer H(Bt) and itsgradient are almost negligible compared to the cost of computing the objective andits Riemannian gradient.Finally, we should also note that the use of this regularizer is not designed toimprove the recovery quality of problem instances with a relatively large amountof data and is useful primarily in the case where there is very little data so as toprevent overfitting, as we shall see in the numerical results section.2.7.8 Convergence analysisOur analysis here follows from similar considerations in [Sec. 3.6 , 161].Theorem 2.12 Let {xi} be an infinite sequence of iterates, with xi generated atiteration i, generated from (2.4) for the Gramian-regularized objective with  S 0f(x) =12‖eΩϕ(x)− w‖22 + 2∑t∈i\troottr(Gt(x)) + tr(G−1t (x))OThen limi→∞ ‖∇Rf(xi)‖ = 0.Proof To show convergence, we merely need to show that the iterates remainin a sequentially compact set, since any accumulation point of {xi} is a criticalpoint of f , by [Theorem 4.3.1, 1]. But this follows because by construction, since39Chapter 2. Low-rank Tensor Completionf(xi) ≤ f(x0) := X2 for all i. Letting Xi := ϕ(xi)12‖eΩϕ(xi)− w‖22 + 2∑t∈i\troottr(Gt(xi)) + tr(G−1t (xi)) =12‖eΩXi − w‖22 + 2∑t∈i\troot‖m(t)i ‖2F + ‖(m(t)i )†‖2F ≤ X2This shows, in particular, that2∑t∈i\troot‖m(t)i ‖2F ≤ X2 2∑t∈i\troot‖(m(t)i )†‖2F ≤ X2and therefore we have upper and lower bounds on the maximum and minimumsingular values of m(t)imax(m(t)i ) ≤ ‖m(t)i ‖F ≤ XR −1min(m(t)i ) ≤ ‖(m(t)i )†‖F ≤ XRand therefore the iterates mi stay within the compact setC = {X ∈ H : min(m(t)k ) ≥ RXP max(m(t)k ) ≤ XRP t ∈ i \ troot}OOne can show, as a modification of the proof in [253], that ϕˆ : M/G → H is ahomeomorphism on to its image, so that ϕˆ−1(X) is compact in M/G.We let ‖ · ‖i be the natural metric on M, which is to sayy(xP y) = ‖x− y‖i =∑t∈L‖jt − kt‖F +∑t∈c(i )‖Bt −Ct‖F Owhere x = (jtPBt), y = (ktPCt) ∈M OA metric on M/G that generates the topology on the quotient space, isy˜(.(x)P .(y)) = infA;B∈Gy(A(x)P B(y))O (2.19)Note that this pseudo-metric is a metric which generates the topology on M/G by[Theorem 2.1, 45] since {A}A∈G is a group of isometries acting onM and the orbitsof the action are closed by [Lemma 1, 253]. Note that this metric is equivalent toy˜(.(x)P .(y)) = infA∈Gy(xP A(y))P (2.20)which is well-defined and equal to (2.19) since ‖A(x)− B(y)‖i = ‖x− A−1B(y)‖iand APB vary over G.Let {xi} be a sequence in .−1(ϕˆ−1(X)). By the compactness of ϕ−1(X) inM/G,we have that there is some set of HT parameters y ∈ ϕ−1(X) such that, without loss40Chapter 2. Low-rank Tensor Completionof generality,y˜(.(xi)P .(y))→ 0 as i→∞OThen, by the characterization (2.20), since G is compact, there exists a sequenceAi ⊂ G such thaty(xiP Ai(y))→ 0Also by the compactness of G, there exists a subsequence {Ain} that converges toA ∈ G, so y(Ain (y)P A(y)) converges to 0 by continuity of the Lie group action .It then follows thaty(xin P A(y)) ≤ y(xin P Ain (y)) + y(Ain (y)P A(y))→ 0 as j →∞And so any sequence {xi} ∈ .−1(ϕˆ−1(X)) has a convergent subsequence and so.−1(ϕˆ−1(X)) is sequentially compact in M. Therefore since the sequence xk gen-erated by Algorithm (2.4) remains in .−1(ϕˆ−1(X)) for all i, a subsequence of xiconverges to some x ∈ .−1(ϕˆ−1(X)), and so x is a critical point of f .Although we have only shown convergence to a limit point where the gradientvanishes, the authors in [225] consider the convergence of general line search methodsthat operate on the affine variety of rank at most k matrices. It would be aninteresting extension of these results to the tensor case, but we leave this for futurework. We also remark that this proof also holds for the Gauss-Newton method aswell, since the results in [Theorem 4.3.1, 1] simply require that the search directionhas a negative inner product with the gradient in the limit, which can be shown inthis case.2.8 Numerical ExamplesTo address the challenges of large-scale tensor completion problems, as encounteredin exploration seismology, we implemented the approach outlined in this chapter in ahighly optimized parallel Matlab toolbox entitled HTOpt (available at http://www.math.ubc.ca/~curtd/software.html for academic use). Contrary to the HT toolbox[162], whose primary function is performing operations on known HT tensors, ourtoolbox is designed to solve optimization problems in the HT format such as theseismic tensor completion problem. Our package includes the general optimizationon HT manifolds detailed in Algorithm (2.1) as well as sparsity-exploiting objectiveand Riemannian gradient in Algorithm (2.2), implemented in Matlab. We alsoinclude a parallel implementation using the Parallel Matlab toolbox for both ofthese algorithms. All of the following experiments were run on a single IBM x3550workstation with 2 quad-core Intel 2.6Ghz processors with 16GB of RAM runningLinux 2.6.18.In this section, we compare our Gauss-Newton method with the interpolationscheme detailed in [161], denoted geomCG, for interpolating tensors with missingentries on the Tucker manifold. We have implemented a completely Matlab-based41Chapter 2. Low-rank Tensor Completionversion of geomCG, which does not take advantage of the sparsity of the residualwhen computing the objective and Riemannian gradient, but uses Matlab’s internalcalls to LAPACK libraries to compute matrix-matrix products and is much moreefficient for this problem. To demonstrate this speedup, we compare our method tothe reference mex implementation for geomCG from [161] on a randomly generated200×200×200 with multilinear rank 40 in each dimension, a training set comprisedof 10% of the data points chosen randomly, and run for 20 iterations. The results ofthis comparison are shown in Table 2.1. We restrict our comparison to 3D tensorssince the implementation of [161] available on the author’s website is strictly for 3Dtensors.Method Time (s) Training error Test error DifferenceOriginal [161] 4701s 1.091 ·10−3 2.792 ·10−3 N/AOurs 96.1s 1.129 ·10−3 2.384 ·10−3 7.956 ·10−4Table 2.1 Comparison between geomCG implementations on a 200 × 200 × 200random Gaussian tensor with multilinear rank 40 and subsampling factor = 0O1,both run for 20 iterations. Quantities are relative to the respective underlyingsolution tensor.Since we take advantage of dense, multithreaded linear algebra routines, we findthat our Matlab implementation is significantly faster than the mex code of [161]when K ≥ 20 and |Ω| is a significant fraction of cy, as is the case in the examplesbelow.In the examples considered below, we consider fixed multilinear ranks for eachproblem. There have been rank increasing strategies proposed for matrix completionin, among other places, [254, 184, 32] and as previously mentioned in [161], but weleave the question as to how to incorporate these heuristics in to the HT optimizationframework for future research.2.8.1 Seismic dataWe briefly summarize the structure of seismic data in this section. Seismic datais typically collected via a boat equipped with an airgun and, for our purposes, a2D array of receivers positioned on the ocean floor. The boat periodically fires apressure wave in to the earth, which reflects off of subterranean discontinuities andproduces a returning wave that is measured at the receiver array. The resulting datavolume is five-dimensional, with two spatial source coordinates, denoted xsrxP ysrx,two receiver coordinates, denoted xrzxP yrzx, and time. For these experiments, wetake a Fourier transform along the time axis and extract a single 4D volume byfixing a frequency and let D denote the resulting frequency slice with dimensionsnsrx × nsrx × nrzx × nrzx.42Chapter 2. Low-rank Tensor CompletionFrom a practical point of view, the acquisition of seismic data from a physicalsystem only allows us to subsample receiver coordinates, i.e., Ω = [nsrx]×[nsrx]×I forsome I ⊂ [nrzx]× [nrzx] with |I| Q n2rzx, rather than the standard tensor completionapproach, which assumes that Ω ⊂ [nsrx] × [nsrx] × [nrzx] × [nrzx] is random andunstructured. As a result, we use the dimension tree in Figure 2.4 for completingseismic data.{xsrc, xrec, ysrc, yrec}{xsrc, xrec}{xsrc} {xrec}{ysrc, yrec}{ysrc} {yrec}Figure 2.4 Dimension tree for seismic dataWith this choice, the fully sampled data D has quickly decaying singular val-ues in each matricization Y(t) and is therefore represented well in the HT format.Additionally, the subsampled data eΩD has increased singular values in all matri-cizations, and is poorly represented as a HT tensor with fixed ranks k as a result.We examine this effect empirically in [81] and note that this data organization isused in [87] to promote low rank of the solution operators of the wave equation withrespect to source and receiver coordinates. In a noise-free environment, seismic datavolumes are modelled as the restriction of Green’s functions of the wave equationto the acquisition surface, which explains why this particular coordinate groupingdecreases the rank of seismic data. Moreover, since we restrict our attention to rela-tively low frequency data, these volumes are relatively smooth, resulting in quicklydecaying singular values [224].Although this approach is limited to considerations of seismic data, for largerdimensions/different domains, potentially the method of [21] can choose an appropri-ate dimension tree automatically. In the next section, we also include the case whenΩ ⊂ [nsrx]× [nsrx]× [nrzx]× [nrzx], i.e. the “missing points” scenario, to demonstratethe added difficulty of the “missing receivers” case described above.2.8.2 Single reflector dataFor this data set, we generate data from a very simple seismic model consisting oftwo horizontal layers with a moderate difference in wavespeed and density betweenthem. We generate this data with nsrx = nrzx = 50 and extract a frequency slice at4.21Hz, rescaled to have unit norm.We consider the two sampling scenarios discussed in the previous section: weremove random points from the tensor, with results shown in Figure 2.5, and we43Chapter 2. Low-rank Tensor Completionremove random receivers from the tensor, with results shown in Figure 2.6. HeregeomCG(rleaf) - w denote the Tucker interpolation algorithm with rank rleaf in eachmode and w rank continuation steps, i.e., the approach proposed in [161]. We alsolet HT(rleafP rxsrcxrec) denote the HT interpolation method with rank rleaf as in theTucker interpolation and rank rxsrcxrec as the internal rank for the dimension tree.As is customary in the seismic literature, we measure recovery quality in terms ofSNR, namelySNR(XPD) = −20 log10(‖XΩc −DΩc‖‖DΩc‖)dBPwhere X is our interpolated signal, D is our reference solution, and Ωx = [nsrx] ×[nsrx] × [nrzx] × [nrzx] \ Ω. As we can see in Figure 2.5, the HT formulation isable to take advantage of low-rank separability of the seismic volume to producea much higher quality solution than that of the Tucker tensor completion. Therank continuation scheme does not seem to be improving the recovery quality of theTucker solution to the same degree as in [161], although it does seem to mitigatesome of the overfitting errors for geomCG(30). We display slices for fixed sourcecoordinates and varying receiver coordinates in Figure 2.5 for randomly missingpoints and Figure 2.6 for randomly missing receivers. We summarize our resultsin Tables (2.2) and (2.3). By exploiting the low-rank structure of the HT formatcompared to the Tucker format, we are able to achieve much better results thanTucker tensor completion, especially for the realistic case of missing receiver samples.In all instances for these experiments, the HT tensor completion outperformsthe conventional Tucker approach both in terms of recovery quality and recoveryspeed. We note that geomCG does not scale as well computationally as our HTalgorithm for y S 3, as the complexity analysis in [161] predicts. As such, we onlyconsider the HT interpolation for the next sections, where we will solve the tensorcompletion problem for much larger data volumes.  5 10 15 20 25 30 35 40 45 505101520253035404550−1−0.8−0.6−0.4−0.200.20.40.60.81x 10−3True data  5 10 15 20 25 30 35 40 45 505101520253035404550−1−0.8−0.6−0.4−0.200.20.40.60.81x 10−3Input data b =bΩX∗  5 10 15 20 25 30 35 40 45 505101520253035404550−1−0.8−0.6−0.4−0.200.20.40.60.81x 10−3HT(30,80) - SNR30:5 dB  5 10 15 20 25 30 35 40 45 505101520253035404550−1−0.8−0.6−0.4−0.200.20.40.60.81x 10−3geomCG(20)-0 -SNR 29:4 dBFigure 2.5 Reconstruction results for 90% missing points, best results for geomCGand HTOpt.44Chapter 2. Low-rank Tensor Completion  5 10 15 20 25 30 35 40 45 505101520253035404550−1−0.8−0.6−0.4−0.200.20.40.60.81x 10−390% missing receivers  5 10 15 20 25 30 35 40 45 505101520253035404550 −1−0.8−0.6−0.4−0.200.20.40.60.81x 10−3HT(20,20) - SNR 7:04dB  5 10 15 20 25 30 35 40 45 505101520253035404550 −1−0.8−0.6−0.4−0.200.20.40.60.81x 10−3geomCG(20)-0 - SNR−1:92 dB  5 10 15 20 25 30 35 40 45 505101520253035404550−1−0.8−0.6−0.4−0.200.20.40.60.81x 10−370% missing receivers  5 10 15 20 25 30 35 40 45 505101520253035404550 −1−0.8−0.6−0.4−0.200.20.40.60.81x 10−3HT(20,20) - SNR 20:4dB  5 10 15 20 25 30 35 40 45 505101520253035404550 −1−0.8−0.6−0.4−0.200.20.40.60.81x 10−3geomCG(30)-5 - SNR16:8 dBFigure 2.6 Reconstruction results for sampled receiver coordinates, best results forgeomCG and HTOpt. Top row: 90% missing receivers. Bottom row: 70% missingreceivers.Percentage Training Data 10 % 30 % 50 %geomCG(20) - 0 28.5 (1023) 30.5 (397) 30.7 (340)geomCG(30) - 0 -6.7 (1848) 21.8 (3621) 31.5 (2321)geomCG(30) - 5 16.1 (492) 13.8 (397) 15.5 (269)HTOpt(20,60) 30.1 (83) 30.4 (59) 30.4 (57)HTOpt(20,80) 30.3 (121) 30.8 (75) 30.8 (53)HTOpt(30,80) 31.6 (196) 32.9 (133) 33.1 (114)Table 2.2 Reconstruction results for single reflector data - missing points - meanSNR over 5 random training sets. Values are SNR (dB) and time (in seconds) inparentheses.2.8.3 Convergence speedWe demonstrate the superior convergence of the Gauss-Newton method comparedto the Steepest Descent and Conjugate Gradient methods when used to completethe simple, single reflector data volume with 50% missing receivers and all ranks setto 20. We start each method with the same initial point and allow each to run for45Chapter 2. Low-rank Tensor CompletionPercentage Training Data 10 % 30 % 50 %geomCG(20) - 0 -5.1 (899) 9.9 (898) 18.5 (891)geomCG(30) - 5 -3.6 (1796) -4.7 (1834) 6.1 (1802)HTOpt(20,20) 6.1 (111) 19.8 (101) 20.1 (66)HTOpt(30,20) 2.8 (117) 18.1 (109) 19.8 (94)HTOpt(30,40) 0.0 (130) 13.4 (126) 21.6 (108)Table 2.3 Reconstruction results for single reflector data - missing receivers - meanSNR over 5 random training test sets. Values are SNR (dB) and time (in seconds)in parentheses.at most 1000 iterations. As we can see in Figure 2.7, the Gauss-Newton methodconverges much faster than the other two while simultaneously having per-iterationcomputational costs that are comparable to the other methods.0 100 200 300 400 500 600 700 800 900 100010−1210−1010−810−610−410−2100IterationObjective value − minimum  SDCG−PRPGNFigure 2.7 Convergence speed of various optimization methods2.8.4 PerformanceWe investigate the empirical performance scaling of our approach as cP yPKP and|Ω| increase, as well as the number of processors for the parallel case, in Figure 2.8.Here we denote the use of Algorithm (2.1) as the “dense” case and Algorithm (2.2)as the “sparse” case. We run our optimization code in Steepest Descent mode witha single iteration for the line search, and average the running time over 10 iterationsand 5 random problem instances. Our empirical performance results agree very46Chapter 2. Low-rank Tensor Completionclosely with the theoretical complexity estimates, which are d(cyK) for the densecase and d(|Ω|yK3) for the sparse case. Our parallel implementation for the sparsecase scales very close to the theoretical time d(1R# processors).20 40 60 80 100 120 14010−210−1100101102Ntime [s]  denseO(Nd)sparse|Ω|Fixed K; d, varying N , |Ω| = 1000N20 30 40 50 60 70 8010−1100101102Ktime [s]  denseO(K)sparseO(K3)Fixed N; d; |Ω|, varying K4 5 6 7 8 9 10 11 120.811.21.41.61.822.22.42.6dtime [s]  sparseO(d)Fixed N;K; |Ω|, varying d1 2 3 4 5 6 7 824681012141618# processorstime [s]  sparseO(1/p)Fixed N;K; d;Ω, varying # of proces-sorsFigure 2.8 Dense & sparse objective, gradient performance.2.8.5 Synthetic BG Compass dataThis data set was provided to us by the BG Group company and consists of 5D datagenerated from an unknown synthetic model. Here nsrx = 68 and nrzx = 401 andwe extract frequency slices at 4.86 Hz, 7.34 Hz, and 12.3 Hz. On physical grounds,we expect a slower decay of the singular values at higher frequencies and thus theproblem is much more difficult at 12.3 Hz compared to 4.86 Hz.47Chapter 2. Low-rank Tensor CompletionAt these frequencies, the data has relatively low spatial frequency content inthe receiver coordinates, and thus we subsample the receivers by a factor of 2 tonrzx = 201, for the purposes of speeding up the overall computation and ensuringthat the intermediate vectors in the optimization are able to fit in memory. Ouroverall data volume has dimensions D ∈ R68×68×201×201.We randomly remove varying amounts of receivers from this reduced data volumeand interpolate using 50 iterations of the GN method discussed earlier. We displayseveral recovered slices for fixed source coordinates and varying receiver coordinates(so-called common source gathers in seismic terminology) in Figures [2.9, 2.10, 2.11].We summarize our recovery results for tensor completion on these data setsfrom missing receivers in Table 2.4 and the various recovery parameters we use in Ta-ble 2.5. When the subsampling rate is extremely high (90% missing receivers in theseexamples), the recovery can suffer from overfitting issues, which leads to spuriousartifacts in the recovered volume and lower SNRs overall. Using the Gramian-basedregularization method discussed earlier, we can mitigate some of those artifacts andboost recovered SNRs, as seen in Figure 2.12.Frequency % Missing Train SNR (dB) Test SNR (dB) Runtime (s)4.86 Hz 25% 21.2 21 403350% 21.3 20.9 416975% 21.5 19.9 433390% 19.9 10.4 467990%* 20.8* 13.0* 50437.34 Hz 25% 17.3 17.0 487550% 17.4 16.9 486075% 17.7 16.5 542290% 16.6 9.82 458290%* 16.6* 10.5* 494712.3 Hz 25% 14.9 14.2 595050% 15.2 13.8 708375% 15.8 9.9 738790% 13.9 5.39 457890%* 14* 6.5* 4966Table 2.4 HT Recovery results on the BG data set - randomly missing receivers.Starred quantities are computed with regularization.48Chapter 2. Low-rank Tensor CompletionReceiver xReceiver y  20 40 60 80 100 120 140 160 180 20020406080100120140160180200−6−4−20246x 10−4True DataReceiver xReceiver y  20 40 60 80 100 120 140 160 180 20020406080100120140160180200−6−4−20246x 10−4Subsampled DataReceiver xReceiver y  20 40 60 80 100 120 140 160 180 20020406080100120140160180200−6−4−20246x 10−4Interpolated Data - SNR 20 dBReceiver xReceiver y  20 40 60 80 100 120 140 160 180 20020406080100120140160180200−6−4−20246x 10−4DifferenceFigure 2.9 HT interpolation results on the BG data set with 75% missing receiversat 4.68 Hz, figures are shown for fixed source coordinates.49Chapter 2. Low-rank Tensor CompletionReceiver xReceiver y  20 40 60 80 100 120 140 160 180 20020406080100120140160180200 −6−4−20246x 10−4True DataReceiver xReceiver y  20 40 60 80 100 120 140 160 180 20020406080100120140160180200 −6−4−20246x 10−4Subsampled DataReceiver xReceiver y  20 40 60 80 100 120 140 160 180 20020406080100120140160180200 −6−4−20246x 10−4Interpolated Data - SNR 17.7 dBReceiver xReceiver y  20 40 60 80 100 120 140 160 180 20020406080100120140160180200 −6−4−20246x 10−4DifferenceFigure 2.10 HT interpolation results on the BG data set with 75% missing receiversat 7.34 Hz, figures are shown for fixed source coordinates.50Chapter 2. Low-rank Tensor CompletionReceiver xReceiver y  20 40 60 80 100 120 140 160 180 20020406080100120140160180200−6−4−20246x 10−4True DataReceiver xReceiver y  20 40 60 80 100 120 140 160 180 20020406080100120140160180200−6−4−20246x 10−4Subsampled DataReceiver xReceiver y  20 40 60 80 100 120 140 160 180 20020406080100120140160180200−6−4−20246x 10−4Interpolated Data - SNR 11.2 dBReceiver xReceiver y  20 40 60 80 100 120 140 160 180 20020406080100120140160180200−6−4−20246x 10−4DifferenceFigure 2.11 HT interpolation results on the BG data set with 75% missing receiversat 12.3 Hz, figures are shown for fixed source coordinates.51Chapter 2. Low-rank Tensor CompletionReceiver xReceiver y  20 40 60 80 100 120 140 160 180 20020406080100120140160180200−6−4−20246x 10−4True DataReceiver xReceiver y  20 40 60 80 100 120 140 160 180 20020406080100120140160180200−6−4−20246x 10−4Subsampled DataReceiver xReceiver y  20 40 60 80 100 120 140 160 180 20020406080100120140160180200−6−4−20246x 10−4No Regularization - SNR 10.4 dBReceiver xReceiver y  20 40 60 80 100 120 140 160 180 20020406080100120140160180200−6−4−20246x 10−4Regularization - SNR 13.9 dBFigure 2.12 Regularization reduces some of the spurious artifacts and reducesoverfitting in the case where there is very little data. 4.86 Hz data, 90% missingreceivers.Frequency rxsrcxric rxsrc rxric HT-SVD SNR (dB)4.86 Hz 150 68 120 21.17.34 Hz 200 68 120 17.012.3 Hz 250 68 150 13.9Table 2.5 HT parameters for each data set and the corresponding SNR of the HT-SVD approximation of each data set. The 12.3 Hz data is of much higher rank thanthe other two data sets and thus is much more difficult to recover.52Chapter 2. Low-rank Tensor Completion2.8.6 Effect of varying regularizer strengthIn this section, we examine the effect of varying the  parameter based on thescenario described in Figure 2.12. That is to say, we use the same data volume D atas in the previous section with 90% of the receiver points removed and recover thisvolume using 50 iterations of the Gauss-Newton method for  = 0P 10−15P O O O P 10−10.We plot the results of the test SNRs in Figure 2.13 and indicate 0 on the log-scalegraph by −16, since 10−16 is effectively at machine precision. For this problemwith very little data, choosing an appropriate  parameter can mitigate some of theoverfitting errors and in our experiments, we found a parameter range from 10−12−10−10 improved our recovery SNRs, i.e., small but nonzero values. Higher values of than those shown here resulted in underfitting of the data and significantly worseSNRs and were omitted. In practice, one would employ a cross-validation schemeto choose an appropriate .−16 −15 −14 −13 −12 −11 −107.47.67.888.28.48.68.8log10(λ)SNR (dB)Figure 2.13 Recovery SNR versus log10() for 4.68Hz data. 90% missing receivers.2.9 DiscussionFor the seismic data examples considered in this chapter, we have assumed a par-ticular sampling scheme, i.e., receiver-only sampling, that is much more feasibleacquisition scenario than the traditional pointwise sampling in all coordinates si-multaneously. Other application areas besides seismology may have further insightsin to appropriate and feasible sampling schemes for these data volumes.It is no accident that the frequencies of the data in this section are quite low:low frequency seismic data has a much faster decay of singular values in multipledimensions compared to higher frequency data [16]. As the frequency of the dataincreases, potentially a multidimensional extension of the hierarchical partitioning53Chapter 2. Low-rank Tensor Completionscheme introduced in [163] could be employed to reconstruct the data. Given that weare ultimately interested in using this completed data volume for solving an inverseproblem, a multi-frequency inversion scheme modeled by the Helmholtz equation,for example, is typically only performed in a frequency band from, say, 3-15Hz. Ifwe are only interested in frequencies in this range, the HT method can perform well.One important concern for utilizing these methods in practice is how to ade-quately estimate the HT rank parameters in each dimension, given data with miss-ing entries. Cross-validation is one viable, albeit expensive, strategy for exploringthe space of HT ranks, given the computational costs associated to interpolating thelarge data volume. Alternatively, one may simply stipulate that the ranks cannotexceed a fixed fraction of the spatial sampling of their corresponding dimensions,arising from the need to keep the computational costs under control.2.10 ConclusionIn this chapter we have developed the algorithmic components to solve optimizationproblems on the manifold of fixed-rank Hierarchical Tucker tensors. By exploitingthis manifold structure, we solve the tensor completion problem where the tensorsof interest exhibit low-rank behavior. Our algorithm is computationally efficient be-cause we mostly rely on operations on the small HT parameter space. The manifoldoptimization itself guarantees that we do not run into convergence issues, whicharise when we ignore the quotient structure of the HT format. Our application ofthis framework to seismic examples confirms the validity of our new approach andoutperforms existing Tucker-based approaches for large data volumes. To stabilizethe recovery for high subsampling ratios, we introduced an additional regularizationterm that exploits properties of the Gramian matrices without the need to computeSVDs in the ambient space.While the method clearly performs well on large-scale problems, there are stilla number of theoretical questions regarding the performance of this approach. Inparticular, the generalization of matrix completion recovery guarantees to the HTformat remains an open problem. As in many alternative approaches to matrix/ten-sor completion, the selection of the rank parameters and regularization parametersremain challenging both theoretically and from a practical point of view. However,the chapter clearly illustrates that the HT format is a viable option to represent andcomplete high-dimensional data volumes in a computationally feasible manner.54Chapter 3The Moreau Envelope and thePolyak-Lojasiewicz Inequality3.1 IntroductionIn this chapter, we study C1 convex functions that satisfy the Polyak-Lojasiewicz(PL) inequality, that is to say, functions f : Rn 7→ R satisfying12‖∇f(x)‖22 ≥ (f(x)−min f) ∀x ∈ X (3.1)where  S 0 is a constant and X ⊂ Rn. We denote the set of all C1 convex functionssatisfying (3.1) as ea(PX) and simply write ea() when X = Rn. As shown in[147], the PL-inequality is in some sense the weakest condition one can impose ona function in order to ensure that gradient descent converges at a linear rate toa global minimizer. This condition also implies that stationary points are globalminimizers, although it does not imply that there are unique minimizers as is thecase with a strongly convex function. Strongly convex functions satisfy (3.1), but itis possible for convex but not strongly convex, or even some non-convex functions,to satisfy this inequality. In this chapter, we consider the behaviour of functionssatisfying (3.1) under various transformations, including composition with smoothmaps and under the Moreau envelope. Lemma 3.4 proved in this chapter will beused in Chapter 4, while the other lemmas are more general, and all of the statedresults are novel.3.2 NotationThe notation introduced in this section will encompass this chapter as well as Chap-ter 4.55Chapter 3. The Moreau Envelope and the Polyak-Lojasiewicz InequalityWe use 〈xP y〉 to denote the Euclidean inner product between vectors x andy. up norms are denoted as ‖x‖p = (∑ni=1 |xi|p)1Rp for 1 ≤ p Q ∞ and ‖x‖∞ =maxi=1;:::;n |xi|. The u0 pseudonorm counts the number of nonzero elements of avector, i.e., ‖x‖0 = |{i : xi ̸= 0}|. Mixed up;q norms operate on matrices and satisfy‖m‖p;q =0@∑i0@∑j(|mi;j |q)1Rq1Ap1A1Rpwhen pP q ∈ [1P∞) and similarly for p or q =∞.The minimum value of a real-valued function f : Rn 7→ R is denoted either asminx f(x) or min f when convenient. The set of minimizers is denoted argmin f =argminx f(x) = {x : f(x) = min f}O When f is convex, argmin f is a convex set.The domain of an extended-real valued function f : Rn 7→ R ∪ {∞} is the setdomf = {x : f(x) Q ∞}. A proper convex function is one such that domf ̸= ∅.The convex hull of a set X is the set of all convex combinations of elements of X,convX := {∑pi=1 ixi : ∑pi=1 i = 1P i ≥ 0P xi ∈ X. The interior of a set X isdefined as int X := ∪Y⊂X;Y is openY.A mapping G : Rn 7→ Rm is Lipschitz-continuous with constant a when itsatisfies‖G(x)−G(y)‖2 ≤ a‖x− y‖2 ∀xP yOThe indicator function of a set X, denoted X(x) or (x|X), is defined asX(x) ={0 if x ∈ X∞ otherwiseThe epigraph of a function f(x), denoted epi f , is the set epi f = {(xP ) : f(x) ≤ }OFor a closed, convex set X ⊂ Rn, the distance function,yX(x) = minz‖z − x‖2s.t. z ∈ XPwhich measures the distance from the vector x to the set X, is a convex function.We have that yX(x) = 0 if and only if x ∈ X. For x ̸∈ X, there is always a uniqueclosest point in X to x that we denote as eX(x), which satisfieseX(x) = argminz‖z − x‖2s.t. z ∈ XOMoreover, the squared distance function 12y2X(x) is convex and differentiable with∇12y2X(x) = x− eX(x)O56Chapter 3. The Moreau Envelope and the Polyak-Lojasiewicz InequalityThe Moreau envelope of a convex function f(x) with parameter  is defined asbf (z) = minx12‖x− z‖22 + f(x)OWe let e f (z) = argminx 12‖x− z‖22 + f(x) be the corresponding proximal operator.Note that 12y2X(x) = b1C. Moreau envelopes are well studied in the literature, andwe recall some basic facts hereFacts 3.11. bf is convex and C1, whenever f is convex [Theorem 2.26, 216]2. ∇bf(x) = 1(x− e f (x))3. The gradient of bf is -Lipschitz continuous4. argminbf = argmin f and minbf = min f5. x ∈ argmin f if and only if e f x = x6. bf is strongly convex if and only if f is strongly convex [Lemma 2.19, 203]From [12.24, 25], we have thatbf (x) = f(ef x) +12‖x− e f x‖22 (3.2)A convex function h(x) is a gauge if it satisfies h(x) = (x|X) = inf{ : x ∈ X},where X is a convex set containing the origin, and so h(0) = 0. Examples of suchfunctions are norms, in particular atomic norms [67]‖x‖A := inf{ ≥ 0|x ∈ conv(A) (3.3)where A is a set of atoms that induce a low-complexity structure of interest. (3.3)induces a norm when A is a symmetric set, i.e., A = −A and 0 ∈ int (conv A). Forinstance, if A consists of n−length vectors with a single nonzero element, then ‖x‖Ais the u1 norm, the convex relaxation of the u0 norm.The corresponding polar gauge h◦(z) is defined ash◦(z) := inf{t S 0 : th(x) ≥ 〈xP z〉 ∀x}OWhen h(z) = ‖z‖p is a norm, the polar gauge is the correponding dual normh◦(z) = ‖z‖qPwhere 1p + 1q = 1.57Chapter 3. The Moreau Envelope and the Polyak-Lojasiewicz Inequality3.3 LemmasLemma 3.2: If f ∈ ea(PK) for a compact setK with argmin f ⊂ K, g : Rm 7→ Rnis a C1 mapping satisfyingminx∈Kmin(∇g(x)) =  S 0Pthen the composition k(x) = f(g(x)) satisfies the PL-inequality with constant 2on K. When g(x) = Vx − w is affine, for some matrix V and constant vector w, = min(V).Proof We have that12‖∇k(x)‖22 =12‖∇g(x)∗∇f(g(x))‖22≥ 22‖∇f(g(x))‖22≥ 2(f(g(x))−min f) = 2(k(x)−min f)whenever x ∈ K. Note that min f ≤ min k = minx f(g(x)) by definition, so−min f ≥ −min k and it follows that12‖∇k(x)‖22 ≥ 2(k(x)−min k)as desired.The PL-constant of f is important for the analysis of steepest descent methodsused to compute min f , as analyzed in [147]. The larger the , the faster the(worst-case) rate of convergence. Thus, it is interesting to study whether certaintransformations of f that preserve the set argmin f either improve or degrade thisconstant. For the Moreau-envelope, the following lemmas demonstrate that thereis an upper limit to the PL-constant of the Moreau-envelope of a function and theconvex functions that realize this upper limit are indicator functions. This will implythat the distance function 12y2X(x) has a PL-constant of 1 .Lemma 3.3 If bf ∈ ea(), then  ≤1.Proof We rewrite (3.2) asbf (x)−min f = f(e f x)−min f +2‖∇bf x‖22≥ 2‖∇bf x‖22since f(e f x)−min f ≥ 0. Since bf ∈ ea(), then2‖∇bf x‖22 ≤bf (x)−min f ≤12‖∇bf x‖2258Chapter 3. The Moreau Envelope and the Polyak-Lojasiewicz Inequalitywhich implies that  ≤ 1OLemma 3.4 Given the Moreau envelope of a closed, proper convex function f ,bf , bf ∈ ea( 1) if and only if f = X + y for some convex set X and someconstant y.Proof The “only if” direction follows by computing, supposing y = 0 withoutloss of generality,12‖∇bC‖22 =12‖−1(x− eX(x))‖22=1(bC )OSuppose that bf satisfies the PL-inequality with  = 1. Then for any x ∈ domf ,12‖∇bf‖22 ≥1(bf (x)−min f)OHere we note that minbf = min f .So by (3.2)12‖∇bf‖22 =122‖x− e f x‖22=1(bf (x)− f(e f x))And therefore1(bf (x)− f(e f x)) ≥1(bf (x)−min f)Rearranging the above inequality, it follows that f(e f x) ≤ min f , and thereforef(e f x) = min f , which implies that e f x ∈ argmin f .Since f(e f x) = min f , we must have that e f x = x, which is equivalent tox ∈ argmin f . Since x was arbitrary, it follows that domf = argmin f , and so f isconstant on its domain. As f is convex, this set, which we denote X, is convex, andit follows that f = X + y, where y = min f .Lemma 3.5 If f ∈ ea(), then its Moreau envelope satisfies bf ∈ ea( 1+).Proof Note that since the proximal point e f (x) = argminy12‖x− y‖22 + f(y), theoptimal conditions of this program imply that x−e f x = ∇f(e f x). Therefore wehave that12‖∇bf (x)‖22 =12‖−1(x− e f x)‖22=12‖∇f(e f x)‖22≥ (f(e f x)−min f)= (bf (x)−12‖x− e f x‖22 −min f)59Chapter 3. The Moreau Envelope and the Polyak-Lojasiewicz Inequalitywhere the second line is the PL-inequality for f and the last line follows from (3.2).Rearranging yields12‖∇bf (x)‖22 +2‖x− e f x‖22 = (1 + )12‖∇bf (x)‖22≥ (bf (x)−min f)and the theorem follows from noting that min f = minbf .In the goal to have as large of a  as possible, the previous lemma demonstratesthat the Moreau envelope can only decrease the PL constant of the original function,resulting in slower convergence of gradient descent in the worse-case scenario. Thereexist functions that realize this worst-case scenario, showing that the aforementionedbounds are tight.Example For the simple quadratic function f(x) = 2‖x‖22, it is easy to showthat f ∈ ea() and that this is the largest PL-constant this function can have. Thecorresponding Moreau envelope bf isbf (x) =121 + ‖x‖22which has a maximal PL-constant of 1 + , which is strictly less than 1, as perLemma 3.3.In this context, one can ask how much one can “improve” a function, imbuingf with the PL-inequality through its Moreau envelope, if f does not satisfy thePL-inequality originally. In general, this might be difficult to analyze for generalconvex functions. Our previous analysis deals with the case when f is an indicatorfunction, and so we consider specific computable examples below.The Huber function, (x), is the Moreau envelope of the absolute value function(x) = miny12(x− y)2 + |y|=8><>:12x2 if |x| ≤ |x| − 12 if |x| S O(3.4)The vector Huber norm, by abuse of notation is also written (x), is defined as(x) =∑ni=1 (xi), and satisfies (x) = miny12‖y− x‖22 + ‖y‖1. It is often used inregression contexts [114, 180, 179], to smooth out the singularity associated to theu1 norm yet retain the linear growth for large residuals, which is related to a smallersensitivity to large outlying values. The Huber norm is not strongly convex, since‖x‖1 is not strongly convex.60Chapter 3. The Moreau Envelope and the Polyak-Lojasiewicz InequalityLemma 3.6 For any 0 Q  ≤ 1, (x) ∈ ea(PX) whereX ={x : ‖x‖∞ ≤ 12+12}Proof We consider the one-dimensional Huber norm, as the multi-dimensional casefollows a similar argument. Note that′(x) =8<:1x if |x| ≤ sign(x) if x S where sign(x) ={xR|x| if x ̸= 00 if x = 0 .Fix any 0 Q  ≤ 1 . In the region |x| ≤ , it follows that12(′(x))2 =1(x)≥ (x)When  ≤ |x| ≤ 12 + 12, we have that12(′(x))2 =12Pand since |x| ≤ 12 + 12,(x) = (|x| − 12) ≤ 12OTo ensure that  ≤ 12 + 12, it must hold that  ≤ 1 . Note that we have  = 1 inthis case without contradicting Lemma 3.4 since the set of points on which the PL-inequality holds for (x) is not the entirety of dom(). Decreasing the parameter increases the region of validity of this bound. The interested reader can verifythat this is the largest set on which (3.1) holds for (x).61Chapter 4A level set, variable projectionapproach for convex compositeoptimization4.1 IntroductionAlthough elegant in their mathematical simplicity and computational efficiency, real-world noise often does not follow a Gaussian distribution. Instead, measurementsacquired from a physical experiment can be contaminated with large outliers. Con-sider the robust tensor completion problem, where our signal, which is a low-ranktensor, has been subsampled and a percentage of the remaining samples has beencorrupted by high amplitude noise. As mentioned previously in Section 1.2, themaximum likelihood estimator is the solution to the following optimization problemminX∈H‖A(X)− w‖1OHere H is our chosen class of low-rank tensors, A is the subsampling operator, . Weconsider the class of Hierarchical Tucker tensors, as developed in Chapter 2, whichallows us to write this program asminx∈M‖A(ϕ(x))− w‖1OThis problem instance is a particular case of the more general class of convex com-posite problems, an important class of non-convex optimization programs of theformminx∈Xh(x(x))O (4.1)62Chapter 4. A level set, variable projection approach for convex compositeoptimizationHere h : Rm 7→ [−∞P∞] is a (typically) non-smooth, proper function with closed,bounded level sets, x : Rn 7→ Rm is a C2 mapping, and X is a closed convex set,often Rn. Despite being non-smooth, h is often quite structured, typically convex,and often polyhedral. Here we assume that we can project on to the level sets of heasily, i.e., we can efficiently computeeh≤ (y) = argminx‖x− y‖2s.t. h(x) ≤ for a given pair (yP ). For theoretical reasons, we also assume that x is coercive,i.e., limx→∞ ‖x(x)‖ = ∞. This problem formulation encompasses many importantapplications such as hinge loss support vector machines [121, 267, 227] and nonlinearleast squares [107, 151, 88]. Problems of the formminxf(x) + g(x)where f is smooth and g is non-smooth but convex and “prox-friendly”, i.e.,minz12‖z−x‖22+g(z) is easily computable, are often referred to as convex-compositeproblems in the literature [43], but we use the nomenclature additive composite prob-lems here to distinguish this particular case. Problems of this form can be writtenin the convex-composite form (4.1) with x(x) = [ f(x);x ] and h(fP x) = f + g(x).See [167] and the references contained therein for more examples.Despite the ubiquity of this problem formulation, there have been relatively fewnumerical examples tackling problems in the form (4.1) directly. Many approacheshave been developed that sidestep the issue of the non-smoothness of h through theuse of the Alternative Direction of Multipliers Method (ADMM) [59, 229], whichdecouples the problem in to a series of simpler subproblems. The convergence ofADMM is not well understood for general nonlinear operators and one must solve alarge number of nonlinear least-squares problems as a result. From a practical pointof view, there are a number of auxiliary parameters that affect the convergence ofsuch methods and their dependence on these parameters is not well understood ingeneral, although some effort has been made to understand specific cases in [106].One could also smooth the function h with a Moreau envelope or similar mapping,but the optimal smoothness parameter can depend on constants that are unknownin general [91].One technique for solving (4.1) is the Gauss-Newton method [43, 44, 169, 265].At every iteration, the Gauss-Newton method linearizes the function x around thecurrent point xk and repeatedly chooses a step direction satisfyingyk ∈ argmin‖y‖≤∆h(x(xk) +∇x(xk)y)for some constant ∆ S 0. Practically speaking, solving these subproblems is ex-tremely challenging for large scale problems, owing to the norm constraint on y andthe non-smooth objective h. Another approach, first proposed in [167] and studied63Chapter 4. A level set, variable projection approach for convex compositeoptimizationfurther in [168, 139], is to use a proximal method to solve the linearized problem,i.e., by choosing a search direction yk satisfyingyk = argminyh(x(xk) +∇x(xk)y) + 2‖y‖22O (4.2)Although this subproblem is strongly convex in y, it is still nonsmooth and anexplicit representation for yk does not usually exist, unlike for proximal methodsfor additive problems. For instance, using the mirror descent method to solve (4.2)yields a convergence rate of d(1Ri ) after i iterations, which is far too slow for largeproblems [144].The aforementioned approaches tackle the original optimization problem di-rectly, which can be computationally challenging for non-smooth h. One alternativemethod to deal with such optimization programs is the SPGL1 algorithm introducedin [255]. The SPGL1 approach solves the Basis Pursuit problemminx‖x‖1s.t. Vx = wusing a level-set method. Instead of solving the BP problem directly, which has anon-smooth objective and difficult to satisfy constraint for general linear operatorsV, SPGL1 inverts the objective and constraints, yielding a series of related problemsv() = minx12‖Vx− w‖22s.t. ‖x‖1 ≤ (4.3)for a specifically chosen sequence of scalar parameters  . Here v() is referred to asthe value function of the original problem. In this case, the smallest  S 0 such thatv() = 0 yields a solution x that solves the original BP problem. The value functionis much easier to evaluate than the original problem, as one solves an optimizationproblem with a smooth objective and a simple to project on constraint. The authorsuse a Newton root-finding method to update  , exploiting the relationship betweenv′() and the dual of (4.3). This idea of “flipping” the optimization problem toconvert a non-smooth objective in to a simple constraint, along with much of thesame machinery as SPGL1, will allow us to solve the more general problem of (4.1).These ideas are further explored in [17] for a particular class of convex programs.4.2 TechniqueWe reformulate (4.1) by introducing the variable z = x(x) in order to decouple thenon-smooth function h from the smooth mapping x, which gives us the equivalent64Chapter 4. A level set, variable projection approach for convex compositeoptimizationproblemminx;zh(z)s.t. x(x)− z = 0O(4.4)Using the same SPGL1-type approach as previously, we flip the objective and theconstraints to consider the associated value function of this problem, which isϕ() = minx;z12‖x(x)− z‖22such that h(z) ≤ O(4.5)As previously, the smallest  that results in ϕ(∗) = 0, denoted ∗, is the optimalvalue of (4.4) and the resulting solution x of (4.5) solves the original problem as well.If our current  is such that ϕ() S 0, then we are super-optimal and infeasible forthe original problem, i.e., ‖x(x)− z‖2 S 0 and h(z) Q minx h(x(x)).To ease notation, we let a := {z : h(z) ≤ }.In order to evaluate (4.5), we could compute ϕ() by means of a projectedgradient method in the joint variables (xP z), but this may converge slowly. On theother hand, we can exploit the fact that the constraints act only on the z variableto speed up convergence with the variable projection (VP) approach [150, 107, 14].The VP method considers the program ϕ(), at a fixed parameter x, as a functionof z only. In this case, given our assumptions on h, minimizing the program withrespect to zeL (x) = argminzg(xP z) =12‖x(x)− z‖22such that z ∈ awhich is easily computable, by assumption.Plugging this expression in to g(xP z) yields the reduced objectiveg˜(x) = g(xP z(x)) =12‖x(x)− eL (x)‖22=12y2(x(x)P a )O(4.6)Since a is closed and convex by assumption,12y2(yP a ) is differentiable and there-fore so is g˜(x), with ∇g˜(x) = ∇x(x)i (x(x) − z(x)). This agrees with the variableprojection interpretation, as in [13]. Minimizing the reduced objective g˜(x), hasbeen numerically shown to converge much faster and towards a much higher qualitylocal minimum (in the non-convex case) compared to minimizing the joint objectiveg(xP z) [13]. Another interpretation for solving (4.5) is that we are looking for thepoint of minimal distance between the level set a and the image of the mappingx, i.e., x(Rn) = {x(x) : x(x) ∈ dom(h)}. The ∗ that solves the original problem isthe scalar such that y(x(Rn)P a ) = 0, i.e., we enlarge the level set a by increasing65Chapter 4. A level set, variable projection approach for convex compositeoptimizationthe  parameter until the point where x(Rn) ∩ a ̸= ∅, but x(Rn) ∩ a ′ = ∅ for all ′ Q  . A pictorial representation of this situation is shown in Figure 4.1.Level set interpretation17c(x)h(z)  ⌧v(⌧)c(x)h(z)  ⌧⇤v(⌧⇤) = 0increases*until⌧Figure 4.1 A cartoon depiction of the level set method for convex-composite opti-mizationWe can also handle convex constraints on the variable x, such as ‖x‖1 ≤  byadding them to the constraints in (4.6). The resulting variable projected problemis simply a nonlinear least squares problem with constraints.In order to update the parameter  in the most general case, we use the secantmethod for finding the root of ϕ() = 0, i.e.,k+1 = k − ϕ(k) k − k−1ϕ(k)− ϕ(k−1) O (4.7)The secant method is known to converge superlinearly for sufficiently smooth func-tions with simple roots [89].We can also employ Newton’s method, which converges quadratically, when wehave a simple expression for v′(), which we will see in the following section.If h is non-convex but has a simple projection, i.e., the u0 pseudonorm, thenwe simply use the secant method to update  . In the case of u0, since  is integer-valued, we perform rounding after each secant update. Algorithm (4.1) outlines thefull method, which we call the VELVET algorithm (conVex compositE LeVel sEtwiTh variable projection).66Chapter 4. A level set, variable projection approach for convex compositeoptimizationAlgorithm 4.1 The VELVET algorithm for solving problem (4.1)Input: Initial point x˜, number of allowed  updates iLet f() = minx 12y2h()≤ (x(x))q() = argminx12y2h()≤ (x(x))If h(z) is a gauge,Newton← true0 ← 0f1 ← f(0)x1 ← q(0)OtherwiseNewton← false0 ← 0, 1 ← xf0 ← f(0)x0 ← q(0)f1 ← f(1)x1 ← q(1)Endiffor k = 1P 2P O O O P i ,If Newtonk+1 = k − fkRf ′k (4.11)Otherwisek+1 = k − (fk − fk−1)R(k − k−1)Endiffk+1 = f(k+1)xk+1 = q(k+1)Return xi+14.3 TheoryWe consider the convergence estimates when evaluating the subproblemsminx12y2L (x(x))O (4.8)To ease the notation, let z(x) = eL (x) and Y(x) =12y2L (x), so the subproblemobjective can be written as p = Y ◦ x. We consider algorithms defined on the setX = {x : p(x) ≤ p(x0)} for some fixed point x0.Lemma 4.1 If g is a continuous, coercive mapping and the level set a isbounded, then X is compact.ProofSuppose X is not bounded. Then there is a sequence xn ∈ X with ‖xn‖2 → ∞.Then ‖x(xn)‖2 → ∞, since x is coercive. Then we have that there is a k suchthat ‖x(xk)‖2 S supk ‖z(xk)‖2 +√2p(x0), since a is bounded by assumption and67Chapter 4. A level set, variable projection approach for convex compositeoptimizationz(xk) ∈ a . This would imply that‖x(xk)− z(xk)‖2 ≥ |‖x(xk)‖ − ‖z(xk)‖|= ‖x(xk)‖ − ‖z(xk)‖S√2p(x0)for some k, contradicting the fact that xn ∈ X for all n. Therefore X is bounded.To see that X is closed, consider xn ∈ X with xn → x. Then x(xn)→ x(x), sinceg is continuous, and since x(xn) ∈ {z : Y(z) ≤ Y(x(x0)) and Y is continuous, andhence lower semi-continuous, this set is closed. Thereforex(x) ∈ {z : Y(z) ≤ Y(x(x0))}Pwhich is a restatement of x ∈ X.Recently, the authors in [147] proposed a unified convergence analysis frameworkfor minimizing smooth convex functions f(x) with a-Lipschitz continuous gradientthat satisfy the Polyak-Lojasiewicz (PL) inequality12‖∇f(x)‖22 ≥ (f(x)−min f) ∀xwhere  S 0. Strongly convex functions satisfy the PL-inequality with the sameconstant, but the reverse implication does not necessarily hold. In our case, thedistance function 12y2X(x) is not strongly convex, since12y2X = b1Cand X is notstrongly convex, neither is 12y2X(x) by [Lemma 2.19, 203]. We will still manage toobtain worst-case linear convergence for gradient descent due to the PL-inequality,as we shall see.The benefit of considering the PL-inequality is that the convergence proof forgradient descent becomes drastically simplified, which we reproduce here for thereader’s convenience.Proposition 4.2 [147] If f(x) is a C1 function with a−Lipschitz continuousgradient and f satisfies the PL-inequality with constant  and m∗ = argminx f(x) ̸=∅P then applying the gradient descent method with step size 1Ra,xk+1 = xk − 1a∇f(x)has a global linear convergence ratef(xk)− f∗ ≤(1− a)k(f(x0)− f∗)OProof An function f(x) with a−Lipschitz continuous gradient satisfiesf(y) ≤ f(x) +∇f(x)i (y − x) + a2‖y − x‖22 ∀yP xO68Chapter 4. A level set, variable projection approach for convex compositeoptimizationPlugging in y = xk+1 = xk − 1a∇f(xk), x = xk gives the estimatef(xk+1)− f(xk) ≤ ∇f(xk)i (xk+1 − xk) + a2‖xk+1 − xk‖22≤ − 12a‖∇f(xk)‖22and the PL-inequality yieldsf(xk+1)− f(xk) ≤ −a(f(xk)−min f)ORearranging and subtracting min f from both sides givesf(xk+1)−min f ≤(1− a)(f(xk)−min f)and applying this result recursively gives the linear convergence rate.Lemma 4.3 Suppose that f is a C1 function with a− Lipchitz continuous gradi-ent, g is a C1 −Lipschitz continuous mapping with −Lipschitz continuous gradientand let = maxx∈K‖∇f(g(x))‖2 = maxx∈K‖∇g(x)‖2for a compact set K. Note that P  Q∞, since K is compact and the mappings arecontinuous. Then k(x) = f(g(x)) has a +a Lipschitz gradient. If g(x) = Vx−wis affine, then k(x) = f(g(x)) is a‖V‖22-Lipschitz continuous.Proof‖∇(f ◦ g(x))−∇(f ◦ g(y))‖2 = ‖Yg(x)∗∇f(g(x))−Yg(y)∗∇f(g(y))‖2= ‖(Yg(x)−Yg(y))∗∇f(g(x))+Yg(y)∗(∇f(g(x))−∇f(g(y)))‖2≤ ( + a)‖x− y‖2The affine case follows a similar derivation.Proposition 4.4 Let x0 be the initial point which belongs to a neighbourhoodof argmin p which does not contain any local but non-global minimizers. If x is a C1−Lipschitz continuous mapping with −Lipschitz continuous gradient, we let := maxx∈X‖x(x)− eL (x(x))‖2 := maxx∈X‖∇x(x)‖2for X = {x : p(x) ≤ p(x0)}, which is compact. Suppose that := minx∈Xmin(∇x(x)) S 069Chapter 4. A level set, variable projection approach for convex compositeoptimizationthen the gradient descent iteration with constant step size (1R( + )) conver-gences linearly, i.e.,p(xk)−min p ≤(1− 2( + ))k(p(x0)−min p)If x(x) = Vx− w is affine, we can remove the restriction on the initial point x0 andthis convergence rate reduces top(xk)−min p ≤(1− 12(V))k(p(x0)−min p)OProof The squared distance function Y(x) satisfies the PL-inequality with  = 1by Lemma 3.4. Also note that Y(x) has a 1−Lipschitz gradient, since Y(x) = b1L .For the composition ∇Y(x(x)) = x(x) − eL (x(x)), we have that the gradient isbounded by ‖∇Y(x(x))‖2 ≤  Q∞, where the latter inequality arises from the factthat X is compact and x is continuous. Applying Lemma 4.3, the composite functionp(x) = Y(x(x)) is (+)-Lipchitz continuous and satisfies the PL-inequality withconstant 2, by Lemma 3.2, and the theorem follows.In order to compute v′(), we consider the case when h is a gauge, and so thevariational machinery of [15] can be applied in the following considerations. If h(z)is a gauge, we can upgrade the secant method to Newton’s method as follows. Firstnote that, for a gauge (z|X) and the coordinate projection operator . : (xP z) 7→ z,the composition Γ(xP z) = (z|X) ◦ .(xP z) is a gaugeΓ(xP z) = ((xP z)|Rn × X)with corresponding polar gaugeΓ(xP z)◦ = ((xP z)|Rn × X◦)where X◦ = {v : 〈vP x〉 ≤ 1 ∀x ∈ X} is the anti-polar of X [Theorem 14.5, 217]. Thiscan also be written as Γ(xP z)◦ = h◦(z).Theorem 4.5 Suppose that h is a gauge, x satsfies the hypotheses of Proposition4.4, then vlin() = v(), wherevlin() = minx12y2L (x(x∗) +∇x(x∗)(x− x∗)) (4.9)and x∗ ∈ argminx12y2L (x(x)). If also it holds thatvlin( +∆) +d(|∆ |2) = v( +∆) (4.10)70Chapter 4. A level set, variable projection approach for convex compositeoptimizationfor all ∆ sufficiently small, thenv′() = −h◦(z − x(x)) (4.11)where (xP z) are such that (xP z) = argmin 12‖x(x) − z‖22 + iL (z). Condition (4.10)always holds when x(x) = Vx− w is affine.Proof First note that if x∗ ∈ argminx12y2L (x(x)), then x∗ satisfies∇x(x∗)i (x(x∗)− eL (x(x∗))) = 0OComputing the gradient of the objective function in the linearized problem in (4.9)yields∇x(x∗)i (x(x∗) +∇x(x∗)(x− x∗)− eL (x(x∗) +∇x(x∗)(x− x∗)))and this gradient vanishes when x = x∗. Since the objective satisfies the PL-inequality, by [147] x∗ is a global minimizer of (4.9) and therefore vlin() = v().If also we have that vlin( + ∆) + d(|∆ |2) = v( + ∆), then the differencequotientv( +∆)− v()∆=vlin( +∆) +d(|∆ |2)− vlin()∆→ v′lin()as ∆ → 0.We can writef(xP zP ) =12[ ∇x(x∗) −I ] [ xz]− (x(x∗) +∇x(x∗)x∗)22+ ((xP zP )|epi Γ(xP z))where vlin() = minx;z f(xP zP ). It follows from [Theorem 6.2, 15] that if(xˆP zˆ) ∈ argminx;z12‖x(x∗) +∇x(x∗)(x− x∗)− z‖22 + iL (z)we must have thatv′lin() = h◦(zˆ − (x(x∗) +∇x(x∗)(xˆ− x∗)))= h◦(zˆ − x(x∗))where the second line follows from the fact thatx∗ ∈ argminx12y2L (x(x∗) +∇x(x∗)(x− x∗))ONote that we have zˆ = eL (x(x∗)) in this case, completing the proof.71Chapter 4. A level set, variable projection approach for convex compositeoptimization4.4 Numerical Examples4.4.1 CosparsityThe classical setup for compressed sensing aims to recover a signal x ∈ Rn from ameasurement vector w = Vx ∈ Rm×n, where m ≪ n. In order to ensure successfulrecovery, the signal x is assumed to be sparse with s unknown entries, although thelocation of these entries is unknown. The strongest known theoretical guaranteesfor recovery are for random matrices, and in particular when V has Gaussian i.i.d.entries, m = d(s log(nRs)) samples are sufficient to recover x with high probability[58, 53]. If x is s−sparse in some dictionary Y ∈ Rn×p, i.e., x = Y where  ∈ Rphas s− nonzeros, with V being our measurement operator, we can recover x bythrough solving the optimization programmin‖‖psuch that VY = w(4.12)where p = 0 is the non-convex u0 pseudnorm and p = 1 is its convex relaxation.This is the synthesis model of compressed sensing studied in [90, 178, 57]. Analternative signal model is the cosparsity model [186], which stipulates that Ωx issparse for some dictionary Ω ∈ Rp×n. A zero value in Ωx stipulates that the signalx is perpendicular to the corresponding row of Ω, and hence in totality x lies in acorresponding union of subspaces. These two models are equivalent only when thedictionary Ω is an orthogonal basis, but in general they produce different solutions.If Ωx is k−sparse, x is referred to as l = p − k cosparse, as we are interested inhaving this quantity be as large as possible.The associated optimization problem isminx‖Ωx‖psuch that ‖Vx− y‖q ≤ P(4.13)where p = 0 is the original nonconvex problem and p = 1 is its correspondingconvexification and q ∈ [1P∞] specifies a norm for the data misfit. The choice of pwill have a significant impact on our solution as we shall see.We recast this problem in our composite convex framework by introducing thevariables r = Vx− y, z = Ωx, which givesminx‖z‖psuch that ‖r‖q ≤ Vx− y − r = 0Ωx− z = 0O72Chapter 4. A level set, variable projection approach for convex compositeoptimizationThe value function is thereforev() = minx;z;r12‖Vx− y − r‖22 +12‖Ωx− z‖22such that ‖r‖q ≤ ‖z‖p ≤ OProjecting out (zP r) yieldsz(x) = e‖·‖t≤ (Ωx)r(x) = e‖·‖q≤(Vx− y)Pwhich gives the final problem that we solvev() = minx12‖Vx− y − r(x)‖22 +12‖Ωx− z(x)‖22O (4.14)A reference MATLAB implementation for evaluating the objective and gradientin (4.14) is shown in Listing (4.1) for the noiseless case where  = 0. Note thatto solve u0 minimization problem, one can merely replace the u1 projection with u0projection, i.e., hard thresholding.1 function [f,g] = cosparsity_obj(A,b,x,Omega,tau)2 r = A*x-b;3 z = Omega*x;4 y = NormL1_project(z,tau);5 z = z-y;6 f = 0.5*norm(r)^2 + 0.5*norm(z)^2;7 if nargout >=28 g = A'*r + Omega'*z;9 end10 endListing 4.1 MATLAB code for computing the VELVET objective and gradient forthe cosparsity subproblemSince the mapping x from (4.1) is linear in this case, by Theorem 4.5, we havethatv′() = −‖e‖·‖t≤ (Ωx)− Ωx‖p∗where ‖‖p∗ is the dual norm to ‖‖p.Chambolle-PockWe compare our method to the Chambolle-Pock method [64], which solves theproblemminxF (Kx) +G(x)73Chapter 4. A level set, variable projection approach for convex compositeoptimizationwhere F , G are two convex, typically non-smooth functions, via the following itera-tionsyn+1 ← e F ∗(yn + Kx¯n)xn+1 ← e!G(xn − !Ki yn+1)x¯n+1 ← xn+1 + (xn+1 − xn)ONote that the u1 analysis problem (4.13) fits in to this framework with F = ‖ · ‖1,K = Ω ∈ Rm×n, and G(x) = i‖V·−w‖2≤(x) . The parameters , ! should be suchthat P ! = 1R‖K‖2, which is equal to 1 for a tight frame operator. Note here thatF ∗(y) = i‖·‖∞≤1(y), so e F ∗(y) is merely projection on to [−1P 1]m. In the noise-freecase,  = 0, since V restricts the values of the vector at specified indices Γ, the e!Gterm becomese!G(y) ={wi if i ∈ Γyi otherwiseOSplit Bregman methodWe also compare our method to the split Bregman method of [48], which solvesthe u1 analysis problem via the following iterationsxk+1 = argminx2‖Vx− w+ xk‖22 +2‖Ωx− yk + fk‖22yk+1 = argminy‖y‖1 + 2‖y− Ωxk+1 − fk‖22fk+1 = fk + f (Ωxk+1 − yk+1)xk+1 = xk + x(Vxk+1 − w)In what follows, Ω is the curvelet operator [54], which is not an explicit matrix. Assuch, we use LSQR [198] to solve the least-squares problem for xk+1 with a relativetolerance of 10−3. We run this algorithm for 200 iterations and use an early stoppingcriteria to quit if the relative residual in the subproblem for xk+1 is approximatelyone. Curvelets form a tight frame, i.e., ΩiΩ = I, and LSQR terminates within arelatively small number of iterations.Seismic Data InterpolationWe test this approach on a realistic seismic data set by first randomly removing50% of the traces from a single shot with 256 receivers and 1024 time samples.We compare the synthesis formulation in (4.12), solved with SPGL1 with a Curveletsparsity dictionary, with the analysis formulation in (4.13) with both p = 0 and p = 1.We also use Chambolle-Pock method previously discussed, run for 2000 iterations,as well as the Split Bregman algorithm run for 200 iterations. For the p = 0 case, wealso consider the GAP method [186] that directly tackles the non-convex analysisproblem. The data is shown in Figure 4.2 and the results are show in Figure 4.3 andTable 4.1. As expected, the u0 methods perform much better than the u1 methods,although the corresponding theoretical guarantees are less developed in the former74Chapter 4. A level set, variable projection approach for convex compositeoptimizationcase compared to the latter. Our proposed VELVET algorithm outperforms theGAP method on this small example, both in terms of recovery time and quality,and is much simpler to implement.50 100 150 200 250Trace number0.20.40.60.81.0Time (s)True Data50 100 150 200 250Trace number0.20.40.60.81.0Time (s)Input DataFigure 4.2 True and subsampled signal, 50% receivers removed50 100 150 200 250Trace number0.20.40.60.81.0Time (s)Synthesis ℓ1 , 14.9 dB50 100 150 200 250Trace number0.20.40.60.81.0Time (s)Analysis ℓ1 (Ours), 15 dB50 100 150 200 250Trace number0.20.40.60.81.0Time (s)Analysis ℓ1 (SB), 14.3 dB50 100 150 200 250Trace number0.20.40.60.81.0Time (s)Gap ℓ0 , 23.4 dB50 100 150 200 250Trace number0.20.40.60.81.0Time (s)Analysis ℓ0, 23.7 dB50 100 150 200 250Trace number0.20.40.60.81.0Time (s)Analysis ℓ1, CP, 10.0 dBFigure 4.3 Recovery of a common source gather (fixed source coordinates). Dis-played values are SNR in dB.75Chapter 4. A level set, variable projection approach for convex compositeoptimizationMethod Test SNR (dB) Time (s)Synthesis u1 (SPGL1) 14.9 112Analysis u1 (Ours) 15.0 77.2Analysis u1 (CP) 10.0 460Analysis u1 (SB) 14.3 188Analysis u0 (Ours) 23.7 77.4Analysis u0 (GAP) 23.4 117Table 4.1 Cosparse recovery resultsRobust Total Variation DeblurringA classic inverse problem is deblurring where one seeks to estimate a sharp imagefrom a blurred, noisy version. We model this scenario asy = Hx+ nwhere H is the blurring operator, x is the true image, n is the additive noise, andy is the noisy blurred image. H is often modeled as a convolution with a Gaussian(or similar) kernel, which exhibits quickly decaying singular values. As such, simpleleast-squares inversion is numerically insufficient to estimate the true signal sincethe components of the noise in the subspace generated by the small singular valuesof H become greatly amplified in the reconstruction, as shown in Figure 4.5. Evenin the noiseless-case, the small singular values, which often correspond to the detailsof the image, are difficult for a Krylov method to invert, resulting in a low-fidelityreconstruction.One popular approach to regularize this problem is to use robust total-variationregularization [219], which corresponds to solvingminx‖x‖iksuch that ‖Hx− y‖1 ≤ (4.15)where  is an estimate of the noise level of n, ‖n‖1, and‖x‖ik =∑i∑j√(xi+1;j − xi;j)2 + (xi;j+1 − xi;j)2is the isotropic TV (pseudo) norm. We can equivalently write ‖x‖ik = ‖Yx‖1;2,where Y is a finite difference matrix stacking the horizontal and vertical derivativematrices and ‖m‖1;2 is the mixed u1;2 norm that computes the column norms of thematrix m. The TV norm promotes cartoon-like structures in images, since the u1component of the norm tends to set the gradient of the image at a large numberof points to zero. This problem is a straightforward instance of the the analysis76Chapter 4. A level set, variable projection approach for convex compositeoptimizationprogram in (4.13), where p = (1P 2) and q = 1, and so we employ the same code aspreviously.To validate our composite convex framework, we deblur the standard “Camera-man” image with the following parameters. The blurring matrix is a convolutionof a Gaussian kernel with width 9 and standard deviation 4 and we update the parameter at most 20 times, with the results displayed in Figure 4.4.True Image Blurred Image, SNR = 15.0dBTV Regularized Image,SNR = 30.0 dB, T = 88.2sFigure 4.4 TV deblurred image in the noise-free case.Compared to handling (4.13) directly, methods that solve the penalized formu-lation of the TV problem (4.15)minx‖Hx− y‖1 + ‖x‖ik (4.16)through the Alternating Direction of Multipliers (ADMM) method or other suchsplitting techniques [66, 65] must have an accurate estimate of the  parameter.There is little theoretical insight in to choosing this parameter apriori and the op-timal  parameter must be estimated on a per-image basis. When coupled withthe other parameters one must choose in order to ensure the ADMM method con-verges sufficiently quickly, the number of parameters that must be known becomesa high dimensional search space. Our method, on the other hand, does not havesuch extraneous parameters to estimate due to the inherent secant/Newton methodfor updating  .We compare our approach to noisy TV-based deblurring to those thatsolve (4.16), such as [242, 23, 66]. The results in the aforementioned articles allinvolve a degree of hand-tuning of the  parameter for each input image, which isuntenable when applying such techniques to a given image, in particular when thetrue image is unknown. We compare our method to the approach of [242], as thesoftware provided by the other references proved too fragile to make the necessarymodifications to construct a baseline comparison. Applying both our method andthe method of [242] to deblur and denoise the Cameraman image results in theimages in Figure 4.5. In this example, the same blurring kernel has been applied77Chapter 4. A level set, variable projection approach for convex compositeoptimizationwith the addition of high amplitude but sparse noise to 5% of the pixels in theimage.In order to fairly compare this approach to our own, we search over a range of 50 parameters ranging logarithmically in [10−2P 102]. Unfortunately the dependenceof the SNR on the  parameter the premature stopping criteria employed by thesealgorithms does not allow us to simply halt our increase of  when the SNR hasreached its true peak. As such, we stop increasing  at iteration i when hcg(xi) Q0O8∗mvxj=1;:::;ihcg(xj). All of the other parameters are set to their default values.Even despite this optimal  choice, the resulting image is very cartoonish and notparticularly sharp, which is consistent with the results in [242]. In reality, if we didnot have the reference image available, it would be exceedingly difficult to choose optimally in this fashion for these methods and we would have to rely on thead-hoc “eyeball norm”. Our method, on the other hand, does not require the choiceof this parameter and reaches a satisfactory solution of its own accord, albeit ina slightly longer computational time due to the need to solve our subproblemsrelatively accurately. The least-squares solution, as predicted, performs poorly asthe noise is amplified through the inversion of the subspace corresponding to thesmall singular values of W, although one could add damping or terminate the Krylovmethod earlier in order to mitigate some of these effects.Blurred noisy image, SNR =10.6 dBLeast-squares solution, SNR= -6.74 dB, T = 1.34sConvex composite solution,SNR = 28.1 dB, T = 134s[242] solution, SNR = 18.2dB, T = 55.4sFigure 4.5 TV deblurred image in the noisy case.78Chapter 4. A level set, variable projection approach for convex compositeoptimizationAudio DeclippingAnother promising application of cosparsity-based regularization is declipping. Clip-ping, or magnitude saturation, occurs when the magnitude of a signal is greater thanthe range of values the acquisition device is able to distinguish. For audio signalsin particular, the perception of clipping may range from hearing “clicks and pops”to hearing additive noise, whereby the discontinuities introduced by the clippingresult in a large number of apparent harmonics. We consider the true audio signalrepresented as a vector x∗ ∈ Rn, where n is the number of time samples, and itsclipped version y ∈ Rn is induced from the hard clipping operator b , y = b(x)withb(x∗)i ={x∗i for |x∗i | ≤ sign(x∗i ) otherwiseOHere  S 0 is the clipping level. This observation model is idealized yet neverthelessconvenient for distinguishing the clipped parts of the signal from the non-clippedsamples by means of the highest amplitude threshold present. We split the indexset [n] = {1P 2P O O O P n} in to a partition of three sets [n] = It ∪ I+ ∪ In. Here Itdenotes the unclipped portions of the signal, I+ are the indices of the signal clippedto the value + and similarly for I−. Taken together, these three indices induceconstraints on any candidate estimate x of the true signal via the inequalitiesxi = yi for i ∈ Itxi ≥ yi for i ∈ I+xi ≤ yi for i ∈ I−OWe denote the set of signals x ∈ Rn satisfying the above inequalities as h(y). Merelysatisfying these constraints is clearly not sufficient to reconstruct the original signal.As such, various regularization techniques have been previously proposed to declipor restore the original signal, including minimizing higher order derivative energy[120], sparsity [4, 233, 155], and cosparsity [154, 153]. The up-cosparse model, as inthe seismic case, attempts to solve the problemminx‖Ωx‖ps.t. x ∈ h(y)P(4.17)where p = 0 or p = 1.The analogous sparsity model isminx‖x‖ps.t. Yx ∈ h(y)OThe pointwise constraints are much more challenging to satisfy in this case, and sowe focus our efforts on the cosparsity case.79Chapter 4. A level set, variable projection approach for convex compositeoptimizationThe analysis dictionary we employ for natural audio signals is the Gabor trans-form (i.e., the short time Fourier transform) [209], composed of time-windowedcomplex exponentials with varying temporal and frequency shifts. Despite its re-dundancy, the Gabor transform Ω is a tight frame, and so ΩHΩ = I. Our implemen-tation is provided by [201]. We can solve problem (4.17) via the VELVET method,where the value function is written asv() = minx12‖Ωx− z(x)‖22s.t. x ∈ h(y)Pand z(x) = e‖·‖t≤ (Ωx). Here the constraints h(y) are pointwise, which leads tothe projectoreh(y)(yˆ)i =8><>:yi for i ∈ Itmax(yiP ) for i ∈ I+min(yiP−) for i ∈ I−OWe use the projected quasi-Newton method of [223] to evaluate v() and the secantmethod to update the  parameter.Our signals are normalized to have magnitude 1 and clipped to values rangingfrom 0O05 ≤  ≤ 0O9O To assess the quality of our recovery using VELVET, we usethe signal-to-distortion ratio [154]hYg(z) = −20 log10(‖x∗I+∪I− − zI+∪I−‖2‖x∗I+∪I−‖2)Pwhich measures the deviation from the true signal on I+ ∪ I−, the set of indiceswhere the signal has been clipped.The results from aggressively clipping the reference “Glockenspiel” signal to = 0O05 are shown in Figure 4.6, resulting in 74% of the signal values being clipped.Despite this agressive threshold, u0-based cosparsity interpolation is able to recovera fairly reasonable estimate of the true signal, whereas the u1 convex relaxationperforms very poorly, hardly improving the input SDR whatsoever. The theoreticalunderpinnings behind this large gap in performance between u0 and u1-based cospar-sity are an interesting open topic for future research. Intuitively, as our samplingoperator decreases the amplitudes of the signal to lie in a given range, the corre-sponding inner products with the Gabor frame elements is reduced in magnitude aswell. Therefore the u1 norm of the coefficients of the clipped signal is smaller thanthe original signal, which implies that our original signal is no longer a solution tothe optimization problem in this case. Compared to the performance of algorithmsthat attempt to solve (4.17) with p = 0, such as CoDec-HT [153] or A-SPADE [154],our method has a natural mechanism for increasing the sparsity level k. Increasingthe sparsity by a fixed amount, as in the aforementioned algorithms, cannot ter-minate until a number of iterations proportional to the true sparsity of the signal,which may be large. On the other hand, the secant method employed by our VEL-80Chapter 4. A level set, variable projection approach for convex compositeoptimizationVET algorithm makes much faster progress towards the true sparsity automatically.In these examples, for instance, we update the  parameter at most 10 times. Al-though u0 analysis performs significantly better than the corresponding u1 analysisreconstruction, the computational time for the former is significantly longer as increases, as shown in Table 4.2. Intuitively, this observation corroborates with thefact that u0-minimization is an NP-hard problem even in the simplest case [187].Method Number of parameter updates SDR (dB) Time (s)Analysis u1 8 2.36 33Analysis u1 10 2.36 45Analysis u0 8 12.7 180Analysis u0 10 14.6 377Table 4.2 Audio declipping produces much better results with p = 0 compared top = 1, but the computational times become daunting as the u0-norm increases.Time (s)0 0.5 1 1.5 2 2.5 3Amplitude-0.8-0.6-0.4-0.200.20.40.60.81True SignalTime (s)0 0.5 1 1.5 2 2.5 3Amplitude-0.8-0.6-0.4-0.200.20.40.60.81Clipped Signal, SDR = 2.04 dBTime (s)0 0.5 1 1.5 2 2.5 3Amplitude-0.8-0.6-0.4-0.200.20.40.60.81Analysis ℓ1, SDR = 2.36 dBTime (s)0 0.5 1 1.5 2 2.5 3Amplitude-0.8-0.6-0.4-0.200.20.40.60.81Analysis ℓ0, SDR = 14.6 dBFigure 4.6 Declipping the “Glockenspiel” audio file. The first three seconds areshown.81Chapter 4. A level set, variable projection approach for convex compositeoptimization4.4.2 Robust Tensor PCA / CompletionIn this section, we solve the robust tensor completion problem introduced in Sec-tion 4.1minx∈M‖A(ϕ(x))− w‖1O (4.18)The associated value function for (4.18) isϕ() = minx∈M12‖A(ϕ(x))− w− z(x)‖22Owithz(x) = argminz12‖A(ϕ(x))− w− z‖22such that ‖z‖1 ≤ OWe only have to ‘plug in’ the soft-thresholded residual z(x) when computing the fulldata residual and the rest of the codes use the standard HT optimization framework.Note that z0(x) = 0 for all x, so ϕ(0) corresponds to the standard HT tensorcompletion problem. Each subproblem associated to an evaluation of ϕ() is warm-started with the previous parameter estimate x and converges quickly, due to theGauss-Newton method used in evaluating ϕ. Although the mapping ϕ is nonlinear,we note that the u1 norm is a gauge we use the corresponding formulaϕ′() = −‖A(ϕ(x))− w− z(x)‖∞Pwhich appears to hold numerically in our tests. Note that the overall problemin (4.18) is non-convex, since the image of the mapping x 7→ ϕ(x) is the non-convexmanifold of Hierarchical Tucker tensors.To validate this approach, we interpolate a single frequency slice of data gen-erated from the BG Compass model with 68 x 68 sources and 201 x 201 receivers.We remove 75% of the receivers randomly and, to 5% of the remaining receivers,add high amplitude noise with energy equal to that of the remaining signal. Weperform 5 updates of  , i.e., (4.7), starting from  = 0 and use a relative functionstopping tolerance of 0O0005 when computing ϕ() and a maximum inner iterationcount of 20. We compare this approach to standard HT completion without the u1penalty, i.e., merely computing ϕ(0). The results are displayed in Figure 4.7. Wealso compare this result to the Huber loss function (3.4), i.e., we solveminxn∑i=1(A(ϕ(x))i − wi)for a variety of parameters . For the Huber loss, the best performing  is similarin terms of computational time and quality to the result u1, although the u1 stillperforms slightly better in terms of recovery quality. Knowledge of this best-case is not determined apriori, however, and departure from this optimal parameterquickly degrades the test SNR, as shown in Table 4.4. The one-norm minimization82Chapter 4. A level set, variable projection approach for convex compositeoptimizationin this case does not have any additional hyperparameters to estimate compared tothe Huber loss function yet still produces high quality results.20 40 60 80 100 120 140 160 180 200receiver x20406080100120140160180200receiver yTrue Data20 40 60 80 100 120 140 160 180 200receiver x20406080100120140160180200receiver yInput Data, SNR = 0 dB20 40 60 80 100 120 140 160 180 200receiver x20406080100120140160180200receiver yℓ1, SNR = 16.8 dB20 40 60 80 100 120 140 160 180 200receiver x20406080100120140160180200receiver yℓ2, SNR = 8.8 dB20 40 60 80 100 120 140 160 180 200receiver x20406080100120140160180200receiver yHuber - best , SNR = 16.7dB20 40 60 80 100 120 140 160 180 200receiver x20406080100120140160180200receiver yDifference ℓ120 40 60 80 100 120 140 160 180 200receiver x20406080100120140160180200receiver yDifference without ℓ120 40 60 80 100 120 140 160 180 200receiver x20406080100120140160180200receiver yDifference Huber - best Figure 4.7 Recovery of a common source gather.Test SNR (dB) Time (s)With u1 16.2 1072Without u1 7.68 632Huber, best 15.9 1003Table 4.3 Summary of recovery results83Chapter 4. A level set, variable projection approach for convex compositeoptimizationHuber  Test SNR (dB) Time (s)10−6 10.1 15785 · 10−6 13.4 165710−5 15.5 15945 · 10−5 15.9 100310−4 14.7 9265 · 10−4 8.32 92810−3 7.68 984.6Table 4.4 Huber recovery performance versus  parameter4.4.3 One-bit Compressed SensingThe standard model of compressed sensing described in Section 4.4.1 assumesinfinite-bit precision in the measurement vector. Practitioners often neglect thedifferences between the infinite precision model and its standard 64−bit floatingpoint numerical discretization. When the measurement device quantizes the mea-surements to a precision other than 64−bits, this precision discrepancy can introduceunintended errors in the recovered signal if not properly accounted for. The authorsin [39] consider the extreme case of quantization, where the magnitude informationof w is discarded and one only has access to the signs of the measurements, sign(w).Unlike in the classical compressed sensing regime, the one-bit quantization resultsin an inherent loss of information. Roughly speaking, if one stipulates that a pointlies on a number of hyperplanes as in standard CS, with a sufficient number of theseconstraints, there will be at most one point that satisfies them. On the other hand,if one stipulates that a point lies in a number of halfspaces associated with thesehyperplanes, the set of points that satisfy these constraints, irrespective of theirnumber, will have nonzero volume. As such, in the one-bit CS paradigm, there isinherent ambiguity introduced by the quantization process that one can merely askfor an approximation to the true x rather than recovering it exactly. More precisely,from [Lemma 1, 141], if x ∈ ∪Li=1hi is in a union of a subspaces hi, each of dimensionK, and we acquire b ≥ 2K 1-bit measurements in the vector w, then w contains atmost K log2(2bzRK)+ log2(a) information bits and consequently the error for anyreconstruction decoder satisfies ϵopt ≥ Ω(KRb). When K ≪ b , the error of anyone-bit decoder can perform no better than Ω(1Rb)OThe work of [202] proves that when V is i.i.d. Gaussian, we can recover anapproximation of x by solving the following optimization programminx‖x‖psuch that sign(Vx) = y‖Vx‖1 = m84Chapter 4. A level set, variable projection approach for convex compositeoptimizationwhen p = 1, although we can consider the case when p = 0 as well. As we are merelymeasuring the signs of Vx, we cannot distinguish between x and xx for any x S 0 andthus cannot determine the norm of x from the measurements. The normalization‖Vx‖1 = m precludes x from being zero, but m can be set to any other positiveconstant in this case.Rewriting sign(Vx) = y as Vx ⊙ y ≥ 0 was suggested in [202] as being anequivalent constraint, with the latter being easier to express as constraints for alinear program. It should be noted that this equivalence is not entirely correct, asthe condition Vx⊙ y S 0 results in the correct signs for the recovered signal. Thisresults in a constraint set which is open, however, and there is no guarantee that theresulting optimization program has a solution. By closing the constraint as Vx⊙y ≥0, we expand the constraint set to include points that may not exactly fit the data,but the resulting program has a solution and we can compute it in polynomial time.With this point in mind, we move to solve this program with the convex compositemethod. The constraint Vx⊙y ≥ 0 is equivalent to ‖max(−y⊙VxP 0)‖1 = 0, whichresults in the equivalent optimization problemminx;r‖x‖1such that ‖max(rP 0)‖1 = 0‖r‖1 = mr = −y ⊙VxThe resulting value function reads asv() = minx;r12‖ − y ⊙Vx− r‖22such that ‖max(rP 0)‖1 = 0‖r‖1 = m‖x‖1 ≤ and the variable projected r(x) isr(x) = argminr12‖ − y ⊙Vx− r‖22such that ‖max(rP 0)‖1 = 0‖r‖1 = mwhich is equivalent tor(x) = argminr12‖ − y ⊙Vx− r‖22such that r ≤ 0‖r‖1 = mO85Chapter 4. A level set, variable projection approach for convex compositeoptimizationWe solve this projection by the simple d(n log n) algorithm in [92] for positivelyconstrained u1 minimization with reversed signs.In what follows, we use three measures of quality when assessing the success orfailure of a recovered signal. The Hamming Distance [141] between two vectors inthe Boolean cube vP w ∈ {−1P 1}m is defined asyH(vP w) :=1mm∑i=1vi ⊕ wiwhere v ⊕ w is the XOR operation between vP w with v ⊕ w = 0 if v = w and 1otherwise, when vP w ∈ {−1P 1}. This distance satisfies yH ∈ [0P 1] and is used tocompare yH(sign(Vxˆ)P y) at an estimated solution xˆ. Also introduced in [141] is adistance measure in signal space, defined asyh(xP y) :=1.arccos〈xP y〉for vectors xP y with ‖x‖2 = ‖y‖2 = 1 and satisfies yh ∈ [0P 1]. We consider theinverse of this distance 1yh(xP y)as a score for a candidate solution (i.e., higher isbetter). We also use the traditional u2 distance to the true signal, ‖x − xtrue‖2as a quality metric merely for presentational purposes, as the u2 difference can bemapped to yh through the polarization identity.We compare this approach to the BIHT algorithm of [141], which involves aniterative gradient descent procedure followed by thresholding on to the top k com-ponents of the signal. Fixing n = 1000 and k = 100, we consider two instancesof signals, sparse signals which have k nonzero coefficients with c(0P 1) entries andcompressible signals that are k-sparse satisfying x(i) ∝ i−2. Here  indexes thenonzero coefficients of x. We consider the subsampling case with m = 500 as wellas the supersampling case with m = 2000 and we average each recovery experimentover 100 trials. The BIHT algorithm is run for at most 15000 iterations and isterminated early when the number of sign mismatches between the predicted andobserved data is zero.The BIHT algorithm significantly outperforms the composite-convex approachwhen the signal is purely sparse, as shown in Tables 4.5 and 4.7. This strong perfor-mance is very much dependent on the knowledge of the true sparsity k, however, andwhen this sparsity level is unknown or difficult to estimate, the performance of theestimator drops significantly, as when we use the parameter 4k instead of k for thisalgorithm. As our method does not assume knowledge of the underlying sparsity, itis unaffected by this parameter, although is significantly less reliable than when kis known. This strong dependence of the performance of BIHT on k also manifestsitself when considering the compressible signal case. Tables 4.6 and 4.8 show thedecreased performance of BIHT in this case compared to the improved performanceof our approach for compressible signals in both the oversampling and undersam-pling scenarios. In most instances, the u0 norm outperforms the corresponding u186Chapter 4. A level set, variable projection approach for convex compositeoptimizationrelaxation for the convex composite method, which is expected. The sparsity levelthat we employ here is rather high relative to the ambient dimension and is thusa challenging case from a compressed sensing point of view. The expected errorbehaves as [141]ϵ = d(√kmlog(mnk))which is uninformative in the subsampled case and around 0O7 in the oversampledcase, which is pessimistic given the experiments below.‖x− xtrue‖2 yH(sign(Vx)P y) yh(xP xtrue)−1CC - u0 2O34 · 10−1 2O15 · 10−2 13O53CC - u1 2O37 · 10−1 5O43 · 10−2 13O30BIHT - k 1O49 · 10−1 0 21O62BIHT - 4k 4O56 · 10−1 0 6O85Table 4.5 m = 2000P n = 1000P k = 100, sparse signal‖x− xtrue‖2 yH(sign(Vx)P y) yh(xP xtrue)−1CC - u0 4O04 · 10−2 5O91 · 10−3 88O34CC - u1 4O81 · 10−2 1O33 · 10−2 72O05BIHT - k 1O54 · 10−1 0 20O65BIHT - 4k 4O54 · 10−1 0 6O87Table 4.6 m = 2000P n = 1000P k = 100, compressible signal‖x− xtrue‖2 yH(sign(Vx)P y) yh(xP xtrue)−1CC - u0 7O68 · 10−1 5O5 · 10−2 4O01CC - u1 7O34 · 10−1 1O3 · 10−1 4O20BIHT - k 8O12 · 10−1 0 3O77BIHT - 4k 9O62 · 10−1 0 3O13Table 4.7 m = 500P n = 1000P k = 100, sparse signal87Chapter 4. A level set, variable projection approach for convex compositeoptimization‖x− xtrue‖2 yH(sign(Vx)P y) yh(xP xtrue)−1CC - u0 8O57 · 10−2 8O5 · 10−3 42O10CC - u1 9O78 · 10−2 1O55 · 10−2 35O85BIHT - k 6O92 · 10−1 0 4O46BIHT - 4k 9O44 · 10−1 0 3O20Table 4.8 m = 500P n = 1000P k = 100, compressible signal4.5 DiscussionIn our developments, we have shown that this level set . Although the experimentalresults are promising, we have not entirely eliminated the possible negative sideeffects from solving the general, non-convex problem. Indeed, although the outersecant method will always converge reasonably (assuming sufficient smoothness onthe value function), evaluating the value function itself may result in iterates beingtrapped in a local, but not global minimum, in particular if we warm-start the it-erations with the solution from the previous problem. We have not observed thispathological case in our experiments, but the observant reader may readily con-struct such pathological examples based on the geometric considerations shown inFigure 4.1. For problems where x is linear, this approach performs well in practice.Clearly there is a theoretical gap between the gradient descent method analyzed inthis chapter and the projected LBFGS method used to evaluate the value function.To the best of our knowledge, this is the first algorithm for convex composite min-imization that has been experimentally validated on a wide variety of medium tolarge-scale problem instances.4.6 ConclusionIn this chapter, we have proposed a new technique for solving composite-convexoptimization problems. This class of problems is quite general and encompasses anumber of important applications such as robust tensor principal component analy-sis and cosparsity-based compressed sensing. Instead of dealing with the non-smoothproblem in its original form, which would result in slow convergence for large-scaleproblems, the original problem induces a corresponding value function, which weaim to evaluate. The value function is a function of the level set parameter, whichaims to increase this scalar through a Newton or secant root finding scheme un-til the level set and the image of the inner, smooth mapping coincide. At thispoint, the original problem is solved. Evaluating the value function is much sim-pler than minimizing the original problem due to the smoothness of the objectivefunction. We have proved that a simple gradient descent scheme converges linearlywhen evaluating these subproblems using analysis related to the Polyak-Lojasiewicz88Chapter 4. A level set, variable projection approach for convex compositeoptimizationinequality, although other more practically appealing methods such as LBFGS areavailable. Applying the level-set approach to a variety of signal reconstruction andclassification problems has shown that this method is competitive for solving thisclass of problems. We are even able to handle non-convex problems without thetuning of additional hyperparameters. One limitation of this approach is that thesubproblems have to be solved relatively accurately in order for the secant methodto update the  parameter. The authors in [17] study the performance of secantupdates when inexact upper and lower bounds are available for v() from partiallysolving the subproblem and duality, respectively. It is in this line of reasoning thatwe may be able to develop further insights in to incorporating inexactness in to thismethod in future work.It is also important to note that the theoretical derivations for computing thevalue function derivatives when x is nonlinear are thus far incomplete. Although thestraightforward formula appears to be true empirically, it is difficult to prove thisfact for general nonlinear mappings. This is an important application for nonlinearmodels such as rank-k parametrizations of low rank matrices m = aai for a ∈Rn×k and smooth parametrizations for higher order tensors such as the HierarchicalTucker mapping. We note that, owing to the use of first-order methods to evaluatev(), these techniques are most appropriately applied to problems that are well-conditioned. The constants in the convergence bound in Proposition 4.4 degradesignificantly as the conditioning of the problem worsens. Potentially accelerated firstorder methods such as Nesterov’s method [190] could be used to ease the dependenceon the square of the condition number of the mapping x (in the linear case), but weleave this as a topic for future research.4.7 AcknowledgementsWe would like to thank Aleksandr Aravkin for his helpful feedback on an earlyversion of this work.89Chapter 5A Unified 2D/3D Large ScaleSoftware Environment forNonlinear Inverse Problems5.1 IntroductionSolving large scale inverse problems is a challenging endeavour for a number ofreasons, not least of which is the sheer volume of prerequisite knowledge required.Developing numerical methods for inverse problems involves the intersection of anumber of fields, in particular numerical linear algebra, nonlinear non-convex opti-mization, numerical partial differential equations, as well as the particular area ofphysics or biology the problem is modelled after, among others. As a result, manysoftware packages aim for a completely general approach, implementing a largenumber of these components in various sub-modules and interfaced in a hierarchicalway. There is often a danger with approaches that increase the cognitive load onthe user, forcing them to keep the conceptual understanding of many componentsof the software in their minds at once. This high cognitive load can result in pro-longing the initial setup time of a new researcher, delaying the time that they areactually productive while they attempt to comprehend how the code behaves. More-over, adhering to a software design model that does not make intuitive sense candisincentivize modifications and improvements to the codebase. In an ideal world, aresearcher with a general knowledge of the subject area should be able to sit in frontof a well-designed software package and easily associate the underlying mathematicswith the code they are presented. If a researcher is interested in prototyping highlevel algorithms, she is not necessarily interested in having to deal with the minutiaof compiling a large number of software packages, manually managing memory, orwriting low level code in C or Fortran in order to implement, for example, a simplestochastic optimization algorithm. Researchers are at their best when actually per-90Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemsforming research and software should be designed to facilitate that process as easilyas possible.Academic software environments for inverse problems are not necessarily gearedtowards high-performance, making use of explicit modeling matrices or direct solversfor 3D problems. Given the enormous computational demands of solving such prob-lems, industrial codes focus on the performance-critical aspect of the problem andare often written in a low-level language without focusing on proper design. Thesesoftware engineering decisions results in code that is hard to understand, maintain,and improve. Fortran veterans who have been immersed in the same software en-vironment for many years are perfectly happy to squeeze as much performance outof their code as possible, but cannot easily integrate higher-level algorithms in toan existing codebase. As a result of this disparity, the translation of higher-levelacademic research ideas to high-performance industrial codes can be lost, whichinhibits the uptake of new academic ideas in industry and vice-versa.One of the primary examples in this work is the seismic inverse problem andvariants thereof, which are notable in particular for their large computational re-quirements and industrial applications. Seismic inverse problems aim to reconstructan image of the subsurface of the earth from multi-experiment measurements con-ducted on the surface. An array of pressure guns inject a pressure differential into the water layer, which in turn generates a wave that travels to the ocean floor.These waves propagate in to the earth itself, reflect off of various discontinuities,before traveling back to the surface to be measured at an array of receivers. Ourgoal in this problem, as well as many other boundary-value problems, is to recon-struct the coefficients of the model (i.e., the wave equation in the time domain orthe Helmholtz equation in the frequency domain) that describes this physical systemsuch that the waves generated by our model agree with those in our measured data.The difficulty in solving industrial-scale inverse problems arises from the variousconstraints imposed by solving a real-world problem. Acquired data can be noisy,lack full coverage, and, in the seismic case, can miss low and high frequencies [246]as a result of equipment and environmental constraints. Particularly in the seismiccase, missing low frequencies results in a highly-oscillatory objective function withmultiple local minima, requiring a practitioner to estimate an accurate startingmodel, while missing high frequencies results in a loss of detail [261]. Realisticallysized problems involve the propagation of hundreds of wavelengths in geophysical[112] and earthquake settings [146], where wave phenomena require a minimumnumber of points per wavelength to model meaningfully [135]. These constraintscan lead to large models and the resulting system matrices become too large tostore explicitly, let alone invert with direct methods.Our goal in this work is to outline a software design approach to solving partialdifferential equation (PDE) constrained optimization problems that allows users tooperate with the high-level components of the problem such as objective functionevaluations, gradients, and Hessians, irrespective of the underlying PDE or dimen-sionality. With this approach, a practitioner can design and prototype inversion91Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemsalgorithms on a complex 2D problem and, with minimal code changes, apply thesesame algorithms to a large scale 3D problem. The key approach in this instance isto structure the code in a hierarchical and modular fashion, whereby each moduleis responsible for its own tasks and the entire system structures the dependenciesbetween modules in a tiered fashion. In this way, the entire codebase becomes mucheasier to test, optimize, and understand. Moreover, a researcher who is primarilyconcerned with the high level ‘building blocks’ of an inversion framework can simplywork with these units in a standalone fashion and rely on default configurations forthe lower level components. By using a proper amount of information hiding throughabstraction, users of this code can delve as deeply in to the code architecture as theyare interested in. We also aim to make our ‘code look like the math’ as much aspossible, which will help our own development as well as that of future researchers,and reduce the cognitive load required for a researcher to start performing research.There are a number of existing software frameworks for solving inverse prob-lems with varying goals in mind. The work of [241] provides a C++ frameworkbuilt upon the abstract Rice Vector Library [197] for time domain modeling andinversion, which respects the underlying Hilbert spaces where each vector lives byautomatically keeping track of units and grid spacings, among other things. Thelow-level nature of the language it is built in exposes too many low level constructsat various levels of its hierarchy, making integrating changes in to the frameworkcumbersome. The Seiscope toolbox [182] implements high level optimization algo-rithms in the low-level Fortran language, with the intent to interface in to existingmodeling and derivative codes using reverse communication. As we highlight below,this is not necessarily a beneficial strategy and merely obfuscates the codebase, aslow level languages should be the domain of computationally-intensive code ratherthan high-level algorithms. The Trilinos project [124] is a large collection of packageswritten in C++ by a number of domain-specific experts, but requires an involvedinstallation procedure, has no straightforward entrance point for PDE-constrainedoptimization, and is not suitable for easy prototyping. Implementing a modellingframework in PETSc [20], such as in [156], let alone an inversion framework, exposestoo many of the unnecessary details at each level of the hierarchy given that PETScis written in C. The Devito framework [166] offers a high-level symbolic Pythoninterface to generate highly optimized stencil-based C- code for time domain mod-elling problems, with extensions to inversion. This is promising work that effectivelydelineates the high-level mathematics from the low-level computations. We follow asimilar philosophy in this work. This work builds upon ideas in [256], which was afirst attempt to implement a high-level inversion framework in Matlab.Software frameworks in the finite-element regime have been successfully appliedto optimal control and other PDE-constrained optimization problems. The Dolfinframework [98] employs a high-level description of the PDE-constrained problemwritten in the UFL language for specifying finite elements, which is subsequentlycompiled in to lower level finite element codes with the relevant adjoint equationsderived and solved automatically for the objective and gradient. For the geophys-ical examples, the finite element method does not easily lend itself to applying92Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemsa perfectly-matched layer to the problem compared to finite differences, althoughsome progress has been made in this front, i.e., see [73]. In general, finite differencemethods are significantly easier to implement, especially in a matrix-free manner,than finite element methods, although the latter have a much stronger convergencetheory. Moreover, for systems with oscillatory solutions such as the Helmholtz equa-tion, applying the standard 7-point stencil to the problem is inadvisable due to thelarge amount of numerical dispersion introduced, resulting in a system matrix witha large number of unknowns. This fact, along with the indefiniteness of the underly-ing matrix, makes it very challenging to solve with standard Krylov methods. Moreinvolved approaches are needed to adequately discretize such equations, see, e.g.,[251, 191, 69]. The SIMPEG package [76] is designed in a similar spirit to the con-siderations in this work, but does not fully abstract away unnecessary componentsfrom the user and is not designed with large-scale computations in mind as it lacksinherent parallelism. The Jinv package [220] is written in Julia in a similar spiritto this work with an emphasis on finite element discretizations using the parallelMUMPS solver [8] for computing the fields and is parallelized over the number ofsource experiments.When considering the performance-understandability spectrum for designing in-verse problem software, it is useful to consider Amdahl’s law [199]. Roughly speak-ing, Amdahl’s law states that in speeding up a region of code, through parallelizationor other optimizations, the speedup of the overall program will always be limitedby the remainder of the program that does not benefit from the speedup. For in-stance, speeding up a region where the program spends 50% of its time by a factorof 10 will only speed up the overall program by a maximum factor of 1.8. For anyPDE-based inverse problem, the majority of the computational time is spent solv-ing the PDEs themselves. Given Amdahl’s law and a limited budget of researchertime, this would imply that there is virtually no performance benefit in writing boththe ‘heavy lifting’ portions of the code as well as the auxiliary operations in a lowlevel language, which can impair readability and obscure the role of the individualcomponent operations in the larger framework. Rather, one should aim to use thestrengths of a high level language to express mathematical ideas cleanly in code andexploit the efficiency of a low level language, at the proper instance, to speed upprimitive operations such as a multi-threaded matrix-vector product. These consid-erations are also necessary to manage the complexity of the code and ensure thatit functions properly. Researcher time, along with computational time, is valuableand we should aim to preserve productivity by designing these systems with thesegoals in mind.It is for this reason that we choose to use Matlab to implement our parallelinversion framework as it offers the best balance between access to performance-critical languages such as C and Fortran, while allowing for sufficient abstractionsto keep our code concise and loyal to the underlying mathematics. A pure Fortranimplementation, for instance, would be significantly more difficult to develop andunderstand from an outsider’s perspective and would not offer enough flexibility forour purposes. Python would also potentially be an option for implementing this93Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemsframework. At the time of the inception of this work, we found that the relativelynew scientific computing language Julia [29] was in too undeveloped of a state tofacilitate all of the abstractions we needed; this may no longer be the case as of thiswriting.5.1.1 Our ContributionsUsing a hierarchical approach to our software design, we implement a frameworkfor solving inverse problems that is flexible, comprehensible, efficient, scalable, andconsistent. The flexibility arises from our design decisions, which allow a researcherto swap components (parallelization schemes, linear solvers, preconditioners, dis-cretization schemes, etc.) in and out to suit her needs and the needs of her localcomputational environment. Our design balances efficiency and understandabilitythrough the use of object oriented programming, abstracting away the low-levelmechanisms of the computationally intensive components through the use of theSPOT framework [93]. The SPOT methodology allows us to abstract away functioncalls as matrix-vector multiplications in Matlab, the so-called matrix-free approach.By abstracting away the lower level details, our code is clean and resembles theunderlying mathematics. This abstraction also allows us to swap between usingexplicit, sparse matrix algebra for 2D problems and efficient, multi-threaded matrix-vector multiplications for 3D problems. The overall codebase is then agnostic to thedimensionality of m, which encourages code reuse when applying new algorithmsto large scale problems. Our hierarchical design also decouples parallel data dis-tribution from computation, allowing us to run the same algorithm as easily on asmall 2D problem using a laptop as on a large 3D problem using a cluster. Wealso include unit tests that demonstrate that our code accurately reflects the un-derlying mathematics in Section (5.5.1). We call this package WAVEFORM (soft-WAre enVironmEnt For nOnlinear inveRse probleMs), which can be obtained athttps://github.com/slimgroup/WAVEFORM.In this work, we also propose a new multigrid-based preconditioner for the 3DHelmholtz equation that only requires matrix-vector products with the system ma-trix at various levels of discretization, i.e., is matrix-free at each multigrid level, andemploys standard Krylov solvers as smoothers. This preconditioner allows us tooperate on realistically sized 3D seismic problems without exorbitant memory costs.Our numerical experiments demonstrate the ease of which we can apply high levelalgorithms to solving the PDE-based parameter estimation problem and its variantswhile still having full flexibility to swap modeling code components in and out aswe choose.5.2 Preamble: Theory and NotationTo ensure that this chapter is sufficiently self-contained, we outline the basic struc-ture and derivations of our inverse problem of interest. Lowercase, letters such as94Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse ProblemsxP yP z denote vectors and uppercase letters such as VPWPX denote matrices or linearoperators of appropriate size. To distinguish between continuous objects and theirdiscretized counterparts, with a slight abuse of notation, we will make the spatialcoordinates explicit for the continuous objects, i.e., sampling u(xP yP z) on a uniformspatial grid results in the vector u. Vectors u can depend on parameter vectors m,which we indicate with u(m). The adjoint operator of a linear mapping x 7→ Vxis denoted as V∗ and the conjugate Hermitian transpose of a complex matrix W isdenoted WH . If W is a real-valued matrix, this is the standard matrix transpose.Our model inverse problem is the multi-source parameter estimation problem.Given our data yi;j depending on the ith source and jth frequency, our measurementoperator er, and the linear partial differential equation H(m)u(m) = q dependingon the model parameter m, find the model m that minimizes the misfit between thepredicted and observed data, i.e.,minm;ui;ncs∑icj∑jϕ(erui;j P yi;j)subject to Hj(m)ui;j = qi;j OHere ϕ(sP t) is a smooth misfit function between the inputs s and t, often the least-squares objective ϕ(sP t) = 12‖s − t‖22, although other more robust penalties arepossible, see e.g., [11, 10]. The indices i and j vary over the number of sourcescs and number of frequencies cf , respectively. For the purposes of notationalsimplicity, we will drop this dependence when the context permits. Note that ourdesign is specific to boundary-value problems rather than time-domain problems,which have different computational and storage challenges.A well known instance of this setup is the full waveform inversion problem inexploration seismology, which involves discretizing the constant-density Helmholtzequation(∇2 +m(xP yP z))u(xP yP z) = h(!)(x− xs)(y − ys)(z − zs)limr→∞ r(UUr− i√m)u(xP yP z) = 0(5.1)where ∇2 = U2x + U2y + U2z is the Laplacian, m(xP yP z) = !2v2(x;y;z)is the wavenum-ber, ! is the angular frequency and v(xP yP z) is the gridded velocity, h(!) isthe per-frequency source weight, (xsP ysP zs) are the spatial coordinates of thesource, and the second line denotes the Sommerfeld radiation condition [235] withr =√x2 + y2 + z2. In this case, H(m) is any finite difference, finite element, orfinite volume discretization of (5.1). Other examples of such problems include elec-trical impedance tomography using a simplified form of Maxwell’s equations, whichcan be reduced to Poisson’s equation [71, 34, 5], X-ray tomography, which can bethought of as the inverse problem corresponding to a transport equation, and syn-thetic aperture radar, using the wave equation [188]. For realistically sized industrial95Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemsproblems of this nature, in particular for the seismic case, the size of the model vectorm is often O(109) and there can be O(105) sources, which prevents the full storageof the fields. Full-space methods, which use a Newton iteration on the associatedLagrangian system, such as in [30, 208], are infeasible as a large number of fieldshave to be stored and updated in memory. As a result, these large scale inverseproblems are typically solved by eliminating the constraint and reformulating theproblem in an unconstrained or reduced form asminmf(m) :=cs∑icj∑jϕ(erHj(m)−1qi;j P yi;j)O (5.2)We assume that we our continuous PDEs are posed on a rectangular domain Ωwith zero Dirichlet boundary conditions. For PDE problems that require spongeor perfectly matched layers, we extend Ω to Ω′ ⊃ Ω and vectors defined on Ω areextended to Ω′ by extension in the normal direction. In this extended domain forthe acoustic Helmholtz equation, for instance, we solve(U2x + U2y + U2z + !2m(xP yP z))u(xP yP z) = (x− sx)(y − sx)(z − sz) (xP yP z) ∈ Ω(U˜x2+ U˜y2+ U˜z2+ !2m(xP yP z)) = 0 (xP yP z) ∈ Ω′ \ ΩPwhere U˜x =1(x)Ux, for appropriately chosen univariate PML damping functions(x), and similarly for yP z. This results in solutions u that decay exponentially forx ∈ Ω′ \ Ω. We refer the reader to [28, 72, 122] for more details.We assume that the source functions qi;j(x) are localized in space around thepoints {xsi}nsi=1P which make up our source grid. The measurement or samplingoperator er samples function values defined on Ω′ at the set of receiver locations{xrk}nrk=1. In the most general case, the points xrk can vary per-source (i.e., as thelocation of the measurement device is dependent on the source device), but we willnot consider this case here. In either case, the source grid can be independent fromthe receiver grid.While the PDE itself is linear, the mapping that predicts data F (m) := m 7→erH(m)−1q, the so-called the forward modeling operator, is known to be highlynonlinear and oscillatory in the case of the high frequency Helmholtz equation [240],which corresponds to propagating anywhere between 50-1000 wavelengths for real-istic models of interest. Without loss of generality, we will consider the Helmholtzequation as our prototypical model in the sequel. The level of formalism, how-ever, will ultimately be related to parameter estimation problems that make use ofreal-world data, which is inherently band-limited. We will therefore not be overlyconcerned with convergence as the mesh- or element-size tends to zero, as the in-formative capability of our acquired data is only valid up until a certain resolutiondictated by the resolution of our measurement device. We will focus solely on thediscretized formulation of the problem from hereon out.96Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse ProblemsWe can compute relevant derivatives of the objective function with straightfor-ward, ableit cumbersome, matrix calculus. Consider the state equation u(m) =H(m)−1q. Using the chain rule, we can derive straightforward expressions for thedirectional derivative Yu(m)[m] asYu(m)[m] = −H(m)−1YH(m)[m]u(m)O (5.3)Here YH(m)[m] is the directional derivative of the mapping m 7→ H(m)P whichis assumed to be smooth. To emphasize the dependence on the linear argument,we let i denote the linear operator defined by im = YH(m)[m]u(m), whichoutputs a vector in model space. Note that i = i (mPu(m)), but we drop thisdependence for notational simplicity. We let i ∗ denote the adjoint of the linearmapping m 7→ imO The associated adjoint mapping of (5.3) is thereforeYu(m)[·]∗y = −i ∗H(m)−HyOThe (forward) action of the Jacobian J of the forward modelling operator F (m) =eru(m) is therefore given by Jm = erYu(m)[m].We also derive explicit expressions for the Jacobian adjoint, Gauss-Newton Hes-sian, and full Hessian matrix-vector products, as outlined in Table 5.1, althoughwe leave the details for Appendix (A). For even medium sized 2D problems, it iscomputationally infeasible to store these matrices explicitly and therefore we onlyhave access to matrix-vector products. The number of matrix-vector products foreach quantity are outlined in Table 5.2 and are per-source and per-frequency. Byadhering to the principle of ‘the code should reflect the math’, once we have therelevant formula from Table 5.1, the resulting implementation will be as simple ascopying and pasting these formula in to our code in a straightforward fashion, whichresults in little to no computational overhead, as we shall see in the next section.u(m) H(m)−1qYu(m)[m] −H(m)−1imYu(m)[·]∗y −i ∗H(m)−HyF (m) eu(m)Jm := YF (m)[m] eYu(m)[m]im YH(m)[m]u(m), e.g., (A.2)i ∗z PDE-dependent, e.g., (A.3)Yi ∗[mP u]z PDE-dependent, e.g., (A.4)k (m) −H(m)−He i∇ϕYk (m)[m] H(m)−H(−YH(m)[m]k (m)− e i∇2ϕ(eu)[eYu(m)[m]])HGcm := JHJm i ∗H(m)−He ieH(m)−1im∇f(m) i ∗k (m)∇2f(m)[m] Yi ∗[mPYu(m)[m]]k (m) + i ∗Yk (m)[m]Table 5.1 Quantities of interest for PDE-constrained optimization.97Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse ProblemsQuantity # PDEsf(m) 1f(m)P∇f(m) 2HGcm 3∇2f(m)[m] 4Table 5.2 Number of PDEs (per frequency/source) for each optimization quantityof interest5.3 From Inverse Problems to Software DesignGiven the large number of summands in (5.2) and the high cost of solving each PDEui;j = Hj(m)−1qi;j , there are a number of techniques one can employ to reduce upper-iteration costs and to increase convergence speed to a local minimum, given afixed computational budget. Stochastic gradient techniques [37, 36] aim to reducethe per iteration costs of optimization algorithms by treating the sum as an expec-tation and approximate the average by a random subset. The batching strategyaims to use small subsets of data in earlier iterations, making significant progress to-wards the solution for lower cost. Only in later iterations does the algorithm requirea larger number of per-iteration sources to ensure convergence [103]. In a distributedparallel environment, the code should also employ batch sizes that are commensu-rate with the available parallel resources. One might also seek to solve the PDEsto a lower tolerance at the initial stages, as in [257], and increasing the tolerance asiterations progress, an analogous notion to batching. By exploiting curvature infor-mation by minimizing a quadratic model of f(m) using the Gauss-Newton method,one can further enhance convergence speed of the outer algorithm.In order to employ these high-level algorithmic techniques and facilitate codereuse, our optimization method should be a black-box, in the sense that it is com-pletely oblivious to the underlying structure of the inverse problem, calling a user-defined function that returns an objective value, gradient, and Gauss-Newton Hes-sian operator. Our framework should be flexible enough so that, for 2D problems, wecan afford to store the sparse matrix H(m) and utilize the resulting efficient sparselinear algebra tools for inverting this matrix, while for large-scale 3D problems, wecan only compute matrix-vector products with H(m) with coefficients constructedon-the-fly. Likewise, inverting H(m) should only employ Krylov methods that usethese primitives, such as FGMRES [221]. These restrictions are very realistic forthe seismic inverse problem case, given the large number of model and data pointsinvolved as well as the limited available memory for each node in a distributed com-putational environment. In the event that a new robust preconditioner developedfor the PDE system matrix, we should be able to easily swap out one algorithm foranother, without touching the outer optimization method. Likewise, if researchersdevelop a much more efficient stencil for discretizing the PDE, develop a new misfitobjective [11], or add model-side constraints [200], we would like to easily integrate98Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemssuch changes in to the framework with minimal code changes. Our code shouldalso expose an interface to allow a user or algorithm to perform source/frequencysubsampling from arbitrarily chosen indices.System	  matrix	  :multiplication/divisionAbstract	  MatrixPDEfunc PDE-­‐related	  quantitiesSerial	  versionDistributed	  Parallel	  Computation PDE-­‐related	  quantitiesParallel	  versionUser	  Facing	  Functions Objective	  function	  constructionJacobian,	  Hessian	  operatorsC-­‐based	  MVP Multithreaded	  Mat-­‐vec	  multiplyLinearsolveAbstract	  linearsolverConstruct	  System	  MatrixH*qH\q3DSparse	  MatrixEfficient	  sparse	  matrix	  algebra2DFigure 5.1 Software Hierarchy.We decouple the various components of the inverse problem context by usingan appropriate software hierarchy, which manages complexity level-by-level and willallow us to test components individually to ensure their correctness and efficiency,as shown in Figure 5.1. Each level of the hierarchy is responsible for a specific set ofprocedural requirements and defers the details of lower level computations to lowerlevels in the hierarchy.At the topmost level, our user-facing functions are responsible for constructinga misfit function suitable for black-box optimization routines such as LBFGS orNewton-type methods [171, 272, 113]. This procedure consists of• handling subsampling of sources/frequencies for the distributed data volume• coarsening the model, if required (for low frequency 3D problems)• constructing the function interface that returns the objective, gradient, andrequested Hessian at the current point.For the Helmholtz case in particular, coarsening the model allows us to keepthe number of degrees of freedom to a minimum, in line with the requirements ofthe discretization. In order to accommodate stochastic optimization algorithsm, weprovide an option for a batch mode interface, which allows the objective functionto take in as input both a model vector and a set of source-frequency indices. Thisoption allows either the user or the outer stochastic algorithm to dynamically specify99Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemswhich sources and frequencies should be computed at a given time. This auxiliaryinformation is passed along to lower layers of the hierarchy. Additionally, we pro-vide methods to construct the Jacobian, Gauss-Newton, and Full Hessian as SPOToperators. For instance, when writing Hess∗v as in a linear solver, Matlab implicitlycalls the lower level functions that actually solve the PDEs to compute this product,all the while never forming the matrix explicitly. Information hiding in this fashionallows us to use existing linear solver codes written in Matlab (Conjugate Gradient,MINRES, etc.) to solve the resulting systems. For the parameter inversion prob-lem, the user can choose to have either the Gauss-Newton or full Hessian operatorsreturned at the current point.Lower down the hierarchy, we have our PDEfunc layer. This function is respon-sible for computing the quantities of interest, i.e., the objective value, gradient,Hessian or Gauss-Newton Hessian matrix-vector product, forward modelling opera-tor, or linearized forward modelling operator and its adjoint. At this stage in thehierarchy, we are concerned with ‘assembling’ the relevant quantities based on PDEsolutions in to their proper configurations, rather than how exactly to obtain suchsolutions, i.e., we implement the formulas in Table 5.1. Here the PDE system matrixis a SPOT operator that has methods for performing matrix-vector products andmatrix-vector divisions, which contains information about the particular stencil touse as well as which linear solvers and preconditioners to call. When dealing with 2Dproblems, this SPOT operator is merely a shallow wrapper around a sparse matrixobject and contains its sparse factorization, which helps speed up solving PDEs withmultiple right hand sides. In order to discretize the delta source function, we useKaiser-windowed sinc interpolation [133]. At this level in the hierarchy, our codeis not sensitive to the discretization or even the particular PDE we are solving, aswe demonstrate in Section (5.5.4) with a finite volume discretization of the Poissonequation. To illustrate how closely our code resembles the underlying mathematics,we include a short snippet of code from our PDEfunc below in Figure 5.1.In our parallel distribution layer, we compute the result of PDEfunc in an em-barrassingly parallel manner by distributing over sources and frequencies and sum-ming the results computed across various Matlab workers. This distribution schemeuses Matlab’s Parallel Toolbox, which is capable of Single Program Multiple Data(SPMD) computations that allow us to call PDEfunc identically on different subsetsof source/frequency indices. The data is distributed as in Figure 5.2. The results ofthe local worker computations are then summed together (for the objective, gradient,adjoint-forward modelling, Hessian-vector products) or assembled in to a distributedvector (forward modelling, linearized forward modelling). Despite the ease of usein performing these parallel computations, the parallelism of Matlab is not fault-tolerant, in that if a single worker or process crashes at any point in computing alocal value with PDEfunc, the parent calling function aborts as well. In a large-scalecomputing environment, this sort of behaviour is unreliable and therefore we rec-ommend swapping out this ‘always-on’ approach with a more resilient model suchas a map reduce paradigm. One possible implementation workaround is to measurethe elapsed time of each worker, assuming that the work is distributed evenly. In100Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problems1 % Set up interpolation operators2 % Source grid -> Computational Grid3 Ps = opInterp('sinc',model.xsrc,xt,model.ysrc,yt,model.zsrc,zt);4 % Computational grid -> Receiver grid5 Pr = opInterp('sinc',model.xrec,xt,model.yrec,yt,model.zrec,zt)←B';6 % Sum along sources dimension7 sum_srcs = @(x) to_phys*sum(real(x),2);8 % Get Helmholtz operator , computational grid struct, its ←Bderivative9 [Hk,comp_grid ,T,DT_adj] = discrete_pde_system(v,model,freq(k),←Bparams);10 U = H \ Q;11 switch func12 case OBJ13 [phi,dphi] = misfit(Pr*U,getData(Dobs,data_idx),←Bcurrent_src_idx ,freq_idx);14 f = f + phi;15 if nargout >= 216 V = H' \ ( -Pr'* dphi);17 g = g + sum_srcs(T(U)'*V);18 end1920 case FORW_MODEL21 output(:,data_idx) = Pr*U;2223 case JACOB_FORW24 dm = to_comp*vec(input);25 dU = H\(-T(U)*dm);26 output(:,data_idx) = Pr*dU;2728 case JACOB_ADJ29 V = H'\( -Pr'* input(:,data_idx) );30 output = output + sum_srcs(T(U)'*V);3132 case HESS_GN33 dm = to_comp*vec(input);34 dU = H\(-T(U)*dm);35 dU = H'\(-Pr'*Pr*dU);36 output = output + sum_srcs(T(U)*dU);3738 case HESS39 dm = to_comp*vec(input);40 [~,dphi,d2phi] = misfit(Pr*U,getData(Dobs,data_idx),←Bcurrent_src_idx ,freq_idx);41 dU = H\(-T(U)*dm);42 V = H'\(-Pr'*dphi);43 dV = H'\(-T(V)*dm - Pr'* reshape(d2phi*vec(Pr*dU),nrec,size(←BU,2)));44 output = output + sum_srcs(DT_adj(U,dm,dU)*V + T(U)*dV);Listing 5.1 Excerpt from the code of PDEfunc.101Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemsthe event that a worker times out, one can simply omit the results of that workerand sum together the remaining function and gradient values. In some sense, onecan account for random machine failure or disconnection by considering it anothersource of stochasticity in an outer-level stochastic optimization algorithm.Figure 5.2 Data distributed over the joint (source, frequency) indices.Further down the hierarchy, we construct the system matrix for the PDE in thefollowing manner. As input, we require the current parametersm, information aboutthe geometry of the problem (current frequency, grid points and spacing, options fornumber of PML points), as well as performance and linear solver options (numberof threads, which solver/preconditioner to use, etc.). This function also extendsthe medium parameters to the PML extended domain, if required. The resultingoutputs are a SPOT operator of the Helmholtz matrix, which has suitable routinesfor performing matrix-vector multiplications and divisions, a struct detailing thegeometry of the pml-extended problem, and the mappings i and Yi ∗[m] from Ta-ble 5.1. It is in this function that we also construct the user-specified preconditionerfor inverting the Helmholtz (or other system) matrix.The actual operator that is returned by this method is a SPOT operator thatperforms matrix-vector products and matrix-vector divisions with the underlyingmatrix, which may take a variety of forms. In the 2D regime, we can afford to ex-plicitly construct the sparse matrix and we utilize the efficient sparse linear algebraroutines in Matlab for solving the resulting linear system. In this case, we imple-ment the 9-point optimal discretization of [70], which fixes the problems associatedto the 9-point discretization of [142], and use the sparse LU decomposition built into Matlab for inverting the system. These factors are computed at the initializationof the system matrix and are reused across multiple sources. In the 3D regime,we implement a stencil-based matrix-vector product (i.e., one with coefficients con-structed on the fly) written in C++, using the compact 27-point stencil of [191]along with its adjoint and derivative. The stencil-based approach allows us to avoidhaving to keep, in this case, 27 additional vectors of the size of the PML-extendedmodel in memory. This implementation is multi-threaded along the z−axis usingOpenMP and is the only ‘low-level’ component in this software framework, gearedtowards high performance. Since this primitive operation is used throughout theinverse problem framework, in particular for the iterative matrix-vector division,any performance improvements made to this code will propagate throughout the102Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemsentire codebase. Likewise, if a ‘better’ discretization of the PDE becomes available,it can be easily integrated in to the software framework by swapping out the ex-isting function at this stage, without modifying the rest of the codebase. When itis constructed, this SPOT operator also contains all of the auxiliary informationnecessary for performing matrix-vector products (either the matrix itself, for 2D, orthe PML-extended model vector and geometry information, for 3D) as well as formatrix-vector divisions (options for the linear solver, preconditioner function handle,etc.).We allow the user access to multiplication, division, and other matrix opera-tions like Jacobi and Kaczmarz sweeps through a unified interface, irrespective ofwhether the underlying matrix is represented explicitly or implicitly via functioncalls. For the explicit matrix case, we have implemented such basic operations inMatlab. When the matrix is represented implicitly, these operations are presentedby a standard function handle or a ‘FuncObj’ object, the latter of which mirrors theFunctor paradigm in C++. In the Matlab case, the ‘FuncObj’ stores a reference toa function handle, as well as a list of arguments. These arguments can be partiallyspecialized upon construction, whereby only some of the arguments are specifiedat first. Later on when they are invoked, the remainder of their arguments arepassed along to the function. In a sense, they implement the same functionality ofanonymous functions in Matlab without the variable references to the surroundingworkspace, which can be costly when passing around vectors of size c3 and withoutthe variable scoping issues inherent in Matlab. We use this construction to specifythe various functions that implement the specific operations for the specific stencils.Through this delegation pattern, we present a unified interface to the PDE systemmatrix irrespective of the underlying stencil or even PDE in question.When writing u = H\q for this abstract matrix object H, Matlab calls the ‘lin-solve’ function, which delegates the task of solving the linear system to a user-specified solver. This function sets up all of the necessary preamble for solving thelinear system with a particular method and preconditioner. Certain methods suchas the row-based CGMN method, introduced in [] and applied to seismic problemsin [257], require initial setup, which is performed here. This construction allows usto easily reuse the idea of ‘solving a linear system with a specific method’ in settingup the multi-level preconditioner of the next section, which reduces the overall codecomplexity since the multigrid smoothers themselves can be described in this way.In order to manage the myriad of options available for computing with these func-tions, we distinguish between two classes of options and use two separate objects tomanage them. ‘LinSolveOpts’ is responsible for containing information about solv-ing a given linear system, i.e., which solver to use, how many inner/outer iterationsto perform, relative residual tolerance, and preconditioner. ‘PDEopts’ contains allof the other information that is pertinent to PDEfunc (e.g., how many fields to com-pute at a time, type of interpolation operators to use for sources/receivers, etc.) aswell as propagating the options available for the PDE discretization (e.g., stencil touse, number of PML points to extend the model, etc.).103Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problems5.3.1 ExtensionsThis framework is flexible enough for us to integrate additional models beyond thestandard adjoint-state problem. We outline two such extensions below.Penalty MethodGiven the highly oscillatory nature of the forward modelling operator, due to thepresence of the inverse Helmholtz matrix H(m)−1q, one can also consider relaxingthe exact constraint H(m)u(m) = q in to an unconstrained and penalized form ofthe problem. This is the so-called Waveform Reconstruction Inversion approach[259], which results in the following problem, for a least-squares data-misfit penalty,minm=12‖eu(m)− y‖22 +22‖H(m)u(m)− q‖22 (5.4)where u(m) solves the least-squares systemu(m) = argminu12‖eu− y‖22 +22‖H(m)u− q‖22O (5.5)The notion of variable projection [14] underlines this method, whereby the fieldu is projected out of the problem by solving (5.5) for each fixed m. We performthe same derivations for problem (5.4) in Appendix A. Given the close relationshipbetween the two methods, the penalty method formulation is integrated into thesame function as our FWI code, with a simple flag to change between the twocomputational modes.2.5DFor a 3D velocity model that is invariant with respect to one dimension, i.e.,v(xP yP z) = h(xP z) for all y, so m(xP yP z) = !2g(xP z) for g(xP z) = 1h2(xP z), wecan take a Fourier transform in the y-coordinate of (5.1) to obtain(U2x + U2z + !2g(xP z)− k2y)uˆ(xP kyP z) = h(!)(x− xs)(z − zs)z−ikyys OThis so-called 2.5D modeling/inversion allows us to mimic the physical behaviourof 3D wavefield propagation (e.g., 1Rr amplitude decay vs 1Rr1R2 decay in 2D, pointsources instead of line sources, etc.) without the full computational burden of solvingthe 3D Helmholtz equation [236]. We solve a series of 2D problems instead, asfollows.Multiplying both sides by zikyys and setting u˜ky(xP z) = zikyys uˆ(xP kyP z), we havethat, for each fixed ky, u˜ky is the solution of H(ky)u˜ky = h(!)(x − xs)(z − zs)where H(ky) = (U2x + U2z + !2g(xP z)− k2y).104Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse ProblemsWe can recover u(xP yP z) by writingu(xP yP z) =12.∫ ∞−∞uˆky(xP z)zikyyyky=12.∫ ∞−∞u˜ky(xP z)ziky(y−ys)yky=1.∫ ∞0u˜ky(xP z) cos(ky(y − ys))yky=1.∫ kryq0u˜ky(xP z) cos(ky(y − ys))ykyOHere the third line follows from the symmetry between ky and −ky and the fourthline restricts the integral to the Nyquist frequency given by the sampling in y, namelyknyq =.∆y [236]. One can further restrict the range of frequencies to [0P p · kx] wherekx =!minx;z v(x;z)is the so-called critical frequency and p ≥ 1P p ≈ 1O In this case,waves corresponding to a frequency much higher than that of kx do not contributesignificantly to the solution. By evaluating this integral with, say, a Gauss-Legendrequadrature, the resulting wavefield can be expressed asu(xP yP z) =c∑i=1wiu˜kiy(xP z)Pwhich is a sum of 2D wavefields. This translates in to operations such as computingthe Jacobian or Hessian having the same sum structure and allows us to incorporate2.5D modeling and inversion easily in to the resulting software framework.5.4 Multi-level Recursive Preconditioner for theHelmholtz EquationOwing to the PML layer and indefiniteness of the underlying PDE system for suf-ficiently high frequencies, the Helmholtz system matrix is complex-valued, non-Hermitian, indefinite, and therefore challenging to solve using Krylov methodswithout adequate preconditioning. There have been a variety of ideas proposedto precondition the Helmholtz system, including multigrid methods [239], meth-ods inverting the shifted Laplacian system [215, 205, 96], sweeping preconditioners[95, 172, 207, 173], domain decomposition methods [237, 238, 38, 104], and Kaczmarzsweeps [258], among others. These methods have varying degrees of applicability inthis framework. Some methods such as the sweeping preconditioners rely on hav-ing explicit representations of the Helmholtz matrix and keeping track of dense LUfactors on multiple subdomains. Their memory requirements are quite steep as aresult and they require a significant amount of bookkeeping to program correctly,in addition to their large setup time, which is prohibitive in inversion algorithmswhere the velocity is being updated. Many of these existing methods also make105Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemsstringent demands of the number of points per wavelength cppw needed to succeed,in the realm of 10 to 15, which results in prohibitively large system matrices for thepurposes of inversion. As the standard 7-point discretization requires a high cppwin order to adequately resolve the solutions of the phases [191], this results in a largecomputational burden as the system size increases.Our aim in this section is to develop a preconditioner that is scalable, in that ituses the multi-threading resources on a given computational node, easy to program,matrix-free, without requiring access to the entries of the system matrix, and leadsto a reasonably low number of outer Krylov iterations. It is more important to havea larger number of sources distributed to multiple nodes rather than using multiplenodes solving a single problem when working in a computational environment wherethere is a nonzero probability of node failure.We follow the development of the preconditioners in [51, 165], which uses amultigrid approach to preconditioning the Helmholtz equation. The preconditionerin [51] approximates a solution to (5.2) with a two-level preconditioner. Specifically,we start with a standard multigrid V-cycle [41] as described in Figure 5.1. Thesmoothing operator aims to reduce the amplitude of the low frequency componentsof the solution error, while the restriction and prolongation operators. For an ex-tensive overview of the multigrid method, we refer the reader to [41]. We use linearinterpolation as the prolongation operator and its adjoint as for restriction, althoughother choices are possible [86].Algorithm 5.1 Standard multigrid V-cycleV-cycle to solve Hfxf = wf on the fine scaleInput: Current estimate of the fine-scale solution xfSmooth the current solution using a particular smoother (Jacobi, CG, etc.) to pro-duce x˜fCompute the residual rf = wf −Vf x˜fRestrict the residual, right hand side to the coarse level rx = grf , wx = gwfApproximately solve Hxxx = rxInterpolate the coarse-scale solution to the fine-grid, add it back to x˜f , x˜f ←xf + exxSmooth the current solution using a particular smoother (Jacobi, CG, etc.) to pro-duce x˜fOutput: x˜fIn the original work, the coarse-scale problem is solved with GMRES precondi-tioned by the approximate inverse of the shifted Laplacian system, which is appliedvia another V-cycle procedure. In this case, we note that the coarse-scale problem ismerely an instance of the original problem and since we apply the V-cycle in (5.1) toprecondition the original problem, we can recursively apply the same method to pre-condition the coarse-scale problem as well. This process cannot be iterated beyondtwo-levels typically because of the minimum grid points per wavelength samplingrequirements for wave phenomena [77].106Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse ProblemsUnfortunately as-is, the method of [51] was only designed with the standard,7-point stencil in mind, rather than the more robust 27-point stencil of [191]. Thework of [165] demonstrates that a straightforward application of the previous precon-ditioner to the new stencil fails to converge. The authors attempt to extend theseideas to this new stencil by replacing the Jacobi iterations with the CGMN algorithm[31], which acts as a smoother in its own right through the use of Kaczmarz sweeps[145]. For realistically sized problems, this method performs poorly as the CGMNmethod is inherently sequential and attempts to parallelize it result in degradedconvergence performance [31], in particular when implemented in a stencil-based en-vironment. As such, we propose to reuse the existing fast matrix-vector kernels wehave developed thus far and set our smoother to be GMRES [222] with an identitypreconditioner. Our coarse level solver is FGMRES [221], which allows us to use ournonlinear, iteration-varying preconditioner. Compared to our reference stencil-basedKaczmarz sweep implementation in C, a single-threaded matrix-vector multiplica-tion is 30 times faster. In the context of preconditioning linear systems, this meansthat unless the convergence rate of the Kaczmarz sweeps are 30Rks faster than theGMRES-based smoothers, we should stick to the faster kernel for our problems.Although the computational complexity of a Kaczmarz sweep is similar to the com-putational complexity of a matrix-vector product, the performance of the sweepsare cache-bound. In order to speed up this smoother, one could use the CARP-CGalgorithm [108], which parallelizes the Kaczmarz sweeps. We also experimentallyobserved that using a shifted Laplacian preconditioner, as in [51, 165] on the secondlevel caused an increase in the number of outer iterations, slows down convergence.As such, we have replaced preconditioning the shifted Laplacian system by precon-ditioning the Helmholtz itself, solved with FGMRES, which results in much fasterconvergence.A diagram of the full algorithm, which we denote ML-GMRES, is depicted inFigure 5.3. This algorithm is matrix-free, in that we do not have to construct anyof the intermediate matrices explicitly and instead merely compute matrix-vectorproducts, which will allow us to apply this method to large systems.Pre-smoother Discretization spacingh2hPost-smootherCoarse solverPreconditioned by4hGMRES GMRESFGMRES GMRES GMRESGMRESPre-smoother Post-smootherFigure 5.3 ML-GMRES preconditioner. The coarse-level problem (relative to thefinest grid spacing) is preconditioned recursively with the same method as the fine-scale problem.107Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse ProblemsWe experimentally study the convergence behaviour of this preconditioner ona 3D constant velocity model and leave a theoretical study of ML-GMRES for fu-ture work. By fixing the preconditioner memory vector to be (ks;oP ks;iP kx;oP kx;i) =(3P 5P 3P 5), we can study the performance of the preconditioner as we vary the num-ber of wavelengths in each direction, denoted n, as well as the number of pointsper wavelength, denoted nppw. For a fixed domain, as the former quantity increases,the frequency increases while as the latter quantity increases, the grid sampling de-creases. As an aside, we prefer to parametrize our problems in this manner as thesequantities are independent of the scaling of the domain, velocity, and frequency,which are often obscured in preconditioner examples in research. These two quanti-ties, n and nppw, on the other hand, are explicitly given parameters and comparableacross experiments. We append our model with a number of PML points equal toone wavelength on each side. Using 5 threads per matrix-vector multiply, we useFGMRES with 5 inner iterations as our outer solver, solve the system to a relativeresidual of 10−6. The results are displayed in Table 5.3.nRnppw 6 8 105 2 (433, 4.6) 2 (573, 6.9) 2 (713, 10.8)10 3 (733, 28.9) 2 (973, 38.8) 2 (1213, 76.8)25 8 (1613, 809) 3 (2173, 615) 3 (2713, 1009.8)40 11 (2533, 4545) 3 (3373, 2795) 3 (4213, 4747)50 15 (3113, 11789) 3 (4173, 5741) 3 (5213, 10373)Table 5.3 Preconditioner performance as a function of varying points per wave-length and number of wavelengths. Values are number of outer FGMRES iterations.In parenthesis are displayed the number of grid points (including the PML) andoverall computational time (in seconds), respectively.We upper bound the memory usage of ML-GMRES as follows. We note thatthe FGMRES(koP ki), GMRES(koP ki) solvers, with ko outer iterations and ki inneriterations, store 2ki + 6 vectors and ki + 5 vectors, respectively. In solving thediscretized H(m)u = q with c = n3 complex-valued unknowns, we require (2ki +6)c memory for the outer FGMRES solver, 2c for additional vectors in the V-cycle,in addition to (ks;i+5)c memory for the level-1 smoother, (2kx;i+6)(cR8) memoryfor the level-2 outer solver, (ks;i + 5)(cR8) memory for the level-2 smoother, and(kx;i+5)(cR64) memory for the level-3 solver. The total peak memory of the entiresolver is therefore(2ki+6)c +max((ks;i+5)cP (2kx;i+6)(cR8)+max((ks;i+5)cR8P (2kx;i+6)cR64))Using the same memory settings as before, our preconditioner requires at most 26vectors to be stored. Although the memory requirements of our preconditioner areseemingly large, they are approximately the same cost as storing the entire 27-pointstencil coefficients of H(m) explicitly and much smaller than the LU factors in(Table 1, [191]). The memory requirements of this preconditioner can of course be108Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemsreduced by reducing k, ks or kx, at the expense of an increased computational timeper wavefield solve. To compute the real world memory usage of this preconditioner,we consider a constant model problem with 10 wavelengths and a varying number ofpoints per wavelength so that the total number of points, including the PML region,is fixed. In short, we are not changing the effective difficulty of the problem, butmerely increase the oversampling factor to ensure that the convergence remains thesame with increasing model size. The results in Table 5.4 indicate that ML-GMRESis performing slightly better than expected from a memory point of view and thecomputational time scaling is as expected.Grid Size Time(s) Peak memory (GB) Number of vectors of size c1283 46 0.61 20c2563 213 5.8 23c5123 1899 48.3 24cTable 5.4 Memory usage for a constant-velocity problem as the number of pointsper wavelength increasesDespite the strong empirical performance of this preconditioner, we note thatperformance tends to degrade as nppw Q 8, even though the 27–point compact stencilis rated for nppw = 4 in [191]. Intuitively this makes sense as on the coarsest gridnppw Q 2, which is under the Nyquist limit, and leads to stagnating convergence. Inour experience, this discrepancy did not disappear when using the shifted Laplacianpreconditioner on the coarsest level. It remains to be seen how multigrid methodsfor the Helmholtz equation with nppw Q 8 will be developed in future research.5.5 Numerical Examples5.5.1 ValidationTo verify the validity of this implementation, specifically that the code as imple-mented reflects the underlying mathematics, we ensure that the following tests pass.The Taylor error test stipulates that, for a sufficiently smooth multivariate functionf(m) and an arbitrary perturbation m, we have thatf(m+ hm)− f(m) = d(h)f(m+ hm)− f(m)− h〈∇f(m)P m〉 = d(h2)f(m+ hm)− f(m)− h〈∇f(m)P m〉 − h22〈mP∇2f(m)m〉 = d(h3)OWe verify this behaviour numerically for the least-squares misfit f(m) and for aconstant 3D velocity model m and a random perturbation m by solving for ourfields to a very high precision, in this case up to a relative residual of 10−10. As109Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemsshown in Figure 5.4, our numerical codes indeed pass this test, up until the pointwhere h becomes so small that the numerical error in the field solutions dominatesthe other sources of error.10 -5 10 -4 10 -3 10 -2 10 -1 10 0h10 -1610 -1410 -1210 -1010 -810 -610 -410 -210 010 210 4Taylor errorZeroth orderFirst orderSecond orderO(h)O(h2)O(h3)Figure 5.4 Numerical Taylor error for a 3D reference model.The adjoint test requires us to verify that, for the functions implementing theforward and adjoint matrix-vector products of a linear operator V, we have, numer-ically,〈VxP y〉 = 〈xPVHy〉Pfor all vectors xP y of appropriate length. We suffice for testing this equality forrandomly generated vectors x and y, made complex if necessary. Owing to thepresence of the PML extension operator for the Helmholtz equation, we set x and y,if necessary, to be zero-filled in the PML-extension domain and along the boundaryof the original domain. We display the results in Table 5.5.〈VxP y〉 〈xPVHy〉 Relative differenceHelmholtz 6O5755− 5O2209i · 100 6O5755− 5O2209i · 100 2O9004 · 10−15Jacobian 1O0748 · 10−1 1O0748 · 10−1 1O9973 · 10−9Hessian −1O6465 · 10−2 −1O646 · 10−2 1O0478 · 10−10Table 5.5 Adjoint test results for a single instance of randomly generated vectorsx, y, truncated to four digits for spacing reasons. The linear systems involved aresolved to the tolerance of 10−10.We also compare the computed solutions in a homogeneous medium to the cor-responding analytic solutions in 2D and 3D, i.e.,G(xP xs) = − i4H0(‖x− xs‖2) in 2DG(xP xs) =zi‖x−xs‖24.‖x− xs‖2 in 3D110Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemswhere  = 2.fv0is the wavenumber, f is the frequency in Hz, v0 is the velocity inmRs,and H0 is the Bessel function of the third kind (the Hankel function). Fig (5.5, 5.6,5.7) show the analytic and numerical results of computing solutions with the 2D, 3D,and 2.5D kernels, respectively. Here we see that the overall phase dispersion is lowfor the computed solutions, although we do incur a somewhat larger error aroundthe source region as expected. The inclusion of the PML also prevents any visibleartificial reflections from entering the solutions, as we can see from the (magnified)error plots.0 500 1000 1500 2000x [m] 0500100015002000z [m] Analytic solution0 500 1000 1500 2000x [m] 0500100015002000z [m] Computed solution0 500 1000 1500 2000x [m] 0500100015002000z [m] Difference x1000 500 1000 1500 2000x [m] 0500100015002000z [m] 0 500 1000 1500 2000x [m] 0500100015002000z [m] 0 500 1000 1500 2000x [m] 0500100015002000z [m] Figure 5.5 Analytic and numerical solutions for the 2D Helmholtz equation for asingle source. Difference is displayed on a colorbar 100x smaller than the solutions.Top row is the real part, bottom row is the imaginary part.5.5.2 Full Waveform InversionTo demonstrate the effectiveness of our software environment, we perform a simple2D FWI experiment on a 2D slice of the 3D BG Compass model. We generatedata on this 2km x 4.5km model (discretized on a 10m grid) from 3Hz to 18Hz(in 1Hz increments) using our Helmholtz modeling kernel. Employing a frequencycontinuation strategy allows us to mitigate convergence issues associated with localminima [261]. That is to say, we partition the entire frequency spectrum 3P 4P O O O P 18in to overlapping subsets, select a subset at which to invert invert the model, anduse the resulting model estimate as a warm-start for the next frequency band. Inour case, we use frequency bands of size 4 with an overlap of 2 between bands.At each stage of the algorithm, we invert the model using 20 iterations of a box-constrained LBFGS algorithm from [223]. An excerpt from the full script thatproduces this example is shown in Listing (5.2). The results of this algorithm areshown in Figure 5.8. As we are using a band with a low starting frequency (3Hz in111Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problems0 500 1000 1500 2000z [m] 0500100015002000y [m] Analytic solution0 500 1000 1500 2000z [m] 0500100015002000y [m] Computed solution0 500 1000 1500 2000z [m] 0500100015002000y [m] Difference x100 500 1000 1500 2000z [m] 0500100015002000y [m] 0 500 1000 1500 2000z [m] 0500100015002000y [m] 0 500 1000 1500 2000z [m] 0500100015002000y [m] Figure 5.6 Analytic and numerical solutions for the 3D Helmholtz equation (de-picted as a 2D slice) for a single source. Difference is displayed on a colorbar 10xsmaller than the solutions. Top row is the real part, bottom row is the imaginarypart.0 0.5 1 1.5 2source [km]00.511.52receiver [km]Analytic0 0.5 1 1.5 2source [km]00.511.52receiver [km]2.5D0 0.5 1 1.5 2source [km]00.511.52receiver [km]2.5D differ-ence0 0.5 1 1.5 2source [km]00.511.52receiver [km]3D0 0.5 1 1.5 2source [km]00.511.52receiver [km]3D difference0 0.5 1 1.5 2source [km]00.511.52receiver [km]Analytic0 0.5 1 1.5 2source [km]00.511.52receiver [km]2.5D0 0.5 1 1.5 2source [km]00.511.52receiver [km]2.5D differ-ence0 0.5 1 1.5 2source [km]00.511.52receiver [km]3D0 0.5 1 1.5 2source [km]00.511.52receiver [km]3D differenceFigure 5.7 Analytic and numerical solutions for the 2.5D Helmholtz system fora generated data volume with 100 sources, 100 receivers, and 100 y-wavenumbers.The 2.5D data took 136s to generate and the 3D data took 8200s, both on a singlemachine with no data parallelization. Top row: real part, bottom row: imaginarypart.112Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemsthis case), FWI is expected to perform well in this case, which we see that it does.Although an idealized experimental setup, our framework allows for the possibility oftesting out algorithms that make use of higher starting frequencies, or use frequencyextrapolation as in [263, 170], with minimal code changes.1 % Set the initial model (a smoothed version of the true model)2 mest = m0;3 % Loop over subsets of frequencies4 for j=1:size(freq_partition ,1)5 % Extract the current frequencies at this batch6 fbatch = freq_partition(j,:);7 % Select only sources at this frequency batch8 srcfreqmask = false(nsrc,nfreq);9 srcfreqmask(:,fbatch) = true;10 params.srcfreqmask = srcfreqmask;11 % Construct objective function for these frequencies12 obj = misfit_setup(mest,Q,Dobs,model,params);13 % Call the box constrained LBFGS method14 mest = minConf_TMP(obj,mest,mlo,mhi,opts);15 endListing 5.2 Excerpt from the script that produces this example0 1.1 2.3 3.4 4.5x [km] 00.511.52z [km] 0 1.1 2.3 3.4 4.5x [km] 00.511.52z [km] 0 1.1 2.3 3.4 4.5x [km] 00.511.52z [km] Figure 5.8 True (left) and initial (middle) and inverted (right) models.5.5.3 Sparsity Promoting Seismic ImagingThe seismic imaging problem aims to reconstruct a high resolution reflectivity mapm of the subsurface, given some smooth background model m0, by inverting the(overdetermined) Jacobian systemJ(m0)m ≈ YO113Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse ProblemsIn this simple example, Y is the image of the true perturbation under the Jacobian.Attempting to tackle the least-squares system directlyminm∑is∈h∑ij∈F‖Jis;ij m− Yis;ij ‖22Pwhere h indexes the sources and F indexes the frequencies, is computationally daunt-ing due to the large number of sources and frequencies used. One straightforwardapproach is to randomly subsample sources and frequencies, i.e., choose h′ ⊂ h andF ′ ⊂ F and solveminm∑is∈h′∑ij∈F ′‖Jis;ij m− Yis;ij ‖22Pbut the Jacobian can become rank deficient in this case, despite it being full ranknormally. One solution to this problem is to use sparse regularization coupled withrandomized subsampling in order to force the iterates to head towards the trueperturbation while still reducing the per-iteration costs. There have been a numberof instances of incorporating sparsity of a seismic image in the Curvelet domain[55, 56], in particular [131, 127, 249]. We use the Linearized Bregman method[195, 47, 46], which solvesminx‖x‖1 + ‖x‖22s.t. Vx = wOCoupled with random subsampling as in [132], the iterations are shown in Algo-rithm (5.2), a variant of which is used in [63]. Here h(x) = sign(x)max(0P |x| − )is the componentwise soft-thresholding operator. In Listing (5.3), the reader willnote the close adherence of our code to Algorithm (5.2), aside from some minorbookkeeping code and pre- and post-multiplying by the curvelet transform to en-sure the sparsity of the signal. In this example, we place 300 equispaced sourcesand 400 receivers at the top of the model and generate data for 40 frequencies from3-12 Hz. At each iteration of the algorithm, we randomly select 30 sources and 10frequencies (corresponding to the 10 parallel workers we use) and set the number ofiterations so that we perform 10 effective passes through the entire data. Comparedto the image estimate obtained from an equivalent (in terms of number of PDEssolved) method solving the least-squares problem with the LSMR [101] method, therandomly subsampled method has made significantly more progress towards thetrue solution, as shown in Figure 5.9.Algorithm 5.2 Linearized Bregman with per-iteration randomized subsamplingfor k = 1P 2P O O O P iDraw a random subset of indices Ikzk+1 ← zk − tkViIk(wIk −VIkxk)xk+1 ← h(xk)114Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problems1 for k=1:T2 % Draw a random subset of sources and frequencies3 Is = rand_subset(nsrc,round(0.2*nsrc));4 If = rand_subset(nfreq,parpool_size());5 % Mask the objective to the sources/frequencies drawn6 srcfreqmask = false(nsrc,nfreq);7 srcfreqmask(Is,If) = true;8 params.srcfreqmask = srcfreqmask;9 % Construct the subsampled Jacobian operator + data10 A = oppDF(m0,Q,model,params);11 b = distributed_subsample_data(b_full,Is,If);12 % Linearized Bregman algorithm13 r = A*x-b;14 ATr = A'*r;15 t = norm(r)^2/norm(ATr)^2;16 z = z - t*ATr;17 x = C'*softThreshold(C*z,lambda);18 endListing 5.3 Excerpt from the code that produces this example0 1.1 2.3 3.4 4.5x [km] 00.511.52z [km] True image0 1.1 2.3 3.4 4.5x [km] 00.511.52z [km] Full data inversion0 1.1 2.3 3.4 4.5x [km] 00.511.52z [km] Linearized Bregman inver-sionFigure 5.9 Sparse seismic imaging - full data least-squares inversion versus lin-earized Bregman with randomized subsampling5.5.4 Electromagnetic Conductivity InversionTo demonstrate the modular and flexible nature of our software framework, wereplace the finite difference Helmholtz PDE with a simple finite volume discretizationof the variable coefficient Poisson equation∇( · ∇u) = qwhere  is the spatially-varying conductivity coefficient. We discretize the field onthe vertices of a regular rectangular mesh and  at the cell centers, which results in115Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemsthe system matrix, denoted H(), written abstractly asH() =5∑i=1Vidiag(Wi)for constant matrices ViP Wi. This form allows us to easily derive expressions fori :  7→ YH()[m]u and i ∗ asi =5∑i=1Vidiag(Wi)ui ∗z =5∑i=1WHi diag(u)VHi zOWith these expressions in hand, we merely can slot the finite volume discretizationand the corresponding directional derivative functions in to our framework withoutmodifying any other code.Consider a simple constant conductivity model containing a square anomaly witha 20% difference compared to the background, encompassing a region that is 5kmx 5km with a grid spacing of 10m, as depicted in Figure 5.10. We place 100 equallyspaced sources and receivers at depths z = 400m and z = 1600m, respectively. Thepointwise constraints we use are the true model for z Q 400m and z S 1600m and,for the region in between, we set min = minx (x) and max = 2maxx (x). Ourinitial model is a constant model with the correct background conductivity, shownin Figure 5.10. Given a current estimate of the conductivity k, we minimize aquadratic model of the objective subject to bound constraints, i.e.,k+1 = argmin〈 − kP gk〉+ 12〈 − kPHk( − k)〉s.t. min ≤  ≤ maxwhere gkPHk are the gradient and Gauss-Newton Hessian, respectively. We solve 5 ofthese subproblems, using 5 objective/gradient evaluations to solve each subproblemusing the bound constrained LBFGS method of [223]. As we do not impose anyconstraints on the model itself and the PDE itself is smoothing, we are able torecover a very smooth version of the true model, shown in Figure 5.10, with theattendant code shown in Listing (5.4). This discretization and optimization setup isby no means the optimal method to invert such conductivity models but we merelyoutline how straightforward it is to incorporate different PDEs in to our framework.Techniques such as total variation regularization [2] can be incorporated in to thisframework by merely modifying our objective function.116Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problems1 obj = misfit_setup(sigma0,Q,D,model,params);2 sigma = sigma0;3 for i=1:54 % Evaluate objective5 [f,g,h] = obj(sigma);6 % Construct quadratic model7 q = @(x) quadratic_model(g,h,x,sigma);8 % Minimize quadratic model, subject to pointwise constraints9 sigma = minConf_TMP(q,sigma,sigma_min ,sigma_max ,opts);10 endListing 5.4 Excerpt from the script that produces this example0 1.3 2.5 3.8 5x [km] 01.32.53.85z [km] True Model0 1.3 2.5 3.8 5x [km] 01.32.53.85z [km] Initial Model0 1.3 2.5 3.8 5x [km] 01.32.53.85z [km] Inverted ModelFigure 5.10 Inversion results when changing the PDE model from the Helmholtzto the Poisson equation5.5.5 Stochastic Full Waveform InversionOur software design makes it relatively straightforward to apply the same inversionalgorithm to both a 2D and 3D problem in turn, while changing very little in thecode itself. We consider the following algorithm for inversion, which will allow usto handle the large number of sources and number of model parameters, as well asthe necessity of pointwise bound constraints [128] on the intermediate models. Ourproblem has the formminm1cscs∑i=1fi(m)s.t. m ∈ Xwhere m is our model parameter (velocity or slowness), fi(m) = 12‖erH(m)−1qi −yi‖22 is the least-squares misfit for source i, and X is our convex constraint set, whichis X = {m : mLB ≤ m ≤ mjB} in this case. When we have p parallel processesand cs ≫ p sources, in order to efficiently make progress toward the solution ofthe above problem, we stochastically subsample the objective and approximately117Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemsminimize the resulting problem, i.e., at the kth iteration we solvemk = argminm1|Ik|∑i∈Ikfi(m)s.t. m ∈ Xfor Ik ⊂ {1P O O O P cs} drawn uniformly at random and |Ik| = p ≪ cs. We use theapproximate solution mk as a warm start for the next problem and repeat thisprocess a total of i times. Given that our basic unit of work in these problemsis computing the solution of a PDE, we limit the number of iterations for eachsubproblem solution so that each subproblem can be evaluated at a constant multipleof evaluating the objective and gradient with the full data, i.e., rsub⌈csp ⌉ iterationsfor a constant rsub. If an iteration stagnates, i.e., if the line search fails, we increasethe size of |Ik| by a fixed amount. This algorithm is similar in spirit to the oneproposed in [257].2D Model — BG CompassWe apply the above algorithm to the BG Compass model, with the same geometryand source/receiver configuration as outlined in Section (5.5.2). We limit the totalnumber of passes over the entire data to be equal to 50% of those used in theprevious example, with 10 re-randomization steps and rsub = 2. Our results areshown in Figure 5.11. Despite the smaller number of overall PDEs solved, thealgorithm converges to a solution that is qualitatively hard to distinguish fromFigure 5.8. The overall model error as a function of number of subproblems solvedis depicted Figure 5.12. The model error stagnates as the number of iterations in agiven frequency batch rises and continues to decrease when a new frequency batchis drawn.0 1.1 2.3 3.4 4.5x [km] 00.511.52z [km] 0 1.1 2.3 3.4 4.5x [km] 00.511.52z [km] 0 1.1 2.3 3.4 4.5x [km] 00.511.52z [km] Figure 5.11 True (left) and initial (middle) and inverted (right) models118Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problems0 5 10 15 20 25 30 35Number of subproblems solved0.030.040.050.060.070.080.090.1Relative model errorFigure 5.12 Relative model error as a function of the number of randomized sub-problems solved.3D Model - OverthrustWe apply the aforementioned algorithm to the SEG/EAGE Overthrust model, whichspans 20km x 20km x 5km and is discretized on a 50m x 50m x 50m grid with a 500mdeep water layer and minimum and maximum velocities of 1500mRs and 6000mRs.The ocean floor is covered with a 50 x 50 grid of sources, each with 400m spacing,and a 396 x 396 grid of receivers, each with 50m spacing. The frequencies we use arein the range of 3− 5O5Hz with 0O25Hz sampling, corresponding to 4s of data in thetime domain, and inverted a single frequency at a time. The number of wavelengthsin the xP yP z directions vary from (40P 40P 10) to (73P 73P 18), respectively, with thenumber of points per wavelength varying from 9.8 at the lowest frequency to 5.3 atthe highest frequency. The model is clamped to be equal to the true model in thewater layer and is otherwise allowed to vary between the maximum and minimumvelocity values. In practical problems, some care must be taken to ensure thatdiscretization discrepancies between the boundary of the water and the true earthmodel are minimized. Our initial model is a significantly smoothed version of thetrue model and we limit the number of stochastic redraws to i = 3, so that we areperforming the same amount of work as evaluating the objective and gradient threetimes with the full data. The number of unknown parameters is 14,880,000 andthe fields are inverted on a grid with 39,859,200 points, owing to the PML layersin each direction. We use 100 nodes with 4 Matlab workers each of the Yemojacluster for this computation. Each node has 128GB of RAM and a 20-core Intelprocessor. We use 5 threads for each matrix-vector product and subsample sourcesso that each Matlab worker solves a single PDE at a time (i.e., |Ik| = 400 in theabove formulation) and set the effective number of passes through the data for eachsubproblem to be one, i.e., rsub = 1. Despite the limited number of passes throughthe data, our code is able to make substantial progress towards the true model,as shown in Figures 5.14 and 5.15. Unlike in the 2D case, we are limited in ourability to invert higher frequencies in a reasonable amount of time and therefore our119Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemsinverted results do not fully resolve the fine features, especially in the deeper partsof the model.Figure 5.13 True (left) and initial (right) modelsy [km]0 5 10 15 20z [km]01.22.33.54.6x=12500my [km]0 5 10 15 20z [km]01.22.33.54.6x=12500my [km]0 5 10 15 20z [km]01.22.33.54.6x=12500mx [km]0 5 10 15 20z [km]01.22.33.54.6y=10000mx [km]0 5 10 15 20z [km]01.22.33.54.6y=10000mx [km]0 5 10 15 20z [km]01.22.33.54.6y=10000mFigure 5.14 True model (left), initial model (middle), inverted model (right) for avariety of fixed coordinate slices120Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemsx [km]0 5 10 15 20y [km]05101520z=750mx [km]0 5 10 15 20y [km]05101520z=750mx [km]0 5 10 15 20y [km]05101520z=750mx [km]0 5 10 15 20y [km]05101520z=1000mx [km]0 5 10 15 20y [km]05101520z=1000mx [km]0 5 10 15 20y [km]05101520z=1000mx [km]0 5 10 15 20y [km]05101520z=2000mx [km]0 5 10 15 20y [km]05101520z=2000mx [km]0 5 10 15 20y [km]05101520z=2000mFigure 5.15 True model (left), initial model (middle), inverted model (right) for avariety of fixed z coordinate slices5.6 DiscussionThere are a number of problem-specific constraints that have motivated our designdecisions thus far. Computing solutions to the Helmholtz equation in particular ischallenging due to the indefiniteness of the underlying system for even moderatefrequencies and the sampling requirements of a given finite difference stencil. Thesechallenges preclude one from simply employing the same sparse-matrix techniquesin 2D for the 3D case. Direct methods that store even partial LU decomposition ofthe system matrix are infeasible from a memory perspective, unless one allows forusing multiple nodes to compute the PDE solutions. Even in that case, one can runin to resiliency issues in computational environments where nodes have a non-zeroprobability of failure over the lifetime of the computation. Regardless of the dimen-121Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemssionality, our unified interface for multiplying and dividing the Helmholtz systemwith a vector abstracts away the implementation specific details of the underlyingmatrix, while still allowing for high performance. These design choices give us theability to scale from simple 2D problems on 4 cores to realistically sized 3D prob-lems running on 2000 cores with minimal engineering effort. The acoustic Helmholtzequation itself is a model resulting from a number of simplifying assumptions madeon the physical properties of the earth, such as constant density and ignoring elasticeffects [228]. These assumptions are made in the interest of reducing the computa-tional complexity of solving the PDEs, but simultaneously reduce the explanatorypower of the model. If one has a sufficiently powerful computational environment,these simplified assumptions can be removed and the resulting PDE solution kernelscan be integrated in to this software system.Although our choice of Matlab has enabled us to succinctly design and prototypeour code, there have been a few stumbling blocks as a result of this language choice.There is an onerous licensing issue for the Parallel Toolbox, which makes scaling toa large number of workers costly. The Parallel Toolbox is built on MPI, which issimply insufficient for performing large scale computations in an environment thatcan be subject to disconnections and node failures and cannot be swapped out foranother parallelization scheme within Matlab itself. In an environment where onedoes not have full control over the computational hardware, such as on Amazon’scloud computing services, this paradigm is untenable. For interfacing with theC/Fortran language, there is a large learning curve for compiling MEX files, whichare particularly constructed C files that can be called from with Matlab. Matlaboffers its Matlab Coder product that allows one to, in principle, compile any functionin to a MEX file, and thus reap potential performance benefits. The compilationprocess has its limits in terms of functionality, however, and cannot, for instance,compile our framework easily.Thankfully, there have been a few efforts to help alleviate these issues. Therelatively new numerical computing language Julia has substantially matured in thelast few years, making it a viable competitor to Matlab. Julia aims to bridge thegap between an interpreted and a compiled language, the former being easier to pro-totype in and the latter being much faster by offering a just-in-time compiler, whichbalances ease of use and performance. Julia is open source, whereas Matlab is decid-edly not, allows for a much more fine-grained control over parallelization, and hasother attractive features such as built-in interfacing to C, is strongly-typed, althoughflexibly so, and has a large, active package ecosystem. We aim to reimplement theframework described in this paper in Julia in the future. The Devito framework[166] is another such approach to balance the high-level mathematics and low-levelperformance in PDE-constrained problems specifically through the compilation ofsymbolic Python to C on-the-fly. Devito incurs some initial setup time to processthe symbolic expressions of the PDEs and compile them in to runnable C binaries,but this overhead is negligible compared to the cost of time-stepping solutions andonly has to be performed once for a PDE with fixed parameters. This is one possi-ble option to speed up matrix-vector products, or allow for user-specified order of122Chapter 5. A Unified 2D/3D Large Scale Software Environment forNonlinear Inverse Problemsaccuracy at runtime, if the relevant complex-valued extensions can be written. Weleave this as an option for future work.5.7 ConclusionThe designs outlined in this paper make for an inversion framework that successfullyobfuscates the inner workings of the construction, solution, and recombination ofsolutions of PDE systems. As a result, the high-level interfaces exposed to the userallow a researcher to easily construct algorithms dealing with the outer structure ofthe problem, such as stochastic subsampling or solving Newton-type methods, ratherthan being hindered by the complexities of, for example, solving linear systems ordistributing computation. This hierarchical and modular approach allows us todelineate the various components associated to these computations in a straightfor-ward and demonstrably correct way, without sacrificing performance. Moreover, wehave demonstrated that this design allows us to easily swap different PDE stencils,or even PDEs themselves, while still keeping the outer, high-level interfaces intact.This design allows us to apply a large number high-level algorithms to large-scaleproblems with minimal amounts of effort.With this design, we believe that we have struck the right balance betweenreadability and performance and, by exposing the right amount of information ateach level of the software hierarchy, researchers should be able to use this codebaseas a starting point for developing future inversion algorithms.123Chapter 6ConclusionInverse problems are an important class of problems that enable us to indirectlyestimate parameters of a region from surface measurements. The acquired dsata ishigh-dimensional and often has missing entries due to practical constraints. Thereis a large theoretical and practical interest in having fully sampled and noise-freedata without the need to acquire all of the samples of the volume in the field. Assuch, developing methods to interpolate large multidimensional volumes when theacquired data is subsampled and noisy is of great importance. Once one has anaccurate estimate of the fully sampled data volume, one must come to terms withthe enormous complexity and large number of components that must be dealt with(from a software perspective) when solving the inverse problem itself. In this thesis,we have proposed techniques that are particular to the three subtopics of missingdata completion for low rank tensors, convex composite optimization, and inverseproblem software design. In what follows, we discuss our contributions to thesetopics as well as various open questions that remain on these topics.Manifold Tensor Completion Chapter 2 introduced the smooth manifold ofHierarchical Tucker tensors, following the theoretical developments of [253]. By ex-ploiting the quotient geometry inherent in HT tensors, along with the description ofthe horizontal space, we introduced a Riemannian metric on to this manifold thatis invariant under the group action. As a result, this metric extends to the under-lying quotient manifold. From a numerical point of view, we can compute innerproducts using the concrete HT parameters that respects the group action and thusare implicitly computing these quantities on the abstract manifold of equivalenceclasses in a well-defined manner. Equipped with this Riemannian metric, we de-rived the expressions for the Riemannian gradient that can be computed efficientlyusing multilinear products, and considered a sparse-data version of these expressionsthat avoid having to form the full data volume. We also specified other importantcomponents of solving optimization problems on the HT manifold by deriving aretraction, vector transport, and a line search method. In order to speed up theconvergence of the resulting optimization algorithms, we derived simplified expres-sions for the Gauss-Newton Hessian that can be computed through the associated124Chapter 6. ConclusionGramian matrices, which encode the singular values of the full tensor. These ma-trices, and their derivatives, can be used to impose regularization on missing datainterpolation problems with high subsampling regimes by keeping the algorithm it-erates away from the boundary of the manifold. Unlike for the Tucker case, thisregularization is explicitly available for algorithmic use, as well as for theoreticalconsiderations. Using this construction, we prove that our algorithms converge to astationary point, which is typically sufficient when analyzing nonconvex problems.In practice, these methods appear to recover low rank tensors successfully and wedemonstrate the effectiveness of this approach on a number of large-scale seismicinterpolation examples. Our methods are efficient and can recover volumes withhigh subsampling ratios.There are a number of open questions pertaining to tensor completion that areimportant both practically and theoretically. The necessity of working with thisnonconvex but low-rank parametrization of this class of tensors arises from the factthat explicitly storing and computing the full volumes is simply not feasible. As aresult, unlike in nuclear-norm minimization problems where the matrix variable isstored explicitly, we must have some estimate on the underlying ranks of the HTformat in order for it to estimate the unknown tensor accurately. Cross-validation toestimate these ranks becomes computationally daunting, given that we have a vectorof parameters over which to iterate. For tensors arising from the discretization ofa multivariate function, the ability to decompose it in to low-rank factors can bequalitatively linked to the smoothness of the function itself [224]. These boundsare not tight and unfortunately uninformative when handling a given tensor. Itremains to be seen as to how to determine these ranks in a rigorous manner apriori.As for the tensor completion problem itself, very little has been rigorously shownin the literature that the solution to the nonconvex optimization program is closeto the true unknown tensor. Results have been shown when the sampling operatoris subgaussian [211] or consists of partial Fourier measurements [212], but beingable to prove that the tensor can be recovered under pointwise sampling, as for thematrix case, remains an open problem.Convex Composite Optimization Chapter 3 is a precursor to Chapter 4,wherein we prove results about smooth convex functions satisfying the Polyak-Lojasiewicz inequality. Such functions are in some sense the most general classof convex functions that converge linearly towards a global minimizer with steepestdescent iterations. Although strongly convex functions satisfy the PL inequality,they are not the only ones that do so and in this chapter we examine the effect ofvarious transformations of a function on its PL constant. We study the behaviourof the PL constant under composition with a smooth mapping. We also prove thatMoreau regularization can only decrease the PL-constant of a given function andthere is a natural upper bound for the PL constant of the Moreau envelope of afunction, irrespective of the function itself. The functions that realize this upperbound are indicator functions for convex sets. This in turn shows that distancefunctions satisfy the PL inequality despite not being strongly convex, which we usein the following chapter. We also examined the PL behaviour of the well-known Hu-125Chapter 6. Conclusionber function, which is the Moreau envelope of the u1 norm. Although the originalfunction ‖x‖1 does not satisfy the PL-inequality, the Huber function does and wecharacterize the region on which this inequality holds.We study the solution to convex composite optimization programs in Chapter4, which are a general class of (usually nonconvex) problems given by minimizing anon-smooth but convex function composed with a smooth inner function. The non-smoothness of the outer function prevents methods that solve the problem directlyfrom scaling, since such methods are inherently saddled by having to solve non-smooth subproblems. Non-smooth convex optimization has sublinear convergencein the worst-case, which is far too slow of a rate when the problem sizes becomelarge. Instead, we propose to use a level set method that switches the role of theobjective and constraints, resulting in smooth subproblems with simple constraints.By studying the associated value function, which parametrizes the tradeoff betweenincreasing the objective to its minimum value and satisfying the constraints, weemploy a secant or Newton method to update this parameter and converge towardsa solution. As the secant method converges superlinearly and the Newton methodconverges quadratically, we have to update our cooling parameter only a small num-ber of times. Despite the fact that our subproblems are non-convex and the distancefunction itself lacks strong convexity, if the nonlinear mapping is sufficiently well-behaved, we prove that steepest descent converges linearly to a solution. We applythis method to a variety of convex and nonconvex problems, many of which can beconstrued as analysis problems (in the sense of compressed sensing). This methodis successful at interpolating seismic data, performing total variation deblurring,declipping audio signals, and performing robust low-rank tensor completion, alsoknown as robust tensor principal component analysis. In particular for robust com-pletion, the resulting codes are very simple to implement as they merely involve aminor change in the objective function. This fact allows us to derive robust tensorcompletion from the code used in Chapter 2 with little modification. Although weare able to handle nonconvex u0 problems, which are NP-complete, we show thatas the number of parameters increases, the associated computational times increasedrastically as one would expect when solving such problems. Our method is compu-tationally attractive as we do not introduce additional hyperparameters that mustbe empirically estimated from the unknown signal.Despite the success of our approach for transform-based problems, it is not alto-gether clear how one could solve these subproblems in a stochastic manner, when xand h are sufficiently separable. One may have to forgo the variable projection alto-gether and formulate algorithms on the joint (xP z) variables. The large amount ofexisting research on stochastic optimization methods, which has demonstrated theeffectiveness of these methods when the dimensionality and number of data pointsare high, suggest that this avenue would be fruitful for solving realistically-sizedproblems. Additionally, what conditions should one impose on the nonlinear map-ping x so that the resulting subproblems v() satisfy the conditions of Theorem 4.5?Namely that v( +∆) = vlin( +∆) +d(|∆ |2) for the expression of v′() to begiven by the theorem. Intuitively, the level set constraint h(z) ≤  only acts on the126Chapter 6. Conclusionz variable and so we would expect this expression for v′() to hold irrespective ofwhether x is linear or nonlinear. Numerically, this expression for v′() appears tohold when compared to the finite difference approximation v′() ≈ (v(+h)−v())Rhfor the nonconvex robust tensor completion problem, and so it may be the case thatthe Hierarchical Tucker mapping ϕ is sufficiently close to its linearization so that thetheorem holds. It is also interesting to ask how one could extend the PL-inequalityconvergence analysis to Newton or quasi-Newton methods and how the resultingtheory would manifest for C2 convex functions.Inverse Problem Software Design In Chapter 5, we consider solving largescale inverse problems from a software design perspective. Given the enormousengineering challenges involved in solving such problems, in addition to the largenumber of scientific fields one must be versed in in order to tackle such problems,previous efforts to design such frameworks have either been well-designed or highperformance, but rarely both. This chapter showcases a new organization for solv-ing inverse problems that decouples the components in to discrete modules that areresponsible for specific tasks and organized in a hierarchical fashion. As a result, onehas flexibility in replacing a baseline implementation of each component with an im-plementation that suits a particular users needs. This separation of responsibilitiesalso enables us to interface with low-level C code when needed for matrix-vectorproducts while keeping the rest of the code in high-level Matlab, which makes itsignificantly easier to comprehend, maintain, and extend. Several extensions of thisapproach are provided to 2.5D inversion and so-called waveform reconstruction in-version, which can be integrated in to the entire framework with minimal effort.Some advantages afforded by this design are that the code is able to solve problemsirrespective of their dimensionality, switching automatically between efficient sparsematrix algebra for 2D problems, where the memory requirements are less daunting,and efficient stencil-based Krylov methods for 3D problems. We also introduce a newpreconditioner for Helmholtz systems based on previous recursive multigrid-basedpreconditioners, which empirically requires a constant number of iterations if thepoints per wavelength of the discretized problem is sufficiently high. As we form allof our matrix-vector products implicitly, our preconditioner requires little memory–on the order of storing the sparse system matrix explicitly,. It is also able to takeadvantage of local parallel resources rather than requiring internode parallelism inorder to scale effectively to distributed systems with occasional node failures. Ourcodebase passes several necessary unit tests to ensure that the code as written re-flects the underlying mathematics of the problem. We showcase our framework on avariety of nonlinear and linear inversion problems, with accompanying scripts thatclosely mirror the associated high level algorithms. Our framework is able to han-dle randomized subsampling, more complex algorithms such as Linearized Bregman,inverse problems with different associated PDEs, as well as large scale 3D problems.In future work, we plan to extend this framework to cloud-based computing,at which point we will have to forgo a Matlab implementation and move towardsa tractable language such as Julia. An open algorithmic question that remainsis, given the Nyquist sampling criteria of two points per wavelength for oscillatory127Chapter 6. ConclusionPDE solutions, is it possible to design a multigrid preconditioner for the Helmholtzequation that employs two coarsened levels? Coarsening to two levels, as opposed toone, allows us to greatly reduce the computational cost of inverting the lower-levelsystems, but impose a sampling requirement of eight points per wavelength on thefinest level, which is too stringent. The 3D stencil we implement from [191] haslow dispersion down to four points per wavelength, but this minimum threshold issomewhat wasted on the high sampling at the finest level.128Bibliography[1] P.A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms onmatrix manifolds. Princeton Univ Press, 2008. → pages 6, 23, 24, 30, 31, 32,33, 39, 41[2] A Abubaker and Peter M Van Den Berg. Total variation as a multiplicativeconstraint for solving inverse problems. IEEE Transactions on ImageProcessing, 10(9):1384–1392, 2001. → pages 116[3] Evrim Acar, Daniel M Dunlavy, and Tamara G Kolda. A scalableoptimization approach for fitting canonical tensor decompositions. Journalof Chemometrics, 25(2):67–86, 2011. → pages 12[4] Amir Adler, Valentin Emiya, Maria G Jafari, Michael Elad, Rémi Gribonval,and Mark D Plumbley. Audio inpainting. IEEE Transactions on Audio,Speech, and Language Processing, 20(3):922–932, 2012. → pages 79[5] Andy Adler, Romina Gaburro, and William Lionheart. Electrical impedancetomography. In Handbook of Mathematical Methods in Imaging, pages599–654. Springer, 2011. → pages 95[6] Vinicius Albani, Uri M. Ascher, Xu Yang, and Jorge P. Zubelli. Data drivenrecovery of local volatility surfaces. Inverse Problems and Imaging, 11(5):2–2,2017. ISSN 1930-8337. doi: 10.3934/ipi.2017038. → pages 2[7] Vinicius Albani, Uri M Ascher, and Jorge P Zubelli. Local volatility modelsin commodity markets and online calibration. Journal of ComputationalFinance, To Appear. → pages 3[8] Patrick R Amestoy, Iain S Duff, and J-Y L’excellent. Multifrontal paralleldistributed symmetric and unsymmetric solvers. Computer methods inapplied mechanics and engineering, 184(2):501–520, 2000. → pages 93[9] A. Y Aravkin, R. Kumar, H. Mansour, B. Recht, and F. J Herrmann. Arobust SVD-free approach to matrix completion, with applications tointerpolation of large scale data. Technical report, University of BritishColumbia, February 2013. → pages 5129Bibliography[10] Aleksandr Aravkin, Tristan Van Leeuwen, and Felix Herrmann. Robustfull-waveform inversion using the Student’s t-distribution. In SEG TechnicalProgram Expanded Abstracts 2011, pages 2669–2673. Society of ExplorationGeophysicists, 2011. → pages 7, 95[11] Aleksandr Aravkin, Michael P Friedlander, Felix J Herrmann, and TristanVan Leeuwen. Robust inversion, dimensionality reduction, and randomizedsampling. Mathematical Programming, pages 1–25, 2012. → pages 95, 98[12] Aleksandr Aravkin, Rajiv Kumar, Hassan Mansour, Ben Recht, and Felix JHerrmann. Fast methods for denoising matrix completion formulations, withapplications to robust seismic data interpolation. SIAM Journal onScientific Computing, 36(5):S237–S266, 2014. → pages 2, 5[13] Aleksandr Aravkin, Dmitriy Drusvyatskiy, and Tristan van Leeuwen.Variable projection without smoothness. arXiv preprint arXiv:1601.05011,2016. → pages 65[14] Aleksandr Y Aravkin and Tristan Van Leeuwen. Estimating nuisanceparameters in inverse problems. Inverse Problems, 28(11):115016, 2012. →pages 9, 65, 104[15] Aleksandr Y Aravkin, James V Burke, and Michael P Friedlander.Variational properties of value functions. SIAM Journal on optimization, 23(3):1689–1717, 2013. → pages 70, 71[16] Aleksandr Y. Aravkin, Rajiv Kumar, Hassan Mansour, Ben Recht, andFelix J. Herrmann. Fast methods for denoising matrix completionformulations, with applications to robust seismic data interpolation. SIAMJournal on Scientific Computing, 36(5):S237–S266, 10 2014. doi:10.1137/130919210. URL https://www.slim.eos.ubc.ca/Publications/Public/Journals/SIAMJournalOnScientificComputing/2014/aravkin2014SISCfmd/aravkin2014SISCfmd.pdf. (SISC). → pages 53[17] Aleksandr Y Aravkin, James V Burke, Dmitriy Drusvyatskiy, Michael PFriedlander, and Scott Roy. Level-set methods for convex optimization.arXiv preprint arXiv:1602.01506, 2016. → pages 64, 89[18] Simon R Arridge. Optical tomography in medical imaging. Inverse problems,15(2):R41, 1999. → pages 1[19] B. W. Bader, T. G. Kolda, et al. Matlab tensor toolbox version 2.5.http://www.sandia.gov/~tgkolda/TensorToolbox/, Accessed January 1,2012. → pages 12[20] Satish Balay, J Brown, Kris Buschelman, Victor Eijkhout, W Gropp,D Kaushik, M Knepley, L Curfman McInnes, B Smith, and Hong Zhang.PETSc users manual revision 3.3. Computer Science Division, ArgonneNational Laboratory, Argonne, IL, 2012. → pages 92130Bibliography[21] Jonas Ballani and Lars Grasedyck. Tree adaptive approximation in thehierarchical tensor format. Preprint, 141, 2013. → pages 43[22] Jonas Ballani, Lars Grasedyck, and Melanie Kluge. Black box approximationof tensors in Hierarchical Tucker format. Linear Algebra and its Applications,438(2):639 – 657, 2013. ISSN 0024-3795. doi: 10.1016/j.laa.2011.08.010.Tensors and Multilinear Algebra. → pages 11, 13[23] Alvaro Barbero and Suvrit Sra. Modular proximal optimization formultidimensional total-variation regularization. arXiv preprintarXiv:1411.0589, 2014. → pages 77[24] Gilbert Bassett Jr and Roger Koenker. Asymptotic theory of least absoluteerror regression. Journal of the American Statistical Association, 73(363):618–622, 1978. → pages 6[25] Heinz H Bauschke and Patrick L Combettes. Convex analysis and monotoneoperator theory in Hilbert spaces. Springer Science & Business Media, 2011.→ pages 57[26] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholdingalgorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009. → pages 1[27] James Bennett, Stan Lanning, et al. The Netflix prize. In Proceedings ofKDD cup and workshop, volume 2007, page 35. New York, NY, USA, 2007.→ pages 4[28] Jean-Pierre Berenger. A perfectly matched layer for the absorption ofelectromagnetic waves. Journal of computational physics, 114(2):185–200,1994. → pages 96[29] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah. Julia: Afresh approach to numerical computing. arXiv preprint arXiv:1411.1607,2014. → pages 94[30] George Biros and Omar Ghattas. Parallel Lagrange–Newton–Krylov–Schurmethods for PDE-constrained optimization. part i: The Krylov–Schur solver.SIAM Journal on Scientific Computing, 27(2):687–713, 2005. → pages 96[31] Åke Björck and Tommy Elfving. Accelerated projection methods forcomputing pseudoinverse solutions of systems of linear equations. BITNumerical Mathematics, 19(2):145–163, 1979. → pages 107[32] J.D. Blanchard, J. Tanner, and Ke Wei. Conjugate gradient iterative hardthresholding: Observed noise stability for compressed sensing. SignalProcessing, IEEE Transactions on, 63(2):528–537, Jan 2015. ISSN1053-587X. doi: 10.1109/TSP.2014.2379665. → pages 42131Bibliography[33] Hans Georg Bock. Numerical treatment of inverse problems in chemicalreaction kinetics. In Modelling of chemical reaction systems, pages 102–125.Springer, 1981. → pages 1[34] Liliana Borcea. Electrical impedance tomography. Inverse problems, 18(6):R99, 2002. → pages 95[35] Brett Borden. Mathematical problems in radar inverse scattering. InverseProblems, 18(1):R1, 2001. → pages 1[36] Léon Bottou. Online learning and stochastic approximations. On-linelearning in neural networks, 17(9):142, 1998. → pages 98[37] Léon Bottou. Large-scale machine learning with stochastic gradient descent.In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010. →pages 98[38] Yassine Boubendir, Xavier Antoine, and Christophe Geuzaine. Aquasi-optimal non-overlapping domain decomposition algorithm for theHelmholtz equation. Journal of Computational Physics, 231(2):262–280,2012. → pages 105[39] Petros T Boufounos and Richard G Baraniuk. 1-bit compressive sensing. InInformation Sciences and Systems, 2008. CISS 2008. 42nd AnnualConference on, pages 16–21. IEEE, 2008. → pages 84[40] Stephen Boyd, Lin Xiao, and Almir Mutapcic. Subgradient methods. 2003.→ pages 6[41] William L Briggs, Van Emden Henson, and Steve F McCormick. A multigridtutorial. SIAM, 2000. → pages 106[42] Brian H Brown. Electrical impedance tomography (EIT): a review. Journalof medical engineering & technology, 27(3):97–108, 2003. → pages 1[43] James V Burke. Second order necessary and sufficient conditions for convexcomposite NDO. Mathematical Programming, 38(3):287–302, 1987. → pages63[44] James V Burke and Michael C Ferris. A Gauss—Newton method for convexcomposite optimization. Mathematical Programming, 71(2):179–194, 1995.→ pages 63[45] Francesca Cagliari, Barbara Di Fabio, and Claudia Landi. The naturalpseudo-distance as a quotient pseudo-metric, and applications. AMS Acta,Universita di Bologna, 3499, 2012. → pages 40[46] Jian-Feng Cai, Stanley Osher, and Zuowei Shen. Convergence of thelinearized Bregman iteration for L1-norm minimization. Mathematics ofComputation, 78(268):2127–2136, 2009. → pages 114132Bibliography[47] Jian-Feng Cai, Stanley Osher, and Zuowei Shen. Linearized Bregmaniterations for compressed sensing. Mathematics of Computation, 78(267):1515–1536, 2009. → pages 114[48] Jian-Feng Cai, Stanley Osher, and Zuowei Shen. Split Bregman methods andframe based image restoration. Multiscale modeling & simulation, 8(2):337–369, 2009. → pages 74[49] Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen. A singular valuethresholding algorithm for matrix completion. SIAM Journal onOptimization, 20(4):1956–1982, 2010. → pages 2, 5[50] Tianxi Cai, T Tony Cai, and Anru Zhang. Structured matrix completionwith applications to genomic data integration. arXiv preprintarXiv:1504.01823, 2015. → pages 5[51] Henri Calandra, Serge Gratton, Xavier Pinel, and Xavier Vasseur. Animproved two-grid preconditioner for the solution of three-dimensionalHelmholtz problems in heterogeneous media. Numerical Linear Algebra withApplications, 20(4):663–688, 2013. ISSN 1099-1506. doi: 10.1002/nla.1860.→ pages 106, 107[52] Emmanuel Candes and Benjamin Recht. Exact matrix completion via convexoptimization. Communications of the ACM, 55(6):111–119, 2012. → pages 2[53] Emmanuel Candes, Mark Rudelson, Terence Tao, and Roman Vershynin.Error correction via linear programming. In Foundations of ComputerScience, 2005. FOCS 2005. 46th Annual IEEE Symposium on, pages668–681. IEEE, 2005. → pages 72[54] Emmanuel Candes, Laurent Demanet, David Donoho, and Lexing Ying. Fastdiscrete curvelet transforms. Multiscale Modeling & Simulation, 5(3):861–899, 2006. → pages 74[55] Emmanuel J Candes and David L Donoho. Curvelets: A surprisinglyeffective nonadaptive representation for objects with edges. Technical report,DTIC Document, 2000. → pages 114[56] Emmanuel J Candès and David L Donoho. New tight frames of curveletsand optimal representations of objects with piecewise C2 singularities.Communications on pure and applied mathematics, 57(2):219–266, 2004. →pages 114[57] Emmanuel J Candes and Terence Tao. Decoding by linear programming.IEEE transactions on information theory, 51(12):4203–4215, 2005. → pages4, 72[58] Emmanuel J Candes and Terence Tao. Near-optimal signal recovery fromrandom projections: Universal encoding strategies? IEEE transactions oninformation theory, 52(12):5406–5425, 2006. → pages 4, 72133Bibliography[59] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robustprincipal component analysis? Journal of the ACM (JACM), 58(3):11, 2011.→ pages 63[60] Emmanuel J Candes, Yonina C Eldar, Thomas Strohmer, and VladislavVoroninski. Phase retrieval via matrix completion. SIAM review, 57(2):225–251, 2015. → pages 5[61] J. Carroll and J. Chang. Analysis of individual differences inmultidimensional scaling via an N-way generalization of ”Eckart-Young”decomposition. Psychometrika, 35:283–319, 1970. ISSN 0033-3123.10.1007/BF02310791. → pages 12[62] Khosrow Chadan and Pierre C Sabatier. Inverse problems in quantumscattering theory. Springer Science & Business Media, 2012. → pages 1[63] Xintao Chai, Mengmeng Yang, Philipp Witte, Rongrong Wang, ZhilongFang, and Felix Herrmann. A linearized Bregman method for compressivewaveform inversion. In SEG Technical Program Expanded Abstracts 2016,pages 1449–1454. Society of Exploration Geophysicists, 2016. → pages 114[64] Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithmfor convex problems with applications to imaging. Journal of MathematicalImaging and Vision, 40(1):120–145, 2011. → pages 73[65] Raymond H Chan, Min Tao, and Xiaoming Yuan. Constrained totalvariation deblurring models and fast algorithms based on alternatingdirection method of multipliers. SIAM Journal on imaging Sciences, 6(1):680–697, 2013. → pages 77[66] Stanley H Chan, Ramsin Khoshabeh, Kristofor B Gibson, Philip E Gill, andTruong Q Nguyen. An augmented Lagrangian method for total variationvideo restoration. IEEE Transactions on Image Processing, 20(11):3097–3111, 2011. → pages 77[67] Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan SWillsky. The convex geometry of linear inverse problems. Foundations ofComputational mathematics, 12(6):805–849, 2012. → pages 57[68] Guang-Hong Chen, Jie Tang, and Shuai Leng. Prior image constrainedcompressed sensing (PICCS): a method to accurately reconstruct dynamicCT images from highly undersampled projection data sets. Medical physics,35(2):660–663, 2008. → pages 2[69] J-B Chen. A 27-point scheme for a 3D frequency-domain scalar waveequation based on an average-derivative method. Geophysical Prospecting, 62(2):258–277, 2014. → pages 93134Bibliography[70] Zhongying Chen, Dongsheng Cheng, Wei Feng, and Tingting Wu. Anoptimal 9-point finite difference scheme for the Helmholtz equation withPML. International Journal of Numerical Analysis & Modeling, 10(2), 2013.→ pages 102[71] Margaret Cheney, David Isaacson, and Jonathan C Newell. Electricalimpedance tomography. SIAM review, 41(1):85–101, 1999. → pages 1, 95[72] WC Chew, JM Jin, and E Michielssen. Complex coordinate stretching as ageneralized absorbing boundary condition. Microwave and OpticalTechnology Letters, 15(6):363–369, 1997. → pages 96[73] Radu Cimpeanu, Anton Martinsson, and Matthias Heil. A parameter-freeperfectly matched layer formulation for the finite-element-based solution ofthe Helmholtz equation. Journal of Computational Physics, 296:329–347,2015. → pages 93[74] Jon F Claerbout and Francis Muir. Robust modeling with erratic data.Geophysics, 38(5):826–844, 1973. → pages 4[75] Kenneth L Clarkson and David P Woodruff. Sketching for M-estimators: Aunified approach to robust regression. In Proceedings of the Twenty-SixthAnnual ACM-SIAM Symposium on Discrete Algorithms, pages 921–939.Society for Industrial and Applied Mathematics, 2015. → pages 6[76] Rowan Cockett, Seogi Kang, Lindsey J Heagy, Adam Pidlisecky, andDouglas W Oldenburg. Simpeg: An open source framework for simulationand gradient based parameter estimation in geophysical applications.Computers & Geosciences, 85:142–154, 2015. → pages 93[77] Gary Cohen. Higher-order numerical methods for transient wave equations.Springer Science & Business Media, 2013. → pages 106[78] Ian Craig and John Brown. Inverse problems in astronomy. 1986. → pages 1[79] Curt Da Silva and F Herrmann. Hierarchical Tucker tensor optimization -applications to 4D seismic data interpolation. In 75th EAGE Conference &Exhibition incorporating SPE EUROPEC 2013, 2013. → pages iv[80] Curt Da Silva and Felix Herrmann. A unified 2D/3D software environmentfor large-scale time-harmonic full-waveform inversion. In SEG TechnicalProgram Expanded Abstracts 2016, pages 1169–1173. Society of ExplorationGeophysicists, 2016. → pages iv[81] Curt Da Silva and Felix J. Herrmann. Hierarchical Tucker tensoroptimization - applications to tensor completion. In 10th internationalconference on Sampling Theory and Applications (SampTA 2013), pages384–387, Bremen, Germany, July 2013. → pages iv, 14, 43135Bibliography[82] Curt Da Silva and Felix J Herrmann. Optimization on the HierarchicalTucker manifold–applications to tensor completion. Linear Algebra and itsApplications, 481:131–173, 2015. → pages iv[83] Curt Da Silva and Felix J. Herrmann. A unified 2D/3D softwareenvironment for large scale nonlinear inverse problems. Technical report,University of British Columbia, 2017. → pages iv[84] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singularvalue decomposition. SIAM Journal on Matrix Analysis and Applications, 21(4):1253–1278, 2000. → pages 12, 15[85] V. De Silva and L.H. Lim. Tensor rank and the ill-posedness of the bestlow-rank approximation problem. SIAM Journal on Matrix Analysis andApplications, 30(3):1084–1127, 2008. → pages 12, 16[86] Paulus Maria De Zeeuw. Matrix-dependent prolongations and restrictions ina blackbox multigrid solver. Journal of computational and appliedmathematics, 33(1):1–27, 1990. → pages 106[87] Laurent Demanet. Curvelets, Wave Atoms, and Wave Equations. PhD thesis,California Institute of Technology, 2006. → pages 43[88] John E Dennis Jr, David M Gay, and Roy E Walsh. An adaptive nonlinearleast-squares algorithm. ACM Transactions on Mathematical Software(TOMS), 7(3):348–368, 1981. → pages 63[89] Pedro Díez. A note on the convergence of the secant method for simple andmultiple roots. Applied Mathematics Letters, 16(8):1211–1215, 2003. →pages 66[90] David L Donoho. Compressed sensing. IEEE Transactions on informationtheory, 52(4):1289–1306, 2006. → pages 4, 72[91] D Drusvyatskiy and C Paquette. Efficiency of minimizing compositions ofconvex functions and smooth maps. Optimization Online, 2016. → pages 63[92] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra.Efficient projections onto the L1-ball for learning in high dimensions. InProceedings of the 25th international conference on Machine learning, pages272–279. ACM, 2008. → pages 86[93] M. P. Friedlander E. van den Berg. Spot - a linear-operator toolbox.”http://www.cs.ubc.ca/labs/scl/spot/”, Accessed April 1, 2014. → pages 25,94[94] Lars Eldén and Berkant Savas. A Newton-Grassmann method for computingthe best multilinear rank-(r_1, r_2, r_3) approximation of a tensor. SIAMJournal on Matrix Analysis and applications, 31(2):248–271, 2009. → pages16136Bibliography[95] Björn Engquist and Lexing Ying. Sweeping preconditioner for the Helmholtzequation: moving perfectly matched layers. Multiscale Modeling &Simulation, 9(2):686–710, 2011. → pages 105[96] Yogi A Erlangga, Cornelis Vuik, and Cornelis Willebrordus Oosterlee. On aclass of preconditioners for solving the Helmholtz equation. AppliedNumerical Mathematics, 50(3):409–425, 2004. → pages 105[97] Antonio Falcó, Wolfgang Hackbusch, Anthony Nouy, et al. Geometricstructures in tensor representations (release 2). 2014. → pages 13[98] Patrick E Farrell, David A Ham, Simon W Funke, and Marie E Rognes.Automated derivation of the adjoint of high-level transient finite elementprograms. SIAM Journal on Scientific Computing, 35(4):C369–C393, 2013.→ pages 92[99] David J Field. Wavelets, vision and the statistics of natural scenes.Philosophical Transactions of the Royal Society of London A: Mathematical,Physical and Engineering Sciences, 357(1760):2527–2542, 1999. → pages 2[100] Mário AT Figueiredo, Robert D Nowak, and Stephen J Wright. Gradientprojection for sparse reconstruction: Application to compressed sensing andother inverse problems. IEEE Journal of selected topics in signal processing,1(4):586–597, 2007. → pages 1[101] David Chin-Lung Fong and Michael Saunders. LSMR: An iterativealgorithm for sparse least-squares problems. SIAM Journal on ScientificComputing, 33(5):2950–2971, 2011. → pages 114[102] Inéz Frerichs. Electrical impedance tomography (EIT) in applications relatedto lung and ventilation: a review of experimental and clinical activities.Physiological measurement, 21(2):R1, 2000. → pages 1[103] Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochasticmethods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012. → pages 98[104] Martin J Gander and Hui Zhang. Domain decomposition methods for theHelmholtz equation: a numerical investigation. In Domain DecompositionMethods in Science and Engineering XX, pages 215–222. Springer, 2013. →pages 105[105] S. Gandy, B. Recht, and I. Yamada. Tensor completion and low-n-ranktensor recovery via convex optimization. Inverse Problems, 27(2):025010,2011. → pages 12, 14, 39[106] Euhanna Ghadimi, André Teixeira, Iman Shames, and Mikael Johansson.Optimal parameter selection for the alternating direction method ofmultipliers (ADMM): quadratic problems. IEEE Transactions on AutomaticControl, 60(3):644–658, 2015. → pages 63137Bibliography[107] Gene Golub and Victor Pereyra. Separable nonlinear least squares: thevariable projection method and its applications. Inverse problems, 19(2):R1,2003. → pages 63, 65[108] Dan Gordon and Rachel Gordon. CARP-CG: A robust and efficient parallelsolver for linear systems, applied to strongly convection dominated PDEs.Parallel Computing, 36(9):495 – 515, 2010. ISSN 0167-8191. doi:10.1016/j.parco.2010.05.004. → pages 107[109] L. Grasedyck. Hierarchical singular value decomposition of tensors. SIAMJournal on Matrix Analysis and Applications, 31(4):2029–2054, 2010. →pages xii, 13, 15, 17, 19, 32, 36[110] Lars Grasedyck, Melanie Kluge, and Sebastian Krämer. Alternatingdirections fitting (ADF) of hierarchical low rank tensors. Preprint, 149, 2013.→ pages 13[111] Lars Grasedyck, Daniel Kressner, and Christine Tobler. A literature surveyof low-rank tensor approximation techniques. GAMM-Mitteilungen, 36(1):53–78, 2013. → pages 11[112] Samuel H Gray, John Etgen, Joe Dellinger, and Dan Whitmore. Seismicmigration problems and solutions. Geophysics, 66(5):1622–1640, 2001. →pages 91[113] L Grippo, F Lampariello, and S Lucidi. A truncated Newton method withnonmonotone line search for unconstrained optimization. Journal ofOptimization Theory and Applications, 60(3):401–419, 1989. → pages 99[114] Antoine Guitton and William W Symes. Robust inversion of seismic datausing the Huber norm. Geophysics, 68(4):1310–1319, 2003. → pages 7, 60[115] W. Hackbusch and S. Kühn. A new scheme for the tensor representation.Journal of Fourier Analysis and Applications, 15(5):706–722, 2009. → pages5, 8, 13[116] Wolfgang Hackbusch. Tensor spaces and numerical tensor calculus,volume 42. Springer, 2012. → pages 11[117] Jutho Haegeman, Tobias J Osborne, and Frank Verstraete. Post-matrixproduct state methods: To tangent space and beyond. Physical Review B, 88(7):075133, 2013. → pages 13[118] R.A. Harshman. Foundations of the parafac procedure: models andconditions for an ”explanatory” multimodal factor analysis. UCLA WorkingPapers in Phonetics, 16:1–84, 1970. → pages 12[119] Richard A Harshman. Foundations of the parafac procedure: Models andconditions for an” explanatory” multi-modal factor analysis. 1970. → pages 5138Bibliography[120] Mark J Harvilla and Richard M Stern. Least squares signal declipping forrobust speech recognition. 2014. → pages 79[121] Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. The entireregularization path for the support vector machine. Journal of MachineLearning Research, 5(Oct):1391–1415, 2004. → pages 63[122] Frank D Hastings, John B Schneider, and Shira L Broschat. Application ofthe perfectly matched layer (PML) absorbing boundary condition to elasticwave propagation. The Journal of the Acoustical Society of America, 100(5):3061–3069, 1996. → pages 96[123] Matthew A Herman and Thomas Strohmer. High-resolution radar viacompressed sensing. IEEE transactions on signal processing, 57(6):2275–2284, 2009. → pages 2[124] Michael A Heroux, Roscoe A Bartlett, Vicki E Howle, Robert J Hoekstra,Jonathan J Hu, Tamara G Kolda, Richard B Lehoucq, Kevin R Long,Roger P Pawlowski, Eric T Phipps, et al. An overview of the trilinos project.ACM Transactions on Mathematical Software (TOMS), 31(3):397–423, 2005.→ pages 92[125] Felix J Herrmann and Gilles Hennenfent. Non-parametric seismic datarecovery with curvelet frames. Geophysical Journal International, 173(1):233–248, 2008. → pages 2[126] Felix J. Herrmann and Xiang Li. Efficient least-squares migration withsparsity promotion. In EAGE Annual Conference Proceedings. EAGE,EAGE, 05 2011. → pages 152[127] Felix J Herrmann and Xiang Li. Efficient least-squares imaging with sparsitypromotion and compressive sensing. Geophysical prospecting, 60(4):696–712,2012. → pages 114[128] Felix J. Herrmann and Bas Peters. Constraints versus penalties foredge-preserving full-waveform inversion. In SEG Workshop on Where are weheading with FWI; Dallas, 10 2016. (SEG Workshop, Dallas). → pages 117[129] Felix J. Herrmann, Deli Wang, Gilles Hennenfent, and Peyman P.Moghaddam. Seismic data processing with curvelets: a multiscale andnonlinear approach. In SEG Technical Program Expanded Abstracts,volume 26, pages 2220–2224. SEG, SEG, 2007. doi: 10.1190/1.2792927. →pages 2[130] Felix J Herrmann, Deli Wang, Gilles Hennenfent, and Peyman PMoghaddam. Curvelet-based seismic data processing: A multiscale andnonlinear approach. Geophysics, 73(1):A1–A5, 2007. → pages 2139Bibliography[131] Felix J Herrmann, Peyman Moghaddam, and Christiaan C Stolk.Sparsity-and continuity-promoting seismic image recovery with curveletframes. Applied and Computational Harmonic Analysis, 24(2):150–173, 2008.→ pages 114[132] FJ Herrmann, N Tu, and E Esser. Fast “online” migration with compressivesensing. 77th Annual International Conference and Exhibition, EAGE, 2015.→ pages 114[133] Graham J Hicks. Arbitrary source and receiver positioning infinite-difference schemes using Kaiser windowed sinc functions. Geophysics,67(1):156–165, 2002. → pages 100[134] Frank L Hitchcock. The expression of a tensor or a polyadic as a sum ofproducts. Studies in Applied Mathematics, 6(1-4):164–189, 1927. → pages 5[135] Olav Holberg. Computational aspects of the choice of operator and samplinginterval for numerical differentiation in large-scale simulation of wavephenomena. Geophysical prospecting, 35(6):629–655, 1987. → pages 91[136] Sebastian Holtz, Thorsten Rohwedder, and Reinhold Schneider. Thealternating linear scheme for tensor optimization in the tensor train format.SIAM Journal on Scientific Computing, 34(2):A683–A713, 2012. → pages 13[137] Sebastian Holtz, Thorsten Rohwedder, and Reinhold Schneider. Onmanifolds of tensors of fixed TT-rank. Numerische Mathematik, 120(4):701–731, 2012. ISSN 0029-599X. doi: 10.1007/s00211-011-0419-7. → pages13[138] L Hu, L Huang, and ZR Lu. Crack identification of beam structures usinghomotopy continuation algorithm. Inverse Problems in Science andEngineering, 25(2):169–187, 2017. → pages 1[139] Yaohua Hu, Chong Li, and Xiaoqi Yang. On convergence rates of linearizedproximal algorithms for convex composite optimization with applications.SIAM Journal on Optimization, 26(2):1207–1235, 2016. → pages 64[140] Bo Huang, Cun Mu, Donald Goldfarb, and John Wright. Provable low-ranktensor recovery. 2014. → pages 12[141] Laurent Jacques, Jason N Laska, Petros T Boufounos, and Richard GBaraniuk. Robust 1-bit compressive sensing via binary stable embeddings ofsparse vectors. IEEE Transactions on Information Theory, 59(4):2082–2102,2013. → pages 84, 86, 87[142] Churl-Hyun Jo, Changsoo Shin, and Jung Hee Suh. An optimal 9-point,finite-difference, frequency-space, 2-D scalar wave extrapolator. Geophysics,61(2):529–537, 1996. → pages 102140Bibliography[143] Ian Jolliffe. Principal component analysis. Wiley Online Library, 2002. →pages 3[144] Anatoli Juditsky, Arkadi Nemirovski, et al. First order methods fornonsmooth convex large-scale optimization, i: general purpose methods.Optimization for Machine Learning, pages 121–148. → pages 64[145] Stefan Kaczmarz. Angenäherte auflösung von systemen linearer gleichungen.Bulletin International de l’Académie Polonaise des Sciences et des Lettres,35:355–357, 1937. → pages 107[146] Hiroo Kanamori. Quantification of earthquakes. Nature, 271:411–414, 1978.→ pages 91[147] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence ofgradient and proximal-gradient methods under the Polyak-Lojasiewiczcondition. In Joint European Conference on Machine Learning andKnowledge Discovery in Databases, pages 795–811. Springer, 2016. → pages8, 55, 58, 68, 71[148] Hiroyuki Kasai and Bamdev Mishra. Riemannian preconditioning for tensorcompletion. arXiv preprint arXiv:1506.02159, 2015. → pages 23[149] Hiroyuki Kasai and Bamdev Mishra. Low-rank tensor completion: aRiemannian manifold preconditioning approach. In International Conferenceon Machine Learning, pages 1012–1021, 2016. → pages 23[150] Linda Kaufman. A variable projection method for solving separablenonlinear least squares problems. BIT Numerical Mathematics, 15(1):49–57,1975. ISSN 1572-9125. doi: 10.1007/BF01932995. → pages 65[151] Steven M Kay. Statistical signal processing. Estimation Theory, 1, 1993. →pages 63[152] Boris N. Khoromskij. Tensors-structured numerical methods in scientificcomputing: Survey on recent advances. Chemometrics and IntelligentLaboratory Systems, 110(1):1 – 19, 2012. ISSN 0169-7439. doi:10.1016/j.chemolab.2011.09.001. → pages 11[153] Srđan Kitić, Nancy Bertin, and Rémi Gribonval. Audio declipping bycosparse hard thresholding. In iTwist-2nd international-Traveling Workshopon Interactions between Sparse models and Technology, 2014. → pages 79, 80[154] Srđan Kitić, Nancy Bertin, and Rémi Gribonval. Sparsity and cosparsity foraudio declipping: a flexible non-convex approach. In InternationalConference on Latent Variable Analysis and Signal Separation, pages243–250. Springer, 2015. → pages 79, 80141Bibliography[155] Srdjan Kitic, Laurent Jacques, Nilesh Madhu, Michael Peter Hopwood, AnnSpriet, and Christophe De Vleeschouwer. Consistent iterative hardthresholding for signal declipping. In Acoustics, Speech and Signal Processing(ICASSP), 2013 IEEE International Conference on, pages 5939–5943. IEEE,2013. → pages 79[156] Matthew G Knepley, Richard F Katz, and Barry Smith. Developing ageodynamics simulator with PETSc. In Numerical Solution of PartialDifferential Equations on Parallel Computers, pages 413–438. Springer, 2006.→ pages 92[157] Tamara G Kolda and Brett W Bader. Tensor decompositions andapplications. SIAM review, 51(3):455–500, 2009. → pages 11, 12, 15[158] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorizationtechniques for recommender systems. Computer, 42(8), 2009. → pages 5[159] N. Kreimer and M.D. Sacchi. Tensor completion via nuclear normminimization for 5D seismic data reconstruction. In SEG Technical ProgramExpanded Abstracts 2012, pages 1–5. Society of Exploration Geophysicists,2012. → pages 12[160] Nadia Kreimer and Mauricio D Sacchi. A tensor higher-order singular valuedecomposition for prestack seismic data noise reduction and interpolation.Geophysics, 77(3):V113–V122, 2012. → pages 12[161] D. Kressner, M. Steinlechner, and B. Vandereycken. Low-rank tensorcompletion by Riemannian optimization. Technical Report 2, 2014. → pages13, 14, 28, 37, 39, 41, 42, 44[162] Daniel Kressner and Christine Tobler. Algorithm 941: Htucker—a Matlabtoolbox for tensors in Hierarchical Tucker format. ACM Trans. Math. Softw.,40(3):22:1–22:22, April 2014. ISSN 0098-3500. doi: 10.1145/2538688. →pages 13, 41[163] Rajiv Kumar, Hassan Mansour, Felix J Herrmann, and Aleksandr Y Aravkin.Reconstruction of seismic wavefields via low-rank matrix factorization in thehierarchical-separable matrix representation. In SEG Technical ProgramExpanded Abstracts 2013, pages 3628–3633. Society of ExplorationGeophysicists, 2013. → pages 54[164] Rajiv Kumar, Curt Da Silva, Okan Akalin, Aleksandr Y Aravkin, HassanMansour, Benjamin Recht, and Felix J Herrmann. Efficient matrixcompletion for seismic data reconstruction. Geophysics, 80(5):V97–V114,2015. → pages 5[165] Rafael Lago and Felix J. Herrmann. Towards a robust geometric multigridscheme for Helmholtz equation. Technical Report TR-EOAS-2015-3, UBC,01 2015. → pages 106, 107142Bibliography[166] Michael Lange, Navjot Kukreja, Mathias Louboutin, Fabio Luporini, FelippeVieira, Vincenzo Pandolfo, Paulius Velesko, Paulius Kazakas, and GerardGorman. Devito: towards a generic finite difference DSL using symbolicpython. arXiv preprint arXiv:1609.03361, 2016. → pages 92, 122[167] Adrian S Lewis and Stephen J Wright. A proximal method for compositeminimization. Mathematical Programming, pages 1–46, 2008. → pages 63[168] Adrian S Lewis and Stephen J Wright. A proximal method for compositeminimization. Mathematical Programming, 158(1-2):501–546, 2016. → pages64[169] Chong Li and Xinghua Wang. On convergence of the Gauss-Newton methodfor convex composite optimization. Mathematical programming, 91(2):349–356, 2002. → pages 63[170] Yunyue Elita Li and Laurent Demanet. Full-waveform inversion withextrapolated low-frequency data. GEOPHYSICS, 81(6):R339–R348, 2016.doi: 10.1190/geo2016-0038.1. → pages 113[171] Dong C Liu and Jorge Nocedal. On the limited memory BFGS method forlarge scale optimization. Mathematical programming, 45(1):503–528, 1989.→ pages 99[172] Fei Liu and Lexing Ying. Recursive sweeping preconditioner for the 3DHelmholtz equation. arXiv preprint arXiv:1502.07266, 2015. → pages 105[173] Fei Liu and Lexing Ying. Additive sweeping preconditioner for theHelmholtz equation. Multiscale Modeling & Simulation, 14(2):799–822, 2016.→ pages 105[174] Christian Lubich. From quantum to classical molecular dynamics: reducedmodels and numerical analysis. European Mathematical Society, 2008. →pages 13[175] Christian Lubich, Thorsten Rohwedder, Reinhold Schneider, and BartVandereycken. Dynamical Approximation by Hierarchical Tucker andTensor-Train Tensors. SIAM Journal on Matrix Analysis and Applications,34(2):470–494, April 2013. → pages 13[176] David G Luenberger. Introduction to linear and nonlinear programming,volume 28. Addison-Wesley Reading, MA, 1973. → pages 6[177] Michael Lustig, David Donoho, and John M Pauly. Sparse MRI: Theapplication of compressed sensing for rapid MR imaging. Magneticresonance in medicine, 58(6):1182–1195, 2007. → pages 2[178] Michael Lustig, David L Donoho, Juan M Santos, and John M Pauly.Compressed sensing MRI. IEEE signal processing magazine, 25(2):72–82,2008. → pages 72143Bibliography[179] Kaj Madsen and Hans Bruun Nielsen. Finite alogorithms for robust linearregression. BIT Numerical Mathematics, 30(4):682–699, 1990. → pages 60[180] Olvi L Mangasarian and David R. Musicant. Robust linear and supportvector regression. IEEE Transactions on Pattern Analysis and MachineIntelligence, 22(9):950–955, 2000. → pages 60[181] Robert Marks. Introduction to Shannon sampling and interpolation theory.Springer Science & Business Media, 2012. → pages 2[182] Ludovic Métivier and Romain Brossier. The seiscope optimization toolbox: alarge-scale nonlinear optimization library based on reverse communication.Geophysics, 81(2):F11–F25, 2016. → pages 92[183] B. Mishra and R. Sepulchre. R3MC: A Riemannian three-factor algorithmfor low-rank matrix completion. In 53rd IEEE Conference on Decision andControl, 2014. → pages 18, 23[184] Bamdev Mishra, Gilles Meyer, Francis Bach, and Rodolphe Sepulchre.Low-rank optimization with trace norm penalty. SIAM Journal onOptimization, 23(4):2124–2149, 2013. → pages 42[185] Cun Mu, Bo Huang, John Wright, and Donald Goldfarb. Square deal: Lowerbounds and improved relaxations for tensor recovery. arXiv preprintarXiv:1307.5870, 2013. → pages 13[186] Sangnam Nam, Mike E Davies, Michael Elad, and Rémi Gribonval. Thecosparse analysis model and algorithms. Applied and ComputationalHarmonic Analysis, 34(1):30–56, 2013. → pages 72, 74[187] Balas Kausik Natarajan. Sparse approximate solutions to linear systems.SIAM journal on computing, 24(2):227–234, 1995. → pages 81[188] Frank Natterer. Imaging and inverse problems of partial differentialequations. Technical report, Institut fur Numerische und AngewandteMathematik, 2006. → pages 95[189] Tamas Nemeth, Chengjun Wu, and Gerard T Schuster. Least-squaresmigration of incomplete reflection data. Geophysics, 64(1):208–221, 1999. →pages 3[190] Yurii Nesterov. Introductory lectures on convex optimization: A basic course,volume 87. Springer Science & Business Media, 2013. → pages 89[191] Stéphane Operto, Jean Virieux, Patrick Amestoy, Jean-Yves L’Excellent, LucGiraud, and Hafedh Ben Hadj Ali. 3D finite-difference frequency-domainmodeling of visco-acoustic wave propagation using a massively parallel directsolver: A feasibility study. GEOPHYSICS, 72(5):SM195–SM211, 2007. doi:10.1190/1.2759835. → pages 8, 93, 102, 106, 107, 108, 109, 128144Bibliography[192] Ivan V Oseledets. Tensor-train decomposition. SIAM Journal on ScientificComputing, 33(5):2295–2317, 2011. → pages 13[193] Ivan V Oseledets and SV Dolgov. Solution of linear systems and matrixinversion in the TT-format. SIAM Journal on Scientific Computing, 34(5):A2718–A2739, 2012. → pages 13[194] Ivan V Oseledets and Eugene E Tyrtyshnikov. Breaking the curse ofdimensionality, or how to use SVD in many dimensions. SIAM Journal onScientific Computing, 31(5):3744–3759, 2009. → pages 13[195] Stanley Osher, Yu Mao, Bin Dong, and Wotao Yin. Fast linearized Bregmaniteration for compressive sensing and sparse denoising. arXiv preprintarXiv:1104.0262, 2011. → pages 114[196] Samet Oymak, Amin Jalali, Maryam Fazel, Yonina C Eldar, and BabakHassibi. Simultaneously structured models with application to sparse andlow-rank matrices. arXiv preprint arXiv:1212.3753, 2012. → pages 12[197] Anthony D. Padula, Shannon D. Scott, and William W. Symes. A softwareframework for abstract expression of coordinate-free linear algebra andoptimization algorithms. ACM Trans. Math. Softw., 36(2):8:1–8:36, April2009. ISSN 0098-3500. doi: 10.1145/1499096.1499097. → pages 92[198] Christopher C Paige and Michael A Saunders. LSQR: An algorithm forsparse linear equations and sparse least squares. 1982. → pages 74[199] David A Patterson. Computer architecture: a quantitative approach.Elsevier, 2011. → pages 93[200] Bas Peters and Felix J. Herrmann. Constraints versus penalties foredge-preserving full-waveform inversion. Revision 1 submitted to TheLeading Edge on November 13, 2016., 2016. → pages 98[201] Gabriel Peyré. The numerical tours of signal processing. Computing inScience & Engineering, 13(4):94–97, 2011. → pages 80[202] Yaniv Plan and Roman Vershynin. One-bit compressed sensing by linearprogramming. Communications on Pure and Applied Mathematics, 66(8):1275–1297, 2013. → pages 84, 85[203] Chayne Planiden and Xianfu Wang. Strongly convex functions, moreauenvelopes, and the generic nature of convex functions with strong minimizers.SIAM Journal on Optimization, 26(2):1341–1364, 2016. → pages 57, 68[204] R-E Plessix. A review of the adjoint-state method for computing thegradient of a functional with geophysical applications. Geophysical JournalInternational, 167(2):495–503, 2006. → pages 152145Bibliography[205] R-E Plessix. A Helmholtz iterative solver for 3D seismic-imaging problems.Geophysics, 72(5):SM185–SM194, 2007. → pages 105[206] David Pollard. Asymptotics for least absolute deviation regressionestimators. Econometric Theory, 7(02):186–199, 1991. → pages 6[207] Jack Poulson, Bjorn Engquist, Siwei Li, and Lexing Ying. A parallelsweeping preconditioner for heterogeneous 3D Helmholtz equations. SIAMJournal on Scientific Computing, 35(3):C194–C212, 2013. → pages 105[208] Ernesto E Prudencio, Richard Byrd, and Xiao-Chuan Cai. Parallel full spaceSQP Lagrange–Newton–Krylov–Schwarz algorithms for pde-constrainedoptimization problems. SIAM Journal on Scientific Computing, 27(4):1305–1328, 2006. → pages 96[209] Shie Qian and Dapang Chen. Discrete gabor transform. IEEE transactionson signal processing, 41(7):2429–2438, 1993. → pages 80[210] Holger Rauhut, Reinhold Schneider, and Zeljka Stojanac. Tensor completionin hierarchical tensor representations. arXiv preprint arXiv:1404.3905, 2014.→ pages 13[211] Holger Rauhut, Reinhold Schneider, and Željka Stojanac. Tensor completionin hierarchical tensor representations. In Compressed Sensing and itsApplications, pages 419–450. Springer, 2015. → pages 125[212] Holger Rauhut, Reinhold Schneider, and Željka Stojanac. Low rank tensorrecovery via iterative hard thresholding. Linear Algebra and its Applications,523:220–262, 2017. → pages 125[213] Benjamin Recht. A simpler approach to matrix completion. Journal ofMachine Learning Research, 12(Dec):3413–3430, 2011. → pages 5[214] Benjamin Recht and Christopher Ré. Parallel stochastic gradient algorithmsfor large-scale matrix completion. Mathematical Programming Computation,5(2):201–226, 2013. → pages 5[215] CD Riyanti, A Kononov, Yogi A Erlangga, Cornelis Vuik, Cornelis WOosterlee, R-E Plessix, and Wim A Mulder. A parallel multigrid-basedpreconditioner for the 3D heterogeneous high-frequency Helmholtz equation.Journal of Computational physics, 224(1):431–448, 2007. → pages 105[216] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317.Springer Science & Business Media, 2009. → pages 57[217] Ralph Tyrrell Rockafeller. Convex Analysis. Princeton University Press,1970. → pages 70[218] Farbod Roosta-Khorasani, Kees Van Den Doel, and Uri Ascher. Datacompletion and stochastic algorithms for pde inversion problems with manymeasurements. Electron. Trans. Numer. Anal, 42:177–196, 2014. → pages 2146Bibliography[219] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variationbased noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4):259–268, 1992. → pages 76[220] Lars Ruthotto, Eran Treister, and Eldad Haber. jInv–a flexible Julia packagefor PDE parameter estimation. arXiv preprint arXiv:1606.07399, 2016. →pages 93[221] Youcef Saad. A flexible inner-outer preconditioned GMRES algorithm.SIAM Journal on Scientific Computing, 14(2):461–469, 1993. → pages 98,107[222] Youcef Saad and Martin H Schultz. GMRES: A generalized minimal residualalgorithm for solving nonsymmetric linear systems. SIAM Journal onscientific and statistical computing, 7(3):856–869, 1986. → pages 107[223] Mark W Schmidt, Ewout Van Den Berg, Michael P Friedlander, andKevin P Murphy. Optimizing costly functions with simple constraints: Alimited-memory projected quasi-Newton algorithm. In AISTATS, volume 5,pages 456–463, 2009. → pages 80, 111, 116[224] Reinhold Schneider and André Uschmajew. Approximation rates for thehierarchical tensor format in periodic Sobolev spaces. Journal of Complexity,30(2):56–71, 2014. → pages 43, 125[225] Reinhold Schneider and André Uschmajew. Convergence results forprojected line-search methods on varieties of low-rank matrices vialojasiewicz inequality. arXiv preprint arXiv:1402.5284, 2014. → pages 41[226] Claude Elwood Shannon. Communication in the presence of noise.Proceedings of the IRE, 37(1):10–21, 1949. → pages 2[227] John Shawe-Taylor and Shiliang Sun. A review of optimizationmethodologies in support vector machines. Neurocomputing, 74(17):3609–3618, 2011. → pages 63[228] Peter M Shearer. Introduction to seismology. Cambridge University Press,2009. → pages 122[229] Chao Shen, Tsung-Hui Chang, Kun-Yu Wang, Zhengding Qiu, andChong-Yung Chi. Distributed robust multicell coordinated beamformingwith imperfect csi: An ADMM approach. IEEE Transactions on signalprocessing, 60(6):2988–3003, 2012. → pages 63[230] Z J Shi and J Shen. New inexact line search method for unconstrainedoptimization. Journal of Optimization Theory and Applications, 127(2):425–446, November 2005. → pages 33147Bibliography[231] Zhen-Jun Shi. Convergence of line search methods for unconstrainedoptimization. Applied Mathematics and Computation, 157(2):393–405, 2004.→ pages 33[232] Naum Zuselevich Shor. Minimization methods for non-differentiablefunctions, volume 3. Springer Science & Business Media, 2012. → pages 6[233] Kai Siedenburg, Matthieu Kowalski, and Monika Dorfler. Audio declippingwith social sparsity. In Acoustics, Speech and Signal Processing (ICASSP),2014 IEEE International Conference on, pages 1577–1581. IEEE, 2014. →pages 79[234] M. Signoretto, R. Van de Plas, B. De Moor, and J. AK Suykens. Tensorversus matrix completion: a comparison with application to spectral data.Signal Processing Letters, IEEE, 18(7):403–406, 2011. → pages 12[235] Arnold Sommerfeld. Partial differential equations in physics, volume 1.Academic press, 1949. → pages 95[236] Zhong-Min Song and Paul R Williamson. Frequency-domain acoustic-wavemodeling and inversion of crosshole data: Part i 2.5D modeling method.Geophysics, 60(3):784–795, 1995. → pages 104, 105[237] Christiaan C Stolk. A rapidly converging domain decomposition method forthe Helmholtz equation. Journal of Computational Physics, 241:240–252,2013. → pages 105[238] Christiaan C Stolk. An improved sweeping domain decompositionpreconditioner for the Helmholtz equation. Advances in ComputationalMathematics, pages 1–32, 2016. → pages 105[239] Christiaan C Stolk, Mostak Ahmed, and Samir Kumar Bhowmik. Amultigrid method for the Helmholtz equation with optimized coarse gridcorrections. SIAM Journal on Scientific Computing, 36(6):A2819–A2841,2014. → pages 105[240] Dong Sun and William W Symes. Waveform inversion via nonlineardifferential semblance optimization. In 75th EAGE Conference &Exhibition-Workshops, 2013. → pages 96[241] William W. Symes, Dong Sun, and Marco Enriquez. From modelling toinversion: designing a well-adapted simulator. Geophysical Prospecting, 59(5):814–833, 2011. ISSN 1365-2478. doi: 10.1111/j.1365-2478.2011.00977.x.→ pages 92[242] Min Tao and Junfeng Yang. Alternating direction algorithms for totalvariation deconvolution in image reconstruction. Optimization Online, 2009.→ pages 77, 78148Bibliography[243] Albert Tarantola. Inverse problem theory and methods for model parameterestimation. SIAM, 2005. → pages 1[244] Albert Tarantola and Bernard Valette. Generalized nonlinear inverseproblems solved using the least squares criterion. Reviews of Geophysics, 20(2):219–232, 1982. → pages 1[245] Howard L Taylor, Stephen C Banks, and John F McCoy. Deconvolution withthe L1 norm. Geophysics, 44(1):39–52, 1979. → pages 4[246] Fons Ten Kroode, Steffen Bergler, Cees Corsten, Jan Willem de Maag, FlorisStrijbos, and Henk Tijhof. Broadband seismic data—the importance of lowfrequencies. Geophysics, 78(2):WA3–WA14, 2013. → pages 91[247] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journalof the Royal Statistical Society. Series B (Methodological), pages 267–288,1996. → pages 6[248] C. Tobler. Low Rank Tensor Methods for Linear Systems and EigenvalueProblems. PhD thesis, ETH Zürich, 2012. → pages 36[249] Ning Tu and Felix J Herrmann. Fast imaging with surface-related multiplesby sparse inversion. Geophysical Journal International, 201(1):304–317, 2015.→ pages 114[250] Ledyard R Tucker. Some mathematical notes on three-mode factor analysis.Psychometrika, 31(3):279–311, 1966. → pages 5[251] Eli Turkel, Dan Gordon, Rachel Gordon, and Semyon Tsynkov. Compact 2Dand 3D sixth order schemes for the Helmholtz equation with variable wavenumber. Journal of Computational Physics, 232(1):272–287, 2013. → pages93[252] A. Uschmajew. Local convergence of the alternating least squares algorithmfor canonical tensor approximation. SIAM Journal on Matrix Analysis andApplications, 33(2):639–652, 2012. → pages 12[253] A. Uschmajew and B. Vandereycken. The geometry of algorithms usinghierarchical tensors. Linear Algebra and its Applications, 439(1):133–166,July 2013. → pages 6, 8, 11, 13, 14, 15, 16, 18, 19, 20, 21, 24, 37, 40, 124[254] A. Uschmajew and B. Vandereycken. Line-search methods and rank increaseon low-rank matrix varieties. In Proceedings of the 2014 InternationalSymposium on Nonlinear Theory and its Applications, 2014. → pages 42[255] E. van den Berg and M. P. Friedlander. Probing the pareto frontier for basispursuit solutions. SIAM Journal on Scientific Computing, 31(2):890–912,2008. doi: 10.1137/080714488. → pages 9, 64149Bibliography[256] Tristan van Leeuwen. A parallel matrix-free framework for frequency-domainseismic modelling, imaging and inversion in matlab. Technical report,University of British Columbia, 2012. → pages 92[257] Tristan van Leeuwen and Felix J. Herrmann. 3D frequency-domain seismicinversion with controlled sloppiness. SIAM Journal on Scientific Computing,36(5):S192–S217, 10 2014. doi: 10.1137/130918629. (SISC). → pages 98, 103,118[258] Tristan van Leeuwen, Dan Gordon, Rachel Gordon, and Felix J Herrmann.Preconditioning the Helmholtz equation via row-projections. In 74th EAGEConference and Exhibition incorporating EUROPEC 2012, 2012. → pages105[259] Tristan van Leeuwen, Felix J. Herrmann, and Bas Peters. A new take onFWI: wavefield reconstruction inversion. In EAGE Annual ConferenceProceedings, 06 2014. doi: 10.3997/2214-4609.20140703. → pages 104[260] Bart Vandereycken. Low-rank matrix completion by Riemannianoptimization. SIAM Journal on Optimization, 23(2):1214–1236, 2013. →pages 5[261] Jean Virieux and Stéphane Operto. An overview of full-waveform inversionin exploration geophysics. Geophysics, 74(6):WCC1–WCC26, 2009. → pages1, 91, 111[262] Curtis R Vogel. Computational methods for inverse problems. SIAM, 2002.→ pages 1[263] Rongrong Wang and Felix Herrmann. Frequency down extrapolation withTV norm minimization. In SEG Technical Program Expanded Abstracts 2016,pages 1380–1384. Society of Exploration Geophysicists, 2016. → pages 113[264] Edmund Taylor Whittaker. XVIII.—on the functions which are representedby the expansions of the interpolation-theory. Proceedings of the RoyalSociety of Edinburgh, 35:181–194, 1915. → pages 2[265] Robert S Womersley. Local properties of algorithms for minimizingnonsmooth composite functions. Mathematical Programming, 32(1):69–89,1985. → pages 63[266] John Wright, Arvind Ganesh, Shankar Rao, Yigang Peng, and Yi Ma.Robust principal component analysis: Exact recovery of corrupted low-rankmatrices via convex optimization. In Advances in neural informationprocessing systems, pages 2080–2088, 2009. → pages 3[267] Huan Xu, Constantine Caramanis, and Shie Mannor. Robustness andregularization of support vector machines. Journal of Machine LearningResearch, 10(Jul):1485–1510, 2009. → pages 63150Bibliography[268] Yangyang Xu and Wotao Yin. A block coordinate descent method forregularized multiconvex optimization with applications to nonnegative tensorfactorization and completion. SIAM Journal on Imaging Sciences, 6(3):1758–1789, 2013. → pages 12[269] Jiyan Yang, Xiangrui Meng, and Michael Mahoney. Quantile regression forlarge-scale applications. In Proceedings of The 30th International Conferenceon Machine Learning, pages 881–887, 2013. → pages 6[270] Yi Yang, Jianwei Ma, and Stanley Osher. Seismic data reconstruction viamatrix completion. UCLA CAM Report, pages 12–14, 2012. → pages 5[271] Tong Zhang. Solving large scale linear prediction problems using stochasticgradient descent algorithms. In Proceedings of the twenty-first internationalconference on Machine learning, page 116. ACM, 2004. → pages 7[272] Ciyou Zhu, Richard H Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm778: LBFGS-B: Fortran subroutines for large-scale bound-constrainedoptimization. ACM Transactions on Mathematical Software (TOMS), 23(4):550–560, 1997. → pages 99151AppendicesAppendix AA ‘User Friendly’ Guide to BasicInverse ProblemsFor our framework, not only do we want to solve (5.2) directly, but allow researchersto explore other subproblems associated to the primary problem, such as the lin-earized problem [126] and the full-Newton trust region subproblem. As such, weare not only interested in the objective function and gradient, but also other in-termediate quantities based on differentiating the state equation H(m)u(m) = q.A standard approach to deriving these quantities is the adjoint-state approach, de-scribed for instance in [204], but the results we outline below are slightly moreelementary and do not make use of Lagrange multipliers.Rather than focusing on differentiating the expressions in (5.2) directly, we find ituseful to consider directional derivatives various quantities and their relationshipsto the gradient. That is to say, for a sufficiently smooth function f(m) that canbe scalar, vector, or matrix-valued, the directional dervative in the direction m,denoted Yf(m)[m], is a linear function of m that satisfiesYf(m)[m] = limt→0f(m+ tm)− f(m)tOThe most important part of the directional derivative, for the purposes of the fol-lowing calculations, is that Yf(m)[m] is the same mathematical object as f(m),i.e., if f(m) is a matrix, so is Yf(m)[m].For a given inner product 〈·P ·〉, the gradient of f(m), denoted ∇f(m), is theunique vector that satisfies〈∇f(m)P m〉 = Yf(m)[m] ∀m152Appendix A. A ‘User Friendly’ Guide to Basic Inverse ProblemsIf we are not terribly worried about specifying the exact vectors spaces in whichthese objects live and treat them as we’d expect them to behave (i.e., satisfyingproduct, chain rules, commuting with matrix transposes, complex conjugation, lin-ear operators, etc.), the resulting derivations become much more manageable.Starting from the baseline expression for the misfit f(m) and differentiating,using the chain rule, we have thatf(m) = ϕ(erH(m)−1qP y) = ϕ(eru(m)P y)Yf(m)[m] = Yϕ(r(m)P y)[Yr(m)[m]]r(m) = eru(m)Yr(m)[m] = erYu(m)[m]In order to determine the expression for Yu(m)[m], we differentiate the state equa-tion,H(m)u(m) = qPin the direction m using the product rule to obtainYH(m)[m]u(m) +H(m)Yu(m)[m] = 0Yu(m)[m] = H(m)−1(−YH(m)[m]u(m))O (A.1)For the forward modelling operator F (m), YF (m)[m] is the Jacobian or so-calledlinearized born-modelling operator in geophysical parlance. Since, for any linearoperator V, its transpose satisfies〈VxP y〉 = 〈xPVi y〉for any appropriately sized vectors xP y, in order to determine the transpose ofYr(m)[m], we merely take the inner product with an arbitrary vector y, “isolate”the vector m on one side of the inner product. The other side is the expression forthe adjoint operator. In this case, we have that〈YF (m)[m]P y〉 = 〈erH(m)−1(−YH(m)[m]u(m))P y〉= 〈H(m)−1(−YH(m)[m]u(m))P e ir y〉= 〈YH(m)[m]u(m)P−H(m)−He ir y〉= 〈imP−H(m)−He ir y〉= 〈mP i ∗(−H(m)−He ir y)〉In order to completely specify the adjoint of YF (m)[m], we need to specify theadjoint of im = YH(m)[m]u(m) acting on a vector. This expression is particularto the form of the PDE with which we’re working. For instance, discretizing theconstant-density acoustic Helmholtz equation with finite differences results in∇2u(x) + !2m(x)u(x) = q(x)153Appendix A. A ‘User Friendly’ Guide to Basic Inverse Problemswith particular matrices aPV discretizing the Laplacian and identity operators, re-spectively, yields the linear systemau+ !2Vdiag(m)u = qOTherefore, we have the expressionim := YH(m)[m]u(m)= !2Vdiag(m)u(m)= !2Vdiag(u(m))mP(A.2)whose adjoint is clearlyi ∗ = !2diag(u(m))VH (A.3)with directional derivativeYi ∗[mP u] = !2diag(u)VH O (A.4)Our final expression for YF (m)[·]∗y is thereforeYF (m)[·]∗y = !2diag(u(m))VH(−H(m)−He ir y)Setting y = ∇ϕ(eru(m)), v(m) = −H(m)−He ir y yields the familiar expression forthe gradient of f(m)∇f(m) = !2diag(u(m))VHH(m)−He ir (−∇ϕ(u(m)))This sort of methodology can be used to symbolically differentiate more complicatedexpressions as well as compute higher order derivatives of f(m). Let us write ∇f(m)more abstractly as∇f(m) = i (mPu)∗v(m)Pwhich will allow us to compute the Hessian as∇2f(m)m = Yi (mPu)∗[mPYu[m]]v(m) + i (mPu)∗Yv(m)[m]OHere, Yu[m] is given in (A.1) and Yi (mPu)∗[mP u] is given in (A.4), for theHelmholtz equation. We compute Yv(m)[m] by differentiating v(m) asYv(m)[m] = H(m)−HYH(m)[m]Hv(m)−H(m)−He ir (∇2ϕ[erYu(m)[m]])= H(m)−H(YH(m)[m]Hv(m)− e ir (∇2ϕ[erYu(m)[m]]))which completes our derivation for the Hessian-vector product.154Appendix A. A ‘User Friendly’ Guide to Basic Inverse ProblemsA.1 Derivative expressions for WaveformReconstruction InversionFrom (5.5), we have that the augmented wavefield u(m) solves the least-squaressystemminu[ erH(m)]u−[yq]22Pi.e., u(m) solves the normal equations(e ir er + 2H(m)HH(m))u(m) = e ir y+ 2H(m)HqOFor the objective ϕ(mPu) = 12‖eru−y‖22+22‖H(m)u−q‖22, the corresponding WRIobjective is f(m) = ϕ(mPu(m)). Owing to the variable projection structure of thisobjective, the expression for ∇mf(m), by [14], is∇mf(m) = ∇mϕ(mPu(m))= 2i (mPu(m))∗(H(m)u(m)− q)Pwhich is identical to the original adjoint-state formulation, except evaluated at thewavefield u(m).The Hessian-vector product is therefore∇2mf(m) =Yi (mPu(m))[mPYu(m)[m]]∗(H(m)u(m)− q)+i (mPu(m))∗(YH(m)[m]u(m) +H(m)Yu(m)[m])OAs previously, the expressions for YH(m)[m] and Yi (mPu)[mP u]∗ are implemen-tation specific. It remains to derive an explicit expression for Yu(m)[m] below.Let G(m) = (e ir er + 2H(m)HH(m)), r(m) = e ir y+ 2H(m)Hq, so the aboveequation reads as G(m)u(m) = r(m).We differentiate this equation in the direction m to obtainYG(m)[m]u(m) +G(m)Yu(m)[m] = Yr(m)[m]→ Yu(m)[m] = G(m)−1(Yr(m)[m]−YG(m)[m]u(m))OSince YG(m)[m] = 2(YH(m)[m]HH(m) + H(m)HYH(m)[m]) andYr(m)[m] = 2YH(m)[m]Hq, we have thatYu(m)[m] = 2G(m)−1(−H(m)HYH(m)[m]u(m)+YH(m)[m]H(H(m)u(m)−q))O155

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0355402/manifest

Comment

Related Items