Dimensionality Reduction for Solving Large Scale Inverse Problems withPDE Constraints through Simultaneous Source MethodbyMichelle LiuB. Sc Geophysics, The University of British Columbia, 2015A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Mathematics)The University of British Columbia(Vancouver)January 2018cMichelle Liu, 2018AbstractOptimization problems with PDE constraints arise in many applications of science and engineering. ThePDEs often describe physical systems and are characterized by a set of parameters, such as the materialproperties of the system, or velocity of the wave traveling through the earth’s subsurface. Usually theparameters cannot be directly measured and can only be inferred by solving inverse problems. Solvinglarge scale inverse problems usually require prohibitively large number of PDE solves. As a result, variousdimensionality reduction methods have been proposed to reduce the number of PDE solves. This workbuilds on a type of dimensionality method called the Simultaneous Source (SS) method. This methodrelies on random sampling and the stochastic trace estimator to reduce the number of PDE solves to amore manageable size. One of the limitations of the SS method is it can only be applied to data sets with nomissing entries. Unfortunately, data sets with missing entries are common in practice, such as in geophysicalapplications.In this thesis, we propose a new coupled optimization problem that extend the SS method to data withmissing entries. The block coordinate descent method (BCDM) is used to break the coupled problem intotwo subproblems. The first subproblem optimizes over the data and fills the missing entries by the rankminimization technique. The second subproblem minimizes over the model parameter and uses SS methodto reduce the number of PDE solves into a more manageable size. We test our method in the context of afull-waveform inversion (FWI) problem and present our numerical results in which we are able to reducethe number of PDE solves by 98% with less than 2% difference between the model using 1 PDE solve andthe model using full PDE solves.iiLay SummaryThe main computational bottleneck for solving large scale optimization problems with PDE constraints issolving for the PDEs. The Simultaneous Source (SS) method has been used to reduce the number of PDEsolves in order to reduce the computational cost. However, one of the limitation of this method is it canonly be applied to data sets with no missing entries. In this thesis, we propose a new coupled optimizationproblem to overcome this limitation.iiiPrefaceThis thesis contains my original research conducted under the supervision of Dr Eldad Haber during myM.Sc studies at the University of British Columbia. The idea of coupled inverse problem presented inchapter 4 came from conversations with Dr Haber. The derivations, code implementation, and numericalexperiments were carried out on my own with some support and suggestions from Dr Haber.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Inverse Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1 Inverse Problem Definitions and Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Approach for Solving Large Scale Inverse Problems with PDE Constraints . . . . . . . . . . 42.2.1 Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Dimensionality Reduction through Simultaneous Source Method . . . . . . . . . . . . . . . . 73.1 Simultaneous Source with Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 73.1.1 Some Important Properties of Monte Carlo Approximation and SAA Method . . . . 93.1.2 Choosing an Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Simultaneous Source Method with Deterministic Sampling . . . . . . . . . . . . . . . . . . 104 The SS Method for Data with Missing Entries . . . . . . . . . . . . . . . . . . . . . . . . . . 124.1 Data Completion through Rank Minimization Method . . . . . . . . . . . . . . . . . . . . . 134.1.1 Low-Rank Matrix Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.1.2 Low Rank Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.1.3 Low rank minimization and convex relaxation . . . . . . . . . . . . . . . . . . . . . 144.1.4 Incoherent matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15v4.2 Coupled Inverse Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2.1 Importance of the Coupled term in (4.7) . . . . . . . . . . . . . . . . . . . . . . . . 174.2.2 Block Coordinate Descent Method (BCDM) . . . . . . . . . . . . . . . . . . . . . 194.3 Error Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3.1 Confidence Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3.2 Computing the Gap in (4.17) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Full Wave-Form Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.1 Experimental Setting and Numerical Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.1.1 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.2 Tuning the parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.3 Low Rank Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.4.1 Boxed Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.4.2 Marmousi Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48A Stochastic Trace Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50viList of TablesTable 5.1 Summary of results for 25% entries missing data . . . . . . . . . . . . . . . . . . . . . . 34Table 5.2 Summary of results for 50% entries missing data . . . . . . . . . . . . . . . . . . . . . . 36Table 5.3 Summary of results for 75% entries missing data . . . . . . . . . . . . . . . . . . . . . . 38Table 5.4 Summary of results for 85% entries missing data . . . . . . . . . . . . . . . . . . . . . . 40Table 5.5 Summary of results for 90% entries missing data . . . . . . . . . . . . . . . . . . . . . . 42Table 5.6 Marmusi Model: 30% missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44viiList of FiguresFigure 3.1 Fast decaying singular values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Figure 4.1 The plane cuts at DFull = D⇤Full along the m axis. Image generated from [1].We willemphasize that purpose of this figure is to only provide us with some intuition of whatmight be wrong with the First Approach. The graph is not the the actual graph ofy(m,DFull). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Figure 4.2 Recovered model from data with 50% missing entries. 1PDE solve were used per misfitto recover the models. misfit = 12a2 || MaskDFull Dobs ||F +12 || P0H1(m)QDFull ||F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Figure 4.3 D⇤Full represents the recovered data for 50%missing entries, andm⇤ was computed using1PDE solved at every iteration (see section 5.4.1 for complete results). The confidenceintervals have some associated probability of capturing the true mean y(m⇤,D⇤Full) (rep-resented in blue star in the graph). The intervals in the plot all successfully captured thetrue mean, except for the 68% confidence interval at L= 5e5. This is no surprise if wethat a 68% confidence interval means that there is a 68% chance the interval containsthe true mean, and 32% chance that it will not. Notice that as the sample size increases,the confidence interval decreases as expected. . . . . . . . . . . . . . . . . . . . . . . . 25Figure 4.4 The histogram on the right hand side shows a sampling distribution for yˆL(m⇤,D⇤Full).The histogram consists of 5000 sample points. Sample size L=200. The QQ plot isapproximately linear which indicates the sampling distribution is approximately normal. 26Figure 4.5 The histogram on the right hand side shows a sampling distribution for yˆ(m⇤,D⇤Full).The distribution consists of 5000 sample points. The sample size is L=5. The QQplot is not linear. This suggests that the sampling distribution is not normal. This isre-confirmed by the histogram, which is slightly skewed. Thus, the chosen sample sizeis too small . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Figure 5.1 Marine acquisition [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Figure 5.2 figure 1a: The air gun (represented by the yellow star) is fired at every 20 km as itis moving towards the left of the page. Reflected wave gets recorded by a stream ofreceivers (red dots) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Figure 5.3 The singular values for Dtrue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31viiiFigure 5.4 True box model and data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 5.5 Recovered model from data with 25% missing entries. . . . . . . . . . . . . . . . . . . 33Figure 5.6 DDeterministic in the above title represents the recovered data by solving the full 60 PDEsfor one misfit. Similarly, D1PDE represents 1 PDE solved for one misfit. . . . . . . . . . 33Figure 5.7 The misfit in the plot was computed in the main (i.e. outer) iteration. We defined themisfit as: misfit = 12a2 ||MaskDFullDobs ||F + 12 || P0H1(m)QDFull ||F . Pleasebe aware that the graphs are converging to zero as the number of iterations increase.This is not the same as saying the misfit has reached zero. . . . . . . . . . . . . . . . . 34Figure 5.8 Recovered model from data with 50% missing entries. . . . . . . . . . . . . . . . . . . 35Figure 5.9 D60PDE in the above title represents the recovered data by solving the full 60 PDEs forone misfit. Similarly, D1PDE represents the recovered data by solving 1 PDE solved forone misfit. The same definition applies to D10PDE. . . . . . . . . . . . . . . . . . . . . . 35Figure 5.10 The misfit in the plot was computed in the main (outer) iteration. We defined the misfitas: misfit = 12a2 || MaskDFull Dobs ||F + 12 || P0H1(m)QDFull ||F . Please beaware that the graphs are converging to some constant value as the number of iterationsincrease. This is not the same as saying the graphs has reached a constant value. . . . . 36Figure 5.11 Recovered model from 75% missing entries data. . . . . . . . . . . . . . . . . . . . . . 37Figure 5.12 Recovered Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Figure 5.13 The misfit in the plot was computed in the main (i.e. outer) iteration. We defined themisfit as: misfit = 12a2 ||MaskDFullDobs ||F + 12 || P0H1(m)QDFull ||F . . . . . 38Figure 5.14 Recovered model from 85% missing entries data. . . . . . . . . . . . . . . . . . . . . . 39Figure 5.15 Recovered Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Figure 5.16 The misfit in the plot was computed in the main (i.e. outer) iteration. We defined themisfit as: misfit = 12a2 ||MaskDFullDobs ||F + 12 || P0H1(m)QDFull ||F . . . . . 40Figure 5.17 Recovered model from 90% missing entries data. . . . . . . . . . . . . . . . . . . . . . 41Figure 5.18 Recovered Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Figure 5.19 The misfit in the plot was computed in the main (i.e. outer) iteration. We defined themisfit as: misfit = 12a2 ||MaskDFullDobs ||F + 12 || P0H1(m)QDFull ||F . . . . . 42Figure 5.20 We define%Data difference= DtrueDDetermisiticDtrue ⇤100% and%Model Difference= mtruemDeterministicmDeterministic ⇤100%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Figure 5.21 Marmusi Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44ixAcknowledgmentsFirst and foremost, I would like to thank my advisor Dr Eldad Haber for providing me the opportunity tojoin his research group, and for giving me such a fun problem to work on. Given how much math terrifiedand hurt my brain in elementary school, I don’t think anyone including myself would have believed thatI would go onto university to pursue a higher degree in mathematics. This degree would not have beenpossible without the support of many great great people throughout my academic career. Special thanks tomy high school physics teacher Gordon Wiegele and math teacher Michael Partington.Thanks to my officemates Luz, Patrick, Daniel, Sanna, and Wenke for their daily warm morning wel-come and for creating such a pleasant work environment. Many thanks to Rajiv, Eran, Gary, and Uri forhelpful discussions on my work.Last but not least, I would like to thank my wonderful family and friends outside of UBC for theirendless encouragement and support.xChapter 1IntroductionPhysical systems are often represented by mathematical models. The mathematical models contains a setof model parameters (or model) that characterize the systems and their behavior. The model parametersusually cannot be directly measure but they can be inferred by solving inverse problems. Large scale inverseproblems involving PDEs arise in many applications of science and engineering and require a large numberof measurements in order to obtain a reasonable reconstructions of the model. For example, seismic data areoften acquired by oil and gas industries in attempt to get subsurface information of the Earth by inferring thevelocity model. In marine acquisition, seismic data are obtained by a ship towing a compressed air gun (i.e.a seismic source) and a stream of receivers. The air gun is usually fired at a regular interval, and producesacoustic wave that propagate deep into the ocean floor. Seismic data is collected when part of the wave getsreflected from beneath the sea floor and back to the surface where it gets recorded by the receivers. Fullwave form inversion is performed on the data to obtain the velocity model (i.e. the model). The modelcan then be used to provide subsurface image of the Earth. Seismic acquisitions usually involve manyshots (or experiments) in order to get a credible reconstruction image of the Earth’s subsurface. Each shotcorresponds to one PDE solve. Solving PDEs can be computationally intensive. The computational cost andmemory increase linearly with the number of experiments. Thus, the computational cost for solving largescale inverse problems is high.Indeed, an efficient algorithm would be one that can reduce the number of PDE solved and can stillgive a credible reconstructions of the model at a lower cost [19]. The use of the Simultaneous Source (SS)method have demonstrated some success in reducing the number of PDE solve (see [19][14] [13]). Themethod is a type of dimensionality reduction method that relies on random sampling and the stochastic traceestimator to reduce the number of PDE solves to a more manageable size.One of the limitations of this method is that it can only be applied to data set with no missing entries.When a data set has missing entries, the simultaneous source term is lost in the mathematical formulationof the problem. Unfortunately, data with missing measurements are common in practice. In the exampleprovided above, seismic data with missing entries can happen when experiments are not allowed to beperformed at certain regions during the acquisition due to environmental reasons. As a result, the data willhave some missing measurements which corresponds to regions where data could not be collected.One way to overcome this limitation is to fill in those missing entries. Various approaches have been1proposed (see [13], [21], [20], [16]). Nuclear norm minimization technique is one of the most cited and wellstudied data completion technique to date. One of the reason is because it has demonstrated great success inrecovering full data set from a small subset of the data. In this thesis,, we extend the SS method to data withmissing entries by filling those missing entries using nuclear norm minimization technique. We propose anew coupled optimization problem that allows us to apply the SS method to data with missing entries. Thecoupled problem can be broken down into two optimization subproblems. The first subproblem fills the databy minimizing over the data set. The second subproblem uses the simultaneous source method to reducethe PDE solves to a more manageable size. We tests our method in the context of full wave form inversionproblem in which we are able to reduce 60 PDEs solves to 1 PDE for data up to 75% missing entries.The outline of this thesis is as follow. In chapter 2, we review the mathematical framework for inverseproblems with PDE constraints. In chapter 3, we introduce the SS method. In chapter 4, we propose a newalgorithm that extends the SS method to data with missing entries. In chapter 5, we test our method in thecontext of full-wave form inversion and show our results.2Chapter 2Inverse ProblemsInverse problems arise in many applications of science and engineering. People in this area of disciplineare likely to have encountered and solved inverse problems at least once during their practice. For example,curve fitting is a popular inverse problem. Given a set of data point and a curve, a set of coefficients (i.e.the model) that gives the best fit is computed by solving an inverse problem. This chapter is organized asfollow. In section 2.1, we review some important terminology in inverse theory. In section 2.2, we introduceinverse problems with PDE constraints.2.1 Inverse Problem Definitions and TerminologyConsider the following mathematical model:F(m) = d (2.1)where F represent the forward operator, which often takes the form of one or more equations that relate themodel to the data. For example, a forward operator can be a wave equation that relates the measured traveltime (i.e. data) to the wave velocity (i.e. the model). The problem in (2.1) is a forward problem if we aregiven a model m and are asked to predict the data d. On the other hand, (2.1) is an inverse problem if weare given d and are asked to compute the model m. Unlike in a forward problem, the solution of an inverseproblem is not unique (i.e. there are more than onem that gives the same d).It might be worth pointing out at this point that there are some inconsistency in terminology betweenmathematicians and scientists [3]. Mathematicians usually refer to F(m)= d as the mathematical model (orjust the model) and to m as the parameter. On the other hand, scientists often refer to F as the forwardoperator and to m as the model. We will be adopting the scientist terminology from this point forward.A model m2 Rlm consists of model parameters. These model parameters can be parameters (such asseismic velocity, density, resistivity) that characterize the system, or they can be coefficients such as incurve fitting. Since measured data d always comes in discrete form in practice, the numerical recoveredmodel will also be discrete.32.2 Approach for Solving Large Scale Inverse Problems with PDEConstraintsBy large scale or large dimensional problem, we mean a large data set or large number of experiments thatresulted in a large data set. The goal of solving an inverse problem is to recover the model of the data. This isoften accomplished by solving an optimization problem where we find a model parameterm that minimizesthe least square misfit between the observed and predicted data within some tolerance. i.e.argminmNÂi=1|| Fi(m)di ||22 (2.2)where Fi represents a forward operator. The chosen norm in (2.2) is problem dependent. For most inverseproblems, l2-norm is a popular choice. The reason is due to maximum likelihood estimation (MLE), whichsays the model m that maximizes the likelihood of getting the data set {di}Ni=1 as an outcome is the onethat solves the least square problem. Throughout this entire thesis, we will be restricting ourselves to thefollowing PDE-constrained inverse problem:argminu1,...,uN,m12NNÂi=1|| Piuidi ||22subject to ci(ui,m) :=H(m)uiqi = 0, i= 1, . . . ,N.(2.3)where m 2 Rlm represents the discretized model, qi 2 Rlq represents the ith source that emitted the fieldui 2 Rlu , Pi is the matrix that maps the discretized field ui to the location of where the data di 2 Rl wascollected, H is a squared matrix discretizing PDE plus boundary conditions. We assume it is possible tocompute the field uj given m. With this assumption, (2.3) can be written in the form of (2.2) as follows:argminmNÂi=1|| PiH1(m)qidi ||2 (2.4)where Fi(m) = PiH1(m)qi is called the forward problem. Since inverse problem is often ill-posed, we adda regularizational function R(m) to (2.4):argminmNÂi=1|| diPiH1(m)qi ||2 +aR(m), (2.5)If P= Pi 8 i, then (2.5) can be expressed in Frobenius norm:argminm|| PH1(m)QD ||2F +aR(m), (2.6)where D2 Rl⇥N , Q2 Rl⇥N . a is a regularization parameter and it can be determined using either crossvalidation method, or by trial and error. Throughout this entire thesis, we will use the word source andexperiment interchangeably.42.2.1 Iterative MethodsSuppose we want to find an m that minimizes the misfit f(m) =|| DF(m) ||2F in (2.6), where F(m) =PH1(m)Q. Given the sensitivity matrix,J(m) = dF(m)dmwe compute the gradientg= —f(m) = JT(Fi(m)D),and approximate the (Gauss Newton) Hessian —2f(m):—2f(m) = JTJ+(F(m)D)dJT(m)dm⇡ S= JTJ.ConsiderSk dmk =gk (2.7)andmk+1 =mk+akdmk, (2.8)where dmk is a search direction, and gk = —f(mk). ak in (2.8) represents a step size and can be computedusing Armijo backtracking line search. Since we want to minimize the misfit, we require dmk to be a descentdirection (i.e. dmkgk < 0). This happens when S1k is symmetric positive definite (SPD). When S1k is SPD,gTkdmk = gTkS1k gk < 0. Different S1k corresponds to different iterative solver. Gauss Newton iterativemethod is the method of our choice. This is because Sk = JTJ is semi-definite and so the search direction isguaranteed to be a descent direction at every iteration .Computing the search direction dk requires solving the gauss newton linear system in (2.7). Performingexplicit matrix-vector product (i.e. (JTJ)gk) at every iteration is numerically expensive and not practical forlarge scale problems. Instead, we want to perform the product implicitly in order to reduce the computationalcost and storage. Since Sk is SPD in (2.7), the method of choice is the preconditioned conjugate gradient(PCG) method. In this method, we do not compute the sensitivity matrix J explicitly. Rather we workwith only matrix-vector products z= Jgk and its adjoint JTz. Detail and theoretical discussion of conjugategradient method can be found in [18][24]. We used the Laplacian as our preconditioner.The Gauss Newton method mentioned above is summarized in the following algorithm [12]:If NCG denotes the number of outer Gauss Newton iterations, NPCG the average number of PCG itera-tions, and there are N sources, then the work estimate for the algorithm above is:work estimate= (2NPCG+2)⇥NCG⇥Nsources5Algorithm 1 Gauss Newton1. Initialize m02. while not converge3. Compute f(m) and —fm(m) (this requires solving 2 forward problems)4. Solve the system in (2.7) using PCG. (This costs 2NPCG solutions of the forward problem becausewe have to compute J and JT)5. Setm m+akdmk6. end whileIndeed, the greater the number of sources, the greater the work estimate. This demonstrates the value ofusing the SS method. If the SS method can be applied to reduce Nsources to K sources (where K <<Nsources),then the work estimate would greatly be reduced.6Chapter 3Dimensionality Reduction throughSimultaneous Source MethodSolving an inverse problem of the form (2.2) for an enormous data set is not practical because it requires alarge number of PDE solved, which is numerically expensive to perform. The demand for a credible recoveryof a model from a large data set has driven the need to develop methods to reduce the computational cost. Ifthe following equivalent assumptions hold:(A1) All experiments are measured at the same locations (i.e. Pi = P 8 i).(A2) The recorded data D has no missing measurements (i.e. the matrix D has no missing entries).then the SS method can be applied to problem (2.6) to reduce the large data set into a more manageablesize. The advantage of this approach is fewer PDE solved and a credible reconstruction of the model at alower cost. The SS method always involve either random sampling or deterministic sampling [19]. If weuse random sampling, then problem (2.2) will turn into a stochastic programming problem. On the otherhand, we can avoid stochastic problem by using deterministic sampling. Both of these sampling methodsare presented in this chapter.3.1 Simultaneous Source with Random SamplingIn this section, we demonstrate how the SS method with random sampling can be used to reduce the numberof PDE solved and approximate the misfit in (2.2). To begin, consider the following deterministic misfit:NÂi=112|| PiH(m)1qidi ||2 (3.1)If Pi = P 8 i, then (3.1) can be expressed in Frobenius norm as follow:f(m) := 12|| PH1(m)QD ||2F (3.2)whereQ= [q1, ...qi, ...,qN], D= [d1, ...,di, ...,dN] and qi,di 2Rl . The key step to stochastic approximationof the misfit is the following stochastic trace estimator:7Tr(BTB) = || B ||2F = E(wTBTBw) = E(|| Bw ||2) (3.3)where tr(BTB) denotes the trace of matrixBTB, E denotes the expectation, and I2RN⇥N denotes the identitymatrix, w 2 RN . The components w1, ...,wN of the random vector w has to be identically and independentlydrawn from any distribution with zero mean and unit covariance so that:E(wwT) = I. (3.4)Please refer to Appendix A for justification of why these conditions are necessary for (3.3) to be true. LetB= PH1(m)QD; then solving the deterministic misfit in (3.1) is equivalent to solving the followingexpectation:E(|| Bw ||2) = E(|| (PH(m)1QD)w ||2) (3.5)Observe in (3.5) that:PH(m)1(Qw) = PH(m)1(NÂi=1wiqi) (3.6)ÂNi=1wiqi is called the simultaneous source term. Because of this term, instead of having to compute NPDEs like we had to in (3.1), we now only have to perform 1 PDE solved. Moreover, by the virtual ofthe identity in (3.3), solving the deterministic optimization problem in (2.5) is equivalent to solving thefollowing stochastic programming problem:argminmE(|| (PH(m)1QD)w ||2)+aR(m). (3.7)Since it is computationally intractable to precisely compute the expectation in (3.7), we approximate it usingMonte Carlo sampling method:f(m) = E(wTBTBw)⇡ fˆK(m) := 1KKÂj=1|| Bwj ||22=1KKÂj=1|| PH(m)1(Qwj)Dwj ||2 (3.8)where wj 2 RN is any random vector that satisfies (3.4). We emphasize that fˆK(m) is a sample average thatapproximates the expectation in (3.5). So the solution to (3.7) can be approximated by the solution to thefollowing problem:argminm12KKÂj=1(PH(m)1(Qwj)Dwj)2+aR(m) (3.9)Notice instead of having to solve N PDEs like we had to in (2.5), we now only have to compute K PDEs.Indeed, unless K << N we are no better off than solving the original problem. The problem in (4.9) isa stochastic average approximation (SAA) problem. The key advantage of solving SAA problem is thatit can be solved using any descent method. We could have solved the stochastic programming problemusing stochastic approximation (SA) technique instead of SAA. In SA method, a single realization wj isused to compute the gradient —mJ(m,wj). The negative of the gradient is taken to be the descent direction.8A step is taken in that direction and the mk at every iteration gets averaged with the previous. A moredetail of this algorithm can be found in [14]. The attractive feature of using this method is it only uses onesingle realization at each iteration. This implies that it requires only a single PDE solved at every iteration.Personally, I prefer SAA over SA because it gives more credible reconstruction of the model. This could bedue to the fact that SA method requires more careful parameter tuning than SAA [12].3.1.1 Some Important Properties of Monte Carlo Approximation and SAA MethodMonte Carlo and SAA method has the following properties that will be useful to us later on:(A1) fˆK(m) is unbiased estimator of f(m). i.e. E(fˆK(m)) = f(m). This can easily be seen as follow.Since each random vector wj has the same probability distribution as w, we have [23]f(m) = E(|| Bw ||22) = E(|| Bwj ||22) (3.10)Thus,E(fˆK(m)) = E(1KKÂj=1|| Bwj ||22)=1KKÂj=1E(|| Bwj ||22) used linear and scaler properties of expectation=1KKÂj=1f(m) used (3.10)= f(m)(A2) If w1, ...,wK are some i.i.d random sample drawn from some distribution, then by the Law of LargeNumber (LLN) [22], fˆK(m) converges w.p 1 to f(m) as K ! •. We say fˆK(m) is a consistentestimator of f(m).(A3) For sufficiently large K, fˆK(m) has approximately a normal distribution with mean f(m) and variances =var(f(m,w)). This follows from the Central Limit theorem (CLT).(A4) var(fˆK) = 1Kvar(f(m,w))). In other words, the variance of fˆK(m) decreases asvar(f(m,w))K ; Thus ifvar(f(m,w)) is small, then we may only require a few independent i.d.d samples of {w1, ...,wK} toestimate fˆK(m) satisfactorily [17].3.1.2 Choosing an EstimatorRecall wj in (3.6) are i.i.d random vectors with distribution chosen to satisfy E(wTw) = I. The two mostcommonly used distributions are the Gaussian and the Rademacher distribution. Different distributions willgive rise to different estimators. For example, if the distribution is chosen from a Rademacher distribution,9then the estimator is called the Hutchinson trace estimator. On the other hand, if the distribution is chosenfrom the normal distribution, then the estimator is called the Gaussian trace estimator. It was proven in[15] that the Hutchinson estimator has the least variance. For comparison, the variance for the Hutchinsonestimator [4][15] isvar(wTAz) = 2(|| A ||2F KÂi=1A2ii),and the variance for the Gaussian trace estimator isvar(wTAz) = 2 || A ||2F ,For this reason, we used Rademacher estimator in our numerical experiments. There are other estimatorsthat we could have chosen from; however, they are suboptimal in the sense that their variance is higher [14].3.2 Simultaneous Source Method with Deterministic SamplingThe deterministic approach abandons stochastic optimization and selectswj deterministically [21]. This isaccomplished by following a simple methodology discussed in [21]. LetD= USVT, where S is the diagonalmatrix of singular values in decreasing order, U and V are orthogonal matrices. Since V is orthogonal, wehave VT = V1. So DV= US. For this type of sampling method to work, we need fast decaying singularvalues like the one shown in figure 3.1 below. This will allow us to truncate the first K singular values andstill get a good approximation of the entire data set. Let VK represent the first K columns of V so that weuse only the first K columns of US and QV. The problem in (2.6) can be approximated by:minimizationm12K|| PTH1(m)(QVK)DVK) ||2F +aR(m) (3.11)Notice the condition E(VKVTK) = I does not apply here because there is nothing random about VK.Figure 3.1: Fast decaying singular values10Disadvantage of this method1. This method is restricted to fast decaying singular values.2. Having to compute the SVD of a large matrix at every iteration can be very costly.Advantage of this method: No sampling error because we are not drawing any random samples.11Chapter 4The SS Method for Data with MissingEntriesIt is common to have observed data with some missing entries in practice. To name a few, consider thefollowing applications:• Seismic acquisition: Seismic acquisition is expensive. As a result, experiment is performed and datais collected only around small region. Extrapolation is usually performed on these small data set toinfer the larger region. The region that we are extrapolating can be seen in the data matrix as missingentries that we are trying to recover.• Marine acquisition with broken receivers: In marine acquisition, a ship fires an air gun at a regularinterval. The shot generates acoustic waves that propagate into the ocean floor and some waves getreflected back to the water surface where they are recorded by a stream of receivers. If some of thereceivers die during the acquisition, then some measurements will not get recorded. This missingmeasurements can be viewed as data with missing entries. Interpolation is usually performed torecover these missing entries.• Netflix problem [10]: The nextflix problem aims to predict a very large ratings matrix X, where theentries Xi, j represents the rating that the ith user gave the jth movie. Since not every customer haswatched all movies from Netflix, only a small fraction of this matrix is known. However, if one makesthe assumption that most customers rate the movies based on a few parameters (i.e. actor/actress, year,genre) of the movie, then the matrix should be approximately low rank. Thus, the missing entries inthe matrix can be recovered by using matrix completion method.The SS method discussed in the previous chapters rely on the assumptions that the observed data hasno missing entries. When the data has missing entries, we lose the simultaneous source term. To illustrate,suppose matrix D have missing entries. This can be mathematically expressed by multiplying the termPTH1(m)Q in (2.7) by a mask matrixM, where12Mi j =(1, if (i, j) 2W0, otherwise(4.1)and W is a sampled subset of entries Di j from the complete set of entries of D. Our optimization problemtakes the form:argminm12|| M (PTH1(m)Q)wjDwj ||2F +R(m) (4.2)W is a subset of the complete set of entries in D. Notice we no longer have the simultaneous source termin (4.2) because we can no longer take the matrix vector product Qwj. Instead, we have to computeM (PTH1(m)Q) first before we can multiply it with wj. ComputingM (PTH1(m)Q) will be compu-tationally expensive for large N. Our contribution will be to extend the method of simultaneous sources todata with missing entries. In Section 4.1, we discuss matrix completion method. In Section 4.2, we proposea new coupled optimization problem that allows us to apply the SS method to data with missing entries. InSection 4.2 we perform an error estimation of the approximated misfit by computing the difference betweenthe approximated misfit and the deterministic misfit.4.1 Data Completion through Rank Minimization MethodRecall we could not use the stochastic program in (4.2) because of the mask function, which arises whenwe have measured data with missing entries. There are a few approaches that we can take to fill in thosemissing entries. We could try classic interpolation methods such as polynomial or spline interpolation. Oneof the drawbacks with this approach is the method usually gives a bad recovery of the true data when thedata matrix has over 50% missing entries. Conversely, low rank matrix completion allows us to recover thefull matrix from a small subset of its entries provided the matrix is low rank. Low-rank matrix completionmethod has the following attractive features:• In theory, it allows us to use a small subset of the data to precisely recover all of those missing entriesto a very high accuracy.• It involves minimizing over a convex function, which implies there is one global minimum solution.• It is one of the most highly cited method for data completion methods because it can be applied tomany interdisciplinary field.Given the popularity and attractive feature of this method, we decided to try to fill our missing data entriesusing low-rank matrix completion technique.4.1.1 Low-Rank Matrix CompletionRecovering a data matrix with missing entries to high accuracy is generally impossible; however, if theunknown matrix exhibits low rank structure (i.e. has few non zeros or rapidly decaying singular values) andis incoherent, then an accurate recovery of the data is possible by matrix completion [7]. Recovering low13rank matrices from a subset of its entries is known as a matrix completion problem [6]. The method requiresthat 1) the matrix is low rank, and 2) The matrix is incoherent (i.e. not ”spiky”) [9] [8]. We now discusswhat these two requirements mean and why they are necessary assumptions in the method.4.1.2 Low Rank MatricesLet A be an N1⇥N2 matrix with rank r. Matrix A is low rank if r << min(N1,N2). When this is the case,then A can be approximated by a small number of r singular vectors:A⇡rÂj=1s jujv0j (4.3)where s1,s2, ...,sr are the singular values, and u1,u2, ...,ur 2 RN1 and v1,v2, ...,vr 2 RN2 are the singularvectors. For example, consider the following 4⇥4 matrix.Matrix completion requires that the matrix is low rank because when the matrix is low rank, one can geta good recovery of the matrix from a small subset of its entries.4.1.3 Low rank minimization and convex relaxationIf D is a low rank incoherent matrix, then it is possible that we can recover D from a small subset of itsentries by solving:minimizeXrank(D)subject to Xi j = Di j, (i, j) 2W.(4.4)where the rank of D equals the rank of matrix X and W is the subset of the complete set of entries of D.One of the major drawback in solving (4.4) it is an NP hard problem. Moreover, all existing algorithms thatcan provide exact solutions are time doubly exponential in the dimension of the matrix both theory and inpractice [8][7]. Thus, a popular alternative to (4.4) is the convex relaxation [8][7]:minimizeX|| D ||⇤subject to Xi j = Di j, (i, j) 2W.(4.5)While the rank function counts the number of nonvanishing singular values, the nuclear norm sums theiramplitude (i.e. || D ||⇤= Ânj=1s j(D)) [8]. One interpretation made by Candes and Recht’s of these results isthat the rank minimization problem in (4.4) and the convex problem in (4.5) are equivalent in the sense thatthey have the exact same unique solution [8][6]. In the paper [8], Candes and Rechts proved that most lowrank incoherent matrices can be recovered exactly from most small sample subset of the the entries of thematrix. They proved that this can be done by solving (4.5) provided that the number of samples m satisfiedm Cn6/5r logn, where C represents some numerical constant that needs to be computed, r represents therank of the matrix, n represents the size of an n⇥n matrix.144.1.4 Incoherent matricesThe low rank assumption alone is not a sufficient condition to perform matrix completion. To illustrate,consider the following n⇥n rank 1 matrix [7] :W=266666664w11 w12 ... w1n1 w1n0 0 0 0 00 0 0 0 0...............0 0 0 0 0377777775W is clearly low rank. Indeed, it would be impossible to recoverW without sampling all of its entries in thefirst row. Even if we were to sample 90% of the entries inW at random, there is still a very high probabilitythat we will miss some of the entries in the first row. In order to recover W, we will have to sample everyentry in the first row. In sum, if a matrix is low rank and has a row (or a column) that has no relationship tothe other rows (or columns) in the sense that it is approximately orthogonal, then we would need to sampleall the entries in that row in order to recover the matrix. Similarly, although Z is low rank, the matrix cannotbe recovered unless we sample all of its entries. Thus, matrix completion would fail because the goal ofmatrix completion is to recover all of the missing entries from a small subset of its entries. In sum, we havejust emphasized through examples that low rank assumption is a necessary but not a sufficient condition formatrix completion. This leads us to the next necessary condition for matrix completion, which is the lowrank matrices has to also be incoherent.Z=266666664z11 0 ... 0 00 0 0 0 00 0 0 0 0...............0 0 0 0 0377777775Z and W are both incoherent matrices. Coherent matrices are matrices that have most of their mass con-centrated in a small number of elements [9]. Sparse matrices are an example of coherent matrices. The twomatricesW, and Z defined above are coherence matrices and indeed, they are sparse. We would miss mostof the mass of the coherent matrices if were to sample uniformly and randomly, just like we would with thetwo matrices discussed above. As a result, any recovery method that permits small subsampling would fail.Having said that, an incoherent matrices have their mass spread out, which makes the elements more likelyto get observed or sampled.How to check if a matrix is incoherent? One approach to test whether a matrix is incoherent is to take theSVD of the matrix and look at its singular vectors. If the matrix is incoherent, then the singular vectors of thematrix should be sufficiently spread out. On the other hand, if the matrix is not incoherent, then the singularvectors would be “spiky” [8] [7]. By spiky, we mean most of the mass of the vector are concentrated in a15small number of elements.To illustrate, recall the SVD of a low rank matrix of rank r is given by:A⇡rÂj=1s jujv0j (4.6)where s1,s2, ...,sr are the singular values, and u1,u2, ...,ur 2 RN1 and v1,v2, ...,vr 2 RN2 are the singularvectors. Consider the following low rank, coherent 4⇥4 matrix:Z=2666641 0 0 00 0 0 00 0 0 00 0 0 0377775The singular vectors for Z are:uzj ,vzj 2 {2666641000377775 ,2666640010377775 ,2666640100377775 ,2666640001377775}Notice the mass of the singular vectors are concentrated into one elements (i.e. there are only one non zeroentries in the singular vectors), and this makes the vectors “spiky”. In contrast, consider the following lowrank coherent matrix:C=2666641 1 1 11 1 1 11 1 1 11 1 1 1377775The singular vectors for C are:vcj ,ucj 2 {2666640.50.50.50.5377775 ,2666640.86600.28870.28870.2887377775 ,26666400.81650.40820.4082377775 ,266664000.70710.7071377775}.As expected, the mass in the singular vectors of C are much more spread out than Z.164.2 Coupled Inverse ProblemSuppose the true data Dtrue exhibits low rank structure and is incoherent. Recall we could not use theSS method for data with missing entries. To over come this limitation, our strategy was to recover thosemissing entries by performing matrix completion, which consists of minimizing a nuclear norm. LetDobs=MDtrue, where M is the mask matrix defined above. Let DFull represent the recovered datawith no missing entries. We propose the following coupled optimization problem which allows us to use theSS method on data with missing entries.argminDFull,mG := a1 || DFull ||⇤ +a22 ||MDFullDobs ||2F +12|| PA(m)1(QW)DFullW ||2F +a3R(m),(4.7)where1. a1 is a parameter that controls how much the nuclear norm promotes the low rank structure in DFull.This parameter is necessary because we might have a data matrix that might not be entirely low rank.2. a2 is a parameter that prevents the third term in (4.7) from “over powering” the other terms by at leasta magnitude. We find this happens in our numerical example when the number of columnsW exceedshalf of the number of columns of Q.3. Because the problem is ill-pose, we add a regularizational parameter a3 to the equation.4. The first two term is the matrix completion step. The third term in (4.7) is the coupling term thatcouples m and DFull together.Notice the simultaneous source term QW is introduced back into the equation. To see the beauty of ournew approach, we break the couple problem into two subproblems using Block coordinate descent method(BCDM).4.2.1 Importance of the Coupled term in (4.7)A natural question that one might have at this point is: why is the coupled term ||PA(m)1(QW)DFullW ||2Fin (4.7) necessary? In other words, why not solve the problem by first completing the data set: i.e.D†Full =minimizeDFull|| DFull ||⇤subject to DFull,i j = Dobs,i j, (i, j) 2W.(4.8)and then use D†Full to recover the model by solving (4.9):m† = argminm12KKÂj=1(PH(m)1(Qwj)D†Fullwj)2+aR(m) (4.9)17We will refer to this approach as the First Approach (i.e. Disjoint Problem). A simple answer would bebecause the solutionm⇤ and D⇤Full obtained this way will usually not give us a global minimum for the misfitfunction: y(m,DFull) = a22 ||MDFullDobs ||2F + 12 || PA(m)1QDFull ||2F . To get a rough intuition ofwhat we mean by this, consider the following figure:Figure 4.1: The plane cuts at DFull = D⇤Full along the m axis. Image generated from [1].We will em-phasize that purpose of this figure is to only provide us with some intuition of what might bewrong with the First Approach. The graph is not the the actual graph of y(m,DFull).When we solve (4.8), we get a solution D⇤Full which trace out a plane in the figure. When we sub in this fixedvalue D⇤Full into (4.9), we recover a model m that is not the global minimum of y(m,DFull). To alleviatethis problem, we introduce the couple term in the optimization problem in (4.7).In addition to the above, figure 4.2 below shows the recovered model using Equation 4.7 and the re-covered model using the First Approach. Equation 4.7 gives a better recovery of the model than the firstapproach because the misfit is approaching a global minimum. In contrast, the misfit for the First Approachis converging to a local minimum.18Figure 4.2: Recovered model from data with 50% missing entries. 1PDE solve were used per misfit torecover the models. misfit = 12a2 ||MaskDFullDobs ||F + 12 || P0H1(m)QDFull ||F .4.2.2 Block Coordinate Descent Method (BCDM)We use BCDM to solve (4.7). BCDM splits the coupled minimization problem into two subproblems. Eachsubproblem minimizes w.r.t one of the parameters m or DFull. The goal of subproblem 1 is to recover themissing entries in the data by minimizing over nuclear norm. The approximated DFull has no missing en-tries, and so we can use the SS method discussed in chapter to reconstruct the model of the data using fewernumber of PDE solves in subproblem 2.Subproblem 1 (Block Coordinate update for DFull): m is held fixed and we optimize over DFull.D⇤Full = argminDFullf1 := a1 || DFull ||⇤ +a22 ||MDFullDobs||2F +12|| (PH(m)1QDFull)W ||2F(4.10)Remark: DFull is an an approximation of the the observed data (i.e. Dobs) and it has no missing entries.Subproblem 2 (Block Coordinate update form): DFull is held fixed and we minimize overm.m⇤ = argminmf2 :=12|| (PH(m)1QDFull)W ||2F +a3R(m) (4.11)Since DFull has no missing entries, we can use the SS method in (4.9) to approximate the solution in(4.11).19Algorithm 2 BCUM for coupled optimization problem1. Initialize m and DFull2. while G(m, DFull)< tol3. DefineW 2 Rl⇥K either stochastically or deterministically as discussed in chapter 3.4. Hold DFull fixed and solve m⇤ = argminmf2(m):• while not converge• compute f2(m) and —mf2(m)• solve (JTJ+ gR00(m⇤))dm⇤ =—mf2(m⇤) using PCG method. See algorithm 3.• compute step size s2 using any line search method• m=m+ s2 ⇤dm• test for convergence• end while5. Hold m fixed and solve D⇤Full = argminDFullf1(DFull).• while not converge• compute f1(DFull) and search direction dk =—DFullf1(DFull)• compute step size s1 using any line search method• DFull = DFull+ s1 ⇤dk• test for convergence• end while6. Compute G(m,DFull)7. end whileAlgorithm 3 PCG for Ax = b[12]1. Choose a preconditioning matrixM2. intialize x= 0,r= b,z=M1r,p= z3. while not converge• q= Ap• s1 = rTkzk/pTq• x= x+ s1p• rk+1 = rk s1q• zk+1 =M1rk+1• b = zTk+1(rk+1 rk)/(zTkrk)• zk = zk+1, rk = rk+1• p= zk+bp• test for convergence4. end while20Descent step for m—mf = JT(F(m)D)+ 12—mR(m)Sm = —2mf ⇡ JmTJm(4.12)where f(m) =|| F(m)D ||2F and F(m) = PH1(m)Q.Descent step for DFull—DFullf(DFull) = a1U sign(S)VT+M (MDDobs)+(PH1QW+DW)WT,where [U,S,V] = SVD(DFull)(4.13)Remark Please note that subproblem 2 can be computed using proximal gradient method. Although themethod requires using full PDE solves, we find the method still converges much faster than using gradientdescent method. See [6] for further detail.4.3 Error EstimationIn this section, we address the following question: Without having to compute the full PDEs, how well doesy(m⇤,D⇤Full) approximate y(m†,D†)? Define the following misfit:y(m,DFull) :=a22||MDFullDobs ||2F +12|| PA(m)1QDFull ||2F (4.14)and the approximated misfit:yˆK(m,DFull) :=a22||MDFullDobs ||2F +12|| PA(m)1(QW) (DFullW) ||2F . (4.15)The following methodology was adapted from [23] [14]. For any m and DFull, the following inequalityholdsyˆK(m,DFull) minm,DFullyˆK(m,DFull)E(yˆK(m,DFull)) E( minm,DFullyˆK(m,DFull)) E1minm,DFullE(yˆK(m,DFull)) E( minm,DFullyˆK(m,DFull)) E2minm,DFully(m,DFull) E( minm,DFullyˆK(m,DFull)) E3y(m⇤,D⇤Full) y(m†,DFull†) E4(4.16)In E1, we took the expectation on both sides. In E2, we used the fact that E(yˆK(m,DFull))minE(yˆK(m,DFull).In E3, we made use of yˆK(m,DFull) is an unbiased estimator of y(m,DFull). See 3.1 for further detail. Forfurther detail of how E( minm,DFullyˆK(m,DFull)) = y(m†,DFull†) in E4, refer to [14]. We can quantify how well21y(m⇤,D⇤Full) approximates y(m†,D†true) by computing the following gap:gap= y(m⇤,D⇤Full)y(m†,D†true), (4.17)wherem†,D†Full = argminDFull,~m|| DFull ||⇤ +a22 ||MDFullDobs ||2F +12|| PTA(m)1QDFull ||2F +a3R(m),(4.18)andm⇤,D⇤Full = argminDFull,~m|| DFull ||⇤ +a22 ||MDFullDobs ||2F +12|| PTA(m)1(QW)DFullW ||2F +a3R(m).(4.19)We will emphasize that the solution {m†,D†Full} involves the full PDE solved and {m⇤,D⇤Full} involvesK < N PDE solved. Recall the goal of solving the minimization problem in (4.18) is so that we can obtaina credible model of the observed data. To determine how well this model predicts the observed data, wecompute the misfit in (4.14). A small misfit signifies a good approximation of the predicted data to theobserved data. However, computing the full PDE solved in (4.18) is computationally expensive. Thus, werecover the model by solving (4.19) instead. Now back to the original question: how well does y(m⇤,D⇤Full)approximate y(m†,D†true)? This question can be answered by computing (4.17). However, we want tocompute (4.17) without having to directly compute y . This is because solving y directly requires the fullPDE solved, which we are assuming is too expensive to perform. Instead of computing y directly, we caninstead construct a confidence interval that has some associated probability of capturing y .4.3.1 Confidence IntervalLet w1, ...,wK represent a sample of K vectors and wl[1,K] represent the lth sample of the K vectors. Further-more, let yˆK(m,DFull,wl[1,K]) denote the lth sample of the K vectors used for approximating y(m,DFull).Different sample w[1,K] will give a slightly different solution of m⇤ and DFull⇤ in (4.19). This is called vari-ation due to sampling, or more commonly sampling error. These type of errors always exist when samplingis involved. The sampling error associated with a given sample mean yˆK(m,DFull,wl[1,K]) is given by a con-fidence interval. A confidence interval is an interval that has a pre-specified probability of “capturing” thetrue mean (i.e. E(yˆK(m,DFull))) (stat506). Constructing such an interval requires that we know the sam-pling distribution of the sample mean yˆK(m⇤,D⇤Full,wl[1,K]). Suppose the sampling distribution of the samplemean follows a normal distribution with mean E(yˆK(m⇤,D⇤Full)) and variances2(m⇤,D⇤Full)N . The confidenceinterval for the true mean E(yˆK(m⇤,D⇤Full)) is:yˆK(m⇤,D⇤Full,wl[1,K]) z a2sˆ(m⇤,D⇤Full)pK< E(yˆK(m⇤,D⇤Full))< yˆK(m⇤,D⇤Full)+ z a2sˆ(m⇤,D⇤Full)pK⇤(4.20)22andsˆ2(m⇤,D⇤Full,wl[1,K]) :=1K1KÂj=1(G j(m,DFull;wlj) yˆK(m⇤,D⇤Full;wl[1,K]))2 (4.21)is the sample variance estimate of s2(m⇤,D⇤Full), and G j(m⇤,d⇤j ;wlj) = (PH1(m⇤)(qjwlj)d⇤j wlj). za/2 :=F1(1 a2 ), and F(·) represents the cumulative distributive function (CDF) of the standard normal distri-bution [23]. For example, z0.025 = 1.96 corresponds to a 95% confidence interval. This means that if wewere to construct many intervals corresponding to different yˆK(m⇤,D⇤Full,wl[1,K]), 95% of the intervals thatwe constructed will make an interval that captures the true mean (i.e. E(yˆK(m⇤,D⇤Full))) and 5% of theintervals will not.The assumption of normality in (4.20) is a crucial one because the interval is valid only when the sam-pling distribution of yˆK(m⇤,D⇤Full,wl[1,K]) is normal. Fortunately, the Central Limit theorem (CLT) assertsthat for sufficiently large enough sample size K, the sampling distribution of yˆK(m⇤,D⇤Full,wl[1,K]) will fol-low a normal distribution with mean E(yˆK(m⇤,D⇤Full)) and variance s2(m⇤,D⇤Full,wl[1,K]). Thus, givensufficiently large sample size K, we can always use (4.20) to try to “capture” the true mean with some asso-ciated probability of success.4.3.2 Computing the Gap in (4.17)Solving (4.17) requires computing the determistic misfity(m,DFull). Assume we cannot computey(m,DFull)exactly because it is too expensive to compute all the PDEs. We can compute the ys in (4.17) by construct-ing a confidence interval that captures y(m⇤,D⇤Full) and y(m†,D†Full).Constructing a confidence for y(m⇤,D⇤Full)Recall the confidence interval tries to capture the expected value of the sample mean. Fixm⇤ and D⇤Full. Letyˆ⇤L :=1LLÂl=1yˆK(m⇤,D⇤Full;wl[1,K]). (4.22)We claim that the expected value of the sample mean yˆ⇤L equals y(m⇤,D⇤Full).Proof.E(1LLÂl=1yˆK(m⇤,D⇤Dfull;wl[1,K])) =1LLÂl=1E(yˆK(m⇤,D⇤Dfull;wl[1,K]))=1LLÂl=1y(m⇤,D⇤Full)= y(m⇤,D⇤Full)23Assume the sampling distribution for yˆL follows a normal distribution. Then a 100(1a)% confidenceinterval for y(m⇤,D⇤Full) is given by:yˆ⇤Lza/2sˆ⇤pL y(m⇤,D⇤Full) yˆ⇤L +za/2sˆ⇤pL, (4.23)where(sˆ†)2 = 1L(L1)LÂl=1(yˆK(m⇤,D⇤Full,wl[1,K]) yˆ⇤L)2Constructing a Confidence for y(m†,D†Full)Lety†T =1TTÂj=1yˆK(m⇤j ,D⇤Full,j,wj[1,K]) (4.24)The mean of the sample mean y†T equals y(m†,D†Full). i.e.E( 1TTÂj=1yˆK(m⇤j ,D⇤Dfull,j;wj[1,K])) =1TTÂj=1(E( minm,DFullyˆK(m,DFull,wj[1,K])))=1TTÂl=1y(m†,D†Full) see ref. [14]= y(m†,D†Full)Assume the sampling distribution fory†T is normal. Then a 100(1a)%confidence interval fory(m†,D†Full)is given by:yˆ†T za/2sˆ†pT y(m†,D†Full) yˆ†T +za/2sˆ†pT, (4.25)where(sˆ⇤)2 = 1T (T 1)TÂl=1(yˆK(m⇤,D⇤Full,wl[1,K]) yˆ†T)2Using (4.23) and (4.25), we can compute the gap in (4.17) to obtain an estimate on the quality of ourapproximate misfit compared to the true (i.e. deterministic) misfit.Choosing a Sample SizeRecall that the normal confidence interval discussed above requires that the sampling distribution of thesample mean is normal. The CLT says that there will always exists a sufficiently large sample size that willgive a normal sampling distribution of the sample mean. Thus, we choose a sample size that gives us a24normal sampling distribution. The advantage of a large sample size is it gives a lower variance. As a result,we get a shorter confidence interval which signifies more certainty in our estimate. On the other hand, thedisadvantage of having a large sample size it generally requires more computational time. Figure belowillustrates the relationship between sample size L and the CI for yˆ(m⇤,D⇤Full).Figure 4.3: D⇤Full represents the recovered data for 50% missing entries, and m⇤ was computed using1PDE solved at every iteration (see section 5.4.1 for complete results). The confidence intervalshave some associated probability of capturing the true mean y(m⇤,D⇤Full) (represented in bluestar in the graph). The intervals in the plot all successfully captured the true mean, except forthe 68% confidence interval at L= 5e5. This is no surprise if we that a 68% confidence intervalmeans that there is a 68% chance the interval contains the true mean, and 32% chance that it willnot. Notice that as the sample size increases, the confidence interval decreases as expected.How to tell if sample size chosen is too smallIf the sample size chosen too small, then the sampling distribution of the sample mean will not be normal.To check if the distribution is normal, we can make a histogram or QQ plot. If the QQ plot of a set of samplemean is approximately linear, then the distribution of the sample mean is normally distributed. See figure4.4 and figure 4.5. On the other hand, if L = 200, then we get a normal sampling distribution as shown inthe figure below.25Figure 4.4: The histogram on the right hand side shows a sampling distribution for yˆL(m⇤,D⇤Full). Thehistogram consists of 5000 sample points. Sample size L=200. The QQ plot is approximatelylinear which indicates the sampling distribution is approximately normal.Figure 4.5: The histogram on the right hand side shows a sampling distribution for yˆ(m⇤,D⇤Full). Thedistribution consists of 5000 sample points. The sample size is L=5. The QQ plot is not linear.This suggests that the sampling distribution is not normal. This is re-confirmed by the histogram,which is slightly skewed. Thus, the chosen sample size is too small26Chapter 5Full Wave-Form InversionSeismic waves are waves of energy caused by Earth’s vibration. The vibration is often caused by the suddenrelease of energy from below the ground. They are commonly generated by Earthquakes and subsurfaceexplosions. These waves travel through the Earth and bring to the surface information about the undergroundproperties of the planet that cannot be directly measured. For this reasons, seismic data are collected andused in full waveform inversion (FWI) to predict the velocity model of the data. This velocity model canbe used to get an image of the subsurface of the earth. FWI is essentially a PDE-constrained optimizationproblem where by making an educated guess of the initial subsurface parameters (i.e. the model m), wecan obtain the predicted model of the data m by minimizing the least square misfit between the predicteddata and the observed seismic data [5]. There are two fundamental approaches to FWI, time domain andfrequency domain. We will be working in the frequency domain. The outline of this chapter is as follow. InSection 5.1 we present the experimental setup of our numerical example. In Section 5.3 we justify our lowrank data assumption by illustrating how measurements can result in low rank data in practice. In Section5.4 we present our results and discussion.5.1 Experimental Setting and Numerical SetupConsider an offshore seismic acquisition where a ship tows a compressed air gun (i.e. a seismic source) anda stream of hydrophones (i.e. receivers). The air gun fires at a regular interval, say at every 20 km. The airgun uses compressed air to produce sound wave that propagates deep into the ocean floor. When the wavehits the rock layers beneath the sea floor, some gets reflected back to the water surface where it gets recordedby the hydrophones (see fig 5.1).Without loss of generality, assume the wave is time invariant and work with the following Helmholtzequation:Du+w2(m (1+ ig)u= q u 2W+Neumann boundary conditions:—u= 0(5.1)where u represents the wave field emitted by source q and g 0 is an attenuation parameter that preventsreflections from the boundary through artificial absorbing layer (called perfectly matched layer). The ab-27Figure 5.1: Marine acquisition [2]sorbing layer dissipates any wave that reflects from it; Thus, the Neumann boundary condition can be set tozero [11].5.1.1 DiscretizationThe discretized form of (5.1) is:Hu= (L +w2diag(m (1+ i~g)))u= q (5.2)whereL = Iy⌦Dx+Dy⌦ Ix is the 2D discretized laplacian using finite difference method.Dx = Dy =1h2266666666641 11 2 11 2 1. . . . . . . . .1 2 11 137777777775,Ix and Iy are identity matrices, and u is the solution to the PDE. The matrix P projects the the wavefield uon to the location of where the data d 2 Rl was collected. Thus, the predicted data is given by:d= PH1(m)qNotice that one source q requires solving one PDE in (5.1). Thus, the more sources there are, the morePDE equations will have to solved. In practice, there are s >> 1 sources. Assume all measurements are28performed at the same location (i.e. Pi = P 8 i), then the forward problem that gives the predicted data for ssources is:D= PH1(m)Q (5.3)where Q= [q1, ...,qs] 2 Rlq⇥s, and D= [d1, ...,ds] 2 Rl⇥s. We will be looking at two models: 1) the boxedmodel and 2) the Marmousi model. To get the data DTrue, we plug the model into (5.3). To generatea synthetic observed data, we compute Dobs =MDTrue, where M is the masking matrix that randomlyremoves the columns of the true data. The missing columns represents broken sources or receivers duringacquisitions.We recover the model from the data with missing columns by solving the problem in (4.7).Results can be found in section 5.4.5.2 Tuning the parametersThe following are the general approach we used to tune our parameters.Choosing grid size hFor our numerical example, we let 1 wavelength span 20 grid points. This corresonds to h⇥h= 0.05km⇥0.05km.Choosing the wavefield velocityFor the boxed model, we chose the initial velocity to be the surface velocity, which is 1e-4 m/s.Setting the angular frequency w in the Helmholtz equationGiven that we know the velocity and wavelength of the wave field, we can compute its frequency:f =vl=v10⇤hand subsequently its angular frequency :w = 2p fAttenuation function g = g(A,q)The attenuation parameter in (5.1) is defined as: g(A,q) = A(eqX1 + eqX2 + eqX3) where X1, X2 and X3represents the left, right, and bottom boundary. The amplitude A represents the intensity of the absorptionat the boundary, and q is inversely proportional to the thickness of the absorbing boundary. In other words,the greater the q , the thinner the boundary. Conversely, the smaller the q , the thinner the boundary. Wewant to choose q s.t. the thickness of the absorbing boundary approximately equals l so that the wave getsabsorbed. If A= 0, then g = 0. In this case, nothing gets absorbed and we get some reflections from theboundary.295.3 Low Rank DataRecall that in order to perform matrix completion on our data, the data has to be either low rank or approxi-mately low rank. A low rank matrix can be loosely defined as when many of its rows are linearly dependentof each other. Let the rows of the observed matrix Dobs represent the sources, and the columns represent thereceivers. The entries of the data di j can then be viewed as the measurement made at the jth receiver dueto the ith source. Two rows would be approximately linearly dependent if the receivers make very similarmeasurements from two different experiments (i.e. shots). To illustrate how this can happen in practice,consider figure 5.2 where we assume all experiments are measured by the same set of receivers:Figure 5.2: figure 1a: The air gun (represented by the yellow star) is fired at every 20 km as it ismoving towards the left of the page. Reflected wave gets recorded by a stream of receivers (reddots)In figure 2, the ship fires the air gun (i.e. the source) at every 20 km. i.e. q1 represents experiment performedat 20 km, q2 represents the second experiment performed at 40 km etc. Observe that at 60 km, 80 km and 100km, the receivers should record very similar wave field because the geological features beneath the sea floorin which the wave is reflected from are very similar. Thus, we would expect the data d5,d4,d3 collected fromexperiment q5,q4 and q3 respectively to be very similar, and data d1,d2 to be different because the geologicalfeatures beneath are different. When we put these data in to a matrix i.e. Dobs = [d1,d2,d3,d4,d5,d6], thematrix should be approximately low rank because most the rows are approximately linearly dependent.To verify whether Dtrue has low rank structure, we plot the singular values of Dtrue. Although thesingular values are decaying, it does not resemble the L shape (see figure ??) that a true low rank matrixwould exhibit. Thus, we say our data is approximately low rank. Because our data is low rank, there isno theorem that would guarantee that we can get an exact recovery of the matrix from a few sample of itsentries. However, we may still get a satisfying recovery. One way to know for sure if matrix completionmethod will give us a reasonable recovery from subset the entries of DTrue is to simply try it out.30Figure 5.3: The singular values for Dtrue.5.4 Numerical ResultsWe now present our numerical results for a) the boxed model, and b) the marmousi model.5.4.1 Boxed Model ResultsThe true boxed model and data that we will recover are the following:31Figure 5.4: True box model and data.In what follows, we will recover the model and data by solving the coupled inverse problem that wepropose in previous chapter. We will examine data with 25%, 50% 75%, 85% and 90% missing data entriesby uniformly and randomly removing the columns of the data (which represents missing sources). Thepurpose of doing so is to: 1) determine at what % missing entries the method break down, and 2) get an ideaof how well the model obtained from SAA method approximates the model from deterministic method.32Recovered Model and Data from 25% missing entries DataFigure 5.5: Recovered model from data with 25% missing entries.Figure 5.6: DDeterministic in the above title represents the recovered data by solving the full 60 PDEsfor one misfit. Similarly, D1PDE represents 1 PDE solved for one misfit.33Figure 5.7: The misfit in the plot was computed in the main (i.e. outer) iteration. We defined themisfit as: misfit = 12a2 ||MaskDFullDobs ||F + 12 || P0H1(m)QDFull ||F . Please be awarethat the graphs are converging to zero as the number of iterations increase. This is not the sameas saying the misfit has reached zero.Table 5.1: Summary of results for 25% entries missing dataNumber of PDE solved per misfit # of iterations % diff =| m60PDEmapproxm60PDE |⇥100% % diff =|m60PDEDapproxD60PDE|⇥100%Deterministic: Full 60 PDE solved 60 —— ——SAA: 1 PDE solved 24 1.11 0.82934Recovered Model and Data from 50% missing entries Data -Still need to upload new datafor 50 percent missing DAta entries. 1PDE solved still running on clusterFigure 5.8: Recovered model from data with 50% missing entries.Figure 5.9: D60PDE in the above title represents the recovered data by solving the full 60 PDEs for onemisfit. Similarly, D1PDE represents the recovered data by solving 1 PDE solved for one misfit.The same definition applies to D10PDE.35Figure 5.10: The misfit in the plot was computed in the main (outer) iteration. We defined the misfitas: misfit = 12a2 || MaskDFullDobs ||F + 12 || P0H1(m)QDFull ||F . Please be aware thatthe graphs are converging to some constant value as the number of iterations increase. This isnot the same as saying the graphs has reached a constant value.Table 5.2: Summary of results for 50% entries missing dataNumber of PDE solved per misfit # of iterations % diff =| m60PDEmapproxm60PDE |⇥100% % diff =|D60PDEDapproxD60PDE|⇥100%Deterministic: Full 60 PDE solved 22 —— ——SAA: 1 PDE solved 38 1.36 1.47SAA: 10 PDE solved 22 1.11 1.1636Recovered Model and Data from 75% missing entries DataFigure 5.11: Recovered model from 75% missing entries data. .Figure 5.12: Recovered Data37Figure 5.13: The misfit in the plot was computed in the main (i.e. outer) iteration. We defined themisfit as: misfit = 12a2 ||MaskDFullDobs ||F + 12 || P0H1(m)QDFull ||F .Table 5.3: Summary of results for 75% entries missing dataNumber of PDE solved per misfit # of iterations % diff =| m60PDEmapproxm60PDE |⇥100% % diff =|D60PDEDapproxD60PDE|⇥100%Deterministic: Full 60 PDE solved 60 —— ——SAA: 20 PDE solved 130 1.6 1.71SAA: 50 PDE solved 84 0.709 0.94538Recovered Model and Data from 85% missing entries DataFigure 5.14: Recovered model from 85% missing entries data. .Figure 5.15: Recovered Data39Figure 5.16: The misfit in the plot was computed in the main (i.e. outer) iteration. We defined themisfit as: misfit = 12a2 ||MaskDFullDobs ||F + 12 || P0H1(m)QDFull ||F .Table 5.4: Summary of results for 85% entries missing dataNumber of PDE solved per misfit # of iterations % diff =| m60PDEmapproxm60PDE |⇥100% % diff =|D60PDEDapproxD60PDE|⇥100%Deterministic: Full 60 PDE solved 35 —— ——SAA: 10 PDE solved 61 6.68 8.1540Recovered Model and Data from 90% missing entries DataFigure 5.17: Recovered model from 90% missing entries data. .Figure 5.18: Recovered Data41Figure 5.19: The misfit in the plot was computed in the main (i.e. outer) iteration. We defined themisfit as: misfit = 12a2 ||MaskDFullDobs ||F + 12 || P0H1(m)QDFull ||F .Table 5.5: Summary of results for 90% entries missing dataNumber of PDE solved per misfit # of iterations % diff =| m60PDEmapproxm60PDE |⇥100% % diff =|D60PDEDapproxD60PDE|⇥100%Deterministic: Full 60 PDE solved 67 —— ——SAA: 60 PDE solved 91 1.1 6.26The two plots below tries to demonstrate how the quality of the recovered data might affect the qualityof the recovered model. The plot shows that when the % data difference is high (i.e. the recovered data ismight not be a very good approximation of the true data), the % model difference is also high. For example,the recovery of the data is poor when we try to recover 90% of the missing data. As a result, we also get apoor recovery of the actual model.42Figure 5.20: We define %Data difference = DtrueDDetermisiticDtrue ⇤ 100% and %Model Difference =mtruemDeterministicmDeterministic ⇤100%.DiscussionRecall that one of the goals of using the SS method is so that we can recover a credible model solving fewerPDEs. Tables above show that the recovered models by performing1 PDE solved per misfit gives a resultsvery similar to performing the full PDE solved. More specifically, the difference between the model for1PDE solve and full PDE solved is less than 2%. A reasonable recovery of the model using deterministicand SAA method good upto 75% missing entries. We begin to get a poor recovery of the data when we tryto recover from 85% of missing data. As a result, we also get a poor recovery of the model and vice versa.435.4.2 Marmousi ModelFigure 5.21: Marmusi ModelTable 5.6: Marmusi Model: 30% missing dataMethod # of iterations % difference =| m60PDEmapproxm60PDE |⇥100%Full PDE solved (60 PDEs) per misfit 176 01 PDE solved per misfit 250 2.1125 PDE solved per misfit 392 2.16TSVD 40 54.6Discussion Table 5.1 shows the percent error using 1 PDE solved is similar to 25 PDE solved. Thisseems to suggest that it is suffice to approximate the true misfit (i.e. the misfit obtained from performing fullPDE solved) using 1 PDE solved. In other words, using more than 1 PDE solved is not going improve the44approximation of deterministic model by much. This observation can also be seen in figure 5.21 where themodels recovered from performing 1 PDE solved and 25 PDE solved look nearly indistinguishable from thefull PDE solved.If we had more time, we would increase our grid size by a factor of 3, and use multiple frequencies inour numerical approximation to get a better approximate of the true model. Using multiple frequencies willallow us to capture some of the geological features in the model that are otherwise missed when using justone frequency. Due to time constrained, this task is left as our future work.TSVD gave the worse approximation. We initially suspect it might have to do with how we chose ourregularization parameters a1, a2, and a3. After much fiddling with these parameters, we believe the TSVDmethod does not work for our particular set of data because the singular values are not decaying fast enough(see section 3.2).5.5 Future Work1. Computing the star norm ||DFull ||⇤ in (4.7) is expensive for large scale data because it requires solvingthe SVD of DFull. The computational costs of computing the SVD of an N⇥N matrix is O(N3). Wecan approximate the nuclear norm and avoid computing the SVDs by instead solving the followingnuclear norm factorization problem [16] [25] :|| DFull ||⇤ =minimizeL,R12(|| L ||2F + || R ||2F)subject to X= LRT(5.4)Computing || L ||2F and || R ||2F should be significantly cheaper than computing the SVD of a largematrix. Thus, we can avoid solving the SVD of the nuclear norm in (4.7) by solving the followingproblem:minimizem,L,R,DFulla12(|| L ||2F + || R ||2F)+a42|| DFullLRT ||2F +a22||MDFullDobs ||2F+12|| (PH(m)1QDFull)W ||2F +a3R(m)Future work will consists of comparing the computational run time and the quality of the recoveredmodel using this approach with the coupled problem in (4.7).2. Recall matrix completion is possible under the condition that the data is low rank (or approximatelylow rank). Having said that, the approach in (4.7) would fail if the data was full rank. One possiblesolution would be to find a transformation that maps the matrix into a different domain that promoteslow rank structure, and recover the missing entries in the new domain. For FWI problem, Kumar etal [? ] used the midpoint-offset transformations to map the data in the source receiver domain toa midpoint-offset domain. This type of transformation is equivalent to a 45deg rotation of the datawhere the diagonal entries in the source-receiver domain are mapped to a single column in the new45domain to promote faster decaying singular values. If our seismic data is high rank, we could trysolving (4.7) in the mid-point offset domain. This may or may not work. We leave this to our futurework.46Chapter 6ConclusionSolving large scale nonlinear inverse problems using equation (5) can be computationally expensive. Bylarge scale, we mean large data set. For many applications that has PDE constraints, such as in seismic andDC resistivity problems, large amounts of data are required to obtain a credible reconstruction of the model.Thus, there has been a driven need to to solve these large scale problems efficiently and at a reduced cost.Since the computational bottleneck for solving PDE constraints inverse problem is computing the PDEs,an efficient algorithm would be one that can reduce the number of PDE solved and still give a crediblereconstruction. In chapter 3, a special type of dimensionality reduction technique called the simultaneoussource (SS) method was introduced. This method consists of sampling large dimensional data to create alower one. The consequence of this is fewer PDE solved. One limitation of the SS method is it is onlyapplicable to data with no missing entries. In chapter 4, we overcome this limitation by introducing a newcoupled optimization that allows us to apply the SS method to data with missing entries. We tested ourmethod in the context of FWI in Chapter 5. The model recovered using full 60 PDE solved differed fromthe model recovered using 1 PDE solve by 2.1%.47Bibliography[1] Geogebra-free onliine geometry tool.https://www.math10.com/en/geometry/geogebra/geogebra.html. Accessed: 2017-08-30. ! pagesviii, 18[2] U.S. Environmental Protection Agency. Marine Seismic Methods, 2016. URL:https://archive.epa.gov/esd/archive-geophysics/web/html/marine seismic methods.html. Last visitedon 2017/08/01. ! pages viii, 28[3] Rick Aster, Brian Borchers, and Cliff Thurber. Parameter estimation and inverse problems (secondedition), 2005. ! pages 3[4] Haim Avron and Sivan Toledo. Randomized algorithms for estimating the trace of an implicitsymmetric positive semi-definite matrix. Journal of the ACM (JACM), 58(2):8, 2011. ! pages 10[5] T. Van. Leeuwen X. Li Z. Fang B. Peters, B. Smithyman. Full-Waveform Inversion, 2015 (accessedJune 2017). ! pages 27[6] Jian-Feng Cai, Emmanuel J Cande`s, and Zuowei Shen. A singular value thresholding algorithm formatrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010. ! pages 14, 21[7] Emmanuel J Candes and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE,98(6):925–936, 2010. ! pages 13, 14, 15[8] Emmanuel J Cande`s and Benjamin Recht. Exact matrix completion via convex optimization.Foundations of Computational mathematics, 9(6):717, 2009. ! pages 14, 15[9] Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, and Rachel Ward. Coherent matrix completion.In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference onMachine Learning, volume 32 of Proceedings of Machine Learning Research, pages 674–682,Bejing, China, 22–24 Jun 2014. PMLR. ! pages 14, 15[10] Curt Da Silva. Large-scale optimization algorithms for missing data completion and inverseproblems. PhD thesis, University of British Columbia, 2017. ! pages 12[11] Eldad Haber. Mini Course on PDE-Constrained Optimization, 2011. URL:http://www.pdeopt2011.unibas.ch. Last visited on 2017/07/01. ! pages 28[12] Eldad Haber. Computational methods in geophysical electromagnetics. SIAM, 2014. ! pages 5, 9,20[13] Eldad Haber and Mathias Chung. Simultaneous source for non-uniform data variance and missingdata. arXiv preprint arXiv:1404.5254, 2014. ! pages 1, 248[14] Eldad Haber, Matthias Chung, and Felix Herrmann. An effective method for parameter estimationwith pde constraints with multiple right-hand sides. SIAM Journal on Optimization, 22(3):739–757,2012. ! pages 1, 9, 10, 21, 24[15] Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplaciansmoothing splines. Communications in Statistics-Simulation and Computation, 19(2):433–450, 1990.! pages 10, 50[16] Rajiv Kumar, Curt Da Silva, Okan Akalin, Aleksandr Y Aravkin, Hassan Mansour, Benjamin Recht,and Felix J Herrmann. Efficient matrix completion for seismic data reconstruction. Geophysics,80(5):V97–V114, 2015. ! pages 2, 45[17] David JC MacKay. Introduction to monte carlo methods. NATO ASI SERIES D BEHAVIOURAL ANDSOCIAL SCIENCES, 89:175–204, 1998. ! pages 9[18] Jorge Nocedal and Stephen J Wright. Conjugate gradient methods. Numerical optimization, pages101–134, 2006. ! pages 5[19] Farbod Roosta-Khorasani. Randomized algorithms for solving large scale nonlinear least squaresproblems. PhD thesis, University of British Columbia, 2015. ! pages 1, 7[20] Farbod Roosta-Khorasani, Kees Van Den Doel, and Uri Ascher. Data completion and stochasticalgorithms for pde inversion problems with many measurements. Electron. Trans. Numer. Anal,42:177–196, 2014. ! pages 2[21] Farbod Roosta-Khorasani, Kees van den Doel, and Uri Ascher. Stochastic algorithms for inverseproblems involving pdes and many measurements. SIAM Journal on Scientific Computing,36(5):S3–S22, 2014. ! pages 2, 10[22] Alexander Shapiro. Monte carlo sampling methods. Handbooks in operations research andmanagement science, 10:353–425, 2003. ! pages 9[23] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyn´ski. Lectures on stochasticprogramming: modeling and theory. SIAM, 2009. ! pages 9, 21, 23[24] Jonathan Richard Shewchuk et al. An introduction to the conjugate gradient method without theagonizing pain, 1994. ! pages 5[25] Nathan Srebro. Learning with matrix factorizations. 2004. ! pages 4549Appendix AStochastic Trace EstimatorResult Let A be any N ⇥N symmetric matrix and let u = [u1, ...,un]T be a vector of n independent andidentically distributed random sample drawn from a distribution with zero mean and unit variance s2 = I.ThenE(uTAu) = TrAi.e. uTAu is an unbiased estimator of TrA [15].ProofAssume the random sample u is drawn from a distribution with zero mean and unit variance. When this isthe cases2(u) = E(uuT) = I (A.1)So:E(uTAu) = E(Tr(uTAu))= E(Tr(AuuT)) trace of a product is invariant under cyclic permutation.= Tr(E(AuuT)) trace and expectation are both linear operators and they commute.= Tr A(E(uuT)) A is not a random variable.= TrA used (A.1)50
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Dimensionality reduction for solving large scale inverse...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Dimensionality reduction for solving large scale inverse problems with PDE constraints through simultaneous… Liu, Michelle 2018
pdf
Page Metadata
Item Metadata
Title | Dimensionality reduction for solving large scale inverse problems with PDE constraints through simultaneous source method |
Creator |
Liu, Michelle |
Publisher | University of British Columbia |
Date Issued | 2018 |
Description | Optimization problems with PDE constraints arise in many applications of science and engineering. The PDEs often describe physical systems and are characterized by a set of parameters, such as the material properties of the system, or velocity of the wave traveling through the earth’s subsurface. Usually the parameters cannot be directly measured and can only be inferred by solving inverse problems. Solving large scale inverse problems usually require prohibitively large number of PDE solves. As a result, various dimensionality reduction methods have been proposed to reduce the number of PDE solves. This work builds on a type of dimensionality method called the Simultaneous Source (SS) method. This method relies on random sampling and the stochastic trace estimator to reduce the number of PDE solves to a more manageable size. One of the limitations of the SS method is it can only be applied to data sets with no missing entries. Unfortunately, data sets with missing entries are common in practice, such as in geophysical applications. In this thesis, we propose a new coupled optimization problem that extend the SS method to data with missing entries. The block coordinate descent method (BCDM) is used to break the coupled problem into two subproblems. The first subproblem optimizes over the data and fills the missing entries by the rank minimization technique. The second subproblem minimizes over the model parameter and uses SS method to reduce the number of PDE solves into a more manageable size. We test our method in the context of a full-waveform inversion (FWI) problem and present our numerical results in which we are able to reduce the number of PDE solves by 98% with less than 2% difference between the model using 1 PDE solve and the model using full PDE solves. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2018-01-19 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0363096 |
URI | http://hdl.handle.net/2429/64406 |
Degree |
Master of Science - MSc |
Program |
Mathematics |
Affiliation |
Science, Faculty of Mathematics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2018-02 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2018_february_liu_michelle.pdf [ 5.39MB ]
- Metadata
- JSON: 24-1.0363096.json
- JSON-LD: 24-1.0363096-ld.json
- RDF/XML (Pretty): 24-1.0363096-rdf.xml
- RDF/JSON: 24-1.0363096-rdf.json
- Turtle: 24-1.0363096-turtle.txt
- N-Triples: 24-1.0363096-rdf-ntriples.txt
- Original Record: 24-1.0363096-source.json
- Full Text
- 24-1.0363096-fulltext.txt
- Citation
- 24-1.0363096.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0363096/manifest