Randomized Algorithms for SolvingLarge Scale Nonlinear Least SquaresProblemsbyFarbod Roosta-KhorasaniB.Sc., The University of British Columbia, 2010A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)April 2015© Farbod Roosta-Khorasani 2015AbstractThis thesis presents key contributions towards devising highly efficient stochastic reconstructionalgorithms for solving large scale inverse problems, where a large data set is available and theunderlying physical systems is complex, e.g., modeled by partial differential equations (PDEs).We begin by developing stochastic and deterministic dimensionality reduction methods totransform the original large dimensional data set into the one with much smaller dimensionsfor which the computations are more manageable. We then incorporate such methods in ourefficient stochastic reconstruction algorithms.In the presence of corrupted or missing data, many of such dimensionality reduction methodscannot be efficiently used. To alleviate this issue, in the context of PDE inverse problems, wedevelop and mathematically justify new techniques for replacing (or filling) the corrupted (ormissing) parts of the data set. Our data replacement/completion methods are motivated bytheory in Sobolev spaces, regarding the properties of weak solutions along the domain boundary.All of the stochastic dimensionality reduction techniques can be reformulated as Monte-Carlo (MC) methods for estimating the trace of a symmetric positive semi-definite (SPSD)matrix. In the next part of the present thesis, we present some probabilistic analysis of suchrandomized trace estimators and prove various computable and informative conditions for thesample size required for such Monte-Carlo methods in order to achieve a prescribed probabilisticrelative accuracy.Although computationally efficient, a major drawback of any (randomized) approximationalgorithm is the introduction of “uncertainty” in the overall procedure, which could cast doubton the credibility of the obtained results. The last part of this thesis consist of uncertaintyquantification of stochastic steps of our approximation algorithms presented earlier. As a result,we present highly efficient variants of our original algorithms where the degree of uncertaintycan easily be quantified and adjusted, if needed.iiAbstractThe uncertainty quantification presented in the last part of the thesis is an application ofour novel results regarding the maximal and minimal tail probabilities of non-negative linearcombinations of gamma random variables which can be considered independently of the rest ofthis thesis.iiiPrefaceParts of this thesis have been published in four co-authored papers plus a submitted (andarXived) one. As a rule, the main contributor appears first in the author lists of these papers.My supervisor has actively participated in the writing phase of all publications.Chapter 3 (including parts of Chapter 2) has been published as [119], and a version ofChapter 4 appeared as [118]. These papers significantly expand the line of research begunin [46] in several directions. I am responsible for several of the new ideas and for the theoryin [118], as well as for the entire implementation (which was written from scratch based in parton a previous effort by Kees van den Doel) and for carrying out all the numerical tests.Chapter 5 has been published as [116], while a paper described in Chapters 6 and 7 hasappeared as [117]. I conceived the ideas that have led to this part of the research work, andthese have been refined through discussions with my collaborators. I formulated and proved allthe theorems in [116] and [117], and have also implemented and carried out all the numericalexamples in these papers.Chapter 8 corresponds to [19]. My contributions consist of discussions, editing, and prepa-ration of some of the numerical examples.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviNotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Large Scale Data Fitting Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 Assumptions on the Forward Operator . . . . . . . . . . . . . . . . . . . 31.1.2 A Practical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.3 Assumptions on the Noise . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Least Squares Formulation & Optimization . . . . . . . . . . . . . . . . . . . . . 51.2.1 Generalized Least Squares Formulation . . . . . . . . . . . . . . . . . . . 81.3 Thesis Overview and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8vTable of Contents2 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1 Stochastic Approximation to Misfit . . . . . . . . . . . . . . . . . . . . . . . . . 142.1.1 Selecting a Sampling Method . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.2 Approximation with Generalized Noise Assumption . . . . . . . . . . . . 162.2 Deterministic Approximation to Data . . . . . . . . . . . . . . . . . . . . . . . . 162.3 GN Iteration on the Approximate Function . . . . . . . . . . . . . . . . . . . . . 183 Stochastic Reconstruction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 203.1 Two Additional Reasons for Unbiased Estimators . . . . . . . . . . . . . . . . . 213.1.1 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.1.2 Stopping Criterion and Uncertainty Check . . . . . . . . . . . . . . . . . 223.2 Adaptive Selection of Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2.1 Sample Size Selection Using Uncertainty Checks . . . . . . . . . . . . . . 243.2.2 Adaptive Selection of Sample Size Using Cross Validation . . . . . . . . . 253.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.1 The EIT/DC Resistivity Inverse Problem . . . . . . . . . . . . . . . . . . 263.3.2 Numerical Experiments Setup . . . . . . . . . . . . . . . . . . . . . . . . 273.3.3 Numerical Experiments Comparing Eight Method Variants . . . . . . . . 293.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 Data Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.1 Stochastic Algorithms for Solving the Inverse Problem . . . . . . . . . . . . . . . 404.1.1 Algorithm Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.1.2 General Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.1.3 The DC Resistivity Inverse Problem . . . . . . . . . . . . . . . . . . . . . 444.2 Data Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2.1 Discontinuities in Conductivity Are Away from Common MeasurementDomain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2.2 Discontinuities in Conductivity Extend All the Way to Common Mea-surement Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.3 Determining the Regularization Parameter . . . . . . . . . . . . . . . . . 50viTable of Contents4.2.4 Point Sources and Boundaries with Corners . . . . . . . . . . . . . . . . 504.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 Matrix Trace Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.2 Hutchinson Estimator Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.2.1 Improving the Bound in [22] . . . . . . . . . . . . . . . . . . . . . . . . . 655.2.2 A Matrix-Dependent Bound . . . . . . . . . . . . . . . . . . . . . . . . . 675.3 Gaussian Estimator Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.3.1 Sufficient Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.3.2 A Necessary Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.4 Random Unit Vector Bounds, with and without Replacement, for General SquareMatrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916 Extremal Probabilities of Linear Combinations of Gamma Random Variables 936.1 Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.2 Proofs of Theorems 6.1 and 6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037 Uncertainty Quantification of Stochastic Reconstruction Algorithms . . . . 1097.1 Tight Conditions on Sample Size for Gaussian MC Trace Estimators . . . . . . . 1107.2 Quantifying the Uncertainty in Randomized Algorithms . . . . . . . . . . . . . . 1177.2.1 Cross Validation Step with Quantified Uncertainty . . . . . . . . . . . . . 1197.2.2 Uncertainty Check with Quantified Uncertainty and Efficient StoppingCriterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1217.2.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1237.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129viiTable of Contents8 Algorithms That Satisfy a Stopping Criterion, Probably . . . . . . . . . . . . 1328.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1328.2 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1348.2.1 Stopping Criterion in Initial Value ODE Solvers . . . . . . . . . . . . . . 1358.2.2 Stopping Criterion in Iterative Methods for Linear Systems . . . . . . . . 1398.2.3 Data Fitting and Inverse Problems . . . . . . . . . . . . . . . . . . . . . 1428.3 Probabilistic Relaxation of a Stopping Criterion . . . . . . . . . . . . . . . . . . 1458.3.1 TV and Stochastic Methods . . . . . . . . . . . . . . . . . . . . . . . . . 1478.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1489 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1509.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1509.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1549.2.1 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1549.2.2 Quasi-Monte Carlo and Matrix Trace Estimation . . . . . . . . . . . . . 1559.2.3 Randomized/Deterministic Preconditioners . . . . . . . . . . . . . . . . . 155Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157AppendixA Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170A.1 Discretizing the Forward Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 170A.2 Taking Advantage of Additional A Priori Information . . . . . . . . . . . . . . . 173A.3 Stabilized Gauss-Newton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174A.4 Matlab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175A.5 Implementation of Total Variation Functional . . . . . . . . . . . . . . . . . . . 176viiiList of Tables3.1 Work in terms of number of PDE solves for Examples 3.1–3.4. The “Vanilla”count is independent of the algorithms described in Section 3.2. . . . . . . . . . 294.1 Algorithm and work in terms of number of PDE solves, comparing RS againstdata completion using Gaussian SS. . . . . . . . . . . . . . . . . . . . . . . . . . 537.1 Example (E.1). Work in terms of number of PDE solves for all variants ofAlgorithm 4, described in Section 7.2.3 and indicated here by (i)–(viii). The“vanilla” count is also given, as a reference. . . . . . . . . . . . . . . . . . . . . . 1267.2 Example (E.2). Work in terms of number of PDE solves for all variants ofAlgorithm 4, described in Section 7.2.3 and indicated here by (i)–(viii). The“vanilla” count is also given, as a reference. . . . . . . . . . . . . . . . . . . . . . 1288.1 Iteration counts required to satisfy (8.5) for the Poisson problem with toleranceρ = 10−7 and different mesh sizes s. . . . . . . . . . . . . . . . . . . . . . . . . . 141ixList of Figures2.1 The singular values of the data used in Example 3.2 of Section 3.3. . . . . . . . 173.1 Example 3.4 – initial guess for the level set method. . . . . . . . . . . . . . . . . 283.2 Example 3.1 – reconstructed log conductivity using Algorithm 1 and the fourmethods of Section 2.1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3 Example 3.1 – reconstructed log conductivity using Algorithm 2 and the fourmethods of Section 2.1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4 Data misfit vs. PDE count for Example 1. . . . . . . . . . . . . . . . . . . . . . 303.5 Example 3.2 – reconstructed log conductivity using Algorithm 1 and the fourmethods of Section 2.1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.6 Example 3.2 – reconstructed log conductivity using Algorithm 2 and the fourmethods of Section 2.1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.7 Data misfit vs. PDE count for Example 3.2. . . . . . . . . . . . . . . . . . . . . . 313.8 True Model for Examples 3.3 and 3.4. The left panel shows 2D equi-distant slicesin the z direction from top to bottom, the right panel depicts the 3D volume. . . 323.9 Example 3.3 – reconstructed log conductivity for the 3D model using Algorithm 1and (a,b) Random Subset, (c,d) Gaussian, (e,f) Hutchinson, and (g,h) TSDV. . . 323.10 Example 3.3 – reconstructed log conductivity for the 3D model using Algorithm 2and (a,b) Random Subset, (c,d) Gaussian, (e,f) Hutchinson, and (g,h) TSDV. . . 343.11 Example 3.4 – reconstructed log conductivity for the 3D model using the levelset method with Algorithm 1 and with (a,b) Random Subset, (c,d) Gaussian,(e,f) Hutchinson, and (g,h) TSDV. . . . . . . . . . . . . . . . . . . . . . . . . . . 34xList of Figures3.12 Example 3.4 – reconstructed log conductivity for the 3D model using the levelset method with Algorithm 2 and with (a,b) Random Subset, (c,d) Gaussian,(e,f) Hutchinson, and (g,h) TSDV. . . . . . . . . . . . . . . . . . . . . . . . . . . 353.13 Data misfit vs. PDE count for Example 3.3. . . . . . . . . . . . . . . . . . . . . . 353.14 Data misfit vs. PDE count for Example 4. . . . . . . . . . . . . . . . . . . . . . . 364.1 Completion using the regularization (4.2), for an experiment taken from Exam-ple 4.3 where 50% of the data requires completion and the noise level is 5%.Observe that even in the presence of significant noise, the data completion for-mulation (4.2) achieves a good quality field reconstruction. . . . . . . . . . . . . 464.2 Completion using the regularization (4.6), for an experiment taken from Exam-ple 4.2 where 50% of the data requires completion and the noise level is 5%.Discontinuities in the conductivity extend to the measurement domain and theireffect on the field profile along the boundary can be clearly observed. Despitethe large amount of noise, data completion formulation (4.6) achieves a goodreconstruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3 Example 4.1 – reconstructed log conductivity with 25% data missing and 5%noise. Regularization (4.6) has been used to complete the data. . . . . . . . . . 534.4 Example 4.2 – reconstructed log conductivity with 50% data missing and 5%noise. Regularization (4.6) has been used to complete the data. . . . . . . . . . 544.5 Example 4.3 – reconstructed log conductivity with 50% data missing and 5%noise. Regularization (4.2) has been used to complete the data. . . . . . . . . . 544.6 Data misfit vs. PDE count for Examples 1, 2 and 3. . . . . . . . . . . . . . . . . 554.7 True Model for Example 4.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.8 Example 4.4 – reconstructed log conductivity for the 3D model with (a,b) Ran-dom Subset, (c,d) Data Completion for the case of 2% noise and 50% of datamissing. Regularization (4.6) has been used to complete the data. . . . . . . . . 56xiList of Figures4.9 Example 4.5 – reconstructed log conductivity for the 3D model using the levelset method with (a,b) Random Subset, (c,d) Data Completion for the case of 2%noise and 30% of data missing. Regularization (4.6) has been used to completethe data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.10 Data misfit vs. PDE count for Example 4.5. . . . . . . . . . . . . . . . . . . . . 574.11 Example 4.6 – reconstructed log conductivity for the 3D model using the levelset method with (a,b) Random Subset, (c,d) Data Completion for the case of 2%noise and 50% of data missing. Regularization (4.6) has been used to completethe data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.12 True Model for Example 4.7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.13 Example 4.7 – reconstructed log conductivity for the 3D model with (a,b) Ran-dom Subset, (c,d) Data Completion for the case of 2% noise and 70% datamissing. Regularization (4.2) has been used to complete the data. . . . . . . . . 595.1 Necessary bound for the Gaussian estimator: (a) the log-scale of n accordingto (5.14) as a function of r = rank(A): larger ranks yield smaller necessarysample size. For very low rank matrices, the necessary bound grows significantly:for s = 1000 and r ≤ 30, necessarily n > s and the Gaussian method is practicallyuseless; (b) tightness of the necessary bound demonstrated by an actual run asdescribed for Example 5.4 in Section 5.5 where A has all eigenvalues equal. . . . 775.2 The behaviour of the bounds (5.18) and (5.19) with respect to the factor K = KUfor s = 1000 and ε = δ = 0.05. The bound for U2 is much more resilient to thedistribution of the diagonal values than that of U1. For very small values of KU ,there is no major difference between the bounds. . . . . . . . . . . . . . . . . . . 815.3 Example 5.1. For the matrix of all 1s with s = 10, 000, the plot depicts thenumbers of samples in 100 trials required to satisfy the relative tolerance ε = .05,sorted by increasing n. The average n for both Hutchinson and Gauss estimatorswas around 50, while for the uniform unit vector estimator always n = 1. Onlythe best 90 results (i.e., lowest resulting values of n) are shown for reasons ofscaling. Clearly, the unit vector method is superior here. . . . . . . . . . . . . . . 82xiiList of Figures5.4 Example 5.2. For the rank-1 matrix arising from a rapidly-decaying vector withs = 1000, this log-log plot depicts the actual sample size n required for (5.2) tohold with ε = δ = 0.2, vs. various values of θ. In the legend, “Unit” refers tothe random sampling method without replacement. . . . . . . . . . . . . . . . . . 835.5 Example 5.3. A dense SPSD matrix A is constructed using Matlab’s randn.Here s = 1000, r = 200, tr(A) = 1,KG = 0.0105, KH = 8.4669 and KU = 0.8553.The method convergence plots in (a) are for ε = δ = .05. . . . . . . . . . . . . . . 865.6 Example 5.4. The behaviour of the Gaussian method with respect to rank andKG. We set ε = δ = .05 and display the necessary condition (5.14) as well. . . . . 875.7 Example 5.5. A sparse matrix (d = 0.1) is formed using sprandn. Here r =50, KG = 0.0342, KH = 15977.194 and KU = 4.8350. . . . . . . . . . . . . . . . . 885.8 Example 5.5. A sparse matrix (d = 0.1) is formed using sprand. Here r =50,KG = 0.0919, KH = 11624.58 and KU = 3.8823. . . . . . . . . . . . . . . . . . 895.9 Example 5.5. A very sparse matrix (d = 0.01) is formed using sprandn. Herer = 50, KG = 0.1186, KH = 8851.8 and KU = 103.9593. . . . . . . . . . . . . . . 905.10 Example 5.5. A very sparse matrix (d = 0.01) is formed using sprand. Herer = 50, KG = 0.1290, KH = 1611.34 and KU = 64.1707. . . . . . . . . . . . . . . 917.1 The curves of P−ε,r(n) and P+ε,r(n), defined in (7.5) and (7.8), for ε = 0.1 andr = 1: (a) P−ε,r(n) decreases monotonically for all n ≥ 1; (b) P+ε,r(n) increasesmonotonically only for n ≥ n0, where n0 > 1: according to Theorem 7.2, n0 =100 is safe, and this value does not disagree with the plot. . . . . . . . . . . . . . 1147.2 Comparing, as a function of δ, the sample size obtained from (7.4) and denoted by“tight”, with that of (7.3) and denoted by “loose”, for ε = 0.1 and 0.01 ≤ δ ≤ 0.3:(a) sufficient sample size, n, for (7.2a), (b) ratio of sufficient sample size obtainedfrom (7.3) over that of (7.4). When δ is relaxed, our new bound is tighter thanthe older one by an order of magnitude. . . . . . . . . . . . . . . . . . . . . . . . 115xiiiList of Figures7.3 Comparing, as a function of δ, the sample size obtained from (7.7) and denoted by“tight”, with that of (7.3) and denoted by “loose”, for ε = 0.1 and 0.01 ≤ δ ≤ 0.3:(a) sufficient sample size, n, for (7.2b), (b) ratio of sufficient sample size obtainedfrom (7.3) over that of (7.7). When δ is relaxed, our new bound is tighter thanthe older one by an order of magnitude. . . . . . . . . . . . . . . . . . . . . . . . 1167.4 Example (E.1). Plots of log-conductivity: (a) True model; (b) Vanilla recoverywith s = 3, 969; (c) Vanilla recovery with s = 49. The vanilla recovery usingonly 49 measurement sets is clearly inferior, showing that a large number ofmeasurement sets can be crucial for better reconstructions. . . . . . . . . . . . . 1277.5 Example (E.1). Plots of log-conductivity of the recovered model using the 8variants of Algorithm 4, described in Section 7.2.3 and indicated here by (i)–(viii). The quality of reconstructions is generally comparable to that of plainvanilla with s = 3, 969 and across variants. . . . . . . . . . . . . . . . . . . . . . . 1277.6 Example (E.2). Plots of log-conductivity: (a) True model; (b) Vanilla recoverywith s = 3, 969; (c) Vanilla recovery with s = 49. The vanilla recovery usingonly 49 measurement sets is clearly inferior, showing that a large number ofmeasurement sets can be crucial for better reconstructions. . . . . . . . . . . . . 1297.7 Example (E.2). Plots of log-conductivity of the recovered model using the 8variants of Algorithm 4, described in Section 7.2.3 and indicated here by (i)–(viii). The quality of reconstructions is generally comparable to each other andthat of plain vanilla with s = 3, 969. . . . . . . . . . . . . . . . . . . . . . . . . . 1297.8 Example (E.2). Growth of the fitting sample size, nk, as a function of theiteration k, upon using cross validation strategies (7.14) and (7.16). The graphshows the fitting sample size growth for variants (ii) and (vi) of Algorithm 4,as well as their counterparts, namely, variants (vi) and (viii). Observe that forvariants (ii) and (iv) where (7.14) is used, the fitting sample size grows at a moreaggressive rate than for variants (vi) and (viii) where (7.16) is used. . . . . . . . 1308.1 Adiabatic invariant approximations obtained using Matlab’s package ode45with default tolerances (solid blue) and stricter tolerances (dashed magenta). . . 138xivList of Figures8.2 Relative residuals and step sizes for solving the model Poisson problem usingLSD on a 15× 15 mesh. The red line in (b) is the forward Euler stability limit. 1428.3 Plots of log-conductivity: (a) True model; (b) Vanilla recovery with s = 3, 969;(c) Vanilla recovery with s = 49; (d) Monte Carlo recovery with s = 3, 969. Thevanilla recovery using only 49 measurement sets is clearly inferior, showing thata large number of measurement sets can be crucial for better reconstructions.The recovery using our algorithm, however, is comparable in quality to Vanillawith the same s. The quantifier values used in our algorithm were: (εc, δc) =(0.05, 0.3), (εu, δu) = (0.1, 0.3) and (εt, δt) = (0.1, 0.1). . . . . . . . . . . . . . . . 147xvList of Acronyms LS/NLS: Least Squares/Nonlinear Least Squares SS: Simultaneous Sources RS: Random Subset PDE: Partial Differential Equation TSVD: Truncated Singular Value Decomposition ML/MAP: Maximum Likelihood/Maximum a Posteriori GN: Gauss Newton PCG: Preconditioned Conjugate Gradient DOT: Diffuse Optical Tomography QPAT: Quantitative photo-Acoustic Tomography DC Resistivity: Direct Current resistivity EIT: Electrical Impedance Tomography i.i.d: Independent and Identically Distributed PDF: Probability Density Function CDF: Cumulative Distribution Function MC: Monte-Carlo QMC: Quasi MC SPSD: Symmetric Positive Semi-DefinitexviNotation Bold lower case letters: vectors Regular upper case letters: matrices or scalar random variables AT : matrix transpose tr(A): matrix trace N (0,Σ) : normal distribution with mean 0 and covariance matrix Σ I: identity matrix Rm×n: m× n real matrix ek: kth column of identity matrix 1: vector of all 1’s exp: exponential function E: Expectation V ar: Variance Pr: Probability X ∼ Gamma(α, β): gamma distributed r.v parametrized by shape α and rate β fX : probability density function of r.v X FX : cumulative distribution function of r.v X ‖v‖2, ‖A‖2: vector or matrix two normxviiNotation ‖A‖F : matrix Frobenius norm ∇: gradient ∇· : divergence ∆: Laplacian ∆S : Laplace-Beltrami operator δ(x− x0): Dirac delta distribution centered at x0 Ω: PDE domain ∂Ω: boundary of the PDE domain Γ¯: closure of Γ w.r.t subspace topology X◦: interior of the set X w.r.t subspace topology Ck: space of k times continuously differentiable functions Cβ: space of Ho¨lder continuous functions with the exponent β Lp: space of Lp function ‖f‖Lp : Lp norm of the function f Lip: space of Lipschitz functions Hk: Sobolev space of Hk function u: solution of the discretized PDE q: discretized input to the PDE (i.e., right hand side) di: measurement vector for the ith experiment ηi: noise incurred at the ith experiment fi: forward operator for the ith experimentxviiiNotation Ji: Jacobian of fi Pi: projection matrix for the ith experiment φ: misfit φ̂: approximated misfit s: total number of experiments or dimension of SPSD matrix in Chapters 5 and 7 k: iteration counter for GN algorithm nk: sample size at kth iteration for GN algorithm n: sample size for MC trace estimators or total number of i.i.d gamma r.v’s in Chapter 6 r: Maximum PCG iterations or rank of the matrix m: model to be recovered δm: update direction used in each GN iteration ρ: stopping criterion tolerance ε: trace estimation relative accuracy tolerance δ: probability of failure of ε-accurate trace estimation α: Tikhonov regularization parameterxixAcknowledgmentsStarting from the first year of grad school, I have had the honor of being supervised by one ofthe greatest numerical analysts of our time, Prof. Uri Ascher. Without his guidance, insightand, most importantly, patience, this thesis would not have been possible. On a personal level,he helped my family and myself in times of hardship and I am forever indebted to him.I would also like to give special thanks to my knowledgeable and brilliant supervisory com-mittee members, Prof. Chen Greif and Prof. Eldad Haber. They have always been my greatestsupporters, and have never hesitated to graciously offer their help and expert advice along theway. I will always be grateful for their kindness.I am also thankful to Prof. Ga´bor J. Sze´kely, Dr. Adriano De Cezaro, Prof. MichaelFriedlander, Dr. Kees van den Doel, and Dr. Ives Maceˆdo for all their help and many fruitfuldiscussions.I am also grateful to all the CS administrative staff, especially Joyce Poon, for ever sodiligently taking care of all the paperwork throughout the years and making the process runsmoothly.A special thanks to my wonderful friends and colleagues Yariv Dror Mizrahi, Kai Rothaugeand Iain Moyles for creating many good memories during these years.Finally and most importantly, I would like to thank my biggest cheerleader, my wife, Jill.She never stopped believing in me, even when I was the closest to giving up. She stood by myside from day one and endured all the pain, anxiety, and frustration that is involved in finishinga PhD, right along with me. Jill, I love you!xxToMy Queen, Jill,&My Princess, Rosa.xxiChapter 1IntroductionInverse problems arise often in many applications in science and engineering. The term “inverseproblem” is generally understood as the problem of finding a specific physical property, orproperties, of the medium under investigation, using indirect measurements. This is a highlyimportant field of applied mathematics and scientific computing, as to a great extent, it formsthe backbone of modern science and engineering. Examples of inverse problems can be found invarious fields within medical imaging (e.g., [10, 12, 28, 100, 136]) and several areas of geophysicsincluding mineral and oil exploration (e.g., [20, 35, 102, 120]). For many of these problems, intheory, having many measurements is crucial for obtaining credible reconstructions of the soughtphysical property, i.e., the model. For others where there is no theory, it is a widely acceptedworking assumption that having more data can only help (at worst not hurt) the quality ofthe recovered model. As a consequence, there has been an exponential growth in the abilityto acquire large amounts of measurements (i.e., many data set) in short periods of time. Theavailability of “big data”, in turn, has given rise to some new rather serious challenges regardingthe potentially high computational cost of solving such large scale inverse problems. As theability to gather larger amounts of data increases, the need to devise algorithms to efficientlysolve such problems becomes more important. Here is where randomized algorithms haveshown great success in reducing the computational costs of solving such large scale problems.More specifically, dimensionality reduction algorithms transform the original large dimensionalproblem into a smaller size problem where the effective solution methods can be used. Thechallenge is to devise methods which yield credible reconstructions but at much lower costs.The main purpose of this thesis is to propose, study and analyze various such highly efficientreconstruction algorithms in the context of large scale least squares problems. Henceforth,the terms “model” and “parameter function” are interchangeably used to refer to the sought11.1. Large Scale Data Fitting Problemsphysical property or properties of the medium under investigation.1.1 Large Scale Data Fitting ProblemsInverse problems can often be regarded as data fitting problems where the objective is to recoveran unknown parameter function such that the misfit (i.e., the distance, in some norm, betweenpredicted and observed data) is to within a desirable tolerance, which is mostly dictated bysome prior knowledge on measurement noise.Generally speaking (and after possibly discretization of the continuous problem), considerthe systemdi = fi(m) + ηi, i = 1, 2, . . . , s, (1.1)where di ∈ Rl is the measured data obtained in the ith experiment, fi = fi(m) is the knownforward operator (or data predictor) for the ith experiment arising from the underlying physicalsystem, m ∈ Rlm is the sought-after parameter vector1, and ηi is the noise incurred in the ithexperiment. The total number of experiments, or the size of the data sets, is assumed large:s 1; this is what is implied by “large scale” or “large dimensional problem”. The goal of datafitting is to find (or infer) the unknown model, m, from the measurements di, i = 1, 2, . . . , s,such thats∑i=1‖fi(m)− di‖ ≤ ρ,where ρ is usually related to noise, and the chosen norm can be problem-dependent. Generally,this problem can be ill-posed. Various approaches, including different regularization techniques,have been proposed to alleviate this ill-posedness; see, e.g., [9, 52, 135]. Most regularizationmethods consist of incorporating some a priori information on m. Such information may be inthe form of expected physical properties of the model in terms, for example, of constraints onthe size, value or the smoothness.In the presence of large amounts of measurements, i.e., s 1, and when computing fi, foreach i, is expensive, the mere evaluation of the misfit function may become computationallyprohibitive. As such any reconstruction algorithm involving (1.1) becomes intractable. The1 The parameter vector m often arises from a parameter function in several space variables projected onto adiscrete grid and reshaped into a vector.21.1. Large Scale Data Fitting Problemsgoal of this thesis is to devise reconstruction methods to alleviate this problem and recover acredible model, m, efficiently.1.1.1 Assumptions on the Forward OperatorIn this thesis, we consider a special class of data fitting problems where the forward operators,fi in (1.1), satisfy the following assumptions.(A.1) The forward operators, fi, have the formfi(m) = f(m,qi), i = 1, . . . , s, (1.2)where qi is the input in the ith experiment. In other words, the ith measurement, di, ismade after injecting the ith input (or source) qi into the system. Thus, for an input qi,f(m,qi) predicts the ith measurement, given the underlying model m.(A.2) For all sources, we have qi ∈ Rlq , ∀i, and f is linear in q, i.e., f(m, w1q1 + w2q2) =w1f(m,q1) + w2f(m,q2). Alternatively, we write f(m,q) = G(m)q, where G ∈ Rl×lq isa matrix that depends, potentially non-linearly, on the sought m.(A.3) Evaluating f(m,qi) for each input, qi, is computationally expensive and is, in fact, thebottleneck of computations.1.1.2 A Practical ExampleAn important class of inverse problems for which Assumptions (A.1) - (A.3) are often valid,is that of large scale partial differential equation (PDE) inverse problems with many mea-surements. Such nonlinear parameter function estimation problems involving PDE constraintsarise often in science and engineering. The main objective in solving such inverse problems isto find a specific model which appears as part of the underlying PDE. For several instancesof these PDE-constrained inverse problems, large amounts of measurements are gathered inorder to obtain reasonable and credible reconstructions of the sought model. Examples of suchproblems include electromagnetic data inversion in mining exploration (e.g., [48, 69, 108, 110]),seismic data inversion in oil exploration (e.g., [56, 81, 115]), diffuse optical tomography (DOT)31.1. Large Scale Data Fitting Problems(e.g., [11, 29]), quantitative photo-acoustic tomography (QPAT) (e.g., [63, 140]), direct cur-rent (DC) resistivity (e.g., [46, 71, 72, 111, 128]), and electrical impedance tomography (EIT)(e.g., [33, 39, 47]). For such applications, it has been suggested that many well-placed experi-ments yield practical advantage in order to obtain reconstructions of acceptable quality.Mathematical Model for Forward OperatorsIn the class of PDE-constrained inverse problems, upon discretization of the continuous prob-lem, the sought model, m, is a discretization of the function m(x) in two or three spacedimensions. Furthermore, the forward operator involves an approximate solution of a PDE, ormore generally, a system of PDEs. We write this in discretized form asL(m)ui = qi, i = 1, . . . , s, (1.3)where ui ∈ IRlu is the ith field, qi ∈ IRlu is the ith source, and L is a square matrix discretizingthe PDE plus appropriate side conditions. Furthermore, there are given projection matrices Pisuch thatfi(m) = f(m,qi) = Piui = PiL−1(m)qi (1.4)predicts the ith data set. In other words, the matrix Pi projects the field, ui, onto the locationsin the domain where the ith measurements are made. Note that the notation (1.3) reflectsan assumption of linearity in u but not in m. Assumptions (A.1) & (A.3) can be justifiedfor the forward operator (1.4). However, if Pi’s are different for each i, then the linearityassumption (A.2) does not hold. On the other hand, if the locations where the measurementsare made do not change from one experiment to another, i.e., P = Pi,∀i, then we getf(m,qi) = PL−1(m)qi, (1.5)and the linearity assumption (A.2) of f(m,q) in q is satisfied. It should be noted that, undercertain circumstances, if the Pi’s are different across experiments, there are methods to trans-form the existing data set into the one where all sources share the same receivers. Different41.2. Least Squares Formulation & Optimizationsuch methods are discussed in [70, 96] as well as Chapter 4 of this thesis.In the sequel, the cost of any reconstruction algorithm used on a PDE constrained inverseproblem is measured by the total count of PDE solves, L(m)−1q, as solving this linear systemfor each q is assumed to be the bottleneck of the computations.1.1.3 Assumptions on the NoiseThe developments of the methods and algorithms presented in this thesis are done under oneof the following assumptions on the noise. In what follows N denotes the normal distribution.(N.1) The noise is independent and identically distributed (i.i.d) as ηi ∼ N (0,Σ),∀i, whereΣ ∈ Rl×l is the symmetric positive definite covariance matrix.(N.2) The noise is independent but not necessarily identically distributed, satisfying insteadηi ∼ N (0, σ2i I), i = 1, 2, . . . , s, where σi > 0 are the standard deviations.Henceforth, for notational simplicity, most of the algorithms and methods are presentedfor the special case of Assumption (N.1) with Σ = σI. However, all of these methods andalgorithms can be readily extended to the more general cases in a completely straightforwardmanner.1.2 Least Squares Formulation & OptimizationIf we may assume that the noise satisfies2 Assumption (N.1) with Σ = σI, the standard maxi-mum likelihood (ML) approach, [123], leads to minimizing the ordinary LS misfit functionφ(m) :=s∑i=1‖f(m,qi)− di‖22 = ‖F (m)−D‖2F , (1.6)where F (m) and D are l× s matrices whose ith columns are, respectively, f(m,qi) and di, and‖ · ‖F stands for the Frobenius norm. Hence, we obtain a misfit function for which the datafitting can be done in `2 sense. However, since the above inverse problem is typically ill-posed, a2 For notational simplicity, we do not distinguish between a random vector (e.g., noise) and its realization, asthey are clear within the context in which they are used.51.2. Least Squares Formulation & Optimizationregularization functional, R(m), is often added to the above objective, thus minimizing insteadφR,α(m) := φ(m) + αR(m), (1.7)where α is a regularization parameter [9, 52, 135]. In general, this regularization term can bechosen using a priori knowledge of the desired model. The objective functional (1.7) coincideswith the maximum a posteriori (MAP) formulation, [123]. Injection of the regularization (i.e.,a priori knowledge), R(m), on the sought-after model can also be done by formulating theproblem asminmR(m) s.t. φ(m) ≤ ρ (1.8)where ρ acts as the regularization parameter3. Note that the “meaning” of the regularizationparameter ρ in (1.8) is more intuitive than α in (1.7), as ρ usually relates to noise and themaximum discrepancy between the measured and the predicted data. As such, determining ρcould be easier than α. Implicit regularization also exists in which there is no explicit termR(m) in the objective [77, 78, 113, 114, 131, 133]. Various optimization techniques can be usedon the (regularized) objective to decrease the value of the above misfit, (1.6), to a desired level(determined, e.g., by a given tolerance which depends on the noise level), thus recovering thesought-after model.Let us suppose for now that the forward operators f(m,qi), each involving a PDE solution,are given as in (1.5): see Appendix A and Section 3.3 for a specific instance, used for ournumerical experiments. Next, consider the problem of reducing the value the misfit functionφ(m) defined in (1.6) (what follows can be easily extended for the regularized objective functionφR,α(m) defined in (1.7)). With the sensitivity matricesJi(m) =∂fi∂m, i = 1, . . . , s, (1.9)3Though for the rest of this thesis, we will not consider algorithms for solving the contained problem (1.8),the discussions regarding stopping criterion in the following chapters are directly relevant in any such algorithm.61.2. Least Squares Formulation & Optimizationwe have the gradient∇φ(m) = 2s∑i=1JTi (fi(m)− di). (1.10)An iterative method such as modified Gauss-Newton (GN), L-BFGS, or nonlinear conjugategradient ([41, 57, 109]) is typically designed to decrease the value of the objective function usingrepeated calculations of the gradient. Although the methods and issues under considerationhere do not require a specific optimization method we employ variants of the GN methodthroughout this thesis, thus achieving a context in which to focus our attention on the newaspects of this work and enabling comparison to past efforts. In particular, the way in whichthe GN method is modified is important more generally; see Appendix A.3.The GN iteration for (1.6) (or (1.7)) at the kth iteration with the current iterate m = mk,calculates the correction as the solution of the linear system(s∑i=1JTi Ji)δm = −∇mφ, (1.11a)followed by the updatemk+1 = mk + αkδm. (1.11b)Here the step length, αk, 0 < αk ≤ 1, is determined by a weak line search (using, say, theArmijo algorithm starting with αk = 1) ensuring sufficient decrease in φ(mk+1) as comparedto φ(mk).Several nontrivial modifications are required to adapt this prototype method for our pur-poses, and these are described in context in Appendix A.3, resulting in a method we refer toas stabilized GN. This method replaces the solution of (1.11a) by r preconditioned conjugategradient (PCG) inner iterations, which costs 2r solutions of the forward problem per iteration,for a moderate integer value r. Thus, if K outer iterations are required to obtain an acceptablesolution then the total work estimate (in terms of the number of PDE solves) is approximatedfrom below byWork Estimate = 2(r + 1)Ks. (1.12)71.3. Thesis Overview and OutlineThis indicates that for s 1, the computational costs can be rather prohibitive. In this thesis,we design and propose algorithms for lowering the above “Work Estimate” by reducing the sizeof the data set s used in each of the K iterations.Note that an alternative method to GN such as L-BFGS would require only r = 1 in (1.12).However, the number of such iterations would be significantly higher. This point again doesnot affect the issues addressed here and is not pursued further.1.2.1 Generalized Least Squares FormulationOur assumption regarding the noise distribution leading to the ordinary LS misfit function (1.6),although standard, is quite simplistic. Under the more general assumptions (N.1) or (N.2) onthe noise, described in Section 1.1.3, we can extend the ordinary LS misfit (1.6) to obtaingeneralized LS formulations. More specifically, under Assumption (N.1), the ML approachleads to minimizing the `2 misfit functionφ(1)(m) :=s∑i=1‖C−1(f(m,qi)− di)‖22 = ‖C−1(F (m)−D)‖2F , (1.13)where C ∈ Rl×l is any invertible matrix such that Σ = CCT (e.g., C can be the Cholesky factorof Σ). The matrices F and D are as in (1.6).Similarly, Under Assumption (N.2), the ML approach yields the weighted LS misfit functionφ(2)(m) :=s∑i=11σ2i‖f(m,qi)− di‖22 = ‖(F (m)−D)C−1‖2F . (1.14)where C ∈ Rs×s denotes the diagonal matrix whose ith diagonal element is σi.Although the developments of methods and algorithms in this thesis is done using the simplemisfit (1.6), they can be almost verbatim applied to the above more general misfits (1.13)and (1.14).1.3 Thesis Overview and OutlineThis thesis is organized into nine chapters. Following the present introductory chapter, inChapter 2, we will review dimensionality reduction methods, both stochastic and deterministic,81.3. Thesis Overview and Outlineto transform the original high dimensional problem, into a smaller and manageable size one.This is done either with approximating the misfit function (in the stochastic case) or approxi-mating the data matrix (in the deterministic case). The common denominator in many of thesedimensionality reduction methods is that they form fewer experiments by some combination ofthe original experiments, called simultaneous sources (SS). This smaller and newly formed setof experiments is then used in optimization iterations. The method of SS is only applicablewhen the linearity assumption (A.2) is justified. Under such assumption, the stochastic variantsof SS methods provide accurate approximations to the misfit. However, in the absence of thelinearity assumption (A.2), an alternative, more general and yet less accurate, approximationmethod named random subset (RS) can be used and will also be discussed in Chapter 2. Partof this chapter is taken from Roosta-Khorasani, van Den Doel and Ascher [119].Efficient, practical and stochastic reconstruction algorithms based on these dimensionalityreduction methods are presented in Chapter 3. Such dimensionality reduction methods alwaysinvolve (random) sampling of the original measurements and as the iterations progress, thissample size might be required to grow. For these algorithms, novel stochastic mechanisms forcontrolling the growth of the number of such samples are proposed and justified. Our algorithmsemploy some variants of stabilized GN method, though other iterative methods can easily beincorporated as well. In addition to using such approximation methods in each GN iteration, weidentify and justify two different purposes for using these approximations in our algorithm. Fur-thermore we show that these different purposes may well require different estimation methods.We show that if the linearity assumption (A.2) is justified, the reconstruction algorithms basedon the SS methods are significantly more efficient than their counterpart using the RS method.The comparison among different variants and the overall efficacy of these reconstruction algo-rithms are demonstrated in the context of the famous DC resistivity problem. We present indetails our methods for solving such inverse problems. These methods involve incorporation ofa priori information such as piecewise smoothness, bounds on the sought conductivity surface,or even a piecewise constant solution. This chapter has appeared in [119].Reconstruction algorithms based on the efficient SS methods, presented in Chapter 3, areonly applicable if the linearity assumption (A.2) is valid. In situations where Assumption (A.2)is violated, such as missing or highly corrupted data, among all algorithmic variants described91.3. Thesis Overview and Outlinein Chapter 3, only the one based on RS method can be used. However, as shown in Chap-ter 3, an algorithm employing the RS method requires more evaluations of the computationallyexpensive forward operators, fi’s, in order to obtain a credible reconstruction. Luckily, undercertain circumstances, it is possible to transform the problem, by constructing a new set ofmeasurements, for which Assumption (A.2) is restored and thus SS algorithms presented inChapter 3 can be used. Such transformations, described in details in Chapter 4, are done bymeans of an approximation using an appropriately restricted gradient or Laplacian regulariza-tion, filling for the missing (or replacing the corrupted) data. Our data completion/replacementmethods are motivated by theory in Sobolev spaces regarding the properties of weak solutionsalong the domain boundary. Results using the method of SS with the newly formed dataset are then compared to those obtained by a more general but slower RS method which re-quires no modifications. This chapter has appeared as as Roosta-Khorasani, van den Doel andAscher [118].All of our randomized reconstruction algorithms presented in this thesis rely heavily uponsome fundamental aspects such as dimensionality reduction methods, discussed in Chapter 2.This, within the context of LS formulations, amounts to randomized algorithms for estimatingthe trace of an implicit matrix using Monte Carlo (MC) methods. Chapter 5 represents acomprehensive study of the theory of MC implicit matrix trace estimators. Such a methodapproximates the trace of an SPSD matrix A by an average of n expressions of the formwT (Aw), with random vectors w drawn from an appropriate distribution. In Chapter 5, weprove, discuss and experiment with bounds on the number of realizations n required in orderto guarantee a probabilistic bound on the relative error of the trace estimation upon employingRademacher (Hutchinson), Gaussian and uniform unit vector (with and without replacement)probability distributions, discussed in Section 2.1.1. In total, one necessary and six sufficientbounds are proved, improving upon and extending similar estimates obtained in the seminalwork of Avron and Toledo [22] in several dimensions. We first improve their bound on n forthe Hutchinson method, dropping a term that relates to rank(A) (hence proving a conjecturein [22]) and making the bound comparable with that for the Gaussian estimator. We furtherprove new sufficient bounds for the Hutchinson, Gaussian and the unit vector estimators, as wellas a necessary bound for the Gaussian estimator, which depend more specifically on properties101.3. Thesis Overview and Outlineof the matrix A. As such they may suggest for what type of matrices one distribution oranother provides a particularly effective or relatively ineffective stochastic estimation method.This chapter has appeared as Roosta-Khorasani and Ascher [116].Chapter 6 is a precursor of Chapter 7. Specifically, the theorems proved in Chapter 7are applications of more general and novel results regarding extremal tail probabilities (i.e.,maximum and minimum of the tail probabilities) of linear combinations of gamma distributedrandom variables, which are presented and proved in Chapter 6. Many distributions, suchas chi-squared of arbitrary degree, exponential, and Erlang are special instances of gammadistribution. As such these results have a wide range of applications in statistics, engineering,insurance, actuarial science and reliability. These results have appeared as Roosta-Khorasani,Sze´kely and Ascher [117] and can be considered independently of the rest of this thesis.The main advantage of an efficient randomized reconstruction algorithms presented in Chap-ter 3 is the reduction of computational costs. However, a major drawback of any such algorithmis the introduction of “uncertainty” in the overall procedure. The presence of uncertainty in theapproximation steps could cast doubt on the credibility of the obtained results. Hence, it maybe useful to have means which allow one to adjust the cost and accuracy of such algorithmsin a quantifiable way, and find a balance that is suitable to particular objectives and compu-tational resources. In Chapter 7, eight variants of randomized algorithms in Chapter 3 arepresented where the uncertainties in the major stochastic steps are quantified. This is done byincorporating similar conditions as those presented in Chapter 5 in our stochastic algorithms.However, the sufficient bounds derived in Chapter 5 are typically not tight enough to be prac-tically useful. As such, in Chapter 7 and for the special case of Gaussian trace estimator, weprove tight necessary and sufficient conditions on the sample size for MC trace estimators. Weshow that these conditions are practically computable and yield small sample sizes, and hence,all variants of our proposed algorithm with uncertainty quantification are very practical andhighly efficient. This chapter has appeared in [117].The discussion regarding the probabilistic stopping criterion in Chapter 7 lead us to observethat issues discussed there can also arise in several other domains of numerical computations.Namely, in practical applications a precise value for a tolerance used in the related stoppingcriterion is rarely known; rather, only some possibly vague idea of the desired quality of the111.3. Thesis Overview and Outlinenumerical approximation is at hand. There are situations where treating such a tolerance as a“holy” constant can result in erroneous conclusions regarding the relative performance of dif-ferent algorithms or the produced outcome of one such algorithm. Highlighting such situationsand finding ways to alleviate these issues are important. This is taken up in Chapter 8 wherewe discuss three case studies from different areas of numerical computation, where uncertaintyin the error tolerance value is revealed in different ways. Within the context of large scaleproblems considered in this thesis, we then concentrate on a probabilistic relaxation of thegiven tolerance. A version of this chapter has been submitted for publication as Ascher andRoosta-Khorasani [19].Each of Chapters 3, 4, 5, and 7 of this thesis, includes a summary, conclusions and futurework section related to that specific line of research or project. In Chapter 9, an overallsummary is given and a few directions regarding possible future research, not mentioned inearlier chapters, are presented.This thesis contains an appendix as well. In Appendix A, certain implementation details aregiven which are used throughout the thesis. Such details include discretization of the EIT/DCresistivity problem in two and three dimensions, injection of a priori knowledge on the soughtparameter function via transformation functions in the original PDE, the overall discussion of a(stabilized) GN algorithm for minimization of the least squares objective, a short Matlab codewhich is employed in Chapter 7 to compute the Monte-Carlo sample sizes used in matrix traceestimators, and finally the details of implementation and discretization of the total variationfunctional used in several numerical examples in this thesis.12Chapter 2Dimensionality ReductionInverse problems of the form (1.1) for which the forward operators satisfy Assumptions (A.1)& (A.3), can be very expensive to solve numerically. This is so especially when s 1 andmany experiments, involving different combinations of sources and receivers, are employed inorder to obtain reconstructions of acceptable quality. For example, the mere evaluation of themisfit function (the distance between predicted and observed data), φ(m) in (1.6), requiresevaluation of all f(m,qi), i = 1, . . . , s. In this chapter, we develop and assess dimensionalityreduction methods, both stochastic and deterministic, to replace the original large data setby a smaller set of potentially modified measurements for which the computations are moremanageable. Such dimensionality reduction methods always involve random or deterministicsampling of the experiments. In this chapter various such sampling techniques are discussed.In problems where, in addition to (A.1) & (A.3), Assumption (A.2) also holds, efficient4dimensionality reduction methods consisting of stochastically or deterministically combiningthe experiments can be employed. In the stochastic case, this yields an unbiased estimator(i.e., approximation) of the misfit function. However, in the deterministic case experiments areapproximated by projecting the original data set onto a smaller space where a newly formedand smaller set of experiments capture the essence of the original data set. Since in bothof these approaches, the approximation is done through the mixing of the experiments, theresulting method, originating from the geophysics community, is generally named the methodof simultaneous sources (SS) [25, 76].However, in situations where Assumption (A.2) is violated, the SS method is no longer ap-plicable. In such scenarios, an alternative approximation method can be used which essentially4In the rest of this thesis, “efficiency” is measured with respect to the total number of evaluations of thecomputationally expensive forward operator, f(m,q). For example, in PDE inverse problems, the efficiency ismeasured with respect to the number of PDE solves.132.1. Stochastic Approximation to Misfitboils down to selecting, uniformly at random, a subset of experiments, and this selection isdone without any mixing [46]. Such a method, in what follows, is called a random subset (RS)method. It will be shown that RS method also provides an unbiased estimator of the misfitand can be applied in a wider variety of situations, compared to SS method.2.1 Stochastic Approximation to MisfitRandomized algorithms that rely on efficiently approximating the misfit function φ(m) havebeen proposed and studied in [8, 22, 46, 71, 105, 134]. In effect, they draw upon estimating thetrace of an implicit5 SPSD matrix. To see this, consider the misfit (1.6) and let B = B(m) :=F (m)−D. It can be shown thatφ(m) = ‖B‖2F = tr(BTB) = E(‖Bw‖22), (2.1)where w is a random vector drawn from any distribution satisfyingE(wwT ) = I, (2.2)tr(A) denotes the trace of the matrix A, E denotes the expectation and I ∈ Rs×s is the identitymatrix. Hence, approximating the misfit function φ(m) in (1.6) is equivalent to approximatingthe corresponding matrix trace (or equivalently, approximating the above expectation). Thestandard approach for doing this is based on a Monte-Carlo method, where one generates nrandom vector realizations, wj , from any such suitable probability distribution and computesthe empirical meanφ̂(m, n) :=1nn∑j=1‖B(m)wj‖22 ≈ φ(m). (2.3)Note that φ̂(m, n) is an unbiased estimator of φ(m), as we have φ(m) = E(φ̂(m, n)). UnderAssumptions (A.1)-(A.3), if n s then this procedure yields a very efficient algorithm for5By “implicit matrix” we mean that the matrix of interest is not available explicitly: only information in theform of matrix-vector products for any appropriate vector is available.142.1. Stochastic Approximation to Misfitapproximating the misfit (1.6), becauses∑i=1f(m,qi)wi = f(m,s∑i=1qiwi), (2.4)which can be computed with a single evaluation of f per realization of the random vectorw = (w1, . . . , ws)T .In practice, one can choose any distribution for which (2.2) is satisfied. Some popular choicesof distributions for w are described in details in Section 2.1.1.2.1.1 Selecting a Sampling MethodThere are a few possible choices of probability distributions for w, among which the mostpopular ones are as follows.(i) The Rademacher distribution [83] where the components of w are independent and iden-tically distributed (i.i.d) with Pr(wi = 1) = Pr(wi = −1) = 12 (referred to in what followsas Hutchinson estimator, in deference to [22, 86]).(ii) The standard normal distribution, N (0, I), is another possible choice and is henceforthreferred to as Gaussian estimator.(iii) The unit vector distribution (in deference to [22]). Here, the vectors wi in (2.3) areuniformly drawn from the columns of the scaled identity matrix,√sI. Drawing thesevectors can be done with or without replacement. Such estimator is called the randomsubset method.Distributions (i) and (ii) give rise to popular methods of simultaneous random sources [53, 71,81, 90, 115, 126]. The methods of SS, when the linearity assumption (A.2) holds, yield veryefficient estimators, as shown in (2.4). It can also be easily shown that, for a given sample sizen, the variance of the Hutchinson estimator is smaller than that of the Gaussian estimator.However, relying solely on variance analysis can be misleading in determining the relative meritof each of these estimators; this is discussed in more details in Chapter 5.For an approximation using the unit vector distribution (iii), the linearity assumption (A.2)is no longer necessary: it boils down to selecting a random subset of the given experiments at152.2. Deterministic Approximation to Dataeach iteration, rather than their weighted combination. Within the context of reconstructionalgorithms for inverse problems, this estimator was first introduced in [46]. In the absence ofAssumption (A.2), such RS estimator is the only one that can be applied. However, as will beshown in Chapters 3 and 4, when the methods of SS apply, they provide a much more efficientand accurate6 approximation to the misfit, compared to RS method.The objective is to be able to generate as few realizations of w as possible for achievingacceptable approximations to the misfit function. Estimates on how large n must be, for a givendistribution, to achieve a prescribed accuracy in a probabilistic sense are derived in Chapter 5.2.1.2 Approximation with Generalized Noise AssumptionThe stochastic approximation methods described in Section 2.1 can be similarly applied for themore general misfit functions, described in Section 1.2.1, under the noise assumptions (N.1)or (N.2). More specifically, the Monte-Carlo approximation, φ̂(1)(m, n), of φ(1)(m) in (1.13) isprecisely as in (2.3) but with B(m) := C−1(F (m) − D). Similarly, with B(m) = (F (m) −D)C−1, we can again apply (2.3) to obtain a similar Monte-Carlo approximation, φ̂(2)(m, n),of φ(2)(m) in (1.14).Now, if n s then the unbiased estimators φ̂(1)(m, n) and φ̂(2)(m, n) are obtained witha similar efficiency as φ̂(m, n). In the sequel, for notational simplicity, we just concentrate onφ(m) and φ̂(m, n), but all the results hold almost verbatim also for (1.13) and (1.14).2.2 Deterministic Approximation to DataAn alternative to stochastically approximating the misfit, is to abandon randomization alto-gether, and instead select the mixing weights deterministically. Deterministic approaches forreducing the size of the original large data set have been proposed in [62, 68], which in effectare data compression approaches. These compression schemes remove redundancy in data, notthrough eliminating redundant data, but instead through some mixing of redundant data. Sim-ilar deterministic SS method to compress the data may be obtained upon applying truncated6A less efficient estimator is the one for which more realizations of w are required to achieve a desirableaccuracy with the same likelihood. A less accurate estimator is the one which, given the same sample size, isless likely to achieve a desirable accuracy.162.2. Deterministic Approximation to Datasingular value decomposition (TSVD) to the data re-cast as the l× s matrix D in (1.6), wherein our context we have s l. More specifically, as a pre-processing step, one can calculatethe SVD decomposition as D = UΣV T , where U ∈ Rl×l, V ∈ Rs×l are the unitary matricesand Σ ∈ Rl×l is the diagonal matrix of singular values. Now, one can effectively obtain anapproximation to the original D as D̂ = DV̂ ∈ Rs×n, where V̂ is a matrix consisting of the firstn columns of V . As such, we can replace the original misfit withφ˜(m, n) :=1nn∑j=1‖B(m)vj‖22, (2.5)where vj is the jth column of V . It should be noted that unlike φ̂(m, n) in (2.3), the new misfitφ˜(m, n) is not an unbiased estimator of the original misfit, φ(m), as here D is approximatedand not φ(m).If n is large enough, this approach should bring out the essence of what is in the data,especially when the current iterate is far from the solution of the inverse problem. This ap-proach can also be seen as denoising the original data as it involves removing the componentscorresponding to small singular values. A plot of the singular values for a typical experiment(in the context of a DC resistivity problem) is depicted in Figure 2.1. The quick drop in theFigure 2.1: The singular values of the data used in Example 3.2 of Section 3.3.singular values suggests that just a few singular vectors (the first columns of the orthogonalmatrix U) represent the entire data well. This simple method is suitable when both dimensions172.3. GN Iteration on the Approximate Functionof the data matrix D are not too large. The SVD is performed only once prior to the inversioncomputations. Then, in the kth iteration of an optimization algorithm (such as stabilized GNin this thesis), the first few columns of V , corresponding to the largest singular values, providefixed and deterministic weights for this SS method. Methods for choosing the number of suchcolumns is discussed in Chapter 3.2.3 GN Iteration on the Approximate FunctionFor the approximations (2.3) (or (2.5)), it is easy to return to a form like (1.6) and definesensitivity matrices Ĵi = Ĵi(m, n) and gradient ∇mφ̂ = ∇mφ̂(m, n) analogously to (1.9) and(1.10), respectively. The GN iteration for (2.3) (or (2.5)) at a current iterate m = mk with nkrandom weight vectors wj in (2.3) (or deterministic weights vj in (2.5)) calculates the correctionas the solution of the linear system(nk∑i=1ĴTi Ĵi)δm = −∇mφ̂, (2.6a)followed by the updatemk+1 = mk + αkδm. (2.6b)Here, as in (1.11b), the step length, αk, 0 < αk ≤ 1, is determined by a weak line search,ensuring sufficient decrease in approximation φ̂(mk+1, n) as compared to φ̂(mk, n).Again, applying stabilized GN, as described in Appendix A.3, we see that, for K outer GNiterations, the total work estimate (in terms of the number of forward operator simulations) isapproximated from below byWork Estimate = 2(r + 1)K∑k=1nk, (2.7)which indicates how keeping nk small is important; see [46]. Comparing (2.7) with (1.12) showsthat if nk s, ∀k, then the computational complexity is greatly reduced.In Chapter 3, stochastic reconstruction algorithms are proposed which heavily rely on the182.3. GN Iteration on the Approximate Functiondimensionality reduction methods presented in Chapter 2. We also present randomized methodsfor controlling the sample size nk used in these algorithms.19Chapter 3Stochastic ReconstructionAlgorithmsIn this chapter, we present our stochastic algorithms for approximately minimizing (1.6) or (1.7),and discuss its novel elements. Here, we continue to make Assumptions (A.1) - (A.3); relaxationof the linearity assumption (A.2) is done in chapter (4). All these algorithms rely heavily on thedimensionality reduction techniques and sampling methods described in the previous chapter.Under Assumption (A.2), we have, as described in Chapter 2, four methods for sampling theoriginal data set, which may be fused and compared.As discussed earlier, the GN iteration (1.11) is computationally prohibitive. Consequently,as an alternative, one can consider the GN iteration (2.6) performed on the modified objective.If nk s, then these iterations can be performed more efficiently. In what follows, we assumefor simplicity that the iterations are performed on the approximation (2.3) of the misfit (1.6)using dynamic regularization (or iterative regularization [46, 78, 132]) where the regularizationis performed implicitly. We then incorporate the deterministic approximation (2.5) as well.Extension of the resulting algorithms to the case (1.7) is straightforward. Hence, the updatedirection, δmk, is calculated using the approximate misfit, φ̂(mk, nk), defined in (2.3) wherenk is the sample size used for this approximation in the kth iteration. However, since theiterations are performed on the modified objective function, the value of the original misfitmight not necessarily be reduced. As such, any recovered model might not fit the original dataappropriately. Thus, in each iteration, we need to check or assess whether the value of theoriginal objective is also decreased using this new iterate. The challenge is to do this as wellas check for termination of the iteration process with a minimal number of evaluations of theprohibitively expensive original misfit function (1.6).203.1. Two Additional Reasons for Unbiased EstimatorsThe papers cited in Section 2.1.1 appear to assume one purpose for the approximate eval-uation of the misfit function φ(m), and that is solely in (1.11a). In contrast, in Section 3.1,we identify two additional purposes for this task, and furthermore we show that these differentpurposes may well require different estimation methods. An additional fourth purpose will beintroduced in Chapter 4 and further modified in Chapter 7.The question of selecting the sample size nk is addressed in Section 3.2. We propose two newalgorithms which allow nk to be very small for small k, and potentially significantly increaseas the iteration approaches a solution. Algorithm 1 in Section 3.2.1 has the advantage of beingsimple, and it generates an exponentially increasing sequence of nk values. Algorithm 2 in Sec-tion 3.2.2 uses cross validation in a manner similar to but not the same as that proposed in [46],and it generates a potentially more moderately increasing sequence of nk values. The latteralgorithm is particularly useful when s is “too large” in the sense that even near a satisfactorysolution for the given inverse problem, far fewer than s experiments are required to satisfythe given error tolerances, a situation we qualitatively refer to as embarrassing redundancy.Within the context of these two algorithms, we compare the resulting weighting methods ofSection 2.1.1 against the more generally applicable random subset method proposed in [46],and find that the three simultaneous sources methods are roughly comparable and are betterthan the random subset method by a factor of roughly 2 or more.The computational work in Section 3.3 is done in the context of a DC resistivity problem.This is a simpler forward problem than low-frequency Maxwell’s equations, and yet it reflectsa similar spirit and general behaviour, allowing us to concentrate on the issues in focus here.A description of the implementation details is given in Appendices A.1, A.2, and A.3.3.1 Two Additional Reasons for Unbiased EstimatorsAs described in Chapter 2, the original expensive misfit can be replaced by a computationallycheaper one, either stochastically or deterministically. One purpose of forming such a modifiedobjective function is to be used in the iterations (2.6). Here we identify and justify two additionalreasons for which stochastic approximate misfit (i.e., unbiased estimators) is used. A fourthpurpose will be introduced in Chapter 4 and further modified in Chapter 7.213.1. Two Additional Reasons for Unbiased Estimators3.1.1 Cross ValidationIt is desirable that after every iteration of any optimization method (such as GN), the valueof the misfit (1.6) (or the regularized objective (1.7)) decreases (perhaps sufficiently). Themechanisms such as line-search are used to enforce such desired property. More specifically, itis desired that at the kth iteration and after the update, we getφ(mk+1) ≤ κφ(mk), (3.1)for some κ ≤ 1, which indicates sufficient decrease in the misfit (or the objective φR,α in the caseof (1.7)). unfortunately, as argued before, such a test using the evaluation of the entire misfit iscomputationally prohibitive. However, since φ̂(mk+1, nk) is an unbiased estimator of φ(mk+1)with nk s, we can approximate the assessment of the updated iterate in terms of sufficientdecrease in the objective function using a control set of random combinations of measurements.More specifically, at the kth iteration with the new iterate mk+1, we test whether the conditionφ̂(mk+1, nk) ≤ κφ̂(mk, nk) (3.2)(cf. (2.3)) holds for some κ ≤ 1; The condition (3.2) is an independent, unbiased indicatorof (3.1), and the success of (3.2) is an indicator that (3.1) is likely to be satisfied as well.However, for now, the test (3.2) is only left as a heuristic indicator of (3.1). As such, for therest of this chapter, the sample size nk used in (3.2) is chosen heuristically, but in Chapter (7),we will make this choice mathematically rigorous where the uncertainty in the test (3.2) isquantified. For example, we will develop tools to assess the probability of the success of (3.1),given the success of (3.2).3.1.2 Stopping Criterion and Uncertainty CheckThe usual stopping criterion for terminating the iterative process for data fitting (cf. Section 1.1)is to check, after the update in the kth iteration, whetherφ(mk+1) ≤ ρ, (3.3)223.2. Adaptive Selection of Sample Sizefor a given tolerance ρ, with φ(mk+1) not being much smaller than ρ. This is done either toavoid under-fitting/over-fitting of the noise, or as part of the explicit constraint such as in (1.8).For instance, consider the simplest case where for all experiments there is a Gaussian noisedistribution for which the (same) standard deviation σ is known. Thus D = D∗ + σN , whereD∗ = F (m∗), with N an l × s matrix of i.i.d Gaussians. We wish to terminate the algorithmwhen (1.6) falls below some multiple η ' 1 of the noise level squared, i.e. σ2‖N‖2F . Since thenoise is not known, following the celebrated Morozov discrepancy principle [52, 91, 107, 135],we replace ‖N‖2F by its expected value, sl, obtainingρ = ησ2sl.Unfortunately, however, the mere calculation of φ(mk+1) requires s evaluations of the com-putationally expensive forward operators. We therefore wish to perform this check as rarely aspossible. Fortunately, as discussed before, we have in φ̂(mk+1, nk) a good, unbiased estimatorof φ(mk+1) with nk s. Thus, in the course of an iteration we can perform the relativelyinexpensive uncertainty check whetherφ̂(mk+1, nk) ≤ ρ. (3.4)This is like the stopping criterion, but in expectation. If (3.4) is satisfied, it is an indicationthat (3.3) is likely to be satisfied as well, so we check the expensive (3.3) only then. Similarlyto the condition (3.2), for the rest of this chapter, the sample size nk used in (3.4) is chosenheuristically, but its selection is made mathematically rigorous in Chapter 7.Note that, for uncertainty check and cross validation steps, since we want an unbiasedestimator of the objective, the approximation should not be constructed deterministically, asdescribed in Section 2.1.3.2 Adaptive Selection of Sample SizeIn this section we describe two algorithms for determining the sample size nk in the kth stabilizedGN iteration. Algorithm 1 adapts nk in a brute force manner. Algorithm 2 uses a cross233.2. Adaptive Selection of Sample Sizevalidation technique to avoid situations in which nk grows too rapidly or becomes larger thannecessary.3.2.1 Sample Size Selection Using Uncertainty ChecksWhile the management strategy of nk in this algorithm is simply to increase it so long as (3.3)is not met, its novelty lies in the fusion of different strategies for selecting the weight matricesat different stages of each iteration. Our algorithm consists of three main steps: (i) data fitting– a stabilized GN outer iteration (2.6); (ii) uncertainty check – a check for condition (3.4); and(iii) depending on the outcome of the uncertainty check, perform either sample size adjustmentor stopping criterion check for termination.Algorithm 1 Solve inverse problem using uncertainty checkGiven: sources Q = [q1q2 · · ·qs], measurements D = [d1d2 · · ·ds], stopping criterion level ρ(i.e. the desired misfit) and initial guess m0.Initialize: m = m0 , n0 = 1.for k = 0, 1, 2, · · · until termination do- Choose nk wight vectors stochastically (or deterministically ) as described in Section 2.1(or Section 2.2).- Fitting: Perform one stabilized GN iteration approximating (2.6), with n = nk.- Choose nk wight vectors stochastically as described in Section 2.1.- Uncertainty Check: Compute (3.4) using mk+1 and the above nk wight vectors.if Uncertainty Check holds then- Stopping Criterion: Compute (3.3) with mk+1. Terminate if it holds.else- Sample Size Increase: Increase nk+1, for example set nk+1 = min(2nk, s).end ifend forThe exponential growth of the sample size in Algorithm 1 can be theoretically appealing,as such a schedule (unlike keeping nk fixed) enables the general convergence theory of [60].However, in cases where there is embarrassing redundancy in the set of experiments, it may notbe desirable for the sample size to grow so rapidly and in an unchecked manner, as we couldend up using far more experiments than what is actually needed. Some mechanism is requiredto control the growth of sample size, and one such is proposed next.243.2. Adaptive Selection of Sample Size3.2.2 Adaptive Selection of Sample Size Using Cross ValidationFor monitoring the growth of nk more closely, one strategy is to compare the objective function φat the current iterate to its value in the previous iterate, effectively checking for the test (3.1),and increase the sample size if there is no sufficient decrease. Unfortunately, evaluating thetest (3.1) exactly defeats the purpose (in Section 3.3 typically the total cost of the reconstructionalgorithm is small multiples of just one evaluation of φ). Fortunately, however, using the crossvalidation test (3.2), described in Section 3.1.1, we can get a handle of how the objectivefunction is likely to behave. In other words, the role of the cross validation step within aniteration is to assess whether the true objective function at the current iterate has (sufficiently)decreased compared to the previous one. If this test fails, we deem that the current sample sizeis not sufficiently large to yield an update that decreases the original objective, and the fittingstep needs to be repeated using a larger sample size. A method of this sort, based on “crossvalidation”, is proposed in [46] together with a Random Subset method. Here we generalizeand adapt this technique in the present context.Thus, the following algorithm involves the steps of Algorithm 1, with an additional check fora sufficient decrease in the estimate (2.3) using another, independently selected weight matrix.Only in case that this test is violated, we increase the sample size.Algorithm 2 Solve inverse problem using uncertainty check and cross validationGiven: sources Q = [q1q2 · · ·qs], measurements D = [d1d2 · · ·ds], stopping criterion level ρ(i.e. the desired misfit) and initial guess m0.Initialize: m = m0 , n0 = 1.for k = 0, 1, 2, · · · until termination do- Choose nk wight vectors stochastically (or deterministically ) as described in Section 2.1(or Section 2.2).- Fitting: Perform one stabilized GN iteration approximating (2.6), with n = nk.- Choose nk wight vectors stochastically as described in Section 2.1.if φ̂(mk+1, nk) ≤ κφ̂(mk, nk), i.e., Cross Validation is satisfied then- Uncertainty Check: Compute (3.4) using mk+1 and the above nk wight vectors.if Uncertainty Check holds then- Stopping Criterion: Compute (3.3) with mk+1. Terminate if it holds.end ifelse- Sample Size Increase: Increase nk+1, for example set nk+1 = min(2nk, s).end ifend for253.3. Numerical ExperimentsNote that our use of the term “cross validation” does not necessarily coincide with itsusual meaning in statistics. But the procedure retains the sense of a control set and thisname is convenient. The performance of Algorithm 2 is not automatically better than that ofAlgorithm 1. Indeed, it is possible to generate examples where cross validation is not necessary,as the computations in Section 3.3 demonstrate. However, it provides an important safetymechanism.3.3 Numerical Experiments3.3.1 The EIT/DC Resistivity Inverse ProblemOur experiments are performed in the context of solving the EIT/DC resistivity problem(e.g., [33, 39, 46, 47, 71, 72, 111, 128]). We have made this choice since exploiting manydata sets currently appears to be particularly popular in exploration geophysics, and our ex-amples, in this thesis, can be viewed as mimicking a DC resistivity setup. Note that the PDEmodel for EIT is identical to that of DC resistivity and the main difference is in experimentalsetup.Consider a linear PDE of the form∇ · (µ(x)∇u) = q(x), x ∈ Ω, (3.5a)where Ω ⊂ IRd, d = 2 or 3, and µ is a conductivity function which may be rough (e.g.,discontinuous) but is bounded away from 0: there is a constant µ0 > 0 such that µ(x) ≥µ0, ∀x ∈ Ω. This elliptic PDE is subject to the homogeneous Neumann boundary conditions∂u∂n= 0, x ∈ ∂Ω. (3.5b)For Ω, we will consider a unit square or a unit cube. The inverse problem is to recover µ inΩ from sets of measurements of u on the domain’s boundary for different sources q. This is anotoriously difficult problem in practice, so it may be useful to inject some a priori informationon µ, when such is available, via a parametrization of µ(x) in terms of m(x) using an appropriate263.3. Numerical Experimentstransfer function ψ as µ(x) = ψ(m(x)). For example, ψ can be chosen so as to ensure thatthe conductivity stays positive and bounded away from 0, as well as to incorporate bounds,which are often known in practice, on the sought conductivity function. Some possible choicesof function ψ are described in Appendix A.2.3.3.2 Numerical Experiments SetupThe experimental setting we use is as follows: for each experiment i there is a positive unit pointsource at xi1 and a negative sink at xi2, where xi1 and xi2 denote two locations on the boundary∂Ω. Hence in (3.5) we must consider sources of the form qi(x) = δ(x− xi1)− δ(x− xi2), i.e., adifference of two δ-functions.For our experiments in 2D, when we place a source on the left boundary, we place thecorresponding sink on the right boundary in every possible combination. Hence, having plocations on the left boundary for the source would result in s = p2 experiments. The receiversare located at the top and bottom boundaries. No source or receiver is placed at the corners.In 3D we use an arrangement whereby four boreholes are located at the four edges of thecube, and source and sink pairs are put at opposing boreholes in every combination, exceptthat there are no sources on the point of intersection of boreholes and the surface, i.e., at thetop four corners, since these four nodes are part of the surface where data values are gathered.In the sequel we generate data di by using a chosen true model (or ground truth) and asource-receiver configuration as described above. Since the field u from (3.5) is only determinedup to a constant, only voltage differences are meaningful. Hence we subtract for each i theaverage of the boundary potential values from all field values at the locations where data ismeasured. As a result each row of the projection matrix P has zero sum. This is followedby peppering these values with additive Gaussian noise to create the data di used in ourexperiments. Specifically, for an additive noise of 3%, say, denoting the “clean data” l × smatrix by D∗, we reshape this matrix into a vector d∗ of length sl, calculate the standarddeviation sd = .03‖d∗‖/√sl, and define D = D∗ + sd ∗ randn(l, s) using Matlab’s randomgenerator function randn.For all numerical experiments, the “true field” is calculated on a grid that is twice as fineas the one used to reconstruct the model. For the 2D examples, the reconstruction is done on273.3. Numerical Experimentsa uniform grid of size 642 with s = 961 experiments in the setup described above, and we usedη = 1.2. For our 3D examples, the reconstruction is done on a uniform grid of size 173 withs = 512 experiments, and we set η = 1.5.In Section 3.3.3 below, for the first three examples we use the transfer function (A.5) withµmax = 1.2 maxµ(x), and µmin = .83 minµ(x). In the ensuing calculations we then “forget”what the exact µ(x) is. Further, we set the PCG iteration limit to r = 20, and the PCGtolerance to 10−3. The initial guess is m0 = 0. Our last example is carried out using the levelset method (A.6). Here we can set r = 5, significantly lower than above. The initial guess forthe level set examples is displayed in Figure 3.1.Figure 3.1: Example 3.4 – initial guess for the level set method.In addition to displaying the log conductivities (i.e., log(µ)) for each reconstruction, we alsoshow the log-log plot of misfit on the entire data (i.e. ‖F (m)−D‖F ) vs. PDE count. A tableof total PDE counts (not including what extra is required for the plots) for each method isdisplayed. In this table, as a point of reference, we also include the total PDE count using the“plain vanilla” stabilized Gauss-Newton method which employs the entire set of experimentsat every iteration.We emphasize that, much as the rows in the work-unit table are easier to examine in orderto determine which method is more efficient, it is important to also consult the correspondingdata misfit plots, especially when the comparison is between relatively close quantities. This isso because one evaluation of the stopping criterion consumes a significant fraction of the totalPDE count in each case, so an extra check that can randomly occur for a given experiment in283.3. Numerical Experimentsone method and not another may affect the work table far more than the misfit figures. Inparticular, the performance of the Hutchinson vs. Gauss estimators was found to be comparablein almost all experiments below.Finally, before we turn to the numerical results let us comment on the expected generalquality of such reconstructions. The quantifiers “good” and “acceptable” are relative conceptshere. Our 3D experiments mimic DC geophysics surveys, where a reconstruction is consideredgood and acceptable if it generally looks like the true model, even remotely so. This is verydifferent from the meaning of similar quantifiers in image denoising, for instance.3.3.3 Numerical Experiments Comparing Eight Method VariantsIn each of the four examples below we apply Algorithm 1 and Algorithm 2 with κ = 1; smallervalues of κ would result in more aggressive increases of the sample size between one stabilizedGN iteration and the next.Furthermore, for convenience of cross reference, we gather all resulting eight work countsin Table 3.1 below. The corresponding entries of this table should be read together with themisfit plots for each example, though.Example Alg Vanilla Rand. Sub. Hutch. Gauss. TSVD3.1 1 86,490 3,788 1,561 1,431 2,2392 3,190 2,279 1,618 2,2953.2 1 128,774 5,961 3,293 3,535 3,5072 3,921 2,762 2,247 2,9853.3 1 36,864 6,266 1,166 1,176 1,8822 11,983 3,049 2,121 2,9913.4 1 45,056 1,498 1,370 978 1,5602 2,264 1,239 896 1,656Table 3.1: Work in terms of number of PDE solves for Examples 3.1–3.4. The “Vanilla” countis independent of the algorithms described in Section 3.2.Example 3.1. In this example, we place two target objects of conductivity µI = 1 in a back-ground of conductivity µII = 0.1, and 3% noise is added to the data: see Figure 3.2(a). Thereconstructions in Figures 3.2 and 3.3 are comparable.From Table 3.1 we see that all our methods offer vast improvements over the plain Vanillamethod. Furthermore, the Random Subset method reduces the objective (i.e., misfit) function293.3. Numerical Experiments(a) True model (b) Rand. Sub. (c) Gaussian (d) Hutchinson (e) TSVDFigure 3.2: Example 3.1 – reconstructed log conductivity using Algorithm 1 and the fourmethods of Section 2.1.1.(a) True model (b) Rand. Sub. (c) Gaussian (d) Hutchinson (e) TSVDFigure 3.3: Example 3.1 – reconstructed log conductivity using Algorithm 2 and the fourmethods of Section 2.1.1.at a slower rate, requiring roughly twice as many PDE solves compared to the other methods ofSection 2.1.1. Consulting also Figure 3.4, observe in addition that although the final PDE countfor TSVD is slightly larger than for Hutchinson and Gaussian, it reduces the misfit at a faster,though comparable, rate. In fact, if we were to stop the iterations at higher noise tolerancesthen the TSVD method would have outperformed all others. In repeated similar tests, we haveobserved that the performance of Hutchinson and Gaussian is comparable.Finally, comparing the first two rows of Table 3.1 and the subplots of Figure 3.4, it is clear(a) Algorithm 1 (b) Algorithm 2Figure 3.4: Data misfit vs. PDE count for Example 1.303.3. Numerical Experimentsthat the performance of Algorithms 1 and 2 is almost the same.Example 3.2. For this example, we merely swap the conductivities of the previous one, seeFigure 3.5(a), and add the lower amount of 1% noise to the “exact data”. The reconstructionresults in Figures 3.5 and 3.6 are comparable. The performance indicators are gathered inTable 3.1 and Figure 3.7.(a) True model (b) Rand. Sub. (c) Gaussian (d) Hutchinson (e) TSVDFigure 3.5: Example 3.2 – reconstructed log conductivity using Algorithm 1 and the fourmethods of Section 2.1.1.(a) True model (b) Rand. Sub. (c) Gaussian (d) Hutchinson (e) TSVDFigure 3.6: Example 3.2 – reconstructed log conductivity using Algorithm 2 and the fourmethods of Section 2.1.1.(a) Algorithm 1 (b) Algorithm 2Figure 3.7: Data misfit vs. PDE count for Example 3.2.Note that since in this example the noise is reduced compared to the previous one, more PDEsolves are required. Similar observations to all those made for Example 3.1 apply here as well,313.3. Numerical Experimentsexcept that using the cross validation algorithm results in a notable reduction in PDE solves.Figure 3.8: True Model for Examples 3.3 and 3.4. The left panel shows 2D equi-distant slicesin the z direction from top to bottom, the right panel depicts the 3D volume.Example 3.3. In this 3D example, we place a target object of conductivity µI = 1 in a back-ground with conductivity µII = 0.1. See Figure 3.8, whose caption also explains what otherplots for 3D runs depict. A 2% noise is added to the “exact” data.(a) RS slices(b) 3D view(c) Gaussian slices(d) 3D view(e) Hutchinson slices(f) 3D view(g) TSVD slices(h) 3D viewFigure 3.9: Example 3.3 – reconstructed log conductivity for the 3D model using Algorithm 1and (a,b) Random Subset, (c,d) Gaussian, (e,f) Hutchinson, and (g,h) TSDV.The reconstruction quality for all eight method variants, see Figures 3.9 and 3.10, appearsless clean than in our other examples; however, the methods are comparable in this regard, whichallows us to concentrate on their comparative efficiency. It should be noted that no attempt wasmade here to “beautify” these results by post-processing, a practice not unheard of for hardgeophysical inverse problems. Better reconstructions are obtained in the next example which323.3. Numerical Experimentsemploys more a priori information and higher contrast.In cases where more experiments are needed, the differences among the sampling methodsare even more pronounced. This 3D example is one such case. All of the methods (excludingVanilla) ended up using half of the experiments (i.e., nk ≈ .5s) before termination. Clearly, theRandom Subset method is far outperformed by the other three, see Table 3.1 and Figure 3.13.This is one example where Algorithm 1 achieves reconstructions of similar quality but morecheaply than Algorithm 2. This is so because in this case there is little embarrassing redundancy,i.e., larger sample sizes are needed to achieve the desired misfit, hence growing the sample sizeat a faster rate leads to an efficient algorithm. The sample size using cross validation growsmore slowly, and relatively many GN iterations are performed using small sample sizes whereeach iteration decreases the misfit only slightly. These added iterations result in larger totalPDE solve count.Example 3.4. This one is the same as Example 3.3, except that we assume that additional priorinformation is given, namely, that the sought model consists of piecewise constant regions withconductivity values µI and µII . This mimics a common situation in practice. So we reconstructusing the level set method (A.6), which significantly improves the quality of the reconstructions:compare Figures 3.11 and 3.12 to Figures 3.9 and 3.10.Here we observe less difference among the various methods. Specifically, in repeated experi-ments, the Random Subset method is no longer clearly the worst, see Table 3.1 and Figure 3.14.The numbers in the last row of Table 3.1 might be deceiving at first glance, as Random Sub-set seems to be worse than the rest; however, the graph of the misfit in Figure 3.14 reflectsa more complete story. At some point in between the final PDE counts for Hutchinson andTSVD, the Random Subset misfit falls below the desired tolerance; however, the uncertaintycheck at that iterate results in a “false negative” which in turn does not trigger the stoppingcriterion. This demonstrates the importance of having a very good and reliable trace estimatorin the uncertainty check. For all our eight algorithm variants and in all of our examples, weused the Hutchinson trace estimator for this purpose, as it has the smallest variance. And yet,one wrong estimate could result in additional, unnecessary GN iterations, leading to more PDEsolves. False positives, on the other hand, trigger an unnecessary stopping criterion evaluation,333.3. Numerical Experiments(a) RS slices(b) 3D view(c) Gaussian slices(d) 3D view(e) Hutchinson slices(f) 3D view(g) TSVD slices(h) 3D viewFigure 3.10: Example 3.3 – reconstructed log conductivity for the 3D model using Algorithm 2and (a,b) Random Subset, (c,d) Gaussian, (e,f) Hutchinson, and (g,h) TSDV.(a) RS slices(b) 3D view(c) Gaussian slices(d) 3D view(e) Hutchinson slices(f) 3D view(g) TSVD slices(h) 3D viewFigure 3.11: Example 3.4 – reconstructed log conductivity for the 3D model using the level setmethod with Algorithm 1 and with (a,b) Random Subset, (c,d) Gaussian, (e,f) Hutchinson,and (g,h) TSDV.343.4. Conclusions(a) RS slices(b) 3D view(c) Gaussian slices(d) 3D view(e) Hutchinson slices(f) 3D view(g) TSVD slices(h) 3D viewFigure 3.12: Example 3.4 – reconstructed log conductivity for the 3D model using the level setmethod with Algorithm 2 and with (a,b) Random Subset, (c,d) Gaussian, (e,f) Hutchinson,and (g,h) TSDV.which results in more PDE solves to calculate the misfit on the entire data set. For this exampleit was also observed that typically the Gaussian method outperforms Hutchinson by a factor ofroughly 1.5.3.4 ConclusionsIn this chapter we have developed and compared several highly efficient stochastic algorithmsfor the solution of inverse problems involving computationally expensive forward operators de-(a) Algorithm 1 (b) Algorithm 2Figure 3.13: Data misfit vs. PDE count for Example 3.3.353.4. Conclusions(a) Algorithm 1 (b) Algorithm 2Figure 3.14: Data misfit vs. PDE count for Example 4.scribed in Section 1.1 in the presence of many measurements or experiments s. Two algorithmsfor controlling the size nk ≤ s of the data set in the kth stabilized GN iteration have beenproposed and tested. For each, four methods of sampling the original data set, three stochasticand one deterministic, discussed in Chapter 2, can be used, making for a total of eight algo-rithm variants. Our algorithms are known to converge under suitable circumstances becausethey satisfy the general conditions in [36, 60]. The numerical experiments are done specifically,in the context of DC resistivity.It is important to emphasize that any of these algorithms is far better than a straightforwardutilization of all experiments at each GN iteration. This is clearly borne out in Table 3.1. Notefurther that in order to facilitate a fair comparison we chose a fixed number of PCG inneriterations, ignoring the adaptive Algorithm 1 of [46], even though that algorithm can impactperformance significantly. We also utilized for the sake of fair comparison a rather rigid (andexpensive) stopping criterion; this will be eased off in future chapters. Further, we use theHutchinson estimator for the uncertainty check in all methods, thus making them all stochastic.In particular, TSVD may not be used in (3.4) because it does not lead to an unbiased estimatorfor the objective function φ.Inverse problems with many measurements arise in different applications which may havevery different solution sensitivity to changes in the data (e.g., the full waveform inversion,although having other big difficulties in its solution process, is far less sensitive in this sensethan DC resistivity). But in any case, it is an accepted working assumption that more data363.4. Conclusionscan only help and not hurt the conditioning of the problem being solved. This then gives riseto the question whether our model reduction techniques may worsen the conditioning of thegiven problem. We have not observed any such effect in our experiments (and our “Vanilla”reconstructions in Section 3.3 are never better, or sharper, than the other, cheaper ones). In asense it could be argued that a good model reduction algorithm actually covers approximatelythe same grounds as the full data problem, so it achieves a similar level of solution sensitivityto data.As demonstrated in Examples 3.2 and 3.3, neither Algorithm 1 nor Algorithm 2 is alwaysbetter than the other, and they often both perform well. Their relative performance depends oncircumstances that can occasionally be distinguished before committing to calculations. Specif-ically, if there are relatively few data sets, as in Example 3.3, then Algorithm 1 is preferable,being both simpler and occasionally faster. On the other hand, if s is very large, the datahaving been massively calculated without much regard to experimental design considerations(as is often the case in geophysical exploration applications), then this may naturally lead to acase of embarrassing redundancy, and caution alone dictates using Algorithm 2.The three methods of simultaneous sources, namely, Hutchinson, Gaussian and TSVD, arecomparable (ignoring the cost of SVD computation), and no definitive answer can be givenas to which is better for the model reduction. Further, especially when the level set methodmay not be used, we have found the methods of simultaneous sources to be consistently moreefficient than the Random Subset method of [46], roughly by a factor of two or more. However,as mentioned before, SS methods can only be applied when the linearity assumption (A.2) isjustified. In the absence of the Assumption (A.2), one is restricted to use the less efficientmethod of RS. Within the context of PDE constrained inverse problem (1.4), this means thatthe projection matrices Pi depend on i. That, in turn, raises the question whether the linearityassumption (A.2)can somehow be relaxed, thus allowing use of the faster methods of SS. Thisis the subject of Chapter 4.37Chapter 4Data CompletionIn Chapter 3, for the case where Assumptions (A.1) - (A.3) are valid, different methods ofsimultaneous sources are obtained by using different algorithms for this model and data reductionprocess. There, we have discussed and compared three such methods: (i) a Hutchinson randomsampling, (ii) a Gaussian random sampling, and (iii) the deterministic truncated singular valuedecomposition (TSVD). We have found that, upon applying these methods, their performancewas roughly comparable (although for just estimating the misfit function by (2.3), only thestochastic methods work well).However, in situations where Assumption (A.2) is violated, none of the SS methods apply.Such situations arise, for example, when parts of measurements are missing or data is partiallycorrupted. In these cases, the random subset method can still be considered, where a randomsubset of the original experiments is selected at each iteration k, as the application of thismethod does not require the linearity assumption (A.2). However, as it was shown in Chapter 3,its performance is generally worse than the methods of simultaneous sources, roughly by a factorbetween 1 and 4, and on average about 2.7 It is, in fact, possible to construct examples wherea reconstruction algorithm using the RS method performs remarkably worse (much more thana factor of 4) than a similar SS based algorithm.This brings us to the quest of the present Chapter, namely, to seek methods for the generalcase where Assumption (A.2) does not hold, which are as efficient as the simultaneous sourcesmethods. The tool employed for this is to “fill in missing or replace corrupted data”, thusrestoring the linearity assumption (A.2). More specifically, the problem is transformed suchthat the original forward operators, f(m,qi), are extended to the ones, which are linear inq. For example, in PDE constrained inverse problems with the forward operators defined as7 The relative efficiency factor further increases if a less conservative criterion is used for algorithm termination,see Section 4.3.38Chapter 4. Data Completionin (1.4), i.e., Pi does depend on i, the goal is to replace Pi, for each i, by a common projectionmatrix P to the union of all receiver locations, i = 1, . . . , s, effectively transforming the probleminto the one with forward operators of the form (1.5). For the rest of the this chapter, we onlyconsider the case of data completion, but application to data replacement is almost identical.The prospect of such data completion, like that of casting a set of false teeth based on a fewgenuine ones, is not necessarily appealing, but is often necessary for reasons of computationalefficiency. Moreover, applied mathematicians do a virtual data completion automatically whenconsidering a Dirichlet-to-Neumann map, for instance, because such maps assume knowledgeof the field u (see, e.g., (3.5)) or its normal derivative on the entire spatial domain boundary,or at least on a partial but continuous segment of it. Such knowledge of noiseless data atuncountably many locations is never the case in practice, where receivers are discretely locatedand some noise, including data measurement noise, is unavoidable. On the other hand, it canbe argued that any practical data completion must inherently destroy some of the “integrity”of the statistical modeling underlying, for instance, the choice of iteration stopping criterion,because the resulting “generated noise” at the false points is not statistically independent ofthe genuine ones where data was collected.Indeed, the problem of proper data completion is far from being a trivial one, and its inherentdifficulties are often overlooked by practitioners. In this chapter we consider this problem inthe context of the DC-resistivity problem (Section 4.1.3), with the sources and receivers foreach data set located at segments of the boundary ∂Ω of the domain on which the forwardPDE is defined. Forward operators are as defined in (1.4). Our data completion approach is toapproximate or interpolate the given data directly in smooth segments of the boundary, whiletaking advantage of prior knowledge as to how the fields ui must behave there. We emphasizethat the sole purpose of our data completion algorithms is to allow the set of receivers tobe shared among all experiments. This can be very different from traditional data completionefforts that have sought to obtain extended data throughout the physical domain’s boundary oreven in the entire physical domain. Our “statistical crime” with respect to noise independenceis thus far smaller, although still existent.We have tested several regularized approximations on the set of examples of Section 4.3,including several DCT [92], wavelet [101] and curvelet [49] approximations (for which we had394.1. Stochastic Algorithms for Solving the Inverse Problemhoped to leverage the recent advances in compressive sensing and sparse `1 methods [50, 58]) aswell as straightforward piecewise linear data interpolation. However, the latter is well-knownnot to be robust against noise, while the former methods are not suitable in the present context,as they are not built to best take advantage of the known solution properties. The methodswhich proved winners in the experimentation ultimately use a Tikhonov-type regularization inthe context of our approximation, penalizing the discretized L2 integral norm of the gradient orLaplacian of the fields restricted to the boundary segment surface. They are further describedand theoretically justified in Section 4.2, providing a rare instance where theory correctly pre-dicts and closely justifies the best practical methods. We believe that this approach applies toa more general class of PDE-based inverse problems.In Section 4.1 we describe the inverse problem and the algorithm variants used for itssolution. Several aspects arise with the prospect of data completion: which data – the originalor the completed – to use for carrying out the iteration, which data for controlling the iterativeprocess, what stopping criterion to use, and more. These aspects are addressed in Section 4.1.1.The resulting algorithm, based on Algorithm 2 of Chapter 3, is given in Section 4.1.2. Thespecific EIT/DC resistivity inverse problem described in Section 4.1.3 then leads to the datacompletion methods developed and proved in Section 4.2.In Section 4.3 we apply the algorithm variants developed in the two previous sections tosolve test problems with different receiver locations. The purpose is to investigate whether theSS algorithms based on completed data achieve results of similar quality at a cheaper price, ascompared to the RS method applied to the original data. Overall, very encouraging results areobtained even when the original data receiver sets are rather sparse. Conclusions are offered inSection 4.4.4.1 Stochastic Algorithms for Solving the Inverse ProblemThe first two subsections below apply more generally than the third subsection. The lattersettles on one application and leads naturally to Section 4.2.Let us recall the acronyms for random subset (RS) and simultaneous sources (SS), usedrepeatedly in this section.404.1. Stochastic Algorithms for Solving the Inverse Problem4.1.1 Algorithm VariantsTo compare the performance of our model recovery methods with completed data, D˜, againstcorresponding ones with the original data, D, we use the framework of Algorithm 2 of Chapter 3.This algorithm consists of two stages within each GN iteration. The first stage produces astabilized GN iterate, for which we use data denoted by Dˆ. The second involves assessmentof this iterate in terms of improvement and algorithm termination, using data D¯. This secondstage consists of evaluations of (2.3), in addition to (1.6). We consider three variants:(i) Dˆ = D, D¯ = D;(ii) Dˆ = D˜, D¯ = D˜;(iii) Dˆ = D˜, D¯ = D;Note that only the RS method can be used in variant (i), whereas any of the SS methods as wellas the RS method can be employed in variant (ii). In variant (iii) we can use a more accurate SSmethod for the stabilized GN stage and an RS method for the convergence checking stage, withthe potential advantage that the evaluations of (2.3) do not use our “invented data”. However,the disadvantage is that RS is potentially less suitable than Gaussian or Hutchinson preciselyfor tasks such as those in this second stage; see Chapter 3.A major source of computational expense is the algorithm stopping criterion, which inChapter 3 was taken to be (3.3), namelyφ(mk+1) ≤ ρ,for a specified tolerance ρ. In Chapter 3, we deliberately employed this criterion in order tobe able to make fair comparisons among different methods. However, the evaluation of φ forthis purpose is very expensive when s is large, and in practice ρ is hardly ever known in a rigidsense. In any case, this evaluation should be carried out as rarely as possible. In Chapter 3,we addressed this by proposing a safety check, called “uncertainty check”, which uses (2.3) asan unbiased estimator of φ(m) with nk s realizations of a random vector from one of thedistributions described in Section 2.1.1. Thus, in the course of an iteration we can perform the414.1. Stochastic Algorithms for Solving the Inverse Problemrelatively inexpensive uncertainty check (3.4), namelyφˆ(mk+1, nk) ≤ ρ.This is like the stopping criterion, but in expectation. If (3.4) is satisfied, it is an indicationthat (3.3) is likely to be satisfied as well, so we check the expensive (3.3) only then.In the present chapter, we propose an alternative heuristic method of replacing (3.3) withanother uncertainty check evaluation as in (3.4) with tk realizations of the Rademacher randomvector (NB the Hutchinson estimator has smaller variance than Gaussian). The sample size tkcan be heuristically set astk = min(s,max (t0, nk)), (4.1)where t0 > 1 is some preset minimal sample size for this purpose. Thus, for each algorithmvariant (i), (ii) or (iii), we consider two stopping criteria, namely,(a) the hard (3.3), and(b) the more relaxed (3.4)+(4.1).When using the original dataD in the second stage of our general algorithm, as in variants (i)and (iii) above, since the linearity assumption (A.2) does not hold in the setting considered here,for efficiency reasons, one is restricted to the RS method as an unbiased estimator. However,when the completed data is used and, as a result, the linearity assumption (A.2) is restored,we can freely use the stochastic SS methods and leverage their rather better accuracy in orderto estimate the true misfit φ(m). This is indeed an important advantage of data completionmethods.However, when using the completed data D˜ in the second stage of our general algorithm,as in variant (ii), an issue arises: when the data is completed, the given tolerance ρ loses itsmeaning and we need to take into account the effect of the additional data to calculate a newtolerance. Our proposed heuristic approach is to replace ρ with a new tolerance ρ := (1 + c)ρ,where c is the percentage of the data that needs to be completed expressed as a fraction. Forexample, if 30% of data is to be completed then we set ρ := 1.3ρ. Since the completed dataafter using (4.2) or (4.6) is smoothed and denoised, we only need to add a small fraction of the424.1. Stochastic Algorithms for Solving the Inverse Probleminitial tolerance to get the new one, and in our experience, 1 + c is deemed to be a satisfactoryfactor. We experiment with this less rigid stopping criterion in Section 4.3.4.1.2 General AlgorithmOur general algorithm utilizes a stabilized Gauss-Newton (GN) method (see Chapter A.3and [46]), where each iteration consists of two stages as described in Section 4.1.1. In ad-dition to combining the elements described above, this algorithm also provides a schedule forselecting the sample size nk in the kth stabilized GN iteration. In Algorithm 3, variants (i), (ii)and (iii), and criteria (a) and (b), are as specified in Section 4.1.1.Algorithm 3 Solve inverse problem using variant (i), (ii) or (iii), cross validation, and stoppingcriterion (a) or (b)Given: sources Q, measurements Dˆ, measurements D¯, stopping tolerance ρ, decrease factorκ < 1, and initial guess m0.Initialize: m = m0 , n0 = 1.for k = 0, 1, 2, . . . until termination do- Choose nk wight vectors stochastically as described in Section 2.1.- Fitting: Perform one stabilized GN iteration, based on Dˆ and above weight, on (2.3).- Choose two independent sets of nk wight vectors stochastically as described in Section 2.1.if φˆ(mk+1, nk) ≤ κφˆ(mk, nk), based on D¯, using the above two sets of weights for eachside of the inequality. i.e., Cross Validation holds then- Choose nk wight vectors stochastically as described in Section 2.1.- Uncertainty Check: Compute (2.3) based on D¯ using mk+1 and the above weights.if (3.4) holds then- Stopping Criterion:if Option (a) selected and (3.3) holds thenterminate; otherwise set nk+1 = nk.elseSet tk = min(s,max (t0, nk)).Choose tk wight vectors stochastically as described in Section 2.1. Terminate if(3.4) holds using D¯; otherwise set nk+1 = nk.end ifend ifelse- Sample Size Increase: for example, set nk+1 = min(2nk, s).end ifend for434.2. Data Completion4.1.3 The DC Resistivity Inverse ProblemFor the forward problem, we consider the DC resistivity problem with a linear PDE of the formdescribed in Section 3.3.1. In our numerical examples we again consider the simple domainΩ ⊂ IRd to be the unit square or unit cube, and the sources q to be the differences of δ-functions; see details in Section 3.3. Since the receivers (where data values are measured) liein ∂Ω, in our data completion algorithms we approximate data along one of four edges in the2D case or within one of six square faces in the 3D case. The setting of our experiments, whichfollows that used in Chapter 3, is more typical of DC resistivity than of the EIT problem.For the inverse problem we introduce additional a priori information, when such is available,via a point-wise parametrization of µ(x) in terms of m(x). For details of this, as well as thePDE discretization and the stabilized GN iteration used, we refer to Chapter 3, Appendix Aand [46] and references therein.4.2 Data CompletionLet Λi ⊂ ∂Ω denote the point set of receiver locations for the ith experiment. Our goal here is toextend the data for each experiment to the union Λ =⋃i Λi ⊆ ∂Ω, the common measurementdomain. To achieve this, we choose a suitable boundary patch Γ ⊆ ∂Ω, such that Λ ⊂ Γ¯, whereΓ¯ denotes the closure of Γ with respect to the boundary subspace topology. For example, onecan choose Γ to be the interior of the convex hull (on ∂Ω) of Λ. We also assume that Γ can beselected such that it is a simply connected open set. For each experiment i, we then constructan extension function vi on Γ¯ which approximates the measured data on Λi. The extensionmethod can be viewed as an inverse problem, and we select a regularization based on knowledgeof the function space that vi (which represents the restriction of potential ui to Γ) should livein. Once vi is constructed, the extended data, d˜i, is obtained by restricting vi to Λ, denoted inwhat follows by vΛi . Specifically, for the receiver location xj ∈ Λ, we set [d˜i]j = vi(xj), where[d˜i]j denotes the jth component of vector d˜i corresponding to xj . Below we show that the traceof potential ui to the boundary is indeed continuous, thus point values of the extension functionvi make sense.In practice, the conductivity µ(x) in (3.5a) is often piecewise smooth with finite jump444.2. Data Completiondiscontinuities. As such one is faced with two scenarios leading to two approximation methodsfor finding vi: (a) the discontinuities are some distance away from Γ; and (b) the discontinuitiesextend all the way to Γ. These cases result in a different a priori smoothness of the field vion Γ. Hence, in this section we treat these cases separately and propose an appropriate datacompletion algorithm for each.Consider the problem (3.5). In what follows we assume that Ω is a bounded open domainand ∂Ω is Lipschitz. Furthermore, we assume that µ is continuous on a finite number of disjointsubdomains, Ωj ⊂ Ω, such that⋃Nj=1 Ωj = Ω and ∂Ωj ∩ Ω ∈ C2,α, for some 0 < α ≤ 1, i.e.,µ ∈ C2(Ωj), j = 1, . . . , N .8 Moreover, assume that q ∈ L∞(Ω) and q ∈ Lip(Ωj ∩ Ω), i.e.,it is Lipschitz continuous in each subdomain; this assumption will be slightly weakened inSubsection 4.2.4.Under these assumptions and for the Dirichlet problem with a C2(∂Ω) boundary condition,there is a constant γ, 0 < γ ≤ 1, such that u ∈ C2,γ(Ωj) [88, Theorem 4.1]. In [98, Corollary7.3], it is also shown that the solution on the entire domain is Ho¨lder continuous, i.e., u ∈ Cβ(Ω)for some β, 0 < β ≤ 1. Note that the mentioned theorems are stated for the Dirichlet problem,and here we assume a homogeneous Neumann boundary condition. However, in this case wehave infinite smoothness in the normal direction at the boundary, i.e., C∞ Neumann condition,and no additional complications arise; see for example [127]. So the results stated above wouldstill hold for (3.5).4.2.1 Discontinuities in Conductivity Are Away from CommonMeasurement DomainThis scenario corresponds to the case where the boundary patch Γ can be chosen such thatΓ ⊂ (∂Ωj ∩ ∂Ω) for some j. Then we can expect a rather smooth field at Γ; precisely, u ∈C2,γ(Γ). Thus, u belongs to the Sobolev space H2(Γ), and we can impose this knowledge inour continuous completion formulation. For the ith experiment, we define our data completionfunction vi ∈ H2(Γ) ∩ C(Γ) asvi = arg minv12‖vΛi − di‖22 + λ ‖∆Sv‖2L2(Γ) , (4.2)8X denotes the closure of X with respect to the appropriate topology.454.2. Data Completionwhere ∆S is the Laplace-Beltrami operator ([89, 124]) for the Laplacian on the boundary surfaceand vΛi is the restriction of the continuous function v to the point set Λi. The regularizationparameter λ depends on the amount of noise in our data; see Section 4.2.3.We next discretize (4.2) using a mesh on Γ as specified in Section 4.3, and solve the resultinglinear least squares problem using standard techniques.Figure 4.1 shows an example of such data completion. The true field and the measured datacorrespond to an experiment described in Example 4.3 of Section 4.3. We only plot the profileof the field along the top boundary of the 2D domain. As can be observed, the approximationprocess imposes smoothness which results in an excellent completion of the missing data, despitethe presence of noise at a fairly high level.Figure 4.1: Completion using the regularization (4.2), for an experiment taken from Example 4.3where 50% of the data requires completion and the noise level is 5%. Observe that even in thepresence of significant noise, the data completion formulation (4.2) achieves a good quality fieldreconstruction.We hasten to point out that the results in Figure 4.1, as well as those in Figure 4.2 below,pertain to differences in field values, i.e., the solutions of the forward problem ui, and not thosein the inverse problem solution shown, e.g., in Figure 4.5. The good quality approximations inFigures 4.1 and 4.2 generally form a necessary but not sufficient condition for success in theinverse problem solution.464.2. Data Completion4.2.2 Discontinuities in Conductivity Extend All the Way to CommonMeasurement DomainThis situation corresponds to the case in which Γ can only be chosen such that it intersectsmore than just one of the (∂Ω ∩ ∂Ωj)’s. More precisely, assume that there is an index setJ ⊆ {1, 2, · · ·N} with |J | = K ≥ 2 such that {Γ ∩ (∂Ω ∩ ∂Ωj)◦ , j ∈ J } forms a set ofdisjoint subsets of Γ such that Γ =⋃j∈J Γ ∩ (∂Ω ∩ ∂Ωj)◦, where X◦ denotes the interior ofthe set X, and that the interior is with respect to the subspace topology on ∂Ω. In such acase u, restricted to Γ, is no longer necessarily in H2(Γ). Hence, the smoothing term in (4.2)is no longer valid, as ‖∆Su‖L2(Γ) might be undefined or infinite. However, as described above,we know that the solution is piecewise smooth and overall continuous, i.e., u ∈ C2,γ(Ωj) andu ∈ Cβ(Ω). The following theorem shows that the smoothness on Γ is not completely gone: wemay lose one degree of regularity at worst.Theorem 4.1. Let U and {Uj | j = 1, 2, . . . ,K} be open and bounded sets such that the Uj arepairwise disjoint and U =⋃Kj=1 U j. Further, let u ∈ C(U) ∩H1(Uj) ∀j. Then u ∈ H1(U).Proof. It is easily seen that since u ∈ C(U) and U is bounded, then u ∈ L2(U). Now, letφ ∈ C∞0 (U) be a test function and denote ∂i ≡∂∂xi. Using the assumptions that the Uj ’s forma partition of U , u is continuous in U , φ is compactly supported inside U , and the fact that the∂Uj ’s have measure zero, we obtain∫Uu∂iφ =∫Uu∂iφ=∫∪Kj=1Uju∂iφ=∫(∪Kj=1Uj)⋃(∪Kj=1∂Uj)u∂iφ=∫∪Kj=1Uju∂iφ=K∑j=1∫Uju∂iφ=K∑j=1∫∂Ujuφνji −K∑j=1∫Uj∂iuφ,474.2. Data Completionwhere νji is the ith component of the outward unit surface normal to ∂Uj . Since u ∈ H1(Uj) ∀j,the second part of the rightmost expression makes sense. Now, for two surfaces ∂Um and ∂Unsuch that ∂Um ∩ ∂Un 6= ∅, we have νmi (x) = −νni (x) ∀x ∈ ∂Um ∩ ∂Un. This fact, and notingin addition that φ is compactly supported inside U , makes the first term in the right hand sidevanish, i.e.,K∑j=1∫∂Ujuφνji = 0.We can now define the weak derivative of u with respect to xi to bev(x) =K∑j=1∂iuXUj , (4.3)where XUj denotes the characteristic function of the set Uj . This yields∫Uu∂iφ = −∫Uvφ. (4.4)Also‖v‖L2(U) ≤K∑j=1‖∂iu‖L2(Uj) <∞, (4.5)and thus we conclude that u ∈ H1(U).If the assumptions stated at the beginning of this section hold then we can expect a field u ∈H1(Γ)∩C(Γ¯). This is obtained by invoking Theorem 4.1 with U = Γ and Uj = Γ∩ (∂Ω∩∂Ωj)◦for all j ∈ J .Now we can formulate the data completion method asvi = arg minv12‖vΛi − di‖22 + λ ‖∇Sv‖2L2(Γ) , (4.6)where vΛi and λ are as in Section 4.2.1.Figure 4.2 shows an example of data completion using the formulation (4.6), depicting theprofile of vi along the top boundary. The field in this example is continuous and only piecewisesmooth. The approximation process imposes less smoothness along the boundary as comparedto (4.2), and this results in an excellent completion of the missing data, despite a nontrivial484.2. Data Completionlevel of noise.Figure 4.2: Completion using the regularization (4.6), for an experiment taken from Example 4.2where 50% of the data requires completion and the noise level is 5%. Discontinuities in theconductivity extend to the measurement domain and their effect on the field profile alongthe boundary can be clearly observed. Despite the large amount of noise, data completionformulation (4.6) achieves a good reconstruction.To carry out our data completion strategy, the problems (4.2) or (4.6) are discretized. Thisis followed by a straightforward linear least squares technique, which can be carried out veryefficiently. Moreover, this is a preprocessing stage performed once, which is completed before thealgorithm for solving the nonlinear inverse problem commences. Also, as the data completionfor each experiment can be carried out independently of others, the preprocessing stage can bedone in parallel if needed. Furthermore, the length of the vector of unknowns vi is relativelysmall compared to those of ui because only the boundary is involved. All in all the amount ofwork involved in the data completion step is dramatically less than one full evaluation of themisfit function (1.6).494.2. Data Completion4.2.3 Determining the Regularization ParameterLet us write the discretization of (4.2) or (4.6) asminv12‖Pˆiv − di‖22 + λ‖Lv‖22, (4.7)where L is the discretization of the surface gradient or Laplacian operator, v is a vector whoselength is the size of the discretized Γ, Pˆi is the projection matrix from the discretization of Γto Λi, and di is the ith original measurement vector.Determining λ in this context is a textbook problem; see, e.g., [135]. Viewing it as aparameter, we have a linear least squares problem for v in (4.7), whose solution can be denotedv(λ) asvi(λ) = (Pˆ Ti Pˆi + λLTL)−1Pˆ Ti uiNow, in the simplest case, which we assume in our experiments, the noise level for the ithexperiment, ηi, is known, so one can use the discrepancy principle to pick λ such that∥∥∥Pˆiv(λ)− di∥∥∥22≤ ηi. (4.8)Numerically, this is done by setting equality in (4.8) and solving the resulting nonlinear equationfor λ using a standard root finding technique.If the noise level is not known, one can use the generalized cross validation (GCV) method([65]) or the L-curve method ([79]). For example, GCV function can be written asGCV (λ) =‖Pˆiv − ui‖22tr(I− Pˆi(Pˆ Ti Pˆi + λLTL)−1Pˆ Ti )2,where tr denotes the standard matrix trace. Now the best λ is the minimizer of GCV (λ). Weneed not dwell on this longer here.4.2.4 Point Sources and Boundaries with CornersIn the numerical examples of Section 4.3, as in Section 3.3 and following [46], we use deltafunction combinations as the sources qi(x), in a manner that is typical in exploration geophysics504.3. Numerical Experiments(namely, DC resistivity as well as low-frequency electromagnetic experiments), less so in EIT.However, these are clearly not honest L∞ functions. Moreover, our domains Ω are a square ora cube and as such they have corners.However, the theory developed above, and the data completion methods that it generates,can be extended to our experimental setting because we have control over the experimentalsetup. The desired effect is obtained by simply separating the location of each source from anyof the receivers, and avoiding domain corners altogether.Thus, consider in (3.5a) a source function of the formq(x) = qˆ(x) + δ(x− x∗)− δ(x− x∗∗),where qˆ satisfies the assumptions previously made on q. Then we select x∗ and x∗∗ suchthat there are two open balls B(x∗, r) and B(x∗∗, r) of radius r > 0 each and centered atx∗ and x∗∗, respectively, where (i) no domain corner belongs to B(x∗, r) ∪ B(x∗∗, r), and (ii)(B(x∗, r) ∪ B(x∗∗, r)) ∩ Γ is empty. Now, in our elliptic PDE problem the lower smoothnesseffect of either a domain corner or a delta function is local! In particular, the contribution ofthe point source to the flux µ∇u is the integral of δ(x − x∗) − δ(x − x∗∗), and this is smoothoutside the union of the two balls.4.3 Numerical ExperimentsThe PDE problem used in our experiments is described in Sections 4.1.3 and 3.3. The exper-imental setting is also very similar to that in Section 3.3.2. Here again, in 2D, the receiversare located at the top and bottom boundaries (except the corners). As such, the completionsteps (4.2) or (4.6) are carried out separately for the top and bottom 1D manifold of bound-aries. In 3D, since the receivers are placed on the top surface, hence completion is done on thecorresponding 2D manifold.For all of our numerical experiments, the “true field” is calculated on a grid that is twice asfine as the one used to reconstruct the model. For the 2D examples, the reconstruction is doneon a uniform grid of size 1292 with s = 961 experiments in the setup described above. For the514.3. Numerical Experiments3D examples, we set s = 512 and employ a uniform grid of size 333, except for Example 4.3where the grid size is 173.In the numerical examples considered below, we use true models with piecewise constantlevels, with the conductivities bounded away from 0. For further discussion of such modelswithin the context of EIT, see [64].Numerical examples are presented for both cases described in Sections 4.2.1 and 4.2.2. For allof our numerical examples except Examples 4.5 and 4.6, we use the transfer function (A.5) withµmax = 1.2 maxµ(x), and µmin = 11.2 minµ(x). In the ensuing calculations we then “forget”what the exact µ(x) is. Further, in the stabilized GN iteration we employ preconditionedconjugate gradient (PCG) inner iterations, setting as described in Section A.3 the PCG iterationlimit to r = 20, and the PCG tolerance to 10−3. The initial guess is m0 = 0. Examples 4.5and 4.6 are carried out using the level set method (A.6). Here we can set r = 5, significantlylower than above. The initial guess for the level set example is a cube with rounded cornersinside Ω as in Figure 3.1.For Examples 4.1, 4.2, 4.3 and 4.5, in addition to displaying the log conductivities (i.e.,log(µ)) for each reconstruction, we also show the log-log plot of misfit on the entire data (i.e.,‖F (m) − D‖F ) vs. PDE count. A table of total PDE counts (not including what extra isrequired for the plots) for each method is displayed. In order to simulate the situation wheresources do not share the same receivers, we first generate the data fully on the entire domainof measurement and then knock out at random some percentage of the generated data. Thissetting roughly corresponds to an EMG experiment with faulty receivers.For each example, we use Algorithm 1 with one of the variants (i), (ii) or (iii) paired with oneof the stopping criteria (a) or (b). For instance, when using variant (ii) with the soft stoppingcriterion (b), we denote the resulting algorithm by (ii, b). For the relaxed stopping rule (b) we(conservatively) set t0 = 100 in (4.1). A computation using RS applied to the original data,using variant (i,x), is compared to one using SS applied to the completed data through variant(ii,x) or (iii,x), where x stands for a or b.For convenience of cross reference, we gather all resulting seven algorithm comparisonsand corresponding work counts in Table 4.1 below. For Examples 4.1, 4.2, 4.3 and 4.5, the524.3. Numerical Experimentscorresponding entries of this table should be read together with the misfit plots for each example.Example Algorithm Random Subset Data Completion4.1 (i, a) | (iii, a) 3,647 1,7164.2 (i, a) | (iii, a) 6,279 1,7544.3 (i, a) | (iii, a) 3,887 1,7044.4 (i, b) | (ii, b) 4,004 5794.5 (i, a) | (iii, a) 3,671 9354.6 (i, b) | (ii, b) 1,016 3904.7 (i, b) | (ii, b) 4,847 1,217Table 4.1: Algorithm and work in terms of number of PDE solves, comparing RS against datacompletion using Gaussian SS.Example 4.1. In this example, we place two target objects of conductivity µI = 1 in a back-ground of conductivity µII = 0.1, and 5% noise is added to the data as described above. Also,25% of the data requires completion. The discontinuities in the conductivity are touching themeasurement domain, so we use (4.6) to complete the data. The hard stopping criterion (a) isemployed, and iteration control is done using the original data, i.e., variants (i, a) and (iii, a)are compared: see the first entry of Table 4.1 and Figure 4.6(a).(a) True model (b) Random Subset (c) Data CompletionFigure 4.3: Example 4.1 – reconstructed log conductivity with 25% data missing and 5% noise.Regularization (4.6) has been used to complete the data.The corresponding reconstructions are depicted in Figure 4.3. It can be seen that roughlythe same quality reconstruction is obtained using the data completion method at less than halfthe price.Example 4.2. This example is the same as Example 4.1, except that 50% of the data is missingand requires completion. The same algorithm variants as in Example 4.1 are compared. The534.3. Numerical Experimentsreconstructions are depicted in Figure 4.4, and comparative computational results are recordedin Table 4.1 and Figure 4.6(b).(a) True model (b) Random Subset (c) Data CompletionFigure 4.4: Example 4.2 – reconstructed log conductivity with 50% data missing and 5% noise.Regularization (4.6) has been used to complete the data.Similar observations to those in Example 4.1 generally apply here as well, despite the smalleramount of original data.Example 4.3. This is the same as Example 4.2 in terms of noise and the amount of missingdata, except that the discontinuities in the conductivity are some distance away from the commonmeasurement domain, so we use (4.2) to complete the data. The same algorithm variantsas in the previous two examples are compared, thus isolating the effect of a smoother dataapproximant.(a) True model (b) Random Subset (c) Data CompletionFigure 4.5: Example 4.3 – reconstructed log conductivity with 50% data missing and 5% noise.Regularization (4.2) has been used to complete the data.Results are recorded in Figure 4.5, the third entry of Table 4.1 and Figure 4.6(c).Figures 4.3, 4.4 and 4.5 in conjunction with Figure 4.6 as well as Table 4.1, reflect superiorityof the SS method combined with data completion over the RS method with the original data.From the first three entries of Table 4.1, we see that the SS reconstruction with completed data544.3. Numerical Experiments(a) Example 4.1 (b) Example 4.2(c) Example 4.3Figure 4.6: Data misfit vs. PDE count for Examples 1, 2 and 3.can be done more efficiently by a factor of more than two. The quality of reconstruction is alsovery good. Note that the graph of the misfit for Data Completion lies mostly under that ofRandom Subset. This means that, given a fixed number of PDE solves, we obtain a lower (thusbetter) misfit for the former than for the latter.Next, we consider examples in 3D.Example 4.4. In this example, the discontinuities in the true, piecewise constant conductivityextend all the way to the common measurement domain, see Figure 4.7. We therefore use (4.6)to complete the data. The target object has the conductivity µI = 1 in a background withconductivity µII = 0.1. We add 2% noise and knock out 50% of the data. Furthermore,we consider the relaxed stopping criterion (b). With the original data (hence using RS), the554.3. Numerical Experimentsvariant (i, b) is employed, and this is compared against the variant (ii, b) with SS appliedto the completed data. For the latter case, the stopping tolerance is adjusted as discussed inSection 4.1.1.Figure 4.7: True Model for Example 4.4.(a) RS slices(b) 3D view(c) DC slices(d) 3D viewFigure 4.8: Example 4.4 – reconstructed log conductivity for the 3D model with (a,b) RandomSubset, (c,d) Data Completion for the case of 2% noise and 50% of data missing. Regulariza-tion (4.6) has been used to complete the data.Reconstruction results are depicted in Figure 4.8, and work estimates are gathered in the4th entry of Table 4.1. It can be seen that the results using data completion, obtained at about1/7th the cost, are comparable to those obtained with RS applied to the original data.Example 4.5. The underlying model in this example is the same as that in Example 4.4 exceptthat, since we intend to plot the misfit on the entire data at every GN iteration, we decrease thereconstruction mesh resolution to 173. Also, 30% of the data requires completion, and we usethe level set transfer function (A.6) to reconstruct the model. With the original data, we usethe variant (i, a), while the variant (iii, a) is used with the completed data. The reconstructionresults are recorded in Figure 4.9, and performance indicators appear in Figure 4.10 as well asTable 4.1.The algorithm proposed here produces a better reconstruction than RS on the original data.A relative efficiency observation can be made from Table 4.1, where a factor of roughly 4 is564.3. Numerical Experiments(a) RS slices(b) 3D view(c) DC slices(d) 3D viewFigure 4.9: Example 4.5 – reconstructed log conductivity for the 3D model using the level setmethod with (a,b) Random Subset, (c,d) Data Completion for the case of 2% noise and 30% ofdata missing. Regularization (4.6) has been used to complete the data.Figure 4.10: Data misfit vs. PDE count for Example 4.5.revealed.Example 4.6. This is exactly the same as Example 4.4, except that we use the level set transferfunction (A.6) to reconstruct the model. The same variants of Algorithm 1 as in Example 4.4are employed.It is evident from Figure 4.11 that employing the level set formulation allows a significantlybetter quality reconstruction than in Example 4.4. This is expected, as much stronger assump-tions on the true model are utilized. It was shown in [131] as well as Chapter 3 that using levelset functions can greatly reduce the total amount of work, and this is observed here as well.Whereas in all previous examples convergence of the modified GN iterations from a zeroinitial guess was fast and uneventful, typically requiring fewer than 10 iterations, the level setresult of this example depends on m0 in a more erratic manner. This reflects the underlyinguncertainty of the inversion, with the initial guess m0 playing the role of a prior.574.3. Numerical Experiments(a) RS slices(b) 3D view(c) DC slices(d) 3D viewFigure 4.11: Example 4.6 – reconstructed log conductivity for the 3D model using the level setmethod with (a,b) Random Subset, (c,d) Data Completion for the case of 2% noise and 50% ofdata missing. Regularization (4.6) has been used to complete the data.It can be clearly seen from the results of Examples 4.4, 4.5 and 4.6 that Algorithm 1 doesa great job recovering the model using the completed data plus the SS method as comparedto RS with the original data. This is so both in terms of total work and the quality of therecovered model. Note that for all reconstructions, the conductive object placed deeper thanthe ones closer to the surface is not recovered well. This is due to the fact that we only measureon the surface and the information coming from this deep conductive object is majorized bythat coming from the objects closer to the surface.Example 4.7. In this 3D example, we examine the performance of our data completion ap-proach for more severe cases of missing data. For this example, we place a target object ofconductivity µI = 1 in a background with conductivity µII = 0.1, see Figure 4.12, and 2% noiseis added to the “exact” data. Then we knock out 70% of the data and use (4.2) to complete it.The algorithm variants employed are the same as in Examples 4.4 and 4.6.Figure 4.12: True Model for Example 4.7.Results are gathered in Figures 4.13 as well as Table 4.1. The data completion plus simul-taneous sources algorithm again does well, with an efficiency factor ≈ 4.584.4. Conclusions(a) RS slices(b) 3D view(c) DC slices(d) 3D viewFigure 4.13: Example 4.7 – reconstructed log conductivity for the 3D model with (a,b) RandomSubset, (c,d) Data Completion for the case of 2% noise and 70% data missing. Regulariza-tion (4.2) has been used to complete the data.4.4 ConclusionsThis chapter is a sequel to Chapter 3 in which we studied the case that the linearity assump-tion (A.2) holds. In the context of PDE constrained inverse problem, this translates to thecase where sources share the same receivers. Here we have focused on the very practical casewhere arise more often in practice, i.e., the linearity assumption (A.2) is violated. Such sce-narios arise, for example, where there are parts of data missing or heavily corrupted. For PDEconstrained inverse problems, this case corresponds to the situation where, unlike Chapter 3,sources do not share the same receivers. In this chapter, we assumed that the experimentalsetting is “suitable” enough to allow for the use of our proposed data completion techniquesbased on appropriate regularization. Our data completion methods are motivated by theoryin Sobolev spaces, [54], regarding the properties of weak solutions along the domain boundary.The resulting completed data allows an efficient use of the methods developed in Chapter 3 aswell as utilization of a relaxed stopping criterion. Our approach shows great success in casesof moderate data completion, say up to 60-70%. In such cases we have demonstrated that,utilizing some variant of Algorithm 3, an execution speedup factor of at least 2 and often muchmore can be achieved while obtaining excellent reconstructions.It needs to be emphasized that a blind employment of some interpolation/approximationmethod would not take into account available a priori information about the sought signal. Incontrast, the method developed in this chapter, while being very simple, is in fact built uponsuch a priori information, and is theoretically justified.Note that with the methods of Section 4.2 we have also replaced the original data with new,594.4. Conclusionsapproximate data. Alternatively we could keep the original data, and just add the missing datasampled from vi at appropriate locations. The potential advantage of doing this is that fewerchanges are made to the original problem, so it would seem plausible that the data extension willproduce results that are close to the more expensive inversion without using the simultaneoussources method, at least when there are only a few missing receivers. However, we found inpractice that this method yields similar or worse reconstructions for moderate or large amountsof missing data as compared to the methods of Section 4.2.For severe cases of missing data, say 80% or more, we do not recommend data completionin the present context as a safe approach. With so much completion the bias in the completedfield could overwhelm the given observed data, and the recovered model may not be correct. Insuch cases, one can use the RS method applied to the original data. A good initial guess for thismethod may still be obtained with the SS method applied to the completed data. Thus, one canalways start with the most daring variant (ii, b) of Algorithm 3, and add a more conservativerun of variant (i, b) on top if necessary.If the forward problem is very diffusive and has a strong smoothing effect, as is the casefor the DC-resistivity and EIT problems, then data completion can be attempted using a(hopefully) good guess of the sought model m by solving the forward problem and evaluatingthe solution wherever necessary [70]. The rationale here is that even relatively large changesin m(x) produce only small changes in the fields ui(x). However, such a prior might provedominant, hence risky, and the data produced in this way, unlike the original data, no longerhave natural high frequency noise components. Indeed, a potential advantage of this approachis in using the difference between the original measured data and the calculated prior field atthe same locations for estimating the noise level for a subsequent application of the Morozovdiscrepancy principle [52, 135].In this chapter we have focused on data completion, using whenever possible the samecomputational setting as in Chapter 3, which is our base reference. Other approaches to reducethe overall computational costs are certainly possible. These include adapting the number ofinner PCG iterations in the modified GN outer iteration (see [46]) and adaptive gridding form(x) (see, e.g., [72] and references therein). Such techniques are essentially independent of thefocus here. At the same time, they can be incorporated or fused together with our stochastic604.4. Conclusionsalgorithms, further improving efficiency: effective ways for doing this form a topic for futureresearch.The specific data completion techniques proposed in this chapter have been justified andused in our model DC resistivity problem. However, the overall idea can be extended toother PDE based inverse problems as well by studying the properties of the solution of theforward problem. One first needs to see what the PDE solutions are expected to behave like onthe measurement domain, for example on a portion of the boundary, and then imposing thisprior knowledge in the form of an appropriate regularizer in the data completion formulation.Following that, the rest can be similar to our approach here. Investigating such extensions toother PDE models is a subject for future studies.61Chapter 5Matrix Trace EstimationAs shown in Section 2.1, stochastic approximations to the misfit are closely related to Monte-Carlo estimations of the trace of the corresponding implicit matrix. So far, in this thesis, allthese estimators have been used rather heuristically and no attempt to better mathematicallyunderstanding them has been made. In this chapter, we present a rigorous mathematicalanalysis of Monte-Carlo methods for the estimation of the trace, tr(A), of an implicitly givenmatrix A whose information is only available through matrix-vector products. Such a methodapproximates the trace by an average of n expressions of the form wT (Aw), with randomvectors w drawn from an appropriate distribution. We prove, discuss and experiment withbounds on the number of realizations n required in order to guarantee a probabilistic bound onthe relative error of the trace estimation upon employing Rademacher (Hutchinson), Gaussianand uniform unit vector (with and without replacement) probability distributions, discussed inSection 2.1.1.In total, one necessary and six sufficient bounds are proved, improving upon and extendingsimilar estimates obtained in the seminal work of Avron and Toledo [22] in several dimensions.We first improve their bound on n for the Hutchinson method, dropping a term that relates torank(A) and making the bound comparable with that for the Gaussian estimator.We further prove new sufficient bounds for the Hutchinson, Gaussian and the unit vec-tor estimators, as well as a necessary bound for the Gaussian estimator, which depend morespecifically on properties of the matrix A. As such they may suggest for what type of matricesone distribution or another provides a particularly effective or relatively ineffective stochasticestimation method.By the novel results in this chapter, it is hoped to correct some existing misconceptionsregarding the relative performance of different estimators that have resulted due to an unsatis-625.1. Introductionfactory state of the theory. The theory developed in the present chapter sheds light on severalquestions which had remained open for some time. Using these results, practitioners can betterchoose appropriate estimators for their applications.5.1 IntroductionThe need to estimate the trace of an implicit square matrix is of fundamental importance [126]and arises in many applications; see for instance [5, 6, 21–23, 43, 46, 66, 71, 86, 104, 121, 134, 138]and references therein. By “implicit” we mean that the matrix of interest is not availableexplicitly: only probes in the form of matrix-vector products for any appropriate vector areavailable. The standard approach for estimating the trace of such a matrix A ∈ Rs×s is basedon a Monte-Carlo method, where one generates n random vector realizations wi from a suitableprobability distribution D and computestrnD(A) :=1nn∑i=1wTi Awi. (5.1)For the popular case where A is symmetric positive semi-definite (SPSD), the original methodfor estimating its trace, is due to Hutchinson [86] and uses the Rademacher distribution for D.Until the work by Avron and Toledo [22], the main analysis and comparison of such methodswas based on the variance of one sample. It is known that compared to other methods theHutchinson method has the smallest variance, and as such it has been extensively used in manyapplications. In [22] so-called (ε, δ) bounds are derived in which, using Chernoff-like analysis,a lower bound is obtained on the number of samples required to achieve a probabilisticallyguaranteed relative error of the estimated trace. More specifically, for a given pair (ε, δ) ofsmall (say, < 1) positive values and an appropriate probability distribution D, a lower boundon n is provided such thatPr (|trnD(A)− tr(A)| ≤ ε tr(A)) ≥ 1− δ. (5.2)These authors further suggest that minimum-variance estimators may not be practically best,and conclude based on their analysis that the method with the best bound is the one using the635.1. IntroductionGaussian distribution. Let us denotec = c(ε, δ) := ε−2 ln(2/δ), (5.3a)r = rank(A). (5.3b)Then [22] showed that, provided A is real SPSD, (5.2) holds for the Hutchinson method ifn ≥ 6(c+ ε−2 ln r) and for the Gaussian distribution if n ≥ 20c.In the present chapter we continue to consider the same objective as in [22], and our firsttask is to improve on these bounds. Specifically, in Theorems 5.1 and 5.3, respectively, we showthat (5.2) holds for the Hutchinson method ifn ≥ 6c(ε, δ), (5.4)and for the Gaussian distribution ifn ≥ 8c(ε, δ). (5.5)The bound (5.4) removes a previous factor involving the rank of the matrix A, conjectured in [22]to be indeed redundant. Note that these two bounds are astoundingly simple and general: theyhold for any SPSD matrix, regardless of size or any other matrix property. Thus, we cannotexpect them to be tight in practice for many specific instances of A that arise in applications.However, as was recently shown in [137], these two bounds are asymptotically tight.Although practically useful, the bounds on n given in (5.4) and (5.5) do not provide insightinto how different types of matrices are handled with each probability distribution. Our nextcontribution is to provide different bounds for the Gaussian and Hutchinson trace estimatorswhich, though generally not computable for implicit matrices, do shed light on this question.Furthermore, for the Gaussian estimator we prove a practically useful necessary lower boundon n, for a given pair (ε, δ).A third probability distribution we consider was called the unit vector distribution in [22].Here, the vectors wi in (5.1) are uniformly drawn from the columns of a scaled identity matrix,√sI, and A need not be SPSD. Such a distribution is used in obtaining the random subset645.2. Hutchinson Estimator Boundsmethod discussed in Chapter 2. We slightly generalize the bound in [22], obtained for the casewhere the sampling is done with replacement. Our bound, although not as simply computed as(5.4) or (5.5), can be useful in determining which types of matrices this distribution works beston. We then give a tighter bound for the case where the sampling is done without replacement,suggesting that when the difference between the bounds is significant (which happens whenn is large), a uniform random sampling of unit vectors without replacement may be a moreadvisable distribution to estimate the trace with.This chapter is organized as follows. Section 5.2 gives two bounds for the Hutchinson methodas advertised above, namely the improved bound (5.4) and a more involved but potentiallymore informative bound. Section 5.3 deals likewise with the Gaussian method and adds also anecessary lower bound, while Section 5.4 is devoted to the unit vector sampling methods.In Section 5.5 we give some numerical examples verifying that the trends predicted by thetheory are indeed realized. Conclusions and further thoughts are gathered in Section 5.6.In what follows we use the notation trnH(A), trnG(A), trnU1(A), and trnU2(A) to refer, respec-tively, to the trace estimators using Hutchinson, Gaussian, and uniform unit vector with andwithout replacement, in lieu of the generic notation trnD(A) in (5.1) and (5.2). We also de-note for any given random vector of size n, wi = (wi1, wi2, . . . , win)T . We restrict attention toreal-valued matrices, although extensions to complex-valued ones are possible, and employ the2-norm by default.5.2 Hutchinson Estimator BoundsIn this section we consider the Hutchinson trace estimator, trnH(A), obtained by setting D = Hin (5.1), where the components of the random vectors wi are i.i.d Rademacher random variables(i.e., Pr(wij = 1) = Pr(wij = −1) = 12).5.2.1 Improving the Bound in [22]Theorem 5.1. Let A be an s× s SPSD matrix. Given a pair (ε, δ), the inequality (5.2) holdswith D = H if n satisfies (5.4).655.2. Hutchinson Estimator BoundsProof. Since A is SPSD, it can be diagonalized by a unitary similarity transformation as A =UTΛU . Consider n random vectors wi, i = 1, . . . , n, whose components are i.i.d and drawnfrom the Rademacher distribution, and define zi = Uwi for each. We havePr (trnH(A) ≤ (1− ε)tr(A)) = Pr(1nn∑i=1wTi Awi ≤ (1− ε)tr(A))= Pr(1nn∑i=1zTi Λzi ≤ (1− ε)tr(A))= Prn∑i=1r∑j=1λjz2ij ≤ n(1− ε)tr(A)= Prr∑j=1λjtr(A)n∑i=1z2ij ≤ n(1− ε)≤ exp{tn(1− ε)}Eexp{r∑j=1λjtr(A)n∑i=1−tz2ij} ,where the last inequality holds for any t > 0 by Markov’s inequality.Next, using the convexity of the exp function and the linearity of expectation, we obtainEexp{r∑j=1λjtr(A)n∑i=1−tz2ij} ≤r∑j=1λjtr(A)E(exp{n∑i=1−tz2ij})=r∑j=1λjtr(A)E(n∏i=1exp{−tz2ij})=r∑j=1λjtr(A)n∏i=1E(exp{−tz2ij}),where the last equality holds since, for a given j, zij ’s are independent with respect to i.Now, we want to have thatexp{tn(1− ε)}n∏i=1E(exp{−tz2ij})≤ δ/2.For this we make use of the inequalities in the end of the proof of Lemma 5.1 of [2]. Following665.2. Hutchinson Estimator Boundsinequalities (15)–(19) in [2] and letting t = ε/(2(1 + ε)), we getexp{tn(1− ε)}n∏i=1E(exp{−tz2ij})< exp{−n2(ε22−ε33)}.Next, if n satisfies (5.4) then exp{−n2 (ε22 −ε33 )} < δ/2, and thus it follows thatPr (trnH(A) ≤ (1− ε)tr(A)) < δ/2.By a similar argument, making use of inequalities (11)–(14) in [2] with the same t as above,we also obtain with the same bound for n so thatPr (trnH(A) ≥ (1 + ε)tr(A)) ≤ δ/2.So finally using the union bound yields the desired result.It can be seen that (5.4) is the same bound as the one in [22] with the important exceptionthat the factor r = rank(A) does not appear in the bound. Furthermore, the same bound onn holds for any SPSD matrix.5.2.2 A Matrix-Dependent BoundHere we derive another bound for the Hutchinson trace estimator which may shed light as towhat type of matrices the Hutchinson method is best suited for.For k, j = 1, . . . , s, let us denote by ak,j the (k, j)th element of A and by aj its jth column.Theorem 5.2. Let A be an s× s symmetric positive semi-definite matrix, and defineKjH :=‖aj‖2 − a2j,ja2j,j=∑k 6=ja2k,j / a2j,j , KH := maxjKjH . (5.6)Given a pair of positive small values (ε, δ), the inequality (5.2) holds with D = H ifn > 2KHc(ε, δ). (5.7)675.2. Hutchinson Estimator BoundsProof. Elementary linear algebra implies that since A is SPSD, aj,j ≥ 0 for each j. Furthermore,if aj,j = 0 then the jth row and column of A identically vanish, so we may assume below thataj,j > 0 for all j = 1, . . . , s. Note thattrnH(A)− tr(A) =1ns∑j=1n∑i=1s∑k=1k 6=jaj,kwijwik.HencePr (trnH(A) ≤ (1− ε)tr(A)) = Prs∑j=1n∑i=1s∑k=1k 6=j−aj,kwijwik ≥ nε tr(A)= Prs∑j=1aj,jtr(A)n∑i=1s∑k=1k 6=j−aj,kaj,jwijwik ≥ nε≤ exp{−tnε}Eexp{s∑j=1aj,jtr(A)n∑i=1s∑k=1k 6=j−aj,ktaj,jwijwik} ,where the last inequality is again obtained for any t > 0 by using Markov’s inequality. Now,again using the convexity of the exp function and the linearity of expectation, we obtainPr (trnH(A) ≤ (1− ε)tr(A)) ≤ exp{−tnε}s∑j=1aj,jtr(A)Eexp{n∑i=1s∑k=1k 6=j−aj,ktaj,jwijwik}= exp{−tnε}s∑j=1aj,jtr(A)n∏i=1Eexp{s∑k=1k 6=j−aj,ktaj,jwijwik}by independence of wijwik with respect to the index i.Next, note thatEexp{s∑k=1k 6=jaj,ktaj,jwik} = Eexp{s∑k=1k 6=j−aj,ktaj,jwik} .685.2. Hutchinson Estimator BoundsFurthermore, since Pr(wij = −1) = Pr(wij = 1) = 12 , and using the law of total expectation,we haveEexp{s∑k=1k 6=j−aj,ktaj,jwijwik} = Eexp{s∑k=1k 6=jaj,ktaj,jwik}=s∏k=1k 6=jE(exp{aj,ktaj,jwik}),soPr (trnH(A) ≤ (1− ε)tr(A)) ≤ exp{−tnε}s∑j=1aj,jtr(A)n∏i=1s∏k=1k 6=jE(exp{aj,ktaj,jwik}).We want to have the right hand side expression bounded by δ/2.Applying Hoeffding’s lemma we getE(exp{aj,ktaj,jwik})≤ exp{a2j,kt22a2j,j},henceexp{−tnε}n∏i=1s∏k=1k 6=jE(exp{aj,ktaj,jwik})≤ exp{−tnε+KjHnt2/2} (5.8a)≤ exp{−tnε+KHnt2/2}. (5.8b)The choice t = ε/KH minimizes the right hand side. Now if (5.7) holds thenexp(−tnε)n∏i=1s∏k=1k 6=jE(exp{aj,ktaj,jwik})≤ δ/2,hence we havePr(trnH(A) ≤ (1− ε)tr(A)) ≤ δ/2.695.3. Gaussian Estimator BoundsSimilarly, we obtain thatPr(trnH(A) ≥ (1 + ε)tr(A)) ≤ δ/2,and using the union bound finally gives desired result.Comparing (5.7) to (5.4), it is clear that the bound of the present subsection is only worthyof consideration if KH < 3. Note that Theorem 5.2 emphasizes the relative `2 energy of theoff-diagonals: the matrix does not necessarily have to be diagonally dominant (i.e., where asimilar relationship holds in the `1 norm) for the bound on n to be moderate. Furthermore,a matrix need not be “nearly” diagonal for this method to require small sample size. In facta matrix can have off-diagonal elements of significant size that are far away from the maindiagonal without automatically affecting the performance of the Hutchinson method. However,note also that our bound can be pessimistic, especially if the average value or the mode of KjHin (5.6) is far lower than its maximum, KH . This can be seen in the above proof where theestimate (5.8b) is obtained from (5.8a). Simulations in Section 5.5 show that the Hutchinsonmethod can be a very efficient estimator even in the presence of large outliers, so long as thebulk of the distribution is concentrated near small values.The case KH = 0 corresponds to a diagonal matrix, for which the Hutchinson methodyields the trace with one shot, n = 1. In agreement with the bound (5.7), we expect the actualrequired n to grow when a sequence of otherwise similar matrices A is envisioned in which KHgrows away from 0, as the energy in the off-diagonal elements grows relatively to that in thediagonal elements.5.3 Gaussian Estimator BoundsIn this section we consider the Gaussian trace estimator, trnG(A), obtained by setting D = Gin (5.1), where the components of the random vectors wi are i.i.d standard normal randomvariables. We give two sufficient and one necessary lower bounds for the number of Gaussiansamples required to achieve an (ε, δ) trace estimate. The first sufficient bound (5.5) improvesthe result in [22] by a factor of 2.5. Our bound is only worse than (5.4) by a fraction, and705.3. Gaussian Estimator Boundsit is an upper limit of the potentially more informative (if less available) bound (5.10), whichrelates to the properties of the matrix A. The bound (5.10) provides an indication as to whatmatrices may be suitable candidates for the Gaussian method. Then we present a practicallycomputable, necessary bound for the sample size n.5.3.1 Sufficient BoundsThe proof of the following theorem closely follows the approach in [22].Theorem 5.3. Let A be an s×s SPSD matrix and denote its eigenvalues by λ1, . . . , λs. Further,defineKjG :=λjtr(A), KG := maxjKjG =‖A‖tr(A). (5.9)Then, given a pair of positive small values (ε, δ), the inequality (5.2) holds with D = G providedthat (5.5) holds. This estimate also holds provided thatn > 8KGc(ε, δ). (5.10)Proof. Since A is SPSD, we have ‖A‖ ≤ tr(A), so if (5.5) holds then so does (5.10). We nextconcentrate on proving the result assuming the tighter bound (5.10) on the actual n requiredin a given instance.Writing as in the previous section A = UTΛU , consider n random vectors wi, i = 1, . . . , n,whose components are i.i.d and drawn from the normal distribution, and define zi = Uwi.Since U is orthogonal, the elements zij of zi are i.i.d Gaussian random variables. We have asbefore,Pr (trnG(A) ≤ (1− ε)tr(A)) = Prn∑i=1r∑j=1λjz2ij ≤ n(1− ε)tr(A)≤ exp{tn(1− ε)tr(A)}Eexp{n∑i=1r∑j=1−tλjz2ij}≤ exp{tn(1− ε)tr(A)}n∏i=1r∏j=1E(exp{−tλjz2ij}).715.3. Gaussian Estimator BoundsHere z2ij is a χ2 random variable of degree 1 (see [106]), and hence for the characteristicswe haveE(exp{−tλjz2ij})= (1 + 2λjt)− 12 .This yields the boundPr (trnG(A) ≤ (1− ε)tr(A)) ≤ exp{tn(1− ε)tr(A)}r∏j=1(1 + 2λjt)−n2 .Next, it is easy to prove by elementary calculus that given any 0 < α < 1, the followingholds for all 0 ≤ x ≤ 1−αα ,ln(1 + x) ≥ αx. (5.11)Setting α = 1− ε/2, then by (5.11) and for all t ≤ (1−α)/(2α‖A‖), we have that (1 + 2λjt) >exp{2αλj}t, soPr (trnG(A) ≤ (1− ε)tr(A)) ≤ exp{tn(1− ε)tr(A)}r∏j=1exp(−nαλjt)= exp{tn(1− ε− α)tr(A)}.We want the latter right hand side to be bounded by δ/2, i.e., we want to haven ≥ln(2/δ)(α− (1− ε))tr(A)t=2εc(ε, δ)tr(A)t,where t ≤ (1− α)/(2α‖A‖). Now, settingt = (1− α)/(2α‖A‖) = ε/(2(2− ε)‖A‖),we obtainn ≥ 4(2− ε)c(ε, δ)KG,so if (5.10) holds thenPr (trnG(A) ≤ (1− ε)tr(A)) ≤ δ/2.725.3. Gaussian Estimator BoundsUsing a similar argument we also obtainPr (trnG(A) ≥ (1 + ε)tr(A)) ≤ δ/2,and subsequently the union bound yields the desire result.The matrix-dependent bound (5.10), proved to be sufficient in Theorem 5.3, provides ad-ditional information over (5.5) about the type of matrices for which the Gaussian estimatoris (probabilistically) guaranteed to require only a small sample size: if the eigenvalues of anSPSD matrix are distributed such that the ratio ‖A‖/tr(A) is small (e.g., if they are all ofapproximately the same size), then the Gaussian estimator bound requires a small number ofrealizations. This observation is reaffirmed by looking at the variance of this estimator, namely2‖A‖2F . It is easy to show that among all the matrices with a fixed trace and rank, those withequal eigenvalues have the smallest Frobenius norm.It is easy to see that the stable rank (see [130] and references therein) of any real rectan-gular matrix C which satisfies A = CTC equals 1/KG. Thus, the bound constant in (5.10) isinversely proportional to this stable rank, suggesting that estimating the trace using the Gaus-sian distribution may become inefficient if the stable rank of the matrix is low. Furthermore,the ratioeRank := 1/KG = tr(A)/‖A‖is known as the effective rank of the matrix (see [51]), which is, similar to stable rank, acontinuous relaxation and a stable quantity compared with the usual rank. Using the conceptof effective rank, we can establish a connection between efficiency of the Gaussian estimatorand the effective rank of matrices: Theorem 5.3 indicates that the true sample size, i.e., theminimum sample size for which (5.2) holds, is in fact in O(1/eRank). Hence as the effectiverank of a matrix grows larger, it becomes easier (i.e., smaller sample size is required) to estimateits trace, with the same probabilistic accuracy. Theorem 5.5 in Section 5.3.2 below establishes adifferent relationship between the inefficiency of the Gaussian estimator and a rank of a matrix.As an example of an application of the above results, let us consider finding the minimumnumber of samples required to compute the rank of a projection matrix using the Gaussian735.3. Gaussian Estimator Boundsestimator [22, 26]. Recall that a projection matrix is SPSD with only 0 and 1 eigenvalues.Compared to the derivation in [22], here we use Theorem 5.3 directly to obtain a similar boundwith a slightly better constant.Corollary 5.4. Let A be an s× s projection matrix with rank r > 0, and denote the roundingof any real scalar x to the nearest integer by round(x). Then, given a positive small value δ,the estimatePr (round(trnG(A)) 6= r) ≤ δ (5.12a)holds ifn ≥ 8 r ln(2/δ). (5.12b)Proof. The result immediately follows using Theorem 5.3 upon setting ε = 1/r, ‖A‖ = 1 andtr(A) = r.5.3.2 A Necessary BoundBelow we provide a rank-dependent, almost tight necessary condition for the minimum samplesize required to obtain (5.2). This bound is easily computable in case that r = rank(A) isknown.Before we proceed, recall the definition of the regularized Gamma functionsP (α, β) :=γ (α, β)Γ (α), Q (α, β) :=Γ (α, β)Γ (α),where γ (α, β) ,Γ (α, β) and Γ (α) are, respectively, the lower incomplete, the upper incompleteand the complete Gamma functions, see [1]. We also have that Γ (α) = Γ (α, β) + γ (α, β).Further, defineΦθ(x) := P(x2,τ(1− θ)x2)+Q(x2,τ(1 + θ)x2), (5.13a)745.3. Gaussian Estimator Boundswhereτ =ln(1 + θ)− ln(1− θ)2θ. (5.13b)Theorem 5.5. Let A be a rank-r SPSD s× s matrix, and let (ε, δ) be a tolerance pair. If theinequality (5.2) with D = G holds for some n, then necessarilyΦε(nr) ≤ δ. (5.14)Proof. As in the proof of Theorem 5.3 we havePr (|trnG(A)− tr(A)| ≤ ε tr(A)) = Pr|n∑i=1r∑j=1λjz2ij − ntr(A)| ≤ εntr(A)= Pr(1− ε) ≤n∑i=1r∑j=1λjtr(A) nz2ij ≤ (1 + ε) .Next, applying Theorem 3 of [129] givesPr (|trnG(A)− tr(A)| ≤ ε tr(A)) ≤ Pr(c(1− ε) ≤1nrX 2nr ≤ c(1 + ε)),where X 2M denotes a chi-squared random variable of degree M with the cumulative distributionfunction (CDF)CDFX 2M (x) = Pr(X 2M ≤ x)=γ(M2 ,x2)Γ(M2) .A further straightforward manipulation yields the stated result.Using the condition (5.14), we can establish a connection between inefficiency of the Gaus-sian estimator and the rank of matrices: Theorem 5.5 indicates that the true sample size, i.e.,the minimum sample size for which (5.2) holds, is in fact in Ω(1/r). Hence, as the rank of amatrix becomes smaller, it becomes harder (i.e., a larger sample size is necessarily required) toestimate its trace, with the same probabilistic accuracy.755.4. Random Unit Vector Bounds, with and without Replacement, for General Square MatricesHaving a computable necessary condition is practically useful: given a pair of fixed samplesize n and error tolerance ε, the failure probability δ cannot be smaller than δ0 = Φε(nr).Since our sufficient bounds are not tight, it is not possible to make a direct comparisonbetween the Hutchinson and Gaussian methods based on them. However, using this necessarycondition can help for certain matrices. Consider a low rank matrix with a rather small KHin (5.7). For such a matrix and a given pair (ε, δ), the condition (5.14) will probabilisticallynecessitate a rather large n, while (5.7) may give a much smaller sufficient bound for n. In thissituation, using Theorem 5.5, the Hutchinson method is indeed guaranteed to require a smallersample size than the Gaussian method.The condition in Theorem 5.5 is almost tight in the following sense. Note that in (5.13b),τ ≈ 1 for θ = ε sufficiently small. So,1− Φε(nr)would be very close toPr ((1− ε) ≤ trnG(A∗) ≤ (1 + ε)) ,where A∗ is an SPD matrix of the same rank as A whose eigenvalues are all equal to 1/r. Nextnote that the condition (5.14) should hold for all matrices of the same rank; hence it is almosttight. Figures 5.1 and 5.4 demonstrate this effect.Notice that for a very low rank matrix and a reasonable pair (ε, δ), the necessary n givenby (5.14) could be even larger than the matrix size s, i.e., n ≥ s, rendering the Gaussian methoduseless for such instances; see Figure 5.1.Both of the conditions given in (5.5) and (5.14) are sharpened in Chapter 7, where tight(i.e., exact for some class of matrices) necessary and sufficient conditions are derived.5.4 Random Unit Vector Bounds, with and withoutReplacement, for General Square MatricesAn alternative to the Hutchinson and Gaussian estimators is to draw the vectors wi fromamong the s columns of the scaled identity matrix√sI, i.e., we use a random subset of thevectors forming the scaled identity matrix. Note that if wi is the ith (scaled) unit vector then765.4. Random Unit Vector Bounds, with and without Replacement, for General Square Matrices(a) ε = δ = 0.02 (b) n = 10, 000, ε = δ = 0.1Figure 5.1: Necessary bound for the Gaussian estimator: (a) the log-scale of n accordingto (5.14) as a function of r = rank(A): larger ranks yield smaller necessary sample size. Forvery low rank matrices, the necessary bound grows significantly: for s = 1000 and r ≤ 30,necessarily n > s and the Gaussian method is practically useless; (b) tightness of the necessarybound demonstrated by an actual run as described for Example 5.4 in Section 5.5 where A hasall eigenvalues equal.wTi Awi = naii. Hence the trace can be recovered in n = s deterministic steps upon setting in(5.1) i = j, j = 1, 2, . . . , s. However, our hope is that for some matrices a good approximationfor the trace can be recovered in n s such steps, with wi’s drawn as mentioned above.There are typically two ways one can go about drawing such samples: with or withoutreplacement. The first of these has been studied in [22]. However, in view of the exact procedure,we may expect to occasionally require smaller sample sizes by using the strategy of samplingwithout replacement. In this section we make this intuitive observation more rigorous.In what follows, U1 and U2 refer to the uniform distribution of unit vectors with and withoutreplacement, respectively. We first find expressions for the mean and variance of both strategies,obtaining a smaller variance for U2.775.4. Random Unit Vector Bounds, with and without Replacement, for General Square MatricesLemma 5.6. Let A be an s× s matrix and let n denote the sample size. ThenE(trnU1(A))= E(trnU2(A))= tr(A), (5.15a)V ar(trnU1(A))=1nss∑j=1a2jj − tr(A)2 , (5.15b)V ar(trnU2(A))=(s− n)n(s− 1)ss∑j=1a2jj − tr(A)2 , n ≤ s. (5.15c)Proof. The results for U1 are proved in [22]. Let us next concentrate on U2, and group therandomly selected unit vectors into an s× n matrix W . ThenE(trnU2(A))=1nE(tr(W TAW))=1nE(tr(A WW T))=1ntr(A E(WW T)).Let yij denote the (i, j)th element of the random matrix WW T . Clearly, yij = 0 if i 6= j. It isalso easily seen that yii can only take on the values 0 or s. We haveE (yii) = sPr (yii = s) = s(s−1n−1)(sn) = n,so E(WW T ) = n · I, where I stands for the identity matrix. This, in turn, gives E(trnU2(A))=tr(A).For the variance, we first calculateE[(trnU2(A))2]=1n2En∑i=1n∑j=1(wTi Awi) (wTj Awj)=1n2n∑i=1E[(wTi Awi)2]+n∑i=1n∑j=1j 6=iE[(wTi Awi) (wTj Awj)] (5.16)Let ej denote the jth column of the scaled identity matrix,√sI. Using the law of total785.4. Random Unit Vector Bounds, with and without Replacement, for General Square Matricesexpectation (i.e., the tower rule), we have for any two random vectors wi and wj with i 6= j,E[(wTi Awi) (wTj Awj)]=s∑k=1E[(wTi Awi) (wTj Awj)|wi = ek]· Pr (wi = ek)=s∑k=1sakk · E[(wTj Awj)|wi = ek]·1s=s∑k=1akks∑l=1l 6=kE[(wTj Awj)|wj = el]· Pr (wj = el|wi = ek)=s∑k=1akks∑l=1l 6=ksall1s− 1=ss− 1s∑k=1s∑l=1k 6=lakkall=ss− 1(tr(A)2 −s∑j=1a2jj).Substituting this in (5.16) givesE[(trnU2(A))2]=1n2sns∑j=1a2jj +sn(n− 1)s− 1tr(A)2 −s∑j=1a2jj .Next, the variance isV ar(trnU2(A))= E[(trnU2(A))2]−[E(trnU2(A))]2,which gives (5.15c).Note that V ar(trnU2(A))= s−ns−1V ar(trnU1(A)). The difference in variance between thesesampling strategies is small for n s, and they coincide if n = 1. Moreover, in case that thediagonal entries of the matrix are all equal, the variance for both sampling strategies vanishes.We now turn to the analysis of the sample size required to ensure (5.2) and find a slightimprovement over the bound given in [22] for U1. A similar analysis for the case of samplingwithout replacement shows that the latter may generally be a somewhat better strategy.795.4. Random Unit Vector Bounds, with and without Replacement, for General Square MatricesTheorem 5.7. Let A be a real s× s matrix, and denoteK(i,j)U =str(A)|aii − ajj | , KU = max1≤i,j≤si 6=jK(i,j)U . (5.17)Given a pair of positive small values (ε, δ), the inequality (5.2) holds with D = U1 ifn >K2U2c(ε, δ) ≡ F , (5.18)and with D = U2 ifn ≥s+ 11 + s−1F. (5.19)Proof. This proof is refreshingly short. Note first that every sample of these estimators takeson a Rayleigh value in [sminj ajj , smaxj ajj ].The proof of (5.18), for the case with replacement, uses Hoeffding’s inequality in exactlythe same way as the corresponding theorem in [22]. We obtain directly that if (5.18) is satisfiedthen (5.2) holds with D = U1.For the case without replacement we use Serfling’s inequality [125] to obtainPr(|trnU2(A)− tr(A)| ≥ εtr(A))≤ 2 exp{−2nε2(1− fn−1)K2U},where fn is the sampling fraction defined asfn =n− 1s− 1.Now, for the inequality (5.2) to hold, we need2 exp{−2nε2(1− fn−1)K2U}≤ δ,so we require thatn1− fn−1≥ F .805.5. Numerical ExamplesThe stated result (5.19) is obtained following some straightforward algebraic manipulation.Looking at the bounds (5.18) for U1 and (5.19) for U2 and observing the expression (5.17)for KU , one can gain insight as to the type of matrices which are handled efficiently using thisestimator: this would be the case if the diagonal elements of the matrix all have similar values.In the extreme case where they are all the same, we only need one sample. The correspondingexpression in [22] does not reflect this result.An illustration of the relative behaviour of the two bounds is given in Figure 5.2.Figure 5.2: The behaviour of the bounds (5.18) and (5.19) with respect to the factor K = KUfor s = 1000 and ε = δ = 0.05. The bound for U2 is much more resilient to the distribution ofthe diagonal values than that of U1. For very small values of KU , there is no major differencebetween the bounds.5.5 Numerical ExamplesIn this section we experiment with several examples, comparing the performance of differentmethods with regards to various matrix properties and verifying that the bounds obtained inour theorems indeed agree with the numerical experiments.Example 5.1. In this example we do not consider δ at all. Rather, we check numericallyfor various values of ε what value of n is required to achieve a result respecting this relativetolerance. We have calculated maximum and average values for n over 100 trials for severalspecial examples, verifying numerically the following considerations.815.5. Numerical ExamplesFigure 5.3: Example 5.1. For the matrix of all 1s with s = 10, 000, the plot depicts the numbersof samples in 100 trials required to satisfy the relative tolerance ε = .05, sorted by increasing n.The average n for both Hutchinson and Gauss estimators was around 50, while for the uniformunit vector estimator always n = 1. Only the best 90 results (i.e., lowest resulting values of n)are shown for reasons of scaling. Clearly, the unit vector method is superior here. The matrix of all 1s (in Matlab, A=ones(s,s)) has been considered in [22]. Here tr(A) =s, KH = s − 1, and a very large n is often required if ε is small for both Hutchinsonand Gauss methods. For the unit vector method, however, KU = 0 in (5.17), so thelatter method converges in one iteration, n = 1. This fact yields an example where theunit vector estimator is far better than either Hutchinson or Gaussian estimators; seeFigure 5.3. Another extreme example, where this time it is the Hutchinson estimator which requiresonly one sample whereas the other methods may require many more, is the case of a diag-onal matrix A. For a diagonal matrix, KH = 0, and the result follows from Theorem 5.2. If A is a multiple of the identity then, since KU = KH = 0, only the Gaussian estimatorfrom among the methods considered requires more than one sample; thus, it is worst. Examples where the unit vector estimator is consistently (and significantly) worst areobtained by defining A = QTDQ for a diagonal matrix D with different positive elementswhich are of the same order of magnitude and a nontrivial orthogonal matrix Q. We have not been able to come up with a simple example of the above sort where the Gaus-825.5. Numerical Examplessian estimator shines over both others, although we have seen many occasions in practicewhere it slightly outperforms the Hutchinson estimator with both being significantly betterthan the unit vector estimators.Example 5.2. Consider the matrix A = xxT /‖x‖2, where x ∈ Rs, and for some θ > 0,xj = exp(−jθ), 1 ≤ j ≤ s. This extends the example of all 1s of Figure 5.3 (for which θ = 0)to instances with rapidly decaying elements.It is easy to verify thattr(A) = 1, r = 1, KG = 1,KjH = ‖x‖2x−2j − 1, KH = ‖x‖2x−2s − 1,K(i,j)U =s‖x‖2|x2i − x2j |, KU =s‖x‖2(x21 − x2s),‖x‖2 =exp(−2θ)− exp(−2(s+ 1)θ)1− exp(−2θ).Figure 5.4: Example 5.2. For the rank-1 matrix arising from a rapidly-decaying vector withs = 1000, this log-log plot depicts the actual sample size n required for (5.2) to hold withε = δ = 0.2, vs. various values of θ. In the legend, “Unit” refers to the random samplingmethod without replacement.Figure 5.4 displays the “actual sample size” n for a particular pair (ε, δ) as a function of θfor the three distributions. The values n were obtained by running the code 100 times for eachθ to calculate the empirical probability of success.835.5. Numerical ExamplesIn this example the distribution of KjH values gets progressively worse with heavier tail valuesas θ gets larger. However, recall that this matters in terms of the sufficient bounds (5.4) and(5.7) only so long as KH < 3. Here the crossover point happens roughly when θ ∼ 1/(2s).Indeed, for large values of θ the required sample size actually drops when using the Hutchinsonmethod: Theorem 5.2, being only a sufficient condition, merely distinguishes types of matricesfor which Hutchinson is expected to be efficient, while making no claim regarding those matricesfor which it is an inefficient estimator.On the other hand, Theorem 5.5 clearly distinguishes the types of matrices for which theGaussian method is expected to be inefficient, because its condition is necessary rather thansufficient. Note that n (the red curve in Figure 5.4) does not change much as a function of θ,which agrees with the fact that the matrix rank stays fixed and low at r = 1.The unit vector estimator, unlike Hutchinson, deteriorates steadily as θ is increased, becausethis estimator ignores off-diagonal elements. However, for small enough values of θ the K(i,j)U ’sare spread tightly near zero, and the unit vector method, as predicted by Theorem 5.7, requiresa very small sample size.For Examples 5.3 and 5.5 below, given (ε, δ), we plot the probability of success, i.e.,Pr (|trnD(A)− tr(A)| ≤ ε tr(A)) for increasing values of n, starting from n = 1. We stop whenfor a given n, the probability of success is greater than or equal to 1− δ. In order to evaluatethis for each n, we run the experiments 500 times and calculate the empirical probability.In the figures below, ‘With Rep.’ and ‘Without Rep.’ refer to uniform unit sampling withand without replacement, respectively. In all cases, by default, ε = δ = .05. We also providedistribution plots of the quantities KjH , KjG and K(i,j)U appearing in (5.6), (5.9) and (5.17),respectively. These quantities are indicators for the performance of the Hutchinson, Gaussianand unit vector estimators, respectively, as evidenced not only by Theorems 5.2, 5.3 and 5.7,but also in Examples 5.1 and 5.2, and by the fact that the performance of the Gaussian andunit vector estimators is not affected by the energy of the off-diagonal matrix elements.Example 5.3 (Data fitting with many experiments). A major source of applications wheretrace estimation is central arises in problems involving least squares data fitting with manyexperiments (cf. Chapter 1). In its simplest, linear form, we look for m ∈ IRm so that the845.5. Numerical Examplesmisfit functionφ(m) =s∑i=1‖Jim− di‖2, (5.20a)for given data sets di ∈ IRl and sensitivity matrices Ji, is either minimized or reduced belowsome tolerance level. The l×m matrices Ji are very expensive to calculate and store, so this isavoided altogether, but evaluating Jim for any suitable vector m is manageable. Moreover, s islarge. Next, writing (5.20a) using the Frobenius norm asφ(m) = ‖C‖2F , (5.20b)where C is l× s with the jth column Cj = Jjm−dj, and defining the SPSD matrix A = CTC,we haveφ(m) = tr(A(m)). (5.20c)Cheap estimates of the misfit function φ(m) are then sought by approximating the trace in(5.20c) using only n (rather than s) linear combinations of the columns of C, which naturallyleads to expressions of the form (5.1). Hutchinson and Gaussian estimators in a similar ormore complex context were considered in [71, 134, 138].Drawing the wi as random unit vectors instead is a method proposed in [46] and compared toothers in Chapter 3, where it is called “random subset”: this latter method can have efficiencyadvantages that are beyond the scope of the presentation here. Typically, l s, and thus thematrix A is dense and often has low rank.Furthermore, the signs of the entries in C can be, at least to some extent, considered random.Hence we consider below matrices A = CTC whose entries are Gaussian random variables,obtained using the Matlab command C = randn(l,s). We use l = 200 and hence the rankis, almost surely, r = 200.It can be seen from Figure 5.5(a) that the Hutchinson and the Gaussian methods performsimilarly here. The sample size required by both unit vector estimators is approximately twicethat of the Gaussian and Hutchinson methods. This relative behaviour agrees with our observa-855.5. Numerical Examples(a) Convergence Rate (b) KjH distribution(c) K(i,j)U distribution (d) Eigenvalue distributionFigure 5.5: Example 5.3. A dense SPSD matrix A is constructed using Matlab’s randn. Heres = 1000, r = 200, tr(A) = 1,KG = 0.0105, KH = 8.4669 and KU = 0.8553. The methodconvergence plots in (a) are for ε = δ = .05.tions in the context of actual application as described above, see Chapter 3. From Figure 5.5(d),the eigenvalue distribution of the matrix is not very badly skewed, which helps the Gaussianmethod perform relatively well for this sort of matrix. On the other hand, by Figure 5.5(b) therelative `2 energies of the off-diagonals are far from being small, which is not favourable for theHutchinson method. These two properties, in combination, result in the similar performanceof the Hutchinson and Gaussian methods despite the relatively low rank. The contrast betweenK(i,j)U ’s is not too large according to Figure 5.5(c), hence a relatively decent performance of bothunit vector (or, random sampling) methods is observed. There is no reason to insist on avoidingrepetition here either.865.5. Numerical ExamplesExample 5.4 (Effect of rank and KG on the Gaussian estimator). In this example we plotthe actual sample size n required for (5.2) to hold. In order to evaluate (5.2), we repeat theexperiments 500 times and calculate the empirical probability. In all experiments, the samplesizes predicted by (5.4) and (5.5) were so pessimistic compared with the true n that we simplydid not include them in the plots.(a) sprandn, s = 5, 000 (b) diagonal, s = 10, 000Figure 5.6: Example 5.4. The behaviour of the Gaussian method with respect to rank and KG.We set ε = δ = .05 and display the necessary condition (5.14) as well.In order to concentrate only on rank and KG variation, we make sure that in all experimentsKH 1. For the results displayed in Figure 5.6(a), where r is varied for each of two valuesof KG, this is achieved by playing with Matlab’s normal random generator function sprandn.For Figure 5.6(b), where KG is varied for each of two values of r, diagonal matrices are utilized:we start with a uniform distribution of the eigenvalues and gradually make this distribution moreskewed, resulting in an increased KG. The low KH values cause the Hutchinson method to lookvery good, but that is not our focus here.It can be clearly seen from Figure 5.6(a) that as the matrix rank gets lower, the samplesize required for the Gaussian method grows significantly. For a given rank, the matrix with asmaller KG requires smaller sample size. From Figure 5.6(b) it can also be seen that for a fixedrank, the matrix with more skewed KjG’s distribution (marked here by a larger KG) requires alarger sample size.Example 5.5 (Method performance for different matrix properties). Next we consider a much875.5. Numerical Examples(a) Convergence Rate (b) KjH distribution (KjH ≤ 100)(c) K(i,j)U distribution (d) Eigenvalue distributionFigure 5.7: Example 5.5. A sparse matrix (d = 0.1) is formed using sprandn. Here r =50, KG = 0.0342, KH = 15977.194 and KU = 4.8350.more general setting than that in Example 5.4, and compare the performance of different methodswith respect to various matrix properties. The matrix A is constructed as in Example 5.3, exceptthat also a uniform distribution is used. Furthermore, a parameter d controlling denseness of thecreated matrix is utilized. This is achieved in Matlab using the commands C=sprandn(l,s,d)or C=sprand(l,s,d). By changing l and d we can change the matrix properties KH , KG andKU while keeping the rank r fixed across experiments. We maintain s = 1000, tr(A) = 1 andε = δ = .05 throughout. In particular, the four figures related to this example are comparable toFigure 5.5 but for a lower rank.By comparing Figures 5.7 and 5.8, as well as 5.9 and 5.10, we can see how not only thevalues of KH , KG and KU , but also the distribution of the quantities they maximize matters.885.5. Numerical Examples(a) Convergence Rate (b) KjH distribution (KjH ≤ 100)(c) K(i,j)U distribution (d) Eigenvalue distributionFigure 5.8: Example 5.5. A sparse matrix (d = 0.1) is formed using sprand. Here r =50,KG = 0.0919, KH = 11624.58 and KU = 3.8823.Note how the performance of both unit vector strategies is negatively affected with increasingaverage values of K(i,j)U ’s. From the eigenvalue (or KjG) distribution of the matrix, it can alsobe seen that the Gaussian estimator is heavily affected by the skewness of the distribution ofthe eigenvalues (or KjG’s): given the same r and s, as this eigenvalue distribution becomesincreasingly uneven, the Gaussian method requires larger sample size.Note that comparing the performance of the methods on different matrices solely based ontheir values KH , KG or KU can be misleading. This can be seen for instance by considering theperformance of the Hutchinson method in Figures 5.7, 5.8, 5.9 and 5.10 and comparing theirrespective KjH distributions as well as KH values. Indeed, none of our 6 sufficient bounds canbe guaranteed to be generally tight. As remarked also earlier, this is an artifact of the generality895.5. Numerical Examples(a) Convergence Rate (b) KjH distribution (KjH ≤ 50)(c) K(i,j)U distribution (d) Eigenvalue distributionFigure 5.9: Example 5.5. A very sparse matrix (d = 0.01) is formed using sprandn. Herer = 50, KG = 0.1186, KH = 8851.8 and KU = 103.9593.of the proved results.Note also that rank and eigenvalue distribution of a matrix have no direct effect on theperformance of the Hutchinson method: by Figures 5.9 and 5.10 it appears to only depend onthe KjH distribution. In these figures, one can observe that the Gaussian method is heavilyaffected by the low rank and the skewness of the eigenvalues. Thus, if the distribution of KjH ’sis favourable to the Hutchinson method and yet the eigenvalue distribution is rather skewed,we can expect a significant difference between the performance of the Gaussian and Hutchinsonmethods.905.6. Conclusions5.6 ConclusionsIn this chapter, we have proved six sufficient bounds for the minimum sample size n requiredto reach, with probability 1 − δ, an approximation for tr(A) to within a relative tolerance ε.Two such bounds apply to each of the three estimators considered in Sections 5.2, 5.3 and 5.4,respectively. In Section 5.3 we have also proved a necessary bound for the Gaussian estimator.These bounds have all been verified numerically through many examples, some of which aresummarized in Section 5.5.(a) Convergence Rate (b) KjH distribution (KjH ≤ 50)(c) K(i,j)U distribution (d) Eigenvalue distributionFigure 5.10: Example 5.5. A very sparse matrix (d = 0.01) is formed using sprand. Herer = 50, KG = 0.1290, KH = 1611.34 and KU = 64.1707.Two of these bounds, namely, (5.4) for Hutchinson and (5.5) for Gaussian, are immediatelycomputable without knowing anything else about the SPSD matrix A. In particular, they are915.6. Conclusionsindependent of the matrix size s. As such they may be very pessimistic. And yet, in someapplications (for instance, in exploration geophysics) where s can be very large and ε need notbe very small due to uncertainty, these bounds may indeed provide the comforting assurancethat n s suffices (say, s is in the millions and n in the thousands). Generally, these twobounds have the same quality.The underlying objective in this work, which is to seek a small n satisfying (5.2), is a naturalone for many applications and follows that of other works. But when it comes to comparingdifferent methods, it is by no means the only performance indicator. For example, variance canalso be considered as a ground to compare different methods. However, one needs to exercisecaution to avoid basing the entire comparison solely on variance: it is possible to generateexamples where a linear combination of X 2 random variables has smaller variance, yet highertail probability.The lower bound (5.14) that is available only for the Gaussian estimator may allow betterprediction of the actual required n, in cases where the rank r is known. At the same timeit also implies that the Gaussian estimator can be inferior in cases where r is small. TheHutchinson estimator does not enjoy a similar theory, but empirically does not suffer from thesame disadvantage either.The matrix-dependent quantities KH , KG and KU , defined in (5.6), (5.9) and (5.17), re-spectively, are not easily computable for any given implicit matrix A. However, the resultsof Theorems 5.2, 5.3 and 5.7 that depend on them can be more indicative than the generalbounds. In particular, examples where one method is clearly better than the others can beisolated in this way. At the same time, the sufficient conditions in Theorems 5.2, 5.3 and 5.7,merely distinguish the types of matrices for which the respective methods are expected to beefficient, and make no claims regarding those matrices for which they are inefficient estimators.This is in direct contrast with the necessary condition in Theorem 5.5.It is certainly possible in some cases for the required n to go over s. In this connection, it isimportant to always remember the deterministic method which obtains tr(A) in s applicationsof unit vectors: if n grows above s in a particular stochastic setting then it may be best toabandon ship and choose the safe, deterministic way.92Chapter 6Extremal Probabilities of LinearCombinations of Gamma RandomVariablesThis chapter prepares us for Chapter 7, in the sense that two pivotal results, given in The-orems 6.1 and 6.2 below, are subsequently used there. However, the development here issignificantly more general than what is needed for Chapter 7, and the results are novel. Hencewe describe them in a separate chapter, as they can be considered independently of the rest ofthis thesis.The gamma distribution forms an important family of distributions, and gamma randomvariables (r.v’s) appear in many practical applications. For example, linear combinations (i.e.,convolutions) of independent gamma r.v’s often naturally arise in many applications in statistics,engineering, insurance, actuarial science and reliability. As such, in the literature, there hasbeen extensive study of the stochastic properties of gamma r.v’s and their convolutions. Forexamples of such theoretical studies as well as applications see [7, 30–32, 44, 61, 93–95, 99, 103,129, 139, 141, 142] and references therein.In what follows, let X ∼ Gamma(α, β) denote a gamma distributed random variable (r.v)parametrized by shape α and rate β parameters with the probability density function (PDF)f(x) =βαΓ(α)xα−1e−βx x ≥ 00 x < 0. (6.1)An important stochastic property of gamma r.v is that of the monotonicity of regularizedgamma function (see Section 5.3.2), i.e., cumulative distribution function (CDF) of gamma r.v,93Chapter 6. Extremal Probabilities of Linear Combinations of Gamma Random Variableswith respect to different shape α and rate β parameters. Theorems 6.1 gives conditions whichallow one to obtain certain important monotonicity results for the regularized gamma function(cf. (5.13)).Theorem 6.1 (Monotonicity of cumulative distribution function of gamma r.v).Given parameters 0 < α1 < α2, let Xi ∼ Gamma(αi, αi), i = 1, 2, be independent r.v’s, anddefine∆(x) := Pr(X2 < x)− Pr(X1 < x).Then we have that(i) there is a unique point x(α1, α2) such that ∆(x) < 0 for 0 < x < x(α1, α2) and ∆(x) > 0for x > x(α1, α2),(ii) 1 ≤ x(α1, α2) ≤2√α1(α2−α1)+12√α1(α2−α1).Another important stochastic property, is that of the maximum and minimum of tail proba-bilities of linear combinations of i.i.d gamma r.v’s. More specifically, let Xi ∼ Gamma(α, β) fori = 1, 2, . . . , n, be n i.i.d gamma r.v’s. Consider the following non-negative linear combinationsof such r.v’sn∑i=1λiXi,where λi ≥ 0, i = 1, 2, . . . , n, are real numbers. The goal is to find conditions allowing one todetermine the maximum and minimum of tail probabilityPr(n∑i=1λiXi < x),with respect to the mixing weights λi, i = 1, . . . , n for various values of x. Theorem 6.2describes these conditions.Theorem 6.2 (Extremal probabilities of linear combination of gamma r.v’s). Givenshape and rate parameters α, β > 0, let Xi ∼ Gamma(α, β), i = 1, 2, . . . , n, be i.i.d gammar.v’s, and defineΘ := {λ =(λ1, λ2, . . . , λn)T| λi ≥ 0 ∀i,n∑i=1λi = 1}.94Chapter 6. Extremal Probabilities of Linear Combinations of Gamma Random VariablesThen we havemn(x) := minλ∈ΘPr(n∑i=1λiXi < x)=Pr(1n∑ni=1Xi < x), x < αβPr(X1 < x), x > 2α+12β ,Mn(x) := maxλ∈ΘPr(n∑i=1λiXi < x)=Pr(X1 < x), x < αβPr(1n∑ni=1Xi < x), x > 2α+12β.Results similar to Theorem 6.2 were obtained in [129] for the special case where the Xi’s arechi-squared r.v’s of degree 1 (corresponding to α = β = 1/2). Theorem 6.2 extends those resultsto arbitrary gamma random variables, including chi-squared of arbitrary degree, exponential,Erlang, etc.In what follows, for a gamma r.v X, we use the notation fX for its PDF and FX for itsCDF.The objective in the proof of Theorem 6.2 is to find the extrema (with respect to λ ∈ Θ)of the CDF of r.v∑ni=1 λiXi. This is mainly achieved by perturbation arguments, employinga key identity which is derived using Laplace transforms. Using our perturbation argumentswith this identity and employing Lemma 6.4, we obtain that at any extremum, we must haveeither λ1, λ2 > 0 and λ3 = · · · = λn = 0 or for some i ≤ n we must get λ1 = · · · = λi > 0 andλi+1 = · · · = λn = 0. (Note that this latter case covers the “corners” as well.). In the formercase, Lemma 6.5 is used to distinguish between the minima and maxima for different values ofx. These results along with Theorem 6.1 are then used to prove Theorem 6.2.Three lemmas are used in the proofs of our two theorems. Lemma 6.3 describes someproperties of the PDF of non-negative linear combinations of arbitrary gamma r.v’s, such asanalyticity and vanishing derivatives at zero. Lemma 6.4 describes the monotonicity propertyof the mode of the PDF of non-negative linear combinations of a particular set of gammar.v’s, which is useful for the proof of Theorem 6.2. Lemma 6.5 gives some properties regardingthe mode of the PDF of convex combinations of two particular gamma r.v’s, which is used inproving Theorem 6.1 and Theorem 6.2.956.1. Lemmas6.1 LemmasWe next state and prove the lemmas summarized above.Lemma 6.3 (Generalization of [129, Lemma A]). Let Xi ∼ Gamma(αi, βi), i = 1, 2, . . . , n,be independent r.v’s, where αi, βi > 0 ∀i. Define Yn :=∑ni=1 λiXi for λi > 0, ∀i and ρj :=∑ji=1 αi. Then for the PDF of Yn, fYn, we have(i) fYn > 0, ∀x > 0,(ii) fYn is analytic on R+ = {x|x > 0},(iii) f (k)Yn (0) = 0, if 0 ≤ k < ρn − 1, where f(k)Yn denotes the kth derivative of fYn.Proof. The proof is done by induction on n. For n = 2 we havefY2(x) =∫ ∞0fλ1X1(y)fλ2X2(x− y)dy=∫ x0(β1/λ1)α1Γ(α1)yα1−1e−β1yλ1(β2/λ2)α2Γ(α2)(x− y)α2−1e−β2(x−y)λ2 dy=(β1/λ1)α1(β2/λ2)α2Γ(α1)Γ(α2)∫ x0yα1−1(x− y)α2−1e−β1yλ1−β2(x−y)λ2 dy.Now the change of variable y → x cos2 θ1 would yieldfY2(x) = 2(β1/λ1)α1(β2/λ2)α2Γ(α1)Γ(α2)x(α1+α2−1)∫ pi20(cos θ1)2α1−1(sin θ1)2α2−1e−x(β1 cos2 θ1λ1+β2 sin2 θ1λ2)dθ1.By induction on n, one can show that for arbitrary n ≥ 2fYn(x) = 2n−1(n∏i=1(βi/λi)αiΓ(αi))xρn−1∫Dn−1Pn(Θn−1)Qn(Θn−1)e−xRn(Θn−1)dΘn−1, (6.2a)wherePn(Θn−1) :=n−1∏j=1(cos θj)2ρj−1, (6.2b)Qn(Θn−1) :=n−1∏j=1(sin θj)2αj+1−1, (6.2c)966.1. Lemmasthe function Rn(Θn−1) satisfies the following recurrence relationRn(Θn−1) := cos2 θn−1Rn−1(Θn−2) + βnλ−1n sin2 θn−1, ∀n ≥ 2 (6.2d)R1(Θ0) := β1/λ1, (6.2e)and dΘn−1 denotes the n− 1 dimensional Lebesgue measure with the domain of integrationDn−1 := (0, pi/2)× (0, pi/2)× . . .× (0, pi/2) = (0, pi/2)n−1 ⊂ Rn−1. (6.2f)Now the claims in Lemma 6.3 follow from (6.2).Lemma 6.4 (Generalization of [129, Lemma 1]). Let Xi ∼ Gamma(αi, α), i = 1, 2, . . . , n,be independent r.v’s, where αi > 0 ∀i and α > 0. Also let ψ ∼ Gamma(1, α) be another r.vindependent of all Xi’s. If∑ni=1 αi > 1, then the mode, x¯(λ), of the r.v W (λ) = Y + λψ isstrictly increasing in λ > 0, where Y =∑ni=1 λiXi with λi > 0, ∀i.Proof. The proof is almost identical to that of Lemma 1 in [129] and we give it here for com-pleteness. By Lemma 6.3, x¯(λ) > 0 for λ ≥ 0. By the unimodality of W (λ), for any λ > λ0 > 0,it is enough to show thatJ(λ, x¯(λ0)):=[d2dx2Pr (W (λ) ≤ x)]x=x¯(λ0)> 0. (6.3)Note that J(λ0, x¯(λ0))= 0 and since∑ni=1 αi > 1, by Lemma 6.3(iii), fY (0) = 0. So we haveJ(λ, x¯(λ0))=[ddx∫ x0fY (x− z)αλe−αλ zdz]x=x¯(λ0)=[∫ x0ddxfY(x− z)αλe−αλ zdz]x=x¯(λ0)=∫ x¯(λ0)0f′Y (z)αλe−αλ(x¯(λ0)−z)dz.Therefore,∫ x¯(λ0)0f′Y (z)eαzλ0 dz =λ0αeαx¯(λ0)λ0 J(λ0, x¯(λ0))= 0.976.1. LemmasThus for λ > λ0 > 0, we haveλαeαx¯(λ)λ J(λ, x¯(λ0))=∫ x¯(λ0)0f′Y (z)eαzλ dz=∫ x¯(λ0)0f′Y (z)eαzλ − f′Y (z)eαzλ0 eαx¯(0)(1λ−1λ0)dz=∫ x¯(λ0)0f′Y (z)(eαzλ − eαzλ0+αx¯(0)(1λ−1λ0))dz=∫ x¯(λ0)0f′Y (z)(eαzλ − eαzλ +Φ(z,x¯(0)))dz,where x¯(0) > 0 is the mode of r.v Y andΦ(z, x¯(0)):= α(z − x¯(0))( 1λ0−1λ).Now if z < x¯(0) then Φ(z, x¯(0))< 0 and f′Y (z) > 0 so we get J(λ, x¯(λ0))> 0. Similarly ifz > x¯(0) then Φ(z, x¯(0))> 0 and f′Y (z) < 0 and again we have J(λ, x¯(λ0))> 0.Lemma 6.5 (Generalization of [129, Lemma 2]). For some α2 ≥ α1 > 0, let ξ1 ∼ Gamma(1 +α1, α1) and ξ2 ∼ Gamma(1 + α2, α2) be independent gamma r.v’s. Also let x¯ = x¯(λ) denotethe mode of the r.v ξ(λ) = λξ1 + (1− λ)ξ2 for 0 ≤ λ ≤ 1. Then(i) for a given λ, x¯(λ) is unique,(ii) 1 ≤ x¯(λ) ≤ 2√α1α2+12√α1α2, ∀0 ≤ λ ≤ 1, with x¯(0) = x¯(1) = 1 and, in case of αi = αj = α,x¯(12) =2α+12α , otherwise the inequalities are strict, and(iii) there is a λ∗ ∈( √α1√α2+√α1, 1)such that the mode x¯(λ) is a strictly increasing function ofλ on (0, λ∗) and it is a strictly decreasing function on (λ∗, 1) and, for α1 = α2, we haveλ∗ = 12 .Proof. Uniqueness claim (i) has already been proven in [129, Theorem 4]. We prove (iii) since (ii)is implied from within the proof. For 0 < λ < 1, the PDF of ξ(λ) can be written asfξ(λ)(x) =∫ x0fλξ1(y)f(1−λ)ξ2(x− y)dy.986.1. LemmasSince fλξ1(0) = f(1−λ)ξ2(0) = 0 we have∂∂xfξ(λ)(x) =∫ x0fλξ1(y)∂∂xf(1−λ)ξ2(x− y)dy= −∫ x0fλξ1(y)∂∂yf(1−λ)ξ2(x− y)dy=∫ x0∂∂y(fλξ1(y)) f(1−λ)ξ2(x− y)dywhere for the second equality we use the fact that ∂∂xf(x− y) = −∂∂yf(x− y), and for the thirdequality we used integration by parts. Let α = α1 and α2 = cα for some c ≥ 1. So now we have∂∂xfξ(λ)(x) =(αλ )1+α( cα1−λ)1+αcΓ(1 + α)Γ(1 + αc)∫ x0∂(yαe−αyλ)∂y(x− y)αce−cα(x−y)1−λ dy=α2+α(cα)1+cαΓ(1 + α)Γ(1 + cα)λ−2−α (1− λ)−1−αc e−cαx(1−λ)∫ x0(λ− y)yα−1(x− y)αce−αy(1λ−c1−λ)dy= C(x, λ)A(x, λ),whereC(x, λ) :=α2+α(cα)1+cαΓ(1 + α)Γ(1 + cα)λ−2−α (1− λ)−1−αc e−cαx(1−λ) ,A(x, λ) :=∫ x0(λ− y) yα−1 (x− y)αc e−φ(λ)ydy,φ(λ) := α(1λ−c1− λ).Now if x¯ is the mode of ξ(λ), then we have∂∂xfξ(λ)(x¯) = C(x¯, λ)A(x¯, λ) = 0,which implies that A(x¯, λ) = 0 since C(x¯, λ) > 0. Let us define the linear functional L : G → R,where G = {g : (0, x¯)→ R |∫ x¯0 g(y)yα−1 <∞}, asL(g) :=∫ x¯0g(y)yα−1 (x¯− y)αc e−φ(λ)ydy.996.1. LemmasWe have∂∂λA(x, λ) =∫ x0[1− φ′(λ)y(λ− y)]yα−1(x− y)αce−φ(λ)ydy=∫ x0[1− λφ′(λ)y + φ′(λ)y2]yα−1(x− y)αce−φ(λ)ydy,so[∂∂λA(x, λ)]x=x¯= L(1− λφ′(λ)f + φ′(λ)f2), (6.4)where f ∈ G is such that f(y) = y. On the other hand since A(x¯, λ) = 0, we getL(λ) = L(f) =∫ x¯0yα(x¯− y)αce−φ(λ)ydy=∫ x¯0yαe−φ(λ)yd(−(x¯− y)αc+1αc+ 1)= (αc+ 1)−1∫ x¯0(x¯− y)αc+1 d(yαe−φ(λ)y)= (αc+ 1)−1∫ x¯0(x¯− y)αc+1(αyα−1e−φ(λ)y − φ(λ)yαe−φ(λ)y)dy= (αc+ 1)−1∫ x¯0(x¯− y) (α− φ(λ)y) yα−1 (x¯− y)αc e−φ(λ)ydy= (αc+ 1)−1L((x¯− f) (α− φ(λ)f))= (αc+ 1)−1L(αx¯− αf − φ(λ)x¯f + φ(λ)f2),where the second integral is Lebesgue-Stieltjes, and the third integral follows from Lebesgue-Stieltjes integration by parts. So, for λ ∈ (0, 1c+1) ∪ (1c+1 , 1), we getL(f2) =1φ(λ)[(αc+ 1)L(f)− L(αx¯− αf − φ(λ)x¯f)]=1φ(λ)[L((αc+ 1)f −αx¯λf + αf + φ(λ)x¯f)]=1φ(λ)[((αc+ 1) + α+ φ(λ)x¯−αx¯λ)L(f)]=1φ(λ)[((α+ αc+ 1) + (φ(λ)−αλ)x¯)L(f)]=1φ(λ)[((1 + c)α+ 1−cαx¯1− λ)L(f)],1006.1. Lemmaswhere we used the fact that L(αx¯) = αx¯λ L(λ) =αx¯λ L(f). Now substituting L(f2) in (6.4) yields[∂∂λA(x, λ)]x=x¯= L( 1λf − λφ′(λ)f + φ′(λ)f2)=(1λ− λφ′(λ) +φ′(λ)φ(λ)[(1 + c)α+ 1−cαx¯1− λ])L(f),=(1λ− φ′(λ)(λ+1φ(λ)[(1 + c)α+ 1−cαx¯1− λ]))L(f)=(1λ−φ′(λ)φ(λ)(λφ(λ) + (1 + c)α+ 1−cαx¯1− λ))L(f)which after some tedious but routine computations gives[∂∂λA(x, λ)]x=x¯= R(λ)x¯− Φ(λ)1− (c+ 1)λ, λ ∈(0,11 + c)∪( 11 + c, 1)where R(λ) > 0, for all 0 < λ < 1, andΦ(λ) :=α+ (1− 2α)λ+ (α− 1 + αc)λ2α((c+ 1)λ2 − 2λ+ 1) .SincedΦ(λ)dλ=((1− c)λ2 − 2λ+ 1)/(α((c+ 1)λ2 − 2λ+ 1))2,we have thatdΦ(λ)dλ= 0 at λ =1(1 +√c).Note that the other root, 1/(1−√c), falls outside of (0, 1) for any c ≥ 1. It readily can be seenthat Φ(λ) is increasing on 0 < λ < 11+√c and decreasing on11+√c < λ < 1, and so1 ≤ Φ(λ) ≤2α√c+ 12α√c, ∀0 ≤ λ ≤ 1.The differentiability of x¯(λ) with respect to λ follows from implicit function theorem:dx¯(λ)dλ= −∂∂λA(x¯, λ)∂∂x¯A(x¯, λ),1016.1. Lemmasand for that we need to show that ∂A(x¯,λ)∂x¯ 6= 0 for all 0 < λ < 1. If we assume the contrary forsome λ, we getαcA(x¯, λ) = αc∫ x¯0(λ− y)yα−1(x¯− y)αce−φ(λ)ydy = 0,(x¯− λ)∂∂x¯A(x¯, λ) = αc∫ x¯0(λ− y)(x¯− λ)yα−1(x¯− y)αc−1e−φ(λ)ydy = 0,which is impossible since the integrand in the first equality is strictly larger than the one in thesecond equality: we can see this by looking at the two cases 0 < y < λ and λ < y < x¯. Fromthis we can also note that ∂∂x¯A(x¯, λ) < 0 for all 0 < λ < 1. To see this, first consider the casex¯ > λ, and it follows directly as above that ∂∂x¯A(x¯, λ) < [αc/(x¯− λ)]A(x¯, λ) = 0. Now assumethat x¯ ≤ λ, but since the integrand in the first equality is strictly positive for all 0 < y < x¯,then A(x¯, λ) > 0 which is impossible. So we getdx¯(λ)dλ= S(λ)x¯− Φ(λ)1− (c+ 1)λ, λ ∈ [0, 1] (6.5)where S(λ) > 0 for all 0 < λ < 1. We also defined dx¯(λ)dλ for λ = 0, 1,12 using l’Hoˆpital’s rule(with one-sided differentiability for λ = 0, 1). It is easy to see thatx¯(0) = x¯(1) = Φ(0) = Φ(1) = 1,x¯( 1c+ 1)= Φ( 1c+ 1)=(c+ 1)α+ 1(c+ 1)α.Next we show that x¯ is strictly increasing on (0, 1c+1). We first show that on this interval, wemust have x¯(λ) ≥ Φ(λ), otherwise there must exist a λˆ ∈ (0, 1c+1) such that x¯(λˆ) < Φ(λˆ). Butthis contradicts x¯( 1c+1) = Φ(1c+1) by (6.5), increasing property of Φ and continuity of x¯. So x¯is non-decreasing on (0, 1c+1). We must also have that x¯(λ) > Φ(λ) for λ ∈ (0,1c+1), otherwiseif there is a λˆ ∈ (0, 1c+1) such that x¯(λˆ) = Φ(λˆ), then, by (6.5), it must be a saddle point ofx¯. But since Φ is strictly increasing and x¯ is non-decreasing on this interval, this would implythat for an ε arbitrarily small, we must have x¯(λˆ + ε) < Φ(λˆ + ε) but this would contradictthe non-decreasing property of x¯ on this interval by (6.5). The same reasoning shows that wemust have x¯(λ) < Φ(λ) on ( 1c+1 , λ∗) (i.e. x¯ is strictly increasing on ( 1c+1 , λ∗)) and x¯(λ) > Φ(λ)1026.2. Proofs of Theorems 6.1 and 6.2on (λ∗, 1) (i.e. x¯ is strictly decreasing on (λ∗, 1)). Now we show that λ∗ ≥ 11+√c . For c = 1 wehave 1c+1 =1√c+1 , hence λ∗ = 12 . For c > 1, Since x¯(λ) is increasing for 0 < λ < λ∗, decreasingfor λ∗ < λ < 1, and x¯(λ∗) = Φ(λ∗), then by (6.5), this implies that λ∗ is where the maximumof x¯(λ) occurs. Now if we assume that λ∗ < 11+√c , since Φ is increasing on (0,11+√c), this wouldcontradict x¯(λ) > Φ(λ) on (λ∗, 1). Lemma 6.5 is proved.6.2 Proofs of Theorems 6.1 and 6.2We now give the detailed proofs for our main theorems stated at the beginning of this chapterand used in Chapter 7.Proof of Theorem 6.1For proving (i), we first show that ∆(x) = 0 at exactly one point on R+ = {x|x > 0}denoted by x(α1, α2). Since α2 > α1, let α2 = α1 + c, for some c > 0. We haved∆(x)dx= C(α2)xα2−1e−α2x − C(α1)xα1−1e−α1x= C(α2)xα1−1e−α1x(xce−cx −C(α1)C(α2))where C(α) = (α)α/Γ(α). The constant C(α1)/C(α2) cannot be larger than xce−cx, for allx ∈ R+, otherwise d∆(x)/dx would be negative for all x ∈ R+, and this is impossible since∆(0) = ∆(∞) = 0. The function xce−cx is increasing on (0, 1) and decreasing on (1,∞), andsince C(α1)/C(α2) is constant, there must exist an interval (a, b) containing x = 1 such thatd∆(x)/dx > 0 for x ∈ (a, b) and d∆(x)/dx < 0 for x ∈ (0, a) ∪ (b,∞). Now since ∆(x) iscontinuous and ∆(0) = ∆(∞) = 0, then there must exist a unique x(α1, α2) ∈ (0,∞) suchthat ∆(x) crosses zero (i.e., ∆(x) = 0 at the unique point x(α1, α2)) and that ∆(x) < 0 for0 < x < x(α1, α2) and ∆(x) > 0 for x > x(α1, α2).We now prove (ii). The desired inequality is equivalent to∆(x) < 0, ∀x < 11036.2. Proofs of Theorems 6.1 and 6.2and∆(x) > 0, ∀x >(2√α1(α2 − α1) + 1)/(2√α1(α2 − α1)).Without loss of generality consider α = α1, and α2 = (1 + c)α, for c = (α2 − α)/α. DefineX˜ ∼ Gamma(cα, cα) and letY (t) := tX1 + (1− t)X˜.Note thatY (1) = X1andY (11 + c) = X2,so it suffices to show that the CDF of Y (t) is increasing in t ∈ [ 11+c , 1] for x < 1 and decreasingfor x > (2α√c+ 1)/(2α√c). Now, we take the Laplace transform of Y (t) asL[Y (t)](z) =(1 +tzα)−α(1 +(1− t)zcα)−cα, Re(z) > max {−α/t,−cα/(1− t)} .The Laplace transform of FY isL[FY ](z) =∫ ∞0e−zxFY (x)dx=1z∫ ∞0e−zxdFY (x)=1zL[Y ](z).Note that in the second equality we applied integration by parts and the fact that FY (0) = 0.DefiningJ(z) := L[FY ](z)1046.2. Proofs of Theorems 6.1 and 6.2and differentiating with respect to t givesdJdt= Jddt(ln(J))= Jddt(− ln(z)− α ln(1 +tzα)− cα ln(1 +(1− t)zcα))=z2cαJ((1 + c)t− 1)(1 +tzα)−1(1 +(1− t)zcα)−1.Taking the inverse transform yieldsddtPr (Y (t) ≤ x) =(1 + c)t− 1cαd2dx2Pr(Y (t) + tψ1 +1− tcψ2 < x),where ψi ∼ Gamma(1, α) , i = 1, 2, are i.i.d gamma r.v’s which are also independent of all X1and X2. Now applying Lemma 6.5 yields the desired results. 2Proof of Theorem 6.2 It is enough to prove the theorem for the special case where α = βand the general statement follows from the scaling properties of gamma r.v.Introduce the random variableY :=n∑i=1λiXiwith CDF FY (x) = Pr(Y < x). As in the proof of Theorem 6.1, defineJ(z) := L[FY ](z) =1zL[Y ](z),where L[FY ] and L[Y ] denote the Laplace transform of FY and Y , respectively andL[Y ](z) =n∏i=1(1 + λiz/α)−α , Re(z) > −α/λi, i = 1, 2, . . . , n.Now consider a vector λ ∈ Θ for which λiλj 6= 0 for some i 6= j. We keep all λk, k 6= i, jfixed and vary λi and λj under the condition that λi + λj = const. We may assume withoutloss of generality that i = 1 and j = 2. Vectors for which λi = 1 for some i, i.e. the “corners”1056.2. Proofs of Theorems 6.1 and 6.2of Θ, are considered at the end of this proof. Differentiating J , we getdJdλ1= Jddλ1(ln J) = Jddλ1(− ln(z)− αn∑i=1ln(1 +λizα))= Jαz2α2λ1 − λ2(1 + λ1zα )(1 +λ2zα )=1α(λ1 − λ2)zL[λ1ψ1](z)L[λ2ψ2](z)L[Y ](z) (6.6)where ψi ∼ Gamma(1, α) , i = 1, 2 are i.i.d gamma r.v’s which are also independent of all Xi’s.LettingW (λ) := Y + λ1ψ1 + λψ2with the CDF FW (λ)(x), it can be shown that since λ1λ2 6= 0, then by Lemma 6.3(iii),FW (λ)(0) = F′W (λ)(0) = 0, ∀λ ≥ 0. DefiningL(Y, λ, x) := F′′W (λ) =d2dx2Pr (W (λ) < x) =d2dx2Pr(Y + λ1ψ1 + λψ2 < x)(6.7)and noting thatL[W (λ)](z) = L[λ1ψ1](z)L[λψ2](z)L[Y ](z),we getL[L(Y, λ, .)](z) =∫ ∞0e−zxL(Y, λ, x)dx=∫ ∞0e−zxF′′W (λ)(x)dx= z∫ ∞0e−zxF′W (λ)(x)dx= z2∫ ∞0e−zxFW (λ)(x)dx= z∫ ∞0e−zxdFW (λ)(x)= zL[W (λ)](z)= zL[λ1ψ1](z)L[λψ2](z)L[Y](z).1066.2. Proofs of Theorems 6.1 and 6.2Inverting (6.6) yieldsdFY (x)dλ1=1α(λ1 − λ2)L(Y, λ2, x). (6.8)So a necessary condition for the extremum of FY (x) is either λ1λ2(λ1−λ2) = 0 or L(λ2, x) =0. Since λ1λ2 6= 0 then by Lemma 6.3, the PDF, fW (λ)(x), of the linear form W (λ) = Y +λ1ψ1 + λψ2, for λ > 0, is differentiable everywhere and fW (λ)(0) = 0. In addition, on thepositive half-line, f ′W (λ)(x) = 0 holds at a unique point because fW (λ)(x) is a unimodal analyticfunction (its graph contains no line segment). The unimodality of fW (λ)(x) was already provenfor all gamma random variables in [129, Theorem 4].Now we can prove that, for any x > 0, if FY (x) has an extremum then the nonzero λi’scan take at most two different values. Suppose that λ1λ2(λ1 − λ2) 6= 0, then by (6.8) we haveL(Y, λ2, x) = 0. Now we show that, for every λj 6= 0, (6.8) implies that λi = λ1 or λi = λ2. Forthis, we assume the contrary that λi 6= λ1, λi 6= λ2, and by using the same reasoning that ledto (6.8), we can show thatL(Y, λ2, x) = L(Y, λj , x) = 0for every λj 6= 0, i.e. the point x > 0 is simultaneously the mode of the PDF of Wλ2Y and WλjYwhich contradicts Lemma 6.4. So we get that λi = λ1 or λ2 = λj . Thus the extrema of FY (x)are taken for some λ1 = λ2 = . . . = λk, λk+1 = λk+2 = . . . = λk+m, and λk+m+1 = λk+m+2 =. . . = λn = 0 where k +m ≤ n, i.e.,extremum Pr(n∑i=1λiXi ≤ x)= extremum Pr(λkk∑i=1Xi +1− λmk+m∑i=k+1Xi ≤ x).Here without loss of generality we can assume k ≥ m ≥ 1. Now the same reasoning as in theend of the proof of [129, Theorem 1] shows an extremum is taken either at k = m = 1, or atλ1 = λ2 = . . . = ... = λk+m. In the former case, by Lemma 6.5, for any x ∈ (0, 1) ∪ (2α+12α ,∞),the extremum can only be taken at λ ∈ {0, 12 , 1}. However, for any x ∈ [1,2α+12α ], in additionto λ ∈ {0, 12 , 1}, the extremum can be achieved for some λ∗ such that x = x¯(λ∗) where x¯(λ)denotes the mode of the distribution of λX1 +(1−λ)X2 +λψ1 +(1−λ)ψ2. But for such λ∗ andx, using (6.8) and Lemma 6.5(iii) with α1 = α2 = α, one can show that Pr(λX1+(1−λ)X2 ≤ x)achieves a local maximum. Now including the case where λ1 = 1 mentioned earlier in the proof,1076.2. Proofs of Theorems 6.1 and 6.2we getmn(x) = min1≤d≤nPr(1dd∑i=1Xi < x)∀x > 0,Mn(x) = max1≤d≤nPr(1dd∑i=1Xi < x)∀x ∈(0, 1)∪(2α+ 12α,∞),where mn(x) and Mn(x) are defined in the statement of Theorem 6.2 in Section 7.1. Nowapplying Theorem 6.1 by considering the collection αi = iα, i = 1, 2, . . . , n, would yield thedesired results. 2108Chapter 7Uncertainty Quantification ofStochastic ReconstructionAlgorithmsIn the present chapter, we continue to consider the stochastic algorithms, presented in Chap-ters 3 and 4, for efficiently solving the class of large scale non-linear least squares (NLS) problemsdescribed in Chapter 1. We will continue to make Assumptions (A.1) - (A.3) (Assumption (A.2)can be, if necessary, restored by employing similar techniques as in Chapter 4). In Chapters 3and 4, practical and randomized reconstruction algorithms were discussed and their efficiencywas demonstrated by various numerical examples. However, all randomized steps in these al-gorithms were left to heuristics and as such the amount of uncertainty in each stochastic stepremained unchecked. One advantage of leaving these steps heuristic is the great simplicityin the design and high efficiency in the performance of such algorithms. However, the mereexistence of uncertainty in the overall procedure can cast doubt on the credibility of the re-constructions. In many applications, one might be willing to sacrifice the simplicity and evencompromise slightly on the efficiency in order to have a handle on the amount of uncertaintyin the algorithm. Hence, it may be desirable to have means which allow one to adjust thecost and accuracy of such algorithms in a quantifiable way, and find a balance that is suitableto particular objectives and computational resources. Here, we propose eight variants of Al-gorithm 2 where the uncertainties in the major stochastic steps are quantified (adjustment ofAlgorithms 1 and 3 in a similar way is straightforward). Quantifying the uncertainty in thesestochastic steps, again, involves approximating the NLS objective function using Monte-Carlo(MC) methods as discussed in Section 2.1. There, it was shown that such approximation is,1097.1. Tight Conditions on Sample Size for Gaussian MC Trace Estimatorsin fact, equivalent to estimating the trace of the corresponding SPSD matrices. In Chapter 5,these estimators were analyzed and conditions on the MC sample size (which translates to cost)to satisfy the prescribed probabilistic relative accuracy were given. However, these conditions,though asymptotically tight, are pessimistic and are typically not sufficiently tight to be prac-tically useful. On the other hand, as discussed in Chapter 3, the objective is to be able togenerate as few random samples as possible for achieving acceptable approximations to the ob-jective function. Hence, in the present chapter, and for the case of the Gaussian estimator, weprove tight necessary and sufficient conditions on the MC sample size and we show that theseconditions are practically computable and yield small sample sizes. They are then incorporatedin our stochastic algorithm to quantify the uncertainty in each randomized step. The boundswe use are applications of the main results of Chapter 6 presented in Theorems 6.1 and 6.2.This chapter is organized as follows. In Section 7.1, we develop and state theorems regardingthe tight tail bounds promised above. In Section 7.2 we present our stochastic algorithm variantsfor approximately minimizing (1.6) or (1.7) and discuss their novel elements. Subsequently inSection 7.3, the efficiency of the proposed algorithm variants is numerically demonstrated. Thisis followed by conclusions and further thoughts in Section 7.4.7.1 Tight Conditions on Sample Size for Gaussian MC TraceEstimatorsLet the matrix A = BTB ∈ Rs×s be implicit SPSD, and denote its trace by tr(A). As describedin Chapter 5, the Gaussian Monte-Carlo estimator of tr(A) is defined by (cf. (5.1) with D = G)trnG(A) :=1nn∑i=1wTi Awi, (7.1)where wj ∈ Rs ∼ N (0, I).Now, given a pair of small positive real numbers (ε, δ), consider finding an appropriate1107.1. Tight Conditions on Sample Size for Gaussian MC Trace Estimatorssample size n such thatPr(trnG(A) ≥ (1− ε)tr(A))≥ 1− δ, (7.2a)Pr(trnG(A) ≤ (1 + ε)tr(A))≥ 1− δ. (7.2b)In Chapter 5 we showed that the inequalities (7.2) hold ifn > 8c, where c = c(ε, δ) = ε−2 ln(1/δ). (7.3)However, this bound on n can be rather pessimistic and yields sample sizes which may notbe practically appealing. Theorems 7.1 and 7.2 and Corollary 7.3 below provide tighter andhopefully more useful bounds on n. For the proof of these, we make use of Theorems 6.1 and6.2 of Chapter 6.Let us defineQ(n) :=1nQn,where Qn ∼ χ2n denotes a chi-squared r.v of degree n. Note that Q(n) ∼ Gamma(n/2, n/2),i.e., a gamma r.v, parametrized by shape α = n/2 and rate β = n/2 parameters with PDFgiven as (6.1). In case of several i.i.d gamma r.v’s of this sort, we refer to the jth r.v by Qj(n).Theorem 7.1 (Necessary and sufficient condition for (7.2a)). Given an SPSD matrix Aof rank r and tolerances (ε, δ) as above, the following hold:(i) Sufficient condition: there exists some integer n0 ≥ 1 such thatPr(Q(n0) < (1− ε))≤ δ. (7.4)Furthermore, (7.2a) holds for all n ≥ n0.(ii) Necessary condition: if (7.2a) holds for some n0 ≥ 1, then for all n ≥ n0P−ε,r(n) := Pr(Q(nr) < (1− ε))≤ δ. (7.5)(iii) Tightness: if the r positive eigenvalues of A are all equal (NB this always happens if1117.1. Tight Conditions on Sample Size for Gaussian MC Trace Estimatorsr = 1), then there is a positive integer n0 satisfying (7.5), such that (7.2a) holds iffn ≥ n0.Proof. Since A is SPSD, it can be diagonalized by a unitary similarity transformation as A =UTΛU , where Λ is the diagonal matrix of eigenvalues sorted in non-increasing order. Considern random vectors wi, i = 1, . . . , n, whose components are i.i.d and drawn from the standardnormal distribution, and define zi = Uwi for each i. Note that since U is unitary, the entriesof zi are i.i.d standard normal variables, like the entries of wi. We havetrnG(A)tr(A)=1n tr(A)n∑i=1wTi Awi=1n tr(A)n∑i=1zTi Λzi=1n tr(A)n∑i=1r∑j=1λjz2ij=r∑j=1λjn tr(A)n∑i=1z2ij=r∑j=1λjtr(A)Qj(n),where the λj ’s appearing in the sums are positive eigenvalues of A. Now, noting thatr∑j=1λjtr(A)= 1,Theorem 6.2 yieldsPrr∑j=1λjtr(A)Qj(n) ≤ (1− ε) ≤ Pr(Q(n) ≤ (1− ε))= P−ε,1(n), (7.6a)Prr∑j=1λjtr(A)Qj(n) ≤ (1− ε) ≥ Pr(Q(nr) ≤ (1− ε))= P−ε,r(n). (7.6b)In addition, for any given r > 0 and ε > 0, the function P−ε,r(n) is monotonically decreasing onintegers n ≥ 1. This can be seen by Theorem 6.1 using the sequence αi = (n0+(i−1))r/2, i ≥ 1.1127.1. Tight Conditions on Sample Size for Gaussian MC Trace EstimatorsThe claims now easily follow by combining (7.6) and this decreasing property.Theorem 7.2 (Necessary and sufficient condition for (7.2b)). Given an SPSD matrix Aof rank r and tolerances (ε, δ) as above, the following hold:(i) Sufficient condition: if the inequalityPr(Q(n0) ≤ (1 + ε))≥ 1− δ (7.7)is satisfied for some n0 > ε−1, then (7.2b) holds with n = n0. Furthermore, there isalways an n0 > ε−2 such that (7.7) is satisfied and, for such n0, it follows that (7.2b)holds for all n ≥ n0.(ii) Necessary condition: if (7.2b) holds for some n0 > ε−1, thenP+ε,r(n) := Pr(Q(nr) ≤ (1 + ε))≥ 1− δ, (7.8)with n = n0. Furthermore, if n0 > ε−2r−2, then (7.8) holds for all n ≥ n0.(iii) Tightness: if the r positive eigenvalues of A are all equal, then there is a smallest n0 >ε−2r−2 satisfying (7.8) such that for any n ≥ n0, (7.2b) holds, and for any ε2r−2 <n < n0, (7.2b) does not hold. If δ is small enough so that (7.8) does not hold for anyn ≤ ε2r−2, then n0 is both necessary and sufficient for (7.2b).Proof. The same unitary diagonalization argument as in the proof of Theorem 7.1 shows thatPr(trnG(A) < (1 + ε)tr(A))= Prr∑j=1λjtr(A)Qj(n) < (1 + ε) .Now we see that if n > ε−1, Theorem 6.2 with α = n/2 yieldsPrr∑j=1λjtr(A)Qj(n) ≤ (1 + ε) ≥ Pr(Q(n) ≤ (1 + ε))= P+ε,1(n), (7.9a)Prr∑j=1λjtr(A)Qj(n) ≤ (1 + ε) ≤ Pr(Q(nr) ≤ (1 + ε))= P+ε,r(n). (7.9b)1137.1. Tight Conditions on Sample Size for Gaussian MC Trace EstimatorsIn addition, for any given r > 0 and ε > 0, the function P+ε,r(n) is monotonically increasingon integers n > ε−2r−2. This can be seen by Theorem 6.1 using the sequence αi = (n0 + (i −1))r/2, i ≥ 1. The claims now easily follow by combining (7.9) and this increasing property.(a) (b)Figure 7.1: The curves of P−ε,r(n) and P+ε,r(n), defined in (7.5) and (7.8), for ε = 0.1 and r = 1:(a) P−ε,r(n) decreases monotonically for all n ≥ 1; (b) P+ε,r(n) increases monotonically only forn ≥ n0, where n0 > 1: according to Theorem 7.2, n0 = 100 is safe, and this value does notdisagree with the plot.Remarks:(i) Part (iii) of Theorem 7.2 states that if δ is not small enough, then n0 might not be anecessary and sufficient sample size for the special matrices mentioned there, i.e., matriceswith λ1 = λ2 = · · · = λr. This can be seen from Figure 7.1(b): for r = 1, ε = 0.1, ifδ = 0.33, say, there is an integer 10 < n ≤ 100 such that (7.2b) holds, so n = 101 is nolonger a necessary sample size (although it is still sufficient).(ii) Simulations show that the sufficient sample size obtained using Theorems 7.1 and 7.2,amounts to bounds of the form O (c(ε, δ)g(δ)), where g(δ) < 1 is a decreasing functionof δ and c(ε, δ) is as defined in (7.3). As such, for larger values of δ, i.e., when largeruncertainty is allowed, one can obtain significantly smaller sample sizes than the onepredicted by (7.3); see Figures 7.2 and 7.3. In other words, the difference between theabove tighter conditions and (7.3) is increasingly more prominent as δ gets larger.(iii) Note that the results in Theorems 7.1 and 7.2 are independent of the size of the matrix.1147.1. Tight Conditions on Sample Size for Gaussian MC Trace EstimatorsIn fact, the first items (i) in both theorems do not require any a priori knowledge aboutthe matrix, other than it being SPSD. In order to compute the necessary sample sizes,though, one is required to also know the rank of the matrix.(iv) The conditions in our theorems, despite their potentially ominous look, are actually simpleto compute. Appendix A.4 contains a short Matlab code which calculates these necessaryor sufficient sample sizes to satisfy the probabilistic accuracy guarantees (7.2), given a pair(ε, δ) (and the matrix rank r in case of necessary sample sizes). This code was used forgenerating Figures 7.2 and 7.3.(a) (b)Figure 7.2: Comparing, as a function of δ, the sample size obtained from (7.4) and denotedby “tight”, with that of (7.3) and denoted by “loose”, for ε = 0.1 and 0.01 ≤ δ ≤ 0.3: (a)sufficient sample size, n, for (7.2a), (b) ratio of sufficient sample size obtained from (7.3) overthat of (7.4). When δ is relaxed, our new bound is tighter than the older one by an order ofmagnitude.Combining Theorems 7.1 and 7.2, we can easily state conditions on the sample size n forwhich the conditionPr(|trnG(A)− tr(A)| ≤ ε tr(A))≥ 1− δ (7.10)holds. We have the following immediate corollary:Corollary 7.3 (Necessary and sufficient condition for (7.10)). Given an SPSD matrix Aof rank r and tolerances (ε, δ) as above, the following hold:1157.1. Tight Conditions on Sample Size for Gaussian MC Trace Estimators(a) (b)Figure 7.3: Comparing, as a function of δ, the sample size obtained from (7.7) and denotedby “tight”, with that of (7.3) and denoted by “loose”, for ε = 0.1 and 0.01 ≤ δ ≤ 0.3: (a)sufficient sample size, n, for (7.2b), (b) ratio of sufficient sample size obtained from (7.3) overthat of (7.7). When δ is relaxed, our new bound is tighter than the older one by an order ofmagnitude.(i) Sufficient condition: if the inequalityPr((1− ε) ≤ Q(n0) ≤ (1 + ε))≥ 1− δ (7.11)is satisfied for some n0 > ε−1, then (7.10) holds with n = n0. Furthermore, there isalways an n0 > ε−2 such that (7.11) is satisfied and, for such n0, it follows that (7.10)holds for all n ≥ n0.(ii) Necessary condition: if (7.10) holds for some n0 > ε−1, thenPr((1− ε) ≤ Q(nr) ≤ (1 + ε))≥ 1− δ, (7.12)with n = n0. Furthermore, if n0 > ε−2r−2, then (7.12) holds for all n ≥ n0.(iii) Tightness: if the r positive eigenvalues of A are all equal then there is a smallest n0 >ε−2r−2 satisfying (7.12) such that for any n ≥ n0, (7.10) holds, and for any ε−2r−2 <n < n0, (7.10) does not hold. If δ is small enough so that (7.12) does not hold for anyn ≤ ε−2r−2, then n0 is both necessary and sufficient for (7.10).1167.2. Quantifying the Uncertainty in Randomized AlgorithmsRemark: The necessary condition in Corollary 7.3(ii) is only valid for n > ε−1 (this is aconsequence of the condition (7.12) being tight, as shown in part (iii)). In Section 5.3.2, an“almost tight” necessary condition is given that works for all n ≥ 1.7.2 Quantifying the Uncertainty in Randomized AlgorithmsAs described in Section 1.2, consider the problem of decreasing the value of the original objec-tive (1.6) to a desired level (e.g., satisfying a given tolerance) to recover the sought model, m.Namely, consider an iterative method such as modied Gauss-Newton (GN), using sensitivitymatricesJi(m) =∂f(m,qi)∂m, i = 1, . . . , sand the gradient∇φ(m) = 2s∑i=1JTi (m)(f(m,qi)− di).As in Chapter 3, what is special in our context here is that the update direction, δmk,is calculated using the approximate misfit, φ̂(mk, nk), defined as described in (2.3) (nk is thesample size used for this approximation in the kth iteration). However, since the ultimate goalis to fit the original data, we need to assess whether the value of the original objective is alsodecreased using this new iterate. The challenge is to do this as well as check for terminationof the iteration process with a minimal number of evaluations of the prohibitively expensiveoriginal misfit function φ.In this section, we extend the algorithms introduced in Chapters 3 and 4. Variants ofmodified stochastic steps in the original algorithms are presented, and using Theorems 7.1and 7.2, the uncertainties in these steps are quantified. More specifically, in Algorithm 2introduced in Chapter 3, following a stabilized GN iteration on the approximated objectivefunction using the approximated misfit, the iterate is updated, and some (or all) of the followingsteps are performed:(i) cross validation (see Section 3.1.1) – approximate assessment of this iterate in terms of1177.2. Quantifying the Uncertainty in Randomized Algorithmssufficient decrease in the objective function using a control set of random combinations ofmeasurements. More specifically, at the kth iteration with the new iterate mk+1, we testwhether the condition (3.2), namelyφ̂(mk+1, nk) ≤ κφ̂(mk, nk)(cf. (2.3)) holds for some κ ≤ 1, employing an independent set of weight vectors used inboth approximations of φ;(ii) uncertainty check (see Section 3.1.2) – upon success of cross validation, an inexpensiveplausible termination test is performed where, given a tolerance ρ, we check for the con-dition (3.4), namelyφ̂(mk+1, nk) ≤ ρusing a fresh set of random weight vectors; and(iii) stopping criterion (see Section 3.1.2) – upon success of the uncertainty check, an additionalindependent and potentially more rigorous termination test against the given tolerance ρis performed (possibly using the original misfit function).The role of the cross validation step within an iteration is to assess whether the true objectivefunction at the current iterate has (sufficiently) decreased compared to the previous one. If thistest fails, we deem that the current sample size is not sufficiently large to yield an update thatdecreases the original objective, and the fitting step needs to be repeated using a larger samplesize, see [46]. In Chapter 3, this step was used heuristically, so the amount of uncertainty insuch validation of the current iterate was not quantified. Consequently, there was no handleon the amount of false positives/negatives in such approximate evaluations (e.g., a sample sizecould be deemed too small while the stabilized GN iteration has in fact produced an acceptableiterate). In addition, in Chapter 3 the sample size for the uncertainty check was heuristicallychosen. So this step was also performed with no control over the amount of uncertainty.For the stopping criterion step in Chapter 3 as well as [46], the objective function wasaccurately evaluated using all s experiments, which is clearly a very expensive choice for analgorithm termination check. This was a judicious decision made in order to be able to have1187.2. Quantifying the Uncertainty in Randomized Algorithmsa fairer comparison of the new and different methods proposed there. Replacement of thistermination criterion by another independent heuristic “uncertainty check” is experimentedwith in Chapter 4.In this section, we address the issues of quantifying the uncertainty in the validation, un-certainty check and stopping criterion steps within a nonlinear iteration. In what follows wecontinue to assume, for simplicity, that the iterations are performed on the objective (1.6) us-ing dynamic regularization (or iterative regularization [45, 78, 132]) where the regularization isperformed implicitly. Extension to the case (1.7) is straightforward. Throughout, we assumeto be given a pair of positive and small probabilistic tolerance numbers, (ε, δ).7.2.1 Cross Validation Step with Quantified UncertaintyThe condition (3.2) is an independent, unbiased indicator of (3.1), which indicates sufficientdecrease in the objective. If (3.2) is satisfied then the current sample size, nk, is consideredsufficiently large to capture the original misfit well enough to produce a valid iterate, andthe algorithm continues using the same sample size. Otherwise, the sample size is deemedinsufficient and is increased. Using Theorems 7.1 and 7.2, we can now remove the heuristiccharacteristic as to when this sample size increase has been performed hitherto, and presenttwo variants of (3.2) where the uncertainties in the validation step are quantified.Assume we have a sample size nc such thatPr(φ̂(mk, nc) ≤ (1 + ε)φ(mk))≥ 1− δ, (7.13a)Pr(φ̂(mk+1, nc) ≥ (1− ε)φ(mk+1))≥ 1− δ. (7.13b)If in the procedure outlined above, after obtaining the updated iterate mk+1, we verify thatφ̂(mk+1, nc) ≤ κ(1− ε1 + ε)φ̂(mk, nc), (7.14)then it follows from (7.13) that φ(mk+1) ≤ κφ(mk) with a probability of, at least, (1 − δ)2.In other words, success of (7.14) indicates that the updated iterate decreases the value of theoriginal misfit (1.6) with a probability of, at least, (1− δ)2.1197.2. Quantifying the Uncertainty in Randomized AlgorithmsAlternatively, suppose that we havePr(φ̂(mk, nc) ≥ (1− ε)φ(mk))≥ 1− δ, (7.15a)Pr(φ̂(mk+1, nc) ≤ (1 + ε)φ(mk+1))≥ 1− δ. (7.15b)Now, if instead of (7.14) we check whether or notφ̂(mk+1, nc) ≤ κ(1 + ε1− ε)φ̂(mk, nc), (7.16)then it follows from (7.15) that if the condition (7.16) is not satisfied, then φ(mk+1) > κφ(mk)with a probability of, at least, (1 − δ)2. In other words, failure of (7.16) indicates that theupdated iterate results in an insufficient decrease in the original misfit (1.6) with a probabilityof, at least, (1− δ)2.We can replace (3.2) with either of the conditions (7.14) or (7.16) and use the conditions (7.4)or (7.7) to calculate the cross validation sample size, nc. If the relevant check (7.14) or (7.16)fails, we deem the sample size used in the fitting step, nk, to be too small to produce an iteratewhich decreases the original misfit (1.6), and consequently consider increasing the sample size,nk. Note that since 1−ε1+ε < 1 <1+ε1−ε , the condition (7.14) results in a more aggressive strategyfor increasing the sample size used in the fitting step than the condition (7.16). Figure 7.8 inSection 7.3 demonstrates this within the context of an application.Remarks:(i) Larger values of ε result in more aggressive (or relaxed) descent requirement by the con-dition (7.14) (or (7.16)).(ii) As the iterations progress and we get closer to the solution, the decrease in the originalobjective could be less than what is imposed by (7.14). As a result, if ε is too large, wemight never successfully pass the cross validation test. One useful strategy to alleviate thisis to start with a larger ε, decreasing it as we get closer to the solution. A similar strategycan be adopted for the case when the condition (7.16) is used as a cross validation: asthe iterations get closer to the solution, one can make the condition (7.16) less relaxed bydecreasing ε.1207.2. Quantifying the Uncertainty in Randomized Algorithms7.2.2 Uncertainty Check with Quantified Uncertainty and EfficientStopping CriterionThe usual test for terminating the iterative process is to check for condition (3.3), namelyφ(mk+1) ≤ ρ,for a given tolerance ρ. However, this can be very expensive in our current context; see Sec-tion 7.3 and Tables 7.1 and 7.2 for examples of a scenario where one misfit evaluation usingthe entire data set can be as expensive as the entire cost of an efficient, complete algorithm.In addition, if the exact value of the tolerance ρ is not known (which is usually the case inpractice), one should be able to reflect such uncertainty in the stopping criterion and performa softer version of (3.3). Hence, it could be useful to have an algorithm which allows one toadjust the cost and accuracy of such an evaluation in a quantifiable way, and find the balancethat is suitable to particular objectives and computational resources.Regardless of the issues of cost and accuracy, this evaluation should be carried out as rarelyas possible and only when deemed timely. In Chapter 3, we addressed this by employing an“uncertainty check” (3.4) as described earlier in this section, heuristically. Using Theorems 7.1and 7.2, we now devise variants of (3.4) with quantifiable uncertainty. Subsequently, againusing Theorems 7.1 and 7.2, we present a much cheaper stopping criterion than (3.3) which, atthe same time, reflects our uncertainty in the given tolerance.Assume that we have a sample size nu such thatPr(φ̂(mk+1, nu) ≥ (1− ε)φ(mk+1))≥ 1− δ. (7.17)If the updated iterate, mk+1, successfully passes the cross validation test, then we check forφ̂(mk+1, nu) ≤ (1− ε)ρ. (7.18)If this holds too then it follows from (7.17) that φ(mk+1) ≤ ρ with a probability of, at least,(1 − δ). In other words, success of (7.18) indicates that the misfit is likely to be below the1217.2. Quantifying the Uncertainty in Randomized Algorithmstolerance with a probability of, at least, (1− δ).Alternatively, suppose thatPr(φ̂(mk+1, nu) ≤ (1 + ε)φ(mk+1))≥ 1− δ, (7.19)and instead of (7.18) we check forφ̂(mk+1, nu) ≤ (1 + ε)ρ. (7.20)then it follows from (7.19) that if the condition (7.20) is not satisfied, then φ(mk+1) > ρ witha probability of, at least, (1 − δ). In other words, failure of (7.20) indicates that using theupdated iterate, the misfit is likely to be still above the desired tolerance with a probability of,at least, (1− δ).We can replace (3.4) with the condition (7.18) (or (7.20)) and use the condition (7.4)(or (7.7)) to calculate the uncertainty check sample size, nu. If the test (7.18) (or (7.20))fails then we skip the stopping criterion check and continue iterating. Note that since (1− ε) <1 < (1 + ε), the condition (7.18) results in fewer false positives than the condition (7.20). Onthe other hand, the condition (7.20) is expected to results in fewer false negatives than thecondition (7.18). The choice of either alternative is dependent on one’s requirements, resourcesand the application on hand.The stopping criterion step can be performed in the same way as the uncertainty check butpotentially with higher certainty in the outcome. In other words, for the stopping criterion wecan choose a smaller δ, resulting in a larger sample size nt satisfying nt > nu, and check forsatisfaction of eitherφ̂(mk+1, nt) ≤ (1− ε)ρ, (7.21a)orφ̂(mk+1, nt) ≤ (1 + ε)ρ. (7.21b)Clearly the condition (7.21b) is a softer than (7.21a): a successful (7.21b) is only necessaryand not sufficient for concluding that (3.3) holds with the prescribed probability.1227.2. Quantifying the Uncertainty in Randomized AlgorithmsIn practice, when the value of the stopping criterion threshold, ρ, is not exactly known (itis often crudely estimated using the measurements), one can reflect such uncertainty in ρ bychoosing an appropriately large δ. Smaller values of δ reflect a higher certainty in ρ and a morerigid stopping criterion.Remarks:(i) If ε is large then using (7.21a), one might run the risk of over-fitting. Similarly, us-ing (7.21b) with large ε, there is a risk of under-fitting. Thus, appropriate values of εneed to be considered in accordance with the application and one’s computational re-sources and experience.(ii) The same issues regarding large ε arise when employing the uncertainty check condi-tion (7.18) (or (7.20)): large ε might increase the frequency of false negatives (or posi-tives).7.2.3 AlgorithmWe now present an extension of Algorithm 2 for approximately solving NLS formulations of (1.6)or (1.7). By performing cross validation, uncertainty check and stopping criterion as descriedin Section 7.2.1 and Section 7.2.2, we can devise 8 variants of Algorithm 4 below. Dependingon the application, the variant of choice can be selected appropriately. More specifically, crossvalidation, uncertainty check and stopping criterion can, respectively, be chosen to be one ofthe following combinations (referring to their equation numbers):(i) (7.14 - 7.18 - 7.21a) (ii) (7.14 - 7.18 - 7.21b)(iii) (7.14 - 7.20 - 7.21a) (iv) (7.14 - 7.20 - 7.21b)(v) (7.16 - 7.18 - 7.21a) (vi) (7.16 - 7.18 - 7.21b)(vii) (7.16 - 7.20 - 7.21a) (viii) (7.16 - 7.20 - 7.21b)Remark:(i) The sample size, nk, used in the fitting step of Algorithm 4 could in principle be de-termined by Corollary 7.3, using a pair of tolerances (εf , δf ). If cross validation (7.14)(or (7.16)) fails, the tolerance pair (εf , δf ) is reduced to obtain, in the next iteration, a1237.3. Numerical Experimentslarger fitting sample size, nk+1. This would give a sample size which yields a quantifiableapproximation with a desired relative accuracy. However, in the presence of all the addedsafety steps described in this section, we have found in practice that Algorithm 4 is capa-ble of producing a satisfying recovery, even with a significantly smaller nk than the onepredicted by Corollary 7.3. Thus, the “how” of the fitting sample size increase is left toheuristic (as opposed to its “when”, which is quantified as described in Section 7.2.1).(ii) In the algorithm below, we only consider fixed values (i.e., independent of k) for ε and δ.One can easily modify Algorithm 4 to incorporate non-stationary values which adapt tothe iteration process, as mentioned in the closing remark of Section 7.2.1.In Algorithm 4, when we draw vectors wi for some purpose, we always draw them independentlyfrom the standard normal distribution.7.3 Numerical ExperimentsIn this section, we numerically demonstrate the efficacy of Algorithm 4 by applying it to theimportant class of problems described in Section 1.1.2: large scale PDE constrained inverseproblems with many measurements. We show below the capability of our method by applyingit to such examples in the context of the DC resistivity/EIT problem (see Section 3.3.1), as inChapters 3 and 4 as well as [46, 70, 71, 111].We consider the forward operators as defined in (1.5) where the linearity assumption (A.2)is satisfied (i.e., the locations where data are measured do not change from one experimentto another, i.e., P = Pi,∀i). Hence, we can use Algorithm 4 to efficiently recover m and bequantifiably confident in the recovered model. If the Pi’s are different across experiments, itmight be possible to use methods such as the ones introduced in Chapter 4 or [70] to extendthe existing data set to one where all sources share the same receivers. Using these methods(when they apply!), one can effectively restore the linearity assumption (A.2) and transformthe problem (1.4) to (1.5), for which Algorithm 4 can be employed.Considering the inverse problem with the PDE model (3.5), below we give two examples,each having a piecewise constant “exact solution”, or “true model”, used to synthesize data:1247.3. Numerical ExperimentsAlgorithm 4 Solve NLS formulation of (1.6) (or (1.7)) using uncertainty check, cross validationand cheap stopping criterionGiven: sources qi , i = 1, . . . , s, measurements di , i = 1, . . . , s, stopping criterion level ρ,objective function sufficient decrease factor κ ≤ 1, pairs of small numbers (εc, δc), (εu, δu),(εt, δt), and initial guess m0.Initialize:- m = m0 , n0 = 1- Calculate the cross validation sample size, nc, as described in Section 7.2.1 with (εc, δc).- Calculate the sample sizes for uncertainty check, nu, and stopping criterion, nt, as describedin Section 7.2.2 with (εu, δu) and (εt, δt), respectively.for k = 0, 1, 2, · · · until termination doFitting:- Draw wi , i = 1, . . . , nk.- Approximate the misfit term and potentially its gradient in (1.6) or (1.7) using (2.3) withthe above weights and n = nk.- Find an update for the objective function using the approximated misfit (2.3).Cross Validation:- Draw wi , i = 1, . . . , nc.if (7.14) (or (7.16)) holds thenUncertainty Check:- Draw wi , i = 1, . . . , nu.if (7.18) (or (7.20)) holds thenStopping Criterion:- Draw wi , i = 1, . . . , nt.if (7.21a) (or (7.21b)) holds then- Terminateend ifend if- Set nk+1 = nk.else- Sample Size Increase: for example, set nk+1 = min(2nk, s).end ifend for1257.3. Numerical Experiments(E.1) in our simpler model a target object with conductivity µt = 1 has been placed in abackground medium with conductivity µb = 0.1 (see Figure 7.4(a)); and(E.2) in a slightly more complex setting a conductive object with conductivity µc = 0.01, as wellas a resistive one with conductivity µr = 1, have been placed in a background mediumwith conductivity µb = 0.1 (see Figure 7.6(a)). Note that the recovery of the model inExample (E.2) is more challenging than Example (E.1) since here the dynamic range ofthe conductivity is much larger.Details of the numerical setup for the following examples are given in Section 3.3.2.Example (E.1)We carry out the 8 variants of Algorithm 4 for the parameter values (εc, δc) = (0.05, 0.3),(εu, δu) = (0.1, 0.3), (εt, δt) = (0.1, 0.1), and κ = 1. The resulting total count of PDE solves,which is the main computational cost of the iterative solution of such inverse problems, isreported in Tables 7.1 and 7.2. As a point of reference, we also include the total PDE countusing the “plain vanilla” stabilized Gauss-Newton method which employs the entire set of sexperiments at every iteration and misfit estimation task. The recovered conductivities aredisplayed in Figures 7.5 and 7.7, demonstrating that employing Algorithm 4 can drasticallyreduce the total work while obtaining equally acceptable reconstructions.Vanilla (i) (ii) (iii) (iv) (v) (vi) (vii) (viii)436,590 4,058 4,028 3,764 3,282 4,597 3,850 3,734 3,321Table 7.1: Example (E.1). Work in terms of number of PDE solves for all variants of Algo-rithm 4, described in Section 7.2.3 and indicated here by (i)–(viii). The “vanilla” count is alsogiven, as a reference.For the calculations displayed here we have employed dynamical regularization [45, 132]. Inthis method there is no explicit regularization term R(m) in (1.7) and the regularization isdone implicitly and iteratively.The quality of reconstructions obtained by the various variants in Figure 7.5 is comparableto that of the “vanilla” with s = 3, 969 in Figure 7.4(b). In contrast, employing only s = 49data sets corresponding to similar experiments distributed over a coarser grid yields an inferior1267.3. Numerical Experiments(a) (b) (c)Figure 7.4: Example (E.1). Plots of log-conductivity: (a) True model; (b) Vanilla recovery withs = 3, 969; (c) Vanilla recovery with s = 49. The vanilla recovery using only 49 measurementsets is clearly inferior, showing that a large number of measurement sets can be crucial forbetter reconstructions.(i) (ii) (iii) (iv)(v) (vi) (vii) (viii)Figure 7.5: Example (E.1). Plots of log-conductivity of the recovered model using the 8 variantsof Algorithm 4, described in Section 7.2.3 and indicated here by (i)–(viii). The quality ofreconstructions is generally comparable to that of plain vanilla with s = 3, 969 and acrossvariants.reconstruction in Figure 7.4(c). The cost of this latter run is 5, 684 PDE solves, which is moreexpensive than our randomized algorithms for the much larger s. Furthermore, comparingFigures 7.4(b) and 7.5 to Figures 4.3 and 4.4 in Chapter 4, which shows similar results fors = 961 data sets, we again see a relative improvement in reconstruction quality. All of thisgoes to show that a large number of measurements s can be crucial for better reconstructions.Thus, it is not the case that one can dispense with a large portion of the measurements and stillexpect the same quality reconstructions. Hence, it is indeed useful to have algorithms such asAlgorithms 1, 2, 3, or 4 that, while taking advantage of the entire available data, can efficientlycarry out the computations and yet obtain credible reconstructions.1277.3. Numerical ExperimentsWe have resisted the temptation to make comparisons between values of φ(mk+1) andφˆ(mk+1) for various iterates. There are two major reasons for that. The first is that φˆ valuesin bounds such as (7.14), (7.16), (7.18), (7.20) and (7.21) are different and are always comparedagainst tolerances in context that are based on noise estimates. In addition, the sample sizesthat we used for uncertainty check and stopping criteria, since they are given by Theorems 7.1and 7.2, already determine how far the estimated misfit is from the true misfit. The second (andmore important) reason is that in such a highly diffusive forward problem as DC resistivity,misfit values are typically far closer to one another than the resulting reconstructed models mare. A good misfit is merely a necessary condition, which can fall significantly short of beingsufficient, for a good reconstruction; see [69] and Chapter 4.Example (E.2)Here we have imposed prior knowledge on the “discontinuous” model in the form of totalvariation (TV) regularization [34, 38, 47]. Specifically, R(m) in (1.7) is the discretization ofthe TV functional∫Ω |∇m(x)|. For implementation details of TV functional see Appendix A.5.For each recovery, the regularization parameter, α, has been chosen by trial and error withinthe range [10−6, 10−3] to visually yield the best quality recovery.Vanilla (i) (ii) (iii) (iv) (v) (vi) (vii) (viii)476,280 5,631 5,057 5,011 3,990 6,364 4,618 4,344 4,195Table 7.2: Example (E.2). Work in terms of number of PDE solves for all variants of Algo-rithm 4, described in Section 7.2.3 and indicated here by (i)–(viii). The “vanilla” count is alsogiven, as a reference.Table 7.2 and Figures 7.6 and 7.7 tell a similar story as in Example (E.1). The quality ofreconstructions with s = 3, 969 by the various variants, displayed in Figure 7.7, is comparableto that of the “vanilla” version in Figure 7.6(b), yet is obtained at only at a fraction (about1%) of the cost. The “vanilla” solution for s = 49 displayed in Figure 7.6(c), costs 5, 978 PDEsolves, which again is a higher cost for an inferior reconstruction compared to our Algorithm 4.It is clear from Tables 7.1 and 7.2 that for most of these examples, variants (i)–(iv) whichuse the more aggressive cross validation (7.14) are at least as efficient as their respective coun-terparts, namely, variants (v)–(viii) which use (7.16). This suggests that, sometimes, a more1287.4. Conclusions(a) (b) (c)Figure 7.6: Example (E.2). Plots of log-conductivity: (a) True model; (b) Vanilla recovery withs = 3, 969; (c) Vanilla recovery with s = 49. The vanilla recovery using only 49 measurementsets is clearly inferior, showing that a large number of measurement sets can be crucial forbetter reconstructions.(i) (ii) (iii) (iv)(v) (vi) (vii) (viii)Figure 7.7: Example (E.2). Plots of log-conductivity of the recovered model using the 8 variantsof Algorithm 4, described in Section 7.2.3 and indicated here by (i)–(viii). The quality ofreconstructions is generally comparable to each other and that of plain vanilla with s = 3, 969.aggressive sample size increase strategy may be a better option; see also the numerical examplesin Chapter 3. Notice that for all variants, the entire cost of the algorithm is comparable to onesingle evaluation of the misfit function φ(m) using the full data set!7.4 ConclusionsIn this chapter, we have proved tight necessary and sufficient conditions for the sample size,n, required to reach, with a probability of at least 1 − δ, (one-sided) approximations, usingGaussian estimator, for tr(A) to within a relative tolerance ε. All of the sufficient conditionsare computable in practice and do not assume any a priori knowledge about the matrix. If the1297.4. ConclusionsFigure 7.8: Example (E.2). Growth of the fitting sample size, nk, as a function of the iterationk, upon using cross validation strategies (7.14) and (7.16). The graph shows the fitting samplesize growth for variants (ii) and (vi) of Algorithm 4, as well as their counterparts, namely,variants (vi) and (viii). Observe that for variants (ii) and (iv) where (7.14) is used, the fittingsample size grows at a more aggressive rate than for variants (vi) and (viii) where (7.16) isused.rank of the matrix is known then the necessary bounds can also be computed in practice.Subsequently, using these conditions, we have presented eight variants of a general-purposealgorithm for solving an important class of large scale non-linear least squares problems. Thesealgorithms can be viewed as an extended version of those in Chapters 3 and 4, where theuncertainty in most of the stochastic steps is quantified. Such uncertainty quantification allowsone to have better control over the behavior of the algorithm and have more confidence in therecovered solution. The resulting algorithm is presented in Section 7.2.3.Furthermore, we have demonstrated the performance of our algorithm using an importantclass of problems which arise often in practice, namely, PDE inverse problems with manymeasurements. By examining our algorithm in the context of the DC resistivity problem as aninstance of such class of problems, we have shown that Algorithm 4 can recover solutions withremarkable efficiency. This efficiency is comparable to similar heuristic algorithms proposed inChapters 3 and 4. The added advantage here is that with the uncertainty being quantified, theuser can have more confidence in the approximate solution obtained by our algorithms.Tables 7.1 and 7.2 show the amount of work (in PDE solves) of the 8 variants of ouralgorithm. Compared to a similar algorithm which uses the entire data set, an efficiency im-1307.4. Conclusionsprovement by two orders of magnitude is observed. For most of the examples considered, thesame tables also show that the more aggressive cross validation strategy (7.14) is, at least, asefficient as the more relaxed strategy (7.16). A thorough comparison of the behavior of thesecross validation strategies (and all of the variants, in general) on different examples and modelproblems is left for future work.131Chapter 8Algorithms That Satisfy a StoppingCriterion, ProbablyIterative numerical algorithms are typically equipped with a stopping criterion, where the iter-ation process is terminated when some error or misfit measure is deemed to be below a giventolerance. This is a useful setting for comparing algorithm performance, among other purposes.However, in practical applications a precise value for such a tolerance is rarely known; rather,only some possibly vague idea of the desired quality of the numerical approximation is at hand.We discuss three case studies from different areas of numerical computation, where uncertaintyin the error tolerance value and in the stopping criterion is revealed in different ways. Thisleads us to think of approaches to relax the notion of exactly satisfying a tolerance value.We then concentrate on a probabilistic relaxation of the given tolerance. Relaxing the notionof an error tolerance in such a way allows the development of theory towards an uncertaintyquantification of Monte Carlo methods (e.g., [2, 22, 84, 87, 138]). For example, this allowsderivation of proven bounds on the sample size of certain Monte Carlo methods, as in Chapters 5and 7. Such error relaxation was introduced in Chapter 7 and was incorporated in Algorithm 4.We show that Algorithm 4 becomes more efficient in a controlled way as the uncertainty inthe tolerance increases, and we demonstrate this in the context of a class of inverse problemsdiscussed in Section 1.1.2.8.1 IntroductionA typical iterative algorithm in numerical analysis and scientific computing requires a stoppingcriterion. Such an algorithm involves a sequence of generated iterates or steps, an error toler-ance, and a method to compute (or estimate) some quantity related to the error. If this error1328.1. Introductionquantity is below the tolerance then the iterative procedure is stopped and success is declared.The actual manner in which the error in an iterate is estimated can vary all the way frombeing rather complex to being as simple as the normed difference between two consecutiveiterates. Further, the “tolerance” may actually be a set of values involving combinations ofabsolute and relative error tolerances. There are several fine points to this, often application-dependent, that are typically incorporated in mathematical software packages (see for instanceMatlab’s various packages for solving ordinary differential equation (ODE) or optimizationproblems). That makes some authors of introductory texts devote significant attention to theissue, while others attempt to ignore it as much as possible (cf. [14, 40, 80]). Let us choose herethe middle way of considering a stopping criterion in a general formerror estimate(k) ≤ ρ, (8.1)where k is the iteration or step counter, and ρ > 0 is the tolerance, assumed given.But now we ask, is ρ really given?! Related to this, we can also ask, to what extent is thestopping criterion adequate? The numerical analyst would certainly like ρ to be given. That is because their jobis to invent new algorithms, prove various assertions regarding convergence, stability,efficiency, and so on, and compare the new algorithm to other known ones for a similartask. For the latter aspect, a rigid deterministic tolerance for a trustworthy error estimateis indispensable.Indeed, in research areas such as image processing where criteria of the form (8.1) do notseem to capture certain essential features and the “eye norm” rules, a good comparisonbetween competing algorithms can be far more delicate. Moreover, accurate comparisonsof algorithms that require stochastic input can be tricky in terms of reproducing claimedexperimental results. On the other hand, a practitioner who is the customer of numerical algorithms, applyingthem in the context of some complicated practical application that needs to be solved,will more often than not find it very hard to justify a particular choice of a precise value1338.2. Case Studiesfor ρ in (8.1).Our first task in what follows is to convince the reader that often in practice there is asignificant uncertainty in the actual selection of a meaningful value for the error tolerance ρ, avalue that must be satisfied. Furthermore, numerical analysts are also subconsciously aware ofthis fact of life, even though in most numerical analysis papers such a value is simply given, ifat all, in the numerical examples section. Three typical yet different classes of problems andmethods are considered in Section 8.2.Once we are all convinced that there is usually a considerable uncertainty in the value ofρ (hence, we only know it “probably”), the next question is what to do with this notion. Theanswer varies, depending on the particular application and the situation at hand. In some cases,such as that of Section 8.2.1, the effective advice is to be more cautious, as mishaps can happen.In others, such as that of Section 8.2.2, we are simply led to acknowledge that the value of ρmay come from thin air (though one then concentrates on other aspects). But there are yetother classes of applications and algorithms, such as in Section 8.2.3, for which it makes senseto attempt to quantify the uncertainty in the error tolerance ρ using a probabilistic framework.We are not proposing here to propagate an entire probability distribution for ρ: that would beexcessive in most situations. But we do show, by studying an instance extended to a wide classof problems, that employing such a framework can be practical and profitable.Following Section 8.2.3 we therefore consider in Section 8.3 a particular manner of relaxingthe notion of a deterministic error tolerance, introduced in Chapter 7, by allowing an estimatesuch as (8.1) to hold only within some given probability. Some numerical examples are givento illustrate these ideas. Conclusions and some additional general comments are offered inSection 8.4.8.2 Case StudiesIn this section we consider three classes of problems and associated algorithms, in an attemptto highlight the use of different tests of the form (8.1) and in particular the implied level ofuncertainty in the choice of ρ.1348.2. Case Studies8.2.1 Stopping Criterion in Initial Value ODE SolversUsing a case study, we show in this section that numerical analysts, too, can be quick to notconsider ρ as a “holy constant”: we adapt to weaker conditions in different ways, depending onthe situation and the advantage to be gained in relaxing the notion of an error tolerance.Let us consider an initial value ODE system in “time” t, written asdudt= f(t,u), 0 ≤ t ≤ b, (8.2a)u(0) = v0, (8.2b)with v0 a given initial value vector. A typical adaptive algorithm proceeds to generate pairs(ti,vi), i = 0, 1, 2, . . . , N , in N consecutive steps, thus forming a mesh pi such thatpi : 0 = t0 < t1 < · · · < tN−1 < tN = b,and vi ≈ u(ti), i = 1, . . . , N .Denoting the numerical solution on the mesh pi by vpi, and the restriction of the exact ODEsolution to this mesh by upi, there are two general approaches for controlling the error in suchan approximation. Given a tolerance value ρ, keep estimating the global error and refining the mesh (i.e., thegamut of step sizes) until roughly‖vpi − upi‖∞ ≤ ρ. (8.3)Details of such methods can be found, for instance, in [17, 37, 74, 82].In (8.3) we could replace the absolute tolerance by a combination of absolute and relativetolerances, perhaps even different ones for different ODE equations. But that aspect isnot what we concentrate on in this chapter. However, most general-purpose ODE codes estimate a local error measure for (8.1) instead,and refine the step size locally. Such a procedure advances one step at a time, andestimates the next step size using local information related to the local truncation error,1358.2. Case Studiesor simply the difference between two approximate solutions for the next time level, oneof which presumed to be significantly more accurate than the other.9 For details see [17,74, 75] and many references therein. In particular, the popular Matlab codes ode45 andode23s use such a local error control.The reason for employing local error control is that this allows for developing a much cheaperand yet more sensitive adaptive procedure, an advantage that cannot be had, for instance, forgeneral boundary value ODE problems; see, e.g., [16].But does this always produce sensible results?! The answer to this question is negative. Asimple example to the contrary is the problemdudt= 100(u− sin t) + cos t, u(0) = 0, b = 1.Local truncation (or discretization) errors for this unstable initial value ODE propagate likeexp(100t), a fact that is not reflected in the local behaviour of the exact solution u(t) = sin ton which the local error control is based. Thus, we may have a large error ‖vpi − upi‖∞ even ifthe local error estimate is bounded by ρ for a small value of ρ.Local error control can be dangerous even for a stable ODE systemStill one can ask, are we safe with local error control in case that we know that our ODEproblem is stable? Here, by “safe” we mean that the global error will not be much larger thanthe local truncation error in scaled form. The answer to this more subtle question turns out tobe negative as well. The essential point is that the global error consists of an accumulation ofcontributions of local errors from previous time steps. If the ODE problem is asymptoticallystable (typically, because it describes a damped motion) then local error contributions die awayas time increases, often exponentially fast, so at some fixed time only the most recent localerror contributions dominate in the sum of contributions that forms the global error. However,if the initial value ODE problem is merely marginally stable (which is the case for Hamiltonian9 Recall that the local truncation error at some time t = ti is the amount by which the exact solution upifails to satisfy the scheme that defines vpi at this point. Furthermore, if at ti, using the known vi and a guessfor ti+1, we apply one step of two different Runge-Kutta methods of orders 4 and 5, say, then the difference ofthe two results at ti+1 gives an estimate for the error in the lower order method over this mesh subinterval.1368.2. Case Studiessystems) then local error contributions propagate undamped, and their accumulation over manytime steps can therefore be significantly larger than just one or a few such errors.10For a simple concrete example, consider applying ode45 with default tolerances to find thelinear oscillator with a slowly varying frequency that satisfies the following initial value ODEfor p(t):dqdt= λ2p, q(0) = 1,dpdt= −(1 + t)2q, p(0) = 0.Here λ > 0 is a given parameter. Thus, u = (q, p)T in the notation of (8.2). This is aHamiltonian system, with the Hamiltonian function given byH(q, p, t) =12[((1 + t)q)2 + (λp)2].Now, since the ODE is not autonomous, the Hamiltonian is not constant in time. However, theadiabatic invariantJ(q, p, t) = H(q, p, t)/(1 + t)(see, e.g., [18, 97]) is almost constant for large λ, satisfying[J(t)− J(0)]/J(0) = O(λ−1)over the interval [0, 1]. This condition means in particular that for λ 1 and the initial valuesgiven above, J(1) = J(0) +O(λ−1) ≈ J(0).Figure 8.1 depicts two curves approximating the adiabatic invariant for λ = 1000. Displayedare the calculated curve using ode45 with default tolerances (absolute=1.e-6, relative=1.e-3),as well as what is obtained upon using ode45 with the stricter relative tolerance RelTol=1.e-6.From the figure it is clear that when using the looser tolerance, the resulting approximation forJ(1) differs from J(0) by far more than what λ−1 =1.e-3 and RelTol=1.e-3 would indicate, whilethe stricter tolerance gives a qualitatively correct result, using the “eye norm”. Annoyingly,10The local error control basically seeks to equalize the magnitude of such local errors at different time steps.1378.2. Case StudiesFigure 8.1: Adiabatic invariant approximations obtained using Matlab’s package ode45 withdefault tolerances (solid blue) and stricter tolerances (dashed magenta).the qualitatively incorrect result does not look like “noise”: while not being physical, it looksdownright plausible, and hence could be misleading for an unsuspecting user. Adding to thepain is the fact that this occurs for default tolerance values, an option that a vast majority ofusers would automatically select. A similar observation holds when trying to approximate the phase portrait or other prop-erties of an autonomous Hamiltonian ODE system over a long time interval using ode45 withdefault tolerances: this may produce qualitatively wrong results. See for instance Figures 16.12and 16.13 in [14]: the Fermi-Pasta-Ulam problem solved there is described in detail in Chapter 1of [73]. What we have just shown here is that the phenomenon can arise also for a very modestsystem of two linear ODEs that do not satisfy any exact invariant.We hasten to add that the documentation of ode45 (or other such codes) does not proposeto deliver anything like (8.3). Rather, the tolerance is just a sort of a knob that is turned tocontrol local error size. However, this does not explain the popularity of such codes despitetheir limited offers of assurance in terms of qualitatively correct results.Our key point in the present section is the following: we propose that one reason for thepopularity of ODE codes that use only local error control is that in applications one rarely1388.2. Case Studiesknows a precise value for ρ as used in (8.3) anyway. (Conversely, if such a global error tolerancevalue is known and is important then codes employing a global error control, and not ode45,should be used.) Opting for local error control over global error control can therefore be seenas one specific way of adjusting mathematical software in a deterministic sense to realisticuncertainties regarding the desired accuracy.8.2.2 Stopping Criterion in Iterative Methods for Linear SystemsIn this case study, extending basic textbook material, we argue not only that tolerance valuesused by numerical analysts are often determined solely for the purpose of the comparison ofmethods (rather than arising from an actual application), but also that this can have unexpectedeffects on such comparisons.Consider the problem of finding u satisfyingAu = b, (8.4)where A is a given s× s symmetric positive definite matrix such that one can efficiently carryout matrix-vector products Av for any suitable vector v, but decomposing the matrix directly(and occasionally, even looking at its elements) is too inefficient and as such is “prohibited”.We relate to such a matrix as being given implicitly. The right hand side vector b is given aswell.An iterative method for solving (8.4) generates a sequence of iterates u1,u2, . . . ,uk, . . . fora given initial guess u0. Denote by rk = b−Auk the residual in the kth iterate. The MINRESmethod, or its simpler version Orthomin(2), can be applied to reduce the residual norm so that‖rk‖2 ≤ ρ‖r0‖2 (8.5)in a number of iterations k that in general is at worst O(√κ(A)), where κ(A) = ‖A‖2‖A−1‖2is the condition number of the matrix A. Below in Table 8.1 we refer to this method as MR.The more popular conjugate gradient (CG) method generally performs comparably in practice.We refer to [67] for the precise statements of convergence bounds and their proofs.1398.2. Case StudiesA well-known and simpler-looking family of gradient descent methods is given byuk+1 = uk + αkrk, (8.6)where the scalar αk > 0 is the step size. Such methods have recently come under intensescrutiny because of applications in stochastic programming and sparse solution recovery. Thus,it makes sense to evaluate and understand them in the simplest context of (8.4), even thoughit is commonly agreed that for the strict purpose of solving (8.4) iteratively, CG cannot besignificantly beaten. Note that (8.6) can be viewed as forward Euler for the artificial time ODEdudt= −Au + b, (8.7)with “time” step size αk. Next we consider two choices of this step size.The steepest descent (SD) variant of (8.6) is obtained by the greedy (exact) line search forthe functionf(u) =12uTAu− bTu,which givesαk = αSDk =rTk rkrTkArk≡(rk, rk)(rk, Ark)≡‖rk‖22‖rk‖2A.However, SD is very slow, requiring k in (8.5) to be proportional to κ(A); see, e.g., [3].11A more enigmatic choice in (8.6) is the lagged steepest descent (LSD) step sizeαk = αLSDk =(rk−1, rk−1)(rk−1, Ark−1).It was first proposed in [24] and practically used for instance in [27, 42]. To the best of ourknowledge, there is no known a priori bound on how many iterations as a function of κ(A) are11 The precise statement of error bounds for CG and SD in terms of the error ek = u− uk uses the A-norm,or “energy norm”, and reads‖ek‖A ≤ 2(√κ(A)− 1√κ(A) + 1)k‖e0‖A, for CG,‖ek‖A ≤(κ(A)− 1κ(A) + 1)k‖e0‖A, for SD.See [67].1408.2. Case Studiesrequired to satisfy (8.5) with this method [24, 45, 59, 112].We next compare these four methods in a typical fashion for a typical PDE example, wherewe consider the model Poisson problem−∆u = 1, 0 < x, y < 1,subject to homogeneous Dirichlet BC, and discretized by the usual 5-point difference schemeon a√s ×√s uniform mesh. Denote the reshaped vector of mesh unknowns by u ∈ IRs. Thelargest eigenvalue of the resulting matrix A in (8.4) is λmax = 4h−2(1 + cos(pih)), and thesmallest is λmin = 4h−2(1 − cos(pih)), where h = 1/(√s + 1). Hence by Taylor expansion ofcos(pih), for h 1 the condition number is essentially proportional to s:κ(A) =λmaxλmin≈(2pi)2s.In Table 8.1 we list iteration counts required to satisfy (8.5) with ρ = 10−7, starting withu0 = 0.s MR CG SD LSD72 9 9 196 45152 26 26 820 91312 54 55 3,337 261632 107 109 13,427 6321272 212 216 53,800 1,249Table 8.1: Iteration counts required to satisfy (8.5) for the Poisson problem with toleranceρ = 10−7 and different mesh sizes s.But now, returning to the topic of the present chapter, we ask, why insist on ρ = 10−7?Indeed, the usual observation that one draws from the columns of values for MR, CG and SDin a table such as Table 8.1, is that the first two grow like√κ(A) ∝√s while the latter growslike κ(A) ∝ s. The value of ρ, so long as it is not too large, does not matter at all!And yet, this is not quite the case for the LSD iteration counts. These do not decrease inthe same orderly fashion as the others, even though they are far better (in the sense of beingsignificantly smaller) than those for SD. Indeed, this method is chaotic [45], and the residual1418.2. Case Studies(a) Residuals (b) Step sizesFigure 8.2: Relative residuals and step sizes for solving the model Poisson problem using LSDon a 15× 15 mesh. The red line in (b) is the forward Euler stability limit.norm decreases rather non-monotonically, see Figure 8.2(a). Thus, the iteration counts inTable 8.1 correspond to the iteration number k = k∗ where the rough-looking relative residualnorm first records a value below the tolerance ρ. Unlike the other three methods, here theparticular value of the tolerance, picked out of nowhere, does play an unwanted role in therelative values, as a function of s, or κ(A), of the listed iteration counts.8.2.3 Data Fitting and Inverse ProblemsIn the previous two case studies we have encountered cases where the intuitive use of an errortolerance within a stopping criterion could differ widely (and wildly) from the notion that isembodied in (8.1) for the consumer of numerical analysts’ products. We next consider a familyof problems where the value of ρ in a particular criterion (8.1) is more directly relevant.Suppose we are given observed data d ∈ IRl and a forward operator fi(m), i = 1, . . . , l, whichprovides predicted data for each instance of a distributed parameter function m. The (unknown)function m is defined in some domain Ω in physical space and possibly time. We are particularlyinterested here in problems where f involves the solution u in Ω of some linear PDE system,sampled in some way at the points where the observed data are provided; see Section 1.1.2.Further, for a given mesh pi discretizing Ω, we consider a corresponding discretization (i.e.,nodal representation) of m and u, as well as the differential operator. Reshaping these meshfunctions into vectors we can write the resulting approximation of the forward operator as (1.5),1428.2. Case Studiesnamelyf(m,q) = Pu = PL−1(m)q, (8.8)where the right hand side vector q is commonly referred to as a source, L is a square matrixdiscretizing the PDE operator plus appropriate side conditions, u = L−1(m)q is the field (i.e.,the PDE solution, here an interim quantity), and P is a projection matrix that projects thefield to the locations where the data values d are given.This setup is typical in the thriving research area of inverse problems; see, e.g., [52, 135]. Aspecific example is provided in Section 8.3.The inverse problem is to find m such that the predicted and observed data agree to withinnoise η: ideally,d = f(m,q) + η. (8.9)To obtain such a model m that satisfies (8.9) we need to estimate the misfit function φ(m),i.e., the normed difference between observed data d and predicted data f(m). An iterativealgorithm is then designed to sufficiently reduce this misfit function. But, which norm shouldwe use to define the misfit function?It is customary to conveniently assume that the noise satisfies η ∼ N (0, σI), i.e., that thenoise is normally distributed with a scaled identity for the covariance matrix, where σ is thestandard deviation. Then the maximum likelihood (ML) data misfit function is simply thesquared `2-norm12φ(m) = ‖f(m)− d‖22. (8.10)In this case, the celebrated Morozov discrepancy principle yields the stopping criterionφ(m) ≤ ρ, where ρ = σ2l, (8.11)see, e.g., [52, 91, 107]. So, here is a class of problems where we do have a meaningful and12 For a more general symmetric positive definite covariance matrix Σ, such that η ∼ N (0,Σ), we get weightedleast squares, or an “energy norm”, with the weight matrix Σ−1 for φ. But let’s not go there in this chapter.1438.2. Case Studiesdirectly usable tolerance value!Assuming that a known tolerance ρ must be satisfied as in (8.11) is often too rigid inpractice, because realistic data do not quite satisfy the assumptions that have led to (8.11)and (8.10). Well-known techniques such as L-curve and GCV (see, e.g., [65, 79, 135]) arespecifically designed to handle more general and practical cases where (8.11) cannot be used orjustified. Also, if (8.11) is used then a typical algorithm would try to find m such that φ(m)is (smaller but) not much smaller than ρ, because having φ(m) too small would correspond tofitting the noise – an effect one wants to avoid. The latter argument and practice do not followfrom (8.11).Moreover, keeping the misfit function φ(m) in check does not necessarily imply a qualityreconstruction (i.e., an acceptable approximation m for the “true solution” m∗, which can bean elusive notion in itself). However, φ(m), and not direct approximations of ‖m∗−m‖, is whatone typically has to work with.13 So any additional a priori information is often incorporatedthrough some regularization.Still, despite all the cautious comments in the preceding two paragraphs, we have in (8.11)in a sense a more meaningful practical expression for stopping an iterative algorithm thanhitherto.Typically there is a need to regularize the inverse problem, and often this is done by addinga regularization term to (8.10). Thus, one attempts to approximately solve the Tikhonov-typeproblemminmφ(m) + αR(m),where R(m) ≥ 0 is a prior (we are thinking of some norm or semi-norm of m), and α ≥ 0 is aregularization parameter.A fourth case study is the one that this thesis concentrates on, namely, the extension ofCase Study 8.2.3 to problems with many data sets to which the additional approximationusing Monte-Carlo sampling is applied. Of course, our uncertainty in the error criterion and13 The situation here is different from that in Section 8.2.1, where the choice of local error criterion over aglobal one was made based on convenience and efficiency considerations. Here, although controlling φ(m) ismerely a necessary and not sufficient condition for obtaining a quality reconstruction m, it is usually all we haveto work with.1448.3. Probabilistic Relaxation of a Stopping Criterionspecifically the error tolerance, if anything, increases even further here. On the other hand,unlike in the previous case studies where we only call for increased alertness and additionalcaution regarding the error tolerance, here we have the framework to quantify uncertaintyand as such we can obtain more efficient algorithms for problems with more such uncertainty.Satisfying the tolerance only probably thus leads to cheaper computations in a disciplinedmanner.8.3 Probabilistic Relaxation of a Stopping CriterionThe previous section details three different case studies which highlight the fact of life that inapplications an error tolerance for stopping an algorithm is rarely known with absolute certainly.Thus, we can say that such a tolerance is only “probably” known. Yet in some situations, itis also possible to assign it a more precise meaning in terms of statistical probability. Thisholds true for the problems considered in this thesis, namely extensions of Case Study 8.2.3 toproblems with many data. Thus, one can consider a way to relax (8.1), which is more systematicand also allows for further theoretical developments. Specifically, we consider satisfying atolerance in a probabilistic sense, as proposed in Section 7.2.2.Thus, according to (7.2), in the check for termination of our iterative algorithm at the nextiterate mk+1, we consider replacing the condition (3.3), namelyφ(mk+1) ≤ ρby either (7.21a) or (7.21b), namelyφ̂(mk+1, nt) ≤ (1− ε)ρ, orφ̂(mk+1, nt) ≤ (1 + ε)ρ,for a suitable n = nt that is governed by Theorems 7.1 or 7.2 with a prescribed pair (ε, δ).If (7.21a) holds, then it follows with a probability of at least (1 − δ) that (3.3) holds. On theother hand, if (7.21b) does not hold, then we can conclude with a probability of at least (1− δ)that (3.3) is not satisfied. In other words, unlike (7.21a), a successful (7.21b) is only necessary1458.3. Probabilistic Relaxation of a Stopping Criterionand not sufficient for concluding that (3.3) holds with the prescribed probability 1− δ.What are the connections among these three parameters, ρ, δ and ε?! The parameter ρ isthe deterministic but not necessarily too trustworthy error tolerance appearing in (3.3), muchlike the tolerance in Section 8.2.1. Next, we can reflect the uncertainty in the value of ρ bychoosing an appropriately large δ (≤ 1). Smaller values of δ reflect a higher certainty in ρand a more rigid stopping criterion (translating into using a larger nt). For instance, successof (7.21a) is equivalent to making a statement on the probability that a positive “test” resultwill be a “true” positive. This is formally given by the conditional probability statementPr(φ(mk+1) ≤ ρ | φ̂(mk+1, nt) ≤ (1− ε)ρ)≥ 1− δ.Note that, once the condition in this statement is given, the rest only involves ρ and δ. So thetolerance ρ is augmented by the probability parameter δ. The third parameter ε governs thefalse positives/negatives (i.e., the probability that the test will yield a positive/negative result,if in fact (3.3) is false/true), where a false positive is given byPr(φ̂(mk+1, nt) ≤ (1− ε)ρ | φ(mk+1) > ρ),while a false negative isPr(φ̂(mk+1, nt) > (1− ε)ρ | φ(mk+1) ≤ ρ).Such probabilistic stopping criterion is incorporated in Algorithm 4 in Chapter 7 and, there,various numerical examples are given to illustrate these ideas on a concrete application. Here,employing Algorithm 4 again, we give some more examples with the same setup as that ofExample (E.2) in Chapter (7), but instead of TV, we use dynamical regularization. Note againthat the large dynamical range of the conductivities, together with the fact that the data isavailable only on less than half of the boundary, contribute to the difficulty in obtaining goodquality reconstructions. The term “Vanilla” refers to using all s available data sets for eachtask during the algorithm. This costs 527,877 PDE solves14 for s = 3, 969 (b) and 5,733 PDE14Fortunately, the matrix L does not depend on i in (1.4). Hence, if the problem is small enough that a direct1468.3. Probabilistic Relaxation of a Stopping Criterion(a) (b) (c)(d)Figure 8.3: Plots of log-conductivity: (a) True model; (b) Vanilla recovery with s = 3, 969; (c)Vanilla recovery with s = 49; (d) Monte Carlo recovery with s = 3, 969. The vanilla recoveryusing only 49 measurement sets is clearly inferior, showing that a large number of measurementsets can be crucial for better reconstructions. The recovery using our algorithm, however, iscomparable in quality to Vanilla with the same s. The quantifier values used in our algorithmwere: (εc, δc) = (0.05, 0.3), (εu, δu) = (0.1, 0.3) and (εt, δt) = (0.1, 0.1).solves for s = 49 (c). However, the quality of reconstruction using the smaller number of datasets is clearly inferior. On the other hand, using our algorithm yields a recovery (d) that iscomparable to Vanilla but at the cost of only 5,142 PDE solves. The latter cost is about 1%that of Vanilla and is comparable in order of magnitude to that of evaluating φ(m) once!8.3.1 TV and Stochastic MethodsThis section is not directly related to the main theme of this chapter, but it arises from thepresent discussion and should have merit on its own (in addition to being mercifully short).The specific example considered above is used also in Chapter 7, except that the objectivefunction there includes a total variation (TV) regularization. This represents usage of additionala priori information (namely, that the true model is discontinuous with otherwise flat regions),whereas here an implicit `2-based regularization has been employed without such knowledgeregarding the true solution. The results in Figures 7.6(b) and 7.7(vi) there correspond tomethod can be used to construct G, i.e., perform one LU decomposition at each iteration k, then the task ofsolving half a million PDEs just for comparison sake becomes less daunting.1478.4. Conclusionsour Figures 8.3(b) and 8.3(d), respectively, and as expected, they look sharper in Chapter 7.On the other hand, a comparative glance at Figure 7.6(c) there vs the present Figure 8.3(c)reveals that the `1-based technique can be inferior to the `2-based one, even for recoveringa piecewise constant solution! Essentially, even for this special solution form TV shines onlywith sufficiently good data, and here “sufficiently good” translates to “many data sets”. Thisintuitively obvious observation does not appear to be as well-known today as it used to be [47].8.4 ConclusionsMathematical software packages typically offer a default option for the error tolerances usedin their implementation. Users often select this default option without much further thinking,at times almost automatically. This in itself suggests that practical occasions where the prac-titioner does not really have a good hold of a precise tolerance value are abundant. However,since it is often convenient to assume having such a value, and convenience may breed compla-cency, surprises may arise. We have considered in Section 8.2 three case studies which highlightvarious aspects of this uncertainty in a tolerance value for a stopping criterion.Recognizing that there can often be a significant uncertainty regarding the actual tolerancevalue and the stopping criterion, we have subsequently considered the relaxation of the settinginto a probabilistic one, and demonstrated its benefit in the context of large scale problemsconsidered in this thesis. The environment defined by probabilistic relative accuracy, suchas (7.2), although well-known in other research areas, is relatively new (but not entirely untried)in the numerical analysis community. It allows, among other benefits, specifying an amount oftrust in a given tolerance using two parameters that can be tuned, as well as the development ofbounds on the sample size of certain Monte Carlo methods. In Section 8.3, following Chapter 7,we have applied this setting in the context of a particular inverse problem involving the solutionof many PDEs, and we have obtained some uncertainty quantification for a rather efficientalgorithm solving a large scale problem.There are several aspects of our topic that remain untouched in this chapter. For instance,there is no discussion of the varying nature of the error quantity that is being measured (whichstrongly differs across the subsections of Section 8.2, from solution error through residual error1488.4. Conclusionsthrough data misfit error for an ill-posed problem to stochastic quantities that relate even lessclosely to the solution error). Also, we have not mentioned that complex algorithms often involvesub-tasks such as solving a linear system of equations iteratively, or employing generalized crossvalidation (GCV) to obtain a tolerance value, or invoking some nonlinear optimization routine,which themselves require some stopping criterion: thus, several occurrences of tolerances inone solution algorithm are common. In the probabilistic sections, we have made the choice ofconcentrating on bounding the sample size n and not, for example, on minimizing the varianceas in [86].What we have done here is to highlight an often ignored yet rather fundamental issue fromdifferent angles. Subsequently, we have pointed at and demonstrated a promising approach (ordirection of thought) that is not currently common in the scientific computing community.149Chapter 9Summary and Future WorkEfficiently solving large scale non-linear inverse problems of the form described in Section 1.1 isindeed a challenging problem. Large scale, within the context we aimed to study in this thesis,implies that we are given a very large number of measurement vectors, i.e., s 1. For manyinstances of such problems, there are theoretical reasons for requiring large amounts of data forobtaining any credible reconstruction. For many others, it is an accepted working assumptionthat having more data can only help and not hurt the conditioning of the problem being solved.As such, methods for efficiently solving such problems are highly sought after in practice. Inthis thesis, we have proposed highly efficient randomized reconstruction algorithms for solvingsuch problems. we have also demonstrated both the efficacy and the efficiency of the proposedalgorithms in the context of an important class of such problems, namely PDE inverse problemswith many measurements. As a specific instance, we used the famous and notoriously difficultDC resistivity problem.Each chapter of this thesis contains conclusions and future research directions specic to thatparticular line of research; here we present an overall summary and some more topics for futureresearch, not mentioned earlier.9.1 SummaryIn Chapter 2, various dimensionality reduction (i.e., approximation) methods were presented todeal with computational challenges arising from evaluating the misfit (1.6). All these methodsconsist of sampling the large dimensional data and creating a new set of lower dimensional datafor which computations can be done more efficiently. Such sampling can be done either (i)stochastically or (ii) deterministically. We showed that stochastic sampling results in an unbi-ased estimator of the original misfit (1.6), and as such, the misfit itself is approximated. Main1509.1. Summaryexamples of appropriate distributions for stochastic estimations were discussed: Rademacher,Gaussian and Random Subset. In cases where the underlying forward operators satisfy Assump-tions (A.1) - (A.3), we showed that the stochastic methods using the Rademacher or Gaussiandistribution result in the method of simultaneous sources (SS) and indeed yield very efficientapproximations (recall our definition of efficiency, in footnotes 4 and 6, in Chapter 2). In situ-ations where Assumption (A.2) is violated, however, the method of random subset (RS) is theonly applicable estimator. On the other hand, our proposed method for deterministic samplingis based on TSVD approximation of the matrix consisting of all measurement vectors. As such,unlike the stochastic methods which approximate the misfit, the TSVD approach approximatesthe data matrix, D in (1.6). If Assumptions (A.1) - (A.3) hold, such deterministic method,similar to stochastic ones, yields another instance of SS methods.In Chapter 3, continuing to make Assumptions (A.1) - (A.3), we developed and comparedseveral highly efficient stochastic iterative reconstruction algorithms for approximately solvingthe (regularized) NLS formulation of aforementioned large scale data fitting problems. Allthese iterative algorithms involve employing the dimensionality reduction techniques discussedin Chapter 2. As such, at iteration k of our algorithms, the original s measurement vectorsare sampled and a new, yet smaller, set of nk measurement vectors with nk s are formed.Two reconstruction algorithms for controlling the size nk of the data set in the kth iterationhave been proposed and tested. We identified and justified three different purposes for suchthe dimensionality reduction methods within various steps of our proposed algorithms, namelyfitting, cross validation and uncertainty check. Using the four methods of sampling the data(i.e., three stochastic and one deterministic introduced in Chapter 2), our two algorithms makefor a total of eight algorithm variants. All of our algorithms are known to converge undersuitable circumstances because they satisfy the general conditions in [36, 60].Chapter 4 is a sequel to Chapter 3 in which we relax the linearity assumption (A.2) andpropose methods that, where applicable, transform the problem into one where the linearityassumption (A.2) is restored. Hence, efficient dimensionality reduction methods introduced inChapter 2 can be applied. In Chapter 4, we focus on a particular case where the linearityassumption (A.2) is violated due to missing or corrupted data. Such situations arise often inpractice, and hence, it is desired to have methods to be able to apply variants of the efficient1519.1. Summaryreconstruction algorithms presented in Chapter 3. The transformation methods presented inChapter 4 involve completion/replacement of the missing/corrupted portion of the data. Thesemethods are presented in the context of EIT/DC resistivity as an important class of PDE in-verse problems; however, we believe that similar ideas can be applied in many more instances ofsuch problems. Our data completion/replacement methods are motivated by theory in Sobolevspaces, regarding the properties of weak solutions along the domain boundary. Our methodsprove to be capable of effectively reconstructing the data in the presence of large amounts ofmissing or corrupted data. Variants of efficient reconstruction algorithms, presented in Chap-ter 3, are proposed and numerically verified. In addition to completion/replacement methods,a heuristic and efficient alternative to the rigid stopping criterion (3.3) is given.All of our proposed randomized dimensionality reduction methods rely heavily upon thefundamental concept of estimating the trace of an implicit symmetric positive semi-definitematrices using Monte Carlo methods. As such the question of accuracy and efficiency of ourstochastic approximation methods are tied with those of such trace estimators. In Chapter 5this task is visited, and accuracy and efficiency of the randomized trace estimators are analyzedusing a suitable and intuitive probabilistic framework. Under such probabilistic framework, oneseeks conditions on the sample size required for these Monte-Carlo methods to probabilisticallyguarantee estimate’s desired relative accuracy. In this chapter, conditions for all the distribu-tions discussed in Section 2.1.1 are derived. In addition to practically computable conditions,we also provide some uncomputable, yet informative, conditions which shed light on questionsregarding the type of matrices a particular distribution is best/least suited for. Part of thetheory presented in Chapter 5 is, subsequently, further improved in Chapter 7.Chapter 6 is a precursor of Chapter 7. Specifically, the improvements in theoretical studiesof MC trace estimators presented in Chapter 7 are applications of more general results regardingthe extremal probabilities (i.e., maxima and minima of the tail probabilities) of non-negativelinear combinations (i.e., convolutions) of gamma random variables, which are proved in Chap-ter 6. In addition, in Chapter 6, we prove results regarding the monotonicity of the regularizedgamma function. All these results are very general and have many applications in economics,actuarial science, insurance, reliability and engineering.The main advantage of any efficient (randomized) approximation algorithm is the reduction1529.1. Summaryof computational costs. However, a major drawback of any such algorithm is the introduction of“uncertainty” in the overall procedure. The presence of uncertainty in the approximation stepscould cast doubt on the credibility of the obtained results. Hence, it may be useful to have meanswhich allow one to adjust the cost and accuracy of such algorithms in a quantifiable way, and finda balance that is suitable to particular objectives and computational resources. Chapter 7 offersthe uncertainty quantification of the major stochastic steps of our reconstruction algorithmspresented in Chapters 3 and 4. Such steps include the fitting, uncertainty check, cross validationand stopping criterion. This results in highly efficient variants of our original algorithms wherethe degree of uncertainty can easily be quantified and adjusted, if needed. Using the resultingalgorithm, one could, in a quantifiable way, obtain a desirable balance between the amountof uncertainty and the computational complexity of the reconstruction algorithm. In orderto achieve this, we make use of similar probabilistic analysis as in Chapter 5. However, theconditions presented in Chapter 5 are typically not sufficiently tight to be useful in manypractical situations. In Chapter 7, using the results of Chapter 6, we improve upon some ofthe theory presented in Chapter 5. Specifically, in Chapter 7, we prove tight bounds for tailprobabilities for such Monte-Carlo approximations employing the standard normal distribution.These tail bounds are then used to obtain necessary and sufficient bounds on the required samplesize, and we demonstrate that these bounds can be practically small and computable. Numericalexamples demonstrate the efficiency of our proposed uncertainty-quantified algorithm.Numerical algorithms are typically equipped with a stopping criterion where the calculationis terminated when some error or misfit measure is deemed to be below a given tolerance.However, in practice such a tolerance is rarely known; rather, only some possibly vague ideaof the desired quality of the numerical approximation is available. In Chapter 8, we discussseveral case studies, from different areas of numerical analysis, where a rigid interpretation oferror criterion and tolerance may result in erroneous outcomes and conclusions. We discuss,for instance, fast codes for initial value ODEs and DAEs, which heavily rely on the underlyingphilosophy that satisfying a tolerance for the global error is too rigid a task; such codes proceedto control just the local error. Another instance of soft error control is when a large, complicatedmodel for fitting data is reduced, say by a Monte Carlo sampling method as in previous chapters.If calculating the misfit norm is in itself very expensive then the option of satisfying the stopping1539.2. Future Workcriterion only in a probabilistic sense naturally arises. This leads one to think of devisingapproaches, where they are possible, to relax the notion of exactly satisfying a tolerance value.In Chapter 8, we discuss this in the context of large scale PDE inverse problems described inSection 1.1.2. Such probabilistic relaxation of the given tolerance in this context, allows, forinstance, for the use of the proven bounds in Chapters 5 and 7.This thesis also includes an appendix. In Appendix A, certain implementation details aregiven which are used throughout the thesis. Such details include discretization of the EIT/DCresistivity problem in two and three dimensions, injection of a priori knowledge on the soughtparameter function via transformation functions in the original PDE, an overall discussion ofthe (stabilized) Gauss-Newton algorithm for minimization of a least squares objective, a shortMatlab code which is used in Chapter 7 to compute the Monte-Carlo sample sizes employedin matrix trace estimators, and finally the details of implementation and discretization of thetotal variation functional used in some of the numerical examples in this thesis.9.2 Future WorkAt the end of each of Chapters 3, 4, 5, and 7, some general directions for future research,related to the specific topics presented in the respective chapter, are discussed. In this section,we present few more directions and ideas for further research, which arose as a result of thework in this thesis. Time constraints did not allow for their full investigation in the presentstudy and they are left for future work. Some of these ideas are presented in their most generalform, while others are described more specifically.9.2.1 Other ApplicationsThe success of randomized approximation algorithms have only been thoroughly evaluated ina handful of applications. However, it is widely believed that the application range of suchstochastic algorithms can be extended. There are many more important medical and geophys-ical applications where the study of efficient randomized approximation algorithms requiresmore concentrated effort. Such applications include large scale seismic data inversion in oilexploration and medical imaging such as quantitative photoacoustic tomography, among many1549.2. Future Workothers. For many of these applications, one typically makes large amounts of measurementsand, hence, the model recovery is computationally very challenging. In addition, there areunique challenges that arise as a result of the nature of each individual application. Withinthe context of approximation algorithms, these challenges need to be individually investigated.These might impose a wide range of difficulties, from a simple modification to the algorithmsin this thesis to devising completely new approaches.9.2.2 Quasi-Monte Carlo and Matrix Trace EstimationAs shown in this thesis, within the context of large scale non-linear least squares problems,efficiency in estimating the objective function (or the trace of the corresponding implicit matrix)directly translates to efficiency in solving such large scale problems. In this thesis, it was shownthat, for such problems, naive randomized approximation techniques using simple Monte-Carlomethods can have great success in designing highly efficient algorithms. Such Monte-Carlomethods for estimating the trace of an implicit matrix was thoroughly studied in Chapters 5and 7. As shown, the analysis is based on a probabilistic framework for which, given twosmall tolerances (ε, δ), one obtains sufficient conditions on sample size in order to guaranteethat the probability of the relative accuracy being below ε is more than 1 − δ. However, ithas been shown in [137] that using simple Monte-Carlo methods, the true sample size growslike Ω(ε−2). As such, for scenarios where an accurate estimation is required, such algorithmsmight be completely inefficient and computationally expensive. And yet, it might be possibleto improve the bound Ω(ε−2) through the application of Quasi-Monte Carlo (QMC) methods,where careful design of a sequence of correlated samples yields more accurate approximations,at lower costs. The application of such QMC methods for efficiently solving large scale inverseproblems has not been greatly studied in the literature. Hence, the analysis and the practicalimplementation of such new algorithms is an interesting topic for future research.9.2.3 Randomized/Deterministic PreconditionersIn Chapter 5, it was shown that the “skewness” of eigenvalue distribution of a symmetricpositive semi-definite matrix greatly affects the performance and efficiency of the Gaussiantrace estimation. In other words, estimating the trace of a matrix whose eigenvalues are similar1559.2. Future Workcan be done more efficiently (i.e., with smaller sample size) than that for which the discrepancybetween the eigenvalues is large (i.e., eigenvalues are more skewed). A question arises whetherit is possible to find a randomized preconditioning scheme to balance the skewed eigenvalues ofa matrix while preserving the value of the trace. In other words, one may seek to find a randommatrix P such that PAP T has a more balanced eigenvalue distribution than A, yet we havetr(A) = tr(PAP T ) (or tr(A) = E[tr(PAP T )]). Alternatively, one could look at deterministicconstructions such as the following formulationminP∈P‖PAP T ‖22s.t. tr(PAP T ) = tr(A)where P is an appropriate space of non-orthogonal matrices. Minimizing ‖PAP T ‖22 translatesto minimizing the largest eigenvalue of PAP T , which given the constraint for the sum of eigen-values, forces the eigenvalue distribution to be less skewed. If such a preconditioner exists, itcan, in addition, be adopted for preconditioning matrices in linear system solvers.156Bibliography[1] M. Abramowitz. Handbook of Mathematical Functions, With Formulas, Graphs, andMathematical Tables,. Dover Publications, Incorporated, 1974.[2] D. Achlioptas. Database-friendly random projections. In ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 01, volume 20, pages274–281, 2001.[3] H. Akaike. On a successive transformation of probability distribution and its applicationto the analysis of the optimum gradient method. Ann. Inst. Stat. Math. Tokyo, 11:1–16,1959.[4] A. Alessandrini and S. Vessella. Lipschitz stability for the inverse conductivity problem.Adv. Appl. Math., 35:207–241, 2005.[5] A. Alexanderian, N. Petra, G. Stadler, and O. Ghattas. A-optimal design of experimentsfor infinite-dimensional bayesian linear inverse problems with regularized `0-sparsification.arXiv:1308.4084, 2013.[6] A. Alexanderian, N. Petra, G. Stadler, and O. Ghattas. A fast and scalable methodfor a-optimal design of experiments for infinite-dimensional bayesian nonlinear inverseproblems. arXiv:1410.5899, 2014.[7] L. Amiri, B. Khaledi, and F. J. Samaniego. On skewness and dispersion among con-volutions of independent gamma random variables. Probability in the Engineering andInformational Sciences, 25(01):55–69, 2011.[8] A. Aravkin, M. P. Friedlander, F. J. Herrmann, and T. van Leeuwen. Robust inver-157Bibliographysion, dimensionality reduction, and randomized sampling. Mathematical programming,134(1):101–125, 2012.[9] G. Archer and DM. Titterington. On some bayesian/regularization methods for imagerestoration. Image Processing, IEEE Transactions on, 4(7):989–995, 1995.[10] S. R. Arridge. Optical tomography in medical imaging. Inverse problems, 15(2):R41,1999.[11] S R Arridge. Optical tomography in medical imaging. Inverse Problems, 15(2):R41, 1999.[12] S. R. Arridge and J. C. Hebden. Optical imaging in medicine: Ii. modelling and recon-struction. Physics in Medicine and Biology, 42(5):841, 1997.[13] U. Ascher. Numerical Methods for Evolutionary Differential Equations. SIAM, Philadel-phia, PA, 2008.[14] U. Ascher and C. Greif. First Course in Numerical Methods. Computational Science andEngineering. SIAM, 2011.[15] U. Ascher and E. Haber. A multigrid method for distributed parameter estimation prob-lems. J. ETNA, 18:1–18, 2003.[16] U. Ascher, R. Mattheij, and R. Russell. Numerical Solution of Boundary Value Problemsfor Ordinary Differential Equations. Classics. SIAM, 1995.[17] U. Ascher and L. Petzold. Computer Methods for Ordinary Differential and Differential-Algebraic Equations. SIAM, 1998.[18] U. Ascher and S. Reich. The midpoint scheme and variants for hamiltonian systems:advantages and pitfalls. SIAM J. Scient. Comput., 21:1045–1065, 1999.[19] U. Ascher and F. Roosta-Khorasani. Algorithms that satisfy a stopping criterion, prob-ably. 2014. Preprint, arXiv:1408.5946.[20] R. C. Aster, B. Borchers, and C. H. Thurber. Parameter estimation and inverse problems.Academic Press, 2013.158Bibliography[21] H. Avron. Counting triangles in large graphs using randomized matrix trace estimation.Workshop on Large-scale Data Mining: Theory and Applications, 2010.[22] H. Avron and S. Toledo. Randomized algorithms for estimating the trace of an implicitsymmetric positive semi-definite matrix. JACM, 58(2), 2011. Article 8.[23] Z. Bai, M. Fahey, and G. Golub. Some large scale matrix computation problems. J.Comput. Appl. Math., 74:71–89, 1996.[24] J. Barzilai and J. Borwein. Two point step size gradient methods. IMA J. Num. Anal.,8:141–148, 1988.[25] C. J. Beasley. A new look at marine simultaneous sources. The Leading Edge, 27(7):914–917, 2008.[26] C. Bekas, E. Kokiopoulou, and Y. Saad. An estimator for the diagonal of a matrix. Appl.Numer. Math., 57:12141229, 2007.[27] E. van den Berg and M. P. Friedlander. Probing the pareto frontier for basis pursuitsolutions. SIAM J. Scient. Comput., 31(2):890–912, 2008.[28] M. Bertero and P. Boccacci. Introduction to inverse problems in imaging. CRC press,2010.[29] D.A. Boas, D.H. Brooks, E.L. Miller, C. A. DiMarzio, M. Kilmer, R.J. Gaudette, andQ. Zhang. Imaging the body with diffuse optical tomography. Signal Processing Magazine,IEEE, 18(6):57–75, 2001.[30] M. E. Bock, P. Diaconis, F. W. Huffer, and M. D. Perlman. Inequalities for linearcombinations of gamma random variables. Canadian Journal of Statistic, 15:387–395,1987.[31] P. J. Boland, E. El-Neweihi, and F. Proschan. Schur properties of convolutions of expo-nential and geometric random variables. Journal of Multivariate Analysis, 48(1):157–167,1994.159Bibliography[32] J. Bon and E. Pa˜lta˜nea. Ordering properties of convolutions of exponential randomvariables. Lifetime Data Analysis, 5(2):185–192, 1999.[33] L. Borcea, J. G. Berryman, and G. C. Papanicolaou. High-contrast impedance tomogra-phy. Inverse Problems, 12:835–858, 1996.[34] A. Borsic, B. M. Graham, A. Adler, and W. R. Lionheart. Total variation regularizationin electrical impedance tomography. 2007.[35] C. Bunks, F. M. Saleck, S. Zaleski, and G. Chavent. Multiscale seismic waveform inversion.Geophysics, 60(5):1457–1473, 1995.[36] R. Byrd, G. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic hessian informa-tion in optimization methods for machine learning. SIAM J. Optimization, 21(3):977–995,2011.[37] Y. Cao and L. Petzold. A posteriori error estimation and global error control for ordinarydifferential equations by the adjoint method. SIAM J. Scient. Comput., 26:359–374, 2004.[38] T. Chan and X. Tai. Level set and total variation regularization for elliptic inverseproblems with discontinuous coefficients. J. Comp. Phys., 193:40–66, 2003.[39] M. Cheney, D. Isaacson, and J. C. Newell. Electrical impedance tomography. SIAMReview, 41:85–101, 1999.[40] G. Dahlquist and A. Bjorck. Numerical Methods. Prentice-Hall, 1974.[41] Y. Dai. Nonlinear conjugate gradient methods. Wiley Encyclopedia of Operations Re-search and Management Science, 2011.[42] Y. Dai and R. Fletcher. Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming. Numerische. Math., 100:21–47, 2005.[43] C. A. Deledalle, S. Vaiter, G. Peyre´, and J. M. Fadili. Stein unbiased gradient estimatorof the risk (sugar) for multiple parameter selection. arXiv:1405.1164, 2014.160Bibliography[44] P. Diaconis and M. D. Perlman. Bounds for tail probabilities of weighted sums of in-dependent gamma random variables. Lecture Notes-Monograph Series, pages 147–166,1990.[45] K. van den Doel and U. Ascher. The chaotic nature of faster gradient descent methods.J. Scient. Comput., 48, 2011. DOI: 10.1007/s10915-011-9521-3.[46] K. van den Doel and U. Ascher. Adaptive and stochastic algorithms for EIT and DCresistivity problems with piecewise constant solutions and many measurements. SIAM J.Scient. Comput., 34:DOI: 10.1137/110826692, 2012.[47] K. van den Doel, U. Ascher, and E. Haber. The lost honour of `2-based regulariza-tion. Radon Series in Computational and Applied Math, 2013. M. Cullen, M. Freitag, S.Kindermann and R. Scheinchl (Eds).[48] O. Dorn, E. L. Miller, and C. M. Rappaport. A shape reconstruction method for elec-tromagnetic tomography using adjoint fields and level sets. Inverse Problems, 16, 2000.1119-1156.[49] L. Demanet E. Candes, D. Donoho, and L. Ying. Fast discrete curvelet transforms.Multiscale Modeling & Simulation, 5(3):861–899, 2006.[50] M. Elad. Sparse and Redundant Representations: From Theory to Applications in Signaland Image Processing. Springer, 2010.[51] Y. C. Eldar and G. Kutyniok. Compressed sensing: theory and applications. CambridgeUniversity Press, 2012.[52] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems. Kluwer,Dordrecht, 1996.[53] J. Krebs et al. Iterative inversion of data from simultaneous geophysical sources.http://www.faqs.org/patents/app/20100018718, 28/01/2010.[54] L. C. Evans. Partial differential equations. 1998.161Bibliography[55] C. Farquharson and D. Oldenburg. Non-linear inversion using general measures of datamisfit and model structure. Geophysics J., 134:213–227, 1998.[56] A. Fichtner. Full Seismic Waveform Modeling and Inversion. Springer, 2011.[57] R. Fletcher. Practical methods of optimization. John Wiley & Sons, 2013.[58] Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing.Springer, 2013.[59] A. Friedlander, J. Martinez, B. Molina, and M. Raydan. Gradient method with retardand generalizations. SIAM J. Num. Anal., 36:275–289, 1999.[60] M. Friedlander and M. Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM J. Scient. Comput., 34(3), 2012.[61] E. Furman and Z. Landsman. Tail variance premium with applications for ellipticalportfolio of risks. Astin Bulletin, 36(2):433, 2006.[62] F. Gao, A. Atle, P. Williamson, et al. Full waveform inversion using deterministic sourceencoding. In 2010 SEG Annual Meeting. Society of Exploration Geophysicists, 2010.[63] H. Gao, S. Osher, and H. Zhao. Quantitative photoacoustic tomography. In MathematicalModeling in Biomedical Imaging II, pages 131–158. Springer, 2012.[64] M. Gehrea, T. Kluth, A. Lipponen, B. Jin, A. Seppaenenb, J. Kaipio, and P. Maass.Sparsity reconstruction in electrical impedance tomography: An experimental evaluation.J. Comput. Appl. Math., 236:2126–2136, 2012.[65] G. Golub, M. Heath, and G. Wahba. Generalized cross-validation as a method for choosinga good ridge parameter. Technometrics, 21(2):215–223, 1979.[66] G. H. Golub, M. Heath, and G. Wahba. Generalized cross validation as a method forchoosing a good ridge parameter. Technometrics, 21:215–223, 1979.[67] A. Greenbaum. Iterative Methods for Solving Linear Systems. SIAM, 1997.162Bibliography[68] TM. Habashy, A. Abubakar, G. Pan, A. Belani, et al. Full-waveform seismic inversionusing the source-receiver compression approach. In 2010 SEG Annual Meeting. Societyof Exploration Geophysicists, 2010.[69] E. Haber, U. Ascher, and D. Oldenburg. Inversion of 3D electromagnetic data in frequencyand time domain using an inexact all-at-once approach. Geophysics, 69:1216–1228, 2004.[70] E. Haber and M. Chung. Simultaneous source for non-uniform data variance and missingdata. 2012. submitted.[71] E. Haber, M. Chung, and F. Herrmann. An effective method for parameter estimationwith PDE constraints with multiple right-hand sides. SIAM J. Optimization, 22:739–757,2012.[72] E. Haber, S. Heldmann, and U. Ascher. Adaptive finite volume method for distributednon-smooth parameter identification. Inverse Problems, 23:1659–1676, 2007.[73] E. Hairer, C. Lubich, and G. Wanner. Geometric Numerical Integration. Springer, 2002.[74] E. Hairer, S. Norsett, and G. Wanner. Solving Ordinary Differential Equations I. Springer,1993.[75] E. Hairer and G. Wanner. Solving Ordinary Differential Equations II. Springer, 1996.[76] G. Hampson, J. Stefani, and F. Herkenhoff. Acquisition using simultaneous sources. TheLeading Edge, 27(7):918–923, 2008.[77] M. Hanke. Regularizing properties of a truncated newton-cg algorithm for nonlinearinverse problems. Numer. Funct. Anal. Optim., 18:971–993, 1997.[78] P. C. Hansen. Rank-Deficient and Discrete Ill-Posed Problems. SIAM, 1998.[79] P. C. Hansen. The L-curve and its use in the numerical treatment of inverse problems.IMM, Department of Mathematical Modelling, Technical Universityof Denmark, 1999.[80] M. Heath. Scientific Computing, An Introductory Survey. McGraw-Hill, 2002. 2nd Ed.163Bibliography[81] F. Herrmann, Y. Erlangga, and T. Lin. Compressive simultaneous full-waveform simula-tion. Geophysics, 74:A35, 2009.[82] Desmond J. Higham. Global error versus tolerance for explicit runge-kutta methods. IMAJ. Numer. Anal, 11:457–480, 1991.[83] P. Hitczenko and S. Kwapien´. On the rademacher series. In Probability in Banach Spaces,9, pages 31–36. Springer, 1994.[84] J. Holodnak and I. Ipsen. Randomized approximation of the gram matrix: Exact com-putation and probabilistic bounds. SIAM J. Matrix Anal. Applic., 2014. to appear.[85] Peter J Huber et al. Robust estimation of a location parameter. The Annals of Mathe-matical Statistics, 35(1):73–101, 1964.[86] M.F. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplaciansmoothing splines. J. Comm. Stat. Simul., 19:433–450, 1990.[87] I. Ipsen and T. Wentworth. The effect of coherence on sampling from matrices withorthonormal columns, and preconditioned least squares problems. SIAM J. Matrix Anal.Applic., 2014. To appear.[88] V. Isakov. Inverse Problems for Partial Differential Equations. Springer; 2nd edition,2006.[89] J. Jost and J Jost. Riemannian geometry and geometric analysis, volume 42005. Springer,2008.[90] A. Juditsky, G. Lan, A. Nemirovski, and A. Shapiro. Stochastic approximation approachto stochastic programming. SIAM J. Optimization, 19(4):1574–1609, 2009.[91] J. Kaipo and E. Somersalo. Statistical and Computational Inverse Problems. Springer,2005.[92] S. A. Khayam. The discrete cosine transform (dct): theory and application. MichiganState University, 2003.164Bibliography[93] S. Kochar and M. Xu. On the right spread order of convolutions of heterogeneous expo-nential random variables. Journal of Multivariate Analysis, 101(1):165–176, 2010.[94] S. Kochar and M. Xu. The tail behavior of the convolutions of gamma random variables.Journal of Statistical Planning and Inference, 141(1):418–428, 2011.[95] S. Kochar and M. Xu. Some unified results on comparing linear combinations of indepen-dent gamma random variables. Probability in the Engineering and Informational Sciences,26(03):393–404, 2012.[96] R. Kumar, C. Da Silva, O. Akalin, A. Y. Aravkin, H. Mansour, B. Recht, and F. J.Herrmann. Efficient matrix completion for seismic data reconstruction. Submitted toGeophysics on August 8, 2014., 08 2014.[97] B. Leimkuhler and S. Reich. Simulating Hamiltonian Dynamics. Cambridge, 2004.[98] Yan Yan Li, Michael Vogelius, and Communicated R. V. Kohn. Gradient estimatesfor solutions to divergence form elliptic equations with discontinuous coefficients. Arch.Rational Mech. Anal, 153:91–151, 2000.[99] S. Lihong and Z. Xinsheng. Stochastic comparisons of order statistics from gamma dis-tributions. Journal of Multivariate Analysis, 93(1):112–121, 2005.[100] AK. Louis. Medical imaging: state of the art and future development. Inverse Problems,8(5):709, 1992.[101] S. Mallat. A wavelet tour of signal processing. Academic press, 1999.[102] W. Menke. Geophysical data analysis: discrete inverse theory. Academic press, 2012.[103] M. L. Merkle and Petrovic´. On schur-convexity of some distribution functions. Publica-tions de l’Institut Mathe´matique, 56(76):111–118, 1994.[104] Y. Michel. Diagnostics on the cost-function in variational assimilations for meteorologicalmodels. Nonlinear Processes in Geophysics, 21(1):187–199, 2014.165Bibliography[105] P. Moghaddam and F. Herrmann. Randomized full-waveform inversion: a dimensionalityreduction approach. In SEG Technical Program Expanded Abstracts, volume 29, pages977–982, 2010.[106] A. Mood, F. A. Graybill, and D. C. Boes. Introduction to the Theory of Statistics.McGraw-Hill; 3rd edition, 1974.[107] V. A. Morozov. Methods for Solving Incorrectly Posed Problems. Springer, 1984.[108] G. A. Newman and D. L. Alumbaugh. Frequency-domain modelling of airborne electro-magnetic responses using staggered finite differences. Geophys. Prospecting, 43:1021–1042,1995.[109] J. Nocedal and S. Wright. Numerical Optimization. New York: Springer, 1999.[110] D. Oldenburg, E. Haber, and R. Shekhtman. 3D inverseion of multi-source time domainelectromagnetic data. J. Geophysics, 2013. To appear.[111] A. Pidlisecky, E. Haber, and R. Knight. RESINVM3D: A MATLAB 3D ResistivityInversion Package. Geophysics, 72(2):H1–H10, 2007.[112] M. Raydan. Convergence Properties of the Barzilai and Borwein Gradient Method. PhDthesis, Rice University, Houston, Texas, 1991.[113] A. Rieder. Inexact newton regularization using conjugate gradients as inner iteration.SIAM J. Numer. Anal., 43:604–622, 2005.[114] A. Rieder and A. Lechleiter. Towards a general convergence theory for inexact newtonregularizations. Numer. Math., 114(3):521–548, 2010.[115] J. Rohmberg, R. Neelamani, C. Krohn, J. Krebs, M. Deffenbaugh, and J. Anderson.Efficient seismic forward modeling and acquisition using simultaneous random sourcesand sparsity. Geophysics, 75(6):WB15–WB27, 2010.[116] F. Roosta-Khorasani and U. Ascher. Improved bounds on sample size for implicitmatrix trace estimators. Foundations of Computational Mathematics, 2014. DOI:10.1007/s10208-014-9220-1.166Bibliography[117] F. Roosta-Khorasani, G. Sze´kely, and U. Ascher. Assessing stochastic algorithms for largescale nonlinear least squares problems using extremal probabilities of linear combinationsof gamma random variables. SIAM/ASA Journal on Uncertainty Quantification, 3(1):61–90, 2015. DOI: 10.1137/14096311X.[118] F. Roosta-Khorasani, K. van den Doel, and U. Ascher. Data completion and stochasticalgorithms for PDE inversion problems with many measurements. Electronic Transactionson Numerical Analysis, 42:177–196, 2014.[119] F. Roosta-Khorasani, K. van den Doel, and U. Ascher. Stochastic algorithms for in-verse problems involving PDEs and many measurements. SIAM J. Scientific Computing,36(5):S3–S22, 2014.[120] B. H. Russell. Introduction to seismic inversion methods, volume 2. Society of ExplorationGeophysicists, 1988.[121] A. K. Saibaba and P. K. Kitanidis. Uncertainty quantification in geostatistical approachto inverse problems. arXiv:1404.1263, 2014.[122] G. Sapiro. Geometric Partial Differential Equations and Image Analysis. Cambridge,2001.[123] L. L. Scharf. Statistical signal processing, volume 98. Addison-Wesley Reading, MA, 1991.[124] F. Schmidt. The laplace-beltrami-operator on riemannian manifolds. Technical Report,Computer Vision Group, Technische Universita¨t Mu¨nchen.[125] R. J. Serfling. Probability inequalities for the sum in sampling without replacement. TheAnnals of Statistics, 2:39–48, 1974.[126] A. Shapiro, D. Dentcheva, and D. Ruszczynski. Lectures on Stochastic Programming:Modeling and Theory. Piladelphia: SIAM, 2009.[127] S. Shkoller. Lecture Notes on Partial Differential Equations. Department of Mathematics,University of California, Davis, June 2012.167Bibliography[128] N. C. Smith and K. Vozoff. Two dimensional DC resistivity inversion for dipole dipoledata. IEEE Trans. on geoscience and remote sensing, GE 22:21–28, 1984.[129] G. J. Sze´kely and N. K. Bakirov. Extremal probabilities for gaussian quadratic forms.Probab. Theory Related Fields, 126:184–202, 2003.[130] J. Tropp. Column subset selection, matrix factorization, and eigenvalue optimization.SODA, pages 978–986, 2009. SIAM.[131] K. van den Doel and U. M. Ascher. On level set regularization for highly ill-poseddistributed parameter estimation problems. J. Comp. Phys., 216:707–723, 2006.[132] K. van den Doel and U. M. Ascher. Dynamic level set regularization for large distributedparameter estimation problems. Inverse Problems, 23:1271–1288, 2007.[133] K. van den Doel and U. M. Ascher. Dynamic regularization, level set shape optimiza-tion, and computed myography. Control and Optimization with Differential-AlgebraicConstraints, 23:315, 2012.[134] T. van Leeuwen, S. Aravkin, and F. Herrmann. Seismic waveform inversion by stochasticoptimization. Hindawi Intl. J. Geophysics, 2011:doi:10.1155/2011/689041, 2012.[135] C. Vogel. Computational methods for inverse problem. SIAM, Philadelphia, 2002.[136] w. Rundell and H. W. Engl. Inverse problems in medical imaging and nondestructivetesting. Springer-Verlag New York, Inc., 1997.[137] K. Wimmer, Y. Wu, and P. Zhang. Optimal query complexity for estimating the trace ofa matrix. arXiv preprint arXiv:1405.7112, 2014.[138] J. Young and D. Ridzal. An application of random projection to parameter estimation inpartial differential equations. SIAM J. Scient. Comput., 34:A2344–A2365, 2012.[139] Y. Yu. Some stochastic inequalities for weighted sums. Bernoulli, 17(3):1044–1053, 2011.[140] Z. Yuan and H. Jiang. Quantitative photoacoustic tomography: Recovery of optical ab-sorption coefficient maps of heterogeneous media. Applied physics letters, 88(23):231101–231101, 2006.168[141] P. Zhao. Some new results on convolutions of heterogeneous gamma random variables.Journal of Multivariate Analysis, 102(5):958–976, 2011.[142] P. Zhao and N. Balakrishnan. Mean residual life order of convolutions of heterogeneousexponential random variables. Journal of Multivariate Analysis, 100(8):1792–1801, 2009.169Appendix AImplementation DetailsHere we describe the forward problem that yields the operators fi(m) of (1.4), and provide somedetails on the stabilized GN iteration used in our numerical experiments. We also provide detailsof discretization of the EIT/DC resistivity problem in two and three dimensions, injection of apriori knowledge on the sought parameter function via transformation functions in the originalPDE, a short Matlab code which is used in Chapter 7 to compute the Monte-Carlo sample sizesused in matrix trace estimators, and finally the details of implementation and discretization ofthe total variation functional used in numerical examples in this thesis.There is nothing strictly new here, and yet some of the details are both tricky and veryimportant for the success of an efficient code for computing reasonable reconstructions for thishighly ill-posed problem. It is therefore convenient to gather all these details in one place forfurther reference.A.1 Discretizing the Forward ProblemThe PDE (3.5) is discretized on a staggered grid as described in [15] and in Section 3.1 of [13].The domain is divided into uniform cells of side length h, and a cell-nodal discretization isemployed, where the field unknowns ui,j (or ui,j,k in 3D) are perceived to be located at the cellcorners (which are cell centers of the dual grid) while µi+1/2,j+1/2 values are at cell centers (cellcorners of the dual grid). For the finite volume derivation, the PDE (3.5a) is written first asj = µ(x)∇u, x ∈ Ω, (A.1a)∇ · j = q(x), x ∈ Ω, (A.1b)170A.1. Discretizing the Forward Problemand then both first order PDEs are integrated prior to discretization. A subsequent, standardremoval of the constant null-space then results in the discretized problem (1.3).In 2D, let us assume here for notational simplicity that the source q is a δ-function centredat a point in the finite volume cell (i∗, j∗). The actual sources used in our experiments arecombinations of such functions, as detailed in Section 3.3.2, and the discretization describedbelow is correspondingly generalized in a straightforward manner. We write for the flux, v =(vx, vy)T , expressions such as ux = µ−1vx at the eastern cell face x = xi+1/2. Settingµ−1(xi+1/2, y) ≈ µ−1i+1/2,j =12(µ−1i,j + µ−1i+1,j),vxi+1/2,j = h−1∫ yj+1/2yj−1/2vx(xi+1/2, y)dy,and integrating yieldsvxi+1/2,j = µi+1/2,jui+1,j − ui,jh.Similar expressions are obtained at the other three faces of the cell. Then integrating (A.1b)over the cell yields[µi+1/2,j(ui+1,j − ui,j)− µi−1/2,j(ui,j − ui−1,j)+µi,j+1/2(ui,j+1 − ui,j)− µi,j−1/2(ui,j − ui,j−1)]=1 if i = i∗ and j = j∗0 otherwise, 1 ≤ i, j ≤ 1/h. (A.2)Repeating the process in 3D (with an obvious notational extension for the source location)171A.1. Discretizing the Forward Problemyields the formulah[µi+1/2,j,k(ui+1,j,k − ui,j,k)− µi−1/2,j,k(ui,j,k − ui−1,j,k)+ µi,j+1/2,k(ui,j+1,k − ui,j,k)− µi,j−1/2,k(ui,j,k − ui,j−1,k)+ µi,j,k+1/2(ui,j,k+1 − ui,j,k)− µi,j,k−1/2(ui,j,k − ui,j,k−1)]=1 if i = i∗ and j = j∗ and k = k∗0 otherwise, 1 ≤ i, j, k ≤ 1/h, (A.3)where, e.g.,µ−1i+1/2,j,k =14(µ−1i+1/2,j+1/2,k+1/2 + µ−1i+1/2,j+1/2,k−1/2+ µ−1i+1/2,j−1/2,k+1/2 + µ−1i+1/2,j−1/2,k−1/2).The derivation is entirely parallel to the 2D case, although note the extra factor h multiplyingthe left hand side in (A.3), which arises due to the special nature of the source q.The boundary conditions are discretized by applying, say, (A.2) at i = 1 and utilizing (3.5b)to set u0,j = u2,j and µ−1/2,j = µ1/2,j . Combining all this results in a linear system of theform (1.3) which is positive semi-definite and has a constant null space, as does the PDEproblem (3.5). This null-space is removed using standard techniques.The method employed for solving the resulting linear system does not directly affect ourconsiderations in this thesis. For the sake of completeness, however, let us add that given thelarge number of right hand sides in problems such as (1.3) that must be solved, a direct methodwhich involves one Cholesky decomposition followed by forward and backward substitution foreach right hand side is highly recommended. If the program runs out of memory (on our systemthis happens in 3D for h = 2−6) then we use a preconditioned conjugate gradient method withan incomplete Cholesky decomposition for a preconditioner.172A.2. Taking Advantage of Additional A Priori InformationA.2 Taking Advantage of Additional A Priori InformationIn general, we wish to recover µ(x) based on measurements of the field u(x) such that (3.5)approximately holds. Note that, since the measurements are made only at relatively few lo-cations, e.g., the domain’s boundary rather than every point in its interior, the matrices Pi(whether or not they are all equal) all have significantly more columns than rows. Moreover,this inverse problem is notoriously ill-posed and difficult in practice, especially for cases whereµ has large-magnitude gradients. Below we introduce additional a priori information, whensuch is available, via a parametrization of µ(x) in terms of m(x) (see also [47]). To this end letus define the transfer functionψ(τ) = ψ(τ ; θ, α1, α2) = α tanh( ταθ)+α1 + α22, α =α2 − α12. (A.4)This maps the real line into the interval (α1, α2) with the maximum slope θ−1 attained at τ = 0.1. In practice, often there are reasonably tight bounds available, say µmin and µmax, suchthat µmin ≤ µ(x) ≤ µmax. Such information may be enforced using (A.4) by definingµ(x) = ψ(m(x)), with ψ(τ) = ψ(τ ; 1, µmin, µmax). (A.5)2. Occasionally it is reasonable to assume that the sought conductivity function µ(x) takesonly one of two values, µI or µII , at each x. Viewing one of these as a background value,the problem is that of shape optimization. Such an assumption greatly stabilizes theinverse problem [4]. In [46, 131, 132] we considered an approximate level set functionrepresentation for the present problem. We write µ(x) = limh→0 µ(x;h), whereµ(x;h) = ψ(m(x);h, µI , µII). (A.6)The function ψ(τ ;h) depends on the resolution, or grid width h. It is a scaled and mollifiedversion of the Heaviside step function, and its derivative magnitude is at most O( |µI−µII |h ).173A.3. Stabilized Gauss-NewtonThus, as h→ 0 the sought function m(x) satisfying∇ · (ψ(m(x))∇ui) = qi, i = 1, . . . , s, (A.7)∂ui∂n∣∣∂Ω = 0,has bounded first derivatives, whereas µ(x) is generally discontinuous.Establishing the relationship between µ and m completes the definition of the forwardoperators fi(m) by (1.4).A.3 Stabilized Gauss-NewtonHere we briefly describe the modifications made to the GN method (1.11), turning it into thestabilized GN method used in our experiments. The matrix at the left hand side of (1.11a) issingular in the usual case where l < lm, and therefore this linear system requires regularization.Furthermore, δm also requires smoothing, because there is nothing in (1.11) to prevent itfrom forming a non-smooth grid function. These regularization tasks are achieved by applyingonly a small number of PCG iterations towards the solution of (1.11a), see [131, 133]. Thisdynamic regularization (or iterative regularization [78]) is also very efficient, and results in astabilized GN iteration. An adaptive algorithm for determining a good number of such inneriterations is proposed in [46]. However, here we opt to keep this number fixed at r PCGiterations independently of n, in order to be able to compare other aspects of our algorithmsmore fairly. Further, the task of penalizing excessive non-smoothness in the correction δm isachieved by choosing as the preconditioner a discrete Laplacian with homogeneous Neumannboundary conditions. This corresponds to a penalty on∫|∇m(x)|2 (i.e., least squares but nottotal variation).The modified GN iteration described above is our outer iteration, and the entire regulariza-tion method is called dynamical regularization [77, 78, 113, 114, 131, 133]. The essential costin terms of forward operator simulations comes through (1.11a) from multiplying Ji or JTi by avector. Each such multiplication costs one forward operator simulation, hence 2rs simulationsfor the left hand side of (1.11a) (or 2rnk in case of (2.6a)). The evaluation of the gradient174A.4. Matlab Codecosts another 2s forward operator evaluations per outer iteration. Considering K GN outeriterations, this gives the work under-estimate formula (1.12). This still neglects, for clarity, theadditional line search costs, although the additional forward operator simulations necessitatedfor determining αk in (1.11b) have of course been counted and included in the work talliesreported in all the tables in this thesis.A.4 Matlab CodeHere we provide a short Matlab code, promised in Section 7.1, to calculate the necessary orsufficient sample sizes to satisfy the probabilistic accuracy guarantees (7.2) for a SPSD matrixusing the Gaussian trace estimator. This code can be easily modified to be used for (7.10) aswell.1 function [N1,N2] = getSampleSizes(epsilon,delta,maxN,r)2 % INPUT:3 % @ epsilon: Accuracy of the estimation .4 % @ delta: Uncertainty of the estimation.5 % @ r: Rank of the matrix (Use r = 1 for obtaining the sufficient sample sizes).6 % @ maxN: Maximum allowable sample size7 % OUTPUT:8 % @ N1: The sufficient (or necessary) sample size for (7.2a).9 % @ N2: The sufficient (or necessary) sample size for (7.2b).10 Ns = 1:1:maxN;11 P1 = gammainc(Ns*r*(1−epsilon)/2,Ns*r/2);12 I1 = find(P1 <= delta,1,"first");13 N1 = Ns(I1); % Necessary/Sufficient sample size obtained for (7.2a).14 Ns = (floor(1/epsilon)+1):1:maxN;15 P2 = gammainc(Ns*r*(1+epsilon)/2,Ns*r/2);16 I2 = find(P2 >= 1−delta,1,"first");17 N2 = Ns(I2); % Necessary/Sufficient sample size obtained for (7.2b).18 end175A.5. Implementation of Total Variation FunctionalA.5 Implementation of Total Variation FunctionalFor the Total Variation (TV) regularization, R(m) in (1.7) is the discretization of the TVfunctionalTV (m) =∫Ω|∇m(x)|,where Ω is the domain under investigation. Consider a 2D square domain, which is dividedinto uniform cells of side length h, resulting in N2 cells. Let (xi, xj) be the center of the cell(i, j). A usual discretization of this TV integral, using this mesh, is obtained asTV (m) ≈N∑i,j=1h2(∣∣∣∣∂m(xi, xj)∂x∣∣∣∣+∣∣∣∣∂m(xi, xj)∂y∣∣∣∣),where ∂m(xi, xj)/∂x is the value of ∂m/∂x at the center of the cell (i, j). The standard approachfor obtaining |∂m(xi, xj)/∂x| is by “averaging the square of the differences” among cell values.More specifically, for i, j = 1, · · · , N , letting mi,j denote the grid value of m(x) at cell (i, j), weget∣∣∣∣∂m(xi, xj)∂x∣∣∣∣ ≈1h√(mi+1,j −mi,j)2 + (mi,j −mi−1,j)22, (A.8a)∣∣∣∣∂m(xi, xj)∂y∣∣∣∣ ≈1h√(mi,j+1 −mi,j)2 + (mi,j −mi,j−1)22. (A.8b)Next, we form the vector m consisting of mi,j values. Let Dx and Dy be the matrices thatimplement the difference operations in (A.8) in x and y directions, respectively. Similarly,let Ax and Ay be the matrices that implement the averaging operations in (A.8) in x and ydirections, respectively. Now we can write (A.8) in vectorized form asR(m) = 1T√Ax(Dxm)2 +Ay(Dym)2,where 1 is a vector of 1’s, and the square and absolute value are taken pointwise.One can introduce differentiability to R(m) byRε(m) = 1T√Ax(Dxm)2 +Ay(Dym)2 + ε1,176A.5. Implementation of Total Variation Functionalfor some ε 1. An alternative is to use the Huber switching function [55, 85, 122]. Extensionsof the above procedure to 3D is straightforward.177
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Randomized algorithms for solving large scale nonlinear...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Randomized algorithms for solving large scale nonlinear least squares problems Roosta-Khorasani, Farbod 2015
pdf
Page Metadata
Item Metadata
Title | Randomized algorithms for solving large scale nonlinear least squares problems |
Creator |
Roosta-Khorasani, Farbod |
Publisher | University of British Columbia |
Date Issued | 2015 |
Description | This thesis presents key contributions towards devising highly efficient stochastic reconstruction algorithms for solving large scale inverse problems, where a large data set is available and the underlying physical systems is complex, e.g., modeled by partial differential equations (PDEs). We begin by developing stochastic and deterministic dimensionality reduction methods to transform the original large dimensional data set into the one with much smaller dimensions for which the computations are more manageable. We then incorporate such methods in our efficient stochastic reconstruction algorithms. In the presence of corrupted or missing data, many of such dimensionality reduction methods cannot be efficiently used. To alleviate this issue, in the context of PDE inverse problems, we develop and mathematically justify new techniques for replacing (or filling) the corrupted (or missing) parts of the data set. Our data replacement/completion methods are motivated by theory in Sobolev spaces, regarding the properties of weak solutions along the domain boundary. All of the stochastic dimensionality reduction techniques can be reformulated as Monte-Carlo (MC) methods for estimating the trace of a symmetric positive semi-definite (SPSD) matrix. In the next part of the present thesis, we present some probabilistic analysis of such randomized trace estimators and prove various computable and informative conditions for the sample size required for such Monte-Carlo methods in order to achieve a prescribed probabilistic relative accuracy. Although computationally efficient, a major drawback of any (randomized) approximation algorithm is the introduction of “uncertainty” in the overall procedure, which could cast doubt on the credibility of the obtained results. The last part of this thesis consists of uncertainty quantification of stochastic steps of our approximation algorithms presented earlier. As a result, we present highly efficient variants of our original algorithms where the degree of uncertainty can easily be quantified and adjusted, if needed. The uncertainty quantification presented in the last part of the thesis is an application of our novel results regarding the maximal and minimal tail probabilities of non-negative linear combinations of gamma random variables which can be considered independently of the rest of this thesis. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2015-04-08 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
DOI | 10.14288/1.0166123 |
URI | http://hdl.handle.net/2429/52663 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2015-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2015_may_roostakhorasani_farbod.pdf [ 3.7MB ]
- Metadata
- JSON: 24-1.0166123.json
- JSON-LD: 24-1.0166123-ld.json
- RDF/XML (Pretty): 24-1.0166123-rdf.xml
- RDF/JSON: 24-1.0166123-rdf.json
- Turtle: 24-1.0166123-turtle.txt
- N-Triples: 24-1.0166123-rdf-ntriples.txt
- Original Record: 24-1.0166123-source.json
- Full Text
- 24-1.0166123-fulltext.txt
- Citation
- 24-1.0166123.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0166123/manifest