Developing Learned Regularization for Geophysical Inversions by Cliad Hewson B.Sc.(hon.), The University of British Columbia. 2001 A THESIS SUBMITTED IN PARTIAL F U L F I L M E N T OF T H E REQUIREMENTS FOR T H E D E G R E E OF MASTER OF SCIENCE hi The Faculty of Graduate Studies (Department of Earth and Ocean Sciences) We accept this thesis as conforming to the required standard T H E UNIVERSITY OF BRITISH COLUMBIA April 21. 2004 © Chad Hewson, 2004 Library Authorization In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Name of Author (please print) Date (dd/mm/yyyy) Title of Thesis: Degree: ' ' ^ c -Department of Year: The University of British Columbia Vancouver, BC Canada Abstract ii Abstract Geophysical inverse problems can be posed as the minimization of an objective function where one term {4>d) is a data misfit function and another (4>m) is a model regularization. In current practice, <j>m is posed as a mathematical operator that potentially includes limited prior information on the model, m. This research focusses on the specification of learned forms of <pm from information on the model contained in a training set, Mj\ This is accomplished via three routes: probabilistic, deterministic (model based) and the Haber-Tenorio (HT) algorithm. In order to adopt a pure probabilistic method for finding a learned <f>m, equivalence between Gibbs distributions and Markov random fields is established. As a result, the prior probability of any given model is reduced to the interactions of cells in a local neigh-bourhood. Gibbs potentials are defined to represent these interactions. The case of the multivariate Gaussian is used due to its expressible form of normalization. <pm is parame-terized by a set of coefficients, 0, and the recovery of these parameters is obtained via an optimization method given M p . For non-Gaussian distributions 0 is recovered via Markov chain Monte Carlo (MCMC) sampling techniques and a strategy to compare different forms of <f>m is introduced and developed. The model based deterministic route revolves around independent identically distributed (i.i.d.) assumptions on some filter property of the model, z = /(m). Given samples of z, two methods of expressing its corresponding <j>m are developed. The first requires the expression of a generic distribution to which all the samples of z are assumed to belong. Methodology to translate z into usable data and recover the corresponding 4>m is developed. Although there are ramifications of the statistical assumptions, this method is shown to translate significant information on z into 4>m. Specifically, the shape of the 4>m functional is maintained and, as a result, the deterministic 4>m performs well in geophysical inversions. This method is compared with the parametrization of the generalized Gaussian (p-norm) for z. Agreement between the generic (j)m and generalized Gaussian helps validate the specific choice of norm in the probabilistic route. The HT algorithm is based around the notion that the geophysical forward operator should help determine the form of cf>m. The strategy of Haber and Tenorio [16] is introduced and an algorithm for the recovery of 9 is developed. Two examples are used to show a case where the HT algorithm is advantageous and one where it does not differ significantly from the probabilistic route. Finally, a methodology to invert geophysical data with generic learned regularization is developed and a simple example is shown. For this example, the generic deterministic method is shown to transfer the most information from the training set to the recovered model. Difficulties with extremely non-linear objective functions due to learned regulariza-tion are discussed and research into more effective search algorithms is suggested. Contents iii Contents Abstract ii Contents iii List of Figures vi List of Tables viii Acknowledgements ix 1 Introduction 1 1.1 The Forward Problem 1 1.2 The Inverse Problem 2 1.2.1 Non-Uniqueness 2 1.2.2 Noisy Data in Inverse Problems 4 1.2.3 Regularizing the Inverse Problem 5 1.3 Typical Regularization Operators 5 1.3.1 The Smallest Model 5 1.3.2 The Flattest Model 6 1.3.3 Combining Regularization 7 1.3.4 Incorporating a Reference Model 8 1.4 Choosing the Norm 9 1.5 Incorporating Prior Knowledge 10 1.5.1 Current Practices 11 1.5.2 Correcting Mathematical Deficiencies 13 1.6 Learned Regularization 14 2 Probabilistic Regularization 16 2.1 Introduction 16 2.2 Probability Theory in Regularization 16 2.2.1 Bayes Theorem and Prior Distributions 16 2.2.2 Specifying the Likelihood and Prior Distributions 18 2.2.3 Neighbourhoods, Cliques and Markov Random Fields 18 2.3 Learning Probabilistic Regularization Functionals 20 2.3.1 Step (a): Obtain Prior Information and Create Training Set 20 2.3.2 Step (b): Quantify Training Set in Gibbs Distribution 21 2.3.3 Step (c): Probabilistic Inversion to Recover <f>m 24 2.4 Search Algorithms 25 2.4.1 The Newton Step 25 2.4.2 Constrained Optimization: Positivity 26 Contents iv 2.5 Synthetic Examples 28 2.5.1 Generating Synthetic Models 28 2.5.2 Recovering the a values 29 2.6 Covariance Model Selection 31 2.6.1 Theory 32 2.6.2 Markov Chains . . . 33 2.6.3 M C M C sampling: The Metropolis-Hastings Algorithm 33 2.6.4 M C M C sampling: Gibbs sampler 36 2.6.5 Implementation 36 2.6.6 Numerical example 39 2.7 Heterogeneity in a 40 2.8 Regularization in the O(M) Inversion 41 2.8.1 Approach 42 2.8.2 Cross-validation 43 2.8.3 Example 45 2.9 Non-Gaussian Distributions 45 2.9.1 Normalization in the p-norm 45 2.9.2 Approximating the Normalization 46 2.9.3 Examples 47 2.9.4 Comparing p-distributions 49 2.10 Summary 49 3 Deterministic Regularization 51 , 3.1 Introduction 51 3.2 Theory 52 3.3 Recovering R(z) as an Unknown Function 53 3.3.1 Binning Strategies 53 3.3.2 The Forward Model 54 3.3.3 The Effect of the Normalization 55 3.3.4 Assigning Data Errors 56 3.3.5 Regularizing r 57 3.3.6 Solving for r • • • 57 3.3.7 Testing the Algorithm 58 3.3.8 Synthetic Example 62 3.4 Parameterizing R(z) as a p-Norm . 65 3.4.1 Numerical results 66 3.5 Summary 67 4 The HT algorithm 69 4.1 Background • 69 4.2 Theory 69 4.2.1 Basics 69 4.2.2 Algorithm Construction 70 4.3 Testing the HT Algorithm 72 4.3.1 Cross-well Seismic tomography 72 4.3.2 The 2-D Gravity Experiment 74 4.4 Summary 77 Contents v 5 Comparing Regularization Functionals 78 5.1 Concept 78 5.2 Inversion with Generic Regularization 78 5.3 Forms of R(m) 80 5.3.1 Multivariate Gaussian 80 5.3.2 The p-norm 82 5.3.3 Generic R(z) 83 5.3.4 The Hybrid Global-Local Search Algorithm 84 5.4 Examples 85 5.4.1 Training set 4: Box in Homogenous Halfspace 85 5.5 Summary 92 6 Summary and Future Work 93 Bibliography 95 List of Figures vi List of Figures 1.1 A Typical Inversion Problem 2 1.2 Inversion results with different regularization. (a) True model, (b) Smallest model, (c) Flattest horizontal model, (d) Flattest vertical model, (e) Ray-paths used to acquire data 8 1.3 Inversion results with combined regularization. (a) True model, (b) Result with (as, ax, az) = (IO - 3,1,1) and mref = 0, (c) First reference model, (d) Result with (as,ax,az) — (10~3,1,1) and m.ref from (c), (e) Second reference model, (f) Result with {as,ax,az) = (10 - 3,1,1) and m r e y from (e) 9 1.4 Comparison of L2 and LI regularization functionals 10 1.5 Recovered models using different model norms, (a) true model, (b) L2 norm solution, (c) LI norm solution 11 1.6 Flowchart outlining increasing sophistication for regularization in geophysi-cal inversion 12 2.1 Example of neighbourhood in a mesh for one cell (black). Its first order neighbourhood consists of the dark grey cells. The second order neighbour-hood consists of the dark and light grey cells 19 2.2 Flowchart of complete inversion process 20 2.3 Examples of possible prior models 22 2.4 y^cjjoo°° f ° r (a) increasing K value, (b) varying as, K=1000, (c) varying ax, K=1000 and (d) varying az, K=1000 29 2.5 (a) Resulting C matrix for C = {WjWs)-1, (b) C = ( W j W x ) - \ (c) C = (w?wz)-\(d)c = (wTwa + wZwx + wTwt)-1 30 2.6 Variation of covariance matrix with changing as. The variance of the model parameters decreases with increasing as 30 2.7 Flowchart outlining covariance model selection algorithm via Gibbs sampling 35 2.8 Different possibilities for p{6i\C~l) 37 2.9 Markov chains for particular 6 vector for (a) C f 1 , (b) C f 1 and (c) C f 1 , . 39 2.10 Results of O(M) parameter alpha inversion, (a) ax recovered weights, (b) az recovered weights 42 2.11 Results of O(M) parameter alpha inversion with regularization. (a) Cross-validation values, (b) ax recovered weights, (c) az recovered weights . . . . 44 2.12 M C M C sampling results for different p values, (a) ax, (b) az, (c) as . . . . 47 2.13 (a) Example of training set (ax, az, as) = (10,1,0.01), (b) covariance matrix, C2, pertaining to training set 48 3.1 Definition of dual grid for recovery of R(z): w is the data grid and r-j denotes the i t h model cell 53 List of Figures vii 3.2 Comparison of different error methods with true errors, e represents the error value 59 3.3 Data for varying N: (a) N=20, ||e||2 = 0.313, (b) N=40, ||e||2 = 0.322, (c) N=32, ||e||2 = 0.318 '. 59 3.4 Resulting models for data in Figure 3.3 60 3.5 Resulting models for varying II value 61 3.6 (a)Ll sample histogram compared to true distribution, (b) Resulting data from samples, (c) Recovered and true models 62 3.7 Data and errors for (a) zi, (b) z2 and (c) Z3 63 3.8 Recovered and true R{zi) for (a) zi , (b) z 2 , (c) Z 3 and (d) resulting total Q Ri(zi) for deterministic regularization compared to true probabilistic value. 64 i=l 3.9 Recovered (functional and ^distribution) and true R(zi) for (a) zi, (b) z2 and (c) Z3 67 4.1 Recovered models for H T algorithm and multivariate Gaussian for set TS2: (a) mi , (b)m2, (c)m3, (d)m4, (e)m5 74 4.2 Recovered models for H T algorithm and smallest model for set TS3: (a)mi, (b) m2, (c)m3, (d)m4, (e)m5. 76 5.1 Recovered R{z\) for TS4 using multivariate and deterministic methods. . . 87 5.2 Four models obtained using a hybrid search approach with optimal p-norm regularization. Iteration 28 was the final result 88 5.3 Four models obtained using a hybrid search approach with generic R(z) regularization. Iteration 23 was the final result 89 5.4 (a)True model, (b)-(f) Recovered models using five learned regularization functionals from TS4 91 List of Tables viii List of Tables 2.1 Results of several 3-parameter a inversions. The asterisk indicates that the program ran to the maximum number of iterations 31 2.2 Comparison between optimization results and statistics of Markov chains. . 40 2.3 Statistics of Markov chains for different p-distributions 48 3.1 Results for optimal p-distributions 66 4.1 Results of HT algorithm and multivariate Gaussian for TS2 73 4.2 Results of H T algorithm for gravity example. Values of a shown are multi-plied by 103. Fi indicates how many cells in layer i contained block values. 75 4.3 Recovered and Expected a ratios. . 76 5.1 Statistics of Markov chains for different p-distributions for TS4. The asterisk indicates results are from the optimization code 86 5.2 Optimal p-norm parameters for TS4 87 5.3 Numerical summary of hybrid search for m using optimal p-norm regular-ization 88 5.4 Numerical summary of hybrid search for m using generic R(z) regularization. 90 5.5 Numerical evaluation of recovered models. . 90 Acknowledgements ix Acknowledgements This research is the culmination of several years of study to which many people contributed both directly and indirectly. Firstly, my supervisor, Dr. Doug Oldenburg, deserves much credit for his wisdom and vision that he conveyed to me over the last few years. Through his guidance, I have grown as both a person and scientist and I thank him for that. In addition, the researchers and students at the Geophysical Inversion Facility (UBC-GIF) deserve recognition for their insights. Finally, my family (Brian, Donna and Tricia) and close friends (Mariela, Alastair and Pat) were vital due to their constant support of me as well as their provision of balance between academic and non-academic life. Chapter 1. Introduction 1 Chapter 1 Introduction In order to collect information about the composition of the Earth below the subsurface, geophysicists make use of remote sensing methods. Sometimes natural signals can be mon-itored or measured (e.g. gravitational or magnetic fields) and, in other cases, signals are measured due to artificial sources (e.g. high frequency electromagnetics). No matter what geophysical survey is conducted, the result is data that are directly dependent on the sub-surface of the Earth. The geophysicist is then faced with the task of interpretation of these data. Often, this interpretation will result in an inference of subsurface physical property. For example, a gravity survey will result in a model of Earth density while an electro-magnetic survey may result in one of electrical conductivity. Thus, geophysical surveys are usually chosen based on desirable physical property information. The most important point, however, is that there exists a mathematical link between the measured quantity (data) and physical property distribution (model). Figure 1.1 illustrates this link. 1.1 The Forward Problem Being able to calculate data values from a particular model is referred to as the forward problem. Often the forward problem is represented as d = F[m] (1.1) where F{-] is called the forward operator. When the forward problem is linear, it may simply be written as d = Gm where G contains kernel information. It is easier to consider a linear problem because the problem is reduced to a simple matrix-vector multiplication. G is defined as size N by M such that M model parameters yield ./V data. However, it is the rank of G that is more pertinent. For instance, if (after an eigenvalue decomposition of G) there are only K independent basis vectors, then there are only K independent data. When more independent data are acquired, there is more information about the model obtained. Ideally, one would want to acquire M independent data values. This would mean that the entire model space would be sampled. Figure 1.1 illustrates this concept. On the left is a graphical representation of model space. Within model space there are M degrees of freedom. Suppose a geophysical experiment is run where N (N < M) data are acquired. Since N < M it is apparent that the whole of model space has not been sampled. As a result, model space is divided into two categories: mA is the activated portion of model space and mUA is the ,un-activated (or annihilator) portion of model space. Since there are N kernels, we can represent the transition from model space to data space as d = F[m^]. However, the kernels never sampled the un-activated model space. Thus, i ?[mC M] = 0. The term annihilator is appropriate for this portion of the model space because any amount of these model parameters can be added to a model without affecting the value of d. Having a viable forward problem is vital in geophysics. It is used widely in survey design and in the interpretation stages of the experiment. It is often good practice to design models Chapter 1. Introduction 2 Figure 1.1: A Typical Inversion Problem representing different scenarios so as to ensure that the chosen survey will adequately sample this area of model space. After data have been collected, forward modelling can be used by the geophysicist to try to compose a model that fits the observed data well. Finally, it is a vital tool in the inversion process where it is used to calculate predicted data for all proposed models. 1.2 The Inverse Problem Above, the route from model space to data space in Figure 1.1 was introduced. Naturally, the opposite application is desirable; recover the model from the data. After performing a survey, the geophysicist is left with a set of numbers that relate to the subsurface. Although there are many methods of interpretation, inversion is by far the most sophisticated. 1.2.1 Non-Uniqueness If we had access to M independent data values in a linear problem, inversion would be trivial. The solution could be obtained by m = G _ 1 d. However, this is rarely the case. Due to the desired resolution in most problems, the Earth has to be divided into many pa-rameters. Often, in a mineral exploration scale, this number can exceed 100000. Regardless of the survey, it is unlikely that we could obtain 100000 independent data. As a result, in general, geophysical inversion problems are under-determined (N « M) such that there is an inherent non-uniqueness in the problem. One can think of non-uniqueness as being Chapter 1. Introduction 3 influenced by the annihilator portion of model space. If this portion of model space can assume any value, then there is an infinite number of models that satisfy the problem. To illustrate this example, consider the following number problem where N — M. d = 0.0893 0.4070 -0.1645 0.2477 0.6481 0.8115 0.4263 1.4621 0.4826 m (1.2) If m = [6,2,3]r, then d = [0.8565,5.2166,6.9296]T. Since G in this case is square, one would think that the solution could be m = G~^d. However, a closer look at G reveals that this is not the case. Consider the singular value decomposition of G such that G = USVT where U contains the data basis vectors, S contains the eigenvalues and V contains the model basis vectors. Performing the SVD, G can be expressed as follows. -0 .1628 -0 .5156 -0 .8412 0.5539 -0 .8165 -0 .7533 -0 .4082 0.3545 0.4082 1.8828 0 0 0 0.5898 0 -0.2660 0.0239 -0.8659 0.4334 -0.4236 - 0 . 9 0 0 9 0.9637 -0 .2498 -0 .0946 (1.3) Immediately, one should notice that there are only 2 non-zero eigenvalues. Thus, only the first two eigenvectors are required to reconstruct G. -0 .1628 -0 .5156 -0 .8412 0.5539 -0 .7533 0.3545 1.8828 0 0 0.5898 -0.2660 0.0239 -0.8659 0.4334 -0.4236 - 0 . 9 0 0 9 In general terms, it can be written as G = UpSpVpT (1.4) (1.5) where p refers to the number of non-zero eigenvalues. Replacing G in (1.2) with (1.5), the following solution for m can be obtained. d = UpSpVpTm C/Jd = SpVpTm (1.6) Vjm VpS~lU^d = VpVpTm m c = VpVpm is defined as the pseudo-inverse solution (VpVp — I if model space was completely spanned). For the particular values of d and m defined above, the pseudo-inverse solution is mc = [1.1830,3.2484,3.4731]T. Thus, the pseudo-inverse solution is not very accurate. Since only p basis vectors are used then the remaining vector must be an annihilator for the problem. The product of the third model basis vector (v^) and G should be zero. 0.0893 0.2477 0.4263 0.4070 0.6481 1.4621 -0.1645 0.8115 0.4826 0.9637 0.5378 x 10~1 6 -0.2498 = -0.6939 x IO" 1 6 (1.7) -0.0946 -0.3469 x 10~1 6 These values are very near machine precision and are essentially zero. As a result, we can represent the solution as d = G(mc + nv 3) (1.8) where n is any real number. Thus, there are infinitely many solutions to the problem and the problem is non-unique. Chapter 1. Introduction 4 1.2.2 N o i s y D a t a i n Inverse P r o b l e m s One thing about the relation in (1.8) is that regardless of the choice of n, each model completely fits the data. This is desirable if the data are perfectly accurate. However, in most geophysical inverse problems this is not the case. Firstly, inaccuracies in the measuring device must be considered. Secondly, there may be secondary geophysical signals from natural phenomena, man-made structures, and other bodies that are not of interest. Regardless of what the sources are, noise is a real problem in geophysical data. As a result, when modelling data, one should attempt to account for it. The easiest way to account for the noise in geophysical data is to model it with a statistical distribution. The most common way is with a Gaussian distribution. Defining the noise random variable as e, its distribution can be written as p(e) = ^e~£ (1.9) where a is the standard deviation of the noise. When inverse modelling, the goal is to produce a model, m, where d = Gm. However, since there is noise in the data, it is more prudent to find models that have a residual relating to the noise. e = d - G m (1.10) Since there are N realizations of the noise that are assumed to be indepedent, the probability of the noise vector will be the multiplication of each distribution. N 2 p{€) = ^e~&^ (1.11) Thus, to get an estimate of the model, we need a maximum likelihood estimate of the noise by re-parameterizing it in terms of the model. . 1 ^ (dj - (Gm),) 2 mvnm fa = - > 5 ( L 1 2 ) i=i 1 This is defined as the data misfit. It can be rewritten in a matrix-vector form as 4>d = ±\\Wd(d-Gm)\\2 (1.13) where contains ^ elements along its diagonal. The solution calculated via (1.12) will still reproduce the data exactly (when N < M). In order to find a model that does not reproduce the data exactly, a desired value for fa must be specified. Statistically, any n quantity yi = where Xi is distributed normally is distributed as chi-squared (x2)-i=l This x 2 distribution must have n degrees of freedom such that E[y] = n. Looking back at (1.12) fa is the sum of the squared, normally distributed e. As a result, its expected value is as follows. E[fa\ = XN (1.14) Statistically, the misfit must equal the number of data, N, such that A = 1. However, this largely depends on the accuracy of the data errors as well as the distribution of the noise. As a result, A is included if the user feels that the data must be fit more or less closely. Regardless, this is a constraint in the inverse problem - fa = XN. Thus, it seems natural that we must define some other criterion to guide the solution. Chapter 1. Introduction 5 1.2.3 Regularizing the Inverse Problem Although an inverse problem should be very data driven, the presence of noise causes one to be hesitant about reproducing the data exactly. Furthermore, the under-determined nature of geophysical inverse problems makes them mathematically challenging when misfit is the only criterion. As a result, some form of regularization is required in the problem. Earlier in the chapter, non-uniqueness was introduced and explained. In fact, (1.8) showed that there were an infinite number of solutions that could satisfy an N = 2 (one data was a linear combination of the others), M = 3 problem. Thus, it seems natural to try to steer the solution towards a desirable area of model space. By this, it is meant that solutions should be chosen from models that exhibit particular character. This limits the number of possible choices. Typically, the regularization can be expressed as <t>m=\\\Wmm\\2 (1.15)' where Wm is some discretized operator (standard choices will be discussed in the next section). <j>m is referred to as the model norm. Thus, the goal is to find a model that has a minimum value of the model norm subject to fitting (1.13) to a certain degree. The best way of completing this task is to combine (1.13) and (1.15) in a model objective function, 4>. <!> = 4>d + (3cprn (i.i6) 8 is referred to as the regularization parameter and it controls the balance between the model norm and data misfit. For every value of 8 there is a solution. The problem is posed as a minimization problem. minm (j> — fa + B(j>m s.t. <pd = XN (1-17) The goal is to find the value of 3 where the objective function has a minimum with a desirable misfit value. For large values of 8, the misfit will be quite high as the emphasis is being put on the regularization. Conversely, smaller values of 3 yield small misfit values. There are several methods currently available to estimate an appropriate value of 8 [12]. However, the simplest is to make use of (1.14) and line search for the optimal value of 8. 1.3 Typical Regularization Operators The last item remaining in the definition of an inverse problem is the form of the regular-ization. This is perhaps the most difficult question because one has to provide information about what type of model is desirable. Often there is no information about the subsur-face other than geophysical data. As a result, regularization in inverse problems to date has focussed on generic properties of geology. This has provided many useful results with-out adding any specific geologic information [11]. In this section, the standard choices for regularization are introduced and explained. 1.3.1 The Smallest Model Until now, inversion regularization has depended on broad geological concepts. Sometimes these concepts are wholly conceived in a geological framework. However, in this case, part Chapter 1. Introduction 6 of the motivation for including smallness in regularization is mathematical. Smallness is exactly as it sounds - it is a term that enforces smaller values of model parameters to be chosen. Consider a two-dimensional problem. The most general way of incorporating smallness into a model is to define the norm (or size) of the model. (f)s — ^JJ m(x,z)2dxdz (1-18) Here, an L2 norm is used to describe smallness. This is the easiest norm to use mathemat-ically. Discretizing the above integral, it can be approximated as follows. 1 M <ps ~ -^mJAxjAzj (1.19) Rearranging the above into a norm formation and assuming the discretization error is negligible, the following expression is obtained. <j>s = ±\\Wsm\\2 (1.20) WS is the smallness operator and has ^AXJAZJ elements along its diagonal. Since the regularization acts as a penalty function, smaller values of <f>s will be encouraged. The other important point about (ps is that WS is diagonal. Thus, if 4>m = fis, and (3 is large, the inverse problem will be tractable as WS has a stable inverse. The limitation of only using 4>s is that values of the model parameters themselves are regularized. Thus, the minimum norm solution will be recovered without any regard to structure or adjacent parameter interaction. Using 4>s as regularization may be more appropriate for slightly under-determined or over-determined problems. Figure 1.2 shows results of using different regularization with the same data set. Figure 1.2(e) displays the tomographic raypaths used in the data acquisition. The seismic tomography problem will be introduced later. In this problem, there were 16 data and 400 model parameters. Figure 1.2(a) shows the true model that is representative of a fault separating two geologic bodies overlying a halfspace. Figure 1.2(b) shows the result using 4>m = 4>s.. Although this model fits the data to the specified degree, it is clear that the recovered model is not appropriate. One can see, however, that many of the model parameters are at or near zero. Thus, the regularization has the desired effect. 1.3.2 The Flattest Model The next type of standard regularization is derived from an assumption about geologic structure. The assumption is that the desired model will not have many large jumps in adjacent model parameters. That is, the derivative of the model in either direction should be as small as possible. This will minimize structure and promote gradual change in model parameter value. Again, this is expressible in an L2-norm sense as (the effects of an L2 norm will be explored later) for the horizontal derivative and *' = \ j j ^ d x d z {1.22) Chapter 1. Introduction 7 for the vertical derivative. Again, after discretizing the integrals, they can be re-written in a vector norm. 4>x = ^ l l ^ m H 2 4>z = ±\\Wzm\\2 For a 3 cell by 3 cell mesh, Wx and Wz are Wx = dz dx Wz = dx dz -1 1 0 0 0 0 0 0 0 " 0 -1 1 0 0 0 0 0 0 0 0 0 - 1 1 0 0 0 0 0 0 0 0 -1 1 0 0 0 0 0 0 0 0 0 - 1 1 0 0 0 0 0 0 0 0 -1 1 _ -1 0 0 1 0 0 0 0 0 " 0 - 1 0 0 1 0 0 0 0 0 0 -1 p 0 1 0 0 0 0 0 0 -1 0 0 1 0 0 0 0 0 0 -1 0 0 1 0 0 0 0 0 0 -1 0 0 1 (1.23) (1.24) (1.25) (1.26) assuming dx and dz are constant for every cell. Otherwise, each row has to be multiplied by a cell-centered dimensional ratio. Figure 1.2(c) shows the result when the regularization is 4>x. The model is very flat in the lateral direction and the interface between the overlying layers and the basement is imaged well. Figure 1.2(d) shows the result when the regularization is <f>z. Due to the data, the fault boundary is not imaged well. However, the lateral boundary is now flattened and the model has bigger, coarser jumps in the lateral direction. 1.3.3 C o m b i n i n g R e g u l a r i z a t i o n Figure 1.2 illustrates how different regularization functions yield completely different so-lutions. Each solution has desirable characteristics so it is a natural extension to try to combine the results. The easiest way to do this is by noting that (1.20), (1.23) and (1.24) have similar forms. As a result, we may combine regularization functions (<ps + 4>x + 4>z) as follows. <j>m = as(f>s + ax<j)x + az(f>z = ^mT(asW^Ws + axW^Wx + azW^Wz)m (1.27) = | | | F F m m | | 2 (1.28) The purpose of the a coefficients is to allow the user to control the relative weight of each property in the inversion. In current practice, there is no robust way of determining these coefficients. Often a set of defaults, such as ( a s , a x , a 2 ) = (10 - 3,1,1), is used or they are scaled to represent the spatial size of model space. Figure 1.3(b) shows the result of this type of inversion. It is clear that this does an adequate job of reproducing the true model in Figure 1.3(a) but that the values of the parameters are not correct, nor are the boundaries imaged strikingly well. Chapter I. Introduction S Figure 1.2: Inversion results with different regularization. (a) True model, (b) Smallest model, (c) Flattest horizontal model, (d) Flattest vertical model, (e) Raypaths used to acquire data 1.3.4 Incorporating a Reference Model The last part of the standard regularization procedure involves incorporating a reference model. By this, it is meant that a model is specified that the recovered model should try to resemble. Until this point, everything has been referenced to a halfspace of zeros. That is, we were searching for a smallest model when referencing every parameter to zero. Evidently, it is possible that a more likely value for a parameter is 1 or 2 and that parameter's size should be measured as the distance from this value. Equation (1.28) can be amended simply to include a reference model. (/>m = ^\\Wm(m-mref)\\2 (1.29) How a reference model should be formulated is still to be discussed. However, Figure 1.3 presents two examples of reference models used. The first, in Figure 1.3(c), is a model where information about the halfspace interface is well known. The result, in Figure 1.3(d), has a much clearer halfspace boundary than Figure 1.3(b). Figure 1.3(e) is a reference model where the fault boundary is known well. The inversion result, in Figure 1.3(f), recovers this boundary as well as the parameter values. However, it only improves the result in the Chapter 1. Introduction 9 10 15 position (m) 10 15 position (m) Figure 1.3: Inversion results with combined regularization. (a) True model, (b) Result with (as,ax,az) — (10~3,1,1) and Hire/ = 0, (c) First reference model, (d) Result with (as,ax,az) = (10~3,1,1) and m r e / from (c), (e) Second reference model, (f) Result with (as,ax,az) — (10-3,1,1) and Hire/ from (e) upper half of the model where the reference model was correct. Resolution of the halfspace interface has been lost. Thus, as shown by this example, a reference model can have a profound effect on the inversion result. 1.4 Choosing the N o r m A l l of the theory presented to this point has been based on choosing L2 norms to represent the regularization and the misfit. This was done because L2 norms are mathematically "nice" in that they are quadratic and easily expressible. However, the choice of this norm is completely arbitrary. In order to be considered a norm, a function must comply with the following rules. 1. /(x) = 0-^>x = 0 2. / ( x + y ) < / ( x ) + /(y) 3. /(ex) = |c|/(x) Chapter 1. Introduction 10 - 4 - 3 - 2 - 1 0 1 2 3 4 Figure 1.4: Comparison of L2 and L l regularization functionals These properties are defined as positivity, sub-additivity and homogeneity. The most com-mon norm is the Lp norm. (1.30) In geophysical inverse problems, focus is put primarily on the L l and squared L2 norm. Figure 1.4 compares a L l and squared L2 norm over a range of argument values. The y-axis can be interpreted as a penalty value that coincides with a model value on the x-axis. Both norms are centered on m = 0. The important point to notice is how quickly the penalty increases on the L2 norm compared to the L l . As a result, a solution involving L2 norms will try to keep all values closer to its minimum. This is not the case for the L l norm. There is a slow and gradual increase in penalty function. Thus, it may still promote more model values around its minimum but would not be adverse to a few larger parameters when the data require it. These are fundamental differences between the two norms. The L2 norm tends to smooth while the L l norm produces sparse solutions. Figure 1.5 shows results of using both the L2 norm (Figure 1.5(b)) and the L l norm (Figure 1.5(c)). It is easy to see that the L l norm has much sharper boundaries and is able to image the fault discontinuity better. Because of the nature of the true model, the L l norm produces a superior result. 1.5 I n c o r p o r a t i n g P r i o r K n o w l e d g e At this point, the general inverse problem has been defined. The recovered model depends heavily on the definition of the data misfit, fa, and the model norm, 4>m. Within each measure, several quantities are user-defined. This includes things such as data errors, a coefficients and a reference model. Thus, the inversion depends heavily on prior information no matter how specific it is. This section looks at current practices for quantifying and specifying this information and then introduces new techniques that will be developed in later chapters. Chapter 1. Introduction 11 5 10 15 20 position (m) position (m) position (m) Figure 1.5: Recovered models using different model norms, (a) true model, (b) L2 norm solution, (c) L I norm solution 1.5.1 Current Practices Often geological information is available in an area where geophysical data were collected. This information can have many different forms. It can be a surface map of rock type and other prominent geological features. It could potentially involve some physical property measurements of the rocks. At its most useful level, geological information may be presented as a drill core log from a nearby location. It is also possible that the information is more conceptual - a geologist may favour an ideal model of the geology even if he cannot provide definitive physical information to prove it. A l l of this information could be considered useful geological information. However, prior information need not be only geological. In fact, often multiple geophys-ical surveys are conducted in an area. As a result, a geophysicist can use results from one survey to help constrain another. There is also the possibility of geochemical information being available. This can also provide prior information for a geophysicist to include in the inversion process. Even though there is often an abundance of information available, the most difficult task can be translating it into a useful form. This is the biggest limitation on prior information and is often a stumbling point. As a result, current practices often under-use geophysical information. Figure 1.6 outlines several degrees of prior information use in a flowchart. At Chapter 1. Introduction 12 L E V E L O N E : L E V E L T W O : -"Inversion as a black box" -Defaults used for all parameters -<pm defined in terms o f mathematical quantities -Smallness, flatness included and considered -"Mathematical Adaptation" -Details of tpm altered -a's specified uniquely, m „ f halfspace values changed L E V E L FOUR: -"Norm Adaptation" -<pm amended norm-wise -LI norms replace L2 for blocky solutions L E V E L T H R E E : -"Geologic Information" -cpm amended with geologic information considered - m „ f developed -a's specified uniquely (with geologic information considered) BEYOND L E V E L FOUR: -Prior information collected, organized and translated -Used as constraints for coefficient, norm optimization etc. Figure 1.6: Flowchart outlining increasing sophistication for regularization in geophysical inversion the most basic level, many inversions are performed as a "black box" operation. In this case, the user may not understand the concept of inversion. Often the inversion code will have a set of defaults for user-defined parameters. These defaults are chosen because they work well most of the time. At this level, the regularization is purely mathematical. As a result, the inversion can produce a usable image with minimal prior knowledge incorporation. The second level of incorporating prior information into inversion still has a mathe-matical flavour but the user is informed on the concepts used to develop regularization. Here, <f>m is composed of mathematical operators that represent certain features. In (1.27), it was shown that regularization typically includes smallness and flatness. Without any hard geological evidence, a user may wish to only include smallness or flatness or introduce new mathematical norms such as smoothness (second derivative). Although this type of information does not come from hard evidence, a geoscientist may have reason to define 4>m in this way. After reviewing several inversions, the user may also adapt the reference model to a more appropriate base level but will not have any information to create a more sophisticated reference model. The third type of regularization definition directly incorporates geologic information. This is an extension of level two in that the a coefficients and mref are defined. The a coefficients may be weighted due to geological structure or geochemical analysis. However, these values are guessed (from a relative point of view) and are not optimal. The reference model is created from prior geologic information. It can be as simple as a halfspace if the information is not very detailed. Conversely, it can be extremely detailed if there is an abundance of drill and physical property information. It should be noted that all the prior Chapter 1. Introduction 13 information at this level must be represented by the a coefficients and the reference model. This is not always an easy task. The fourth regularization level is the modification of level three by changing the norms. This level is often reached after several inversions have been conducted with L2 norms. At this point, the geophysicist may feel that the smooth nature of the L2 norm is not appropriate. This often results in the definition of the L l norm that was introduced in the previous section. Levels one to four of Figure 1.6 outline most current practices in regularizing geophysical inversion. What follows is a discussion of potential extensions to these practices that go "beyond level four". 1.5.2 C o r r e c t i n g M a t h e m a t i c a l Deficiencies Often the mathematics of a geophysical survey can cause problems when the design of ,a regularization functional treats every cell (parameter) in the same way. This is especially a problem with potential fields where cells further away from the receiver have relatively less signal effect than those closer. For instance, consider the measurement of magnetic field. M di = 'Y^gijKj (1.31) J'=I Like any other linear relation, the data is the sum of the kernels (gij) for every cell multiplied by its susceptibility value, Kj. However, the kernels for each cell vary as follows. 9H « 4" (1-32) Tij is the distance between the data location and the model cell. Thus, a cell further from the source has less effect on the value of the data than a closer cell of the same K value. For it to have the same effect, the value of K for the cell furthest away must be larger. Since in a default regularization functional smaller model parameters are encouraged, the recovered model will opt for a solution with a lot of structure near surface. (This illustrates the equivalence of potential fields - a small near surface body can have the same signal as a larger, deeper body.) As a result, every cell was not given equal chance of containing non-zero K. Thus, the regularization must account for this mathematical deficiency in the forward operator [22]. Consider the model norm presented in (1.28). To include a correction for the forward operator, it is rewritten as follows. 4>m = \\\WmZm\\2 (1.33) ZT Z contains the deficiency correction for each cell along its diagonal (and m is a vector of K values). In the magnetics problem (surface data), the deficiency correction would be Wj = ^ where z is the depth of a cell (the depth issue is of more concern than a general distance issue). Thus, Z would be defined as follows. Z = diag(0/7f,..., ^uT^) (1.34) After applying the deficiency correction, the smallness penalty value for a cell is i—- as 3 opposed to Kj5xj5zj. Essentially, this cell can now have a n value of z?Kj in order to have Chapter 1. Introduction 14 the same penalty. Thus, all cells are equally likely of containing structure. This method can be applied to any problem that has this kind of deficiency. This theory is at the root of the idea that regularization in geophysical inversion should be linked to the forward operator. 1.6 Learned Regularization Learned regularization functionals are defined as optimal model norms given a set of prior in-formation on the physical property in question. Currently, as explained in previous sections, methods to include prior information are limited. As a result, inversions are performed with regularization that does not promote optimal model structure. This research focusses on developing techniques to specify more informative regularizations. This is attacked in three different ways varying from a purely probabilistic approach to a purely deterministic ap-proach. Each method is a viable tool to express regularization functionals that have been learned from prior knowledge. Because these regularization functionals are more specific, the level of detail and complexity of the functional increases. As a result, complementary issues such as optimization algorithms and search methods must become more sophisti-cated. The goal of this paper is not to augment these aspects of inversion theory. Rather, the goal is to investigate how learned regularization functionals can be generated. Chapters 2-4 outline the learned regularization methods. Application of the procedures is presented in Chapter 5. A generic geophysical model consisting of an anomalous body embedded in a uniform space is considered. Learned Regularization I: Probabilistic Regularization When thinking about regularization in a probabilistic sense, the natural extension is to introduce the idea of distributions (or probability density functions). A distribution of a random variable is a summary of the likelihood of its possible values. For instance, a perfect coin would be best described by a binomial distribution with p, the probability of it obtaining one value, equalling 0.5. This simple example is easy to grasp due to the fact that there are only two possible outcomes. In geophysical problems, we are often faced with predicting the most likely outcome for one parameter that physically has an unlimited range. Thus, we must consider a continuous distribution, p(x), where the following properties must hold. 1. 0>p{x)<l Vz oo 2. / p(x)dx = 1 —oo To summarize, there must be some probability (regardless of how small) that a particular value can occur that when summed over all possible values must equal 1. In order to pose geophysical regularization in terms of a probabilistic variable, we must propose a form for the distribution. The simplest way to do this is via a Gibbs distribution. p(m) = i e - * f i ( m ' (1.35) It is up to the scientist to determine desirable forms of R(m) and potential values of T. This can be as simple as using a multivariate Gaussian (Chapter 2). However, it is possible to define any form as long as the normalization value (or partition function), Z, can be Chapter 1. Introduction 15 calculated. What makes the regularization learned is that R(m) is usually defined para-metrically. That is, R = i?(m; 0) where 0 is a vector of parameters that have optimal values given a set of possible realizations of m. This set is defined as the training model set, M y = {m-}, where the individual training models were obtained from some informa-tion unrelated to the distribution or particular inversion. For instance, a geologist may have ideas of several possible Earth models that can be translated into physical property information and used to define the regularization. The key to this process is that a priori information is classified and becomes extremely useful in the inversion process. Learned Regularization II: Deterministic Model Based Regularization The model based deterministic approach to regularization differs from the probabilistic one in that we do not strictly follow the rules of probability. Instead, we assume that we have access to significant prior information on a filtered version of the model. This is represented as z. A key point is that we simply choose an appropriate norm for z. This could take any form that we choose. For instance, we may wish it to be represented by a p-norm or some generic norm. Since we have prior information on the variable z, we are able to parameterize the particular norm and recover the best fitting parameters. This approach is more practical in many ways because the connection between the model and z is ignored. Chapter 3 investigates its origins and presents its strengths and weaknesses. Learned Regularization III: Deterministic Survey Dependent Regularization In Section 1.5.2, the idea that deficiencies in the forward operator should be corrected for in the regularization was introduced. This was accomplished by analyzing the kernels that govern the forward model and counteracting any inherent decay or deficiency. However, in a paper developed by Haber and Tenorio [16], deficiencies in the forward operator are accounted for in a learned regularization framework. The idea centres around creating synthetic data from the training model set - Vt = F[M.t]. Then, by using typical inversion methodology with 4>m = 4>m(6), one can invert Vt to obtain a set of inverted models, {mi}. The goal is to find the vector 6 that produces inverted models that are closest to their progenitors in M t . That is, min e iP = l\\mi(d)-mti\\2 . (1.36) The method is contingent on the form of the regularization being specified. As a result, optimal parameters for a specific type of regularization are sought. Chapter 4 details the formulation of this approach and explores several synthetic examples. Chapter 2. Probabilistic Regularization 16 Chapter 2 Probabil is t ic Regularizat ion 2.1 Introduction In this chapter a probabilistic framework to represent geophysical prior information is in-troduced and developed. Specifically, we seek to represent the prior distribution of the geophysical model, m, as p(m) = I e-^'m) where II is a normalization and 6 is a vector of parameters. Through probability theory, we seek a method of reducing the specification of R to interactions between neighbouring cells. The latter part of the chapter concentrates on using information in the form of training models to recover optimal values of 9. Depending on the specification of R, methods to attack this problem vary from routine optimization in a maximum likelihood framework to sampling techniques in Markov chain Monte Carlo (MCMC) theory. In the field of image processing there has been significant work done to learn prior distributions from statistics of observed images ([29],[28]). These advanced techniques make use of Gibbs distributions and take a maximum entropy approach and are based on selecting appropriate filters of m from a bank of possibilities. In this chapter, a simpler approach to representing prior information is sought. Upon conclusion of this chapter, the reader will be familiar with the statistical in-corporation of geophysical information into learned regularization functions. In addition, methodology for the selection of appropriate forms of R is introduced and developed. 2.2 Probability Theory in Regularization In this section, the laws of probability as they apply to regularization are discussed, z is assumed to be a filtered quantity of the training set such that z = /(m). 2.2.1 Bayes T h e o r e m and P r i o r D i s t r i b u t i o n s In probability theory, Bayes theorem is a relation between the conditional distributions of two variables. p ( m | d ) = P i ^ m ) ( 2 1 ) If we wish to condition on z as well as m, (2.1) is amended as follows. p(m|d,z) = ^ d ' m ; H Z ^ Z ) (2.2) p(d,z) Above, we refer to p(d|m, z) as the likelihood, p(m, z) as the prior distribution and p(m|d, z) as the posterior distribution. Chapter 2. Probabilistic Regularization 17 The main question remaining in (2.2) is how to expand p(m, z). Using conditional probabilities the prior can be decomposed. p(m, z) = p(z|m)p(m) = P(m|z)p(z) (2.3) However, we are still required to define either p(m|z) or p(z|m). In order to do this, we must know what the definition of z is. For simplicity, we assume that the relation between z and m is linear such that z = Fm where F is a linear operator. One possible example of F would be a derivative operator. z = Vm (2.4) V is a discretized (un-riormalized) first order derivative. 1 0 0 -1 1 0 0 -1 1 0 0 0 0 0 0 0 0 0 0 -1 1 (2.5) By writing out the form of the matrix we can make two conclusions. • z is not independent of m. • The resulting set {zi} are not independent. The first conclusion is a result of the definition that z = Fm and is always true. The second depends on the form of F. Returning to the problem of defining our prior, (2.3) requires the specification of a conditional and marginal distribution. We assume that we have obtained information on z. Thus, we wish the prior distribution to be p(z) and, referring to the second equation in (2.3), we are required to specify p(m|z). This is an interesting issue if considered carefully. We assume that F is a matrix where rows(F) < colurnns(F). For example, if F — V and m contains M elements, z would contain M — 1 elements. Thus, the values of z are certain given m and p(z|m) is simply a series of points in probability space. Let us consider the reverse process. Given a vector z, p(m|z) is an M-dimensional distribution given M - l constraints (and no boundary conditions). Without additional prior information on m, this distribution must cover RM and is a uniform distribution. As a result, (2.3) is the equivalent of writing the following. p(m,z)=p(z) (2.6) Therefore, we can write the following Bayesian proportionality referring to (2.2). p(m|d, z) oc p(d|m, z)p(z) (2.7) The problem has been reduced to the definition of a prior distribution on z alone. The next task is to specify the types of distributions in (2.7). Chapter 2. Probabilistic Regularization 18 2.2.2 Specifying the Likelihood and Prior Distributions In order to demonstrate the complexities of using a probabilistic approach in our inversion, we must begin by specifying individual distributions. For instance, the Gaussian (or L2) distribution has the advantage that it is a smooth distribution that carries an inherent sensitivity to extreme values. We shall define our likelihood as follows. p(d|m, z) = J - e - ^ - ^ H l ^ - ' I d - F H ) ( 2 . 8 ) TLL is a normalization constant that ensures unit area under the distribution. The reader should realize that the form of the likelihood can be anything - the Gaussian representation is by far the most common. (Note: The definition of a Gaussian likelihood is akin to assuming the presence of Gaussian noise in the data.) The definition of the prior, p(z), is the most important. This will depend on the information made available by a geologist or some other scientist and we must define this distribution very generally. The easiest way to do this is by defining a Gibbs distribution. p{z) = ~ e - R ^ (2.9) The notion of Gibbs distributions comes from thermodynamics where R(z) is an energy function. (It is interesting to note that probability distributions take their exponential form due to thermodynamic constraints. For any two states, the total energy must be additive but probability must be multiplicative.) We can define this energy function in any form (as long as it is finite). However, it is often more useful to define the energy in terms of local interactions of parameters. If this occurs, we have defined a Gibbs distribution as a Markov random field [2] and henceforth, we must consider the notion of neighbourhoods and cliques (see [21] for derivation and reference therein). 2.2.3 Neighbourhoods, Cliques and Markov Random Fields An important part of understanding Gibbs distributions is being able to define a neigh-bourhood. Consider a 2-D mesh or lattice. Every cell i has at least 2 cells that directly border it. These cells comprise its first order neighbourhood. Thus for the entire lattice, S, a neighbourhood is defined as N = {A^Vi G S} where Ni is a set of sites neighbouring point i. There are two criteria that a neighbourhood must obey. a) A site cannot neighbour itself: i ^ N b) The neighbourhood relation is mutual: i G N# <-> i' £ Ni Neighbourhoods on meshes and lattices have different orders. For example, the first order neighbourhood for a cell in the interior of a rectangular mesh consists only of the four cells that directly border it. However, when considering the second order neighbourhood, those cells that diagonally border the interior cell are included. Thus, the second order neighbourhood includes eight cells. This is illustrated in Figure 2.1. Cliques are groups of cells within a neighbourhood that are mutually neighbours. Thus, when considering a first order neighbourhood, the cliques for cell i can be defined as c = {ci,...,c 5}, ci = {i\i G S}, Ck = {(M')K' £ N,i £ S}, A: = 2... 5. Figure 2.1 helps to concrete this notion visually. Consider the black cell as cell i. The locations i' are the Chapter 2. Probabilistic Regularization 19 Figure 2.1: Example of neighbourhood in a mesh for one cell (black). Its first order neigh-bourhood consists of the dark grey cells. The second order neighbourhood consists of the dark and light grey cells. dark grey cells. Thus, the cliques for cell i are defined by a function of cell i alone as well as interactions with the cells at locations i'. If the neighbourhood was enlarged to second order, there would be 3 and 4 cell interaction cliques. Utilizing the definition of cliques and neighbourhoods we can decompose our energy function into Gibbs potentials. R(z) = Y/Vc(z) (2.10) c The subscript, c is meant to imply summation over all the orders of cliques present in the neighbourhood. The function Vc(z) is referred to as a Gibbs potential. The form of Vc can vary as the size of the clique changes. This decomposition of R(z) gives the problem a local foundation and allows equivalence to Markov Random Fields to be drawn. Given any lattice or set, S, a random field is defined as the labelling of a particular value to every cell. A random field that is defined by local interactions is called a Markov Random Field (MRF). z is a M R F if and only if a) P(z) > 0, Vz (Positivity) b) P ( 2 i | z s _ [ j ] ) = P(zi\zNi) (Markovianity) Chapter 2. Probabilistic Regularization 20 (a) Obtain prior information and create training set Preliminary Inversion (b) Quantify training set in Gibbs distribution (c) Regularization inversion process (probabilistic or deterministic) Main Inversion (d)Rcgularization specified (e) Field data measurements obtained (OGcophysical Inversion (g) Geophysical mode) estimation Figure 2.2: Flowchart of complete inversion process where S — [i] denotes all values of z except Zi, and Z J V 4 are the values of z at the sites neighbouring i. This formulation creates enormous flexibility within prior specification. At its root, p(z) is specified by the definition of a neighbourhood, the resulting cliques and the specification of the Gibbs potentials. By inserting (2.10) into (2.9), the following relation is obtained. p(z) = — e - E c ^ W (2.11) r i p We have been successful in changing the prior from a general Gibbs distribution into one that depends on local characteristics. This is an important result. Given a possibility for m , we require a method of assigning a probability. Before establishing the relationship between the Gibbs distribution and Markov random fields it was unclear how to proceed. Now, we are able to use relationships between neighbouring cells to assess probability and define the prior. Once the prior has been specified, we may return to Bayes Theorem in (2.2) and concentrate on estimating m . By using the theory in this section, this estimate of m is obtained via a purely probabilistic route. We now turn our attention to finding a form of V c ( z ) from prior information in a learned manner. 2.3 Learning Probabilistic Regularization Functionals Figure 2.2 illustrates the learned regularization inversion process. There are two inver-sion steps outlined. In this section, the preliminary inversion step is documented. This involves the recovery of a regularization functional from prior information. What follows is a discussion of steps (a)-(c). 2.3.1 Step (a): Obtain Prior Information and Create Training Set Prior information on our physical property model, m , could come from many different sources. For instance, a geologist could have a conceptual model of rock units or fault boundaries. There may be some geochemical data that suggests a certain mineral com-position. Regardless, we assume that this information can be translated into information on the physical property in question. This information could be provided in a couple of different ways. Chapter 2. Probabilistic Regularization 21 a) A model, mref, is provided as a mean of our distribution. b) Several models, M t = {mi},i = 1,...,K, are provided in a training set and are supposed to be samples from p(m). If we consider option (a) carefully, it is apparent that we are not provided with enough information to define the distribution wholly. The minimum number of moments needed to define a distribution is two (mean and variance for a Gaussian). Thus, we would be faced with the task of prescribing another moment independently. However, we could easily combine option (a) with information provided in option (b) to resolve this issue. The route involving moment prescription essentially reduces our problem to a parametrization of p(m). The simplest example, a Gaussian, is shown below. If we had samples that we hypothesized were from the above distribution we could recover the elements of the covariance matrix, C _ 1 . The only problem with this approach is that we must prescribe the form of the distribution beforehand. The form of C~l dictates how large the neighbourhood and clique systems are. It is apparent that our prior information must come in the form of (b). In this case, we have been given several models that a geologist has deemed probable. Figure 2.3 shows four examples of these types of models. In this example, the scientist has decided that we are most likely looking for a box model (within a halfspace). He has assigned relative model values to the box and the halfspace. Although these models are similar in type, there are differences in location and model values. Therefore, we must know what properties of m (recall z — /(m)) the training set represents. This is an issue that needs to be resolved by the geophysicist and we will address it in the next section. By establishing how the information is obtained, we can move on to the quantification step. 2.3.2 Step (b): Quantify Training Set in Gibbs Distribution Equation (2.11) was developed as the most general way to express our prior distribution. Once we have obtained information such as that in Figure 2.3, we must be able to quantify it in some manner. To do this in a tractable manner, we must assume that all of our models within Mf are independent samples from p(m) (in this case, our training set contains information on m such that z = m in (2.11)). Thus, we can safely propose that the joint probability of the entire set is simply the multiplication of all the individual probabilities. p(m) = e - 5 ( m - m r e / ) n G \TC 1 (m-m r i .y ) (2.12) K 1 ( U » )) (2.13) i=l Taking the logarithm of the previous equation we obtain the following. log(p(Mt)) = - E E V^mi) + Klog{YlP) (2.14) The above is the basis of the maximum likelihood estimator (MLE). We wish to maximize the probability so we can simply transform this into a minimization problem with the Chapter 2. Probabilistic Regularization 22 Figure 2.3: Examples of possible prior models following objective function. ^ = E (E ^ ("»i)J + ki°9{KP) (2-15) In order to find the minimum of (2.15), two items must be specified. First, we must decide on the form of I4(m'). Second, we must be able to define Up. Regardless, (2.15) is a general expression using training models to quantify a prior distribution. Because of the multivariate nature of the distribution, the tractable choices for Vc (m) are limited. If we wish to optimize (2.15), the normalization must be expressible. As a result, we choose the simplest case: the multivariate Gaussian. To specify the multivariate Gaussian distribution we need to define E Vc{m) = m T C _ 1 m (2.16) c in order to fully specify (2.15). C"1 is the inverse covariance matrix. Within this matrix, the size of the neighbourhood and corresponding cliques are defined. Note that the reference model is omitted but can be inserted at any point (replace m with m — m r e y). In most geophysical inverse problems, desirable features of the recovered model include anomalous bodies and relative changes in physical property value. As a result, it is rea-sonable to limit the neighbourhood of m to first order (see Figure 2.1). Thus, for one cell Chapter 2. Probabilistic Regularization 23 m.k the clique potentials would include a term for mk alone as well as one for its difference with neighbouring cells. In the above equation, S is included as the distance between locations k and k' in order to normalize the derivative, k' is a neighbouring cell, ak and afc' a r e the parameters of interest that act as weights within the different potentials. For instance, a relatively high aki value will promote smaller derivative values in an inversion (increases penalty in regularization). The value of ak> could be allowed to vary for every difference or be held constant in some fashion. Both cases will be examined in this chapter. For simplicity, the first case involves holding ak constant for every cell as well as holding ak' constant for horizontal differences and vertical differences respectively. The resulting a vector is defined as a = (as,ax,az) for a 2-D problem. Inserting these concepts into (2.16), the Gibbs potentials are defined for the multivariate Gaussian. £ Vc(m) = ^mT{akWjWs + axW^Wx + azWjWz)m (2.18) c The operators Ws,x,z a r e &s defined in Chapter 1. It is now easy to see the relation between (2.16) and (2.18).'Thus, C~l = akWjWs + axWjWx + azWjWz (2.19) is the inverse covariance matrix and for the purpose of optimization we rewrite it as C~1(a). It was stated that two tasks must be completed before recovering our distributions. Firstly, we had to define our Gibbs potentials. This has been accomplished in (2.18). Secondly, the corresponding normalization must be defined (lip). Since we have defined our distribution as a multivariate Gaussian, l ip is expressible. n ~ = (2n)f \det(C(a))\$ (2.20) Now we can fully define (2.15) and obtain a M L E . 1 K i> = -J2™iC(a)-1mi + Klog((2n)f \det(C(a))\$) 1 K = -E«nrC(a)- 1 W + K%((27r )T )+K^( |de t (C(a ) ) | i ) i—l = \ £ m f C ( a ) - 1 m i + ™ log (2TT) + | log (\det(C(a))\) (2.21) 1 ^ ~ , , i K M K / • 1 \ = - £ m f C(a)-^ + — log (2.) + ^ ^ {ldet{C(a)-^) = \ J2 mfCia)-1^ + ™ log (2TT) - | log (|ciei(C(a)- 1)|) = Vk(mk) + ^2 Vktk'{mk, mk>) (2.17) Chapter 2. Probabilistic Regularization 24 The Gibbs distribution for this simple example has been defined and, as a result, the objective function for our preliminary inversion is completely tractable. Next, the inversion process must be investigated. 2.3.3 Step (c): Probabil ist ic Inversion to Recover 4>m First, we shall write a formal minimization statement. 1 K K mine ^ - E 1 " ^ " ) " 1 " 1 ' - ^ logddetCCCa)-1)!) (2.22) i=l However we decide to represent C(o:) _ 1 does not change how the minimization problem is approached. It is clear that we must somehow deal with the derivative of a (logged) determinant (the last term of (2.22)). Fortunately, there is an analytic approach to this problem. The corresponding theorem by Hanche-Olsen [17] is outlined below. Theorem 1. For any n-by-n matrix $(£) , the determinant can be expressed as a multi-linear function of its rows such as det($j = / ( & , . . . , (2.23) where fa refers to a particular row of <&. Thus, by successively holding all rows constant except one, the determinant can be expressed as a linear function of each of its arguments. Applying the derivative operator to this notion produces = f(fa, fa, ••• , M + f(fa, fa,: • • , 4>n) + • • • + f(fa,fa, 4>n) (2.24) where fa symbolizes the derivative of the first row. Consider the first term of (2.24) if &(t) is the identity matrix. The result will be simply fa\ and, the complete determinant can be expressed as d ( d e t ( ^ W ) ) = t r Q c e ( . ( t ) ) dt when &(t) = ti. In order to develop a formula for non-identity matrices we can use determinant properties. For instance, if A is a constant invertible matrix det(A$) = det(A)det($) = 1 det$ . (2.26) det^yi ) Using (2.25) and (2.26) together, the following result holds. d e t ( A ) d ( d e t ^ * ^ = trace(A$(i)) when A$(i) = I (2.27) If $ is invertible, we can substitute A = into (2.27) and rearrange the final result. d ( d e t ^ W ) ) = det($(i))trace($-1(i)6(i)) (2.28) Finally, the above can be re-written into the following. ^W) = traoe(*- 1 («)*(0) (2.29) dt Chapter 2. Probabilistic Regularization 25 This theorem proves to be extremely useful in our derivation. Recalling (2.22), it is clear that (2.29) provides us with an exact expression for the derivative we wish to take. Thus, by substituting C~Y for $ we can continue with the minimization. dip rdC-1 K („dC~l\ >„ o n . * = - 2 £ ^ - ^ - m , - -trace ( C — j (2.30) gj is the derivative of our objective function and is required for any search algorithm we may impose. To this point, the probabilistic inversion has been generally outlined. The only point remaining is to deal with the specific form for C - 1 and to tune the search algorithm accordingly. 2.4 Search Algorithms In any optimization problem, a search algorithm is required to find the minimum Of the objective function. The complexity of this algorithm often reflects the complexity of the problem - the number of parameters and linearity of objective function are two important issues to consider. In the problem of recovering weighting coefficients, we can have as few as two parameters. However, we may also have to deal with 0(M) parameters. In addition, as shown in (2.30), there is inherent non-linearity in the problem (the normalization constant in (2.20) ensures that). By incorporating second order information, efficiency of the search can be improved. In this section, a Newton search approach is developed. As we shall see, incorporating simple bounds in this algorithm produces the gradient projection algorithm. 2.4.1 The Newton Step The Newton step is based around incorporating curvature information to direct the search. Expanding the objective function about a location of a, we obtain the following. ip(ct + p) = 4>(a) + g r p + ^pTH(a)p + H.O.T (2.31) H is the second derivative matrix and is called the Hessian (Htj — .) and p is a perturbation vector. Neglecting the higher order terms (H.O.T), we search for a vector p that minimizes (see [13]) *(p) = t K a ) + g T p + i p r t f ( a ) p (2-32) The solution to V p * = 0 is P = - # - 1 g (2.33) The updated a vector is ak+i = ak + up (2.34) where ui 6 (0,1). If ^ is a good representation of then w = 1. That is, the Newton step is optimal. If * is a poor approximation, then a line search to find u> is required. We are guaranteed that for sufficiently small w, i/>(a +wp) < ip(ot) (i.e. p is a descent direction). Chapter 2. Probabilistic Regularization 26 Taking the derivative of (2.30), the expression for the Hessian can be obtained. ^ = - ! i T O C e ( S ^ ) (235) This expression is very difficult to obtain analytically because we lack a direct expression for C. Thus, the easiest approach is a numerical one. Because the Hessian is symmetric in this problem, the numerical approach is not extremely expensive. Thus, we calculate the derivative of C using finite differences. dC_ ^ C(ak + Sak) - C(ak) dak 5ak 5ak is some small perturbation. Using (2.36) to substitute into (2.35) results in a good approximation to the Hessian and an efficient way of calculating the step length. Thus, the algorithm below is used in all problems that solve the system in (2.33). 1. Set ip0 = tp{ak). 2. Set UJ = 1. 3. Evaluate i\) — tp(ak + wp). 4. If tp < tpo return; else reduce OJ (e.g.a; = ^); goto 3. end; The above algorithm is used every time the system in (2.33) is solved. What remains to be discussed is when to terminate the search for a. In theory, the termination would be when the gradient, g, of the updated model, ak+i, equals zero. However, in practice this is difficult to obtain so we set our termination criterion as Mi < e (2.37) where go — ||go|| and e is some small number. In addition, we check the size of the step and terminate the loop if it is below a threshold (tol). I K + i -afclb < tol (2.38) 2.4.2 Constrained Optimization: Positivity For the covariance matrix, C ( a ) , we require a > 0. Thus, there is a need to ensure that the solution recovered via our algorithm in Section 2.4.1 incorporates this constraint. Two methods of accomplishing this are proposed. Chapter 2. Probabilistic Regularization 27 Avoidance W h e n dea l ing w i t h p rob lems w i t h few parameters , we m a y be bet ter off e m p l o y i n g a p o l i c y of avoidance. T h a t is , we can devote a re la t ive ly s m a l l amoun t of c o m p u t i n g power to checking the step length generated b y (2.33). Nega t ive values of a c a n be avoided b y re-adjus t ing the m a x i m u m l ine search value, u>. T h e m a x i m u m a l lowable size o f to, w m a x is OL ' umo,x = m i n { l , m a x ( p | ) : pt < 0} (2.39) \Pi I . R e p l a c i n g OJ — 1 w i t h comax i n Sec t i on 2.4.1 ensures negat ive values are avoided . The Gradient Project ion Algor i thm A l t h o u g h avoidance is s imple a n d easy to use, we m a y require a more robus t m e t h o d (convergence guaranteed) for p rob lems w i t h more parameters or those where the search p a t h is diff icul t . T h e gradient p ro jec t ion a l g o r i t h m ( G P A ) deals w i t h s imple bounds o n parameters a n d defines an acceptable space of solut ions , <S [20]. ^• = {{t lit <2-40> L is our lower b o u n d . I n this p r o b l e m , we m a y set L to some s m a l l number close to zero and let S be a l l real pos i t ive values greater or equa l to L. T h u s , we a p p l y th is to our desc r ip t ion o f the u p d a t e d m o d e l f rom (2.34). a k + l = P(ak+up) (2.41) T h e gradient p ro jec t ion a l g o r i t h m also searches for an appropr ia te co. However , th i s is done s l igh t ly differently by se t t ing UJ — 3l where 3 is some number i n (0,1). T h u s , we c a n express ak+i{t) as the proposed m o d e l a n d search for a value of t tha t exh ib i t s sufficient decrease. ij>{ak+1(t)) - ^{ak) < -jt\\ak - a f e + 1 (*) | | 2 (2.42) 7 is t y p i c a l l y some s m a l l pa ramete r (~ 10 - 4 ) . T h e a l g o r i t h m begins by tes t ing t = 0 and increments t u n t i l the c o n d i t i o n i n (2.42) is satisfied. O n c e the c o n d i t i o n is met , t h e n the a l g o r i t h m tests for t e r m i n a t i o n . If t e r m i n a t i o n fails, t hen the u p d a t e d m o d e l is used to ca lcula te the new p e r t u r b a t i o n a n d we renew our search for t. T e r m i n a t i o n c a n be tested several ways . W e e m p l o y the suggest ion o f K e l l e y [20] i n ou r a l g o r i t h m a n d define t e r m i n a t i o n as follows. Ilotfc - ak+1{t = 0)|| <Ta + Tr\\a0 - a0(t = 0)|| (2.43) ra and rr are absolute a n d re la t ive error coefficients tha t c a n be tuned . T h e mos t often used values were ra — 0 a n d r r = 10~5. T h e advantage of the a l g o r i t h m is tha t we do not have to speculate o n the n u m e r i c a l decrease i n the gradient . W e do, however, have to check each parameter for i t s p o s i t i o n re la t ive t o the preferred space. T h i s c o u l d be cos t ly for p rob lems of higher d imens ions . Chapter 2. Probabilistic Regularization 28 2 .5 Synthetic Examples 2.5.1 Gene ra t i ng Synthe t ic M o d e l s To test the algorithm synthetic models must be used. In order to do this, (2.12) is consid-ered. This distribution is simply a multivariate Gaussian with mean m r e / and covariance C. If we redefine 8m = m — mref, the distribution takes the following form. p(6m) = l-e-¥mTc' lSm (2.44) Further, if the covariance matrix is positive definite (this requires that as > 0) a Cholesky decomposition such that C~l = LTL can be performed. Thus, by defining m = LSm the distribution takes the following form. p(m) = - L e - 5 * T l h (2.45) Now, the distribution is defined in terms of rh with distribution i h ~ N(0,1). It is easy create samples of rh. To create a model of M dimensions with mean mref and covari-ance C one simply has to create an M-dimensional i h vector and perform the following transformation. m = L - 1 r h + m r e / (2.46) The natural question when considering (2.46) is how to specify the reference model. In the synthetic case, it is not really a big issue. One can specify any reference model and the resulting models will tend to average to its value. However, this distribution is, in fact, the regularization used in the geophysical inversion. In cases where we do not have to generate the training models from a known a vector, we may choose our reference model to be the mean of the training set. mref = Mt (2.47) The choice of reference model impacts the objective function and should be considered carefully. Since the synthetic approach is used to test and validate the algorithm, it is wise to ensure that the synthetic training set actually is produced by the intended covariance matrix. The best way to do this is to produce the approximate covariance matrix of the training set. 1 K Cij = jT^i E K W - } - Mi) (2-48) fc=i K is the number of training models and rnf^ refers to the kth sample of m*. If the models are actually representative of the true covariance matrix then c should converge to C. To quantify the fit, the infinity norm ^jjcjjjj,00 w a s u s e d - That is, the matrix is rearranged into a vector and the largest value is used as a norm. In order to smooth the curves, the average of this norm was taken over 5 different training sets. Like any variance, the fit is improved as K increases. Figure 2.4(a) shows the fit as a function of K. In Figure 2.4(a), (as,ax,az) = (10~3,1,1) and the number of samples (training models) is varied from 10 to 10000. As dictated by the Law of Large Numbers, the sample mean will converge to the true mean as K —> oo. The same follows for the covariance matrix. This is illustrated in Figure 2.4 as the value of the norm decreases as K increases. By the time K = 1000, Chapter 2. Probabilistic Regularization 29 (a) 10" s 15*10" o 10 * * * * 10 10 10 10 (c) 0.115 0.11 =80.105+ o 5 0.1 o M: 0.095 0.09 * * * * * 0.085 10 10 a value 10 10' (b) 0.14 0.12 8 O 0.1 .8 0.08 0.06 0.04 10 O * * * * ^ * 10 10 oc value 10' (d) 0.12 0.115 , 0.11 ^0.105 8 o o.i —0.095 0.09 0.085 10 * * * * 10 10 a value 10* Figure 2.4: ^yjjj^ 0 0 for (a) increasing K value, (b) varying aa, K=1000, (c) varying ax, K=1000 and (d) varying az, K=1000. the largest difference value is only about 0.05 (or 5 percent). One can conclude that the algorithm will perform better as K —> oo. Although algorithm performance is better with higher K, it is wise to investigate the effects of the a parameters on covariance fit. Figures 2.4(b)-(d) illustrate this. In this example, one a parameter is varied and the other two are held constant at one. There is a smaller error with smaller as and larger ax and az. However, the variation is relatively small over six order of magnitudes of a values. It is reasonable to conclude that the a values are not a major source of error when considering covariance fit. The number of samples, K, is the most influential parameter. Figure 2.5 shows what the covariance matrix looks like if it was only constructed with Ws, Wx or Wz as well as a combination of the three. Formulations without Ws (Figure 2.5 (b) and (c)) result in large parameter variances while those with the presence of Ws (Figure 2.5 (a) and (d)) have significantly reduced variances. Figure 2.6 demonstrates how these variances change with increasing as. The variances are reduced as as increases. Given the analysis of this section, we can be assured that our training sets are adequate representations of the true covariance models for the upcoming synthetic examples. 2.5.2 Recovering the a values Given the conclusions of the previous section, it is wise to test the algorithm with numerical experiments. In these examples, specific values of the a's are used. The only thing that is varied is the number of samples used. In order to compare results, the standard L2 norm Chapter 2. Probabilistic Regularization 30 parameter # parameter # Figure 2.5: (a) Resulting C matrix for C = {WjW„)-x, (b) C = ( W j W x ) - \ (c) C = [WjWz)-\ (d) C = (WjWa + wjwx + WjWz)~l Figure 2.6: Variation of covariance matrix with changing as. The variance of the model parameters decreases with increasing as. Chapter 2. Probabilistic Regularization 31 true o f e d rytrue " a ; cged rJrue error iter K le-6 1.89e-6 1 0.9634 1 0.9285 0.8891 15 10 le-6 1.04e-6 1 0.9749 1 1.0482 0.0669 13 100 le-6 9.4e-7 1 1.0016 1 1.0103 0.0609 13 1000 le-3 l.le-3 1 0.9139 1 1.0419 0.1336 11 100 1 0.9637 1 0.9477 1 1.296 0.1444 7 100 le3 1.0328e3 1 0.0393 1 0.0088 1.3807 50* 100 le6 1.0147e6 1 4.78e-22 1 6.32e-22 1.4143 50* 100 1 1.0463 le3 1.0093e3 1 0.9890 0.0484 13 100 1 1.0432 le-3 1.31e-7 1 1.0089 1.0008 50* 100 1 1.1189 1 0.9623 le3 1.0162e3 0.1258 13 100 1 1.0293 1 1.0275 le-3 1.5078e-18 1.0008 50* 100 Table 2.1: Results of several 3-parameter a inversions. The asterisk indicates that the program ran to the maximum number of iterations. of atru^Trured is used. A 10 by 10 mesh of 1 meter square cells is applied. Results are summarized in Table 2.1. The first three runs of the program illustrate what Figure 2.4(a) showed. Clearly the error reduces when K increases. However, there is not a significant improvement from K=100 to K=1000. The rest of the examples illustrate the results of Figure 2.4(b)-(d). Larger errors occur when the program terminates at the maximum number of iterations (although it was ensured that the objective function was unchanging for several iterations). This occurs when as has high values or when ax or az has lower values. It should be noted that the algorithm does well in fitting the high values of as but fails to fit the corresponding values of ax or az. This makes sense as the covariance matrix is so heavily dominated by as that it can be approximated C _ 1 c± ctsWjWs and the problem can be reduced to a one parameter estimation. Another observation that one can make from Table 2.1 is that the problem is much more sensitive to as than to any other parameter. Looking at the first two columns it is clear the as was never poorly fit. Rather, when the error was high it was because of extremely poor ax or az fits. Thus, there may be cases where reduced parameterizations of the covariance would be satisfactory. In any case, given an adequate number of training models developed from reasonable a values, the algorithm is successful in recovering the a values within reasonable error. 2.6 Covariance Model Selection In (2.10) it was stated that the regularization could be expressed in terms of a summation of Gibbs potentials. Eventually a practical choice of those Gibbs potentials has to be made. This was expressed in (2.16) and was a reasonable way of representing information contained in training models. However, there is no pure theory behind the choice of regularization (or covariance). The scientist simply has to make a choice. This is a common problem in many disciplines. Statisticians, for example, must decide what type of model to use to represent their data before calculating statistics on the parameters of that model [1]. Thus, in order to be complete, we must investigate the implications of other representations of the inverse covariance matrix. The most practical way to validate a particular choice of inverse covariance matrix is to Chapter 2. Probabilistic Regularization 32 compare it against other possibilities. Thus, we consider several possible choices of inverse covariance, Cj1 — Cj1(6j). We assume that every choice of covariance is parameterized by a vector Oj. To put this in context, we may choose 0\ = (ax,az,as). Every covariance matrix will be included in the corresponding likelihood . PW' C>) = l / " i y T C j " ' W y (2-49) |det(C,-)| where we define y = m — mref as the training data and n as a multiplicative factor that does not depend on m or Cj1. If we are given L possibilities for Cj1, then the interesting statistic would be P{Cjl\y) where P is used for a numerical probability and p denotes the distribution. Given L covariance models the following would be true. L Y/P(C~1\y) = i (2.50) j=l We would be interested in the conditional that had the highest numerical probability. This covariance model, given all other possibilities, would be the most likely choice to represent the data. This problem can be attacked by a Bayesian model selection approach. There are many examples and papers on this subject ([18],[19],[6]) but this section was developed based on theory from Carlin and Chib [7]. 2.6.1 T h e o r y Before developing the statistical theory there are a few items that should be addressed. • The training data, y, is only dependent on the particular choice of covariance model, CJ1. That is, any conditional p(y|0,Cj 1) where 6 = {0\,... ,0i) is the set of all parameter vectors can be expressed p(y\0j, Cj1). • Given a particular covariance model, CJ1, the priors of Oi are proper and conditionally independent. Specifically, p(8j\Cj1) and p(9i\Cj1) (i ^ j) exist with the latter being referred to as a pseudoprior. L • Each choice of Cj1 has a particular prior probability, TTJ = P(C~ 1 ) , such that ^ TTJ = j=i 1. Although there will often be no case to make the values of TTJ unequal, it is important to recognize that this information is included in the formulation. To proceed with the theory, the joint distribution is expressed using conditional theory. L p{y,0,Cjl) =p(y|%,Cr1){J]p(cJ i|C-1)}Tr j (2.51) i=l In order to calculate quantities in (2.50) the conditionals of 0j and Cj1 need to be expressed. (This will become clear when sampling methods are discussed in the next section.) Thus, for Oi, there are the cases when i = j and i ^ j. PiW^Ci ,y) oc { p ^ - i ' ) (2-52) Chapter 2. Probabilistic Regularization 33 When i = j, the data (and likelihood) are included in the solution. Otherwise, the condi-tional is represented by the pseudoprior. Thus, for any proposed covariance model, each Oi is generated from the appropriate distribution. If these distributions are simple and well-known, the generations can be obtained easily. Otherwise, sampling algorithms are useful. Once we have obtained a sample of each (9; for a given model j , the next step is to generate a new model indicator from the conditional for C~l. Piyi^c-'nfipmcr1)}^ p{C-l\6,y) = : (2.53) Ep(y l^ , c f c - 1 ) { n ^ | c f c _ 1 )K fc=l i=l The above equation is a conditional expression where the denominator marginalizes p(0, y). Sampling from the above distribution will result in a vector of covariance model indicators (j). Essentially, if there are L possibilities for the covariance then over N samples of (2.53), we will obtain t samples of model j such that represents the posterior probability of model j. What remains an unanswered question is how these samples will be obtained from (2.52) and (2.53). Although the regularization is represented by a Gaussian in (2.49), one should recognize that this is a Gaussian in y. We are interested in samples of 9j and, as a result, the distribution becomes extremely complex. Thus, two sampling algorithms are utilized to implement model selection theory. 2.6.2 M a r k o v C h a i n s In order to accomplish Markov Chain Monte Carlo (MCMC) sampling, one must be aware of the properties of a Markov chain [21]. A Markov chain is generated by sampling from a distribution known as a transition kernel. Thus, given a current location (t) of a parameter, x, the next location is drawn as X ^ t + 1 ^ ~ p(x\xl). Thus, the new (or proposed) location of x only depends on its current location (it is independent on all other past locations). This is known as Markovianity. Once we have drawn a large number of samples from the transition kernel, Markov chains should be representative of their stationary distribution. This distribution is invariant with time (length of chain) and will always have the same probability from one specific location to another. This stationary distribution is unique if the Markov chain is irreducible. This means that any state can be reached in a finite number of moves from any other state. One can interpret this as the non-existence of zero transition probabilities. Markov chains are also generally aperiodic. This means that there is no regularity in returning to a particular state. Thus, under certain conditions, a Markov chain will generate samples from its stationary distribution given the transition kernel. This is useful if one knows the transition kernel. As we shall see, in most simulation problems, this is not the case. 2.6.3 M C M C sampl ing : T h e M e t r o p o l i s - H a s t i n g s A l g o r i t h m In statistical analysis, one is often faced with the problem of evaluating samples from a particular target distribution, n(x). A good example of this is (2.52) and (2.53). We are Chapter 2. Probabilistic Regularization 34 able to explicitly write the distribution but are not able to produce samples. This is where Markov chain theory can be employed. The goal of M C M C sampling is to construct a Markov chain that has a stationary distribution equal to the target distribution (ir(x)) but where the transition kernel is unknown. This can be done several ways but this research concentrates on the Metropolis-Hastings (MH) algorithm [9]. . The M H algorithm depends on the user specifying a candidate generating density. This density, q(y\x^), is usually easy to draw a sample, y, from given the current state of the chain, x^. If this density satisfies the following condition, then we are guaranteed that ir(x) is the stationary density of the chain. ir(x)q(y\x) = n(y)q(x\y) (2.55) The above condition is referred to as reversibility. If our choice of q satisfies this condition, then our problem is solved. We can simply generate samples from q to recreate ir. However, this is an unlikely occurrence. In fact, it is more likely that (2.55) is an inequality. As a result, the equation can be balanced by a series of coefficients. ir(x)q(y\x)a(y\x) = ir(y)q{x\y)a(x\y) (2.56) q(x\y)a(x\y) is the updated transition kernel and a(x\y) is defined as the probability of moving from y to x. If (2.55) is an inequality, then the probability of moving from y to x (or vice versa, depending on the direction of inequality) would be quite low. If this is the case, then it is prudent to set a(x\y) at its upper limit of 1. Rearranging (2.56), we obtain the solution for probability of a move. a(y\x) = min n(y)i(x\y) l [ix{x)q{y\xY (2.57) Thus, since a(y\x) is the probability of moving from x to y, we should accept the move with that probability. If it is not accepted, the current state is used as the value for the current time in the chain. As a result, it is possible to remain at the same location for a period of time. The advantage to the M H algorithm is that (2.57) includes a ratio of target densities evaluated at two different locations. This means that any normalizing constant that does not depend on the actual locations of the parameters need not be included as they will cancel out. This turns out to be extremely advantageous in many cases where the calculation of the normalization is intractable. The choice of the candidate generating density, q, is one that is left to the user. However, there are a few special cases that are worthwhile to note. Firstly, if q is symmetric in that q(x\y) = q(y\x) (2.57) reduces to the following. cx(y\x) — min |_7r(x)' (2.58) One example of this is called the random walk [24]. This proposal uses a candidate that is a perturbation from the current state. .y = x + r Z (2.59) Z is taken to be a sample from a JV(0,1) density and r is a parameter that can be used to change the variance of the perturbation until the Markov chain exhibits sufficient acceptance Chapter 2. Probabilistic Regularization 35 Propose covariance models, Cj"' , and set k = 0 , j ( 0 ) = l M H 1 : e i < k + " (i=j) generated u s i n g M - H algorithm Figure 2.7: Flowchart outlining covariance model selection algorithm via Gibbs sampling and mixing. Statisticians usually look for an acceptance rate of about 50 percent and a chain that exhibits faster convergence to the target density (less time to "burn in"). In truth, any candidate density is acceptable - some may perform better than others. It is easier to tune the choice of q to the type of problem. Higher dimensional problems may require a more specific q. A typical M H algorithm may look like the following. 1. Set £ ( 0 ) = x0 2. For t=l...N 3. Propose y from q(y\x). 4. Calculate a{y\x) from (2.57) 5. Generate u ~ (7(0,1) 6. If a(y\x) > u set a;(t+1) = y else set = x « 7. t=t+l Goto 3 if t < N Otherwise terminate and output x Thus, the M H algorithm is extremely easy to implement and effective in problems of rela-tively low dimension with parameters that have a near diagonal covariance matrix. Chapter 2. Probabilistic Regularization 36 2.6.4 M C M C s a m p l i n g : G i b b s sampler The Gibbs sampler is really a special case of the M H algorithm where the candidate density is the product of the full conditional distributions [8]. The simplest way to look at it is if there is a two parameter joint distribution, p(x,y). Writing it in terms of conditionals, we obtain p(x,y) — p(x\y)p(y) = p(y\x)p(x). Thus, if we generate proposals for x from p{x\y) and proposals for y from p(y\x), then one can see that reversibility is satisfied and that each of these proposals will eventually converge to the stationary distributions of p(x) and p(y) respectively. To illustrate implementation of the Gibbs sampler, we assume that we have some joint density of k parameters: p(y\,.. • ,yk)- Suppose we are interested in finding samples of the parameters but are unable to sample the joint distribution inherently. However, suppose that we are able to sample from the conditional distributions: p(yi\ys), where S contains all indices except i. Thus, given k initial values, yf\ new values are drawn as follows. y[1] ~ P(yi|yf,---,J/£ 0 )) (2.60) (1) / 1 (1) (0) (0)s y2 ~ pmwi',2/3 ,---,yk ) yi1] ~ p ( y 3 | y i 1 ) , y 2 1 ) >2 / 4 0 ) ' - - - ' y i 0 ) ) yk] ~ p{yk\yx\y{2\• • • , y k l i ) All values are drawn using the most current values of the other parameters. After a long run of the Gibbs sampler, each set of variables should converge to their marginal distributions, p(yi). The Gibbs sampler is effective if the conditionals can be sampled from. If not, applications of the general M H algorithm are often required inside the Gibbs sampler to produce samples from the desired distribution. In the problem of covariance model selection, this is exactly the case. 2.6.5 I m p l e m e n t a t i o n Now that the theory for Markov chains, M H algorithm and the Gibbs sampler have been introduced, it is interesting to note that Section 2.6.1 is, in fact, a big Gibbs sampler. The desired outcome is (2.54). This distribution is conditional on the data, y. Thus, we need not produce a conditional for the data from the joint distribution in (2.51). Instead, as shown in the previous section, we must express conditionals for all other parameters (all di and Cj1). This was accomplished in (2.52) and (2.53). However, sampling from these distributions is not trivial. As a result, we must implement a M H algorithm to sample (2.52) when i = j and (2.53). Figure 2.7 illustrates the sequence of events that takes place in the model selection algorithm. The interesting part of the algorithm is that we are able to sample 9j\y,C~l using the M H algorithm without considering the current values of di i ^ j. Thus, the strategy is to implement K M H algorithms (MH1 in Figure 2.7) to obtain samples of each f?j-|y, Cj1. These algorithms are run for a sufficient length of time whereas there are enough samples of 9j to run the overlying Gibbs sampler. Thus, whenever (2.53) is sampled, we can produce a value of 9j from MH1 as well as all the #j's from their pseudopriors (to be discussed below). Sampling (2.53) also requires an application of the MH algorithm. This is done by proposing a move from one value of j to another (the values of potential j's are all given equal probability) and calculating a ratio of the numerator in (2.53). Then, typical M H algorithm theory is used to accept or reject the proposed j. Chapter 2. Probabilistic Regularization 37 0 . 1 6 i 1 1 1 r - 2 0 - 1 5 - I O - 5 O 5 I O 1 5 2 0 Parameter value Figure 2.8: Different possibilities for p(0»|C,- *) The key to the whole application of the MH algorithm for j is that every time a ratio is calculated that a new set of fVs be used. This is, after all, the foundation of the Gibbs sampler that governs the whole process. After a long run of the Gibbs sampler, we can simply count the number of samples oij that equal 1 and use (2.54) to calculate P ( C f 1 |y). The last thing that needs to be thought about are the pseudopriors. These will often change depending on the type of models that are proposed. The main question is whether we would expect the values of the parameters for 61 be different if j = 1 or j = 2. If so, then one must be careful about how to specify these distributions. In the case of covariance model selection, we have so little information about the absolute values of these parameters that it is appropriate to use the same pseudoprior in all cases. The only information that we wish to include in all cases is that these parameters are likely positive. Thus, a reasonable prior for all 8i\Mj is a lognormal distribution. Figure 2.8 presents several possibilities for the pseudopriors. Several of them (lognormal, uniform, Rayleigh, gamma) can incorporate the positivity on the parameters of 0j. As the figure illustrates, these distributions can also be tuned to look quite similar (all of the plots were generated with a particular set of parameters similar to a mean and variance in the normal distribution) such that any of them may be adequate. The only significant difference is in the uniform distribution in that it assigns equal probability to a given range and in the normal density in that it is symmetric around its mean and allows a non-zero probability for all values. The lognormal density was chosen because we can often expect values to be closer to zero but we do not want to penalize significantly larger values too harshly. However, to evaluate a choice of pseudoprior it is always wise to use several different possibilities and compare the results. The last issue in the implementation section deals with efficiency. As pointed out in the derivation of the M H algorithm, we only need a ratio of likelihoods such that the normalizing constant can be ignored. Let us recall (2.49). This equation is our likelihood for the covariance model selection problem. It is involved in both the distribution to sample 0i and the distribution to sample j. When sampling 6i, we only need to know it up to a Chapter 2. Probabilistic Regularization 38 multiplicative factor, n. This factor will be the same for each proposal of Oi. However, we must keep the |det(C)| term because it is different for each location of Oi. In order to improve efficiency, we can use determinant properties to reduce the number of operations required. We can express C - 1 easily so it makes sense to take the determinant of this matrix rather than its inverse. As a result, (2.49) is rewritten as follows. • ^ y l ^ . C j O c x I d e t ^ r 1 ) ! ^ - ! ^ 7 1 ^ (2.61) When sampling j, we must also use (2.61) to calculate the likelihood. This is because of the overall presence of Cj in the normalization which is the state that we are sampling. This means that we must be able to express (or calculate, by sampling) the normalization of the likelihood. Although this is not a problem for Gaussian distributions, it is an issue for other types. The advantage of the way that this algorithm is implemented is that we actually obtain samples from the conditional p{0j\Cjl ,y). This is accomplished due to the MH1 step (see Figure 2.7) where each Oj is sampled given the corresponding Cj1. One should note that if MH1 was not necessary (i.e. we are able to produce a sample Oj given any j easily) we would not produce the desired conditional but rather p{0j\y). Thus, we would have a distribution of Oj given all possible covariance-models. Using the MH1 step allows us to obtain the distribution that our optimization algorithm earlier in the chapter (Section 2.5.2) obtained via a M A P (maximum a posteriori) estimate. From this distribution, we can obtain the mean, median, variance or any other relevant statistic. Thus, the M H algorithm is more powerful in this context but it is worthwhile to compare its results with the optimization routine to ensure consistency. The last issue in implementation is the values of the likelihood. Looking at (2.61), it is apparent that each training datum is a vector. However, in the context of training model, we actually have K vectors of training data. Thus, (2.61) is modified as follows. K p(yi,...,yK\ej,Cj)cx\det(C-1)\Te~'1*£yIC} ^ (2.62) One can imagine that as K gets larger the value of.log(p) will get quite large in amplitude. If we are comparing two different covariance models, the numerical value of this likelihood could be significantly different. Thus, regardless of what values we obtain for the priors we are most likely to get an acceptance ratio of either zero or one. This will cause the algorithm to be completely absorbed by one covariance model and we will obtain probabilities near 100 percent for that model. To fix this problem, we include a normalization in the problem. The best way to do this is to calculate the probability ratio as follows. p{yi,---,yKW2,C2) L\ and L2 are the log-likelihoods (log(p(yi, • . . , YK\9J,CJ))) and K is the number of training models. Thus, is an average log-likelihood for covariance model 1 and the ratio should still represent a good comparison between the two models. The advantage is that the exponential produces a sensible result that is in the same range as the prior distribution ratios. Chapter 2. Probabilistic Regularization 39 sample # x 1 Figure 2.9: Markov chains for particular 6 vector for (a) C1 1 , (b) C 2 1 and (c) C 3 1 , 2.6.6 Numerical example One hundred training models were generated from the covariance model with (ax, az, as) = (10,1,0.01). This training set is referred to as TS1. In order to test the covariance model selection algorithm, three covariance models were proposed. C f 1 = axWxvWx+azWjWz + asWjWs (2.64) C^1 = axWxrWx + \.hW?Wz + asWjWs C 3 - 1 " axWxrWx + asWjWs The idea behind using these three models is that the first is the correct parametrization, the second is slightly incorrect and the last lacks an entire free parameter. Thus, the outcomes are predictable (although we can never obtain a predicted numerical outcome) and the algorithm can be evaluated qualitatively. En route to the desired outcome we must sample each covariance model's 9 vector (9 — (ax,az,cts) for C f 1 and 9 — (ax,ct3) for C^1 and C^1). This is analogous to a sampling of (2.52) when i = j. These results are obtained via the M H algorithm and are extremely useful because they give the distribution of each a value for a particular covariance model. Figure 2.9 illustrates the results for each covariance model. Figure 2.9(a) is the result for CT . This chain was generated by a random walk proposal (T = 0.005) that had an acceptance rate of 35 percent. Shown is the entire chain for every a parameter (non-acceptances result in staying at the same location). Figure 2.9(b) is the result for C ^ T 1 and Figure 2.9(c) is the result for C^1. Both used a random walk with r — 0.003 and resulted in acceptance rates of 49 and 77 percent respectively. For all the Markov chains, it is clear that convergence to the stationary distribution occurs after 40000 iterations for ax and much earlier for the other a parameters. In order to evaluate the results of these chains, they were compared against the optimization routine. Only values after 40000 samples were used to calculate parameter statistics. Results are summarized in Table 2.2. Chapter 2. Probabilistic Regularization 40 All of the results of the optimization code fall within two standard deviations (cr) of the corresponding means of the Markov chains. In fact, all results except for ax from C ^ 1 are within one standard deviation. Running these chains on M A T L A B version 6 release 13 on a 2.41 GHz processor with 1 Gb R A M , it takes an average of about 7 seconds to produce 1000 samples. In order to obtain 30000 samples from the posterior, one should run a total of 70000 samples using this random walk method which' results in a runtime of approximately eight minutes. Once samples of the 9 vectors are obtained, the second M H algorithm is implemented to obtain samples of j from Cj1. This is the sampling of (2.53). The pseudopriors were defined to be lognormal with log(mean)=l and a — 4. This choice of pseudoprior will incorporate positivity in the a parameters. Thirty thousand samples of j were generated for every comparison. This is an abundance of samples to ensure the results reflect the correct balance between the two models. The first comparison was between C^1 and C^"1. The models were assumed to be equally likely such that TT\ = 7 r 2 = 0.5. The results were that P(C7f1|y) = 0.796 and PCC^"1 |y) = 0.204. Thus, the evidence is quite favourable towards the correct covariance model but it is not entirely overwhelming as Cj1 has the correct structure but has az fixed at a value just off of its true value. The second comparison was between C j - 1 and Cj1. Originally, the models were assumed to be equally likely but only samples of j = 1 were produced. In order to ensure that this was not simply because of numerical difficulties (relative likelihood magnitudes, for example), the prior model probabilities were adjusted to 7Ti = 0.0005 and 7T3 = 0.9995. The results using these probabilities were P ( C f *|y) = 0.9685 and P(C 2 ~ 1 |y) = 0.0315 which is overwhelming evidence in favour of C f 1 . Thus, it seems that the correct parametrization of C - 1 is an important issue. The algorithm takes about 9 seconds to produce 1000 samples of j. As a result, this algorithm took about 4.5 minutes to produce 30000 samples of j. The total runtime of the algorithm is approximately 12.5 minutes to produce all samples. However, this was being extremely generous with the number of samples needed to produce the result. 2.7 Heterogeneity in a In Section 2.3.2, it was assumed that the a coefficients in the covariance matrix of the multi-variate Gaussian (see (2.19)) were constant for every difference. When our training models contain more detailed information, such as information on location of possible anomalous bodies or larger geologic structure, we cannot adopt constant values for ax and az. In fact, Model Parameter Optimization code M C mean value M C a 10.0392 9.9093 0.1566 0.9730 0.9795 0.0360 as 0.0108 . 0.0115 0.0016 9.4755 9.3574 0.1016 as 0.0101 0.0108 0.0015 Q-x 11.6097 11.5745 0.1517 as 0.0951 0.0955 0.0042 Table 2.2: Comparison between optimization results and statistics of Markov chains. Chapter 2. Probabilistic Regularization 42 N horizontal difference 2.8.1 Approach The main problem with Figure 2.10 is that it is not very "flat". By this, we would be referring to the large jumps in parameter values. Thus, the change in adjacent a parameters is included in the regularization functional. The values of the parameters themselves are not appropriate for regularization, however. If this was the case, we would have to specify a reference to measure the values against (again, not specifying one would be like referencing to zero) and we do not have the information to do this. Thus, to first order, a flattest model is most desirable. The training models (which act as data) can dictate any deviation from this model. Thus, let W° be defined as a first order difference matrix on the a grid and the resulting regularization can be defined as Ra = \\WQa\f (2.68) where Wa= G £ £ (2.69) where G was defined in (2.66) and is defined for either the ax or az grid and O are zero padding matrices of the appropriate size. As in any inversion problem with regularization, Chapter 2. Probabilistic Regularization 43 the objective function (2.22) must be amended. IK min a ^ = \ ^mJCia)-1^ - *log ( ^ ( C t a ) - 1 ) ! ) + ^Ra (2.70) i=l The gradient (2.30) must also be changed to reflect the new objective function. 1 ^ K / dC~1 \ 9i• = 2 E m ^ ^ - m * - 2 t r a C C ( C ^ o ~ )+ ftWZW°ah (2-71) i = l 3 \ 3 / Finally, the Hessian in (2.35) has to be updated with (2.36) still being used to approximate the first derivative. • " " - ^ ( S ^ ) ^ ' (2'72) Equations (2.70), (2.71) and (2.72) provide all the tools necessary to be able to perform a regularized preliminary inversion. The only issue that remains unaddressed is the regularization parameter, (3. This pa-rameter controls the relative weight of data and regularization in the final model. It is a constant and for every value of (3 there is a particular solution. The problem is in knowing which value of (3 is appropriate. For instance, as (3 —• oo, ip-+Ra and the solution will tend to the null of the regularization operator (in this case, a constant value). In contrast, as (3 —> 0, ipR —> 4>NR (non-regularized) from Section 2.7 and the solution would approach that of Figure 2.10. Neither solution is appropriate so a method of finding a j3 £ (0, oo) is required. 2.8.2 Cross-validation In cross-validation techniques [15], one usually has specific data to deal with. That is, the data can be expressed as some function of the recovered model. The theory of cross-validation states that if a recovered model is appropriate it should be able to accurately predict extra data. Thus, consider a system of N data, d, where the goal is to recover M parameters, x. Next, consider that there are N subsets of d of size N-1 (i.e. a subset is the entire data set minus one value). Thus, if we were to perform N inversions with these N subsets (defining k as the index excluded from d), we would obtain N recovered models, Xfc. From each of these models, we could attempt to predict the excluded (kth) data. <fk = Fk[x.k] (2.73) As a result, the quantity of interest will be the squared difference between dk and dpk. N CV = Y^{dk-dpkf (2.74) fc=i CV is the cross-validation quantity for the objective function with one constant value of (3. Thus, if we calculated (2.74) for several values of (3, we would obtain CV(/?).' In theory, there should be some value of (3 that minimizes CV([3) and this will be the most appropriate choice of regularization parameter. Chapter 2. Probabilistic Regularization 44 (a) 175 0170 ro > > 1 6 5 k Cross-validation seems like an appropriate method to use to determine the value of the regularization parameter. However, considering the problem at hand, there are some problematic differences. First, there is no clear relationship between data and model. Here, our "data" are really the training models we are presented with. The "model" is the vector of Q coefficients we seek to recover. However, when considering (2.74), it is really the misfit part of a typical objective function that is being minimized. Thus, there is a clear analogy to be made when considering (2.70). The first term is a "data" influenced term. The last term is a "model" influenced term. Thus, the C V criterion is defined as follows. 1 K y K CVa = - ^ m f C t a ) - 1 ! ! ! , - -^-log(\det{C{<x)-l)\) (2.75) i=i We would not be able to predict a training model (there is no relationship m — F\c*\). As a result, the training set is divided into two subsets: MT = nij, i = 1... KT and Mv = m ,^ i — I... Ky where KT is the number of models in the training set and Ky is the number of models in the validation set. Once a range of CV{f3) values have been calculated, we simply choose 0 = min CV((3). Chapter 2. Probabilistic Regularization 45 2.8.3 Example In order to check to effectiveness of the regularization, the experiment performed in Section 2.7 was repeated with regularization. The results are shown in Figure 2.11. Several (3 values between 10~2 and 104 were tested. Figure 2.11(a) shows the C V values for each of these (3 values and shows that there is a clear minimum around a /3 value of 20. Figure 2.11(b),(c) show the results for the ax and az weights with (3 — 21.5443. Values of a are much more constant. The mean value of ax is 9.9543 but with a standard deviation of 0.00338. az has a mean value of 0.9926 and standard deviation of 0.1094. Each mean is closer to the true value than the non-regularized case and the standard deviations are much smaller than before. 2.9 Non-Gaussian Distributions Until this point, all of the theory in this chapter has focussed on the specification of weight-ing parameters for a given Gaussian distribution. The Gaussian distribution is useful be-cause we have complete knowledge of its multivariate.normalization constant. However, we may wish to specify information in a non-Gaussian form. To do this, we would have to know the normalization constant in order to perform the optimization. However, in this section, the Markov chain theory developed in Section 2.6 is used to overcome this difficulty. 2.9.1 Normalization in the p-norm Considering one parameter, the distribution of the p-norm (Generalized Gaussian) can be expressed as follows [27]. PP(X) = TTTTTe (2.76) T represents the Gamma function and / i and cr are defined in (3.35) and (3.34). Setting p = 2 we obtain the single variable Gaussian. If we assume that we have M independent samples of x and let p — 2, the normalization would equal 1 (27r)f aM whereas if we do not assume independence it would equal 1 (2n)f |det(C2)|* where C 2 is the covariance matrix for the 2-norm. Thus, in the transition from the single to multi-variate framework, the normalization changes from a aM term to a |det(C)|5 term. In a multi-variate independent framework the p = 1 normalization would be 1 Wo™' Thus, following the Gaussian example, we conjecture that the dependent multi-variate L l norm might look like 1 2M |det(Ci)| Chapter 2. Probabilistic Regularization 46 if Ci could be written explicitly. In fact, this can be said for any value of p as the single variate normalization in (2.76) only depends on CT (which has no-dependence on p). In order to specify the distribution for a given value of p, we seek a reasonable approximation to Cp. 2.9.2 Approximating the Normalization With the conjecture given above, the multivariate form of the p = 1 version of the Gaussian distribution in 2.44 is as follows. P U ' ~ 2M |det(Ci)| 1 ' The only problem is that because we are no longer dealing with 2-norms, the exponent cannot be represented as (m—mref)TC~1(m—mref). Thus, the assumption about the form of the regularization cannot be realized. This means that any optimization algorithm will be unsuccessful in finding the values of a. However, if one considers the theory developed on covariance model selection in Section 2.6 there, may be a way to circumvent this difficulty. Referring to Figure 2.7, the first application of the M H algorithm (MH1) produced samples of the particular 0 vector given its covariance model. In the case of the p = 1 distribution, this would be samples of ax, ctz and as. The advantage of the M H algorithm is that in deciding whether to accept a candidate for 0 a ratio of the likelihoods is used (see (2.58)). Thus, the normalization is still important but it is only the relative difference between two normalizations that matters. As a result, reasonable approximations to C\ can be used. Due to the fact that the distribution of interest is a p-norm, the p = 2 case is used as an approximation for the normalization. The reason for this is simple. Consider a point in parameter space 0(°\ Now consider some perturbation from that point, 0^ — 0^ + d0. We can assume that the value of the determinant of the Gaussian covariance (p = 2) and the p — 1 covariance differs by some factor at the point 0^. |det(C 2(0 ( o )))|5 = 7 ( ° ) | d e t ( C i ( 0 ( o ) ) ) | (2.78) Thus, if we are calculating a ratio of determinant values at points 0^ and 0^\ we would obtain the following. |det(C 2(fl(°)))|* 7 ( 0 ) | d e t ( d ( g ( ° ) ) ) | |det(C 2(0(D))|5 7 ^ |det(Ci(0(D))| ( ' ' If ^ | is close to one, then using a Gaussian approximation to the normalization of a p = 1 distribution would be valid. Recalling the theory from Section 2.6, a random walk algorithm was applied in the M H algorithm. This uses a d0 = TZ where Z is a sample from a Ar(0,1) distribution. The largest value of r used in Section 2.6 was 0.005. Thus, 95 percent of the perturbations from 0^ would be within 0.01 units. Because the proposed radius is extremely close around the current point, the assumption that 2 ^ 1 (2.80) is valid and the Gaussian normalization can be used to sample the 0 vector from any given p distribution. The following approximation is used for p(y|#i, C\) where y = m — m r e / . p(y\0P,Cp) a | d e t ( C - i ) | ^ - ? ^ l ^ l P + ^ l ^ y l ' ' + ^ l ^ P ) (2.81) Now the theory in Section 2.6 can be applied to obtain samples of 0P. Chapter 2. Probabilistic Regularization 17 x 10 Figure 2.12: M C M C sampling results for different p values, (a) ax, (b) az, (c) a8 2.9.3 Examples Using the same training models from the examples in Section 2.6.6 where they were gen-erated from a Gaussian (p = 2) distribution with (ax, az, as) = (10,1,0.01), several p-distributious from p = 1 to p = 2 were sampled. The results of these samplings are shown in Figure 2.12. In all samplings, the value of r was set to 0.004 and 60000 samples were drawn. This resulted in acceptance rates of all six chains around 40 percent. Each chain took about six minutes to run. All chains were estimated to have converged to the stationary distribution after 40000 samples. The statistics of each chain is shown in Table 2.3 (data for p = 2.2 is shown for use in Section 2.9.4). As p decreases from 2 to 1, ax has a steady decrease from approximately 10 to 1.5. The same sort of trend is seen in az. Does this make sense? In order to answer this, one must refer to Figure 1.4. This illustrates the difference between a p = 1 and p — 2 norm. If ax is high, then the goal is to restrict the value of the model parameter (or derivative) to have values closer to its mean. Thus, we obtain solutions that are more smoothed out from the mean value. The essence of the p = 1 norm is different though. The relative penalty of values away from the mean is less (on average) such that the odd -value substantially away from the mean is permitted (blocky solutions result). Thus, the relative weight for that norm should be less than its p = 2 counterpart. This reasoning does not, however, transfer to as. There seems to be no particular trend to the recovered means of ae Chapter 2. Probabilistic Regularization 48 2 4 Figure 2.13: (a) Example of training set (ax,az,as) = (10,1,0.01), (b) covariance matrix, C2, pertaining to training set in Table 2.3. The only thing that is certain is that as is higher for values of p < 2. This is most likely due to the fact that the reference model was a halfspace of 1 for these synthetic training models. Figure 2.13(a) shows a typical training model and Figure 2.13(b) shows the covariance matrix for the training set. Thus, 95 percent of the models will have values within approximately 2y/lA as 2.4 units from the reference model of 1. This means we can expect values between -1.4 and 3.4. The only problem is that the penalty of the p = 1 norm is higher for values less than one and smaller for values greater than one. Thus, it is difficult to predict the appropriate variation of as with p. Although there is no exact method to confirm the result of the M C M C sampling for other values of p, there is optimism on its validity for a couple of reasons. First, the variation of the Q parameters makes sense when considering the norms. Second, there is convergence to the p = 2 solution as the value of p approaches 2. Lastly, there is solid convergence of the Markov Chain for every value of p. This would indicate that the stationary distribution was sampled well. All of this evidence is substantial enough to validate the non-Gaussian norm results. p-value « i 0~x oT2 o-z as crs Tip) 1.0 1.5813 0.0310 0.4541 0.0157 0.0127 0.0016 0.0211 1.2 2.4527 0.0451 0.5669 0.0193 0.0131 0.0017 0.0195 1.4 3.6802 0.0754 0.6796 0.0244 0.0133 0.0017 0.0212 1.6 5.3030 0.0740 0.7822 0.0284 0.0133 0.0017 0.0148 1.8 7.4405 0.0994 0.8824 0.0294 0.0131 0.0017 0.0138 2.0 10.0058 0.1096 0.9735 0.0407 0.0115 0.0016 0.0116 2.2 12.9992 0.1879 1.0863 0.0381 0.0056 7.5922e-4 0.0147 Table 2.3: Statistics of Markov chains for different p-distributions. Chapter 2. Probabilistic Regularization 49 2.9.4 Comparing p-distributions Deciding to which p-distribution each set of training models most likely belongs is an important and challenging problem. The difficulty again stems from the fact that we cannot explicitly define the normalization for the general multivariate p-distribution. In the previous section, this was overcome by the fact that we were taking small steps in a and that for a given value of p, the normalization was not too different from a to a + da. The recovered Markov chains (in Figure 2.12) turn out to have good convergence and small standard deviation. There are two methods that have potential to help decide on the optimal value of p be decided. The first method relies on the methodology that was introduced in the previous section. Consider the problem of comparing two p distributions, p i and p 2 . For a constant value of 9, (2.78) showed that the normalization for a non-Gaussian distribution can be related to that for a Gaussian. Thus, assuming that the values of 9 are constant for each value of p, the following expression follows. |rfet(CPl(fli))| = 72 , . |^(C P 2 ( (9 2 ) ) | 7 1 1 • } Since the values of 9 for different p-values have small standard deviations (see Table 2.3), it can be assumed that they are constant and that ratio of ^ is relatively unchanging. However, the original method of relating the normalization back to a Gaussian may have increasing inaccuracies as the value of p strays from 2. This was not an issue when p was constant and 9 was changing because the ratio in (2.79) would cancel out the inaccuracy. However, for changing p there is no guarantee that this is true. Unfortunately, attempts to institute such methods were unsuccessful. It does warrant further research into approx-imation of normalization and it seems that M C M C techniques have the potential to solve this problem. The second technique of comparing p-distributions takes a more practical approach. Referring to results summarized in Table 2.3, it seems reasonable to use the error informa-tion to help decide which value of p is most desirable. However, it is important to realize that absolute values of a are not important. It is relative differences that provide the most information. For instance, the ratio of | i is about 3 for p = 1 and 10 for p = 2. Thus, this ratio is about 3 times larger for p = 2 than p = 1. As a result, ax and az are about 3 times larger. Thus, the error propagates with the change in the ratio. The goal is to find the value of p where the most models are contained around a mean. That is, we would want to find the distribution that has the minimum relative error level. TM = W w ( 2 - 8 3 ) II«(P)I|2 In summary, the value of p that minimizes T(p) is the distribution that best fits the training models. As shown in Table 2.3, for models derived from a multivariate Gaussian, p = 2 minimizes T(p). Data for p = 2.2 is included to show that T(p) increases in both directions from p = 2. • 2.10 Summary Chapter 2 has introduced probabilistic methods for developing learned regularization. The important points are as follows. Chapter 2. Probabilistic Regularization 50 • If information is given on a filter quantity, z, of the model, m, one can develop a regularization for z. This assumes that the relation between z and m is known and p(m|z) is non-informative. • The most general way to express probabilistic regularization is via a Gibbs distribu-tion. Equivalence to Markov random fields allows the regularization to be expressed in terms of interactions of cells within a neighbourhood. The size of the neighbourhood is variable and was limited to first order in Chapter 2. • Interactions between cells can be represented as Gibbs potentials summed over clique orders. The simplest way to represent this was to assume constancy in the coefficients such that a = ( a x , a z , a s ) . • In order to recover estimates of a, prior information in the form of training models needs to be provided (in a training set M T ) . • Given M y , the coefficients can be recovered via an optimization method. As a result, the multivariate Gaussian was adopted due to its analytic multivariate normaliza-tion. The coefficients were searched for using a Newton approach. The values of the coefficients were recovered well using the optimization approach. • Covariance model selection theory was introduced such that different forms of C"1 could be compared. Metropolis-Hastings theory was used to recover the a parameters. These values compared well to the results from the optimization. A Gibbs sampling method was developed to calculate the probability of each different covariance model, CJ\ • The a coefficients were generalized where each difference had a unique a value. The number of a values to be recovered using an optimization technique increased to the point where regularization was required. Cross-validation was used to select an appropriate 3 and a synthetic example was shown. • a coefficients for non-Gaussian distributions were successfully recovered using M C M C theory and an approximation for the normalization. Methods that have potential to compare results between two distributions of different p-values were explained. Chapter 3. Deterministic Regularization 51 Chapter 3 Deterministic Regularization 3.1 Introduction In Chapter 2, the model distribution was represented as p(m) = I e - * M = Le-mm) ( 3 1 } where R(8, m ) had a defined functional form but parameters 6 needed to be provided so that R could be completely specified. For example, . R{0;m) = ax\\Wxm\\r + az\\Wzm\\P + a8\\W8m\\P 6 = {p,ax,az,as) The procedure was as follows. • Given a form of R(0,m) such as (3.2), a set of training models that are random realizations of (3.1) were generated. • Given these training models, compute 8 via maximum likelihood. This requires the solution of a non-linear inverse problem. The solution to the inverse problem is difficult because of the normalization (II). LT can be difficult to calculate as the size of the problem increases. There was also no closed-form expression for values of p other than 2. This resulted in the development of M C M C theory in Chapter 2. In this chapter, the information about the model is characterized in terms of filter functions, Z j = /j(rrij). For instance, the regularization functional in (3.2) involves filters fa ~ 3x' f2 = Sz a n d / 3 = -^ Writing the regularization functional in terms of the filter parameters yields R(zu z 2 , z 3) = -Ri(zi) + R2{z2) + R3{z3) (3.3) where z i = Wjin, and R\ = ax\\ • \\p. The z,'s are not independent. As a result, breaking the regularization down as in (3.3) was not advantageous in Chapter 2 so everything was left in terms of p ( m ) . In this chapter, we proceed by ignoring the mutual dependence of the filter parameters. By doing this we can use information about the i t h filter parameter applied to the training models to estimate Ri(zi). Because of the success in Chapter 2, we are able to evaluate the consequences of our assumptions and decide whether, or under what circumstances the deterministic route is valid. Chapter 3. Deterministic Regularization 52 P(zi,- • , Z Q ) P(zi,. • > Z Q ) P(zi,• . , Z Q ) 3.2 Theory Given several niters, z;, i = 1,..., Q, the goal is to represent the regularization operator, i?(zi,... , Z Q ) , as follows. R(ZI,...,ZQ) = R1(Z1)+R2(Z2) + . . . + RQ{ZQ) (3.4) Assuming the filters are independent, the joint distribution can be broken down into the product of individual distributions. J _ -{RI(ZI)+R2{Z2)+...+RQ{XQ)) n> e _ J _ e - f i i ( z i ) e - / ? 2 ( z 2 ) _ _ _ e ^ o ( z Q ) £3 na 6 n 2 e - - n Q 6 Finally, this can be expressed more generally. p (z i , . . . , z Q ) = p ( z i ) p ( z 2 ) : . - P (ZQ) (3.6) Thus, by assuming that each data set is given independently of the others, we can write the regularization in terms of the multiplication of the individual distributions of Z j . Once independence has been assumed, we can continue by considering an individual p{zi). This is still a multivariate distribution and can cause trouble if we do not know its normalization. Thus, it is much wiser to think about a further simplification. Given a training set Z i , consider its elements Zj, j = 1... Y. Consider the case where each ZJ is independent of each other. Thus, we can break down p(z;) as follows. j ^ p ( z t ) = -fi-Pi{zi)P2{z2). • .py(zY) (3.7) Each Zi is assumed to come from its own distribution. This is an independence assumption between the z/s. Furthermore, since we wish to recover a generic distribution i?(z,), we must assume that the zfs are all samples from R(zi). Thus, we assume that the Zj samples are identically distributed. Assuming that the values of Z j are independent and identically distributed (i.i.d) means that we have Y samples of the distribution, p(zj). This allows us to proceed with an optimization-type method to determine the best regularization (or parameters, given a regularization) given a set of samples z». Then, if this can be repeated Q times, the regularization can be fully defined by (3.4) and (3.5). Making the i.i.d assumption between the elements of Z j allows the one-dimensional distribution to be written as follows. p(zj) = I e -^ ' ) (3.8) In the following analysis we will look at each filter function separately. We can therefore drop all subscripts in (3.8) and consider a filter function z and its distribution to be p(z) = ±e-RW (3.9) Chapter 3. Deterministic Regularization 53 1/2 T | 2 - i / 2 1^1 I I I I I I 1^1 ri r 9 TJN-1/2 TJN+1/2 I ' M Figure 3.1: Definition of dual grid for recovery of R(z): rj is the data grid and denotes the ith model cell. The problem that remains is the specification of the functional, R(z). We take two approaches to recovering an appropriate R(z). In the next section, we assume it is a generic function and we recover values of the function via an inverse problem. The second approach parameterizes R(z) as a p-norm and seeks to recover optimal values of parameters. 3 . 3 Recovering R(z) as an Unknown Function Given M<, the individual values of the filter z are computed. The goal is to represent this as p(z) = Ie-*« The data will be obtained from a histogram of z. A datum is related to the distribution via .N di= J p(z)dz = ^ J e-^dz i = l . where (%_!,rji+i) define the bin limits for the ith histogram element. Given N data, and some estimate of their errors, the goal of the inverse problem will be to estimate R(z). 3.3.1 Binning Strategies We are provided with a vector of samples, z, of z. Let us assume that z is Y elements long. The goal will be to specify a set of bins that contain a certain number of elements of z. The coordinates of the bins in 2-space is rj- The bin boundaries are referred to as (T^_ i ,?7 i + i ) so that the bin center can be referred to with an integer subscript. This can be seen in Figure 3.1. Assume that the number of samples in each bin is fcj. There are two practical methods of determining the values of k{. Chapter 3. Deterministic Regularization 54 The first binning method involves keeping the values of ki constant. This is a valid method when dealing with samples that are changing in value (i.e. a Gaussian with non-zero standard deviation). In this case, the value of k may be calculated via k = sY ( 3 . 1 0 ) where 0 < s < 1 . The value of s can be changed to obtain a reasonable value for N. If k is not a whole number, often the last bin will contain a relatively smaller number of samples. Using this strategy, the values of 77 are determined using the samples within a bin. m and nN, 1 are equal to the smallest and largest values of z. The rest of the values 2 J V -t- 2 are determined using the samples in adjacent bins. For example, 772+1 would be the mean of the largest value of z in the second bin and smallest z in the third. When the samples do not change continuously (i.e. z is dominated by a small numerical range), the first binning strategy is not appropriate. In this case, it is better to prescribe V a priori and bin the samples accordingly. How small the bins are will play a large role in how detailed R(z) will look. This is something that should be kept in mind when specifying V 3.3.2 The Forward Model Regardless of how the binning is done, we obtain N bins with limits 77 and ki, i = 1... N, samples within each bin. Thus, the data are calculated as follows. di = I ( 3 - 1 1 ) We have numerical quantities for the data. These values are probabilities and referring to ( 3 . 9 ) we can relate them to the governing distribution. di = ^ J e~R^dz ( 3 . 1 2 ) In order to complete the forward model, a method where given values of R{z) can be translated into appropriate di values is required. This means the discretization of ( 3 . 1 2 ) and the definition of a grid for R(z) as shown in Figure 3 . 1 . The functional J R is represented as discrete cells, r. The accuracy of the recovered R(z) is dependent upon the interval width of r. This width is made sufficiently small so that the discretization error is negligible. Discretizing ( 3 . 1 2 ) the following is obtained. 1 M di = -YJBike-r» ( 3 . 1 3 ) fe=i M is the number of unknowns (r). D is a matrix that contains the appropriate width of each cell rk that corresponds to the particular datum, di. Figure 3 . 1 illustrates this concept. In this example, the first row of B would have the width of the cells rk entered in the first nine columns. The rest of the values in that row would be zero. Thus, the first datum only has contributions from r\ to rg. Before declaring ( 3 . 1 3 ) the final forward modelling relation, there are a couple of items to investigate. The first stems from the basics of probability distributions. It is well known Chapter 3. Deterministic Regularization 55 that the total probability over the variable z must equal one. We assume that this can be well-approximated by using the range (ni,r]N+i) and present the condition in discretized form. We are adding an extra datum so it is necessary to re-define N — N + 1. 1 M dN = l = -Y,BNke-rk (3.14) The second issue to be discussed is the normalization, II. It can be represented as the following integral. oo 11= J e~R^dz (3.15) —oo Essentially, this expression is akin to the expression in (3.14). If we hadn't specified d^, then we could recover II as a parameter. However, if this is investigated further there is an inherent non-uniqueness present when considering the normalization, n. 3.3.3 T h e Effect of the N o r m a l i z a t i o n Consider the integral forms of (3.13) and (3.14). (3.16) - h i -R^dz The normalization can be rewritten as 1_ n : and (3.16) can be rewritten. e - i n n / e-{R(z)+\uXl)dz (3.17) / "4 e-{R{z)+mn)dz Let the regularization in the recovery of R(z) be denoted (j>m(R). Let m(z) = R(z) + lull. Because n is a constant, it does not affect the regularization. <t>m(m) = (f>m{R) If the value of n is changed, the model, m(z), is shifted by a constant. Thus, we are free to choose any value of n we wish. In this chapter, we set m(z) = R{z) (II = 0). However, Chapter 3. Deterministic Regularization 56 recognizing the fact that we wish to use R(z) as a measure of the size of z, we adopt the following constraint. R{z) > 0 • (3.18) Once R is recovered, we simply shift it by a constant. The optimal value of IT, IT*, is IT = e€ (3.19) where £ = min R(z). A numerical example is shown later in this chapter. 3.3.4 Assigning Data Errors Now that the probability data have been quantified, their errors must be considered. In order to assign these errors, a couple of questions must be considered. 1. Are the data noisy? This may be counter-intuitive when considering the source of most geophysical noise. In that sense, the data are not noisy since we have derived them synthetically and have added no noise. However, in the truest sense of inaccuracy, the data are noisy. The source of the noise is the finite number of samples we have drawn. This can be illustrated by considering the law of large numbers. Consider a finite set of samples, {Xn}, drawn from the distribution p(x). The summation of these trials can be pre-n sented as a random variable, Sn — -^i- Thus, using Chebyshev's inequality we can i=l present P(|^-H><)<4 . n ne* for any e > 0. p. (mean) and a (error) are the parameters of p(x). Thus, the uncer-tainty of the data is inversely proportional to the number of samples. However, we would need an infinite number of samples to obtain perfectly accurate data. 2. How can we model the uncertainties? It is not possible to completely gauge the uncertainties without knowing the dis-tribution the data were sampled from. However, for any given n, we can assume that the data errors, a, can be modelled normally as iV(/x, r). As n increases, p, 0 and r —> 0. The question that remains is whether we can determine appropriate data errors from this information. The first option for specifying uncertainties has been used in many applications. Errors are simply taken as a percentage of the data point. The data in this problem do not have a large dynamic range so we need not worry about adding a base level error. The danger in this method is that a bad estimate of error can lead to an unknown target misfit. For instance, if we under-estimate the errors on the data consistently, our target misfit (see (1.12) and (1.14)) is actually greater than N. As a result, we would end up over-fitting the data. This method often adds simplicity in assigning errors. The second method for assigning errors relies on the fact that we are given multiple realizations of z. In the case where several training models were developed and z is Chapter 3. Deterministic Regularization 57 designated to be some filter of each model, there are actually several versions of the vector z. Thus, we may assume that there are j = 1,..., K versions of z. Each of these sets of z can be sorted and put into bins just as the total data set was earlier. Thus, a vector of probabilities, p^\ for each set can be created that is an analogy to d and the error for each bin can be calculated. 1 K CT<=X^l5>^~di)2' i = h-..,N-l (3.20) i=i AT — 1 is used to give an unbiased estimator of variance. These errors will be more reflective of the given samples and will place higher errors on bins with larger variances in z value. Because the bins are originally designed such that they contain almost equal probability, we need, not worry about increased errors due to insufficient bin sample size. The two different methods of quantifying error will be assessed later in the chapter when synthetic examples are shown. The error assigned to the last datum is 0.001. This is to ensure that the total area under the distribution will be very close to 1. 3.3.5 Regularizing r In order to invert for the values rk, there must be regularization specified on in order to make the problem tractable. This is done by imposing smoothness between the r^s. The reason that this is valid is that the r^ 's are reasonably close together in z-space and we are not interested in small scale differences in R(z). It is the larger features that are important. The regularization can be specified as <Pr = d | W r r | (3.21) where ||-|| is the L2 norm and Wr is a discretized finite difference second derivative operator. It is a M — 2 by M matrix that is built as follows. Wr = ± 0 2 1 -2 1 0 0 . . 0 0 0 0 1 -2 1 0 . . 0 0 0 0 0 1 -2 1 . . 0 0 0 0 0 0 0 0 . . 1 -2 1 (3.22) 5 is the z distance between centers of the cells rk which will assumed to constant in this case. 3.3.6 Solving for r The previous two sections have developed a forward model (and hence a misfit functional as designated by (1.12)) and regularization for r. The solution for r is posed as the minimum of the objective function r* = minr 4> = <j)d-\- 0<f>r (3.23) Chapter 3. Deterministic Regularization 58 where 4>d = l\\Wd(d-F(r))\\2, (3.24) F(r) is the data given r and Wd = diag(i) represents the data errors. The solution to (3.23) is approached as described in Section 2.4.1. As a result, the first and second derivatives of the objective function are required. Implicit to this is the calculation of the sensitivities. Jlk = | * = _ * t e - * (3.25) ork n The gradient of the objective function is expressed as follows. g = -JTWjWd(d - F(r)) + dWjWrr (3.26) As in Section 2.4.1, the second derivative is referred to as the Hessian. However, for this problem a quasi-Newton method is employed where the Hessian is approximated. This is due to the fact that J — J(r) and another derivative results in a rank 3 tensor. Thus, the contribution from this tensor is ignored and the Hessian is approximated as follows. HK JTWjWdJ + pWlWm (3.27) Finally, the perturbation in r is calculated and the algorithm in Section 2.4.1 is employed. <Jr = - # _ 1 g (3.28) Note that in Section 2.4.1 the perturbation was defined as p but that would certainly be inappropriate in this section. The algorithm in Section 2.4.1 is run for a constant value of the tradeoff parameter, (3. This parameter controls the relative weight of the data misfit and regularization and has to adjusted until our misfit reaches an expected value (usually equal to N - see (1.14)). It is wise to start with a large value of (3 such that </> « pfa and the problem is almost quadratic. The solution will be obtained quickly and can be used as a starting solution for the next value of (3. A cooling scheme is used for (3 where j3^ = UJ(3^ and UJ < 1. For a detailed discussion of particulars to non-linear programming [10],[25] and [14] provide an excellent overview of all relevant aspects. 3.3.7 Testing the Algorithm Before filtering TS1 (see Section 2.6.6) to create z samples to run the algorithm, a set of samples from a known distribution should be used. Thus, 1000 samples from the following normal distribution was used. This is a A R ( 1 0 , 1 0 ) distribution. Thus, r should resemble ^z ^ and n should equal I O V ^ T T « 2 5 . 0 7 . The first thing to evaluate is the data errors. The samples from ( 3 . 2 9 ) were used to create data in twenty bins. Then these samples were assigned errors based on either a percentage or multiple data set method outlined in Section 3 .3 .4 . The results are shown in Figure 3 .2 . The norm- of the errors, ||e||2, was used to evaluate the size of the errors. The true errors were calculated using ( 3 . 2 9 ) and | | e ' r u e | | 2 = 0 . 0 2 5 4 . We seek an error Chapter 3. Deterministic Regularization 59 0.065 0.06 oP-055 CD >0.05 ro ro ^ . 0 4 5 0.04 15% errors: ||e||2 = 0.0335 0 20 bin centre 10% errors: ||e||2 = 0.0224 0.065 0.06 .P-055 ro >0.05 2 ro ^ . 0 4 5 0.04 -20 0 20 40 bin centre Multiple data set errors: ||e||2 = 0.0313 0 20 bin centre -20 0 20 bin centre 40 Figure 3.2: Comparison of different error methods with true errors, e represents the error value (a) 0.06 „ 0.055 3 ro » 0.05 ro ro •° 0.045 0.04 -J L_ -> 1 =F -15 -10 (b) 0.04 | 0.03 g ro 1 0.02 0.01 -20 (c) 0.06 5 10 15 bin centre 20 25 30 35 I II in i M i f H M i -10 10 bin centre 20 30 40 S 0.04 0.02 ^ i I i i i i i M i i i i ' i ~ i i M iimn i i -20 -10 10 bin centre 20 30 40 Figure 3.3: Data for varying N: (a) N=20, ||e||2 = 0.313, (b) N=40, ||e||2 - 0.322, (c) N=32, ||e||2 = 0.318 Chapter 3. Deterministic Regularization 60 Figure 3.4: Resulting models for data in Figure 3.3 assignment method that produces an error level near this value. Looking at Figure 3.2, it is apparent that the correct percentage should be between 10 and 15 percent. The multiple data set errors produce a total norm that is close to the true norm without any parameters being specified. These errors were created by breaking the total samples into 10 sets of 100 samples. It seems that without any prior knowledge of error distribution, it is reasonable to use the multiple data set method. The second thing that should be investigated is the effect of bin size (number of data). Several recoveries of r were done with varying numbers of data. In order to make the results meaningful, the value of LT was set to the true value. The first comparison was when the number of data were doubled. The base data set is shown in Figure 3.3(a). The multiple data set method was used to quantify errors for all data. The number of data was doubled and is shown in Figure 3.3(b). The resulting models are shown in the left panel of Figure 3.4. It is apparent that doubling the data results in a better fit to the true model. However, one can see that the fit is only significantly better for values of z away from the mean. Thus, if the number of data at the bins near the edge of the distribution was increased, the result may be better without completely doubling the number of data. This data set is shown in Figure 3.3(c). The outer two bins on each side were divided into 4 bins each resulting in 32 data. The resulting model is shown in the right panel of Figure 3.4. There is a slight improvement over the base data set but it is clear that doubling the data recovered the best model of r. In addition to making inference on the data, Figure 3.4 shows the first results of the inversion algorithm. With a limited number of samples, the algorithm is successful in recovering a model for r that has the same character as the true model. Aside from small differences at the flanks of the distribution as well as with the value at the mean, the inversion can be considered a success. What Chapter 3. Deterministic Regularization 61 z Figure 3.5: Resulting models for varying II value remains is a discussion of the normalization, II. Figure 3.5 shows several models with varying II value used as input. The true model is shown as a dotted line. As shown in Section 3.3.3, the value of II has no effect on the character of the solution. It is possible to prove this numerically by considering (3.19). Referring to (3.15), it is apparent that an invariant model m(z) = R(z) + \n(U) exists. This can be proven by looking at a few values of the models in Figure 3.5. r1 r 5 r 6 r 1 0 r 1 5 r 2 0 -25 . r 6 r 3 0 Calculated value 6.9383 5.3288 4.6357 4.2302 3.9425 3.7194 3.5371 Invariant value 6.9383 6.9383 6.9383 6.9383 6.9383 6.9383 6.9383 The superscript indicates the value of II and the subscript is the indicator for the model parameter. All models have equal invariant value and are linear shifts of each other. As a result, it is possible to predict a solution for any value of II given the solution for another. However, we still choose a solution that adheres to our norm condition in (3.18). It is desirable to have a zero value in our norm. In the case of the example of Figure 3.5, n* = 26.2366 which is close to its true value of about 25.07. Before declaring the algorithm successful, it is wise to check the results for another synthetic result. Thus, we look at the generalized Gaussian. Specifically, of interest is the LI norm which looks like the following in distribution form. p(z) = ie - W (3.30) In order to find the correct form of R(z) = \z\, samples of (3.30) are needed. Thus, the MH algorithm from Chapter 2 is used to sample the above. The results of 10000 samples Chapter 3. Deterministic Regularization 62 are shown in Figure 3.6(a). These samples were used to create a histogram that matches the true distribution well. Thus, we can be sure that these samples are from (3.30). The resulting data and errors are shown in Figure 3.6(b) and the final model for r is shown in Figure 3.6(c). The recovered model is an excellent representation of the true model. The value of FT* for this result is 2.4316. This slight difference from the true value of 2 can be interpreted as due to a finite number of samples of z and hence small differences between the histogram values and true curve in 3.6(a). 3.3.8 Synthetic Example The training set TS1 was used as data for a synthetic example using deterministic regu-larization. Due to the assumptions that were made in the development of the theory, it is unlikely that we can recover exactly the same regularization functional that was introduced in Chapter 2. This data set consists of 100 training models (K = 100) and we are presented with three (Q = 3) filtered versions of the data. z j = Wxm z 2 = Wzm (3.31) z 3 = Wsm Chapter 3. Deterministic Regularization 63 (a) 0.03 r I=F3> = r 0.028 0.0241 -1 ' (b) 0.032 0.03 h 0.028 0.026 h 0.024 (c) 0.04 § 0.03 ro S 0.02 h 0.01 -0.8 -0.6 -0.4 -0.2 -1.5 -2 0 bin centre 0.2 0.4 0.6 0.8 -0.5 0 bin centre 0 1 bin centre 0.5 1.5 Figure 3.7: Data and errors for (a) zi , (b) z 2 and (c) Z3 The data and resulting multiple data set errors are shown in Figure 3.7. Recalling (ax,az,as) = (10,1,0.01), one can see the larger variances in z 2 and Z 3 that result. There are 37 data for z\ and z 2 and 41 data for 2 3 due to the nature of the filters. One can see that as the corresponding value of a decreases, the range of values for z% increases. This is because ^ behaves like a variance. Referring to Section 2.6.6, a true regularization functional can be inferred for each z. j^true = 5(m-m r e/) 7 WjW x(m - m r e /) R\rue = - ( m - m r e / ) r T 4 ^ ( m - m r e / ) (3.32) ptrue — (m - mreffWjWsim - m r e /) The algorithm was run three separate times and a recovered Ri{zi) was obtained for each data set. The results of Figure 3.8 are important. The probabilistic regularization was dominated by the ax term because the value of ax was at least one order of magnitude larger than its counterparts. Physically, as shown with the ax parameter in Figure 2.6, this means that the variance of z\ will be quite small. That is, the horizontal derivative will have a small range and over K training models this range will be adequately sampled. Thus, the samples for z\ are truly representative of its part of the regularization and the resulting deterministic regularization approximates the probabilistic value well in Figure Chapter 3. Deterministic Regularization 64 (d) 1000 800 c 600 o 400 200 + Deterministic • Probabilistic x + + 20 40 60 model number 80 100 Figure 3.8: Recovered and true R(zi) for (a) z i , (b) z 2 , (c) z 3 and (d) resulting total Q Ri(zi) for deterministic regularization compared to true probabilistic value. 3.8(a). However, because the value of ctz is one order of magnitude less than ax, and as is three order of magnitudes less, the resulting samples of 22 and 23 are poorly sampled towards the edges of the distribution. This is because we have the same number of samples to obtain the same quality data and the sampling is done with the values of the other two a coefficients in mind. As a result, we do not see values of z2 and 23 two standard deviations away from the mean very frequently. The results from z2 and 23 are shown in Figure 3.8(b) and (c) respectively. The deterministic result for z2 is a poor fit away from the mean. The result for 23 is substantially worse. Thus, the conclusion must be that without considering the samples of z\ it is difficult to reproduce the appropriate regularization for z2 and 23. Figure 3.8(d) shows the resulting regularization value for T S l using both the recovered deterministic regularization and the true probabilistic regularization. The probabilistic regularization has a mean value of 50.1090 and a standard deviation of 6.9837. The deterministic value has a mean of 394.9820 and a standard deviation of 127.7597. Models that have extremely high deterministic values have extreme values of z2 and 23. There are only a few of these models which support the idea that the flanks of these distributions are under-sampled. Figure 3.8 can be considered a worst case scenario for deterministic regularization. This is because we are using training models that are inherent Chapter 3. Deterministic Regularization 65 to a multivariate probabilistic regularization. As a result, the effect of the independence assumptions made earlier in this chapter is quite profound. However, there are a few things that need to be considered before evaluating this method as a whole. • The true effect of the assumptions cannot be evaluated until we compare geophysical inversion results in a later chapter. This stems from the fact that Figure 3.8 is a worst case scenario. In reality, we will not know the form of the regularization and will have no method to evaluate the probabilistic or deterministic results. In Chapter 2, although the theory is robust, there is still a choice for the type of distribution that is appropriate (Gaussian or non-Gaussian) that one must make. • There are positive results in Figure 3.8. Firstly, Figure 3.8(a) is an excellent represen-tation of the strongest part of the regularization (z\ = Wxm). Thus, the deterministic regularization will try to penalize variation in zi . Relatively, the penalty for z 2 and Z3 is correct. There is less penalty for a given value of z for z\ than 22 and even less for 23. Although the absolute values are incorrect, the overall goal of the regularization is maintained. • The recovered curves are most definitely more quadratic looking than absolute value. This will be shown numerically in the next section. This is encouraging because if these results can be quantified beforehand, we may have a better idea of what type of distribution to define for the probabilistic path. 3 . 4 Parameterizing R(z) as a p-Norm In many cases where the training information is not completely detailed, it may suffice to specify the type of norm and recover best fitting parameters. In this section, this is done by defining the general distribution for a p-norm (Generalized Gaussian) [27]. IT is an appropriate estimate of deviation and given the value of p and a set of samples, z, the most likely value can be expressed as follows. The symbol u denotes the mean of the distribution. Given a set of samples, it can be calculated as usual. /„(*) = (3.33) (3.34) (3.35) i= i Thus, the only unknown in (3.33) for a given sample set is-the appropriate value of p. If there are Y samples of z the multi-variate independent p distribution looks like (3.36) Chapter 3. Deterministic Regularization 66 and the negative log-likelihood is as follows. LM=^h«-M-^£h (3-37) In order to find the optimal value of p, an initial value is specified and the search is done in the opposite direction of the gradient (—<?). Because it is only a one-parameter search, the steepest descent algorithm is employed and the gradient is calculated via finite difference. = L(z,p + 6p)-L(z,p) op The value of 5p is taken to be small (w IO - 3 ) and the resulting perturbation is chosen to be a maximum of 0.1 units in the downhill direction. „<*-!) = p (*>_ i £ _£_ (3.39) LO is some constant equal to or less than one. The highest value of LO that satisfies L(z,p^-k+1^) < L(z,p^) is desired. For this simple algorithm, the initial guess is LU = 1 and it is reduced by a factor of two until the previous condition is satisfied. Once the value of L(z,p^) has remained significantly unchanged for several iterations, the algorithm is terminated and the optimal values of p and o are output. 3.4.1 Numerical results The training set TS1 was again used to produce results for the p-distributions. As before, there are three sets of samples as outlined in (3.31). The results are summarized in Table 3.1. Just like the functional case from the previous section, the fit for z\ is very close to the true value. This is expected for reasons outlined earlier. What is encouraging is that the results for all three sets of samples are reasonably close to the true value of p. The major difference is manifested in the calculation of the a (deviation) parameter. This parameter is much too small in both the z2 and z$ results (as a result of the incomplete sampling discussed earlier) and this leads to an unfavourable comparison between the coefficient of the p-norm and the true coefficient (^-). The graphical comparison of the results is shown in Figure 3.9. As expected, the p-norm solution is close to the results of the general functional from the previous section. Thus, the combination of the two methods leads to confidence that both algorithms are producing sensible results. When the samples can be represented by a simple p-distribution, we will see agreement between the two deterministic methods. Otherwise, the function R(z) has more flexibility in creating more complex distributions such as combinations of p-distributions or otherwise. In this case, we would be quite certain from the deterministic analysis that T S l can be represented by a p-distribution where p P a l ai 2 Z\ 1.9759 0.2901 -0.0048 5.8350 5 Z2 1.8058 0.4821 0.0039 2.0679 0.5 Z3 1.7190 1.0246 0.9320 0.5579 0.005 Table 3.1: Results for optimal p-distributions. Chapter 3. Deterministic Regularization 67 -4 -2 0 2 4 Figure 3.9: Recovered (functional and p-distribution) and true R(zi) for (a) z i , (b) z 2 and (c) z 3 is close to 2 in all sample sets. This information is of extreme value. The coefficients, although not correct in amplitude, provide relative information on the importance of each Zi in the regularization. Overall, one would have to agree that the information obtained from the deterministic approach has quantified a meaningful regularization that is surely more accurate than an ad hoc R(z). Furthermore, the results of the deterministic approach can be used to help the user gauge the type of probabilistic regularization. functions that should be attacked using the M C M C theory in Section 2.6. After seeing the results in Figure 3.9, the multivariate Gaussian seems like a valid regularization form and the true values of the coefficients could be obtained easily. In summary it is not the absolute quantitative results that are of interest when obtaining deterministic regularization functions. It is the form of the functional and their relative changes that provide valuable information that, when combined with the probabilistic approach, can help the best form of R(z) be found. 3.5 Summary In this chapter, the deterministic route to learned regularization was developed. Two meth-ods of accomplishing this were explored. The first, the generic R(z) method, is the most flexible. Given a filtered quantity, z, of the model, m, a generic distribution was recovered for z. In order to accomplish this, i.i.d assumptions had to be made on z. Several synthetic Chapter 3. Deterministic Regularization 68 examples were shown and the implication of the assumptions were discussed. To comple-ment this method, z was used to recover the optimal value of p in a p-norm. Comparison of results from the p-norm and generic R(z) revealed two items. Firstly, although both ap-proaches failed at recovering the width (spread) of all true functions, they were successful at recovering the shape. Second, agreement between the p-norm and R{z) indicates that a p-norm parametrization is valid. This information could then be used to help choose the type of norm for a probabilistic approach. Chapter 4. The HT algorithm 69 Chapter 4 The H T algorithm 4.1 Background When inverting geophysical data, regularization for the appropriate physical property must be developed. Probabilistically, this regularization should be appropriate for all surveys involving this physical property. After all, the type of survey should not influence the prior distribution. However, in practice, does this reasoning make sense? To answer this question, we need to refer back to Section 1.5.2. In order to counteract the natural decay of the magnetic kernels the regularization had to be rewritten as (1.33). This allowed every cell to have the same chance of producing the measured secondary field. In this case, the regularization had to be amended according to the distance between cell and receiver. If we performed two magnetic surveys - one ground survey and one aero-magnetic survey - the form of (pm would certainly change. Thus, in the case of potential fields, the regularization should be dependent on the forward operator. In this chapter, this idea is expanded and developed and compared to the probabilistic result. This theory was developed by Haber and Tenorio [16] and is summarized in this chapter. It is referred to as the H T algorithm. 4.2 Theory 4.2.1 Basics Just as in the probabilistic case, it is assumed that a set of training models, M t , is available. The HT algorithm takes advantage of forward modelling. Specifically, given M t one can create a set of training data, Vt = F[M 4 ] where T>t = {d\} i = 1... K. Thus, given K sets of data, the procedure is as usual: define a regularization (and an objective function) and invert the data. However, as before we do not know the exact form of the regularization. Thus, we simply propose a possible regularization that seems reasonable and appropriate. As a result, the regularization is defined as <Pm = <Pm(Q) where 0 is a vector of parameters that we wish to optimize given the set M^. The objective function is thus defined as 0 i = | | | C ? m i - d i | | 2 + 1^(1114,0) (4.1) where the forward modelling is assumed to be linear such that d* = Gm\. We will wish to compare this algorithm to results obtained in the previous chapter. Thus, the regularization is expanded as 4>{mi, 6) = \\Wm{9){mi - m r e / ) | | 2 (4.2) and the solution to (4.1) is written as follows. m, = (GTG + pW^WmrHG7^ + 6W^Wmmref) (4.3) Chapter 4. The HT algorithm 70 As a result, we have K recovered models all using a particular 0 vector. The goal is to find the 6 that produces recovered models that are closest to their "parents" in M t . This can be summarized by the following objective function. 1 2 (4.4) 4.2.2 Algorithm Construction Considering that we wish to find the minimum of (4.4) subject to (4.3), this problem can be looked at as a constrained minimization. That is, we can pose it as below. K 1 min e ^ = £ d|mi(0) - m\||2 (4.5) t=i s.t. rn, =. (GTG + f3W^Wmr1(GTdi + pwlWmmref) The way the solution of (4.5) has to be approached largely depends on the form of Wm(6). If the dimension of 6 is large then one may need to include regularization for 6. This would also make conventional ways to solve the problem inefficient as the number of iterations that would be required would be large. However, our goal is that of comparison to the theory in the previous chapter so it is written as following. Ng wZlWm = Y,9kWkTWk (4.6) fc=i N$ is the number of parameters we choose to include in the regularization. If N$ is rela-tively small, we need not worry about regularizing the expression in (4.5). Thus, we can concentrate on its solution. There are two ways that the solution to (4.5) is approached. The first is the simplest and sometimes more effective method. If the inversions of each data set (equation 4.3) are efficient, then for every configuration of 6 one can simply compute ip. Then, one value, 9k, is perturbed and ip is recalculated. This manual changing of the 6 vector can seem laborious but if one has a good idea of the location of the minimum, it can be quite effective. The second method is the more rigorous approach and the one that will be concentrated on. It is an automated Newton search approach [25] that requires the expression of the first and second derivative of (4.5). In order to simplify the expressions, we write A = GTG + 0W^Wm and rewrite (4.3). Ami = GTdi + 0WlWmmref (4:7) For any search method, an expression for the gradient is vital (it indicates the direction of the minimum). In order to take the derivative of (4.5), the chain rule is utilized. b\/> d6k dip dvcii dvcii d9k K 9k = i=l Chapter 4. The HT algorithm 71 In order to fully express the gradient, the last term in (4.8) must be derived. Differentiating (4.7) with respect to 9k w e obtain the following relation. dAm% = d{GTdi + 8-WZWmmref) d9k 89k Ai9i+Wk^ = p~mr^ (4-9) Rearranging the above we obtain the expression for the derivative ^ = -/3A-1(WkTWk(mi-mref)) (4.10) where we have used J^- = BW^Wk from (4.7). Thus, the gradient is expressed as below. K 9k = - / 3 £ ( m i " m\)TA-\WkWk{mi - mref)) (4.11) i=l The gradient values are un-normalized. In order to get a good estimate of the size that the perturbation should be, the Hessian (curvature information) is calculated such that the perturbation finds the minimum of an appropriate quadratic problem based on the particular parameter location. After several iterations, the result converges to the minimum of the non-quadratic problem. This requires the calculation of the second derivative of (4.5). Differentiating (4.8), H* = Wr^-eej - b T k + { m i ~ m i ) d9M ( 4 - 1 2 ) In order to complete the definition, the second derivative of mi must be expressed. Starting at (4.9), this can be accomplished. d(A^) d{BW^Wkmj) = d((3WkTWkmre}) dBj d9j dOj . f92m, dAdmi ^ T T r T x T T dmi ' , H . A ^ + ^ ^ + « ^ W = ° ^ Rearranging the above and taking the derivatives we obtain the second derivative. As a result, (4.14) along with (4.10) are sufficient to define the Hessian in (4.12). For simplicity, it is easier to leave the Hessian in this form and refer to the definitions of the derivatives. The Hessian is only useful in a search algorithm if it is positive definite. In this problem, there is no guarantee that this condition is met. As a result, the matrix must be checked for positive-definiteness and amended if needed. Let A be the smallest eigenvalue of H. If A is negative, the Hessian is expressed as follows. H = H + {\\\ + e)I (4.15) Chapter 4. The HT algorithm 72 e is some small value to ensure the smallest eigenvalue of H is greater than zero. If A was greater than zero originally, then H = H and the definition in (4.12) will suffice. The perturbation in the 9 vector can now be expressed. 59 = -H-Xg (4.16) However, since this is a quadratic approximation to the solution, it is prudent to line search the perturbation. 0(n+l) = 0(n) + w 5 0 ( n ) (4 1 ? ) n is the current iteration, and to is the line search factor that is chosen to augment the perturbation. The value of 0 in (4.5) is set to be relatively small. A typical value of 0 is obtained by inverting noisy synthetic data created from one of the models in the training set. This data is inverted via the minimization of (4.1) with a default form of </>TO (e.g. equation (1.27)). Since the HT algorithm requires inversion of data that are perfectly accurate, we would want to fit our models to a much closer degree than the noisy case. Thus, it is prudent to set 0 to & value several orders of magnitude less than the value obtained in the noisy case. The absolute value is not of particular interest. It is only important that the value of 0 is approximately correct. The approach shown in this section is valid for a parameterized regularization with few unknowns. Haber and Tenorio [16] generalized the problem to recovering a functional form of 9. The solution to this type of problem is complicated and the reader is referred to the relevant paper for details. 4.3 Testing the H T Algorithm In order to test the H T algorithm, a forward problem needs to be defined. The difficulty in evaluating the results is two-fold. Firstly, it is not known what the true solution should be even in a synthetic case. Consider the training set TS1. Certainly, it is clear that the values of 9 in (4.6) are known when considering the governing distribution. It has been shown in Chapter 2 that given TS1 and a proper form for its distribution, the correct values of the coefficients can be recovered. However, would we expect the coefficients to remain the same in the H T algorithm when we apply different forward operators (G)? This section strives to answer this problem via the introduction of two geophysical methods. 4.3.1 Cross-well Seismic tomography The 2D cross-well tomography experiment is one of the simplest geophysical problems. Thus, it is an ideal problem to use when evaluating different forms of regularization. Seismic energy is propagated from one borehole to another. ns sources are ignited in one borehole while nr receivers record the time, t, it takes for the seismic energy to arrive. The time is dependent on the seismic velocity, v, of the Earth on the path, £, between the source and receiver. If refraction of energy due to velocity contrast is ignored, the forward model can be expressed easily. t = f -^jrdl J v(i) = j m{t)<U (4.18) Chapter 4. The HT algorithm 73 a | | m ; ru e _ m r M | | O l Otz a., a. a . . i = l i=2 i=3 i=4 i=5 H T mGauss 0.6079 0.0836 0.8659 0.0999 0.2405 0.0021 0.7020 0.8371 2.5271 40.1043 38.0618 38.2191 8.3148 8.3185 12.4855 12.4940 16.7282 16.6446 33.4661 33.5698 Table 4.1: Results of HT algorithm and multivariate Gaussian for TS2 m is the inverse velocity, or slowness. By recovering values of m instead of v, the problem becomes linear. If the Earth is discretized into M cells of constant slowness, the integral can be approximated by a summation. M ti = ^2Gikmk (4.19) Gik contains the length of the ith seismic ray in the kth cell. This expression can be further generalized so as to agree with notation in the Introduction. d = Gm (4.20) d is the vector of time data and m is the vector of slowness values. As a result, the tomography problem is the equivalent of the minimization of (4.1). Because the elements of G are simply lengths of ray paths (there is no geometric depen-dence), the tomography problem is an ideal method of evaluating regularization approaches. However, this is not the case for the HT algorithm. The theory of the HT algorithm is based on accounting for deficiencies in the mathematical expression of the forward operator. When dealing with the tomography problem, this is less of a concern and the results are less impressive. Consider a 10 cell by 10 cell 2D mesh with 7 equally spaced (in depth) shot-points on the left of the mesh being received by 7 receivers (at the same depth points) on the right of the mesh. The total number of data is 49. A training set of a box of constant slowness in a homogenous halfspace is considered. The dimensions of the block are determined probabilistically with every configuration equally likely. The block is given a mean value of 10 ^ with a standard deviation of .0.5 ^ and the halfspace has a mean value of 1 ^ with a standard deviation of 0.1 Fifty models were created with 45 of them constituting the training set TS2. The remaining 5 models are used to evaluate the results. The HT algorithm and the multivariate Gaussian method of Chapter 2 were applied to TS2. The form of the regularization was assumed to be of the standard form shown in (1.27). The results are summarized in Table 4.1. Figure 4.1 shows the true models and the recovered models for the evaluation set. Numerically, the results for the recovery of the a coefficients are quite different. However, relative differences between coefficients are more important due to Q in (4.1). That is, any scaling of the a's yields the same scaling of (j>m(m.i,9) and this can be absorbed into Q. The ratio of ax to az does not vary much between regularization methods. There is significant difference when considering as. Figure 4.1 shows that differences between the results for the two methods are indistinguishable to the eye. Numerically, these differences are presented in Table 4.1 and slight improvements are noted in four of five models of the evaluation set when using HT coefficients. These improvements are slight and the conclusion must be that the HT algorithm provides no significant improvement in the tomography example. In order to illustrate an occasion where the HT algorithm is an especially valuable approach, the geophysical problem must be changed so that the survey is deficient in its ability to delineate the model structure. Chapter 4. The HT algorithm 71 True model Optimal Gaussian a HT algorithm a Figure 4.1: Recovered models for H T algorithm and multivariate Gaussian for set TS2: (a)mi, (b)m 2, (c)m3, (d)m^, (e)m5. 4.3.2 The 2-D Gravity Experiment Considering a general 2-D density function, p(x,z), the vertical component of the gravita-tional acceleration is expressed as ' • " • " - ^ / / ( x - t y + V - ^ ) ' ^ (4'21) z' x' where 7 is the gravitational constant, (x, z) are the observation coordinates and (x1, z') axe the source (density) coordinates [5j. It is possible to discretize the relation for y to the summation over a volume. There are two important points to realize. Firstly, the gravitational forward mode] is linear in density such that d = Gin (4.22) where d is a vector of gravitational accelerations (data) and m contains density values of the rectangular cells. If the problem is 2-D, G contains gravity kernels that decay as the distance, r, from source to observation increases. (4.23) Chapter 4. The HT algorithm 75 This is a problem when modelling gravity data to reproduce the density of the Earth. The fact that the kernels decay means that both the size and value of an anomalously dense region of the Earth must increase as its location in the Earth deepens in order for the identical gravity readings to be produced at the surface. Inherent to most inversion algorithms is the preference that the size of the model must be minimum in some way. As a result, when modelling gravity data, the recovered model tends to locate most of its anomalous density at or near the surface. In order to combat this problem, a weighting factor is often included in the regularization [23]. The corresponding depth weighting would be TTTTyf (4-24) (z + Zo) where ZQ is a constant and rj ~ 0(1). Using the H T algorithm, a parameter set (zo,r)) could have been recovered. In this section, a discrete representation is adopted where the a coefficients are chosen to represent a function like (4.24). The choice of regularization parametrization is vital to the use of the H T algorithm. In this experiment, the goal is to see if the H T algorithm can help constrain the depth location of a buried block. As a result, a new training set is developed. This training set has the same dimensions as TS2. The only difference is that the value of the block is held constant at 1 and the background is set to 0. Forty five of these training models constitute the training set while five are used for evaluation. The dimension of each cell is 1 meter square. This training set is referred to as TS3. The form of the regularization is chosen such that the parameters vary with depth. In order to keep the size of the problem small, each layer is assigned a constant value of as. As a result, the full version of the smallness weighting matrix, Ws, is broken up into ten matrices, Wk, each of which regularizes one row of model cells. Equation (4.6) can be expressed as follows. 10 WlWm = Y . a i W ? W i t 4 ' 2 5 ) fe=l There are a couple of issues that should be considered before evaluating the results. Firstly, the ability of the H T algorithm to produce appropriate weighting coefficients depends com-pletely on the makeup of the training set. If the training set consisted only of models near the surface, the results would not reflect the depth decay of the gravity kernels. Although this is certainly not the case, the set TS3 may not have an ideal distribution of models. Furthermore, the algorithm would certainly produce more favourable results as the num-ber of training models increased. The second issue that is worth considering is the form of the decay of the kernels. As the depth, Az, to a layer i increases, the ratio A^+' be-comes smaller. Thus, the difference in the expected values of a» and aj+i is smaller. These differences may be harder to recover for the H T algorithm. « 1 « 2 QJ3 tt4 "5 Oil ag ag aio 0.7563 0.1413 0.0598 0.0438 0.0311 0.0226 0.0241 0.0268 0.0658 0.0386 F i F2 Fs F4 F5 Fe F7 F8 F9 Fio 9 47 51 86 106 126 105 79 54 25 Table 4.2: Results of HT algorithm for gravity example. Values of a shown are multiplied by 103. Ft indicates how many cells in layer i contained block values. Chapter 4. Tho HT algorithm 76 True model Smallest model HT coefficients Figure 4.2: Recovered models for H T algorithm and smallest model for set TS3: (a)nii, (b)m 2, (c)m 3, (d)m 4, (e)m5. (12 ai an a- a<j a m Recovered 5.3511 2.3C19 1.3678 1.4084 1.3751 0.9372 0.8998 0.4074 1.7019 Expected 2 1.5 1.3333 1.25 1.2 1.1667 1.143 1.125 l.mi Table 4.3: Recovered and Expected a ratios. .Numerical results from the gravity H T algorithm example are shown in Table 4.2. The values of a decrease steadily for the first six layers. At that point, the algorithm appears to have difficulty recovering valid a ratios. Shown in the bottom part of Table 4.2 are the number of cells in each layer that contain block values. Because the blocks are allowed to vary in size (this is necessary to simulate real possibilities), the middle layers are sampled most frequently. It is at the seventh layer where the frequency begins to decrease. This helps to explain the a values below this point. Eleven data points were used at a height of 0.5 meters above the top layer. Thus, one can expect to see relative difference between the a coefficients based on relative distance to the observation level from the layers. This data is displayed in Table 4.3. The expected values are obtained by ratioing depths of adjacent layers with respect to the observation level. The conclusion is again that the results are most accurate for the least sparse layers. The five-model evaluation set was used to create data and recover models using both Chapter 4. The HT algorithm 77 (4.25) and a smallest model (W^Wm = WjWa). The results are shown in Figure 4.2. The H T algorithm has improved results significantly and, in most cases, locates the appropriate depth of the block. Only in the final case does the algorithm misplace the block in depth due to improper a values at depth. 4.4 Summary This chapter introduced the H T algorithm. Theory was developed to learn regularization functionals by incorporating information in the forward operator, G. An algorithm was introduced to apply the H T approach if the dimension of the unknown parameter vector, Ng, is small. When Ng is larger, it was stated that regularization on the 6 vector is required and the solution to the problem should be attacked by more advanced methods explained in [16]. Two examples were used to qualify the need for the H T algorithm. The first utilized the cross-well tomography example and it was shown that in the absence of significant deficiencies in G the H T algorithm does not improve the results significantly. To contrast, the second example used a 2-D gravity experiment. Due to the inherent depth decay of the gravity kernels, it was shown that the H T algorithm can be used to learn depth weighting using a simple parametrization of the regularization. The inverted results were much improved compared to a smallest model result. Chapter 5. Comparing Regularization Functionals 78 Chapter 5 Compar ing Regularizat ion Functionals 5.1 Concept In the previous three chapters, the theory for developing learned regularization functionals from prior knowledge has been developed and simple examples have illustrated results for each. Thus, it is possible to develop these functionals for any giving set of training data. What is lacking is a method to evaluate the results. That is, each functional is the most appropriate representation of the training models given a particular method but it is unclear how to make comparisons between the different functionals. Referring to the introduction in Chapter 1, the goal of developing learned regularization was to produce better geophysical models. Thus, the evaluation of the learned regularization functionals should be relative to the resulting geophysical model from an inversion. As a result, a method to invert geophysical data with generic regularization is needed. This will be developed in the next section. 5.2 Inversion with Generic Regularization The geophysical problem in question is assumed to be linear for simplicity. Non-linear forward models result in more difficult solutions of the objective function. As a result, the data misfit is assumed to be the same as (1.13) but is rewritten here for convenience. Thus, d is the geophysical data in question, m the corresponding physical property dis-tribution of the Earth and G the forward modelling matrix such that d = Gm. The regularization part of the geophysical inversion is written as follows. Ri is the appropriate regularization operator for z<. In the case of the probabilistic and HT algorithm method it is likely that there is only one type of regularization (Q — 1) and that z = m. The generic form is more versatile and can help include the deterministic methods where the regularization operators are different for different filters of the model. To evaluate the numerical value of Rife), the equation 4>d = \\\Wd(d-Gm)\\2 (5.1) Q (5.2) i= i Y (5.3) Chapter 5. Comparing Regularization Functionals 79 is used where Y is the total number of elements in Z j , ZJ is an element of z» and ri(zj) is the value of R4 at zj. Following (1.17), the optimization can be posed as a minimization problem. m i n m <p = (j)d + 0R(m) (5.4) The problem can be solved with a search algorithm using the Newton step. As a result, the first and second derivatives of the regularization must be calculated. The first derivative is approached as follows. Q OR dm i = i dzj dRj dm dzi Considering the summation Rife) • = n(zi) + n(z2) + the derivative of Ri with respect to Z j is as follows. ri(zY) (5.5) (5.6) dRj dzi / d r j ( Z l ) \ ' ^ z 7 ~ * dr,(z 2 ) d z 2 dr j (z Y ) dzy (5.7) Thus, the full derivative may be expressed. Q ( * 1 \ dR dz. dm 2—1 i = i dm dzj. dr;(z 2 ) d z 2 drj(zy) dzy (5.8) Taking another derivative of R with respect to m, the following expression is obtained. d2R dm2 Q A E dZj dm i = i d2n(z2) V To simplify the expressions let OJi ii 0 d 2r j(z Y-) dz 2 . / drj(zi) \ ' dz i > drT(z2) d z 2 dr j (z Y ) d z Y \ dzjT dm d2Zj dm'2 / drj(zi) \ ' dz i » drj(z 2) d z 2 drj(zy) dzy (5.9) (5.10) and Sli = / d 2 r i ( Z l ) 0 0 d 2 r j (z 2 ) (5.11) Chapter 5. Comparing Regularization Functionals 80 Now the full derivative of (5.4) can be expressed. ^ dz, E = GTWjWd{Gui-d) + B^^u>i (5.12) ^ dzi „ d z i T d2Zi The Hessian would be expressed as follows. " = G T l ^ G + " £ £ " < i f + ( 5 - 1 3 ) i = i For simplicity, it is reasonable to assume that z is connected to m via some filter. Zi=.Fim (5.14) In this case, Fi is a filter that could be a finite difference first derivative operator or some-thing similar. Then the term in (5.13) is zero. Thus, we can further simplify (5.12) and (5.13). Q g = GTWjWd(Gm-d) + 8^F{'uJi (5.15) i = i Q H = GTWjWdG + 8YJFl^Fi (5-16) t = i Finally, as usual, the perturbation is calculated. Sm = -H~1s (5.17) The same steps used in the algorithm in Section 2.4.1 are used here to find the optimal solution for m given G and R(m). 5.3 Forms of R(m) In this chapter, an example is used to compare the results using three different forms of R(m). Here, the particulars of inversion using these different forms of R are addressed. 5.3.1 Multivariate Gaussian If the regularization is posed as R(m) = i(m - mref)T(axW^Wx + azWjWz + a ^ > , ) ( m - m r e / ) or a function with a similar composition, the geophysical inverse problem is linear (given a linear forward model). However, to be consistent, the theory of the previous section may still be employed. In this case, three filters of m can be defined following (5.14). zi = Wx{m-mref) z 2 = W z ( m - m r e / ) (5.18) Z3 = Ws{m- m r e / ) Chapter 5. Comparing Regularization Functionals 81 As a result, the regularization functionals can be written as R(zi) = ^ocxz\ R(z2) = \uzz\ (5.19) •R( z 3) = where the values of R(zi) are quantified using (5.3). To follow the theory of the previous section, there are several derivatives that must be expressed. Firstly, the relation between the filter quantity, z, and the model, m, needs to be expressed. ^1 = WT dm x p- = Wf (5.20) dm ^1 = WT dm s Following (5.5), the derivative of the regularization operator can be expressed. dR dm a x W * Z l + a*WTZ2 + aaWaTx3 = {axWfWx + azWjWz + a s W j W , ) ( m - m r e /) Following (5.9), the second derivative can be expressed. (5.21) d2R _ dz\dz\T dz2dz2T dz^dz^ dm2 x dm dm z dm dm s dmdm (5.22) - *zWlWx + azWjWz + asWjWs Because the derivatives can be expressed analytically, the gradient (5.12) and Hessian (5.13) also have cleaner forms. g = GTWjWd(Gm - d) + (3^ d m (5.23) H = GTWjWdG + B-^ Since the problem is linear (in both 4>d and R(m)), only one one iteration of the non-linear code is required to obtain the minimum of (5.4) for a given B. The multivariate Gaussian approach is the most well-behaved approach for a few reasons. Above all, the use of a Gaussian incorporates a quadratic function into a quadratic approximation search algorithm. In addition, quadratic functions are easy to deal with due to the presence of continuous derivatives. The limitation of the multivariate Gaussian approach, however, is that the type of norm is constant. Like the HT algorithm, the user must impose a type of regularization on the problem. Optimizing the weights simply produces the best fitting Gaussian but we have no guarantee of its appropriateness. The use of an optimized Gaussian on a non-Gaussian distributed training will be illustrated later in this chapter. Chapter 5. Comparing Regularization Functionals 82 5.3.2 The p-norm In the case of the regularization being a p-norm, the regularization is evident from the distribution in (3.33). Y i2(zO = a ^ | z i - M r . (5.24) i The coefficient a is used to represent either the term or some arbitrary weight. Noting (5.3), the expression for ri(zj) is equal to the right hand side of (5.24). The derivatives of r are calculated at Zj ^ p. These expressions are valid for Zj = p when p > 2. When p < 2, a numerical threshold is applied such that values of z are never evaluated at Zj = p. This is explained later in this section. The first derivative is expressed below. = pa\zj - u\p~lsign{zj - p), Zj ^ p = p a \ Z ] - p \ ^ ^ v z^p ( S - 2 5 ) \Zj - p\ = pa(zj - p)\zj - p\p~2, ZjT^H As a result, the derivative of R can be expressed according to (5.8). The second derivative of r is required by (5.9). ^ = pa\zj - p r 2 +p(p - 2)a\Zj - ^^LZJ^-, Zj # p j I j Ml = Pa\zj ~ M l p _ 2 + P(p - 2)a|2j - p\p'2, Zj^p ('5'26^ = p(p - l)a\Zj - p\p-2, Zj ^ p Without specifying a specific form of z, (5.25) and (5.26) are sufficient for carrying out the theory in Section 5.2. There are a couple of issues to be discussed when using specific values of p. The most apparent item is that the second derivative vanishes when p = 1. This makes the quadratic search approaches ineffective as there is no way to normalize the derivative. The solution can be obtained for a value of p close to 1 (RS 1.05) and it is satisfactorily close tb the desired solution. The second item concerns values of p < 2. In this case, both the gradient and the second derivative of r are divided by the value of Zj (to some power). When p < 2 where p = |p — 2| dr Zj - — OC dzj \Zj\P and d2r 1 dz) \Zj\V Thus, there must be some care taken when values of Zj are close to zero. If the value is below some tolerance, the values of the gradient and Hessian will become unrealistically high. This problem is solved by introducing a tolerance level, e, for the value of the Zj. If Zj < e, the value is set such that Zj — e. The value of e is dependent on the problem as well as the relation of z to m. For instance, if z is a gradient of m, the value of e would be set to a value of the Zj that is insignificant in size. For example, if the training set consisted of a box of value 10 in a halfspace of value 1, the value of e might be set to I O - 3 . This is an effective way of stabilizing the inversion while recovering models of appropriate character. Chapter 5. Comparing Regularization Functionals 83 5.3.3 Generic R(z) The theory in Section 5.2 is almost sufficient to invert a data set using a generic regulariza-tion functional. The remaining aspect is the calculation of the derivatives of R with respect to z. Recalling Chapter 3, the generic regularization functional, R(z), is defined by a set of points r at locations z r. The best way to utilize this information is to interpolate a cubic spline curve. Nr r(z)=Ylaj9j{z ~ Zj) + P{z) (5-27) j = i r(z) is the value of the regularization at location z, Nr is the length of the vector r, aj is a spline coefficient, gj is a cubic spline basis functions, Zj is a value of z r and p(z) is a degree 1 polynomial (p(z) = bi +1)2z). The theory is derived from a radial basis function approach outlined in Billings et al [3]. For a cubic spline interpolation, the basis functions are written as follows. g ( z - z j ) = \z-zj\3 (5.28) There are three conditions that must be satisfied in the interpolation problem. r{zj) = rj Nr E a ^ = 0 j=i (5-29) Nr j = i As a result, the interpolation system can be written as follows. g(zi - zi) g{zi - Z2) g(zi-zNr) 1 Zl ai n g{z2 - zi) g{z2 - Z2) -+ g(z2-zNr) 1 Z2 a.2 T2 1 I i . i 1 y i i 9{ZNT --zi) g{zNr -~Z2) g(zNr-zNr) 1 ZNr a.Nr TNr l 1 -* ' 1 0 0 bi 0 Zl Z2 —> ZNr 0 0 . b2 0 (5.30) A simple inversion of the matrix on the left side of (5.30) produces values of the spline and polynomial coefficients. For cases when the matrix is ill-conditioned, the reader is referred to Billings et al [4]. Once the coefficients are recovered, any value of the function R(z) can be recovered via (5.27). The derivatives of the function R(z) can be recovered by differentiating (5.27). This requires the expression of the derivatives of gj. 9j = 9" = (5.31) Chapter 5. Comparing Regularization Functionals 84 The derivatives of R follow suit from (5.27). dr{z) dz Zj) + b2 d2r(z) dz2 (5.32) Zj) Given a particular form of z, (5.31) and (5.32) are sufficient to define the gradients in (5.8) and (5.9). In turn, the system in (5.17) can be solved. The generic form of R{z) is attractive due to its versatility. Firstly, as shown in Chapter 3, the shape of the norm is usually recovered well regardless of any failure to recover values of coefficients. This allows unusual forms of R(z) to be recovered. For instance, we may expect to see R(z) as a combinations of two well known norms. Currently, there is no way of including this type of information in any other learned fashion. Other methods require specification of the norm before parametrization. In addition, the discretization of r allows one to recover detail that varies over a certain length scale. In many problems larger scale variations of z may be important such that R(z) takes a smoother form. Unfortunately, complexity often accompanies versatility. As we shall see later in the chapter, R(z) can take complex forms that may include concaveness or discontinuities of derivatives. Inversion algorithms typically use a local minimum search approach. The incorporation of a complicated R(z) may produce multiple minima. In addition, the system in (5.17) can become increasingly difficult to solve. As a result, many iterations may be required to produce a solution with minimum model norm. For these reasons, a hybrid global-local search approach is used. 5.3.4 T h e H y b r i d Globa l -Loca l Search A l g o r i t h m When minimizing the recovered model, m r e c , usually has a lower value of model norm than the true model. In synthetic examples a check of (5.33) is a standard way of evaluating a result. However, when using methods that optimize the norm rather than the coefficients (p-norm, generic R(z)) (5.33) may be closer to an equality. In the specific case of generic R(z) functions, i a t r u e may have a minimum value of jR(m) if ratrue is accurately represented by the training set. As a result, ensuring that R(mrec) is as low as possible is of utmost importance. Learned regularization functionals may cause the objective function to become quite non-linear. Thus, there is the possibility of haying the solution become trapped in a local minimum. Currently, search algorithms in this research utilize a Newton approach (cur-vature information). This may not be the most efficient way of searching in a complex objective function. This is a difficult problem and needs to be studied in detail in the future. Some other approaches involve cooling the solution from a well-behaved regular-ization (L2 norm) to a more complicated regularization. The assumption is that the two solutions lie near each other in objective function space. In this research, we focus on a small example and use a hybrid global-local search algorithm. <j) = <f>d + 0R{m) R(mrec)<R(mtrue) (5.33) Chapter 5. Comparing Regularization Functionals 85 Global methods [26] are inefficient for problems with a large number of unknowns. Therefore, a local algorithm is used to simulate a global search. Given a form of R(m) and data, d, the local algorithm produces a recovered model, m ^ , and a value of the tradeoff parameter, This model will have a value of misfit, (f)^, near the target misfit, <j)d. Thus, the objective function will initially be ^ ) = 0 ( 1 ) + / ? ( i ) i ? ( m ( i ) ) . The algorithm is then restarted, but with a starting model of and the initial value of P = cd^ where c > 1 is a constant 10). The initial value of the objective function will then be 4>start =+ 3R(mW) and the goal is to attempt to reduce the value of (f)start. It is possible that en route to termination of this search the value of fa increases while R(m) decreases. The goal is to have the value of the objective function for the next generation of model decrease such that This decrease could manifest itself in a decrease of the model norm, i?(m), if ~ 0(2\ Thus, the algorithm attempts to find a model that fits the data to the specified degree but minimizes the value of the objective function, </>(m). This simulates a global approach because of the many different starting points, m ' ' l Once the algorithm begins to produce several generations of models without significant changes (m.( l + 1) ~ such that </>(m(l+1))~</>(m^)) the search is terminated. The final model is therefore taken as the generation with minimum (/>(m.W) (and likely minimum R(m^)). 5.4 Examples In order to compare all of the regularization methods, a synthetic example is proposed. The minimum of (5.1) is sought. The example uses a training set (TS4) of a box in a halfspace model. The seismic tomography problem (Section 4.3.1) is used as a forward model in order to evaluate effectiveness of the varied R(m). The H T algorithm is not applied to TS4 because a simple parametrization of the regularization such as (4.6) is not useful for seismic tomography. A more complex example using tomography is presented in [16]. TS4 is used to compare results from the multivariate Gaussian approach in Chapter 2 (and other values of p), and generic R(z) methods and parameterized p-norms in Chapter 3. 5.4.1 Training set 4 : Box in Homogenous Halfspace Training set 4 was chosen to be a version of the box-in-halfspace models developed in Chapter 4 (TS2 and TS3). The reasons for choosing this type of model are twofold. Firstly, it is unknown what distribution these types of models are inherent to. Secondly, this type of model (with edges) is important in geophysical inversion since conventional regularization (Gaussian) methods tend to smooth out edges. The details of TS4 are as follows. The grid was chosen to be a 2-D 20 cell by 20 cell mesh with 1 meter square cells. The value of the background was modelled by a Gaussian as 1 ± 0 . 2 5 ^ . The value of the box was also modelled by a Gaussian as 1 0 ± 2 ^ . The location and size of the box are determined by uniform distribution. As a result, any configuration Chapter 5. Comparing Regularization Functionals 86 is possible within the mesh. One hundred of these models were created. One of the models was chosen at random to be the true model and the rest constituted the training set TS4. Four of these models were shown in Figure 2.3. The tomography experiment consisted of 14 sources and receivers separated equally in depth at the edges of the mesh. The resulting number of data, N, was 196. Gaussian noise was added to the data and a 5 percent error was assumed. Since this is a synthetic problem, the true misfit is calculated using (1.13) to be about 364. As a result, all recovered models in this experiment fit the data within a small tolerance of the true misfit. In any case where a reference model needs to be specified, mref was set to zero. R(m) as a multivariate distribution The first approach to obtaining a form of the regularization was by using the probabilistic approach from Chapter 2. Using a form of covariance such as •. C - 1 =Q.T||VK xm||P + Q , | |W z m | |P + a s | |W sm||P (5.34) when p—2 the solution can be obtained from an optimization approach. The covariance model selection theory in Section 2.6 can be applied for any value of p. Thus, the a coefficients are sought for p = 1,1.25,1.5,1.75 and 2. The optimization code can be used as a check for the results at p = 2. Table 5.1 summarizes the results. The values of ax and az are relatively close for every value of p which is expected for a training set such as TS4. The value of p = 1.5 has a minimum value of T(p) and is considered the best choice of p by this measure. Plots of a x | |W xm||p for p = 1,1.5 and 2 are shown in Figure 5.1. The resulting models are shown in Figure 5.4. Analysis is reserved for the end of this section. There was some consideration of the proper parametrization of the covariance matrix. Results in Table 5.1 indicate that as is not an important parameter in the problem. Thus, the other option is to only regularize one derivative (let ax = 0 or az — 0). Covariance model selection theory from Section 2.6 was used to test these possibilities. Covariance models with one derivative were extremely improbable given TS4. It should be noted that the theory of Section 2.6 allows other types of regularization operators to be included. However, it is extremely problem dependent. The last amendment to this type of regularization could be to employ the theory in Section 2.7. However, as noted in that section, our training models would have to be detailed in terms of location so that the a coefficients would be more efficiently expressed for each model cell. This is certainly not appropriate in TS4. p-value 0z o7s <rs Tip) 1.0 0.9328 0.0175 1.1767 0.0141 8.9703e-4 1.19776-4 0.0150 1.25 0.6669 0.0120 0.8510 0.0110 8.4179e-4 1.1250e-4 0.0152 1.5 0.4628 0.0044 0.5912 0.0061 6.9892e-4 9.3449e-5 0.0100 1.75 0.3103 0.0043 0.3950 0.0055 5.2790e-4 7.1042e-4 0.0139 2.0 0.2075 0.0017 0.2580 0.0033 3.6839e-4 4.8568e-5 0.0112 2.0* 0.2050 - 0.2573 - 5.3180e-4 - -Table 5.1: Statistics of Markov chains for different p-distributions for TS4. The asterisk indicates results are from the optimization code Chapter 5. Comparing Regularization Functionals 87 601 1 1 1 1 1 i i Figure 5.1: Recovered R'z\) for TS4 using multivariate and deterministic methods. R(m) as a p - n o r m In this approach, two derivative filters of the training set are defined. Z i = W T m (5.35) z 2 = Wzm It is assumed that the regularization can he defined as the sum of the individual functions. R(m) = Rtizi) + R2(z2) (5.36) The form of Ri is assumed to be a p-norm according to the distribution in (3.33). (5.37) P a Z'2 0.2847 0.2939 0.0286 0.0260 Table 5.2: Optimal p-norm parameters for TS4 Chapter 5. Comparing Regufarization Functionals SS Given zi and za, optimal values for (pi,rri) and (^2,^2) were determined and are summa-rized in Table 5.2. The values of p are relatively small compared to the value of p — 1.5 that was determined in the previous section. This is because the z vectors are modelled independently and are overly sensitive to the number of zero gradients in the training set. This is evident from the example models shown in Figure 2.3. The resulting p-norm for zj is shown in Figure 5.1. It is clear that values of zx away from zero are penalized heavily. It is worthwhile to note the differences in R{zi) for this value of p versus larger values. Below p = 1, concavity is introduced into the objective function which (as summarized in Section 5.3.4) requires advanced search techniques. Figure 5.2 illustrates three solutions obtained enroute to the final solution at iteration 28. Iter 8 <i>d R(m) 1 2.03o-l 4.20e2 6.1 le2 11 2.16e-l 3.59e2 5.55e2 20 3.40e-l 3.67e2 5.34e2 28 1.53e-l 3.77e2 5.34e2 Table 5.3: Numerical summary of hybrid search for m using optimal p-norm regularization. Chapter 5. Comparing Regularization Functionals 89 Iteration #1 Iteration #7 Iteration #15 Iteration #23 Figure 5.3: Four models obtained using a hybrid search approach with generic R(z) regu-larization. Iteration 23 was the final result. Table 5.3 smmnarizes these results numerically. The true model has a model norm of 160.4. The hybrid algorithm was able to reduce the model norm from 420 to 377 while maintaining a relatively constant misfit. However, it was not able to find a model with a model norm closer to the true value. Tliis is mostly because the smooth background was not recovered well (as shown in Figure 5.2). Visual inspection of the progress made from iteration 1 to 28 is striking. After the first iteration, we could not be sure of the value of the block or its edges. After the hybrid algorithm is applied, these features are much clearer. R(m) as Generic R(z) Functions Once again, we use two filtered quantities, z i and z 2 , which are expressed in (5.35). The theory of Section 3.3 is applied and two functions Ri and R2 result such that the regulariza-tion is expressed by (5.36). The resulting functions, R\ and R2, are similar for training set TS1. R\{z\) is shown in Figure (5.1). There are two striking features about R\{z\). Firstly, the curve is quite smooth and looks quadratic near the minimum of zero. Secondly, the curve has relatively lower penalty values for larger values of Z] and includes local minima around -8 and 8. This represents the more frequent derivatives due to the contrast between the halfspace and the box in TS4. lt is important to compare this function to the results of the multivariate case and the optimal />-norm. The generic R(z) method recovers points Chapter 5. Comparing Regularization Functionals 90 Iter 3 <f>d R(m) 1 1.42e-l 3.12e2 1.54e3 7 2.24e-l 3.38e2 9.57e2 15 3.62e-l 3.42e2 5.99e2 23 3.60e-l 3.66e2 4.65e2 Table 5.4: Numerical summary of hybrid search for m using generic R(z) regularization. on a function so that the type of norm must never be specified. This allows local features (such as the minima) to be included in the regularization. This can never be included in the other cases - only a best fit to the training set is recovered. In summary, by looking at the regularization curves of Figure 5.1, the information content of the generic R{z) curve is clearly superior to the others. The generic R(z) includes concaveness as well as local minima. As a result, the hybrid algorithm was applied. Numerical results are summarized in Table 5.4 and corresponding models are shown in Figure 5.3. Compared to results from the optimal p-norm, the hybrid algorithm had a major influence on the results. The model norm was reduced to less than one-third of its original value over 23 iterations. The true model norm was 248.2. Visually, the improvement from iteration 1 to 23 is dramatic. At the termination of the hybrid algorithm, the flat background is reproduced well, negative values are eliminated, and a box of constant value that almost has rectangular shape has been imaged. Analysis of Results The recovered models for all three methods of learned regularization are shown in Figure 5.4(b)-(f). The true model is shown in Figure 5.4(a). The box has a slowness value of 12.2761 The halfspace has a value of 1.3573 Numerical comparisons to the true model are shown in Table 5.5. Recovered models using the multivariate analysis are shown in Figures 5.4(b)-(d). Vi-sually, the best model is recovered via p = 1. This is confirmed by the numerical results from Table 5.5. The second best model is obtained with p = 1.5 followed by the result with p = 2. Thus, one could conclude that the analysis using T(p) earlier in the section was incorrect. However, it could be that T(p) is a good measure for finding the optimal distri-bution and individual models or subsets of training models can vary in distribution. Until the M C M C theory of Section 2.9.4 can be finalized, measures such as T(p) are the best option. Regardless, with optimized values of coefficients, all three multivariate approaches produced valuable recovered models. The Gaussian approach suffers from smoothing of edges while the multivariate p = 1 solution recovers edges well but has difficulty repro-ducing the smooth halfspace (strong negative values appear at the bottom of the model). R{z) method | |m f r u e - m r e c | | 2 Multivariate p = 2 54.9691 Multivariate p = 1.5 50.2288 Multivariate p = 1 37.9153 Optimal p-norm 54.6092 Generic R{z) 22.2679 Table 5.5: Numerical evaluation of recovered models. Chapter 5. Comparing Regularization Functionals 91 True model Multivariate (p=2) 5 10 15 20 5 10 15 20 x x Figure 5.4: (a)True model, (b)-(f) Recovered models using five learned regularization func-tionals from TS4. The p = 1.5 model is a compromise between the other two but still smooths the edges out excessively. The recovered model using the optimal p-norm regularization is shown in Figure 5.4(f). By comparing the generic R(z) function in Figure 5.1 with the p-norm curve, it is difficult to be optimistic about the potential of the p-norm. Because both methods work with individual filters, z, one would expect the curves to be similar if a p-norm parametrization was valid. In tills case, the fact that the curves are different may suggest p-norm regularization is not optimal. However, the value of p is best-fitting and the result is reasonable. At the terminus of the hybrid algorithm, the block slowness value, location, edges and size have been recovered adequately. In addition, the value of the background is around the correct value although the smoothness has been distorted. Numerically, the p-norm does a better job than the multivariate 2-norm but is worse than the other multivariate cases. The recovered model using the generic R(z) regularization is shown in Figure 5.1(e). It was stated earlier that the generic R(z) function in Figure 5.1 contained the most informa-tion. As such, it follows that its recovered model is superior. Visually, this method does the best job of recovering the slowness value of the block and the background, the fact that the two values are constant, and the location and edges of the block. Numerically, the fit to the true model is almost twice as good as the nearest competitor. There are a few cells on the right hand side of the recovered model that seem to bleed into slightly elevated background Chapter 5. Comparing Regularization Functionals 92 values to the right of the block. It is this area that contributes to the misfit to the true model. It is also impressive that there is a natural absence of negative values. With this example in mind, the conclusion must be that, despite the i.i.d. assumptions that govern this method, the generic R(z) method is most capable of transferring information from the training set to the inversion. 5.5 Summary In this Chapter, methodology to incorporate a regularization R(m) into geophysical in-versions was developed. In addition, specific mathematical and implementation details for the cases of the multivariate Gaussian, p-norms and generic R{z) were discussed. It was noted that in most cases learned regularization will make the objective function extremely non-linear. As a result, conventional search methods may prove ineffective. However, us-ing a small simple synthetic example, a global-local search method was successfully used with learned regularization. Inversions were run for several learned regularization function-als. For a box model, it was shown that the generic R{z) function was most capable of translating prior knowledge into a useful form for the inversion. Chapter 6. Summary and Future Work 93 Chapter 6 S u m m a r y a n d F u t u r e W o r k The goal of this thesis was to develop learned regularization functionals for geophysical inversions. As a result, three methods have been developed that translate prior information (in the form of training models) into informative regularization. First, in Chapter 2, the regularization was learned in a probabilistic framework. This was accomplished by specifying the regularization as a Gibbs distribution and utilizing in-teractions of model cells in a local neighbourhood. The Gibbs distribution was expanded as a multivariate Gaussian and regularization was parameterized by a set of coefficients, 6, in an inverse covariance matrix, C~ l{6). Given training information in My, a sim-ple optimization routine was used for the recovery of the coefficients and synthetic results demonstrated the accuracy of the recovered values. In order to compare different param-eterizations of C _ 1, MCMC theory was employed. When the governing distribution is non-Gaussian, MCMC theory was developed to recover corresponding parameter values. The result of Chapter 2 was that methods to create, evaluate and compare learned proba-bilistic regularization functions now exist. Chapter 3 was a model based deterministic approach to learned regularization based around i.i.d assumptions on a filter quantity, z , of the model, m . Two methods - the generic R{z) approach and parameterized p-norms - were proposed as possible forms of learned regularization. It was shown that for one filter quantity it is possible to recover the true regularization. However, when several filter quantities exist, there is difficulty in recovering all of the true regularization functions exactly. Rather, the width (or spread) of the functionals can be incorrect but the shape seems to be preserved. Several examples of both methods were explored and a comparison between the two methods proved useful. Specifically, the result of the generic R(z) compared to the p-norm can show if a standard type of regularization (such as multivariate Gaussian) would be appropriate in the proba-bilistic methodology. If this is not the case, then specification of the type of norm would be inappropriate and, thus, the generic R(z) method was shown to transfer the most prior information into the final recovered model. Chapter 4 introduced the HT algorithm which is a method of including information in the forward problem in the regularization. This method was shown to be valuable when the geophysical forward operator, G, has deficiencies. Theory and a simple algorithm were outlined such that the learned regularization was parameterized as in the probabilistic case. Two examples were presented. One example used the seismic tomography problem and it was shown that the HT algorithm, in the absence of any significant deficiencies in G, produces similar results to a learned probabilistic result. The second example was a 2 D gravity problem and it was shown that with an informative training set, the HT algorithm can be used to learn depth weighting. Thus, the HT algorithm is a superior approach when the geophysical problem has deficiencies that cause recovered models to be biased. Chapter 5 introduced the theory required to invert geophysical data with a generic learned regularization. Discussion of optimization problems was presented. A method to find the optimal recovered model was shown and results were presented for a simple box Chapter 6. Summary and Future Work 94 model. Despite the assumptions of the model based deterministic case, the generic R(z) regularization produced the superior recovered model. Through the work presented in this thesis, groundwork has been established for de-veloping learned regularization functionals. As a result, there is substantial future work that could be carried out. In Chapter 2, numerical issues must be addressed. Specifically, when performing optimizations or sampling Markov chains, there is the need to calculate det(C _ 1 ) . When the dimension of the problem increases, this is an expensive step and an efficient approximation would be useful. Furthermore, currently the Hessian matrix from Section 2.4.1 is calculated by finite difference which is also an expensive process. The work on covariance model selection will be useful in the future if more regularization filters be-sides Wx, Wz and Ws are developed. If the neighbourhood of a cell is increased to second order, operators that represent the interactions of diagonally bordering cells could be de-veloped. Eventually, this could follow the ideas put forth in [29] where a bank of filters could be established and M C M C theory could be used to select the optimal combination given the prior information. The last item for future work in Chapter 2 concerns Section 2.9.4. Comparison of two different p-distributions would require an efficient normaliza-tion approximation for non-Gaussian distributions and would result in an optimum type of distribution being chosen. Other future work concerning learned regularization is as follows. From Chapter 3, further investigation of i.i.d assumptions is required. Specifically, it would be useful to establish conditions where these assumptions are reasonable as well as conditions where they are not. When models are more complicated, i.i.d assumptions play a larger role as in Figure 3.8. In Chapter 4, an algorithm for more complex types of regularization should be devel-oped. This would help emphasize the flexibility and usefulness of the H T algorithm. Lastly, referring to Chapter 5, a more intensive investigation of learned regularization functions in the objective function (cf>) is required. This would involve checking non-linearity of <f> as a function of regularization method. Depending on the results, it may be possible to develop search algorithms that are more appropriate. It is reasonable that this algorithm would have a global flavour but it must be efficient for large number of parameters. Fur-thermore, searching the objective function can potentially be guided by information in the training set. This could be manifested by using the training set (and data) to develop an informative starting model or to target a particular value of the model norm, R(m). Bibliography 95 Bibliography [1] E.J. Bedrick, R. Christensen, and W. Johnson. Bayesian binomial regression: Predict-ing survivial at a trauma center. The American Statistician, 51:211-218, 1997. [2] J. Besag. Spatial interation and the statistical analysis of lattice systems (with discus-sion). Journal of the Royal Statistical Society. Series R. Methodological, 36:192-236, 1974. [3] S.D. Billings, R.K. Beatson, and G.N. Newsam. Interpolation of geophysical data by continuous global surfaces. Geophysics, 67:1810-1822, 2002. [4] S.D. Billings, G.N. Newsam, and R.K. Beatson. Smooth fitting of geophysical data using continuous global surfaces. Geophysics, 67:1823-1834, 2002. [5] R. Blakely. Potential Theory in Gravity & Magnetic Applications. Cambridge Univer-sity Press, 1996. [6] G.L Bretthorst. An Introduction to Model Selection Using Probability Theory as Logic (in Maximum Entropy and Bayesian Methods). 1996. [7] B. Carlin and S. Chib. Bayesian model choice via markov chain monte carlo methods. Journal of the Royal Statistical Society. Series B. Methodological, 57:473-484, 1995. [8] G. Casella and E. George. Explaining the gibbs sampler. The American Statistician, 46:167-174, 1998. [9] S. Chib and E. Greenberg. Understanding the metropolis-hastings algorithm. The American Statistician, 49:327-335, 1995. [10] E.Haber. Numerical Strategies for the Solution of Inverse Problems. PhD thesis, The University of British Columbia, 1997. [11] C.G. Farquharson and D.W. Oldenburg. Nonlinear inversion using general measures of data misfit and model structure. Geophysical Journal International, 134:213-227, 1998. [12] C.G. Farquharson and D.W. Oldenburg. A comparison of automatic techniques for estimating the regularization parameter in non-linear inverse problems. Geophysical Journal International, 156:411, 2002. [13] P. Gill , W. Murray, and M . H . Wright. Practical Optimization. Academic Press, 1981. [14] E. Haber, U . Ascher, and D.W. Oldenburg. On optimization techniques for solving nonlinear inverse problems. Inverse Problems, 16:1263-1280, 2000. Bibliography 96 [15] E. Haber and D.W. Oldenburg. A gcv based method for nonlinear ill-posed problems. Computational Geosciences, 4:41-63, 2000. [16] E. Haber and L. Tenorio. Learning regularization functionals. Inverse Problems, 19:611-626, 2003. [17] H. Hanche-Olsen. The derivative of a determinant. Technical report, Dept. of Mathe-matical Sciences, Norwegian University of Science and Technology, 1997. [18] J.A. Hoeting, D. Madigan, A. Raftery, and C T . Volinsky. Bayesian model averaging. Statistical Science, 14:382-417, 1995. [19] R.E. Kass and A . E . Raftery. Bayes factors. Journal of the American Statistical Asso-ciation, 90:773-795, 1995. [20] C T . Kelley. Iterative Methods for Optimization. Society for Industrial and Applied Mathematics, 1999. [21] S.Z. Li. Markov Random Field Modeling in Computer Vision. Springer-Verlag, 1995. [22] Y. Li and D.W. Oldenburg. 3d inversion of magnetic data. Geophysics, 61:394-408, 1996. [23] Y. Li and D.W. Oldenburg. 3d inversion of gravity data. Geophysics, 63:109-119, 1998. [24] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E . Teller. Equa-tion of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087-1092, 1953. [25] J. Nocedal and S. Wright. Numerical Optimization. Springer, 1999. [26] M . Sambridge. Exploring multidimensional landscapes without a map. Inverse Prob-lems, 14:427-440, 1998. [27] A. Tarantola. Inverse Problem Theory. Elsevier, 1987. [28] S .C Zhu and D. Mumford. Prior learning and gibbs reaction-diffusion. IEEE Trans-actions on Pattern Analysis and MAchine Intelligence, 19:1236-1250, 1997. [29] S.C. Zhu, Y. Wu, and D. Mumford. Filters, random fields and maximum entropy (frame): Towards a unified theory for texture modeling. International Journal of Computer Vision, 27:107-126, 1998.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Developing learned regularization for geophysical inversions
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Developing learned regularization for geophysical inversions Hewson, Chad 2004
pdf
Page Metadata
Item Metadata
Title | Developing learned regularization for geophysical inversions |
Creator |
Hewson, Chad |
Date Issued | 2004 |
Description | Geophysical inverse problems can be posed as the minimization of an objective function where one term (ϕ[sub d]) is a data misfit function and another (ϕ[sub m]) is a model regularization. In current practice, ϕ[sub m] is posed as a mathematical operator that potentially includes limited prior information on the model, m. This research focusses on the specification of learned forms of |
Extent | 13462533 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-11-18 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
IsShownAt | 10.14288/1.0103851 |
URI | http://hdl.handle.net/2429/15256 |
Degree |
Master of Science - MSc |
Program |
Geophysics |
Affiliation |
Science, Faculty of Earth, Ocean and Atmospheric Sciences, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2004-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2004-0230.pdf [ 12.84MB ]
- Metadata
- JSON: 831-1.0103851.json
- JSON-LD: 831-1.0103851-ld.json
- RDF/XML (Pretty): 831-1.0103851-rdf.xml
- RDF/JSON: 831-1.0103851-rdf.json
- Turtle: 831-1.0103851-turtle.txt
- N-Triples: 831-1.0103851-rdf-ntriples.txt
- Original Record: 831-1.0103851-source.json
- Full Text
- 831-1.0103851-fulltext.txt
- Citation
- 831-1.0103851.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0103851/manifest