BAYESIAN MULTIVARIATE INTERPOLATION WITH MISSING DATA AND ITS APPLICATIONS by WEIMIN SUN B.Sc., University of Electric Science and Technology of China, 1983 M.Sc., University of Georgia, 1990 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES Department of Statistics We accept this thesis as conforming to the equired tandard THE UNIVERSITY OF BRITISH COLUMBIA December 1994 ©Weimin Sun, 1994 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. (Signature) Department of The University of British Columbia Vancouver, Canada Date DE-6 (2188) N. I Abstract This thesis develops Bayesian multivariate interpolation theories when there are: (i) data missing-by-design; (ii) randomly missing data; (iii) monotone missing data patterns. Case (i) is fully discussed both theoretically and empirically. A predictive distribu tion yields a Bayesian interpolator with associated standard deviation, a simultaneous interpolation region, and a hyperparameter estimation algorithm. These results are de scribed in detail. The method is applied to interpolating data from Southern Ontario Pollution. An optimal redesign of a current network is proposed. A cross-validation study is conducted to judge the performance of our method. The method is compared with a Co-kriging approach to which the method is meant to be an alternate. Case (ii) is briefly discussed. An approximation of a matrix T-distribution by a normal distribution is explored for obtaining an approximate predictive distribution. Based on the approximate distribution, an approximate Bayesian interpolator and an approach for estimating hyperparameters by the EM algorithm are described. Case (iii) is only touched on. Only an iterative predictive distribution is derived. Further study is needed for finding ways of estimating hyperparameters. 11 Contents Abstract ii Table of Contents iii List of Tables v List of Figures vi Acknowledgements ix 1 Introduction 1 2 Bayesian Multivariate Interpolation with Missing Data 13 2.1 Introduction 13 2.2 Bayesian Interpolation 14 2.3 Matrix T-Distribution 18 2.4 Interpolation with Data Missing-by-design 25 2.4.1 Predictive Distributions and Interpolation 26 2.4.2 Estimation of Hyperparameters 31 2.4.3 EM Algorithm 38 2.5 Interpolation with Randomly Missing Data 40 2.6 Interpolation with Monotone Missing Data Patterns 46 111 2.7 3 4 5 6 Proofs of Theorems, Corollaries and Lemmas Application to Air Pollution Data 49 66 3.1 Application to Monthly Pollution Data 68 3.2 Application to Daily Pollution Data 76 Application to Environmental Monitoring Network Redesign 84 4.1 Theory of Network Redesign with Entropy 85 4.2 Network Redesign with Data Missing-by-design 88 Cross-validation Study 90 5.1 Simultaneous Interpolation Versus Univariate Interpolation 90 5.2 Trends in Adjusted Mean Squared Prediction Error (AMSPE) 93 5.3 Comparison with CoKriging 95 Concluding Remarks and Future Studies 104 Bibliography 108 Appendix A 116 Appendix B 121 Appendix C 153 iv List of Tables 3.1 Pollutants Measured at Each Gauged Site, Where a, b, d And e Represent , SO 2 NO , 03 And SO 4 2 Respectively 3.2 The Estimated Between-pollutants-hypercovariance Matrix of the Logtransformed, Summer Monthly Data 3.3 75 The Estimated Between-pollutants-hypercovariance Matrix of the Log transformed Daily Summer Pollution Levels in Southern Ontario 3.5 72 Correlations Between Residuals of Log-Transformed, Observed and Esti mated Pollution Levels at Gauged Sites 3.4 69 79 Correlations Between the Residual of Log-transformed, Summer Daily Observed and Estimated pollutants at Gauged Sites 81 5.1 Latitudes and Longitudes of the Gauged Sites 101 5.2 MSPEs by Both Methods 102 V List of Figures 3.1 Locations of gauged sites in Southern Ontario plotted with Census Subdi vision boundaries, where monthly pollution levels are observed and Sites 3, 29 (outliers) are not plotted 3.2 153 Locations of selected sites in Southern Ontario plotted with Census Subdi vision boundaries, where monthly interpolated pollution levels are needed. 154 3.3 Plots for monthly observed and fitted, log-transformed levels of 03 in ppb, 2 and 504 in g/m , NO 2 SO , at Gauged Site 5 3 3.4 155 Normal quantile-quantile plots for original and log-transformed monthly levels of SO 3 at Gauged Site 4 4 in pg/m 3.5 156 Plots for autocorrelation and partial autocorrelation of monthly, logtransformed levels of SO 3 at Gauged Site 4 4 in ug/m 157 3.6 A rough checkerboard obtained in the SG step 158 3.7 A smoother checkerboard obtained in the SG step 158 3.8 Scatter plot of observed covariances vs predicted covariances obtained by the GS approach 3.9 159 Means of monthly levels of 03 in ppb, in summers of 1983 -‘ 1988 at gauged sites in Southern Ontario plotted with CSD boundaries 3.10 Means of monthly levels of 03 in ppb, in summers of 1983 selected sites in Southern Ontario plotted with CSD boundaries vi 160 1988 at 160 3.11 Scatter plots for residuals of monthly observed pollutant levels against residuals of interpolated levels at the log-scale in winter and summer respectively, where levels of 03 are in ppb, SO 2 and SO 4 in ,ug/m . 161 3 , NO 2 3.12 Pollutant-wise scatter plots for residuals of monthly observed pollutant levels against residuals of interpolated levels at the log-scale in winter and summer respectively, where levels of 03 are in ppb; SO 2 and SO , NO 2 4 in g/m 3 162 3.13 Boxplots for predicted, observed and residual levels of log-transformed, monthly concentrations of 03 in ppb, SO 2 and 304 in /Lg/in , 3 , NO 2 respectively. 163 3.14 Locations of gauged sites in Southern Ontario plotted with Census Sub division boundaries, where daily pollution levels are observed and Sites 3, 26 (outliers) are not plotted 164 3.15 Normal quantile-quantile plots for original and log-transformed daily lev els of SO 3 at Gauged Site 1 4 in tg/m 165 3.16 Plots for autocorrelation and partial autocorrelation of daily, log-transformed levels of 03 in ppb at Gauged Site 6, before an AR(1) transformation is taken 166 3.17 Plots for autocorrelation and partial autocorrelation of daily, log-transformed levels of 03 in ppb at Gauged Site 6, after an AR(1) transformation is taken 166 5.1 Plot for trends in AMSPE 167 5.2 MSPEs obtained by Hass’s interpolator and the Bayesian interpolator with original acid rain data 5.3 168 MSPEs obtained by Hass’s interpolator and the Bayesian interpolator with log-transformed acid rain data vii 168 5.4 Boxplots of observed nitrate levels with/out log-transformation at 35 sites in US 169 vu’ Acknowledgements I would like to thank Prof. James Zidek for his excellent guidance, for his fundamen tal influence on my “statistical thinking” and for providing me with a opportunity to work on a practical problem. I would like to thank Dr. Nhu Le for the many fruitful discussiolls I had with him. I am indebted to Dr. Rick Burnett for his comments and for providing me with the data used in my investigation. I thank Dr. Tim Hass; only with his generous help was the comparison of the method in this thesis with his own method feasible. I thank Mr. Hongbin Zhang, our systems analyst, for his support. I thank Mr. Rick White for providing part of the C codes I needed for my analysis. I thank the members of my Supervisory Committee for their help. Very special thanks go to Ms. Chun Zhang, the McConnells and the Veecks for their substantial help. Finally, I would like to thank the University and its Department of Statistics along with Health Canada for their financial support during my graduate studies. ix Chapter 1 Introduction A spatial data set consists of a collection of measurements or observations on one or more attributes taken at specified locations. At a fixed time, those data are observed over a restricted geographical or other region. Many models and theories have been established for spatial data analyses. The following several paragraphs give a brief overview of this newly established field. Cressie (1991a) defined a general spatial model. Let s e R° be a generic sampling site in d-dimensional Euclidean space and X(s) the response vector at s. As s varies over the index set D C Rd, {X(s) : s D} represents a multivariate random field (or alternatively, process). According to the form of D, spatial data analyses can be classified into four types. When D is a fixed, non-degenerate convex subset of Rd and X(s) is a random vector at site s D, such analysis is called geostatistical data analysis. When D is a fixed collection of countably many points of Rd and X(s) is a random vector at site S E D, it is called lattice data analysis. When D is a random point process and X(s) is a random vector at site s D, it is called point pattern data analysis. When D is a random point process and X(s) is a random set, it is called object data analysis. Cressie (1991a) gives a comprehensive survey of spatial data analyses. 1 An earlier form of spatial data occurred as a data map centuries ago (Halley 1686). Recently spatial data have been seen in many applications. For example, the grade of ore deposit is measured at various sites spread over a constrained geographical area. In terms of the general spatial model, D is the constrained geographical area, d is two, s is a two dimensional vector of the longitude and latitude of a site and X(s) is the measured ore deposit grade at location s. Another example is the remotely sensed yield of wheat on earth at a fixed time. In the ore grade example, the ore grade at a site is likely similar to the grade at a nearby site. However, it will be less similar to that at a faraway site, provided these two sites are not on the same ore deposition ridge. It means an underlying relationship among the ore deposit grades does exist. This underlying relationship is called “spatial correlation”. Spatial correlation plays an important role in spatial inference. While the ore deposition grade does not change over time, observations like humidity, temperature and wind within a region are different from time to time. So they have to be measured not only across sites , but also over time. Such data are called spatial-temporal data. Another example of spatial-temporal data comes from air pollution monitoring. There, air pollution levels, e.g. sulphate or nitrate levels, are observed at different sites and regularly, say hourly, for a segment of time duration. Among the four types of spatial analyses, geostatistics is most relevant to the topic of this dissertation. Geostatistics was established in the early 1980s as a hybrid discipline of mining engineering, geology, mathematics and statistics. Its study began with Math eron’s early 1960’s papers on Krigiug, a name given by Matheron after D. 0. Krige, a South African mining engineer who developed empirical methods to determine true ore grade distribution based on sampled ore grades. In Kriging, the model of a ran dom process X(s) generally consists of two terms: a trend and an error. The trend 2 term catches large-scale variation of X(s) and is deterministic. The error term reflects small-scale variation of X(s) and is a random process. The Kriging approach gives a “best linear unbiased estimator” (BLUE) of the unknown ore-grade at sites in the pre diction region using ore-grade samples from neighboring sites by exploring correlations among ore-grades at different sites. Later, in other applications, many different forms of Kriging are developed. Examples are ordinary Kriging; simple Kriging; universal Kriging; Bayesian Kriging; robust Kriging and Co-Kriging. These Kriging methods also give best linear predictors. Thus Kriging has become synonymous with optimal spatial linear prediction. Before exploring the family of Kriging methods further, we define some basic concepts. In Kriging, the spatial correlation is expressed in terms of the Variogram, defined as Var(X(si) = — Var(X(si) X(s ) 2 ), sl,s2 E D. When E(X(si) — — X(s ) 2 ) = 0, E(X(si) — )) 2 X(s X(s ) 2 ). The equation says that if the mean function of a random field is a constant, the variogram and expected mean squared difference are the same. Sometimes, the variogram has other names. For example, it is called Dispersion in Sampson and Guttorp (1992) (SG hereafter). Another important concept in Kriging is intrinsic stationarity. An intrinsically statiollary process, X(s), has: (i)a constant mean D; (ii) Var(X(si) for all s — X(s ) 2 ) = 1 2r(s — s ) 2 , where r(.) is a real, non-negative function in Rd. In Cressie (1991a) 2r(.) is called variogram and r(.) semi-variogram. The concept of variogram implicitly implies intrinsic stationarity. If (ii) is replaced by Cov(X(si), X(s )) 2 = C(si—s2), X(s) is a second-order stationary process. The function C(.) is called a covariogram. If further, 2r(si C Si — s2 — ) 2 s = 2r(H Si — s2 II) (C(si — s2) = 2r(.) (C(.)) is isotropic. The cross-variogram and cross-covariogram between the random fields X(s) and X ( 3 s) have similar definitions. Intrinsic stationarity is a strictly weaker assumption than second-order stationarity. 3 From 2r(h) = 2(0(0) intrinsic stationarity. — 0(h)), we can prove that second-order stationarity implies However, the opposite is not true. For example, a Brownian motion is an intrinsically stationary process but not a second-order stationary process. In many applications, the variogram (covariogram) is unknown. One has to estimate it. There are both parametric and nonparametric approaches for estimating variogram within a constrained region. Usually a parametric approach involves two steps. First, lags, h , 1 ..., h, are chosen and sample variograms (covariograms) at these lags are estimated using observed data. Second, a proper, often an isotropic, variogram (co variogram) model is chosen and the model parameters are determined by fitting it to sample variograms (covariograms). Matheron (1962) proposes a natural way for the computation of a sample variogram (covariogram) by the method of moments. His es timator is unbiased but not robust. Cressie and Hawkins (1980) propose two robust sample variogram estimators. For choosing a proper variogram (covariogram) model, precautions are needed. For example, a variogram model should satisfy the condition ally negative-definiteness condition. That is, vector a E Rk Si, . . . , sk E 3 2y(s aa , s) < 0 for any real D and any integer k. Similarly, a covariogram model must satisfy the positive-definiteness condition. That is, any real vector a E R’ — Si,. . . , sk E 3 C(s aa — s) 0 for D and any integer k. There are other considera tions too. For example, a variogram model should be able to reflect the sill, observed data may present a sill being defined as a variogram’s non-zero limit at lag zero. Many isotropic variogram models have been proposed. Examples of such models are the ra tional quadratic model: 7 (h) (the Gaussian model) 7 (h) = = 2 a {1 2 a h — 2 /(1+ exp(— h h 2)}, 2) ,h E R’ (Schoenberg 1938); h E Rd; the linear model; the exponential model and the spherical model (Journel and Huijbregts 1978). A covari ogram model is chosen similarly. A natural model fitting criterion is “least squares”. Other possible criteria include maximum likelihood, restricted maximum likelihood and 4 minimum norm quadratic. When observed data do not confirm a stationary or isotropic condition, measures are taken to make data stationary or isotropic. For example, Hass (1990a, 1990b, 1992, 1993) adopts a moving window approach so that all the observa tions inside a circular window are approximately isotropic. SG describe a nonparametric approach to estimating a spatial dispersion matrix when the observed data are not stationary. Here a dispersion matrix has the same meaning as a variogram matrix except that no stationarity assumption is implicitly implied. Their method takes two steps. First, with the “nonmetric multidimensional scaling” (MDS) algorithm (see Mardia, Kent and Bibby 1979), a two-dimensional representation of sampling sites is sought. In this two dimensional Euclidean space, called the D—plane, a monotonic function, g, of the distance between any two points approximates the spatial dispersion between the two points. As a counterpart of the D-plane, the geographical coordinates plane of the sampling sites is called the G splines, f, — plane. Second, thinplate are found to provide smooth mappings of the G-plane representation into the D-plane representation. Then, the composition of estimator of var(Z(xa) — f and g yields a nonparametric Z(x ) 6 ) at any two geographic locations Xa and xb. Other nonparametric estimation methods for a spatial covariance matrix are discussed , for example, in Loader and Switzer (1991), Leonard and Hsu (1992), Le and Zidek (1992) and Pilz (1991). While both adopting Bayesian approaches, Le and Zidek (1992) take conjugate priors, inverted Wishart, on the spatial covariance matrix while Leonard and Hsu (1992) propose a class of prior distributions, other than the inverted Wishart. With the above preliminaries, we can now summarize ordinary Kriging as follows. As sume a model, X(s) = u + 6(s), s E D, t E R, where D is a convex subset in Rd, 6(.) is a stochastic process and u is an unknown, constant scalar. Kriging searches for an optimal linear predictor of any unobserved value X(so) within a family of linear func 5 tions of X(s ), i 1 1,. = . . n. Therefore the predictor takes the form p(X; , X(s) under the restriction, E(p(X; so)) A = 1. The restriction makes p(X; = = S) SO) = Z unbiased, since E(X(so)). = Suppose X(s) is an intrinsically stationary process. By minimizing 2 (X(so) optimal choices of = (Xi,. — . . X(s)) + 2m(1 , — A) = 0, (1.1) )) and m have the forms, )_i 7 r t (7+11 _1 = and m where y (j,j)th = (‘y(so = — — Si),.. element is y(sj — , . 7(So —(1 — ))t, 5 )/(ltp_1 7 ltp_1 ) , an n x 1 matrix and 1’ is an n x n matrix whose sd). When X(s) is second-order stationary, the above optimal solution can be expressed in terms of covariograms. The solutions for ) and rn are obtained by replacing ) with c, where c = (C(so si),. — . . , G(SO — sn)), F with = (G(s — si)) and m with -m. If the variogram or covariogram fullction is known, ordinary Kriging stops here. If unknown, it is estimated with either a parametric or nonparametric method. When t has a more complicated form, other Kriging approaches are developed. In simple Kriging (Matheron 1971), the model is X(s) = it(s) + 5s), where u(.) is known. The best linear predictor is sought within the family p(X; so) to an uribiasedness restriction. By minimizing E(X(so) i} and k, the optimal solution is, k (7 = (C(so, s ),. 1 . . , C(so, s))t and = = JL(so) — (C(s, s )). 3 6 — = lX(s) + k subject p(X; so)) 2 over i ijt(s) and i = = {i, Cti where Universal Kriging is introduced when a more general form of form is X(s) (s)i3_ + 6(s), where 3 f_ 1 = vector of parameters, —(•), j 3 f 1,. = . is assumed. That general , = /3)t R’ is an unknown ,p+l are known functions and 6(.) is a zero mean . intrinsically stationary process with variogram 27(.). The form of a best linear unbiased predictor is p(X; so) )X(s), subject to, )X is an n x (p + 1) matrix, its (,j)t1 x, ) = = \1,. element being 1 f_ ( s), and x If the variogram is known, solutions i = 1,. . . = . . , .A,,), where X (fo(so), , . . . f(so)). , n are easily derived. If unknown, it needs to be estimated. One problem occurs when one estimates the variogram. Since in universal Kriging X(s) is not intrinsically stationary, the sample variograms computed with observed data, X(s ), i 1 1 = . . . , n by the formulas of Matheron (1962), and Gressi and Hawkins (1980) are biased. To obtain unbiased variogram estimators, the residuals of X(s) must be known; but they are unknown, since /3 is generally unknown. To bypass this dilemma, Neuman and Jacobson (1984; see also Haas 1993) propose an iterative method starting with ordinary least square (o.1.s.). Note that the model for the vector Xt = (X(si), ..., X(sj) can be rewritten in a matrix form, X n x (p + 1) matrix with its (,j)t1 element, i = 1,... ,ri, j = Z/3 + 6, where Z is an 0,1,... ,p being f(s); = /3 is regression coeficients and S is a stochatic process vector. The iterative procedure is as follows. First, estimate X — /3 by /3i = Z/3 i 0 . Second, update /3 by residuals, X — gis 3 / X and obtain Z 1 (ZtZ) = (Zt_1Z)_1Z_1_1X based on residuals, and E with the updated gis. Repeat the above procedure until it converges. 3 Z/ In previous Kriging approaches, an estimate of X(so) is computed using information on the same process X(s ) i 1 = 1,. . . , n. In some applications, additional information is available. In these cases, realizations of other correlated random processes are observed. CoKriging was developed to bring the additional information into the BLUE predictor. More specifically, if the observed data set is X with (j,j)th element being 3 X ( s), i = = 1,... ,n, 7 (X(s ) 1 ,. j = . . , X(s))t, an n x k matrix 1,..., k, one needs to predict (s)) 3 Xi(so) using X. Suppose E(X C(s,u), s,u E D, where = ,,j ([Li,. ..,[jk)t /1 = 1,. . . ,k, s E D and cov(X(s), X(u)) and C(s,u) is a k x k matrix. The best (so) takes the form p 1 linear CoKriging predictor of X (Z; so) 1 with 11 Z- A 1 = 1, _i Ai = 0, j = 2,.. . , = = k. The remaining steps are the same as those for other Kriging methods. Applications of the CoKriging method in environmetrics, soil science, and hydrology can be found in Haas (1993) , Yate and Warrick (1987), as well as Ahmed and de Marsily (1987). Haas (1993) proposes a Moving Window Regression Residual CoKriging (MWRRCK) method for predicting pollution carried in rainfall. There, the observed information includes: (i) wet deposition of pollutants monitored at 200 sampling sites in US; (ii) observations of precipitation at over 6000 sites of National Weather Service (NWS) network. We outline MWRRCK method below. First, a moving circular window centered at a prediction site is selected to achieve local isotropy. A radius of the circular window is chosen so that the total number of monitoring sites inside the window will not be less than a predetermined value. This number is set to make sample variograms reasonably accurate. Second, the spatial trend surface (respectively spatial covariance matrix) in the window is removed (respectively estimated) through an iterative o.l.s-g.l.s-procedure. Third, a regular CoKriging method is applied to the residual process for estimating the residual at the prediction site. The final prediction is a sum of the predicted trend and the predicted residual. In Hass’s theory, care has been taken to make the covariance matrix within a window positive definite. However, the covariance matrix between windows need not be positive definite. The problem arises when two prediction sites are geographically close and one of the two sites falls inside the circular window of the other. Because different fitted 8 semivariogram and cross-semivariogram models are obtained for each window, it can happen that the covariance matrix between these two sites is negative. Although Kriging is an appealingly simple method, it tends to understate uncertainties of predicted values, since model uncertainty is not included. One simple example given in Handcock and Stein (1993) shows how much effect the model uncertainty has on the confidence interval (CI) of a predicted value. There, the above authors showed that if the uncertainty of the unknown spatial covariance is not included, the 95% CI of a predicted elevation based on measured elevations is (699, 707). When it is included by using a Bayesian approach, the 95% CI becomes (694, 713). This example clearly supports the use of Bayesian approach to Kriging. Such Kriging are described as Bayesian. Bayesian Kriging is relatively new. Not much work has been done in geostatistics. For a general model X(s) = n(s) + 6(s), s E D, a Bayesian approach can be adopted by assuming it(s) to be a random process that is independent of 6(.). More specially, p(s) f /9_ ( 1 s), _ s e D ,where j = 0,... ,p are random variables (see Nather 1985, Kitanidis 1986, and, Omre and Halvorsen 1989). For example, Omre and Halvarsen (1989) describe a version of Bayesian Kriging that puts prior on the mean function only. Their method appears to be a direct extension of traditional Bayesian linear model to case of spatial data. An empirical Bayesian predictor is obtained if the parameters of the prior are estimated from the data and substituted into the Bayesian predictor of X(so). Similarly, one can assume the covariogram varies in the space of all positive-definite functions PD = {C(s, ) : s, 1 tt € D} and put a (prior) distribution on PD. Or one can assume a structural model on C(.), C(s, i; 0) and put a prior on 0. For the latter case, the predictive density is f(X(so) I X) =10 f(X(so) I X, 0)f(0 I X)d0. One example of such an approach is proposed by Handcock and Stein (1993). In Hand 9 cock and Wallis (1994), the method is applied to model meteorological fields in an attempt to assess the global warming problem from a statistical point of view. One competitor to Kriging is the smoothing spline (Wahba 1990a, 1990b). The spline method can be briefly summarized as follows. For observed data, X, i assumed model is X surements, f() = f(s) + e, where s, i is a smooth function and e, i = 1,.. = 1,. . . , . , = 1,. . . n, the n are the locations of mea n, are independently, identically distributed (i.i.d.) errors. For each value of y, the smoothing function, f(s), is estimated by minimizing - over all f with continuous first 2+ f(s)} f[f”(x)]2dx and squared-integerable second derivatives. The smooth parameter ,\ is chosen by “generalized cross-validation” (Wahba 1990b). The inter 0 is taken as f(so). Oehlert (1993) shows one example polated value at any location s which combines a smoothing spline technique with a Bayesian approach. There, Oehlert proposes a multiple time series model for data that have both temporal and spatial correlations. The smoothing spline is used when the mean and trend are extended from the rectangles where there are the monitoring sites to rectangles with no moni tors. Many empirical comparisons have suggested that the interpolation performances of spline methods and Kriging methods are similar (see Laslett (1994) for references to these comparisons). Laslett (1994) demonstrates that in certain cases, Kriging surpasses splines by a big margin. When the observed data have a spatial-temporal form, Kriging faces another rival, a “hierarchal Bayesian time series approach” developed by Le and Zidek (1992) (LZ hereafter). LZ assume independent responses over time, a common unknown spatial covariance structure at each fixed time point and conjugate priors on both trend param eters and the spatial covariance. As an alternative to Kriging, LZ’s method has many 10 advantages. For example, it incorporates the model uncertainty into its interpolator. It uses incoming information to dynamically update estimation and gives a predictive distribution that enables construction of a simultaneous band for interpolated values. In its original form, the LZ method does not allow such data. In this thesis, we extend the LZ method to include missing data. We develop and explain the extension in the context of interpolating air pollution. Let’s first describe this context. Assume there are s sites scattering over a constrained region. At each site, k response values are measured. Examples of such measurements are air pollution levels, like sulphate or nitrate levels. Among the s sites, s,, sites are ungauged sites, where there are no observations but air pollution levels are needed. The other g 8 sites are gauged; there are observations. In the case of missing data, some values are missing at the gauged sites. The missing data patterns discussed in the next chapter are called randomly missing, missing-by-design and monotone missing. Here, we use the term “randomly missing” to mean its probability of being missing is not related to its value (see Little and Rubin 1987). The term “monotone missing” is also from Little and Rubin (1987). The meanings of the three missing data patterns are explained in the next chapter. Let X denote the complete, random response vector for all sites (both gauged and ungauged) at time t, where the first k elements represent pollution levels for k pollutants at site one, the second k elements for site two and so on. Thus, X, is an sk-dimensional vector. The inferential objective treated in this thesis is to interpolate unobserved pollution levels at ungauged sites using incomplete observations at gauged sites. As in LZ, we assume a linear, Multivariate Gaussian model and conjugate priors on its parameters. Interpolation with missing data consists of two parts. First, by fixing hyperparameters, we find a predictive distribution and its posterior mean along with 11 a standard error. Second, we develop an estimation procedure for hyperparameters. Further, in two steps we estimate the hyper-covariance matrix of X,. In step one, we adopt an EM algorithm to estimate the hypercovariance matrix at gauged sites. In step two, we apply the OS approach to extend this hyper-covariance matrix to all sites. This thesis consists of six chapters. In Chapter 2, we describe three different interpo lation theories depending on the missing data patterns. The patterns are: (i) missingby-design; (ii) randomly missing; (iii) monotone. There, we fully develop the theory of interpolation with data missing-by-design; we briefly discuss the theory of interpolation with randomly missing data and only touch the theory of interpolation with monotone missing data. In Chapter 3, we apply the theory of interpolation with data missingby-design to Southern Ontario pollution data; we implement it with S and C programs and carry out residual analysis. In Chapter 4, by combining the theory developed in Chapter 3 and the theory developed in Caselton, Kan and Zidek (1992), we show how to apply our results to an optimal network redesign problem. In Chapter 5, we compare the theory of “interpolation with data missing-by-design” with the general theory of LZ and also with Hass’s CoKriging method. In Chapter 6, we drawn conclusions and list some future research topics. All figures referred in Chapter 3 and Chapter 5 are listed in Appendix C. We attach examples of S and C programs in Appendix A and B. 12 Chapter 2 Bayesian Multivariate Interpolation with Missing Data 2.1 Introduction The problem of interpolating a spatial field has received a lot of attention. Kriging offers a well-known solution, but it has deficiencies. In particular, it overstates the precision of its interpolator because it fails to reflect the model uncertainty (LZ, Brown, Le and Zidek 1994a, hereafter BLZ). To avoid these deficiencies, LZ propose a hierarchical Bayesian alternative to Kriging. The LZ method takes a Bayesian time series approach with a Gaussian linear model and conjugate priors for the model parameters. Such a Bayesian approach incorporates model uncertainty and yields a heavier-than-normal tailed posterior distribution, the multivariate t. LZ’s Bayesian alternative also has the advantage of dynamically updating the predictor as more observations come in. LZ developed their theory in a univariate case where at each of s sites, only one air pollutant is monitored. BLZ extend the above LZ Bayesian interpolation theory to the multivariate case. At each site k air pollutants are monitored. BLZ adapt the original LZ Bayesian theory by stacking rows of the s x lc response matrix into an sk xl vector, where each row represents 13 k measurements at a gauged site. To reduce the total number of hyperparameters in the prior distributions, BLZ assume a Kronecker product structure on the hypercovariance matrix. Then BLZ describe an algorithm for hyperparameter estimation. But the BLZ theory has an important restriction. It does not permit missing values at gauged sites. In this chapter, theories of Bayesian multivariate interpolation with different patterns of missing data are discussed. These patterns are: (i) missing-by-design ; (ii) randomly missing and (iii) monotone. In all three cases, we follow the hierarchical Bayesian ap proach of LZ. In the case of data missing-by-design, the proposed Bayesian interpolation method is an alternative to Co-Kriging. In a Bayesian analysis, a predictive distribu tion plays a key role and we need to derive that predictive distribution of the unknown pollution levels at ungauged sites given the observed at gauged sites. Section 2 gives a brief review of the LZ theory. Section 3 serves as a technical preparation for the following sections. There, we define a matrix T-distribution, which plays a pivotal role in our inference, and explore its normal approximation. Section 4 spells out the theory of Bayesian multivariate interpolation with data missing-by-design. Section 5 gives a brief discussion to interpolation theory with randomly missing data. There, we describe an approximate predictive distribution and estimation of hyperparameters. Section 6 is about interpolation theory with monotone missing data patterns. There, only a recursive predictive distribution is described. We put the proofs of most theorems, lemmas and corollaries appearing in this Chapter, in the last section. 2.2 Bayesian Interpolation In this section, the LZ Bayesian interpolation theory is briefly summarized. Let X be an s-dimensional random vector at time t, where the first s elements, denoted by X, correspond to the unobserved responses at s ungauged sites. The remaining 14 g 8 elements, denoted by X, correspond to the observed responses at s gauged sites. Assume: Z, where Zt indejndent B, (Bz, ) 8 N (2.1) is an h-dimensional vector of known covariates at time t and B is an s x h matrix of regression coefficients, B= The priors of the unknown parameters B, are taken as conjugates of the normal model, B B°, F , h(B° , 3 N ® F), _l(,5*). 3 W 6” (2.2) (2.3) Let At denote a matrix transpose of A. Since X is partitioned into X and X, E and B are partitioned correspondingly as (11 12 \21 22 E ) and i 1 (B B=i Define S = (X 2 B C A z)(X 2 E - = CA, = = Note D is the set of all observed data. 15 - With a straightforward calculation, the posterior distributions of B and Z are easily found. Lemma 2.1 gives their forms. Lemma 2.1 The posterior distributions of B and are: B D,B°,F 2 22 D, 22, 5* D, , 6* + n 22 W( 6* 1I2, l47(1I2, — 6), , 3 N 1 ( 1 39 12 2 ., where = }, 2 { 1 , 12 ,r 12 2 12 T = B* = — E 22 = ‘22 +S+ = ‘c’ ‘c’—i 22 ZJi2L v’ c’ v—iv 22 L21 L.Ii2ZI LJfl ( ) T12 B° + EØF = = 139 X Sg [(z) 2 (E — B)Et, ,Z Ø(EF), 2 r 22 (E )] F’(A + F’), 2 (E — B°) ( t A’ + Fj(B 2 712 — B°), ,+ 22 ‘‘i2 LZ show that the predictive distribution of a future unknown realization Xf consists of two multivariate t distributions. These distributions are given in Theorem 2.1. For com pleteness, the definition of a multivariate t-distribution is repeated here. A multivariate t-distribution, denoted as f(x) tr(jt, H H, 9), is defined to have a density function, I- [ + (x — ) H t (x 16 — )] +r) (2.4) Theorem 2.1 Let the posterior distribution of B and The predictive distribution of X hyperparameters B°, , ((x))t, (X)t), = be defined as in Lemma 2.1. given covariate vector Zf and the 6, F, consists of X D 9 (a(i) + b, t c — 22 d and X X = D t (ao + 12 (x — a(i)), c + (a(i) — ’(a(l) t x) — 12 x) q), where = * 1* + — — = q g 5 + 1, s + 1, — (ao’\ J=Bzf, a=I )J 1 \a( 2 (E b c = d B)Ezf, — 1 + 4F zf, 1 = zEF z 1 f. With the predictive distributions, the Bayesian predictor is simplely taken as the pos terior mean. Corollary 2.1 Given the predictive distributions of Theorem 2.1 and means of X) and X are = 112 E(X) = E(X) 17 1 + qi a b, 2 = 2 + b. a Zf, the predictive 2.3 Matrix T-Distribution A representation of normal distribution given as a lemma in Lindley (1972) is summa rized here. For any given constant matrices A, B, assume X ) 1 N(AY,E Y N(B, and 2). If two normal distributions are independent, Lindley’s result asserts X AB + N(O, ). t 2 +A The above fact will be repeatedly used in the sequel without explicit mention. Besides the normal distribution, the matrix T-distribution plays a pivotal role in the theories developed in this chapter. Its definition and some properties are discussed next. Definition 2.1 A random matrix T : p x q is said to follow the matrix T-distribution, if its density is expressible as where > 1 + tQ’t F t p(rn-qQp I = 0, Qqxq > 0, m > p + q — and F() = l)f()f( — ). . . = F( (2.5) 1, KPFq k[m,p,q] Im, ((m T’q + — fi — \ m) ). In the above theorem, the notation, “P > 0” means that P is positive-definite. An alternative form of f(t) is = Q (m-p) p q k[rn,p,q] 18 +t p 1 m (2.6) Using the notation of Dickey (1967), Press(1982) by T -‘ T(P, Q, 0, m) or more generally T + C “ , we express the matrix T-distribution T(F, Q, C, m), where C is a constant matrix. Lemma 2.2 When q = 1 and Q is a scalar, T(F, Q, 0, m) is equivalent to a multivariate t(0, c’,m_p). Proof: By Equation (2.6), f(t) (Q + tt[ph]1t)m [(m — ) + t (QP -1 ) 1 rn—p By (2.5) and (2.6), it is easy to see that t T r’. ,P’,O,m). 1 T(Q (2.7) In Dickey (1967), a representation of a matrix T is given. The result is copied here. Lemma 2.3 Suppose that 7 U , , F,m Q > 0 ,m > p + q and T = — W(P,m—q), Xpxq Q ‘‘ N(0,I®Q), P >0, 1 and that X, U are independent. Let T be a random p x q matrix (u_4)tx. Then T has the distribution given in (p.5). In the Lemma, 0 denotes the Kronecker product. A direct application of the above representation yields the mean of a matrix T. Corollary 2.2 If T ‘- T(P, Q, 0, m), the mean of T + C is C, where C is any constant matrix. Proof: By Lemma 2.3, E(T) = E([U1 ) t E(X) = 0.1 Below, some properties of a matrix T distribution given in Press (1982) are listed without proof. Partition T as 1 IT T=( 19 T: being a p x q matrix, i = 1, 2, and conformably, 21 \P 11 of dimensions for a matrix P P1 122 x p. Let P 211 = 22 P — 21 P 12 P’P 0 be positive > definite. Lemma 2.4 Suppose T T(P,Q,0,m). Then: 1. conditionally T 1 2. marginally 3. 0 7’2 2 T = 2 t ‘—‘ T(Pn, , Q, 0, m 211 T(F T(P,CQCi,0,m) where — , m); t 2 , —PPi t 2 Q + tPii p); Opxr = 1 and C TC 1 is any q x r matrix. Dawid (1981) replaces the notation T(P, Q, 0, m) with T(8, F, Q). His notation differs from that of Dickey in the choice of the “degrees of freedom” parameter, that is, rn—p = +q—i. 6 Dawid (1981) has another representation of a matrix T-distribution T that can be defined as the marginal distribution of Tpxq with T and W(I, m — p). In a general form, if Tpxq I E T(I, ‘q, 0, rn) ‘s-’ N(0, I, 0 E) N(0, P’ 0 E) and >Z Q W’(Q,m—p) then T where F, Q T(P, Q, 0, m), (2.8) are invertible symmetric square matrices. Proof: f(t P,Q) = ff(t I P,)f( f —(m++1) Q)dS e_4tT[_1 t Pt (t )] dZ pt IQ+t The last step is true by integrating with respect to a partial Wishart density function.I 20 Definition 2.2 Let be a matrix with column vectors, a , 1 ..., a,,, define: 1 a vec(A) = a Let tr(A) denote the trace of matrix A, which is the sum of diagonal elements of A. Here are some useful facts: Lemma 2.5 When the dimensions are proper: 1. vec(A + B) = vec(A) + vec(B); 2. (Bt 0 A) vec(Z) vec(AZB); 3. tr(AB) = tr(BA) ,4. tr(AB) = ) t (vec(A v ec(B); ) 5. (AOB)(COD)=(AC)Ø(BD); =AØB (AØB) ; 6. t 7. (A ® B)-’ = A’ ® B’; 8. if X is a random matrix, E(vec(X)) = vec(E(X)). The proof is straightforward and omitted. Definition 2.3 Assume Y = (Yj) is an m x q random matrix, elements of Y are independently, identically distributed as N(O, 1), M : n x p, A : p x q, B : n x m are constant matrices and . t X=M+BYA Then X follows a matrix normal distribution, i.e. X 21 NXP(M, (BBt) ® (AAt)). The density function of a matrix normal X is f(x) where W = = BBt >0, V = V W 4 (2r) e_tr[W_1_M)V_1(_M)t1, i’ (2.9) AAt >0. By Lemma 2.5, [vec(X = — [vec(X tr[(X = M)t]t(W 0 V)’vec(X — — vec[V’(X ] t M) (X 1 M)V tr[W’(X — M)tWh] W’] t M) M)V’(X — — M)t — — M)t]. Thus, the following fact is proved. Lemma 2.6 T’Vhen the dimensions are proper and W, V are invertible, symmetric ma trices, the following is true, [vec(X — M)t]t(W 0 V) vec(X 1 — t M) = tr[W’(X — M)V’(X — ]. t M) By a direct application of Lemma 2.6 to Equation (2.9), an equivalent relation between multivariate normality and matrix normality is established. Lemma 2.7 Assume X be an n x p random matrix, then X N(M, W 0 V) if and only if vec(X ) t ‘ ), W ® V). t N(vec(M As one can see, the covariance matrix of a matrix normal distribution is a Kronecker product of two matrices. Sometimes, for notational simplicity, a notation Xpxq V) +M due to Dawid (1981) replaces Xpxq ‘ N(Mpxq,Wpxq 0 Vqxq). ‘ .A1(W, With this new notation, some facts about the matrix normal distribution are given without proof: 22 Lemma 2.8 If Xpxq f(W,V)+M, 1. t V(V,W)+M X ; Jf(CWCt, DtVD), where C, D are two nonrandom matrices with 2. CXD proper dimensions. Next, a normal approximation of a matrix T is derived. Suppose T follows a matrix T distribution Tpxq ‘‘ Q, 0, 6). T(P, T* where Let = = 6 Define a scaled matrix T* u’ g 1 = —r (-_) = 6T. By Lemma 2.3, x, (2.10) means equal in distribution. T* is a matrix analogue of the univariate t . 8 — q. By the definition of a Wishart distribution, there are random vectors, Y 1 , p dimensional N(0, F), such that UtU.YtYt. (2.11) By the multivariate strong law of large numbers (SLLN), ) t vec(U vec(yiyjt) E(vec(YiY ) t 1 ) a.s. as where, a.s. represents convergence almost surely with respect to Note that 1 E(vec(Y ) t Y ) vec(E(Y ) t Y 1 ) = t U —* Since when p is fixed, tS’ P —f a.s. = (). 1 fy vec(P). Hence, as —* oc oc is equivalent to 6 —* or 6 —* oc. oc. Applying Slutsky’s theorem, we have t U T*=[-] —* P X = .A/(P , Q) in 23 distribution, as —* oc. Thus the following theorem is proved. Theorem 2.2 WIzen n, p are fixed, as T* —* J/(P’, Q) cc in distribution. Theorem 2.2 is an extension of a similar asymptotic result in the univariate case, that is t 5 —* N(O, 1) as S Lemma 2.9 Let U —* cc. The fact has been noted in Dickey (1967). W(P,’9). Then ( where in the lemma, __,d - p) (2P, F), ,as cc, denotes convergence in distribution. The above lemma is applied in the proof of the following theorem that gives an asymp totic distribution of a matrix T-distribution in higher order. Theorem 2.3 Let T* be defined by (2.10), and F > 0, - where U Jf(P,F ) 1 , V X)--U 4 P * V, as Corollary 2.3 When z9 is big, T*(1+=Y*)X* Q, > —* 0. Then cc, .A/(P’,Q) and given the hyperparameters F, independent of V. where given F, Q U, V are independent and U(P , 1 Q), 24 Q, U is Interpolation with Data Missing-by-design 2.4 In this section, the theory of Bayesian spatial interpolation with data missing-by-design is established for the multivariate case in which there are s,. ungauged sites, s gauged sites and k monitored pollutants at each site. Suppose, now, that observations at gauged sites are pooled from different air pollution monitoring network. Since by design each gauged site does not monitor the same subset of pollutants, some pollutants, concerned by us, may not be monitored at a site. Hence, these pollutants are missing because of design. Therefore, we call these “missing” val ues missing-by-design. More specifically, Let X = ((X)t, (x(l))t), where (X )kl 9 represents the response vector at ungauged sites at time t and (X )sgkx1 represents the 1 response vector at gauged sites at time t. After a proper rearrangement of its elements, is further partitioned into (X’) , and 11 that respectively correspond (X)(sgk_1)x1 to the vector of missing-by-design pollutants and to the vector of observed pollutants at gauged sites. Since the same 1 pollutants are missing during the whole monitoring pe riod and they comprise 1 columns in matrix X, they are sometimes referred as missing columns in the sequel. Like LZ, a normal model for the conditional sampling distribution is adopted. In terms of a matrix normal distribution, the model can be written as, X where X = Z, B, - kX(BZ, E ® Ia), 3 N ,. .,X) 1 (X kXfl is the response matrix; Z 3 . = , 1 (Z (2.12) ..., Z)h is the matrix of covariates; B= sk,1 9 / is the coefficient matrix; . sk,h 3 / skxh is the unknown spatial covariance matrix of X and I is an 25 n x n identity matrix. The conjugate priors of B I B°, E, F , B are, kh(B°, 3 N D ® F’) (2.13) and F,6* W(,6*). (2.14) The random field interpolation theory developed in this section has two steps. In the first step, while the hyperparameters B°, F, 1’ and 5’ are assumed to be fixed, predictive distributions derived. In the second step, estimation procedure of hyperparameters is discussed. 2.4.1 Predictive Distributions and Interpolation In this subsection, all hyperparameters B°, F, and 6 fixed and are suppressed in the derivations. The estimation of these hyperparameters is left to the next subsection. If indices of missing columns in ZgJ let R 1 = , 1 (r . . ., are i , 1 r) and R 2 is an .s k-dimensional vector with 9 th ..., = i1 and indices of observed columns are (r , + 1 . . ., ,) where r, 3 r = = 1,..., .gk, element being one and the remainder being zero. Thus R 1 and R 2 “mark” the position of missing columns. Now let R that (X’) t j = ,R 1 (R ). Observe 2 1 consists of the missing columns. Similarly (X2)t (X(l))tR = 2 (X(1))tR consists of the observed columns. Because vector X is partitioned into three parts, B, E, B° and 4 are partitioned accordingly. For example, ( where oo and (11) are oo ) 1 Eo( E(i)o E() k 9 .sJc x .sk, sgk x .s is partitioned as matrices respectively; Rt(ll)R partitioned as t (11) R— R — 21 ” 1 E 2 22 E ) = (R(ll)Rl R(ll)R ’ 2 \ R(ll)Rl RE(ll)R ) 2 ’ 26 is further where Et = 11 and En are 1 X 1, (sgk — 1) (sgk X — 1) matrices respectively and in general if (E,E,E)t, E( ) means (E,E)t. The partitions of 1 B, B° have a meaning 11 Further, denote E analogous to that of ). /T t (11) = (11) = with ‘I’rn being 1 x 1 and 22 JJ /rt T \ W ( v11 l 2 k\l1 \ ) 22 ‘P being (s k 9 ( = D ’&’(11)1U2 1 1L kR(ll)Rl 2 R(ll)R 1) x (sgk — DtT D ’1’(ll)ILl 1 tL — 1) . Let 112 “P — “121’22121 ’J! for use in the sequel. Let, 12 ‘IJ and / ritr,o _ — \ t 1 / o\ (1) o \ _1 B t R 2 °(i)J \ — 1 B°2 When a mean squared loss function is taken, the Bayesian interpolator is simply E(X° ). To find the posterior mean, one needs the predictive distribution function, f(X° 2 X I ). That predictive distribution function plays a pivotal role in our Bayesian interpo 2 X lation theory. As an indirect approach, since E(X° ) 2 X = E(E(X° ) 2 X’,X one can instead find the predictive distributions of f(X° X(’)) and f(X’ ), 1 X ). Those 2 X two predictive distributions are given in the following two theorems. Theorem 2.4 x° = ‘-‘ T( ) 1 , Z+ 0 B ZtF_ + Z +1 j)(X(’) 1 O(l)( (x(1) — — t Bl)Z) B(°l)Z), 6* + n). From Corollary 2.2 we obtain the next corollary. Corollary 2.4 E(X° = x(1)) = BZ + 27 — Brl)Z). — B°l)Z), Theorem 2.5 X’ ZtF_ + (x Z 2 T(’, I + 1 2 = x X 2 RB(°)Z + ‘y(x 2 — — RB°l)Z), 6* )Z)t 1 RB l 2 (x — — sk + n). Corollary 2.5 E (x’ I = 2) = 2 RBrl)Z + 7(x — )Z). 1 RB° With the above two corollaries, the Bayesian interpolator is found. Corollary 2.6 E(X° X 2 = ) 2 x = I1’(X o(l)R Z+2 0 B — RB°l)Z). The above corollary shows that the interpolated value E(X° X ) depends only on the 2 observed vector, X, provided the hyperparameters are fixed. Later it will be seen that our estimator of the hyperparameters depends on all observed values X , therefore the 2 interpolated values, E(X° X ), do depend on all observed values. 2 Although it is difficult to find an analytical form of Var(X° X ), it is possible to find 2 a closed form for Var(X° ). The following theorem gives that closed form. Before 2 X the variance formula is presented, an useful fact, needed in the derivation of variance formula, is given. 28 Lemma 2.10 Let S be an invertible matrix (a I At ” 1 A 2 A I h122J \12 where a is a scalar and A 22 is an n 1 by n — — A1=( Then (b — B’B 1 B ) 2 2 = 1 matrix. A 1 is denoted as ::)• a. Theorem 2.6 Fort=1,...,n, ) 2 Var(X° X = m2 A + 2 I 11 [A where ) 2 ,A 1 (A cj = = Io(l)’I’(J’flR = 1 + ZFZ + (X m = 8* — (o(l)j)Rl, 2 ?o(l)’I’(j)R ) ; — — sk Zt); 0 2 B 1 + 1. Obviously when there are no missing columns, the variance is o (l)ct/(m 1 — 2). When there are missing columns, the variance is roughly increased by a factor Ai’I’i Act/(m— 12 2). As indicated earlier, if the predictive distribution of f(X° X ) is found, the Bayesian 2 interpolator can easily be obtained. That approach is pursued here as a way of checking the formulas for the interpolator and its variance. Note that if hyperparameters are given, by Model (2.12)-(2.14) and Lemma 2.20 of Section 2.6: fyO\ ( I X} ‘kt I I -‘--‘0 ) 2 B / v’ ( ‘-OO RE(l)o 29 £-20(1)1t2 2 R(ll)R ((Boo (B 0 ) 2 \B ( E RlO (\RE(,)o OO N ‘ J 2 RE(,,)R (( Eo(l)R2 ‘\ ® F-’ 2 ) 1 ( 0 ” R \ oo ) 2 R(ll)R 22 } * i — By following the same way as the proof of Theorem 2.4, the predictive distribution of ) is easily obtained. 2 f(X° X Theorem 2.7 X° X 2 where Ct, T (,ct,Bo°Zt + o(l)R (X 22 m are defined in Theorem 2.6 and oI2 = oo — — 0 Zt),m + sk) 2 B o(i)R2 ‘I’ R(,)o. From the above predictive distribution, the same Bayesian interpolator as that of Corol lary 2.6 is obtained. By Lemma 2.2 and Press (1982), it is easy to see that the variance is Var(Xt° I X) = m— 20I2 (2.15) It is not difficult to prove that Equation (2.15) is equivalent to the variance formula of Theorem 2.6. Lemma 2.11 If A, is defined in Theorem 2.6 and oI2 is defined in Theorem 2.7, the following equation is true. A ‘OI2 — At iTi 11l2111 + “oI(l)• That verifies the variance formula of Theorem 2.6. With the variance, only a pairwise confidence interval can be derived. With the posterior distribution of X° , we can derive a simultaneous region. The next lemma is 2 X Theorem 3.2.12 of Muirhead (1982 p913) and it is copied here. 30 Lemma 2.12 If A is W(,6*). So 5* is a positive integer, > any p x 1 random vector distributed independently of A with P(A YtEY/YtA_lY is p = 1, and Y is 0) = 0 then and is independent of Y. By that lemma, the distribution of the quadratic form of matrix T is easily found. T(I,1,0,6*), Theorem 2.8 IfT 1 (6* — T t p)T ‘-.- Let I = (X 1 22 Z + o(l)R 0 B — B Z 0 2 ), where I is the Bayesian interpolator of X° when there are missing-by-design data. Then a simultaneous region of X° is given in Theorem 2.9. Theorem 2.9 The 1 Bayesian interpolator — a, where 0 < a < 1, simultaneously posterior region of the is a hyper-ellipsoid, defined by the set {X : (X — ) ( t X — ) <b 2 X } , where, b — sk — and Ct 2.4.2 F1_a,suk,3*_l_suk+1 6*_lsk+1 * Ct * is defined in Theorem 2.6. Estimation of Hyperparameters In the previous subsection, the hyperparameters , 8, F, B° are assumed to be known. Often, they are unknown. To finish the interpolation procedure, they must be specified in some way. In a complete Bayesian hierarchical approach, another level of priors is usually laid on these hyperparameters. An alternative approach, suitable when the parameters are not sensitive to their prior specification, entails the use of empirical 31 Bayesian method. In an empirical method, the hyperparameters are estimated with the observed data. In BLZ, 1 and 6 are estimated through a maximum likelihood estimator (MLE) and B°, F’ are assumed to be zero. While it is difficult to estimate all of hyperparameters, , 6, B°, F’ by MLE, in this subsection we propose two unbiased estimators for the estimation of B°, F—’ and apply the MLE method to estimate the MLE of , , 6. For the moment, 6’ is discussed. The two unbiased estimators of B°, F’ are described at the end of this subsection. Even with B°, F’ being fixed, the direct maximization of the marginal distribution 2 f(X I , 6*) is difficult. With the help of EM algorithm proposed by Dempster, Laird and Rubin (1977), one can instead maximize the distribution of f(X ,, 2 22 B E 2 , 6) (Chen 1979). Let the complete data set be x and assume only part of x is observed. That observed part is denoted as y. If the sampling density f(x = )/a(q5) t b(x)exp(qt(x) is from a regular exponential-family, the p iteration of EM algorithm consists of: (E-step) estimating the complete data sufficient statistics, t(x), by finding (P) = E(t(x) y, q));and (M-step) determining q(P+1) as the solution of the equation E(t(x) It is proved, in C.F.J. Wu(1983), that the limit point q) = does maximize the marginal likelihood function l(q I 1). To apply the above EM algorithm, we need to find the sufficient statistics of F and 6*. Following the approach of LZ, let 2 B = )tZt(ZZt)1 2 (X 32 and 2 S = (X(I 2 ( Z ) t . -’Z)X ZZ — Anderson (1984 p291) shows that given B(,) and 2 B(,) B 52 I - N(RB(l), 22 )—’), t ® (ZZ k_1(22, 9 WS (11) n (2.16) h) — (2.17) , S 2 and that B 2 are independent. Since, , B(,), E() B°,), F’, 2 f(X = 2 f(X B(,), (ii))f((ii) 2 f(S (fl) ,, I 6) 6*)f(B 6*)f(E(,) I F’, B(,), (,,))f(B(i) B(°,), ,, F-’), where the last step is true, because of Equations (2.16) and (2.17). When B° and F’ are assumed to be known, the above likelihood function of 6* is proportional to f((ii) (11) and (,,),6*). Because, by Lemma 2.20 of Section 2.6, 6* — sk). Therefore, the likelihood function is proportional to, , B(,), 2 f(X i) (ii), 6*) oc f(Eii) CC CJ 1 6*) (H), (—sk) 6+sgk+1—suk 2 2 exp [_tr((l,)E))] (2.18) where sgk(6*suk)/2(8gk1)sgk/4 = 2 33 — sk — i + 1) With the likelihood Equation (2.18), the pair the sufficient statistics of ) log is readily identified as 6 (Chen 1979). To reduce number of parameters that need to be estimated, BLZ adopt a Kronecker structure, = A®, where A the between-sites-hypercovariance matrix and pollutants-hypercovariance matrix. Since (11) = Ag 0 , between where Ag is the hypercovari ance matrix between the gauged sites, by the likelihood function (2.18), only Ag and could be estimated. Hence, the estimation of c1 takes two steps. In Step one, Ag, and 8* are estimated by MLE through an EM algorithm. In Step two, the SG method is applied to extend Ag to A. Expectation Step At E-step, the posterior expectations of and log I are of interest. The next lemma, due to Chen (1979, Theorem 2.1, p237), is used in the derivation of the expectations of these sufficient statistics. The lemma is copied without proof. , Lemma 2.13 Given hyperparameters x( ) 1 ) = 6, B° and F, 8*1) ( t 7 ? 6 \\ _ _6*) d where, ,T 71 ’O(1)P(fl), 4 ‘ , t A=ZZ 0(1) ) 1 B( S(i) (11) = (11) = x’ (I + S(i) + (E = x(l)Zt, = — — Z (ZZ 1 Z) ()((1) ), ) t Brl))(A’ + F’) (E(i) 34 — , t Btl)) d Note that ‘J/ = = (6* + n — sk h)’j) + 6?74?7 + — R(ll)R. W(A’,n), then given A, n, Lemma 2.14 If W E(log W ) = n 1 A * and where ‘P(.) represents a digamma function. A digamma function is defined to be the derivative of a log-gamma function. With the above lemma, the following lemma is true. The expectation of is, Theorem 2.10 E E’ (11) 2 x — — B (6* ( (8 — — (6* sk)’IJ t suk)(*) — sk)I,* 1 d “ ) t R where = 2 G = I — (ZZ t MZ ) ’ *_ — S = Zt(ZZt)_ Z 1 , = (6* — B, T T—l ‘“1222 2+t M 1 MG ((ZZ 2 G ) ’ +F )’G, 1 = ‘22 1 d — sk + fl — 1 — + 5, h)’ + (6* 35 — I17l* + lT!. t suk)(,l*) The expectation of log I is, Lemma 2.15 E(iog I (11) , 2 X , Sgkl — ) = —sgkiog(2) k—i—h— .+1 2 — — + 1_ — ) + log 1I2 +log 22 I. Maximization Step When the expectations of sufficient statistics are found, at M-step, the current values of the hyperparameters are updated with the values that maximize the “likelihood” function. Here, the “likelihood” function is obtained by plugging the expectations of the sufficient statistics into Equation (2.18). However maximizing the “likelihood” function over = Ag 0 is not easy. When there is more than one parameter involved in the maximization step, the following lemma leads to an iterated procedure. An advantage of the iterated procedure is that at each iterated step, only the maximization of a function over a single-parameter is required. That is generally easier than maximizing over several parameters simultaneously. Lemma 2.16 If a function g(x, y) is upper bounded, the iteration procedure described below leads to the pair (xo, yo) that maximizes g(,) locally. At the pt iteration, update the current value with a value that maximizes the function g(x, y(”) y(P)) and denote the updated x value as xl); update the current value (°) with a value that maximizes the function g(x(’),y x(’)) and denote the updated value y as (‘). (xo,yo) is the limit of (x(’), y(P)). Proof: g(x,y) is bounded up and g(x,y) g(x’’,y) 36 Besides the above lemma, another lemma is used in the maximizing procedure. The lemma is Lemma 3.2.2 of Anderson (1984 p6 ). It is copied for convenience. 2 Lemma 2.17 If > 0, the maximum of f(G) — —flog G —tr(G D), 1 with respect to positive definite matrices G exists, occurs at G = (1/n)D, and has the value f[(1/n)D] —_p*n*log(n)—n*log D —p*n. With Lemma 2.16, an iterated procedure is adopted to find maximize Equation (2.18). Since Equation (2.18) over Rt(ll)R = I(ii) i) 1 F( = A 9 0 Q, 6* that and R is orthogonal, maximizing is equivalent to maximizing the same equation over When 5 is fixed, the log-likelihood of Equation (2.18) is proportional to L(Ag, c) = (S = — (6* sk)log —tr — sk)klog A’ ((11))) _(6* — sk)s 9 —tr [(A 90 Again, an iterated procedure is applied to maximize L(Ag, ft). When Q is fixed, rewrite tr (411)) Q). By Lemma 2.17, Ag = k(6* 9 as tr(A = s(6* Similarly, when Ag is fixed, — — j). 9 sk)Q’ maximizes L(A ) 11 sk)G, where tr(( = tr(QG), maximizes L(Ag, 1). When 4() is fixed, 6* is updated by maximizing Equation (2.18) over 6*. By taking the first derivative of Equation (2.18) with respect to 6*, it becomes sgk — klog(2) 9 s — log ) 11 E( I +log I — 37 6* k 2 1 = 0. (2.19) are fixed, finding When A , 9 t that maximizes Equation (2.18) is equivalent to finding a solution in Equation (2.19). For showing the existence of a solution of Equation (2.19), j with E(log replace log Ei) and use the relationship log M,(ll),6*) in Lemma 2.15 to Equation (2.19) II log J) log t R’P(ll)R 112 ‘P 22 ‘P +log , Equation (2.19) becomes, 6—sk—z+1 i=1 1=1+1 —log ‘P221 +log I ‘P22=0 or equivalently — sk— h 1) — ‘P(6 — sk— i + 1)] i=l+1 . —log’P 22 =log Since the gamma function is convex, the digamma function is monotonically increasing. Thus, the left side of the above equation is always positive whenever n it goes to zero when t — h > increases to infinity. The right side of the equation is positive. Therefore, a unique solution does exist. 2.4.3 EM Algorithm By summarizing the above discussion, our EM algorithm becomes: E-STEP. Given the current values of ‘P( ), 6*, 11 E ‘ (11) t2) M — — R ( _(6*(6* k)’P ’P t sk)(*) — — (6* — sk)’P,* 1 d ) t R where 1 d = (6* 0 and — sk + fl — 1— h) + (6* 38 — 7 W t sUk)(7l*) * , + lP and II E(log M(,,),6*) = 6 +n—sk—l—h _i+ M-STEP: Given the current values of 6 — klog(2) 9 —s ) +log 1I2 — sk— i + +log I; I , find the MLE of and log = 9 0 l, A 6* by repeating the following steps until it converges. Step 1: Given the current A) and Q(’), represent tr[(A ® = s(6* — as tr(1(P)G) and set sk)G; Step 2. Given current = k(6* and — P+1), represent 1 tr[(AO’ ) )j)] as ir(A’)Q) and set . 1 sk)Q Step 3. Given the current estimate 6* by solving the following equation: {(6*+n_suk_h_i+1)(6*_suk_i+1)] j=1+1 = log I 22 —log 22 I. (2.20) Estimation of B° and F . 1 The estimation of the hyperparameters B° and F are based on two unbiased estima tors. Suppose that the k air pollutants are labeled from 1 to k. The measurements of the same air pollutant at different sites are usually similar but the measurements among different pollutants are not similar. Based on this fact, we assume that B° has an exchangeability structure, that is, the hypermean coefficients of all gauged sites are the same and equal to u, i 39 = 1,. . . , k. th air pollutant at Since, I 2 X BZ + N(O, a Ia), then the following lemma is true. Lemma 2.18 Given Be,) and E(), B 2 and S 2 are independent. Proof: ,S 2 f(E 2 J f(E, J f(E Br,), (ii)) — = 2 f(E = 52 I B( ) 1 , Br,), (n))f(B(,) Br,), E(,,)) dB(,) B(i), (,,))f(B(,) B(°,), (,,f(S 2 I 2 Br,), E(,l))f(S 11) dB(,) I (11)). The next theorem gives unbiased estimators of B 0 and F . Let 1 12= isgk—L k—l 9 fs wherej E {1,.. ,k}, v 1,.. .,sgkl. Obviously . (/ j 3 t = (ZZt)_lZtX andj marks the pollutant type of X. Let %i be the means of all J3 with = v 1,... , k. Therefore, under the exchange is the estimator of uz. Let, ability assumption, = n — h — 2 k1 (v 9 S — %iv)t($iv — — — where /3 — = a a 2 v=1 . 2 E Theorem 2.11 Given X 2 and the exchangeability of B°, unbiased estimators of u, i 2.5 (ZZ ) t ’ = , i = 1,.. . , k and F’ are 1,..., k and F—’ respectively. Interpolation with Randomly Missing Data We refer the term randomly missing to the same meaning as missing at random defined mathematically by Rubin (1976), Little and Rubin (1987). More precisely, let Y denote 40 the data that would occur in the absence of missing values. Further, we assume Y = 0 represents the observed values and Ymis, missing values. Assume (Yobs, Ymis), where Y f(Y 0) denotes the joint probability density of Yobs, Ymis. We define for each component of Y a missing data indicator, taking value 1 if the component is observed and 0 if missing. For example, if Y ), an (n x K) matrix of n observations measured for 1 (Y K variables, define the response indicator R = (R), such that 1Itj — — f 1 if Yij observed, 0 if Yij missing. When we treat R as a random variable and specify the joint distribution of R and Y, the conditional distribution R given Y indexed by an unknown parameter i/ is called the distribution for the missing data mechanism. Rubin (1976) defines the missing data to be missing at random when the distribution of the missing data mechanism does not depend on the missing value Ymis, that is, if f(R I Yobs,Ymis,i/)) = f(R Y ,i,b). 3 b 0 Little and Rubin (1987) point out that when data are missing at random, likelihood based inferences can ignore the missing data mechanism. In other words, likelihood based inferences can only base on the marginal distribution, f(Yob 5 Let M 1 represent the randomly missing subset of data from subset of data from x’ 0). and M 2 the observed Let 1* denote the total number of elements in M . Again, the 1 predictive distribution f(X° ) leads to a Bayesian interpolator. Given the hyper 2 M parameters, when there are data missing-by-design or no missing data, the predictive distribution follows a matrix T-distribution or a multivariate t-distribution. With ran domly missing data, will the predictive distribution f(X° ) still follow some form 2 M of t-distribution? The following simple example gives a negative answer. 41 Consider a simple case, where there are two gauged sites and n = 2. Let T denote the random matrix at the gauged sites. That is, 22 T (X 1 I Y1,Y2 = y’\ X 2 1I \Y2 where the upper case characters represent the missing values and lower case characters represent the observed values. Assume, T -‘.A[(I , x 2 ), E W (12x2,6+2). Then, II+T T t = where a 1 = X+y+1 2 +y 1 y X +y X+y+1 1 y X 2 = )X + 1 X (—2yiy (X+ 1)X+ 2 = 4 + (aix + 2 ; a ) 2 (X? + 1), a = , a X 2 y 1 —y 3 = y + yy+1+ X 4 + y + y?y + 1 + X? and a Note, =y+y+y+X+i4 a Thus -± 2 4 + 2 II+TtT_=[(aiX ] )+a since ,yi,y f(Xj,x ) j°° 2 =1 J—oo 2 dx f(y1,y2) ,yi,y Lf(xi,x ) 2 dx L . 2 f(T)dx 42 = 3 a 2 — Use (2.5) to substitute for f(T), and obtain 1 f(X = =yl,Y ) LlI+TtT2 y IY’ 2 2 dx -± / ±i 4 xa 2 4 cxa 2 2 2+ 1 (aix a 1+ 4 a fX’ I oo 2) t 2 1+— 4 a a’J 2 dx -± 2 dt -± 2 X 4 a 1 a 2 [i+x] 2 The last step is completed by integrating with respect to a partial t density. Consider the special case, 1 f(X 1 f(X I Ui = Y2 = = ‘,Y2 1. Then = (X +2)(1 +X), 1) 1, 1) is clearly not a t-distribution. Therefore f(X ,X 1 2 multivariate t-distribution. If it did, its marginal f(X 1 I y’, y2) y, y2) does not follow a would follow a univariate t-distribution. Since the search for an exact predictive distribution proves difficult, an alternative ap proach is to derive an approximate predictive distribution. Note that given hyperparam eters, ) 2 , 1 f(X°,M M follows a matrix T-distribution. By Theorem 2.2, f(X°,X(’)) is approximately normal and then by Lemma 2.7, f(vec(X°), vec(Mi), vec(M )) is approxi 2 mately multivariate normal. General normal theory implies that f(vec(X°) 2 vec(M ) ) is approximately normal. Theorem 2.3 shows that the order of approximation is O((6*)_). The theorem below states an approximate predictive distribution of f(vec(X°) I 2 vec(M ) ). Let R* — k 0 (Iskxs 0 - where R = ) and Rvec((X(l))t) 2 R 11 (R 0 = vec(Mi), Rvec((X(l))t) 43 = vec(M ) 2 . Theorem 2.12 f(vec((X°) ) t t ° 0 vec((B ) Z) -f- vec(M ) 2 ) is approximately normal with mean (M ExoM ) 2 M (vec(M2) — Rvec((B(°l)Z) ) t ) and covariance — (x0M X0M ) M 2 , EM where Exx = [(Z F t ’Z + I) 0 ]/(* + n); 0) Zxx XOX0 = (‘sujcxs 2 = EXOM I o) k_1*)x(S 9 (0(3 k _1j’\ 0) Exx j J 2 = M (0 R)Exx (2) If the loss function is “mean squared error”, the approximately Bayesian interpolator is simply the mean of the above approximate predictive distribution. Following the same approach of the previous section, an empirical Bayesian method is obtained. In other words, all the hyperparameters are estimated using the observations. Since no unbiased estimators of B°, F—’ has yet been found, the estimation of these hyperparameters will not be discussed here. In applications, they will be specified as plausible values. Hence, in the following discussion, these two hyperparameters are assumed to be known. Again, the estimation of hyperparameters requires two steps. In step one, an EM algorithm is applied to estimate 6, 9 0 ft In step two, the A = SG method is applied to extend A 9 so as to obtain A. In the EM algorithm, the likelihood function is f(X(’) 6). By the same argument as that of the proof of Theorem 2.12, f(vec((X)t) I ,(ii)) 44 ). t + vec((B,)Z) So by Equation (2.9), f(X(’) 1 _bk Ag 11), 6) is approximately proportional to (6* ç — sk + where A and a* = 6* — a*(X — B(°l)Z)(Z F t ’Z + I)’(X 1 — )Z)t 1 B .sk + n. 1 For the E-step, E(M I ) is given by the following lemma. 2 M Lemma 2.19 Given hyperparameters, f(vec(Mi) vec(M ) 2 , 6) is approximately normal with mean )+ t f4vec((Brl)Z) 2M 1 M 2 ) 2 (vec(M — ) t Rvec((B(°l)Z) and covariance M 1 EM — EM EM , 1 2 M where = [(ii) = F’Z + I)]/(6* t a (Z REx(l)x(l)Rj,i,j = — sk + n); 1,2. The M-step is similar to that in the previous section. Its discussion will lot be repeated here. The EM algorithm is summarized below. E-STEP: Given the current values of E(vec(Mi) I 2 vec(M ) ) = 6*, B), Rvec(B ) 1 Z) + 45 E MlM ( 2 M vec(M2) Ij — Rvec(Brl)Z)) M-STEP: Given the current values of vec(Mi), update (11) = Ag 0 Q, 6K by repeating the following steps until it converges. Step 1: Given the current (A’)() and tr((f’)(”)G) and set (1_1)(7), represent (c_1)(P+1) = tr[((A;’)P 0 (1— )())A] 1 as hsgG’; Step 2: (1)(i+1), Given the current (A’)() and and set (A;’)(’) = represent tr[((A)PO (Q_1)(P+1)) A] as hkQ’. Step 3: Given the current a* = = , sgkh/tr[ The estimated 6 is max{sk, a* 2.6 Ag 0 — update a* by 1 + i) F t B(l)Z)(Z (X’ 1 — ] t B(l)Z) — — Interpolation with Monotone Missing Data Pat terns Let X be generally partitioned as ((X) , (X) t ,. t (Xflt), i = 0,. . . , \ (i+i)i tively. Note that , . (Xfl’) and (X)t = ,..., t ((X) r. E is partitioned correspondingly as I In the above, . L4([i+1][j+1]) 1’ z=O,1,...,r—1. J are the covariance matrices of X and = j = o,... >D. A reparameterization is recursively taken as , 1 {P ([i+i][i+i])}, 46 0,.. . , r — 1, , r respec where Pj = {ici+i’ i(i+1)}=o and i(i+l)E([+l][+ll), iI(i+1) Ti(i+l) B is partitioned as Bt = (B, B,. = ii . — Tj(j+l)([+lJ[i+lj)Tjt(j+l), = 0,. . . , r — 1. , B) accordingly. . The lemma below gives the intuition behind the reparameterization. In the lemma, ‘I is partitioned in the same fashion as Lemma 2.20 If has the prior distribution of (2.14), the following are true: is independent of 1’,,j 1. 2. [i+1][i+1fl TV’(([j+i][+i]), ‘- = — 0, 1,. . . , i. sk + l(i+1)), where l(i+1) is the dimension of 11 E 3. i W_l( ( 1 +l),6* — sk + l()), where i(i) is the dimension of X and — 4. (+ 1 E ) Tq1) )0 ( 1 E This result is called Generalized Inverted Wishart Distribution by Brown, Le and Zidek (1994b) (see Caselton, Kan and Zidek (1990), LZ for its earlier forms). Suppose (X))t = (U, G), i = 1,. . . , n, where U! : 1 x d is missing, G : 1 x (sgk is observed and X’ is the matrix of response vectors at the gauged sites. 1 d 2 d ... — d) When d,, the pattern of the missing data {U!} 1 resembles an upside down staircase. An example of such missing data occurs when at gauged Site 1 the pollutants are measured up to time T , at gauged Site 2 up to T 1 , 2 and when T 1 < T 2 < ... . .., at gauged Site n up to T,-, < T; the missing data of [X(1)]t have the shape of staircase. Such a missing pattern is called “monotone”; in Little and Rubin (1987). The following theorem gives a recursively predictive distribution. 47 Theorem 2.13 Suppose X, t = 1,... ,n follow the model of (2.12) and B, E follow the prior of (2.13), (2.14). Then, given all hyperparameters, {G } 1 ,D Uj a, 6 + T(( ) 2 ( 1 )’, j — j s)k, 1,. .. , where: fl — — (Bfl*Z a a = 1 + (G — = lVki fi=1J (B( ) 2 )ZJ + 22 )(( 2 l( ))Gj; 22 ) 2 ( 1 )) (( 22 ) 2 (B ))_1(Gj )*Zj)t( — (Bi))* (B ) 1 )* jj y(1)j.-1 (1)) ) 1 ) + (B( 1 B — ))*Zj) + ZZ; 2 (B — (B)* : d x h, (B ) 2 )*: (s k 9 d) x h; — ((2)) ) 1 B( Si = CA7’; — = C, A’ = F ( 1 A’ + F)’; ii) = = (11) ( (2)1 + Si + ) 1 (E — = F = (I — ) 1 (AT’ + F’)(E t B°l)) : d x d, 22) : (sgk — — ) 3 d X (sgk — di). (22) For the case of a monotone missing data pattern, finding an EM algorithm estimation procedure for hyperparameters is difficult. As a make-shift measure, we can treat it as a randomly missing data problem; the EM algorithm proposed in the previous section may then be applied. 48 2.7 Proofs of Theorems, Corollaries and Lemmas Proof of Lemma 2.9 Proof: By (2.11), U 1983 p305) 1’’ and }çy t = implies t Cov(vec(YY ) ) = i.i.d. N(O, F). Proposition 8.3 (ii) (cf. Eaton ) t Cov(YY 2P 0 P. With the multivariate = central limit theorem, (vec(U) vec(P))(2P, F). Since vec(U ) and Ut are only notationally different and a matrix normal distribution t is distributionally equivalent to a multivariate normal distribution, by the definition of convergence in distribution (Billingsley 1968), it is easy to see that - P)--A1(2P, F).I Proof of Theorem 2.3 Since - FX) - (T* = and x) F + (() _i) p-x 1 v1(T*_P ()x) (x = - = By Lemma 2.9: (- (ç) -) (ç) (1+ () () -) () () ‘ as 49 By LLN and Slutsky Theorem, as —* o 1 ()r4) —1 (1+ a.s.; —* —* P 2 a.s.; —+1. So by Slusky theorem (T* - ()x) F- Ar(2P,P)F’F4Ar(I,Q) or (T* Given F, Q, () - - x) , 1 F’)A1(F Q). the two normal distributions are independent, since U and X are indepen dent. Further, since 6 )[()r+i] —+ The above is true, since = 6 — 0 as q and q is fixed. Reapply Slusky theorem. The conclusion is followed. I Proof of Theorem 2.4 Given B°, F, , 6*, (2.12), (2.13) imply and BI > B°+N(0,®F’). 50 —* co. In turn, EB0Z+N(O,®P) X where P = Z F’Z + I. So t Xt ( Z + t N(O,P®). B0) Equation (2.14) and (2.8) imply, t X — , , 0, 1 T(P (B’) Z t = By Bartlett’s decomposition, where = (I \0 and + n). I) A= ‘ 0 (oI(1) \ 0 (11) = o(1)i). Now apply Lemma 2.4 part 3 to get t (X — (B0) Z ) ) t _l ( ‘-‘ T(P_l,A, 0, 6* + n). Then by (2.7) ( )x=(t1 t2)T((rj1) )(BO) where X° = t 1 and X’ = t . By Lemma 2.4 2 tl — — °Z + 1 0 B BZ 7 ? ) (° I t 2 -‘ ), P + (t 1 T( 2 — j)(t t Bl)Z) 2 — )Z), 1 B 0,6* +n). The theorem is proved by reapplying the same lemma. I Proof of Theorem 2.5 By (2.12)-(2.14) and Lemma 2.20 t (X(’)) I (B Z ) 1 t ) + N(0, P ® 51 W (ii), 6* — k) where P = Z F t ’Z + Ij)(fl. Thus t ) 1 (X — t (Bl)Z) ‘-- T(P’, (11), 0, sk + n). — Apply Lemma 2.4 part 3: R t (X’) — (B(0 ) 1 Z)tR T(P’, R (ll)R, 0, t — .sk + n). The remaining is the same as in the proof of Theorem 2.4. I Proof of Corollary 2.6 E(X° X ) 2 = E(E(X° = )(E(XU I X 1 BZ + O(l) ) 2 = Z+ 0 B The last equation is true since I ) 2 X [R E 1 (x1 o(1)(i) = RRt I B’l)Z) — X 2 )+R 2 X 2 R 1 R R +R , X’ = — Brl)Z]. ) and X 1 RX( 2 = By Corollary 2.5 E(X’ ) 2 X = Z + o(l)jj){Rl (RB(°l)Z + y(X 0 B 2 X 2 +R = — — RBrl)Z)) Bl)Z} BZ + — R ’ 1 yR — I)Bl)Z -y)X 1 R } 2+2 + (R = BZ + — R R 2 )Brl)Z )X 1 R } 2+2 + (R = BZ + = BZ + O(l)R 2 (X 222 )(X R +2 The third last step is true because RRt = — since 52 = RBl)Z) RBl)Z). 1 +R R 2 Rt(ll)R — 1’(11) = I and the last step is true implies 2 Rt(ll)R = 1’22) So R = /‘12 = 111 + 22 1 R 12 ’I’ 2 R . I122) That is R2 22 ) 2 +R — — = )(Rl + R 1 7 ).I 2 Proof of Lemma 2.10 By using Bartlett decomposition, it is easy to see that b = (a 12 B 22 B For notational simplicity, let d b — B 12 blBt 22 12 = d’ — = = ‘—1. 12 A A_lAt 12? 22 — = Ab; 12 —A A’ + 2 A’A b A 12 A’ 22 = a — AA 1 A 2 2 which is a scalar. Then A’d’(A’ 1 A 2 + = = [d 2 d d’[l — — 12 1 d A J 2 12 + )’A A 22 (A 1212i1 2 A t 12 + )’A A d 22 (A For showing the lemma, note that 2 AA 1 A 2 is a scalar and 12 A 12 A — A 1 a ( A 2 ) Ai A i+2 (Ai a’Ai A ) A = 12 [a A a’ = 12 + 2 A a’d A a’(Ai ) Ai A = A(A + ). 2 a’Ai d 22 12 2 A — 12 + t ] A 1 2 A’A 2 (Ai 12 aA 22 12)“ Equivalently, 22 2 A 1 (A 1 A d+ 22 2) 53 = A. 12 a’A (2.21) Then, 1 — (A 1 A d 22 )’A 1 A 2+2 2 = 1 — 2 A’A 12 a’A = a’d. Combined with (2.21), the lemma is proved. I Proof of Theorem 2.6 Given all the hyperparameters, ) 2 Var(X° X = )) X 1 ) 2 Var(E(Xt° X( x( ) 1 ) +E[Var(Xt° If t = X ] 2 . n, by (2.7), Lemma 2.4, Theorem 2.4 and Lemma 2.10, it is obvious that T(’I 0 (l),gt,BO Z t+ x° I x(’) — B(Ol)Z),6* + 1), (2.22) where gi For t € {1,. . . ,n = — 1 + ZF’Z + (X’ — Brl)Zt)t j)(Xt — B(°l)Zt). 1}, let C be the orthogonal matrix such that By Lemma 2.4 (III) and Theorem 2.4, x°c I x(’) — ZC 1 T( ) 1 , I. + CtZtF_ Bl)ZC)tj)(X(l)C B Z 0 C + o(l)’)(X(’)C — — Bl)ZC), B(°l)ZC), + n). CtZtF_ Z C is Note that the last column of ZC is Z, the last diagonal element of 1 ZF’Z and the last column of X’C is So it shows that (2.22) is true for any t. By Lemma 2.2 and Press (1982 p128), E(X I X(’)) = BZ + (I)o(l)(ll)(X 54 — Bl)Zt) and Var(X g = I m+l—2 oI(1). Further, — B(°l)Zt) = )(J!j)RRt(XU 1 ( 0 B(°l)Zt) — = IX’ ,A 1 (A ) 2 — — = (X’ 1 A — B’Z BZ) (X 2 BZ) + A — BZ). By Theorem 2.5 and a similar argument as the case of X° X IX (2.23) —BZt),m+l). So I Var(E(X° ) 2 X = Var(o(l) )(Xt = AiVar(X’ )A 2 X — — ) 2 Bl)Zt) X A iT At J1lCtVlI2f1i m—2 The last equation is true, because of Lemma 2.2 and Press (1982 p128). X}, let us start with E[Var(X° Rt)R= ( (I -r ‘\0 I) - ( \ 0 (I \o 1 , 0 -r So E — = Brl)Zt)tJj)(Xl) E [(RtX1) — Bl)Zt) x] — — RtBl)Zt) I x] 0 -E X - - I BZ 0 ((Xe’ —B’Zt)—r(X? BZ) 2 X—BZ E{[(X BZ) r(X BZ)] [(X t — ) = — —r(X — BZ)] — — I X} + (X 55 — — B’Z) (X t BZ) — BZ). To find Let Y = i1’[(X’ — Z) 0 1 B — r(X Z)]. By (2.23) and Lemma 2.4, 0 2 B — Y X T(I,ct,O,m+ 1). Lemma 2.2 and Equation 2.4 imply f(Y = y I X) C?(Ct = y) t +y - 2 7r2F() Then E(YtY)=E(YtY+ct)_ct — — = p (m —2 + l)ct m 2 o (m_2+1) p — (!!iza) C(Ct + ,, y) m—2-H 2 dy — Ct m-2+l Ct—Ct. m—2 That implies — = m—2+l Ct m—2 Bl)Zt)tj)(Xl) — Ct + (X And E[Var(X I X’) XJ = — Bl)Zt) — — 1 m + 1— [Ct 2 + B°Z). m—2+l Ct m 2 (l) 1 Ct]O — Ct = m—2 oI(1). Finally, Var(Xt X) = = 112 c 1 A ’I’ + m—2 m—2 Ct 1T, + oI(1)].I 1 [A A 112 m—2 Proof of Lemma 2.11 By Rt(ll)R = ‘IJ, the following are true: 11 ( 2 )R = R (12 = 112 + , R 222 R (2.24) 122) = R + 21 11 RI’ ’I’ + 12 2 R ’J! + R. 1 R 22 1 2 R 56 (2.25) By (2.24) and (2.25) t 11 2 ‘P’R ( )R 22 2 (11) ’1 2 R t ) ’T’ + 12 1 (R 12 2 ’ R ( 2 )I’ ’ 1 T1 R 2 + 22 ’I’I’ 1 R 21 12 + = — ’P 1 R 11 — = Hence, R 112 RI! = (11) — R(ll) P 2 (ll)R which implies t-1 (11) (ll)RlIhlI R 2 l — (I)(fl) — — 1 2 R R 2 . So = O(1)(i j 1 ) o(1) — or 112 I 1 A ’ = —[o = oI(1) + — + — OI2• It proves that 112 + oI(1) W 1 A = OI2’ Proof of Theorem 2.8: By Lemma 2.3, there exists U T = (U_)tY. By Lemma 2.12 , — W(Ipxp, 5* r’.I 1) and Y/p t Y = N(O, I> 0 1) such that is independent of YtY ‘‘ YtU-1Y T t p)T — Yty I ytU_1y/(*_p) F ,..I Proof of Theorem 2.9: By Equation (2.7), Lemma 2.4 and Theorem 2.7, I 2(XO 012 — / T(I,1,O,6*_l+1). Ct2 57 x• Therefore, Then by Theorem 2.8, (6* — 1— sk + 1)(X .sk 0t—l(xO t/ 012 — * — O) F k 5 , ct 6—l—sk+1 So, 1 — a = 012\ — t < — b).I Proof of Lemma 2.13 By Bartlett’s decomposition, = TAT, where 0 (°I(1) 0 and r = 0(1)(1’1). T= , )j 11 E( Ii o \ ) Thus = / I 0\ = / — /1 0 1>:0:1) —r\ ) i) (1) — )+ 11 E( Given , Tt ) 1 T) 6”, Lemma 2.1 implies I x(’) I x(’) WSk((l), 6*) 1 WS k 9 (!j), 6* + n — sk — h) and (1)’ 89 k 3 N( k)(, )( 01 0 E ) 1 ( So = X) = 6oI(1) )E(r 1 E ( I x(’)) = 58 0I(1)) x(’)) and ,)T t + T (6* + n Note that, — sk t + E(E(T — = )T 1 ( Let Y E)r. t El)T = (E)T) t T x(’)) OI(1)) )T_1. x(’)). Obviously by definition of a Wishart distribution, Y E , 0 1x 1 N(sk)(s k 9 )(O, ‘skxsk 0 Then, yty oI(1), x’ WS k 9 (j), .sk). - Thus E(E(ll) + E,)T t T xc’)) = (6* + n — sk — + sUk’j) x(’)) = (6* + n — sk — h)j) + 6q( )77 + sUkF(ll).I 1 Proof of Theorem 2.10 Note that since, R X t ’ B(,) t R Z, B, E Br,), E(ii)R t R , RtB(l)Z + N(O, R Z(,i)R 0 Ia), t Brl) + N(o, Rt(,,)R ® t R F (11),6* W(R ( t ll)R,6* — sk) and , 2 x 6*) = RE((R E t (,,)R)’ The theorem follows from Lemma 2.13. I Proof of Lemma 2.15 59 , 2 X 6*)1?t. Since , 2 II X E(log 6) = E(log I R E(ll)R t = E(log( I 22 E 1I2 , 2 X I) , 6) , 2 X 6*) Note that 112 I X E , 2 (11), 6* 112 ’ 1 W , (’P 6* — sk) and that , 2 X (11), 6* r, W_ , 2 ( 1 2 6* + n — sk — 1— h). Now apply Lemma 2.14 to finish the proof. I Proof of Theorem 2.11 1 fixed. By (2.16), In the following proof, we assume F E(3 I1))f, v1,...,sgkl. So E(/IBl))= i=1,...,k. To prove the other half of the Theorem, by (2.17) and Theorem 3.2.8 of Muirhead 1982 ), 93 (p at S a Jv Jv 2 t 3 22 a a L “JV 2 implying E( a a 2 S 3 j a 22 aE = )’ 22 , 2 x = n — 1 h —2 (2.26) and — jt( — i) X 2 = x ,B 2 ), 1 22] = Var(Ba 2 = x X , B° 2 ), 1 By (2.16) and the fact that aE aj is a scalar, 22 = x , B, 2 22 “ t ) 1 N([B , 2 R ]t ((ZZ ) ’ + a). 22 ) 1 F a 60 22)• Hence, Var(Baj I = 2 ,B(°,), ) x 22 E = )’). t a)(F’ + (ZZ 22 (a (2.27) 2 are independent. If we replace B(,) with , B(,), 22 are fixed, S 2 x 2 and B When X 2 B(°,), they are still independent (Lemma 2.18). Therefore, given X 2 = , B(°,) and , 2 x 22 Z by Lemma 2.18, E(fr_1) h —2 sgk 1 = k—1 9 s — — a S a 2 E — aj)}’E[($ 22 (aE v)t(u — =, 1 ) t (ZZ 1 sgk k—l 9 s — (F-’ + (ZZ )’) t 1 — (ZZ ) t ’ = Proof of Theorem 2.12 By (2.7), (2.8) and (2.12)-(2.14), given the hyperparameters X , B°Z, 6* + Z F ’Z + T(’, t 6’, B°, F, ii). By Theorem 2.2, X where a = — B°Z 1 = 1 a(a’(X — B°Z)) approx, ‘-‘-‘ a _1 Z F t ’Z + I), 6* + n. Then by Lemma 2.7, vec(X) approx. 4 f(ZtF1Z + I, a J — ). t ) + vec((B°Z) So, vec(X t (R*) ) = / vec[(X°) \ } t vec(M,) v ec(M ) 2 ( J approx. vec[(BZ) ] t Rvec[(Brl)Z)t] + (R*)ta_Ar(ZtF_1Z + I, Rvec[(B,)Z)t]) 61 ). — v)j By the general normal theory, the theorem is proved. I Proof of Lemma 2.19 By the same argument of the proof of Theorem 2.12, one has vec(Xl))aocalAf(Z F t _lZ + I, where a 1 = 8* — + sk + n. So, (vec(Mi)) ) 2 vec(M aiRtJ\1(ZtF_1Z + I, vec[(B(° R ) 1 ] Z) . +t By the general normal theory, the lemma is proved. I Proof of Theorem 2.13 For notational simplicity, the superscript Let X, j = 1,... ,n, be partitioned into (B, B, B) and where P = j is suppressed in the following derivation. ((X5)1X k 3 ), are partitioned confirmablely. ), 2 {1I(2), Ti( , ((G)1x(sgk_dj) 3 (UJ)lxd ). B = is reparameterizedas {oI(1), To(i), f*} ) is also reparameterized 1 )}. B 22 E( ) represents {B, B}. B( 1 as = 1 B — )B( 2 Tl( ) , ) 2 b( = ). 2 B( Note, f (U {G},D) ),f*)f(bl,b( Dj,bl,b( ) ,F* Dj)dbldb( )dF* 2 Jf(U,{G} 2 Jf(U,G .f(b, b( ) 2 ),F*)f({G}+l 2 Dj,bl,b( F D )f(f* 3 To find the distribution of f(b , b( 1 ) 2 B(i) ) dF* 2 D) db 1 db( D, f*), note by lemma (2.1) D, f B + N(O, (ii)) 62 (2.28) where B ) 1 ) + (B(i) 1 B = — ))Et, E 1 B(0 En 0 F, F* = = (I — E)F-’. By Lemma 2.7, vec(Bl)) D f — 0 F). ) + N(O, t vec((Bl)) By a general normal theory, )),I* 2 vec(B) Dj,vec(B vec( (B)t) + .1V(O, + vec(B )vec(B )) vec(B))vec(B 2 (v ec(B )) 2 — v ec( t 2 (B ) ) 2)) vec(B)Ivec(Bt 2 ) )) ) + (ri(2) 0 I)(vec(B t vec((B) )) 2 vec((B + )(B( Tl( ) 2 — — t 2 vec((B ) ) + N(O, ))t) + N(O, 2 B 1I(2) 1I(2) 0 F*) 0 F*). In the above derivation, it is assumed that (F*)_l exists and the last step is true because of Lemma 2.5. Again by Lemma 2.7, 1 B ) 2 I B( , D 3’ f B+ )(B( Tl( ) 2 _B*(2)) + N(O, 1I(2) 0 F*), it implies 3’ — — 1 (B d where b*1 — — — ‘-‘l )B( 2 —Tl( ) ) b + N(O, I 1I(2) )B). Since the distribution of b 2 rl( 1 0 F*), ) does not depend on b( 2 b( ), b 2 1 ) are independent. So 2 and b( , b( 1 f(b ) D, f*) 2 = 1 f(b = f(bi ) D, f*) 2 D, F*)f(b( I , 1I(2), ))f(b( rl( ) 2 I ). 3 D Further since f(U, G I ,2 1 b b( ) , f*) oc f(U I t , b( 1 ), f*)f(G 2 G, b 63 , b( 1 b ), f*) 2 (2.29) and UJ I ), p 2 , b( 1 G, b Z + 1 [B = [b Z + 1 TI( ) 2 (G Tl( ) 2 Gj]t t ) 2 B( Z)] + N(0, — + N(0, 1I(2)) ( 1 E ) 2 ). 1 So f(U,, G ,2 1 b b( ) , F) cx f(U ), ( 2 G, b , Ti( 1 11 ) 2 )f(G b( ) 2 , (22)). (2.30) Also, by Model (2.12)-(2.14), D, b ,2 1 b( ) , f*) = f({G} 1 f({G}t=3+ t I (22), b( ) 2 ). (2.31) By (2.28)-(2.31) and Lemma 2.20 f(UJ cx I {G},D) I t f(UtjI’Yj, , Ti( 1 b ), 2 I ( 1 E ) 2 )f(bl 1 D, Ti( ), 1I(2)) 2 f(r ) 2 ( 1 ,( 11 I D) db E ) 2 12 drl( 2 ) dl!( ) f f({G J b, E( ) 2 ))f(b( 22 if i=j cx I (22), ) 22 D)f(E( ) 22 D) db( ) d( 2 f(UtjI’j’ b , Ti( 1 ), 1 2 i ) 2 ( )f(bi D, Ti( ), 2 2 ( 1 E ) ) 1 f(71(2), )( 2 D,) db 11 d ) 2 . 1 drl( >1I(2) For summary, we have, U bF , G, 1 D, Ti( ), 2 Z + 1 b 1I(2) Tl( ) 2 G + N(0, =d 1 + N(0, b 1I(2) ( 1 E ) 2 ); 1 0 F), (2.32) (2.33) where = B — )B( 2 Tl( ) = B ° + (B 1 1 — B)E — )[B 2 Tl( ) + ) 2 (E( — B ) 2 )Et]. Since, by Lemma 2.1, W(ii), + (j — 1) — sk). Then by Lemma 2.20, ) 2 Ti( D, 1I(2) “ N(l( ) 2 ), 1I(2) 0 64 (2.34) (i 1 W ) 2 ( , 6* D 1I(2) — sk + (j 1)). — By (2.32)-(2.34), U G, 2 ( 1 E ) 1 1 + N(O, 2 a )a ( 1 i ) , where 1 = a BZ 1 — 1(2)2) B*(2) Z3 + 1(2) (22) G 3 and 2 a = 1 + ZF*Z + (G — t ) 2 B F Zj) )(G B’ ) 2 Zj). — Combined with Equation (2.8) and (2.35), U U G ) 2 T( , al,6* ,a 65 — sk +j).I (2.35) Chapter 3 Application to Air Pollution Data In this chapter, the theory of interpolation for data missing-by-design developed in Chapter 2 is applied to obtain interpolated data. The daily maximum hourly levels of nitrogen dioxide 2 (NO ) , ozone (03) , sulphur dioxide ) and the daily mean levels of nitrate 3 2 (SO (NO ) , sulfate (SO ) were recorded from 4 January 1 of 1983 to December 31 of 1988 in Ontario and its surrounding areas. These data come from several monitoring networks in the province, including the Environment Air Quality Monitoring Network (OME), Air Pollution in Ontario Study (APIOS), the Canadian Acid and Participation Monitoring Network (CAPMON) (see Burnett et al (1992) for more description of the data set). Also available are some climatic data: daily maximum, minimum and mean temperature; daily maximum, minimum and mean humidity, and, mean pressure and mean wind speed, measured at other locations. In total, there are 37 different monitoring locations (sites), but not all sites monitor all of the five air pollutants. In the application below, we assume that the variation caused by networks to the observations at 37 sites is negligible. Therefore we can pool the observed pollution levels without worrying about that variation. In general, there are two kinds of air pollutants: (i) a primary pollutant, which is 66 directly emitted by identifiable sources; (ii) a secondary pollutant, which is produced by chemical reactions within the atmosphere between pollutants and other constituents. NO 03 and 504 are secondary pollutants. , , 3 2 2 is a primary pollutant while NO SO 2 is produced by burning fuels containing sulphur. Its level depends on local emission SO sources, like burning fuel oil or smelting. The secondary pollutants studied here are all produced by oxidation of primary pollu tants. This oxidation is driven by ultra-violet radiation from sunlight and comprises chemical reactions that are temperature dependent. Since the chemical reactions pro ceed while the polluted air is being adverted by winds, secondary pollutants are generally more widespread than primary pollutants. We thus refer to secondary pollutants as re gional. Because of temperature dependence of the governing chemical reaction, NO , 2 3 and 03 are high in early afternoon and midsummer, low overnight and in winter. NO The oxidation of 802 to 504 is dominated by photochemical processes in dry, warm atmospheres. Monthly pollution data are interpolated down to the level of a Public Health Unit (PHU), the daily pollution data down to the level of a census subdivision (CSD). Both interpolation problems originate from environmental health studies not discussed in this thesis. In both cases, the relevance of the assumptions in Chapter 2 is investigated and the interpolated residuals are checked. In Section 3.1, our interpolation theory is applied to monthly data and in Section 3.2, to daily data. 67 3.1 Application to Monthly Pollution Data A monthly pollution level is simply computed as the mean of the observed daily levels in that month. Hence, a time series of observed monthly mean levels consists of 72 values. Since some of these time series contain excessively many randomly missing values, to control the quality of data all time series are screened and those with more than onethird of its values missing are deleted. Thus, the number of gauged sites is reduced from 37 to 31. Here, we assume that the probability of a time series having more than one-third of its values missing is not related to those values. Therefore, such a strategy for dropping time series from our study does not cause bias. The locations of the remaining 31 sites, except two outlying sites, are plotted in Figure 3.1. The whole of the Province of Ontario divides into thirty-seven PHUs or districts (Duddek et al 1994). A PHU resembles a Census Division, the difference being marginal disagreements in boundaries. Some PHUs, for example, are aggregates of two Census Divisions. Figure 3.2 displays the locations of some approximate centroids of these PHU’s in Southern Ontario. Our ungauged sites consist of the 37 approximate centroids. Heilce, the total number of gauged sites, g, 5 is 31 and that of ungauged sites, .s, is 37. For the monthly interpolation, four air pollutants are included. They are NO , SO 2 , 4 03 and SO . Table 3.1 lists the observed pollutants at each of gauged sites, where a, b, 2 d and e represent NO , SO 2 , 03 and SO 4 2 respectively. From the table, we can see that at all gauged sites, there are 64 observed time series aild 60 missing time series. In the sequel, the term “time series” is replaced with “column”. Among the 64 observed columils, about two percent are missing. The missing data include the randomly missing data aild those censored from below. Here, censored below means that the actual air pollution level is below the detection limit of a monitor. 68 Table 3.1: Pollutants Measured at Each Gauged Site, Where a, b, d And e Represent , SO 2 NO 2 Respectively. , 03 And SO 4 Sites 1 Pollutants b Sites 13 ade Pollutants Sites Pollutants 25 de 2 b 3 4 b b 14 15 ade 26 ade 27 16 b 28 abde 29 ade ade de b 9 10 20 d 21 ade ade 5 abde 6 d 7 be 8 ade 17 18 b 19 ade 31 30 e d 11 ade 12 de 22 23 24 d e ade de In statistics, the procedure for filling in missing data is called imputation. There are many existing imputation methods (c.f. Sarndel 1992, Little and Rubin 1987). Examples of these methods are: overall mean imputation; class mean imputation; hot-deck and cold-deck imputation; random overall imputation; random imputation within classes; sequential hot-deck imputation; distance function matching; regression imputation; EM algorithm based imputation and multiple imputation. The imputation methods mentioned above, produce a single imputed value for each missing value except for the multiple imputation approach. Many authors used these methods. Afifi et al. (1961) suggested filling in the missing observations for each variable by that variable’s mean. Buck (1960) and Stein et al. (1991) discussed imputing the missing values by regression models. Komungoma (1992) adopted a modified regression strategy. Johnson et al. (1994) combined a stepwise linear regression and a time series model to fill in missing values. Miller (1994) employed a nearest-neighbor hot-deck imputation procedure to fill in missing data. While the multiple imputation is appealing, it has a disadvantage of requiring more work for data handling and computation of estimates. Moreover, the multiple imputation method was designed for survey sampling. It is not suitable for a spatial application. 69 We may apply Theorem 2.12 to impute our missing data. However, that method has not been well tested yet. Further more, Theorem 2.12 bases on the normal approximation to a matrix T. As showed by the “empirical coverage percentage” at the end of this section, in the current scenario the matrix T cannot be replaced by its normal approximation. For our monthly data, with such a small proportion of missing data, the choice of a procedure for filling in missing values is not critical. We filled in a value missing from jth column in the th month with the mean of the observed values of the same column in the month of other years. In case, all six measurements of the same month of the 1983 to 1988 are missing, the grand mean of the observed values of the column is used. However, with our data set, no such a case obtained. The above ad hoc filled-in method allows us to preserve the periodic property of the columns shown in Figure 3.3, where ozone is seen to have a strong periodic pattern. Again, because only a small percentage of the data are missing, our ad hoc fill-in method will cause negligible bias. Our theory of Bayesian multivariate interpolation with data missing-by-design is devel oped under two important assumptions. One assumption is that the detrended residuals should spatially follow a multivariate normal distribution. The other is that each resid ual time series is a white noise process; in other words, the residuals are temporally independent. Checking multivariate normality is not easy since we lose hundreds of degrees of freedom to parameter estimation. In this chapter, only univariate marginal normality is checked. Temporal independence is checked with autocorrelation and par tial autocorrelation plots. The normal quantile-quantile plots of the detrended residuals of the original observed data do not show straight lines, suggesting the observed data be transformed. With a logarithmic transformation of the observed data, the residuals appear to be marginally 70 normal. A typical example of a normal q-q plot of the original and log-transformed data is shown in Figure 3.4. Autocorrelation and partial autocorrelation plots of the detrended residuals of log-transformed data are shown in Figure 3.5. The correlation plots do not show evidence of temporal correlation. By repeating the above initial data analysis for observations at all gauged sites, we are led to conclude that the logtransformed data do meet the assumptions of our interpolation theory. In the sequel, unless specify otherwise, we always refer observations on their log-scale. For determining the linear and seasonal trends of our Gaussian model, time series plots are made. Based on the plots, Z is taken to be 1, t, cos(W), .sin(W) ,where t Here t = on, until t 1 represents the January of 1983, t = = = 1,. . . 72. 2 represents February of 1983 and so 72 which represents December of 1988. The coefficients of the linear and seasonal trends are estimated with ordinary least squares. Figure 3.3 displays the time series plots and their least squares fitted curves for the four observed pollutants at Site 5. The fitted curve of log(O3) is far better than that of the other three because of its periodicity. The strong yearly pattern of ozone is partially explained by the fact that the creation of ozone is highly related to solar radiation. For the environmental health study referred to earlier, the one which required interpola tion of monthly means, the air pollution level in summer is of special interest. Therefore the pollution data of winter and summer are interpolated separately. In the following, only the interpolation for summer data is described. The procedure for winter data is similar. Here, the summer of a year is defined to be from May 1 to August 31 and “win ter” the remainder of the year. Thus there are 24 values (24 months) in each summer column. The purpose of the analysis is to interpolate monthly NO , SO 2 , 03 and SO 4 2 levels in the summers of 1983 1988 at 37 ungauged sites. 71 Table 3.2: The Estimated Between-pollutants-hypercovariance Matrix of the Logtransformed, Summer Monthly Data. N02 S04 03 S02 N02 S04 1.00000000 -0.28541295 0.03476434 0.13645127 -0.2854130 1.0000000 0.7940206 -0.3379370 03 0.03476434 0.79402064 1.00000000 -0.14573424 S02 0.1364513 -0.3379370 -0.1457342 1.0000000 The interpolation procedure follows the following order: first, the unbiased estimators of F—’ and Br,) are computed; second, the EM algorithm for the estimation of and Ag, Q is invoked; third, the SG method is applied to extend Ag to A; then, with the exchange able assumption on B°, Br,) is extended to B°; finally, with all the hyperparameters estimated, the interpolated values are computed by the Bayesian interpolator. We use S macros to carry out the interpolation and call a C program to perform the EM algorithm. An example of the S macros and C program are attached in the Appendices. By running S macros and the C program, we get the estimated prior degrees of freedom to be 610; the estimated between-pollutants-hypercorrelation matrix, which is listed in Table 3.2 and the estimated hyper-covariances of NO , SO 2 , 03 and SO 4 2 are 0.6582724, 1.6263357, 0.2166192, 1.8503741, respectively. The biggest positive correlation among the four pollutants occurs between 03 and SO . Since both 03 and SO 4 4 are regional air pollutants and both are related to sunlight, a higher correlation between 03 and SO 4 is expected. Meanwhile 802 is a primary pollutant, it has a lower correlation with the other pollutants. The result of the SG step is summarized in Figures 3.6 ‘-- 3.8. The right hand plot of Figure 3.6 is a twisted 30 by 30 checkerboard in the D-plane. The original 30 by 30 checkerboard is in the 0-plane. The coordinates of its lower left corner are the 72 minimum latitude and longitude of gauged sites. The coordinates of its upper right corner are the maximum latitude and longitude of gauged sites. The left hand plot of Figure 3.6 shows an exponential fit between dispersions and the D-plane distances (refer to Chapter 1 for a brief summary of the SG method). The smoothness of the twisted checkerboard is controlled by parameter ). A smoother checkerboard in D-plane is achieved by sacrificing the fit between the dispersions and D-plane distances. Figure 3.7 shows a smoother checkerboard in the D-plane but a rougher fit between the dispersions and D-plane distances when the smoothing parameter value is increased from 0 to 2500. A linear pattern in Figure 3.8 shows that the estimated covariance and the observed covariance are conformable. By applying Corollary 2.6 and using the above estimated hyperparameter values, after SG step it is straightforward to compute the interpolated, summer monthly air pollution levels at ungauged sites over six years. As a way of checking the interpolated values, the overall average ozone levels the in summers of 1983 to 1988 at gauged sites are plotted in Figure 3.9. Those of interpolated ozone levels at ungauged sites are plotted in Figure 3.10. The two plots affirm our interpolation procedure. When a high mean 03 level is observed at a gauged site, our interpolator gives a high 03 values at nearby ungauged sites. Corresponding results obtain for lower observed 03 levels. Another way of checking the interpolation procedure is to look at the correlation be tween the observed and estimated data by cross-validation (CV hereafter). CV is a procedure which deletes observed datum one at a time and estimates these datum from the remaining data as if that datum were never observed. It is a popular diagnostic tool. In our CV study, we deleted one gauged site at a time and interpolated pollutant levels at that same site using observed levels at other sites. To avoid spuriously high computed correlations between the estimated and observed levels of pollutants, we first 73 removed the trends from both the estimated and observed columns, and then calculated the correlations among the residuals. The correlations between the detrended, estimated and observed levels of each pollutant aggregating across sites and over time are: Summer Winter N02 0.243 0.242 S04 0.494 0.438 03 0.534 0.429 502 0.238 0.200. The correlations between the estimated and observed levels at each gauged site and for each observed pollutant are given in Table 3.3. In Table 3.3, the correlations of SO 4 and 03 in both summer and winter are generally higher than those of 502 with other pollutants. In other words, the predictions of SO 4 and 03 are more accurate than those of 302. Figure 3.11 displays the plot residuals of log-transformed, monthly observed and estimated pollutant levels in both summer and winter. Figure 3.12 shows the scatteplots of log-transformed observed pollutants against estimated pollutant levels for each pollutant in summer and willter, respectively. In the plots a linear pattern means accurate interpolation. The plots confirm conclusions suggested by the tables; these results are consistent with the fact that 03 and SO 4 are regional pollutants. It is easy to predict them with the observed data from other sites. In contrast SO 2 is a local pollutant and so more difficult to predict. Can a simpler to use, normal distribution be substituted for the multivariate T pre dictive distribution? That might naively seem possible since the univariate normal approximates its longer tailed relative very well. 74 However, our results suggest this Table 3.3: Correlations Between Residuals of Log-Transformed, Observed and Estimated Pollution Levels at Gauged Sites. Sites NO2 1 2 3 4 5 6 7 8 9 10 11 12 0.39 summer S04 03 0.96 winter S04 03 0.81 0.96 0.87 0.85 0.74 0.93 0.82 0.81 0.75 0.92 0.90 0.42 502 0.57 0.81 0.76 0.57 0.61 0.54 0.11 0.13 0.66 0.93 16 0.90 0.91 0.53 0.69 0.40 0.14 0.74 0.58 0.52 0.75 0.32 0.59 0.57 0.30 0.04 0.69 0.79 0.33 0.24 0.55 0.18 0.59 0.72 0.38 0.40 0.40 0.87 17 18 0.66 19 0.63 0.36 0.77 0.72 0.80 0.61 0.67 0.80 0.63 0.80 0.78 0.25 0.85 0.68 0.75 0.34 0.57 0.47 0.63 0.49 0.86 0.78 0.68 0.52 0.56 0.65 0.81 0.24 0.44 0.47 0.44 0.48 0.74 25 26 0.71 0.80 0.82 27 0.55 0.77 0.66 0.87 0.83 30 31 SO2 0.78 0.41 0.66 0.65 0.11 28 29 0.56 0.97 0.87 14 15 0.70 0.74 0.71 0.85 0.75 0.88 23 24 0.67 0.87 0.44 21 22 0.09 0.67 13 20 N02 0.97 0.25 0.59 0.40 0.56 0.26 0.23 0.50 0.49 0.49 0.82 0.78 0.62 75 0.10 0.23 0.52 substitution cannot be recommended without additional study. Our initial impression comes from an evaluation we did of the empirical coverage percentage of three-standarddeviation confidence intervals (CI). If the predictive distribution were normal, all the three-standard-deviation CIs would include the true values about 100 percent of the time. As the percentages by pollutants presented below indicate, this high coverage probability is not achieved here. The heavier tailed predictive matrix T distribution seems to be required. By pollutants the percentages are, Summer Winter N02 1000!. 94.27. S04 1007. 99.27. 03 98.67. 98.87. S02 94.57. 1007. The unbiasedness of residuals is also checked. In the top plot of Figure 3.13 are the boxplots of the prediction errors for four pollutants, these being defined as the difference between the predicted and observed values. Except for SO , the mean prediction errors 2 for the other three pollutants are almost zero. In other words, the predictor is unbiased. In the same figure, the other two plots indicate that observed values and the boxplots of the predicted values have similar patterns except that the predicted values have bigger variances. 3.2 Application to Daily Pollution Data For one of the environmental health studies mentioned above we needed to interpolate daily air pollution levels down to the centroids of CSD’s. In Southern Ontario, 733 such 76 centroids are chosen and all five pollutants NO , NO 2 , 03, SO 3 2 and SO , are included. 4 The general interpolation procedure and many intermediate results are similar to those for the monthly data. Thus, in the following, only the results which differ from those in Section 1 are discussed. Again, only the interpolation of summer air pollution levels at the CSD centroids is discussed. The observed data come from daily measurements at 37 sites. By removing the time series (columns) where there are excessively many missing data, the number of sites is reduced from 37 to 27 and the number of observed columns to 55. Then the number of missing columns is 80 27 (= * 5 — 55). Figure 3.14 displays the locations of the 27 gauged sites, except for two outlying sites. As in Section 3.1, an ad hoc method is used to fill-in the missing data. However, in this section, a different ad hoc method is used. Since in the daily observed columns, eleven percent are missing, a more delicate approach is needed. The new ad hoc method replys on multivariate normal theory. With the new method, missing data are filled in by xiii = where X is the missing value for the (j = . . + jth ij, pollutant (i = 1,. . . , 5) at j gauged site ,27) at time t; B is estimated by the ordinary least squares method using observed values for the jth pollutant at the ih gauged site and jj is the estimated residual. The estimated residual is computed by using well known normal theory. That is, E(X Y = y) = where X and Y are jointly normal, E(X) + xY’y(y — is the covariance matrix between random variables U and V and E(U) is the mean of U. By letting Y represent the set of all observed data at time t, replacing X with X, and taking both E(X) and E(Y) to be 77 zero, we apply the above formula to calculate . However, the formula is not directly applicable, since the joint covariance matrix of X is unknown. A method of moments is used to estimate the covariance matrix in a pairwise manner. In other words, the covariance of Ximt and Xhkt is estimated by Zt(Ximt — Xlm)(Xhkt — Xhk), where f is the total number of observed pairs of Ximt and Xhkt. The estimated covariance matrix has to be checked for positive definiteness. definite, it cannot be used to obtain If the estimated matrix is not positive If so, other imputation methods mentioned in the previous section would need to be investigated. However, in this application, the estimated matrix is indeed positive definite. Compared with the method used in Section 1, the advantage of new method is that it brings less autocorrelation into each time series when the missing values are fill-in. We assume such a filled-in procedure will not cause serious bias, because of the small percentage of missing data. Figure 3.15 shows that a logarithmic transformation of daily data is also necessary. Since the time series plots of the daily summer data do not have obviously periodic patterns, the linear trend of the Model (2.12) is instead taken to be the grand mean effect; time t effect; weekday effect (from Monday to Thursday); monthly effect (May, June, July); yearly effect (from 1983 to 1987) and the mean daily temperature. The mean daily temperature is the mean of observed daily mean temperatures in Southern Ontario. Such an arrangement is due to the fact that Model (2.12) only allows covariates with spatially equal measurements. Other variables, for instances, the daily mean humidity, mean pressure etc. are investigated too. They are not included in the final model because their ordinary-least-squares fit coefficients are not significantly different from zero. We checked the autocorrelation and partial autocorrelation plots of daily, detrended summer observations. Figure 3.16 shows an example of such a plot. We find that 78 Table 3.4: The Estimated Between-pollutants-hypercovariance Matrix of the Logtransformed Daily Summer Pollution Levels in Southern Ontario. N02 S04 N03 03 S02 N02 $04 N03 03 S02 1.0000000 0.1224039 0.2435792 0.1131423 0.2085270 0.1224039 1.0000000 0.4118659 0.2668008 0.1240847 0.24357919 0.41186592 1.00000000 0.24745489 0.11314233 0.26680085 0.24745489 1.00000000 0.09569043 0.20852698 0.12408473 0.05080721 0.09569043 0.05080721 1.00000000 they are lag-i correlated. For removing the lag-i correlation of the residuals, an AR(1) transformation of the observed values is applied, that is, t at = 2,. . . , 738, where gauged site. qjj = — for is estimated based on the observed values of i’ pollutant Such a transformation removes the lag-i correlation among the residuals. For comparison, the autocorrelation and partial autocorrelation plots of the AR(1)-transformed residuals of 03 at Gauged Site 6 are shown in Figure 3.17. When the normal and independence assumptions are satisfied, the EM algorithm can be applied to the residuals to estimate the hyperparameters. The between-pollutants hypercorrelation matrix is given in Table 3.4 and the hypervariances of the N02, S04, N03, 03, S02 are 0.8093218, 1.8325046, 1.3923836, 0.4061250, 1.9200310, respectively. The estimated number of degrees of freedom is 4365 for nearest integer. A big number reflects a lot of prior information about the covariance matrix E. That large number stems from the fact that there are only 27 gauged sites with 55 observed columns while 733 ungauged sites with 3665 columns missing-by-design need to be interpolated. The SG and interpolation steps are similar to those of the monthly data application above. One remaining problem needs to be solved. Since the interpolation is based on the AR(1)-transformed data, the interpolated values are in the form of AR(1)transformed residuals. To obtain the true interpolated residuals, the following fact 79 is must be used. Lemma 3.1 If Uk — cUk_1 = Vk and q < 1, then Uk k=2 fl_kVk provided n is big enough. Proof: Since -çU 2 U = 1 V Un_i U — — c/;Ufl_2 = V_ 1 1 = V, çbU multiply both sides of the first equation by q , the next equation by 2 , etc and then add all equations. It becomes, Uk — ‘Ui n_kVk = When n is big enough, k=2 is almost zero. I To apply the above fact, the ‘s at ungauged sites are needed. These values are not available. Let us assume the ‘s are the same for the same pollutant at all sites. Then in all, only 5 different coefficients are needed. The five coefficients can be estimated by ordinary linear regression subject to the above assumption. By checking the ob served data, we know that the assumption is valid. Now with the new assumption, the interpolation procedure is repeated. 80 Table 3.5: Correlations Between the Residual of Log-transformed, Summer Daily Ob served and Estimated pollutants at Gauged Sites. Sites S04 NO3 1 2 0.68 0.56 AR(1)_Summer S04 N03 03 0.63 0.52 0.79 0.65 0.76 0.63 3 4 0.56 0.72 0.55 0.64 0.48 0.50 0.70 0.62 0.45 0.63 0.48 N02 5 0.44 6 7 8 9 10 03 0.67 S02 0.12 NO2 0.44 0.60 0.69 0.65 0.49 0.49 0.65 0.10 0.87 0.64 0.50 0.51 0.10 0.88 0.90 0.55 0.55 11 12 13 0.82 0.82 0.83 0.83 0.18 0.78 0.72 0.17 0.78 14 0.41 0.75 0.38 0.73 0.73 16 0.69 17 18 19 20 21 22 0.74 0.68 0.50 0.58 0.70 0.53 0.63 0.63 0.71 0.86 0.68 0.78 0.84 0.54 23 0.50 0.67 0.59 0.68 0.52 0.63 0.62 0.02 0.45 25 26 0.34 0.68 0.65 0.60 0.85 0.79 0.69 0.86 0.88 0.56 0.82 0.83 0.79 24 27 0.71 0.88 0.82 0.82 0.12 0.69 0.90 15 S02 0.03 0.78 0.44 0.04 0.58 0.64 0.61 0.32 0.52 0.15 81 0.05 0.54 0.14 As before, a CV study of the residuals for both the observed and predicted values is carried out. The pollutant-wise correlations between the residuals of observed and predicted summer daily pollution levels at gauged sites are: Summer AR(1) Sununer N02 0.49128973 0.48457402 S04 0.66564685 0.63188994 N03 0.58376676 0.56927091 03 0.78495848 0.78477716 S02 0.08807141 0.08978648. In the above table, the predicted values used for computing the correlations in Column one are obtained using the original observed values and the predicted values, and for those in Column two obtained using the AR(l)-transformed data. From the same table, it can be seen that the correlations of the same pollutant in both columns are close. We interpret this result positively, as saying that our interpolation procedure is robust to the assumption of temporal independence. In other words, in terms of the above correlations, by taking an AR(l) transformation, the prediction of observed values has not been improved. From the same table, the correlation for SO 2 is much lower than those of the other pollutants. This is explained as follows. First, from Table 3.4, we see that the hypercorrelations of SO 2 with other pollutants are very low. This fact indicates that by including other pollutants, the predictions on SO 2 levels are not improved much. Second, from Table 3.5 it is seen that 502 was only observed at only 5 gauged sites. So in the CV analysis, the prediction of SO 2 levels at one of its gauged sites are based only on the observations at the other four sites gauged for SO . For monthly data, 2 the problems associated with SO 2 do not exist. Therefore, the computed correlations between the predicted and observed 802 values are higher. 82 The correlations of the observed and estimated residuals at each gauged site for each pollutant are given in Table 3.5. From Table 3.5, it can be seen that it is easier to predict 03 and SO , because they are regional pollutants and more difficult to SO 4 , 2 because it is a local pollutant. 83 Chapter 4 Application to Environmental Monitoring Network Redesign Another application of the “Bayesian interpolation with missing-by-design data” theory developed in Chapter 2 is to the problem of redesigning an environmental monitoring network. The term “redesigning” means adding or deleting sites from a current existing monitoring network. Guttorp, Le, Sampson and Zidek (1992), Caselton, Kan and Zidek (1992), Wu and Zidek (1992) have discussed the above redesign problem. These authors derived their optimally redesigning strategy based on following reasoning. Maintaining and collecting data from an environmental monitoring network is quite expensive, there fore the network is set up and maintained by a nation wide institute. Thus the collected data may be used by different users for different purposes. This fact implies that an optimal redesign of an environmental monitoring network should be based on certain common and fundamental purposes of the users of monitoring networks. They choose “reducing uncertainty about some aspect of the world, regardless of how that need may be expressed in any particular situation” (Guttorp, Le, Sampson and Zidek 1992) as the redesign goal. They developed a general network redesign theory by combing the entropy optimal criteria and the Bayesian paradigm. 84 In Section 1, the general theory of network redesign in univariate case, which is dis cussed in Guttorp, Le, Sampson and Zidek (1992), is summarized. In Section 2, it is demonstrated that how our theory can be applied to the redesign problem. 4.1 Theory of Network Redesign with Entropy In this section, the theory of network redesign with the entropy is summarized in the univariate case, which means that there is one pollutant at each site. Although the theory deals only with an augmentation of a network, the reduction of a network is handled in a similar fashion and the conclusion is similar. Suppose that currently there are g gauged sites in a environmental monitoring net work and it is planned to add u 1 more sites to the network in future. These u 1 sites are chosen among u candidate sites. Call all the u candidate sites “ungauged sites”. Let Xf represent a future realization at gauged and ungauged sites. Decom pose Xinto((X) , (X)t), where X is a realization at gauged sites and X7 at un t gauged. By properly rearranging the coordinates, (Xy)t is further decomposed into ((Xrd)t, (xrem))t). is the response vector at ungauged sites that will be added into the network and xrm) is the vector at sites that are not chosen. We assume the same Gaussian linear model of (2.12) -.‘ (2.14). As to a measurement of uncertainty, one natural choice is the entropy, which is defined as H(X) = E [_log] where h(X) is a reference density, h(x) makes H(X) be invariant to scale transformation of X. From the definition, it is easy to see that H(X, Y) = H(Y) + H(X Y), provided 85 (X)h 1 h ( Y). h(X, Y) = 2 Let D be the set of all observed data at gauged sites in the past, 0 = (, B) be the set of parameters in our Guassian model. Then, given D, the total uncertainty of a future realization Xf and unknown parameter, 0, is H(Xf, 0 D). Since, H(Xf, 0 D) where G = = H(U G, 0, D) + H(0 I G, D) + H(G D), (X)t), U = m xr H(U G,0,D) = E[—log(f(U I G,0,D)/h (U)) 11 , t ((X) H(8 G,D) H(G I D) = = (4.1) and E[—log(f(0 2 G,D)/h ( 0)) D], E[—log(f(G D)/h (G)) 12 In the above, it is assumed that h(X,0) = I Dj. 2 ( 1 h ( X)h 0) and h (X) 1 = (U)h 1 h (G). 12 1 Note that adding or deleting any site to or from the current network will not change the total uncertainty H(Xf, 0 D). When in future, the response vector at gauged sites and added sites is observed and if the measurement errors are negligible, the uncertainty represented by H(G D) becomes known. By Equation (4.1), one can see that a fixed total present uncertainty is decomposed into two parts, one part will become absolutely known by taking observation in future and the other part is still unknown. Therefore minimizing the future uncertainty by taking additional sites is equivalent to minimizing the unknown future uncertainty, is in turn equivalent to maximizing the uncertainty of what will be known in future. So the problem is equivalent to adding sites to maximize H(GID). For finding expression, H(G D), the entropy of a multivariate t-distribution needs to be computed, since f(G D) consists of two multivariate t-distributions. Now assume, for a random vector Y, I E 86 (,6), 1 W; by (2.8), t(O,(6* —g + 1),6* —g + 1). I Note that H(Y, E , 6) = H(Y 6’) + H(E , 6) H(E Y To find the multivariate t-distribution’s entropy, H(Y H(Y I , 6’, ), H(> multivariate normal , I , 6*) and H(E , Y, 6* and E I Y, , , , , 6*) + H(Y , 6w). 6’), we only need compute 6*) respectively. Since Y E, , 6* is 6* are inverse Wishart. By using the result presented in Caselton, Kan and Zidek (1992), the entropies of the multivariate normal and the inverse Wishart. Thus, after a straightforward calculation, the entropy of a multivariate t-distribution is, H(Y ,6*) *), 6 +c(g, log = (4.2) where c(g, 6*) is a function of g and 6* only. Since by Theorem 2.1, the distributions of D and Xd X, D are multivariate t-distributions and H(G I D) = H(X I D) + H(X I X,D). By applying (4.2), it is easy to see that H(G D) Where ddg = log addIg , 22, c, d, 1, g). 22 +c( is the residual covariance matrix of Xa conditional on defined in Lemma 2.1. Because only the term log 22 S is related to a choice of the newly added sites, maximizing H(G I D) is the same as maximizing 87 and I. 4.2 Network Redesign with Data Missing-by-design The result presented in the previous section can be easily extended to the multivariate case by assuming k pollutants at each site. The value of (Faddlg comes from (F that is estimated with the method described in BLZ. Similarly the result can be extended to the case when there is missing-by-design data. To justify this extension, one needs to notice that, given hyperparameters, () B = (R*) R t * R*B I R*Xt T47_l((R*) ( t FR*, 6* — 1) N(R*BO, t (R*) E R* ® F-’), where = (1skxsk \ 0 0 and R 2 is defined the same as that in Section 2.4.1 of Chapter 2. Then applying the above model and a similar argument as in the previous section, an optimal criteria for redesign of a current network is maximizing log addI2 1 , where ’addl2 T ‘ is the conditional hypercovariance matrix of the future realization at added pollutants and sites given the observed pollutants at gauged sites. The matrix ‘I’addl2 can be obtained from matrix (F that is estimated in Chapter 2. In an application, the added sites need not monitor all air pollutants. This relaxation may be useful when the optimal network redesign is required for multiple networks that were set to monitor different pollutants. A hypothetical example is that there are three networks, labeled 1, 2, 3. Network 1 measures ozone only, network 2 nitrate only and network 3 both pollutants. The above discussed optimal redesign enables us, say, to optimally add one site to network 1 and another site to network 2, in terms of maximally reducing uncertainty of a future realization in these three networks. 88 As an example of implementing the above discussion, the monthly air pollution data in Chapter 3 is used. Thus, there are 31 gauged sites and their latitudes range from 42.246 to 49.800 and longitudes range from 74.736 to 94.400. Suppose in future two sites will be added to these monitoring networks in Southern Ontario. One added site will monitor ozone only and the other nitrate only. Further suppose the possible site locations for the two would-be added sites are constrained to the grids of 10 by 10 checkerboard with the latitudes and longitudes of the four corners being (42.246, 74.736), (42.246, 94.400), (49.800, 74.736) and (49.800, 94.400). By taking the estimated 1 from Section 3.1 and applying the above optimal redesign criteria, the following answer is reached. We should add the measuring-ozone only site at latitude 43.92467 and longitude 85.66044 and add the site measuring-nitrate-only site at latitude 44.76400 and longitude 87.84533. However, further simulation study on the sensitivity of need be done. 89 to the locations of added sites Chapter 5 Cross-validation Study In this chapter, we describe three CV studies designed to judge performance of the newly developed theory of interpolation with data missing-by-design. The first study is to justify the necessity of developing a new theory, in situation where the LZ method could be applied to solve the same problem. In the second study, an artificial example is made for studying the trend of adjusted mean squared predicted error(AMSPE), where, starting from a complete data set, columns are deleted one-by-one and the AMSPE computed. The third is a comparison of our theory against Hass’s CoKriging method, since both methods can handle data missing-by-design. 5.1 Simultaneous Interpolation Versus Univariate Interpolation By interpolating one pollutant at a time, one can apply LZ theory to the Southern Ontario pollution interpolation problem. Why then is a new theory needed when an old theory is available? The answer lies in the difference of two methods. With the LZ method, only partial data is used for the interpolatioll, while the ew method includes all available data in the procedure. Take the monthly pollution data in Southern Olltario, for example. When ozone levels are interpolated at ungauged sites by the LZ theory, only 90 the observed 03 levels at gauged sites are included in the analysis. The new method uses all the observed values of NO , SO 2 , 03 and SO 4 . For distinguishing between 2 these two methods, that of LZ will be called univariate interpolation and that of the new method, simultaneous interpolation. One way of showing the superiority of the simultaneous interpolation over the univariate interpolation is to prove that its interpolator leads to a smaller mean square error. That fact is shown below. Theorem 5.1 Let X , 1’ be any two random vectors and X a random variable. Then 0 E(X — E(X Xo, 2 )) <E(X 0 Y — E(X I (5.1) Proof: Observe that E(X — E(X = 2 E(X = ) 2 E(X = ) 2 E(X = ) 2 E(X — 2XE(X X (X X 2 )) 0 )+E 0 — — - 2E(XE(X 0 (X 2 X ) ) + E(E I Xe)) 2E(E(XE(X I X) X E(E ( X I 0 )) + 2 0 X ) ) E(E ( 2 X Similarly, E(X - E(X 2 ,Y X ) 0 ) = ) 2 E(X - E(E ( 2 X Further, by Jensen’s inequality, for any random variable Z, ) 2 E(Z > . 2 (E(Z)) Apply it, E(E ( 2 X 0 ,Y X ) ) = E(E(E ( 2 X 0 ,Y X ) E([E(E(X 0 )] 0 X ) ,Y 2 X ) = E(E ( 2 X 91 I So equivalently, E(X — E(X I 0 2 , X ) Y ) E(X — E(X 2 ) 0 K . ) Returning to the ozone example, we take X 0 to be the observed levels of 03 at gauged sites, Y 0 the observed levels of the other pollutants and X, the unobserved pollution level at an ungauged site. Then the univariate Bayesian interpolator is E(X X ) and 0 the simultaneous interpolator E(X X , Ye). When the model is correctly specified and 0 the hyperparameters are known, Theorem 5.1 implies that the simultaneous interpolator does no worse than the univariate interpolator. The following CV study supports the above claim empirically. Again, monthly air pollu tion data in Southern Ontario is used. At each gauged site, the observed pollutants are deleted as if they were not observed. Then both univariate and simultaneous Bayesian interpolators are applied to obtain the predicted values of the “deleted” values based on the data at the other gauged sites. When the predicted values by both methods are computed for all 31 gauged sites, the mean squared predicted error (MSPE) is calcu lated for the univariate interpolator and the simultaneous interpolator respectively. The results for the monthly summer data and monthly winter data are listed below. Simultaneous summer winter Univariate summer winter N02 0.18543849 0.14342632 0.2829322 0.1292677 S04 0.13848447 0.21311221 1.274841 0.7310782 03 0.04369236 0.05382523 0.12536 0.2367643 S02 0.62173407 0.28098323 0.758635 0.4344104 The values confirm our theoretical result, except the case of NO 2 in winter, where the 92 MSPE of the univariate interpolator is smaller than that of the simultaneous interpola tor. One interesting point is worth mentioning here. The above numbers show that the relative reduction in the MSPE’s achieved by simultaneous interpolation over univariate interpolation, is much higher for 804 and 03 than for 802. For 504 and O3 the relative reduction is from 300% to 900%. For 302, the reduction is under 50%. This is because 504 , 0 are regional pollutants and 802 is a local pollutant. Intuitively, a regional air pollutant intuitively has a higher correlation with the other pollutants than a local pollutant, as indicated by the estimated between-pollutants-hypercorrelation in Section 3.1. By including the other correlated pollutants in the analysis, we would expect to enhance the interpolation. For a local pollutant, since it has little or no correlation with other pollutants, the inclusion of additional pollutants in the analysis will not improve the interpolator as much. Therefore, we can conclude that the interpolator with data missing-by-design does better than that of LZ on regional pollutants. It does not do much better than LZ on local pollutants. The conclusion has theoretical support: if X 0 are X are independent equality in Equation (5.1) obtains. 5.2 Trends in Adjusted Mean Squared Prediction Error (AMSPE) We now check the performance of the Bayesian interpolator (2.6) in terms of the AMSPE. Here the adjusted prediction error is defined to be (Xpred — Xobs)fstd(Xpred), that is, the difference between the predicted value and the observed value divided by the standard deviation of the predicted value. AMSPE is the mean of all squared adjusted predicted errors. We prefer AMSPE over the mean squared error (MSE) because different pollu tants have different units of measurement, hence different variability. Therefore, mean prediction errors of different pollutants must be normalized to make them comparable. 93 The data set for our study comes from the monthly Southern Ontario pollutant data used earlier. Among 31 gauged sites 13 sites where NO , 03 and SO 2 2 are all observed, are chosen. So the original data set consists of thirty-nine columns (three observed pollutants at each of thirteen sites). In our study, a randomly chosen column at a randomly chosen site among the 13 sites is deleted. A CV study is performed on the remaining data set that has data missing-by-design. An AMSPE is computed. Next, in a similar way, a randomly chosen column from the remaining data is deleted and the CV process is repeated on the new data set, which has two columns less than the complete data set. So a new AMSPE is computed. By repeating this and plotting AMSPEs, a trend in AMSPE is obtained. One example of AMSPE trends is shown in Figure 5.1. There are twenty-four AMSPE’s in the plot. The first AMSPE is computed after the column of observed NO 2 at Site 2 is deleted from the original data set, the second AMSPE while the previously chosen column and the column of observed SO 2 at Site 3 are deleted. By repeating the same procedure, the 24 AMSPE is computed after 24 randomly chosen columns are deleted. Below we show the order of the deleted columns. Order: 1 2 Pollutant: Site: 2 Order: 13 3 4 5 6 N02 S02 N02 03 7 03 8 9 502 03 10 03 11 12 N02 S02 N02 S02 3 4 1 7 4 10 6 12 7 10 12 14 15 16 17 18 19 20 21 22 23 24 Pollutant: Site: 3 N02 03 2 11 03 9 S02 S02 S02 N02 S02 N02 S02 03 6 1 5 5 94 9 11 8 8 S02 The AMSPE trend plot in Figure 5.1 shows a generally increasing pattern with some bumps along the way. The incremental changes in AMSPE are not very dramatic. This perhaps indicates good performance of the simultaneous interpolator. However, the result must be interpreted cautiously, since the AMSPE is not a robust index. When the standard deviation of a predicted value is relatively small, it could explode the corresponding adjusted prediction error so much that an AMSPE value will be dominated by a particular term. If a common term plays a dominant role among all the AMSPE values, the trend will be a flat line. Such a plot does not imply superiority of the simultaneous interpolator. 5.3 Comparison with CoKriging Both Bayesian multivariate interpolation with data missing-by-design (referred to below as the new method) and Hass’s CoKriging method can handle data missing-by-design. Here are some direct contrasts of the two methods. The new method gives a predictive distribution, therefore a simultaneous interpolation region, while Hass’s method only gives the interpolated value and its standard deviation. The new method easily handles any number of variables (pollutants), ignoring for the limitation of computer capacity, while Hass’s method only allows two variables in order to retain mathematical tractabil ity. The new method includes all available data in the interpolation process, while Hass’s method only uses only partial data. However, when sk is big, Hass’s method has a sub stantial computing advantage over the new method, since the convergence of the EM algorithm used in the new method is slow. A detailed comparison follows. Generally speaking, our simulation study shows that Hass’s method enjoys computational advantages over the new method when there are 95 many sites. When there are not enough sites, Hass’s method encounters difficulty. Since in each moving window, Hass’s method requires a minimal number of sites in the window for the purpose of variogram estimation. When gauged sites are widely spread out or there are not enough sites, that requirement cannot be met. This affects the accuracy of the estimated variogram. On the other hand, since the new method involves a lot of matrix operations, when there are too many sites, the dimensions of the matrix can be so big that computation becomes excessively slow. In the case of a small number of gauged sites, the new method works fine. Another difference between Hass’s method and the new method is in the models they use. While Hass’s method models only a “snapshot” of a spatial trend, the new method incorporates the temporal trend for each pollutant at each site. Therefore, the new method can handle spatial-temporal data, but may have problems with spatial data alone. In the case of spatial data alone, it is hard to model a temporal trend with one datum for each observed pollutant at each gauged site. Because of the complexity of the cross-variogram formulas, Hass’s method cannot take more than two variables (e.g., two air pollutants) in practice and it is hard to expand beyond that. The new method can handle any number of variables (pollutants). As shown in Chapter 2, a simultaneous prediction region is derivable when a predictive distribution is available. Hass’s method only gives a pairwise confidence interval for the interpolated value. A fundamental distinction between the two methods lies in their Bayesian against non Bayesian distinction. We do not take up that issue here. For an empirical comparison of the two methods, we use a real data set. The data come from selected sites in the National Acidic Deposition (NADP) Network and the 96 National Trends Network (NTN) in United States. Refer to Wu and Zidek (1992), Watson and Olsen (1984) for more details. Forty-five gauged sites, the latitudes and longitudes of which are in Table 5.1, are randomly chosen. Since both sulfate and nitrate are regional pollutants, they should be highly correlated. That guess is confirmed by an analysis of the data, where the estimated hypercorrelation between sulfate and nitrate is higher than 0.70. The high correlation enables both methods to fully demonstrate their strength and weakness, making their comparison more reliable. At the forty-five sites, both sulfate and nitrate are observed monthly from 1983 to 1986. In terms of Hass’s method, one pollutant is treated as a response variable and the other as a covariate. The covariate is included in the analysis to improve the interpolation precision. We arbitrarily chose nitrate as the response variable and sulfate as the covariate. To agree with the form of a CV study provided by Dr. Tim Hass, both observed levels of nitrate and sulfate at the first 35 sites of Table 5.1 are used for our study and only the monthly sulfate observations at the remaining 10 sites are used for the study. The number, 35, is chosen because according to Hass’s method, in each moving window there must be a minimum of 30 gauged sites. For each month of the four years, a CV study of Hass’s method is done and the MSPE is computed for the original and log-transformed data respectively. The same study is done for the new method. The MSPE, for both methods on the two cases are listed in Table 5.2. In terms of MSPE, with the original data, the new method beats Hass’s method in 32 out of 48 months and in terms of the grand mean of MSPE, the new method also wins by 0.4437143 against 1.25524. Figure 5.2 displays the monthly MSPE’s of both methods on the original data. In the figure, “Bayesian” means the new method and CoKriging means Hass’s method. With the log-transformed data, the ranking is reversed. The new method loses to Hass’s method in 39 out of 48 months and 0.3813128 against 0.2347863 of the mean MSPE. See Figure 5.3 for a graphical comparison of the MSPE’s at each 97 month on the log-transformed data. By checking the MSPE’s of the new method used on the log-transformed data at each site, we found that among 35 MSPE’s, the MSPE value at Site 35 (i.e, Site 075a) is much higher than those at other sites. At Site 35, the MSPE value is 7.48, while at other sites, all the MSPE values are below 0.493 (except that at Site 21). Observations at Site 35 are unnaturally compressed into a small interval near zero, making their logarithm extraordinarily large in magnitude. Thus in Table 5.2, we observe MSPE’s for transformed data larger than those for the original because of this outlying site. As well, the poor performance at Site 35 makes the monthly MSPE values of the new method applied to the log-transformed data systematically higher than those of Hass’s method. Figure 5.4 shows that the boxplot of log-transformed nitrate levels at Site 35 is well below other boxplots as noted above. Recall, in Chapter 2, we assumed that the hyperparameter B° has an exchangeable structure. From our simulation study, we know that if that assumption is violated, the proposed new method will not do well. For example, if B° were taken to be identical over sites for each pollutant while the actual data shows that it is not true, the performance of our interpolator would be poor. So one explanation of the poor performance of the new method on the log-transformed data is that the exchangeability assumption on B° is violated because of Site 35. That is, the actual hyperparameter B° at that site is not the same as that of the same pollutant at other sites. To check this conjecture, Sites 21, 35, 36, 44, 46, where either observed sulfate levels or nitrate levels are unusual, are removed from the data set and five new sites are added in (the codes are 076a, 077a, 078a, 160a, 161a). CV studies of both Hass’s method and the new method are then carried out on the new log-transformed data. In terms of MSPE, over 48 months the new method beats Hass’s method in 35 out of 48 months and in terms of the mean 98 MSPE, the new method gains an edge by 0.2087363 against 0.3676921. These results support our conjecture. However, the diagnosis of the exchangeability assumption is difficult. The following theory offers a way to make a rough check. Note, by Model (2.12)-(2.14) and Equation (2.16), ° 2 B ,t t ) 0 2 N((B ((ZZ ) ’ + F-’) 0 22). Equation (2.8) implies, 2 B B r’- T(’1’, (ZZt)_1 + F’, B, 6* By Lemma 2.4, the marginal distribution of /9r (i = 1,. — . . sk , .Sgk — 1 + h). — 1), the row of B , 2 has distribution T(d, t (ZZ ) ’ + F’, 3° 6* where 3° is the rth — sk + h + 1), row of B. To define d, let D= (f” f::) Cr, 2 = 2 where Cr is an orthogonal matrix that exchanges the r column of J! with its last column, D,, is (s k 9 — 1— 1) by (sgk — 1— 1) and d 22 is 1 by 1. Then d is d 22 — , D,,’ D, 2 D . 2 By applying Theorem 2.8, the following theorem is proved. Theorem 5.2 F = 6* —sk+ 1($ [(zzt)1+F1]( _) Then by the above theorem, the p-value for 1 — [1 — /335 PF(F 99 > o)t k÷1. 3 Fh,ss_ at Site 35 can be computed as 35 ) 8 b 0 F , ] (5.2) because /335 is the extreme value among the 35 r5 3 / by plugging in the estimated hyperparameters and j = 1,. . . , ,, can be computed 0 F 35. 3 into Equation (5.2). With the above p-value formula, unless PF(F > ) 03 is extremely small, it is very unlikely we F would reject the exchangeability assumption at Site 35. Therefore, the power of the above test is low. An application of the above test to our data shows that we failed to reject it. However, the boxplots of Figure 5.4 make us believe that the observed levels of nitrate at Site 35 are abnormal. Theoretically, Hass’s method is optimal only when the processes of pollutants are mul tivariate normal with a flat prior on the parameters. That explains why the MSPE values of Hass’s method drop a lot when it is applied to the log-transformed data. (See Table 5.2.) A log-transformation made the observed levels of the pollutants follow a multivariate normal more closely, although it is not necessarily the most appropriate transformation. The new method is also sensitive to the normality assumption. From Table 5.2, it can be seen that the MSPE values were reduced in 31 out of 48 months when the new method was applied to the log-transformed data. 100 Table 5.1: Latitudes and Longitudes of the Gauged Sites. Sites Lat. Long. 004a OlOa 36.1006 94.1733 40.3644 105.56 Olla 39.1011 105.092 012a 40.8064 104.754 017a 33.1778 84.4061 020a 40.0533 021a 41.7011 88.3719 87.9953 022a 37.71 89.2689 023a 37.4356 88.6719 024a 41.8414 88.8511 025a 41 .6325 87.0878 028a 35.6644 83.5903 029a 37.1989 108.491 030a 45.4897 69.6644 031a 45.5611 84.6783 032a 42.4103 85.3928 033a 44.2244 85.8186 034a 47.5311 93.4686 035a 44.2372 95.3006 036a 32.3344 88.745 037a 48.5103 113.996 038a 41.1531 96.4928 039a 43.9431 71.7033 040a 041a 42.7339 76.6597 42.2994 79.3964 046a 047a 43.5261 75.9472 42.1061 77.5356 049a 36.1278 77.175 051a 35.6967 052a 35.0239 80,6228 78.2792 053a 35.7286 78.6811 055a 40.3553 83.0661 056a 39.7928 81.5311 058a 40.78 81 .9253 059a 44.3869 123.623 061a 44.2231 122.242 063a 41.5978 78.7678 064a 40.6589 77.9361 065b 40.7883 77.9464 068a 36.0717 112.153 070a 29.3019 103.177 071a 96.92 073a 28.8453 37.335 074a 47.86 123.933 075a 39.0897 79.6622 101 80.5578 Table 5.2: MSPEs by Both Methods. month 1983.1 2 3 4 5 6 7 8 9 10 11 12 1984.1 2 3 4 5 6 7 8 9 10 11 12 1985.1 2 3 4 5 6 7 8 original_data LCoKriging new method I 3.28392 5.652717 0.950273 0.617414 0.232695 0.914407 0.668226 1.347384 0.300603 0.521691 0.238963 0.320864 3.209043 0.817165 0.469469 0.517674 0.143388 1.500481 0.987726 0.457944 0.210432 0.499223 0.808873 0.254086 1.876528 1.821478 0.398158 2.397513 0.570281 1.89153 0.509808 0.758525 0.52072 0.366484 0.321812 0.396189 0.509551 0.881811 0.786001 0.592036 0.543329 0.375799 0.25169 0.332461 0.402137 0.539348 0.18821 0.288131 0.334989 0.787056 0.775919 0.60891 0.354206 0.502362 0.234341 0.356278 0.238328 0.36958 0.244937 0.430129 0.307239 0.37827 0.51783 0.733556 102 log-transformed_data CoKriging new method 0.341134 0.549418 0.165156 0.115926 0.083971 0.183823 0.290934 0.263396 0.092434 0.166289 0.079822 0.095796 0.339717 0.234923 0.079713 0.28535 0.047676 0.23976 0.34463 0.208633 0.082369 0.163754 0.146061 0.089412 0.51466 0.181693 0.13868 0.326478 0.13937 0.405738 0.153515 0.215295 0.323611 0.357877 0.303732 0.27441 0.320196 0.407721 0.405173 0.183026 0.332301 0.226903 0.394561 0.389354 0.392407 0.453388 0.394719 0.781333 0.253829 0.400371 0.340348 0.606851 0.290468 0.427758 0.405709 0.501983 0.265733 0.427747 0.267305 0.326975 0.26641 0.461921 0.297702 0.545632 month original_data CoKriging new method log-transformed_data 1985.9 10 11 12 1986.1 2 3 4 5 6 7 8 9 10 11 12 0.302321 1.559379 0.24711 9.080794 6.937748 1.717417 1.376672 0.410942 0.240855 0.270805 0.377201 0.664169 0.492622 0.363747 0.199882 0.861362 0.310728 0.194151 0.281423 CoKriging 1_new method 0.200713 0.58109 0.314374 0.291196 0.188169 0.265927 1.434542 0.395802 0.685439 0.346247 0.29135 0.261732 0.217498 0.315148 0.116953 0.5226 0.069861 0.380813 0.09048 0.36987 0.08781 0.598339 0.226513 0.28267 0.14715 0.43719 0.118633 0.417164 0.109205 0.307664 0.205528 0.502112 Mean 1.25524 0.4437143 0.234786 0.470564 0.484305 0.165849 0.585204 0.553377 0.32262 0.194057 0.32283 0.568083 0.755552 0.637667 0.411857 0.570382 103 0.381312 Chapter 6 Concluding Remarks and Future Studies In this thesis, the theories of Bayesian multivariate interpolation with missing data are discussed. The proposed interpolation theories have two main characteristics, a hierarchical Bayesian method and a multivariate method. Some obvious advantages of a Bayesian method: with a Bayesian paradigm, the prior information is easily incorporated into an interpolation procedure, if such prior informa tion is indeed available; with a Bayesian approach, one can include the model uncertainty into the confidence interval of interpolated values, while traditional CoKriging fails to do so. Thus, the confidence intervals computed with traditional CoKriging methods are narrower they should be. With a hierarchical Bayesian approach, the specification of the spatial covariance structure can be pushed up to a higher level in the hierarchy. That is, the covariance structure can be specified at Stage 2 of the modeling. Under this setup, if there is any mistake in the specification of the covariance structure, future observations can modify the wrongly specified structure. Eventually if enough data are observed, the mistake will be corrected. Therefore, Bayesian modeling provides robustness. An other benefit of the Bayesian approach is that it provides predictive distribution, while 104 CoKriging only provides an interpolated value with its standard deviation. By knowing the predictive distribution, one can do more things, e.g., constructing a simultaneous confidence region, creating random numbers for a simulation study and so on. Another characteristic of the theories in this thesis is their multivariate nature. The general theory of BLZ also yields a multivariate interpolator. However when there are missing data, as when data are missing-by-design, with the BLZ theory, interpolation can only be done pollutant-by-pollutant. That reduces the method to a univariate method. The advantage of a multivariate approach is that it allows the interpolation to be carried out by including all the available information. Theorem 5.1 of Chapter 3 concludes that in terms of mean squared error, when hyperparameters are known, a Bayesian interpolator based on all available information performs at least as well as a Bayesian interpolator based on partial information. With that theorem, the extension of the general BLZ theory to the new theory is intuitively justified. As an empirical check, a CV study in Chapter 5 shows that the quantitative gains of the new method over the LZ method are significant. The theories developed in this thesis are by no means complete. There is much room for further study. For example, a general estimation method of the unknown hyperpa rameters is needed for the theory of interpolation with monotone missing data. Other future research topics are listed below. In Chapter 5, while the theory of interpolation with data missing-by-design is compared with Hass’s CoKriging theory, we pointed out that if the observed data is a spatial data set only, the theory in this thesis is not directly applicable. That is because the model of our Bayesian interpolation consists of a temporal trend, which can not be estimated by a single datum. Therefore, hierarchical Bayesian CoKriging is of interest. Here hierarchal 105 Bayesian CoKriging means putting a prior on the unknown coefficients of the trend and spatial covariance matrix. One possible approach is as follows. When there are many sites involved, the dimensions of the spatial covariance matrix can be very big. To reduce the number of random variables in the spatial covariance matrix, one may divide all related sites into m clusters, with n sites in each cluster. When the means are assumed to be zero, a model of the random process at the jth cluster is set to be, X=R1+L, where the random variable R reflects the correlations of Cluster i with other clusters, 1 is a 1 x n, vector of l’s and L reflects the local spatial correlation at cluster i. Let R represents a m x 1 random vector that takes care of the between-clusters spatial correlation. Conjugate priors of inverted Wishart distribution are assumed on Var(R) and Var(L), i = 1,. . . , m. Details of interpolation theory remain to be filled in. A Gaussian model is assumed for all interpolation theories discussed in this thesis. In some applications, the observed data may not be Gaussian and they cannot be transformed to be Gaussian. An interpolation theory with non-Gaussian data is needed. With a Gaussian model, the interpolator with data missing-by-design has an analytical form. With a general distribution model, an interpolator may not have closed form. In that case, some Bayesian computing tools, like Gibbs’ importance sampling may need to be brought in for a numerical answer. The prior distribution of the unknown spatial covariance matrix, , in this thesis, is taken to be an inverted Wishart, a conjugate distribution of the Gaussian model. Since many authors criticize the Wishart distribution (c.f. Press 1982, Brown, Le and Zidek 1993b), an extension of the inverted Wishart may be needed. On possible choice, which keeps the conjugate property of the priors, therefore, retains mathematical tractability, and at the same time, provides more parameters to offset one deficiency of the Wishart, 106 is proposed in Brown, Le and Zidek (1993b). The proposed prior is called a Generalized Inverted Wishart Distribution. However, the following is worth mentioning. Based on our application and simulation studies in Chapter 3 and 5, our method works well with the inverted Wishart prior. In Chapter 3, we see that the daily air pollution levels in Southern Ontario show sig nificant lag-i correlation. There, an AR(i) transformed is adopted to remove the cor relation. A better method is to extend the interpolation theory of LZ to include a lag-i correlated time series, or more generally, to remove the assumption of temporal independence. That extension will have immediate applications. From the same Southern Ontario study, we see that among the observed data, some are censored below. In Chapter 3, the censored data are simply treated as missing and filled with an ad hoc method. Therefore, a new theory that can handle the censored data will be needed for applications. A last remark goes to the optimal redesign of a current monitoring network. In Chapter 4, an example demonstrates how to add two sites to a current monitoring network. An interesting question may be raised. Instead of adding a new monitoring site, that measures ozone only, an alternative approach would put an ozone monitor at one of the currently gauged sites where ozone is not monitored. The question is which is optimal? While this can be done with the same entropy approach given in Chapter 4, there are some problems. Normally it is true that the cost of maintaining a separate monitoring site will cost more than putting a meter at a current operating site. The entropy approach does not take that into consideration. Therefore, to solve such a problem, we have to bring additional factors like cost into the objective function. 107 Bibliography Afifi, A. A. and Elashoff, R. M. (1961). “Missing Observations in Multivariate Statis tics.” J. Amer. Statist. Assoc. Vol. 61, 595-604. Ahmed, S. and de Marsily, G., (1987). “Comparison of Geostatistical Methods for Estimating Transmissivity Using Data on Transmissivity and Specific Capacity”. Water Resources Research, 23, 1717-1737. Anderson, T.W., (1984). “An Introduction to Multivariate Statistical Analysis”. New York: Wiley. Berndtsson, R., (1988). “Temporal Variability in Spatial Correlation of Daily Rainfall”. Water Resources Research, Vol. 24, 1511-1517. Billingsley, P., (1968). “Convergence of Probability Measures.” New York: Wiley. Brown, P.J., Le, N. D. and Zidek, J.V. (1994a). “Multivariate Spatial Interpolation and Exposure to Air Pollutants” . Canadian Journal of Statistics. To appear. Brown P.J., Le, N. D. and Zidek, J.V. (1994b). “Inference for A Covariance Matrix”. Aspects of Uncertainty: A Tribute to D.V. Lindley. Ed. A.F.M. Smith and P.R. Freeman. John Wiley & Sons. Buck, S. F. (1960). “A Method of Estimation of Missing Values in Multivariate Data Suitable for Use with an Electronic Computer.” J. of the Royal Statistical Society, Ser. B, 22, 302-307. Burnett, R. T., Dales R. E., Rainenne M. D. and Krewski D. (1992). “The Relationship Between Hospital Admissions and Ambient Air Pollution in Ontario, Canada: A Preliminary Report”. Unpublished Report. Caselton, W.F., Kan, L. and Zidek, J. V. (1992) “Quality Data Networks that Minimize Entropy”. Statistics in the Environmental and Earth Sciences. Eds. P. Guttorp and Walden. Griffin, London. 108 Chen, C.F., (1979). “Bayesian Inference for a Normal Dispersion Matrix and its Ap plication to Stochastic Multiple Regression Analysis”. J. Roy. Statist. Soc., B, 41, 235-248. Cressie, N. and Hawkins, D.M.(1980). “Robust Estimation of the Variogram, I.” J. of the International Association for Mathematical Geology, 12, 115-125. Cressie, N. (1986). “Kriging Nonstationary Data”. JASA, 81, 625-634. Cressie, N. (1989). “Geostatistics”. The American Statistician, 43, 197-202. Cressie, N., (1991a). “Statistics for Spatial Data”. New York: Wiley. Cressie, N., (1991b). “Modeling Growth with Random Sets.” In Spatial Statistics and Imaging (Proceedings of the AMS-IMS-SIAM Joint Summer Research Con ference), A. Possolo, ed. Institute of Mathematical Statistics, Hayward, CA. Dawid, A.P., (1978). “Extendibility of Spherical Matrix Distribution”. J. Mult. Anal. 8, 559-66. Dawid, A.P., (1981). “Some Matrix—variate Distribution Theory”. Biometrika, 68, 265-74. Dempster, A.P., Laird, N.M. and Rubin, D.B., (1977). “Maximum Likelihood from Incomplete Data via the EM Algorithm (with discussion)”. J. Roy. Statist. Soc B, 39, 1-38. Delhomme, J.P. (1978). “Kriging in the Hydrosciences” Advances in Water Resources, 1, 251-266. Dickey, J.M, (1967). “Matricvariate Generalizations of the Multivariate t Distribution and the Inverted Multivariate t—distribution”. Ann. Math. Statist. 38, 511-8. Dickey, J.M., Lindley, D.V. and Press, S.J. (1985). “Bayesian Estimation of the Disper sion Matrix of a Multivariate Normal Distribution A, 14:1019-1034. 109 “. Communication in Statistics Duddek, C., Le N. D., Sun W., White R., Wong H., Zidek J.V. (P1) (1994). “Assessing the Impact of Ambient Air Pollution on Hospital Admissions Using Interpolated Exposure Estimates in Both Space and Time: Final Report to Health Canada under DSS Contract h4078-3-C059/01-SS”. Unpublished Report. Dunnett, C. W. and Sobel, (1954) . “A Bivariate Generalization of Student’s t distribution with Tables for Certain Special Cases”. JASA, 50, 1096-1121 Eaton, M.L., (1983). “Multivariate Statistics: A Vector Space Approach”. Wiley, New York. Guttorp, P., Le N. D., Sampson P.D. and Zidek, J.V., (1993) “Using Entropy in the Redesign of an Environmental Monitoring Network”. Multivariate Environmental Statistics; edited by G.P. Patil and C.R. Rao. North Holland/Elsevier Science, NY. Harrison, P. J. and Stevens, C. F.(1976) “Bayesian Forecasting” (with discussion) J. of the Royal Statistical Society, Ser. B, .38, 205-247. Hass, T. C., (1993). “Cokriging Variables That Are First and Second Order Nonsta tionary “. Paper presented at the annual meeting of the Statistical Society of Canada, Wolfville, Nova Scotia, June, 1993. Hass, T. C., (1992). “Redesigning Continental-Scale Monitoring Networks”. Atmo spheric Environment Vol.26A, No.18, 3323-3333. Hass, T. C., (1990a). “Lognormal and Moving Window Methods of Estimating Acid Deposition”. J. Amer. Statist. Assoc. 85(412), 950-963. Hass, T. C., (1990b). “Kringing and Automated Variogram Modeling within a Moving Window”. Atmospheric Environment Vol.24A, No.7, 1759-1769. Halley, E. (1686). “An Historical Account of the Trade Winds, and Monsoons Observ able in the Seas between and near the Tropics: With an Attempt to Assign the Physical Cause of Said Winds”. Philosophical Transactions 183, 153-168. 110 Handcock, M. S. and Stein, M. L.(1993) “A Bayesian Analysis of Kriging” Technomet rics, Vol.35, No.4, 403-410. Handcock, M. S. and Wallis, J. R.(1994) “An Approach to Statistical Spatial-Temporal Modeling of Meteorological Fields” JASA, Vol.89, No.426, 368-390. Huijibregts, C. J. and Matheron, G. (1971) “Universal Kriging (An Optimal Method for Estimating and Contouring in Trend Surface Analysis).” In Proceedings of Ninth International Symposium on Techniques for Decision-Making and Metal lurgy, Social Volume, 12, 159-169. Johnson, T., Capel, J., McCoy, M. and Warnasch, J. (1994) “Estimation of Carbon Monoxide Exposures and Associated Carboxyhemoglobin Levels Experienced by Residents of Toronto, Ontario Using a Probabilistic Version of NEM.” Unpublished Report. Journel, A. G. (1980) “The Lognormal Approach to Predicting Local Distributions of Selective Mining Unit Grades.” Journal of the International Association for Mathematical Geology, Vol.12, 285-303. Kannan, D. (1979). “An Introduction to Stochastic Processes “. New York: North Holland. Kitanidis, P. K.(1986) “Parameter Uncertainty in Estimation of Spatial Functions: Bayesian Analysis.” Water Resources Research,22,499-507. Komungoma, S. (1992) “Assessment of the Quality for the NADP/NTN Data Based on Their Predictability.” M.Sc. Thesis, Department of Statistics, University of British Columbia, Vancouver, Canada. Laslett, G. M.(1994) “Kriging and Splines: An Empirical Comparison of Their Predic tive Performance in Some Applications.” JASA,Vol.89, No.426, 391-409. Le, N. D. and Zidek, J.V. (1992). “Interpolation with Uncertain Spatial Covariance: A Bayesian Alternative to Kriging”. J. Mult. Anal, 43, 351-74. 111 Leonard, T. and Hsu, S.J.H. (1992). “Bayesian Inference for a Covariance Matrix”. Annals of Statistics , 20, 1669-1696. Lindley, D. V. and Smith, A.F.M. (1972). “Bayes Estimates for the Linear Model”. J. Roy. Statist. Soc. B, 34, 1-32. Little, R. J. A. and Rubin, D.B. (1987). “Statistical Analysis with Missing Data”. John Wiley & Sons. Loader, C. and Switzer, P. (1992). “Spatial Covariance Estimation for Monitoring Data”. In Statistics in the Environmental and Earth Sciences, eds. A. T. Walden and P. Guttorp, London: Edward Arnold, pp.52-7O. Mardia, K. V., Kent, J.T. and Bibby, J.M. (1979). “Multivariate Analysis”. Academic Press, New York. Mardia, K.V. and Marshall, R.J. (1984). “Maximum Likelihood Estimation of Models for Residual Covariance in Spatial Regression”. Biometrika Matheron, C. (1962). , 71, 135-146. “Traite de Geostatistique, Tome I.”Memoires du Bureau de Recherches Geologiques et Minieres, No. 14. Editions Technip, Paris. Matheron, G. (1969). Mathematique , “Le Kriegeage Universel.” Cahiers du Centre de Morphologic No. 1. Fontainebleau, France. Matheron, 0. (1971). “The Theory of Regionalized Variables and Its Applications”. Cahiers du Centre de Morphologic Mathematique, No. 5. Fontainebleau, France. Miller, A. A. and Sager, T. W. (1994). “Site Redundancy in Urban Ozone Monitoring.” J. Air é1 Waste Manage. Assoc., 44, 1097-1102. Muirhead, R. J. (1982). “Aspects of Multivariate Statistical Theory”. John Wiley, New York. Nather, W. (1985). “Effective Observations of Random Fields.” Teubner-Texte zur Mathematik, Band 72. Teubner, Leigzig. 112 Neuman S. P. and Jacobson, E.A. (1984). “Analysis of Nonintrinsic Spatial Variability Be Residual Kriging with Application to Regional Groundwater Levels”. J. of the International Association for Mathematical Geology 16, 499-521. Olea, R.A. (1984). “Sampling Design Optimization for Spatial Functions “. Journal of Geophysical Research, 79, 695-702. Oehlert, G. W. (1993). “Regional Trends in Sulfate Wet Deposition”. JASA, 88, 390.3599. Omre, H. (1987). “Bayesian Kriging—Merging Observations and Qualified Guesses in Kriging “. Mathematical Geology, 19, 25-39. Omre, H. and Halvorsen, K.B. (1989a). Universal Kriging “. “the Bayesian Bridge between Simple and Mathematical Geology, 21, 767-786. Omre, H. and Halvorsen, KB. (1989b). “A Bayesian Approach to Kriging “. In Proceedings of the Third International Geostatistics Congress I ed. M. Armstrong, Dordrecht: Academic Publishers, pp. 49-68. Pilz, J., (1991). “Bayesian Estimation and Experimental Design in Linear Regression Models.” John Wiley, New York. . New York: Holt, Rinehart & Winston Press, S. J., (1982). “Applied Multivariate Analysis-Using Bayesian and Frequentist Methods of Inference”. Holt, Rinehart & Winston, New York. Rubin, D. B. (1976). “Comparing Regressions When Some Predictor Variables Are Missing”. Technometrics 18, 201-206. Sampson, P. and Guttorp, P., (1992). “Nonparametric Estimation of Nonstationary Spatial Covariance Structure”. J. Amer. Statist. Assoc. Vol.87 No. 417, 108-119. Sarndel, C. E. and Swensson, B. and Wretman, J. (1992). “Model Assisted Survey Sampling”. Springer-Verlag, New York, Inc. Stein, A. and Corsten, L.C.A., (1991). “Universal Kriging and Cokriging as a Regres 113 sion Procedure”. Biometrics 47, 575-587. Sten, M. L., Shen, X. and Styer, P.E. (1991). “Applications of a Simple Regression Model to Acid Rain Data.” Technical Report No. 276. Department of Statistics, University of Chicago, Chicago, IL., USA. Taylor, A. E. and Lay, D.C. (1980). “Introduction to Functional Analysis”. Wiley, New York. Tanner, M.A. and Wong, W.H., (1987). “The Calculation of Posterior Distribution by Data Augmentation”. JASA 82, 528-550. Wahba, 0., (1990a). “Comment on Cressie,” The American Statistician, 44, 255-256. Wahba, 0., (1990b). “Spline Models for Observational Data,” Philadelphia: Society for Industrial and Applied Mathematics. West, M. and Harrison, P.J., (1986). “Monitoring and Adaptation in Bayesian Fore casting Models”. JASA 81, 741-750. Woodbury, A.D. (1989). “Bayesian Updating Revisited,”. Mathematical Geology 21, 285-308. Wu, C.F.J., (1983). “On the Convergence Properties of the EM Algorithm,” Ann. Statist. 11, 95-103. Wu, S. and Zidek, J.V. (1992). “An Entropy Based Review of Selected NADP/NTN Network Sits for 1983-1986”. Atmospheric Environment. Yates, S. and Warrick, A.W. (1987). “Estimating Soil Water Content Using Cokriging.” Soil Science Society of America Journal,51, 23-30. Zellner, A. (1971). “An Introduction to Bayesian Inference in Econometrics.” New York: John Wiley. Watson, C.R. and Olsen A. R. (1984). “Acid Deposition System (ADS) for Statis 114 tical Reporting—System Design and User’s Code Manual”. U.S. Environmental Protection Agency. 115 Appendix A: In this Appendix, the S function, iwm(m,cn,p,flag=1,sl,M2,misindx, Tint), and its S help file is listed. The iwm S function is designed to perform spatial interpolation and estimation of Ag, , with an EM algorithm that is described at the end of Chapter 2. “iwm” calls a C program, the source file of which is listed in Appendix B. iwm S funtion help file: Returns a list of estimated hypercovariance matrices and prior degrees of freedom. USAGE: iwm(m, cn, p, flag = 1, si, M2, mi.sindx, Tint) ARGUMENTS: m: number of measured or non-measured air pollutants at a site cn: when flag=O, cn is the known degrees of freedom in the prior distribution, “inverse Wishart”, of the unknown spatial covariance matrix. When flag=i, cn is unknown and the input is an initial value. That value must be greater than m times the total number of sites, p+si. p: number of gauged sites. flag: a value indicates whether cn is known or not. Si: number of ungauged sites. M2: a response matrix. The rows are the observations at gauged sites over the time. The columns are “snapshot” air pollutant levels observed at gauged sites. misindx: an index vector of missing columns and observed columns. Tint: a p column matrix with each column being a mean of all columns at a gauged site. In iwm, Tint is used for getting an initial estimation of the between-sites hypercovariance matrix Ti VALUE: 116 Ti: the estimated between—gauged-sites-hypercovariance matrix. B 1: the estimated between-air-pollutants-hypercovariance matrix. cn: if flag=O, cn is the same as the input value. If fiag=1, cn is the estimated degrees of freedom. C OBJECT FILE: The C objective file “Phill.o” is supposed to be under your S working directory that normally is at position 1 of your S search list. Use S function “searchO” to check it. If that is not true, you may need to customize “iwm” function to your own version by replacing the following S commands in your copy of iwm function, filename_search() [1] filename_paste(c(filename,”/Phill.o”) ,collapse=”) dyn. load(filename) with dyn.load(”pathname/Phill .0”) where “pathname” is the complete pathname under which “Phill.o” locates. DETAILS: IWM, standing for “Interpolation With Data Missing-by-design”, implements the theory of interpolation with data missing-by-design. The theory is developed under a normal distribution model when there are data missing-by-design. The other hyperparameters B°, F—’ are estimated by unbiased estimators. The output Ti of iwm is used later as an input for the Sampson and Guttorp nonparametric approach for extending Ti to including all sites. The structure of iwm is that: first, iwm does some initial data manipulation and second, it calls “Phiil.o”. It is in the C program where the EM algorithm of the interpolation theory is implemented. As a warning, iwm adopts an EM algorithm, which is notoriously slow. Sometime when the dimensions of the observed data matrix are big, it could take hours, even days to finish. So first of all, be patient! Secondly, either submit it as a batch job or lower the precision value at line 70 of file “Phiii.c”. When you change the source code file, be 117 sure that you recompile the file with unix command “cc -c Phill.c” and copy the new object file to your S working directory. The current value is set to be 0.0005. You may set it to a different number. Also you can try various transformations of the data matrix to make it close to normal. By doing those, the convergence is hopefully faster. EXAMPLES: Consider an artificial example for an explanation of how to use “iwm”. Suppose there are four Sites 1, 2, 3 and 4. Site 1 is an ungauged site (sl=1), Sites 2, 3 and 4 are gauged sites (p=3). Three air pollutants: 504, 03, NO 3 are measured at all sites. Only SO 4 is measured at Site 2, 03 at Site 3, and, SO , 03 and NO 4 3 at Site 4. Now label 1 to 3 for SO , 03, NO 4 3 at Site 2; 4 to 6 for SO 3 at Site 3 and 7 to 9 for 504, 03, , O3 NO 4 3 at Site 4. Since the columns of air pollutants labeled with 2, 3, 4, 6 are missed, NO and the columns labeled with 1, 5, 7, 8, 9 are observed, misindx=(2,3,4,6,1,5,7,8,9). If further, the observed data matrix is: Site #2 Site #3 Site #4 4 SO 03 t=1 1.2 3.4 1.6 4.0 2.7 t=2 1.7 3.1 1.1 3.9 2.5 , 4 5O , 3 NO 0 then M2 = (1.2 \1.7 3.4 4.0 3.1 1.6 1.1 /1,2 I \ 1.7 3.4 3.1 2.667 2.5 3.9 2.7 2.5 and Tint = where the third column of Tint is the mean of the columns 4,5,6 of M2. Macros of IWM: function(m,cn,p,flagl,sl,M2,misindx,Tint) { f_dim(M2) [1] 118 1_p*m—dim(M2) [2] tmlO_c(1 :f) Z_c(rep(1,f),tmlO,cos(2*pi*tmlO/12),sin(2*pi*tmlO/12)) k_length(Z) If Z_matrix(Z,k,f ,byrowT) Bhat_t (M2)7.*%t (Z)%*°hsolve(Z°h*%t (Z)) MuO_apply(Bhat ,2 ,mean) 7*7M2-Bhat * Z M %*7 7 2 7 S_t (M2) 0 tmpi_as integer(p*m-1) F1_matrix(O,k,k) for (i in 1:tmpi) { a.rep(-1/tmpi ,tnipi) a[i] _1+a[i] a_matrix(a, tmpi,1) F1_F1+t (Bhat) 7.*7.a%*7.t (a) °h*’/.Bhat/c (t (a) *7,S7.*7.a) } F1_ (n+f-k-2)*F1/tmpi 1)7 * %matrix(MuO , 1 ,k) B02_matrix(1 ,tmpi , 0 tmp_Bhat-B02 Ssim_S+tmp%*Y.solve(F1) %*Y.t (tmp) B1_diag(rep(1 ,m)) Bhat2_t (TintYh*’ht t so1ve(Zh*Y 0 (Z)Y.*Y (Z)) 0 (Tint)7 int-Bhat2%*Y.Z%*7.Tin *Yt T1_t T nO filename_search() [1] filename_paste(c(filename,”IPhill.o),collapse””) dyn. load (filename) rvalues_ .CQ’Phill”, as.double(T1), as.double((B1)), as.integer(misindx), as.double((Ssim)),as.integer(flag), as.integer(n), as.integer(f), as.integer(p), as.integer(m), as.integer(k), as.integer(l), as.double(cn), as.integer(sl)) 119 list (Tlrvalues[[1]] ,Blrvalues[[2]] ,degree=rvalues[[12]]) } 120 Appendix B: Source file of C program Phill.o: #include <math.h> #include <stdio .h> void Phill(T1, Bi, misindxl, Ssiml ,flagl,nl,fl ,pl ,ml ,kl ,ll ,cnl, sli) mt *flagl, *nl,*fl,*pl,*ml,*kl,*ll, *misindxl, *sll; double *T1, *B1, *Ssiml, *cnl; { mt flag, n,f,p,m,k,l,sl; double **T, **B, **Ssim,cn; double **,**matrixQ; double ma,tO,tl,t2,detQ,snQ,t3,t4; mt i,j,*ivectorQ, *misindx, tmpi2; void maxlQ, 1Q,f_matQ, detlogO; /* flag = 0 degrees of freedom is known (ie. m is constant in the paper) = 1 estimate the df as well n = number of observations in the past f = number of observations in the future 1 = number of missing columns p = number of (gauged) stations s1 number of (ungauged) stations m = number of ions k = number of coefficients in the linear model (i.e. 4 in the paper) cn= initial estimate of m; 121 that is, if flag =0 then df =cn else if flagl then use the algorithm in the paper with cn as initial value SsimS”{”} misindx = the index matrix for the missing columns and the not missing cls. *1 /* T is Lambda, B is Omega *1 flagflag1; nflnl; f=*f 1; p*pl; m*ml; k=*kl; l=*l1; cn*cnl; s1*sll; Tmatrix(p,p); Bmatrix(m,m); misindxivector(m*p); matrix(m*p ,m*p); Ssimmatrix(p*m-l,p*m-l); /*******************/ for (j0;j<m*p;++j) misindx[j]=*(misindxl+j); for (i0; i<p*m—l;+-i-i) for (j0;j<p*m—l;++j) Ssim[i] [j]*(Ssiml+i*(p*m—1)+j); for (i0;i<p;++i) for (j0;j<p;++j) T[i] [j]*(T1+i*p+j); 122 for (i0;i<m;++i) for (j0;j<m;++j) B[i] [j]*(B1+i*m+j); printf(”The program will finish when ma < O.0005\n”); { while (1) 1* printf (“step 1\n”); */ 1(sigma, misindx, T, B, cn, p,f,l, m,n,k,Ssim,sl); 1* printf(”step 2\n”); */ { while (1) tldet(T,p); t2det(B,m); if if ( ( /* fabs(tl) < le—50) t11.O; fabs(t2) < le—50) t21.O; *1 printf(”step 3\n”); maxl(cn,T,B,,p,m,sl); 1* *1 printf (“step 4\n”); t3=det(T,p); t4det(B,m); if if ( ( fabs(t3) < le—50) t31O; fabs(t4) < le—50) t4=1O; if ((fabs(log(tl)—log(t3))<le-4)&& (fabs(log(t2)—log(t4))<le-4)) break; } macn; tmpi2p*m; /* printf (“step 5\n”); */ detlog(T,B,&t2,&t3,p,m,l,misindx, Ssim); /* printf (“step 6\n”); if (flag>O) 1* */ cn=sn(cn,n,p,m,sl,f,k,l,t2,t3); printf(”step 7\n”); */ 123 ma=fabs(ma-cn)/cn; printf(”ma is h1f\n”,ma); if (ma<O.0005) break; } for (i0;i<p;++i) for (j0;j<p;-i-+j) *(T1+i*p+j)T[i] [j] for (i0;i<m;++i) for (j0;j<m;++j) *(B1+i*m-I-j)=B[i] [j] *cnlcn; } /**********************************************/ void maxl(c,T,B,L,p,m,sl) double **T,**B,**L,c; mt p,m,sl; { mt i,j,k,l; double xO,xl,detQ; void invert_matrixQ; for (k0;k<p;-f+k) for (l=O;l<p;++l) for (i0,T[k] [l]0.O;i<m;++i) for (j0;j<m;++j) T[k] [l]+L[m*k+i] [m*l+j]*B[i] U]! (c— (double) (sl*m))/(double)m; invert_matrix(T,p); for (k=O;k<m;-i-+k) for (10;l<m;++l) for (i0,B[k][1]0.O;i<p;++i) 124 for (j0;j<p;-f+j) B Ek] [1] +=L [in*i+k] [m*j+l] *T [i] Ci] / (c—(double)(sl*m))/(double)p; invert_matrix(B,m); } /**********************************************/ double sn(cl,n,p,m,sl,f,k,1,t2,t3) double cl,t2,t3; mt n,p,m,sl,f,k,l; { double kl,k2,c2,funclQ,trigarnmaQ; mt i, tmpi; { while (1) tmpip*m-l; kl=funcl(cl,n,p,m,sl,f,k,l,t2,t3); c2c1; for (il,k20.0;i<p*m;++i) k2=k2+trigamma((cl+(double) (n+f-i—k—sl*m))/2 .0) —trigainma((cl—(double)(i+sl*m))/2.0); k2k2/2.0; c1c2-kl/k2; 1* printf(”kl,k2,cl,c2 is %1f 7 1f °hlf °hlf\n”,kl,k2,cl,c2); 0 if (fabs(kl)<1.e-08) break; if (cl<(double) (m*(p+sl))) { if (c2/2>(double)(m*(p+sl)+1)) c1c2/2; else ci (double) (m*(p+si)+i); } } 125 *1 return ci; } double funcl(c,n,p,m,sl,f,k,l,t2,t3) double c,t2,t3; mt p,m,n,k,si,f,l; { double tl,digamma() ,**matrix() ,de-tQ; mt i; void f_matQ; for (i1,t10.0;i<p*m;-i-+i) titi+digamma((c+(double) (n+f—i—k—si*m)) /2.0) —digamma((c— (double) (i+si*m))/2 .0); /* printf(”ti,t2,t3 °hlf °hlf 7 1f\n”,ti,t2,t3); 0 return ti—t2+t3; } void detlog(T,B,t2,t3,p,m,l,mismndx,ssim) double **T, **B, *t2,*t3,**ssim; mt p,m,l, *misindx; { double **Psi22, **tmi, **tm2, **R, **matrixQ,detQ double **Phiii, **Psi; mt tmpi2,tmpi,i,j ,ii,i2,i3,tpi; tmpip*m-l; tmpi2p*m; Rmatrix(p*m, p*m); Phiiimatrix(tmpi2, tmpi2); 126 for (i10;±1<tmp±2;++i1) for (i20;±2<tmpi2;++i2) R[±1] [±21=0.0; for (i0; ±<p*m; i++) { tpimis±ndx[i] —1; R[tp±] [±1=1.0; } for (i0;i<tmp±2;++i) for (j0;j<tmp±2;++j) Ph±11[i] [j]T[±/m] [j/m]*B[±°hm] [j°hm]; tmlmatr±x(tmp±2, tmp±2); tm2=matr±x(tmp±2, tmp±2); for (±1=O;±1<tmp±2;++±1) for (±2=0; ±2<tmp±2 ; tml[±1] [±2]R[±2] [±1]; for (±1=0;±1<tmp±2;++±1) for (±2=0; ±2<tmp±2 ; for (±3=0,tm2[±1] [±2]=0.0;±3<tmp±2;++±3) tm2 [±1] [±2] =tm2 [±1] [±2] +tml [±1] [±3] *ph±11 [±3] [±2] f_mat (Ph±11 ,tmp±2); f_mat (tml ,tmp±2); Ps±matr±x(tmp±2, tmp±2); for (±1=0;±1<tmp±2;+i-±1) for (±2=0,Ps±[±1][±2]=0.0;±2<tmp±2;++±2) for (±3=0; ±3<tmp±2 ; ++±3) Ps±[±1] [±2]Ps±[±1] [±2]+tm2[±1] [±3]*R[±3] [±2]; f_mat (tm2,tmp±2); f_mat(R, tmp±2); Ps±22matr±x(tmp±, tmp±); for (±1=0; ±1<tmp±; ±1++) for (±2=0; ±2<tmp±; ±2++) { 127 Psi22 [ii] [i2] =Psi [il+l] [i2+l] +ssim[il] [i2]; } *t2=log(det(Psi22,tmpi)); for (i10; il<tmpi; il++) { for (i2O; i2<tmpi; i2++) Psi22[il] [i2]=Psi[il+l] [i2+l] } *t3=log(det(Psi22,tmpi)); f_mat(Psi, tmpi2); f_mat(Psi22, tmpi); } /******i*****************************************************/ #include <stdio .h> #include <math.h> /*****************************************************************/ /* /* 1* 1* /* 1* This module gives E(\_{(11)} and E(log I\J(ii)}I I I M_{2}, \Phi, \cn{*}) m_{2}, \Phi, \cn{*}) for the systematic pattern with whole column missing *1 is \Sigma..{(11)} log is log I\Sigma_{(11)} *1 I 1* phi is the hyperparameter \Phi_{(11)}= 1* *1 \Lambda_{g} \otimes \Omega 1* cn is the degrees of freedom, another hyperparameter 1* /* s2 is number of gauged stations, si is number of ungauged *1 stations 1* 1 is number of missing columns, misindx is the index of 1* 1* missing columns k is number of responses at each station 1* n is number time spots in the past, f is that in the future 128 *1 */ /* h is number of covariates void 1(sigma, misindx, Lambda, Omega, delta, s2,f,l, k,n,h,Ssim,sl) double mt **, **Lambda, **Omega, **Ssim,delta; 1, k,h,n, s2,f, *misindx,sl; { double **Psillstar, **Psill, **Psil2, **Psi22, **matrixQ, **Psi; double **R, **Rt, **tml, **tm2, **tm3, **tm4,**tm5, **1; double **Psi22in, **Phi22, **Psillstarin, **Phill, cont,contl; double **etal2, **etal2star; mt ii, i2, i3,tmpi, void mat_inverseQ, f_matQ; tmpi2, tpi; FILE *out; tmpis2*k -l Rmatrix(s2*k, s2*k); for (i10;il<s2*k;++il) for (i20;i2<s2*k;++i2) R[il] [i2]0.O; for (i10; il<s2*k; il++) { tpi=misindxEil]—1; R[tpi] [il]=1.O; } tmp i 2 = s2 *k; Phillmatrix(tmpi2, tmpi2); Psimatrix(tmpi2, tmpi2); for (i10;il<tmpi2;++il) for (i2=O; i2<tmpi2;++i2) Phill[il] Ei2]LambdaEil/k] [i2/k]*Omega[il%k] [i2%k]; 129 tmlmatrix(tmpi2, tmpi2); tm2matrix(tmpi2, tmpi2); for (i10;il<tmpi2;++il) for (i20;i2<tmpi2;-H-i2) tml[il] [i2]R[i2] [ii]; for (il=O;il<tmpi2;++il) for (i20; i2<tmpi2 ; ++i2) for (i30,tm2[il][i2]0.O;i3<tmpi2;++i3) tm2[il] [i2]=tm2[il] [i2]+tml[il] [i3]*Phill[i3] [i2] for (110;il<tmpi2;++il) for (i20,Psi[il][i2]0.O;i2<tmpi2;++i2) for (i3O; i3<tmpi2 ;++i3) Psi[il] [i2]=Psi[il] [i2]+tm2[il] [i3]*R[i3] [i2] f_mat (tml, tmpi2); f_mat(tm2, tmpi2); Psillmatrix(1,1); for (i10; il<1; il++) for (i2=O; i2<1; i2++) { Psill[il] [i2]Psi[il] [i2] } Psil2matrix(1 ,tmpi); for (i10; il<1; il-H-) for (i2O; i2<tmpi; i2++) { Psil2 [ii] [i2]Psi[il] [i2+1] 3Psi22matrix(tmpi,tmpi); for (il=O; il<tmpi; il++) for (i2=O; i2<tmpi; i2++) { Psi22[il] [i2]Psi[il+1] [i2+1] 3etal2matrix(1,tmpi); 130 Psi22inmatrix(tmpi, tmpi); mat_inverse(Psi22, tmpi, Psi22in); f_mat(Psi22, tmpi); Psi22matrix(tmpi, tmpi); for (i10; il<tmpi; il++) for (i20; i2<tmpi; i2++) { Psi22[il] [i2]Psi[il+1] [i2+1] } f_mat (Psi ,tmpi2); for (i10;il<1;++il) for (i20 ; i2<tmpi ; ++i2) for (i30,etal2[il] [i2]0.O;i3<tmpi;++i3) etal2[il] [i2]etal2[il] [i2]+Psil2[il] [i3]*Psi22in[i3] [i2]; tml=matrix(tmpi, 1); for (i10;il<tmpi;++il) for (i2O ; i2<1 ; ++i2) tml[il] [i2]Psil2[i2] [ii]; tm2matrix(1, 1); for (i10;il<1;++il) for (i20;i2<1;++i2) for (i30,tm2[il][i2]=O.O;i3<tmpi;+-t-i3) tm2[il] [i2]tm2[il] [i2]+etal2[il] [i3]*tml[i3] [i2J Psillstar=iuatrix(1,1); for (i10; il<1; il++) for (i20; i2<1; i2++) Psillstar[il] [i2]Psill[il] [i2]—tm2[il] [i2] Psillstarinmatrix(1,1); /*outfopen(”jmik”,”w”); */ /*for (i10;il<1;il++) {*/ 1* 1* for (i20;i2<1;++i2) {*/ fprintf(out,”°h5.21f “,Psillstar[il][i2]);*/ 131 1* if (ilY.1O==9) fprintf(out,”\n”); }*/ /*fprintf(”\n\n”); }*/ /*fclose(out) ;*/ /*exjt(0) ;*/ if (1 mat_inverse(Psillstar, 1, Psilistarin); 0) if (1 ‘=0) { f_mat (Psillstar,l); f_mat (Psill,l); f_mat(tm2, 1); f_mat(Psil2,1); } f_mat(tml, tmpi); contdelta— (double) (sl*k); 1=matrix(tmpi2,tmpi2); for (i10; il<1; il++) for (i20; i2<l; i2++) 1[il] [i2]cont*Psillstarin[il] [i2]; tmlmatrix(l,tmpi); for (i10;il<l;++il) for (i2=0;i2<tmpi;++i2) for (i30,tml[il][i2]=0.0;i3<l;++i3) tml [ii] [i2] =tml [ii] [i2] +Psillstarin [ii] [i3] *etal2 [i3] [i2] for (i10; il<1; il-H-) for (i20; i2<tmpi; i2++) { 1[il] [i2+l]—1.0*cont*tml[il] [i2] l[i2+l] [ii] sigmal [ii] [i2+l]; } contldelta-(double) (sl*k-n-f+l-t-h); tm2matrix(tmpi , 1); for (i10;il<tmpi;++il) for (i20;i2<l;+-i-i2) 132 tm2[il] [i2]=etal2[i2] [ii]; tm3=matrix(tmpi ,tmpi); for (il=O;il<tmpi;++il) for (i2=O;i2<tmpi;++i2) for (i30,tm3[il][i2]0.O;i3<1;++i3) tm3[il] [i2]tm3[il] [i2]+tm2[il] [i3]*tml[i3] [i2J if (1 ‘=0) f_mat(tml,1); f_mat(tm2, tmpi); tm4=matrix(tmpi,tmpi); for (il=O;il<tmpi;++il) for (i2=0; i2<tmpi ; ++i2) tm4 [ii] [i2] Psi22 [ii] [i2] +Ssim [ii] [i2] tm5matrix(tmpi,tmpi); mat_inverse(tm4, tmpi, tm5); tml=matrix(tmpi,tmpi); for (il=0; il<tmpi; il++) for (i20; i2<tmpi; i2++) { 1[il+1] [i2+1]contl*tm5[il] [i2]+cont*tm3 [ii] [i2] +(double)1*Psi22in[il] [i2]; f_mat (tml ,tmpi); f_mat(tm4, tmpi); f_mat (Phill ,tmpi2); f_mat(tm5, tmpi); f_mat(Psi22, tmpi); f_mat(tm3, tmpi); f_mat(Psi22in, tmpi); if (1 =0) { f_mat (etal2,1); f_mat (Psillstarin,1); } 133 tml=matrix(tnpi2,tmpi2); Rt=matrix(tmpi2, tmpi2); for (il=O;il<tmpi2;++il) for (i2=O; i2<tmpi2 ; ++i2) Rt[il] [i2]R[i2] [ii]; for (il=O;il<s2*k;++il) for (i20;i2<s2*k;++i2) { tml[il] [i2]0.O; for (i30;i3<s2*k;++i3) tml[il] [i2]= tml[il] [i2]+1[il] [j3]*Rt[j3] [i2] } for (i10;il<s2*k;++il) for (i20;i2<s2*k;++i2) { [ii] [i2]0.O; for (i30;i3<s2*k;++i3) [ii] [i2]= [ii] [i2]+R[il] [j3]*tml[j3] [i2] } f_mat(tml, tmpi2); f_mat(Rt, tmpi2); f_nat (R,tmpi2); f_mat(1,tmpi2); } /**********************************************************/ #include <math.h> #define SIGN(a,b) ((b)<O.O 7 -fabs(a) void ord_mat(nvr,eig,mat) mt nvr; double *eig, **nat; { 134 fabs(a)) mt i=O,j=O,k=O; double tmp=O.O; for (i=0;j<nvr;i++) for (ji;j<nvr;j++) if (eig[j]<eig[i]) { tmpeig[j]; eig[j]eig[i] eig[i] tmp; for (k0;k<nvr;k++) tmpmat[k] { [ji; iuat[k] [j]matEk] [ii; matEk] [i]tmp; } } } void tred2(a,n,d,e) double **a,d[] ,e[]; mt n; { mt 10,k0,j0,i0; double scale0 .0,hhO .0,h=0 .0,g0 .0,f0 .0; for (i=n—1;i>1;i——) { 1i—1; hscale=0.0; if (1 > 0) { for (k0 ; k<1 ; k++) scale if (scale fabs(a[i][k]); + == 0.0) 135 e[i]a[i] [1]; else { for (k0;k<1;k++) { a[i] [k] / scale; h +za[i] [k]*a[i] [k]; } fa[i] [1]; g = f>O.O ? —sqrt(h) : sqrt(h); e [ii scale*g; h -= f*g; a[i] [l]f—g; f0.O; for (j0;j<1;j++) { a[j] [i]aEi] [j]/h; g0.O; for (k0;k<j ;k++) g +a[j] [k]*a[i] [k]; for (kj+1 ;k<1;k++) g +a[k] [j]*a[i] [k]; e[j]g/h; f + e[j]*a[i][j]; } hh±/(h+h); for (jO;j<l;j++) { fa[i] [j] e[j]=g=e[j]-hh*f; for (kO;k<j ;k++) a[j] [k] —= (f*e[k]+g*a[i] [k]); } } } else 136 e[i]a[i] [1]; d[i]h; } d [0] =0 0; . e [0] =0 0; for (i0;i<n;i++) { 1i—1; if (d[i]) { for (j0;j<1;j++) { g0.0; for (k0 ;k<1 ;k++) g +=a[i] [k]*a[k] [j]; for (k0 ;k<1 ;k++) a[k] [j] —= g*a[k] [i]; } } d[i]a[i] [ii; a[i] [i]1.O; for (jO;j<1;j++) a[i] [j]=a[j] [i]0.0; } } / *********************************************************/ void tqli(d,e,n,z) double d[] ,e[] ,**z; mt n; { mt m0,10,iter0,i0,k0; double s0.0,r0.0,p0.0,g0.0,f0.0,ddO.0,c0.0,b0.0; void nrerrorQ; 137 for (i1;i<n;i++) e[i—1]e[i] e[n—1]0.O; for (l=iter=O;l<n;1++) do { { for (ml;m<n-1;m++) dd=f abs (d Em] ) +f abs (d [m+1] if (fabs(e[m])+dd == ) dd) break; } if (m { 1) if ((iter++) == 10000) nrerror(”Too many iterations in TQLI”); g=(d[l+1]—d[l])/(2.0*e[l]); rsqrt((g*g)+1.0); g=d [ml —dEl] +e El] / (g+SIGN (r , sc1.O; p . 0 O; for (im—1;i>1;i——) { f=s*e[j]; bc*e[i]; if (fabs(f) >= fabs(g)) cg/f; rsqrt((c*c)+1.0); e Ei+1] =f*r; c } (s1.0/r); *= else { sf/g; rsqrt((s*s)+1.0); eEi+1]g*r; s *= (c1.0/r); } 138 { gd[i+1] -p; rCd [i] —g)*s+2. O*c*b; ps*r; d[i+1]g+p; gc*r-b; for (k0;k<n;k++) { fz[k] [i+1]; z[k] [j+1]s*z[k] [i]+c*f; z[k] [i]c*z[k] [i]—s*f; } } d[l]d[l]—p; e[m]0.O; } } while Cm 1* 1); printf(”Number iterations in tqli = ‘/.d\n”,iter); */ } #include <malloc .h> #include <stdio .h> /*******************************************************/ void nrerror(error_text) char error.text[]; { fprintfCstderr,”Numerical Recipes run-time error.. fprintfCstderr,”°hs\n” ,errortext); fprintfCstderr,”. . .now exiting to system.. exit Ci) 3. 139 double *vector(nh) mt nh; { double *v; v(double *)calloc(nh,sizeof (double)); if (v) nrerror(”allocation failure in dvectorQ”); return v; } / *******************************************************/ mt mt *ivector(nh) nh; { mt *v; v=(int *)calloc(nh,sizeof(int)); if (!v) nrerror(”allocation failure in ivectorO”); return v; } /*******************************************************/ double **matrix(nrh,nch) mt nrh,nch; { mt i,j; double **m; m(double **) calloc(nrh,sizeof(double*)); if (!m) nrerror(”allocation failure 1 in matrixO”); for(i0; i<nrh; i++) { m[i]=(double *)calloc(nch,sizeof (double)); if (m[i]) nrerror(”allocation failure 2 in matrixQj; 140 } return m; } /******************************************************/ void f_mat (m,nrh) double **m; mt nrh; { mt i0; for(i=nrh—1;i>0;i——) free((char*) (m[i])); free((char*) (m)); } double **dmatrix(nrh,nch) mt nrh,nch; { mt i; double **m; m(double **) calloc(nch,sizeof(double*)); if (m) nrerror(”allocation failure 1 in dmatrixQ”); for(i=O;i<nch;i++) { m[i]=(double *)calloc(nrh,sizeof (double)); if (!m[iJ) nrerror(”allocation failure 2 in dmatrixQ”); } return m; /**********************************************************/ #include <math.h> double det(h,m) 141 double **h; mt m; { double *dv,*er,tmp,**temp; double *vector() ,**matrixQ; void tqli() ,tred2() ,f_matQ; mt i,j; temp=matrix(m,m); for (i0;i<m;++i) for (j0;j<m;++j) temp[i] [j]h[i] [j] ervector(m); dv=vector(m); tred2(temp,m,dv,er); tqli(dv,er,m,temp); for (i0,tmpl.O;i<m;-t--t-i) tmp*=dv[i]; free((char*)(dv)); free((char*)(er)); f_mat (temp,m); return tmp; } /**********************************************************/ void mat_inverse(h,m,invh) double **h, **invh; mt m; { double **hs ,*dv,*er,**lam; double *vector() ,**matrixO; void tqliO,tred2O,f_matQ,m_matO,nrerrorQ; 142 mt i,j; er=vector(m); lam=matrix(m,m); hs=matrix(m,m); dv=vector(m); tred2(h,m,dv,er); tqli(dv,er,m,h); for (i=O;i<m;++i) if (dv[i]’O.O) lam[i] [i]=1.O/dv[i] else nrerrorQ’trying to invert a singular matrix”); m_mat(hs,h,lam,m,m,m); for (i0;i<m;++i) for (j0;j<m;++j) lam[i] [j]=h[j] [ii; m_mat(invh,hs,lam,m,m,m); f_mat (hs,m); f_mat(lam,m); free((char*)(dv)); free((char*)(er)); } /**********************************************************/ void invert_matrix(h,m) double **h; mt m; { double **hs , *dv , *er , **lata; double *vector() ,**matrixQ; void tqli() ,tred2() ,f_mat() ,m_mat() ,nrerrorQ; mt i,j; 143 ervector(m); lam=matrix(m,m); hs=matrix(m,m); dvvector(m); tred2(h,m,dv,er); tqli(dv,er,m,h); for (i0;i<m;++i) if (dv[i] !=O.O) lam[i] [i]1.O/dv[i] else nrerror(”trying to invert a singular matrix”); m_mat(hs,h,lam,m,m,m); for (i0;i<m;++i) for (j=O;j<m;+-fj) lam[i] [j]=h[j] [ii; m_mat(h,hs,lam,m,m,m); f_mat (hs,m); f_mat(lam,m); free((char*)(dv)); free((char*)(er)); } /***********************************************************/ void m_mat(mO,ml,m2,n,p,m) double **mO , **mj. , **m2; mt n,p,m; { mt i,j,k; for (i0;i<n;++i) for (j0;j<m;+-i-j) for (k0,mO[i] Ej]=O.O;k<p;++k) 144 mO[i] [j]+m1[i] [k]*m2[k] U]; } /**********************************************************/ void s_mat(mO,ml,m2,n,m,k) double **mO , , **rn2; mt n,m,k; { i,j; mt if (k==o) for (i0;i<n;++i) for (j=O;j<m;-I--fj) mOLi] [j]m1[i] [j]—m2[i] [j]; else for (i0;i<n;++i) for (j0;j<m;++j) iuO[i] [j]m1[i] [j]-i-m2[i] [ii; } #include <math.h> double trigamma(z) double z; { double retval,rl,d,w,x,ww; mt il,i2,j; i2(int) (z); if (i2>20) { rlz; wwrl*rl; 145 x1—(O .2—1/(ww*7))/ww; d=1/z+(x/(z*3)+1)/(ww*2); } else { i220—i2; wz+i2; rlw; wwrl*rl; x1—(O .2—1/(ww*7))/ww; d=1/w+(x/(w*3)+1)/(ww*2); i1i2; for (j0;j<il;+-i-j) { rlw; d+1/(rl*rl); } 3. ret val=d; return retval; 3. double digamma(x) double x; { double y,retval,s3,s4,s5,dl,r,gammaQ; s31.0/12; s41 .0/120; s50 .003968253968; d1—0 .5772156649; retval0 .0; 146 yx; if (y<=O.O) return retval; if (y<1e—5) return (dl—1.O/y); while (y<8.5) { retvalretval-1 O/y; . .O; 1 yy+ } r=1 .O/y; retvalretval+log(y)-r/2; retval=retval—r* (s3—r* (s4-r*s5)); return retval; } double gamma(x) double x; { double gcs[24],pi,sq2pil,xmin,xmax,dxrel,y,sinpiy,gamiri; double r9lgmc() ,csevlQ,gamlinQ; mt ngcs,i,n,initsO; gcs[1] O.008571195590989331e0; gcs [2] = 0.00441538 1324841007e0; gcs [3] = 0. 05685043681599363e0; gcs [4] = —0.0042 19835396418561e0; gcs [5] = 0. 001326808181212460e0; gcs [6] = —0. 0001893024529798880e0; gcs [7] = 0 0000360692532744124e0; . —0. 0000060567619044608e0; gcs [8] gcs [9] = gcs [10] 0 0000010558295463022e0; . = —0. 0000001811967365542e0; 147 0. 0000000311772496471e0; gcs [11] —o gcs [12] = . 0000000053542196390e0; 0. 0000000009193275519e0; gcs [13] = gcs [14] = —0. 0000000001577941280e0; 0. 0000000000270798062e0; gcs [15] = gcs [16] = —0. 0000000000046468186e0; gcs [17] = 0. 0000000000007973350e0; —0.000000000000 1368078e0; gcs [18] gcs [19] = 0 0000000000000234731e0; . gcs [20] = —0. 0000000000000040274e0; gcs [21] = 0. 0000000000000006910e0; gcs [22] = —0. 0000000000000001185e0; gcs[23] 0.0000000000000000203e0; pi=3. 14159265358979324e0; sq2pilo 91893853320467274e0; . ngcs inits(gcs,23,5.9604645e—8); = gainlim(&xmin,Scxmax); dxrel y = sqrt(1.1920929e—6); fabs(x); = { if (y>1O.O) gamin = exp((y—0.5)*log(y)—y+sq2pil+r9lgmc(y)); { if (x<0.) sin(pi*y); sinpiy = gamm -pi/(y*sinpiy*gamm); = } } else { n = x/1; if (x<0.) n y = = n-i; x—n/1.0; 148 n n-i; = ganim = O.9375+csevl(2.*y-i. ,gcs,ngcs); if (n>0) for (i1;i<n;i++) gamm else { n = = (y+i/i.O)*ganun; -n; for (ii;i<n;i++) gamin = gamm/(x+(i-i)/i.O); return(ganuu); } gaml im(xmin , xmax) double *xmin, *xmax; { double alnsml,alnbig,xln,xold,log() ,fabsQ; mt i; alnsmllog(2. 9387359e-37); *xmjn= -alnsml; for (ii;i<10;i++) { xold= *xmin; xln=log(*xmin); *xmjn *xmjn- *xmjn* ( (*xmjn+O 5) *xln *xmin-0 2258+alnsml) / (*xmin*xln+0 5); . . if (fabs(*xmin—xold)<0 .005) break; *fljjfl - *xmjn+0.Oi; alnbiglog(i .70i4117e38); *xmaxalnbig; 149 for (i1;i<10;i++) { *xmax; xold xln=log(*xmax); *xmax— *xmax* ( (*xmax-0 .5) *xln— *xmax *xmax+0 9189—alnbig) / (*xmax*xln-0 .5); . if (fabs(*xmax-xold)<0 .005) break; *xmax-0 .01; *xmax *xmjn(*xmjn>1 .0- *xmax)? *xmin: 1.0- *xmax; } double r9lgmc(x) double x; { double algmcs[7],xbig,xmax,y,z,csevlQ,sqrtQ,expO,logQ; mt nalgm,initsO; algmcs [1] = 0. 166638948045186; algmcs [2] = —0 .0000138494817606; algmcs [3] = 0.0000000098108256; algmcs [4] = —0.0000000000180912; algmcs [5] = 0.0000000000000622; algmcs [6] = —0.0000000000000003; nalgminits(algmcs ,6,5. 9604645e—7); xbigl 0/sqrt (5. 9604645e—7); . ylog(1 .7014117e38/12.0); z= —log(12.0*2.9387359e—37); xmax(y<z)? exp(y) :exp(z); if (x<xbig) return(csevl(2.0*(10.0/x)*(10.0/x)— 1.0, algmcs ,nalgm)/x); else return(1.0/(12.0*x)); } 150 double csevl(x,cs ,n) mt n; double x,cs[201]; { double bO,bl,b2,twox; mt i,ni; b10.O; b00.O; twox2 Q*x; . for (i1;i<n;i++) { b2b1; blbO; nin+1-i; bOtwox*bl-b2+cs [ni]; return(O.5*(bO-b2)); } mt inits(os,nos,eta) mt nos; double os[201] ,eta; mt i,ii; double err,fabsQ; errO.O; iil; while (ii<=nos && err<eta) { inos+1—ii; errerr+fabs(os [i]); 151 return (i); } 152 Appendix C: Figures 31 30 (D 2 L() cl) 23 2 1 81 2 1 C) 4 -84 1 I I I -82 -80 -78 -76 -74 Longitude Figure 3.1: Locations of gauged sites in Southern Ontario plotted with Census Subdivi sion boundaries, where monthly pollution levels are observed and Sites 3, 29 (outliers) are not plotted. 153 * * CD * * ‘.0 * ci) 4* * * _I * * * * * * * * * C’) * * * * * * I -84 I -82 -80 -78 -76 -74 Longitude Figure 3.2: Locations of selected sites in Southern Ontario plotted with Census Subdi vision boundaries, where monthly interpolated pollution levels are needed. 154 The Oberved and Fitted Values at Site 5 q 0 z 20 40 60 Month(1=Jan. 1983) 0 0• 0 20 40 60 Month(l=Jan, 1983) q C., LI) 0 0 20 40 60 Month(1=Jan. 1983) 0 20 40 60 Month(1.Jan. 1983) Figure 3.3: Plots for monthly observed and fitted, log-transformed levels of 03 in ppb, 2 and SO , NO 2 SO , at Gauged Site 5. 3 4 in g/m 155 ‘......*.... —‘. .— a - -2 -1 0 1 2 Quantiles of Standard Normal 0 0 Cl) 0 U, 0• #— C .d’• .—.—— -2 -1 0 1 2 Quantiles of Standard Norma) Figure 3.4: Normal quantile-quantile plots for original and log-transformed monthly levels of SO 3 at Gauged Site 4. 4 in ug/m 156 Series: S04 0 u_ . Ii II I 5 0 10 15 Lag Series : S04 cJj 0 U 0 Ii I t 0 9 5 10 15 Lag Figure 3.5: Plots for autocorrelation and partial autocorrelation of monthly, log transformed levels of SO 3 at Gauged Site 4. 4 in g/m 157 0 OO9 = pqtu 009 8c I 0 0 OO- SVJd eoU4SIp 9UECI—rj JeOLJJEI 6U!LBOOUJS oufld OO 0001- 009 OO 0 009- • seeuipJoo euid—cj iiueuodx das U! d = pquiI v putqo paoqa pp ajoouxs i{ UT pUtqo pa’oqa3pp S! LUJf5O!JA Pel.111 :)c anj :9 t{flOJ eoluIea 6uqoou euiids 0 = eouSp euid—a 0091- 00 009 0 0001-— 0091- seeu!pJoo euid-cj 0001- 009 0 S! tJJJ6O[J’\ P&!d Iueuodx3 0 . . C C. U) ci 0 C . > 0 0 a) . . 0. L() . U) - 0 I. . . . . •. •• .• • . • •• • . • •. . e . . 0• . . . • I I I I 0 20 40 60 - 80 I 100 Estimated Covariances Figure 3.8: Scatter plot of observed covariances vs predicted covariances obtained by the GS approach. 159 31 4 U, a) -a- 8 4 4 ‘ 6 5 -84 -82 -80 -78 -76 -74 Longitude Figure 3.9: Means of monthly levels of 03 in ppb, in summers of 1983 sites in Southern Ontario plotted with CSD boundaries. 1988 at gauged Figure 3.10: Means of monthly levels of 03 in ppb, in summers of 1983 sites in Southern Ontario plotted with CSD boundaries. 1988 at selected 35 41 Co -a- 4 45 4 4 47 6 4 0 5 525 56 5 57 -84 -82 -80 -78 Longitude 160 -76 -74 I- a) 2 2 •• D C • •. (I) • 1 . • • •• • c_._._..• •. . — Cl) •. . •.4 . ? ,_ . • _., ,.1.. 0 • Cl) a) St. ••.I• • . • • • I . • •• a) . . I • . •• • . .. .1 I. %i S 0 0 a) •1 C . . . -2 —1 0 1 Residuals of Observations in Summer a) C .1 . • C • ‘: Cl) , •. . D •.I . 0 s• . • U) a) .• . . a) • : ; .•••...•‘.• —. rS • S • .m . . . . • - — . •. •a 0 • • . c.j . a) C .4-. -1.0 -0.5 0.0 0.5 1.0 1.5 Residuals of Observations in Winter Figure 3.11: Scatter plots for residuals of monthly observed pollutant levels against residuals of interpolated levels at the log-scale in winter and summer respectively, where levels of 03 are in ppb, SO 2 and SO , NO 2 . 3 4 in pg/m 161 0” z Oo 0 z 0 O0 0 0 # a 0 S Ec 0 E’ E C S -0.4 -0.6 -0.2 0.0 0.2 0.4 -0.2 -0.4 0.6 summer monthly est. log(N02) 0.0 0.2 winter monthly est. log(N02) Co co o :.‘ i •#• .•• . 0 .•..s•. .. . ‘I. C. 0 E S E C S -1 0 1 -1.0 0.0 -0.5 summer monthly eut. log(S04) 1.0 0.5 1.5 winter monthly est. log(S04) 8 2° o Cl 00 00 £ 0 £ S -0.4 -0.6 -0.2 0.0 0.2 0.4 0.6 0.6 -04 summer monthly cut. log(03) 02 00 02 0.6 winter monthly eut. log(03) 04 0 0 0 .1 oa a 0 0. t - 0 :,.. 0 -1.5 -1.0 S -0.5 0.0 0.5 1.0 1.5 -2 summer monthly cut. log(S02) -1 winter monthly cut. log(SID2) Figure 3.12: Pollutant-wise scatter plots for residuals of monthly observed pollutant levels against residuals of interpolated levels at the log-scale in winter and summer respectively, where levels of 03 are in ppb; SO 2 and 504 in tg/m , NO 2 . 3 162 E E 0 I 0) 1 2 3 4 polutants, 1=N02, 2=S04, 3=03, 4=S02 Co 0 0 1 2 3 4 pollutants, I=N02, 2=S04, 3=03, 4=S02 V E E 0) 0 0 2 3 4 pollutants, 1 =N02, 2=S04, 3=03, 4=S02 Figure 3.13: Boxplots for predicted, observed and residual levels of log-transformed, monthly concentrations of 03 in ppb, 802, NO 2 and 804 in 3 ug/m respectively. 1 , 163 27 Co. 2 LC) a) D 1 1 -J 1 2 1 9 410 1 Cs” I -84 -82 I I -80 -78 -76 -74 Longitude Figure 3.14: Locations of gauged sites in Southern Ontario plotted with Census Sub division boundaries, where daily pollution levels are observed and Sites 3, 26 (outliers) are not plotted. 164 • Cl, C 0 0 • . 21012 3 Quantiles of Standard Normal 0 C Cl, 0 0 0 ..___._..__.....I__..._._._._..__I__.____._I:_s__tøtI•P_•e•_•• ci) 0 -3 -2 -1 0 1 2 3 Quantiles of Standard Normal Figure 3.15: Normal quantile-quantile plots for original and log-transformed daily levels of SO 4 in [Lg/m 3 at Gauged Site 1. 165 Series : 03 0 U- C) 0 LiL:z:i::L1zzizz:zir.EzE.z.:mz 5 0 10 15 ii 20 25 Lag Series : 03 U .I.:h1Ih1h1 0 5 .II•I 10 15 ‘.....:.....:.....‘ 20 25 Lag Figure 3.16: Plots for autocorrelation and partial autocorrelation of daily, logtransformed levels of 03 in ppb at Gauged Site 6, before an AR(1) transformation is taken. Series : 03 U- C) - 0 0 0 5 10 15 20 25 Lag Series : 03 c? 0 5 10 15 20 25 Lag Figure 3.17: Plots for autocorrelation and partial autocorrelation of daily, logtransformed levels of 03 in ppb at Gauged Site 6, after an AR(1) transformation is taken. 166 . U) . . . w D_ C,, 0 ‘-: 1 5 15 10 Order of Deleted Columns Figure 5.1: Plot for trends in AMSPE. 167 20 l--Co-kriging MSPE 0--Bayesian MSPE . Co . ci) Cl, a) a) C Lt) Co - C) a) a) a) Cci C 0 ci) 0 Lii 0 C,) C\j - 0’ 0 10 20 30 40 Months, 1=Jan. 1983; 48=Dec. 1986 Figure 5.2: MSPEs obtained by Hass’s interpolator and the Bayesian interpolator with original acid rain data. Figure 5.3: MSPEs obtained by Hass’s interpolator and the Bayesian interpolator with log-transformed acid rain data. I--Cokriging MSPE 0--Bayesian MSPE C,) C.” a) C C) a) a) a) Co a) 0 0 CD 0 0 C & 0 C C ci) 00 O0 C C 0 -c 0 0 0 00 0 C 0 U-I 0 Cl) 0 C” 0 I Lt ‘I 0 I 10 IC TI H H 0 0 1I7 17 1 20 30 Months, 1=Jan. 1983; 48=Deo. 1986 168 C 40 0 U) C) > C) a) 1 I .4- 0 co C V CD C) U) .0 0 Cd a) 2: 0 1 2345678 sites from 1 to 35 ci) a)> C) C) 4-. 4- C V C) 0 U) C 01 0 : I... 4- 0 -D C) 2: C) U) 1 2345678 91112 3141 516 T122a72s3c81323s3435 .0 0 sites from 1 to 35 Figure 5.4: Boxplots of observed nitrate levels with/out log-transformation at 35 sites in US. 169
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Bayesian multivariate interpolation with missing data...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Bayesian multivariate interpolation with missing data and its applications Sun, Weimin 1994
pdf
Page Metadata
Item Metadata
Title | Bayesian multivariate interpolation with missing data and its applications |
Creator |
Sun, Weimin |
Date | 1994 |
Date Issued | 2009-06-11T19:15:20Z |
Description | This thesis develops Bayesian multivariate interpolation theories when there are: (i) data missing-by-design; (ii) randomly missing data; (iii) monotone missing data patterns. Case (i) is fully discussed both theoretically and empirically. A predictive distribution yields a Bayesian interpolator with associated standard deviation, a simultaneous interpolation region, and a hyperparameter estimation algorithm. These results are described in detail. The method is applied to interpolating data from Southern Ontario Pollution. An optimal redesign of a current network is proposed. A cross-validation study is conducted to judge the performance of our method. The method is compared with a Co-kriging approach to which the method is meant to be an alternate. Case (ii) is briefly discussed. An approximation of a matrix T-distribution by a normal distribution is explored for obtaining an approximate predictive distribution. Based on the approximate distribution, an approximate Bayesian interpolator and an approach for estimating hyperparameters by the EM algorithm are described. Case (iii) is only touched on. Only an iterative predictive distribution is derived. Further study is needed for finding ways of estimating hyperparameters. |
Extent | 2465492 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Collection |
Retrospective Theses and Dissertations, 1919-2007 |
Series | UBC Retrospective Theses Digitization Project |
Date Available | 2009-06-11 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0088917 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1995-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
URI | http://hdl.handle.net/2429/8976 |
Aggregated Source Repository | DSpace |
Download
- Media
- ubc_1995-983523.pdf [ 2.35MB ]
- [if-you-see-this-DO-NOT-CLICK]
- [if-you-see-this-DO-NOT-CLICK]
- Metadata
- JSON: 1.0088917.json
- JSON-LD: 1.0088917+ld.json
- RDF/XML (Pretty): 1.0088917.xml
- RDF/JSON: 1.0088917+rdf.json
- Turtle: 1.0088917+rdf-turtle.txt
- N-Triples: 1.0088917+rdf-ntriples.txt
- Original Record: 1.0088917 +original-record.json
- Full Text
- 1.0088917.txt
- Citation
- 1.0088917.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
United States | 12 | 5 |
Russia | 12 | 0 |
Austria | 8 | 0 |
France | 6 | 0 |
China | 5 | 26 |
Japan | 2 | 0 |
Canada | 2 | 0 |
Hong Kong | 1 | 0 |
Venezuela | 1 | 1 |
South Africa | 1 | 2 |
Nigeria | 1 | 0 |
Saudi Arabia | 1 | 0 |
Iran | 1 | 2 |
City | Views | Downloads |
---|---|---|
Unknown | 24 | 8 |
Penza | 6 | 0 |
Ashburn | 4 | 0 |
Saint Petersburg | 3 | 0 |
Beijing | 3 | 1 |
Vancouver | 2 | 0 |
Wilmington | 1 | 0 |
Jeddah | 1 | 0 |
Guangzhou | 1 | 0 |
Mashhad | 1 | 2 |
Shenzhen | 1 | 25 |
Mountain View | 1 | 0 |
Lawrence | 1 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0088917/manifest