Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Bayesian multivariate interpolation with missing data and its applications Sun, Weimin 1994

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
ubc_1995-983523.pdf [ 2.35MB ]
[if-you-see-this-DO-NOT-CLICK]
[if-you-see-this-DO-NOT-CLICK]
Metadata
JSON: 1.0088917.json
JSON-LD: 1.0088917+ld.json
RDF/XML (Pretty): 1.0088917.xml
RDF/JSON: 1.0088917+rdf.json
Turtle: 1.0088917+rdf-turtle.txt
N-Triples: 1.0088917+rdf-ntriples.txt
Original Record: 1.0088917 +original-record.json
Full Text
1.0088917.txt
Citation
1.0088917.ris

Full Text

BAYESIAN MULTIVARIATE INTERPOLATION WITH MISSING DATA AND ITS APPLICATIONS by WEIMIN SUN B.Sc., University of Electric Science and Technology of China, 1983 M.Sc., University of Georgia, 1990 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES Department of Statistics  We accept this thesis as conforming to the equired tandard  THE UNIVERSITY OF BRITISH COLUMBIA December 1994 ©Weimin Sun, 1994  In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission.  (Signature)  Department of The University of British Columbia Vancouver, Canada Date  DE-6 (2188)  N.  I  Abstract This thesis develops Bayesian multivariate interpolation theories when there are: (i) data missing-by-design; (ii) randomly missing data; (iii) monotone missing data patterns. Case (i) is fully discussed both theoretically and empirically. A predictive distribu tion yields a Bayesian interpolator with associated standard deviation, a simultaneous interpolation region, and a hyperparameter estimation algorithm. These results are de scribed in detail. The method is applied to interpolating data from Southern Ontario Pollution. An optimal redesign of a current network is proposed. A cross-validation study is conducted to judge the performance of our method. The method is compared with a Co-kriging approach to which the method is meant to be an alternate. Case (ii) is briefly discussed. An approximation of a matrix T-distribution by a normal distribution is explored for obtaining an approximate predictive distribution. Based on the approximate distribution, an approximate Bayesian interpolator and an approach for estimating hyperparameters by the EM algorithm are described. Case (iii) is only touched on. Only an iterative predictive distribution is derived. Further study is needed for finding ways of estimating hyperparameters.  11  Contents Abstract  ii  Table of Contents  iii  List of Tables  v  List of Figures  vi  Acknowledgements  ix  1  Introduction  1  2  Bayesian Multivariate Interpolation with Missing Data  13  2.1  Introduction  13  2.2  Bayesian Interpolation  14  2.3  Matrix T-Distribution  18  2.4  Interpolation with Data Missing-by-design  25  2.4.1  Predictive Distributions and Interpolation  26  2.4.2  Estimation of Hyperparameters  31  2.4.3  EM Algorithm  38  2.5  Interpolation with Randomly Missing Data  40  2.6  Interpolation with Monotone Missing Data Patterns  46  111  2.7 3  4  5  6  Proofs of Theorems, Corollaries and Lemmas  Application to Air Pollution Data  49 66  3.1  Application to Monthly Pollution Data  68  3.2  Application to Daily Pollution Data  76  Application to Environmental Monitoring Network Redesign  84  4.1  Theory of Network Redesign with Entropy  85  4.2  Network Redesign with Data Missing-by-design  88  Cross-validation Study  90  5.1  Simultaneous Interpolation Versus Univariate Interpolation  90  5.2  Trends in Adjusted Mean Squared Prediction Error (AMSPE)  93  5.3  Comparison with CoKriging  95  Concluding Remarks and Future Studies  104  Bibliography  108  Appendix A  116  Appendix B  121  Appendix C  153  iv  List of Tables 3.1  Pollutants Measured at Each Gauged Site, Where a, b, d And e Represent  , SO 2 NO , 03 And SO 4 2 Respectively 3.2  The Estimated Between-pollutants-hypercovariance Matrix of the Logtransformed, Summer Monthly Data  3.3  75  The Estimated Between-pollutants-hypercovariance Matrix of the Log transformed Daily Summer Pollution Levels in Southern Ontario  3.5  72  Correlations Between Residuals of Log-Transformed, Observed and Esti mated Pollution Levels at Gauged Sites  3.4  69  79  Correlations Between the Residual of Log-transformed, Summer Daily Observed and Estimated pollutants at Gauged Sites  81  5.1  Latitudes and Longitudes of the Gauged Sites  101  5.2  MSPEs by Both Methods  102  V  List of Figures 3.1  Locations of gauged sites in Southern Ontario plotted with Census Subdi vision boundaries, where monthly pollution levels are observed and Sites 3, 29 (outliers) are not plotted  3.2  153  Locations of selected sites in Southern Ontario plotted with Census Subdi vision boundaries, where monthly interpolated pollution levels are needed. 154  3.3  Plots for monthly observed and fitted, log-transformed levels of 03 in ppb, 2 and 504 in g/m , NO 2 SO , at Gauged Site 5 3  3.4  155  Normal quantile-quantile plots for original and log-transformed monthly levels of SO 3 at Gauged Site 4 4 in pg/m  3.5  156  Plots for autocorrelation and partial autocorrelation of monthly, logtransformed levels of SO 3 at Gauged Site 4 4 in ug/m  157  3.6  A rough checkerboard obtained in the SG step  158  3.7  A smoother checkerboard obtained in the SG step  158  3.8  Scatter plot of observed covariances vs predicted covariances obtained by the GS approach  3.9  159  Means of monthly levels of 03 in ppb, in summers of 1983  -‘  1988 at  gauged sites in Southern Ontario plotted with CSD boundaries 3.10 Means of monthly levels of 03 in ppb, in summers of 1983 selected sites in Southern Ontario plotted with CSD boundaries vi  160 1988 at 160  3.11 Scatter plots for residuals of monthly observed pollutant levels against residuals of interpolated levels at the log-scale in winter and summer respectively, where levels of 03 are in ppb, SO 2 and SO 4 in ,ug/m . 161 3 , NO 2 3.12 Pollutant-wise scatter plots for residuals of monthly observed pollutant levels against residuals of interpolated levels at the log-scale in winter and summer respectively, where levels of 03 are in ppb; SO 2 and SO , NO 2 4 in g/m 3  162  3.13 Boxplots for predicted, observed and residual levels of log-transformed, monthly concentrations of 03 in ppb, SO 2 and 304 in /Lg/in , 3 , NO 2 respectively.  163  3.14 Locations of gauged sites in Southern Ontario plotted with Census Sub division boundaries, where daily pollution levels are observed and Sites 3, 26 (outliers) are not plotted  164  3.15 Normal quantile-quantile plots for original and log-transformed daily lev els of SO 3 at Gauged Site 1 4 in tg/m  165  3.16 Plots for autocorrelation and partial autocorrelation of daily, log-transformed levels of 03 in ppb at Gauged Site 6, before an AR(1) transformation is taken  166  3.17 Plots for autocorrelation and partial autocorrelation of daily, log-transformed levels of 03 in ppb at Gauged Site 6, after an AR(1) transformation is taken  166  5.1  Plot for trends in AMSPE  167  5.2  MSPEs obtained by Hass’s interpolator and the Bayesian interpolator with original acid rain data  5.3  168  MSPEs obtained by Hass’s interpolator and the Bayesian interpolator with log-transformed acid rain data  vii  168  5.4  Boxplots of observed nitrate levels with/out log-transformation at 35 sites in US  169  vu’  Acknowledgements I would like to thank Prof. James Zidek for his excellent guidance, for his fundamen tal influence on my “statistical thinking” and for providing me with a opportunity to work on a practical problem. I would like to thank Dr. Nhu Le for the many fruitful discussiolls I had with him. I am indebted to Dr. Rick Burnett for his comments and for providing me with the data used in my investigation. I thank Dr. Tim Hass; only with his generous help was the comparison of the method in this thesis with his own method feasible. I thank Mr. Hongbin Zhang, our systems analyst, for his support. I thank Mr. Rick White for providing part of the C codes I needed for my analysis. I thank the members of my Supervisory Committee for their help. Very special thanks go to Ms. Chun Zhang, the McConnells and the Veecks for their substantial help. Finally, I would like to thank the University and its Department of Statistics along with Health Canada for their financial support during my graduate studies.  ix  Chapter 1 Introduction A spatial data set consists of a collection of measurements or observations on one or more attributes taken at specified locations. At a fixed time, those data are observed over a restricted geographical or other region. Many models and theories have been established for spatial data analyses. The following several paragraphs give a brief overview of this newly established field. Cressie (1991a) defined a general spatial model. Let s  e  R° be a generic sampling  site in d-dimensional Euclidean space and X(s) the response vector at s. As s varies over the index set D C Rd, {X(s) : s  D} represents a multivariate random field  (or alternatively, process). According to the form of D, spatial data analyses can be classified into four types. When D is a fixed, non-degenerate convex subset of Rd and X(s) is a random vector at site s  D, such analysis is called geostatistical data analysis.  When D is a fixed collection of countably many points of Rd and X(s) is a random vector at site  S  E D, it is called lattice data analysis. When D is a random point process and  X(s) is a random vector at site s  D, it is called point pattern data analysis. When  D is a random point process and X(s) is a random set, it is called object data analysis. Cressie (1991a) gives a comprehensive survey of spatial data analyses.  1  An earlier form of spatial data occurred as a data map centuries ago (Halley 1686). Recently spatial data have been seen in many applications. For example, the grade of ore deposit is measured at various sites spread over a constrained geographical area. In terms of the general spatial model, D is the constrained geographical area, d is two, s is a two dimensional vector of the longitude and latitude of a site and X(s) is the measured ore deposit grade at location s. Another example is the remotely sensed yield of wheat on earth at a fixed time. In the ore grade example, the ore grade at a site is likely similar to the grade at a nearby site. However, it will be less similar to that at a faraway site, provided these two sites are not on the same ore deposition ridge. It means an underlying relationship among the ore deposit grades does exist. This underlying relationship is called “spatial correlation”. Spatial correlation plays an important role in spatial inference. While the ore deposition grade does not change over time, observations like humidity, temperature and wind within a region are different from time to time. So they have to be measured not only across sites , but also over time. Such data are called spatial-temporal data. Another example of spatial-temporal data comes from air pollution monitoring. There, air pollution levels, e.g. sulphate or nitrate levels, are observed at different sites and regularly, say hourly, for a segment of time duration. Among the four types of spatial analyses, geostatistics is most relevant to the topic of this dissertation. Geostatistics was established in the early 1980s as a hybrid discipline of mining engineering, geology, mathematics and statistics. Its study began with Math eron’s early 1960’s papers on Krigiug, a name given by Matheron after D. 0. Krige, a South African mining engineer who developed empirical methods to determine true ore grade distribution based on sampled ore grades. In Kriging, the model of a ran dom process X(s) generally consists of two terms: a trend and an error. The trend  2  term catches large-scale variation of X(s) and is deterministic. The error term reflects small-scale variation of X(s) and is a random process. The Kriging approach gives a “best linear unbiased estimator” (BLUE) of the unknown ore-grade at sites in the pre diction region using ore-grade samples from neighboring sites by exploring correlations among ore-grades at different sites. Later, in other applications, many different forms of Kriging are developed. Examples are ordinary Kriging; simple Kriging; universal Kriging; Bayesian Kriging; robust Kriging and Co-Kriging. These Kriging methods also give best linear predictors. Thus Kriging has become synonymous with optimal spatial linear prediction. Before exploring the family of Kriging methods further, we define some basic concepts. In Kriging, the spatial correlation is expressed in terms of the Variogram, defined as Var(X(si) =  —  Var(X(si)  X(s ) 2 ), sl,s2 E D. When E(X(si) —  —  X(s ) 2 )  =  0, E(X(si)  —  )) 2 X(s  X(s ) 2 ). The equation says that if the mean function of a random  field is a constant, the variogram and expected mean squared difference are the same. Sometimes, the variogram has other names. For example, it is called Dispersion in Sampson and Guttorp (1992) (SG hereafter). Another important concept in Kriging is intrinsic stationarity. An intrinsically statiollary process, X(s), has: (i)a constant mean D; (ii) Var(X(si)  for all s  —  X(s ) 2 )  =  1 2r(s  —  s ) 2 , where r(.) is a real, non-negative  function in Rd. In Cressie (1991a) 2r(.) is called variogram and r(.) semi-variogram. The concept of variogram implicitly implies intrinsic stationarity. If (ii) is replaced by Cov(X(si), X(s )) 2  =  C(si—s2), X(s) is a second-order stationary process. The function  C(.) is called a covariogram. If further, 2r(si C  Si  —  s2  —  ) 2 s  =  2r(H  Si  —  s2  II)  (C(si  —  s2)  =  2r(.) (C(.)) is isotropic. The cross-variogram and cross-covariogram  between the random fields X(s) and  X ( 3 s) have similar definitions.  Intrinsic stationarity is a strictly weaker assumption than second-order stationarity.  3  From 2r(h)  =  2(0(0)  intrinsic stationarity.  —  0(h)), we can prove that second-order stationarity implies  However, the opposite is not true.  For example, a Brownian  motion is an intrinsically stationary process but not a second-order stationary process. In many applications, the variogram (covariogram) is unknown. One has to estimate it. There are both parametric and nonparametric approaches for estimating variogram within a constrained region. Usually a parametric approach involves two steps. First, lags, h , 1  ...,  h, are chosen and sample variograms (covariograms) at these lags are  estimated using observed data. Second, a proper, often an isotropic, variogram (co variogram) model is chosen and the model parameters are determined by fitting it to sample variograms (covariograms). Matheron (1962) proposes a natural way for the computation of a sample variogram (covariogram) by the method of moments. His es timator is unbiased but not robust. Cressie and Hawkins (1980) propose two robust sample variogram estimators. For choosing a proper variogram (covariogram) model, precautions are needed. For example, a variogram model should satisfy the condition ally negative-definiteness condition. That is, vector a E Rk  Si,  .  .  .  ,  sk E  3 2y(s aa  ,  s) < 0 for any real  D and any integer k. Similarly, a covariogram model must  satisfy the positive-definiteness condition. That is, any real vector a E R’  —  Si,.  .  .  ,  sk E  3 C(s aa  —  s)  0 for  D and any integer k. There are other considera  tions too. For example, a variogram model should be able to reflect the sill, observed data may present a sill being defined as a variogram’s non-zero limit at lag zero. Many isotropic variogram models have been proposed. Examples of such models are the ra tional quadratic model: 7 (h) (the Gaussian model) 7 (h)  =  =  2 a  {1 2 a  h —  2  /(1+  exp(—  h  h 2)},  2)  ,h E R’ (Schoenberg 1938);  h E Rd; the linear model; the  exponential model and the spherical model (Journel and Huijbregts 1978). A covari ogram model is chosen similarly. A natural model fitting criterion is “least squares”. Other possible criteria include maximum likelihood, restricted maximum likelihood and 4  minimum norm quadratic. When observed data do not confirm a stationary or isotropic condition, measures are taken to make data stationary or isotropic. For example, Hass (1990a, 1990b, 1992, 1993) adopts a moving window approach so that all the observa tions inside a circular window are approximately isotropic. SG describe a nonparametric approach to estimating a spatial dispersion matrix when the observed data are not stationary. Here a dispersion matrix has the same meaning as a variogram matrix except that no stationarity assumption is implicitly implied. Their method takes two steps. First, with the “nonmetric multidimensional scaling” (MDS) algorithm (see Mardia, Kent and Bibby 1979), a two-dimensional representation of sampling sites is sought. In this two dimensional Euclidean space, called the D—plane, a monotonic function, g, of the distance between any two points approximates the spatial dispersion between the two points. As a counterpart of the D-plane, the geographical coordinates plane of the sampling sites is called the G splines,  f,  —  plane.  Second, thinplate  are found to provide smooth mappings of the G-plane representation into  the D-plane representation. Then, the composition of estimator of var(Z(xa)  —  f  and g yields a nonparametric  Z(x ) 6 ) at any two geographic locations  Xa  and xb. Other  nonparametric estimation methods for a spatial covariance matrix are discussed  ,  for  example, in Loader and Switzer (1991), Leonard and Hsu (1992), Le and Zidek (1992) and Pilz (1991). While both adopting Bayesian approaches, Le and Zidek (1992) take conjugate priors, inverted Wishart, on the spatial covariance matrix while Leonard and Hsu (1992) propose a class of prior distributions, other than the inverted Wishart. With the above preliminaries, we can now summarize ordinary Kriging as follows. As sume a model, X(s)  =  u + 6(s), s E D,  t  E R, where D is a convex subset in Rd, 6(.)  is a stochastic process and u is an unknown, constant scalar. Kriging searches for an optimal linear predictor of any unobserved value X(so) within a family of linear func  5  tions of X(s ), i 1  1,.  =  .  .  n. Therefore the predictor takes the form p(X;  ,  X(s) under the restriction, E(p(X; so))  A  =  1. The restriction makes p(X;  =  =  S)  SO) =  Z  unbiased, since  E(X(so)).  =  Suppose X(s) is an intrinsically stationary process. By minimizing 2  (X(so) optimal choices of  =  (Xi,.  —  .  .  X(s)) + 2m(1  ,  —  A)  =  0,  (1.1)  )) and m have the forms, )_i 7 r t (7+11 _1  =  and m where y (j,j)th  =  (‘y(so  =  —  — Si),..  element is y(sj  —  ,  .  7(So  —(1  —  ))t, 5  )/(ltp_1 7 ltp_1 ) ,  an n x 1 matrix and 1’ is an n x n matrix whose  sd).  When X(s) is second-order stationary, the above optimal solution can be expressed in terms of covariograms. The solutions for ) and rn are obtained by replacing ) with c, where c  =  (C(so  si),.  —  .  .  ,  G(SO  —  sn)), F with  =  (G(s  —  si)) and m with -m. If the  variogram or covariogram fullction is known, ordinary Kriging stops here. If unknown, it is estimated with either a parametric or nonparametric method. When  t  has a more complicated form, other Kriging approaches are developed. In  simple Kriging (Matheron 1971), the model is X(s)  =  it(s) + 5s), where u(.) is known.  The best linear predictor is sought within the family p(X; so) to an uribiasedness restriction. By minimizing E(X(so) i} and k, the optimal solution is, k (7  =  (C(so, s ),. 1  .  .  ,  C(so, s))t and  =  =  JL(so)  —  (C(s, s )). 3  6  —  =  lX(s) + k subject  p(X; so)) 2 over i  ijt(s) and i  =  =  {i,  Cti where  Universal Kriging is introduced when a more general form of form is X(s)  (s)i3_ + 6(s), where 3 f_ 1  =  vector of parameters,  —(•), j 3 f  1,.  =  .  is assumed. That general  ,  =  /3)t  R’ is an unknown  ,p+l are known functions and 6(.) is a zero mean  .  intrinsically stationary process with variogram 27(.). The form of a best linear unbiased predictor is p(X; so)  )X(s), subject to, )X  is an n x (p + 1) matrix, its  (,j)t1  x, )  =  = \1,.  element being 1 f_ ( s), and x  If the variogram is known, solutions  i  =  1,.  .  .  =  .  .  ,  .A,,), where X  (fo(so),  , .  .  .  f(so)).  , n are easily derived. If unknown, it  needs to be estimated. One problem occurs when one estimates the variogram. Since in universal Kriging X(s) is not intrinsically stationary, the sample variograms computed with observed data, X(s ), i 1  1  =  .  .  .  , n by the formulas of Matheron (1962), and Gressi  and Hawkins (1980) are biased. To obtain unbiased variogram estimators, the residuals of X(s) must be known; but they are unknown, since  /3 is generally unknown.  To bypass  this dilemma, Neuman and Jacobson (1984; see also Haas 1993) propose an iterative method starting with ordinary least square (o.1.s.). Note that the model for the vector Xt  =  (X(si),  ...,  X(sj) can be rewritten in a matrix form, X  n x (p + 1) matrix with its  (,j)t1  element, i  =  1,... ,ri,  j  =  Z/3 + 6, where Z is an  0,1,... ,p being f(s);  =  /3  is regression coeficients and S is a stochatic process vector. The iterative procedure is as follows. First, estimate X  —  /3  by  /3i =  Z/3 i 0 . Second, update /3 by  residuals, X  —  gis 3 /  X and obtain Z 1 (ZtZ)  = (Zt_1Z)_1Z_1_1X  based on residuals,  and E with the updated  gis. Repeat the above procedure until it converges. 3 Z/  In previous Kriging approaches, an estimate of X(so) is computed using information on the same process X(s ) i 1  =  1,.  . . ,  n. In some applications, additional information is  available. In these cases, realizations of other correlated random processes are observed. CoKriging was developed to bring the additional information into the BLUE predictor. More specifically, if the observed data set is X with  (j,j)th  element being 3 X ( s), i  =  =  1,... ,n, 7  (X(s ) 1 ,.  j  =  . . ,  X(s))t, an n x k matrix  1,..., k, one needs to predict  (s)) 3 Xi(so) using X. Suppose E(X C(s,u), s,u E D, where  =  ,,j  ([Li,. ..,[jk)t  /1  =  1,.  .  .  ,k, s E D and cov(X(s), X(u))  and C(s,u) is a k x k matrix. The best  (so) takes the form p 1 linear CoKriging predictor of X (Z; so) 1 with  11 Z- A 1  =  1,  _i  Ai  =  0,  j  =  2,..  .  ,  =  =  k. The remaining steps are the same as those  for other Kriging methods. Applications of the CoKriging method in environmetrics, soil science, and hydrology can be found in Haas (1993)  ,  Yate and Warrick (1987), as well as Ahmed and de  Marsily (1987). Haas (1993) proposes a Moving Window Regression Residual CoKriging (MWRRCK) method for predicting pollution carried in rainfall. There, the observed information includes: (i) wet deposition of pollutants monitored at 200 sampling sites in US; (ii) observations of precipitation at over 6000 sites of National Weather Service (NWS) network. We outline MWRRCK method below. First, a moving circular window centered at a prediction site is selected to achieve local isotropy. A radius of the circular window is chosen so that the total number of monitoring sites inside the window will not be less than a predetermined value. This number is set to make sample variograms reasonably accurate. Second, the spatial trend surface (respectively spatial covariance matrix) in the window is removed (respectively estimated) through an iterative o.l.s-g.l.s-procedure. Third, a regular CoKriging method is applied to the residual process for estimating the residual at the prediction site. The final prediction is a sum of the predicted trend and the predicted residual. In Hass’s theory, care has been taken to make the covariance matrix within a window positive definite. However, the covariance matrix between windows need not be positive definite. The problem arises when two prediction sites are geographically close and one of the two sites falls inside the circular window of the other. Because different fitted  8  semivariogram and cross-semivariogram models are obtained for each window, it can happen that the covariance matrix between these two sites is negative. Although Kriging is an appealingly simple method, it tends to understate uncertainties of predicted values, since model uncertainty is not included. One simple example given in Handcock and Stein (1993) shows how much effect the model uncertainty has on the confidence interval (CI) of a predicted value. There, the above authors showed that if the uncertainty of the unknown spatial covariance is not included, the 95% CI of a predicted elevation based on measured elevations is (699, 707). When it is included by using a Bayesian approach, the 95% CI becomes (694, 713). This example clearly supports the use of Bayesian approach to Kriging. Such Kriging are described as Bayesian. Bayesian Kriging is relatively new. Not much work has been done in geostatistics. For a general model X(s)  =  n(s) + 6(s), s E D, a Bayesian approach can be adopted by  assuming it(s) to be a random process that is independent of 6(.). More specially, p(s) f /9_ ( 1 s), _ s  e  D ,where  j  =  0,... ,p are random variables (see Nather 1985,  Kitanidis 1986, and, Omre and Halvorsen 1989). For example, Omre and Halvarsen (1989) describe a version of Bayesian Kriging that puts prior on the mean function only. Their method appears to be a direct extension of traditional Bayesian linear model to case of spatial data. An empirical Bayesian predictor is obtained if the parameters of the prior are estimated from the data and substituted into the Bayesian predictor of X(so). Similarly, one can assume the covariogram varies in the space of all positive-definite functions PD  =  {C(s, ) : s, 1 tt € D} and put a (prior) distribution on PD. Or one can  assume a structural model on C(.), C(s, i; 0) and put a prior on 0. For the latter case, the predictive density is f(X(so) I X)  =10 f(X(so) I X, 0)f(0 I X)d0.  One example of such an approach is proposed by Handcock and Stein (1993). In Hand 9  cock and Wallis (1994), the method is applied to model meteorological fields in an attempt to assess the global warming problem from a statistical point of view. One competitor to Kriging is the smoothing spline (Wahba 1990a, 1990b). The spline method can be briefly summarized as follows. For observed data, X, i assumed model is X surements,  f()  =  f(s) + e, where s, i  is a smooth function and e, i  =  1,..  =  1,.  .  .  ,  .  ,  =  1,.  .  .  n, the  n are the locations of mea  n, are independently, identically  distributed (i.i.d.) errors. For each value of y, the smoothing function, f(s), is estimated by minimizing -  over all  f with continuous first  2+ f(s)}  f[f”(x)]2dx  and squared-integerable second derivatives. The smooth  parameter ,\ is chosen by “generalized cross-validation” (Wahba 1990b).  The inter  0 is taken as f(so). Oehlert (1993) shows one example polated value at any location s which combines a smoothing spline technique with a Bayesian approach. There, Oehlert proposes a multiple time series model for data that have both temporal and spatial correlations. The smoothing spline is used when the mean and trend are extended from the rectangles where there are the monitoring sites to rectangles with no moni tors. Many empirical comparisons have suggested that the interpolation performances of spline methods and Kriging methods are similar (see Laslett (1994) for references to these comparisons). Laslett (1994) demonstrates that in certain cases, Kriging surpasses splines by a big margin. When the observed data have a spatial-temporal form, Kriging faces another rival, a “hierarchal Bayesian time series approach” developed by Le and Zidek (1992) (LZ hereafter). LZ assume independent responses over time, a common unknown spatial covariance structure at each fixed time point and conjugate priors on both trend param eters and the spatial covariance. As an alternative to Kriging, LZ’s method has many 10  advantages. For example, it incorporates the model uncertainty into its interpolator. It uses incoming information to dynamically update estimation and gives a predictive distribution that enables construction of a simultaneous band for interpolated values. In its original form, the LZ method does not allow such data. In this thesis, we extend the LZ method to include missing data. We develop and explain the extension in the context of interpolating air pollution. Let’s first describe this context. Assume there are s sites scattering over a constrained region. At each site, k response values are measured. Examples of such measurements are air pollution levels, like sulphate or nitrate levels. Among the s sites, s,, sites are ungauged sites, where there are no observations but air pollution levels are needed. The other  g 8  sites are gauged; there are observations. In the case of missing data, some values are missing at the gauged sites. The missing data patterns discussed in the next chapter are called randomly missing, missing-by-design and monotone missing. Here, we use the term “randomly missing” to mean its probability of being missing is not related to its value (see Little and Rubin 1987). The term “monotone missing” is also from Little and Rubin (1987). The meanings of the three missing data patterns are explained in the next chapter. Let X denote the complete, random response vector for all sites (both gauged and ungauged) at time t, where the first k elements represent pollution levels for k pollutants at site one, the second k elements for site two and so on. Thus, X, is an sk-dimensional vector.  The inferential objective treated in this thesis is to interpolate unobserved  pollution levels at ungauged sites using incomplete observations at gauged sites. As in LZ, we assume a linear, Multivariate Gaussian model and conjugate priors on its parameters. Interpolation with missing data consists of two parts.  First, by fixing  hyperparameters, we find a predictive distribution and its posterior mean along with  11  a standard error. Second, we develop an estimation procedure for hyperparameters. Further, in two steps we estimate the hyper-covariance matrix of X,. In step one, we adopt an EM algorithm to estimate the hypercovariance matrix at gauged sites. In step two, we apply the OS approach to extend this hyper-covariance matrix to all sites. This thesis consists of six chapters. In Chapter 2, we describe three different interpo lation theories depending on the missing data patterns. The patterns are: (i) missingby-design; (ii) randomly missing; (iii) monotone. There, we fully develop the theory of interpolation with data missing-by-design; we briefly discuss the theory of interpolation with randomly missing data and only touch the theory of interpolation with monotone missing data. In Chapter 3, we apply the theory of interpolation with data missingby-design to Southern Ontario pollution data; we implement it with S and C programs and carry out residual analysis. In Chapter 4, by combining the theory developed in Chapter 3 and the theory developed in Caselton, Kan and Zidek (1992), we show how to apply our results to an optimal network redesign problem. In Chapter 5, we compare the theory of “interpolation with data missing-by-design” with the general theory of LZ and also with Hass’s CoKriging method. In Chapter 6, we drawn conclusions and list some future research topics. All figures referred in Chapter 3 and Chapter 5 are listed in Appendix C. We attach examples of S and C programs in Appendix A and B.  12  Chapter 2 Bayesian Multivariate Interpolation with Missing Data 2.1  Introduction  The problem of interpolating a spatial field has received a lot of attention. Kriging offers a well-known solution, but it has deficiencies. In particular, it overstates the precision of its interpolator because it fails to reflect the model uncertainty (LZ, Brown, Le and Zidek 1994a, hereafter BLZ). To avoid these deficiencies, LZ propose a hierarchical Bayesian alternative to Kriging. The LZ method takes a Bayesian time series approach with a Gaussian linear model and conjugate priors for the model parameters. Such a Bayesian approach incorporates model uncertainty and yields a heavier-than-normal tailed posterior distribution, the multivariate t.  LZ’s Bayesian alternative also has  the advantage of dynamically updating the predictor as more observations come in. LZ developed their theory in a univariate case where at each of s sites, only one air pollutant is monitored. BLZ extend the above LZ Bayesian interpolation theory to the multivariate case. At each site k air pollutants are monitored. BLZ adapt the original LZ Bayesian theory by stacking rows of the s x lc response matrix into an sk xl vector, where each row represents 13  k measurements at a gauged site. To reduce the total number of hyperparameters in the prior distributions, BLZ assume a Kronecker product structure on the hypercovariance matrix. Then BLZ describe an algorithm for hyperparameter estimation. But the BLZ theory has an important restriction. It does not permit missing values at gauged sites. In this chapter, theories of Bayesian multivariate interpolation with different patterns of missing data are discussed. These patterns are: (i) missing-by-design ; (ii) randomly missing and (iii) monotone. In all three cases, we follow the hierarchical Bayesian ap proach of LZ. In the case of data missing-by-design, the proposed Bayesian interpolation method is an alternative to Co-Kriging. In a Bayesian analysis, a predictive distribu tion plays a key role and we need to derive that predictive distribution of the unknown pollution levels at ungauged sites given the observed at gauged sites. Section 2 gives a brief review of the LZ theory. Section 3 serves as a technical preparation for the following sections. There, we define a matrix T-distribution, which plays a pivotal role in our inference, and explore its normal approximation. Section 4 spells out the theory of Bayesian multivariate interpolation with data missing-by-design. Section 5 gives a brief discussion to interpolation theory with randomly missing data. There, we describe an approximate predictive distribution and estimation of hyperparameters. Section 6 is about interpolation theory with monotone missing data patterns. There, only a recursive predictive distribution is described. We put the proofs of most theorems, lemmas and corollaries appearing in this Chapter, in the last section.  2.2  Bayesian Interpolation  In this section, the LZ Bayesian interpolation theory is briefly summarized. Let X be an s-dimensional random vector at time t, where the first s elements, denoted by X, correspond to the unobserved responses at s ungauged sites. The remaining 14  g 8  elements, denoted by X, correspond to the observed responses at s gauged sites.  Assume: Z,  where  Zt  indejndent  B,  (Bz, ) 8 N  (2.1)  is an h-dimensional vector of known covariates at time t and B is an s x h  matrix of regression coefficients,  B=  The priors of the unknown parameters B,  are taken as conjugates of the normal  model, B B°,  F  ,  h(B° , 3 N  ® F),  _l(,5*). 3 W  6”  (2.2) (2.3)  Let At denote a matrix transpose of A. Since X is partitioned into X and X, E and B are partitioned correspondingly as (11  12  \21  22 E )  and  i  1 (B B=i  Define S  =  (X  2 B C A  z)(X 2 E  -  =  CA,  =  =  Note D is the set of all observed data.  15  -  With a straightforward calculation, the posterior distributions of B and Z are easily found. Lemma 2.1 gives their forms. Lemma 2.1 The posterior distributions of B and  are:  B D,B°,F 2 22  D, 22, 5* D,  , 6* + n 22 W( 6*  1I2,  l47(1I2,  —  6),  , 3 N 1 ( 1 39 12 2  .,  where =  }, 2 { 1 , 12 ,r 12 2  12 T  =  B*  =  —  E 22  =  ‘22  +S+  =  ‘c’ ‘c’—i 22 ZJi2L  v’  c’ v—iv 22 L21 L.Ii2ZI  LJfl  ( ) T12  B° +  EØF  =  =  139 X Sg  [(z)  2 (E  —  B)Et,  ,Z Ø(EF), 2 r 22 (E )]  F’(A + F’),  2 (E  —  B°) ( t A’ + Fj(B 2  712  —  B°),  ,+  22 ‘‘i2  LZ show that the predictive distribution of a future unknown realization Xf consists of two multivariate t distributions. These distributions are given in Theorem 2.1. For com pleteness, the definition of a multivariate t-distribution is repeated here. A multivariate t-distribution, denoted as f(x)  tr(jt,  H  H, 9), is defined to have a density function,  I-  [  + (x  —  ) H t (x  16  —  )]  +r)  (2.4)  Theorem 2.1 Let the posterior distribution of B and The predictive distribution of X hyperparameters B°,  ,  ((x))t, (X)t),  =  be defined as in Lemma 2.1.  given covariate vector  Zf  and the  6, F, consists of X  D  9 (a(i) + b, t  c  —  22 d  and X X  =  D  t (ao + 12 (x  —  a(i)),  c + (a(i)  —  ’(a(l) t x)  —  12 x)  q),  where =  *  1*  +  —  —  =  q  g 5  + 1,  s + 1,  —  (ao’\  J=Bzf, a=I )J 1 \a(  2 (E  b c  =  d  B)Ezf,  —  1 + 4F zf, 1  =  zEF z 1 f.  With the predictive distributions, the Bayesian predictor is simplely taken as the pos terior mean. Corollary 2.1 Given the predictive distributions of Theorem 2.1 and means of X) and X are =  112  E(X)  =  E(X)  17  1 + qi a b, 2 =  2 + b. a  Zf,  the predictive  2.3  Matrix T-Distribution  A representation of normal distribution given as a lemma in Lindley (1972) is summa rized here. For any given constant matrices A, B, assume X  ) 1 N(AY,E  Y  N(B,  and 2).  If two normal distributions are independent, Lindley’s result asserts X  AB + N(O,  ). t 2 +A  The above fact will be repeatedly used in the sequel without explicit mention. Besides the normal distribution, the matrix T-distribution plays a pivotal role in the theories developed in this chapter. Its definition and some properties are discussed next. Definition 2.1 A random matrix T : p x q is said to follow the matrix T-distribution,  if its density is expressible as  where  >  1 + tQ’t F t  p(rn-qQp  I  =  0, Qqxq > 0, m > p + q  —  and F()  =  l)f()f( —  ).  .  .  =  F(  (2.5)  1, KPFq  k[m,p,q]  Im,  ((m  T’q  +  —  fi  —  \  m)  ).  In the above theorem, the notation, “P > 0” means that P is positive-definite. An alternative form of f(t) is =  Q  (m-p)  p q  k[rn,p,q] 18  +t p  1 m  (2.6)  Using the notation of Dickey (1967), Press(1982) by T  -‘  T(P, Q, 0, m) or more generally T + C  “  ,  we express the matrix T-distribution T(F, Q, C, m), where C is a constant  matrix.  Lemma 2.2 When q = 1 and  Q  is a scalar, T(F,  Q, 0, m)  is equivalent to a multivariate  t(0, c’,m_p). Proof: By Equation (2.6), f(t)  (Q + tt[ph]1t)m [(m  —  )  +  t  (QP -1 ) 1 rn—p  By (2.5) and (2.6), it is easy to see that t T  r’.  ,P’,O,m). 1 T(Q  (2.7)  In Dickey (1967), a representation of a matrix T is given. The result is copied here. Lemma 2.3 Suppose that 7 U , , F,m  Q  >  0 ,m > p + q  and T  =  —  W(P,m—q), Xpxq  Q  ‘‘  N(0,I®Q), P >0,  1 and that X, U are independent. Let T be a random p x q matrix  (u_4)tx.  Then T has the distribution given in (p.5).  In the Lemma, 0 denotes the Kronecker product. A direct application of the above representation yields the mean of a matrix T. Corollary 2.2 If T  ‘-  T(P, Q, 0, m), the mean of T + C is C, where C is any constant  matrix. Proof: By Lemma 2.3, E(T)  =  E([U1 ) t E(X)  = 0.1  Below, some properties of a matrix T distribution given in Press (1982) are listed without proof. Partition T as 1 IT T=( 19  T: being a p x q matrix, i  =  1, 2, and conformably,  21 \P 11 of dimensions for a matrix P  P1  122  x p. Let P 211  =  22 P  —  21 P 12 P’P  0 be positive  >  definite. Lemma 2.4 Suppose T  T(P,Q,0,m). Then:  1. conditionally T 1 2. marginally 3. 0  7’2  2 T  =  2 t  ‘—‘  T(Pn,  , Q, 0, m 211 T(F  T(P,CQCi,0,m) where  —  , m); t 2 , —PPi t 2 Q + tPii  p);  Opxr =  1 and C TC 1 is any q x r matrix.  Dawid (1981) replaces the notation T(P, Q, 0, m) with T(8, F,  Q).  His notation differs  from that of Dickey in the choice of the “degrees of freedom” parameter, that is, rn—p  =  +q—i. 6 Dawid (1981) has another representation of a matrix T-distribution T that can be defined as the marginal distribution of Tpxq with T and  W(I, m  —  p). In a general form, if Tpxq  I  E  T(I, ‘q, 0, rn) ‘s-’  N(0, I, 0 E)  N(0, P’ 0 E) and >Z  Q  W’(Q,m—p) then T where F,  Q  T(P,  Q, 0, m),  (2.8)  are invertible symmetric square matrices.  Proof: f(t P,Q)  =  ff(t I P,)f(  f  —(m++1)  Q)dS e_4tT[_1  t Pt (t  )] dZ  pt IQ+t The last step is true by integrating with respect to a partial Wishart density function.I  20  Definition 2.2 Let  be a matrix with column vectors, a , 1  ...,  a,,, define:  1 a vec(A)  =  a Let tr(A) denote the trace of matrix A, which is the sum of diagonal elements of A. Here are some useful facts: Lemma 2.5 When the dimensions are proper:  1. vec(A + B)  =  vec(A) + vec(B);  2. (Bt 0 A) vec(Z)  vec(AZB);  3. tr(AB)  =  tr(BA)  ,4. tr(AB)  =  ) t (vec(A v ec(B); )  5. (AOB)(COD)=(AC)Ø(BD); =AØB (AØB) ; 6. t 7. (A ® B)-’  =  A’ ® B’;  8. if X is a random matrix, E(vec(X))  =  vec(E(X)).  The proof is straightforward and omitted. Definition 2.3 Assume Y  =  (Yj) is an m x q random matrix, elements of Y are  independently, identically distributed as N(O, 1), M : n x p, A : p x q, B : n x m are constant matrices and . t X=M+BYA Then X follows a matrix normal distribution, i.e. X  21  NXP(M, (BBt) ® (AAt)).  The density function of a matrix normal X is f(x)  where W  =  =  BBt >0, V  =  V  W  4 (2r)  e_tr[W_1_M)V_1(_M)t1,  i’  (2.9)  AAt >0.  By Lemma 2.5, [vec(X =  —  [vec(X tr[(X  =  M)t]t(W 0 V)’vec(X —  —  vec[V’(X ] t M)  (X 1 M)V  tr[W’(X  —  M)tWh]  W’] t M)  M)V’(X  —  —  M)t  —  —  M)t].  Thus, the following fact is proved. Lemma 2.6 T’Vhen the dimensions are proper and W, V are invertible, symmetric ma trices, the following is true, [vec(X  —  M)t]t(W 0 V) vec(X 1  —  t M)  =  tr[W’(X  —  M)V’(X  —  ]. t M)  By a direct application of Lemma 2.6 to Equation (2.9), an equivalent relation between multivariate normality and matrix normality is established. Lemma 2.7 Assume X be an n x p random matrix, then X  N(M, W 0 V)  if  and  only  if vec(X ) t  ‘  ), W ® V). t N(vec(M  As one can see, the covariance matrix of a matrix normal distribution is a Kronecker product of two matrices. Sometimes, for notational simplicity, a notation Xpxq V) +M due to Dawid (1981) replaces Xpxq  ‘  N(Mpxq,Wpxq 0  Vqxq).  ‘  .A1(W,  With this new  notation, some facts about the matrix normal distribution are given without proof: 22  Lemma 2.8 If Xpxq f(W,V)+M,  1. t V(V,W)+M X ; Jf(CWCt, DtVD), where C, D are two nonrandom matrices with  2. CXD  proper dimensions.  Next, a normal approximation of a matrix T is derived. Suppose T follows a matrix T distribution Tpxq  ‘‘  Q, 0, 6).  T(P,  T* where Let  =  =  6  Define a scaled matrix T*  u’  g  1  =  —r  (-_)  =  6T. By Lemma 2.3,  x,  (2.10)  means equal in distribution. T* is a matrix analogue of the univariate t . 8  — q. By the definition of a Wishart distribution, there are  random vectors, Y 1  ,  p dimensional  N(0, F), such that  UtU.YtYt.  (2.11)  By the multivariate strong law of large numbers (SLLN), ) t vec(U  vec(yiyjt)  E(vec(YiY ) t 1 )  a.s.  as  where, a.s. represents convergence almost surely with respect to Note that 1 E(vec(Y ) t Y )  vec(E(Y ) t Y 1 )  =  t U —*  Since when p is fixed,  tS’  P  —f  a.s.  =  (). 1 fy  vec(P). Hence,  as  —*  oc  oc is equivalent to 6  —*  or  6  —*  oc.  oc. Applying Slutsky’s theorem,  we have t U T*=[-]  —*  P  X  =  .A/(P  ,  Q)  in 23  distribution,  as  —*  oc.  Thus the following theorem is proved.  Theorem 2.2 WIzen n, p are fixed, as T*  —*  J/(P’, Q)  cc in  distribution.  Theorem 2.2 is an extension of a similar asymptotic result in the univariate case, that is t 5  —*  N(O, 1) as S  Lemma 2.9 Let U  —*  cc. The fact has been noted in Dickey (1967).  W(P,’9). Then  ( where in the lemma,  __,d  -  p) (2P, F),  ,as  cc,  denotes convergence in distribution.  The above lemma is applied in the proof of the following theorem that gives an asymp totic distribution of a matrix T-distribution in higher order. Theorem 2.3 Let T* be defined by (2.10), and F > 0, -  where U  Jf(P,F ) 1 , V  X)--U 4 P  *  V, as  Corollary 2.3 When z9 is big, T*(1+=Y*)X*  Q,  >  —*  0. Then cc,  .A/(P’,Q) and given the hyperparameters F,  independent of V.  where given F,  Q  U, V are independent and U(P , 1 Q), 24  Q,  U is  Interpolation with Data Missing-by-design  2.4  In this section, the theory of Bayesian spatial interpolation with data missing-by-design is established for the multivariate case in which there are s,. ungauged sites, s gauged sites and k monitored pollutants at each site. Suppose, now, that observations at gauged sites are pooled from different air pollution monitoring network. Since by design each gauged site does not monitor the same subset of pollutants, some pollutants, concerned by us, may not be monitored at a site. Hence, these pollutants are missing because of design. Therefore, we call these “missing” val ues missing-by-design. More specifically, Let X  =  ((X)t,  (x(l))t),  where (X )kl 9  represents the response vector at ungauged sites at time t and (X )sgkx1 represents the 1 response vector at gauged sites at time t. After a proper rearrangement of its elements, is further partitioned into (X’) , and 11  that respectively correspond  (X)(sgk_1)x1  to the vector of missing-by-design pollutants and to the vector of observed pollutants at gauged sites. Since the same 1 pollutants are missing during the whole monitoring pe riod and they comprise 1 columns in matrix X, they are sometimes referred as missing columns in the sequel. Like LZ, a normal model for the conditional sampling distribution is adopted. In terms of a matrix normal distribution, the model can be written as, X where X  =  Z, B,  -  kX(BZ, E ® Ia), 3 N  ,. .,X) 1 (X kXfl is the response matrix; Z 3 .  =  , 1 (Z  (2.12)  ...,  Z)h is the matrix  of covariates; B= sk,1 9 /  is the coefficient matrix;  .  sk,h 3 /  skxh  is the unknown spatial covariance matrix of X and I is an  25  n x n identity matrix. The conjugate priors of  B I B°, E, F  ,  B are,  kh(B°, 3 N  D ® F’)  (2.13)  and F,6*  W(,6*).  (2.14)  The random field interpolation theory developed in this section has two steps. In the first step, while the hyperparameters B°, F, 1’ and 5’ are assumed to be fixed, predictive distributions derived. In the second step, estimation procedure of hyperparameters is discussed.  2.4.1  Predictive Distributions and Interpolation  In this subsection, all hyperparameters B°, F,  and 6 fixed and are suppressed in the  derivations. The estimation of these hyperparameters is left to the next subsection. If indices of missing columns in ZgJ  let R 1  =  , 1 (r  . . .,  are i , 1 r) and R 2  is an .s k-dimensional vector with 9  th  ...,  =  i1 and indices of observed columns are  (r , + 1 .  . .,  ,) where r, 3 r  =  =  1,..., .gk,  element being one and the remainder being zero.  Thus R 1 and R 2 “mark” the position of missing columns. Now let R that (X’) t  j  =  ,R 1 (R ). Observe 2  1 consists of the missing columns. Similarly (X2)t (X(l))tR  =  2 (X(1))tR  consists of the observed columns. Because vector X is partitioned into three parts, B, E, B° and 4 are partitioned accordingly. For example,  ( where oo and (11) are  oo  ) 1 Eo(  E(i)o  E()  k 9 .sJc x .sk, sgk x .s  is partitioned as  matrices respectively;  Rt(ll)R  partitioned as t (11) R— R —  21  ” 1 E 2 22 E )  =  (R(ll)Rl  R(ll)R ’ 2 \  R(ll)Rl  RE(ll)R ) 2 ’  26  is  further  where Et  =  11  and En are 1  X  1, (sgk  —  1)  (sgk  X  —  1) matrices respectively and in general if  (E,E,E)t, E( ) means (E,E)t. The partitions of 1  B, B° have a meaning  11 Further, denote E analogous to that of ). /T  t (11) =  (11) =  with ‘I’rn being 1 x 1 and  22 JJ  /rt  T \ W  ( v11 l 2 k\l1  \ ) 22 ‘P  being (s k 9  ( =  D  ’&’(11)1U2 1 1L  kR(ll)Rl  2 R(ll)R  1) x (sgk  —  DtT  D  ’1’(ll)ILl 1 tL  —  1)  .  Let  112 “P  —  “121’22121  ’J! for use in the sequel. Let, 12 ‘IJ  and  /  ritr,o  _  —  \  t 1  /  o\  (1)  o  \ _1  B t R 2 °(i)J  \  —  1  B°2  When a mean squared loss function is taken, the Bayesian interpolator is simply E(X° ). To find the posterior mean, one needs the predictive distribution function, f(X° 2 X  I  ). That predictive distribution function plays a pivotal role in our Bayesian interpo 2 X lation theory. As an indirect approach, since E(X°  ) 2 X  =  E(E(X°  ) 2 X’,X  one can instead find the predictive distributions of f(X° X(’)) and f(X’  ), 1 X  ). Those 2 X  two predictive distributions are given in the following two theorems. Theorem 2.4  x°  =  ‘-‘  T( ) 1 , Z+ 0 B  ZtF_ + Z +1 j)(X(’) 1 O(l)(  (x(1)  —  —  t Bl)Z)  B(°l)Z), 6* + n).  From Corollary 2.2 we obtain the next corollary. Corollary 2.4 E(X°  = x(1)) =  BZ +  27  —  Brl)Z).  —  B°l)Z),  Theorem 2.5 X’  ZtF_ + (x Z 2 T(’, I + 1  2 = x X 2  RB(°)Z + ‘y(x 2  —  —  RB°l)Z), 6*  )Z)t 1 RB l 2 (x  —  —  sk + n).  Corollary 2.5 E  (x’ I  = 2)  =  2 RBrl)Z + 7(x  —  )Z). 1 RB°  With the above two corollaries, the Bayesian interpolator is found. Corollary 2.6 E(X° X 2  =  ) 2 x  =  I1’(X o(l)R Z+2 0 B  —  RB°l)Z).  The above corollary shows that the interpolated value E(X° X ) depends only on the 2 observed vector, X, provided the hyperparameters are fixed. Later it will be seen that our estimator of the hyperparameters depends on all observed values X , therefore the 2 interpolated values, E(X° X ), do depend on all observed values. 2 Although it is difficult to find an analytical form of Var(X° X ), it is possible to find 2 a closed form for Var(X°  ). The following theorem gives that closed form. Before 2 X  the variance formula is presented, an useful fact, needed in the derivation of variance formula, is given.  28  Lemma 2.10 Let S be an invertible matrix  (a I At  ” 1 A 2 A I h122J  \12  where a is a scalar and A 22 is an n  1 by n  —  —  A1=(  Then (b  —  B’B 1 B ) 2 2  =  1 matrix. A 1 is denoted as  ::)•  a.  Theorem 2.6 Fort=1,...,n,  ) 2 Var(X° X =  m2  A + 2 I 11 [A  where ) 2 ,A 1 (A cj  =  =  Io(l)’I’(J’flR  =  1 + ZFZ + (X  m  =  8* —  (o(l)j)Rl, 2 ?o(l)’I’(j)R ) ; — —  sk  Zt); 0 2 B  1 + 1.  Obviously when there are no missing columns, the variance is o (l)ct/(m 1  —  2). When  there are missing columns, the variance is roughly increased by a factor Ai’I’i Act/(m— 12 2). As indicated earlier, if the predictive distribution of f(X° X ) is found, the Bayesian 2 interpolator can easily be obtained. That approach is pursued here as a way of checking the formulas for the interpolator and its variance. Note that if hyperparameters are given, by Model (2.12)-(2.14) and Lemma 2.20 of Section 2.6: fyO\  (  I X} ‘kt  I I -‘--‘0 ) 2 B  / v’ ( ‘-OO RE(l)o  29  £-20(1)1t2  2 R(ll)R  ((Boo  (B 0 ) 2 \B  (  E RlO  (\RE(,)o OO  N ‘  J 2 RE(,,)R  ((  Eo(l)R2 ‘\  ® F-’  2 ) 1 ( 0 ” R \  oo  ) 2 R(ll)R  22  }  *  i —  By following the same way as the proof of Theorem 2.4, the predictive distribution of ) is easily obtained. 2 f(X° X Theorem 2.7  X° X 2 where  Ct,  T (,ct,Bo°Zt + o(l)R (X 22  m are defined in Theorem 2.6 and oI2  =  oo  —  —  0 Zt),m + sk) 2 B  o(i)R2 ‘I’ R(,)o.  From the above predictive distribution, the same Bayesian interpolator as that of Corol lary 2.6 is obtained. By Lemma 2.2 and Press (1982), it is easy to see that the variance is Var(Xt° I X)  =  m— 20I2  (2.15)  It is not difficult to prove that Equation (2.15) is equivalent to the variance formula of Theorem 2.6. Lemma 2.11 If A, is defined in Theorem 2.6 and oI2 is defined in Theorem 2.7, the  following equation is true. A  ‘OI2  —  At  iTi  11l2111  + “oI(l)•  That verifies the variance formula of Theorem 2.6. With the variance, only a pairwise confidence interval can be derived. With the posterior distribution of X°  , we can derive a simultaneous region. The next lemma is 2 X  Theorem 3.2.12 of Muirhead (1982 p913) and it is copied here.  30  Lemma 2.12 If A is W(,6*). So 5* is a positive integer,  >  any p x 1 random vector distributed independently of A with P(A YtEY/YtA_lY is  p =  1, and Y is 0)  =  0 then  and is independent of Y.  By that lemma, the distribution of the quadratic form of matrix T is easily found. T(I,1,0,6*),  Theorem 2.8 IfT 1  (6*  —  T t p)T ‘-.-  Let I  =  (X 1 22 Z + o(l)R 0 B  —  B Z 0 2 ), where I is the Bayesian interpolator of X°  when there are missing-by-design data. Then a simultaneous region of X° is given in Theorem 2.9. Theorem 2.9 The 1 Bayesian interpolator  —  a, where 0 < a < 1, simultaneously posterior region of the is a hyper-ellipsoid, defined by the set  {X : (X  —  ) ( t X  —  )  <b 2 X } ,  where, b  —  sk  —  and  Ct  2.4.2  F1_a,suk,3*_l_suk+1 6*_lsk+1  * Ct *  is defined in Theorem 2.6.  Estimation of Hyperparameters  In the previous subsection, the hyperparameters  ,  8, F, B° are assumed to be known.  Often, they are unknown. To finish the interpolation procedure, they must be specified in some way.  In a complete Bayesian hierarchical approach, another level of priors  is usually laid on these hyperparameters. An alternative approach, suitable when the parameters are not sensitive to their prior specification, entails the use of empirical 31  Bayesian method. In an empirical method, the hyperparameters are estimated with the observed data. In BLZ, 1 and 6 are estimated through a maximum likelihood estimator (MLE) and B°, F’ are assumed to be zero. While it is difficult to estimate all of hyperparameters, ,  6, B°, F’ by MLE, in this subsection we propose two unbiased estimators for the  estimation of B°, F—’ and apply the MLE method to estimate the MLE of  ,  ,  6. For the moment,  6’ is discussed. The two unbiased estimators of B°, F’ are described  at the end of this subsection. Even with B°, F’ being fixed, the direct maximization of the marginal distribution 2 f(X  I  ,  6*) is difficult. With the help of EM algorithm proposed by Dempster, Laird  and Rubin (1977), one can instead maximize the distribution of f(X ,, 2 22 B E 2  ,  6)  (Chen 1979). Let the complete data set be x and assume only part of x is observed. That observed part is denoted as y. If the sampling density f(x  =  )/a(q5) t b(x)exp(qt(x)  is from a regular exponential-family, the p iteration of EM algorithm consists of: (E-step) estimating the complete data sufficient statistics, t(x), by finding  (P) =  E(t(x)  y, q));and (M-step) determining  q(P+1)  as the solution of the equation E(t(x)  It is proved, in C.F.J. Wu(1983), that the limit point  q)  =  does maximize the marginal  likelihood function l(q I 1). To apply the above EM algorithm, we need to find the sufficient statistics of F and 6*. Following the approach of LZ, let 2 B  =  )tZt(ZZt)1 2 (X  32  and 2 S  =  (X(I  2 ( Z ) t . -’Z)X ZZ  —  Anderson (1984 p291) shows that given B(,) and  2 B(,) B  52  I  -  N(RB(l),  22  )—’), t ® (ZZ  k_1(22, 9 WS  (11)  n  (2.16)  h)  —  (2.17)  , S 2 and that B 2 are independent. Since, , B(,), E() B°,), F’, 2 f(X =  2 f(X  B(,), (ii))f((ii)  2 f(S  (fl)  ,,  I  6) 6*)f(B  6*)f(E(,)  I  F’,  B(,), (,,))f(B(i)  B(°,),  ,,  F-’),  where the last step is true, because of Equations (2.16) and (2.17). When B° and F’ are assumed to be known, the above likelihood function of 6* is proportional to f((ii)  (11)  and  (,,),6*). Because, by Lemma 2.20 of Section 2.6, 6*  —  sk).  Therefore, the likelihood function is proportional to, , B(,), 2 f(X  i)  (ii),  6*) oc f(Eii) CC CJ 1  6*)  (H),  (—sk)  6+sgk+1—suk  2  2  exp [_tr((l,)E))]  (2.18)  where sgk(6*suk)/2(8gk1)sgk/4 = 2  33  —  sk  —  i + 1)  With the likelihood Equation (2.18), the pair the sufficient statistics of  )  log  is readily identified as  6 (Chen 1979).  To reduce number of parameters that need to be estimated, BLZ adopt a Kronecker structure,  =  A®, where A the between-sites-hypercovariance matrix and  pollutants-hypercovariance matrix. Since  (11) =  Ag 0  ,  between  where Ag is the hypercovari  ance matrix between the gauged sites, by the likelihood function (2.18), only Ag and could be estimated. Hence, the estimation of c1 takes two steps. In Step one, Ag,  and  8* are estimated by MLE through an EM algorithm. In Step two, the SG method is applied to extend Ag to A.  Expectation Step  At E-step, the posterior expectations of  and log  I  are of interest. The  next lemma, due to Chen (1979, Theorem 2.1, p237), is used in the derivation of the expectations of these sufficient statistics. The lemma is copied without proof.  ,  Lemma 2.13 Given hyperparameters  x( ) 1 )  =  6, B° and F,  8*1)  (  t 7 ? 6 \\ _  _6*) d  where, ,T  71  ’O(1)P(fl), 4 ‘  , t A=ZZ 0(1)  ) 1 B( S(i) (11) = (11)  =  x’ (I  + S(i) +  (E  =  x(l)Zt,  =  —  —  Z (ZZ 1 Z) ()((1) ), ) t Brl))(A’ + F’) (E(i) 34  —  , t Btl))  d  Note that ‘J/  =  =  (6* + n  —  sk  h)’j) + 6?74?7 +  —  R(ll)R.  W(A’,n), then given A, n,  Lemma 2.14 If W  E(log W  )  =  n  1 A  *  and  where ‘P(.) represents a digamma function. A digamma function is defined to be the derivative of a log-gamma function. With the above lemma, the following lemma is true. The expectation of  is,  Theorem 2.10  E E’ (11)  2 x  — —  B  (6*  (  (8  —  —  (6*  sk)’IJ  t suk)(*)  —  sk)I,*  1 d  “  )  t R  where =  2 G  =  I  —  (ZZ t MZ ) ’ *_ —  S  =  Zt(ZZt)_ Z 1 ,  =  (6*  —  B,  T T—l ‘“1222  2+t M 1 MG ((ZZ 2 G ) ’ +F )’G, 1 = ‘22  1 d  —  sk +  fl  —  1  —  + 5,  h)’ + (6*  35  —  I17l* + lT!. t suk)(,l*)  The expectation of log  I  is,  Lemma 2.15 E(iog  I  (11)  , 2 X  ,  Sgkl —  )  =  —sgkiog(2)  k—i—h— .+1 2  —  —  + 1_  —  ) + log  1I2  +log  22  I.  Maximization Step When the expectations of sufficient statistics are found, at M-step, the current values of the hyperparameters are updated with the values that maximize the “likelihood” function. Here, the “likelihood” function is obtained by plugging the expectations of the sufficient statistics into Equation (2.18). However maximizing the “likelihood” function over  =  Ag 0  is not easy. When there is more than one parameter involved in the  maximization step, the following lemma leads to an iterated procedure. An advantage of the iterated procedure is that at each iterated step, only the maximization of a function over a single-parameter is required. That is generally easier than maximizing over several parameters simultaneously. Lemma 2.16 If a function g(x, y) is upper bounded, the iteration procedure described  below leads to the pair (xo, yo) that maximizes g(,) locally. At the pt iteration, update the current value  with a value that maximizes the function g(x, y(”) y(P)) and denote  the updated x value as xl); update the current value  (°)  with a value that maximizes  the function g(x(’),y x(’)) and denote the updated value y as (‘). (xo,yo) is the limit of (x(’), y(P)). Proof: g(x,y) is bounded up and  g(x,y)  g(x’’,y) 36  Besides the above lemma, another lemma is used in the maximizing procedure. The lemma is Lemma 3.2.2 of Anderson (1984 p6 ). It is copied for convenience. 2 Lemma 2.17 If  >  0, the maximum of f(G)  —  —flog G —tr(G D), 1  with respect to positive definite matrices G exists, occurs at G = (1/n)D, and has the value f[(1/n)D] —_p*n*log(n)—n*log D —p*n.  With Lemma 2.16, an iterated procedure is adopted to find maximize Equation (2.18). Since Equation (2.18) over  Rt(ll)R  =  I(ii)  i) 1 F(  = A 9 0 Q, 6* that  and R is orthogonal, maximizing  is equivalent to maximizing the same equation over  When 5 is fixed, the log-likelihood of Equation (2.18) is proportional to L(Ag, c) = (S =  —  (6*  sk)log  —tr  — sk)klog  A’  ((11)))  _(6*  — sk)s 9  —tr [(A 90  Again, an iterated procedure is applied to maximize L(Ag, ft). When Q is fixed, rewrite tr  (411))  Q). By Lemma 2.17, Ag = k(6* 9 as tr(A = s(6*  Similarly, when Ag is fixed,  —  —  j). 9 sk)Q’ maximizes L(A  ) 11 sk)G, where tr((  = tr(QG),  maximizes L(Ag, 1). When 4() is fixed, 6* is updated by maximizing Equation (2.18) over 6*. By taking the first derivative of Equation (2.18) with respect to 6*, it becomes sgk —  klog(2) 9 s  — log  ) 11 E(  I  +log  I  —  37  6*  k 2  1  = 0.  (2.19)  are fixed, finding  When A , 9  t  that maximizes Equation (2.18) is equivalent to finding  a solution in Equation (2.19). For showing the existence of a solution of Equation (2.19), j with E(log  replace log  Ei)  and use the relationship log  M,(ll),6*) in Lemma 2.15 to Equation (2.19)  II  log  J)  log  t R’P(ll)R  112 ‘P  22 ‘P  +log  ,  Equation (2.19) becomes, 6—sk—z+1 i=1  1=1+1  —log  ‘P221  +log  I  ‘P22=0  or equivalently —  sk—  h  1)  —  ‘P(6  —  sk— i + 1)]  i=l+1  . —log’P 22 =log Since the gamma function is convex, the digamma function is monotonically increasing. Thus, the left side of the above equation is always positive whenever n it goes to zero when  t  —  h  >  increases to infinity. The right side of the equation is positive.  Therefore, a unique solution does exist.  2.4.3  EM Algorithm  By summarizing the above discussion, our EM algorithm becomes: E-STEP. Given the current values of ‘P( ), 6*, 11 E  ‘ (11)  t2) M  — —  R  ( _(6*(6*  k)’P ’P t sk)(*)  —  —  (6*  —  sk)’P,* 1 d  )  t R  where 1 d  =  (6*  0 and  —  sk +  fl  —  1— h) + (6*  38  —  7 W t sUk)(7l*) * , + lP  and  II  E(log  M(,,),6*)  =  6 +n—sk—l—h _i+  M-STEP: Given the current values of  6  —  klog(2) 9 —s  ) +log  1I2  —  sk— i +  +log  I;  I  , find the MLE of  and log  =  9 0 l, A  6* by repeating the following steps until it converges. Step 1: Given the current A) and Q(’), represent tr[(A ® =  s(6*  —  as tr(1(P)G) and set  sk)G;  Step 2. Given current =  k(6*  and —  P+1),  represent 1 tr[(AO’ ) )j)] as ir(A’)Q) and set  . 1 sk)Q  Step 3. Given the current  estimate 6* by solving the following equation:  {(6*+n_suk_h_i+1)(6*_suk_i+1)] j=1+1 =  log  I  22  —log  22  I.  (2.20)  Estimation of B° and F . 1 The estimation of the hyperparameters B° and F are based on two unbiased estima tors. Suppose that the k air pollutants are labeled from 1 to k. The measurements of the same air pollutant at different sites are usually similar but the measurements among different pollutants are not similar. Based on this fact, we assume that B° has an exchangeability structure, that is, the hypermean coefficients of all gauged sites are the same and equal to u, i 39  =  1,.  .  .  , k.  th  air pollutant at  Since,  I  2 X  BZ + N(O,  a Ia),  then the following lemma is true. Lemma 2.18 Given Be,) and E(), B 2 and S 2 are independent. Proof:  ,S 2 f(E 2  J f(E, J f(E  Br,), (ii)) —  =  2 f(E  =  52  I  B( ) 1 , Br,), (n))f(B(,) Br,), E(,,)) dB(,)  B(i), (,,))f(B(,)  B(°,), (,,f(S 2  I  2 Br,), E(,l))f(S  11)  dB(,)  I  (11)).  The next theorem gives unbiased estimators of B 0 and F . Let 1  12= isgk—L  k—l 9 fs  wherej E {1,.. ,k}, v  1,.. .,sgkl. Obviously  .  (/ j 3 t =  (ZZt)_lZtX andj marks  the pollutant type of X. Let  %i  be the means of all J3 with  =  v  1,... , k. Therefore, under the exchange  is the estimator of uz. Let,  ability assumption, =  n  —  h  —  2  k1 (v 9 S  —  %iv)t($iv  —  —  —  where /3  —  =  a  a 2  v=1  . 2 E  Theorem 2.11 Given X 2 and the exchangeability of B°,  unbiased estimators of u, i  2.5  (ZZ ) t ’  =  ,  i  =  1,..  .  ,  k and F’ are  1,..., k and F—’ respectively.  Interpolation with Randomly Missing Data  We refer the term randomly missing to the same meaning as missing at random defined mathematically by Rubin (1976), Little and Rubin (1987). More precisely, let Y denote 40  the data that would occur in the absence of missing values. Further, we assume Y = 0 represents the observed values and Ymis, missing values. Assume (Yobs, Ymis), where Y f(Y 0) denotes the joint probability density of Yobs, Ymis. We define for each component of Y a missing data indicator, taking value 1 if the component is observed and 0 if missing. For example, if Y  ), an (n x K) matrix of n observations measured for 1 (Y  K variables, define the response indicator R = (R), such that  1Itj  — —  f  1 if Yij observed, 0 if Yij missing.  When we treat R as a random variable and specify the joint distribution of R and Y, the conditional distribution R given Y indexed by an unknown parameter  i/  is called  the distribution for the missing data mechanism. Rubin (1976) defines the missing data to be missing at random when the distribution of the missing data mechanism does not depend on the missing value Ymis, that is, if  f(R I Yobs,Ymis,i/)) = f(R Y ,i,b). 3 b 0 Little and Rubin (1987) point out that when data are missing at random, likelihood based inferences can ignore the missing data mechanism. In other words, likelihood based inferences can only base on the marginal distribution, f(Yob 5 Let M 1 represent the randomly missing subset of data from subset of data from  x’  0).  and M 2 the observed  Let 1* denote the total number of elements in M . Again, the 1  predictive distribution f(X°  ) leads to a Bayesian interpolator. Given the hyper 2 M  parameters, when there are data missing-by-design or no missing data, the predictive distribution follows a matrix T-distribution or a multivariate t-distribution. With ran domly missing data, will the predictive distribution f(X°  ) still follow some form 2 M  of t-distribution? The following simple example gives a negative answer. 41  Consider a simple case, where there are two gauged sites and n  =  2. Let T denote the  random matrix at the gauged sites. That is, 22 T  (X 1 I  Y1,Y2 =  y’\  X 2 1I  \Y2  where the upper case characters represent the missing values and lower case characters represent the observed values. Assume, T  -‘.A[(I , x 2 ),  E  W  (12x2,6+2).  Then,  II+T T t =  where a 1  =  X+y+1 2 +y 1 y X +y X+y+1 1 y X 2  =  )X + 1 X (—2yiy (X+ 1)X+ 2  =  4 + (aix + 2 ; a )  2 (X? + 1), a  =  , a X 2 y 1 —y 3  = y  + yy+1+ X  4 + y + y?y + 1 + X? and a  Note,  =y+y+y+X+i4 a  Thus  -± 2  4 + 2 II+TtT_=[(aiX ] )+a  since ,yi,y f(Xj,x ) j°° 2  =1  J—oo  2 dx  f(y1,y2)  ,yi,y Lf(xi,x ) 2 dx L . 2 f(T)dx  42  =  3 a  2 —  Use (2.5) to substitute for f(T), and obtain 1 f(X  = =yl,Y ) LlI+TtT2 y IY’ 2  2 dx -±  /  ±i  4 xa  2  4 cxa  2  2  2+ 1 (aix a 1+ 4 a  fX’  I  oo  2)  t 2 1+— 4 a  a’J  2 dx  -± 2  dt  -± 2 X 4 a 1 a 2  [i+x]  2  The last step is completed by integrating with respect to a partial t density. Consider the special case, 1 f(X 1 f(X  I  Ui  = Y2 =  =  ‘,Y2  1. Then  =  (X +2)(1 +X),  1)  1, 1) is clearly not a t-distribution. Therefore f(X ,X 1 2  multivariate t-distribution. If it did, its marginal f(X 1  I  y’, y2)  y, y2) does not follow a would follow a univariate  t-distribution. Since the search for an exact predictive distribution proves difficult, an alternative ap proach is to derive an approximate predictive distribution. Note that given hyperparam eters, ) 2 , 1 f(X°,M M follows a matrix T-distribution. By Theorem 2.2, f(X°,X(’))  is  approximately normal and then by Lemma 2.7, f(vec(X°), vec(Mi), vec(M )) is approxi 2 mately multivariate normal. General normal theory implies that f(vec(X°) 2 vec(M ) ) is approximately normal. Theorem 2.3 shows that the order of approximation is O((6*)_). The theorem below states an approximate predictive distribution of f(vec(X°) I 2 vec(M ) ). Let R*  —  k 0 (Iskxs  0  -  where R  =  ) and Rvec((X(l))t) 2 R 11 (R  0  =  vec(Mi), Rvec((X(l))t) 43  =  vec(M ) 2 .  Theorem 2.12 f(vec((X°) ) t t ° 0 vec((B ) Z) -f-  vec(M ) 2 ) is approximately normal with mean  (M ExoM ) 2 M  (vec(M2)  —  Rvec((B(°l)Z) ) t )  and covariance —  (x0M X0M ) M 2 , EM  where  Exx  =  [(Z F t ’Z + I) 0 ]/(* + n);  0) Zxx  XOX0 = (‘sujcxs  2 = EXOM  I  o)  k_1*)x(S 9 (0(3 k _1j’\  0) Exx j  J 2 = M  (0  R)Exx (2)  If the loss function is “mean squared error”, the approximately Bayesian interpolator is simply the mean of the above approximate predictive distribution. Following the same approach of the previous section, an empirical Bayesian method is obtained. In other words, all the hyperparameters are estimated using the observations. Since no unbiased estimators of B°, F—’ has yet been found, the estimation of these hyperparameters will not be discussed here. In applications, they will be specified as plausible values.  Hence, in the following discussion, these two hyperparameters are  assumed to be known. Again, the estimation of hyperparameters requires two steps. In step one, an EM algorithm is applied to estimate 6,  9 0 ft In step two, the A  =  SG method is applied to extend A 9 so as to obtain A. In the EM algorithm, the likelihood function is f(X(’)  6). By the same argument  as that of the proof of Theorem 2.12, f(vec((X)t)  I  ,(ii))  44  ). t + vec((B,)Z)  So by Equation (2.9), f(X(’) 1 _bk  Ag  11),  6) is approximately proportional to  (6*  ç  —  sk +  where A and a*  =  6*  —  a*(X  —  B(°l)Z)(Z F t ’Z + I)’(X 1  —  )Z)t 1 B  .sk + n.  1 For the E-step, E(M  I  ) is given by the following lemma. 2 M  Lemma 2.19 Given hyperparameters, f(vec(Mi)  vec(M ) 2 ,  6) is approximately  normal with mean )+ t f4vec((Brl)Z)  2M 1 M 2  ) 2 (vec(M  —  ) t Rvec((B(°l)Z)  and covariance M 1 EM  —  EM EM , 1 2 M  where = [(ii)  =  F’Z + I)]/(6* t a (Z REx(l)x(l)Rj,i,j  =  —  sk + n);  1,2.  The M-step is similar to that in the previous section. Its discussion will  lot  be repeated  here. The EM algorithm is summarized below. E-STEP: Given the current values of E(vec(Mi) I 2 vec(M ) )  =  6*, B),  Rvec(B ) 1 Z) + 45  E MlM ( 2 M vec(M2) Ij  —  Rvec(Brl)Z))  M-STEP: Given the current values of vec(Mi), update  (11) =  Ag 0 Q,  6K  by repeating the  following steps until it converges. Step 1: Given the current (A’)() and tr((f’)(”)G) and set  (1_1)(7), represent  (c_1)(P+1) =  tr[((A;’)P 0 (1— )())A] 1  as  hsgG’;  Step 2: (1)(i+1),  Given the current (A’)() and  and set (A;’)(’)  =  represent tr[((A)PO  (Q_1)(P+1))  A]  as  hkQ’.  Step 3: Given the current a*  =  =  ,  sgkh/tr[  The estimated 6 is max{sk, a*  2.6  Ag 0  —  update a* by 1 + i) F t B(l)Z)(Z (X’ 1  —  ] t B(l)Z)  — —  Interpolation with Monotone Missing Data Pat terns  Let X be generally partitioned as ((X) , (X) t ,. t (Xflt), i  =  0,.  .  .  ,  \ (i+i)i  tively. Note that  ,  .  (Xfl’) and (X)t  =  ,..., t ((X)  r. E is partitioned correspondingly as  I In the above,  .  L4([i+1][j+1])  1’  z=O,1,...,r—1.  J  are the covariance matrices of X and =  j  =  o,...  >D. A reparameterization is recursively taken as , 1 {P  ([i+i][i+i])},  46  0,..  .  ,  r  —  1,  ,  r respec  where Pj  =  {ici+i’ i(i+1)}=o  and  i(i+l)E([+l][+ll), iI(i+1)  Ti(i+l)  B is partitioned as Bt  =  (B, B,.  = ii  .  —  Tj(j+l)([+lJ[i+lj)Tjt(j+l),  =  0,.  .  .  ,  r  —  1.  , B) accordingly.  .  The lemma below gives the intuition behind the reparameterization. In the lemma, ‘I is partitioned in the same fashion as Lemma 2.20 If  has the prior distribution of (2.14), the following are true: is independent of 1’,,j  1. 2.  [i+1][i+1fl  TV’(([j+i][+i]),  ‘-  =  —  0, 1,.  .  .  , i.  sk + l(i+1)), where l(i+1) is the dimension  of 11 E 3. i  W_l( ( 1 +l),6*  —  sk + l()), where i(i) is the dimension of X and  —  4.  (+ 1 E )  Tq1)  )0 ( 1 E  This result is called Generalized Inverted Wishart Distribution by Brown, Le and Zidek (1994b) (see Caselton, Kan and Zidek (1990), LZ for its earlier forms). Suppose (X))t  =  (U, G), i  =  1,.  .  .  , n, where U! : 1 x d is missing, G : 1 x (sgk  is observed and X’ is the matrix of response vectors at the gauged sites.  1 d  2 d  ...  —  d)  When  d,, the pattern of the missing data {U!} 1 resembles an upside down  staircase. An example of such missing data occurs when at gauged Site 1 the pollutants are measured up to time T , at gauged Site 2 up to T 1 , 2 and when T 1 < T 2  <  ...  .  ..,  at gauged Site n up to T,-,  < T; the missing data of [X(1)]t have the shape of staircase.  Such a missing pattern is called “monotone”; in Little and Rubin (1987). The following theorem gives a recursively predictive distribution. 47  Theorem 2.13 Suppose X, t  =  1,... ,n follow the model of (2.12) and B, E follow  the prior of (2.13), (2.14). Then, given all hyperparameters, {G } 1 ,D  Uj  a, 6 +  T(( ) 2 ( 1 )’,  j  —  j  s)k,  1,.  ..  ,  where: fl  —  —  (Bfl*Z  a a  =  1 + (G  —  =  lVki  fi=1J  (B( ) 2 )ZJ + 22 )(( 2 l( ))Gj;  22 ) 2 ( 1 )) ((  22 ) 2 (B ))_1(Gj )*Zj)t(  —  (Bi))* (B ) 1 )*  jj y(1)j.-1  (1))  ) 1 ) + (B( 1 B  —  ))*Zj) + ZZ; 2 (B  —  (B)* : d x h,  (B ) 2 )*: (s k 9  d) x h;  —  ((2))  ) 1 B(  Si  =  CA7’;  —  =  C,  A’  =  F ( 1 A’ + F)’; ii)  =  = (11)  (  (2)1  + Si +  ) 1 (E  —  =  F  =  (I  —  ) 1 (AT’ + F’)(E t B°l))  : d x d,  22)  : (sgk  —  —  ) 3 d  X  (sgk  —  di).  (22)  For the case of a monotone missing data pattern, finding an EM algorithm estimation procedure for hyperparameters is difficult. As a make-shift measure, we can treat it as a randomly missing data problem; the EM algorithm proposed in the previous section may then be applied.  48  2.7  Proofs of Theorems, Corollaries and Lemmas  Proof of Lemma 2.9 Proof: By (2.11), U 1983  p305)  1’’ and }çy t  =  implies t Cov(vec(YY ) )  =  i.i.d.  N(O, F). Proposition 8.3 (ii) (cf. Eaton  ) t Cov(YY  2P 0 P. With the multivariate  =  central limit theorem, (vec(U)  vec(P))(2P, F).  Since vec(U ) and Ut are only notationally different and a matrix normal distribution t is distributionally equivalent to a multivariate normal distribution, by the definition of convergence in distribution (Billingsley 1968), it is easy to see that  -  P)--A1(2P, F).I  Proof of Theorem 2.3 Since  -  FX)  -  (T*  =  and  x)  F  +  (()  _i) p-x  1  v1(T*_P  ()x) (x  = -  =  By Lemma 2.9:  (-  (ç)  -) (ç) (1+  () () -) () () ‘  as 49  By LLN and Slutsky Theorem, as  —*  o  1  ()r4)  —1  (1+  a.s.;  —*  —* P  2  a.s.;  —+1. So by Slusky theorem (T*  -  ()x)  F-  Ar(2P,P)F’F4Ar(I,Q)  or (T* Given F,  Q,  ()  -  -  x)  , 1 F’)A1(F  Q).  the two normal distributions are independent, since U and X are indepen  dent. Further, since 6  )[()r+i]  —+  The above is true, since  =  6  —  0  as  q and q is fixed.  Reapply Slusky theorem. The conclusion is followed. I Proof of Theorem 2.4 Given B°, F,  ,  6*, (2.12), (2.13) imply  and  BI > B°+N(0,®F’). 50  —*  co.  In turn, EB0Z+N(O,®P)  X  where P = Z F’Z + I. So t Xt  ( Z + t N(O,P®). B0)  Equation (2.14) and (2.8) imply, t X  —  , , 0, 1 T(P  (B’) Z t  =  By Bartlett’s decomposition,  where  = (I \0  and  + n).  I)  A= ‘  0  (oI(1)  \  0  (11)  = o(1)i). Now apply Lemma 2.4 part 3 to get t (X  —  (B0) Z ) ) t _l (  ‘-‘  T(P_l,A, 0, 6* + n).  Then by (2.7)  (  )x=(t1  t2)T((rj1)  )(BO)  where X° = t 1 and X’ = t . By Lemma 2.4 2 tl  —  —  °Z + 1 0 B BZ 7 ? ) (° I t 2  -‘  ), P + (t 1 T( 2  —  j)(t t Bl)Z) 2  —  )Z), 1 B  0,6* +n). The theorem is proved by reapplying the same lemma. I Proof of Theorem 2.5 By (2.12)-(2.14) and Lemma 2.20 t (X(’))  I  (B Z ) 1 t ) + N(0, P ® 51  W  (ii),  6*  —  k)  where P  =  Z F t ’Z + Ij)(fl. Thus  t ) 1 (X  —  t (Bl)Z)  ‘--  T(P’,  (11),  0,  sk + n).  —  Apply Lemma 2.4 part 3: R t (X’)  —  (B(0 ) 1 Z)tR  T(P’, R (ll)R, 0, t  —  .sk + n).  The remaining is the same as in the proof of Theorem 2.4. I Proof of Corollary 2.6 E(X° X ) 2  =  E(E(X°  =  )(E(XU I X 1 BZ + O(l) ) 2  =  Z+ 0 B  The last equation is true since I  ) 2 X  [R E 1 (x1  o(1)(i) =  RRt  I  B’l)Z)  —  X 2 )+R 2 X  2 R 1 R R +R , X’  =  —  Brl)Z].  ) and X 1 RX( 2  =  By Corollary 2.5 E(X’  ) 2 X  =  Z + o(l)jj){Rl (RB(°l)Z + y(X 0 B 2 X 2 +R  =  —  —  RBrl)Z))  Bl)Z}  BZ +  —  R ’ 1 yR  —  I)Bl)Z  -y)X 1 R } 2+2 + (R =  BZ +  —  R R 2 )Brl)Z  )X 1 R } 2+2 + (R =  BZ +  =  BZ + O(l)R 2 (X 222  )(X R +2  The third last step is true because RRt  =  —  since  52  =  RBl)Z)  RBl)Z).  1 +R R 2  Rt(ll)R  —  1’(11)  =  I and the last step is true  implies 2 Rt(ll)R =  1’22)  So R  =  /‘12 =  111 + 22 1 R 12 ’I’ 2 R .  I122)  That is R2  22  ) 2 +R  — —  =  )(Rl + R 1 7 ).I 2  Proof of Lemma 2.10  By using Bartlett decomposition, it is easy to see that b  =  (a 12 B  22 B For notational simplicity, let d b  —  B 12 blBt 22 12  =  d’  —  =  =  ‘—1. 12 A A_lAt 12? 22  —  =  Ab; 12 —A  A’ + 2 A’A b A 12 A’ 22 =  a  —  AA 1 A 2 2 which is a scalar. Then  A’d’(A’ 1 A 2 + =  =  [d 2 d d’[l  —  —  12 1 d A J 2 12 + )’A A 22 (A 1212i1 2 A t 12 + )’A A d 22 (A  For showing the lemma, note that 2 AA 1 A 2 is a scalar and 12 A  12 A  —  A 1 a ( A 2 ) Ai A i+2 (Ai a’Ai A ) A  =  12 [a A a’  =  12 + 2 A a’d A a’(Ai ) Ai A  =  A(A + ). 2 a’Ai d 22 12 2 A  —  12 + t ] A 1 2 A’A 2 (Ai 12 aA 22 12)“  Equivalently, 22 2 A 1 (A 1 A d+ 22 2) 53  =  A. 12 a’A  (2.21)  Then, 1  —  (A 1 A d 22 )’A 1 A 2+2 2  =  1  —  2 A’A 12 a’A  =  a’d.  Combined with (2.21), the lemma is proved. I Proof of Theorem 2.6 Given all the hyperparameters, ) 2 Var(X° X  =  )) X 1 ) 2 Var(E(Xt° X(  x( ) 1 )  +E[Var(Xt° If t  =  X ] 2 .  n, by (2.7), Lemma 2.4, Theorem 2.4 and Lemma 2.10, it is obvious that T(’I 0 (l),gt,BO Z t+  x° I x(’)  —  B(Ol)Z),6* + 1),  (2.22)  where gi For t € {1,.  .  .  ,n  =  —  1 + ZF’Z + (X’  —  Brl)Zt)t  j)(Xt  —  B(°l)Zt).  1}, let C be the orthogonal matrix such that  By Lemma 2.4 (III) and Theorem 2.4,  x°c  I x(’) —  ZC 1 T( ) 1 , I. + CtZtF_  Bl)ZC)tj)(X(l)C  B Z 0 C + o(l)’)(X(’)C  —  —  Bl)ZC),  B(°l)ZC),  + n).  CtZtF_ Z C is Note that the last column of ZC is Z, the last diagonal element of 1 ZF’Z and the last column of X’C is  So it shows that (2.22) is true for any t.  By Lemma 2.2 and Press (1982 p128), E(X I X(’))  =  BZ + (I)o(l)(ll)(X 54  —  Bl)Zt)  and Var(X  g  =  I  m+l—2  oI(1).  Further, —  B(°l)Zt)  =  )(J!j)RRt(XU 1 ( 0  B(°l)Zt)  —  =  IX’  ,A 1 (A ) 2  —  —  =  (X’ 1 A  —  B’Z BZ)  (X 2 BZ) + A  —  BZ).  By Theorem 2.5 and a similar argument as the case of X° X  IX  (2.23)  —BZt),m+l).  So  I  Var(E(X°  ) 2 X  =  Var(o(l)  )(Xt  =  AiVar(X’  )A 2 X  —  —  ) 2 Bl)Zt) X  A iT At J1lCtVlI2f1i  m—2 The last equation is true, because of Lemma 2.2 and Press (1982 p128). X}, let us start with  E[Var(X°  Rt)R=  (  (I  -r  ‘\0  I)  -  ( \  0 (I \o 1 ,  0  -r  So E  —  =  Brl)Zt)tJj)(Xl)  E [(RtX1)  —  Bl)Zt)  x]  —  —  RtBl)Zt)  I x]  0  -E  X  -  -  I  BZ  0  ((Xe’ —B’Zt)—r(X? BZ) 2 X—BZ E{[(X BZ) r(X BZ)] [(X t —  )  =  —  —r(X  —  BZ)]  —  —  I X} + (X 55  —  —  B’Z)  (X t BZ)  —  BZ).  To find  Let Y  =  i1’[(X’  —  Z) 0 1 B  —  r(X  Z)]. By (2.23) and Lemma 2.4, 0 2 B  —  Y X  T(I,ct,O,m+ 1).  Lemma 2.2 and Equation 2.4 imply f(Y  =  y I X)  C?(Ct  =  y) t +y  -  2  7r2F()  Then E(YtY)=E(YtY+ct)_ct —  —  =  p  (m —2 + l)ct m 2 o  (m_2+1)  p  —  (!!iza)  C(Ct  +  ,,  y)  m—2-H 2  dy  —  Ct  m-2+l Ct—Ct. m—2  That implies —  =  m—2+l Ct m—2  Bl)Zt)tj)(Xl)  —  Ct  + (X  And E[Var(X I X’) XJ  =  —  Bl)Zt)  — —  1 m + 1—  [Ct 2  +  B°Z).  m—2+l Ct m 2  (l) 1 Ct]O  —  Ct =  m—2 oI(1).  Finally, Var(Xt X)  =  =  112 c 1 A ’I’ + m—2 m—2 Ct 1T, + oI(1)].I 1 [A A 112 m—2  Proof of Lemma 2.11 By Rt(ll)R  =  ‘IJ, the following are true: 11 ( 2 )R  =  R  (12 =  112 + , R 222 R  (2.24)  122)  =  R + 21 11 RI’ ’I’ + 12 2 R ’J! + R. 1 R 22 1 2 R 56  (2.25)  By (2.24) and (2.25) t 11 2 ‘P’R ( )R 22 2 (11)  ’1 2 R t ) ’T’ + 12 1 (R 12 2 ’ R ( 2 )I’ ’ 1 T1 R 2 + 22 ’I’I’ 1 R 21 12 +  =  —  ’P 1 R 11  —  =  Hence,  R 112 RI!  =  (11)  —  R(ll) P 2 (ll)R  which implies t-1 (11)  (ll)RlIhlI R 2 l  —  (I)(fl)  —  —  1 2 R R 2 .  So =  O(1)(i j 1 ) o(1)  —  or  112 I 1 A ’  =  —[o  =  oI(1)  +  —  +  —  OI2•  It proves that 112 + oI(1) W 1 A  =  OI2’  Proof of Theorem 2.8: By Lemma 2.3, there exists U T  =  (U_)tY. By Lemma 2.12 , —  W(Ipxp, 5*  r’.I  1) and  Y/p t Y  =  N(O, I> 0 1) such that  is independent of YtY  ‘‘  YtU-1Y  T t p)T  —  Yty  I ytU_1y/(*_p)  F ,..I  Proof of Theorem 2.9: By Equation (2.7), Lemma 2.4 and Theorem 2.7, I  2(XO 012  —  /  T(I,1,O,6*_l+1).  Ct2  57  x• Therefore,  Then by Theorem 2.8, (6*  —  1— sk + 1)(X .sk  0t—l(xO t/ 012  —  *  —  O)  F k 5 ,  ct  6—l—sk+1  So, 1  —  a  =  012\  —  t  <  —  b).I  Proof of Lemma 2.13  By Bartlett’s decomposition,  =  TAT, where 0  (°I(1)  0 and r  = 0(1)(1’1).  T=  ,  )j 11 E(  Ii o  \  )  Thus =  / I  0\  =  /  —  /1  0  1>:0:1)  —r\  )  i)  (1)  —  )+ 11 E( Given  ,  Tt ) 1 T)  6”, Lemma 2.1 implies  I x(’) I  x(’)  WSk((l), 6*) 1  WS k 9 (!j),  6* + n  —  sk  —  h)  and (1)’  89 k 3 N( k)(, )(  01 0 E ) 1 (  So =  X)  =  6oI(1)  )E(r 1 E (  I x(’))  =  58  0I(1))  x(’))  and ,)T t + T  (6* + n Note that,  —  sk  t + E(E(T  —  =  )T 1 (  Let Y  E)r. t El)T = (E)T) t T  x(’))  OI(1))  )T_1.  x(’)).  Obviously by definition  of a Wishart distribution, Y E , 0 1x 1  N(sk)(s k 9 )(O, ‘skxsk  0  Then, yty  oI(1), x’  WS k 9 (j), .sk).  -  Thus E(E(ll)  +  E,)T t T  xc’))  =  (6* + n  —  sk  —  + sUk’j) x(’)) =  (6* + n  —  sk  —  h)j) + 6q( )77 + sUkF(ll).I 1  Proof of Theorem 2.10 Note that since,  R X t ’ B(,) t R  Z, B, E Br,),  E(ii)R t R  ,  RtB(l)Z + N(O, R Z(,i)R 0 Ia), t Brl) + N(o, Rt(,,)R ® t R  F  (11),6*  W(R ( t ll)R,6*  —  sk)  and , 2 x  6*)  =  RE((R E t (,,)R)’  The theorem follows from Lemma 2.13. I Proof of Lemma 2.15  59  , 2 X  6*)1?t.  Since , 2 II X  E(log  6) = E(log I R E(ll)R t = E(log(  I  22 E  1I2  , 2 X  I)  ,  6)  , 2 X  6*)  Note that 112 I X E , 2  (11),  6*  112 ’ 1 W , (’P 6*  —  sk)  and that , 2 X  (11),  6* r, W_ , 2 ( 1 2 6* + n  —  sk  —  1— h).  Now apply Lemma 2.14 to finish the proof. I Proof of Theorem 2.11  1 fixed. By (2.16), In the following proof, we assume F E(3  I1))f,  v1,...,sgkl.  So E(/IBl))=  i=1,...,k.  To prove the other half of the Theorem, by (2.17) and Theorem 3.2.8 of Muirhead 1982 ), 93 (p  at S a Jv Jv 2 t 3 22 a a  L  “JV  2  implying E(  a a 2 S 3 j a 22 aE  = )’ 22 , 2 x  = n  —  1 h —2  (2.26)  and —  jt(  —  i) X 2 = x ,B 2 ), 1  22]  = Var(Ba  2 = x X , B° 2 ), 1  By (2.16) and the fact that aE aj is a scalar, 22 = x , B, 2  22  “  t ) 1 N([B , 2 R ]t ((ZZ ) ’ + a). 22 ) 1 F a  60  22)•  Hence, Var(Baj  I  =  2 ,B(°,), ) x 22 E  =  )’). t a)(F’ + (ZZ 22 (a  (2.27)  2 are independent. If we replace B(,) with , B(,), 22 are fixed, S 2 x 2 and B  When X 2  B(°,), they are still independent (Lemma 2.18). Therefore, given X 2  =  , B(°,) and , 2 x 22 Z  by Lemma 2.18, E(fr_1)  h —2 sgk 1  =  k—1 9 s  —  —  a S a 2  E  —  aj)}’E[($ 22 (aE  v)t(u —  =,  1 ) t (ZZ  1 sgk  k—l 9 s —  (F-’ + (ZZ )’) t  1  —  (ZZ ) t ’  =  Proof of Theorem 2.12  By (2.7), (2.8) and (2.12)-(2.14), given the hyperparameters X  ,  B°Z, 6* +  Z F ’Z + T(’, t  6’, B°, F,  ii).  By Theorem 2.2, X where a  =  —  B°Z  1  =  1  a(a’(X  —  B°Z))  approx, ‘-‘-‘  a  _1  Z F t ’Z + I),  6* + n. Then by Lemma 2.7, vec(X)  approx.  4 f(ZtF1Z + I, a J —  ). t ) + vec((B°Z)  So, vec(X t (R*) ) =  /  vec[(X°) \ } t vec(M,) v ec(M ) 2  ( J  approx.  vec[(BZ) ] t Rvec[(Brl)Z)t] + (R*)ta_Ar(ZtF_1Z + I, Rvec[(B,)Z)t]) 61  ).  —  v)j  By the general normal theory, the theorem is proved. I Proof of Lemma 2.19 By the same argument of the proof of Theorem 2.12, one has vec(Xl))aocalAf(Z F t _lZ + I, where a 1  =  8*  —  +  sk + n. So,  (vec(Mi)) ) 2 vec(M  aiRtJ\1(ZtF_1Z + I,  vec[(B(° R ) 1 ] Z) . +t  By the general normal theory, the lemma is proved. I Proof of Theorem 2.13 For notational simplicity, the superscript Let X,  j  =  1,... ,n, be partitioned into  (B, B, B) and where P  =  j  is suppressed in the following derivation.  ((X5)1X k 3 ),  are partitioned confirmablely.  ), 2 {1I(2), Ti(  , ((G)1x(sgk_dj) 3 (UJ)lxd  ).  B  =  is reparameterizedas {oI(1), To(i), f*}  ) is also reparameterized 1 )}. B 22 E( ) represents {B, B}. B( 1  as  =  1 B  —  )B( 2 Tl( ) ,  ) 2 b(  =  ). 2 B(  Note,  f  (U {G},D) ),f*)f(bl,b( Dj,bl,b( ) ,F* Dj)dbldb( )dF* 2 Jf(U,{G} 2 Jf(U,G  .f(b, b( ) 2  ),F*)f({G}+l 2 Dj,bl,b( F D )f(f* 3  To find the distribution of f(b , b( 1 ) 2 B(i)  ) dF* 2 D) db 1 db(  D, f*), note by lemma (2.1)  D, f  B + N(O, (ii)) 62  (2.28)  where B ) 1  ) + (B(i) 1 B  =  —  ))Et, E 1 B(0  En 0 F, F*  =  =  (I  —  E)F-’. By Lemma  2.7, vec(Bl)) D f  —  0 F).  ) + N(O, t vec((Bl))  By a general normal theory, )),I* 2 vec(B) Dj,vec(B vec( (B)t)  + .1V(O,  +  vec(B )vec(B )) vec(B))vec(B 2  (v ec(B )) 2  —  v ec( t 2 (B ) )  2))  vec(B)Ivec(Bt 2 ) ))  ) + (ri(2) 0 I)(vec(B t vec((B) )) 2 vec((B +  )(B( Tl( ) 2  —  —  t 2 vec((B ) ) + N(O,  ))t) + N(O, 2 B  1I(2)  1I(2)  0 F*)  0 F*).  In the above derivation, it is assumed that (F*)_l exists and the last step is true because of Lemma 2.5. Again by Lemma 2.7, 1 B  ) 2 I B(  ,  D 3’ f  B+  )(B( Tl( ) 2  _B*(2)) + N(O, 1I(2) 0 F*),  it implies  3’  — —  1 (B d  where b*1  —  —  —  ‘-‘l  )B( 2 —Tl( ) )  b + N(O,  I  1I(2)  )B). Since the distribution of b 2 rl( 1  0 F*),  ) does not depend on b( 2 b( ), b 2 1  ) are independent. So 2 and b( , b( 1 f(b ) D, f*) 2  =  1 f(b  =  f(bi  ) D, f*) 2 D, F*)f(b(  I ,  1I(2),  ))f(b( rl( ) 2  I  ). 3 D  Further since f(U, G  I  ,2 1 b b( ) , f*) oc f(U  I  t , b( 1 ), f*)f(G 2 G, b  63  , b( 1 b ), f*) 2  (2.29)  and UJ  I  ), p 2 , b( 1 G, b  Z + 1 [B = [b Z + 1  TI( ) 2 (G  Tl( ) 2 Gj]t  t ) 2 B( Z)] + N(0,  —  + N(0,  1I(2))  ( 1 E ) 2 ). 1  So f(U,, G  ,2 1 b b( ) , F) cx f(U  ), ( 2 G, b , Ti( 1 11 ) 2 )f(G  b( ) 2 ,  (22)).  (2.30)  Also, by Model (2.12)-(2.14), D, b ,2 1 b( ) , f*) = f({G} 1  f({G}t=3+ t  I  (22),  b( ) 2 ).  (2.31)  By (2.28)-(2.31) and Lemma 2.20 f(UJ cx  I  {G},D) I t f(UtjI’Yj, , Ti( 1 b ), 2  I  ( 1 E ) 2 )f(bl 1  D, Ti( ), 1I(2)) 2  f(r ) 2 ( 1 ,( 11 I D) db E ) 2 12 drl( 2 ) dl!( )  f f({G J  b, E( ) 2 ))f(b( 22  if i=j  cx  I  (22),  ) 22 D)f(E(  ) 22 D) db( ) d( 2  f(UtjI’j’ b , Ti( 1 ), 1 2 i ) 2 ( )f(bi D, Ti( ), 2 2 ( 1 E ) ) 1  f(71(2),  )( 2 D,) db 11 d ) 2 . 1 drl(  >1I(2)  For summary, we have, U  bF , G, 1 D, Ti( ), 2  Z + 1 b 1I(2)  Tl( ) 2 G  + N(0,  =d 1 + N(0, b  1I(2)  ( 1 E ) 2 ); 1  0 F),  (2.32) (2.33)  where = B  —  )B( 2 Tl( )  = B ° + (B 1 1  —  B)E  —  )[B 2 Tl( )  +  ) 2 (E(  —  B ) 2 )Et].  Since, by Lemma 2.1, W(ii),  +  (j  —  1)  —  sk).  Then by Lemma 2.20, ) 2 Ti(  D,  1I(2)  “  N(l( ) 2 ), 1I(2) 0  64  (2.34)  (i 1 W ) 2 ( , 6*  D  1I(2)  —  sk +  (j  1)).  —  By (2.32)-(2.34), U  G, 2 ( 1 E ) 1  1 + N(O, 2 a )a ( 1 i ) ,  where 1 = a  BZ  1 —  1(2)2)  B*(2) Z3 +  1(2) (22) G 3  and 2 a  =  1 + ZF*Z + (G  —  t ) 2 B F Zj) )(G  B’ ) 2 Zj).  —  Combined with Equation (2.8) and (2.35), U  U  G  ) 2 T( , al,6* ,a  65  —  sk +j).I  (2.35)  Chapter 3 Application to Air Pollution Data In this chapter, the theory of interpolation for data missing-by-design developed in Chapter 2 is applied to obtain interpolated data. The daily maximum hourly levels of nitrogen dioxide 2 (NO ) , ozone (03) , sulphur dioxide  ) and the daily mean levels of nitrate 3 2 (SO (NO ) , sulfate (SO ) were recorded from 4 January 1 of 1983 to December 31 of 1988 in Ontario and its surrounding areas. These data come from several monitoring networks in the province, including the Environment Air Quality Monitoring Network (OME), Air Pollution in Ontario Study (APIOS), the Canadian Acid and Participation Monitoring Network (CAPMON) (see Burnett et al (1992) for more description of the data set). Also available are some climatic data: daily maximum, minimum and mean temperature; daily maximum, minimum and mean humidity, and, mean pressure and mean wind speed, measured at other locations. In total, there are 37 different monitoring locations (sites), but not all sites monitor all of the five air pollutants. In the application below, we assume that the variation caused by networks to the observations at 37 sites is negligible. Therefore we can pool the observed pollution levels without worrying about that variation. In general, there are two kinds of air pollutants: (i) a primary pollutant, which is  66  directly emitted by identifiable sources; (ii) a secondary pollutant, which is produced by chemical reactions within the atmosphere between pollutants and other constituents.  NO 03 and 504 are secondary pollutants. , , 3 2 2 is a primary pollutant while NO SO 2 is produced by burning fuels containing sulphur. Its level depends on local emission SO sources, like burning fuel oil or smelting. The secondary pollutants studied here are all produced by oxidation of primary pollu tants. This oxidation is driven by ultra-violet radiation from sunlight and comprises chemical reactions that are temperature dependent. Since the chemical reactions pro ceed while the polluted air is being adverted by winds, secondary pollutants are generally more widespread than primary pollutants. We thus refer to secondary pollutants as re gional. Because of temperature dependence of the governing chemical reaction, NO , 2 3 and 03 are high in early afternoon and midsummer, low overnight and in winter. NO The oxidation of 802 to 504 is dominated by photochemical processes in dry, warm atmospheres. Monthly pollution data are interpolated down to the level of a Public Health Unit (PHU), the daily pollution data down to the level of a census subdivision (CSD). Both interpolation problems originate from environmental health studies not discussed in this thesis. In both cases, the relevance of the assumptions in Chapter 2 is investigated and the interpolated residuals are checked. In Section 3.1, our interpolation theory is applied to monthly data and in Section 3.2, to daily data.  67  3.1  Application to Monthly Pollution Data  A monthly pollution level is simply computed as the mean of the observed daily levels in that month. Hence, a time series of observed monthly mean levels consists of 72 values. Since some of these time series contain excessively many randomly missing values, to control the quality of data all time series are screened and those with more than onethird of its values missing are deleted. Thus, the number of gauged sites is reduced from 37 to 31. Here, we assume that the probability of a time series having more than one-third of its values missing is not related to those values. Therefore, such a strategy for dropping time series from our study does not cause bias. The locations of the remaining 31 sites, except two outlying sites, are plotted in Figure 3.1. The whole of the Province of Ontario divides into thirty-seven PHUs or districts (Duddek et al 1994). A PHU resembles a Census Division, the difference being marginal disagreements in boundaries. Some PHUs, for example, are aggregates of two Census Divisions.  Figure 3.2 displays the locations of some approximate centroids of these  PHU’s in Southern Ontario. Our ungauged sites consist of the 37 approximate centroids. Heilce, the total number of gauged sites,  g, 5  is 31 and that of ungauged sites, .s, is 37.  For the monthly interpolation, four air pollutants are included. They are NO , SO 2 , 4 03 and SO . Table 3.1 lists the observed pollutants at each of gauged sites, where a, b, 2 d and e represent NO , SO 2 , 03 and SO 4 2 respectively. From the table, we can see that at all gauged sites, there are 64 observed time series aild 60 missing time series. In the sequel, the term “time series” is replaced with “column”. Among the 64 observed columils, about two percent are missing. The missing data include the randomly missing data aild those censored from below. Here, censored below means that the actual air pollution level is below the detection limit of a monitor. 68  Table 3.1: Pollutants Measured at Each Gauged Site, Where a, b, d And e Represent , SO 2 NO 2 Respectively. , 03 And SO 4 Sites  1  Pollutants  b  Sites  13 ade  Pollutants Sites Pollutants  25 de  2 b  3  4  b  b  14  15  ade 26  ade 27  16 b 28  abde 29  ade  ade  de  b  9  10  20  d 21  ade  ade  5 abde  6 d  7 be  8 ade  17  18 b  19 ade 31  30 e  d  11 ade  12 de  22  23  24  d  e  ade  de  In statistics, the procedure for filling in missing data is called imputation. There are many existing imputation methods (c.f. Sarndel 1992, Little and Rubin 1987). Examples of these methods are: overall mean imputation; class mean imputation; hot-deck and cold-deck imputation; random overall imputation; random imputation within classes; sequential hot-deck imputation; distance function matching; regression imputation; EM algorithm based imputation and multiple imputation. The imputation methods mentioned above, produce a single imputed value for each missing value except for the multiple imputation approach. Many authors used these methods. Afifi et al. (1961) suggested filling in the missing observations for each variable by that variable’s mean. Buck (1960) and Stein et al. (1991) discussed imputing the missing values by regression models. Komungoma (1992) adopted a modified regression strategy. Johnson et al. (1994) combined a stepwise linear regression and a time series model to fill in missing values. Miller (1994) employed a nearest-neighbor hot-deck imputation procedure to fill in missing data. While the multiple imputation is appealing, it has a disadvantage of requiring more work for data handling and computation of estimates. Moreover, the multiple imputation method was designed for survey sampling. It is not suitable for a spatial application.  69  We may apply Theorem 2.12 to impute our missing data. However, that method has not been well tested yet. Further more, Theorem 2.12 bases on the normal approximation to a matrix T. As showed by the “empirical coverage percentage” at the end of this section, in the current scenario the matrix T cannot be replaced by its normal approximation. For our monthly data, with such a small proportion of missing data, the choice of a procedure for filling in missing values is not critical. We filled in a value missing from jth  column in the th month with the mean of the observed values of the same column  in the  month of other years. In case, all six measurements of the same month of  the  1983 to 1988 are missing, the grand mean of the observed values of the column is used. However, with our data set, no such a case obtained. The above ad hoc filled-in method allows us to preserve the periodic property of the columns shown in Figure 3.3, where ozone is seen to have a strong periodic pattern. Again, because only a small percentage of the data are missing, our ad hoc fill-in method will cause negligible bias. Our theory of Bayesian multivariate interpolation with data missing-by-design is devel oped under two important assumptions. One assumption is that the detrended residuals should spatially follow a multivariate normal distribution. The other is that each resid ual time series is a white noise process; in other words, the residuals are temporally independent. Checking multivariate normality is not easy since we lose hundreds of degrees of freedom to parameter estimation. In this chapter, only univariate marginal normality is checked. Temporal independence is checked with autocorrelation and par tial autocorrelation plots. The normal quantile-quantile plots of the detrended residuals of the original observed data do not show straight lines, suggesting the observed data be transformed. With a logarithmic transformation of the observed data, the residuals appear to be marginally  70  normal. A typical example of a normal q-q plot of the original and log-transformed data is shown in Figure 3.4. Autocorrelation and partial autocorrelation plots of the detrended residuals of log-transformed data are shown in Figure 3.5. The correlation plots do not show evidence of temporal correlation.  By repeating the above initial  data analysis for observations at all gauged sites, we are led to conclude that the logtransformed data do meet the assumptions of our interpolation theory. In the sequel, unless specify otherwise, we always refer observations on their log-scale. For determining the linear and seasonal trends of our Gaussian model, time series plots are made. Based on the plots, Z is taken to be 1, t, cos(W), .sin(W) ,where t Here t  =  on, until t  1 represents the January of 1983, t =  =  =  1,.  .  .  72.  2 represents February of 1983 and so  72 which represents December of 1988. The coefficients of the linear and  seasonal trends are estimated with ordinary least squares. Figure 3.3 displays the time series plots and their least squares fitted curves for the four observed pollutants at Site 5. The fitted curve of log(O3) is far better than that of the other three because of its periodicity. The strong yearly pattern of ozone is partially explained by the fact that the creation of ozone is highly related to solar radiation. For the environmental health study referred to earlier, the one which required interpola tion of monthly means, the air pollution level in summer is of special interest. Therefore the pollution data of winter and summer are interpolated separately. In the following, only the interpolation for summer data is described. The procedure for winter data is similar. Here, the summer of a year is defined to be from May 1 to August 31 and “win ter” the remainder of the year. Thus there are 24 values (24 months) in each summer column. The purpose of the analysis is to interpolate monthly NO , SO 2 , 03 and SO 4 2 levels in the summers of 1983  1988 at 37 ungauged sites.  71  Table 3.2: The Estimated Between-pollutants-hypercovariance Matrix of the Logtransformed, Summer Monthly Data.  N02 S04 03 S02  N02  S04  1.00000000 -0.28541295 0.03476434 0.13645127  -0.2854130 1.0000000 0.7940206 -0.3379370  03 0.03476434 0.79402064 1.00000000 -0.14573424  S02 0.1364513 -0.3379370 -0.1457342 1.0000000  The interpolation procedure follows the following order: first, the unbiased estimators of  F—’ and Br,) are computed; second, the EM algorithm for the estimation of  and Ag, Q  is invoked; third, the SG method is applied to extend Ag to A; then, with the exchange able assumption on B°, Br,) is extended to B°; finally, with all the hyperparameters estimated, the interpolated values are computed by the Bayesian interpolator. We use S macros to carry out the interpolation and call a C program to perform the EM algorithm. An example of the S macros and C program are attached in the Appendices. By running S macros and the C program, we get the estimated prior degrees of freedom to be 610; the estimated between-pollutants-hypercorrelation matrix, which is listed in Table 3.2 and the estimated hyper-covariances of NO , SO 2 , 03 and SO 4 2 are 0.6582724, 1.6263357, 0.2166192, 1.8503741, respectively. The biggest positive correlation among the four pollutants occurs between 03 and SO . Since both 03 and SO 4 4 are regional air pollutants and both are related to sunlight, a higher correlation between 03 and SO 4 is expected. Meanwhile 802 is a primary pollutant, it has a lower correlation with the other pollutants. The result of the SG step is summarized in Figures 3.6  ‘--  3.8. The right hand plot  of Figure 3.6 is a twisted 30 by 30 checkerboard in the D-plane. The original 30 by 30 checkerboard is in the 0-plane. The coordinates of its lower left corner are the 72  minimum latitude and longitude of gauged sites. The coordinates of its upper right corner are the maximum latitude and longitude of gauged sites. The left hand plot of Figure 3.6 shows an exponential fit between dispersions and the D-plane distances (refer to Chapter 1 for a brief summary of the SG method). The smoothness of the twisted checkerboard is controlled by parameter ).  A smoother checkerboard in D-plane is  achieved by sacrificing the fit between the dispersions and D-plane distances. Figure 3.7 shows a smoother checkerboard in the D-plane but a rougher fit between the dispersions and D-plane distances when the smoothing parameter value is increased from 0 to 2500. A linear pattern in Figure 3.8 shows that the estimated covariance and the observed covariance are conformable. By applying Corollary 2.6 and using the above estimated hyperparameter values, after SG step it is straightforward to compute the interpolated, summer monthly air pollution levels at ungauged sites over six years. As a way of checking the interpolated values, the overall average ozone levels the in summers of 1983 to 1988 at gauged sites are plotted in Figure 3.9. Those of interpolated ozone levels at ungauged sites are plotted in Figure 3.10. The two plots affirm our interpolation procedure. When a high mean 03 level is observed at a gauged site, our interpolator gives a high 03 values at nearby ungauged sites. Corresponding results obtain for lower observed 03 levels. Another way of checking the interpolation procedure is to look at the correlation be tween the observed and estimated data by cross-validation (CV hereafter). CV is a procedure which deletes observed datum one at a time and estimates these datum from the remaining data as if that datum were never observed. It is a popular diagnostic tool. In our CV study, we deleted one gauged site at a time and interpolated pollutant levels at that same site using observed levels at other sites. To avoid spuriously high computed correlations between the estimated and observed levels of pollutants, we first  73  removed the trends from both the estimated and observed columns, and then calculated the correlations among the residuals. The correlations between the detrended, estimated and observed levels of each pollutant aggregating across sites and over time are: Summer  Winter  N02  0.243  0.242  S04  0.494  0.438  03  0.534  0.429  502  0.238  0.200.  The correlations between the estimated and observed levels at each gauged site and for each observed pollutant are given in Table 3.3. In Table 3.3, the correlations of SO 4 and 03 in both summer and winter are generally higher than those of 502 with other pollutants. In other words, the predictions of SO 4 and 03 are more accurate than those of 302. Figure 3.11 displays the plot residuals of log-transformed, monthly observed and estimated pollutant levels in both summer and winter. Figure 3.12 shows the scatteplots of log-transformed observed pollutants against estimated pollutant levels for each pollutant in summer and  willter,  respectively.  In the plots a linear pattern means accurate interpolation. The plots confirm conclusions suggested by the tables; these results are consistent with the fact that 03 and SO 4 are regional pollutants. It is easy to predict them with the observed data from other sites. In contrast SO 2 is a local pollutant and so more difficult to predict. Can a simpler to use, normal distribution be substituted for the multivariate T pre dictive distribution? That might naively seem possible since the univariate normal approximates its longer tailed relative very well. 74  However, our results suggest this  Table 3.3: Correlations Between Residuals of Log-Transformed, Observed and Estimated Pollution Levels at Gauged Sites. Sites NO2 1 2 3 4 5 6 7 8 9 10 11 12  0.39  summer S04 03 0.96  winter S04 03 0.81  0.96 0.87  0.85 0.74  0.93 0.82  0.81 0.75 0.92  0.90 0.42  502  0.57  0.81  0.76  0.57  0.61  0.54  0.11 0.13  0.66  0.93  16  0.90 0.91  0.53 0.69  0.40 0.14  0.74  0.58  0.52  0.75  0.32  0.59 0.57  0.30 0.04  0.69 0.79  0.33 0.24  0.55  0.18  0.59 0.72  0.38 0.40 0.40  0.87  17 18  0.66  19  0.63 0.36  0.77 0.72  0.80 0.61  0.67  0.80  0.63  0.80  0.78  0.25  0.85  0.68 0.75  0.34 0.57 0.47  0.63 0.49  0.86 0.78 0.68 0.52  0.56  0.65  0.81  0.24  0.44 0.47  0.44  0.48  0.74  25 26  0.71  0.80 0.82  27  0.55  0.77  0.66  0.87  0.83  30 31  SO2  0.78 0.41 0.66 0.65  0.11  28 29  0.56  0.97 0.87  14 15  0.70 0.74  0.71  0.85  0.75 0.88  23 24  0.67  0.87  0.44  21 22  0.09  0.67  13  20  N02  0.97  0.25  0.59 0.40  0.56 0.26 0.23 0.50 0.49  0.49 0.82  0.78 0.62 75  0.10 0.23  0.52  substitution cannot be recommended without additional study. Our initial impression comes from an evaluation we did of the empirical coverage percentage of three-standarddeviation confidence intervals (CI). If the predictive distribution were normal, all the three-standard-deviation CIs would include the true values about 100 percent of the time. As the percentages by pollutants presented below indicate, this high coverage probability is not achieved here. The heavier tailed predictive matrix T distribution seems to be required. By pollutants the percentages are, Summer  Winter  N02  1000!.  94.27.  S04  1007.  99.27.  03  98.67.  98.87.  S02  94.57.  1007.  The unbiasedness of residuals is also checked. In the top plot of Figure 3.13 are the boxplots of the prediction errors for four pollutants, these being defined as the difference between the predicted and observed values. Except for SO , the mean prediction errors 2 for the other three pollutants are almost zero. In other words, the predictor is unbiased. In the same figure, the other two plots indicate that observed values and the boxplots of the predicted values have similar patterns except that the predicted values have bigger variances.  3.2  Application to Daily Pollution Data  For one of the environmental health studies mentioned above we needed to interpolate daily air pollution levels down to the centroids of CSD’s. In Southern Ontario, 733 such 76  centroids are chosen and all five pollutants NO , NO 2 , 03, SO 3 2 and SO , are included. 4 The general interpolation procedure and many intermediate results are similar to those for the monthly data. Thus, in the following, only the results which differ from those in Section 1 are discussed. Again, only the interpolation of summer air pollution levels at the CSD centroids is discussed. The observed data come from daily measurements at 37 sites. By removing the time series (columns) where there are excessively many missing data, the number of sites is reduced from 37 to 27 and the number of observed columns to 55. Then the number of missing columns is 80  27  (=  *  5  —  55). Figure 3.14 displays the locations of the 27  gauged sites, except for two outlying sites. As in Section 3.1, an ad hoc method is used to fill-in the missing data. However, in this section, a different ad hoc method is used. Since in the daily observed columns, eleven percent are missing, a more delicate approach is needed. The new ad hoc method replys on multivariate normal theory. With the new method, missing data are filled in by  xiii = where X is the missing value for the  (j =  .  .  + jth  ij,  pollutant (i  =  1,.  .  .  ,  5) at  j  gauged site  ,27) at time t; B is estimated by the ordinary least squares method using  observed values for the  jth  pollutant at the  ih  gauged site and  jj  is the estimated  residual. The estimated residual is computed by using well known normal theory. That is, E(X Y  =  y)  =  where X and Y are jointly normal,  E(X) +  xY’y(y  —  is the covariance matrix between random  variables U and V and E(U) is the mean of U. By letting Y represent the set of all observed data at time t, replacing X with X, and taking both E(X) and E(Y) to be 77  zero, we apply the above formula to calculate  .  However, the formula is not directly  applicable, since the joint covariance matrix of X is unknown. A method of moments is used to estimate the covariance matrix in a pairwise manner. In other words, the covariance of Ximt and Xhkt is estimated by  Zt(Ximt  —  Xlm)(Xhkt  —  Xhk),  where  f is  the total number of observed pairs of Ximt and Xhkt. The estimated covariance matrix has to be checked for positive definiteness. definite, it cannot be used to obtain  If the estimated matrix is not positive  If so, other imputation methods mentioned  in the previous section would need to be investigated. However, in this application, the estimated matrix is indeed positive definite. Compared with the method used in Section 1, the advantage of new method is that it brings less autocorrelation into each time series when the missing values are fill-in. We assume such a filled-in procedure will not cause serious bias, because of the small percentage of missing data. Figure 3.15 shows that a logarithmic transformation of daily data is also necessary. Since the time series plots of the daily summer data do not have obviously periodic patterns, the linear trend of the Model (2.12) is instead taken to be the grand mean effect; time t effect; weekday effect (from Monday to Thursday); monthly effect (May, June, July); yearly effect (from 1983 to 1987) and the mean daily temperature. The mean daily temperature is the mean of observed daily mean temperatures in Southern Ontario. Such an arrangement is due to the fact that Model (2.12) only allows covariates with spatially equal measurements. Other variables, for instances, the daily mean humidity, mean pressure etc. are investigated too.  They are not included in the final model  because their ordinary-least-squares fit coefficients are not significantly different from zero. We checked the autocorrelation and partial autocorrelation plots of daily, detrended summer observations.  Figure 3.16 shows an example of such a plot. We find that  78  Table 3.4: The Estimated Between-pollutants-hypercovariance Matrix of the Logtransformed Daily Summer Pollution Levels in Southern Ontario.  N02 S04 N03 03 S02  N02  $04  N03  03  S02  1.0000000 0.1224039 0.2435792 0.1131423 0.2085270  0.1224039 1.0000000 0.4118659 0.2668008 0.1240847  0.24357919 0.41186592 1.00000000 0.24745489  0.11314233 0.26680085 0.24745489 1.00000000 0.09569043  0.20852698 0.12408473 0.05080721 0.09569043  0.05080721  1.00000000  they are lag-i correlated. For removing the lag-i correlation of the residuals, an AR(1) transformation of the observed values is applied, that is, t at  =  2,.  .  .  ,  738, where  gauged site.  qjj  =  —  for  is estimated based on the observed values of i’ pollutant  Such a transformation removes the lag-i correlation among the  residuals. For comparison, the autocorrelation and partial autocorrelation plots of the AR(1)-transformed residuals of 03 at Gauged Site 6 are shown in Figure 3.17. When the normal and independence assumptions are satisfied, the EM algorithm can be applied to the residuals to estimate the hyperparameters. The between-pollutants hypercorrelation matrix is given in Table 3.4 and the hypervariances of the N02, S04,  N03, 03, S02 are 0.8093218, 1.8325046, 1.3923836, 0.4061250, 1.9200310, respectively. The estimated number of degrees of freedom is 4365 for nearest integer. A big number reflects a lot of prior information about the covariance matrix E. That large number stems from the fact that there are only 27 gauged sites with 55 observed columns while 733 ungauged sites with 3665 columns missing-by-design need to be interpolated. The SG and interpolation steps are similar to those of the monthly data application above. One remaining problem needs to be solved. Since the interpolation is based on the AR(1)-transformed data, the interpolated values are in the form of AR(1)transformed residuals.  To obtain the true interpolated residuals, the following fact 79  is must be used.  Lemma 3.1 If Uk  —  cUk_1  = Vk and q < 1, then Uk  k=2  fl_kVk  provided n is big  enough. Proof: Since  -çU 2 U = 1 V  Un_i  U  —  —  c/;Ufl_2  = V_ 1  1 = V, çbU  multiply both sides of the first equation by q , the next equation by 2  ,  etc and  then add all equations. It becomes,  Uk  —  ‘Ui  n_kVk  = When n is big enough,  k=2  is almost zero. I  To apply the above fact, the ‘s at ungauged sites are needed. These values are not available. Let us assume the ‘s are the same for the same pollutant at all sites. Then in all, only 5 different coefficients are needed. The five coefficients can be estimated by ordinary linear regression subject to the above assumption. By checking the ob served data, we know that the assumption is valid. Now with the new assumption, the interpolation procedure is repeated.  80  Table 3.5: Correlations Between the Residual of Log-transformed, Summer Daily Ob served and Estimated pollutants at Gauged Sites. Sites S04  NO3  1 2  0.68  0.56  AR(1)_Summer S04 N03 03 0.63 0.52  0.79  0.65  0.76  0.63  3 4  0.56 0.72  0.55 0.64  0.48  0.50  0.70  0.62  0.45  0.63 0.48  N02  5  0.44  6 7 8 9 10  03  0.67  S02  0.12  NO2  0.44  0.60  0.69 0.65  0.49  0.49  0.65  0.10 0.87  0.64  0.50  0.51  0.10 0.88 0.90  0.55  0.55  11 12 13  0.82 0.82  0.83 0.83  0.18  0.78 0.72  0.17  0.78  14  0.41  0.75  0.38  0.73 0.73  16  0.69  17 18 19 20 21 22  0.74  0.68  0.50  0.58  0.70  0.53  0.63 0.63 0.71  0.86  0.68  0.78 0.84  0.54  23  0.50  0.67 0.59  0.68  0.52  0.63 0.62 0.02  0.45  25 26  0.34  0.68 0.65 0.60  0.85 0.79  0.69  0.86 0.88  0.56  0.82 0.83  0.79  24  27  0.71  0.88 0.82 0.82  0.12  0.69  0.90  15  S02  0.03  0.78 0.44 0.04  0.58  0.64 0.61  0.32 0.52  0.15  81  0.05  0.54 0.14  As before, a CV study of the residuals for both the observed and predicted values is carried out. The pollutant-wise correlations between the residuals of observed and predicted summer daily pollution levels at gauged sites are: Summer  AR(1) Sununer  N02  0.49128973 0.48457402  S04  0.66564685 0.63188994  N03  0.58376676 0.56927091  03  0.78495848 0.78477716  S02  0.08807141 0.08978648.  In the above table, the predicted values used for computing the correlations in Column one are obtained using the original observed values and the predicted values, and for those in Column two obtained using the AR(l)-transformed data. From the same table, it can be seen that the correlations of the same pollutant in both columns are close. We interpret this result positively, as saying that our interpolation procedure is robust to the assumption of temporal independence. In other words, in terms of the above correlations, by taking an AR(l) transformation, the prediction of observed values has not been improved. From the same table, the correlation for SO 2 is much lower than those of the other pollutants. This is explained as follows. First, from Table 3.4, we see that the hypercorrelations of SO 2 with other pollutants are very low. This fact indicates that by including other pollutants, the predictions on SO 2 levels are not improved much. Second, from Table 3.5 it is seen that 502 was only observed at only 5 gauged sites. So in the CV analysis, the prediction of SO 2 levels at one of its gauged sites are based only on the observations at the other four sites gauged for SO . For monthly data, 2 the problems associated with SO 2 do not exist. Therefore, the computed correlations between the predicted and observed 802 values are higher. 82  The correlations of the observed and estimated residuals at each gauged site for each pollutant are given in Table 3.5. From Table 3.5, it can be seen that it is easier to predict 03 and SO , because they are regional pollutants and more difficult to SO 4 , 2 because it is a local pollutant.  83  Chapter 4 Application to Environmental Monitoring Network Redesign Another application of the “Bayesian interpolation with missing-by-design data” theory developed in Chapter 2 is to the problem of redesigning an environmental monitoring network. The term “redesigning” means adding or deleting sites from a current existing monitoring network. Guttorp, Le, Sampson and Zidek (1992), Caselton, Kan and Zidek (1992), Wu and Zidek (1992) have discussed the above redesign problem. These authors derived their optimally redesigning strategy based on following reasoning. Maintaining and collecting data from an environmental monitoring network is quite expensive, there fore the network is set up and maintained by a nation wide institute. Thus the collected data may be used by different users for different purposes. This fact implies that an optimal redesign of an environmental monitoring network should be based on certain common and fundamental purposes of the users of monitoring networks. They choose “reducing uncertainty about some aspect of the world, regardless of how that need may be expressed in any particular situation” (Guttorp, Le, Sampson and Zidek 1992) as the redesign goal. They developed a general network redesign theory by combing the entropy optimal criteria and the Bayesian paradigm.  84  In Section 1, the general theory of network redesign in univariate case, which is dis cussed in Guttorp, Le, Sampson and Zidek (1992), is summarized. In Section 2, it is demonstrated that how our theory can be applied to the redesign problem.  4.1  Theory of Network Redesign with Entropy  In this section, the theory of network redesign with the entropy is summarized in the univariate case, which means that there is one pollutant at each site. Although the theory deals only with an augmentation of a network, the reduction of a network is handled in a similar fashion and the conclusion is similar. Suppose that currently there are g gauged sites in a environmental monitoring net work and it is planned to add u 1 more sites to the network in future.  These u 1  sites are chosen among u candidate sites. Call all the u candidate sites “ungauged sites”. Let Xf represent a future realization at gauged and ungauged sites. Decom pose Xinto((X) , (X)t), where X is a realization at gauged sites and X7 at un t gauged. By properly rearranging the coordinates, (Xy)t is further decomposed into ((Xrd)t,  (xrem))t).  is the response vector at ungauged sites that will be added  into the network and xrm) is the vector at sites that are not chosen. We assume the same Gaussian linear model of (2.12)  -.‘  (2.14).  As to a measurement of uncertainty, one natural choice is the entropy, which is defined as H(X) = E [_log] where h(X) is a reference density, h(x) makes H(X) be invariant to scale transformation of X. From the definition, it is easy to see that H(X, Y) = H(Y) + H(X Y),  provided 85  (X)h 1 h ( Y). h(X, Y) = 2  Let D be the set of all observed data at gauged sites in the past, 0  =  (, B)  be the set  of parameters in our Guassian model. Then, given D, the total uncertainty of a future realization Xf and unknown parameter, 0, is H(Xf, 0 D). Since, H(Xf, 0 D) where G  =  =  H(U G, 0, D) + H(0 I G, D) + H(G D),  (X)t), U  =  m xr  H(U G,0,D)  =  E[—log(f(U I G,0,D)/h (U)) 11  , t ((X)  H(8 G,D) H(G I D)  =  =  (4.1)  and  E[—log(f(0 2 G,D)/h ( 0)) D], E[—log(f(G D)/h (G)) 12  In the above, it is assumed that h(X,0)  =  I  Dj.  2 ( 1 h ( X)h 0) and h (X) 1  =  (U)h 1 h (G). 12 1  Note that adding or deleting any site to or from the current network will not change the total uncertainty H(Xf, 0 D). When in future, the response vector at gauged sites and added sites is observed and if the measurement errors are negligible, the uncertainty represented by H(G  D) becomes known. By Equation (4.1), one can see that a fixed  total present uncertainty is decomposed into two parts, one part will become absolutely known by taking observation in future and the other part is still unknown. Therefore minimizing the future uncertainty by taking additional sites is equivalent to minimizing the unknown future uncertainty, is in turn equivalent to maximizing the uncertainty of what will be known in future. So the problem is equivalent to adding sites to maximize H(GID). For finding expression, H(G D), the entropy of a multivariate t-distribution needs to be computed, since f(G D) consists of two multivariate t-distributions. Now assume, for a random vector Y,  I  E 86  (,6), 1 W;  by (2.8), t(O,(6* —g + 1),6* —g + 1).  I Note that H(Y, E  ,  6)  =  H(Y  6’) + H(E  ,  6)  H(E Y  To find the multivariate t-distribution’s entropy, H(Y H(Y  I  ,  6’, ), H(>  multivariate normal  ,  I  ,  6*) and H(E  ,  Y,  6* and E I Y,  ,  ,  ,  ,  6*) + H(Y  ,  6w).  6’), we only need compute  6*) respectively. Since Y  E,  ,  6* is  6* are inverse Wishart. By using the result  presented in Caselton, Kan and Zidek (1992), the entropies of the multivariate normal and the inverse Wishart. Thus, after a straightforward calculation, the entropy of a multivariate t-distribution is, H(Y ,6*)  *), 6 +c(g,  log  =  (4.2)  where c(g, 6*) is a function of g and 6* only. Since by Theorem 2.1, the distributions of  D and  Xd  X, D are multivariate  t-distributions and  H(G I D)  =  H(X  I  D) + H(X I X,D).  By applying (4.2), it is easy to see that H(G D) Where  ddg  =  log  addIg  , 22, c, d, 1, g). 22 +c(  is the residual covariance matrix of Xa conditional on  defined in Lemma 2.1. Because only the term log  22 S  is related to a choice of the  newly added sites, maximizing H(G I D) is the same as maximizing 87  and  I.  4.2  Network Redesign with Data Missing-by-design  The result presented in the previous section can be easily extended to the multivariate case by assuming k pollutants at each site. The value of  (Faddlg  comes from (F that is  estimated with the method described in BLZ. Similarly the result can be extended to the case when there is missing-by-design data. To justify this extension, one needs to notice that, given hyperparameters,  ()  B  =  (R*) R t *  R*B  I  R*Xt T47_l((R*) ( t FR*, 6*  —  1)  N(R*BO, t (R*) E R* ® F-’),  where = (1skxsk  \  0  0  and R 2 is defined the same as that in Section 2.4.1 of Chapter 2. Then applying the above model and a similar argument as in the previous section, an optimal criteria for redesign of a current network is maximizing log  addI2 1  , where  ’addl2 T ‘  is the conditional  hypercovariance matrix of the future realization at added pollutants and sites given the observed pollutants at gauged sites. The matrix  ‘I’addl2  can be obtained from matrix (F  that is estimated in Chapter 2. In an application, the added sites need not monitor all air pollutants. This relaxation may be useful when the optimal network redesign is required for multiple networks that were set to monitor different pollutants.  A hypothetical  example is that there are three networks, labeled 1, 2, 3. Network 1 measures ozone only, network 2 nitrate only and network 3 both pollutants. The above discussed optimal redesign enables us, say, to optimally add one site to network 1 and another site to network 2, in terms of maximally reducing uncertainty of a future realization in these three networks. 88  As an example of implementing the above discussion, the monthly air pollution data in Chapter 3 is used. Thus, there are 31 gauged sites and their latitudes range from 42.246 to 49.800 and longitudes range from 74.736 to 94.400. Suppose in future two sites will be added to these monitoring networks in Southern Ontario. One added site will monitor ozone only and the other nitrate only. Further suppose the possible site locations for the two would-be added sites are constrained to the grids of 10 by 10 checkerboard with the latitudes and longitudes of the four corners being (42.246, 74.736), (42.246, 94.400), (49.800, 74.736) and (49.800, 94.400). By taking the estimated 1 from Section 3.1 and applying the above optimal redesign criteria, the following answer is reached. We should add the measuring-ozone only site at latitude 43.92467 and longitude 85.66044 and add the site measuring-nitrate-only site at latitude 44.76400 and longitude 87.84533. However, further simulation study on the sensitivity of need be done.  89  to the locations of added sites  Chapter 5 Cross-validation Study In this chapter, we describe three CV studies designed to judge performance of the newly developed theory of interpolation with data missing-by-design. The first study is to justify the necessity of developing a new theory, in situation where the LZ method could be applied to solve the same problem. In the second study, an artificial example is made for studying the trend of adjusted mean squared predicted error(AMSPE), where, starting from a complete data set, columns are deleted one-by-one and the AMSPE computed. The third is a comparison of our theory against Hass’s CoKriging method, since both methods can handle data missing-by-design.  5.1  Simultaneous Interpolation Versus Univariate Interpolation  By interpolating one pollutant at a time, one can apply LZ theory to the Southern Ontario pollution interpolation problem. Why then is a new theory needed when an old theory is available? The answer lies in the difference of two methods. With the LZ method, only partial data is used for the interpolatioll, while the  ew  method includes all  available data in the procedure. Take the monthly pollution data in Southern Olltario, for example. When ozone levels are interpolated at ungauged sites by the LZ theory, only  90  the observed 03 levels at gauged sites are included in the analysis. The new method uses all the observed values of NO , SO 2 , 03 and SO 4 . For distinguishing between 2 these two methods, that of LZ will be called univariate interpolation and that of the new method, simultaneous interpolation. One way of showing the superiority of the simultaneous interpolation over the univariate interpolation is to prove that its interpolator leads to a smaller mean square error. That fact is shown below.  Theorem 5.1 Let X , 1’ be any two random vectors and X a random variable. Then 0 E(X  —  E(X Xo, 2 )) <E(X 0 Y  —  E(X  I  (5.1)  Proof: Observe that E(X  —  E(X  =  2 E(X  =  ) 2 E(X  =  ) 2 E(X  =  ) 2 E(X  —  2XE(X X (X X 2 )) 0 )+E 0  —  —  -  2E(XE(X 0 (X 2 X ) ) + E(E  I Xe))  2E(E(XE(X I X) X E(E ( X I 0 )) + 2 0 X ) ) E(E ( 2 X  Similarly,  E(X  -  E(X 2 ,Y X ) 0 )  =  ) 2 E(X  -  E(E ( 2 X  Further, by Jensen’s inequality, for any random variable Z,  ) 2 E(Z  >  . 2 (E(Z))  Apply it, E(E ( 2 X 0 ,Y X ) )  =  E(E(E ( 2 X 0 ,Y X ) E([E(E(X 0 )] 0 X ) ,Y 2 X )  =  E(E ( 2 X 91  I  So equivalently, E(X  —  E(X I 0 2 , X ) Y )  E(X  —  E(X  2 ) 0 K . )  Returning to the ozone example, we take X 0 to be the observed levels of 03 at gauged sites, Y 0 the observed levels of the other pollutants and X, the unobserved pollution level at an ungauged site. Then the univariate Bayesian interpolator is E(X X ) and 0 the simultaneous interpolator E(X X , Ye). When the model is correctly specified and 0 the hyperparameters are known, Theorem 5.1 implies that the simultaneous interpolator does no worse than the univariate interpolator. The following CV study supports the above claim empirically. Again, monthly air pollu tion data in Southern Ontario is used. At each gauged site, the observed pollutants are deleted as if they were not observed. Then both univariate and simultaneous Bayesian interpolators are applied to obtain the predicted values of the “deleted” values based on the data at the other gauged sites. When the predicted values by both methods are computed for all 31 gauged sites, the mean squared predicted error (MSPE) is calcu lated for the univariate interpolator and the simultaneous interpolator respectively. The results for the monthly summer data and monthly winter data are listed below. Simultaneous  summer  winter  Univariate  summer  winter  N02  0.18543849 0.14342632  0.2829322  0.1292677  S04  0.13848447 0.21311221  1.274841  0.7310782  03  0.04369236 0.05382523  0.12536  0.2367643  S02  0.62173407 0.28098323  0.758635  0.4344104  The values confirm our theoretical result, except the case of NO 2 in winter, where the 92  MSPE of the univariate interpolator is smaller than that of the simultaneous interpola tor. One interesting point is worth mentioning here. The above numbers show that the relative reduction in the MSPE’s achieved by simultaneous interpolation over univariate interpolation, is much higher for 804 and 03 than for 802. For 504 and O3 the relative reduction is from 300% to 900%. For 302, the reduction is under 50%. This is because  504  ,  0 are regional pollutants and 802 is a local pollutant. Intuitively, a regional  air pollutant intuitively has a higher correlation with the other pollutants than a local pollutant, as indicated by the estimated between-pollutants-hypercorrelation in Section 3.1. By including the other correlated pollutants in the analysis, we would expect to enhance the interpolation. For a local pollutant, since it has little or no correlation with other pollutants, the inclusion of additional pollutants in the analysis will not improve the interpolator as much. Therefore, we can conclude that the interpolator with data missing-by-design does better than that of LZ on regional pollutants. It does not do much better than LZ on local pollutants. The conclusion has theoretical support: if X 0 are X are independent equality in Equation (5.1) obtains.  5.2  Trends in Adjusted Mean Squared Prediction Error (AMSPE)  We now check the performance of the Bayesian interpolator (2.6) in terms of the AMSPE. Here the adjusted prediction error is defined to be (Xpred  —  Xobs)fstd(Xpred), that is, the  difference between the predicted value and the observed value divided by the standard deviation of the predicted value. AMSPE is the mean of all squared adjusted predicted errors. We prefer AMSPE over the mean squared error (MSE) because different pollu tants have different units of measurement, hence different variability. Therefore, mean prediction errors of different pollutants must be normalized to make them comparable.  93  The data set for our study comes from the monthly Southern Ontario pollutant data used earlier. Among 31 gauged sites 13 sites where NO , 03 and SO 2 2 are all observed, are chosen. So the original data set consists of thirty-nine columns (three observed pollutants at each of thirteen sites).  In our study, a randomly chosen column at a  randomly chosen site among the 13 sites is deleted. A CV study is performed on the remaining data set that has data missing-by-design. An AMSPE is computed. Next, in a similar way, a randomly chosen column from the remaining data is deleted and the CV process is repeated on the new data set, which has two columns less than the complete data set. So a new AMSPE is computed. By repeating this and plotting AMSPEs, a trend in AMSPE is obtained. One example of AMSPE trends is shown in Figure 5.1. There are twenty-four AMSPE’s in the plot. The first AMSPE is computed after the column of observed NO 2 at Site 2 is deleted from the original data set, the second AMSPE while the previously chosen column and the column of observed SO 2 at Site 3 are deleted. By repeating the same procedure, the 24 AMSPE is computed after 24 randomly chosen columns are deleted. Below we show the order of the deleted columns.  Order: 1  2  Pollutant: Site:  2  Order: 13  3  4  5  6  N02 S02 N02 03  7 03  8  9  502 03  10 03  11  12  N02 S02 N02 S02  3  4  1  7  4  10  6  12  7  10  12  14  15  16  17  18  19  20  21  22  23  24  Pollutant: Site:  3  N02 03 2  11  03 9  S02 S02 S02 N02 S02 N02 S02 03 6  1  5  5  94  9  11  8  8  S02  The AMSPE trend plot in Figure 5.1 shows a generally increasing pattern with some bumps along the way.  The incremental changes in AMSPE are not very dramatic.  This perhaps indicates good performance of the simultaneous interpolator. However, the result must be interpreted cautiously, since the AMSPE is not a robust index. When the standard deviation of a predicted value is relatively small, it could explode the corresponding adjusted prediction error so much that an AMSPE value will be dominated by a particular term. If a common term plays a dominant role among all the AMSPE values, the trend will be a flat line. Such a plot does not imply superiority of the simultaneous interpolator.  5.3  Comparison with CoKriging  Both Bayesian multivariate interpolation with data missing-by-design (referred to below as the new method) and Hass’s CoKriging method can handle data missing-by-design. Here are some direct contrasts of the two methods. The new method gives a predictive distribution, therefore a simultaneous interpolation region, while Hass’s method only gives the interpolated value and its standard deviation. The new method easily handles any number of variables (pollutants), ignoring for the limitation of computer capacity, while Hass’s method only allows two variables in order to retain mathematical tractabil ity. The new method includes all available data in the interpolation process, while Hass’s method only uses only partial data. However, when sk is big, Hass’s method has a sub stantial computing advantage over the new method, since the convergence of the EM algorithm used in the new method is slow. A detailed comparison follows. Generally speaking, our simulation study shows that Hass’s method enjoys computational advantages over the new method when there are  95  many sites. When there are not enough sites, Hass’s method encounters difficulty. Since in each moving window, Hass’s method requires a minimal number of sites in the window for the purpose of variogram estimation. When gauged sites are widely spread out or there are not enough sites, that requirement cannot be met. This affects the accuracy of the estimated variogram. On the other hand, since the new method involves a lot of matrix operations, when there are too many sites, the dimensions of the matrix can be so big that computation becomes excessively slow. In the case of a small number of gauged sites, the new method works fine. Another difference between Hass’s method and the new method is in the models they use. While Hass’s method models only a “snapshot” of a spatial trend, the new method incorporates the temporal trend for each pollutant at each site. Therefore, the new method can handle spatial-temporal data, but may have problems with spatial data alone. In the case of spatial data alone, it is hard to model a temporal trend with one datum for each observed pollutant at each gauged site. Because of the complexity of the cross-variogram formulas, Hass’s method cannot take more than two variables (e.g., two air pollutants) in practice and it is hard to expand beyond that. The new method can handle any number of variables (pollutants). As shown in Chapter 2, a simultaneous prediction region is derivable when a predictive distribution is available. Hass’s method only gives a pairwise confidence interval for the interpolated value. A fundamental distinction between the two methods lies in their Bayesian against non Bayesian distinction. We do not take up that issue here. For an empirical comparison of the two methods, we use a real data set. The data come from selected sites in the National Acidic Deposition (NADP) Network and the  96  National Trends Network (NTN) in United States.  Refer to Wu and Zidek (1992),  Watson and Olsen (1984) for more details. Forty-five gauged sites, the latitudes and longitudes of which are in Table 5.1, are randomly chosen. Since both sulfate and nitrate are regional pollutants, they should be highly correlated. That guess is confirmed by an analysis of the data, where the estimated hypercorrelation between sulfate and nitrate is higher than 0.70. The high correlation enables both methods to fully demonstrate their strength and weakness, making their comparison more reliable. At the forty-five sites, both sulfate and nitrate are observed monthly from 1983 to 1986. In terms of Hass’s method, one pollutant is treated as a response variable and the other as a covariate. The covariate is included in the analysis to improve the interpolation precision. We arbitrarily chose nitrate as the response variable and sulfate as the covariate. To agree with the form of a CV study provided by Dr. Tim Hass, both observed levels of nitrate and sulfate at the first 35 sites of Table 5.1 are used for our study and only the monthly sulfate observations at the remaining 10 sites are used for the study. The number, 35, is chosen because according to Hass’s method, in each moving window there must be a minimum of 30 gauged sites. For each month of the four years, a CV study of Hass’s method is done and the MSPE is computed for the original and log-transformed data respectively. The same study is done for the new method. The MSPE, for both methods on the two cases are listed in Table 5.2. In terms of MSPE, with the original data, the new method beats Hass’s method in 32 out of 48 months and in terms of the grand mean of MSPE, the new method also wins by 0.4437143 against 1.25524. Figure 5.2 displays the monthly MSPE’s of both methods on the original data. In the figure, “Bayesian” means the new method and CoKriging means Hass’s method. With the log-transformed data, the ranking is reversed. The new method loses to Hass’s method in 39 out of 48 months and 0.3813128 against 0.2347863 of the mean MSPE. See Figure 5.3 for a graphical comparison of the MSPE’s at each 97  month on the log-transformed data. By checking the MSPE’s of the new method used on the log-transformed data at each site, we found that among 35 MSPE’s, the MSPE value at Site 35 (i.e, Site 075a) is much higher than those at other sites. At Site 35, the MSPE value is 7.48, while at other sites, all the MSPE values are below 0.493 (except that at Site 21). Observations at Site 35 are unnaturally compressed into a small interval near zero, making their logarithm extraordinarily large in magnitude. Thus in Table 5.2, we observe MSPE’s for transformed data larger than those for the original because of this outlying site. As well, the poor performance at Site 35 makes the monthly MSPE values of the new method applied to the log-transformed data systematically higher than those of Hass’s method. Figure 5.4 shows that the boxplot of log-transformed nitrate levels at Site 35 is well below other boxplots as noted above. Recall, in Chapter 2, we assumed that the hyperparameter B° has an exchangeable structure. From our simulation study, we know that if that assumption is violated, the proposed new method will not do well. For example, if B° were taken to be identical over sites for each pollutant while the actual data shows that it is not true, the performance of our interpolator would be poor. So one explanation of the poor performance of the new method on the log-transformed data is that the exchangeability assumption on B° is violated because of Site 35. That is, the actual hyperparameter B° at that site is not the same as that of the same pollutant at other sites. To check this conjecture, Sites 21, 35, 36, 44, 46, where either observed sulfate levels or nitrate levels are unusual, are removed from the data set and five new sites are added in (the codes are 076a, 077a, 078a, 160a, 161a). CV studies of both Hass’s method and the new method are then carried out on the new log-transformed data. In terms of MSPE, over 48 months the new method beats Hass’s method in 35 out of 48 months and in terms of the mean  98  MSPE, the new method gains an edge by 0.2087363 against 0.3676921. These results support our conjecture. However, the diagnosis of the exchangeability assumption is difficult. The following theory offers a way to make a rough check. Note, by Model (2.12)-(2.14) and Equation (2.16), ° 2 B  ,t t ) 0 2 N((B ((ZZ ) ’ + F-’) 0  22).  Equation (2.8) implies, 2 B B  r’-  T(’1’, (ZZt)_1 + F’, B, 6*  By Lemma 2.4, the marginal distribution of /9r (i = 1,.  —  .  .  sk ,  .Sgk  —  1 + h).  —  1), the  row of B , 2  has distribution T(d, t (ZZ ) ’ + F’, 3° 6* where 3° is the  rth  —  sk + h + 1),  row of B. To define d, let D=  (f” f::)  Cr, 2 = 2  where Cr is an orthogonal matrix that exchanges the r column of J! with its last column, D,, is (s k 9  —  1— 1) by (sgk  —  1— 1) and d 22 is 1 by 1. Then d is d 22  —  , D,,’ D, 2 D . 2  By applying Theorem 2.8, the following theorem is proved. Theorem 5.2 F  = 6* —sk+ 1($  [(zzt)1+F1]( _)  Then by the above theorem, the p-value for 1  —  [1  —  /335  PF(F 99  >  o)t  k÷1. 3 Fh,ss_  at Site 35 can be computed as 35 ) 8 b 0 F , ]  (5.2)  because  /335  is the extreme value among the 35  r5 3 /  by plugging in the estimated hyperparameters and  j  =  1,.  .  .  ,  ,, can be computed 0 F 35. 3  into Equation (5.2). With the  above p-value formula, unless PF(F > ) 03 is extremely small, it is very unlikely we F would reject the exchangeability assumption at Site 35. Therefore, the power of the above test is low. An application of the above test to our data shows that we failed to reject it. However, the boxplots of Figure 5.4 make us believe that the observed levels of nitrate at Site 35 are abnormal. Theoretically, Hass’s method is optimal only when the processes of pollutants are mul tivariate normal with a flat prior on the parameters. That explains why the MSPE values of Hass’s method drop a lot when it is applied to the log-transformed data. (See Table 5.2.) A log-transformation made the observed levels of the pollutants follow a multivariate normal more closely, although it is not necessarily the most appropriate transformation. The new method is also sensitive to the normality assumption. From Table 5.2, it can be seen that the MSPE values were reduced in 31 out of 48 months when the new method was applied to the log-transformed data.  100  Table 5.1: Latitudes and Longitudes of the Gauged Sites. Sites  Lat.  Long.  004a OlOa  36.1006  94.1733  40.3644  105.56  Olla  39.1011  105.092  012a  40.8064  104.754  017a  33.1778  84.4061  020a  40.0533  021a  41.7011  88.3719 87.9953  022a  37.71  89.2689  023a  37.4356  88.6719  024a  41.8414  88.8511  025a  41 .6325  87.0878  028a  35.6644  83.5903  029a  37.1989  108.491  030a  45.4897  69.6644  031a  45.5611  84.6783  032a  42.4103  85.3928  033a  44.2244  85.8186  034a  47.5311  93.4686  035a  44.2372  95.3006  036a  32.3344  88.745  037a  48.5103  113.996  038a  41.1531  96.4928  039a  43.9431  71.7033  040a 041a  42.7339  76.6597  42.2994  79.3964  046a 047a  43.5261  75.9472  42.1061  77.5356  049a  36.1278  77.175  051a  35.6967  052a  35.0239  80,6228 78.2792  053a  35.7286  78.6811  055a  40.3553  83.0661  056a  39.7928  81.5311  058a  40.78  81 .9253  059a  44.3869  123.623  061a  44.2231  122.242  063a  41.5978  78.7678  064a  40.6589  77.9361  065b  40.7883  77.9464  068a  36.0717  112.153  070a  29.3019  103.177  071a  96.92  073a  28.8453 37.335  074a  47.86  123.933  075a  39.0897  79.6622  101  80.5578  Table 5.2: MSPEs by Both Methods. month 1983.1 2 3 4 5 6 7 8 9 10 11 12 1984.1 2 3 4 5 6 7 8 9 10 11 12 1985.1 2 3 4 5 6 7 8  original_data LCoKriging new method I 3.28392 5.652717 0.950273 0.617414 0.232695 0.914407 0.668226 1.347384 0.300603 0.521691 0.238963 0.320864 3.209043 0.817165 0.469469 0.517674 0.143388 1.500481 0.987726 0.457944 0.210432 0.499223 0.808873 0.254086 1.876528 1.821478 0.398158 2.397513 0.570281 1.89153 0.509808 0.758525  0.52072 0.366484 0.321812 0.396189 0.509551 0.881811 0.786001 0.592036 0.543329 0.375799 0.25169 0.332461 0.402137 0.539348 0.18821 0.288131 0.334989 0.787056 0.775919 0.60891 0.354206 0.502362 0.234341 0.356278 0.238328 0.36958 0.244937 0.430129 0.307239 0.37827 0.51783 0.733556 102  log-transformed_data CoKriging new method 0.341134 0.549418 0.165156 0.115926 0.083971 0.183823 0.290934 0.263396 0.092434 0.166289 0.079822 0.095796 0.339717 0.234923 0.079713 0.28535 0.047676 0.23976 0.34463 0.208633 0.082369 0.163754 0.146061 0.089412 0.51466 0.181693 0.13868 0.326478 0.13937 0.405738 0.153515 0.215295  0.323611 0.357877 0.303732 0.27441 0.320196 0.407721 0.405173 0.183026 0.332301 0.226903 0.394561 0.389354 0.392407 0.453388 0.394719 0.781333 0.253829 0.400371 0.340348 0.606851 0.290468 0.427758 0.405709 0.501983 0.265733 0.427747 0.267305 0.326975 0.26641 0.461921 0.297702 0.545632  month  original_data CoKriging new method  log-transformed_data  1985.9 10 11 12 1986.1 2 3 4 5 6 7 8 9 10 11 12  0.302321 1.559379 0.24711 9.080794 6.937748 1.717417 1.376672 0.410942 0.240855 0.270805 0.377201 0.664169 0.492622 0.363747 0.199882 0.861362  0.310728 0.194151 0.281423  CoKriging 1_new method 0.200713 0.58109 0.314374 0.291196 0.188169 0.265927 1.434542 0.395802 0.685439 0.346247 0.29135 0.261732 0.217498 0.315148 0.116953 0.5226 0.069861 0.380813 0.09048 0.36987 0.08781 0.598339 0.226513 0.28267 0.14715 0.43719 0.118633 0.417164 0.109205 0.307664 0.205528 0.502112  Mean  1.25524  0.4437143  0.234786  0.470564 0.484305 0.165849 0.585204 0.553377 0.32262 0.194057 0.32283 0.568083 0.755552 0.637667 0.411857 0.570382  103  0.381312  Chapter 6 Concluding Remarks and Future Studies In this thesis, the theories of Bayesian multivariate interpolation with missing data are discussed. The proposed interpolation theories have two main characteristics, a hierarchical Bayesian method and a multivariate method. Some obvious advantages of a Bayesian method: with a Bayesian paradigm, the prior information is easily incorporated into an interpolation procedure, if such prior informa tion is indeed available; with a Bayesian approach, one can include the model uncertainty into the confidence interval of interpolated values, while traditional CoKriging fails to do so. Thus, the confidence intervals computed with traditional CoKriging methods are narrower they should be. With a hierarchical Bayesian approach, the specification of the spatial covariance structure can be pushed up to a higher level in the hierarchy. That is, the covariance structure can be specified at Stage 2 of the modeling. Under this setup, if there is any mistake in the specification of the covariance structure, future observations can modify the wrongly specified structure. Eventually if enough data are observed, the mistake will be corrected. Therefore, Bayesian modeling provides robustness. An other benefit of the Bayesian approach is that it provides predictive distribution, while  104  CoKriging only provides an interpolated value with its standard deviation. By knowing the predictive distribution, one can do more things, e.g., constructing a simultaneous confidence region, creating random numbers for a simulation study and so on. Another characteristic of the theories in this thesis is their multivariate nature. The general theory of BLZ also yields a multivariate interpolator. However when there are missing data, as when data are missing-by-design, with the BLZ theory, interpolation can only be done pollutant-by-pollutant. That reduces the method to a univariate method. The advantage of a multivariate approach is that it allows the interpolation to be carried out by including all the available information. Theorem 5.1 of Chapter 3 concludes that in terms of mean squared error, when hyperparameters are known, a Bayesian interpolator based on all available information performs at least as well as a Bayesian interpolator based on partial information. With that theorem, the extension of the general BLZ theory to the new theory is intuitively justified. As an empirical check, a CV study in Chapter 5 shows that the quantitative gains of the new method over the LZ method are significant. The theories developed in this thesis are by no means complete. There is much room for further study. For example, a general estimation method of the unknown hyperpa rameters is needed for the theory of interpolation with monotone missing data. Other future research topics are listed below. In Chapter 5, while the theory of interpolation with data missing-by-design is compared with Hass’s CoKriging theory, we pointed out that if the observed data is a spatial data set only, the theory in this thesis is not directly applicable. That is because the model of our Bayesian interpolation consists of a temporal trend, which can not be estimated by a single datum. Therefore, hierarchical Bayesian CoKriging is of interest. Here hierarchal  105  Bayesian CoKriging means putting a prior on the unknown coefficients of the trend and spatial covariance matrix. One possible approach is as follows. When there are many sites involved, the dimensions of the spatial covariance matrix can be very big. To reduce the number of random variables in the spatial covariance matrix, one may divide all related sites into m clusters, with n sites in each cluster. When the means are assumed to be zero, a model of the random process at the  jth  cluster is set to be,  X=R1+L, where the random variable R reflects the correlations of Cluster i with other clusters, 1 is a 1 x n, vector of l’s and L reflects the local spatial correlation at cluster i. Let R represents a m x 1 random vector that takes care of the between-clusters spatial correlation. Conjugate priors of inverted Wishart distribution are assumed on Var(R) and Var(L), i  =  1,.  .  .  ,  m. Details of interpolation theory remain to be filled in.  A Gaussian model is assumed for all interpolation theories discussed in this thesis. In some applications, the observed data may not be Gaussian and they cannot be transformed to be Gaussian. An interpolation theory with non-Gaussian data is needed. With a Gaussian model, the interpolator with data missing-by-design has an analytical form. With a general distribution model, an interpolator may not have closed form. In that case, some Bayesian computing tools, like Gibbs’ importance sampling may need to be brought in for a numerical answer. The prior distribution of the unknown spatial covariance matrix,  ,  in this thesis, is  taken to be an inverted Wishart, a conjugate distribution of the Gaussian model. Since many authors criticize the Wishart distribution (c.f. Press 1982, Brown, Le and Zidek 1993b), an extension of the inverted Wishart may be needed. On possible choice, which keeps the conjugate property of the priors, therefore, retains mathematical tractability, and at the same time, provides more parameters to offset one deficiency of the Wishart, 106  is proposed in Brown, Le and Zidek (1993b). The proposed prior is called a Generalized Inverted Wishart Distribution. However, the following is worth mentioning. Based on our application and simulation studies in Chapter 3 and 5, our method works well with the inverted Wishart prior. In Chapter 3, we see that the daily air pollution levels in Southern Ontario show sig nificant lag-i correlation. There, an AR(i) transformed is adopted to remove the cor relation. A better method is to extend the interpolation theory of LZ to include a lag-i correlated time series, or more generally, to remove the assumption of temporal independence. That extension will have immediate applications. From the same Southern Ontario study, we see that among the observed data, some are censored below. In Chapter 3, the censored data are simply treated as missing and filled with an ad hoc method. Therefore, a new theory that can handle the censored data will be needed for applications. A last remark goes to the optimal redesign of a current monitoring network. In Chapter 4, an example demonstrates how to add two sites to a current monitoring network. An interesting question may be raised. Instead of adding a new monitoring site, that measures ozone only, an alternative approach would put an ozone monitor at one of the currently gauged sites where ozone is not monitored. The question is which is optimal? While this can be done with the same entropy approach given in Chapter 4, there are some problems. Normally it is true that the cost of maintaining a separate monitoring site will cost more than putting a meter at a current operating site.  The entropy  approach does not take that into consideration. Therefore, to solve such a problem, we have to bring additional factors like cost into the objective function.  107  Bibliography Afifi, A. A. and Elashoff, R. M. (1961). “Missing Observations in Multivariate Statis tics.” J. Amer. Statist. Assoc. Vol. 61, 595-604. Ahmed, S. and de Marsily, G., (1987).  “Comparison of Geostatistical Methods for  Estimating Transmissivity Using Data on Transmissivity and Specific Capacity”.  Water Resources Research, 23, 1717-1737. Anderson, T.W., (1984). “An Introduction to Multivariate Statistical Analysis”. New York: Wiley. Berndtsson, R., (1988). “Temporal Variability in Spatial Correlation of Daily Rainfall”.  Water Resources Research, Vol. 24, 1511-1517. Billingsley, P., (1968). “Convergence of Probability Measures.” New York: Wiley. Brown, P.J., Le, N. D. and Zidek, J.V. (1994a). “Multivariate Spatial Interpolation and Exposure to Air Pollutants”  .  Canadian Journal of Statistics. To appear.  Brown P.J., Le, N. D. and Zidek, J.V. (1994b). “Inference for A Covariance Matrix”. Aspects of Uncertainty: A Tribute to D.V. Lindley. Ed. A.F.M. Smith and P.R. Freeman. John Wiley & Sons. Buck, S. F. (1960). “A Method of Estimation of Missing Values in Multivariate Data Suitable for Use with an Electronic Computer.” J. of the Royal Statistical Society, Ser. B, 22, 302-307. Burnett, R. T., Dales R. E., Rainenne M. D. and Krewski D. (1992). “The Relationship Between Hospital Admissions and Ambient Air Pollution in Ontario, Canada: A Preliminary Report”. Unpublished Report. Caselton, W.F., Kan, L. and Zidek, J. V. (1992) “Quality Data Networks that Minimize Entropy”. Statistics in the Environmental and Earth Sciences. Eds. P. Guttorp and Walden. Griffin, London. 108  Chen, C.F., (1979). “Bayesian Inference for a Normal Dispersion Matrix and its Ap plication to Stochastic Multiple Regression Analysis”. J. Roy. Statist. Soc., B, 41, 235-248. Cressie, N. and Hawkins, D.M.(1980). “Robust Estimation of the Variogram, I.” J. of the International Association for Mathematical Geology, 12, 115-125. Cressie, N. (1986). “Kriging Nonstationary Data”. JASA, 81, 625-634. Cressie, N. (1989). “Geostatistics”. The American Statistician, 43, 197-202. Cressie, N., (1991a). “Statistics for Spatial Data”. New York: Wiley. Cressie, N., (1991b).  “Modeling Growth with Random Sets.” In Spatial Statistics  and Imaging (Proceedings of the AMS-IMS-SIAM Joint Summer Research Con ference), A. Possolo, ed. Institute of Mathematical Statistics, Hayward, CA. Dawid, A.P., (1978). “Extendibility of Spherical Matrix Distribution”. J. Mult. Anal. 8, 559-66. Dawid, A.P., (1981).  “Some Matrix—variate Distribution Theory”. Biometrika, 68,  265-74. Dempster, A.P., Laird, N.M. and Rubin, D.B., (1977). “Maximum Likelihood from Incomplete Data via the EM Algorithm (with discussion)”. J. Roy. Statist. Soc B, 39, 1-38. Delhomme, J.P. (1978). “Kriging in the Hydrosciences” Advances in Water Resources, 1, 251-266. Dickey, J.M, (1967). “Matricvariate Generalizations of the Multivariate t Distribution and the Inverted Multivariate t—distribution”. Ann. Math. Statist. 38, 511-8. Dickey, J.M., Lindley, D.V. and Press, S.J. (1985). “Bayesian Estimation of the Disper sion Matrix of a Multivariate Normal Distribution A, 14:1019-1034. 109  “.  Communication in Statistics  Duddek, C., Le N. D., Sun W., White R., Wong H., Zidek J.V. (P1) (1994). “Assessing the Impact of Ambient Air Pollution on Hospital Admissions Using Interpolated Exposure Estimates in Both Space and Time: Final Report to Health Canada under DSS Contract h4078-3-C059/01-SS”. Unpublished Report. Dunnett, C. W. and Sobel, (1954)  .  “A Bivariate Generalization of Student’s t  distribution with Tables for Certain Special Cases”. JASA, 50, 1096-1121 Eaton, M.L., (1983). “Multivariate Statistics: A Vector Space Approach”. Wiley, New York. Guttorp, P., Le N. D., Sampson P.D. and Zidek, J.V., (1993) “Using Entropy in the Redesign of an Environmental Monitoring Network”. Multivariate Environmental Statistics; edited by G.P. Patil and C.R. Rao. North Holland/Elsevier Science, NY. Harrison, P. J. and Stevens, C. F.(1976) “Bayesian Forecasting” (with discussion) J. of the Royal Statistical Society, Ser. B, .38, 205-247. Hass, T. C., (1993). “Cokriging Variables That Are First and Second Order Nonsta tionary  “.  Paper presented at the annual meeting of the Statistical Society of  Canada, Wolfville, Nova Scotia, June, 1993. Hass, T. C., (1992). “Redesigning Continental-Scale Monitoring Networks”. Atmo spheric Environment Vol.26A, No.18, 3323-3333. Hass, T. C., (1990a). “Lognormal and Moving Window Methods of Estimating Acid Deposition”. J. Amer. Statist. Assoc. 85(412), 950-963. Hass, T. C., (1990b). “Kringing and Automated Variogram Modeling within a Moving Window”. Atmospheric Environment Vol.24A, No.7, 1759-1769. Halley, E. (1686). “An Historical Account of the Trade Winds, and Monsoons Observ able in the Seas between and near the Tropics: With an Attempt to Assign the Physical Cause of Said Winds”. Philosophical Transactions 183, 153-168. 110  Handcock, M. S. and Stein, M. L.(1993) “A Bayesian Analysis of Kriging” Technomet rics, Vol.35, No.4, 403-410. Handcock, M. S. and Wallis, J. R.(1994) “An Approach to Statistical Spatial-Temporal Modeling of Meteorological Fields” JASA, Vol.89, No.426, 368-390. Huijibregts, C. J. and Matheron, G. (1971) “Universal Kriging (An Optimal Method for Estimating and Contouring in Trend Surface Analysis).” In Proceedings of Ninth International Symposium on Techniques for Decision-Making and Metal lurgy, Social Volume, 12, 159-169. Johnson, T., Capel, J., McCoy, M. and Warnasch, J. (1994) “Estimation of Carbon Monoxide Exposures and Associated Carboxyhemoglobin Levels Experienced by Residents of Toronto, Ontario Using a Probabilistic Version of NEM.” Unpublished Report. Journel, A. G. (1980) “The Lognormal Approach to Predicting Local Distributions of Selective Mining Unit Grades.” Journal of the International Association for Mathematical Geology, Vol.12, 285-303. Kannan, D. (1979). “An Introduction to Stochastic Processes  “.  New York: North  Holland. Kitanidis, P. K.(1986) “Parameter Uncertainty in Estimation of Spatial Functions: Bayesian Analysis.” Water Resources Research,22,499-507. Komungoma, S. (1992) “Assessment of the Quality for the NADP/NTN Data Based on Their Predictability.” M.Sc. Thesis, Department of Statistics, University of British Columbia, Vancouver, Canada. Laslett, G. M.(1994) “Kriging and Splines: An Empirical Comparison of Their Predic tive Performance in Some Applications.” JASA,Vol.89, No.426, 391-409. Le, N. D. and Zidek, J.V. (1992). “Interpolation with Uncertain Spatial Covariance: A Bayesian Alternative to Kriging”. J. Mult. Anal, 43, 351-74. 111  Leonard, T. and Hsu, S.J.H. (1992). “Bayesian Inference for a Covariance Matrix”. Annals of Statistics , 20, 1669-1696. Lindley, D. V. and Smith, A.F.M. (1972). “Bayes Estimates for the Linear Model”. J. Roy. Statist. Soc. B, 34, 1-32. Little, R. J. A. and Rubin, D.B. (1987).  “Statistical Analysis with Missing Data”.  John Wiley & Sons. Loader, C. and Switzer, P. (1992).  “Spatial Covariance Estimation for Monitoring  Data”. In Statistics in the Environmental and Earth Sciences, eds. A. T. Walden and P. Guttorp, London: Edward Arnold, pp.52-7O. Mardia, K. V., Kent, J.T. and Bibby, J.M. (1979). “Multivariate Analysis”. Academic Press, New York. Mardia, K.V. and Marshall, R.J. (1984). “Maximum Likelihood Estimation of Models for Residual Covariance in Spatial Regression”. Biometrika Matheron, C. (1962).  ,  71, 135-146.  “Traite de Geostatistique, Tome I.”Memoires du Bureau de  Recherches Geologiques et Minieres, No. 14. Editions Technip, Paris. Matheron, G. (1969). Mathematique  ,  “Le Kriegeage Universel.” Cahiers du Centre de Morphologic  No. 1. Fontainebleau, France.  Matheron, 0. (1971). “The Theory of Regionalized Variables and Its Applications”. Cahiers du Centre de Morphologic Mathematique, No. 5. Fontainebleau, France. Miller, A. A. and Sager, T. W. (1994). “Site Redundancy in Urban Ozone Monitoring.” J. Air é1 Waste Manage. Assoc., 44, 1097-1102. Muirhead, R. J. (1982). “Aspects of Multivariate Statistical Theory”. John Wiley, New York. Nather, W. (1985).  “Effective Observations of Random Fields.” Teubner-Texte zur  Mathematik, Band 72. Teubner, Leigzig. 112  Neuman S. P. and Jacobson, E.A. (1984). “Analysis of Nonintrinsic Spatial Variability Be Residual Kriging with Application to Regional Groundwater Levels”. J. of the International Association for Mathematical Geology 16, 499-521. Olea, R.A. (1984). “Sampling Design Optimization for Spatial Functions  “.  Journal  of Geophysical Research, 79, 695-702. Oehlert, G. W. (1993).  “Regional Trends in Sulfate Wet Deposition”.  JASA, 88,  390.3599. Omre, H. (1987). “Bayesian Kriging—Merging Observations and Qualified Guesses in Kriging  “.  Mathematical Geology, 19, 25-39.  Omre, H. and Halvorsen, K.B. (1989a). Universal Kriging  “.  “the Bayesian Bridge between Simple and  Mathematical Geology, 21, 767-786.  Omre, H. and Halvorsen, KB. (1989b).  “A Bayesian Approach to Kriging  “.  In  Proceedings of the Third International Geostatistics Congress I ed. M. Armstrong, Dordrecht: Academic Publishers, pp. 49-68. Pilz, J., (1991). “Bayesian Estimation and Experimental Design in Linear Regression Models.” John Wiley, New York.  .  New York: Holt, Rinehart & Winston  Press, S. J., (1982). “Applied Multivariate Analysis-Using Bayesian and Frequentist Methods of Inference”. Holt, Rinehart & Winston, New York. Rubin, D. B. (1976). “Comparing Regressions When Some Predictor Variables Are Missing”. Technometrics 18, 201-206. Sampson, P. and Guttorp, P., (1992). “Nonparametric Estimation of Nonstationary Spatial Covariance Structure”. J. Amer. Statist. Assoc. Vol.87 No. 417, 108-119. Sarndel, C. E. and Swensson, B. and Wretman, J. (1992). “Model Assisted Survey Sampling”. Springer-Verlag, New York, Inc. Stein, A. and Corsten, L.C.A., (1991). “Universal Kriging and Cokriging as a Regres 113  sion Procedure”. Biometrics 47, 575-587. Sten, M. L., Shen, X. and Styer, P.E. (1991). “Applications of a Simple Regression Model to Acid Rain Data.” Technical Report No. 276. Department of Statistics, University of Chicago, Chicago, IL., USA. Taylor, A. E. and Lay, D.C. (1980). “Introduction to Functional Analysis”. Wiley, New York. Tanner, M.A. and Wong, W.H., (1987). “The Calculation of Posterior Distribution by Data Augmentation”. JASA 82, 528-550. Wahba, 0., (1990a). “Comment on Cressie,” The American Statistician, 44, 255-256. Wahba, 0., (1990b). “Spline Models for Observational Data,” Philadelphia: Society for Industrial and Applied Mathematics. West, M. and Harrison, P.J., (1986). “Monitoring and Adaptation in Bayesian Fore casting Models”. JASA 81, 741-750. Woodbury, A.D. (1989). “Bayesian Updating Revisited,”. Mathematical Geology 21, 285-308. Wu, C.F.J., (1983).  “On the Convergence Properties of the EM Algorithm,” Ann.  Statist. 11, 95-103. Wu, S. and Zidek, J.V. (1992). “An Entropy Based Review of Selected NADP/NTN Network Sits for 1983-1986”. Atmospheric Environment. Yates, S. and Warrick, A.W. (1987). “Estimating Soil Water Content Using Cokriging.” Soil Science Society of America Journal,51, 23-30. Zellner, A. (1971).  “An Introduction to Bayesian Inference in Econometrics.” New  York: John Wiley. Watson, C.R. and Olsen A. R. (1984). “Acid Deposition System (ADS) for Statis 114  tical Reporting—System Design and User’s Code Manual”. U.S. Environmental Protection Agency.  115  Appendix A: In this Appendix, the S function, iwm(m,cn,p,flag=1,sl,M2,misindx, Tint), and its S help file is listed. The iwm S function is designed to perform spatial interpolation and estimation of Ag,  ,  with an EM algorithm that is described at the end of Chapter  2. “iwm” calls a C program, the source file of which is listed in Appendix B. iwm S funtion help file:  Returns a list of estimated hypercovariance matrices and prior degrees of freedom. USAGE: iwm(m, cn, p, flag  =  1, si, M2, mi.sindx, Tint)  ARGUMENTS: m: number of measured or non-measured air pollutants at a site cn: when flag=O, cn is the known degrees of freedom in the prior distribution, “inverse Wishart”, of the unknown spatial covariance matrix. When flag=i, cn is unknown and the input is an initial value. That value must be greater than m times the total number of sites, p+si. p: number of gauged sites. flag: a value indicates whether cn is known or not. Si:  number of ungauged sites.  M2: a response matrix. The rows are the observations at gauged sites over the time. The columns are “snapshot” air pollutant levels observed at gauged sites. misindx: an index vector of missing columns and observed columns. Tint: a p column matrix with each column being a mean of all columns at a gauged site. In iwm, Tint is used for getting an initial estimation of the between-sites hypercovariance matrix Ti VALUE: 116  Ti: the estimated between—gauged-sites-hypercovariance matrix. B 1: the estimated between-air-pollutants-hypercovariance matrix. cn: if flag=O, cn is the same as the input value. If fiag=1, cn is the estimated degrees of freedom. C OBJECT FILE: The C objective file “Phill.o” is supposed to be under your S working directory that normally is at position 1 of your S search list. Use S function “searchO” to check it. If that is not true, you may need to customize “iwm” function to your own version by replacing the following S commands in your copy of iwm function, filename_search() [1] filename_paste(c(filename,”/Phill.o”) ,collapse=”) dyn. load(filename) with dyn.load(”pathname/Phill .0”) where “pathname” is the complete pathname under which “Phill.o” locates.  DETAILS: IWM, standing for “Interpolation With Data Missing-by-design”, implements the theory of interpolation with data missing-by-design. The theory is developed under a normal distribution model when there are data missing-by-design. The other hyperparameters B°, F—’ are estimated by unbiased estimators. The output Ti of iwm is used later as an input for the Sampson and Guttorp nonparametric approach for extending Ti to including all sites. The structure of iwm is that: first, iwm does some initial data manipulation and second, it calls “Phiil.o”. It is in the C program where the EM algorithm of the interpolation theory is implemented. As a warning, iwm adopts an EM algorithm, which is notoriously slow. Sometime when the dimensions of the observed data matrix are big, it could take hours, even days to finish. So first of all, be patient! Secondly, either submit it as a batch job or lower the precision value at line 70 of file “Phiii.c”. When you change the source code file, be 117  sure that you recompile the file with unix command “cc -c Phill.c” and copy the new object file to your S working directory. The current value is set to be 0.0005. You may set it to a different number. Also you can try various transformations of the data matrix to make it close to normal. By doing those, the convergence is hopefully faster. EXAMPLES: Consider an artificial example for an explanation of how to use “iwm”. Suppose there are four Sites 1, 2, 3 and 4. Site 1 is an ungauged site (sl=1), Sites 2, 3 and 4 are gauged sites (p=3). Three air pollutants: 504, 03, NO 3 are measured at all sites. Only SO 4 is measured at Site 2, 03 at Site 3, and, SO , 03 and NO 4 3 at Site 4. Now label 1 to 3 for SO , 03, NO 4 3 at Site 2; 4 to 6 for SO 3 at Site 3 and 7 to 9 for 504, 03, , O3 NO 4 3 at Site 4. Since the columns of air pollutants labeled with 2, 3, 4, 6 are missed, NO and the columns labeled with 1, 5, 7, 8, 9 are observed, misindx=(2,3,4,6,1,5,7,8,9). If further, the observed data matrix is: Site #2  Site #3  Site #4  4 SO  03  t=1  1.2  3.4  1.6  4.0  2.7  t=2  1.7  3.1  1.1  3.9  2.5  , 4 5O , 3 NO 0  then  M2  =  (1.2 \1.7  3.4  4.0  3.1  1.6 1.1  /1,2 I \ 1.7  3.4 3.1  2.667 2.5  3.9  2.7 2.5  and  Tint  =  where the third column of Tint is the mean of the columns 4,5,6 of M2.  Macros of IWM:  function(m,cn,p,flagl,sl,M2,misindx,Tint) { f_dim(M2) [1] 118  1_p*m—dim(M2) [2] tmlO_c(1 :f) Z_c(rep(1,f),tmlO,cos(2*pi*tmlO/12),sin(2*pi*tmlO/12)) k_length(Z) If Z_matrix(Z,k,f ,byrowT) Bhat_t (M2)7.*%t (Z)%*°hsolve(Z°h*%t (Z)) MuO_apply(Bhat ,2 ,mean) 7*7M2-Bhat * Z M %*7 7 2 7 S_t (M2) 0 tmpi_as integer(p*m-1) F1_matrix(O,k,k) for (i in 1:tmpi)  {  a.rep(-1/tmpi ,tnipi) a[i] _1+a[i] a_matrix(a, tmpi,1) F1_F1+t (Bhat) 7.*7.a%*7.t (a) °h*’/.Bhat/c (t (a) *7,S7.*7.a)  } F1_ (n+f-k-2)*F1/tmpi 1)7 * %matrix(MuO , 1 ,k) B02_matrix(1 ,tmpi , 0 tmp_Bhat-B02 Ssim_S+tmp%*Y.solve(F1) %*Y.t (tmp) B1_diag(rep(1 ,m)) Bhat2_t (TintYh*’ht t so1ve(Zh*Y 0 (Z)Y.*Y (Z)) 0 (Tint)7 int-Bhat2%*Y.Z%*7.Tin *Yt T1_t T nO filename_search() [1] filename_paste(c(filename,”IPhill.o),collapse””)  dyn. load (filename) rvalues_ .CQ’Phill”, as.double(T1), as.double((B1)), as.integer(misindx), as.double((Ssim)),as.integer(flag), as.integer(n), as.integer(f), as.integer(p), as.integer(m), as.integer(k), as.integer(l), as.double(cn), as.integer(sl))  119  list (Tlrvalues[[1]] ,Blrvalues[[2]] ,degree=rvalues[[12]])  }  120  Appendix B: Source file of C program Phill.o:  #include <math.h> #include <stdio .h>  void Phill(T1, Bi, misindxl, Ssiml ,flagl,nl,fl ,pl ,ml ,kl ,ll ,cnl, sli)  mt  *flagl, *nl,*fl,*pl,*ml,*kl,*ll, *misindxl, *sll;  double *T1, *B1, *Ssiml, *cnl;  {  mt  flag, n,f,p,m,k,l,sl;  double **T, **B, **Ssim,cn; double **,**matrixQ; double ma,tO,tl,t2,detQ,snQ,t3,t4;  mt  i,j,*ivectorQ, *misindx, tmpi2;  void maxlQ, 1Q,f_matQ, detlogO;  /* flag  =  0 degrees of freedom is known  (ie. m is constant in the paper) =  1 estimate the df as well  n  =  number of observations in the past  f  =  number of observations in the future  1  =  number of missing columns  p  =  number of (gauged) stations  s1  number of (ungauged) stations  m  =  number of ions  k  =  number of coefficients in the linear model (i.e. 4  in the paper)  cn= initial estimate of m; 121  that is, if flag =0 then df =cn else if flagl then use the algorithm in the paper with cn as initial value SsimS”{”} misindx  =  the index matrix for the missing columns and  the not missing cls.  *1 /* T is Lambda, B is Omega  *1  flagflag1; nflnl; f=*f 1; p*pl; m*ml; k=*kl; l=*l1; cn*cnl; s1*sll; Tmatrix(p,p); Bmatrix(m,m); misindxivector(m*p); matrix(m*p ,m*p); Ssimmatrix(p*m-l,p*m-l); /*******************/  for (j0;j<m*p;++j) misindx[j]=*(misindxl+j); for (i0; i<p*m—l;+-i-i) for (j0;j<p*m—l;++j) Ssim[i] [j]*(Ssiml+i*(p*m—1)+j); for (i0;i<p;++i) for (j0;j<p;++j) T[i] [j]*(T1+i*p+j);  122  for (i0;i<m;++i) for (j0;j<m;++j) B[i] [j]*(B1+i*m+j);  printf(”The program will finish when ma < O.0005\n”);  {  while (1) 1*  printf (“step 1\n”);  */  1(sigma, misindx, T, B, cn, p,f,l, m,n,k,Ssim,sl);  1*  printf(”step 2\n”);  */  {  while (1)  tldet(T,p); t2det(B,m); if if  ( (  /*  fabs(tl) < le—50) t11.O; fabs(t2) < le—50) t21.O;  *1  printf(”step 3\n”);  maxl(cn,T,B,,p,m,sl);  1*  *1  printf (“step 4\n”);  t3=det(T,p); t4det(B,m); if if  ( (  fabs(t3) < le—50) t31O; fabs(t4) < le—50) t4=1O;  if ((fabs(log(tl)—log(t3))<le-4)&& (fabs(log(t2)—log(t4))<le-4)) break;  } macn; tmpi2p*m; /*  printf (“step 5\n”);  */  detlog(T,B,&t2,&t3,p,m,l,misindx, Ssim); /*  printf (“step 6\n”);  if (flag>O)  1*  */  cn=sn(cn,n,p,m,sl,f,k,l,t2,t3);  printf(”step 7\n”);  */  123  ma=fabs(ma-cn)/cn; printf(”ma is h1f\n”,ma); if (ma<O.0005) break;  } for (i0;i<p;++i) for (j0;j<p;-i-+j) *(T1+i*p+j)T[i]  [j]  for (i0;i<m;++i) for (j0;j<m;++j) *(B1+i*m-I-j)=B[i]  [j]  *cnlcn;  } /**********************************************/  void maxl(c,T,B,L,p,m,sl) double **T,**B,**L,c;  mt  p,m,sl;  {  mt  i,j,k,l;  double xO,xl,detQ; void invert_matrixQ;  for (k0;k<p;-f+k) for (l=O;l<p;++l) for (i0,T[k] [l]0.O;i<m;++i) for (j0;j<m;++j) T[k] [l]+L[m*k+i] [m*l+j]*B[i]  U]!  (c— (double) (sl*m))/(double)m; invert_matrix(T,p); for (k=O;k<m;-i-+k) for (10;l<m;++l) for (i0,B[k][1]0.O;i<p;++i)  124  for (j0;j<p;-f+j) B Ek] [1] +=L [in*i+k] [m*j+l] *T [i]  Ci] /  (c—(double)(sl*m))/(double)p; invert_matrix(B,m);  } /**********************************************/  double sn(cl,n,p,m,sl,f,k,1,t2,t3) double cl,t2,t3;  mt  n,p,m,sl,f,k,l;  { double kl,k2,c2,funclQ,trigarnmaQ;  mt  i, tmpi;  {  while (1)  tmpip*m-l; kl=funcl(cl,n,p,m,sl,f,k,l,t2,t3); c2c1; for (il,k20.0;i<p*m;++i) k2=k2+trigamma((cl+(double) (n+f-i—k—sl*m))/2 .0) —trigainma((cl—(double)(i+sl*m))/2.0); k2k2/2.0; c1c2-kl/k2;  1*  printf(”kl,k2,cl,c2  is %1f 7 1f °hlf °hlf\n”,kl,k2,cl,c2); 0  if (fabs(kl)<1.e-08) break; if (cl<(double) (m*(p+sl)))  { if (c2/2>(double)(m*(p+sl)+1)) c1c2/2; else ci  (double) (m*(p+si)+i);  } }  125  *1  return ci;  } double funcl(c,n,p,m,sl,f,k,l,t2,t3) double c,t2,t3;  mt  p,m,n,k,si,f,l;  { double tl,digamma() ,**matrix() ,de-tQ;  mt  i;  void f_matQ;  for (i1,t10.0;i<p*m;-i-+i) titi+digamma((c+(double) (n+f—i—k—si*m)) /2.0) —digamma((c— (double) (i+si*m))/2 .0); /*  printf(”ti,t2,t3 °hlf °hlf 7 1f\n”,ti,t2,t3); 0  return ti—t2+t3;  } void detlog(T,B,t2,t3,p,m,l,mismndx,ssim) double **T, **B, *t2,*t3,**ssim;  mt  p,m,l, *misindx;  { double **Psi22, **tmi, **tm2, **R, **matrixQ,detQ double **Phiii, **Psi;  mt  tmpi2,tmpi,i,j ,ii,i2,i3,tpi;  tmpip*m-l; tmpi2p*m; Rmatrix(p*m, p*m); Phiiimatrix(tmpi2, tmpi2);  126  for (i10;±1<tmp±2;++i1) for (i20;±2<tmpi2;++i2) R[±1] [±21=0.0; for (i0; ±<p*m; i++)  {  tpimis±ndx[i] —1; R[tp±] [±1=1.0;  } for (i0;i<tmp±2;++i) for (j0;j<tmp±2;++j) Ph±11[i] [j]T[±/m] [j/m]*B[±°hm] [j°hm]; tmlmatr±x(tmp±2, tmp±2); tm2=matr±x(tmp±2, tmp±2); for (±1=O;±1<tmp±2;++±1) for (±2=0; ±2<tmp±2 ;  tml[±1] [±2]R[±2] [±1]; for (±1=0;±1<tmp±2;++±1) for (±2=0; ±2<tmp±2 ; for (±3=0,tm2[±1] [±2]=0.0;±3<tmp±2;++±3) tm2 [±1] [±2] =tm2 [±1] [±2] +tml  [±1] [±3] *ph±11 [±3] [±2]  f_mat (Ph±11 ,tmp±2); f_mat (tml ,tmp±2); Ps±matr±x(tmp±2, tmp±2); for (±1=0;±1<tmp±2;+i-±1) for (±2=0,Ps±[±1][±2]=0.0;±2<tmp±2;++±2) for (±3=0; ±3<tmp±2 ; ++±3) Ps±[±1] [±2]Ps±[±1] [±2]+tm2[±1] [±3]*R[±3] [±2]; f_mat (tm2,tmp±2); f_mat(R, tmp±2); Ps±22matr±x(tmp±, tmp±); for (±1=0; ±1<tmp±; ±1++) for (±2=0; ±2<tmp±; ±2++)  {  127  Psi22 [ii] [i2] =Psi [il+l] [i2+l] +ssim[il] [i2];  } *t2=log(det(Psi22,tmpi)); for (i10; il<tmpi; il++)  {  for (i2O; i2<tmpi; i2++)  Psi22[il] [i2]=Psi[il+l] [i2+l]  } *t3=log(det(Psi22,tmpi)); f_mat(Psi, tmpi2); f_mat(Psi22, tmpi);  } /******i*****************************************************/  #include <stdio .h> #include <math.h> /*****************************************************************/  /*  /* 1* 1* /*  1*  This module gives E(\_{(11)} and E(log  I\J(ii)}I I  I  M_{2}, \Phi, \cn{*})  m_{2}, \Phi, \cn{*}) for the  systematic pattern with whole column missing  *1  is \Sigma..{(11)} log is log I\Sigma_{(11)}  *1  I  1* phi is the hyperparameter \Phi_{(11)}=  1*  *1  \Lambda_{g} \otimes \Omega  1* cn is the degrees of freedom, another hyperparameter  1* /*  s2 is number of gauged stations, si is number of ungauged  *1  stations  1* 1 is number of missing columns, misindx is the index of  1* 1*  missing columns k is number of responses at each station  1* n is number time spots in the past, f is that in the future  128  *1 */  /* h is number of covariates  void 1(sigma, misindx, Lambda, Omega, delta, s2,f,l, k,n,h,Ssim,sl) double  mt  **,  **Lambda, **Omega, **Ssim,delta;  1, k,h,n, s2,f, *misindx,sl;  { double **Psillstar, **Psill, **Psil2, **Psi22, **matrixQ, **Psi; double **R, **Rt, **tml, **tm2, **tm3, **tm4,**tm5, **1; double **Psi22in, **Phi22, **Psillstarin, **Phill, cont,contl; double **etal2, **etal2star;  mt  ii, i2, i3,tmpi,  void  mat_inverseQ, f_matQ;  tmpi2, tpi;  FILE *out;  tmpis2*k -l Rmatrix(s2*k, s2*k); for (i10;il<s2*k;++il) for (i20;i2<s2*k;++i2) R[il] [i2]0.O; for (i10; il<s2*k; il++)  {  tpi=misindxEil]—1; R[tpi] [il]=1.O;  } tmp i 2 = s2 *k; Phillmatrix(tmpi2, tmpi2); Psimatrix(tmpi2, tmpi2); for (i10;il<tmpi2;++il) for (i2=O; i2<tmpi2;++i2) Phill[il] Ei2]LambdaEil/k] [i2/k]*Omega[il%k] [i2%k];  129  tmlmatrix(tmpi2, tmpi2); tm2matrix(tmpi2, tmpi2); for (i10;il<tmpi2;++il) for (i20;i2<tmpi2;-H-i2) tml[il] [i2]R[i2] [ii]; for (il=O;il<tmpi2;++il) for (i20; i2<tmpi2 ; ++i2) for (i30,tm2[il][i2]0.O;i3<tmpi2;++i3) tm2[il] [i2]=tm2[il] [i2]+tml[il] [i3]*Phill[i3] [i2] for (110;il<tmpi2;++il) for (i20,Psi[il][i2]0.O;i2<tmpi2;++i2) for (i3O; i3<tmpi2 ;++i3) Psi[il] [i2]=Psi[il] [i2]+tm2[il] [i3]*R[i3] [i2] f_mat (tml, tmpi2); f_mat(tm2, tmpi2); Psillmatrix(1,1); for (i10; il<1; il++) for (i2=O; i2<1; i2++)  {  Psill[il] [i2]Psi[il] [i2]  } Psil2matrix(1 ,tmpi); for (i10; il<1; il-H-) for (i2O; i2<tmpi; i2++)  {  Psil2 [ii] [i2]Psi[il] [i2+1]  3Psi22matrix(tmpi,tmpi); for (il=O; il<tmpi; il++) for (i2=O; i2<tmpi; i2++)  {  Psi22[il] [i2]Psi[il+1] [i2+1]  3etal2matrix(1,tmpi);  130  Psi22inmatrix(tmpi, tmpi); mat_inverse(Psi22, tmpi, Psi22in); f_mat(Psi22, tmpi); Psi22matrix(tmpi, tmpi); for (i10; il<tmpi; il++) for (i20; i2<tmpi; i2++)  {  Psi22[il] [i2]Psi[il+1] [i2+1]  } f_mat (Psi ,tmpi2); for (i10;il<1;++il) for (i20 ; i2<tmpi ; ++i2) for (i30,etal2[il] [i2]0.O;i3<tmpi;++i3) etal2[il] [i2]etal2[il] [i2]+Psil2[il] [i3]*Psi22in[i3] [i2]; tml=matrix(tmpi, 1); for (i10;il<tmpi;++il) for (i2O ; i2<1 ; ++i2) tml[il] [i2]Psil2[i2] [ii];  tm2matrix(1, 1); for (i10;il<1;++il) for (i20;i2<1;++i2) for (i30,tm2[il][i2]=O.O;i3<tmpi;+-t-i3) tm2[il] [i2]tm2[il] [i2]+etal2[il] [i3]*tml[i3] [i2J Psillstar=iuatrix(1,1); for (i10; il<1; il++) for (i20; i2<1; i2++) Psillstar[il] [i2]Psill[il] [i2]—tm2[il] [i2]  Psillstarinmatrix(1,1); /*outfopen(”jmik”,”w”); */ /*for (i10;il<1;il++) {*/  1* 1*  for (i20;i2<1;++i2) {*/ fprintf(out,”°h5.21f “,Psillstar[il][i2]);*/  131  1*  if (ilY.1O==9) fprintf(out,”\n”); }*/  /*fprintf(”\n\n”); }*/ /*fclose(out) ;*/ /*exjt(0) ;*/ if (1  mat_inverse(Psillstar, 1, Psilistarin);  0)  if (1 ‘=0)  {  f_mat (Psillstar,l); f_mat (Psill,l); f_mat(tm2, 1); f_mat(Psil2,1);  } f_mat(tml, tmpi); contdelta— (double) (sl*k); 1=matrix(tmpi2,tmpi2); for (i10; il<1; il++) for (i20; i2<l; i2++)  1[il] [i2]cont*Psillstarin[il] [i2]; tmlmatrix(l,tmpi); for (i10;il<l;++il) for (i2=0;i2<tmpi;++i2) for (i30,tml[il][i2]=0.0;i3<l;++i3) tml [ii] [i2] =tml [ii] [i2] +Psillstarin [ii] [i3] *etal2 [i3] [i2] for (i10; il<1; il-H-) for (i20; i2<tmpi; i2++)  {  1[il] [i2+l]—1.0*cont*tml[il] [i2]  l[i2+l] [ii] sigmal [ii] [i2+l];  } contldelta-(double) (sl*k-n-f+l-t-h); tm2matrix(tmpi , 1); for (i10;il<tmpi;++il) for (i20;i2<l;+-i-i2)  132  tm2[il] [i2]=etal2[i2] [ii]; tm3=matrix(tmpi ,tmpi); for (il=O;il<tmpi;++il) for (i2=O;i2<tmpi;++i2) for (i30,tm3[il][i2]0.O;i3<1;++i3) tm3[il] [i2]tm3[il] [i2]+tm2[il] [i3]*tml[i3] [i2J if (1 ‘=0) f_mat(tml,1); f_mat(tm2, tmpi); tm4=matrix(tmpi,tmpi); for (il=O;il<tmpi;++il) for (i2=0; i2<tmpi ; ++i2) tm4 [ii] [i2] Psi22 [ii] [i2] +Ssim [ii] [i2] tm5matrix(tmpi,tmpi); mat_inverse(tm4, tmpi, tm5); tml=matrix(tmpi,tmpi); for (il=0; il<tmpi; il++) for (i20; i2<tmpi; i2++)  {  1[il+1] [i2+1]contl*tm5[il] [i2]+cont*tm3 [ii] [i2] +(double)1*Psi22in[il] [i2];  f_mat (tml ,tmpi); f_mat(tm4, tmpi); f_mat (Phill ,tmpi2); f_mat(tm5, tmpi); f_mat(Psi22, tmpi); f_mat(tm3, tmpi); f_mat(Psi22in, tmpi); if (1  =0)  {  f_mat (etal2,1); f_mat (Psillstarin,1);  }  133  tml=matrix(tnpi2,tmpi2); Rt=matrix(tmpi2, tmpi2); for (il=O;il<tmpi2;++il) for (i2=O; i2<tmpi2 ; ++i2) Rt[il] [i2]R[i2] [ii]; for (il=O;il<s2*k;++il) for (i20;i2<s2*k;++i2)  {  tml[il] [i2]0.O; for (i30;i3<s2*k;++i3) tml[il] [i2]= tml[il] [i2]+1[il] [j3]*Rt[j3] [i2]  } for (i10;il<s2*k;++il) for (i20;i2<s2*k;++i2)  {  [ii] [i2]0.O; for (i30;i3<s2*k;++i3) [ii] [i2]= [ii] [i2]+R[il] [j3]*tml[j3] [i2]  } f_mat(tml, tmpi2); f_mat(Rt, tmpi2); f_nat (R,tmpi2); f_mat(1,tmpi2);  } /**********************************************************/  #include <math.h> #define SIGN(a,b) ((b)<O.O 7 -fabs(a)  void ord_mat(nvr,eig,mat)  mt nvr; double *eig, **nat;  {  134  fabs(a))  mt  i=O,j=O,k=O;  double tmp=O.O;  for (i=0;j<nvr;i++) for (ji;j<nvr;j++) if (eig[j]<eig[i])  {  tmpeig[j]; eig[j]eig[i] eig[i] tmp; for (k0;k<nvr;k++) tmpmat[k]  {  [ji;  iuat[k] [j]matEk] [ii; matEk] [i]tmp;  } } } void tred2(a,n,d,e) double **a,d[] ,e[];  mt  n;  {  mt  10,k0,j0,i0;  double scale0 .0,hhO .0,h=0 .0,g0 .0,f0 .0;  for (i=n—1;i>1;i——)  {  1i—1; hscale=0.0; if (1 > 0)  {  for (k0 ; k<1 ; k++) scale if (scale  fabs(a[i][k]);  + ==  0.0)  135  e[i]a[i] [1];  else { for (k0;k<1;k++) { a[i] [k]  /  scale;  h +za[i] [k]*a[i] [k];  } fa[i] [1];  g =  f>O.O ? —sqrt(h)  : sqrt(h);  e [ii scale*g; h  -=  f*g;  a[i] [l]f—g; f0.O; for (j0;j<1;j++)  {  a[j] [i]aEi] [j]/h; g0.O; for (k0;k<j ;k++)  g  +a[j] [k]*a[i] [k];  for (kj+1 ;k<1;k++)  g  +a[k] [j]*a[i] [k];  e[j]g/h; f  +  e[j]*a[i][j];  } hh±/(h+h);  for (jO;j<l;j++) { fa[i] [j] e[j]=g=e[j]-hh*f; for (kO;k<j ;k++) a[j] [k]  —=  (f*e[k]+g*a[i] [k]);  } } } else  136  e[i]a[i] [1]; d[i]h;  } d [0] =0 0; .  e [0] =0 0; for (i0;i<n;i++)  {  1i—1; if (d[i])  {  for (j0;j<1;j++)  {  g0.0; for (k0 ;k<1 ;k++) g +=a[i] [k]*a[k]  [j];  for (k0 ;k<1 ;k++) a[k]  [j]  —=  g*a[k] [i];  } } d[i]a[i]  [ii;  a[i] [i]1.O; for (jO;j<1;j++) a[i] [j]=a[j] [i]0.0;  } } / *********************************************************/  void tqli(d,e,n,z) double d[] ,e[] ,**z; mt n;  { mt m0,10,iter0,i0,k0; double s0.0,r0.0,p0.0,g0.0,f0.0,ddO.0,c0.0,b0.0; void nrerrorQ;  137  for (i1;i<n;i++) e[i—1]e[i] e[n—1]0.O; for (l=iter=O;l<n;1++) do  { {  for (ml;m<n-1;m++)  dd=f abs (d Em] ) +f abs (d [m+1] if (fabs(e[m])+dd  ==  )  dd) break;  } if (m  {  1)  if ((iter++)  ==  10000)  nrerror(”Too many iterations in TQLI”); g=(d[l+1]—d[l])/(2.0*e[l]); rsqrt((g*g)+1.0);  g=d [ml —dEl] +e El] / (g+SIGN (r  ,  sc1.O; p . 0 O; for (im—1;i>1;i——)  {  f=s*e[j]; bc*e[i]; if (fabs(f)  >=  fabs(g))  cg/f; rsqrt((c*c)+1.0); e Ei+1] =f*r; c  }  (s1.0/r);  *=  else  {  sf/g; rsqrt((s*s)+1.0); eEi+1]g*r; s  *=  (c1.0/r);  }  138  {  gd[i+1] -p; rCd [i] —g)*s+2. O*c*b; ps*r; d[i+1]g+p; gc*r-b; for (k0;k<n;k++)  {  fz[k] [i+1]; z[k] [j+1]s*z[k] [i]+c*f; z[k] [i]c*z[k] [i]—s*f;  } } d[l]d[l]—p;  e[m]0.O;  } } while Cm  1*  1);  printf(”Number iterations in tqli  =  ‘/.d\n”,iter); */  } #include <malloc .h> #include <stdio .h> /*******************************************************/  void nrerror(error_text) char error.text[];  { fprintfCstderr,”Numerical Recipes run-time error.. fprintfCstderr,”°hs\n” ,errortext); fprintfCstderr,”.  .  .now exiting to system..  exit Ci)  3.  139  double *vector(nh)  mt  nh;  { double *v;  v(double *)calloc(nh,sizeof (double)); if (v) nrerror(”allocation failure in dvectorQ”); return v;  } / *******************************************************/  mt mt  *ivector(nh) nh;  {  mt  *v;  v=(int *)calloc(nh,sizeof(int)); if (!v) nrerror(”allocation failure in ivectorO”); return v;  } /*******************************************************/  double **matrix(nrh,nch)  mt  nrh,nch;  {  mt i,j; double **m;  m(double **) calloc(nrh,sizeof(double*)); if (!m) nrerror(”allocation failure 1 in matrixO”); for(i0; i<nrh; i++)  {  m[i]=(double *)calloc(nch,sizeof (double)); if (m[i]) nrerror(”allocation failure 2 in matrixQj;  140  } return m;  } /******************************************************/  void f_mat (m,nrh) double **m;  mt  nrh;  {  mt  i0;  for(i=nrh—1;i>0;i——) free((char*) (m[i])); free((char*) (m));  } double **dmatrix(nrh,nch)  mt  nrh,nch;  {  mt  i;  double **m;  m(double **) calloc(nch,sizeof(double*)); if (m) nrerror(”allocation failure 1 in dmatrixQ”); for(i=O;i<nch;i++)  {  m[i]=(double *)calloc(nrh,sizeof (double)); if (!m[iJ) nrerror(”allocation failure 2 in dmatrixQ”);  } return m;  /**********************************************************/  #include <math.h> double det(h,m)  141  double **h;  mt  m;  { double *dv,*er,tmp,**temp; double *vector() ,**matrixQ; void tqli() ,tred2() ,f_matQ;  mt i,j; temp=matrix(m,m); for (i0;i<m;++i) for (j0;j<m;++j) temp[i] [j]h[i]  [j]  ervector(m); dv=vector(m); tred2(temp,m,dv,er); tqli(dv,er,m,temp); for (i0,tmpl.O;i<m;-t--t-i) tmp*=dv[i]; free((char*)(dv)); free((char*)(er)); f_mat (temp,m); return tmp;  } /**********************************************************/  void mat_inverse(h,m,invh) double **h, **invh;  mt  m;  { double **hs ,*dv,*er,**lam; double *vector() ,**matrixO; void tqliO,tred2O,f_matQ,m_matO,nrerrorQ;  142  mt  i,j;  er=vector(m); lam=matrix(m,m); hs=matrix(m,m); dv=vector(m); tred2(h,m,dv,er); tqli(dv,er,m,h); for (i=O;i<m;++i) if (dv[i]’O.O) lam[i] [i]=1.O/dv[i]  else nrerrorQ’trying to invert a singular matrix”); m_mat(hs,h,lam,m,m,m); for (i0;i<m;++i) for (j0;j<m;++j) lam[i] [j]=h[j] [ii; m_mat(invh,hs,lam,m,m,m); f_mat (hs,m); f_mat(lam,m);  free((char*)(dv)); free((char*)(er));  } /**********************************************************/  void invert_matrix(h,m) double **h;  mt  m;  { double **hs , *dv , *er , **lata; double *vector() ,**matrixQ; void tqli() ,tred2() ,f_mat() ,m_mat() ,nrerrorQ;  mt  i,j;  143  ervector(m); lam=matrix(m,m); hs=matrix(m,m); dvvector(m); tred2(h,m,dv,er); tqli(dv,er,m,h); for (i0;i<m;++i) if (dv[i] !=O.O)  lam[i] [i]1.O/dv[i] else nrerror(”trying to invert a singular matrix”); m_mat(hs,h,lam,m,m,m); for (i0;i<m;++i) for (j=O;j<m;+-fj) lam[i] [j]=h[j] [ii; m_mat(h,hs,lam,m,m,m); f_mat (hs,m); f_mat(lam,m); free((char*)(dv)); free((char*)(er));  } /***********************************************************/  void m_mat(mO,ml,m2,n,p,m) double **mO , **mj. , **m2;  mt  n,p,m;  {  mt  i,j,k;  for (i0;i<n;++i) for (j0;j<m;+-i-j) for (k0,mO[i] Ej]=O.O;k<p;++k)  144  mO[i] [j]+m1[i] [k]*m2[k]  U];  } /**********************************************************/  void s_mat(mO,ml,m2,n,m,k) double **mO  ,  ,  **rn2;  mt n,m,k;  { i,j;  mt  if (k==o) for (i0;i<n;++i) for (j=O;j<m;-I--fj) mOLi] [j]m1[i] [j]—m2[i]  [j];  else for (i0;i<n;++i) for (j0;j<m;++j) iuO[i] [j]m1[i] [j]-i-m2[i]  [ii;  } #include <math.h>  double trigamma(z) double z;  { double retval,rl,d,w,x,ww; mt il,i2,j;  i2(int) (z); if (i2>20)  {  rlz; wwrl*rl;  145  x1—(O .2—1/(ww*7))/ww; d=1/z+(x/(z*3)+1)/(ww*2);  } else  {  i220—i2; wz+i2; rlw; wwrl*rl; x1—(O .2—1/(ww*7))/ww; d=1/w+(x/(w*3)+1)/(ww*2); i1i2; for (j0;j<il;+-i-j)  {  rlw; d+1/(rl*rl);  } 3. ret val=d; return retval;  3. double digamma(x) double x;  { double y,retval,s3,s4,s5,dl,r,gammaQ;  s31.0/12; s41 .0/120; s50 .003968253968; d1—0 .5772156649; retval0 .0;  146  yx;  if  (y<=O.O)  return retval;  if (y<1e—5) return (dl—1.O/y); while (y<8.5)  {  retvalretval-1 O/y; .  .O; 1 yy+  } r=1 .O/y; retvalretval+log(y)-r/2;  retval=retval—r* (s3—r* (s4-r*s5)); return retval;  } double gamma(x) double x;  { double gcs[24],pi,sq2pil,xmin,xmax,dxrel,y,sinpiy,gamiri; double r9lgmc() ,csevlQ,gamlinQ;  mt  ngcs,i,n,initsO;  gcs[1]  O.008571195590989331e0;  gcs [2] =  0.00441538 1324841007e0;  gcs [3] =  0. 05685043681599363e0;  gcs [4]  =  —0.0042 19835396418561e0;  gcs [5]  =  0. 001326808181212460e0;  gcs [6]  =  —0. 0001893024529798880e0;  gcs [7]  =  0 0000360692532744124e0; .  —0. 0000060567619044608e0;  gcs [8] gcs [9] = gcs [10]  0 0000010558295463022e0; .  =  —0. 0000001811967365542e0;  147  0. 0000000311772496471e0;  gcs [11]  —o  gcs [12] =  .  0000000053542196390e0;  0. 0000000009193275519e0;  gcs [13] =  gcs [14] = —0. 0000000001577941280e0; 0. 0000000000270798062e0;  gcs [15] = gcs [16]  =  —0. 0000000000046468186e0;  gcs [17]  =  0. 0000000000007973350e0; —0.000000000000 1368078e0;  gcs [18]  gcs [19] =  0 0000000000000234731e0; .  gcs [20]  =  —0. 0000000000000040274e0;  gcs [21]  =  0. 0000000000000006910e0;  gcs [22]  =  —0. 0000000000000001185e0;  gcs[23]  0.0000000000000000203e0;  pi=3. 14159265358979324e0; sq2pilo 91893853320467274e0; .  ngcs  inits(gcs,23,5.9604645e—8);  =  gainlim(&xmin,Scxmax); dxrel y  =  sqrt(1.1920929e—6);  fabs(x);  =  {  if (y>1O.O)  gamin  =  exp((y—0.5)*log(y)—y+sq2pil+r9lgmc(y));  {  if (x<0.)  sin(pi*y);  sinpiy  =  gamm  -pi/(y*sinpiy*gamm);  =  } } else  {  n  =  x/1;  if (x<0.) n y  =  =  n-i;  x—n/1.0;  148  n  n-i;  =  ganim  =  O.9375+csevl(2.*y-i. ,gcs,ngcs);  if (n>0) for (i1;i<n;i++) gamm else  {  n  =  =  (y+i/i.O)*ganun;  -n;  for (ii;i<n;i++) gamin  =  gamm/(x+(i-i)/i.O);  return(ganuu);  } gaml im(xmin , xmax) double *xmin, *xmax;  { double alnsml,alnbig,xln,xold,log() ,fabsQ;  mt  i;  alnsmllog(2. 9387359e-37); *xmjn= -alnsml; for (ii;i<10;i++)  { xold= *xmin; xln=log(*xmin); *xmjn  *xmjn- *xmjn* ( (*xmjn+O 5) *xln  *xmin-0 2258+alnsml) / (*xmin*xln+0 5); .  .  if (fabs(*xmin—xold)<0 .005) break;  *fljjfl  -  *xmjn+0.Oi;  alnbiglog(i .70i4117e38); *xmaxalnbig;  149  for (i1;i<10;i++)  { *xmax;  xold  xln=log(*xmax); *xmax— *xmax* ( (*xmax-0 .5) *xln—  *xmax  *xmax+0 9189—alnbig) / (*xmax*xln-0 .5); .  if (fabs(*xmax-xold)<0 .005) break;  *xmax-0 .01;  *xmax  *xmjn(*xmjn>1 .0- *xmax)? *xmin: 1.0- *xmax;  } double r9lgmc(x) double x;  { double algmcs[7],xbig,xmax,y,z,csevlQ,sqrtQ,expO,logQ;  mt  nalgm,initsO;  algmcs [1]  =  0. 166638948045186;  algmcs [2] = —0 .0000138494817606; algmcs [3]  =  0.0000000098108256;  algmcs [4] = —0.0000000000180912; algmcs [5] =  0.0000000000000622;  algmcs [6] = —0.0000000000000003; nalgminits(algmcs ,6,5. 9604645e—7); xbigl 0/sqrt (5. 9604645e—7); .  ylog(1 .7014117e38/12.0); z= —log(12.0*2.9387359e—37); xmax(y<z)? exp(y) :exp(z); if (x<xbig) return(csevl(2.0*(10.0/x)*(10.0/x)— 1.0, algmcs ,nalgm)/x); else return(1.0/(12.0*x));  }  150  double csevl(x,cs ,n)  mt n; double x,cs[201];  { double bO,bl,b2,twox;  mt i,ni; b10.O; b00.O; twox2 Q*x; .  for (i1;i<n;i++)  { b2b1; blbO; nin+1-i; bOtwox*bl-b2+cs [ni];  return(O.5*(bO-b2));  } mt inits(os,nos,eta) mt nos; double os[201] ,eta;  mt i,ii; double err,fabsQ; errO.O; iil; while (ii<=nos && err<eta)  { inos+1—ii; errerr+fabs(os [i]);  151  return (i);  }  152  Appendix C: Figures  31 30 (D  2 L()  cl)  23 2  1  81 2  1 C)  4  -84  1  I  I  I  -82  -80  -78  -76  -74  Longitude  Figure 3.1: Locations of gauged sites in Southern Ontario plotted with Census Subdivi sion boundaries, where monthly pollution levels are observed and Sites 3, 29 (outliers) are not plotted.  153  * *  CD *  *  ‘.0  *  ci) 4*  * *  _I  *  * * * * *  *  * *  C’)  *  *  *  *  *  *  I  -84  I  -82  -80  -78  -76  -74  Longitude  Figure 3.2: Locations of selected sites in Southern Ontario plotted with Census Subdi vision boundaries, where monthly interpolated pollution levels are needed.  154  The Oberved and Fitted Values at Site 5  q 0  z  20  40  60  Month(1=Jan. 1983)  0  0•  0  20  40  60  Month(l=Jan, 1983)  q C.,  LI)  0  0  20  40  60  Month(1=Jan. 1983)  0  20  40  60  Month(1.Jan. 1983)  Figure 3.3: Plots for monthly observed and fitted, log-transformed levels of 03 in ppb, 2 and SO , NO 2 SO , at Gauged Site 5. 3 4 in g/m  155  ‘......*.... —‘.  .—  a  -  -2  -1  0  1  2  Quantiles of Standard Normal  0  0  Cl) 0 U,  0•  #— C  .d’• .—.——  -2  -1  0  1  2  Quantiles of Standard Norma)  Figure 3.4: Normal quantile-quantile plots for original and log-transformed monthly levels of SO 3 at Gauged Site 4. 4 in ug/m  156  Series: S04 0  u_  .  Ii  II I 5  0  10  15  Lag  Series : S04 cJj 0 U  0  Ii  I t 0  9 5  10  15  Lag  Figure 3.5: Plots for autocorrelation and partial autocorrelation of monthly, log transformed levels of SO 3 at Gauged Site 4. 4 in g/m  157  0  OO9  =  pqtu  009  8c I  0  0  OO-  SVJd  eoU4SIp 9UECI—rj JeOLJJEI 6U!LBOOUJS oufld OO  0001-  009  OO  0  009-  •  seeuipJoo euid—cj iiueuodx  das  U!  d  =  pquiI  v  putqo paoqa pp ajoouxs  i{ UT pUtqo  pa’oqa3pp  S! LUJf5O!JA  Pel.111  :)c anj  :9  t{flOJ  eoluIea 6uqoou euiids  0  =  eouSp euid—a 0091-  00  009  0  0001-—  0091-  seeu!pJoo euid-cj  0001-  009  0  S! tJJJ6O[J’\ P&!d  Iueuodx3  0  .  .  C C. U)  ci  0 C  . >  0  0 a)  .  .  0. L()  .  U)  -  0 I.  .  . . . •. •• .•  •  .  •  •• •  .  •  •.  .  e . .  0• .  .  .  • I  I  I  I  0  20  40  60  -  80  I  100  Estimated Covariances  Figure 3.8: Scatter plot of observed covariances vs predicted covariances obtained by the GS approach.  159  31  4 U,  a)  -a-  8 4 4 ‘  6  5  -84  -82  -80  -78  -76  -74  Longitude  Figure 3.9: Means of monthly levels of 03 in ppb, in summers of 1983 sites in Southern Ontario plotted with CSD boundaries.  1988 at gauged  Figure 3.10: Means of monthly levels of 03 in ppb, in summers of 1983 sites in Southern Ontario plotted with CSD boundaries.  1988 at selected  35  41  Co  -a-  4 45 4 4 47  6  4 0 5 525  56  5 57  -84  -82  -80  -78 Longitude  160  -76  -74  I-  a)  2 2  ••  D  C  •  •.  (I)  •  1  .  • •  ••  •  c_._._..•  •.  .  —  Cl)  •.  .  •.4  .  ?  ,_  . •  _.,  ,.1..  0  •  Cl) a)  St. ••.I•  •  .  • •  •  I  . •  ••  a)  .  .  I  • .  ••  • . ..  .1  I.  %i  S  0 0 a) •1 C  . .  .  -2  —1  0  1  Residuals of Observations in Summer  a) C  .1  . •  C  • ‘:  Cl)  ,  •.  .  D  •.I  .  0  s•  . •  U)  a)  .• . .  a)  •  :  ;  .•••...•‘.•  —.  rS •  S  •  .m  .  .  .  . •  -  —  . •. •a  0  •  •  .  c.j  .  a) C  .4-.  -1.0  -0.5  0.0  0.5  1.0  1.5  Residuals of Observations in Winter Figure 3.11: Scatter plots for residuals of monthly observed pollutant levels against residuals of interpolated levels at the log-scale in winter and summer respectively, where levels of 03 are in ppb, SO 2 and SO , NO 2 . 3 4 in pg/m  161  0”  z  Oo  0  z  0 O0 0  0 #  a  0  S  Ec 0  E’ E C  S  -0.4  -0.6  -0.2  0.0  0.2  0.4  -0.2  -0.4  0.6  summer monthly est. log(N02)  0.0  0.2  winter monthly est. log(N02)  Co  co o :.‘ i  •#•  .••  . 0  .•..s•.  ..  .  ‘I. C.  0  E  S  E  C  S  -1  0  1  -1.0  0.0  -0.5  summer monthly eut. log(S04)  1.0  0.5  1.5  winter monthly est. log(S04)  8 2° o  Cl 00  00  £  0  £  S  -0.4  -0.6  -0.2  0.0  0.2  0.4  0.6  0.6  -04  summer monthly cut. log(03)  02  00  02  0.6  winter monthly eut. log(03)  04  0 0  0  .1 oa  a 0  0. t  -  0  :,..  0  -1.5  -1.0  S  -0.5  0.0  0.5  1.0  1.5  -2  summer monthly cut. log(S02)  -1 winter monthly cut. log(SID2)  Figure 3.12: Pollutant-wise scatter plots for residuals of monthly observed pollutant levels against residuals of interpolated levels at the log-scale in winter and summer respectively, where levels of 03 are in ppb; SO 2 and 504 in tg/m , NO 2 . 3  162  E E  0  I  0)  1  2  3  4  polutants, 1=N02, 2=S04, 3=03, 4=S02  Co  0 0  1  2  3  4  pollutants, I=N02, 2=S04, 3=03, 4=S02  V  E E  0)  0  0  2  3  4  pollutants, 1 =N02, 2=S04, 3=03, 4=S02  Figure 3.13: Boxplots for predicted, observed and residual levels of log-transformed, monthly concentrations of 03 in ppb, 802, NO 2 and 804 in 3 ug/m respectively. 1 ,  163  27  Co.  2 LC)  a) D  1  1  -J  1 2 1  9  410  1  Cs”  I  -84  -82  I  I  -80  -78  -76  -74  Longitude Figure 3.14: Locations of gauged sites in Southern Ontario plotted with Census Sub division boundaries, where daily pollution levels are observed and Sites 3, 26 (outliers) are not plotted.  164  • Cl,  C  0 0 • .  21012  3  Quantiles of Standard Normal  0 C Cl,  0 0  0  ..___._..__.....I__..._._._._..__I__.____._I:_s__tøtI•P_•e•_••  ci)  0  -3  -2  -1  0  1  2  3  Quantiles of Standard Normal  Figure 3.15: Normal quantile-quantile plots for original and log-transformed daily levels of SO 4 in [Lg/m 3 at Gauged Site 1.  165  Series : 03 0 U-  C) 0  LiL:z:i::L1zzizz:zir.EzE.z.:mz 5  0  10  15  ii 20  25  Lag  Series : 03  U  .I.:h1Ih1h1 0  5  .II•I  10  15  ‘.....:.....:.....‘ 20  25  Lag  Figure 3.16: Plots for autocorrelation and partial autocorrelation of daily, logtransformed levels of 03 in ppb at Gauged Site 6, before an AR(1) transformation is taken. Series : 03  U-  C)  -  0  0  0  5  10  15  20  25  Lag  Series : 03  c?  0  5  10  15  20  25  Lag  Figure 3.17: Plots for autocorrelation and partial autocorrelation of daily, logtransformed levels of 03 in ppb at Gauged Site 6, after an AR(1) transformation is taken. 166  .  U)  .  . .  w  D_ C,,  0 ‘-: 1  5  15  10  Order of Deleted Columns Figure 5.1: Plot for trends in AMSPE.  167  20  l--Co-kriging MSPE 0--Bayesian MSPE  .  Co .  ci) Cl,  a) a) C Lt)  Co  -  C) a) a) a) Cci C  0  ci) 0  Lii 0 C,)  C\j  -  0’  0  10  20  30  40  Months, 1=Jan. 1983; 48=Dec. 1986  Figure 5.2: MSPEs obtained by Hass’s interpolator and the Bayesian interpolator with original acid rain data. Figure 5.3: MSPEs obtained by Hass’s interpolator and the Bayesian interpolator with log-transformed acid rain data.  I--Cokriging MSPE 0--Bayesian MSPE C,)  C.”  a) C C)  a) a) a) Co  a)  0  0 CD  0  0  C  & 0  C C  ci) 00  O0  C C  0  -c  0  0  0  00  0  C 0  U-I 0 Cl)  0  C”  0  I Lt  ‘I  0  I  10  IC  TI  H H  0  0  1I7 17 1  20  30  Months, 1=Jan. 1983; 48=Deo. 1986  168  C  40  0  U) C) > C)  a)  1  I .4-  0  co  C V  CD  C) U) .0 0  Cd  a) 2:  0  1 2345678  sites from 1 to 35 ci)  a)>  C) C)  4-. 4-  C V C) 0 U) C  01  0  :  I... 4-  0 -D C)  2:  C)  U)  1 2345678 91112 3141 516  T122a72s3c81323s3435  .0 0  sites from 1 to 35 Figure 5.4: Boxplots of observed nitrate levels with/out log-transformation at 35 sites in US.  169  

Cite

Citation Scheme:

    

Usage Statistics

Country Views Downloads
United States 12 5
Russia 12 0
Austria 8 0
France 6 0
China 5 26
Japan 2 0
Canada 2 0
Hong Kong 1 0
Venezuela 1 1
South Africa 1 2
Nigeria 1 0
Saudi Arabia 1 0
Iran 1 2
City Views Downloads
Unknown 24 8
Penza 6 0
Ashburn 4 0
Saint Petersburg 3 0
Beijing 3 1
Vancouver 2 0
Wilmington 1 0
Jeddah 1 0
Guangzhou 1 0
Mashhad 1 2
Shenzhen 1 25
Mountain View 1 0
Lawrence 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}
Download Stats

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0088917/manifest

Comment

Related Items