Smoothing Parameter Selection When Errors are Correlated and Application to Ozone Data by Robert Jr S t - A u b i n B.Sc, Universite de Montreal , 1996 A T H E S I S S U B M I T T E D I N P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F M a s t e r o f S c i e n c e in T H E F A C U L T Y O F G R A D U A T E S T U D I E S (Department of Statistics) we accept this thesis as conforming to the required standard The University of British Columbia October 1998 © Robert Jr S t - A u b i n , 1998 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my writ ten permission. Department of STA7fS~nCS> The University of British Columbia Vancouver, Canada Date < ? - / £ > - < ? f t DE-6 (2/88) Abstract Automat ic smoothing parameter selection methods for nonparametric regres-sion like cross-validation and generalized cross-validation are known to be severely affected by dependence in the regression errors. We proposed, in this work, to modify some of the ideas used in the cross-validation criterion in kernel regression with dependent errors and apply them to smoothing splines with model based penalty. Model based penalty smoothing permits us to keep the flexibility of the nonparametric methods while it also allows us to specify a favoured parametric model which can help improve on the estimate of the regression function. We consider the "modified cross-validation" (also known as Leave—21+1 out) and the "blockwise cross-validation" smoothing parameter selection tech-niques which were ini t ia l ly proposed by Hardle and V i e u (1992) and Wehrly and Hart (1988) respectively. These two smoothing parameter selection tech-niques take correlation into account and alleviate its effect on the regression function estimation. We use a simulation study to evaluate the performance of our two smoothing parameter selection techniques. We compare the results with a i i few commonly used parametric techniques. Our techniques are also applied to an air pollution data set where we estimate the underlying trend of daily and monthly ground ozone levels in southern Ontario. i i i Contents Abstract i i Contents iv List of Tables v i i List of Figures ix Acknowledgements xv Dedication xvi 1 Introduction 1 2 Overview of Existing Methods. 6 2.1 Kernel Estimators 8 2.1.1 Local Constant F i t t i n g 8 2.1.2 Local Polynomial F i t t i n g 11 2.1.3 Bias and Variance Tradeoff for Polynomial F i t t i n g . . . 13 2.1.4 Bandwidth Selection for Local Polynomial Regression . 18 iv 2.2 Splines 21 2.2.1 Regression Splines 22 2.2.2 Smoothing Splines 24 2.2.3 Some Theory on Lsplines 26 2.2.4 Smoothing Parameter Selection for Smoothing Splines . 31 3 Smoothing Parameter Selection When the Errors are Corre-lated 33 3.1 Effect of Dependence on the Smoothing Parameter 34 3.2 Smoothing Parameter Selection Methods for Dependent Obser-vations 36 3.2.1 Modif ied Cross-Validation 37 3.2.2 Partit ioned Cross-Validation 38 3.2.3 Blockwise Cross-Validation 39 3.2.4 T i m e Series Cross-Validation 41 3.2.5 P lug- in Method 43 4 Simulations 46 4.1 Description 46 4.2 Generation of Random Functions 50 4.3 Simulation Results 53 4.3.1 Positive Correlation 54 4.3.2 Negative Correlation 67 4.4 Conclusions 74 v 5 Application to Air Pollution Data 77 5.1 Introduction 77 5.2 A i r Pol lut ion Data 79 5.2.1 Da i ly Ozone Data . . 79 5.2.2 Monthly Ozone Data 84 5.3 Methods of Analysis 87 5.4 Results 91 5.4.1 Monthly 03 Levels 91 5.4.2 Da i ly 0 3 Levels 103 5.4.3 Conclusions 110 6 Conclusion and Discussion 130 Bibliography 134 Appendix A Distributions 144 Appendix B Monthly Data 161 Appendix C Daily Data 198 v i List of Tables 2.1 Asymptotic bias and variance of kernel regression smoothers in the case of random design. Taken from Fan (1992) 14 2.2 Effect of bandwidth selection on the local linear regression esti-mator 15 2.3 Differential Operators and their Parametric Family 27 4.1 Simulation Mean Results with (b = 0.5 55 4.2 Number of Parameters Used, (b = 0.5 63 4.3 Correlation between Number of Parameters Used, (f> = 0.5. . . 65 4.4 Simulation Mean Results with <f> = —0.5 67 4.5 Number of Parameters Used, 4> — —0.5 72 4.6 Correlation between Number of Parameters Used, <fi = —0.5. . 74 5.1 Monthly Levels of O3 Mean Results 93 5.2 Number of Parameters Used for Monthly Levels of O3 95 5.3 Number of Parameters used by each Method for the Monthly Data. ( a * represents a site that is studied in Chapter 5) . . . 98 v i i 5.4 Correlation Between Number of Parameters Used for Monthly Levels of 03 100 5.5 Square Root Transformed Daily Levels of Oz Mean Results. . . 104 5.6 Number of Parameters Used for Square Root Transformed Daily Levels of 0$ 105 5.7 Number of Parameters used for the Daily Data, ( a * represents a site that was studied in Chapter 5) 107 v i i i List of Figures 1.1 Estimated Mean Functions using CV and MCVi 4 4.1 Sample of Randomly Generated Curves. The solid line is the true curve 52 4.2 MSE and log(MSE) for NLS (left) and TS (right) when (j) = 0.5. 54 4.3 Estimated values of ui for (j) = 0.5 56 4.4 Fits of Parametric Methods on Simulated Curve 6 for <f> = 0.5 56 4.5 Fits on Simulated Curve 6 for 4> = 0.5 59 4.6 Fits on Simulated Curve 6 for (j) = 0.5 60 4.7 Fits on Simulated Curve 6 for (j> = 0.5 61 4.8 Behaviour of BCV. 64 4.9 Scaterplots of # of Parameters used by INd, MCV\ vs BCV. . 66 4.10 Estimated values of u> for 4> = —0.5 68 4.11 Fits on Simulated Curve 6 for <j> = —0.5 69 4.12 Fits on Simulated Curve 6 for <f) = —0.5 70 4.13 Fits on Simulated Curve 6 for (j) = —0.5 71 ix 5.1 Residuals of Original (top plot) and Square Root Transformed (bottom plot) Daily Levels of O3 at Site #4 82 5.2 Normal Quantile-Quantile Plots of Residuals for Original (top plot) and Square Root Transformed (bottom plot) Daily Levels of 0 3 at Site #4 83 5.3 Plot for Autocorrelation and Partial Autocorrelation of Square Root Transformed Daily O3 Levels at Site #4 85 5.4 Plot for Autocorrelation and Partial Autocorrelation of Monthly 0 3 Levels at Site #4 86 5.5 Residuals of Original (top plot) and Square Root Transformed (bottom plot) Monthly Levels of O3 at Site #4 88 5.6 Normal Quantile-Quantile Plots for Original and Square Root Transformed Monthly levels of O3 at Site #4 89 5 . 7 Ratio of u and iv = 2n/12 for Monthly Levels 0/O3 9 4 5.8 Behaviour of BCV for Site # 2 (top left), Site # 7 (top right) and Site # 1 7 (bottom) 9 7 5.9 Relationship between the Number of Parameters Used by IND and MCV2 9 9 5 .10 Ratio to and ui = 27r /365 for Square Root Transformed Daily Levels of 0 3 106 5 .11 Fit of the IND Method for the Square Root Transformed Daily Levels of 03 at Site #2 108 5 .12 Fits for Monthly Levels of 0 3 at Site #2 112 5 .13 Fits for Monthly Levels of 03 at Site #2 113 5 .14 Fits for Monthly Levels of 03 at Site #2 114 5 .15 Fits for Monthly Levels of 03 at Site #7. 1 1 5 5 .16 Fits for Monthly Levels of 03 at Site #7. 116 5 . 1 7 Fits for Monthly Levels of 03 at Site #7. 1 1 7 5 .18 Fits for Monthly Levels of 03 at Site #17. . 118 5 .19 Fits for Monthly Levels of 03 at Site #17. 119 5 .20 Fits for Monthly Levels of 03 at Site #17. 120 5.21 Fits for Square Root Transformed Daily Levels of 03 at Site #2. 121 5 .22 Fits for Square Root Transformed Daily Levels of 03 at Site #2. 122 5 .23 Fits for Square Root Transformed Daily Levels of 03 at Site #2. 123 5 .24 Fits for Square Root Transformed Daily Levels of 03 at Site #7. 124 5 .25 Fits for Square Root Transformed Daily Levels of 03 at Site #7. 125 5 .26 Fits for Square Root Transformed Daily Levels of 03 at Site #7. 126 5 . 2 7 Fits for Square Root Transformed Daily Levels of 03 at Site #17.127 5.28 Fits for Square Root Transformed Daily Levels of 03 at Site #17.128 5 .29 Fits for Square Root Transformed Daily Levels of 03 at Site #17.129 A . l MSE and log(MSE) for LS for (j) = 0 .5 145 A . 2 MSE and log (MSE) for NLS for <j> = 0 .5 146 A . 3 MSE and log(MSE) for TS for (j) = 0.5 1 4 7 A . 4 MSEandlog(MSE)forINDfor<f) = 0.5 148 A . 5 MSEandlog(MSE)forBCVfor(h = 0.5 149 A . 6 MSE and log(MSE) for MCVX for (j) = 0 .5 150 xi A . 7 MSE and log(MSE) for MCV2 for <j> = 0.5 151 A . 8 MSE and log(MSE) for MCV3 for (b = 0.5 152 A . 9 MSE and log(MSE) for LS for <f> = - 0 . 5 153 A.10 MSE and log(MSE) for NLS for <f> = - 0 . 5 154 A . l l MSE and log(MSE) for TS for <b = - 0 . 5 155 A.12 MSE and log(MSE) for IND for <b = - 0 . 5 156 A . 13 MSE and log(MSE) for BCV for <f> = - 0 . 5 157 A.14 MSE and log(MSE) for MCVX for <f> = - 0 . 5 158 A . 15 MSE and log(MSE) for MCV2 for <j> = - 0 . 5 159 A . 16 MSE and log(MSE) for MCV3 for <j> = - 0 . 5 160 B . l Fits for Monthly 03 Levels at Site #1 162 B.2 Fits for Monthly 0 3 Levels at Site #1 163 B.3 Fits for Monthly 03 Levels at Site #2 164 B.4 Fits for Monthly 03 Levels at Site #2. . . . 165 B.5 Fits for Monthly 03 Levels at Site #3 166 B.6 Fits for Monthly 03 Levels at Site #3 167 B.7 Fits for Monthly 03 Levels at Site #4 168 B.8 Fits for Monthly 03 Levels at Site #4 169 B.9 Fits for Monthly 03 Levels at Site #5 170 B.10 Fits for Monthly 03 Levels at Site #5 171 B . l l Fits for Monthly 03 Levels at Site #6 172 B.12 Fits for Monthly 03 Levels at Site #6 173 B.13 Fits for Monthly 03 Levels at Site #7 174 xii B . 1 4 Fits for Monthly 03 Levels at Site # 7 175 B . 1 5 Fits for Monthly 03 Levels at Site # 8 176 B . 1 6 Fits for Monthly 03 Levels at- Site # 8 1 7 7 B . 1 7 Fits for Monthly 03 Levels at Site # 9 178 B . 1 8 Fits for Monthly 03 Levels at Site # 9 179 B . 1 9 Fits for Monthly 03 Levels at Site # 1 0 180 B . 2 0 Fits for Monthly 03 Levels at Site # 1 0 181 B . 2 1 Fits for Monthly 03 Levels at Site # 1 1 182 B . 2 2 Fits for Monthly 03 Levels at Site # 1 1 183 B . 2 3 Fits for Monthly 03 Levels at Site # 1 2 184 B . 2 4 Fits for Monthly 03 Levels at Site # 1 2 185 B . 2 5 Fits for Monthly 03 Levels at Site # 1 3 186 B . 2 6 Fits for Monthly 03 Levels at Site # 1 3 1 8 7 B . 2 7 Fits for Monthly 03 Levels at Site # 1 4 188 B . 2 8 Fits for Monthly 03 Levels at Site # 1 4 189 B . 2 9 Fits for Monthly 03 Levels at Site # 1 5 190 B . 3 0 Fits for Monthly 03 Levels at Site # 1 5 191 B . 3 1 Fits for Monthly 03 Levels at Site # 1 6 192 B . 3 2 Fits for Monthly 03 Levels at Site # 1 6 193 B . 3 3 Fits for Monthly 03 Levels at Site # 1 7 194 B . 3 4 Fits for Monthly 03 Levels at Site # 1 7 195 B . 3 5 Fits for Monthly 03 Levels at Site # 1 8 196 B . 3 6 Fits for Monthly 03 Levels at Site # 1 8 . 1 9 7 xiii C . l Fits for Square Root Transformed Daily O s Levels at Site # 1 - 199 C . 2 Fits for Square Root Transformed Daily 0 3 Levels at Site # 2 . 2 0 0 C . 3 Fits for Square Root Transformed Daily 0 3 Levels at Site # 3 . 201 C . 4 Fits for Square Root Transformed Daily 0 3 Levels at Site # 4 . 2 0 2 C .5 Fits for Square Root Transformed Daily O3 Levels at Site #5. 2 0 3 C . 6 Fits for Square Root Transformed Daily o 3 Levels at Site # 6 . 2 0 4 C . 7 Fits for Square Root Transformed Daily 0 3 Levels at Site # 7 - 2 0 5 C . 8 Fits for Square Root Transformed Daily 0 3 Levels at Site # 8 . 2 0 6 C . 9 Fits for Square Root Transformed Daily 0 3 Levels at Site # 9 - 2 0 7 C I O Fits for Square Root Transformed Daily 0 3 Levels at Site # 1 0 . 2 0 8 C . l l Fits for Square Root Transformed Daily 0 3 Levels at Site # 1 1 . 2 0 9 C . 1 2 Fits for Square Root Transformed Daily 0 3 Levels at Site # 1 2 . 2 1 0 C . 1 3 Fits for Square Root Transformed Daily O3 Levels at Site # 1 3 . 211 C . 1 4 Fits for Square Root Transformed Daily 0 3 Levels at Site # 1 4 . 2 1 2 C.15 Fits for Square Root Transformed Daily o 3 Levels at Site # 1 5 . 2 1 3 C.16 Fits for Square Root Transformed Daily Levels at Site # 1 6 . 2 1 4 C . 1 7 Fits for Square Root Transformed Daily 0 3 Levels at Site # 1 7 . 2 1 5 C . 1 8 Fits for Square Root Transformed Daily 0 3 Levels at Site # 1 8 . 2 1 6 xiv Acknowledgements Foremost, I would like to thank my supervisor, Nancy Heckman, as I couldn't have chosen a better person to work wi th . Nancy was always wi l l ing to give advice and most importantly encouragement, al l of which were much needed. Secondly, I am grateful to J i m Zidek for approaching me wi th this project. I am also thankful to J i m for reading my thesis and for giving useful comments. A huge thank also to Harry Joe for his invaluable help wi th C program-ming and with many computational aspects of my thesis. Lastly, I would like to thank everyone in the department of Statistics at U B C for making my two years such a pleasant experience. ROBERT J R S T - A U B I N The University of British Columbia October 1998 xv ma famille, a la vie qui est souvent si belle et aux S&R.. xvi Chapter 1 Introduction Est imat ing the underlying signal, often called the regression function, of a set of noisy data is a very common task that statisticians encounter and it has two main purposes. First ly, it provides a way of analysing and presenting the relationship between the design variable and the response variable; sec-ondly, it allows the prediction of observations that have not yet been made. In parametric regression, the underlying signal is assumed to follow a given parametric form. For instance, in the case of linear regression, the regression function is assumed to belong to the family of linear functions. Oftentimes, nonparametric methods of estimation such as kernel smoothers or smoothing splines are desirable, because they are extremely flexible as they do not re-strict the model to a defined family of functions (see Green and Silverman (1994), Eubank (1988)). The use of an ini t ia l nonparametric estimate may very well suggest a parametric model (like a linear model or low order polyno-mial model) but it w i l l at least give the data a chance to express themselves 1 more freely. However, the use of nonparametric regression involves the choice of a smoothing parameter which controls the balance between goodness of fit to the data and smoothness of the regression function. This choice can be made subjectively but automated methods of selection are often preferred. Due to their dependence on the cyclic nature of solar radiation, tides, meteorological processes, economical activities, etc., many observed time se-ries exhibit a periodic pattern along with serial correlation. Let us consider ground-level ozone (O3) concentration. O3 concentration has a strong yearly cycle as it highly depends on solar radiation. It is also serially correlated. We would like to estimate the underlying trend of ground-level ozone concentra-t ion using a nonparametric regression technique as we want our method of estimation to be flexible. However, since we strongly suspect that the under-lying signal is periodic and of the form {sin, cos}, we would like to use that knowledge. Therefore, we w i l l use smoothing splines with model based penalty. This nonparametric method, which w i l l be described in Chapter 3, allows us to specify a default parametric family. Also , we w i l l need to choose a value of the smoothing parameter and, i n order to avoid subjectivity, we would like to use an automated selection method. M a n y automatic methods of choos-ing the smoothing parameter exist for the case where the regression errors (or noise) are assumed to be independent, namely v ia plug-in (see Ruppert , Sheather and Wand (1995)), cross-validation and generalized cross-validation (see Stone (1974) and Craven and Wahba (1979)). However, if the observa-tions are dependent, the smoothing parameter selectors designed for indepen-2 dent observations w i l l produce poor results. In fact, many authors such as Opsomer (1996), A l t m a n (1990) and C h u and Marron (1991a) showed, in the context of kernel regression, that the effect of ignoring even minor amounts of error correlation when selecting a smoothing parameter could potentially be severe. This effect is also present in smoothing spline regression, as a simulated example readily shows in Figure 1.1: the data are generated by a {sin, cos} trend (discussed in greater detail in Chapter 4) and are overlaid wi th errors that follow an A R ( 1 ) model wi th autoregressive coefficient cb — 0.5. Ordinary cross-validation, which performs extremely well when the errors are uncorre-c t e d , selects a smoothing parameter that is far too small and results in an overly "wiggly" estimated regression function. The other estimate displayed in Figure 1.1 uses a larger smoothing parameter selected automatically by the M C V i method studied in this thesis, and is much closer to the true mean function of the data. For kernel regression, many authors have tackled the problems asso-ciated with correlated regression errors and have proposed methods to solve them. Hart and Wehrly (1986) proved that, under certain conditions, correla-t ion may lead to an inconsistent estimator of the regression function. Several other authors have shown that correlation causes smoothing parameter se-lection techniques designed for independent observations to converge to the "wrong" smoothing parameter, and have proposed various methods to remedy the situation. C h i u (1989) and Hart (1991,1994) assumed a parametric form for the correlation structure and adjusted some smoothing parameter selection 3 O 20 40 60 80 100 Figure 1.1: Estimated Mean Functions using CV and MCV\. techniques. A l t m a n (1990) estimated both the regression and correlation func-tion nonparametrically and proposed a plug-in smoothing parameter method. Opsomer (1996) also proposed a plug-in method and it w i l l be discussed later in Chapter 3. C h u and Marron (1991a) bypassed the problem of estimating the correlation function and adjusted the cross-validation criterion to obtain an improved bandwidth selector. However, most authors have worked in a kernel estimation setting. We propose, in this thesis, to modify some of the ideas used in the cross-validation criterion in the kernel regression context, and to apply them to smoothing splines wi th model based penalty. Chapter 2 surveys the nonparametric meth-ods most commonly used for regression function estimation. Chapter 3 dis-cusses a few smoothing parameter selection techniques that account for corre-4 lation in the regression errors. In Chapter 4, we will analyse the performance of the developed smoothing parameter selection methods by applying them to a simulation study and in Chapter 5, using the same smoothing parame-ter selectors, we will analyse a data set of daily and monthly ground-level O3 concentrations. 5 C h a p t e r 2 O v e r v i e w of E x i s t i n g M e t h o d s . This chapter contains a brief summary of the key methods available for uni -variate nonparametric regression function estimation from independent data. The goal of these methods is estimation of the regression function without assuming any underlying parametric model. One method in particular, penal-ized regression wi th model-based penalties, w i l l be discussed in greater detail . However it is helpful to have a survey of methods before spending more time on any specific method. First of al l , in order to put smoothing problems into a statistical frame-work, let us describe the two most commonly used models. The first one is the "fixed design", where the ordinates of the data points are thought of as being deterministic values. These ordinates values are usually chosen by the experimenter, as in a designed experiment. Frequently, these points are taken to be equally spaced since it is what one might intuitively think of. The fixed design model w i l l be used in this chapter and throughout this thesis. For 6 reference, the second model is the "random design", where the data points are thought of as being realizations from a bivariate probability distribution. These ordinates are usually not chosen by the experimenter, and occur, for example, in sample surveys. Let the rc '^s have a fixed design satisfying the following regularity con-ditions. There exists a continuous positive density / on [0,1] wi th fXif(t)dt=-^- (2.1) Jo n + 1 One can show that there exists C\ > 0 and C2 such that C l < * i - < - ^ r (2.2) n + 1 n + 1 for al l n and al l x^, i — 1, . . . , n + 1 (take x0 — 0 and xn+\ = 1). The fixed design model is given by Yi = m(Xi) + Si (2.3) for i = 1, 2 , . . . , n, where m is the "mean" or regression function, the X j ' s are nonrandom design points satisfying (2.1), and the £j's are independent random variables with mean 0 and variance a2. The goal is to use Y\,..., Yn to estimate the regression function m or its rth derivative m^r\ The performance of an estimator rhr(x) of m^(x) is conveniently as-sessed v ia its Mean Squared Error (MSE) or its Mean Integrated Squared Error (MISE) defined respectively by MSE{x) = E [{mr(x) - m^{x)}2] , (2.4) 7 MISE = J MSE(x)u{x)dx (2.5) wi th co > 0, a weight function. When the MSE-cr i ter ion is used, it is under-stood that the main objective is to estimate the function m^(-) at the point x, and when the MISE-cr i ter ion is used the main goal is to recover the whole curve. It can be easily seen that the M S E has the following bias-variance decomposition: MSE(x) = [E{mr(x)} - m{r\x)^ + Var{mr(x)}. (2.6) One often refers to the term E{mr(x)} — mfJ\x) as the bias. We w i l l come back to the notion of bias-variance later in section 2.1.3. 2.1 Kernel Estimators 2.1.1 Local Constant Fitting Nadaraya- Watson Estimator Without any assumption on the form of the regression function m, a data point far away from x carries very litt le information about the actual value of m(x). Thus, a reasonable estimator for the mean function m is a running local average and is generally called the locally weighted average estimator. This estimator is a weighted average of the Yj's wi th weights depending on a function K and a positive number h. Let K be a given real-valued function to calculate the weights assigned to the observations. This function K is usually a probability density function symmetric about 0 and is often called a kernel 8 function. The following notation about kernels will be used from now on. The moments of K are denoted by Vj(K) = / sjK(s) ds. Gasser, Miiller and Mam-mitzsch (1985) defined if as a kernel of order p0, pi (po > 0,Pi > Po + 2) if Uj(K) = 0, j = 0,... ,p0 - 1, j = Po + 1, • • • ,Pi - 1, Vpo(K) ^ 0, and uPl(K) is finite and nonzero. Generally, po is taken to be the order of the derivative of the regression function m to be estimated while p\ — po controls the amplitude of the estimation bias. We say that if is a kernel of order p if it is a kernel of order p0 — 0,pi = p. Let h be a bandwidth (also called a smoothing parameter), which is a non-negative number controlling the width of the neighborhood in which averaging is performed. Then, the well-known Nadaraya-Watson (N-W from now on) kernel regression estimator is given by: YK( Xi~x) See Nadaraya (1964) and Watson (1964) for more details. Some commonly used kernel functions include the Gaussian kernel K(t) = (27r)~2 exp(—12/2) and the symmetric Beta family kernel where the subscript + denotes the positive part. The term Beta(a, j3) denotes the beta function, Beta(a,/3)= C xa~l (1 - xf~l Jo dx. 9 The beta function is related to the gamma function through the following identity: The choice of 7 = 0,1, 2, 3 leads respectively to the uniform, the Epanech-nikov, the biweight and the triweight kernel function. Marron and Nolan (1988) stated that the Beta family includes the most widely used kernel func-tions, the Gaussian kernel being the l imit as 7 —t 0 0 . It is easily shown that the N - W estimate is the solution to a natural weighted least squares problem, being n rhh(x) = argminp^2(Yi - (3)2Wi (2.8) 1=1 where the weights Wi are K((xi — x)/h). That is, the N - W estimator corre-sponds to locally approximating m(x) wi th a constant, weighting values of Y corresponding to X j ' s closer to x more heavily. Note that one can estimate the rth derivative of m by differentiating m(x) in (2.8), or one can use the asymptotically equivalent estimate (for equally spaced design) ^ ^ ^ ( ^ W s ^ M - <2-9> with K*(u) = (-l)^K^(u)/C where C is so that vr(K*) = r\. One can show that if K is of order p, then K* is a kernel of order r,r +p. 10 Gasser-Muller Estimator Another estimator using local constant approximation is the Gasser-Muller ( G - M from now on) estimator and is defined as n where < s ,_ i < Xi (a common choice being S j _ i = {x^i + xi)/2, wi th s0 and sn being the upper and lower l imits of the range of x, respectively). When restricted to x € [h, 1 — h], K(u) — 0 for u 0 [—1,1] and K integrable to 1, the weights u>i = h~l K ( ^ p ) du sum to 1, and hence, the G - M estimator is of the form (2.8). Often, estimators of this type are called convolution estimators. The G - M estimate can be viewed as a "continuous moving local average" as the weights h~l J^i_i K (j^j du are continuous functions of x, even if K isn't. O n the other hand, the weights used in the N - W estimator could be discontinuous. If the design points are equally spaced, then the N - W and the G - M estimators are very nearly the same, by the integral mean value theorem. However, when the design points are not equally spaced, or when the design points are i id random variables, there are substantial and important differences between these two estimators. These cases are not of main interest here and therefore w i l l not be covered in detail in this survey. 2.1.2 Local Polynomial Fitting The local constant Nadaraya-Watson estimator can be modified to an estima-tor that fits higher order local polynomials. This might be desirable since a 11 X — U h du\ Yi, (2.10) high order polynomial may better approximate a function locally. A lot of work has been done on local polynomial regression. It was studied by Stone (1977), Cleveland (1979) and Stone (1980,1982). Some more recent work includes Fan (1992,1993), Fan and Gijbels (1992) and Ruppert and Wand (1994). A detailed picture of the attractive properties of local polynomial fitting is given in these papers. Suppose that locally the regression function m can be approximated by P rnU)(r\ P "*(*) * £ — T 1 ^ = - *V (2-11) 3=0 •>' 3=0 for z in a neighborhood of x, using Taylor's expansion. Because we estimate m(z) locally by a polynomial model, we are inclined to use a locally weighted polynomial regression £{Yi - ! > ; ( * ; - m 2 K ( ^ ) (2.12) i = l j=0 where K(-) is a kernel function .as defined previously and h is the smoothing parameter. Let J3j, j = 0 , . . . ,p be the minimizer of (2.12). Then an estimator for m^(x) is mr(x) = r\J3r. (2.13) One might think that this is pretty similar to the tradit ional parametric approach where the regression function is globally modeled by a polynomial . However, to have an acceptable modeling bias, the degree p of the globally fitted polynomial is often required to be quite large. For example, for a simple function such as m(x) = s i n x + cosx, x € [0, 27r], parametric polynomial modeling requires a large p in order to get a low bias. This large degree p 12 introduces an over-parameterization, causing the estimated parameters to be highly variable. A consequence of this is a large variance of the estimated regression function. Local polynomial regression being a " local" method, it only requires a small degree p of the polynomial fitted locally. Usually a degree oip = r + 1 or p = r + 3 is used (where r represents the order of the derivative to be estimated). For example, using a polynomial of degree p = 1 leads to a very popular estimator m(x) called the local linear fit or the local linear regression smoother. In the case of fixed design, the N - W estimate, the G - M estimate and the local linear estimate all have the same asymptotic bias and variance expressions when x ^ 0 or 1. These are respectively Asympt.Bias = \h2m" (x)u2(K) + 0(-^=) + o(h2) (2.14) 2 V n / i Asympt.Variance = ° . .v0(K2) + O(—^-=). (2.15) nhj[x) n*nz However, in the case of a random design they perform differently. See Table 2.1 for the asymptotic expressions in the random design case. Compar-ison of these estimates is discussed in greater detail in Eubank (1988), Mi i l l e r (1988), Hardle (1990), Fan (1992) and Cheng, Fan and Marron (1997). 2.1.3 Bias and Variance Tradeoff for Polynomial Fitting A natural question is how do we proceed in the selection of the smoothing parameter. Choosing a "good" bandwidth is crucial to obtain a sensible curve 13 Table 2.1: Asymptotic bias and variance of kernel regression smoothers in the case of random design. Taken from Fan (1992). Method Bias Variance N ^ V (m"(:r) + 2-^0^jb~n K G - M m"(x)bn 1 . 5 K Local Linear m"(x)bn Vn where bn = \h2v2{K) and Vn = j ^ v ^ R 2 ) . estimate. A s mentioned earlier, the smoothing parameter is the value that determines the width of the neighborhood where the local estimate w i l l be computed. One can choose h "by eye" (pretty feasible when x € 1Z) which means that one chooses the width of the neighborhood by looking at plots of the estimate for different values of h. B y doing this, one can assure oneself that the resulting estimate of m is smooth enough while tracking the data pretty well. Also , the bandwidth can be chosen automatically using some well known techniques like cross-validation and plug-in methods that w i l l be introduced shortly. However, automatic techniques appear more desirable since they are data driven, and thus somewhat objective. We w i l l discuss automatic band-width selectors in Chapter 3. Nevertheless, even wi th automatic bandwidth choice, further tuning by the data analyst is often beneficial. Let us demonstrate the influence of bandwidth selection on the esti-mated regression function by two examples taken in the context of local linear regression (see Table 2.2). 14 Table 2.2: Effect of bandwidth selection on the local linear regression estimator. Small Bandwidth : h - » 0 Large Bandwidth : h —> oo Window contains only 1 Xi rhh(xi) = yi E(mh(xi)) = E(yi) = m(xi) Var(rhh(xi)) = Var(yi) = a2 -/» 0 as n —> oo Window contains al l Xi's rhh(xi) = a + fixi E(mh(xi)) # m(xi) Var{mh{Xl)) = a2{\ + —r 0 as n —T oo Consider a small bandwidth {h —> 0) and a very large one (h —> oo). Obviously taking h very close to 0 gives a window that only contains one Xi, which results in an estimate interpolating the data. Using h —> oo gives a window that contains al l the Xi's which is then equivalent to fitting a global linear model m(x) = a + f5x. A n estimate using a small bandwidth does not suffer from bias but does suffer from high variability. O n the contrary, an estimate with a large bandwidth is usually biased but has a low variance. The problem is then to find a compromise, i.e., a smoothing parameter so that both the bias and the variance are minimal . This problem is often referred to as the "Bias-Variance Tradeoff". Since this forms the core of many bandwidth selection criteria, it is essential to have a good understanding of the bias and variance of the estimators. We w i l l , in this section, present the asymptotic expressions for the bias and variance of the local polynomial estimate (degree p) of the r t h derivative of the regression function m. Let Np be the (p+l)x(p+l) matrix having (i, j) entry equal to ui+j_2{K) and let MrjP(u) be the same as Np, but with the (r + l )st column replaced by 15 ( 1 , « , . . . , « > ) r . Ruppert and Wand (1994) defined the kernel K{r!p)(u)=r\{\Mr>p\/\Np\}K(u), (2.16) where K(u) is a symmetric density. It can be shown that (—l) TK{r, P) is an order (r, s) kernel as defined by Gasser, Mul ler and Mammitzsch (1985) where s = p + 1 when p — r is odd and s = p + 2 when p — r is even. Kernels of that form are known to perform well when estimating the rth derivative of a regression function. The asymptotic expansions for the bias and variance of the estimator mr(x) are given in the next theorem. These results are taken from Ruppert and Wand (1994) and were derived for the random design case. However, similar results also hold for the fixed design case. We w i l l first introduce the original results for a general local pth degree polynomial fit and derivative r. First , let's assume the following. 1. The X j ' s are i id random observations following the density / . 2. The kernel K is a compactly supported, bounded kernel wi th U2(K) ^ 0. In addition, al l odd-order moments of K are equal to zero. 3. The point x is in the support of / . A t x, f is continuously differentiable and mPp+2^1 is continuous in a neighborhood of x. Also , fix) > 0 and a2 > 0. Theorem 1 Suppose that x is an interior point of the support of f and that 16 (1), (2) and (3) hold. Further, assume that h->0, nh2r+1 —y oo as n —r co. Then, for p — r odd, Biasirhrix^X-t,.. .,Xn} = up+1(K^p)) m (P+1)! ) hp-r+l + Op(k> ,p—r+1 (2.17) and, for p — r even, Bias{mr(x)\X1,...,Xn} = hv'r + op(h'-r+2). + {vp+2(K{r,p)) - r z / p + 1 ( i C ( r _ 1 > p ) ) } %^+l){x)f'(x) ' (p + l)\f(x) (2.18) In both cases .2 Var{mr(x)\Xu ...,Xn} = Mtfr*)) n h J + 1 f i l + °PM}- i2-™) It is clear from this Theorem that the case of p — r odd exhibits a much simpler bias expression. In fact, unlike the case of p — r even which depends on m ( p + 1 ) ( x ) , vnP+2\x) and f'(x)/f(x), it only depends on mSp+l\x). It can be shown that the bias of mr(x) at the boundary is of the same order as the interior bias when p — r is odd and is of higher degree when p — r is even. See Ruppert and Wand (1994) for more details on the subject. The results for the fixed design case are the same, and can be derived by techniques similar to those used by L i (1996). However, we need to assume the following along with assumptions 2 and 3 of Theorem 1: 1. The Xj ' s follow a fixed design wi th density / . 17 2. The first derivative of the kernel density K is continuous. 2.1.4 Bandwidth Selection for Local Polynomial Regres-sion Now that the concept of bias-variance tradeoff has been explained, it is time to introduce bandwidth selection techniques. The main topic of this work is smoothing parameter selection with smoothing splines for correlated errors. This chapter w i l l only survey a few commonly used methods in local polyno-mia l regression and smoothing splines when the errors are independent and Chapter 3 w i l l describe in greater detail the methods for correlated errors. A s mentioned earlier, the appearance of a local polynomial estimate depends strongly on the value of the bandwidth h. The two most popular automatic techniques are known as cross-validation ( C V ) and plug-in methods. Cross-validation uses the principle of "leave-one-out" prediction. The idea is to leave the data points out one at a time and to select the value of the bandwidth under which the removed data points are best predicted by the remaining data. The cross-validation criterion to be minimized is the following CV(h) = -J2(Yi-m^(xi))2 (2.20) n i=i where m W ( i j ) is the estimate, using bandwidth h, based on the data wi th Xi,Yi removed. The minimizer of CV, hey, is the cross-validation choice of the optimal bandwidth value. Unfortunately, studies have shown (see H a l l and Marron (1987), Park and Marron (1990)) that this technique's theoretical and 18 practical performances are disappointing. In particular, it leads to an estimate of h that suffers from high variance. Let us now describe the Direct Plug-in method of Ruppert , Sheather and Wand (1995). Consider the M I S E criterion as seen in (2.5) where w = /, the design density, and r = 0. Then, the bandwidth that minimizes (2.5), and thus asymptotically of (2.20), is the following (p+l ) (p ! ) 2 i * ( t f> 2 l 1 / ( 2 p + 3 ) h, opt (2-21) [2n^+1(K)Jm^)(u)2f(u) du\ using a local polynomial of odd degree p and a kernel K satisfying assumption 2 of Theorem 1, provided / m ( p + 1 ) (u)2 f (u) du is nonzero. The plug-in bandwidth selector requires an estimate of a2 and / m(jp+l') (u)2 f (u) du. For sake of simplicity, let us assume that we are estimating m v ia a local linear (p — 1) kernel regression method. Then, equation (2.21) simplifies to hopt — v0(K2)a2 1/5 (2.22) nvl{K)jm"{u)2f{u) du\ However, a2 and / m"(u)2f(u) du are unknown so we need to find an estimate of those two quantities in order to find hopt. There exists a relatively extensive literature on the estimation of o2 in the homoscedastic nonparametric regression context. Ruppert , Sheather and W a n d (1995) extended the idea of H a l l and M a r r o n (1990) based on a zero-degree least squares fit to a general pth degree setting. This leads to an estimate of the form: <T2(A) = v~x 'YJXi - rh(xi))2 (2.23) t=i 19 (2.24) (2.25) and X P i X is the nx(p + 1) design matrix 1 X\ — x (2.26) 1 *^ n x) W x = diag[K{(xi — x)/h}/h,..., K{(xn — x)/h}/h] and e,- is the (p + l ) x l vector having 1 in the jth entry and zero elsewhere. For more details on how to compute an estimate of a 2 , please refer to the above mentioned paper. In order to estimate hopt, one must also estimate the integral in the denominator of (2.22). We saw earlier in equation (2.13) that it is possible to find an estimate of rh"(x) when using local cubic polynomial regression. Using such an estimate, the bandwidth hi that minimises the MISE criterion for m''(x) is given by: where D is a function of a2 and m^. So now, a new bandwidth must be chosen. One can use the following steps in order to get the " R U L E O F T H U M B " estimate of hopt as proposed by Ruppert , Sheather and W a n d (1995). 1. F i n d the estimate a2 of a2, as, for instance, i n (2.23). hf = Dn-l'» (2.27) 20 2. Use a parametric family to estimate m by rhp. Take the 4 derivative to obtain rhp\ Estimate J[m^(x)]2 dx by J[rhp\x)]2 dx. 3. Use the estimates in 1 and 2 in the formula (2.27) to estimate hlpt. 4. Estimate m" using a local cubic polynomial and the estimate of hlpt. 5. Use 4 to estimate J m"(x)2f(x) dx and plug this estimate along with the estimate of a2 into (2.22) to get an estimate of hopt. For more details on plug-in bandwidth selection methods, please refer to Ruppert , Sheather and Wand (1995). 2.2 Splines We saw in the previous section that the use of local methods such as local polynomial fitting provided some added flexibility in estimating the regression function. In this section, a different way of enhancing the flexibility of the regression method w i l l be introduced. This method is known as the 'spline method'. Eubank (1988), Wahba (1990) and Green and Silverman (1994) constitute excellent reference books on spline smoothing. The function </> is called a B-spline of degree D defined on the inter-val [a, b] wi th knots t\,..., (a < t\ < • • • < < b) if (j) is a polyno-mial of degree D on the subintervals [a, ti], [tx, £2], • • •, b] and <f) has D — 1 continuous derivatives on [a,b]. We denote the family of these B-splines by BSp(ti, • • •, tL\ Z)).The number of parameters required to specify a B-spline is 21 L + D + l. It is possible to show that BSp(t\, • • •, tr,; D) is a linear space. This means that if (j>\ and (b2 € BSp(ti, • • •, t^; D) and a\ and a2 are real numbers, then a:i0i + a2(j)2 e BSp(tlt • • •, tL; D). We also say that (b is a univariate natural polynomial spline of de-gree D = 2M - 1 with knots ti,...,tL (a < t i < ••• < tL < b) if (b G BSp(h, •••,tL;D) and 0(M)(t) = 0 for t e [a, h] U [tL, b]. Let us demonstrate the previous definitions by a simple example where we use cubic splines (D = 3). A cubic B-spline is cubic on each of the subin-tervals and has a continuous second derivative. A natural cubic spline is also a cubic B-spline except that it is linear on the interval [a,ti] U [£z,,fr]. C u -bic splines are the most commonly used splines since polynomials of degree higher than three are too wiggly while lower degree polynomials are not flexible enough. Now that we have introduced the concept of splines, we will discuss the two most important methods of fitting splines to a model as in (2.3) where m is known to be a smooth function. These are respectively regression splines and smoothing splines. 2.2.1 Regression Splines In order to discuss regression splines, we need to define a basis for BSp(ti, • • •, tr, Let ... = t-i = t 0 = a < h < • • • < tL < b = t L + i = tL+2 = ... (2.28) 22 We define the B-spline basis for BSp(ti, • • • ,tL; D) by {N-DjD, • • •, NLjD} where the A ^ D ' S are defined recursively by K M = r ^ T N i ^ x ) + + U + D + l ~ X N ^ - i i x ) , (2.29) given NitD-i and i = —(D — 1 ) , . . . , L. We define 0/0 = 0, undefined/0 = 0 and Nifi(x) = I{x E M i + i ) } i = 0 , - - - I L - l NL,0{x) =I{xe [tL,b]}. (2.30) So, given D and a set of knots t _( jn _ i ) , . . . , <£ , , we have a basis of d i -mension D + L + 1 that defines a linear space that may be used as regression functions when wanting to fit a smooth function m. Thus, one can estimate m{x) as 2~2j 0jNjtE>(x). The spline method con-sists of finding the best spline estimate in a least squares problem. Let 9 be the least squares estimate. Then m(x) is estimated by rh(x) = 2~2j6jNjtr)(x). Obviously this method is very dependent on the number and on the location of the knots. A n automated method of knot selection is therefore desirable. Introduced by Smith (1982) and Breiman, Friedman, Olshen and Stone (1993) the procedure known as "knot deletion" is a popular automated procedure but w i l l not be discussed here. Instead, a related method which avoids knot selection and is more relevant to the topic of this thesis w i l l be discussed. This procedure relies on the smoothing spline approach, see W h i t -taker (1923), Wahba (1990) or Green and Silverman (1994). 23 2.2 .2 S m o o t h i n g S p l i n e s First , let us consider a simple least squares problem: n minm^{Yi - m(xi)}'2 . (2.31) i = l Obviously the solution of this rather naive least squares problem is any function m which w i l l interpolate the data. However this solution is not desirable in a statistical context. Not only does it produce a highly complex model (with the same dimension as the data set) but it results in an over-parametrisation which leads to a large variability of the estimated parameters. W h a t could be done in order to improve this approach? The main problem is that we did not penalize for the over-parametrisation or interpolation. A popular way to deal with this problem is to penalize v ia the roughness of the estimate. The roughness is measured by P(m) = J (Lm(x))2 dx, where L is a differential operator which w i l l be discussed in greater detail later. B y adding this penalty term to our simple least squares problem we now have the following minimizat ion problem: n m = minmY^{Yi-rn(xi)}2+ \P(m) (2.32) i=l where A is a non-negative real number called the smoothing parameter. So m is then, for a given value of A, the best compromise between smoothness and goodness of fit. Also , it is possible to show that the minimizer of (2.32) wi th P(m) = j(m^M\x))2 dx, is a natural spline of degree 2M — 1 wi th knots {xi}. In this case, one can show (see equation (2.2) of Wahba (1975)) that rn is linear in the 24 observations Fj , in the sense that there exists a weight function G ( s , t ) such that m(s) = n~l J2YiG{s,Xi). (2.33) t=i The weight function depends on the design points x\,...,xn and also on the smoothing parameter. The results in (2.33) also hold for the penalty functions described in section 2.2.3, equations (2.35) and (2.36). A s one can notice, (2.32) consists of two distinct parts. The first part imposes a penalty on the lack of fit as we saw in the above least squares prob-lem. The latter part, as mentioned earlier, adds a penalty on the roughness of the estimating function, thus relating to over-parametrisation. It becomes clear that, in (2.32), the value of the smoothing parameter A controls the complexity of the model. For example, let us consider a A of value 0. In that case we fall back to the minimizat ion problem of the form (2.31), which inevitably results in an estimate that interpolates the data. O n the other hand, the case where A is very large leads to an estimate of m converging to the least squares estimate restricted to the kernel of P, that is, the set of m with P(m) = 0. In the cubic spline case, when P(m) — J(m"(x))2 dx, the kernel of P is the family of lines. In fact, if the data F; lie on a curve in the kernel of P, then the estimated regression function m w i l l be exactly that curve. Thus, we can see that through its kernel, the penalty P is related to a favoured parametric model. Thus, for example, the parametric model m(x) = fi0 + B\X may be considered the favoured model for the penalty f(m"(x))2 dx. 25 It is important to note that this relationship is not very strong: nothing forbids the estimate m to be quite far from the favoured parametric model. So the choice of the roughness penalty may appear to have litt le importance on the estimation. However it was shown in Heckman and Ramsay (1996) that the choice of an appropriate penalty can lead to better estimates. In fact, as seen in Wahba (1990), the minimizer m of (2.32) follows this inequality: - J2 {E(m{xi)) - m{xi)f < XP{m) (2.34) 71 i=i where m is the true regression function. It is then clear that the closer m is to the kernel of the penalty, the smaller the bias of the resulting estimate w i l l be. Therefore, if the data analyst has some knowledge about the family of curves to which the true curve belongs, an appropriate choice of the penalty P w i l l help in reducing the bias of the estimate, thus giving a better estimate. 2.2.3 Some Theory on Lsplines Let us introduce the motivation behind the choice of a penalty along with some theory. We now know that through its kernel, the penalty is associated with what we called a favoured parametric model. We w i l l consider penalties that are based on differential operators. First , let us assume that the regression function m is in Hh, the Sobolev space of functions wi th k — 1 absolutely con-tinuous derivatives and square-integrable kth derivative. We w i l l consider the penalized least squares problem of the form (2.32) where m is the minimizer. 26 Specifically, let L be a k order linear differential operator fc-i L = Dk + wiDi (2.35) i=0 where D1 represents the ith derivative operator and the tOj's are continuous real-valued weight functions. In (2.32) we take the penalty to be P(m) = J (Lm)2 (2.36) and we want to find m that minimizes the following C ( m , w ) = ^ ( y i - m ( ^ ) ) 2 + A f[(Dk + Y/wiDi)m}2, (2-37) where w = (WQ, . . . , Wk-i)1 • We call the minimizer an Lspline. Choosing a penalty L is equivalent to choosing a favoured parametric model, i.e., the set of functions wi th Lm = 0. Here are a few examples of dif-ferential operators and their respective bases for the corresponding parametric families. Table 2.3: Differential Operators and their Parametric Family Operator L Parametric Family for kernel of L D2 D2 + 7D, 7 7^ 0 TJ 4 D3 + u2D, co#0 D4 + LJ2D2,CO#0 (D2-JD){D2 + LO2D) {1,*} {l,exp(-7t)} { 1 , M 2 , * 3 } {1, cos cot, sin cot} {1, t, cos tot, sin cot} {1, exp jt, cos cot, sin cot} Given L, one can easily find the corresponding basis functions in the case where the tOj's are assumed to be constant real numbers. For that case, 27 the following Theorem gives a simple recipe for calculating the basis functions for the space of m wi th Lm = 0. For a proof of this result, the reader is referred to the standard differential equation results of Coddington (1961). Theorem 2 Let us denote thep distinct roots of the polynomial xk+YliZo w i x l as r±,... ,rp and let m,j denote the multiplicity of root rj. Then the following functions oft form a basis for the kernel of the differential operator L: exp(rjt),texp(rjt),imj_1 expfa t ) j = 1 , . . . ,p. The other way to proceed is to specify a desired basis and then find the corresponding penalty L. For certain favoured parametric models, the differential operator L is pretty easy to calculate. Let us assume that the favoured parametric model is a linear combination of /xo, • • •, Hk-i £ TLk which are linearly independent with each having k continuous derivatives. Since there are k functions in the basis, an L of order k can be found, if the associated Wronskian matrix [W(£)]j j = A4+I^ (£) 1 S invertible for a l l t G [a, b]. One can easily find the weight functions i u 0 , . . . , Wk-i by solving o = (L»m = ^\t) + Y/wJ(t)^\t) j=0 which is equivalent to the following matrix form: (2.38) \ fill®) 28 that is ^ w0(t) ^ Wk-l{t) = - W ( t ) " 1 (2.39) Data analysts usually choose models by using the knowledge that they have of the data to be analysed. Note that in nonlinear regression, many of the parametric models are based on physical models, so the favoured parametric model could also be chosen that way. In our case for example, since ozone levels are known to have a periodic pattern, we would like to use a {1, sin, cos} model to estimate m. Using this basis made of three functions, it would then be easy, using the previous theory, to find a penalty L of order 3. In addition, it is important to consider the goal of the analyses: do we want to estimate the regression function m or one of its derivatives? In the first differential operator of order 2 or greater is suggested while in the latter case, an operator of order 4 or higher would perform better. It is also possible to define the effective number of parameters used (p\) by a smoothing spline. Here we w i l l use the definition given by Hastie and Tibshirani (1990) p A = trace S A > W (2.40) where S A , w is a hat matrix, defined as n (2.41) with G the weight function from (2.33). It can be shwon that S A , w is i n -dependent of Y = ( Y i , . . . ,Yn), and satisfies the following riiA = S A > W Y for 29 m = {rh\(xi),... ,rh\(xn))'. One can show that, for fixed w , p\ is a decreas-ing function of A wi th p\ = n when A = 0 and limA_>ooPA = ^ L , where UL is the dimension of the kernel of L. For example, in the case where we use the favoured parametric model {1, sin, cos}, we get = 3. Thus, a small smoothing parameter is equivalent to a large number of parameters and it re-sults in a "wiggly" estimate of m while a large smoothig parameter (a small number of parameters) yields a smoother estimate of the regression function. Note that if we choose a data-dependent smoothing parameter A, we then have rii^ = W Y where w now depends on Y via A. Nonetheless, the trace of S ^ w is st i l l used to define p\. In the case where w is estimated, the number of parameters used by an estimate is given by trace + length w . For more details on the theoretical aspects of smoothing splines wi th model based penalties, please refer to Heckman and Ramsay (1996). The Splus module Lspline, written by Ramsay and available on the Statl ib web site (http:// l ib.stat .cmu.edu), allows the data analyst to compute splines with model based penalties easily and quickly. Later on in this thesis, when we reach the segment of data simulations and data analysis, we w i l l use the method of splines with model-based penalty. The choice of the penalty along wi th the motivations w i l l be discussed in Chapter 4. However, before doing so, we w i l l introduce in the next section the different automated techniques of smoothing parameter selection in the case of spline smoothing when the observations are independent. In Chapter 3, we w i l l introduce techniques that are specifically designed for correlated 30 regression errors, the subject of this thesis. In Chapters 4 and 5, we w i l l apply two of these techniques to a simulation study and to a real data set. 2.2.4 Smoothing Parameter Selection for Smoothing Splines Spline smoothing has become a very popular method for estimating the regres-sion function in a nonparametric model. A s discussed earlier, the performance of the spline smoothing technique depends on the choice of the penalty but most importantly on the choice of the smoothing parameter. There exist many methods of selecting the smoothing parameter when the observations are as-sumed to be independent. Such methods include cross-validation ( C V ) and generalized cross-validation ( G C V ) among the most commmonly used. We w i l l now introduce the C V and G C V methods in the context of data that are assumed to be independent and then, in Chapter 3, w i l l present a few methods for the case of correlated data. Cross- Validation We saw in section 2.1.4 the idea of cross-validation for polynomial regression. It is basically the same idea that is used for spline smoothing. Let jr$ be the smoothing spline obtained from all the data except Yi using a smoothing parameter value of A. Then, the cross-validation choice of A is the value of A that minimizes the C V criterion 1 n 2 (2.42) 31 Now recall the hat matrix defined in equation (2.41). Craven and Wahba (1979) showed that equation (2.42) has an easier computational form: CV(X) - - V rc-^fo))2 (2 43) C V W - n h (1 - SU(\))* ' ( 2 " 4 3 ) where rh\ is the smoothing spline obtained using all data points. Generalized Cross- Validation Craven and Wahba (1979) also suggested another criterion which is based on C V . It is called generalized cross-validation ( G C V ) and is obtained by replacing Sa(X) in (2.43) by its average value, n~HrS(X). We then obtain the G C V criterion to be minimized GCV(X) = ISre-mAfo)) 2 (2.44) Theoretical arguments show that G C V should, asymptotically, give the best choice possible for the value of the smoothing parameter A in the sense of minimizing the average squared error at the design points. For more details on the computational aspect of G C V , please refer to Silverman (1984b). 32 C h a p t e r 3 S m o o t h i n g P a r a m e t e r Select ion W h e n the E r r o r s are C o r r e l a t e d Let us consider the following model: Yi = m(xi) +£i,i = l,...,n (3.1) where the x^s follow a fixed design with density / as defined earlier in (2.1). However, unlike previously assumed, the £;'s are now correlated, that is, they follow a covariance stationary process with unknown covariance func-tion jn(k) = Cov(£i,Si+k)- This formulation of the covariance function al -lows the autocorrellation to vary with both the distance between the de-sign points and the sample size. Here we define covariance stationary as Cov(ei,ej) = Cov(ei+k,ej+k) for any k > 0. For more details on such pro-cesses, see Brockwell and Davis (1987). We surveyed, in the previous chapter, some automated smoothing pa-33 rameter selectors which are known to perform well in the situation of inde-pendent observations. However, in the case of dependent observations, those smoothing parameter selectors w i l l not produce good smoothing parameters. In this chapter, we w i l l first analyse how correlation in the data affects the appropriate amount of smoothing. Then, we w i l l introduce four smooth-ing parameter selection methods for the case of dependent observations. These methods are intended to take into account the correlation and give smooth-ing parameters that are closer to the optimal smoothing parameter. The presented methods w i l l be: "modified cross-validation" ( M C V ) , "partitioned cross-validation" ( P C V ) , "blockwise cross-validation" ( B C V ) and "time series cross-validation" ( T S C V ) . For the sake of simplicity, we w i l l present them in a kernel estimation context, but they can very well be generalized to other smoothing methods, such as splines. 3.1 Effect of Dependence on the Smoothing Parameter Let us assume that we want to estimate the regression function m , which is assumed to be twice continuously differentiable and defined on [0,1]. Further-more, we would like to have a data-driven method that would allow us to chose an appropriate value for h, the smoothing parameter. We choose as our risk criterion the mean average squared error ( M A S E ) defined as MASE(h) = -J2 E(mh{xi) - m{Xl))2. (3.2) n i=i 34 Under model (3.1) and for the Gasser-Muller kernel regression estima-tor, Hart (1991) showed that MASE{h) „ sJ^m+wp* r m » { x f m dx, (3.3) nh 4 Jo where Vi(K) is as in section (2.1.1), i f is a kernel of order 4 and 5 is the spectrum of the process {Yi — m(xi)}, i.e., o o S(u) = 7„(0) + 2 53 "fn(i)cos(niu), 0 < u < l . (3.4) i = l We can see that the square of the estimator's bias (the second term of the right-hand side of (3.3)) is not affected by the correlation. However, the correlation does affect the variance and it does so in the same way that correlation affects the variance of a sample mean of a covariance stationary process. Indeed, Priestley (1981) showed that the sample mean (Y = z 3 i L i ^ V n ) 0 I * a process with well-defined and bounded spectrum satisfies Var(Y) ~ n One can show that the minimizer of (3.3) and, asymptotically, of (3.2) is • * ™ » I ' ' 5 . (3.5) hopt — nvl{K)ti{m»{u)¥f{u)du\ which is similar to equation (2.22) with a2 replaced by 5(0). So if 5(0) > 7n(0) = Var(ei), meaning that the correlation is predominantly positive, then hopt w i l l be larger than it would be in the case where the data are indepen-dent. Thus, when errors are predominantly positively correlated, data should be oversmoothed. O n the other hand, for 5(0) < 7 n (0), which is equiva-lent to predominant negative correlation, / i o p t would be smaller than in the 35 uncorrelated case. It is then obvious that correlation has a clear effect on the optimal smoothing parameter which leads us to consider techniques for smoothing parameter selection that take into account the fact that the data are correlated. For a more detailed description of the effect of dependence on smoothing parameter selection, the reader is refered to Hart and Wehrly (1986), C h i u (1989), Diggle and Hutchinson (1989) and Hart (1991). 3.2 Smoothing Parameter Selection Methods for Dependent Observations For dependent observations, C h u and Marron (1991a) showed that the cross-validated smoothing parameter does not converge to the optimal smoothing parameter, although it converges at the same rate as that given by Hardle, H a l l and Marron (1988) for-the case of independent observations. One can easily see why C V yields such a poor estimator of the regression function when errors are correlated. The cross-validation method tries to find a good predictor of Y{. The problem is that a good predictor of Yi is often a poor estimate of E(Yi) when the data are correlated. The best mean squared error predictor of the data point Yi is based on the remainder of the observations in the following way: E(Yi\Yu ... ,Yi-i,Yi+i,... ,Yn) = m{xi) + q(eu . . . , £ ; _ i , e i + i , . . . ,en) (3.6) for some function q. In the case of independent data, q = 0, and thus a good estimator of 36 E{Yi) is the same as a good predictor of Y,. However, this is not generally the case when the data are correlated. Cross-validation tends to select a smaller smoothing parameter when data are positively correlated because a kernel regression smoother with a small smoothing parameter usually does a good job of estimating the predictor of Yi which is not the same as E(Yi)-in the case of dependent observations. Obviously, an adjustment to cross-validation is needed in order to alleviate the effect of dependence. Here are a few methods that take into account the correlation when deciding on the amount of smoothness needed. 3.2.1 Modified Cross-Validation The modified cross-validation method ( M C V ) , also called the "leave-(2/+l)-out" method, was first introduced by Hardle and V i e u (1992) and further studied by V i e u and Hart (1990), Gyorfi , Hardle, Sarda and V i e u (1989) and C h u and Marron (1991a) who presented results of applications of the M C V in various settings, including mixing data. Let us now introduce the "leave-(2/+l)-out" smoothing parameter se-lector. For any / > 0 the smoothing parameter is obtained by minimizing the M C V criterion MCVl{h) = n- lJ2(mf{xi)-Yl)\ ' (3.7) i=l Here rfi£\xi) is the "leave-(2/+l)-out" estimate of m(xi), that is, the obser-vations (xi+j, Y{+j), —l<j<l are left out in constructing the estimate. Note that when / is taken to be zero, M C V is simply ordinary cross-validation. In 37 the case of the Nadaraya-Watson kernel estimator, rh^\xi) is defined by Ej:\j-i\>lK{Xi -Xj) The amount of dependence between m^(xi) and Yi is gradually reduced as the value of I is increased. C h u and Marron (1991a) showed that whenever I is moderately large, M C V produces a smoothing parameter that is nearly asymptotically unbiased for the optimal smoothing parameter for estimating m. Marron (1987) suggested an approach for the choice of I that is based on the minimizat ion of the mean squared error. The results of C h u and Marron (1991a) were proven in the context of kernel regression estimators. We implemented this smoothing parameter selector and util ized it in the context of smoothing splines wi th model-based penalty. The results w i l l be discussed in greater detail in the next chapter. 3.2.2 Partitioned Cross-Validation We have now seen one way of overcoming the effect of dependence on smooth-ing parameter selection. However other possibilities exist. One of them, pro-posed by Marron (1987) in the context of density estimation, is called "part i -tioned cross-validation" ( P C V ) . C h u and Marron (1991a) later used P C V for kernel regression estimation. Al though the rate of convergence of this smoothing parameter is faster than that for M C V , the resulting smoothing parameter is not very good. Indeed, C h u and Marron (1991a) proved that the asymptotic mean of this smoothing parameter is far from the optimal smoothing parameter. In fact, 38 they showed that the l imit ing distribution of the smoothing parameter is cen-tered at the optimal smoothing parameter for the case of independent obser-vations, which is different from the true optimal smoothing parameter. Thus, the resulting smoothing parameter suffers from asymptotic bias and leads to poor results in practice. The idea behind partitioned cross-validation is to split the observations into k subgroups. This is done by taking observations {1,1 + k, 1 + 2k,...} to build the first subgroup, {2,2 + k,2 + 2k,...} for the second subgroup and so on. Then, we calculate the ordinary cross-validation score CVoj(h) for the jth subgroup of observations separately, j = 1,2, ...,k. We then minimize the average of these k cross-validation scores to find the optimal partitioned cross-validated smoothing parameter. In the case where n is not a multiple of A;, we then take observations 1,2,... ,k • [n/k] and we discard the others. The notation [a] denotes the largest integer which is less than or equal to a. When k = 1, partitioned cross-validation is simply ordinary cross-validation. For correlated data, as long as the value of k is sufficiently large, the errors within each subgroup w i l l be essentially independent. 3.2.3 Blockwise Cross-Validation A quite common problem seen in applications is that of taking repeated mea-surements on the same experimental subject or unit. Such data are often used to make inferences about a population. A model for such a problem can be Yj(xij) = m{xij) + £j{xij), j = 1,..., M i = l,...,nj\ (3.9) 39 where m, assumed to be smooth, is the population mean and €j(xij); % = 1 , . . . , rij follow a causal A R M A process. This leads to having M independent samples of respective size rij of the same process with mean m. We do not assume that each sample is measured at the same time points nor do we assume that we have the same amount of data points for each sample. Therefore we use the notation Xij which corresponds to the ith time point for the jth sample. Also , it is assumed that the errors Sj between samples are independent. Now the question that arises is how to analyse such data. Based on the assumption that m is smooth, we can use nonparametric methods such as kernel smoothers or smoothing splines. But how do we decide on the amount of smoothing? The most popular technique of smoothing parameter selection, cross-validation, is ineffective for this case since, as shown at the beginning of this chapter, observations are correlated. Wehrly and Hart (1988) and Rice and Silverman (1991) proposed a version of C V based on the idea of predicting data of one of the M experimental units by using the other M — 1 units. It is done for each of the units and the M predicting errors are then averaged. The optimal smoothing parameter is the one that minimizes the mean prediction error. This method was called "unitwise" or "blockwise" cross-validation since one unit or block is left out and then predicted. Then, the blockwise cross-validation criterion is defined as (3.10) where N = Y^jLi^j and m3h(xij) is the estimate of m(xij) computed wi th all the data except for the jth unit : {Yj(x\j),... ,Yj(xnjj)}. The regression 40 function is then estimated by m/ i o p ( , where hopt minimizes BCV(h) and al l N observations are treated as a data set from one individual . We also implemented a modification of this method and we applied it to our simulation study (Chapter 4) and data set (Chapter 5). 3.2.4 Time Series Cross-Validation Based on the prequential method of Dawid (1984), "time series" cross-validation ( T S C V ) was proposed by Hart (1994). The prequential approach is based on the concept of choosing a parametric model that w i l l perform well in predict-ing future observations. It is similar to cross-validation but instead of using al l the data other than Yi, it uses only the previous data Y i , . . . , Y i _ i to predict. Since forecasting is so important in the time series context, prequential analy-sis becomes very appealing in that situation. So like the prequential approach, T S C V has roots in giving a good one-step ahead predictor. However, it can be modified to be more suitable for estimating the regression function. Details on how to modify this technique appear at the end of this section. Now, let us assume that we believe that the parametric probability model MQ would yield good prediction for the process {ei = Yi~m(xi)}. Here 6 represents the vector of the unknown parameters of the error model M$. Consider the following predictor Yi(h, 6) = rh}h{xi) + co(ii,eVi), (3.11) where mlh{xi) is an estimate of m(xi), based on a smoothing parameter h, that uses only Yl,..., Y{_x, ij = Yj - m3h{xj),j = 1 , . . . , i - 1 and cg(eu e^) is 41 the optimal mean squared error predictor of e$ given e\,..., Si-i and assuming that the probability model is M#. Since T S C V only uses data to the left of Xi, Hart (1994) suggests that one uses a boundary-type kernel when estimating m(xi). Then, the T S C V criterion is defined as T ( M ) = £ ( £ ( M ) - ty2- (3-12) t=2 The T S C V method consists of choosing h and 9 that minimize (3.12). However, wi th certain models for the errors such as autoregressive ones, it is possible to find a closed form, for each h, of the value 9(h) that minimizes T(h, 9) wi th respect to 9. In those cases, one obtains a curve T(h) = T(h, 9(h)) that can be analyzed in order to find the optimal time series cross-validated smoothing parameter, hrscv-This method yields a smoothing parameter that is evidently intended for a good one-step ahead predictor based on the past data. W i t h this estimate, one obtains, not only the predictors, but also estimates of the trend and the covariance function. However, if the mean function (trend) is the main interest of the data analyst, then one would want to have an estimate that uses data on both sides of X j . In that case, Hart (1994) proposed using a smoothing parameter "equivalent" to hrscv to estimate m. If the regression function is estimated using a kernel K, then the corrected smoothing parameter is CK,L • hrscv, where the term CK,L is „ _ \v0(K2)vl(L)\1'5 CK>L-{v0(L*)vUK)f (3"13) and L is the boundary-type kernel that was used in the estimation of m l (x j ) . 42 3.2.5 Plug-in Method Let us now introduce a plug-in method called Correlation Direct Plug-in ( C D P I ) developped by Opsomer (1996). This method is based on the Direct Plug-in method seen in section 2.1.4 but is intended for local polynomial regression when the errors are correlated. Ear ly work on the plug-in method in the context of correlated errors includes A l t m a n (1990). Opsomer (1996) developped this method for the random design case but the results also hold for the nonrandom fixed design case. Opsomer (1996) defined the covariance of the residuals as Cov{ei,Sj) = a2pn(Xi — Xj) where pn is a stationary, symmetric and continuous correlation function which can depend on the sample size. To relate pn to our jn, consider the case with Xi = i/n. Then, 7 n(&) = a2pn(k/n). In order to ensure convergence of the estimator, Opsomer imposed the following set of conditions on pn. 1. pn is differentiable, /n\pn(t — x)\ dt = 0 (1 ) , / npn(t — x)2 dt = o(l) for al l x, 2. 3£ > 0: J\Pn{t)\I(\t/h\>t)dt = o(f \pn(t)\dt), where h = hn ->• 0 and nh —>• oo as n —> oo This means that the dependence is short range, that is, correlation between observations sufficiently far apart can be ignored. Opsomer (1996) showed that the smoothing parameter that minimizes the mean average squared error of the estimate of the regression function is of the form (3.14) 43 where K is a kernel function satisfying assumption 2 of Theorem 1. ICn is called the integrated covariance function and % = E(m^(X)m^(X)). These two quantities are unknown. Therefore one needs to estimate them. The estimation of ICn, when the observations are equally spaced, can be done in one simple step by estimating the spectral density at 0. However, when the observations are not equally spaced, this method can not be directly applied. Opsomer (1996) proposed using a technique based on binning ap-proximations to increase the efficiency of local polynomial regression (see Fan and Marron (1994) and Turlach and Wand (1996)). A binned dataset with ap-proximately the same correlation structure as the original data is constructed and spectral theory is used to estimate ICn. The estimation of ICn also re-quires the separate estimation of the variance parameter cr2. Opsomer (1996) suggests that one uses a generalization of the variance estimator of Opsomer and Ruppert (1998) in order to get a2. A s an estimator of 022, #22 = n~l Y%=\{™>"{%i))2, is proposed wi th rh" the estimate of the the second derivative gotten from a local polynomial es-timator of degree p = 3. Opsomer (1996) showed that the asymptotically optimal smoothing parameter to estimate 622 is of the form (3.15) where i^( r ,3 ) is as defined in (2.16) and C 6 = { 1 if e24 < 0 I I if 024 > 0 Then, one simply needs, in order to get hopt, to follow these steps. 44 1. "Pilot" estimators of 924 and a2 are computed from a piecewise quartic fit, where the number of pieces is chosen by minimizing Mallows Cp; 2. a pilot estimator for ICn is computed and is used to obtain the nonpara-metric estimate ICn as described in section 3 of Opsomer (1996); 3. use #24 and ICn in (3.15) and use the estimated hlpt to find 622', 4. hgpt is obtained by plugging (922 and ICn in equation (3.14). 45 C h a p t e r 4 S imulat ions The main goal of this thesis is to estimate a regression function from data of the form (3.1), where the £j's are known to be time-correlated. In this chapter, we w i l l perform a simulation study where we w i l l use smoothing splines wi th model based penalties along with two of the smoothing parameter selectors described in the previous chapter. The motivation for this work comes from an environmental problem, namely ground-level ozone concentration. Al though we w i l l base our simulations on this problem, the techniques used here can very well be applied in other contexts. 4.1 Description W i t h this simulation study, we wish to show that combining model based pe-nalized smoothing wi th a smoothing parameter selection technique that takes into account correlation w i l l increase the accuracy of the estimation. Since 46 the purpose of this thesis is to apply the proposed methods to environmen-tal data analysis, we based our simulations on models motivated by our data set. We w i l l , in the next chapter, analyse a data set of daily and monthly ozone levels. Such data are assumed to follow a periodic {1, sin, cos} model while having correlated errors. For that reason, we use regression functions that are "similar" to the favoured parametric model ozone levels, namely m(x) = a.\ + a2 • cos cox + a3 • sin cox. We chose to = 8n, wi th x G [0,1], yielding four "cycles" in each data set. We implemented two methods of smoothing parameter selection: "Modif ied Cross-Validation" and "Blockwise Cross-Validation" that were described in sections (3.2.1) and (3.2.3) respec-tively. The optimal smoothing parameter for each of these two methods was obtained by using a Quasi-Newton algorithm to search for the location of MCVmin and BCVmin. In order to evaluate the performance of these meth-ods, we w i l l compare them with four commonly used techniques of analysis. Overall , we w i l l use six methods in our simulation study: L S : nonlinear least squares estimate based on the favoured parametric model assuming independent observations and to = 8n; N L S : nonlinear least squares estimate based on the favoured parametric model assuming independent observations and where to is estimated; T S : time series method assuming m(x) — ct\ + a2 • cos tox + a3 • sin tox where to = 87T is known. We also assume that the errors are correlated and follow a causal A R M A process. A more detailed description of that method w i l l follow shortly. 47 I N D : penalized least squares estimate based on the appropriate penalty L = D3 + UJ2D where co = 8ir is known and A is chosen by ordinary C V assuming independent observations; B C V : penalized least squares estimate based on the appropriate penalty L — D3 + OJ2D where u — 8ir is known and A is chosen by blockwise cross-validation as described in section 3.2.3. The blocks were chosen by splitt ing the simulated data set into 4 differents groups, one group for each cycle; M C V J : penalized least squares estimate based on the appropriate penalty L = D3 + OJ2D where UJ is known and A is chosen by modified cross-validation as described in section 3.2.1. The values of / used wi th M C V / are 1, 2 and 3 (remember that a value of / = 0 corresponds to ordinary C V , that is, I N D ) . Note that it is also possible to estimate the value of u in the last four methods. For penalized least squares used in I N D , B C V and M C V ; , Heckman and Ramsay (1996) describe four different approaches to estimating u when it is assumed to be unknown. Due to l imited time for computation, we only considered estimation of u in N L S . Now we w i l l present in greater detail the pseudo-code for the time series method (TS) that was just mentioned, for estimating 0 : 1 , 0 : 2 and a3 in the parametric model. 1. Perform an ordinary least squares analysis on the data to get the estimate 48 2. Given d ^ , construct the residuals {zf^} = {Yi — a[h^ — a2k^ • cos uxi — - ( * ) i 0:3 • sin LaXi\. 3. Check for serial correlation of {z\ k^}. If correlated, try fitt ing an autore-gressive model of order p to {Z^}, where p is determined by looking at the partial auto-correlation function. Then use the fitted model to * (fc) generate an estimated covariance matrix £ . For example, if an AR (1) model is fitted, then 1-ft where (f>i is the estimate of the autoregressive coefficient and a2 is the estimate of the variance. In the absence of serial correlation, S T O P . 4. Given d ^ and ,^ perform a weighted least squares analysis by mina ( Y — y^cx^)\t{k))-l{Y - XQ<*)) to obtain d ( f c + 1 ) . Here, Y and X are re-spectively the vector of {Yi} and the design matrix corresponding to the parametric model. 5. If | | d ( A : + 1 ) - d ( f c ) | | < 6, where 5 is the desired precision, S T O P . Otherwise, return to step 2 using d ^ f e + 1 ^ instead of d ^ . We used al l six methods of estimation on regression functions that were "similar" to our favoured model. These regression functions were obtained by generating random functions. The procedure used for generating these random functions w i l l be described in the next section. 49 We constructed 500 simulated data sets by generating the random func-tions Hi,..., pi5oo- For each simulated data set, j = 1 , . . . , 500, we generated 100 data points Yij = Hj{xi) + oEij i = l , . . . , 100 (4.1) where X{ = i /100, the e^'s follow an A R ( 1 ) process wi th autoregressive coef-ficient (j) = 0.5 and a is a positive real valued constant that was taken to be 15. (Note that we proceeded to the same simulations with cp = —0.5.) To quantify the results of each estimation method, we w i l l use the fol-lowing summary statistics: PSE: Predictive Squared Error ^ 100 P S E j = Too ^ { Y i j ~ ™ l i X i ) ) 2 j = 1 ; • • • ' 5 0 0 ( 4 , 2 ) where the raj(xi)'s are the values of the estimated regression functions at the time points {x^} ; MSE: Mean Squared Error 100 MSEj = — E K - ( x O - Hjixi))2 3 = 1,..., 500. (4.3) 4.2 Generation of Random Functions In order to verify the efficiency of our methods, we wish to generate random functions that are close to the favoured parametric model. That choice leads us to the following penalty, L = D3 4- UJ2D, where the favoured model is 50 spanned by the functions 1, cos cot, sin cot. Therefore, we want to generate random processes w(-) and /x(-) with L\i = 0 and However, for the case where w(-) is not constant, there does not exist a closed form solution of the differential equation L\x = 0. Thus, replace L/J, = 0 wi th Let us now introduce some useful notation. Let e(-) be a white noise process with standard deviation cre, b(t) = f0* e(s) ds be Brownian motion and B(t) = b(s) ds. Then, one can show that (4.4) and (4.5) are satisfied for w(t) = Lu+b(i) and fi(t) — C [a 0 + D~lcos (cot + B(t))], where D"1 denotes the first anti-derivative operator. C is an arbitrary constant chosen to obtain data with reasonable magnitude. Here, a0 is a normal random variable, independent of B(-), wi th mean zero and standard deviation aa and was chosen in order to avoid always having /i(0) = 0. In order to generate a curve fi, we need to generate an approxima-tion of B(-). We do this by generating 100 independent normal random vari-ables with mean zero and standard deviation ae and we use their partial sums to define b(-). Next, we approximate B(i/100), for i = 1 , . . . , 100, by nu-merical integration of b(-) using the trapezoidal rule wi th grid points j / 100 , j = 1 , . . . , i. Afterwards, we use the trapezoidal rule once more to approximate D~lcos (cot + B{t)). Figure 4.1 shows a sample of 3 of the 500 randomly generated curves. When generating those curves, we used co = 8ir, ae = 50, aa = 0.01, a — 15 E(w(t)) EE CO. (4.4) E(Lfj.) EE 0. (4.5) 51 a -o o -O 20 40 60 80 100 Figure 4.1: Sample of Randomly Generated Curves. The solid line is the true curve. and C = 1000. We chose these values in order to obtain curves that are close to the favoured parametric model while st i l l being different. The solid curve is the true curve and was generated using a0 = 0 and B(t) = 0, that is, the true curve is fx(t) = 1000 • D-lcos(ut) = 1000 • (sin(ut))/u. It is important to notice that since the simulated curves are not exactly parametric, this also means that some of these curves may not be periodic. Furthermore, in the event that a curve is roughly periodic, we are not assured that the period w i l l be exactly 87r. Figure 4.1 demonstrates the fact that not al l the randomly generated curves are periodic. In that figure, three curves, including the true curve, look periodic wi th four cycles while one curve is definitely not periodic and it has only three cycles, of unequal length. 52 4.3 Simulation Results The results of the different methods with regards to the M S E and the P S E for positive and negative correlation are given in Tables 4.1 and 4.4, respectively. Since the distributions of both the mean squared error and the predic-tive squared error are generally long-tailed, we also included summary statis-tics for the natural logarithm of the M S E and P S E . Figure 4.2 shows the distribution of the M S E and log(MSE) for the N L S and T S method when (j) — 0.5. This clearly demonstrates the skewness of the M S E score. Therefore, by taking the natural logarithm of the M S E , we diminish the influence of the skewness, thus showing the real behaviour of each estimating method. The histograms of the M S E and log(MSE) for the other methods are displayed in Appendix A . Al though in our tables we provided the results of the M S E and P S E for each method, we w i l l mainly focus on the M S E . Since the goal of our work is to to find a way of easily and efficiently estimating the regression function when the errors are correlated, we are more concerned about the per-formance of the methods with regards to the M S E . The P S E score, on the other hand, is not very meaningful in the context of this thesis since P S E helps us understand the behaviour of the methods in terms of predictive performance. The results of the P S E w i l l however be somewhat meaningful when we w i l l look at the number of parameters used by each technique. Indeed, an estimate that uses a large number of parameters w i l l try to interpolate the data, and hence have a low P S E score. O n the other hand, an estimate using a small number of parameters should have a large P S E score as it is not predicting 53 N L S T S I B ii III i l l i WmM 5 6 7 log(MSE) J . i t~...j 4 5 6 log(MSE) Figure 4.2: MSE and log(MSE) for NLS (left) and TS (right) when <f> = 0.5. the observations well. In the next two sections, we w i l l analyse the results of the simulation study for positive and negative correlation. We w i l l compare the log(MSE) results of the different methods v ia paired t-tests with a significance level of 5% to insure that the differences are significant. A paired t-test simply means that we apply the usual t-test to the differences of the log(MSE) scores for each pair of methods. 4.3.1 Positive Correlation In terms of the log(MSE) , al l of the smoothing spline methods wi th model based penalty ( IND, B C V and M C V / ) , which are nonparametric methods of 54 Table 4.1: Simulation Mean Results with (j) = 0.5. M e t h o d M S E ( S E ) P S E ( S E ) l o g ( M S E ) ( S E ) l o g ( P S E ) ( S E ) L S N L S T S 401.8 (16.1) 443.4 (20.6) 398.3 (15.8) 654.5 (16.8) 673.6 (21.3) 651.5 (16.6) 5.59 (0.04) 5.35 (0.06) 5.59 (0.04) 6.35 (0.02) 6.28 (0.03) 6.34 (0.02) I N D 168.3 (3.2) 109.2 (2.5) 5.01 (0.02) 4.54 (0.03) B C V M C V i M C V 2 M C V 3 344.4 (16.4) 111.8 (2.4) 103.6 (2.5) 97.8 (2.1) 530.8 (18.9) 172.6 (2.1) 193.2 (2.9) 440.2 (28.3) 5.32 (0.05) 4.60 (0.02) 4.51 (0.02) 4.47 (0.02) 5.98 (0.04) 5.11 (0.01) 5.22 (0.01) 5.61 (0.04) estimation, outperformed the parametric methods (LS, N L S and TS) . Part of the reason why L S , N L S and T S performed so poorly is due to the fact that we generated functions that were not exactly the favoured parametric model. Therefore, the parametric methods had a disadvantage right from the start since the parametric model that they were trying to fit was not exactly what the data were following. Comparing the three parametric methods, we see that N L S does signif-icantly better than both L S and T S . This can be explained by the fact that since it is estimating the period, N L S really has an advantage over the two other parametric methods. Figure 4.3 shows the histogram of the estimated values of u obtained by N L S . In fact, since our simulated regression functions were not exactly the favoured parametric model, we generated regression func-tions that either were far from being periodic, or were roughly periodic, but perhaps with a "period" different from the one of the favoured parametric model, i.e., 87r. Therefore, in the case where the curves were roughly periodic 55 a H — i — 20 —I— 40 —I— 60 Figure 4.3: Estimated values of UJ for (b — 0.5 Figure 4.4: Fits of Parametric Methods on Simulated Curve 6 for (j) = 56 but with a period different from 8TT, N L S was able to adapt more easily to the change in the period than T S and L S , hence performing better in terms of M S E . For curves that were far from periodic, the parametric methods gener-ally performed poorly as they were restricted to a periodic parametric model. Figure 4.4 shows the fits of all parametric methods for one simulated curve. It is obvious that this curve is not periodic and it has only three cycles, of dif-ferent durations. In this case, al l parametric methods performed very poorly at estimating the underlying trend. L S and T S did not perform well since the specified u of 87T forced them to give an estimate of the regression func-tion wi th four cycles and N L S yielded Co = 45.45 which resulted in an estimate wi th approximately seven cycles. For the penalized smoothing spline methods, we did not estimate the period. Indeed, its value was fixed at 87r. However, this w i l l not have an important effect since these methods are nonparametric, which means that they don't totally depend on the favoured parametric model. B y looking closely at Figure 4.2, we see that the distribution of log(MSE) of N L S is clearly bimodal . This bimodality is explained by the two features of the randomly generated curves mentioned above. Indeed, in the case where the curves are roughly periodic, N L S generally estimates the "period" quite accurately and yields a good estimate of the regression function. However, when the curves are far from periodic, N L S does not perform very well , hence yielding a higher M S E score which corresponds to the second mode in the M S E distribution. Since T S is taking the correlation into account, one would think that 57 this method would perform better than both L S and N L S which do not ac-count for the correlation. A s discussed previously, we know that T S does not perform better than N L S . However, T S does perform significantly better than L S . The mean results in Table 4.1 were rounded. Al though T S and L S appear to have the same rounded mean log(MSE) , the unrounded means are signifi-cantly different, by a paired t-test. The comparison between T S and L S is a good way of quantifying the effect of correlation on the estimation techniques. Indeed, the only difference between T S and L S is that T S takes correlation into account. Therefore, the fact that T S performs significantly better than L S indicates that accounting for correlation really helps improve the estimation. When comparing the log(MSE) results, we can see that penalized smooth-ing wi th ordinary cross-validation (IND) did substantially worse than al l the nonparametric methods that took correlation into account except for B C V . It is pretty obvious by looking at the log(MSE) values in Table 4.1 that I N D is really affected by the correlation structure in the regression errors. Figures 4.5-7 show the fits of the different methods on the same simulated curve as in Figure 4.4. In Figure 4.5, we can see that I N D yields an estimate of m that is too "wiggly" , that is, the optimal smoothing parameter obtained by ordinary cross-validation is too small . In comparison, looking at the two other plots in Figures 4.6 and 4.7, we can see that the fits of al l penalized methods are much smoother and, except for B C V (which we w i l l discuss later), they follow the true regression curve quite faithfully. In a majority of the simulated data sets, the smoothing parameters chosen by B C V and M C V j were bigger than 58 0 20 40 60 80 100 Figure 4.5: Fits on Simulated Curve 6 for (b = 0.5 59 Figure 4.6: Fits on Simulated Curve 6 for <p = 0.5 60 61 the one obtained by I N D . This supports the discussion in section 3.1, that, for positive correlation, ordinary cross-validation has a tendancy to interpolate the data thus yielding a smoothing parameter that is too small . The M C V / methods performed significantly best in terms of log(MSE) score. In fact, for M C V / , the log(MSE) is a decreasing function of / since the M C V / methods perform better and better as the value of I increases, hence the better performance of M C V 3 . This is due to the correlation structure as the bigger the value of /, the more M C V / accounts for correlation thus alleviating its effect more efficiently. Indeed, we know that the residuals follow an A R ( 1 ) model where the Cor(ej,e i + j) , for j = 1,2 and 3, is 0.50, 0.25 and 0.125 respectively. It is clear that observations adjacent in time are significantly correlated while observations three time points apart are weakly correlated. This is why M C V 2 and M C V 3 should perform best as they are el iminating more correlation than M C V i . In practice, this is what happened. Indeed, M C V 2 and M C V 3 respectively performed best for 28% and 44% of the simulated data sets, for a total of 72% of the cases. However, when estimating Cor(ej , Ei+3) and the order of the autoregressive model of the residuals, we could not conclude on a data-based rule of selection for / as there was no clear pattern. For example, in simulated data set #22, M C V i performs best even though the estimated autoregressive order is six while in site #10, M C V 3 performs best even though the estimated autoregressive order is one and Cor(ej , £1+3) is estimated as 0.05. A more intuitive way to compare the performance of the different esti-mators is to look at the number of parameters used by each method. Please 62 Table 4.2: Number of Parameters Used, (f> = 0.5. M e t h o d m e a n m e d i a n S . E . I N D 29.87 29.69 0.66 B C V 7.57 3.10 0.36 M C V i 14.79 13.91 0.25 M C V 2 12.62 12.15 0.21 M C V 3 11.50 11.33 0.18 refer back to section 2.2.3 for the definition of the effective number of param-eters used by an estimate. Table 4.2 summarizes the number of parameters used by al l penalized methods. The table does not include the parametric methods since they use a fixed number of parameters. Indeed, L S , N L S and T S always use three, four and three parameters respectively. I N D uses on average significantly more parameters than any of the penalized methods that take correlation into account. In fact, I N D uses more than twice the number of parameters than the other nonparametric methods. This explains why the estimate of m using I N D is usually not as smooth as that of the other meth-ods since I N D tries to interpolate the data. Indeed, I N D has the lowest mean log(PSE) which confirms that I N D tracks the data too closely. One can also notice, by looking at Table 4.2, that the number of parameters used by M C V / is a decreasing function of I. This shows that when the value of I increases, the smoothing parameter chosen by M C V ( increases, thus yielding an estimate of m that is closer to the favoured parametric family. Therefore, if the penalty is chosen correctly by the data analyst, the resulting estimate of m would be smoother and more precise. 63 The B C V method yields significantly the lowest effective number of parameters among al l nonparametric methods. However, as discussed ear-lier, B C V does not perform well in terms of M S E . In fact, in approximately 60% of the simulated data sets, B C V uses less than 3.3 parameters which is roughly equivalent to using an infinite smoothing parameter. Remember that a.smoothing spline with an infinite smoothing parameter lies in the kernel of the penalty, that is, it follows the favoured parametric model. Thus, in our case when B C V uses approximately three parameters, it becomes roughly the same method as L S . But why would B C V use so few parameters? Figure 4.8 4 Figure 4.8: Behaviour of BCV. helps us answer that question. The four vertical lines on the plot represent the boundaries of the four sections, or "cycles", used in the B C V method. B C V 64 Table 4.3: Correlation between Number of Parameters Used, 4> = 0.5. M E T H O D S B C V M C V i M C V 2 M C V 3 I N D -0.054 0.288 0.119 0.040 B C V - -0.046 -0.077 -0.099 M C V i - - 0.565 0.265 M C V 2 - - - 0.477 uses data in the last three sections to predict data in the first section. It uses data in the first, th ird and fourth section to predict data in the second section and so on. Obviously, when the data are far from periodic, that is, the four sections are different from each other (as is clearly the case in Figure 4.8), B C V w i l l oversmooth the data in order to predict the deleted section, using a large smoothing parameter or equivalently few parameters. This behaviour of B C V indicates that it is not doing very well. A s mentioned, B C V is over-smoothing the data when the four sections are quite different, thus leading to an estimate close to the parametric one. Note from Figure 4.6, B C V has fit a periodic model to the non-periodic data, but the amplitude is fairly small , and hence close to an average of the data. O n the other hand, if the data in the four sections are similar, B C V w i l l choose a small smoothing parameter and tend to interpolate the data. One might wonder if there is a correlation between the optimal smooth-ing parameters obtained by the different methods used. To investigate this question, we look at the correlation between the number of parameters used, as displayed in Table 4.3. There is a noticeable correlation between the number of parameters used by M C V / for its different values of I. In fact, for M C V / , 65 the correlation is stronger between two consecutive values of I. This is not surprising, since leaving out, say 2-2 + 1 = 5 points in the C V score should be similar to leaving out 2-3 + 1 = 7 points. Interestingly, the number of parame-Q Z • L • ! •5" I " ' 10 20 30 > . 40 > O 10 20 30 40 > o 5 10 20 30 40 # of Parameters used by BCV 10 20 30 40 # of Parameters used by BCV Figure 4.9: Scaterplots of # of Parameters used by INd, MCVi vs BCV. ters used by B C V is slightly negatively correlated wi th al l the other penalized methods. To explain the negative correlation between B C V and the other penalized methods one has to remember that B C V uses a small number of pa-rameters when the four groups in the data set are different. However, in that case, M C V j and I N D w i l l generally use a bigger number of parameters since the data appear to be further away from the default parametric model. The 66 correlation between B C V and the other methods is not very strong though, as can be seen by looking at Figure 4.9. 4.3.2 Negative Correlation Table 4.4: Simulation Mean Results with <f> = —0.5. M e t h o d M S E ( S E ) P S E ( S E ) l o g ( M S E ) ( S E ) l o g ( P S E ) ( S E ) L S N L S T S 380.8 (16.0) 394.1 (20.9) 379.4 (15.9) 672.5 (16.3) 682.0 (21.0) 671.1 (16.1) 5.48 (0.05) 4.15 (0.07) 5.48 (0.05) 6.39 (0.02) 6.31 (0.03) 6.39 (0.02) I N D 18.0 (0.4) 295.4 (2.5) 2.80 (0.02) 5.67 (0.01) B C V M C V i M C V 2 M C V 3 315.4 (17.1) 17.3 (0.3) 15.0 (0.3) 14.8 (0.3) 597.1 (17.7) 280.2 (2.4) 289.2 (2.5) 287.1 (2.5) 4.80 (0.07) 2.77 (0.02) 2.60 (0.02) 2.59 (0.02) 6.21 (0.03) 5.62 (0.01) 5.65 (0.01) 5.64 (0.01) For the case of negative correlation, once again L S , N L S and T S per-form very poorly compared to the other methods. N L S performs significantly better than T S and L S , and T S and L S cannot be significantly differentiated. The reason for the superior performance of N L S compared to the two other parametric methods can be explained as in the positive correlation case. (See Figure 4.10 for the histogram of the a)'s in the case of negative correlation.) In terms of log(MSE) , al l penalized methods, except for B C V , per-formed very well. Surprisingly, I N D , with log(MSE) st i l l significantly higher than M C V / , is much closer to the methods that take correlation into account than in the positive correlation case. Indeed, on average, the fits of a l l pe-nalized methods but B C V are very close to the true regression function and 67 8 S CM § -I 8 — i — 20 —I— 40 Figure 4.10: Estimated values of co for 4> = —0.5 they are very similar to each other (see Figures 4.11-13, which use the same simulated curve as Figures 4.5-7). Once again, as shown in Figure 4.11, the N L S method did not succeed in estimating the period and yielded an estimate of the regression function that was far from the true one. The M C V / methods for / = 2 and 3 were not significantly different and overall they performed best. The similar performance of M C V 2 and M C V 3 is due to the fact that the correlation structure of the residuals is not as strong as in the previous section. Here again, we are led to the same conclusions as in the previous section when trying to find a selection rule for an optimal value of / . A g a i n , just like in the previous section, B C V is the penalized method that performs the worst in terms of log(MSE) . The same conclusions as in the 68 69 71 Table 4.5: Number of Parameters Used, (j) = —0.5. Method mean median S.E. I N D 10.35 10.79 0.11 B C V 7.13 3.04 0.37 M C V i 16.42 16.12 0.19 M C V 2 11.77 12.27 0.12 M C V 3 12.60 12.91 0.12 positive correlation case arise here. Indeed, B C V s t i l l tends to oversmooth the data when the four groups are different, hence yielding a high MSE score. A s seen in Table 4.4, the effective number of parameters for M C V ; is no longer a decreasing function of I. M C V i uses the most parameters among all smoothing methods. M C V 2 uses on average significantly fewer parame-ters than M C V i , as one might expect, but also, due to the weaker correlation structure, uses significantly fewer parameters than M C V 3 . B C V uses signifi-cantly less parameters than any other nonparametric method as it generally oversmooths the data. The reason why most of the smoothing methods were roughly com-parable is that negative correlation, while affecting the optimal smoothing parameter, does not have the same impact on the estimated regression func-tion as positive correlation. I N D is st i l l influenced by the correlation structure. However, the effect of correlation does not have the same results on the esti-mate given by I N D . Indeed, by looking once again at Table 4.5, one can see that I N D yields on average a significantly smaller number of parameters used, that is, a bigger smoothing parameter, than MCV/ . Recall that I N D is equiv-72 alent to ordinary cross-validation. Thus, the fact that I N D leads to a bigger smoothing parameter when 0 = —0.5 is not surprising, as it agrees with what was mentioned in section 3.1. I N D oversmooths the data and this shows up in its mean log(MSE) as it is higher than the one for M C V / . However, we must not forget that we are now working in the context of smoothing splines wi th model based penalty. Remember section 2.2.2 where we introduced the notion of smoothing splines. We mentioned that the minimizer of (2.32) consists of two distinct parts: a penalty on the lack of fit and a penalty on the roughness of the estimating function. Therefore, by selecting a smoothing parameter bigger than the optimal one, I N D in fact leads to an estimate that is closer to the least squares estimate of the kernel of P. Since the simulated functions are similar to the favoured parametric model of the chosen penalty P , the effect of correlation is causing I N D to yield a smoothing parameter that results in an estimate of the regression function that is closer to the favoured parametric model. Since the influence of correlation on I N D is in some way alleviated by the appropriate choice of the penalty, it is not surprising to obtain high correlation between the number of parameters used by I N D and M C V / as these methods do very comparably. For the same reason as in the case where <b = 0.5, the number of parameters used by B C V is s t i l l negatively correlated wi th the other nonparametric methods. 73 Table 4.6: Correlation between Number of Parameters Used, cb = —0.5. M E T H O D S B C V M C V i M C V 2 M C V 3 I N D -0.275 0.417 0.916 0.831 B C V M C V i M C V 2 --0.082 -0.232 0.402 -0.262 0.494 0.771 4.4 Conclusions W i t h this simulation study, we saw that both positive and negative correlation have an effect on the selection of the smoothing parameter. We saw that M C V / always performs best no matter if the data are positively or negatively correlated. Obviously, the choice of / is of importance as it can help improve on the accuracy of the estimation. Indeed, if the chosen value of I is too small , M C V / w i l l not sufficiently alleviate the effect of the correlation. A n extreme example that illustrates this is the results obtained v ia I N D . Remember that I N D is in fact equivalent to M C V / with / = 0. We saw, in the simulation study, that I N D is affected by the correlation structure and has a tendancy to yield "wiggly" estimates in the case of positive correlation and to oversmooth estimates when the correlation is predominantly negative. O n the other hand, the effect of an excessively large value of / is not clear since we only performed the simulation study with the values of / = 1,2,3. We saw earlier that the effect of the correlation at lag three was st i l l present, therefore a value of I = 3 cannot be considered too large. The simulation study shows that the M S E decreases as the value of I increases and presumably an optimal value of I 74 would be one that takes a good amount of correlation into account, without leading to an estimate that oversmooths or interpolates the data. However, further reseach on the effect of a large / should be done before making any conclusion. We consider two possible "Rules of thumb" that one might use in order to choose the value of The first one simply uses a value of / = p where p is the estimated order of the autoregressive model of the residuals. However, this rule does not seem appropriate in the simulation study. Indeed, in the simulation study with <j> = 0.5, 97% of the estimated autoregressive models are of order three or more, thus one should choose a value of at least I = 3. However, in the simulation study, / = 1 and / = 2 outperform I = 3 in terms of log(MSE) for 56% of the simulated data sets. Alternatively, one might consider choosing I using estimates of Cor ( s j , Si+j). In that value of I = k is chosen, where k is the largest value of j for which Cor(ei , ei+j) < C, wi th C a constant chosen by the data analyst. For the simulation study, this rule works well when using the true correlations but fails when we use the estimated correlations. We have evidence that B C V w i l l perform poorly if the data are not pe-riodic with similar cycles. We also noticed that parametric techniques perform poorly compared to M C V j since they are not flexible enough to track al l the fine features of the data. Another important point that came out of this simulation study is that the effect for both types of correlation is different. In fact, the results indicate 75 that, as long as the penalty is correctly chosen, negative correlation w i l l not have as large an impact on the performance of I N D . It is s t i l l important to remember that, for both positive and negative correlation, M C V / did signif-icantly better than I N D . Hence there is an obvious advantage to accounting for correlation. However, a key point is to notice that unlike positive cor-relation, which highly affects the estimation methods even if the penalty is chosen appropriately, the effect of negative correlation may also be alleviated by correctly choosing the favoured parametric model. 76 C h a p t e r 5 A p p l i c a t i o n to A i r P o l l u t i o n D a t a 5.1 Introduction Ambient ozone pollution in urban areas constitutes a very important envi-ronmental problem. Whi le the decreasing stratospheric ozone layer may lead to increased instances of skin cancer, high ambient ozone intensity has been shown to cause damage to the human respiratory system as well as to agri-cultural crops and trees (Lefohn and Runeckles (1987), L ippmann (1989)). Ground-level ozone has been studied quite intensively recently and many data analyses involving a number of statistical methods have been used. For exam-ple Cox and C h u (1992) and Smith and Huang (1993) studied daily maximum ozone concentration using extreme value theory while Guttorp , Meir ing and Sampson (1994) used a space-time model to analyze ground-level ozone data. 77 In this thesis, we are interested in finding an efficient way to estimate the underlying periodic trend of ground-level ozone. Ozone measurements are correlated across time and between measuring sites (spatially), but here, we w i l l analyze each site individually, so we w i l l only consider correlation across time. In order to test the methods explained in the previous chapters, we w i l l apply them to a data set of daily and monthly ozone levels and w i l l compare the results wi th some well known and commonly used methods of analyses. The pollutant of interest to us, O 3 , is a secondary pollutant. There are two main kinds of air pollutants: primary and secondary pollutants. Sec-ondary pollutants are produced by chemical reactions between pollutants and other constituents while primary pollutants are emitted directly by identifiable sources. O3 is produced by oxidation of primary pollutants. This oxidation is driven by ultra-violet radiation coming from sunlight and comprises chemical reactions that are dependent on the temperature. Because the chemical re-actions proceed while the air is being carried by winds, secondary pollutants are generally more widespread than primary pollutants. Therefore, secondary pollutants are sometimes refered to as regional pollutants. Due to the depen-dence of the governing chemical reaction on temperature, 0 3 levels are high in early afternoon and midsummer and low overnight and in the winter. Also , ozone has a strong yearly pattern which is due to the fact that the creation of ozone is highly related to solar radiation. 78 5.2 Air Pollution Data The data set used for the analysis includes the daily maximum of hourly levels of nitrogen dioxide, ozone (O3), sulphur dioxide and the daily mean of nitrate and sulfate. These measurements were recorded in the southern portion of Ontario during the six year period from January 1 of 1983 to December 31 of 1988. Several monitoring networks in the province were used to collect the data, including the Environment A i r Qual i ty Monitor ing network ( O M E ) , A i r Pol lut ion in Ontario Study ( A P I O S ) , and the Canadian A c i d and Part ic ipation Monitor ing Network ( C A P M O N ) . The reader is referred to Burnett et al (1994) for a complete description of the data set. Furthermore, some climatic data like temperature, humidity, pressure and wind speed were also available. Over al l three monitoring networks, 37 monitoring locations (sites) are available, but not al l sites monitored al l five pollutants (24 sites monitored ozone levels). In the analysis to come, we assume that the variation caused by different networks is negligible, i.e., we consider that the three networks function in an equivalent manner and so we w i l l not include network as a factor in the analysis. However, since only the ozone data are of interest to us, from now on, we w i l l not consider the other four pollutants. 5.2.1 Daily Ozone Data The daily ozone data cover a period of six years. Thus, for every site where the ozone levels were monitored, a complete daily ozone level time series would be made of a total of 2192 observations. However, some of the daily levels time 79 series contain an excessive number of missing values. So, in order to control the quality of the data, al l time series with more than one-third of their values missing are not included in the analysis. We assume that the probability of a time series having more than one-third of missing values is not related to those values. Therefore, we are not causing any bias by adopting such a strategy. Following that stategy, the number of monitoring sites for ozone level was re-duced from 24 to 18. For those 18 sites, approximately eleven percent of the observations are missing. We assume that the missing data include randomly missing data and censored missing data, which occur when the actual air pol -lut ion level is below the detection l imit of a monitor. Al though penalized smoothing splines do not require equispaced data, the smoothing parameter selection technique M C V ; would not be clear wi th missing data points. In-deed, the value of I determines how many data points to leave out. Hence, if data points were missing, it would not be clear if we should leave out 21 + 1 data points or 21 + 1 days (or months). Therefore, in order to keep things simple, we use imputation to fi l l in missing data so that data points and days (months) have the same meaning. There exist many imputations methods like overall mean imputat ion, class mean imputation, hot-deck and cold-deck i m -putation, regression imputat ion, E M algorithm based imputation and multiple imputation. For more details on imputation methods, see Sarndel, Swensson and Wretman (1992) and Li t t le and R u b i n (1987). In this thesis, we used the technique of Sun (1994), which is based on multivariate normal theory. Sun analysed the same data set in a Bayesian interpolation setting. It is assumed 80 that this imputation procedure w i l l not cause serious bias since the percentage of missing data is small . Dai ly ozone levels show a strong serial correlation along with a periodic yearly pattern. However, the variances of the residuals may not be constant, which results in a violation of the assumptions of this thesis. Indeed, by ex-amining the top plot of Figure 5.1, we see that the plot of the residuals of the original data versus the fitted values at the arbitrary site #4 shows an obvious presence of heteroscedasticity. The fitted values were obtained using the B C V method. The lower plot in Figure 5.1 presents the residuals of the square root transformed data at the same site. There is an evident advantage to proceed to this tranformation as it gets r id of most of the heteroscedas-ticity in the residuals. Therefore, we are closer to satisfying the assumptions of homoscedasticity of the residuals. Also , the normal quantile-quantile plots shown in Figure 5.2 clearly demonstrate that the residuals obtained from the transformed data behave more normally than those obtained from the origi-nal data. A log transformation was also tried but it d id not give satisfying results. Therefore, from now on, we w i l l consider the square root transform of the daily O 3 levels. B y examining the plot of the autocorrelation function ( A C F ) of the residuals of the transformed data (see the top plot of Figure 5.3), we can see that there is s t i l l some evidence of seasonality, hence there may be heteroscedasticity in the residuals. However, this seasonality is pretty small and is negligible in the estimation of the regression function. The par-t ia l autocorrelation function ( P A C F ) plot of the residuals (see bottom plot of 81 3 4 5 6 7 8 9 Fitted Values Figure 5.1: Residuals of Original (top plot) and Square Root Transformed (bottom plot) Daily Levels o / 0 3 at Site #4. 82 -2 0 2 Quantiles of Standard Normal Figure 5.2: Normal Quantile-Quantile Plots of Residuals for Original (top plot) and Square Root Transformed (bottom plot) Daily Levels of 03 at Site #1 83 Figure 5.3), shows a significant serial correlation. In fact, the residuals of al l sites follow an A R model wi th orders between 2 and 71 days. The example in Figure 5.3 shows a correlation following an AR(45) model. 5.2.2 Monthly Ozone Data Monthly pollution levels are computed as the mean of observed daily levels i n that month. Hence, since our data set covers a period of six years, a complete monthly ozone level time series would consist of 72 data points. B y using the 18 sites that were kept for the daily data, we proceeded to compute the monthly data, using the original daily data. For those 18 sites, approximately two percent of observations ended up missing. However, since our monthly data have such a small proportion of missing data, the choice of an imputat ion method is in fact not crit ical as it might only cause negligible bias. So we filled in a value missing from the ith site in the jth month with the mean of the observed values of the same site in the jth month of the other years. This method allows us to preserve the periodic property as we know that ozone has a strong seasonal pattern. For more detail on the imputation of the data set, please refer to Sun (1994). For the monthly data, the fitted values used in computing the residuals were also obtained v ia the B C V method. Al though monthly ozone levels have a weaker correlation structure than the daily ozone levels, they have an obvious seasonal pattern. This is shown in figure 5.4 as the A C F plot shows a slight seasonal pattern while the P A C F plot suggests an A R model of order 9 for 84 Ser ies : O z o n e o 0 50 100 150 Lag Ser ies : O z o n e I . I . I 11 I „ , I 1 1 1 ll 1 'pni'liNT iri i i i i iniHiiii i 'ii ' i ' i i i ' i ' ' , , i | n i i | , | i | , i i r , , ' i 0 50 100 150 Lag Figure 5.3: Plot for Autocorrelation and Partial Autocorrelation of Square Root Transformed Daily 0 3 Levels at Site #4-85 Ser ies : O z o n e J I L "l r L a g Ser ies : O z o n e L a g Figure 5.4: Plot for Autocorrelation and Partial Autocorrelation of Monthly 0 3 Levels at Site #4-86 site #4. Overall , P A C F indicates that the order of the A R model for each of the sites varies from one to fifteen months while the A C F conserves the same periodic pattern. The seasonal pattern shown in the A C F plot could be evidence of heteroscedasticity in the residuals. However, as seen in Figure 5.5, the residuals of the monthly 0 3 levels are better distributed than the ones coming from the tranformed data. The same conclusion arises from looking at Figure 5.6, the normal quantile-quantile plots. Therefore, since the residuals of the untransformed data seem more homoscedastic, we w i l l carry out the analysis wi th the untransformed monthly 0 3 levels data. 5.3 Methods of Analysis To perform the analyses of the data set, we used al l six methods employed in the simulation study. We used the same penalty as in Chapter 4. Since ozone is known to have a strong seasonal pattern and a yearly cycle, we respectively assumed co = 27r/365 and co = 2n/12 for the period of the daily and monthly ozone levels. However, wi th the N L S method, the period was estimated. The histogram of the estimated a»'s for the daily and monthly data w i l l be presented shortly. Due to ozone's yearly cycle and to the fact that each time series is over a period of six years, the number of blocks chosen for the B C V method was six where each block represents a calendar year. We therefore assume that ozone levels on a given site are roughly independent from year to year, wi th the years being replicates. The monthly data were analyzed for al l 18 stations using each of the 87 40 50 Fitted Values Figure 5.5: Residuals of Original (top plot) and Square Root Transformed (bottom plot) Monthly Levels of 0 3 at Site #4-88 o co O 0 1 Q u a n t i l e s of S t a n d a r d N o r m a l ~l 1 -1 0 Q u a n t i l e s of S t a n d a r d N o r m a l Figure 5 . 6 : Normal Quantile-Quantile Plots for Original and Square Root Transformed Monthly levels 0 / O 3 at Site #4-89 methods. The daily data were analyzed for al l sites using N L S , L S , I N D and B C V . However, because of the large size of the daily data set (2192 data points per site), using M C V / is very time consuming. Therefore, due to time restrictions, we only applied M C V / to three randomly chosen sites for the values of / = 1 and I = p, where p is the estimated order of the A R model of the residuals of the chosen site. The randomly selected sites were #2, #7 and #17 and the residuals of these sites respectively follow an A R model of order 51, 41 and 37. Furthermore, due to computational restrictions of the memory usage by the Splus software, we could not apply the time series method to the daily ozone level data. Since we are analysing a real data set, the true underlying regression function is unknown, thus we can not compute the M S E summary statistic. Therefore, quantifying the performance of each of the methods, in terms of es-t imating the trend, is slightly more complex than wi th the simulations. How-ever, a good estimation method would be one that would not track the data too closely, hence not using too many parameters. Furthermore, an estimate using too few parameters would not be good either as it would oversmooth the data. The results of the analyses with regards to the P S E and the results concerning the number of parameters used by the different penalized methods are shown in the next two sections. However, the P S E scores are not really of interest for the purpose of this work. Remember that our goal is to estimate the underlying trend of the data. Whi le the P S E score quantifies the prediction 90 peformance of each estimator, it does not say anything about the abil ity of the methods to estimate the regression function. However, in general, a method that yields a small P S E score w i l l be using a large number of parameters as the estimate tries to interpolate the data. Thus, generally a small P S E score would mean an estimate that is too "wiggly" . 5.4 Results 5.4.1 Monthly 0 3 Levels Plots of the data and the fitted curves for sites #2, #7 and #17 are given at the end of this chapter as Figures 5.12-5.20. The plots of the fits for al l remaining sites are shown in Appendix B . Overall , M C V ; appears to be the method that performs the best. For the values of I = 1,2 and 3 considered in this work, there was no definite pattern indicating an optimal /. However, M C V ; alleviates more and more the effect of the correlation as the value of I increases. Therefore, a data set wi th a stronger correlation structure should be analysed with a larger value of I since the correlation structure has more influence on the estimation of the regression function. Therefore, choosing a value of I = p, where p is the estimate of the autoregressive order of the residuals, might be a good choice, that is, it might lead to an estimate that fits the data well. Indeed, we verified that rule with sites #2, #7 and #17, and for those three sites, the order of the autoregressive model is a good indication of the value of I to choose in order to get a good 91 curve. Here, "good" is denned as the fit that subjectively appears to track the data closely enough without interpolation. In the analysis of the monthly data, some values of / led to estimates us-ing very parameters, thus yielding estimates that were really close to the para-metric model. Therefore, due to their similarity with the parametric model, each of these estimates has roughly the same amplitude for al l six yearly cycles. For some sites, these estimates are not fitting the data well since, at certain years, the estimated regression function lies above the highest points of the data. A t the end of this section, we w i l l present detailed analyses of three sites where such a problem arise. B C V generally tracks the data too closely as it is driven by the s imi-larity in the yearly cycles of each site. I N D , as concluded in the simulation study, is highly influenced by the correlation structure and usually interpolates the data. The parametric methods, since they are restricted to a parametric family of models, are not flexible enough and they are unable to track the finer features of the data. Usually, L S , T S and N L S perform very similarly. However, in the case where the estimate of u> is not close to 27r/12, N L S differs quite substantially from the other two parametric methods. Figure B.7 in the Appendix shows such an example at site #4. Indeed, the estimated period at that site is 0.3928, hence yielding an estimate with five cycles. Because the period was not well estimated, the estimate given by N L S is very different than that of L S or T S . Now let us give summary information and a more detailed discussion of 92 P S E and the number of parameters used. First of a l l , a paired t-test similar to the one used in Chapter 4 showed that N L S had a significantly higher mean P S E score than al l the other methods as it is, on average, oversmoothing the data. We can also notice that the N L S Table 5.1: Monthly Levels of 03 Mean Results. Method PSE (SE) log(PSE) (SE) NLS LS TS 86.4 (21.3) 33.3 (3.2) 34.6 (3.1) 3.93 (0.25) 3.43 (0.10) 3.48 (0.09) IND 8.2 (1.3) 1.81 (0.23) B C V M C V i M C V 2 M C V 3 11.1 (2.4) 24.1 (2.7) 25.6 (2.3) 25.1 (1.8) 2.16 (0.16) 3.05 (0.14) 3.17 (0.10) 3.18 (0.07) method is far more variable than any of the other methods used. The reason for such a poor performance and high variability can be explained by the fact that for six sites out of eighteen, N L S did not estimate to very well. Indeed, Figure 5.7 shows clearly that for those six sites, the ratio of the estimated values of to and 27r/12 was around 0.8. This had the effect that N L S yielded, for those six sites, an estimate of the regression function with five cycles instead of six, thus the poor performance of N L S and its high variablility. Table 5.2 gives the average effective number of parameters used by the different penalized methods. This table does not include the parametric methods as these use a fixed number of parameters each time. T S and L S each use three parameters while N L S uses four parameters as it is estimating the 93 0.75 0.80 0.85 0.90 0.95 Figure 5.7: Ratio of to and co = 2n/12 for Monthly Levels of 0 3 . period co. The effective number of parameters used is very useful in quantifying the performance of the methods. Indeed, recall that a smooth estimate, close to the true regression function, should use a small number of parameters but not too small , while an estimate that would be too wiggly, hence trying to interpolate the data, would use a larger number of parameters. First of a l l , we can see that I N D and B C V use on average around 4 times more parameters than any other penalized method. This is confirmed by the average P S E of these two methods as they are significantly lower than any other methods used. For I N D , the fact that it uses too many parameters is explained by the effect of the correlation in the data. Indeed, the correlation causes the estimate given by I N D to quite often interpolate the data, hence yielding a low P S E score and equivalently, a high number of parameters used. 94 Table 5.2: Number of Parameters Used for Monthly Levels of O3. M e t h o d m e a n m e d i a n S . E . I N D 31.1 31.3 2.6 B C V 28.47 30.6 2.54 M C V i 7.9 5.8 1.6 M C V 2 5.8 4.4 0.7 M C V 3 6.3 5.2 0.8 The reason why B C V is using a high number of parameters is due to the nature of the monthly data. Indeed, the yearly cycles of sixteen out of the eighteen site are very regular, hence the six groups used in B C V , for a given site, are very similar. This has the effect that B C V , for those sixteen sites, uses a small smoothing parameter which is equivalent to a high number of parameters. For the two remaining sites #7 and #17, B C V uses a large smoothing parameter which yields an estimate using approximately three pa-rameters. Figure 5.8 shows, for sites #2, #7 and #17, the six sections that B C V uses to minimize its criterion. One can see that for site #2, the six sections appear to be very similar. Indeed, the beginning and the end of each section are very much at the same heights while the shape of the function in each section is pretty similar, that is, it is increasing in the first half of each section and then decreasing for the last second half. This explains why B C V uses, for site #2, 31.2 parameters and tends to interpolate the data. For sites #7 and #17, the reason why B C V uses so few parameters can be explained by looking at the top right hand-side plot and the lower plot of Figure 5.8. Indeed, for site #7, one can clearly see that within each of the six sections, 95 the function starts and ends at different heights and does not behave similarly within each section. Therefore, B C V uses a large smoothing parameter to ob-tain an "average" curve that best fits the six different curves. The data from site #17 are similar but not as extreme. The number of parameters used by B C V is more variable than the M C V ; methods. This is so because for most sites, B C V yields an estimate using approximately 30 parameters while for the other two sites, it uses only three parameters. Comparing the different values of I, M C V i uses significantly more pa-rameters than both M C V 2 and M C V 3 while the last two do not significantly differ. This agrees with what was seen in the simulation study, that is, the average number of parameters used by M C V i is larger than the averages used by M C V 2 and M C V 3 . Table 5.3 shows, for al l 18 sites, the number of parameters used by each method in addition to the estimated autoregressive order of the residuals cal-culated from the fitted values obtained with B C V . The orders were estimated using the ac/function in Splus. A s we can see, both I N D and B C V use a large number of parameters. The number of parameters used by M C V i is quite variable as it is more sensitive to the correlation structure. Indeed, since it is using a value of / = 1, it does not eliminate al l the effect of correlation. Also , it appears that the number of parameters used by M C V i tends to be higher than M C V 2 . Table 5.4 displays the correlation between the number of parameters used by the 5 penalized methods. The relationship is very similar to what 96 0 20 40 60 0 20 40 60 Site #2 Site #7 0 20 40 60 Site #17 Figure 5.8: Behaviour of BCV for Site #2 (top left), Site #7 (top right) and Site #17 (bottom). 97 Table 5.3: Number of Parameters used by each Method for the Monthly Data. ( a * represents a site that is studied in Chapter 5) S i t e A R o r d e r I N D B C V M C V i M C V 2 M C V 3 1 14 35.5 35.7 3.4 3.4 4.1 *2 12 29.8 31.1 10.9 10.6 10.1 3 14 44.0 27.4 4.1 4.9 7.9 4 15 32.5 35.7 9.3 9.0 12.2 5 13 59.2 30.1 31.0 3.9 4.2 6 14 28.0 29.4 3.9 3.5 3.4 *7 1 31.1 3.1 12.9 8.4 6.3 8 14 12.4 41.8 3.5 3.2 3.1 9 15 31.6 31.1 6.4 6.2 7.8 10 13 36.2 27.4 3.4 3.3 3.4 11 9 28.7 16.8 5.2 5.2 7.9 12 14 32.9 41.8 3.5 3.7 7.5 13 14 28.0 30.1 6.9 6.7 7.7 14 14 32.5 29.4 3.4 3.4 3.5 15 15 40.2 32.3 8.3 3.5 3.5 16 15 29.0 33.2 8.9 9.3 12.0 *17 1 14.8 3.0 13.8 12.6 3.1 18 13 13.9 33.2 3.9 3.7 3.1 was observed in the simulation study with the exception of the negative cor-relation between I N D and M C V 2 which is very different from what was seen in the simulation study. However, if we examine closely this relationship (see Figure 5.9), we see that at two sites, namely sites #5 and #17 which are represented by an asterix in Figure 5.9, I N D and M C V 2 respectively use an exceptionally high number of parameters. In fact, if we remove these two sites, the correlation goes from —0.268 to a positive value of 0.0468. Let us now look closely at three specific sites and discuss the perfo-mances of the different estimation methods. We w i l l use sites # 2,7 and 17 as 98 3 0 4 0 N u m b e r o f P a r a m e t e r s u s e d b y I N D 6 0 Figure 5.9: Relationship between the Number of Parameters Used by IND anc MCV2. 99 Table 5.4: Correlation Between Number of Parameters Used for Monthly Levels of03 . M E T H O D S B C V M C V i M C V 2 M C V 3 I N D 0.111 0.497 -0.268 0.105 B C V M C V i M C V 2 --0.290 -0.521 0.313 0.128 0.062 0.577 we want to look at the same sites that w i l l be analysed more intensively for the daily 0 3 levels data. Site #2 Figure 5.12 displays the fit of the three parametric methods. Whi le the three fits are very comparable, we can see that L S and N L S are practically identical. This is explained by the fact that the ratio of the estimated value of co obtained by N L S and 27r/12 is 0.997. Therefore, N L S and L S are basically the same method. However, in Figure 5.12 we see that the T S estimate has a slightly smaller amplitude than N L S and L S . That difference can be accounted for by the fact that T S takes the correlation into account while the two others methods do not. Figure 5.13 shows the fits of the penalized methods that take correlation into account. We see that the M C V / fits are vir tual ly identical for / = 1,2,3, and yield estimates with effective numbers of parameters of 10.9, 10.6 and 10.1 respectively. Since the A R model that fits the residuals has an order of 12 months, and in order to verify the rule of selection for /, we also show the 100 estimate of M C V i 2 . This fit is different from the other M C V ; fits as it uses 20.9 parameters and it appears to fit the data better. Indeed, when looking at Figure 5.13, we see that at the peak of the fourth year (around the 4 2 n d month), al l the M C V ; fits but M C V i 2 exceed the highest data point. Generally, M C V i 2 tracks the data more accurately while using a reasonable number of parameters. The other method that takes correlation into account, B C V , uses 31.2 parameters and tends to interpolate the data. This result is due to the fact that the cycles of the data at site #2 are very similar, hence B C V uses a small smoothing parameter to minimize its criterion. See Figure 5.8 for a representation of the cycles used by B C V . The most interesting plot is Figure 5.14 which shows the fits of I N D , N L S and M C V i 2 . There, it becomes clear that the parametric methods are handicapped compared to the penalized methods. Indeed, N L S has to keep the same amplitude for each cycle (year) and therefore does not yield a good estimate when the peaks and valleys are not constant from year to year. O n the other hand, I N D and M C V i 2 yield a comparable fit. However, we have more confidence in the fit of M C V 1 2 since it's taken correlation into account. Furthermore, M C V 1 2 yields an estimate that interpolates the data less than I N D . Site #7 Figure 5.15 shows the plots of the parametric methods at site #7 and leads us to the same conclusions as for site #2. A t site #7, the ratio of the estimated 101 co and 2n/12 is 1.006 which is once again very close to one. Therefore, here too, L S and N L S are basically the same method. Figure 5.16, which shows the fits of the penalized methods that take correlation into account, is a little bit more interesting than the correspoding figure for site #2. Indeed, this time, as it is clearly shown in Figure 5.16, the fits of the M C V ; methods are noticeably different. B C V used 3.01 parameters while M C V / , / = 1,2,3, use respectively 12.9, 8.4 and 6.3 parameters. This can be explained by the fact that the correlation for the residuals of site #7 follow an A R model of order one wi th an estimated autocorrelation coefficient of 0.63. This setting is very similar to our simulation study when 0 = 0.5, and we do obtain very similar results indeed. To see the behaviour of B C V for site #7, please refer back to Figure 5.8. The best fit among the M C V / methods appears to be M C V i as it is the only fit whose height does not exceed the data points at months 20 and 42. Indeed, it seems that M C V 2 and M C V 3 oversmooth the data. This agrees wi th the decision rule that leads to the choice oil = p. Indeed, since the correlation structure is of short range, that is A R ( 1 ) , M C V i takes into account a good amount of correlation. Figure 5.17 shows once again that M C V i performs very well, while I N D st i l l interpolates the data too much and T S is unable to discern peaks of different amplitudes due to its parametric nature. 102 S i t e #17 In Figure 5.18, we see once more that the parametric methods perform simi-larly. The ratio of the estimated u and 27r/12 is 1.003, which means that the period estimated by N L S was very close to 2n/12. In Figure 5.19, we can see that M C V 3 and B C V perform very s imi-larly. In fact, M C V 3 uses approximately 3.14 parameters while B C V uses 3.01 parameters. This means that those two methods behave very much like the parametric methods of Figure 5.18. B C V uses so few parameters since, as shown in Figure 5.8, the six cycles of the data are pretty different. The two remaining methods in Figure 5.19, M C V i and M C V 2 , use respectively 13.8 and 12.6 parameters and produce much better fits. The correlation for site #17 follows an A R ( 1 ) , hence this can explain why M C V 3 oversmooths the data while M C V i and M C V 2 seem to do much better. In Figure 5.20, we can see that I N D and M C V i are very similar. The fact that I N D does not try to interpolate the data is somehow surprising. However, this can be explained by the fact that the residuals at site #17, as mentioned above, are not highly correlated. Therefore, the effect of the correlation at that site is not very strong, hence the similarity in the estimates obtained by I N D and M C V i . 5.4.2 Daily 0 3 Levels Plots of the data and the fitted curves for sites #2, #7 and #17 are given at the end of this chapter as Figures 5.21-5.29. The plots of the fits for al l 103 remaining sites are shown in Appendix C . Table 5.5: Square Root Transformed Daily Levels of O 3 Mean Results. M e t h o d P S E ( S E ) l o g ( P S E ) ( S E ) L S N L S 2581.0 (89.5) 2568.1 (89.2) 7.85 (0.04) 7.84 (0.04) I N D 683.5 (40.1) 6.50 (0.06) B C V M C V i M C V p 2330.0 (83.4) 1996.3 (10.1) 2125.2 (22.3) 7.74 (0.04) 7.60 (0.01) 7.66 (0.01) Overall , it appears that using M C V p , wi th p the estimated order of the A R model of the residuals, results in smooth estimates that track the features of the data pretty efficiently and better than the other methods. Since the correlation structure is pretty strong, M C V i does not alleviate the correlation effect sufficiently, hence yielding estimates that are not smooth enough. B C V , as concluded in the simulation study and in the monthly data analysis, is too dependent on the cycles in the data and therefore is pretty unreliable. Final ly, I N D , which is highly influenced by the strong correlation structure in the residuals of the daily data, performs very poorly as it interpolates the data. The results of the daily data analysis with regards to the P S E are shown in Table 5.5 while the results concerning the number of parameters used by B C V , I N D , M C V i and M C V p are shown in Table 5.6. Recall that the summary statistics for M C V i and M C V p are based on only three sites: #2, #7 and #17. First of a l l , we can see that N L S and L S do not significantly perform 104 Table 5.6: Number of Parameters Used for Square Root Transformed Daily Levels of O 3 . M e t h o d m e a n m e d i a n S . E . I N D 884.6 875.1 19.1 B C V 29.24 29.34 3.60 M C V i 71.72 72.10 2.61 M C V p 26.37 26.10 3.18 differently. This can be explained by the fact that, as shown in Figure 5.10, the estimation of u> by N L S gave a)'s very close to the value used for L S , that is, UJ = 27r/365. Remember that in the case where CJ = 27r/365, L S and N L S w i l l yield exactly the same estimate as they are the same method. So since the ratio is close to one, the methods use a similar period and therefore perform quite comparably. A good way of assessing the performance of the penalized methods, I N D and B C V , is to look at the effective number of parameters that these methods use. I N D uses significantly more parameters than B C V . This is explained by the fact that I N D highly interpolates the data as it is strongly influenced by the correlation structure. Indeed, looking at the effective number of parame-ters used by I N D in Table 5.6, we see that, on average, I N D uses more than 880 parameters compared to slightly more than 29 for B C V . B C V uses sur-prisingly very few parameters. Remember that a daily ozone level time series has 2192 data points while each monthly time series only has 72 data points. Nevertheless, for the daily data, B C V uses approximately the same number of parameters as for the monthly data. This can be explained by the fact 105 0.995 1.000 1.005 1.010 1.015 1.020 Figure 5.10: Ratio Co and to = 27r/365 for Square Root Transformed Daily Levels of O 3 that with the monthly data, B C V tends to interpolate the data. O n the other hand, in the daily data, the yearly cycles in a given site are generally very different from each other as they have lots of noise that must be smoothed for prediction. This causes B C V to use a large smoothing parameter to minimize the B C V criterion, hence using very few parameters. Table 5.7 shows, for al l the 18 sites of the daily data, the number of parameters used by each method in addition to the estimated autoregressive order of the residuals. Let us now look more in depth at the fits of the differents methods when applied to the three randomly chosen sites. Figure 5.11 shows the fit of the I N D method at site #2. Since I N D performed similarly at al l sites, we w i l l 106 Table 5.7: Number of Parameters used for the Daily Data, ( a * represents a site that was studied in Chapter 5) S i t e A R o r d e r I N D B C V M C V i M C V p 1 58 801.9 35.9 - -*2 51 859.2 29.3 76.0 32.0 3 2 771.0 27.8 - -4 45 960.5 29.3 - -5 33 890.0 29.3 - -6 26 817.9 3.0 - -*7 41 894.6 67.9 67.1 21.0 8 30 903.7 29.4 - -9 32 964.2 24.4 - -10 71 802.3 35.9 - -11 71 783.7 41.6 - -12 34 840.9 29.3 - -13 71 855.5 35.9 - -14 35 961.2 29.4 - -15 60 1093.6 35.9 - -16 37 866.9 3.0 - -• 17 37 971.8 35.9 72.0 26.0 18 13 883.2 3.0 - -not show this method in the next figures since it only makes plots difficult to analyse. S i t e #2 For this site, remember that the correlation structure follows an AR(51) . Therefore, we computed the estimate of m v ia the M C V ; method using I = {1,51}. Figure 5.21 shows that N L S and L S fit the periodic portion of the data very well but do not account for the increase in the amplitude of the last two O 365 730 1095 1460 1825 2190 D a y s Figure 5.11: Fit of the IND Method for the Square Root Transformed Daily Levels of 03 at Site #2. peaks. N L S is clearly very similar to L S and this is supported by the fact that the ratio of the estimate period and 27r/365 is 0.9955. Since L S and N L S are restricted to a parametric model in the family of {sin, cos} functions, they cannot produce estimates with varying amplitudes. Therefore, they do not have the capacity to follow the peaks and the valleys of the data closely enough. In Figure 5.22, we can see that M C V i is pretty "wiggly"as it uses 76.0 parameters. O n the other hand, MCV51 and B C V behave similarly as the estimates use 32.0 and 29.3 parameters respectively. B o t h the MCV51 and B C V estimates are much smoother than the estimate using M C V i and they track the features of the data pretty well. MCV51 is smooth and uses fewer 108 parameters since it really alleviates the effect of correlation by using a value of I = 51. B C V uses few parameters since the huge amount of noise requires a lot of smoothing in order to obtain a good prediction. Comparing MCV51 and N L S (Figure 5.23) shows that M C V 5 1 , while being a smooth function, has the capacity to follow more closely the fine features of the data like fluctuations in the peaks and the valleys. This is especially obvious by looking at the peaks located approximately at days 915, 1640 and 2000. S i t e #7 Here, the correlation structure follows an AR(41) . Therefore, we computed the estimate of m v ia the M C V / method using I = {1,41}. A s seen in the others sites, L S and N L S present almost identical fits (see Figure 5.24). This is confirmed by the ratio of the estimated period and 2?r/365 which is 1.002. Figure 5.25 shows the fits of B C V , M C V x and MCV41 which use 67.9, 67.1 and 21.0 parameters, respectively. This explains why MCV41 is a lot smoother than the two other estimates. For reasons similar to those mentioned for site #2, MCV41 succeeds much better in diminishing the effect of the correlation on the estimate of the regression function. B C V is not as smooth since the six cycles for site #7 are more similar than in the previously analysed site. In Figure 5.26 we present the comparison between the fits of MCV41 and N L S . Whi le the two fits are pretty similar, once again, MCV41 offers more 109 flexibility and allows its estimate to follow the fluctuations in the amplitude of the peaks. S i t e #17 For site #17, the correlation structure follows an AR(37) so we computed the estimate of m v ia the M C V / method using / = {1, 37}. In Figure 5.27, we see that N L S and L S st i l l perform almost exactly the same. Indeed, the ratio of a) and 27r/365 is 1.007. In Figure 5.28, we see that, since M C V 3 7 uses 26.0 parameters, it is far smoother than M C V i which uses 72.0 parameters. B C V , using 35.9 parameters, is much smoother than M C V i but slightly more "bumpy" than MCV37. B y looking at Figure 5.29, we see that once again, even if N L S follows the main features of the data pretty well, it lacks flexibility compared to MCV37 in tracking the finer features of the data. 5.4 .3 C o n c l u s i o n s Overall , the method that performs best, both in the monthly and daily analyses is M C V / . For the three sites studied in both the monthly and the daily data, M C V p provided the best estimate of trend, wi th p being the estimate of the order of the autoregressive model of the residuals. Indeed, by using I = p, the estimate takes into account a good amount of correlation thus yielding a smooth estimate that st i l l tracks the finer features of the data very well. We can therefore easily and accurately estimate the ozone level trend. B C V 110 performs better in the daily data analysis, mainly due to the large amount of noise in that data that results in a smoother estimate. If the cycles are clearly defined, each with a lot of data points and a high degree of noise, B C V performs well but M C V p is st i l l preferred. Otherwise, as with the monthly data and the simulation study, B C V is pretty unreliable. I N D , as expected, performs very poorly for both the monthly and the daily data. Only when the correlation is negligeable does I N D yield acceptable results. The parametric methods follow the periodic nature of the data quite faithfully but lack flexibility to track some changes in the amplitudes of the peaks and valleys. In conclusion, the analysis of the ozone level data confirms what was seen in the simulation study, that is, that accounting for correlation v ia using M C V ; allows us to alleviate the effect of correlation while getting a very flexible estimate. 111 LS — TS — • NLS F\ A A A A A* /; V V -V W \ i i i i 1 1— 12 24 36 48 60 72 Months Figure 5.12: Fits for Monthly Levels of 0 3 Site #2. 112 MCV1 — MCV12 — MCV2 BCV — MCV3 Figure 5.13: Fits for Monthly Levels of 03 at Site #2. 113 IND — NLS — • MCV12 f it ii\ Vj «i\ It. :\m:'i \ ti i'l f ' . \ / .1 1\ \\ V / »-A il U il \ i W \ HI I;' If/ I:/ i i ' V I I I I 1 1 [— 0 12 24 36 48 60 72 Months Figure 5.14: Fits for Monthly Levels o / 0 3 at Site #2. 114 Figure 5.15: Fits for Monthly Levels 0 / O 3 at Site #7. 115 MCV1 — MCV3 — MCV2 — BCV Figure 5.17: Fits for Monthly Levels of 0 3 at Site #7. 117 Figure 5.18: Fits for Monthly Levels of 03 at Site #17. 118 MCV1 - - • MCV3 MCV2 — BCV H 1 1 1 1 1 r -0 12 24 36 48 60 72 Months Figure 5.19: Fits for Monthly Levels of 0 3 at Site #17. 119 IND - - • NLS MCV1 A i H I ii J , I /• i n \ i \\ L' ' ii '•' ^ f 1 '/ V J / ' I J , / « 11 l • i ? / / ii PI 11 ii11 i I I i'\ A , ' \ A //\\ // 1 / 4 , i t v i I U ' , ' \ I X F*. 1 ! M ' / i il ^ 1 I] \ [ it 1 i / 1 * ' ' •< *\ y \ I1* \ V V 14 \ I ,i 12 24 36 48 60 72 Months Figure 5.20: Fits for Monthly Levels of 0 3 at Sfte # i 7 . 120 121 122 124 o —I 1 1 1 1 1 -I— 0 365 730 1095 1460 1825 2190 Days Figure 5.25: Fits for Square Root Transformed Daily Levels o / 0 3 at Site #7. 125 Figure 5.26: Fits for Square Root Transformed Daily Levels of 0 3 at Site #7. 126 / H I I I ; 1 1 1 0 365 730 1095 1460 1825 2190 Days Figure 5.27: Fits for Square Root Transformed Daily Levels of 03 at Site #17. 127 I I I I : 1 1 1— 0 365 730 1095 1460 1825 2190 Days Figure 5.28: Fits for Square Root Transformed Daily Levels of 0$ at Site #17. 128 1 1 1 1 1 1 1— 0 365 730 1095 1460 1825 2190 Days : Figure 5.29: Fits for Square Root Transformed Daily Levels 0 / O 3 at Site #17. 12$ C h a p t e r 6 C o n c l u s i o n a n d D i s c u s s i o n The motivation of this work comes from an environmental problem. Indeed, estimating the underlying trend in air pollution data is complex as one has to consider the serial correlation of the time series while keeping in mind the usual seasonal behaviour of such data. Usual time series methods appear to take correlation into account but lack the flexibility of nonparametric methods in terms of tracking finer features of the data. In Chapter 3, we described five smoothing parameter selection tech-i niques used in nonparametric kernel regression which, when applied in situa-tions where the regression errors are correlated, are designed to perform more effectively than smoothing parameter selection techniques used in the case of independent errors. We modified two of these techniques, namely the "mod-ified cross-validation" ( M C V / ) and the "blockwise cross-validation" ( B C V ) methods, and used them in the context of smoothing splines wi th model based penalty. 130 The performance of these two smoothing parameter selection techniques was investigated, in Chapter 4, via a simulation study. Using simulated data sets to which correlated errors were added and for which the underlying func-tions were known, we could evaluate the extent to which the effect of the cor-relation was alleviated, that is, evaluate the accuracy with which the function was estimated in the presence of correlated errors. Our two techniques were compared to ordinary cross-validation and to three commonly used parametric methods. Overall the performance of MCV; in the simulations is superior to any other methods that we considered. Furthermore, for best results, an optimal value of I should be chosen. Indeed, an appropriate choice of / would lead to a smoothing parameter that efficiently alleviates the effect of correlation and yields an estimate of the regression function that is more accurate. On the other hand, if the value of / is chosen to be too small compared to the optimal one, then the obtained smoothing parameter would, for positively correlated errors, tend to yield an estimate that is more "wiggly", hence leading to in-terpolation. Conversely, the use of an excessively large smoothing parameter has not been studied in depth, thus we can not assess its effect in this work. We proposed two techniques to select I: the first one uses the order of the es-timated autoregressive model and the second one considers the magnitude of the estimates of Cor(t?j, Ei+j). However, to properly assess their performance, further work needs to be done. The BCV method did not perform very well. Indeed, by the way it 131 minimizes its criterion, B C V is highly influenced by the amount of noise in the data and by its periodicity. Also , one has to correctly choose the number of cycles for B C V . The choice of these cycles is usually made according to the seasonality of the observed natural process. However, the regularity of these cycles plays an important role in the performance of B C V . Indeed, in the case where the data are periodic with similar cycles, B C V w i l l select an excessively small smoothing parameter, thus yielding an estimate of the regression func-tion that is interpolating the data. Conversely, if the data are not periodic, B C V w i l l have a tendancy to oversmooth the data as it selects a smoothing parameter that is too large. In Chapter 5, we applied M C V / and B C V to daily and monthly ground level ozone concentration. Once again, M C V / performed well compared to the other methods of analysis. B C V , which was highly affected by the periodicity of the monthly data and by the high level of noise in the daily data, d id not yield satisfying results. For a few selected sites, we considered M C V p where p is the estimate of the autoregressive order of the residuals. For both data sets, the selection rule of I = p worked well. Indeed, the estimate obtained wi th M C V p appears, for al l selected sites, to give the best fit among al l methods used. A s usual, there is further work to be done. The smoothing parameter selection techniques were considered in the context of no covariate. However, the importance of covariates such as temperature, humidity, season, etc., is undeniable. Therefore, modifying our techniques to apply them to regression of 132 higher dimensional data might increase the accuracy of the regression function estimation. Furthermore, studying the effect of different error structures, such as those that are heteroscedastic and otherwise nonstationary, could be helpful to improve the understanding of the effect of correlation. Final ly, and as mentioned before, further investigation on the choice of I in M C V / is needed. We believe that the range and the magnitude of the correlation should be used to select an optimal value of /. 133 B i b l i o g r a p h y A l t m a n , N.S . (1990), "Kernel Smoothing of D a t a with Correlated Errors" , J. Amer. Statist. Assoc., 85, 749-759. A l t m a n , N.S . (1994), "Krige , Smooth, B o t h or Neither?" , Technical Report, Biometrics U n i t , Cornell University, Ithaca, New York. Breiman, L . , Friedman, J . H . , Olshen, R . A . and Stone, C . J . (1993), "CART: Classification and Regression Trees", Wadsworth, Belmont. Brockwell , P . J . and Davis, R . A . (1987), Time Series: Theory and Methods, Springer, New York. Burnett , R . T . , Dales, R . E . , Raizenne, M . E . , Krewski , D . , Summers, P . W . , Roberts, G . R . , Raad-Yound, M . , Dann, T . and Brook, J . (1994), "Effects of Low Ambient Levels of Ozone and Sulfates on the Frequency of Respiratory Admissions to Ontario Hospitals" , Environmental Research, 65, 172-194. Carrol , R . J . , Chen, R., George, E . I , L i , T . H . , Newton, H . J . , Schmiediche, H . and Wang, N . (1997), "Ozone Exposure and Population Density in Harris County, Texas", J. Amer. Statist. Assoc., 92, 392-415. 134 Cheng, M . Y . , Fan, J . and Marron , J.S. (1993), " O n Automat ic Boundary Corrections", Ann. Statist, 25, 1691-1708. C h i u , S.-T. (1989), "Bandwidth Selection for Kernel Est imation wi th Corre-lated Noise" , Statist. Probab. Lett, 8, 347-354. C h u , B . - K . and Marron , J.S. (1991), "Comparison of Two Bandwidth Selec-tors with Dependent Errors" , Ann. Statist, 19, 1906-1918. C h u , B . - K . and Marron , J.S. (1991), "Choosing a Kernel Regression Est ima-tor" , Statis. Sci., 6, 404-436. Cleveland, W . S . (1979), "Robust Local ly Weighted Regression and Smooth-ing Scatterplots", J. Amer. Statist. Assoc., 74, 829-836. Coddington, E . A . (1961), An Introduction to Ordinary Differential Equa-tions, New-Jersey, Prentice-Hall . Cox, W . M . and C h u , S - H . (1992), "Meteorology-Adjusted Ozone Trends in Urban Areas: A Probabil i ty Approach" , Atmospheric Environment., 27B, 425-434. Craven, P. and Wahba, G . (1979), "Smoothing Noisy D a t a with Spline Func-tions. Est imat ing the Correct Degree of Smoothing by the Method of Gener-alized Cross-validation", Numer. Math., 31, 377-403. Davis, J . M . , Sacks, J . , Saltzman, N . , Smith, R . L . and Styer, P. (1996), " A i r -borne Particulate Matter and Dai ly Morta l i ty in Birmingham, A l a b a m a " , Technical Report, 55, National Institute of Statistical Science. 135 Dawid , A . P . (1984), "Statistical Theory: the Prequential Approach (with discussion)", J. Royal Statist. Soc. A, 147, 278-292. Diggle, P . J . and Hutchinson, M . F . (1989), " O n Spline Smoothing W i t h A u -tocorrelated Errors" , Austral. J. Statist., 31, 166-182. Eubank, R . L . (1988), Spline Smoothing and Nonparametric Regression, M a r -cel Drekker, New York. Fan, J . (1992), "Design-adaptive Nonparametric Regression", J. Amer. Statist. Assoc., 87, 998-1004. Fan, J . (1993), "Local Linear Regression Smoothers and their M i n i m a x Effi-ciency", Ann. Statist., 21, 196-216. Fan, J . , and Gijbels, I. (1992), "Variable Bandwidth and Local Linear Re-gression Smoothers", Ann. Statist, 20 , 2008-2036. Fan, J . and Gijbels, I. (1994), "Data-driven Bandwidth Selection in Local Polynomial F i t t ing : Variable Bandwidth and Spatial A d a p t a t i o n " , J. Royal Statist. Soc. B, 57, 371-394. Fan, J . and Gijbels, I. (1996) Local Polynomial Modelling and Rs Applica-tions, Chapman and H a l l , New York. Fan, J . and Marron , J.S. (1994), "Fast Implementation of Nonparametric Curve Est imators" , J. of Computational and Graphical Statist., 3, 35-56. Gasser, T . , Kneip , A . and Kohler , W . (1991), " A Flexible and Fast Method for Automat ic Smoothing", J. Royal Statist. Soc. B, 86, 643-652. 136 Gasser, T . , Mul ler , H . G . and Mammitzsch, V . (1985), "Kernels for Nonpara-metric Curves Es t imat ion" , J. Royal Statist. Soc. B, 47, 238-252. Green, P . J . and Silverman, B . W . (1994), Nonparametric Regression and Generalized Linear Models. A Roughness Penalty Approach, Monographs on Statistics and Appl ied Probability, 58, Chapman and H a l l , London. Guttorp , P. , Meir ing, W . and Sampson, P . D . (1994), " A Space-Time Analysis of Ground-Level Ozone D a t a " , EnvironMetrics, 5, 241-254. Gyorfi , L . , Hardle, W . , Sarda, P. and V i e u , P. (1989), Nonparametric Curve Estimation from Time Series. Lecture Notes in Statist, 60, Springer, New York. H a l l , P. , Sheather, S.J., Jones, M . C . and Marron , J.S. (1991), " O n O p t i m a l Data-based Bandwidth Selection in Kernal Density Es t imat ion" , Biometrika, 78, 263-271. H a l l , P. and Marron , J.S. (1990), "Extent to W h i c h Least Squares Cross-Val idat ion Minimises Integrated Squared Error in Nonparametric Density Es t imat ion" , Probab. Theory Rel. Fields, 74, 567-581. H a l l , P. and Marron , J.S. (1990), " O n Variance Est imation In Nonparametric Regression", Biometrika, 77, 415-419. Hardle, W . (1990), Applied Nonparametric Regression, Cambridge Univ . Press. 137 Hardle, W . and Marron , J.S. (1985), " O p t i m a l Bandwidth Selection in Non-parametric Regression Function Est imat ion" , Ann. Statist., 13, 1465-1481. Hardle, W . and V i e u , P. (1992), "Kernel Regression Smoothing of T ime Se-ries", J. Time Ser. Anal., 13, 209-232. Hardle, W . , H a l l , P. and Marron , J.S. (1988), "How Far are Automat ica l ly Chosen Regression Smoothing Parameter from their O p t i m u m " , J. Amer. Statist, 83, 86-101. Hart , J . D . (1991), "Kernel Regression Est imation with T ime Series Errors" , J. Roy. Statist Soc. Ser. B, 53, 173-187. Hart , J . D . (1994), "Automated kernel Smoothing of Dependent D a t a by Us-ing T ime Series Cross-Val idat ion" , J. Roy. Statist. Soc. Ser. B, 56, 529-542. Hart , J . D . (1996), "Some Automated Methods of Smoothing Time-Dependent D a t a " , Nonparametric Statistics, 6, 115-142. Hart , J . D . and Wehrly, T . E . (1986), "Kernel Regression Using Repeated Mea-surements D a t a " , J. Amer. Statist. Assoc., 81 , 1080-1088. Hastie, T . J . and Tibshirani , R . J . (1990), Generalized Additive Models, Chap-man and H a l l , New York. Heckman, N . E . (1997), "The Theory and Appl icat ion of Penalized Least Squares Methods or Reproducing Kernel Hilbert Spaces Made Easy" , un-published manuscript. 138 Heckman, N . E . and Ramsay, J . O . (1996), "Penalized Regression with Model -Based Penalties", unpublished manuscript. Lefohn, A . S . and Runeckles, V . C . (1987), "Established Standards to Pro-tect Vegetation-Ozone Exposure/Dose Considerations", Atmospheric Envi-ronment, 21 , 561-568. L i , Xiaochun (1996), "Local Linear Regression Versus Backcalculation in Forecasting", P h . D . Thesis, Department of Statistics, University of Br i t i sh Columbia . L ippmann, M . (1989), "Health Effects of Ozone: A Cr i t i ca l Review", J. of the Air Waste Management Association, 39, 672-695. Li t t le , R . J . A . and R u b i n , D . B . (1987), Statistical Analysis with Missing Data,_ John Wi ley & Sons. Marron , J.S. (1987), "Partit ioned Cross-validation", Econometric Rev., 6, 271-284. Marron , J.S. and Nolan, D . (1988), "Canonical Kernels for Density Est ima-t i o n " , Statist. Prob. Lett, 7, 195-199. Mi i l ler , H . G . (1988), Nonparametric Analysis of Longitudinal Data, Lecture notes in Statist., 46, Springer, New York. Nadaraya, E . A . (1964), " O n Est imating Regression", Theory Prob. Appl., 9, 141-1421. 139 Nash, J . C . (1990), Compact Numerical Methods for Computers: Linear Al-gebra and Function Minimization, Hilger, New York. Opsomer, J . D . (1996), "Est imating an Unknown Function by Loca l Linear Regression when the Errors are Correlated", Proceedings of the Section on Statist. Computing, Amer . Statist. Assoc., Alexandria , V A , 102-108. Opsomer, J . D . and Ruppert , D . (1998), " A Ful ly Automated Bandwidth Selection Method for F i t t i n g Addit ive Models" , J. Amer. Statist. Assoc., 93, 605-619. Park, B . U . and Marron , J.S. (1990), "Comparison of Data-Driven Bandwidth Selectors (with discussion)", J. Amer. Statist. Assoc., 85, 66-72. Priestley, M . B . (1981), Spectral Analysis and Time Series, Academic Press, London. Ramsay, J . O . (1997), " A User's Guide to Smart Smoothing wi th L-Spl ine" , unpublished manuscript. Rice, J . (1984), "Bandwidth Choice for Nonparametric Regression", Ann. Statist, 12, 1215-1230. Rice, J . and Silverman, B . W . (1991), "Est imat ing the Mean and Covariance Structure Nonparametrically when the Data are Curves" , J. Royal Statist. Soc. B, 53, 233-243. 140 Ruppert , D . , Sheather, S.J. and Wand, M . P . (1995), " A n Effective Bandwidth Selector for Local Least Squares Regression", J. Amer. Statist. Assoc., 90, 1257-1270. Ruppert , D . and Wand, M . P . (1994), "Mult ivariate Local ly Weighted Least Squares Regression", Ann. Statist., 22, 1346-1370. Sarndel, C . E . , Swensson, B . and Wretman, J . (1992), "Model Assisted Survey Sampling" Springer-Verlag, New York. Silverman, B . W . (1984), "Spline Smoothing: The Equivalent Variable Kernel M e t h o d " , Ann. Statist., 12, 898-916. Silverman, B . W . (1984), " A Fast and Efficient Cross-Validation Method for Smoothing Parameter Choice in Spline Regression", J. Amer. Statist. Assoc., 79, 584-589. Silverman, B . W . (1985), "Some Aspects of the Spline Smoothing Approach to Nonparametric Regression Curve F i t t i n g " , J. Royal Statist. Soc. B, 47, 1-52. Silverman, B . W . (1986), Density Estimation for Statistics and Data Analysis, Chapman and H a l l , London. Simonoff, J.S. (1996), Smoothing methods in Statistics, Springer Series in Statistics, Springer-Verlag, New York. 141 Smith , R . L . and Huang, L-S . (1993), "Model ing High Threshold Exceedances of Urban Ozone", Technical Report, 6, National Institute of Statistical Sci-ences. Smith, P . L . (1982), "Curve F i t t i n g and Model l ing with Splines Using Statis-tical Variable Selection Methods" , N A S A , Langley Research Center, Hampla , V A , N A S A Report 166034. Stone, C . J . (1974), "Cross-validatory Choice and Assessment of Statistical Predictions (with Discussions)", J. Royal Statist. Soc. B, 3 6 , 111-147. Stone, C . J . (1977), "Consistent Nonparametric Regression", Ann. Statist., 5, 595-645. Stone, C . J . (1980), " O p t i m a l Rates of Convergence for Nonparametric Es t i -mators" , Ann. Statist, 8, 1348-1360. Stone, C . J . (1982), " O p t i m a l Global Rates of Convergence for Nonparametric Regression", Ann. Statist, 10, 1040-1053. Sun, Weimin (1994), "Bayesian Mult ivariate Interpolation w i t h Missing D a t a and its Appl icat ions" , P h . D . Thesis, Department of Statistics, University of Br i t i sh Columbia . Turlach, B . A . and W a n d , M . P . (1996), "Fast Computat ion of A u x i l i a r y Quan-tities in Local Polynomial Regression",J. of Computational and Graphical Statist, 5, 337-350. 142 V i e u , P. and Hart , J . (1990), "Data-driven Bandwidths Choice for Density Est imation Based on Dependent D a t a " , Ann. Statist., 18, 873-890. Wahba, G . (1975), "Smoothing Noisy Data wi th Spline Functions", Numer. Math., 24, 383-393. Wahba, G . (1990), Spline Models for Observational Data, S I A M . Wand, M . P . and Jones, M . C . (1995) Kernel Smoothing, Chapman and H a l l , London. Wang, Y . (1996), "Smoothing Spline Models with Correlated Random E r -rors", Technical Report, 966, Dept. of Statist., Univ . of Wisconsin. Watson, G.S . (1964), "Smooth Regression Analys is " , Sankhya Ser. A, 26, 359-372. Wehrly, T . E . and Hart , J . D . (1988), "Bandwidth Selection for Kernel Es t i -mators of Growth Curves wi th Correlated Errors" , Institute of Mathematical Statistics Bulletin, 17, 236. Whittaker , E . T . (1923), " O n a New Method of Graduat ion" , Proc. Edinburgh Math. Soc, 41 , 63-75. 143 Appendix A Distributions This Appendix contains the histograms of the distributions of M S E and log(MSE) for al l six methods used in the simulation study. 144 5 0 0 1 5 0 0 MSE 5 log(MSE) Figure A . l : MSE and log(MSE) for LS for (b = 0.5. 145 1 0 0 0 M S E 1 5 0 0 2 0 0 0 5 l o g ( M S E ) Figure A.2: MSE and log(MSE) for NLS for <p = 0.5. 146 2 0 0 3 0 0 4 0 0 M S E 3 4 l o g ( M S E ) Figure A.3: MSE and log(MSE) for TS for cb = 0.5. 147 100 200 M S E l o g ( M S E ) Figure A.4: MSE and log(MSE) for IND for <h = 0.5. 148 5 0 0 1 0 0 0 MSE 5 log(MSE) Figure A . 5 : MSE and log(MSE) for BCV for (b = 0 . 5 . 149 5 0 1 0 0 1 5 0 M S E 4 l o g ( M S E ) Figure A.6 : MSE and log(MSE) for MCVX for (j) = 0.5. 150 1 0 0 2 0 0 3 0 0 M S E 4 0 0 5 0 0 l o g ( M S E ) Figure A.7: MSE and log(MSE) for MCV2 for (j) = 0.5. 151 iiltK'i^.'lihXi! 5 0 1 0 0 1 5 0 M S E 2 5 0 Figure A.8: MSE and log(MSE) for MCV3 for <f> = 0.5. 152 5 0 0 1 0 0 0 M S E l o g ( M S E ) Figure A.9 : MSE and log(MSE) for LS for (h = -0 .5 . 153 8 5 0 0 1 0 0 0 M S E 4 5 l o g ( M S E ) Figure A.10: MSE and log(MSE) for NLS for (b = -0 .5 . 154 8 200 M S E 3 M S E Figure A . l l : MSE and log(MSE) for TS for 4> = - 0 . 5 . 155 1 0 2 0 3 0 4 0 M S E 5 0 6 0 liiliiill 111 Illlillli 1.5 2 . 0 2 . 5 3 . 0 l o g ( M S E ) 3 . 5 Figure A.12: MSE and log(MSE) for IND for <b = -0 .5 . 156 1 0 0 0 M S E 1 5 0 0 4 5 l o g ( M S E ) Figure A . 13: MSE and log (MSE) for BCV for <h = - 0 . 157 10 2 0 3 0 M S E 4 0 5 0 6 0 l i i i i i i i l i i l l iliiiiiii mSSSMmm 2 . 5 3 .0 l o g ( M S E ) 3 . 5 Figure A.14: MSE and log(MSE) for MCVX for <\> = -0 .5 . 158 I 1 1 1 1 1 1 0 10 2 0 3 0 4 0 5 0 6 0 M S E l o g ( M S E ) Figure A . 15: MSE and log (MSE) for MCV2 for <j) = - 0 159 MmmmmM i i 1 H H I l i l i i i i • I l l l l i B i a 0 10 2 0 3 0 4 0 5 0 6 0 M S E 1 2 3 4 l o g ( M S E ) Figure A.16: MSE and log (MSE) for MCV3 for (h = - 0 . 160 A p p e n d i x B M o n t h l y D a t a 161 I / 0 1 2 2 4 3 6 4 8 6 0 7 2 M o n t h s M C V 1 M C V 3 M C V 2 B C V ~~1 I 1 1 1 1 ,— 0 12 2 4 3 6 4 8 6 0 7 2 M o n t h s Figure B . l : Fits for Monthly Q3 Levels at Site #1. 162 IND - - - TS MCV1 i 1 1 i 1 r 0 12 24 36 48 60 72 Months Figure B.2: Fits for Monthly 03 Levels at Site #1. 163 L S T S N L S 4 8 6 0 7 2 Months Figure B.3: Fits for Monthly (93 Levels at Site #2. 164 Figure B.4: Fits for Monthly 03 Levels at Site #2. 165 L S T S N L S I I 0 1 2 2 4 3 6 4 8 6 0 7 2 M o n t h s M C V 1 M C V 3 M C V 2 B C V ' I I I 1 1 1— 0 1 2 2 4 3 6 4 8 6 0 7 2 M o n t h s Figure B . 5 : Fits for Monthly 0 3 Levels at Site #3. 166 Months Figure B.6: Fits for Monthly 0 3 Levels at Site #3. 167 Figure B.7: Fits for Monthly 03 Levels at Site # 4 . 168 Figure B.8: Fits for Monthly 0 3 Levels at Site #4. 169 170 Figure B.10: Fits for Monthly 03 Levels at Site #5. 171 L S T S N L S 0 12 2 4 3 6 4 8 6 0 7 2 M o n t h s M C V 1 M C V 3 M C V 2 B C V Figure B . l l : Fits for Monthly 0 3 Levels at Site #6. 172 IND - - - TS MCV1 Figure B.12: Fits for Monthly 03 Levels at Site #6. 173 M o n t h s Figure B.13: Fits for Monthly 03 Levels at Site #7. 174 Figure B.14: Fits for Monthly 03 Levels at Site #7. 175 L S T S N L S M C V 1 M C V 3 M C V 2 B C V 0 1 2 2 4 3 6 4 8 6 0 7 2 M o n t h s Figure B.15: Fits for Monthly 0 3 Levels at Site #8. 176 IND - - • TS MCV1 / A A J \ 7 / 4 I J — i — 24 12 36 48 60 72 Months Figure B.16: Fits for Monthly 03 Levels at Site #8. 177 L S T S N L S M o n t h s M C V 1 M C V 3 M C V 2 B C V - i 1 1 1 1 1 1— 0 12 2 4 3 6 4 8 6 0 7 2 M o n t h s Figure B.17: Fits for Monthly 0 3 Levels at Site #9. 178 Figure B.18: Fits for Monthly 0 3 Levels at Site #9. 179 Figure B.19: Fits for Monthly 0 3 Levels at Site #10. 180 IND - - • TS MCV1 f; 1! « ' // III ill ft V . ( •I 1/ 4// 1 / / I 12 24 36 Months 48 60 72 Figure B.20: Fits for Monthly 03 Levels at Site #10. 181 L S T S N L S A !\ • r\ a A A ' r V // \ \j A i\ \/ 1 \ //. ' -\ / 7 " \ J " \ / \"' \ 3 6 4 8 6 0 7 2 M o n t h s M C V 1 M C V 3 M C V 2 B C V Figure B.21: Fits for Monthly 0 3 Levels at Site #11. 182 o I-* o to o in o o co o CM Months Figure B.22: Fits for Monthly 0 3 Levels at Site #11. 183 - r ^ 1 1 1 1 1 1— 0 1 2 2 4 3 6 4 8 6 0 7 2 M o n t h s M C V 1 M C V 3 M C V 2 B C V -i 1 1 1 1 1 — — r— 0 12 2 4 3 6 4 8 6 0 7 2 M o n t h s Figure B.23: Fits for Monthly Oz Levels at Site #12. 184 o O CO o i n o o CO o CM Figure B.24: Fits for Monthly 0 3 Levels at Site #12. 185 .1 I I 1 1 1 1 1 1— 6 12 2 4 3 6 4 8 6 0 7 2 M o n t h s M C V 1 M C V 3 M C V 2 B C V -T 1 1 1 1 1 1— 0 1 2 2 4 3 6 4 8 6 0 7 2 M o n t h s Figure B .25: Fits for Monthly 0 3 Levels at Site #13. 186 IND - - • TS MCV1 Months Figure B.26: Fits for Monthly 03 Levels at Site #13. 187 -1 1 1 1 1 1 1— 0 12 2 4 3 6 4 8 6 0 7 2 M o n t h s M C V 1 M C V 3 M C V 2 B C V - i 1 i 1 1 1 1— 0 1 2 2 4 3 6 4 8 6 0 7 2 M o n t h s Figure B.27: Fits for Monthly 03 Levels at Site #14. 188 IND — TS MCV1 12 24 36 48 60 72 Months Figure B.28: Fits for Monthly 0 3 Levels at Site #14. 189 L S T S N L S 1 2 2 4 3 6 M o n t h s Figure B.29: Fits for Monthly 0 3 Levels at Site #15. 190 o CO o o o CO o CM Months Figure B.30: Fits for Monthly 0 3 Levels at Site #15. 191 Figure B.31: Fits for Monthly 03 Levels at Site #16. 192 IND - - - TS MCV1 11 ii Ii '//• il-ia A. / / / ' / / if-I \ // • \ // v./ A A i A / A ,\ \ i \ ft \ \ \ • «\ , / / V : Hi It .// ii / / w \ :A\. — i — 24 12 36 Months 48 60 72 Figure B.32: Fits for Monthly 03 Levels at Site #16. 193 M C V 1 M C V 3 M C V 2 B C V —i 1 1 1 1 1 1— 0 12 2 4 3 6 4 8 6 0 7 2 M o n t h s Figure B.33: Fits for Monthly 03 Levels at Site #17. 194 o CO o O o CO o CM Figure B.34: Fits for Monthly 03 Levels at Site #17. 195 Figure B.35: Fits for Monthly 0 3 Levels at Site #18. 196 IND - - • TS MCV1 '.4 I A / V \ 4 A i •v. — i — 12 — i — 72 24 36 Months 48 60 Figure B .36 : Fits for Monthly 03 Levels at Site #18. 197 A p p e n d i x C D a i l y D a t a 198 0 365 730 1095 1460 1825 2190 Days Figure C . l : Fits for Square Root Transformed Daily 0 3 Levels at Site #1. 199 200 201 202 203 Figure C .6: Fits for Square Root Transformed Daily O 3 Levels at Site # 6 . 204 205 206 LS BCV 207 208 Figure C . l l : Fits for Square Root Transformed Daily 03 Levels at Site #11. 209 210 211 212 LS BCV i i i 1 1 1 1— 0 365 730 1095 1460 1825 2190 Days Figure C.16: Fits for Square Root Transformed Daily 03 Levels at Site #16. 214 "1 I 1 1 1 1 1— 0 365 730 1095 1460 1825 2190 Days Figure C.17: Fits for Square Root Transformed Daily O 3 Levels at Site #17. 215 Figure C.18: Fits for Square Root Transformed Daily 03 Levels at Site #18.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Smoothing parameter selection when errors are correlated...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Smoothing parameter selection when errors are correlated and application to ozone data St-Aubin, Robert 1998
pdf
Page Metadata
Item Metadata
Title | Smoothing parameter selection when errors are correlated and application to ozone data |
Creator |
St-Aubin, Robert |
Date Issued | 1998 |
Description | Automatic smoothing parameter selection methods for nonparametric regression like cross-validation and generalized cross-validation are known to be severely affected by dependence in the regression errors. We proposed, in this work, to modify some of the ideas used in the cross-validation criterion in kernel regression with dependent errors and apply them to smoothing splines with model based penalty. Model based penalty smoothing permits us to keep the flexibility of the nonparametric methods while it also allows us to specify a favoured parametric model which can help improve on the estimate of the regression function. We consider the "modified cross-validation" (also known as Leave—21+1 out) and the "blockwise cross-validation" smoothing parameter selection techniques which were initially proposed by Hardle and Vieu (1992) and Wehrly and Hart (1988) respectively. These two smoothing parameter selection techniques take correlation into account and alleviate its effect on the regression function estimation. We use a simulation study to evaluate the performance of our two smoothing parameter selection techniques. We compare the results with a few commonly used parametric techniques. Our techniques are also applied to an air pollution data set where we estimate the underlying trend of daily and monthly ground ozone levels in southern Ontario. |
Extent | 7771762 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-05-26 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0088619 |
URI | http://hdl.handle.net/2429/8250 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1998-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1998-0601.pdf [ 7.41MB ]
- Metadata
- JSON: 831-1.0088619.json
- JSON-LD: 831-1.0088619-ld.json
- RDF/XML (Pretty): 831-1.0088619-rdf.xml
- RDF/JSON: 831-1.0088619-rdf.json
- Turtle: 831-1.0088619-turtle.txt
- N-Triples: 831-1.0088619-rdf-ntriples.txt
- Original Record: 831-1.0088619-source.json
- Full Text
- 831-1.0088619-fulltext.txt
- Citation
- 831-1.0088619.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0088619/manifest