LOCAL LINEAR REGRESSION VERSUS BACKCALCULATION IN FORECASTING by Xiaochun Li B.Sc, Tsinghua University, Beijing, China, 1986 M.Sc, Tsinghua University, Beijing, China, 1988 M.Sc, University of Saskatchewan, 1991 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES Department of Statistics We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA August 1996 Â©Xiaochun Li, 1996 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of StcffiS'TiC S The University of British Columbia Vancouver, Canada Date Olf % DE-6 (2/88) Abstract The local linear forecasting estimator is proposed in this thesis as an alternative technique to either parametric regression or the backcalculation approach in the context of forecasting for independent data. The asymptotic bias and variance of the local linear forecasting estimator are de rived and used to develop procedures for the estimation of the optimal bandwidth for forecasting. Both the theoretical and the computational aspects of these procedures are explored. Simulation study shows that a cross-validation procedure has the best performance in forecasting among four bandwidth estimation procedures under study. Simulations and statistical analyses show that the backcalculation approach is very vulnerable to violations of the assumptions underlying this approach and that its appli cation to AIDS data fails to achieve its two primary goals, to forecast the numbers of new AIDS cases and to estimate the historical HIV infection curve. To test the proposed forecasting estimator over parametric regression, both tech niques are applied to the Canadian AIDS data and the UK AIDS data. The results of the two examples expose the weakness of parametric regression and show that the proposed technique does better than parametric regression in forecasting. ii Contents Abstract ii Table of Contents iiList of Tables vi List of Figures viiAcknowledgements x 1 Introduction 1 1.1 The linear operator approach 4 1.2 The regression approach 7 1.2.1 Parametric extrapolation1.2.2 A local linear forecasting estimator 9 2 The linear operator approach 12 2.1 Some results on reproducing kernel Hilbert spaces 14 2.2 Application to the AIDS data: backcalculation 6 3 Asymptotics for the local linear regression estimator 22 iii 4 The choice of a smoothing parameter 39 4.1 The plug-in approach 40 4.1.1 Introduction 1 4.1.2 Plug-in approach using complete asymptotics for forecasting (CAMSE) 43 4.1.3 The plug-in approach using the finite sample variance for forecast ing (HAMSE) 44 4.1.4 Estimation of m" and <72 - an introduction 45 4.1.5 Estimation of m" and a2 for forecasting 6 4.2 Cross-validation 49 4.2.1 Background to the non-forecasting setting 44.2.2 CV for the linear operator approach 53 4.2.3 CV for the local regression approach in curve estimation 53 4.2.4 CV for forecasting: FCV 54 4.2.5 CV for forecasting, BCV 61 5 Simulation: local linear forecasting estimator 71 5.1 Motivation and data 75.2 Results and comments 3 6 The local method versus backcalculation 85 6.1 Comments on the methods 86.2 A small simulation study 7 6.3 Discussion of the methods and the simulation results 96 7 Data analysis 100 7.1 Data, methods and results 108 Concluding remarks 110 iv Bibliography 112 v List of Tables 5.1 Summary of 100 forecasts (ms) and estimates of optimal bandwidths by local linear regression for m = m\ with sample size n = 50 and 100 respectively. The minimum of AMSE is displayed in column (m.s.e) for true value 5.2 Summary of 100 forecasts (ms) and estimates of optimal bandwidths by local linear regression for m = m2 with sample size n = 50 and 100 respectively. The minimum of AMSE is displayed in column (m.s.e) for true value 5.3 Summary of 100 forecasts (ms) and estimates of optimal bandwidths by local linear regression for m = m3 with sample size n = 50 and 100 respectively. The minimum of AMSE is displayed in column (m.s.e) for true value 6.1 Summary of forecasts by backcalculation and local linear regression for 100 realizations of the fictitious AIDS data 6.2 Summary of lower bounds of forecasts by backcalculation for 100 realiza tions of the fictitious AIDS data 7.1 Quarterly numbers of AIDS cases reported within six years of diagnosis in Canada, 79Q4-90Q1 vi 7.2 Monthly numbers of AIDS cases reported to the end of September 1987 in UK. For time periods May 1986 to September 1987, the first number is the reported and the second number in parenthesis is the estimate of number of unreported new AIDS cases 102 7.3 Forecasts (ms) for 8 quarters, 88Q2 to 90Q1, on Canadian AIDS data (79Q4-88Q1) by local linear forecasting estimator (LL) with FCV, and by Poisson regression (PR) using most recent data respectively. The optimal bandwidth estimated by FCV is rescaled in years and denoted by hy0rpt. The FCV score and h?Jpt are displayed together in parenthesis. The a2 estimated by the Rice estimator is 177 106 7.4 Forecasts (ms) for 9 months, January 1987 to September 1987, on UK AIDS data (January 1982-December 1986) by local linear forecasting esti mator (LL) with FCV, and by Poisson regression (PR) using most recent data respectively. The optimal bandwidth estimated by FCV is rescaled in years and denoted by hyJpt. The FCV score and hyJpt are displayed together in parenthesis. The a2 estimated by the Rice estimator is 14. . 107 List of Figures 1.1 The quarterly number of AIDS cases reported within six years of diagnosis from the fourth quarter of 1979 (79Q4) to the first quarter of 1990 (90Q1). One unit in Time is 1/42 with 42 being the total number of quarters from 79Q4 to 90Q1 inclusive 3 5.1 The median of 100 shifted FCV curves (dotted), the median of 100 CAMSE curves (dashed) and the true AMSE curve (solid) for m = mi,m,2 and m3 with n = 50 and An = 0.1 79 5.2 Pairs of CAMSE (dotted) and HAMSE (solid) curves for 9 simulated data sets for m = ra3, n = 50 and An = 0.1 80 5.3 The AMSE curve, the 25th and the 75th quantiles of shifted FCV and CAMSE respectively, for m = ra3, n = 100 and An = 0.1 81 5.4 The median of 100 shifted BCV curves (dotted line) and AMSE curve (solid line) for m â€” m3, n = 50 and An = 0.1 82 5.5 The AMSE curve (solid) and the mean of 100 shifted FCV curves (dot ted) for m = m3 with An = 0.1,0.2 and n = 50,100 respectively 84 6.1 The HIV infection function 7,1(s) = â€”200(s+l)(5â€”2) and the incubation function F, F(t) â€” 1 â€” e-' used in the generation of the fictitious AIDS data 89 viii 6.2 The sensitivity of I to the assumption of F in the fictitious AIDS data (<r = 0). True F(t) = 1 - e~\ long F(t) = 1 - e"*'1-1 and short F(t) = l-e-'/Â°-9 93 A 6.3 The sensitivity of / to the presence of the errors (er = 0.5) in the fictitious AIDS data when the true F is used. The first plot shows that J is recovered from the error-free AIDS data while the rest show that I is highly distorted by the presence of errors in the AIDS incidence data. The solid curve (-) in each plot is the true / and the dotted line (...) the backcalculated J 94 6.4 The variability of backcalculated / 95 7.1 Canadian AIDS data and UK AIDS data: forecasting beyond t = 1. . . . 104 7.2 Forecasts by the local linear method and by Poisson regression using the most recent data. Among three sets of forecasts by Poisson regression, the set with the smallest ASFE is plotted 109 ix Acknowledgement s I would like to thank my thesis supervisor Nancy Heckman for her guidance in the development of my thesis, for her excellent advice and support, and for being an influential role model throughout my Ph.D. program. I am also very grateful to the members of my thesis committee, Jean Meloche, Steve Marion, and in particular, John Petkau for their careful reading of the manuscript and valuable comments. In addition, I am very much indebted to Jim Ramsay for making my stay at McGill possible and for his inspiration in statistics. I would also like to acknowledge the help from Ping Yan from the Laboratory Centre for Disease Control, who provided me with the Canadian AIDS data, and the friendship and support of fellow graduate students and staff members in the Department of Statistics. Finally, I would like to thank the University and the Department of Statistics for the indispensible financial support I received throughout my Ph.D program. Chapter 1 Introduction Suppose bivariate data {(t;, Yi), i = 1,..., n] are observed at times 0 < ti < t2 < ... < 'tn < 1. Here the f;s are in [0,1], which may be normalized physical times. Thus any time beyond 1 is the "future". What can be said about the future path of this stream of data by studying the information available up to the present? Specifically, what is the "expected" value of Y at some t > 1? Consider the AIDS example. A new disease, acquired immunodeficiency syndrome (AIDS), was defined by the Center for Disease Control in the United States in 1982. AIDS is believed to be caused by the human immunodeficiency virus, or HIV. This disease spread rapidly in the United States: reported cases of AIDS increased from 295 in 1981 to nearly 42,000 in 1991 with reported deaths increasing from 126 to more than 30,000 in the same period. AIDS is now one of the major causes of death in the United States [6]. AIDS has spread and caused serious concern around the world. Even though the current data suggest a slowing down in the growth of the AIDS epidemic in the United States, Northern Europe, Canada and Australia, the spread of HIV infection is rampant in Africa. AIDS in South America and Asia seems to be at the onset of explosive spread [6]. 1 Uncertainty about AIDS has generated research involving epidemiologists and statis ticians about the relationship between HIV and AIDS, the transmission of the disease and the future course of AIDS incidence. Mechanisms for monitoring the epidemic have been set up around the world. In the United States, the number of new AIDS cases each month has been available from the CDC (Centers for Disease Control) since 1982 (earlier data from 1975 and prior to 1982 being combined to assure confidentiality [1]) while in Canada quarterly numbers have been recorded by the FCA (the Federal Centre for AIDS) since 1979 [23] and the LCDC (the Laboratory Centre for Disease Control). Figure 1.1 depicts the quarterly numbers of AIDS cases reported within six years of diagnosis in Canada. Those data are considered to be complete in reporting [23]. This thesis will focus on the problem of forecasting with complete or adjusted AIDS data, since adjusting AIDS data for under-reporting and reporting-delays would pose a separate research problem on its own. Note that the data shown here are not the cumulated numbers of AIDS cases. Each data point represents the number of new AIDS cases in the corresponding quarter. In this case, i; corresponds to the end of the ith quarter and Y{ observed at t,-, the number of new AIDS cases in the zth quarter (Â£,â€¢_!,Â£,â€¢]. One goal in AIDS research is forecasting the number of AIDS cases that will occur in future time periods. This number is important to health organizations and governments for estimating the future demand in health care and allocating funds accordingly. To summarize data and make forecasts based on observed data of the form {(ti,Yi),i = 1,... ,n}, various approaches exist in different theoretical frameworks: 1. the linear operator approach; 2. the regression approach; 3. the time series approach. 2 o o CO Canadian Data o o C\J o o O H i 1 1 ; r~ 0.2 0.4 0.6 0.8 0.0 1.0 Time (0.0=79Q3, 1.0=90Q1) Figure 1.1: The quarterly number of AIDS cases reported within six years of diagnosis from the fourth quarter of 1979 (79Q4) to the first quarter of 1990 (90Q1). One unit in Time is 1/42 with 42 being the total number of quarters from 79Q4 to 90Q1 inclusive. 3 This thesis focuses on the first two approaches, its main goal being the development of methodology for regression. Its application to AIDS data will be used as an example. 1.1 The linear operator approach Assume the i,s are in [0,1]. In the special case of equally spaced data, U = i/n, tn = 1 being the "present" time at which the last observation was made. Suppose (t, Y) has a certain probabilistic structure and E{g(Yi)\ti} = m(t,), g being a transformation function and m some smooth function. For a fixed design where the t,s are not random, E{g{Yi)\ti] means E{g(Yi)} with Yi observed at t = t{. The linear operator approach assumes that data can be expressed in a functional form, E{g(Yi)\ti} = Li(I) = where m is a smooth function and Li is a known linear operator on a certain function space where the unknown I resides. In this thesis, the case where X,- is the evaluation operator, i.e., L{I = I(ti), is considered as a problem in regression (approach 2) in which no additional information other than that in the data is available. The function I will have a specific meaning in applications. In the AIDS example, I is the rate of HIV infection and the YiS are the numbers of AIDS cases. The linear operator L,- for the AIDS data will be described later. The linear operator approach summarizes the data and enables forecasting by estimating I first. Then the operator Lj is applied to this estimated I to give an estimate of m (Â£,â€¢). This approach has the potential benefit of embedding prior knowledge of the structure of the data into the linear operator Li. The technique called backcalculation [3] which exemplifies this approach is illustrated below through its application to the AIDS data. The prevalent theory in AIDS research postulates that an AIDS patient was infected with HIV at a certain time s in the past which took some time to develop to the advanced stage of HIV infection and then to the time point of AIDS diagnosis, t. The 4 length of time between the point of HIV infection at s and the point of AIDS diagnosis at t is called the incubation time, whose distribution can be modeled by a time dependent distribution function. Estimates of F have been found from cohort studies. So F is usually assumed known. Based on the assumed relationship between HIV infection and AIDS diagnosis, the expected number of AIDS cases before t can be calculated by the formula, a(t) = iÂ£(number of AIDS cases diagnosed before time t) = f* I(s)F(t-s\s)ds, (1.1) Jo where I(s) is the expected number of new HIV infections at time instant s, s > 0. /(â€¢) is unknown and to be estimated. The convolution formula (1.1) uses an integral operator to link the HIV infections and the AIDS diagnoses. From (1.1), for equally spaced i,s m(ti) = E{Yi) = ^(number of AIDS cases in (t,_i,t,]) = a(tt) â€” a(2,_i) = I*' I(s)F(U - s\s)ds - f*"11(s)F(ti_1 - s\s)ds Jo Jo = Â£ I(s) (F(U - s\s) - F(ti - ^ - s\s)) ds (1.2) EE Li{T), (1.3where F{u\s) = 0, if u < s. The linear operators, i.e., the Lts so defined, model the dynamics of the two processes: HIV infection and the AIDS diagnosis [3]. In general, define m(t) = j* I(s) (F(t - s\s) - F(t - i - s\s)^j ds. (1.4) Thus to forecast the number of AIDS cases in a future quarter, say (1 +Aâ€ž â€”1/n, 1 + Aâ€ž], one must estimate m(t) at t = 1 + An. 5 The first step to forecasting is to estimate the HIV infection curve I from the Y;s, the numbers of AIDS cases. This is done by confining /(â€¢) to an appropriate function space Ti and minimizing a fitting criterion, e.g., /(â€¢) = argminlâ‚¬>l J2(Yi - m(ti))2 + A /* I"\t)dt (1.5) The parameter A in (1.5) is the smoothing parameter which can be chosen by cross-validation. More detailed discussion of cross-validation is provided in Chapter 2 and Chapter 4. The above procedure for estimating I from the Y{S is the so-called backcal culation [3]. In an appropriately chosen 7i, the minimizer I can be written as a linear combination of l,t and the Riesz representors of L{S [31]. For details, see Chapter 2. After I is obtained, m(l + An) can be predicted by using I in formula (1.4). However, for s close to 1, I(s) is usually not reliable. The information about I(s) is contained in the Ys observed after time s. Thus the data contain no information about I(s) for 5 > 1. Further, for s < 1, the closer a time point 5 is to 1 the less reliable the backcalculated I(s) is because there are fewer available data. A conservative approach is to be less ambitious and get a lower bound for m(l + An) instead. A useful lower bound can be found if the disease has a long incubation. For a short term forecast, a great proportion of new HIV infections arising in time period (1,1 + An] are still latent and thus will not contribute to the number of new AIDS cases occurring in (1 + An â€” 1/n, 1 + An], Therefore, the percentage of new AIDS cases contributed by the HIV infections that occur in (1,1 + Aâ€ž] may be negligible. From (1.4), we have m(l + An) > Â£ I(s) (F(l + An - s\s) - F(l + An - i - s|s)) ds = LBn, (1.6) which in effect takes I to be zero over (1,1+ Aâ€ž]. An estimate of LBn can be obtained by plugging I into the above formula for LBn, LBn = Â£ i(s) (^(1 + Aâ€ž - s\s) - F(l + An - ^ - s\sf) ds. (1.7) 6 More examples of the linear operator approach in science and engineering can be found in [29] and [30]. 1.2 The regression approach The regression approach assumes an unknown underlying regression function m for the data or the transformed data, i.e., E{g(Y)\U} â€” m{tt). It will be assumed hereafter that the data are properly transformed and E(Yi\t{) = m(t,) in the regression approach. The regression approach uses the data {Yi}â„¢ to estimate ra directly. This general approach includes parametric regression in which m depends on a global parametric form and local regression in which m is only assumed to be smooth. The next two sections contain details of both types of regression in the context of forecasting. 1.2.1 Parametric extrapolation This approach assumes that the trend of the data continues over a period of time in a postulated parametric form, for example, log(E(Yi)) = /So + /Mi- One fits this parametric model to the entire data set and then uses this fit to extrapolate to obtain the forecast. In 1986 the Public Health Service in the United States gave a forecast of 270,000 for the cumulative number of AIDS cases in the United States by the end of 1991 by extrapolating a polynomial model. The actual number of reported AIDS cases turned out to be 206,000 [6]. As noted by a few authors ([9],[20]), parametric extrapolation can provide useful short-term forecasts. However, there are a few criticisms of this approach: 1. Parametric extrapolation does not make use of any available information on the progression of the HIV infections to AIDS disease and thus may be less efficient 7 than backcalculation. 2. The parametric assumption is not verifiable and therefore is a considerable source of uncertainty. Thus parametric extrapolation is unable to give a long-term fore cast. As for the first criticism, it is not necessarily the case that a method that uses less data is less efficient than a method that uses more data. The discussion in Chapter 6 will show that the local linear forecasting estimator that uses only K)}â„¢, suggests better forecasts than backcalculation that uses {(i,-,!^)}" and additional data. As for the second criticism, so far as forecasting is concerned, future predictions of medium to long term pose a challenge for statisticians. If a parametric model is "correct" it will be able to give a long-term forecast. However, no models are "correct" since they are simplified representations of the real world. When a set of data is fitted to a specific parametric form, the fitted curve may be compromised to fit all the data well at the expense of local structure in the data. Especially of interest is the local structure near the present, t = 1. Forecasts based on the extrapolation from such a fitted curve are unlikely to be as good as forecasts based an extrapolation from a fitted curve of more recent data. As Healy and Tillet [20] write, "It doesn't seem entirely reasonable to give the early data ... a high degree of influence in forecasting the future." When fitting the data to a log-linear model, they superimposed subjectively-chosen weights decreasing into the past. This method should be expected to be an improvement over simple parametric fitting when "appropriate" weights are chosen. It might be desirable to do away with the parametric assumption and also to couple the fitting with a data-driven method for the choice of weights. The local regression method proposed in the next section does exactly this. 8 1.2.2 A local linear forecasting estimator In general, to estimate a function m at a certain point t, the approach of local linear regression assumes that m has a linear structure in a neighbourhood of t: m(x) = Po + Pi(x â€” t), and estimates flo, /?i by weighted regression. The weights assigned to the observations close to t are much bigger than the weights assigned to the rest of observations. That is, let the estimator of m(t) be defined as follows: m^t) = fa+ fo(t-t) = 0Oit, (1.8) where AM = argminJ2(Yi - A> ~ p\{U ~ t))2K(\-^-), (1.9) \ J with K(-) a kernel function that gives more weight to the observations close to t (defined by hn) than the rest of the observations, and hn the so-called bandwidth that determines the size of the neighbourhood of t. For example, if the kernel function is the indicator function over [â€”1,1], then K((U â€” t)/hn) = 1 for those t,s with \tj â€” t\ < hn and 0 otherwise. Since rhhn(t) is calculated from the Y{S, it is a random variable with statistical properties which are affected by the bandwidth hn and the kernel K. In practice the choice of K has little effect on rhhn{t) while the choice of hn has a large effect. Therefore we assume K is specified in Chapter 4 and describe data driven choice of hn. For now assume hn is given. Unless m is a straight line, rhhn(t) is in general a biased estimator of m(t), that is, Bias{rhhn(t)} = E{mhn(t)\ti,... ,tn} â€” m(t) ^ 0. Note that if the kernel function is the indicator function over [â€”1,1], the bandwidth hn reflects the number of Y{S used in the regression. A large hn means that a large number of T^-s is used in the regression and a small hn means otherwise. Therefore a large bandwidth 9 causes rhhn(t) to have a small variance and a small bandwidth causes rhhn(t) to have a large variance. For commonly used kernels, when hn is small, rhhn(t) will have a small bias but a large variance; when hn is large, rhhn(t) will have a large bias but a small variance. If hn is small enough, rhhn{t) will behave like an interpolant while if hn is large enough, m/ln(t) will become the familiar least squares fit. A balance between the bias and the variance is desirable and this balance is usually achieved by minimizing the mean squared error of rhhn{i), i.e., E{(rhhn(t) â€” m(t))2 \t\,... ,tn}. To predict mat 1 + An, let t = 1 + An in (1.9). Therefore, rh,hn{l + An) = /30,I+Aâ€ž-To avoid cumbersome calculations of the asymptotics later in Chapter 3, an algebraic substitution is employed to re-center the data at 1 instead of 1 + Aâ€ž. Note that Po,l+Aâ€ž 1+Aâ€ž / n V argminj^iY. - ft - ft ft - 1 - An))2K(U ~ ][~ A") 2 Let argminf2(Yi - (ft - ftAn) - p\{U - l))2K(U \ A"). fir, ft* = ft-ftAn, ft* = ft, K-firÂ±) = K{ti \ An). hr (1.10) (1.11) Then argminitiYi - ft* - ft(tt- - l))2TC, (1.12) \ Â«, / and m^â€ž(l + An) = ftx + ftflAn. Formulation (1.12) will be used hereafter with the superscript * dropped. The no tation m^n>i(l + An) instead of rhhn(l + An) will be used to denote the forecasting estimator that forecasts An ahead with data centered at t = 1 and with bandwidth hn. To summarize: 10 Definition 1.1 The estimator ra/iâ€ž,i(l + An) of m(l + An) is defined as mw(l + Aâ€ž) where argminf^iYi - A, - &(*,- - l))2tf(^-â€”^). (1.13) â€¢n The bandwidth /in can be chosen by plug-in or cross-validation approaches. The devel opment of those approaches for forecasting in Chapter 4 is the main contribution of this thesis. Results of the statistical properties derived in Chapter 3 serve as the cornerstone for all bandwidth estimation procedures. In the next chapter, theoretical results for the linear operator approach will be presented and applied to the AIDS data. 11 Chapter 2 The linear operator approach The linear operator approach will be discussed under the theoretical framework of a HUbert space. Standard results will be presented without proof. Proofs and other details can be found in [31]. Examples of the linear operator approach, details of computation and some asymptotic results can be found in [28]-[31]. The following theorem is central to the linear operator approach. Theorem 2.1 Assume the following conditions: 1. 7i is a HUbert space over the reals, 1Z, with inner product <, > and norm || â€¢ ||; 2. For each i G {1,... ,n}, Li : ri â€”* TZ, is a bounded linear functional; 3. P : Ti â€”> Til, is a projection operator with ri\ a subspace ofri; 4- The null space of P is TCo = span{<f>i,<f)no}, where the <j)jS are linearly inde pendent and n0 < 00; 5. !F(Y, L\(I),..., Ln(I)) : lZn xTZn â€”> 71 is a real function, where Y = (3/1,..., yâ€ž)'. 12 For given X > 0, if the minimizer I of F{Y,Lx{I),...,Ln{I)) + X\\P{I)\\2 (2-1) exists in Ti, it is of the form: no i = \Edsh + Hc&> (2-2) i=i i=i where the djS and CjS are real numbers, = P{r)j) and r]j is the Riesz representor of Lj, that is, Lj(I) =< rjj, I > for all I G Ti. Corollary 2.1 Assume the conditions in Theorem 2.1. In addition, assume the matri ces T â€” (Tij)nXno, Tij = Li((f)j) and E = (E,j)nXn, E,-j =< > are of full column rank, and that the WjS are positive. Then ^EÂ«-^))X + A||P(/)||2 (2.3) has a unique minimizer I in Ti and is as in (2.2) with d = {du...,dno)t = {TtA-xT)-xTtA-1Y, (2.4) c = (ci,...,cn)* = i4-1[/-r(r'A-1r)-1r*A-1]y, (2.5) where A = E + n\I. Remarks: 1. Theorem 2.1 greatly reduces the complexity of the task of finding the minimizer 7 in Ti (often of infinite dimension) to finding / in the finite dimensional subspace spanned by <j>x,..., <j>no-,i\-, â€¢ â€¢ â€¢, Â£Â«â€¢ Roughly speaking, a parametric model of span <f>i,..., (f>no determined by P, and the additional functions, Â£1,..., Â£nj allow us a more flexible fit to our n data points. 13 2. Theorem 2.1 is a very general result in that T can be a general function. However, we are only interested in J7 being â€”n^logijikelihood), in which case (2.1) is called a penalized likelihood. When the YJS are independent and normally distributed with mean Lj(I) and variance (2WJ)~1, J- is â€” re_1/o<jr(normal likelihood) and is the first term as in (2.3). We will assume (2.1) to be a penalized likelihood hereafter. The number A, known as the smoothing parameter, governs the relative importance of the goodness of fit of the data to that of the penalty P. If A is zero, the fit of the data is of the utmost importance and the penalty is ignored, with the result that I is the maximum likelihood estimator. On the other hand, a large value of A forces I close to the null space of P and results in an estimate close to a parametric fit. Any A between these two extremes reflects a compromise between the goodness of fit to the data and the form of I emphasized by the penalty. 3. An automatic, i.e., data-driven choice of the smoothing parameter A might be desired. Cross-validation is a technique commonly used for the estimation of an optimal A. Details of this method will be presented in Chapter 4. Though the Riesz representation theorem assures the existence of the rjjS, the represen tors of the LjS, it does not give the analytical forms of the r}jS. The theory of reproducing kernel Hilbert spaces (r.k.h.s.) makes it possible to calculate those Â£,-s analytically. This is important in applications. This chapter will present the relevant results of r.k.h.s. and then apply them to the case of backcalculation. 2.1 Some results on reproducing kernel Hilbert spaces The theory of r.k.h.s. is very powerful and very useful in the linear operator approach in general. 14 Definition 2.1 Let H be a Hilbert space of real-valued functions over [0,1], H={I: [0,1] ->ft}. Then 7i is a r.k.h.s. if and only if for all t G [0,1], Lt: H^Tl with Lt(I) = lit) is a bounded linear functional. Definition 2.2 IfTi is a r.k.h.s., then by the Riesz representation theorem, for any t, there exists a function Kt â‚¬ rt such that < J, Kt >= I(t), for all I eTi. Let K(s, t) = Kt(s) :. [0,1] x [0,1] -> K. (2.6) Then K is called the reproducing kernel ofri. As shown in the following lemma, in a r.k.h.s. it is very easy to use the reproducing kernel to calculate the Riesz representor of a bounded linear operator. Lemma 2.1 Assume that L is a bounded linear operator on a r.k.h.s., Ti. Its Riesz representor is n where n(t) = L(Kt). Proof: Let n be the Riesz representor of L in 7i. So for all I G rl, L(I) =<//,/>. In particular, for I = Kt, L(Kt) =< r),Kt >â€¢ By Definition 2.2, < rj,Kt > is also equal to n(t). So n{t) = L(Kt). â€¢ A commonly chosen r.k.h.s. is the class of smooth functions with square-integrable rth derivatives. Definition 2.3 W^l0*1] = {/ : [0,1] -> ft : /, 7(r-1) are absolutely continuous in [0,1] with JQ I^{u)2du < oo}. The inner product is defined as: < f,9 >= E/(i)(0)<7(i)(0) + t /w(u)5W(Â«)du. i=o Jo 15 Standard results say that this space is a r.k.h.s. and has a kernel of a known form. Lemma 2.2 WÂ£[0,1] is a r.k.h.s. with the reproducing kernel, *(M) = Â£^ + /oX (j^^\s-uy-\t-uy-idu, (2.7) where (s â€” u)+ = s â€” u, if s > u and 0 otherwise. 2.2 Application to the AIDS data: backcalculation Recall that in the linear operator approach in the AIDS example in Section 1.1, E(Yj), the expected number of new AIDS cases in the jth quarter, is assumed to be linked to an unknown function 7 by a linear operator Lj, Lj{I) = f' I(s)(F(tj - s\s) - F(tj - - - s\s))ds, (2.8) Jo n where I is the HIV infection function and F(-\s) is the distribution function of the incubation time from HIV infection to AIDS diagnosis at the instant s. F is assumed known so the operator Lj is known. Obviously Lj is a linear operator. In this application, the HIV infection function I will be assumed to be in H = WÂ£[0>1] with r = 2. This is a very minimal assumption on I because M^tO?!] 1S a very general class of functions. No specific assumptions are made on those func tions other than smoothness and integrability: 7 and V are absolutely continuous with Jo1 J"(u)2du < co. Recovering the historical HIV infection pattern I is an important goal in AIDS research. It is believed that I is a smooth curve. Therefore rough Is should be penalized. The quantity /Q1 I"(u)2du is a measure of the roughness of I so it can be used as the penalty term. 16 To put it statistically, finding a reasonably smooth I in W^O, 1] that fits the observed AIDS incidence data well is achieved by finding the I that minimizes F{Y,L1(I),...,Ln(I)) + \ I11'XuYdu, (2.9) JO where T is the negative of a loglikelihood function divided by n, I â‚¬ VF^O, 1] and A > 0. The smoothing parameter A places a control on the roughness of I relative to the goodness of fit of the Lj(I)s to the YjS. Minimizing the penalized likelihood means choosing an I in W%[0,1] with a tradeoff between the goodness of fit to the data and the roughness of I. This tradeoff is determined by the value of the smoothing parameter A. An optimal A for this tradeoff minimizes the prediction error, PE(\) = n-1Y,{Y;-mx(ti)}2 i=l where the 5^*s are hypothetical new observations at t,s, rh\(ti) = Li(I\) and 7A minimizes (2.9). This optimal A can be estimated by a standard technique called cross-validation. Further details on cross-validation will follow in Chapter 4. The minimizer I of (2.9) can be determined by using Theorem 2.1 and the theory of r.k.h.s. in Section 2.1. Theorem 2.2 If the minimizer I of (2.9) exists in W|[0,1], it takes the form: i(t) = Bo +iht+ (2-10) 3=1 where m = I Jo x / ,Mt)2 (sAt)3 st{s A t) - (a + i)^1- + (Fitj-s^-Fitj-^-s^ds, (2.11) and s At = s if s <t and t otherwise. 17 Theorem 2.2 enables one to find the minimizer in a finite dimensional subspace instead of in W|[0,1], a space of infinite dimension. The succeeding lemmas verify the conditions of Theorem 2.1 for its application to the AIDS data. So the ensemble of the following lemmas serves as the proof of Theorem 2.2. Lemma 2.2 in the previous section assures that Condition 1 of Theorem 2.1 is satis fied. [G\ 1] is a Hilbert space with the inner product <f,g>= /(0M0) + / W(0) + fX f"(u)9"(u)du, f,ge w2[o, 1], Jo and the norm induced by the above inner product, 11/2 7(0)2 + I'(0)2 + f1I"(u)2du Jo The lemma below checks Condition 2 of Theorem 2.1. Lemma 2.3 Suppose that for any s > 0, ^(-l-s) is a distribution function with F(u\s) = 0 if u < 0 and that F(u\s) is continuous in both u and s. For each j G {1,..., n}, the operator Lj : Lj(I) = jh I(s)(F(tj - s\s) - F{tj --- s\s))ds, I G W2 [0,1] Jo n is a bounded linear operator on W|[0,1]. Proof: Lj is well-defined since I(s)(F(tj â€” s\s) â€” F(tj â€” ^ â€” s\s)) is continuous in s and thus is integrable. The linearity of Lj is obvious. By the definition of the boundedness of a linear operator, Lj is bounded if there exists C, 0 < C < co, such that \Lj(I)\ < C\\I\\, (2.12) for all Ie W2[0,1]. 18 Since 0 < F(tj - s\s) - F(tj - J - s\s) < 1, it follows that |^(/)| = | I(s)(F(tj - s\s) - F(tj - - s\s))da\ Jo n < Â£j \I(s)\ds < Â£ \I(s)\ds < I(s)2ds^ 2 . (2.13) To bound /g |/(s)|<Â£s, first consider I(s) = 1(0) + s/'(0) + f" (U I"{v)dvdu. (2.14) Jo Jo Applying the inequality (a + b + c)2 < 3(a2 + b2 + c2), to (2.14) gives I(s)2 < 3(/(0)2 + s2I'{0)2 + (/'/" I"{v)dvdu)2). (2.15) Jo Jo By Schwarz's inequality, ( [' fU I"(v)dvdu)2 < ( [' [U I"(v)2dvdu)( [' fUl2dvdu), Jo Jo Jo Jo Jo Jo which is bounded by I"(v)2dv for 0 < s < 1. For 0 < s < 1, applying the relationship in (2.15) yields: I(s)2 < 3(/(0)2 + s2/'(0)2 + lX I"{v)2dv) Jo < 3(/(0)2 + 7'(0)2 + C I"{v)2dv) Jo = 3||/||2. (2.16) Using (2.16) in (2.13) gives \Lj(I)\ < ^1/2\\I\\- Therefore Lj is a bounded linear operator on W2[0,1]. â€¢ For Condition 3 of Theorem 2.1, the following lemma gives the projection operator P corresponding to the penalty term /0X I"(u)2du and the basis functions (f>jS that span the null space of P. , 19 Lemma 2.4 Let P : W2[0,1] - W22[0,1] : P(I)(t) = I(t) - 7(0) - l'(0)t. (2.17) T/iera P is a projection operator onto H\ with TiiÂ®rio = W|[0,1], where rio â€” span{l, t} is the null space of P and ||P(i")||2 = /o I"{ufdu. The next lemma uses the results of the r.k.h.s. to calculate the Riesz representor rjj of Lj in W^fO,!] and its projected image Â£j = P(r)j) for Lj as in (2.8) and P as in (2.17). Lemma 2.5 In W22[0,1], the projection of the Riesz representor of Lj is: w> - r x , x(sA*)2 (sAt)3 st(s At) â€” (s + t)^â€”Â±- + ^â€”t-(F(tj - s\s) - F(tj - ^ - s\s))ds. (2.18) Remark: The Â£jS are not splines (cubic polynomials) as in the trivial case when Lj(I) = I(tj). One can show that defined by (2.18) is an increasing function with Â£j(0) = 0, concave in [0, tj] and a straight line beyond tj. Proof: For r = 2, (2.7) gives the reproducing kernel as ~/ x / x / x(sAt)2 (s At)3 K(s,t) = l + st + st(s At)-(s + t)K n ' + v ' (2.19) 2 3 Recall that Kt(-) = K(-,t). By Lemma 2.1, the Riesz representor t]j of Lj evaluated at t is: - rjj(t) =< r,j,Kt >= Lj(Kt) = [*' Kt(s)(F(t}- - s\s) - F(tj --- s\s))ds. Jo n So &(*) = P(rn)(t) = rjj(t) - nj(0) - r]'j(0)t = Lj(Kt)-Lj(K0)-jt(Lj(Kt))\t=0t (2.20) 20 One can show that at Jo n Therefore = f\Kt{s) - 1 - st)(F(tj - s\s) - F(tj --- s\s))ds. â€¢ The results in this chapter will be used in Chapter 6. The next chapter will contain the derivation of the statistical properties of the local linear forecasting estimator rhhn,i(l + Aâ€ž) for the regression approach. 21 Chapter 3 Asymptotics for the local linear regression estimator This chapter will cover the asymptotic properties of m^ni<(i+An), as defined below based on data i^)}â„¢, for i,s random and t,s nonrandom respectively. These asymptotic results will help us understand the local linear forecasting estimator and choose an optimal bandwidth for the estimator. The results and conditions are different from those in non-forecasting regression problems (see, e.g. [8]). However, the proofs are similar. In non-forecasting regression problems, e.g., estimating m(t) by local regression for t in the domain of design points t,s, data on both sides of t are used in regression. There fore usually a kernel function K symmetric about 0 is used. Often in non-forecasting regression problems, K is assumed to be a density function. However, in forecasting data are centered at the boundary t = 1, so only the left part of the kernel function K is actually used because there are no data beyond t â€” 1. For this reason, the local linear forecasting estimator is defined by a kernel function K with a negative support. More-22 over, the local linear forecasting estimator does not require K to be a density function but a general function that satisfies Condition 3.1 below. Definition 3.1 Let rhhn,t{t') â€” /30i + P\,t{t' â€” t), where /30t and J3lt minimize X3i (Y]~ A)-P\{tiâ€”t))2K((tiâ€”t)/hn) with K a kernel function having a negative support. Note that the data are centered at t in Definition 3.1. The estimator rhhnj(t') forecasts t' â€” t ahead of t using data prior to t since the kernel function K has a negative support. The main application of the asymptotic results for rhhntt(t') will be to the case t = 1 and t' â€” t = An (see (1.13)). However, the more general asymptotic results for t and t' in a neighbourhood of 1 will be used in Chapter 4. From Definition 3.1, the asymptotic properties of rhhn,t[t + An) are completely de termined by those of 0t = ($ot, fiijY- Note that the superscript of a vector or a matrix indicates the action of taking the transpose of a vector or a matrix and should not be confused with the scalar t as either an argument of a function or a subscript of a variable. The asymptotic bias and variance of /3t will be presented first. For the case of random i,s, assume the following conditions: 1. Yi = m(ti) + e,-, where the e,s are independent with mean 0 and variance cr2; 2. t{S are iid with density /(â€¢) and are independent of the e,s; 3. the density function / is known and satisfies: / is bounded away from 0 and / is continuous on [0,1]; 4. K is square integrable on its compact support [â€”1,0] with u0 > 0, uQu2 â€” u\ > 0, (3.1) where u,- â€” ff j uxK(u)du\ let < = f\ tfK^Ydu-23 5. m" is continuous over [0,1 + An]; 6. hn â€”* 0 and nhn â€”> oo. Theorem 3.1 Let e = (1,1)'. Under the above assumptions, as n - m"(t) oo, _2_ ~hl E(B0tt\t) - m(t) [ hn (E0ht\t) - m'(tj) ) UQ UI Ui U2 = op(l)e, (3.2) and //t 0 w. \ \ nhnVar 0 /u I* /(*) uo ^1 V J Pot Â«0 Ui Ui U2 - oP(l)e, (3.3) Ui u2 uniformly in t G [an, 1] liminf an/hn > 1. Remarks: 1. Note that here "0n(i) = 0p((n/i)~1/2) uniformly for * <E [an, 1]" means that lim lim sup P((n/O1/2|0n(i)| > C) = 0, C_+oo n^oo tg[onil] which is weaker than lim lim P( sup {nhfl2\en{t)\ > C) = 0. Thus, for a fixed sample size, some values of 6n(t) could be far from 0. 2. It is not necessary to require that K have an one-sided support if a bandwidth is given because there are no data for t > 1 and thus any kernel will have an one sided support automatically. However, some bandwidth selection procedures (e.g. FCV in Chapter 4) need this requirement to have certain asymptotic properties (see Chapter 4). 24 The proof of Theorem 3.1 will be postponed until its corollaries are presented and proven. The results in Theorem 3.1 make calculations of the asymptotic bias and vari ance of rhhn,t(t + An) very easy. The corollary below follows easily from Theorem 3.1. Corollary 3.1 Assume Conditions 1, 2, 3, 4 and 5. In addition assume 6'. An = Dn-1^ and hn = Hn-1'5 for some D, H > 0. Let 8 = D/H. Let Bias(mhn,t(t + An)\t) = E(mhn<t(t + An)\t) - m(t + An). (3.4) As n â€”â€¢ oo, and 2 -Â£jBias(mhntt(t + An)\t) -m"(t) ( â€” U1U3 U0U3 â€” U1U2 82 + )/(u0u2 - ul) - 1 = op(l) nf{t)AnVar{mhntt(t + An)\t) -1 a28(l,8) U0 Ui Ui u2 \ "1 "2 / UQ UI Ui u2 oâ€ž(l) uniformly in t G [an, 1] with liminf an/hn > 1. Proof of Corollary 3.1: Taylor expansion of m{t + Aâ€ž) at t yields: (3.5) (3.6) m{t + An) = m(t) + Anm'(t) + ^m"(t) + o(A2n), uniformly for t G [0,1]. 2 A2 r, _2_ Al So â€”â€¢jBias(mkntt(t + An)\t) = -j [E(mhn]t(t + An)\t) - m(t + An)] (3.7) 25 A2 , n L 82hl E(mhntt(t + An)\t) - m(t) - Anm'(t) - ^m"(t) + o(A2) 1 E0o,\t) - m(t) N (1 I)A y82' 8'hi { hn (E0ht\t)-m'(t)) 1 E0o>t\t) - m(t) ^ - m"(t) + o(l) / J -m"(t) + o(l). (3.8) ^ K (E0u\t)-m'(t)) ) By result (3.2) of Theorem 3.1, as n â€”> oo this expression converges in probability uniformly to un ui 1 I Â«9 I m"(t) (ii)-"W U0 Ui Ui u2 = m"(t) ( u\ - U!U3 UQU3- UXU2 2> + 6* 8 So (3.5) is proved. Now consider the variance nf(t)AnVar(mhntt(t + An)\t) = nf(t)AnVar0Oit + PuAn\t) )/(u0u2 - ul) - 1 nf(t)AnVar { /i oX 0 fcK r/i oN = n/(t)Aâ€ž(M)Var { 0 hr ( \ *> / \ 1 0 1 nhnVar < Â» \ ^ 0 hn t 4 (3.9) (3.10) By result (3.3) of Theorem 3.1, as n â€”â€¢ oo the last expression converges in probability 26 to: 8{l,S)f{t) /(*) I UQ UI Ui u2 = *28(1,6) ) \U*1 U2 ) U0 UI Ui U2 U0 Ui Ui u2 \ *0 "1 \U*1 U2 ) ( V \8) U0 Ui Ui u2 uniformly in t. â€¢ Since / and m" are uniformly continuous on [0,1], Corollary 3.2 follows immediately from Corollary 3.1. Corollary 3.2 Assume the same conditions as in Corollary 3.1. As n â€”* oo, 2 2-Bias(mhntt(t + An)\t) -m O V P 6 )/(uÂ°W2 - ui) _ 1J = (3.11) nf(l)AnVar(mhn,t(t + An)\t) ( a2S(l,8) \X ( UQ Ui Ui U2 "5 Â«1 \Ul U2 ) \X( \ UQ Ui \Ui U2 J \6I = o,(l) (3.12) uniformly in t G [1 â€” pn, 1] with any sequence pn > 0, pn â€”â€¢ 0. Remarks: 1. As is usually done in regression problems, we consider the conditional bias and variance of our estimates. 2. Corollary 3.2 says that the dominant terms in both the asymptotic conditional bias and the variance of rhnntt(t + An) are equal to those of m/lnii(l + An) for t G [1 -/?â€ž,!]. 27 3. For a finite sample, An is given and hn will be chosen to minimize the asymptotic conditional mean squared error (AMSE) of m^n>1(l + An). Because of Condition 6', choosing hn is equivalent to choosing 6 = An/hn. Although minimizing the ac tual finite sample conditional MSE would be best, its form is far too complicated. 4. Based on the results in Corollary 3.1 one can write out the AMSE of m^nii(l-|-An) as the square of the asymptotic bias plus the asymptotic variance. Condition 6' sets a guideline on the magnitude of An, that is, on how far ahead one can forecast if the rate n-4/5 is to be achieved for the AMSE of m/lrJii(l + An). Stone [27] has shown that under appropriate regularity conditions, the optimal rate of AMSE for a nonparametric estimator of a twice differentiable function is n-4/5. If An grows faster (but still slowly enough so that An â€”> 0), for example, An/hn â€”> oo, a rate slower than n-4/5 can be achieved under appropriate conditions. This can be seen by the reasoning below. For An â€”> 0, the bias and variance of m^.nii(l + An) are Bias(rhhn<1(l + An)\t) Var(mhntl(l + An)\t) = E(mfcBll(l + An)-m(l + An)|*) = E (fa - m(l) + - m'(l)]|t) + O(Al), = Var(A,1i + &llAB|t) = Var0Otl\t) + 2AnCov0OAJ1A\t) + A2nVar0ltl\t). By the results in Theorem 3.1 for t = 1 and any An > 0 with A. n 0 Bias(mhntl(l + An)\t) Var(mhntl{l-[- An)\t) Op(h2n) + Op(hnAn) + 0(Al), Op(l/(nhn)) + Op(An/(nh2n)) + Op(A2J(nh3n) Op{l/(nhn)) (l + Op(An/hn) + Op(A2Jh2n)) . )) If AJh, â€¢n co, we have Bias{mhntl(l + An)\t) 0P(A2n), 28 Var(mhnA(l + An)\t) ~ Op(A2J(nh3n)). To minimize the AMSE, we need to make Bias2 and Var of the same magnitude, i.e., A* oc A2J{nh3n), or A2/*3 oc 1/n. If An/An -> oo, A2nh3n oc 1/n implies n/iÂ£ â€”â€¢ 0 and nAn5 â€”â€¢ oo. As a result, if An/hn â€”â€¢ oo under the conditions nh3 â€”> oo and A2/i3 oc 1/n, the AMSE of rhfcâ€ž,i(l + An) achieves the rate Op(An) (equivalently Op(An/(nh3)), which is slower than n-4/5 since nAhn â€”* oo. Recall that in Chapter 1.2.2, the algebraic substitution (1.11) is used to re-center the data. This recentering yields a simple dependence of AMSE on 8, thus making it easier to find the optimal bandwidth by finding the optimal 8. If the original formulation (1.9) is used with the data centered at 1 + An, then the estimate of m(l + An) is /3O,I+Aâ€ž- The formulae of the asymptotic bias and variance of A),i+AN wiii show clearly the advantage of centering data at 1 over centering at 1 + Aâ€ž. Under the same assumptions as in Corollary 3.1 but with the additional assumption that D < H, one can show that the asymptotic bias and variance satisfy _2_ h2 Bias(mhntl(l + An)\t) - m"(l + An) v2 - vxvz V0V2 - V{ 2 h2 Op(l), Â£(/WÂ» - m(l + An)\t) - m"(l + An)V* VlV* VQV2 â€” Vi (3.13) nhnf(l + An)Var(mhntl(l + An)\t) -I \X ( V0 Vi V\ v2 \ Vn V-> ( \V1 V2 J \ V0 V-L Vi V2 (3.14) where i>,- = /"f v{K(v)dv, v* = /~f v'K(v)2dv. Minimization of the resulting AMSE of fi01+An is computationally intensive since 8 in the upper limit of the integrals depends on hn. Therefore if AMSE(hn) is to be minimized over a grid 29 of values of hn, all the integrals have to be computed for each hn. If K is, say, the indicator function on [â€”1,0], these integrals can be found in closed form. If, however, K is a truncated normal density, then the intregrals must be evaluated numerically. There is another advantage of re-centering the data: after re-centering the data at 1, hn is the "real" bandwidth used in forecasting. If the data were instead centered at 1 + Aâ€ž, the effective bandwidth would be hn â€” An since there are no data beyond 1. Lemma 3.1 below is used in the proof of Theorem 3.1. Lemma 3.1 Suppose that g(-) and /(â€¢) are bounded over [0,1] and W(-) is square in tegrate over 1Z and that the i,s are iid with density /(â€¢). If nhn = nh â€”â€¢ oo, then as n â€”Â» oo, ^ Â£ W&j^MU) - I Â£ W(^)g(u)f(u)du = Op((nfc)-1/a) (3.15) uniformly for t 6 [0,1]. Proof: Let Z = (nh)-1 Â£?=1 W((t{ - t)/h)g(U). By Chebyshev's inequality, Z = E(Z) + Op((Var(Z)fl2). E(Z) = \E (Vc^Mto) = I Â£ W(^)g(u)f(u)du, Var(Z) = ^yÂ«r (V(*I^(tl)) * nh<W2^^) = \ C W\^-Â±)g\u)f(u)dulh. (3.16) nh Jo n Since g and / are bounded, and /0X W2((u â€” t)/h)du/h = f^^h\V2(s)ds < co, Var(Z) = 0(\), (3.17) nh 30 which does not depend on t. Proof of Theorem 3.1: Write â€¢ -f%- fiiiU - t))2K(^r-) = (Y- Aj3)'W(Y - A/3), (3.18) where Y = A 1 tx-t 1 tn-t P Vft/ and W = diag{K(p-Â±)). â€¢VTL The minimizer of (3.18) is: pt = (A'WAJ^A'^y, (3.19) provided that A'WA is invertible and positive-definite. So for t with A'WA invertible, E{Pt\t) = (A'WA^A'Wro where m = (m(ti),..., m(tn))*, and (3.20) Var(J3t\t) = <72(A'WA)-1A'W2A(A'WA) -l (3.21) Result (3.28) below shows that, when suitably normalized, A'WA converges in proba bility to a positive definite matrix. Since we are interested in convergence in probability of E{pt\t) and Var(Pt\t), it suffices to analyze (3.20) and (3.21). The asymptotic bias will be derived first. The Taylor expansion of m at t yields: m"(t) m(ti) = m(t) + m\t){U -t) + â€”^(U - *)2 + - *)2) 31 uniformly in ti with \U â€” t\ < hn. Only these t,s are considered because the support of K is [-1,0]. Therefore, for \U â€” t\ < hn m (U) = (l,U-t) ( m(t) ^ + ^Y^-qiK2 + h2no(qi) A,-/ m(t) N m'(t) + qihl(m"(t)/2 + o(l)) where A,- is the ith row of A and =(9i'92'---'^^((V) 'â€¢â€¢â€¢'(V)) Since Wi = if((*i â€” t)/hn) = 0 for â€” t\ > hn, so 'l 0 W m(i) y 0 hn j 'i oN V (KtWK)-xKtWqhn 2(m"(t)l2 + o(l)) \ ( AlW A] 0 Zin y 0 hn j ( = hn 2 1 0 o hn I -l Fort,i = l,2 1 0 y0hNJ -1 1 AtWq)(m"(t)/2 + o(l)). 1 0 0 /*â€ž nh. -A*WA y0hNJ 32 (3.22) (3.23) 1 1 hJ+t-2 nhn 1 1 (A'WA),,-â€¢+j-2 (3.25) nk. â€” â€¢ â€¢ â€¢ â€¢ â€¢ ' (3-24) which by Lemma 3.1, is equal to nn Jo \ nn J \ nn J Substituting s = (u â€” t)/hn yields (l-t)/hn t/hâ€ž For t e [aâ€ž,l] with liminf an/fcn > 1, [-t/hn, (1 - t)/hn] D [-1,0] = [-1,0] for n sufficiently large. So for n â€”> oo and t Â£ [aâ€ž, 1], expression (3.25) is equal to f si+i-2K(s)f(t + shn)ds + Op((nhn)-^2), (3.26) where Op{(nhn)~xl2) holds uniformly in t G [aâ€ž, 1]. The last expression converges (not only in probability) to r(l-t)/hâ€ž . . / K(s)st+>-2f(t + shn)ds + Op{{nhnyxl2). J-t/hn i: ,Â»"+J-2 K(s)f(t)ds = f(t)lli+j-2 uniformly in t Â£ [an, 1] because | /_Â° si+j-2K(s){f(t + shn) - f(t)}ds\ < sup \f(t + shn)-f(t)\ [Â° \si+i-2K(s)\ds te[oâ€ž,i] sup |/(r + s/iB) - f(t)\( f \s2^-2Usf'2{ fÂ° K(s)2ds) <=rnâ€ž.ii J-i J-i 1/2 <G[aâ€ž,l] > 0. (3.27) So 1 0 0 hn nh 1 A'WA 0 fcn -l /(*) Ui u2 (3.28) 33 uniformly in t G [an, 1], as n â€”+ oo. Similarly, for i = 1,2, 1 0 o hn \ -i nh. -A'Wq Ii = .l-Â±K(ijLzl) (h^l)'+1 nhn fr{ \ hn J \ hn J uniformly in i G [an,l]- This expression converges to /(i)u,+i uniformly in Â£ G [an,l], as n â€”> oo. Thus / -l 1 0 o K nhn A'Wq A /(*) / \ (3.29) By (3.23), (3.28) and (3.29), it follows that _2_ hi E(BQtt\t) - m{t) v hn (E01<t\t)-m'(tj) ( = m"(t) UQ UI J u2 \U3/ m"(t) I I \\ UQ UI -1 /(*) Ui u2 /(*) VÂ«3 J J (3.30) Ui u2 uniformly in t G [an, 1]. To calculate the asymptotic variance, first consider, nhnVar 11 u = nhno~2 1 0 0 hn 1 0 0 hn J ) (AlW AYX A'W2 A{A'W A) -l 0 h. 34 = (7 1 0 o K 1 0 0 K 1 o 0 hn \ nh, nhn -A*W2A -l 1 0 0 K -1 0 hn X 1 0 0 K (3.31) By calculations similar to those in the proof of (3.28) -l â€ž -i 0 K nh. -A'W2A /1 0 N 0 hn ( ux u2 j (3.32) Using results (3.28) and (3.32) in (3.31) gives // 1 Oil B0jt I , (nhn)Var ( /(*) 1 0 0 AB / V Ui Ui /(*) Â«0 Wl /(*) U0 Ui V J \ "1 / /(*) U0 Ui Ui u2 -1 + op(l) V Â«1 U2 J \ U0 Ui Ui U-i + oP(l), (3.33) Ui u2 uniformly for t â‚¬ [an, 1]. Under appropriate conditions, Theorem 3.1 and its corollaries hold for the fixed design where i,-s are pre-specified design points rather than random variables. The analogous theorem and corollaries for the fixed design are presented below. Theorem 3.2 For the fixed design case, assume the same conditions on m, K, hn: 1, 4, 5 and 6 as in the random design. In addition assume that K' is continuous, that t{ = 35 with 0 < ti < ... < tn = 1 and rU f: f ' f(t)dt = i/n, inf /(t)>0 Jo <e[o,i] (3.34) loii/i / continuous. Then as n â€”â€¢> oo, / EfA/j-rofi) ^ - m"(t) _2_ AH E(ft,t) - m(t) { hn (E01<t) - m'(t)) ) ( \ UQ UI Ui U2 o(l)e, (3.35) / 1 0 o K fit) \ UQ UI Ui U2 AM \ _1 / U0 Ui Ui u2 = o(l)e, (3.36) uniformly in t Â£ wifJi liminf an/hn > 1. The condition on the design points Â£,s in the theorem above is analogous to Condition 3 for the random design case. This condition serves as a guideline to the selection of the design points in an experiment. Corollary 3.3 Assume the same conditions as those in Theorem 3.2 except with Con dition 6 replaced by Condition 6' of Corollary 3.1. Then as n â€”â€¢ oo, sup jBias(rhhn,t(t + Aâ€ž))-nu\ / lU2 ~ "1^3 . U0U3-UiU2 ro(*H( 8* + 6 )/(u0u2 - u\) - 1^ = o(l), (3.37) and sup <6[aâ€ž,l] nf(t)AnVar(mhntt(t + An)) - a26(l, 6) I \ UQ UI -1 Ul U2 Un U, UQ UI Ui U2 o(l). (3.38) 36 Corollary 3.4 Assume the same conditions as in Corollary 3.3. As n â€”*â€¢ oo, sup te[wâ€ž,i] -^Bias(rhhntt(t + Aâ€ž))-(tu\ ~ "i"3 , u^-uiu2 2\ A (3.39) sup tG[l-Pn,l] nf(l)AnVar(mhnit(t + An)) - a2S(l, 6) ( \ UQ Ui V / Ui U2 \ _1 / ' \ UQ UI Ui U2 = o(l). (3.40) The proof of Theorem 3.2 for the fixed design case is along the same lines as for the random design case except that the following lemma is used instead of Lemma 3.1 when replacing a sum by an integral. The proofs of the corollaries of Theorem 3.2 are the same as those of the corollaries of Theorem 3.1. Lemma 3.2 Assume f,W and g' continuous on [0,1] and f satisfies (3.34)- Then 3 C > 0, such that C sup tâ‚¬[0,l] < n h2' (3.41) Remark: Note that the error term 0((nh2)~l) in Lemma 3.2 is smaller than the error term 0({nh)~ll2) in Lemma 3.1 for h of order n-1/5. The proof of Lemma 3.2 requires the result of Lemma 3.3. Lemma 3.3 If infte[0,i] f(t) > 0, and if 0 = i0 < h < ... < tn = 1, then 3 C2 > 0 such that supin \U â€” \ < C2/n. Proof: Let C2 = 1/inf<e[0)i] f(t). By the assumption on /, -= [*' f(u)du> .[*' inf f(u)du = C2-l{ti-ti-i). n Jti-i Jti-i <e[o,i] 37 Thus U â€” < C2/n, Vi,n. Proof of Lemma 3.2: l rKâ€ž,u-t f(u)du ,_1 L " J EE A + B. Consider the first term in (3.42). by assumption (3.34). The lemma will be proved if \B\ < C/(nh2) for some C > 0. Since /, W and g' are continuous over [0,1], there exists C\ > 0, such that (3.42) (3.43) wC^)g{u) - W^-j-l)g(ti) f(u) < d u â€” ti (3.44) Thus 1 A /â€¢*â€¢â€¢ 1*1 < TE/ C u â€” ti < 1 CxC22 c â€¢ (3.45) nh2 nh2 The results in this chapter will be used in the next chapter for the estimation of an optimal bandwidth for forecasting. 38 Chapter 4 The choice of a smoothing parameter The plug-in approach and the cross-validation approach are commonly used in choosing a bandwidth for a local regression estimator in the non-forecasting setting. Each approach seeks to minimize some measure of the discrepancy between the estimated and the true function and tries to estimate the bandwidth at which the minimum of such a measure occurs. There is abundant literature on the merits and limitations of both methods. For a comprehensive review, see [7], [12], [13] and [19]. The ideas of choosing a bandwidth for forecasting by both approaches will be developed in this chapter. The applicability of these approaches relies upon the fact that the data are independent. Any correlation structure in the data will affect any automatic selection of bandwidth (see, e.g. [16]). 39 4.1 The plug-in approach A plug-in bandwidth is an estimate of an "optimal bandwidth" in a certain sense. To estimate the function m at t by rhhn(t) (defined in 1.8), an optimal bandwidth hopt(t) may be defined to be the one that minimizes the mean squared error (MSE) of m/ln(t): hopt(t) = argminhnMSE(t, hn), where MSE(t,hn) = E(rhhn(t) - m(t)\t)2 in the random design and MSE(t,hn) = E(rhhn(t) â€” m(t))2 in the fixed design. Since the idea is the same for both designs, the selection of an optimal bandwidth will be described for the random design. The bandwidth hopt(t) is called a local bandwidth since MSE(t, hn) is a criterion of goodness of fit at a single point t. However, if m is believed to be reasonably smooth, a constant bandwidth for all t will suffice ([7]). In such a case, an optimal bandwidth hfpt may be defined to be the one that minimizes the sum of the mean squared errors: h^pt = argrninhnMSEG(hn) where MSEG{hn) = n'1 Â£?=1 E ((rhhn(ti) - m(ti))2\t). Since hÂ°pt will be used to esti mate m(t) for all t, it is called a global bandwidth. Another commonly used criterion for a global bandwidth is the integrated mean squared error, MISE{hn) = f1 MSE(t,hn)dt Jo = jT E ((rhhn(t) - m(t))2\t) dt; (4.1) a global bandwidth can be obtained by minimizing MISE(hn). MSEa(hn) may be viewed as a discretized version of MISE(hn) when the Â£,s are equally spaced over [0,1]. Note that all three criteria mentioned above involve the unknown function m and the data {(^i,^)}". So the optimal bandwidth depends on m and a2 and thus has to be estimated. The plug-in approach estimates the optimal bandwidth by minimizing an estimate of the asymptotic expression for MSE or MISE. 40 The criterion MSE(t, hn) will be used hereafter. The context of the plug-in approach that of non-forecasting and some relevant results will be presented first and then the generalization of the plug-in approach to the forecasting setting will be discussed. 4.1.1 Introduction In this section suppose Y{ = m(t,) + e,-, where the e,s are independent with mean 0 and variance cr2(Â£,). When the errors are homoscedastic, o"2(i,) = a1. Results for heteroscedastic errors in the non-forecasting problem will be presented because they encompass the special case of homoscedastic errors. In particular, a technique by Fan and Gijbels [7] of estimating m"(l) in the case of heteroscedastic errors will be described and used in forecasting. In general, to estimate wSv\t)/v\ = ft, v â€” 0,1,..., a polynomial of degree p (p > v) is fitted to the data with the following fitting criterion: A = argmin jjY{ - Â£ &â€¢(*,â€¢ - tyfKip^), (4.2) and m/,n'"'(t) is set to v\$vt. In curve estimation, the kernel K in (4.2) is usually taken to be symmetric around 0 over either the reals or [â€”1,1]. The performance of rhhn^\t) is assessed by its MSE: E j (rhhj^(t) â€” mM(t)J2 |t-j. According to the general results in [7], when p - v is odd, the asymptotic MSE (AMSE) of rhhn^\t) is Qi i2.2(p+i-</) , g2(*) /43\ ft+iMâ€ž + a" f(t)nhn+^ V-6* Here a EE (a0,ai,...,apy = diag(S~1S*S~1), (4.4) 6 EE (6o, 6P)* = S-1(sp+i,...,s2p+2)\ (4.5S = (5,-j) with sij EE si+j-2 = J ui+j-2K(u)du, (4.6) S* EE (,?.) with EE ,?+._A = Jui+i-2K2(u)du. (4.7) 41 The first term in (4.3) is the square of the asymptotic bias which depends on m(p+i)(2) = (p+l)!(5P+1) while the second term is the asymptotic variance which depends on cr2(t). For example, when m(t) is estimated by a local linear estimator, i.e., v = 0 and v = 1, the square of the asymptotic bias of rhhn(t) is P\b\hn (or m"(t)2blhn/4) and its asymptotic variance is a0a2(t)/(f(t)nhn). Formula (4.3) shows that if m(p+1)(i) ^ 0, a large bandwidth hn will create a large bias but a small variance, and a small bandwidth will yield a small bias but a large variance. Requiring both the bias and the variance to be small at the same time is setting conflicting goals for hn. Therefore a tradeoff between the bias and the variance is needed and this is achieved by choosing hn to minimize the asymptotic MSE of the estimator. From (4.3), the bandwidth that minimizes the AMSE of rhhâ€ž^\t) is: _ / (21/ + l)aâ€ža2{t) \ 5fe ^(t)-U(p+i-^)WiÂ«/(oJ ' ( ] For example, when m is to be estimated by a local line, i.e., v = 0 and p = 1, this locally optimal bandwidth is *-*>- (^)*-Gs^)*"^ Plugging this optimal bandwidth into the formula for the AMSE of rhhn(t), one can see that the rate of n-4/5 is achieved for its AMSE. Note that formula (4.8) depends on two unknowns, a2(*)and m(r+1\t) = /3p+1(p-rl)\. These unknowns need to be estimated and these estimates are then plugged into formula (4.8) to yield an estimate, hu,opt(t), of the optimal bandwidth hv<opt(t). For example, if the curve m (v = 0) is being estimated at t, to get an optimal bandwidth one has to estimate m"(t) and <r2(i) first. 42 4.1.2 Plug-in approach using complete asymptotics for fore casting (CAMSE) We only consider the case of homoscedastic errors, i.e., o~2(i) = a2. In forecasting, the criterion that the plug-in approach considers is also the mean squared error: MSE(hn) = MSE(l + An,hn) = E([mhntl(l + An)-m(l+ (4.10) Again, only the random design case will be presented since the ideas and the results are the same (under proper conditions) for both designs. Under Condition 6' of Corollary 3.3, for Aâ€ž given, the asymptotic MSE(hn) can be written in terms of 8 = An/hn, so the notation AMSE(8) is used instead. By Corollary 3.1 or 3.3, the AMSE of mhnA(l + An) is AASOVf^ m"(l)2 ( u\ - UXU3 U0U3-UiU2 2 V AMSE(8) = â€”I H-j5 + s )/(UoUi ~ ui) - 1 1 + 4 a28 n/(l)A, â€¢(1.*) UQ U\ UI U2 Un Â«i UQ UI Ui U2 (4.11) If the first term of AMSE(8) is a monotone and non-increasing function of 8, the above expression shows, in an asymptotic sense, that for a given An, a very large 8 (corresponding to a very small hn) will yield estimates with a small bias but a large variance and that a small 8 (corresponding to a large hn) will yield estimates with a small variance but a large bias. For a general kernel function K, the relationship between hn and the asymptotic bias is not as obvious as it is in the non-forecasting setting. In the forecasting setting, a trade-off between the asymptotic bias and the asymptotic variance can be achieved by choosing the hn = An/8 with 8 minimizing AMSE(8). Observe that as 8 â€”* 0 and +oo, AMSE(8) â€”* +oo. So a minimum of AMSE(8) in (0, +oo) is guaranteed. The minimum can be found by setting the first derivative of 43 AMSE to zero and choosing the positive root which gives the smallest AMSE. Whereas in curve estimation for the non-forecasting setting an explicit formula (4.8) for the optimal hn exists, no such formula exists in forecasting. Finding the optimal hn or equivalently 8 for forecasting, calls for solving a seventh degree polynomial equation (the one resulting from setting the first derivative of AMSE(8) to zero) for 6. An alternative method for minimizing AMSE is by a grid search. Note that like in curve estimation, estimating the optimal hn in forecasting requires the estimation of m" and a2. This issue will be discussed in Section 4.1.4. 4.1.3 The plug-in approach using the finite sample variance for forecasting (HAMSE) Recall that the discrepancy measure (4.10), the MSE of the forecasting estimator, is comprised of two parts: a bias component and a variance component. The bias com ponent involves the unknown function m but by (3.21) the variance component equals <72(1,Aâ€ž)(A*WA)*A*W2A(A*WA)(1, Aâ€ž)*, which does not involve m. The estimation of the variance parameter a2 is relatively easier than the estimation of m". Therefore we can directly use the exact variance expression rather than using its asymptotic expres sion [7]. The asymptotic variance describes the variability when n â€”> oo. Specifically, the following can be used in lieu of AMSE. Definition 4.1 The asymptotic MSE using the finite sample variance for the forecast ing estimator rhhn,\(l + An) is defined as: HAMSE(S) = ^An (<Â±^ + â„¢ +<T2(1, An)(AiVTA)tAtW2A(At WA)(1, An)*. (4.12) 44 Expression (4.12) uses only "half" of the asymptotic results of the bias and the variance of m/ltl)1(l + Aâ€ž) in that HAMSE is defined as the sum of the asymptotic bias and the finite sample variance of the forecasting estimator, hence the acronym "HAMSE". Recall that W = diag(K((U - l)/hn)) = diag(K(8(U - 1)/Aâ€ž)). The optimal 8, 8opt, that minimizes HAMSE(-) can be found by a grid search and the optimal bandwidth is set to be: hnAMSE,oPt = An/8opt. Note that both the plug-in approach using AMSE and the plug-in approach us ing HAMSE are based on the same discrepancy measure, the mean squared error of rhhâ€ž,i(l + An). The difference is that the latter approach uses the exact variance of ^/iâ€ž,i(l + An) in the variance component instead of its asymptotic expression. 4.1.4 Estimation of m" and a2 â€” an introduction The plug-in approach in the non-forecasting setting calls for the estimation of the second derivative ra" of the unknown function and the variance function a2. When homoscedastic errors are assumed, a2 can be estimated by the Rice estimator [25]. Definition 4.2 The Rice estimator of the variance a2 is defined as: ^ = E%^f. (4.13) Under certain conditions, the Rice estimator is n1/,2-consistent, which means ^^(a^^ â€” a2) â€”+ 0 in probability. Other estimators of a2 can be found in [11]. When heteroscedastic errors are assumed, Fan and Gijbels propose that o~2(t) can be estimated from residuals of the local polynomial regression [7]: 45 where ^ 1 ti-t ... (u-ty^ , W = diag(K((tj-t)/hn)), (4.15) yi tn-t ... {tn-ty j p is the degree of the polynomial and Y = (Yi,..., YnY = Ap/3t, (3t being the usual non-forecasting estimate as defined in (4.2). Clearly cr2{t) can be used in the cases of either homoscedastic or heteroscedastic errors. Note that the estimation of cr2(t) requires again a choice of a bandwidth which may differ from the bandwidth for curve estimation. The estimation of m" requires a higher order method. Recall that is estimated by locally fitting a pth degree polynomial with p > v. Ruppert and Wand [26] recom mend that p â€” v should be taken to be odd to obtain an accurate estimate of m^(t). So m"(t) should be estimated by locally fitting at least a cubic polynomial. Again the estimation of m" requires the choice of an optimal bandwidth, which is different from (usually bigger than) the bandwidth that is used for estimating the function m itself. To summarize the idea of the plug-in approach, to get an optimal hn for curve estimation one needs to estimate a2 and m". The estimation of a2 and m" requires two additional bandwidths whose optimal values depend on higher derivatives of m. To avoid this spiralling argument, one has to come up with an initial bandwidth for estimating derivatives and a2. Therefore in the plug-in approach, the problem of choosing a good initial bandwidth has attracted special attention ([26],[12],[7]). 4.1.5 Estimation of m" and a2 for forecasting The discussion will be restricted to the homoscedastic errors case. Recall that the forecasting estimator is defined as m/lnii(l + An) = y901 + ^jAn where fcx = argminj^iY - 80 - frfr - l))2K({ti - l)/hn). l 46 The objective is to choose an hn = An/8 such that AMSE(8) or HAMSE(8) achieves a minimum. Since both objective functions involve the unknowns a2 and m"(l), these need to be estimated first. Existing techniques for estimating both in curve fitting can be applied directly to forecasting. For estimating a2, aRice suffices because of the assumption of homoscedastic errors. Estimation of m"(l) is a much more complicated matter. As discussed earlier, m"(l) can be estimated by fitting a local cubic polynomial around t = 1. Let v = 2, p = 3 and t = 1 in (4.2), that is, let & = argmin J2(Yi - 1)'')2#(^), (4.16) i=l j=0 and let mâ€žnii"(l) = 2ft,i. Note that the kernel function K and the bandwidth hn used to estimate m"(l) can be different from those used for forecasting. Formula (4.8) gives the optimal bandwidth for estimating m"(l) as follows: = (4'17) which depends on a2 and one higher unknown derivative m^(l) ( = 4!/?4). Ordinarily the fourth derivative of a function m is difficult to estimate, partly because the optimal bandwidth depends on higher order derivatives of m. However, an idea due to Fan and Gijbels [7] eliminates the need to estimate m^4\l). To avoid estimating higher order derivatives when estimating m^ by a pth degree polynomial, these authors have introduced a statistic RSC, RSC(t, hn) = a2(t){l + (p + 1)V}, (4.18) where o2(i) is as defined in (4.14) and V is the first diagonal element of the matrix (A* WAp)~1ApW2Ap(ApWAp)"1. The motivation for using RSC is that its minimum reflects to some extent a trade-off between the bias and variance of the fit. 47 Fan and Gijbels [7] have shown that the optimal bandwidth hQ(t) that minimizes the asymptotic value of E(RSC(t, hn)\t) is: where Â£, _ S2p+2 â€” (Sp+l, â€¢ â€¢ . ,S2p+l)S~1(Sp+1, ... ,S2p+lY ^ 2Q^ and Sj is as defined in (4.6). Comparing (4.8) and (4.19) yields: Â» M ( (2J/ + 1) auCp\^ Applying this relationship to the estimation of m"(l), i.e., t = 1, v = 2, p â€” 3, leads to /*2,0P< = />2,oPt(l) = (f^|)9 W (4-22) The point is that h0(l) can be estimated from the sample by minimizing RSC, yielding an estimate of h2,0pt since the scaling factor in (4.22) depends on the kernel function only. Thus the optimal bandwidth for estimating m"(l) can be estimated from (4.22) using the information in the finite sample at hand. The following algorithm summarizes the steps to get the optimal bandwidth for estimating m"(l). Algorithm 1 : 1. Fit a local cubic polynomial centered at t â€” 1 as in (4-16) 1Â° tne data for each value of hn on a grid; 2. find h0(l) that minimizes RSC(l,hn) on that grid of hn; 3. get h2t0pt via (4-22); 4- estimate m"(l) by fitting a local cubic polynomial to the data using (4-16) with bandwidth h2,opt-48 4.2 Cross-validation 4.2.1 Background to the non-forecasting setting Cross-validation is a technique commonly used to choose a smoothing parameter. Most references in literature applies cross-validation in the context of curve fitting by splines or kernel estimators. However the idea of cross-validation is the same for both techniques and can be applied in other contexts, e.g., bandwidth selection in local linear regression. For a brief historical note on cross-validation, see [15] (pages 152-153). This section will present the motivation and results on cross-validation in the setting of curve fitting. A generic symbol A is used for the smoothing parameter. In the linear operator approach, A will be the parameter used in the penalty term. In the local regression approach, A will be hn, the bandwidth of the kernel. Let m\(-) denote the fit to the data Y{S with smoothing parameter A. Ideally, one would want to choose a A such that the prediction error is minimized. A criterion reflecting this objective is independent of the Y{S but have the same distribution as the i^s. To see the relationship to the criterion MSEG(X) used in the plug-in approach, note that (4.23) where the Y*s are new observations made at the design points Thus the l^*s are E(PE(\)\t) m(ti))(m(ti) - mA(*,â€¢)) i=i = a2 + MSEG(X). (4.24) 49 So minimizing the expected prediction error E(PE(X)\t) is equivalent to minimizing MSEG(\). Of course the 3^*s are unknown and so to implement this idea the Y*s have to be replaced by observables. Simply minimizing (4.23) with the Y*s replaced by the Y{S would usually result in choosing A = 0, the A that minimizes 2~Zi(^' â€” â„¢A(^,))2 and yields an estimate of m that interpolates the data. Thus the fitted curve would be very bumpy. This is a direct consequence of underestimating the prediction error PE(X) because rh\(-) is "closer" to the Y{S than to the Y*s since rk\(-) is fitted to the former. The idea of cross-validation is to correct this downward bias in the estimation of PE{\). Note that Y* is independent of rh\(ti) but Y{ is not. Cross-validation gets around this dependence by substituting mÂ£~^(2,-) for rh\(ti), where m^(-) is the curve fitted to the same data but with the zth data point (i,-, Yi) removed. As a result Yi and m^~x\ti) are independent. Definition 4.3 The cross-validation function is defined as CV(X) = -f2(Yi-m{-i\ti))\ (4.25) where for each i Â£ {1,... , n}, m\~%\-) is estimated by the same procedure as rh\(-) but with (ti,Yi) removed from the data set. The cross-validation choice of the smoothing parameter, \cv, minimizes CV. By taking the expectation, we see that CV(X) is approximately unbiased for PE{\). E{CV{\)\t) = -E(f^(Yi-m(ti))2 + 2it(Yi-MU))(â„¢(ti)-n \i=i i-i + Â±(m(U) - mi-^ti))^) t=i / = ^ + ^(EM*0-4"o(*.0)2l*) w <72 + MSEQ(\). (4.26) 50 Under suitable conditions, Xcv, the smoothing parameter chosen by cross-validation, can be shown to converge almost surely to the optimal smoothing parameter XPE, the minimizer of the prediction error. Hardle and Marron [14] have shown such a result for the Nadaraya-Watson estimator. On a practical level, cross-validation is a computationally intensive method. Since CV can rarely be minimized analytically, usually CV is minimized on a grid of A in a specified interval [A, A] . Therefore for each combination of grid point Xj and omitted design point t,, the entire curve fitting procedure has to be repeated. This amounts to n\ â€¢ n curve fits with n\ being the number of grid points for A. In some cases, a short-cut formula is available which expresses CV(X) in terms of quantities which can be evaluated directly from applying the curve fitting procedure once to the entire data set. With this formula, only one regression curve is fitted for each value of A. The following lemma gives the short-cut formula and conditions for its validity. Lemma 4.1 For each i Â£ {l,...,n}, let m^-^(-) be the fit with (ti,Yi) removed from the data set {(tj,Yj)}"; and let ^~^*(-) be the fit with (ti,Yi) replaced by (Â£,-, ra^-^(Â£,â€¢)) in the data set {(tjjYj)}â„¢. Suppose that rh\ = (rh\(ti),... ,rh\(tn)y = H\Y, with H\ not dependent on Y and with its ith diagonal element [H\]a ^ 1. If for each i, m^~^*(t,) = rn^Cl\ti) for any set ofYis, then Yi-m\ '(ti) = i_[H ] Â» and so (4.27) Remark: The short-cut version of the CV function (4.28) casts insight on the idea of cross-validation. As discussed earlier, n_1 Y%{Yi â€” rh\(ti))2 tends to underestimate 51 the prediction error. By scaling the iih term by 1/(1 â€” [H\]u)2, CV function tries to eliminate this bias. Proof: Since rh\ = H\Y and H\ does not depend on Y, the following holds: ^i_,>(*.-) = EWi + [^A]Â«mi"Â°(*0- (4-29) Using the assumption m^-^*(i;) = m^~^(t,) yields: Yi-= Yi-m^m(ti) = Yi- YlWijYj - [frA]Â«m^(t,-) 1 = Yi-mM + iHxMYi-m^iti)). (4.30) Therefore, ^ - m^iU) = (Yi - mA(*,))/(l - [Hx]n). Applying this relationship to the definition of CV(X) (4.25) gives the short-cut formula (4.28). â€¢ The above applies to the scenario of curve fitting in the domain of the design variable when a global smoothing parameter A is desired. If one wants to choose a local value of A depending on t, localization of the CV function is straightforward by incorporating a weight function into CV. For example, a CV function for a local smoothing parameter can be defined as CV(X, t) = n-1 J2(Yi - m^iUtfwiit), (4.31) l where iw,-(i) is a weight function that gives more weight to points near t than the rest. This weight function Wi(t) can depend on another smoothing parameter A*. The A = X(t) that minimizes CV(-,t) will determine the amount of smoothing around t. 52 4.2.2 CV for the linear operator approach The minimizer of the penalized likelihood, F(Y,Lu...,Ln) = n-1 fjXi - L^I))2 + \f I"{ufdu, 1 J for I â‚¬ W^O,1], gives an estimator satisfying the conditions of Lemma 4.1. So the short-cut formula holds for this case. For proof and details, see [31]. 4.2.3 CV for the local regression approach in curve estima tion The local polynomial regression estimator, rhhn(t) = /30t, of m(t) with /3t defined in (4.2), satisfies the conditions for the short-cut formula. The ith row of the H\ matrix in this case is [Ap],-[(AÂ£WAP)-1AÂ£W], where [Ap],- is the ith row of Ap, Ap and W are as defined in (4.15) with t â€” ti, i = 1,..., n. Obviously, this H\ matrix does not depend on Y. The lemma below verifies the other condition in Lemma 4.1 for the short-cut formula to hold. Lemma 4.2 For each i â‚¬ {1,..., n), let {wjtti}\ be a set of non-negative numbers with u)i,n > 0. Suppose that f3t. minimizes sfc-E #(*;-*.â€¢/) (4.32) \ 1=0 / and that fi\. ^ minimizes E (Y> - E m - *o')2 Â«>j,u + - E m - ti)1) j^i \ 1=0 J \ 1=0 ' = E fYi ~ E - ti)1) â„¢i,u + fc - ft)2 witti. (4.33) &i \ i=o ) v J Then /30<^ = /30t'^* for i G {1,... ,n}. Here J30^ and (3Qt^ are the first component of ^ and p[. ^ respectively. 53 Remark: The above lemma is the essence of the short-cut formula for regression with the fitting criterion E fa-!>(*;-')') i V ;=o / with rhhn(t) = {30. In particular the short-cut formula holds for local linear regression estimator (p = 1) and local constant regression estimator (p = 0, Nadaraya-Waston estimator), that is, when Wjyt = K((tj â€” t)/hn) with K being a kernel function. Proof: By the definition of /3t. , E fa - E ttf'to - **)')2 + (4? - 2 ^ < E fa - E - *0')2 + te? - /CV < E fa - E A(Af>(*i - *â€¢â€¢)')2 (4-34) \ /=o / since f}\ ^ minimizes (4.32). Therefore J3Qt^ â€” /?o,v â€¢ 1=1 4.2.4 CV for forecasting: FCV The following two sections contain discussion of ideas of cross-validation for forecasting based on different assumptions. Unlike the case of curve fitting in the data range where the Y{S can be used to "cross-validate" the estimates ihhn(ti) computed with bandwidth hn, the forecasting case has no "future" data beyond 1 to cross-validate m/lnii(l + An) for hn. One natural idea would be to leave out a portion of the most recent data, use the forecasting estimator and the rest of the data to estimate the left-out data and minimize some measure of the discrepancy between the left-out YiS and their estimated values to choose an optimal bandwidth hn for forecasting. This idea has the flavour of 54 > conventional cross-validation in that a statistical model is built on part of the data and then "validated" by the rest of the data. A formal definition of such a CV function for forecasting can be formed as below. Definition 4.4 The cross-validation function for forecasting is defined as: FCV(hn) = J2(â„¢^,U-Aâ€ž(U) - Y)2, (4.35) where S = {U : ti Â£ [1 â€” pAn, 1]} with p > 1 and \S\ = #2tS Â£ <S*. Remarks: 1. Note that rhhn,ti-Aâ€ž(ti) as defined in Definition 3.1 centers the data at i,- â€” An and predicts Aâ€ž ahead. For each left-out 2,-, the forecasting estimator m^niii_Aâ€ž(i,) uses the rest of the data prior to the time point 2,- â€” An to estimate m(2j). Then rhhn<ti-An(ti), the "forecast" at t = 2,-, is compared to Yi. 2. From Corollary 3.2 or 3.4, FCV(hn) could use more i,s, with Â£ [1 â€” pn, 1] for pn â€”> 0. But for computational convenience, FCV(hn) leaves out the part of the recent data with 2; â‚¬ [1 â€” pAn, 1]. 3. Independent work by Hart [17] uses a cross-validation idea, TSCV (time series cross-validation), similar to FCV in the context of forecasting. He assumes an AR(1) model for the e,s and forecasts one step, 1/n ahead, using a locally constant regression estimator. In contrast, we assume independent e,s and forecast Aâ€ž oc n-1/5 ahead. Hart and Yi [18] recently modified TSCV to OSCV (one-sided cross-validation) for the bandwidth selection for curve estimation by local linear regression. They showed that the optimal bandwidth estimated by OSCV is less variable than that estimated by the traditional CV as defined in (4.25). In both these papers ([17], [18]), the asymptotic mean squared error of the forecast has a simple form, B2hn + V/(nhn) where B and V are constants depending on K, m", 55 a2 and /, since the estimate is forecasting 1/n ahead. But AMSE(6) has a more complicated form than B2hn + V/(nhn) since we are forecasting further ahead. Recall that AMSE(6), the asymptotic mean squared error of m^â€žii(l +An) is used in the plug-in approaches to estimate the optimal bandwidth for forecasting. The following corollary of Theorem 3.2 stated for non-random i,s sheds some light on the relationship between FCV(hn) and AMSE{8). Corollary 4.1 Assume the same conditions as in Corollary 3.4- Then n4/5{E(FCV(hn)) -a2- AMSE{8)} = o(l). (4.36) Proof: Since rh-hn,ti-&n(ti) is independent of Yi, E(FCV(hn)) = |4| Â£ {E(â„¢Â»n,L-*M ~ m{ti)f + E(m(ti) - Y)2} = â€¢Â£[ | Â£ E(mhntt,_An(ti) - m(ti))2 + a2 j . (4.37) By Corollary 3.4, sup n4'5 [E [mhntU.An(ti) - m(ti)}2 - AMSE(6)) = o(l). As a result, nAl5{E{FCV{hn)) -a2- AMSE(8)} = o(l). â€¢. The lemma below will ensure that the stronger uniform results in Theorem 3.2 and Corollaries 3.3 and 3.4 hold for 2,-s random. Then by the same proof, the above corollary is true for 2,-s random. Lemma 4.3 Suppose that 0 = t0 < tx < ... < tn < tn+1 = 1 and that tx,...,tn are order statistics of a distribution with density function f. Suppose that W and W are 56 bounded in the support ofW, g' and f are continuous over [0,1] and infie[o,i] f(t) = a > 0. Then ifnhn â€”â€¢ oo, sup <e[o,i] 0 (4.38) in probability as n â€”> oo. The proof of the above lemma will use standard results on order statistics from the uniform distribution over [0,1] (see, e.g. [21]). Those results are stated in the lemma below without proof. Lemma 4.4 Let Â£/(,), i = 1,..., n be order statistics from the uniform distribution over [0,1]. Let C/(0) = 0, and U(n+i) = 1. Then *â€¢ E{U(i)) = ^, Var{l%) = for i = 1,.. 3. E(U{i) - */(,_!)) = E(U(1)) and E(U(i) - U^f = E^), for i = 1,..., n + 1. Proof of Lemma 4.3: Since 1 "+1 tâ€”t (n + l)hn hn n + l nhn Y Â«n [n + l)hn hn (4.39) we have sup te[o,i] sup te[o,i] n + l + tn+l â€” t (n + l)hn hn The assumptions on W', # and hn ensure that the last expression will go to 0 as n â€”â€¢ oo. Therefore this lemma is true if r^uT E W(^)g(U) ~Y [ W(^)(gf)(u)du (n + l)hn j hn hn Jo hn sup te[o,i] (4.40) 57 in probability as n â€”> oo. Consider the integral in (4.40): _L f1 w{u-zl)(gf){u)du tin JO tln = rE/ wCâ€”-){gf){u)du Hn jâ€”\ Jtiâ€”1 *in = rSr W{t\1)giti)mdu + fln )=1 -'t.-i T- E1 r HV^M - w&^Wi)] â€¢ In ,=i Jti-i L /ln /tn Plugging (4.41) in the left hand side of (4.40) gives: 1 n+1 t - â€” t 1 71+1 rU tâ€”t E w(\^)g(u) - f E / ^(^-M*0/(Â«)Â«* (re + l)/in j â€¢ ftn /in i=1 Jfi_i An (4.41) sup te[o,i] = sup \A(t)-B(t)\. te[o,i] By Chebyshev's inequality, for any e > 0, (4.42) P sup \A(t)-B(t)\ > e < \te[o,i] / Â£ (suP<6[0)1] |A(t)-5(i)|) < Â£ (supÂ«â‚¬[0fl] |A(t)l)+^(supte[0|1]|^(t)|) (4.43) Therefore the lemma is proved if the right hand side of (4.43) â€”> 0 as re â€”> oo. Consider E (supte[01] |A(2)|) first. Ait) = ^E^^-fsf w(^)^(to/(Â«)dÂ« (n + l)hn i hn hn iZ[ Jti-i hn = ^Â£V(^Mto^-f EV(^M*o/fi (4-44) hn i Â«n n + 1 ftn ftn which can be written in terms of the order statistics of a uniform distribution. Let U(i) = J fiu)du, i = 0, ...,n + l. 58 Then ..., ?/(â€ž) are order statistics of the uniform distribution over [0,1] and U(o) = 0, <7(â€ž+i) = 1. Using the facts 1 n + Y = EiUv-Uw), / f(u)du = U(i) - Â«7(,_i), t = l,...,n + l Jti-l in A(t) and then re-arranging the terms yield: -i n+l 4 f Mi) = rE^V^Kr^} t'=l â€¢j n+l J ^ n j=l n (4.45) Recall that U(0) = 0 ari(1 ^(n+i) = 1, so - f E w(^)^(to - ^.--i)} ,=2 = i E HHr^*-0 ~ w(^rM*0} {EUÂ«-x) ~U(i-^ -(4-46) Using the condition |(W#)'| < C in (4.46) yields: sup |A(t) | < ^E <G[0,1] "n ,=2 Because of the fact that EU(i-\) â€” U(i-i) (4.47) rU U(i) - r/(,-_i) = / /(Â«)<Zu > mif{t)(ti-ti-l) = a(ti-ti-1), tÂ£[0,l] (4.48) suPte[o,i] \ can ^e bounded by c n+l ah2 ""n ,=2 E ^(0 _ ^(Â«'-l) - ^(i-l) 59 Schwarz' inequality can be used to bound the expected value of the above by C n~^~^ / 2\ 1/2 E {E fa) " tf(.--i>) ) {VarWw)) Using fact 3 of Lemma 4.4 then yields sup \A(t)\) < -TJ {U(i)) E (^K^-i))) <6[0,1] / afln x 7 t=2 Again, by facts 1 and 2 of Lemma 4.4, the last expression equals C ( 2 V/2A/ i(n + l-i) \1/2 /2 ahl V(Â» + + 2),/ Â£ V(" + 1)2(" + 2) c ahl \(n + l){ Since n + as n â€”> oo, we have E sup = O \*â‚¬[0,1] / 0 under the condition that nhn â€”â€¢ oo. Now consider Â£ (sup1Â£[01] |i?(t)|), where 1 iln ,_i â€¢'ti-i L (in ftn (4.49) ^1/2 "+1 I _l_ Â± (J- (i _ n + 2)J (n + 2)V2+ l ^ Vn + 1 V n + 1/; 1/2' (4.50) (4.51) By the assumptions of the lemma, W and W are bounded in the support of W, and / and g' are continuous on [0,1]. So there exists C\ > 0, such that I u â€” ti /(Â«) < Cx hr (4.52) 60 Therefore, 1 ^ r*i sup \B(t)\ < T-E/ C-i â€” au ii_i â€” U du n+l "n i=l < Â§iE(^)-^-i))2, (4-53) "n " Â«=1 by (4.48). Finally, <?i" v 2 c?hnKn^ J(n + l)(n + 2) = (4'54) under the condition that nhn â€”> oo. So by (4.50) and (4.54), the right hand side of (4.43) -> 0. â€¢ Unfortunately, no short-cut formula for FCV has been identified, making the imple mentation of this criterion computationally intensive. The cross-validation criterion in the next section has a short-cut formula. 4.2.5 CV for forecasting, BCV FCV uses the data up to â€” An to forecast ra(Â£,). In contrast the "backcast" cross-validation BCV in this section will use data to "forecast" the past. To understand the definition of BCV, recall that rhhn,i(l + s) = /?01 + J3ltls estimates m(l + s) with data centered ait â€” 1. Suppose that m is really a line. Then if m^nii(l + s) is a good estimate 61 of m(l+s) for s G [â€”An, 0], rh-hn,i(l+s) is an equally good estimate of m(l+s) as it is for s G [0, An]. If m is not a line, we know by a Taylor expansion that m is approximately linear in a neighbourhood of 2 = 1. So we can fit a line locally with the data centered at 1 and estimate m(l + s) using /301 and to calculate mâ€žn)i(l + s). The accuracy of the fit is then assessed by the data in [1 â€” Aâ€ž, 1]. Since for s Â£ [â€”Aâ€ž, 0], m/,nii(l + 5) gives a "backcast" instead of a forecast, we call the following cross-validation criterion BCV. Definition 4.5 Let S\ = {ti : 2,- > 1 â€” An} and \S\\ = the number oftiS G S\. BCV{K) = -i- Â£ c(l _ 2.) (m,â€ž,1(-)(2,) - Y)2 , (4.55) w/*ere m/ln)1("')(2,) = + ^^(t,- - 1), @[ *J minimizes 52&i(Yj - Po - Pi(tj -l))2K((tj â€” l)/hn) and c(-) is a function that may depend on An but not on hn. In (4.55), we recommend c(-) to be chosen so that E(BCV(hn)) Â« AM5,E(<5) + constant, (4.56) where AMSE(6) is the asymptotic MSÂ£ of mfcB>1(l + An) and 8 = An/hn. BCV always centers the data at 2 = 1 when estimating m(2,), thus using data in [1 â€” hn, 1] but FCV in the last section centers the data at 2,- â€” Aâ€ž when estimating m(2,), using data in [2,- â€” An â€” hn,ti â€” Aâ€ž]. Suppose that a c(-) exists such that (4.56) holds. Then BCV behaves approximately like AMSE(8). The hope is that the bandwidth that minimizes BCV also approxi mately minimizes AMSE{8). The optimal bandwidth chosen by the BCV criterion is defined to be the one that minimizes BCV and will be sought by a grid search. Working with (4.55) directly is computationally intensive because for each hn on the search grid and for each data point (U,Yi) with 2,- > 1 â€” An, a regression needs to be computed. Fortunately, a short-cut formula holds for (4.55). This is shown in Lemma 4.6. 62 To choose a c(-) to satisfy (4.56), note that E{BCV{hn)) = ~ Â£ ^-^{(Aw^^^)-^-^)])'} = E c(l-t,-)^{(mw(-Â°(*0-m(*0)a} + ^2T^ E c(! " *0 l^l t.-eSi ~ IF, E c(! - {(mw(*0 - m(r,))2} + Â°-2|F7 Ec(!-^), (4-57) if \S\ |_1 ]Ct,-e5i c(lâ€” ^')E | (m/iâ€ž,i^~^('Â«) â€” m(^Â«')) | can be replaced with negligible error by ISil"1 Eues, c(l - *,)Â£ {(mw(t.-) - m(t,))2}. Thus to have (4.56), it suffices to find c(l â€” i,)s such that E c(l - *;)Â£ {(mfcâ€ž,i(*0 " MU))2} ~ AMSÂ£(Â£). (4.58) Since AMSE(6) = 0(n~4/5) by the asymptotic results in Chapter 3, we require the difference between the left and right sides of (4.58) to be o(n-4/5). The following Theorem is stated for fixed design with equally spaced i,s, U = i/n, for the choice of the c;s. Theorem 4.1 Assume the same conditions as in Corollary 3.3. Let k = [n â€” nAn], e = (1,...,1)' (a k x 1 vector), z = (zk,...,znY = (1 â€” ifc,..., 1 â€” Â£â€ž)* and c = (cjt,... ,câ€ž,Y â€” (c(zk),... ,c(zn)Y- Note that Si â€” {tk,...,tn}. If c(-) satisfies: n / VA.. 2 ^ cÂ«( â€”) = \Si\ (4.59) 63 where (z/An)*' = ((zjt/An)\(zn/An)*y, then Â£ - â„¢(*0)2 } - AMSE(6) = o{n-A'5). (4.60) Proof: Since the t,-s are equally spaced, we can write l0lt ueSi i=fc = B + y, where 1 " Pi I t=fc 1 n 1*1 & Theorem 3.2 tells us that E0Otl)-m{l) = E0hl)-m'(l) = Var0Otl) = Var0ltl) = 2 Â°iK + o(hn) 02hn + 0(/inJ 2 1 a2 1 1 <T2 I 22 + Â°W' where E = Ui u2 U0 Ui Ui u2 u2 \U3J uT, u: -l \ Ul U2 J UQ UI Ui U2 and Ui and u* are as defined in Condition 4 in Chapter 3. 64 So, for \t- 1| < Aâ€ž, Bias(mhn<1(t)) = E(mhnA(t) - m(t)) = E + Pltl(t - 1) - m(t)) = E{p0tl-m(l) + [Phl-m'(l)](t-l)} -[m(t) - m(l) - m'(l)(i - 1)] = j&if + fe2^-l)}m"(l) + o(^) -{m'W^ + Â«((*-i)2)} = ^{Mn-Mâ€ž(l-0-(l-02} + Â«(Aâ€ž), (4-61) I Var(mw(*)) = Var + ^(t - 1)) = Var(/?0)1) + 2(i - l)C<wfAi> /31)X) + (f - l)2Var0hl) _ ( 1 2(t-l) (t-1)2 \ a2 1 - feSll + ^^Sl2 + ^^S22J/(iT + 0W' (4.62) where o(A2) and o(l/(nhn)) do not depend on t as long as \t â€” 1| < An. By (4.61) and (4.62) and the fact that hn â€” An/<5, 1 " i=k ll E Â« - Mn(l " ti) " (1 - t,-)2 + 0(A2)}5 1*11 .t A4 " 15: \Si\k \*4 Zi_\ 2M2 / M2 b\ ~ 26i AJ ^ +UJ 62 \3 2&2 2; 65 m"(l) |<S\| \ V UJ Â£3 + UJ Â£2 +<*(i)3^'(i)-4 (4.63) and V = 1 " 1*11 t=fc 1 * f 1 2(1-*Q 2^cÂ« ^ â€”r-^ii r5â€”Li2 1*1 fa* lÂ»*n n/>2 , (1 ~ *Q2 nh3. , 1 xl ^ S22 + 0W/7(T) 1 " / 1 2A f Zi \si\t nhn nhl An WJ /(I) 1*1 Â« . t / z \ 2An VAâ€žy nhl 12 +ct(i)2iS22+o(i) (4.64) By (4.61) and (4.62), for t â€” 1 + An, the square of the bias and the variance of mâ€žâ€ž,i(l + Aâ€ž) are ^"(l)12 Bias2(mhntl(l + An)) = {b2X + 2hb2h3nAn + (b\ - 2h)h2nAl -2&2/>nA3+An + o(An)} m"(l) A b2 26162 &2 - 2&i Â» 1 64 + Â£3 + <52 and + 1+0(1) f 1 2A A2 1 1 a2 Var(^,(l + A.)) = |_Eâ€ž + ^E12 + + o(â€”) j w (4.65) (4.66) Conditions in (4.59) ensure that B - Bias2(mhnA(l + AB)) = o(An) 66 and V - Var(mhn<1(l + Aâ€ž)) = o(l/(nhn)). Since hni An oc n-1/5, we have B - Bias2(mhntl(l + An)) = o(n-4/5), and V - Var{mhn<l(l + Aâ€ž)) = o(n"4/5). Therefore the theorem is proven. â€¢ Remarks: 1. Since for each value of [nAn], c needs to be calculated in the implementation of BCV, one might be interested in the limiting forms of conditions in (4.59) when n â€”* oo and nAn, â€”* oo. Let c(t) = g(t/An). For equally spaced 2,-s, one can show that there exists a function g, e.g., a fourth degree polynomial, such that / g{u)du = 1 Jo I ug(u)du = â€”1 Jo / u2g(u)du = 1 Jo / u3g(u)du = â€”1 Jo f u4g(u)du = 1. (4.67) Jo 2. The conditions in (4.59) are sufficient for (4.60) but may not be necessary. For example, if m"(l) = 0 the last two conditions in (4.59) are not needed for (4.60) to hold. However, if m"(l) ^ 0 and all the coefficients of powers of 8 in (4.65) and (4.66) are non-zero, then the conditions in (4.59) are necessary for (4.60). 67 3. So far our discussion about BCV has been limited to simply providing guidelines for choosing c(-). It has not been shown that the hn minimizing BCV is close to hopt, the bandwidth that equals An/5opt with 8opt minimizing AMSE{8). Note that when |<?i| > 5, there may not be a unique vector c that satisfies (4.59). In the implementation of BCV, we will use the vector c that minimizes c'c subject to the conditions in (4.59). Lemma 4.5 Suppose that \S\\ > 5 and that the are distinct. Let X1 = and '(AJ'(AJ '(AJ '(AJ 6' EE (1^1, -15x1,1*1,-ISxU^I). (4-68) Then the minimizer of ctc subject to (4-59) is c â€” Xt(XXt)~1b. Proof: Let L(c,\) = ctc-\\Xc-b), (4.69) where A is a 5 by 1 vector. Setting the derivatives of L with respect to c and A to zero yields: ^ = 2c-Xt\ = Q=>c = ]-Xt\, (4.70) oc 2 ^ = Xc-b = 0=> Xc = b. (4.71Plugging c of (4.70) in to (4.71) gives Xc = \XX*\ = b. (4.72) Since tk,...,tn are distinct, X is of full row rank of 5. Therefore XXt is invertible. Inverting XX* in (4.72) results in A = 2(XXt)~1b. Plugging this A into (4.70) yields c = Xt(XXt)~1b, which may be either a minimizer or a maximizer. Since L is un bounded above, this c is a minimizer. â€¢ 68 Unlike FCV, BCV has the short-cut formula we established to enable one to calcu late only one regression for each hn on the search grid. Thus the process of estimating the optimal bandwidth hn is fast. We turn to that problem now. Lemma 4.6 BC V{hn) - â€”- 2^ c(l - ti)â€”â€” , where [Hhn]u = {1,U - 1) x toe ito colmn o/ [(A*WA) A W], /i ... iN Al = and W = diag(K(^-r-^)) (4.73) \ *i â€” 1 â€¢ â€¢ â€¢ tn-l J with K a positive kernel over its support. Proof: Recall that & = argmin Â£ (^ - ft - ft(t; - l))2 K^T^)-j=i 71 Fix i and let n Â£ â€¢ â€” 1 <*U = argmin (Yj - a0 - ax{t3- - U)) K(J-^â€”). Then hn = (4-74) The usual estimate of m(*t) using /Sj is ft>x + ftii(i; â€” 1) which is equal to a0tU, the usual estimate of m(U) based on ati. Thus, both parameterizations yield the same estimate of (m(ti),..., m(tn)y, namely Hhâ€žY. For the same fixed i, also let fiâ„¢ = argminifciYj-h-lhitj-ltfKi^) 69 + (^;i*')-/5o-^-i))2/v(\^), ti-i. a[. l) = argrain^ (Yj - ct0 â€” a^tj - t,))2 K("3 hr, cti"^* = ar grain ^ (5j - ao - ax(ij - *i))2 ^(^7 ) + (afc? - ao - axfc - t,))' ^f^' then From Lemma 4.2, cto~u* = ^o~u- Thus from Lemma 4.1, we have By the relationship between the ds and /3s, the short-cut formula holds. â€¢ In the next chapter, simulations will be carried out for the local linear forecasting estimator to compare the performance of the four methods of bandwidth estimation. 70 Chapter 5 Simulation: local linear forecasting estimator 5.1 Motivation and data In this chapter we will study the local linear forecasting estimator of m(l + An) for Aâ€ž = 0.1 and 0.2 on simulated data sets, Yi = m{ti) + e,- with e,-, i = 1,..., n iid normal variables ~ N(0,<T = 0.1), and equally spaced 2,-s, t,- = i/n. Data sets with sample sizes n = 50 and 100 will be generated from each of three ms, mi(t) = t, m2(t) = t2 and m3(t) 2-7l2(cosUvt) + 1), t < 1/2, (5.1) i5/2, t > 1/2. Methods of estimating an optimal bandwidth by using the plug-in procedures CAMSE, HAMSE from Sections 4.1.2 and 4.1.3 and the cross-validation procedures FCV and BCV from Sections 4.2.4 and 4.2.5 are applied to each data set. The estimators of m"(l) and a2 needed for the plug-in procedures are those in Section 4.1.5. The kernel 71 function K used in all four methods is the density function of the standard normal with support truncated to [â€”1,0]. For this kernel function, the asymptotic mean square error by formula (4.11) is m"(l)2 AMSE{8) = â€”^-An(0.1532668/62 + 0.9663585/6 + l)2 2 c +-^â€”(4.034252 + 12.16996 + 12.21121662). (5.2) The optimal bandwidth for forecasting is: hopt = An/6opt, where Sopt minimizes AMSE(S) The goal is to compare the estimates of m(l + Aâ€ž) and the estimates of optimal band-widths by all the bandwidth estimation procedures discussed in Chapter 4. The functions tp, p = 1 and 2, are chosen to study the sensitivity or the robust ness of plug-in procedures of bandwidth estimation to the violation of assumptions on the regression functions in the asymptotic analysis in Chapter 4. Recall that plug-in procedures require the estimation of m"(l) in order to estimate the optimal bandwidth for forecasting. In the estimation of m"(l), the fourth derivative of m at 1, m^4\l), is assumed to exist and be non-zero. Therefore, it is interesting to study the performance of the plug-in procedures when m^(l) is zero. In contrast, the function m3 is chosen because it has a non-zero fourth derivative at t = 1 and non-constant second derivative over [0,1]. Although m3' has a discontinuity point at t = 1/2, Corollary 4.1 still holds when the FCV procedure is applied to m3. Recall that Theorem 3.1 assumes that in Condition 5 m" is continuous over [0,1 + An] and that the results concerning the asymptotic bias and the variance of rhhn,t(t + Aâ€ž) hold uniformly for t 6 [an, 1] with liminf an/hn > 1. It can be easily shown that under the conditions of Theorem 3.1 but with Condition 5 replaced by the condition that m" is continuous over [ao, 1 + An] for some 0 < ao < 1, the conclusions in Theorem 3.1 hold uniformly for t (E [a0,1]. Therefore Corollary 4.1 holds since FCV uses estimates of m(ti)s with U > I â€” pAn and these t,s are in [a0,1] for n sufficiently large. It is also interesting to see how the magnitude of m"(l) will affect the estimated 72 optimal bandwidth and the resulting forecasts. If m"(l) = 0, the square of the asymp totic bias of rhkn(l + An) is of a higher order than An and thus the hn that minimizes the AMSE of mâ€žn(l + An) is of a different order from n-1/5. In particular, if ra is a line, the exact bias of m/in(l + An) is zero. Thus the AMSE in this case has only the variance term which equals the second term in (4.11), resulting in the asymptotically optimal bandwidth hn being co. It can be proved from (5.2) that a larger absolute value of ra"(l) will result in a smaller asymptotically optimal bandwidth for the kernel function used in this chapter. The three ras under study have m'((l) = 0, m2'(l) = 2 and ra3'(l) = 3.75. 5.2 Results and comments Tables 5.1-5.3 display the summary statistics of the An-ahead forecasts (An = 0.1,0.2) along with the values of hopt, and the estimates of optimal bandwidth for the forecasts by the four procedures applied to 100 simulated data sets, each for sample sizes n = 50 or 100 and the functions ra = mi, ra2 and m3. Since the estimates of the optimal bandwidth are quite variable, the median of the estimates of the optimal bandwidth for 100 simulations is a better indicator of the center of these estimates than the mean. The median of the estimates of the optimal bandwidth for 100 simulations is denoted as hf^}. In the tables, the standard deviation (s.d.) and the mean square error (m.s.e.) of m/iâ€ž,i(l + Aâ€ž) and hopt are estimated respectively as follows. Let 6\> generically represent either raâ€žnii(l + An) or hopt for the bth simulation, 6 = 1,..., 100, and 0 the true value, i.e., 9 = m(l+ An) or hopt. Then (s.d.)2 is Â£Â£Â°Â° (Â§b - fj2 /100 with I = E^i 4/100 and the m.s.e. is (0 â€” 0)2 +(s.d.)2. The last line of each table displays the true ra(l + An) and hopt. Results in Tables 5.1-5.3 reveal that, among the four procedures examined, FCV 73 Table 5.1: Summary of 100 forecasts (ms) and estimates of optimal bandwidths by local linear regression for m = m\ with sample size n = 50 and 100 respectively. The minimum of AMSE is displayed in column (m.s.e) for true value. m = mi, n = 50 method An = 0.1 An = 0.2 m (s.d.) (m.s.e.) Ag? MO rk (s.d.) (m.s.e.) *Â£? (B.d.) CAMSE 1.08 (0.19) (0.04) 0.30 (0.27) 1.17 (0.34) (0.12) 0.27 (0.26) HAMSE 1.08 (0.19) (0.04) 0.27 (0.27) 1.18 (0.35) (0.12) 0.24 (0.26) FCV 1.11 (0.07) (0.01) 0.51 (0.21) 1.20 (0.08) (0.01) 0.51 (0.12) BCV 1.10 (0.10) (0.01) 0.31 (0.24) 1.23 (0.18) (0.03) 0.19 (0.15) true value 1.10 0.00 oo 1.20 0.00 oo m â€” mi, n = 100 method An = 0.1 An = 0.2 m (s.d.) (m.s.e.) *g? (s.d.) m (s.d.) (m.s.e.) Ag? (s.d.) CAMSE 1.10 (0.08) (0.01) 0.33 (0.28) 1.21 (0.14) (0.02) 0.30 (0.25) HAMSE 1.10 (0.08) (0.01) 0.33 (0.24) 1.20 (0.14) (0.02) 0.30 (0.24) FCV 1.10 (0.05) (0.00) 0.51 (0.21) 1.20 (0.06) (0.00) 0.51 (0.10) BCV 1.09 (0.10) (0.01) 0.19 (0.22) 1.20 (0.18) (0.03) 0.18 (0.14) true value 1.10 0.00 oo 1.20 0.00 oo 74 Table 5.2: Summary of 100 forecasts (ms) and estimates of optimal bandwidths by local linear regression for m = m2 with sample size n = 50 and 100 respectively. The minimum of AMSE is displayed in column (m.s.e) for true value. ra = ra2, n = 50 method An = 0.1 An = 0.2 ra (s.d.) (m.s.e.) *S? (s.d.) ra (s.d.) (m.s.e.) hSg (s.d.) CAMSE 1.16 (0.24) (0.06) 0.24 (0.21) 1.33 (0.43) (0.20) 0.22 (0.20) HAMSE 1.16 (0.24) (0.06) 0.22 (0.20) 1.33 (0.43) (0.20) 0.21 (0.20) FCV 1.16 (0.09) (0.01) 0.29 (0.12) 1.33 (0.11) (0.02) 0.29 (0.09) BCV 1.13 (0.13) (0.02) 0.30 (0.27) 1.37 (0.25) (0.07) 0.20 (0.16) true value 1.21 0.01 0.34 1.44 0.02 0.32 ra = m2, n = 100 method An = 0.1 An = 0.2 ra (s.d.) (m.s.e.) fcg? (s-d.) ra (s.d.) (m.s.e.) *Â£? (s.d.) CAMSE 1.16 (0.12) (0.02) 0.24 (0.26) 1.34 (0.20) (0.05) 0.23 (0.22) HAMSE 1.17 (0.12) (0.02) 0.24 (0.21) 1.34 (0.21) (0.05) 0.22 (0.20) FCV 1.18 (0.08) (0.01) 0.26 (0.09) 1.36 (0.11) (0.02) 0.24 (0.08) BCV 1.15 (0.14) (0.02) 0.20 (0.28) 1.37 (0.20) (0.04) 0.20 (0.18) true value 1.21 0.01 0.30 1.44 0.02 0.27 75 Table 5.3: Summary of 100 forecasts (ms) and estimates of optimal bandwidths by local linear regression for m = m3 with sample size n = 50 and 100 respectively. The minimum of AMSE is displayed in column (m.s.e) for true value. m = ra3, n = 50 method An = 0.1 An = 0.2 m (s.d.) (m.s.e.) f>SS (s.d.) m (s.d.) (m.s.e.) AgJ (s.d.) CAMSE 1.21 (0.28) (0.08) 0.24 (0.21) 1.45 (0.51) (0.28) 0.23 (0.19) HAMSE 1.21 (0.28) (0.08) 0.22 (0.21) 1.45 (0.51) (0.28) 0.21 (0.19) FCV 1.19 (0.10) (0.02) 0.26 (0.09) 1.37 (0.12) (0.06) 0.25 (0.10) BCV 1.13 (0.16) (0.05) 0.30 (0.28) 1.39 (0.25) (0.10) 0.19 (0.17) true value 1.27 0.02 0.26 1.58 0.05 0.24 m = m3, n = 100 method An = 0.1 An = 0.2 m (s.d.) (m.s.e.) h%] (s.d.) m (s.d.) (m.s.e.) Ag? (s-d-) CAMSE 1.20 (0.20) (0.04) 0.24 (0.26) 1.43 (0.33) (0.13) 0.22 (0.22) HAMSE 1.20 (0.20) (0.04) 0.22 (0.24) 1.44 (0.33) (0.13) 0.20 (0.22) FCV 1.19 (0.09) (0.01) 0.21 (0.08) 1.42 (0.15) (0.05) 0.21 (0.09) BCV 1.16 (0.16) (0.04) 0.21 (0.29) 1.44 (0.21) (0.06) 0.19 (0.17) true value 1.27 0.01 0.22 1.58 0.04 0.21 76 produces the best forecasts in terms of the mean square error (m.s.e.) and that the m.s.e. of FCV is closest to the minimum of AMSE. The m.s.e. of forecasts given by the other cross-validation procedure, BCV, is smaller than the m.s.e. of forecasts given by the two plug-in procedures in most cases (eight cases out of twelve). However, BCV ties with the two plug-in procedures (in terms of the m.s.e.) in three cases and does worse in one case (m = mi, An = 0.2 and n = 100) where the m.s.e. of forecasts by BCV is 1% more than those of forecasts by plug-in procedures. Comparing the estimated mean and the estimated standard deviation of forecasts by all four procedures shows that forecasts by plug-in procedures have a bias comparable to the bias of forecasts by cross-validation procedures but tend to have a much larger variance, and that the two plug-in procedures produce almost identical results. Tables 5.1-5.3 also show that the m.s.e.s of forecasts by all procedures increase as |ra"(l)| increases, which is confirmed by examining plots in Figure 5.1 where the min imum of AMSE curve (solid line) increases as |m"(l)| increases from m^l) = 0 to m2'(l) = 2 and m3'(l) = 3.75. Moreover, the m.s.e.s of forecasts by all procedures are smaller for the larger sample size n = 100. This is expected since the convergence rate of AMSE of m/lnii(l + An) is n-4/5 and gets smaller when n gets larger. From Tables 5.1-5.3, we see that among all four procedures, the estimates of the optimal bandwidths calculated by FCV are the least variable and have medians agreeing with the true optimal bandwidths reasonably well except for m â€” mi. Other procedures may do well for some cases but badly in others. Recall that all four procedures are devised to estimate the optimal bandwidth by simulating the AMSE curve and that E(FCV) â€” a2 is approximately AMSE. To help us understand why FCV has the best performance, in the following figures, AMSE(6) will be plotted as a function of hn with hn = An/S. Moreover, the shifted FCV, FCV - a2, will be plotted instead of FCV. 77 The two plug-in procedures, CAMSE and HAMSE, produce almost identical sim ulated AMSE curves. For example, Figure 5.2 shows that the CAMSE and HAMSE curves are in good agreement for each of 9 simulated data sets for function mz with n = 50 and Aâ€ž = 0.1. The similarity of those curves explains why the CAMSE and HAMSE procedures produce very similar forecasts and estimates of optimal band-widths. From now on only the CAMSE curves are plotted for plug-in procedures. Of the four procedures, only FCV on average simulates AMSE curves reasonably well in all cases, which is not surprising in the light of Corollary 4.1. In contrast, as Figure 5.1 shows, the median of the CAMSE curves (dashed) agrees with the AMSE curve only for m = but the median of the FCV curves agrees with the AMSE curve for all three functions. This observation suggests that the condition of m^(l) ^ 0 is necessary for plug-in procedures to work. Furthermore, in all cases the AMSE curve simulated by FCV is less variable than the AMSE curve simulated by plug-in procedures. For example, Figure 5.3 of quantiles of the FCV curves and CAMSE curves shows that the AMSE curve simulated by FCV is less variable than the AMSE curve simulated by CAMSE for m = ra3. This observation probably explains why forecasts obtained through FCV are less variable than forecasts by plug-in procedures. The medians of simulated AMSE curves by BCV are too rough to resemble anything like the AMSE curve in all cases; see Figure 5.4 for example. This is probably due to the greatly oscillating values of the c,s, the weights used in BCV which minimize c*c subject to (4.59). For example, when n = 50 and An = 0.1, we have c< = (220.75,-753.77,632.54,492.46,-996.23,409.25). As Figure 5.1 shows, FCV approximates the asymptotic mean squared error very well for ra = rax and ra2 but not as well for ra = m3. It is easy to see why the shifted FCV(hn) under-estimates AMSE(6), that is, 78 0.1 0.2 0.3 0.4 0.5 0.6 0.7 I I / / / / / / I I I m=m2 / / / \ \\ \\ \\ v., \ ^~ . â€” y s / / / / 0.2 0.3 0.4 0.5 0.6 0.7 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Figure 5.1: The median of 100 shifted FCV curves (dotted), the median of 100 CAMSE curves (dashed) and the true AMSE curve (solid) for m = m1,m2 and m3 with n = 50 and An = 0.1. 79 Figure 5.2: Pairs of CAMSE (dotted) and HAMSE (solid) curves for 9 simulated data sets for m = ra3, n = 50 and An -= 0.1. 80 81 o T3 CD C 55 CD E > O CO u3 CO CO o CM O O O 0.1 0.2 0.3 0.4 0.5 0.6 0.7 h Figure 5.4: The median of 100 shifted BCV curves (dotted line) and AMSE curve (solid line) for m = m3, n = 50 and Aâ€ž â€” 0.1. 82 AMSE(An/hn), particularly for large values of hn for m = m3. Recall that the FCV score is an average of the estimates of the prediction errors for m(Â£,)s with U Â£ [1 â€” An, 1] and that by Corollary 3.1 or 3.3, AMSE of rhhn,ti-An(ti) depends on m"(ti â€” An), while AMSE(8) depends on m"(l). The second order derivatives of mi and m2 are constants, with m" = 0 and m'2' = 2. So for U Â£ [1 - An, 1], m"(U â€” An) is equal to m"(l). Thus, for mi and m2, one sees an agreement in shape between the shifted FCV scores and the AMSE curve. But for m3, m3(i) = 15Â£1/2/4 for t > 1/2, which is an increasing function of t. Therefore for ti Â£ [1 â€” An, 1], m3'(Â£t- - An) is less than m3(l). Even though as n â€”+ oo, the m"(ti â€” An)s converge to m"(l), for a finite sample, one would expect the shifted FCV scores to under-estimate the AMSE curve and that the discrepancy increases as hn increases. However, Figure 5.5 shows that for fixed An the discrepancy is smaller for larger sample size n, as would be expected. Based on our analysis and the above observations from our modest simulation study, we conclude that FCV has the best performance and thus is recommended. The next chapter will contain the comparison of the backcalculation method and the local linear method applied to a fictitious AIDS example. 83 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.1 0.2 0.3 0.4 0.5 h h Figure 5.5: The AMSE curve (solid) and the mean of 100 shifted FCV curves (dotted) for m â€” ra3 with An = 0.1,0.2 and n â€” 50,100 respectively. 84 Chapter 6 The local method versus backcalculat ion This chapter offers the arguments that the proposed local method should serve as an alternative to the backcalculation approach in forecasting. 6.1 Comments on the methods Backcalculation is used to achieve two goals: to estimate the historical HIV infection curve up to the present and to forecast the number of new AIDS cases using the esti mated HIV infection curve. Backcalculation assumes the following: 1. the adjusted AIDS incidence data are accurate; 2. the incubation distribution, F, is accurately modelled and does not vary across subpopulations represented in the counts of new cases. 85 In reality, the raw AIDS incidence data are affected by under-reporting and report ing delays. In some developing countries, AIDS data are highly incomplete. Though adjusting the raw AIDS data for under-reporting and reporting delays improves the quality of the data significantly, the adjusted AIDS data still contain errors depending on the imputation methods, the quality of the raw data and the random nature of the data. In short, accurate AIDS counts are not available. This is more of a problem for backcalculation (Section 6.2) than for the local linear forecasting estimator. Furthermore, knowledge about the "true" F is suspect. Knowing F is crucial to the backcalculation approach but F is very hard to estimate. Recall that F is the distribution of the incubation time from HIV infection to AIDS diagnosis. Usually external data other than the AIDS incidence data at hand are used in the procedure for estimating F. There are several problems with that procedure. First, exact infection times are known only for a few highly selected groups of individuals. So usually F is estimated from specific cohort(s) [3] and then plugged into the backcalculation model for a certain (or general) HIV infected population as the "true" incubation distribution. The applicability of F obtained from a cohort (a particular subpopulation) to a more general population is doubtful. In fact, Bacchetti et al [2] conclude from their analysis that the incubation distributions across the cohorts under study are quite different, resulting in very different estimates of HIV infection curves. Second, the changes of the definition of an AIDS case [1] (once in 1985 and then again in 1987 in the United States) further complicate the estimation of the incubation distribution. Third, under-reporting and reporting delays also affect the estimation of the incubation distribution. In contrast, the local linear forecasting estimator has only one goal: to forecast the number of new AIDS cases. It assumes that the expectation of the adjusted AIDS incidence data reflects the true level of the underlying AIDS epidemic, i.e., E(Yi\ti) â€” m(ti) (with m the true level of the underlying AIDS epidemic) and uses the AIDS counts 86 (adjusted when necessary) directly. The local linear forecasting estimator does not use any external data other than the AIDS data. One might think that backcalculation should provide good estimates since it employs both external data and the AIDS data. However, as shown in the simulation study described in the next section, it does not produce better forecasts than the local linear forecasting estimator that uses the AIDS data only. In addition, it will be shown that backcalculation can not be used to produce a good estimate of the HIV infection curve up to present. The next two sections will contain discussions of the performance of the two methods. 6.2 A small simulation study This simulation study has two objectives: 1. to compare the forecasts of the number of new AIDS cases by backcalculation and the local linear forecasting estimator, 2. to investigate how the violation of either of the two assumptions about the AIDS incidence data and the incubation distribution function F will affect the results of backcalculation. To achieve the two objectives, the performance of both approaches is investigated in the simplest set-up: Y{ = m(ti) + e,- where the e;s are independent normal random variables with standard deviation a: Although the assumption of independent normal errors may not be realistic for the AIDS count data, it serves to show the statistical weakness of the backcalculation approach and simplifies the computation. This fictitious example will use the simplest model for F: an incubation time distribution function which is independent of the infection time, i.e., F(t â€” s\s) = F(t â€” s). 87 The simulated data are Yi = m(ti) + e,-, i = 1,..., 20, where the e,s are i.i.d. normal variables with mean zero and variance 0.25, ti = i/20, m(Â£;) = JQ I(s)(F(ti â€” s) â€” F(U-i - s))ds, I(s) = -200(5 + l)(s - 2) and F(t) = 1 - e~*. One hundred simulated data sets are generated from the above model. Figure 6.1 shows the incubation function F and the HIV infection rate I. To attain the first objective, m(l + An), an estimate of m(l + Aâ€ž), is calculated by both approaches for each of the simulation data sets for An = 0.1 and An = 0.2. As explained in the last section, the "true" F should be taken with a grain of salt. It will be interesting to see how the forecasts are affected by the assumed form of F. Therefore for the backcalculation approach, m(l + An) is estimated under a few assumed forms of F, the true F: F(t) = 1 â€” e_t and wrong JPS: F(t) = 1 â€” e~f^ with ft = 4,1.1,0.9,0.5. Since the true ft is 1, ft = 4 or 0.5 is grossly wrong while ft = 1.1 or 0.9 is pretty close to the truth. The application of backcalculation with each F to the 100 simulated data sets yields 100 m(l + Aâ€ž)s. In backcalculation, the smoothing parameter A is chosen by cross-validation (Section 4.2.2). The application of the local linear forecasting estimator to the same data also yields 100 m(l + An)s, but not depending on the assumed form of F. Method FCV is used to choose the bandwidth for the local linear forecasting estimator because simulations in the last chapter suggest that FCV has the best performance among the four competing bandwidth estimating procedures. For each set of 100 m(l + An)s, the average, the standard deviation (s.d.) and the mean squared error (m.s.e.) are computed and displayed in Table 6.1. For this fictitious example, the mean squared errors can be calculated since the true value of m(l + Aâ€ž) is known. Results in Table 6.1 show that the local linear forecasting estimator always gives forecasts with smaller mean squared errors than backcalculation. Comparison of the averaged forecasts and standard deviations suggests that forecasts by backcalculation 88 Time 0.0 0.2 0.4 0.6 0.8 1.0 Time Figure 6.1: The HIV infection function I, I(s) = -200(5 + l)(s - 2) and the incubation function F, F(t) = 1 â€” e~* used in the generation of the fictitious AIDS data. 89 Table 6.1: Summary of forecasts by backcalculation and local linear regression for 100 realizations of the fictitious AIDS data. method assumed average of forecasts (s.d.) (m.s.e.) incubation F An = 0.1 An = 0.2 backcalculation 1 - e-*/4 14.27 (1.58) (2.52) 14.85 (3.64) ( 13.37) 1 - e-'l1-1 14.10 (2.55) (6.50) 14.23 (7.42) (55.13) 1 â€” e~f, (true) 14.45 (2.94) (8.75) 15.52 (7.60 ) (58.80) 1 - e-</Â°-9 14.31 (1.45) (2.14) 14.90 (3.24) ( 10.66) 1 - e-'/Â°-5 13.59 (8.10) (65.88) 12.49 (27.00) (733.04) local linear FCV NA 14.58 (0.62) (0.61) 15.51 (0.96) (1.95) true value 1 -e"< 14.11 14.50 are highly variable. For example, for An = 0.1, even when the "true" F is used in the backcalculation, the standard deviation (s.d.) of 100 forecasts by backcalculation is 2.94, which is much higher than the corresponding standard deviation of 0.62 for the local linear forecasting estimator. Results of backcalculation when the "true" F is used suggest that forecasts should not be made based on the extrapolated I. Estimates of the conservative lower bounds for the numbers of new AIDS cases, LBn of (1.6), are also calculated according to formula (1.7) for the backcalculation approach from the same simulated data. The average and the standard deviation of the 100 LBns 90 Table 6.2: Summary of lower bounds of forecasts by backcalculation for 100 realizations of the fictitious AIDS data. method assumed average of LBn (s.d.) incubation F An = 0.1 An = 0.2 backcalculation 1 - e-'/4 13.51 (0.89) 13.18 (0.87) 1 - e"*/1-1 12.89 (1.26) 11.78 (1.15) 1 â€” e_t, (true) 12.76 (1.21) 11.55 (1.09) 1 - e"*/0-9 12.71 (0.83) 11.38 (0.75) 1 _ g-t/0-5 11.87 (0.54) 9.72 (0.44) true lower bound 1 â€” e_t, (true) 12.69 11.48 true value 1 â€” e~f, (true) 14.11 14.50 for each F are summarized in Table 6.2. The true lower bounds given by formula (1.6) are calculated with F the "true" distribution used in the generation of the simulation data. The s.d.'s of the lower bounds in Table 6.2 are much smaller than the s.d.'s of forecasts based on extrapolated 7s in Table 6.1. This confirms that the problem shown in Table 6.1 is due to the extrapolation of I. To attain the second objective of studying the sensitivity of backcalculation to the two parameters: the assumed form of the incubation function F and the presence of noise in the YiS, each parameter is allowed to vary separately. First, estimates of I are backcalculated based on three similar Fs from the Y^s with no error. The three Fs are 91 F(t) = 1 - e~t/p with ft = 1.1,1 (true) and 0.9 (see Figure 6.2). A larger value of ft means a longer incubation period. The resulting I based on Fs with ft = 1.1 (long F), 1 (true F) and 0.9 (short F) are plotted together with the true I in Figure 6.2. Note that the true I is fully recovered when the true F is used in the backcalculation and when there are no errors in the YiS. To study the sensitivity of backcalculation to the presence of errors in the YiS, esti mates of I are backcalculated using the true F (the one used to generate the simulation data, ft = 1) for different realizations of the errors in the Y{S with a always equal to 0.5. The first plot of / in Figure 6.3 shows that I is fully recovered when there is no error in the YiS. The other three 7s in Figure 6.3 are backcalculated from three realizations of the YiS with errors. These three plots of I show that when there are errors in the l^s, the backcalculated Is are highly variable. To get a better idea of the variability of the backcalculated 7s when the YiS contain A A error, Figure 6.4 summarizes the 100 Is by plotting the averaged 7, the 25th quantile, the 75th quantile and the true I. The simulation results show that the results produced by the backcalculation ap proach are sensitive to the assumed form of the incubation distribution F and to the presence of errors in the YiS and that the forecast based on the extrapolated / is highly variable (even when the true F is used). In comparison, the local linear estimator gives better forecasts by the criterion of MSE. Furthermore, the local linear estimator is much faster to compute and therefore is less expensive. 92 Time Figure 6.2: The sensitivity of / to the assumption of F in the fictitious AIDS data (a = 0). True F(t) = 1 - e~\ long Fit) = 1 - e"*/1-1 and short F(t) = 1 - e"'/0'9. 93 Time Time Figure 6.3: The sensitivity of I to the presence of the errors (a â€” 0.5) in the fictitious AIDS data when the true F is used. The first plot shows that I is recovered from the error-free AIDS data while the rest show that / is highly distorted by the presence of errors in the AIDS incidence data. The solid curve (-) in each plot is the true I and the dotted line (...) the backcalculated /. 94 0.0 0.2 0.4 0.6 0.8 1.0 Time Figure 6.4: The variability of backcalculated I. 95 6.3 Discussion of the methods and the simulation results This section will offer theoretical insights on how the violation of either of the two assumptions on the AIDS incidence data and the incubation function F will make back-calculation perform poorly. The reason that backcalculation is highly sensitive to inaccuracies in both the AIDS incidence data and the incubation function F is that backcalculation is a mathematically ill-posed problem. Equation (1.4) m(t) = jT I{s) (F(t -s\s)-F(t-^- s|s)) ds. is a Volterra equation of the first kind [22]. For details on Volterra equations, see [22]. According to [22], a linear Volterra equation of the first kind is defined as /' K(t,s)I(s)ds = u(t), 0<t<T (6.1) Jo while a linear Volterra equation of the second kind is defined as I(t) - I* K(t,s)I{s)ds = g{t), 0<t<T. (6.2) Jo The purpose is to solve these equations for I. An equation is well-posed if a small change in parameters, e.g. in K or u in (6.1), or in K. or g in (6.2), causes only a small change in the solution I [22]. Equation (6.1) is not a well-posed problem. Volterra equations of the second kind have been studied more extensively and are understood better than those of the first kind. Roughly speaking they are relatively well-posed in contrast to those of the first kind. A Volterra equation of the first kind may be converted to one of the second kind by differentiating (6.1): K(t,t)I(t) + Â£ â„¢^ll(s)ds = u\t). (6.3) 96 If K(t,t) does not vanish in 0 < t < T, then dividing (6.3) by K(t,t) yields a standard Volterra equation of the second kind and can be solved by existing algorithms. The exact knowledge of u (consequently the exact knowledge of u') makes the equation (6.3) possibly well-posed. Under regularity conditions, a continuous solution of equation (6.1) exists uniquely and is the same as the continuous solution of equation (6.3). Even though Volterra equations of the second kind are relatively more tractable in theory, solving them numerically is known to be hard. However, in the backcalculation approach, K(s,t) = F(t - s\s) - F(t - l/n - s\s), (6.4) with F being the incubation function, so K(t,t) = 0 for any t. Thus (6.3) is again an equation of the first kind. According to [22], a Volterra equation of the first kind can be converted into one of the second kind if K and u are sufficiently differentiable. This result is not very useful in practice because "u" (e.g. the functional form of m in backcalculation) is the very unknown that is of interest to us. As mentioned earlier only the exact knowledge of "u" may help convert equations of the first kind to those of the well-posed second kind. Suppose that (1.4) can be converted into an equation of the second kind by differ entiating. Then one might propose to solve the original equation by instead solving the resulting equation. Since in reality m is unknown, one needs to estimate derivatives of m from the raw data. It is well known that accurate estimation of derivatives of m from the raw data is hard, since errors inherited from the original data propagate (in terms of mean squared error). Those errors would be further amplified in I due to the ill-posed nature of the problem. Penalizing 7, e.g., controlling the magnitude of $ I"(s)2ds when backcalculating 7, does not change the ill-posed nature of the problem. This is well manifested by the simulations in the last section. A small amount of error added to m can distort the resulting estimate of I completely. See Figure 6.3. Of course, all of above 97 discussions about Volterra equations assumes that u and K are known at all values of t, not just at the t,s. Smoothing the data {li}â„¢ prior to applying the backcalculation procedure does not help either. This can be explained by the ill-posed nature of the problem. The ill-posed nature of backcalculation is also manifested in the sensitivity of back-calculation to the assumed form of the incubation function F. See Figure 6.2. Figure 6.2 also exhibits a phenomenon which Brookmeyer [1] calls "nonidentifiabil-ity" of I and F: for the same set of "AIDS incidence" data if a long incubation period is assumed (e.g., ft = 1.1), backcalculation produces an inflated estimate of the infection rate in order to match the observed incidence. On the other hand, if a short incubation period is assumed (e.g., ft = 0.9) backcalculation underestimates the infection rates. Bacchetti [2] makes the same observation. Recall that one major goal of the backcalculation approach is to reconstruct (es timate) the HIV infection curve /(â€¢) from the AIDS incidence series. However, this nonidentifiability, accompanied by the unfortunate mathematical ill-posed nature of backcalculation, makes backcalculation unfit for the task of reconstructing the past to present HIV infection curve I. It is also important to note that backcalculation is limited to estimating only a proportion of HIV infections. Even though there is a strong correlation between HIV infection and AIDS disease, not every HIV infected person will develop AIDS in his lifetime. The proportion of people who are infected with HIV virus and eventually develop AIDS is unknown. This is another reason that backcalculation can not recover the whole picture of the historical HIV infection pattern. In short, one can not expect to get a reliable HIV infection curve from backcalculation unless the true F and accurate AIDS incidence data (after correction for under-reporting and reporting delays) are available. Other methods must be devised to estimate the HIV 98 infections. Brookmeyer [1] believes that greater improvements in reconstructing the HIV in fection rates may come from empirical data on recent infection rates rather than from smoothing procedures or parametric models for I using backcalculation. As the AIDS epidemic continues, more and more data on HIV seroprevalence are available from cross-sectional surveys (since the mid-1980s) and longitudinal cohort studies. Thus obtaining infection rates directly has become possible. Brookmeyer [4] [5] has proposed new ap proaches based on cross-sectional surveys and cohort studies in his recent research. Though the proposed local linear estimator does not attempt to estimate the HIV infections, it does a better job at forecasting than backcalculation. The local linear estimator will be applied to two real data sets from Canada and the United Kingdom in the next chapter. 99 Chapter 7 Data analysis In this chapter, forecasts using the local linear forecasting estimator and Healy and Tillett's method [20] will be compared to the corresponding true or corrected numbers of new AIDS cases. The two data sets at hand are the Canadian AIDS data, and the UK AIDS data used by Healy and Tillett [20]. For a given data set, a few recent observations will be left out and treated as "future", unknown observations. "Forecasts" by the two methods will be made using the rest of the data and compared to the left out data, actual values of these "future" observations. 7.1 Data, methods and results The Canadian AIDS data are provided by the Laboratory Centre for Disease Control. Although the national quarterly numbers of new AIDS cases are available from the fourth quarter of 1979 to the first quarter of 1996, quarterly numbers prior to the second quarter of 1991 are assumed to be free of reporting delays. This agrees with previous work [23] which considered reporting to have ceased within six years. Since correcting for reporting delays and under-reporting is beyond the scope of this thesis, the subset 100 of the Canadian AIDS cases that were diagnosed between the beginning of the fourth quarter of 1979 and the end of the first quarter of 1990 and reported within six years of diagnosis will be used. These data will be treated as being free of reporting delays and under-reporting, i.e, these data will be assumed to represent the true numbers of new AIDS cases. The Canadian data are displayed in Table 7.1 with quarters of a year labelled Q1-Q4. Table 7.1: Quarterly numbers of AIDS cases reported within six years of diagnosis in Canada, 79Q4-90Q1. 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 Ql NA 0 2 6 17 34 65 125 190 267 351 372 Q2 NA 4 2 4 16 36 82 147 223 254 317 NA Q3 NA 0 2 7 13 39 99 163 261 295 350 NA Q4 1 0 2 6 17 50 117 186 261 304 328 NA The UK data as shown in Table 7.2 are taken from Table 6 in [20] in which the monthly numbers of reported new AIDS cases and Healy and Tillett's estimates of number of unreported new AIDS cases are displayed. Healy and Tillett [20] imputed the estimates of numbers of unreported new AIDS cases based on the delay distribution estimated from the data from 1984 to 1986. Thus the corrected data are the sum of the reported number and the estimate of the number of unreported new AIDS cases. The Canadian data and the corrected UK data are plotted in Figure 7.1. To test our forecasting method, for the Canadian data, we consider 88Q1 as the present time 101 Table 7.2: Monthly numbers of AIDS cases reported to the end of September 1987 in UK. For time periods May 1986 to September 1987, the first number is the reported and the second number in parenthesis is the estimate of number of unreported new AIDS cases. 1982 1983 1984 1985 1986 1987 Jan 1 1 6 16 25 33(7.5) Feb 0 0 5 15 30 44(9.7) Mar 2 2 8 18 26 35(12.2) Apr 1 0 4 16 35 35(15.8) May 0 1 5 14 38(0.1) 25(19.8) Jun 0 1 6 12 27(0.4) 48(25.9) Jul 1 5 10 17 24(0.9) 27(32.8) Aug 1 4 14 24 29(1.4) 27(44.2) Sep 1 3 7 " 23 39(2.3) 14(66.1) Oct 1 0 14 25 34(3.3) NA Nov 1 5 8 22 38(4.3) NA Dec 2 7 16 21 46(5.8) NA 102 and for the UK data we consider December 1986 as the present time. In each plot, time has been rescaled so that the data with t in [0,1] are "known" and the data with t > 1 are "unknown" and will be predicted. For the Canadian data, data with t in [0,1] are quarterly numbers of new AIDS cases from the fourth quarter of 1979 (79Q4) to the first quarter of 1988 (88Q1) and are used to "forecast" the quarterly numbers of new AIDS cases from 88Q2 to 90Q1. For the UK data, data with t in [0,1] are monthly numbers of new AIDS cases from January 1982 to December 1986 and are used to "forecast" the monthly numbers of new AIDS cases from January 1987 to September 1987. Healy and Tillett [20] fit a log-linear model in t to the most recent two years data assuming the Yis to be Poisson counts, log(Â£(y,)) = Â«o + ai*, (7.1) and then extrapolate the fit to get forecasts. The local linear forecasting estimator with FCV, and Poisson regression using the most recent data will be applied to each data set to produce forecasts for specified time periods. Healy and Tillett [20] chose a time span of two years in the Poisson regression for the UK AIDS data. The choice of a time span of data on which Poisson regression is computed is equivalent to the choice of a bandwidth in local regression. To show that forecasts are highly dependent on the time span, that is, the bandwidth, Poisson regression will be computed using data from the most recent half year, one year and two years and then the fit will be extrapolated to yield forecasts. This choice of the spans of "most recent" data is arbitrary. The forecasts will then be compared to the corresponding true numbers of new AIDS cases for the Canadian data or to the corresponding corrected numbers of new AIDS cases for the UK data. Tables 7.3-7.4 display the "forecasts" by the local linear method and by Poisson re gression using the most recent data with the three time spans. The optimal bandwidths estimated by FCV are rescaled in years so that it is easy to see the "time span" that 103 < 8 S3 at 8 1 Time (0.0=7903,1.0=8801) Time (0.0=Dec 81,1.0=Dec 86) ure 7.1: Canadian AIDS data and UK AIDS data: forecasting beyond t 104 the local linear forecasting estimator uses. The average of the squared forecasting errors (ASFE), displayed in the last line of these tables, serves as a measure of the precision of forecasts by each method. For the local linear method, the FCV score and estimated bandwidth (in years) are also displayed. For the local linear forecast at t = 1 + An for each Aâ€ž, AMSE(8) with 8 = An/hopt can be used as a measure of accuracy of the forecast. However, AMSE is unknown. Recall that FCV â€” a2 is approximately AMSE, provided that errors in Ys are ho moscedastic. Since Figure 7.1 shows that the assumption of homoscedastic errors is reasonable for the Canadian data with t in [0,1] and for the UK data with t in [0.5,1.0], the Rice estimator can be applied to estimate a2 in each case. The Rice estimator as in (4.13), when applied to the Canadian data in [0,1] and the UK data in [0.5,1.0] yields 177 and 14, respectively. Although the distribution of FCV is unknown, FCV scores are displayed as an indication of forecasting errors. Table 7.3 shows that the local linear forecasting estimator with FCV gives the closest forecasts for the Canadian data, with the smallest ASFE, 716. Poisson regression from data of different time spans clearly gives very different forecasts, showing that the choice of a time span of most recent data is a practical problem. This sensitivity to the choice of the bandwidth is usual for smoothing methods. Poor performance of Poisson regression to the Canadian data also suggests the possibility that Y{S may not be Poisson and thus a specific assumption on the distribution of YiS may not be justifiable for different AIDS data sets. Table 7.4 shows that Poisson regression using data of the most recent year gives the closest forecasts, with the smallest ASFE, 62. However, Poisson regression using data of the most recent half year gives the worst result, with the largest ASFE, 2389. In comparison, the local linear forecasting estimator with FCV gives decent forecasts with ASFE, 129. Figure 7.2 shows the forecasts by the local linear method and the forecasts 105 Table 7.3: Forecasts (ms) for 8 quarters, 88Q2 to 90Q1, on Canadian AIDS data (79Q4-88Q1) by local linear forecasting estimator (LL) with FCV, and by Poisson regression (PR) using most recent data respectively. The optimal bandwidth estimated by FCV is rescaled in years and denoted by hyopt. The FCV score and hy0rpt are displayed together in parenthesis. The <r2 estimated by the Rice estimator is 177. Time Method Actual # m by LL rh by PR on data from most recent Y m (FCV,hy0rpt) 0.5 year 1.0 year 2.0 years 88Q2 293 (64, 3.2) 273 288 311 254 88Q3 315 (483, 2.4) 279 303 341 295 88Q4 309 (234, 0.9) 286 319 373 304 89Q1 349 (312, 1.0) 292 337 408 351 89Q2 370 (403, 1.2) 299 355 446 317 89Q3 346 (685, 0.9) 306 374 488 350 89Q4 359 (1816, 0.9) 313 394 534 328 90Q1 371 (2474, 0.9) 320 415 584 372 ASFE = Â£(m - y)2/8 716 1191 1222 17033 / 106 Table 7.4: Forecasts (ms) for 9 months, January 1987 to September 1987, on UK AIDS data (January 1982-December 1986) by local linear forecasting estimator (LL) with FCV, and by Poisson regression (PR) using most recent data respectively. The optimal bandwidth estimated by FCV is rescaled in years and denoted by hyJpt. The FCV score and hyJpt are displayed together in parenthesis. The a2 estimated by the Rice estimator is 14. Time Method Delay-rh by LL m by PR on data from most recent corrected # m (FCV,h%t) 0.5 year 1.0 year 2.0 years yc 87Jan 53 (54, 0.6) 58 47 47 41 87Feb 45 (60, 1.4) 66 49 49 54 87Mar 48 (57, 1.3) 75 52 51 47 87Apr 49 (53, 1.3) 85 55 54 51 87May 52 (44, 1.2) 96 58 57 45 87Jun 55 (27, 1.1) 109 61 60 74 87Jul 57 (20, 1.0) 124 64 63 60 87Aug 59 (46, 1.0) 141 67 66 71 87Sep 61 (68, 0.9) 160 71 69 80 ASFE â€” E(m - yc)2/9 129 2389 62 65 / 107 by the Poisson regression with the smallest ASFE. For the Canadian AIDS data, the Poisson regression with the smallest ASFE uses the data from the most recent half year; for the UK AIDS data the Poisson regression with the smallest ASFE uses the data from the most recent year. The plots in Figure 7.2 confirm our observation from Tables 7.3:7.4 that the local linear method gives very decent forecasts for both data sets. 108 I* E o C to P Time (0.0=7903.1.0=88Q1) S co a < S S Time (0.0=Dec 81,1.0=Dec 86) Figure 7.2: Forecasts by the local linear method and by Poisson regression using the most recent data. Among three sets of forecasts by Poisson regression, the set with the smallest ASFE is plotted. 109 Chapter 8 Concluding remarks In this thesis, the local linear forecasting estimator is proposed as an alternative to either backcalculation or parametric regression in the context of forecasting for AIDS data. The asymptotic theory of this estimator is developed and applied to the automatic implementation of this method, i.e., the data driven estimation of an optimal bandwidth for forecasting. For the estimation of an optimal bandwidth, two plug-in procedures and two cross-validation procedures are investigated theoretically and by simulations. Sim ulations clearly show the advantage of FCV over the two plug-in procedures, CAMSE and HAMSE, and the other cross-validation procedure, BCV. The simulation study of Chapter 6 shows that the local linear method provides better forecasts than the back-calculation approach and that backcalculation can not accomplish the two tasks, that is, to provide a reasonable estimate of the HIV infection curve and to provide forecasts of AIDS incidence. Analyses of the Canadian AIDS data and the UK AIDS data show that the local linear method forecasts well. In addition, these analyses expose the practical problem of choosing an appropriate size of a time span of most recent data for use in parametric regression. The theoretical results for the local linear forecasting estimator are derived based on 110 the assumption of independent data. This assumption is not entirely realistic and the statistical properties of the local linear forecasting estimator using correlated data and the estimation of an optimal bandwidth will be interesting topics for future research. In the literature of AIDS research, many authors (e.g., [1], [20] and [23]) assumed the numbers of AIDS cases to be Poisson. For example, Bacchetti [1] assumed that the numbers of people infected with HIV follow a nonhomogeneous Poisson process. Under appropriate assumptions, the process of numbers of AIDS cases can be shown to be Poisson. It would be interesting to simulate the HIV infection process and the incu bation distribution to get simulated AIDS counts and then to analyse the simulated AIDS counts by the backcalculation approach and the local linear forecasting estimator. Another interesting extension of the local linear forecasting estimator is to include co-variates in the regression, for example, sexual practices, age and gender, and to estimate the expected numbers of new AIDS cases among a group with specific covariate values. Moreover, construction of point-wise confidence intervals of the forecasts by the local linear method is an interesting problem. Ill Bibliography [1] Bacchetti, P., Segal, M.R., and Jewell, N.P. (1993), "Backcalculation of HIV Infec tion Rates," Statistical Science 8, 82-119. [2] Bacchetti, P., Segal, M.R., Hessol, N.A., and Jewell, N.P. (1993), "Different AIDS Incubation Periods and Their Impacts on Reconstructing Human Immunodeficiency Virus Epidemics and Projecting AIDS Incidence," Proc. Natl. Acad. Sci. USA 90, 2194-2196. [3] Brookmeyer, R. (1991), "Reconstruction and Future Trends of the AIDS Epidemic in the United States," Science 253, 37-42. [4] Brookmeyer, R., and Quinn, T.C. (1995), "Estimation of Current Human Immun odeficiency Virus Incidence Rates from a Cross-sectional Survey Using Early Diag nostic Tests," American Journal of Epidemiology 141, 166-172. [5] Brookmeyer, R., Quinn, T.C, Shepherd, M., et al (1995), "The AIDS Epidemic in India: A New Method for Estimating Current Human Immunodeficiency Virus (HIV) Incidence Rates," American Journal of Epidemiology 142, 709-713. [6] Cohen, P.T., Sande, M.A., and Bolberding, P.A. (Editors), The AIDS Knowledge Base (2nd Ed, 1994), Little, Brown and Company. 112 [7] Fan, J., and Gijbels, I. (1994), "Data-driven Bandwidth Selection in Local Polyno mial Fitting: Variable Bandwidth and Spatial Adaptation," JRSSB, to appear. [8] Fan, Jianqing (1992), "Design-adaptive Nonparametric Regression," JASA 87, 998-1004. [9] Gail, M.H., and Brookmeyer, R. (1988), "Methods for Projecting Course of Ac quired Immunodeficiency Syndrome Epidemic (Review)," Journal of the National Cancer Institute 80, 900-911. [10] Gail, M.H., and Rosenberg, P.S. (1991), "Perspectives on Using Backcalculation to Estimate HIV Prevalence and Project AIDS Incidence" manuscript. [11] Gasser, T., Sroka, L., and Jennen-Steinmetz, C. (1986), "Residual Variance and Residual Pattern in Nonlinear Regression, " Biometrika 73, 625-633. [12] Gasser, T., Kneip, A. and K5hler, W. (1991), "A Flexible and Fast Method for Automatic Smoothing," JASA 86, 643-652. [13] Hall, P., Sheather, S.J., Jones, M.C. and Marron, J.S. (1991), "On Optimal Data-based Bandwidth Selection in Kernal Density Estimation," Biometrika 78, 263-271. [14] Hardle, W., and Marron, J.S. (1985), "Optimal Bandwidth Selection in Nonpara metric Regression Function Estimation," The Annals of Statistics 13, 1465-1481. [15] Hardle, W. (1989), Applied Nonparametric Regression, SIAM Cambridge University Press. [16] Hart, J. D. and Wehrly, T. E. (1986), "Kernel Regression Estimation Using Re peated Measurements Data," JASA 81, 1080-1088. 113 [17] Hart, J. D. (1994), "Automated Kernel Smoothing of Dependent Data by Using Time Series Cross-validation," Journal of the Royal Statistical Society, Ser. B. 56, 529-542. [18] Hart, J. D. and Yi, S. (1996), "One-sided Cross-validation," manuscript. [19] Hastie, T. and Loader, C. (1993), "Local regression: Automatic Carpentry," Statist. Sci. 8, 120-143. [20] Healy, M. J. R., and Tillett, H. E. (1988), "Short-term Extrapolation of the AIDS Epidemic," JRSSA 151, 50-61. [21] Lehmann, E. L.'(1975), Nonparametrics: Statistical Methods Based on Ranks, Holden Day. [22] Linz, Peter (1985), Analytical and Numerical Methods for Volterra Equations, SI AM Philadelphia. [23] Marion, S. A., and Schechter M. T. (Dec, 1990), "Estimation of the Number of Persons in Canada Infected with Human Immunodeficiency Virus," A report com missioned by the Minister of National Health and Welfare. [24] Marion, S. A., and Schechter M. T. (Mar., 1992), "Human Immunodeficiency Virus Infection in Canada: An Updated Analysis Using Backcalculation," A report com missioned by the Minister of National Health and Welfare. [25] Rice, J. (1984), "Bandwidth Choice for Nonparametric Regression," The Annals of Statistics 12, 1215-1230. [26] Ruppert, D., Sheather, S.J., and Wand, M.R (1993), "An Effective Bandwidth Selector for Local Least Squares Regression," manuscript. 114 [27] Stone, C. J. (1982), "Optimal Global Rates of Convergence for Nonparametric Regression," The Annals of Statistics 10, 1040-1053. [28] Wahba, G. (1977), "Practical Approximate Solutions to Linear Operator Equations When the Data Are Noisy," SIAM J. Numer. Anal. Vol. 14, 651-667. [29] Wahba, G. (1979), "Smoothing and 111 posed Problems," Solution Methods for Integral Equations with Applications, Michael Golberg (Editor), Plenum Press, 183-194. [30] Wahba, G. (1982), "Constrained Regularization for 111 Posed Linear Operator Equa tions, with Application in Meteorology and Medicine," Statistical Decision Theory and Related Topics ///Vol. 2, Shanti S. Gupta and J. 0. Berger (Editors), 383-418. [31] Wahba, G. (1990), Spline Models for Observational Data, SIAM. 115
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Local linear regression versus backcalculation in forecasting
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Local linear regression versus backcalculation in forecasting Li, Xiaochun 1996
pdf
Page Metadata
Item Metadata
Title | Local linear regression versus backcalculation in forecasting |
Creator |
Li, Xiaochun |
Date | 1996 |
Date Issued | 2009-03-17T18:45:11Z |
Description | The local linear forecasting estimator is proposed in this thesis as an alternative technique to either parametric regression or the backcalculation approach in the context of forecasting for independent data. The asymptotic bias and variance of the local linear forecasting estimator are derived and used to develop procedures for the estimation of the optimal bandwidth for forecasting. Both the theoretical and the computational aspects of these procedures are explored. Simulation study shows that a cross-validation procedure has the best performance in forecasting among four bandwidth estimation procedures under study. Simulations and statistical analyses show that the backcalculation approach is very vulnerable to violations of the assumptions underlying this approach and that its application to AIDS data fails to achieve its two primary goals, to forecast the numbers of new AIDS cases and to estimate the historical HIV infection curve. To test the proposed forecasting estimator over parametric regression, both techniques are applied to the Canadian AIDS data and the UK AIDS data. The results of the two examples expose the weakness of parametric regression and show that the proposed technique does better than parametric regression in forecasting. |
Extent | 4410988 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Collection |
Retrospective Theses and Dissertations, 1919-2007 |
Series | UBC Retrospective Theses Digitization Project |
Date Available | 2009-03-17 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0087838 |
URI | http://hdl.handle.net/2429/6155 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1996-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- [if-you-see-this-DO-NOT-CLICK]
- ubc_1996-147800.pdf [ 4.21MB ]
- Metadata
- JSON: 1.0087838.json
- JSON-LD: 1.0087838+ld.json
- RDF/XML (Pretty): 1.0087838.xml
- RDF/JSON: 1.0087838+rdf.json
- Turtle: 1.0087838+rdf-turtle.txt
- N-Triples: 1.0087838+rdf-ntriples.txt
- Original Record: 1.0087838 +original-record.json
- Full Text
- 1.0087838.txt
- Citation
- 1.0087838.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
China | 16 | 9 |
United States | 9 | 1 |
French Polynesia | 9 | 0 |
Russia | 2 | 0 |
France | 2 | 0 |
Unknown | 2 | 0 |
Poland | 1 | 0 |
Canada | 1 | 0 |
Japan | 1 | 0 |
City | Views | Downloads |
---|---|---|
Beijing | 13 | 0 |
Papeete | 9 | 0 |
Ashburn | 5 | 0 |
Unknown | 4 | 18 |
Shenzhen | 3 | 9 |
Saint Petersburg | 2 | 0 |
Sunnyvale | 2 | 0 |
Ottawa | 1 | 0 |
Paris | 1 | 0 |
Matawan | 1 | 0 |
Tokyo | 1 | 0 |
Redwood City | 1 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0087838/manifest