L O C A L L I N E A R REGRESSION VERSUS B A C K C A L C U L A T I O N IN F O R E C A S T I N G by Xiaochun L i B .Sc , Tsinghua University, Beijing, China, 1986 M . S c , Tsinghua University, Beijing, China, 1988 M . S c , University of Saskatchewan, 1991 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T OF T H E R E Q U I R E M E N T S F O R T H E D E G R E E OF D O C T O R OF P H I L O S O P H Y in T H E F A C U L T Y OF G R A D U A T E STUDIES Department of Statistics We accept this thesis as conforming to the required standard T H E U N I V E R S I T Y OF BRITISH C O L U M B I A August 1996 ©Xiaochun L i , 1996 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of StcffiS'TiC S The University of British Columbia Vancouver, Canada Date O l f % DE-6 (2/88) Abstract The local linear forecasting estimator is proposed in this thesis as an alternative technique to either parametric regression or the backcalculation approach in the context of forecasting for independent data. The asymptotic bias and variance of the local linear forecasting estimator are de- rived and used to develop procedures for the estimation of the optimal bandwidth for forecasting. Both the theoretical and the computational aspects of these procedures are explored. Simulation study shows that a cross-validation procedure has the best performance in forecasting among four bandwidth estimation procedures under study. Simulations and statistical analyses show that the backcalculation approach is very vulnerable to violations of the assumptions underlying this approach and that its appli- cation to AIDS data fails to achieve its two primary goals, to forecast the numbers of new AIDS cases and to estimate the historical H I V infection curve. To test the proposed forecasting estimator over parametric regression, both tech- niques are applied to the Canadian AIDS data and the U K AIDS data. The results of the two examples expose the weakness of parametric regression and show that the proposed technique does better than parametric regression in forecasting. i i Contents Abstract ii Table of Contents iii List of Tables vi List of Figures viii Acknowledgements x 1 Introduction 1 1.1 The linear operator approach 4 1.2 The regression approach 7 1.2.1 Parametric extrapolation 7 1.2.2 A local linear forecasting estimator 9 2 The linear operator approach 12 2.1 Some results on reproducing kernel Hilbert spaces 14 2.2 Appl i ca t ion to the A I D S data: backcalculation 16 3 Asymptotics for the local linear regression estimator 22 i i i 4 The choice of a smoothing parameter 39 4.1 The plug-in approach 40 4.1.1 Introduction 41 4.1.2 Plug-in approach using complete asymptotics for forecasting (CAMSE) 43 4.1.3 The plug-in approach using the finite sample variance for forecast- ing (HAMSE) 44 4.1.4 Estimation of m" and <72 - an introduction 45 4.1.5 Estimation of m" and a2 for forecasting 46 4.2 Cross-validation 49 4.2.1 Background to the non-forecasting setting 49 4.2.2 C V for the linear operator approach 53 4.2.3 C V for the local regression approach in curve estimation 53 4.2.4 C V for forecasting: FCV 54 4.2.5 C V for forecasting, BCV 61 5 Simulation: local linear forecasting estimator 71 5.1 Motivation and data 71 5.2 Results and comments 73 6 The local method versus backcalculation 85 6.1 Comments on the methods 85 6.2 A small simulation study 87 6.3 Discussion of the methods and the simulation results 96 7 Data analysis 100 7.1 Data, methods and results 100 8 Concluding remarks 110 iv Bibliography 112 v List o f Tables 5.1 Summary of 100 forecasts (ms) and estimates of optimal bandwidths by local linear regression for m = m\ with sample size n = 50 and 100 respectively. The minimum of AMSE is displayed in column (m.s.e) for true value 5.2 Summary of 100 forecasts (ms) and estimates of optimal bandwidths by local linear regression for m = m 2 with sample size n = 50 and 100 respectively. The minimum of AMSE is displayed in column (m.s.e) for true value 5.3 Summary of 100 forecasts (ms) and estimates of optimal bandwidths by local linear regression for m = m 3 with sample size n = 50 and 100 respectively. The minimum of AMSE is displayed in column (m.s.e) for true value 6.1 Summary of forecasts by backcalculation and local linear regression for 100 realizations of the fictitious AIDS data 6.2 Summary of lower bounds of forecasts by backcalculation for 100 realiza- tions of the fictitious AIDS data 7.1 Quarterly numbers of AIDS cases reported within six years of diagnosis in Canada, 79Q4-90Q1 vi 7.2 Monthly numbers of AIDS cases reported to the end of September 1987 in U K . For time periods May 1986 to September 1987, the first number is the reported and the second number in parenthesis is the estimate of number of unreported new AIDS cases 102 7.3 Forecasts (ms) for 8 quarters, 88Q2 to 90Q1, on Canadian AIDS data (79Q4-88Q1) by local linear forecasting estimator (LL) with FCV, and by Poisson regression (PR) using most recent data respectively. The optimal bandwidth estimated by FCV is rescaled in years and denoted by hy0rpt. The FCV score and h?Jpt are displayed together in parenthesis. The a2 estimated by the Rice estimator is 177 106 7.4 Forecasts (ms) for 9 months, January 1987 to September 1987, on U K AIDS data (January 1982-December 1986) by local linear forecasting esti- mator (LL) with FCV, and by Poisson regression (PR) using most recent data respectively. The optimal bandwidth estimated by FCV is rescaled in years and denoted by hyJpt. The FCV score and hyJpt are displayed together in parenthesis. The a2 estimated by the Rice estimator is 14. . 107 List of Figures 1.1 The quarterly number of A I D S cases reported wi th in six years of diagnosis from the fourth quarter of 1979 (79Q4) to the first quarter of 1990 (90Q1). One unit i n T i m e is 1/42 w i t h 42 being the total number of quarters from 79Q4 to 90Q1 inclusive 3 5.1 The median of 100 shifted FCV curves (dotted), the median of 100 CAMSE curves (dashed) and the true AMSE curve (solid) for m = mi,m,2 and m 3 wi th n = 50 and A n = 0.1 79 5.2 Pairs of CAMSE (dotted) and HAMSE (solid) curves for 9 simulated data sets for m = ra3, n = 50 and A n = 0.1 80 5.3 The AMSE curve, the 25th and the 75th quantiles of shifted FCV and CAMSE respectively, for m = ra3, n = 100 and A n = 0.1 81 5.4 The median of 100 shifted BCV curves (dotted line) and AMSE curve (solid line) for m — m 3 , n = 50 and A n = 0.1 82 5.5 The AMSE curve (solid) and the mean of 100 shifted FCV curves (dot- ted) for m = m 3 w i t h A n = 0.1,0.2 and n = 50,100 respectively 84 6.1 The H I V infection function 7,1(s) = —200(s+l)(5—2) and the incubation function F , F(t) — 1 — e - ' used i n the generation of the fictitious A I D S data 89 viii 6.2 The sensitivity of I to the assumption of F i n the fictitious A I D S data (<r = 0). True F(t) = 1 - e~\ long F(t) = 1 - e"*' 1- 1 and short F(t) = l - e - ' / ° - 9 93 A 6.3 The sensitivity of / to the presence of the errors (er = 0.5) i n the fictitious A I D S data when the true F is used. The first plot shows that J is recovered from the error-free A I D S data while the rest show that I is highly distorted by the presence of errors i n the A I D S incidence data. The solid curve (-) i n each plot is the true / and the dotted line (...) the backcalculated J 94 6.4 The variabil i ty of backcalculated / 95 7.1 Canadian A I D S data and U K A I D S data: forecasting beyond t = 1. . . . 104 7.2 Forecasts by the local linear method and by Poisson regression using the most recent data. A m o n g three sets of forecasts by Poisson regression, the set w i t h the smallest ASFE is plotted 109 ix Acknowledgement s I would like to thank my thesis supervisor Nancy Heckman for her guidance i n the development of m y thesis, for her excellent advice and support, and for being an influential role model throughout my P h . D . program. I am also very grateful to the members of m y thesis committee, Jean Meloche, Steve M a r i o n , and i n particular, John Petkau for their careful reading of the manuscript and valuable comments. In addit ion, I a m very much indebted to J i m Ramsay for making m y stay at M c G i l l possible and for his inspiration i n statistics. I would also like to acknowledge the help from P i n g Y a n from the Laboratory Centre for Disease Control , who provided me w i t h the Canadian A I D S data, and the friendship and support of fellow graduate students and staff members i n the Department of Statistics. F ina l ly , I would l ike to thank the Universi ty and the Department of Statistics for the indispensible financial support I received throughout my P h . D program. Chapter 1 Introduction Suppose bivariate data {(t;, Yi), i = 1 , . . . , n] are observed at times 0 < ti < t2 < ... < 'tn < 1. Here the f;s are i n [0,1], which may be normalized physical times. Thus any t ime beyond 1 is the "future" . W h a t can be said about the future path of this stream of data by studying the information available up to the present? Specifically, what is the "expected" value of Y at some t > 1? Consider the A I D S example. A new disease, acquired immunodeficiency syndrome ( A I D S ) , was defined by the Center for Disease Control i n the Uni ted States i n 1982. A I D S is believed to be caused by the human immunodeficiency virus, or H I V . This disease spread rapidly i n the Uni ted States: reported cases of A I D S increased from 295 i n 1981 to nearly 42,000 i n 1991 wi th reported deaths increasing from 126 to more than 30,000 i n the same period. A I D S is now one of the major causes of death i n the Uni ted States [6]. A I D S has spread and caused serious concern around the world. Even though the current data suggest a slowing down i n the growth of the A I D S epidemic i n the Uni ted States, Northern Europe, Canada and Austra l ia , the spread of H I V infection is rampant i n Afr i ca . A I D S i n South A m e r i c a and A s i a seems to be at the onset of explosive spread [6]. 1 Uncertainty about A I D S has generated research involving epidemiologists and statis- ticians about the relationship between H I V and A I D S , the transmission of the disease and the future course of A I D S incidence. Mechanisms for monitoring the epidemic have been set up around the world. In the Uni ted States, the number of new A I D S cases each month has been available from the C D C (Centers for Disease Control) since 1982 (earlier data from 1975 and prior to 1982 being combined to assure confidentiality [1]) while i n Canada quarterly numbers have been recorded by the F C A (the Federal Centre for A I D S ) since 1979 [23] and the L C D C (the Laboratory Centre for Disease Control) . Figure 1.1 depicts the quarterly numbers of A I D S cases reported wi th in six years of diagnosis i n Canada. Those data are considered to be complete i n reporting [23]. This thesis w i l l focus on the problem of forecasting w i t h complete or adjusted A I D S data, since adjusting A I D S data for under-reporting and reporting-delays would pose a separate research problem on its own. Note that the data shown here are not the cumulated numbers of A I D S cases. Each data point represents the number of new A I D S cases i n the corresponding quarter. In this case, i ; corresponds to the end of the i t h quarter and Y{ observed at t,-, the number of new A I D S cases i n the zth quarter (£,•_!,£,•]. One goal i n A I D S research is forecasting the number of A I D S cases that w i l l occur i n future t ime periods. This number is important to health organizations and governments for estimating the future demand i n health care and allocating funds accordingly. To summarize data and make forecasts based on observed data of the form {(ti,Yi),i = 1 , . . . ,n } , various approaches exist i n different theoretical frameworks: 1. the linear operator approach; 2. the regression approach; 3. the t ime series approach. 2 o o CO C a n a d i a n D a t a o o C\J o o O H i 1 1 ; r ~ 0 . 2 0 .4 0 . 6 0 .8 0 . 0 1.0 T i m e ( 0 . 0 = 7 9 Q 3 , 1 . 0 = 9 0 Q 1 ) Figure 1.1: The quarterly number of A I D S cases reported wi th in six years of diagnosis from the fourth quarter of 1979 (79Q4) to the first quarter of 1990 (90Q1). One unit i n T i m e is 1/42 w i t h 42 being the total number of quarters from 79Q4 to 90Q1 inclusive. 3 This thesis focuses on the first two approaches, its main goal being the development of methodology for regression. Its application to A I D S data w i l l be used as an example. 1.1 The linear operator approach Assume the i,s are i n [0,1]. In the special case of equally spaced data, U = i/n, tn = 1 being the "present" t ime at which the last observation was made. Suppose (t, Y) has a certain probabilistic structure and E{g(Yi)\ti} = m(t , ) , g being a transformation function and m some smooth function. For a fixed design where the t,s are not random, E{g{Yi)\ti] means E{g(Yi)} w i t h Yi observed at t = t{. The linear operator approach assumes that data can be expressed i n a functional form, E{g(Yi)\ti} = Li(I) = where m is a smooth function and Li is a known linear operator on a certain function space where the unknown I resides. In this thesis, the case where X,- is the evaluation operator, i.e., L{I = I(ti), is considered as a problem i n regression (approach 2) i n which no additional information other than that i n the data is available. The function I w i l l have a specific meaning i n applications. In the A I D S example, I is the rate of H I V infection and the YiS are the numbers of A I D S cases. The linear operator L,- for the A I D S data w i l l be described later. The linear operator approach summarizes the data and enables forecasting by estimating I first. Then the operator Lj is applied to this estimated I to give an estimate of m (£,•). This approach has the potential benefit of embedding prior knowledge of the structure of the data into the linear operator Li. The technique called backcalculation [3] which exemplifies this approach is i l lustrated below through its application to the A I D S data. The prevalent theory i n A I D S research postulates that an A I D S patient was infected w i t h H I V at a certain t ime s i n the past which took some t ime to develop to the advanced stage of H I V infection and then to the t ime point of A I D S diagnosis, t. The 4 length of t ime between the point of H I V infection at s and the point of A I D S diagnosis at t is called the incubation t ime, whose distribution can be modeled by a t ime dependent distribution function. Estimates of F have been found from cohort studies. So F is usually assumed known. Based on the assumed relationship between H I V infection and A I D S diagnosis, the expected number of A I D S cases before t can be calculated by the formula, a(t) = i£(number of A I D S cases diagnosed before t ime t) = f* I(s)F(t-s\s)ds, (1.1) Jo where I(s) is the expected number of new H I V infections at t ime instant s, s > 0. /(•) is unknown and to be estimated. The convolution formula (1.1) uses an integral operator to l ink the H I V infections and the A I D S diagnoses. F r o m (1.1), for equally spaced i,s m(ti) = E{Yi) = ^ ( n u m b e r of A I D S cases i n (t,_i,t ,]) = a(tt) — a(2,_i) = I*' I(s)F(U - s\s)ds - f*"11(s)F(ti_1 - s\s)ds Jo Jo = £ I(s) (F(U - s\s) - F(ti - ^ - s\s)) ds (1.2) EE Li{T), (1.3) where F{u\s) = 0, if u < s. The linear operators, i.e., the L t s so defined, model the dynamics of the two processes: H I V infection and the A I D S diagnosis [3]. In general, define m(t) = j* I(s) (F(t - s\s) - F(t - i - s\s)^j ds. (1.4) Thus to forecast the number of A I D S cases i n a future quarter, say (1 + A „ —1/n, 1 + A„] , one must estimate m(t) at t = 1 + A n . 5 The first step to forecasting is to estimate the H I V infection curve I f rom the Y;s, the numbers of A I D S cases. This is done by confining /(•) to an appropriate function space Ti and minimiz ing a fitt ing criterion, e.g., /(•) = argminl€>l J2(Yi - m(ti))2 + A /* I"\t)dt (1.5) The parameter A i n (1.5) is the smoothing parameter which can be chosen by cross- validation. Mor e detailed discussion of cross-validation is provided i n Chapter 2 and Chapter 4. The above procedure for estimating I f rom the Y{S is the so-called backcal- culation [3]. In an appropriately chosen 7i, the minimizer I can be writ ten as a linear combination of l,t and the Riesz representors of L{S [31]. For details, see Chapter 2. After I is obtained, m ( l + A n ) can be predicted by using I i n formula (1.4). However, for s close to 1, I(s) is usually not reliable. The information about I(s) is contained i n the Ys observed after t ime s. Thus the data contain no information about I(s) for 5 > 1. Further, for s < 1, the closer a time point 5 is to 1 the less reliable the backcalculated I(s) is because there are fewer available data. A conservative approach is to be less ambitious and get a lower bound for m ( l + A n ) instead. A useful lower bound can be found if the disease has a long incubation. For a short term forecast, a great proportion of new H I V infections arising i n t ime period (1,1 + A n ] are st i l l latent and thus w i l l not contribute to the number of new A I D S cases occurring i n (1 + A n — 1/n , 1 + A n ] , Therefore, the percentage of new A I D S cases contributed by the H I V infections that occur i n (1,1 + A„] may be negligible. From (1.4), we have m ( l + A n ) > £ I(s) (F(l + An - s\s) - F(l + A n - i - s|s)) ds = LBn, (1.6) which i n effect takes I to be zero over ( 1 , 1 + A„]. A n estimate of LBn can be obtained by plugging I into the above formula for LBn, LBn = £ i(s) (^(1 + A„ - s\s) - F(l + An - ^ - s\sf) ds. (1.7) 6 More examples of the linear operator approach i n science and engineering can be found i n [29] and [30]. 1.2 The regression approach The regression approach assumes an unknown underlying regression function m for the data or the transformed data, i.e., E{g(Y)\U} — m{tt). It w i l l be assumed hereafter that the data are properly transformed and E(Yi\t{) = m(t,) i n the regression approach. The regression approach uses the data {Yi}™ to estimate ra directly. This general approach includes parametric regression i n which m depends on a global parametric form and local regression i n which m is only assumed to be smooth. The next two sections contain details of both types of regression i n the context of forecasting. 1.2.1 Parametric extrapolation This approach assumes that the trend of the data continues over a period of t ime i n a postulated parametric form, for example, log(E(Yi)) = /So + / M i - O n e fits this parametric model to the entire data set and then uses this fit to extrapolate to obtain the forecast. In 1986 the P u b l i c Health Service i n the Un i ted States gave a forecast of 270,000 for the cumulative number of A I D S cases i n the Un i ted States by the end of 1991 by extrapolating a polynomial model . The actual number of reported A I D S cases turned out to be 206,000 [6]. A s noted by a few authors ([9],[20]), parametric extrapolation can provide useful short-term forecasts. However, there are a few criticisms of this approach: 1. Parametric extrapolation does not make use of any available information on the progression of the H I V infections to A I D S disease and thus may be less efficient 7 than backcalculation. 2. The parametric assumption is not verifiable and therefore is a considerable source of uncertainty. Thus parametric extrapolation is unable to give a long-term fore- cast. A s for the first cr i t ic ism, it is not necessarily the case that a method that uses less data is less efficient than a method that uses more data. The discussion i n Chapter 6 w i l l show that the local linear forecasting estimator that uses only K)}™, suggests better forecasts than backcalculation that uses {(i,-,!^)}" and additional data. A s for the second cri t ic ism, so far as forecasting is concerned, future predictions of medium to long term pose a challenge for statisticians. If a parametric model is "correct" it w i l l be able to give a long-term forecast. However, no models are "correct" since they are simplified representations of the real world. W h e n a set of data is fitted to a specific parametric form, the fitted curve may be compromised to fit a l l the data well at the expense of local structure i n the data. Especially of interest is the local structure near the present, t = 1. Forecasts based on the extrapolation from such a fitted curve are unlikely to be as good as forecasts based an extrapolation from a fitted curve of more recent data. A s Healy and Ti l le t [20] write, "It doesn't seem entirely reasonable to give the early data . . . a high degree of influence i n forecasting the future." W h e n fitting the data to a log-linear model , they superimposed subjectively-chosen weights decreasing into the past. This method should be expected to be an improvement over simple parametric fitting when "appropriate" weights are chosen. It might be desirable to do away w i t h the parametric assumption and also to couple the fitting wi th a data-driven method for the choice of weights. The local regression method proposed i n the next section does exactly this. 8 1.2.2 A local linear forecasting estimator In general, to estimate a function m at a certain point t, the approach of local linear regression assumes that m has a linear structure i n a neighbourhood of t: m(x) = Po + Pi(x — t), and estimates flo, /?i by weighted regression. The weights assigned to the observations close to t are much bigger than the weights assigned to the rest of observations. That is, let the estimator of m(t) be defined as follows: m^t) = fa+ fo(t-t) = 0Oit, (1.8) where A M = argminJ2(Yi - A> ~ p\{U ~ t))2K(\-^-), (1.9) \ J w i t h K(-) a kernel function that gives more weight to the observations close to t (defined by hn) than the rest of the observations, and hn the so-called bandwidth that determines the size of the neighbourhood of t. For example, if the kernel function is the indicator function over [—1,1], then K((U — t)/hn) = 1 for those t,s w i t h \tj — t\ < hn and 0 otherwise. Since rhhn(t) is calculated from the Y{S, it is a random variable w i t h statistical properties which are affected by the bandwidth hn and the kernel K. In practice the choice of K has l i t t le effect on rhhn{t) while the choice of hn has a large effect. Therefore we assume K is specified i n Chapter 4 and describe data driven choice of hn. For now assume hn is given. Unless m is a straight line, rhhn(t) is i n general a biased estimator of m(t), that is, Bias{rhhn(t)} = E{mhn(t)\ti,... ,tn} — m(t) ^ 0. Note that if the kernel function is the indicator function over [—1,1], the bandwidth hn reflects the number of Y{S used i n the regression. A large hn means that a large number of T̂ -s is used i n the regression and a small hn means otherwise. Therefore a large bandwidth 9 causes rhhn(t) to have a small variance and a small bandwidth causes rhhn(t) to have a large variance. For commonly used kernels, when hn is small , rhhn(t) w i l l have a small bias but a large variance; when hn is large, rhhn(t) w i l l have a large bias but a small variance. If hn is small enough, rhhn{t) w i l l behave like an interpolant while if hn is large enough, m/ l n ( t) w i l l become the familiar least squares fit . A balance between the bias and the variance is desirable and this balance is usually achieved by min imiz ing the mean squared error of rhhn{i), i.e., E{(rhhn(t) — m(t))2 \t\,... ,tn}. To predict m a t 1 + A n , let t = 1 + A n i n (1.9). Therefore, rh,hn{l + A n ) = /30,I+A„- To avoid cumbersome calculations of the asymptotics later i n Chapter 3, an algebraic substitution is employed to re-center the data at 1 instead of 1 + A „ . Note that P o , l + A „ 1+A„ / n V argminj^iY. - ft - ft ft - 1 - An))2K(U ~ ][~ A") 2 Let argminf2(Yi - ( f t - ftAn) - p\{U - l))2K(U \ A"). fir, ft* = ft-ftAn, ft* = ft, K-fir±) = K{ti \ An). hr (1.10) (1.11) Then argminitiYi - ft* - ft(tt- - l ) ) 2 T C , (1.12) \ «, / and m^„(l + A n ) = ftx + ftflAn. Formulat ion (1.12) w i l l be used hereafter wi th the superscript * dropped. The no- tation m ^ n > i ( l + A n ) instead of rhhn(l + A n ) w i l l be used to denote the forecasting estimator that forecasts A n ahead with data centered at t = 1 and wi th bandwidth hn. To summarize: 10 Definition 1.1 The estimator ra/i„,i(l + An) of m(l + An) is defined as m w ( l + A„) where argminf^iYi - A, - &(*,- - l ) ) 2 t f (^-—^). (1.13) •n The bandwidth / i n can be chosen by plug-in or cross-validation approaches. The devel- opment of those approaches for forecasting i n Chapter 4 is the m a i n contribution of this thesis. Results of the statistical properties derived i n Chapter 3 serve as the cornerstone for al l bandwidth estimation procedures. In the next chapter, theoretical results for the linear operator approach w i l l be presented and applied to the A I D S data. 11 Chapter 2 The linear operator approach The linear operator approach w i l l be discussed under the theoretical framework of a HUbert space. Standard results w i l l be presented without proof. Proofs and other details can be found i n [31]. Examples of the linear operator approach, details of computation and some asymptotic results can be found i n [28]-[31]. The following theorem is central to the linear operator approach. Theorem 2.1 Assume the following conditions: 1. 7i is a HUbert space over the reals, 1Z, with inner product <, > and norm || • ||; 2. For each i G { 1 , . . . ,n}, Li : ri —* TZ, is a bounded linear functional; 3. P : Ti —> Til, is a projection operator with ri\ a subspace ofri; 4- The null space of P is TCo = span{<f>i,<f)no}, where the <j)jS are linearly inde- pendent and n0 < 00 ; 5. !F(Y, L\(I),..., Ln(I)) : lZn xTZn —> 71 is a real function, where Y = (3 /1 , . . . , y „ ) ' . 12 For given X > 0, if the minimizer I of F{Y,Lx{I),...,Ln{I)) + X\\P{I)\\2 ( 2 - 1 ) exists in Ti, it is of the form: no i = \Edsh + Hc&> (2-2) i=i i=i where the djS and CjS are real numbers, = P{r)j) and r]j is the Riesz representor of Lj, that is, Lj(I) =< rjj, I > for all I G Ti. C o r o l l a r y 2.1 Assume the conditions in Theorem 2.1. In addition, assume the matri- ces T — (Tij)nXno, Tij = Li((f)j) and E = ( E , j ) n X n , E,-j =< > are of full column rank, and that the WjS are positive. Then ^ E « - ^ ) ) X + A | | P ( / ) | | 2 (2.3) has a unique minimizer I in Ti and is as in (2.2) with d = { d u . . . , d n o ) t = {TtA-xT)-xTtA-1Y, (2.4) c = ( c i , . . . , c n ) * = i 4 - 1 [ / - r ( r 'A- 1 r ) - 1 r*A - 1 ]y , (2.5) where A = E + n\I. R e m a r k s : 1. Theorem 2.1 greatly reduces the complexity of the task of finding the minimizer 7 i n Ti (often of infinite dimension) to finding / i n the finite dimensional subspace spanned by <j>x,..., <j>no-,i\-, • • •, £«• Roughly speaking, a parametric model of span <f>i,..., (f>no determined by P, and the additional functions, £ 1 , . . . , £ n j allow us a more flexible fit to our n data points. 13 2. Theorem 2.1 is a very general result i n that T can be a general function. However, we are only interested i n J7 being —n^logijikelihood), i n which case (2.1) is called a penalized l ikelihood. W h e n the YJS are independent and normally distributed wi th mean Lj(I) and variance (2WJ)~1, J- is — re_1/o<jr(normal likelihood) and is the first term as i n (2.3). We w i l l assume (2.1) to be a penalized l ikelihood hereafter. The number A, known as the smoothing parameter, governs the relative importance of the goodness of fit of the data to that of the penalty P. If A is zero, the fit of the data is of the utmost importance and the penalty is ignored, w i t h the result that I is the m a x i m u m likelihood estimator. O n the other hand, a large value of A forces I close to the nul l space of P and results i n an estimate close to a parametric fit. A n y A between these two extremes reflects a compromise between the goodness of fit to the data and the form of I emphasized by the penalty. 3. A n automatic, i.e., data-driven choice of the smoothing parameter A might be desired. Cross-validation is a technique commonly used for the estimation of an opt imal A. Details of this method w i l l be presented i n Chapter 4. Though the Riesz representation theorem assures the existence of the rjjS, the represen- tors of the LjS, it does not give the analytical forms of the r}jS. The theory of reproducing kernel Hilbert spaces (r.k.h.s.) makes it possible to calculate those £,-s analytically. This is important i n applications. This chapter w i l l present the relevant results of r.k.h.s. and then apply them to the case of backcalculation. 2.1 Some results on reproducing kernel Hilbert spaces The theory of r.k.h.s. is very powerful and very useful i n the linear operator approach i n general. 14 D e f i n i t i o n 2.1 Let H be a Hilbert space of real-valued functions over [0,1], H={I: [0,1] -> f t } . Then 7i is a r.k.h.s. if and only if for all t G [0,1], Lt: H^Tl with Lt(I) = lit) is a bounded linear functional. D e f i n i t i o n 2.2 IfTi is a r.k.h.s., then by the Riesz representation theorem, for any t, there exists a function Kt € rt such that < J , Kt >= I(t), for all I eTi. Let K(s, t) = Kt(s) :. [0,1] x [0,1] -> K. (2.6) Then K is called the reproducing kernel ofri. A s shown i n the following lemma, i n a r.k.h.s. it is very easy to use the reproducing kernel to calculate the Riesz representor of a bounded linear operator. L e m m a 2.1 Assume that L is a bounded linear operator on a r.k.h.s., Ti. Its Riesz representor is n where n(t) = L(Kt). Proof : Let n be the Riesz representor of L i n 7i. So for al l I G rl, L(I) =< / / , / > . In particular, for I = Kt, L(Kt) =< r),Kt >• B y Definit ion 2.2, < rj,Kt > is also equal to n(t). So n{t) = L(Kt). • A commonly chosen r.k.h.s. is the class of smooth functions w i t h square-integrable r t h derivatives. D e f i n i t i o n 2.3 W ^ l 0 * 1 ] = {/ : [0,1] -> ft : / , 7 ( r - 1 ) are absolutely continuous in [0,1] with JQ I^{u)2du < oo}. The inner product is defined as: < f,9 >= E/ ( i )(0)<7 ( i )(0) + t / w ( u ) 5 W ( « ) d u . i=o J o 15 Standard results say that this space is a r.k.h.s. and has a kernel of a known form. Lemma 2.2 W£[0,1] is a r.k.h.s. with the reproducing kernel, * ( M ) = £ ^ + / o X ( j ^ ^ \ s - u y - \ t - u y - i d u , (2.7) where (s — u)+ = s — u, if s > u and 0 otherwise. 2.2 Application to the AIDS data: backcalculation Recal l that i n the linear operator approach i n the A I D S example i n Section 1.1, E(Yj), the expected number of new A I D S cases i n the jth quarter, is assumed to be l inked to an unknown function 7 by a linear operator Lj, Lj{I) = f' I(s)(F(tj - s\s) - F(tj - - - s\s))ds, (2.8) Jo n where I is the H I V infection function and F(-\s) is the distr ibution function of the incubation t ime from H I V infection to A I D S diagnosis at the instant s. F is assumed known so the operator Lj is known. Obviously Lj is a linear operator. In this application, the H I V infection function I w i l l be assumed to be i n H = W£[0>1] w i t h r = 2. This is a very m i n i m a l assumption on I because M^tO?!] 1 S a very general class of functions. No specific assumptions are made on those func- tions other than smoothness and integrability: 7 and V are absolutely continuous w i t h Jo1 J"(u)2du < co. Recovering the historical H I V infection pattern I is an important goal i n A I D S research. It is believed that I is a smooth curve. Therefore rough Is should be penalized. The quantity / Q 1 I"(u)2du is a measure of the roughness of I so it can be used as the penalty term. 16 To put it statistically, finding a reasonably smooth I i n W ^ O , 1] that fits the observed A I D S incidence data well is achieved by finding the I that minimizes F{Y,L1(I),...,Ln(I)) + \ I11'XuYdu, (2.9) JO where T is the negative of a loglikelihood function divided by n , I € VF^O, 1] and A > 0. The smoothing parameter A places a control on the roughness of I relative to the goodness of fit of the Lj(I)s to the YjS. M i n i m i z i n g the penalized l ikel ihood means choosing an I i n W%[0,1] wi th a tradeoff between the goodness of fit to the data and the roughness of I. This tradeoff is determined by the value of the smoothing parameter A. A n opt imal A for this tradeoff minimizes the prediction error, PE(\) = n-1Y,{Y;-mx(ti)}2 i = l where the 5^*s are hypothetical new observations at t,s, rh\(ti) = Li(I\) and 7A minimizes (2.9). This opt imal A can be estimated by a standard technique called cross-validation. Further details on cross-validation w i l l follow i n Chapter 4. The minimizer I of (2.9) can be determined by using Theorem 2.1 and the theory of r.k.h.s. i n Section 2.1. Theorem 2.2 If the minimizer I of (2.9) exists in W|[0,1], it takes the form: i(t) = Bo +iht+ (2-10) 3=1 where m = I Jo x / , M t ) 2 (sAt)3 st{s A t) - (a + i ) ^ 1 - + ( F i t j - s ^ - F i t j - ^ - s ^ d s , (2.11) and s At = s if s <t and t otherwise. 17 Theorem 2.2 enables one to find the minimizer i n a finite dimensional subspace instead of i n W|[0 ,1], a space of infinite dimension. The succeeding lemmas verify the conditions of Theorem 2.1 for its application to the A I D S data. So the ensemble of the following lemmas serves as the proof of Theorem 2.2. L e m m a 2.2 i n the previous section assures that Condit ion 1 of Theorem 2.1 is satis- fied. [G\ 1] i s a Hi lbert space wi th the inner product <f,g>= /(0M0) + / W ( 0 ) + fX f"(u)9"(u)du, f , g e w2[o, 1], Jo and the norm induced by the above inner product, 11/2 7(0) 2 + I ' (0) 2 + f1I"(u)2du Jo The lemma below checks Condit ion 2 of Theorem 2.1. Lemma 2.3 Suppose that for any s > 0, ^(-l-s) is a distribution function with F(u\s) = 0 if u < 0 and that F(u\s) is continuous in both u and s. For each j G { 1 , . . . , n}, the operator Lj : Lj(I) = j h I(s)(F(tj - s\s) - F{tj - - - s\s))ds, I G W2 [0,1] Jo n is a bounded linear operator on W|[0,1]. Proof: Lj is well-defined since I(s)(F(tj — s\s) — F(tj — ̂ — s\s)) is continuous i n s and thus is integrable. The linearity of Lj is obvious. B y the definition of the boundedness of a linear operator, Lj is bounded if there exists C, 0 < C < co, such that \Lj(I)\ < C\\I\\, (2.12) for a l l Ie W2[0,1]. 18 Since 0 < F(tj - s\s) - F(tj - J - s\s) < 1, it follows that |^(/)| = | I(s)(F(tj - s\s) - F(tj - - s\s))da\ Jo n < £ j \I(s)\ds < £ \I(s)\ds < I(s)2ds^ 2 . (2.13) To bound /g |/(s)|<£s, first consider I(s) = 1(0) + s/'(0) + f" (U I"{v)dvdu. (2.14) Jo Jo A p p l y i n g the inequality (a + b + c)2 < 3(a 2 + b2 + c 2 ) , to (2.14) gives I(s)2 < 3(/(0) 2 + s2I'{0)2 + ( / ' / " I"{v)dvdu)2). (2.15) Jo Jo B y Schwarz's inequality, ( [' fU I"(v)dvdu)2 < ( [' [U I"(v)2dvdu)( [' fUl2dvdu), Jo Jo Jo Jo Jo Jo which is bounded by I"(v)2dv for 0 < s < 1. For 0 < s < 1, applying the relationship i n (2.15) yields: I(s)2 < 3(/(0) 2 + s 2 / ' (0 ) 2 + lX I"{v)2dv) Jo < 3(/(0) 2 + 7'(0)2 + C I"{v)2dv) Jo = 3||/||2. (2.16) Using (2.16) i n (2.13) gives \Lj(I)\ < ^1/2\\I\\- Therefore Lj is a bounded linear operator on W2[0,1]. • For Condit ion 3 of Theorem 2.1, the following lemma gives the projection operator P corresponding to the penalty term / 0 X I"(u)2du and the basis functions (f>jS that span the nul l space of P . , 19 L e m m a 2.4 Let P : W 2 [0 ,1] - W 2 2[0,1] : P(I)(t) = I(t) - 7(0) - l'(0)t. (2.17) T/iera P is a projection operator onto H\ with Tii®rio = W|[0,1], where rio — span{l, t} is the null space of P and ||P(i")||2 = /o I"{ufdu. The next l emma uses the results of the r.k.h.s. to calculate the Riesz representor rjj of Lj i n W^fO, ! ] and its projected image £j = P(r)j) for Lj as i n (2.8) and P as i n (2.17). L e m m a 2.5 In W 2 2 [0,1], the projection of the Riesz representor of Lj is: w> - r x , x ( s A * ) 2 (sAt)3 st(s At) — (s + t)^—±- + ^—t- (F(tj - s\s) - F(tj - ^ - s\s))ds. (2.18) R e m a r k : The £jS are not splines (cubic polynomials) as i n the t r iv ia l case when Lj(I) = I(tj). One can show that defined by (2.18) is an increasing function w i t h £ j (0) = 0, concave i n [0, tj] and a straight line beyond tj. P r o o f : For r = 2, (2.7) gives the reproducing kernel as ~ / x / x / x(sAt) 2 (s At)3 K(s,t) = l + st + st(s At)-(s + t)K n ' + v ' (2.19) 2 3 Recal l that Kt(-) = K(-,t). B y L e m m a 2.1, the Riesz representor t]j of Lj evaluated at t is: - rjj(t) =< r,j,Kt >= Lj(Kt) = [*' Kt(s)(F(t}- - s\s) - F(tj - - - s\s))ds. Jo n So &(*) = P(rn)(t) = rjj(t) - nj(0) - r]'j(0)t = Lj(Kt)-Lj(K0)-jt(Lj(Kt))\t=0t (2.20) 20 One can show that at Jo n Therefore = f\Kt{s) - 1 - st)(F(tj - s\s) - F(tj - - - s\s))ds. • The results i n this chapter w i l l be used i n Chapter 6. The next chapter w i l l contain the derivation of the statistical properties of the local linear forecasting estimator rhhn,i(l + A„) for the regression approach. 21 Chapter 3 Asymptotics for the local linear regression estimator This chapter w i l l cover the asymptotic properties of m ^ n i < ( i + A n ) , as defined below based on data i^)}™, for i,s random and t,s nonrandom respectively. These asymptotic results w i l l help us understand the local linear forecasting estimator and choose an opt imal bandwidth for the estimator. The results and conditions are different from those i n non-forecasting regression problems (see, e.g. [8]). However, the proofs are similar . In non-forecasting regression problems, e.g., estimating m(t) by local regression for t i n the domain of design points t,s, data on both sides of t are used i n regression. There- fore usually a kernel function K symmetric about 0 is used. Often i n non-forecasting regression problems, K is assumed to be a density function. However, i n forecasting data are centered at the boundary t = 1, so only the left part of the kernel function K is actually used because there are no data beyond t — 1. For this reason, the local linear forecasting estimator is defined by a kernel function K w i t h a negative support. More- 22 over, the local linear forecasting estimator does not require K to be a density function but a general function that satisfies Condit ion 3.1 below. D e f i n i t i o n 3.1 Let rhhn,t{t') — / 3 0 i + P\,t{t' — t), where / 3 0 t and J3lt minimize X3i (Y]~ A)-P\{ti—t))2K((ti—t)/hn) with K a kernel function having a negative support. Note that the data are centered at t i n Definit ion 3.1. The estimator rhhnj(t') forecasts t' — t ahead of t using data prior to t since the kernel function K has a negative support. The main application of the asymptotic results for rhhntt(t') w i l l be to the case t = 1 and t' — t = A n (see (1.13)). However, the more general asymptotic results for t and t' i n a neighbourhood of 1 w i l l be used i n Chapter 4. From Definit ion 3.1, the asymptotic properties of rhhn,t[t + A n ) are completely de- termined by those of 0t = ($ot, fiijY- Note that the superscript of a vector or a matr ix indicates the action of taking the transpose of a vector or a matr ix and should not be confused wi th the scalar t as either an argument of a function or a subscript of a variable. The asymptotic bias and variance of /3t w i l l be presented first. For the case of random i ,s , assume the following conditions: 1. Yi = m(ti) + e,-, where the e,s are independent w i t h mean 0 and variance cr2; 2. t{S are i i d w i t h density /(•) and are independent of the e,s; 3. the density function / is known and satisfies: / is bounded away from 0 and / is continuous on [0,1]; 4. K is square integrable on its compact support [—1,0] w i t h u0 > 0, uQu2 — u\ > 0, (3.1) where u,- — ff j uxK(u)du\ let < = f\ tfK^Ydu- 23 5. m" is continuous over [0,1 + A n ] ; 6. hn —* 0 and nhn —> oo. T h e o r e m 3.1 Let e = (1,1) ' . Under the above assumptions, as n - m"(t) oo, _2_ ~hl E(B0tt\t) - m(t) [ hn (E0ht\t) - m'(tj) ) UQ U I Ui U2 = o p ( l ) e , (3.2) and / / t 0 w . \ \ nhnVar 0 /u I* /(*) uo ̂ 1 V J Pot « 0 Ui Ui U2 - oP(l)e, (3.3) Ui u 2 uniformly in t G [an, 1] l i m i n f an/hn > 1. R e m a r k s : 1. Note that here "0 n ( i ) = 0 p ((n/i)~ 1 / 2 ) uniformly for * <E [an, 1]" means that l i m l i m sup P ( ( n / O 1 / 2 | 0 n ( i ) | > C) = 0, C _ + o o n ^ o o t g [ o n i l ] which is weaker than l i m l i m P( sup {nhfl2\en{t)\ > C) = 0. Thus, for a fixed sample size, some values of 6n(t) could be far from 0. 2. It is not necessary to require that K have an one-sided support if a bandwidth is given because there are no data for t > 1 and thus any kernel w i l l have an one- sided support automatically. However, some bandwidth selection procedures (e.g. FCV i n Chapter 4) need this requirement to have certain asymptotic properties (see Chapter 4). 24 The proof of Theorem 3.1 w i l l be postponed unt i l its corollaries are presented and proven. The results i n Theorem 3.1 make calculations of the asymptotic bias and vari- ance of rhhn,t(t + An) very easy. The corollary below follows easily f rom Theorem 3.1. Corollary 3.1 Assume Conditions 1, 2, 3, 4 and 5. In addition assume 6'. A n = Dn-1^ and hn = Hn-1'5 for some D, H > 0. Let 8 = D/H. Let Bias(mhn,t(t + An)\t) = E(mhn<t(t + An)\t) - m(t + A n ) . (3.4) As n —• oo, and 2 -£jBias(mhntt(t + An)\t) - m"(t) ( — U1U3 U0U3 — U1U2 82 + )/(u0u2 - ul) - 1 = o p ( l ) nf{t)AnVar{mhntt(t + An)\t) -1 a28(l,8) U0 Ui Ui u 2 \ "1 "2 / UQ UI Ui u2 o„(l) uniformly in t G [an, 1] with l i m i n f an/hn > 1. Proof of Corollary 3.1: Taylor expansion of m{t + A„) at t yields: (3.5) (3.6) m{t + An) = m(t) + Anm'(t) + ̂ m"(t) + o(A2n), uniformly for t G [0,1]. 2 A2 r, _2_ Al So —•jBias(m k n t t(t + An)\t) = -j [E(mhn]t(t + An)\t) - m(t + A n ) ] (3.7) 25 A 2 , n L 82hl E(mhntt(t + An)\t) - m(t) - Anm'(t) - ^m"(t) + o (A 2 ) 1 E0o,\t) - m(t) N ( 1 I)A y82' 8'hi { hn (E0ht\t)-m'(t)) 1 E0o>t\t) - m(t) ^ - m"(t) + o(l) / J -m"(t) + o(l). (3.8) ^ K (E0u\t)-m'(t)) ) B y result (3.2) of Theorem 3.1, as n —> oo this expression converges i n probabil i ty uniformly to un ui 1 I « 9 I m"(t) (ii)-"W U0 U i Ui u 2 = m"(t) ( u\ - U!U3 UQU3- UXU2 2> + 6* 8 So (3.5) is proved. Now consider the variance nf(t)AnVar(mhntt(t + An)\t) = nf(t)AnVar0Oit + PuAn\t) )/(u0u2 - ul) - 1 nf(t)AnVar { / i o X 0 fcK r / i o N = n / ( t ) A „ ( M ) V a r { 0 hr ( \ *> / \ 1 0 1 nhnVar < » \ ^ 0 hn t 4 (3.9) (3.10) B y result (3.3) of Theorem 3.1, as n —• oo the last expression converges i n probabil i ty 26 to: 8{l,S)f{t) /(*) I UQ UI Ui u 2 = *28(1,6) ) \U*1 U2 ) U0 UI Ui U2 U0 Ui Ui u 2 \ *0 "1 \U*1 U2 ) ( V \ 8 ) U0 Ui Ui u 2 uniformly i n t. • Since / and m" are uniformly continuous on [0,1], Corollary 3.2 follows immediately from Corol lary 3.1. Corollary 3.2 Assume the same conditions as in Corollary 3.1. As n —* oo, 2 2-Bias(mhntt(t + An)\t) - m O V P 6 ) / ( u ° W 2 - u i ) _ 1 J = (3.11) nf(l)AnVar(mhn,t(t + An)\t) ( a2S(l,8) \ X ( UQ Ui Ui U2 " 5 «1 \Ul U2 ) \X( \ UQ Ui \Ui U2 J \ 6 I = o,(l) (3.12) uniformly in t G [1 — pn, 1] with any sequence pn > 0, pn —• 0. Remarks: 1. A s is usually done i n regression problems, we consider the conditional bias and variance of our estimates. 2. Corollary 3.2 says that the dominant terms i n both the asymptotic conditional bias and the variance of rhnntt(t + A n ) are equal to those of m / l n i i ( l + A n ) for t G [1 - / ? „ , ! ] . 27 3. For a finite sample, A n is given and hn will be chosen to minimize the asymptotic conditional mean squared error (AMSE) of m^ n > 1 ( l + A n ) . Because of Condition 6', choosing hn is equivalent to choosing 6 = An/hn. Although minimizing the ac- tual finite sample conditional MSE would be best, its form is far too complicated. 4. Based on the results in Corollary 3.1 one can write out the AMSE of m^ n i i ( l - | -A n ) as the square of the asymptotic bias plus the asymptotic variance. Condition 6' sets a guideline on the magnitude of A n , that is, on how far ahead one can forecast if the rate n - 4 / 5 is to be achieved for the AMSE of m/ l r J ii(l + An). Stone [27] has shown that under appropriate regularity conditions, the optimal rate of AMSE for a nonparametric estimator of a twice differentiable function is n - 4 / 5 . If A n grows faster (but still slowly enough so that A n —> 0), for example, An/hn —> oo, a rate slower than n - 4 / 5 can be achieved under appropriate conditions. This can be seen by the reasoning below. For A n —> 0, the bias and variance of m^.nii(l + A n ) are Bias(rhhn<1(l + An)\t) Var(mhntl(l + An)\t) = E(m f c B l l (l + A n ) - m ( l + An)|*) = E (fa - m(l) + - m'(l)]|t) + O(Al), = Var(A,1i + & l l A B |t) = Var0Otl\t) + 2AnCov0OAJ1A\t) + A2nVar0ltl\t). By the results in Theorem 3.1 for t = 1 and any A n > 0 with A. n 0 Bias(mhntl(l + An)\t) Var(mhntl{l-[- An)\t) Op(h2n) + Op(hnAn) + 0(Al), Op(l/(nhn)) + Op(An/(nh2n)) + Op(A2J(nh3n) Op{l/(nhn)) (l + Op(An/hn) + Op(A2Jh2n)) . )) If AJh, •n co, we have Bias{mhntl(l + An)\t) 0P(A2n), 28 Var(mhnA(l + An)\t) ~ Op(A2J(nh3n)). To minimize the AMSE, we need to make Bias2 and Var of the same magnitude, i.e., A* oc A2J{nh3n), or A 2 / * 3 oc 1/n. If A n / A n -> oo, A2nh3n oc 1/n implies n / i£ —• 0 and n A n 5 —• oo. A s a result, if An/hn —• oo under the conditions nh3 —> oo and A 2 / i 3 oc 1/n, the AMSE of rhfc„,i(l + A n ) achieves the rate Op(An) (equivalently Op(An/(nh3)), which is slower than n - 4 / 5 since nAhn —* oo. Recal l that i n Chapter 1.2.2, the algebraic substitution (1.11) is used to re-center the data. This recentering yields a simple dependence of AMSE on 8, thus making it easier to find the optimal bandwidth by finding the opt imal 8. If the original formulation (1.9) is used wi th the data centered at 1 + A n , then the estimate of m ( l + A n ) is /3O,I+A„- The formulae of the asymptotic bias and variance of A ) , i + A N w i i i show clearly the advantage of centering data at 1 over centering at 1 + A „ . Under the same assumptions as i n Corollary 3.1 but w i t h the additional assumption that D < H, one can show that the asymptotic bias and variance satisfy _2_ h2 Bias(mhntl(l + An)\t) - m " ( l + A n ) v2 - vxvz V0V2 - V{ 2 h2 O p ( l ) , £ ( / W » - m(l + An)\t) - m " ( l + An)V* VlV* VQV2 — Vi (3.13) nhnf(l + An)Var(mhntl(l + An)\t) - I \ X ( V0 Vi V\ v2 \ Vn V-> ( \V1 V2 J \ V0 V-L Vi V2 (3.14) where i>,- = / " f v{K(v)dv, v* = / ~ f v'K(v)2dv. M i n i m i z a t i o n of the resulting AMSE of fi01+An is computationally intensive since 8 i n the upper l imi t of the integrals depends on hn. Therefore if AMSE(hn) is to be minimized over a grid 29 of values of hn, a l l the integrals have to be computed for each hn. If K is, say, the indicator function on [—1,0], these integrals can be found i n closed form. If, however, K is a truncated normal density, then the intregrals must be evaluated numerically. There is another advantage of re-centering the data: after re-centering the data at 1, hn is the "real" bandwidth used i n forecasting. If the data were instead centered at 1 + A „ , the effective bandwidth would be hn — A n since there are no data beyond 1. L e m m a 3.1 below is used i n the proof of Theorem 3.1. Lemma 3.1 Suppose that g(-) and /(•) are bounded over [0,1] and W(-) is square in- tegrate over 1Z and that the i , s are iid with density /(•). If nhn = nh —• oo, then as n —» oo, ^ £ W&j^MU) - I £ W(^)g(u)f(u)du = O p ( ( n f c ) - 1 / a ) (3.15) uniformly for t 6 [0,1]. Proof: Let Z = (nh)-1 £?=1 W((t{ - t)/h)g(U). B y Chebyshev's inequality, Z = E ( Z ) + Op((Var(Z)fl2). E(Z) = \E (Vc^Mto) = I £ W(^)g(u)f(u)du, Var(Z) = ^y« r ( V ( * I^ ( t l ) ) * n h < W 2 ^ ^ ) = \ C W\^-±)g\u)f(u)dulh. (3.16) nh Jo n Since g and / are bounded, and / 0 X W2((u — t)/h)du/h = f^^h\V2(s)ds < co, Var(Z) = 0 ( \ ) , (3.17) nh 30 which does not depend on t. P r o o f o f T h e o r e m 3 . 1 : Wri te • -f%- fiiiU - t))2K(^r-) = (Y- Aj3)'W(Y - A/3), (3.18) where Y = A 1 tx-t 1 tn-t P V f t / and W = diag{K(p-±)). • V T L The minimizer of (3.18) is: pt = ( A ' W A J ^ A ' ^ y , (3.19) provided that A ' W A is invertible and positive-definite. So for t w i t h A'WA invertible, E{Pt\t) = (A'WA^A'Wro where m = (m(ti),..., m(t n ) ) * , and (3.20) Var(J3t\t) = <72(A'WA) - 1A'W2A(A'WA) - l (3.21) Result (3.28) below shows that, when suitably normalized, A'WA converges i n proba- bi l i ty to a positive definite matr ix . Since we are interested i n convergence i n probabil i ty of E{pt\t) and Var(Pt\t), it suffices to analyze (3.20) and (3.21). The asymptotic bias w i l l be derived first. The Taylor expansion of m at t yields: m"(t) m(ti) = m(t) + m\t){U -t) + —^(U - *)2 + - *)2) 31 uniformly i n ti w i t h \U — t\ < hn. Only these t,s are considered because the support of K is [-1,0] . Therefore, for \U — t\ < hn m (U) = (l,U-t) ( m(t) ^ + ^Y^-qiK2 + h2no(qi) A,- / m(t) N m'(t) + qihl(m"(t)/2 + o(l)) where A,- is the ith row of A and =(9i'92'---'^^((V) '•••'(V)) Since Wi = i f ( ( * i — t)/hn) = 0 for — t\ > hn, so ' l 0 W m(i) y 0 hn j ' i o N V (K tWK)-xK tWqhn 2(m"(t)l2 + o(l)) \ ( A lW A] 0 Zi n y 0 hn j ( = hn2 1 0 o hn I - l F o r t , i = l,2 1 0 y 0 h N J -1 1 A tWq)(m"(t)/2 + o(l)). 1 0 0 /*„ nh. -A*WA y 0 h N J 32 (3.22) (3.23) 1 1 hJ+t-2 nhn 1 1 ( A ' W A ) , , - •+j-2 (3.25) nk. — • • • • • ' ( 3 - 2 4 ) which by L e m m a 3.1, is equal to nn Jo \ nn J \ nn J Substituting s = (u — t)/hn yields (l-t)/hn t/h„ For t e [a„ , l ] wi th l i m i n f a n / f c n > 1, [-t/hn, (1 - t)/hn] D [-1,0] = [-1,0] for n sufficiently large. So for n —> oo and t £ [a„, 1], expression (3.25) is equal to f si+i-2K(s)f(t + shn)ds + Op((nhn)-^2), (3.26) where Op{(nhn)~xl2) holds uniformly i n t G [a„, 1]. The last expression converges (not only i n probability) to r(l-t)/h„ . . / K(s)st+>-2f(t + shn)ds + Op{{nhnyxl2). J-t/hn i: ,»"+J-2 K(s)f(t)ds = f ( t ) l l i + j - 2 uniformly i n t £ [an, 1] because | /_° si+j-2K(s){f(t + shn) - f(t)}ds\ < sup \f(t + shn)-f(t)\ [° \si+i-2K(s)\ds t e [ o „ , i ] sup |/(r + s/iB) - f(t)\( f \s2^-2Usf'2{ f° K(s)2ds) <=rn„.ii J - i J - i 1/2 <G[a„,l] > 0. (3.27) So 1 0 0 hn nh 1 A ' W A 0 fcn - l /(*) Ui u2 (3.28) 33 uniformly i n t G [an, 1], as n —+ oo. Similarly, for i = 1,2, 1 0 o hn \ - i nh. -A'Wq Ii = .l-±K(ijLzl) (h^l)'+1 nhn fr{ \ hn J \ hn J uniformly i n i G [ a n , l ] - This expression converges to / ( i ) u , + i uniformly i n £ G [ a n , l ] , as n —> oo. Thus / - l 1 0 o K nhn A'Wq A /(*) / \ (3.29) B y (3.23), (3.28) and (3.29), it follows that _2_ hi E(BQtt\t) - m{t) v hn (E01<t\t)-m'(tj) ( = m"(t) UQ UI J u 2 \ U 3 / m"(t) I I \\ UQ UI - 1 /(*) U i u 2 /(*) V « 3 J J (3.30) U i u 2 uniformly i n t G [a n , 1]. To calculate the asymptotic variance, first consider, nhnVar 11 u = nhno~2 1 0 0 hn 1 0 0 hn J ) (AlW AYX A'W2 A{A'W A) - l 0 h. 34 = (7 1 0 o K 1 0 0 K 1 o 0 hn \ nh, nhn - A * W 2 A - l 1 0 0 K -1 0 hn X 1 0 0 K (3.31) B y calculations similar to those i n the proof of (3.28) - l „ - i 0 K nh. - A ' W 2 A / 1 0 N 0 hn ( ux u2 j (3.32) Using results (3.28) and (3.32) i n (3.31) gives / / 1 O i l B0jt I , (nhn)Var ( /(*) 0 0 A B / V Ui Ui /(*) « 0 W l /(*) U 0 U i V J \ "1 / /(*) U0 Ui Ui u2 -1 + op(l) V «1 U2 J \ U0 Ui Ui U-i + oP(l), (3.33) Ui u2 uniformly for t € [an, 1]. Under appropriate conditions, Theorem 3.1 and its corollaries hold for the fixed design where i,-s are pre-specified design points rather than random variables. The analogous theorem and corollaries for the fixed design are presented below. Theorem 3.2 For the fixed design case, assume the same conditions on m , K, hn: 1, 4, 5 and 6 as in the random design. In addition assume that K' is continuous, that t{ = 35 with 0 < ti < ... < tn = 1 and rU f: f ' f(t)dt = i/n, inf / ( t ) > 0 Jo <e[o,i] (3.34) loii/i / continuous. Then as n —•> oo, / EfA/j - rof i ) ^ - m"(t) _2_ AH E ( f t , t ) - m(t) { hn (E01<t) - m'(t)) ) ( \ UQ U I U i U 2 o(l)e, (3.35) / 1 0 o K fit) \ UQ U I U i U 2 A M \ _ 1 / U 0 U i U i u 2 = o ( l ) e , (3.36) uniformly in t £ wifJi l i m i n f an/hn > 1. The condition on the design points £,s i n the theorem above is analogous to Condit ion 3 for the random design case. This condition serves as a guideline to the selection of the design points i n an experiment. Corollary 3.3 Assume the same conditions as those in Theorem 3.2 except with Con- dition 6 replaced by Condition 6' of Corollary 3.1. Then as n —• oo, sup jBias(rhhn,t(t + A „ ) ) - nu\ / lU2 ~ "1^3 . U 0 U 3 - U i U 2 ro(*H( 8* + 6 ) / ( u 0 u 2 - u\) - 1^ = o(l), (3.37) and sup <6[a„ , l ] nf(t)AnVar(mhntt(t + An)) - a26(l, 6) I \ UQ U I -1 U l U 2 Un U, UQ U I U i U 2 o(l). (3.38) 36 C o r o l l a r y 3.4 Assume the same conditions as in Corollary 3.3. As n —*• oo, sup t e [ w „ , i ] -^Bias(rhhntt(t + A„))- (tu\ ~ " i " 3 , u ^ - u i u 2 2\ A (3.39) sup tG[l-Pn,l] nf(l)AnVar(mhnit(t + An)) - a2S(l, 6) ( \ UQ Ui V / Ui U2 \ _ 1 / ' \ UQ UI Ui U2 = o ( l ) . (3.40) The proof of Theorem 3.2 for the fixed design case is along the same lines as for the random design case except that the following lemma is used instead of L e m m a 3.1 when replacing a sum by an integral. The proofs of the corollaries of Theorem 3.2 are the same as those of the corollaries of Theorem 3.1. L e m m a 3.2 Assume f,W and g' continuous on [0,1] and f satisfies (3.34)- Then 3 C > 0, such that C sup t€[0,l] < n h 2' (3.41) R e m a r k : Note that the error term 0((nh2)~l) i n L e m m a 3.2 is smaller than the error term 0({nh)~ll2) i n L e m m a 3.1 for h of order n - 1 / 5 . The proof of L e m m a 3.2 requires the result of L e m m a 3.3. L e m m a 3.3 If inf t e [ 0 , i ] f(t) > 0, and if 0 = i 0 < h < ... < tn = 1, then 3 C2 > 0 such that s u p i n \U — \ < C2/n. P r o o f : Let C2 = 1 / inf < e [ 0 ) i ] f(t). B y the assumption on / , - = [*' f(u)du> .[*' inf f(u)du = C2-l{ti-ti-i). n Jti-i Jti-i <e[o , i] 37 Thus U — < C 2 / n , V i , n . P r o o f o f L e m m a 3.2: l rK„,u-t f(u)du , _ 1 L " J EE A + B. Consider the first term i n (3.42). by assumption (3.34). The lemma w i l l be proved if \B\ < C/(nh2) for some C > 0. Since / , W and g' are continuous over [0,1], there exists C\ > 0, such that (3.42) (3.43) wC^)g{u) - W^-j-l)g(ti) f(u) < d u — ti (3.44) Thus 1 A /•*•• 1*1 < T E / C u — ti < 1 CxC22 c • (3.45) nh2 nh2 The results i n this chapter w i l l be used i n the next chapter for the estimation of an opt imal bandwidth for forecasting. 38 Chapter 4 The choice of a smoothing parameter The plug-in approach and the cross-validation approach are commonly used i n choosing a bandwidth for a local regression estimator i n the non-forecasting setting. Each approach seeks to min imize some measure of the discrepancy between the estimated and the true function and tries to estimate the bandwidth at which the m i n i m u m of such a measure occurs. There is abundant literature on the merits and l imitations of both methods. For a comprehensive review, see [7], [12], [13] and [19]. The ideas of choosing a bandwidth for forecasting by both approaches w i l l be developed in this chapter. The applicabil i ty of these approaches relies upon the fact that the data are independent. A n y correlation structure in the data w i l l affect any automatic selection of bandwidth (see, e.g. [16]). 39 4.1 The plug-in approach A plug-in bandwidth is an estimate of an "opt imal bandwidth" i n a certain sense. To estimate the function m at t by rhhn(t) (defined i n 1.8), an opt imal bandwidth hopt(t) may be defined to be the one that minimizes the mean squared error (MSE) of m/ l n ( t ) : hopt(t) = argminhnMSE(t, hn), where MSE(t,hn) = E(rhhn(t) - m(t)\t)2 i n the random design and MSE(t,hn) = E(rhhn(t) — m(t))2 i n the fixed design. Since the idea is the same for both designs, the selection of an opt imal bandwidth w i l l be described for the random design. The bandwidth hopt(t) is called a local bandwidth since MSE(t, hn) is a criterion of goodness of fit at a single point t. However, if m is believed to be reasonably smooth, a constant bandwidth for a l l t w i l l suffice ([7]). In such a case, an opt imal bandwidth hfpt may be defined to be the one that minimizes the sum of the mean squared errors: h^pt = argrninhnMSEG(hn) where MSEG{hn) = n'1 £?=1 E ((rhhn(ti) - m(ti))2\t). Since h°pt w i l l be used to esti- mate m(t) for a l l t, it is called a global bandwidth. Another commonly used criterion for a global bandwidth is the integrated mean squared error, MISE{hn) = f1 MSE(t,hn)dt Jo = jT E ((rhhn(t) - m(t))2\t) dt; (4.1) a global bandwidth can be obtained by minimizing MISE(hn). MSEa(hn) may be viewed as a discretized version of MISE(hn) when the £,s are equally spaced over [0,1]. Note that al l three criteria mentioned above involve the unknown function m and the data {(^i ,^)}" . So the opt imal bandwidth depends on m and a2 and thus has to be estimated. The plug-in approach estimates the opt imal bandwidth by minimiz ing an estimate of the asymptotic expression for MSE or MISE. 40 The criterion MSE(t, hn) w i l l be used hereafter. The context of the plug-in approach that of non-forecasting and some relevant results w i l l be presented first and then the generalization of the plug-in approach to the forecasting setting w i l l be discussed. 4.1.1 Introduction In this section suppose Y{ = m(t,) + e,-, where the e,s are independent w i t h mean 0 and variance cr 2(£,). W h e n the errors are homoscedastic, o"2(i,) = a1. Results for heteroscedastic errors i n the non-forecasting problem w i l l be presented because they encompass the special case of homoscedastic errors. In particular, a technique by Fan and Gijbels [7] of estimating m " ( l ) i n the case of heteroscedastic errors w i l l be described and used i n forecasting. In general, to estimate wSv\t)/v\ = ft, v — 0 , 1 , . . . , a polynomial of degree p (p > v) is fitted to the data wi th the following fitting criterion: A = argmin j j Y { - £ &•(*,• - tyfKip^), (4.2) and m/, n ' " ' (t) is set to v\$vt. In curve estimation, the kernel K i n (4.2) is usually taken to be symmetric around 0 over either the reals or [—1,1]. The performance of rhhn^\t) is assessed by its MSE: E j (rhhj^(t) — m M ( t ) J 2 | t - j . According to the general results i n [7], when p - v is odd, the asymptotic MSE (AMSE) of rhhn^\t) is Qi i2.2(p+i-</) , g2(*) / 4 3 \ ft+iM„ + a" f(t)nhn+^ V-6* Here a EE (a0,ai,...,apy = diag(S~1S*S~1), (4.4) 6 EE (6o, 6P)* = S-1(sp+i,...,s2p+2)\ (4.5) S = (5 , - j ) wi th sij EE si+j-2 = J ui+j-2K(u)du, (4.6) S* EE (,?.) wi th EE , ? + . _ A = Jui+i-2K2(u)du. (4.7) 41 The first term i n (4.3) is the square of the asymptotic bias which depends on m ( p + i ) ( 2 ) = ( p + l ) ! ( 5 P + 1 ) while the second term is the asymptotic variance which depends on cr2(t). For example, when m(t) is estimated by a local linear estimator, i.e., v = 0 and v = 1, the square of the asymptotic bias of rhhn(t) is P\b\hn (or m"(t)2blhn/4) and its asymptotic variance is a0a2(t)/(f(t)nhn). Formula (4.3) shows that if m ( p + 1 ) ( i ) ^ 0, a large bandwidth hn w i l l create a large bias but a small variance, and a small bandwidth w i l l y ie ld a small bias but a large variance. Requiring both the bias and the variance to be small at the same t ime is setting conflicting goals for hn. Therefore a tradeoff between the bias and the variance is needed and this is achieved by choosing hn to minimize the asymptotic MSE of the estimator. From (4.3), the bandwidth that minimizes the AMSE of rhh„^\t) is: _ / (21/ + l)a„a2{t) \ 5fe ^ ( t ) - U ( p + i - ^ ) W i « / ( o J ' ( ] For example, when m is to be estimated by a local l ine, i.e., v = 0 and p = 1, this locally opt imal bandwidth is *-*>- ( ^ ) * - G s ^ ) * " ^ Plugging this opt imal bandwidth into the formula for the AMSE of rhhn(t), o n e c a n see that the rate of n - 4 / 5 is achieved for its AMSE. Note that formula (4.8) depends on two unknowns, a 2 ( * ) a n d m(r+1\t) = /3p+1(p-rl)\. These unknowns need to be estimated and these estimates are then plugged into formula (4.8) to yield an estimate, hu,opt(t), of the opt imal bandwidth hv<opt(t). For example, if the curve m (v = 0) is being estimated at t, to get an opt imal bandwidth one has to estimate m"(t) and <r2(i) first. 42 4.1.2 Plug-in approach using complete asymptotics for fore- casting (CAMSE) We only consider the case of homoscedastic errors, i.e., o~2(i) = a2. In forecasting, the criterion that the plug-in approach considers is also the mean squared error: MSE(hn) = MSE(l + An,hn) = E([mhntl(l + An)-m(l+ (4.10) A g a i n , only the random design case w i l l be presented since the ideas and the results are the same (under proper conditions) for both designs. Under Condit ion 6' of Corollary 3.3, for A„ given, the asymptotic MSE(hn) can be writ ten i n terms of 8 = An/hn, so the notation AMSE(8) is used instead. B y Corollary 3.1 or 3.3, the AMSE of mhnA(l + An) is AASOVf^ m " ( l ) 2 ( u\ - UXU3 U 0 U 3 - U i U 2 2 V AMSE(8) = — I H - j 5 + s )/(UoUi ~ ui) - 1 1 + 4 a28 n / ( l ) A , •(1.*) UQ U\ UI U2 Un « i UQ UI Ui U2 (4.11) If the first term of AMSE(8) is a monotone and non-increasing function of 8, the above expression shows, i n an asymptotic sense, that for a given A n , a very large 8 (corresponding to a very small hn) w i l l y ie ld estimates w i t h a smal l bias but a large variance and that a small 8 (corresponding to a large hn) w i l l y ie ld estimates w i t h a small variance but a large bias. For a general kernel function K, the relationship between hn and the asymptotic bias is not as obvious as it is i n the non-forecasting setting. In the forecasting setting, a trade-off between the asymptotic bias and the asymptotic variance can be achieved by choosing the hn = An/8 wi th 8 min imiz ing AMSE(8). Observe that as 8 —* 0 and +oo, AMSE(8) —* +oo. So a m i n i m u m of AMSE(8) i n (0, +oo) is guaranteed. The m i n i m u m can be found by setting the first derivative of 43 AMSE to zero and choosing the positive root which gives the smallest AMSE. Whereas i n curve estimation for the non-forecasting setting an explicit formula (4.8) for the opt imal hn exists, no such formula exists i n forecasting. F i n d i n g the opt imal hn or equivalently 8 for forecasting, calls for solving a seventh degree polynomial equation (the one resulting from setting the first derivative of AMSE(8) to zero) for 6. A n alternative method for minimiz ing AMSE is by a grid search. Note that like i n curve estimation, estimating the opt imal hn i n forecasting requires the estimation of m" and a2. This issue w i l l be discussed i n Section 4.1.4. 4.1.3 The plug-in approach using the finite sample variance for forecasting (HAMSE) Recal l that the discrepancy measure (4.10), the MSE of the forecasting estimator, is comprised of two parts: a bias component and a variance component. The bias com- ponent involves the unknown function m but by (3.21) the variance component equals <72(1,A„)(A*WA)*A*W2A(A*WA)(1, A„)*, which does not involve m. The estimation of the variance parameter a2 is relatively easier than the estimation of m". Therefore we can directly use the exact variance expression rather than using its asymptotic expres- sion [7]. The asymptotic variance describes the variabil i ty when n —> oo. Specifically, the following can be used i n l ieu of AMSE. Definition 4.1 The asymptotic MSE using the finite sample variance for the forecast- ing estimator rhhn,\(l + A n ) is defined as: HAMSE(S) = ^ A n ( < ± ^ + ™ +<T2(1, A n ) (A i VTA) t A t W 2 A(A t WA)(1, A n)*. (4.12) 44 Expression (4.12) uses only "half" of the asymptotic results of the bias and the variance of m / l t l ) 1 ( l + A„) i n that HAMSE is defined as the sum of the asymptotic bias and the finite sample variance of the forecasting estimator, hence the acronym "HAMSE". Recal l that W = diag(K((U - l)/hn)) = diag(K(8(U - 1 ) / A „ ) ) . The opt imal 8, 8opt, that minimizes HAMSE(-) can be found by a grid search and the opt imal bandwidth is set to be: hnAMSE,oPt = An/8opt. Note that both the plug-in approach using AMSE and the plug- in approach us- ing HAMSE are based on the same discrepancy measure, the mean squared error of rhh„,i(l + A n ) . The difference is that the latter approach uses the exact variance of ^ / i „ , i ( l + A n ) i n the variance component instead of its asymptotic expression. 4.1.4 E s t i m a t i o n o f m" a n d a2 — a n i n t r o d u c t i o n The plug-in approach i n the non-forecasting setting calls for the estimation of the second derivative ra" of the unknown function and the variance function a2. W h e n homoscedastic errors are assumed, a2 can be estimated by the Rice estimator [25]. D e f i n i t i o n 4.2 The Rice estimator of the variance a2 is defined as: ^ = E % ^ f . (4.13) Under certain conditions, the Rice estimator is n 1 / , 2 -consistent, which means ^^(a^^ — a2) —+ 0 i n probability. Other estimators of a2 can be found i n [11]. W h e n heteroscedastic errors are assumed, Fan and Gijbels propose that o~2(t) can be estimated from residuals of the local polynomial regression [7]: 45 where ^ 1 ti-t ... (u-ty^ , W = diag(K((tj-t)/hn)), (4.15) yi t n - t ... {tn-ty j p is the degree of the polynomial and Y = (Yi,..., YnY = A p / 3 t , (3t being the usual non- forecasting estimate as defined i n (4.2). Clearly cr2{t) can be used i n the cases of either homoscedastic or heteroscedastic errors. Note that the estimation of cr2(t) requires again a choice of a bandwidth which may differ from the bandwidth for curve estimation. The estimation of m" requires a higher order method. Recal l that is estimated by locally f i t t ing a pth degree polynomial wi th p > v. Ruppert and W a n d [26] recom- mend that p — v should be taken to be odd to obtain an accurate estimate of m^(t). So m"(t) should be estimated by locally fitting at least a cubic polynomial . A g a i n the estimation of m" requires the choice of an opt imal bandwidth, which is different from (usually bigger than) the bandwidth that is used for estimating the function m itself. To summarize the idea of the plug-in approach, to get an opt imal hn for curve estimation one needs to estimate a2 and m". The estimation of a2 and m" requires two additional bandwidths whose opt imal values depend on higher derivatives of m . To avoid this spirall ing argument, one has to come up wi th an in i t ia l bandwidth for estimating derivatives and a2. Therefore i n the plug-in approach, the problem of choosing a good in i t ia l bandwidth has attracted special attention ([26],[12],[7]). 4.1.5 Estimation of m" and a2 for forecasting The discussion w i l l be restricted to the homoscedastic errors case. Recal l that the forecasting estimator is defined as m / l n i i ( l + A n ) = y901 + ^ j A n where fcx = argminj^iY - 80 - frfr - l))2K({ti - l)/hn). l 46 The objective is to choose an hn = An/8 such that AMSE(8) or HAMSE(8) achieves a m i n i m u m . Since both objective functions involve the unknowns a2 and m " ( l ) , these need to be estimated first. Exis t ing techniques for estimating both i n curve f i t t ing can be applied directly to forecasting. For estimating a2, aRice suffices because of the assumption of homoscedastic errors. Es t imat ion of m " ( l ) is a much more complicated matter. A s discussed earlier, m " ( l ) can be estimated by f i t t ing a local cubic polynomial around t = 1. Let v = 2, p = 3 and t = 1 i n (4.2), that is, let & = argmin J2(Yi - 1 ) ' ' ) 2 # ( ^ ) , (4.16) i=l j=0 and let m „ n i i " ( l ) = 2 f t , i . Note that the kernel function K and the bandwidth hn used to estimate m"(l) can be different from those used for forecasting. Formula (4.8) gives the opt imal bandwidth for estimating m " ( l ) as follows: = ( 4 ' 1 7 ) which depends on a2 and one higher unknown derivative m^(l) ( = 4!/?4). Ordinar i ly the fourth derivative of a function m is difficult to estimate, part ly because the opt imal bandwidth depends on higher order derivatives of m. However, an idea due to Fan and Gijbels [7] eliminates the need to estimate m^4\l). To avoid estimating higher order derivatives when estimating m^ by a pth degree polynomial , these authors have introduced a statistic RSC, RSC(t, hn) = a2(t){l + (p + 1 )V} , (4.18) where o2(i) is as defined i n (4.14) and V is the first diagonal element of the matr ix (A* W A p ) ~ 1 A p W 2 A p ( A p W A p ) " 1 . The motivation for using RSC is that its m i n i m u m reflects to some extent a trade-off between the bias and variance of the fit . 47 Fan and Gijbels [7] have shown that the opt imal bandwidth hQ(t) that minimizes the asymptotic value of E(RSC(t, hn)\t) is: where £, _ S 2 p + 2 — (Sp+l, • • . ,S2p+l)S~1(Sp+1, . . . ,S2p+lY ^ 2Q^ and Sj is as defined i n (4.6). Comparing (4.8) and (4.19) yields: » M ( (2J/ + 1) a u C p \ ^ A p p l y i n g this relationship to the estimation of m " ( l ) , i.e., t = 1, v = 2, p — 3, leads to /*2, 0 P< = />2,oPt(l) = (f̂ |)9 W (4-22) The point is that h0(l) can be estimated from the sample by minimiz ing RSC, yielding an estimate of h2,0pt since the scaling factor i n (4.22) depends on the kernel function only. Thus the opt imal bandwidth for estimating m " ( l ) can be estimated from (4.22) using the information i n the finite sample at hand. The following algorithm summarizes the steps to get the opt imal bandwidth for estimating m"(l). Algorithm 1 : 1. Fit a local cubic polynomial centered at t — 1 as in (4-16) 1° tne data for each value of hn on a grid; 2. find h0(l) that minimizes RSC(l,hn) on that grid of hn; 3. get h2t0pt via (4-22); 4- estimate m"(l) by fitting a local cubic polynomial to the data using (4-16) with bandwidth h2,opt- 48 4.2 Cross-validation 4.2.1 Background to the non-forecasting setting Cross-validation is a technique commonly used to choose a smoothing parameter. Most references i n literature applies cross-validation i n the context of curve f i t t ing by splines or kernel estimators. However the idea of cross-validation is the same for both techniques and can be applied i n other contexts, e.g., bandwidth selection i n local linear regression. For a brief historical note on cross-validation, see [15] (pages 152-153). This section w i l l present the motivation and results on cross-validation i n the setting of curve f i t t ing. A generic symbol A is used for the smoothing parameter. In the linear operator approach, A w i l l be the parameter used i n the penalty term. In the local regression approach, A w i l l be hn, the bandwidth of the kernel. Let m\(-) denote the fit to the data Y{S w i t h smoothing parameter A. Ideally, one would want to choose a A such that the prediction error is minimized . A criterion reflecting this objective is independent of the Y{S but have the same distribution as the i^s. To see the relationship to the criterion MSEG(X) used i n the plug-in approach, note that (4.23) where the Y*s are new observations made at the design points Thus the l^*s are E(PE(\)\t) m(ti))(m(ti) - mA(*,•)) i=i = a2 + MSEG(X). (4.24) 49 So minimiz ing the expected prediction error E(PE(X)\t) is equivalent to minimizing MSEG(\). Of course the 3^*s are unknown and so to implement this idea the Y*s have to be replaced by observables. S imply minimizing (4.23) w i t h the Y*s replaced by the Y{S would usually result i n choosing A = 0, the A that minimizes 2~Zi(^' — ™ A ( ^ , ) ) 2 and yields an estimate of m that interpolates the data. Thus the fitted curve would be very bumpy. This is a direct consequence of underestimating the prediction error PE(X) because rh\(-) is "closer" to the Y{S than to the Y*s since rk\(-) is fitted to the former. The idea of cross-validation is to correct this downward bias i n the estimation of PE{\). Note that Y* is independent of rh\(ti) but Y{ is not. Cross-validation gets around this dependence by substituting m£~^(2,-) for rh\(ti), where m ^ ( - ) is the curve fitted to the same data but wi th the zth data point (i,-, Yi) removed. A s a result Yi and m^~x\ti) are independent. Definition 4.3 The cross-validation function is defined as CV(X) = -f2(Yi-m{-i\ti))\ (4.25) where for each i £ { 1 , . . . , n} , m\~%\-) is estimated by the same procedure as rh\(-) but with (ti,Yi) removed from the data set. The cross-validation choice of the smoothing parameter, \cv, minimizes CV. B y taking the expectation, we see that CV(X) is approximately unbiased for PE{\). E{CV{\)\t) = -E(f^(Yi-m(ti))2 + 2it(Yi-MU))(™(ti)- n \i=i i-i + ±(m(U) - mi-^ti))^) t=i / = ^ + ^ (EM*0-4" o ( * . 0 ) 2 l * ) w <72 + MSEQ(\). (4.26) 50 Under suitable conditions, Xcv, the smoothing parameter chosen by cross-validation, can be shown to converge almost surely to the optimal smoothing parameter XPE, the minimizer of the prediction error. Hardle and M a r r o n [14] have shown such a result for the Nadaraya-Watson estimator. O n a practical level, cross-validation is a computationally intensive method. Since CV can rarely be minimized analytically, usually CV is minimized on a grid of A i n a specified interval [A, A] . Therefore for each combination of grid point Xj and omitted design point t,, the entire curve f i t t ing procedure has to be repeated. This amounts to n\ • n curve fits w i t h n\ being the number of grid points for A. In some cases, a short-cut formula is available which expresses CV(X) i n terms of quantities which can be evaluated directly from applying the curve fitting procedure once to the entire data set. W i t h this formula, only one regression curve is fitted for each value of A. The following lemma gives the short-cut formula and conditions for its validity. L e m m a 4.1 For each i £ { l , . . . , n } , let m^ - ^(-) be the fit with (ti,Yi) removed from the data set {(tj,Yj)}"; and let ̂ ~^*(-) be the fit with (ti,Yi) replaced by (£,-, ra^-^(£,•)) in the data set {(tjjYj)}™. Suppose that rh\ = (rh\(ti),... ,rh\(tn)y = H\Y, with H\ not dependent on Y and with its ith diagonal element [H\]a ̂ 1. If for each i, m^~^*(t,) = rn^Cl\ti) for any set ofYis, then Yi-m\ '(ti) = i _ [ H ] » and so (4.27) R e m a r k : The short-cut version of the CV function (4.28) casts insight on the idea of cross-validation. A s discussed earlier, n _ 1 Y%{Yi — rh\(ti))2 tends to underestimate 51 the prediction error. B y scaling the iih term by 1/(1 — [H\]u)2, CV function tries to eliminate this bias. Proof: Since rh\ = H\Y and H\ does not depend on Y, the following holds: ^ i _ , > ( * . - ) = E W i + [ ^ A ] « m i " ° ( * 0 - (4-2 9) Using the assumption m^ - ^* ( i ; ) = m^~^(t,) yields: Y i - = Yi-m^m(ti) = Yi- YlWijYj - [ f r A ] « m ^ ( t , - ) 1 = Yi-mM + iHxMYi-m^iti)). (4.30) Therefore, ^ - m^iU) = (Yi - m A ( * , ) ) / ( l - [Hx]n). A p p l y i n g this relationship to the definition of CV(X) (4.25) gives the short-cut formula (4.28). • The above applies to the scenario of curve f i tt ing i n the domain of the design variable when a global smoothing parameter A is desired. If one wants to choose a local value of A depending on t, localization of the CV function is straightforward by incorporating a weight function into CV. For example, a CV function for a local smoothing parameter can be defined as CV(X, t) = n-1 J2(Yi - m^iUtfwiit), (4.31) l where iw,-(i) is a weight function that gives more weight to points near t than the rest. This weight function Wi(t) can depend on another smoothing parameter A* . The A = X(t) that minimizes CV(-,t) w i l l determine the amount of smoothing around t. 52 4.2.2 C V for the linear operator approach The minimizer of the penalized l ikel ihood, F(Y,Lu...,Ln) = n - 1 fjXi - L^I))2 + \f I"{ufdu, 1 J for I € W ^ O , 1 ] , gives an estimator satisfying the conditions of L e m m a 4.1. So the short-cut formula holds for this case. For proof and details, see [31]. 4.2.3 C V for the local regression approach in curve estima- tion The local polynomial regression estimator, rhhn(t) = /3 0 t , of m(t) w i t h /3t defined i n (4.2), satisfies the conditions for the short-cut formula. The i t h row of the H\ matr ix i n this case is [Ap] , - [ (A£WA P ) - 1 A£W], where [Ap],- is the i t h row of A p , A p and W are as defined i n (4.15) wi th t — ti, i = 1 , . . . , n . Obviously, this H\ matr ix does not depend on Y. The lemma below verifies the other condition i n L e m m a 4.1 for the short-cut formula to hold. Lemma 4.2 For each i € {1 , . . . , n), let {wjtti}\ be a set of non-negative numbers with u)i,n > 0. Suppose that f3t. minimizes s f c - E #(*;-*.•/) (4.32) \ 1=0 / and that fi\. ^ minimizes E ( Y > - E m - *o')2 «>j,u + - E m - ti)1) j^i \ 1=0 J \ 1=0 ' = E fYi ~ E - ti)1) ™i,u + fc - ft)2 witti. (4.33) &i \ i=o ) v J Then /3 0 <^ = /30t'^* for i G {1, . . . ,n}. Here J30^ and (3Qt^ are the first component of ^ and p[. ^ respectively. 53 R e m a r k : The above lemma is the essence of the short-cut formula for regression w i t h the f i t t ing criterion E fa-!>(*;-')') i V ;=o / w i t h rhhn(t) = {30. In particular the short-cut formula holds for local linear regression estimator (p = 1) and local constant regression estimator (p = 0, Nadaraya-Waston estimator), that is, when Wjyt = K((tj — t ) / h n ) wi th K being a kernel function. P r o o f : B y the definition of /3t. , E fa - E ttf'to - **)')2 + (4? - 2 ^ < E fa - E - *0 ' ) 2 + te? - / C V < E fa - E A ( A f > ( * i - *••)')2 (4-3 4) \ /=o / since f}\ ^ minimizes (4.32). Therefore J3Qt^ — /?o,v • 1=1 4.2.4 C V for forecasting: FCV The following two sections contain discussion of ideas of cross-validation for forecasting based on different assumptions. Unl ike the case of curve f i t t ing i n the data range where the Y{S can be used to "cross- validate" the estimates ihhn(ti) computed wi th bandwidth h n , the forecasting case has no "future" data beyond 1 to cross-validate m / l n i i ( l + A n ) for hn. One natural idea would be to leave out a portion of the most recent data, use the forecasting estimator and the rest of the data to estimate the left-out data and minimize some measure of the discrepancy between the left-out YiS and their estimated values to choose an optimal bandwidth hn for forecasting. This idea has the flavour of 54 > conventional cross-validation i n that a statistical model is buil t on part of the data and then "validated" by the rest of the data. A formal definition of such a CV function for forecasting can be formed as below. Definition 4.4 The cross-validation function for forecasting is defined as: FCV(hn) = J2(™^,U-A„(U) - Y ) 2 , (4.35) where S = {U : ti £ [1 — pAn, 1]} with p > 1 and \S\ = # 2 t S £ <S*. Remarks: 1. Note that rhhn,ti-A„(ti) as defined i n Definition 3.1 centers the data at i ,- — A n and predicts A„ ahead. For each left-out 2,-, the forecasting estimator m^ n i i i_A„(i,) uses the rest of the data prior to the t ime point 2,- — A n to estimate m(2j) . Then rhhn<ti-An(ti), the "forecast" at t = 2,-, is compared to Yi. 2. From Corollary 3.2 or 3.4, FCV(hn) could use more i ,s , w i t h £ [1 — pn, 1] for pn —> 0. B u t for computational convenience, FCV(hn) leaves out the part of the recent data w i t h 2; € [1 — pAn, 1]. 3. Independent work by Hart [17] uses a cross-validation idea, TSCV (time series cross-validation), similar to FCV i n the context of forecasting. He assumes an AR(1) model for the e,s and forecasts one step, 1/n ahead, using a locally constant regression estimator. In contrast, we assume independent e,s and forecast A„ oc n - 1 / 5 ahead. Hart and Y i [18] recently modified TSCV to OSCV (one-sided cross-validation) for the bandwidth selection for curve estimation by local linear regression. They showed that the opt imal bandwidth estimated by OSCV is less variable than that estimated by the tradit ional CV as defined i n (4.25). In both these papers ([17], [18]), the asymptotic mean squared error of the forecast has a simple form, B2hn + V/(nhn) where B and V are constants depending on K, m", 55 a2 and / , since the estimate is forecasting 1/n ahead. B u t AMSE(6) has a more complicated form than B2hn + V/(nhn) since we are forecasting further ahead. Recal l that AMSE(6), the asymptotic mean squared error of m^„ i i ( l + A n ) is used i n the plug-in approaches to estimate the opt imal bandwidth for forecasting. The following corollary of Theorem 3.2 stated for non-random i,s sheds some light on the relationship between FCV(hn) and AMSE{8). C o r o l l a r y 4.1 Assume the same conditions as in Corollary 3.4- Then n4/5{E(FCV(hn)) -a2- AMSE{8)} = o ( l ) . (4.36) P r o o f : Since rh-hn,ti-&n(ti) is independent of Yi, E(FCV(hn)) = | 4 | £ {E(™»n,L-*M ~ m{ti)f + E(m(ti) - Y)2} = •£[ | £ E(mhntt,_An(ti) - m(ti))2 + a 2 j . (4.37) B y Corollary 3.4, sup n 4 ' 5 [E [mhntU.An(ti) - m(ti)}2 - AMSE(6)) = o ( l ) . A s a result, nAl5{E{FCV{hn)) -a2- AMSE(8)} = o ( l ) . • . The lemma below w i l l ensure that the stronger uniform results i n Theorem 3.2 and Corollaries 3.3 and 3.4 hold for 2,-s random. Then by the same proof, the above corollary is true for 2,-s random. L e m m a 4 .3 Suppose that 0 = t0 < tx < ... < tn < tn+1 = 1 and that tx,...,tn are order statistics of a distribution with density function f. Suppose that W and W are 56 bounded in the support ofW, g' and f are continuous over [0,1] and inf i e[o,i] f(t) = a > 0. Then ifnhn —• oo, sup <e[o,i] 0 (4.38) in probability as n —> oo. The proof of the above lemma w i l l use standard results on order statistics f rom the uniform distribution over [0,1] (see, e.g. [21]). Those results are stated i n the lemma below without proof. L e m m a 4.4 Let £/(,), i = 1 , . . . , n be order statistics from the uniform distribution over [0,1]. Let C/(0) = 0, and U(n+i) = 1. Then *• E{U(i)) = ^ , Var{l%) = for i = 1 , . . 3. E(U{i) - */(,_!)) = E(U(1)) and E(U(i) - U^f = E ^ ) , for i = 1 , . . . , n + 1. P r o o f of L e m m a 4.3: Since 1 "+1 t—t (n + l)hn hn n + l nhn Y «n [n + l)hn hn (4.39) we have sup te[o,i] sup te[o,i] n + l + tn+l — t (n + l)hn hn The assumptions on W', # and hn ensure that the last expression w i l l go to 0 as n —• oo. Therefore this lemma is true if r^uT E W(^)g(U) ~Y [ W(^)(gf)(u)du (n + l)hn j hn hn Jo hn sup te[o,i] (4.40) 57 i n probabil i ty as n —> oo. Consider the integral i n (4.40): _L f1 w{u-zl)(gf){u)du tin JO tln = r E / wC—-){gf){u)du Hn j — \ Jti—1 *in = rSr W{t\1)giti)mdu + fln ) = 1 - ' t . - i T - E 1 r H V ^ M - w&^Wi)] • In , = i Jti-i L / l n / t n Plugging (4.41) i n the left hand side of (4.40) gives: 1 n+1 t - — t 1 71+1 rU t—t E w(\^)g(u) - f E / ^(^-M*0/(«)«* (re + l ) / i n j • ftn / i n i = 1 Jf i_i An (4.41) sup te[o,i] = sup \A(t)-B(t)\. te[o,i] B y Chebyshev's inequality, for any e > 0, (4.42) P sup \A(t)-B(t)\ > e < \te[o,i] / £ ( s u P < 6 [ 0 ) 1 ] | A ( t ) - 5 ( i ) | ) < £ ( s u p « € [ 0 f l ] | A ( t ) l ) + ^ ( s u p t e [ 0 | 1 ] | ^ ( t ) | ) (4.43) Therefore the lemma is proved if the right hand side of (4.43) —> 0 as re —> oo. Consider E ( sup t e [ 0 1 ] |A(2)|) first. Ait) = ^ E ^ ^ - f s f w ( ^ ) ^ ( t o / ( « ) d « (n + l ) h n i h n h n iZ[ Jti-i h n = ^ £ V ( ^ M t o ^ - f E V ( ^ M * o / f i (4-44) h n i « n n + 1 ftn ftn which can be writ ten i n terms of the order statistics of a uniform distr ibution. Let U(i) = J fiu)du, i = 0, . . . , n + l . 58 Then . . . , ?/(„) are order statistics of the uniform distribution over [0,1] and U(o) = 0, <7 („ + i ) = 1. Using the facts 1 n + Y = EiUv-Uw), / f(u)du = U(i) - «7(,_i), t = l , . . . , n + l Jti-l i n A(t) and then re-arranging the terms yie ld: -i n+l 4 f Mi) = r E ^ V ^ K r ^ } t'=l •j n+l J ^ n j=l n (4.45) Recal l that U(0) = 0 a r i ( 1 ^ (n+ i ) = 1, so - f E w(̂ )̂ (to - .̂--i)} , = 2 = i E HHr^*-0 ~ w( r̂M*0} {EU«-x) ~U(i-^ -(4-46) Using the condition |(W#)'| < C in (4.46) yields: sup |A(t) | < ̂ E <G[0,1] " n , = 2 Because of the fact that EU(i-\) — U(i-i) (4.47) rU U(i) - r/(,-_i) = / /(«)<Zu > mif{t)(ti-ti-l) = a ( t i - t i - 1 ) , t£[0,l] (4.48) suPte[o,i] \ c a n ^ e bounded by c n+l ah2 " " n ,=2 E ^ ( 0 _ ^(«'-l) - ^ ( i - l ) 59 Schwarz' inequality can be used to bound the expected value of the above by C n~^~^ / 2\ 1/2 E {E fa) " tf(.--i>) ) {VarWw)) Using fact 3 of L e m m a 4.4 then yields sup \A(t)\) < - T J {U(i)) E ( ^ K ^ - i ) ) ) <6[0,1] / a f l n x 7 t=2 A g a i n , by facts 1 and 2 of L e m m a 4.4, the last expression equals C ( 2 V / 2 A / i(n + l-i) \1/2 /2 a h l V(» + + 2),/ £ V(" + 1 ) 2 ( " + 2 ) c ahl \(n + l){ Since n + as n —> oo, we have E sup = O \*€[0,1] / 0 under the condition that nhn —• oo. Now consider £ ( sup 1 £ [ 0 1 ] |i?(t)|), where 1 i l n , _ i •'ti-i L (in ftn (4.49) ^1 / 2 "+1 I _ l _ ± (J- (i _ n + 2)J (n + 2 ) V 2 + l ^ V n + 1 V n + 1 / ; 1/2' (4.50) (4.51) B y the assumptions of the lemma, W and W are bounded i n the support of W, and / and g' are continuous on [0,1]. So there exists C\ > 0, such that I u — ti /(«) < Cx hr (4.52) 60 Therefore, 1 ^ r*i sup \B(t)\ < T - E / C-i — au i i _ i — U du n+l " n i=l < § i E ( ^ ) - ^ - i ) ) 2 , (4-53) " n " «=1 by (4.48). F ina l ly , <?i" v 2 c?hnKn^ J ( n + l)(n + 2) = ( 4 ' 5 4 ) under the condition that nhn —> oo. So by (4.50) and (4.54), the right hand side of (4.43) -> 0. • Unfortunately, no short-cut formula for FCV has been identified, making the imple- mentation of this criterion computationally intensive. The cross-validation criterion i n the next section has a short-cut formula. 4.2.5 C V for forecasting, BCV FCV uses the data up to — A n to forecast ra(£,). In contrast the "backcast" cross- validation BCV i n this section w i l l use data to "forecast" the past. To understand the definition of BCV, recall that rhhn,i(l + s) = / ? 0 1 + J3ltls estimates m ( l + s) w i t h data centered ait — 1. Suppose that m is really a line. Then if m ^ n i i ( l + s) is a good estimate 61 of m(l+s) for s G [—A n , 0], rh-hn,i(l+s) is an equally good estimate of m(l+s) as it is for s G [0, A n ] . If m is not a line, we know by a Taylor expansion that m is approximately linear i n a neighbourhood of 2 = 1. So we can fit a line locally w i t h the data centered at 1 and estimate m ( l + s) using / 3 0 1 and to calculate m „ n ) i ( l + s). The accuracy of the fit is then assessed by the data i n [1 — A „ , 1]. Since for s £ [—A„, 0], m / , n i i ( l + 5 ) gives a "backcast" instead of a forecast, we call the following cross-validation criterion BCV. D e f i n i t i o n 4.5 Let S\ = {ti : 2,- > 1 — A n } and \S\\ = the number oftiS G S\. BCV{K) = - i - £ c ( l _ 2.) ( m , „ , 1 ( - ) ( 2 , ) - Y)2 , (4.55) w/*ere m / l n ) 1 ( " ' ) ( 2 , ) = + ^ ^ ( t , - - 1), @[ *J minimizes 52&i(Yj - Po - Pi(tj - l))2K((tj — l)/hn) and c(-) is a function that may depend on A n but not on hn. In (4.55), we recommend c(-) to be chosen so that E(BCV(hn)) « AM5,E(<5) + constant, (4.56) where AMSE(6) is the asymptotic M S £ of m f c B > 1 ( l + A n ) and 8 = An/hn. BCV always centers the data at 2 = 1 when estimating m(2,), thus using data i n [1 — hn, 1] but FCV i n the last section centers the data at 2,- — A„ when estimating m(2,), using data i n [2,- — A n — hn,ti — A„] . Suppose that a c(-) exists such that (4.56) holds. Then BCV behaves approximately like AMSE(8). The hope is that the bandwidth that minimizes BCV also approxi- mately minimizes AMSE{8). The opt imal bandwidth chosen by the BCV criterion is defined to be the one that minimizes BCV and w i l l be sought by a grid search. Working wi th (4.55) directly is computationally intensive because for each hn on the search grid and for each data point (U,Yi) wi th 2,- > 1 — A n , a regression needs to be computed. Fortunately, a short-cut formula holds for (4.55). This is shown i n L e m m a 4.6. 62 To choose a c(-) to satisfy (4.56), note that E{BCV{hn)) = ~ £ ^ - ^ { ( A w ^ ^ ^ ) - ^ - ^ ) ] ) ' } = E c(l-t,-)^{(mw (-°(*0-m(*0) a} + ^ 2 T^ E c(! " *0 l ^ l t.-eSi ~ I F , E c(! - {(mw(*0 - m(r,))2} + °- 2 |F7 E c ( ! -^ ) , (4-57) if \S\ | _ 1 ]Ct,-e5i c ( l — ^ ' ) E | ( m / i„, i^~^( '« ) — m (^«')) | c a n be replaced w i t h negligible error by ISil" 1 Eues, c(l - *,)£ {(mw(t.-) - m(t , ) ) 2 } . Thus to have (4.56), it suffices to f ind c ( l — i,)s such that E c(l - * ; ) £ {(mfc„,i(*0 " MU))2} ~ A M S £ ( £ ) . (4.58) Since AMSE(6) = 0(n~4/5) by the asymptotic results i n Chapter 3, we require the difference between the left and right sides of (4.58) to be o ( n - 4 / 5 ) . The following Theorem is stated for fixed design w i t h equally spaced i ,s , U = i/n, for the choice of the c;s. Theorem 4.1 Assume the same conditions as in Corollary 3.3. Let k = [n — nAn], e = ( 1 , . . . , 1 ) ' (a k x 1 vector), z = (zk,...,znY = (1 — ifc,..., 1 — £„)* and c = (cjt , . . . ,c„,Y — (c(zk),... ,c(zn)Y- Note that Si — {tk,...,tn}. If c(-) satisfies: n / V A . . 2 ^ c«( —) = \Si\ (4.59) 63 where ( z /A n ) * ' = ( ( z j t / A n ) \ ( z n / A n ) * y , then £ - ™(*0)2 } - AMSE(6) = o{n-A'5). (4.60) Proof: Since the t,-s are equally spaced, we can write l 0 l t ueSi i=fc = B + y, where 1 " P i I t=fc 1 n 1*1 & Theorem 3.2 tells us that E0Otl)-m{l) = E0hl)-m'(l) = Var0Otl) = Var0ltl) = 2 °iK + o(hn) 02hn + 0(/inJ 2 1 a2 1 1 <T2 I 2 2 + ° W ' where E = Ui u2 U0 Ui Ui u2 u2 \ U 3 J uT, u: - l \ Ul U2 J UQ UI Ui U2 and Ui and u* are as defined i n Condit ion 4 i n Chapter 3. 64 So, for \t- 1| < A „ , Bias(mhn<1(t)) = E(mhnA(t) - m(t)) = E + Pltl(t - 1) - m(t)) = E{p0tl-m(l) + [Phl-m'(l)](t-l)} -[m(t) - m ( l ) - m ' ( l ) ( i - 1)] = j & i f + fe2^-l)}m"(l) + o ( ^ ) - { m ' W ^ + « ( ( * - i ) 2 ) } = ^ { M n - M „ ( l - 0 - ( l - 0 2 } + « (A„) , (4-61) I V a r ( m w ( * ) ) = V a r + ^ ( t - 1)) = Var ( / ? 0 ) 1 ) + 2(i - l)C<wfAi> /31)X) + (f - l)2Var0hl) _ ( 1 2 ( t - l ) ( t - 1 ) 2 \ a 2 1 - feSll + ^ ^ S l 2 + ^ ^ S 2 2 J / ( i T + 0 W ' (4.62) where o ( A 2 ) and o(l/(nhn)) do not depend on t as long as \t — 1| < A n . B y (4.61) and (4.62) and the fact that hn — An/<5, 1 " i=k ll E « - M n ( l " ti) " (1 - t,-)2 + 0 ( A 2 ) } 5 1*11 .t A 4 " 15: \Si\k \*4 Zi_\ 2M2 / M 2 b\ ~ 26i A J ^ + U J 62 \ 3 2&2 2; 65 m " ( l ) |<S\| \ V U J £3 + U J £2 +<*(i)3 '̂(i)-4 (4.63) and V = 1 " 1*11 t=fc 1 * f 1 2 ( 1 - * Q 2 ^ c « ^ — r - ^ i i r 5 — L i 2 1*1 fa* l » * n n/>2 , (1 ~ * Q 2 nh3. , 1 x l ^ S 2 2 + 0 W / 7 ( T ) 1 " / 1 2 A f Z i \si\t nhn nhl An W J /(I) 1*1 « . t / z \ 2 A n VA„y nhl 12 + c t ( i ) 2 i S 2 2 + o ( i ) (4.64) B y (4.61) and (4.62), for t — 1 + A n , the square of the bias and the variance of m„„,i(l + A„) are ^ " ( l ) 1 2 Bias2(mhntl(l + An)) = {b2X + 2hb2h3nAn + (b\ - 2h)h2nAl - 2 & 2 / > n A 3 + A n + o ( A n ) } m " ( l ) A b 2 2 6 1 6 2 &2 - 2 & i » 1 64 + £ 3 + <52 and + 1+0(1) f 1 2 A A 2 1 1 a2 V a r ( ^ , ( l + A . ) ) = | _ E „ + ̂ E 1 2 + + o ( — ) j w (4.65) (4.66) Conditions i n (4.59) ensure that B - Bias2(mhnA(l + A B ) ) = o ( A n ) 66 and V - Var(mhn<1(l + A„) ) = o(l/(nhn)). Since hni A n oc n - 1 / 5 , we have B - Bias2(mhntl(l + A n ) ) = o ( n - 4 / 5 ) , and V - Var{mhn<l(l + A„) ) = o ( n " 4 / 5 ) . Therefore the theorem is proven. • Remarks: 1. Since for each value of [ n A n ] , c needs to be calculated i n the implementation of B C V , one might be interested i n the l i m i t i n g forms of conditions i n (4.59) when n —* oo and nAn, —* oo. Let c(t) = g(t/An). For equally spaced 2,-s, one can show that there exists a function g, e.g., a fourth degree polynomial , such that / g{u)du = 1 Jo I ug(u)du = —1 Jo / u2g(u)du = 1 Jo / u3g(u)du = —1 Jo f u4g(u)du = 1. (4.67) Jo 2. The conditions i n (4.59) are sufficient for (4.60) but may not be necessary. For example, if m " ( l ) = 0 the last two conditions i n (4.59) are not needed for (4.60) to hold . However, i f m " ( l ) ^ 0 and al l the coefficients of powers of 8 i n (4.65) and (4.66) are non-zero, then the conditions i n (4.59) are necessary for (4.60). 67 3. So far our discussion about B C V has been l imi ted to simply providing guidelines for choosing c(-). It has not been shown that the hn min imiz ing B C V is close to hopt, the bandwidth that equals An/5opt wi th 8opt min imiz ing AMSE{8). Note that when |<?i| > 5, there may not be a unique vector c that satisfies (4.59). In the implementation of BCV, we w i l l use the vector c that minimizes c'c subject to the conditions i n (4.59). Lemma 4.5 Suppose that \S\\ > 5 and that the are distinct. Let X1 = and ' ( A J ' ( A J ' ( A J ' ( A J 6' EE (1^1, -15x1,1*1,-ISxU^I). (4-68) Then the minimizer of ctc subject to (4-59) is c — Xt(XXt)~1b. Proof: Let L(c,\) = c t c - \ \ X c - b ) , (4.69) where A is a 5 by 1 vector. Setting the derivatives of L w i t h respect to c and A to zero yields: ^ = 2 c - X t \ = Q=>c = ]-Xt\, (4.70) oc 2 ^ = Xc-b = 0=> Xc = b. (4.71) Plugging c of (4.70) i n to (4.71) gives Xc = \XX*\ = b. (4.72) Since tk,...,tn are distinct, X is of fu l l row rank of 5. Therefore XXt is invertible. Inverting XX* i n (4.72) results i n A = 2(XXt)~1b. Plugging this A into (4.70) yields c = Xt(XXt)~1b, which may be either a minimizer or a maximizer . Since L is un- bounded above, this c is a minimizer . • 68 Unl ike FCV, BCV has the short-cut formula we established to enable one to calcu- late only one regression for each hn on the search grid. Thus the process of estimating the opt imal bandwidth hn is fast. We turn to that problem now. Lemma 4.6 B C V{hn) - — - 2̂ c(l - ti)—— , where [Hhn]u = {1,U - 1) x toe ito c o l m n o/ [(A*WA) A W ] , / i . . . i N A l = and W = diag(K(^-r-^)) (4.73) \ * i — 1 • • • tn-l J with K a positive kernel over its support. Proof: Recal l that & = argmin £ (^ - ft - ft(t; - l))2 K^T^)- j=i 71 F i x i and let n £ • — 1 <*U = argmin (Yj - a0 - ax{t3- - U)) K(J-^—). Then hn = (4-74) The usual estimate of m(*t) using /Sj is ft>x + ftii(i; — 1) which is equal to a0tU, the usual estimate of m(U) based on ati. Thus, both parameterizations yield the same estimate of (m(ti),..., m(tn)y, namely Hh„Y. For the same fixed i , also let fi™ = argminifciYj-h-lhitj-ltfKi^) 69 + ( ^ ; i * ' ) - / 5 o - ^ - i ) ) 2 / v ( \ ^ ) , t i - i . a[. l) = argrain^ (Yj - ct0 — a^tj - t,))2 K("3 hr, ct i"^* = ar grain ^ (5j - ao - a x ( i j - *i)) 2 ^ ( ^ 7 ) + (afc? - ao - axfc - t ,)) ' ^ f ^ ' then From L e m m a 4.2, cto~u* = ^o~u- Thus from L e m m a 4.1, we have B y the relationship between the d s and /3s, the short-cut formula holds. • In the next chapter, simulations w i l l be carried out for the local linear forecasting estimator to compare the performance of the four methods of bandwidth estimation. 70 Chapter 5 Simulation: local linear forecasting estimator 5.1 Motivation and data In this chapter we w i l l study the local linear forecasting estimator of m ( l + A n ) for A„ = 0.1 and 0.2 on simulated data sets, Yi = m{ti) + e,- w i t h e,-, i = 1 , . . . , n i i d normal variables ~ N(0,<T = 0.1), and equally spaced 2,-s, t,- = i/n. D a t a sets w i t h sample sizes n = 50 and 100 w i l l be generated from each of three ms, mi(t) = t, m 2 ( t ) = t2 and m3(t) 2-7l2(cosUvt) + 1), t < 1/2, (5.1) i 5 / 2 , t > 1/2. Methods of estimating an opt imal bandwidth by using the plug-in procedures CAMSE, HAMSE f rom Sections 4.1.2 and 4.1.3 and the cross-validation procedures FCV and BCV f rom Sections 4.2.4 and 4.2.5 are applied to each data set. The estimators of m " ( l ) and a2 needed for the plug-in procedures are those i n Section 4.1.5. The kernel 71 function K used i n al l four methods is the density function of the standard normal wi th support truncated to [—1,0]. For this kernel function, the asymptotic mean square error by formula (4.11) is m"(l)2 AMSE{8) = — ^ - A n ( 0 . 1 5 3 2 6 6 8 / 6 2 + 0.9663585/6 + l ) 2 2 c +-^—(4.034252 + 12.16996 + 12.21121662). (5.2) The opt imal bandwidth for forecasting is: hopt = An/6opt, where Sopt minimizes AMSE(S) The goal is to compare the estimates of m ( l + A„) and the estimates of opt imal band- widths by a l l the bandwidth estimation procedures discussed i n Chapter 4. The functions tp, p = 1 and 2, are chosen to study the sensitivity or the robust- ness of plug-in procedures of bandwidth estimation to the violat ion of assumptions on the regression functions i n the asymptotic analysis i n Chapter 4. Recal l that plug-in procedures require the estimation of m " ( l ) i n order to estimate the opt imal bandwidth for forecasting. In the estimation of m " ( l ) , the fourth derivative of m at 1, m^4\l), is assumed to exist and be non-zero. Therefore, it is interesting to study the performance of the plug-in procedures when m^(l) is zero. In contrast, the function m 3 is chosen because it has a non-zero fourth derivative at t = 1 and non-constant second derivative over [0,1]. Al though m 3 ' has a discontinuity point at t = 1/2, Corollary 4.1 st i l l holds when the FCV procedure is applied to m 3 . Recal l that Theorem 3.1 assumes that i n Condit ion 5 m" is continuous over [0,1 + A n ] and that the results concerning the asymptotic bias and the variance of rhhn,t(t + A„) hold uniformly for t 6 [an, 1] w i t h l i m i n f an/hn > 1. It can be easily shown that under the conditions of Theorem 3.1 but w i t h Condit ion 5 replaced by the condition that m" is continuous over [ao, 1 + A n ] for some 0 < ao < 1, the conclusions i n Theorem 3.1 hold uniformly for t (E [a 0,1]. Therefore Corollary 4.1 holds since FCV uses estimates of m(ti)s w i t h U > I — pAn and these t,s are i n [a0,1] for n sufficiently large. It is also interesting to see how the magnitude of m " ( l ) w i l l affect the estimated 72 opt imal bandwidth and the resulting forecasts. If m " ( l ) = 0, the square of the asymp- totic bias of rhkn(l + A n ) is of a higher order than A n and thus the hn that minimizes the AMSE of m „ n ( l + A n ) is of a different order from n - 1 / 5 . In particular, if ra is a line, the exact bias of m / i n ( l + A n ) is zero. Thus the AMSE i n this case has only the variance term which equals the second term i n (4.11), resulting i n the asymptotical ly opt imal bandwidth hn being co. It can be proved from (5.2) that a larger absolute value of ra"(l) w i l l result i n a smaller asymptotically opt imal bandwidth for the kernel function used i n this chapter. The three ras under study have m'((l) = 0, m 2 ' ( l ) = 2 and ra3'(l) = 3.75. 5.2 Results and comments Tables 5.1-5.3 display the summary statistics of the A n - a h e a d forecasts ( A n = 0.1,0.2) along wi th the values of hopt, and the estimates of opt imal bandwidth for the forecasts by the four procedures applied to 100 simulated data sets, each for sample sizes n = 50 or 100 and the functions ra = m i , ra2 and m3. Since the estimates of the optimal bandwidth are quite variable, the median of the estimates of the opt imal bandwidth for 100 simulations is a better indicator of the center of these estimates than the mean. The median of the estimates of the opt imal bandwidth for 100 simulations is denoted as hf^}. In the tables, the standard deviation (s.d.) and the mean square error (m.s.e.) of m / i „ , i ( l + A„) and hopt are estimated respectively as follows. Let 6\> generically represent either ra„nii(l + A n ) or hopt for the bth s imulation, 6 = 1 , . . . , 100, and 0 the true value, i.e., 9 = m(l+ An) or hopt. Then (s.d.) 2 is £ £ ° ° (§b - fj2 /100 w i t h I = E ^ i 4/100 and the m.s.e. is (0 — 0)2 +(s.d.)2. The last line of each table displays the true ra(l + A n ) and hopt. Results i n Tables 5.1-5.3 reveal that, among the four procedures examined, FCV 73 Table 5.1: Summary of 100 forecasts (ms) and estimates of opt imal bandwidths by local linear regression for m = m\ wi th sample size n = 50 and 100 respectively. The m i n i m u m of AMSE is displayed i n column (m.s.e) for t rue value. m = m i , n = 50 method An = 0.1 An = 0.2 m (s.d.) (m.s.e.) Ag? M O rk (s.d.) (m.s.e.) * £ ? ( B . d . ) CAMSE 1.08 (0.19) (0.04) 0.30 (0.27) 1.17 (0.34) (0.12) 0.27 (0.26) HAMSE 1.08 (0.19) (0.04) 0.27 (0.27) 1.18 (0.35) (0.12) 0.24 (0.26) FCV 1.11 (0.07) (0.01) 0.51 (0.21) 1.20 (0.08) (0.01) 0.51 (0.12) BCV 1.10 (0.10) (0.01) 0.31 (0.24) 1.23 (0.18) (0.03) 0.19 (0.15) t rue value 1.10 0.00 oo 1.20 0.00 oo m — m i , n = 100 method An = 0.1 An = 0.2 m (s.d.) (m.s.e.) *g? (s.d.) m (s.d.) (m.s.e.) Ag? (s.d.) CAMSE 1.10 (0.08) (0.01) 0.33 (0.28) 1.21 (0.14) (0.02) 0.30 (0.25) HAMSE 1.10 (0.08) (0.01) 0.33 (0.24) 1.20 (0.14) (0.02) 0.30 (0.24) FCV 1.10 (0.05) (0.00) 0.51 (0.21) 1.20 (0.06) (0.00) 0.51 (0.10) BCV 1.09 (0.10) (0.01) 0.19 (0.22) 1.20 (0.18) (0.03) 0.18 (0.14) t rue value 1.10 0.00 oo 1.20 0.00 oo 74 Table 5.2: Summary of 100 forecasts (ms) and estimates of opt imal bandwidths by local linear regression for m = m 2 wi th sample size n = 50 and 100 respectively. The m i n i m u m of AMSE is displayed i n column (m.s.e) for true value. ra = ra2, n = 50 method An = 0.1 An = 0.2 ra (s.d.) (m.s.e.) *S? (s.d.) ra (s.d.) (m.s.e.) hSg (s.d.) CAMSE 1.16 (0.24) (0.06) 0.24 (0.21) 1.33 (0.43) (0.20) 0.22 (0.20) HAMSE 1.16 (0.24) (0.06) 0.22 (0.20) 1.33 (0.43) (0.20) 0.21 (0.20) FCV 1.16 (0.09) (0.01) 0.29 (0.12) 1.33 (0.11) (0.02) 0.29 (0.09) BCV 1.13 (0.13) (0.02) 0.30 (0.27) 1.37 (0.25) (0.07) 0.20 (0.16) true value 1.21 0.01 0.34 1.44 0.02 0.32 ra = m 2 , n = 100 method An = 0.1 An = 0.2 ra (s.d.) (m.s.e.) fcg? (s-d.) ra (s.d.) (m.s.e.) * £ ? (s.d.) CAMSE 1.16 (0.12) (0.02) 0.24 (0.26) 1.34 (0.20) (0.05) 0.23 (0.22) HAMSE 1.17 (0.12) (0.02) 0.24 (0.21) 1.34 (0.21) (0.05) 0.22 (0.20) FCV 1.18 (0.08) (0.01) 0.26 (0.09) 1.36 (0.11) (0.02) 0.24 (0.08) BCV 1.15 (0.14) (0.02) 0.20 (0.28) 1.37 (0.20) (0.04) 0.20 (0.18) true value 1.21 0.01 0.30 1.44 0.02 0.27 75 Table 5.3: Summary of 100 forecasts (ms) and estimates of opt imal bandwidths by local linear regression for m = m 3 wi th sample size n = 50 and 100 respectively. The m i n i m u m of AMSE is displayed i n column (m.s.e) for t r u e v a l u e . m = ra3, n = 50 method An = 0.1 An = 0.2 m (s.d.) (m.s.e.) f>SS (s.d.) m (s.d.) (m.s.e.) AgJ (s.d.) CAMSE 1.21 (0.28) (0.08) 0.24 (0.21) 1.45 (0.51) (0.28) 0.23 (0.19) HAMSE 1.21 (0.28) (0.08) 0.22 (0.21) 1.45 (0.51) (0.28) 0.21 (0.19) FCV 1.19 (0.10) (0.02) 0.26 (0.09) 1.37 (0.12) (0.06) 0.25 (0.10) BCV 1.13 (0.16) (0.05) 0.30 (0.28) 1.39 (0.25) (0.10) 0.19 (0.17) t r u e v a l u e 1 . 2 7 0 . 0 2 0 . 2 6 1 . 5 8 0 . 0 5 0 . 2 4 m = m 3 , n = 100 method An = 0.1 An = 0.2 m (s.d.) (m.s.e.) h%] (s.d.) m (s.d.) (m.s.e.) Ag? (s-d-) CAMSE 1.20 (0.20) (0.04) 0.24 (0.26) 1.43 (0.33) (0.13) 0.22 (0.22) HAMSE 1.20 (0.20) (0.04) 0.22 (0.24) 1.44 (0.33) (0.13) 0.20 (0.22) FCV 1.19 (0.09) (0.01) 0.21 (0.08) 1.42 (0.15) (0.05) 0.21 (0.09) BCV 1.16 (0.16) (0.04) 0.21 (0.29) 1.44 (0.21) (0.06) 0.19 (0.17) t r u e v a l u e 1 . 2 7 0 . 0 1 0 . 2 2 1 . 5 8 0 . 0 4 0 . 2 1 76 produces the best forecasts i n terms of the mean square error (m.s.e.) and that the m.s.e. of FCV is closest to the m i n i m u m of AMSE. The m.s.e. of forecasts given by the other cross-validation procedure, BCV, is smaller than the m.s.e. of forecasts given by the two plug-in procedures i n most cases (eight cases out of twelve). However, BCV ties w i t h the two plug-in procedures (in terms of the m.s.e.) i n three cases and does worse i n one case (m = m i , A n = 0.2 and n = 100) where the m.s.e. of forecasts by BCV is 1% more than those of forecasts by plug-in procedures. Comparing the estimated mean and the estimated standard deviation of forecasts by al l four procedures shows that forecasts by plug-in procedures have a bias comparable to the bias of forecasts by cross-validation procedures but tend to have a much larger variance, and that the two plug-in procedures produce almost identical results. Tables 5.1-5.3 also show that the m.s.e.s of forecasts by al l procedures increase as |ra"(l)| increases, which is confirmed by examining plots i n Figure 5.1 where the min- i m u m of AMSE curve (solid line) increases as |m"(l)| increases from m ^ l ) = 0 to m 2 ' ( l ) = 2 and m 3 ' ( l ) = 3.75. Moreover, the m.s.e.s of forecasts by al l procedures are smaller for the larger sample size n = 100. This is expected since the convergence rate of AMSE of m / l n i i ( l + A n ) is n - 4 / 5 and gets smaller when n gets larger. From Tables 5.1-5.3, we see that among al l four procedures, the estimates of the opt imal bandwidths calculated by FCV are the least variable and have medians agreeing w i t h the true opt imal bandwidths reasonably well except for m — mi. Other procedures may do well for some cases but badly i n others. Recal l that a l l four procedures are devised to estimate the opt imal bandwidth by simulating the AMSE curve and that E(FCV) — a2 is approximately AMSE. To help us understand why FCV has the best performance, i n the following figures, AMSE(6) w i l l be plotted as a function of hn w i t h hn = An/S. Moreover, the shifted FCV, FCV - a2, w i l l be plotted instead of FCV. 77 The two plug-in procedures, CAMSE and HAMSE, produce almost identical sim- ulated AMSE curves. For example, Figure 5.2 shows that the CAMSE and HAMSE curves are i n good agreement for each of 9 simulated data sets for function mz w i t h n = 50 and A„ = 0.1. The similarity of those curves explains why the CAMSE and HAMSE procedures produce very similar forecasts and estimates of opt imal band- widths. F rom now on only the CAMSE curves are plotted for plug-in procedures. Of the four procedures, only FCV on average simulates AMSE curves reasonably well i n a l l cases, which is not surprising i n the light of Corollary 4.1. In contrast, as Figure 5.1 shows, the median of the CAMSE curves (dashed) agrees w i t h the AMSE curve only for m = but the median of the FCV curves agrees w i t h the AMSE curve for a l l three functions. This observation suggests that the condition of m^(l) ^ 0 is necessary for plug-in procedures to work. Furthermore, i n al l cases the AMSE curve simulated by FCV is less variable than the AMSE curve simulated by plug-in procedures. For example, Figure 5.3 of quantiles of the FCV curves and CAMSE curves shows that the AMSE curve simulated by FCV is less variable than the AMSE curve simulated by CAMSE for m = ra3. This observation probably explains why forecasts obtained through FCV are less variable than forecasts by plug-in procedures. The medians of simulated AMSE curves by BCV are too rough to resemble anything like the AMSE curve i n al l cases; see Figure 5.4 for example. This is probably due to the greatly oscillating values of the c,s, the weights used i n BCV which minimize c*c subject to (4.59). For example, when n = 50 and A n = 0.1, we have c< = (220.75,-753.77,632.54,492.46,-996.23,409.25). A s Figure 5.1 shows, FCV approximates the asymptotic mean squared error very well for ra = rax and ra2 but not as well for ra = m3. It is easy to see why the shifted FCV(hn) under-estimates AMSE(6), that is, 78 0.1 0.2 0.3 0.4 0.5 0.6 0.7 I I / / / / / / I I I m=m2 / / / \ \\ \ \ \ \ v., \ ^~ . — y s / / / / 0.2 0.3 0.4 0.5 0.6 0.7 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Figure 5.1: The median of 100 shifted FCV curves (dotted), the median of 100 CAMSE curves (dashed) and the true AMSE curve (solid) for m = m1,m2 and m3 w i t h n = 50 and A n = 0.1. 79 Figure 5.2: Pairs of CAMSE (dotted) and HAMSE (solid) curves for 9 simulated data sets for m = ra3, n = 50 and An -= 0.1. 80 81 o T3 CD C 55 CD E > O CO u3 CO CO o CM O O O 0.1 0.2 0.3 0.4 0.5 0.6 0.7 h Figure 5.4: The median of 100 shifted BCV curves (dotted line) and AMSE curve (solid line) for m = m 3 , n = 50 and A„ — 0.1. 82 AMSE(An/hn), particularly for large values of hn for m = m 3 . Recal l that the FCV score is an average of the estimates of the prediction errors for m(£,)s w i t h U £ [1 — A n , 1] and that by Corollary 3.1 or 3.3, AMSE of rhhn,ti-An(ti) depends on m"(ti — A n ) , while AMSE(8) depends on m " ( l ) . The second order derivatives of m i and m 2 are constants, w i t h m" = 0 and m'2' = 2. So for U £ [1 - A n , 1], m"(U — A n ) is equal to m " ( l ) . Thus, for m i and m 2 , one sees an agreement i n shape between the shifted FCV scores and the AMSE curve. B u t for m 3 , m 3 ( i ) = 15£ 1 / 2 /4 for t > 1/2, which is an increasing function of t. Therefore for ti £ [1 — A n , 1], m3'(£t- - A n ) is less than m 3 ( l ) . Even though as n —+ oo, the m"(ti — A n ) s converge to m " ( l ) , for a finite sample, one would expect the shifted FCV scores to under-estimate the AMSE curve and that the discrepancy increases as hn increases. However, Figure 5.5 shows that for fixed A n the discrepancy is smaller for larger sample size n , as would be expected. Based on our analysis and the above observations from our modest simulation study, we conclude that FCV has the best performance and thus is recommended. The next chapter w i l l contain the comparison of the backcalculation method and the local linear method applied to a fictitious A I D S example. 83 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.1 0.2 0.3 0.4 0.5 h h Figure 5.5: The AMSE curve (solid) and the mean of 100 shifted FCV curves (dotted) for m — ra3 w i t h A n = 0.1,0.2 and n — 50,100 respectively. 84 Chapter 6 The local method versus backcalculat ion This chapter offers the arguments that the proposed local method should serve as an alternative to the backcalculation approach i n forecasting. 6 . 1 Comments on the methods Backcalculation is used to achieve two goals: to estimate the historical H I V infection curve up to the present and to forecast the number of new A I D S cases using the esti- mated H I V infection curve. Backcalculation assumes the following: 1. the adjusted A I D S incidence data are accurate; 2. the incubation distr ibution, F, is accurately modelled and does not vary across subpopulations represented i n the counts of new cases. 85 In reality, the raw A I D S incidence data are affected by under-reporting and report- ing delays. In some developing countries, A I D S data are highly incomplete. Though adjusting the raw A I D S data for under-reporting and reporting delays improves the quality of the data significantly, the adjusted A I D S data st i l l contain errors depending on the imputat ion methods, the quality of the raw data and the random nature of the data. In short, accurate A I D S counts are not available. This is more of a problem for backcalculation (Section 6.2) than for the local linear forecasting estimator. Furthermore, knowledge about the "true" F is suspect. K n o w i n g F is crucial to the backcalculation approach but F is very hard to estimate. Recal l that F is the distr ibution of the incubation time from H I V infection to A I D S diagnosis. Usual ly external data other than the A I D S incidence data at hand are used i n the procedure for estimating F. There are several problems w i t h that procedure. F i rs t , exact infection times are known only for a few highly selected groups of individuals . So usually F is estimated from specific cohort(s) [3] and then plugged into the backcalculation model for a certain (or general) H I V infected population as the "true" incubation distribution. The applicabil i ty of F obtained from a cohort (a particular subpopulation) to a more general population is doubtful . In fact, Bacchetti et al [2] conclude f rom their analysis that the incubation distributions across the cohorts under study are quite different, resulting i n very different estimates of H I V infection curves. Second, the changes of the definition of an A I D S case [1] (once i n 1985 and then again i n 1987 i n the Uni ted States) further complicate the estimation of the incubation distribution. T h i r d , under-reporting and reporting delays also affect the estimation of the incubation distr ibution. In contrast, the local linear forecasting estimator has only one goal: to forecast the number of new A I D S cases. It assumes that the expectation of the adjusted A I D S incidence data reflects the true level of the underlying A I D S epidemic, i.e., E(Yi\ti) — m(ti) (with m the true level of the underlying A I D S epidemic) and uses the A I D S counts 86 (adjusted when necessary) directly. The local linear forecasting estimator does not use any external data other than the A I D S data. One might think that backcalculation should provide good estimates since it employs both external data and the A I D S data. However, as shown i n the simulation study described i n the next section, it does not produce better forecasts than the local linear forecasting estimator that uses the A I D S data only. In addit ion, it w i l l be shown that backcalculation can not be used to produce a good estimate of the H I V infection curve up to present. The next two sections w i l l contain discussions of the performance of the two methods. 6.2 A small simulation study This simulation study has two objectives: 1. to compare the forecasts of the number of new A I D S cases by backcalculation and the local linear forecasting estimator, 2. to investigate how the violation of either of the two assumptions about the A I D S incidence data and the incubation distribution function F w i l l affect the results of backcalculation. To achieve the two objectives, the performance of both approaches is investigated i n the simplest set-up: Y{ = m(ti) + e,- where the e;s are independent normal random variables w i t h standard deviation a: Al though the assumption of independent normal errors may not be realistic for the A I D S count data, it serves to show the statistical weakness of the backcalculation approach and simplifies the computation. This fictitious example w i l l use the simplest model for F: an incubation t ime distribution function which is independent of the infection t ime, i.e., F(t — s\s) = F(t — s). 87 The simulated data are Yi = m(ti) + e,-, i = 1 , . . . , 20, where the e,s are i . i . d . normal variables w i t h mean zero and variance 0.25, ti = i/20, m(£;) = JQ I(s)(F(ti — s) — F(U-i - s))ds, I(s) = - 2 0 0 ( 5 + l ) ( s - 2) and F(t) = 1 - e~*. One hundred simulated data sets are generated from the above model . Figure 6.1 shows the incubation function F and the H I V infection rate I. To attain the first objective, m ( l + A n ) , an estimate of m ( l + A „ ) , is calculated by both approaches for each of the simulation data sets for A n = 0.1 and A n = 0.2. A s explained i n the last section, the "true" F should be taken w i t h a grain of salt. It w i l l be interesting to see how the forecasts are affected by the assumed form of F. Therefore for the backcalculation approach, m ( l + A n ) is estimated under a few assumed forms of F, the true F: F(t) = 1 — e _ t and wrong J P S : F(t) = 1 — e~f^ w i t h ft = 4,1.1,0.9,0.5. Since the true ft is 1, ft = 4 or 0.5 is grossly wrong while ft = 1.1 or 0.9 is pretty close to the t ruth . The application of backcalculation w i t h each F to the 100 simulated data sets yields 100 m ( l + A„)s. In backcalculation, the smoothing parameter A is chosen by cross-validation (Section 4.2.2). The application of the local linear forecasting estimator to the same data also yields 100 m ( l + A n ) s , but not depending on the assumed form of F. Method FCV is used to choose the bandwidth for the local linear forecasting estimator because simulations i n the last chapter suggest that FCV has the best performance among the four competing bandwidth estimating procedures. For each set of 100 m ( l + A n ) s , the average, the standard deviation (s.d.) and the mean squared error (m.s.e.) are computed and displayed i n Table 6.1. For this fictitious example, the mean squared errors can be calculated since the true value of m ( l + A„) is known. Results i n Table 6.1 show that the local linear forecasting estimator always gives forecasts w i t h smaller mean squared errors than backcalculation. Comparison of the averaged forecasts and standard deviations suggests that forecasts by backcalculation 88 Time 0.0 0.2 0.4 0.6 0.8 1.0 Time Figure 6.1: The H I V infection function I, I(s) = - 2 0 0 ( 5 + l ) ( s - 2) and the incubation function F, F(t) = 1 — e~* used i n the generation of the fictitious A I D S data. 89 Table 6.1: Summary of forecasts by backcalculation and local linear regression for 100 realizations of the fictitious A I D S data. method assumed average of forecasts (s.d.) (m.s.e.) incubation F A n = 0.1 A n = 0.2 backcalculation 1 - e-*/ 4 14.27 (1.58) (2.52) 14.85 (3.64) ( 13.37) 1 - e-'l1-1 14.10 (2.55) (6.50) 14.23 (7.42) (55.13) 1 — e ~ f , ( t r u e ) 14.45 (2.94) (8.75) 15.52 (7.60 ) (58.80) 1 - e-</°- 9 14.31 (1.45) (2.14) 14.90 (3.24) ( 10.66) 1 - e - ' / ° - 5 13.59 (8.10) (65.88) 12.49 (27.00) (733.04) local linear F C V N A 14.58 (0.62) (0.61) 15.51 (0.96) (1.95) t r u e v a l u e 1 - e " < 1 4 . 1 1 1 4 . 5 0 are highly variable. For example, for A n = 0.1, even when the "true" F is used i n the backcalculation, the standard deviation (s.d.) of 100 forecasts by backcalculation is 2.94, which is much higher than the corresponding standard deviation of 0.62 for the local linear forecasting estimator. Results of backcalculation when the "true" F is used suggest that forecasts should not be made based on the extrapolated I. Estimates of the conservative lower bounds for the numbers of new A I D S cases, LBn of (1.6), are also calculated according to formula (1.7) for the backcalculation approach from the same simulated data. The average and the standard deviation of the 100 LBns 90 Table 6.2: Summary of lower bounds of forecasts by backcalculation for 100 realizations of the fictitious A I D S data. method assumed average of LBn (s.d.) incubation F A n = 0.1 A n = 0.2 backcalculation 1 - e - ' / 4 13.51 (0.89) 13.18 (0.87) 1 - e"*/1-1 12.89 (1.26) 11.78 (1.15) 1 — e _ t , ( t rue) 12.76 (1.21) 11.55 (1.09) 1 - e"*/ 0- 9 12.71 (0.83) 11.38 (0.75) 1 _ g-t/0-5 11.87 (0.54) 9.72 (0.44) t r u e l o w e r b o u n d 1 — e _ t , ( t rue) 12.69 11.48 t r u e v a l u e 1 — e~ f , ( t rue) 14.11 14.50 for each F are summarized i n Table 6.2. The true lower bounds given by formula (1.6) are calculated wi th F the "true" distr ibution used i n the generation of the simulation data. The s.d.'s of the lower bounds i n Table 6.2 are much smaller than the s.d.'s of forecasts based on extrapolated 7s in Table 6.1. This confirms that the problem shown i n Table 6.1 is due to the extrapolation of I. To attain the second objective of studying the sensitivity of backcalculation to the two parameters: the assumed form of the incubation function F and the presence of noise i n the YiS, each parameter is allowed to vary separately. F i rs t , estimates of I are backcalculated based on three similar Fs from the Y^s w i t h no error. The three Fs are 91 F(t) = 1 - e~t/p w i t h ft = 1.1,1 (true) and 0.9 (see Figure 6.2). A larger value of ft means a longer incubation period. The resulting I based on Fs wi th ft = 1.1 (long F), 1 (true F) and 0.9 (short F) are plotted together w i t h the true I i n Figure 6.2. Note that the true I is ful ly recovered when the true F is used i n the backcalculation and when there are no errors i n the YiS. To study the sensitivity of backcalculation to the presence of errors i n the YiS, esti- mates of I are backcalculated using the true F (the one used to generate the simulation data, ft = 1) for different realizations of the errors i n the Y{S w i t h a always equal to 0.5. The first plot of / i n Figure 6.3 shows that I is ful ly recovered when there is no error i n the YiS. The other three 7s i n Figure 6.3 are backcalculated from three realizations of the YiS w i t h errors. These three plots of I show that when there are errors i n the l^s, the backcalculated Is are highly variable. To get a better idea of the variabil ity of the backcalculated 7s when the YiS contain A A error, Figure 6.4 summarizes the 100 Is by plott ing the averaged 7, the 25th quantile, the 75th quantile and the true I. The simulation results show that the results produced by the backcalculation ap- proach are sensitive to the assumed form of the incubation distr ibution F and to the presence of errors i n the YiS and that the forecast based on the extrapolated / is highly variable (even when the true F is used). In comparison, the local linear estimator gives better forecasts by the criterion of MSE. Furthermore, the local linear estimator is much faster to compute and therefore is less expensive. 92 Time Figure 6.2: The sensitivity of / to the assumption of F i n the fictitious A I D S data (a = 0). True F(t) = 1 - e~\ long Fit) = 1 - e"*/1-1 and short F(t) = 1 - e" ' / 0 ' 9 . 93 Time Time Figure 6.3: The sensitivity of I to the presence of the errors (a — 0.5) i n the fictitious A I D S data when the true F is used. The first plot shows that I is recovered from the error-free A I D S data while the rest show that / is highly distorted by the presence of errors i n the A I D S incidence data. The solid curve (-) i n each plot is the true I and the dotted line (...) the backcalculated / . 94 0.0 0.2 0.4 0.6 0.8 1.0 Time Figure 6.4: The variability of backcalculated I. 95 6.3 Discussion of the methods and the simulation results This section w i l l offer theoretical insights on how the violation of either of the two assumptions on the A I D S incidence data and the incubation function F w i l l make back- calculation perform poorly. The reason that backcalculation is highly sensitive to inaccuracies i n both the A I D S incidence data and the incubation function F is that backcalculation is a mathematical ly ill-posed problem. Equat ion (1.4) m(t) = j T I{s) (F(t -s\s)-F(t-^- s|s)) ds. is a Volterra equation of the first k i n d [22]. For details on Volterra equations, see [22]. According to [22], a linear Volterra equation of the first k i n d is defined as /' K(t,s)I(s)ds = u(t), 0<t<T (6.1) Jo while a linear Volterra equation of the second k ind is defined as I(t) - I* K(t,s)I{s)ds = g{t), 0<t<T. (6.2) Jo The purpose is to solve these equations for I. A n equation is well-posed if a small change i n parameters, e.g. i n K or u i n (6.1), or i n K. or g i n (6.2), causes only a small change i n the solution I [22]. Equat ion (6.1) is not a well-posed problem. Volterra equations of the second k ind have been studied more extensively and are understood better than those of the first k i n d . Roughly speaking they are relatively well-posed i n contrast to those of the first k ind . A Volterra equation of the first k ind may be converted to one of the second k i n d by differentiating (6.1): K(t,t)I(t) + £ ™^ll(s)ds = u\t). (6.3) 96 If K(t,t) does not vanish i n 0 < t < T, then dividing (6.3) by K(t,t) yields a standard Volterra equation of the second k ind and can be solved by existing algorithms. The exact knowledge of u (consequently the exact knowledge of u') makes the equation (6.3) possibly well-posed. Under regularity conditions, a continuous solution of equation (6.1) exists uniquely and is the same as the continuous solution of equation (6.3). Even though Volterra equations of the second k i n d are relatively more tractable i n theory, solving them numerically is known to be hard. However, i n the backcalculation approach, K(s,t) = F(t - s\s) - F(t - l/n - s\s), (6.4) w i t h F being the incubation function, so K(t,t) = 0 for any t. Thus (6.3) is again an equation of the first k i n d . According to [22], a Volterra equation of the first k ind can be converted into one of the second k i n d if K and u are sufficiently differentiable. This result is not very useful i n practice because " u " (e.g. the functional form of m i n backcalculation) is the very unknown that is of interest to us. A s mentioned earlier only the exact knowledge of " u " may help convert equations of the first k i n d to those of the well-posed second k i n d . Suppose that (1.4) can be converted into an equation of the second k i n d by differ- entiating. Then one might propose to solve the original equation by instead solving the resulting equation. Since i n reality m is unknown, one needs to estimate derivatives of m f rom the raw data. It is well known that accurate estimation of derivatives of m from the raw data is hard, since errors inherited from the original data propagate (in terms of mean squared error). Those errors would be further amplified i n I due to the ill-posed nature of the problem. Penalizing 7, e.g., controlling the magnitude of $ I"(s)2ds when backcalculating 7, does not change the ill-posed nature of the problem. This is well manifested by the simulations i n the last section. A small amount of error added to m can distort the resulting estimate of I completely. See Figure 6.3. Of course, al l of above 97 discussions about Volterra equations assumes that u and K are known at a l l values of t, not just at the t,s. Smoothing the data {li}™ prior to applying the backcalculation procedure does not help either. This can be explained by the ill-posed nature of the problem. The ill-posed nature of backcalculation is also manifested i n the sensitivity of back- calculation to the assumed form of the incubation function F. See Figure 6.2. Figure 6.2 also exhibits a phenomenon which Brookmeyer [1] calls "nonidentifiabil- i t y " of I and F: for the same set of " A I D S incidence" data if a long incubation period is assumed (e.g., ft = 1.1), backcalculation produces an inflated estimate of the infection rate i n order to match the observed incidence. O n the other hand, if a short incubation period is assumed (e.g., ft = 0.9) backcalculation underestimates the infection rates. Bacchett i [2] makes the same observation. Recal l that one major goal of the backcalculation approach is to reconstruct (es- timate) the H I V infection curve /(•) from the A I D S incidence series. However, this nonidentifiability, accompanied by the unfortunate mathematical i l l-posed nature of backcalculation, makes backcalculation unfit for the task of reconstructing the past to present H I V infection curve I. It is also important to note that backcalculation is l imi ted to estimating only a proportion of H I V infections. Even though there is a strong correlation between H I V infection and A I D S disease, not every H I V infected person w i l l develop A I D S i n his lifetime. The proportion of people who are infected w i t h H I V virus and eventually develop A I D S is unknown. This is another reason that backcalculation can not recover the whole picture of the historical H I V infection pattern. In short, one can not expect to get a reliable H I V infection curve f rom backcalculation unless the true F and accurate A I D S incidence data (after correction for under-reporting and reporting delays) are available. Other methods must be devised to estimate the H I V 98 infections. Brookmeyer [1] believes that greater improvements i n reconstructing the H I V in- fection rates may come from empirical data on recent infection rates rather than from smoothing procedures or parametric models for I using backcalculation. A s the A I D S epidemic continues, more and more data on H I V seroprevalence are available from cross- sectional surveys (since the mid-1980s) and longitudinal cohort studies. Thus obtaining infection rates directly has become possible. Brookmeyer [4] [5] has proposed new ap- proaches based on cross-sectional surveys and cohort studies i n his recent research. Though the proposed local linear estimator does not attempt to estimate the H I V infections, it does a better job at forecasting than backcalculation. The local linear estimator w i l l be applied to two real data sets from Canada and the Uni ted K i n g d o m i n the next chapter. 99 Chapter 7 Data analysis In this chapter, forecasts using the local linear forecasting estimator and Healy and Ti l let t ' s method [20] w i l l be compared to the corresponding true or corrected numbers of new A I D S cases. The two data sets at hand are the Canadian A I D S data, and the U K A I D S data used by Healy and Ti l le t t [20]. For a given data set, a few recent observations w i l l be left out and treated as "future" , unknown observations. "Forecasts" by the two methods w i l l be made using the rest of the data and compared to the left out data, actual values of these "future" observations. 7.1 Data, methods and results The Canadian A I D S data are provided by the Laboratory Centre for Disease Control . Al though the national quarterly numbers of new A I D S cases are available from the fourth quarter of 1979 to the first quarter of 1996, quarterly numbers prior to the second quarter of 1991 are assumed to be free of reporting delays. This agrees w i t h previous work [23] which considered reporting to have ceased wi th in six years. Since correcting for reporting delays and under-reporting is beyond the scope of this thesis, the subset 100 of the Canadian A I D S cases that were diagnosed between the beginning of the fourth quarter of 1979 and the end of the first quarter of 1990 and reported w i t h i n six years of diagnosis w i l l be used. These data w i l l be treated as being free of reporting delays and under-reporting, i.e, these data w i l l be assumed to represent the true numbers of new A I D S cases. The Canadian data are displayed i n Table 7.1 w i t h quarters of a year labelled Q1-Q4. Table 7.1: Quarterly numbers of A I D S cases reported wi th in six years of diagnosis i n Canada, 79Q4-90Q1. 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 Q l N A 0 2 6 17 34 65 125 190 267 351 372 Q2 N A 4 2 4 16 36 82 147 223 254 317 N A Q3 N A 0 2 7 13 39 99 163 261 295 350 N A Q4 1 0 2 6 17 50 117 186 261 304 328 N A The U K data as shown i n Table 7.2 are taken from Table 6 i n [20] i n which the monthly numbers of reported new A I D S cases and Healy and Ti l le t t ' s estimates of number of unreported new A I D S cases are displayed. Healy and Ti l le t t [20] imputed the estimates of numbers of unreported new A I D S cases based on the delay distribution estimated from the data from 1984 to 1986. Thus the corrected data are the sum of the reported number and the estimate of the number of unreported new A I D S cases. The Canadian data and the corrected U K data are plotted i n Figure 7.1. To test our forecasting method, for the Canadian data, we consider 88Q1 as the present t ime 101 Table 7.2: M o n t h l y numbers of A I D S cases reported to the end of September 1987 i n U K . For t ime periods M a y 1986 to September 1987, the first number is the reported and the second number i n parenthesis is the estimate of number of unreported new A I D S cases. 1982 1983 1984 1985 1986 1987 Jan 1 1 6 16 25 33(7.5) Feb 0 0 5 15 30 44(9.7) M a r 2 2 8 18 26 35(12.2) A p r 1 0 4 16 35 35(15.8) M a y 0 1 5 14 38(0.1) 25(19.8) J u n 0 1 6 12 27(0.4) 48(25.9) J u l 1 5 10 17 24(0.9) 27(32.8) A u g 1 4 14 24 29(1.4) 27(44.2) Sep 1 3 7 " 23 39(2.3) 14(66.1) Oct 1 0 14 25 34(3.3) N A Nov 1 5 8 22 38(4.3) N A Dec 2 7 16 21 46(5.8) N A 102 and for the U K data we consider December 1986 as the present t ime. In each plot, t ime has been rescaled so that the data wi th t i n [0,1] are " k n o w n " and the data w i t h t > 1 are "unknown" and w i l l be predicted. For the Canadian data, data w i t h t i n [0,1] are quarterly numbers of new A I D S cases from the fourth quarter of 1979 (79Q4) to the first quarter of 1988 (88Q1) and are used to "forecast" the quarterly numbers of new A I D S cases from 88Q2 to 90Q1. For the U K data, data wi th t i n [0,1] are monthly numbers of new A I D S cases from January 1982 to December 1986 and are used to "forecast" the monthly numbers of new A I D S cases from January 1987 to September 1987. Healy and Ti l le t t [20] fit a log-linear model i n t to the most recent two years data assuming the Yis to be Poisson counts, l o g ( £ ( y , ) ) = «o + a i * , (7.1) and then extrapolate the fit to get forecasts. The local linear forecasting estimator w i t h FCV, and Poisson regression using the most recent data w i l l be applied to each data set to produce forecasts for specified t ime periods. Healy and Ti l le t t [20] chose a t ime span of two years i n the Poisson regression for the U K A I D S data. The choice of a t ime span of data on which Poisson regression is computed is equivalent to the choice of a bandwidth i n local regression. To show that forecasts are highly dependent on the t ime span, that is, the bandwidth, Poisson regression w i l l be computed using data from the most recent half year, one year and two years and then the fit w i l l be extrapolated to yie ld forecasts. This choice of the spans of "most recent" data is arbitrary. The forecasts w i l l then be compared to the corresponding true numbers of new A I D S cases for the Canadian data or to the corresponding corrected numbers of new A I D S cases for the U K data. Tables 7.3-7.4 display the "forecasts" by the local linear method and by Poisson re- gression using the most recent data wi th the three t ime spans. The opt imal bandwidths estimated by FCV are rescaled i n years so that it is easy to see the " t ime span" that 103 < 8 S3 at 8 1 Time (0.0=7903,1.0=8801) Time (0.0=Dec 81,1.0=Dec 86) ure 7.1: Canadian AIDS data and U K AIDS data: forecasting beyond t 104 the local linear forecasting estimator uses. The average of the squared forecasting errors (ASFE), displayed i n the last line of these tables, serves as a measure of the precision of forecasts by each method. For the local linear method, the FCV score and estimated bandwidth (in years) are also displayed. For the local linear forecast at t = 1 + A n for each A „ , AMSE(8) w i t h 8 = An/hopt can be used as a measure of accuracy of the forecast. However, AMSE is unknown. Recal l that FCV — a2 is approximately AMSE, provided that errors i n Ys are ho- moscedastic. Since Figure 7.1 shows that the assumption of homoscedastic errors is reasonable for the Canadian data w i t h t i n [0,1] and for the U K data w i t h t i n [0.5,1.0], the Rice estimator can be applied to estimate a2 i n each case. The Rice estimator as i n (4.13), when applied to the Canadian data i n [0,1] and the U K data i n [0.5,1.0] yields 177 and 14, respectively. Al though the distribution of FCV is unknown, FCV scores are displayed as an indication of forecasting errors. Table 7.3 shows that the local linear forecasting estimator w i t h FCV gives the closest forecasts for the Canadian data, w i t h the smallest ASFE, 716. Poisson regression from data of different t ime spans clearly gives very different forecasts, showing that the choice of a t ime span of most recent data is a practical problem. This sensitivity to the choice of the bandwidth is usual for smoothing methods. Poor performance of Poisson regression to the Canadian data also suggests the possibility that Y{S may not be Poisson and thus a specific assumption on the distribution of YiS may not be justifiable for different A I D S data sets. Table 7.4 shows that Poisson regression using data of the most recent year gives the closest forecasts, w i t h the smallest ASFE, 62. However, Poisson regression using data of the most recent half year gives the worst result, w i t h the largest ASFE, 2389. In comparison, the local linear forecasting estimator w i t h FCV gives decent forecasts w i t h ASFE, 129. Figure 7.2 shows the forecasts by the local linear method and the forecasts 105 Table 7.3: Forecasts (ms) for 8 quarters, 88Q2 to 90Q1, on Canadian A I D S data (79Q4- 88Q1) by local linear forecasting estimator ( L L ) w i t h FCV, and by Poisson regression ( P R ) using most recent data respectively. The opt imal bandwidth estimated by FCV is rescaled i n years and denoted by hyopt. The FCV score and hy0rpt are displayed together i n parenthesis. The <r2 estimated by the Rice estimator is 177. T i m e Method A c t u a l # m by L L rh by P R on data from most recent Y m (FCV,hy0rpt) 0.5 year 1.0 year 2.0 years 88Q2 293 (64, 3.2) 273 288 311 254 88Q3 315 (483, 2.4) 279 303 341 295 88Q4 309 (234, 0.9) 286 319 373 304 89Q1 349 (312, 1.0) 292 337 408 351 89Q2 370 (403, 1.2) 299 355 446 317 89Q3 346 (685, 0.9) 306 374 488 350 89Q4 359 (1816, 0.9) 313 394 534 328 90Q1 371 (2474, 0.9) 320 415 584 372 ASFE = £ ( m - y ) 2 / 8 716 1191 1222 17033 / 106 Table 7.4: Forecasts (ms) for 9 months, January 1987 to September 1987, on U K A I D S data (January 1982-December 1986) by local linear forecasting estimator ( L L ) wi th FCV, and by Poisson regression ( P R ) using most recent data respectively. The opt imal bandwidth estimated by FCV is rescaled i n years and denoted by hyJpt. The FCV score and hyJpt are displayed together i n parenthesis. The a2 estimated by the Rice estimator is 14. T i m e Method Delay- rh by L L m by P R on data from most recent corrected # m (FCV,h%t) 0.5 year 1.0 year 2.0 years yc 87Jan 53 (54, 0.6) 58 47 47 41 87Feb 45 (60, 1.4) 66 49 49 54 87Mar 48 (57, 1.3) 75 52 51 47 87Apr 49 (53, 1.3) 85 55 54 51 87May 52 (44, 1.2) 96 58 57 45 87Jun 55 (27, 1.1) 109 61 60 74 87Jul 57 (20, 1.0) 124 64 63 60 87Aug 59 (46, 1.0) 141 67 66 71 87Sep 61 (68, 0.9) 160 71 69 80 ASFE — E ( m - y c ) 2 / 9 129 2389 62 65 / 107 by the Poisson regression w i t h the smallest ASFE. For the Canadian A I D S data, the Poisson regression wi th the smallest ASFE uses the data f rom the most recent half year; for the U K A I D S data the Poisson regression w i t h the smallest ASFE uses the data from the most recent year. The plots i n Figure 7.2 confirm our observation from Tables 7.3 :7.4 that the local linear method gives very decent forecasts for both data sets. 108 I* E o C to P Time (0.0=7903.1.0=88Q1) S co a < S S Time (0.0=Dec 81,1.0=Dec 86) Figure 7.2: Forecasts by the local linear method and by Poisson regression using the most recent data. A m o n g three sets of forecasts by Poisson regression, the set w i t h the smallest ASFE is plotted. 109 Chapter 8 Concluding remarks In this thesis, the local linear forecasting estimator is proposed as an alternative to either backcalculation or parametric regression i n the context of forecasting for A I D S data. The asymptotic theory of this estimator is developed and applied to the automatic implementation of this method, i.e., the data driven estimation of an opt imal bandwidth for forecasting. For the estimation of an opt imal bandwidth, two plug- in procedures and two cross-validation procedures are investigated theoretically and by simulations. S im- ulations clearly show the advantage of FCV over the two plug-in procedures, CAMSE and HAMSE, and the other cross-validation procedure, BCV. The simulation study of Chapter 6 shows that the local linear method provides better forecasts than the back- calculation approach and that backcalculation can not accomplish the two tasks, that is, to provide a reasonable estimate of the H I V infection curve and to provide forecasts of A I D S incidence. Analyses of the Canadian A I D S data and the U K A I D S data show that the local linear method forecasts well . In addit ion, these analyses expose the practical problem of choosing an appropriate size of a t ime span of most recent data for use i n parametric regression. The theoretical results for the local linear forecasting estimator are derived based on 110 the assumption of independent data. This assumption is not entirely realistic and the statistical properties of the local linear forecasting estimator using correlated data and the estimation of an opt imal bandwidth w i l l be interesting topics for future research. In the literature of A I D S research, many authors (e.g., [1], [20] and [23]) assumed the numbers of A I D S cases to be Poisson. For example, Bacchetti [1] assumed that the numbers of people infected w i t h H I V follow a nonhomogeneous Poisson process. Under appropriate assumptions, the process of numbers of A I D S cases can be shown to be Poisson. It would be interesting to simulate the H I V infection process and the incu- bation distribution to get simulated A I D S counts and then to analyse the simulated A I D S counts by the backcalculation approach and the local linear forecasting estimator. Another interesting extension of the local linear forecasting estimator is to include co- variates i n the regression, for example, sexual practices, age and gender, and to estimate the expected numbers of new A I D S cases among a group w i t h specific covariate values. Moreover, construction of point-wise confidence intervals of the forecasts by the local linear method is an interesting problem. I l l Bibliography [1] Bacchett i , P . , Segal, M . R . , and Jewell, N . P . (1993), "Backcalculat ion of H I V Infec- t ion Rates," Statistical Science 8, 82-119. [2] Bacchett i , P . , Segal, M . R . , Hessol, N . A . , and Jewell , N . P . (1993), "Different A I D S Incubation Periods and Their Impacts on Reconstructing H u m a n Immunodeficiency V i r u s Epidemics and Project ing A I D S Incidence," Proc. Natl. Acad. Sci. USA 90, 2194-2196. [3] Brookmeyer, R . (1991), "Reconstruction and Future Trends of the A I D S Epidemic i n the Uni t ed States," Science 253, 37-42. [4] Brookmeyer, R . , and Q u i n n , T . C . (1995), "Est imat ion of Current H u m a n Immun- odeficiency V i r u s Incidence Rates from a Cross-sectional Survey Using Ear ly Diag- nostic Tests," American Journal of Epidemiology 141 , 166-172. [5] Brookmeyer, R . , Q u i n n , T . C , Shepherd, M . , et a l (1995), "The A I D S Epidemic i n India: A New Method for Est imat ing Current H u m a n Immunodeficiency V i r u s ( H I V ) Incidence Rates," American Journal of Epidemiology 142, 709-713. [6] Cohen, P . T . , Sande, M . A . , and Bolberding, P . A . (Editors) , The AIDS Knowledge Base (2nd E d , 1994), L i t t l e , Brown and Company. 112 [7] Fan, J . , and Gijbels, I. (1994), "Data-driven Bandwidth Selection i n Local Polyno- m i a l F i t t i n g : Variable Bandwidth and Spatial Adapta t ion , " JRSSB, to appear. [8] Fan, J ianqing (1992), "Design-adaptive Nonparametric Regression," JASA 87, 998- 1004. [9] G a i l , M . H . , and Brookmeyer, R . (1988), "Methods for Project ing Course of A c - quired Immunodeficiency Syndrome Epidemic (Review)," Journal of the National Cancer Institute 80, 900-911. [10] G a i l , M . H . , and Rosenberg, P.S. (1991), "Perspectives on Using Backcalculation to Est imate H I V Prevalence and Project A I D S Incidence" manuscript. [11] Gasser, T . , Sroka, L . , and Jennen-Steinmetz, C . (1986), "Residual Variance and Residual Pattern i n Nonlinear Regression, " Biometrika 73, 625-633. [12] Gasser, T . , K n e i p , A . and K5hler , W . (1991), " A Flexible and Fast Method for A u t o m a t i c Smoothing," JASA 86, 643-652. [13] H a l l , P . , Sheather, S.J. , Jones, M . C . and M a r r o n , J .S. (1991), " O n O p t i m a l Data- based B a n d w i d t h Selection i n Kerna l Density Es t imat ion , " Biometrika 78, 263-271. [14] Hardle , W . , and M a r r o n , J .S. (1985), " O p t i m a l Bandwidth Selection i n Nonpara- metric Regression Function Est imat ion , " The Annals of Statistics 13, 1465-1481. [15] Hardle , W . (1989), Applied Nonparametric Regression, S I A M Cambridge University Press. [16] Har t , J . D . and Wehrly, T . E . (1986), "Kernel Regression Est imat ion Using Re- peated Measurements Data , " JASA 81, 1080-1088. 113 [17] Har t , J . D . (1994), "Automated Kernel Smoothing of Dependent D a t a by Using T i m e Series Cross-validation," Journal of the Royal Statistical Society, Ser. B . 56, 529-542. [18] Har t , J . D . and Y i , S. (1996), "One-sided Cross-validation," manuscript. [19] Hastie, T . and Loader, C . (1993), "Loca l regression: Automat ic Carpentry," Statist. Sci. 8, 120-143. [20] Healy, M . J . R . , and Ti l l e t t , H . E . (1988), "Short-term Extrapolat ion of the A I D S Epidemic , " JRSSA 151 , 50-61. [21] Lehmann, E . L. ' (1975) , Nonparametrics: Statistical Methods Based on Ranks, Holden Day. [22] L i n z , Peter (1985), Analytical and Numerical Methods for Volterra Equations, SI A M Phi ladelphia . [23] M a r i o n , S. A . , and Schechter M . T . ( D e c , 1990), "Es t imat ion of the Number of Persons i n Canada Infected wi th H u m a n Immunodeficiency V i r u s , " A report com- missioned by the Minister of National Health and Welfare. [24] M a r i o n , S. A . , and Schechter M . T . (Mar . , 1992), " H u m a n Immunodeficiency Virus Infection i n Canada: A n Updated Analysis Using Backcalculat ion," A report com- missioned by the Minister of National Health and Welfare. [25] Rice, J . (1984), " B a n d w i d t h Choice for Nonparametric Regression," The Annals of Statistics 12, 1215-1230. [26] Ruppert , D . , Sheather, S.J. , and W a n d , M . R (1993), " A n Effective B a n d w i d t h Selector for Local Least Squares Regression," manuscript. 114 [27] Stone, C . J . (1982), " O p t i m a l Global Rates of Convergence for Nonparametric Regression," The Annals of Statistics 10, 1040-1053. [28] Wahba, G . (1977), "Pract ical Approximate Solutions to Linear Operator Equations W h e n the Data A r e Noisy," SIAM J. Numer. Anal. V o l . 14, 651-667. [29] Wahba, G . (1979), "Smoothing and 111 posed Problems," Solution Methods for Integral Equations with Applications, Michael Golberg (Editor) , P l e n u m Press, 183- 194. [30] Wahba, G . (1982), "Constrained Regularization for 111 Posed Linear Operator Equa- tions, w i t h Appl ica t ion i n Meteorology and Medic ine , " Statistical Decision Theory and Related Topics / / / V o l . 2, Shanti S. G u p t a and J . 0 . Berger (Editors) , 383-418. [31] Wahba, G . (1990), Spline Models for Observational Data, S I A M . 115
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Local linear regression versus backcalculation in forecasting
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Local linear regression versus backcalculation in forecasting Li, Xiaochun 1996
pdf
Page Metadata
Item Metadata
Title | Local linear regression versus backcalculation in forecasting |
Creator |
Li, Xiaochun |
Date | 1996 |
Date Created | 2009-03-17 |
Date Issued | 2009-03-17 |
Description | The local linear forecasting estimator is proposed in this thesis as an alternative technique to either parametric regression or the backcalculation approach in the context of forecasting for independent data. The asymptotic bias and variance of the local linear forecasting estimator are derived and used to develop procedures for the estimation of the optimal bandwidth for forecasting. Both the theoretical and the computational aspects of these procedures are explored. Simulation study shows that a cross-validation procedure has the best performance in forecasting among four bandwidth estimation procedures under study. Simulations and statistical analyses show that the backcalculation approach is very vulnerable to violations of the assumptions underlying this approach and that its application to AIDS data fails to achieve its two primary goals, to forecast the numbers of new AIDS cases and to estimate the historical HIV infection curve. To test the proposed forecasting estimator over parametric regression, both techniques are applied to the Canadian AIDS data and the UK AIDS data. The results of the two examples expose the weakness of parametric regression and show that the proposed technique does better than parametric regression in forecasting. |
Extent | 4410988 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | Eng |
Collection |
Retrospective Theses and Dissertations, 1919-2007 |
Series | UBC Retrospective Theses Digitization Project |
Date Available | 2009-03-17 |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0087838 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of |
Degree Grantor | University of British Columbia |
Graduation Date | 1996-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
URI | http://hdl.handle.net/2429/6155 |
Aggregated Source Repository | DSpace |
Download
- Media
- ubc_1996-147800.pdf [ 4.21MB ]
- Metadata
- JSON: 1.0087838.json
- JSON-LD: 1.0087838+ld.json
- RDF/XML (Pretty): 1.0087838.xml
- RDF/JSON: 1.0087838+rdf.json
- Turtle: 1.0087838+rdf-turtle.txt
- N-Triples: 1.0087838+rdf-ntriples.txt
- Original Record: 1.0087838 +original-record.json
- Full Text
- 1.0087838.txt
- Citation
- 1.0087838.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
China | 13 | 9 |
United States | 6 | 1 |
France | 2 | 0 |
Unknown | 2 | 0 |
Japan | 1 | 0 |
Canada | 1 | 0 |
City | Views | Downloads |
---|---|---|
Beijing | 13 | 0 |
Unknown | 3 | 4 |
Sunnyvale | 2 | 0 |
Ashburn | 2 | 0 |
Tokyo | 1 | 0 |
Ottawa | 1 | 0 |
Redwood City | 1 | 0 |
Matawan | 1 | 0 |
Paris | 1 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0087838/manifest