L O C A L LINEAR REGRESSION VERSUS B A C K C A L C U L A T I O N IN F O R E C A S T I N G by Xiaochun L i B . S c , Tsinghua University, Beijing, China, 1986 M . S c , Tsinghua University, Beijing, China, 1988 M . S c , University of Saskatchewan, 1991 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T OF T H E R E Q U I R E M E N T S F O R T H E D E G R E E OF D O C T O R OF P H I L O S O P H Y in T H E F A C U L T Y O F G R A D U A T E STUDIES Department of Statistics We accept this thesis as conforming to the required standard T H E U N I V E R S I T Y OF B R I T I S H C O L U M B I A August 1996 ©Xiaochun L i , 1996 In presenting degree at the this thesis in University of partial fulfilment of the requirements British Columbia, I agree that the for an advanced Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for department or by his or scholarly purposes may be granted her representatives. It publication of this thesis for financial gain shall not permission. Department of StcffiS'TiC S The University of British Columbia Vancouver, Canada Date DE-6 (2/88) Olf % is by the understood that head of copying my or be allowed without my written Abstract The local linear forecasting estimator is proposed in this thesis as an alternative technique to either parametric regression or the backcalculation approach in the context of forecasting for independent data. The asymptotic bias and variance of the local linear forecasting estimator are derived and used to develop procedures for the estimation of the optimal bandwidth for forecasting. Both the theoretical and the computational aspects of these procedures are explored. Simulation study shows that a cross-validation procedure has the best performance in forecasting among four bandwidth estimation procedures under study. Simulations and statistical analyses show that the backcalculation approach is very vulnerable to violations of the assumptions underlying this approach and that its application to AIDS data fails to achieve its two primary goals, to forecast the numbers of new AIDS cases and to estimate the historical H I V infection curve. To test the proposed forecasting estimator over parametric regression, both techniques are applied to the Canadian AIDS data and the U K AIDS data. The results of the two examples expose the weakness of parametric regression and show that the proposed technique does better than parametric regression in forecasting. ii Contents Abstract ii Table of Contents iii List of Tables vi List of Figures viii Acknowledgements x 1 Introduction 1 1.1 T h e linear operator approach 4 1.2 T h e regression approach 7 1.2.1 Parametric extrapolation 7 1.2.2 A local linear forecasting estimator 9 2 3 The linear operator approach 12 2.1 Some results on reproducing kernel H i l b e r t spaces 14 2.2 A p p l i c a t i o n to the A I D S data: backcalculation 16 Asymptotics for the local linear regression estimator iii 22 4 The choice of a smoothing parameter 39 4.1 The plug-in approach 40 4.1.1 Introduction 41 4.1.2 Plug-in approach using complete asymptotics for forecasting (CAMSE) 4.1.3 The plug-in approach using the finite sample variance for forecasting (HAMSE) 4.2 5 6 7 8 43 44 4.1.4 Estimation of m" and <7 - an introduction 45 4.1.5 Estimation of m" and a for forecasting 46 2 2 Cross-validation 49 4.2.1 Background to the non-forecasting setting 49 4.2.2 C V for the linear operator approach 53 4.2.3 C V for the local regression approach in curve estimation 53 4.2.4 C V for forecasting: FCV 54 4.2.5 C V for forecasting, BCV 61 Simulation: local linear forecasting estimator 71 5.1 Motivation and data 71 5.2 Results and comments 73 The local method versus backcalculation 85 6.1 Comments on the methods 85 6.2 A small simulation study 87 6.3 Discussion of the methods and the simulation results 96 Data analysis 100 7.1 100 Data, methods and results Concluding remarks 110 iv Bibliography 112 v List o f Tables 5.1 Summary of 100 forecasts (ms) and estimates of optimal bandwidths by local linear regression for m = m\ with sample size n = 50 and 100 respectively. The minimum of AMSE is displayed i n column (m.s.e) for true value 5.2 Summary of 100 forecasts (ms) and estimates of optimal bandwidths by local linear regression for m = m with sample size n = 50 and 100 2 respectively. The minimum of AMSE is displayed i n column (m.s.e) for true value 5.3 Summary of 100 forecasts (ms) and estimates of optimal bandwidths by local linear regression for m = m with sample size n = 50 and 100 3 respectively. The minimum of AMSE is displayed i n column (m.s.e) for true value 6.1 Summary of forecasts by backcalculation and local linear regression for 100 realizations of the fictitious AIDS data 6.2 Summary of lower bounds of forecasts by backcalculation for 100 realizations of the fictitious AIDS data 7.1 Quarterly numbers of AIDS cases reported within six years of diagnosis in Canada, 79Q4-90Q1 vi 7.2 Monthly numbers of AIDS cases reported to the end of September 1987 in U K . For time periods May 1986 to September 1987, the first number is the reported and the second number in parenthesis is the estimate of number of unreported new AIDS cases 7.3 102 Forecasts (ms) for 8 quarters, 88Q2 to 90Q1, on Canadian AIDS data (79Q4-88Q1) by local linear forecasting estimator (LL) with FCV, and by Poisson regression (PR) using most recent data respectively. The optimal bandwidth estimated by FCV is rescaled in years and denoted by h . The FCV score and h?J are displayed together in parenthesis. y r 0 pt pt The a estimated by the Rice estimator is 177 106 2 7.4 Forecasts (ms) for 9 months, January 1987 to September 1987, on U K AIDS data (January 1982-December 1986) by local linear forecasting estimator (LL) with FCV, and by Poisson regression (PR) using most recent data respectively. The optimal bandwidth estimated by FCV is rescaled in years and denoted by h J . The FCV score and h J are displayed y y pt pt together in parenthesis. The a estimated by the Rice estimator is 14. 2 . 107 List of Figures 1.1 T h e quarterly number of A I D S cases reported w i t h i n six years of diagnosis f r o m the fourth quarter of 1979 (79Q4) to the first quarter of 1990 (90Q1). One unit i n T i m e is 1/42 w i t h 42 being the total number of quarters from 79Q4 t o 90Q1 inclusive 5.1 3 T h e median of 100 shifted FCV curves (dotted), the m e d i a n of 100 CAMSE curves (dashed) and the true AMSE curve (solid) for m = mi,m,2 and m w i t h n = 50 and A 3 5.2 = 0.1 79 Pairs of CAMSE (dotted) and HAMSE (solid) curves for 9 simulated data sets for m = ra , n = 50 and A 3 5.3 n n = 0.1 80 T h e AMSE curve, the 25th and the 75th quantiles of shifted FCV and CAMSE respectively, for m = ra , n = 100 and A = 0.1 3 5.4 3 n = 0.1 82 T h e AMSE curve (solid) and the mean of 100 shifted FCV curves (dotted) for m = m w i t h A 3 6.1 81 T h e median of 100 shifted BCV curves (dotted line) and AMSE curve (solid line) for m — m , n = 50 and A 5.5 n = 0.1,0.2 and n = 50,100 respectively n 84 T h e H I V infection function 7,1(s) = —200(s+l)(5—2) and the incubation function F , F(t) — 1 — e ' used i n the generation of the fictitious A I D S - data 89 viii 6.2 T h e sensitivity of I to the assumption of F i n the fictitious A I D S data (<r = 0). True F(t) l-e-'/°- = 1 - e~\ long F(t) = 1 - e"*' - and short F(t) 1 1 = 93 9 A 6.3 T h e sensitivity of / to the presence of the errors (er = 0.5) i n the A I D S data when the true F is used. fictitious T h e first plot shows that J is recovered from the error-free A I D S data while the rest show that I is highly distorted by the presence of errors i n the A I D S incidence data. T h e solid curve (-) i n each plot is the true / and the dotted line (...) the backcalculated J 94 6.4 T h e variability of backcalculated / 95 7.1 C a n a d i a n A I D S data and U K A I D S data: forecasting beyond t = 1. . . . 104 7.2 Forecasts by the local linear method and by Poisson regression using the most recent data. A m o n g three sets of forecasts by Poisson regression, the set w i t h the smallest ASFE is plotted ix 109 Acknowledgement s I would like to thank m y thesis supervisor Nancy H e c k m a n for her guidance i n the development of m y thesis, for her excellent advice and support, and for being an influential role m o d e l throughout m y P h . D . program. I a m also very grateful to the members of m y thesis committee, Jean Meloche, Steve M a r i o n , and i n particular, John Petkau for their careful reading of the manuscript and valuable comments. In addition, I a m very much indebted to J i m Ramsay for m a k i n g m y stay at M c G i l l possible and for his inspiration i n statistics. I would also like to acknowledge the help f r o m P i n g Y a n f r o m the Laboratory Centre for Disease C o n t r o l , who provided me w i t h the C a n a d i a n A I D S data, a n d the friendship a n d support of fellow graduate students and staff members i n the Department of Statistics. F i n a l l y , I w o u l d like to thank the U n i v e r s i t y and the Department of Statistics for the indispensible financial support I received throughout m y P h . D program. Chapter 1 Introduction Suppose bivariate data {(t;, Yi), i = 1 , . . . , n] are observed at times 0 < ti < t < ... < 2 't < 1. Here the f;s are i n [0,1], which may be normalized physical times. Thus any n time beyond 1 is the "future". W h a t can be said about the future path of this stream of data by studying the information available up to the present? Specifically, what is the "expected" value of Y at some t > 1? Consider the A I D S example. A new disease, acquired immunodeficiency syndrome ( A I D S ) , was defined by the Center for Disease C o n t r o l i n the U n i t e d States i n 1982. A I D S is believed to be caused by the human immunodeficiency virus, or H I V . T h i s disease spread rapidly i n the U n i t e d States: reported cases of A I D S increased f r o m 295 i n 1981 to nearly 42,000 i n 1991 w i t h reported deaths increasing f r o m 126 to more than 30,000 i n the same period. A I D S is now one of the major causes of death i n the U n i t e d States [6]. A I D S has spread and caused serious concern around the world. E v e n though the current data suggest a slowing down i n the growth of the A I D S epidemic i n the U n i t e d States, Northern Europe, Canada and A u s t r a l i a , the spread of H I V infection is rampant i n A f r i c a . A I D S i n South A m e r i c a and A s i a seems to be at the onset of explosive spread [6]. 1 Uncertainty about A I D S has generated research involving epidemiologists and statisticians about the relationship between H I V and A I D S , the transmission of the disease and the future course of A I D S incidence. Mechanisms for monitoring the epidemic have been set up around the world. In the U n i t e d States, the number of new A I D S cases each m o n t h has been available from the C D C (Centers for Disease Control) since 1982 (earlier d a t a f r o m 1975 and prior to 1982 being combined to assure confidentiality [1]) while i n C a n a d a quarterly numbers have been recorded by the F C A (the Federal Centre for A I D S ) since 1979 [23] and the L C D C (the Laboratory Centre for Disease C o n t r o l ) . Figure 1.1 depicts the quarterly numbers of A I D S cases reported w i t h i n six years of diagnosis i n C a n a d a . Those data are considered to be complete i n reporting [23]. T h i s thesis w i l l focus on the problem of forecasting w i t h complete or adjusted A I D S data, since adjusting A I D S data for under-reporting and reporting-delays would pose a separate research problem on its own. Note that the data shown here are not the cumulated numbers of A I D S cases. E a c h data point represents the number of new A I D S cases i n the corresponding quarter. In this case, i ; corresponds to the end of the i t h quarter and Y{ observed at t,-, the number of new A I D S cases i n the zth quarter (£,•_!,£,•]. One goal i n A I D S research is forecasting the number of A I D S cases that w i l l occur i n future time periods. T h i s number is important to health organizations and governments for estimating the future demand i n health care and allocating funds accordingly. To summarize data and make forecasts based on observed data of the f o r m {(ti,Yi),i 1 , . . . , n } , various approaches exist i n different theoretical frameworks: 1. the linear operator approach; 2. the regression approach; 3. the time series approach. 2 = Canadian Data o o CO o o C\J o o O H 0.0 i 1 1 0.2 0.4 0.6 ; r~ 0.8 1.0 Time (0.0=79Q3, 1.0=90Q1) Figure 1.1: T h e quarterly number of A I D S cases reported w i t h i n six years of diagnosis f r o m the fourth quarter of 1979 (79Q4) to the first quarter of 1990 (90Q1). One unit i n T i m e is 1/42 w i t h 42 being the total number of quarters f r o m 79Q4 to 90Q1 inclusive. 3 T h i s thesis focuses on the first two approaches, its m a i n goal being the development of methodology for regression. Its application to A I D S data w i l l be used as an example. 1.1 The linear operator approach Assume the i,s are i n [0,1]. In the special case of equally spaced data, U = i/n, t = 1 n being the "present" time at which the last observation was made. Suppose (t, Y) a certain probabilistic structure and E{g(Yi)\ti} = has m ( t , ) , g being a transformation function and m some smooth function. For a fixed design where the t,s are not random, E{g{Yi)\ti] means E{g(Yi)} w i t h Yi observed at t = t{. T h e linear operator approach assumes that data can be expressed i n a functional f o r m , E{g(Yi)\ti} = Li(I) = where m is a smooth function and Li is a known linear operator on a certain function space where the unknown I resides. In this thesis, the case where X,- is the evaluation operator, i.e., L{I = I(ti), is considered as a problem i n regression (approach 2) i n which no additional information other than that i n the data is available. T h e function I w i l l have a specific meaning i n applications. In the A I D S example, I is the rate of H I V infection and the YiS are the numbers of A I D S cases. T h e linear operator L,- for the A I D S data w i l l be described later. T h e linear operator approach summarizes the data and enables forecasting by estimating I first. T h e n the operator Lj is applied to this estimated I to give an estimate of m (£,•). T h i s approach has the potential benefit of embedding prior knowledge of the structure of the data into the linear operator Li. T h e technique called backcalculation [3] which exemplifies this approach is illustrated below through its application to the A I D S data. T h e prevalent theory i n A I D S research postulates that an A I D S patient was infected w i t h H I V at a certain time s i n the past which took some time to develop to the advanced stage of H I V infection and then to the time point of A I D S diagnosis, t. T h e 4 length of time between the point of H I V infection at s and the point of A I D S diagnosis at t is called the incubation t i m e , whose distribution can be modeled b y a time dependent distribution function. Estimates of F have been found f r o m cohort studies. So F is usually assumed known. Based on the assumed relationship between H I V infection a n d A I D S diagnosis, the expected number of A I D S cases before t can be calculated b y the f o r m u l a , a(t) = i£(number of A I D S cases diagnosed before t i m e t) = f* I(s)F(t-s\s)ds, Jo (1.1) where I(s) is the expected number of new H I V infections at t i m e instant s, s > 0. /(•) is unknown a n d to be estimated. T h e convolution formula (1.1) uses an integral operator to link the H I V infections and the A I D S diagnoses. F r o m (1.1), for equally spaced i,s m(ti) where F{u\s) = E{Yi) = ^ ( n u m b e r of A I D S cases i n (t,_i,t,]) = a(t ) — a(2,_i) = I*' I(s)F(U Jo = £ EE Li{T), t - s\s)ds - f*" 1(s)F(t _ Jo 1 i 1 - s\s)ds (1.2) I(s) (F(U - s\s) - F(ti - ^ - s\s)) ds (1.3) = 0, if u < s. T h e linear operators, i.e., the L s so defined, model the t dynamics of the two processes: H I V infection and the A I D S diagnosis [3]. In general, define m(t) = j* I(s) (F(t - s\s) - F(t - i - s\s)^j ds. (1.4) Thus to forecast the number of A I D S cases i n a future quarter, say (1 + A „ —1/n, 1 + A „ ] , one must estimate m(t) at t = 1 + A . n 5 T h e first step to forecasting is to estimate the H I V infection curve I f r o m the Y;s, the numbers of A I D S cases. T h i s is done by confining /(•) to a n appropriate function space Ti and m i n i m i z i n g a fitting criterion, e.g., /(•) = argmin l€>l J2(Yi - m(ti)) 2 + A /* (1.5) I"\t)dt T h e parameter A i n (1.5) is the smoothing parameter which can be chosen b y crossvalidation. M o r e detailed discussion of cross-validation is provided i n Chapter 2 a n d Chapter 4. T h e above procedure for estimating I f r o m the Y{S is the so-called backcalculation [3]. I n an appropriately chosen 7i, the minimizer I can be w r i t t e n as a linear combination of l,t and the Riesz representors of L{S [31]. For details, see Chapter 2. A f t e r I is obtained, m ( l + A ) can be predicted by using I i n f o r m u l a (1.4). However, n for s close to 1, I(s) is usually not reliable. T h e information about I(s) is contained i n the Ys observed after time s. Thus the data contain no information about I(s) for 5 > 1. Further, for s < 1, the closer a time point 5 is to 1 the less reliable the backcalculated I(s) is because there are fewer available data. A conservative approach is to be less ambitious and get a lower bound for m ( l + A ) instead. A useful lower bound can be n found if the disease has a long incubation. For a short t e r m forecast, a great proportion of new H I V infections arising i n time period (1,1 + A ] are still latent and thus w i l l n not contribute to the number of new A I D S cases occurring i n (1 + A n — 1/n, 1 + A ] , n Therefore, the percentage of new A I D S cases contributed b y the H I V infections that occur i n (1,1 + A„] may be negligible. F r o m (1.4), we have m(l + A ) > £ n I(s) (F(l + An - s\s) - F(l + A n - i - s|s)) which i n effect takes I to be zero over ( 1 , 1 + A„]. A n estimate of LB n ds = LBn, (1.6) can be obtained by plugging I into the above formula for LB , n LB n =£ i(s) ( ^ ( 1 + A „ - s\s) - F(l + An - ^ - s\sf) ds. 6 (1.7) M o r e examples of the linear operator approach i n science and engineering can be found i n [29] and [30]. 1.2 The regression approach T h e regression approach assumes an unknown underlying regression function m for the data or the transformed data, i.e., E{g(Y)\U} — m{t ). It w i l l be assumed hereafter that t the data are properly transformed and E(Yi\t{) = m(t,) i n the regression approach. T h e regression approach uses the data {Yi}™ to estimate ra directly. T h i s general approach includes parametric regression i n which m depends on a global parametric f o r m and local regression i n which m is only assumed to be smooth. T h e next two sections contain details of b o t h types of regression i n the context of forecasting. 1.2.1 Parametric extrapolation T h i s approach assumes that the trend of the data continues over a period of time i n a postulated parametric f o r m , for example, log(E(Yi)) = /So + / M i - O n e fits this parametric m o d e l to the entire data set and then uses this fit to extrapolate to obtain the forecast. In 1986 the P u b l i c H e a l t h Service i n the U n i t e d States gave a forecast of 270,000 for the cumulative number of A I D S cases i n the U n i t e d States by the end of 1991 by extrapolating a p o l y n o m i a l model. T h e actual number of reported A I D S cases turned out to be 206,000 [6]. A s noted by a few authors ([9],[20]), parametric extrapolation can provide useful short-term forecasts. However, there are a few criticisms of this approach: 1. Parametric extrapolation does not make use of any available information on the progression of the H I V infections to A I D S disease and thus m a y be less efficient 7 than backcalculation. 2. T h e parametric assumption is not verifiable and therefore is a considerable source of uncertainty. Thus parametric extrapolation is unable to give a long-term forecast. A s for the first c r i t i c i s m , i t is not necessarily the case that a m e t h o d that uses less data is less efficient than a method that uses more data. T h e discussion i n Chapter 6 w i l l show that the local linear forecasting estimator that uses only K)}™, suggests better forecasts than backcalculation that uses {(i,-,!^)}" and additional data. A s for the second c r i t i c i s m , so far as forecasting is concerned, future predictions of m e d i u m to long t e r m pose a challenge for statisticians. If a parametric model is "correct" i t w i l l be able to give a long-term forecast. However, no models are "correct" since they are simplified representations of the real world. W h e n a set of data is fitted to a specific parametric f o r m , the fitted curve may be compromised to fit a l l the data well at the expense of local structure i n the data. Especially of interest is the local structure near the present, t = 1. Forecasts based on the extrapolation f r o m such a fitted curve are unlikely to be as good as forecasts based an extrapolation f r o m a fitted curve of more recent data. A s Healy and T i l l e t [20] write, "It doesn't seem entirely reasonable to give the early data . . . a high degree of influence i n forecasting the future." W h e n fitting the data to a log-linear m o d e l , they superimposed subjectively-chosen weights decreasing into the past. T h i s method should be expected to be an improvement over simple parametric fitting when "appropriate" weights are chosen. It might be desirable to do away w i t h the parametric assumption and also to couple the fitting w i t h a data-driven method for the choice of weights. T h e local regression method proposed i n the next section does exactly this. 8 1.2.2 A local linear forecasting estimator In general, to estimate a function m at a certain point t, the approach of local linear regression assumes that m has a linear structure i n a neighbourhood of t: m(x) = Po + Pi(x — t), and estimates flo, /?i b y weighted regression. T h e weights assigned to the observations close to t are much bigger than the weights assigned t o the rest of observations. T h a t is, let the estimator of m(t) be defined as follows: m^t) = fa+ fo(t-t) = 0 , (1.8) Oit where AM \ - A> ~ p\{U ~ t)) K(\-^-), = argminJ2(Yi (1.9) 2 J w i t h K(-) a kernel function that gives more weight to the observations close to t (defined by h ) than the rest of the observations, and h the so-called b a n d w i d t h that determines n n the size of the neighbourhood of t. For example, if the kernel function is the indicator function over [—1,1], then K((U — t)/h ) n = 1 for those t,s w i t h \tj — t\ < h n and 0 otherwise. Since rhh (t) is calculated f r o m the Y{S, i t is a r a n d o m variable w i t h statistical n properties which are affected b y the b a n d w i d t h h and the kernel K. In practice the n choice of K has little effect on rhh {t) while the choice of h has a large effect. Therefore n n we assume K is specified i n Chapter 4 and describe data driven choice of h . n For now assume h is given. Unless m is a straight line, rhh (t) is i n general a biased n n estimator of m(t), that is, Bias{rhh (t)} n = E{m (t)\ti,... hn ,t } — m(t) ^ 0. Note that n if the kernel function is the indicator function over [—1,1], the b a n d w i d t h h n reflects the number of Y{S used i n the regression. A large h means that a large number of T^-s n is used i n the regression and a small h means otherwise. Therefore a large b a n d w i d t h n 9 causes rhh (t) t o have a small variance and a small b a n d w i d t h causes rhh (t) t o have a n n large variance. For commonly used kernels, when h is s m a l l , rhh (t) w i l l have a small n n bias but a large variance; when h is large, rhh (t) w i l l have a large bias but a small n variance. If h n n is small enough, rhh {t) w i l l behave like a n interpolant while i f h is n n large enough, m / ( t ) w i l l become the familiar least squares fit. A balance between the ln bias and the variance is desirable and this balance is usually achieved b y m i n i m i z i n g the mean squared error of rhh {i), i.e., E{(rhh (t) — m(t)) \t\,... ,t }. 2 n n To predict m a t 1 + A , let t = 1 + A n n n i n (1.9). Therefore, rh,hn{l + A ) = /3 ,I+A„n 0 To avoid cumbersome calculations of the asymptotics later i n Chapter 3, an algebraic substitution is employed to re-center the data at 1 instead of 1 + A „ . Note that Po,l+A„ 1+A„ V / n argminj^iY. - ft - ft ft - 1 - A )) K( n 2 ~ ][~ U ") A 2 - (ft - argminf2(Yi ftA ) - p\{U - l)) K( 2 n U \ "). A (1.10) fir, Let ft* = ft-ftA , ft* n = ft, K-fi ±) r = K{ h ti \ - l ) ) ). (1.11) T C , (1.12) An r Then argminitiYi \ «, / and m ^ „ ( l + A ) = ft n x + - ft* - ft(t t 2 ft A . fl n Formulation (1.12) w i l l be used hereafter w i t h the superscript * dropped. T h e notation m ^ i ( l + A ) instead of rhh (l + A ) w i l l be used t o denote the forecasting n > n estimator that forecasts A n n n ahead w i t h data centered at t = 1 and w i t h b a n d w i d t h h . n To summarize: 10 Definition 1.1 The estimator ra/i„,i(l + A ) of m(l + A ) is defined as n n m ( l + A„) w where argminf^iYi The bandwidth / i n - A , - &(*,- - l)) tf(^-—^). •n 2 (1.13) can be chosen by plug-in or cross-validation approaches. T h e devel- opment of those approaches for forecasting i n Chapter 4 is the m a i n contribution of this thesis. Results of the statistical properties derived i n Chapter 3 serve as the cornerstone for a l l b a n d w i d t h estimation procedures. In the next chapter, theoretical results for the linear operator approach w i l l be presented a n d applied to the A I D S data. 11 Chapter 2 T h e linear operator approach T h e linear operator approach w i l l be discussed under the theoretical framework of a HUbert space. Standard results w i l l be presented without proof. Proofs and other details can be found i n [31]. Examples of the linear operator approach, details of computation and some asymptotic results can be found i n [28]-[31]. T h e following theorem is central to the linear operator approach. Theorem 2.1 Assume the following conditions: 1. 7i is a HUbert space over the reals, 1Z, with inner product <, > and norm || • ||; 2. For each i G { 1 , . . . ,n}, Li : ri —* TZ, is a bounded linear functional; 3. P : Ti —> Til, is a projection operator with ri\ a subspace ofri; 4- The null space of P is TCo = span{<f>i,<f) }, no where the <j)jS are linearly inde- pendent and n < 0 0 ; 0 5. !F(Y, L\(I),..., L (I)) n : lZ xTZ —> 71 is a real function, where Y = ( 3 / 1 , . . . , y „ ) ' . n n 12 For given X > 0, if the minimizer I of F{Y,L {I),...,L {I)) x (2- ) + X\\P{I)\\ 2 n 1 exists in Ti, it is of the form: no i = \E sh + H &> d (-) c i=i 2 i=i where the djS and CjS are real numbers, 2 = P{r)j) and r]j is the Riesz representor of Lj, that is, Lj(I) =< rjj, I > for all I G Ti. C o r o l l a r y 2.1 Assume the conditions in Theorem 2.1. In addition, assume the matriTij = Li((f)j) and E = ( E , j ) ces T — (Tij) , nXno n X n , E,-j = < > are of full column rank, and that the WjS are positive. Then ^ E « - ^ ) ) X + A||P(/)|| (2.3) 2 has a unique minimizer I in Ti and is as in (2.2) with d = {d ...,d c = (ci,...,c )* = u n o where A = E + n ) t (2.4) = {T A- T)- T A- Y, t x x t 1 i4- [/-r(r'A- r)- r*A- ]y, 1 1 1 (2.5) 1 n\I. Remarks: 1. T h e o r e m 2.1 greatly reduces the complexity of the task of finding the m i n i m i z e r 7 i n Ti (often of infinite dimension) to finding / i n the finite dimensional subspace spanned b y <j>x,..., <j> -,i\-, • • •, £«• Roughly speaking, a parametric m o d e l of span no <f>i,..., (f> no determined by P, and the additional functions, £ 1 , . . . , £ more flexible fit to our n data points. 13 n j allow us a 2. Theorem 2.1 is a very general result i n that T can be a general function. However, we are only interested i n J being —n^logijikelihood), 7 i n which case (2.1) is called a penalized likelihood. W h e n the YJS are independent and n o r m a l l y distributed w i t h mean Lj(I) and variance (2WJ)~ , the first term as i n (2.3). 1 J- is — re /o<jr(normal likelihood) and is _1 We w i l l assume (2.1) to be a penalized likelihood hereafter. T h e number A, known as the smoothing parameter, governs the relative importance of the goodness of fit of the data to that of the penalty P. If A is zero, the fit of the data is of the utmost importance and the penalty is ignored, w i t h the result that I is the m a x i m u m likelihood estimator. O n the other h a n d , a large value of A forces I close to the null space of P and results i n an estimate close to a parametric fit. A n y A between these two extremes reflects a compromise between the goodness of fit to the data and the form of I emphasized by the penalty. 3. A n automatic, i.e., data-driven choice of the smoothing parameter A might be desired. Cross-validation is a technique commonly used for the estimation of an o p t i m a l A. Details of this method w i l l be presented i n Chapter 4. T h o u g h the Riesz representation theorem assures the existence of the rjjS, the represen- tors of the LjS, it does not give the analytical forms of the r}jS. T h e theory of reproducing kernel H i l b e r t spaces (r.k.h.s.) makes it possible to calculate those £,-s analytically. T h i s is important i n applications. T h i s chapter w i l l present the relevant results of r.k.h.s. and then apply t h e m to the case of backcalculation. 2.1 Some results on reproducing kernel Hilbert spaces T h e theory of r.k.h.s. is very powerful and very useful i n the linear operator approach i n general. 14 D e f i n i t i o n 2.1 Let H be a Hilbert space of real-valued functions over [0,1], H={I: [0,1] - > f t } . Then 7i is a r.k.h.s. if and only if for all t G [0,1], L: t H^Tl with L (I) = lit) t is a bounded linear functional. D e f i n i t i o n 2.2 IfTi is a r.k.h.s., then by the Riesz representation theorem, for any t, there exists a function K € rt such that t < J , K >= I(t), t for all I eTi. Let K(s, t) = K (s) :. [0,1] x [0,1] -> K. (2.6) t Then K is called the reproducing kernel ofri. A s shown i n the following l e m m a , i n a r.k.h.s. it is very easy to use the reproducing kernel to calculate the Riesz representor of a bounded linear operator. L e m m a 2.1 Assume that L is a bounded linear operator on a r.k.h.s., Ti. Its Riesz representor is n where n(t) = L(K ). t P r o o f : Let n be the Riesz representor of L i n 7i. So for a l l I G rl, L(I) = < / / , / > . In particular, for I = K , L(K ) t =< r),K t t >• B y Definition 2.2, < rj,K t n(t). So n{t) = L(K ). > is also equal to • t A commonly chosen r.k.h.s. is the class of smooth functions w i t h square-integrable r t h derivatives. D e f i n i t i o n 2.3 W ^ l * ] = {/ : [0,1] -> ft : / , 0 1 in [0,1] with JQ I^{u) du 2 ( r - 1 ) are absolutely continuous < oo}. The inner product is defined as: < f,9 >= E/ (0)<7 (0) + t i=o (i) 7 (i) J o / (u) W(«)du. w 5 15 Standard results say that this space is a r.k.h.s. and has a kernel of a known f o r m . Lemma 2.2 W£[0,1] is a r.k.h.s. with the reproducing kernel, *(M) = £ where (s — u) + 2.2 ^ + / o X (j^^\s- y - \ t - u u y - i d u , (2.7) = s — u, if s > u and 0 otherwise. Application to the AIDS data: backcalculation Recall that i n the linear operator approach i n the A I D S example i n Section 1.1, E(Yj), the expected number of new A I D S cases i n the jth an unknown function 7 by a linear operator Lj{I) quarter, is assumed to be linked to Lj, = f' I(s)(F(tj - s\s) - F(tj - - - s\s))ds, Jo n (2.8) where I is the H I V infection function and F(-\s) is the distribution function of the incubation t i m e f r o m H I V infection to A I D S diagnosis at the instant s. F is assumed known so the operator Lj is known. Obviously Lj is a linear operator. In this application, the H I V infection function I w i l l be assumed to be i n H W£[0>1] w i t h r = 2 . T h i s is a very m i n i m a l assumption on I because M^tO?!] a very general class of functions. = 1S N o specific assumptions are made on those func- tions other than smoothness and integrability: 7 and V are absolutely continuous w i t h Jo J"(u) du < co. 1 2 Recovering the historical H I V infection pattern I is an important goal i n A I D S research. It is believed that I is a smooth curve. Therefore rough Is should be penalized. T h e quantity /Q I"(u) du is a measure of the roughness of I so it can be used as the 1 2 penalty t e r m . 16 To put i t statistically, finding a reasonably smooth I i n W ^ O , 1] that fits the observed A I D S incidence data well is achieved b y finding the I that minimizes + \ I 1'XuYdu, JO F{Y,L (I),...,L (I)) 1 (2.9) 1 n where T is the negative of a loglikelihood function divided by n , I € VF^O, 1] and A > 0. T h e smoothing parameter A places a control on the roughness of I relative to the goodness of fit of the Lj(I)s to the YjS. M i n i m i z i n g the penalized likelihood means choosing an I i n W%[0,1] w i t h a tradeoff between the goodness of fit to the data and the roughness of I. T h i s tradeoff is determined by the value of the smoothing parameter A. A n o p t i m a l A for this tradeoff minimizes the prediction error, PE(\) n- Y,{Y;-mx(t )} = 1 2 i i=l where the 5^*s are hypothetical new observations at t,s, rh\(ti) = Li(I\) and 7A minimizes (2.9). T h i s o p t i m a l A can be estimated b y a standard technique called cross-validation. Further details on cross-validation w i l l follow i n Chapter 4. T h e m i n i m i z e r I of (2.9) can be determined by using Theorem 2.1 and the theory of r.k.h.s. i n Section 2.1. Theorem 2.2 If the minimizer I of (2.9) exists in W|[0,1], it takes the form: i(t) = Bo +iht+ (- ) 2 10 3=1 where m = I x / ,Mt) st{s A t) - (a + i ) ^ - (sAt) 3 2 1 + Jo (Fitj-s^-Fitj-^-s^ds, and s At = s if s <t and t otherwise. 17 (2.11) Theorem 2.2 enables one to find the m i n i m i z e r i n a finite dimensional subspace instead of i n W | [ 0 , 1 ] , a space of infinite dimension. T h e succeeding lemmas verify the conditions of Theorem 2.1 for its application to the A I D S data. So the ensemble of the following lemmas serves as the proof of Theorem 2.2. L e m m a 2.2 i n the previous section assures that C o n d i t i o n 1 of Theorem 2.1 is satisfied. [G\ 1] i H i l b e r t space w i t h the inner product sa /(0M0) + / W ( 0 ) + f f"(u) "(u)du, <f,g>= X 9 f , g e w [o, 1], 2 Jo and the n o r m induced by the above inner product, 7(0) 2 + I'(0) + f I"(u) du 1 2 Jo 11/2 2 T h e l e m m a below checks C o n d i t i o n 2 of Theorem 2.1. Lemma 2.3 Suppose that for any s > 0, ^(-l-s) is a distribution function with F(u\s) = 0 if u < 0 and that F(u\s) is continuous in both u and s. For each j G { 1 , . . . , n}, the operator Lj : Lj(I) = j I(s)(F(tj Jo h - s\s) - F{tj - - - s\s))ds, n I G W [0,1] 2 is a bounded linear operator on W|[0,1]. Proof: Lj is well-defined since I(s)(F(tj — s\s) — F(tj — ^ — s\s)) is continuous i n s and thus is integrable. T h e linearity of Lj is obvious. B y the definition of the boundedness of a linear operator, Lj is bounded if there exists C, 0 < C < co, such that (2.12) \Lj(I)\ < C\\I\\, for a l l Ie W [0,1]. 2 18 Since 0 < F(tj - s\s) - F(tj - J - s\s) < 1, it follows that |^(/)| = | < £ I(s)(F(tj - s\s) - F(tj - Jo \I(s)\ds < £ j n \I(s)\ds < - s\s))da\ I(s) ds^ 2 2 . (2.13) To bound /g |/(s)|<£s, first consider I(s) = 1(0) + s/'(0) + f" ( I"{v)dvdu. Jo Jo (2.14) U A p p l y i n g the inequality (a + b + c) < 3 ( a + b + c ) , 2 2 2 2 to (2.14) gives I(s) 2 < 3 ( / ( 0 ) + s I'{0) 2 2 + ( / ' / " I"{v)dvdu) ). 2 (2.15) 2 Jo Jo B y Schwarz's inequality, ( [' f I"(v)dvdu) Jo Jo U 2 which is bounded by < ( [' [ I"(v) dvdu)( [' Jo Jo Jo Jo U f l dvdu), 2 U 2 I"(v) dv for 0 < s < 1. 2 For 0 < s < 1, applying the relationship i n (2.15) yields: I(s) 2 < 3(/(0) + s /'(0) + l Jo < 3 ( / ( 0 ) + 7'(0) + C I"{v) dv) Jo 3||/|| . 2 2 = 2 2 I"{v) dv) 2 X 2 2 (2.16) 2 Using (2.16) i n (2.13) gives \Lj(I)\ < ^ \\I\\1/2 operator on W [0,1]. Therefore Lj is a bounded linear • 2 For C o n d i t i o n 3 of Theorem 2.1, the following l e m m a gives the projection operator P corresponding to the penalty t e r m / I"(u) du and the basis functions (f>jS that span 0 2 X the n u l l space of P . , 19 L e m m a 2.4 Let P : W [0,1] - (2.17) W [0,1] : P(I)(t) = I(t) - 7(0) - l'(0)t. 2 2 2 T/iera P is a projection operator onto H\ with Tii®rio is the null space of P and ||P(i")|| = /o = W|[0,1], where rio — span{l, t} I"{ufdu. 2 T h e next l e m m a uses the results of the r.k.h.s. to calculate the Riesz representor rjj of Lj i n W ^ f O , ! ] and its projected image £j = P(r)j) for Lj as i n (2.8) and P as i n (2.17). L e m m a 2 . 5 In W [0,1], 2 w> - r the projection of the Riesz representor 2 , (sA*) st(s At) — (s + t)^—±x x 2 of Lj is: (sAt) ^—t3 + (2.18) (F(tj - s\s) - F(tj - ^ - s\s))ds. R e m a r k : T h e £jS are not splines (cubic polynomials) as i n the t r i v i a l case when Lj(I) I(tj). One can show that = defined by (2.18) is an increasing function w i t h £ j ( 0 ) = 0, concave i n [0, tj] and a straight line beyond tj. P r o o f : For r = 2, (2.7) gives the reproducing kernel as ~/ x / x / x(sAt) (s At) K(s,t) = l + st + st(s At)-(s + t) ' + ' 2 3 2 K R e c a l l that K (-) = K(-,t). t 3 v n (2.19) B y L e m m a 2.1, the Riesz representor t]j of Lj evaluated at t is: - rjj(t) =< r,j,K t >= Lj(K ) = [*' K (s)(F(t Jo t t - s\s) - F(tj - - - s\s))ds. n } So &(*) = = P(rn)(t) = rjj(t) - nj(0) - r]'j(0)t Lj(K )-Lj(K )-j (Lj(K ))\ t t 0 t t 20 t=0 (2.20) One can show that at Jo n Therefore = f\K {s) t - 1 - st)(F(tj - s\s) - F(tj - - - s\s))ds. • T h e results i n this chapter w i l l be used i n Chapter 6. T h e next chapter w i l l contain the derivation of the statistical properties of the local linear forecasting estimator rhh ,i(l n + A „ ) for the regression approach. 21 Chapter 3 Asymptotics for the local linear regression estimator T h i s chapter w i l l cover the asymptotic properties of m ^ on data n i < ( i + A ) , as defined below based n i^)}™, for i,s random and t,s nonrandom respectively. These asymptotic results w i l l help us understand the local linear forecasting estimator and choose an o p t i m a l b a n d w i d t h for the estimator. T h e results and conditions are different from those i n non-forecasting regression problems (see, e.g. [8]). However, the proofs are similar. In non-forecasting regression problems, e.g., estimating m(t) by local regression for t i n the domain of design points t,s, data on both sides of t are used i n regression. Therefore usually a kernel function K symmetric about 0 is used. Often i n non-forecasting regression problems, K is assumed to be a density function. However, i n forecasting data are centered at the boundary t = 1, so only the left part of the kernel function K is actually used because there are no data beyond t — 1. For this reason, the local linear forecasting estimator is defined by a kernel function K w i t h a negative support. More- 22 over, the local linear forecasting estimator does not require K to be a density function but a general function that satisfies C o n d i t i o n 3.1 below. D e f i n i t i o n 3.1 Let rhh ,t{t') — / 3 n 0i + P\,t{t' — t), where / 3 0t and J3 minimize lt X3i (Y]~ A) P\{ti—t)) K((ti—t)/h ) with K a kernel function having a negative support. - 2 n Note that the data are centered at t i n Definition 3.1. T h e estimator rhh j(t') forecasts n t' — t ahead of t using data prior to t since the kernel function K has a negative support. T h e m a i n application of the asymptotic results for rhh t(t') w i l l be to the case t = 1 nt and t' — t = A n (see (1.13)). However, the more general asymptotic results for t and t' i n a neighbourhood of 1 w i l l be used i n Chapter 4. F r o m Definition 3.1, the asymptotic properties of rhh ,t[t + A ) are completely den = ($ , termined by those of 0 ot t fiijY- Note that the superscript n of a vector or a m a t r i x indicates the action of taking the transpose of a vector or a m a t r i x and should not be confused w i t h the scalar t as either an argument of a function or a subscript of a variable. T h e asymptotic bias and variance of /3 w i l l be presented first. t For the case of random i,s, assume the following conditions: 1. Yi = m(ti) + e,-, where the e,s are independent w i t h mean 0 and variance cr ; 2 2. t{S are i i d w i t h density /(•) and are independent of the e,s; 3. the density function / is known and satisfies: / is bounded away f r o m 0 and / is continuous on [0,1]; 4. K is square integrable on its compact support [—1,0] w i t h u > 0, u u 0 where u,- — ff j let < = f\ Q — u\ > 0, 2 (3.1) u K(u)du\ x tfK^Ydu23 5. m" is continuous over [0,1 + A ] ; n 6. h —* 0 and nh n —> oo. n T h e o r e m 3.1 Let e = (1,1)'. Under the above assumptions, as n E(B \t) _2_ ~hl - m(t) 0tt [ h (E0 \t) n - = o (l)e, p Ui - m'(tj) ) ht UI UQ m"(t) oo, (3.2) U 2 and / / t 0 w . \ Pot nh Var \ I* n 0 /u « uo ^1 /(*) V Ui u 2 0 Ui Ui J uniformly in t G [a , 1] l i m i n f a /h n n o (l)e, P (3.3) U 2 > 1. n Remarks: 1. Note that here " 0 ( i ) = 0 ( ( n / i ) ~ n lim p l i m sup P ( ( n / O C_+oo n^oo 1/2 1/2 ) uniformly for * <E [a , 1]" means that n | 0 ( i ) | > C) = 0, n t g [ o n i l ] which is weaker than lim l i m P( sup {nhfl \e {t)\ > C) = 0. 2 n T h u s , for a fixed sample size, some values of 6 (t) could be far f r o m 0. n 2. It is not necessary to require that K have an one-sided support i f a b a n d w i d t h is given because there are no data for t > 1 and thus any kernel w i l l have a n onesided support automatically. However, some b a n d w i d t h selection procedures (e.g. FCV i n Chapter 4) need this requirement to have certain asymptotic properties (see Chapter 4). 24 T h e proof of Theorem 3.1 w i l l be postponed u n t i l its corollaries are presented a n d proven. T h e results i n Theorem 3.1 make calculations of the asymptotic bias and variance of rhh ,t(t + A ) very easy. T h e corollary below follows easily f r o m T h e o r e m 3.1. n n Corollary 3.1 Assume Conditions 1, 2, 3, 4 and 5. In addition assume 6'. A n = Dn- ^ for some D, H > 0. Let 8 = D/H. and h = Hn- ' 1 1 5 n Let Bias(m , (t + A )\t) hn t = E(m (t n hn<t (3.4) + A )\t) - m(t + A ) . n n As n —• oo, 2 -£jBias(m (t + A )\t) hntt — U1U3 m"(t) ( - n 8 2 + U0U3 — U1U2 )/(u u 0 2 - ul) - 1 = o (l) p (3.5) and nf{t)A Var{m (t n + A )\t) hntt n -1 U Ui Ui u 0 a 8(l,8) 2 \ "1 2 uniformly in t G [a , 1] with l i m i n f a /h n n n "2 / UQ UI Ui u o„(l) (3.6) 2 > 1. Proof of Corollary 3.1: Taylor expansion of m{t + A „ ) at t yields: m{t + A ) n = m(t) + A m'(t) + ^m"(t) + o(A ), 2 n n uniformly for t G [0,1]. So 2 —•jBias(m (t + A )\t) A kntt 2 n r, = -j _2_ [E(m (t hn]t + A )\t) - m(t + A ) ] n n Al 25 (3.7) A 2 , n E(m (t + A )\t) - m(t) - A m'(t) - ^m"(t) + o ( A ) hntt L - m(t) E0 ,\t) 1 o 8 hl 2 { h - m"(t) + o(l) / J ht - m(t) E0 \t) 1 N (E0 \t)-m'(t)) n ( 1 I)A 8 ' 8'hi y 2 n n o>t ^ -m"(t) + o(l). 2 ^ K (E0 \t)-m'(t)) (3.8) ) u B y result (3.2) of Theorem 3.1, as n —> oo this expression converges i n probability uniformly to (ii)-"W = m"(t) ( un U uii U Ui u 0 u\ - 1 I I «9 m"(t) 2 U!U 3 + 6* U UQ UU 3 X 2 0 8 > )/(u u - ul) - 1 2 2 (3.9) So (3.5) is proved. N o w consider the variance nf(t)A Var(m (t n hntt n + = nf(t)A Var0 n + A )\t) Oit P A \t) u / nf(t)A Var { n i o X n 0 fc K r = n/(t)A„(M)Var { / o i 0 ( nh Var < N h r 1 0 \ *> » n \ / ^ 0 hn 1 \ (3.10) t 4 B y result (3.3) of Theorem 3.1, as n —• oo the last expression converges i n probability 26 to: 8{l,S)f{t) /(*) I UI Ui u U Ui Ui u 0 = * 8(1,6) UQ 2 2 ) \ *1 ) U U \ ( *0 "1 2 U 0 UI Ui U U uniformly i n t. 2 U Ui Ui u 0 2 ) V \ *1 2 U \ ) 8 2 • Since / and m" are uniformly continuous on [0,1], Corollary 3.2 follows immediately from Corollary 3.1. Corollary 3.2 Assume the same conditions as in Corollary 3.1. As n —* oo, 2 -Bias(m (t 2 + A )\t) hntt O V m - n P nf(l)A Var(m , (t n W 2 - u i) _ 1 (3.11) J = n UQ Ui Ui U a S(l,8) u + A )\t) hn t ( )/( ° 6 \ X ( 2 2 \( \ X "5 \l U «1 UQ 2 ) U \Ui Ui = o,(l) J U 2 (3.12) \ I 6 uniformly in t G [1 — p , 1] with any sequence p > 0, p —• 0. n n n Remarks: 1. A s is usually done i n regression problems, we consider the conditional bias and variance of our estimates. 2. Corollary 3.2 says that the dominant terms i n both the asymptotic conditional bias a n d the variance of rh (t nntt + A ) are equal to those of m / n t G [1 - / ? „ , ! ] . 27 l n i i ( l + A ) for n 3. For a finite sample, A is given and h will be chosen to minimize the asymptotic n n conditional mean squared error (AMSE) of m ^ n>1 ( l + A ) . Because of Condition 6', choosing h is equivalent to choosing 6 = A /h . Although minimizing the ac- n n n n tual finite sample conditional MSE would be best, its form is far too complicated. 4. Based on the results in Corollary 3.1 one can write out the AMSE of m ^ i ( l - | - A ) ni n as the square of the asymptotic bias plus the asymptotic variance. Condition 6' sets a guideline on the magnitude of A , that is, on how far ahead one can forecast n if the rate n - 4 / is to be achieved for the AMSE of m/ 5 lrJi i(l + A ). Stone [27] has n AMSE shown that under appropriate regularity conditions, the optimal rate of for a nonparametric estimator of a twice differentiable function is n / . If A - 4 5 grows faster (but still slowly enough so that A —> 0), for example, A /h n n a rate slower than n - 4 n —> oo, n / can be achieved under appropriate conditions. This can 5 be seen by the reasoning below. For A —> 0, the bias and variance of m^. i(l + A ) are ni n Bias(rh (l hn<1 Var(m (l hntl + A )\t) = E(m = E (fa n + A )\t) = n = fcBll n ( l + A ) - m ( l + A )|*) n n - m(l) + - m'(l)]|t) + Var(A, i + & A | t ) ll 1 B Var0 \t) + 2A Cov0 J \t) Otl n OA + 1A By the results in Theorem 3.1 for t = 1 and any A > 0 with A . n hntl Var(m {l-[hntl + A )\t) n A )\t) n O (h ) + O (h A ) p 2 n AJh, •n hntl + A )\t) n + n n + O {l/(nh )) (l + O (A /h ) + O (A Jh )) n n 0 (A ), P 2 n 28 p 2 n p ltl 0(Al), + O (A /(nh )) p co, we have Bias{m (l n 2 O (l/(nh )) p If p A Var0 \t). 0 n Bias(m (l O(Al), n n n O (A J(nh ))) p p 2 2 2 n 3 n . Var(m (l + A )\t) hnA To m i n i m i z e the AMSE, i.e., A* oc A J{nh ), 2 n/i£ —• 0 a n d n A p and Var of the same magnitude, oc 1 / n . If A / A 3 n oc 1 / n , the AMSE 3 (equivalently O (A /(nh )), n -> oo, A h 2 p n oc 1 / n implies 3 n n —• oo under the conditions n of rhfc„,i(l + A ) achieves the rate n which is slower than n 3 n n n 2 O (A ) 3 —• oo. A s a result, if A /h —> oo and A / i 3 2 p 2 2 5 n O (A J(nh )). we need to make Bias or A / * 3 n nh ~ n - 4 / since nA h 5 n —* oo. R e c a l l that i n Chapter 1.2.2, the algebraic substitution (1.11) is used to re-center the data. T h i s recentering yields a simple dependence of AMSE on 8, thus m a k i n g it easier to find the o p t i m a l bandwidth by finding the o p t i m a l 8. If the original formulation (1.9) is used w i t h the data centered at 1 + A , then the estimate n of m ( l + A ) is /3O,I+A„- T h e formulae of the asymptotic bias and variance of n A),i+A N w i i i show clearly the advantage of centering data at 1 over centering at 1 + A „ . Under the same assumptions as i n Corollary 3.1 b u t w i t h the additional assumption that D < H, one can show that the asymptotic bias a n d variance satisfy _2_ Bias(m (l h 2 £ ( / W » h + A )\t) hntl n - m"(l + A ) n 2 - m(l + A )\t) n V 0 2 n Vi \ ( X Vn \1 V { 01+An + A )\t) hntl where i>,- = / " f v K(v)dv, fi VlV V QV + A )Var(m (l 2 of * V — Vi (3.13) V\ v AMSE z V{ n p I x - 2 - m"(l + A) * O (l), n 2 VV 0 2 nh f(l v - vv - n V-> \ ( V 0 2 J Vi V \ V-L (3.14) V 2 v* = / ~ f v'K(v) dv. 2 M i n i m i z a t i o n of the resulting is computationally intensive since 8 i n the upper l i m i t of the integrals depends on h . n Therefore if AMSE(h ) n 29 is to be m i n i m i z e d over a grid of values of h , all the integrals have to be computed for each h . n n If K is, say, the indicator function on [—1,0], these integrals can be found i n closed f o r m . If, however, K is a truncated n o r m a l density, then the intregrals must be evaluated numerically. There is another advantage of re-centering the data: after re-centering the data at 1, h n is the "real" b a n d w i d t h used i n forecasting. If the data were instead centered at 1 + A „ , the effective b a n d w i d t h would be h — A n n since there are no data beyond 1. L e m m a 3.1 below is used i n the proof of Theorem 3.1. Lemma 3.1 Suppose that g(-) and /(•) are bounded over [0,1] and W(-) is square integrate over 1Z and that the i , s are iid with density /(•). If nh n = nh —• oo, then as n —» oo, ^ £ W&j^MU) - I£ W(^)g(u)f(u)du = O ((nfc)p 1/a ) (3.15) uniformly for t 6 [0,1]. Proof: Let Z = (nh)- 1 £? =1 W((t - t)/h)g(U). { B y Chebyshev's inequality, Z = E ( Z ) + E(Z) 2 p = \E ( V c ^ M t o ) = Var(Z) O ((Var(Z)fl ). = * = I £ ^y« W(^)g(u)f(u)du, r (V(*I^( )) n h < t l W 2 ^ ^ ) \ C W\^-±)g\u)f(u)dulh. nh Jo n Since g and / are bounded, and / W ((u — t)/h)du/h 0 Var(Z) = X 2 0(\), (3.16) = f^^ \V (s)ds h 2 < co, (3.17) nh 30 • which does not depend on t. P r o o f o f T h e o r e m 3 . 1 : Write -f%- fiiiU - t)) K(^r-) = (Y- 2 Aj3)'W(Y - A/3), (3.18) where 1 Y = t -t x and P A V 1 W = ft/ t -t n diag{K(p-±)). • V T L T h e m i n i m i z e r of (3.18) is: (3.19) p = (A'WAJ^A'^y, t provided that A ' W A is invertible and positive-definite. So for t w i t h A'WA invertible, (3.20) E{P \t) = ( A ' W A ^ A ' W r o t where m = (m(ti),..., m ( t ) ) * , and n - l Var(J3 \t) = <7 (A'WA) A'W A(A'WA) 2 t -1 (3.21) 2 Result (3.28) below shows that, when suitably normalized, A'WA converges i n probab i l i t y to a positive definite m a t r i x . Since we are interested i n convergence i n probability of E{p \t) and Var(P \t), i t suffices to analyze (3.20) and (3.21). t t T h e asymptotic bias w i l l be derived first. T h e Taylor expansion of m at t yields: m"(t) m(ti) = m(t) + m\t){U -t) + —^(U 31 - *) + 2 - *) ) 2 uniformly i n ti w i t h \U — t\ < h . O n l y these t,s are considered because the support of n K is [ - 1 , 0 ] . Therefore, for \U — t\ < h n m(t) ^ ( m (U) = (l,U-t) + + ^Y^-qiK 2 h o( ) 2 n qi m'(t) m(t) / N A,- + hl(m"(t)/2 (3.22) + o(l)) qi where A,- is the ith row of A and ' '---'^^((V) '•••'(V)) =(9i 92 Since Wi = i f ( ( * i — t)/h ) = 0 for n ' l 0 y 0 h j m(i) V o N (K WK)- K Wqh (m"(t)l2 t y 0 h n x t n + o(l)) 2 h j n ( = n W n ' i — t\ > h , so 1 \ 0 ( 2 o AW A] l 0 Zi h n I -l 1 0 y 0 h N 1 A Wq)(m"(t)/2 t J F o r t , i = l,2 1 0 0 /*„ n -1 nh. -A*WA y 0 h N 32 J + o(l)). (3.23) 1 1 (A'WA),,- hJ+t- nh 1 1 2 n nk. — •+j-2 • • •• • ' ( 3 - 2 4 ) which b y L e m m a 3.1, is equal to n Jo \ n n J \ n n Substituting s = (u — t)/h n (l-t)/h„ (l-t)/h r / J n . n yields . K(s)s >- f(t t+ J-t/h t/h„ (3.25) O {{nh y l ). + sh )ds + 2 n n p x 2 n e For t [a„,l] with l i m i n f a /fc n [-t/h , (1 - t)/h ] > 1, n n n D [-1,0] = [-1,0] for n sufficiently large. So for n —> oo and t £ [a„, 1], expression (3.25) is equal to f s i- K(s)f(t i+ 2 n where O {(nh )~ l ) p n holds uniformly i n t G [a„, 1]. T h e last expression converges (not x 2 p (3.26) + sh )ds + O ((nh )-^ ), 2 n only i n probability) to i: ,»"+J-2 K(s)f(t)ds = f(t)ll - i + j 2 uniformly i n t £ [a , 1] because n | /_° s - K(s){f(t i+j < + sh ) - 2 n sup \f(t + sh )-f(t)\ f(t)}ds\ [° n \s i- K(s)\ds i+ 2 te[o„,i] sup |/(r + s/i ) - f(t)\( f \s ^- Usf' { <=rn„.ii J - i <G[a„,l] 2 B 2 f° J - i 2 K(s) ds) 2 1/2 (3.27) > 0. So -l 1 0 1 0 h nh n A ' W A (3.28) /(*) 0 fc Ui n 33 u 2 uniformly i n t G [a , 1], as n —+ oo. n Similarly, for i = 1,2, 1 \ 0 o - i nh. h -A'Wq n Ii = .l-± ( jLzl) (h^l)' i +1 K nh n fr{ \ h J \ n h n J as n —> oo. uniformly i n i G [ a , l ] - T h i s expression converges to / ( i ) u , i uniformly i n £ G [ a , l ] , Thus n / + -l 1 / 0 \ (3.29) A /(*) n K o A'Wq nh n B y (3.23), (3.28) and (3.29), i t follows that I E(B \t) - m{t) _2_ Qtt hi v m"(t) (E0 \t)-m'(tj) h 1<t n ( = m"(t) UQ UI Ui u 2 u \ J /(*) J I \\ UQ UI Ui u /(*) 2 V 2 U 3 / n To calculate the asymptotic variance, first consider, nh Var 1 0 0 h n u 1 = nh o~ n 0 ) J (A W AY A'W A{A'W A) - l l 2 n 0 « 3 J (3.30) uniformly i n t G [ a , 1]. 11 -1 2 X 0 h n 34 h. 1 0 o K - l = (7 1 nh n 0 0 K -1 \ 0 1 -A*W A 2 0 K 1 o 0 h nh, X 0 n h n 1 0 0 K (3.31) B y calculations similar to those i n the proof of (3.28) -l „ - i / -A'W A 0 1 ( N (3.32) 2 0 nh. K 0 h u n x j u 2 U s i n g results (3.28) and (3.32) i n (3.31) gives / 1 1 / (nh )Var 0 O i lB I, 0jt n 0 A B / -1 ( «0 W Ui V U 0 Ui Ui Ui u 0 l /(*) /(*) U /(*) \ "1 Ui / U 0 Ui o (l) + p 2 \ (3.33) + o (l), P /(*) V Ui u J 2 V «1 2 U Ui J U-i uniformly for t € [a , 1]. n Under appropriate conditions, Theorem 3.1 and its corollaries hold for the fixed design where i,-s are pre-specified design points rather than r a n d o m variables. T h e analogous theorem and corollaries for the fixed design are presented below. Theorem 3.2 For the fixed design case, assume the same conditions on m , K, h : 1, 4, n 5 and 6 as in the random design. In addition assume that K' is continuous, that t{ = 35 with 0 < ti < ... < t = 1 and n rU f ' f(t)dt = i/n, Jo f: (3.34) inf / ( t ) > 0 <e[o,i] loii/i / continuous. Then as n —•> oo, _2_ AH E i) E (ffA t , / )j -- r o fm(t) / ( ^ t { h (E0 ) n / fit) m"(t) UQ UI Ui U o(l)e, (3.35) - m'(t)) ) 1<t \ \ 1 0 o K \ AM / _ 1 UQ UI U Ui U Ui uniformly in t £ 2 2 wifJi l i m i n f a /h n n Ui 0 u (3.36) = o(l)e, 2 > 1. T h e condition on the design points £,s i n the theorem above is analogous to C o n d i t i o n 3 for the random design case. T h i s condition serves as a guideline to the selection of the design points i n an experiment. Corollary 3.3 Assume the same conditions as those in Theorem 3.2 except with Condition 6 replaced by Condition 6' of Corollary 3.1. Then as n —• oo, hn *H nu\ ro( + A„))- jBias(rh ,t(t sup / l2 ~ U ( 8* "1^3 + . U 0 U 6 3 - U i U 2 )/(u u 0 2 - u\) - 1^ = o(l), (3.37) and I sup nf(t)A Var(m (t n hntt + A )) - a 6(l, 6) 2 n <6[a„,l] Un 36 U, \ UQ UI Ul U UQ UI Ui U 2 -1 2 o(l). (3.38) C o r o l l a r y 3 . 4 Assume the same conditions as in Corollary 3.3. As n —*• oo, -^Bias(rh (t sup + A„))- hntt te[w„,i] (t \ ~ " i " u 3 , u ^-uiu \ 2 A 2 (3.39) ( nf(l)A Var(m (t sup n hnit \ + A )) - a S(l, 6) 2 n tG[l-Pn,l] UQ V Ui / UI UQ Ui Ui U2 \ / ' \ _ 1 = o ( l ) . (3.40) U 2 T h e proof of Theorem 3.2 for the fixed design case is along the same lines as for the random design case except that the following l e m m a is used instead of L e m m a 3.1 when replacing a s u m b y a n integral. T h e proofs of the corollaries of T h e o r e m 3.2 are the same as those of the corollaries of Theorem 3.1. L e m m a 3 . 2 Assume f,W and g' continuous on [0,1] and f satisfies (3.34)- Then 3 C > 0, such that < sup t€[0,l] R e m a r k : Note that the error t e r m 0((nh )~ ) 2 t e r m 0({nh)~ l ) l 2 l C n h' 2 (3.41) i n L e m m a 3.2 is smaller t h a n the error i n L e m m a 3.1 for h of order n - 1 / . 5 T h e proof of L e m m a 3.2 requires the result of L e m m a 3.3. L e m m a 3 . 3 If i n f [ , i ] f(t) > 0, and if te such that s u p i n 0 \U — -= n < h < ... < t n 2 <e [*' f(u)du> Jti-i 0 \ < C /n. P r o o f : Let C = 1 / i n f [ i ] f(t). 2 0 = i 0) .[*' Jti-i B y the assumption on / , i n f f(u)du = <e[o,i] 37 C - {ti-ti-i). l 2 = 1, then 3 C > 0 2 Thus U — < C /n, Vi,n. 2 P r o o f of L e m m a 3.2: l rK„,u-t f(u)du ,_1 EE " L J (3.42) A + B. Consider the first term i n (3.42). (3.43) by assumption (3.34). wC^)g{u) - W^- -l)g(t ) f(u) < d j i u — ti T h e l e m m a w i l l be proved if \B\ < C/(nh ) for some C > 0. (3.44) 2 Since / , W and g' are continuous over [0,1], there exists C\ > 0, such that Thus 1*1 < < 1 A /•*•• T E / u — ti C 1 CxC nh 2 2 2 c • nh 2 (3.45) T h e results i n this chapter w i l l be used i n the next chapter for the estimation of an o p t i m a l b a n d w i d t h for forecasting. 38 Chapter 4 T h e choice of a smoothing parameter T h e plug-in approach and the cross-validation approach are c o m m o n l y used i n choosing a b a n d w i d t h for a local regression estimator i n the non-forecasting setting. E a c h approach seeks to m i n i m i z e some measure of the discrepancy between the estimated and the true function and tries to estimate the b a n d w i d t h at which the m i n i m u m of such a measure occurs. There is abundant literature on the merits and limitations of b o t h methods. For a comprehensive review, see [7], [12], [13] and [19]. T h e ideas of choosing a b a n d w i d t h for forecasting by both approaches w i l l be developed i n this chapter. T h e applicability of these approaches relies upon the fact that the data are independent. A n y correlation structure i n the data w i l l affect any automatic selection of b a n d w i d t h (see, e.g. [16]). 39 The plug-in approach 4.1 A plug-in b a n d w i d t h is an estimate of an " o p t i m a l b a n d w i d t h " i n a certain sense. T o estimate the function m at t by rhh (t) (defined i n 1.8), an o p t i m a l b a n d w i d t h h (t) n opt may be defined to be the one that minimizes the mean squared error (MSE) of m / ( t ) : ln h (t) = argmin MSE(t, opt h ), hn where MSE(t,h ) = E(rh (t) n - m(t)\t) i n the random design and MSE(t,h ) 2 hn E(rhh (t) — m(t)) 2 n n i n the fixed design. n = Since the idea is the same for both designs, the selection of an o p t i m a l b a n d w i d t h w i l l be described for the r a n d o m design. T h e b a n d w i d t h h (t) is called a local b a n d w i d t h since MSE(t, opt h ) is a criterion of goodness n of fit at a single point t. However, if m is believed to be reasonably smooth, a constant b a n d w i d t h for a l l t w i l l suffice ([7]). In such a case, an o p t i m a l b a n d w i d t h hf m a y be pt defined to be the one that minimizes the sum of the mean squared errors: h^ = argrnin MSE (h ) pt hn where MSE {h ) G n = n' £? G 1 =1 n E ((rh (ti) - m(ti)) \t). Since h° 2 hn pt w i l l be used to esti- mate m(t) for a l l t, i t is called a global b a n d w i d t h . A n o t h e r commonly used criterion for a global b a n d w i d t h is the integrated mean squared error, MISE{h ) n = f Jo = jT E ((rh (t) - m(t)) \t) dt; 1 MSE(t,h )dt n (4.1) 2 hn a global b a n d w i d t h can be obtained b y m i n i m i z i n g MISE(h ). n viewed as a discretized version of MISE(h ) MSEa(h ) n m a y be when the £,s are equally spaced over [0,1]. n Note that a l l three criteria mentioned above involve the unknown function m and the data { ( ^ i , ^ ) } " . So the o p t i m a l b a n d w i d t h depends on m and a 2 and thus has to be estimated. T h e plug-in approach estimates the o p t i m a l b a n d w i d t h by m i n i m i z i n g an estimate of the asymptotic expression for MSE or 40 MISE. T h e criterion MSE(t, h ) w i l l be used hereafter. T h e context of the plug-in approach n that of non-forecasting and some relevant results w i l l be presented first and then the generalization of the plug-in approach to the forecasting setting w i l l be discussed. 4.1.1 Introduction In this section suppose Y{ = m(t,) + e,-, where the e,s are independent w i t h mean 0 and variance cr (£,). 2 W h e n the errors are homoscedastic, o" (i,) = a . 2 1 Results for heteroscedastic errors i n the non-forecasting problem w i l l be presented because they encompass the special case of homoscedastic errors. In particular, a technique by Fan and Gijbels [7] of estimating m " ( l ) i n the case of heteroscedastic errors w i l l be described and used i n forecasting. In general, to estimate wS \t)/v\ = ft, v — 0 , 1 , . . . , a p o l y n o m i a l of degree p (p > v) v is fitted to the data w i t h the following fitting criterion: A = argmin j j Y - £ { and m/, '"'(t) is set to v\$ . n vt &•(*,• - tyfKip^), (4.2) In curve estimation, the kernel K i n (4.2) is usually taken to be symmetric around 0 over either the reals or [—1,1]. T h e performance of is assessed by its MSE: E j (rhhj^(t) i2.2(p+i-</) , ft+iM„ g2 n — m M ( t ) J | t - j . A c c o r d i n g to the general results 2 i n [7], when p - v is o d d , the asymptotic MSE Qi rhh ^\t) (AMSE) (*) of rh ^\t) hn is / + " f(t)nh +^ 4 3 \ V- * a 6 n Here a EE (a ,ai,...,a y 0 6 EE (6o, S S* = (5,-j) with EE (,?.) w i t h (4.4) = diag(S~ S*S~ ), 1 p 1 6 )* = S- (s i,...,s )\ P sij 1 p+ = J u - K(u)du, EE si+j-2 EE , ? . _ + (4.5) 2p+2 i+j A = 2 Ju i- K (u)du. i+ 41 2 2 (4.6) (4.7) T h e first t e r m i n (4.3) m (p+i)(2) (p+l)! 5 = on cr (t). ( P + 1 ) is the square of the asymptotic bias which depends on while the second t e r m is the asymptotic variance which depends For example, when m(t) 2 is estimated by a local linear estimator, i.e., v = 0 and v = 1, the square of the asymptotic bias of rhh (t) is P\b\h (or m"(t) blh /4) n its asymptotic variance is 2 n and n a a (t)/(f(t)nh ). 0 2 F o r m u l a (4.3) shows that if m ( n p+1 ) ( i ) ^ 0, a large b a n d w i d t h h w i l l create a large n bias but a s m a l l variance, and a small b a n d w i d t h w i l l yield a s m a l l bias but a large variance. Requiring b o t h the bias and the variance to be small at the same time is setting conflicting goals for h . n Therefore a tradeoff between the bias and the variance is needed and this is achieved by choosing h n to m i n i m i z e the asymptotic MSE of the estimator. F r o m (4.3), the b a n d w i d t h that minimizes the AMSE _ / (21/ + l)a„a {t) of rhh„^\t) is: \ 5fe 2 ^ -U(p+i-^)Wi«/(oJ ' ( t ) ( ] For example, when m is to be estimated by a local line, i.e., v = 0 and p = 1, this locally o p t i m a l b a n d w i d t h is *-*>- ( ^ ) * - G s ^ ) * " ^ Plugging this o p t i m a l b a n d w i d t h into the formula for the AMSE see that the rate of n - 4 / is achieved for its of rhh (t), n o n e c a n AMSE. 5 Note that formula (4.8) depends on two unknowns, a ( * ) a n d m(r \t) 2 +1 = /3 (p-rl)\. p+1 These unknowns need to be estimated and these estimates are then plugged into formula (4.8) to yield an estimate, h , t(t), u op of the o p t i m a l b a n d w i d t h h (t). v<opt For example, if the curve m (v = 0) is being estimated at t, to get an o p t i m a l b a n d w i d t h one has to estimate m"(t) and <r (i) first. 2 42 4.1.2 Plug-in approach using complete asymptotics for forecasting (CAMSE) W e only consider the case of homoscedastic errors, i.e., o~ (i) = a . 2 2 In forecasting, the criterion that the plug-in approach considers is also the mean squared error: MSE(h ) n = MSE(l + A ,h ) n n = E([m (l + A )-m(l+ hntl (4.10) n A g a i n , only the random design case w i l l be presented since the ideas and the results are the same (under proper conditions) for both designs. Under C o n d i t i o n 6' of Corollary 3.3, for A „ given, the asymptotic MSE(h ) can be n written i n terms of 8 = A /h , n n so the notation AMSE(8) B y Corollary 3.1 or 3.3, the AMSE of mhnA(l AASOVf^ AMSE(8) m"(l) = + 2 ( u\ - — 4 I a8 X U 3 H-j5 2 n/(l)A, UU •(1.*) + UQ U\ UI U is used instead. + A ) is n 0 U 3 - U i U s )/( i ~ i) - UoU Un V 2 2 u UQ « i 1 UI Ui 2 1 U 2 (4.11) If the first t e r m of AMSE(8) is a monotone and non-increasing function of 8, the above expression shows, i n an asymptotic sense, that for a given A , a very large 8 n (corresponding to a very small h ) n w i l l yield estimates w i t h a s m a l l bias but a large variance and that a small 8 (corresponding to a large h ) n w i l l yield estimates w i t h a small variance but a large bias. For a general kernel function K, the relationship between h n and the asymptotic bias is not as obvious as it is i n the non-forecasting setting. In the forecasting setting, a trade-off between the asymptotic bias and the asymptotic variance can be achieved by choosing the h n = A /8 with 8 minimizing n Observe that as 8 —* 0 and +oo, AMSE(8) —* +oo. AMSE(8). So a m i n i m u m of AMSE(8) in (0, +oo) is guaranteed. T h e m i n i m u m can be found by setting the first derivative of 43 AMSE to zero and choosing the positive root which gives the smallest AMSE. Whereas i n curve estimation for the non-forecasting setting an explicit formula (4.8) for the o p t i m a l h exists, no such formula exists i n forecasting. F i n d i n g the o p t i m a l h n n or equivalently 8 for forecasting, calls for solving a seventh degree p o l y n o m i a l equation (the one resulting f r o m setting the first derivative of AMSE(8) alternative method for m i n i m i z i n g AMSE t o zero) for 6. An is by a grid search. Note that like i n curve estimation, estimating the o p t i m a l h i n forecasting requires n the estimation of m" and a . T h i s issue w i l l be discussed i n Section 4.1.4. 2 4.1.3 The plug-in approach using the finite sample variance for forecasting (HAMSE) R e c a l l that the discrepancy measure (4.10), the MSE of the forecasting estimator, is comprised of two parts: a bias component and a variance component. T h e bias component involves the unknown function m but by (3.21) the variance component equals <7 (1,A„)(A*WA)*A*W A(A*WA)(1, A„)*, which does not involve m. T h e estimation 2 2 of the variance parameter a is relatively easier than the estimation of m". Therefore we 2 can directly use the exact variance expression rather than using its asymptotic expression [7]. T h e asymptotic variance describes the variability when n —> oo. Specifically, the following can be used i n lieu of AMSE. Definition 4.1 The asymptotic MSE using the finite sample variance for the forecasting estimator rhh ,\(l + A ) is defined as: n HAMSE(S) n = ^ A ( n < ± ^ + ™ +<T (1, A ) ( A V T A ) A W A ( A WA)(1, A )*. 2 n i t 44 t 2 t n (4.12) Expression (4.12) uses only "half" of the asymptotic results of the bias and the variance of m / ltl)1 ( l + A „ ) i n that HAMSE is defined as the s u m of the asymptotic bias and the finite sample variance of the forecasting estimator, hence the acronym "HAMSE". R e c a l l that W = diag(K((U - l)/h )) = diag(K(8(U - 1 ) / A „ ) ) . T h e o p t i m a l 8, 8 n that minimizes HAMSE(-) is set to be: hnAMSE,o t = P can be found by a grid search a n d the o p t i m a l b a n d w i d t h A /8 . n opt Note that both the plug-in approach using AMSE ing HAMSE , opt a n d the plug-in approach us- are based on the same discrepancy measure, the mean squared error of rhh„,i(l + A ) . T h e difference is that the latter approach uses the exact variance of n ^ / i „ , i ( l + A ) i n the variance component instead of its asymptotic expression. n 4.1.4 E s t i m a t i o n o f m" a n d a — an introduction 2 T h e plug-in approach i n the non-forecasting setting calls for the estimation of the second derivative ra" of the unknown function and the variance function a . 2 W h e n homoscedastic errors are assumed, a can be estimated b y the R i c e estimator 2 [25]. D e f i n i t i o n 4 . 2 The Rice estimator of the variance a is defined as: 2 ^ = E%^f. (4.13) Under certain conditions, the Rice estimator is n a ) —+ 0 i n probability. Other estimators of a 2 2 1/,2 -consistent, which means ^^(a^^ — can be found i n [11]. W h e n heteroscedastic errors are assumed, F a n and Gijbels propose that o~ (t) can 2 be estimated f r o m residuals of the local p o l y n o m i a l regression [7]: 45 where ^ 1 ti-t ... (u-ty^ , W = diag(K((t -t)/h )), j yi t -t n ... {t -ty (4.15) n j n p is the degree of the p o l y n o m i a l and Y = (Yi,..., Y Y = A / 3 , (3 being the usual nonp n t t forecasting estimate as defined i n (4.2). Clearly cr {t) can be used i n the cases of either 2 homoscedastic or heteroscedastic errors. Note that the estimation of cr (t) requires again 2 a choice of a b a n d w i d t h which m a y differ f r o m the b a n d w i d t h for curve estimation. T h e estimation of m" requires a higher order method. R e c a l l that is estimated by locally fitting a p t h degree p o l y n o m i a l w i t h p > v. R u p p e r t and W a n d [26] recommend that p — v should be taken to be o d d to obtain an accurate estimate of m^(t). So m"(t) should be estimated by locally fitting at least a cubic p o l y n o m i a l . A g a i n the estimation of m" requires the choice of an o p t i m a l b a n d w i d t h , which is different from (usually bigger than) the b a n d w i d t h that is used for estimating the function m itself. To summarize the idea of the plug-in approach, to get an o p t i m a l h n for curve estimation one needs to estimate a and m". T h e estimation of a and m" requires two 2 2 additional bandwidths whose o p t i m a l values depend on higher derivatives of m . T o avoid this spiralling argument, one has to come u p w i t h a n i n i t i a l b a n d w i d t h for estimating derivatives and a . Therefore i n the plug-in approach, the problem of choosing a good 2 i n i t i a l b a n d w i d t h has attracted special attention ([26],[12],[7]). 4.1.5 Estimation of m" and a for forecasting 2 T h e discussion w i l l be restricted to the homoscedastic errors case. forecasting estimator is defined as m / fc = argminj^iY x l n i i ( l + A ) = y9 + ^ j A n where n - 8 - frfr - l)) K({ti 0 R e c a l l that the 2 l 46 01 - l)/h ). n T h e objective is to choose an h = A /8 such that AMSE(8) n n or HAMSE(8) achieves a minimum. Since b o t h objective functions involve the unknowns a and m " ( l ) , these need to be 2 estimated first. E x i s t i n g techniques for estimating b o t h i n curve f i t t i n g can be applied directly to forecasting. For estimating a , a 2 Rice suffices because of the assumption of homoscedastic errors. E s t i m a t i o n of m " ( l ) is a m u c h more complicated matter. A s discussed earlier, m " ( l ) can be estimated by fitting a local cubic p o l y n o m i a l around t = 1. L e t v = 2, p = 3 and t = 1 i n (4.2), that is, let & = argmin J2(Yi - 1)'') #(^), (4.16) 2 i=l j=0 and let m „ i " ( l ) = 2 f t , i . Note that the kernel function K and the b a n d w i d t h h ni used n to estimate m"(l) can be different from those used for forecasting. F o r m u l a (4.8) gives the o p t i m a l b a n d w i d t h for estimating m " ( l ) as follows: = ( which depends o n a and one higher unknown derivative m^(l) 2 4 ' 1 7 ) ( = 4!/? ). 4 O r d i n a r i l y the fourth derivative of a function m is difficult to estimate, partly because the o p t i m a l b a n d w i d t h depends on higher order derivatives of m. However, an idea due to F a n and Gijbels [7] eliminates the need to estimate m^ \l). 4 order derivatives when estimating m^ introduced a statistic 2 where o (i) by a pth degree p o l y n o m i a l , these authors have RSC, RSC(t, h ) = a (t){l n To avoid estimating higher + (p + 1 ) V } , (4.18) is as defined i n (4.14) a n d V is the first diagonal element of the m a t r i x 2 (A* W A p ) ~ A p W A p ( A p W A ) " . T h e motivation for using RSC is that its m i n i m u m 1 2 p 1 reflects to some extent a trade-off between the bias and variance of the fit. 47 Fan a n d Gijbels [7] have shown that the o p t i m a l b a n d w i d t h h (t) that minimizes Q the asymptotic value of E(RSC(t, h )\t) is: n where £, _ S — (Sp+l, • • . ,S p+l)S~ (Sp , . . . ,S2p+lY 1 2 p + 2 2 +1 ^ 2Q^ and Sj is as defined i n (4.6). C o m p a r i n g (4.8) a n d (4.19) yields: » ( M (2J/ + 1) a C \^ u p A p p l y i n g this relationship to the estimation of m " ( l ) , i.e., t = 1, v = 2, p — 3, leads to /*2, < = 0P />2,o t(l) P (f^|) W 9 = T h e point is that h (l) h , pt 2 4 22 can be estimated from the sample by m i n i m i z i n g RSC, 0 yielding an estimate of ( - ) 0 since the scaling factor i n (4.22) depends on the kernel function only. Thus the o p t i m a l b a n d w i d t h for estimating m " ( l ) can be estimated from (4.22) using the information i n the finite sample at hand. T h e following algorithm summarizes the steps to get the o p t i m a l b a n d w i d t h for estimating m"(l). Algorithm 1 : 1. Fit a local cubic polynomial centered at t — 1 as in (4-16) 1° t ne value of h n 2. find h (l) 0 3. get h 2t0pt data for each on a grid; that minimizes RSC(l,h ) n on that grid of h ; n via (4-22); 4- estimate m"(l) bandwidth by fitting a local cubic polynomial to the data using (4-16) with h ,opt2 48 4.2 Cross-validation 4.2.1 Background to the non-forecasting setting Cross-validation is a technique commonly used to choose a smoothing parameter. M o s t references i n literature applies cross-validation i n the context of curve fitting by splines or kernel estimators. However the idea of cross-validation is the same for b o t h techniques and can be applied i n other contexts, e.g., b a n d w i d t h selection i n local linear regression. For a brief historical note on cross-validation, see [15] (pages 152-153). T h i s section w i l l present the motivation and results on cross-validation i n the setting of curve f i t t i n g . A generic symbol A is used for the smoothing parameter. In the linear operator approach, A w i l l be the parameter used i n the penalty t e r m . I n the local regression approach, A w i l l be h , the b a n d w i d t h of the kernel. n Let m\(-) denote the fit to the data Y{S w i t h smoothing parameter A. Ideally, one would want to choose a A such that the prediction error is m i n i m i z e d . A criterion reflecting this objective is (4.23) where the Y*s are new observations made at the design points Thus the l^*s are independent of the Y{S but have the same distribution as the i^s. To see the relationship to the criterion MSEG(X) used i n the plug-in approach, note that E(PE(\)\t) m(ti))(m(ti) - m (*,•)) A = a 2 + i=i (4.24) MSE (X). G 49 So m i n i m i z i n g the expected prediction error E(PE(X)\t) is equivalent to m i n i m i z i n g MSE (\). G Of course the 3^*s are unknown and so to implement this idea the Y*s have to be replaced by observables. S i m p l y m i n i m i z i n g (4.23) w i t h the Y*s replaced by the Y{S would usually result i n choosing A = 0, the A that minimizes 2~Zi(^' — ™ A ( ^ , ) ) and 2 yields an estimate of m that interpolates the data. Thus the fitted curve would be very bumpy. T h i s is a direct consequence of underestimating the prediction error PE(X) because rh\(-) is "closer" to the Y{S than to the Y*s since rk\(-) is fitted to the former. T h e idea of cross-validation is to correct this downward bias i n the estimation of Note that Y* is independent of rh\(ti) b u t Y{ is not. Cross-validation gets PE{\). m£~^(2,-) for rh\(ti), around this dependence b y substituting where m ^ ( - ) is the curve fitted to the same data but w i t h the zth data point (i,-, Yi) removed. A s a result Yi and m^~ \ti) are independent. x Definition 4.3 CV(X) The cross-validation function is defined as (4.25) = -f2(Yi-m{- \ti))\ i where for each i £ { 1 , . . . , n } , m\~ \-) is estimated by the same procedure as rh\(-) but % with (ti,Yi) removed from the data set. parameter, \cv, minimizes The cross-validation choice of the smoothing CV. B y t a k i n g the expectation, we see that CV(X) is approximately unbiased for E{CV{\)\t) = -E(f^(Yi-m(ti)) 2 + \i=i n + ±(m(U) PE{\). 2it(Yi-MU))(™(ti)- i-i mi-^ti))^) / - t=i = ^ + ^(EM*0-4" (*.0) l*) o w <7 + MSEQ(\). 2 (4.26) 2 50 Under suitable conditions, Xcv, the smoothing parameter chosen b y cross-validation, can be shown to converge almost surely to the o p t i m a l smoothing parameter XPE, the minimizer of the prediction error. Hardle and M a r r o n [14] have shown such a result for the Nadaraya-Watson estimator. O n a practical level, cross-validation is a computationally intensive m e t h o d . Since CV can rarely be m i n i m i z e d analytically, usually CV is m i n i m i z e d on a grid of A i n a specified interval [A, A] . Therefore for each combination of grid point Xj and o m i t t e d design point t,, the entire curve fitting procedure has to be repeated. T h i s amounts to n\ • n curve fits w i t h n\ being the number of grid points for A. In some cases, a short-cut formula is available which expresses CV(X) i n terms of quantities which can be evaluated directly f r o m applying the curve fitting procedure once to the entire data set. W i t h this formula, only one regression curve is fitted for each value of A. T h e following l e m m a gives the short-cut f o r m u l a and conditions for its validity. L e m m a 4 . 1 For each i £ { l , . . . , n } , let m ^ ^ ( - ) be the fit with (ti,Yi) removed from - the data set {(tj,Yj)}"; in the data set and let ^~^*( ) be the fit with (ti,Yi) replaced by (£,-, ra^ ^(£,•)) - - {(tjjYj)}™. Suppose that rh\ = (rh\(ti),... ,rh\(t )y = H\Y, with H\ not dependent on Y and n with its ith diagonal element [H\]a ^ 1. If for each i, m^~^*(t,) = rn^C \ti) for any set l ofYis, then Yi-m\ Remark: '(ti) = i_[ H ] (4.27) » and so T h e short-cut version of the CV function (4.28) casts insight o n the idea of cross-validation. A s discussed earlier, n _ 1 51 Y%{Yi — rh\(ti)) 2 tends to underestimate the prediction error. B y scaling the iih term by 1/(1 — [H\]u) , CV function tries to 2 eliminate this bias. Proof: Since rh\ = H\Y and H\ does not depend on Y, the following holds: ^i _ , > (*.-) = E W i + [^A]«mi"°(*0- (- ) 4 29 Using the assumption m ^ ^ * ( i ; ) = m^~^(t,) yields: - Y i - = Yi-m^ (ti) m = Yi- YlWijYj - [ f r ] « m ^ ( t , - ) A 1 = Therefore, ^ - m^iU) Yi-mM + iHxMYi-m^iti)). (4.30) = (Yi - m ( * , ) ) / ( l - [H ]n). A p p l y i n g this relationship to the A x definition of CV(X) (4.25) gives the short-cut formula (4.28). • T h e above applies to the scenario of curve fitting i n the domain of the design variable when a global smoothing parameter A is desired. If one wants to choose a local value of A depending on t, localization of the CV function is straightforward by incorporating a weight function into CV. For example, a CV function for a local smoothing parameter can be defined as CV(X, t) = n- J2(Yi - m^iUtfwiit), l (4.31) 1 where iw,-(i) is a weight function that gives more weight to points near t than the rest. T h i s weight function Wi(t) can depend on another smoothing parameter A*. T h e A = X(t) that minimizes CV(-,t) w i l l determine the amount of smoothing around t. 52 4.2.2 C V for the linear operator approach T h e m i n i m i z e r of the penalized likelihood, F(Y,L ...,L ) u = n - fjXi - L^I)) + \f 1 n 2 1 I"{ufdu, J for I € W ^ O , 1 ] , gives an estimator satisfying the conditions of L e m m a 4.1. So the short-cut formula holds for this case. For proof and details, see [31]. 4.2.3 C V for the local regression approach in curve estimation T h e local p o l y n o m i a l regression estimator, rhh (t) = / 3 , of m(t) w i t h /3 defined i n 0t n t (4.2), satisfies the conditions for the short-cut formula. T h e i t h row of the H\ m a t r i x i n this case is [ A p ] , - [ ( A £ W A ) A £ W ] , where [A ],- is the i t h row of A , A and W are as -1 P p p p defined i n (4.15) w i t h t — ti, i = 1 , . . . , n . Obviously, this H\ m a t r i x does not depend on Y. T h e l e m m a below verifies the other condition i n L e m m a 4.1 for the short-cut formula to hold. Lemma 4.2 For each i € { 1 , . . . , n), let {wj }\ tti u)i,n > 0. Suppose that f3 . be a set of non-negative numbers with minimizes t s f c - E #(*;-*.•/) \ 1=0 (4.32) / and that fi\. ^ minimizes E ( > - E m Y - *o') j^i \ 1=0 = E fYi ~ E &i \ i=o - E m «>j,u + 2 J - ti) ) ) 1 \ ™i,u + v fc 1=0 - ft) 2 J - ti) ) 1 w . Then /3 ^ = /3 '^* for i G { 1 , . . . ,n}. Here J3 ^ and (3 ^ 0< 0t 0 ^ and p[. ^ respectively. 53 Qt itti ' (4.33) are the first component of R e m a r k : T h e above l e m m a is the essence of the short-cut f o r m u l a for regression w i t h the fitting criterion fa-!>(*;-')') E i with ;=o V rhh (t) = n / {3 . In particular the short-cut formula holds for local linear regression 0 estimator (p = 1) and local constant regression estimator (p = 0, Nadaraya-Waston estimator), that is, when Wj yt = K((tj P r o o f : B y the definition of /3 . < E < E fa fa \ - n w i t h K being a kernel function. , t E fa - E ttf'to — t)/h ) 2 + (4? - 2 + te? - **)') E - *0') ^ 2 - / C V - E A A ( * i - *••)') ( f> /=o ( - ) 2 4 / since f}\ ^ minimizes (4.32). Therefore J3 ^ — /?o,v • 1=1 Qt 4.2.4 34 C V for forecasting: FCV T h e following two sections contain discussion of ideas of cross-validation for forecasting based on different assumptions. U n l i k e the case of curve fitting i n the data range where the Y{S can be used to "crossvalidate" the estimates ihh (ti) n computed w i t h b a n d w i d t h no "future" data beyond 1 to cross-validate m / l n i h , n the forecasting case has i ( l + A ) for h . n n One natural idea would be to leave out a portion of the most recent data, use the forecasting estimator and the rest of the data to estimate the left-out data and m i n i m i z e some measure of the discrepancy between the left-out YiS and their estimated values to choose an o p t i m a l b a n d w i d t h h for forecasting. T h i s idea has the flavour of n 54 > conventional cross-validation i n that a statistical model is b u i l t on part of the data and then "validated" by the rest of the data. A formal definition of such a CV function for forecasting can be formed as below. Definition 4.4 FCV(h ) n The cross-validation function for forecasting is defined as: J2(™^,U-A„(U) = - Y) , (4.35) 2 where S = {U : ti £ [1 — pA , 1]} with p > 1 and \S\ = # 2 t S £ <S*. n Remarks: 1. Note that rhh ,ti-A„(ti) n predicts A „ ahead. as defined i n Definition 3.1 centers the data at i,- — A For each left-out 2,-, the forecasting estimator m^ uses the rest of the data prior to the t i m e point 2,- — A rhh ti-A (ti), n< n n and _A„(i,) to estimate m ( 2 j ) . T h e n the "forecast" at t = 2,-, is compared to Yi. 2. F r o m C o r o l l a r y 3.2 or 3.4, FCV(h ) n could use more i , s , w i t h p —> 0. B u t for computational convenience, FCV(h ) n n recent data w i t h 2; € [1 — pA , n £ [1 — p , 1] for n leaves out the part of the 1]. 3. Independent work by H a r t [17] uses a cross-validation idea, TSCV cross-validation), similar to FCV AR(1) niii n i n the context of forecasting. (time series H e assumes an model for the e,s and forecasts one step, 1/n ahead, using a locally constant regression estimator. In contrast, we assume independent e,s and forecast A „ oc n- / 1 5 ahead. H a r t and Y i [18] recently modified TSCV to OSCV (one-sided cross-validation) for the b a n d w i d t h selection for curve estimation by local linear regression. T h e y showed that the o p t i m a l b a n d w i d t h estimated by OSCV variable than that estimated by the traditional CV is less as defined i n (4.25). In both these papers ([17], [18]), the asymptotic mean squared error of the forecast has a simple f o r m , B h 2 n + V/(nh ) where B and V are constants depending on K, n 55 m", a 2 and / , since the estimate is forecasting 1 / n ahead. B u t AMSE(6) has a more complicated form than B h 2 R e c a l l that AMSE(6), + V/(nh ) n since we are forecasting further ahead. n the asymptotic mean squared error of m ^ „ i ( l + A ) is used i n i n the plug-in approaches to estimate the o p t i m a l bandwidth for forecasting. T h e following corollary of Theorem 3.2 stated for non-random i,s sheds some light on the relationship between FCV(h ) and n AMSE{8). C o r o l l a r y 4 . 1 Assume the same conditions as in Corollary 3.4- Then n {E(FCV(h )) 4/5 -a - AMSE{8)} 2 n = o(l). (4.36) P r o o f : Since rh-h ,ti-& (ti) is independent of Yi, n n E(FCV(h )) n = |4| £ {E(™» ,L-*M ~ m{ti)f + E(m(ti) - = •£[ | £ n E(m ,_ (ti) hntt - m(ti)) 2 An Y) } 2 + a j . (4.37) 2 B y Corollary 3.4, sup n ' [E [m 4 5 . (ti) hntU An - m(ti)} - AMSE(6)) 2 = o(l). A s a result, n l {E{FCV{h )) A 5 n -a 2 AMSE(8)} = o(l). •. T h e l e m m a below w i l l ensure that the stronger uniform results i n T h e o r e m 3.2 and Corollaries 3.3 and 3.4 hold for 2,-s random. T h e n by the same proof, the above corollary is true for 2,-s r a n d o m . L e m m a 4 . 3 Suppose that 0 = t 0 < t x < ... < t n < t n+1 order statistics of a distribution with density function f. 56 = 1 and that tx,...,t n are Suppose that W and W are bounded in the support ofW, g' and f are continuous over [0,1] and inf [o,i] f(t) = a > ie 0. Then ifnh n —• oo, (4.38) 0 sup <e[o,i] in probability as n —> oo. T h e proof of the above l e m m a w i l l use standard results on order statistics f r o m the uniform distribution over [0,1] (see, e.g. [21]). Those results are stated i n the l e m m a below without proof. L e m m a 4.4 Let £/(,), i = 1 , . . . , n be order statistics from the uniform distribution over [0,1]. Let C/( ) = 0, and U( i) = 1. 0 n+ Then *• E{U ) = ^ , Var{l%) = for i = 1 , . . (i) 3. E(U {i) - */(,_!)) = E(U ) (1) and E(U (i) - U^f = E ^ ) , for i = 1 , . . . , n + 1. P r o o f o f L e m m a 4.3: Since 1 "+ 1 (n + l)h t—t h n n n + l nh Y n «n [n + l)h (4.39) h n n we have sup te[o,i] sup te[o,i] n + l + (n + tn+l — t l)h n h n T h e assumptions on W', # and h ensure that the last expression w i l l go to 0 as n —• oo. n Therefore this l e m m a is true i f E W(^)g(U) sup r^uT (n + l)h j h te[o,i] n n ~Y [ h Jo n 57 W(^)(gf)(u)du h n (4.40) i n probability as n —> oo. Consider the integral i n (4.40): _L f u-zl 1 tin w{ JO n = r E / Hn wC—-){gf){u)du j — \ Jti—1 *i n = rSr fl n )(gf){u)du tl W{ \ )git )mdu t ) = 1 + i - ' t . - i 1 T- E • In , HV^M r 1 i Jti-i = L / l / n (4.41) w&^Wi)] t n Plugging (4.41) i n the left hand side of (4.40) gives: sup te[o,i] 1 + n E (re + = sup l)/i t- — t 1 w(\^)g(u) j n • -f ft 1 E / i n rU 71+1 n i / = 1 Jfi_i t—t ^(^-M*0/(«)«* An (4.42) \A(t)-B(t)\. te[o,i] B y Chebyshev's inequality, for any e > 0, P sup \A(t)-B(t)\ \te[o,i] £ (sup« < €[0fl] > e / < £ (su |A(t)l)+^(sup P < 6 [ 0 ) 1 ] te[0|1 |A(t)-5(i)|) ]|^(t)|) (4.43) Therefore the l e m m a is proved if the right hand side of (4.43) —> 0 asre—> oo. Consider E ( s u p [ ] |A(2)|) first. te Ait) = 01 ^ E ^ ^ - f s f (n + l)h n i h h n n w(^)^(to/(«)d« iZ[ Jti-i h n = ^ £ V ( ^ M t o ^ - f EV(^M*o/ f i h i «n n+ 1 ft ft which can be written i n terms of the order statistics of a uniform distribution. n n Let U(i) = J fiu)du, i = 0, . . . , n + l . 58 n (4-44) Then . . . , ?/(„) are order statistics of the uniform distribution over [0,1] and <7(„ i) = 1. U(o) = 0, + U s i n g the facts 1 n + / Y = EiUv-Uw), f(u)du = U(i) - «7(,_i), t = l,...,n + l Jti-l i n A(t) a n d then re-arranging the terms yield: n+l -i Mi) = f 4 r E ^ V ^ K r ^ } t'=l •j n R e c a l l that U( ) = 0 0 a r i ( 1 n+l J ^ j=l (4.45) n ^ ( n + i ) = 1, so - f E w(^)^(to - ^.--i)} , 2 = = i E HHr^*- ~ ^r *} { «- ~ -^ - 0 w( M 0 EU x) U(i (4 46) U s i n g the condition |(W#)'| < C i n (4.46) yields: <^ E sup |A(t) | <G[0,1] "n , EU(i-\) — U(i-i) (4.47) = a(t -t - ), (4.48) = 2 Because of the fact that U(i) - rU r/(,-_i) = > su Pte[o,i] \ c c n+l ah 2 E a n ^(0 / /(«)<Zu mif{t)(t -t - ) t£[0,l] i i l i ^ bounded by e _ ^(«'-l) - ^(i-l) " " n ,=2 59 i 1 Schwarz' inequality can be used to bound the expected value of the above by C ~^~^ / 2\ n { E E 1/2 (4.49) {VarWw)) fa) "tf(.--i>)) U s i n g fact 3 of L e m m a 4.4 then yields sup \A(t)\) <6[0,1] < / a f l n E {U(i)) - T J x /2 (^K^-i))) t=2 7 A g a i n , b y facts 1 and 2 of L e m m a 4.4, the last expression equals C ( V(» + ahl V 2 / 2 A / i(n + l-i) \ 1/2 + 2),/ £ V(" + 1 ) ( " + 2 2 ) ± (J- (i c ^ 1 / 2 "+ 1 I _ l _ ahl \(n + l){ n + 2)J (n + 2 ) V 2 + l ^ Vn + 1 V 1/2' _ n+ 1/; Since n + as n —> oo, we have E sup (4.50) 0 = O \*€[0,1] / under the condition that nh n —• oo. Now consider £ ( s u p [ ] |i?(t)|), where 1£ 01 1 i l n , _ i •'ti-i L (in (4.51) ftn B y the assumptions of the l e m m a , W and W are bounded i n the support of W, and / and g' are continuous on [0,1]. So there exists C\ > 0, such that /(«) < Cx 60 I u — ti hr (4.52) Therefore, sup \B(t)\ < 1 ^ r*i T - E / C-i— ii_i au — U du n+l "n < i=l §iE(^)-^-i)) , (4-53) 2 "n " «=1 by (4.48). Finally, <?i" c?h ^ Kn J n v 2 ( n + l ) ( n + 2) = ( 4 ' 5 4 ) under the condition that nh —> oo. n So by (4.50) and (4.54), the right hand side of (4.43) -> 0. • Unfortunately, no short-cut formula for FCV has been identified, m a k i n g the implementation of this criterion computationally intensive. T h e cross-validation criterion i n the next section has a short-cut formula. 4.2.5 FCV C V for forecasting, BCV uses the data up t o —A n to forecast ra(£,). I n contrast the "backcast" cross- validation BCV i n this section w i l l use data to "forecast" the past. T o understand the definition of BCV, recall that rhh ,i(l + s) = / ? n 0 1 + J3 s estimates m ( l + s) w i t h data ltl centered ait — 1. Suppose that m is really a line. T h e n if m ^ i ( l + s) is a good estimate n i 61 of m(l+s) for s G [ — A , 0], rh-h ,i(l+s) n n is an equally good estimate of m(l+s) as it is for s G [0, A ] . If m is not a line, we know by a Taylor expansion that m is approximately n linear i n a neighbourhood of 2 = 1. So we can fit a line locally w i t h the data centered at 1 and estimate m ( l + s) using / 3 01 and to calculate m „ i ( l + s). T h e accuracy n) of the fit is then assessed by the data i n [1 — A „ , 1]. Since for s £ [—A„, 0], m / , i ( l + 5 ) ni gives a "backcast" instead of a forecast, we call the following cross-validation criterion BCV. D e f i n i t i o n 4.5 Let S\ = {ti : 2,- > 1 — A } and \S\\ = the number oftiS G S\. n BCV{K) w/*ere m /ln)1 ( = - i - £ c ) 2 n (4.55) 2 1 + ^ ^ ( t , - - 1), @[ * minimizes 52&i(Yj - Po - Pi(tj - "' (2,) = l)) K((tj — l)/h ) ( l _ 2.) ( m , „ , ( - ) ( 2 , ) - Y) , J and c(-) is a function that may depend on A n but not on h . n In (4.55), we recommend c(-) to be chosen so that E(BCV(h )) n « AM5 E(<5) + constant, (4.56) , where AMSE(6) is the asymptotic M S £ of m f c B > 1 ( l + A ) and 8 = A /h . n n n BCV always centers the data at 2 = 1 when estimating m(2,), thus using data i n [1 — h , 1] n but FCV i n the last section centers the data at 2,- — A „ when estimating m(2,), using data i n [2,- — A n — h ,ti — A„]. n Suppose that a c(-) exists such that (4.56) holds. T h e n BCV behaves approximately like AMSE(8). T h e hope is that the bandwidth that minimizes BCV also approxi- mately minimizes AMSE{8). T h e o p t i m a l bandwidth chosen by the BCV criterion is defined to be the one that minimizes BCV and w i l l be sought by a grid search. W o r k i n g w i t h (4.55) directly is computationally intensive because for each h on the search grid n and for each data point (U,Yi) w i t h 2,- > 1 — A , a regression needs to be computed. n Fortunately, a short-cut formula holds for (4.55). This is shown i n L e m m a 4.6. 62 To choose a c(-) to satisfy (4.56), note that E{BCV{h )) = n ~ = £ ^ - ^ { ( A w ^ ^ ^ ) - ^ - ^ ) ] ) ' } E c(l-t,-)^{(mw -°(*0-m(*0) } + ( E (! " *0 ^ 2 T^ c l^l ~ t.-eSi E (! - {(mw(*0 - c IF, °- |F7 2 if \S\ | _ 1 ]Ct,-e5i ( l c — ^')E | ( /i„,i^~^('«) m a m(r,)) } + 2 E (!-^), c — m (^«')) | c a n (4-57) be replaced w i t h negligible error by ISil" Eues, c(l - *,)£ {(m (t.-) - m ( t , ) ) } . 1 2 w T h u s to have (4.56), i t suffices to find c ( l — i,)s such that E c(l - * ; ) £ {(mfc„,i(*0 " MU))2} ~ A M S £ ( £ ) . Since AMSE(6) = 0(n~ / ) 4 5 (4.58) b y the asymptotic results i n Chapter 3, we require the difference between the left and right sides of (4.58) t o be o ( n / ) . - 4 5 T h e following Theorem is stated for fixed design w i t h equally spaced i , s , U = i/n, for the choice of the c;s. Theorem 4.1 Assume the same conditions as in Corollary 3.3. Let k = [n — nA ], n e = (1,...,1)' (a k x 1 vector), z c = (cjt,... ,c„,Y — (c(zk),... ,c(z )Yn V A .n. / 2 ^ c«( —) = = (zk,...,z Y n = (1 — ifc,..., 1 — £„)* and Note that Si — {tk,...,t }. n \Si\ If c(-) satisfies: (4.59) 63 where ( z / A ) * ' = ( ( z j t / A ) \ ( z / A ) * y , then n n n £ n - ™(*0) } - AMSE(6) = o{n- ' ). 2 A 5 Proof: Since the t,-s are equally spaced, we can write t ueSi = B + y, l i=fc 0 l where 1 Pi " I t=fc 1 n 1*1 & Theorem 3.2 tells us that E0 )-m{l) = E0 )-m'(l) = Otl hl Var0 ) 2 = Otl Var0 ) °iK + o(h ) n 0h 2 2 1 a 1 <T n 1 2 I 2 = ltl 0(/i J + n 22 + ° W ' where u 2 Ui u 2 \ U 3 J -l U Ui Ui u 0 E = 2 uT, u: \ l U 2 J U UQ UI Ui U 2 and Ui and u* are as defined i n Condition 4 i n Chapter 3. 64 (4.60) So, for \t- 1| < A „ , Bias(m (t)) hn<1 = E(m (t) - m(t)) = E = E{p -m(l) hnA + P (t - 1) - m(t)) ltl + 0tl [P -m'(l)](t-l)} hl -[m(t) - m ( l ) - m ' ( l ) ( i - 1)] = j&if + - = { m ^ { M fe ^-l)}m"(l) ' n + o(^) 2 W + «((*-i) )} 2 ^ - M „ ( l - 0 - ( l - 0 2 } + «(A„), (4-61) I Var(m (*)) w = Var + ^ ( t - 1)) = Var(/? _ ( 1 0)1 - fe Sll ) + 2(i - l)C<wfAi> /3 ) + (f - l) Var0 ) 2 1)X 2(t-l) ^^ + S l 2+ (t-1) ^^ S 2 2 \ a 2 hl 1 2 J/(iT + W ' 0 (4.62) where o ( A ) and o(l/(nh )) do not depend on t as long as \t — 1| < A . B y (4.61) and 2 n n (4.62) and the fact that h — A /<5, n 1 n " i=k ll 1*11 A 4 E .t « - M n ( l " ti) " (1 - t,-) + 0 ( A ) } 2 2 " 15: Zi_\ \Si\k \* 4 A J ^ \ 2& 3 65 /M 2M2 2 + U 2; 2 b\ ~ J 62 26i 5 m"(l) UJ £ V |<S\| \ UJ £ 3+ 2 +<*(i) ^'(i)-4 (4.63) 3 and V = 1 1*11 1 " t=fc * f 1 2(1-*Q r25 — i 2 2^ « ^ — r - ^ i i c L n/> 1*1 fa* l » * n , (1 ~ * Q nh . , 2 3 1 " / 1 2A nhl nh \si\t fZ n S 2 2 + 0 xl 1 ^ W/7(T) i An W J « . t 1*1 / z \ 2A n 12 VA„y nhl + c t /(I) (i) i 2 S 2 2 + o (4.64) (i) B y (4.61) a n d (4.62), for t — 1 + A , the square of the bias a n d the variance of n m„„,i(l + A „ ) are Bias (m (l 2 hntl + A )) n = ^"(l) 1 2 {b X + 2hb h A 2 2 3 n + (b\ - 2h)h Al 2 n n -2& /> A +A 2 m"(l) A 26162 b 2 1 64 » + £3 + 3 n &2 - 2 & n + o(A )} n i <5 2 + 1+0(1) (4.65) and f 1 2A V a ( ^ , ( l +A.)) = | _ E „ + ^ E r 1 Conditions i n (4.59) ensure that B - Bias (m (l 2 hnA +A )) = o(A ) B n 66 2 + A 2 1 1 a +o ( — ) j 2 w (4.66) and V - Var(m (l hn<1 Since h ni A n oc n - 1 + A„)) = n / , we have 5 B - Bias (m (l 2 o(l/(nh )). + A ) ) = o(n- / ), hntl 4 n 5 and V - Var{m (l hn<l + A„)) = o(n" / ). 4 5 Therefore the theorem is proven. • Remarks: 1. Since for each value of [ n A ] , c needs to be calculated i n the implementation of n B C V , one might be interested i n the l i m i t i n g forms of conditions i n (4.59) when n —* oo and nAn, —* oo. Let c(t) = g(t/A ). n For equally spaced 2,-s, one can show that there exists a function g, e.g., a fourth degree p o l y n o m i a l , such that / g{u)du = Jo 1 I ug(u)du Jo = —1 / u g(u)du Jo = 1 / u g(u)du Jo = —1 f u g(u)du Jo = 1. 2 3 4 (4.67) 2. T h e conditions i n (4.59) are sufficient for (4.60) but m a y not be necessary. For example, i f m " ( l ) = 0 the last two conditions i n (4.59) are not needed for (4.60) to h o l d . However, i f m " ( l ) ^ 0 and a l l the coefficients of powers of 8 i n (4.65) and (4.66) are non-zero, then the conditions i n (4.59) are necessary for (4.60). 67 3. So far our discussion about B C V has been l i m i t e d to simply providing guidelines for choosing c(-). It has not been shown that the h m i n i m i z i n g B C V is close to n h , the bandwidth that equals A /5 t n opt with 8 op opt minimizing AMSE{8). Note that when |<?i| > 5, there m a y not be a unique vector c that satisfies (4.59). In the implementation of BCV, we w i l l use the vector c that minimizes c'c subject to the conditions i n (4.59). Lemma 4.5 Suppose that \S\\ > 5 and that the X 1 = 6' '(AJ'(AJ '(AJ '(AJ are distinct. Let and EE (1^1, -15x1,1*1,-ISxU^I). (4-68) Then the minimizer of c c subject to (4-59) is c — t X (XX )~ b. t t 1 Proof: L e t L(c,\) (4.69) =c c - \ \ X c - b ) , t where A is a 5 b y 1 vector. Setting the derivatives of L w i t h respect to c and A to zero yields: ^ oc = 2 c - X \ = Q=>c = ]-X \, 2 ^ = Xc-b t (4.70) t = 0=> Xc = b. (4.71) Plugging c of (4.70) i n to (4.71) gives Xc = \XX*\ Since tk,...,t (4.72) = b. are distinct, X is of f u l l row rank of 5. Therefore XX t n Inverting XX* i n (4.72) results i n A = 2(XX )~ b. t c = X (XX )~ b, t t 1 1 is invertible. Plugging this A into (4.70) yields which m a y be either a minimizer or a m a x i m i z e r . Since L is u n - bounded above, this c is a minimizer. • 68 U n l i k e FCV, BCV has the short-cut formula we established to enable one to calculate only one regression for each h on the search grid. T h u s the process of estimating n the o p t i m a l bandwidth h is fast. We t u r n to that problem now. n Lemma 4.6 B C V{h ) - — - 2^ c(l - ti)—— , n where [H ]u = {1,U - 1) x toe ito c o l m n o/ [(A*WA) hn / i ... i A W], N A = l \ * i — 1 • • • t -l J n and W = diag(K(^-r-^)) h (4.73) n with K a positive kernel over its support. Proof: Recall that & = argmin £ (^ - ft - ft(t; - l)) j=i K^T^)- 2 71 F i x i and let <*U = argmin n ( j - a - a {t - - U)) Y 0 x 3 £• — 1 K( -^—). J Then = (4-74) T h e usual estimate of m(* ) using /Sj is ft + ft i(i; — 1) which is equal to a , the usual t >x estimate of m(U) based on a . ti i T h u s , both parameterizations yield the same estimate of (m(ti),..., m(t )y, namely Hh„Y. For the same fixed i , also let n fi™ = 0tU argminifciYj-h-lhitj-ltfKi^) 69 + a[. l) cti"^* = = argrain^ argrain ^ (^;i*' -/5o-^-i)) /v(\^), ) (Yj - ct — a^tj - t,)) 0 2 ti-i. K(" 2 3 h, r (5j - ao - a ( i j - *i)) ^ ( ^ 7 2 x + (afc? - ao - axfc - t,))' ) ^ f ^ ' then F r o m L e m m a 4.2, cto~u* = ^o~u- Thus f r o m L e m m a 4.1, we have B y the relationship between the d s and /3s, the short-cut f o r m u l a holds. • In the next chapter, simulations w i l l be carried out for the local linear forecasting estimator to compare the performance of the four methods of b a n d w i d t h estimation. 70 Chapter 5 Simulation: local linear forecasting estimator 5.1 Motivation and data In this chapter we w i l l study the local linear forecasting estimator of m ( l + A ) for n A „ = 0.1 a n d 0.2 on simulated data sets, Yi = m{ti) + e,- w i t h e,-, i = 1 , . . . , n i i d n o r m a l variables ~ N(0,<T = 0.1), and equally spaced 2,-s, t,- = i/n. D a t a sets w i t h sample sizes n = 50 and 100 w i l l be generated from each of three m s , mi(t) = t, m ( t ) = t a n d 2 2 2- l (cosUvt) + 1), t < 1/2, 7 2 m (t) (5.1) 3 i / , 5 2 t > 1/2. Methods of estimating an o p t i m a l bandwidth by using the plug-in procedures HAMSE BCV CAMSE, f r o m Sections 4.1.2 and 4.1.3 a n d the cross-validation procedures FCV a n d f r o m Sections 4.2.4 and 4.2.5 are applied to each data set. T h e estimators of m " ( l ) and a needed for the plug-in procedures are those i n Section 4.1.5. T h e kernel 2 71 function K used i n all four methods is the density function of the standard normal w i t h support truncated to [—1,0]. For this kernel function, the asymptotic mean square error by formula (4.11) is m"(l) 2 AMSE{8) = —^-A (0.1532668/6 n 2 + 0.9663585/6 + l ) 2 2 c +-^—(4.034252 + 12.16996 + 12.2112166 ). (5.2) 2 T h e o p t i m a l b a n d w i d t h for forecasting is: h = A /6 , opt n opt where S opt minimizes AMSE(S) T h e goal is to compare the estimates of m ( l + A „ ) and the estimates of o p t i m a l bandwidths by a l l the b a n d w i d t h estimation procedures discussed i n Chapter 4. T h e functions t , p = 1 and 2, are chosen to study the sensitivity or the robustp ness of plug-in procedures of b a n d w i d t h estimation to the violation of assumptions on the regression functions i n the asymptotic analysis i n Chapter 4. R e c a l l that plug-in procedures require the estimation of m " ( l ) i n order to estimate the o p t i m a l b a n d w i d t h for forecasting. In the estimation of m " ( l ) , the fourth derivative of m at 1, m^ \l), is 4 assumed to exist and be non-zero. Therefore, i t is interesting to study the performance of the plug-in procedures when m^(l) is zero. In contrast, the function m is chosen because i t has a non-zero fourth derivative at 3 t = 1 and non-constant second derivative over [0,1]. A l t h o u g h m ' has a discontinuity 3 point at t = 1/2, Corollary 4.1 still holds when the FCV procedure is applied to m . 3 Recall that Theorem 3.1 assumes that i n C o n d i t i o n 5 m" is continuous over [0,1 + A ] n and that the results concerning the asymptotic bias and the variance of rhh ,t(t + A„) n hold uniformly for t 6 [a , 1] w i t h l i m i n f a /h n n n > 1. It can be easily shown that under the conditions of Theorem 3.1 but w i t h C o n d i t i o n 5 replaced by the condition that m" is continuous over [ao, 1 + A ] for some 0 < ao < 1, the conclusions i n Theorem 3.1 n hold uniformly for t (E [a ,1]. Therefore Corollary 4.1 holds since FCV uses estimates 0 of m(ti)s w i t h U > I — pA and these t,s are i n [a ,1] for n sufficiently large. n 0 It is also interesting to see how the magnitude of m " ( l ) w i l l affect the estimated 72 o p t i m a l b a n d w i d t h and the resulting forecasts. If m " ( l ) = 0, the square of the asymptotic bias of rhk (l + A ) is of a higher order than A n the AMSE n and thus the h that minimizes n n of m „ ( l + A ) is of a different order f r o m n n - 1 n / . line, the exact bias of m / i ( l + A ) is zero. Thus the AMSE n n 5 I n particular, i frais a i n this case has only the variance t e r m which equals the second t e r m i n (4.11), resulting i n the asymptotically o p t i m a l b a n d w i d t h h being co. n It can be proved f r o m (5.2) that a larger absolute value of ra"(l) w i l l result i n a smaller asymptotically o p t i m a l b a n d w i d t h for the kernel function used i n this chapter. T h e three ras under study have m'((l) = 0, m ' ( l ) = 2 2 and ra '(l) = 3.75. 3 5.2 Results and comments Tables 5.1-5.3 display the summary statistics of the A - a h e a d forecasts ( A n n = 0.1,0.2) along w i t h the values of h , and the estimates of o p t i m a l b a n d w i d t h for the forecasts opt by the four procedures applied to 100 simulated data sets, each for sample sizes n = 50 or 100 and the functions ra = m i , ra and m . 2 3 Since the estimates of the o p t i m a l b a n d w i d t h are quite variable, the median of the estimates of the o p t i m a l b a n d w i d t h for 100 simulations is a better indicator of the center of these estimates than the mean. T h e median of the estimates of the o p t i m a l b a n d w i d t h for 100 simulations is denoted as hf^}. In the tables, the standard deviation (s.d.) and the mean square error (m.s.e.) of m h / i „ , i ( l + A „ ) and either ra„ i(l ni opt are estimated respectively as follows. Let 6\> generically represent + A ) or h n opt for the bth simulation, 6 = 1 , . . . , 100, and 0 the true value, i.e., 9 = m(l+ A ) or h . T h e n (s.d.) n opt 2 is £ £ ° ° (§ - fj b 2 /100 w i t h I = E ^ i 4/100 and the m.s.e. is (0 — 0) +(s.d.) . T h e last line of each table displays the true ra(l + A ) 2 2 n and h . opt Results i n Tables 5.1-5.3 reveal that, among the four procedures examined, FCV 73 Table 5.1: S u m m a r y of 100 forecasts (ms) and estimates of o p t i m a l bandwidths b y local linear regression for m = m\ w i t h sample size n = 50 a n d 100 respectively. T h e m i n i m u m of AMSE is displayed i n column (m.s.e) for t r u e value. m = m i , n = 50 A method n A = = 0.1 n m (s.d.) (m.s.e.) Ag? CAMSE 1.08 (0.19) (0.04) HAMSE 1.08 (0.19) FCV 0.2 rk (s.d.) (m.s.e.) *£? 0.30 (0.27) 1.17 (0.34) (0.12) 0.27 (0.26) (0.04) 0.27 (0.27) 1.18 (0.35) (0.12) 0.24 (0.26) 1.11 (0.07) (0.01) 0.51 (0.21) 1.20 (0.08) (0.01) 0.51 (0.12) BCV 1.10 (0.10) (0.01) 0.31 (0.24) 1.23 (0.18) (0.03) 0.19 (0.15) t r u e value 1.10 0.00 oo 1.20 0.00 oo M O (B.d.) m — m i , n = 100 A = method n A 0.1 n = 0.2 m (s.d.) (m.s.e.) Ag? (s.d.) 0.33 (0.28) 1.21 (0.14) (0.02) 0.30 (0.25) (0.01) 0.33 (0.24) 1.20 (0.14) (0.02) 0.30 (0.24) 1.10 (0.05) (0.00) 0.51 (0.21) 1.20 (0.06) (0.00) 0.51 (0.10) BCV 1.09 (0.10) (0.01) 0.19 (0.22) 1.20 (0.18) (0.03) 0.18 (0.14) t r u e value 1.10 0.00 oo 1.20 0.00 oo m (s.d.) (m.s.e.) *g? CAMSE 1.10 (0.08) (0.01) HAMSE 1.10 (0.08) FCV (s.d.) 74 Table 5.2: S u m m a r y of 100 forecasts (ms) and estimates of o p t i m a l bandwidths b y local linear regression for m = m w i t h sample size n = 50 a n d 100 respectively. T h e 2 m i n i m u m of AMSE is displayed i n column (m.s.e) for true value. ra = ra2, n = 50 A = method n A = 0.1 n 0.2 ra (s.d.) (m.s.e.) hSg (s.d.) 0.24 (0.21) 1.33 (0.43) (0.20) 0.22 (0.20) (0.06) 0.22 (0.20) 1.33 (0.43) (0.20) 0.21 (0.20) 1.16 (0.09) (0.01) 0.29 (0.12) 1.33 (0.11) (0.02) 0.29 (0.09) BCV 1.13 (0.13) (0.02) 0.30 (0.27) 1.37 (0.25) (0.07) 0.20 (0.16) true value 1.21 0.01 0.34 1.44 0.02 0.32 ra (s.d.) (m.s.e.) *S? CAMSE 1.16 (0.24) (0.06) HAMSE 1.16 (0.24) FCV (s.d.) ra = m 2 , n = 100 A = method n A 0.1 n ra (s.d.) (m.s.e.) fcg? CAMSE 1.16 (0.12) (0.02) 0.24 (0.26) HAMSE 1.17 (0.12) (0.02) FCV 1.18 (0.08) BCV true value (m.s.e.) * £ ? (s.d.) 1.34 (0.20) (0.05) 0.23 (0.22) 0.24 (0.21) 1.34 (0.21) (0.05) 0.22 (0.20) (0.01) 0.26 (0.09) 1.36 (0.11) (0.02) 0.24 (0.08) 1.15 (0.14) (0.02) 0.20 (0.28) 1.37 (0.20) (0.04) 0.20 (0.18) 1.21 0.01 0.30 1.44 0.02 0.27 75 ra (s.d.) (s-d.) = 0.2 Table 5.3: S u m m a r y of 100 forecasts (ms) and estimates of o p t i m a l bandwidths by local linear regression for m = m w i t h sample size n = 50 and 100 respectively. T h e 3 m i n i m u m of AMSE is displayed i n column (m.s.e) for t r u e v a l u e . m = ra , n = 50 3 method A n = 0.1 A n m (s.d.) (m.s.e.) f>SS CAMSE 1.21 (0.28) (0.08) 0.24 (0.21) HAMSE 1.21 (0.28) (0.08) FCV 1.19 (0.10) BCV true value = 0.2 (m.s.e.) AgJ (s.d.) 1.45 (0.51) (0.28) 0.23 (0.19) 0.22 (0.21) 1.45 (0.51) (0.28) 0.21 (0.19) (0.02) 0.26 (0.09) 1.37 (0.12) (0.06) 0.25 (0.10) 1.13 (0.16) (0.05) 0.30 (0.28) 1.39 (0.25) (0.10) 0.19 (0.17) 1.27 0.02 0.26 1.58 0.05 0.24 (s.d.)m (s.d.) m = m , n = 100 3 method A An = 0.1 n = 0.2 m (s.d.) (m.s.e.) h% (s.d.) m (s.d.) (m.s.e.) Ag? (--) CAMSE 1.20 (0.20) (0.04) 0.24 (0.26) 1.43 (0.33) (0.13) 0.22 (0.22) HAMSE 1.20 (0.20) (0.04) 0.22 (0.24) 1.44 (0.33) (0.13) 0.20 (0.22) FCV 1.19 (0.09) (0.01) 0.21 (0.08) 1.42 (0.15) (0.05) 0.21 (0.09) BCV 1.16 (0.16) (0.04) 0.21 (0.29) 1.44 (0.21) (0.06) 0.19 (0.17) 1.27 0.01 0.22 1.58 0.04 0.21 true value ] 76 s d produces the best forecasts i n terms of the mean square error (m.s.e.) and that the m.s.e. of FCV is closest to the m i n i m u m of AMSE. T h e m.s.e. of forecasts given by the other cross-validation procedure, BCV, is smaller than the m.s.e. of forecasts given by the two plug-in procedures i n most cases (eight cases out of twelve). However, BCV ties w i t h the two plug-in procedures (in terms of the m.s.e.) i n three cases and does worse i n one case ( m = m i , A n = 0.2 and n = 100) where the m.s.e. of forecasts by BCV is 1% more than those of forecasts b y plug-in procedures. C o m p a r i n g the estimated mean a n d the estimated standard deviation of forecasts by a l l four procedures shows that forecasts by plug-in procedures have a bias comparable to the bias of forecasts by cross-validation procedures but tend to have a much larger variance, and that the two plug-in procedures produce almost identical results. Tables 5.1-5.3 also show that the m.s.e.s of forecasts by a l l procedures increase as |ra"(l)| increases, which is confirmed by examining plots i n Figure 5.1 where the m i n i m u m of AMSE curve (solid line) increases as |m"(l)| increases f r o m m ^ l ) = 0 to m ' ( l ) = 2 a n d m ' ( l ) = 3.75. Moreover, the m.s.e.s of forecasts by a l l procedures are 2 3 smaller for the larger sample size n = 100. T h i s is expected since the convergence rate of AMSE of m / l n i i ( l + A ) is n n - 4 / and gets smaller when n gets larger. 5 F r o m Tables 5.1-5.3, we see that among a l l four procedures, the estimates of the o p t i m a l bandwidths calculated by FCV are the least variable and have medians agreeing w i t h the true o p t i m a l bandwidths reasonably well except for m — mi. O t h e r procedures m a y do well for some cases but badly i n others. Recall that a l l four procedures are devised to estimate the o p t i m a l b a n d w i d t h by simulating the AMSE curve and that E(FCV) — a is approximately AMSE. 2 T o help us understand why FCV has the best performance, i n the following figures, AMSE(6) w i l l be plotted as a function of h n with h n FCV - a , w i l l be plotted instead of FCV. 2 77 = A /S. n Moreover, the shifted FCV, T h e two plug-in procedures, CAMSE and HAMSE, produce almost identical sim- ulated AMSE curves. For example, Figure 5.2 shows that the CAMSE and HAMSE curves are i n good agreement for each of 9 simulated data sets for function mz w i t h n = 50 a n d A „ = 0.1. T h e similarity of those curves explains why the CAMSE a n d HAMSE procedures produce very similar forecasts a n d estimates of o p t i m a l band- widths. F r o m now on only the CAMSE O f the four procedures, only FCV curves are plotted for plug-in procedures. on average simulates AMSE curves reasonably well i n a l l cases, which is not surprising i n the light of Corollary 4.1. In contrast, as Figure 5.1 shows, the median of the CAMSE curve only for m = curves (dashed) agrees w i t h the AMSE but the median of the FCV curves agrees w i t h the AMSE curve for a l l three functions. T h i s observation suggests that the condition of m^(l) ^ 0 is necessary for plug-in procedures to work. Furthermore, i n all cases the AMSE curve simulated by FCV is less variable than the AMSE curve simulated by plug-in procedures. For example, Figure 5.3 of quantiles of the FCV FCV curves and CAMSE curves shows that the AMSE is less variable than the AMSE curve simulated by curve simulated b y CAMSE observation probably explains w h y forecasts obtained through FCV for m = ra . T h i s 3 are less variable than forecasts by plug-in procedures. T h e medians of simulated AMSE curves by BCV are too rough to resemble anything like the AMSE curve i n a l l cases; see Figure 5.4 for example. T h i s is probably due to the greatly oscillating values of the c,s, the weights used i n BCV which m i n i m i z e c*c subject to (4.59). For example, when n = 50 and A = 0.1, we have n c< = (220.75,-753.77,632.54,492.46,-996.23,409.25). A s Figure 5.1 shows, FCV approximates the asymptotic mean squared error very well for ra = ra and ra but not as well for ra = m . x 2 3 It is easy to see w h y the shifted FCV(h ) n 78 under-estimates AMSE(6), that is, 0.1 0.2 0.3 0.4 0.5 0.6 0.7 / / / / I I I I I / / m=m2 \ \\ \\ \\ v., / y / / / / / / s \ ^~ 0.2 .— 0.3 0.4 0.5 0.6 0.1 0.7 0.2 0.3 0.4 0.5 0.6 0.7 Figure 5.1: T h e median of 100 shifted FCV curves (dotted), the median of 100 CAMSE curves (dashed) and the true AMSE curve (solid) for m = m ,m 1 and A n = 0.1. 79 2 and m w i t h n = 50 3 Figure 5.2: Pairs of CAMSE (dotted) and HAMSE sets for m = ra , n = 50 and A 3 n -= 0.1. 80 (solid) curves for 9 simulated data 81 o CO T3 CD o C 55 CD E CM O > O CO u3 CO O O 0.1 0.2 0.3 0.5 0.4 0.6 0.7 h Figure 5.4: T h e median of 100 shifted BCV (solid line) for m = m , n = 50 and A „ — 0.1. 3 82 curves (dotted line) and AMSE curve AMSE(A /h ), n particularly for large values of h for m = m . R e c a l l that the FCV n n 3 score is an average of the estimates of the prediction errors for m(£,)s w i t h U £ [1 — A , 1] n and that by Corollary 3.1 or 3.3, AMSE of rhh ,t -A (ti) depends o n m"(ti — A ) , while n i n n AMSE(8) depends on m " ( l ) . T h e second order derivatives of m i and m are constants, 2 w i t h m" = 0 and m' ' = 2. So for U £ [1 - A , 1], m"(U — A ) is equal to m " ( l ) . T h u s , 2 n n for m i a n d m , one sees a n agreement i n shape between the shifted FCV scores and 2 the AMSE curve. B u t for m , m ( i ) = 1 5 £ / / 4 for t > 1/2, which is a n increasing 3 1 3 2 function of t. Therefore for ti £ [1 — A , 1], m '(£ - - A ) is less than m ( l ) . E v e n though n 3 t n 3 as n —+ oo, the m"(ti — A ) s converge to m " ( l ) , for a finite sample, one would expect n the shifted FCV scores to under-estimate the AMSE curve and that the discrepancy increases as h increases. However, Figure 5.5 shows that for fixed A the discrepancy n n is smaller for larger sample size n , as would be expected. Based on our analysis and the above observations f r o m our modest simulation study, we conclude that FCV has the best performance and thus is recommended. T h e next chapter w i l l contain the comparison of the backcalculation method and the local linear m e t h o d applied to a fictitious A I D S example. 83 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.1 0.2 h 0.3 0.4 h Figure 5.5: T h e AMSE curve (solid) and the mean of 100 shifted FCV for m — ra w i t h A 3 n 0.5 = 0.1,0.2 and n — 50,100 respectively. 84 curves (dotted) Chapter 6 T h e local method versus backcalculat ion T h i s chapter offers the arguments that the proposed local method should serve as an alternative to the backcalculation approach i n forecasting. 6.1 Comments on the methods Backcalculation is used to achieve two goals: to estimate the historical H I V infection curve up to the present and to forecast the number of new A I D S cases using the estimated H I V infection curve. Backcalculation assumes the following: 1. the adjusted A I D S incidence data are accurate; 2. the incubation distribution, F, is accurately modelled and does not vary across subpopulations represented i n the counts of new cases. 85 In reality, the raw A I D S incidence data are affected by under-reporting and reporting delays. In some developing countries, A I D S data are highly incomplete. T h o u g h adjusting the raw A I D S data for under-reporting and reporting delays improves the quality of the data significantly, the adjusted A I D S data still contain errors depending on the i m p u t a t i o n methods, the quality of the raw data and the r a n d o m nature of the data. In short, accurate A I D S counts are not available. T h i s is more of a problem for backcalculation (Section 6.2) than for the local linear forecasting estimator. Furthermore, knowledge about the "true" F is suspect. K n o w i n g F is crucial to the backcalculation approach but F is very hard to estimate. R e c a l l that F is the distribution of the incubation time f r o m H I V infection to A I D S diagnosis. Usually external data other than the A I D S incidence data at hand are used i n the procedure for estimating F. There are several problems w i t h that procedure. F i r s t , exact infection times are known only for a few highly selected groups of individuals. So usually F is estimated f r o m specific cohort(s) [3] and then plugged into the backcalculation model for a certain (or general) H I V infected population as the "true" incubation distribution. T h e applicability of F obtained f r o m a cohort (a particular subpopulation) to a more general population is doubtful. In fact, Bacchetti et al [2] conclude f r o m their analysis that the incubation distributions across the cohorts under study are quite different, resulting i n very different estimates of H I V infection curves. Second, the changes of the definition of an A I D S case [1] (once i n 1985 and then again i n 1987 i n the U n i t e d States) further complicate the estimation of the incubation distribution. T h i r d , under-reporting and reporting delays also affect the estimation of the incubation distribution. In contrast, the local linear forecasting estimator has only one goal: to forecast the number of new A I D S cases. It assumes that the expectation of the adjusted A I D S incidence data reflects the true level of the underlying A I D S epidemic, i.e., E(Yi\ti) — m(ti) (with m the true level of the underlying A I D S epidemic) and uses the A I D S counts 86 (adjusted when necessary) directly. T h e local linear forecasting estimator does not use any external data other than the A I D S data. One might think that backcalculation should provide good estimates since it employs b o t h external data and the A I D S data. However, as shown i n the simulation study described i n the next section, it does not produce better forecasts than the local linear forecasting estimator that uses the A I D S data only. In a d d i t i o n , it w i l l be shown that backcalculation can not be used to produce a good estimate of the H I V infection curve up to present. T h e next two sections w i l l contain discussions of the performance of the two methods. 6.2 A small simulation study T h i s simulation study has two objectives: 1. to compare the forecasts of the number of new A I D S cases by backcalculation and the local linear forecasting estimator, 2. to investigate how the violation of either of the two assumptions about the A I D S incidence data and the incubation distribution function F w i l l affect the results of backcalculation. To achieve the two objectives, the performance of b o t h approaches is investigated i n the simplest set-up: Y{ = m(ti) + e,- where the e;s are independent n o r m a l r a n d o m variables w i t h standard deviation a: A l t h o u g h the assumption of independent n o r m a l errors may not be realistic for the A I D S count data, it serves to show the statistical weakness of the backcalculation approach and simplifies the computation. T h i s w i l l use the simplest model for F: fictitious example an incubation time distribution function w h i c h is independent of the infection t i m e , i.e., F(t — s\s) = F(t — s). 87 T h e simulated data are Yi = m(ti) + e,-, i = 1 , . . . , 20, where the e,s are i . i . d . normal variables w i t h mean zero and variance 0.25, ti = i/20, m(£;) = JQ I(s)(F(ti F(U-i — s) — - s))ds, I(s) = - 2 0 0 ( 5 + l ) ( s - 2) and F(t) = 1 - e~*. One hundred simulated data sets are generated f r o m the above model. Figure 6.1 shows the incubation function F and the H I V infection rate I. To attain the first objective, m ( l + A ) , an estimate of m ( l + A „ ) , is calculated n by both approaches for each of the simulation data sets for A n = 0.1 and A n = 0.2. A s explained i n the last section, the "true" F should be taken w i t h a grain of salt. It w i l l be interesting to see how the forecasts are affected b y the assumed f o r m of F. Therefore for the backcalculation approach, m ( l + A ) is estimated under a few n assumed forms of F, the true F: F(t) = 1 — e w i t h ft = 4,1.1,0.9,0.5. _ t and wrong J P S : F(t) = 1 — e~ ^ f Since the true ft is 1, ft = 4 or 0.5 is grossly wrong while ft = 1.1 or 0.9 is pretty close to the t r u t h . T h e application of backcalculation w i t h each F to the 100 simulated data sets yields 100 m ( l + A„)s. I n backcalculation, the smoothing parameter A is chosen by cross-validation (Section 4.2.2). T h e application of the local linear forecasting estimator to the same data also yields 100 m ( l + A ) s , but n not depending on the assumed form of F. M e t h o d FCV is used to choose the b a n d w i d t h for the local linear forecasting estimator because simulations i n the last chapter suggest that FCV has the best performance among the four competing b a n d w i d t h estimating procedures. For each set of 100 m ( l + A ) s , the average, the standard deviation (s.d.) n and the mean squared error (m.s.e.) are computed and displayed i n Table 6.1. For this fictitious example, the mean squared errors can be calculated since the true value of m ( l + A „ ) is known. Results i n Table 6.1 show that the local linear forecasting estimator always gives forecasts w i t h smaller mean squared errors t h a n backcalculation. Comparison of the averaged forecasts and standard deviations suggests that forecasts b y backcalculation 88 Time 0.0 0.2 0.4 0.6 0.8 1.0 Time Figure 6.1: T h e H I V infection function I, I(s) = - 2 0 0 ( 5 + l ) ( s - 2) and the incubation function F, F(t) = 1 — e~* used i n the generation of the fictitious A I D S data. 89 Table 6.1: S u m m a r y of forecasts by backcalculation and local linear regression for 100 realizations of the fictitious A I D S data. method backcalculation average of forecasts (s.d.) (m.s.e.) assumed A = 0.2 incubation F A 1 - e-*/ 14.27 (1.58) (2.52) 14.85 (3.64) ( 13.37) 1 - e-'l - 14.10 (2.55) (6.50) 14.23 (7.42) (55.13) 1 — e~ , (true) 14.45 (2.94) (8.75) 15.52 (7.60 ) (58.80) 1 - e-</°- 9 14.31 (1.45) (2.14) 14.90 (3.24) ( 10.66) 1 - e-'/°- 5 13.59 (8.10) (65.88) 12.49 (27.00) (733.04) 4 1 1 f = 0.1 n n local linear FCV NA 14.58 (0.62) (0.61) 15.51 (0.96) (1.95) true value 1-e"< 14.11 14.50 are highly variable. For example, for A n = 0.1, even when the "true" F is used i n the backcalculation, the standard deviation (s.d.) of 100 forecasts by backcalculation is 2.94, which is m u c h higher than the corresponding standard deviation of 0.62 for the local linear forecasting estimator. Results of backcalculation when the "true" F is used suggest that forecasts should not be made based on the extrapolated I. Estimates of the conservative lower bounds for the numbers of new A I D S cases, LB n of (1.6), are also calculated according to formula (1.7) for the backcalculation approach from the same simulated data. T h e average and the standard deviation of the 100 LB s n 90 Table 6.2: S u m m a r y of lower bounds of forecasts by backcalculation for 100 realizations of the fictitious A I D S data. method backcalculation average of LB assumed n incubation F A 1 - e-'/ 13.51 (0.89) 13.18 (0.87) 12.89 (1.26) 11.78 (1.15) 1 — e , (true) 12.76 (1.21) 11.55 (1.09) 1 - e"*/ - 9 12.71 (0.83) 11.38 (0.75) g-t/0-5 11.87 (0.54) 9.72 (0.44) 4 1 - e"*/ 1 1 _ t 0 1 _ n = 0.1 A (s.d.) n = 0.2 true lower b o u n d 1 — e , (true) 12.69 11.48 true value 1 — e~ , (true) 14.11 14.50 _ t f for each F are summarized i n Table 6.2. T h e true lower bounds given b y formula (1.6) are calculated w i t h F the "true" distribution used i n the generation of the simulation data. T h e s.d.'s of the lower bounds i n Table 6.2 are m u c h smaller than the s.d.'s of forecasts based on extrapolated 7s i n Table 6.1. T h i s confirms that the problem shown i n Table 6.1 is due to the extrapolation of I. To attain the second objective of studying the sensitivity of backcalculation to the two parameters: the assumed f o r m of the incubation function F and the presence of noise i n the YiS, each parameter is allowed to vary separately. F i r s t , estimates of I are backcalculated based on three similar Fs from the Y^s w i t h no error. T h e three Fs are 91 F(t) = 1 - e~ t/p w i t h ft = 1.1,1 (true) and 0.9 (see Figure 6.2). A larger value of ft means a longer incubation period. T h e resulting I based on Fs w i t h ft = 1.1 (long 1 (true F) and 0.9 (short F) are plotted together w i t h the true I i n Figure 6.2. F), Note that the true I is fully recovered when the true F is used i n the backcalculation and when there are no errors i n the YiS. To study the sensitivity of backcalculation to the presence of errors i n the YiS, estimates of I are backcalculated using the true F (the one used to generate the simulation data, ft = 1) for different realizations of the errors i n the Y{S w i t h a always equal to 0.5. T h e first plot of / i n Figure 6.3 shows that I is fully recovered when there is no error i n the YiS. T h e other three 7s i n Figure 6.3 are backcalculated f r o m three realizations of the YiS w i t h errors. These three plots of I show that when there are errors i n the l^s, the backcalculated Is are highly variable. To get a better idea of the variability of the backcalculated 7s when the YiS contain A A error, Figure 6.4 summarizes the 100 Is by plotting the averaged 7, the 25th quantile, the 75th quantile and the true I. T h e simulation results show that the results produced by the backcalculation approach are sensitive to the assumed f o r m of the incubation distribution F and to the presence of errors i n the YiS and that the forecast based on the extrapolated / is highly variable (even when the true F is used). In comparison, the local linear estimator gives better forecasts by the criterion of MSE. Furthermore, the local linear estimator is m u c h faster to compute and therefore is less expensive. 92 Time Figure 6.2: T h e sensitivity of / to the assumption of F i n the fictitious A I D S data (a = 0). True F(t) = 1 - e~\ long Fit) = 1 - e"*/ - and short F(t) = 1 - e " ' / ' . 1 93 1 0 9 Time Time Figure 6.3: T h e sensitivity of I to the presence of the errors (a — 0.5) i n the fictitious A I D S data when the true F is used. T h e first plot shows that I is recovered f r o m the error-free A I D S data while the rest show that / is highly distorted by the presence of errors i n the A I D S incidence data. T h e solid curve (-) i n each plot is the true I and the dotted line (...) the backcalculated / . 94 0.0 0.2 0.4 0.6 0.8 Time Figure 6.4: The variability of backcalculated I. 95 1.0 6.3 Discussion of the methods and the simulation results T h i s section w i l l offer theoretical insights on how the violation of either of the two assumptions on the A I D S incidence data and the incubation function F w i l l make backcalculation perform poorly. T h e reason that backcalculation is highly sensitive to inaccuracies i n both the A I D S incidence data and the incubation function F is that backcalculation is a mathematically ill-posed problem. E q u a t i o n (1.4) m(t) = j T I{s) (F(t -s\s)-F(t-^- s|s)) ds. is a V o l t e r r a equation of the first k i n d [22]. For details on Volterra equations, see [22]. A c c o r d i n g to [22], a linear Volterra equation of the first k i n d is defined as /' K(t,s)I(s)ds Jo (6.1) = u(t), 0<t<T while a linear Volterra equation of the second k i n d is defined as I(t) - I* K(t,s)I{s)ds Jo = g{t), 0<t<T. (6.2) T h e purpose is to solve these equations for I. A n equation is well-posed if a small change i n parameters, e.g. i n K or u i n (6.1), or i n K. or g i n (6.2), causes only a small change i n the solution I [22]. E q u a t i o n (6.1) is not a well-posed problem. Volterra equations of the second k i n d have been studied more extensively and are understood better than those of the first k i n d . Roughly speaking they are relatively well-posed i n contrast to those of the first k i n d . A Volterra equation of the first k i n d may be converted to one of the second k i n d by differentiating (6.1): K(t,t)I(t) + £ ™^ll( )ds s = u\t). 96 (6.3) If K(t,t) does not vanish i n 0 < t < T, then dividing (6.3) by K(t,t) yields a standard V o l t e r r a equation of the second k i n d and can be solved b y existing algorithms. T h e exact knowledge of u (consequently the exact knowledge of u') makes the equation (6.3) possibly well-posed. Under regularity conditions, a continuous solution of equation (6.1) exists uniquely and is the same as the continuous solution of equation (6.3). E v e n though V o l t e r r a equations of the second k i n d are relatively more tractable i n theory, solving t h e m numerically is known to be hard. However, i n the backcalculation approach, (6.4) K(s,t) = F(t - s\s) - F(t - l/n - s\s), w i t h F being the incubation function, so K(t,t) = 0 for any t. Thus (6.3) is again an equation of the first k i n d . A c c o r d i n g to [22], a V o l t e r r a equation of the first k i n d can be converted into one of the second k i n d if K and u are sufficiently differentiable. T h i s result is not very useful i n practice because " u " (e.g. the functional form of m i n backcalculation) is the very unknown that is of interest to us. A s mentioned earlier only the exact knowledge of " u " may help convert equations of the first k i n d to those of the well-posed second k i n d . Suppose that (1.4) can be converted into an equation of the second k i n d by differentiating. T h e n one might propose to solve the original equation by instead solving the resulting equation. Since i n reality m is unknown, one needs to estimate derivatives of m f r o m the raw data. It is well known that accurate estimation of derivatives of m from the raw data is h a r d , since errors inherited f r o m the original data propagate (in terms of mean squared error). Those errors would be further amplified i n I due to the ill-posed nature of the problem. Penalizing 7, e.g., controlling the magnitude of $ I"(s) ds when 2 backcalculating 7, does not change the ill-posed nature of the problem. T h i s is well manifested by the simulations i n the last section. A small amount of error added to m can distort the resulting estimate of I completely. See Figure 6.3. O f course, all of above 97 discussions about V o l t e r r a equations assumes that u and K are k n o w n at a l l values of t, not just at the t,s. Smoothing the data {li}™ prior to applying the backcalculation procedure does not help either. T h i s can be explained by the ill-posed nature of the problem. T h e ill-posed nature of backcalculation is also manifested i n the sensitivity of backcalculation to the assumed form of the incubation function F. See Figure 6.2. Figure 6.2 also exhibits a phenomenon which Brookmeyer [1] calls "nonidentifiabili t y " of I and F: for the same set of " A I D S incidence" data if a long incubation period is assumed (e.g., ft = 1.1), backcalculation produces an inflated estimate of the infection rate i n order to m a t c h the observed incidence. O n the other h a n d , if a short incubation period is assumed (e.g., ft = 0.9) backcalculation underestimates the infection rates. Bacchetti [2] makes the same observation. R e c a l l that one major goal of the backcalculation approach is to reconstruct timate) the H I V infection curve /(•) f r o m the A I D S incidence series. (es- However, this nonidentifiability, accompanied by the unfortunate m a t h e m a t i c a l ill-posed nature of backcalculation, makes backcalculation unfit for the task of reconstructing the past to present H I V infection curve I. It is also important to note that backcalculation is l i m i t e d to estimating only a proportion of H I V infections. E v e n though there is a strong correlation between H I V infection and A I D S disease, not every H I V infected person w i l l develop A I D S i n his lifetime. T h e proportion of people who are infected w i t h H I V virus and eventually develop A I D S is unknown. T h i s is another reason that backcalculation can not recover the whole picture of the historical H I V infection pattern. In short, one can not expect to get a reliable H I V infection curve f r o m backcalculation unless the true F and accurate A I D S incidence data (after correction for under-reporting and reporting delays) are available. Other methods must be devised to estimate the H I V 98 infections. Brookmeyer [1] believes that greater improvements i n reconstructing the H I V i n fection rates may come f r o m empirical data on recent infection rates rather than f r o m smoothing procedures or parametric models for I using backcalculation. A s the A I D S epidemic continues, more and more data on H I V seroprevalence are available f r o m crosssectional surveys (since the mid-1980s) and longitudinal cohort studies. Thus obtaining infection rates directly has become possible. Brookmeyer [4] [5] has proposed new approaches based on cross-sectional surveys and cohort studies i n his recent research. T h o u g h the proposed local linear estimator does not attempt to estimate the H I V infections, it does a better job at forecasting than backcalculation. T h e local linear estimator w i l l be applied to two real data sets from Canada and the U n i t e d K i n g d o m i n the next chapter. 99 Chapter 7 Data analysis In this chapter, forecasts using the local linear forecasting estimator and Healy and T i l l e t t ' s m e t h o d [20] w i l l be compared to the corresponding true or corrected numbers of new A I D S cases. T h e two data sets at hand are the C a n a d i a n A I D S data, and the U K A I D S data used by Healy and T i l l e t t [20]. For a given data set, a few recent observations w i l l be left out and treated as " f u t u r e " , unknown observations. "Forecasts" by the two methods w i l l be made using the rest of the data and compared to the left out data, actual values of these "future" observations. 7.1 Data, methods and results T h e C a n a d i a n A I D S data are provided by the Laboratory Centre for Disease Control. A l t h o u g h the national quarterly numbers of new A I D S cases are available f r o m the fourth quarter of 1979 to the first quarter of 1996, quarterly numbers prior to the second quarter of 1991 are assumed to be free of reporting delays. T h i s agrees w i t h previous work [23] which considered reporting to have ceased w i t h i n six years. Since correcting for reporting delays and under-reporting is beyond the scope of this thesis, the subset 100 of the C a n a d i a n A I D S cases that were diagnosed between the beginning of the fourth quarter of 1979 and the end of the first quarter of 1990 and reported w i t h i n six years of diagnosis w i l l be used. These data w i l l be treated as being free of reporting delays and under-reporting, i.e, these data w i l l be assumed to represent the true numbers of new A I D S cases. T h e Canadian data are displayed i n Table 7.1 w i t h quarters of a year labelled Q1-Q4. Table 7.1: Quarterly numbers of A I D S cases reported w i t h i n six years of diagnosis i n Canada, 79Q4-90Q1. 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 Ql NA 0 2 6 17 34 65 125 190 267 351 372 Q2 NA 4 2 4 16 36 82 147 223 254 317 NA Q3 NA 0 2 7 13 39 99 163 261 295 350 NA Q4 1 0 2 6 17 50 117 186 261 304 328 NA T h e U K data as shown i n Table 7.2 are taken f r o m Table 6 i n [20] i n which the m o n t h l y numbers of reported new A I D S cases and Healy and T i l l e t t ' s estimates of number of unreported new A I D S cases are displayed. Healy and T i l l e t t [20] i m p u t e d the estimates of numbers of unreported new A I D S cases based on the delay distribution estimated f r o m the data f r o m 1984 to 1986. Thus the corrected data are the sum of the reported number and the estimate of the number of unreported new A I D S cases. T h e C a n a d i a n data and the corrected U K data are plotted i n Figure 7.1. To test our forecasting m e t h o d , for the Canadian data, we consider 88Q1 as the present time 101 Table 7.2: M o n t h l y numbers of A I D S cases reported to the end of September 1987 i n U K . For t i m e periods M a y 1986 to September 1987, the first number is the reported and the second number i n parenthesis is the estimate of number of unreported new A I D S cases. 1983 1984 1985 Jan 1 1 6 16 25 33(7.5) Feb 0 0 5 15 30 44(9.7) Mar 2 2 8 18 26 35(12.2) Apr 1 0 4 16 35 35(15.8) May 0 1 5 14 38(0.1) 25(19.8) Jun 0 1 6 12 27(0.4) 48(25.9) Jul 1 5 10 17 24(0.9) 27(32.8) Aug 1 4 14 24 29(1.4) 27(44.2) Sep 1 3 7" 23 39(2.3) 14(66.1) Oct 1 0 14 25 34(3.3) NA Nov 1 5 8 22 38(4.3) NA Dec 2 7 16 21 46(5.8) NA 102 1986 1987 1982 and for the U K data we consider December 1986 as the present time. In each plot, time has been rescaled so that the data w i t h t i n [0,1] are " k n o w n " and the d a t a w i t h t > 1 are " u n k n o w n " and w i l l be predicted. For the Canadian data, data w i t h t i n [0,1] are quarterly numbers of new A I D S cases from the fourth quarter of 1979 (79Q4) to the first quarter of 1988 (88Q1) and are used to "forecast" the quarterly numbers of new A I D S cases f r o m 88Q2 to 90Q1. For the U K data, data w i t h t i n [0,1] are m o n t h l y numbers of new A I D S cases f r o m January 1982 to December 1986 and are used to "forecast" the m o n t h l y numbers of new A I D S cases from January 1987 to September 1987. Healy and T i l l e t t [20] fit a log-linear model i n t to the most recent two years data assuming the Yis to be Poisson counts, l o g ( £ ( y , ) ) = «o + a i * , (7.1) and then extrapolate the fit to get forecasts. T h e local linear forecasting estimator w i t h FCV, and Poisson regression using the most recent data w i l l be applied to each data set to produce forecasts for specified t i m e periods. Healy and T i l l e t t [20] chose a time span of two years i n the Poisson regression for the U K A I D S data. T h e choice of a time span of data on which Poisson regression is computed is equivalent to the choice of a b a n d w i d t h i n local regression. To show that forecasts are highly dependent on the time span, that is, the b a n d w i d t h , Poisson regression w i l l be computed using data from the most recent half year, one year and two years and then the fit w i l l be extrapolated to y i e l d forecasts. T h i s choice of the spans of "most recent" data is arbitrary. T h e forecasts w i l l then be compared to the corresponding true numbers of new A I D S cases for the C a n a d i a n data or to the corresponding corrected numbers of new A I D S cases for the U K data. Tables 7.3-7.4 display the "forecasts" by the local linear m e t h o d and by Poisson regression using the most recent data w i t h the three time spans. T h e o p t i m a l bandwidths estimated by FCV are rescaled i n years so that it is easy to see the " t i m e span" that 103 < 8 S3 8 1 at Time (0.0=7903,1.0=8801) Time (0.0=Dec 81,1.0=Dec 86) ure 7.1: Canadian AIDS data and U K AIDS data: forecasting beyond t 104 the local linear forecasting estimator uses. T h e average of the squared forecasting errors (ASFE), displayed i n the last line of these tables, serves as a measure of the precision of forecasts by each method. For the local linear method, the FCV score and estimated b a n d w i d t h (in years) are also displayed. For the local linear forecast at t = 1 + A n for each A „ , AMSE(8) w i t h 8 = can be used as a measure of accuracy of the forecast. Recall that FCV — a 2 moscedastic. is approximately AMSE, A /h n opt However, AMSE is unknown. provided that errors i n Ys are ho- Since Figure 7.1 shows that the assumption of homoscedastic errors is reasonable for the Canadian data w i t h t i n [0,1] and for the U K data w i t h t i n [0.5,1.0], the R i c e estimator can be applied to estimate a i n each case. T h e R i c e estimator as i n 2 (4.13), when applied to the Canadian data i n [0,1] and the U K data i n [0.5,1.0] yields 177 and 14, respectively. A l t h o u g h the distribution of FCV is unknown, FCV scores are displayed as an indication of forecasting errors. Table 7.3 shows that the local linear forecasting estimator w i t h FCV gives the closest forecasts for the C a n a d i a n data, w i t h the smallest ASFE, 716. Poisson regression from data of different t i m e spans clearly gives very different forecasts, showing that the choice of a t i m e span of most recent data is a practical problem. T h i s sensitivity to the choice of the b a n d w i d t h is usual for smoothing methods. Poor performance of Poisson regression to the C a n a d i a n data also suggests the possibility that Y{S m a y not be Poisson and thus a specific assumption on the distribution of YiS m a y not be justifiable for different A I D S data sets. Table 7.4 shows that Poisson regression using data of the most recent year gives the closest forecasts, w i t h the smallest ASFE, 62. However, Poisson regression using data of the most recent half year gives the worst result, w i t h the largest ASFE, 2389. In comparison, the local linear forecasting estimator w i t h FCV gives decent forecasts w i t h ASFE, 129. Figure 7.2 shows the forecasts by the local linear method and the forecasts 105 Table 7.3: Forecasts (ms) for 8 quarters, 88Q2 to 90Q1, on C a n a d i a n A I D S data (79Q488Q1) by local linear forecasting estimator ( L L ) w i t h FCV, and b y Poisson regression ( P R ) using most recent data respectively. T h e o p t i m a l b a n d w i d t h estimated b y FCV is rescaled i n years and denoted by h y . T h e FCV score and h y r 0 pt opt are displayed together i n parenthesis. T h e <r estimated b y the R i c e estimator is 177. 2 Actual # Method Time rh by P R on data f r o m most recent m by L L m (FCV,h ) y r 0 pt 0.5 year 1.0 year Y 2.0 years 88Q2 293 (64, 3.2) 273 288 311 254 88Q3 315 (483, 2.4) 279 303 341 295 88Q4 309 (234, 0.9) 286 319 373 304 89Q1 349 (312, 1.0) 292 337 408 351 89Q2 370 (403, 1.2) 299 355 446 317 89Q3 346 (685, 0.9) 306 374 488 350 89Q4 359 (1816, 0.9) 313 394 534 328 90Q1 371 (2474, 0.9) 320 415 584 372 716 1191 1222 17033 / ASFE = £(m - y) /8 2 106 Table 7.4: Forecasts (ms) for 9 months, January 1987 to September 1987, on U K A I D S data (January 1982-December 1986) b y local linear forecasting estimator ( L L ) w i t h and by Poisson regression ( P R ) using most recent data respectively. T h e o p t i m a l FCV, b a n d w i d t h estimated by FCV is rescaled i n years and denoted by h J . T h e FCV score y pt and h J are displayed together i n parenthesis. T h e a estimated by the R i c e estimator y 2 pt is 14. Delay- Method Time m by P R on data f r o m most recent rh by L L m (FCV,h% ) t 0.5 year 1.0 year corrected # 2.0 years yc 87Jan 53 (54, 0.6) 58 47 47 41 87Feb 45 (60, 1.4) 66 49 49 54 87Mar 48 (57, 1.3) 75 52 51 47 87Apr 49 (53, 1.3) 85 55 54 51 87May 52 (44, 1.2) 96 58 57 45 87Jun 55 (27, 1.1) 109 61 60 74 87Jul 57 (20, 1.0) 124 64 63 60 87Aug 59 (46, 1.0) 141 67 66 71 87Sep 61 (68, 0.9) 160 71 69 80 129 2389 62 65 / ASFE E(m - — y ) /9 c 2 107 by the Poisson regression w i t h the smallest ASFE. Poisson regression w i t h the smallest ASFE For the C a n a d i a n A I D S data, the uses the data f r o m the most recent half year; for the U K A I D S data the Poisson regression w i t h the smallest ASFE uses the data f r o m the most recent year. T h e plots i n Figure 7.2 confirm our observation f r o m Tables 7.3 7.4 that the local linear method gives very decent forecasts for b o t h data : sets. 108 I* E o C to P Time (0.0=7903.1.0=88Q1) S co a < S S Time (0.0=Dec 81,1.0=Dec 86) Figure 7.2: Forecasts by the local linear method and by Poisson regression using the most recent data. A m o n g three sets of forecasts by Poisson regression, the set w i t h the smallest ASFE is plotted. 109 Chapter 8 Concluding remarks In this thesis, the local linear forecasting estimator is proposed as an alternative to either backcalculation or parametric regression i n the context of forecasting for A I D S data. T h e asymptotic theory of this estimator is developed and applied to the automatic implementation of this m e t h o d , i.e., the data driven estimation of an o p t i m a l b a n d w i d t h for forecasting. For the estimation of an o p t i m a l b a n d w i d t h , two plug-in procedures and two cross-validation procedures are investigated theoretically and by simulations. S i m ulations clearly show the advantage of FCV and HAMSE, over the two plug-in procedures, and the other cross-validation procedure, BCV. CAMSE T h e simulation study of Chapter 6 shows that the local linear method provides better forecasts than the backcalculation approach and that backcalculation can not accomplish the two tasks, that is, to provide a reasonable estimate of the H I V infection curve and to provide forecasts of A I D S incidence. Analyses of the Canadian A I D S data and the U K A I D S data show that the local linear method forecasts well. In addition, these analyses expose the practical problem of choosing an appropriate size of a time span of most recent data for use i n parametric regression. T h e theoretical results for the local linear forecasting estimator are derived based on 110 the assumption of independent data. T h i s assumption is not entirely realistic and the statistical properties of the local linear forecasting estimator using correlated data and the estimation of an o p t i m a l b a n d w i d t h w i l l be interesting topics for future research. In the literature of A I D S research, m a n y authors (e.g., [1], [20] and [23]) assumed the numbers of A I D S cases to be Poisson. For example, Bacchetti [1] assumed that the numbers of people infected w i t h H I V follow a nonhomogeneous Poisson process. Under appropriate assumptions, the process of numbers of A I D S cases can be shown to be Poisson. It would be interesting to simulate the H I V infection process and the incubation distribution to get simulated A I D S counts and then to analyse the simulated A I D S counts by the backcalculation approach and the local linear forecasting estimator. A n o t h e r interesting extension of the local linear forecasting estimator is to include covariates i n the regression, for example, sexual practices, age and gender, and to estimate the expected numbers of new A I D S cases among a group w i t h specific covariate values. Moreover, construction of point-wise confidence intervals of the forecasts by the local linear method is an interesting problem. Ill Bibliography [1] Bacchetti, P . , Segal, M . R . , and Jewell, N . P . (1993), "Backcalculation of H I V Infection Rates," Statistical Science 8, 82-119. [2] Bacchetti, P . , Segal, M . R . , Hessol, N . A . , and Jewell, N . P . (1993), "Different A I D S Incubation Periods and T h e i r Impacts on Reconstructing H u m a n Immunodeficiency V i r u s Epidemics and Projecting A I D S Incidence," Proc. Natl. Acad. Sci. USA 90, 2194-2196. [3] Brookmeyer, R . (1991), "Reconstruction and Future Trends of the A I D S E p i d e m i c i n the U n i t e d States," Science 2 5 3 , 37-42. [4] Brookmeyer, R . , and Q u i n n , T . C . (1995), " E s t i m a t i o n of Current H u m a n I m m u n odeficiency V i r u s Incidence Rates from a Cross-sectional Survey U s i n g E a r l y Diagnostic Tests," American Journal of Epidemiology 1 4 1 , 166-172. [5] Brookmeyer, R . , Q u i n n , T . C , Shepherd, M . , et a l (1995), " T h e A I D S E p i d e m i c i n India: A N e w M e t h o d for E s t i m a t i n g Current H u m a n Immunodeficiency V i r u s ( H I V ) Incidence Rates," American Journal of Epidemiology 1 4 2 , 709-713. [6] C o h e n , P . T . , Sande, M . A . , and Bolberding, P . A . (Editors), The AIDS Knowledge Base (2nd E d , 1994), L i t t l e , B r o w n and Company. 112 [7] F a n , J . , and Gijbels, I. (1994), "Data-driven B a n d w i d t h Selection i n L o c a l Polynom i a l F i t t i n g : Variable B a n d w i d t h and Spatial A d a p t a t i o n , " JRSSB, to appear. [8] F a n , Jianqing (1992), "Design-adaptive Nonparametric Regression," JASA 87, 9981004. [9] G a i l , M . H . , and Brookmeyer, R . (1988), "Methods for P r o j e c t i n g Course of A c quired Immunodeficiency Syndrome E p i d e m i c (Review)," Journal of the National Cancer Institute 80, 900-911. [10] G a i l , M . H . , and Rosenberg, P . S . (1991), "Perspectives on U s i n g Backcalculation to E s t i m a t e H I V Prevalence and Project A I D S Incidence" manuscript. [11] Gasser, T . , Sroka, L . , and Jennen-Steinmetz, C . (1986), " R e s i d u a l Variance a n d Residual Pattern i n Nonlinear Regression, " Biometrika 73, 625-633. [12] Gasser, T . , K n e i p , A . a n d K 5 h l e r , W . (1991), " A Flexible and Fast M e t h o d for A u t o m a t i c Smoothing," JASA 86, 643-652. [13] H a l l , P . , Sheather, S.J., Jones, M . C . and M a r r o n , J . S . (1991), " O n O p t i m a l Databased B a n d w i d t h Selection i n K e r n a l Density E s t i m a t i o n , " Biometrika 78, 263-271. [14] H a r d l e , W . , and M a r r o n , J.S. (1985), " O p t i m a l B a n d w i d t h Selection i n Nonparametric Regression Function E s t i m a t i o n , " The Annals of Statistics 13, 1465-1481. [15] H a r d l e , W . (1989), Applied Nonparametric Regression, S I A M Cambridge University Press. [16] H a r t , J . D . and Wehrly, T . E . (1986), " K e r n e l Regression E s t i m a t i o n U s i n g R e peated Measurements D a t a , " JASA 81, 1080-1088. 113 [17] H a r t , J . D . (1994), " A u t o m a t e d K e r n e l Smoothing of Dependent D a t a b y U s i n g T i m e Series Cross-validation," Journal of the Royal Statistical Society, Ser. B . 5 6 , 529-542. [18] H a r t , J . D . and Y i , S. (1996), "One-sided Cross-validation," manuscript. [19] Hastie, T . and Loader, C . (1993), " L o c a l regression: A u t o m a t i c Carpentry," Statist. Sci. 8, 120-143. [20] Healy, M . J . R . , and T i l l e t t , H . E . (1988), "Short-term E x t r a p o l a t i o n of the A I D S E p i d e m i c , " JRSSA 1 5 1 , 50-61. [21] L e h m a n n , E . L . ' ( 1 9 7 5 ) , Nonparametrics: Statistical Methods Based on Ranks, Holden Day. [22] L i n z , Peter (1985), Analytical and Numerical Methods for Volterra Equations, SI A M Philadelphia. [23] M a r i o n , S. A . , and Schechter M . T . ( D e c , 1990), " E s t i m a t i o n of the N u m b e r of Persons i n C a n a d a Infected w i t h H u m a n Immunodeficiency V i r u s , " A report commissioned by the Minister of National Health and Welfare. [24] M a r i o n , S. A . , and Schechter M . T . ( M a r . , 1992), " H u m a n Immunodeficiency V i r u s Infection i n Canada: A n U p d a t e d Analysis U s i n g Backcalculation," A report commissioned by the Minister of National Health and Welfare. [25] Rice, J . (1984), " B a n d w i d t h Choice for Nonparametric Regression," The Annals of Statistics 1 2 , 1215-1230. [26] R u p p e r t , D . , Sheather, S . J . , and W a n d , M . R (1993), " A n Effective B a n d w i d t h Selector for L o c a l Least Squares Regression," manuscript. 114 [27] Stone, C . J . (1982), " O p t i m a l G l o b a l Rates of Convergence for Nonparametric Regression," The Annals of Statistics 1 0 , 1040-1053. [28] W a h b a , G . (1977), " P r a c t i c a l A p p r o x i m a t e Solutions to Linear Operator Equations W h e n the D a t a A r e Noisy," SIAM J. Numer. Anal. V o l . 14, 651-667. [29] W a h b a , G . (1979), "Smoothing and 111 posed P r o b l e m s , " Solution Methods for Integral Equations with Applications, M i c h a e l Golberg ( E d i t o r ) , P l e n u m Press, 183194. [30] W a h b a , G . (1982), "Constrained Regularization for 111 Posed Linear Operator E q u a tions, w i t h A p p l i c a t i o n i n Meteorology and M e d i c i n e , " Statistical Decision Theory and Related Topics / / / V o l . 2, Shanti S. G u p t a and J . 0 . Berger (Editors), 383-418. [31] W a h b a , G . (1990), Spline Models for Observational Data, S I A M . 115
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Local linear regression versus backcalculation in forecasting
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Local linear regression versus backcalculation in forecasting Li, Xiaochun 1996
pdf
Page Metadata
Item Metadata
Title | Local linear regression versus backcalculation in forecasting |
Creator |
Li, Xiaochun |
Date Issued | 1996 |
Description | The local linear forecasting estimator is proposed in this thesis as an alternative technique to either parametric regression or the backcalculation approach in the context of forecasting for independent data. The asymptotic bias and variance of the local linear forecasting estimator are derived and used to develop procedures for the estimation of the optimal bandwidth for forecasting. Both the theoretical and the computational aspects of these procedures are explored. Simulation study shows that a cross-validation procedure has the best performance in forecasting among four bandwidth estimation procedures under study. Simulations and statistical analyses show that the backcalculation approach is very vulnerable to violations of the assumptions underlying this approach and that its application to AIDS data fails to achieve its two primary goals, to forecast the numbers of new AIDS cases and to estimate the historical HIV infection curve. To test the proposed forecasting estimator over parametric regression, both techniques are applied to the Canadian AIDS data and the UK AIDS data. The results of the two examples expose the weakness of parametric regression and show that the proposed technique does better than parametric regression in forecasting. |
Extent | 4410988 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-03-17 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
IsShownAt | 10.14288/1.0087838 |
URI | http://hdl.handle.net/2429/6155 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1996-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1996-147800.pdf [ 4.21MB ]
- Metadata
- JSON: 831-1.0087838.json
- JSON-LD: 831-1.0087838-ld.json
- RDF/XML (Pretty): 831-1.0087838-rdf.xml
- RDF/JSON: 831-1.0087838-rdf.json
- Turtle: 831-1.0087838-turtle.txt
- N-Triples: 831-1.0087838-rdf-ntriples.txt
- Original Record: 831-1.0087838-source.json
- Full Text
- 831-1.0087838-fulltext.txt
- Citation
- 831-1.0087838.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0087838/manifest