KERNEL ESTIMATION OF THE DRIFT COEFFICIENT OF A DIFFUSIONPROCESS IN THE PRESENCE OF MEASUREMENT ERRORbyWOOYONG LEEB.Econ., Korea University, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Statistics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)June 2014© Wooyong Lee, 2014AbstractDiffusion processes, a class of continuous-time stochastic processes, can be used to modeltime-series data observed at discrete time points. A diffusion process can be completely char-acterized by two functions, called the drift coefficient and the diffusion coefficient. For thenonparametric estimation of these two functions, Bandi and Phillips (2003) proved consis-tency and asymptotic normality of Nadaraya-Watson kernel estimators of the drift and thediffusion coefficient.In some cases, we observe the time-series data with measurement error. For instance, itis a well-known fact that we observe the financial time-series data with measurement errors(Zhou, 1996). For the nonparametric estimation of the drift and the diffusion coefficients in thepresence of measurement error, some works are done for the estimation of integrated volatil-ity, which is the integral of the diffusion coefficient over a fixed period of time, but little workexists on the estimation of the drift and the diffusion coefficients themselves. In this thesis,we focus on the estimation of the drift coefficient, and we propose a consistent and asymptoti-cally normal Nadaraya-Watson type kernel estimator of the drift coefficient in the presence ofmeasurement error.iiPrefaceThis thesis is an original and unpublished work of the author, Wooyong Lee, under the super-vision of Dr. Nancy Heckman and Dr. Priscilla Greenwood.The research question and the estimator are established earlier by Nancy Heckman, PriscillaGreenwood and Dr. Wolfgang Wefelmeyer. Based on that, I have written a full proof forconsistency and asymptotic normality of the estimator, proposed a criterion for choosing theappropriate bandwidths and performed a simulation study of the estimator.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Stochastic Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Stochastic Differential Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Stationarity and Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 Kernel Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Kernel Estimation of the Drift Coefficient of a Diffusion Process in the Presenceof Measurement Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2 Statement of the Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3 Comparison to the Existing Estimators . . . . . . . . . . . . . . . . . . . . . . . . 252.4 Bandwidth Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28iv2.4.1 Choice of the kernel bandwidth h . . . . . . . . . . . . . . . . . . . . . . 292.4.2 Choice of the block size r . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.5 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.6 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.6.1 Structure of the proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.6.2 Preliminary lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.6.3 Proof of Equation (2.20) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.6.4 Proof of Equation (2.21) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82vList of Tables2.1 The values of ∆n, rn, mn and hn when δ = 0.9, ρ = 0.58 and η = 0.02. . . . . . . 252.2 Means (and standard errors, i.e. standard deviations/√1000 ) of the integratedsquared errors (ISEs) of candidate estimators over 1,000 sample paths. Labels“BPS” and “BPD” stand for the single-smoothing and the double-smoothingestimators of Bandi and Phillips (2003), respectively. Label “Avg” stands for thepre-averaging estimator. The “s” after a label means the estimator is combinedwith the subsampling method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.3 The list of Z’s and W’s for each N (j). . . . . . . . . . . . . . . . . . . . . . . . . 50viList of Figures2.1 A sample path of the stochastic process defined by (2.18), with the linear drift co-efficient. Label ”Original” represents the process without measurement errors.Label ”Contaminated” represents the process with independent N(0, 0.0022)-distributed additive measurement errors. Label ”Averaged” represents the av-eraged contaminated process with r = 5. Label ”Subsampled” represents thesubsampled process having 1/5 less sampling frequency than the original pro-cess. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602.2 A sample path of the stochastic process defined by (2.19), with the nonlin-ear drift coefficient. Label ”Original” represents the process without measure-ment errors. Label ”Contaminated” represents the process with independentN(0, 0.06612)-distributed additive measurement errors. Label ”Averaged” rep-resents the averaged contaminated process with r = 5. Label ”Subsampled”represents the subsampled process having 1/5 less sampling frequency thanthe original process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612.3 Density plot of cross-validation bandwidths of the BPSs, BPDs and Avg estima-tor. Labels “BPSs” and “BPDs” stand for the single-smoothing and the double-smoothing estimator of Bandi and Phillips (2003), respectively, both combinedwith the subsampling method. Label “Avg” stands for the pre-averaging esti-mator. The top panel corresponds to the model (2.18), and the bottom panelcorresponds to the model (2.19). . . . . . . . . . . . . . . . . . . . . . . . . . . . 62vii2.4 Pointwise mean squared errors (MSE) of the estimators for the model (2.18) withoracle bandwidths. Refer to the caption of Table 2.2 for definition of the labels.The “-o” represents the oracle bandwidths are used. Label “AMSE” representsthe asymptotic mean squared error computed using the oracle bandwidth. Thenumbers of the vertical axis do not apply to the AMSE. The bottom panel depictsoracle bandwidths, hopt(x) defined in (2.14), according to the values of x. . . . 632.5 Pointwise mean squared errors (MSE) of the estimators for the model (2.18) withcross-validation bandwidths. Refer to the caption of Table 2.2 for definitionof the labels. The “-cv” represents the cross-validation bandwidths are used.Label “AMSE” represents the asymptotic mean squared error computed usingthe oracle bandwidth. The numbers of the vertical axis do not apply to theAMSE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642.6 Pointwise mean squared errors (MSE) of the estimators for the model (2.19) withoracle bandwidths. Refer to the caption of Table 2.2 for definition of the labels.The “-o” represents the oracle bandwidths are used. Label “AMSE” representsthe asymptotic mean squared error computed using the oracle bandwidth. Thenumbers of the vertical axis do not apply to the AMSE. The bottom panel depictsoracle bandwidths, hopt(x) defined in (2.14), according to the values of x. . . . 652.7 Pointwise mean squared errors (MSE) of the estimators for the model (2.19) withcross-validation bandwidths. Refer to the caption of Table 2.2 for definitionof the labels. The “-cv” represents the cross-validation bandwidths are used.Label “AMSE” represents the asymptotic mean squared error computed usingthe oracle bandwidth. The numbers of the vertical axis do not apply to theAMSE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662.8 Pointwise squared biases (the top panel) and pointwise variances (the bottompanel) of the pre-averaging estimator with the oracle bandwidth (denoted by“Avg-o”) under different values of the block size r for the model (2.19). Thevalues of r are indicated in the legend. . . . . . . . . . . . . . . . . . . . . . . . 67viii2.9 Values of the oracle bandwidth hopt(x) defined in (2.14) and the function Γµ(x)defined in Theorem 2.1. The two top panels depict values of hopt(x) and Γµ(x)according to the values of x for model (2.18). The two bottom panels depict thevalues for model (2.19). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682.10 The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of Avg and BPSs estimates for the model (2.18) withoracle bandwidths. The sum is computed along the grid of evaluation pointsdescribed in Section 2.5. Refer to the caption of Table 2.2 for definition of thelabels. The “-o” represents the oracle bandwidths are used. The black solid lineis the 45 degrees line. 825 points out of 1,000 are above the line. . . . . . . . . . 692.11 The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of Avg and BPDs estimates for the model (2.18) withoracle bandwidths. The sum is computed along the grid of evaluation pointsdescribed in Section 2.5. Refer to the caption of Table 2.2 for definition of thelabels. The “-o” represents the oracle bandwidths are used. The black solid lineis the 45 degrees line. 733 points out of 1,000 are below the line. . . . . . . . . . 702.12 The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of Avg and BPSs estimates for the model (2.18) withcross-validation bandwidths. The sum is computed along the grid of evaluationpoints described in Section 2.5. Refer to the caption of Table 2.2 for definition ofthe labels. The “-cv” represents the cross-validation bandwidths are used. Theblack solid line is the 45 degrees line. 513 points out of 1,000 are above the line. 712.13 The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of Avg and BPDs estimates for the model (2.18) withcross-validation bandwidths. The sum is computed along the grid of evaluationpoints described in Section 2.5. Refer to the caption of Table 2.2 for definition ofthe labels. The “-cv” represents the cross-validation bandwidths are used. Theblack solid line is the 45 degrees line. 782 points out of 1,000 are above the line. 72ix2.14 The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of Avg and BPSs estimates for the model (2.19) withoracle bandwidths. The sum is computed along the grid of evaluation pointsdescribed in Section 2.5. Refer to the caption of Table 2.2 for definition of thelabels. The “-o” represents the oracle bandwidths are used. The black solid lineis the 45 degrees line. 679 points out of 1,000 are above the line. . . . . . . . . . 732.15 The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of Avg and BPDs estimates for the model (2.19) withoracle bandwidths. The sum is computed along the grid of evaluation pointsdescribed in Section 2.5. Refer to the caption of Table 2.2 for definition of thelabels. The “-o” represents the oracle bandwidths are used. The black solid lineis the 45 degrees line. 718 points out of 1,000 are below the line. . . . . . . . . . 742.16 The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of Avg and BPSs estimates for the model (2.19) withcross-validation bandwidths. The sum is computed along the grid of evaluationpoints described in Section 2.5. Refer to the caption of Table 2.2 for definition ofthe labels. The “-cv” represents the cross-validation bandwidths are used. Theblack solid line is the 45 degrees line. 549 points out of 1,000 are above the line. 752.17 The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of Avg and BPDs estimates for the model (2.19) withcross-validation bandwidths. The sum is computed along the grid of evaluationpoints described in Section 2.5. Refer to the caption of Table 2.2 for definition ofthe labels. The “-cv” represents the cross-validation bandwidths are used. Theblack solid line is the 45 degrees line. 536 points out of 1,000 are below the line. 762.18 The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of the pre-averaging estimator for the model (2.18).The sum is computed along the grid of evaluation points described in Sec-tion 2.5. The “-o” and “-cv” mean the oracle and the cross-validation band-widths are used, respectively. The black solid line is the 45 degrees line. 741points out of 1,000 are above the line. . . . . . . . . . . . . . . . . . . . . . . . . 77x2.19 The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of the pre-averaging estimator for the model (2.19).The sum is computed along the grid of evaluation points described in Sec-tion 2.5. The “-o” and “-cv” mean the oracle and the cross-validation band-widths are used, respectively. The black solid line is the 45 degrees line. 852points out of 1,000 are above the line. . . . . . . . . . . . . . . . . . . . . . . . . 78xiAcknowledgementsI express my deep gratitude to my supervisors, Dr. Nancy Heckman and Dr. Priscilla Green-wood, and my second reader, Dr. Alexandre Bouchard-Coˆte´. Nancy and Cindy happily in-vested a lot of their time for my thesis and for my academic training. Their help led to greatacademic improvement of myself during my stay at UBC. Especially, I learned from them howto write an academic paper, without which I can never become a good researcher. Alex taughtme three courses, two of which were core courses in graduate level statistics. Taking Alex’scourses were of great fun, and what I learned from him were extremely useful in readingtechnical papers cited in this thesis.I also thank all the other faculty members and fellow graduate students in the departmentfor introducing to me a lot of interesting fields in statistics in their classes and talks. It was oneof the greatest decision that I have made to join the statistics department at UBC.xiiTo my familyxiiiChapter 1IntroductionA continuous-time stochastic process can be used to model time-series data that are observedat discrete time points. For example, Felsenstein (1985) uses Brownian motion to model evo-lutionary history of species, and Andersen et al. (2001) use a continuous-time semimartingaleto model variability of exchange rates.In this thesis, we focus on a specific type of continuous-time stochastic process, namely, adiffusion process. A diffusion process is a solution to a stochastic differential equation, and itis used to describe many kinds of time-series data such as price data of financial instruments(see e.g. Aı¨t-Sahalia, 1996). A stochastic differential equation has the formdXt = µ(t, Xt)dt + σ(t, Xt)dWt,where the function µ and the nonnegative function σ are two deterministic functions from[0,∞)×R to R, called the drift coefficient and the diffusion coefficient respectively, and Wt isa stochastic process called Brownian motion. A solution to this stochastic differential equationwith an initial value random variable Y is a stochastic process {Xt | t ≥ 0} satisfying X0 = YandXt = Y +∫ t0µ(s, Xs)ds +∫ t0σ(s, Xs)dWs, t ≥ 0.The solution must satisfy additional conditions, introduced later in Definition 1.6. The integral∫ t0 σ(s, Xs)dWs is an example of what is called stochastic integration, which we define later in1Definition 1.5.As a solution to a stochastic differential equation, a diffusion process can be completelycharacterized by the drift coefficient µ and the diffusion coefficient σ. In addition, µ determinesthe expected value of the (random) change in Xt over an infinitesimal amount of time, and σdetermines the variance of the (random) change in Xt over an infinitesimal amount of time.Therefore, the statistical goal when using a diffusion model is to estimate these two functions.The theme of this thesis is to propose a statistical method to estimate the drift coefficientnonparametrically in the presence of measurement error, in which case the discrete-time ob-servations do not provide exact values of the latent continuous-time process. In addition, weconsider the simpler form of the stochastic differential equation, that µ and σ are not functionsof t and Xt but functions of Xt only, in which case the stochastic differential equation is said tobe time-homogeneous:dXt = µ(Xt)dt + σ(Xt)dWt. (1.1)In this chapter, we introduce background knowledge used to formally define our researchquestion and our estimator. We first introduce Brownian motion, which is used to definethe stochastic integrals considered here. Then we use stochastic integration to construct astochastic differential equation and its solution, which is called a diffusion process. After that,we discuss stationarity and ergodicity for diffusion processes, the properties we assume inour study in later chapters. Lastly, we discuss kernel estimation, a nonparametric estimationmethod we use in order to estimate the drift coefficient µ. The review of existing literature andthe statement of our research question in relation to the literature will be given in Section 2.1.For discussing relevant background knowledge related to stochastic processes, we useKaratzas and Shreve (1991), Øksendal (1992) and Kutoyants (2004). Øksendal (1992) is a text-book on stochastic integration and stochastic differential equations intended for graduate stu-dents and non-experts while Karatzas and Shreve (1991) offer a more abstract and rigoroustreatment of these areas. Kutoyants (2004) studies statistical problems for stationary and er-godic diffusion processes. For an introduction to kernel estimation, we use Simonoff (1996)and Hardle (1990) and the references therein, which give an overview for graduate studentsand applied statisticians while giving further references for more advanced treatment of the2subject.1.1 Brownian MotionWe introduce Brownian motion first. The sequence of subsets {Ft | t ≥ 0} of a σ-algebra Fwill denote a filtration with the time-index t. All random processes are assumed to be definedon the same probability space, when required.Definition 1.1 (Karatzas and Shreve, 1991, page 47) Defined on a probability space (Ω,F ,P), astochastic process is said to be Brownian motion if it is a continuous Ft-adapted stochastic process{Wt,Ft ; t ≥ 0} such that1. W0 = 0 almost surely, and2. For any 0 ≤ s < t, the increment Wt−Ws is independent of Fs and is normally distributed withmean 0 and variance t− s.Brownian motion is also called the Wiener process, which is why it is usually denoted witha W. From condition 2 of the above, for discrete times t1 < t2 < · · · < tn, the Brownian motionincrements {Wt2 −Wt1 , Wt3 −Wt2 , . . . , Wtn −Wtn−1} are independent and normally distributedrandom variables.1.2 Stochastic IntegrationStochastic integration is an integration with respect to a stochastic process, in contrast toLebesgue integration which is an integration with respect to a measure. In this section, werestrict our discussion to stochastic integration with respect to Brownian motion, althoughstochastic integration is defined for more general classes of stochastic processes including mar-tingales.To understand the definition of stochastic integration, recall that the Riemann integral ofan integrable function can be characterized by the limit of integrals of step functions whichconverge to the integrable function. The stochastic integral is defined in a similar way: wefirst define the stochastic integral of a simple process, which is similar to a step function, and3then we define the stochastic integral of a stochastic process as a limit of the stochastic integralsof simple processes that converge to the process in a suitable norm.Now we introduce this construction formally. For ease of exposition, we only considerstochastic integration over the time-interval [0, T]. Its extension to a generic time-interval [S, T]is straightforward. We first define a simple process and then define the stochastic integral of asimple process.Definition 1.2 (Karatzas and Shreve, 1991, page 132) Defined on a probability space (Ω,F ,P), astochastic process {St | t ≥ 0} is said to be a simple process if there exists a strictly increasing sequenceof real numbers {ti}∞i=0 with t0 = 0 and tn → ∞ as well as a sequence of random variables {ξi}∞i=0such thatSt = ξk for tk ≤ t < tk+1,where supn≥0 |ξn(ω)| ≤ C < ∞ for every ω ∈ Ω and ξn is Ftn -measurable for every n ≥ 0.Definition 1.3 (Karatzas and Shreve, 1991, page 132) The stochastic integral of the simple process{St | t ≥ 0} with respect to Brownian motion {Wt | t ≥ 0} over [0, T] is defined as∫ T0St dWt ≡N−1∑k=0ξk(Wtk+1 −Wtk)+ ξN (WT −WtN ) ,where N is an integer such that tN ≤ T < tN+1.Note that the ξ’s and {Wt} need not be independent and that the stochastic integral is a ran-dom variable. Having defined stochastic integrals for simple processes, the next step is todefine a class of stochastic processes that are well approximated by simple processes. We in-troduce the following class of stochastic processes, L2([0, T]), which is conceptually similar tothe L2 space of random variables.Definition 1.4 (Øksendal, 1992, page 18) L2([0, T]) is defined as the class of Ft-adapted stochasticprocesses {Vt,Ft ; t ≥ 0, Vt defined on (Ω,F ,P)} such thatE(∫ T0V2t dt)< ∞.4The following result states that every stochastic process in L2([0, T]) can be approximatedby a simple process, in an “L2 norm”.Lemma 1.1 (Øksendal, 1992, page 19) For {Vt | t ≥ 0} ∈ L2([0, T]), there exists a sequence ofsimple processes{{S(n)t | t ≥ 0}∣∣∣ n = 1, 2, . . .}⊆ L2([0, T]) such thatlimn→∞E(∫ T0(Vt − S(n)t)2dt)= 0.Then we define the stochastic integral of a stochastic process in L2([0, T]) as a limit of thestochastic integrals of the simple processes that converge to the process, as follows.Definition 1.5 (Øksendal, 1992, page 21) The stochastic integral of {Vt | t ≥ 0} ∈ L2([0, T]) withrespect to Brownian motion {Wt | t ≥ 0} over [0, T] is defined as∫ T0Vt dWt ≡ limn→∞∫ T0S(n)t dWt,where the limit is the almost sure limit and{{S(n)t | t ≥ 0}∣∣∣ n = 1, 2, . . .}⊆ L2([0, T]) is a sequenceof simple processes as in Lemma 1.1.One can show that this stochastic integral is well-defined, that is, that the limit is indepen-dent of the choice of S(n)t ’s (Øksendal, 1992, page 21). In addition, the limiting random variablehas a finite second moment (Øksendal, 1992, page 21).We conclude this section by stating some basic properties of stochastic integration. Theseproperties are obvious when the stochastic process in the integrand is simple. For genericstochastic processes in L2([0, T]), the properties can be proven via the limit argument.Lemma 1.2 (Øksendal, 1992, page 22) For real S ≤ R ≤ T and α and β, stochastic processes{Vt | t ≥ 0} and {Ut | t ≥ 0} in L2([0, T]) and Brownian motion {Wt | t ≥ 0}, the following5hold almost surely.∫ TSVt dWt =∫ RSVt dWt +∫ TRVt dWt,∫ TS(αVt + βUt) dWt = α∫ TSVt dWt + β∫ TSUt dWt,E(∫ TSVt dWt)= 0,E([∫ TSVt dWt]2)= E(∫ TSV2t dt).1.3 Stochastic Differential EquationIn this section, we formally introduce stochastic differential equations and their solutions,which are called diffusion processes. Then, in the next section, we define stationarity andergodicity for diffusion processes, the properties we assume in our study in later chapters.As stated earlier, we consider the time-homogeneous stochastic differential equation given in(1.1).Let {Wt | t ≥ 0} be Brownian motion defined on a probability space (Ω,F ,P) and Y be areal-valued random variable also defined on (Ω,F ,P) and independent of Brownian motion.We define an augmented filtration Ft, t ≥ 0, based on the filtrationGt ≡ σ(Y , {Ws | 0 ≤ s ≤ t}), t ≥ 0,and the collection of all subsets of measure zero:N ≡{N ⊆ Ω∣∣∣ ∃G ∈⋃t≥0Gt with N ⊆ G and P(G) = 0}.Then an augmented filtration Ft, t ≥ 0, is defined byFt ≡ σ(Gt ∪N), t ≥ 0.With this notation, we define a (strong) solution to a stochastic differential equation asfollows.6Definition 1.6 (Karatzas and Shreve, 1991, page 285) A strong solution {Xt | t ≥ 0} to the stochas-tic differential equation in (1.1) on the probability space (Ω,F ,P) with respect to Brownian motion{Wt | t ≥ 0} and the initial value random variable Y is defined as a stochastic process with continuoussample paths such that1. Xt is adapted to the augmented filtration Ft.2. P(X0 = Y) = 1.3. P(∫ t0(|µ(Xs)|+ σ2(Xs))ds < ∞)= 1 for every 0 ≤ t < ∞.4. The following holds almost surely for all t ∈ [0,∞):Xt = X0 +∫ t0µ(Xs)ds +∫ t0σ(Xs)dWs.In contrast, a weak solution is a stochastic process that has the same distribution as a strongsolution but that is not necessarily adapted to the augmented filtration Ft, that is, not neces-sarily a function of {Wt} and Y both defined on (Ω,F ,P).As in the study of ordinary differential equations, we are interested in existence and unique-ness conditions for solutions of stochastic differential equations. First, the following definesthe uniqueness of a strong solution.Definition 1.7 (Karatzas and Shreve, 1991, page 286) We say strong uniqueness holds for the pair(µ, σ) if, when {Xt | t ≥ 0} and {Yt | t ≥ 0} are both strong solutions to the stochastic differentialequation in (1.1) with the initial value random variable Z, we have P(Xt = Yt ; t ≥ 0) = 1.One well-known condition that ensures existence of the unique strong solution is the fol-lowing.Theorem 1.1 (Øksendal, 1992, page 48) Suppose that the initial value random variable Y is inde-pendent of Brownian motion {Wt | t ≥ 0} and satisfies E(Y2) < ∞. Also suppose that, for everyx, y ∈ R, there exist constants C and D such that|µ(x)− µ(y)| + |σ(x)− σ(y)| ≤ C|x− y|, and|µ(x)| + |σ(y)| ≤ D(1 + |x|).7Then the stochastic differential equation in (1.1) has a unique strong solution.There are other conditions that give the existence and uniqueness of a strong solution. Forexample, as Bandi and Phillips (2003, page 244) point out, if µ and σ are twice continuouslydifferentiable and if σ2(x) > 0 for all x, then a unique strong solution exists by the followingtheorems in Karatzas and Shreve (1991): Theorem 2.5 (page 187), Theorem 5.15 (page 341) andCorollary 3.23 (page 310).1.4 Stationarity and ErgodicityIn this section, we define stationarity and ergodicity of diffusion processes, which we assumein later chapters. In order to define them, we first define recurrence, positive recurrence andnull recurrence, which are defined not only for a diffusion process but also for a generic real-valued stochastic process.Definition 1.8 (Kutoyants, 2004, page 39) Let {Vt} be a real-valued stochastic process, and let τa ≡inft≥0{Vt = a} and τba ≡ inft≥τa{Vt = b}. We define inf φ ≡ ∞.1. The process {Vt} is said to be recurrent if P(τba < ∞) = 1 for all a, b ∈ R.2. The process {Vt} is said to be positive recurrent if it is recurrent and E(τba ) < ∞ for all a, b ∈ R.3. The process {Vt} is said to be null recurrent if it is recurrent and E(τba ) = ∞ for all a, b ∈ R.When it comes to a strong solution of a time-homogeneous stochastic differential equationgiven in (1.1), there are conditions on µ and σ that are related to recurrence, positive recurrenceand null recurrence of the corresponding strong solution. Below we give a necessary andsufficient condition on µ and σ for a strong solution to be recurrent or positive recurrent. Notethat we only have a sufficient condition (but not a necessary condition) for the null recurrence.Lemma 1.3 (Kutoyants, 2004, page 40) A strong solution {Xt | t ≥ 0} to the time-homogeneousstochastic differential equation in (1.1) is recurrent if and only ifS(x) ≡∫ x0exp{−2∫ y0µ(z)σ2(z)dz}dy8satisfies limx→−∞ S(x) = −∞ and limx→∞ S(x) = ∞.In addition, {Xt | t ≥ 0} is positive recurrent if and only if it additionally satisfiesG ≡∫∞−∞1σ2(y)exp{2∫ y0µ(z)σ2(z)dz}dy < ∞.Also, the solution process is null recurrent if it is recurrent and G = ∞.A positive recurrent strong solution {Xt} has the following properties. First, there exists arandom variable X whose probability density function is fX, called the invariant density, suchthat Xtd−→ X as t → ∞. In addition, a positive recurrent strong solution {Xt} is ergodic, thatis, for any measurable function h such that∫|h(x)| fX(x)dx < ∞, we have, almost surely,1T∫ T0h(Xs)ds −→∫h(x) fX(x)dx as T → ∞.The following theorem summarizes this discussion and gives the analytical form of theinvariant density fX for a positive recurrent strong solution {Xt}.Theorem 1.2 (Kutoyants, 2004, page 40) If a strong solution {Xt | t ≥ 0} to the time-homogeneousstochastic differential equation in (1.1) is positive recurrent, then {Xt | t ≥ 0} is ergodic with theinvariant densityfX(x) =1Gσ2(x)exp{2∫ x0µ(y)σ2(y)dy},where G is as in Lemma 1.3.Now we discuss stationarity. We first define stationarity for a generic stochastic process.A (strictly) stationary process is a stochastic process whose joint probability distributions areinvariant under the shift of the time-indices, as defined below.Definition 1.9 (Karatzas and Shreve, 1991, page 103) A stochastic process {Vt} is said to be strictlystationary if, for any n ∈N, any time-indices t1, . . . , tn and any s ∈ R,(Vt1 , . . . , Vtn)d= (Vt1+s, . . . , Vtn+s),where the symbol “ d=” means both sides have the same distribution.9Now we relate stationarity to a diffusion process. If a strong solution to (1.1), {Xt}, ispositive recurrent (so that the invariant density fX, defined in Theorem 1.2, exists) and theinitial value random variable Y has the density function equal to fX, then the strong solution{Xt} is strictly stationary (Kutoyants, 2004, page 2). For this reason, the invariant density fXis also called the stationary density.In the next section, we discuss kernel estimation, the last background information we needto provide in order to formally define our research question and our estimator.1.5 Kernel EstimationWe will introduce what is called the Nadaraya-Watson estimator as our estimator for the driftcoefficient. The Nadaraya-Watson estimator is the first widely used kernel estimator for cross-sectional data (Nadaraya, 1964, and Watson, 1964). Suppose that we have independent bivari-ate data, (x1, y1), . . . , (xn, yn), from the distribution of (X, Y) from the regression modelY = m(X) + ε (1.2)where m is a function and ε is a random variable such that E(ε|X = x) = 0 and Var(ε|X =x) = σ2(x). Therefore, the function m represents the conditional expectation of Y given X. TheNadaraya-Watson estimator estimates m(x) = E(Y|X = x) for each fixed x. In this section, weintroduce the estimator using the overview of Simonoff (1996, Chapter 5) and the referencestherein.Note that the conditional expectation E(Y|X = x) is given byE(Y|X = x) =∫y fY|X=x(y)dy =∫yfX,Y(x, y)fX(x)dy, (1.3)where fY|X=x, fX,Y and fX are conditional, joint and marginal densities, respectively. We canobtain the Nadaraya-Watson estimator if we substitute for fX(x) and fX,Y(x, y) in (1.3) withthe kernel density estimates, which we define in what follows.We first define the kernel density estimate of fX(x). Note that we have observed indepen-dent and identically distributed data, x1, . . . , xn, where xi ∈ R for each i, having a common10density fX. The kernel density estimate fˆn(x) of fX(x) is defined asfˆn(x) ≡1nhn∑i=1K(xi − xh), (1.4)where K : R → R is called the kernel function and h is a positive constant called the band-width. Both K and h are chosen by the user. Parzen (1962) proved that fˆn(x) is a consistentestimator of fX(x) in the L2 norm if fX is continuous at x. To emphasize that we choose haccording to n but choose K independent of n, we will sometimes write h = hn.Theorem 1.3 (Parzen, 1962, page 1069) Suppose that the kernel K : R → R is a bounded Borelmeasurable function such thatlimz→∞|zK(z)| = 0,∫∞−∞|K(y)|dy < ∞ and∫∞−∞K(y)dy = 1.If hn → 0 and nhn → ∞ as n→ ∞, thenlimn→∞E((fˆn(x)− fX(x))2)= 0for every x at which fX is continuous.Parzen (1962) gave the order of the asymptotic variance of fˆn(x), but not that of the asymptoticbias. Rosenblatt (1956) derived orders of the asymptotic bias and the variance for nonnegativeK and twice differentiable fX. In addition, he found that using a symmetric K makes the biasconverge to 0 in a higher order, which is why people often use symmetric kernels.We can generalize the univariate kernel density estimate to a multivariate density estimate.Here we introduce a special case of the bivariate kernel density estimate used to derive theNadaraya-Watson estimator. The product kernel density estimate of the joint density from theindependent and identically distributed data, (x1, y1), . . . , (xn, yn), where (xi, yi) ∈ R2 for eachi, is defined asfˆn(x, y) ≡1nhxhyn∑i=1Kx(xi − xhx)Ky(yi − yhy), (1.5)where Kx and Ky are kernels and hx and hy are bandwidths. Discussion of a more general formof the multivariate kernel density estimate can be found in Simonoff (1996, Chapter 4).11Now we define the Nadaraya-Watson estimator following the derivation of Simonoff (1996,page 134), which is a simplified version of the derivation of Watson (1964). Recall the expres-sion (1.3) of the conditional expectation E(Y|X = x). If we substitute for fX(x) and fX,Y(x, y)in (1.3) with the kernel density estimates (1.4) and (1.5), set hx in (1.5) to be equal to h in (1.4)and choose Ky so that∫Ky(z)dz = 1 and that∫zKy(z)dz = 0 (for instance, if Ky is symmetricabout zero), we derive the Nadaraya-Watson estimator:mˆ(x) =∑ni=1 K( xi−xh)yi∑ni=1 K( xi−xh) . (1.6)We note that Watson (1964) provided the estimator for the case of xi ∈ Rd where d ∈ N, inwhich the kernel K is appropriately defined according to the value of d.Nadaraya (1964) proved consistency of mˆ(x) when Y is bounded. Among many otherresults of the consistency of mˆ(x) for unbounded Y, we state the following.Theorem 1.4 (Hardle, 1990, Proposition 3.1.1) Suppose that the following three conditions hold:1. the regression model (X, Y) satisfies fX(x) > 0 and E(Y2) < ∞,2. the kernel K satisfies∫|K(u)|du < ∞ and lim|u|→∞ uK(u) = 0,3. the sequence of bandwidths {hn} satisfies hn → 0 and nhn → ∞.Then mˆ(x)p−→ m(x) as n→ ∞ for every x at which all of m(·), fX(·) and σ2(·) are continuous.We note that the use of the Nadaraya-Watson estimator is not restricted to model (1.2).For example, Hall and Hart (1990) studied estimating E(Y|X = x) by the Nadaraya-Watsonestimator when ε i’s in the data are not independent, but rather a stationary process indexedby i. Robinson (1983) considered time-series data, z1, . . . , zn, from a discrete-time stochas-tic process {Zi}ni=1. He studied estimating E(Zi+p | Zi, . . . , Zi+p−1) for some p ∈ N by theNadaraya-Watson estimator, setting yi = zi+p and xi = (zi, . . . , zi+p−1). Researchers also stud-ied estimating statistical objects in continuous-time models, including diffusion processes, bythe Nadaraya-Watson estimator. We refer to studies that used the Nadaraya-Watson estimatorfor estimation of the drift and the diffusion coefficients of a diffusion process in Section 2.1.12The Nadaraya-Watson estimator mˆ(x), defined in (1.6), can be generalized to what is calledthe local polynomial estimator. Note that, for a fixed x, the estimator mˆ(x) is the solution tothe following weighted least square problem:mˆ(x) = argminzn∑i=1(yi − z)2K(xi − xhn).Generalizing this, the local polynomial estimator of degree p ≥ 0 is defined (Stone, 1977, andCleveland, 1979), for each x, asmˆLP(x) ≡ βˆx0 + βˆx1(x− xi) + . . . + βˆxp(x− xi)pwhere(βx0, . . . , βxp) = argminβ0,...,βpn∑i=1(yi − β0 − β1(x− xi)2 − . . .− βp(x− xi)p)2 K(xi − xhn).If p = 0, then mˆLP equals mˆ. Stone (1977) proved that mˆLP(x) is consistent when p = 1.Ruppert and Wand (1994) derived the asymptotic bias and variance of mˆLP(x) for p ≥ 1.Their key assumptions are that m is (p + 2)-times differentiable at x with continuous (p + 2)ndderivative, that x is an interior point of the support of fX, that fX is continuous at x and that thekernel K has compact support (Ruppert and Wand, 1994, Theorem 4.1). In addition, they alsoderived the asymptotic bias and variance of the multivariate generalization of mˆLP(x) whenx ∈ Rk and p = 1, 2, under similar key assumptions (Ruppert and Wand, 1994, Theorem 3.2).Lastly, we discuss the choice of the bandwidth h. From Theorem 1.4, we can see that manysequences of hn’s satisfy the conditions of Theorem 1.4. Therefore, given the sample size n,we have great freedom in choosing hn. But this choice is important: as Simonoff (1996) writes(page 151), the shapes of the function estimates mˆ and mˆLP are strongly dependent on h. Largerh leads to a function estimate that is close to the least squares degree p polynomial. Therefore,we require a finite sample method of choosing h.We will discuss the bandwidth choice criteria in detail in Section 2.4, so we don’t give thedetails here. To summarize that section, the goal of the choice of h is to minimize the meansquared error of the estimator. We consider the bandwidth h be good if either it minimizes the13asymptotic mean squared error or it minimizes what is called the prediction error. We intro-duce a bandwidth choice method, called “cross-validation”, in Section 2.4. In cross-validation,we choose the bandwidth as a minimizer of an estimate of prediction error.We note that the dependence structure of the ε’s affects the asymptotic mean squared er-ror of the estimator and thus our choice of the bandwidth, if we choose the bandwidth as aminimizer of the asymptotic mean squared error. For example, Hall and Hart (1990) provedthat the asymptotic variance of the Nadaraya-Watson estimator depends on the dependencestructure of the ε’s when we observe data (x1, y1), . . . , (xn, yn) with xi = i/n.If we use cross-validation, we should use an appropriate estimate of prediction error ac-cording to the dependence structure of the ε’s. For model (1.2), a widely-used estimate ofprediction error is the estimate computed by the “leave-one-out” cross-validation:P̂E(h) ≡n∑i=1(mˆ(−i)(xi)− yi)2, (1.7)wheremˆ(−i)(x) ≡∑j∈Ai K(xj−xh)yj∑j∈Ai K(xj−xh) (1.8)and Ai ≡ {1, . . . , n} ∩ {i}C. That is, mˆ(−i) is the Nadaraya-Watson estimator, defined in (1.6),computed with the ith observation removed.However, (1.7) may give an inaccurate estimate of prediction error if the data are generatedfrom a model other than (1.2), which may lead to an unsatisfactory choice of the bandwidth.For example, if ε i’s in model (1.2) are correlated, using (1.7) for such model tends to give abandwidth that undersmooths the data (i.e. too small bandwidth) when the ε’s are positivelycorrelated and give one that oversmooths the data when negatively correlated (see e.g. Chuand Marron, 1991, and Hart, 1994, and the references therein). Chu and Marron (1991) modi-fied (1.7) for use for the dependent ε’s, which Burman, Chow, and Nolan (1994) also proposedfor use in the analysis of time-series data. We will introduce their estimate in Section 2.4. Foranother approach, Hart (1994) proposed to, roughly speaking, modify the set Ai in (1.8) to beAi = {1, . . . , i− 1} and also modify (1.7) according to the dependence structure of the ε’s.We have now introduced all the concepts necessary to introduce our research question14and our estimator. In the next chapter, we discuss our research question, the estimation ofthe drift coefficient µ of a positive recurrent and strictly stationary diffusion process when weobserve Xti ’s with additive measurement errors at discrete times t1, . . . , tn, and we introduceour Nadaraya-Watson estimator of µ.15Chapter 2Kernel Estimation of the DriftCoefficient of a Diffusion Process in thePresence of Measurement Error2.1 IntroductionFinancial time-series data such as stock prices, interest rates and derivative prices can be mod-eled as diffusion processes. A diffusion process is completely characterized by two functions,the drift coefficient which is related to the expected return of an asset for an infinitesimalamount of time and the diffusion coefficient which is related to the variance of the return foran infinitesimal amount of time. When we model time-series data as a diffusion process, weare interested in estimating these two functions as they completely characterize the underlyingprocess. In addition, the diffusion coefficient integrated over time, which is called integratedvolatility, has also received attention as a risk measure of an asset (see e.g. Andersen et al.,2001).Recently, analysis of ultra-high frequency data revealed an ugly fact that we observe fi-nancial time series data with measurement errors, called microstructure noise in the financialeconometrics literature, which is negligible compared to the observed return in low samplingfrequency but has a significant effect in high sampling frequency (Zhou, 1996). While there are16approaches that deal with the measurement error problem in the integrated volatiliy estima-tion literature (see e.g. Zhang, Mykland, and Aı¨t-Sahalia, 2005), there are few papers, to ourknowledge, that incorporate the measurement error problem in the estimation of the drift andthe diffusion coefficients. In this chapter, we focus on estimation of the drift coefficient, andwe provide a nonparametric estimator of the drift coefficient that is consistent and asymptoti-cally normal in the presence of measurement error under the assumption that the underlyingprocess is stationary.Integrated volatility is defined as the integral of the squared diffusion coefficient with re-spect to time over a fixed time period, which is identical to the integrated quadratic variationof the process. Integrated volatility represents variability of a financial instrument for a givenperiod of time, for example, variability of a stock price within a day. A widely used estimatorof integrated volatility proposed by Andersen et al. (2001) is the realized volatility estimator,which is simply the sum of squared instantaneous returns. The theory of quadratic variationtells that the realized volatility estimator is an unbiased and consistent estimator of integratedvolatility when there is no measurement error. For details about the realized volatility estima-tor, see e.g. Andersen et al. (2009).However, according to Zhang, Mykland, and Aı¨t-Sahalia (2005), researchers knew that theperformance of the realized volatility estimator is not satisfactory when the measurement erroris present and the data are sampled at high frequency. So the researchers purposely used low-frequency data to avoid estimation problems. Zhang, Mykland, and Aı¨t-Sahalia (2005) for-malized this approach, which they call the subsampling method, and proposed an estimatorof integrated volatility which uses the subsampling method. They first chose a subsamplingfrequency by minimizing the mean squared error of the realized volatility estimator when themeasurement error is present. Then they split the high frequency data into subdata with thechosen subsampling frequency and with different starting times. For example, if the data aresampled hourly and the subsampling frequency is 24 hours, they would create 24 subdatawhere the kth subdata contain values at hour k every day. After that, they obtained an estimateby using the realized volatility estimates obtained from all subdata.Other approaches proposed to deal with the measurement error problem in the context ofintegrated volatility estimation are that of Barndorff-Nielsen et al. (2008), who proposed the17realized kernel estimator which computes the kernel-weighted average of autocorrelations ofthe process, and that of Jacod et al. (2009) who proposed the preaveraging estimator whichuses an average of the returns computed at low sampling frequency.In the literature of the estimation of the drift and the diffusion coefficients, the diffusionprocess is usually assumed to be time-homogeneous, that is, the drift and the diffusion coeffi-cients are not functions of time, but rather functions of the value of the process only, and thatthe process is stationary. Early work on the nonparametric estimation of the two functionsincludes that of Florens-Zmirou (1993) who provided a Nadaraya-Watson kernel estimator ofthe diffusion coefficient with the uniform kernel, Aı¨t-Sahalia (1996) who estimated the diffu-sion coefficient nonparametrically under the parametric specification of the drift coefficient,and Stanton (1997) who proposed a Nadaraya-Watson kernel estimator of the drift and thediffusion coefficients. Later, Bandi and Phillips (2003) provided Nadaraya-Watson kernel es-timators of the drift and the diffusion coefficients under more general conditions, includingnon-stationarity, and proved consistency and asymptotic normality of their estimators.As we stated earlier, in contrast to estimation of integrated volatility, there are few studies,to our knowledge, that consider estimation of the drift and the diffusion coefficients in thepresence of measurement error. An exception is Bandi, Corradi, and Moloche (2009), whoconsider a standard deviation of the measurement error that converges to zero as the samplingfrequency increases to infinity. In our paper, we focus on estimation of the drift coefficient andconsider a less restrictive form of the measurement error. We extend the result of Bandi andPhillips (2003) and propose a Nadaraya-Watson type kernel estimator of the drift coefficientwhich is consistent and asymptotically normal in the presence of independent measurementerrors of mean zero and bounded variance.The structure of the chapter is as follows. In Section 2.2, we introduce our assumptions anddefine our estimator, and we state the consistency and asymptotic normality of our estimator.In Section 2.3, we compare our estimator to the existing nonparametric estimators of the driftcoefficient, especially those proposed by Bandi and Phillips (2003). In Section 2.4, we discussthe bandwidth choice problem of our estimator. In Section 2.5, we describe our simulationstudy. We will prove the consistency and asymptotic normality result in Section 2.6.182.2 Statement of the Main ResultWe consider the following stochastic differential equationdXt = µ(Xt)dt + σ(Xt)dWt (2.1)where µ and σ are real-valued functions called the drift and the diffusion coefficients respec-tively, {Wt | t ≥ 0} is Brownian motion defined on a probability space (Ω,F ,P). A real-valuedinitial value random variable, X0, is also defined on (Ω,F ,P) and independent of Brownianmotion. To define a strong solution (a sample path solution) to (2.1), we define an augmentedfiltration Ft, t ≥ 0, based on the filtrationGt ≡ σ(X0 , {Ws | 0 ≤ s ≤ t}), t ≥ 0,and the collection of all subsets of measure zero:N ≡{N ⊆ Ω∣∣∣ ∃G ∈⋃t≥0Gt with N ⊆ G and P(G) = 0}.Then an augmented filtration Ft, t ≥ 0, is defined byFt ≡ σ(Gt ∪N), t ≥ 0.A strong solution to (2.1) is a process {Xt | t ≥ 0} adapted to the augmented filtration{Ft | t ≥ 0} such that the following almost surely holds:Xt = X0 +∫ t0µ(Xs)ds +∫ t0σ(Xs)dWs, (2.2)which is the integrated version of (2.1). See Karatzas and Shreve (1991, page 285) for a formaldefinition of a strong solution.Our objective is to provide a consistent and asymptotically normal Nadaraya-Watson typekernel estimator of µ(x) from observations of a sample path of the solution process {Xt | t ≥ 0}sampled discretely in time and with additive measurement error. To formalize the setting,19suppose that we observe the values of a sample path of the solution process {Xt} at timest ∈ {t1, . . . , tn | tk ∈ [0, T]} for some time span T > 0 and that the times are equispaced, i.e.ti = i∆ for some ∆ > 0. Then we suppose that we observe {Yi∆}ni=1 such thatYi∆ ≡ Xi∆ + ε i∆ (2.3)where {ε i∆}ni=1 are values from a process {εt | t ≥ 0} which is independent of {Xt}.Our objective is to estimate µ(x) from {Yi∆}ni=1. Our key idea is to estimate µ(x) by aver-aging the Yi∆’s neighboring in time, expecting that the averaging reduces the noise caused byε i∆’s and reveals the latent solution process {Xt | t ≥ 0}. Formally speaking, we construct anew stochastic process by averaging the Yi∆’s in m blocks, each of size r, as in Definition 2.1below.Definition 2.1 For fixed ∆ > 0 and r and n ∈N, let Y¯r,∆j be the arithmetic average of the Yi∆’s over isuch that (j− 1)r < i ≤ jr. In other words, for j = 1, . . . , m ≡ bn/rc (the largest integer no greaterthan n/r),Y¯r,∆j ≡1rr∑i=1Y[(j−1)r+i]∆.In addition, we define X¯r,∆j and ε¯r,∆j similarly as the arithmetic averages of the Xi∆’s and the ε i∆’s over isuch that (j− 1)r < i ≤ jr, respectively.Our estimator of µ(x), given in Definition 2.2 below, is a weighted average of the discreteslopes (Y¯r,∆j+2 − Y¯r,∆j+1)/r∆ with j = 1, . . . , m− 2.Definition 2.2 Let K : R→ R be a known function and h > 0. LetµˆY¯ (x) ≡1m−2 ∑m−2j=1Y¯r,∆j+2−Y¯r,∆j+1r∆1h K(Y¯r,∆j −xh)1m−2 ∑m−2j=11h K(Y¯r,∆j −xh) ≡NY¯1,...,Y¯m (x)DY¯1,...,Y¯m−2 (x)≡NY¯ (x)DY¯ (x).Note that the jth summand ofNY¯ (x) contains Y¯r,∆j , in the argument of K, and the differenceY¯r,∆j+2 − Y¯r,∆j+1 (not Y¯r,∆j+1 − Y¯r,∆j ). We chose these indices (j, j + 1, j + 2) to make our asymptoticcalculations easier: with these indices, the difference Y¯r,∆j+2 − Y¯r,∆j+1 depends on values of Xt + εt20with t in [(jr + 1)∆, (j + 2)r∆], while Y¯r,∆j depends on values of Xt + εt for t in a differentinterval. When it comes to the finite-sample performance, we saw in our simulation, which isnot included in Section 2.5, that the shifts increase the mean squared error of our pre-averagingestimator. We will discuss this issue more concretely in Chapter 3.Our proof of consistency and asymptotic normality of µˆY¯(x) requires that the observationtime lag ∆ tends to zero and the observation time span T = n∆ tends to infinity as the numberof observations n tends to infinity. These assumptions are necessary. As Bandi and Phillips(2003) note, without the condition ∆ → 0, we suffer from what is called the aliasing problem:“different continuous-time processes may be indistinguishable when we observe the processdiscretely in time.” If ∆ is fixed and n tends to infinity, the data form a discrete-time pro-cess. We may be able to deduce some properties of the discrete time process. But we cannotidentify what continuous-time process generated the data, as there is usually more than onecontinuous-time process that can generate the discrete-time process.They also note that, without the condition n∆ → ∞, we cannot obtain a consistent estima-tor of µ(x) in general, even if the process is observed without measurement error. In additionto these assumptions on ∆, our proof requires similar conditions for the averaged process: thetime lag between the two adjacent averages, r∆, tends to zero and the number of blocks n/rtends to infinity.Now we introduce the assumptions.Assumption 2.1 As n → ∞, the sequence of positive real numbers {∆n}∞n=1 and the sequence ofpositive integers {rn}∞n=1 satisfy ∆n → 0, n∆n → ∞, rn → ∞, rn∆n → 0 and n/rn → ∞.We will often denote mn ≡ n/rn, which represents the number of blocks (which equals thenumber of Y¯r,∆j ’s).Assumption 2.2 The functions µ and σ are Borel-measurable and twice continuously differentiableon R. In addition, σ2(x) > 0 for all x ∈ R.Assumption 2.2 is a sufficient condition for the existence and uniqueness of a strong solu-tion of the stochastic differential equation (2.1), as discussed in Bandi and Phillips (2003, page244).21Assumption 2.3 The solution process {Xt} is positive recurrent and strictly stationary. Let fX be thestationary density. The functions µ, σ and fX satisfy∫µ2(x) fX(x)dx < ∞ and∫σ2(x) fX(x)dx < ∞.Note that, if the solution process {Xt} is positive recurrent, there exists a random variableX whose probability density function is fX, called the stationary density, such that Xtd−→ Xas t → ∞. If we let the initial value random variable X0 have the density function fX, then{Xt} is strictly stationary (Kutoyants, 2004, page 2).Note that Assumption 2.2 and the positive recurrence assumption in Assumption 2.3 implythat fX is continuous: if {Xt} is positive recurrent, fX is given byfX(x) =1Gσ2(x)exp{2∫ x0µ(y)σ2(y)dy}, (2.4)where G is a normalizing constant (Kutoyants, 2004, Theorem 1.16, page 40).Assumption 2.4 The kernel K ∈ L2(R) is bounded, symmetric, nonnegative and continuously dif-ferentiable. Its derivative, K′, is bounded and is in L1(R). In addition,∫∞−∞K(x)dx = 1 and∫∞−∞s2K(s)ds < ∞.Assumption 2.5 The error process {εt} is independent of {Xt}, and the εt’s are independent across t.Also, E(εt) = 0 for all t, and there exists a finite, positive constant σ2ε such that supt Var(εt) ≤ σ2ε .In the literature, the ε i∆n ’s with i in 1, . . . , n are usually assumed to be independent andidentically distributed and that E(ε i∆n) = 0. Some authors assume that Var(ε i∆n) = σ2ε for alli and n (see Zhang, Mykland, and Aı¨t-Sahalia, 2005, among others). In contrast, some papersin the literature, and most of the papers in the rounding error literature according to Jacodet al. (2009), assume that Var(ε i∆n) = anσ2ε for all i where an → 0 as n→ ∞ (see Bandi, Corradi,and Moloche, 2009, among others). Our Assumption 2.5 includes both specifications as specialcases.Now we introduce our main result.22Theorem 2.1 Suppose Assumptions 2.1 to 2.5 hold, and suppose(i)(n∆nhn)2rn∆n ln(1/rn∆n) = o(1),(ii) n∆nhn → ∞, and(iii)nh3nr2n= o(1).Let K2 ≡∫K2(s)ds and ν2 ≡∫s2K(s)ds. Then the following consistency and asymptotic normalityresults hold for every x ∈ {y | fX(y) > 0}.1. µˆY¯ (x) −→ µ(x) in probability as n→ ∞.2. If n∆nh5n = o(1), then√(n− rn)∆nhn {µˆY¯ (x)− µ(x)}d−→ N(0, K2σ2(x)fX(x)).3. If n∆nh5n = O(1), then√(n− rn)∆nhn{µˆY¯ (x)− µ(x)− h2nΓµ(x)} d−→ N(0, K2σ2(x)fX(x))whereΓµ(x) = ν2 ×(µ′(x)f ′X(x)fX(x)+12µ′′(x)).The conclusions 1 and 2 give consistency and asymptotic normality of µˆY¯(x). The conclusion3 gives asymptotic bias and variance, which are useful for the choice of the bandwidth h.Bandi and Phillips (2003) provided consistency and asymptotic normality of the Nadaraya-Watson estimator of µ when one observes a sample path of a recurrent diffusion process {Xt}sampled discretely in time and without measurement error. We prove consistency and asymp-totic normality of our estimator by showing that the difference between our estimator andtheir estimator converges to 0 asymptotically, under our stronger assumptions. The full proofof Theorem 2.1 can be found in Section 2.6.23We finish this section by considering simple sufficient conditions so that ∆n, rn and hnsatisfy the conditions of Theorem 2.1. Suppose ∆n = n−δ, rn = nρ and hn = n−η for somepositive real numbers δ, ρ and η. We study conditions on δ, ρ and η so that the conditions ofTheorem 2.1 are satisfied.To begin with, from Assumption 2.1, we have0 < ρ < δ < 1. (2.5)Also, from condition (i) of Theorem 2.1, we have(ρ− δ)n2−3δ+2η+ρ ln n = o(1),that is, since ρ 6= δ by (2.5), we have 2− 3δ+ 2η + ρ < 0, orη <32δ−12ρ− 1. (2.6)Condition (ii) becomes 1− δ− η > 0, orη < 1− δ. (2.7)Condition (iii) becomes 1 + 3η − 2ρ < 0, orη <13(2ρ− 1). (2.8)In addition, for the condition of Theorem 2.1’s conclusion 2, that n∆nh5n = o(1), we requireη >15(1− δ). (2.9)Lastly, for the condition of Theorem 2.1’s conclusion 3, that n∆nh5n = O(1), we requireη ≥15(1− δ). (2.10)24For example, suppose that δ = 0.9 and ρ = 0.58, which satisfies (2.5). Then the conditionsin Equations (2.6) to (2.8) are satisfied provided0.02 ≤ η < 0.0533.Equation (2.9) is satisfied if η > 0.02 and Equation (2.10) is satisfied if η = 0.02. Table 2.1 showsvalues of ∆n, rn and hn for δ = 0.9, ρ = 0.58 and η = 0.02, when n = 5000, 10000, 15000.Table 2.1: The values of ∆n, rn, mn and hn when δ = 0.9, ρ = 0.58 and η = 0.02.n ∆n = n−0.9 rn = n0.58 mn ≈ n0.42 rn∆n = n−0.32 hn = n−0.025,000 0.00047 139 35 0.0655 0.843410,000 0.00025 208 48 0.0525 0.831815,000 0.00017 264 56 0.0461 0.8250Remark: The distribution of {X¯r,∆j } is not clear, although we suspect that the marginaldistribution of X¯r,∆j converges in distribution to the marginal distribution of Xt under the con-ditions of Theorem 2.1. We can prove adapting the proof of Theorem 2.1 that1mn−2 ∑mn−2j=1X¯rn ,∆nj+2 −X¯rn ,∆nj+1rn∆n1hnK(X¯rn ,∆nj −xhn)1mn−2 ∑mn−2j=11hnK(X¯rn ,∆nj −xhn) −1mn−2 ∑mn−2j=1X(j+1)rn∆n−Xjrn∆nrn∆n1hnK(X(j−1)rn∆n−xhn)1mn−2 ∑mn−2j=11hnK(X(j−1)rn∆n−xhn)converges to 0 in probability as n → ∞, where the two terms on the left-hand side are con-structed by replacing Y¯j’s in Definition 2.2 with X¯j’s and with Xt’s, respectively.2.3 Comparison to the Existing EstimatorsRecall that our Theorem 2.1 is based on the result of Bandi and Phillips (2003). Our pre-averaging estimator is similar to their estimator which they call the “double-smoothing es-timator”, and we compare the two in this section. After that, we discuss the “subsamplingmethod”, which is used to estimate integrated volatility when measurement error is present.The subsampling method can also be applied to estimation of the drift coefficient, so we willcompare it to our pre-averaging estimator.25We first compare the double-smoothing estimator of Bandi and Phillips (2003) with ourestimator. Suppose that we observe the time-equispaced data {Yi∆}ni=1. Both estimators are ofthe form∑wj=11h K(Wkernj −xh)Wslopej∑wj=11h K(Wkernj −xh) .In our pre-averaging estimator,w = m− 2, Wkernj = Y¯r,∆j and Wslopej =Y¯r,∆j+2 − Y¯r,∆j+1r∆.In the double-smoothing estimator of Bandi and Phillips (2003),w = n− 1, Wkernj = Yj∆ and Wslopej =1Nl,∆j∑k:|Yk∆−Yj∆|≤lY(k+1)∆ −Yk∆∆, (2.11)where l ∈ R+ and Nl,∆j is the number of Yk∆’s such that |Yk∆ − Yj∆| ≤ l. Note that l can beinterpreted as the bandwidth of the uniform kernel.Bandi and Phillips (2003) also define the usual Nadaraya-Watson estimator and call it the“single-smoothing estimator” to contrast it to the double-smoothing estimator. The single-smoothing estimator (that is, the usual Nadaraya-Watson estimator) is defined by settingw = n− 1, Wkernj = Yj∆ and Wslopej =Y(j+1)∆ −Yj∆∆. (2.12)Note that the single-smoothing and the double-smoothing estimators were introduced for thecase of no measurement error, that is, the case where εt = 0 for all t, in which case we haveYi∆ = Xi∆.We note three differences between the double-smoothing estimator and our estimator.First, the double-smoothing estimator pre-averages Yi∆’s such that |Yi∆ − Yj∆| ≤ l for eachj before computing the kernel-weighted average. This is similar to our pre-averaging Yi∆’ssuch that (j − 1)r < i ≤ jr for each j. A difference is that the double-smoothing estimatorpre-averages Yi∆’s according to their values while our estimator pre-averages Yi∆’s accordingto their time-indices (the i∆’s).26Second, while both estimators use averaging for Wslopej , our estimator uses the averagedvalue, Y¯r,∆j , for Wkernj while the double-smoothing estimator uses a single observation, Yj∆.Third, the double-smoothing estimator was introduced and studied for the case of no mea-surement error, so its consistency in the presence of measurement error is not yet established.In contrast, Theorem 2.1 states that our estimator is consistent in the presence of measurementerror as well as in the case of no measurement error (that εt = 0 for all t satisfies Assump-tion 2.5). Our simulation study will indicate that, for finite samples, the double-smoothingestimator has higher mean squared error than our estimator when there is measurement error.Now we discuss the subsampling method, which Zhang, Mykland, and Aı¨t-Sahalia (2005)studied for estimation of integrated volatility, and compare it to our pre-averaging approach.Using notation in (2.1), integrated volatility is defined as∫ ba σ2(Xt)dt for some fixed time pe-riod [a, b]. When we observe data {Xti}ni=1 where a = t1 < t2 < . . . < tn = b, Andersenet al. (2001) proposed∑n−1i=1 (Xti+1 − Xti)2, called the realized volatility estimator, as an estima-tor of the integrated volatility. They showed that∑n−1i=1 (Xti+1 − Xti)2 converges to∫ ba σ2(Xt)dtin probability as n tends to infinity.However, the realized volatility estimator does not give an accurate estimate of the inte-grated volatility when the data are observed with measurement errors and when the data aresampled at high frequency, i.e. when the ti+1− ti’s are small. Zhang, Mykland, and Aı¨t-Sahalia(2005) showed that, when we observe data {Yti = Xti + εti}ni=1 where the εti ’s are independentand identically distributed with mean zero and variance s2ε ,n−1∑i=1(Yti+1 −Yti)2 = 2ns2ε + Op(n1/2)(Zhang, Mykland, and Aı¨t-Sahalia, 2005, page 1395, Equation 5). This proves that the realizedvolatility estimator does not converge in probability to the integrated volatility in the presenceof measurement error.As Zhang, Mykland, and Aı¨t-Sahalia (2005) noted, in order to avoid this estimation prob-lem of the realized volatility estimator, researchers used the data sampled at lower frequency,namely, the data {Ytk , Yt2k , . . .} for some k > 1 instead of {Yt1 , Yt2 , . . .}, to compute the realizedvolatility estimate. When considering Yt(i+1)k − Ytik = (Xt(i+1)k − Xtik) + (εt(i+1)k − εtik) for each i27for some large k, the difference Xt(i+1)k − Xtik is relatively larger in magnitude than εt(i+1)k − εtik .This yields∑i(Yt(i+1)k −Ytik)2 ≈∑i(Xt(i+1)k −Xtik)2 when k is large. Recall that∑i(Xt(i+1)k −Xtik)2is a consistent estimator of the integrated volatility. Zhang, Mykland, and Aı¨t-Sahalia (2005)formalized this ad-hoc approach and proposed to choose k as the minimizer of the asymp-totic mean squared error of the estimator∑i(Yt(i+1)k − Ytik)2, and they called this approach thesubsampling method.We compare the subsampling method to our pre-averaging approach by considering hourly-observed stock price data. Our pre-averaging approach with the block size of a day obtainsY¯’s that are average daily prices. In contrast, the equivalent subsampling method uses dailyclosing prices to construct the subsampled data.Recall that the estimators of Bandi and Phillips (2003), defined in (2.11) and (2.12), use{Yi∆}ni=1 as data while our pre-averaging estimator, defined in Definition 2.2, uses {Y¯r,∆j }mj=1where m = n/r and r is the block size. We can apply the subsampling method to the estimatorsof Bandi and Phillips (2003) by using {Yjr∆}mj=1 as the data instead of {Yi∆}ni=1. Our simulationstudy, which will be given in Section 2.5, indicates that applying the subsampling methodleads to much lower mean squared errors for the estimators of Bandi and Phillips (2003). Theestimators of Bandi and Phillips (2003) without subsampling have higher mean squared errorsthan our estimator. However, the mean squared errors of the estimators with subsampling areabout the same as ours.2.4 Bandwidth ChoicesIn Section 2.2, we stated Theorem 2.1, that µˆY¯ is a consistent and asymptotically normal es-timator of µ in the presence of measurement error under some conditions. We can see thatmany sequences of {hn} and {rn} satisfy the conditions of the theorem. Hence, it is desirableto have a principle of the choice of h and r given the sample size n. We discuss the choice of hin Section 2.4.1 and the choice of r in Section 2.4.2.282.4.1 Choice of the kernel bandwidth hThe kernel bandwidth choice criterion has been extensively studied in the literature, at least incertain circumstances such as analysis of cross-sectional data. For an overview of bandwidthchoice methods, see e.g. Jones, Marron, and Sheather (1996) and the references therein. Inthis subsection, we discuss the two popular methods to choose h, the plug-in method andthe cross-validation method, in the context of our estimator. After that, we briefly introducea bandwidth choice method recently proposed by Bandi, Corradi, and Moloche (2009). Thismethod is explicitly intended for kernel estimators of the drift coefficient µ and the diffusioncoefficient σ of a diffusion process.The plug-in method requires an expression for the asymptotic mean squared error of theestimator and the existence of hopt, the h that minimizes the asymptotic mean squared error.We then “plug into” the expression for hopt estimates of all unknown quantities. This yieldshˆopt, the plug-in bandwidth. In the context of our estimator, we first obtain the asymptotic biasand the asymptotic variance of µˆY¯ (x) using conclusion 3 of Theorem 2.1:asymptotic bias = h2Γµ(x) (2.13)andasymptotic variance =K2σ2(x)/ fX(x)(n− r)∆h.The asymptotic mean squared error (AMSE) of our estimator is, then, the sum of the squaredasymptotic bias and the asymptotic variance:AMSE(x) = h4Γ2µ(x) +K2σ2(x)/ fX(x)(n− r)∆h.Now we find the bandwidth h that minimizes the AMSE. When Γ2µ(x) 6= 0, differentiationyields the minimizer of the AMSE, hopt(x):hopt(x) =(K2σ2(x)/ fX(x)4Γ2µ(x))1/5×(nn− r)1/5×(1n∆)1/5. (2.14)When Γ2µ(x) = 0, the AMSE decreases to 0 as h approaches infinity.29While it is reasonable to take hopt(x) given by (2.14) as the bandwidth for our estimator, inpractice the values of Γµ(x), σ2(x) and fX(x) are not known. The plug-in bandwidth is gottenby plugging estimates of Γµ(x), σ2(x) and fX(x) into (2.14). The estimates of these unknownscan be obtained either parametrically or nonparametrically. The “ideal” bandwidth, hopt(x), iscalled the oracle bandwidth and is often used in simulation studies as a gold standard.Our simulation study in Section 2.5 indicates that the cross-validation bandwidth, which isdiscussed below, exhibits better finite sample performance than the oracle bandwidth. There-fore, we do not consider estimating Γµ(x), σ2(x) and fX(x) to calculate the plug-in bandwidth,and we recommend using the cross-validation bandwidth instead of the plug-in bandwidth.The cross-validation method of choosing h attempts to minimize what is called the pre-diction error as follows. We can think of our estimator µˆY¯ (x) as a predictor of (Xt+δ − Xt)/δgiven that Xt = x. In this perspective, we choose the bandwidth h so that the resulting estima-tor µˆY¯ (x) has the least error in predicting (Xt+δ − Xt)/δ given Xt = x, where the predictionerror is estimated using the data.To compute an estimate of the prediction error, we can use what is called the H-blockcross-validation method proposed by Chu and Marron (1991) and further developed by Bur-man, Chow, and Nolan (1994), who coined the name “H-block”. The H-block cross-validationmodifies the well-known leave-one-out cross-validation (Stone, 1974), used for independentdata, for use with stationary time-series data. In cross-validation, in order to estimate the pre-diction error, one predicts a data value by using information in a portion of the data set, calledthe training data. Ideally, the training data and the target data value are independent. Inleave-one-out cross-validation, one constructs the training data by omitting one observation.In H-block cross-validation, one omits 2H more observations, H neighboring observations inthe past and H neighboring observations in the future. The objective of omitting the additional2H observations is to weaken the dependence between the target data value and the trainingdata: if the time between the two is large enough, we expect that the autocorrelation betweenthe two is close to zero. This expectation is valid if the time-series data are stationary, whichwe assume in Assumption 2.3.Now we formally describe the cross-validation bandwidth choice method based on H-30block cross-validation in the context of our estimator. We estimate the prediction error byP̂E(h ; H) ≡m−1∑k=1(µˆ(k,H)Y¯(Y¯r,∆k)−Y¯r,∆k+1 − Y¯r,∆kr∆)2(2.15)where H is an integer and µˆ(k,H)Y¯(Y¯r,∆k)is our drift coefficient estimator µˆY¯ (x) evaluated atx = Y¯r,∆k calculated by removing the terms j = k− H, . . . , k + H of the sums in Definition 2.2,that is,µˆ(k,H)Y¯(Y¯r,∆k)≡∑j∈AHkY¯r,∆j+2−Y¯r,∆j+1r∆1h K(Y¯r,∆j −Y¯r,∆kh)∑j∈AHk1h K(Y¯r,∆j −Y¯r,∆kh)for the set of indices AHk = {1, . . . , m− 2}⋂{k− H, . . . , k + H}C. The integer H is chosen sothat the correlation between Y¯r,∆k and Y¯r,∆j ’s with j ∈ AHk is “weak enough”. For the implemen-tation, H can be chosen by looking at the empirical autocorrelation function.The value (Y¯r,∆k+1 − Y¯r,∆k )/r∆ in (2.15) is called the target data value. Recall that we want tochoose h so that the resulting estimator µˆY¯ (x) has the least error in predicting (Xt+δ − Xt)/δgiven Xt = x. The target data value (Y¯r,∆k+1 − Y¯r,∆k )/r∆ is considered an estimate of the value(Xt+δ − Xt)/δ when t = (k − 1)r∆ and δ = r∆. Since we cannot observe the underlyingprocess {Xt}, we use the pre-averaged process for the target data value expecting that the pre-averaged process is close to the underlying process. We choose the target data value not todepend on h in order to prevent interaction between µˆ(k,H)Y¯(Y¯r,∆k)and the target data value.The cross-validation bandwidth hcv is the minimizer of P̂E(h ; H):hcv ≡ argminhP̂E(h ; H). (2.16)The simulation study in Section 2.5 indicates that the mean integrated squared error of our pre-averaging estimator with hcv is smaller than that of our estimator with the oracle bandwidthhopt(x) in (2.14). Based on this simulation result, we recommend using (2.16) as a bandwidthchoice criterion over the plug-in bandwidth.We finish this subsection by introducing a recently proposed bandwidth choice method inBandi, Corradi, and Moloche (2009), which is explicitly developed to jointly choose the band-31widths for the estimators of the drift and the diffusion coefficients of a diffusion process. Theirmethod relies on residuals of the fits being approximately independent and normally dis-tributed. Their method consists of two stages, and the first stage is as follows. Given the time-equispaced time-series data {Xi∆}ni=1 generated from a diffusion process and the bandwidths(hdr, hdi f ) applied to the estimators of the drift and the diffusion coefficients respectively, theydefine the scaled residualsrˆi∆ =X(i+1)∆ − Xi∆ − µˆ(Xi∆; hdr)∆σˆ(Xi∆; hdi f )√∆for i = 1, . . . , n − 1 where µˆ(. ; hdr) and σˆ(. ; hdi f ) are kernel estimates of the drift and thediffusion coefficients. Then they choose (h∗dr, h∗di f ) ∈ (0,∞)× (0,∞) by(h∗dr, h∗di f ) = argminhdr ,hdi fsupx|Frˆ(x)−Φ(x)| (2.17)where Frˆ is the empirical distribution function of rˆ and Φ is the distribution function of astandard normal random variable. The justification for this first step is as follows. For small∆, the drift coefficient µ and the diffusion coefficient σ can be treated as constants in each timeinterval of [i∆, (i + 1)∆], i = 1, . . . , n− 1. We denote such constants by µi∆ and σi∆. Then, from(2.2), we haveX(i+1)∆ − Xi∆ ≈∫ (i+1)∆i∆µi∆dt +∫ (i+1)∆i∆σi∆dWt,and soX(i+1)∆ − Xi∆ − µi∆∆σi∆√∆≈1√∆∫ (i+1)∆i∆dWt =W(i+1)∆ −Wi∆√∆d= Zi,where {Zi}n−1i=1 are independent and identically distributed standard normal random vari-ables.After choosing the bandwidth in (2.17), they proceed to the second stage. Here we justintroduce the main idea of their second stage. The kernel estimators of the drift and the dif-fusion coefficients have conditions on the convergence rates of the bandwidth for consistencyand asymptotic normality, for example, conditions (i), (ii) and (iii) and the additional condi-tion in conclusion 2 of Theorem 2.1 for our estimator. They construct three random variables32that depend on (h∗dr, h∗di f ), the bandwidths chosen in the first step. They derive the asymptoticdistribution of a functional of these random variables when (h∗dr, h∗di f ) do not satisfy at leastone of the conditions. They use this functional and its asymptotic distribution to determine ifany of the conditions are violated. If so, they use the values of the three random variables todetermine how to adjust h∗dr and h∗di f .2.4.2 Choice of the block size rNote first that pre-averaging is a form of smoothing. According to our simulation study usingthe oracle bandwidth hopt(x) defined in (2.14), there is a bias-variance tradeoff in choosing theblock size r (see Figure 2.8). This tradeoff is similar to the well-known tradeoff for the choiceof the bandwidth h: a large value of r is likely to produce a constant function estimate of µ andresult in large bias but small variance. On the other hand, a small value of r is likely to resultin small bias but large variance.Therefore, one might think of choosing r by the plug-in method or the cross-validationmethod. However, there are complications in using these approaches. For the plug-in method,we cannot use AMSE for the choice of r because we do not have an asymptotic result thatcontains r. For the cross-validation method, there are two complications. First, there is a com-putational issue. If we use the cross-validation method, we should minimize the predictionerror with respect to both r and h. However, this two-dimensional optimization problem iscomputationally burdensome considering that the measurement error problem is often con-sidered for high-frequency data. Second, finding target data values that do not depend on ris not easy. In choosing h, we proposed in (2.15) to use the Y¯r,∆j ’s. However, when we chooseboth h and r, using these Y¯r,∆j ’s in the targets may lead to unsatisfactory choices.Because of these problems, we recommend using an ad-hoc choice of r, just as researchersestimating integrated volatility used when choosing the subsampling frequency before Zhang,Mykland, and Aı¨t-Sahalia (2005) formalized the subsampling approach. A sensible ad hocchoice of r would be one considering any periodicity of the data. For example, in our simula-tion study in which we generated daily observed data with five business days per week, wechose r = 5 to yield weekly averages.332.5 Simulation StudyIn this section, we carry out a simulation study to assess finite sample performance of our es-timator. We simulate data with two kinds of underlying models for the drift coefficient, µ, onewith linear drift coefficient and one with nonlinear drift coefficient, and with the methods ofchoosing h and r discussed in Section 2.4. We consider the following two kinds of underlyingmodels:dXt = 0.858× (0.086− Xt)dt + 0.157√Xt dWt, (2.18)dXt = −(Xt − 1)(Xt + 1)2dt + 2dWt. (2.19)The process defined by (2.18) is called a Cox-Ingersoll-Ross (CIR) process and is used as anunderlying model for a short-term interest rate process. The value of a CIR process at timet equals the annual interest rate, and the time is measured in days, with a year being 250days (counting business days only). Following the parameter choice of Chapman and Pearson(2000), we use the parameter values (0.858, 0.086, 0.157) in (2.18) to match the solution pro-cess’s monthly (i.e. 21st-order) autocorrelation, unconditional mean and unconditional vari-ance to the corresponding sample quantities of the dataset of Aı¨t-Sahalia (1996). The datasetis seven-day Eurodollar deposit rates observed daily from June 1, 1973 to February 25, 1995(total of 5505 observations). The Eurodollar deposit rate is known to move in close connectionwith short-term interest rates such as T-bill rates (Aı¨t-Sahalia, 1996, page 539). We use the pro-cess defined by (2.19) to study the performance of our estimator when the true drift coefficientis nonlinear. It is straightforward to check that each of the models (2.18) and (2.19) satisfiesAssumptions 2.2 and 2.3, so that the unique solution process exists and is positive recurrent.We generated 1,000 discretely-observed independent sample paths for each of (2.18) and (2.19)at time increments of ∆ = 1/250, which represents daily observations assuming 250 businessdays a year, and with the number of observations n = 5505, which is the sample size of thedataset of Aı¨t-Sahalia (1996). The top panels in Figures 2.1 and 2.2 depict sample paths of theprocesses defined by (2.18) and (2.19) with these values of ∆ and n.In order to generate sample paths of the model (2.18), we first note that the analytical forms34of the stationary density and the transition density are known. When generating each samplepath, we used the package sde in R (Iacus, 2009) to generate an initial value by a randomdraw from the stationary density, to generate the first observation by a random draw from thetransition density given the initial value, to generate the second observation by a random drawfrom the transition density given the first observation, and so on. Then the data generated bythis procedure have the same distribution as the distribution of the discretely observed dataof model (2.18).For model (2.19), we note that the analytical form of the stationary density is known (Equa-tion 2.4), but the analytical form of the transition density is unknown. When generating eachsample path, we again used the package sde in R, which first obtains an initial value by a ran-dom draw from the stationary density, then implements a numerical approximation methodproposed by Milstein to generate the discretely-observed sample paths. Milstein’s methoduses the first-order and the second-order derivatives of µ and σ. For details on Milstein’smethod, see e.g. Iacus (2008, Chapter 2, page 81, Equation 29).We then added independent and identically normally distributed measurement errors tothe generated discretely-observed sample paths. For model (2.18), we took 0.002 as the stan-dard deviation of our measurement errors. This value is an estimate of the standard deviationof the measurement error of the dataset of Aı¨t-Sahalia (1996), proposed by Jones (2003, page812). We note that the value 0.002 is 5.7% of the unconditional standard deviation of the so-lution process of (2.18). We also set the standard deviation of the measurement error addedto model (2.19) to be 5.7% of the unconditional standard deviation of the solution process of(2.19), that is, to be 0.0661. The second panels of Figures 2.1 and 2.2 depict sample paths of theprocesses defined by (2.18) and (2.19) with measurement errors added.Using these sample paths with additive measurement errors, we estimated the drift co-efficient by our pre-averaging estimator in Definition 2.2 (which we denote “Avg”) and thedouble-smoothing and the single-smoothing estimator of Bandi and Phillips (2003), which aredefined in (2.11) and (2.12) respectively and which we denote “BPD” and “BPS”, respectively.We also combined the subsampling method explained in Section 2.3 with the BPS and the BPDestimators, and we denote these as “BPSs” and “BPDs”. In Avg, we chose r = 5 (see Defini-tion 2.1 for the definition of r), which means we took weekly averages assuming 5 business35days a week. In BPSs and BPDs, we used the weekly closing prices (i.e. every fifth value) inorder to construct subsampled data. The third and fourth panels of each of Figures 2.1 and 2.2depict, respectively, averaged and subsampled sample paths of the process defined by each of(2.18) and (2.19).For all estimators, we used the standard normal kernel for estimation. The estimatorswere evaluated pointwise at each point in the grid which consists of 100 equispaced pointsranging from the 20th percentile to the 80th percentile of the invariant density fX defined inAssumption 2.3.For each candidate estimator, we used the oracle bandwidth defined in (2.14). Note thatall estimators have the same oracle bandwidths because they have the same asymptotic biasesand variances. We used the cross-validation bandwidths defined in (2.16) for Avg, BPSs andBPDs estimators. We didn’t use the cross-validation bandwidths for BPS and BPD due to thehigh computational cost. However, it will become evident from the simulation result usingthe oracle bandwidths, summarized in Table 2.2, that BPS and BPD have much larger meansquared errors than those of Avg, BPSs and BPDs. When calculating the cross validation band-width defined in (2.16) for Avg, BPSs and BPDs, we set H = 150 by the observation that, formost sample paths, the empirical autocorrelation functions of the averaged and the subsam-pled data reached zero before the time lag reaches 150. In addition, for BPSs and BPDs, weused Y’s instead of Y¯’s in the target data value of (2.15). In other words, the P̂E for BPSs andBPDs using the subsampled data {Yjr∆}mj=1 is defined byP̂E(h ; H) ≡m−1∑j=1(µˆ(j,H)BP(Yjr∆)−Y(j+1)r∆ −Yjr∆r∆)2,where µˆ(j,H)BP is either the BPSs or the BPDs estimator. The superscript (j, H) has the samemeaning as that of µˆ(k,H)Y¯(Y¯r,∆k)in (2.15) for the Avg estimator.In order to solve the minimization problem of (2.16), for each sample path, we first calcu-lated D = maxi Yi∆ −mini Yi∆ where Yi∆ is defined in (2.3), and we found the local minimumby looking at bandwidths in the grid of length 30, {D/30, 2D/30, . . . , D}. If there weremultiple bandwidths that attain local minima, we took the the largest bandwidth, which is a36common practice when using cross-validation. In our simulation result, a grid of 30 valueswas fine enough to detect the local minima. We obtained an interior minimizer of P̂E(· ; H)for Avg and BPSs estimators for every sample path we generated. Figure 2.3 contains densityplots of the selected bandwidths. For the sample paths generated by model (2.18), the averagevalue of D was 0.176. For model (2.19), the average value was 5.41. These values are reflectedin the scales of the horizontal axes in Figure 2.3.For BPDs, recall that we need to choose both h and l, where l is defined in (2.11). We choseto minimize P̂E(· ; H) with respect to h with the restriction that l = h. This is motivated by thefact that, as Bandi and Phillips (2003) noted, choosing {ln}, the bandwidth sequence of l ac-cording to the sample size n, so that hn/ln → C > 0 yields smaller asymptotic variance for thedouble-smoothing estimator than that of the single-smoothing estimator (Bandi and Phillips,2003, Remark 5). In our simulation study, for some sample paths, the curve h 7→ P̂E(h ; H)for the BPDs estimator evaluated at the grid of the bandwidths {D/30, 2D/30, . . . , D} wasmonotonically decreasing in h. In this case, we picked D as the bandwidth. We see this in Fig-ure 2.3 by the additional modes on the right side of the density plot of the BPDs bandwidths.Since D is random, the bandwidths equal to D form a smooth mode in the density plots. Thenumber of h’s whose values were set to D was 365 for model (2.18) and 214 for model (2.19).Whenever h = D, the BPDs function estimate was a constant function. There was very littlevariation in the intercept across such constant function estimates. For example, the standarddeviation of the intercepts for model (2.19) was 0.07, which is small considering that the driftcoefficient of model (2.19) ranges from -1 to 1 at our evaluation points.Now we present the simulation results. Table 2.2 summarizes the estimated expected inte-grated squared errors (ISE) of the estimators. For each combination of the model, the estima-tor and the bandwidth choice method, we approximated the ISE for each sample path by aninvariant-density-weighted sum of the squared errors over the equispaced grid of length 100on which the estimators were evaluated, and we provided the mean of the 1,000 ISEs alongwith the standard error of the mean in Table 2.2.According to the table, if we use the oracle bandwidth, our Avg estimator has a smallermean ISE than any other listed estimators except for the BPDs estimator. If we use the cross-validation bandwidth, our Avg estimator has smaller mean ISE than the BPSs estimator and37ISE, Model (2.18) ISE, Model (2.19)Estimator Oracle CV Oracle CVBPS 1.474 (0.048) — 94.8 (2.1) —BPD 0.628 (0.022) — 50.9 (1.5) —BPSs 0.479 (0.012) 0.187 (0.010) 48.9 (1.1) 27.84 (0.8)BPDs 0.194 (0.016) 0.283 (0.011) 27.2 (0.8) 22.67 (0.8)Avg 0.327 (0.008) 0.138 (0.006) 38.0 (1.0) 24.37 (0.7)Table 2.2: Means (and standard errors, i.e. standard deviations/√1000 ) of the integratedsquared errors (ISEs) of candidate estimators over 1,000 sample paths. Labels “BPS” and“BPD” stand for the single-smoothing and the double-smoothing estimators of Bandi andPhillips (2003), respectively. Label “Avg” stands for the pre-averaging estimator. The “s”after a label means the estimator is combined with the subsampling method.has smaller mean ISE than BPDs for model (2.18) and larger for model (2.19). Except for BPDsin model (2.18), the cross-validation bandwidths have smaller mean ISEs than the oracle band-widths.Figures 2.4 to 2.7 depict the pointwise mean squared errors (MSEs), i.e. the means of the1,000 pointwise squared errors, for each estimator over the grid of evaluation points, for bothbandwidths and for both models (2.18) and (2.19). According to the MSE plots, an estimatorwhich has smaller mean ISE than another estimator in Table 2.2 tends to have smaller point-wise MSEs at almost all evaluation points.We see that the MSE is small around x = 0.072 in Figures 2.4 and 2.5 and around x = −0.2and x = 1 in Figures 2.6 and 2.7. To understand why the MSEs are small around these points,in the asymptotic bias and variance defined in (2.13), we set h equal to the oracle bandwidthhopt(x), defined in (2.14). Then we obtain thatAMSE(x) ∝σ8/5(x)Γ2/5µ (x)f 4/5X (x).Note that Γ2/5µ is nonnegative as Γ2µ is nonnegative. It follows that the points where the MSEis small correspond to points where fX(x) is large or Γµ(x) is small. In Figures 2.4 and 2.5,around x = 0.072, Γµ(x) equals zero and fX attains its maximum. In Figures 2.6 and 2.7, Γµ(x)equals zero around x = −0.2 and fX attains its maximum around x = 1.We can rewrite Γµ(x) using the analytical form of fX(x) in (2.4), calculating f ′X(x) and38simplifying the expression for Γµ(x) toΓµ(x) = ν2 ×(2µ′(x)×µ(x)/σ(x)− σ′(x)σ(x)+12µ′′(x)).From this expression, we can see that Γµ(x) = 0 whenever 4µ′(x) × (µ(x) − σ(x)σ′(x))+µ′′(x)σ2(x) = 0. In particular, if the drift coefficient µ is linear so that µ′ is constant and µ′′is zero, then Γµ(x) = 0 if and only if µ(x) = σ(x)σ′(x). This condition is equivalent to thecondition that f ′X(x) is zero.Considering the oracle bandwidth, we see from the bottom panels of Figures 2.4 and 2.6and the Γµ curve in Figure 2.9 that hopt(x) is very large when Γµ(x) is close to zero. Recall thatthe asymptotic bias obtained from Theorem 2.1 is equal to h2Γµ(x), as in (2.13). That Γµ(x)is close to zero means that the asymptotic bias is very small, which means we choose large hin order to reduce the asymptotic variance without suffering much from the increase in theasymptotic bias.One may notice that the MSE curve slightly increases at the point where hopt(x) attains itsmaximum. The point corresponds to the one at which Γµ(x), and hence the asymptotic bias,is nearly zero. When Γµ(x) is nearly zero, if we choose a very large h, the asymptotic varianceand hence the AMSE are close to zero. However, the finite-sample bias is not necessarily zeroeven if the asymptotic bias is zero. Therefore, unlike the AMSE, the MSE curve calculated fromthe simulations does not equal to zero. In fact, we see a slight increase in MSE for a very highchoice of the bandwidth.Next, we present a simulation result that indicates the bias-variance tradeoff of the blocksize r, where r is defined in Equation (2.3). Figure 2.8 depicts pointwise squared bias andvariance of the Avg estimator with the oracle bandwidth under different values of r, namelyr = 2, 5, 10, 20, 40, 60, for model (2.19). It is clear from Figure 2.8 that the squared bias isincreasing in r and the variance is decreasing in r. Note that the sharp increase in the biasand the sharp decrease in the variance at x ≈ −0.2 is due to a very high value of the oraclebandwidth, hopt.We provide the plots only for (2.19) because we can clearly see by plots the bias-variancetradeoff of r for (2.19) as its drift coefficient is nonlinear. We obtained similar results for (2.18),39that the bias is increasing in r and that the variance is decreasing in r.Returning to Table 2.2, it indicates whether an estimator has smaller ISE than another es-timator “on average”. To make “pathwise” comparisons of ISEs, we construct Figures 2.10to 2.17. Each plot is a scatterplot of 1,000 points, each point representing a pair of pathwiseISEs of the two estimators indicated in the plot. For instance, Figure 2.10 is a scatterplot of1,000 pairs of pathwise ISEs of the Avg and the BPSs estimators, where each pair (i.e. eachpoint in the scatterplot) corresponds to each of 1,000 sample paths of model (2.18). Accordingto the figures, the pathwise comparisons give conclusions that are consistent with Table 2.2, inother words, an estimator which has smaller mean ISE than another estimator tends to havesmaller pathwise ISEs. From Figures 2.10, 2.12, 2.14 and 2.16, the Avg estimator has smallerpathwise ISE than BPSs for, respectively, 825, 513, 679 and 549 out of 1,000 sample paths. FromFigures 2.11, 2.13, 2.15 and 2.17, the ISE is smaller than BPDs for, respectively, 267, 782, 282and 464 out of 1,000 sample paths.We can make similar statements about the oracle and the cross-validation bandwidth, thatis, the cross-validation bandwidth tends to have smaller pathwise ISEs than the oracle band-width. As an example, Figures 2.18 and 2.19 compare pathwise ISEs of the Avg estimator withthe oracle and the cross-validation bandwidths for models (2.18) and (2.19). From Figures 2.18and 2.19, the Avg estimator with the cross-validation bandwidth has smaller pathwise ISEsthan the estimator with the oracle bandwidth for, respectively, 741 and 852 out of 1,000 samplepaths.One may notice that some points in Figures 2.13 and 2.17 form straight horizontal lines inthe plots. Those points correspond to sample paths for which the bandwidth h for the BPDsestimator was equal to D = maxi Yi∆−mini Yi∆, where Yi∆ is defined in (2.3) (recall the discus-sion about the additional modes of the density plot of the BPDs bandwidths in Figure 2.3). Wementioned earlier that, when we chose h = D, the resulting function estimate was a constantfunction estimate with little variation in its intercept across such constant function estimates.Therefore, when we take D as the bandwidth, the pathwise ISE of the BPDs estimator has littlevariation across the sample paths. Hence the points form a straight horizontal line, where thevalues of pathwise ISEs of the BPDs estimator are almost fixed at some level at the Y-axis.402.6 Proof of Theorem 2.1In this section, we provide a full proof of Theorem 2.1. In order to clarify the argument, we first,in Section 2.6.1, give a overall scheme of the proof and derive key statements that are sufficientto prove Theorem 2.1. Then we prove those key statements in Sections 2.6.2 and 2.6.3.2.6.1 Structure of the proofBandi and Phillips (2003) showed that their single-smoothing estimator, defined in (2.12), isa consistent and asymptotically normal estimator of µ when we observe a sample path of arecurrent solution process {Xt} sampled discretely in time and without measurement error.We prove Theorem 2.1 by showing, under the conditions of this theorem, that the differencebetween our estimator computed from the Yi∆’s and the single-smoothing estimator of Bandiand Phillips (2003) computed from the Xi∆’s converges to 0 in probability. This subsectionpresents the overall scheme of this proof.To begin with, we introduce the single-smoothing estimator of Bandi and Phillips (2003)more explicitly and state their results of its consistency and asymptotic normality. For nota-tional convenience, in order to relate the consistency and asymptotic normality results for theirestimator to our estimator, we present the estimator of Bandi and Phillips (2003) when the ob-servation time lag is r∆ and the number of observations is m − 1, that is, when we observe{X(j−1)rn∆n}m−1j=1 .Definition 2.3 The single-smoothing estimator µˆX of the drift coefficient µ proposed by Bandi andPhillips (2003) is defined byµˆX(x) ≡1m−2 ∑m−2j=1Xjr∆−X(j−1)r∆r∆1h K(X(j−1)r∆−xh)1m−2 ∑m−2j=11h K(X(j−1)r∆−xh)≡NX0,...,X(m−2)r∆(x)DX0,...,X(m−3)r∆ (x)≡NX(x)DX (x).We state the consistency and asymptotic normality result of µˆX(x) not for a recurrent dif-fusion process but for a positive recurrent diffusion process, in order to relate the result to ourestimator.41Theorem 2.2 (Bandi and Phillips, 2003) Suppose that(i) ∆n → ∞ and n∆n → ∞ as n→ ∞,(ii) Assumption 2.2 holds,(iii) The solution process {Xt} is positive recurrent (so that the stationary density fX exists), and(iv) The kernel K satisfies Assumption 2.4 except for our additional requirement that K′ is bounded.In addition, suppose(n∆nhn)2rn∆n ln(1/rn∆n) = o(1) and n∆nhn → ∞.Then the following consistency and asymptotic normality results hold whenever fX(x) > 0.1. µˆX(x) −→ µ(x) almost surely as n→ ∞.2. If n∆nh5n = o(1), then√n∆nhn {µˆX(x)− µ(x)}d−→ N(0, K2σ2(x)fX(x)),where K2 is as in Theorem 2.1.3. If n∆nh5n = O(1), then√n∆nhn{µˆX(x)− µ(x)− h2nΓµ(x)} d−→ N(0, K2σ2(x)fX(x)),where Γµ(x) is as in Theorem 2.1.Note that Assumptions 2.1 and 2.3 include conditions (i) and (iii) of Theorem 2.2, respec-tively. Thus Theorem 2.2 holds under Assumptions 2.1 to 2.4 as well.We will prove consistency and asymptotic normality of our estimator µˆY¯ (x) based on The-orem 2.2 and the following theorem.Theorem 2.3 (Bandi and Phillips, 2003) Suppose conditions (i) − (iv) of Theorem 2.2 hold, andsuppose(1hn)2rn∆n ln(1/rn∆n) = o(1).42Then, for each x such that fX(x) > 0, we haveDX (x) −→ fX(x) almost surely,where fX(x) is as in Assumption 2.3.In what follows, we will prove that the following statements hold under Assumptions 2.1to 2.5 in Section 2.2 and conditions (i), (ii) and (iii) of Theorem 2.1. Recall the definition ofµˆY¯(x) = NY¯(x)/DY¯(x), in Definition 2.2.DY¯ (x)−DX (x) = op(1) and (2.20)√n∆nhn {NY¯ (x)−NX (x)} = op(1). (2.21)Then Theorem 2.1 follows from these statements and Theorems 2.2 and 2.3. First, since n∆nhn →∞ as in condition (ii) of Theorem 2.1, Equation (2.21) impliesNY¯ (x)−NX (x) = op(1). (2.22)Then (2.20) and (2.22) along with Theorems 2.2 and 2.3 imply Theorem 2.1’s conclusion 1 since,for every x ∈ {y | fX(y) > 0},µˆY¯ (x) =NY¯ (x)DY¯ (x)=NX (x) + op(1)DX (x) + op(1)p−→ µ(x).To prove Theorem 2.1’s conclusion 2, where n∆nh5n = o(1), we use (2.20) and (2.21) along withTheorems 2.2 and 2.3 to write√n∆nhn {µˆY¯ (x)− µ(x)} =√n∆nhn NY¯ (x)DY¯ (x)−√n∆nhn µ(x)=√n∆nhn NX (x) + op(1)DX (x) + op(1)−√n∆nhn µ(x)=√n∆nhn{NX (x) + op(1)DX (x) + op(1)− µ(x)}d−→ N(0, K2σ2(x)fX(x)),43by Theorem 2.2’s conclusion 2. We can prove Theorem 2.1’s conclusion 3 similarly.In summary, if we prove (2.20) and (2.21) under Assumptions 2.1 to 2.5 and conditions (i),(ii), (iii) of Theorem 2.1, then the conclusions of Theorem 2.1 follow. In the remainder of thissection, we prove (2.20) and (2.21). The proof uses the following preliminary lemmas.2.6.2 Preliminary lemmasLemma 2.1 The following hold.1. Under Assumption 2.4, K is globally Lipschitz, that is, there exists a finite constant M > 0 suchthat |K(x)− K(y)| ≤ M|x− y| for all x, y ∈ R. In addition, we can take M ≡ supx K′(x).2. Under Assumption 2.5, the following inequalities hold for all j and n:(E(|ε¯rn,∆nj |))2≤ E([ε¯rn,∆nj ]2) ≤σ2εrn.Proof :1. The conclusion follows directly from the fact that K is continuously differentiable andthat K′ is bounded.2. The first inequality is from the fact that the L1 norm (that is, expectation of the absolutevalue) is bounded by the L2 norm. For the second inequality, sinceE(ε¯rn,∆nj ) = 0, we haveE([ε¯rn,∆nj ]2) = Var(ε¯rn,∆nj ). Then independence of εt’s and the boundedness of Var(εt)allow us to writeVar(ε¯rn,∆nj ) =1r2nrn∑i=1Var(ε [(j−1)rn+i]∆n) ≤σ2εrn. Lemma 2.2 Under Assumptions 2.2 and 2.3, for any real 0 ≤ a < b,{E(∣∣∣∣∫ baµ(Xs)ds∣∣∣∣)}2≤ E([∫ baµ(Xs)ds]2)≤ E(µ2(X0)) (b− a)2,{E(∣∣∣∣∫ baσ(Xs)dWs∣∣∣∣)}2≤ E([∫ baσ(Xs)dWs]2)≤ E(σ2(X0)) (b− a).44Proof : The first inequality of each line is by the fact that the L1 norm is bounded by the L2norm. We now prove the second inequality of the first line. By applying Ho¨lder’s inequalityto |µ| × 1,[∫ baµ(Xs)ds]2≤[∫ ba|µ(Xs)|ds]2≤∫ baµ2(Xs)ds∫ bads = (b− a)∫ baµ2(Xs)ds.Therefore,E([∫ baµ(Xs)ds]2)≤ (b− a)E(∫ baµ2(Xs)ds)= (b− a)∫ baE(µ2(Xs))ds.In addition, since E(µ2(Xs)) = E(µ2(X0)) as {Xt} is stationary,(b− a)∫ baE(µ2(Xs))ds = (b− a)∫ baE(µ2(X0))ds = E(µ2(X0))(b− a)2.This proves the second inequality of the first line. We can use the same reasoning to provethe second inequality of the second line if we first use the following property of stochasticintegration:E([∫ baσ(Xs)dWs]2)= E(∫ baσ2(Xs)ds). Remark : We denoteMba ≡∫ b∆a∆µ(Xs)ds and W ba ≡∫ b∆a∆σ(Xs)dWs. (2.23)Then Lemma 2.2 can be rewritten as E((Mba)2)≤ E(µ2(X0)) (b− a)2∆2 and E((W ba)2)≤E(σ2(X0)) (b− a)∆.Lemma 2.3 Suppose that Assumptions 2.1 to 2.3 hold. Defineκn ≡ maxj≤mnsup(j−1)rn∆n≤s≤jrn∆n|Xs − X(j−1)rn∆n |andγn ≡ maxj≤mnE((X¯rn,∆nj − X(j−1)rn∆n)2).45Then the following hold.(i) For any nonnegative numerical sequence {an} such that limn→∞ an(rn∆n ln(1/rn∆n))1/2 = 0,limn→∞ anκn = 0 a.s.(ii) maxj≤mn |X¯rn∆nj − X(j−1)rn∆n | ≤ κn.(iii) maxj≤mn |X¯rn∆nj+1 − X¯rn∆nj | ≤ 3κn.(iv) There exists a finite constant β such that γn ≤ βrn∆n.(v) maxj≤mn E(|X¯rn∆nj − X(j−1)rn∆n |) ≤√γn.(vi) maxj≤mn E(|X¯rn∆nj+1 − X¯rn∆nj |) ≤ 3√γn.Proof : We first prove (i). As Bandi and Phillips (2003, page 267) point out, by Levy’s modulusof continuity of diffusions, we haveP(lim supn→∞κn(rn∆n ln(1/rn∆n))1/2= C)= 1where C is a suitable constant (Karatzas and Shreve, 1991, Theorem 9.25, Chapter 2, page 114).Therefore, we have, for a nonnegative sequence an such that an(rn∆n ln(1/rn∆n))1/2 → 0,P(lim supn→∞anκn = 0)= 1.Since both an and κn are nonnegative, (i) follows. Next, to prove (ii), it suffices to notice that,for any j,|X¯rn∆nj − X(j−1)rn∆n | ≤1rnrn∑i=1|X(j−1)rn∆n+i∆n − X(j−1)rn∆n | ≤1rnrn∑i=1κn = κn.We can prove (iii) using (ii) and the definition of κn as follows: for any j,|X¯rn∆nj+1 − X¯rn∆nj | ≤ |X¯rn∆nj+1 − Xjrn∆n |+ |Xjrn∆n − X(j−1)rn∆n |+ |X¯rn∆nj − X(j−1)rn∆n | ≤ 3κn.Now we prove (iv). Note first that, by (2.2), we have the following on a set of probability461:X¯rn,∆nj − X(j−1)rn∆n =1rnrn∑i=1∫ (j−1)rn∆n+i∆n(j−1)rn∆nµ(Xs)ds +1rnrn∑i=1∫ (j−1)rn∆n+i∆n(j−1)rn∆nσ(Xs)dWs≡ A + B. (2.24)Then we have E((X¯rn,∆nj − X(j−1)rn∆n)2)≤ 2E(A2) + 2E(B2), for A and B defined in (2.24),since the above equality holds almost surely and since (A + B)2 ≤ 2A2 + 2B2. We now expandA2:(1rnrn∑i=1∫ (j−1)rn∆n+i∆n(j−1)rn∆nµ(Xs)ds)2=1r2nrn∑i=1rn∑k=1∫ (j−1)rn∆n+i∆n(j−1)rn∆nµ(Xs)ds∫ (j−1)rn∆n+k∆n(j−1)rn∆nµ(Xs)ds.(2.25)We use the Cauchy-Schwarz inequality, Lemma 2.2 and the fact that i, k ≤ rn to bound theexpectation of the absolute value of the ikth summand of (2.25) by√√√√E([∫ (j−1)rn∆n+i∆n(j−1)rn∆nµ(Xs)ds]2)√√√√E([∫ (j−1)rn∆n+k∆n(j−1)rn∆nµ(Xs)ds]2)≤ E(µ2(X0))r2n∆2n.This bound is uniform in i and k, so E(A2) is bounded by E(µ2(X0))r2n∆2n. In exactly the sameway, we can bound E(B2) by E(σ2(X0))rn∆n. Then E((X¯rn,∆nj − X(j−1)rn∆n)2)is bounded by2E(µ2(X0))r2n∆2n + 2E(σ2(X0))rn∆n ≤ βrn∆nfor some constant β since E(µ2(X0)) and E(σ2(X0)) are finite and rn∆n → 0. This bound isuniform in j, so (iv) follows.Next, (v) follows directly from the fact that the L1 norm is bounded by the L2 norm andthat the square root function f (x) =√x is monotonically increasing. Lastly, we can prove(vi) using (v), the argument of (ii) and the argument in the proof of (iv) to bound E(|Xjrn∆n −X(j−1)rn∆n |) by√γn. Now we are ready to present the proof of (2.20) and (2.21).472.6.3 Proof of Equation (2.20)Suppose Assumptions 2.1 to 2.5 and conditions (i), (ii), (iii) of Theorem 2.1 hold. Instead ofproving (2.20), we prove the stronger statement, thatDY¯ (x)−DX (x) = op(1√n∆nhn).First, the Lipschitz continuity of the kernel K implies|DY¯ (x)−DX (x)| ≤1mn − 2mn−2∑j=11hn∣∣∣∣∣K(Y¯rn∆nj − xhn)− K(X(j−1)rn∆n − xhn)∣∣∣∣∣≤Mmn − 2mn−2∑j=1|Y¯rn,∆nj − X(j−1)rn∆n |h2n, (2.26)where M is as in Lemma 2.1. In addition, using the definition of Y¯rn,∆nj and (ii) of Lemma 2.3,we have|Y¯rn,∆nj − X(j−1)rn∆n | = |X¯rn,∆nj − X(j−1)rn∆n + ε¯rn,∆nj | ≤ κn + |ε¯rn,∆nj |. (2.27)Combining (2.26) and (2.27), we can bound |DY¯ (x)−DX (x)| by|DY¯ (x)−DX (x)| ≤Mκnh2n+Mmn − 2mn−2∑j=1|ε¯rn,∆nj |h2n.We now study the orders of the two terms of this bound. We show thatκnh2n= oa.s.(1n∆nhn)and1mn − 2mn−2∑j=1|ε¯rn,∆nj |h2n= op(1√n∆nhn), (2.28)which will complete the proof since n∆nhn → ∞ as in condition (ii) of Theorem 2.1. For thefirst claim of (2.28), we use condition (i) of Theorem 2.1, which says (n∆n/hn)× (rn∆n ln(rn∆n))→ 0. This condition together with (i) of Lemma 2.3 yields n∆nhn × κn/h2n = (n∆n/hn)× κn =oa.s.(1). Then this implies that the first term is oa.s.(1/(n∆nhn)), since n∆nhn → ∞ by condition(ii) of Theorem 2.1.For the second claim of (2.28), since E(|ε¯rn,∆nj |) ≤ σε/√rn for all j by Lemma 2.2, we can48bound its expectation by σε/(h2nr1/2n ), which is o(1/√n∆nhn) sincen∆nhn ×1h4nrn=nh3nr2n× rn∆n = o(1)× o(1)by condition (iii) of Theorem 2.1 and the assumption that rn∆n → 0 (see Assumption 2.1).Therefore, the L1 norm of the second term is o(1/√n∆nhn), implying that it is op(1/√n∆nhn).2.6.4 Proof of Equation (2.21)In order to prove (2.21) under the conditions, we first define the following 3 terms. Recall thedefinition of NY¯(x) in Definition 2.2 and NX in Definition 2.3.NY¯,X¯(j) ≡(Y¯rn ,∆nj+2 − Y¯rn ,∆nj+1)KY¯rn ,∆nj − xhn−(X¯rn ,∆nj+2 − X¯rn ,∆nj+1)KX¯rn ,∆nj − xhn ,NX¯,X(j) ≡(X¯rn ,∆nj+2 − X¯rn ,∆nj+1)KX¯rn ,∆nj − xhn−(X(j+1)rn∆n − Xjrn∆n)K(X(j−1)rn∆n − xhn),NX,X(j) ≡(X(j+1)rn∆n − Xjrn∆n)K(X(j−1)rn∆n − xhn)−(Xjrn∆n − X(j−1)rn∆n)K(X(j−1)rn∆n − xhn).Then we have the following equality:√n∆nhn {NY¯ (x)−NX (x)} =√n∆nhn(mn − 2)rn∆nhnmn−2∑j=1{NY¯,X¯(j) +NX¯,X(j) +NX,X(j)}.Then, to prove (2.21), we note that√n∆nhn(mn − 2)rn∆nhn=mnmn − 2×1√n∆nhn,and we show that each ofmn−2∑j=1NY¯,X¯(j) ,mn−2∑j=1NX¯,X(j) andmn−2∑j=1NX,X(j) (2.29)49is op(√n∆nhn). Sometimes we treat the three sums in (2.29) all at once using the notationNZ,W(j). Note that NZ,W(j) can be rewritten in the formNZ,W(j) = Zincj × K(Zkernj − xhn)− W incj × K(Wkernj − xhn)= Zincj ×[K(Zkernj − xhn)− K(Wkernj − xhn)]+ (Zincj −Wincj )× K(Wkernj − xhn)≡ NZ,W,kern(j) + NZ,W,inc(j). (2.30)We shall call NZ,W,kern(j) the kernel difference term and NZ,W,inc(j) the increment differenceterm. To be precise, the following table lists the Z’s and W’s for each NZ,W(j).Table 2.3: The list of Z’s and W’s for each N (j).Zincj Wincj Zkernj WkernjNY¯,X¯(j) Y¯rn,∆nj+2 − Y¯rn,∆nj+1 X¯rn,∆nj+2 − X¯rn,∆nj+1 Y¯rn,∆nj X¯rn,∆njNX¯,X(j) X¯rn,∆nj+2 − X¯rn,∆nj+1 X(j+1)rn∆n − Xjrn∆n X¯rn,∆nj X(j−1)rn∆nNX,X(j) X(j+1)rn∆n − Xjrn∆n Xjrn∆n − X(j−1)rn∆n X(j−1)rn∆n X(j−1)rn∆nNote that Zkernj = Wkernj for NX,X(j), which means that NX,X,kern(j) is equal to zero. Wemust show that the two kernel difference and the three increment difference terms, summedover j, are all op(√n∆nhn). In what follows, we prove it. We first study the orders of the kerneldifference terms, and then we study those of the increment difference terms. When studyingeach difference term, we study that of NY¯,X¯ first, and then we study that of NX,X and that ofNX¯,X. We proceed in this order because the study of the terms of NY¯,X¯ is simplest and thestudy of those of NX¯,X is most delicate.Study of the kernel difference termsWe first study the orders of the kernel difference terms, summed over j. First, we show that∑mn−2j=1 NY¯,X¯,kern(j) is op(√n∆nhn). We first use the Lipschitz continuity of K to bound its abso-lute value: ∣∣∣∣∣mn−2∑j=1NY¯,X¯,kern(j)∣∣∣∣∣≤mn−2∑j=1Mhn|Zincj | |Zkernj −Wkernj |, (2.31)50where the Z’s and W’s are as in Table 2.3 and M is as in Lemma 2.1. Then we take the expecta-tion of the bound (2.31), and we use the independence of {Xt} and {εt} and the independenceof εt’s to bound the jth summand of the right-hand side of (2.31) byMhnE(|Y¯rn,∆nj+2 − Y¯rn,∆nj+1 ||ε¯rn,∆nj |)≤Mhn(E(|X¯rn,∆nj+2 − X¯rn,∆nj+1 |) +E(|ε¯rn,∆nj+2 − ε¯rn,∆nj+1 |))E(|ε¯rn,∆nj |)≤Mhn(3√γn +2σε√rn)σε√rn, (2.32)where the last inequality used Lemma 2.1 and (vi) of Lemma 2.3. Then, since γn ≤ βrn∆n by(iv) of Lemma 2.3, the expression (2.32) is bounded further by a constant times√∆nhn+1rnhn.Therefore, E(|∑mn−2j=1 NY¯,X¯,kern(j)|) is bounded by a constant timesmn√∆nhn+mnrnhn=n√∆nrnhn+nr2nhn,since mn = n/rn as in Assumption 2.1. This bound is o(√n∆nhn)since1√n∆nhn×n√∆nrnhn=√nr2nh3n= o(1)by condition (iii) of Theorem 2.1 and since1√n∆nhn×nr2nhn=nr2nh3n×h2n√n∆nhn= o(1)× o(1)by conditions (ii) and (iii) of Theorem 2.1 and since hn → 0 as in Assumption 2.1.Next, we show that∑mn−2j=1 NX¯,X,kern(j), which is defined at the beginning of Section 2.6.4and (2.30), is op(√n∆nhn). We use the first order Taylor expansion of K and write NX¯,X,kern(j)51as follows:NX¯,X,kern(j) = Zincj K′(ξ j − xhn)(Zkernj −Wkernjhn)=(X¯rn,∆nj+2 − X¯rn,∆nj+1)K′(ξ j − xhn)( X¯rn,∆nj − X(j−1)rn∆nhn), (2.33)where ξ j is a value between the values of X¯rn,∆nj and X(j−1)rn∆n . We must show that the sumof (2.33) over j is op(√n∆nhn). First, we use the definition of Xt in (2.2) that Xt = X0 +∫ t0 µ(Xs)ds +∫ t0 σ(Xs)dWs a.s. to write the increments of (2.33) asX¯rn,∆nj+2 − X¯rn,∆nj+1 =1rnrn∑i=1(X[(j+1)rn+i]∆n − X[jrn+i]∆n)=1rnrn∑i=1∫ [(j+1)rn+i]∆n[jrn+i]∆nµ(Xs)ds +1rnrn∑i=1∫ [(j+1)rn+i]∆n[jrn+i]∆nσ(Xs)dWs=1rnrn∑i=1M(j+1)rn+ijrn+i +1rnrn∑i=1W (j+1)rn+ijrn+i ,andX¯rn,∆nj − X(j−1)rn∆n =1rnrn∑k=1(X[(j−1)rn+k]∆n − X(j−1)rn∆n)=1rnrn∑k=1∫ [(j−1)rn+k]∆n(j−1)rn∆nµ(Xs)ds +1rnrn∑k=1∫ [(j−1)rn+k]∆n(j−1)rn∆nσ(Xs)dWs=1rnrn∑k=1M(j−1)rn+k(j−1)rn +1rnrn∑k=1W (j−1)rn+k(j−1)rn .Recall the definition of M and W in the Remark after Lemma 2.2. Using these, we expand(2.33) as follows:(2.33) =1r2nrn∑i=1rn∑k=1M(j+1)rn+ijrn+i M(j−1)rn+k(j−1)rn1hnK′(ξ j − xhn)(2.34)+1r2nrn∑i=1rn∑k=1M(j+1)rn+ijrn+i W(j−1)rn+k(j−1)rn1hnK′(ξ j − xhn)(2.35)+1r2nrn∑i=1rn∑k=1W (j+1)rn+ijrn+i M(j−1)rn+k(j−1)rn1hnK′(ξ j − xhn)(2.36)+1r2nrn∑i=1rn∑k=1W (j+1)rn+ijrn+i W(j−1)rn+k(j−1)rn1hnK′(ξ j − xhn). (2.37)52Throughout the calculations regarding (2.34) to (2.37), we will use the bounds based on theRemark after Lemma 2.2, that, for some constant C, the L2 norms E((M(j+1)rn+ijrn+i)2)andE((M(j−1)rn+k(j−1)rn)2)are bounded by Cr2n∆2n and thatE((W (j+1)rn+ijrn+i)2)andE((W (j−1)rn+k(j−1)rn)2)are bounded by Crn∆n.Note that, using boundedness of K′ and the Cauchy-Schwarz inequality, we can bound theL1 norm of the ikth summand of (2.34) by√E([M(j+1)rn+ijrn+i]2)E([M(j−1)rn+k(j−1)rn]2)×Mhn≤ CMr2n∆2nhn.By the same reasoning, we can bound the L1 norm of the ikth summand of (2.35) and (2.36) byCM(rn∆n)3/2/hn. These bounds are uniform in i, j, k, so the L1 norm of (2.34) to (2.36) summedover j is bounded by a constant times mn(rn∆n)3/2/hn = nr1/2n ∆3/2n /hn (recall mn = n/rn inAssumption 2.1). In addition, by condition (i) of Theorem 2.1,nr1/2n ∆3/2nhn=n∆nhn√rn∆n = o(1). (2.38)Therefore, the L1 norm of (2.34) to (2.36) summed over j is o(1). This implies that it is o(√n∆nhn)since n∆nhn → ∞ by condition (ii) of Theorem 2.1.Now it remains to show that (2.37) summed over j is op(√n∆nhn). It requires a moredelicate argument than (2.34) to (2.36). We first rearrange the sum of (2.37) over j as follows:1r2nhnrn∑i=1rn∑k=1mn−2∑j=1W (j+1)rn+ijrn+i W(j−1)rn+k(j−1)rnK′(ξ j − xhn). (2.39)We derive the bound of the squared L2 norm of the ikth summand of the above. We first provethatE[mn−2∑j=1W (j+1)rn+ijrn+i W(j−1)rn+k(j−1)rnK′(ξ j − xhn)]2=mn−2∑j=1E([W (j+1)rn+ijrn+i]2 [W (j−1)rn+k(j−1)rn]2K′2(ξ j − xhn)), (2.40)53in other words, thatE(W (j+1)rn+ijrn+i W(j−1)rn+k(j−1)rn1hnK′(ξ j − xhn)W (l+1)rn+ilrn+i W(l−1)rn+k(l−1)rn1hnK′(ξl − xhn))= 0 (2.41)for all j > l. In order to prove this, we use the fact that∫ v1u1σ(Xs)dWs is independent of∫ v2u2σ(Xs)dWs and Xw whenever u2 ≤ v2 ≤ u1 ≤ v1 and w ≤ u1 ≤ v1. Since ξ j is a numberbetween X¯rn,∆nj and X(j−1)rn∆n , the variable ξ j depends on those Xs’s such that (j− 1)rn∆n ≤s ≤ jrn∆n. Thus, by independence, the left-hand side of (2.41) equals toE(W (j+1)rn+ijrn+i)E(W (j−1)rn+k(j−1)rn1hnK′(ξ j − xhn)W (l+1)rn+ilrn+i W(l−1)rn+k(l−1)rn1hnK′(ξl − xhn))= 0,because integrals with respect to Brownian motion, such as E(W (j+1)rn+ijrn+i), have mean zero.This proves (2.41).Then, by the boundedness of K′, (2.40) is bounded further byM2mn−2∑j=1E([W (j+1)rn+ijrn+i]2 [W (j−1)rn+k(j−1)rn]2)= M2mn−2∑j=1E([W (j+1)rn+ijrn+i]2)E([W (j−1)rn+k(j−1)rn]2)(2.42)where we used the fact thatW (j+1)rn+ijrn+i andW(j−1)rn+k(j−1)rnare independent. Applying the boundsbased on the Remark after Lemma 2.2, we can bound (2.42) by a constant times mr2n∆2n. This isa bound of (2.40), so we can bound the L1 norm of (2.39) by a constant times1r2nhnrn∑i=1rn∑k=1√mnr2n∆2n =√mnr2n∆2nh2n=√nrn∆2nh2n(recall that mn = n/rn as in Assumption 2.1). We can rewrite the term inside the square rootasnrn∆2nh2n=nr1/2n ∆3/2nhn×r1/2n ∆1/2nhn.Now we show that the right-hand side is o(1). We have shown that the first component is o(1)in (2.38). For the second component, by condition (i) of Theorem 2.1 and the assumption that54n∆n → ∞ (see Assumption 2.1),√rn∆nhn=n∆nhn√rn∆n ×1n∆n= o(1)× o(1).Therefore, the L1 norm of (2.39) is o(1) and thus o(√n∆nhn) as n∆nhn → ∞ by condition (ii)of Theorem 2.1. This implies that (2.39) is op(√n∆nhn) as desired.Study of the increment difference termsWe first show∑mn−2j=1 NY¯,X¯,inc(j), which is defined at the beginning of Section 2.6.4 and (2.30),is op(√n∆nhn). We write Zincj −Wincj = Dj+1 − Dj where Dj = Y¯rn,∆nj+1 − X¯rn,∆nj+1 = ε¯rn,∆nj+1 . Thenwe writemn−2∑j=1NY¯,X¯,inc(j) =mn−2∑j=1(Dj+1 − Dj)K(Wkernj − xhn)(2.43)=mn−2∑j=1Dj+1K(Wkernj − xhn)−mn−3∑j=0Dj+1K(Wkernj+1 − xhn)= Dmn−1K(Wkernmn−2 − xhn)− D1K(Wkern1 − xhn)−mn−2∑j=1Dj+1[K(Wkernj+1 − xhn)− K(Wkernj − xhn)]. (2.44)Then, the boundedness and the Lipschitz continuity of K and the boundedness of K′ yield∣∣∣∣∣mn−2∑j=1NY¯,X¯,inc(j)∣∣∣∣∣≤ C{|Dmn−1|+ |D1|+1hnmn−2∑j=1|Dj+1||Wkernj+1 −Wkernj |}(2.45)for a suitable constant C. Recall that Wkernj+1 −Wkernj = ε¯rn,∆nj for NY¯,X¯,inc(j). We use the in-dependence of {Xt} and {εt} and the independence of εt’s to bound the expectation of theright-hand side of (2.45) further by a constant timesE(|ε¯rn,∆nmn |) +E(|ε¯rn,∆n2 |) +1hnmn−3∑j=1E(|ε¯rn,∆nj+2 |)E(|X¯rn,∆nj+1 − X¯rn,∆nj |).55We use Lemma 2.1 and (iv) and (vi) of Lemma 2.3 to bound it further by a constant times1√rn+mn√∆nhn.This bound is o(√n∆nhn)by the following. First, 1/√rn = o(1) by Assumption 2.1, whichimplies it is o(√n∆nhn)as n∆nhn → ∞ by condition (ii) of Theorem 2.1. Also,1√n∆nhn×mn√∆nhn=√nr2nh3n= o(1)by condition (iii) of Theorem 2.1.Next, we show∑mn−2j=1 NX,X,inc(j), which is defined at the beginning of Section 2.6.4 and(2.30), is op(√n∆nhn). We first write Zincj −Wincj = Dj+1 − Dj where Dj = Xjrn∆n − X(j−1)rn∆n .Then, by (2.2), which is the definition of Xt that Xt = X0 +∫ t0 µ(Xs)ds +∫ t0 σ(Xs)dWs a.s., wehave the following equality almost surely:Dj =∫ jrn∆n(j−1)rn∆nµ(Xs)ds +∫ jrn∆n(j−1)rn∆nσ(Xs)dWs =Mjrn∆n(j−1)rn+W jrn∆n(j−1)rn ≡ Ej + Fj(recall the definition ofM andW in the Remark after Lemma 2.2). Therefore, we have Zincj −W incj = Dj+1 − Dj = (Ej+1 − Ej) + (Fj+1 − Fj) almost surely. We now writemn−2∑j=1NX,X,inc(j) =mn−2∑j=1(Ej+1 − Ej)K(Wkernj − xhn)+mn−2∑j=1(Fj+1 − Fj)K(Wkernj − xhn)≡ NX,X,e + NX,X, f . (2.46)Note that NX,X,e and NX,X, f are of the same forms as (2.43), except for having Ej’s and Fj’sinstead of Dj’s, respectively. Now we show NX,X,e and NX,X, f are op(√n∆nhn).First, for NX,X,e, we use the bound (2.45) to bound NX,X,e by a constant times∣∣∣M(mn−1)rn(mn−2)rn∣∣∣+∣∣Mrn0∣∣+1hnmn−2∑j=1∣∣∣M(j+1)rnjrn∣∣∣∣∣∣Xjrn∆n − X(j−1)rn∆n∣∣∣ . (2.47)56Since |Xjrn∆n − X(j−1)rn∆n | ≤ κn by the definition of κn in Lemma 2.3, we can bound (2.47) by∣∣∣M(mn−1)rn(mn−2)rn∣∣∣+∣∣Mrn0∣∣+κnhnmn−2∑j=1∣∣∣M(j+1)rnjrn∣∣∣ . (2.48)In addition, Lemma 2.2 and the Markov inequality imply that∑mn−2j=1∣∣∣M(j+1)rnjrn∣∣∣ = Op(n∆n)and thatMj+1j = Op(rn∆n) for any j. Therefore, we can rewrite (2.48) asOp(rn∆n) + Op(rn∆n) +κnhnOp(n∆n).This bound is op(1), which implies it is op(√n∆nhn)as n∆nhn → ∞ by condition (ii) of Theo-rem 2.1, by the following. First, Op(rn∆n) = op(1) as rn∆n → 0 by Assumption 2.1. In addition,(n∆n/hn)× κn = oa.s.(1) by condition (i) of Theorem 2.1 and (i) of Lemma 2.3, which implies(κn/hn)×Op(n∆n) = (Op(n∆n)/hn)× κn = op(1).For NX,X, f , defined in (2.46), we can bound the absolute value of NX,X, f , using the reason-ing used from (2.43) to (2.44) and the boundedness of K, by a constant times∣∣∣Wmn−1mn−2∣∣∣+∣∣∣W10∣∣∣+∣∣∣∣∣mn−2∑j=1W (j+1)rnjrn[K(Xjrn∆n − xhn)− K(X(j−1)rn∆n − xhn)]∣∣∣∣∣.Lemma 2.2 and the Markov inequality imply that, for any j, W j+1j = Op(√rn∆n) = op(1).Therefore, it remains to show that∣∣∣∣∣mn−2∑j=1W (j+1)rnjrn[K(Xjrn∆n − xhn)− K(X(j−1)rn∆n − xhn)]∣∣∣∣∣= op(√n∆nhn).To show this, we show that its L2 norm is o(√n∆nhn). Note thatE(mn−2∑j=1W (j+1)rnjrn[K(Xjrn∆n − xhn)− K(X(j−1)rn∆n − xhn)])2=mn−2∑j=1E((W j+1j)2)E([K(Xjrn∆n − xhn)− K(X(j−1)rn∆n − xhn)]2)by the same reasoning used to prove (2.41). In addition, using the Lipschitz continuity of K,57we can bound the above further by a constant timesmn−2∑j=1E((W j+1j)2) E((Xjrn∆n − X(j−1)rn∆n)2)h2n≤γnh2nmn−2∑j=1E((W j+1j)2), (2.49)where we used E((Xjrn∆n − X(j−1)rn∆n)2)≤ γn for the inequality, which can be proved adapt-ing the proof of (iv) of Lemma 2.3. Then, since E((W j+1j )2) ≤ E(σ2(X0))rn∆n by Lemma 2.2(and the Remark after that) and γn ≤ βrn∆n by (iv) of Lemma 2.3, we can bound the right-hand side of (2.49) further by a constant timesmnr2n∆2nh2n=nrn∆2nh2n(mn = n/rn as in Assumption 2.1). We must show that this bound is o(n∆nhn), which provesthat the L2 norm is o(√n∆nhn). The following proves it is o(n∆nhn):1n∆nhn×nrn∆2nh2n=(n∆nhn)2rn∆n ×1n∆nhn×1n∆n= o(1)× o(1)× o(1)by conditions (i) and (ii) of Theorem 2.1 and the assumption that n∆n → 0 in Assumption 2.1.Lastly, we show∑mn−2j=1 NX¯,X,inc(j), which is defined at the beginning of Section 2.6.4 and(2.30), is op(√n∆nhn). By (2.2), which is the definition of Xt that Xt = X0 +∫ t0 µ(Xs)ds +∫ t0 σ(Xs)dWs a.s., the increment terms Zincj = X¯rn,∆nj+2 − X¯rn,∆nj+1 and Wincj = X(j+1)rn∆n + Xjrn∆nsatisfy the following equations almost surely:X¯rn,∆nj+2 − X¯rn,∆nj+1 =1rnrn∑i=1M(j+1)rn+ijrn+i +1rnrn∑i=1W (j+1)rn+ijrn+i ,X(j+1)rn∆n + Xjrn∆n = M(j+1)rnjrn+W (j+1)rnjrn(recall the definition ofM’s andW ’s in the Remark after Lemma 2.2). Therefore, we can writeZincj −Wincj =1rnrn∑i=1(M(j+1)rn+ijrn+i −M(j+1)rnjrn)+1rnrn∑i=1(W (j+1)rn+ijrn+i −W(j+1)rnjrn).58In addition, asM andW are simplified notation for integrals, we haveM(j+1)rn+ijrn+i −M(j+1)rnjrn=(M(j+1)rnjrn+i +M(j+1)rn+i(j+1)rn)−(Mjrn+ijrn −M(j+1)rnjrn+i)=M(j+1)rn+i(j+1)rn −Mjrn+ijrnand the same equation forW ’s. Then we can decompose Zincj −Wincj asZincj −Wincj =(1rnrn∑i=1M(j+1)rn+i(j+1)rn −1rnrn∑i=1Mjrn+ijrn)+(1rnrn∑i=1W (j+1)rn+i(j+1)rn −1rnrn∑i=1W jrn+ijrn)≡(Ej+1 − Ej)+(Fj+1 − Fj).Then, similarly to∑mn−2j=1 NX,X,inc(j), we can decompose ∑mn−2j=1 NX¯,X,inc(j) as sum ofNX¯,X,e andNX¯,X, f and show that each is op(√n∆nhn). Briefly, NX¯,X,e is bounded by a constant times1rnrn∑i=1∣∣∣M(mn−1)rn+i(mn−1)rn∣∣∣+1rnrn∑i=1∣∣∣Mrn+irn∣∣∣+1rnhnrn∑i=1mn−2∑j=1∣∣∣M(j+1)rn+i(j+1)rn∣∣∣∣∣∣X(j+1)rn∆n − Xjrn∆n∣∣∣ ,and NX¯,X, f is bounded by a constant times1rnrn∑i=1∣∣∣W(mn−1)rn+i(mn−1)rn∣∣∣+1rnrn∑i=1∣∣∣W rn+irn∣∣∣+1rnrn∑i=1∣∣∣∣∣mn−2∑j=1W (j+1)rn+i(j+1)rn[K(X(j+1)rn∆n − xhn)− K(Xjrn∆n − xhn)]∣∣∣∣∣.Applying, again, the reasoning used to study NX,X,e and NX,X, f to the above completes theproof. 590.050.100.150.200.250.050.100.150.200.250.050.100.150.200.250.050.100.150.200.25OriginalContaminatedAveragedSubsampled0 5 10 15 20timeXFigure 2.1: A sample path of the stochastic process defined by (2.18), with the linear driftcoefficient. Label ”Original” represents the process without measurement errors. Label ”Con-taminated” represents the process with independent N(0, 0.0022)-distributed additive mea-surement errors. Label ”Averaged” represents the averaged contaminated process with r = 5.Label ”Subsampled” represents the subsampled process having 1/5 less sampling frequencythan the original process.60−3−2−1012−3−2−1012−3−2−1012−3−2−1012OriginalContaminatedAveragedSubsampled0 5 10 15 20timeXFigure 2.2: A sample path of the stochastic process defined by (2.19), with the nonlinear driftcoefficient. Label ”Original” represents the process without measurement errors. Label ”Con-taminated” represents the process with independent N(0, 0.06612)-distributed additive mea-surement errors. Label ”Averaged” represents the averaged contaminated process with r = 5.Label ”Subsampled” represents the subsampled process having 1/5 less sampling frequencythan the original process.6105100.0 0.1 0.2 0.3BandwidthdensityTypeAvgBPSsBPDs0.00.51.01.52.00 2 4 6BandwidthdensityTypeAvgBPSsBPDsFigure 2.3: Density plot of cross-validation bandwidths of the BPSs, BPDs and Avg estimator.Labels “BPSs” and “BPDs” stand for the single-smoothing and the double-smoothing estima-tor of Bandi and Phillips (2003), respectively, both combined with the subsampling method.Label “Avg” stands for the pre-averaging estimator. The top panel corresponds to the model(2.18), and the bottom panel corresponds to the model (2.19).620.00000.00030.00060.00090.06 0.08 0.10XMSETypeAvg−oBPSs−oBPDs−oAMSE (scaled)0.050.100.150.200.06 0.08 0.10XBandwidth TypeOracleFigure 2.4: Pointwise mean squared errors (MSE) of the estimators for the model (2.18) withoracle bandwidths. Refer to the caption of Table 2.2 for definition of the labels. The “-o”represents the oracle bandwidths are used. Label “AMSE” represents the asymptotic meansquared error computed using the oracle bandwidth. The numbers of the vertical axis do notapply to the AMSE. The bottom panel depicts oracle bandwidths, hopt(x) defined in (2.14),according to the values of x.630.00000.00020.00040.00060.06 0.08 0.10XMSETypeAvg−cvBPSs−cvBPDs−cvAMSE (scaled)Figure 2.5: Pointwise mean squared errors (MSE) of the estimators for the model (2.18) withcross-validation bandwidths. Refer to the caption of Table 2.2 for definition of the labels.The “-cv” represents the cross-validation bandwidths are used. Label “AMSE” represents theasymptotic mean squared error computed using the oracle bandwidth. The numbers of thevertical axis do not apply to the AMSE.640123−1.0 −0.5 0.0 0.5 1.0XMSETypeAvg−oBPSs−oBPDs−oAMSE (scaled)1234−1.0 −0.5 0.0 0.5 1.0XBandwidth TypeOracleFigure 2.6: Pointwise mean squared errors (MSE) of the estimators for the model (2.19) withoracle bandwidths. Refer to the caption of Table 2.2 for definition of the labels. The “-o”represents the oracle bandwidths are used. Label “AMSE” represents the asymptotic meansquared error computed using the oracle bandwidth. The numbers of the vertical axis do notapply to the AMSE. The bottom panel depicts oracle bandwidths, hopt(x) defined in (2.14),according to the values of x.650.00.40.81.21.6−1.0 −0.5 0.0 0.5 1.0XMSETypeAvg−cvBPSs−cvBPDs−cvAMSE (scaled)Figure 2.7: Pointwise mean squared errors (MSE) of the estimators for the model (2.19) withcross-validation bandwidths. Refer to the caption of Table 2.2 for definition of the labels.The “-cv” represents the cross-validation bandwidths are used. Label “AMSE” represents theasymptotic mean squared error computed using the oracle bandwidth. The numbers of thevertical axis do not apply to the AMSE.660.00.51.01.52.0−1.0 −0.5 0.0 0.5 1.0XSquared biasTypeAvg−o (r=2)Avg−o (r=5)Avg−o (r=10)Avg−o (r=20)Avg−o (r=40)Avg−o (r=60)0.00.51.01.52.0−1.0 −0.5 0.0 0.5 1.0XVarianceTypeAvg−o (r=2)Avg−o (r=5)Avg−o (r=10)Avg−o (r=20)Avg−o (r=40)Avg−o (r=60)Figure 2.8: Pointwise squared biases (the top panel) and pointwise variances (the bottompanel) of the pre-averaging estimator with the oracle bandwidth (denoted by “Avg-o”) un-der different values of the block size r for the model (2.19). The values of r are indicated in thelegend.670.050.100.150.200.06 0.08 0.10Bandwidth TypeOracle−10010200.06 0.08 0.10Gamma TypeGamma12345−1.0 −0.5 0.0 0.5 1.0Bandwidth TypeOracle−4−202−1.0 −0.5 0.0 0.5 1.0Gamma TypeGammaFigure 2.9: Values of the oracle bandwidth hopt(x) defined in (2.14) and the function Γµ(x)defined in Theorem 2.1. The two top panels depict values of hopt(x) and Γµ(x) according tothe values of x for model (2.18). The two bottom panels depict the values for model (2.19).68lllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllll lllllllllllllllllllllllllllllllllllllll lllllllllllllllll lllllllllllll lllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.010.101.000.01 0.10 1.00log10(Prediction error), Avg−olog10(Prediction error), BPSs−oFigure 2.10: The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of Avg and BPSs estimates for the model (2.18) with oracle band-widths. The sum is computed along the grid of evaluation points described in Section 2.5.Refer to the caption of Table 2.2 for definition of the labels. The “-o” represents the oraclebandwidths are used. The black solid line is the 45 degrees line. 825 points out of 1,000 areabove the line.69lllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllll llllllllllllllllllllllllllllllll ll llllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllll0.010.101.000.01 0.10 1.00log10(Prediction error), Avg−olog10(Prediction error), BPDs−oFigure 2.11: The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of Avg and BPDs estimates for the model (2.18) with oracle band-widths. The sum is computed along the grid of evaluation points described in Section 2.5.Refer to the caption of Table 2.2 for definition of the labels. The “-o” represents the oraclebandwidths are used. The black solid line is the 45 degrees line. 733 points out of 1,000 arebelow the line.70llllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllll lllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllll llllllllllllllllll llllllllllllllll lllllllllllllll0.011.000.01 0.10 1.00log10(Prediction error), Avg−cvlog10(Prediction error), BPSs−cvFigure 2.12: The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of Avg and BPSs estimates for the model (2.18) with cross-validation bandwidths. The sum is computed along the grid of evaluation points described inSection 2.5. Refer to the caption of Table 2.2 for definition of the labels. The “-cv” representsthe cross-validation bandwidths are used. The black solid line is the 45 degrees line. 513 pointsout of 1,000 are above the line.71llll llllllllllllllllllllllllllllllllllllllllllllllllllll llllll llllllllllllllllll llllllllllllll lll llllllllllllll llllll llllllllllllllllllllllllllllllllllll llll l llllllllllllllllllll llllllllllll llllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllll lllllllllllllllllllllllll llll lll lllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllll lll llllllllllllllllllllllllll l lllllllllll llllllllllllllllllllllllllllll lllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllll lllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllll0.010.101.000.01 0.10 1.00log10(Prediction error), Avg−cvlog10(Prediction error), BPDs−cvFigure 2.13: The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of Avg and BPDs estimates for the model (2.18) with cross-validation bandwidths. The sum is computed along the grid of evaluation points described inSection 2.5. Refer to the caption of Table 2.2 for definition of the labels. The “-cv” representsthe cross-validation bandwidths are used. The black solid line is the 45 degrees line. 782 pointsout of 1,000 are above the line.72lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllll lllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllll llllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllll llllllllllllllllllllllllllllllllllllll llllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllll ll llll1010010 100log10(Prediction error), Avg−olog10(Prediction error), BPSs−oFigure 2.14: The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of Avg and BPSs estimates for the model (2.19) with oracle band-widths. The sum is computed along the grid of evaluation points described in Section 2.5.Refer to the caption of Table 2.2 for definition of the labels. The “-o” represents the oraclebandwidths are used. The black solid line is the 45 degrees line. 679 points out of 1,000 areabove the line.73lllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllll lllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll11010010 100log10(Prediction error), Avg−olog10(Prediction error), BPDs−oFigure 2.15: The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of Avg and BPDs estimates for the model (2.19) with oracle band-widths. The sum is computed along the grid of evaluation points described in Section 2.5.Refer to the caption of Table 2.2 for definition of the labels. The “-o” represents the oraclebandwidths are used. The black solid line is the 45 degrees line. 718 points out of 1,000 arebelow the line.74llllllllll lllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllll llllll llllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllll1101001 10 100log10(Prediction error), Avg−cvlog10(Prediction error), BPSs−cvFigure 2.16: The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of Avg and BPSs estimates for the model (2.19) with cross-validation bandwidths. The sum is computed along the grid of evaluation points described inSection 2.5. Refer to the caption of Table 2.2 for definition of the labels. The “-cv” representsthe cross-validation bandwidths are used. The black solid line is the 45 degrees line. 549 pointsout of 1,000 are above the line.75llllllllllllllllllllllllllllllllllllllllllllllllllllllll llll llllllllllllllllllllllllllllllllllllllllllllllllllllll lll llllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llll llllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllll llll1101001 10 100log10(Prediction error), Avg−cvlog10(Prediction error), BPDs−cvFigure 2.17: The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of Avg and BPDs estimates for the model (2.19) with cross-validation bandwidths. The sum is computed along the grid of evaluation points described inSection 2.5. Refer to the caption of Table 2.2 for definition of the labels. The “-cv” representsthe cross-validation bandwidths are used. The black solid line is the 45 degrees line. 536 pointsout of 1,000 are below the line.76lll lllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllll lllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllll llllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.010.101.000.01 0.10 1.00log10(Prediction error), Avg−cvlog10(Prediction error), Avg−oFigure 2.18: The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of the pre-averaging estimator for the model (2.18). The sum iscomputed along the grid of evaluation points described in Section 2.5. The “-o” and “-cv”mean the oracle and the cross-validation bandwidths are used, respectively. The black solidline is the 45 degrees line. 741 points out of 1,000 are above the line.77lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll101001 10 100log10(Prediction error), Avg−cvlog10(Prediction error), Avg−oFigure 2.19: The log10-transformed pathwise invariant-density-weighted sum of squaredpointwise prediction errors of the pre-averaging estimator for the model (2.19). The sum iscomputed along the grid of evaluation points described in Section 2.5. The “-o” and “-cv”mean the oracle and the cross-validation bandwidths are used, respectively. The black solidline is the 45 degrees line. 852 points out of 1,000 are above the line.78Chapter 3ConclusionIn this thesis, we proposed a Nadaraya-Watson type kernel estimator of the drift coefficient ofa diffusion process. Our estimator is consistent and asymptotically normal when the data aregenerated from a positive recurrent and strictly stationary diffusion process and are sampleddiscretely in time and with additive measurement errors. Our consistency and asymptoticnormality result is built upon the result of Bandi and Phillips (2003), who proved consistencyand asymptotic normality of the Nadaraya-Watson estimator of the drift coefficient when thedata are generated from a recurrent diffusion process and are sampled discretely in time andwithout measurement error.We recommended using the H-block cross-validation, proposed by Chu and Marron (1991)and Burman, Chow, and Nolan (1994), to choose the bandwidth h. Our simulation study inSection 2.5 indicates that, when the data are observed with independent and identically dis-tributed additive measurement errors, our estimator with the H-block cross-validation band-width has smaller mean integrated squared error than our estimator with the oracle band-width, which has much smaller mean squared error than the estimators of Bandi and Phillips(2003) with the oracle bandwidth.In our simulation study, as an alternative to our estimator, we also applied the subsamplingmethod to the estimators of Bandi and Phillips (2003) for estimation of the drift coefficient. Oursimulation study indicates that, when combined with the subsampling method, the estimatorsof Bandi and Phillips (2003) have mean squared errors that are as small as our estimator.Because we have errors of observation in our model, which are not considered by Bandi79and Phillips (2003), we needed to reduce the noise caused by these errors in order to improvethe accuracy of the estimate of the drift coefficient. Our approach was to construct a pre-averaged process {Y¯r,∆j }, as in Definition 2.1. Alternatively, according to our simulation study,the subsampling method seems to be another effective way to reduce the noise.Our estimator offers wider applicability compared to the estimators developed for the caseof no measurement error, as we do often observe data with measurement errors. For example,Zhou (1996) reported the presence of measurement error in foreign exchange rates data, andJones (2003) argued the presence of measurement error in the dataset of Aı¨t-Sahalia (1996), theseven-day Eurodollar rates dataset.Despite advantages of our proposed approach, the choice of the block size r and the choiceof the subsampling rate are largely unsolved issues. In practice, we rely on an ad-hoc choiceof r because of difficulties in using existing methods to choose r, as discussed in Section 2.4.2.In addition, our estimator involves shifts of the time-indices, as discussed right after Defini-tion 2.2, and the effect of the shifts on the performance of the estimator is not clear. In anothersimulation study using the oracle bandwidth, which is not included in Section 2.5, we saw thatthe shifts of the time-indices increase the mean squared error of our pre-averaging estimator.However, the shifts of the time-indices do not seem necessarily to increase the mean squarederror. The simulation study not included in Section 2.5 indicates that applying the shifts ofthe time-indices to the single-smoothing and the double-smoothing estimators, with subsam-pling, of Bandi and Phillips (2003) decreases their mean squared errors. The investigation ofthese issues could be a possible future research topic.Another possible future research topic is estimation of µ(x) from time-irregularly observeddata. An advantage of using a continuous-time process over a discrete-time process is that acontinuous-time process allows us to consider time-irregularly observed data. For asymp-totics, we can consider the situation where the time-difference between each pair of time-adjacent observations is a random number between 0 and infinity.Lastly, the H-block cross-validation is a bandwidth choice method for a finite-sample, i.e.for a fixed n, and the asymptotic behavior of the H-block cross-validation bandwidth as ntends to infinity is not studied yet. As n tends to infinity, we require {hn} to satisfy the condi-tions (i), (ii) and (iii) of Theorem 2.1. The study of the asymptotic behavior of the sequence80of H-block cross-validation bandwidths in relation to the conditions of Theorem 2.1 is anotherpossible future research topic.81BibliographyAı¨t-Sahalia, Yacine. 1996. “Nonparametric pricing of interest rate derivative securities.” Econo-metrica 64 (3):527–560.Andersen, Torben G, Tim Bollerslev, Francis X Diebold, and Paul Labys. 2001. “The distri-bution of realized exchange rate volatility.” Journal of the American Statistical Association96 (453):42–55.———. 2009. “Parametric and nonparametric volatility measurement.” Handbook of FinancialEconometrics 1:67–138.Bandi, Federico, Valentina Corradi, and Guillermo Moloche. 2009. “Bandwidth selection forcontinuous-time Markov processes.” Working paper .Bandi, Federico M and Peter CB Phillips. 2003. “Fully nonparametric estimation of scalardiffusion models.” Econometrica 71 (1):241–283.Barndorff-Nielsen, Ole E, Peter Reinhard Hansen, Asger Lunde, and Neil Shephard. 2008.“Designing realized kernels to measure the ex post variation of equity prices in the presenceof noise.” Econometrica 76 (6):1481–1536.Burman, Prabir, Edmond Chow, and Deborah Nolan. 1994. “A cross-validatory method fordependent data.” Biometrika 81 (2):351–358.Chapman, David A and Neil D Pearson. 2000. “Is the short rate drift actually nonlinear?” TheJournal of Finance 55 (1):355–388.Chu, C-K and James S Marron. 1991. “Comparison of two bandwidth selectors with dependenterrors.” The Annals of Statistics 19 (3):1906–1918.82Cleveland, William S. 1979. “Robust locally weighted regression and smoothing scatterplots.”Journal of the American Statistical Association 74 (368):829–836.Felsenstein, Joseph. 1985. “Phylogenies and the comparative method.” American Naturalist125 (1):1–15.Florens-Zmirou, Danielle. 1993. “On estimating the diffusion coefficient from discrete obser-vations.” Journal of Applied Probability 30 (4):790–804.Hall, Peter and Jeffrey D Hart. 1990. “Nonparametric regression with long-range dependence.”Stochastic Processes and Their Applications 36 (2):339–351.Hardle, Wolfgang. 1990. Applied Nonparametric Regression, vol. 27. Cambridge Univ Press.Hart, Jeffrey D. 1994. “Automated kernel smoothing of dependent data by using time seriescross-validation.” Journal of the Royal Statistical Society. Series B (Methodological) 56 (3):529–542.Iacus, Stefano Maria. 2008. Simulation and Inference for Stochastic Differential Equations: with RExamples. Springer.———. 2009. sde: Simulation and Inference for Stochastic Differential Equations. URLhttp://CRAN.R-project.org/package=sde. R package version 2.0.10.Jacod, Jean, Yingying Li, Per A Mykland, Mark Podolskij, and Mathias Vetter. 2009. “Mi-crostructure noise in the continuous case: the pre-averaging approach.” Stochastic Processesand their Applications 119 (7):2249–2276.Jones, Christopher S. 2003. “Nonlinear mean reversion in the short-term interest rate.” Reviewof Financial Studies 16 (3):793–843.Jones, M Chris, James S Marron, and Simon J Sheather. 1996. “A brief survey of bandwidthselection for density estimation.” Journal of the American Statistical Association 91 (433):401–407.Karatzas, Ioannis Autor and Steven Eugene Shreve. 1991. Brownian Motion and Stochastic Cal-culus, vol. 113. Springer.83Kutoyants, Yu A. 2004. Statistical Inference for Ergodic Diffusion Processes. Springer.Nadaraya, Elizbar A. 1964. “On estimating regression.” Theory of Probability & Its Applications9 (1):141–142.Øksendal, Bernt. 1992. Stochastic Differential Equations. Springer.Parzen, Emanuel. 1962. “On estimation of a probability density function and mode.” Annalsof Mathematical Statistics 33 (3):1065–1076.Robinson, Peter M. 1983. “Nonparametric estimators for time series.” Journal of Time SeriesAnalysis 4 (3):185–207.Rosenblatt, Murray. 1956. “Remarks on some nonparametric estimates of a density function.”Annals of Mathematical Statistics 27 (3):832–837.Ruppert, David and Matthew P Wand. 1994. “Multivariate locally weighted least squaresregression.” The Annals of Statistics 22 (3):1346–1370.Simonoff, Jeffrey S. 1996. Smoothing Methods in Statistics. Springer.Stanton, Richard. 1997. “A nonparametric model of term structure dynamics and the marketprice of interest rate risk.” The Journal of Finance 52 (5):1973–2002.Stone, Charles J. 1977. “Consistent nonparametric regression.” The Annals of Statistics 5 (4):595–620.Stone, Mervyn. 1974. “Cross-validatory choice and assessment of statistical predictions.” Jour-nal of the Royal Statistical Society. Series B (Methodological) 36 (2):111–147.Watson, Geoffrey S. 1964. “Smooth regression analysis.” Sankhya¯: The Indian Journal of Statistics,Series A :359–372.Zhang, Lan, Per A Mykland, and Yacine Aı¨t-Sahalia. 2005. “A tale of two time scales.” Journalof the American Statistical Association 100 (472):1394–1411.Zhou, Bin. 1996. “High-frequency data and volatility in foreign-exchange rates.” Journal ofBusiness & Economic Statistics 14 (1):45–52.84
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Kernel estimation of the drift coefficient of a diffusion...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Kernel estimation of the drift coefficient of a diffusion process in the presence of measurement error Lee, Wooyong 2014
pdf
Page Metadata
Item Metadata
Title | Kernel estimation of the drift coefficient of a diffusion process in the presence of measurement error |
Creator |
Lee, Wooyong |
Publisher | University of British Columbia |
Date Issued | 2014 |
Description | Diffusion processes, a class of continuous-time stochastic processes, can be used to model time-series data observed at discrete time points. A diffusion process can be completely characterized by two functions, called the drift coefficient and the diffusion coefficient. For the nonparametric estimation of these two functions, Bandi and Phillips (2003) proved consistency and asymptotic normality of Nadaraya-Watson kernel estimators of the drift and the diffusion coefficient. In some cases, we observe the time-series data with measurement error. For instance, it is a well-known fact that we observe the financial time-series data with measurement errors (Zhou, 1996). For the nonparametric estimation of the drift and the diffusion coefficients in the presence of measurement error, some works are done for the estimation of integrated volatility, which is the integral of the diffusion coefficient over a fixed period of time, but little work exists on the estimation of the drift and the diffusion coefficients themselves. In this thesis, we focus on the estimation of the drift coefficient, and we propose a consistent and asymptotically normal Nadaraya-Watson type kernel estimator of the drift coefficient in the presence of measurement error. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2014-06-11 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
DOI | 10.14288/1.0167486 |
URI | http://hdl.handle.net/2429/46990 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2014-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2014_september_lee_wooyong.pdf [ 695.7kB ]
- Metadata
- JSON: 24-1.0167486.json
- JSON-LD: 24-1.0167486-ld.json
- RDF/XML (Pretty): 24-1.0167486-rdf.xml
- RDF/JSON: 24-1.0167486-rdf.json
- Turtle: 24-1.0167486-turtle.txt
- N-Triples: 24-1.0167486-rdf-ntriples.txt
- Original Record: 24-1.0167486-source.json
- Full Text
- 24-1.0167486-fulltext.txt
- Citation
- 24-1.0167486.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0167486/manifest