RELEVANCE WEIGHTED SMOOTHING AND A NEW BOOTSTRAP METHODbyFEIFANG RUB.Sc., Hangzhou Normal University, P.R. China, 1985M.Sc., Zhejiang University, P.R. China, 1988A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinTHE FACULTY OF GRADUATE STUDIESDepartment of StatisticsWe accept this thesis as conformingto the required standardTHE UNIVERSITY OF BRITISH COLUMBIAOctober 1994©Feifang Hu, 1994In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives, It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(Signature)Department ofThe University of British ColumbiaVancouver, CanadaDate Oct. t3 Vj1’DE-6 (2/88)Relevance Weighted Smoothing and a New Bootstrap MethodAbstractThis thesis addresses two quite different topics. First, we consider several relevanceweighted smoothing methods for relevant sample information. This topic can be viewedas a generalization of nonparametric smoothing. Second, we propose a new bootstrapmethod which is based on estimating functions.A statistical problem usually begins with an unknown object of inferential of interest.About this unknown object, we may have three types of information (classified in thisthesis): direct information, exact sample information and relevant sample information.Almost all classical statistical theory is about direct information and exact sample information. In many cases, relevant sample information is available and useful. Butthere is no systematic theory about relevant sample information. The problem of thisthesis is to extract “the relevant information” contained in the relevant samples. ThTeegeneral methods have been developed under three different lines of approach (parametric, nonparametric and semiparametric approach). In the parametric approach, wepropose the idea of relevance weighted likelihood (REWL). For the nonparametric approach, we develop our theory based on the relevance weighted empirical distributionfunction (REWED). In the semiparametric approach, the relevance weighted estimating functions are used to extract “relevant information” from relevant samples. Fromasymptotic results, we find that these proposed methods have many desirable properties.We apply these proposed methods as well as some adjusted methods to generalizedsmoothing problems. Theoretical results as well as simulation results show our methodsto be promising.11We also present a new bootstrap method. It has computational and theoretical advantages over conventional bootstrap methods when the data obtain from non-identicallydistributed observables. And it differs from conventional methods in that it resamplescomponents of an estimating function rather than the data themselves.111ContentsAbstract iiTable of Contents ivList of Tables ixList of Figures xAcknowledgements xiii1 Introduction 12 Relevant Sample Information 62.1 Introduction 62.2 Classification of Information 102.3 Measure of Information 13iv2.4 Fisher’s Information of Relevant Sample in Point Estimation 162.5 Kuilback and Leibler’s Information in Relevant Sample in Discrimination 212.6 General Remarks 23S Relevance Weighted Likelihood Estimation 243.1 Introduction 243.2 The Relevance Weighted Likelihood 283.2.1 Definition 283.2.2 Examples 303.2.3 Remarks 323.3 Weakly Sufficient Statistics 333.3.1 Definition 333.4 Maximum Relevance Weighted Likelihood Estimation (MREWLE). . 353.5 The Entropy Maximization Principle 363.5.1 The generalized entropy maximization principle 363.5.2 The MREWLE and generalized entropy maximization principle 373.6 MREWLE for Normal Observations 38V3.7 General Remarks.414 The Asymptotic Properties of the Maximum Relevance Weighted Likelihood Estimator 444.1 Introduction 444.2 Weak Consistency 454.3 Strong Consistency 524.4 The Asymptotic Normality of the MREWLE 544.5 Estimated Variance of the MREWLE 584.6 Some Possible Extensions and Remarks 584.7 Proofs 595 The MREWLE for Generalized Smoothing Models 695.1 Introduction 695.2 Small samples 715.3 Asymptotic Properties 735.4 Bandwidth Selection 825.5 Simulation 83vi6 A Relevance Weighted Nonparametric Quantile Estimator 916.1 916.2 946.3 966.4 986.5 996.6 996.7 1026.8 1086.9 1107 Some Further Results 1207.1 Locally polynomial maximum relevance weighted likelihood estimation. 1207.2 Relevance weighted estimating functions 127Approach of Bootstrapping Through Estimating EquationsIntroductionProblems with Common Bootstrap Regression MethodsIntroductionThe REWED and REW Quantile EstimationStrong Consistency of REW Quantile EstimatorsAsymptotic Representation TheoryAsymptotic Normality ofApplicationsSimulation StudyDiscussionProofs of the Theorems8 An8.18.2132132134vii8.2.1 Residual Resampling 1358.2.2 Vector resampling 1398.3 A new bootstrap 1398.4 Asymptotics 1428.4.1 Preamble 1428.4.2 Consistency of 3* 1438.4.3 Asymptotic normality of 3* 1458.5 Comparison and Simulation Study 1478.6 Bootstrapping in Nonlinear Situations 1558.6.1 Regression M-estimator 1558.6.2 Nonlinear regression 1568.6.3 Generalized Linear models 1578.7 Concluding Remarks 1608.8 Proofs 162References . 168vii’List of Tables5.1 Pointwise Biases and Variances of Regression Smoothers 815.2 Conjectured Pointwise Biases and Variances of the MREWLE Smoothers 818.1 Averages of the Kolmogorov-Smirnov Statistics for Competing BootstrapDistribution Estimators 1528.2 Absolute Biases of the Competing Bootstrap Quantile Estimators . . . . 1528.3 The Pivot Quantile Estimators 154ixList of Figures3.1 A comparison of MREWLE and MLE under o = 0. Curve A representsthe MSE of MLE at line 0 = 01. Curve B represents the Maximum MSEof MREWLE over whole parameter space. Curve C represents the MSEof MREWLE at line 0 = O 403.2 A comparison of MREWLE with MLE under i = 1. Curve A representsthe MSE of MLE at line 0 = 0. Curve B represents the Maximum MSEof MREWLE over whole parameter space. Curve C represents the MSEof MREWLE at line 0 = 01 425.1 A comparison of the Nadaraya- Watson MREWLE with the locally linearsmoother MREWLE from Model (5.21) with n 200. The true curveis a, the Nadaraya- Watson MREWLE, b, and the locally linear smootherMREWLE, c 855.2 The Nadaraya- Watson MREWLE based on five simulations from Model(5.21) with n=200 865.3 The locally linear smoother MREWLE based on five simulations fromModel (5.21) with n=200 87x5.4 A comparison of the Nadaraya- Watson MilE WLE with the locally linearsmoother MilE WLE from Model (5.22) with n = 200. The true curveis a, the Nadaraya- Watson MilE WLE, b, and the locally linear smootherMREWLE, c 885.5 A comparison of the Nadaraya- Watson MREWLE with the locally linearsmoother MilE WLE from Model (5.23) with n = 200. The true curveis a, the Nadaraya-Watson MREWLE, b, and the locally linear smootherMREWLE, c 906.1 A comparison of the Nadaraya- Watson estimate with REW quantile estimator. The model is Y = X * (1 — X) + e, where X is uniform (0,1)and e is N(0, 0.5). The sample size n = 1000 and the bandwidth, h = 0.1.The true curve is a, the REW quantile estimator, b, and the NadarayaWatson, c 1036.2 A comparison of the Nadaraya- Watson estimate with REW quantile estimator with outliers. To the data depicted in Figure 6.1, we add 50 eoutliers from N(2, 0.5). The true curve is a, the REW quantile estimator,b, and the Nadaraya- Watson, c 1046.3 A comparison of the Nadaraya- Watson estimate with REW quantile estimator. The model is Y = X * (1— X) + e, where X is from uniform (0,1)and e from a double exponential distribution with r = 0.1. The samplesize is n = 100 and the bandwidth, h = 0.1. The true curve is a, theREW quantile estimator, b, and the Nadaraya- Watson, c 105xi6.4 A comparison of the Nadaraya- Watson estimate with REW quantile estimator with outliers. To the data depicted in Figure 6.3, we add 10c-outliers from N(—.5, .25). The true curve is a, the REW quantile estimator, b, and the Nadaraya- Watson, c 1066.5 A REW quantile estimator of a quantile curve. The .25 quantile curve isestimated for the data depicted in Figure 1. The true quantile curve is a,and the REW quantile estimator is b 1077.1 A comparison of the Nadaraya- Watson MRE WLE, the locally linear smootherMREWLE and the locally linear MRE WLEfrom Model (5.21) with n=200.The true curve is a, the Nadaraya- Watson MREWLE, b, the locally linearsmoother MREWLE, c, and the locally linear MREWLE, d 1257.2 The locally linear MREWLE based on five simulations from Model (5.21)with n=200 1268.1 A comparison of bootstrap distribution estimators for regression with homoscedastic errors. We depict the distributions of w= — /% inducedby the true distribution (labelled a), using our bootstrap estimator (b),Efron’s estimator (c), Wu’s estimator (d), and Freedman’s estimator (e,). 1508.2 A comparison of bootstrap distribution estimators for regression with heteroscedastic errors. We depict the sampling distributions of w=—induced by the true distribution (labelled a), using our bootstrap estimator(b), Efron’s estimator (c), Wu ‘s estimator (d), and Freedman’s estimator(e) 151xiiAcknowledgementsI would like to thank my supervisor James Zidek for his excellent guidance, for hisconstant encouragement and his patience. Only the support from James Zidek madethis thesis possible.I wish to thank the people whose suggestions and comments about this thesis have beenmost helpful: Harry Joe, Nancy Heckman and Jean Meloche.This thesis benefitted from the helpful references and comments of Nancy Heckman.Many thanks go to Harry Joe, Nancy Heckman, Jian Liu and John Petkau for theirencouragement and support during my stay at UBC.Many thanks go to my dear wife Ying Liu for her constant encouragement and patience.Our son Leon made the life more enjoyable by so much fun he gave to me.Finally, I would like to thank James Zidek for his financial support. The financialsupport from the Department of Statistics is acknowledged with great appreciation. Ialso acknowledge the support of the University of British Columbia through a UniversityGraduate Fellowship.XIIIChapter 1IntroductionThis thesis addresses two quite different topics. First, we consider several relevanceweighted smoothing methods for relevant sample information. This topic can be viewedas a generalization of nonparametric smoothing. Second, we propose a new bootstrapmethod which is based on estimating functions.A statistical problem usually begins with an unknown object of inference. About thisunknown object of inference, we may have three types of information (classified in thisthesis): direct information, exact sample information and relevant sample information.Almost all classical statistical theory is about direct information and exact sample information. In many cases, the relevant sample information is available and useful. Thereis no systematic theory about relevant sample information.The problem of this thesis is to extract “the relevant information” contained in the relevant samples. Three general methods have been developed under three different lines ofapproach (parametric, nonparametric and semiparametric approach). For the parametric approach, we propose the idea of relevance weighted likelihood (REWL). The REWLplays the same role in relevant sample analysis as the likelihood in classical statistical in1ference. For the nonparametric approach, we develop our theory based on the relevanceweighted empirical distribution function (REWED). In the semiparametric approach,the relevance weighted estimating functions are used to extract “relevant information”from relevant samples. We use estimating functions because of their generality.We show that the maximum relevance weighted likelihood estimator (MREWLE) isconsistent and asymptotically normal. The asymptotic theory of the nonparametricapproach is also developed. [For the semiparametric approach, the asymptotic theory isomitted.] By these asymptotic results, we find that these proposed methods have manydesirable properties.We also apply these proposed methods to generalized smoothing methods. We findthat: (i) the MREWLE has some advantages over current nonparametric regressionmethods; (For example, the MREWLE always has a smaller variance which dependson the Fisher information function. This also indicates that the MREWLE is a kindof efficient estimator.) (ii) the relevance weighted quantile estimator (based on theREWED) is usually robust and quite efficient. For generalized smoothing models, weget some good estimators by some locally polynomial adjustments. Simulations supportour methods.Various papers (Efron 1979, Bickel and Freedman 1981 and Singh 1981) speak to thequality of the bootstrap resampling procedure for estimating a sampling distribution insituations where the sampled observables are independent and identically distributed.In this thesis we present a new bootstrap method. It has computational and theoreticaladvantages when the data obtain from non-identically distributed observables. Andit differs from conventional bootstrap methods in that it resamples components of anestimating function rather than the data themselves. We apply this bootstrap method2to ordinary linear regression. By comparing with Efron’s method, Freedman’s pairsmethod and Wu’s method, our method gets support from theoretical results as well assimulations.Because our new bootstrap method is based on estimating functions, applying thisbootstrap method to relevance weighted smoothing is possible and reasonable. This isa further research topic.We organize the thesis as following.In Chapter 2, we classify the different types of information about the unknown object ofinference (parameter) in statistical way. In classical statistical inference and Bayesianinference, statisticians usually focus on direct information and exact sample information.We learn that relevant sample information is very important in statistical inference. Twopossible generalizations of Fisher’s information and Kullback-Leibler’s information forrelevant samples are considered.For relevant samples, we propose the idea of the relevance weighted likelihood (REWL)in Chapter 3. Our idea generalizes that of the likelihood function in that the independentsamples going into the likelihood function may be discounted according to their degreeof relevance. The classical likelihood obtains in the special case where the independentsamples are all from the study population whose parameters are of inferential interest.But more generally, as in metaanalysis, for example, the value of such samples maybe reduced in that their relevance is in doubt because for example, they are noisy orbiased. The relevance weights, which enter as exponents of the component factors inthe sampling density, enable us to tradeoff information against such things as bias inthe samples which may be relevant even if not drawn from the study population.3We show how the REWL can obtain from a generalization of the entropy maximizationprinciple. Using the REWL we define the notion of weak sufficiency and the maximumREWL estimator (MREWLE). By using the MREWLE in a normal example, we findthe MREWLE has some advantages.In Chapter 4, we establish the weak and strong consistency of the MREWLE under awide range of conditions. My results generalize those of Wald (1948) to both nonidentically distributed random variables and unequally weighted likelihoods (when dealingwith independent data sets of varying relevance to the inferential problem of interest).Asymptotic normality of the MREWLE is also proved.We apply the REWL methods to generalized smoothing models in Chapter 5. Bychoosing different weights, four estimators are considered. They are the: NadarayaWatson MREWLE, Gasser-Muller MREWLE, k-NN weights MREWLE and locally linear smoother MREWLE. Asymptotic results for these four estimators are developed.We also compare them by theoretical results as well as simulations.Chapter 6 concerns situations in which a sample X1 = x, X,-, x of independentobservations is drawn from populations with different CDF’s F1, •, F,, respectively.Inference is about a quantile of another population with CDF F0 when the data from theother populations are thought to be “relevant”. Nonparametric smoothing of a quantilefunction would typify situations to which our theory applies. We define the relevanceweighted quantile (REWQ) estimator derived from the relevance weighted empiricaldistribution (REWED) function. We show that the estimator has desirable asymptoticproperties. A simulation study is also included. It shows that the median estimator isa robust alternative to the locally weighted averages used in conventional smoothing.Some further results appear in Chapter 7. We propose a locally polynomial MREWLE.4By comparing with locally constant MREWLE (in Chapter 5) and current nonparametnc regression methods, we find that the locally linear MREWLE has a simple bias andsmaller variance. Some simulation results also support this locally linear MREWLE. Forour semiparametric approach, we present the method of relevance weighted estimatingfunctions. We find that the locally linear MREWLE is the best among all locally linearREW estimating equations (with kernel weights) and the locally linear quasi-MREWLE(maximum relevance weighted quasi-likelihood estimator) is the best among all locallylinear REW linear estimating equations.Chapter 8 presents a method of bootstrap estimation. Rather than resampling from theoriginal sample, as is conventional, the proposed method resamples summands of theestimating function used to produce the original estimate. The result is computationsimpler than existing competitors. However, its main advantages lie in its treatmentof non-identically distributed observations. Shortcomings of conventional methods areovercome. An application to ordinary linear regression is worked out in detail alongwith the appropriate asymptotic theory. We report as well a simulation study whichprovides support for this new bootstrap method.5Chapter 2Relevant Sample Information2.1 IntroductionInformation is a key word in statistics. After all, this is what the subject is all about.As Basu (1975) says:• A problem in statistics begins with a state of nature, a parameter of interest 0 aboutwhich we do not have enough information. In order to generate further informationabout 8, we plan and then perform a statistical experiment E. This generates thesample x. By the term ‘statistical data’ we mean such a pair (‘,x) where iswell-defined statistical experiment and x the sample generated by a performance ofthe experiment. The problem of data analysis is to extract ‘the whole of relevantinformation’—an expression made famous by R. A. Fisher—contained in the data(E,x) about the parameter 0.However, statistical theory has traditionally been concerned with a narrow interpretationof the word embraced by Basu’s description. Given data, statisticians would typically6construct a sampling model with a parameter 0 to describe a population from which thedata were supposedly drawn. Information in the sample about the population comesout through inference about 0. Alternatively, given a 0 of interest the classical paradigmsees the statistician as conducting a statistical experiment to generate a sample froma population defined by a sampling distribution with parameter 9. The sample thenprovides information about 9. In either case, statistical infereilce will be based on theseobservations and their directly associated sampling model. This is the frequency theoryviewpoint. (See Lehmann 1983).Bayesians think that we always have some prior distribution for the unknown parameter.Then we combine the prior information and the ‘statistical data’ information. (SeeLindley 1965).But these two sources of information are not the only source of information for theparameter. There is another source of information (relevant sample information asdefined in Section 2.2). This information is very useful in many cases. This can beclearly seen by the following examples. (We use an example similar to that of Basu(1975) but for a different purpose).Example 2.1 Suppose an urn A contains 100 tickets that are numbered consecutively aso + 1, 0 + ,9 + 100 where 9 is an unknown number. Let E, stand for the statisticalexperiment of drawing a simple random sample of n tickets from the urn A withoutreplacement and then recording the sample as a set of ii numbers x1 < x2 < < x.Suppose we know that 9 is bounded by 50, this means O 50. This information is a kindof direct information (or prior information). Consider now the hypothetical situationwhere E25 has been performed and has yielded the sample x = (55,57,... ,105), where55 and 105 are respectively the smallest and largest number drawn. To be specific, with7data E, x = (x1,x2,. , x,,), we know without any shadow of doubt that the true valueof 0 must belong to the setS = {xi — 1, x1— 2,••• , x1 — m}where m = 100— (x — xi). Now with the information from the data we can nowassert that 0 is an integer that lies somewhere in the interval [5, 54]. Combine the directinformation and the ‘statistical data’ information, we could conclude that 0 is an integerthat lies somewhere in the interval [5,50].Now suppose another urn B contains other 100 tickets that are numbered consecutivelyas 0 + 1, 01 + 2, , O + 100 where 0 is another unknown number. But we known thatthe difference between 0 and O is smaller than 5, meaning that 0 — Oi < 5. Supposenow that we draw a simple random sample of 5 tickets from the urn B and find thesample x’ = (51,80,... , 149). With this information, we can assert that 01 = 50 or49. Now with the information from the data x’, we can say that 0 is an integer thatlies somewhere in the interval [45, 54]. This is the information from the data x’ for theparameter 8.Combining the information from these different sources, we can finally conclude that 0is an integer somewhere in the interval [45, 50].From the above example, the data x’ contain some useful information for the parameter0. But x’ are not from the experiment E with the parameter 0 (the experiment we planand perform). The x’ come from the experiment with the unknown parameter 01, whichhas some relation with the parameter 0 (in this example, 8—01j < 5). This is thedifference between x’ and x. The classical statistical inference usually focuses on theinformation from x alone.8In this chapter, we will focus on the discussion of this third source of information. Wecome out very strongly in support of the use this information for statistical inferencewhen it is available.In Section 2.2, we try to classify information about the unknown parameter into twomain types. The first is “direct information” and the second we call “sample information” (as defined in Section 2.2). We classify sample information into exact sampleinformation and relevant sample information. Almost all classical statistical theory isabout direct information and exact sample information. Relevant sample informationhas only been used in some special contexts of statistics. But systematic theory is notavailable and needs to be developed.We have used the word information many times in this section. But, what is information? Or how to measure information? As Basu (1975) remarks no other conceptin statistics is more elusive in its meaning and less amenable to a generally agreeddefinition. The measure of information plays a very important role in statistics andcommunication science. A lot of well-known work has been done in this area (See Fisher(1925), Shannon (1948), Wiener (1948) and Kullback (1959)). We will review three ofthe most important information measures in the Section 2.3.All these information measurements are used for direct information and exact sampleinformation for different purposes. In Section 2.4, we discuss how to measure the relevantsample information. A possible generalization of Fisher information is proposed.The use of the relevant sample information in the test of hypotheses or discriminationwill be considered in Section 2.5. We propose a reasonable measure of information fordiscrimination. This measure is a generalization of the Kuliback-Leibler informationmeasure.92.2 Classification of InformationStatistics is concerned with the collection of data and with their analysis and interpretation. Here we do not consider the problem of data collection but take the data asgiven and ask what they tell us. The answer depends not only on the data, but also onbackground knowledge of the situation; the latter is formalized in the assumptions withwhich the analysis is entered. We review two traditional principal lines of approach.Classical inference and decision theory. The observations are now postulated to be thevalues taken on by random variables which are assumed to follow a joint probabilitydistribution, F, belonging to some known class P. Frequently, the distributions areindexed by a parameter, say 8, taking values in a set, , so thatP {F9,8The aim of the analysis is then to specify a plausible value for 0 (this is the problem ofpoint estimation) or at least to determine a subset of f of which we can plausibly assertthat it does, or does not, contain 0 (estimation by confidence sets or hypothesis testing).Such a statement about 0 can be viewed as a summary of the information provided bythe data and may be used as a guide to action.In this approach, statistical inference about 0 is based on both this directly associatedsampling model (here {P9,}) and the observations.Bayesian analysis. In this approach, it is assumed in addition that 0 is itself a random variable (though unobservable) with a known distribution. This prior distribution(specified prior to the availability of the data) is modified in light of the data to determine a posterior distribution (the conditional distribution of 0 given the data), which10summarizes what can be said about 0 on the basis of assumptions made and the data.This Bayesian approach about the parameter 0 is based on both the directly associatedmodel, the prior distribution and the data.It is frequently reasonable to assume that we get some other observations which are notfrom F9, but from some F91 where P91 is related to P9. These observations do containsome information about 0. But the above two traditional principal lines of approach donot include these observations.Example 2.2 Assume that we wish to estimate the probability (a parameter OA) of apenny A showing heads when spun on a flat surface. We usually consider n spins of thepenny as a set of n binomial trials with an unknown probability °A of showing heads.Suppose, however, that we have m spins of the penny B. If we believe this penny Bis similar to penny A (meaning 0A and 0B are close to each other), to estimate 0A, itmight be reasonable to use the information from the m spins of penny B.The above discussion leads to the following classification of information about the unknown object of inference (here parameter 0).Definition 2.1 (Direct information). All the information which is directly relatedto the unknown object of inference (parameter) is called direct information of theparameter 0.Definition 2.2 (Sample information). We call sample information, the information about 0 from the sample or the data. If the sample is from the experiment (model)which is direct to the parameter 0, we call it an exact sample for this parameter 0.11The information in the exact sample is called exact sample information. The samplewhich is from the experiment (model) related to the parameter 0 (not direct) is called arelevant sample for parameter 0. The relevant sample information is defined asthe information from a relevant sample.As in Example 2.1, 0 50 is direct information about 0. The data of E25 is exactsample information. The data drawn from urn B are relevant sample information.In classical inference and decision theory, statistical inference about 0 is based on boththe directly associated sampling model {P9,} (direct information) and the observations(exact sample information). The Bayesian approach based on the directly associatedsampling model (direct information), the prior distribution (direct information) and thedata (exact sample information).In some cases, we may have relevant sample information about the inferential objective.This can be well-illustrated by the examples of the following Chapters. Examples 3.1,3.2, 3.6 and 6.1 all indicate that relevant sample information is available and useful.In the following examples, we show how information is classified.Example 2.3 Linear Model. Observations y, considered as an n x 1 column vector,is a realization of the random vector Y with E(Y) = x3, where x is a known n x q matrixof rank q n, and /3 is a q dimensional column vector of unknown parameters. /3 arethe parameters of interest.Because the model involves the unknown parameters directly, the information from theobservations y about /3 is exact sample information.12More generally, for the Generalized Linear Model (See McCullagh and Nelder 1989),the information from the sample is still exact sample information, because in that case,the experiment involves the unknown parameters directly.Example 2.4 Let X be from N(8, 1), and let the estimand be 8, 0 has a convenientprior distribution, say N(O, 1).Now let us assume Y is from N(01,1), with 01 unknown. But we do know that 0 —has a prior distribution, say N(O, 1).The data vaule X is an exact sample for 8. The prior distribution of 0 is direct information for 8. The data value Y is a relevant sample for 8.The statistical methods using direct information and exact sample information are welldeveloped. We can easily find these methods in standard textbooks. However there areno systematic methods about how to use relevant sample information.2.3 Measure of InformationIn Section 2.2, we classify the different types of information. But what is information?How can we measure the information? Answers of these questions seem controversial.The definition of information or entropy goes backed to 1870, and a series of papersby L.Boltzmann. Since then, statisticians have proposed many different definitions fordifferent targets. Now we review the three most important definitions of statisticalinformation.(I) Shannon and Wiener Information (Entropy).13The statistical interpretation of thermodynamic entropy, a measure of the unavailableenergy within a thermodynamic system, was developed by L.Boltzmann around 1870.His first contribution was the observation of the monotone decreasing behavior in timeof a quantity defined byf(x,t)E=j f(x,t)log{ }dx,where f(x, t) denotes the frequency distribution of the number of molecules with energybetween x and x + dx at time t (Boltzmann, 1872). When the distribution f is definedin terms of the velocities and positions of the molecules, the above quantity takes theformE = fflogfdxdy,where x and y denote the vectors of the position and velocity, respectively. Boltzmannshowed that for some gases this quantity, multiplied by a negative constant, was identicalto the thermodynamic entropy.Shannon (1948) proposed the definition of the entropy of a probability distribution: (thenegative of the above quantity)H = _Jp(x)logp(x)dx, (2.1)where p(x) denotes the probability density with respect to the measure dx.Shannon entropy plays a very important role in modern communication theory. Thereare almost uncountably many papers and books about the use of Shannon entropy.The quantity H is simply referred to as a measure of information, or uncertainty, orrandomness.This definition may be used in the measure of direct information, when the directinformation is Bayes prior. Also it may be used in the case of a posterior distribution.14However, Savage (1954, page 50) remarks: ‘The ideas of Shannon and Wiener, thoughconcerned with probability, seem rather far from statistics. It is, therefore, somethingof an accident that the term ‘information’ coined by them should be not altogetherinappropriate in statistics.’(II) Fisher’s Information.R. A. Fisher’s (1925) measure of the amount of information supplied by data aboutan unknown parameter is well-known to statisticians. This measure is the first use of‘information’ in mathematical statistics, and was introduced especially for the theory ofstatistical estimation.For a real parameter and a density function satisfying Cramer-Rao regularity conditionsit has the formJ(Q) = (2.2)We know that Fisher’s Information has a lot of optimal properties as a measure ofinformation: (i) Fisher’s information, being specific to a parameter, will stay the sameif we reduce to a sufficient statistic; (ii) Fisher’s information is additive over differentsets of independent data; and (iii) Fisher’s information gives a lower variance bound forthe estimation of the parameter provided some regularity conditions are satisfied.(III) Kuliback-Leibler’s Information.Kullback and Leibler (1951) consider a definition of information for ‘discrimination infavor of H1 against H2’. Here H, i = 1,2, is the hypothesis that X is from the statistical population with probability measure p, with 4(x) = f(x)dx. They define thelogarithm of the likelihood ratio, log{fi(x)/f2(x)}, as the information in X = x for15discrimination in favor of H1 against H2. The mean information for discriminationin favor of H1 against H2 per observation from H1 is defined as1(1 :2) = Jfi(x) log‘1dx. (2.3)This definition is a departure from Shannon and Wiener’s information. It is widely usedin statistics for discrimination. This can be easily seen from Kullback (1959).The above three definitions of information are all about direct information and exactsample information. This is because classical statistics is focused on these two types ofinformation.In the following two sections, we will discuss the information measure for relevant sampleinformation.2.4 Fisher’s Information of Relevant Sample in PointEstimation.It is well-known that Fisher’s information plays a very important role in the theory ofstatistical estimation. In the last section, we have discussed Fisher’s information forexact samples. Now we will propose a possible generalization of Fisher’s information tothe relevant sample case. The bias function and information function in Chapters 4, 5,and 7 offer other possible generalizations.Let us begin with the simplest case. Assume that X has density f(x, 8), where 8is a real parameter. Suppose f(x, O) satisfies the Cramer-Rao regularity conditions;then the Fisher’s Information for Qo from X as defined in Section 2.3 is (2.2). Weare interested in the parameter 0. As we have claimed in Section 2.2, there is some16rnformation in X for the parameter 0, if we know that 0— 00! c for some constant c.Now it is natural to ask how to measure the information in X for the parameter 0.Definition 2.3 The information for the parameter 0 in X is defined as Infx(0, 0) =(Infx(00),0 — 0). Here Infx(00) is the Fisher’s information.In the above definition, we can see that the information in X for the parameter 0contains two parts, one is the information part; another is the bias part. If the biaspart is 0, which means that the 0 = 00, then the above information measure becomesFisher information, If we know the bias exactly, then the bias can be eliminated. This isbecause then we can transform the parameter. Generally, if 0 and O have a one-to-onerelationship, then we always can do the parameter transformation. This will be the caseof exact sample information.Now we discuss the properties of the above definition of information. In the followingdiscussion, we always assume the bias is the same, unless we specify otherwise. Whenwe say Infx(O, 0) equals Infy(0, 0), we mean Oo = 01 and Infx(00) Infy(01). Onlywhen the bias is the same, can we order the two information indices.• a. Inf(0, 0) is independent of the a-measure p.As we know Inf(0, 0) is calculated from the f(x, 0) = dP90(x)/dp. If we can findanother a—measure ii such that {P6 : 0 E O} << v, then we can replace f(x, 0)by f*(x,00)= dP90(x)/dv. The value Inf(0,00)is not changed.The proof is easy and we omit it here. D• b. The information of several independent observations is the sum(appropriately defined) of the information in these observation, if thebias of these observations are the same.17Mathematically, the above statement says suppose X1,• , X, are independentand X = (X1,. , Xk). If f(x, O) is the density of X, and they satisfy theCramer-Rao regularity conditions, thenf(x,80)= fl(x1,0)”fk(xk,8satisfies the Crame-Rao regular conditions, andInf(O, O) = {Infi(00)+ + Inf(8o) , 0—0,}, (2.4)here Inf(0, 0) are the information function for 8 in X.The proof of the above result is exactly similar to that of Fisher’s Information.We omit the proof here. 0• c. The information will not increase, when we transform the data.Let Y = T(x) be a statistic, that is , T is a function with domain X and range )“,and let T be an additive class of subsets of Y. We assume that T is measurable.Let g(t, 0) be the density of T(X). If f(x, 0) and g(t, 8) satisfy the Cramer-Raoregularity conditions, thenInfx(O,00) IrifT(8,00), (2.5)here (x1,y) (x2,y) means x1 x2.We omit the proof here because it is a direct result from the Fisher’s Information.0• d. Under the conditions of c), we have inequality in (2.5), with equalityif and only if the statistic Y = T(x) is sufficient for the parameter 8.The proof is omitted. 018In connection with the basic properties of information, we have the following comments.1. From (2.4), when we have n iid observations from f(x, 8), then Irif(O, 8) =(nlnf(00),8— On).2. Above, d) implies that we should use the sufficient statistic to do the statistic analysisfor the parameter 0, although this sufficient statistic is for the population indexed byparameter 0. This result tells us how to reduce the dimension of the data.3. If we do a one-to-one transformation of the parameter, that is i h(0) and h isdifferentiable, the information that X contains about isInfx(ii, ) = (Infx(00)[h’(]2,17—where 1o = h(00).Now we are going to discuss the generalized information inequality, which generalizesthe information inequality to the relevant sample case.Theorem 2.1 (Relevant Information Inequality) Suppose that Cramer-Rao regularity conditions are satisfied. Let 6 be any statistic with E90(62) < oo for which thederivative with respect to 8 of h(00) = E90(6) exists and can be obtained by differentiating under the integral sign. ThenE(6 - 0)2 + (h(00) - 9)2 (2.6)Proof. The result follows directly from— 9)2= Var(6) + (h(00) — 9)2and the Information Inequality. D19Usually we can control the value of 10— Ol, but not lh(00)— 01. We would like to chooseh(00) = 00. From the above Theorem, we obtain.Corollary 2.1 Under the conditions of Theorem 1, for any 00 ‘s unbiased estimator 5,E(6 — 0)2Inf(00)+ (0— 0)2 (2.7)From the Corollary, we can see that our definition of information for relevant samples isreasonable. The lower bound of the mean square error depends both on the informationand the bias.Now we consider how to combine the information from relevant samples having differentbias parts. Let X have density f(x,00) and Y, the density g(y,Oi). Both f(x,00)andg(y, 0) satisfy the Cramer-Rao regularity conditions. 0 is the parameter of interest.Both X and Y contain some information about 0, so we need to combine their information.From Definition 2.3, we get Infx(0, 0) and Infy(0, 0k). Then the information from(X,Y) is defined as {I(00,01),B(},here I(0,0) diag(Infx(0),Infy(0i)) andB(00,0) (0—0, O_01)t. This is similar to Fisher’s information for the multiparametercase except for our inclusion of the bias.We can easily obtain the following result.Theorem 2.2 Let 6(X, Y) be any statistic with E(6(X, Y)2) < oo such that derivative with respect to 00 and 0 of h(00,0) = E(6(X, Y)) exists and can be obtained bydifferentiating under the integral sign. ThenE(5(X,Y) — 9)2 /3t I’(0 Oi)/3+ (h(00,01)— 9)2 (2.8)20where /3t = (öh(00,0)/90, ãh(001)/a0).We stop this section here. Further theory is under development but not yet complete.2.5 Kuilback and Leibler’s Information in RelevantSample in DiscriminationLet us begin with the following example.Example 2.5 Simple null hypothesis. Assume X1 is a sample from N(0, 1). Thenull hypothesis is H0: 0 = 0 and the alternative Ha: 0 = 2. From the Neyman-PearsonLemma, we can easily get the most powerful test of level a = .05 as: if Xi > 1.645, wereject the null hypothesis H0. Otherwise, we accept the H0. The power of this test is= .639.Now suppose we get another sample X2 from N(00,1). We know that 10 — 001 .5. Weconstruct a new test: if X1 + X2 > 2.826, we reject the null hypothesis H0. Otherwisewe accept H0.For the second test, we have supPH0(reject H0) .05 and infPH(reject H0) .683 >.639. So the second test is more powerful than the first one. This means the observationX2 contains some information about the simple null hypothesis.Example 2.5 tells us that the relevant sample is useful for testing. In this section, wewill consider the information of the relevant sample for discrimination. We will use anidea similar to that underlying Kuilback-Leibler information.21As we know, Kuliback and Leibler (1951) define the logarithm of the likelihood ratio,log{f1(x)/2}, as the information in X = x for discrimination in favor of H1 againstH2. Now we suppose that the sample X is from some density distribution g1(x) whichmay have some relationship to {fi(x),f2(x)} (X is a relevant sample).Definition 2.4 The mean information for discrimination in favor of H1 against H2 perobservation from gi(x) is defined as1(1 : 2;X) = fgi(x) log dx.D (2.9)This definition is a departure from Kuliback and Leibler information; Kuliback andLeibler information is about the sample fromf1(x). We generalize this to the relevantsample case. If 1(1 : 2;X) > 0, then the sample from g1(x) favors H1.We can easily see that1(1: 2;X) Jgi(x)log’dx_Jg1(x)log1dx. (2.10)The term fgi(x)log(gi(x)/fi(x))dx is the bias part of this information. Ifg1(x)then the bias part vanishes.Now we discuss the properties of this information.• a. Inf(O, O) is independent of the u-measure t.This is similar to the property a) in Section 2.4.• b. The information could be negative, and the negative value meansthis sample favors 112.22• c. The information for discrimination in independent observations isadditive.This means that if we have some independent observations X1, , X, from densitiesg1(x),. ,g,(x) and let X = (X1,. ,Xk); then1(1 : 2;X) = 1(1: 2;Xi) + + 1(1: 2;Xk).The proof is easy and we omit it here.2.6 General RemarksAs suggested in Section 2.2, there are three types of information. What is the relationship among these types of information? The relevant sample information containstwo parts: one is the relation between the two experiments (models); another is theobservations. The relation between two experiments is a kind of direct information withan unknown parameter. When the number of relevant samples goes to infinity, therelevant sample information becomes direct information. For example, in Example 2.2,the similarity of 8A and °B indicates a relationship between the two experiments. Whenm (spins of the penny B) goes to infinity, we can get an exact value of 0s. Then therelevant sample information would mean 0A is close to some known value (this is directinformation).The relation between °A and OB can be of several types. Here we list some of them: (1)— OBI c for some constant c (Example 3.1); (2) °A— 0B is small (Example 2.2 andExample 3.2); and (3) °A — 8B is a random variable with a known distribution.The information measures proposed in Section 2.4 and 2.5 need further study.23Chapter 3Relevance Weighted LikelihoodEstimation3.1 IntroductionIn this chapter we generalize the classical likelihood as the Relevance Weighted Likelihood(REWL). The REWL arises in parametric inference when in addition to (or instead of)the sample from the study population, relevant but independent samples from otherpopulations are available. By down-weighting them according to their relevance, theREWL incorporates the information from these other samples. We have characterizedsuch “relevant “ sample information in Chapter 2. To motivate the likelihood theorypresented below, we merely illustrate situations where such information arises.Example 3.1 Let ‘ 1), i = 1, 2, be independent random variables, the {j}being unknown parameters. We want to estimate j. Two estimators present themselves.(i) Classical likelihood-based estimation theory suggests the MLE which uses just Y1.24Then we get1%=Y. (3.1)(ii) However, if2 was deemed to be “close” to ,a, intuition suggests we use the information in Y2 in some way. Yet the classical theory still yields the result in (i). Evenwhen we add the structural condition that—< c for a specified constant c > 0,the MLE still uses just Y1 unless the condition IY’ — Y2j c is violated. If that conditionfails, j = {Y1 + (Y2 — c)}/2 or = {Yi + (Y2 + c)}/2 according as Y1 < (Y2 — c)or Y1 > (Y2 + c). So then the MLE does bring Y2 into the estimation of But itdoes so crudely only through truncation. So instead we turn to a seemingly more naturalalternative which uses Y2 more fully:1+c2 1+c2Y1+2Y2 (3.2)Under the mean squared error criterion, we find- )2 1> E( )2 (3.3)From (3.3), we can conclude that the estimator (3.2) based on both }‘ and Y2 alwayshas a smaller mean squared error than that (3.1) based on Y1 alone.The last example demonstrates that in certain cases, we can profitably incorporate theinformation from samples drawn out of populations different from that under study. Inother words, all “relevant” information must be used in inferences about the parametersof interest.Example 3.2 Nonparametric Regression Model. If n data points {(X, )} havebeen collected, the regression relationship between Y and X can be modeled as= m(X1)+ e, i = 1,• . . , (3.4)25using the unknown regression function m and observation errors e. Assume that c1,• ,are iid from some unknown density function f(x) with E(ei) = 0. The function rn arethe parameters we want to estimate.About the situation embraced by this last example, Eubank,R.(1988, p.7) says “If m isbelieved to be smooth, then the observations at X near x should contain informationabout the value of rn at x. Thus it should be possible to use something like a localaverage of the data near x to construct an estimator of m(x).” Reasoning like thisand the last example itself motivates our work. Developments of recent years in thetheory and application of nonparametric regression validate Eubank’s argument. Thesedevelopments show nonparametric regression to be a useful explanatory and diagnostictool. Eubank (1988), Hardle (1990) and Muller (1988) discuss nonparametric regressionwhere observations at x near x are used to infer m at x because they contain relevantinformation.However, the domain encompassed by the heuristics of nonparametric regression theoryappears much broader than that currently encompassed by that theory. In fact, Example3.2 immediately suggests a number of unanswered questions. (i) If the error-distributionassociated with (3.4) had a known (parametric) form with unspecified parameters, howwould all “relevant” information be used in inference about m(x)? More specifically, isthere a likelihood based approach which would permit the use of that information?(ii) If we were interested in estimating some unknown x- population attribute other thanits mean, m(x) = E(YIX = x), what could we do in either the parametric case of (i) ormore generally in the nonparametric case?(iii) How can the information about m(x) in the observations at x near x be describedor quantified?26We have explored possible answers to question (iii) in Chapter 2. We will addressquestion (ii) for the nonparametric case in Chapter 6; a solution for the parametric caseimplicitly derives from the theory of this chapter. Finally, the method of this chaptergives an answer to question (i) (see details in Chapter 5).In this Chapter, we propose the idea of the relevance weighted likelihood (REWL) forthe relevant sample situations. In Section 3.2, we construct the REWL function. Afterlooking at several examples, we discuss there the relationship between the usual andrelevance weighted likelihood function.For the traditional purpose of data reduction, we define “weakly” sufficient statisticsusing the REWL in Section 3.3 (weakly because of their dependence on the relevanceweights chosen in the construction of the REWL). We show that weakly sufficient statistics have some of the properties of their sufficient relatives.In Section 3.4, we introduce the MREWLE, the maximum relevance weighted likelihoodestimator. The MREWLE’s obtained in some specific applications are new. In others,the theory merely enables us to rederive known estimators albeit from a more basicstarting point.We can justify the idea of the REWL by appealing in entropy and prediction much asin the classical case. This we do in Section 3.5. We first extend there the entropymaximization principle to embrace all relevant samples. The resulting extension thenyields the relevance-weighted likelihood principle.We use relevance weighted likelihood under normal theory assumptions in Section 3.6.We show that maximizing the REWL yields a reasonable estimator. In this example,the MREWLE has advantage over MLE which is available in some special cases. The27application of the MREWL estimation to generalized smoothing models appears inChapter 5. Some general remarks are made in Section 3.7.3.2 The Relevance Weighted Likelihood3.2.1 Definition.Let y=...,y,) denote a realization of the random vector Y = (Y1, ..., Y,) andfjFj, i i,•• , n, the unknown probability density functions (PDF) of the Y which areassumed to be independent. We are interested in the PDF fF of a study variable Xmeasurable on items of our study population (with PDF f). At least in some qualitativesense, the f are thought to be “like” f. Consequently the yj’s are thought to be of valuein our inferential analysis even though the ‘ are independently drawn from a populationdifferent from our study population.In the familiar paradigm of repeated sampling, we impose the condition f, = f for all iin deriving the likelihood. In reality, this condition represents an approximation whichmay be more plausible for some of the i’s than others. It may even seem desirable todownweight certain of the likelihood components, f(y) in some way, when the quality of the associated approximation seems low. But how should those components beweighted?A heuristic Bayesian analysis suggests a way of assigning relative weights to the likelihood components. This analysis leads us to the REWL. Suppose we take logarithms ofthe various PDF’s (assumed to be positive for these heuristics) to put them on an affinescale: (—cc, cc). In the log-likelihood, the correct term associated with yj is log[f(y)j.If we are to replace this with a term involving only 1og[f(y)], we might plausibly use28the best linear predictor (BLP) of log[f(y)] based on log[f(y)j. This gives usin place of f(y) in the likelihood. Here p, represents the coefficient of log f(y) in theBLP. In other words, p represents the covariance between log f(y) and log f(y) (if weignore a multiplicative rescaling factor). [Our analysis also ignores an irrelevant additivefactor in the BLP.]This leads us to define the REWL at y= (yi,” ,Wlik{fQ),y} = llf(y), for f e . (3.5)The REWL, like the classical likelihood, allows the data to jointly assess the credibilityof any hypothesized candidate f for the role of the study population PDF. But herethe yj’s from the study population itself would be given the greatest weight in this jointassessment. As the relevance of the other yj’s decline in their relevance (measured bytheir pj’s) so does the weight accorded to them in the assessment.Usually it is convenient to work with the natural logarithm, denoted by Wl{fQ), y} andcalled the log relevance weighted likelihood:Wl{f(),y} = plogf(y), for f F. (3.6)Conventionally we take = F for all i and index F by a finite dimensional (unknown)parameter 0= (0k, ..., Oq) e Q. Then f(t) = f(t; 0), f(.;.) having a known form. Then(3.5) and (3.6) become for 0Wlik{0,y} flf’(y;0), Wl{0,y} = plogf(y;8). (3.7)293.2.2 Examples.The following examples illustrate the REWL and reveal differences between likelihoodand the REWL.Example 3.3 Continuation of Example 3.1. The usual likelihood function in thisproblem would belik(R; y) = (2ir)’ exp(—[(yi—I.Li)2 + (Y2 — R2)]/2). (3.8)This likelihood would ignore prior information like Ri — 1L21 is “small”. Now define theREWL by putting pi = 1 and P2. (1 +c2) as the relevance weights when inference isabout the study population parameter i1. ThenWlik(i;y) = (2’exp(— 2t1) )((2)_1exp(_2 t) (3.9)The likelihood (3.8) contains the parameters for both of the populations from which thedata were obtained. None of the information from Y2 would be used to estimate j evenwhen that information was deemed relevant. In contrast, the REWL in (3.9) containsollly the parameter of inferential interest, Ri• And Y2 would be used in estimating Rito the extent determined by the size of c.Example 3.4 Continuation of Example 3.2. Here define the likelihood to befJf(y - m(x)). (3.10)We find the result of no use in estimating m(x). Yet as we argued earlier, if rn(.) werethought to be smooth, observations near x should contain information about the PDF,f(y — m(x)). The REWL can reflect this heuristic through the relevance weights pj inllf(u - m(x)). (3.11)30The next example differs from the last two in that we allow the relevance weights todepend on the data themselves.Example 3.5 Robustness. Assume Y1,. , Y, are iid observations with PDF f(y, 0)parametrized by 8 so that the likelihood is1[ f(y, 0).We may believe some of the yj to be outliers effectively coming from some other population than that under study. The information from such data needs to be downweighted.To do this we may: (i) order the data as y(1),..• , y(n); and (ii) assign relevance weightsi depending on the degree to which we regard the associated data as outlying. TheREWL becomesfi fPi (y(i), 8). (3.12)In the extreme case, when a fraction 2€ are deemed to be outliers, we could choose p, = 0,when i < [ne] or i [n(1— e)]. [Here [.] denotes the greatest integer less than x.] Thenthe REWL becomes a “trimmed likelihood.” The trimmed mean would then be obtainedin certain cases as a MREWLE. This case will be discussed again in Section 3..In “parametric-nonparametric regression” we would postulate PDF’s in parametric formwith a smoothly varying regression (mean) function. Yet other parameters like quantilesand variances for example may also be of interest in this setting. The following generalapproach to smoothing through the REWL enables us to deal with this diversity ofpossible inferential objectives within a unified framework.Example 3.6 Generalized Smoothing Suppose {, X} are n data pairs, for givenX, Y has PDF f{y, 0(X)} with parameter 0(X). Interest lies in the study population31corresponding to a fixed value X = x and fitting the associated PDF. The relevanceweights i enable us to represent the degree to which the information from the populationscorresponding to X should be used in fitting f{y, O(x)}. The REWL becomesII f{y, O(x)}. (3.13)Generally choosing the {p } will be like choosing a kernel and bandwidth in nonparametric regression theory. Indeed, in the domain of that theory, we can find the {pj}directly from the corresponding kernels [and their bandwidths] making our task easy inthat case.3.2.3 RemarksHere we discuss the relationship between the likelihood and the REWL. Then we consider some properties of the MLE retained by the REWL.3.2.1 The likelihood is obtained from the REWL as a special case when all the data areindependently drawn from the study population. However, even here there maybe a role for the REWL as Example 3.5 demonstrates.2.2 The likelihood is usually derived from the sampling density by inverting what isfixed and what is varying. In particular, conditional on f (or the parameter off), the likelihood integrates over the sample space to 1. This duality betweenlikelihood and sampling density may be useful for determining the likelihood.However, it does not seem intrinsic. We could in the usual case of iid sampling, takethe nth root of the inverted sampling density without apparent loss and withoutpreserving the aforementioned property. Moreover, in the Bayesian framework the32sample space need not even be specified, once the data have been obtained. Yetthe likelihood can certainly be defined.So we do not see the lack of duality with sampling as a shortcoming of our proposedextension of the likelihood. The usual asymptotic theory, appropriately modifiedstill obtains as shown in the next Chapter.3.2.3 The REWL depends on the relevance weights and work remains to be done on howthese may be chosen. As noted above, in the case of nonparametric regression,standard theory for kernel smoothers suggests reasonable possibilities.3.2.4 We can easily show that the REWL is preserved under arbitrary differentiabledata-transformations (with non- vanishing Jacobian) when the sampling densities are absolutely continuous with respect to Lebesgue measure. So the REWLinherits this important property of the likelihood.3.3 Weakly Sufficient Statistics3.3.1 Definition.The very important likelihood principle of classical statistical inference tells us that thelikelihood embraces all relevant information about the parameter. Indeed, according tothe factorization theorem sufficiency may be defined through the likelihood. Standardconstructions of minimally sufficient statistics rely on the likelihood.The counterpart of the likelihood theory in our setting would be the REWL principle.Lacking the invertibility of likelihood and sampling density found in standard frequencybased theory of the likelihood, we must resort to a REWL based definition of sufficiency.33Lacking a basis for claiming our likelihood captures all relevant information in the data,we call our notion of sufficiency “weak sufficiency”. That notion enables us to reducethe dimension of the observation vector to that of any (weakly) sufficient vector-valuedfunction of the data, while retaining all information in the REWL.Definition. We call weakly sufficient any vector- valued statistic which determines theREWL up to an arbitrary multiplicative factor which does not depend on f. Weaklyminimal-sufficient statistics are functions of every other weakly sufficient statistic.A weakly minimal-sufficient statistic yields the maximal data reduction. Such a statisticneed not be unique. The factorization theorem remains true for weak sufficiency.Theorem 3.1 A necessary and sufficient condition that S be weakly sufficient for theparametric family, F, indexed by 8 is that there exist functions mi(s, 8) and m2(y) suchthat for all 8 E 1,Wlik(8,y) = mi(s,O)m2(y). (3.14)Accepting the REWL as the basis for inference makes reliance on weakly sufficient statistics inevitable. The seemingly reasonable estimators we obtain below depend on the dataonly through a weakly sufficient statistic, thereby offering “empirical support” for ourprinciple. Just as the conventional likelihood (regarded as a function) is sufficient, theREWL is weakly sufficient. [This fact follows from the factorization theorem.] However,weakly sufficient statistics lack the property of conventional sufficiency which rendersthe conditional sampling distribution of the data given a sufficient statistic independentof 0. [Our REWL does not derive from a sampling density function.]343.4 Maximum Relevance Weighted Likelihood Es.timation (MREWLE)In this section we generalize the MLE.Definition: Call any 0 E f which maximizes the REWL, a maximum REWL estimator(MREWLE).Before discussing the properties of MREWLE, we reconsider one of our examples.Continuation of Example 3.5. Assume the density f(y0) is that of a normal distribution with mean 0. Then the MREWLE of 0 is[n(1—e)][nE]when we choose the {pj} in Example 3.5. This is a trimmed mean. Other choices yieldL-statistics as the MREWLE’s.The MREWLE inherits some of the properties of the MLE.• Under certain weak conditions, the MREWLE is consistent. We will prove thisfact in the next Chapter by generalizing to the non-iid and more general weightedcase the well-known theory of Wald (1949).• The asymptotic normality of MREWLE under certain conditions is proved inChapter 4.• The MREWLE always relies on the data only through weakly sufficient statistics.• the MREWLE possesses the familiar property of invariance under one-to-one parameter transformations.35• The goal of establishing the asymptotic efficiency of MREWLE in some appropriate general sense has eluded us. At this time we can give that property only inthe special case of nonparametric regression (Chapter 5 and Chapter 7).3.5 The Entropy Maximization Principle.In a series of papers, Akaike (1977, 1978, 1983,1985) discusses the importance of theentropy maximization principle in unifying conventional and Bayesian statistics. Wegeneralize this principle to the framework of relevant samples in this section. Thisgeneralization enables us to prove that the method of MREWL may be viewed as arealization of that principle to an important but limited extent.3.5.1 The generalized entropy maximization principle.To recall the conventional entropy maximization principle, suppose we drawx = (x1,x2,. . . , xk)’ from a multivariate distribution with density f. Suppose we intendto estimate f by g(. : x) and view this estimate as a predictive distribution for a futurevector drawn from f. As the index of the quality of g( : z), use the entropy of f withrespect to g( :B(f;g)=— f(z)log dz.The entropy maximization principle asks us to find the g(• : x) which maximizes theexpected entropy EB(f; g)= J’ B(f; g)f(x)dx. We may view the result as giving us an“optimum” estimator of f, regarded as the object of inferential interest. We would notein passing that Fisher’s maximum likelihood method and the AIC (Akaike informationcriterion) are two very important implications of the entropy maximization principle.36For simplicity of exposition, consider now only the univariate case. [The vector variablecase is an obvious generalization.] Suppose Yi, , Yn respectively, are independentlydrawn from distributions with densities f(y),..,f(y) thought to be related to thedensity of inferential interest f. [The relevance of f to f could be described by B(f; f) >—c for all i where the {c} are positive constants. This inequality means f is not farfrom f. For the special iid case, f = f for all i or equivalently, B(f; f) = 0 for all i.]Let g(• : y) denote an estimate of f where y = (yi,• , yj. Once again we may view gas an estimated predictive distribution of a future observation z from f.Because the relevance of the f to f varies with i, we assign different relevance weights,pj, to them. We then get the weighted entropy measure:pB(f;g) = _pjJfj(z)1og g[()dZ•[Because we do not know f, we choose the above index to force g to lie “close” to thedensities we do know, {f}, and which we deem to be close to f.]Our generalized entropy maximization principle may now be stated. All inference aboutf may be based on the g obtained by maximizing the expected weighted entropy of thepredictive distribution where the expected weighted entropy isE,pB(f; g) = — f pB(f; g)f(y)dy.3.5.2 The MREWLE and generalized entropy maximizationprinciple.We know thatpB(f;g) =p1Jf(z)logg(z : y)dz — Epiffi(z)logfi(z)dz.37The second term on the right, a constant, depends on only {f}. For assessing g weneed only consider the first term. However, we cannot evaluate that unknown termso estimate it by pj log g(yj : y). This amount uses what in Chapter 6 we call theRelevance Weighted Empirical Distribution function which puts mass p2 at y, for alli. If we specify a family of feasible g( : y)’s the one which maximizes the estimatedexpected weighted entropy at y defines the maximum REWL estimate of f. Obviouslythe performance of the MREWLE of f depends on both the choice of the feasible familyand the statistical characteristics of the simple and natural estimator. If for the feasiblefamily we choose the parametric family of the f(y0), then we find that the estimateobtained from the generalized entropy maximization principle is just the MREWLE.For brevity we will not pursue further our discussion of the generalized entropy maximization principle. However many questions about the generalized entropy maximization principle remain to be answered.3.6 MREWLE for Normal ObservationsIn this section, we develop a method of estimating the mean of a normal populationusing data from relevant samples, thereby extending Example 1 above. We use theMREWLE and compare it with other estimators.Let Y and Y1 be observations from normal populations with known variances andunknown means 0 and 01, respectively. Without essential loss of generality, supposeVar(Y) 1 and Var(Yi) = a. Assume 0—0 E [—c, c] for some fixed c> 0, 0 beingthe parameter of interest. We readily find the MREWLE of 0 to beY+1Y1, (3.15)1+c2+o? 1+c2+of38if we choose the relevance weights P1=and P2 +2’+ for Y and Y1, respectively. Here we choose the relevance weights by minimaxing the mean square error ofMREWLE.Now we compare the MREWLE with the maximum likelihood estimator.In agreement with intuition, we find that the MREWLE loses the advantage over theMLE as —p oo or c —* . The extra information in Y1 becomes useless in theseextreme cases because of the uncontrolled bias or noise in the second sample. Whenc = 0, the MREWLE becomes the MLE for the full data set. In all these cases, theMREWLE is the minimax estimator.If o —* 0, then the problem under consideration becomes that of estimating a boundednormal mean. However, the MREWLE differs from the MLE. Without loss of generality,assume 0 = 0. From (3.15), the MREWLE of 0 is0= Y1 + c2and the MLE is—c, ifY<—c0(MLE)= Y if—cY+c.The comparison of the MREWLE with the MLE is showing in the Figure 3.2.From the above comparison, we find that the MREWLE has the advantage over theMLE for the two normal relevant samples, when the mean square error criterion is used.With several relevant samples for 0 with differing means, the analytical calculation ofthe MLE of 0 proves nearly impossible. By choosing relevance weights, we can easilycalculate the MREWLE.3.7 General RemarksIn Example 3.5 of Section 3.2, we show that we can use the REWL for robustness. Thisidea is similar to the work of weighted partial likelihood by Sasieni (1992) for the Coxmodel (exact sample case). In that paper, he considers robustness and efficiency of theweighted partial likelihood method. So even for the iid sample case, we may used theREWL for both robustness and efficiency.The weak sufficient statistics defined in Section 3.3 depend on the relevance weights.This agrees with our intuition. For different relevance weights (i.e. different views ofrelative importances), the weak sufficient statistics should be different.4100w(000Cl)C(0(0(0d0Figure 3.2: A comparison of MREWLE with MLE under a1 = 1. Curve A representsthe MSE of MLE at line 0 = 0. Curve B represents the Maximum MSE of MREWLEover whole parameter space. Curve C represents the MSE of MREWLE at line 0 = 01.—ABC0.0 0.5 1.0 1.5 2.0Constant c42The REWL proposed in this Chapter depends on the relevance weights, {pt}. Theseweights express the statistician’s perceived relationships among the populations andusually can be chosen on intuitive grounds. For different problems, the relevance weightswere chosen in different ways. In Section 3.6, we choose these weights by minimaxingthe mean square error. In the Example 3.5, we may choose these weights by consideringboth robustness and efficiency. For the generalized smoothing model, we will used theweights similar to these of nonparametric regressions. This is in Chapter 5.The asymptotic theory of REWL in the next Chapter also give a guide line for thechoice of relevance weights. In the following Chapters, we will further discuss the choiceof these weights.43Chapter 4The Asymptotic Properties of theMaximum Relevance WeightedLikelihood Estimator4.1 IntroductionIn Chapter 3, we gave a very general method for using relevant sample information instatistical inference. We base the theory what we call the relevance weighted likelihood(REWL). The REWL function plays the same role in the case of relevance sampleinformation as the likelihood in that of exact sample information.The maximum relevance weighted likelihood estimate (MREWLE) studied in this paper, plays the role of the maximum likelihood estimator (MLE) in conventional pointestimation. The consistency of the MLE has been investigated by several authors (c.fCramer 1946 and Wald 1948). Here we prove the consistency of MREWLE under general conditions. But in many cases, the consistency of the MREWLE is not enough; weneed to get the asymptotic distribution of the MREWLE. The asymptotic normality of44the MREWLE is considered in this Chapter.I first consider the weak consistency of the MREWLE and show: (i) there exists aweakly consistent sequence of roots for the log REWL equation (Theorem 4.2); (ii) theMREWLE is weakly consistent (Theorem 4.4). I then go on to the strong consistencyof the MREWLE (Theorem 4.5). My analysis relies heavily on the work of Chow andLai (1973) as well as Stout (1968) who deal with the almost sure behavior of weightedsums of independeilt random variables. Finally, we prove the asymptotic normality ofthe MREWLE (Theorem 4.7).I organize the paper as follows. The main results on weak and strong consistency arestated in Sections 4.2 and 4.3, respectively. We state the asymptotic normality in Section4.4. [The proofs of these theorem are in Section 4.7]. Section 4.5 proposes two estimatorsof the variance of the MREWLE. In Section 4.6, I discuss possible extensions and makesome concluding remarks.4.2 Weak ConsistencyIn Chapter 3, we have defined the REWL and the MREWLE. In this Chapter, wewill treat only the parametric case so that interest focuses on a single parameter 0 forsimplicity.Let X1,. . . , X be random variables with probability density functions (PDF’s)fi, f2,. . . , f. We are interested in the PDF f(x, 0): 0 of a study variable X. 0 is anunknown parameter. To state the Theorem, we begin with the following assumptions.Assumptions 4.1 4.1.1 {F9 : 9 E } represents a family of distinct distributions45with common support and dominating measure i.Let f(x,0) denote the PDF ofF9.4.1.2 The distributions of the independent sample observations, X, i = 1,. . . , n, havethe same support as the {P9}.4.1.3 The relevance weights pj corresponding to X, i 1,. . . , n, and incorporated inthe vector P, = (Pnl,Pn2,...,pnn) play a central role in our theory. They satisfy theformal requirements 0 and = 1. As well, with the “true” value of 0 denotedby 00, we require that,‘ 0 as n ,‘ cc (4.1)and for any 0) 0 as n cc. (4.2)4.1.4 Q contains an open interval 0 of which the true parameter0o is an interior point.Let K = (X1,X2, ..., X). For fixed K = , the function, 0 -‘--f i-; fn(x, 0) will becalled the REWL function.Theorem 4.1 Assumptions implyP9o{f1(X,00)...fm(X,0 >f’(X1,0)...f’( ,0)} _* 1 (4.3)as n —* cc for any fixed 0 0. D46From (4.3), the value of the REWL function at 0 (regarded as depending of K) exceedsits value at any other fixed 0 with high probability when n is large. We do not know Oo,but we can determine the point, 0 called the MREWLE, at which the REWL functionfor fixed K = is maximized. Suppose the observations are from distributions withPDF’s like that of the true sampling distribution f(x, 0) (and that Conditions (4.1)and (4.2) hold). Then the last theorem suggests that the MREWLE of 0 should be closeto the true value of 0 if the REWL function of varies smoothly with 0. Hence theMREWLE should be a reasonable estimator.Remark A.(i) Assumption 4.1.2 seems quite reasonable. If the distributions did not have the samesupport, we could not construct a useful REWL function. For example, if X1 hadsupport [0,2] and f(x, 0) had support [0,1], the REWL function would be identically 0when X1 was in (1, 2].(ii) The independence assumed in 4.1.2 greatly simplifies our problem which wouldotherwise be insurmountable.(iii) Condition (4.1) underlies the construction of a useful REWL function. Recall that[the Kullback-Leibler (KL) functional] E log{f(X)/f(X2,0)} measures the discrepancybetween f and f(., Oo). That condition insures that the weighted KL discrepancy of theobservations converges to zero when the sample size grows large. When the PDF’s of theobservations are quite different from f(x, 9), we usually cannot get a good estimator ofthe true parameter. Our difficulty arises then because we do not get enough informationabout the unknown parameter from the observations.This condition is easily satisfied as when E1og{f(X)/f(X, 0)} is uniformly boullded47while 1imj Elog{f(X)/f(X,Oo)} —* 0. For then P can easily be chosen to make(4.1) hold.(iv) Condition (4.2) commonly holds as when max{p}—÷ 0 whileVar(1og(f(X)/f(X, 6))) is uniformly bounded for each 6. This is because thatj%Var[1og{f(X)/f(X,O)}] pC(O) max{p}C(O).Here the C(8) is a constant depend on 0. The first inequality follows from uniformboundedness and the second inequality because p7, = 1.(v) Conditions (4.1) and (4.2) hold respectively, when (X1,X2, ..., X) are independentand identically distributed with PDF f(., Oo) and when Var[log{f(Xi,80)/f(X1O)}]exists while maxt{pmj}—+ 0.Corollary 4.1 If is finite, Assumptions 4.1.1-4.1.3 imply that the MREWLEO: (i)exists; (ii) is unique with probability tending to 1 and (iii) is weakly consistent.Proof: The result follows immediately from the Theorem and the fact thatP(Alfl...flAk)—*1asn-—*ooifP(A)-—*1fori=1,...,k. CTheorem 4.2 Suppose: (i) = (X1,2...,X) satisfies Assumptions 4.1; and (ii)for 8 E 0 and almost all x, the function 0 --+ f(x, 0) is differentiable with derivativef’(x, 0). Then with probability tending to 1 as n —* oo, the relevance weighted likelihood(REWL) equationö/ö8{fJf(x,0)} = 0has a root, O, = . .,x), which tends to 0.48Note that the REWL equation in the last theorem may equivalently be stated asWL’(9,) = 0. (4.4)The following comments also relate to Theorem 4.2.1. Its proof shows incidentally that with probability tending to 1, the {9}, can bechosen to be local maxima. Therefore we may take the 9 to be the root closest to amaximum.2. But the Theorem does not establish the existence of a consistent estimator sequencesince, with the true value unknown, the data do not enable us to pick a specific consistentsequence.3. Theorem 4.2 only gives us the existence of a consistent root of the REWL equation.But only in very special cases is this root the MREWLE, in which case it is thenconsistent (see Corollary 4.2 below)4. To prove Theorem 4.2, we require that 0 -s-f f(x, 0) be differentiable, 0 0. We willgive some conditions (similar to the conditions given by Wald (1949) for the lid case)which avoid the requirement that 0 -‘--f f(x, 0) be differentiable.Corollary 4.2 If the weighted likelihood equation has a unique root 6 for each n and all. the hypotheses of Theorem 4.2 imply that {6} is a consistent sequence of estimatorsof 0. If in addition, the parameter space is any open interval (a, b) then with probabilitytending to 1, 5 maximizes the weighted likelihood, (3, is the MREWLE), and is thereforeconsistent.Proof: The first statement is obvious. To prove the second suppose the probability49of 6 being MREWLE does not tend to 1. Then for sufficiently large n, the weightedlikelihood must, with positive probability, tend to a supremum as 0 tends toward a orb. Now with probability tending to 1, is a local maximum of the weighted likelihood,which must then also possess a local minimum. This contradicts the assumed uniquenessof the root. DThe conclusion of Corollary 4.2 holds when the probability of multiple roots tends to 0as n —* co.We have already discussed the consistency of a root of the REWL equation. Now weare going to study the consistency of the MREWLE.Before formulating our assumptions, we introduce some notation. For any 0 and p, r > 0let: f(x,0,p) = sup{f(x,0’): 0’—0 p}; (x,r) = sup{f(x,8): 101 > r}; f*(x,o,p) =f(x,8,p) or 1 according as f(x,0,p) > 1 or < 1, respectively; y*(x,r) y(x,r) or 1according as (x, r) > 1 or 1, respectively.Assumptions 4.2 4.2.1 For any 0 and p, x —‘--* f(x, 0, p) is measurable.4.2.2 For any 0, 0o E 0, there exists a > 0, such that zf 0

0.4.2.6 For any 0 E 0, there exists B9, such that lB9 f(x, 0o)d(x) 0 for any 0 E 0,and for x E B8, f(x, 0’) —* f(x, 0) for any 0’ —* 0.4.2.7 In the observation vector, = (X1,2...,X), the X are independent withPDF f(x) with respect to the same dominating measure4.2.8 Let F,-, = (Pnl,pn2, ...,p,) denote the respective important weights satisfyingp 0 and p, = 1. Assume0 as n oc. (4.5)4.2.9 Let gn(x) = Assume there exists a Borel measurable function G,such that gn(x) < G(x) , fG(x)logf(x,0o)I4(x) < oc and fG(x)1og(x,r)I4(x) 0,> 0 48fPn1(Xi,00)...fPnn(X, } — . ( . )Here the probability F,., denotes that of the {X} obtained from {f1(x) : i = 1,.. . , n}. DTheorem 4.4 Suppose Assumptions 4.2 obtain. Let O(X1,. . . ,X,.,) be any function ofthe observations, X1,. . . ,X,,., such thatc> 0 all n and all X1,. .. ,X. (4.9)Then 0,., is a weakly consistent estimator of 00. 0Observe that the MREWLE always satisfies the conditions of Theorem 4.4 if we choosec = 1. So we have proved the weak consistency of MREWLE.4.3 Strong ConsistencyFor the strong consistency of MREWLE, we use almost sure results on linear combinations of independent random variables such as those of (see Chow and Lai (1973), Stout(1968)).For simplicity, define: (I) A = (ii) D, = log f(X:,00)— log f(X,0,p(8))—E(logf(X,0o)— log f(X,0,p(0))) from (4.24) and the proof of Theorem 4.3; (iii)= log f(X, 0)— log r0)— E(log f(X, 0) — log c(X1,r0)) from (4.25) and theproof of Theorem 4.3, where i = 1,. . . , n and j = 1,. . . , k. In the next theorem, D willgenerically denote all D3 and D.52Theorem 4.5 Suppose Assumptions obtain, and there exists a constant K >0, such that pj Kn for some a> 0 and that one of the following conditions hold:(i) ED(1 Dj)’ K for some > 0 and /3,(1 + a + /3)/a 2, ED2(log IDD2 K, A Kn;(ii) ED2 0,(iii) E)D(l )Ic(1og+ D)1 K for some > 0 and /3,1 <(1 + a + /3)/a <2, Kn7,A Knfor some 7 > 0;(iv) ED(l /°(log DI)1 K for some > 0 and /3,0 < (1 + a + /3)/a < 1, Kn, A Knforsome7>0andp=0fori>nwhere<7(1+a+/3)/a. Thensu ‘1X O OFl urn “ I• ‘ = 01 = 1 D (4 101n—oo fPni(X,)...fPnn ,9Theorem 4.6 Under the conditions of Theorem .5, let O(X1,.. . ,X,,j be any functionof the observations X1,. . . ,X, such that (.9) holds. Then O is a strongly consistentestimator of °oWe now state without proof a direct corollary of the last Theorem.Corollary 4.3 Under the conditions of Theorem .5, the MREWLE is strongly consistent.53The weak conditions of Theorem 4.5 are not easily verified. In contrast, the strongerconditions in the following corollaries are easy to verify (but the results are not then asgeneral as those of the Theorem).Corollary 4.4 Under the assumptions of Theorem 4., let pj 0.IfEIDI2/ K for some 0< a < 1 then (.1O) holds.Proof: Let 3 = 0. Then (i) of Theorem 4.5 is satisfied. 0In the case of independently and identically distributed observations, the assumptionsof Theorem 4.5 can be quite unrestrictive. We will not go into detail here because ourobservation follows immediately from Theorem 1 of Stout (1968).4.4 The Asymptotic Normality of the MREWLEIn the last section, we have shown that under regularity conditions, the MREWLEs areconsistent and strongly consistent. In this section, we shall show that the MREWLEsare asymptotically normal under some conditions.As we know, the asymptotic normality of MLE have been discussed by Cramer (1946)among others. Our results generalize Cramer’s to both non-i.i.d and unequal Pni.Assumptions 4.3 4.3.1 For each E 0, the derivativesôlogf(x,O) 92logf(x,O) ô3logf(x,8)00 ‘ 802 ‘54exist, all x.4.3.2 For each 0o e €, there exist functions g, h and H (possibly depending on O)such that for 0 in a neighborhood N(00) the relationsãf(x)g(x)jô2f(x,O) I h(x)jã3logf(x,0) < H(x)hold, all x, andJ g(x)4(x) < ,f h(x)4(x) pf(x). Assume g(x) —+ f(x,Oo) almost surely andthere exists a Borel measurable function G*(x) such that g,(x) G*(x) andf (){ ogf,00)}24()p = 1. But then Conditions (4.1) and(4.2) need to be changed. We could for example replace condition (4.1) by the followingstronger conditions:0 as n oo. (4.14)In Chapter 5, we describe plausible situations where we might want to choose negativeweights.The consistency of MREWLE’s can be extended to REWME’s defined in Chapter 6.The results of this Chapter apply in several important subdomains of estimation theoryindicated by Chapter 5. Of particular note is that of nonparametric smoothing methods.For simplicity, our treatment in this chapter is confined to the case of a one-dimensionalparameter. The multivariate extension is similar and we omit it for brevity.4.7 ProofsTo prove Theorem 4.1, we need the following two important lemmas about KulibackLeibler information (KU; see Kullback 1959).Lemma 4.1 Let f(x) i = 1,. . . , n and g(x) be n + 1 general PDF’s with the samesupport andq1 0 i = 1,...,n be such that q +q2 + ... +qn = 1. If f(x) = qifi(x)++qf(x), thenqffi(x)log{}d(x) > Jf(x)log{}d(x) (4.15)59with equality if and only if fi(x) = f2(x) = ... = f(x) almost everywhere with respectto measure t.Lemma 4.2 If the PDF’s {gn(x)} and g(x) have the same support thenlim+ fg(x)log{g(x)/g(x)}dp(x) = 0 if and only if lim g(x)/g(x) = I [] uniformly.We can now apply these results.Lemma 4.3 Let f(x) i = 1,. . . , oc and g(x) be PDF’s with common support and F, =(Pnl,Pn2, ...,pnn) denote the respective importance weights, > 0 and = 1. Letgn(x) = pf(x). IfpniJfi(X)l0g{}d(x) ‘0 (4.16)then gn(x) — g(x) almost surely.Proof: By Lemma 4.1,Epniffi(x)1og{}d(x) Jgfl(x)1og{(}d(x) 0.The second inequality follows from the positivity of the KU.The assumptions of Lemma’s 4.1 and 4.2 imply that 1im(g(x)/g(x)) = 1 [i’], uniformly. Now for each x e A = {x : lim,(g(x)/g(x)) = 1}, we have lim g(x) =g(x). Because 1t(A) = 0, the conclusion follows. 0Lemma 4.4 Let f(x) i = 1,. . . , oc and g(x) be PDF’s with common support and P =(pnl,pn2, ...,p,) denote the respective importance weights. Suppose condition (4.16)60holds and 1u{x : h(x) g(x)} > 0. Here h(x) is another PDF with the same support asg(x). Then there exists a 6> 0 and an N(6) such that for n > N(6)pniJfi(x)log{}4(x) >6. (4.17)Proof: The KU is always positive. If the inequality (4.17) is not true, then there existsa subsequence n(j) such thatn(j) f__Pn(j)i f f(x) log{ }4(x) ‘ 0 as j .‘ oo.Now let gn(x) = By the last Lemma, we get gn(j)(x) —* h(x) almost surely.Also by the same Lemma, gn(x) —* g(x) almost surely, so g(j)(x) —* g(x) almost surely.Therefore {x : h(x) g(x)} = 0. This contradicts the assumption and completes theproof. DProof of Theorem 4.1: The inequality in (4.3) is equivalent toplogf(X,8o) — p1ogf(X,9) > 0.Nowlog f(X, Oo) — log f(X, &)= Pnilo{f(9)}=(I)+(II)From (4.1) and lemma 4.2, pjf(x) —f f(x,Oo) almost surely. Now because f(x,8)and f(x, O) are distinct densities, then by Lemma 4.4, there exists a 6 > 0 such thatfor n large enoughZPniElo{f(0)} 6.61Now from Assumption 4.1.4,(I) — PfliE1o{f(0)} 0 in probability.Therefore (I) > 6 in probability. Similarly, (II) —* 0 in probability. This meansplogf(X,Oo) —pnj1ogf(Xj,0) 0in probability. That observation completes the proof. DProof of Theorem 4.2: Let a be small enough so that (O — a, O + a) contains in 0and let= {x: WL(00,) > WL(00 — a,) and WL(00,) > WL(0o + a,)}.By Theorem 4.1, Fe0 (Sn) —f 1. For any . E 5,, there thus exists a value O — a <0,, < 0o + a at which WL(0) has a local maximum, so that WL’(O,,) 0. Hence forany a > 0 sufliciently small, there exists a sequence 0,, = 0,,(a) of roots such thatP90(O,,— Oo 0, there exist r0(6) > 0 and N(6, r0), such that for everyn> N(6,r0) and r r0,pElogf(X,0o) >6. (4.20)Proof: From (4.15), we havepE log f(X,00)-log (Xr)= PmiE10{f()}> I gfl(x) log{ }dJL(x) - pflE1og{f)}= fn(x)lo{f )}d(x) +fg(x)logf(x,0o)4(x)_fgn(x)log(x,r)4(x)- pfliElog{f[)}= (I) + (II) — (III) — (IV)By Assumption 4.2.8, (I) —* 0 and (IV) —* 0.Now we prove (II)— E90 log f(X, 0w). From Lemma 4.3, we know that gn(x) _* f(x, 0w).The result now follows from the Assumption 4.2.9 and the Dominated ConvergenceTheorem. We can prove (III) —* E90 log(X,r). in a similar fashion. From (4.19),we can choose r0 such that E90 log r0) < —(56 + E00 log f(X, 0)). Now choose63N1, N2, N3 and N4 such that: (i)when n > N1, j(I) < 6; (ii)when n > N2, 1(11) —E00 log f(X,U0) < 6; (iii)when n > N3, (III)— E90 log y(X,r0) < 6; and (iv)whenn > N4, (IV) < 6, respectively. Let N(6,r0) = max{Ni,N2,N34} we have prove(4.20) for r = r0. Because p(X, r) decreases with r, the proof is complete. DLemma 4.9 For any 0 0 in 0, there exist 6 > 0, p(O, 6) > 0 and N(p(8, 6), 6) suchthat forn > N(p(0, 6), 6),pE1ogf(X,0o) — pE1ogf{X,0,p(0,6)} >6. (4.21)Proof: The proof of this lemma is similar to that of the last. The oniy difference isthat we use (4.18) instead of (4.19). 0Proof of Theorem 4.3: From Lemma 4.8, we know there exist r0 and N(r0) such thatpE1ogf(X,0o) —p1E1og(X,r0)>1. (4.22)Let w1 = o fl {0 : 0H r0}. Then by Lemma 4.9, for any 0 E , there exist p(O) > 0,6(0) > 0 and N(0,p(0),6(0)) such that n> N(8,p(0),6(0)) andpjE log f(X, 0) — p,E log f(X, 8, p(O, 6)) > 6(0). (4.23)Now let S(8, p) denote the sphere with center 0 and radius p. Since is compact, bythe Finite Covering Theorem, there exists a finite number of points {0,. . ., Ok} insuch that S(81,p(0))U... U S(Ok,p(Ok)) contains as a subset. Clearly, we have0< sup6Ewk< f’{Xi,0,p(8)} . . .f{X,0,p(0)} + y’(Xi,r0)..64Hence, the theorem is proved if we can show thatfPnl(Xi,Oj,p(Oj))...fPnn(Xn,8j,p(Oj))> k 1 0 1 kfPni(Xi,O0)...fPnn(X, } — e/( + )j ,‘ z —andr ) r ) c/(k+ 1)1 0.Proving these last results is equivalent to showing that for i = 1,. . . , kn{p1logf(X,Oo)— oc in probability. (4.24)andn{plogf(X1,Oo)— oc in probability. (4.25)Under our assumptions, (4.22), (4.23) and the weak law of large numbers, we can prove(4.24) and (4.25). This completes the proof of this theorem. DProof of Theorem 4.4: For any > 0, let= {(X1,X2...) : 8(X1,. . . ,X,) E S(8o,E) for sufficient large n}.From (4.9), we obtainAcC={(X1,X2...):sup19_901>.fPni (X1,0) . . . f’(X, 0)fPrti(X,0o) .. . f(x,0) } c for infznztely many n}By Theorem 4.3, we have lim F(C) = 0, then lim P(A) = 0. Therefore,lim Pn(Ae) = 1. This completes the proof. DProof of Theorem 4.5: If we can prove (4.24) and (4.25) with probability 1, then fromthe proof of Theorem 4.3, we obtain the asserted result. But Theorem 4 of Stout (1968)and our conditions on P imply this result and hence the conclusion of our theorem. D65The proof of Theorem 4.6 is similar to Theorem 4.4 and hence we omit this for brevity.Proof of Theorem 4.7: By Assumption (4.3.1) and (4.3.2), we have for 0 in theneighborhood N(00) a Taylor expansion of ôlog f(x, 0)/ö0 about the point 0 = 0 asfollows:ölogf(x,8) 8logf(x,0) O21ogf(x,0) 1 280 — 80 — °‘ 8O 6=6 +8 — 0) H(xwhere < 1. Therefore, puttingãlog f(X, 0)A = 9082 log f(X, 0)B = pnj 882and= pH(X).We have—A = B(Ô — 0) + — Oo)2where < 1.From Theorem 4.2, we know that there exist a sequence of 8,-, such that 8, —* soo o —n°Bn+*Cn(n_0o)Now we prove:(i) A .-.‘ AN(pbj, (pjI(Oo)).As we knows—.. Ologf(X:,0)A = 80 Ozz666so EA = — pj b andVar(A) = log f(X, 0)6=80)2 — pb00=800By Assumption 4.3.5, 4.3.7 and Dominated Convergence TheoremVar(A) — (p)I(0o) + o(p).Now let0logf(X,Oo)= O0i=1E0logf(X,0o)= E I 000i=1max(p)J Ologf(x,0) IBy Dominated Convergence Theoremf Ologf(x,0o) c Ologf(x,8o) f(x,0o)4(x),000 (x)Jso 1’ —* 0 (max(p) —* 0). Then by Theorem 7.1.2 of Chung (1974, p200) and Assumption 4.3.4, we get(ii) B —* —I(0) in probability.02 log f(X, 0)B= ooO2logf(XOo)— O2logf(x,0o)oog oog—*—I(0)by Dominated Convergence Theorem.So from Assumption 4.3.8, B,-, —* I(0) in probability.67(iii) C,. —*E90H(X) in probability. The proof is similar to (ii).Therefore O,—= —A,./[B,. + *C,.(O,. — 9)]. Since Or, — Oü —÷ 0 in probability.B,. +—O) —* —I(O)Further, —A,. - AN(— (ZpjI(Oo)). Consequently, by Slutsky’s Theorem,(p){êr,--____‘ N(0,establishing the theorem.68Chapter 5The MREWLE for GeneralizedSmoothing Models5.1 IntroductionAs we have mentioned in the Chapter 2 and 3, the nonparametric regression (NR)paradigm motivates much of our work on relevance weighted inference. Methods developed in NR use relevant information in relevant samples (see Chapter 2). Nonparametricregression provides a useful explanatory and diagnostic tool for this propose. See Eubank(1988), Hardle (1990), and Muller (1988) for good introductions to the general subjectarea. Several methods have been proposed for estimating m(x): kernel, spline, andorthogonal series. Recently, local likelihood was introduced by Tibshirani and Hastie(1987) as a method of smoothing by ‘running lines’ in nongaussian regression models.Staniswalis (1989) carried out a similar generalization of the kernel estimator.In this Chapter, we apply the method of MREWLE to capture the relevant sampleinformation for generalized smoothing models. Both local likelihood and Staniswalis’smethod can be viewed as special cases of our methods. We wish to demonstrate the69applicability of our methodology, whose primary advantage over individual methodswhich have been developed in NR, lies in its generality. However, we are able to establishsecondary advantages as well. The MREWLE always has a smaller variance whichdepends on Fisher information.Usually NR is used for the location parameter and it also assumes the mean and varianceof the observations exist. But in many cases, we are interested in some other parameters, for example, the variance of a normal distribution or parameters of the Weibulldistribution in Section 5.5. In some cases, even when we are interested in the locationparameter, the mean and variance of the observations may not exist. The Cauchy distribution in Section 5.5 is an example. The method of MREWLE works well in thesesituations.Under the model of Example 3.6, we know that given X, Y has density f(y, 0(X1))(assumed known up to the unknown parameter 8(X1)). We seek to estimate 8(x) atthe fixed point X = x. After choosing the relevance weights, we get the MREWLE bymaximizing over 8(x),fJ f(, 0(x))(x).The next straightforward result shows how locally weighted regression estimators obtain.Theorem 5.1 . If f(y,0(X1)) is the density for a normal distribution with mean p(X1)and variance cr2(X1) [here 8(x) = {t(x),u2(x)}], then the MREWLE is(x) = pni(x)) (5.1)and= Pni(x){ - (x)}2.D (5.2)70Thus the MREWLE in the normal case of the last theorem, is the linear smoother of Fan(1992). Obviously we can get Nadaraya-Watson kernel estimators, k-nearest neighborestimates, Gasser-Muller estimators and locally linear regression smoothers by choosingappropriate relevance weights. These smoothers include those generated by the use ofspline and orthogonal series methods. This means that the MREWLE method subsumescurrent nonparametric smoothing methods when the error distribution is normal. Thevariance estimator in (5.2) is the same as the variance estimate in Hardle (1990). Hereit is a direct result of the MREWLE.We now study the MREWLE in relation to current nonparametric smoothing methodsin situations where the error distribution family is known. Our discussion addressesboth asymptotic and non-asymptotic issues.We organize this chapter as follows. Small sample properties are considered in Section5.2. The main asymptotic results are shown in Section 5.3. Also in Section 5.3, wecompare the MREWLE with current NR methods. How to choose the relevant weightsis considered in Section 5.4. Finally in Section 5.5, we give some simulation results.5.2 Small samples.Our earlier discussion leads us to wonder about the difference between the MREWLEand other linear smoothers for nonnormal error models. The following theorem partiallyaddresses this issue. There we refer to conventional sufficiency with respect to the jointsampling distribution of all the data.Theorem 5.2 . Suppose the sufficient statistics for the error distribution family arenot linear in the data. If X = X, for some i j, then with respect to quadratic loss,71the linear smoother is inadmissible.Proof. We easily obtain the conclusion using Rao-Blackwell Theorem and sufficiencyat the replication points. 0The last theorem shows that if we have replicate observations in a designed experiment,we can achieve a uniformly smaller risk than that of any linear smoother (which dependslinearly on the data when sufficiency shows it should not). For instance, the one parameter exponential family has sufficient statistics based on { T(1’)} for some functionT) where the sums are taken at the replication points. Obviously basing any smootheron the {1’} would violate the sufficiency principle in this case.It could be argued that this claim is unfair. Linear smoothers are proposed in anonparametric-nonparametric framework where neither the error distribution nor theregression function has parametric form.That argument ignores Theorem 5.1 which shows that these linear smoothers are consistent with normal error models. Indeed, in Chapter 6 we obtain smoothers in thenonparametric- nonparametric setting which are very different than linear smoothers.So the argument fails to blunt the impact of the last theorem. Rather that theorempoints to the nonrobustness of linear smoothing methods. Theorem 5.2 emphasizes theimportance of the vehicle which carries the data into a smoothing procedure. And ittells us how to improve on a linear smoother if we have repeated observations at somepoints.We know by weak sufficiency that the MREWLE must depend only on the sufficientstatistics. So it evades the difficulty confronted above by linear smoothers.72When the {X}’s are continuous covariates, we cannot (in principle) have repeated observations at any point, so cannot improve on linear smoothers by invoking the lastTheorem. But we may nevertheless have near ties among the {X}’s in which case theheuristics underlying that theorem still obtain. Large sample theory below will lead tofurther discussion of this issue.In Chapter 2 and 3, we emphasized the NR paradigm because it provided a contextwherein some information from the relevant samples have been used to advantage. Thelast theorem suggests that these methods fail to use all the relevant informatioll availablewhen the error distribution cannot be assumed to be normal. In this way, these linearsmoothers seem analogous to moment estimators in classical estimation theory; theMREWLE would then be analogous to the MLE.5.3 Asymptotic PropertiesWe begin with the generalized smoothing models described in Example 3.6. Let(X1,Y1), ..., (X, Y) be a random sample from a population (X, Y). For given X =Y has density function f{y, O(x)}.Because we have used relevant sample information for estimation, the MREWLE isusually biased. We define the bias function asB(z) = E9()l3 log f{Y O(x)}/O(x). (5.3)This bias function indicates the bias when we used the information from YX = z toestimate 8(x). Under some conditions, we can get B(x) = 0 andB(z) = I{O(x)}{8(z) — 8(x)} + o{O(z) — O(x)}, (5.4)73where I{0(x)} is the Fisher information function for 0(x). Equation (5.4) indicatesthe meaning of this bias function. Also this bias function is a special case of the biasfunction in Chapter 4.In the next four subsections we present classes of possible relevance weights suggestedby results in modern NR theory. We study one of these classes, that suggested byNadaraya-Watson, in some detail. For the rest, some expected results will merely besketched for brevity.(I) Kernel weights (Nadaraya-Watson).The weight sequence for kernel smoothers (for a one dimensional x) is defined bypni(X) = Kh(x— X)/{nh(x)}, (5.5)wheregh(x) = n1 Kh(x — X) and Kh(u) =h1K(u/h). (5.6)The kernel K is a continuous, bounded and symmetric real function which integrates toone,f K(u)du = 1. (5.7)Because the form (5.5) of kernel weights pj(x) has been proposed by Nadaraya (1964)and Watson (1964), we call these Nadaraya-Watson weights.The MREWLE with Nadaraya-Watson weights obtains from maximizingn_i— X) log f{1’;, 0(x)}.We now consider the asymptotic properties of this MREWLE. In the sequel, we alwaysletCK= i_:u2K(u)du, dK = L K2(u)du. (5.8)74We need the following assumptions:Assumptions 5.1 5.1.1 The bias function B(z) has a bounded and continuous second derivative for every fixed x.5.1.2 Let B(x) = Es(2) log{f(Y,9(z))/f(Y0(x))}. B(x) has a bounded and continuous first derivative for every fixed x.5.1.3 The marginal density g(x) of the covariate X has continuous first derivative andis bounded away from zero in an interval (a0,b0).5.1.4 j’uK(u)du = 0 and fu4K( )du < cc.5.1.5 The density function f(y, 0)) satisfies the following regularity conditions:(i) log f(y, 0) has three continuous partial derivatives with respect to 9;(ii) for each Oo E e, there exists integrable functions H(y) such that for 9 in a neighborhood N(90) the relationsI öf(,0) H(y), ãf(mO) iH2(y),Iö3logf(y,0) H3(y)hold, for all y, andf Hi(y)dy < oc,fH2(y)dy < co,E9(H3Y)) 1be a triangular array of row-independent random variables with associated array ofdistribution functions, F ( [F1, ..., F,]; n > 1 and nonnegative constantsdefPn — [Pni, .., Pnnl, 194satisfying 1. Define:• the relevance weighted empirical distribution function (REWED) byF(x) =• the relevance weighted average distribution function (REWADF) for —cc < x p}for a sample {X1,...,To illustrate the use of these REW quantile estimators, we offer the following example.Example 6.2 Nonparametric Regression. Let= f(x) + e x E [a, b] i = i, 2, ... , n;here e, e2, ... , e are iid, symmetric, E(e) = 0 for all i, and f(x) is a smooth function.To estimate f(x) we may use the median of the RE WED. This kind of estimator isusually robust and it often quite efficient.956.3 Strong Consistency of REW Quantile EstimatorsIn this section, we describe strong and uniform consistency properties of the REWED.Then we describe the strong consistency of the quantile estimators derived from theREWED. In the following discussion, we assume that F is the CDF of interest and,its pth quantile.Theorem 6.1 (Strong Consistency of F(x)). a) Suppose exp(—e2K) < forall e> 0, where I=(1p)—. Then F(x)— F(x)I —*0 a.s. for all x.b) Further, if IF(x) — —* 0 for all x, then IF(x) — F(x)I —* 0 a.s. for all x.Corollary 1. If log(n)/K = o(1), thenF(x)— F(x)I —* 0 a.s. for every x. DThe hypothesis of the theorem is easily satisfied. If, for example, maxj{pni} =then the hypothesis is satisfied. The assumption F(x) — —* 0 for all x is essential; without this, we cannot get a consistent estimator of the CDF F(x). Qualitativelythis condition is the one which gives operational meaning to the notion of “relevanceweights”.Theorem 6.2 (Uniformly Strong Consistency of F(x)) a) Under the hypothesis ofTheorem 6.1 a), and the further assumptions that (i) sup,f(x) is bounded and, (ii)limsupM,supfl{(1 — (M)),(—M)} —*0, where f(x) is the derivative of F(x),thensup F(x) — —* 0, a.s..96b) Further, if sup F(x) — —* 0, thensup F(r) — F(x) — 0, a.s..When the distributions underlying our investigation derive from the same family, theconditions of the last theorem are usually satisfied.Theorem 6.3 (Strong Consistency of ) Under the conditions of Theorem 6.2 a),suppose x = solves uniquely the inequalities F’(x—)

e) 2exp{—2(n)K}for every e > 0 and n, where 5’(n) = min{P’(() + e)—p,p——The last theorem shows P() p(n)1 > e) converges toO exponentially fast. The valueof e (> 0) may depend upon K if desired. These bounds hold for each n = 1, 2,and so may be applied for any fixed n as well as for asymptotic analysis.976.4 Asymptotic Representation Theory.For the case of iid data and pj = 1/n, i = 1,..., n, Bahadur (1966) expresses samplequantiles asymptotically as sums of independent random variables by representing themas a linear transform of the sample distribution function evaluated at the relevant quantile. From these representations, a number of important properties ensue. (see Bahadur1966 and Serifing 1980 for details). We now generalize this asymptotic representationto the cases of non-iid observations and general p,1.Theorem 6.5 Let 0

0, such that inf J(()) > c;3. there exists c > 0 , such that < cc;4. F has a uniformly bounded first derivative in the neighbourhood of();5. m,, =o{K3/4(logK)’/}.Thenp—F(())np = p(n) + + R.JnIcp(n))where= O{K,314(log )}, n —* cc, with probability 1.The Bahadur representation is a special case of this theorem suggesting the result ofour theorem may be fairly accurate, that is hard to improve upon.98The distribution of REW sample quantile is usually hard to find, but the REW sampledistribution relatively easy. By this theorem, we can use the REW sample distributionfunction evaluated at the relevant quantile to study the REW sample quantile asymptotically. A simple example is that we can use this representation to prove quite easilythe asymptotic normality of the REW sample quantile.6.5 Asymptotic Normality ofExcept for the case of iid random variables, we cannot always find the exact distributionof.The asymptotic distribution of given in the following theorem may thereforebe useful.Theorem 6.6 Let 0 < p < 1 and V = — Assume P’is differentiable at p(n), inf > c > 0 and maxi0.We easily deduce that c = [o2]/[Z1u2] minimizes the mean squared error. ThenAN{, ( 1/u’}.Now let us try using the median to estimate . Let be the distribution of X and=pF,,j. The median of F is t and we use the sample median med to estimatei’• By the results of Section 6.5, we get-‘n 2A7.Tf_________________cmed ‘‘ i-iJv1fL,,1i—n .c .1i=iPnjnPiihere fT(Iu) =We want to minimize the variance of the asymptotic distribution subject to p = 1.We easily obtain = /oThe asymptotic relative efficiency of these two estimators isARE(I%,Imed) =Remarks 1 1. For the iid normal case, the ARE of the sample mean estimatorrelative to the sample median estimator is . Here we have proved that whenthe samples are from normal distributions with the same mean, but differentvariances, the ARE of the weighted sample mean estimator relative to theweighted sample median estimator yields the same value .. The weights used in the sample mean are different from the weights used inthe sample median. We only compare the two best estimators here. If we usethe same weights, the ARE can be larger or smaller than .1003. The weighted sample median should be more robust than the weighted samplemean.Example 6.4 Consider the double exponential family. Assume the density of X to be1/2r exp(—x — /r); the r are known while j is unknown i — 1,. . . , n. We again usethe weighted sample mean and weighted sample median of Example 6.3 to estimate t.Analysis of Example 6.4. Choose c = (r2)/(1r2) to minimize the meansquared error. ThenAN{, 2/( 1/r)}.By choosing = (1/r)/( 1/ri), we get the weighted sample median med Fromthe results of Section 6.5,med AN(,The asymptotic relative efficiency of these two estimators isARE(I2,*med) = 2.Remarks 2 1. The ARE of the best sample mean estimator relative to the bestsample median estimator does not depend on the {r} ‘s. The median is amore efficient estimator.2. As in the normal case, the weights used in the sample mean do not equal theweights used in the sample median.3. The weighted sample median should be more robust.1016.7 Simulation StudyWe have shown via asymptotics that the REW quantile estimator possesses a numberof desirable asymptotic properties. In this section we use two simulated examples toobtain insight into its performance with a finite sample.The model in Example 6.2 is used in this simulation where we compare the REW quantileestimator with the Nadaraya-Watson estimate. The Gaussian kernel function is used togenerate the relevance weights.Simulation Study 1. A random sample of size n is simulated from the modelY=X(l—X)+,with N(0, 0.5) independent of X U(0, 1). A typical realization when n = 1000is shown in Figure 6.1. The bandwidth used here and in all subsequences is h = 0.1 (choosed by eye). Let us next add 50 outliers from N(2, 0.5) to the simulation experimentjust described. The result is shown in Figure 6.2. We have not used the boundarycorrections.Simulation Study . In the model of Simulation 1, instead of using the normal error,we get the from a double exponential distribution with r = 0.1. Figure 6.3 shows theresults of a curve fit based on 100 simulated observations. The simulation results with10 outliers from N(— .5, 0.25) are shown in Figure 6.4.For the data of Simulation 1 without outliers, the quantile curve estimate obtained byusing the REW quantile estimator is shown in Figure 6.5 with 0.25 quantile.The results of the simulation can be summarized as follows:102c)00>00______09Figure 6.1: A comparison of the Nadaraya- Watson estimate with REW quantile estimator. The model is Y = X * (1 — X) + e, where X is uniform (0,1) and e is N(0,0.5).The sample size n = 1000 and the bandwidth, h = 0.1. The true curve is a, the REWquantile estimator, b, and the Nadaraya- Watson, c.CI0:2 0.4 0:6 0:8 1.0x103>%Figure 6.2: A comparison of the Nadaraya- Watson estimate with REW quantile estimator with outliers. To the data depicted in Figure 6.1, we add 50 c-outliers from N(2, 0.5).The true curve is a, the REW quantile estimator, b, and the Nadaraya- Watson, c.C)0c’j090.0 0.2 0.4 0.6 0.8 1.0x1040c’J0>10_____09Figure 6.3: A comparison of the Nadaraya- Watson estimate with REW quantile estimator. The model is Y = X*(1—X)-I-e, where X is from uniform (0,1) and e from a doubleexponential distribution with r = 0.1. The sample size is n = 100 and the bandwidth,h = 0.1. The true curve is a, the REWquantile estimator, b, and the Nadaraya- Watson,C.C0.2 0:4 0.6 0:8 1.0x1050c’j0>00______09Figure 6.4: A comparison of the Nadaraya-Watson estimate with REW quantile estimator with outliers. To the data depicted in Figure 6.3, we add 10 c-outliers fromN(—.5, .25). The true curve is a, the REW quantile estimator, b, and the NadarayaWatson, c.///C0.0 0:2 0:4 0:6 0.8 1.0x106019U)900>IU)9091.0Figure 6.5: A REW quantile estimator of a quantile curve. The .25 quantile curve isestimated for the data depicted in Figure 1. The true quantile curve is a, and the REWquantile estimator is b.0.2 0.4 0.6 0.8x1071. In the model of Example 6.2, when the error has the double exponential distribution, the REW quantile estimator performs a little better than NadarayaWatson estimate, see Figure 6.3. Even when the error is normal, the REWquantile estimator performs about as well as the Nadaraya-Watson estimate(see Figure 6.1).2. When the data have a small fraction of outliers, say about 5 or 10 percent, theREW quantile is robust (see Figures 6.2 and 6.4). By contrast, the NadarayaWatson estimator fails. This observation suggests we use the REW quantileestimator and Nadaraya-Watson estimate together to diagnose the model anddetermine if there are outliers in the data set. If the REW qiiantile estimatorand Nadaraya-Watson estimate disagree, then we should reconsider the modeland the outliers.3. The REW quantile estimator seems promising judging from these simulationstudies.4. Computing the REW quantile curve estimator took about one minute inSimulation Study 1 using Splus in a Sun workstation.6.8 DiscussionWe have presented a general method for estimating a population quantile based on independent observations drawn from other related but not identical populations. We haveshown the estimator to be strongly consistent and asymptotically normal under mild assumptions. Our method derives from a generalization of the empirical distribution (theREWED), and we have shown that the latter is also strongly consistent under certainconditions.108The context of our method includes that of nonparametric regression and smoothing.Thus our estimator may be viewed as a generalized smoothing quantile estimator. Inthe special case of Example 6.2, we obtain a nonparametric-nonparametric quantileestimator in as much as nothing is assumed about the form of the population distributions involved. In particular, as the Examples of Section 6.6 show, heteroscedasticity isallowed in the smoothing context.Our theory depends on the relevance weights, {p} used to construct the REWED.These weights express the statistician’s perceived relationships among the populationsand would usually be chosen on intuitive grounds. Making F approximate F well isa primary objective in this choice. Additional restrictions on the {p} stem from thelarge sample theory developed in this paper. Theorems 6.1, 6.2, and 6.3 on consistency,for example, require that exp(—€K) < oc for all > 0 where iç= (This imposes a requirement that the {p,,} —* 0 fairly rapidly as n —* oc, say fasterthan 1/log(n). And for asymptotic normality, we see in Theorem 6.6 the requirementthat maxi A) exp(—A2/8)(l + 2)for all A > 0. DPro of of Theorem 6.1.F(x)-= x) - F(x)fisay. So on applying Lemma 6.1 with A =P(F(x)- > )>(1 + 2/21reK/2)exp(_e2Kn/8)for every e> 0. The assumption, exp(—c2K) < oo, implies that K,., —* whenn —+ cc. It follows that for every C> 0, there exists N, such that for every n > N(1 + 2V2KeK2) <(eKfl)110Consequently(1 + 2V2eI2)exp(_fl) exp(—).But > exp(—€2K) < oo for all > 0. So exp(—-) < 00 for every > 0.HenceP(F(x)— > ) <00 for all >0.The Borel-Cantelli Lemma then impliesF(x)— P’(x)I —+ 0 a.s. for every x. 0Proof of Theorem 6.2. Let M be a large positive integer and= max IF(i/M) ——M2 0. By the uniqueness condition and the definition ofPn(p(n)— €)

n pm— p(m)I > e) —* 0 as n —* cc. This completes the proof. DTo prove the Theorem 6.4, we need the following useful result of Hoeffding (1963).Lemma 6.2 Let Y1, ..., Y, be independent random variables satisfying P(a < Y 0,— E()] > t) 0. Then—> e} + e} + —But with Y = pI(X > p(n) + e),p(n) + }= P{p > F(() +€)}= > p(n) + e) > 1—p}=— E()) > 1—p—pj(1— + e))}=F{(1 - E()) > n(p(n) + e) — p}.Because P(0 p,j) = 1 for each i, by Lemma 6.2, we havep(n) + e) exp(—2S/p) = exp(—26K);112here 6 P’(() + e) — p. Similarly,p(n) — e) exp(—26/p) = exp(—26K)where 62= p — —Putting 6E(n) = min{Si, 62}, completes the proof. DTo prove Theorem 6.5, we need the following results (see Shorack and Weliner, 1986,page 855)Lemma 6.3 (Bernstein) Let Y1,Y2, ..., Y,., be independent random variables satisfyingPY — E(Y) 0,- E()) 2exP[_ VarO) +for all n = 1, 2Lemma 6.4 Let 0

0, p(n) solves uniquely P(x—)p < (x) and p=Put= (+1)K12(log113We then have+ Cfl)—p = P(P(fl) + n)— n(p(n))!n(p(n))en + o(e)/7(iog Kn)”2/K?’/for all sufficiently large n.Likewise we may show that p ——e,) satisfies a similar inequality. Thus, withas defined in Theorem 6.4, we have2KflSE(n) clogKfor all n sufficiently large. Hence by Theorem 6.4,2— p(n) > en)for all sufficiently large n.This last result, hypothesis 3 of this theorem and the Borel-Cantelli Lemma imply thatwpl np — p(n) > e holds for only finitely many n. This completes the proof. DLemma 6.5 Let 0

0 and q 1/2. Let m = maxi 0. NowP(A > 7n) P(Dn(ir,n)I 7n)r=—bAndIDn(r,n)I = I E + r,n)) — E(I(X E + r,n))))Iby definition. With Y = pI(X E Bernstein’s Lemma (see Lemma6.3) impliesP(IDn(77r,n)I y,) 2exp(—-y,/D)where D = 2Z1Var()1) + 2/3in-y.Choose c2 > sup Then there exists an integer N such that+ a) — < c2a116and— — a) < c2aboth of the above inequalities being for all n > N and i = 1,. . . , n. ThenVar() c + 1. It then followsthat there exists N* such thatP(Dn(??r,n) y) 2K1)for all r < b and n > N*. Consequently, for n > N*P(A > 8bK*+).In turn this impliesP(A‘) 8Kv.Hence P(A,. > ) p).ThusG(t) = P(pnjI(Xnj a,) p)= P(V112 pnj(I(Xnj a) — E(I(X a,))) V-’/2— 0 and r > 0:1181. 1 P(ri ) —* 0. (since i7j 2maxi ô-xx(XTX)’.Wu’s (1986) particular implementation of the general bootstrap described above canbe obtained by choosing ô = (r(1 — h)’/2)for all i. To explicate these varianceestimators, define y = x”/3 + ôjt, i 1, ..., n; the t” {t} obtain fromresampling (denoted by *) withEt*= 0, and Cov*(t*) = I. (8.2)Wu (1986) shows that (8.2) entails E$* = and= (XTX)_1 r(1— h)’xx’(XTX) . The latter is consistent in general.Remarks1 When o = u for all i, vb is unbiased and consistent. Indeed Efron’s method(Efron 1979) may well be best if e1, ..., e,7. are iid. If not and interest focuseson the distribution of /3 (and not just its covariance), the bootstrap encountersserious difficulties. A very simple example demonstrates this.Example 8.1 Consider the one dimensional regression model:= x/3 + ej. i = 1, ..., n137with Ee = 0, Var(e) = a2, and Ee = tt3. Interest focuses on 13’s thirdmoment as when the Edgeworth expansion and related results for the bootstrapmethod are required. The LSE is 3= (x1 > xY and=But the bootstrap estimate of the third moment is—= (x) xn_1 r. (8.3)In general, (8.3) is biased and inconsistent. This is not surprising since the {r}are not exchangeable (the {j} being unequal).2 The properties of Wu’s bootstrap depend on the distribution of t. Specifying thatdistribution becomes a new problem. (Wu’s bootstrap uses information not in theoriginal samples. This may make this method nonrobust in more general cases).3 For heteroscedastic errors, Wu’s method encounters the problem identified in (Remark 2.1). We can get a consistent third moment estimate by choosing the distribution of t to satisfy E(t*3) = 1. But we may well be interested in otherquantities as well. Manipulating the distribution of t may not simultaneouslyprovide satisfactory estimators for all quantities of interest.4 As noted by Wu (1986), bootstrap methods are hard to generalize to nonlinear situations; the heteroscedastic bootstrap is based on resampling the residuals whichis not feasible in the nonlinear case.1388.2.2 Vector resamplingFreedman (1981) suggests a way of dealing with the correlation model used for a nonhomogeneous errors model. His method draws a simple random sample with replacement(* sampling) {(y’, 4T)}? from {(y, xfl}?. The method computes the bootstrap LSEfrom {(y, 4T)}?,= (E64T)Z4y, (8.4)and the bootstrap covariance estimator= E($*_$)($*_$)T. (8.5)This approach suffers from several drawbacks. Firstly injudicious use of this methodcan lead to inconsistent covariance estimators (Wu 1986 gives an example). Secondly,the method fails to incorporate the knowledge that Var(e) changes smoothly with xwhen such knowledge obtains. Thirdly, on the general grounds of requiring inference tobe conditional on the design D = (x1, , ..., xj, one should not risk having simulateddata sets whose designs D* = (4, , ..., x) are very different from D. Fourthly, if n ork is large, computational costs may be quite high. Fifthly, small sample sizes can entailsingular design matrices, D*.8.3 A new bootstrapLet Yi, ..., Y7. be a sequence of independent random variables with observed values,y,,. Suppose for specified functions {gj}, 0 satisfies Es{g()1, 0)} = 0 for all139i = 1, ..., n and 0. Here and in the sequel E9 denotes the conditional expectation ofthe {} given 0; var represents the corresponding conditional variance. For simplicityof presentation, we assume 0 is a scalar but our results easily extend to vector-valuedparameters. We let 0 be a solution of the estimating equationg(y,0) = 0,and consider the distribution of 0 — 0.A standard two step method for approximating that distribution when the {1’} areindependent and identically distributed can be described as follows:1. use the Taylor expansion to get the approximation{ a(•,o) 0)2. invoke the Central Limit Theorem to get— 0) N[0, flVaro{g(Y,0)}1The standard bootstrap method exploits this approximation when the {11} are independent and identically distributed. However, that condition often fails. Then drawing abootstrap sample directly from {u, ..., y} proves unproductive, leading us to our alternative bootstrap method. First replace 0 by 0 in gj(yj, 0). Then define z = gj(yj, ).Finally:1. draw the bootstrap sample {z, ..., z,} from {zi, ..., z} as a simple random samplewith replacement;2. compute the bootstrap estimator ase*1403. use the bootstrap, i.e empirical distribution of the (O*— O)’s obtained after manyrepetitions of 1-2 to approximate the distribution of 0 — 0.Since our method approximates the distribution of 0— 0, we get approximations tothe distributions needed for inferences based on that distribution. The quality of thoseapproximations depend of course, on the quality of our underlying approximation tothe distribution. That notwithstanding, our method offers great flexibility. And in amanner of Liu (1988), we can prove that our bootstrap yields a better approximationthan its competitors under certain conditions. However, our method cannot be used toestimate the bias.The following example shows how our method differs from its conventional relative.Observations are made of n independent and identically distributed random variables,each with an unspecified probability distribution function, F. Inferential interest focuseson the mean, [t, of F. The usual bootstrap would: (i) draw the “bootstrap sample”{y, ..., y} from {yi, ..., y} as a simple random sample with replacement; (ii) calculatethe bootstrap sample mean = y’; and (iii) repeat (i)-(ii) sequentially to obtaina sequence of y’ values. The empirical distribution of the (* — )‘s is the bootstrapapproximation to the sampling distribution of—Our approach begins with the sampling distribution modeli=1,...,n.Given a sample {yj} generated by this distribution, we readily find that the least squaresestimate of satisfies the “estimating equation“, Z(y— ) = 0; for simplicity here andin the sequel we (usually) suppress the upper and lower summation limits, i = 1 andi = n. This estimating equation relies on the component functions yj — t, i = 1, ...,141which may be estimated by z — , i = 1,...., n. Our method: (i) draws a bootstrapsample, {4, ..., 4} from {zi, ..., z,} as a simple random sample with replacement; (ii)calculates the bootstrap estimator, /i’ 2 + n1 4; and (iii) repeats (i)-(ii) abovesequentially to obtain a sequence of (/*— )‘s and the bootstrap approximation to thesampling distribution of fi — u.From this general discussion of our method we turn in the following sections to linearregression and describe properties of our bootstrap estimator of the regression coefficientvector.8.4 Asymptotics.8.4.1 Preamble.For the linear model (8.1), Efron (1979), Freedman (1981) and Wu (1986) have suggestedbootstrap methods described in Section 8.2, based on either the residuals of the leastsquares fit or the complete observation vectors. Our method differs from theirs.We start with the normal equations for the ordinary least squares estimate,Xi(yi-x/3) = 0 (8.6)and their solution . Letzj=xj(yj_x$), i=1,...,n.The bootstrap estimator defined in Section 8.3 becomes= + (XTX)’ 4, (8.7){ z, ..., z,} being a bootstrap sample from {z1, ..., z}.1428.4.2 Consistency of 3*•For brevity, we now state our main results. All proofs except that of Theorem 8.1, andunderlying assumptions appear in Section 8.8.Theorem 8.1 Suppose Assumptions 8.1 and 8.2 hold for model (8.1). ThenE(v) = cov(){1 + O(n_1)},where v. = — — )T and E represents expectation with respect to thedistribution induced by bootstrap sampling.Proof. This is an immediate consequence of Lemma 8.1 in Section 8.8. 0Because we use an estimate of /3 in resampling, we lose degrees of freedom. This suggestsrenormalising the terms on the left side of equation (8.6). Two alternatives suggest themselves, z’ = (1_n_1k)h/2xj(yj_x$) and 2) = (1_hj)_h/2xj(yj_x$), i =where the {h1} represent the diagonal elements of the hat matrix. The bootstrap estimators corresponding to these two renormalisations are *(1) and $*(2) and the covarianceestimators are V*(l) and V*(2). The asymptotic properties of /3’, /3*(1) and $*(2) are equivalent.The proof of the next theorem, the counterpart of Theorem 8.1 for the renormalized“bootstrappands”, follows immediately from Lemma 1 and so it is omitted.Theorem 8.2 For model (8.1), Ev(2) = Cov(/3) if the assumptions in (8.2.q) hold. IfAssumptions 8.1 and 8.3 hold as well, E(v()) = Cov(/3)(1 + O(n1)), for i = 1,2.143An asymptotic analysis must explore the third moments needed for Edgeworth expansions. The use of such expansions in bootstrap regression models seems to be havebeen proposed by Navidi (1989) who considers only Efron’s method for independentand identically distributed errors. Liu (1988) investigates the third moment bootstrapestimator of 1T/3 based on Wu’s bootstrap for the case of heteroscedastic errors, 1 beinga fixed k x 1 vector. For model (8.1), we also consider the sampling distribution of theleast squares estimator of any such specified linear functional of 3, 1T3•For model (8.1), letE(e) = E(e) = [L4,j, i 1, ..., n. (8.8)From the bootstrap estimate (8.7), the third and fourth moment estimators for 1T/3 aredefined as = E{lT(/3* — Th} and /14,* = E{lT(/3* —It is easily shown thatn33= w r,i=1and= + (8.10)i=1 ijwhere w = lT(XTX)_lx and r = y — x.With the notation of (8.8), the third and fourth moments of 1T,3 are= E{lT( — 3=(8.11)and= E{lT(f —= Ew4, + (8.12)i=1 ijTheorem 8.3 Assume the elements of 1 are bounded. Then under model (8.1) and(8.8),144(i) E(,u3,) = j + O(n3) when Assumptions 8.1 and 8.5 hold and(ii) E(4,) = 1u4 + O(n3)=i4{1 + O(n1)} when Assumptions 8.1 and 8.6 hold.Remarks:1 The covariance estimators v,, V*(l) and V*(2) corresponding to the bootstrap estimator triplet , I3) and *(2) are asymptotically equivalent. As well, in estimating cov(/3), U*(2) is unbiased for homoscedastic models and consistent forheteroscedastic models. So it has the desiderata set out by Wu (1986, Section 5).2 Wu’s method (Wu 1986) fails in general to give a consistent estimator of the thirdmoment. This difficulty can be overcome with Wu’s approach simply by modifyingthe distribution of his random variable t to achieve a consistent third momentestimator (Liu 1988). But then an inconsistent fourth moment estimator mayresult. By contrast, our method yields consistent 3rd and 4th moment estimators.8.4.3 Asymptotic normality of 8*.In this section, we describe the asymptotic distribution of /3w’ under general conditionsincluding the case of independent and identically distributed errors and the correlationmodel as special cases. Freedman (1981) investigates the asymptotic theory of Efron’sresidual resampling method for the case of independent and identically distributed errorsand the vector resampling method for the correlation model.To simplify the statement of our results, denote by £(UV) the distribution of U conditional on V for any random objects U and V and P denotes the probability distribution145condition on {(yj, xfl, i = 1, ..., n}. Let Nk(, 1’) be the multivariate normal distribution with mean and covariance F.The next result involves two distinct sampling processes, the first leading to the originaldata set of size n, and the second created by bootstrap resampling, say m times. Withthe original sample fixed classical limit theory applies when m — cc. But complicationsarising from the need to allow n —* cc entail more delicate analysis, as emphasized inthe work of Bickel and Freedman (1981). Such analysis leads to the following results.Theorem 8.4 For the regression model (8.1) with fixed regressor variables, supposeAssumptions 8.7-8.10 obtain. Then— ${(yj, x) : i = 1, ..., n}] Nk(O, V’WV’)for almost all sequences {(y,, x) : i = 1, ..., n} as n —+ cc.Theorem 8.5 For the regression model (8.1), with random regressor variables, suppose Assumptions 8.11-8.13 obtain. Then £[/i(,3*— 3){(y,x) : i = 1,...,n}}Nk(O, Qj’M11)for almost all sequences {(yj, x) : i = 1, ..., n} as n — cc, whereM = Exx’e?.Remarks:3 Freedman (1981) suggests different bootstrap methods for different models. If themodel is changed, the normality of the bootstrap method can be lost. In contrast,Theorems 8.4 and 8.5 assert that the one bootstrap method discussed in Section8.4.1 yields asymptotic normality for both models.1468.5 Comparison and Simulation Study.In this section we compare our method with those of Efron (1979), Wu (1986) and Freedman (1981). We do this by highlighting the simple structural differences between thesemethods in the way they simulate resampling of the regression coefficient estimationvectors. Then we present the results of simulation studies in support our method.Suppose we wish to estimate the distribution of—/3 = (XTX)_lThen we need only estimate the distribution of (XTX)_1 being fixed.Efron’s bootstrap resamples from the residuals r1, ..., r,, r=— x/3 and determinesthe corresponding bootstrap estimators by—= (XTX)Efron’s {r} are independent and identically distributed samples from {r}. While thisprocedure seems natural for exchangeable residuals {r}, doubt about the validity of hisprocedure arises when they are not. And these residuals are definitely not exchangeablein heteroscedastic regression models, where Efron’s estimators can be inconsistent.In Wu’s approach, bootstrap estimators are found from— = (XTX)_lthe independent and identically distributed sampling model, denoted by , for t =(t,...,t) satisfyingE(t*)= 0, and cov*(t*) = I.147So we see that Wu’s method uses > xr1t to simulate > The distribution of >will depend on the choice of that of the {t}. Specifying that distribution becomes anew problem and in any event, means that Wu’s bootstrap requires information not inthe original sample. In general the method risks possible nonrobustness. And if n isinsufficiently large, the distribution of > can vary greatly for varying distributionsof t’.Freedman (1981) suggests a way of dealing with the correlation model used for a nonhomogeneous errors model. His “pairs” method draws a simple random sample withreplacement, {(yi,4) i = 1,...,n} from {(yj,xj) i = 1,...,n} and computes the bootstrap least squares estimate from {(y, xfl i = 1, ...,=Then the distribution of 3*— /3 approximates that of /3 — /3. As indicated in Wu (1986),Hinkley (1988) and Section 8.2, this approach suffers from several drawbacks.As shown in equation (8.7), our approach uses—= (XTX)’ YZz7.The permutability of the z1, ..., z in (8.6) leads us to make the {z1} independent andidentically distributed. We see that of the four methods, ours is the most direct insimulating the distribution of interest.We now compare the four methods again, this time through a simulation study designedto compare their performance. From the simulation study of Wu (1986), we know thatour bootstrap method will perform well in variance-covariance estimation. We knowthis because our method has the desiderata set out by Wu (1986, Section 5).148We study the following regression model:= j9o + /31x + e, i = 1(1)10,= 0, 1, 1.5, 2, 3, 3.5, 4, 4.5, 4.75, 5.Two error distributions are used in our study: equal variances e “-‘ N(O, 1) and unequalvariances e ‘-‘s xN(0, 1) i = 1, ..., n. In all cases, the {e} are independent.We consider four bootstrap methods: (i) Efron’s method; (ii) ‘Nii’s method, the distribution of t being given by the “wild bootstrap” (Hardle and Marron 1991); this choicefor the distributioll of the {t} ensures fulfilment of 1st, 2nd and 3rd moments conditions(Liu 1988); (iii) Freedman’s “pairs” method; (iv) our new bootstrap method.We have run many simulations and chosen two Figures to illustrate the results obtainedfor the homoscedastic and heteroscedastic model simulation experiments. Each figuredisplays the cumulative frequency plots for simulated error distributions. They depictthe distributions associated with each of the four methods under investigation, labelledb through e. That associated with the true estimation error distribution is labelled bya. The five variables underlying the displays a through e are represented generically byw=— A single simulation run begins with a from a single sample generatedin accord with the sampling distribution. Then 1000 bootstrap values w of w aregenerated by each of the four methods under consideration.The results in Table 1 and Table 2 unlike those in the Figures offer an aggregate viewof performance and are based on 500 simulations of B = 1000 bootstrap samples.We use the Kolmogorov-Smirnov statistic to index the accuracy various distributionestimators. In Table 1, the Kolmogorov-Smirnov statistic is defined as sup F*(w) —F(w), where F*(.) denotes the empirical distribution of w and F(.) that of the true1490d(00C0t000.0N000Figure 8.1: A comparison of bootstrap distribution estimators for regression with homoscedastic errors. We depict the distributions of w =— /3o induced by the truedistribution (labelled a), using our bootstrap estimator (b), Efron’s estimator (c), Wu ‘sestimator (d), and Freedman’s estimator (e).—abC—-d—-eI I I I I I-2 -1 0 1 2 3 4w1500.7iv /FrI // /1,///1/3;’ /(II/ :1 1(0 j(.o 1/lii,,02 —ab.l. • —-—C0 / —-d/ —I./ /I,o-I...71/I-.q .. .0-5 0 5wFigure 8.2: A comparison of bootstrap distribution estimators for regression with heteroscedastic errors. We depict the sampling distributions of w— ,Bo induced by thetrue distribution (labelled a), using our bootstrap estimator (b), Efron’s estimator (c),Wu’s estimator (d), and Freedman’s estimator (e).151Table 8.1: Averages of the Kolmogorov-Smirnov Statistics for Competing BootstrapDistribution mEqual Variance Unequal VarianceNew Method 0.115 (0.0029) 0.080(0.0021)Efron 0.079(0.0019) 0.136(0.0028)Wu 0.158(0.0024) 0.113(0.0019)Pair 0.126(0.0025) 0.095(0.0017)Table 8.2: Absolute Biases of the Competing Bootstrap Quantile EstimatorsEqual Variance Unequal Variance5% 95% 5% 95%New Method 0.328(0.009) 0.320(0.009) 0.411(0.014) 0.403(0.014)Efron 0.224(0.007) 0.220(0.007) 1.298(0.041) 1.276(0.039)Wu 0.344(0.009) 0.338(0.009) 0.436(0.014) 0.431(0.014)Pair 0.345(0.012) 0.343(0.011) 0.785(0.033) 0.680(0.030)distribution. The mean values of the Kolmogorov-Smirnov statistics from these 500simulations are summarized in Table 1. The value besided the mean is the correspondstandard error.Since the confidence interval is one commonly used inferential procedure, studying thequantile estimators obtained from the four methods seemed worthwhile. The absolutebias of 95%-quantile estimator is defined as q95(F*)— q.95(F), where q.95(F) denotesthe 95%-quantile for the distribution, F. The mean of the absolute biases of 5%- and95%-quantile estimators are summarized in Table 2. The value besided the mean is thecorrespond standard error.The results of our simulation studies can be summarized as follows.1521. As expected, for estimating a distribution Efron’s method performs best whenthe error distributions are independent and identically distributed (see Table1). The new bootstrap method also works well in that case, though not quiteas well as Efron’s. Freedman’s pairs and Wu’s method seem to yield roughestimators and to be qualitatively poorer than the other two methods. Thismay be because Wu’s method stresses consistency of the estimator’s momentsat the expense of estimating the overall distribution (see Figure 1 also).From Table 1, we see that in regression with heteroscedastic errors, Efron’sapproach does badly, again in accord with our expectations; the model errorsare not exchangeable and Efron’s estimator is not consistent. The new bootstrap method works best, Freedman’s reasonably well and Wu’s not so well(see also Figure 2).2. For quantile estimation, Efron’s method performs best for both 5%- and95%-quantiles when the error distributions are independent and identicallydistributed (see Table 2). The new method ranks second and Wu’s third.For regression with heteroscedastic errors, the new method is best, Wu’ssecond and Freedman’s third. However, Efron’s method gives an inconsistentestimator and seems totally unsatisfactory.3. Overall, we can conclude that Efron’s method is not robust against heteroscedastic errors. Wu’s method seems unsatisfactory for estimating a distribution. We found Freedman’s method effective but computationally expensive in the examples considered; presumably the method would be unrealisticfor use in nonlinear situations. Even with just two parameters, the methodrequired much more time than the others.We next consider the use of pivotal quantities in bootstrapping. The bootstrap-t statistic153Table 8.3: The Pivot Quantile Estimators0.05 0.10 0.25 0.50 0.75 0.90 0.95Equal VarianceNew Method -1.72 -1.34 -0.70 0.052 0.72 1.30 1.67Efron -1.86 -1.36 -0.71 -0.004 0.71 1.49 1.97Pair -1.51 -1.08 -0.60 0.122 0.58 1.16 1.61t(8) -1.86 -1.40 -0.71 0 0.71 1.40 1.86N(0,1) -1.65 -1.28 -0.67 0 0.67 1.28 1.65Unequal VarianceNew Method -1.77 -1.34 -0.69 -0.029 0.68 1.31 1.84Efron -1.99 -1.55 -0.76 -0.033 0.64 1.36 1.86Pair -1.63 -1.06 -0.55 -0.011 0.45 0.90 1.17is constructed ast*(b) = (*(b) - $)/se*(b), (8.13)where b denotes bootstrap sample and se means “standard error”. In view of our earlierfindings, we consider only three bootstrap methods, Efron’s, Freedman’s, and ours.For both the case of equal and that of unequal variances, we ran a single simulation.The pivot quantile estimators in Table 3 are based on 1000 bootstrap samples. In Table3, t(8) denotes the t-distribution with degree of freedom 8 and N(0, 1) denotes thestandard normal distribution.For the case of equal variances, t(8) is a good reference distribution. From Table 3, wefind that both Efron’s method and the new bootstrap method perform well. The pairsestimator is not as good as that given by the other two methods.For the case of unequal variances, we do not know a good reference distribution. If we uset(8) as before, both Efron’s and the new method give good estimators. The pairs method154fails to give reasonable results. But since Efron’s method yields an inconsistent varianceestimator, we cannot use Efron’s method to construct a useful confidence interval.8.6 Bootstrapping in Nonlinear SituationsIn this section we merely sketch extensions of our bootstrap in the nonlinear situationsconsidered by Wu (1986). The extensive detail needed for the required analysis of lower-order terms will be presented elsewhere.8.6.1 Regression M-estimatorAn M-estimate 3 of the regression coefficient vector /3 is found by minimizing- (8.14)over /3, where p is usually assumed to be symmetric. Choices of p can be found in Huber(1981). The M-estimate is found by solving:— x3) = 0 (8.15)Call xp’(y—xi3) the “score function”. From the method of Section 8.3, we obtain abootstrap sample from the estimated score function with /3 replaced by fi. To be precise,let z = Xjp’(yj—XT/3), j = .., n. Next: (1) draw the bootstrap sample, {4, ..., z}a simple random sample with replacement from {zi, ..., z}; (2) construct the bootstrapM-estimator 3* as 3* = 3 + (> xjXp”(j))1 4, where j = — x/3: i = 1, ..., n.155For estimating Var(’zb), b = b($), we use an analogue of the estimator in Section 8.4,=— ?)((,* — 7)T where = 3*). Principal interest centers on=/3,andv = wherev= (XjXp” (e))’(xjxTr)( XjX’p”(j)) (8.16)and r = p’(yj — xã).Now let us calculate the asymptotic approximation covariance matrix of 3. Underappropriate conditions (not stated here for brevity) 0 = xep’ (y — x/9) = Xip (y—x/3)— ( — /3)xx’p”(y — x3) + o( — i3). This implies 9 — [xjxp”(yj —xp’(y — x8). Alternatively, v E(8— /3)( — /3)T wherev [ xjxp”(ej)]_1[ xxT(p’(ej))2][xjxp”(ej)]’; (8.17)here e: = — x3: i = 1, ..., n. Comparison of (8.16) and (8.17) suggest the bootstrapestimator is a reasonable estimate of the covariance matrix of j3.8.6.2 Nonlinear regression.In nonlinear regression yj = f(3) + e i = 1, ..., n, where f is a nonlinear smoothfunction of /3 and e satisfies (8.1). The LSE /3 is obtained by minimizing (y — f(/3))2.The score function is xB)(y— f(/3)), where x = af/9,8. Put the LSE into thescore function. The bootstrap method of Section 8.3 is with samples, {z, ..., z7}from {z = Xj()(yj—f()) i = 1, ..., n}. The bootstrap coefficient estimators are= + (X(/)TX(/))_1 z, where X=(x1,...,x). It is easily shown that= /3 and v = E*(,8*_/3)(/3*_)T, with= (X()TX())’ (8.18)156where r=— f(,8).If we calculate the asymptotic approximation covariance matrix of the LSE, , we get- (X(/3)TX(/3))’X(/9 e and v =-- 9) wherev (X()TX())’ (8.19)From (8.18) and (8.19), we find that the bootstrap estimator is consistent and robustin general.8.6.3 Generalized Linear modelsA generalized linear model is characterized by three components (McCullagh and Nelder,1989):(i) an error distribution from the exponential family f(y; ) = exp([y’b — a(b) +b(y, ‘)]/q),(ii) a systematic component = x/3, and(iii) a monotonic, differentiable link function i = g(), where t = E(y).Here we consider generalized linear models with independent observations. Let y =(yr,..., y), E(y) = (IL’, ..., IL)T andCov(y) = diag(cr,...,a,) V(IL) = diag(uv()). (8.20)The mean, tj, is related to the regressor, xj, by the link function g, i.e. = g(x8). Thefull likelihood may not be available. Inference is instead based on the log quasilikelihood157(see Wedderburn 1974, McCullagh 1983), L(t, y), defined by= V([L)’(y—LL)A generalized least squares estimator (GLSE), 3, is defined as a solution ofDTV_l(y— i(3)) = 0, (8.21)where it(,3) = ([tj) = (g(x’/3))! and D = d/LL/d/3 = diag(g’(x3))X. Here DTV1(y —i(i3)) is called the quasi-score function.Moulton and Zeger (1991) consider two bootstrap methods for generalized linear models.One corresponds to standardized Pearson residual resampling, which extends Efron’sresidual bootstrapping method using standardized Pearson residuals. The method issimilar to that for the ordinary regression model, with heteroscedastic errors (i.e. unequal a’s in (8.20)); the bootstrapped covariance estimator is in general inconsistentfor the true covariance of GLSE, 6. The other method of Moulton and Zeger, involvesobservation vector resampling; this extends ordinary vector resampling using a one-stepapproach.We use the bootstrap method proposed in Section 8.3. By substituting the estimatorin (8.21) and rewriting it, we find the quasi-score function—Let z, = xjg1(xj)v ()_1(yj— /%j), j = 1, ..., n. The bootstrap is thus based on theuniform distribution, Fb with support z1, ..., zr-, and the calculation of the bootstrap158coefficients by= + (DTV_1)_lz; (8.22)here we have inserted the estimator 3 into the righthand side of the (8.22). For estimating Var(), j’ = ib(), the bootstrap covariance estimator is v,, =For /3, v,. = E(3* — 3)(9* — )T wherev = (DTV_1)_l(g’(x)vj(1L)_1)2rxjx(DTV_ D)_1,where {r=—: i = 1, ..., n} are the prediction errors.For the GLSE, /3, we have the approximation (McCullagh, 1983)— /3 (DTV_1)_DTV_l(y—so thatCov($) (DTV_1)_DTV_bcov(y —u)V1D(DTV’ ).Under model (8.20) Cov(y—t) = diag(uv(1u)) andCov(/3) = (DTVD)_l (g’(x/3))2v1(it)uxx(DTV_i D)’.We show elsewhere v,. is a consistent estimator for Cov($).The homoscedastic model (o = u2) is the most interesting special case; then the covariance of is Cov($) =u2(DTV_lD) and the bootstrap covariance estimator becomesv,, = (DTV_lD)_lDV_lDiag{r}V_lD(DTV_lD)_l. (8.23)The bootstrap covariance estimator (8.23) is the same as that employed by Cox (1961),Huber (1967), White (1982), and Royall (1986) for handling quite general model misspecification.1598.7 Concluding Remarks.The conventional approach to the bootstrap has been through quasi replicates of theoriginal sample possibly subject to reweighting or some constraints. The bootstrapdistribution is then obtained from the empirical distribution or some smoothed versionof the empirical distribution of the succession of realized estimators obtained from thesequasi replicate samples.Even when the resampling population consists of (possibly renormalized) model fit residuals the approach has been to construct successive pseudo estimator values and thenthe bootstrap distribution for the estimator of interest, through the construction ofbootstrap data sets.What we have done is to by-pass the data sets. Instead we have taken the estimator ofinterest as the baseline, generated successive estimator fit residuals (rather than modelfit residuals). The approximate sampling distribution for the estimator is then found bytaking the resulting empirical distribution of the realizations of estimator plus residual.A critical feature of our approach is the use of the components of the estimating functionitself to transform to residuals to the appropriate scale. Heuristically, we have perturbedthe estimator by bootstrapped realizations of a normalized first order term from a Taylorexpansion of the estimator about the true value of the parameter. This approach seemsnatural. Generating bootstrap replicates of the original sample seems unnecessary whenthe object of interest is the sampling distribution of an estimator and it can be realizeddirectly from suitably expressed estimator fit residuals.The idea of by-passing the data sets is in itself not new. Moulton and Zeger (1991)use this idea in what they call their “one step procedure”. They intend their procedure160for use in estimating coefficients in link functions of generalized linear models. Theirgoal is computational simplicity. If they were to follow thw traditional practice ofbootstrapping the data set, they would need to find the bootstrap coefficient estimatorby an iterative process for each successive bootstrap replicate sample. The resultingintolerable computational burden leads them to use the estimating equations (for thegeneralized linear models) directly in much the same way as we suggest in this paperfor the general case.Apart from the difference in motivation and intended domain of application, our methoddiffers from that of Moulton and Zeger (1991) in the way we resample residuals. Theyuse a method like that of Wu described above. Their method then inherits the potentialdeficiencies discussed above of Wu’s approach.Not surprisingly, our bootstrap has the asymptotic properties up to second order whichone might expect from the Taylor expansion heuristic. However, the unexpected robustness of our method up to at least nonhomogeneity in four order moments is unexpectedand encouraging. Overall our method has promise. As well, its underlying Taylor expansion heuristic suggests other ways of perturbing the estimator to approximate itssampling distribution and these are currently under investigation.We noted above a number of potential applications of our method. In current work weare adapting our method for use with longitudinal count data series. As well we areseeing if our method can be used to find standard errors for estimators computed fromdata obtained through complex survey designs. Binder (1991), building on the work ofGodambe and Thompson (1986), has shown how estimating equations can be used inthat context. That work provides the platform on which we are attempting to build ourbootstrap variance estimation procedure.1618.8 ProofsOur asymptotic theory requires certain conditions.Assumptions 8.1 For X = X the design matrix in (8.1) for n observations,max1< 0 independent of n.Assumptions 8.2 For the error variances in model (8.1), maxi<< u < cc.Assumptions 8.3 The minimum and maximum eigenvalues of n’XX are uniformly bounded away from 0 and cc.Assumptions 8.4 The elements of X are uniformly bounded.We also require the following lemma.Lemma 8.1 (Wu 1986) If= xT(XX)1j= 0 for any i,j with u aj, (8.24)then E(r?) = (1 — h)o. More generally, Assumptions 8.1 and 8.2 imply E(r?) =(1— h)u? + O(n’) = u + O(n_1).Comparing (8.9) with (8.11), and (8.10) with (8.12) suggests and /14,* are robust forestimating the third and four moments, respectively under substantial departures fromthe model assumptions. We prove this below under additional assumptions.Assumptions 8.5 maxi<< I1L3,i hE(e) = (1 — h)3,— ht3,. (8.29)ij ijThe first equality in (8.29) follows from the independence of e, ..., e,. The secondinequality in (8.25) obtains in a similar way.Now more generally when 0, i j,nmax{3,}Assumptions 8.1 and 8.5 imply the right hand side of these last inequalities is of order n2since ( /i) by the Cauchy-Schwarz inequality . Therefore, the first equality163in (8.26) follows from (8.29) and the second from (8.29) and h c/n. This completesthe proof of (8.26).Now from (8.28)E(r) (1 — h)41t,+ hj1t+ 6(1 — h)2a hoji+ 6ii(i)i= (I) + (II) + (III) + (IV),say. Assumptions 8.1, 8.2 and the Cauchy-Schwarz inequality imply(IV) <6n2c4maxa (8.30)(III) < 12(n’c +n3c4)max4. (8.31)From Assumptions 8.1, 8.5 and the Cauchy-Schwarz inequality,(II) n3c4maxi, (8.32)(I) = (1 — 4h + 6h — 4h + = iti,i + O(n’). (8.33)The first equality in (8.27) follows from (8.30), (8.31) and (8.32) and the second from(8.33). This completes the proof. DProof of Theorem 8.3 From Lemma 8.2,= wEr= + O(n’))=wt3,+9+ wO(n’)= 3+O(n)164The last equality follows from 0 such that Vi-, = n X Xi,. —* V.def—1 2 TAssumptions 8.9 There exists a matrix W> 0 such that W = n o x:; W.Assumptions 8.10 x2 and E(4) are uniformly bounded, i = 1, ..., n.For random {x}, the corresponding assumptions are:Assumptions 8.11 The vectors {(, x)} are independent and identically distributedwith common unknown distribution function F on R’1, and E (, x’) j oc, whereis Euclidean length.Assumptions 8.12 The k x k covariance matrix Q = E(xix’) is positive definite.Assumptions 8.13 Exte: = 0, i = 1, ..., n where e = Yi — x’/9.165def—1 n T2Proof of Theorem 8.4 We will first show W,-, = n xx r —* W almost surelyas n —* cc where r = — xT/3 for all i. Then in bootstrap sampling we may conditionon the set of unit probability where the just asserted convergence takes place and letm -* cc.To obtain this almost sure result, observe that= xjx(yj - xB)2 + n xxflx{j3 -+ 2n xxxftB-- x/3)= (I) + (II) + (III)say. Assumptions 8.8, 8.9 and Kolmogorov’s strong law (c.f. Chung, 1968, p 119 andthe Corollary given there for the case, p = 2) entail /3 — —* 0 almost surely asn —* cc. Using the trace norm for matrices and the usual norm for vectors, (II)II 0 is a uniform upper bound forIIxII obtained from Assumption 8.10. So (II) —* 0 a.e.. At the same time, I(111)H =H2n — < 2n_ e IxH3 L — < 2cn_ Z IeHLB— /W for thesame constant c> 0 used above. So Assumption 8.10 and Kolmogorov’s strong law citedabove imply (III) — 0 almost surely. Finally, (I) = n’ Zxxre = n Zxx(e —c)+n But n’ Zxx(e?—a,) —*0 almost surely by Assumption 8.10 andKolmogorov’s strong law of large numbers. Then I —* W almost surely by Assumption8.9.Now- ) 1V’Let = lTzj, i = 1, ..., n for any fixed k dimensional vector 1 with lH = 1. Observe thatconditional on the actual sample, the{1T4, i = 1, ..., n} are independent and identically166distributed random variables with zero means and common standard deviation, o =(lTJ,fl)1/2 (n—’ )1/2. The Berry-Esseen Theorem impliessup F(u’/’ lTz x) — (x) (8.34)where p = But n1 i’Wi > 0 a.e.. Moreover,> —* 0 as n —* oo a.e.. The last result obtains since by our assumptions, the{E’} are uniformly bounded; we may then invoke a general version of the strong lawof large numbers (Chung, 1968, pll’T, Theorem 5.4.1 with q(.) = (.)4/3 and a = 3/2)to obtain the result.Since inequality (8.34) holds for all 1, the conclusion of the theorem now follows. DCorollary 8.1 Under the assumptions of the last theorem,- ){(yj,X) : i =1, ...,n}]converges to the standard normal on R’ for almost all sequences {(yj, x’) : i = 1, ..., n}as n —* oo.Proof: This result is an immediate consequence of the result of the last theorem. DProof of Theorem 8.5 This proof is similar to that of Theorem 8.4 and the detailswill be omitted. That 3 — —* 0 almost surely follows from 8.11 and 8.13 and thestrong law of large numbers. Assumptions 8.11 and 8.12 imply (I) — 0, (II) —* 0, and(III) —p M almost surely. These results and Assumption 8.12 imply the conclusion ofthe Theorem. 0167Bibliography[1] Akaiki, H. (1973). Information theory and an extension of entropy maximizationprinciple. 2nd International Symposium on Information Theory, eds. B. N. Petrovand F. Csak, Kiado: Akademia, p. 276-281.[2] Akaiki, H. (1977). On entropy maximization principle. In P. R. Krishnaiah, (ed.),Applications of Statistics. Amsterdam: North-Holland, 27-41.[3] Akaiki, H. (1978). A Bayesian analysis of the minimum AIC procedure.Ann.Inst.Statist.Math., 30A, 9-14.[4] Akaiki, H. (1982). On the fallacy of the likelihood principle. Statistics and Probability Letters, 1, 75-78.[5] Akaike, H. (1983). Information measures and model selection. Proceedings of the44th Session of 181, 1, 277-291.[6] Akaike, H. (1985). Prediction and entropy. A Celebration of statistics, The 181Centenary Volume, Berlin, Spring-Verlag.[7] Bahadur, R.R.(].966). A note on quantiles in large sample, Ann.Math.Statist. 37,577-580.[8] Barndorff-Nielsen, 0. (1978). Information and Exponential Families in StatisticalTheory. Wiley, New York.168[9] Basu, D. (1975). Statistical information and likelihood (with discussions). Sankhya,Ser. A. 37, 1-71.[10] Bickel, P.J. (1981). Minimax estimation of the mean of a normal distribution whenthe parameter space is restricted. Ann. Statist. 9, 1301-1309.[11] Bickel, P.J. and Freedman, D.A. (1981). Some asymptotic theory for the bootstrap.Ann. Statist., 9, 1196-1217.[12] Binder, D.A. (1991). Use of estimating functions for interval estimation from complex surveys (with discussion), Proc. Sect. Res. Methods, Amer. Statist. Assoc.,34-44.[13] Birnbaum, A. (1962). On the foundation of statistical inference (with discussions).JASA, 57, 269-306.[14] Booth, J. Hall, P. and Wood, A. (1992). Bootstrap estimation of conditional distributions. Ann. Statist., 20, 1594-1610.[15] Buehler, R.J. (1971). Measuring information and uncertainty. Foundations of Statistical Inference. Eds: Godambe and Sprott; Holt, Rinehart and Winston.[16] Casella, C. and Strawderman, W.E. (1981). Estimating a bounded normal mean.Ann. Statist., 9, 870-878.[17] Chow, Y.S. and Lai, T.L. (1973). Limiting behavior of weighted sums of independent random variables. Ann. of Prob., 1, 810-824.[18] Chu, C.K. and Marron, J.S. (1992). Choosing a kernel regression estimator. Statistical Science, 6, 404-436.[19] Chung, K.L. (1968). A Course in Probability Theory, New York: Harcourt, Brace& World, Inc..169[20] Cox, D.R. (1961). Tests for separate families of hypotheses. Proceedings of theFourth Berkeley Symposium on Mathematical Statistics and Probability, (Universityof California Press, Berkeley), 105-123.[21] Cox, D.R. and Hinkley, D.V. (1974). Theoretical Statistics. Chapman and Hall,Landon.[22] Cramer, H. (1946). Mathematical Methods of Statistics. Princeton University Press,Princeton.[23] Edwards, A.W.F. (1972). Likelihood. C.U.P., Cambridge.[24] Efron,B. (1979). Bootstrap methods: another look at the jackknife. Ann. Statist.,7,1-26.[25] Efron,B. (1982). The jackknife, the bootstrap and other resampling plans, SIAM,Philadelphia.[26] Efron,B. (1987). Better bootstrap confidence intervals and bootstrap approximations. JASA, 82, 171-185.[27] Efron,B. (1990). More efficient bootstrap computations. JASA, 85, 79-89.[28] Efron, B. and Tibshirani, R.J. (1993). An Introduction to Bootstrap. Chapman &Hall, New York.[29] Eubank, P.L. (1988). Spline Smoothing and Nonparametric Regression. New York:Marcel Dekker.[30] Fan, J. (1992). Design-adaptative nonparametric regression. JASA, 87, 998-1004.[31] Fan, J. (1993). Local linear regression smoothers and their Minimax efficiencies.Ann. Statist., 21, 196-216.170[32] Fan, J., Heckman, N.E. and Wand M.P. (1993). Local polynomial kernel regressionfor generalized linear models and quasi-likelihood functions. JASA, in press.[33] Fan, J. and Marron, J.S. (1992). Best possible constant for bandwidth selection.Ann. Statist., in press.[34] Fisher, R.A. (1925). Theory of statistical estimation. Proc. Cambridge Phil. Soc.,22, 700-725.[35] Fisher, R.A. (1956) Statistical Methods and Scientific Inference. Oliver and Boyd.Edinbugh.[36] Fraser, D.A.S. (1965). On information in statistics. Ann. Math. Statist., 36 890-896.[37] Fraser, D.A.S. (1968). The Structure of Inference. New York: John Wiley.[38] Freedman, D.A. (1981) Bootstrapping regression models. Ann. Statist., 9, 1218-1228.[39] Freedman, D.A. and Peters,S.C. (1984). Bootstrapping a regression equation: someempirical results. JASA, 79, 97-106.[40] Gasser, T. and Muller, H.G. (1979). Kernel estimation of regression function. inSmoothing Techniques of Curve Estimation. Lecture Notes in Mathematics 757,eds. T. Gasser and M. Resenblatt. Heidelberg: Springer-Verlag, 23-68.[41] Godambe, V.P. (1960). An optimum property of regular maximum likelihood estimation. Ann. Math. Statist., 31, 1208-1212.[42] Godambe, V.P. (1976). Conditional likelihood and unconditional optimum estimating equations. Biometrika, 63, 277-284.171[43] Godambe, V.P. and Heyde, C.C. (1987). Quasi-likelihood and optimal estimation.mt. Statist. Rev., 55, 231-244.[44] Godambe, V.P. and Thompson, M.E. (1986). Parameters of superpopulation andsurvey population: their relationships and estimation, International Statist. Rev.,54, 127-138.[45] Good, I.J. (1971). The probability explication of information, evidence, surpise,causality, explanation, and utility. Foundations of Statistical Inference, Eds: Godambe and Sprott; Holt, Rinehart and Winston, 108-141.[46] Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer-Verlag, NewYork, Berlin.[47] Hardle, W. (1991). Applied Nonparametric Regression, Cambridge: Cambridge University Press.[48] Hardle, W. and Marron, J.S. (1990). Bootstrap simulaneous error bars for nonparametric regression. Ann. Statist.[49] Hastie, T.J. and Tibshirani, R.J. (1990). Generalized Additive Models. Chapmanand Hall, London, New York.[50] Hinkley, D.V. (1988). Bootstrap methods. J. Roy. Statist. Soc. Ser.B, 50,321-337.[51] Hoeffding, W.(1963). Probability Inequalities for sum of bounded random variables,JASA, 58, 13-30.[52] Ru, F. (1994). The consistency of the maximum relevance weighted likelihood estimator. Unpublished Manuscript.[53] Hu, F. and Zidek, J.V. (1993a). An approach to bootstrapping through estimatingequations. Tech. Report of University of British Columbia, No.126.172[54] Ru, F. and Zidek, J.V. (1993b). A relevance weighted nonparametric quantile estimator. Tech. Report of University of British Columbia, No.134.[55] Hu, F. and Zidek, J.V. (1993c). Relevance weighted likelihood estimations. Unpublished Manuscript.[56] Ruber, P.J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium on MathematicalStatistics and Probability (University of California Press, Berkeley) 221-233.[57] Huber, P.J. (1981). Robust Statistics, Wiley, New York.[58] Jennen-Steinmetz, C. and Gasser, T. (1988). A unifying approach to nonparametricregression estimation. JASA, 83, 1084-1089.[59] Kuliback, S. (1954). Certain inequalities in information theory and Cramer-Raoinequality. Ann. Math. Statist., 25, 745-751.[601 Kuliback, S. (1959). Information Theory and Statistics, New York: Wiley.[61] Kullback, S. and Leibler, R.A. (1951). On information and sufficient. Ann. Math.Statist., 22, 79-86.[62] Lahiri, S.N. (1992). Bootstrapping M - estimators of a multiple linear regressionparameter. Ann. Statist., 20, 1548-1570.[63] Lee, K.W. (1990) Bootstrapping logistic regression model with regressors. Comm’un.Statist. Theory Meth., 19, 2527-2539.[64) Lehmann, E.L. (1983). Theory of Point Estimation. New York: Wiley.[65] Lehmann, E.L. (1986). Testing Statistical Hypothesis. New York: Wiley.173[66] Lindley, D.V. (1956). On a measure of the information provided by an experimentAnn. Math. Statist., 27, 980-1005.[67] Lindley, D.V. (1965). Introduction to Probability and Statistics from a BayesianViewpoint. Part 2. Inference. Cambridge: Cambridge University Press.[68] Liu, R.Y. (1988). Bootstrap procedures under some non-iid models. Ann. Statist.,16, 1696-1708.[69] Liu, R.Y. and Singh, K. (1992). Efficiency and robustness in resampling. Ann.Statist., 20, 370-384.[70] Mack, Y.P. and Muller, H.G. (1989). Convolution-type estimators for nonparametnc regression estimation. Statistics and Probability Letters, 7, 229-239.[71] Marcus, M.B. and Zinn, J.(1984). The bounded law of the iterated logarithm for theweighted empirical distribution process in the non-iid case, Ann.Prob., 12, 335-360.[72] Marshall, A.W. and 01km, I.(1979). Inequalities: Theory of Majorization and ItsApplications, Academic Press, New York.[73] McCullagh, P.J. (1983). Quasi-likelihood functions. Ann. Statist., 11, 59-67.[74] McCullagh, P.J. and Nelder, J.A. (1989). Generalized linear models, 2nd edition,Chapman and Hall, London.[75] Moulton, L.H. and Zeger, S.L. (1991). Bootstrapping generalized linear models.Computational Statistics and Data Analysis, 11, 53-63.[76] Muller, H.G. (1988). Nonparametric Analysis of Longitudinal Data, Berlin:Springer-Verlag.174[77] Muller, 11.0. and Stadtmuller, U. (1987). Estimation of heteroscedasticity in regression analysis. Ann. Statist., 15, 610-635.[78] Navidi, W. (1989). Edgeworth expansions for bootstrapping regression models.Ann. Statist., 17, 1472-1478.[79] Owen, A.B. (1991). Empirical likelihood for linear models. Ann. Statist., 19, 1725-1747.[80] Park, B.U. and Marron, J.S. (1990). Comparison of data-driven bandwidth selectors. JASA, 85, 66-72.[81] Royall, R.M. (1986). Model robust confidence intervals using maximum likelihoodestimators. mt. Statist. Rev., 54, 221-226.[82] Rubin, D.B. (1981). The Bayesian bootstrap. Ann. Statist., 9, 130-134.[83] Sasieni, P. (1992). Maximum weighted partial likelihood estimators of Cox model.JASA, 88, 144-152.[84] Sarndal, C.-E., Swenson, B. and Wretman, J.(1992). Model Assessed Survey Sampling, New York: Springer-Valag.[85] Savage (1954). The Foundations of Statistics. New York: Wiley.[86] Serfling, R.J.(1980). Approximation Theorems of Mathematical Statistics , Wiley,New York.[87] Serfiing, R.J.(1984). Generalized L-, M- and R-statistics, Ann.Statist., 12, 76-86.[88] Shannon, C.E. (1948). A mathematical theory of communication. Bell System Tech.J., 27, 379-423; 623-656.175[89] Shorack, G.R.(1978). Convergence of reduced empirical and quantile processes withapplications of order statistics in the non-iid case. Ann.Statist., 1, 146-152.[90] Shorack, G.R. and Weilner J.A.(1986). Empirical Processes with Applications toStatistics, Wiley, New York.[91] Singh, K. (1981). On the asymptotic accuracy of Efron’s bootstrap. Ann. Statist.,9, 1187-1195.[92] Staniswalis, J.G. (1989). The kernel estimate of a regression function in likelihood-based models, JASA, 84, 276-283.[93] Stout, W.F. (1968). Some results on complete and almost sure convergence of linearcombinations of independent random variables and martingale differences. Ann.Math. Statist., 39, 1549-1562.[94] Stout, W.F. (1974). Almost Sure Convergence. Academic Press, New York.[95] Tibshirani, R. and Hastie, T. (1987). Local likelihood estimation. JASA, 82, 559-567.[96] Wahba, C. and Wold, 5. (1975). A completely automatic French curve: fittingspline functions by cross-validation. Communications in Statistics, Part A—Theoryand Methods, 4, 1-7.[97] Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. Ann.Math. Statist., 20,595-601.[98] Wedderburn, R.W.M. (1974). Quasi-likelihood functions, generalized linear models,and the Gauss-Newton method. Biometrika, 61, 439-447.[99] Weerahandi, S. and Zidek, J.V. (1988). Bayesian nonparametric smoothers for regular processes. The Canadian Journal of Statistics, 16, 61-74.176[100] White, H. (1982). Maximum likelihood estimation of misspecified models.Econometrika, 50, 1-25.[101] Wiener, N. (1948). Cybernetics. New York: Wiley.[102] Wu, C.F.J. (1986). Jackknife, bootstrap and other resampling methods in regression analysis (with discussion). Ann. Statist., 14, 1261-1295.177