RELEVANCE WEIGHTED SMOOTHING AND A NEW BOOTSTRAP METHOD by FEIFANG RU B.Sc., Hangzhou Normal University, P.R. China, 1985 M.Sc., Zhejiang University, P.R. China, 1988 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES Department of Statistics We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA October 1994 ©Feifang Hu, 1994 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives, It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. (Signature) Department of The University of British Columbia Vancouver, Canada Date Oct. t3 Vj1’ DE-6 (2/88) Relevance Weighted Smoothing and a New Bootstrap Method Abstract This thesis addresses two quite different topics. First, we consider several relevance weighted smoothing methods for relevant sample information. This topic can be viewed as a generalization of nonparametric smoothing. Second, we propose a new bootstrap method which is based on estimating functions. A statistical problem usually begins with an unknown object of inferential of interest. About this unknown object, we may have three types of information (classified in this thesis): direct information, exact sample information and relevant sample information. Almost all classical statistical theory is about direct information and exact sample in formation. In many cases, relevant sample information is available and useful. But there is no systematic theory about relevant sample information. The problem of this thesis is to extract “the relevant information” contained in the relevant samples. ThTee general methods have been developed under three different lines of approach (para metric, nonparametric and semiparametric approach). In the parametric approach, we propose the idea of relevance weighted likelihood (REWL). For the nonparametric ap proach, we develop our theory based on the relevance weighted empirical distribution function (REWED). In the semiparametric approach, the relevance weighted estimat ing functions are used to extract “relevant information” from relevant samples. From asymptotic results, we find that these proposed methods have many desirable properties. We apply these proposed methods as well as some adjusted methods to generalized smoothing problems. Theoretical results as well as simulation results show our methods to be promising. 11 We also present a new bootstrap method. It has computational and theoretical advan tages over conventional bootstrap methods when the data obtain from non-identically distributed observables. And it differs from conventional methods in that it resamples components of an estimating function rather than the data themselves. 111 Contents Abstract ii Table of Contents iv List of Tables ix List of Figures x Acknowledgements xiii 1 Introduction 1 2 Relevant Sample Information 6 2.1 Introduction 6 2.2 Classification of Information 10 2.3 Measure of Information 13 iv 2.4 Fisher’s Information of Relevant Sample in Point Estimation 16 2.5 Kuilback and Leibler’s Information in Relevant Sample in Discrimination 21 2.6 General Remarks 23 S Relevance Weighted Likelihood Estimation 24 3.1 Introduction 24 3.2 The Relevance Weighted Likelihood 28 3.2.1 Definition 28 3.2.2 Examples 30 3.2.3 Remarks 32 3.3 Weakly Sufficient Statistics 33 3.3.1 Definition 33 3.4 Maximum Relevance Weighted Likelihood Estimation (MREWLE). . 35 3.5 The Entropy Maximization Principle 36 3.5.1 The generalized entropy maximization principle 36 3.5.2 The MREWLE and generalized entropy maximization principle 37 3.6 MREWLE for Normal Observations 38 V 3.7 General Remarks.41 4 The Asymptotic Properties of the Maximum Relevance Weighted Like lihood Estimator 44 4.1 Introduction 44 4.2 Weak Consistency 45 4.3 Strong Consistency 52 4.4 The Asymptotic Normality of the MREWLE 54 4.5 Estimated Variance of the MREWLE 58 4.6 Some Possible Extensions and Remarks 58 4.7 Proofs 59 5 The MREWLE for Generalized Smoothing Models 69 5.1 Introduction 69 5.2 Small samples 71 5.3 Asymptotic Properties 73 5.4 Bandwidth Selection 82 5.5 Simulation 83 vi 6 A Relevance Weighted Nonparametric Quantile Estimator 91 6.1 91 6.2 94 6.3 96 6.4 98 6.5 99 6.6 99 6.7 102 6.8 108 6.9 110 7 Some Further Results 120 7.1 Locally polynomial maximum relevance weighted likelihood estimation . 120 7.2 Relevance weighted estimating functions 127 Approach of Bootstrapping Through Estimating Equations Introduction Problems with Common Bootstrap Regression Methods Introduction The REWED and REW Quantile Estimation Strong Consistency of REW Quantile Estimators Asymptotic Representation Theory Asymptotic Normality of Applications Simulation Study Discussion Proofs of the Theorems 8 An 8.1 8.2 132 132 134 vii 8.2.1 Residual Resampling 135 8.2.2 Vector resampling 139 8.3 A new bootstrap 139 8.4 Asymptotics 142 8.4.1 Preamble 142 8.4.2 Consistency of 3* 143 8.4.3 Asymptotic normality of 3* 145 8.5 Comparison and Simulation Study 147 8.6 Bootstrapping in Nonlinear Situations 155 8.6.1 Regression M-estimator 155 8.6.2 Nonlinear regression 156 8.6.3 Generalized Linear models 157 8.7 Concluding Remarks 160 8.8 Proofs 162 References . 168 vii’ List of Tables 5.1 Pointwise Biases and Variances of Regression Smoothers 81 5.2 Conjectured Pointwise Biases and Variances of the MREWLE Smoothers 81 8.1 Averages of the Kolmogorov-Smirnov Statistics for Competing Bootstrap Distribution Estimators 152 8.2 Absolute Biases of the Competing Bootstrap Quantile Estimators . . . . 152 8.3 The Pivot Quantile Estimators 154 ix List of Figures 3.1 A comparison of MREWLE and MLE under o = 0. Curve A represents the MSE of MLE at line 0 = 01. Curve B represents the Maximum MSE of MREWLE over whole parameter space. Curve C represents the MSE of MREWLE at line 0 = O 40 3.2 A comparison of MREWLE with MLE under i = 1. Curve A represents the MSE of MLE at line 0 = 0. Curve B represents the Maximum MSE of MREWLE over whole parameter space. Curve C represents the MSE of MREWLE at line 0 = 01 42 5.1 A comparison of the Nadaraya- Watson MREWLE with the locally linear smoother MREWLE from Model (5.21) with n 200. The true curve is a, the Nadaraya- Watson MREWLE, b, and the locally linear smoother MREWLE, c 85 5.2 The Nadaraya- Watson MREWLE based on five simulations from Model (5.21) with n=200 86 5.3 The locally linear smoother MREWLE based on five simulations from Model (5.21) with n=200 87 x 5.4 A comparison of the Nadaraya- Watson MilE WLE with the locally linear smoother MilE WLE from Model (5.22) with n = 200. The true curve is a, the Nadaraya- Watson MilE WLE, b, and the locally linear smoother MREWLE, c 88 5.5 A comparison of the Nadaraya- Watson MREWLE with the locally linear smoother MilE WLE from Model (5.23) with n = 200. The true curve is a, the Nadaraya-Watson MREWLE, b, and the locally linear smoother MREWLE, c 90 6.1 A comparison of the Nadaraya- Watson estimate with REW quantile es timator. The model is Y = X * (1 — X) + e, where X is uniform (0,1) and e is N(0, 0.5). The sample size n = 1000 and the bandwidth, h = 0.1. The true curve is a, the REW quantile estimator, b, and the Nadaraya Watson, c 103 6.2 A comparison of the Nadaraya- Watson estimate with REW quantile es timator with outliers. To the data depicted in Figure 6.1, we add 50 e outliers from N(2, 0.5). The true curve is a, the REW quantile estimator, b, and the Nadaraya- Watson, c 104 6.3 A comparison of the Nadaraya- Watson estimate with REW quantile esti mator. The model is Y = X * (1 — X) + e, where X is from uniform (0,1) and e from a double exponential distribution with r = 0.1. The sample size is n = 100 and the bandwidth, h = 0.1. The true curve is a, the REW quantile estimator, b, and the Nadaraya- Watson, c 105 xi 6.4 A comparison of the Nadaraya- Watson estimate with REW quantile es timator with outliers. To the data depicted in Figure 6.3, we add 10 c-outliers from N(—.5, .25). The true curve is a, the REW quantile esti mator, b, and the Nadaraya- Watson, c 106 6.5 A REW quantile estimator of a quantile curve. The .25 quantile curve is estimated for the data depicted in Figure 1. The true quantile curve is a, and the REW quantile estimator is b 107 7.1 A comparison of the Nadaraya- Watson MRE WLE, the locally linear smoother MREWLE and the locally linear MRE WLEfrom Model (5.21) with n=200. The true curve is a, the Nadaraya- Watson MREWLE, b, the locally linear smoother MREWLE, c, and the locally linear MREWLE, d 125 7.2 The locally linear MREWLE based on five simulations from Model (5.21) with n=200 126 8.1 A comparison of bootstrap distribution estimators for regression with ho moscedastic errors. We depict the distributions of w = — /% induced by the true distribution (labelled a), using our bootstrap estimator (b), Efron’s estimator (c), Wu’s estimator (d), and Freedman’s estimator (e,). 150 8.2 A comparison of bootstrap distribution estimators for regression with het eroscedastic errors. We depict the sampling distributions of w = — induced by the true distribution (labelled a), using our bootstrap estimator (b), Efron’s estimator (c), Wu ‘s estimator (d), and Freedman’s estimator (e) 151 xii Acknowledgements I would like to thank my supervisor James Zidek for his excellent guidance, for his constant encouragement and his patience. Only the support from James Zidek made this thesis possible. I wish to thank the people whose suggestions and comments about this thesis have been most helpful: Harry Joe, Nancy Heckman and Jean Meloche. This thesis benefitted from the helpful references and comments of Nancy Heckman. Many thanks go to Harry Joe, Nancy Heckman, Jian Liu and John Petkau for their encouragement and support during my stay at UBC. Many thanks go to my dear wife Ying Liu for her constant encouragement and patience. Our son Leon made the life more enjoyable by so much fun he gave to me. Finally, I would like to thank James Zidek for his financial support. The financial support from the Department of Statistics is acknowledged with great appreciation. I also acknowledge the support of the University of British Columbia through a University Graduate Fellowship. XIII Chapter 1 Introduction This thesis addresses two quite different topics. First, we consider several relevance weighted smoothing methods for relevant sample information. This topic can be viewed as a generalization of nonparametric smoothing. Second, we propose a new bootstrap method which is based on estimating functions. A statistical problem usually begins with an unknown object of inference. About this unknown object of inference, we may have three types of information (classified in this thesis): direct information, exact sample information and relevant sample information. Almost all classical statistical theory is about direct information and exact sample infor mation. In many cases, the relevant sample information is available and useful. There is no systematic theory about relevant sample information. The problem of this thesis is to extract “the relevant information” contained in the rele vant samples. Three general methods have been developed under three different lines of approach (parametric, nonparametric and semiparametric approach). For the paramet ric approach, we propose the idea of relevance weighted likelihood (REWL). The REWL plays the same role in relevant sample analysis as the likelihood in classical statistical in 1 ference. For the nonparametric approach, we develop our theory based on the relevance weighted empirical distribution function (REWED). In the semiparametric approach, the relevance weighted estimating functions are used to extract “relevant information” from relevant samples. We use estimating functions because of their generality. We show that the maximum relevance weighted likelihood estimator (MREWLE) is consistent and asymptotically normal. The asymptotic theory of the nonparametric approach is also developed. [For the semiparametric approach, the asymptotic theory is omitted.] By these asymptotic results, we find that these proposed methods have many desirable properties. We also apply these proposed methods to generalized smoothing methods. We find that: (i) the MREWLE has some advantages over current nonparametric regression methods; (For example, the MREWLE always has a smaller variance which depends on the Fisher information function. This also indicates that the MREWLE is a kind of efficient estimator.) (ii) the relevance weighted quantile estimator (based on the REWED) is usually robust and quite efficient. For generalized smoothing models, we get some good estimators by some locally polynomial adjustments. Simulations support our methods. Various papers (Efron 1979, Bickel and Freedman 1981 and Singh 1981) speak to the quality of the bootstrap resampling procedure for estimating a sampling distribution in situations where the sampled observables are independent and identically distributed. In this thesis we present a new bootstrap method. It has computational and theoretical advantages when the data obtain from non-identically distributed observables. And it differs from conventional bootstrap methods in that it resamples components of an estimating function rather than the data themselves. We apply this bootstrap method 2 to ordinary linear regression. By comparing with Efron’s method, Freedman’s pairs method and Wu’s method, our method gets support from theoretical results as well as simulations. Because our new bootstrap method is based on estimating functions, applying this bootstrap method to relevance weighted smoothing is possible and reasonable. This is a further research topic. We organize the thesis as following. In Chapter 2, we classify the different types of information about the unknown object of inference (parameter) in statistical way. In classical statistical inference and Bayesian inference, statisticians usually focus on direct information and exact sample information. We learn that relevant sample information is very important in statistical inference. Two possible generalizations of Fisher’s information and Kullback-Leibler’s information for relevant samples are considered. For relevant samples, we propose the idea of the relevance weighted likelihood (REWL) in Chapter 3. Our idea generalizes that of the likelihood function in that the independent samples going into the likelihood function may be discounted according to their degree of relevance. The classical likelihood obtains in the special case where the independent samples are all from the study population whose parameters are of inferential interest. But more generally, as in metaanalysis, for example, the value of such samples may be reduced in that their relevance is in doubt because for example, they are noisy or biased. The relevance weights, which enter as exponents of the component factors in the sampling density, enable us to tradeoff information against such things as bias in the samples which may be relevant even if not drawn from the study population. 3 We show how the REWL can obtain from a generalization of the entropy maximization principle. Using the REWL we define the notion of weak sufficiency and the maximum REWL estimator (MREWLE). By using the MREWLE in a normal example, we find the MREWLE has some advantages. In Chapter 4, we establish the weak and strong consistency of the MREWLE under a wide range of conditions. My results generalize those of Wald (1948) to both noniden tically distributed random variables and unequally weighted likelihoods (when dealing with independent data sets of varying relevance to the inferential problem of interest). Asymptotic normality of the MREWLE is also proved. We apply the REWL methods to generalized smoothing models in Chapter 5. By choosing different weights, four estimators are considered. They are the: Nadaraya Watson MREWLE, Gasser-Muller MREWLE, k-NN weights MREWLE and locally lin ear smoother MREWLE. Asymptotic results for these four estimators are developed. We also compare them by theoretical results as well as simulations. Chapter 6 concerns situations in which a sample X1 = x, X,-, x of independent observations is drawn from populations with different CDF’s F1, •, F,, respectively. Inference is about a quantile of another population with CDF F0 when the data from the other populations are thought to be “relevant”. Nonparametric smoothing of a quantile function would typify situations to which our theory applies. We define the relevance weighted quantile (REWQ) estimator derived from the relevance weighted empirical distribution (REWED) function. We show that the estimator has desirable asymptotic properties. A simulation study is also included. It shows that the median estimator is a robust alternative to the locally weighted averages used in conventional smoothing. Some further results appear in Chapter 7. We propose a locally polynomial MREWLE. 4 By comparing with locally constant MREWLE (in Chapter 5) and current nonparamet nc regression methods, we find that the locally linear MREWLE has a simple bias and smaller variance. Some simulation results also support this locally linear MREWLE. For our semiparametric approach, we present the method of relevance weighted estimating functions. We find that the locally linear MREWLE is the best among all locally linear REW estimating equations (with kernel weights) and the locally linear quasi-MREWLE (maximum relevance weighted quasi-likelihood estimator) is the best among all locally linear REW linear estimating equations. Chapter 8 presents a method of bootstrap estimation. Rather than resampling from the original sample, as is conventional, the proposed method resamples summands of the estimating function used to produce the original estimate. The result is computation simpler than existing competitors. However, its main advantages lie in its treatment of non-identically distributed observations. Shortcomings of conventional methods are overcome. An application to ordinary linear regression is worked out in detail along with the appropriate asymptotic theory. We report as well a simulation study which provides support for this new bootstrap method. 5 Chapter 2 Relevant Sample Information 2.1 Introduction Information is a key word in statistics. After all, this is what the subject is all about. As Basu (1975) says: • A problem in statistics begins with a state of nature, a parameter of interest 0 about which we do not have enough information. In order to generate further information about 8, we plan and then perform a statistical experiment E. This generates the sample x. By the term ‘statistical data’ we mean such a pair (‘,x) where is well-defined statistical experiment and x the sample generated by a performance of the experiment. The problem of data analysis is to extract ‘the whole of relevant information’—an expression made famous by R. A. Fisher—contained in the data (E,x) about the parameter 0. However, statistical theory has traditionally been concerned with a narrow interpretation of the word embraced by Basu’s description. Given data, statisticians would typically 6 construct a sampling model with a parameter 0 to describe a population from which the data were supposedly drawn. Information in the sample about the population comes out through inference about 0. Alternatively, given a 0 of interest the classical paradigm sees the statistician as conducting a statistical experiment to generate a sample from a population defined by a sampling distribution with parameter 9. The sample then provides information about 9. In either case, statistical infereilce will be based on these observations and their directly associated sampling model. This is the frequency theory viewpoint. (See Lehmann 1983). Bayesians think that we always have some prior distribution for the unknown parameter. Then we combine the prior information and the ‘statistical data’ information. (See Lindley 1965). But these two sources of information are not the only source of information for the parameter. There is another source of information (relevant sample information as defined in Section 2.2). This information is very useful in many cases. This can be clearly seen by the following examples. (We use an example similar to that of Basu (1975) but for a different purpose). Example 2.1 Suppose an urn A contains 100 tickets that are numbered consecutively as o + 1, 0 + ,9 + 100 where 9 is an unknown number. Let E, stand for the statistical experiment of drawing a simple random sample of n tickets from the urn A without replacement and then recording the sample as a set of ii numbers x1 < x2 < < x. Suppose we know that 9 is bounded by 50, this means O 50. This information is a kind of direct information (or prior information). Consider now the hypothetical situation where E25 has been performed and has yielded the sample x = (55,57,... ,105), where 55 and 105 are respectively the smallest and largest number drawn. To be specific, with 7 data E, x = (x1,x2,. , x,,), we know without any shadow of doubt that the true value of 0 must belong to the set S = {xi — 1, x1 — 2,••• , x1 — m} where m = 100 — (x — xi). Now with the information from the data we can now assert that 0 is an integer that lies somewhere in the interval [5, 54]. Combine the direct information and the ‘statistical data’ information, we could conclude that 0 is an integer that lies somewhere in the interval [5,50]. Now suppose another urn B contains other 100 tickets that are numbered consecutively as 0 + 1, 01 + 2, , O + 100 where 0 is another unknown number. But we known that the difference between 0 and O is smaller than 5, meaning that 0 — Oi < 5. Suppose now that we draw a simple random sample of 5 tickets from the urn B and find the sample x’ = (51,80,... , 149). With this information, we can assert that 01 = 50 or 49. Now with the information from the data x’, we can say that 0 is an integer that lies somewhere in the interval [45, 54]. This is the information from the data x’ for the parameter 8. Combining the information from these different sources, we can finally conclude that 0 is an integer somewhere in the interval [45, 50]. From the above example, the data x’ contain some useful information for the parameter 0. But x’ are not from the experiment E with the parameter 0 (the experiment we plan and perform). The x’ come from the experiment with the unknown parameter 01, which has some relation with the parameter 0 (in this example, 8 — 01j < 5). This is the difference between x’ and x. The classical statistical inference usually focuses on the information from x alone. 8 In this chapter, we will focus on the discussion of this third source of information. We come out very strongly in support of the use this information for statistical inference when it is available. In Section 2.2, we try to classify information about the unknown parameter into two main types. The first is “direct information” and the second we call “sample infor mation” (as defined in Section 2.2). We classify sample information into exact sample information and relevant sample information. Almost all classical statistical theory is about direct information and exact sample information. Relevant sample information has only been used in some special contexts of statistics. But systematic theory is not available and needs to be developed. We have used the word information many times in this section. But, what is infor mation? Or how to measure information? As Basu (1975) remarks no other concept in statistics is more elusive in its meaning and less amenable to a generally agreed definition. The measure of information plays a very important role in statistics and communication science. A lot of well-known work has been done in this area (See Fisher (1925), Shannon (1948), Wiener (1948) and Kullback (1959)). We will review three of the most important information measures in the Section 2.3. All these information measurements are used for direct information and exact sample information for different purposes. In Section 2.4, we discuss how to measure the relevant sample information. A possible generalization of Fisher information is proposed. The use of the relevant sample information in the test of hypotheses or discrimination will be considered in Section 2.5. We propose a reasonable measure of information for discrimination. This measure is a generalization of the Kuliback-Leibler information measure. 9 2.2 Classification of Information Statistics is concerned with the collection of data and with their analysis and interpre tation. Here we do not consider the problem of data collection but take the data as given and ask what they tell us. The answer depends not only on the data, but also on background knowledge of the situation; the latter is formalized in the assumptions with which the analysis is entered. We review two traditional principal lines of approach. Classical inference and decision theory. The observations are now postulated to be the values taken on by random variables which are assumed to follow a joint probability distribution, F, belonging to some known class P. Frequently, the distributions are indexed by a parameter, say 8, taking values in a set, , so that P {F9,8 The aim of the analysis is then to specify a plausible value for 0 (this is the problem of point estimation) or at least to determine a subset of f of which we can plausibly assert that it does, or does not, contain 0 (estimation by confidence sets or hypothesis testing). Such a statement about 0 can be viewed as a summary of the information provided by the data and may be used as a guide to action. In this approach, statistical inference about 0 is based on both this directly associated sampling model (here {P9,}) and the observations. Bayesian analysis. In this approach, it is assumed in addition that 0 is itself a ran dom variable (though unobservable) with a known distribution. This prior distribution (specified prior to the availability of the data) is modified in light of the data to deter mine a posterior distribution (the conditional distribution of 0 given the data), which 10 summarizes what can be said about 0 on the basis of assumptions made and the data. This Bayesian approach about the parameter 0 is based on both the directly associated model, the prior distribution and the data. It is frequently reasonable to assume that we get some other observations which are not from F9, but from some F91 where P91 is related to P9. These observations do contain some information about 0. But the above two traditional principal lines of approach do not include these observations. Example 2.2 Assume that we wish to estimate the probability (a parameter OA) of a penny A showing heads when spun on a flat surface. We usually consider n spins of the penny as a set of n binomial trials with an unknown probability °A of showing heads. Suppose, however, that we have m spins of the penny B. If we believe this penny B is similar to penny A (meaning 0A and 0B are close to each other), to estimate 0A, it might be reasonable to use the information from the m spins of penny B. The above discussion leads to the following classification of information about the un known object of inference (here parameter 0). Definition 2.1 (Direct information). All the information which is directly related to the unknown object of inference (parameter) is called direct information of the parameter 0. Definition 2.2 (Sample information). We call sample information, the informa tion about 0 from the sample or the data. If the sample is from the experiment (model) which is direct to the parameter 0, we call it an exact sample for this parameter 0. 11 The information in the exact sample is called exact sample information. The sample which is from the experiment (model) related to the parameter 0 (not direct) is called a relevant sample for parameter 0. The relevant sample information is defined as the information from a relevant sample. As in Example 2.1, 0 50 is direct information about 0. The data of E25 is exact sample information. The data drawn from urn B are relevant sample information. In classical inference and decision theory, statistical inference about 0 is based on both the directly associated sampling model {P9,} (direct information) and the observations (exact sample information). The Bayesian approach based on the directly associated sampling model (direct information), the prior distribution (direct information) and the data (exact sample information). In some cases, we may have relevant sample information about the inferential objective. This can be well-illustrated by the examples of the following Chapters. Examples 3.1, 3.2, 3.6 and 6.1 all indicate that relevant sample information is available and useful. In the following examples, we show how information is classified. Example 2.3 Linear Model. Observations y, considered as an n x 1 column vector, is a realization of the random vector Y with E(Y) = x3, where x is a known n x q matrix of rank q n, and /3 is a q dimensional column vector of unknown parameters. /3 are the parameters of interest. Because the model involves the unknown parameters directly, the information from the observations y about /3 is exact sample information. 12 More generally, for the Generalized Linear Model (See McCullagh and Nelder 1989), the information from the sample is still exact sample information, because in that case, the experiment involves the unknown parameters directly. Example 2.4 Let X be from N(8, 1), and let the estimand be 8, 0 has a convenient prior distribution, say N(O, 1). Now let us assume Y is from N(01,1), with 01 unknown. But we do know that 0 — has a prior distribution, say N(O, 1). The data vaule X is an exact sample for 8. The prior distribution of 0 is direct infor mation for 8. The data value Y is a relevant sample for 8. The statistical methods using direct information and exact sample information are well developed. We can easily find these methods in standard textbooks. However there are no systematic methods about how to use relevant sample information. 2.3 Measure of Information In Section 2.2, we classify the different types of information. But what is information? How can we measure the information? Answers of these questions seem controversial. The definition of information or entropy goes backed to 1870, and a series of papers by L.Boltzmann. Since then, statisticians have proposed many different definitions for different targets. Now we review the three most important definitions of statistical information. (I) Shannon and Wiener Information (Entropy). 13 The statistical interpretation of thermodynamic entropy, a measure of the unavailable energy within a thermodynamic system, was developed by L.Boltzmann around 1870. His first contribution was the observation of the monotone decreasing behavior in time of a quantity defined by f(x,t)E=j f(x,t)log{ }dx, where f(x, t) denotes the frequency distribution of the number of molecules with energy between x and x + dx at time t (Boltzmann, 1872). When the distribution f is defined in terms of the velocities and positions of the molecules, the above quantity takes the form E = fflogfdxdy, where x and y denote the vectors of the position and velocity, respectively. Boltzmann showed that for some gases this quantity, multiplied by a negative constant, was identical to the thermodynamic entropy. Shannon (1948) proposed the definition of the entropy of a probability distribution: (the negative of the above quantity) H = _Jp(x)logp(x)dx, (2.1) where p(x) denotes the probability density with respect to the measure dx. Shannon entropy plays a very important role in modern communication theory. There are almost uncountably many papers and books about the use of Shannon entropy. The quantity H is simply referred to as a measure of information, or uncertainty, or randomness. This definition may be used in the measure of direct information, when the direct information is Bayes prior. Also it may be used in the case of a posterior distribution. 14 However, Savage (1954, page 50) remarks: ‘The ideas of Shannon and Wiener, though concerned with probability, seem rather far from statistics. It is, therefore, something of an accident that the term ‘information’ coined by them should be not altogether inappropriate in statistics.’ (II) Fisher’s Information. R. A. Fisher’s (1925) measure of the amount of information supplied by data about an unknown parameter is well-known to statisticians. This measure is the first use of ‘information’ in mathematical statistics, and was introduced especially for the theory of statistical estimation. For a real parameter and a density function satisfying Cramer-Rao regularity conditions it has the form J(Q) = (2.2) We know that Fisher’s Information has a lot of optimal properties as a measure of information: (i) Fisher’s information, being specific to a parameter, will stay the same if we reduce to a sufficient statistic; (ii) Fisher’s information is additive over different sets of independent data; and (iii) Fisher’s information gives a lower variance bound for the estimation of the parameter provided some regularity conditions are satisfied. (III) Kuliback-Leibler’s Information. Kullback and Leibler (1951) consider a definition of information for ‘discrimination in favor of H1 against H2’. Here H, i = 1,2, is the hypothesis that X is from the statis tical population with probability measure p, with 4(x) = f(x)dx. They define the logarithm of the likelihood ratio, log{fi(x)/f2(x)}, as the information in X = x for 15 discrimination in favor of H1 against H2. The mean information for discrimination in favor of H1 against H2 per observation from H1 is defined as 1(1 :2) = Jfi(x) log ‘1dx. (2.3) This definition is a departure from Shannon and Wiener’s information. It is widely used in statistics for discrimination. This can be easily seen from Kullback (1959). The above three definitions of information are all about direct information and exact sample information. This is because classical statistics is focused on these two types of information. In the following two sections, we will discuss the information measure for relevant sample information. 2.4 Fisher’s Information of Relevant Sample in Point Estimation. It is well-known that Fisher’s information plays a very important role in the theory of statistical estimation. In the last section, we have discussed Fisher’s information for exact samples. Now we will propose a possible generalization of Fisher’s information to the relevant sample case. The bias function and information function in Chapters 4, 5, and 7 offer other possible generalizations. Let us begin with the simplest case. Assume that X has density f(x, 8), where 8 is a real parameter. Suppose f(x, O) satisfies the Cramer-Rao regularity conditions; then the Fisher’s Information for Qo from X as defined in Section 2.3 is (2.2). We are interested in the parameter 0. As we have claimed in Section 2.2, there is some 16 rnformation in X for the parameter 0, if we know that 0 — 00! c for some constant c. Now it is natural to ask how to measure the information in X for the parameter 0. Definition 2.3 The information for the parameter 0 in X is defined as Infx(0, 0) = (Infx(00),0 — 0). Here Infx(00) is the Fisher’s information. In the above definition, we can see that the information in X for the parameter 0 contains two parts, one is the information part; another is the bias part. If the bias part is 0, which means that the 0 = 00, then the above information measure becomes Fisher information, If we know the bias exactly, then the bias can be eliminated. This is because then we can transform the parameter. Generally, if 0 and O have a one-to-one relationship, then we always can do the parameter transformation. This will be the case of exact sample information. Now we discuss the properties of the above definition of information. In the following discussion, we always assume the bias is the same, unless we specify otherwise. When we say Infx(O, 0) equals Infy(0, 0), we mean Oo = 01 and Infx(00) Infy(01). Only when the bias is the same, can we order the two information indices. • a. Inf(0, 0) is independent of the a-measure p. As we know Inf(0, 0) is calculated from the f(x, 0) = dP90(x)/dp. If we can find another a—measure ii such that {P6 : 0 E O} << v, then we can replace f(x, 0) by f*(x,00)= dP90(x)/dv. The value Inf(0,00)is not changed. The proof is easy and we omit it here. D • b. The information of several independent observations is the sum (appropriately defined) of the information in these observation, if the bias of these observations are the same. 17 Mathematically, the above statement says suppose X1,• , X, are independent and X = (X1,. , Xk). If f(x, O) is the density of X, and they satisfy the Cramer-Rao regularity conditions, then f(x,80)= fl(x1,0)”fk(xk,8 satisfies the Crame-Rao regular conditions, and Inf(O, O) = {Infi(00)+ + Inf(8o) , 0 — 0,}, (2.4) here Inf(0, 0) are the information function for 8 in X. The proof of the above result is exactly similar to that of Fisher’s Information. We omit the proof here. 0 • c. The information will not increase, when we transform the data. Let Y = T(x) be a statistic, that is , T is a function with domain X and range )“, and let T be an additive class of subsets of Y. We assume that T is measurable. Let g(t, 0) be the density of T(X). If f(x, 0) and g(t, 8) satisfy the Cramer-Rao regularity conditions, then Infx(O,00) IrifT(8,00), (2.5) here (x1,y) (x2,y) means x1 x2. We omit the proof here because it is a direct result from the Fisher’s Information. 0 • d. Under the conditions of c), we have inequality in (2.5), with equality if and only if the statistic Y = T(x) is sufficient for the parameter 8. The proof is omitted. 0 18 In connection with the basic properties of information, we have the following comments. 1. From (2.4), when we have n iid observations from f(x, 8), then Irif(O, 8) = (nlnf(00),8 — On). 2. Above, d) implies that we should use the sufficient statistic to do the statistic analysis for the parameter 0, although this sufficient statistic is for the population indexed by parameter 0. This result tells us how to reduce the dimension of the data. 3. If we do a one-to-one transformation of the parameter, that is i h(0) and h is differentiable, the information that X contains about is Infx(ii, ) = (Infx(00)[h’(]2,17 — where 1o = h(00). Now we are going to discuss the generalized information inequality, which generalizes the information inequality to the relevant sample case. Theorem 2.1 (Relevant Information Inequality) Suppose that Cramer-Rao regu larity conditions are satisfied. Let 6 be any statistic with E90(62) < oo for which the derivative with respect to 8 of h(00) = E90(6) exists and can be obtained by differenti ating under the integral sign. Then E(6 - 0)2 + (h(00) - 9)2 (2.6) Proof. The result follows directly from — 9)2 = Var(6) + (h(00) — 9)2 and the Information Inequality. D 19 Usually we can control the value of 10— Ol, but not lh(00)— 01. We would like to choose h(00) = 00. From the above Theorem, we obtain. Corollary 2.1 Under the conditions of Theorem 1, for any 00 ‘s unbiased estimator 5, E(6 — 0)2 Inf(00)+ (0 — 0)2 (2.7) From the Corollary, we can see that our definition of information for relevant samples is reasonable. The lower bound of the mean square error depends both on the information and the bias. Now we consider how to combine the information from relevant samples having different bias parts. Let X have density f(x,00) and Y, the density g(y,Oi). Both f(x,00)and g(y, 0) satisfy the Cramer-Rao regularity conditions. 0 is the parameter of interest. Both X and Y contain some information about 0, so we need to combine their informa tion. From Definition 2.3, we get Infx(0, 0) and Infy(0, 0k). Then the information from (X,Y) is defined as {I(00,01),B(},here I(0,0) diag(Infx(0),Infy(0i)) and B(00,0) (0—0, O_01)t. This is similar to Fisher’s information for the multiparameter case except for our inclusion of the bias. We can easily obtain the following result. Theorem 2.2 Let 6(X, Y) be any statistic with E(6(X, Y)2) < oo such that deriva tive with respect to 00 and 0 of h(00,0) = E(6(X, Y)) exists and can be obtained by differentiating under the integral sign. Then E(5(X,Y) — 9)2 /3t I’(0 Oi)/3+ (h(00,01)— 9)2 (2.8) 20 where /3t = (öh(00,0)/90, ãh(001)/a0). We stop this section here. Further theory is under development but not yet complete. 2.5 Kuilback and Leibler’s Information in Relevant Sample in Discrimination Let us begin with the following example. Example 2.5 Simple null hypothesis. Assume X1 is a sample from N(0, 1). The null hypothesis is H0: 0 = 0 and the alternative Ha: 0 = 2. From the Neyman-Pearson Lemma, we can easily get the most powerful test of level a = .05 as: if Xi > 1.645, we reject the null hypothesis H0. Otherwise, we accept the H0. The power of this test is = .639. Now suppose we get another sample X2 from N(00,1). We know that 10 — 001 .5. We construct a new test: if X1 + X2 > 2.826, we reject the null hypothesis H0. Otherwise we accept H0. For the second test, we have supPH0(reject H0) .05 and infPH(reject H0) .683 > .639. So the second test is more powerful than the first one. This means the observation X2 contains some information about the simple null hypothesis. Example 2.5 tells us that the relevant sample is useful for testing. In this section, we will consider the information of the relevant sample for discrimination. We will use an idea similar to that underlying Kuilback-Leibler information. 21 As we know, Kuliback and Leibler (1951) define the logarithm of the likelihood ratio, log{f1(x)/2}, as the information in X = x for discrimination in favor of H1 against H2. Now we suppose that the sample X is from some density distribution g1(x) which may have some relationship to {fi(x),f2(x)} (X is a relevant sample). Definition 2.4 The mean information for discrimination in favor of H1 against H2 per observation from gi(x) is defined as 1(1 : 2;X) = fgi(x) log dx.D (2.9) This definition is a departure from Kuliback and Leibler information; Kuliback and Leibler information is about the sample fromf1(x). We generalize this to the relevant sample case. If 1(1 : 2;X) > 0, then the sample from g1(x) favors H1. We can easily see that 1(1: 2;X) Jgi(x)log’dx_Jg1(x)log1dx. (2.10) The term fgi(x)log(gi(x)/fi(x))dx is the bias part of this information. Ifg1(x) then the bias part vanishes. Now we discuss the properties of this information. • a. Inf(O, O) is independent of the u-measure t. This is similar to the property a) in Section 2.4. • b. The information could be negative, and the negative value means this sample favors 112. 22 • c. The information for discrimination in independent observations is additive. This means that if we have some independent observations X1, , X, from den sitiesg1(x),. ,g,(x) and let X = (X1,. ,Xk); then 1(1 : 2;X) = 1(1: 2;Xi) + + 1(1: 2;Xk). The proof is easy and we omit it here. 2.6 General Remarks As suggested in Section 2.2, there are three types of information. What is the rela tionship among these types of information? The relevant sample information contains two parts: one is the relation between the two experiments (models); another is the observations. The relation between two experiments is a kind of direct information with an unknown parameter. When the number of relevant samples goes to infinity, the relevant sample information becomes direct information. For example, in Example 2.2, the similarity of 8A and °B indicates a relationship between the two experiments. When m (spins of the penny B) goes to infinity, we can get an exact value of 0s. Then the relevant sample information would mean 0A is close to some known value (this is direct information). The relation between °A and OB can be of several types. Here we list some of them: (1) — OBI c for some constant c (Example 3.1); (2) °A — 0B is small (Example 2.2 and Example 3.2); and (3) °A — 8B is a random variable with a known distribution. The information measures proposed in Section 2.4 and 2.5 need further study. 23 Chapter 3 Relevance Weighted Likelihood Estimation 3.1 Introduction In this chapter we generalize the classical likelihood as the Relevance Weighted Likelihood (REWL). The REWL arises in parametric inference when in addition to (or instead of) the sample from the study population, relevant but independent samples from other populations are available. By down-weighting them according to their relevance, the REWL incorporates the information from these other samples. We have characterized such “relevant “ sample information in Chapter 2. To motivate the likelihood theory presented below, we merely illustrate situations where such information arises. Example 3.1 Let ‘ 1), i = 1, 2, be independent random variables, the {j} being unknown parameters. We want to estimate j. Two estimators present themselves. (i) Classical likelihood-based estimation theory suggests the MLE which uses just Y1. 24 Then we get 1%=Y. (3.1) (ii) However, if2 was deemed to be “close” to ,a, intuition suggests we use the infor mation in Y2 in some way. Yet the classical theory still yields the result in (i). Even when we add the structural condition that — < c for a specified constant c > 0, the MLE still uses just Y1 unless the condition IY’ — Y2j c is violated. If that condition fails, j = {Y1 + (Y2 — c)}/2 or = {Yi + (Y2 + c)}/2 according as Y1 < (Y2 — c) or Y1 > (Y2 + c). So then the MLE does bring Y2 into the estimation of But it does so crudely only through truncation. So instead we turn to a seemingly more natural alternative which uses Y2 more fully: 1+c2 1 +c2Y1+2Y2 (3.2) Under the mean squared error criterion, we find - )2 1> E( )2 (3.3) From (3.3), we can conclude that the estimator (3.2) based on both }‘ and Y2 always has a smaller mean squared error than that (3.1) based on Y1 alone. The last example demonstrates that in certain cases, we can profitably incorporate the information from samples drawn out of populations different from that under study. In other words, all “relevant” information must be used in inferences about the parameters of interest. Example 3.2 Nonparametric Regression Model. If n data points {(X, )} have been collected, the regression relationship between Y and X can be modeled as = m(X1)+ e, i = 1,• . . , (3.4) 25 using the unknown regression function m and observation errors e. Assume that c1,• , are iid from some unknown density function f(x) with E(ei) = 0. The function rn are the parameters we want to estimate. About the situation embraced by this last example, Eubank,R.(1988, p.7) says “If m is believed to be smooth, then the observations at X near x should contain information about the value of rn at x. Thus it should be possible to use something like a local average of the data near x to construct an estimator of m(x).” Reasoning like this and the last example itself motivates our work. Developments of recent years in the theory and application of nonparametric regression validate Eubank’s argument. These developments show nonparametric regression to be a useful explanatory and diagnostic tool. Eubank (1988), Hardle (1990) and Muller (1988) discuss nonparametric regression where observations at x near x are used to infer m at x because they contain relevant information. However, the domain encompassed by the heuristics of nonparametric regression theory appears much broader than that currently encompassed by that theory. In fact, Example 3.2 immediately suggests a number of unanswered questions. (i) If the error-distribution associated with (3.4) had a known (parametric) form with unspecified parameters, how would all “relevant” information be used in inference about m(x)? More specifically, is there a likelihood based approach which would permit the use of that information? (ii) If we were interested in estimating some unknown x- population attribute other than its mean, m(x) = E(YIX = x), what could we do in either the parametric case of (i) or more generally in the nonparametric case? (iii) How can the information about m(x) in the observations at x near x be described or quantified? 26 We have explored possible answers to question (iii) in Chapter 2. We will address question (ii) for the nonparametric case in Chapter 6; a solution for the parametric case implicitly derives from the theory of this chapter. Finally, the method of this chapter gives an answer to question (i) (see details in Chapter 5). In this Chapter, we propose the idea of the relevance weighted likelihood (REWL) for the relevant sample situations. In Section 3.2, we construct the REWL function. After looking at several examples, we discuss there the relationship between the usual and relevance weighted likelihood function. For the traditional purpose of data reduction, we define “weakly” sufficient statistics using the REWL in Section 3.3 (weakly because of their dependence on the relevance weights chosen in the construction of the REWL). We show that weakly sufficient statis tics have some of the properties of their sufficient relatives. In Section 3.4, we introduce the MREWLE, the maximum relevance weighted likelihood estimator. The MREWLE’s obtained in some specific applications are new. In others, the theory merely enables us to rederive known estimators albeit from a more basic starting point. We can justify the idea of the REWL by appealing in entropy and prediction much as in the classical case. This we do in Section 3.5. We first extend there the entropy maximization principle to embrace all relevant samples. The resulting extension then yields the relevance-weighted likelihood principle. We use relevance weighted likelihood under normal theory assumptions in Section 3.6. We show that maximizing the REWL yields a reasonable estimator. In this example, the MREWLE has advantage over MLE which is available in some special cases. The 27 application of the MREWL estimation to generalized smoothing models appears in Chapter 5. Some general remarks are made in Section 3.7. 3.2 The Relevance Weighted Likelihood 3.2.1 Definition. Let y = ..., y,) denote a realization of the random vector Y = (Y1, ..., Y,) and fjFj, i i,•• , n, the unknown probability density functions (PDF) of the Y which are assumed to be independent. We are interested in the PDF fF of a study variable X measurable on items of our study population (with PDF f). At least in some qualitative sense, the f are thought to be “like” f. Consequently the yj’s are thought to be of value in our inferential analysis even though the ‘ are independently drawn from a population different from our study population. In the familiar paradigm of repeated sampling, we impose the condition f, = f for all i in deriving the likelihood. In reality, this condition represents an approximation which may be more plausible for some of the i’s than others. It may even seem desirable to downweight certain of the likelihood components, f(y) in some way, when the qual ity of the associated approximation seems low. But how should those components be weighted? A heuristic Bayesian analysis suggests a way of assigning relative weights to the likeli hood components. This analysis leads us to the REWL. Suppose we take logarithms of the various PDF’s (assumed to be positive for these heuristics) to put them on an affine scale: (—cc, cc). In the log-likelihood, the correct term associated with yj is log[f(y)j. If we are to replace this with a term involving only 1og[f(y)], we might plausibly use 28 the best linear predictor (BLP) of log[f(y)] based on log[f(y)j. This gives us in place of f(y) in the likelihood. Here p, represents the coefficient of log f(y) in the BLP. In other words, p represents the covariance between log f(y) and log f(y) (if we ignore a multiplicative rescaling factor). [Our analysis also ignores an irrelevant additive factor in the BLP.] This leads us to define the REWL at y = (yi,” , Wlik{fQ),y} = llf(y), for f e . (3.5) The REWL, like the classical likelihood, allows the data to jointly assess the credibility of any hypothesized candidate f for the role of the study population PDF. But here the yj’s from the study population itself would be given the greatest weight in this joint assessment. As the relevance of the other yj’s decline in their relevance (measured by their pj’s) so does the weight accorded to them in the assessment. Usually it is convenient to work with the natural logarithm, denoted by Wl{fQ), y} and called the log relevance weighted likelihood: Wl{f(),y} = plogf(y), for f F. (3.6) Conventionally we take = F for all i and index F by a finite dimensional (unknown) parameter 0 = (0k, ..., Oq) e Q. Then f(t) = f(t; 0), f(.;.) having a known form. Then (3.5) and (3.6) become for 0 Wlik{0,y} flf’(y;0), Wl{0,y} = plogf(y;8). (3.7) 29 3.2.2 Examples. The following examples illustrate the REWL and reveal differences between likelihood and the REWL. Example 3.3 Continuation of Example 3.1. The usual likelihood function in this problem would be lik(R; y) = (2ir)’ exp(—[(yi — I.Li)2 + (Y2 — R2)]/2). (3.8) This likelihood would ignore prior information like Ri — 1L21 is “small”. Now define the REWL by putting pi = 1 and P2. (1 +c2) as the relevance weights when inference is about the study population parameter i1. Then Wlik(i;y) = (2’exp(— 2t1) )((2)_1exp(_2 t) (3.9) The likelihood (3.8) contains the parameters for both of the populations from which the data were obtained. None of the information from Y2 would be used to estimate j even when that information was deemed relevant. In contrast, the REWL in (3.9) contains ollly the parameter of inferential interest, Ri• And Y2 would be used in estimating Ri to the extent determined by the size of c. Example 3.4 Continuation of Example 3.2. Here define the likelihood to be fJf(y - m(x)). (3.10) We find the result of no use in estimating m(x). Yet as we argued earlier, if rn(.) were thought to be smooth, observations near x should contain information about the PDF, f(y — m(x)). The REWL can reflect this heuristic through the relevance weights pj in llf(u - m(x)). (3.11) 30 The next example differs from the last two in that we allow the relevance weights to depend on the data themselves. Example 3.5 Robustness. Assume Y1,. , Y, are iid observations with PDF f(y, 0) parametrized by 8 so that the likelihood is 1[ f(y, 0). We may believe some of the yj to be outliers effectively coming from some other popula tion than that under study. The information from such data needs to be downweighted. To do this we may: (i) order the data as y(1),..• , y(n); and (ii) assign relevance weights i depending on the degree to which we regard the associated data as outlying. The REWL becomes fi fPi (y(i), 8). (3.12) In the extreme case, when a fraction 2€ are deemed to be outliers, we could choose p, = 0, when i < [ne] or i [n(1 — e)]. [Here [.] denotes the greatest integer less than x.] Then the REWL becomes a “trimmed likelihood.” The trimmed mean would then be obtained in certain cases as a MREWLE. This case will be discussed again in Section 3.. In “parametric-nonparametric regression” we would postulate PDF’s in parametric form with a smoothly varying regression (mean) function. Yet other parameters like quantiles and variances for example may also be of interest in this setting. The following general approach to smoothing through the REWL enables us to deal with this diversity of possible inferential objectives within a unified framework. Example 3.6 Generalized Smoothing Suppose {, X} are n data pairs, for given X, Y has PDF f{y, 0(X)} with parameter 0(X). Interest lies in the study population 31 corresponding to a fixed value X = x and fitting the associated PDF. The relevance weights i enable us to represent the degree to which the information from the populations corresponding to X should be used in fitting f{y, O(x)}. The REWL becomes II f{y, O(x)}. (3.13) Generally choosing the {p } will be like choosing a kernel and bandwidth in nonpara metric regression theory. Indeed, in the domain of that theory, we can find the {pj} directly from the corresponding kernels [and their bandwidths] making our task easy in that case. 3.2.3 Remarks Here we discuss the relationship between the likelihood and the REWL. Then we con sider some properties of the MLE retained by the REWL. 3.2.1 The likelihood is obtained from the REWL as a special case when all the data are independently drawn from the study population. However, even here there may be a role for the REWL as Example 3.5 demonstrates. 2.2 The likelihood is usually derived from the sampling density by inverting what is fixed and what is varying. In particular, conditional on f (or the parameter of f), the likelihood integrates over the sample space to 1. This duality between likelihood and sampling density may be useful for determining the likelihood. However, it does not seem intrinsic. We could in the usual case of iid sampling, take the nth root of the inverted sampling density without apparent loss and without preserving the aforementioned property. Moreover, in the Bayesian framework the 32 sample space need not even be specified, once the data have been obtained. Yet the likelihood can certainly be defined. So we do not see the lack of duality with sampling as a shortcoming of our proposed extension of the likelihood. The usual asymptotic theory, appropriately modified still obtains as shown in the next Chapter. 3.2.3 The REWL depends on the relevance weights and work remains to be done on how these may be chosen. As noted above, in the case of nonparametric regression, standard theory for kernel smoothers suggests reasonable possibilities. 3.2.4 We can easily show that the REWL is preserved under arbitrary differentiable data-transformations (with non- vanishing Jacobian) when the sampling densi ties are absolutely continuous with respect to Lebesgue measure. So the REWL inherits this important property of the likelihood. 3.3 Weakly Sufficient Statistics 3.3.1 Definition. The very important likelihood principle of classical statistical inference tells us that the likelihood embraces all relevant information about the parameter. Indeed, according to the factorization theorem sufficiency may be defined through the likelihood. Standard constructions of minimally sufficient statistics rely on the likelihood. The counterpart of the likelihood theory in our setting would be the REWL principle. Lacking the invertibility of likelihood and sampling density found in standard frequency based theory of the likelihood, we must resort to a REWL based definition of sufficiency. 33 Lacking a basis for claiming our likelihood captures all relevant information in the data, we call our notion of sufficiency “weak sufficiency”. That notion enables us to reduce the dimension of the observation vector to that of any (weakly) sufficient vector-valued function of the data, while retaining all information in the REWL. Definition. We call weakly sufficient any vector- valued statistic which determines the REWL up to an arbitrary multiplicative factor which does not depend on f. Weakly minimal-sufficient statistics are functions of every other weakly sufficient statistic. A weakly minimal-sufficient statistic yields the maximal data reduction. Such a statistic need not be unique. The factorization theorem remains true for weak sufficiency. Theorem 3.1 A necessary and sufficient condition that S be weakly sufficient for the parametric family, F, indexed by 8 is that there exist functions mi(s, 8) and m2(y) such that for all 8 E 1, Wlik(8,y) = mi(s,O)m2(y). (3.14) Accepting the REWL as the basis for inference makes reliance on weakly sufficient statis tics inevitable. The seemingly reasonable estimators we obtain below depend on the data only through a weakly sufficient statistic, thereby offering “empirical support” for our principle. Just as the conventional likelihood (regarded as a function) is sufficient, the REWL is weakly sufficient. [This fact follows from the factorization theorem.] However, weakly sufficient statistics lack the property of conventional sufficiency which renders the conditional sampling distribution of the data given a sufficient statistic independent of 0. [Our REWL does not derive from a sampling density function.] 34 3.4 Maximum Relevance Weighted Likelihood Es. timation (MREWLE) In this section we generalize the MLE. Definition: Call any 0 E f which maximizes the REWL, a maximum REWL estimator (MREWLE). Before discussing the properties of MREWLE, we reconsider one of our examples. Continuation of Example 3.5. Assume the density f(y0) is that of a normal distri bution with mean 0. Then the MREWLE of 0 is [n(1—e)] [nE] when we choose the {pj} in Example 3.5. This is a trimmed mean. Other choices yield L-statistics as the MREWLE’s. The MREWLE inherits some of the properties of the MLE. • Under certain weak conditions, the MREWLE is consistent. We will prove this fact in the next Chapter by generalizing to the non-iid and more general weighted case the well-known theory of Wald (1949). • The asymptotic normality of MREWLE under certain conditions is proved in Chapter 4. • The MREWLE always relies on the data only through weakly sufficient statistics. • the MREWLE possesses the familiar property of invariance under one-to-one pa rameter transformations. 35 • The goal of establishing the asymptotic efficiency of MREWLE in some appropri ate general sense has eluded us. At this time we can give that property only in the special case of nonparametric regression (Chapter 5 and Chapter 7). 3.5 The Entropy Maximization Principle. In a series of papers, Akaike (1977, 1978, 1983,1985) discusses the importance of the entropy maximization principle in unifying conventional and Bayesian statistics. We generalize this principle to the framework of relevant samples in this section. This generalization enables us to prove that the method of MREWL may be viewed as a realization of that principle to an important but limited extent. 3.5.1 The generalized entropy maximization principle. To recall the conventional entropy maximization principle, suppose we draw x = (x1,x2,. . . , xk)’ from a multivariate distribution with density f. Suppose we intend to estimate f by g(. : x) and view this estimate as a predictive distribution for a future vector drawn from f. As the index of the quality of g( : z), use the entropy of f with respect to g( : B(f;g)=— f(z)log dz. The entropy maximization principle asks us to find the g(• : x) which maximizes the expected entropy EB(f; g) = J’ B(f; g)f(x)dx. We may view the result as giving us an “optimum” estimator of f, regarded as the object of inferential interest. We would note in passing that Fisher’s maximum likelihood method and the AIC (Akaike information criterion) are two very important implications of the entropy maximization principle. 36 For simplicity of exposition, consider now only the univariate case. [The vector variable case is an obvious generalization.] Suppose Yi, , Yn respectively, are independently drawn from distributions with densities f(y),.. , f(y) thought to be related to the density of inferential interest f. [The relevance of f to f could be described by B(f; f) > —c for all i where the {c} are positive constants. This inequality means f is not far from f. For the special iid case, f = f for all i or equivalently, B(f; f) = 0 for all i.] Let g(• : y) denote an estimate of f where y = (yi,• , yj. Once again we may view g as an estimated predictive distribution of a future observation z from f. Because the relevance of the f to f varies with i, we assign different relevance weights, pj, to them. We then get the weighted entropy measure: pB(f;g) = _pjJfj(z)1og g[()dZ• [Because we do not know f, we choose the above index to force g to lie “close” to the densities we do know, {f}, and which we deem to be close to f.] Our generalized entropy maximization principle may now be stated. All inference about f may be based on the g obtained by maximizing the expected weighted entropy of the predictive distribution where the expected weighted entropy is E,pB(f; g) = — f pB(f; g)f(y)dy. 3.5.2 The MREWLE and generalized entropy maximization principle. We know that pB(f;g) =p1Jf(z)logg(z : y)dz — Epiffi(z)logfi(z)dz. 37 The second term on the right, a constant, depends on only {f}. For assessing g we need only consider the first term. However, we cannot evaluate that unknown term so estimate it by pj log g(yj : y). This amount uses what in Chapter 6 we call the Relevance Weighted Empirical Distribution function which puts mass p2 at y, for all i. If we specify a family of feasible g( : y)’s the one which maximizes the estimated expected weighted entropy at y defines the maximum REWL estimate of f. Obviously the performance of the MREWLE of f depends on both the choice of the feasible family and the statistical characteristics of the simple and natural estimator. If for the feasible family we choose the parametric family of the f(y0), then we find that the estimate obtained from the generalized entropy maximization principle is just the MREWLE. For brevity we will not pursue further our discussion of the generalized entropy maxi mization principle. However many questions about the generalized entropy maximiza tion principle remain to be answered. 3.6 MREWLE for Normal Observations In this section, we develop a method of estimating the mean of a normal population using data from relevant samples, thereby extending Example 1 above. We use the MREWLE and compare it with other estimators. Let Y and Y1 be observations from normal populations with known variances and unknown means 0 and 01, respectively. Without essential loss of generality, suppose Var(Y) 1 and Var(Yi) = a. Assume 0 — 0 E [—c, c] for some fixed c> 0, 0 being the parameter of interest. We readily find the MREWLE of 0 to be Y+ 1 Y1, (3.15) 1+c2+o? 1+c2+of 38 if we choose the relevance weights P1 = and P2 +2’+ for Y and Y1, respec tively. Here we choose the relevance weights by minimaxing the mean square error of MREWLE. Now we compare the MREWLE with the maximum likelihood estimator. In agreement with intuition, we find that the MREWLE loses the advantage over the MLE as —p oo or c —* . The extra information in Y1 becomes useless in these extreme cases because of the uncontrolled bias or noise in the second sample. When c = 0, the MREWLE becomes the MLE for the full data set. In all these cases, the MREWLE is the minimax estimator. If o —* 0, then the problem under consideration becomes that of estimating a bounded normal mean. However, the MREWLE differs from the MLE. Without loss of generality, assume 0 = 0. From (3.15), the MREWLE of 0 is 0= Y1 + c2 and the MLE is —c, ifY<—c 0(MLE)= Y if—c<Y<c c, ifc<Y. The mean square error for these two estimators are maxE(O — 0)2 = c2/(1 + c2), and E(Ô(MLE) — 0)2 = 2(c)(1 — c2) + 2c — 1 + 2cexp(_c2/2)/J for only 0 = 01. There is no closed form of the maximum mean square error for the MLE over all the parameter space. We compare them in the following Figure 3.1. 39 0cc 0 0 w 0 Cs Co j0 N 0 Figure 3.1: A comparison of MREWLE and MLE under a1 = 0. Curve A represents the MSE of MLE at line 0 = O. Curve B represents the Maximum MSE of MREWLE over whole parameter space. Curve C represents the MSE of MREWLE at line 0 = 0. —A B C 0.0 0.5 1.0 1.5 2.0 Constant a 40 Assume cr = 1. From (3.15), the MREWLE of 0 is = :::+22Y1, and the MLE becomes (Y+Yi+c)/2, ifY1<Y—c 0(MLE)= Y ifY—c<Yj<Y+c (Y+Yi—c)/2, ifYi>Y+c. The comparison of the MREWLE with the MLE is showing in the Figure 3.2. From the above comparison, we find that the MREWLE has the advantage over the MLE for the two normal relevant samples, when the mean square error criterion is used. With several relevant samples for 0 with differing means, the analytical calculation of the MLE of 0 proves nearly impossible. By choosing relevance weights, we can easily calculate the MREWLE. 3.7 General Remarks In Example 3.5 of Section 3.2, we show that we can use the REWL for robustness. This idea is similar to the work of weighted partial likelihood by Sasieni (1992) for the Cox model (exact sample case). In that paper, he considers robustness and efficiency of the weighted partial likelihood method. So even for the iid sample case, we may used the REWL for both robustness and efficiency. The weak sufficient statistics defined in Section 3.3 depend on the relevance weights. This agrees with our intuition. For different relevance weights (i.e. different views of relative importances), the weak sufficient statistics should be different. 41 00 w (00 0 Cl) C(0(0 (0 d 0 Figure 3.2: A comparison of MREWLE with MLE under a1 = 1. Curve A represents the MSE of MLE at line 0 = 0. Curve B represents the Maximum MSE of MREWLE over whole parameter space. Curve C represents the MSE of MREWLE at line 0 = 01. —A B C 0.0 0.5 1.0 1.5 2.0 Constant c 42 The REWL proposed in this Chapter depends on the relevance weights, {pt}. These weights express the statistician’s perceived relationships among the populations and usually can be chosen on intuitive grounds. For different problems, the relevance weights were chosen in different ways. In Section 3.6, we choose these weights by minimaxing the mean square error. In the Example 3.5, we may choose these weights by considering both robustness and efficiency. For the generalized smoothing model, we will used the weights similar to these of nonparametric regressions. This is in Chapter 5. The asymptotic theory of REWL in the next Chapter also give a guide line for the choice of relevance weights. In the following Chapters, we will further discuss the choice of these weights. 43 Chapter 4 The Asymptotic Properties of the Maximum Relevance Weighted Likelihood Estimator 4.1 Introduction In Chapter 3, we gave a very general method for using relevant sample information in statistical inference. We base the theory what we call the relevance weighted likelihood (REWL). The REWL function plays the same role in the case of relevance sample information as the likelihood in that of exact sample information. The maximum relevance weighted likelihood estimate (MREWLE) studied in this pa per, plays the role of the maximum likelihood estimator (MLE) in conventional point estimation. The consistency of the MLE has been investigated by several authors (c.f Cramer 1946 and Wald 1948). Here we prove the consistency of MREWLE under gen eral conditions. But in many cases, the consistency of the MREWLE is not enough; we need to get the asymptotic distribution of the MREWLE. The asymptotic normality of 44 the MREWLE is considered in this Chapter. I first consider the weak consistency of the MREWLE and show: (i) there exists a weakly consistent sequence of roots for the log REWL equation (Theorem 4.2); (ii) the MREWLE is weakly consistent (Theorem 4.4). I then go on to the strong consistency of the MREWLE (Theorem 4.5). My analysis relies heavily on the work of Chow and Lai (1973) as well as Stout (1968) who deal with the almost sure behavior of weighted sums of independeilt random variables. Finally, we prove the asymptotic normality of the MREWLE (Theorem 4.7). I organize the paper as follows. The main results on weak and strong consistency are stated in Sections 4.2 and 4.3, respectively. We state the asymptotic normality in Section 4.4. [The proofs of these theorem are in Section 4.7]. Section 4.5 proposes two estimators of the variance of the MREWLE. In Section 4.6, I discuss possible extensions and make some concluding remarks. 4.2 Weak Consistency In Chapter 3, we have defined the REWL and the MREWLE. In this Chapter, we will treat only the parametric case so that interest focuses on a single parameter 0 for simplicity. Let X1,. . . , X be random variables with probability density functions (PDF’s) fi, f2,. . . , f. We are interested in the PDF f(x, 0): 0 of a study variable X. 0 is an unknown parameter. To state the Theorem, we begin with the following assumptions. Assumptions 4.1 4.1.1 {F9 : 9 E } represents a family of distinct distributions 45 with common support and dominating measure i. Let f(x,0) denote the PDF ofF9. 4.1.2 The distributions of the independent sample observations, X, i = 1,. . . , n, have the same support as the {P9}. 4.1.3 The relevance weights pj corresponding to X, i 1,. . . , n, and incorporated in the vector P, = (Pnl,Pn2,...,pnn) play a central role in our theory. They satisfy the formal requirements 0 and = 1. As well, with the “true” value of 0 denoted by 00, we require that ,‘ 0 as n ,‘ cc (4.1) and for any 0 ) 0 as n cc. (4.2) 4.1.4 Q contains an open interval 0 of which the true parameter0o is an interior point. Let K = (X1,X2, ..., X). For fixed K = , the function, 0 -‘--f i-; fn(x, 0) will be called the REWL function. Theorem 4.1 Assumptions imply P9o{f1(X,00)...fm(X,0 >f’(X1,0)...f’( ,0)} _* 1 (4.3) as n —* cc for any fixed 0 0. D 46 From (4.3), the value of the REWL function at 0 (regarded as depending of K) exceeds its value at any other fixed 0 with high probability when n is large. We do not know Oo, but we can determine the point, 0 called the MREWLE, at which the REWL function for fixed K = is maximized. Suppose the observations are from distributions with PDF’s like that of the true sampling distribution f(x, 0) (and that Conditions (4.1) and (4.2) hold). Then the last theorem suggests that the MREWLE of 0 should be close to the true value of 0 if the REWL function of varies smoothly with 0. Hence the MREWLE should be a reasonable estimator. Remark A. (i) Assumption 4.1.2 seems quite reasonable. If the distributions did not have the same support, we could not construct a useful REWL function. For example, if X1 had support [0,2] and f(x, 0) had support [0,1], the REWL function would be identically 0 when X1 was in (1, 2]. (ii) The independence assumed in 4.1.2 greatly simplifies our problem which would otherwise be insurmountable. (iii) Condition (4.1) underlies the construction of a useful REWL function. Recall that [the Kullback-Leibler (KL) functional] E log{f(X)/f(X2,0)} measures the discrepancy between f and f(., Oo). That condition insures that the weighted KL discrepancy of the observations converges to zero when the sample size grows large. When the PDF’s of the observations are quite different from f(x, 9), we usually cannot get a good estimator of the true parameter. Our difficulty arises then because we do not get enough information about the unknown parameter from the observations. This condition is easily satisfied as when E1og{f(X)/f(X, 0)} is uniformly boullded 47 while 1imj Elog{f(X)/f(X,Oo)} —* 0. For then P can easily be chosen to make (4.1) hold. (iv) Condition (4.2) commonly holds as when max{p} —÷ 0 while Var(1og(f(X)/f(X, 6))) is uniformly bounded for each 6. This is because that j%Var[1og{f(X)/f(X,O)}] pC(O) max{p}C(O). Here the C(8) is a constant depend on 0. The first inequality follows from uniform boundedness and the second inequality because p7, = 1. (v) Conditions (4.1) and (4.2) hold respectively, when (X1,X2, ..., X) are independent and identically distributed with PDF f(., Oo) and when Var[log{f(Xi,80)/f(X1O)}] exists while maxt{pmj} —+ 0. Corollary 4.1 If is finite, Assumptions 4.1.1-4.1.3 imply that the MREWLEO: (i) exists; (ii) is unique with probability tending to 1 and (iii) is weakly consistent. Proof: The result follows immediately from the Theorem and the fact that P(Alfl...flAk)—*1asn-—*ooifP(A)-—*1fori=1,...,k. C Theorem 4.2 Suppose: (i) = (X1,2...,X) satisfies Assumptions 4.1; and (ii) for 8 E 0 and almost all x, the function 0 --+ f(x, 0) is differentiable with derivative f’(x, 0). Then with probability tending to 1 as n —* oo, the relevance weighted likelihood (REWL) equation ö/ö8{fJf(x,0)} = 0 has a root, O, = . .,x), which tends to 0. 48 Note that the REWL equation in the last theorem may equivalently be stated as WL’(9,) = 0. (4.4) The following comments also relate to Theorem 4.2. 1. Its proof shows incidentally that with probability tending to 1, the {9}, can be chosen to be local maxima. Therefore we may take the 9 to be the root closest to a maximum. 2. But the Theorem does not establish the existence of a consistent estimator sequence since, with the true value unknown, the data do not enable us to pick a specific consistent sequence. 3. Theorem 4.2 only gives us the existence of a consistent root of the REWL equation. But only in very special cases is this root the MREWLE, in which case it is then consistent (see Corollary 4.2 below) 4. To prove Theorem 4.2, we require that 0 -s-f f(x, 0) be differentiable, 0 0. We will give some conditions (similar to the conditions given by Wald (1949) for the lid case) which avoid the requirement that 0 -‘--f f(x, 0) be differentiable. Corollary 4.2 If the weighted likelihood equation has a unique root 6 for each n and all . the hypotheses of Theorem 4.2 imply that {6} is a consistent sequence of estimators of 0. If in addition, the parameter space is any open interval (a, b) then with probability tending to 1, 5 maximizes the weighted likelihood, (3, is the MREWLE), and is therefore consistent. Proof: The first statement is obvious. To prove the second suppose the probability 49 of 6 being MREWLE does not tend to 1. Then for sufficiently large n, the weighted likelihood must, with positive probability, tend to a supremum as 0 tends toward a or b. Now with probability tending to 1, is a local maximum of the weighted likelihood, which must then also possess a local minimum. This contradicts the assumed uniqueness of the root. D The conclusion of Corollary 4.2 holds when the probability of multiple roots tends to 0 as n —* co. We have already discussed the consistency of a root of the REWL equation. Now we are going to study the consistency of the MREWLE. Before formulating our assumptions, we introduce some notation. For any 0 and p, r > 0 let: f(x,0,p) = sup{f(x,0’): 0’—0 p}; (x,r) = sup{f(x,8): 101 > r}; f*(x,o,p) = f(x,8,p) or 1 according as f(x,0,p) > 1 or < 1, respectively; y*(x,r) y(x,r) or 1 according as (x, r) > 1 or 1, respectively. Assumptions 4.2 4.2.1 For any 0 and p, x —‘--* f(x, 0, p) is measurable. 4.2.2 For any 0, 0o E 0, there exists a > 0, such that zf 0 <p < p0, the expected value of J’° log f*(x, 8, p)dF(x, 0) is finite. Similarly, f log *(x, r)dF(x, 0) < oc for sufficiently larger r (depending on Oo); here F(., 0) represents the CDF for P60. 4.2.3 For any 0 E 0, f° logf(x,0)If(x,0)dt(x) <00. 4.2.4 There exists A, a Borel set, such that for any 0 € 0, fA f(x, 0)4(x) = 0 and for x A, lim101 f(x, 0) = 0. 50 4.2.5 If9 O then i({x: f(x,Oi) f(x,02}) > 0. 4.2.6 For any 0 E 0, there exists B9, such that lB9 f(x, 0o)d(x) 0 for any 0 E 0, and for x E B8, f(x, 0’) —* f(x, 0) for any 0’ —* 0. 4.2.7 In the observation vector, = (X1,2...,X), the X are independent with PDF f(x) with respect to the same dominating measure 4.2.8 Let F,-, = (Pnl,pn2, ...,p,) denote the respective important weights satisfying p 0 and p, = 1. Assume 0 as n oc. (4.5) 4.2.9 Let gn(x) = Assume there exists a Borel measurable function G, such that gn(x) < G(x) , fG(x)logf(x,0o)I4(x) < oc and fG(x)1og(x,r)I4(x) < oc for sufficiently large r. 4.2.10 For each 0 and p(O), — 0 as n —* oc. (4.6) For each r, 0 as n oc. (4.7) The Assumptions 4.2.1-4.2.6 are similar to Wald’s assumptions for the i.i.d case. These assumptions insure the validity of the lemmas of Wald (1948). Assumption 4.2.8 is essential for constructing a useful REWL. Assumption 4.2.9 and Assumption 4.2.10 are for the Dominated Convergence Theorem and Weak Law of Large Numbers. 51 Theorem 4.3 If 0 w a closed subset of 0, Assumptions 4.2 imply that for any C> 0, > 0 48fPn1(Xi,00)...fPnn(X, } — . ( . ) Here the probability F,., denotes that of the {X} obtained from {f1(x) : i = 1,.. . , n}. D Theorem 4.4 Suppose Assumptions 4.2 obtain. Let O(X1,. . . ,X,.,) be any function of the observations, X1,. . . ,X,,., such that c> 0 all n and all X1,. .. ,X. (4.9) Then 0,., is a weakly consistent estimator of 00. 0 Observe that the MREWLE always satisfies the conditions of Theorem 4.4 if we choose c = 1. So we have proved the weak consistency of MREWLE. 4.3 Strong Consistency For the strong consistency of MREWLE, we use almost sure results on linear combina tions of independent random variables such as those of (see Chow and Lai (1973), Stout (1968)). For simplicity, define: (I) A = (ii) D, = log f(X:,00)— log f(X,0,p(8)) — E(logf(X,0o) — log f(X,0,p(0))) from (4.24) and the proof of Theorem 4.3; (iii) = log f(X, 0) — log r0) — E(log f(X, 0) — log c(X1,r0)) from (4.25) and the proof of Theorem 4.3, where i = 1,. . . , n and j = 1,. . . , k. In the next theorem, D will generically denote all D3 and D. 52 Theorem 4.5 Suppose Assumptions obtain, and there exists a constant K > 0, such that pj Kn for some a> 0 and that one of the following conditions hold: (i) ED(1 Dj)’ K for some > 0 and /3, (1 + a + /3)/a 2, ED2(log IDD2 K, A Kn; (ii) ED2 <K, ED = 0, and p6 <J(a(2_S)_1 for some 0 < 6 < 2 and > 0, (iii) E)D(l )Ic(1og+ D)1 K for some > 0 and /3, 1 <(1 + a + /3)/a <2, Kn7,A Kn for some 7 > 0; (iv) ED(l /°(log DI)1 K for some > 0 and /3, 0 < (1 + a + /3)/a < 1, Kn, A Kn forsome7>0andp=0fori>nwhere<7(1+a+/3)/a. Then su ‘1X O OFl urn “ I• ‘ = 01 = 1 D (4 101n—oo fPni(X,)...fPnn ,9 Theorem 4.6 Under the conditions of Theorem .5, let O(X1,.. . ,X,,j be any function of the observations X1,. . . ,X, such that (.9) holds. Then O is a strongly consistent estimator of °o We now state without proof a direct corollary of the last Theorem. Corollary 4.3 Under the conditions of Theorem .5, the MREWLE is strongly consis tent. 53 The weak conditions of Theorem 4.5 are not easily verified. In contrast, the stronger conditions in the following corollaries are easy to verify (but the results are not then as general as those of the Theorem). Corollary 4.4 Under the assumptions of Theorem 4., let pj <Kn for some a> 0. IfEIDI2/ K for some 0< a < 1 then (.1O) holds. Proof: Let 3 = 0. Then (i) of Theorem 4.5 is satisfied. 0 In the case of independently and identically distributed observations, the assumptions of Theorem 4.5 can be quite unrestrictive. We will not go into detail here because our observation follows immediately from Theorem 1 of Stout (1968). 4.4 The Asymptotic Normality of the MREWLE In the last section, we have shown that under regularity conditions, the MREWLEs are consistent and strongly consistent. In this section, we shall show that the MREWLEs are asymptotically normal under some conditions. As we know, the asymptotic normality of MLE have been discussed by Cramer (1946) among others. Our results generalize Cramer’s to both non-i.i.d and unequal Pni. Assumptions 4.3 4.3.1 For each E 0, the derivatives ôlogf(x,O) 92logf(x,O) ô3logf(x,8) 00 ‘ 802 ‘ 54 exist, all x. 4.3.2 For each 0o e €, there exist functions g, h and H (possibly depending on O) such that for 0 in a neighborhood N(00) the relations ãf(x) g(x)jô2f(x,O) I h(x)jã3logf(x,0) < H(x) hold, all x, and J g(x)4(x) < ,f h(x)4(x) <oo,E90H(x) <00. 4.3.3 For each 0 e 0 0 <1(0) = [{alof(x;o)}2] < 4.3.4 —*0. 4.3.5 Define b = f{Dlog f(x; 0o)/O0o}f(x)dt(x) as the biased functions for each ob servation. Then (p)’ —* 0. 4.3.6 Let gn(x) = Zpf(x). Assume there exists a Borel measurable function G such that g(x) G(x), J G(x) I Ologf(x,0o) I d1t(x) 55 J G(x) ö21ogf(x,9 d(x) <cc, and J G(x)H(x)4(x) <cc. 4.3.7 Let g(x) = (p)’>pf(x). Assume g(x) —+ f(x,Oo) almost surely and there exists a Borel measurable function G*(x) such that g,(x) G*(x) and f (){ ogf,00)}24()<cc. 4.3.8 2i :fv n 2 (.1 iogjy,uopVar( ) —* 0 and pVar(H(X)) —+0 as n —* cc. Some interpretations of these conditions as following. Assumption 4.3.1 insures that the function ãlogf(x, 0)700 has, for each x, a Taylor expansion as a function of 0. As sumption 4.3.2 insures (justifies) that ff(x,0)d1t and f{O1ogf(x,0)/O0}dp(x) may be differentiated with respect to 0 under the integral sign. Assumption 4.3.3 states that Fisher’s information is positive. Assumption 4.3.4 insures a sufficiently fast asymptotic rate and Assumption 4.3.5 insures that the weighted bias go to 0. Assumptions 4.3.6, 4.3.7 and 4.3.8 insure the validity of the Dominated Convergence Theorem and Weak Law of Large Numbers. Theorem 4.7 Under the Assumptions 4.1 and 4.3, the relevance weighted likelihood equations admit a sequence of solutions {O} satisfy: 56 (i) Ô,., — Oo in probability ri — (ii) (pj_1/2{ê — Oo — b/I(00)} —* N(O, 7-)). Here b = as the asymptotic bias of the MREWLE. Further, for any consistent sequence O, of roots of the REWL equations, (ii) is true. Theorems 4.5 and 4.7 insure that the MREWLEs are asymptotically normal, when the MREWLEs are roots of REWL equations. By comparing this result with the asymptotic normality of the MLE for the case of iid sampling, we find that the MREWLE has a kind of asymptotic efficiency, because the variance is the inverse of Fisher Information. The convergence rate of the MREWLE is (p)’12, which depends on the relevance weights. The best convergence rate is _h/2 obtained by choosing Pni = 1/n, But in most of relevant sample information cases, we cannot use ptj = 1/n for condition (4.1) and Assumption 4.3.5. A straight forward result from the above Theorem is Corollary 4.5 When X1, ...,X,,. are iid from f(x,Oo) and the Assumptions i.1 and 4.3 obtain, then any consistent sequence {O} of roots of the REWL equation (4.4) satisfies (p2.)-1/2(â - o) ,‘ N(o, i(O)• (4.11) To get the best convergent rate, we always choose p = 1/n for the iid exact sample case. 57 4.5 Estimated Variance of the MREWLE It was seen in Subsections 4.2, 4.3 and 4.4 that the MREWLE’s are consistent and asymptotically normal under certain conditions. The variance of this asymptotically normal distribution provides a reasonable measure of the accuracy of the estimator sequence. But in most cases, we do not know this variance, so an estimator of this variance will be useful. From Theorem 4.7, there are several possible estimators of the variance. We only discuss two of them. They are zpii V1 — Zpö2logf(x.9)/öO I8=8 (4.12) and V2 {Zp.ölogf(x.8)/ 0}8=6 (4.13) For the estimator , we try to estimate the Fisher Information. The is obtained directly from the Taylor expansion of the REWL equation (4.4). Under the conditions of Theorem 4.7, we can show that both estimators are consistent. In Theorem 4.7, the b — 0, so we can use the above variance estimator to construct asymptotic confidence intervals. We do not discuss these variance estimators further in this thesis. 4.6 Some Possible Extensions and Remarks The method given in this report can be extended to establish the consistency of the MREWLE’s for certain types of dependent random variables for which the weak and 58 strong law of large numbers remain valid. Assumption 4.1.4 about the importance weights P can be extended to general cases where we drop the requirements 0 and >p = 1. But then Conditions (4.1) and (4.2) need to be changed. We could for example replace condition (4.1) by the following stronger conditions: 0 as n oo. (4.14) In Chapter 5, we describe plausible situations where we might want to choose negative weights. The consistency of MREWLE’s can be extended to REWME’s defined in Chapter 6. The results of this Chapter apply in several important subdomains of estimation theory indicated by Chapter 5. Of particular note is that of nonparametric smoothing methods. For simplicity, our treatment in this chapter is confined to the case of a one-dimensional parameter. The multivariate extension is similar and we omit it for brevity. 4.7 Proofs To prove Theorem 4.1, we need the following two important lemmas about Kuliback Leibler information (KU; see Kullback 1959). Lemma 4.1 Let f(x) i = 1,. . . , n and g(x) be n + 1 general PDF’s with the same support andq1 0 i = 1,...,n be such that q +q2 + ... +qn = 1. If f(x) = qifi(x)+ +qf(x), then qffi(x)log{}d(x) > Jf(x)log{}d(x) (4.15) 59 with equality if and only if fi(x) = f2(x) = ... = f(x) almost everywhere with respect to measure t. Lemma 4.2 If the PDF’s {gn(x)} and g(x) have the same support then lim+ fg(x)log{g(x)/g(x)}dp(x) = 0 if and only if lim g(x)/g(x) = I [] uni formly. We can now apply these results. Lemma 4.3 Let f(x) i = 1,. . . , oc and g(x) be PDF’s with common support and F, = (Pnl,Pn2, ...,pnn) denote the respective importance weights, > 0 and = 1. Let gn(x) = pf(x). If pniJfi(X)l0g{}d(x) ‘0 (4.16) then gn(x) — g(x) almost surely. Proof: By Lemma 4.1, Epniffi(x)1og{}d(x) Jgfl(x)1og{(}d(x) 0. The second inequality follows from the positivity of the KU. The assumptions of Lemma’s 4.1 and 4.2 imply that 1im(g(x)/g(x)) = 1 [i’], uni formly. Now for each x e A = {x : lim,(g(x)/g(x)) = 1}, we have lim g(x) = g(x). Because 1t(A) = 0, the conclusion follows. 0 Lemma 4.4 Let f(x) i = 1,. . . , oc and g(x) be PDF’s with common support and P = (pnl,pn2, ...,p,) denote the respective importance weights. Suppose condition (4.16) 60 holds and 1u{x : h(x) g(x)} > 0. Here h(x) is another PDF with the same support as g(x). Then there exists a 6> 0 and an N(6) such that for n > N(6) pniJfi(x)log{}4(x) >6. (4.17) Proof: The KU is always positive. If the inequality (4.17) is not true, then there exists a subsequence n(j) such that n(j) f __ Pn(j)i f f(x) log{ }4(x) ‘ 0 as j .‘ oo. Now let gn(x) = By the last Lemma, we get gn(j)(x) —* h(x) almost surely. Also by the same Lemma, gn(x) —* g(x) almost surely, so g(j)(x) —* g(x) almost surely. Therefore {x : h(x) g(x)} = 0. This contradicts the assumption and completes the proof. D Proof of Theorem 4.1: The inequality in (4.3) is equivalent to plogf(X,8o) — p1ogf(X,9) > 0. Now log f(X, Oo) — log f(X, &) = Pnilo{f(9)} = (I)+(II) From (4.1) and lemma 4.2, pjf(x) —f f(x,Oo) almost surely. Now because f(x,8) and f(x, O) are distinct densities, then by Lemma 4.4, there exists a 6 > 0 such that for n large enough ZPniElo{f(0)} 6. 61 Now from Assumption 4.1.4, (I) — PfliE1o{f(0)} 0 in probability. Therefore (I) > 6 in probability. Similarly, (II) —* 0 in probability. This means plogf(X,Oo) —pnj1ogf(Xj,0) 0 in probability. That observation completes the proof. D Proof of Theorem 4.2: Let a be small enough so that (O — a, O + a) contains in 0 and let = {x: WL(00,) > WL(00 — a,) and WL(00,) > WL(0o + a,)}. By Theorem 4.1, Fe0 (Sn) —f 1. For any . E 5,, there thus exists a value O — a < 0,, < 0o + a at which WL(0) has a local maximum, so that WL’(O,,) 0. Hence for any a > 0 sufliciently small, there exists a sequence 0,, = 0,,(a) of roots such that P90(O,, — Oo <a) —p 1. It is remains to show that we can determine such a sequence, which does not depend on a. Let 0 be the closest root to 00. [This root exists because the limit of a sequence of roots is again a root by the assumed continuity of WL(0).] ThenP90(O — 0o < a) —* 1 and this completes the proof. 0 Before we prove Theorem 4.3, we state the following Lemmas. Lemma 4.5 cp(O, r), f*(x, 0, p) and *(o, r) are Borel measurable functions. This last lemma follows from immediately from Assumption 4.2.1 and we omit it for brevity. The next two lemmas follow from Wald (1949). 62 Lemma 4.6 For any 0 0 in 0 we have limE90 log f(X,O,p) = E90 log f(X,0). (4.18) p—,0 Lemma 4.7 For any 00 E 0 we have lim E90 log(X,r) = —00. (4.19) Lemma 4.8 For any 6 > 0, there exist r0(6) > 0 and N(6, r0), such that for every n> N(6,r0) and r r0, pElogf(X,0o) >6. (4.20) Proof: From (4.15), we have pE log f(X,00) - log (Xr) = PmiE10{f()} > I gfl(x) log{ }dJL(x) - pflE1og{f)} = fn(x)lo{f )}d(x) +fg(x)logf(x,0o)4(x) _fgn(x)log(x,r)4(x) - pfliElog{f[)} = (I) + (II) — (III) — (IV) By Assumption 4.2.8, (I) —* 0 and (IV) —* 0. Now we prove (II) — E90 log f(X, 0w). From Lemma 4.3, we know that gn(x) _* f(x, 0w). The result now follows from the Assumption 4.2.9 and the Dominated Convergence Theorem. We can prove (III) —* E90 log(X,r). in a similar fashion. From (4.19), we can choose r0 such that E90 log r0) < —(56 + E00 log f(X, 0)). Now choose 63 N1, N2, N3 and N4 such that: (i)when n > N1, j(I) < 6; (ii)when n > N2, 1(11) — E00 log f(X,U0) < 6; (iii)when n > N3, (III) — E90 log y(X,r0) < 6; and (iv)when n > N4, (IV) < 6, respectively. Let N(6,r0) = max{Ni,N2,N34} we have prove (4.20) for r = r0. Because p(X, r) decreases with r, the proof is complete. D Lemma 4.9 For any 0 0 in 0, there exist 6 > 0, p(O, 6) > 0 and N(p(8, 6), 6) such that forn > N(p(0, 6), 6), pE1ogf(X,0o) — pE1ogf{X,0,p(0,6)} >6. (4.21) Proof: The proof of this lemma is similar to that of the last. The oniy difference is that we use (4.18) instead of (4.19). 0 Proof of Theorem 4.3: From Lemma 4.8, we know there exist r0 and N(r0) such that pE1ogf(X,0o) —p1E1og(X,r0)>1. (4.22) Let w1 = o fl {0 : 0H r0}. Then by Lemma 4.9, for any 0 E , there exist p(O) > 0, 6(0) > 0 and N(0,p(0),6(0)) such that n> N(8,p(0),6(0)) and pjE log f(X, 0) — p,E log f(X, 8, p(O, 6)) > 6(0). (4.23) Now let S(8, p) denote the sphere with center 0 and radius p. Since is compact, by the Finite Covering Theorem, there exists a finite number of points {0,. . . , Ok} in such that S(81,p(0))U... U S(Ok,p(Ok)) contains as a subset. Clearly, we have 0< sup 6Ew k < f’{Xi,0,p(8)} . . .f{X,0,p(0)} + y’(Xi,r0).. 64 Hence, the theorem is proved if we can show that fPnl(Xi,Oj,p(Oj))...fPnn(Xn,8j,p(Oj)) > k 1 0 1 kfPni(Xi,O0)...fPnn(X, } — e/( + )j ,‘ z — and r ) r ) c/(k+ 1)1 0. Proving these last results is equivalent to showing that for i = 1,. . . , k n{p1logf(X,Oo) — oc in probability. (4.24) and n{plogf(X1,Oo) — oc in probability. (4.25) Under our assumptions, (4.22), (4.23) and the weak law of large numbers, we can prove (4.24) and (4.25). This completes the proof of this theorem. D Proof of Theorem 4.4: For any > 0, let = {(X1,X2...) : 8(X1,. . . ,X,) E S(8o,E) for sufficient large n}. From (4.9), we obtain AcC={(X1,X2...): sup19_901>.fPni (X1,0) . . . f’(X, 0) fPrti(X,0o) .. . f(x,0) } c for infznztely many n} By Theorem 4.3, we have lim F(C) = 0, then lim P(A) = 0. Therefore, lim Pn(Ae) = 1. This completes the proof. D Proof of Theorem 4.5: If we can prove (4.24) and (4.25) with probability 1, then from the proof of Theorem 4.3, we obtain the asserted result. But Theorem 4 of Stout (1968) and our conditions on P imply this result and hence the conclusion of our theorem. D 65 The proof of Theorem 4.6 is similar to Theorem 4.4 and hence we omit this for brevity. Proof of Theorem 4.7: By Assumption (4.3.1) and (4.3.2), we have for 0 in the neighborhood N(00) a Taylor expansion of ôlog f(x, 0)/ö0 about the point 0 = 0 as follows: ölogf(x,8) 8logf(x,0) O21ogf(x,0) 1 2 80 — 80 — °‘ 8O 6=6 +8 — 0) H(x where < 1. Therefore, putting ãlog f(X, 0)A = 90 82 log f(X, 0)B = pnj 882 and = pH(X). We have —A = B(Ô — 0) + — Oo)2 where < 1. From Theorem 4.2, we know that there exist a sequence of 8,-, such that 8, —* so o o —n °Bn+*Cn(n_0o) Now we prove: (i) A .-.‘ AN(pbj, (pjI(Oo)). As we know s—.. Ologf(X:,0)A = 80 Ozz6 66 so EA = — pj b and Var(A) = log f(X, 0)6=80)2 — pb00 = 800 By Assumption 4.3.5, 4.3.7 and Dominated Convergence Theorem Var(A) — (p)I(0o) + o(p). Now let 0logf(X,Oo) = O0i=1 E 0logf(X,0o) = E I 000i=1 max(p)J Ologf(x,0) I By Dominated Convergence Theorem f Ologf(x,0o) c Ologf(x,8o) f(x,0o)4(x),000 (x)J so 1’ —* 0 (max(p) —* 0). Then by Theorem 7.1.2 of Chung (1974, p200) and Assump tion 4.3.4, we get (ii) B —* —I(0) in probability. 02 log f(X, 0)B = oo O2logf(XOo) — O2logf(x,0o) oog oog —* —I(0) by Dominated Convergence Theorem. So from Assumption 4.3.8, B,-, —* I(0) in probability. 67 (iii) C,. —*E90H(X) in probability. The proof is similar to (ii). Therefore O, — = —A,./[B,. + *C,.(O,. — 9)]. Since Or, — Oü —÷ 0 in probability. B,. + — O) —* —I(O) Further, —A,. - AN(— (ZpjI(Oo)). Consequently, by Slutsky’s Theorem, (p){êr, - - ____ ‘ N(0, establishing the theorem. 68 Chapter 5 The MREWLE for Generalized Smoothing Models 5.1 Introduction As we have mentioned in the Chapter 2 and 3, the nonparametric regression (NR) paradigm motivates much of our work on relevance weighted inference. Methods devel oped in NR use relevant information in relevant samples (see Chapter 2). Nonparametric regression provides a useful explanatory and diagnostic tool for this propose. See Eubank (1988), Hardle (1990), and Muller (1988) for good introductions to the general subject area. Several methods have been proposed for estimating m(x): kernel, spline, and orthogonal series. Recently, local likelihood was introduced by Tibshirani and Hastie (1987) as a method of smoothing by ‘running lines’ in nongaussian regression models. Staniswalis (1989) carried out a similar generalization of the kernel estimator. In this Chapter, we apply the method of MREWLE to capture the relevant sample information for generalized smoothing models. Both local likelihood and Staniswalis’s method can be viewed as special cases of our methods. We wish to demonstrate the 69 applicability of our methodology, whose primary advantage over individual methods which have been developed in NR, lies in its generality. However, we are able to establish secondary advantages as well. The MREWLE always has a smaller variance which depends on Fisher information. Usually NR is used for the location parameter and it also assumes the mean and variance of the observations exist. But in many cases, we are interested in some other param eters, for example, the variance of a normal distribution or parameters of the Weibull distribution in Section 5.5. In some cases, even when we are interested in the location parameter, the mean and variance of the observations may not exist. The Cauchy dis tribution in Section 5.5 is an example. The method of MREWLE works well in these situations. Under the model of Example 3.6, we know that given X, Y has density f(y, 0(X1)) (assumed known up to the unknown parameter 8(X1)). We seek to estimate 8(x) at the fixed point X = x. After choosing the relevance weights, we get the MREWLE by maximizing over 8(x), fJ f(, 0(x))(x). The next straightforward result shows how locally weighted regression estimators obtain. Theorem 5.1 . If f(y,0(X1)) is the density for a normal distribution with mean p(X1) and variance cr2(X1) [here 8(x) = {t(x),u2(x)}], then the MREWLE is (x) = pni(x)) (5.1) and = Pni(x){ - (x)}2.D (5.2) 70 Thus the MREWLE in the normal case of the last theorem, is the linear smoother of Fan (1992). Obviously we can get Nadaraya-Watson kernel estimators, k-nearest neighbor estimates, Gasser-Muller estimators and locally linear regression smoothers by choosing appropriate relevance weights. These smoothers include those generated by the use of spline and orthogonal series methods. This means that the MREWLE method subsumes current nonparametric smoothing methods when the error distribution is normal. The variance estimator in (5.2) is the same as the variance estimate in Hardle (1990). Here it is a direct result of the MREWLE. We now study the MREWLE in relation to current nonparametric smoothing methods in situations where the error distribution family is known. Our discussion addresses both asymptotic and non-asymptotic issues. We organize this chapter as follows. Small sample properties are considered in Section 5.2. The main asymptotic results are shown in Section 5.3. Also in Section 5.3, we compare the MREWLE with current NR methods. How to choose the relevant weights is considered in Section 5.4. Finally in Section 5.5, we give some simulation results. 5.2 Small samples. Our earlier discussion leads us to wonder about the difference between the MREWLE and other linear smoothers for nonnormal error models. The following theorem partially addresses this issue. There we refer to conventional sufficiency with respect to the joint sampling distribution of all the data. Theorem 5.2 . Suppose the sufficient statistics for the error distribution family are not linear in the data. If X = X, for some i j, then with respect to quadratic loss, 71 the linear smoother is inadmissible. Proof. We easily obtain the conclusion using Rao-Blackwell Theorem and sufficiency at the replication points. 0 The last theorem shows that if we have replicate observations in a designed experiment, we can achieve a uniformly smaller risk than that of any linear smoother (which depends linearly on the data when sufficiency shows it should not). For instance, the one param eter exponential family has sufficient statistics based on { T(1’)} for some function T) where the sums are taken at the replication points. Obviously basing any smoother on the {1’} would violate the sufficiency principle in this case. It could be argued that this claim is unfair. Linear smoothers are proposed in a nonparametric-nonparametric framework where neither the error distribution nor the regression function has parametric form. That argument ignores Theorem 5.1 which shows that these linear smoothers are con sistent with normal error models. Indeed, in Chapter 6 we obtain smoothers in the nonparametric- nonparametric setting which are very different than linear smoothers. So the argument fails to blunt the impact of the last theorem. Rather that theorem points to the nonrobustness of linear smoothing methods. Theorem 5.2 emphasizes the importance of the vehicle which carries the data into a smoothing procedure. And it tells us how to improve on a linear smoother if we have repeated observations at some points. We know by weak sufficiency that the MREWLE must depend only on the sufficient statistics. So it evades the difficulty confronted above by linear smoothers. 72 When the {X}’s are continuous covariates, we cannot (in principle) have repeated ob servations at any point, so cannot improve on linear smoothers by invoking the last Theorem. But we may nevertheless have near ties among the {X}’s in which case the heuristics underlying that theorem still obtain. Large sample theory below will lead to further discussion of this issue. In Chapter 2 and 3, we emphasized the NR paradigm because it provided a context wherein some information from the relevant samples have been used to advantage. The last theorem suggests that these methods fail to use all the relevant informatioll available when the error distribution cannot be assumed to be normal. In this way, these linear smoothers seem analogous to moment estimators in classical estimation theory; the MREWLE would then be analogous to the MLE. 5.3 Asymptotic Properties We begin with the generalized smoothing models described in Example 3.6. Let (X1,Y1), ..., (X, Y) be a random sample from a population (X, Y). For given X = Y has density function f{y, O(x)}. Because we have used relevant sample information for estimation, the MREWLE is usually biased. We define the bias function as B(z) = E9()l3 log f{Y O(x)}/O(x). (5.3) This bias function indicates the bias when we used the information from YX = z to estimate 8(x). Under some conditions, we can get B(x) = 0 and B(z) = I{O(x)}{8(z) — 8(x)} + o{O(z) — O(x)}, (5.4) 73 where I{0(x)} is the Fisher information function for 0(x). Equation (5.4) indicates the meaning of this bias function. Also this bias function is a special case of the bias function in Chapter 4. In the next four subsections we present classes of possible relevance weights suggested by results in modern NR theory. We study one of these classes, that suggested by Nadaraya-Watson, in some detail. For the rest, some expected results will merely be sketched for brevity. (I) Kernel weights (Nadaraya-Watson). The weight sequence for kernel smoothers (for a one dimensional x) is defined by pni(X) = Kh(x — X)/{nh(x)}, (5.5) where gh(x) = n1 Kh(x — X) and Kh(u) =h1K(u/h). (5.6) The kernel K is a continuous, bounded and symmetric real function which integrates to one, f K(u)du = 1. (5.7) Because the form (5.5) of kernel weights pj(x) has been proposed by Nadaraya (1964) and Watson (1964), we call these Nadaraya-Watson weights. The MREWLE with Nadaraya-Watson weights obtains from maximizing n_i — X) log f{1’;, 0(x)}. We now consider the asymptotic properties of this MREWLE. In the sequel, we always let CK = i_:u2K(u)du, dK = L K2(u)du. (5.8) 74 We need the following assumptions: Assumptions 5.1 5.1.1 The bias function B(z) has a bounded and continuous sec ond derivative for every fixed x. 5.1.2 Let B(x) = Es(2) log{f(Y,9(z))/f(Y0(x))}. B(x) has a bounded and continu ous first derivative for every fixed x. 5.1.3 The marginal density g(x) of the covariate X has continuous first derivative and is bounded away from zero in an interval (a0,b0). 5.1.4 j’uK(u)du = 0 and fu4K( )du < cc. 5.1.5 The density function f(y, 0)) satisfies the following regularity conditions: (i) log f(y, 0) has three continuous partial derivatives with respect to 9; (ii) for each Oo E e, there exists integrable functions H(y) such that for 9 in a neigh borhood N(90) the relations I öf(,0) H(y), ãf(mO) iH2(y),Iö3logf(y,0) H3(y) hold, for all y, and f Hi(y)dy < oc,fH2(y)dy < co,E9(H3Y)) <cc for 9 N(0o); (iii) the Fisher information I{0(x)} is continuously differentiable and bounded away from zero. We state the following pointwise properties of the MREWLE. 75 Theorem 5.3 Under the Assumption 5.1, assume h —*0 and nh —* oo. If x is afixed point in (a0,b0), then the REWL equations admit a sequence of solutions {O} satisfying: rnhg(x)I{o(x)} 1/2 { B’(x) B(x)g’(x)L dK j [Ô(x) - 0(x) - 21{0(x)} + g(x)I{0(x)} } + O(h)}] N(0, 1). (5.9) Proof: For x (a0,b0), similar to the proof of Theorem 4.2, we can show that the REWL equations admit a sequence of solutions {O} satisfy 8 —* 0(x) in probability. As in the proof of Theorem 4.7, putting ôlog f{, 0(x)}A = pni(X) 90(x) ö2 log f{}’, 0(x)}Bp(x) ao(x)2 and = yields -A B{&(x) - 0(x)} + *Cn{On(x) - 0(x)}2. Here ‘K 1< 1. Now -A B + *Cn{8(x) - We prove: AN[(B(x)/2 + B(x)9’(x)) h2{1 + O(h)} I{0(x)}dK1 (5.10) nhg(x) As we know 8 log f{1, 0(x)} = pni(x) 80(x) 76 = Khfl(x_Xi)Olof{YO(x)} ngh(x) 00(x) ______ Olog f{1’, 0(x)} flYh(X) 00(x) Here Kh(x — X)0logf{,0(x)}/00(x) are iid random variables with expectation a = 80(x) Ologf(y, ,0(x)) = fKhn(x_Z) f{y,0(z)}g(z)dydz = f K(x — z)B(z)g(z)dz = J K(u)B(x — hu)g(x — hu)du = {B(x)g(x)/2 + B(x)g’(x)}cKh{l + O(h)}. The variance is b = 00(x) 0 log f(y, O(x))]2f{Y 0(z)}g(z)dydzJ[Khn(X—Z) 00(x) — [EKh(x — )alof{l/;o(x)}]200(x) = I{0(x)}dK(x){1 + O(h)}. From the Central Limit Theorem, we have — a} —* N(O, 1) (5.11) As we know that (x) = n1 Kh (x — Xi), the EKh(x—X) = fKh(x_z)g(z)dz = f K(u)g(x — hu)du = g(x){1 + O(h)} and Var{Kh(x — X1)} = diç-g(x)h1j’{l + O(h)}. By the Central Limit Theorem, nh g(x)d {.‘h(x) — g(x){1 + O(h)}} ,‘ N(O, 1). (5.12)K 77 Now A a — g(x){1 +O(h)}(h(x)A—a) —a[h(x) —g(x){1 +O(h)}] g(x){1 + O(h)} - gh(x)g(x){l + O(h)} So g(x)[A - {1 O(h)}1 N(0, 1) (5.13) by (5.11), (5.12) and Slutsky’s Theorem (h(x) —* g(x) in probability and a —* 0). We get result (5.10) from (5.13). As in the above proof, by the SLLN and Slutsky’s theorem, B —f —I{O(x)} in probability (5.14) and E9H(Y) in probability. (5.15) By (5.10), (5.14), (5.15) and Slutsky’s theorem, we get the conclusion of the theorem. D In Theorem 5.3, we only prove that the REWL equations admit a sequence of solutions which is asymptotically normal. As we have discovered in Chapter 4, the MREWLE is a consistent sequence of solutions under certain conditions. In the following discussion, we always assume that the MREWLE is a consistent sequence of solutions of the REWL equations. The last theorem gives us the asymptotic bias and variance for 0. Then they can formally be combined to give a conditional MSE (mean square error) result for the MREWLE of 0. The result suggested heuristically through that combination suggests the MREWLE has the conditional MSE E[{Ô(x) -0(x)}2X1,...,Xj = + J{0}1cKhfl 78 + nhg(x)I{O(x)} + o(h + ---)‘ (5.16) when x is any fixed point in (a0,b0). Similar results are suggested for the case of fixed design variables of the form X3 = G’(j/n) + o(1/n), where the function G is the cdf of g(x). We defer for future work, the formal proof of these and other results which are stated below. When 0(x) is a location parameter, then we can use the ordinary Nadaraya-Watson kernel estimator. The MSE of Nadaraya-Watson kernel estimator is (Hardle 1990, P77) MSE = {O(X) + O’(’(x) }2c2 h4 + + o(h + --). (5.17) Comparing (5.16) with (5.17) suggests tentatively that: (i) the MREWLE usually has smaller variance which depends on Fisher information (like the Fisher information bound for exact sample case); (ii) the bias of the MREWLE depends on the bias function B(z) and Fisher informa tion function, while the bias of the Nadaraya-Watson kernel estimator depends on the regression function 6(x). [But it is hard to compare the biases of these two method.] (II) Gasser-Muller type weights. Let us now use the weights derived from those of Gasser and Muller (1979). Their results formally suggest the following conclusion for the resulting MREWLE: [2nhn(X)I{o(x)}]lI - 0(x) - 2}CK{1 + O(h)}] —* N(0, 1), (5.18) if x is a fixed point in (a0,b0). We leave verification of this result to future work. 79 (III) k-NN weights. Use of the k-NN weights defined in Hardle (1989, page 42), and his results heuristically suggest that the MREWLE sequence, {0} with k-NN weights satisfies: kI{0(x)} 1/2 - k2B(x)g(x) + 2B(x)g’(x) k 2dK [0(x) — 0(x) — (;;) 8g3(x)I{0(x)} CK{l + O()}] —* N(0, 1) (5.19) for fixed x. (IV) Locally linear smoother weights. Locally linear smoother weights are defined by Fan (1992) for nonparametric regression. Using those weights to define the MREWLE for generalized smoothing models, suggests that [nhn(x)I{o(x)}]l/2 - 0(x) - 210 }CK{1 + O(h)}] —* N(0, 1) (5.20) by analogy Fan’s results, for any fixed x. Now we compare the MREWLE smoothers with the nonparametric regression smoothers. In the following discussion, assume 0(x) is a location parameter. So nonparametric regression smoothers are applicable. For comparison, we summarize the asymptotic pointwise bias and variance of nonparametric smoothers in Table 5.1 (See Hardle 1989 and Fan 1992). We also give in Table 5.2 the results stated above (but not yet proved) about the MREWLE smoothers. Tables 5.1 and 5.2 suggest that the results for the MREWLE smoothers are simi lar to their corresponding nonparametric regression smoothers. The variances of the MREWLE smoothers would seem to depend on Fisher information function while the 80 Table 5.1: Pointwise Biases and Variances of Regression Smoothers Method Bias Variance Nadaraya-Watson ( 9” (x) + e’ ‘ )hc u2 ()dicg(x) hg(x) Gasser-Muller 8”(x)hcK 2nhg(z) K-NN weights S”(x)g(r)+29’(x)g) . 8g(2;) n k Local linear smoother 0”(x)hcK u2(r)2nhgfr) Table 5.2: Conjectured Pointwise Biases and Variances of the MREWLE Smoothers Method Bias Variance B’()g’(x) dKNadaraya-Watson (B”(x) + g(x) )hcK/I{0(x)} nhngfr)I{6(x)} 3dKGasser-Muller B” (x) hcK /I{0(x) } 2nhg(x)I{9fr)} B”()gx)+2B’(x)g’(x) 2dKK-NN weights 8g3(x)I{6(xfl n K kI{8fr)} dKLocal linear smoother B” (x ) hcK /I{0(x) } 2nhg(x)I{6(x)} variances of nonparametric regression smoothers are based on the variance function. This suggests that the MREWLE smoothers usually have smaller variances. But the biases of the MREWLE smoothers depend on the bias function and Fisher information function while the biases of the nonparametric regression smoothers depend on the pa rameter function 0(x). It is hard to compare them. Our tentative findings remain to be confirmed by rigorous analysis. In nonparametric regression models, the comparison of these four smoothers has been considered by several authors (see Chu and Marron 1991; Fan 1992; Hardle 1989; Mack and Muller 1988; and among others). Now we compare again in a heuristic fashion, the four MREWLE smoothers. The Nadaraya-Watson MREWLE seems essentially similar to the k-NN MREWLE. By 81 choosing k = 2g(x)nh, they have the same apparent variance and bias. The bias of the Nadaraya-Watson MREWLE depends on both B(x) and B(x)g’(x)/g(x). Keeping B(x) fixed, we first remark that in a highly clustered design where g’(x)/g(x) is large, the bias of the Nadaraya-Watson MREWLE is large. Note also that when B(x) is large, so is the bias of that estimator. The Gasser-Muller MREWLE appears to have an asymptotic variance 3/2 times as large as that of that of the Nadaraya-Watson MREWLE. On the other hand the bias of the Gasser-Muller MREWLE seems sim pler; it does not share the drawbacks mentioned above. The locally linear smoother MREWLE appears to overcome the disadvantages of these two methods. As noted above, the biases of the MREWLE smoothers seem to depend on the bias function, not the parameter function. This makes it hard to compare with other methods. In Chap ter 7, we propose a MREWLE which have both simple bias (depends on the parameter function 9) and small variance. 5.4 Bandwidth Selection For generalized smoothing models, the selection of weights to construct the REWL is analogous to getting a proper bandwidth. It is well known that the bandwidth plays a very important role in the trade-off between reducing bias and variance in nonparametric regression. Often the user will be able to choose the bandwidth satisfactorily by eye with interactive graphics. Moreover it is also desirable to have a reliable data—driven rule for selecting the value of h. Here we list some possible methods. Cross-Validation (Wahba and Wold, 1975) may be used to select the bandwidth, in which case h is chosen to maximize logf(,Ô). 82 Here is the maximizer with respect to 0 of p7(x) log fOl, 0). tj This is a global bandwidth selection procedure, whose asymptotic optimality properties are not known. Alternatively the bandwidth may be selected by the ‘plug-in’ procedure, based on the asymptotic expansion of the squared error for kernel smoothers: MSE — n’h’dK + h4[cK{B”(x)/2 + B’(x)(g’/g)(x)}]2 — I{O(x)}g(x) 12(0(x)) An ‘optimal’ bandwidth minimizing this expression would be = { dKI3(0(x))/g(x) 2}n.4[CK(B (x)/2 + B (x)(g /g)(x)] This bandwidth is proportional to n1’5 with constants depending on the unknown I{0(x)}, B’(x), B”(x) and so on. This has been shown to be more stable in both theoretical and practical performance in NR (see e.g. Park and Marrron 1990). Indeed, Fan and Marroll (1992) show that, in the density estimation case, the ‘plug-in’ selector is an asymptotically efficient method from a semiparametric point of view. We expect the ‘plug-in’ procedure will perform as well in the MREWLE case. 5.5 Simulation We have shown, via asymptotics, that the MREWLE possesses a number of desirable properties. Now we use three simulated experiments to illustrate its finite sample be havior. Two methods are considered here: the Nadaraya-Watson MREWLE and the locally linear smoother, MREWLE. 83 Simulation 1. A random sample of size n is simulated from the model Y = X(1 — X) + 0.Oh, (5.21) with ‘•‘-‘ Cauchy(0,1) independent of X “-‘ Uniform(0,1). The nonparametric regression kernel smoothers do not work here, because the mean and variance of the Cauchy(0,1) do not exist. We use the MREWLE smoothers here. A typical realization when n = 200 is shown in Figure 5.1. The bandwidth used here is h = 0.1. Gaussian kernel function is used here and in the sequel. The Nadaraya-Watson MREWLE based on five simulations from Model (5.21) is shown in Figure 5.2. Figure 5.3 shows the locally linear smoother MREWLE based on five simulations from Model (5.21). From the above Figures, both the Nadaraya-Watson MREWLE and the locally lin ear smoother MREWLE perform reasonable. The Nadaraya-Watson MREWLE has large boundary effects, and may require boundary modifications. But the locally linear smoother MREWLE seems accurate over the whole interval. Simulation 2. A random sample of size ri is simulated from the model Y = sin(X) + 0.01, (5.22) with f ‘-s’ Cauchy(0,1) independent of X N(0, cT2). When o is small, the quantity g’(x)/g(x) gets large. Thus we anticipate that the Nadaraya-Watson MREWLE does not behave well for small a. We estimate the parameter function in the interval x € [—2a, 2u], two standard devia tions away from its normal mean. Figure 5.4 plots the estimates for the case n = 200 and a = 0.25. The bandwidth is Ii = .15. The Nadaraya-Watson MREWLE has a large bias. The locally linear smoother MREWLE provides a suitable estimation. Simulation 3. Instead of estimating the location parameter function, consider the fol 84 xFigure 5.1: A comparison of the Nadaraya- Watson MREWLE with the locally linear smoother MRE WLEfrom Model (5.21) with n = 200. The true curve is a, the Nadaraya Watson MREWLE, b, and the locally linear smoother MREWLE, c. 0.0 0.2 0.4 0.6 0.8 1.0 x 85 C‘I/I / LL E o o \ H’ C I I 0.0 0.2 0.4 0.6 0.8 1.0 x Figure 5.2: The Nadaraya-Watson MREWLE based on five simulations from Model (5.21) with n=200. 86 LC) 0 0 c’J 0 L() o ,.0 XI L()q 0 q 0 LI) 0 9 Figure 5.3: The locally linear smoother MREWLE based on five simulations from Model (5.21) with n=200. I I / / I / / I \ 0.0 0.2 0.4 0.6 0:8 1.0 x 87 0c’J 0 xo IIo 9 9 Figure 5.4: A comparison of the Nadaraya- Watson MREWLE with the locally linear smoother MRE WLEfrom Model (5.2) with n = 200. The true curve is a, the Nadaraya Watson MREWLE, b, and the locally linear smoother MREWLE, c. C -0.4 -0.2 0.0 0:2 0:4 x 88 lowing model: fyix=(y) y’exp( (x) (5.23) with a = 1, 3(x) = 2 + 4x(1 — x) and X Uniform(O,1). A typical realization with n = 200 is shown in Figure 5.5. The bandwidth used here is h = 0.25. The Nadaraya Watson MREWLE has a large bias, while the locally linear smoother MREWLE per forms reasonably well in tails. 89 ‘N 1’. If:’ /1’ // ‘ I,,, N’c’J I/I /1’ I!, // I’ .—Sc\I // / 1/ / ( 1/ II. / a) 1/ I ‘ L’ ‘ ‘i: / ‘ :‘i / __________ “ C’J ••./ I, —.I / / ,/ ———C I / I / I / II, c.’J I,I,I, I, I, / 0.0 0:2 0:4 0.6 0:8 1.0 x Figure 5.5: A comparison of the Nadaraya- Watson MREWLE with the locally linear smoother MRE WLEfrom Model (5.23) with n = 200. The true curve is a, the Nadaraya Watson MREWLE, b, and the locally linear smoother MREWLE, c. 90 Chapter 6 A Relevance Weighted Nonparametric Quantile Estimator 6.1 Introduction This Chapter concerns situations in which a sample X1 = x1, ..., X, = x of inde pendent observations are drawn from populations with different CDF’s F, ..., F, respectively. Inference is about an attribute Oo of another population with CDF Fo; an observation may be available from the latter population as well. In this Chapter 9 will be a quantile of Fo; elsewhere we address other problems of interest within the same general framework. The {F} are unknown, so that we are in the nonparametric case. The special character of the problems investigated in this problem derives from the be lief that there is relevant information in the X, i = 1 ,..., ii about Oo. However this information is deemed to be “inexact”. By this we mean it cannot be translated into a prior distribution from which a marginal posterior distribution for Oo could be con structed. And we mean there are no known structural constraints among the attributes of the various populations to force the x, . ., x into inferences about 00. 91 The example given below illustrates the problem. That example reflects the situation underlying nonparametric regression. In fact, our approach may be thought of as gen eralized smoothing. In nonparametric smoothing it leads to locally weighted regression quantile estimators 0(t), — < t < cc, for even rough regression quantile functions —cc < t < cc, if the relevance of the x(t)’s corresponding to ti’s remote from t can be ascertained. Our method bears on other problems like those of meta analysis where there is no well defined underlying mathematical structure. In Section 6.8, we briefly discuss linkages with standard statistical methods. As we have noted in Chapter 2, there are problems where relevant sample information about 0 may be used to advantage. Yet there does not seem to be a general theory underlying such problems. How to use the relevant sample information like that en countered in metaanalysis, thus becomes an important topic which seems to have been addressed largely on a piecemeal basis. In Chapter 3, we introduced the “relevance weighted likelihood” (REWL) as a general device for using relevant sample information in parameter estimation. However, when we cannot specify the population CDF’s parametric form the REWL based theory is of no avail. To cope with such problems we introduce in this paper a nonparametric but general theory based on an extension of the empirical distribution function which we call the REWED (“Relevance Weighted Empirical Distribution”). We then tackle the problem of quantile estimation within that framework which is examplified by the following example. Example 6.1 Distribution Smoothing. Let X(t) have distribution function Ft; o < t < 1, i = 1,. . . ,n. Assume that X(t1), ... , X(t) are independent and that F changes smoothly with t, ‘smooth” meaning sup Ft(x) — Ft+(t)(x) —* 0 as A(t) —+ 0. 92 We seek to estimate F for a fixed t. In general, let F0 be an unknown distribution function describing the population of interest. The classical paradigm would assume independent and identically distributed (hereafter iid) observations from F0. Here instead, only observations from other pop ulations described by CDF’s F, i = 1, 2, ..., n are available. If we believe the {F}, i = 1,2, ..., n are related to F0, x1,x2, ... , x, may be used for inference about attributes of F0. The question is how. In this Chapter, our answer to this question uses the REWED, defined in Section 6.2. From the REWED we can construct moment estimators for parameters defined in terms of the moments of F0 [this idea will be discussed elsewhere]. But here we consider only the estimation of the quantiles of F0. In Section 6.2, the REW quantile estimator will be defined in addition to REWED for the problem identified by the last example. And we will offer generalizations along with some examples. Strong consistency of the REWED and the quantile estimators are stated in Section 6.3 and proved in Section 6.9 under mild conditions. These results generalize the results for iid sampling. Some other asymptotic properties are given in Section 6.3. R.R. Bahadur (1966) gives a useful asymptotic representation of the sample quantile as a simple sum of random variables by using the empirical distribution function. We give a generalization of Bahadur’s results for general weights {pni} in the non-iid case. This is the subject of Section 6.4. We discuss the asymptotic normality of the REW quantile estimator in Section 6.5. In 93 Section 6.6, we apply the theory of this paper. We get reasonable estimators for location parameters for several distributions. By comparing them with the weighted sample mean, we find their asymptotic relative efficiency (ARE) in the iid case to be fairly high. Section 6.7 presents the results of a simulation study, using the REW quantile estimator for the nonparametric smoothing model. The proofs of our Theorems appear in Section 6.9. 6.2 The REWED and REW Quantile Estimation Let us reconsider Example 6.1. Because t —* F changes smoothly, we could hypo thetically use pFt(x) to approximate Ft(x). The choice of the weights pj would depend on the perceived relationship between Ft: and Ft(x). [The {pt} might plausibly be generated from a kernel.] But the {F1} are unknown. So instead we must use the data, X(t), i = 1, ..., ii to estimate Ft(x), say by1pI(X(t) x), I(.) being an indicator function. This empirical distribution we will call the relevance weighted empirical distribution (RE WED). Estimating F(x) by the REWED results in two errors from: (i) using pjFt(x) to approximate F(x); (ii) using1pI(X(t) x) to estimate Much of this Chapter will be concerned with (ii). To generalize the ideas in the above example, let X [X1, . .., X,]; n> 1 be a triangular array of row-independent random variables with associated array of distribution functions, F ( [F1, ..., F,]; n > 1 and nonnegative constants def Pn — [Pni, .., Pnnl, 1 94 satisfying 1. Define: • the relevance weighted empirical distribution function (REWED) by F(x) = • the relevance weighted average distribution function (REWADF) for —cc < x <cc by = • the pth quantile of i by p(n) = inf{x : ,(x) p} 0 <p < 1; • the pth relevance weighted quantile (REWQ) estimator by = inf{x : F(x) > p} for a sample {X1,..., To illustrate the use of these REW quantile estimators, we offer the following example. Example 6.2 Nonparametric Regression. Let = f(x) + e x E [a, b] i = i, 2, ... , n; here e, e2, ... , e are iid, symmetric, E(e) = 0 for all i, and f(x) is a smooth function. To estimate f(x) we may use the median of the RE WED. This kind of estimator is usually robust and it often quite efficient. 95 6.3 Strong Consistency of REW Quantile Estima tors In this section, we describe strong and uniform consistency properties of the REWED. Then we describe the strong consistency of the quantile estimators derived from the REWED. In the following discussion, we assume that F is the CDF of interest and , its pth quantile. Theorem 6.1 (Strong Consistency of F(x)). a) Suppose exp(—e2K) < for all e> 0, where I = (1p)—. Then F(x) — F(x)I —*0 a.s. for all x. b) Further, if IF(x) — —* 0 for all x, then IF(x) — F(x)I —* 0 a.s. for all x. Corollary 1. If log(n)/K = o(1), then F(x) — F(x)I —* 0 a.s. for every x. D The hypothesis of the theorem is easily satisfied. If, for example, maxj{pni} = then the hypothesis is satisfied. The assumption F(x) — —* 0 for all x is essen tial; without this, we cannot get a consistent estimator of the CDF F(x). Qualitatively this condition is the one which gives operational meaning to the notion of “relevance weights”. Theorem 6.2 (Uniformly Strong Consistency of F(x)) a) Under the hypothesis of Theorem 6.1 a), and the further assumptions that (i) sup,f(x) is bounded and, (ii) limsupM,supfl{(1 — (M)),(—M)} —*0, where f(x) is the derivative of F(x), then sup F(x) — —* 0, a.s.. 96 b) Further, if sup F(x) — —* 0, then sup F(r) — F(x) — 0, a.s.. When the distributions underlying our investigation derive from the same family, the conditions of the last theorem are usually satisfied. Theorem 6.3 (Strong Consistency of ) Under the conditions of Theorem 6.2 a), suppose x = solves uniquely the inequalities F’(x—) <p < (x). Then np — p(n) 0 a.s. for n —* cc. b) Further, if sup IF(x) — —* 0, then — , —* 0 a.s. for n —* cc. The uniqueness condition on ‘p(n) imposed in the last theorem cannot be dropped. We finish this section with the following theorem giving a probabilistic inequality for quantile estimators. This theorem will be used in next section. Theorem 6.4 Suppose x = p(n) solves uniquely the inequalities P’,(x—) p (x) for any given p E (0, 1). Then p(n) > e) 2exp{—2(n)K} for every e > 0 and n, where 5’(n) = min{P’(() + e) — p,p — — The last theorem shows P() p(n)1 > e) converges toO exponentially fast. The value of e (> 0) may depend upon K if desired. These bounds hold for each n = 1, 2, and so may be applied for any fixed n as well as for asymptotic analysis. 97 6.4 Asymptotic Representation Theory. For the case of iid data and pj = 1/n, i = 1,..., n, Bahadur (1966) expresses sample quantiles asymptotically as sums of independent random variables by representing them as a linear transform of the sample distribution function evaluated at the relevant quan tile. From these representations, a number of important properties ensue. (see Bahadur 1966 and Serifing 1980 for details). We now generalize this asymptotic representation to the cases of non-iid observations and general p,1. Theorem 6.5 Let 0 <p < 0 and m = maxi<:<n{pn}. Suppose: 1. F has bounded second derivative in the neighbourhood of() with F,()) = 2. there exists c> 0, such that inf J(()) > c; 3. there exists c > 0 , such that < cc; 4. F has a uniformly bounded first derivative in the neighbourhood of(); 5. m,, =o{K3/4(logK)’/}. Then p—F(()) np = p(n) + + R. JnIcp(n)) where = O{K,314(log )}, n —* cc, with probability 1. The Bahadur representation is a special case of this theorem suggesting the result of our theorem may be fairly accurate, that is hard to improve upon. 98 The distribution of REW sample quantile is usually hard to find, but the REW sample distribution relatively easy. By this theorem, we can use the REW sample distribution function evaluated at the relevant quantile to study the REW sample quantile asymp totically. A simple example is that we can use this representation to prove quite easily the asymptotic normality of the REW sample quantile. 6.5 Asymptotic Normality of Except for the case of iid random variables, we cannot always find the exact distribution of . The asymptotic distribution of given in the following theorem may therefore be useful. Theorem 6.6 Let 0 < p < 1 and V = — Assume P’ is differentiable at p(n), inf > c > 0 and maxi<j<n(pnj’/2)—* 0 as n —* co. Then llm - ())V”2<t} = (t) where (t) is the distribution function of N(0, 1). 6.6 Applications In this section, we use REW sample quantiles to estimate location parameters, and compare these estimators with the weighted sample mean estimators. Example 6.3 Let {X} be an independent sample with X N(,u,o) i = 1, ... , the a being known and it unknown. An estimate of it is required. 99 Analysis of Example 6.3 Using the weighted sample mean to estimate seems nat ural: i%=YZcX, c=1andc>0. We easily deduce that c = [o2]/[Z1u2] minimizes the mean squared error. Then AN{, ( 1/u’}. Now let us try using the median to estimate . Let be the distribution of X and = pF,,j. The median of F is t and we use the sample median med to estimate i’• By the results of Section 6.5, we get -‘n 2 A7.Tf _________________ cmed ‘‘ i-iJv1fL, ,1i—n .c .1i=iPnjnPii here fT(Iu) = We want to minimize the variance of the asymptotic distribution subject to p = 1. We easily obtain = /o The asymptotic relative efficiency of these two estimators is ARE(I%,Imed) = Remarks 1 1. For the iid normal case, the ARE of the sample mean estimator relative to the sample median estimator is . Here we have proved that when the samples are from normal distributions with the same mean, but different variances, the ARE of the weighted sample mean estimator relative to the weighted sample median estimator yields the same value . . The weights used in the sample mean are different from the weights used in the sample median. We only compare the two best estimators here. If we use the same weights, the ARE can be larger or smaller than . 100 3. The weighted sample median should be more robust than the weighted sample mean. Example 6.4 Consider the double exponential family. Assume the density of X to be 1/2r exp(—x — /r); the r are known while j is unknown i — 1,. . . , n. We again use the weighted sample mean and weighted sample median of Example 6.3 to estimate t. Analysis of Example 6.4. Choose c = (r2)/(1r2) to minimize the mean squared error. Then AN{, 2/( 1/r)}. By choosing = (1/r)/( 1/ri), we get the weighted sample median med From the results of Section 6.5, med AN(, The asymptotic relative efficiency of these two estimators is ARE(I2,*med) = 2. Remarks 2 1. The ARE of the best sample mean estimator relative to the best sample median estimator does not depend on the {r} ‘s. The median is a more efficient estimator. 2. As in the normal case, the weights used in the sample mean do not equal the weights used in the sample median. 3. The weighted sample median should be more robust. 101 6.7 Simulation Study We have shown via asymptotics that the REW quantile estimator possesses a number of desirable asymptotic properties. In this section we use two simulated examples to obtain insight into its performance with a finite sample. The model in Example 6.2 is used in this simulation where we compare the REW quantile estimator with the Nadaraya-Watson estimate. The Gaussian kernel function is used to generate the relevance weights. Simulation Study 1. A random sample of size n is simulated from the model Y=X(l—X)+, with N(0, 0.5) independent of X U(0, 1). A typical realization when n = 1000 is shown in Figure 6.1. The bandwidth used here and in all subsequences is h = 0.1 ( choosed by eye). Let us next add 50 outliers from N(2, 0.5) to the simulation experiment just described. The result is shown in Figure 6.2. We have not used the boundary corrections. Simulation Study . In the model of Simulation 1, instead of using the normal error, we get the from a double exponential distribution with r = 0.1. Figure 6.3 shows the results of a curve fit based on 100 simulated observations. The simulation results with 10 outliers from N(— .5, 0.25) are shown in Figure 6.4. For the data of Simulation 1 without outliers, the quantile curve estimate obtained by using the REW quantile estimator is shown in Figure 6.5 with 0.25 quantile. The results of the simulation can be summarized as follows: 102 c) 0 0 > 0 0 ______ 0 9 Figure 6.1: A comparison of the Nadaraya- Watson estimate with REW quantile estima tor. The model is Y = X * (1 — X) + e, where X is uniform (0,1) and e is N(0,0.5). The sample size n = 1000 and the bandwidth, h = 0.1. The true curve is a, the REW quantile estimator, b, and the Nadaraya- Watson, c. CI 0:2 0.4 0:6 0:8 1.0 x 103 >% Figure 6.2: A comparison of the Nadaraya- Watson estimate with REW quantile estima tor with outliers. To the data depicted in Figure 6.1, we add 50 c-outliers from N(2, 0.5). The true curve is a, the REW quantile estimator, b, and the Nadaraya- Watson, c. C) 0 c’j 0 9 0.0 0.2 0.4 0.6 0.8 1.0 x 104 0c’J 0 > 1 0 _____ 0 9 Figure 6.3: A comparison of the Nadaraya- Watson estimate with REW quantile estima tor. The model is Y = X*(1—X)-I-e, where X is from uniform (0,1) and e from a double exponential distribution with r = 0.1. The sample size is n = 100 and the bandwidth, h = 0.1. The true curve is a, the REWquantile estimator, b, and the Nadaraya- Watson, C. C 0.2 0:4 0.6 0:8 1.0 x 105 0c’j 0 > 0 0 ______ 0 9 Figure 6.4: A comparison of the Nadaraya-Watson estimate with REW quantile es timator with outliers. To the data depicted in Figure 6.3, we add 10 c-outliers from N(—.5, .25). The true curve is a, the REW quantile estimator, b, and the Nadaraya Watson, c. / / / C 0.0 0:2 0:4 0:6 0.8 1.0 x 106 0 1 9 U) 9 0 0 >I U) 9 0 9 1.0 Figure 6.5: A REW quantile estimator of a quantile curve. The .25 quantile curve is estimated for the data depicted in Figure 1. The true quantile curve is a, and the REW quantile estimator is b. 0.2 0.4 0.6 0.8 x 107 1. In the model of Example 6.2, when the error has the double exponential distri bution, the REW quantile estimator performs a little better than Nadaraya Watson estimate, see Figure 6.3. Even when the error is normal, the REW quantile estimator performs about as well as the Nadaraya-Watson estimate (see Figure 6.1). 2. When the data have a small fraction of outliers, say about 5 or 10 percent, the REW quantile is robust (see Figures 6.2 and 6.4). By contrast, the Nadaraya Watson estimator fails. This observation suggests we use the REW quantile estimator and Nadaraya-Watson estimate together to diagnose the model and determine if there are outliers in the data set. If the REW qiiantile estimator and Nadaraya-Watson estimate disagree, then we should reconsider the model and the outliers. 3. The REW quantile estimator seems promising judging from these simulation studies. 4. Computing the REW quantile curve estimator took about one minute in Simulation Study 1 using Splus in a Sun workstation. 6.8 Discussion We have presented a general method for estimating a population quantile based on inde pendent observations drawn from other related but not identical populations. We have shown the estimator to be strongly consistent and asymptotically normal under mild as sumptions. Our method derives from a generalization of the empirical distribution (the REWED), and we have shown that the latter is also strongly consistent under certain conditions. 108 The context of our method includes that of nonparametric regression and smoothing. Thus our estimator may be viewed as a generalized smoothing quantile estimator. In the special case of Example 6.2, we obtain a nonparametric-nonparametric quantile estimator in as much as nothing is assumed about the form of the population distribu tions involved. In particular, as the Examples of Section 6.6 show, heteroscedasticity is allowed in the smoothing context. Our theory depends on the relevance weights, {p} used to construct the REWED. These weights express the statistician’s perceived relationships among the populations and would usually be chosen on intuitive grounds. Making F approximate F well is a primary objective in this choice. Additional restrictions on the {p} stem from the large sample theory developed in this paper. Theorems 6.1, 6.2, and 6.3 on consistency, for example, require that exp(—€K) < oc for all > 0 where iç = ( This imposes a requirement that the {p,,} —* 0 fairly rapidly as n —* oc, say faster than 1/log(n). And for asymptotic normality, we see in Theorem 6.6 the requirement that maxi<j<n(pj_h/2) —* 0 as n —* oo where V = — We believe these conditions offer some guidance on the choice of the relevance weights without unduly restricting it. In the smoothing model, we can usually use the kernel weights as the relevance weights like we did in the simulation study. The kernel weights usually satisfy the above conditions. F(x) also arises as a population distribution estimator in finite population sampling theory (see Sarndal, Swenson and Wretman 1992, p199) where may be regarded as the distribution function of the subpopulation from which x is drawn; here p2 = lrI’, 7t being xi’s selection probability, i = 1,. . . , n.. We would note a Bayesian connection with our theory. If the {p11,} are thought of as prior weights, then F is just the marginal CDF of the independent observations obtained by mixing the conditional models {F }. Viewed from this perspective, the weights should 109 be chosen to make the CDF for the population of interest, F, that marginal CDF. We would note that incidentally this paper does provide a large sample theory for the Bayesian marginal mixture distribution, in particular for the quantiles of that mixture distribution. 6.9 Proofs of the Theorems Lemma 6.1 ( Marcus and Zinn, 198). Let {c}, n = 1, ..., oo, be a sequence of real numbers and {X}, n = 1, ..., oo, a sequence of independent random variables. Define U(t) by U(t) = c[I(X <t) — P(X <t)]. Then > A) exp(—A2/8)(l + 2) for all A > 0. D Pro of of Theorem 6.1. F(x) - = x) - F(x)fi say. So on applying Lemma 6.1 with A = P(F(x) - > ) > (1 + 2/21reK/2)exp(_e2Kn/8) for every e> 0. The assumption, exp(—c2K) < oo, implies that K,., —* when n —+ cc. It follows that for every C> 0, there exists N, such that for every n > N (1 + 2V2KeK2) <(eKfl) 110 Consequently (1 + 2V2eI2)exp(_fl) exp(—). But > exp(—€2K) < oo for all > 0. So exp(—-) < 00 for every > 0. Hence P(F(x) — > ) <00 for all >0. The Borel-Cantelli Lemma then implies F(x) — P’(x)I —+ 0 a.s. for every x. 0 Proof of Theorem 6.2. Let M be a large positive integer and = max IF(i/M) — —M2<i<M2 By Theorem 1, u —* 0 a.s.. Also monotonicity implies that for (i — 1)/M < t <i/M — P(t) < F[i/Mj — — 1)/M] = [F[i/M] — [i/M1] + [.L[i/M] — — 1)/M]]. By similar reasoning, — P(t) [F[(i — 1)/M] — — 1)/M]] — [P’[i/M] — — 1)/M]]. So lim sup F(x) — P’(x)I n—*oo limsupu+limsup max {() — 1),1 —(M),(—M)} —M2<< M M limsupu + 1imsupM’sup,f(x) + limsup{(1 — n—* n-÷co n—oo under the assumptions. Since M is arbitrary, the result follows. 0 111 Proof of Theorem 6.3. Let e> 0. By the uniqueness condition and the definition of Pn(p(n) — €) <P < n(pn + c). By Theorem 6.2, F(() — — — —* 0 a.s. and F(() + e) — + e) —* 0 a.s.. Hence P[Fm(p(m) — 6) <p < Fm(p(m) + e), for all m n] —* 1 as n —* cc. That is, P(5UPm>n pm — p(m)I > e) —* 0 as n —* cc. This completes the proof. D To prove the Theorem 6.4, we need the following useful result of Hoeffding (1963). Lemma 6.2 Let Y1, ..., Y, be independent random variables satisfying P(a < Y <b) = 1 for each i, where a <b. Then fort > 0, — E()] > t) <exp(_2t2/(b— aj)2). 0 Proof of Theorem 6.4. Fix e > 0. Then — > e} + e} + — But with Y = pI(X > p(n) + e), p(n) + } = P{p > F(() + €)} = > p(n) + e) > 1 — p} = — E()) > 1 — p — pj(1 — + e))} = F{(1 - E()) > n(p(n) + e) — p}. Because P(0 p,j) = 1 for each i, by Lemma 6.2, we have p(n) + e) exp(—2S/p) = exp(—26K); 112 here 6 P’(() + e) — p. Similarly, p(n) — e) exp(—26/p) = exp(—26K) where 62 = p — — Putting 6E(n) = min{Si, 62}, completes the proof. D To prove Theorem 6.5, we need the following results (see Shorack and Weliner, 1986, page 855) Lemma 6.3 (Bernstein) Let Y1,Y2, ..., Y,., be independent random variables satisfying PY — E(Y) <m) = 1, for each i, where m < oo. Then, for e> 0, - E()) 2exP[_ VarO) + for all n = 1, 2 Lemma 6.4 Let 0 <p < 1. Suppose conditions 1-3 of Theorem 6.5 hold. Then with probability 1 (hereafter wpl) (/7+ 1)K;’/2(log K)1/2 np — p(n)I Jncp(n) for all sufficiently large n. Proof. Since F is continuous at p(n) with F,()) > 0, p(n) solves uniquely P(x—) p < (x) and p = Put = (+1)K12(log 113 We then have + Cfl) — p = P(P(fl) + n) — n(p(n)) !n(p(n))en + o(e) /7(iog Kn)”2/K?’/ for all sufficiently large n. Likewise we may show that p — — e,) satisfies a similar inequality. Thus, with as defined in Theorem 6.4, we have 2KflSE(n) clogK for all n sufficiently large. Hence by Theorem 6.4, 2 — p(n) > en) for all sufficiently large n. This last result, hypothesis 3 of this theorem and the Borel-Cantelli Lemma imply that wpl np — p(n) > e holds for only finitely many n. This completes the proof. D Lemma 6.5 Let 0 <p < 1 and T be any estimator of p(n) for which T — —+ 0 wpl. Suppose F has a bounded second derivative in the neighborhood of p(n). Then wpl n(Tn) — Pn(p(n)) Fp(n))(Tn — p(n)) + O((Tn — as n —* cc. Proof. The proof is an immediate consequence of the Taylor expansion. D 114 For convenience in presenting the next result, we set D(x) = [F() + z) — — + x) — Lemma 6.6 Let {a} be a sequence of positive constants such that a coK,1”2(log K) as n —* oo, for some constants c0 > 0 and q 1/2. Let m = maxi<j<n{pnj} and = sup IxIan If m =o(K;3/4(1ogK)(_l)/2),then under the hypothesis of Theorem 6.5, wpl = Proof. Let {b} be any sequence of positive integers such that b coK/4(1og K) as n —+ . For successive integers r = —ba, ..., b, put i = ab’r and cr,n = P’(() + 97r+i,n) — Pn(p(n) + ?lr,n). The monotonicity of F and i’r, implies that for “lr,n X 7r+i,n, D(x) [F(() + ?7r+i,n) — — + ‘7r,n) — Dn(i7ri,n) + [F(() + llr+i,n) — F(() + ‘lr,n)]. Similarly, D(x) Dn(ir,n) — [F(() + 77r+i,n) — F(e() + So < A + where A = max{Dn(iir,n) : —b r b} and /3 = max{ar,m : —b r b — 1}. Since ?lr+1,n — ‘lr,n = ab’ -‘ K314, —b r b — 1, we have by the Mean Value 115 Theorem that [sup + x)](ii, — ) [sup + x)]K314,IxIan IxI<an —b <r < b — 1. Thus = O(K3I’4), ii —* oo. We now establish that wpl A O(I3/4(1ogKn)+1)) as n —* By the Borel-Cantelli Lemma it suffices to show that ) < oc where ‘y = ciI/(log K)(1) for some constant > 0. Now P(A > 7n) P(Dn(ir,n)I 7n) r=—b And IDn(r,n)I = I E + r,n)) — E(I(X E + r,n))))I by definition. With Y = pI(X E Bernstein’s Lemma (see Lemma 6.3) implies P(IDn(77r,n)I y,) 2exp(—-y,/D) where D = 2Z1Var()1) + 2/3in-y. Choose c2 > sup Then there exists an integer N such that + a) — < c2a 116 and — — a) < c2a both of the above inequalities being for all n > N and i = 1,. . . , n. Then Var() <Zpc2an = K’c2a. Hence ‘y/D 7/{2K’ca+ 2/3m7} c1ogK/(4c2) for all sufficiently large n. The last result obtains because of the condition m = o(KI(1og K)(_l)/2). Given Co and c2, we may choose c1 large enough that c(4c2Co)’ > c + 1. It then follows that there exists N* such that P(Dn(??r,n) y) 2K1) for all r < b and n > N*. Consequently, for n > N* P(A > 8bK*+). In turn this implies P(A ‘) 8Kv. Hence P(A,. > ) <oc, and the proof is complete. Proof of Theorem 6.5. Under the conditions of Theorem, we may apply Lemma 6.4. This means Lemma 6.5 becomes applicable with T = and we have wpl, n(np) — p(n)) = — + O(I log Ku), as n —* oc. Now using Lemma 6.6 with q = 1/2, and appealing to Lemma 6.4 again, we may pass from the last conclusion to: wpl — = — ()) +O(K314(log )), as n —p oc. 117 Finally, since wpl: = p + O(m), as n —* oc, we have wpl p — Fn(pn) = fn(p(n))(np — p(n)) + O(K314(1ogK)314), as n ‘ This completes the proof. D Proof of Theorem 6.6. Fix t, and put G(t) = P[fn(p(n))(np — tJ = an], where a = p(n) + Then by the definition of Gn(t) = P(F(a) > p). Thus G(t) = P(pnjI(Xnj a,) p) = P(V112 pnj(I(Xnj a) — E(I(X a,))) V-’/2— <an))]) = P(Z ca); here Z = V”2p (I(X <an) — E(I(X <an))) and c = — an))] We first prove Z —* N(O, 1) in distribution and then that c,., —* —t as n —* oo to complete the proof. To this end Z = <an) — F(a)) = where ij = pV;112(I(X a) — From the condition maxi<j<n(pnj_h/2)—* 0, we get maxi<j<npnj —+ 0 and V, —+ 0. We then easily obtain for every > 0 and r > 0: 118 1. 1 P(ri ) —* 0. (since i7j 2maxi<j<n(pnjT’/)—* 0); 2. < r)) — (siriceE(rinjI(inj < r)))2] —* 1. (For n large enough, I77ni <r)) = 1); 3. E(?)niI(ni < r)) —* 0. So Z,- —* N(0, 1) in distribution by the CLT (see Chung, 1968, page 191) for triangular independent random variables. Next we prove Cn + t. Cn = V_112 — an))] = V”2p(F() — = “pni(fni(p(n))(an — p(n)) + o(a — p(n))) —* —t asn—*oo. The proof is now complete. D 119 Chapter 7 Some Further Results In this Chapter we discuss some further results about the use of relevant sample infor mation. In Section 7.1, we consider the locally polynomial MREWLE for generalized smoothing models. We show that the locally linear MREWLE has many desirable prop erties. In Chapter 3, we have considered the full parametric approach to relevant sample anal ysis. Chapter 6 proposes a kind of nonparametric approach. In Section 7.2, we outline a semiparametric approach by using estimating functions. 7.1 Locally polynomial maximum relevance weighted likelihood estimation Locally regression is a popular form of nonparametric regression, combining excellent theoretical properties with conceptual simplicity and flexibility to find structure in many datasets. The method was introduced by Stone (1977) using a rectangular window. Cleveland (1979) introduced his lowess procedure, which is based on locally polynomial 120 fitting and incorporates many important features, such as design adaptiveness , kernel weighting and robustness. Recently, Fan (1992, 1993) has studied minimax properties of locally linear regression. A detailed summary of the advantages of locally regression compared to kernel fitting may be found in Hastie and Loader (1993). Fan, Heckman and Wand (1993) have studied locally polynomial kernel regression for generalized linear models and quasi-likelihood functions. Weerahandi and Zidek (1988) present locally smooth processes from Bayesian viewpoint. In this section we investigate the generalization of locally polynomial fitting with ker nel methods for the MREWLE in the case of generalized smoothing models. We are motivated by the fact that locally regressions are both intuitively and mathematically simple which allows us to achieve a deeper understanding of their performance. In the generalized smoothing model of Chapter 5, when we suppose that 8 has q con tinuous derivatives, then in a small neighborhood of a point x, () (x)8(z) = 0(x) +0 (x)(z — x) +... + q (z — x) + o{(z — x)} = o+i(z—x)+ ...+q(z_x) +o{(z_x)}. (7.1) To estimate 0(x), we want to use the information from {(11,X)}. As with the relevance weighted likelihood in Chapter 5, we define the locally polynomial relevance weighted likelihood at the point x to be (7.2) Here {pj} are the relevance weights. One would then estimate (%,. . . , /3q) by maximizing the local polynomial relevance weighted likelihood (7.2). This is locally polynomial MREWLE. Because + /3(z — x) + . + /3q(z — x) is a better approximation to 0(z) than 8(x), the locally polynomial 121 REWL in (7.2) is better than the REWL in Chapter 5. We would expect the locally polynomial MREWLE to perform better than its locally constant counterpart. There are many possible strategies for weighting the likelihood (See Chapter 5). Be cause of their mathematical simplicity we will use kernel weights, the Nadaraya-Watson weights in Chapter 5. The locally polynomial kernel MREWLE of 0(x) is then given by 0o(x;q,h) where (,Bo,. . . , /q) maximizes — X)logf{}’3o + . — Note that in the case q = 0, the estimator of 0(x) is the Nadaraya-Watson MREWLE in Chapter 5. Locally polynomial fitting also provides consistent estimates of higher order derivatives of 0(x) through the coefficients of the higher order terms in the polynomial fit. Thus, for 0 r < q, we define locally polynomial MREWLE of 0(r)(x) to be q, h) = r!$r. We leave the investigation of the asymptotic properties of Or(x; q, h) to future work. In most applications, we are interested in a low-degree polynomial ( p = 1, 2, or 3). For brevity, we consider the locally linear MREWLE in this section. The locally linear MREWLE maximizes EKh(x —X3)logf{}’,t30+ 3i(x — X)}. 122 The corresponding locally linear REWL equations are ölogf{,f3o + /3i(x — X)}Kh(x—X) =0 U/Jo and =0. We state the following asymptotic properties of the locally linear MREWLE as conjec tures. Proof will be left to future work. Assume that the second derivative of 0 is continuous. If h —* 0 and rih —* co, then under Assumption 5.1, for x E (a0,b0), the locally linear REWL equations admit a sequence of solutions {O(x, 1, h)} satisfying: [flhTh(x)I{O(x)}]l/2( 1,h) - 8(x) - 0”(x)cKh{1+ O(h)}] N(0, 1).D (7.3) So the asymptotic bias and variance are 8”(x)cKh/2 and dK/{nhg(x)I{8(x)}] respec tively. Now we compare the result of above statement with Theorem 1 of Fan (1992) for locally linear regression. Fan (1992) proved that under certain conditions, the locally linear regression estimator 8 has MSE E{O(x) - o(x)}2 = {8”(x)}22h4 + nhgx + o(h + ). (74) From the bias and variance comparison, we find the locally linear MREWLE and the locally linear regression estimator have the same bias. But the locally linear MREWLE always has smaller variance which depends on Fisher information. Also the locally linear regression estimator only works for a location parameter, while the locally linear MREWLE works for any parameter. 123 Fan (1992) also show that locally linear regression estimator is the best among all the linear estimators of 8(x). The above result of the locally linear MREWLE tells us that we usually can find a better estimator than the locally linear regression when the density f{y:,O(x)} is known. This result supports the argument in Section 5.2, that when we know the distribution family, the linear smoother is asymptotic inadmissible. In the next section, we will show that the locally linear MREWLE is the best among all the locally linear estimators for estimating functions. This locally linear MREWLE has all the desirable properties discussed in Fan (1992). Now we use a simple simulation to illustrate its finite sample behavior. We use the model given in (5.21). The simulation sample size n = 200, bandwidth h = 0.1, and Gaussian kernel function are used. Three methods are used: the Nadaraya Watson MREWLE, the locally linear smoother MREWLE (defined in Chapter 5) and the locally linear MREWLE. Figure 7.1 shows a comparison among these three methods. The locally linear MREWLE based on five simulations is shown in Figure 7.2. From Figure 7.1, both the locally linear smoother MREWLE and the locally linear MREWLE perform very well. The Nadaraya-Watson MREWLE has large boundary effects. The locally linear MREWLE works excellent as showing in Figure 7.2. From the results of Simulation 1 of Chapter 5 and Figure 7.2, We can conclude that the locally linear MREWLE is the best. 124 LO C 0 0 U, o x 0 U,q 0 q 0 Figure 7.1: A comparison of the Nadaraya-Watson MREWLE, the locally linear smoother MREWLE and the locally linear MREWLE from Model (5.21) with n=200. The true curve is a, the Nadaraya-Watson MREWLE, b, the locally linear smoother MREWLE, c, and the locally linear MREWLE, d. E.g 0.0 0:2 0:4 0.6 0:8 1.0 x 125 L() c’J 0 0 0 L() a x 0 a L() 0 0 0 Figure 7.2: The locally linear MREWLE based on five simulations from Model (5.21) with n=200. 0:2 0:4 x 0.6 0.8 1.0 126 7.2 Relevance weighted estimating functions For the exact sample case, it is well known that in many practical circumstances, where even though the full likelihood is unknown, one can specify some relationship about the parameters. For example, if we can specify the relationship between the mean and variance, then we use a quasi-likelihood function approach. More generally, the theory of estimating functions (which include quasi-likelihood as a special case) have been developed in the series papers of Codambe (e.g. Godambe (1960), Godambe and Thompson (1974), Godambe (1976)). This theory focused initially on the elementary fact that any point estimator may regard as the solution of an equation (y, 0) = 0. (7.5) for data vector y and parameters 0. Unlike traditional approachs, however, which impose conditions on estimator as unbiasedness, invariance etc., the approach here focuses on properties of the estimating function itself, rather than the estimator derived from it. Thus, instead of dealing with linear estimators, unbiased estimators and so on, we deal with linear estimating functions, unbiased estimating functions and so on. The simplest condition one might impose on an estimating function is that of unbiased- ness, i.e. Ee(Y; 0) = 0. Note that this does not necessarily imply unbiasedness of the corresponding estimator unless is linear in 0. However, under suitable regularity conditions it does imply consistency of the estimator. Clearly, there will be a great variety of competing estimation functions. Godambe (1960) suggested that an optimal estimating function (OEF) would be one for which 127 the efficiency index E1 E8(2) 76 — {E(a/ao)} takes its smallest possible value among all unbiased estimation functions. Under the type of regularity conditions used in the Cramer-Rao argument, Godambe (1960) showed that the least possible value of (7.6) is attained, in the one parameter case by the score function — 9logf(y,O) 98 Moreover, the smallest possible value is the reciprocal of Fisher information function. We thus have a justification for maximum likelihood estimation which unlike more tra ditional arguments, does not rely explicitly on asymptotics. This fairly general situation includes many models important in application. Examples are: • nuisance parameter model e.g. Neyman and Scott (1948); • location model i.e. F = {f(x — 0) = f unknown }; • scale model, F {f(x/0) = f unknown }; • semi-parametric models such as Cox’s regression model, Cox (1972); • generalized linear models with unspecified error distributions (for example, quasi likelihood models). In Chapter 2, we have classified different sources of information. For relevant sample information, the full likelihood approach has been proposed in Chapter 3. Because of its generality, in this section we investigate the generalization of estimating functions for the relevant sample, which we call relevance weighted estimating functions. 128 Let Y1,• , Yr-, be random variables on a sample space with probability density f(y, 8i), i = 1,••• , n. Here O is a real- or vector-valued parameter which is assume to be unknown. Interest lies in the study population corresponding to parameter 8. We know that there is some relation (see Chapter 2) between O and 8. Our object is to estimate o on the basis of observed values Yi,• , Y,. If we know the distribution f(y, 0), we can use the REWL method in Chapter 3. When f(y, 8) is unknown, but we can specify some relationship about the parameters (for example, E9(Y, 8) = 0), then we should use the following method of relevance weighted estimating functions instead. As with estimating equation (7.5) for an exact sample, we define the relevance estimating equation: i=1 (7.7) to estimate the parameter 9 from data Y1, , Y. Here {p} are the corresponding relevance weights of }‘. Note that if choose (y, 0) = 8 log f(y, 0)/88, we get the relevance weighted likelihood equation in Chapter 3. Also if let (y, 8) = y — 0, the relevance weighted estimating equation become the relevance weighted least square equation. Since the estimating function (y, 8) is initially used to obtain an estimator by solving the equation (y, 8) 0, the unbiasedness condition (7.8) becomes natural when the sample Y is exact. Now for relevant observations, , from f(y, O), usually E8(1, 0) 0. This means that relevant samples are usually biased. This makes us define the bias function for relevant samples as =E9(Y,0). (7.9) 129 Both the bias function in Chapter 4 and 5 are special cases of (7.9). For competing estimating functions, we define the following efficiency index for estimat ing function (7.7) as P(v’ tIv nV2 Eff() — (7 10 — [EpO(Y,O)/ãO]2 An optimal estimating function would be one for which the above efficiency index takes its smallest possible value among all functions which satisfy E9(Y, 0) = 0. Under certain conditions, we can generalize the results of Chapter 4 to the case of relevance weighted estimating functions. We now apply the relevance weighted estimating functions to generalized smoothing models. Consider a model for the relationship between a dependent variable Y and an independent variable X. Suppose that Y given X = x satisfy E(Y0(x)) = 0, (7.11) where 0 is an unknown smooth function. The NR (Hardle 1990), semiparametric smoothing models (Severini and Staniswalis 1994), generalized linear smoothing model (Fan, Heckman and Wand 1993) and generalized smoothing model in Chapter 3 are special cases of (7.11). To estimate 8(x), we construct the relevance weighted (REW) estimating equation as = 0. As in Chapter 5, we can prove several theorems about the properties of REW estimating equation. By using the efficiency index (7.10), we can compare the REW estimating functions. 130 Now we extend the idea of locally polynomial approximations to REW estimating equa tion. For simplicity, we only consider the locally linear REW estimating equations, which is defined as p(x){1’,O(x) + 0’(x)(X — x)} = 0 (7.12) and +0’(x)(X —x)}(X —x) = 0. (7.13) The above locally linear REW estimating equations are easily extended to locally poly nomial REW estimating equations. If we use the Nadaraya-Watson kernel weights, after some calculation, we get the effi ciency index for 0(x) to be O”(x) 2 2 h4 dK E[{Y,0(x)}]2 h4 1 7 142 CK fl + nhg(x) [Ea{Y, 0(x)}/ãO(x)]2 + o( + ). (. The minimum of (7.14) is attained by choosing o ãlogf(Y,0) 90 This result tells us that the locally linear MREWLE is the best among all locally linear REW estimating equations (with Kernel weights). So when the likelihood function is available, we should use MREWLE to extract the relevant sample information from relevant samples. Similarly, we can show that the locally linear quasi-MREWLE (maximum relevance weighted quasi-likelihood estimator) is the best among all locally linear REW linear es timating equations. These last two statements ensure that both relevance weighted likelihood and relevance weighted quasi-likelihood play important roles in generalized smoothing models. 131 Chapter 8 An Approach of Bootstrapping Through Estimating Equations 8.1 Introduction This Chapter presents a new bootstrap method. It has computational and theoretical advantages when the data obtain from non-identically distributed observables. And it differs from conventional bootstrap methods in that it resamples components of an estimating function rather than the data themselves. Various authors explore bootstrap resampling procedures for estimating a sampling dis tribution in situations where the sampled observables are independent and identically distributed (c.f. Efron 1979, Bickel and Freedman 1981 and Singh 1981). However, the potential value of bootstrap methods lies in more complex situations like nonlinear regression analysis. In such situations standard inferential methods encounter serious difficulty. A number of authors (Efron 1979, Freedman 1981 and Wu 1986) propose the use of bootstrap methods instead. 132 The recent publications of Hall (1992) as well as Efron and Tibshirani (1993) survey the literature on bootstrapping. A recent contribution is that of Liu and Singh (1992); it provides a succinct overview of bootstrapping and assessment of methods for estimating mean square errors of statistics in heteroscedastic linear regression. Booth, Hall and Wood (1992) extend the bootstrap to estimate conditional distributions (obtaining con fidence intervals and hypothesis tests in particular). Lahiri (1992) “fixed “the bootstrap M-estimator (we offer an alternative in Section 8.6). However, we describe in Section 8.2 the three principal bootstrapping methods for re gression; this is necessary to put our results in perspective. Two of these methods bootstrap residuals. They implicitly assume exchangeability of the residuals and hence fail to be robust against “heteroscedasticity” of high order moments (the third and fourth moments, in particular) The third method resamples from the (y, x) pairs and also encounters major difficulties. The new bootstrap in Section 8.3 can be used whenever estimating equations define the estimator. The method, unlike conventional bootstrapping, does not resample the data. Instead we start with the estimator of interest and generate random estimator to fit residuals rather than use model to fit residuals. The approximate sampling distribution for the estimator is then found by taking the empirical distribution of the resulting estimator plus bootstrapped residuals. A critical feature of our approach is its use of the components of the estimating function itself to transform the residuals to an appropriate scale. For simplicity, we restrict ourselves in this Chapter to estimation for the linear model. Section 8.4 gives the asymptotic properties of our method, which yield covariance esti mators having the desiderata listed by Wu (1986, Section 5). In particular, for linear 133 regression models (8.1), the proposed covariance estimators are almost unbiased and consistent for heteroscedastic errors. For homoscedastic errors, we show that one of these estimators is unbiased for estimating the covariance of the least squares estimator. The new method is robust against nonhomogeneity, not just of the second, but also of higher moments. Such robustness is needed in constructing Edgeworth approximations to the distribution of a bootstrap estimator where consistent third moment estimators are then essential. Finally, the asymptotic distribution of the new bootstrap estimator is normal for both random and nonrandom regression design matrices. In Section 8.5, we give a general comparison of all the principal bootstrap methods for regression. This comparison shows on heuristic grounds that our new method is more natural than its competitors in heteroscedastic regression. The simulation reported in Section 8.5 also shows our bootstrap to advantage. In that simulation, we use the bootstrap distribution to estimate the true distribution. From the estimate, we can estimate confidence intervals among other things. Because our new methods based on the estimating equations, it generalizes readily to nonlinear regression. The importance of this advantage derives from the growing importance in practice of nonlinear regression models. Three nonlinear situations are discussed in Section 8.6. Section 8.8 contains the proofs of our main results. 8.2 Problems with Common Bootstrap Regression Methods Define a linear model by Y = x/3 + e, where x is a k x 1 nonrandom vector. Here /3 denotes a k x 1 parameter vector and the { e }, uncorrelated errors with means 134 zero and variances {u}, respectively. Assume the model includes an intercept term so that the same designated coordinate of x is 1 for all i. With Y = (Y1, ..., y)T, e = (ej, ..., en)T and X = (zi, ..., x)T, reexpress the linear model as Y = X3 + e, Gov(e) = = diag(a, ..., o). (8.1) To reduce clutter, we drop in (8.1) and in the sequel the subscript representing the fixed parameter in conditional distributions. Thus we use E, var and coy to represent respectively, the conditional expectation, variance and covariance induced by the model in equation (8.1) for fixed /3 and ft Assume XTX is nonsingular so that we have the usual coefficient estimator, $ = (XTX)_1Y. The observed Y is y = (yi, ..., Suppose a bootstrap distribution of some parameter & = ‘(/3) is required. Efron (1979) suggests two methods (described in next subsection) based on either the residuals of the least squares fit or the complete observation vectors. These methods are studied by Freedman (1981), Wu (1986) and Hinkley (1988), among others. 8.2.1 Residual Resampling a) Efron’s Method. The most common method of bootstrapping an ordinary linear model, that of Efron (1979), exploits the assumed exchangeability of the error terms when a = a2 for all i. This method is the focus of work by Freedman and Peters (1984) , Wu (1986) and Hinkley (1988). For the model (8.1), e represents a vector of iid random variables with mean 0 and covariance a21. The bootstrap uses the empirical distribution function, Er: mass 1/n at r, i = 1, ..., here the {rj represent the elements of the residual vector r = [I — X(XTX)XT]y 135 where y = (yi, ..., T when the observed responses are {yi, y}. Since an intercept term is included in the linear model = 0. The bootstrap (Monte Carlo) approximation obtains from an iid sample of n residuals from Fr, say {e, ... , e}. Let e* (e, ... , e) and y* = (yr, , y) where = + e for all i. Finally define the bootstrap LS Estimator (LSE) as = (XTX)_1XTy* = + (XTX)_1X e*. Efron (1979) shows that E* = $ and that vb = E(/*_/)($*_$)T = = &2(XTX)_1 where &2 = r. We may use L,* = /,(/3*) for estimating b. If b is smooth, we may develop asymptotic theory for directly from that of 3*. For brevity we will not consider such an extension of our results here. Notice that i is the usual maximum likelihood estimator of the covariance of ; it is biased since Ey(vb) = n’(n — k)u2(XTX)_l. The bias results from the fact that Vary(r) = u2(1 — hi), where h = x(XTX)_lxj is the ith diagonal element of X(XTX)_1. A bootstrap method which uses standardized residuals, r:(1 — or r(1 — kn’)’/2 eliminates this bias. b) Wu’s Method. For unequal a’s in (8.1), vb is not only biased but inconsistent as well since the true covariance of the least square estimator (LSE), 3, is Cov(/3) = (XTX)_1 ZcrxixT(XTX).This deficiency of vb is intrinsic. Drawing iid samples from {r} only makes sense for exchangeable residuals {r}; iid sampling wipes out heterogeneity among the {r} and this heterogeneity will not be reflected in vi,. To deal with this difficulty, Wu (1986) proposes another bootstrap (see also Efron 1986) described below. 136 Suppose we wish to assess the variability of a statistic computed from the random vector of observables Y which comes from a specific, but unknown structural/stochastic mechanism. Here we suppose Y = X3 + e, with unknown parameters 3, o, ..., u and = (XTX)_1XTY. The general bootstrap method would consist of: (i) estimating the entire probability mechanism, F, by say P [in this case the F associated with (, o, ..., ô)]; (ii) after resampling the data vector y from P, recalculating the statistic of interest, 3* = (XTX)_1XTY*; (iii) and finally using the observed variability of the /3*)s to estimate the variability of . For (8.1) the bootstrap calculations can be carried out explicitly and they yield v, = (XTX)_1> ô-xx(XTX)’. Wu’s (1986) particular implementation of the general bootstrap described above can be obtained by choosing ô = (r(1 — h)’/2)for all i. To explicate these variance estimators, define y = x”/3 + ôjt, i 1, ..., n; the t” {t} obtain from resampling (denoted by *) with Et* = 0, and Cov*(t*) = I. (8.2) Wu (1986) shows that (8.2) entails E$* = and = (XTX)_1 r(1 — h)’xx’(XTX) . The latter is consistent in general. Remarks 1 When o = u for all i, vb is unbiased and consistent. Indeed Efron’s method (Efron 1979) may well be best if e1, ..., e,7. are iid. If not and interest focuses on the distribution of /3 (and not just its covariance), the bootstrap encounters serious difficulties. A very simple example demonstrates this. Example 8.1 Consider the one dimensional regression model: = x/3 + ej. i = 1, ..., n 137 with Ee = 0, Var(e) = a2, and Ee = tt3. Interest focuses on 13’s third moment as when the Edgeworth expansion and related results for the bootstrap method are required. The LSE is 3 = (x1 > xY and = But the bootstrap estimate of the third moment is — = (x) xn_1 r. (8.3) In general, (8.3) is biased and inconsistent. This is not surprising since the {r} are not exchangeable (the {j} being unequal). 2 The properties of Wu’s bootstrap depend on the distribution of t. Specifying that distribution becomes a new problem. (Wu’s bootstrap uses information not in the original samples. This may make this method nonrobust in more general cases). 3 For heteroscedastic errors, Wu’s method encounters the problem identified in (Re mark 2.1). We can get a consistent third moment estimate by choosing the dis tribution of t to satisfy E(t*3) = 1. But we may well be interested in other quantities as well. Manipulating the distribution of t may not simultaneously provide satisfactory estimators for all quantities of interest. 4 As noted by Wu (1986), bootstrap methods are hard to generalize to nonlinear sit uations; the heteroscedastic bootstrap is based on resampling the residuals which is not feasible in the nonlinear case. 138 8.2.2 Vector resampling Freedman (1981) suggests a way of dealing with the correlation model used for a nonho mogeneous errors model. His method draws a simple random sample with replacement (* sampling) {(y’, 4T)}? from {(y, xfl}?. The method computes the bootstrap LSE from {(y, 4T)}?, = (E64T)Z4y, (8.4) and the bootstrap covariance estimator = E($*_$)($*_$)T. (8.5) This approach suffers from several drawbacks. Firstly injudicious use of this method can lead to inconsistent covariance estimators (Wu 1986 gives an example). Secondly, the method fails to incorporate the knowledge that Var(e) changes smoothly with x when such knowledge obtains. Thirdly, on the general grounds of requiring inference to be conditional on the design D = (x1, , ..., xj, one should not risk having simulated data sets whose designs D* = (4, , ..., x) are very different from D. Fourthly, if n or k is large, computational costs may be quite high. Fifthly, small sample sizes can entail singular design matrices, D*. 8.3 A new bootstrap Let Yi, ..., Y7. be a sequence of independent random variables with observed values, y,,. Suppose for specified functions {gj}, 0 satisfies Es{g()1, 0)} = 0 for all 139 i = 1, ..., n and 0. Here and in the sequel E9 denotes the conditional expectation of the {} given 0; var represents the corresponding conditional variance. For simplicity of presentation, we assume 0 is a scalar but our results easily extend to vector-valued parameters. We let 0 be a solution of the estimating equation g(y,0) = 0, and consider the distribution of 0 — 0. A standard two step method for approximating that distribution when the {1’} are independent and identically distributed can be described as follows: 1. use the Taylor expansion to get the approximation { a(•,o) 0) 2. invoke the Central Limit Theorem to get — 0) N[0, flVaro{g(Y,0)}1 The standard bootstrap method exploits this approximation when the {11} are indepen dent and identically distributed. However, that condition often fails. Then drawing a bootstrap sample directly from {u, ..., y} proves unproductive, leading us to our alter native bootstrap method. First replace 0 by 0 in gj(yj, 0). Then define z = gj(yj, ). Finally: 1. draw the bootstrap sample {z, ..., z,} from {zi, ..., z} as a simple random sample with replacement; 2. compute the bootstrap estimator as e* 140 3. use the bootstrap, i.e empirical distribution of the (O* — O)’s obtained after many repetitions of 1-2 to approximate the distribution of 0 — 0. Since our method approximates the distribution of 0 — 0, we get approximations to the distributions needed for inferences based on that distribution. The quality of those approximations depend of course, on the quality of our underlying approximation to the distribution. That notwithstanding, our method offers great flexibility. And in a manner of Liu (1988), we can prove that our bootstrap yields a better approximation than its competitors under certain conditions. However, our method cannot be used to estimate the bias. The following example shows how our method differs from its conventional relative. Observations are made of n independent and identically distributed random variables, each with an unspecified probability distribution function, F. Inferential interest focuses on the mean, [t, of F. The usual bootstrap would: (i) draw the “bootstrap sample” {y, ..., y} from {yi, ..., y} as a simple random sample with replacement; (ii) calculate the bootstrap sample mean = y’; and (iii) repeat (i)-(ii) sequentially to obtain a sequence of y’ values. The empirical distribution of the (* — )‘s is the bootstrap approximation to the sampling distribution of — Our approach begins with the sampling distribution model i=1,...,n. Given a sample {yj} generated by this distribution, we readily find that the least squares estimate of satisfies the “estimating equation “, Z(y — ) = 0; for simplicity here and in the sequel we (usually) suppress the upper and lower summation limits, i = 1 and i = n. This estimating equation relies on the component functions yj — t, i = 1, ..., 141 which may be estimated by z — , i = 1,...., n. Our method: (i) draws a bootstrap sample, {4, ..., 4} from {zi, ..., z,} as a simple random sample with replacement; (ii) calculates the bootstrap estimator, /i’ 2 + n1 4; and (iii) repeats (i)-(ii) above sequentially to obtain a sequence of (/* — )‘s and the bootstrap approximation to the sampling distribution of fi — u. From this general discussion of our method we turn in the following sections to linear regression and describe properties of our bootstrap estimator of the regression coefficient vector. 8.4 Asymptotics. 8.4.1 Preamble. For the linear model (8.1), Efron (1979), Freedman (1981) and Wu (1986) have suggested bootstrap methods described in Section 8.2, based on either the residuals of the least squares fit or the complete observation vectors. Our method differs from theirs. We start with the normal equations for the ordinary least squares estimate, Xi(yi - x/3) = 0 (8.6) and their solution . Let zj=xj(yj_x$), i=1,...,n. The bootstrap estimator defined in Section 8.3 becomes = + (XTX)’ 4, (8.7) { z, ..., z,} being a bootstrap sample from {z1, ..., z}. 142 8.4.2 Consistency of 3*• For brevity, we now state our main results. All proofs except that of Theorem 8.1, and underlying assumptions appear in Section 8.8. Theorem 8.1 Suppose Assumptions 8.1 and 8.2 hold for model (8.1). Then E(v) = cov(){1 + O(n_1)}, where v. = — — )T and E represents expectation with respect to the distribution induced by bootstrap sampling. Proof. This is an immediate consequence of Lemma 8.1 in Section 8.8. 0 Because we use an estimate of /3 in resampling, we lose degrees of freedom. This suggests renormalising the terms on the left side of equation (8.6). Two alternatives suggest them selves, z’ = (1_n_1k)h/2xj(yj_x$) and 2) = (1_hj)_h/2xj(yj_x$), i = where the {h1} represent the diagonal elements of the hat matrix. The bootstrap estima tors corresponding to these two renormalisations are *(1) and $*(2) and the covariance estimators are V*(l) and V*(2). The asymptotic properties of /3’, /3*(1) and $*(2) are equiv alent. The proof of the next theorem, the counterpart of Theorem 8.1 for the renormalized “bootstrappands”, follows immediately from Lemma 1 and so it is omitted. Theorem 8.2 For model (8.1), Ev(2) = Cov(/3) if the assumptions in (8.2.q) hold. If Assumptions 8.1 and 8.3 hold as well, E(v()) = Cov(/3)(1 + O(n1)), for i = 1,2. 143 An asymptotic analysis must explore the third moments needed for Edgeworth expan sions. The use of such expansions in bootstrap regression models seems to be have been proposed by Navidi (1989) who considers only Efron’s method for independent and identically distributed errors. Liu (1988) investigates the third moment bootstrap estimator of 1T/3 based on Wu’s bootstrap for the case of heteroscedastic errors, 1 being a fixed k x 1 vector. For model (8.1), we also consider the sampling distribution of the least squares estimator of any such specified linear functional of 3, 1T3• For model (8.1), let E(e) = E(e) = [L4,j, i 1, ..., n. (8.8) From the bootstrap estimate (8.7), the third and fourth moment estimators for 1T/3 are defined as = E{lT(/3* — Th} and /14,* = E{lT(/3* — It is easily shown that n 33 = w r, i=1 and = + (8.10) i=1 ij where w = lT(XTX)_lx and r = y — x. With the notation of (8.8), the third and fourth moments of 1T,3 are = E{lT( — 3 = (8.11) and = E{lT(f — = Ew4, + (8.12) i=1 ij Theorem 8.3 Assume the elements of 1 are bounded. Then under model (8.1) and (8.8), 144 (i) E(,u3,) = j + O(n3) when Assumptions 8.1 and 8.5 hold and (ii) E(4,) = 1u4 + O(n3)=i4{1 + O(n1)} when Assumptions 8.1 and 8.6 hold. Remarks: 1 The covariance estimators v,, V*(l) and V*(2) corresponding to the bootstrap es timator triplet , I3) and *(2) are asymptotically equivalent. As well, in es timating cov(/3), U*(2) is unbiased for homoscedastic models and consistent for heteroscedastic models. So it has the desiderata set out by Wu (1986, Section 5). 2 Wu’s method (Wu 1986) fails in general to give a consistent estimator of the third moment. This difficulty can be overcome with Wu’s approach simply by modifying the distribution of his random variable t to achieve a consistent third moment estimator (Liu 1988). But then an inconsistent fourth moment estimator may result. By contrast, our method yields consistent 3rd and 4th moment estimators. 8.4.3 Asymptotic normality of 8*. In this section, we describe the asymptotic distribution of /3w’ under general conditions including the case of independent and identically distributed errors and the correlation model as special cases. Freedman (1981) investigates the asymptotic theory of Efron’s residual resampling method for the case of independent and identically distributed errors and the vector resampling method for the correlation model. To simplify the statement of our results, denote by £(UV) the distribution of U condi tional on V for any random objects U and V and P denotes the probability distribution 145 condition on {(yj, xfl, i = 1, ..., n}. Let Nk(, 1’) be the multivariate normal distribu tion with mean and covariance F. The next result involves two distinct sampling processes, the first leading to the original data set of size n, and the second created by bootstrap resampling, say m times. With the original sample fixed classical limit theory applies when m — cc. But complications arising from the need to allow n —* cc entail more delicate analysis, as emphasized in the work of Bickel and Freedman (1981). Such analysis leads to the following results. Theorem 8.4 For the regression model (8.1) with fixed regressor variables, suppose Assumptions 8.7-8.10 obtain. Then — ${(yj, x) : i = 1, ..., n}] Nk(O, V’WV’) for almost all sequences {(y,, x) : i = 1, ..., n} as n —+ cc. Theorem 8.5 For the regression model (8.1), with random regressor variables, sup pose Assumptions 8.11-8.13 obtain. Then £[/i(,3* — 3){(y,x) : i = 1,...,n}} Nk(O, Qj’M11)for almost all sequences {(yj, x) : i = 1, ..., n} as n — cc, where M = Exx’e?. Remarks: 3 Freedman (1981) suggests different bootstrap methods for different models. If the model is changed, the normality of the bootstrap method can be lost. In contrast, Theorems 8.4 and 8.5 assert that the one bootstrap method discussed in Section 8.4.1 yields asymptotic normality for both models. 146 8.5 Comparison and Simulation Study. In this section we compare our method with those of Efron (1979), Wu (1986) and Freed man (1981). We do this by highlighting the simple structural differences between these methods in the way they simulate resampling of the regression coefficient estimation vectors. Then we present the results of simulation studies in support our method. Suppose we wish to estimate the distribution of — /3 = (XTX)_l Then we need only estimate the distribution of (XTX)_1 being fixed. Efron’s bootstrap resamples from the residuals r1, ..., r,, r = — x/3 and determines the corresponding bootstrap estimators by — = (XTX) Efron’s {r} are independent and identically distributed samples from {r}. While this procedure seems natural for exchangeable residuals {r}, doubt about the validity of his procedure arises when they are not. And these residuals are definitely not exchangeable in heteroscedastic regression models, where Efron’s estimators can be inconsistent. In Wu’s approach, bootstrap estimators are found from — = (XTX)_l the independent and identically distributed sampling model, denoted by , for t = (t, ..., t) satisfying E(t*) = 0, and cov*(t*) = I. 147 So we see that Wu’s method uses > xr1t to simulate > The distribution of > will depend on the choice of that of the {t}. Specifying that distribution becomes a new problem and in any event, means that Wu’s bootstrap requires information not in the original sample. In general the method risks possible nonrobustness. And if n is insufficiently large, the distribution of > can vary greatly for varying distributions of t’. Freedman (1981) suggests a way of dealing with the correlation model used for a non homogeneous errors model. His “pairs” method draws a simple random sample with replacement, {(yi,4) i = 1,...,n} from {(yj,xj) i = 1,...,n} and computes the boot strap least squares estimate from {(y, xfl i = 1, ..., = Then the distribution of 3* — /3 approximates that of /3 — /3. As indicated in Wu (1986), Hinkley (1988) and Section 8.2, this approach suffers from several drawbacks. As shown in equation (8.7), our approach uses — = (XTX)’ YZz7. The permutability of the z1, ..., z in (8.6) leads us to make the {z1} independent and identically distributed. We see that of the four methods, ours is the most direct in simulating the distribution of interest. We now compare the four methods again, this time through a simulation study designed to compare their performance. From the simulation study of Wu (1986), we know that our bootstrap method will perform well in variance-covariance estimation. We know this because our method has the desiderata set out by Wu (1986, Section 5). 148 We study the following regression model: = j9o + /31x + e, i = 1(1)10, = 0, 1, 1.5, 2, 3, 3.5, 4, 4.5, 4.75, 5. Two error distributions are used in our study: equal variances e “-‘ N(O, 1) and unequal variances e ‘-‘s xN(0, 1) i = 1, ..., n. In all cases, the {e} are independent. We consider four bootstrap methods: (i) Efron’s method; (ii) ‘Nii’s method, the distri bution of t being given by the “wild bootstrap” (Hardle and Marron 1991); this choice for the distributioll of the {t} ensures fulfilment of 1st, 2nd and 3rd moments conditions (Liu 1988); (iii) Freedman’s “pairs” method; (iv) our new bootstrap method. We have run many simulations and chosen two Figures to illustrate the results obtained for the homoscedastic and heteroscedastic model simulation experiments. Each figure displays the cumulative frequency plots for simulated error distributions. They depict the distributions associated with each of the four methods under investigation, labelled b through e. That associated with the true estimation error distribution is labelled by a. The five variables underlying the displays a through e are represented generically by w = — A single simulation run begins with a from a single sample generated in accord with the sampling distribution. Then 1000 bootstrap values w of w are generated by each of the four methods under consideration. The results in Table 1 and Table 2 unlike those in the Figures offer an aggregate view of performance and are based on 500 simulations of B = 1000 bootstrap samples. We use the Kolmogorov-Smirnov statistic to index the accuracy various distribution estimators. In Table 1, the Kolmogorov-Smirnov statistic is defined as sup F*(w) — F(w), where F*(.) denotes the empirical distribution of w and F(.) that of the true 149 0d (0 0 C 0 t 0 0 0. 0 N 0 0 0 Figure 8.1: A comparison of bootstrap distribution estimators for regression with ho moscedastic errors. We depict the distributions of w = — /3o induced by the true distribution (labelled a), using our bootstrap estimator (b), Efron’s estimator (c), Wu ‘s estimator (d), and Freedman’s estimator (e). —a b C —-d —-e I I I I I I -2 -1 0 1 2 3 4 w 150 0.7iv / FrI / / /1 ,// /1 /3;’ / (II / :1 1(0 j(.o 1/ lii,, 0 2 —a b .l. • —-—C 0 / —-d / — I. / / I,o -I.. .71/ I-. q .. . 0 -5 0 5 w Figure 8.2: A comparison of bootstrap distribution estimators for regression with het eroscedastic errors. We depict the sampling distributions of w — ,Bo induced by the true distribution (labelled a), using our bootstrap estimator (b), Efron’s estimator (c), Wu’s estimator (d), and Freedman’s estimator (e). 151 Table 8.1: Averages of the Kolmogorov-Smirnov Statistics for Competing Bootstrap Distribution m Equal Variance Unequal Variance New Method 0.115 (0.0029) 0.080(0.0021) Efron 0.079(0.0019) 0.136(0.0028) Wu 0.158(0.0024) 0.113(0.0019) Pair 0.126(0.0025) 0.095(0.0017) Table 8.2: Absolute Biases of the Competing Bootstrap Quantile Estimators Equal Variance Unequal Variance 5% 95% 5% 95% New Method 0.328(0.009) 0.320(0.009) 0.411(0.014) 0.403(0.014) Efron 0.224(0.007) 0.220(0.007) 1.298(0.041) 1.276(0.039) Wu 0.344(0.009) 0.338(0.009) 0.436(0.014) 0.431(0.014) Pair 0.345(0.012) 0.343(0.011) 0.785(0.033) 0.680(0.030) distribution. The mean values of the Kolmogorov-Smirnov statistics from these 500 simulations are summarized in Table 1. The value besided the mean is the correspond standard error. Since the confidence interval is one commonly used inferential procedure, studying the quantile estimators obtained from the four methods seemed worthwhile. The absolute bias of 95%-quantile estimator is defined as q95(F*) — q.95(F), where q.95(F) denotes the 95%-quantile for the distribution, F. The mean of the absolute biases of 5%- and 95%-quantile estimators are summarized in Table 2. The value besided the mean is the correspond standard error. The results of our simulation studies can be summarized as follows. 152 1. As expected, for estimating a distribution Efron’s method performs best when the error distributions are independent and identically distributed (see Table 1). The new bootstrap method also works well in that case, though not quite as well as Efron’s. Freedman’s pairs and Wu’s method seem to yield rough estimators and to be qualitatively poorer than the other two methods. This may be because Wu’s method stresses consistency of the estimator’s moments at the expense of estimating the overall distribution (see Figure 1 also). From Table 1, we see that in regression with heteroscedastic errors, Efron’s approach does badly, again in accord with our expectations; the model errors are not exchangeable and Efron’s estimator is not consistent. The new boot strap method works best, Freedman’s reasonably well and Wu’s not so well (see also Figure 2). 2. For quantile estimation, Efron’s method performs best for both 5%- and 95%-quantiles when the error distributions are independent and identically distributed (see Table 2). The new method ranks second and Wu’s third. For regression with heteroscedastic errors, the new method is best, Wu’s second and Freedman’s third. However, Efron’s method gives an inconsistent estimator and seems totally unsatisfactory. 3. Overall, we can conclude that Efron’s method is not robust against het eroscedastic errors. Wu’s method seems unsatisfactory for estimating a dis tribution. We found Freedman’s method effective but computationally expen sive in the examples considered; presumably the method would be unrealistic for use in nonlinear situations. Even with just two parameters, the method required much more time than the others. We next consider the use of pivotal quantities in bootstrapping. The bootstrap-t statistic 153 Table 8.3: The Pivot Quantile Estimators 0.05 0.10 0.25 0.50 0.75 0.90 0.95 Equal Variance New Method -1.72 -1.34 -0.70 0.052 0.72 1.30 1.67 Efron -1.86 -1.36 -0.71 -0.004 0.71 1.49 1.97 Pair -1.51 -1.08 -0.60 0.122 0.58 1.16 1.61 t(8) -1.86 -1.40 -0.71 0 0.71 1.40 1.86 N(0,1) -1.65 -1.28 -0.67 0 0.67 1.28 1.65 Unequal Variance New Method -1.77 -1.34 -0.69 -0.029 0.68 1.31 1.84 Efron -1.99 -1.55 -0.76 -0.033 0.64 1.36 1.86 Pair -1.63 -1.06 -0.55 -0.011 0.45 0.90 1.17 is constructed as t*(b) = (*(b) - $)/se*(b), (8.13) where b denotes bootstrap sample and se means “standard error”. In view of our earlier findings, we consider only three bootstrap methods, Efron’s, Freedman’s, and ours. For both the case of equal and that of unequal variances, we ran a single simulation. The pivot quantile estimators in Table 3 are based on 1000 bootstrap samples. In Table 3, t(8) denotes the t-distribution with degree of freedom 8 and N(0, 1) denotes the standard normal distribution. For the case of equal variances, t(8) is a good reference distribution. From Table 3, we find that both Efron’s method and the new bootstrap method perform well. The pairs estimator is not as good as that given by the other two methods. For the case of unequal variances, we do not know a good reference distribution. If we use t(8) as before, both Efron’s and the new method give good estimators. The pairs method 154 fails to give reasonable results. But since Efron’s method yields an inconsistent variance estimator, we cannot use Efron’s method to construct a useful confidence interval. 8.6 Bootstrapping in Nonlinear Situations In this section we merely sketch extensions of our bootstrap in the nonlinear situations considered by Wu (1986). The extensive detail needed for the required analysis of lower- order terms will be presented elsewhere. 8.6.1 Regression M-estimator An M-estimate 3 of the regression coefficient vector /3 is found by minimizing - (8.14) over /3, where p is usually assumed to be symmetric. Choices of p can be found in Huber (1981). The M-estimate is found by solving: — x3) = 0 (8.15) Call xp’(y—xi3) the “score function”. From the method of Section 8.3, we obtain a bootstrap sample from the estimated score function with /3 replaced by fi. To be precise, let z = Xjp’(yj — XT/3), j = .., n. Next: (1) draw the bootstrap sample, {4, ..., z} a simple random sample with replacement from {zi, ..., z}; (2) construct the bootstrap M-estimator 3* as 3* = 3 + (> xjXp”(j))1 4, where j = — x/3: i = 1, ..., n. 155 For estimating Var(’zb), b = b($), we use an analogue of the estimator in Section 8.4, = — ?)((,* — 7)T where = 3*). Principal interest centers on = /3, andv = where v = (XjXp” (e))’(xjxTr)( XjX’p”(j)) (8.16) and r = p’(yj — xã). Now let us calculate the asymptotic approximation covariance matrix of 3. Under appropriate conditions (not stated here for brevity) 0 = xep’ (y — x/9) = Xip (y — x/3) — ( — /3)xx’p”(y — x3) + o( — i3). This implies 9 — [xjxp”(yj — xp’(y — x8). Alternatively, v E(8 — /3)( — /3)T where v [ xjxp”(ej)]_1[ xxT(p’(ej))2][xjxp”(ej)]’; (8.17) here e: = — x3: i = 1, ..., n. Comparison of (8.16) and (8.17) suggest the bootstrap estimator is a reasonable estimate of the covariance matrix of j3. 8.6.2 Nonlinear regression. In nonlinear regression yj = f(3) + e i = 1, ..., n, where f is a nonlinear smooth function of /3 and e satisfies (8.1). The LSE /3 is obtained by minimizing (y — f(/3))2. The score function is xB)(y — f(/3)), where x = af/9,8. Put the LSE into the score function. The bootstrap method of Section 8.3 is with samples, {z, ..., z7} from {z = Xj()(yj — f()) i = 1, ..., n}. The bootstrap coefficient estimators are = + (X(/)TX(/))_1 z, where X = (x1, ..., x). It is easily shown that = /3 and v = E*(,8*_/3)(/3*_)T, with = (X()TX())’ (8.18) 156 where r = — f(,8). If we calculate the asymptotic approximation covariance matrix of the LSE, , we get - (X(/3)TX(/3))’X(/9 e and v = - - 9) where v (X()TX())’ (8.19) From (8.18) and (8.19), we find that the bootstrap estimator is consistent and robust in general. 8.6.3 Generalized Linear models A generalized linear model is characterized by three components (McCullagh and Nelder, 1989): (i) an error distribution from the exponential family f(y; ) = exp([y’b — a(b) + b(y, ‘)]/q), (ii) a systematic component = x/3, and (iii) a monotonic, differentiable link function i = g(), where t = E(y). Here we consider generalized linear models with independent observations. Let y = (yr, ..., y), E(y) = (IL’, ..., IL)T and Cov(y) = diag(cr, ..., a,) V(IL) = diag(uv()). (8.20) The mean, tj, is related to the regressor, xj, by the link function g, i.e. = g(x8). The full likelihood may not be available. Inference is instead based on the log quasilikelihood 157 (see Wedderburn 1974, McCullagh 1983), L(t, y), defined by = V([L)’(y—LL) A generalized least squares estimator (GLSE), 3, is defined as a solution of DTV_l(y — i(3)) = 0, (8.21) where it(,3) = ([tj) = (g(x’/3))! and D = d/LL/d/3 = diag(g’(x3))X. Here DTV1(y — i(i3)) is called the quasi-score function. Moulton and Zeger (1991) consider two bootstrap methods for generalized linear models. One corresponds to standardized Pearson residual resampling, which extends Efron’s residual bootstrapping method using standardized Pearson residuals. The method is similar to that for the ordinary regression model, with heteroscedastic errors (i.e. un equal a’s in (8.20)); the bootstrapped covariance estimator is in general inconsistent for the true covariance of GLSE, 6. The other method of Moulton and Zeger, involves observation vector resampling; this extends ordinary vector resampling using a one-step approach. We use the bootstrap method proposed in Section 8.3. By substituting the estimator in (8.21) and rewriting it, we find the quasi-score function — Let z, = xjg1(xj)v ()_1(yj — /%j), j = 1, ..., n. The bootstrap is thus based on the uniform distribution, Fb with support z1, ..., zr-, and the calculation of the bootstrap 158 coefficients by = + (DTV_1)_lz; (8.22) here we have inserted the estimator 3 into the righthand side of the (8.22). For estimat ing Var(), j’ = ib(), the bootstrap covariance estimator is v,, = For /3, v,. = E(3* — 3)(9* — )T where v = (DTV_1)_l(g’(x)vj(1L)_1)2rxjx(DTV_ D)_1, where {r = — : i = 1, ..., n} are the prediction errors. For the GLSE, /3, we have the approximation (McCullagh, 1983) — /3 (DTV_1)_DTV_l(y — so that Cov($) (DTV_1)_DTV_bcov(y —u)V1D(DTV’ ). Under model (8.20) Cov(y — t) = diag(uv(1u)) and Cov(/3) = (DTVD)_l (g’(x/3))2v1(it)uxx(DTV_i D)’. We show elsewhere v,. is a consistent estimator for Cov($). The homoscedastic model (o = u2) is the most interesting special case; then the covari ance of is Cov($) =u2(DTV_lD) and the bootstrap covariance estimator becomes v,, = (DTV_lD)_lDV_lDiag{r}V_lD(DTV_lD)_l. (8.23) The bootstrap covariance estimator (8.23) is the same as that employed by Cox (1961), Huber (1967), White (1982), and Royall (1986) for handling quite general model mis specification. 159 8.7 Concluding Remarks. The conventional approach to the bootstrap has been through quasi replicates of the original sample possibly subject to reweighting or some constraints. The bootstrap distribution is then obtained from the empirical distribution or some smoothed version of the empirical distribution of the succession of realized estimators obtained from these quasi replicate samples. Even when the resampling population consists of (possibly renormalized) model fit resid uals the approach has been to construct successive pseudo estimator values and then the bootstrap distribution for the estimator of interest, through the construction of bootstrap data sets. What we have done is to by-pass the data sets. Instead we have taken the estimator of interest as the baseline, generated successive estimator fit residuals (rather than model fit residuals). The approximate sampling distribution for the estimator is then found by taking the resulting empirical distribution of the realizations of estimator plus residual. A critical feature of our approach is the use of the components of the estimating function itself to transform to residuals to the appropriate scale. Heuristically, we have perturbed the estimator by bootstrapped realizations of a normalized first order term from a Taylor expansion of the estimator about the true value of the parameter. This approach seems natural. Generating bootstrap replicates of the original sample seems unnecessary when the object of interest is the sampling distribution of an estimator and it can be realized directly from suitably expressed estimator fit residuals. The idea of by-passing the data sets is in itself not new. Moulton and Zeger (1991) use this idea in what they call their “one step procedure”. They intend their procedure 160 for use in estimating coefficients in link functions of generalized linear models. Their goal is computational simplicity. If they were to follow thw traditional practice of bootstrapping the data set, they would need to find the bootstrap coefficient estimator by an iterative process for each successive bootstrap replicate sample. The resulting intolerable computational burden leads them to use the estimating equations (for the generalized linear models) directly in much the same way as we suggest in this paper for the general case. Apart from the difference in motivation and intended domain of application, our method differs from that of Moulton and Zeger (1991) in the way we resample residuals. They use a method like that of Wu described above. Their method then inherits the potential deficiencies discussed above of Wu’s approach. Not surprisingly, our bootstrap has the asymptotic properties up to second order which one might expect from the Taylor expansion heuristic. However, the unexpected robust ness of our method up to at least nonhomogeneity in four order moments is unexpected and encouraging. Overall our method has promise. As well, its underlying Taylor ex pansion heuristic suggests other ways of perturbing the estimator to approximate its sampling distribution and these are currently under investigation. We noted above a number of potential applications of our method. In current work we are adapting our method for use with longitudinal count data series. As well we are seeing if our method can be used to find standard errors for estimators computed from data obtained through complex survey designs. Binder (1991), building on the work of Godambe and Thompson (1986), has shown how estimating equations can be used in that context. That work provides the platform on which we are attempting to build our bootstrap variance estimation procedure. 161 8.8 Proofs Our asymptotic theory requires certain conditions. Assumptions 8.1 For X = X the design matrix in (8.1) for n observations, max1<<x(XX)’x c/n for some scalar c > 0 independent of n. Assumptions 8.2 For the error variances in model (8.1), maxi<< u < cc. Assumptions 8.3 The minimum and maximum eigenvalues of n’XX are uni formly bounded away from 0 and cc. Assumptions 8.4 The elements of X are uniformly bounded. We also require the following lemma. Lemma 8.1 (Wu 1986) If = xT(XX)1j= 0 for any i,j with u aj, (8.24) then E(r?) = (1 — h)o. More generally, Assumptions 8.1 and 8.2 imply E(r?) = (1 — h)u? + O(n’) = u + O(n_1). Comparing (8.9) with (8.11), and (8.10) with (8.12) suggests and /14,* are robust for estimating the third and four moments, respectively under substantial departures from the model assumptions. We prove this below under additional assumptions. Assumptions 8.5 maxi<< I1L3,i <cc; 162 Assumptions 8.6 maxi<e<oc, 1t4,jI <00. To prove the main theorems, we also need the following lemma. Lemma 8.2 If = x(XTX)_lx = 0 for any i j, then E(r) = (1 — E(r) = (1 — h)4/L,, (8.25) where h = x((XTX)_lxj. More generally, Assumptions 8.1 and 8.5 imply E(r) (1 — h)3,+ O(n2)= + O(n’) (8.26) and Assumptions 8.1, 8.2 and 8.6 imply E(r) = (1 — h)41t,+ O(n1)= ,j + O(n1). (8.27) Proof. From r1 = ye — xT/3 = e — xT(XTX)_1XTe, = — he = (1 — h)e — (8.28) j=1 ij From (8.28) E(r) = (1 — h)3E(e) — > hE(e) = (1 — h)3,— ht3,. (8.29) ij ij The first equality in (8.29) follows from the independence of e, ..., e,. The second inequality in (8.25) obtains in a similar way. Now more generally when 0, i j, nmax{3,} Assumptions 8.1 and 8.5 imply the right hand side of these last inequalities is of order n2 since ( /i) by the Cauchy-Schwarz inequality . Therefore, the first equality 163 in (8.26) follows from (8.29) and the second from (8.29) and h c/n. This completes the proof of (8.26). Now from (8.28) E(r) (1 — h)41t,+ hj1t+ 6(1 — h)2a ho ji + 6 ii(i)i = (I) + (II) + (III) + (IV), say. Assumptions 8.1, 8.2 and the Cauchy-Schwarz inequality imply (IV) <6n2c4maxa (8.30) (III) < 12(n’c +n3c4)max4. (8.31) From Assumptions 8.1, 8.5 and the Cauchy-Schwarz inequality, (II) n3c4maxi, (8.32) (I) = (1 — 4h + 6h — 4h + = iti,i + O(n’). (8.33) The first equality in (8.27) follows from (8.30), (8.31) and (8.32) and the second from (8.33). This completes the proof. D Proof of Theorem 8.3 From Lemma 8.2, = wEr = + O(n’)) = wt3,+9+ wO(n’) = 3+O(n) 164 The last equality follows from <nmaxw O(n2). This proves part (i). Result (ii) is reached by similar reasoning. 0 Let X, denote the X matrix in (8.1). Key assumptions for the case of nonrandom {x} are: Assumptions 8.7 The residuals e are independent with distribution F having mean 0 and variance o?, both F and a being unknown, i = 1, ..., n. def —1 TAssumptions 8.8 There exists a matrix V> 0 such that Vi-, = n X Xi,. —* V. def —1 2 TAssumptions 8.9 There exists a matrix W> 0 such that W = n o x:; W. Assumptions 8.10 x2 and E(4) are uniformly bounded, i = 1, ..., n. For random {x}, the corresponding assumptions are: Assumptions 8.11 The vectors {(, x)} are independent and identically distributed with common unknown distribution function F on R’1, and E (, x’) j oc, where is Euclidean length. Assumptions 8.12 The k x k covariance matrix Q = E(xix’) is positive definite. Assumptions 8.13 Exte: = 0, i = 1, ..., n where e = Yi — x’/9. 165 def —1 n T2Proof of Theorem 8.4 We will first show W,-, = n xx r —* W almost surely as n —* cc where r = — xT/3 for all i. Then in bootstrap sampling we may condition on the set of unit probability where the just asserted convergence takes place and let m -* cc. To obtain this almost sure result, observe that = xjx(yj - xB)2 + n xxflx{j3 - + 2n xxxftB - - x/3) = (I) + (II) + (III) say. Assumptions 8.8, 8.9 and Kolmogorov’s strong law (c.f. Chung, 1968, p 119 and the Corollary given there for the case, p = 2) entail /3 — —* 0 almost surely as n —* cc. Using the trace norm for matrices and the usual norm for vectors, (II)II < HxI[ HC — ThH2 < c4 I(i3 — )I where c > 0 is a uniform upper bound for IIxII obtained from Assumption 8.10. So (II) —* 0 a.e.. At the same time, I(111)H = H2n — < 2n_ e IxH3 L — < 2cn_ Z IeHLB — /W for the same constant c> 0 used above. So Assumption 8.10 and Kolmogorov’s strong law cited above imply (III) — 0 almost surely. Finally, (I) = n’ Zxxre = n Zxx(e — c)+n But n’ Zxx(e?—a,) —*0 almost surely by Assumption 8.10 and Kolmogorov’s strong law of large numbers. Then I —* W almost surely by Assumption 8.9. Now - ) 1V’ Let = lTzj, i = 1, ..., n for any fixed k dimensional vector 1 with lH = 1. Observe that conditional on the actual sample, the{1T4, i = 1, ..., n} are independent and identically 166 distributed random variables with zero means and common standard deviation, o = (lTJ,fl)1/2 (n—’ )1/2. The Berry-Esseen Theorem implies sup F(u’/’ lTz x) — (x) (8.34) where p = But n1 i’Wi > 0 a.e.. Moreover, > —* 0 as n —* oo a.e.. The last result obtains since by our assumptions, the {E’} are uniformly bounded; we may then invoke a general version of the strong law of large numbers (Chung, 1968, pll’T, Theorem 5.4.1 with q(.) = (.)4/3 and a = 3/2) to obtain the result. Since inequality (8.34) holds for all 1, the conclusion of the theorem now follows. D Corollary 8.1 Under the assumptions of the last theorem, - ){(yj,X) : i =1, ...,n}] converges to the standard normal on R’ for almost all sequences {(yj, x’) : i = 1, ..., n} as n —* oo. Proof: This result is an immediate consequence of the result of the last theorem. D Proof of Theorem 8.5 This proof is similar to that of Theorem 8.4 and the details will be omitted. That 3 — —* 0 almost surely follows from 8.11 and 8.13 and the strong law of large numbers. Assumptions 8.11 and 8.12 imply (I) — 0, (II) —* 0, and (III) —p M almost surely. These results and Assumption 8.12 imply the conclusion of the Theorem. 0 167 Bibliography [1] Akaiki, H. (1973). Information theory and an extension of entropy maximization principle. 2nd International Symposium on Information Theory, eds. B. N. Petrov and F. Csak, Kiado: Akademia, p. 276-281. [2] Akaiki, H. (1977). On entropy maximization principle. In P. R. Krishnaiah, (ed.), Applications of Statistics. Amsterdam: North-Holland, 27-41. [3] Akaiki, H. (1978). A Bayesian analysis of the minimum AIC procedure. Ann.Inst.Statist.Math., 30A, 9-14. [4] Akaiki, H. (1982). On the fallacy of the likelihood principle. Statistics and Proba bility Letters, 1, 75-78. [5] Akaike, H. (1983). Information measures and model selection. Proceedings of the 44th Session of 181, 1, 277-291. [6] Akaike, H. (1985). Prediction and entropy. A Celebration of statistics, The 181 Centenary Volume, Berlin, Spring-Verlag. [7] Bahadur, R.R.(].966). A note on quantiles in large sample, Ann.Math.Statist. 37, 577-580. [8] Barndorff-Nielsen, 0. (1978). Information and Exponential Families in Statistical Theory. Wiley, New York. 168 [9] Basu, D. (1975). Statistical information and likelihood (with discussions). Sankhya, Ser. A. 37, 1-71. [10] Bickel, P.J. (1981). Minimax estimation of the mean of a normal distribution when the parameter space is restricted. Ann. Statist. 9, 1301-1309. [11] Bickel, P.J. and Freedman, D.A. (1981). Some asymptotic theory for the bootstrap. Ann. Statist., 9, 1196-1217. [12] Binder, D.A. (1991). Use of estimating functions for interval estimation from com plex surveys (with discussion), Proc. Sect. Res. Methods, Amer. Statist. Assoc., 34-44. [13] Birnbaum, A. (1962). On the foundation of statistical inference (with discussions). JASA, 57, 269-306. [14] Booth, J. Hall, P. and Wood, A. (1992). Bootstrap estimation of conditional dis tributions. Ann. Statist., 20, 1594-1610. [15] Buehler, R.J. (1971). Measuring information and uncertainty. Foundations of Sta tistical Inference. Eds: Godambe and Sprott; Holt, Rinehart and Winston. [16] Casella, C. and Strawderman, W.E. (1981). Estimating a bounded normal mean. Ann. Statist., 9, 870-878. [17] Chow, Y.S. and Lai, T.L. (1973). Limiting behavior of weighted sums of indepen dent random variables. Ann. of Prob., 1, 810-824. [18] Chu, C.K. and Marron, J.S. (1992). Choosing a kernel regression estimator. Statis tical Science, 6, 404-436. [19] Chung, K.L. (1968). A Course in Probability Theory, New York: Harcourt, Brace & World, Inc.. 169 [20] Cox, D.R. (1961). Tests for separate families of hypotheses. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, (University of California Press, Berkeley), 105-123. [21] Cox, D.R. and Hinkley, D.V. (1974). Theoretical Statistics. Chapman and Hall, Landon. [22] Cramer, H. (1946). Mathematical Methods of Statistics. Princeton University Press, Princeton. [23] Edwards, A.W.F. (1972). Likelihood. C.U.P., Cambridge. [24] Efron,B. (1979). Bootstrap methods: another look at the jackknife. Ann. Statist., 7,1-26. [25] Efron,B. (1982). The jackknife, the bootstrap and other resampling plans, SIAM, Philadelphia. [26] Efron,B. (1987). Better bootstrap confidence intervals and bootstrap approxima tions. JASA, 82, 171-185. [27] Efron,B. (1990). More efficient bootstrap computations. JASA, 85, 79-89. [28] Efron, B. and Tibshirani, R.J. (1993). An Introduction to Bootstrap. Chapman & Hall, New York. [29] Eubank, P.L. (1988). Spline Smoothing and Nonparametric Regression. New York: Marcel Dekker. [30] Fan, J. (1992). Design-adaptative nonparametric regression. JASA, 87, 998-1004. [31] Fan, J. (1993). Local linear regression smoothers and their Minimax efficiencies. Ann. Statist., 21, 196-216. 170 [32] Fan, J., Heckman, N.E. and Wand M.P. (1993). Local polynomial kernel regression for generalized linear models and quasi-likelihood functions. JASA, in press. [33] Fan, J. and Marron, J.S. (1992). Best possible constant for bandwidth selection. Ann. Statist., in press. [34] Fisher, R.A. (1925). Theory of statistical estimation. Proc. Cambridge Phil. Soc., 22, 700-725. [35] Fisher, R.A. (1956) Statistical Methods and Scientific Inference. Oliver and Boyd. Edinbugh. [36] Fraser, D.A.S. (1965). On information in statistics. Ann. Math. Statist., 36 890-896. [37] Fraser, D.A.S. (1968). The Structure of Inference. New York: John Wiley. [38] Freedman, D.A. (1981) Bootstrapping regression models. Ann. Statist., 9, 1218- 1228. [39] Freedman, D.A. and Peters,S.C. (1984). Bootstrapping a regression equation: some empirical results. JASA, 79, 97-106. [40] Gasser, T. and Muller, H.G. (1979). Kernel estimation of regression function. in Smoothing Techniques of Curve Estimation. Lecture Notes in Mathematics 757, eds. T. Gasser and M. Resenblatt. Heidelberg: Springer-Verlag, 23-68. [41] Godambe, V.P. (1960). An optimum property of regular maximum likelihood esti mation. Ann. Math. Statist., 31, 1208-1212. [42] Godambe, V.P. (1976). Conditional likelihood and unconditional optimum estimat ing equations. Biometrika, 63, 277-284. 171 [43] Godambe, V.P. and Heyde, C.C. (1987). Quasi-likelihood and optimal estimation. mt. Statist. Rev., 55, 231-244. [44] Godambe, V.P. and Thompson, M.E. (1986). Parameters of superpopulation and survey population: their relationships and estimation, International Statist. Rev., 54, 127-138. [45] Good, I.J. (1971). The probability explication of information, evidence, surpise, causality, explanation, and utility. Foundations of Statistical Inference, Eds: Go dambe and Sprott; Holt, Rinehart and Winston, 108-141. [46] Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer-Verlag, New York, Berlin. [47] Hardle, W. (1991). Applied Nonparametric Regression, Cambridge: Cambridge Uni versity Press. [48] Hardle, W. and Marron, J.S. (1990). Bootstrap simulaneous error bars for nonpara metric regression. Ann. Statist. [49] Hastie, T.J. and Tibshirani, R.J. (1990). Generalized Additive Models. Chapman and Hall, London, New York. [50] Hinkley, D.V. (1988). Bootstrap methods. J. Roy. Statist. Soc. Ser.B, 50,321-337. [51] Hoeffding, W.(1963). Probability Inequalities for sum of bounded random variables, JASA, 58, 13-30. [52] Ru, F. (1994). The consistency of the maximum relevance weighted likelihood es timator. Unpublished Manuscript. [53] Hu, F. and Zidek, J.V. (1993a). An approach to bootstrapping through estimating equations. Tech. Report of University of British Columbia, No.126. 172 [54] Ru, F. and Zidek, J.V. (1993b). A relevance weighted nonparametric quantile esti mator. Tech. Report of University of British Columbia, No.134. [55] Hu, F. and Zidek, J.V. (1993c). Relevance weighted likelihood estimations. Unpub lished Manuscript. [56] Ruber, P.J. (1967). The behavior of maximum likelihood estimates under nonstan dard conditions. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (University of California Press, Berkeley) 221-233. [57] Huber, P.J. (1981). Robust Statistics, Wiley, New York. [58] Jennen-Steinmetz, C. and Gasser, T. (1988). A unifying approach to nonparametric regression estimation. JASA, 83, 1084-1089. [59] Kuliback, S. (1954). Certain inequalities in information theory and Cramer-Rao inequality. Ann. Math. Statist., 25, 745-751. [601 Kuliback, S. (1959). Information Theory and Statistics, New York: Wiley. [61] Kullback, S. and Leibler, R.A. (1951). On information and sufficient. Ann. Math. Statist., 22, 79-86. [62] Lahiri, S.N. (1992). Bootstrapping M - estimators of a multiple linear regression parameter. Ann. Statist., 20, 1548-1570. [63] Lee, K.W. (1990) Bootstrapping logistic regression model with regressors. Comm’un. Statist. Theory Meth., 19, 2527-2539. [64) Lehmann, E.L. (1983). Theory of Point Estimation. New York: Wiley. [65] Lehmann, E.L. (1986). Testing Statistical Hypothesis. New York: Wiley. 173 [66] Lindley, D.V. (1956). On a measure of the information provided by an experiment Ann. Math. Statist., 27, 980-1005. [67] Lindley, D.V. (1965). Introduction to Probability and Statistics from a Bayesian Viewpoint. Part 2. Inference. Cambridge: Cambridge University Press. [68] Liu, R.Y. (1988). Bootstrap procedures under some non-iid models. Ann. Statist., 16, 1696-1708. [69] Liu, R.Y. and Singh, K. (1992). Efficiency and robustness in resampling. Ann. Statist., 20, 370-384. [70] Mack, Y.P. and Muller, H.G. (1989). Convolution-type estimators for nonparamet nc regression estimation. Statistics and Probability Letters, 7, 229-239. [71] Marcus, M.B. and Zinn, J.(1984). The bounded law of the iterated logarithm for the weighted empirical distribution process in the non-iid case, Ann.Prob., 12, 335-360. [72] Marshall, A.W. and 01km, I.(1979). Inequalities: Theory of Majorization and Its Applications, Academic Press, New York. [73] McCullagh, P.J. (1983). Quasi-likelihood functions. Ann. Statist., 11, 59-67. [74] McCullagh, P.J. and Nelder, J.A. (1989). Generalized linear models, 2nd edition, Chapman and Hall, London. [75] Moulton, L.H. and Zeger, S.L. (1991). Bootstrapping generalized linear models. Computational Statistics and Data Analysis, 11, 53-63. [76] Muller, H.G. (1988). Nonparametric Analysis of Longitudinal Data, Berlin: Springer-Verlag. 174 [77] Muller, 11.0. and Stadtmuller, U. (1987). Estimation of heteroscedasticity in re gression analysis. Ann. Statist., 15, 610-635. [78] Navidi, W. (1989). Edgeworth expansions for bootstrapping regression models. Ann. Statist., 17, 1472-1478. [79] Owen, A.B. (1991). Empirical likelihood for linear models. Ann. Statist., 19, 1725- 1747. [80] Park, B.U. and Marron, J.S. (1990). Comparison of data-driven bandwidth selec tors. JASA, 85, 66-72. [81] Royall, R.M. (1986). Model robust confidence intervals using maximum likelihood estimators. mt. Statist. Rev., 54, 221-226. [82] Rubin, D.B. (1981). The Bayesian bootstrap. Ann. Statist., 9, 130-134. [83] Sasieni, P. (1992). Maximum weighted partial likelihood estimators of Cox model. JASA, 88, 144-152. [84] Sarndal, C.-E., Swenson, B. and Wretman, J.(1992). Model Assessed Survey Sam pling, New York: Springer-Valag. [85] Savage (1954). The Foundations of Statistics. New York: Wiley. [86] Serfling, R.J.(1980). Approximation Theorems of Mathematical Statistics , Wiley, New York. [87] Serfiing, R.J.(1984). Generalized L-, M- and R-statistics, Ann.Statist., 12, 76-86. [88] Shannon, C.E. (1948). A mathematical theory of communication. Bell System Tech. J., 27, 379-423; 623-656. 175 [89] Shorack, G.R.(1978). Convergence of reduced empirical and quantile processes with applications of order statistics in the non-iid case. Ann.Statist., 1, 146-152. [90] Shorack, G.R. and Weilner J.A.(1986). Empirical Processes with Applications to Statistics, Wiley, New York. [91] Singh, K. (1981). On the asymptotic accuracy of Efron’s bootstrap. Ann. Statist., 9, 1187-1195. [92] Staniswalis, J.G. (1989). The kernel estimate of a regression function in likelihood- based models, JASA, 84, 276-283. [93] Stout, W.F. (1968). Some results on complete and almost sure convergence of linear combinations of independent random variables and martingale differences. Ann. Math. Statist., 39, 1549-1562. [94] Stout, W.F. (1974). Almost Sure Convergence. Academic Press, New York. [95] Tibshirani, R. and Hastie, T. (1987). Local likelihood estimation. JASA, 82, 559- 567. [96] Wahba, C. and Wold, 5. (1975). A completely automatic French curve: fitting spline functions by cross-validation. Communications in Statistics, Part A—Theory and Methods, 4, 1-7. [97] Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. Ann. Math. Statist., 20,595-601. [98] Wedderburn, R.W.M. (1974). Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika, 61, 439-447. [99] Weerahandi, S. and Zidek, J.V. (1988). Bayesian nonparametric smoothers for reg ular processes. The Canadian Journal of Statistics, 16, 61-74. 176 [100] White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrika, 50, 1-25. [101] Wiener, N. (1948). Cybernetics. New York: Wiley. [102] Wu, C.F.J. (1986). Jackknife, bootstrap and other resampling methods in regres sion analysis (with discussion). Ann. Statist., 14, 1261-1295. 177
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Relevance weighted smoothing and a new bootstrap method
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Relevance weighted smoothing and a new bootstrap method Hu, Feifang 1994
pdf
Page Metadata
Item Metadata
Title | Relevance weighted smoothing and a new bootstrap method |
Creator |
Hu, Feifang |
Date | 1994 |
Date Created | 2009-04-14 |
Date Issued | 2009-04-14 |
Description | This thesis addresses two quite different topics. First, we consider several relevance weighted smoothing methods for relevant sample information. This topic can be viewed as a generalization of nonparametric smoothing. Second, we propose a new bootstrap method which is based on estimating functions. A statistical problem usually begins with an unknown object of inferential of interest. About this unknown object, we may have three types of information (classified in this thesis): direct information, exact sample information and relevant sample information. Almost all classical statistical theory is about direct information and exact sample in formation. In many cases, relevant sample information is available and useful. But there is no systematic theory about relevant sample information. The problem of this thesis is to extract “the relevant information” contained in the relevant samples. The general methods have been developed under three different lines of approach ( parametric, nonparametric and semiparametric approach). In the parametric approach, we propose the idea of relevance weighted likelihood (REWL). For the nonparametric approach, we develop our theory based on the relevance weighted empirical distribution function (REWED). In the semiparametric approach, the relevance weighted estimating functions are used to extract “relevant information” from relevant samples. From asymptotic results, we find that these proposed methods have many desirable properties. We apply these proposed methods as well as some adjusted methods to generalized smoothing problems. Theoretical results as well as simulation results show our methods to be promising. We also present a new bootstrap method. It has computational and theoretical advan tages over conventional bootstrap methods when the data obtain from non-identically distributed observables. And it differs from conventional methods in that it resamples components of an estimating function rather than the data themselves. |
Extent | 3279081 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | Eng |
Collection |
Retrospective Theses and Dissertations, 1919-2007 |
Series | UBC Retrospective Theses Digitization Project |
Date Available | 2009-04-14 |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0088029 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of |
Degree Grantor | University of British Columbia |
Graduation Date | 1994-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
URI | http://hdl.handle.net/2429/7074 |
Aggregated Source Repository | DSpace |
Download
- Media
- [if-you-see-this-DO-NOT-CLICK]
- ubc_1994-953500.pdf [ 3.13MB ]
- Metadata
- JSON: 1.0088029.json
- JSON-LD: 1.0088029+ld.json
- RDF/XML (Pretty): 1.0088029.xml
- RDF/JSON: 1.0088029+rdf.json
- Turtle: 1.0088029+rdf-turtle.txt
- N-Triples: 1.0088029+rdf-ntriples.txt
- Original Record: 1.0088029 +original-record.json
- Full Text
- 1.0088029.txt
- Citation
- 1.0088029.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
United States | 7 | 7 |
Turkey | 3 | 0 |
Japan | 2 | 0 |
India | 1 | 0 |
City | Views | Downloads |
---|---|---|
Uludag | 3 | 0 |
Unknown | 3 | 2 |
Ashburn | 2 | 0 |
Mountain View | 2 | 1 |
Redwood City | 1 | 0 |
Piscataway | 1 | 1 |
Tokyo | 1 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0088029/manifest