UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Maximum weighted likelihood estimation Wang, Steven Xiaogang 2001

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2001-715531.pdf [ 5.32MB ]
Metadata
JSON: 831-1.0090880.json
JSON-LD: 831-1.0090880-ld.json
RDF/XML (Pretty): 831-1.0090880-rdf.xml
RDF/JSON: 831-1.0090880-rdf.json
Turtle: 831-1.0090880-turtle.txt
N-Triples: 831-1.0090880-rdf-ntriples.txt
Original Record: 831-1.0090880-source.json
Full Text
831-1.0090880-fulltext.txt
Citation
831-1.0090880.ris

Full Text

M a x i m u m Weighted Likelihood Estimation by Steven X i a o g a n g W a n g B.Sc, Beijing Polytechnic University, P.R. China, 1991. M.S., University of California at Riverside, U.S.A., 1996. „ A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F THE REQUIREMENTS FOR T H E D E G R E E OF Doctor of Philosophy in T H E F A C U L T Y O F G R A D U A T E STUDIES Department of Statistics  We accept this thesis as conforming to the required standard  T H E U N I V E R S I T Y O F BRITISH C O L U M B I A June 21, 2001 ©Steven Xiaogang Wang, 2001  In  presenting  degree  this  at the  thesis  in  University of  partial  fulfilment  of  of  department  this or  thesis for by  his  or  requirements  British Columbia, I agree that the  freely available for reference and study. I further copying  the  representatives.  an advanced  Library shall make  it  agree that permission for extensive  scholarly purposes may be her  for  It  is  granted  by the  understood  that  head of copying  my or  publication of this thesis for financial gain shall not be allowed without my written permission.  Department The University of British Columbia Vancouver, Canada  DE-6 (2/88)  Maximum Weighted Likelihood Estimation Steven X. Wang Abstract  A maximum weighted likelihood method is proposed to combine all the relevant data from different sources to improve the quality of statistical inference especially when the sample sizes are moderate or small. The linear weighted likelihood estimator (WLE), is studied in depth. The weak consistency, strong consistency and the asymptotic normality of the WLE are proved. The asymptotic properties of the WLE using adaptive weights are also established. A procedure for adaptively choosing the weights by using cross-validation is proposed in the thesis. The analytical forms of the "adaptive weights" are derived when the WLE is a linear combination of the MLE's. The weak consistency and asymptotic normality of the WLE with weights chosen by cross-validation criterion are established. The connection between WLE and theoretical information theory is discovered. The derivation of the weighted likelihood by using the maximum entropy principle is presented. The approximations of the distributions of the WLE by using saddlepoint approximation for small sample sizes are derived. The results of the application to the disease mapping are shown in the last chapter of this thesis.  ii  Contents  Abstract  iii  T a b l e of C o n t e n t s  iii  L i s t of Figures  vi  L i s t of Tables  vii  Acknowledgements  viii  1  2  Introduction  1  1.1 Introduction  1  1.2 Local Likelihood and Related Methods  2  1.3  6  Relevance Weighted Likelihood Method  1.4 Weighted Likelihood Method  6  1.5 A Simple Example  8  1.6 The Scope of the Thesis  11  M o t i v a t i n g Example: N o r m a l Populations  14  2.1 A Motivating Example  15  2.2 Weighted Likelihood Estimation  16  2.3 A Criterion for Assessing Relevance  18  iii  3  4  2.4 The Optimum WLE  22  2.5 Results for Bivariate Normal Populations  24  M a x i m u m Weighted Likelihood Estimation  28  3.1 Weighted Likelihood Estimation  28  3.2 Results for One-Parameter Exponential Families  29  3.3 WLE On Restricted Parameter Spaces  32  3.4 Limits of Optimum Weights  38  A s y m p t o t i c P r o p e r t i e s of the W L E  48  4.1 Asymptotic Results for the WLE  49  4.2  5  4.1.1  Weak Consistency  50  4.1.2  Asymptotic Normality  60  4.1.3  Strong Consistency  70  Asymptotic Properties of Adaptive Weights  74  4.2.1  Weak Consistency and Asymptotic Normality  74  4.2.2  Strong Consistency by Using Adaptive Weights  77  4.3 Examples  78  4.3.1  Estimating a Univariate normal Mean  78  4.3.2  Restricted Normal Means  79  4.3.3  Multivariate Normal Means  81  4.4 Concluding Remarks  83  C h o o s i n g W e i g h t s by C r o s s - V a l i d a t i o n  85  5.1 Introduction  85  5.2 Linear WLE for Equal Sample Sizes  87  5.2.1  Two Population Case  88 iv  5.2.2  Alternative Matrix Representation of A and b  93  5.2.3  Optimum Weights X°f By Cross-validation  95  e  5.3 Linear WLE for Unequal Sample Sizes  6  7  8  e  96  5.3.1  Two Population Case  97  5.3.2  Optimum Weights By Cross-Validation  99  5.4 Asymptotic Properties of the Weights  102  5.5 Simulation Studies  107  D e r i v a t i o n s of the W e i g h t e d L i k e l i h o o d Functions  112  6.1 Introduction  112  6.2 Existence of the Optimal Density  114  6.3 Solution to the Isoperimetric Problem  116  6.4 Derivation of the WL Functions  117  Saddlepoint A p p r o x i m a t i o n of the W L E  125  7.1 Introduction  125  7.2 Review of the Saddlepoint Approximation  125  7.3 Results for Exponential Family  128  7.4 Approximation for General WL Estimation  134  A p p l i c a t i o n to Disease M a p p i n g  137  8.1 Introduction  137  8.2 Weighted Likelihood Estimation  138  8.3 Results of the Analysis  141  8.4 Discussion  143  Bibliography  146  v  List of Figures 2.1  A special case: the solid line is the max\p- \<cMSE(WLE), a  broken line represents the MSE(MLE).  The X-axis  and the represents the  value of Ai. The parameters have been set to be the following:  l,p = 0.5  and C =  1  =  21  8.1 Daily hospital admissions for CSD # 380 in the summer of 1983. 8.2 Hospital Admissions for CSD # 380, 362, 366 and 367 in 1983.  vi  . . 138 . . . 139  List of Tables 5.1 MSE of the MLE and the WLE and standard deviations of the squared errors for samples with equal sizes for N(0,1) and iV(0.3,1)  108  5.2 Optimum weights and their standard deviations for samples with equal sizes from N(0,1) and N(0.3,1)  109  5.3 MSE of the MLE and the WLE and their Standard deviations for samples with equal sizes from V(3) and V(3.6)  110  5.4 Optimum weights and their standard deviations for samples with equal sizes from V(3) and P(3.6)  Ill  7.1 Saddlepoint Approximations of T(n + 1) = n\  128  7.2 Saddlepoint Approximation of the Spread Density for m = 10  132  8.1 Estimated MSE for the MLE and the WLE  142  8.2 Correlation matrix and the weight function for 1984  144  8.3 MSE of the MLE and the WLE for CSD 380  145  vii  Acknowledgments I am most grateful to my supervisor, Dr. James V. Zidek, whose advice and support are a source of invaluable guidance, constant encouragement and great inspiration throughout my Ph.D. study at University of British Columbia. I am also very grateful to my co-supervisor, Dr. Constance van Eeden, who provides invaluable advice and rigorous training in technical writing. I wish to thank my thesis committee members Dr. John Petkau and Dr. Paul Gustafson whose insights and helps are greatly appreciated. I also wish to thank Dr. Ruben Zamar for his suggestions. Finally, I would like to thank my wife, Ying Luo, for her understanding and support. My other family members all have their shares in this thesis.  viii  Chapter 1 Introduction I.l  Introduction  Recently there has been increasing interest in. combining information from diverse sources. Effective methods for combining information are needed in a variety of fields, including engineering, environmental sciences, geosciences and medicine. Cox (1981) gives an overview on some methods for the combination of data such as weighted means and pooling in the presence of over-dispersion. An excellent survey of more current techniques of combining information and some concrete examples can be found in the report Combining Information written by the Committee on Applied and Theoretical Statistics, U.S. National Research Council (1992). This thesis will concentrate on the combination of information from separate sets of data. • When two or more data sets derive from similar variables measured on different samples of subjects are available to answer a given question, a judgment must be made whether the samples and variables are sufficiently similar that the data sets may be directly merged or whether some other method of combining information that only partially merges them is more appropriate. For instance, data sets may be  1  1.2.  LOCAL LIKELIHOOD A N DRELATED METHODS  2  time series data for ozone levels of each year. Or they may be hospital admissions for different geographical regions that are close to each other. The question is how to combine information from data sets collected under different conditions (and with • differing degrees of precision and bias) to yield more reliable conclusions than those available from a single information source.  1.2  Local Likelihood and Related Methods  Local likelihood, introduced by Tibshirani and Hastie (1987), extends the idea of local fitting to likelihood-based regression models. Local regression may be viewed as a special case of the local likelihood procedure. Staniswalis (1989) defines her version of local likelihood in the context of non-parametric regression as follows:  W(9) =  J2w{^^)logf( e) yi]  i=i  where X{ are fixed and b is a single unknown parameter. Recently, versions of local likelihood for estimation have been proposed and discussed. The general form of local likelihood was presented by Eguchi and Copas (1998). The basic idea is to infuse local adaptation into the likelihood by considering n  L(t;x ,x ,...,x ) 1  2  n  =  x —t K{- -^—)logf{x ,e), !  i  i=l .  where K = K(^^)  is a kernel function with center t and bandwidth h. The local  maximum likelihood estimate 0 of a parameter in a statistical model f(x; 9) is defined t  by maximizing the weighted version of the likelihood function which gives more weight to sample points near t. This does not give an unbiased estimating equation as it stands, and so the local likelihood approach introduces a correction factor to ensure consistent estimation. The resulting local maximum likelihood estimator ( L M L E ) ,  1.2.  L O C A L  L I K E L I H O O D  A N D R E L A T E D  3  M E T H O D S  say 9t,h, depends on the controllable variables t and h through the kernel function K, and, intuitively, it is natural to think that Q ,h gains more information about the data t  around t in the sample space. Detailed discussions of the local likelihood method can be found in Eguchi and Copas (1998). The L M L E might be related to the local M-estimator proposed by Hardle and Gasser (1984). The local M-estimator is defined as i=l  </x,-l  where Yi are observations obtained at point The term weighted likelihood has been used in the statistics literature besides the local likelihood approach. Dickey and Lientz (1970) propose the use of what they called weighted likelihood ratio for tests of simple hypothesis. Dickey (1971) proposes the weighted  utility-likelihood  along the same line of argument. Assume a statistical  model in which the observed data vector D G E  n  occurs with the probability mass  or density function <f>(D\9), depending continuously on an unknown parameter vector 9 E E . Suppose that one suspects the unknown parameter 9 of belonging to a given r  Borel set H C E . Let H denote a Borel measurable alternative such that H(~)H = 0 T  with P(H) + P(H)  = 1. The key feature of their proposal is to use weighted  likelihood  ratio defined as follows: $(D\H)  where $(D\H)  = J <j)(D\9)dP(9\H)  that they called it weighted  and $(D\H)  likelihood  = f <j>(D\9)dP(9\H).  The reason  ratio is because the quantity &(D\H)  summary of the evidence in D for H. A modern name for their weighted  is the likelihood  ratio might be odds ratio.  Markatou, Basu and Lindsay (1997,1998) propose a method based on the weighted likelihood  equation  in the context of robust estimation. Their approach can be de-  1.2.  4  LOCAL LIKELIHOOD A N D R E L A T E D METHODS  scribed as follows: Suppose that {X±, X , • X } is a random sample with distribution 2  n  f(x;9). The weighted likelihood equation is defined as n Q ^2w(xi,F)—logf{xi;6) i=l  where F is the empirical cumulative distribution function.  The weight function  w(Xi, F) is selected such that it has value close to 1 if there is no evidence of model violation at x from the empirical distribution function.  The weight function will  be very close to 0 or exactly 0 at Xi if the empirical cumulative distribution functions indicates lack of fit at or near X{. Thus, the role of the weight function is to down-weight points in the sample that are inconsistent with the assumed model. Hunsberger (1994) also uses the term "weighted likelihood" to arrive at kernel estimators for the parametric and non-parametric components of semi-parametric regression models. Consider the model with Xi\(Yi,Ti = ti) having the distribution f(Xi;Xi)  where A; = yiB + g(ti). Furthermore let / , the conditional density of  X\(Y,T),  be arbitrary but known. Then xB is the parametric portion, Bo being the  0  0  unknown parameter to be estimated that relates the covariate y to the response. Here g is the non-parametric portion of the model, the only assumption on g being that it is a smooth function of t. Assume yi = r(i;) + r/i where r is a smooth funtion and the rji are independent random error terms with E(r]i) = 0 and En  2  — a . Now 2  Xi can be rewritten using the model for the y's to obtain Aj = r/j/3 + h(ti), where 0  h(ti) =• r(ti)P + g(ti) is the portion that depends on t. The main purpose is to 0  estimate /?o and Qi = h(ti),i = 1, 2 , n  in the semi-parametric model by maximizing  a weighted likelihood function  WW  0) = J2 i  Y, C-^Y°9f{Xf, w  P,  9i)/n b 2  3  with respect to B and 6, where 0 = (6\, 9 , 9 ) . In the weighted likelihood function, 1  2  n  1.2.  LOCAL LIKELIHOOD A N DRELATED METHODS  5  w is a kernel that assigns zero weights to the observations Xj that correspond to tj outside a neighborhood of tj. Besides these "weighted likelihood" approaches, it should be noted that the term weighted likelihood has been used in other contexts as well. Newton and Raftery (1994) introduce what they called weighted likelihood bootstrap as a way to simulate approximately from a posterior distribution. The weighted likelihood function is defined as  n i=l  where the random weight vector w  w ,w )  = (w i,  nj2  n>  has some probability dis-  n>n  tribution determined by the statistician. The function L is not a likelihood function in the usual sense. It is considered by Newton and Raftery to be good approximation to the posterior. Rao (1991) introduces his definition of the weighted maximum likelihood to deal with irregularities of the observation times in the longitudinal studies of the growth rate. To be more specific, he defines the weighted likelihood as Ln(P) =  l[f{x ,n,t ,n,d\{x ,t l  l  j<n  J!n  I  1 < j < I - l})MU, )~HU-i,n) n  i=l  where A(.) is a known nondecreasing function on [a,b] and t  iin  are observation times  such that  0, = £u,n < t\,n < ••• < tk ,n = n  The term weighted likelihood is also used by Warm (1987) in the context of item response theory. His proposal can be described as maximizing w(9)L(x\9)  instead  of the traditional likelihood function L(x\9). The weight function, w(9) is carefully chosen such that the new estimator is unbiased to the order of n~ . l  1.3.  1.3  R E L E V A N C E W E I G H T E D  L I K E L I H O O D  M E T H O D  6  Relevance Weighted Likelihood Method  Hu and Zidek (1997) give a very general method for using all relevant sample information in statistical inference. They base their theory on what they call relevanceweighted likelihood ( R E W L ) . Let x = (xi, x , x ) denote a realization ofthe random 2  vector X = (Xi, X ,X ). 2  n  Let / ; be the unknown probability density function of  n  the Xi which are assumed to be independent. Inferential interest is on a probability density function / . A t least in some qualitative sense, the fi are thought to be "like" / . Consequently the  thought to be of value in our inferential analysis  though the Xi are independently drawn from a population different from our study population. The relevance-weighted likelihood is defined as n  REWL(0)  =Hf(xi;9)  Xi  i=i  where A;, i = 1,2,n,  are weight functions.  R E W L plays the same role in the "relevant-sample" case as the likelihood in the conventional ("exact-information") case. It can be seen that this method generalizes local likelihood. The asymptotic properties of the Relevance Maximum Weighted Likelihood Estimator can be found in Hu (1997). Hu, Rosenberger and Zidek (2000) extend the results for independent sequences in Hu (1997) to dependent sequences.  1.4  Weighted Likelihood Method  In this thesis, we are interested in a context which is different from that of all the methods described above. Suppose that we are interested in a single population. A parameter or parameter vector, t?i, is of inferential interest. Information from other related populations, Population 2, Population 3 and so on, are available together with the direct information from Population 1. Let m denote the total number of popula-  1.4. W E I G H T E D L I K E L I H O O D M E T H O D  7  tions whose distributions are thought to "resemble" Population 1. Let n n , ...,n l5  2  m  denote the number of observations obtained from each population mentioned above. Let X i , X  2  , X  be random variables or vectors with marginal probability density  m  functions / i ( . ; 6>i), / ( . ; 6 > ) , / ( . ; 6> ), where X i = (X ,X 2,...,X ) . t  2  2  m  m  il  i  ini  The joint  distribution of ( X i , X , . . . , X ) is not assumed. We are interested in the probability 2  m  density function / i ( . ; 9i) : 9\ € O of a study variable or vector of variables X , 9\ being an unknown parameter or vector of parameters. A t least in some qualitative sense, the / ( . ; 62), 2  /m(-; 9m) are thought to be "similar to" /i(.; 9i).  For fixed X = x, the weighted likelihood (WL) is defined as:  m W L ( ^ ) = n/ (x ;c9 ) S A  1  1  i  1  i=i  where A = ( A A , A ) ' is the "weight vector" which must be specified by the 1;  2  m  analyst. We say that 9~x is a maximum weighted likelihood estimator (WLE) for 9\ if h = arg sup WL(0i). The uniqueness of the maximizer is not assumed. In this thesis, we assume that Xn,X{2, ...,X , i=l,2,...,m, are independent and irii  identically distributed random variables. The W L then becomes m  n;  wL(0o=nn^^) Ai  i=i 3=1  Hu (1997) proposes a paradigm which abstracts that of non-parametric regression and function estimation. There information about 9\ builds up because the number of populations grows with increasingly many in close proximity to that of B\. This is the paradigm commonly invoked in the context of non-parametric regression but it is not always the most natural one. In contrast we postulate a fixed number of  1.5. A SIMPLE E X A M P L E  8  populations with an increasingly large number of samples from each population. Our paradigm may be more natural in situations like that in which James-Stein estimator obtained, where a specific set of populations is under study.  1.5  A Simple Example  The advantages of using WLE might be illustrated by the following example. A coin is tossed twice. Let 6\ = P(H) for this coin. Let 6\ denote the MLE of 6\. It then follows that 6 = Si/2 where Si =Xi + X and the X's are independent X  2  Bernoulli random variables. If unknown to the statistician, 6 = 1/2, then  P ({X, = 0; X = l}D{X  P (0i = 1/2)  2  1  = l;X  2  = 0}) = 1/2;  The probability for the MLE to conclude that either the fair coin has no head or no tail is 50%. It is clear that the probability of making a nonsensical decision about the fair coin in this case is extremely high due to the small sample size. Suppose that another coin, not necessarily a fair one, is tossed twice as well. The question here is whether we can use the result from the second experiment to derive a better estimate for 6\. The answer is affirmative if we combine the information obtained from the two experiments. Suppose for definiteness 9 = P(H) = 0.6 for the second coin. Let 9 denote the 2  2  MLE of 0 • Thus, 9 = S /2 = (Y + Y )/2 where Y and Y are independent Bernoulli 2  2  2  1  2  x  2  1.5. A SIMPLE E X A M P L E  9  random variables. It then follows that  P (l = fj)  = P(Y = Y = 0) = 0.16;  2  X  P (§ = 1/2)  = P ({Y = 0; Y = 1} n {Y = 1; Y = 0}) = 0.48;  2  P(9  x  2  1  2  = P (Fi = Y = 1) = 0.36.  = l)  2  2  2  Consider a new estimate which is a linear combination of 6\ and 0 : 2  9\ = Ai#i + A 0 , 2  2  where Ai and A are relevant weights. 2  The optimum weights will be discussed in later chapters. Pretend that we do not know how to choose the best weights. We might just set each of the weights to be 1/2. It follows that  = P (S = 0; S = 0) = 0.04; l  2  P (0i = 1/4  = > ( { S i = l ; 5 = 0 } n { S i = 0 ; S = l}) = 0.20;  P (0 - 1/2  = P ({Si = 1; S = 1} n {Si = 2; S = 0} n {Si = 0; S = 1}) = 0.37;  P (0i = 3/4  = P (Si = 1; S = 2} n {Si = 2; S = 1}) = 0.30;  X  P  2  2  2  2  2  2  2  = P (Si = 2; S = 2) = 0.09. 2  Note that the probability of making a nonsensical decision has been greatly reduced to 0.13 (0.04+0.09) through the introduction of the information obtained from the second coin. Furthermore, it can be verified that  MSE{9 ) = 1/8; X  MSE{9 ) = 1-02 x 1/16 « 1/16. X  1.5.  A SIMPLE  10  EXAMPLE  Thus, MSE(9 )/MSE(9 ) l  « 0.5.  l  We see that the M S E of the W L E is only about 50% of that of the M L E . Due to the small sample size of the first experiment, for arbitrary 9 , the probX  ability of making a nonsensical decision is 9\ + (1 — 0i) with a minimum value of 2  50%. B y incorporating the relevant information in a very simple way, that probability is greatly reduced. In particular, if the second coin is indeed a fair coin and 9i is arbitrary, the probability of making a nonsensical decision is then reduced to \[9\ + (1 - 0i) ] which is less or equal to 12.5%. 2  We would like to consider the reduction of M S E by using the simple average of the two M L E ' s in this case. Let 0i — \9 + \9 . For arbitrary 9\ and 0 with sample X  2  2  size of 2, we have MSE(9 )  = Var(d ) =  MSE(9 )  = Var(9 ) +. Bias(B )  X  ^6 (l-d )  1  1  l  2  X  x  =  x  =\0i(i-e )  + -9 (i-9 ) +  - (e -9 ) .  l  l  2  2  2  4  1  2  It follows that, for 0i 7^ 0 or 1, MSE(B ) X  =  |0i(l-0i) + | 0 ( l - 0 ) + f(0i-0 ) 2  2  2  1  19 (1 - 9 ) + l(9 - 9 )  4  |0i(l-0i)  2  2  2  x  Assume that 0i, 9 £ [0.35,0.65], it then follows that 2  0.35*0.65 < (0i-0 )  2  2  02(1-02)  (l-0i)0i;  < 0.09; < 0.25.  2  2  1.6.  T H E SCOPE OF T H E THESIS  11  We then have ^  < 0.72.  MSE(9 ) 1  ~  Therefore, for a wide range of values of 9 and 9 , the simple average of 9\ and 9 will X  2  2  produce at least 28% reduction in the M S E compared with the traditional M L E , 6\. The maximum reduction is achieved if these two coins happen to be of the same type, i.e. 9\ = 9 . We remark that the upper bound on the bias in this case is 0.15. Notice 2  that the weights are chosen to be 0.5. However they are not the optimum weights which minimize the M S E of a weighted average of 9\ and B . The optimum weights 2  will be studied in later chapters.  1.6  The Scope of the Thesis  In Chapter 2 we will show that certain linear combinations of the M L E ' s derived from two possibly different normal populations achieves smaller M S E than that of the traditional one sample M L E for finite sample sizes. A criterion for assessing the relevance of two related samples is proposed to control the magnitude of possible bias introduced by the combination of data. Results for two normal populations are shown in this chapter. The weighted likelihood function  which uses all the relevant information is formally  proposed in Chapter 3. Our weighted likelihood generalizes the R E W L of Hu (1994). Results for exponential families are presented in this chapter. The advantages of using a linear W L E on restricted parameter spaces are demonstrated. A set of optimum weights for the linear W L E is proposed and the non-stochastic limits of the proposed optimum weights when the sample size of the first population goes to infinity are found. Chapter 4 is concerned with the asymptotic properties of the W L E . The weak  1.6.  T H E SCOPE OF T H E THESIS  12  consistency, strong consistency and asymptotic normality of the W L E are proved when the parameter space is a subset of R ,p > 1. The asymptotic results proved p  here differ from those of Hu (1997) because a different asymptotic paradigm is used. Hu's paradigm abstracts that of non-parametric regression and function estimation. There information about 9\ builds up because the number of populations grows with increasingly many in close proximity to that of 6\. This is the paradigm commonly invoked in the context of non-parametric regression but it is not always the most natural one. In contrast we postulate a fixed number of populations with an increasingly large number samples from each. Asymptotically, the procedure can rely on just the data from the population of interest alone. The asymptotic properties of the W L E using adaptive weights, i.e. weights determined from the sample, are also established in this chapter. These results offer guidance on the difficult problem of specifying A. In Chapter 5 we address the of choosing the adaptive weights by using the crossvalidation criterion. Stone (1974) introduces and studies in detail the cross-validatory choice and assessment of statistical predictions. The K-group estimators in Stone (1974) and Geisser (1975) are closely related to the linear W L E . Breiman and Friedman (1997) also demonstrate the benefit of using cross-validation to obtain the linear combination of predictions that achieve better estimation in the context of multivariate regression. Although there are many ways of dividing the entire sample into subsamples such as a random selection of the validation sample, we use the simplest leave-one-out approach in this chapter since the analytic forms ofthe optimum weights are tractable for the linear W L E . The weak consistency and asymptotic normality of the W L E based on cross-validated weights are established in this chapter. We develop a theoretical foundation for the W L E in Chapter 6. Akaike (1985) reviewed the historical development of entropy and discussed the importance of the maximum entropy principle. Hu and Zidek (1997) discovered the connection between  1.6.  T H E SCOPE OF T H E THESIS  relevance  weighted  likelihood  and maximum  13  entropy principle  shall show that the weighted likelihood function entropy principle  for the discrete case. We  can be derived from the  maximum  for the continuous case.  In the context of weighted likelihood  estimation,  the i.i.d. assumption is no longer  valid. Observations from different samples follow different distributions. The saddlepoint approximation technique in Daniels (1954) is then generalized for the non i.i.d. case to derive very accurate approximate distributions of the W L E for exponential families in Chapter 7. The saddlepoint approximation for estimating equations proposed in Daniels (1983) is also generalized to derive the approximate density of the W L E derived from estimating equations. The last chapter of this thesis applies the W L E approach to disease mapping data. Weekly hospital admission data are analyzed. The data from a particular site and neighboring sites are combined to yield a more reliable estimate to the average weekly hospital admissions compared with the traditional M L E .  Chapter 2 Motivating Example: Normal Populations Combining information from disparate sources is a fundamental activity in both scientific research and policy decision making. The process of learning, for example, is one of combining information: we are constantly called upon to update our beliefs in the light of new evidence, which may come in various forms. In some cases, the nature of the similarity among different populations is revealed through some geometrical structure in the parameter space, e.g. the means of several populations are all points on a circular helix. From the value of relevant variables it might then be possible to obtain a great deal of information about the parameter of primary inferential interest. Therefore, we should be able to construct a better estimate of the parameter of primary interest by combining data derived from different sources. How might one combine the data in a sensible way? To illustrate some of the fundamental characteristics of our research, it is useful to consider a simple example and examine the inferences that can be made.  14  2.1.  A MOTIVATING E X A M P L E  2.1  15  A Motivating Example  In this subsection we consider the following simple example:  where the { i i } "  =1  ei~ N(0,(T1),  Xi  =  ati + a,  Yi  =  Bti + e'ii e'i^N(0,a ), 2  2  i = l,...,n  (2.1)  i = l,...,n,  (2.2)  are fixed. The {ei}'s are i.i.d.. So are the {e^j's. While Covfa, e'j) =  0 if i ^ j; Cov(ei,e'i) =  po\0~2  for all i, and for the purpose of our demonstration we  assume p, o\ and a are known although that would rarely be the case in practice. 2  Note that a bivariate normal distribution is not assumed in the above model. In fact, only the marginal distributions are specified; no joint distribution is assumed although we do assume the correlation structure in this case. The parameters, a and 8, are of primary interest. They are thought or expected to be not too different due to the "similarity" of the two experiments. The error terms are assumed to be i.i.d. within each sample. The joint distributions of the (Xi,Yi),i  = l , . . . , n are not specified. Only the correlations between samples are  assumed to be known. The objective is to get reasonably good estimates for the parameters without assuming a functional form of the joint distribution of (Xi, Yi) which may be unknown to the investigator. Assuming marginal normality, the marginal likelihoods for a and 8 are L (x ,x ,...,x ; 1  1  2  n  " a) cc T[exp(  (x,~a '  i=i  L {yi,y ,y ; 2  2  2  °  1  8) oc ]~[exp(-^  Therefore, ignoring the constants, ,  r /  x  In  U(a) =  ^ ( X j - a Y ~  tj)  2  2  l  ).  1  n  U) * ),  2  ,  2.2.  W E I G H T E D LIKELIHOOD ESTIMATION  i In  ta\ LM = r  -}Z-  (Vi-P U) 2  »=i  The M L E ' s based on the Xi and  16  2  •  2  2  respectively for a and /? are n  «  =  .  i=i —  —  >  i=l  n  E*i J/i i=i  2.2  Weighted Likelihood Estimation  If we know that \a — B\ < C, where C is a known constant according to past studies or expert opinions, this information ("direct information" or "prior information") might be used to yield a better estimate of the parameter. A n extremely important aspect of the problem of combining information that we have just described is that we can incorporate the relevant information into the likelihood function by assigning possibly different weights to different samples. We next show how it is done. The weighted likelihood (WL) for inference about a is defined as:  WL(a) = L (x ,x ,...,x ;a) L (y ,y ,  ...,y ;a) ,  Xl  1  1  2  n  1  1  X2  2  n  (2.3)  where A i and A are weights selected according to the relevance of the likelihood to 2  which they attached. The non-negativeness of the weights is not assume although the optimum weights are actually non-negative. Note that WL(a)  y , y ; a) instead of L ((yi, y , y \ B) is used to define 2  n  2  2  n  since a is of our primary interest at this stage and the marginals distri-  butions of the F ' s are thought to resemble the marginal distributions of the X ' s . Note that the W L depends on the distributions of the X ' s .  But it does not depend  2.2.  WEIGHTED LIKELIHOOD ESTIMATION  17  on the distribution of Y's. Since X and Y are not independent, the weights of the likelihood functions are designed to reflect the dependence that is not expressed in the marginals. Notice that the joint distribution of the X's and y ' s does not appear in the W L and no assumptions are made about it. The maximum weighted likelihood estimator (WLE) is obtained by maximizing the weighted likelihood function for given weights A and A . x  2  From (2.3) we get In WL(Q;) = Ai In Li{x , din W L ( a )  2  n  l  Ai 1  ...,y ;  x ; a) + A 2 In L (y ,y ,  x,  x  ,  A ^ 2  i=i  1  n  2  ,  a),  \  t=i  1  So the W L E for a is Al  a =  „ ,  ^2  —a + Ai + A  n  —B. Ai + A  2  2  Without loss of generality, we can write  a = Ai a + A ft, 2  where Ai + A = 1. 2  The W L E is a linear combination of the M L E for a and B, a and ft under the over-simplified model. The weights Ai and A reflect the importance of a and J3. In2  tuitively, the inequality A < Ai should be satisfied because direct sample information 2  from {Xj}" should be more reliable than relevant sample information from {Fi}" . =1  =1  Obviously, the W L E is the M L E obtained from the marginal distribution if the weight for the second marginal likelihood function is set to be zero. This may happen when evidence suggests that the seemingly relevant sample information is actually totally irrelevant. In that case, we would not want to include that information and thereby accrue too much bias into our estimator.  2.3.  A CRITERION F O R ASSESSING R E L E V A N C E  18  We call the estimator derived from the weighted likelihood function the W L E in line with the terminology found in the literature although the estimator described here differs from others proposed in those published papers. In particular, we work with the problem in which the number of samples are fixed in advance while the authors of the R E W L E were interested in the problem where the number of populations goes to infinity. The distinction here should be clear.  2.3  A Criterion for Assessing Relevance  The weighted likelihood estimator is a linear combination of the M L E ' s derived from the likelihood function under the condition of marginal normality. We would like to find a good W L E under our model since there is no guarantee that any W L E will outperform the M L E in the sense of achieving a smaller M S E . Obviously, the W L E will be determined by the weights assigned to the likelihood function and the M L E ' s obtained from the marginals. The value of any estimator will depend on our choice of a loss function. The most commonly used criterion, the mean squared error (MSE), is selected as the measure of the performance of the W L E . The next proposition will give a lower bound for the ratio ^ such that the W L E will outperform the M L E obtained from the marginal distribution. For simplicity, we make the assumption of equal variances, i.e. o\ = o\. P r o p o s i t i o n 2.1 Let a be the WLE and a, the MLE of a. If \a — 8\ < C, then E(a - a) \E(a-a)\ Var(a)  =  A (8 - a)  <  A C  <  2  2  Var(a).  2.3. A C R I T E R I O N FOR ASSESSING R E L E V A N C E  19  If p < 1 and A > 0, then 2  2 c2 Ai C S;4 iff -± > , A 2(1-p)a1  max MSE(d) < MSE(a) v <*-P\<c ' . ' x  If p =  1, i/ien  max MSE(a)  JJ  2  > MSE(a)  with equality iff Xi  |a—6|<C  2  = 1 and A = 0. 2  Proof: It can be verified that E(a)  = a,  £(/3)  = b,  E(a)  = Ai a + A /?. 2  It follows that E(a  -a)  = X (B2  a).  Thus ct is not an unbiased estimator of a unless a = B. However the absolute bias is bounded by A C if \B - a\ < C. 2  It can also be checked that a  2  Var(a)  =  Var0)  =  n (~ a\ Cov(a,p)  =  ^ , Cov(i ,<?x) x  po 'Sf' 2  =  where S = £ * . 2  2  i=l  Furthermore, we have Var{a)  =  A Var(a) + X\ V.ar0) + 2X A Cov(a, /3) 2  X  (A + A + 2 A i A 2  2  2  p)°^.  2  2.3.  A CRITERION FOR ASSESSING R E L E V A N C E  20  It can be seen that the M S E of W L E is a function of a and 8. So is the bias function for a. We would like to choose weights that are independent of a and 8 which are unknown. Let us consider the MSE's of a and a under the assumption \a — 8\ < C , E(a  — a)  E(a  - a)  2  2  (i) If p = 1, we have Var(a) Var(a)  =  Var(a)  =  =  Var(a)  +  <  (X + X + 2X X p)^ 2  2  ,  (Bias)* a  2  1  .2  +  2  XC z  2  2  as well as max\p- \ <cMSE(a)  = Var(a),  > MSE(a)  a i  =  . Equality is achieved only when Ai is set to 1 and A to 0. 2  (ii) If p < 1, it follows that Var(a)  <  Var(a).  Furthermore, 2  max E(a - a)  ^2  < E(& - a)  2  2  X X p L  \P—a\<C  2  + X\C  2  2  Ot  <2 X A x  '  .  2  Oj  We then have max E(a - a) < E(a - a) ?-a|<c ' ' 2  v  In conclusion, we have max^^ \<cM a  >  2  A  v  S E(a)  < MSE(a)  '  2 ( 1 - p)a ' 2  2  iff ^ > (i- )a  2  2  P  • °  , The optimum weights which achieve the minimum of the M S E in this case will be positive as will be shown in the next section. Thus, we will not consider the case when A < 0. 2  We remark that the reduction in the M S E is independent of the assumption that of equals o\. In general, it can be verified that , if p < 0\jo , 2  max MSE(a) \P-a\<c  < MSE(a)  iff  ^ A  2  then  > ^ f + i^-^i) 2(o\ po o ) x  2  ( 2  4  )  2.3.  A C R I T E R I O N FOR ASSESSING R E L E V A N C E If o\ = of  21  ° 1 then the equation (2.4) is reduced to the inequality in the  =  2  previous Proposition, i.e., max  |a-/J|<C  MSE(a) K  '  < MSEia)  A  iff  ^ *  2  -± >  A  9  C  2  2  2(l-p)(T  2  From equation (2.4), observe that we will have Ai = 1 and A = 0 for a number 2  of special cases: 1) C —> oo and p < 1. Here we have little information about the upper bound for the distance between the two parameters, or we do not have any prior information at all. 2) Sf —>• oo and p < 1. We already have enough data to make inference and the additional relevant data has little value in terms of estimating the parameter of primary interest. 3) af —> 0 and p < 1. This means that the precision ofthe data from the first sample is good enough already.  p  0.0  0.2  0.4  0.6  0.8  1.0  Weight Function  Figure 2.1:  A special  broken line represents parameters  case:  the solid  the MSE(MLE).  line  is the max\p_ \<cMSE(WLE), a  The X-axis  have been set to be the following:  represents  = l , p = 0.5  and  the value of A .  and  x  C — 1.  the The  2.4.  THE OPTIMUM WLE  22  A special case is discussed here to gain some insight into the situation. The upper bound of mean square error for WLE as a function of Ai is shown in the Figure 1. It can be seen that the upper bound of the MSE for the WLE will be less than that of the MLE if Ai lies in a certain range. From the Proposition 2.1, the cut-off point should be 0.5 which is exactly the case shown in the graph. In fact, the minimum of the  MSE can be obtained by using the Proposition 2.2 which will be  max\i3- \<c a  established in the next section.  2.4  The Optimum W L E  In the previous section we found the threshold for Ai and A in order to make WLE 2  outperform the MLE in the sense of obtaining a MSE and variance which are smaller than that of MLE uniformly on the set {(a, 8) : \8 — a\ < C}. Because there are numerous cases under which \ and A satisfy the threshold, we might want to find X  2  the optimum WLE in the sense of minimizing the maximum of MSE and variance. P r o p o s i t i o n 2.2  The  optimum  WLE  under mean squared  .  1 + K a — 2 + K  a H  2 + K " where K =  Proof: G{Xi,  C  a  2  is  3  : — 8,  2 +  K ' P  2 + K  S t  rnax\p_ \< MSE{a)  A) =  1  error  c  max\p- \<cMSE(a). a  is a function of Ai and A for fixed a, 2  S  2  and  p.  Let  The saddle point (stationary point) of this function  can be obtained as the solution of the following equations, dG(XiM) dXi ax1  _  =  5G(Ai,A )  ex  2  2  Ai + A  2  °.  ~  = 1.  2.4. T H E O P T I M U M W L E  23  In fact, the above equation has a unique solution. It can be verified that the minimum is achieved at that point by checking the second derivative. The solution to these equations are A^ =  and A2 =  where  K  =  jjzfijp-  Note that the  weights X{ and A2 satisfy the condition of ^ > y because ^ = K + 1 > y . This implies that with the optimum weights given above the WLE will achieve a smaller MSE than the MLE according to Proposition 2.1 Thus, the optimum weights for estimating a are \*  _  I+K  1 A*  2  A  2+ftT' —  1  —  2+K'  The optimum WLE's for a and 8 are a  =  _L_  l±K  "  2+K  r  2+K  ~  r  '  R  2+K l -> J  2+K  where i f = ^ _ ^ . Note that the optimum WLE for ^ is obtained by the argument p  2  of symmetry, o Notice that A2 < \ for all K > 0. This implies that, for the purpose of estimating a, a should never get a weight which is less than that of b if we want to obtain the optimum WLE. This is consistent with our intuition since relevant sample information (the {Fjj's) should never get larger weight than direct sample information (the {-^}'s) when the sample sizes and variances are all equal. It should be noted that the WLE under a marginal normality assumption is a linear combination of the MLE's. Furthermore, the optimum WLE is the best linear combination of MLE's in the sense of achieving the smallest upper bound for MSE compared to other linear combination of MLE's. The optimization procedure is used to obtain the  minimax  solution in the  sense that basically we are minimizing the upper bound of the MSE function over the set  {{a,  8) :\a-B\<  C}.  2.5.  RESULTS FOR BIVARIATE N O R M A L POPULATIONS  2.5  Results for Bivariate N o r m a l  24  Populations  This subsection contains some results for bivariate normal distributions. The following results should be considered as immediate corollaries of Propositions 2.1 and 2.2 established in the previous section. C o r o l l a r y 2.1 Let /  X  l  N  Yi If \Px — PY\ < C  and  Px  a  \  2  po  po  2  l,..,n.  o  2  2  p < 1, then the optimum WLE's for estimating the  marginal  means are r,  —  v  y +-  +  l  M  1  2+M ^  ^  where M =  2 (l-p)  a  2  -  2+M  v  1  2+M  + 2+M  Y  '  A  •  Furthermore, ^ax\it -ii \<c x  (Px  E  Y  ~ Px)  2  E(fiY ~ PY)  max^x-ny^c  2  < E(X  -  fi ) ;  < E(Y  -  fi ) ;  2  X  2  Y  o  2  Var(flx)  < Var(X)  Var(fi )  < Var(Y)  =  n  2  Y  Cov(fi ,py) X  >  = —; n Cov(X,Y). n  Proof: Let ti = 1 in Proposition 2.1. Then Sf —  _  1 = n. B y letting a — X  and b — Y, we can apply Proposition 2.2 to get the optimum W L E . Therefore the optimum W L E will have smaller MSE and variance than the M L E , X, obtained from the marginal normal distribution. . Observe that fi + PY = X + Y. If we take Var on both sides, we then have the X  following Var (fix) + Var fay) + 2 Cov(fi , X  fiy) = Var(X)  + Var(Y)  + 2 Cov(X,  Y).  2.5. RESULTS F O R B I V A R I A T E N O R M A L P O P U L A T I O N S  25  It follows that Cov(p, ,p, ) x  > Cov(X,Y).  Y  o  From Corollary 2.1, we draw the conclusion that the optimum WLE out-performs the MLE when  \JJL — p,y\ x  Corollary 2.2  Proof:  Let  < C is true.  Under the conditions  =  M  2  g 2  2.1,  of Corollary  -  p  fix — X  —>  0 as n —» oo.  • Thus, M -> oo as n  -4  oo. Therefore,  -> 0 as  n —> co. By Markov's inequality, we have, for all e > 0, P(\px-X\>e)=p(^\X-Y\>e^j  Corollary 2.3 77ie Corollary  2.1.  Proof:  Consider  optimum  P(\fi -p \>e) x  where M = ( f l " 2  p  g 2  WLE is strongly  =  x  < ^  E  ^ ~  consistent  Y  ^  under  _>  0  n  ^  oo.o  the conditions  \P(\(ji -X)-{n -X)\>e) x  x  as before. But  '(^-^>i)-Klf^l>5)^5Ts)*(*- )?<^ ,  p ,  where Mi is a constant. Furthermore  p  0- -"-i 2)^—1^ Y  >  ^ w = ^-  where M is a constant. Thus, 2  oo  P(\jix-Hx\ n=l  > e) < oo.  of  2.5.  RESULTS FOR BIVARIATE N O R M A L POPULATIONS  By the Borel-Cantelli Lemma (Chung 1968), we conclude that  fix  26 x- o  n  It follows from Corollary 2.1 that the optimum W L E is preferable to the M L E . However the relevant information will play a decreasingly important role as the direct sample size increases. Motivated by our previous results, it seems that the linear combination of M L E ' s is a convenient way to combine information if the marginal distributions are assumed to be normal. A more general result is given as follows when the normality condition is relaxed. T h e o r e m 2.1 Let Xi, X Var(Xi)  , X  be i.i.d.  n  = o \ < oo, and let Yi,Y ,  Assume  E(a)  2+7?' ^  Var(a)  < oo.  2+K  is assumed  IfVar0)  < Var(a),  be i.i.d.  n  Let a and B denote  = a, E0) 2 =  random  ...,Y  2  and Var(Yi)  =  2  a n  some  variables random  estimates  = B, and \B — a\ < C.  d K  =  {i- )Var{a) • ^  s  0  s u  P  with E(Xi)  variables  = a and  with E(Yi) = 6  of a and  ft  respectively.  Let a = A i a + A $, 2  PP  o s e  P — cor(a,J3)  <  1 while  known. then \E(a - a)\ < A 2 C and  Var(a) max\ -p\<c a  < I^ar(o;),  E(a — a)  2  a — a —> 0 if  Var(a)  < E(a — a) , 2  —> 0 as  while  n —>• oo.  Proof: We consider the variance of weighted estimate d of a: Var(a)  =  Var{X a  =  A Var(a)  + X  <  A Var(a)  + X Var{a)  + 2 X X p ^/Var(a)  <  X\ Var(a)  + X Var(&)  + 2A A  =  Var(a).  1  2  2  + X P) 2  2 2  Var0)  2  2  2  2  + 2 A : A2  x  Cov(aJ)  2  x  2  Var(a)  where  \]var(B)  2.5.  RESULTS FOR  Thus, we have  Var(6t) < Var(a).  MSE\p- \< {a) a  BIVARIATE NORMAL  c  +  Var(a)  =  A Var(&)  + \\  <  (A  + 2 Ai A p)Var(a)  2  +  \\  27  Furthermore,  =  2  POPULATIONS  Bias (a) 2  Var0)  + 2 Ai  2  +  A Cov(a,a)  +  2  \\  \ \ (B -  a)  2  C. 2  Now we are in a position to apply Proposition 2.2 to get the weights for the best linear combination of a and B. Consequently we have the best linear combination, a =  Ai  a + A  J3,  2  we have max\p_ \<c a  where the weights E(a  — a)  2  Ai = | ± f , A  < E(a  — a) . 2  2  = ^  and  K =  ( 1  _  < p )  V  a r ( a )  • Also,  The last statement of this theorem can  be established in the same way as shown in Corollary 2.2. o The above theorem implies that any information contained in B can be used to improve the quality of inference provided it is as reliable as a. Normality assumptions for the marginals are no longer necessary to obtain a better estimate of a under the conditions of Theorem 2.1. Moreover, we can correct for our "working assumption" of independent samples through likelihood weights.  Chapter 3 Maximum Weighted Likelihood Estimation 3.1  Weighted Likelihood Estimation  Suppose that we are interested in a single population. A parameter or parameter vector, di, is of inferential interest.  Information from other related popula-  tions, Population 2, Population 3 and so on, are available together with the direct information from Population 1. Let m denote the total number of populations whose distributions are thought to "resemble" Population 1. Let  ni,n , ...,n 2  m  de-  note the number of observations obtained from each population mentioned above. Let X i , X functions that of  2  , X  /i(.;  Xn,Xi ,  6 ), / (.; 9 ),f {.\ X  Hi  (Xij, X j,X j)  function  2  ...,Xi  2  2  be random variables or vectors with marginal probability density  m  m  fi(.;9 ) x  2  are  m  i.i.d  0 ), m  where X = s  (X , X ix  i 2  ,X  i n  .y.  We assume  random variables in this thesis. The joint distribution  is not assumed. We are interested in the probability density : Oi € 0 of a study variable or vector of variables X,  9  X  being an  unknown parameter or vector of parameters. At least in some qualitative sense, the /2(-; #2),  / m ( - ; Om) are thought to be "similar to" / i ( . ; 9 ). X  28  3.2.  R E S U L T S F O R O N E - P A R A M E T E R E X P O N E N T I A L F A M I L I E S 29 For fixed X = x, the weighted likelihood (WL) is defined as:  wL(^)=nn^(^;^) ' rii  m  Ai  i=l j=l  where A = (Ai, A , A ) * is the "weight vector". This must be specified by the 2  m  analyst. It follows that rii  m  log WL(0O =  E  °9 M v> h)-  X i l  x  i=l j=l  We say that 9 is a maximum weighted likelihood estimator (WLE) for 9\ if X  6i = arg sup WL(0i). In many cases the W L E may be obtained by solving the estimating  equation:  (d/ddi) log WL(di) = 0. Note that the uniqueness of the W L E is not assumed. Throughout this thesis, 9\ is the parameter of primary inferential interest. We will use 9\ and 9\ to denote the W L E and the true value of 9\ in the sequel.  3.2  Results for One-Parameter Exponential Families  Exponential family models play a central role in classical statistical theory for independent observations. Many models used in statistical practice are from exponential families and they are analytically tractable. Assume that X n , X i , X 2  l  n  ;  are independent random variables which follow the  same distribution from the exponential family with one parameter, #i, that is, h(x; 9 ) = exp (A^S^) l  + B^)  + Ci{x)).  3.2. RESULTS FOR O N E - P A R A M E T E R E X P O N E N T I A L FAMILIES 30 The likelihood function for n observations which are i.i.d from the above distrix  bution family can be written as ^ ( J n , ^ , . . , ^ ; ^ )  = exp iA^J^Sixu)  V  + n B^,) x  + Y C (x ) /  i=i  1  ).  u  i=i  It follows that In L^Xn, X  1  2  ,X  l  n  i  ;  9i) = Ai(6i) ^  ( v)  s  + i B ^ ) + Constant.  x  n  3=1  It then follows that  i=l  m (i'^OTlx ) + B;(0 )) 1  1  n _1  where Tlx ) = ^E5K). r 1  j'=i  It is known (Lehmann 1983, p 123) that for the exponential family, the necessary and sufficient condition for an unbiased estimator to achieve the Cramer-Rao lower bound is that there exists D(9i) such that <91n Li(X ,X , n  Xi ;0i)  l2  ni  001  Theorem 3.1 WLE  Assume  of 6\ is a linear  n D (9 )(T(x )-9 ),  V0x.  1  l  1  1  l  = l,2,...,nj.  that, for any given i, Xij '~" f(x;9i),j %  combination  of the MLE's  obtained  from  the marginals  The  if  o9\ i.e., the WLE of 9\ will be the linear lower Proof:  bound can be achieved  combination  by the MLE's  of the MLE's  derived  from  the  if the  marginals.  Under the condition din  Li(xn,Xi2,...,x ;8i) ini  = n D {e ) (T(^) - 0 ), i  d6  1  1  1  X  Cramer-Rao  3.2.  R E S U L T S F O R O N E - P A R A M E T E R E X P O N E N T I A L F A M I L I E S 31  it can be seen that T(x}) is the traditional M L E for 9\ which achieves the Cramer-Rao lower bound and is unbiased as well. Then we have din W L 09,  =J2 > i X  D {d )(T(x )-B )  n  i  1  1  1  Thus the W L E is given by m  m  where U = rii/ ]P Aj/i;. i=i Therefore the W L E of 9\ is a linear combination of the M L E ' s obtained from the marginals. This completes the proof, o Thus, for normal distributions, Bernoulli, exponential and Poisson distributions the W L E is a linear combination of the M L E ' s obtained from the marginal distributions. T h e o r e m 3.2 For distributions 9\ has the form  of g(T(x ))  where T(x})  1  (  of the exponential  m  is the sufficient  \  2~2 U\ T{x ) l  Proof:  family  i=i  As seen above  dlnLi(sii,3;i2,  form, statistic.  I, where U = n^/ ]T  )  i=i  ...,x \9i) = lni  n  i  (A'MTix^  \ni.  B ' M ) .  +  x  Consequently, the W L E satisfies ( X ( 0 ) T ( ^ ) + B ' M ) =o 1  i=i  which implies that A[(9i) (f>A,  ,i=i m  where *i = n*/  i=i  Ajnj.  T  ^ ) )  +  B  '  M =  0  the MLE of  Then the WLE of  m  dQ  ^Xim  suppose  3.3.  W L E O N RESTRICTED P A R A M E T E R SPACES  32  Therefore the W L E of 9\ takes the form  (  m  5>A, T V )  i=i  m  where U = rii/ E ^ " i - This completes the proof, o. Therefore the W L E has the same functional form as the M L E obtained from the marginals if we confine our attention to the one-parameter exponential family. The only modification made by the W L E is to use the linear combination of sufficient statistics from the two samples instead of using the sufficient statistic from a single sample. The advantage of doing this is that it may be a better estimator in terms of variance and M S E .  3.3  W L E O n Restricted Parameter  Spaces  The estimation of parameters in restricted parameter spaces is an old and difficult problem. A n overview of the history and some recent developments can be found in van Eeden (1996). van Eeden and Zidek (1998) consider the problem of combining sample information in estimating ordered normal means, van Eeden and Zidek (2001) also consider estimating one of two normal means when their difference is bounded. We are concerned with combining the sample information when a number of populations in question are known to be related. Often, prior to gathering of the current data in a given investigation, relevant information in some form is available from past studies or expert opinions. A l l statisticians must make use of such information in their analysis although they might differ in the degree of combining information. A linear combination of the estimates from each population is straightforward to use. In this section, we assume that the W L E is a linear combination of the individual M L E . We remark that the results of this sec-  3.3.  W L E O N R E S T R I C T E D P A R A M E T E R SPACES  33  tion hold for the general case where the new estimator is a linear combination of the estimator derived from each sample. We need the following Lemma on optimization to prove the major theorem of this section. We emphasize that we do not require the Aj's to be non-negative. n  Lemma 3.1  Let A  = (Ai, A , A ) ' 2  Aj = 1.  and  m  Let A be a  symmetric  i=i  invertible function,  m x m  A*^4A,  matrix.  is given  The  weight  function,  by the following  Proof:  lA  which  minimizes  the  objective  formula:  A* = provided  X,  - ^ i -  1 > 0.  l  Using the Lagrange method for maximizing an objective function under con-  straints, we only need to maximize G = X A X - c(l*A - 1), l  m  subject to 2~2 Xi = 1. i=l  Differentiating the function G gives ^  oX  =  2AX-cl.  Setting | f = 0, it follows that X =  -A~ l. l  2  Now 1 = 1* A = - l ^ - ! . 2 2 So c = lM-U' 1  Therefore, VA- !' 1  3.3.  W L E O N R E S T R I C T E D P A R A M E T E R SPACES  34  Since X A A is a quadratic function of A;, therefore, A A A has its global minimum at t  the stationary point. This completes the proof.o As before, we would assume that the parameters are not too far apart. The following theorem will show that some benefit could be gained if we take advantage of such information, namely that, sufficient precision may be gained at the expense of bias so as to reduce the MSE. However maximizing that gain may entail negative weights. Theorem 3.3 with E(Xij)  Assume  that Xij, i —  = Qi, and m is the the number  sample size for each source. joint  distribution  structure  1,2,m,  The marginal  = 1, 2 , n ;  j  of related  distributions  distribution  is assumed  are assumed  to be known.  are all finite and \9i — 6i\ < Ci where Ci is a known where Qi — Qi(xn,Xi ,  Xi )  2  of the X^.  data sources  across those related sources is not assumed.  of the joint  ni  Suppose  E(9i)  are random  is the MLE for the parameter  and rii is th>  e  to be known.  Instead,  the  Furthermore,  constant.  variables  The  covariance Q\,Q ,  ...,Q  2  m  Let 0 = (Q\,Q , ••••,9m)  t  2  Qi derived the  = Qi and V = cov(0) are known  distribution  and (V + BB )' 1  1  is  inveriible. Then the minimax  linear  WLE for Qi is:  0~i = ^  ^'  i=l  where: ,  A* = (Aj, A j , A ^ J V = cov(d) ; mxm  s t  B =  (V + B B ^ l V(V  +  BB )- !' 1  1  (0, C , C , C ) * . 2  3  m  Proof: We are seeking the best linear combination of these MLE's derived from the marginal distributions. As before, let us consider the MSE of the WLE.  3.3.  W L EON RESTRICTED  PARAMETER  35  SPACES  Writing #1 = A 9 = J2 Xi 9 we can calculate it  iy  i=i  MSE(9~i)  E[Y^ X & - Or)} (since E A, = 1) 2  =  i=i  =  E E  £  xx (e -e )(9 -e )\ 3  l  l  3  l  i=l j=l m m  = EE  *i *3  - 0i) & - Qi)}  i=l j=l m m =  E  E i = l j=l m m  = E i=l E j=l m  m  A  ' i  X  i  X  X  i=l j=l  cov  +  + i-  Oj)  + (di -  9 )(9 - X  3  E E *i A  +  X  9,))  & - Mi d  d  ~  i)  0  i=l j=l  \ B B \ t  Oi)0j ~ 8j + Oj - Oi))  e  cov(9i,0j) -  < A'cou(0)A + = ' \'(V  -°r  ( 0u  i  X  = E E 'i A  \.&  E  (by the  t  assumptions)  BB*)\.  Let A = V + BB . By applying Lemma 3.1, we conclude that the optimal linear t  WLE is: m  Ol =  E  i  X  ^'  i=l  where A' = (A;, A S , X Y = ^ m  B  ^  Vo '  It can be seen that the max MSE function is a quadratic form in Ai,A2,...A  m  and involves only the first and the second moments of the marginal distributions and the joint distributions. The whole procedure consists of two stages. The first step is to work out the functional form of the M S E . The second step is to work out the optimum weights to construct the optimum WLE. Note that the optimum weights are functions of the matrices V and B. Let's consider some special cases:  3.3.  W L EO N RESTRICTED P A R A M E T E R SPACES  36  (i) B = 0 and V = a 1. 2  This implies that all the data can be pooled together in such a way that equal weights should be attached to the M L E derived from each marginal distribution since now we have 6i = 9 = ... = 9 . That is, A* = ^ for all i for i = 1 , 2 , m . This is consistent 2  m  with our intuition for the i.i.d. normal case. (ii) B = 0 and V = diag(o\,  a  2  , a  2  ) .  The optimum W L E is then given by: if  5  0  1  +^  °  01 =  ""  +  2  +  m  A °  m  •  k=l  k  Note that most current pooling methods estimate the population parameter 6 is a weighted average of the estimates obtained from each data set: m  9=  i = 1  m  i=l  provided that the Var(9i)  *  = cr are known. The optimal weights under the assumption 2  t  are proportional to the precision in the ith data set, that is, to the inverse of the variance. This is a simple example of a weighted  least squares  estimator.  Therefore  the optimum W L E coincides with the weighted least squares estimator under the assumption B = 0 and V. = diag(a , 2  a  2  , a  2  ) -  Furthermore, let us apply Theorem 3.3 to our motivating example since it is also a special case of the theorem. To be consistent with the notation used in Section 2, we still use a and b to denote the parameters of interest. We need to work out the matrix V and BB before applying the theorem since both are assumed to be known. 1  As before, we have O"  2  Var(a)  =  2  Var(b)  = ^  2  a n  d  cov(a, b) =  3.3.  W L EO NR E S T R I C T E D P A R A M E T E R SPACES  37  To write those in a matrix form, we have  J_ /  V = cov{{a,bY)  °  pa  2  St  \ pa  0  0  0  c  2  o  2  2  As for the matrix BB , we have 1  BB  =  1  2  Adding the above two matrices together gives o  V + BB* =  S  2  po*  pa  o +  2  t  C Sf  2  2  Therefore,  • {V +  o*  BB )1  1  S  2  po  o +  2  s  -i  po* CS  2  2  2  o +CS 2  2 t  2  W + BBH  -i  -po  2  2  -po* o + CS 2  a - p a + 4  2  2  CS o  4  2  2  2  -po*  2  -po*  t  o  2  Thus, we have o +CS 2  S  2  (V + BB )' ! 1  =  1  o - po 4  2  A  +  CSo 2  2  2  2  2  -po  | ^ 1  2 O*  1  2  2  —po  z  (1 - p)o + C S 2  a  4  - po 2  +  4  CSo 2  2  2  2  2  (1 - p)o  2  Furthermore,  si l  t  (  K  +  B  B  t  )  '  H  = o^-p o^C S o 2  2  2  2  ^  - ^  +  ^  3.4. LIMITS OF O P T I M U M WEIGHTS  38  Therefore the optimum weight function, A* = (A*, A^)', is (l-p)a +C S 2(1-p)<x +C S (l-p)a +C S 2{l-p)a +C S 2  2  2  2  2  2  \  2  2  2  2  2  J  2  where K = c s? 2  These are exactly the weights we derived in Proposition 2.2. We therefore have shown that Theorem 3.3 can be used to derive the results we found earlier in Proposition 2.2.  3.4  Limits of O p t i m u m Weights  The weights are of fundamental importance. Therefore it seems worthwhile to investigate the behavior of the optimum weights as the sample sizes get large. Since the weights are all fixed numbers forfixedsample sizes, thus the limits are non-stochastic as well. For simplicity, we assume that C ' = C3 = ... = C — C, where C is a 2  m  constant. Theorem 3.4 Re-write  Let B =  V = o\W,  Assume  lim  (0, C, C , C ) \  C  > 0.  where the of is a function  o~ (ni) = 0 and  lim  2  n\—too  LetV  ofni  ...,n )~  {V +  BBty !  l  m  is a function  =  W ~. l  0  ni—>oo  Then  lim ^V*  lim  WQ BB Wg 1  t  where  S„  Proof:  From the matrix identity (Rao 1965),  = WrT  1  -  0  u  andW  W(ni,n , 2  = cov{0), where 0 = (§  1  mxm  J  J  ofn,,  n, 2  2  , B  ...n . m  m  y .  3.4.  LIMITS OF O P T I M U M  WEIGHTS  39  it follows that ^W^BB^W'  1  -1  1+ ^BW-'B -l  \WBB W 1 l  l  w- - -1 + 1  \BW~ B X  W^BB'W'  1  -1  =w  al +  B W- B' l  l  Therefore, (a W +  BB )- !  2  lim lim A* = nX^oo  V(afW  1  +  BB*)-n  Q-jjo-fW + ni^oo  \  = m-yoo lim  (  t  w  1  a +BtW-iB  V  \  V  i  W~ BB W-  _  1  t  \  1  )  al+BW-lB  V  1  W ^ B B ^ \ B'W^B )  _  0  w ->BBtw-i\  _  0  )  2  t  0  V  -rBB^- !  -, (xxr-i A  W  V  1  — WzlMB^Wzl\ i  l  V  1  2  fw~  ( -l  BB )- !  al! (a W l  1  0  BW^B  )  1 mxm -  LJ  _1  where S  Wg BB W^ 1  n  t  BtWg^B  C o r o l l a r y 3.1  J  l  -. This completes the proof, o  Under the conditions  of Theorem  3.4, assume  that V = Diag(a\,  o f , a ,  m  If a (ni) 2  —> 0 and a (ni)/a (ni) 2  2  —> ji > 0, for i > 2, as n  x  lim A* = (1,0,0,...,())'.  —)• oo and ]C 7i > 0, i=2  then  3.4. LIMITS OF O P T I M U M WEIGHTS Proof: <?  Consider the following -1  1  WZ BB Wr x  l  l  BW^B  mxm -  u  40  s  i - /, [Diag(1,72,  x  n  =  ...,7  m  Diag{l,j ,-,7m)BB Diag{l,^-nm)\ . B Diag(l,-y2,-,'Ym)B t  2  t  , U  t  (  C (0,72,-,7m)'(0,72,-,7m) m 2  Diag  (l,j ,-,1m) 2  C £ * 2  V  7  i=2  = (1, 7 2 , - , 7 m ) * -  IS  I 0,72  E7i  V  i=2  (1,0,...,0) . (  E '-"' E ' ) 7i  7m  i=2  7  /  i=2  It follows that 1*51 = 1. This implies that lim A* = (1,0,...,())'. o n—Kx>  Next we will consider the case where the covariance matrix is not a diagonal matrix but has a special form. Suppose that /  0  1  P  P'  „m—1  P  1  P  „m-2  y p —l m  cov(6) = ^-V  p~ m  2  where \ (3.2)  p ~^ m  where /) / 0 and p < 1. This kind of covariance structure is found in first order 2  autoregressive model with lage 1 effect, namely AR(1) model, in time series analysis. If the goal of inference is to predict the average response at the next time point given observations from current and afixednumber of previous time points, then this type of covariance structure is of our interest. By Proposition 2.2, the optimum weights will go to (1,0) if m = 2. This is because K = , _ ? ^ / x goes to infinity. As a result, x  the optimum weights A = (^f-, 2+K) ~~^ (1>0)-  n  3.4. LIMITS OF O P T I M U M W E I G H T S 2  A  Corollary 3.2  41  For m > 2, if cov(0)  = ^-Vb, where  V  0  takes the form  as in  (3.2),  then umA where D  0  Proof:  D - p ' A)-p '""'  P +  I L ,  2  2  0  = 1 + (m - 2)(1 -  D  0  - p * ' D  0  - p * J  '  p) . 2  It can be verified that i 1  /  WV  1  -p  -p 1 + p  l  0  n  -p  2  n 0  n 0  0  \  0  =  l -  p  2  0  0 0  -p 1 + p  2  -p  0  It follows that C  ( - P , i - p + p , (i - p ) , ( i - p) , i 2  1-p  2  1-p  2  2  y,  2  P  B WQ 1 1  1  B ^ ^ B  We then have  Do-P  (1 -  (-p, 1 - p + p , (1 - p ) , ( 1 - p) ,1 - pf. 2  P ) 2  A)  2  2  3.4. LIMITS OF O P T I M U M W E I G H T S  42  Therefore,  SI  (1-  pf  (1-  pf  i-p  1-  p  \  z  P )D 2  0  (1 - pf  ( 1 - pf 1 -- p 2  +p  (1 - Pf  Do-P (1 -  1-p  V P ( I - P + P  1  \  (  1 -- p  -p + '  D ' D 00  It can be verified that lim 1*51  p{i-p)  2  )  2  D  p(i- )  D  0  D  0  =-^-^(1  j^).  —  pii-py*  2  P  D  0  0  This completes the proof, o  Note that the limits of A2 and A^ take different form than that of A 3 , A ^ _  x  reason is that B = (0, C, C,C) ignores the first row or column of W  0  . _1  The when  multiplied with it and the last row of W ~ is different than the other rows in that l  0  matrix. Positive Weights If the estimators are uncorrelated, that is, p = 0, then we show below that asymptotically all weights are positive. Furthermore, all the weights are always positive in this case. Theorem 3.5 B  As before,  = (0, C, C, ...,Cy.  of, i = 1 , 2 , m ,  Furthermore,  are known.  Observe that  Proof:  The matrix  let V be the covariance  Then  V~ = diag  (V + BB*)-  l  1  assume  ( - \ , - \ - ) .  = V-  1  0  = B'V^B  that V = diag(o\,o\,  can be written as  and T  x  1  '  V  =  of (9\, 9 , •  A* > 0 for all i = 1 , 2 , m  (V + BB')-  where t  matrix  l + <o  (V^B^V- ). 1  2  9 ) and let  ...,ofJ,  m  where the  and for all n.  3.4. LIMITS OF O P T I M U M WEIGHTS  43  It may be verified that ™ Q2  of'  ^ i=2  and  1  0  0  0  0  c  c  2  \  2  c  2  0  Therefore, we have (V + BB )1  1  =  diag(\,\,...,±-) \°1  °2  l + *o  V i /  n  0  Ti  \  0  0  0  0  n  c  c  c  c  2  2  1+to o  o  2  ... 4 -  2  Furthermore, (V + BB )' 1  1  1  diag  1  1  2'  °\  1  2' ' ^2  1 + to  0  0  l  \  0  0  °l  Ti 1  c  2  1  0  c  2  T?O-2  1  l + *o 0 /  i  c  2  V / 1  /  \  1  C a , a: 2  1  2  1 + 4 0  u  V l  +  t  0  < T |  C  1  1+to  2  \ "l, l+to  u  since m  1  io  1+V  1  1  <7?l+t " 0  V / 1  3.4. LIMITS OF O P T I M U M WEIGHTS  44  Hence, •1 1 1 = 4 + T - V T  1 -2 •  m  1' (V + BB')'  1  Therefore, K  =  ^r—>o. J_ + _ J _ V - L 3=2 3  Also, for 2 < i < m, I  A:  I  * \ l +  =  > o.  This completes the proof, o  Negative Weights The above theorem gives conditions that insure positive weights. However, weights will not always be non-negative. In Corollary 3.2 , we have lim  A  *  ( i , -f> ° D  =  +  y '  "l^oo  P( ~P  + P ),  l  DQ-  2  PJ ~ Pf 1  D  p  2  - p  ? 2  0  .  Pi  - Pf  1  ' '"'  D  - p  1  2  Q  ~ P) V  Pi  '  DQ  = 1 + (m — 2)(1 — p) . It follows that D  ?  - p J 2  > I provided  where p  < 1 and D  m > 2.  Hence the sign of those asymptotic weights taking the form ^1^,2 are  2  2  0  0  m  determined by the sign of p. Notice that  hm A* = 1 and lim Ai = 1*. Hence  m  lim A = — 2^2 bm A*. Therefore, we have the following. 2  Til—S-OO  j_gni—•oo  1) If p > 0, then lim A* > 0, i = 3 , m , and A < 0. 2  Jll—»oo  2) If p < 0, then lim A* < 0,  i = 3,...,m,  and A2 > 0.  A general result on negative weights is shown below.  Theorem 3.6  Assume  B  m e  a  3=2 61  C , C )  and m >  3. Lei kFo" = 1  (aij) . mxm  Let  2  i = YI ij-  if ^  = (0, C,  <  Assume  a  u  — -^r— > 0. h  for some i > 1.  E et  Then the asymptotic  weight,  lim A*, is  negative  3.4. LIMITS OF O P T I M U M WEIGHTS  45  Proof:  By Theorem 3.4, we have (V + BB*)- ! 1  lim A* = lim n  where 5  =W  m x m  1 0  -  ,—  n^oo V(V + BB*)- !  ^oo  °  B  t  i  -i °  y  1  S  m x m  l*5  l  l  m x m  •  B  We then have W B = ( C ^ a l  0  .  i  i  , C ^ a  J=2  2  i  , . . . , C ^ a  j=2  m  )  i  = C(ei, e , e ) * .  i  2  m  j=2  Therefore, WQ-^B'WQ- , 1  B~*wF B r  C ( e i , e , ••.,e ) C(ei,e2, - , e ) f  2  m  m  (0,C,C,...,C)C(e ,e ,...,e y 1  2  C (ei, e ,  e )'(ei, e , •  2  =  (an+ei,ai2 + e ,...,a i+e )' 2  m  m  2  m  2  m  m  ^  ai2  i=2  (  a n + ei ^  +e  eie e e!  2  2  e  2  e2e  2  m  m  i=2  y c ei m  C77je2  ...  e  m  y  V / 1  e ). m  3.4.  LIMITS OF O P T I M U M WEIGHTS  46  It follows that /  9  o n + ei  e  2  m  \ a  mX  \  E* e  ei  i=l  ai2 + e  1 —  (  ^  +e ) m  E  J  e  •m  E i  2  e  i=i  i  i=2  I \  6 J^J  I  m  1= 1  /  e i ( e i + E e; j=2  a n + ei  m i= 2 m  a i + e 2  2  e2(ei+ E ev) i=2 m  E e, i= 2  (ei+E 0 i=2 e  a i + m  m  i=2  =  e2 ei ,a i- — — , a  e\  (an  m  2  E* e  i=2  e m  m  i  e —) . :  t  i=2  i=2  Note that W is symmetric which implies that the matrix W 0  1 0  is also symmetric.  Thus, we have m  m  3=2  3=2  By the assumption of the theorem, we have l ^ m x m l — °11  m^~~ >  0,  E i e  i=2  Therefore, lim A* < 0 if an < ^r - for any i such that i > 2. o 1  ni->oo  ti i=2  We can construct a simple example where the non-asymptotic weights can actually be negative. Suppose that information from three populations is available and one single observation is obtained from each population. Further, we assume that the three random variables, X\, X , and X , say, follow a multivariate normal distribution 2  3  3.4. LIMITS OF O P T I M U M WEIGHTS  47  with covariance matrix as follows:  V =  / 1.0  0.7  0.3 \  I  0.7  1.0  0.7  V 0.3  0.7  1.0  Also, assume that C = C = 1. Thus B = (0,1,1)'. It follows that 2  3  ( 1.0 V + BB  1  =  0.3 \  0.7  0.7 1.0  0.7  0.3 0.7  1.0  ( 0  +  0  0 ^  ( 1.0  0.7 0.3  0 1.0  1.0  0.7 2.0  1.7  0 1.0  1.0  0.3  2.0  1.7  X  Hence, approximately ( (V + BB )' 1  1  =  1.67  -1.34  0.89 ^  -1.34  2.88  -2.24  0.89  -2.24  2.27  We then have (V + BB)  1 = (1.22, -0.71, 0.92)*,  and 1\V  + BB )' ! 1  1  = 1.43  It follows that A* =  (V + V(V +  BB*)- ! 1  BB')- ! 1  (0.85,-0.49,0.64).  Thus, A is negative in this example. The negative weights in this example might 2  be due to the collinearity. If we replace 0.7 by 0.3 in the covariance matrix, all the weights will be positive.  Chapter 4 Asymptotic Properties of the W L E Throughout this chapter, 9\ is the parameter of primary inferential interest although in their extension of the REWL, Hu and Zidek (1997) consider simultaneous inference for all the 0's. Recall that for fixed X = x, the weighted likelihood (WL) is defined as: m  rii  nii/i^i)*. i=l  w  j=l  where A = (Ai, A , A ) * is the "weight vector". 2  m  It follows that logWL(x,  0i)  =EE i=l  X i  l  o  g  hfai'i  3=1  We say that 0\ is a maximum weighted likelihood estimator (WLE) for 0i if 0i = arg sup WL(x, 0i).  We will use 0J to denote the true value of 0i in the sequel. The asymptotic results proved here differ from those of Hu (1997) because a different asymptotic paradigm is used. Hu's paradigm abstracts that of non-parametric regression and function estimation. There information about 0i builds up because the number of populations grows with increasingly many in close proximity to that of 9\. 48  4.1. A S Y M P T O T I C RESULTS F O R T H E W L E  49  This is the paradigm commonly invoked in the context of non-parametric regression but it is not always the most natural one. In contrast we postulate a fixed number of populations with an increasingly large number samples from each. Asymptotically, the procedure can rely on just the data from the population of interest alone. These results offer guidance on the difficult problem of specifying A. We also consider the general version of the adaptively weighted likelihood in which the weights are allowed to depend on the data. Such likelihood arises naturally when the responses are measured on a sequence of independent draw on discrete random variables. In that case the likelihood factors into powers of the common probability mass function at successive discrete points in the sample space. (The multinomial likelihood arises in precisely this way for example). The factors in the likelihood may well depend on a vector of parameters deemed to be approximately fixed during the sampling period. The sample itself now "self-weights" the likelihood factors according to their degree of relevance in estimating the unknown parameter. In Section 4.1, we present our extension of the classical large sample theory for the the asymptotic results for the maximum likelihood estimator. Both consistency and asymptotic normality of the W L E for the fixed weights are shown under appropriate assumptions.  The weights may be "adaptive" that is, allowed to depend on the  data. In Section 4.2 we present the asymptotic properties of the W L E using adaptive weights.  4.1  A s y m p t o t i c Results for the W L E  In this section, establish the consistency and asymptotic normality of the W L E under appropriate conditions.  4.1.  ASYMPTOTIC RESULTS FOR T H E W L E  4.1.1  50  Weak Consistency  Consistency is a minimal requirement for any good estimate of the parameter of interest. In this and the next sub-section, we will give a general conditions that ensure the consistency of the W L E ' s . Our analysis concerns a-finite probability spaces (X, T, fii),i = 1,2 under suitable regularity conditions. We assume that the probability measures fa is absolutely  with respect to one another; that is, suppose  continuous  there exists no set (event) E G T for which fa(E) = 0 and faj(E) ^ 0, or fa(E) = 0 and fJ-j(E) ^ 0 for i ^ j. Let v be a measure that dominates fa, i = 1, 2, for example (pi +fa>)/2and fr, i = 1,2. By the Radon-Nikodym theorem (Royden 1988, p. 276), there exist measurable functions fi(x),  i = 1,2, called probability  unique  densities,  up to sets of (probability) measure zero in v, 0 < fi(x) < oo (a.e. v),i = 1,2, ...,m such that  p.i{E)=  j  fi(x)du(x),  i =  l,2.  for all £ 6 ^ . Define the Kullback-Leibler K(f f ) ll 2  information  = E  1  as:  number  ^ l o g ^ ^  = J  log j^h{x)du{x). f  is defined as +oo if fi(x)  In this expression, log(fi(X)/f (X)) 2  so the expectation could be +oo. Although log(fi(x)/f (x)) 2  fi{x) = 0 and f (x) 2  > 0, the integrand, log(fi(x)/f {x))fi(x) 2  > 0 and f (x) 2  = 0,  is defined as — oo when is defined as zero in  this case. The next lemma gives well known result. L e m m a 4.1 (Shannon-Kolmogorov densities  with respect to v.  K(f ,f ) 1  with equality  2  Information  Inequality)  Let f\(x)  Then  = E (log^^j 1  if and only if f\(x)  = f (x) 2  = j (a.e.  logj&hWM*) v).  > 0,  and f (x) 2  be  4.1. A S Y M P T O T I C RESULTS FOR T H E W L E Proof:  Let 2,3,  51  (See for example, Ferguson 1996, p. 113). denote the true value of  9\  ...,m.  and  9i  for ^ e 9 , i =  9° — (9\, B , 9 ) , 2  m  Throughout this chapter, the following assumptions are assumed to hold  except where otherwise stated. Assumption 4.1  The parameter  Assumption 4.2  For each i = l,..,m,  random  variables  having  Assumption 4.3  common  is compact  fi(x,  For any 8\ G  density  0  = 1,2,rij.}  j  functions  are  i.i.d.  with respect to v.  v) implies  and for any open set  \log(h{x^)/h{x;  are each measurable  separable.  that 9\ = 9[ for  any  9) have the same support for all 9 G O.  OC0,  IM/i(x;  sup \ l o g ( h 0 ? i n f sup  :  {Xij  assume  probability  and  fi(x,9\) = fi(x,9[)(a.e.)  Assume  6\, 9[ G © and the densities  Assumption 4.4  O  space  assume  6>))|  9 )/h(x; 0  x  x  inf |/op(/i(a;;0f)//i(a;;0i))|  d,))\  in x and  ^ [ s u p i i o g y y ; ; ! ]112' 1  Oiee where K > 0 is a constant  <K<OO,  JiK-X-ifVi)  independent  of 9{, i = 1, 2,..,  Assumption 4.5 Let n = (m, n ,n ).  Assume A  2  m  m. ( n )  = (A , A n)  x  n ) 2  ,A^)'  satisfies \W^W  =  (w ,W ,...,W ) =(l,0,...,0) t  1  2  t  m  while  max l<A:<m  nl K  max ItUj — A - ^ l < O 2  l<i<m '  I —  t  (n)~ ) 6  V l  /  as m  —> oo, ,  /or some 8 > 0.  Assumption 4.5 will be satisfied if n i = 2 , m are in the same order of n and if  also  \wi-\f \ ]  = O (n^)/ ) . 2  x  4.1. A S Y M P T O T I C RESULTS F O R T H E W L E  52  In this chapter, we require the density functions to be upper semi-continuous. Let |.|| be defined as Euclidean norm; that is  W for  any  x =  1,2  1=1  ( x i , ^2, . . . j X g ) ' .  Definition 4.1  A real-valued  said to be upper  semi-continuous  such that  = CE°Z) >  2  lim \\9  n  semi-continuous  — 9\ \ =  if g(9) <  0,  function,  g(0), defined  O,  on  n  space, Q, is ,  if for all 9 G © and for any sequence  we have g(9) >  liminfg(# )  on the parameter  lim s u p A lim \\9  whenever  n  We need to show that sup U(x] 9i) = 1°9JJ0^  function  — 9\ \  6  GQ  n  is called  lower  = 0.  for some open set O is measurable  if fi(x; 9i) is upper semi-continuous. Proposition 4.1  If U(x; 9,) is lower semi-continuous  in 9\ for all x, then  sup  U(x  \\6i-e \\<R 0  is  measurable.  Proof: For simplicity, let us assume that 6 = R and O = {9_i : ||0i - 9^\\ < R} . We will show that sup  U(x;9 ) 1  = sup U(x;9 ) 1  (4.2)  e eD  \0i-e\\<R  x  where D — N D {9\ : \Q\ — 0\\ < R}. The set N is the set of all the rational numbers in  R.  Let s =  sup U(x; 9 ). It follows that for any 9\ G D , U(x\ 9\) < s. For any given X  |0i-0?|<R  e > 0, there exist  91(e)  such that  Since D is a dense subset of {9\ such that lim 9^\e) n—foo  = 91(e).  \9l(e) - 9°\ < R  and  U(x,9{(e))  >s-e/2.  (4.3)  : \6\—9\\ <  i?}, then there exist a sequence 9^(6)  G D  Since U(x;9i) is lower semi-continuous, then, for fixed  e and some S > 0, there exist 0** G D such that  — 0JI < 5 and  [7(x; 0J) - e/2 < {/(a; 0**)-  (4.4)  4.1.  ASYMPTOTIC RESULTS FOR T H E W L E  53  Thus, combining equation (4.3) and (4.4), it follows that for any given e > 0, there exist  9{*  e  such that  D  U(x\9\*)  > s - e.  Equation (4.2) is then established.  We then have :  {x  sup  U(x;  |0i-0?|<R = {x : sup U(x;6i) 0lG£>  #i) < a} < a}  (0 £ D}. = n? {x : U(x; 9?) < a, 6? =1  Since  {x : U(x;9i)  is measurable. Therefore the set  < a}  is measurable for any 6\. Therefore the set {x :  {x  :  sup  U(x; 9±) <  a}  \ei-e°\<R  sup  U(x; 9i) < a} is measurable.  \ei-e°\<R  This completes the proof, o Therefore, if f\(x;9i)  is upper semi-continuous in 9\ for all x and the open set  is defined as {9 : \9\ — 9\\ < R, R > 0}, then the Assumption 4.4 is automatically X  satisfied because loq^' A  is lower semi-continuous and  6  3  /i(x;0i)  log  sup  log  sup  0ieD  9i-e°\<R  AM?)  AMi)  for any denumerable set D dense in {9 ; \9\ — 9°\ < R} by Proposition 4.1. The X  measurability of  inf  log  |0i-0?|<R  fi(x,0i)  also follows.  L e m m a 4.2 Let Aij(x; 9) be measurable K for some constant  K independent  1  i=i  for any 9 ,9 , ...,0 , 9 eQ,i 2  Proof:  Put  3  m  Aij = Aij(Xij)  of 9i,  in x for all 9. If Eg [A(X j; i  i  and  <  2  <  then  B  ...m. = (w^ - x\ ^)Aij. n  i3  Observe that by the Cauchy-  Schwartz Inequality, for any i, i',j, and / , we have Ego\BijBi'ji\  9i)]  J=I  = 1,2,  {  function  (»)i  \wi — X)  - X^f I ^1E oA jE oA ,., 1  e  2  e  2  4.1. A S Y M P T O T I C RESULTS F O R T H E W L E in view of the finite second moment condition on the  54  Aij(X).  Further, / i  HI.  \  <«i  *  1  1=1 3=1  ^  m  771  7l{  i=l  j=l  m  ni  ?V  i=i i'=i j=l j'=i  1  < O (—- 1 max n\ max liu; — A,- ^| n  \nf/l<k<m  2  l<i<m  <o(i) (nn 0  <  o n  0, as  1+S  ni  —>  oo.  by Assumption 4.5. It then follows that 771  sup  71,  ±£i>-A, "'M« < ^ £ £ h -><r%\ (  8ie& "•i  1=1  '*i i=l j. =- il  j=l  = u  1  i  771  7li  lttMM-  i=l 3=1  . _ i .•_ 1  We then have ^  1  for any 9 ,6 ,9 , 1  2  3  ...,8 ,0i m  m  rii  t=i j=i e Q,i =  1,2, ...,m. o  If we set Aij = log ^ffilg)) and  It then follows from Lemma 4.2 that —<5 (X, 0i) ni  for any 0 , 2  03,  ni  0  (4.5)  0m, 0; G O, i = 2, 3 , m .  The above result will be used to establish the following theorem which will be applied to prove the weak consistency and asymptotic normality of the WLE.  4.1.  55  ASYMPTOTIC RESULTS FOR T H E W L E  T h e o r e m 4.1 For 9, ^ 9\,  ( for any 9 ,6> ,0 , 2  3  m n;  m  i=l j=l  i=l j=l  9 e&,i  m  =  {  2,3,m.  Proof:  With P o measure 1, g  m rii |W i=l j=l  m nt T TT T i = l j=l  ,  ,  ^ s  if and only if W„,(X,9 )>0. 1  where ^n), Jl{Xj \9l) n  n  3  i~l~{  ji\Xj,Vi y  Observe that l~l~{  n  /  1  Jl{Xj,Vl)  /i(^ij,#i)  o  /  a  N  By equation (4.5) we have 1 1 ~ -S  ni  Hi  for any #i € 0 and any 9 , 9 2  5  , 9  (X, 9\)  Po e  t  m  By the weak law of large numbers, i n  i  ,  /ipfij-;fl?)  h{x^e\)  /i(^ij;0i)  /I(AIJ-;0I)  4.1.  A S Y M P T O T I C RESULTS F O R T H E W L E  56  when 6\ ^ 9\ by Lemma 4.1 and Assumption 4.3. Therefore lim P (W > 0) = 1 for all 6 ^ 0?, 9 ,0 ,9 . 0O  ni  X  2  3  m  o  ni->oo  For any open set  let Zi (0) =  O,  3  We are now in a position to  m^log^^ ' ^. 3 6  prove the weak consistency of the WLE. Theorem 4.2  Suppose  that logf\(x;9)  is upper semi-continuous  in 6 for all x. As-  sume that for every 9\ ^ 9\ there is an open set Ng such that 9\ € Ng C O. Then for 1  any sequence  of maximum  weighted  likelihood  lim Pgo ( | | 0 i ni—>oo  ni)  x  estimates  0(™^ of'9\, and for all e > 0,  - 0 ° | | > e) =0,  \  /  for any 9 ,9 , ...,9 ,9i e Q,i = 2,3, ...,m. 2  Proof:  3  m  The proof of this theorem given below resembles the proof of weak consistency  ofthe MLE in Schervish (1995). For each  9\ ^ 9\,  let NJfi\k = 1, 2,... be a sequence of open balls centered at 0i and  of radius at most l/k such that for all k, N:  (*+i)  C Ng^ c  It follows that Hfcli NJ,?  =  therefore has a limit as  —>• oo. Note that  k  e.  {^i}- Thus, for fixed Xij, Zij(Ng?) increases with k and is lower semi-continuous in 0i  loq{ ^ A l  ,9  for each x. So, lim ZIJW)  > log  J  The limit in the last expression is not required to be finite. Observe that E  =E  a  inf  log  fi(Xij-,9^) fi(Xij',9[)  fi{Xif, 9\) SUPlog f\(Xij;9[) e[ee  < EgO 1  < oo,  4.1. A S Y M P T O T I C RESULTS F O R T H E W L E by Assumption 4.4. This implies that  57  are integrable. Using the monotone  Z\j(Ng')  convergence theorem, we then have lim  =  EeoZ^N^)  Thus, we can choose of 'NJfi ^ for each  lim Z ( < ) > E*(log ffi''®\) )  E*  k* = k*(0i)  Q\ G  > 0.  f  l j  so that  0. Let e > 0 and  EgoZij(N^  ^)  k  N  0  > 0. Let  be the interior  be the open ball of radius e around  d\.  Now, 0 \ A^o is a compact set since 0 is compact. Also, {JV£:0i"ee\JV } o  is an open cover of 0 \ A^. Therefore, there exist afinitesub-cover, A ^, Ng*2, 7  such that  > 0,  EeoZ^iN*,)  1, 2, ...,p.  I=  We then have ^(ll*i  B , )  -0ill>e)  (flf 0  = Po e  for some  e  ;=i  ;=i  p  =E  1=1 P  V /  1  1  -  i=i  j=i  m  n;  /  _.  E E " M N k )  i=l j=l / -. ni  \  + - EE( * A  1  1  ^  m  m  i=l  \  1  j=l  1  i=l  rij  j=l  - ^)^-(^) < o  j=l  • =E ^ - E^(^) +- E E ^ (=1  (n)  \  n )  - «*) w . ) < o .  If we show the last expression goes to zero as n\ goes to infinity, then P*> ( i f r  0  ~ i\\ >e) —>0 as m'^oo.. e  /  NQP  4.1.  A S Y M P T O T I C RESULTS F O R T H E  Since EgoZ^N*,)  llog^f^]l)  (sup  < E  2  9i  follows from Lemma 4.2 that  n-i 1  58  < oo by Assumption 4.4, it  /  Hi  m  1  < K  '  1  WLE  i=i j=i  Also, 1  n i  -  p  Z (N* ) -A Xj  EgoZ (N* )  0[  Xj  > 0, for any 6[ e 9 \ N ,  g[  0  by the Weak Law of Large Numbers and the construction of NZ. Thus, for any 6[eO\N , 0  -X^(Ar;,) + -^^(AS 1  J=l  1  i=l  n )  —>0 as m ' - x x ) .  -^)^(iV;,)<0  j=1  /  This implies that /  P  ni  ^  rii  m  \  E *° - E ^ W ) + - E E ( ^ - ^)^(^) P  1=1  \  < 0 U 0 a a m 4 o o .  A  x  1  j=l  1  i=\ j=l  J  Thus the assertion follows, o In the next theorem we drop Assumption 4.1 which assumes the compactness of the parameter space and replace it with a slightly different condition. A t the same time we keep Assumption 4.2-4.5.  Theorem 4.3  Suppose  that for every 9 ^ 9 X  assume  X  logfi(x;9)  there is an open set Ng  0 < Ego I c  is a constant  for  all  inf  l o g  J  ; ) ^  f (X -9' ) x  C O . In  addition,  i:j  a  [  x  < K  J  G  < oo,  (4.6)  of 9 , •••,9 . 2  of maximum  m  weighted  likelihood  e> 0  lim P*>(||0i -0 ||>e) = 0 , Bl)  1  ni—>oo  1  X  independent  Then for any sequence  X  Assume  subset C of Q such that 9 G C and  \e[£ccn@  where K  such that 9 e Ng  x  that there is a compact  in 9 for all x.  is upper semi-continuous  \  /  estimates  of 9  X  and  4.1. A S Y M P T O T I C RESULTS FOR T H E W L E for any 9 , 9 , 9 , 9 eQ,i 2  Proof:  3  m  Let  2,3,ra.  =  {  59  and e be as in the proof of Theorem 4.2, and let N*i,N* ,N*p  be  2  an open cover of C \ N with EgoZij(N^) > 0. Then Q  P* (\\%  - %\\ > *)  ni)  P  (^i  < Y, e° p  +  k)  N  ni) e  (^  n i )  e c n e) c  *=i k=l  (  1 _ _  1 _ _  nl  \  m  It follows from the proof of Theorem 4.2 that the first term of last expression goes to zero as n goes to infinity. By the Weak Law of Large Numbers, we have I E *«<C n 6) ^ ft. (_ igf (  I \—  2~^(X i=ij=i m  If we show that ^ ^ n i  > 0, by equation (4.6).  n9  Wi)Zij(C  c  fl 6)  P0  — 0 , then the second expression  goes to zero. Consequently, the result of the theorem will follow. Observe that m  rii  ^EE( S A  n i  n )  i=l i=\  -^)%(  G C n 0  )  = E ^ ( i=l  U l  A  i  n  )  •  - ^ E ^ n e ) . U i  j=l  By the Weak Law of Large Numbers, it follows that  inf^ c ne c  l€  l°Q f\x --9i)) 3  IS  a  ^^ n  e  number by the condition of this theorem.  By Assumption 4.5, it follows that Tl'  — {\r  n\  ( \  - wt)  —> 0, as m ->• oo.  (4.8)  4.1. A S Y M P T O T I C RESULTS F O R T H E W L E  60  Combining equation (4.7) and (4.8), we then have m rii  ,  1  4.1.2  1=1  j=l  Asymptotic Normality  To obtain asymptotic normality of the WLE, more restrictive conditions are needed. In particular, some conditions will be imposed upon the first and second derivative of the likelihood function. For each fixed n, there may be many solutions to the likelihood equation even if the WLE is unique. However, as will be seen in the next theorem, there generally exist a sequence of solutions of this equation that are asymptotically normal. Assume that 0i is a vector defined in RP with p, a positive integer, i. e. 0i = (011, 012, • 0\ ) and the true value of the parameter is 0° = ( 0 , 0 ° , 6 \ ) . n  P  2  p  Write ip(x; 6i) = ——logfi(x; 0i),  a p column vector,  (701  and a p by p matrix.  tp =—ip(x;9x),  Then, for any j, the Fisher Information matrix is defined as  Assuming that the partial derivatives can be passed under the integral sign in f fi(x;6 )di/(x) x  E^{X ^) lj  = 1, we find that, for any =  j,,  J ^^^jf^x^duix)^  J  so that I(0i) is in fact the covariance matrix of tp,  I(e° ) 1  =  cov oiJj(X ;6 ). 0  e  lj  1  ^f (x;e° )du(x)=0, 1  1  (4.9)  4.1. A S Y M P T O T I C RESULTS F O R T H E W L E  61  If the second partial derivatives with respect to 9\ can also be passed under the integral sign, then  f(d /d9 )fi(x; 2  = 0, and  9\)dv(x)  2  I(9 ) =  -E oi;(X ;9 ). 0  1  e  l  l  To simplify the notation, let WL (x;9 ) ni  = —logWL(x;8 )  1  and  1  0°)  WL (x; ni  0i)|„ o  = —logWL(x;  1=fl  In the next theorem we assume that the parameter space is an open subset of RP. Theorem 4.4 (1) for almost  Suppose: all x the first and second partial  to 9 exist, are continuous ff (x;9)du{x)  in 9 £ Q, and may be passed  exist three functions  Ego\Gi(Xij)\  2  < Ki < oo,l  each component (respectively  of fi(x;9)  through  with  respect  the integral  sign in  = l;  1  (2) there  derivatives  G\(x),  G^O^) and Gz(x) such that for all 9< ,---,9 , 1  = 1,2,3,,  i =  of ip(x) (respectively  G2(x)) uniformly  l,...,m,  ip( ))  «s bounded  x  in 9\ G 6.  and in some neighborhood in absolute  value  m  of 9\  by G\(x)  Further,  &logh{x;e,) d9i d9ik d9i ^ kl  ki,k,2,kz  = 1, ..,p, is bounded  (3) I(9\) is positive Then there exists weakly consistent  2  by Gs(x) uniformly  in 9\ G 0;  definite.  a sequence  of roots 9^^ of the weighted  likelihood  equation  that is  and  ^ ( f ' - ^ A i V ^ ^ ) ) Proof:  k  1. Existence  of consistent  roots.The  1  ) ) ,  as m->oo.  proof of existence of consistent roots  resembles the proof in Lehmann (1983, p 430-432). Let d be small enough so that  4.1. A S Y M P T O T I C RESULTS F O R T H E W L E S  a  62  : ||0! - 0°|| < a} C 6 and let  = {9  l  4(a) = {x : log WL(x; 0?) > log WL(x, 0?) for all boundary points 9\ of S } a  = {x : log WL(x; 0°) > sup log WL(x, 0*)}. e^eSa  The set I (a) is measurable since log WL(x.; n  9\)  is measurable and s u p  0  6  e5a  logWL(x,  9\)  is measurable by Proposition 4.1. We will show that Peo(X. for any given e, there exist  7 (a)) —>• 1 for all sufficiently small enough a. That is,  6  N  €  n  such that, for any n  > N, e  we have  Peo(X.  e / (a)) > n  1 — e. This implies that I (a) is not an empty set when n > N . To prove the claim, n  e  we expand the log weighted likelihood on the boundary of S about the true value 9\ a  and divide it by —log ni  to find WL(x; 9\) - —log rii  WL(x; 0?) = Si + S + S 2  3  where  Kl=l  2  P  P  2ni 1  j  " 6ni  Jfel=l*2=1  v  v  (EE  p E ( ^ i  ~  01*1) (0i*2  -  0?* )(0i*3 2  *1=1 *2 = 1 *3 = 1  i=i j=i and 4 W  = E E i=i j=i = 1  j  =  1  (^dlogfiixij-^i) A  39  i*i  10i=0?>  l 8 , = , t  '  - 0?*) 3  4.1. A S Y M P T O T I C RESULTS FOR T H E W L E  63  and |Cfcifc fc ( ij)l < 1 by assumption. a;  2  3  By the Weak Law of Large Numbers and (4.9) — 2, 1 ^  —  k=e?—^°  d logf (X ;9 )  p  2  l  — Z_^  lj  M  where Ikik {6i) is the 2  as ru-too,  l  e0  (fci,fc) element 2  0  —>"V^i)  k=«?  M  q  (4.10)  flsm->oo  (4.11)  of the information matrix 1(9®). By Lemma  4.2, we then have 1 v~>v>/,(n) —  2^ Z J  A  '  sdlogfi{Xij\6\). ~  >  p  l*i=«? ~ ^  7ti~^  Wi  ,  e0  0  a  S  n  ^  i  °°-  (  4  .  1  2  )  32;  - L D  1 - ^ ) - I ^ - - I M ^  a  m->«>•.  0  ( 4  1 3  )  i=i j=\ ^  m  rii  ^ iE =i (j=i ! A  n )  ni  - ')W^ w  i^°°'  0  (- )  flsn  4 14  To prove that the maximum difference -^log WL(x; 9\) — ^log WL(x; 9®) over all the boundary points 9\ of S is negative with P o probability tending to 1 for a  e  sufficiently small a, we will show that, with Pgo probability tending to 1, the maximum of S for all the boundary points 9\ of S is negative while 2  a  and  IS3I  are small  compared to | 5 | . 2  We begin with S\. Observe that —A {x) kl  = —  2^ j=i  M——1*1=*?  +  zr.l^Z^( i  ~ >—m  x  1  1=1  Wi  k=*?-  j=i  By (4.10) and (4.12), it then follows that, for any 9 , •••,9 , 2  - 4 ( X ) - ^ 0  as  Further, for any boundary point #J such that  00. —  iSii<^Ekwx)i. Bi  =1  m  = a, we have  (4.15)  4.1. A S Y M P T O T I C RESULTS FOR T H E W L E  64  For any given a, it follows from (4.15) that with Pgo probability tending to 1  - V  (4.16)  2  iti=i  1  \A (X)\<pa kl  and (4.16) then gives |Si|<pa  (4.17)  3  with P o probability tending to 1. e  Next consider p  2S  E E^  = -  2  p  k l  lfcl _  0° )I k (0i)(0ik2 kl  kl  - 0ik )  2  2  (4.18)  fc = l  1=  2  +E  -  +.W0?))(*i* a  fci=ifc =i  1  2  For the second term in the above expression, consider 1  ,  R  r  ( a  o,  1 ^df logfi(X ;9i) lj  i  (  =— > —™—™  —B +l [9i) klk2  ,  i  klk2  ni  ni ^ ^ E 2 >  +  i=i  e ey + Jfcifc l i) 1=  oVi c>V kl  0  2  y  1 lk2  l«irff-  -«0  j=i  By (4.11) and (4.13), it then follows that, for any ki and k , 2  —B  klk2  +I  M)  klk  ^  0  as m  oo.  (4.19)  Thus, for any boundary points such that \\9\ — 9°\\ = a, we have  I E X>*i "  -•*?*,) I < A -  (4.20)  2  A;i=lfc = l 2  By equation (4.19) and (4.20), it follows that, for given a v  IE  v  X .>(0i^, -— k ) )(( ^ d  • fe =ifc =i 1  2  B  k  ni  with Pgo probability tending to 1.  ^ + W * i ))(*!* - 0l )\ < p a 2  2  3  (4.21)  4.1. A S Y M P T O T I C RESULTS FOR T H E W L E  65  Let us examine the first term in (4.18). Since 1(8®) is a symmetric and positive definite matrix, there exits a matrix identity matrix and  B  t  -  b n  8° ,d  -  b  n  2  12  BB  = Diag(l,  l  where Si <  5 ,5 )B  —I(8\) = B Diag(5i,  Let £ = (6,6, ...,£„)* = B(8  such that  p x p  P  the  1 , 1 )  0,i = 1,2,  ...,m.  0° , ..,0> - 0°,)'. Then we have ||£|| = 2  2  p  ||0} - 0 ° | | = a . It follows that 2  2  V  V  rn  - E E o&i fc lfc2 1=  where S* — Combining  2  = l  max{5i,  (4.21)  w t f x c - *W = E<^ ^  and  (- )  V  4 22  (=1  < 0. We see that there exist  S ,S } 2  5  p  a  such that 5* +p ao 2  0  it follows that with P o probability tending to  (4.22),  e  1 there  exists c > 0 such that for a < a  0  S  Note that  m  m  1  (4.23)  Thus for 53 we only consider  lOtii^fca^)! < 1-  Tlx  < —ca . 2  2  i=i j = i -.rii  m  =- E  G  3 ( *  y  )  rii  + - E E( * A  (n)  "O^Oy).  i = l 3=1  J'=l  By the Weak Law of Large Numbers with probability tending to 1, m  <2(i + # )  1  (4.24)  3  Tii ^  J=l  where we use the inequality \EZ\ < E\Z\ < 1 + E\Z\ for any random variable Z. 2  By  (4.14),  it follows that with P o probability tending to 1 e  1  m  nt  (4.25) ' i J  . _ i  1  =i i = i  Hence by (4.24) and (4.25) , for any given a and'0f such that ||0j - 0?|| = \S \<^(1 3  +  K )a\ 3  < 0.  a,  (4.26)  4.1.  A S Y M P T O T I C RESULTS FOR T H E W L E  66  Finally combining (4.17), (4.23) and (4.26), we then have, with P o probability e  tending to 1, for a < ao, max(5 + S 1  e\€S  + (p + ~(1 +  + S ) < -ca  2  2  3  K ))a? z  2  a  which is less than zero if a < c/[p + ^(1 + K )). This completes the proof of our 3  claim that for any sufficiently small a the Pgo probability tends to 1 that max  log WL(x;  0f) < log WL{x, 0°).  e'leSa  For any such a and x G /71(a), it follows that there exists at least one point with ||0( ^ ni  —  9\\\ < a at  which WL(x;  9\)  9^  has a local maximum, -^-WL .(x; 0i)| _s( ; ni  n  0. Since Pgo (X G J„(a)) —> 1 as n —> 00 for all sufficiently small a, it then folx  lows that for such fixed a  >  0, there exists a sequence of roots  such that  9^{ \a) x  P o (ll^V) -0°|| < a) -» 1. e  It remains to show that we can determine such a sequence of roots, which does not depend on a. Let 0 ^ be the closest root to 9\ among all the roots to the likelihood equation for every fixed n\. This closest root always exists. The reason for this can be seen as follows. If there is afinitenumber of roots within the closed ball ||0i — Q±\\ < a, we can always find such a root. If there are infinitely many roots in that sphere which is compact, then there exists a convergent sequence of roots inside the sphere 0~f^ such that lim ||0^ — 0°|| = inf ||0i — 0i||, where V is the ni  set of all the roots to the likelihood equation. Then the closest root exists since the limit of this sequence of roots 0^ is again a root by the assumed continuity of the Thus 9  ^WL (9 ). ni  { lY  1  x  2. Asymptotic Normality.  does not depend on a and P o (||0i * - 0?|| < a) -> 1. ' ni)  e  Expand  »1  WL (x-9 ) ni  1  = WL (x;9 )+ 0  ni  1  WL(x; 0x)as  -^-log m  rii  / ^^A J o  l  { n  i=i j=i  V ( ^ ; 0 ? + i(0 -0?))a't(0 -0?), 1  1  4.1. A S Y M P T O T I C RESULTS FOR T H E W L E rii  m  , >  = EE  where WL (x;9\) ni  67  A^VO^?)-  i=ij=i Now let #i = #~[ \ where 9^ is any weakly consistent sequence of roots satisfying ni  WL (x; (j^) = 0, and divide by y/n[ to get: —WL {x-X) = B ^ T ( 0 S ni  l  ni  ni)  niV  - 0?),  (4.27)  where r  B  _.  1  1  m  „- / £ £ * ^ ; ?+*$ A  ni  n )  e  ni)  -  Note that m  WL (x;9°)  m  nt  E £ ^ ( ^ ) +££(\ w  =  ni  rii  i=l  X  j=l  i=l  j=l  ( n )  -«;0^«;O  _;'=1  j=l  i=l  By (4.27), it follows that *tl  ^  V  1  lit  3=1  V  1  i =  IL-i  i j=i  From the Central Limit Theorem, because E oip(Xij-,9°) = 0 and cov oil>(Xij]9l) e  1(9°),  e  =  we find that l  "*  V i n  If we show -±= £  i = 1  E(A- n)  w^X^  9°) %  0 and B  ni  ^  1(9°),  then by the  i=lj=l  multivariate version of Slutsky's theorem (see for example, Sen and Singer (1993), p. 130.) we have ±=(§™ - 91) = B~l^=WL  l  77.1  V* ! 1  ni  A  my^Z*  ~ JV(0,/(6I?)- ) 1  4.1.  A S Y M P T O T I C RESULTS F O R T H E W L E  68  Now we prove i=ij=i  - wOV'^ijI^i)-  Let Kx = E E( ! A  1=1 J'=l  We then have  <  (-1=1114,11 > e ) ^ *  '  g-EEEEi . " -«'.ii^ -'»-i A  (  >  )  i = l i' = l j = l j ' = l  < 0(  J —>• 0,  \™i/  —>• co,  as  by hypothesis (2) of this theorem. (ii)  B  ni  —  ^  —oo.  1(9®) as n\  Let B = < + B , where H  ni  i  < = --/  First, we prove B  7  S  p  = {9i  70  + t ^ - e ^ d t ,  0  i=l j=l  1(9®) as n\ —> oo.  !  Let  /^m^Oi  - j=i  1  711  /-l «i  ni  : ||0i -  9\\\ < p}.  Note that  is continuous in 0], by  E oip(Xij,0i) e  condition (1), so there is a p > 0 such that < P=> lEeo^Xij-A)  + T(9°)\<  e.  (4.28)  For any t € TZ such that 0 < t < 1, then l l ^ ^ - ^ I K P ^ I I ^  By equation  MI$  (4.28) B1)  and  " *ill  + ^i" -^) 0  0?||<P-  (4-29)  (4.29), we then have  < P) <  0? + ^ i "  0  - 0?)) + Z(0?)|  <-c).  4.1. A S Y M P T O T I C RESULTS FOR T H E W L E Note that P o ( | | ^  69  - 0j|| < p) —> 1. We then have  n i )  0  Peo(\Ego7p(X f,e  + t(9^ -9° ))  0  + I(e Y < e ) . — • ! as n 0  l)  1  l  x  o o .  (4.30)  This result implies that P o (9° + t ( 0 i  - 0°) eS )-^l  ni)  e  (4.31)  as m -»• oo.  p  From the Uniform Strong Law of Large Numbers, Theorem 16(a) in Ferguson (1991), with P o probability 1, there is an integer N[ such that e  Til  Hi  >N  sup\—f]TJ>{Xi ,e )-E oi>(Xu,6i)\<t-  =>  r  j  flies,  l  9  (4.32)  i  n  Then, assuming N is so large that rii > A / => ||0[ ^ — 0j|| < p, then ni  7  ' -W)\ < /  -^(x ,9 t(9(r -^))+m) ]  nx  j  =  °  /  Jo  n  1n  j=i  i  •  i eo  v  0? + i ( 0 i  lj  1  /  1  SUp  V  SiGS  p  n  ir  v  "1  ni)  - 0°)) + 7(0?)left  i  -^(X^ej-Ego^Xij-A)  ^7  0? + i(0~! - 0?)) + 7(0?) )dt  *W  + sup  7  = 1  r / JO  .  - 0?)) - Egoip (X , 9° + t(9^ - 9  - £ ^ fe: *? +  +E ^(X ; <  dt  1+  ir  m)  0<t<l  0 as ni —> oo by equation (4.30), (4.31) and (4.32). p Secondly, we prove B™ — -> Q as rii —> oo. By Lemma 4.2, every component of e  Bj/ goes to 0 in probability. Thus  \ "\ < B  ^ - E E h - t m x - 0 ? + ( 9 ^ - 0?))  This completes the proof, o  ]  ]  l 3  1  dt  0 as ni  oo.  4.1.  A S Y M P T O T I C RESULTS FOR T H E W L E  70  R E M A R K : If there is a unique root of the weighted likelihood equation for every n, as in many applications, this sequence of roots will be consistent and asymptotically normal. Small et. al. (2000) discuss the multiple root problems in estimation and propose various methods for selecting among the roots if the solution to the estimating equation is not unique.  4.1.3  Strong Consistency  We prove strong consistency of the WLE in this subsection. Recall that (X, 8 ) = 1 £ f > -  ±S  X  ni  tWj^^y  To prove the strong consistency, we prove the following lemma: Lemma 4.3  Under Assumptions 4-1- 4-5  sup —S (x,0i)  (4.33)  ni  for any d , 0 , ...,8m, 8 G 6, i = 2, 3 , m . 2  Proof:  3  {  By Lemma 4.2, we then have  /i(*«j,g?) /i(*y,«i)  '  It follows by the Borel-Cantelli Lemma that, m rii 1  1=1  7=1  By the definition of Aij, it follows that  0, a.s. [Poo].  4.1.  A S Y M P T O T I C RESULTS FORT H E W L E  71  Therefore, we have sup —S (x;0i) —)• 0, a.s. [P o] ni  e  for any 0j ^ 0°, 0 , 0 , 0 , 0, G 6, i = 1, 2 , m . 2  T h e o r e m 4.5 (1) Q is  3  o  m  Suppose:  compact;  (2) logfi(x;  0) is upper semi-continuous  (3) there exists a function for all x and  K(x)  in 0 for all x;  such that E o\K(X\)\  <  g  oo and / o g ^ ^ , ' ^ < 1  K{x),  0 € O;  Then for any sequence  of maximum  weighted  likelihood  estimates  9^  of 9\,  0™ —• 0? a.s. [P, ] ?  / o r any 0 , 0 , 9 , 0, G O, i = 2, 3 , r a . 2  3  m  Proof: Let 9\ be the parameter of interest and let 1. , W  =  ni  1  —logWL(9 ) 1  = - E E -( ..  ni  l  m  A  U l  -—logWL(9 i)  W/I(A^-;  00 - '°s/i(*y; 0?))  j=l i=l  1  = -E (^'^) + - " ( '^) 1  n l  c/  5  3=1  ^hereU(X ,9 ) lj  =  1  l o g ^ ^  )  1  x  .  Let p > 0 and I> = (0i G 0 : ||0i - 0°|| > p). Then D is compact by condition (1). We then have, (c.f. Ferguson 1996, p. 109) P o (limsup sup — V r / p f y ^ ) < sup p(0 ) 1=1, e  :  y ni->oo 8 eD ni ~(  dieD  x  where p(0i) = /  log^^h{x;9\)du{x)  <  J  0 for 0i G D by Lemma 4.1.  (4.34)  4.1.  ASYMPTOTIC RESULTS FOR T H E W L E  72  Furthermore, p,(9i) is upper semi-continuous since Hid,) >  f \imsuplog )^ f  f^e^duix)  ei  > limsup  ]  /l g -^p^Ux-9®)du{x). f  0  Hence p(9i) achieves its maximum value on D. Let 5 = sup p(9i); then 5 < 0 by Lemma 4.1 and Assumption 4.3. Then by Lemma 4.3, with P o measure 1, there d  exists an Ni such that, for all n > Ni, x  1  sup  <  S (X.,9i) ni  ni  0ieD  -5/2.  Observe that  (  1  1  711  \  1  - E ^ i ^ i ) + -5 (X e ) n  l ~ [  n  l  )  ril  n i  Sni (X,  < sup -  1  J  B x t D U x ^  J2u(X ;e )+su lj  l  eD  #1)  ni V  It follows that, with P o measure 1, there exists an ./V such that for all ni > N, dl  e  sup sup ( — jhu{X ;9 ) lj  6i&D h&D  \  J  + —S (XA))  1  ni  l ^  n  "1  <5/2<Q.  But, for all ni > AT,  1  ni *i  1  m  U(Xtf 0? ) + S (X; }  ni  p7  8~t ) = sup ]  i  / 1 -  0iee \n\  n  1  711  V U(X  lj;  6,) + - 5  ^  \  B 1  ( X , 9{)  > 0.  J  " i  since 1 ni n i J2U(X ;9° ) j=i l]  + -S (X;9 ) 0  1  ni  = 0  U l  This implies that the W L E , 0~j e D for m > N; that is | | ^ ni)  c  n i )  - 0 \\ < p. Since p X  is arbitrary, the theorem follows, o . The proof of the above theorem resembles the famous proof given by Wald (1949) which established the strong consistency of M L E except that we have to deal with the extra term,  ^S (X.,8i). ni  Again, a slightly different condition is required if 6 is not compact.  4.1.  73  A S Y M P T O T I C RESULTS FOR T H E W L E  Theorem 4.6 Suppose: (1) there is a compact subset C of 0 such that 9\ e C and Ego  sup log < 0; 0 0],ec ne Jiv ij; iJ J  c  f,2) i/iere exzsi  A  y  a function K(x) such that E o\K(X)\ < oo e  and loff^jgoj <  K(x), for  all x and 9 € C; (3) logfi(x; 9) is upper semi-continuous in 9 for all x; Then for any sequence of maximum weighted likelihood estimates 9^^ of 9\, a.s. [Pgo]  fifO _ > for any 9 ,9 , ...,0 ,9i € Q,i = 2,3, ...,m. 2  3  m  Proof: Let D = (9i : \\9i - 9°\\ > p) as in the proof of Theorem 4.5 such that C n D ± <f>. It follows that C fl D is also compact. It follows from the proof of Theorem 4.5 that, with PQO measure 1, there exits an Ni such that, for all n  sup (-JTu(Xi ,9i) r  where U(X^  +  x  > Ni,  -S (X,9i))<5/2<0, ni  0) = X  Also, with P o measure 1, there exits an N e  2  1 .  W l  such that, for all n i >  N,  .  —V  sup t/(Xi,-;0i) < 5  (4.35)  by the Strong Law of Large Numbers and the fact that Ego c U(Xij; eC  nQ  A s in the proof of L e m m a 4.3, it can be shown that  —  sup S (X;0i) —•(), a.s. [Pgo].  ni 0iec ne c  2  ni  9{) < 0.  4.2.  A S Y M P T O T I C PROPERTIES OF A D A P T I V E W E I G H T S  74  It follows that, with Pgo measure 1, there exist an A^, such that, for all n\ > N , 3  1 —V  1  n i  sup  i~~^eiec ne  n  c  U(X ;9 ) lj  + —  1  sup 5 (X;0 ) < ni  0iec ne c  x  5/2 <  0.  It implies that, with Pgo measure 1, for all rii > N , 3  1  n i  sup —J"U(X ;9 )+  flieone  lj  ~ ^  1  sup  1 —  e i € C n e ™i  S (X;0i) < 0.  (4.36)  ni  c  Therefore, it follows that, with Pgo measure 1, there exist an N* = max(N2, N ), 3  such that for all n > N*, sup \—f]u(x ;e ) lj  1  + —s (x-,e )) ni  1  <s<o.  0i es  But 1  n  1  i  - V  m j^t  [/(^s^V-^n^X;^ ) 0  i  n  = sup  / 1  711  1  - V ^ l , ^ ! ) + -5 (X;0 )  0iee \ni  n  n i  x  1  \  J  > 0.  since the sum is equal to 0 for 9\ = 9\. This implies that the WLE, 0\ G D for c  rii > N*; that is, H ^ " ^ ~ 9 \\ < p. Since p is arbitrary, the theorem follows, o . 1  X  4.2  Asymptotic Properties of Adaptive Weights  At the practical level, we might want to let the relevant weight vector to be a function of the data. This section is concerned with the asymptotic properties of the WLE using adaptive weights. Assumption 4.1-4.4 are assumed to hold in this section.  4.2.1  Weak Consistency and Asymptotic Normality  In this subsection, we adopt the following additional condition: Assumption 4.6 (Weak Convergence Condition). Assume:  (i) lim ^- < oo, for i = 1,2, ...,m;  4.2.  A S Y M P T O T I C PROPERTIES OF A D A P T I V E W E I G H T S  (ii) the adaptive  relevant  weight vector  A^ ^(X) = (A[ ^(X), A " ^ ( X ) , A m ^ ( X ) ) * n  n  2  75 sat-  isfies, for any e > 0,  A^ (X) where (wi,w ,  —> oo,  as n\  Wi,  = (1,0,0)*.  wy  2  —>  m  Let ^ ( X ,  = J gg(A}" (X)-«,)^^gg. ,  W  r  We then have the following lemma. Lemma 4.4  If the adaptive  relevance  weight  (X, 0 ) X  for any 9 ,9 , 2  3  %  vector  0,  as  satisfies  Assumption  nx -»• oo  ...,0 ,0 G 0,z = 2,3, ...,m. m  Proof: Let Tj =  fl  i  for i = 1, 2 , m . Then  9 \\f-fe%  lo  f  f  i  1  ^  m  = ^(nx)-,)r, m  1  By the weak law of large numbers, for any i = 1, 2 , m ,  ru  nt ^  fi [Xij;  0!)  / i (A%-; 0 ) X  It then follows that, for any i = 1, 2 , m ni V for any # ,#3, 2  9m,^  / n,  :  6 0,J = 2 , 3 , m , by Assumption 4.6. o  We then have the following theorems:  4-6, then  4.2.  A S Y M P T O T I C PROPERTIES OF A D A P T I V E W E I G H T S  Theorem 4.7  For each 9\, the true value of 6i, and each 9i ^ 9® i> m rii m ni  (  nn/i(^^?) " A,(  for any 9 , 9 , 9 2  3  Theorem 4.8 any sequence adaptive  , 9 € 6, i = 2, 3 , m .  > nnA(^i«i)  t=lj=l  that the conditions  of maximum  weighted  = i. J  of Theorem  likelihood  \  t ) ( x )  i=lj=l  {  Suppose  estimates  4-2 are satisfied. 9^  Then for  of 9\ constructed  with  weights X\ (X), and for all e > 0,  for any 9 , 9 , 9 2  3  Theorem 4.9 any sequence adaptive  m  )(x)  76  m  , Qi e O, % = 2, 3 , m .  Suppose  that the conditions  of maximum  weights  Aj(X),  weighted  and for all  likelihood  of Theorem estimates  4-3 are satisfied.  Then for  9^^ of 9\ constructed  with  e>0  lim P o ( | | ^ - ^ | | > ) = 0 , l )  e  for any 9 , 9 ,9 , 2  3  m  9iG&,i  e  = 2, 3 , m .  We remark that the proofs of Theorem 4.7 - 4.9 are identical to the proofs of Theorem 4.1 - 4.3 except that the fixed weights are replaced by adaptive weights and the utilization of Lemma 4.2 is replaced everywhere by Lemma 4.4. We are now in a position to establish the asymptotic normality for the WLE constructed by adaptive weights. We assume that the parameter space is an open subset of RP. Theorem 4.10  (Multi-dimensional)  satisfied.  there  Then  based on adaptive  exists  weights 9^  Suppose  a sequence  that the conditions  of roots of the weighted  that is weakly consistent  of Theorem  4-4  likelihood  function  and  as ni —>  oo.  a  r  e  4.2.  ASYMPTOTIC PROPERTIES OFADAPTIVE  4.2.2  WEIGHTS  77  Strong Consistency by Using Adaptive Weights  To establish the strong consistency of the WLE constructed by the adaptive weights, we need a condition that is stronger than Assumption 4.6. We hence assume the following condition: A s s u m p t i o n 4.7 (Strong  Convergence  Condition)  Assume  that:  (i) lim ?i < oo, for i = 1,2, ...,m; ni->-oo  U  1  (ii) the adaptive  relevant  weight vector \W(X) = ( A  (  N ) 1  (X), A  N ) 2  ( X ) , A £ ( X ) ) * sat}  isfies  AJ (X) B)  where ( w , w , w 1  2  m  )'  wt, a.s. [P ], 0O  = (1, 0 , 0 ) * .  L e m m a 4.5 If the adaptive  relevance  weight  vector satisfies  0, a.s.  for any 9 ,8 , ...,9 ,.0i eO,i 2  3  m  =  Assumption  4-7, then  [Pgo],  2,3,..., m.  By the Strong Law of Large Numbers, for any i = 1, 2 , m ,  where Eg An = Eg sup log /i(*u;0i) < oo. This implies that, for any i = 1, 2 , m , 0i ee 0  0  0, a.s. by Assumption 4.7. Since  [P o] e  4.3.  EXAMPLES.  78  it then follows that 0, a.s. [P o].  sup  e  0iG0  This completes the proof.o We then have the following theorems: Theorem 4.11 sequence weights  Suppose  of maximum  the conditions  of Theorem  weighted likelihood  estimates  4-5 are satisfied.  Then for any  9^^ of 6\ constructed  by adaptive  A^(X),  for any 9 ,9 , ...,9 ,9i G Q,i = 2,3, ...,m. 2  3  Theorem 4.12 sequence weights  m  Suppose  of maximum  the conditions  weighted  likelihood  estimates  4-6 are satisfied.  Then for any  9^^ of 9\ constructed  by adaptive  Aj(X), •  for any 9 ,9 ,...,9 ,9i 2  4.3  of Theorem  3  m  el a.s. [Pgo],  §M  e Q,i =  2,3,...,m.  Examples.  In this section we demonstrate the use of our theory in some examples.  4.3.1  Estimating a Univariate normal M e a n .  Suppose Xij are independent random variables that follow a normal distribution with mean 9i and variance 1. Assume 0 = (—00,00) and C = [—M,M]. We need to verify the condition that, for 9\ G C, 0 < Ego ^inf ' c ^^/iffi'a' ) ) — 1  fl i6C  K < 00 for some constants M and K . c  c  n0  4.3.  EXAMPLES.  79  We then have -\{x-e\f  if  -\{x  - 9lf + \{x - M )  2  if  -\{x  - 9\) + \{x + M )  2  if  2  \ X \ > M ,  0<x<M, -M<x<0.  It then follows that Eo  inf  e  log  f (X-- 9°) ' = hi + I 1  tJ  +I ,  1  i2  i=  i3  l,2,...,m,  where = -  hi  f  \{x-9«Y~exp-^ l dx, 2 2  |x|>M M  1  2  1  =  / 0  "  -  +  - ^  )  M)2  e  X  P  "  {  X  ~  9  i  )  2  /  2  d  X  '  0  7i = /  [~(x - 0?) + \{x + M ) ) - i ^ c x p - ^ ) ^ . 2  3  2  8  —M  The first term ^ goes to zero as M goes to infinity. It can be verified that hi + hi = M  + o(M ).  It then follows that there exist  for  M  If we choose  2  1, 2,  2  >  MQ.  K  =  C  M  0  >  0 such that  I  u  + hi  + hi  > 0  2M , it then follows that, for i = 1,2, ...,m,j 2  0  =  ...,Tli,  /i(-Xij-;0?) \. 0< £o 0  4.3.2  inf  log ;\' \)[ J  J  2  ) <2M  Z  < oo, /or a« M > M . 0  Restricted N o r m a l Means.  A simple but important example is presented in this subsection. That problem is treated by van Eeden and Zidek (2001). Let Xn,  Xi  ni  be i.i.d. normal random  variables each with mean 9\ and variance a . We now introduce a second random 2  sample drawn independently of the first one from a second population:  X \, 2  ...,X , 2n2  4.3.  EXAMPLES.  i.i.d.  normal random variables each with mean  80 and variance  6  2  a. 2  Population 1  is of inferential interest while Population 2 is the relevant population. However, 102  —  I < C for a known constant C > 0. Assumptions 4.2 and 4.3 are obviously  satisfied for this example. The condition (4.6) in Theorem 4.3 is satisfied as shown in the previous example. If we show that Assumption 4.5 is also satisfied, then all the conditions assumed will be satisfied for this example. To verify the final assumption, an explicit expression for the weight vector is needed. Let  mXi. =  ij,  x  i = l,...,m,V  and  = Cov{{X .,X ) ) t  l  2  B  = (0,  C) . f  It follows  that  It can be shown that the "optimum" WLE in this case, the one that minimizes the maximum MSE over the restricted parameter space, takes the following form — ^ I ^ - I . ~r /\ ^V2.)  "l  2  where (V +  (A;,A;)*  BB )- ! 1  1  l^V + BB^-n'  We find that  It follows that 1\V  +  BB )- l l  1  l  u /ni 2  Thus, we have a /n +C 2  A; =  2  1  + a /n 2  2  + C  4.3.  EXAMPLES.  81  Finally K  =  1— ' \a*  n  ni/  2  Estimators of this type are considered by van Eeden and Zidek (2000). It follows that |A| ^ — ni  Wi\ = O(^),  i —  1, 2. If we have n = 2  0(n ~ ), 2  s  then Assump-  tion 4.5 will b e satisfied. Therefore, we do not require that the two sample sizes approach to infinity at the same rate for this example in order to obtain consistency and asymptotic normality. The sample size of the relevant sample might go to infinity at a much higher rate. Under the assumptions made in the subsection, it can be shown that the conditions of Theorem 4.4 are satisfied. The maximum of the likelihood estimator in this example is unique for any fixed sample size. Therefore, we have  4.3.3  Multivariate Normal  Means.  Let X = ( A i , X ) , where for i = 1,..., m, m  x  i  =  Y^Xij/m *~ N(6i, l/m). i=i  Assume that the Oi are "close" to each other. The objective is to obtain a reasonable good estimate of 6\. If the sample size from the first population is relatively small, we choose WLE as the estimator. In the normal case, the WLE, 9i, takes the following form: m  9\  =^  \Xi.  4.3.  EXAMPLES.  82  Note that the James-Stein estimator of the parameter 9 = (9i,...,0 ) is given by m  C ( X ) = ( C i ( X ) , . . , U X ) ) , where  /  \  m-2  1  C*(X) =  Xi.  The quantity,  P-2  1 -  m  _  '  i=i  can be viewed as a weight function derived from the weight in the James-Stein estimator. Consider the following choice of weights of James-Stein type : Ai(X)  1-  =  1  m - 2  1+6  rn  _  )  i=l  1 m  -  , 1 I  m - 2 - „  m  )'  z  _  2  '  3  ' ' • • '  m  '  t=i  for some 5 > 0 and c > 0. It can be verified that E A* = 1 and A; > 0, i =  2,3,m.  i=i  It follows that Peo(  1 "i-  m-2  m-2  + c  Y.x? 2=1  i=i  <  m-2„ 1 , -geonl+ e c 1 4  = O  5  n  (since Xf > 0)  i+5 / •  We then have 1 n  i  m-2 E A ? i=l-  > e) = O  +C  n  1+6  P ( A i < 0) = O e 0  n  1+6  (4.37)  (4.38)  4.4.  CONCLUDING  REMARKS  83  .We consider the following two scenarios (i) If we set 5 = 0, it follows that Aj(X) - u>i for  i  0,  = 1,2, ...,m. Assumption 4.6 is then satisfied. Therefore, the weak consistency  and asymptotic normality of the WLE constructed by this set of weights will follow. (ii) If 5 > 0, Ai(X) - Wi for  i  0, a.s. [P o], e  = 1,2, ...,m, then this leads to strong consistency. Since strong consistency  implies weak consistency, asymptotic normality of the WLE using adaptive weights will follow in this case.  4.4  Concluding Remarks  In this chapter we have shown how classical large sample theory for the maximum likelihood estimator can be extended to the adaptively weighted likelihood estimator. In particular, we have proved the weak consistency of the latter and of the roots of the likelihood equation under more restrictive conditions. The asymptotic normality of the WLE is also proved. Observations from the same population are assumed to be independent although the observations from different populations obtained at the same time can be dependent. In practice weights will sometimes need to be estimated. Assumption 4.6 states conditions that insure the large sample results obtain. In particular, they obtain as long as the samples drawn from populations different from that of inferential interest are of the same order as that of the drawn from the latter.  4.4. C O N C L U D I N G  REMARKS  84  This finding could have useful practical implications since often there will be a differential cost of drawing samples from the various populations.  The overall cost  of sampling may be reduced by judiciously collecting a relatively larger amount of inexpensive data, that although biased, nevertheless increases the accuracy of the estimator.  Our theory suggests that as long as the amount of that other data is  about the same as obtained from the population of direct interest (and the weights are chosen appropriately), the asymptotic theory will hold.  Chapter 5 Choosing Weights by Cross-Validation 5.1  Introduction  This chapter is concerned with the application of the  cross-validation criterion  to  the choice of optimum weights. This concept is an old one. In its most primitive but nevertheless useful form, it consists of controlled and uncontrolled division of the data sample into two subsamples. For example, the subsample can be selected by deleting one or a few observations or it can be a random sample from the dataset. Stone (1974) conducts a complete study of the cross-validatory choice and assessment of statistical predictions. Stone (1974) and Geisser (1975) discuss the application of cross-validation to the so-called  K-group  problem which uses a linear combinations  of the sample means from different groups to estimate a common mean. Breiman and Friedman (1997) also demonstrate the benefit of using cross-validation to obtain the linear combination of predictions to achieve better estimation in the context of multivariate regression. Although there are many ways of dividing the entire sample into subsamples such 85  /  5.1.  INTRODUCTION  .  86  as a random selection technique, we use the simplest  leave-one-out  approach in this  chapter since the analytic forms of the optimum weights are then completely tractable for the linear WLE. We will denote the vector of parameters and the weight vector by  0  =  for  i  = 1,2,  (di, 9 , 9 ) 2  m  ...,m.  Let  and  A  X°  and  pt e  —  (Ai, A , A ) respectively. Assume that ||0||.< oo 2  A°  pt  m  be the optimum weight vector for samples with m  equal and unequal sizes. We require that E \ = 1 in this chapter. i=i Suppose that we have m populations which might be related to each other. The probability density functions or probability mass functions are of the form  fi(x;9i)  with 6i as the parameter for population i. Assume that -Xii)  X12,  X13,  Xi  X21, X22, X23,  Xml,  X 2, m  ~  ni  X2  n2  X ,  X  m3  fi(x;9i)  ~ f2(x]0 ) 2  ~ fm( ',9m) x  mrim  where, for fixed i, the {Xij} are observations obtained from population i and so on. Assume that observations obtained from each population are independent of those from other populations and E(Xij)  = 4>(9i),j  = 1, 2 , r i j . The population parameter  of the first population, 9 , is of inferential interest. Taking the usual path, we predict X  Xij  by <f>(9[~ ), the WLE of its mean without using the j)  X\j.  Note that  <j>(9~[  function of the weight vector A by the construction of the WLE. A natural measure for the discrepancy of the WLE is the following: D(X) = f2(x -d>(9[^)) . 2  l3  (5.1)  The optimum weights are derived such that the minimum of D(\) is achieved for m  fixed sample sizes n\, n , n  and E A* = 1. j=i We will study the linear WLE by using cross-validation when E(Xij) = 9i,j = 1.2, ...nj for any fixed i. The asymptotic properties of the WLE are established in 2  m  is a  5.2. L I N E A R W L E F O R E Q U A L S A M P L E SIZES  87  this chapter. The results of simulation studies are shown later in this chapter.  5.2  L i n e a r W L E f o r E q u a l S a m p l e Sizes  Stone (1974) and Geisser (1975) discuss the application of the the cross-validation approach to the so-called K-group problem. Suppose that the data set  S  consists of  n observations in each of K groups. The prediction of the mean for the ith group is constructed as: (ii = aXi. m  where Y.. =  n  + (1 — a)X_. n  E ^ij d X^ = ^ ]T) -^0'- ^ a n  i=lj=l  1  e a r e  interested in group 1, then  j=l  the prediction for group 1 becomes A  w  = ( i - ^  Q  ) x  1  .  +  g  ^  „  .  We remark that the above formula is a special form of linear combination of the sample means. The cross-validation procedure is used by Stone (1974) to derive the value of ct. We consider general linear combinations. Let 0^ denote the WLE by using the cross-validation rule when the sample sizes are equal. If <fi(0)  = 0,  the linear WLE  for Bi is then defined as m  where  m  i=i  ^ = 1. We assume ni = n = ... = n = n in this section. 2  m  In this section, we will use cross-validation by simultaneously deleting Xij,  X j, 2  ...,X,  for each fixed j . That is, we delete one data point from each sample at each step. This might be appropriate if these data points are obtained at the same time point and strong associations exist among these observations. By simultaneously deleting  5.2. L I N E A R W L E F O R E Q U A L S A M P L E SIZES Xij,X j,  for each fixed j , we might achieve numerical stability of the cross-  ...,X j  2  88  m  validation procedure. An alternative approach is to delete a data point from only the first sample at each step. It will be studied in the next section. Let X  L  be the sample mean of the zth sample with jth element in that sample  excluded. A natural measure for the discrepancy of 9\ might be: n  /  m  3=1 m  \  i=l  =E  x  -  2  i  j=l n  n  m  j=l m  i=l n  n  m  m  E ^ E vt" + E E E j=l i=l k=l m m  n  = E l - E ^ E ^ + E E * * E ^."^ x  2  j=l  =  x  i=l  3)  A  j=l  c{X) - 2\%(X)  +  i=l  A  fc=l  j=l  A*A (X)X e  (6.(20), = E ^y4 , and (4e(20) = £ _i)  where c(X) = £ Xj,,  tt  j=l  j=l  X ^ X ^ ,  j=l  % — 1, 2 , n , k = 1, 2 , m . An optimum weight vector by using the cross-validation rule is defined to be a (m)  weight vector which minimizes the objective function, D  m  and satisfies E i x  e  For expository simplicity, let b = b (X) and A = A (2Q in this chapter. e  5.2.1  e  e  =  1-  i=l  e  Two Population Case  For simplicity, first consider a simple case of two populations, i.e. Xu,  X12,  X13,  Xi  X i,  X22,  X23,  X2n  2  with Xj 2  E(Xij)  = 9i  and  respectively. Let  E(X j) 2  =  9. 2  p — cor(Xi ,X ). 3  2j  true values for 9i and 02 respectively.  n  l  ~  fl(x',9l)  ~  f2(x;9 ) 2  Let a\ and cr| denote the variances of X\  3  Let  9°  = (fl?,^) where 0j and 0°  a r e  and the  5.2.  89  L I N E A R W L E F O RE Q U A L S A M P L E SIZES  We seek the optimum weights such. that Ai + A = 1 and they minimize the 2  objective function which is defined as follows: e  D  = £  ]  Differentiating  (*y - ^[: X  - A a ^ )  J)  2  - 7(Ai + A - 1). 2  with respect to Ai and A , we have 2  d  x  dD,  \- = - £ * i : (2)  <9A  J  - Ai^:  )  - E ^  (X - A  - ^x[: )  J )  - 7 = 0,  ])  ^ - XiXt")  y  -7  2  It follows that E  -  - AxAt* - A ^ )  = o.  Note that Ai + A = 1. We then have 2  Af  (X) = 1  £(x[: -x : )i j)  j)  2  3=1  Af  (X)  (5.2)  E  L e m m a 5.1 The following identity holds:  Af = 1 - A f  and  \°  pt 2  =  S /S , e  e  2  x  where  - Xf 2  n  S  e 2  where a\ = £ E 3=1 2  (n-iy  + ^  \ i)2  E (*y - ^ )  2  ,  ( c r — ccw )  - *i.)  2  2  and cot) = £ E 3=1  - *i.)  (X  2j  -  X ). 2  5.2. L I N E A R W L E FOR E Q U A L S A M P L E SIZES  90  Observe that  Proof:  X\ ^ —  7 {.Xii + X + ...Xij_i + Xij i n —1 . i2  _±_  x  n-1  l j  3=1  ~Xi  + ... + Xi )  +  n  13  Xi.  1  n —  where e = - . JL  n  T  n—1  n  Let St = ±it ( { x  n  5  i  e  j=i  v  =  '  ~ 2 Yx  J)  l t t h e n  y  follows that  - E f ( e n ^ i . - ^ T ^ ) - ( e n ^ 2 . - - ^ - X  n V j=i  =  3)  n-1  v  n-1  2  7  , ) )  /  ^( ^pC .-X f-2^pC -X .)f^(X -Xy) n—1 *—' n  n  1  2  l  2  lj  3=1  +(^i) E(^-^) ) 2  =  el (Xi. - X f 2  2  - 2e ^ (Xi. n  - Xf  1  2  +—  ^  ^  = ^ (e n(n  —  j=l  (Xi. - X f + -^-L— J2  n  2  2)  1  7 ^ T ) 2 c * i . - *y+ _i x  n(n  )2  {Xlj  3-  " -  x  ^  f^X^  2  - Xf 2]  -  Xf v  5.2.  L I N E A R W L E F O R E Q U A L S A M P L E SIZES  Let Sf = i f2(x[  j )  - X ~ ){X[ [  3)  91  - Xij). It follows that  3 )  S2 =  ^E  ^ e ( X - X .) - ^ - [ ( ^ l j - ^2j)^ ^(en-X"i. - ~zr\ 3)  =  - E  ( "(^i-  =  - ^ T T )  Xl  n  L  e  2  _  ^2.) - ~~[( ij x  ~ 2j) \ {e Xi. - e Xij)  X i . E ( i i - 2j) - E  n n  7^2-  x  n  E ( * « - * y ) ( * i . - Xy)  x  x  V ( *?  j=i  ~  x  n  (since E ( * i - " y ) = °) X  i  E - ^ ' ^ '  +  3=1  /  - cov)  This completes the proof, o The value of the \ °  p t  can be seen as some sort of measure of relevance between  the two samples. If that measure is almost zero, the formula for A by assigning a very small value to A  p t 2  p t 2  will reflect that  . This implies that there is no need to combine  the two populations if the difference between the two sample means is relatively large or the second sample has little information of relevance to the first. The weights chosen by the cross-validation rule will then guard against the undesirable scenario in which too much bias might be introduced into the estimation procedure. On the other hand, if the second sample does contain valuable information about the parameter of interest, the cross-validation procedure will recognize that by assigning a non-zero value to A *. P  2  5.2. L I N E A R W L E FO R E Q U A L S A M P L E SIZES Proposition 5.1  Proof:  If p <  92  then  By the Weak Law of Large Numbers, it follows that ~v  P < )  AI.  \ aO. > tf , 1  17 A  no  e\  p  2. — t / , 2  j=i  Therefore, b\ — cov —• o\ — p0~\O . 2  Thus condition  implies that of > cov for sufficiently large n. Thus, A^*  p < a\ja  2  eventually will be positive, o We remark that the condition p < a\/a is satisfied if a < o\ or p < 0. If 2  the condition  p < o~i/o~  2  2  is not satisfied, then A ^ will have negative sign for suffi-  ciently large n. However, the value of A^' will converge to zero as shown in the next Proposition. Proposition 5.2 If 6\  ^ Q\, then, for any given  e > 0,  P o{\\T - 1| < e) —• 1 and P o{\\° \ pt  e  Proof:  e  2  <  e).—>• 1.  From Lemma 5.1, it follows that the second term of S\ goes to zero in prob-  ability as n goes to infinity while the first term converges to (9\ — 0°) in probability. 2  5.2. L I N E A R W L E F O R E Q U A L S A M P L E SIZES  93  Therefore we have St ^  (9° - 8° )  as n  2  2  oo,  where (6\ — 9 ,) / 0 by assumption. 0  2  Moreover, we see that Sf = 0 ( ) . By definition of A f , it follows that i  P  , „  , 5 f . <> P  |A 1 = | —  |  2  0  —>  0  as n  —>• oo.  This completes the proof, o The asymptotic limit of the weights will not exist if 9\ equals 99,. This is because the cross-validation procedure will not be able to detect the difference of the two populations involved since there is none. This can be rectified by defining A f = where c > 0. We remark that the knowledge of the variances and covariances is not assumed.  5.2.2  Alternative M a t r i x Representation of A and b e  e  To study the case of more than two populations, it is necessary to derive an alternative matrix representation of X  It can be verified that  .  opt  -(-j)-(-j)  _ . ,  _  _  1  w  _  _  n—1—  where e„ — "  JL  71—1  T  .  ^  n  2— —  6 Xj.X/j_ n  77  1  X  n —1 —  ~7%ij%k. -L  ^  n  —  T kj i. X  J.  x  i f  ~r  \  ^  n  ~\  7)  X  2  %ij kj x  5.2. L I N E A R W L E F O R E Q U A L S A M P L E SIZES  94  Thus, we have n  n (—j) •^i.  ,  /  %k.  j=l  j  j=l  '  1  v > /  {—j)  2  /  ^7i  I &nXi.%k.  ~  ~^%ij-Ek.  ~  I  "F ( —  ^•^fcj-^'i.  v  \2  ~j  \ XijX j k  J  \  /  _ l Xk  n  1  E ~ _\ - E ^ ( _ ]_) E n  _  71  n  Xi  n  j=l  +  j=l  n  J=l  X  *J' '*J  71  2  2  2  1^ >  77.  [n — l) n z  n — 177 e (n n  -  2)a;i  x. +  - - Y](xij  fc  77 —  (e*(n  -  2) +  -  Xi)(x  kj  -  x)  1 77 ^ — ' 070,  k  +  -XiX . k  77—1  +  covl,  ^  where  ^  l  C07J  i f c  "  -^2(  =  Xij  -Xi.)(x  -X .).  kj  k  Recall that, for 1 < i < m and 1 < k < m, A (ik) = E^  (-J)^-J)  e  .7=1  It follows that A  where E = ijt  cov  ik  ^  ^  and 0 =  E  +  (el(n-2)  (x ,x .). h  m  +  - ^ J  00*  (5.3)  a  5.2.  L I N E A R W L E F O R E Q U A L S A M P L E SIZES  95  We also have X ^  3=1  \S  - \(  An + / (Cn-^lj j=l J  Cn-^lJ I Cn^i. ^  __L_ \ -^^ij j '  e  " n - 1 ^—' It then follows that b (x) = A e  where ^4i is the first column of A  E  1  - e % .  (5.4)  and E i is the first column of the sample covariance  matrix E .  5.2.3  O p t i m u m Weights Af* B y Cross-validation  We are now in a position to derive the optimum weights when sample sizes are equal. P r o p o s i t i o n 5.3 The optimum weight vector which minimizes D ^ takes the folm  e  lowing form  XT = ( 1 , 0 , 0 , 0 ) * - el  ( A ; %  -  ^ ^ A -  l e  i  \.  Proof:  By differentiating D ^ — u ( l ' A — 1) and setting the result to zero, it follows that m  e  5.3. L I N E A R W L E FOR U N E Q U A L S A M P L E SIZES  96  It then follows that A  A:  =  f  (  1  We then have 1 =  i * A f  =  Thus, 2  (1 - l'A^&e).  Therefore, Af* = A-  b  l e  Since  e  +  is a quadratic function of A and A > 0, the minimum is achieved at the  point A f . Furthermore, by equation (5.3) and (5.4), we have A - % = A ; (A - e Ex) = (1, 0, 0 , 0 ) * - e ^ " ^ . 1  2  1  X  Denote the optimum weight vector by \ ° . It follows that pt  This completes the proof, o We remark that A is invertible since E is invertible. We remark that the exe  pression of the weight vector in the two population case can also be derived by using the matrix representation given as above. The detailed calculation is quite similar to that given in the previous subsection.  5.3  L i n e a r W L E f o r U n e q u a l S a m p l e Sizes  In the previous section, we discussed choosing the optimum weights when the samples sizes are equal. In this section, we propose cross-validation methods for choosing  5.3.  L I N E A R W L E F O R U N E Q U A L S A M P L E SIZES  97  adaptive weights for unequal sample sizes. If the sample sizes are not equal, it is not clear whether the delete-one-column  approach is a reasonable one. For example,  suppose that there are 10 observations in first sample and there are 5 observations in the second. Then there is no observation to delete for the second sample for half of the cross-validation steps. Furthermore, we might lose accuracy in prediction deleting one column for small sample sizes. Therefore we propose alternative method which delete only one data point from the first sample and keep all the data points from the rest of samples if the sample sizes are not equal.  5.3.1  T w o Population Case  Let us again consider the two population case in which only two populations are considered.  The optimum weights X  are obtained by minimizing the following  opt  objective function: m D^\X)  = Y {Xi -^x[-  -\ X )  J)  J  ]  2  ,  2  3=1  where YJ Aj = 1 and X  lm  =  J2 ik- We 'remark that the major difference X  k^j  i=l  between  and  is that only the j t h data point of the first sample is left out  for the j t h term in Du^• Under the condition that A + A = 1, we can rewrite x  «i  as a function of Ai:  2  .  .  ..  .  2  3=1 ni  =  2  J^((x  l j  -x ) 2  + \ (x .-x - j) <  1  •  i )  2  3=1 (2)  We differentiate D\ with respect to Ai- It then follows that 3=1  +  -  i~ ))  x  j)  - i~ ) • xi  j)  5.3. L I N E A R W L E F O R U N E Q U A L S A M P L E SIZES  98  We then have  E(^2.-X| V  ^2  J  Consider ni /"V  "v ( J')'\ r  -  .7=1 m  E (*- -  - ir^rr**- -  2  (*- - ^  1  3=1  V  E  (A . - A \ ) ( A \ _ x )  2  1  7  2  - —  y  3=1  -  2  1 + — -  - X )(X . y  X)  2  l3  "l  n  _  77-1 —  1  x  E(Ai. J'=l  1  nifXx. - A \ )  x  E(Ax.  -  X )X X]  l3  and m  s = E^ --^) u  2  2  2  i=i  =  E(^i-  1  - * -) 2  + ^-rr(^- *-) —' E^1 2  2  2  ni —  A  3=1  -  +  (n^i - Z T )  3=1  2  E^- - ^ x  3=1  We then have Af  nAX  x  =  ,  —  -X f 2  —  -^T<J ^AT; 2  A  2  =l - A f .  Proposition 5.4 7/0? ^ 0°, £/>en Af* % 1 and A f ^ Proof: From equation (5.5), it follows that  0.  (5.5)  5.3. L I N E A R W L E FOR U N E Q U A L S A M P L E SIZES  99  By the Weak Law of Large Numbers, we have e°. 2 —> a  p  2  o  l  (X -x .) L  % (0?-0 °)Vo.  2  2  2  It then follows that  2  ^2  0.  n (X .-X .)2 + _L_ 2 1  1  2  a  We then have Af'^1. The last assertion of the theorem follows by the fact that Ai + A = 1. o 2  5.3.2  O p t i m u m Weights B y Cross-Validation  We derive the general formula for the optimum weights by cross-validation when the sample sizes are not all equal. The objective function is defined as follows:  =  E U ^ - A ^ - E ^  3=1  \  n  i  i=2  .  i  X\ - 2 £  £  3  3=1  j=l  /  X  -.  Ai  \  m  A, ( X + — j - C * ! . - *y)) + £  y  3=1  +E ( =  n  L  \  \  XpCi.  t=2  1  /  ( ^ i - + ^ - T i ^ - - *y) j + E ^ A  ^  '  c{X)-2b{X)\  +  u  i=2  \ A{X)\ t  u  u  where 6i = V X y (x -f-f V  L  3=1  bi =  niXiXi,,  +  (Xx. - A y ) ) = n X ni - 1 /  i = 2,...,m;  x  2  - -^-a ni - 1  2  5.3. L I N E A R W L E FOR U N E Q U A L S A M P L E SIZES  100  and =  E ( * . . + ^ ( * . - - - M )  «ij = fi\XiXj,,  >  = '0?.  +  (^*?  j 7^ 1 or j ^ 1.  It then follows that A  =  n  x  (<?!,  0  2)  ...,0  M  )'  (^x,  0 , 2  ...,0  M  )  +  L>  where  dij =  i 7^  0,  1  or j /  1.  By the elementary rank inequality, it follows rank  (A) < rank(9 §)  + rank(D)  t  = 2.  Therefore, we have rank(A)  < m  if  m > 2.  It then follows that A is not invertible for m > 2. Thus the Lagrange method will not work in this case since it involves the inversion of the matrix A . To solve this problem, we can rewrite the objective function  in terms of  m  A , A3, ...,A only, that is, we replace Ai by 1 — E - V The original minimization 2  m  i=2  problem is then transformed into a minimization problem without any constraint. As we will see in the following derivation, the new objective function is a quadratic function of  A2, A 3 , A  and is unique.  m  .  Thus, the minimum of the new objective function exists  5.3. L I N E A R W L E F O R U N E Q U A L S A M P L E SIZES  101  By replacing Ai by 1 - J2 A*, we then have i=2 m  b(Xy\  6iAi +  =  u  (  ^Mi i=2  i —  m  \  A  ro  ») i +  i=2  J2  b  biXi  /  t=2  m  i=2 m  2  bi+n E(X .X .-^i. + ^- ya?)A  =  1  1  i  r  i=2  =  i  ^  fei+nx^^^-^  + ^ - ^ - ^ A i .  i=2  Since ^4 is symmetric, we also have A*J4A  u  =  A a 2  u  E AjOii +  + 2Ax  Aja^Afc  i=2 / .  m  \  i-E 0 A  =  i=2  i=2  2  a u+ 2  /  k=2  /  m  1 _  A <  ( E \  \  i=2  /  A  i=2  i=2  ( E ** - E E m  A ai  i=2  m i=2 =  an  ~  2 E(  an  /  \  A a  i=2 Jfc=2  ) m  m  + E E AiOyAfc  Ai°iiAfc)  i=2 A:=2  aii  i=2  anAfc  k=2  mm  mm  E * + E E *^  an - 2an E + E E * ^  m  A  /  i=2  k=2  mm i=2 k=2 ~~ ° i * ) A i + E  E  A  ^ ^' a  +  °  u  ~~  2 a  ii)Afc-  A  5.4. A S Y M P T O T I C PROPERTIES OF T H E WEIGHTS  102  We then have =  c(X)-2b -2n f2(e (9 -§ ) l  1  l  +  1  1=1 \  m  - 2  l  ^-af)x rn-1  J  mm  E( ~ n)^i + E £ Qu  i  Ai(flij + an - 2a )X  a  i=2  u  k  i=2 k=2  (  m  = a - 2b, + c{X) - 2n V u  1 0 ^ - 6) +  x  a  n  X  + E X] Aj( u + n - 2a )A  1 \ -a\ + —(a - a ) A, u  a  il  fc  i=2 k=2 m  —  - 26i + c(X) - 2n J2  O 11 n —  ff^  n i  x  1  m m /  <M; + 0i - 20*01 +  +ni i=2  fc=2  ^  \  _  a I A*. 2  I 1  J  /  If C is invertible, it then follows that Ar<- ) 1  = (A A ... A )' = ^ 2>  3>  J  2  C7- l 1  m  i  ?  where C is a m - 1 by m - 1 matrix, and for i = 1, 2, ...m — 1, j = 1 , 2 , m — 1, Cij = 0j+i 0j+i + 9\ — 29 i§i  +  i+  ^ 1) ^ ' 2<  2  We then have Af* = ( A ( ^ - ) ) ) . 1  t  t  1 >  where Af* = 1 — 1*A° *^ ^. We remark that C is indeed invertible for m = 2 and P  -1  m = 3. It is not clear whether C is invertible for m > 3. Therefore, the g-inverse of the matrix A should be considered in order find the optimum weight vector.  5.4  A s y m p t o t i c Properties of the Weights  In this section, we derive the asymptotic properties ofthe cross-validated weights. Let 0 ^ be the MLE based on the first sample of size n\. Let 9\~^ and  respectively  5.4. A S Y M P T O T I C PROPERTIES OF T H E WEIGHTS  103  be the MLE and WLE based on m samples without the jth data point from the first sample. This generalizes the two cases where either only the jth data point is deleted from the first sample or jth data point from each sample is deleted. Note that 9[~^ is a function of the weight function A. Let ^D  be the average discrepancy in the  ni  cross-validation which is defined as 1  1  2  7 1 1  ni .  ni  3=1  Let A ^ be the optimum weights chosen by using the cross-validation. Let 9° = (0°, 0 ,  9 )) where 0? is the true values for 9\. We then have the following theorem.  2  m  Theorem 5.1 (1) ^ D  Assume  that  has a unique  minimum  for any  £ E U{e\r ) - mi) n  i  fixed  n\;  3)  (2)  oo;  ^ O a s n , ^  3=1  (3) Pgo ^  E  (4) Pgo ( 0(0^)  (Xij  ~ <t>(0 )) {rJ)  - <f>(9^)  2  <  > Af)  f  1  = o(^)  or  s  o  m  e  constant  for some constant  0 < K < oo; 0 < M < oo;  then  A ^ ) ^ ^  = (l,0,0,...,0) .  (5.6)  t  0  Proof: Consider  3=1  1  E  1  n  J )  i  i 2 (*, 2  +•  - tt&r*)))  - ^(^")') +  =i  n i  ni  2  1  7 1 1  \ 2  >) + £ E (*(*•') - *$"">) j=i  5.4.  ASYMPTOTIC PROPERTIES OF T H E WEIGHTS  104  Note that - Wi"'"'))  ^EN  - <t>{e[- )) ])  3=1  1 .  n i  .  3=1  +-Ej=i. O?) - <^i~)) 0(*i~) - <^T )) )J  J)  J)  n i1  = Si + S  2  where  * = ^E(^-^?))^ ni 1  We first show that S i  3=1  0.  Consider P o(|5i|>e) fl  =  (b{6[- ) - (f>(6{~ < ) M  P o (e < | S i | and  j)  e  +P o (e < | S i | and e  j)  U  1  -  7 1 1  1  n i  l  7 1 1  all  \<f>(§[~ ) - <i>(6~[~ ) \ > M for l)  l)  < Me<|Si|<-El*U-^)| V  for  - m))  3=1  1  >M)  1=1  \  J  some f)  + ^ J  3=1  j)  + miV»  ( l ^ ) - <K6>i-)l > ) 1}  M  i  \  The first term goes to zero by the Weak Law of Large Numbers. The second term also goes to zero by assumption (4). We then have P0o(|Si| > e) —> 0 as n  x  —)• oo.  (5.7)  5.4.  ASYMPTOTIC PROPERTIES OF T H E WEIGHTS  105  We next show that S —^ 0 as n —> oo. 2  x  Consider P o(\S \>e) e  2  Pgo f^e < j ^ l  X-Jh  and  < M /or a// j)  +Pgo (e < | 5 | and L $ ~ ) <  Pgo\e<  < Poo  \S \ <  711  M  2  ni  | -U < 1  =  > M /or some l^j  0  2  3=1  /  Peo  The first term goes to zero by assumption (2). The second term also goes to zero by assumption (4). We then have Pe°(\S \  > e) —> 0  2  as  %  (5.8)  —>• oo.  It then follows that 1  E  1 _ _ / ni  —D (X)-= ni  1  1  0(^ )j + - E 2  (*y -  3=1  1  n  2  i  )  1  ~ H0[~ )) J)  + Rn  (5.9)  3=1  where Rn —> 0. Observe that thefirstterm is independent of A. Therefore the second term must be minimized with respect to A to obtain the minimum of ^D (\). ni  We  see that the second term is always non-negative. It then follows that, with probability tending to 1, 1  1  n  i  —An(A) > - E  1  2  (*y - Wi~ )) 3)  =  -D (w), ni  3=1  since (f>(9[- ) j)  = (f>(9 - ) {  j)  for A  ( c u )  =w = 0  (1,0,0,0)* forfixedm.  5.4. A S Y M P T O T I C PROPERTIES OF T H E WEIGHTS  106  Finally, we will show that ,(cv)  ^  ;  P  0 °  —> w ,  as rii —> oo.  0  Suppose that A ^ —^ w + d where ol is a non-zero vector. Then there exist n such 0  0  that for n i > no, -D  (\ )  >-D (w).  {cv)  n i  ni  V  Ul  /  Til  This is a contradiction because A ^ is the vector which minimizes ^D  ni  for any  fixed rii and the minimum of ^ D ( A ) is unique by assumption, o m  To check the assumptions of the above theorem, let us consider the linear WLE for two samples with equal sample sizes. Assumption (1) is satisfied since  ^D (\) ni  is a quadratic form in A and its minimum is indeed unique for each fixed n i . To check Assumption (2), consider  ]=1  J=l  Til  1  i:f- -rE^-^ J  n-i ^f—' ±  \ ni —  1 ^—'  E ( ^ T ^ - ^ T ^ ) - " f  nx ^  \ ni  —  1  ni —1  '  Next we consider  ni 1  14  —' \  i=i  /  ni 1  i=i ni  — E i~[\ n  (^y  n  (  -J  \  2  -Xij ni - 1 1 p — ^E {Xij — A i . ) — ^ var(Xn) < oo _  (— ~^ \. \ni-l l  x  Ul  2  as ni —> oo.  5.5. SIMULATION STUDIES  107  For the last assumption of the previous theorem, consider ni)i  X  (CTJ)  - (A^Xj. + \ r x )  L  2  (cu)  XL  - X,  It then follows from Lemma 5.1 that (  '•o(|^i  B l )  )-^°)|>e)  ='P o e  (al-covnX^-X.y n-2  >  [(Xi.-x^ + ^ E t X y - x ^ p i=l  < (n - 2) e Ego 2  1  < since X . - X . x  2  (al-covfiX.-X^  2  (x  Ego a\ — cov X\. — X  ( n - 2) e ' 2  2  i :  -x y 2  = o{n)  2  0° - 0° ^ 0 and d - cow  a\ - cov(X ,X ).  2  u  2l  Thus the  assumptions of the theorem are all satisfied. We then have (cu)  i;  {cv)  0.  This is consistent with the result of Proposition 5.2. Since the cross-validated weight function converges in probability to w . There0  fore the asymptotic normality of 0 of using cross-validated weights follows by TheoX  rem 4.10.  5.5  Simulation Studies  To demonstrate and verify the benefits of using cross-validation procedures described in previous sections, we perform simulations according to the following algorithm which deletes j t h point from each sample, i.e. delete-one-column approach. Step 1: Draw random  samples  of size n from  fi(x; 0j) and f (x; i 2  '2l>  5.5.  SIMULATION STUDIES  108  n  MSE(MLE)  SD of (MLE-0?)  5  0.203  10  MSE(WLE) MSE(MLE)  MSE(WLE)  SD of (WLE-00)  0.451  0.130  0.360  0.638  0.100  0.317  0.075  0.274  0.751  15  0.069  0.262  0.057  0.238  0.826  20  0.051  0.227  0.042  0.204  0.809  25  0.041  0.203  0.035  0.187  0.843  30  0.035  0.187  0.031  0.177  0.895  35  0.030.  0.173  0.028  0.166  0.931  40  0.025  0.159  0.023  0.153  0.932  45  0.023  0.151  0.023  0.151  0.997  50  0.020  0.141  0.020  0.141  1.007  55  0.018  0.135  0.019  0.139  1.066  60  0.017  0.129  0.018  0.133  1.057  2  2  Table 5.1: M S E of the M L E and the W L E and standard deviations of the squared errors for samples with equal sizes for N(0,1)  and Af(0.3,1).  Step 2: Calculate  the cross-validated  weights  Step 3. Calculate  the (MLE-  Repeat  Step 1 - 3  of the squared respectively.  for  2  1000 times.  differences  Calculate  0°)  of MLE  optimum and (WLECalculate and  WLE  (5.2);  0°) ; 2  the averages to the value  the averages and standard  We generate random samples from N(0,af)  by using  deviations  and standard  deviations  of the true parameter of the optimum  6^  weights.  and N(c, of). For simplicity, we set  = <T = L For the purpose of the demonstration, we set c — 0.3 which is 30% of 2  the variance. Other values were tried. In general, the larger the value of c, the less improvement in the M S E . For example, if we set o\ — a = 1 and c = 1, the ratio of 2  the M S E for M L E and W L E will almost be 1 which means that the cross-validation procedure will not consider the second sample useful in that case. Some of the result  5.5. SIMULATION STUDIES  109  for c = 0.3 is shown Table 5.1. n  AVE. of Ai  AVE. of A  5  0.710  0.290  0.027  10  0.725  0.275  0.053  15  0.734  0.266  0.064  20  0.750  0.250  0.075  25  0.755  0.245  0.080  30  0.765  0.235  0.085  35  0.777  0.223  0.089  40  0.785  0.216  0.092  45  0.785  0.215  0.092  50  0.788  0.220  0.094  55  0.792  0.208  0.094  60  0.807  0.193  0.097  2  SD of Ai and A  2  Table 5.2: Optimum weights and their standard deviations for samples with equal sizes from JV(0,1) and AT(0.3,1). It is obvious from the table that the MSE of WLE is much smaller than that of the MLE for small and moderate sample sizes. The standard deviations of the squared differences for the WLE is less or equal to that of the MLE. This suggests that not only does WLE achieve smaller MSE but also its MSE has less variation than that of MLE. Intuitively, as the sample size increases, the importance of the second sample diminishes. As indicated by Table 5.2, the cross-validation procedure begins to realize this by assigning a larger value to Ai as the sample size increases. Although quite slowly, the optimum weights do increase towards the asymptotic weights (1,0) for the normal case.  5.5.  SIMULATION STUDIES  110  We repeat the algorithm for Poisson distribution with V(3) and 7->(3.6). Some of the results are shown in Table 5.3 and Table 5.4. The result for Poisson distributions is somewhat different than that from the normal. The striking difference can be seen from the ratio of the average M S E of W L E and average of M S E of W L E . The W L E achieves smaller M S E on average when the sample sizes are less than 30. Once it is over 30, it seems that we should not combine the two samples. This is not the case for the normal until the sample size reaches 45. We remark that the reduction in M S E will disappear if we set c = 1.5 in the above case.. Thus the cross-validation procedure will not combine two samples if the second sample does not help to predict the behavior of the first. We also emphasize that the value c in both cases are not assumed to be known to the cross-validation procedure. MSE(wle) MSE{mle)  n  MSE(mle)  SD  MSE(wle)  SD  5  0.312  0.558  0.235  0.484  0.753  15  0.142  0.377  0.127  0.357  0.896  25  0.120  0.347  0.114  0.338  0.950  30  0.104  0.323  0.104  0.323  1.000  35  0.077  0.277  0.081  0.284  1.054  40  0.074  0.272  0.076  0.275  1.025  45  0.072  0.268  0.075  0.274  1.045  50  0.057  0.238  0.065  0.255  1.141  ' 55  0.054  0.233  0.060  0.245  1.098  60  0.046  0.215  0.052  0.229  1.132  Table 5.3: M S E of the M L E and the W L E and their Standard deviations for samples with equal sizes from V(3) and V(3.6).  5.5. SIMULATION STUDIES n AVE. of Ai  111  AVE. of A  2  SD of Ai and A  5  0.710  0.289  0.027  10  0.729  0.270  0.057  15  0.738  0.261  0.065  0.754  0.245  0.077  25  0.754  0.245  0.078  30  0.768  0.231  0.086  35  0.777  0.222  0.091  40  0.789  0.210  0.093  45  0.797  0.202  0.097  50  0.799  0.200  0.095  55  0.812  0.187  0.097  60  0.820  0.179  0.096  20  ;  2  Table 5.4: Optimum weights and their standard deviations for samples with equal sizes from V(3) and "P(3.6) We remark that simulations of using the done. They give quite similar results.  delete-one-point  approach have also been  Chapter 6 D e r i v a t i o n s o f the W e i g h t e d Likelihood Functions In this chapter, we shall develop the theoretical foundation for using weighted likelihood. Hu and Zidek (1997) discuss the connection between relevance weighted likelihood and maximum entropy principle for the discrete case. We also show that the weighted likelihood function can be derived from maximum entropy principle advocated by Akaike for the continuous case.  6.1  Introduction  Akaike (1985) reviewed the historical development of entropy and discussed the importance of the maximum entropy principle. Hu and Zidek (1997) discovered the connection between relevance weighted likelihood and maximum entropy principle for the discrete case. We offer a proof for the continuous case. We first state the maximum entropy principle: to maximize plication.  the expected entropy  of the predictive  all statistical distribution  activities  are  directed  in each particular  ap-  When a parametric model {p(x;6);6 e 6} of the distribution of a future 112  6.1.  INTRODUCTION  observation  x  113  is given, the goodness of a particular model  {p(x;9);9  € 0} as the  predictive distribution of x is evaluated by the relative entropy p(x;9)  where  f(x)  is the true distribution. In this expression,  dx.  logf(x)/p(x;9)  is defined as  -f-oo if f(x) > 0 and p(x; 9), so the expectation could be +oo. Although logf(x)/p(x; is defined as — oo when f(x)  — 0 and p(x; 9) > 0, the  integrand, f(x)logf(x)/p(x;  9)  9)  is  defined as zero in this case. The relative entropy is a natural measure of discrepancy between two probability functions. Hence, maximizing B(f;p(.;9)) is equivalent to minimizing I(f,p(.;9)) with respect to 9. Without any restrictions, the desired density function is f(x) itself. However, the true density function f(x) is indeed unknown. We therefore use a set of density functions,  fi(x), f (x),f (x), 2  m  say, to represent the true density function.  The density function fi (x} represents the density function which is thought to be the "closest" to the true density function f(x). This resembles the idea of compromised MLE proposed by Easton (1991). To be consistent with our use of relative entropy, we use it in interpreting "resemblance" of any density function g(x) to the fi(x) and define that term to mean = 1, 2, 3..., ra. The fa reflects the maximum allowable deviation from the density function ai  is set to be zero, then  g(x)  takes exact the same functional form of  fi(x).  If  fi(x).  For a given set of density functions, we seek a probability density function which minimizes I(fi,g)  = J f\(x)log ^rdx  over all probability densities  g  satisfying (6.1)  6.2. E X I S T E N C E OF T H E O P T I M A L DENSITY  114  where a, are finite non-negative constants. This idea of minimizing the relative entropy under certain constraint is also similar to the approach outlined in Kullback(1959, Chapter 3) for the hypothesis testing.  6.2  Existence ofthe Optimal Density  Let V be a reflexive Banach space. Let £ be a non-empty closed convex subset of V. Define a function 1(g) on £ into 7Z where g is a continuous function. We are concerned with minimization problem: mi  1(g).  gee  To avoid trivial cases, we assume that the function 1(g) is proper, i.e. it does not take the value —oo and is not identically equal to -t-oo. We then have the following theorem. Theorem 6.1  Assume  respect to g. In addition, M,  say, such  that 1(g)  is convex,  lower  semi-continuous  assume that the set V is bounded, i.e.  and proper  there exist a  with  constant  that sup 1(g) < ev  M.  g  Then the minimization the function  1(g)  problem  is strictly  has at least one solution.  The solution  is unique  if  convex on T>.  Proof: (See, for example, Ekeland 1976, p 35. ) o Consider IJ spaces (1 < p < oo). It is known that IP (1 < p < oo) is reflexive (Royden 1988, 227). Define 1(g)  = / f (x)log^dx x  for some given density  h(x).  It  can be seen that 1(g) is a proper convex function and also continuous in g on LP. We also assume that 1(g) < oo.  6.2. E X I S T E N C E OF T H E O P T I M A L DENSITY  115  Define = £i =  {9 • J fi{x)log-^-^dx  < cii, J g(x)dx  = 1, g(x) >0,ge  L }, p  i =2 , m ,  and £=  Lemma 6.1  The following  inequality  holds:  V (f J )/2<I(f ,f ) 2  1  where V(h,f )  = J \f,(x) -  2  2  1  2  f (x)\dx. 2  Proof: (See Kullback, 1954) o Lemma 6.2 i=2,...,m,  If \g(x)  —  fi{x)\ ~ p  l  <  then the set £ is bounded  Bi,  for all x,  1 <p  < 00 and  Bi  is a  constant,  subset of LP.  Proof: It suffices to show that each £, is bounded in LP. For any density in £i, i = 2,.., m, we have  j\g(x)-fi(x)\ dx p  = < <  f\g{x)-fi{x)\\g{x)-fi{x)\ - dx p l  BiJ  \g(x) - fi(x)\dx  B /2I(fi,g) iy  = BiV2a~. This implies that \\g — fi\\ < Ci. p  By the Mankowski Inequality for LP spaces with 1 < p < 00, we have \\g\\ <\\g- fi\\ + ll/tllp <Q + \\fi\\ . P  P  p  It then follows that £i is bounded in LP. o We are now in a position to establish the existence of the optimal solution.  6.3.  SOLUTION TO T H E ISOPERIMETRIC P R O B L E M  T h e o r e m 6.2 For a given set of density functions problem  (6.1)  has  a unique  solution  if the  fx, f  admissible  , f  2  m  density  ,  116  the  minimization  functions  satisfy  the  conditions  \g(x)  Proof:  — fi(x)\ ~ p  l  < Bi,  for  all  x  and  g e L, p  1 < p <  oo.  Note that log function is strictly convex. The assertion of this theorem follows  from Theorem 6.1 and Lemma 6.2. o  6.3  Solution t o the Isoperimetric P r o b l e m  In order to find the solution of the problem, it is useful to state certain results in the calculus of variations. In 1744, Euler formulated the general  Isoperimetric  problem  as the following: Find a vector function  (yi, y  2  , y  )  m  which minimizes the functional  b  = J  Z[yi,y2,--,y ] m  f{x,y (x),...,y (x),y[(x),...,y (x))dx, 1  m  (6.2)  m  a  and satisfies the initial conditions Vi( ) = Vi, a  * = 1,2,  (6.3)  ..,m,  as well as the terminal conditions {b) = y i, i = l,2,.,m,  (6.4)  b  yi  while b  fi(x,yi(x),...,y (x),y[(x),...,y {x))dx m  where  h,l , 2  ...,i  m  m  =  z = l,2, ...,m,  are given numbers.  We now state a fundamental theorem in the calculus of variations.  (6.5)  6.4. DERIVATION OF T H E W L FUNCTIONS Theorem 6.3  For (y\, y , •••y ) € C [a,  such that, for k =  h (x,y ,y , yk  1  2  =  r  prob-  t = (to, t\,t )  ^  m  m  T^h ,\x,y ,y , y  1  ...,y' ,t) =  ...,y ,y\,y ,  I  2  m  2  m  0 (6.6)  and  Vk  h (x,  ofthe Isoperimetric  that there exist m + 1 constants  -••,y ,*) -  2  where hi,  to be a solution  1 , 2 , m ,  .:,ym,y'i,y' ,  I  m  m  lem [(6.2)-(6.5)], it is necessary (0,0)  b]  l  2  117  dy  k  y , y , ...,y , y[, y' , x  2  m  y' , t) = t f(x,  2  m  0  y,  3/1,3/2,  m  Vi, y' , 2  y' ) m  =i  P  Proof: See, for example, Theorem 2 on p.90 in Giaquinta and Hildebrant (1996). o Note that the values of a and b are arbitrary. They can take the value of +00 and —00. The original proof is given in Bliss (1930). Detailed discussions can also be found in that paper.  6.4  Derivation of the W L Functions  We now establish the necessary condition for the optimal solution to the minimization problem. Assume that the density functions /1,  f  2  , f  m  are continuous and twice  differentiable. Theorem 6.4  (Necessary  that it is a mxiture  Condition)  distribution,  i.e.,  For g* to be the optimal there exist constants  solution,  t\,t\, ...,t*  it is m  such  necessary that  m  t* = 1,  and  i=i  m  fc=i  Proof: There exist constants b , 2  6 3 , b  m  such that  I(fi,  g) = h < ai, % = 2,  3 , m  for any particular choice of g satisfying the constraints (6.1). We seek the optimal  6.4. DERIVATION OF T H E W L FUNCTIONS  118  solution which is a point in a certain manifold in the function spaces. Thus the optimization problem can be re-formulated in the context of calculus of variations as follows min7(g) = min geE  g  [ fi(x) J  l o g d x g(x)  such that g satisfies the following constraints: fi(x)  l o g ^ ~ dx = bi, i = 9{x)  I Define ^(x,g) = fi(x)log^  g(x)dx  2,...,m;  = 1 and g(x) > 0.  By Theorem 6.3, it follows  + l g(x) + f: l f (x)log^. fc=i 0  k k  that the necessary condition for g* to be the optimal solution is that it has to satisfy the  equation, i.e.  Euler-Lagrange  dip  d  dip  .  ^-^W '  -  )=0  (6  .  9)  where g = ||. Notice that  ip{x,g)  is not a function of g  It implies that | 4 = 0. The Euler  .  equation then becomes dip  It follows that  .AM fa)  +  f JtW o  i  =  Q  x=\  0  k  g(x)  We then have g*(x)=J2f f (x), k  k  k=l  where  t{ = l/l , 0  t* = l /lo, k = k  2,m.  Since we seek a density function, it then follows that the sum of the ti must be 1 since  m  Jg*(x)dx  = YJ %  =  fe=i fc=i  ^ also follows that  m  g*(x)  = YJ  Kfk( ) x  > 0.  6.4. DERIVATION OF T H E W L FUNCTIONS Since if g*(x) = YJ t* f {x) fc=i k  f fi(x)log(fi(x)/g*(x))dx  0 for all x e K with Pi(K) > 0, the constraints  <  k  119  = h > 0  then will not be satisfied. This completes the  proof, o Consider the minimization problem without any constraint. We then seek the optimal density function g* such that it minimizes I(f\,g) for any given  f\(x).  Ac-  cording to Theorem 6.4, the necessary condition of the optimal function g* is that g*{x) =  Since  t* = 0,i =  2,3, ..,m, then  Furthermore, we have I(f\,g)  t{  1  = 1. It then follows that  > I(fi,g*)  This result is also known as the  t\f (x).  — I{fi,  g*(x)  =  f\(x),  a.e..  /i) = 0 for any density function  Shannon-Kolmogorov  Information  g.  Inequality.  We establish the uniqueness of the optimal solution in the next theorem. Theorem 6.5  (Uniqueness)  m g* = Y2tf.fi( )  Suppose  x  ^  w  the t* chosen  so  that'g*  i=i m  satisfies  the constraints  g* uniquely  minimizes  (6.1) and J^t* i=l I(fi,g)  =  1,0 <  over all probability  t* < densities  1,  i  = l,2,...,m.  g satisfying  constraints  (6.1).  . Proof: Suppose that there exist a probability density function g such that 0  f fi(x)  J  while  It follows that  while  J  / J  l o g ^ ~ dx < go{x)  fi(x)log  ^  \  dx < a  J  i =  i:  fi{x) log g*{x) dx <  fi(x) log g*(x) dx<  [ fi{x) log dx, J g*(x)  j  f (x) x  2,...,m.  log g (x) dx 0  fi(x) log g (x) dx, 0  i =  2,  . m.  Then  6.4. DERIVATION OF T H E W L FUNCTIONS  120  We then have 771  771  „  E*i i=l  / fii ) J  9 9*( ) dx  <  / h( ) i=l J p. m  E*.*/i(a;) °9 9*( ) dx  <  Y^ifi^)  x  l o  x  x  m  /  l  x  i=l  l o  9 9o( ) dx x  9 9o( ) dx  lo  x  J i=l  J It follows that  „  g*{x) log g*(x) dx  I(g*,g )  <  0  j  <  g*(x) log g (x) dx 0  0. However, we know that  I(g*,ga)  >  0.  Therefore, g*{x) = go{x) for all x. This completes the proof, o The weights t* are functions of ai, a  2  ,a  m  To describe the relationships between  .  t* and a,, we have the next theorem. Theorem 6.6  Suppose  a = (ai, 0  there exists  a , a ) * and5° 2  m  = (Si, S ,<5 )' 2  m  such  771  that there  exists go = ^2Ufi(x)  with ti chosen  so that go achieves  the equalities  in  7=1 771  the constraints  (6.1) and  U  = 1, 0 < U < 1,  i=i \a,i — a°\ < Sf. Then U are monotone  functions  <  0,  i  = 1,2, ...,m  of ai.  i =  Moreover,  2,...,m,  da  {  0 , 1  Proof: Let <f>i(x)  k^i  = fi(x) — fi(x).  Therefore, 771  g*(x) = f (x) + x  Y^^kix) k=2 1  and j  <j>i(x)dx = 0,  i =  2,...,m.  It also follows that m fi(x)  = g*(x) + <t>i(x) - ^k4>k{x) fc=2  > 0.  for any  a suc/i  that  6.4. DERIVATION OF T H E W L FUNCTIONS  121  This implies that m  (6-10)  -[<f>i(x)-Y,tk<i>k{x)]<9*(x)fc=2  Since g* satisfies the constraints (6.1), it follows that, for 2 < i < m  In = dti J [M*) logg*(x) dx] dti l  J i y  J  y  J  J-[/<(*) ^g J  l{x)  dx]  jfc=i  f [g*(x) + Hx)  =  -  - f ] W * ( i ) ] ^ dx ^ 9 [X)  =  ~ f 9*(x)  <  - J </>i(x)dx + Jg*(x)^j^dx  V J  9  ^ f \ d x -  (x)  J  [ fciz)  -f^tkMx)] by  d  x  9*{x)  (6.10)  = 0. Therefore, it follows that, for i = 2,  ...,m,  1 o > »— < 0. r9o dti  °  du  a%  It also follows that d da;  since rj + rj + ... + t = 1. This completes the proof, o x  2  m  Theorem 6.7 Proof: Note  The weights ti are all between 0 and 1.  that if we set a; = 0, then  ti —  1; if a, = oo, then rjj = 0. Since rjj is a  monotone function of at for any fixed a ,i ^ j , it follows that 0 < ti < 1. o 3  The distributions functions fi, / ,f 2  m  are, in fact, unknown. We have to derive  the optimum function by using samples from different populations. The derivation of  6.4. DERIVATION OF T H E W L FUNCTIONS  122  the weighted likelihood function for the discrete case is given in Hu and Zidek (1994). We now generalize their derivation of the weighted likelihood function. Theorem 6.8 Given  X ,i i3  this section Proof:  Assume  that the optimal  = 1,2, ...,m,j is equivalent  distribution  = l,2,...,n;, to optimizing  takes the functional  the optimization  the weighted  problem  likelihood  form  considered  f(x). in  the  function.  By the proof of Theorem 6.4 and the Lagrange theorem, we need to choose  the optimal density function g* which minimizes  The minimization problem considered is then equivalent to maximizing first term in the above equation, i.e m  maxY^rJi /  fi(x)logf(x)dx  m max  However distributions fi(x),  E k ! log f(x;  f {x),f (x) 2  m  non-parametric context would replace  6)dFi(x).  are unknown. Their natural estimate in by its empirical counterpart. Assume that  the the optimal density function takes the same functional form of f\. The estimate of the parameter of the optimal distribution would be found as  6.4. DERIVATION OF T H E W L FUNCTIONS  123  This implies that the estimate of parameter of the optimal density is equivalent to finding the WLE derived from the weighted likelihood function if the functional form of the optimal density function is known. This completes the proof, o We have shown that the optimal function takes the form m  9*(x) = ^2t fk(x). k  k=l  However the density function g* does not exist if the constraints define an empty set. Let us consider a relative simple situation where three populations are involved. Recall that the ti need to satisfy the following condition: h + t + h = i, 2  0 < U < 1, i = 1,2,3. The above condition is equivalent to the following: 0<* + * 3 < l ; 2  0<t  2  <  1;0 < t <1. 3  If we set a = a = 0, then there is no probability distribution satisfying the con2  3  straints since a probability density function can not take the functional form of f and 2  / at the same time if f i fz-  The reason is as follows. In order to satisfy the con-  z  3  2  dition a = 0, the weight t must be set to 1. We must also have t = 1 for the same 2  2  3  reason. Clearly, this set of weights no longer satisfies the condition t\ + t + t = 1. 2  Lemma 6.3  The following  inequality  m  3  holds: m  I{fi,J2t fk)<Y2t I{fi,f ), k  k  i = 2,3,...,m.  k  fc=i fc=i  Proof: The function — log g is a convex function of g. It then follows that m  m  -log(^2t f (x)) k  k=l  k  <-^2t logf (x). k  fc=l  k  (6.11)  6.4. DERIVATION OF T H E W L FUNCTIONS  124  We then have /(/*,$>/*)  =  [ l o g  h  k=l <  {  x  Mx)dx  )  m  m  / [logfi(x) Yjklog'fk(x)]fj(x)dx "* k=l m  /  m  [Y,tklogfi(x) - y^Jklogfk(x)]fi{x)dx k=l k=l m  k=l  This completes the proof, o Let D = (d^) where (  dij —  1  J(fiJj)  and o  if i = l if i = 2,3, ...m.  i = (l,a ,....,a ) . t  m x  2  m  Theorem 6.9 (Existence) The optimal solution does not exist if rank(D) < rank(B) where £  m x ( m + 1 )  = [D  (6.12)  : a].  mxm  m  m  Proof: Note that /(/i, YJ hfk) is bounded by YJ £<./(/;,//;) by Lemma 6.3. Set k=l  k=l  m  m  '  ai = YJ t I(fi, fk), i = 2, 3 , m . Note that Y_) t = 1. We then have the following k=l  k  k=l  k  simultaneous linear equations in tf. Dt = a.  By a result from elementary linear algebra, the assertion of the Theorem follows, o  Chapter 7 Saddlepoint A p p r o x i m a t i o n of the WLE 7.1  Introduction  In the context of weighted likelihood estimation, the  i.i.d.  assumption is no longer  valid. Furthermore, the sample sizes are usually moderate or even very small. The powerful saddlepoint technique is applied to derive the approximate distribution of WLE from exponential family. The saddlepoint approximation for estimating equations proposed in Daniels (1983) is further generalized to derive the approximate density of the WLE derived from an estimating equation.  7.2  Review of the Saddlepoint Approximation  In a pioneering paper, Daniels (1954) introduced a new idea into statistics by applying saddlepoint techniques to derive a very accurate approximation to the distribution of the sample mean for the  i.i.d.  case. It is a general technique which allows us to  125  7.2.  R E V I E W OF T H E SADDLEPOINT  APPROXIMATION  126  compute asymptotic expansions of integrals of the form / e <j>{z)dz Jv  (7.1)  vw{z)  when the real parameter v is large and positive. Here w and <j> are analytic functions of z in a domain of the complex plane which contains the path of integration V. This technique is called the method of steepest descent and is used to derive saddlepoint approximations to density function of a general statistic. Consider the integral (7.1). In order to find the approximation we deform arbitrarily the path of integration V provided we remain the domain where w and (j) are analytic. We deform the path V such that (i) the new path of integration passes through a zero of the derivative w (z); (ii) the imaginary part of w, lw(z)  is constant on the new path.  If we write z w(z)  =  x + iy,  =  u(x,y)  z  0  = x  0  + iv(x,y),  and denote by S the surface (x,y)  —> u(x,y),  +  iy , 0  w'(z ) 0  = 0,  then by Cauchy-Riemann differential  equations du  dv  du  dv  dx  dy'  dy  dx'  it then follows that the point (x ,y ) 0  0  can not be a maximum or minimum but a  saddlepoint on the surface S. Moreover, the orthogonal trajectories to the level curves u(x,y)  = constant  are given the the curves v(x,y)  — constant.  It follows that the  paths on S corresponding to the orthogonal trajectories of the level curves are paths of steepest descent. We will truncate the integration at certain point on the paths of steepest descent. The major part of the integration is then used to approximate  7.2.  R E V I E W OF T H E SADDLEPOINT  APPROXIMATION  127  the integration on the complex plane. Detailed discussions can be found in Daniels (1954). Suppose that sity / and  Xi, X  T (Xi, X n  2  , X  2  , X  n  )  are  n  real-valued random variables with den-  i.i.d.  is real-valued statistic with density /„. Let  be the moment-generating function, and  f e fn{t)dt at  K (a)  =  n  M (a)  logM (a) n  n  =  be the  cumulant-generating function. Further suppose that the moment generating function M (a) exists for real a in some non-vanishing interval that contains the origin. n  By Fourier inversion, fn(x)  =  - ! - I" 2TT 7_OO  =  2^ /  M  M (zr)e- dr irx  n  -( ) ~ n z  rr+ioo  n  e  n Z X d z  /  - / • 2iri Jr-ioo  \  exp( n(R (z)  — zx) J dz,  n  ^  1  '  where I is the imaginary axis in the complex plane, r is any real number in the interval where the moment generating exists, and R (z) n  =  K (nz)/n. n  Applying the saddlepoint approximation to the last integral gives the approximation of /„ with uniform error of order n~ : 1  9n{x)  = y  2 i r  g,^  exp(n[R {a )-a x}),  a  n  0  0  (7.2)  where «o is the saddlepoint determined by the equation K(«o)  = *,  where R! and R'^ denote the first two derivatives of R^. Detail discussions of the n  saddlepoint can be found in Daniels (1954) and Field and Ronchetti (1990).  7.3.  RESULTS F O R E X P O N E N T I A L F A M I L Y n!  n  128  Stirling Saddlepoint r.e. of Stirling r.e. of saddlepoint  1  1  . 0.92  0.99  0.07786  2  2  1.92  1.99  0.04050  0.00052  3  6  5.83  5.99  0.02730  0.00028  4  24  23.50  24.00  0.02058  0.00017  5  120  118.02  119.99  0.01651  0.00016  6  720  710.08  719.94  0.01378  0.00008  7  5040  4980.40  5039.69  0.01183  0.00006  8  40320  39902.87  40318.05  0.01036  0.00005  9 362880 359537.62  362865.91  0.00921  0.00004  '  0.00102  Table 7.1: Saddlepoint Approximations of r(?7, + 1) = n\ It is known that the Stirling.formula serves as a very good approximation to the Gamma function. The comparison of accuracies between the  Stirling  formula and  the saddlepoint approximation based on the the first two terms is given in the above table with the last two columns for the relative error for using Stirling formula and saddlepoint approximation respectively. The saddlepoint approximation is given by \/2~/Trj" / e~ (l + j ^ ) - Notice that the expression before (l + ^ ) is exactly the +1 2  Stirling  7.3  n  Formula.  Results for Exponential Family  The saddlepoint approximation stated in the last section is for a general statistic constructed by a series of  i.i.d.  random' variables. In this section, we derive the  saddlepoint approximation to the distribution of a sum of afinitenumber of random variables that are independent but not identically distributed.  7.3. RESULTS F O R E X P O N E N T I A L F A M I L Y  129  Suppose that we consider the distribution of the following statistic: m S(X)  =  Tj(Xji,  Xj2,  Xj ), ni  where X{ - are i.i.d for any given i. But (Xij) and (Xjj) with i ^ % do not follow the 3  same distribution in general. Theorem 7.1  The saddlepoint  variable  by the convolution  defined  approximation is given  /  \  to the density function  of the  random  by:  1 / 2  exp\^ii^^Ri(nial)ln -alx)j  fs(x)  (7.3)  x  , 27rgi^'(nia5)/ni \ i=l where OJQ satisfying  the following  equation m 1=1  Proof: The moment generating function of 5(X) is Mi x M ... x M , where M is 2  the moment generating function of Ti(Xn, /S(.i.- ..... ,(^) a  m  =  ^  =  ^ 2m  X, i2  ini  M (ir)...M (ir)e-^dr 2  [' M (n z) Jz x  zx  2  /  /  Wr-ioo  m  M {n z)...M {n z)e-^ dz  x  rT+ioo  2m  m  By Fourier inversion,  ...,X ).  J^M^ir)  m  .  l  m  1  .  m  exp I ni K(  V l=1  Ri[n-\_z) Ini — zx)  \  ) dz,  ')  where Ri is the cumulant generating funtion of Tj. Applying the saddlepoint technique we derive the approximate density for 5(X): / fs, J  (ni,H2  dx) "m)  v  '  \ n  = 2 7 r  exp (  E^i'( i o)/ i n  1 / 2  a  n  i=i  V  where a:^ i the root of the equation YJ -Rj(^i«o)/ i s  n  t=i  n  =  i(  Rd i o)l i n a  i=i  x  - °  n  —  OLQX^  7.3. RESULTS FOR E X P O N E N T I A L FAMILY Example 7.1  (The Sum of Sample Means) Let  130  us examine the distribution of  W  n  where W  n  = —(Xu ni  + X  + ... + X )  12  + -(X n  lni  + X  2 1  22  + ... +  X ). 2n2  2  The moment generating function of W is Mi(^-)  x  ni  n  where  M (~)  n2  2  M\  and  M  2  are the moment generating functions of X n and X i respectively. Let K\ and K be 2  2  the cumulant generating function of X n and X i respectively. It then follows 2  Ri(niz)/ni  f— z ) , \m J  = riiKi [ — ] /rii = —K \ rii J m  {  i =  l,2.  The saddlepoint in this case is then a root of the equation —K (z +—TT * \—z oz rii oz \ n K  x  2  )  )=x.  The saddlepoint approximate density of W is n  \ f  I  \  I  N  2TT (j^K (al) +  ^K (al))  x  exp (ni[Ki(ao) V  o o 32 , , \27r2-^K(a ) R  0  0  0  = ii'. We then have  1  =  2  2  2  x  2  + —K (—a* ) - a* t) ] . ni n Til Tlo / I  V  Assume that n\ = n = n and A^ ) = fw»( )  1/2  I  exp (n(2K(a* ) - a* x)) . 0  J  0  The sample average of the combined sample is W /2. It then follows that n  1/2  2n  1/2  where CHQ is the root of the equation 2—AT(z) = x = 2y.  7.3. RESULTS F O R E X P O N E N T I A L F A M I L Y  131  It then implies that aig indeed satisfies the following equation  This is exactly the saddlepoint  a  0  for an  i.i.d.  sample with size  2n.  Thus, the  saddlepoint approximation of the density of W /2 by Theorem 7.1 when the random n  variables from both samples are indeed i.i.d. is exactly the saddlepoint approximation of the sample mean of a Example 7.2  i.i.d.  sample with size  2n.  (Spread Density for the Exponential)  If Yi, Y  2  ,Y  m +  \  are independent,  exponentially distributed random variables with common mean 1, then the spread, V( +i)  the difference between maxima and minima of the sample, has the  —  m  density  This is also the density of the sum Xi + X + ... + X of independent, exponentially 2  m  distributed random variables Yj with respective means 1 , 1 / 2 , 1 / m . A proof of this claim is sketched in Problem 1.13.13 in Feller (1971). It follows that the cumulant generating function of the sum £>(X) = X\ + X ... + X 2  m  m  .  m  is YJ Ri(z) = — YJ ^ (l n  i=l  lJ)- The equation YJ R\{ ) —  — z  j=l  m  • z  x  m  Theorem 7.1 becomes  i=l  YJ 1/ (j< — z) = x which needs to be solved numerically. Due to the unequal variances of Yj, the normal approximation does not work well. Lange (1999) calculates the saddlepoint approximation for this particular example. Note that the following table is part of Table 17.2 in Lange (1999). The last column is the difference between the exact density and the saddlepoint approximation. It can be seen that saddlepoint approximation gives a very accurate approximation. Lange (1999) also shows that the saddlepoint approximation out-performs the Edgeworth expansion in this example.  7.3.  RESULTS FOR E X P O N E N T I A L F A M I L Y  132  X  Exact  Error  0.5  .00137  -.00001  1.0  .05928  -.00027  1.5  .22998  .00008  2.0  .36563  .00244  2.5  .37874  .00496  3.0  .31442  .00550  3.5  .22915  .00423  4.0  .15508  .00242  4.5  .10046  .00098  5.0  .06340  .00012  5.5  .03939  -.00026  6.0  .02424  -.00037  6.5  .01483  -.00035  7.0  .00904  -.00028  7.5  .00550  -.00021  8.0  .00334  -.00014  Table 7.2: Saddlepoint Approximation of the Spread Density for m = 10. We now consider the saddlepoint approximation to the distribution of the WLE in the general exponential family. It has been shown that the WLE takes the form V m  9  ( Zj  \ AJTJ(XJI,  X )  Theorem 7.2  Assume  ini  j  for some smooth function g under fairly general conditions.  that g is a smooth function.  The saddlepoint  approximation  7.3.  RESULTS F O R E X P O N E N T I A L F A M I L Y  to the density of g (j£  X ^ X ^ , X ^ ^ j  is 1/2  \  ( ni  SwLEiy)  27r£X(A mO/n i  i=i /  '.exp satisfies  1  J  m  I ni( J^i2i(Ajniao)M t=i  \  where  133  the following  -  1  o9~ (y))  a  l  equation  m i=l  Proof: We first derive the approximate density function for S = YJ AjIi(Yjx,  Y ). i n i  i=i  The cumulant generating function of AjTj is Ri(Xiz) where Ri(z) is the cumulant generating function of Tj. By Theorem 7.1, the approximate density function of S = E AjTj is given by t=i 1/2  nx 27rY:fl;'(A n <)/n i=i i  where  1  exp  [ nx (^2  R i i ^ i ^ o )/ni  -  a^x)  i=i  1  is the root of the equation m  ^i?-(AjnxQ;^)/nx = x. i=l  (  m  Y2, A j T j ( Y i x , X i=i  is given by  IWLE{V)  =  \  ni  1/2  . 27rE^'(A niO/m \ i=i l  xexp I ni(^Ri(\iniao)/ni i=i  - a* g 0  (y))  1  g\9- {y)) l  ini  7.4.  APPROXIMATION FOR GENERAL W L ESTIMATION  134  where OJQ satisfies the following equation m  Y ^iniz){a )/n  =  w  J  0  1  g-\y).  i=i  This completes the proof.  7.4  A p p r o x i m a t i o n for General W L E s t i m a t i o n  The saddlepoint approximation to estimating equations for the i.i.d. case is developed by Daniels (1983). We generalize the techniques to the W L E derived from the estimation equation constructed by the Weighted Likelihood Function. Recall that the W L E is the root of the following estimating equation: n;  m  E ^ E J - W ; « I ) = 0 i=l  For simplicity, let tp(Xij\6i)  j=l  (7.4)  1  The estimating equation for W L E  = •J^logf(X j\6i). i  can be written as rii  m  Y ^^(X t  t=l  i  j  ;9 )  = 0.  1  (7.5)  3=1  Assume that ib(Xij-,6) is a monotone decreasing function of 0 and Aj > 0, i = l , 2 , . . . , m . Write rii  m  S(a)  =  ^2\ ^(X i=l j=\ l  l J  ;a),  where a is a fixed number. Let aig be the root of the equation m  ^  a  y^njAj—Ki(ni\jZ,  a) = x.  2=1  We have the following theorem: T h e o r e m 7.3 Let 0\ be the WLE derived from the weighted likelihood equation. Then poo  P (0! 0l  > a) = P(S > 0) ~  / f (x, a)dx Jo s  (7.6)  7.4.  APPROXIMATION FOR G E N E R A L W L ESTIMATION  135  and  \  7li  fs(x) =  1/2  (7.7)  2?r iE = i nihiX^g-sK^nxXiZ, a)\ s z=Q  (  m i=i  and exp (K~i(t, a)) = Eg exp (tip(Xij, a)) ,i = 1, 2,m. {  Proof: The m o m e n t g e n e r a t n ig f u n c t i o n of S(a) is  M (t,a) s  =  exp(K (t,a)) s  (7 7 (  m  II /'•••/  =  =  i=l m  Voo  e  x  p  \ iY,^( i> } tX  Xi  V  -oo  \  ni a  J = l  )  Ylf( ij^i)dx ...dx  /  ^  x  n  ini  HexpikiiXAa)) ' 11  i=l  (  m  y^nji^^Ai^a) i=i  /  w h e r e exp (K~i(t, a)) = Eg exp(ttp(Xij,a)) ,i = 1,2, ...,m. The f u n c t i o n Ms(t,a) is i  a s s u m e d to c o n v e r g e for r e a l i in a n o n v a n s ih n ig i n t e r v a l c o n t a i n i n g the origin. It t h e n f o l o w st h a t i  fs( ) x  7 (  \  m  dr = ^— / exp 1 Y niKjjirXj, a) j exp (-irx) ni  -oo T+ioo  niKi(n\XiZ,a) — n\xz dz  2ni T —lOO T+ZOO  7  2TTZ  j  (  exp yni  m  n-  The s a d d e lp o n i t OJQ is the r o o t of the e q u a t o in d  E »Aj—ifi(n]AiZ,a) = x. n  i=l  \  — A"i(riiAj2;, a) — xz) J dz  7.4. A P P R O X I M A T I O N F O R G E N E R A L W L ESTIMATION  136  It then follows that the saddlepoint approximation to the density fs(x) is given by / fs(x) =  \  ( ^Wi*i-^Ki(ni\iZ,a)\ s) 111  1 / 2  \  z=a  ( n  m  l(^^ i^i n  K  i{ l^i i )\z=a n  z  a  s 0  ~  x  a  0 )  ) •  We can then deploy the device used in Field and Hampel (1982) and Daniels (1983) Pg (9 >a) 1  1  =  P (S(a)>O). 01  (7.8)  We then have /•OO  P (6 > a) = P(S > 0) 9l  X  ~  /  f (x, a)dx. s  Jo  In general, the saddlepoint approximation is very computational intensive since the normalizing constant needs to be calculated by numerical integration. We remark that the saddlepoint approximations to the WLE proposed in this chapter are for fixed weights. The saddlepoint approximation to the WLE with adaptive weights needs further study.  Chapter 8 A p p l i c a t i o n t o Disease M a p p i n g 8.1  Introduction  In this chapter, we present the results of the application of the likelihood  the  estimation  weighted  likelihood  maximum  weighted  to parallel time series of hospital-based health data. Specifically, method  is illustrated on daily hospital admissions of respiratory  disease obtained from 733 census sub-division (CSD) in Southern Ontario over the May-to-August period from 1983 to 1988. Our main interest is on the estimation of the rate of weekly hospital admissions of certain densely populated areas. The association between air pollutants and respiratory morbidity are studied in Zidek et al. (1998) and Burnett (1994). For the purpose of our demonstration, we will consider the estimation of the rate of weekly hospital admissions of CSD # 380 which has the largest yearly hospital admissions total among all CSD's from 1983 to 1988. The estimation of the rate of weekly admissions is a challenging task due to the sparseness of the data set. The original data set contains many 0's which represent no hospital admissions on most of the days in the summer. On certain days of the summer, however, quite a number of  137  8.2.  W E I G H T E D LIKELIHOOD ESTIMATION  .9  138  o  I  uwuu 20  40  60 80 Days in the summer of 1983  100  120  Figure 8.1: Daily hospital admissions for C S D # 380 in the summer of 1983. people with respiratory disease went to hospital to seek treatments due perhaps to the high level of pollution in the region. For CSD # 380 that has the largest number of hospital admissions among all the CSD's, there are no hospital admission for a total of 112 days out of 123 days in the summer of 1983. However there were 17 hospital admissions on day 51. The daily counts of this C S D are shown in Figure 8.1. The problem of data sparseness and high level of variation is quite obvious. Thus we will consider the estimation of the rate of weekly admissions instead of the daily counts. There are 17 weeks in total. For simplicity, the data obtained in the last few days of each year are dropped from the analysis since they do not constitute a whole week.  8.2  Weighted Likelihood Estimation  We assume that the total number of hospital admissions of a week for a particular CSD follows Poisson distribution, i.e., for year q, C S D i and week j, Y* ~ V {e ) , j = 1, 2 , . . , 17;'t = 1, 2 , 7 3 3 ; q = l , 2 , 6 . q  i3  8.2. W E I G H T E D LIKELIHOOD ESTIMATION  1  2  3  4  5  6  7  8  9  10  139  11  12  13  14  15  16  11  12  13  14  15  16  11  12  13  14  15  16  17  11  12  13  14  15  16  17  C S D 380: Weeks in 1983  1  2  3  4  5  6  7  8  9  10  C S D 362: Weoks in 1983  1  2  3  4  5  .  6  7  8  9  10  C S D 367: Weeks in 1983  1  2  3  4  5  6  7  8  9  10  C S D 367: Weeks in 1983  Figure 8.2: Hospital Admissions for CSD # 380, 362, 366 and 367 in 1983. The raw estimate of  which is Y£ is highly unreliable. The sample size is only  1 in this case. Also each CSD may contain only a small group of people whose lung conditions are susceptible to the high levels of air pollution. Therefore we think that it is desirable to combine neighboring CSD's in order to produce a more reliable estimate. For any given CSD, the neighboring CSD's are defined to be CSD's that are in close proximity to the CSD of interest, CSD #380 in our analysis. To study the rate of weekly hospital admissions in a particular CSD, we would expect that neighboring subdivisions contain relevant information which might help us to derive  8.2. W E I G H T E D LIKELIHOOD ESTIMATION  140  a better estimate than the traditional sample average. The importance of combining disease and exposure data is discussed in Waller et al. (1997). The Euclidean distance between the target CSD and any other C S D in the dataset is calculated by using the longitudes and latitudes. CSD's whose Euclidean distances are less than 0.2 from the target C S D are selected as the relevant CSD. For CSD # 380, neighboring CSD's are CSD # 362, 366 and 367. The time series plots of weekly hospital admissions for those CSD's in 1983 are shown in Figure 8.2. It seems that the hospital admissions of these CSD's at a given week might be related since the major peaks in the time series plot occurred at roughly the same time point. However, including data from other CSD's might introduce bias. The weight function defined in the W L E can control the degree of bias introduced by the combination of data from other CSD's. Ideally, we would assume that 0?- = Q\ for j = 1,2, ...,17. But this assumption does not hold due to seasonality. For example, week 8 always has the largest hospital admissions for CSD 380. By examining the data more closely, we realize that the 8th week for each year is a week with more than normal hospital admissions. In 1983, there are 21 admissions in week 8 while the second largest weekly count is only 7 in week 15. In fact, week 8 is an unusual week through 1983 to 1988. The air pollution level might further explain this phenomenon by using the method proposed in Zidek et al. (1998). Thus, the assumption is violated at week 8. One alternative is to perform the analysis on a window of a few weeks and repeat the analysis while we move the window one week forward. This is equivalent of assuming that 0?- take the same value for a period of a few weeks instead of the entire summer. In this chapter, we will simply exclude the observations from week 8 and proceed with the assumption that 0?- = Q\ for j = 1,2, ...7,9, ..17. The fact that the sample means and sample variances of the weekly hospital admissions for those 16 weeks of CSD 380 are quite close to each other supports our assumption.  8.3. RESULTS OF T H E ANALYSIS  141  Let Y to be the overall sample average for a particular CSD i for a given year. For L  Poisson distributions, the MLE of 6\ is the sample average of the weekly admissions of CSD #380 and the WLE is a linear combination of the sample average for each CSD according to Theorem 3.1. Thus, the weighted likelihood estimate ofthe average weekly hospital admissions for a CSD, 6\, is 4  WLE"  = ^2X^1  a = 1,2.., 6.  i=l  For our analysis, the weights are selected by the cross-validation procedure proposed in Chapter 5. Recall that the cross-validated weights for equal sample sizes can be calculated as follows:  where b {y) = £ Y^vf  j  q  \ and A (y) q  j=i  8.3  = £ Y ^ Y ^ , i = 1, 2, 3,4; k = 1, 2, 3, 4. 9  lk  •  j=\  Results of the Analysis  We assess the performance ofthe MLE and the WLE by comparing their MSE's. The MSE of the MLE and the WLE are defined as, for q = 1,2,.., 6, MSEUei)  In fact, the  9\  =  E (Y -ei) g  el  2  h  are unknown. We then estimate the  MSEM  and  MSEw  by replacing  9\ by the MLE. Under the assumption of Poisson distributions, the estimated MSE for the MLE is given by: MSE  q M  = var(Y )/16, u  o=l,2,...6.  8.3. RESULTS OF T H E ANALYSIS  142  The estimated M S E for the W L E is give as following:  (  m  i=i  4  4  /  i=l k=l  m  \t=l  The estimated M S E for the M L E and the W L E are given in the following table. It can be seen from the table that the M S E for the W L E is much smaller than that of the M L E . In fact, the average reduction of the M S E by using W L E is about 25%. Year  7 MLE  7 WLE  1  .185  .174  .101  .084  0.80  2  .328  .282  .241  .131  0.87  3  .227  .257  .286  .143  0.54  4  .151  .224  .159  .084  0.96  5  .303  .322  .298  .130  0.80  6  .378  .412  .410  .244  0.54  16  MSE  q M  16  MSE  M^E /MSE q  9 W  q  w  M  Table 8.1: Estimated M S E for the M L E and the W L E . Combining information across these CSD's might also help us in the prediction since the patterns exhibited in one neighboring location in a particular year might manifest itself at the location of interest the next year. To assess the performance of the W L E , we also use the W L E derived from one particular year to predict the overall weekly average of the next year. The overall prediction error is defined as the average of those prediction errors. To be more specific, the overall prediction errors  8.4. DISCUSSION  143  for the WLE and the traditional sample average are defined as follows: PRED  PRED  1  V  £(n-FT) ; 2  1 J2{WLE^-Yl y +I  The average prediction error for the MLE, Pred , is 0.065 while the Pred , the M  w  average prediction error for the WLE, is 0.047 which is about 72% of that of the MLE.  8.4  Discussion  Bayes methods are popular choices in the area of disease mapping. Manton et al. (1989) discuss the Empirical Bayes procedures for stabilizing maps of cancer mortality rates. Hierarchical Bayes generalized linear models are proposed for the analysis of disease mapping in Ghosh et al. (1999). But it is not obvious how one would specify a neighborhood which needs to be defined in these approaches. The numerical values of the weight functions can be used as a criterion to appropriately define the neighborhood in the Bayesian analysis. We will use the following example to demonstrate how a neighborhood can be defined by using the weight functions derived from the cross-validation procedure for the WLE. From Table 8.3, we see that there is strong linear association between CSD 380 and CSD 366. However, the weight assigned to CSD 366 is the smallest one, It shows that CSD's with higher correlation contain less information for the prediction since they might have too similar a pattern to the target CSD for a given year to be helpful in the prediction for the next year. Thus CSD 366 which has the smallest weight should not be included the analysis. Therefore, the "neighborhood" of CSD 380 in  8.4. DISCUSSION  144  CSD 380 CSD 362 CSD 366 CSD 367 Weights CSD 380  1.000  0.421  0.906  0.572  0.455  CSD 362  0.421  1.000  0.400  0.634  0.202  CSD 366  0.906  0.400  1.000  0.553  0.128  CSD 367  0.572  0.634  0.553  1.000  0.215  Table 8.2: Correlation matrix and the weight function for 1984. the analysis should only include CSD 362 and CSD 367. In general, we might examine those CSD which are in close proximity to the target CSD. We can calculate the weight for each CSD selected by using the cross-validation procedure. The CSD with small weights should be dropped from the analysis since they are not deemed to be helpful or relevant to our analysis according to the crossvalidation procedure. We remark that the weight function can also be helpful in selecting an appropriate distribution that takes into account the spatial structure. Ghosh et al. (1999) propose a very general hierarchical Bayes spatial generalized model that is considered broad enough to cover a large number of situations where a spatial structure needs to be incorporated. In particular, they propose the following: 0i  =  qi  = z-b + Ui  + Vi, i =  1, 2 , m  where the qi are known constants, Xi are covariates, Ui and Vi are mutually independent with Vi ~ ' N(0, a ) and the Ui have joint pdf J  d  2  (-E&-«,)H/(^) m  i=l  where  uiij  >  0 for all 1 <  i  ^  j  <  m.  j^i  The above distribution is designed to take  into account the spatial structure. In their paper, they propose to use uii = 1 if location % and j are neighbors. They also mention the possibility of using the inverse  8.4. DISCUSSION  145  of the correlation matrix as the weight function. The weights function derived from the cross-validation procedure might be a better choice since it takes account of the spatial structure without any specific model assumptions. The predictive distribution for the weekly total will be Poisson (WLE). We can then derive the 95% predictive interval for the weekly average hospital admissions. This might be criticized as failing to take into account the the uncertainty of the unknown parameter. Smith (1998) argues that the traditional plug-in method has a small M S E compared to the posterior mean under certain circumstances. In particular, it has a smaller M S E when the true value of the parameter is not large. Let CIw and CIM be the 95% predictive intervals of the weekly averages calculated from the W L E and the M L E respectively. The results are shown in the following table. Year  CIM  1983  [0,3]  [0, 3]  1984  [0, 5]  [0,4]  1985  [0, 4]  [0, 4]  1986  [0, 3]  [0, 4]  1987  [0, 4]  [0, 5]  1988  [0, 5]  [0, 6]  CIw  Table 8.3: M S E of the M L E and the W L E for C S D 380. We remark that this chapter is merely a demonstration of the weighted likelihood method. Further analysis is needed if one wants to compare the performances of the W L E , the M L E and the Bayesian estimator in disease mapping.  Bibliography [1] Akaike, H. (1985). Prediction and entropy, In: A Celebration of Statistics 1-24, Edited by Atkinson, A. C. and Fienberg, S. E., Springer-Verlag, New York.  [2] Bliss, G. A. (1930). The problem of lagrange in the calculus of variations. The American Journal of Mathematics 52 673-744.  [3] Breiman, L. and Friedman, H. J. (1997). Predicting multivariate responses in multiple regression, Journal of Royal Statistical Society: Series B 36 111-147. [4] Burnett, R. and Krewski, D. (1994). Air pollution effects on hospital admission rates: A random effects modeling approach. The Canadian Journal of Statistics 22 441-458. [5] Cox. D. R. (1981). Combination of data. Encyclopedia of Statistical Sciences 45-52, John Wiley k Sons, Inc., New York. [6] Csiszar, I. (1975) I-divergence geometry of probability distributions and minimization problems. The Annals of Probability 3 146-158.  [7] Daniels, H. E. (1954). Saddlepoint approximation in statistics. The Annals of Mathematical Statistics 25 59-63.  [8] Daniels, H. E. (1983). Saddlepoint approximation for estimating equations. Biometrika 70 89-96.  146  2  [9] Dickey, J. M. (1971) The weighted likelihood ratio, linear hypotheses on normal location parameters. The Annals of Mathematical Statistics 42 204-223. [10] Dickey, J. M and Lientz, B. P. (1970) The weighted likelihood ratio, sharp hypotheses about chances, the order of a Markov chain. The Annals of Mathematical Statistics 41 214-226.  [11] Easton, G. (1991). Compromised maximum likelihood estimators for location. Journal of the American Statistical Association 86 1051-1064.  [12] Eguchi, S. and Copas, J. (1998). A class of local likelihood methods and nearparametric asymptotics. Journal of Royal Statistical Society, Series B 60 709-724. [13] Ekeland, I. and Temam, R. (1976). Convex Analysis and Variational Problems. American Elsevier Publishing Company Inc., New York. [14] Feller, W. (1971) An Introduction to Probability Theory and Its Applications, Vol 2. John Wiley k, Sons, Inc., New York. [15] Ferguson, T. S. (1996). A Course in Large Sample Theory. Chapman and Hall, New York. [16] Field, C. A. and Hampel, F. R. (1982). Small Sample Asymptotics Distributions of M-estimators of Location. Biometrika 69 29-46. [17] Field, C. A. and Ronchetii, E. (1990). Small Sample Asymptotics. Institute of Mathematical Statistics, Hayward. [18] Geisser, S. (1975). The predictive sample reuse method with applications, Journal of the American Statistical Association 70 320-328.  147  [19] Genest, C. and Zidek, J. V. (1986). Combining probability distributions: a critique and an annotated bibliography. Statistical Science 1 114-148. [20] Ghosh, M., Natarajan, K., Waller, L. A. and Kim, D. (1999). Hierarchical Bayes GLMs for the analysis of spatial data: An application to disease mapping. Journal of Statistical Planning and Inference 75 305-318.  [21] Giaquinta, M. and Hildebrandt, S. (1996). Calculus of Variations. SpringerVerlag Series, New York. [22] Hardle, W. and Gasser, T. (1984) Robust Nonparametric Function Fitting. Journal of the Royal Statistical Society, Series B, 46 42-51.  [23] Hu, F. (1994). Relevance Weighted Smoothing and A New Bootstrap Method, Ph.D. Dissertation, Department of Statistics, University of British Columbia, Canada. [24] Hu, F. (1997). The asymptotic properties of the maximum-relevance weighted likelihood estimators. The Canadian Journal of Statistics 25 45-59. [25] Hu, F., Rosenberger, W. F. and Zidek, J. V. (2000). Relevance weighted likelihood for dependent data. Metrika 51 223-243. [26] Hu, F. and Zidek, J. V. (1997). The relevance weighted likelihood with applications. In: Empirical Bayes and Likelihood Inference 211-235, Edited by Ahmed, S. E. and Reid, N., Springer, New York. [27] Hunsberger, S. (1994) Semiparametric regression in likelihood-based models. Journal of the American Statistical Association 89 1354-1365.  [28] Kullback, S. (1954). Certain inequality in information theory and the CramerRao inequality. The Annals of Mathematical Statistics 25 745-751. 148  [29] Kullback, S. (1959).  Information  Theory  and  Lecture Notes-  Statistics.  Monograph Series Volume 21, Institute of Mathematical Statistics. [30] Lange, K. (1999)  Numerical  Analysis  for Statisticians.  Springer-Verlag, New  York. [31] Lehmann, E. L. (1983),  Theory of Point Estimation.  John Wiley & Sons Inc.,  New York. [32] Markatou, M., Basu, A. and Lindsay, B. (1997). Weighted likelihood estimating equations: The discrete case with applications to logistic regression. Statistical  Planning  and Inference  Journal  of  92 215-232.  [33] Markatou, M., Basu, A. and Linday, B. (1998). Weighted likelihood equations with bootstrap root search.  Journal  of the American  Statistical  Association  93  740-750. [34] Manton, K. G., Woodbury, M. A., Stallard, E. Riggan, W. B. Creason, J. P. and Pellon, A. C. (1989). Empirical Bayes procedures for stabilizing maps of U.S. cancer mortality rates.  Journal  of the American  Statistical  Association  84  637-650. [35] National Research Council (1992). Combining Opportunities  for Research.  Information:  Statistical  Issues and  National Academy Press, Washington D.C..  [36] Newton, M. A. and Raftery, A. E. (1994). Approximate Bayesian inference with the weighted likelihood boostrap.  Journal of Royal Statistical  56 3-48.  149  Society: Series B  [37] Rao, B. L. S. (1991). Asymptotic theory of weighted maximum likelihood estimation for growth models. In:  Statistical Inference in Stochastic Processes 183-208,  edited by Prabhu, N.U. and Basawa, I. V., Marcel Dekker, Inc., New York. [38] Rao, C. R. (1965).  Linear Statistical Inference and Its Applications.  John Wiley  & Sons, Inc., New York. [39] Royden, H. L. (1988). [40] Savage, L. J. (1954).  Real Analysis.  Prentice Hall, New York.  The Foundations of Statistics. Springer-Verlag,  [41] Schervish, M. J. (1995).  Theory of Statistics,  New York.  New York: Springer-Verlag.  [42] Small, C. G., Wang, J. and Yang, Z. (2000). Eliminating multiple root problems in estimation.  Statistical Science 15, 313-341.  [43] Smith, R. L. (1998). Bayesian and frequentist approaches to parametric predictive inference.  Bayesian Statistics 6  589-612.  [44] Staniswalis, J. G. (1989). The kernel estimate of a regression function in likelihood-based models. Journal of the  American Statistical Association  89 276-  283. [45] Stone, M. (1974). Cross-validation choice and assessment of statistical predictions.  Journal of Royal Statistical Society: Series B  36 111-147.  [46] Tibshirani, R. and Hastie, T. (1987). Local likelihood estimation. American Statistical Association  Journal of the  82 559-567.  [47] van Eeden, C. (1996). Estimation in restricted parameter spaces-Some history and some recent developments.  Statistics & Decisions  150  17 1-30.  [48] van Eeden, C. and Zidek, J. V. (1998). Combining sample information in estimating ordered normal means. Technical Report # 182, Department of Statistics, University of British Columbia.  [49] van Eeden, C. and Zidek, J.V. (2001). Estimating one of two normal means when their difference is bounded. Statistics & Probability Letters 51 277-284. [50] Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. The Annal of Mathematical Statistics 15 358-372.  [51] Waller, L. A., Louis, T. A., and Carlin, B. P. (1997). Bayes methods for combining disease and exposure data in assessing environmental justice. Environmental and Ecological Statistics 4 267-281.  [52] Warm, T. A. (1987). Weighted likelihood estimation of ability in item response theory. Psychometrika 54 427-450. [53] Zidek, J. V., White, R. and Le, N. D. (1998). Using spatial data in assessing the association between air pollution episodes and respiratory morbidity. Statistics for the Environment 4- Pollution Assessment and Control 111 -136.  151  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0090880/manifest

Comment

Related Items