Adaptive Likelihood Weights and Mixtures of Empirical Distributions by Jean-Frangois Plante B.Sc, Universite Laval, 2000 M.Sc, Universite Laval, 2002 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in The Faculty of Graduate Studies (Statistics) THE UNIVERSITY OF BRITISH COLUMBIA July, 2007 © Jean-Frangois Plante 2007 Abstract Suppose that you must make inference about a population, but that data from m — 1 similar populations are available. The weighted likelihood uses exponential weights to include all the available information into the inference. The contribution of each datum is discounted based on its dissimilarity with the target distribution. One could hope to elicitate the likelihood weights from scientific information, but using data-based weights is more pragmatic. To this day, no entirely satisfactory method has been found for determining likelihood weights from the data. We propose a way to determine the likelihood weights based on data. The suggested "MAMSE" weights are nonparametric and can be used as likelihood weights, or as mixing probabilities to define a mixture of empirical distributions. In both cases, using the MAMSE weights allows strength to be borrowed from the m—1 similar populations whose distribution may differ from the target. The MAMSE weights are defined for different types of data: univariate, censored and multivariate. In addition to their role for the likelihood, the MAMSE weights are used to define a weighted Kaplan-Meier estimate of the survival distribution and weighted co efficients of correlation based on ranks. The maximum weighted pseudo-likelihood, a new method to fit a family of copulas, is also proposed. All these examples of inference using the MAMSE weights are shown to be asymptotically unbiased. Furthermore, simulations show that inference based on MAMSE-weighted methods can perform better than their unweighted counterparts. Hence, the adaptive weights we propose successfully trade bias for precision. ii Table of Contents Abstract ii Table of Contents . iiList of Tables vi List of Figures xAcknowledgements i xiii 1 Introduction 1 1.1 Review1.2 Overview 3 2 Likelihood, Entropy and Mixture Distributions 5 2.1 Likelihood and Entropy 5 2.2 The Weighted Likelihood 6 2.3 The MAMSE Weights 7 2.4 An Algorithm to Compute the MAMSE Weights 10 3 MAMSE Weights for Univariate Data 17 3.1 Notation and Review 13.2 Definition of the MAMSE Weights 18 3.3 Computing the MAMSE Weights 9 iii Table of Contents 3.4 Structural Properties of the MAMSE Weights 20 3.5 Strong Uniform Convergence of the Weighted Empirical CDF 22 3.6 Weighted Strong Law of Large Numbers 30 3.7 Strong Consistency of the Maximum Weighted Likelihood Estimate .... 36 3.8 Asymptotic Behavior of the MAMSE Weights 43.9 Issues for Asymptotic Normality 50 3.10 Simulations 51 3.10.1 Two Normal Distributions 52 3.10.2 Complementary Populations 4 3.10.3 Negative Weights 6 3.10.4 Earthquake Data 60 4 MAMSE Weights for Right-Censored Data 65 4.1 Notation and Review of the Kaplan-Meier Estimate 66 4.2 Definition of the MAMSE Weights for Right-Censored Data 69 4.3 Computing the MAMSE Weights for Right-Censored Data 72 4.4 Uniform Convergence of the MAMSE-Weighted Kaplan-Meier Estimate . . 73 4.5 Simulations 86 4.5.1 Survival in the USA 87 4.5.2 Survival in Canada 94 5 MAMSE Weights for Copulas 100 5.1 Review of Copulas '. 100 5.2 Notation and Empirical Estimation . . . ., 106 5.3 Definition of the MAMSE Weights for Copulas Ill 5.4 Computing the MAMSE Weights for Copulas 113 5.5 Uniform Convergence of the MAMSE-Weighted Empirical Copula 114 5.6 Weighted Coefficients of Correlation Based on Ranks 122 Table of Contents 5.7 Weighted Strong Law of Large Numbers 126 5.8 Strong Consistency of the Weighted Pseudo-Likelihood 134 5.9 Simulations 143 5.9.1 Different Correlations for Bivariate Copulas 144 5.9.2 Different Bivariate Distributions with Common Correlation 147 5.9.3 Multivariate Normal 148 6 Summary and Future Research 159 6.1 Summary of the Work 156.2 Future Research 161 7 Conclusion 165 Bibliography 7 v List of Tables 3.1 Average MAMSE weights for Population 1 when equal samples of size n are drawn from Normal distributions with unit variance and means 0 and A respectively. The results are averages over 10000 replicates 52 3.2 Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE). Samples of equal size n are simulated from Normal distributions with unit variance and means 0 and A respectively. The results are averaged over 10000 replicates. 53 3.3 Average MAMSE weights for Population 1 when samples of size n and lOn are drawn from Normal distributions with unit variance and means 0 and A respectively. The results are averages over 40000 replicates 54 3.4 Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE). Samples of sizes n and lOn are simulated from Normal distributions with unit variance and means 0 and A respectively. The results are averaged over 40000 replicates. 55 3.5 Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE) and av erage MAMSE weights allocated to samples of sizes n drawn from A/\0,1), |A/"(0,1)| and — pV(0,1)| respectively. The results are averages over 10000 repetitions. 56 3.6 Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE) when the MAMSE weights are calculated without the constraints A; > 0. Samples of size n, lOn and n are taken from each population. Population 2 is an equal mixture of Populations 1 and 3 that respectively follow a A/"(0,1) and a A/"(A, 1) distribution. All results are averages over 40000 repetitions. ... 59 vi List of Tables 3.7. Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE) when the usual MAMSE weights (i.e. constrained to positive values) are used. Samples of size n, lOn and n are taken from each population. Population 2 is an equal mixture of Populations 1 and 3 that respectively follow a N~(0, 1) and a A^(A, 1) distribution. All results are averages over 40000 repetitions 59 3.8 Relative efficiency of the MWLE with and without the constraints Aj > 0 as measured by 100 MSE(constrained MWLE)/MSE(unconstrained MWLE). Samples of size n, lOn and n are taken from each population. Population 2 is an equal mixture of Populations 1 and 3 that respectively follow a A/"(0,1) and a A/"(A, 1) distribution. All results are averages over 40000 repetitions. 60 3.9 Number of earthquakes in three Western Canadian areas between the 12th of February 2001 and the \1th of February 2006. The magnitude of these earthquakes is modeled by a Gamma distribution; the maximum likelihood estimates appear above and are used as the "true" parameters for this sim ulation 62 3.10 Efficiency in estimating some probabilities about the magnitude of the next earthquake in the Lower Mainland - Vancouver Island area followed by the average of the actual estimates and their true values. Efficiency is measured by 100 MSE(plug-in MLE)/MSE(plug-in MWLE). The first four columns of probabilities should be multiplied by the corresponding multiplier 63 4.1 Relative performance of the WKME as measured by 100Ai/A^. Both areas are calculated on the interval [0, 55] and various upper bounds U are used to determine the weights. Samples of equal size n are taken from each of four subpopulations, then used to estimate the survival of a white male living in the USA. Each scenario is repeated 20000 times 89 List of Tables - 4.2 Relative performance of the WKME as measured by 100Ai/A\. Areas are calculated on the interval [0, U — 5], where U is the upper bound used to determine the weights. Samples of equal size n are taken from each of four subpopulations, then used to estimate the survival of a white male living in the USA. Each scenario is repeated 20000 times 90 4.3 Relative performance of the weighted Kaplan-Meier estimate compared to the usual Kaplan-Meier estimate for estimating i*\(55) = 0.11976 as mea sured by 100 MSE{FI(55)}/MSE{FA(55)}. Different choices of U and n are considered. Each scenario is repeated 20000 times. 93 4.4 Relative performance of the weighted Kaplan-Meier estimate compared to the usual Kaplan-Meier estimate for estimating F1_1(0.10) = 52.081 as measured by MSE(QI)/MSE(QA)- Different choices of U and n are considered. Each scenario is repeated 20000 times 93 4.5 Average MAMSE weights for different rates of censoring p and different sam ple sizes, n. The right-hand side of the table presents the relative performance of the WKME as measured by 100Ai/A\. Figures are averaged over 20000 repetitions and the values U = 80 and T = 75 are used 94 4.6 Relative performance of the weighted Kaplan-Meier estimate compared to the Kaplan-Meier estimate as measured by 100AI/4A- Average MAMSE weights are also shown, but they are multiplied by a factor of 1000 for easier reading. In the simulation with males and females, the average weights in italics refer to the male populations. Note that U — 90, T = 85 and that all figures are based on 10000 repetitions. '. . 97 5.1 Population value of different coefficients of correlation 102 5.2 Empirical estimates of different coefficients of correlation 109 5.3 Values of p under two different scenarios that are simulated for different families of copulas 144 List of Tables 5.4 Performance of a MAMSE-weighted coefficient of correlation based on ranks as measured by 100 MSE(/5)/MSE(px) f°r different scenarios and sample sizes n e {20,50,100,250}. Each figure is averaged over 10000 repetitions 146 5.5 Performance of the maximum weighted pseudo-likelihood estimate as mea sured by 100 MSE((?)/MSE((9.x) for different scenarios and sample sizes n G {20, 50,100, 250}. Each figure is averaged over 10000 repetitions 146 5.6 Distributions from which bivariate random variables are drawn. The choice of parameters in each population yields a Spearman correlation of p = 0.20. 147 5.7 Average MAMSE weights as well as the relative efficiency of the weighted correlation and of the MWPLE when compared to their non-weighted coun terparts. Samples of size n € {20,50,100,250} are drawn from five different bivariate distributions that have a common correlation of p = 0.2. Figures are averaged over 10000 repetitions 148 5.8 Parameters of the simulation for 3-variate Normal variables. Population 1 is from the target distribution, but the other populations are affected by measurement errors 153 5.9 Average weight allocated to each of four 3-variate Normal distributions. Pop ulation 1 is observed from the target distribution, but the other populations contain measurement errors. The values are averaged over 5000 repetitions. 153 5.10 Relative performance of the MWPLE when compared to the MPLE as mea sured by 100 MSE(f i)/MSE(f A) or an equivalent ratio for the individual entries of Ti. Population 1 is observed from a 3-variate Normal distribu tion and the other populations contain measurement errors. The values are averaged over 5000 repetitions 154 5.11 Parameters of the simulations for 4-variate Normal variables. Population 1 comes from the target distribution, but the other populations are affected by measurement errors 156 List of Tables 5.12 Average weight allocated to each of four 4-variate Normal distributions. Pop ulation 1 is observed from the target distribution and the other populations contain measurement errors. The values are averaged over 5000 repetitions. 157 5.13 Relative performance of the MWPLE when compared to the MPLE for 4-variate Normal distributions. Population 1 is observed from the target dis tribution and the other populations contain measurement errors. The values are averaged over 5000 repetitions 157 x List of Figures 3.1 Average values of 100x the MAMSE weights without the constraints A; > 0. Samples of size n, lOn and n are taken from each population. Population 2 is an equal mixture of Populations 1 and 3 that respectively follow a N(0,1) and a A/"(A, 1) distribution. All results are averages over 40000 repetitions. 58 3.2 Average values of 100x the usual MAMSE weights (with constraints A; > 0). Samples of size n, lOn and n are taken from each population. Population 2 is an equal mixture of Populations 1 and 3 that respectively follow a A/"(0,1) and a A/"(A, 1) distribution. All results are averages over 40000 repetitions. 61 3.3 Histograms of the magnitude of earthquakes measured between the \2th of February 2001 and the 12th of February 2006 for three different Western Canadian areas. The curves correspond to the fitted Gamma density 62 4.1 Graphics representing the Riemann sums used in the proof of Case 2 (left panel) and Case 3 (right panel) 82 4.2 Survival functions for subgroups of the American population as taken from the life tables of the National Center for Health Statistics (1997) 88 4.3 Average value of the MAMSE weights for different upper bounds U and sam ple sizes n. The cells' area are proportional to the average weight allocated to each population. The numbers correspond to 100Ai and are averaged over 20000 repetitions 91 xi List of Figures 4.4 Typical examples" of the weighted Kaplan-Meier estimate (dashed line) and of the usual Kaplan-Meier estimate (plain black line) for different sample sizes. Note that U = 80 and T = 75. The true distribution is depicted by a smooth gray line •. 92 4.5 Survival functions of Canadians as taken from the life tables of Statistics Canada (2006). Survival functions are available for males and females of each province. The curves for New Brunswick are repeated as gray dotted lines on each panel to facilitate comparisons since they are the populations of interest 96 4.6 Typical examples of estimates under different scenarios. Note the increased number of steps of the weighted Kaplan-Meier estimate (dashed line) which makes it smoother than the usual Kaplan-Meier estimate (plain black line). The true distribution appears as a smooth gray curve 99 5.1 Average value of the MAMSE weights for scenarios where samples of equal sizes n are drawn from 5 different populations. The cell areas are proportional to the average weight allocated to each population. The numbers correspond to lOOAi and are averaged over 10000 repetitions 145 Acknowledgements I would like to thank my supervisor Jim Zidek for the numerous inspiring discussions we had about the foundations of statistics. With an amazing intellectual tact, Jim gave me enough direction to keep me thinking hard, but enough liberty to let me design this thesis my own way. His guidance will definitely have a deep impact in the way I approach statistics in my future career. I would also like to acknowledge Jim's financial support. I would like to thank the' members of my supervisory committee, John Petkau and Mafias Salibian-Barrera, for their insightful comments and suggestions. I would like to extend an additional thank to John for sharing his wisdom in many occasions when I was seeking an advice. The Department of Statistics offers a friendly environment for development and growth. I did appreciate the opportunity to be exposed to different views of statistics. Department's faculty and staff members as well as fellow students have always been very helpful. I am thankful to them. I made many very good friends during my stay in Vancouver. They brightened my days and will truly be missed. Thank you for the friendship, the support and for the fun! I want to thank Christian Genest, my Master's supervisor, for his outstanding guidance while I was at Laval University. I learned so much under his supervision that his teachings are still making a significant difference in my work and in my understanding of statistics as a science and as a discipline. Thanks to my family for their constant support and prayers. I had the chance to grow'up in a healthy and supportive familial environment that forged my personality. I am thankful xiii Acknowledgements for that. Thank you Pascale for sharing my life through the second half of this endeavour. For partial support of this work through graduate scholarships and research grants, thanks are due to the Natural Sciences and Engineering Research Council of Canada and to the Fonds quebecois de la recherche sur la nature et les technologies. xiv Chapter 1 Introduction Suppose that you must make inference about a population from which you obtained a sample and that additional data are available from populations whose distributions may be similar to the target, but not necessarily identical. The weighted likelihood allows use of the relevant information contained in that data and its incorporation into the inference. In this thesis, we propose a data-based criterion to determine likelihood weights and show that they can successfully trade bias for precision. 1.1 Review The expression weighted likelihood refers to a few different statistical methods. In sur vey sampling theory, a method called weighted likelihood is used to adjust for a response-dependent sampling intensity; see Krieger & Pfeffermann (1992). The increased robustness of estimates obtained through weighted likelihood equations have been studied by several groups of statisticians including Markatou et al. (1997) and Ahmed et al. (2005). However, both these methods are only remotely related to the relevance weighted likelihood as defined by Hu (1994). In that context, the weighted likelihood is an extension to the likelihood that uses data from multiple populations for making inference. Statistical models typically use information about data that come from the target dis tribution. When other possibly similar data are available, the relevant information they contain is ignored unless their similarity with the population of interest is precisely incor porated in the model. 1 Chapter 1. Introduction The original work of Hu (1994), enhanced in Hu & Zidek (2001) and Hu k. Zidek (2002), proposes use of the weighted likelihood to capture the relevant information from data whose distribution may differ from that of interest, hence trading bias for precision. A specific paradigm is derived from the work of Hu and Zidek by Wang (2001) and further developed in Wang, van Eeden & Zidek (2004) and in Wang & Zidek (2005). We adopt that paradigm throughout this thesis. Suppose the data come from m distinct populations that have different yet similar distributions. More formally, for each fixed i = 1,..., m, Aji, . . . , A.ini ~ i*i where Fj denotes the cumulative distribution function (CDF) of each population. We denote by fi the corresponding density or mass function for population i = 1,..., m. The family of density (or mass) functions f(x\6) indexed by 6 € 0 is used to model the data. Population 1 is of inferential interest, but the weighted likelihood m nj Lx(e) = Hl[f(xij\e)^ . i=ij=i lets other populations contribute to the inference so that the relevant information they contain is not lost. The vector of exponents A = [Ai,..., Am]T downweights data according i to the degree to which they are thought to be dissimilar from Population 1. The expression for the weighted log-likelihood may be more intuitive: The maximum weighted likelihood estimate (MWLE) is a value of 9 that maximizes L\(9). Ideally, the weights Aj would be determined from scientific knowledge. However, us ing data-based weights seems more pragmatic. The ad-hoc methods proposed by Hu & 2 Chapter 1. Introduction Zidek (2002) as well as the cross-validation weights of Wang (2001) and Wang & Zidek (2005) are the only adaptive weights proposed in the literature for the paradigm considered. How ever, none of these solutions is fully satisfactory. In particular, the cross-validation weights (CVW) suffer from instability problems. The simulation results in Wang (2001) and Wang & Zidek (2005) can be reproduced only at the cost of fine-tuning the algorithms numerically. In the ideal situation where some of the populations are identical to Population 1, the weights may not even be defined. Let Fi denote the empirical distribution function (EDF) based on the sample from Population i. Hu k. Zidek (1993) call F\ = 2~2u=i ^i^i the relevance weighted empirical distribution (REWED) when A* > 0 add to 1. Hu k.Zidek (2002) show that the REWED is the nonparametric maximum weighted likelihood estimate of the target distribution func tion. In Chapter 2, we look at the link between the REWED and the weighted likelihood from a different angle. Note that Ahmed (2000) uses a paradigm similar to the one retained in this thesis as he considers samples of different sizes from m populations to develop a Stein-type estimator of the quantiles of the distribution. The properties of his method and its links to the REWED and the weighted likelihood are however not the object of this work. In this thesis, we propose a data-based criterion to determine likelihood weights which yields consistent estimates. Inferential methods based on our proposed weights perform at least as well as the CVW in comparable simulations, but without the instability problems. We also propose other weighted methods based on the newly proposed weights that are shown to perform well and to be asymptotically unbiased. 1.2 Overview In Chapter 2, we present our heuristic interpretation of the weighted likelihood that links it to mixture distributions. This leads to the general definition of the MAMSE (Minimum Averaged Mean Squared Error) weights and we provide an algorithm to calculate them. 3 Chapter 1. Introduction The ensuing three chapters are implementations of the idea of the MAMSE weights in three different contexts. In Chapter 3, univariate data are treated using the empirical distribution function. The properties of the resulting nonparametric MAMSE weights are studied. Asymptotically, the MAMSE-weighted empirical distribution function converges to the target distribution and the maximum weighted likelihood estimate is strongly consistent. Finally, simulations show that improved performance is indeed possible on samples of realistic sizes. Chapter 4 introduces MAMSE weights for right^censored data by using the Kaplan-Meier estimate to infer the CDF of each population. The weights are used to define a weighted Kaplan-Meier estimate that converges uniformly to the target distribution. Sim ulations show that improved performance is once again possible. Finally, MAMSE weights for multivariate data are defined using the empirical copula. The MAMSE-weighted mixture of empirical copulas converges uniformly to the copula underlying the target distribution. We define MAMSE-weighted coefficients of correlation and show that they are consistent estimates of the correlation in the target population. To infer the copula underlying the target population, the maximum weighted pseudo-likelihood estimate (MWPLE) is proposed. We show that the MWPLE is consistent when calculated with MAMSE weights. Simulations confirm once more that MAMSE-weighted methods may perform better than their unweighted counterparts, even in subasymptotic ranges. We conclude this thesis with a summary and some suggestions for future research. 4 Chapter 2 Likelihood, Entropy and Mixture Distributions This chapter links the likelihood to the weighted likelihood by showing that both expres sions can be derived as particular cases of the entropy maximization principle proposed by Akaike (1977). ' After drawing the connection between relative entropy and the likelihood, we heuristi-cally extend that link to the relationship between mixture distributions and the weighted likelihood. This leads us to formulate a general criterion to determine likelihood weights, called the "MAMSE" weights. We propose an algorithm to compute the MAMSE weights and prove its convergence. 2.1 Likelihood and Entropy Consider first the one-sample situation where n independent data points, Yi,..., Yn, come from a distribution whose unknown and unknowable density is g(y). In his pioneering work, Akaike (1977) argues that the goal of inference should be the estimation of g(y). When a parametric model f(y\9) is to be used, Akaike proposes maximizing the relative entropy: 5 Chapter 2. Likelihood, Entropy and Mixture Distributions The relative entropy is in fact minus the Kullback-Leibler divergence between / and g. In that sense, it is a measure of the proximity of the distributions / and g. The expression for B(g,f) can be further simplified: B(gJ) = -j log|-^|5(?/)dy = J log{f(y\e)}g(y)dy-J log{g(y)}g(y) dy. In particular, when the objective is to maximize B(g, /) as a function of 8, the last term of the rightmost expression can be ignored since it does not depend on 9. Calculating the entropy would require the knowledge of the unknown and unknowable true distribution g. We thus have to estimate it. Let = v) be the empirical distribution function (EDF) of the dataset Y\,... ,Yn. The indicator vari able !(•) is equal to one if all the elements of its argument are true and equal to 0 otherwise. Using dG(y) as an approximation to dG{y) = g(y) dy yields J \og{f(y\0)}dG(y) = i]Tlog/(m i=l the log-likelihood! Therefore, calculating the likelihood is equivalent to calculating the entropy where the true distribution is estimated by the empirical distribution of the data. Hence, the maximum likelihood estimate can be seen as a special case of Akaike's entropy maximization principle. 2.2 The Weighted Likelihood Consider now the m-population paradigm of Wang (2001) introduced in Chapter 1. With appropriate weights, the mixture Fx = 2~HL\\Ei can be arbitrarily close to F\. Let Fj denote the EDF based on the sample from Population i. The weighted EDF, written 6 Chapter 2. Likelihood, Entropy and Mixture Distributions F\ = YliLi ^iFi witn > 0 and called relevance weighted empirical distribution by Hu & Zidek (1993), may use more data than Fi, and thus be less variable. Hu & Zidek (1993) note the implicit bias involved in defacto replacing Fi by F\, but do not investigate as we do here the possibility of trading bias for precision. In the context of maximum entropy, consider using the weighted EDF as an estimate of Fx. Then, /m . m . rii \ogf(x\6)dFx(x) = J2^i / logf(x\B)dFi(x) = Y,-J2^gf(XlJ\e), i=i i=x j=i the weighted log-likelihood. The maximum weighted likelihood can thus be derived from Akaike's entropy maximization principle. 2.3 The MAMSE Weights Let Fi(x) represent an estimate of the CDF of Population i. Based on the heuristics presented in the previous section, the weighted likelihood is akin to using a mixture of the CDFs of the m populations from which data are available. Let m Fx(x) = J2xiFi(x) i=i with A = [Ai,..., Am]T, Aj > 0 and AT1 = 1 be such a mixture. A wise choice of likelihood weights should make: • F\(x) close to Fi(x), the target distribution, • F\(x) less variable than F\(x). 7 Chapter 2. Likelihood, Entropy and Mixture Distributions We combine the two requirements above in an objective function. Let p(x) be a proba bility measure that has a domain comparable to the distribution of interest and define r i m P(A) = 7 •{P1(x)-Fx(x)} +^Afvar{Fi(x)} L i=l d/i(x) (2.1) The term vaf |i<i(:r) j is a form of penalty to foster decreasing the variance by using as many of the m populations as necessary. In the coming chapters, we will use different estimates Fj depending on the type of data at hand. The exact expression for vaf JFj(:c) j will depend on that choice; see Sections 3.1, 4.2 and 5.3 for specific examples. The probability measure p(x) may also depend on the data and will be specified later. We call MAMSE (minimum averaged mean squared error) weights any vector of values A = [Ai,..., Am]T that solves the program minimize -P(A) ' 771 subject to {Aj >0,i = 1,..., m} and ^ Aj = 1. (2.2) i=l The name MAMSE is motivated by the conceptual resemblance of the integrand with the mean squared error (Bias2 4- Variance). Note that the objective function P(A) is quadratic in A and must be optimized under linear constraints. Substituting Ai = 1 — auows the constraint Y^Li Aj = 1 to embedded into the objective function. Let us write A2 Fx{x) -F2(x) A = - Fm(x) and 8 Chapter 2. Likelihood, Entropy and Mixture Distributions V(x) = var{F2(x)} var{Fm(x)} Then, the function P(A) can be written as A(z) - £\Fiix) - ll - £^j | i=2 / i=2 /I" m -£(_)} . Li=2 +(1 - AT1)2 var { A (x)} + J_ A? var {£(x)} i=2 = J {AV(x)}2 + (l-2ATl + ATllTA)vi:r|Fi(a;)} + ATV(x)A = J AT ^(x)Jr(x)T+V(x) + llTvar|Fi(x)}] A dp(x) dp.(x) dp(x) -2AT1 var {FI(X)} + var {A(x)} d/u(x) = ATAA-2AT16 + 6 (2.3) where A = J ^F(x)F(x)T + V(x) + llTvar {A(x)} d/j(z) b = f var [FI(X)| d//(x). Hence, the objective function F(A) can be written as a quadratic function of A involving matrices and vectors whose entries depend on Fj(x) and its estimated variance. Note that a similar quadratic development is possible with A as well. 9 Chapter 2. Likelihood, Entropy and Mixture Distributions 2.4 An Algorithm to Compute the MAMSE Weights To improve inference about Population 1, we attempt to use the data from m — 1 other populations. Suppose that one of the other populations is different enough that its sample does not even overlap with that of Population 1. Extracting useful information from that sample seems unlikely and it is nearly impossible to evaluate to which degree the populations are similar, except for suspecting that they are probably quite different. We suggest to preprocess the samples from the m populations to discard those that are clearly different from the target population. Some samples may also be discarded because they are difficult to compare with the target. Specifics of the preprocessing are described for all three versions of the MAMSE weights in their respective chapters. Typically, the preprocessing steps will also insure that Assumption 2.1 is respected. Assumption 2.1 J var{Fi(x)}d^(x) > 0 for any Population i that is considered for the optimization of (2.2). All samples that fail the preprocessing are assigned weights of zero, hence they do not appear in the MAMSE objective function. For simplicity, the notation used in this section does not reflect the possibly reduced number of populations considered: we suppose that m populations remain after preprocessing. If we ignore the constraints Aj > 0, the MAMSE weights are the solution to the equation A A = 61. To ensure the weights are nonnegative, we apply the following algorithm and denote its solution by A* (or A*). 1. Solve the equation A A = 61; 2. if all the weights obtained are nonnegative, stop. Otherwise set the negative weights to 0, ignore the corresponding samples and repeat from Step 1 with the reduced 10 Chapter 2. Likelihood, Entropy and Mixture Distributions system. The weight allocated to Population 1 from Step 1 cannot be negative (see Lemma 2.4). If no other samples are left, then A = 0 and Ai = 1. The objective function P(X) is quadratic and positive (thus convex). Since the con straints form a convex set, intuition suggests that A* should be the global constrained minimum. We prove this more formally next. Consider the generic program where A G IRm and h(A) = [/ii(A),..., hk(X)]T is a vector of functions, each being from IRm to IR. Let VP(A) denote the gradient of P and P(A) its Hessian. Similarly, V/i*(A) is the gradient of hi(X) and Hi (A) its Hessian. By definition, an m x m matrix B is positive definite (denoted B y 0) if yT£y > 0 for all y G IRm\{0}. The Kuhn-Tucker conditions are necessary conditions that any solution to a program like (2.2) must respect. The second order version that follows (see for instance Luenberger (2003), page 316) are necessary and sufficient conditions to show that a given A* solves program (2.2). Theorem 2.1 (Kuhn-Tucker Second Order Sufficiency Conditions) Let hi,...,hk and P be continuous and twice differentiable functions from IRm to IR. Sufficient conditions that a point X* be a strict relative minimum point of the program above are that there exists p = [pi,..., /Li/C]T € IRfc such that p, > 0, pTh(X*) = 0, minimize -P(A) subject to h(A) < 0 k (2.4) %=i 11 Chapter 2. Likelihood, Entropy and Mixture Distributions and the matrix P(A*) + 5^_/iiHI(A*)^0. (2.5) i=l Note that from Equation (2.3), we have VP(A) = AA - lb and P(A) = A. The function P(A) and its derivatives do not depend on Ai since it was replaced by Ai = 1 — 1TA. Consequently, it is implicitly understood in the following that P and h{ are functions of A € IRm_1, even when we write P(A) and /ij(A) rather than P(A) and /ij(A). Lemma 2.1 The Hessian matrix P(A) is positive definite. Proof of Lemma 2.1. Remember that A = y T(x)F(x)T + V(x) + llTvar{Pi(x)} dp(x). For any fixed x, each term of the integrand as written above are nonnegative definite. In particular, for any y e IRm_1\{0}, yT{^(x)^(x)T}y >0 and yTllTy J^{F^x)} dp(x) > 0 because of Assumption 2.1. As another consequence of Assumption 2.1; the diagonal ele ments of f V(x) d/i(x) are strictly positive, meaning that V(x) dfi(x) m—l = E i=i var{Pj(x)} dp,(x) > 0 for any y € IRm ^{0}. Consequently, yTAy > 0, i.e. the Hessian of P, P(A) = A, is positive definite. _ 12 Chapter 2. Likelihood, Entropy and Mixture Distributions Corollary 2.1 Equation (2.5) is satisfied. Proof of Corollary 2.1. In our implementation of the general Kuhn-Tucker conditions, hi(\) = — Ai+1. Therefore, Hj(A) = VTV/ii(A) = 0 are null matrices. From Lemma 2.1, we know that P(A*) is positive definite, hence Equation (2.5) is satisfied. _ Applying the algorithm above will change negative weights to Aj+i = 0 for some i G Ic C {1,... ,m — 1} where Ic may be null. The set I contains the remaining indices and may also be null. Let J C {1,..., m — 1} be a possibly null subset of indices, then Ajtj is the sub-matrix of A for the rows i £ I and the columns j G J. We define the subvector Xj similarly. The. proposed algorithm involves solving reduced systems where the rows and columns for i G Ic are excluded. The system of equations that has to be solved then involves the matrix Ai = J [Ei{x)Fr(x)r + V/,/(s) + l/l7var{A(a;)}] d/x(s). For convenience of exposition, suppose that the order of appearance of the Aj in A are such that all the Aj that are "forced" to be zero are last. Then, with T(x) = write Tjc (x) we can A = " Fi(?) ' Tjc (x) Tjc (x) + V(x) + llTvax{Fi(x)} Fiix^ix)7 + Vu{x) FIc(x)Jri(x)T A/./c Ajc jc Jrl(x)J7Ic(x)r FIc{x)JrIc(x)T + VIcjc(x) dp(x) + llTvar{Fi(x)}d^(x) A/,/ A/c,7 In particular, Ai = AIJ and Ajc = AJCJC. Therefore, the last step of the proposed algorithm is to solve the system of equations AIXI — AIJXJ = ljb. 13 Chapter 2. Likelihood, Entropy and Mixture Distributions Lemma 2.2 If Ic 0 and y e IRm :\{0} is any nonnegative vector with yj = 0 and ylC > 0, then VP(A*)Ty > 0. Proof of Lemma 2.2. First note that the expression VF(A*)Ty corresponds to the direc tional derivative of P at A* in the direction y. Next consider the unit vector ej G IRm_1 whose ith element is 1. For i e Ic, the global unconstrained minimum of the convex func tion P is outside of the half-space Aj+i > 0. Therefore, P increases in the direction ej at A* and thus VP(A*)Te, > 0: Finally, the hypothesized vector y can be expressed as a linear combination of vectors {ei : i e Ic} with nonnegative coefficients j/j. Therefore, VP(A*)Ty = J2 yiVP(A*)Tei > 0. _ ieic Although I = 0 or Ic = 0 may occur, the following proofs hold under these special cases. Lemma 2.3 The proposed algorithm solves the quadratic program minimize -P(A) subject to {Xi > 0, i = 2,..., m). Proof of Lemma 2.3. To verify that the Kuhn-Tucker conditions are satisfied, first note that for i = 1,..., m — 1 the functions foj(A) = — Aj+i are continuous and twice differentiable. The. quadratic objective function P(X) shares the same smoothness properties. Moreover, Corollary 2.1 establishes that Equation (2.5) holds. At termination, the algorithm yields A^ > 0 and X]c = 0. The proposed solution A* is thus in the feasible set. It remains to show that there exists a p: satisfying the Kuhn-Tucker conditions stated earlier. We will show that /x = VP(A*) satisfies the required properties. Expression (2.4) can be written m—l VP(A*) + w(-ei) = VP(A*) -n = 0 i=l 14 Chapter 2. Likelihood, Entropy and Mixture Distributions and clearly holds for /i = VP(A*). The other Kuhn-Tucker conditions require that pi > 0 -i * and pi A =0. /x > 0 The last step of the algorithm before termination is to solve AIJXI = 1/6. Therefore, \ij = VP(A*)/ = [AA* -lb\j = A/,/A/ + &ijcKc -1/6 = 0 since Xjc =0. In addition, we have from Lemma 2.2 that p,i = pJei = VP(A*)ei > 0 for all i G Ic, and hence /LX7C > 0. Therefore, /i > 0. TA* = 0 We can write the condition pJX* = 0 as /xjA/ + /LIJCA/C = 0. It is shown above that pij = 0, hence pb]\*j — 0. Moreover, the definition of the set I implies that X*jc = 0, thus PJCXJC = 0 and the condition is satisfied. Consequently, the solution found by the proposed algorithm is a strict relative minimum since it satisfies the sufficient Kuhn-Tucker conditions. _ Lemma 2.4 The solution found by the proposed algorithm satisfies the additional constraint 2~^I=2 Ai < 1, or equivalently, Ai > 0. Proof of Lemma 2.4- The solution found by the algorithm satisfies A^/A]- = lib. Expanding AIJ in this equation yields J [^/(x)^7(x)T + V/,/(x)4-l/ljTOr{p1(x)}] dp(x Xr = 1/6. By subtracting 61/1J A/ from both sides and multiplying the resulting equation by Xj on 15 Chapter 2. Likelihood, Entropy and Mixture Distributions the left, we have x; = 6A;T (I7 - l/ijAj) = 6A;TI7 (I - ijxi) = i (i - E^i) (z A*+1) • By the same argument as in Lemma 2.1, the matrix on the left hand-side is positive definite, and hence the expression itself is positive. Since b and Aj are positive, we necessarily have 1 — ^i+i > 0- Hencei the solution to the program in Lemma 2.3 always satisfies the additional constraint _3ie/ A*+1 = YA=2 K < 1 (remember that X"jc = 0). This inequality is equivalent to > 0. Regarding the comment to the effect that Ai cannot be negative for intermediate steps, consider the development above for such steps where A/ may still contain negative values. Note that the left-hand-side of the expression is still positive because of its positive defmite-ness. Moreover, the right-hand-side can be written as Ai(l — Ai)b, meaning that Ai(l — Ai) is positive. Therefore, Ai £ (0,1), except if I = 0 in which case Ai = 1 and A = 0. _ Theorem 2.2 The proposed algorithm solves the quadratic program minimize -P(A)' m subject to {Xi > 0, i = 1,..., m} and ^ Aj = 1. i=l Proof of Theorem 2.2. The result follows from Lemmas 2.3 and 2.4. _ J {rrtx^ix)* + VItI(x)} dM(x) We now define MAMSE weights for specific types of data and study their properties. 16 Chapter 3 MAMSE Weights for Univariate Data We define MAMSE weights for univariate data by using the empirical distribution function (EDF) based on the sample from each population as an estimate of their CDF. This choice yields nonparametric MAMSE weights with interesting asymptotic properties that are stud ied. In particular, the MAMSE-weighted mixture of empirical distributions converges to the target distribution and the maximum weighted likelihood estimate (MWLE) is a strongly consistent estimate of the true parameter when the MAMSE weights are used. Simulation studies evaluate the performance of the MWLE on finite samples. 3.1 Notation and Review Let (fi, S(fi), P) be the sample space on which the random variables Xij{u) : fi IR, i = 1,..., m, j € IN are defined. The Xij are assumed to be independent with continuous distribution F;. We consider samples of nondecreasing sizes: for any positive integer k, the random variables {X^ : i = l,...,m, j — l,...,^} are observed. Moreover, the sequences of sample sizes are such that —> oo as k —> oo. We do not require that the sample sizes of the other populations tend to oo, nor do we restrict the rate at which they increase. 17 Chapter 3. MAMSE Weights for Univariate Data Let Fik(x) be the EDF based on the sample Xij, j = 1,..., n^, i.e. Fik{x) = — J^^i^jH <X}. The empirical measure dFik(x) allocates a weight 1/nik to each of the observations X^, j = l,...,nik. Under the assumption that data from each population are independent and identically distributed as is the case here, the empirical distribution is a nonparametric estimate of the CDF. Indeed, the Glivenko-Cantelli lemma (see e.g. Durrett (2005) page 58) states that sup X Fik(x) - Fi(x) 0 almost surely as —* oo. For any fixed value of x, one may also note that riikFik(x) follows a Binomial distribution with parameters and Fi{x). Hence, the asymptotic normality of the EDF at fixed x follows from the central limit theorem. 3.2 Definition of the MAMSE Weights For univariate data, we define the MAMSE weights based on the EDFs of the m populations. The MAMSE objective function assesses Y^L\ \Fik{x) as an estimate of F\k{x) in terms of bias and variance. As a preselection step, any sample whose range of values does not overlap with that of Population 1 will be discarded. For the remaining m samples, the MAMSE weights will be the vector A = [Ai,..., Am]T that minimizes Pfc(A) = F\k{x) - ^2\iFik(x) + 1=1 i=i Fik(x) {l - Fik(x)} dFlk(x) (3.1) under the constraints Aj > 0 and YLlLi Ai = 1. 18 Chapter 3. MAMSE Weights for Univariate Data Equation (3.1) is a special case of Equation (2.1) where dp(x) = dFik(x) and where the substitution var^z)} = — Fik(x) {l - Fik(x)\ is based on the variance of the Binomial variable nikF{(x). The choice of dp,(x) = dFik(x) allows to integrate where the target distribution F\(x) has most of its mass. Other choices for p,(x) could be explored in the future. For u e fi and k <E IN, we denote the MAMSE weights by Xk(co) = [Aifc(w),..., Amfc(w)]T. The MAMSE weights are used to define an estimate of the distribution F\(x), the MAMSE-weighted EDF. 3.3 Computing the MAMSE Weights The algorithm proposed in Section 2.4 applies to the MAMSE, weights for univariate data. To prove the convergence of the algorithm, it is sufficient to show that Assumption 2.1 is respected when the MAMSE weights are defined using Expression (3.1). Lemma 3.1 Let x be any point within the range of the data set from Population i, i.e. let m Gk(x) = _j \k(u)Fik(x), i=i Then. Proof of Lemma 3.1. Note that 0 < Fik{x) #{Xjj < x : j < nik] nik < 1 19 Chapter 3. MAMSE Weights for Univariate Data since the numerator is at least one and at most — 1, or otherwise x would be outside the specified interval. Consequently, •1 var {Pik(x)} = —Fik{x) {l - Fik(x)} > 0 nik Theorem 3.1 Assumption 2.1 is respected for the definition of MAMSE weights suggested in Section 3.2. . Proof of Theorem 3.1. The preprocessing step described in Section 3.2 requires that any sample considered overlaps with that of Population 1. In particular this means that at least one of the data points from Population 1 will fall within the range of the sample of Population i, meaning that j var [Pik{x)) dFx{x) = J ^~Fik(x) {l - Pik(x)} dP^x) nik Tllk • , L J J-l since all the terms in the sum are nonnegative and at least one must be positive by Lemma 3.1. • 3.4 Structural Properties of the MAMSE Weights Choosing the empirical distribution function to define the MAMSE weights implies some invariance properties that are discussed next.. Theorem 3.2 The MAMSE weights are invariant to a strictly increasing transformation of the data. Proof of Theorem 3.2. Let Xij — g(Xij) where g is a strictly increasing function of the real line. Let Hi denote the cumulative distribution function of Yij. Then for all y, x = g(y) 20 Chapter 3. MAMSE Weights for Univariate Data and any i = 1,..., m H%k{y) = — J^MYij<y} = — J2M9(yij) < 9(V)} lk j=\ lk j=i = —Y,MXij<xY=Fik(x). '* 3 = 1 Since P(X) is integrated with respect to dFik, a discrete measure, there is no Jacobian of transformation in the integral and replacing all Fik by the corresponding Hik will not change the expression P(X), nor its maximum. _ Theorem 3.3 The MAMSE weights do not depend on the parametric model f(x\8) used in the weighted likelihood. Proof of Theorem 3.3. The result follows immediately from the definition of MAMSE weights and the choice of the nonparametric empirical distribution functions as estimates of Fi. i Theorem 3.4 The MWLE based on MAMSE weights is invariant under .a one-to-one reparametrization of f(x\9) into h{x\r) = f{x\h{r)}, i.e. 6 is a MWLE iff f is a MWLE. Proof of Theorem 3.4. By Theorem 3.3, the MAMSE weights Xk(u) = [Alfc(w),..., Xmk(cj)]T are invariant to the choice of parametric model f(x\6). If r is such that 9 = h(r) and h is a one-to-one mapping of the parameter space, then rmax is such that n n / {XiJWT)}x»M'n« < n n / {^(w)}^^ i=lj=l i=lj=\ for all r if and only if #max = fa(rmax) is such that in nik • m nik n n fixyie)^'"" < n n f^™^^ i=l j=l i=lj'.= l 21 Chapter 3. MAMSE Weights for Univariate Data for all 6. Hence, the MWLE possesses the same functional invariance property as the MLE if we use the MAMSE weights. _ 3.5 Strong Uniform Convergence of the Weighted This section explores the large sample behavior of the MAMSE-weighted EDF. Note that asymptotic results for adaptive weights are obtained by Wang, van Eeden and Zidek (2004), but they do not apply here because their assumption that the weight allocated to Popu lation 1 tends to one as the sample size increases may not be satisfied (see Section 3.8 for more details on the asymptotic behavior of the MAMSE weights). The proof of uniform convergence of Gk(x) is built as a sequence of lemmas showing that Gk is close to F^, and ultimately that almost surely as k —» oo. Whether a sample is rejected in the preprocessing or not may vary with k and to. However, as the sample sizes increase, the probability that a sample is rejected tends to zero unless the domain of possible values of a Population does not overlap at all with that of Population 1, i.e. unless P(X\\ < Xn) = 0 or 1. Thus, without loss of generality, we suppose that no samples were excluded by the preprocessing. Lemma 3.2 For any to £ Q and k E IN; Empirical CDF sup Gk{x)-Fx(x) ^0 x 22 Chapter 3. MAMSE Weights for Univariate Data Proof of Lemma 3.2. For any LO 6 fl and k G IN, consider I = Gk(x)-Flk(x) dFlk(x) < J |Gfc(_)-Flfc(x)f^ By the definition of MAMSE weights, the expression above is minimized in A. The sub-optimal choice of weights [Ai,..., Am] = [1,0,...,0] cannot lead to a smaller value of I, i.e. I < |Fife(x) - Flk(x) 2+ —Flk(x)\l-Flk(x)\ nlk I J dFlk(x) This bound is tight since the optimal A could be arbitrarily close to [1,0,... ,0]T, making I arbitrarily close to the bound above. For instance, letting ri\k —> oo while the other njfc's are held constant will do the trick. _ Lemma 3.3 There exists fii C fl with P(fli) = 1 such that for all LO G fl\ and any fixed k G IN, max X Gk{x) - Flk(x) < h max n\k j'e{i,...,nlfc} Gk{Xl:i(uj)} - Flk{Xlj(u;)} Proof of Lemma 3.3. Define fl0 = {u efl: 3i,i',j,f with ^ (i',f) and Xij(u)-=XVjl(u)} . Since the distributions Fj are continuous, P(flo) = 0. Fix k G IN and consider any fixed LO € fli = fi\Q:0. Note that for i G {1,..., m} and j G {1,..., n^}, [minjj -Xjj(o;), maxjj Xij(u)\ is a compact set outside of which D{x) — \Gk(x) — Fifc(a;)| = 0. Let XQ be a value maximiz-23 Chapter 3. MAMSE Weights for Univariate Data ing the bounded function D(x). We treat two cases. Case 1: Gk(x0) < Flk(x0). Define x\ = max{Xij(uj) : j = 1,..., n\k, X\J(LU) < XQ}, the largest data point less than XQ found in Population 1. The step function F\k(x) is right-continuous, nondecreasing and has equal steps of size l/n\k at each observation XXj{ui). By the choice of x\, and the definition of the EDF, AfcOi) = FifcOo)-The step function Gk(x) is nondecreasing, thus Gk(xi) < Gk(xQ). Consequently, \Gk(x0) - Fik(xQ)\ = Flk(xQ) - Gk(x0) < Flk(xi) - Gfc(xi) < max je{i,..,Tiifc} Flk{Xl3{u)} - GkiXx^uj)} < h max n\k je{i,...,nlk} Flk{X1:j(uj)} - Gk{Xl3(Lu)} Case 2: GiAxn) > F1t.(xn). Define x<i = min{Xij(u>) : j = 1,... ,nik, Xij(u) > XQ}, the smallest data point exceeding XQ found in Population 1. The step function F\k{x) is right-continuous, nondecreasing and has equal steps of size l/n\k at each observation Xij(to). Therefore, Since Gk(x) is nondecreasing, F\k{x2) = + Flk(x0). nik Gk(x2) > Gk(x0). 24 Chapter 3. MAMSE Weights for Univariate Data Consequently, \Gk(x0) - Flk(x0)\ = Gk(x0) - Fik(x0) < Gk(x2) - Flk(x2) + n\k < h max n\k ie{i,-,"u} FikiXijiu)} - GkiXijiu)} which completes the proof. Lemma 3.4 Let ak be a sequence of numbers such that limk^00aki/nik = 0. Then, there exists fii C fi with P(fii) = 1 such that for all e > 0, there exists a ko such that Vcu € fii ak max ' j'G{l,...,nlfc} G^Xi^u)} - Fik{Xij{u)} <e for all k > ko. Proof of Lemma 3.4- Let e > 0. Consider an arbitrary but fixed u G fii = fi\fio where fio has the same definition as in the proof of Lemma 3.3. Suppose that Lemma 3.4 is false. Then there exists an infinite sequence ki such that for I G IN, (3.2) for some jctf £ {15 2,..., nike}, where parentheses in the index identify order statistics, i.e. Xi(j) is the jth smallest value among Xn,..., Xinikt. -Consider a fixed value of L For simplicity, we drop the index £ and revert to k and jo that are fixed. Note that 1. Gk(x) is a nondecreasing function, 2. Fik(x) is a right-continuous nondecreasing step function with equal jumps of l/nik. 25 Chapter 3. MAMSE Weights for Univariate Data We treat two cases: / Case 1: Gk{X1{jo)(u>)} > Flk{Xm(u)}. Note that Fik{X1(jo)(u)}<Gk{X1(jo)(u)}<l and hence, Fik{Xi^0^(u;)} < 1 — ek or inequality (3.2) would not hold. Consequently, Jo < nik ~ L€fcnifcJ and for i € {0,1,... , [efc^ifcj}, we have Gk{Xi(j0-+i)(u})} > Gk{X1{jo)(u)}, Fik{X1{jo+i)(u>)} = Flk{X1{jo)(u)} + ^-nlk and hence /v ^ A % % Gk{Xi(jo+i)(u})} - Fik{X1{jo+i)(u)} > Gk{X1{JQ)(u>)} - Fik{X1{jo){to)} - — > ek - —. nlk nik As a consequence, \Gk(x) - Fik(x)\2 dFik(x > nik — J2 \Gk{X1(jo+i)(u)} - Flk{X1{jo+i)(u)}\' 1=0 nik ^ V W < ^ n\k ^ Lefc™ifcJ(|_efcnifcJ + l)(2LefcnifcJ + 1) z2 6r4 > i /j_____v 3 V nxk J 26 Chapter 3. MAMSE Weights for Univariate Data By Lemma 3.2, we thus have that 1 feknlk-l\3 < 1 /Lefcnlfcj\3 < fn\k-l\ 1 < 1-nik J 3 \ nXk J \ nik J 6nifc 6nife V afc J - 2 afc - 2V3 ^ ak 2^3 „2/3 + 2V3 e^ifc n^{3 + 2:/3 ' nifc a contradiction since a\/nik —> 0 and fc^ —> oo as £ —» oo, i.e. the left-hand term converges to 0. Therefore, ak max Gk{Xij(u)} - Fik{Xij(u)} < e je{o,...,nlk} Note that the bound does not depend on the choice of LO. Case 2: Gk{X1{jo)(u)} < Flk{X1(jo){u)}. Note that Fik{Xi^(uj)} > Gk{Xx^{u)} > 0. Since both functions are at least ek apart, Fik{Xi(jo)(uj)} > ek and thus j0 > [eknik\. Then for i 6 {0,1;..., LefcnifcJ}> we have GkiX^-^uj)} < Gk{X1{jo)(Lu)} Afc{A-l0-0_0(w)} = Fik{X1(jo)(u)} nik and hence i i Flk{XHjo_i}(oj)} - Gfc{A-l0-0_0(a;)} > Flk{X1(jo)(u)} - Gk{X1(jo)(u)} - ^- > ek nlk nlk Then, J \Gk(x) - Fik(x)\2 dFik(x) > nik - J2 I^^KJO-OH}-^^-*)^)}!2 i=0 27 Chapter 3. MAMSE Weights for Univariate Data nik f£ V W n\k f^0 n\k ^ _ LefcnifcJ(LefcnifcJ + l)(2|_efcnlfcJ + 1) > i (Lefc"ifcJ 3 V nifc By Lemma 3.2, we thus have that 1 (eknlk-l\6 1 /Lefcnlfcjy < fn\k-l\ 1 < 1 3 V nik J 3 V nlk J \ n\k J 6nifc 6nlfc / \ 3 2 2/3 * Ur-'1; ~2~ * ak 2^3 n2{3 + 21/3 —->-275 — <=> a* — >21/3e, €nifc n^/3 + 2i/3 nit a contradiction since a\/nik —> 0 and —* oo as I —> oo, i.e. the left-hand term converges to 0. Therefore, ak max Fu{^ij(w)}-G(t{Iij(w)}<e. j€{0,..,nlfc} Combining the two cases, we know that Ve > 0, 3k0 such that \/k> kQ, Gk{X1:i(uj)} ~ Flk{X1:J(Lj)}\ ak max j6{0,..,nifc} < e. Since fcrj does not depend on to £ fl\ and P(fl\) = 1, the uniform convergence is almost sure. _ Lemma 3.5 There exists Q\ C fl with P(fli) = 1 such that for all e > 0, there exists a ko such that i - . . - . . i < e max X \Gk(x) - Flk(x) for all k > kQ with the same ko for all to £ fl\. 28 Chapter 3. MAMSE Weights for Univariate Data Proof of Lemma 3.5. Consider the set Vt\ defined in the proof of Lemma 3.3. For any fixed LO € Qi, max X Gk(x)-Flk(x) 1 < h max nxk je{i,...,nlk} By Lemma 3.4, Ve > 0, 3k\ such that VA: > kx, max jG{l,..,nlfc} Gk{Xl3{u)} - Flk{Xlj{u)} < for all LO E Oi. Moreover, 3k2 such that VA; > k2, l/nxk < e/2. Therefore, for all k > ko — max(A;i, k2), we have 0 < max X Gk(x) - Flk(x) e e ^2 + 2=e' Theorem 3.5 The random variable sup X Gk(x)-F!(x) converges almost surely to 0. Proof of Theorem 3.5. By Lemma 3.5, Ve > 0, 3kx such that max X Gk(x) - Fik(x) < VA; > k\ and any to € fii with P(flx) — 1. The Glivenko-Cantelli theorem states that sup X Flfe(x)-f\(x) almost surely as A; —> co. Hence, there exists Vt2 C 0, with P(fl2) = 1 such that Ve > 0 and 29 Chapter 3. MAMSE Weights for Univariate Data LU € 3^2(0;) with sup x FikW-F^x) e <2 V7c > ^2(0;). Consider now fio = ^ 1 H ^2 and fco(w) = max{fci, ^(w)}. Note that we have •P(^o) > -P(^i) + -P(ri2) — 1 = 1. For any fixed u>, k and x, the inequality Gfe(x) - #1 (x) < Gfc(x) - Fi„(x) + Flfc(x) - Fi(s) holds, hence for any to G QQ and all > ko(uj) we have sup Gfc(x) - Fi(x)| < sup \Gk(x) - Fifc(x)I + sup |Fifc(x) - Fi(x) X ^2 + 2=£-Therefore, supx Gk(x) — F\{x) converges almost surely to 0. Corollary 3.1 Let Y\. be a sequence of random variables with distribution Gk(y), then Yk converge weakly to the variable Y with distribution Fi(y) as k —> 00. Proof of Corollary 3.1. The result is clear from Theorem 3.5 and the definition of weak convergence. _ 3.6 Weighted Strong Law of Large Numbers By the weak convergence shown in Corollary 3.1, Jg(x)dGk(x) -> Jg{x)dFx (3.3) as k —> 00 for functions g that respect some regularity conditions. Examples of such results can be found in Section 2.2.b of Durrett (2005) for instance. 30 Chapter 3. MAMSE Weights for Univariate Data The left-hand side of Expression (3.3) can be written as m Aifc(w) E i=l and is hence a form of weighted strong law of large numbers (WSLLN). In Section 3.7, we use the WSLLN to prove the consistency of the weighted likelihood with MAMSE weights. For that, we need a WSLLN that holds for an unbounded function g with some discontinuities. We prove the needed theorem in this section as it could not be located in the literature. Lemma 3.6 Consider any two distribution functions F and G from IR to [0,1] such that supx \F(x) — G(x)\ < e for some e > 0. Then, for any connected set A C IR, for all 8 > 0. Since 8 can be arbitrarily small, | dF(B) — dG(B)\ < 2e. The result holds for any combination of closed or open boundaries with minor changes to the proof. _ Theorem 3.6 Let g{x) be a function for which J \g(x) \ dF\(x) < oo. The function g(x) is continuous on IR except possibly on a finite set of point {d\,..., d_}. For each of populations 2,..., m at least one of these two conditions hold 1. the sample size is bounded: Vfc 6 IN, < Mj. dF(A) - dG(A)\ < 2e. Proof of Lemma 3.6. Let B = [a, b] C IR and define B$ = (a — 8, b}' for 8 > 0. Let e5 = | dF(B6) - dG(Bs)\ = \F(b) - F(a -8)- G{b) + G(a - S)\ < \F{b) - G(b)\ + \F(a -8)- G{a - 8)\ < 2e 2. J\g(x)\dFi(x) < oo. 31 Chapter 3. MAMSE Weights for Univariate Data Further suppose that the sequences of sample sizes are non-decreasing with k for all popu lations. Then, Jg(x)dGk(x)- Jg(x)dF1(x) 0 almost surely as k —> oo. Proof of Theorem 3.6. We show that for any e > 0, we can find a sequence of inequalities that imply that J g(x) dGk(x) — f g(x) dFi(x) < e for any large enough k. The inequalities come from truncating g and from approximating it by a step function. For t G IN, let Dt = nf=1(de - 2~l, de + 2^)°, Bt = [-t, t] n Dt and g(x) if x G Bt rt(x) = 0 otherwise Since g{x) is continuous and Bt is a compact set, the image of Tt is bounded, say rt(x) G [Lt,Ut]. By the Heine-Cantor Theorem, Tt is uniformly continuous on Bt, i.e. VeT]i > 0, 35Tj > 0 such that Vxx,x2 G Bt, \xi - x2\ < 5T,t ==^ \rt{x\) - Tt(x2)\ < eTjt. Let eT]t = 2_t and choose 0 < ST:t < 2~l accordingly. For s — 1,..., St, where St = \2t/6Ttt], let Ast ={~t+ (s - l)5Ttt, -t + s5T,t) n Bt. In the rare case where 2t/5T}t is an integer, we let Asttt = [2t — 8Ttt, 2t]. The sets Ast form a partition of the compact set Bt- Note that the choice of Dt and 5Ttt ensures that Ast are connected, with the harmless exception of Ast,t which could sometimes consist of two singletons. Define ht by St ht(x) ^/^bstTl-Astix) s=l 32 Chapter 3. MAMSE Weights for Univariate Data where 1 if x G Ast bst = inf g(y) and TL-Ast(x) = yeAst • 1 0 otherwise Then, by construction, sup^ \rt(x) — ht(x)\ < 2 * and g(x)dGk{x) - jg(x) dF^x) < Tx + T2 + T3 + T4 + T5 (3.4)" where Jg(x)dGk(x) - jrt(x) dGk(x Jn(x)dGk(x)- J ht(x)dGk(x) Jht(x)dGk(x)- Jht(x)dFi{x) J htWdF^x)- J rt(x)dFi(x) rt(x)dF!(s)- Jg{x)dFl{x) Ti T2 n T5 We will now prove that for any e > 0 and w in a subset of fl with probability 1, we can choose 7_ such that the five terms above are less than e/5 for all k > A_(£_). To begin, note that T4 = ht(x) - rt(x) dFi(x) < J \ht{x)-Tt(x)\dF1(x)<2-by construction. The same bound applies for T2 and does not depend on k or to. By Theorem 3.5, sup^. \Gk(x) — F\(x)\ converges almost surely to 0. Therefore, 3flg C fl with P(flo) = 1 such that for each to G flo and any t, 3A_,t with sup|Gfe(x) -Fi(a;)| < 5tmax(|Ut|)|Lt|)21+1 33 ) Chapter 3. MAMSE Weights for Univariate Data for all k > kUtf For any such k and u>, Lemma 3.6 implies that dGk(Ast) - dF^Ast) < 2 5tmax(|C/t|,|Lt|)2*+i for any s = 1,..., St- Developing T3 yields st St T3 = j2b^dG^A^-J2b^dF^A^ St < J2\b^\- dGk(Ast) - dF^Ast) Therefore, 3t\ such that 2-t < e/5 for all t > t\, i.e. T2, T3 and T4 are each bounded by e/5 tor any t>t\ and k > / We can write as t —> 00 since the integrand goes to 0 for each x € IR\{di,... , d_} by the dominated convergence theorem with bounding function |<?(x)|. The integrand does not converge to 0 on {di,..., d_}, but that set has measure 0. Therefore, there exists t2 such that T5 < e/5 for all t > t2. Turning now to T\, we denote by I C {l,...,m} the indices corresponding to the populations for which riik —> 00 as k —-> 00. By the strong law of large numbers, for any fixed t, there exists fi,^ C 0 with P(!i\4) = 1 such that for all u € £\t, T5 = / 5(x)lSc(x)dF1(x) < / |5(x)|lBc(x)dF1(x) - 0 34 Chapter 3. MAMSE Weights for Univariate Data as k —> oo. Consider a fixed u> € Q,i = {Lo\Xij(u) — dt for some i,j,£}C D I n fiM iei,te\N The intersection is over a countable number of sets of probability 1, hence P(fii) = 1. For any such to G fii, T\ is developed as jg{x)TLBc(x)&Gk{x) < J |_(x)|lB?(aO dGfc(x) E ^ E iflf{^«H} 2=1 J=l ^ E — E \9{XijH}\lB?{*i>)}-. 2 = 1 J = l Since LO is fixed, 3£* such that TLB*{Xij(u>)} = 0, Vi G I0, j = 1,..., 71^, < > C- FOR i £ I, the dominated convergence theorem says that there exists i* such that |_(x)|lBf(aOdFi(x) < 10m for all t>t*. Choose t > £3 = maxi6/1*. Since u G fii, B&i,^ such that for all k > kitt:LJ, — 'zZ\9{Xij(u)}\TLBc{Xij(Lo)} < / |_(x)|l_c(x)dFiv lfc j=i J x) H < . 10m 5m Therefore, Vi > max(i3,t_), there exists fc* t = maxje/ fci^u, such that •T, = 5(x)dGfc(x) - 1 r4(x) dGk(i < for all k>klt. In conclusion, for any u; G fio H fii and any e •> 0, we can choose £_ = max(ti, <2, *3, C) 35 Chapter 3. MAMSE Weights for Univariate Data that yields inequalities showing that g(x)dGk(x)- / _(x)dFi(x) < e for all k > kuitu) = max(fc_itw, fc* tu). In other words, the left hand side of Expression (3.4). converges to 0 for any u G fio PI fii with P(fio fl fii) = 1, i.e. that expression converges almost surely to 0. _ Corollary 3.2 (Weighted Strong Law of Large Numbers) Let Xi denote a variable with distribution Fi. Suppose E\Xi\ < oo for i'= 1,..., m, then almost surely as k —> oo. Proof of Corollary 3.2. Use Theorem 5 with g(x) = x. _ Theorem 3.6 is useful for proving the consistency of the MWLE by extending the proof of Wald (1949). This extension is given next. 3.7 Strong Consistency of the Maximum Weighted Likelihood Estimate In this section, we adapt the work of Wald (1949) to prove that the MWLE obtained with MAMSE weights is a strongly consistent estimate. For easier reference to his original work, the numbering of Wald is noted with the prefix W. 36 Chapter 3. MAMSE Weights for Univariate Data Wald's Assumptions The assumptions of Wald (1949) are reproduced below and adapted as required to extend his proof to the MWLE. Let F{x\9) be a parametric family of distributions where 6 is an element of 8, a closed subset of a finite dimensional Cartesian space. We assume that 36o G 0 such that F{X\9Q) = Fi(x). Wald (1949) does not assume that F(x\9o) is continuous in x, but we do and denote its corresponding density function by f(x\9o). The following notation is used by Wald (1949): V9ee,p>0, f(x,9,p) = sup f(x\9'), f*(x,9,p) = max{/(x, 0, p), 1}, ' \e-6'\<P Vr > 0, (f)(x, r) = sup f(x\9), <fi*(x, r) = max{<f)(x, r), 1}. |0|>r Assumption 3.1 (WI) For all 9 G 0, F(x\9) is absolutely continuous for all x. There fore, F(x\9) admits a density function f(x\9). Assumption 3.2 (W2) For sufficiently small p and sufficiently large r, the expected values J\ogf*(x,9,p)dFi(x) and f log<f>*(x,r) dF\(x) are finite. Assumption 3.3 (W3) 7/limj_>oo0j = 9, then lim^oo f(x\6j) = f{x\9). Assumption 3.4 (W4) If 9\ ^ 9Q, then F(X\6Q) ^ F{x\9\) for at least one x. Assumption 3.5 (W5) Iflimi^^ \9i\ = oo, then limj^oo f{x\9i) = 0. Assumption 3.6 (W6) / | log f{x\90) \ dFi(x) < oo for i = 1,m. Assumption 3.7 (W7) The parameter space 0 is a closed subset of a finite-dimensional Cartesian space. Assumption 3.8 (W8) The functions f(x,9,p) and cp(x,r) are measurable for any 9, p and r. 37 Chapter 3. MAMSE Weights for Univariate Data Assumption 3.9 The functions f(x\9o), f(x,9,p) and <j>(x,r) are continuous except pos sibly on a finite set of points {di,... The set of discontinuities may depend on 9, p or r, but must be finite for any fixed values of these parameters. Assumptions WI to W8 are from Wald (1949); only Assumption W6 is modified to cover the m populations of our paradigm. Assumption 3.9 is required to ensure that Theorem 3.6 applies. Note that these assumptions are mostly concerned with the family of distributions F(x\9) (the model), rather than with the true distribution of the data. Lemmas 3.10 and 3.11 are useful for determining if the family of distributions F(x\9) satisfies Assumption 3.9. Wald's Lemmas Wald's lemmas do not need to be modified. We state them for completeness, but do not reproduce the proofs provided in Wald (1949). For expectations, the following convention is adopted. Let U be a random variable. The expected value of U exists if E{max(f/, 0)} < oo. If E{max(U, 0)} is finite but E{min(U, 0)} is not, we say that E{mm(U, 0)} = —oo. Let a generic X represent a random variable with distribution Fi(x) = F(X\9Q). Lemma 3.7 (WI) For any 9 ^ 9Q, we have Elog f(X\9) < Elog f{X\90). Lemma 3.8 (W2) lim E log f(X, 9, p) = E log f(X\9). p->0 Lemma 3.9 (W3) The equation limr_>ooElog0(Ar,r) = -oo holds. The next two lemmas are useful in determining if Assumption 3.9 is satisfied. Lemma 3.10 Let f(x\9) be continuous for all 9 G © and x G NXl, a neighborhood of' x\. Then for 9Q and p fixed, f(x,8o,p) is continuous at x\. 38 Chapter 3. MAMSE Weights for Univariate Data Proof of Lemma 3.10. Suppose that f(x,6o,p) has a discontinuity at x — x\. Then, there exists e > 0 such that for all 5 > 0, there exists x2 with \xx — x2\ < 5 but \f(xu80,p)-f(x2;90,p)\ >e. (3.5) Let A c NXl be a compact set around x\. Let B = {6 : \0 — 9Q\ < p). The set A x B is compact and hence f(x\9) is uniformly continuous on that domain by Heine-Borel. Therefore, for the e chosen above, there exists a Si > 0 such that x\,x2 e A and \x\ — x2\ < 5i imply \f(Xl\9)-f(x2\9)\<e/2 (3.6) for all 9 G B. Choose such an x2 and define 9\ = arg max fixAO) and • 92 = arg max f(x2\9). \e-e0\<P \e-e0\<P The maxima are attained since A x B is compact and /(x|#) continuous in 6. Therefore, f(xi,90,p) = /(xi|0i) and f{x2,e0,p) = f(x2\62). j (3.7) Consider the following two cases. Case 1: ffsijfli) > ffa^l^) By Equations (3.5) and (3.7), /(xi|0i) > f(x2\92) + e. Furthermore, inequality (3.6) implies that f(x2\9l)>f(x1\91)-€->f(x2\B2) + ^, a contradiction with the definition of 92. 39 Chapter 3. MAMSE Weights for Univariate Data Case 2: f(xi\9i) < f{x2\92) By Equations (3.5) and (3.7), f(xi\9\) < f(x2\92) — e. Inequality (3.6) yields f{xi\e2)> f{x2\d2)-e->f{xl\el)+€-a contradiction with the definition of 9\. Therefore, we conclude that f(x,9o,p) is continuous at x\. _ By Lemma 3.10, if f(x\8) is continuous in x and 9, then f(x,9o,p) is continuous in x for any fixed 9Q and p. Before introducing Lemma 3.11, define the modulus of continuity of the function g(x) around XQ. Note that when it exists, Proof of Lemma 3.11. Fix r > 0. Since <f>{x,r) is discontinuous at XQ\ there exists e > 0 such that for any 5 > 0, 3x\ such that \XQ — xi\ < 8 but iog(5,x0)= sup \g(x)-g(x0)\ \x—xo\<6 s'(*o)|. Lemma 3.11 Suppose that f(x\9) is continuous in 9 and that (f)(x,r) has. a discontinuity at XQ. Then, there exists e > 0 such that Uf^.^(S, XQ) > e for any 5 > 0 and some 9. \<p(xQ,r) - <p{xi,r )\>2e. (3.8) For any fixed S and x\, consider the following two cases. 40 Chapter 3. MAMSE Weights for Univariate Data Case 1: d>(xn, r) > <b(x-\, r) + le. By the continuity of f(x\6), it is possible to choose \6Q\ > r such that /(xo|#o) is arbitrarily close to (f>(xQ,r), say less than e apart, i.e. f(xo\0o) ^ <Pixo,r) - e. For that possibly suboptimal Oo, (f)(xiir) > f(xi\8o), hence f(xo\0o) ></>(x0,r) - e > ^(xi.r) > /(xi|0o) meaning that |/(xo|»o) - /(xi|0o)| > /(*o|0o) - /(asil^o) > <K*o,r) - e - <f>(Xl,r) > e because of Equation 3.8. Therefore, Uf^g0^(5,xo) > e. Case 2: </>(xc r) < 4>{x\,r) — 2e. The continuity of /(x|#) allows us to choose |#i| > r such that is close to <j)(xi,r), say less than e apart, i.e. f(xi\di) >4>{xi,r) - e. ) Then, by the definition of (j>, we have (p(xo,r) > f(xo\9\), hence f{x1\e1)>(f)(x1,r)-e>^(xo,r)>f(x0\9l). Therefore, \f{xi\9i) - /(xo|^i)| > /(xil^i) - /(aroint) > </>(xur) - e - <f>(x0,r)> e by Equation 3.8. Therefore, ujf'.\gl)(5,xo) > e. 41 Chapter 3. MAMSE Weights for Univariate Data By combining both cases, we can conclude that for all 5 > 0, there exists a 9 such that cOf(.\e){S,x0) > e u Note that if g(x) is continuous at XQ, lims-toujg(5,xo) = 0. Having a modulus of conti nuity u>y(.|0)(5, xo) bounded below is akin to having an infinite derivative df(x\9)/dx at xo-This occurs when: • there is a discontinuity in the function f(x\8) at xo, • as 9 —* oo, the slope of f(x\9) around xo keeps increasing. Therefore, discontinuities in (f>(x,r) will occur if f(x\9) is discontinuous itself, or if f(x\8) has a peak that can become arbitrarily steep (i.e. its slope is not bounded for a fixed x as 9 varies). The main result of this paper uses Theorem 3.6 which allows a finite number of discontinuities. Therefore, as long as f(x\9) is a continuous model such that the set • {x : for an arbitrarily large r > 0,Uf(.\g)(5,x) is arbitrarily large for some |0| > r) = {x : f(x\9) has an arbitrarily steep peak at x for some \9\ > r} is empty or made up of a finite number of singletons, Assumption 3.9 will hold. Note the effect of the constraint \9\ > r. Consider for instance the normal model with unknown mean and variance. As cr2 —* 0, the normal density will have an arbitrarily steep peak close to fi. However, \[p,,cr2]T| > r implies that there is a lower bound for a2, hence the steepness of the peak is bounded. Assumption 3.9 is satisfied under that model. Main Result We now turn to the main results of this section. 42 Chapter 3. MAMSE Weights for Univariate Data Theorem 3.7 (Theorem WI) Let T be any closed subset of Q that does not contain 9Q. Then, m nik supTT TT f{xij(u)\e}Xik^nik'ntk = ol =1. nn/^Hi^} (w)nifc/nifc i=li=i Proof of Theorem 3.7. Let X denote a random variable with distribution F\(x) = F(X\9Q) and let ro be a positive number chosen such that E{log<t>(X,r0)}<E{logf(X\90)}. (3.9) The existence of such an ro follows from Lemma 3.9 and Assumption 3.6. Then, 71 — {9 : ^<ro}nTisa compact set since it is a closed and bounded subset of a finite-dimensional Cartesian space. With each element 9 € 71, we associate a positive value pg such that E{logf(X,9,p9)} < E{log/(X|0o)}. (3.10) The existence of such pg follows from Lemma 3.7 and 3.8. Let S(9, p) denote the sphere with center 9 and radius p. The spheres {S(9,pg)} form a covering of the compact 71, hence there exists a finite sub-covering. Let 6\,..., 9h € 71 such that 71 c Us=i S(9S, pgs). Clearly, m nik o < suPnnw>)i^Aifc {w)nxk/nik m nik h m nik ^ Enn/^(w)'^'^}Aifc(a,)nifc/nifc + nn.^M'^^^"1*^-s=li=lj=l i=lj=l 43 Chapter 3. MAMSE Weights for Univariate Data Therefore, to prove Theorem 3.7 if suffices to show that m nik nn/{^H.^.^.}Aifc(w)nit/nifc nn/{^(^)i«o}AttMnu/Bik t=ij=i lim = 0 = 1 for s — 1,..., h and that m nik ^"srss : nii/{^iHi^}A<fc(")nifc/Tiifc i=ij=i = 0 =1. The above equations can be rewritten as lim nik k—»oo EE¥^1O§/{^HA,^} i=i j=i nik \ogf{Xijiu)\eQ} = — oo = p lim / logf(x,9s,pga)dGk(x) log/(x|^o)dGib(x) } = -oo = 1 (3.11) for s = 1,..., /i and lim nik k—»oo L i=l i=l Ajfc(^) nik- log/{A-yH|0o} = -co (3.12) 44 Chapter 3. MAMSE Weights for Univariate Data = P lim nxk{ j \og<t>(x,r0)dGk{x) k-*oo - \ogf{x\90)dGk(x) = -oo = 1 respectively. Assumptions 3.6 and 3.9 insure that Theorem 3.6 applies to the integrals above, each of these converging almost surely to log/CzA^JdFiOr), J log^(x,r0)dF1(x) or J log f(x\90) dFi(x) Combining this result with Equations (3.10) and (3.9), we have that (3.11) and (3.12) hold. Hence the proof of Theorem 3.7 is complete. • Theorem 3.8 (Theorem W2) Let 9k(u>) be a sequence of random variables such that there exists a positive constant c with m nik TJ TJ /{^(w)|4M}A<fc(w)nifc/nifc i=i j=\ > c> 0 (3.13) JJJJfiXijito^o}^^^ i=Vj=l for all k € IN and all u> G fl. Then p{ lim 6k(u>) = 60\ = 1. Proof of Theorem 3.8. Let e > 0 and consider the values of 9k(to) as k goes to infinity. Suppose that 9( is a limit point away from #o, such that \9g — 9Q\ > e. Then, m nik sup nn/{Xij(w)|0}Aifc(w)nifc/nifc \0-0o\>et=lf=l m riik > c> 0 i=ij=i 45 Chapter 3. MAMSE Weights for Univariate Data infinitely often. By Theorem 3.7, this event has probability 0 even with e arbitrarily small. Therefore, for all e > 0. • Corollary 3.3 (Corollary WI) The MWLE is a strongly consistent estimate of 9Q. Proof of Corollary 3.3. The MWLE clearly satisfies Equation (3.13) with c = 1 because We study the asymptotic behavior of the MAMSE weights as k —> oo and its consequences in constructing a weighted central limit theorem. Let where = indicates that the functions are equal for all x. Clearly, £ is a nonempty convex set with [1,0,... ,0]T € C. Moreover, if we consider the elements of £ as elements of the normed space ([0, l]m, || • ||) where || • || stands for the Euclidean norm, then Cc is an open set. We will show that for k € IN, all accumulation points of the MAMSE weights will be in the set C. In other words, the MAMSE weights can only converge to vectors that define a mixture distribution identical to the target distribution. Theorem 3.9 Suppose that C° ^ 0 and let A* € CP', then for any e > 0, there exists a set flo of probability 1 such that \\X* — Ajfc(w)|| > e i.o. for all u> & fl® and hence the MAMSE weights do not converge to X* as k —> oo. 0k(to) is chosen to maximize the numerator of (3.13). 3.8 Asymptotic Behavior of the MAMSE Weights 46 Chapter 3. MAMSE Weights for Univariate Data Proof of Theorem 3.9. The Glivenko-Cantelli lemma shows that supx \Fik(x) — Fi(x)\ —> 0 almost surely as k —> oo. Let fij be the set of probability 1 where the convergence occurs. The summand and the integrand of the following expressions are bounded by 1, thus i nik — E Y^\iFi{Xij(Lo)} - Fi{Xij{u)} Li=l E ^XiF^Xn) - Fi(Xn .i=i almost surely as k —> oo by the strong law of large numbers. Let Q' be the set of probability 1 on which the convergence occurs. Note that the expectation in the expression above is taken over the random variable Xn which follows distribution Fi. Consider now the set QQ = ft' C\ Di=i &i and let u G f2o be any fixed element of that set. Note that by construction P(fio) — L Let P(x, r) denote the open ball of radius r centered at x. Since is an open set, any small enough e > 0 will be such that B(X*,e) D C = 0. Then, consider Pfc(A) as defined in Equation (3.1) and for any A € P(A*,e), Pfc(A) > dPifc(x) E XiFik{x) ~ Elk(x) 1 m i=i m J2^iFi{x)- Pi(x) l m m ^2\iFik(x) -^2\iFi(x Fi{x) - Flk(x)\ } dFlk(x) > -< nik —E nlk f-f i=l m F^-Fuix) } dFlk{x j=l Li=l Y/^iFi{Xlj(u)}-F1{Xlj(u)} J |EA?I^(3 x) - Fi(x) + Flk(x) - Fi{x)\ } dFlk(x) 47 Chapter 3. MAMSE Weights for Univariate Data * — E Tin. i—* Y_ A?sup Fik(x)-Fi(x i=i m J2xiFi(xn)-Fi(Xn) — sup x Fxk[x)-Fx{x) . i=i K > 0 for a large enough k. The fact that A G £c implies that YA=I XiFi(x) ^ Fi(x) for some x where F\(x) is not flat, i.e. some x with positive probability, thus E J^XiF^Xu) - FxiXn) . i=i > 0. Therefore, there exist ko(u) and K > 0 such that Pfc(A) > if for all k > ko(u). However, Lemma 3.2 shows that Pk{Xk(u)} —> 0 as A; —> oo. Therefore, Afc(w) € JB(A*,e) at most finitely many times. This is true of any A* G Cc, meaning that for all LO e flo, \\X* — Afe(w)|| > e at most finitely many times. _ Corollary 3.4 Consider the sequence of MAMSE weights Xk(u>) for LO fixed and k G IN. Let A be an accumulation point of the sequence Xk(u>), then A G £. Proof of Corollary 3.4- By Theorem 3.9, the neighborhood of any A G Cc can be visited at most finitely many times. Hence, any accumulation point belongs to C. ' • Corollary 3.5 If C is a singleton, then C = {[1,0,..., 0]T} and Afc(u;)«>[l,0,...,0]T almost surely as k —> oo. Proof of Corollary 3.5. The vector .[1,0,... ,0]T is always in L. Therefore, C will be a 48 Chapter 3. MAMSE Weights for Univariate Data singleton only when C = {[1,0,..., 0]T}. Let e > 0 and let ^=[0,ir\B([l,0,...,0]T,e) where B(x,r) denote the open ball of radius r centered at x. The set A is closed and bounded thus compact. Let B(x,r) be the closed ball of radius r centered at x. Consider the sets £(x, e/2) for x € [0, l]m; they form a covering of A. Since A is a compact set, there exist a finite sub-covering with balls centered at xs for s — 1,..., S. Consider now the sequence of MAMSE weights \k{u)- By Theorem 3.9, for every fixed to £ fl\ with P(Qi) = 1, any of the balls B(xs,e/2) will contain at most finitely many Afc(oi), i.e. s A,t(w) € (J B(xs,e/2) finitely many times. (3-14) s=l Consequently, . Afc(cu) € ||JS(xS)e/2)| CB([1,0,..,0]T,£) Lo. (3.15) Expressions (3.14) and (3.15) imply that if it exists, the limit of Afc(w) is in the set B([l, 0,..., 0]T, e). Since e can be arbitrarily small and since the space is complete, we conclude that Afc(w) —> [1,0,..., 0]T almost surely. _ In the case where £ is not a singleton, the MAMSE weights do not seem to converge to any particular point. Corollary 3.4 indicates that any accumulation point will be in C, but it seems that the neighborhood of many points in C is visited infinitely often. Describing the limit of the MAMSE weights in a general case seems rather tedious. In particular, the speeds at which the sample sizes increase in each population as well as the shape of each Fjfc(x) compared to -Fife (a;) will have an influence on the behavior of the 49 Chapter 3. MAMSE Weights for Univariate Data weights. The precise description thereof is left to future work. 3.9 Issues for Asymptotic Normality In all simulations performed, we have not found any evidence showing that a MAMSE-weighted sum of random variables is not asymptotically normal. A formal proof showing the asymptotic distribution of such a sum remains however to be found. Even in the realistic case where no mixture of the populations are exactly identical to the population of interest, i.e. C = {[1,0,.... ,0]T}, we cannot develop a central limit theorem for a MAMSE-weighted sum of variables without studying the speed of convergence of the MAMSE weights. Let pi — E(A'ji) and of" = var(Xii). Under reasonable assumptions, one could hope to show that the expression converges weakly to a Normal variable with mean 0. To simplify things, suppose that the sample sizes from all populations increase at the same rate, that is [nik, «2fe, • • •, nmk]/nik —• [<*i, • • •, cxm] as k —> oo with 0 < ctj < oo. converges almost surely to 0 by the strong law of large numbers and that y/n~ik~Bik converges weakly to a Normal distribution with mean 0 and variance of/o^ by the central limit theorem. However, developing Bk yields (3.16) Note that m Aife(w) Y^(Xij - Pi) + Z~2 Xikiu)(pi ~ Ml) 3=1 i=l 50 Chapter 3. MAMSE Weights for Univariate Data m m = \k(u)Bik + ^_ \ik(u)(m - fii). 1=1 t=l Recall that Bik —> 0 and that Corollary 3.2 shows that Bk —> 0. Consequently, the ex pression 2~TJi_i Ajjfc(u;) (fii — /ii) converges to 0 as well, but we do not know the rate of that convergence. If that rate is slower than l/y/nik, Expression (3.16) will not converge to a Normal distribution even if the MAMSE weights converge almost surely to a fixed value. To prove the asymptotic normality of (3.16) using the strategy sketched above, we need to show that the speed of convergence of the MAMSE weights is fast enough. This condition corresponds to the second half of Assumption 2.5 of Wang, van Eeden and Zidek (2004) who require the same rate of convergence, l/y/nik, for the weights. Note that the classical proof of asymptotic normality uses the moment generating func tion or the characteristic function, but it does not apply here because each datum is multi plied by a data-based weight. As a consequence, the terms \k(yj)Xij are not independent. The study of the asymptotic distribution of expressions similar to (3.16) are left to future work. Any significant advances will probably come only after a thorough investigation of the speed of convergence of the MAMSE weights. 3.10 Simulations In this section, the finite-sample performance of the MWLE with MAMSE weights is eval uated through simulations. Different cases of interest are considered. The number of repetitions for each simulation study varies from 10000 to 40000. We used the bootstrap on a pilot simulation to evaluate the variability of the values presented throughout this section. Unless otherwise stated, the standard deviation of the error due to simulation is less than one unit of the last digit shown. 51 Chapter 3. MAMSE Weights for Univariate Data 3.10.1 Two Normal Distributions We first explore the merits of our weights for. the ubiquitous Normal distribution. Samples of equal sizes n are drawn from Pop. 1 : A/"(0,1), Pop. 2 : Af(A,.l) for different values of A, each scenario being repeated 10000 times. Table 3.1 shows the average MAMSE weights under different circumstances. Average Values of 100Ai n = 5 10 15 20 25 50 100 200 1000 10000 A = 0 72 71 72 71 71 72 72 72 72 72 0.001 72 71 71 72 72 72 72 71 72 72 0.01 72 72 71 72 72 72 72 72 72 74 0.10 72 72 73 73 73 73 74 76 86 98 0.25 74 74 75 76 76 79 83 88 97 100 0.50 77 79 80 82 83 88 93 96 99 100 0.75 80 83 86 88 89 94 97 98 100 100 1.00 84 87 90 92 93 96 98 99 100 100 1.50 89 92 94 95 96 98 99 99 100 100 ' 2.00 93 94 96 97 97 99 99 100 100 100 Table 3.1: Average MAMSE weights for Population 1 when equal samples of size n are drawn from Normal distributions with unit variance and means 0 and A respectively. The results are averages over 10000 replicates. From Table 3.1, we notice that the average weight of Population 1 does not seem to go below 0.7 for these scenarios. As n increases, the weight of Population 1 approaches 1, hence the MAMSE weights detect that the distributions are different and ultimately discard Population 2. Note that this convergence to 1 does not seem to occur for A = 0 and seems very slow when A is tiny. The average weight for Population 1 increases as well when the discrepancy between the populations increases while n is kept fixed. Table 3.2 shows the performance obtained for the MWLE with MAMSE weights when compared to the MLE. The ratio of the mean squared errors, 100 MSE(MLE)/MSE(MWLE) 52 Chapter 3. MAMSE Weights for Univariate Data is shown; a value greater than 100 meaning that the MWLE is preferable. This ratio is akin to the relative efficiency of the MLE with respect to the MWLE. Efficiency of the MWLE n = 5 10 15 20 25 50 100 200 1000 10000 A = 0 146 145 144 144 143 143 144 144 144 143 0.001 . 147 : 146 145 144 143 143 142 143 143 144 0.01 146 146 145 144 143 143 144 143 141 127 0.10 143 143 142 140 139 135 128 118 89 94 0.25 .139 134 131 125 123 110 96 87 91 99 0.50 127 117 108' 104 97 88 88 90 97 100 0.75 114 103 95 91 89 87 91 95 99 100 1.00 103 94 90 88 88 90 94 97 99 100 1.50 89 88 89 91 91 94 98 98 100 100 2.00 84 87 91 92 93 96 98 99 100 100 Table 3.2: Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE). Samples of equal size n are simulated from Normal distributions with unit variance and means 0 and A respectively. The results are averaged over 10000 replicates. The MWLE performs better than the MLE for small n and A. When n and A increase, the methods' performances are eventually equivalent. For the cases in between however, the MLE is a better choice than the MWLE. Fortunately, the loss (at most 16%) seems to be smaller than the potential gain (up to 47%). When the two populations are identical, a steady improvement of about 43% is observed. Note that we cannot expect to improve uniformly over the MLE since the mean is an admissible estimator. The weighted likelihood could be especially useful in situations where a large population is available to support a few observations from the population of interest. For the next sim ulation, 40000 replicates of each scenario are produced with the same Normal distributions as before, but with samples of size n and lOn for Population 1 and 2 respectively. Table 3.3 shows the average weight allocated to Population 1; Table 3.4 shows the relative efficiency of the methods as measured by 100 MSE(MLE)/MSE(MWLE). The general behavior of the weights is similar to that in the previous simulation, except that their minimal average value is below 0.5 this time around. As a consequence of its 53 Chapter 3. MAMSE Weights for Univariate Data Average Values of lOOAi n = 5 10 15 20 25 50 100 200 A = 0 51 50 49 49 49 49 49 48 0.001 51 50 49 49 49 49 49 48 0.01 52 50 50 49 49 49 49 49 0.10 54 53 52 53 53 54 57 62 0.25 58 59 60 61 62 69 78 86 0.50 66 70 73 76 79 87 93 96 0.75 74 79 83 86 88 94 97 98 1.00 80 86 89 91 93 96 98 99 1.50 87 92 94 95 96 98 99 99 2.00 91 94 96 97 97 99 99 100 Table 3.3: Average MAMSE weights for. Population 1 when samples of size n and lOn are drawn from Normal distributions with unit variance and means 0 and A respectively. The results are averages over 40000 replicates. larger size, the sample from Population 2 gets a heavier weight. It appears that a larger Population 2 magnifies the gains or losses observed previously. Fortunately however, the magnitude of the further improvements seem to exceed that of the- extra losses. Note that the MAMSE weights are invariant to a common transformation of the data in all populations. Therefore, simulation results would be identical (less simulation error) for Normal populations with variance a2 and with means 0 and Ac respectively. Overall, the MWLE works very well under the suggested scenarios. 3.10.2 Complementary Populations We explained in Section 2.2 how the likelihood weights can be seen as mixing probabilities. Can the MAMSE weights detect and exploit the fact that Population 1 has the same distribution as a mixture of some of the other populations? Would the quality of the inference then be improved? 54 Chapter 3. MAMSE Weights for Univariate Data Efficiency of the MWLE n = 5 10 15 20 25 50 100 200 A = 0 223 223 223 222 222 221 222 221 0.001 223 225 223 221 222 223 221 220 0.01 223 222 222 220 221 221 220 218 0.10 216 209 203 197 191 169 142 113 0.25 187 165 147 135 125 100 83 78 0.50 139 111 97 90 85 79 83 89 0.75 111 91 85 82 82 85 90 94 1.00 98 85 84 83 85 90 94 97 1.50 88 86 88 89 90 94 97 98 2.00 86 89 91 92 93 96 98 99 Table 3.4: Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE). Samples of sizes n and lOn are simulated from Normal distributions with unit variance and means 0 and A respectively. The results are averaged over 40000 replicates. Pseudo-random samples of equal sizes n are drawn from the distributions Pop. 1 : jV(0,1), Pop. 2 : \jV(0,1)|, Pop. 3 : -\j\f(0,1)| where | • | denotes absolute values. Hence Population 2 has a Half-Normal distribution and Population 3 follows the complementary distribution. We consider different sample sizes, each scenario being repeated 10000 times. The results are summarized in Table 3.5. The first column shows 100 MSE(MLE)/MSE(MWLE); the other columns show the average MAMSE weights allocated to each of the three populations. First observe that the combined average MAMSE weight of Population 2 and 3 accounts for at least half of the total weight for all sample sizes. The MAMSE weights thus detect that an equal mixture of Population 2 and 3 shares the same distribution as Population 1. Note also that the relative efficiency is uniformly greater than 100, meaning that the MWLE . with MAMSE weights is preferable to the MLE in these situations. 55 Chapter 3. MAMSE Weights for Univariate Data n Efficiency lOOAi 100A2 IOOA3 5 115 50 19 30 10 121 46 23 30 15 120 46 25 29 20 118 45 25 29 25 118 45 26 29 50 117 45 27 28 100 116 44 27 28 200 116 44 28 28 1000 115 44 28 28 10000 116 44 28 28 Table 3.5: Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE) and aver age MAMSE weights, allocated to samples of sizes n drawn from M(0,1), |A/"(0,1)| and-— \M(0,1)| respectively. The results are averages over 10000 repetitions. 3.10.3 Negative Weights In most cases, the unconstrained optimization of -P(A) yields positive weight's. In some cases such as the one that we are going to explore, negative weights systematically occur. Some previous work such as van Eeden &z Zidek (2004) showed that allowing negative weights may sometimes boost the performance of the MWLE. We explore the possibility of such improvements here. Imagine a situation where a measurement of interest is cheaply obtained, but it is costly to determine whether a patient is diseased or not. We want to study the measurement of interest on the diseased patients.. Suppose we have two small samples (one diseased, one not) as well as a larger sample where the health status of patients is unknown. If we allow negative values for MAMSE weights, would they adapt by including the larger population •in the inference and allocating a negative weight to the small healthy population? . To represent the hypothetical situation above we simulate from the following distribu tions: Pop. 1 : A/10,-1), Pop, 2 : 0.5Af(0,1) + 0.5A/YA, 1), Pop. 3 : Af(A, 1), 56 Chapter 3. MAMSE Weights for Univariate Data where Population 1 and 3 have equal sample sizes of n, but Population 2 has a sample size of lOn. Each scenario is repeated 40000 times. Although we allow weights to be negative, we still apply the preprocessing step and set the weight of a population to 0 when it does not overlap with the sample from Population 1. If the preprocessing were ignored, there would be a possibility that A would be nonnegative definite and that the MAMSE weights would not be unique. Applying the preprocessing does not affect the pertinence of this example: if the dis tributions in the populations of diseased and healthy are so different that the samples are often disjoint, there is no point in using the weighted likelihood to include Population 2 as the measurements are in fact a cheap diagnostic test. Moreover, previous simulations without preprocessing yielded results that are not better than those presented here. Figure 3.1 shows the average values of the unconstrained MAMSE weights for different scenarios. Negative weights do appear, hence the MAMSE criterion detects that Population 2 is a mixture of the other two populations and removes the component which is not of interest. For a large A, notice how the negative weights are closer to 0 for smaller samples. In such cases, there is a higher probability that the sample from Population 3 will be disjoint of the sample from Population 1. As a result, the weight allocated to Population 3 is more often forced to 0 by the preprocessing step. As the sample size increases, the samples overlap more frequently. Table 3.6 shows the performances obtained by the MWLE with unconstrained MAMSE weights. The MWLE performs better than the MLE in most cases, being almost twice as good in many cases. Unfortunately, the performances for large A are very poor, especially in the cases where the difference between the.populations is so large that they overlap very lightly. Using a weighted likelihood with negative weights provides an improvement over the MLE, but a similar improvement may be obtainable when the constraints are enforced. 57 Chapter 3. MAMSE Weights for Univariate Data n = 5 n= 10 n= 15 n = 20 n = 25 n = 50 n= 100 A=0.00 so 4 ii » 4 ii » 5 zzi to5 5 5 B : 5 zz 50 5 -0.001 w 4 Z_ so 4 ZZ) so 4 5 •i » 5 ZZ 3 5 —l 5 0.01 50 3 4 zzi 2 4 5 ZZ 50 5 Z_] » 5 •l 50 5 pi 48 0.10 50 3 zzi 50 3 3 E3 so 4 4 3 50 3 ii 50 2 ii 50 0.25 49 1 zzl S 1 •i 5 1 50 0 50 ZZ 50 ZZ] 55 -53 0.50 50 ? Si 56 F "[ 60 63 •M 57 -15 -21 30 1.00 51 60 -ii r_ P i =3 | -20 C 63 -22 C 65 ZZT 63, -24 f_ 66 65 -28 •31 69 2.00 48 -17 -33 60 78 -37 ^^1^ 60 •38 1 60 80 -39 \~Z I 60 80 -to -40 82 5.00 18 I 87 I 14 -0 89 90 90 ', "l 1 87 1 12--10 -23 EED lOOAi wm 100A2 BB IOOA3 Figure 3.1: Average values of 100x the MAMSE weights without the constraints A; > 0. Samples of size n, lOn and n are taken from each population. Population 2 is an equal mixture of Populations 1 and 3 that respectively follow a A/^O,1) and &J\f(A, 1) distribution. All results are averages over 40000 repetitions. Table 3.7 shows the performance of the MWLE when the usual MAMSE weights are used. Figure 3.2 shows the average values of the weights obtained in that case. Using the MWLE with positively constrained MAMSE weights also provides an improvement over the MLE. This improvement is sometimes larger than that obtained with unconstrained weights. To discern between the two versions of MAMSE weights, Table 3.8 compares their relative effi ciency; values above 100 favor the unconstrained weights. Note that the standard deviation of the error due to simulation in Table 3.8 can be more than one unit, but does not exceed 1.3 units. 58 Chapter 3. MAMSE Weights for Univariate Data 100 MSE(MLE)/MSE(MWLE) n = 5 10 15 20 25 50 100 A = 0 195 196 197 198 197 197 198 0.001 196 196 197 197 198 198 197 0.01 196 196 197 197 198 198 197 0.10 195 194 194 194 192 184 172 0.25 190 182 176 170 165 144 121 0.50 173 153 140 131 124 107 97 1.00 137 113 105 101 100 97 96 2.00 116 92 86 84 84 84 84 5.00 51 49 51 54 57 62 55 Table 3.6: Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE) when the MAMSE weights are calculated without the constraints A* > 0. Samples of size n, lOn and n are taken from each population. Population 2 is an equal mixture of Populations 1 and 3 that respectively follow a A/"(0,1) and a A/"(A, 1) distribution. All results are averages over 40000 repetitions. 100 MSE(MLE)/MSE(MWLE) n = 5 10 15 20 25 50 100 A = 0 211 209 210 210 209 208 208 0.001 212 210 209 . 209 210 209 208 0.01 212 210 210 209 210 209 208 0.10 212 209 207 206 203 194 180. 0.25 207 196 187 180 173 146 118 0.50 186 161 144 131 122 98 82 1.00 139 111 97 89 86 79 82 2.00 97 82 79 78 79 84 90 5.00 51 48 50 53 57 68 79 Table 3.7: Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE) when the usual MAMSE weights (i.e. constrained to positive values) are used. Samples of size n, lOn and n are taken from each population. Population 2 is an equal mixture'of Populations 1 and 3 that respectively follow aJV(0,1) and a JV(A, 1) distribution. All results are averages over 40000 repetitions. It seems that allowing negative weights further improves the performances only in a few cases. In fact, Figure 3.2 shows that Population 2 by itself can be used and Table 3.7 shows it has a positive impact. Table 3.8 suggests that the constrained MAMSE weights are to 59 Chapter 3. MAMSE Weights for Univariate Data 100 MSE(constrained)/MSE(negative) n = 5 10 15 20 25 50 100 A = 0 92 94 94 94 95 95 95 0.001 92 93 94 94 94 95 95 0.01 92 93 94 94 94 95 95 0.10 92 93 94 94 94 95 96 0.25 92 93 94 95 96 98 102 0.50 93 95 98 100 102 109 119 1.00 99 102 109 114 117 123 117 2.00 119 112 109 107 107 100 94 5.00 100 101 102 102 101 91 69 Table 3.8: Relative efficiency of the MWLE with and without the constraints A; > 0 as measured by 100 MSE(constrained MWLE)/MSE(unconstrained MWLE). Samples of size n, lOn and n are taken from each population. Population 2 is an equal mixture of' Populations 1 and 3 that respectively follow a j\f(0, 1) and a A/"(A, 1) distribution. All results are averages over 40000 repetitions. be preferred more often than not. If we consider other complications arising from allowing negative weights, (e.g. making the weighted EDF non-monotone) keeping the constraints Xi > 0 in the definition of the MAMSE weights seems a better option. A different prevalence of diseased in Population 2 could affect the simulation results. If major differences were observed, the conclusion above could be revisited. 3.10.4 Earthquake Data We now use a model whose weighted likelihood estimate does not have a simple form, i.e. it is not a weighted average of the MLE of each population. Natural Resources Canada http://earthquakescanada.nrcan.gc.ca/ maintains an educational website with resources about earthquakes. From their website, it is possible.to download data about recent Western Canadian earthquakes. The histograms in Figure 3.3 show the magnitude of the earthquakes that occurred in the 5 year period from the 12th of February 2001 to the 12th of February 2006. Events are divided into 3 groups depending on the geographical location of their epicenter. For the purpose of this example, we make 60 Chapter 3. MAMSE Weights for Univariate Data n = 5 n= 10 n= 15 n = 20 n = 25 n = 50 n= 100 A=0.00 43 P 1 ps 1 I 46 P 1 ? 1 0.001 43 8 I3 1 8 46 46 S 48 0.01 43 8 p : •—' 1 r3 1 5 : 6 46 46 8 50 0.10 42 S Is i zJ S 7 zr « 7 7 44 6 52 0.25 41 7 ^ 1 F1 :i ZT « 6 —T « 6 55 ZT 41 4 60 ZT 3 0.50 38 6 —? s 4 59 zr 4 1 60 3 1 62 36 2 68 31 1 77 23 0 65 1.00 32 3 m'~ 1 69 30 1 72 27 1 75 • 24 0 0 86 I 0 92 1 0 i 1 2.00 23 0 82 18 0 06 1 0 99 J « 0 90 10 0 I 95 5 0 1 97 3 0 i 82 5.00 18 0 1 87 J 13 0 90 1 0 92 0 93 I 0 96 4 0 98 2 0 i [ZJ lOOAi HI 100A2 IB IOOA3 Figure 3.2: Average values of lOOx the usual MAMSE weights (with constraints A* > 0). Samples of size n, lOn and n are taken from each population. Population 2 is an equal mixture of Populations 1 and 3 that respectively follow aA/'(0,1) and &M(A, 1) distribution. All results are averages over 40000 repetitions. the assumption that the magnitude of the earthquakes are independent random variables and fit a gamma distribution to each of the three populations using maximum likelihood. The fitted curves appear on Figure 3.3 and the estimated values of their parameters are shown in Table 3.9 along with the number of observations in each area. The gamma model is parametrized as for 8, p, x > 0. 61 Chapter 3. MAMSE Weights for Univariate Data Lower Mainland - Elsewhere in British Columbia Yukon and Vancouver Island or in Alberta North West Territories 0 2468 02468 0246 Magnitude Magnitude Magnitude Figure 3.3: Histograms of the magnitude of earthquakes measured between the 12 of February 2001 and the 12th of February 2006 for three different Western Canadian areas. The curves correspond to the fitted Gamma density. Lower Mainland - Elsewhere in BC Yukon and Vancouver Island or in Alberta North West Territories p 1.654 2.357 6.806 M 1.437 1.869 2.782 n 4743 4866 1621 Table 3.9: Number of earthquakes in three Western Canadian areas between the 12th of February 2001 and the 12th of February 2006. The magnitude of these earthquakes is modeled by a Gamma distribution; the maximum likelihood estimates appear above and are used as the "true" parameters for this simulation. • . We focus our interest on the magnitude of the next earthquake with epicenter in the Lower Mainland - Vancouver Island area. Suppose that only the 50 most recent events from each of the three regions are available. Would the MWLE that uses data from all three regions provide a better estimate than the MLE? To investigate the question, we produce 10000 pseudo-random samples of earthquakes based on the fitted gamma models shown above. The average MAMSE weights are 0.959 for the Lower Mainland - Vancouver Island 62 Chapter 3. MAMSE Weights for Univariate Data area, 0.041 for the rest of British Columbia and Alberta and finally, nearly 0 for Yukon and North West Territories. Although it looks like a small contribution, the MSE of the MWLE for the vector (j3, u) was smaller with 100 MSE(MLE)/MSE(MWLE)=107. We also considered other values of possible interest, namely some probabilities about the magnitude (M) of the next earthquake that are all obtained by plugging the MLE or MWLE in the Gamma model. Table 3.10 summarizes these results. Probabilities Prob Efficiency MLE MWLE Model Data Multiplier P{M > 1) 123 62 63 68 51 xlO-2 P(M > 2) 114 22 24 40 22 xlO"2 P(M > 3) 112 66 73 174 98 xlO"3 P(M > 4) 113 19 21 51 26 xlO-3 P(M > 5) 112 51 59 99 53 xlO-4 P(M > 6) 80 14 17 12 6 xlO"4 Table 3.10: Efficiency in estimating some probabilities about the magnitude of the next earthquake in the Lower Mainland - Vancouver Island area followed by the average of the actual estimates and their true values. Efficiency is measured by 100 MSE(plug-in ' MLE)/MSE(plug-in MWLE). The first four columns of probabilities should be multiplied by the corresponding multiplier. The first column of Table 3.10 corresponds to the relative efficiency of using the MWLE compared to using the MLE as plug-in parameters for the gamma model in order to evaluate the probability of interest. The numbers shown are 100 MSE(plug-in MLE)/MSE(plug-in MWLE) followed by the estimated values of P(M > k) using the MLE and the MWLE as plug-in parameters. For comparison purposes, the columns Model and Data contain re spectively the true probabilities (from the simulated model) and the empirical proportions in the complete dataset. All probabilities are scaled for easier reading; using the corre sponding multiplier will yield the original value. Note that discrepancies with the empirical probabilities reveal weaknesses of the gamma model to perfectly represent the magnitude of earthquakes rather than an advantage for one method over the other. 63 Chapter 3. MAMSE Weights for Univariate Data Interestingly enough, the MSE of the estimates is almost always smaller with the MWLE. Improved performance is hence possible by using the MWLE with MAMSE weights in this situation with distributions copied from real life. 64 Chapter 4 MAMSE Weights for Right-Censored Data Situations arise where the exact value of some data may not be observed. In engineering, a test of the lifetime of light bulbs, say, may not last long enough to see the last bulb fail. In health science, patients may move away while participating in a study and the researcher does not know when the symptoms stopped or when the death of the patient occurred. In these two examples, the data are right-censored: some times of failure are not observed, but they are bounded below. If censored data were ignored, large values of the response variable would be more often excluded from the inference than early deaths (or failures), resulting in an underestimation of the lifetimes. Some methods exist to account for censored data, including a nonparametric estimate of the cumulative distribution function proposed by Kaplan & Meier (1958). We suggest use of that estimate to define a version of the MAMSE weights for censored data. The paradigm still involves m populations, one of which is of inferential interest. This situation may arise in practice when: • interest concerns a subgroup of the population studied, • data come from different studies with different schemes for censoring, and possibly under other circumstances. We first introduce the notation for this section and review the Kaplan-Meier estimate and its properties. We then define the MAMSE weights and show that the algorithm for 65 Chapter 4. MAMSE Weights for Right-Censored Data calculating them converges. We propose the MAMSE-weighted Kaplan-Meier estimate and prove its uniform convergence to the target distribution. Finally, simulations explore the performance of our proposed estimate in finite samples. 4.1 Notation and Review of the Kaplan-Meier Estimate We introduce a notation that comprises m possibly right-censored populations and accounts-for increasing sample sizes. For simplicity, we will adopt a "survival analysis" terminology where measurement of interest is the survival of individuals. For Population i, let The independent positive random variables Xij(u) and Vij(u>) are defined on a common probability space (fl,B(fl), P). We denote their distributions by Fi and Gi respectively; the distributions Fi are assumed to be continuous, but the Gi do not need to satisfy that assumption. Instead of the values X{j, we rather observe For any fixed k € IN, we observe (Z^, 5ij) for i = 1,..., m and j = 1,..., n,^. The index k will be useful to express the asymptotic results of Section 4.4. We assume that the sample sizes are non-decreasing with k and that —> oo as k —> oo. Let us define Xij = time of death of individual j Vij — censoring time of individual j. and 66 Chapter 4. MAMSE Weights for Right-Censored Data diVj^s) = Nik(s) — Nik(s~) = # of deaths at time s, Yik(s) = ^2 ~^-{Zij>s) = # at risk Just before time s, 3=1 dlifc(s) = Yik(s) — Yik(s+) = # of deaths and censored at time s. For Population i, the Kaplan-Meier estimate (KME) of the probability of dying at time t or earlier is written (see Kaplan 8z Meier 1958) Note that the factors of the product differ from 1 only at times of death; Let Hi(t) = P(Zn < t) and let T#. — sup{£ : Hi(t) < 1} be the largest value that Zij can attain. The possibility that THT = oo is not ruled out although it is unlikely to occur in practice. In addition, let H*(t) = P(Zn < t,6n = 1) = Al - Gi(x)}dFi(a;) Jo be the distribution of observed death times for Population i. The Kaplan-Meier estimate is an increasing step function with a jump at each of the times of death. By the definition of THX , the number of deaths observed in Population 1 is •A/fc = NikirH-i)- For k € IN, let tki < • • • < tkj^k be the ordered times of these deaths. The times of death are distinct by the continuity of Fj. If in addition, we use the convention that tko = 0, the jumps of Flfc(i) are Jkj = Fik(tkj) - Afc^fcO-i)) for j € {1,... ,A4} and we have that Jkj < 1-Efron (1967) discusses how the Kaplan-Meier estimate redistributes weight of the cen sored data to the observations on the right. Consequently, Jki < Jk2 < ••• < JkMk- (4.1) 67 Chapter 4. MAMSE Weights for Right-Censored Data We will consider the Kaplan-Meier estimate on a bounded interval [0, U] with U < TBI • Theorem 4.1 (Winter, Foldes & Rejto) sup\Fik{t)-Fi{t)\^0 t<u ; almost surely as —>.oo. Foldes & Rejto (1981) study the rate of convergence of supt<rj \Fik(t) - Fi(t)\. To get a better rate, they assume that the distribution Gi is continuous. However, they also mention that they proved the result of Theorem 4.1 without making any assumptions about Fi and Gi in some earlier work published as Winter, Foldes & Rejto (1978). Efron (1967) and Breslow & Crowley (1974) assume that the distribution Gi is contin uous and show that Fjfc(i) is approximately normal with mean Fj(i) and variance ~{1-Fm2 Al-fTi(a)}-2dff;(s). n Jo This expression for the variance can be estimated using Greenwood's formula (see e.g. Chen & Lo (1997), page 1069), yielding var{Flfc(i)} « vTr{F^)} = {1 - Fik{t)f £ (4.2) This expression becomes less reliable as t approaches TH{, but for a large enough n^, it should be reliable on any interval [0,T] with T < U. Note that the terms in the sum above are 0 except when s is a time of death. Defining MAMSE weights based on the Kaplan-Meier estimate involves using an esti mate of its variance. Equation (4.2) is thus used for the definition of the MAMSE weights in Equation (4.3). Since the variance term is used as a penalty to foster the inclusion of many populations into the inference, we take the liberty of using v&x{Fik(t)} even though we do not make the assumption that Gi is continuous. 68 Chapter 4. MAMSE Weights for Right-Censored Data In the next section, we build a version of the MAMSE weights based on the Kaplan-Meier estimate. The weights are then used to define a MAMSE-weighted Kaplan-Meier estimate (WKME). 4.2 Definition of the MAMSE Weights for Right-Censored Data We suggest using an expression of the form YliLi ^i-^ifcW to estimate Fi(t). To find the weights A = [Ai,..., Am]T adaptively, we use a version of the MAMSE criterion, see Equa tion (2.1), where the distribution functions are estimated with the corresponding Kaplan-Meier estimates. When censored observations are found after the last time of death, the Kaplan-Meier estimate is not a proper cumulative distribution function on the positive real line because it never reaches a value of 1. For that reason, we assume that we can specify an upper bound U < THI and limit our study of the survival function to the interval [0, T] where T < U is such that H*(T) < H{{U). The last inequality means that there is a non-null probability that a death is observed in the interval (T, U]. This will be the case whenever the probability of death (observed or not) is non-null in that interval since the cumulative probability of being censored before T is less than one by the definition of TJJI > U. Preprocessing For a fixed k and i G {2,..., m}, let rriik = min and Mn~ — max Z^ '{j<nik-Sij = l} {j<nik--Sij=l} be the smallest and largest times of death observed in Population i. The weights allocated to the sample from Population i are set to 0 if it fails to satisfy the following two conditions: 69 Chapter 4. MAMSE Weights for Right-Censored Data 1. U & [rriik,Mik], i.e. at least one observed death from Population i is in the interval [0, U] and at least one observed death occurs after U; ' 2. ^ e \mik-, min(Mjfc, [/)]} > 1, i.e. at least one observed death from {j<nik:8ij=l} Population 1 which occurred in [0, U] falls within the range of the observed times of death in Population i. Condition 1 ensures that Formula (4.2) is well defined on [0, U] and not null everywhere on that interval. Condition 2 means that the same formula will be strictly positive for at least one of the times of death from Population 1 in [0, U], ensuring the unicity of the MAMSE weights and the convergence of the algorithm used to calculate them. These consequences are explained in greater detail in Section 4.3. The preprocessing requirements appear to be technical, but they avoid using samples yielding unreliable Kaplan-Meier estimates. After the last time of death, the KME remains constant forever. If Condition 1 failed, we could be relying on such a plateau where subtle differences in the distribution of the lifetimes may be hidden by the censoring scheme. If Condition 2 failed, the whole range where the KME of Population i increases would be compared against a constant function from Population 1. Hence, the conditions also have some intuitive foundations. If very few deaths from Population 1 fall in [0,17], Condition 2 is likely to fail and the other populations may see their weights set to 0. In such cases, the little information available about Population 1 makes it.hard to compare it to the other samples. It is then better to be more conservative and to avoid relying on the other populations. In particular, if less than 2 deaths from Population 1 fall in the interval [0, U], we allocate no weight to the other populations. The lack of knowledge about Population 1 makes the comparison to other populations too uncertain. Let A4k C {1,..., m} be a set of indices corresponding to the population whose samples satisfy the preprocessing conditions. We always have 1 € A4k since Population 1 is never 70 Chapter 4. MAMSE Weights for Right-Censored Data removed from the pool of populations. Objective Function Let Pk(X)= / \Flk(t)-y£tXiFik(t)\ + ]TA2W{/4(*)} J° I i=l ) i=l dFlk(t) - (4.3) be a special case of Equation (2.1) where var{Fik(t)} is estimated by var{Fik(t)} from Equation (4.2) and dp(x) is replaced by dF\k(t). Note that none of the preprocessing steps can remove Population 1 since it is the popula tion of interest. In cases where the last observed death in Population 1 is contained in [0, U] (i.e. when M\k = tkj^k < U), the expression for v5Jc{Fik(tkjvk)} involves a division by 0 since ^»fc(*fe/Vfc) = 0' ^n *na* case) we substitute the ill-defined term by its value just before time tkMk- Although this solution would not be acceptable for constructing confidence intervals, it should do no harm here where our purpose is to construct a penalty term that fosters using the data from all populations. In particular, this adjustment will affect at most one term of the integral Pk(X). The weights are chosen to minimize -Pfc(A) subject to A > 0 and ^ Aj = 1. The solution to that program will be referred to as the survival MAMSE weights and written Afc(w) = [Aife(o;),..., Amfc(w)]T to represent their dependence on LO and k. For values of t in the interval [0,T], the weighted Kaplan-Meier estimate (WKME) of the lifetime's CDF is then defined by m Gk(t) = Y,\k(u)Fik{t). (4.4) i=i 71 Chapter 4. MAMSE Weights for Right-Censored Data 4.3 Computing the MAMSE Weights for Right-Censored Data The algorithm suggested in Section 2.4 applies to the MAMSE weights defined with the Kaplan-Meier estimate. To prove that the algorithm converges, we must show that As sumption 2.1 is satisfied. Lemma 4.1 y"var{Afc(a;)} dFlk(x) >0 Proof of Lemma 4-1- To calculate the MAMSE weights, we require that at least two times of death from Population 1 fall in the interval [0, U]. Let t* be the smaller of the two times r \ of death. By the definition of the Kaplan-Meier estimate, F\k{t*) < 1 since the larger time of death has some mass. Hence, Expression (4.2) is positive at t = t*, i.e. vax{Flfc(O}>0. Since t* is a time of death, the Kaplan-Meier estimate for Population 1 makes a jump at that time. Hence dFik(t*) > 0 as well and consequently, J j<tix{Plk{x)} dFifc(x) >0. • Lemma 4.2 For all external populations remaining after prescreening: i = {2,... ,m}, j™x{Pik{x)} &Flk{x) > 0. Proof of Lemma 4-2. Let and M{k be the smallest and largest time of death respectively observed in Population i. Note that the sum in Equation (4.2) is cumulative. It is thus positive for all x > rriik. 72 Chapter 4. MAMSE Weights for Right-Censored Data The first condition for preprocessing requires that < U < M^. Since the Kaplan-Meier estimate for Population i jumps at M^, the first part of Equation (4.2), {1 — Fik{t)}2, is positive for all t < M^. Consequently, var for all x € [mik,Mik). The second requirement for preprocessing ensures that a death occurs in Population 1 in the interval [m-ik, min(Mjjfc, U)]. Consequently, the discrete measure dFik(x) gives positive mass to at least one point in that interval where var |Fjfc(x)| > 0. Therefore, var {Fife(x)| dFlk(x) > 0. • Lemmas 4.1 and 4.2 are sufficient to show that the algorithm in Section 2.4 converges for this new application to the MAMSE weights. 4.4 Uniform Convergence of the MAMSE-Weighted Kaplan-Meier Estimate We prove that the WKME, Gk(t), converges uniformly in probability to F\(t). The proof is built as a sequence of lemmas that appear next and follow a logic akin to that of Section 3.5. Remember that we assumed in Section 4.1 that the distribution of the times of death is continuous, but the distribution of the times of censoring need not be. Lemma 4.3 E dNik(s) _P 0<3<uYMs)Ylk^ as k —> oo. 73 Chapter 4. MAMSE Weights for Right-Censored Data Proof of Lemma 4-3. Notice that v dNlk(s) ^ dYlk(s) v Ylk(s) - Ylk(s+) The first inequality holds since dNik(s) < dYik(s) for all s. Suppose that the Z\j's are distinct and for a fixed k, let Z^(j) denote the jth order statistic of the sample from Population 1, i.e. the jth smallest value in the list Zu, • • •, Z\nik- Let jok = max{j : j < n\k, Z^ < U} Then, there is at most one censored datum or death at any given time and the expression above can be rewritten as y /_> 1_\ = y Z-~> I VT,J*+\ YTJ.*} I 0<s<U Ylk(s+) Ylk(s)J Z^lYlk(ZlU+1)) Ylk(Zl(j)) Ylk(Z1{jok)) nlk - Ylk(U) (4-5) since Yik(s) is decreasing in s and the series telescopes. Concurrent death times are impossible, but concurrent censoring times are possible since the continuity of their underlying distribution is not assumed. Moreover, censoring cannot occur at a time of death since the two times are independent random variables and at least one of them has a continuous distribution. Even if multiple censoring occurs at one time, inequality (4.5) will still hold. Indeed, the term for concurrent censoring times in the summation is equal to the sum of the individual terms if the times were different but consecutive. For instance, if Yik(t) — Yik(t+) = 2, we have Yik(t)-Ylk(t+) _ 1 1 Ylk(t)Ylk(t+) Ylk(t)-2 Ylk(t) 1 1 1 + Ylk(t)-2 Ylk(t)-1 Ylk(t)-1 Ylk(tY This extends for an arbitrary number of concurrent censoring times. Let us now show that the bound l/Y\k(U) converges to 0. Since the Z-^fs are indepen-74 Chapter 4. MAMSE Weights for Right-Censored Data dent, Yik(U) has a Binomial distribution with parameters nik and 1 — Hi(U). Let e > 0, then p\ E vnvfV8 ^ >*} = P<i/*} • 0 (4.6) 0<s<£/ — — / r l/e-nik{l-Hi(U)} ^nikHi(U){l - Hi(U)} as k oo since the argument inside the standard normal CDF $ tends to — oo. The ap proximation is from the central limit theorem and becomes exact as nik —> oo. • Whether a sample is rejected in the preprocessing or not may vary with k and LO. Remember that m populations are available before preprocessing and that weights of populations in JAk are forced to 0, but Population 1 is never excluded from the optimization problem. Hence preprocessing does not affect the distribution of probabilities calculated in expres sions such as (4.6). Moreover, preprocessing does not change the fact that A = [1,0,..., 0]T is a suboptimal choice of weights, which we use in the proof of the next result. Lemma 4.4 £ {Flk(t)-Gk(i)}2 dFlk(t)$0 as k —> oo, where Gk(t) is defined in Equation (4-4)-Proof of Lemma 4-4- By the definition of the MAMSE weights, Pfc{AfcH}<Pfc{[l,0,...,0]T} since [1,0,... ,0]T is a suboptimal choice of weights. Following Equations (4.2) and (4.3), we thus have £ {Fik(t) - Gk(t)}2 dFik < Pk{Xk(Lv)} < Pfc{[l,0,...,0]T} 75 Chapter 4. MAMSE Weights for Right-Censored Data -f Jo n p rm2 V- dNlk(s) {1 " Flk(t)} ^ Ylk(s)YMs" 0<s<t < E dNlk(s) F{I - Flk{t)}2 dFlk{t) Jo dFlk(t) 0<s<U * E Ylk(s)Ylk(s+) dNlk(s) P 0<s<U Ylk(s)Ylk(s as k —> oo by Lemma 4.3. Let vk = max{j < n\k : tkj < U} be the index of the largest time of death observed at or before time U in the sample from Population 1. Since the steps of Fik(t) are increasing in size (see Equation 4.1), we may define Jk = max\Flk(t) - Flk(t )| = Jfcl/fc, the biggest jump of Fik{t) on the interval [0, U}. Lemma 4.5 Jk —> 0 almost surely as k —> oo. Proof of Lemma 4-5. The result follows from Theorem 4.1 since Jk < 2 sup o<t<u almost surely as k —> oo. Recalling that T < U is such that H{{T) < H*(U), we let Dk = Nlk(U) - Nlk(T) = J2 *{zlje(T,u],6lj=i} i=i be the number of deaths observed in the interval (T, U] among individuals sampled from 76 Chapter 4. MAMSE Weights for Right-Censored Data Population 1. Since the Zy are independent, Dk follows a Binomial distribution with parameters nik and Hl(U) — Hl(T). Let £k — N\k(T) be the number of deaths observed in the interval [0,T], and their corresponding times of death tki < ... < tkek < T. By convention, we set tk(ek+i) = THX if no death is observed after tkek-Lemma 4.6 P < max 0<t<T Fik{t) - Gk(t)\< Jk + max Flk(t) - Gk{t) I *€{tfcl,...,tfc(£fc + 1)} converges to 1 as k —> oo. Proof of Lemma 4-6. Fix k G IN and LO e fl, and let xo G [0, T] be the value maximizing \F\h(t) — Gk(t)\. That maximum exists since |Fifc(*) — Gk(t)\ is a bounded function being optimized on a compact set. Three disjoint cases need to be considered; Case 1: Gk(%o) < Fik(xo) and > 1. Let j\ = max{j G {1,... ,£k} '• tkj < xo} be the index of the largest time of death from Population 1 inferior to XQ. By the choice of ji, tkjx belongs to the same step as xo and hence Fikitkn) = Flk(x0). Moreover, Gk{tkjx) < Gk(x0). since Gk{t) is a monotone nondecreasing function. Recalling that XQ maximizes the differ ence between Fik(t) and Gk(t), we can write mffl \Fik{t) - Gk(t)\ = Flk(x0) - Gk(x0) < Fik(tkji) - Gk(tkjx) 77 Chapter 4. MAMSE Weights for Right-Censored Data < max te{tfci,...,tfc£,} Fik(t)-Gk(t) Fik(t) - Gk(t) < Jk+ max meaning that the maximum will always occur at a time of death from Population 1, where F\k{t) has a step. Case 2: Gk(xo) > Afe(^o) &nd Dk > 1. . Let J2 = minjj € {1,..., £k + 1} : tkj > xo} be the index of the smallest time of death greater than XQ. The condition Dk > 1 ensures that tk(ek+i) exists, hence j2 is well defined. The choice of j2 ensures that it belongs to the step of Fik(t) that immediately follows xo, hence Fik(tk(j2-i)) = Flk(x0). The function Gk(t) is a right-continuous nondecreasing function and tkj2 > XQ, thus Gk{tk]2) > Gk(x0). Recalling that XQ is the point maximizing the difference between Fik(t) and Gk(t), we write < Gk(tkj2) - Fik(tkQj2_i)) = {Fik{tkj2) - F\k{tk^2-i))} + Gk(tkj2) - Flk(tkJ2) < Jk+ max *G{tfci,...,tfc(ffc+1)} Fik(t) - Gk(t) meaning that under Case 2, the maximum will occur immediately before a jump of Flk(t). Case 3: Dk = 0. This event has probability [1 - {H{(U) - H~i(T)}}nik —> 0 as —> oo. 78-Chapter 4. MAMSE Weights for Right-Censored Data The combination of Cases 1 and 2 implies that the desired result is true at least when ever Dk > 1. Consequently, P < max 0<t<T Flk(t)-Gk(t) < Jk + max te{tki, — ,tk'ek+1)} Fik(t) - Gk(t) > P(Dk > 1) = 1 - P(Dk = 0) 1 as k —> oo since the probability of Case 3 converges to 0 as k —* oo. I Lemma 4.7 as k —> oo. max t€{tfci,...,tfc(ffc + 1)} PifcW-GfcCt) Proof of Lemma 4-7. Let e > 0 be such that e < iff ((7) - /^(T). We will show that P < max l*€{*fcl.-">*k(<fc + l)} Fik(t) - Gfc(t) > e ^ -> 0 as fc —> oo. For a large k,let xk G {1,..., £k +1} be the index of a time of death where the difference Flk(t)-Gk(t) is maximized. We define the following three events: Ak = jw G fi : Flk(tkXk) - Gk{tkxk) > ej £fe = G fi : Gk(tkXk) - Fxk{tkXk) > e} Cfc = {w G fi : Dk > enik + 1} . Then, P < max I tG{tfci.--.>tfc(«fc+i)} Pifc(t)-Gfc(t)|>6) < P{Cfu(4nCfc)u(Bkncfc)} 79 Chapter 4. MAMSE Weights for Right-Censored Data < p(c^) + p(Aknck) + p(Bknck). We next show that each of the three probabilities on the right hand side go to zero as k —> oo. Note that the event Ck is used to remove an event of probability 0 that would otherwise complicate the proof that P(Bk) —-> 0. Case 1: P(Cf) -> 0. Recalling that Dk follows a Binomial distribution with n\k trials and probability of success {#*([/) - H{(T)}, we have P(Cf) = P{Dk < enlk + 1} ^ enlk + l-nlk{H*(U)-H*(T)} . \ y/nlk{H*AU) ~ Ht(T)}{l - H{{U) + H*(T)} J = $(-cy/n~[k~ + d/^/n^) -> 0 as k —> oo since the choice of e implies that c > 0, hence the argument inside the standard normal CDF $ tends towards — oo. We use the central limit theorem to approximate the Binomial by the Normal, but this comparison becomes exact as n\k approaches oo. Case 2: P{Ak n Ck) -* 0. Let uk = min{u : J2i=u+i ^ki < e}- This index exists when k is large enough since • Jkxk < Jk —> 0 by Lemma 4.5 and • * /^JJki = Fik(tkXk) > Gk(tkXk) + e>e. i=l • For a large enough fc, < e and hence uk < xk — 1. For j € {ufc,'..., — 1}, we have Gk(tkj) < Gk(tkXk) . 80 Chapter 4. MAMSE Weights for Right-Censored Data since the function Gk(t) is monotone nondecreasing and Fik{tkj) = Flk{tkXk) — /2 since the function F\k{t) makes a jump of size Jki at time t^. Combining these inequalities yields Xk 3*k F\k{tkj) - Gk{tkj) > Fik{tkXk) - Gk{tkXk) ~ ^2 Jki>e— ^2 Jki>0-i=j+l i=j+l The last inequality holds because of the choice of uk. The function F\k{t) gives a mass of Jkj to the point tkj, and hence / \Gk(t)-Flk(t)\2dFlk(t) > Y,Jkj\Gk(tkj)-Flk(tkj)\2 — yi jkj\Gk(tkj) - Fik(tkj)Y 3=Uk ( xk \ 2 > Jkxke2 + Jk3 U ~ J2 Jki ' (4-7)' 3=Uk \ i=j+l ) f\e - x) Jo I£ e3 2 dx = / x2dx = — since the summation corresponds to the Riemann sum for the integral J0£(e — x)2 dx depicted in Figure 4.1. The sum converges as k —-> oo because the width of the columns Jkj tends to zero by Lemma 4.5. To clarify the link between the Riemann sum and the integral, consider the change of variable p = xk — j and let 0 p = 0 ]LX=l Jk(xk-i+l) p=l,--.,xk-uk 81 Chapter 4. MAMSE Weights for Right-Censored Data Note that cfc(p+1) — ckv = Jk{Xk-p) — Jkj and with respect to the variable j, ckp equals J2i=j+i Jki when p > 0. We can thus write the expression in (4.7) as 2J (cfc(p+i) - cfcp)(e - cfep)2 ~* / (e-x)2dx = x2 dx = — P=O 7o « (4.8) Consequently, there exists a rCo such that /Vfc(0-Flfc(i)|2dFlfc(i)>^ 7o o for all fe > fco, an event of probability 0 according to Lemma 4.4. We conclude that P(Ak n Ck) -* 0 as k -> oo. Area of the shaded column: (ck(p+l) - Ckp)(e - Ckp)2 ll dkidk Area of the shaded column: (dk, - 4(?-i))(e - 49)2 Hashed column is ignored Figure 4.1: Graphics representing the Riemann sums used in the proof of Case 2 (left panel) and Case 3 (right panel). Case 3: P(Bk D Ck) -» 0. Recall Equation (4.1) and note that the smallest possible size of a jump in F\j(t) is \jn\k-Hence Jkj > l/n^ and /J^ > en\k + 1 implies that {j:tfc,e(r,t/]j>4+i} Let ffc = max{?j : Y^Vj=xk+i J^j — €}- F°r a large enough /c, Lemma 4.5 implies that 82 Chapter 4. MAMSE Weights for Right-Censored Data Jk(xk+i) < Jk < e and thus vk > xk + 1. For j € {xk + 1,.'.., i^}, the monotonicity of the nondecreasing function Gk(t) implies that Gk(tkj) > Gk(tkxk) In addition, the function Fik(t), a Kaplan-Meier estimate, makes a jump of size Jki at each time of death tki- Therefore, j E\k(tkj) = Eik{tkxk) + E Combining these two inequalities yields . Gk(tkj) - Fik(tkj) > Gk(tkxk) - Fik(tkxk) - E Jki>£~ E — ^' i=H + l i=xfc+l r the last inequality holding because of the choice of vk. Using again the fact that dFifc(i) allocates a mass of Jkj to tkj, we find that rU 4 / |Gfe(t)-Fifc(t)|2dF1Jfc(t) > Jfcj|Gfc(tfci) - Afc(**j)|2 Jo 3=1 - E Jkj\Gk{tkj) - Fik(tkj)\2 ' 3=*k > / (e-x)2dx= / x2dx=- (4.10) since the summation corresponds to the Riemann sum for the integral f0e(e — x)2 dx depicted in Figure 4.1. The term Jkxk£2 ignored in Equation (4.9) corresponds to the hashed column. The sum converges as k —> oo because the width of the columns Jkj tend to zero by Lemma 4.5. 83 Chapter 4. MAMSE Weights for Right-Censored Data To clarify the link between the Riemann sum and the integral, consider the change of variable q = j — xk and let ( 0 q = 0 dkq = \ • Note that dkq - 4(9_i) = Jk(xk+q) = Jkj and that with respect to the variable j, dkq corresponds to Y?i=Xk+i Jki wnen 9 > 0. We can thus rewrite Expression (4.9) as vk-xk j.e /.e £3 2-(dfc,-dfc(9-i))(^-4,)2- J (e-x)2dx = Jo x2dx=- (4.11) 9=1 0 Therefore, there exists a ko such that fU \Gk(t)-Flk(t)\2dFlk(t)>e3/6 Jo for all k > ko, an event of probability 0 according to Lemma 4.4. We conclude that P(Bk n Ck) -> 0 as k -> oo. Combining the three cases yields the desired result. I Lemma 4.8 For T <U < THx with H{(T) < H*(U) and any sufficiently small e > 0, P lsup\Gk(t) - Flk(t)\ >e*> ->0 as k —> oo. Proof of Lemma 4-8. For an arbitrary e > 0, let Ak — < LO 6 fl : max o<t<r Fik(*) -Gk(t)\<Jk+ max Fife(t) - Gk(t) 1 Kfth-A((k + i)} . 84 Chapter 4. MAMSE Weights for Right-Censored Data Bic wgfi: max Flk(t)-Gk(t) e <2 %•= {w € fi : Jk < |}. Clearly, Akr\Bkn Ck implies that sup - Fik(t)\ < e. Therefore, t<T p suP|Gfe(t)-Fifc(t)| <e > P(Afc n 7ifc n c7fc) > P(Afc) + P(Bk) + P(Ck) - 2 -» 1 as A; —* oo by Lemmas 4.5, 4.6 and 4.7. Theorem 4.2 LetO <T <U < THi with H*(T) < H{{U). For all sufficiently small e > 0, P{sup Gfc(i)-Fi(t) t<T <e\ -1 as k —> oo, i.e. supt<T Gfc(t) — F\(t) Proof of Theorem 1. Let us define 0. Bk = Ck = co G fi : sup o<t<r a; G fi : sup 0<£<T to G fi : sup o<t<r Gfc(t)-Fi(t) Gfe(t)-Fife(<) PifcW-^iW < e e < -- 2 By the triangular inequality, GfcW-Pi(t) < Gk(t)-Flk(t) + Flk(t)-F1(t) 85 Chapter 4. MAMSE Weights for Right-Censored Data for any fixed t. In particular sup C*k{t) — F\(t) < sup {|Gfc(i)-Flfe(t)| + |Flfc(t)-Fi(0|} < sup Gk(t)-Flk(t) + sup Fifc(t)-Fi(t) . Consequently, (Bk C\ Ck) C Ak and P(Afc) > P(Bk n Cfc) >P(Bfc) + P\Ck) -1-1 as A; —> oo. Lemma'4.8 implies that P(Bk) —> 1 and Theorem 4.1 implies that P(Ck) —> 1. Therefore, The WKME converges uniformly in probability to the lifetime distribution for Popula tion 1 in the interval [0, T]. The MAMSE weights, although they use data from different distributions, provide an asymptotically unbiased estimate of Fi(i). 4.5 Simulations This section presents the results of simulations performed to evaluate the finite-sample per formance of the MAMSE-weighted Kaplan-Meier estimate compared to the usual Kaplan-Meier estimate. The two examples are based on real survival functions to mimic reality. Simulations use between 10000 and 20000 repetitions. Unless otherwise stated, this number is large enough to make the standard deviation of the simulation error smaller than the last digit shown in the tables or on the figures. sup Gk(t)-Fi(t) 40. t<T 86 Chapter 4. MAMSE Weights for Right-Censored Data 4.5.1 Survival in the USA The Centers for Disease Control and Prevention maintains a website which includes a section called National. Center for Health Statistics. That section contains the decennial life tables published by the National Center for Health Statistics (1997) at the address http://www.cdc.gov/nchs/products/pubs/pubd/lftbls/decenn/1991-89.htm . From this publication, we obtain the survival curves for four subgroups of the population in the United States: • White males; • White females; • Males other than white; • Females other than white. The life tables have a resolution of one year from birth to the age of 109 years. We simulate the age at death based on these tables; the day and time of death is chosen uniformly during the year of death. The CDF of the survival time for each of the four populations is depicted in Figure 4.2. Our purpose here is to explore the potential value of our new method rather than to study its behavior extensively under all possible conditions. Hence, we use a simple definition for the distribution of the censoring times based on the distribution of the lifetimes. Lemma 4.9 Let X\,... ,Xr+i be r + 1 independent and identically distributed variables with common distribution F. Then, P{max(X1, ...,Xr)< Xr+1} = —|— 87 Chapter 4. MAMSE Weights for Right-Censored Data Distribution of Lifetime in the USA White Male — • - White Female Male Other Than White - - - • Female Other Than White f* ,**• rm J. Ji m . — • — • — -~T-20 1 1 40 60 Age at Death (Years) 80 —I— 100 Figure 4.2: Survival functions for subgroups of the American population as taken from the life tables of the National Center for Health Statistics (1997). Proof of Lemma 4-9. poo P{max(Xi,...,Xr)<Xr+i}= / P {max(Ii,..., Xr) < Xr+1\Xr+x = i) dF(t) Jo = {F(t)Y dF(t) = furdu=jL-Jo Jo r + 1 Let X\,..., Xr be r random variables with the same distribution as the survival time of an individual. The censoring time is defined as V = max(Xi,..., Xr), yielding a censoring rate of l/(r + 1). Throughout this section, r = 4 is used yielding a censoring rate of 20%. The last example however involves different rates of censoring, from 16% to 34%, obtained with r e {2,3,4,5}.' We restrict our goal to inferring the survival distribution of white males based on equal samples drawn from the four demographic groups mentioned above. We investigate if the 88 Chapter 4. MAMSE Weights for Right-Censored Data MAMSE-weighted Kaplan-Meier estimate is a better choice than the usual Kaplan-Meier estimate based on the data from white males. For different values of the upper bound U £ {60,70,80,90,100}, we generate samples of equal sizes n € {10,25,100,1000} from each of the four populations. Each scenario is repeated 20000 times. We calculate both the weighted Kaplan-Meier estimate F^{t) and the usual Kaplan-Meier estimate Fi(t). To evaluate the quality of the estimators, we compare the area that separates them from the real survival curve F\(t), more precisely, we use Table 4.1 shows the ratio lOOAi/A^ for different choices of n, U and with T = 55. The interval of interest is thus [0,T]. Values above 100 mean that the weighted Kaplan-Meier estimate performs better. Table 4.1: Relative performance of the WKME as measured by 100Ai/A\. Both areas are calculated on the interval [0,55] and various upper bounds U are used to determine the weights. Samples of equal size n are taken from each of four subpopulations, then used to estimate the survival of a white male living in the USA. Each scenario is repeated 20000 times. The weighted Kaplan-Meier estimate seems to be a better estimate under all scenarios considered. No apparent trends in the magnitude of the improvement against n and U are observed. The MAMSE criterion evaluates the dissimilarity between the populations on the interval [0, U]. It is thus not surprising that no clear trend is observed as U varies. Note that for the largest sample size (n — 1000), the advantage of the WKME seems more Ratio 100Ai/Ax, with T = 55 U = 60 70 80 90 100 n = 10 25 100 1000 114 135 142 118 100 137 148 149 128 101 135 143 140 128 102 121 120 108 105 103 89 Chapter 4. MAMSE Weights for Right-Censored Data modest. The white females have the longest survival time and nearly 25% of them reach the age of 90. However, less than 3% survive long enough to celebrate their 100th birthday. For U — 100, the samples from other populations will frequently fall short of the upper bound and be ignored, especially for small sample sizes. This might partially explain the abrupt change in performance from U = 90 to U = 100. The improvements observed in Table 4.1 appear again in Table 4.2 where the interval on which the functions are compared varies with the upper bound U. Once again, the newly proposed weighted Kaplan-Meier estimate performs better than the classical one under all the scenarios. Table 4.2: Relative performance of the WKME as measured by lOOAi/A\. Areas are calculated on the interval [0,(7 — 5], where U is the upper bound used to determine the weights. Samples of equal size n are taken from each of four subpopulations, then used to estimate the survival of a white male living in the USA. Each scenario is repeated 20000 v times. Figure 4.3 illustrates the average weights allocated to each of the four subpopulations. Notice that the weight allocated to Population 1 is close to one when U = 100, especially for small sample sizes. This supports the explanations regarding the sudden drop in perfor mance observed in Table 4.1 and 4.2: samples from other populations are often dismissed because no individual reaches 100 years of age. Unless a mixture of the survival distributions of the other populations is identical to that of the survival distribution for Population 1, Theorem 4.2 predicts that the weight allocated to Population 1 will converge to 1. A tendency to increase towards 1 is observed for U £ {70,80,90}, but not for U = 60. It is expected that for larger sample sizes, Ai Ratio 100Ai/Ax, T = U - 5 U = 60 70 80 90 100 n = 10 25 100 1000 114 132 137 116 100 137 141 137 122 101 135 134 128 118 101 121 115 103 101 101 90 Chapter 4. MAMSE Weights for Right-Censored Data Average Weight Allocated to Each Population U = 60 70 ! U=80 U=90 U= 100 n=10 73 51 53 78 1 100 n= 25 46 43 - 70 I 99 n=100 39 40 64 1 97 n= 1000 37 46 69 83 I 96 | • While Males • White Females • Non-White Males • Non-White Females Figure 4.3: Average value of the MAMSE weights for different upper bounds U and sample sizes n. The cells' area are proportional to the average weight allocated to each population. The numbers correspond to 100Ai and are averaged over 20000 repetitions. would eventually converge to 1 even in that latter case. The large weight allocated to the three other subpopulations for samples as large as 1000 should be interpreted as a sign that a mixture of these 3 distributions is extremely close to the true survival distribution in Population 1 and that does not seem unreasonable based on Figure 4.2. Figure 4.4 depicts examples of estimates of the survival functions. While the smooth gray line shows the true distribution of the lifetime of a white male in the United States, the plain black line shows the Kaplan-Meier estimate based on a sample of size n and the dashed line corresponds to the MAMSE-weighted Kaplan-Meier estimate that we propose. The numbers on each panel correspond to A\ and A\ respectively with T — 75. As we may expect from the good performances noticed in the previous tables, A\ is typically smaller than A\. Some exceptions arise: A\ will occasionally be smaller than A\, as it is the case for n = 1000 in Figure 4.4. A close look at the dashed line shows one advantage of the weighted Kaplan-Meier estimate: using more populations involves having more steps in the estimate since jumps may occur at a time of death from any of the samples 91 Chapter 4. MAMSE Weights for Right-Censored Data 0 20 40 60 0 20 40 60 Age at death (Years) . Age at death (Years) Figure 4.4: Typical examples of the weighted Kaplan-Meier estimate (dashed line) and of the usual Kaplan-Meier estimate (plain'black line) for different sample sizes. Note that U = 80 and T = 75. The true distribution is depicted by a smooth gray line. considered. This results in a smoother step function. An estimated survival function will typically be used to answer further questions about the population of interest. Tables 4.3 and 4.4 show the performances of the weighted Kaplan-Meier in estimating Fi(55) = 0.11976 or Ff^O.lO) = 52.081. Note that we write qx = Ff^O.lO) and qx = FT^O.IO). The estimates obtained by the weighted Kaplan-Meier estimate feature a smaller MSE in almost all cases. Moreover, the magnitude of the gains seems to outweigh that of the occasional losses, especially when we consider that such losses occur when n is large, not the cases where our method would be most useful. The relative performances of the WKME for estimating a quantile is similar to the results obtained for the probability F\(55). The resemblance between the tables is not 92 .. Chapter 4. MAMSE Weights for Right-Censored Data MSE{A(55)} MSE{FA(55)} (7 = 60 70 80 90 100 n = 10 117 151 172 134 101 25 137 159 170 149 102 100 125 142 142 134 104 1000 110 107 84 86 103 Table 4.3: Relative performance of the weighted Kaplan-Meier estimate compared to the usual Kaplan-Meier estimate for estimating Fi(55) = 0.11976 as measured by 100 MSE{F\(55)}/MSE{F\(55)}. Different choices of U and n are considered. Each scenario is repeated 20000 times. 100 MSE(gi)/MSE(gA) U = 60 70 80 90 100 n = 10 120 140 161 141 100 25 153 173 172 133 101 100 124 145 139 126 102 1000 119 113 86 86 106 Table 4.4: Relative performance of the weighted Kaplan-Meier estimate compared to the usual Kaplan-Meier estimate for estimating FT1 (0.10) = 52.081 as measured by MSE(gi)/MSE(f7^). Different choices of U and n are considered. Each scenario is repeated 20000 times. surprising since they both use the estimated curves around their 10% quantile. For Table 4.5, we fix U = 80 and T = 75, then try different distributions for the time of censoring that yield an average fraction p € {1/3,1/4,1/5,1/6} of censored data. ' The proportion of censored data has little or no effect on the relative performance of the WKME compared to the KME. A closer look at the raw data shows that the precision of both the KME and the WKME are affected by a larger p, but it appears that the magnitude of this effect is the same for both methods. As a result, their relative performance seems unaffected by the rate of censoring. Overall, the weighted Kaplan-Meier estimate seems to outperform the usual Kaplan-Meier estimate in almost all the cases studied. Next, we look at another population with more subgroups. 93 Chapter A. MAMSE Weights for Right-Censored Data Average of lOOAi lOOAi/Ax 1 1 l l II I I ^ 3 4 5 6 3 4 5 6 n = 10 57 54 53 52 133 135 137 136 25 50 48 48 47 136 137 137 138 100 47 47 47 47 127 127 128 126 1000 68 69 69 69 102 103 103 102 Table 4.5: Average MAMSE weights for different rates of censoring p and different sample sizes n. The right-hand side of the table presents the relative performance of the WKME as measured by 100Ai/A\. Figures are averaged over 20000 repetitions and the values U = 80 and T = 75 are used. 4.5.2 Survival in Canada Statistics Canada periodically publishes life tables for the survival of Canadians. We use the life tables published for the reference period of 2000 to 2002 that are available online at the address http://www.statcan.ca/bsolc/english/bsolc?catno=84-537-X. . The life tables from Statistics Canada (2006) provide the survival functions of Canadians with a resolution of one year, from birth to over 105 years. Distinct tables are provided for males and females from each of the 10 Canadian provinces. Due to its smaller population, Prince Edward Island is the only exception with a resolution of 5 years. It is excluded from our simulations for that reason. We suppose that n males and n females from each province (except PEI) are followed and that their time of death is observed or censored. Censorship was determined the same way as before, following Lemma 4.9 with r = 4, which yields a censoring rate of 20%. We perform three simulations. 1. Males: We estimate the survival function of a male in New Brunswick when data sets of males across the country are available (total 9 populations). 2. Females: We estimate the survival function of a female in New Brunswick when data - sets of females across the country are available (total 9 populations). 3. Males and Females: We estimate the survival function of a female in New Brunswick 94 Chapter 4. MAMSE Weights for Right-Censored Data when data sets of males and females across the country are available (total 18 popu lations). Figure 4.5 depicts the survival functions for the different provinces and genders. The survival curves for New Brunswick are repeated on each panel for comparison purposes since they are the target distributions; they appear as gray dotted lines. In the following, we express the distributions in term of survival functions. These functions are obtained by simple arithmetic as S,(t) = 1 — Fm(t) and S,(t) = 1 — F,(t) where • stands for any common index defined previously. Throughout this section, we choose the upper bound U = 90 and we fix T — 85. Denote by Si(tj the target distribution and let Sx(t) and S\(t) be the KME and the WKME respectively. The measure of performance considered will use the area between our estimate and the true survival function: Ax= [T \Sx(t)-Si(t)\dt and Al = F \Sx{t) - Sx{t)\dt. . Jo Jo Table 4.6 shows 100Ax/A\, the ratio of these areas. Values above 100 mean that the WKME performs better than the KME. Table 4.6 also shows the average weight allocated to each population under the three scenarios considered. Simulations are performed for n 6 {10,25,100} and each scenario is based on 10000 repetitions. For Simulation 3, we can distinguish between the weights allocated to men and women by their font: the weights for males appear in italics. Under all scenarios, the WKME yields a more precise estimate than the usual Kaplan-Meier estimate. The improvement is small for the males, but substantial for estimating the survival of females from New Brunswick, with a gain exceeding 50% in some cases. This might mean that the survival of the women are more similar across provinces than that of men, although it is not easily seen from Figure 4.5. 95 Age at Death (Years) Age at Death (Years) Age at Death (Years) Figure 4.5: Survival functions of Canadians as taken from the life tables of Statistics Canada (2006). Survival functions are available for males and females of each province. The curves for New Brunswick are repeated as gray dotted lines on each . panel to facilitate comparisons since they are the populations of interest. CO 1000 x n 100 4^ XNB XNF XNS XQC XoN XMB XsK XAB XBC 10 101 349 82 83 80 77 85 -84 81 79 Males 25 101 346 91 84 83 76 86 .82 78 75 100 101 350 109 87 82 71 87 79 70 65 10 146 438 65 66 68 67 72 78 73 73 Females 25 153 376 82 79 78 76 80 77 77 75 100 151 359 98 87 77 78 79 75 74 74 10 146 365 22 47 22 51 22 52 22 53 22 55 22 60 22 58 22 59 22 Males & 25 148 284 ' 49 54 61 59 57 64 64 69 Females 25 25 25 25 25 25 25 25 25 100 144 271 48 59 73 76 64 79 81 94 15 15 15 15 15 15 15 15 15 Table 4.6: Relative performance of the weighted Kaplan-Meier estimate compared to the Kaplan-Meier estimate as measured by lOOAi/'A\. Average MAMSE weights are also shown, but they are multiplied by a factor of 1000 for easier reading. In the simulation with males and females, the average weights in italics refer to the male populations. Note that U = 90, T — 85 and that all figures are based on 10000 repetitions. CO Chapter 4. MAMSE Weights for Right-Censored Data Note the difference in performance between the Females and the Males & Females sim ulations: using more populations did not seem to improve the performance and may even have worsened it. Intuitively, calculating the MAMSE weights has a cost in effective sample size. When the populations' distributions are close to each other, this cost is recovered and the quality of inference is improved. Otherwise, the performances may degrade. Figure 4.5 shows how the survival functions of men and women differ. The additional information contained in the males survival seems insufficient to recover the cost of using them. Their dissimilarity is also visible through the smaller weights allocated to the male populations on average when compared to the females in that simulation. It is interesting to note that the weight allocated to female populations of interest seems to decrease as the sample size increases. In the Males & Females scenario, notice also the small magnitude of the weights allocated to men. As n increases, the dissimilarity between the survival distributions becomes more certain. Figure 4.6 depicts typical estimates obtained in the simulations. The WKME is typically more precise than the usual KME, however, some exceptions arise where the area between the KME and the true distribution is smaller than A\. Notice once again that using the weighted Kaplan-Meier estimate produces a smoother curve than the KME since the function has more jumps. The MAMSE weights allow us to build a weighted Kaplan-Meier estimate that exploits information from similar populations. For the problem of estimating the survival of Canadi ans from New Brunswick, the weighted Kaplan-Meier estimate features better performance than the usual Kaplan-Meier estimate in all scenarios considered. The WKME may be an alternative to consider when data from similar populations are easily available, or when one is interested in a subpopulation of a larger group as is the case for this simulation. 98 Chapter 4. MAMSE Weights for Right-Censored Data Females Males and Females Area weighted =0.88 Area normal =2.92 Area weighted = 1.63 Area normal =2.35 Area weighted =0.71 Area normal =2.34 Area weighted = 1.05 Area normal = 1.66 Area weighted =0.78 Area normal = 1.40 Area weighted =0.52 Area normal =0.85 40 60 Age at death (Years) 80 40 60 Age at death (Years) 80 Figure 4.6: Typical examples of estimates under different scenarios. Note the increased num ber of steps of the weighted Kaplan-Meier estimate (dashed line) which makes it smoother than the usual Kaplan-Meier estimate (plain black line). The true distribution appears as a smooth gray curve. 99 Chapter 5 MAMSE Weights for Copulas Copulas are distribution functions with uniform margins, but most importantly, they are a tool for expressing the dependence structure of multivariate distributions. In this chapter, the MAMSE weights are extended to multivariate data using copulas. The empirical copula, a nonparametric estimate of the copula based on the ranks of the data, is used for that purpose. After reviewing the theory related to copulas, we define the MAMSE weights using the empirical copula. The nonparametric weights obtained are then used to define a mix ture of empirical copulas which yields weighted coefficients of correlation. The weighted pseudo-likelihood, an extension of the pseudo-likelihood, is proposed and shown to pro duce consistent estimates when used in conjunction with the MAMSE weights. Simulations evaluate the performances of the maximum weighted pseudo-likelihood estimate and of the weighted coefficients of correlation. 5.1 Review of Copulas Let X be a p-dimensional vector with continuous distribution i?(x) and continuous marginal distributions Gi(xi),..., Gp{xp). The dependence between the elements of X is best de scribed by its underlying copula, a cumulative distribution function with uniform marginals. Sklar (1959) showed that all multivariate distributions admit the representation H {[xu ..., xp]T) = C {Gx(xi),..., Gp(xp)} 100 Chapter 5. MAMSE Weights for Copulas where C is a CDF with uniform margins called a copula. Any continuous distribution H is associated with a unique copula C. If hi,..., hp are one-to-one increasing functions, then the unique copula associated with the distribution of Y = [hi (X\),..., hp(Xp)]J is the same as the copula underlying the distribution of X = [X\,..., Xp]T. Coefficients of Correlation The expression "coefficient of correlation" typically refers to an empirical estimate of the dependence between two variables. These statistics however estimate population values that are also called "coefficients of correlation". For instance, the Pearson correlation of a bivariate distribution H(x, y) with marginal distributions F(x) and G(y) is given by where px and ax are the mean and the variance of F(x) and similarly for G(y). Pearson's correlation is a parameter of the multivariate normal model and it is an efficient measure of dependence under that model. However, Pearson's correlation is not a good measure of dependence in a general setting. In particular, the range of values for Pearson's coefficient is limited by the distribution of the margins F and G. How can one interpret a correlation of 0.80, say, when it could be the maximal value for r given the marginal distributions at hand? Moreover, a monotone transformation of the variables will typically affect r. Better measures of correlation are invariant to monotone transformations of the data. Spearman's p and Kendall's r are well-known such coefficients. Let C(u, v) denote the unique copula associated with H(x, y). Then Table 5.1 gives the population value of different coefficients of correlation, including p and r. These coefficients depend only on the copula C, hence they are invariant to a monotone increasing transformation of the margins. The 101 Chapter 5. MAMSE Weights for Copulas empirical estimates of the coefficients of correlation in Table 5.1 are based on ranks; they are presented in Section 5.2 Usual Name Population Value Spearman Kendall Gini Blomqvist Blest Symmetrized Blest Table 5.1: Population value of different coefficients of correlation. More details on Spearman's p, Kendall's r, Gini's 7 and Blomqvist's 6 can be found in Nelsen (1999). Blest's coefficients were first introduced by Blest (2000), then further developed by Genest & Plante (2003). Pinto da Costa & Soares (2005) studied the same coefficients of correlation and rediscovered independently some of the results published by Genest & Plante (2003). Families of Copulas A number of families of copulas have been studied in the literature. The books of Joe (1997), Nelsen (1999) and Cherubini et al. (2004) present the most common ones, and discuss meth ods for constructing copulas. Note in particular that building a copula in more than 2 di mensions is not always an easy endeavor as not all lower dimensional copulas are compatible p = 12 J j uvdC{u,v) - 3 T^iJJ C(u,v)dC(u,v) - 1 7 = / j I it + v — 1| — \u — v\dC(u, v) * = *C|'i,i'|-l v = 2 - 12 j J (1 - ufv dC(u, v) i = -2 + 12 J J m;(4 -u-v) dC(u, v) 102 Chapter 5. MAMSE Weights for Copulas with each other. Let us introduce the families of copulas that will be used for simulations in Section 5.9. Normal Copula The copula underlying the Normal distribution does not have a closed form but can be expressed as follow. Let i/i;(x) be the CDF of a multivariate Normal with mean 0 and covariance matrix E. In addition, suppose that the diagonal of E contains only ones, hence E is a correlation matrix and the margins of H% are J\f (0,1). If $(x) = P(Z < x) for Z ~ JV(0,1), then CE(u) = HX{$-1(U1),...,<P-\UP)} is the copula underlying H%. In p dimensions, the Normal copula depends on p(p — l)/2 parameters which make E positive definite. In 2 dimensions, it depends on one parameter, r € (—1,1). The limiting cases where r = ±1 corresponds to perfect linear concordance or discordance. Independence is given by r = 0. Under the bivariate Normal model, the population values of p and r are p = — arcsin (^-] and r = — arcsin(r). Simulating the Normal copula can be done by generating a multivariate Normal, then applying the inverse transformation $_1 to its marginal values. Farlie-Gumbel-Morgenstern Copula The Farlie-Gumbel-Morgenstern (FGM) copula (see Nelsen (1999) page 68) is parametrized by 9 € [—1,1] and written CQ(U, V) = uv + 9uv(l — u)(l — v). 103 Chapter 5. MAMSE Weights for Copulas Its simple closed form expression is convenient, but it does not span a great range of depen dence. In particular, p = 9/3 under that model, hence the absolute value of Spearman's p is at most 1/3. With the simplicity of its closed-form expression, the FGM copula is often used for illustrative purposes in the literature. However, it is not a very flexible model since it features a limited range of dependence. Hence, it is rarely used as an inferential model in practice. Simulating a datum (U, V) from a FGM copula can be done by generating a first uniform variate (V), then transforming a second uniform variate (W) by inverting the CDF of the conditional distribution of U given V. This yields a quadratic equation whose root of interest is -{1 + q(l - 2V)} + y/{l + Q(1 - 2V)Y - 4a(l - 2V)W : •• -2a(l -2V) Genest k, MacKay (1986) present a recipe to build numerous families of copulas that are called Archimedean copulas. Many important families of copulas are Archimedean. Chap ter 4 of Nelsen (1999) is devoted to the construction of such copulas and to the study of their properties. The next three families of copulas are particular cases of Archimedean copulas. An algorithm to produce pseudo-random variates from these distributions is given by Genest & MacKay (1986). Clayton Copula The Clayton copula was introduced by Clayton (1978) and is often used in survival anal ysis (see Oakes (1986) for instance). The bivariate Clayton family is parametrized by 9 6 [—1,0) U (0, oo), the limiting case where 9 —> 0 corresponding to independence. The 104 Chapter 5. MAMSE Weights for Copulas expression for the copula is C$(u, v) = max } Frank Copula This family was first discussed by Frank (1979) and is indexed by one real parameter, 9 G (—oo,0) U (0, oo). The limiting case where 9 —> 0 corresponds to independence. Its The Frank copula is the only radially symmetric Archimedean copula. Its shape is akin to the Normal copula. Gumbel-Hougaard The Gumbel-Hougaard family is one of three copulas that can be a limit distribution for extremes and is hence used to model data of that.type. Indexed by 9 G [1, oo) it is written and the choice 9 = 1 corresponds to independence. The families of copulas presented in this section are absolutely continuous and hence admit a density function, as long as we omit limiting cases of perfect concordance or discordance where the distributions collapse to a line. The existence of a density function is less clear for the Clayton, Frank and Gumbel-Hougaard copulas. Genest & MacKay (1986) show that all bivariate Archimedean copulas can be factorized as a mixture of two components, one absolutely continuous, the second singular on a line that they describe explicitly. From their result, it is straightforward to CDF is 105 Chapter 5. MAMSE Weights for Copulas verify that the singular part of the three Archimedean copulas presented in this section has mass 0, hence that they are absolutely continuous. 5.2 Notation and Empirical Estimation Suppose that p-dimensional data are available from m different populations believed to have similar dependence structures (i.e. similar copulas). For any fixed k, we observe njfc data points from Population i. Explicitly, are observed where Xjj = [Xij\,... ,Xijp]T is a vector in p dimensions. By Sklar's (1959) Theorem, there exists a unique copula underlying the distribution FJ; we denote it by Cj(u), u = [u\,... ,up]T being a vector in [0, l]p. That unique copula is a cumulative distribution function defined on the unit cube such that • . list. Assume that the {Fj} are continuous, and hence ties cannot occur with probability 1. The empirical copula, defined on the ranks of a sample, is for u = [u\,... ,up]T. The indicator variable !(•) is equal to one if all the elements of its argument are true and equal to 0 otherwise. The empirical copula puts a weight of 1/njfc Fj(x) = Ci {Gii(xi),... .,Gip(xp)} where Gn,..., GiP are the marginal distributions of Fj. Let R^- = [R-iji, • , Rijp] be the ranks associated with the vectors Xy, j = 1,..., njfc. For fixed i and t, the list of values Xm,..., Xinike is sorted and Rj?-t is the rank of X^ in that 106 Chapter 5. MAMSE Weights for Copulas on the points of the grid 1 2 1} X ••• X corresponding to an observed combination of ranks. There is exactly one such point in every (p — l)-dimensional slice of the grid (rows and columns in 2 dimensions). Consequently, almost surely as k —> oo. Fermanian et al. (2004) show that y/nik{Cik(u) — Cj(u)} con verges weakly to a Brownian sheet with mean zero and a variance that depends on Cj and its partial first-order derivatives. Although they hold for an arbitrary number of di mensions, the results of Fermanian et al. (2004) are presented for bivariate copulas only. Tsukahara (2005) credits Fermanian et al. (2004) for the discovery and expresses the same results in p dimensions. Remark 5.1 LetUi(u) be a p-dimensional centered Gaussian random field with covariance function where A is the component-wise minimum. Such a random field is called a p-dimensional pinned Ci-Brownian sheet. • Theorem 5.1 (Tsukahara (2005)) Assume that C;(u) is differentiable with continuous partial derivatives dCi(u)/dug for 1= 1,... ,p and let [1, ue, 1]T represent a vector of ones, the univariate marginals of the empirical, copula C{k are uniformly distributed on the points {l/nik,2/nik,...,l}. Suppose the {nik} are strictly increasing with k. Deheuvels (1979) shows that Ci(uAv) -Ci(u)Ci(y) 107 Chapter 5. MAMSE Weights for Copulas except for the Ith element who is equal to the £th element of u. Then the random variable y/n^{Cik(u) - Ci(u)} converges weakly to the random field Wi(u)-^|^Ci(u)}wi([l,U<,l]T as k —> oo. Remark 5.2 Let u € [0, l]p and = [1, U£, 1]T for some £ e {1,... ,p}. For the Brownian sheet defined in the previous remark and 1 < I < £ < p, we have: vax{Wi(u)} = Ci{u) - Qiu)2 = Ci(u){l - C^u)} var{Ui{ve)} = Ci(vi){l-Ci{ve)} = ue(l-ue) cov {Wi(u),Wi (v<)} = Ci(u) - Ci(u)u< = a(u)(l - u<) cov{Wi(v;),tVj(v£)} = Ci([l,ui,l,ue,l}T) ~ um where [1, itj, 1, ug, 1]T is a vector of ones, except for the elements I and I that are equal to the Ith and Ith element of u respectively. To define the MAMSE weights in Section 5.3, we need an estimate of the asymptotic variance of the empirical copula. Remark 5.3 From Theorem 5.1, we have that n-k{Cik(u) - Ci(u)}] -> var jwi(u) "53^ Ci(u)Ui (v^)| W {Ki(n)} + 2 J2 ITT" CI(U)| {jT Q(U)I C0V {UI (V/) (V')} i<i<e<P ^ Ue ) V ui ) + £{^Ci(u)j var{^(v,)}-2^|^-CI(u)| cov {^(u),^ (v£)} var 108 Chapter 5. MAMSE Weights for Copulas The expressions for the variances and covariances from Remark 5.2 can be substituted in the expression above. Note that the only term that does not depend on derivatives o/Cj(u) is var{Wi(u)} = Ci(u){l-Ci(u)}. Coefficients of Correlation Based on Ranks The measures of dependence presented in Table 5.1 can be estimated from a sample by using ranks. Their classical estimates appear in Table 5.2. Note that we use a simplified notation where for a fixed population i and for a fixed k, we write n — no-, C{u\,U2) = Cifc(u) and (Rj,Sj) = R^-. That simplified notation is the most commonly seen in the literature. Usual Name Empirical Estimate Spearman Pn~ 3 n_1 + YJX=\ RiSt Kendall Tn = G)-1 Ei<i<j<n sign(Ri - Rj)sign{Si - Sj) Gini In ~ |n2/2J Si=l \Ri "b Si n 1| \Ri Si\ Blomqvist Pn = 4C(H)-l Blest K _ 2n+l 12 1 1 p\2c ~. n-1 n(n+iy2{n-l) ^i=lKn ^ 1 -"-zj^i Symmetrized Blest ln = 4n+5 , 6 y^n r, Q (A Ri+Si\ n-1 1 n(n+l)(n-l) 2~>i=l ritJl \L n+1 J Table 5.2: Empirical estimates of different coefficients of correlation. Rescaled Ranks The empirical copula allocates a non-null weight to points that are on the boundary of [0, l]p even though that set has probability 0 under a continuous copula. In some applications, we need to evaluate functions that are well defined on (0, l)p, but that may have asymptotes 109 Chapter 5. MAMSE Weights for Copulas close to the boundaries of the p-dimensional cube. To avoid such problems, one can use a rescaled empirical copula where the mass of R^/njfc is rather allocated to T)fc 1. Tjfc Y** = riZ_5 or Yj = -^-. J nik J nik +1 Hence, the rescaled empirical copula allocates an equal mass to some points of the grid 0.5 nik -0.51 f 0.5 - 0.5 x • • • x or X * * * X nifc + 1' " '' nik + 1 J " " [ nik + 1' " '' nik + 1 J ' where the function of interest is well-defined. Remark 5.4 Let C*k denote a rescaled empirical copula. The asymptotic behavior of C*k is typically identical to that of'Cik since sup'\C.;k(u)-Cik(u)\< — ->0 ue[o,i]p nik as k —-> oo. The Pseudo-Likelihood Methods have been proposed for fitting a family of copulas to the data. A review of such methods can be found in Cherubini et al. (2004). Let us consider the pseudo-likelihood originally proposed by Genest et al. (1995). Suppose that the family of copulas C(u\9), 8 E 0 admits a density function c(u|6^). To fit that family of distributions to the data from Population i, Genest et al. (1995) suggest 110 Chapter 5. MAMSE Weights for Copulas maximizing the pseudo-likelihood, i=i where are the reseated ranks described above. The resulting maximum pseudo-likelihood estimate is a consistent estimate of the true parameter t90 for which C(u|#0) = Cj(u). Alternatively, the log-pseudo-likelihood can be written *(0) = f>gc(Y£.|0)= /togC(u|0)dC4(u) where C*k(u) denotes the empirical copula defined with the rescaled ranks Y^. 5.3 Definition of the MAMSE Weights for Copulas Univariate MAMSE weights are designed to yield a mixture of distributions close to the target distribution F\, but less variable than its empirical distribution function. In the context of copulas, an extended version of the MAMSE weights can aim at choosing a mixture of the empirical copulas, (5.1) i=l with Aj > 0 and 1TA = 1, that is close to Ci, but less variable than C\k- Let us define «(A) = / |Cife(u) - CXk(u)\2 + J2 A2var{Qfc(u)} t=i dMfc(u) (5.2) where Mk is a discrete probability measure allocating a weight of l/nplk to every point of the grid ffl> = LL-£-\n\k nik nik n^ J 111 Chapter 5. MAMSE Weights for Copulas Remark 5.3 provides an approximation to the variance of yjriik Qfc(u). However, the expression depends on the derivatives of Ci(u), the true unknown and unknowable copula. To estimate these derivatives, we could use the data and either assume a parametric model for each population, or use smoothing techniques. Either of these choices involves many degrees of freedom in the selection of families of distributions or basis functions for instance. Our goal is to show that the concept of the MAMSE weights has the potential to improve the quality of the inference while at the same time keeping their computation feasible. Thus at this stage we choose on the basis of Remark 5.3, the simple compromise approximation, <™{Ctk{n)} « var{Cifc(u)} = — Cifc(u){l - Cik(u)}, (5.3) n>ik which corresponds to the only term in Remark 5.3 that does not involve a derivative. We recognize that incorporating other terms in that approximation may improve ^the perfor mance of the weights but leave'that and the investigation of the computational feasibility of doing so to future work. As we will show, the choice made above does demonstrate the potential value of the MAMSE weights in this context. Note that the theoretical results in the following sections hold with a different penalty term as long as the property expressed in Lemma 5.1 is preserved, i.e. as long as y*var{Clfc(u)}dMfc(u)-0 as k —> oo. The property stated in Theorem 5.2 should also hold to insure the proper convergence of the algorithm calculating the MAMSE weights. The value of A minimizing the objective function Pfc(A) defined in (5.2) with the sub stitution (5.3) are called the MAMSE weights and denoted Xk(oj) = [\ik(ui),..., Amfc(cj)]T. The weighted empirical copula obtained using MAMSE weights will be written m Ck(u) = J^\ik(u)Cik(u). (5.4) i=l 112 Chapter 5. MAMSE Weights for Copulas 5.4 Computing the MAMSE Weights for Copulas The algorithm proposed in Section 2.4 applies to the MAMSE weights for copulas. All copulas are constrained to be on the. [0, l]p cube and have uniform marginals. For that reason, there is no need for preprocessing in this case as the domain of all distributions overlap at all times. To prove the convergence of the algorithm, it is sufficient to show that Assumption 2.1 is satisfied when the MAMSE weights are defined as in Equation (5.2). Theorem 5.2 Assumption 2.1 is satisfied for the definition of MAMSE weights suggested in Section 5.3, that is J vab{Cik(u)} dMk(u) > 0 for i e {1,... ,m}. Proof of Theorem 5.2. Consider the term :{Cik(u)}dMk(u) = [ — Cik(u){l-Cik(u)}dMk(u) J nik van for some i = 1,... ,m. Note that the measure Mk(u) gives a weight l/n^. to each point of the grid Qk. The empirical copula Ci(u) is greater than 0 and less than 1 for all points of the grid Qk, except for the point 1 and possibly for a few "slices" (rows and columns in 2 dimensions) of points close to the axes when n{k < n\k. In all cases, 0 < Q(u) < 1 for many u € Qk, meaning that Cjfc(u){l — Qfc(u)} > 0 for those u. Consequently, — / Cifc(u){l - Qfc(u)}dMfc(u) > V Cifc(u){l - Cik(u)} > 0, nikJ nifcnifcutefe. the desired result. • 113 Chapter 5. MAMSE Weights for Copulas 5.5 Uniform Convergence of the MAMSE-Weighted Empirical Copula We now prove a sequence of lemmas that will build up to showing that the MAMSE-weighted empirical copula converges to the copula underlying Population 1. For that purpose, we suppose that the {n^} are increasing with k and that —•> oo as k —> oo. Lemma 5.1 /{< 'Clk{u) - Ck(u)} dMfc(u) -0 almost surely as k —> oo where Cfc(u) is defined as in (5.4)-Proof of Lemma 5.1. The choice of A = [1,0,..., 0]T yields a larger value of -Pfc(A) than the MAMSE weights because of the optimization involved in the definition of the latter. Hence, for any LO € fl, J {Clfc(u)-Cfc(u)}2 dMfe(u) <Pk{\k{Lo)} <Pfc([l,0,...,0]T) = — /-Cifc(u){l - Clfc(u)}dMfc(u) < J- - 0 n\k J 4nife as k —> oo. Lemma 5.2 Let u, v € [0, l]p be such that V£ < ug for £ — 1,... ,p. Then 0<Clk(u)-Clk(,)<±^-Ve)] nik where \x\ denotes the smallest integer greater or equal to x. Proof of Lemma 5.2. Consider the vectors u, v G [0, l]p with vi < for £ = 1,..., p. Then, 0 < Cifc(u) - Cifc(v) nik ^-f 1 € {[0,ui] x ••• x [0,up]} nik - 1 ^ie{[0,Vl}x nik x [0,«p]} 114 Chapter 5. MAMSE Weights for Copulas ^ nik Pi ~ nik ^ ^ I TDK pre nik 1 nik P / pre = -EE* — ^ 1 Wjfc E 1 e=i I lK .?=! pfc *=i nik for any such vectors. Let Qk be an extended grid that includes the axes: Qt .= jo,—i) 1 2 x ••• x <J0, , ,...,1 n\k nik Lemma 5.3 sup ue[o,i]p Cifc(u)-Cfc(u) P < — + sup nik ueg*k Cifc(u)-Cfc(u) /or aZZ fc and u e fl. Proof of Lemma 5.3. For a fixed fc, |Cifc(u) - CK(u)| is a bounded function on the compact set [0, l]p and hence its maximum is attained. Let v 6 [0, l]p be a point, where that maxi mum is achieved. We treat two cases. Casel: Cu(v) >4(v). ' Let v* = [v*,..., v*]T be defined by v| = |_nifc^J/nifc> where [x\ denotes the largest integer smaller or equal to x. Then v* £ Qk is on the same "plateau" of the multivariate 115 Chapter 5. MAMSE Weights for Copulas step function C\k(u) as v, meaning that cwv*) = elfc(v). Moreover, since C^(u) is a nondecreasing function and v* < v, we have 4(v*) < Cfe(v). Recalling that v is the point where the difference between Cjt(u) and Cifc(u) is maximized, we can write |Cife(v) - 4(v)| = Clk(v) - Cfc(v) < Clk(v*) - Cfc(v*) < sup < — + sup nik ueg*k Cife(u) -Cfe(u) Cifc(u)-Cfc(u) meaning that the maximum occurs at a point of the grid Q\. • Case 2: 'C1t(y) < C,fv). Let v* = [•«*,..., v*]T be defined by v\ — \nikV(]/nik, where \x] denotes the smallest integer greater or equal to x. Then, v* e Qk and Cfc(v*)>Cfc(v) . since Cfc(u) is a nondecreasing function and v* > v. By Lemma 5.2, cu(vn-c^)<±lnikiv*-ve)] <^ j~i nlk nik e=i 116 Chapter 5. MAMSE Weights for Copulas Recalling that v maximizes the difference between Cifc(u) and Cfc(u), we can write |Cifc(v)-Cfc(v)| = Ck(v)-Clk(v) < - Cfe(v*.) - Clfc(v*) + P P < h sup Combining Cases 1 and 2 yields the desired result. Lemma 5.4 Lei u = [iti,..., up]T G [0, l]p, then 0 < C'jfe(u) < min it^. <?=I,...,P Proof o/ Lemma 5.4- For each ^ G {1,... ,p}, we have Cik(u)-Ck{u) nik P / ijC 1 P: •ijt nik J nik y nik We get the desired result by noting that these p inequalities have to be satisfied simultane ously. H Lemma 5.5 We have almost surely as k —> oo. sup Cifc(u)-Cfc(u) Proof of Lemma 5.5. Let e > 0. For any given fc G IN, let = [it^i,..., Ukp]T be the point of the grid (/£ where |Cifc(u) — Cfc(u)| is maximized. Let Ak Ek Ck {w G fi : Cifc(ufc) - Cfc(ufc) > e} , |w G fi : Cfc(ufc) - Cifc(ufc) > ej , = |CJ G fi : Ufc G 1.1 117 Chapter 5. . MAMSE Weights for Copulas The negation of Lemma 5.5 is which will happen if and only if Ak U Bk i.o. (Ak UBkD Cf) U (Ak \JBkn Ck) i.o.. We will show that neither of the two events in the decomposition above can occur infinitely often. Case 1: Ak[jBkC\ Cg. We have < |Ci/c(ufc)| + Cfc(ujfc) m = Cifc(ufc) + 53 \k{u)Cik(\\k) min ' <e{i,...lP} i=i < 2 min uk£ < e by Lemma 5.4 since the MAMSE weights sum to 1. Consequently, Ak\jBkr\C^ = 0 for all k. Case 2: Afe U 5fe n Ck. Let v be a vector of integers such that \/n\k = uk; we temporarily omit the index k for notational simplicity. Let also w = [w\,..., wp]T be a point from the set W= 0,1,..., nxk 2p x • • • x 4 0,1 [2p The points (v — w)/nik belong to Qk since uk G [e/2, l]p. Next, we show that nik v-w\ - /v-w wife 2 nlfc 118 Chapter 5. MAMSE Weights for Copulas by treating two subcases. Note that the last inequality holds because 1 V- P nik [2p - 2 Subcase A: Ak!~)Ck-From the fact that the copulas are monotone functions, we get: V — W \ / V Ck ( < Ck[ ) = Cfc(ufc). nik J \nik. Then from Lemma 5.2, nik nik J \nikj nik = Cik(uk) - nik Combining the two inequalities yields C Ik v — w nik — Ck v — w nik > Cik(uk) - Cfc(ufc) nik > e > -~ 2 nik > 0. ™ifc Subcase B: E>k H CK. By Lemma 5.2, we have — Ck v — w nik = y\ Aifc(w) \ Cik [—) - Cik i=l n%k e=l nik v — w nik v S Hik S njkWj nik 119 Chapter 5. MAMSE Weights for Copulas < (nikW£ i i) = W^ | P Hence, V "ife / \nikj n^ ~inik nifc ~i Uik Since the empirical copula is a monotone function, we also have /v — w\ / v \ Cifc < Cife — = Cife(ujfc). V nik ) \nikj Let us consider only k that are large enough to make YliLiP/nik < e/2. From the two previous inequalities, we obtain ck (^) - Cik (l^L) > Ck(nk) - Cik(uk) - £.JL \ nik J \ nik J ~ nik m "lfc f—f njfc 2 nifc Combining subcases A and B yields PkW > f (4(u) - Cu(u)}2dM,(u) > i V f ^ - ^ The sum above corresponds to the Riemann sum of the multiple integral p \ 2 f^'M^S'*) ivi" dVp = Kp-The number is a fixed positive constant for any fixed p. 120 Chapter 5. MAMSE Weights for Copulas As a consequence, there exists a ko such that for all k > ko, Pk(X) > Kv/2 > 0, a contradiction with Lemma 5.1. We must thus conclude that Ak U Bk fl Ck occurs at most a finite number of times. Since the two cases above do not occur infinitely often, we conclude that Ak U Bk occurs at most a finite number of times. Hence, sup Cifc(u)-Cfc(u) 0 almost surely as k —• oo. Lemma 5.6 almost surely as k —> oo. sup U6[0,1]P Clk(u)-Ck(u) Proof of Lemma 5.6. The result follows from Lemma 5.3 and Lemma 5.5. Theorem 5.3 We have uniform convergence of the MAMSE weighted empirical copula: sup ue[o,i]p Cfc(u) - Cx(u) almost surely as k —> oo. Proof of Theorem 5.3. Consider the decomposition 4(u)-Ci(u) < Cfc(u) - Cufefu) + Cifc(u)-Ci(u that holds for all u e [0, Then sup U6[0,l]p 4(u)-d(u). < sup {|cfc(u)-Cifc(u) + Cifc(u)-Ci(u)|} uG[0,l]p U U 121 Chapter 5. MAMSE Weights for Copulas < sup Cfc(u) - Cife(u) + sup Cifc(u) - Ci(u) U£[0,1]P ..Wn-,1„ ue[o,i]p 0 almost surely as fc — oo. Indeed, the first term goes to 0 almost surely by Lemma 5.6 and the expression sup ue[o,i]p |Cifc.(u)-Ci(u) converges almost surely to 0 as k —> oo, see e.g. Expression 3.1 of Deheuvels (1979). Corollary 5.1 Let TJk be a sequence of random vectors with distribution Ck(u) and TJ a .random vector with distribution Ci(u). Then XJk converges to U in distribution. Proof of Corollary 5.1. For almost every a; € fl, the sequence of distributions Ck(u) con verges to Ci(u) for all u, hence the definition of weak convergence is respected with prob ability 1. • Corollary 5.2 If g(u) is a bounded function, then jg(u)dCk(u) = E{g(XJk)} - E{g(V)} = J5(u)dd(u) • almost surely as k —> oo. Proof of Corollary 5.2. The result is a well known consequence of the weak convergence proved in Corollary 5.1, see page 164 of Durrett (2005) for more details. • 5.6 Weighted Coefficients of Correlation Based on Ranks Coefficients of correlation based on ranks typically estimate a measure of dependence that depends on the copula of the underlying distribution. Examples of such coefficient are provided in Tables 5.1 and 5.2. Suppose that bivariate data from m populations are available. We suggest the use of the MAMSE weights to create a weighted coefficient of correlation and show in this section that it converges almost surely to the true correlation of Population 1. 122 Chapter 5. MAMSE Weights for Copulas Consider first Blomqvist's (3 and let /3;re be the Blomqvist coefficient based on the ranks of the sample XJI, ... Xjn.fc. The weighted Blomqvist's (3 is given by m Theorem 5.4 (3\k Pi almost surely as k —> oo. Proof of Theorem 5.4 As a direct consequence of Corollary 5.3, P\k = 4^ \ik(v)Cik almost surely as k —> oo • The remainder of this section will cover a general case that includes the other coefficients in Table 5.1 and 5.2, except for Kendall's r. The population version of Kendall's r depends on J C(u,v)dC(u,v). A weighted version of Kendall's r is achievable, but replacing both C(u,v) in this theoretical expression with mixture distributions will not yield a linear com bination like the other coefficients considered. The weighted Kendall's r hence requires a special treatment that is left to future work. Replacing the copulas by their empirical counterparts in Table 5.1 yields estimates of the correlation based on ranks. However, the expressions obtained typically differ from the usual estimates based on ranks and do not necessarily span the interval [—1,1], but rather an interval whose bounds tend to ±1 as the sample size goes to infinity. The classic formulas for the coefficients of correlation based on the ranks of the sample 123 Chapter 5. MAMSE Weights for Copulas Xji,..., Xjn.fc will thus admit a representation of the form Kik = akJ J g(u)dCik(u) + bk (5.5) that estimates &ik = a J J g(u) dCi(u) + b. The coefficients an —> a and bn b as n —> oo are chosen to ensure that kik € [—1,1] for all sample sizes n^. Moreover, the values ±1 occur only for perfect concordance or discordance, i.e. when the ranks are identical {Rkji = Rkj2) or antithetic {Rk3i = nik + 1 — Rkj2)-The function g(u) is typically bounded and its functional form defines the coefficient of correlation at hand (see Chapter 2 of Plante (2002) for detailed explanations). Consider a particular case: for Population i and a fixed k, the empirical estimate of Spearman's coefficient, can be written _ 3 nik + 1 ^ 12nife Rkji Rjj2 %k nik-l (nik + l){nik - 1) nik nik = -3^+|+ , u f ulU2dClk(u) no, - 1 [nik + l)("ifc - I) J and clearly has the same asymptotic behavior as —3 + 12 J u\u2 dCjfc(u). Let m ^Afc = E ^ik{u)kik i=l be a MAMSE-weighted coefficient of correlation based on ranks. Remark 5.5 The MAMSE weighted coefficients of correlations are invariant under mono tone transformations of the data. Indeed, both the MAMSE weights and the individual correlations are based on the ranks of the data which are invariant to such transformations. Next, we prove that the MAMSE-weighted coefficients of correlations are strongly con sistent estimates of the correlation in Population 1. 124 Chapter 5. MAMSE Weights for Copulas Lemma 5:7 = J5(u)dCifc(u) Jff(u)dQ(u) almost surely as k —> oo for any bounded continuous function g. Proof of Lemma 5.7. By the uniform convergence of dk proved by Deheuvels (1979), the sequence of distributions Cjfc(u) converges to Ci(u) for all u and almost every to € fl. Consequently, Cik converges weakly to Cj. Since the function g is bounded, the almost sure convergence of the expectation follows (see e.g. Durrett (2005), page 164). • Lemma 5.8 The coefficient kik converges almost surely to Ki = a J g(u) dCi(u) + b, its associated population value. Proof of Lemma 5.8. Note that I kik Ki I — kik ~ | a J g{u) dCik(u) + fjj + | a J g(u) dCik(u) + 6 j - m (afe - a) Jg(u) dCik(u) + (bk - b) jg(u) dCifc(u) - a J.g(u) dQ(u) + 6-6 < • |ajfc - a\ .0 J g(u)dCik(u) +|6fc-6| Jg(u)dCik(u)- Jg(u)dCi(u) almost surely as k —> oo since J g(u) dQfc(u) is bounded and by Lemma 5.7, the other term involving integrals tends to 0. • 125 Chapter 5. MAMSE Weights for Copulas Theorem 5.5 The weighted coefficient of correlation 1=1 almost surely as k —> oo. Proof of Theorem 5.5. The proof follows the same steps as that of Lemma 5.8 I^Afc-^ll < k\k - {a/p(u)dCfc(u) + &j + J<?(u)dCfc(u) + frj (afc - a) J g{u) dCk (u) + (6fc - 6) a y g(u) dCfc(u) - a Kl < \ak - a\ 0 y p(u)dCfc(u) + |6n-6| y «?(u)dcfc(u)- y 5(u)dd(u) almost surely as k —» oo since g is bounded and J"g(u)dCK(u) converges to Jg(u) dCi(u) by Corollary 5.2. • Corollary 5.3 The MAMSE weighted versions of Spearman's p, Gini's 7 or Blest's v and £ are strongly consistent estimates of their corresponding population value. Proof of Corollary 5.3. Direct consequence of Theorem 5.5. • 5.7 Weighted Strong Law of Large Numbers In Chapter 3, we proved a WSLLN for unbounded functions in order to prove the consistency of the weighted likelihood with MAMSE weights. In the next section, we propose the maximum weighted pseudo-likelihood estimate and prove its consistency. To do that, we first prove a weighted strong law of large number more general than Corollary 5.2 as it 126 Chapter 5. MAMSE Weights for Copulas holds for functions g continuous on (0, l)p with asymptotes at the boundary of the unit hyper-cube. Let C*k denote a rescaled empirical copula based on transformed ranks Y ^ as defined in Section 5.2. We let C*(u) = YA=I ^ik{u)C*k and propose to show that for a function g satisfying some regularity conditions, E~I> (Y&) = /9(n) dC*k(u) - /g(u) dC7l(u) i=i - lk j=i J J almost surely as k —> oo. Since we are now using rescaled copulas, let us first prove the following result. Lemma 5.9 sup |C£(u)-Ci(u)|->0 ue[o,i]p almost surely as k —> oo. Proof of Lemma 5.9. Remark 5.4 points out that sup \C*k(u) - Cifc(u)l < — - 0 u6[0,l]P nik as k —> oo. Therefore, sup |CJE(u)-Ci(u)| < sup {|Cfc*(u)-Cfc(u)| + |Cfc(u)-Ci(u)|} uG[0,l]P ue[0,l]P J < sup |^(u)-4(u)|+ sup |4(u)-d(u)| ue[o,i]p ue[o,i]p m < SUP \Gik(*) -Cik(a)\ + sup |Cfc(u)-Ci(u)| i=1 ' ue[o,i]p ue[o,i]p m < SUP |.C;fc(u)-Cifc(u)|+ sup |Cfe(u)-Ci(u)|-0 I=L U6[0,1]P U£[0,1]P as k —> oo by Remark 5.4 and Theorem 5.3. I 127 Chapter 5. MAMSE Weights for Copulas Proposition A-l(i) of Genest et al. (1995) provides a strong law of large numbers that we use to prove the weighted strong law of large numbers that we need. To simplify their notation, Genest et al. (1995) present their result with bivariate copulas, but we do not make this simplification here. Theorem 5.6 (Genest et al. (1995)) Let r(u) = u(l -u), 5 > 0, {qe,£ = 1,... ,p} be positive numbers satisfying Yle=i V% ~ 1 an<^ #(u) a continuous function from (0, l)p to IR such that pi = f g(u)dCj(u) exists. If M < oo is a positive real number such that p \g(u)\<Ml\r(ue)ae t=\ ' with ai = (—1 + 6)/qe, then Rn = J g(u) dC7;fc(u) — pi almost surely as k — oo. A bounded function always satisfies the assumptions of Theorem 5.6 because n?=i r(ue)ae has a lower bound. Lemma 5.7 is thus a particular case of Theorem 5.6. Lemma 5.10 Let A be a cube in p. dimensions whose opposite corners are u\ and u2 with ui < u2, i.e. A = (tin,«2l] x • • • x (mp,u2p}. Let Ci(u) and C2(u) be p-dimensional copulas such that sup |Ci(u)-C2(u)|<e. U€[0,1]P Then, op \dd{A)- dC2{A)\ < —. 128 Chapter 5. MAMSE Weights for Copulas Proof of Lemma 5.10. The coordinates of the corners of the cube can be re-expressed by replacing some elements of Ui by the corresponding elements of 112. Let 5 C {l,...,p} be a set of indices and v$ any vector such that Vi = Uu if i € S u-2i if i G Sc To calculate the probability of the cube, we can use the following development which is akin to the formula for the intersection of multiple sets: dC1(A) = C1(v0)-X>(v{i})+ £ d (v{ij}) - • • • ± Ci (v{ w}) . i=l l<i<J<P Consider now the difference \dCi(A)- dC2{A)\ = Ci (v0) - C2 (v0) - ]T {Ci (v{l}) - C2 (v{i})-} i=i + E {Ci(vm)-C2(v{l,j})} l<i<j<p -•••±{CI(V{1I...)P})-C2(V{1I...IP})} p < |d (v0) - C2 (v0)| + }2\Ci (v{l}) - C2 (v{i}) j i=i + E lCl(Vfe})-C2(v{lj})| l<i<j<p + --- + |C1(v{w})-C2(v{1)...,p})| < 2pe. . Since the sum above has terms each bounded by e. i=l 129 Chapter 5. MAMSE Weights for Copulas Corollary 5.4 Let B be a closed cube, B = [un,u2i]x •••x [uip,U2p\. Let Ci(u) and C2(u) be p-dimensional copulas such that sup |Ci(u) - C2(u)| < e. ue[o,i]p Then, cyp |dCi(5)- dC2(B)\ < —. Proof of Corollary 5.4- Apply Lemma 5.10 to the sequence of half-open cubes As = (itii - 5, U2\] x • • • x (uip - 5, u2p] with 5 —> 0. A similar proof holds for a mix of closed and open intervals. • Theorem 5.7 Let g(u) be any continuous function on (0, l)p that satisfies the assumptions of Theorem 5.6. Suppose that the sample sizes —» oo for all populations. Then, 0 <7(u)dCjJ(u)- J <7(u)dCi(u) -almost surely as k —• oo. Proof of Theorem 5.7. For t G IN, let Bt = [2~Vl - 2~*]p and , g(u) iiuEBt rt(u) = _ 0 otherwise Since g(u) is continuous and Bt is a compact set, the image of rt is bounded. Suppose that Tj(u) € [Lt, J7t]. By the Heine-Cantor Theorem, rt is uniformly continuous on Bt, 130 Chapter 5. MAMSE Weights for Copulas i.e. VeT)t > 0, 3<5r,t > 0 such that Vu, v G Bt, |u - v| < <5T,t ==> \Tt(u) - rt(v)| < eT)t. Let eTj = 2-t and choose 0 < 8T:t < 2_* accordingly. Let At be a partition of the interval [2-*, 1-2-*], A = {[2-*, (2)2"*]} U {(S2-4, (s + 1)2"*], s = 2,..., 2* - 2} . Then, the elements of At x ••• x At = {Au,...,AStt} form a partition of Bt of cardinality St = (2* — 2)p. Define Mu) = ^^*lAst(u) s=l where 1 if u e Ast 6st= inf s(u) and l^fu) = yeAst 1 0 otherwise Then, by construction supue[01]P |rt(u) — ht(x)\ < 2 * and 5(u)dCfc*(u)- J 5(u)dd(u) (5.6) where T3 5(u)d^(u)-|rt(u)d^(u Tt(n)dC*k(u)- j ht(u)dC*k(u) J ht(u)dC*k(u)- J ^(u)dCi(u) 131 Chapter 5. MAMSE Weights for Copulas n = /^(u)dCi(u)- J rt(u) dCi(u) rt(u)dCi(u)- J g(u) dC:(u) We now prove that for any e > 0 and w in a subset of fi with probability 1, we can choose tw such that the five terms above are less than e/5 for all k > kw(tu)-First note that J /it(u)-rt(u)dCi(u) < y |^(u)-rt(u)|dC1(u)<2-t by construction. The same bound applies for T2 and does not depend on or CJ. By Lemma 5.9, supue[01]P |C£(u) — Ci(u')| converges almost surely to 0. Therefore, 3fin C fi with P(flo) = 1 such that for each u> G fin and any t, 3/c^j with sup |Cfc(u) - Ci(u)| < UG[O,I]P 5tmax(|f/t|, |Lt|)2T+P for all k > ku,t- For any such k and w, Lemma 5.10 implies that dC*k(Ast) - dCi(Art) < 2p 5t max(|C^|,|Lt|)2*+P for any s,t. Developing T3 yields T* = St st b«dC*k(Ast) - £bst dC^Ast) S=l 8=1 St < £>• st\ • s=l < Stmax{\Ut\,\Lt 1_ 2*' dC^A*) - dCi(Ast) 2P '5tmax(|f/t|,|Li|)2*+P Therefore, 3*i such that 2 * < e/5 for all t > t\, i.e. T2, T3 and T4 are each bounded by 132 Chapter 5. MAMSE Weights for Copulas e/5 for any t > t\ and k > kWtt-We can write s(u)lBc(u) dCi(u) < j |^(u)|lflc(u)dCi(u). The integrand on the right-hand side goes to 0 as t —> oo for each u 6 (0, l)p. Hence, the dominated convergence theorem ensures that the right-hand-side expression goes to 0 as t —> oo (the bounding function is |<?(u)|). Therefore, there exists t2 such that T5 < e/5 for all t>t2. Turning now to T\, Theorem 5.6 means that there exists fi^t C fi with P(fij]t) = 1 such that for all u £ fi^t, ^-E^ (Y&) = /9(n)dC*k(n) - /<?(u)dQ(u) as —> 00. Consider a fixed w € fii = P) fii;t. i€{l,...,m},telN The intersection is over a countable number of sets of probability 1, hence P(fii) = 1. For any to £ fii, T\ is developed as r, = 5(u)lBF(x)dCfc*(u) < / |0(u)|lBc(u)dCfc*(u) m \ ( \ nik ™ 1 nik i=i j=i i=i j=i The dominated convergence theorem says that 3t* such that |5(u)|lBc(u)da(u) < e/lOm 133 Chapter 5. MAMSE Weights for Copulas for all t > t*. Choose t > t% = maxi<j<m t*. Since uGfij, such'that for all k > fc,^^ 1 Uik r 1 j=i e e H < -—• 10m 5m Therefore, Vt > max(i3,t*), there exists k*t = maxi<i<m k^u such that Jg(u)dC*k(u)- Jrt(u)dCfc*(u) < for all k > k*t. In conclusion, for any ui e fi0 fl fix and any e > 0, we can choose tw = max(ti, t2, £3, t*) that yields inequalities showing that s(u)dCfe»- y ^(u)dCi(u) < e for all k > ku(tu) = max(ku>ttu,k*ttu). In other words, the left hand side of expression (5.6) converges to 0 for any u G fi0 D fii with P(fi0 n fii) = 1, i.e. it converges almost surely. • 5.8 Strong Consistency of the Weighted Pseudo-Likelihood Section 5.2 introduces the pseudo-likelihood as proposed by Genest et al. (1995). When the goal of inference is to study the dependence structure underlying the data, the heuristics of Section 2.1 can be used with distributions replaced by copulas. Therefore, the maximum pseudo-likelihood estimate can be seen as a particular case of the entropy maximization principle where the true copula is replaced by its empirical counterpart. Suppose that p-dimensional data are available from m populations. It is natural to ex tend the pseudo-likelihood following the heuristics of Section 2.2. If we consider the family of copulas c(u|0) and use a mixture of empirical copulas such as C£(u) as an approxima tion to the true underlying copula, the entropy maximization principle yields the weighted 134 Chapter 5. MAMSE Weights for Copulas pseudo-likelihood: ^)=nnc(Y«l*) •Wntfc (5.7) where Y^- are rescaled ranks as presented in Section 5.2. The maximum weighted pseudo-likelihood estimate (MWPLE) is a value of 9 maximizing L(9). To our knowledge, the weighted pseudo-likelihood has never been suggested in the litera ture. This section shows that when MAMSE weights are used in conjunction with (5.7), the MWPLE is a strongly consistent estimate of the true parameter 9Q. The proof is adapted from the work of Wald (1949). Families of Copulas The weighted pseudo-likelihood is based on the density function of a family of copulas. However, not all copulas have a density function. In fact, any copula can be factorized as the sum of an absolutely continuous part and a singular term (see Nelsen (1999) page 23). For instance, the bivariate Marshall-Olkin copula, C(u\a,0) = mm(u\~aU2,Uiu\~13), is parametrized by 0 < a,Q < 1 and gives a positive probability to the line ua = v@, its singular component. Such a family cannot be fitted with the pseudo-likelihood or its weighted counterpart as it does not admit a density function with respect to Lebesgue measure. As mentioned previously, all the families presented in Section 5.1 admit a density. Let {c(u\9) : 9 € 6} be the family of copulas that we wish to fit to the data at hand. A family of copulas that does not cover the complete range of dependence (i.e. which does not approach perfect positive or negative correlation) will typically have a compact parameter space. This is the case for the FGM copula. Perfect concordance or discordance between two (or more) variables corresponds to a limit case where the copula places all of its mass on the lines itj = Uj or u\ = —Vj. For such a limit case, the copula does not admit a density anymore and its parameter space will typically be infinite or open, and hence not compact. Other limiting cases may have to 135 Chapter 5. MAMSE Weights for Copulas be omitted because the definition of the copula presents an undefined form; independence under the Clayton model is such an example as it would involve dividing by 0. Despite this fact, we make the assumption that 6 is a compact set. In practice, this means that the proof of consistency that we provide will hold for (possibly constrained) families of copulas that do not include limiting cases such as perfect concordance or dis cordance between any 2 marginals. Relaxing Assumption 5.1 may be possible but is left to future work. Assumption 5.1 The set 0 is a compact subset of a finite-dimensional Cartesian space. Due to the constraints on their margins, copulas that admit a density will typically be smooth, hence little generality is lost with Assumption 5.2. • Assumption 5.2 The density function of C{\x\9), c(u\9), is jointly continuous for u € (0,1)P and 8 EG. Assumption 5.3 Suppose the true copula underlying the true unknown distribution is C{n\9o) = Ciin) for some 80 EQ. L In particular, this latter assumption means that the copula underlying the true distribution admits a continuous density function. Revised Wald's Assumptions For all 8 E 9 and p > 0, let us define c(u, 9,p)= sup c(u\9'). \8-8'\<p Assumption 5.4 For a sufficiently small p, the expected values Jlog[max{c(u,0,/0),l}] dCi(u) 136 Chapter 5. MAMSE Weights for Copulas are finite. Assumption 5.5 If Q\ ^ 9Q, then C{U\0Q) ^ C{u\6{) for at least one u. Assumption 5.6 j | log c(u|(90)| dCi(u) < oo for i = 1,... , m. Assumption 5.7 The functions c(u,9,p) are measurable for any 8 and p. Assumption 5.8 Suppose that the functions logc(u|t90) and logc(u,8,p) satisfy the as sumptions of Theorem 5.6 for any 8 and p. Wald (1949) makes additional assumptions, but they are not required here because the copula is defined on a bounded set and because of the stronger assumptions on continuity that are made above. Remark 5.6 If lim^oo 8i = 8, then lim;_oo c(u\9i) = c(u|f7) from the continuity of c(u\9). Remark 5.7 The function cf> of Wald also found in Section 3.7 is not required here because we assume that the set Q is compact. Remark 5.8 The functions c(x,9,p) are continuous from the continuity of c(u|f7) and Lemma 3.10. An Example That Satisfies the Assumptions Before proving the main result of this section, we show that the assumptions made above are not vacuous by verifying that at least one often-used family of distributions satisfies all the stated conditions. For this example, we use a simplified notation where u = [u, v]T. Consider the Clayton family of distributions whose density function 137 Chapter 5. MAMSE Weights for Copulas is indexed by 8 G 0 = [a, b) with 0 < a < b < oo. Choosing a compact 0 for this family means that the limiting cases of independence and of perfect concordance are omitted. Assumptions 5.1, 5.2 and 5.5 are satisfied. All distributions considered are defined on (0, l)2 and its associated Borelian, hence Assumption 5.7 is also satisfied. Assumption 5.3 is necessary in the context of proving consistency. For the model considered, we show that Assumptions 5.4 and 5.6 are consequences of the bound implied by Assumption 5.8 to which we now turn our attention. Remark 5.9 For u,v € (0,1), there exists a finite constant M > 0 such that . M M \l°Su\ ^ 7771 < «V4 - {r(u)r(v)y/4 where r(u) = u(l — u) since logii is finite for all u G (0,1), except when u —> 0 and — \ogU U~l 1/4 lim —7-r- = lim . • .. = 4 hm u 1 =0 by VHospital's rule. Remark 5.10 For u,v G (0,1), the function r(u) is bounded between .0 and 1/4. Therefore, r(«)V4 - 4 V2 meaning that for any constant M\, there exists a constant M2 = M\/yj2 such that M2 M2 Mi < —< r(u)V4 ~ {r(u)r{v)y/*' Lemma 5.11 The functions of Assumption 5.8 satisfy the bound implied by Theorem 5.6 for the Clayton family of copulas with 0 = [a, b) where 0 < a < b < 00. Proof of Lemma 5.11. Using the notation of Theorem 5.6, set p — q — 2 and 8 = 1/2. We 138 Chapter 5. MAMSE Weights for Copulas show that there exists positive finite constants MQ0 and M*D such that |logc(u|0o)| < M0O {r(u)r(v)y/4 and We can write |logc(u,0,p)| < {r(u)r(v)}W |logc(u|0)| log(0 + !) - (| + 2) lQg {u~6 + v~° - l) - + 1) log uv < |log(0 + l)| + < Q + 2^) log (V* + v~e - l) + \(6 + 1) log u\ + |(0 + 1) log v\ log(0 + 1) + (1 + 20)^ log (u~e + v~e - 1 +(0 + 1)| log u| +(0 + l)| log u| log^'1) + (1 + 20)M^ + 2(0 + 1)M {r(u)r(z;)}V4 {r(u)r(v)y/4 by Remark 5.9 and the fact that ^log (vTe + v-e - 1 < ^log{2max(«-%-e) - lj < I log (21*-* - l) + I log (2v-° - l) Jlog (2^.)+ 1 log (2^) 2 |logu| + I log t>I + - log 2 o M M < V2\og2 + + {r(u)r(v)}V* {r(u)r{v)}ll4 (r(u)r(i;)}1/4 {r{u)r{v)y/A 139 Chapter 5. MAMSE Weights for Copulas by Remarks 5.9 and 5.10. Note that the constant MQ is continuous in 9 for parameters 9 in any compact subset of [a, b] The inequalities above holds for 9 = 9Q, which is the first part of Assumption 5.8. The second part is also satisfied since log sup c(u|0)= sup logc(u|0)< sup < ' \9-6'\<p \0-0'\<p \0-0'\<p {r(u)r-(u)}1/4 {r^r^)}1^ because M# is a continuous function on the compact set {9 : \9 — 9'\ < p] D ©, it thus achieves its maximum denoted M*E,.. • Corollary 5.5 Under the assumptions of Lemma 5.11, Assumptions 5.4 and 5.6 are sat isfied. Proof of Corollary 5.5. Any of the integrals in Assumptions 5.4 and 5.6 are bounded by a positive constant times / {r(u)r(u)}V4 dCi(u) " / mm{r(u),r(v)y/idCi{U dCi(u) ) J{u:r(u)<r(v)} r{U)1'^ J{u:r(u)>r(v)} r{V)1'' £ /-^CtuH/^dQtu) = 2/^I75d»='r<cc' Hence we obtain the desired result. - B Remark 5.11 Corollary 5.5 guarantees that J logc(u|0o)dCj(u) and J logc(u, 9, p) dCj(u) exist. Therefore, the assumptions of Theorem 5.6 are satisfied, i.e. Assumption 5.8 is satisfied. 140 Chapter 5. MAMSE Weights for Copulas Wald's Lemmas For expectations, the following convention is adopted. Let Y be a random variable. The expected value of Y exists if E{max(Y, 0)} < oo. If .E{max(Y, 0)} is finite but £{min(Y, 0)} is not, we say that E{min(Y, 0)} = —oo. Moreover, a generic U represents a random variable with distribution C\(u) = C(u,$o). The following lemmas are equivalent to those found in Section 3.7. Lemma 5.12 For any 9 ^ 60, we have Elogc(U|0) < Elogc(U|0o). Lemma 5.13 lim Elogc(u,8,p) = Elogc(U|0). p^O Main Result Let now turn to the main result of this section. Throughout this subsection, we suppose that Assumptions 5.1 to 5.8 are satisfied. Theorem 5.8 Let T be any closed subset of © that does not contain 8Q. Then, L i=ij=i . J Proof of Theorem 5.8: Let U denote a random variable with distribution Ci(u) = C(u|0o). With each element 6 £ T, we associate a positive value po such that The existence of such pe follows from Lemmas 5.12 and 5.13. Let S(8,p) denote the sphere with center 8 and radius p. The spheres {S(0, pe) : 8 G T} form a covering of the compact set P E{logc(U,(9,pe)}<E{logc(U|0o)}. (5.8) T, hence there exists a finite sub-covering. Let 6\,... ,8h eT such that T C (Js=i S(9s, P6S)-141 Chapter 5. MAMSE Weights for Copulas Clearly, m nik i=ij=i \ik{u)nik/nik ^ "T_™ik , \ Aifc(u;)ralfc/nifc < Ennc(Y6'(?"^ s=l1=1j=l Therefore, to prove Theorem 5.8 if suffices to show that m nik niic(Y£>^) Kk(u)nlk/nik i=l j=l fc-»oo m nik i=lj=l \ik{u)nlk/nik = 0 for s — 1,..., h. These equations can be rewritten as lim nik k—>oo m nik i=l j=l = p ik \ = —oo logc(u|0o)d^(u) = —oo = 1 (5.9) for s — 1,..., h. The integrals in (5.9) converge almost surely to Elogc(U,9S,pes) and Elogc(U|f70) by Theorem 5.7. For large k, the expression inside the curly brackets is thus negative by (5.8). Hence the proof of Theorem 5.8 is complete. I Theorem 5.9 Let 9k(to) be a sequence of random variables such that there exists a positive 142 Chapter 5. MAMSE Weights for Copulas constant c with m nik niHnl i=i j=i Aifc(cj)nifc/raifc m nik i=lj=l for all k £ IN and all to £ Q. Then Xik(,cj)nlk/nik > c > 0 (5.10) P{ lim 0fc(w) = 0ol = 1. Proo/ o/ Theorem 5.9. Let e > 0 and consider the values of 9k(u>) as k goes to infinity. Suppose that 9( is a limit point away from 9Q, such that — 9Q\> e. Then, m nik suP nncH e) \k{io)nlk/nik m nik nn«(Ys|*) Xik(ui)nlk/nik > c> 0 i=i j=i infinitely often. By Theorem 5.8, this event has probability 0 even with e arbitrarily small. Therefore, LO lim 9k(u) -9 <e =1 for all e > 0. Corollary 5.6 TTie MWPLE with MAMSE weights is a strongly consistent estimate of 9. Proof of Corollary 5.6. The MWPLE clearly respects Equation (5.10) with c = 1 since 9k(u>) is then chosen to maximize the numerator of (5.10). • 5.9 Simulations We study the performance of the weighted coefficients of correlations and of the weighted pseudo-likelihood with MAMSE weights in finite samples through simulations. Unless oth-143 Chapter 5. MAMSE Weights for Copulas erwise specified, the number of repetitions for each simulation is sufficient to make the standard deviation of the simulation less than the last digit shown in the tables or on the figures. 5.9.1 Different Correlations for Bivariate Copulas The bivariate families of copulas presented in Section 5.1 all depend on a single parameter and the population value of Spearman's p is a monotone function of that parameter. To use these families on a comparable scale, we use parameter values that are associated with a specified p. Four different families of copulas are considered: Normal, Clayton, Frank and Gumbel-Hougaard. Equal samples of size n € {20, 50,100, 250} are simulated from five bivariate populations with different correlations. Two scenarios are adopted; they are described in Table 5.3. Each situation considered is repeated 10000 times. Table 5.3: Values of p under two different scenarios that are simulated for different families of copulas. Figure 5.1 shows the average weight allocated to each of the five populations by the MAMSE approach. Within a given scenario, the average weights do not seem to depend strongly on the actual distribution of the data. This is not very surprising since populations share the same correlation under such circumstances, and hence the shape of their copulas is similar even if they come from different families. As one would expect, the difference between Scenario A and B is bigger than the difference between two distributions within the same scenario. Under Scenario A, bias can cancel out since the correlation of the target population sits squarely within the range of correlations from the other'populations. Consequently, more weight is given to Population 1 under Scenario B where that situation cannot occur. Scenario A Scenario B Pop. 1 Pop. 2 Pop. 3 Pop. 4 Pop. 5 0.35 0.25 0.25 0.30 0.30 0.35 0.40 0.40 0.45 0.45 144 \ Chapter 5. MAMSE Weights for Copulas Finally, note that under Scenario A populations 2 to 5 receive an approximately equal share of the weight, but under Scenario B, populations whose correlation is closer to the target receive a larger mass than the others. This behavior conforms with intuition. Average Weights Allocated to Each Population Scenario / Family n= 20 n=50 n= 100 n= 250 Normal Clayton A Frank Gumbel 30 30 30 30 30 30 30 30 30 30 30 30 31 30 30 30 Normal Clayton B Frank Gumbel 32 33 34 38 32 33 34 38 32 33 34 38 32 33 35 38 • Pop. 1 H Pop. 2 • Pop. 3 • Pop. 4 • Pop. 5 Figure 5.1: Average value of the MAMSE weights for scenarios where samples of equal sizes n are drawn from 5 different populations. The cell areas are proportional to the average weight allocated to each population. The numbers correspond to lOOAi and are averaged over 10000 repetitions. Table 5.4 shows a ratio of mean squared errors comparing the usual estimate of Spear man's p with its MAMSE-weighted counterpart. Table 5.5 shows a similar ratio for esti mating the parameter of the underlying copula with the pseudo-likelihood or the MAMSE-weighted pseudo-likelihood. Note that the error due to simulation in Table 5.5 has a stan dard deviation from below 6 units for n = 20 to less than 3 units for n = 250. Under Scenario A, the performance of the MAMSE-weighted methods is impressive, featuring in most cases a MSE three times smaller than the equivalent non-weighted method. 145 Chapter 5. MAMSE Weights for Copulas Family n =20 ' 50 100 250 Normal 333 331 315 271 Scenario A Clayton 335 329 314 271 Frank 338 333 314 271 Gumbel 331 327 312 271 Normal 275 197 131 72 Scenario B Clayton 284 197 131 81 Frank 283 195 131 77 Gumbel • 286 196 136 ' 78 Table 5.4: Performance of a MAMSE-weighted coefficient of correlation based on ranks as measured by 100 MSE(p)/MSE(p^) f°r different scenarios -and sample sizes n € {20, 50,100, 250}. Each figure is averaged over 10000 repetitions. Family n =20 50 100 250 Normal 305 310 304 266 Scenario A Clayton Frank 407 398 349 356 327 325 278. 273 Gumbel 389 341 315 276 Normal 193 139 97 57 Scenario B Clayton 184 110 72 45 Frank 245 164 112 68 Gumbel 188 117 83 51 Table 5.5: Performance of the maximum weighted pseudo-likelihood estimate as measured by 100 MSE(f?)/MSE(i9A) for different scenarios and sample, sizes n € {20,50,100,250}. Each figure is averaged over 10000 repetitions. Under that scenario, the correlation of the population of interest is in the middle of the range of correlations from the other populations, hence the bias can cancel out. Under Scenario B however, Populations 2 to 5 have larger correlations than the target population. Despite that fact, the methods of inference based on MAMSE weights still achieve appreciable gains in performance in many cases. The performance of the weighted methods is sometimes clearly inferior to their unweighted equivalents. Although the counter-performances seem to occur for larger sample sizes, a situation where the weighted methods are less needed, the performance of the method should be studied further before recommending its use in practice to avoid encountering such losses. 146 Chapter 5. MAMSE Weights for Copulas 5.9.2 Different Bivariate Distributions with Common Correlation In this simulation, we explore the possible advantages of the MAMSE-weighted methods when populations have copulas of different yet similar shapes. Five populations are used, all of which have a theoretical Spearman correlation of p = 0.2. However, the actual shape varies since the populations come from different copulas that are described in Table 5.6. Table 5.6: Distributions from which bivariate random variables are drawn. The choice of parameters in each population yields a Spearman correlation of p = 0.20. Note that Frank, Farlie-Gumbel-Morgenstern and the Normal copulas are all radially symmetric while those of Clayton and Gumbel-Hougaard are not. Hence, they do differ in shape although they are chosen to share a common theoretical correlation. For each sample size n G {20, 50,100, 250} considered, 10000 replicates are produced. The MSE of estimating the correlation p — 0.2 and the MSE of estimating the parameter of the target distribution, a Normal model with 9 — r = 2sin(0.27r/6) « 0.209, are evaluated. Table 5.7 shows the average MAMSE weights allocated to each of the five populations considered as well as the efficiency calculated by the ratios 100 MSE(pi)/MSE(p^) and 100 MSE(f3i)/MSE(0\). Note that the standard deviation of the error due to simulation can reach nearly 4 units for the efficiency at estimating 9\. All other figures in the table respect the quoted margin of error of one standard deviation. First note that a substantial weight is allocated to the external populations, the sam ple from the target population contributing a mere 31% in the inference. Note also that the remaining weight is spread rather evenly between the other populations. The gain in efficiency is clear and impressive. In fact, it corresponds approximately to the inverse of the weights allocated to Population 1 in average: 1/0.31 « 3.23. When the populations Pop. 1 Pop. 2 Pop. 3 Pop. 4 Pop. 5 Normal Copula, r = 0.209 Frank Copula, 9 = 1.22 Farlie-Gumbel-Morgenstern Copula, 6 = 0.600 Gumbel-Hougaard Copula, 9 = 1.16 Clayton Copula, 9 = 0.310 147 Chapter 5. MAMSE Weights for Copulas Efficiency 100 x n pi §i Ai A2 A3 A4 A5 20 328 299 50 . 336 334 100 338 345 250 345 356 31 17 17 17 17 31 17 17 17 17 31 17 17 17 17 31 17 17 17 17 Table 5.7: Average MAMSE weights as well as the relative efficiency of the weighted cor relation and of the MWPLE when compared to their non-weighted counterparts. Samples of size n € {20,50,100,250} are drawn from five different bivariate distributions that have a common correlation of p = 0.2. Figures are averaged over 10000 repetitions. considered are similar, the MAMSE-weighted correlation and the MWPLE clearly improve the inference. The relevant information contained in the samples from Population 2 to 5 has a clear value; that information should not be ignored. The average weight allocated to each population does not seem to vary with n. This unexpected behavior cannot hold for an arbitrarily large sample size as it would contradict Theorem (5.3). In Table 3.1, it was noticed that the weights seem to converge very slowly when the distributions are very close to each other. The same behavior may be observed here since the populations that were simulated share a common Spearman p, i.e. they are quite similar to each other. 5.9.3 Multivariate Normal The multivariate Normal copula is parametrized by its correlation matrix. In the follow ing simulations, we will use Normal distributions in 3 and 4 dimensions, hence depending respectively on 3 and 6 parameters. We simulate a scenario where the population of interest follows the target distribution and 3 other populations are available with the same underlying copula, but with measure ment error that changes the correlations associated with their underlying copula. The sample sizes simulated are unduly small as the MAMSE weights suffer from the curse of dimensionality: the measure Mk contains nplk points for which operations of non-trivial 148 Chapter 5. MAMSE Weights for Copulas time must be performed. The time required to calculate the equations to determine the MAMSE weights increases rapidly to the hardware limitations even for moderate n^' and p. There are ways in which the calculation of the MAMSE weights might be accelerated. The integrals with respect to dMfc(u) could be evaluated on a grid sparser than that associated with Mfc(u), whether it is through sampling or by a new definition of Mfc(u). Proving that the algorithm and the weights converge properly with such practical simplifications is left to future work. Maximum Pseudo-Likelihood Estimate and its Weighted Counterpart To calculate the pseudo-likelihood of the data, we apply the inverse CDF of a standard Normal to the rescaled ranks. For this section, we choose to use Y^* = (R*- — l/2)/rijfe, for a fixed k, i — 1,..., m and j = 1,..., n^. We thus transform the data into Zjj — a-1 (Y*;),...,*11 (y*;) By construction, the mean of the vectors Z*j, j = 1,..., is exactly 0 and the MLE of the marginal variances are ^{•-ft1)} »/;{•-'«}'...-/y<>•(„-1. The actual value of the sample variance differs from 1 by a fixed amount that depends on the sample size since it determines the number of terms in the sum used above to approximate the corresponding integral. The expression for the pseudo-likelihood based on the rescaled ranks Y^* is identical to the likelihood of a centered normal based on the corresponding Zjj. The maximum 149 Chapter 5. MAMSE Weights for Copulas likelihood estimate of Ej, the covariance matrix in Population i, is given by Sj = / ZjjZi7-. i« j=i The diagonal of Ej already contains a numerical approximation of 1. Define then E* = Ej, except for diagiT,*) = 1, i.e. we replace the diagonal of Ej by true ones. We use E* as the maximum pseudo-likelihood estimate of E. The change of diagonal does not impact the positive definiteness of the matrix because the elements of the diagonal of Ej are equal and slightly smaller than 1, hence the replace ment by 1 is equivalent to adding a small positive multiple of the identity matrix which is itself positive definite. We prefer the approach above to brute force numerical maximization' of the pseudo-likelihood. Now, turn to the weighted pseudo-likelihood whose expression is equivalent to the weighted likelihood for centered Normal variates with common covariance matrix E based on the Zjj-: m nik , -I \ Xikiu) . . rn nik / \ / \ \ z=lj=l ' 11 i=lj=\ where |E| denotes the determinant of the matrix E. The corresponding weighted log-likelihood is -, m . / s nik I m . / N nik | I=I j=i ^ i=i j=i j = -IloglEl-itrfE-1^} (5.11) 150 Chapter 5. MAMSE Weights for Copulas where A = EA^)EzTZy i=l Hlk 3=1 Theorem 5.10 (Wichern & Johnson (2002), Result 4.10) Given apxp symmetric positive definite matrix B and a scalar b > 0, it follows that -61og|E| - ^-tr(E~lB) < -6log\B\ + p61og(26) - bp for all positive definite p x p matrix E, with equality holding only for E = (1/26)5. The calculation of the MLE of the covariance matrix of a multivariate normal distribution typically uses a result akin to Theorem (5.10). Its proof can be found from different sources and is thus not reproduced here. Applying Theorem (5.10) to Expression (5.11) with 6 = 1/2 and B = A, we conclude that the maximum weighted likelihood estimate of the covariance matrix E is given by 1=1-.™** j=l i=l To avoid brute force optimization involving constraints on the positive definiteness of the estimate, we use the same approach as before and use m I=I as the maximum weighted pseudo-likelihood estimate. 151 Chapter 5. MAMSE Weights for Copulas 3-D Multivariate Normal We suppose a multivariate normal model with measurement error. Let 1 0.4 0.3 1 0.8 0.6 EA = 0.4 1 0.4 • > E# = 0.8 1 0.8 0.3 0.4 1 0.6 0.8 1 0.2 0 0 0.2 0.1 0 S2 = 0 0.2 0 S3 = 0.1 0.2 0 0 0 0.2 0 0 0.2 0.2 -0.1 0 and S4 = -0.1 0.2 0 0 0 0.2 We draw random vectors from multivariate Normals with means 0 and covariances that are defined in terms of the matrices above according to the formulas in the column Covariance of Table 5.8. Measurement errors affect the dependence structure of the populations. Let 1 ri2 Ta 1 ri2 ri3 I be the covariance matrix of Population i = 1,..., m. The actual value of Ej is defined by three parameters explicitly written in Table 5.8. Equal samples of size n € {20, 35, 50} are drawn from each population. The maximum pseudo-likelihood estimate of Ti = [ru, r12,Ti3]T and its MAMSE-weighted equivalent are calculated. Table 5.9 shows the average weight allocated to each population. The inference relies 152 Chapter 5. MAMSE Weights for Copulas 100 x Pop. Covariance ri2 ri3 1 40 30 40 Scenario A 2 £A + S2/4 38 29 38 3 SA + S3/4 40 29 38 4 EA + 54/4 38 29 36 1 80 60 80 Scenario B 2 TiB + S2 67 50 67 3 ZB + S3 75 50 67 4 SB + S4 67 50 58 Table 5.8: Parameters of the simulation for 3-variate Normal variables. Population 1 is from the target distribution, but the other populations are affected by measurement errors. strongly on the populations with measurement errors since they have a total weight of about 60% in all cases. The weights do not seem very affected by the sample size n. The small range of values for n might be partially responsible. Recalling a similar comment from the previous simulation, it is also possible that the measurement error model produced distributions that are similar enough to make the convergence of the weights rather slow. 100 x n Ai A2 A3 A4 20 41 19 20 20 Scenario A 35 41 20 20 20 50 41 20 20 19 20 37 21 22 21 ' Scenario B 35 38 21 22 20 50 39 20 22 19 Table 5.9: Average weight allocated to each of four 3-variate Normal distributions. Pop ulation 1 is observed from the target distribution, but the other populations contain mea surement errors.-The values are averaged over 5000 repetitions. To evaluate the performance of the MWPLE, we compare its MSE to that of the MPLE based on Population 1 only. Let T\ denote the maximum pseudo-likelihood estimate of Y\ 153 Chapter 5. MAMSE Weights for Copulas and T\ denote its weighted counterpart. We estimate the mean squared errors MSE(fi) = E||fi -Till2 and MSE(f A) = E||f A - Ti||2 with 5000 replicates; || • || denoting Euclidean distance. Table 5.10 shows the values of MSE(r^) as well as an equivalent ratio of the MSE for estimating each element of the vector T\. Note that the standard deviation due to simulation error may reach almost 5 units under Scenario A and nearly 2.5 units under Scenario B. Efficiency n Ti Tn Tl2 Ti3 20 243 240 260 231 Scenario A 35 255 262 261 241 50 257 264 266 243 20 74 68 123 47 Scenario B 35 60 53 110 35 50 52 44 105 27 Table 5.10: Relative performance of the MWPLE when compared to the MPLE as mea sured by 100 MSE(Ti)/MSE(rA) or an equivalent ratio for the individual entries of Tx-Population 1 is observed from a 3-variate Normal distribution and the other populations contain measurement errors. The values are averaged over 5000 repetitions. Under Scenario A, the performances of the MWPLE are very good as its MSE is less than half that of the MPLE. However, this excellent performance does not obtain under Scenario B where the pseudo-likelihood is the winner. We can only speculate about the causes of these results at this stage, but at least two possible explanations should be explored in future research. • The variance term in the definition of the MAMSE weights is a very rough estimate of the actual asymptotic variance of the empirical copula. This choice appears to be 154 Chapter 5. MAMSE Weights for Copulas reasonable for mild correlations as in Scenario A, but may degenerate as the correlation increases which is the case under Scenario B. A better penalty term can certainly be found. • When the correlations get larger, the data points tend to cluster around a singular subset of the unit cube. The choice of the uniform measure MK in the definition of the MAMSE weight might not be optimal in that situation.. Despite poor performances under Scenario B, the results of this section show that the MAMSE weights offer great potential to improve the quality of the inference, at least for moderate correlation and possibly for more cases if the MAMSE criterion is improved. 4-D Multivariate Normal We suppose another multivariate normal model with measurement error. Let 52 = 1 0.4 0.3 0.2 0.4 1 0.4 0.3 0.3 0.4 1 0.4 0.2 0.3 0.4 1 0.2 0 0 0 0 0.2 0 0 0 0 0.2 0 0 0 0 0.2 5,= 1 0.8 0.6 0.4 0.8 1 0.8 0.6 0.6 0.8 1 0.8 0.4 0.6 0.8 1 0.2 0.1 0 0 0.1 0.2 0 0 0 0 0.2 0.1 0 0 0.1 0.2 and S4 = 0.2 -0.1 0 0 -0.1 0.2 0 0 0 0 0.2 -0.1 0 0 -0. 0.2 155 Chapter 5. MAMSE Weights for Copulas We draw random vectors from multivariate Normal distributions with means 0 and co-variances that are defined in terms of the matrices above according to the formulas in the column Covariance of Table 5.11. Measurement errors affect the dependence structure of the populations. The covariance matrix of Population i is written 1 Tii Yi2 TJ3 Ti2 TJ4 i r*6 Ti3 TJ5 Tj6 1 and depends on a vector of six parameters, T{ = [Tn,TJ2,r*3,r*4,1^5,rj6]T, whose values are explicitly written in Table 5.11. 100 x Pop. Covariance ri2 ri3 Ti4 Ti5 Ti6 1 40 30 20 40 30 40 Scenario A 2 3 + £2/4 XA + S3/4 38 40 29 29 19 19 38 38 29 29 38 40 4 36 29 19 38 29 36 1 80 60 40 80 60 80 Scenario B 2 + 52 67 50 33 67 50 67 3 + 53 75 • 50 33 67 50 75 4 Ss + S4 58 50 33 67 50 58 Table 5.11: Parameters of the simulations for 4-variate Normal variables. Population 1 comes from the target distribution, but the other populations are affected by measurement errors. We simulate 5000 samples of size n = 20 from each of the populations and calculate both the maximum pseudo-likelihood estimate of Ti and its weighted counterpart. The choice of such a small sample size is dictated by the curse of dimensionality. The measure MK used to average the MAMSE criterion puts an equal weight on nvlk points. Doubling the sample size thus multiplies the run time by a factor greater than 16. To accelerate the calculation of the multivariate MAMSE weights, one could try to 156 Chapter 5. MAMSE Weights for Copulas approximate the integrals of the MAMSE criterion by choosing a sparser Mk. Using &C\k instead of dMk is another option to consider as it would not suffer from the curse of dimensionality in terms of run time. It might however be too sparse to detect differences between the copulas. In particular, it is not clear if the uniform convergence of the MAMSE-weighted empirical copula would hold with such a definition of the MAMSE weights. These investigations are however left to future research. Table 5.12 shows the average weight allocated to each of the populations. Less than half of the weight is allocated to Population 1 meaning that the contribution of the other populations is quite substantial. 100 x n Ai A2 A3 A4 Scenario A 20 46 18 18 18 Scenario B 20 41 20 20 19 Table 5.12: Average weight allocated to each of four 4-variate Normal distributions. Pop ulation 1 is observed from the target distribution and the other populations contain mea surement errors. The values are averaged over 5000 repetitions. Table 5.13 compares the performance of the MPLE to that of its weighted counterpart. The ratios in the table are calculated as those in the last section. The standard deviation of the error due to simulation is less than 4 units in that table. Efficiency n Ti Vu Ti2 Ti3 Ti4 Fi5 Ti6 Scenario A 20 235 232 234 259 225 234 229 Scenario B 20 98 58 118 214 59 130 62 Table 5.13: Relative performance of the MWPLE when compared to the MPLE for 4-variate Normal distributions. Population 1 is observed from the target distribution and the other populations contain measurement errors. The values are averaged over 5000 repetitions. Once again, the improvement from using the weighted pseudo-likelihood is very sub stantial under Scenario A, but it suffers a small loss under Scenario B. The results tend to confirm that the proposed implementation of the MAMSE weights performs better for 157 Chapter 5. MAMSE Weights for Copulas moderate correlations. Nonetheless, the potential for improvement is clear and in this case, at least outweighs the performance losses that occur under Scenario B. The simulations in this section show that using the MAMSE weights can improve the mean squared error in many cases, but as with most methods, it is not uniformly better over all scenarios that were considered. 158 Chapter 6 Summary and Future Research This chapter presents a list of the most important original contributions contained in this thesis and sketches some of the directions that will be explored in future research. 6.1 Summary of the Work The heuristic justification of the weighted likelihood presented in Section 2.2 provided the genesis of this thesis. That interpretation does not appear to have been exploited previously in the literature. With these heuristics in mind, intuition suggests choosing likelihood weights that make a mixture of the m distributions close to the target distribution, but less variable than its natural estimate. The MAMSE weights are a specific implementation of that idea. The MAMSE weights are not only likelihood weights. As a consequence of their defini tion, they can also be used to define a mixture of estimates of the CDF. The general idea of the MAMSE weights involves using an estimate of the CDF of each population. Specific properties are studied for three different kinds of data. These three cases are built from nonparametric estimates of the CDF, meaning that the resulting MAMSE weights are nonparametric as well. Univariate Data We use the empirical distribution function to define the MAMSE criterion and prove the following properties. 159 Chapter 6. Summary and Future Research • Invariance of the MWLE to a reparametrization of the model. • The strong uniform convergence of the MAMSE-weighted empirical CDF. • A weighted strong law of large numbers. • The strong consistency of the MWLE with MAMSE weights. In addition, simulations show that using the MAMSE weights allows for superior perfor mances in many scenarios. Censored Data We use the Kaplan-Meier estimate to accommodate right-censored data. The main contri butions are: • the weighted Kaplan-Meier estimate, a fully nonparametric estimate of the survival function that uses data from the m populations at hand, and • the uniform convergence thereof. Simulations show possible improvements for inference on finite samples when using the MAMSE-weighted Kaplan-Meier estimate. Multivariate Data Through Copulas We treat the dependence structure of multivariate data through copulas. The empirical cop ula, an estimate based on the ranks of the data, is used to that end. Important contributions are as follows. • Definition of the weighted empirical copula and a proof of its strong uniform conver gence to the target copula. • A weighted strong law of large numbers for bounded multivariate rank order statistics. 160 Chapter 6. Summary and Future Research • A weighted coefficient of correlation based on MAMSE weights; the proof that this estimate is strongly consistent. • The weighted pseudo-likelihood for fitting a family of copulas on data from m popu lations. • A proof that the MWPLE is a strongly consistent estimate of the parameter of interest. Here again, simulations showed that the methods proposed are powerful tools when appro priately applied. The sum of these results show that the MAMSE weights succeed in trading bias for precision for different types of data. 6.2 Future Research The idea of MAMSE weights and the intuitive- understanding of the weighted likelihood open many directions for future research. We list some of these below. Principle of the MAMSE Weights • Using different functions to combine bias and variance. For instance, it was recently brought to my attention that statistics of the form J{F(x) — G(x)}2<&(x) dF(x) have been studied for goodness-of-fit. The choice \l>(x) = 1 that corresponds to the bias term in the MAMSE criterion is also known as the Cramer-von Mises test statis tic. However, it seems that the choice ^(x) = F(x){l — F(x)} which corresponds to Anderson-Darling goodness-of-fit test typically offers a more powerful alternative. Hence, we intend to explore the properties of a different MAMSE criterion where the bias term adopts the form of the Anderson-Darling statistic. • Using parametric estimates of the distributions in the general formulation of the 161 Chapter 6. Summary and Future Research MAMSE weights (rather than nonparametric estimates) may allow one to account for nuisance parameters or for covariates. Univariate Data • The rate of convergence of the MAMSE weights needs further study. • It would be useful to describe the asymptotic distribution of a MAMSE-weighted sum of variables and even more useful to determine the asymptotic distribution of the MWLE. ' • The weights implied by the Stein-type estimate of Ahmed (2000) could be studied as possible likelihood weights. Alternatively, the approach of Ahmed (2000) may be extended to the empirical distribution functions to define Stein-like likelihood weights. Censored Data • MAMSE weights could be used as likelihood weights. The fact that the weighted Kaplan-Meier estimate may not converge on the complete real line is a challenge. Defining the MAMSE weights based on parametric models rather than on the Kaplan-Meier estimates may help in building weights that will make the MWLE consistent. • The idea of the MAMSE weights might be extendable to a Cox proportional hazard model, something that warrants further investigations given the great importance of that model. Copulas • Defining and studying the properties of a weighted version of Kendall's r. • Exploring different definitions of the MAMSE weights that suffer less from the curse of dimensionality. 162 Chapter 6. Summary and Future Research • Studying the effect of using different estimates for the variance of the empirical copula. In particular, verifying the hypothesis that it may improve the performance of the MWPLE in the presence of strong dependence. • Using the weighted pseudo-likelihood in other contexts where the weights need not be estimated from the data. For instance, if data were available from populations with the same dependence structure, but different marginals (e.g. different currencies), or if m sets of rankings are the only information available. • Proposing weighted coefficients of correlation under scenarios where the weights do not need to be determined from the data. • Using the MWPLE and univariate MWLEs in order to build an estimate of a multivari ate distribution function by combining the estimated copula with estimated marginal distributions. An important advantage of such an approach is the possibility to use a different number of populations for each of the marginal distributions. Suppose that data are available from some studies who did not record all the variables of interest. The information from these studies could be used for the inference on the marginals, but only the complete data could be used to determine the dependence structure. • One approach to building multivariate distributions may be obtained by extending the work of Cox & Reid (2004) where the likelihoods would be replaced by MAMSE-weighted likelihoods. Such an extension may however require a definition of MAMSE weights for multivariate distributions that are not copulas. Bootstrap In this document, we propose three weighted nonparametric asymptotically unbiased empir ical estimates of a target distribution. Resampling techniques such as the bootstrap could be developed from these weighted mixtures of empirical functions. To use such methods 163 Chapter 6. Summary and Future Research with confidence, we should first demonstrate that the bootstrap estimators obtained that way are consistent. One advantage of bootstrapping from a MAMSE-weighted mixture of empirical functions is the ability to draw from a larger number of values, resulting in a bootstrap sample with fewer ties. One could even try including a distribution as an external population, i.e. to allow for an infinite sample, hence further reducing the number of ties in the bootstrap sample. The Stein-like quantile estimate of Ahmed (2000) implies an empirical distribution that could also be considered for the definition of an alternative bootstrap method. Foundations of Statistics In addition to these directions for research, this work may have shed a new light on likelihood methods for some readers; writing it definitely changed my understanding of this important statistical tool. In particular, the likelihood can be seen as an estimate of the entropy between the fitted model and the EDF. The convergence properties of the CDF can then be used to show the consistency of the MLE. By showing that the MAMSE-weighted EDF converges, we could also show that a MWLE based on MAMSE weights is consistent. A question arises from that interpretation of the likelihood: Is the likelihood a good method because the empirical distribution function is a good estimate, or is the likelihood a more general principle? Can any model be fitted successfully using the likelihood? Using the likelihood for linear model based on the normal distribution for instance is definitely reasonable, especially if when we consider that maximizing the normal likelihood is equivalent to a form of least squares. But what about complex models, possibly involving nonlinear covariates? Should we avoid using the likelihood blindly or is it safe in all situations? 164 Chapter 7 Conclusion The weighted likelihood as proposed by Hu, Wang and Zidek suggests that information from populations similar to that of interest should not be ignored, but rather incorporated into the inference. When I started working on the weighted likelihood, the question that everybody asked first was: "But how do you choose the weights?" The MAMSE weights provide one answer to that question which seemed to be of most concern to those who were not acquainted with the weighted likelihood. This thesis does not say how to choose optimal likelihood weights. In fact, the answer to that question will depend on the purpose of the inference, more particularly on the loss function that one wants to use. This thesis however offers a practical and effective solution to determine likelihood weights. These weights are shown to produce consistent estimates, yet they successfully trade bias for precision in many cases that were simulated. The different cases of the MAMSE weights studied in detail in this document are all based on nonparametric estimates of the distribution functions. Hence, the proposed MAMSE weights are themselves nonparametric. For all cases studied, the MAMSE-weighted empirical distributions are shown to con verge uniformly to the target distribution without assuming any particular shape and min imal regularity conditions for the other populations. Hence, the estimates suggested are asymptotically unbiased even though they take into consideration unknown data. With MAMSE weights, the maximum weighted likelihood estimate and the maximum 165 Chapter 7. Conclusion weighted pseudo-likelihood estimate converge strongly under conditions that are akin to those imposed on their unweighted counterparts. Finally, simulations show that the MAMSE weights can indeed improve the quality of inference on finite samples in many situations. The results presented in this document show that when different sources of data are available, it is possible to use them to improve the quality of the inference. Hence it is possible with the MAMSE weights to use the data from m populations in order to evaluate their level of dissimilarity and then to use that information to successfully trade bias for precision. « 166 Bibliography E. S. Ahmed, A. L Volodin k A. A. Hussein (2005). Robust weighted likelihood estimation of exponential parameters, IEEE Transactions on Reliability,-54, 389-395. E. S. Ahmed (2000). Stein-type shrinkage quantile estimation, Stochastic Analysis and Applications, 18, 475-492. H. Akaike (1977). On entropy maximization principle, Applications of Statistics, 27-42. D. Blest (2000). Rank correlation—an alternative measure, Australian and New Zealand Journal of Statistics, 42, 101-111. N. E. Breslow k J. Crowley (1974). A large sample study of the life table and product limit estimates under random censorship, The Annals of Statistics, 2, 437-453. K. Chen k S.-H. Lo (1997). On the rate of uniform convergence of the product-limit estimator: Strong and weak laws, The Annals of Statistics, 25, 1050-1087. U. Cherubini, E. Luciano k W. Vecchiato (2004). Copula Methods in Finance. Wiley Finance, Wiley, Hoboken. D. G. Clayton (1978). A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence, Biometrika, 65, 141-151. D. R. Cox k N. Reid (2004). A note on pseudolikelihood constructed from marginal densities, Biometrika, 91, 729-737. 167 Bibliography P. Deheuvels (1979). La fonction de dependance empirique et ses proprietes: un test non parametrique d'independance, Academie royale de Belgique-Bulletin de la classe des sciences, 65, 274-292. R. Durrett (2005). Probability: Theory and Examples, Third Edition. Duxbury Advanced Series, Thomson, Belmont. B. Efron (1967). The two sample problem with censored data. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley: University of California, 4, 831-853. J.-D. Fermanian, D. Radulovic, M. Wegkamp (2004). Weak convergence of empirical copula processes, Bernoulli, 10, 847-860. A. Foldes & L. Rejto (1981). Strong uniform consistency for nonparametric survival curve estimators from randomly censored data. The Annals of Statistics, 9, 122-129. M. J. Frank (1979). On the simultaneous associativity of F(x,y) and x + y — F(x,y). Aequationes Mathematicae, 19, 194-226. C. Genest, K. Ghoudi Sz L.-P. Rivest (1995). A semiparametric estimation procedure of dependence parameters in multivariate families of distributions. Biometrika, 82, 543-552. C. Genest & J. MacKay (1986). Copules archimediennes et families de lois bidimension-nelles dont les marges sont donnees. Canadian Journal of Statistics, 14, 145-159. C. Genest &; J.-F. Plante (2003). On Blest's measure of rank correlation, The Canadian Journal of Statistics, 31, 35-52. F. Hu (1994). Relevance Weighted Smoothing and a New Bootstrap Method, unpublished doctoral dissertation, Department of Statistics, The University of British Columbia. 168 Bibliography F. Hu & J. V. Zidek (1993). A Relevance Weighted Nonparametric Quantile Estima tor. Technical report no. 134, Department of Statistics, The University of British Columbia, Vancouver. F. Hu & J. V. Zidek (2001). The relevance weighted likelihood with applications, Empirical Bayes and Likelihood Inference, 211-235. F. Hu & J. V. Zidek (2002). The weighted likelihood, The Canadian Journal of Statistics, 30, 347-371. H. Joe (1997). Multivariate Models and Dependence Concepts. Chapman &Hall, London. R. A. Johnson Sz D. W. Wichern (2002). Applied Multivariate Statistical Analysis, Fifth Edition, Prentice Hall, Upper Saddle River. E. L. Kaplan & P. Meier (1958). Nonparametric estimation from incomplete observations, Journal of the American Statistical Association, 53, 457-481. A. M. Krieger k, D. Pfeffermann (1992). Maximum likelihood estimation from complex sample surveys, Survey Methodology, 18, 225-239. D. G. Luenberger (2003). Linear and Nonlinear Programming, Second Edition, Kluwer, Nor well. M. Markatou, A. Basu Sz B. Lindsay (1997). Weighted likelihood estimating equations: The discrete case with applications to logistic regression. Journal of Statistical Plan ning and Inference, 57, 215-232.-National Center for Health Statistics (1997). U.S.. Decennial Life Tables for 1989-91, vol. 1, no. 1, Hyattsville, Maryland. R. B. Nelsen (1999). An Introduction to Copulas, Lecture Notes in Statistics No. 139, Springer, Berlin. 169 Bibliography D. Oakes (1986). Semiparametric inference in a model for association in bivariate survival data, Biometrika, 73, 353-361. J. Pinto da Costa h C. Soares (2005). A weighted rank measure of correlation, Australian and New Zealand Journal of Statistics, 47, 515-529. J.-F. Plante (2002). A propos d'une mesure de correlation des rangs de Blest, unpublished Master's Thesis, Departement de mathematiques et de statistique, Universite Laval. A. Sklar (1959). Fonctions de repartition a n dimensions et leurs marges. Publications de I'Institut de statistique de V Universite de Paris, 8, 229-231. Statistics Canada (2006). Life Tables, Canada, Provinces and Territories. Reference Period: 2000 to 2002., catalog number 84-537-XIE, Ottawa, Canada. H. Tsukahara (2005). Semiparametric estimation in copula models. The Canadian Journal of Statistics, 33, 357-375. C. van Eeden & J. V. Zidek (2004). Combining the data from two normal populations to estimate the mean of one when their means difference is bounded, Journal of Multivariate Analysis, 88, 19-46. A. Wald (1949). Note on the consistency of the maximum likelihood estimate, The Annals of Mathematical Statistics, 20, 595-601. X. Wang (2001). Maximum Weighted Likelihood Estimation, unpublished doctoral disser tation, Department of Statistics, The University of British Columbia. X. Wang, C. van Eeden h J. V. Zidek (2004). Asymptotic properties of maximum weighted likelihood estimators, Journal of Statistical Planning and Inference, 119, 37-54. X. Wang & J. V. Zidek (2005). Selecting likelihood weights by cross-validation, The Annals of Statistics, 33, 463-501. 170 Bibliography B. B. Winter, A. Foldes & L. Rejto (1978). Glivenko-Cantelli theorems for the PL estimate. Problems of Control and Information Theory, 7, 213-225. 171
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Adaptive likelihood weights and mixtures of empirical...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Adaptive likelihood weights and mixtures of empirical distributions Plante, Jean-François 2007-12-31
pdf
Page Metadata
Item Metadata
Title | Adaptive likelihood weights and mixtures of empirical distributions |
Creator |
Plante, Jean-François |
Publisher | University of British Columbia |
Date | 2007 |
Date Issued | 2011-02-17T21:54:35Z |
Description | Suppose that you must make inference about a population, but that data from m -1 similar populations are available. The weighted likelihood uses exponential weights to include all the available information into the inference. The contribution of each datum is discounted based on its dissimilarity with the target distribution. One could hope to elicitate the likelihood weights from scientific information, but using data-based weights is more pragmatic. To this day, no entirely satisfactory method has been found for determining likelihood weights from the data. We propose a way to determine the likelihood weights based on data. The suggested "MAMSE" weights are nonparametric and can be used as likelihood weights, or as mixing probabilities to define a mixture of empirical distributions. In both cases, using the MAMSE weights allows strength to be borrowed from the m -1 similar populations whose distribution may differ from the target. The MAMSE weights are defined for different types of data: univariate, censored and multivariate. In addition to their role for the likelihood, the MAMSE weights are used to define a weighted Kaplan-Meier estimate of the survival distribution and weighted co-efficients of correlation based on ranks. The maximum weighted pseudo-likelihood, a new method to fit a family of copulas, is also proposed. All these examples of inference using the MAMSE weights are shown to be asymptotically unbiased. Furthermore, simulations show that inference based on MAMSE-weighted methods can perform better than their unweighted counterparts. Hence, the adaptive weights we propose successfully trade bias for precision. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Collection |
Retrospective Theses and Dissertations, 1919-2007 |
Series | UBC Retrospective Theses Digitization Project |
Date Available | 2011-02-17 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0100736 |
URI | http://hdl.handle.net/2429/31457 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- [if-you-see-this-DO-NOT-CLICK]
- [if-you-see-this-DO-NOT-CLICK]
- ubc_2007-319055.pdf [ 8.56MB ]
- Metadata
- JSON: 1.0100736.json
- JSON-LD: 1.0100736+ld.json
- RDF/XML (Pretty): 1.0100736.xml
- RDF/JSON: 1.0100736+rdf.json
- Turtle: 1.0100736+rdf-turtle.txt
- N-Triples: 1.0100736+rdf-ntriples.txt
- Original Record: 1.0100736 +original-record.json
- Full Text
- 1.0100736.txt
- Citation
- 1.0100736.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
United States | 14 | 0 |
Poland | 11 | 0 |
China | 9 | 5 |
France | 5 | 0 |
Canada | 4 | 10 |
Russia | 3 | 0 |
Ukraine | 3 | 0 |
Czech Republic | 2 | 0 |
United Kingdom | 2 | 0 |
Germany | 2 | 6 |
Malta | 1 | 0 |
Taiwan | 1 | 0 |
Latvia | 1 | 0 |
City | Views | Downloads |
---|---|---|
Unknown | 20 | 6 |
Ashburn | 8 | 0 |
Beijing | 5 | 0 |
Denton | 3 | 0 |
Montreal | 2 | 10 |
Shenzhen | 2 | 5 |
Brighton | 2 | 0 |
Prague | 2 | 0 |
Changsha | 2 | 0 |
Novosibirsk | 2 | 0 |
Saint Petersburg | 1 | 0 |
Paris | 1 | 0 |
Seattle | 1 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0100736/manifest