Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Adaptive likelihood weights and mixtures of empirical distributions Plante, Jean-François 2007

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2007-319055.pdf [ 8.56MB ]
Metadata
JSON: 831-1.0100736.json
JSON-LD: 831-1.0100736-ld.json
RDF/XML (Pretty): 831-1.0100736-rdf.xml
RDF/JSON: 831-1.0100736-rdf.json
Turtle: 831-1.0100736-turtle.txt
N-Triples: 831-1.0100736-rdf-ntriples.txt
Original Record: 831-1.0100736-source.json
Full Text
831-1.0100736-fulltext.txt
Citation
831-1.0100736.ris

Full Text

Adaptive Likelihood Weights and Mixtures of Empirical Distributions by Jean-Frangois Plante  B.Sc, Universite Laval, 2000 M.Sc, Universite Laval, 2002  A THESIS SUBMITTED IN PARTIAL FULFILMENT OF T H E REQUIREMENTS FOR T H E D E G R E E OF Doctor of Philosophy in The Faculty of Graduate Studies (Statistics)  T H E UNIVERSITY OF BRITISH COLUMBIA  July, 2007 © Jean-Frangois Plante 2007  Abstract Suppose that you must make inference about a population, but that data from m — 1 similar populations are available. The weighted likelihood uses exponential weights to include all the available information into the inference. The contribution of each datum is discounted based on its dissimilarity with the target distribution. One could hope to elicitate the likelihood weights from scientific information, but using data-based weights is more pragmatic. To this day, no entirely satisfactory method has been found for determining likelihood weights from the data. We propose a way to determine the likelihood weights based on data. The suggested "MAMSE" weights are nonparametric and can be used as likelihood weights, or as mixing probabilities to define a mixture of empirical distributions. In both cases, using the MAMSE weights allows strength to be borrowed from the m—1 similar populations whose distribution may differ from the target. The MAMSE weights are defined for different types of data: univariate, censored and multivariate. In addition to their role for the likelihood, the MAMSE weights are used to define a weighted Kaplan-Meier estimate of the survival distribution and weighted coefficients of correlation based on ranks. The maximum weighted pseudo-likelihood, a new method to fit a family of copulas, is also proposed. All these examples of inference using the MAMSE weights are shown to be asymptotically unbiased. Furthermore, simulations show that inference based on MAMSE-weighted methods can perform better than their unweighted counterparts. Hence, the adaptive weights we propose successfully trade bias for precision.  ii  Table of Contents Abstract  ii  Table of Contents .  iii  List of Tables  vi  List of Figures  xi  Acknowledgements 1  2  3  i  xiii  Introduction  1  1.1  Review  1  1.2  Overview  3  Likelihood, Entropy and Mixture Distributions  5  2.1  Likelihood and Entropy  5  2.2  The Weighted Likelihood  6  2.3  The MAMSE Weights  7  2.4  An Algorithm to Compute the MAMSE Weights  10  M A M S E Weights for Univariate Data  17  3.1  Notation and Review  17  3.2  Definition of the MAMSE Weights  18  3.3  Computing the MAMSE Weights  19  iii  Table of Contents 3.4  Structural Properties of the MAMSE Weights  20  3.5  Strong Uniform Convergence of the Weighted Empirical CDF  22  3.6  Weighted Strong Law of Large Numbers  30  3.7  Strong Consistency of the Maximum Weighted Likelihood Estimate  3.8  Asymptotic Behavior of the MAMSE Weights  46  3.9  Issues for Asymptotic Normality  50  . . . .  3.10 Simulations  4  5  36  51  3.10.1 Two Normal Distributions  52  3.10.2 Complementary Populations  54  3.10.3 Negative Weights  56  3.10.4 Earthquake Data  60  M A M S E Weights for Right-Censored Data  65  4.1  Notation and Review of the Kaplan-Meier Estimate  66  4.2  Definition of the MAMSE Weights for Right-Censored Data  69  4.3  Computing the MAMSE Weights for Right-Censored Data  72  4.4  Uniform Convergence of the MAMSE-Weighted Kaplan-Meier Estimate . .  73  4.5  Simulations  86  4.5.1  Survival in the USA  87  4.5.2  Survival in Canada  94  M A M S E Weights for Copulas  100  5.1  Review of Copulas  '.  5.2  Notation and Empirical Estimation  5.3  Definition of the MAMSE Weights for Copulas  Ill  5.4  Computing the MAMSE Weights for Copulas  113  5.5  Uniform Convergence of the MAMSE-Weighted Empirical Copula  114  5.6  Weighted Coefficients of Correlation Based on Ranks  122  . . . .,  100 106  Table of Contents  6  7  5.7  Weighted Strong Law of Large Numbers  126  5.8  Strong Consistency of the Weighted Pseudo-Likelihood  134  5.9  Simulations  143  5.9.1  Different Correlations for Bivariate Copulas  144  5.9.2  Different Bivariate Distributions with Common Correlation  147  5.9.3  Multivariate Normal  148  Summary and Future Research  159  6.1  Summary of the Work  159  6.2  Future Research  161  Conclusion  Bibliography  165 167  v  List of Tables 3.1  Average MAMSE weights for Population 1 when equal samples of size n are drawn from Normal distributions with unit variance and means 0 and A respectively. The results are averages over 10000 replicates  3.2  52  Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE). Samples of equal size n are simulated from Normal distributions with unit variance and means 0 and A respectively. The results are averaged over 10000 replicates.  3.3  53  Average MAMSE weights for Population 1 when samples of size n and lOn are drawn from Normal distributions with unit variance and means 0 and A respectively. The results are averages over 40000 replicates  3.4  54  Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE). Samples of sizes n and lOn are simulated from Normal distributions with unit variance and means 0 and A respectively. The results are averaged over 40000 replicates. 55  3.5  Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE) and average MAMSE weights allocated to samples of sizes n drawn from A/\0,1), |A/"(0,1)| and — pV(0,1)| respectively. The results are averages over 10000 repetitions.  3.6  56  Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE) when the MAMSE weights are calculated without the constraints A; > 0. Samples of size n, lOn and n are taken from each population. Population 2 is an equal mixture of Populations 1 and 3 that respectively follow a A/"(0,1) and a A/"(A, 1) distribution. All results are averages over 40000 repetitions. . . .  59 vi  List of Tables 3.7. R e l a t i v e efficiency as measured by 100 M S E ( M L E ) / M S E ( M W L E ) w h e n the usual M A M S E weights (i.e. constrained to positive values) are used. Samples of size n , l O n and n are taken from each p o p u l a t i o n .  P o p u l a t i o n 2 is an  equal m i x t u r e of P o p u l a t i o n s 1 and 3 that respectively follow a N~(0, 1) and a A ^ ( A , 1) d i s t r i b u t i o n . A l l results are averages over 40000 repetitions 3.8  59  R e l a t i v e efficiency of the M W L E w i t h and w i t h o u t the constraints Aj > 0 as measured by 100 M S E ( c o n s t r a i n e d M W L E ) / M S E ( u n c o n s t r a i n e d  MWLE).  Samples of size n, l O n and n are taken from each p o p u l a t i o n . P o p u l a t i o n 2 is an equal m i x t u r e of P o p u l a t i o n s 1 and 3 that respectively follow a A/"(0,1) a n d a A/"(A, 1) d i s t r i b u t i o n . A l l results are averages over 40000 repetitions. 3.9  N u m b e r of earthquakes i n three Western C a n a d i a n areas between the 1 2 of F e b r u a r y 2001 and the \1  th  of F e b r u a r y 2006.  60 t h  T h e magnitude of these  earthquakes is modeled by a G a m m a d i s t r i b u t i o n ; the m a x i m u m likelihood estimates appear above and are used as the "true" parameters for this simulation  62  3.10 Efficiency i n estimating some probabilities about the magnitude of the next earthquake i n the Lower M a i n l a n d - Vancouver Island area followed by the average of the actual estimates and their true values. Efficiency is measured by 100 M S E ( p l u g - i n M L E ) / M S E ( p l u g - i n M W L E ) . T h e first four columns of probabilities should be m u l t i p l i e d by the corresponding m u l t i p l i e r 4.1  63  R e l a t i v e performance of the W K M E as measured by 1 0 0 A i / A ^ . B o t h areas are calculated on the interval [0, 55] and various upper bounds U are used to determine the weights. Samples of equal size n are taken from each of four subpopulations, then used to estimate the s u r v i v a l of a w h i t e male l i v i n g i n the U S A . E a c h scenario is repeated 20000 times  89  List of Tables - 4.2  Relative performance of the W K M E as measured by 100Ai/A\.  Areas are  calculated on the interval [0, U — 5], where U is the upper bound used to determine the weights. Samples of equal size n are taken from each of four subpopulations, then used to estimate the survival of a white male living in the USA. Each scenario is repeated 20000 times 4.3  90  Relative performance of the weighted Kaplan-Meier estimate compared to the usual Kaplan-Meier estimate for estimating i*\(55) = 0.11976 as measured by 100 MSE{FI(55)}/MSE{FA(55)}. Different choices of U and n are considered. Each scenario is repeated 20000 times.  4.4  93  Relative performance of the weighted Kaplan-Meier estimate compared to the usual Kaplan-Meier estimate for estimating F (0.10) = 52.081 as measured _1  1  by  MSE(QI)/MSE(QA)-  Different choices of U and n are considered. Each  scenario is repeated 20000 times 4.5  93  Average MAMSE weights for different rates of censoring p and different sample sizes, n. The right-hand side of the table presents the relative performance of the W K M E as measured by 100Ai/A\.  Figures are averaged over 20000  repetitions and the values U = 80 and T = 75 are used 4.6  94  Relative performance of the weighted Kaplan-Meier estimate compared to the Kaplan-Meier estimate as measured by 100AI/4A-  Average MAMSE  weights are also shown, but they are multiplied by a factor of 1000 for easier reading. In the simulation with males and females, the average weights in italics refer to the male populations. Note that U — 90, T = 85 and that all figures are based on 10000 repetitions.  '. .  97  5.1  Population value of different coefficients of correlation  102  5.2  Empirical estimates of different coefficients of correlation  109  5.3  Values of p under two different scenarios that are simulated for different families of copulas  144  List of Tables 5.4  Performance of a MAMSE-weighted coefficient of correlation based on ranks as measured by 100 MSE(/5)/MSE(px) f ° different scenarios and sample sizes r  n e {20,50,100,250}. Each figure is averaged over 10000 repetitions 5.5  146  Performance of the maximum weighted pseudo-likelihood estimate as measured by 100 MSE((?)/MSE((9.x) for different scenarios and sample sizes n G {20, 50,100, 250}. Each figure is averaged over 10000 repetitions  5.6  Distributions from which bivariate random variables are drawn. The choice of parameters in each population yields a Spearman correlation of p = 0.20.  5.7  146  147  Average M A M S E weights as well as the relative efficiency of the weighted correlation and of the M W P L E when compared to their non-weighted counterparts. Samples of size n € {20,50,100,250} are drawn from five different bivariate distributions that have a common correlation of p = 0.2. Figures are averaged over 10000 repetitions  5.8  148  Parameters of the simulation for 3-variate Normal variables. Population 1 is from the target distribution, but the other populations are affected by measurement errors  5.9  153  Average weight allocated to each of four 3-variate Normal distributions. Population 1 is observed from the target distribution, but the other populations contain measurement errors. The values are averaged over 5000 repetitions.  153  5.10 Relative performance of the M W P L E when compared to the M P L E as measured by 100 MSE(f i ) / M S E ( f ) or an equivalent ratio for the individual A  entries of Ti. Population 1 is observed from a 3-variate Normal distribution and the other populations contain measurement errors. The values are averaged over 5000 repetitions  154  5.11 Parameters of the simulations for 4-variate Normal variables. Population 1 comes from the target distribution, but the other populations are affected by measurement errors  156  List of Tables 5.12 Average weight allocated to each of four 4-variate Normal distributions. Population 1 is observed from the target distribution and the other populations contain measurement errors. The values are averaged over 5000 repetitions.  157  5.13 Relative performance of the M W P L E when compared to the M P L E for 4variate Normal distributions. Population 1 is observed from the target distribution and the other populations contain measurement errors. The values are averaged over 5000 repetitions  157  x  List of Figures 3.1  Average values of 100x the MAMSE weights without the constraints A; > 0. Samples of size n, lOn and n are taken from each population. Population 2 is an equal mixture of Populations 1 and 3 that respectively follow a and a A/"(A, 1) distribution. All results are averages over  3.2  40000  N(0,1)  repetitions.  58  Average values of 100x the usual MAMSE weights (with constraints A; > 0). Samples of size n, lOn and n are taken from each population. Population 2 is an equal mixture of Populations  1  and 3 that respectively follow a A/"(0,1)  and a A/"(A, 1) distribution. All results are averages over 3.3  40000  repetitions.  Histograms of the magnitude of earthquakes measured between the \2  th  February  2001  and the 12  th  of February  2006  of  for three different Western  Canadian areas. The curves correspond to the fitted Gamma density 4.1  82  Survival functions for subgroups of the American population as taken from the life tables of the National Center for Health Statistics  4.3  62  Graphics representing the Riemann sums used in the proof of Case 2 (left panel) and Case 3 (right panel)  4.2  61  (1997)  88  Average value of the MAMSE weights for different upper bounds U and sample sizes n. The cells' area are proportional to the average weight allocated to each population. The numbers correspond to 20000  repetitions  100Ai  and are averaged over 91  xi  List of Figures 4.4  Typical examples" of the weighted Kaplan-Meier estimate (dashed line) and of the usual Kaplan-Meier estimate (plain black line) for different sample sizes. Note that U = 80 and T = 75. The true distribution is depicted by a smooth gray line  4.5  •.  92  Survival functions of Canadians as taken from the life tables of Statistics Canada (2006).  Survival functions are available for males and females of  each province. The curves for New Brunswick are repeated as gray dotted lines on each panel to facilitate comparisons since they are the populations of interest 4.6  96  Typical examples of estimates under different scenarios. Note the increased number of steps of the weighted Kaplan-Meier estimate (dashed line) which makes it smoother than the usual Kaplan-Meier estimate (plain black line). The true distribution appears as a smooth gray curve  5.1  99  Average value of the M A M S E weights for scenarios where samples of equal sizes n are drawn from 5 different populations. The cell areas are proportional to the average weight allocated to each population. The numbers correspond to lOOAi and are averaged over 10000 repetitions  145  Acknowledgements I would like to thank my supervisor Jim Zidek for the numerous inspiring discussions we had about the foundations of statistics. With an amazing intellectual tact, Jim gave me enough direction to keep me thinking hard, but enough liberty to let me design this thesis my own way. His guidance will definitely have a deep impact in the way I approach statistics in my future career. I would also like to acknowledge Jim's financial support. I would like to thank the' members of my supervisory committee, John Petkau and Mafias Salibian-Barrera, for their insightful comments and suggestions. I would like to extend an additional thank to John for sharing his wisdom in many occasions when I was seeking an advice. The Department of Statistics offers a friendly environment for development and growth. I did appreciate the opportunity to be exposed to different views of statistics. Department's faculty and staff members as well as fellow students have always been very helpful. I am thankful to them. I made many very good friends during my stay in Vancouver. They brightened my days and will truly be missed. Thank you for the friendship, the support and for the fun! I want to thank Christian Genest, my Master's supervisor, for his outstanding guidance while I was at Laval University. I learned so much under his supervision that his teachings are still making a significant difference in my work and in my understanding of statistics as a science and as a discipline. Thanks to my family for their constant support and prayers. I had the chance to grow'up in a healthy and supportive familial environment that forged my personality. I am thankful  xiii  Acknowledgements for that. Thank you Pascale for sharing my life through the second half of this endeavour. For partial support of this work through graduate scholarships and research grants, thanks are due to the Natural Sciences and Engineering Research Council of Canada and to the Fonds quebecois de la recherche sur la nature et les technologies.  xiv  Chapter 1  Introduction Suppose that you must make inference about a population from which you obtained a sample and that additional data are available from populations whose distributions may be similar to the target, but not necessarily identical. The weighted likelihood allows use of the relevant information contained in that data and its incorporation into the inference. In this thesis, we propose a data-based criterion to determine likelihood weights and show that they can successfully trade bias for precision.  1.1  Review  The expression weighted likelihood refers to a few different statistical methods. In survey sampling theory, a method called weighted likelihood is used to adjust for a responsedependent sampling intensity; see Krieger & Pfeffermann (1992). The increased robustness of estimates obtained through weighted likelihood equations have been studied by several groups of statisticians including Markatou et al. (1997) and Ahmed et al. (2005). However, both these methods are only remotely related to the relevance weighted likelihood as defined by Hu (1994). In that context, the weighted likelihood is an extension to the likelihood that uses data from multiple populations for making inference. Statistical models typically use information about data that come from the target distribution. When other possibly similar data are available, the relevant information they contain is ignored unless their similarity with the population of interest is precisely incorporated in the model.  1  Chapter  Introduction  1.  The original work of Hu (1994), enhanced in Hu & Zidek (2001) and Hu k. Zidek (2002), proposes use of the weighted likelihood to capture the relevant information from data whose distribution may differ from that of interest, hence trading bias for precision. A specific paradigm is derived from the work of Hu and Zidek by Wang (2001) and further developed in Wang, van Eeden & Zidek (2004) and in Wang & Zidek (2005). We adopt that paradigm throughout this thesis. Suppose the data come from m distinct populations that have different yet similar distributions. More formally, for each fixed i = 1,..., m,  A j i , . . . , A.i  ni  ~ i*i  where Fj denotes the cumulative distribution function (CDF) of each population. We denote by fi the corresponding density or mass function for population i = 1,..., m. The family of density (or mass) functions f(x\6) indexed by 6 € 0 is used to model the data. Population 1 is of inferential interest, but the weighted likelihood m  nj  L (e) = Hl[f(x \e)^ x  .  ij  i=ij=i lets other populations contribute to the inference so that the relevant information they contain is not lost. The vector of exponents A = [Ai,..., A ] m  T  downweights data according  i  to the degree to which they are thought to be dissimilar from Population 1. The expression for the weighted log-likelihood may be more intuitive:  The maximum weighted likelihood estimate (MWLE) is a value of 9 that maximizes L\(9). Ideally, the weights Aj would be determined from scientific knowledge. However, using data-based weights seems more pragmatic. The ad-hoc methods proposed by Hu &  2  Chapter 1. Introduction Zidek (2002) as well as the cross-validation weights of Wang (2001) and Wang & Zidek (2005) are the only adaptive weights proposed in the literature for the paradigm considered. However, none of these solutions is fully satisfactory. In particular, the cross-validation weights (CVW) suffer from instability problems. The simulation results in Wang (2001) and Wang & Zidek (2005) can be reproduced only at the cost of fine-tuning the algorithms numerically. In the ideal situation where some of the populations are identical to Population 1, the weights may not even be defined. Let Fi denote the empirical distribution function (EDF) based on the sample from Population i. Hu k. Zidek (1993) call F\ = 2~2u=i ^i^i the relevance distribution  weighted  empirical  (REWED) when A* > 0 add to 1. Hu k.Zidek (2002) show that the REWED is  the nonparametric maximum weighted likelihood estimate of the target distribution function. In Chapter 2, we look at the link between the REWED and the weighted likelihood from a different angle. Note that Ahmed (2000) uses a paradigm similar to the one retained in this thesis as he considers samples of different sizes from m populations to develop a Stein-type estimator of the quantiles of the distribution. The properties of his method and its links to the REWED and the weighted likelihood are however not the object of this work. In this thesis, we propose a data-based criterion to determine likelihood weights which yields consistent estimates. Inferential methods based on our proposed weights perform at least as well as the C V W in comparable simulations, but without the instability problems. We also propose other weighted methods based on the newly proposed weights that are shown to perform well and to be asymptotically unbiased.  1.2  Overview  In Chapter 2, we present our heuristic interpretation of the weighted likelihood that links it to mixture distributions. This leads to the general definition of the MAMSE (Minimum Averaged Mean Squared Error) weights and we provide an algorithm to calculate them. 3  Chapter 1. Introduction The ensuing three chapters are implementations of the idea of the MAMSE weights in three different contexts. In Chapter 3, univariate data are treated using the empirical distribution function. The properties of the resulting nonparametric MAMSE weights are studied. Asymptotically, the MAMSE-weighted empirical distribution function converges to the target distribution and the maximum weighted likelihood estimate is strongly consistent. Finally, simulations show that improved performance is indeed possible on samples of realistic sizes. Chapter 4 introduces MAMSE weights for right^censored data by using the KaplanMeier estimate to infer the CDF of each population. The weights are used to define a weighted Kaplan-Meier estimate that converges uniformly to the target distribution. Simulations show that improved performance is once again possible. Finally, MAMSE weights for multivariate data are defined using the empirical copula. The MAMSE-weighted mixture of empirical copulas converges uniformly to the copula underlying the target distribution. We define MAMSE-weighted coefficients of correlation and show that they are consistent estimates of the correlation in the target population. To infer the copula underlying the target population, the maximum weighted pseudo-likelihood estimate (MWPLE) is proposed. We show that the M W P L E is consistent when calculated with MAMSE weights. Simulations confirm once more that MAMSE-weighted methods may perform better than their unweighted counterparts, even in subasymptotic ranges. We conclude this thesis with a summary and some suggestions for future research.  4  Chapter 2  Likelihood, Entropy and Mixture Distributions This chapter links the likelihood to the weighted likelihood by showing that both expressions can be derived as particular cases of the entropy maximization principle proposed by Akaike (1977).  '  After drawing the connection between relative entropy and the likelihood, we heuristically extend that link to the relationship between mixture distributions and the weighted likelihood. This leads us to formulate a general criterion to determine likelihood weights, called the "MAMSE" weights. We propose an algorithm to compute the MAMSE weights and prove its convergence.  2.1  Likelihood and Entropy  Consider first the one-sample situation where n independent data points, Y i , . . . , Y , come n  from a distribution whose unknown and unknowable density is g(y). In his pioneering work, Akaike (1977) argues that the goal of inference should be the estimation of g(y). When a parametric model f(y\9) is to be used, Akaike proposes maximizing the relative entropy:  5  Chapter 2. Likelihood, Entropy and Mixture Distributions The relative entropy is in fact minus the Kullback-Leibler divergence between / and g. In that sense, it is a measure of the proximity of the distributions / and g. The expression for B(g,f) can be further simplified:  B(gJ)  = -j  log|-^| ( 5  ? /  ) d y = J log{f(y\e)}g(y)dy-J  log{g(y)}g(y) dy.  In particular, when the objective is to maximize B(g, /) as a function of 8, the last term of the rightmost expression can be ignored since it does not depend on 9. Calculating the entropy would require the knowledge of the unknown and unknowable true distribution g. We thus have to estimate it. Let  =  v)  be the empirical distribution function (EDF) of the dataset Y\,... ,Y . The indicator varin  able !(•) is equal to one if all the elements of its argument are true and equal to 0 otherwise. Using dG(y) as an approximation to dG{y) = g(y) dy yields  J \og{f(y\0)}dG(y)  = i ] T l o g / ( m i=l  the log-likelihood!  Therefore, calculating the likelihood is equivalent to calculating the  entropy where the true distribution is estimated by the empirical distribution of the data. Hence, the maximum likelihood estimate can be seen as a special case of Akaike's entropy maximization principle.  2.2  The Weighted Likelihood  Consider now the m-population paradigm of Wang  (2001)  introduced in Chapter  appropriate weights, the mixture Fx = 2~HL\\Ei  can be arbitrarily close to F\. Let Fj  denote the EDF based on the sample from Population i.  1.  With  The weighted EDF, written 6  Chapter  F\ = YliLi ^iFi i t w  2. Likelihood,  Entropy and Mixture  Distributions  > 0 and called relevance weighted empirical  n  distribution  by Hu &  Zidek (1993), may use more data than F i , and thus be less variable. Hu & Zidek (1993) note the implicit bias involved in defacto replacing Fi by F\, but do not investigate as we do here the possibility of trading bias for precision. In the context of maximum entropy, consider using the weighted EDF as an estimate of F . Then, x  m  /  \ogf(x\6)dF (x) x  =  .  J2^i  m  / logf(x\B)dF (x) i  i=i  .  rii  Y,-J2^gf(X \e),  =  lJ  i=x  j=i  the weighted log-likelihood. The maximum weighted likelihood can thus be derived from Akaike's entropy maximization principle.  2.3  The M A M S E Weights  Let Fi(x) represent an estimate of the CDF of Population i.  Based on the heuristics  presented in the previous section, the weighted likelihood is akin to using a mixture of the CDFs of the m populations from which data are available. Let m  F (x) x  = J2 i i(x) x  F  i=i  with A = [ A i , . . . , A ] , Aj > 0 and A 1 = 1 be such a mixture. A wise choice of likelihood T  T  m  weights should make: • F\(x) close to Fi(x),  the target distribution,  • F\(x) less variable than  F\(x).  7  Chapter 2. Likelihood, Entropy and Mixture Distributions We combine the two requirements above in an objective function. Let p(x) be a probability measure that has a domain comparable to the distribution of interest and define r  P(A) = 7  •{P (x)-F (x)}  L  1  i  x  m  + ^ A f v a r { F ( x ) } d/i(x) i  (2.1)  i=l  The term vaf |i<i(:r) j is a form of penalty to foster decreasing the variance by using as many of the m populations as necessary. In the coming chapters, we will use different estimates Fj depending on the type of data at hand. The exact expression for vaf JFj(:c) j will depend on that choice; see Sections 3.1, 4.2 and 5.3 for specific examples. The probability measure p(x) may also depend on the data and will be specified later. We call M A M S E (minimum averaged mean squared error) weights any vector of values A = [ A i , . . . , A ] that solves the program T  m  minimize  (2.2)  -P(A) 771  '  subject to  {Aj >0,i = 1,..., m} and ^  Aj = 1.  i=l  The name M A M S E is motivated by the conceptual resemblance of the integrand with the mean squared error (Bias 4- Variance). 2  Note that the objective function P(A) is quadratic in A and must be optimized under linear constraints. Substituting Ai = 1 —  a u o w s  the constraint Y^Li Aj = 1 to  embedded into the objective function. Let us write  A  2  Fx{x)  -F (x) 2  and  A = -  F (x) m  8  Chapter 2. Likelihood, Entropy and Mixture Distributions var{F (x)} 2  V(x) = var{F (x)} m  Then, the function P(A) can be written as  A(z) - £ \ F i i x )  -  ll  - £ ^ j  |  dp(x) i=2  i=2  /  I" m  -£(_)}  /  . Li=2  +(1 - A 1 ) var { A (x)} + J _ A? var { £ ( x ) } dp.(x) T  2  i=2  =  J  { A V ( x ) } + ( l - 2 A l + A l l A ) v i : r | F i ( a ; ) } + A V ( x ) A dp(x)  =  J  A  2  T  T  T  T  T  ^(x)J (x) +V(x) + ll var|Fi(x)}] A r  T  T  -2A 1 var { F I ( X ) } + var { A ( x ) } d/u(x) T  =  A AA-2A 16 + 6 T  (2.3)  T  where  A  =  b =  J ^F(x)F(x) + V ( x ) + l l v a r { A ( x ) } T  T  d/j(z)  f var [ F I ( X ) | d//(x).  Hence, the objective function F(A) can be written as a quadratic function of A involving matrices and vectors whose entries depend on Fj(x) and its estimated variance. Note that a similar quadratic development is possible with A as well.  9  Chapter 2. Likelihood, Entropy and Mixture Distributions  2.4  A n Algorithm to Compute the M A M S E Weights  To improve inference about Population 1, we attempt to use the data from m — 1 other populations. Suppose that one of the other populations is different enough that its sample does not even overlap with that of Population 1. Extracting useful information from that sample seems unlikely and it is nearly impossible to evaluate to which degree the populations are similar, except for suspecting that they are probably quite different. We suggest to preprocess the samples from the m populations to discard those that are clearly different from the target population. Some samples may also be discarded because they are difficult to compare with the target. Specifics of the preprocessing are described for all three versions of the MAMSE weights in their respective chapters. Typically, the preprocessing steps will also insure that Assumption 2.1 is respected. Assumption 2.1 J var{Fi(x)}d^(x) > 0 for any Population  i that is considered for the optimization  of (2.2).  All samples that fail the preprocessing are assigned weights of zero, hence they do not appear in the M A M S E objective function. For simplicity, the notation used in this section does not reflect the possibly reduced number of populations considered: we suppose that m populations remain after preprocessing. If we ignore the constraints Aj > 0, the MAMSE weights are the solution to the equation A A = 61. To ensure the weights are nonnegative, we apply the following algorithm and denote its solution by A* (or A*). 1. Solve the equation A A = 61; 2. if all the weights obtained are nonnegative, stop. Otherwise set the negative weights to 0, ignore the corresponding samples and repeat from Step 1 with the reduced  10  Chapter 2. Likelihood, Entropy and Mixture Distributions  system. The weight allocated to Population 1 from Step 1 cannot be negative (see Lemma 2.4). If no other samples are left, then A = 0 and Ai = 1. The objective function P(X) is quadratic and positive (thus convex). Since the constraints form a convex set, intuition suggests that A* should be the global constrained minimum. We prove this more formally next.  Consider the generic program minimize  -P(A) h(A) < 0  subject to  where A G IR and h(A) = [/ii(A),..., hk(X)] is a vector of functions, each being from m  T  IR to IR. Let V P ( A ) denote the gradient of P and P(A) its Hessian. Similarly, V/i*(A) is m  the gradient of hi(X) and H i (A) its Hessian. By definition, an m x m matrix B is positive definite (denoted B y 0) if y £ y > 0 for all y G IR \{0}. The Kuhn-Tucker conditions m  T  are necessary conditions that any solution to a program like (2.2) must respect. The second order version that follows (see for instance Luenberger (2003), page 316) are necessary and sufficient conditions to show that a given A* solves program (2.2). T h e o r e m 2.1 ( K u h n - T u c k e r S e c o n d O r d e r Sufficiency C o n d i t i o n s ) Let  hi,...,hk  and P be continuous and twice differentiable functions from IR to IR. Sufficient conditions m  that a point X* be a strict relative minimum point of the program above are that there exists p = [pi,...,  /  Li/ ] C  T  € IR such that p, > 0, p h(X*) = 0, fc  T  k  (2.4) %=i  11  Chapter 2. Likelihood, Entropy and Mixture Distributions and the matrix  P(A*) + 5^_/i H (A*)^0. i  I  (2.5)  i=l  Note that from Equation (2.3), we have VP(A) = AA - lb and P(A) = A. The function P(A) and its derivatives do not depend on Ai since it was replaced by Ai = 1 — 1 A. T  Consequently, it is implicitly understood in the following that P and h{ are functions of A € IR  m_1  , even when we write P(A) and /ij(A) rather than P(A) and /ij(A).  Lemma 2.1 The Hessian matrix P(A) is positive definite. Proof of Lemma 2.1. Remember that  A= y  T(x)F(x)  T  + V(x) + l l v a r { P i ( x ) } dp(x). T  For any fixed x, each term of the integrand as written above are nonnegative definite. In particular, for any y e IR  m_1  \{0},  y { ^ ( x ) ^ ( x ) } y >0 T  T  and y l l y J^{F^x)} T  T  dp(x) > 0  because of Assumption 2.1. As another consequence of Assumption 2.1; the diagonal elements of f V(x) d/i(x) are strictly positive, meaning that m—l  V(x) dfi(x)  =E  var{Pj(x)} dp,(x) > 0  i=i  for any y € IR ^{0}. m  positive definite.  Consequently, y A y > 0, i.e. the Hessian of P , P(A) = A, is T  _ 12  Chapter 2. Likelihood, Entropy and Mixture Distributions Corollary 2.1 Equation  (2.5) is satisfied.  Proof  of Corollary  2.1. In our implementation of the general Kuhn-Tucker conditions,  hi(\)  = — A i . Therefore, Hj(A) = V V / i i ( A ) = 0 are null matrices. From Lemma 2.1, T  + 1  we know that P(A*) is positive definite, hence Equation (2.5) is satisfied.  _  Applying the algorithm above will change negative weights to A j i = 0 for some i G I  c  +  {1,... , m — 1} where I  c  C  may be null. The set I contains the remaining indices and may  also be null. Let J C {1,..., m — 1} be a possibly null subset of indices, then Aj j is the sub-matrix t  of A for the rows i £ I and the columns j G J . We define the subvector Xj similarly. The. proposed algorithm involves solving reduced systems where the rows and columns for i G I  c  are excluded. The system of equations that has to be solved then involves the  matrix i  = J [Ei{x)F (x)  A  + V/, (s) + l l7var{A(a;)}] d/x(s).  r  r  /  /  For convenience of exposition, suppose that the order of appearance of the Aj in A are such that all the Aj that are "forced" to be zero are last. Then, with T(x) =  Tjc (x)  we can  write  A  =  " Fi(?) ' Tjc (x)  T  Tjc (x) + Vu{x)  Fiix^ix)  7  F c(x)J i(x) r  I  A/,/  A/./c  A/c,  Ajc jc  7  + V(x) + ll vax{Fi(x)}  T  dp(x)  J l(x)J c(x) r  7  r  I  F c{x)J c(x) r  I  T  I  In particular, Ai = AIJ and Ajc = AJCJC.  +  + ll var{Fi(x)}d^(x) T  V cjc(x) I  Therefore, the last step of the proposed  algorithm is to solve the system of equations AIXI — AIJXJ  = ljb.  13  Chapter 2. Likelihood, Entropy and Mixture Distributions L e m m a 2.2 If I  0 and y e IR  c  y  m  \{0}  is any nonnegative vector with yj = 0 and  :  > 0, then V P ( A * ) y > 0. T  lC  Proof of Lemma 2.2. First note that the expression V F ( A * ) y corresponds to the direcT  tional derivative of P at A* in the direction y. Next consider the unit vector ej G I R whose i  th  m _ 1  element is 1. For i e I , the global unconstrained minimum of the convex funcc  tion P is outside of the half-space Aj+i > 0. Therefore, P increases in the direction ej at A* and thus V P ( A * ) e , > 0: T  Finally, the hypothesized vector y can be expressed as a linear combination of vectors {ei : i e I } c  with nonnegative coefficients j/j. Therefore, V P ( A * ) y = J2 y i V P ( A * ) e > 0. T  T  i  _  iei  c  Although I = 0 or I  c  L e m m a 2.3  = 0 may occur, the following proofs hold under these special cases.  The proposed algorithm solves the quadratic program minimize  subject to  -P(A)  {Xi > 0, i = 2 , . . . , m).  Proof of Lemma 2.3. To verify that the Kuhn-Tucker conditions are satisfied, first note that for i = 1,..., m — 1 the functions foj(A) = — Aj+i are continuous and twice differentiable. The. quadratic objective function P(X) shares the same smoothness properties. Moreover, Corollary 2.1 establishes that Equation (2.5) holds. At termination, the algorithm yields A^ > 0 and X]c = 0. The proposed solution A* is thus in the feasible set. It remains to show that there exists a p: satisfying the Kuhn-Tucker conditions stated earlier. We will show that /x = VP(A*) satisfies the required properties. Expression (2.4) can be written m—l  VP(A*) +  w ( - i ) = VP(A*) -n e  = 0  i=l  14  Chapter 2. Likelihood, Entropy and Mixture Distributions and clearly holds for / i = VP(A*). The other Kuhn-Tucker conditions require that pi > 0 -i  *  and pi A =0.  /x > 0  The last step of the algorithm before termination is to solve AIJXI = 1/6. Therefore,  \ij = VP(A*)/ = [AA* -lb\j = A/,/A/ + &ijcKc  -1/6 = 0  since Xjc =0. In addition, we have from Lemma 2.2 that p,i = pJei = VP(A*)ei > 0 for all i G I , A* = 0 c  T  and hence /LX C > 0. Therefore, / i > 0. 7  We can write the condition pJX* = 0 as /xjA/ + /LIJ A/C = 0. It is shown above that C  pij = 0, hence pb]\*j — 0. Moreover, the definition of the set I implies that X*jc = 0, thus PJCXJC  = 0 and the condition is satisfied.  Consequently, the solution found by the proposed algorithm is a strict relative minimum since it satisfies the sufficient Kuhn-Tucker conditions.  _  Lemma 2.4 The solution found by the proposed algorithm satisfies the additional constraint  2~^I=2 Ai  < 1, or equivalently, Ai > 0.  Proof of Lemma 2.4- The solution found by the algorithm satisfies A^/A]- = lib. Expanding AIJ in this equation yields  J [^ (x)^ (x) + V , ( x ) 4 - l l j T O r { p ( x ) } ] dp(x T  /  7  /  /  /  1  Xr = 1/6.  By subtracting 61/1J A/ from both sides and multiplying the resulting equation by Xj on 15  Chapter 2. Likelihood, Entropy and Mixture Distributions the left, we have  J {rrtx^ix)*  + V (x)} ItI  d (x) x ; = 6A; ( I - l / i j A j ) T  7  M  = 6 A ; I (I T  7  ijxi) = i (i - E ^ i ) (z * ) • A  +1  By the same argument as in Lemma 2.1, the matrix on the left hand-side is positive definite, and hence the expression itself is positive. Since b and Aj are positive, we necessarily have 1—  ^i+i  >  0-  H  e  n  c  e  i  the solution to the program in Lemma 2.3 always satisfies the  additional constraint _3ie/ A * is equivalent to  + 1  = YA=2 K < 1 (remember that X"jc = 0). This inequality  > 0.  Regarding the comment to the effect that A i cannot be negative for intermediate steps, consider the development above for such steps where A/ may still contain negative values. Note that the left-hand-side of the expression is still positive because of its positive defmiteness. Moreover, the right-hand-side can be written as A i ( l — Ai)b, meaning that A i ( l — A i ) is positive. Therefore, Ai £ (0,1), except if I = 0 in which case A i = 1 and A = 0.  _  Theorem 2.2 The proposed algorithm solves the quadratic program  minimize  -P(A)' m  subject to  {Xi > 0, i = 1,..., m} and ^  Aj = 1.  i=l  Proof of Theorem 2.2. The result follows from Lemmas 2.3 and 2.4.  _  We now define MAMSE weights for specific types of data and study their properties.  16  Chapter 3  M A M S E Weights for Univariate Data We define MAMSE weights for univariate data by using the empirical distribution function (EDF) based on the sample from each population as an estimate of their CDF. This choice yields nonparametric MAMSE weights with interesting asymptotic properties that are studied. In particular, the MAMSE-weighted mixture of empirical distributions converges to the target distribution and the maximum weighted likelihood estimate (MWLE) is a strongly consistent estimate of the true parameter when the MAMSE weights are used. Simulation studies evaluate the performance of the M W L E on finite samples.  3.1  Notation and Review  Let (fi, S(fi), P) be the sample space on which the random variables  Xij{u) : fi IR,  i = 1,..., m, j € IN  are defined. The Xij are assumed to be independent with continuous distribution F;. We consider samples of nondecreasing sizes: for any positive integer k, the random variables {X^ : i = l , . . . , m , j — l , . . . , ^ } are observed. Moreover, the sequences of sample sizes are such that  —> oo as k —> oo. We do not require that the sample sizes of  the other populations tend to oo, nor do we restrict the rate at which they increase.  17  Chapter 3. MAMSE Weights for Univariate Data Let Fik(x) be the EDF based on the sample Xij, j = 1,..., n^, i.e.  F {x)  = —  ik  J^^i^jH  <X}.  The empirical measure dFi (x) allocates a weight 1/nik to each of the observations X^, k  j=  l,...,n . ik  Under the assumption that data from each population are independent and identically distributed as is the case here, the empirical distribution is a nonparametric estimate of the CDF. Indeed, the Glivenko-Cantelli lemma (see e.g. Durrett (2005) page 58) states that  sup F (x) - Fi(x)  0  ik  X  almost surely as  —* oo.  For anyfixedvalue of x, one may also note that rii Fi (x) follows a Binomial distribution k  with parameters  k  and Fi{x). Hence, the asymptotic normality of the EDF at fixed x  follows from the central limit theorem.  3.2  Definition of the MAMSE Weights  For univariate data, we define the MAMSE weights based on the EDFs of the m populations. The MAMSE objective function assesses Y^L\ \Fi {x) as an estimate of F\k{x) in terms k  of bias and variance. As a preselection step, any sample whose range of values does not overlap with that of Population 1 will be discarded. For the remaining m samples, the MAMSE weights will be the vector A = [ A i , . . . , A ] m  Pfc(A) =  F\k{x) -  ^2\iF (x) ik  1=1  under the constraints Aj > 0 and  +  T  that minimizes  F (x) {l ik  F (x)} ik  dF (x) lk  (3.1)  i=i  YLlLi Ai = 1. 18  Chapter 3. MAMSE Weights for Univariate Data Equation (3.1) is a special case of Equation (2.1) where dp(x) = dFi (x) and where k  the substitution v a r ^ z ) } = — F (x) { l ik  F (x)\ ik  is based on the variance of the Binomial variable ni F{(x). k  The choice of dp,(x) = dFi (x) allows to integrate where the target distribution F\(x) k  has most of its mass. Other choices for p,(x) could be explored in the future. For u e fi and k <E IN, we denote the MAMSE weights by X (co) = [Aifc(w),..., A (w)] . T  k  mfc  The MAMSE weights are used to define an estimate of the distribution F\(x), m  Gk(x) = _ j  \k(u)F (x), ik  i=i  the MAMSE-weighted EDF.  3.3  Computing the M A M S E Weights  The algorithm proposed in Section 2.4 applies to the MAMSE, weights for univariate data. To prove the convergence of the algorithm, it is sufficient to show that Assumption 2.1 is respected when the MAMSE weights are defined using Expression (3.1). Lemma 3.1 Let x be any point within the range of the data set from Population i, i.e. let  Then.  Proof of Lemma 3.1. Note that  0 < F {x) ik  #{Xjj < x : j < n ] ik  ni  < 1  k  19  Chapter 3. MAMSE Weights for Univariate Data  since the numerator is at least one and at most  — 1, or otherwise x would be outside  the specified interval. Consequently, •1  var {P (x)} = —F {x) { l - F (x)} > 0 ni ik  ik  ik  k  Theorem 3.1 Assumption 2.1 is respected for the definition of MAMSE weights suggested in Section 3.2.  .  Proof of Theorem 3.1. The preprocessing step described in Section 3.2 requires that any sample considered overlaps with that of Population 1. In particular this means that at least one of the data points from Population 1 will fall within the range of the sample of Population i, meaning that  j var [P {x)) dF {x) = ik  x  J ^~F (x) { l - P (x)} dP^x) ik  ik lk  n  Tl  • , J-l  ik  L  J  since all the terms in the sum are nonnegative and at least one must be positive by Lemma 3.1.  3.4  •  Structural Properties of the MAMSE Weights  Choosing the empirical distribution function to define the M A M S E weights implies some invariance properties that are discussed next.. Theorem 3.2 The MAMSE weights are invariant to a strictly increasing transformation of the data.  Proof of Theorem 3.2. Let Xij — g(Xij) where g is a strictly increasing function of the real line. Let Hi denote the cumulative distribution function of Yij. Then for all y, x = g(y)  20  Chapter 3. MAMSE Weights for Univariate Data and any i = 1,..., m  H {y) = %k  = — J2M9(yij) j=i  — J^MY <y} j=\ ij  l k  =  l k  < 9(V)}  —Y,MXij<xY=Fik(x). '* 3 = 1  Since P(X) is integrated with respect to dFi , a discrete measure, there is no Jacobian k  of transformation in the integral and replacing all Fi by the corresponding Hi will not k  k  change the expression P(X), nor its maximum.  _  Theorem 3.3 The MAMSE weights do not depend on the parametric model f(x\8) used in the weighted likelihood. Proof of Theorem 3.3. The result follows immediately from the definition of MAMSE weights and the choice of the nonparametric empirical distribution functions as estimates i  of Fi.  Theorem 3.4 The MWLE based on MAMSE weights is invariant under .a one-to-one reparametrization of f(x\9) into h{x\r) = f{x\h{r)}, i.e. 6 is a MWLE iff f is a MWLE. Proof of Theorem 3.4. By Theorem 3.3, the MAMSE weights X (u) = [A (w),..., X (cj)]  T  k  lfc  mk  are invariant to the choice of parametric model f(x\6). If r is such that 9 = h(r) and h is a one-to-one mapping of the parameter space, then r  n n / {Xi WT)} »M' « x  n  J  i=lj=l  <n n /  is such that  { ^ ( w ) } ^ ^  i=lj=\  for all r if and only if #  max  = fa(r ) is such that  in n  n  i = l j=l  max  •  ik  n  m a x  fixyie)^'""  m n  ik  <n n  f  ^  ™  ^  ^  i=lj'.= l  21  Chapter 3. MAMSE Weights for Univariate Data for all 6. Hence, the M W L E possesses the same functional invariance property as the M L E if we use the MAMSE weights.  3.5  _  Strong Uniform Convergence of the Weighted Empirical CDF  This section explores the large sample behavior of the MAMSE-weighted EDF. Note that asymptotic results for adaptive weights are obtained by Wang, van Eeden and Zidek (2004), but they do not apply here because their assumption that the weight allocated to Population 1 tends to one as the sample size increases may not be satisfied (see Section 3.8 for more details on the asymptotic behavior of the MAMSE weights). The proof of uniform convergence of G (x) is built as a sequence of lemmas showing k  that Gk is close to F ^ , and ultimately that  sup G {x)-F (x) k  x  ^0  x  almost surely as k —» oo. Whether a sample is rejected in the preprocessing or not may vary with k and to. However, as the sample sizes increase, the probability that a sample is rejected tends to zero unless the domain of possible values of a Population does not overlap at all with that of Population 1, i.e. unless P(X\\ < Xn) = 0 or 1. Thus, without loss of generality, we suppose that no samples were excluded by the preprocessing.  Lemma 3.2 For any to £ Q and k E IN  ;  22  Chapter 3. MAMSE Weights for Univariate Data Proof of Lemma 3.2. For any LO 6 fl and k G IN, consider  I  = <  G (x)-F (x) k  J  dF (x)  lk  |G (_)-F f c  lk  l f c  (x)f^  By the definition of M A M S E weights, the expression above is minimized in A. The suboptimal choice of weights [ A i , . . . , A ] = [1,0,...,0] cannot lead to a smaller value of I, m  i.e.  I <  |Fife(x) -  F (x) +  —F (x)\l-F (x)\ I J  2  lk  n  lk  lk  lk  dF (x) lk  This bound is tight since the optimal A could be arbitrarily close to [1,0,... ,0] , making T  I arbitrarily close to the bound above. For instance, letting ri\ —> oo while the other njfc's k  are held constant will do the trick.  _  Lemma 3.3 There exists fii C fl with P(fli) = 1 such that for all LO G fl\ and any fixed k G IN,  max G {x) k  F (x) <  h  lk  X  n\  k  max j'e{i,...,n }  G {X (uj)} k  l:i  F {X (u;)} lk  lj  lfc  Proof of Lemma 3.3. Define  fl = {u efl: 3i,i',j,f 0  with  ^ (i',f) and Xij(u)-=X (u)} Vjl  .  Since the distributions Fj are continuous, P(flo) = 0. Fix k G IN and consider any fixed LO € fli = fi\Q: . Note that for i G { 1 , . . . , m} and j G { 1 , . . . , n^}, [minjj -Xjj(o;), maxjj Xij(u)\ 0  is a compact set outside of which D{x) — \G (x) — Fifc(a;)| = 0. Let XQ be a value maximizk  23  Chapter 3. MAMSE Weights for Univariate Data  ing the bounded function D(x). We treat two cases.  Case 1: G (x ) < k  0  F (x ). lk  0  Define x\ = max{Xij(uj) : j = 1,..., n\ , k  X\J(LU) < XQ},  the largest data point less than  XQ  found in Population 1. The step function F\ (x) is right-continuous, nondecreasing and has k  equal steps of size l/n\ at each observation X j{ui). By the choice of x\, and the definition k  X  of the EDF, AfcOi) = FifcOo)The step function G (x) is nondecreasing, thus k  G (xi) < G (x ). k  k  Q  Consequently,  \G (x ) - Fi (x )\ k  0  k  =  Q  <  F (x ) - G (x ) < F (xi) lk  Q  k  max  0  lk  F {X {u)} lk  - G (xi) fc  - GkiXx^uj)}  l3  je{i,..,Tiifc}  <  F {X (uj)}  h max  lk  1:j  -  G {X (Lu)} k  l3  n\k je{i,...,n } lk  Case 2: GiAxn) > F .(x ). 1t  n  Define x<i = min{Xij(u>) : j = 1,... ,ni , Xij(u) > XQ}, the smallest data point exceeding k  XQ found in Population 1. The step function F\ {x) is right-continuous, nondecreasing and k  has equal steps of size l/n\ at each observation Xij(to). Therefore, k  F\k{x2) = nik  +  F (x ). lk  0  Since G (x) is nondecreasing, k  G (x ) > G (x ). k  2  k  0  24  Chapter 3. MAMSE Weights for Univariate Data  Consequently,  \G (x ) k  - F (x )\  0  lk  0  =  G (x )  -  <  G (x )  - F (x )  k  0  k  2  <  Fi (x ) k  lk  0  2  +  n\k  max FikiXijiu)} - GkiXijiu)} ie{i,-,"u}  h  n\  k  which completes the proof.  Lemma 3.4 Let a be a sequence of numbers such that lim ^ a /ni i  k  k  00  k  k  = 0. Then, there  existsfiiC fi with P(fii) = 1 such that for all e > 0, there exists a ko such that Vcu € fii  a  k  G^Xi^u)}  max  ' j'G{l,...,n  l f c  Fi {Xij{u)} <e  -  k  }  for all k > ko.  Proof of Lemma 3.4- Let e > 0. Consider an arbitrary but fixed u G fii = fi\fio where fio has the same definition as in the proof of Lemma 3.3. Suppose that Lemma 3.4 is false. Then there exists an infinite sequence ki such that for I G IN, (3.2)  for some jctf £ {1 2,..., nik }, where parentheses in the index identify order statistics, 5  i.e. Xi(j) is the j  th  e  smallest value among Xn,...,  Xi  n i k t  . -  Consider a fixed value of L For simplicity, we drop the index £ and revert to k and jo that are fixed. Note that 1. Gk(x) is a nondecreasing function, 2. Fi (x) k  is a right-continuous nondecreasing step function with equal jumps of l/ni . k  25  Chapter 3. MAMSE Weights for Univariate Data We treat two cases:  /  Case 1: G {X (u>)} > k  F {X (u)}.  1{jo)  lk  m  Note that Fik{X (u)}<G {X (u)}<l 1(jo)  k  1(jo)  and hence, Fi {Xi^ ^(u;)} < 1 — e or inequality (3.2) would not hold. Consequently, k  0  k  Jo < nik ~ L fc ifcJ and for i € {0,1,... , [efc^ifcj}, we have €  n  k{Xi(j - )(u})} >  G {X (u)},  G  0  Fik{X  +i  (u>)}  k  =  1{jo+i)  1{jo)  F {X (u)} lk  + ^lk  1{jo)  n  and hence /v  ^  k{Xi( )(u})} - Fi {X jo+i  k  A  (u)}  G  > G {X  1{jo+i)  k  %  (u>)} - Fik{X {to)}  1{JQ)  - — >e lk  1{jo)  k  n  %  —. i  n  k  As a consequence,  \G (x) - Fi (x)\ dFik(x 2  k  k  > —  J2  \Gk{X  (u)}  1(jo+i)  nik  1=0  ni  ^ V W  k  -  F {X  < ^  lk  ( )}\'  1{jo+i)  u  n\  k  ^  z  2  L fc™ifcJ(|_ fc ifcJ + l)(2LefcnifcJ + 1) e  e  n  6r  >  i  3  4  /j_____v  V  n  xk  J  26  Chapter 3. MAMSE Weights for Univariate Data  By Lemma 3.2, we thus have that 1 fe n k  l k  -l\  ni  3  1 /Le n  <  f c  J  k  l f c  j\  3\  nk  J  a  J  -  V  X  fc  a  2^3  e^ifc  n^{ + 2 /3 '  k  1  k  <  \  ik  J  n  2  6n  a  1-  <  ifc  6n  ife  - 2V3 ^  fc  „ / + V3 2  3  2  ni  :  3  a contradiction since a\/ni  fn\ -l\  3  fc  —> 0 andfc^—> oo as £ —» oo, i.e. the left-hand term converges  k  to 0. Therefore, a  k  max G {Xij(u)} je{o,...,n } k  - Fi {Xij(u)}  <e  k  lk  Note that the bound does not depend on the choice of LO.  Case 2: G {X (u)} k  <  1{jo)  Note that Fi {Xi^(uj)} ik{Xi( )(uj)} jo  lk  1(jo)  > Gk{X ^{u)}  k  F  F {X {u)}. > 0. Since both functions are at least e apart,  x  k  > e and thus j > [e ni \. Then for i 6 {0,1;..., L fc ifcJ}> we have e  k  0  k  GkiX^-^uj)} Afc{A- - _ (w)} l0  0  n  k  0  <  G {X (Lu)}  =  Fi {X (u)}  k  1{jo)  k  1(jo)  ni  k  and hence  F {X _ (oj)} lk  Hjo  - G {A- - _ (a;)} > F {X (u)}  i}  fc  l0  0  0  lk  1(jo)  - G {X (u)} k  1(jo)  - ^ i- > e lk  i lk  k  n  n  Then,  J \G (x) - Fi (x)\  2  k  >  k  -  nik  J2  dFik(x)  I^^KJO-OH}-^^-*)^)}!  2  i=0  27  Chapter 3. MAMSE Weights for Univariate Data  nik  V  f£  W  n\  f^  k  L fc ifcJ(L fcnifcJ + l)(2|_e n J + 1)  >  i (L fc"ifcJ  n  n\  0  _  e  ^  k  e  fc  lfc  e  3 V nifc By Lemma 3.2, we thus have that 1 (e n k  3 V  l k  -l\  ik  1 /Le n jy  6  fc  J  n  lfc  3V n  \ 3  1  k  J  lk  /  fn\ -l\  <  \  n\  J 6ni  k  1  <  6n  fc  2  lfc  2/3  * Ur ' ; ~2~ * - 1  a  2^3 —->-275 —  n { + 2 /3 2  k  €ni a contradiction since a\/ni  k  fc  n  ^/  3 +  2  1  <=> a * —  i/3  —> 0 and  3  >2  1/3  e,  nit —* oo as I —> oo, i.e. the left-hand term converges  to 0. Therefore, a  k  max  F {^ij(w)}-G(t{Iij(w)}<e. u  j€{0,..,n } lfc  Combining the two cases, we know that Ve > 0, 3k such that \/k> k , 0  a  k  G {X (uj)} ~  max  k  1:i  Q  F {X (Lj)}\ lk  1:J  < e.  j6{0,..,ni } fc  Sincefcrjdoes not depend on to £ fl\ and P(fl\) = 1, the uniform convergence is almost sure.  _  Lemma 3.5 There exists Q\ C fl with P(fli) = 1 such that for all e > 0, there exists a ko such that i -  . .  -  . . i  max \G (x) - F (x) < e k  lk  X  for all k > kQ with the same ko for all to £ fl\. 28  Chapter 3. MAMSE Weights for Univariate Data  Proof of Lemma 3.5. Consider the set Vt\ defined in the proof of Lemma 3.3. For any fixed LO  € Qi,  max G (x)-F (x) k  lk  1  <  X  max  h  n  je{i,...,n }  xk  lk  By Lemma 3.4, Ve > 0, 3k\ such that VA: > k , x  G {X {u)} -  max  k  F {X {u)} <  l3  lk  lj  jG{l,..,n } lfc  for all LO E Oi. Moreover, 3k such that VA; > k , l/n 2  2  < e/2. Therefore, for all  xk  k > ko — max(A;i, k ), we have 2  0 < max G (x) k  e ^2  F (x) lk  X  e +  2  =  e  '  Theorem 3.5 The random variable  sup G (x)-F!(x) k  X  converges almost surely to 0.  Proof of Theorem 3.5. By Lemma 3.5, Ve > 0, 3k such that x  max G (x) - Fi (x) k  k  X  <  VA; > k\ and any to € fii with P(fl ) — 1. The Glivenko-Cantelli theorem states that x  sup  F  l  f  e  ( x ) - f \ ( x )  X  almost surely as A; —> co. Hence, there exists Vt C 0, with P(fl ) = 1 such that Ve > 0 and 2  2  29  Chapter 3. MAMSE Weights for Univariate Data  LU €  3^2(0;) with e  sup FikW-F^x)  <2  x  V7c > ^2(0;). Consider now fio = ^ 1 H ^2 andfco(w)= max{fci, ^(w)}. Note that we have •P(^o) > -P(^i) + -P(ri2) — 1 = 1. For any fixed u>, k and x, the inequality G (x) - #1 (x) < G (x) - Fi„(x) + F (x) - Fi(s) fe  fc  holds, hence for any to G QQ and all  lfc  > ko(uj) we have  sup Gfc(x) - Fi(x)| < sup \G (x) - Fi (x)I + sup |Fifc(x) - Fi(x) ^ 2 k  +  fc  2  = £  -  X  Therefore, sup G (x) — F\{x) converges almost surely to 0. x  k  Corollary 3.1  Let Y\. be a sequence of random variables with distribution Gk(y), then Y  converge weakly to the variable Y with distribution Fi(y) as k —> 00.  Proof of Corollary 3.1. The result is clear from Theorem 3.5 and the definition of weak convergence.  3.6  _  Weighted Strong Law of Large Numbers  By the weak convergence shown in Corollary 3.1,  Jg(x)dG (x) -> Jg{x)dF k  x  (3.3)  as k —> 00 for functions g that respect some regularity conditions. Examples of such results can be found in Section 2.2.b of Durrett (2005) for instance.  30  Chapter 3. MAMSE Weights for Univariate Data  The left-hand side of Expression (3.3) can be written as m  E  Aifc(w)  i=l  and is hence a form of weighted strong law of large numbers (WSLLN). In Section 3.7, we use the WSLLN to prove the consistency of the weighted likelihood with M A M S E weights. For that, we need a WSLLN that holds for an unbounded function g with some discontinuities. We prove the needed theorem in this section as it could not be located in the literature. Lemma 3.6 Consider any two distribution functions F and G from IR to [0,1] such that sup \F(x) — G(x)\ < e for some e > 0. Then, for any connected set A C IR, x  dF(A) - dG(A)\ < 2e.  Proof of Lemma 3.6. Let B = [a, b] C IR and define B$ = (a — 8, b}' for 8 > 0. Let  e = | dF(B ) - dG(B )\ = \F(b) - F(a -8)5  6  s  G{b) + G(a - S)\  < \F{b) - G(b)\ + \F(a -8)-  G{a - 8)\ < 2e  for all 8 > 0. Since 8 can be arbitrarily small, | dF(B) — dG(B)\ < 2e. The result holds for any combination of closed or open boundaries with minor changes to the proof.  _  Theorem 3.6 Let g{x) be a function for which J \g(x) \ dF\(x) < oo. The function g(x) is continuous on IR except possibly on a finite set of point {d\,..., d_}. For each of populations 2 , . . . , m at least one of these two conditions hold 1. the sample size is bounded: Vfc 6 IN,  < Mj.  2. J\g(x)\dFi(x) < oo.  31  Chapter 3. MAMSE Weights for Univariate Data Further suppose that the sequences of sample sizes are non-decreasing with k for all populations. Then,  Jg(x)dG (x)-  Jg(x)dF (x)  k  0  1  almost surely as k —> oo.  Proof of Theorem 3.6. We show that for any e > 0, we can find a sequence of inequalities that imply that J g(x) dG (x) — f g(x) dFi(x) < e for any large enough k. The inequalities k  come from truncating g and from approximating it by a step function. For t G IN, let D = nf (d t  =1  - 2~ , d + 2^)°, B = [-t, t] n D and l  e  e  t  t  g(x) if x G B  t  r (x) = t  0  otherwise  Since g{x) is continuous and Bt is a compact set, the image of Tt is bounded, say r (x) G t  [Lt,Ut]. By the Heine-Cantor Theorem, Tt is uniformly continuous on B , i.e. V e i > 0, t  T]  35 j > 0 such that T  Vx ,x x  Let e t = 2 T]  _ t  2  G B,  \xi - x \ < 5 , ==^ \r {x\) - T (x )\ < e .  t  2  T t  t  t  2  Tjt  and choose 0 < S < 2~ accordingly. For s — 1,..., St, where St = \2t/6 t], l  T:t  Tt  let A  ={~t+ (s - l)5 , -t + s5 , ) n B .  st  In the rare case where 2t/5  Ttt  T}t  T  t  t  is an integer, we let As t = [2t — 8 , 2t]. The sets A form tt  Ttt  st  a partition of the compact set Bt- Note that the choice of D and 5 ensures that A t  Ttt  st  are connected, with the harmless exception of As ,t which could sometimes consist of two t  singletons. Define h by t  St t( )  h  x  ^/^bstTl-Astix) s=l  32  Chapter 3. MAMSE Weights for Univariate Data where  1 b  =  st  inf g(y)  and  T h e n , by construction, sup^  0  1  \rt(x) — h (x)\ < 2 t  g(x)dG {x) - jg(x)  x  G  A  otherwise  * and  dF^x) < T + T + T + T +  k  st  st  •  y e A s t  if  TL-A (x) =  x  2  3  4  T  5  (3.4)"  where  Ti  Jg(x)dG (x) - jr (x)  T  J (x)dG (x)-  n  Jh (x)dG (x)-  k  n  2  J  k  t  dG (x  t  k  h (x)dG (x) t  k  Jht(x)dFi{x)  k  J r (x)dFi(x)  J htWdF^x)-  t  r ( x ) d F ! ( s ) - Jg{x)dF {x)  T  t  5  l  We will now prove that for any e > 0 a n d w i n a subset of fl w i t h probability 1, we can choose 7_ such that the five terms above are less t h a n e/5 for all k > A _ ( £ _ ) . T o begin, note that  T = 4  ht(x) - r (x) dFi(x) < J t  by construction. T h e same b o u n d applies for T  3.5,  sup^.  \G (x) — F\(x)\ k  t  t  1  a n d does not depend on k or to.  2  By Theorem  \h {x)-T (x)\dF (x)<2-  converges almost surely to  0.  Therefore,  3flg  C  fl  with P(flo) = 1 such that for each to G flo a n d any t, 3A_,t with  sup|G (x) -Fi(a;)| < 5 max(|U | |L |)2 + fe  1  t  t  )  1  t  33 )  Chapter 3. MAMSE Weights for Univariate Data for all k > k f For any such k and u>, Lemma 3.6 implies that Ut  dG (A ) k  - dF^Ast) <  st  2 5 max(|C/ |,|L |)2*+i t  for any s  St- Developing  = 1,...,  s  T =  T3  t  t  yields  t  St  j2 ^ ^ ^-J2 ^ ^ ^ b  3  dG  A  b  dF  A  St  < J2\ ^\b  Therefore, 3t\ such that 2 by e/5 tor any t>t\  - t  dG (A ) k  - dF^Ast)  st  < e/5 for all t > t\, i.e. T , T 3 and T 4 are each bounded 2  and k > /  We can write  T = 5  / (x)l c(x)dF (x) < / | (x)|l c(x)dF (x) - 0 5  S  1  5  B  1  as t —> 00 since the integrand goes to 0 for each x € IR\{di,... , d_} by the dominated convergence theorem with bounding function |<?(x)|. The integrand does not converge to 0 on { d i , . . . , d_}, but that set has measure 0. Therefore, there exists t such that T5 < e/5 2  for all t > t . 2  Turning now to T\, we denote by I C { l , . . . , m } the indices corresponding to the populations for which rii —> 00 as k —-> 00. By the strong law of large numbers, for any k  fixed t, there exists fi,^ C 0 with P(!i\ ) = 1 such that for all u € £\t, 4  34  Chapter 3. MAMSE  Weights  for Univariate Data  as k —> oo. Consider a fixed  u> € Q,i = {Lo\Xij(u)  — dt  for some i,j,£}  C  DI  n  M  fi  iei,te\N The intersection is over a countable number of sets of probability 1, hence P(fii) = 1. For any such to G f i i , T\ is developed as jg{x)TL c(x)&G {x) B  E ^  E  E —E 2  Since  LO  B?  \9{XijH}\l {*i>)}-. B?  J= l  =1  fc  iflf{^«H}  J=l  2=1  ^  < J |_(x)|l (aO dG (x)  k  is fixed, 3£* such that TLB*{Xij(u>)} = 0, V i G I , j = 0  1,..., 7 1 ^ , < >  C-  F  O  R  i £ I, the dominated convergence theorem says that there exists i * such that |_(x)|lBf(aOdFi(x) <  10m  for all t>t*. Choose t > £3 = maxi / 1*. Since u G f i i , B&i,^ such that for all k > ki , 6  — lfc  tt:LJ  'zZ\9{X (u)}\TL c{X (Lo)} ij  B  / |_(x)|l_c(x)dF x) H  <  ij  j=i  iv  J  10m  < . 5m  Therefore, V i > max(i3,t_), there existsfc* = maxj / fci^u, such that t  •T, = for all  5(x)dG (x) fc  e  1 r (x) 4  dG (i k  <  k>kl . t  In conclusion, for any u; G fio H fii and any e •> 0, we can choose £_ = max(ti, <2, *3, C )  35  Chapter 3. MAMSE  Weights for Univariate Data  that yields inequalities showing that  g(x)dG (x)k  / _(x)dFi(x) < e  for all k > kuitu) = max(fc_ t ,fc* ). In other words, the left hand side of Expression (3.4). i w  tu  converges to 0 for any u G fio PI fii with P(fio f l fii) = 1, i.e. that expression converges almost surely to 0. C o r o l l a r y 3 . 2 ( W e i g h t e d S t r o n g L a wo fL a r g e N u m b e r s )  _  Let Xi denote a v  with distribution Fi. Suppose E\Xi\ < oo for i'= 1,..., m, then  almost surely as k —> oo.  Proof of Corollary 3.2. Use Theorem 5 with g(x) = x.  _  Theorem 3.6 is useful for proving the consistency of the M W L E by extending the proof of Wald (1949). This extension is given next.  3.7  Strong Consistency of the Maximum Weighted Likelihood Estimate  In this section, we adapt the work of Wald (1949) to prove that the M W L E obtained with M A M S E weights is a strongly consistent estimate. For easier reference to his original work, the numbering of Wald is noted with the prefix W.  36  Chapter 3. MAMSE Weights for Univariate Data Wald's Assumptions The assumptions of Wald (1949) are reproduced below and adapted as required to extend his proof to the M W L E . Let F{x\9) be a parametric family of distributions where 6 is an element of 8, a closed subset of a finite dimensional Cartesian space. We assume that 36o  G  0 such that  F{X\9Q)  =  Fi(x). Wald (1949) does not assume that F(x\9o) is continuous in x, but we do and denote  its corresponding density function by f(x\9o). The following notation is used by Wald (1949): V9ee,p>0,  f(x,9,p)  =  sup  f(x\9'),  f*(x,9,p)  = max{/(x, 0, p), 1}, '  \e-6'\<  P  Vr > 0,  (f)(x, r) = sup f(x\9),  <fi*(x, r) = max{<f)(x, r), 1}.  |0|>r  A s s u m p t i o n 3 . 1 ( W I ) For all 9 G 0 , F(x\9) is absolutely continuous for all x. Therefore, F(x\9) admits a density function f(x\9).  A s s u m p t i o n 3 . 2 ( W 2 ) For sufficiently small p and sufficiently large r, the expected valu J\ogf*(x,9,p)dFi(x)  and f log<f>*(x,r) dF\(x) are finite.  A s s u m p t i o n 3 . 3 ( W 3 )  7/limj_ 0j = 9, then l i m ^ o o f(x\6j) = f{x\9). >oo  A s s u m p t i o n 3 . 4 ( W 4 ) If 9\ ^ 9 Q , then F(X\6Q) ^ F{x\9\) for at least one x. A s s u m p t i o n 3 . 5 ( W 5 ) Iflimi^^ \9i\ = oo, then limj^oo f{x\9i) = 0. A s s u m p t i o n 3 . 6 ( W 6 ) / | log f{x\9 ) \ dFi(x) < oo for i = 0  1 , m .  A s s u m p t i o n 3 . 7 ( W 7 ) The parameter space 0 is a closed subset of a finite-dimensional Cartesian space. A s s u m p t i o n 3 . 8 ( W 8 ) The functions f(x,9,p)  and cp(x,r) are measurable for any 9, p  and r.  37  Chapter 3. MAMSE Weights for Univariate Data Assumption 3.9 The functions sibly on a finite  f(x\9o),  set of points {di,...  f(x,9,p)  and <j>(x,r) are continuous  The set of discontinuities  or r, but must be finite for any fixed values of these  except pos-  may depend on 9, p  parameters.  Assumptions W I to W8 are from Wald (1949); only Assumption W6 is modified to cover the m populations of our paradigm. Assumption 3.9 is required to ensure that Theorem 3.6 applies. Note that these assumptions are mostly concerned with the family of distributions F(x\9) (the model), rather than with the true distribution of the data. Lemmas 3.10 and 3.11 are useful for determining if the family of distributions F(x\9) satisfies Assumption 3.9. Wald's Lemmas Wald's lemmas do not need to be modified. We state them for completeness, but do not reproduce the proofs provided in Wald (1949). For expectations, the following convention is adopted. Let U be a random variable. The expected value of U exists if E{max(f/, 0)} < oo. If E{max(U, 0)} is finite but E{min(U, 0)} is not, we say that E{mm(U, 0)} = —oo. Let a generic X represent a random variable with distribution Fi(x) = F(X\9Q). L e m m a 3.7 ( W I )  For any 9 ^ 9 , we have Elog f(X\9) Q  Lemma 3.8 (W2) lim E log f(X, 9, p) = E log  < Elog  f{X\9 ). 0  f(X\9).  p->0  Lemma 3.9 (W3) The equation lim _ Elog0(A ,r) = -oo holds. r  r  >oo  The next two lemmas are useful in determining if Assumption 3.9 is satisfied. Lemma 3.10 Let f(x\9) be continuous for all 9 G © and x G N , Xl  Then for 9Q and p fixed, f(x,8o,p)  is continuous  a neighborhood  of' x\.  at x\.  38  Chapter 3. MAMSE Weights for Univariate Data Proof of Lemma 3.10. Suppose that f(x,6o,p) has a discontinuity at x — x\. Then, there  exists e > 0 such that for all 5 > 0, there exists x with \x — x \ < 5 but 2  x  \f(x 8 ,p)-f(x ;9 ,p)\ u  Let A c N  0  2  > e .  0  be a compact set around x\.  Xl  2  (3.5)  Let B = {6 : \0 — 9Q\ < p). The set  A x B is compact and hence f(x\9) is uniformly continuous on that domain by HeineBorel. Therefore, for the e chosen above, there exists a Si > 0 such that x\,x  2  e A and  \x\ — x \ < 5i imply 2  \f( \9)-f(x \9)\<e/2 Xl  (3.6)  2  for all 9 G B. Choose such an x and define 2  9\ = arg  max  fixAO)  \e-e \< 0  and  •  9 = arg  max  f(x \9).  \e-e \<  2  P  0  2  P  The maxima are attained since A x B is compact and /(x|#) continuous in 6. Therefore,  f(xi,9 ,p)  = /(xi|0i)  0  and  f{x ,e ,p) 2  0  = f(x \6 ). 2  2  j  (3.7)  Consider the following two cases.  Case  1:  ffsijfli)  >  ffa^l^)  By Equations (3.5) and (3.7), /(xi|0i) > f(x \9 ) + e. Furthermore, inequality (3.6) implies 2  2  that f(x \9 )>f(x \9 )- ->f(x \B ) €  2  l  1  1  2  2  + ^,  a contradiction with the definition of 9 . 2  39  Chapter  Case 2: f(xi\9i)  <  3. MAMSE  Weights for Univariate  Data  f{x \9 ) 2  2  — e. Inequality (3.6) yields  By Equations (3.5) and (3.7), f(xi\9\) < f(x \9 ) 2  f{xi\e )>  2  f{x \d )- ->f{x \e )+ e  2  2  2  €  l  l  a contradiction with the definition of 9\.  Therefore, we conclude that f(x,9o,p) is continuous at x\.  _  By Lemma 3.10, if f(x\8) is continuous in x and 9, then f(x,9o,p)  is continuous in x  for any fixed 9Q and p. Before introducing Lemma 3.11, define  io (5,x )= g  sup  0  \g(x)-g(x )\ 0  \x—xo\<6  the modulus of continuity of the function g(x) around XQ. Note that when it exists,  s'(*o)|. L e m m a 3.11 Suppose that f(x\9) is continuous  in 9 and that (f)(x,r) has. a  at XQ. Then, there exists e > 0 such that Uf^.^(S,  XQ) > e for any 5 > 0 and some 9.  Proof of Lemma  discontinuity  3.11. Fix r > 0. Since <f>{x,r) is discontinuous at XQ\ there exists e > 0  such that for any 5 > 0, 3x\ such that \XQ — xi\ < 8 but  \<p(x ,r) - <p{xi,r )\>2e. Q  (3.8)  For any fixed S and x\, consider the following two cases.  40  Chapter 3. MAMSE Weights for Univariate Data Case 1: d>(xn, r) > <b(x-\, r) + le.  By the continuity of f(x\6), it is possible to choose \6Q\ > r such that /(xo|#o) is arbitrarily close to (f>(xQ,r), say less than e apart, i.e.  f(xo\0 ) ^ <Pi o,r) - e. x  o  For that possibly suboptimal Oo, (f ( ii ) )  x  > f( i\8o), hence  r  x  f(x \0 ) ></>(x ,r) - e > ^ ( x i . r ) > /(xi|0o) o  o  0  meaning that  |/(xo|»o) - / ( x i | 0 ) | > /(*o|0o) - /(asil^o) > <K*o,r) - e - <f>( ,r) > e o  Xl  because of Equation 3.8. Therefore, Uf^g ^(5,xo) > e. 0  Case 2: </>(xc ) < 4>{x\,r) — 2e. r  The continuity of /(x|#) allows us to choose |#i| > r such that  is close to <j)(xi,r),  say less than e apart, i.e. f(xi\di) >4>{xi,r) - e.  )  Then, by the definition of (j>, we have (p(xo,r) > f(xo\9\), hence f{x \e )>(f)(x ,r)-e>^(xo,r)>f(x \9 ). 1  1  1  0  l  Therefore,  \f{xi\9i) - /(xo|^i)| > / ( x i l ^ i ) - /(aroint) > </>(x r) - e - <f>(x ,r)> e u  0  by Equation 3.8. Therefore, ujf'.\g )(5,xo) > e. l  41  Chapter 3. MAMSE Weights for Univariate Data  By combining both cases, we can conclude that for all 5 > 0, there exists a 9 such that cOf(.\e){S,x ) > e  u  0  Note that if g(x) is continuous at XQ, lims- oujg(5,xo) = 0. Having a modulus of contit  nuity u>y(.|0)(5, xo) bounded below is akin to having an infinite derivative df(x\9)/dx at xoThis occurs when: • there is a discontinuity in the function f(x\8) at xo, • as 9 —* oo, the slope of f(x\9) around xo keeps increasing. Therefore, discontinuities in (f>(x,r) will occur if f(x\9) is discontinuous itself, or if f(x\8) has a peak that can become arbitrarily steep (i.e. its slope is not bounded for a fixed x as 9 varies). The main result of this paper uses Theorem 3.6 which allows a finite number of discontinuities. Therefore, as long as f(x\9) is a continuous model such that the set •  {x : for an arbitrarily large r > 0,Uf(.\g)(5,x) is arbitrarily large for some |0| > r) = {x : f(x\9) has an arbitrarily steep peak at x for some \9\ > r}  is empty or made up of a finite number of singletons, Assumption 3.9 will hold. Note the effect of the constraint \9\ > r. Consider for instance the normal model with unknown mean and variance. As cr —* 0, the normal density will have an arbitrarily steep 2  peak close to fi. However, \[p,,cr ] | > r implies that there is a lower bound for a , hence 2  T  2  the steepness of the peak is bounded. Assumption 3.9 is satisfied under that model. M a i n Result We now turn to the main results of this section.  42  Chapter 3. MAMSE Weights for Univariate Data Theorem 3.7 (Theorem WI) Let T be any closed subset of Q that does not contain 9Q. Then, mn  ik  supTT TT  f{x (u)\e} ^ ' Xik  nik ntk  ij  nn/^Hi^}  = ol  =1.  (w)nifc/nifc  i=li=i  Proof of Theorem 3.7. Let X denote a random variable with distribution F\(x) = F(X\9Q)  and let ro be a positive number chosen such that (3.9)  E{log<t>(X,r )}<E{logf(X\9 )}. 0  0  The existence of such an ro follows from Lemma 3.9 and Assumption 3.6. Then, 71 — {9 : ^ < r o } n T i s a compact set since it is a closed and bounded subset of a finite-dimensional Cartesian space. With each element 9 € 71, we associate a positive value pg such that E{logf(X,9,p )}  (3.10)  < E{log/(X|0 )}.  9  o  The existence of such pg follows from Lemma 3.7 and 3.8. Let S(9, p) denote the sphere with center 9 and radius p. The spheres {S(9,pg)} form a covering of the compact 71, hence there exists a finite sub-covering. Let 6\,..., 9h € 71 such that 71 c Us=i S(9 , pg ). S  s  Clearly, mn  o < su nnw>)i^ ik  Aifc  {w)n /n xk  ik  P  h mn  ^ Enn/^( )'^'^} ik  s=li=lj=l  w  Aifc(a,)nifc/nifc  +  mn  ik  nn.^M'^^^" *^1  i=lj=l  43  Chapter 3. MAMSE Weights for Univariate Data Therefore, to prove Theorem 3.7 if suffices to show that n  m  ik  nn/{^H.^.^.}  A i f c ( w ) n i t / n i f c  lim  = 0 nn/{^(^)i«o} t=ij=i  for s —  1,...,  =1  A t t M n u / B i k  h and that m  n  ik  ^"srss  :  nii/{^iHi^}  = 0  A < f c (  "  =1.  ) n i f c / T i i f c  i=ij=i  The above equations can be rewritten as  lim nik k—»oo  EE¥^ §/{^HA,^} 1 O  i=i j=i  nik =  p  lim  /  \ogf{Xijiu)\e } Q  = — oo  (3.11)  logf(x,9 ,pg )dG (x) s  a  k  log/(x|^o)dGib(x) } = -oo = 1 for s = 1,..., /i and  lim nik k—»oo  L i=l i=l Ajfc(^)  nik-  log/{A-yH|0o}  = -co  (3.12)  44  Chapter 3. MAMSE Weights for Univariate Data  = P  lim n {  j \og<t>(x,r )dG {x)  xk  0  k  k-*oo  -  \ogf{x\9 )dG (x) 0  k  = -oo = 1  respectively. Assumptions 3.6 and 3.9 insure that Theorem 3.6 applies to the integrals above, each of these converging almost surely to  l o g / C z A ^ J d F i O r ) , J l o g ^ ( x , r ) d F ( x ) or J log f(x\9 ) d F i ( x ) 0  1  0  Combining this result with Equations (3.10) and (3.9), we have that (3.11) and (3.12) hold. Hence the proof of Theorem 3.7 is complete.  •  Theorem 3.8 (Theorem W2) Let 9 (u>) be a sequence of random variables such that k  there exists a positive constant c with m  n  ik  TJ TJ / { ^ ( ) | 4 M } i=i j=\  A < f c ( w ) n i f c / n i f c  w  > c> 0  (3.13)  JJJJfiXijito^o}^^^ i=Vj=l  for all k € IN and all u> G fl. Then  p{ lim 6 (u>) = 6 \ = 1. k  0  Proof of Theorem 3.8. Let e > 0 and consider the values of 9 (to) as k goes to infinity. k  Suppose that 9( is a limit point away from #o, such that \9g — 9Q\ > e. Then, m  sup  ik  nn/{Xij(w)|0}  \0-0o\>et=lf=l m  n  rii  Aifc(w)nifc/nifc  > c> 0  k  i=ij=i  45  Chapter 3. MAMSE Weights for Univariate Data  infinitely often. By Theorem 3.7, this event has probability 0 even with e arbitrarily small. Therefore,  for all e > 0.  •  Corollary 3.3 (Corollary W I )  The MWLE is a strongly consistent estimate of 9Q.  Proof of Corollary 3.3. The M W L E clearly satisfies Equation (3.13) with c = 1 because 0k(to) is chosen to maximize the numerator of (3.13).  3.8  Asymptotic Behavior of the MAMSE Weights  We study the asymptotic behavior of the MAMSE weights as k —> oo and its consequences in constructing a weighted central limit theorem. Let  where = indicates that the functions are equal for all x. Clearly, £ is a nonempty convex set with [1,0,... ,0] € C. Moreover, if we consider the elements of £ as elements of the T  normed space ([0, l ] , || • ||) where || • || stands for the Euclidean norm, then C is an open m  c  set. We will show that for k € IN, all accumulation points of the M A M S E weights will be in the set C. In other words, the MAMSE weights can only converge to vectors that define a mixture distribution identical to the target distribution. Theorem 3.9 Suppose that C° ^ 0 and let A* € CP', then for any e > 0, there exists a set flo of probability 1 such that \\X* — Ajfc(w)|| > e  i.o.  for all u> & fl® and hence the MAMSE weights do not converge to X* as k —> oo.  46  Chapter 3. MAMSE Weights for Univariate Data Proof of Theorem 3.9. The Glivenko-Cantelli lemma shows that sup \Fi (x) — Fi(x)\ —> 0 x  k  almost surely as k —> oo. Let fij be the set of probability 1 where the convergence occurs. The summand and the integrand of the following expressions are bounded by 1, thus i  nik  —E  Y^\iFi{Xij(Lo)} - Fi{Xij{u)}  E  Li=l  ^XiF^Xn)  -  Fi(X  n  .i=i  almost surely as k —> oo by the strong law of large numbers. Let Q' be the set of probability 1 on which the convergence occurs. Note that the expectation in the expression above is taken over the random variable Xn which follows distribution Fi. Consider now the set  = ft' C\  QQ  D i = i &i  a n  d let u G  f2o  be any fixed element of that set.  Note that by construction P(fio) — L Let P(x, r) denote the open ball of radius r centered at x. Since  is an open set, any  small enough e > 0 will be such that B(X*,e) D C = 0. Then, consider Pfc(A) as defined in Equation (3.1) and for any A € P(A*,e),  Pfc(A)  >  E i ik{ ) X F  dPi (x)  ~ Elk(x)  x  fc  1  m  Fi{x) - F (x)\ lk  } dF (x) lk  i=i  m  J2^iFi{x)- Pi(x) l  m  m  ^2\iF (x)  -^2\iFi(x  ik  F^-Fuix)  } dF {x lk  i=l -<  >  ni  —E lk  n  k  m  Y ^iFi{X (u)}-F {X (u)} /  lj  1  lj  j=l f-f Li=l  J | E ? I ^ ( x) A  3  Fi(x) + F (x) - Fi{x)\ } dF (x) lk  lk  47  Chapter 3. MAMSE Weights for Univariate Data  * —E Tin.  i—*  Y_ A?sup  F (x)-Fi(x  — sup F [x)-F {x)  ik  xk  i=i  x  x  m  K >0  J2 i i( n)-Fi(X ) x F  x  n  . i=i  for a large enough k. The fact that A G £  c  implies that YA=I XiFi(x)  for some x where F\(x) is not  ^ Fi(x)  flat, i.e. some x with positive probability, thus  E  J^XiF^Xu)  -  FxiXn)  > 0.  . i=i  Therefore, there exist ko(u) and K > 0 such that Pfc(A) > i f for all k > ko(u). However, Lemma 3.2 shows that P {X (u)} —> 0 as A; —> oo. Therefore, Afc(w) € JB(A*,e) k  k  at most finitely many times. This is true of any A* G C , meaning that for all LO e flo, c  \\X*  —  Afe(w)|| > e at most finitely many times.  _  Corollary 3.4 Consider the sequence of MAMSE weights X (u>) for LOfixedand k G IN. k  Let A be an accumulation point of the sequence X (u>), then A G £ . k  Proof of Corollary 3.4- By Theorem 3.9, the neighborhood of any A G C can be visited at c  most finitely many times. Hence, any accumulation point belongs to C.  '  •  Corollary 3.5 If C is a singleton, then C = {[1,0,..., 0] } and T  A (u;)«>[l,0,...,0]  T  fc  almost surely as k —> oo.  Proof of Corollary 3.5. The vector .[1,0,... ,0] is always in L. Therefore, C will be a T  48  Chapter 3. MAMSE Weights for Univariate Data singleton only when C = {[1,0,..., 0] }. Let e > 0 and let T  ^=[0,ir\B([l,0,...,0] ,e) T  where B(x,r) denote the open ball of radius r centered at x. The set A is closed and bounded thus compact. Let B(x,r) be the closed ball of radius r centered at x . Consider the sets £ ( x , e/2) for x € [0, l ] ; they form a covering of A. Since A is a compact set, there exist a finite m  sub-covering with balls centered at x for s — 1,..., S. s  Consider now the sequence of MAMSE weights \k{u)- By Theorem 3.9, for every fixed to £ fl\ with P(Qi) = 1, any of the balls B ( x , e / 2 ) will contain at most finitely many s  A f c ( o i ) , i.e. s A,t(w) € (J B(x ,e/2) s  finitely many times.  (3-14)  s=l  Consequently,  .  A (cu) € | | J S ( x fc  S )  e/2)|  CB([1,0,..,0] ,£) T  Lo.  (3.15)  Expressions (3.14) and (3.15) imply that if it exists, the limit of Afc(w) is in the set B([l, 0,..., 0] , e). Since e can be arbitrarily small and since the space is complete, we T  conclude that Afc(w) —> [1,0,..., 0] almost surely. T  _  In the case where £ is not a singleton, the MAMSE weights do not seem to converge to any particular point. Corollary 3.4 indicates that any accumulation point will be in C, but it seems that the neighborhood of many points in C is visited infinitely often. Describing the limit of the MAMSE weights in a general case seems rather tedious. In particular, the speeds at which the sample sizes increase in each population as well as the shape of each Fjfc(x) compared to -Fife (a;) will have an influence on the behavior of the  49  Chapter 3. MAMSE Weights for Univariate Data  weights. The precise description thereof is left to future work.  3.9  Issues for Asymptotic Normality  In all simulations performed, we have not found any evidence showing that a MAMSEweighted sum of random variables is not asymptotically normal. A formal proof showing the asymptotic distribution of such a sum remains however to be found. Even in the realistic case where no mixture of the populations are exactly identical to the population of interest, i.e. C = {[1,0,.... ,0] }, we cannot develop a central limit theorem T  for a MAMSE-weighted sum of variables without studying the speed of convergence of the M A M S E weights. Let pi — E(A'ji) and of" = var(Xii). Under reasonable assumptions, one could hope to show that the expression  (3.16)  converges weakly to a Normal variable with mean 0. To simplify things, suppose that the sample sizes from all populations increase at the same rate, that is [nik, « 2 f e ,  • • •,  n k]/nik —• [<*i, • • •, cx ] as k —> oo with 0 < ctj < oo. m  m  Note that  converges almost surely to 0 by the strong law of large numbers and that y/n~ik~Bik converges weakly to a Normal distribution with mean 0 and variance of/o^ by the central limit theorem. However, developing Bk yields m  Aife(w)  Y^( ij X  3=1  - Pi) + Z~2 iki )(pi i=l X  u  ~ Ml)  50  Chapter 3. MAMSE Weights for Univariate Data m  =  m  \k(u)B  ik  + ^_ \ik(u)(m - fii). t=l  1=1  Recall that Bi —> 0 and that Corollary 3.2 shows that B —> 0. Consequently, the exk  k  pression 2~TJi_i Ajjfc(u;) (fii — /ii) converges to 0 as well, but we do not know the rate of that convergence. If that rate is slower than l/y/n , Expression (3.16) will not converge to a ik  Normal distribution even if the M A M S E weights converge almost surely to a fixed value. To prove the asymptotic normality of (3.16) using the strategy sketched above, we need to show that the speed of convergence of the M A M S E weights is fast enough. This condition corresponds to the second half of Assumption 2.5 of Wang, van Eeden and Zidek (2004) who require the same rate of convergence, l/y/ni , for the weights. k  Note that the classical proof of asymptotic normality uses the moment generating function or the characteristic function, but it does not apply here because each datum is multiplied by a data-based weight. As a consequence, the terms \ (yj)Xij k  are not independent.  The study of the asymptotic distribution of expressions similar to (3.16) are left to future work. Any significant advances will probably come only after a thorough investigation of the speed of convergence of the M A M S E weights.  3.10  Simulations  In this section, the finite-sample performance of the M W L E with M A M S E weights is evaluated through simulations. Different cases of interest are considered. The number of repetitions for each simulation study varies from 10000 to 40000. We used the bootstrap on a pilot simulation to evaluate the variability of the values presented throughout this section. Unless otherwise stated, the standard deviation of the error due to simulation is less than one unit of the last digit shown.  51  Chapter 3. MAMSE Weights for Univariate Data  3.10.1  Two Normal Distributions  We first explore the merits of our weights for. the ubiquitous Normal distribution. Samples of equal sizes n are drawn from  Pop. 1 : A/"(0,1),  Pop. 2 : Af(A,.l)  for different values of A , each scenario being repeated 10000 times. Table 3.1 shows the average MAMSE weights under different circumstances.  A =0 0.001 0.01 0.10 0.25 0.50 0.75 1.00 1.50 ' 2.00  n=5 72 72 72 72 74 77 80 84 89 93  10 71 71 72 72 74 79 83 87 92 94  15 72 71 71 73 75 80 86 90 94 96  Average Values of 100Ai 20 25 50 100 200 71 71 72 72 72 72 72 72 72 71 72 72 72 72 72 73 73 73 74 76 76 76 79 83 88 82 83 88 93 96 88 89 94 97 98 92 93 96 98 99 95 96 98 99 99 97 97 99 99 100  1000 72 72 72 86 97 99 100 100 100 100  10000 72 72 74 98 100 100 100 100 100 100  Table 3.1: Average MAMSE weights for Population 1 when equal samples of size n are drawn from Normal distributions with unit variance and means 0 and A respectively. The results are averages over 10000 replicates.  From Table 3.1, we notice that the average weight of Population 1 does not seem to go below 0.7 for these scenarios. As n increases, the weight of Population 1 approaches 1, hence the MAMSE weights detect that the distributions are different and ultimately discard Population 2. Note that this convergence to 1 does not seem to occur for A = 0 and seems very slow when A is tiny. The average weight for Population 1 increases as well when the discrepancy between the populations increases while n is kept fixed. Table 3.2 shows the performance obtained for the M W L E with MAMSE weights when compared to the MLE. The ratio of the mean squared errors, 100 MSE(MLE)/MSE(MWLE)  52  Chapter 3. MAMSE Weights for Univariate Data is shown; a value greater than 100 meaning that the M W L E is preferable. This ratio is akin to the relative efficiency of the M L E with respect to the MWLE.  A =0 0.001 0.01 0.10 0.25 0.50 0.75 1.00 1.50 2.00  n =5 146 . 147 146 143 .139 127 114 103 89 84  :  10 145 146 146 143 134 117 103 94 88 87  15 144 145 145 142 131 108' 95 90 89 91  Efficiency of 25 20 144 143 144 143 144 143 140 139 125 123 104 97 91 89 88 88 91 91 92 93  the M W L E 50 100 200 143 144 144 143 142 143 143 144 143 135 128 118 110 96 87 88 88 90 87 91 95 94 90 97 94 98 98 96 98 99  1000 144 143 141 89 91 97 99 99 100 100  10000 143 144 127 94 99 100 100 100 100 100  Table 3.2: Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE). Samples of equal size n are simulated from Normal distributions with unit variance and means 0 and A respectively. The results are averaged over 10000 replicates.  The M W L E performs better than the M L E for small n and A . When n and A increase, the methods' performances are eventually equivalent. For the cases in between however, the M L E is a better choice than the MWLE. Fortunately, the loss (at most 16%) seems to be smaller than the potential gain (up to 47%). When the two populations are identical, a steady improvement of about 43% is observed. Note that we cannot expect to improve uniformly over the M L E since the mean is an admissible estimator. The weighted likelihood could be especially useful in situations where a large population is available to support a few observations from the population of interest. For the next simulation, 40000 replicates of each scenario are produced with the same Normal distributions as before, but with samples of size n and lOn for Population 1 and 2 respectively. Table 3.3 shows the average weight allocated to Population 1; Table 3.4 shows the relative efficiency of the methods as measured by 100 MSE(MLE)/MSE(MWLE). The general behavior of the weights is similar to that in the previous simulation, except that their minimal average value is below 0.5 this time around. As a consequence of its 53  Chapter 3. MAMSE Weights for Univariate Data  A = 0 0.001 0.01 0.10 0.25 0.50 0.75 1.00 1.50 2.00  n=5 51 51 52 54 58 66 74 80 87 91  Average 10 15 50 49 50 49 50 50 53 52 59 60 70 73 79 83 86 89 92 94 94 96  Values of 20 25 49 49 49 49 49 49 53 53 61 62 76 79 86 88 91 93 95 96 97 97  lOOAi 50 100 49 49 49 49 49 49 54 57 69 78 87 93 94 97 96 98 98 99 99 99  200 48 48 49 62 86 96 98 99 99 100  Table 3.3: Average MAMSE weights for. Population 1 when samples of size n and lOn are drawn from Normal distributions with unit variance and means 0 and A respectively. The results are averages over 40000 replicates.  larger size, the sample from Population 2 gets a heavier weight. It appears that a larger Population 2 magnifies the gains or losses observed previously. Fortunately however, the magnitude of the further improvements seem to exceed that of the- extra losses. Note that the MAMSE weights are invariant to a common transformation of the data in all populations. Therefore, simulation results would be identical (less simulation error) for Normal populations with variance a and with means 0 and A c respectively. 2  Overall, the M W L E works very well under the suggested scenarios.  3.10.2  Complementary Populations  We explained in Section 2.2 how the likelihood weights can be seen as mixing probabilities. Can the MAMSE weights detect and exploit the fact that Population 1 has the same distribution as a mixture of some of the other populations?  Would the quality of the  inference then be improved?  54  Chapter 3. MAMSE Weights for Univariate Data  n= 5  A =0 0.001 0.01 0.10 0.25 0.50 0.75 1.00 1.50 2.00  223 223 223 216 187 139 111 98 88 86  Efficiency of the M W L E 10 15 20 25 50 223 223 222 222 221 225 223 221 222 223 222 222 220 221 221 209 203 197 191 169 165 147 135 125 100 111 97 90 85 79 91 82 82 85 85 85 84 83 85 90 94 86 88 89 90 92 89 91 93 96  100 222 221 220 142 83 83 90 94 97 98  200 221 220 218 113 78 89 94 97 98 99  Table 3.4: Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE). Samples of sizes n and lOn are simulated from Normal distributions with unit variance and means 0 and A respectively. The results are averaged over 40000 replicates.  Pseudo-random samples of equal sizes n are drawn from the distributions  Pop. 1 : jV(0,1),  Pop. 2 : \jV(0,1)|, Pop. 3 : -\j\f(0,1)|  where | • | denotes absolute values. Hence Population 2 has a Half-Normal distribution and Population 3 follows the complementary distribution. We consider different sample sizes, each scenario being repeated 10000 times. The results are summarized in Table 3.5. The first column shows 100 MSE(MLE)/MSE(MWLE);  the  other columns show the average M A M S E weights allocated to each of the three populations. First observe that the combined average M A M S E weight of Population 2 and 3 accounts for at least half of the total weight for all sample sizes. The M A M S E weights thus detect that an equal mixture of Population 2 and 3 shares the same distribution as Population 1. Note also that the relative efficiency is uniformly greater than 100, meaning that the M W L E . with M A M S E weights is preferable to the M L E in these situations.  55  Chapter 3. MAMSE Weights for Univariate Data  n 5 10 15 20 25 50 100 200 1000 10000  Efficiency 115 121 120 118 118 117 116 116 115 116  lOOAi 50 46 46 45 45 45 44 44 44 44  100A 19 23 25 25 26 27 27 28 28 28 2  IOOA3  30 30 29 29 29 28 28 28 28 28  Table 3.5: Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE) and average M A M S E weights, allocated to samples of sizes n drawn from M(0,1), |A/"(0,1)| and— \M(0,1)| respectively. The results are averages over 10000 repetitions.  3.10.3  Negative Weights  In most cases, the unconstrained optimization of -P(A) yields positive weight's. In some cases such as the one that we are going to explore, negative weights systematically occur. Some previous work such as van Eeden &z Zidek (2004) showed that allowing negative weights may sometimes boost the performance of the M W L E . We explore the possibility of such improvements here. Imagine a situation where a measurement of interest is cheaply obtained, but it is costly to determine whether a patient is diseased or not. We want to study the measurement of interest on the diseased patients.. Suppose we have two small samples (one diseased, one not) as well as a larger sample where the health status of patients is unknown. If we allow negative values for M A M S E weights, would they adapt by including the larger population •in the inference and allocating a negative weight to the small healthy population? . To represent the hypothetical situation above we simulate from the following distributions: Pop. 1 : A/10,-1),  Pop, 2 : 0.5Af(0,1) + 0.5A/YA, 1),  Pop. 3 : Af(A, 1),  56  Chapter 3. MAMSE Weights for Univariate Data where Population 1 and 3 have equal sample sizes of n, but Population 2 has a sample size of lOn. Each scenario is repeated 40000 times. Although we allow weights to be negative, we still apply the preprocessing step and set the weight of a population to 0 when it does not overlap with the sample from Population 1. If the preprocessing were ignored, there would be a possibility that A would be nonnegative definite and that the MAMSE weights would not be unique. Applying the preprocessing does not affect the pertinence of this example: if the distributions in the populations of diseased and healthy are so different that the samples are often disjoint, there is no point in using the weighted likelihood to include Population 2 as the measurements are in fact a cheap diagnostic test. Moreover, previous simulations without preprocessing yielded results that are not better than those presented here. Figure 3.1 shows the average values of the unconstrained MAMSE weights for different scenarios. Negative weights do appear, hence the MAMSE criterion detects that Population 2 is a mixture of the other two populations and removes the component which is not of interest. For a large A , notice how the negative weights are closer to 0 for smaller samples. In such cases, there is a higher probability that the sample from Population 3 will be disjoint of the sample from Population 1. As a result, the weight allocated to Population 3 is more often forced to 0 by the preprocessing step. As the sample size increases, the samples overlap more frequently. Table 3.6 shows the performances obtained by the M W L E with unconstrained MAMSE weights. The M W L E performs better than the M L E in most cases, being almost twice as good in many cases. Unfortunately, the performances for large A are very poor, especially in the cases where the difference between the.populations is so large that they overlap very lightly. Using a weighted likelihood with negative weights provides an improvement over the MLE, but a similar improvement may be obtainable when the constraints are enforced.  57  Chapter 3. MAMSE Weights for Univariate Data n = 5 A=0.00  so 4  0.001  w 4  0.01  0.10  0.25  n= 10  n= 15  » i i  » zzi  i i  4  Z_  so 4  50 3  48 50 3  50 49 1  4  zzi  ZZ) zzi  5  to 5  so 4  5  2 3  3  S• i 1  50  51  •i  48  82  5.00  18  : zz  E3  ZZ  so  50  ?  60  r_  -ii  F  i  P  i  =3  |  C  -20  60 78  -17  I  -33  I  87 14  -37  -22  ^^1^  90  -0  Z_]  EED lOOAi  wm  100A  » • l  3  50 5  50  i i  50 2  3  60  ZZ]  55  •M  63 57  "[  -15  63  65  C ZZT -24  63, -28 f_  1  60  1  i i  66 65  •31  I  60 80  pi  -21  60 80  \~Z -to  -39  "l  2  5  5  90 ',  — l  50  •38  89  50 5  5  50  ZZ  0  1  50 5  4  4  5  3  » ZZ  -  n= 100  5  5  5  69  2.00  B  56 S  30  1.00  n = 50  5  53  0.50  n = 25  5  4  50  zzl  n = 20  -40  1  87  12-10  -23  B B IOOA3  Figure 3.1: Average values of 100x the MAMSE weights without the constraints A; > 0. Samples of size n, lOn and n are taken from each population. Population 2 is an equal mixture of Populations 1 and 3 that respectively follow a A/^O,1) and &J\f(A, 1) distribution. All results are averages over 40000 repetitions.  Table 3.7 shows the performance of the M W L E when the usual MAMSE weights are used. Figure 3.2 shows the average values of the weights obtained in that case. Using the M W L E with positively constrained M A M S E weights also provides an improvement over the M L E . This improvement is sometimes larger than that obtained with unconstrained weights. To discern between the two versions of MAMSE weights, Table 3.8 compares their relative efficiency; values above 100 favor the unconstrained weights. Note that the standard deviation of the error due to simulation in Table 3.8 can be more than one unit, but does not exceed 1.3 units.  58  Chapter 3. MAMSE Weights for Univariate Data  A =0 0.001 0.01 0.10 0.25 0.50 1.00 2.00 5.00  100 MSE(MLE)/MSE(MWLE) n= 5 10 15 20 25 50 195 196 197 198 197 197 196 196 197 197 198 198 196 196 197 197 198 198 195 194 194 194 192 184 190 182 176 170 165 144 173 153 140 131 124 107 137 113 105 101 100 97 92 84 116 86 84 84 54 51 49 51 57 62  100 198 197 197 172 121 97 96 84 55  Table 3.6: Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE) when the M A M S E weights are calculated without the constraints A* > 0. Samples of size n, lOn and n are taken from each population. Population 2 is an equal mixture of Populations 1 and 3 that respectively follow a A/"(0,1) and a A/"(A, 1) distribution. All results are averages over 40000 repetitions.  A =0 0.001 0.01 0.10 0.25 0.50 1.00 2.00 5.00  100 MSE(MLE)/MSE(MWLE) n= 5 10 15 20 25 50 211 209 210 210 209 208 212 210 209 . 209 210 209 212 210 210 209 210 209 212 209 207 206 203 194 207 196 187 180 173 146 186 161 144 131 122 98 139 111 97 89 86 79 82 97 79 78 79 84 51 48 50 57 53 68  100 208 208 208 180. 118 82 82 90 79  Table 3.7: Relative efficiency as measured by 100 MSE(MLE)/MSE(MWLE) when the usual M A M S E weights (i.e. constrained to positive values) are used. Samples of size n, lOn and n are taken from each population. Population 2 is an equal mixture'of Populations 1 and 3 that respectively follow aJV(0,1) and a JV(A, 1) distribution. All results are averages over 40000 repetitions.  It seems that allowing negative weights further improves the performances only in a few cases. In fact, Figure 3.2 shows that Population 2 by itself can be used and Table 3.7 shows it has a positive impact. Table 3.8 suggests that the constrained M A M S E weights are to 59  Chapter 3. MAMSE Weights for Univariate Data  A =0 0.001 0.01 0.10 0.25 0.50 1.00 2.00 5.00  100 MSE(constrained)/MSE(negative) n=5 10 15 20 25 50 100 92 94 94 94 95 95 95 92 94 94 94 93 95 95 92 93 94 94 94 95 95 92 94 94 94 93 95 96 92 94 93 95 96 98 102 93 95 98 100 102 109 119 99 102 109 114 117 123 117 94 119 112 109 107 107 100 100 101 102 102 101 91 69  Table 3.8: Relative efficiency of the M W L E with and without the constraints A; > 0 as measured by 100 MSE(constrained MWLE)/MSE(unconstrained MWLE). Samples of size n, lOn and n are taken from each population. Population 2 is an equal mixture of' Populations 1 and 3 that respectively follow a j\f(0, 1) and a A/"(A, 1) distribution. All results are averages over 40000 repetitions.  be preferred more often than not. If we consider other complications arising from allowing negative weights, (e.g. making the weighted E D F non-monotone) keeping the constraints Xi > 0 in the definition of the MAMSE weights seems a better option. A different prevalence of diseased in Population 2 could affect the simulation results. If major differences were observed, the conclusion above could be revisited.  3 . 1 0 . 4  E a r t h q u a k e  D a t a  We now use a model whose weighted likelihood estimate does not have a simple form, i.e. it is not a weighted average of the M L E of each population. Natural Resources Canada http://earthquakescanada.nrcan.gc.ca/ maintains an educational website with resources about earthquakes. From their website, it is possible.to download data about recent Western Canadian earthquakes. The histograms in Figure 3.3 show the magnitude of the earthquakes that occurred in the 5 year period from the 12 of th  February 2001 to the 12  th  of February 2006. Events are divided into 3 groups depending  on the geographical location of their epicenter. For the purpose of this example, we make  60  Chapter 3. MAMSE Weights for Univariate Data n= 5 A=0.00  43  0.001  43 8  0.01  0.10  0.25  0.50  n= 10  P  ps  I  s  1F1  ^  —?  s 4  :  n = 25  I  1P  : •—'  p  50 42 S  38 6  1  n = 20  I  48 43 8  52 41 7  n= 15  3  1 1  i zJ  S zr  «  ZT  « —T 1  zr  5  3  62 36 2  75 24 0  0  46 46 S  : 6  46 46 8  7  44 6  55  « ZT 6 1  60  1  8  7  7  6  59  n= 100  ?  1  1r3  i  n = 50  46  41 4  60  ZT 3  68 31 1  77 23 0  4  1.00  2.00  5.00  65 32 3  m  '~ 1  69 30 1  82 18 0  23 0  82 18 J 0  1  87 13 0  72 27 • 1  06  1 0  99  J  0  90  92  0  0  1  [ Z J lOOAi  90 10 0  «  93  I  H I 100A2  0  IB  86  I  0  I  95 5 0  96 4 0  92  i  1  1  0  1  97 3 0  i  98 2 0  i  IOOA3  Figure 3.2: Average values of lOOx the usual M A M S E weights (with constraints A* > 0). Samples of size n, lOn and n are taken from each population. Population 2 is an equal mixture of Populations 1 and 3 that respectively follow aA '(0,1) and &M(A, 1) distribution. All results are averages over 40000 repetitions. /  the assumption that the magnitude of the earthquakes are independent random variables and fit a gamma distribution to each of the three populations using maximum likelihood. The fitted curves appear on Figure 3.3 and the estimated values of their parameters are shown in Table 3.9 along with the number of observations in each area. The gamma model is parametrized as  for 8, p, x > 0.  61  Chapter 3. MAMSE Weights for Univariate Data Lower Mainland Vancouver Island  0  2  4  6  Elsewhere in British Columbia or in Alberta  8  0  Magnitude  2  4  6  Magnitude  8  Yukon and North West Territories  0  2  4  6  Magnitude  Figure 3.3: Histograms of the magnitude of earthquakes measured between the 12 of February 2001 and the 12 of February 2006 for three different Western Canadian areas. The curves correspond to the fitted Gamma density. th  p M n  Lower Mainland Vancouver Island 1.654 1.437 4743  Elsewhere in B C Yukon and or in Alberta North West Territories 2.357 6.806 1.869 2.782 4866 1621  Table 3.9: Number of earthquakes in three Western Canadian areas between the 12 of February 2001 and the 12 of February 2006. The magnitude of these earthquakes is modeled by a Gamma distribution; the maximum likelihood estimates appear above and are used as the "true" parameters for this simulation. th  th  • . We focus our interest on the magnitude of the next earthquake with epicenter in the Lower Mainland - Vancouver Island area. Suppose that only the 50 most recent events from each of the three regions are available. Would the M W L E that uses data from all three regions provide a better estimate than the MLE? To investigate the question, we produce 10000 pseudo-random samples of earthquakes based on the fitted gamma models shown above. The average MAMSE weights are 0.959 for the Lower Mainland - Vancouver Island  62  Chapter 3. MAMSE Weights for Univariate Data area, 0.041 for the rest of British Columbia and Alberta and finally, nearly 0 for Yukon and North West Territories. Although it looks like a small contribution, the MSE of the M W L E for the vector (j3, u) was smaller with 100 MSE(MLE)/MSE(MWLE)=107. We also considered other values of possible interest, namely some probabilities about the magnitude (M) of the next earthquake that are all obtained by plugging the M L E or M W L E in the Gamma model. Table 3.10 summarizes these results.  P{M P(M P(M P(M P(M P(M  Prob > 1) > 2) > 3) > 4) > 5) > 6)  Efficiency 123 114 112 113 112 80  MLE 62 22 66 19 51 14  Probabilities M W L E Model Data 63 68 51 24 22 40 174 73 98 21 51 26 59 99 53 17 12 6  Multiplier xlO xlO" xlO" xlO xlOxlO" - 2  2  3  - 3  4  4  Table 3.10: Efficiency in estimating some probabilities about the magnitude of the next earthquake in the Lower Mainland - Vancouver Island area followed by the average of the actual estimates and their true values. Efficiency is measured by 100 MSE(plug-in ' MLE)/MSE(plug-in MWLE). The first four columns of probabilities should be multiplied by the corresponding multiplier.  The first column of Table 3.10 corresponds to the relative efficiency of using the M W L E compared to using the M L E as plug-in parameters for the gamma model in order to evaluate the probability of interest. The numbers shown are 100 MSE(plug-in MLE)/MSE(plug-in MWLE) followed by the estimated values of P(M > k) using the M L E and the M W L E as plug-in parameters. For comparison purposes, the columns Model and Data contain respectively the true probabilities (from the simulated model) and the empirical proportions in the complete dataset. All probabilities are scaled for easier reading; using the corresponding multiplier will yield the original value. Note that discrepancies with the empirical probabilities reveal weaknesses of the gamma model to perfectly represent the magnitude of earthquakes rather than an advantage for one method over the other. 63  Chapter 3. MAMSE Weights for Univariate Data Interestingly enough, the MSE of the estimates is almost always smaller with the MWLE. Improved performance is hence possible by using the M W L E with MAMSE weights in this situation with distributions copied from real life.  64  Chapter 4  M A M S E Weights for Right-Censored Data Situations arise where the exact value of some data may not be observed. In engineering, a test of the lifetime of light bulbs, say, may not last long enough to see the last bulb fail. In health science, patients may move away while participating in a study and the researcher does not know when the symptoms stopped or when the death of the patient occurred. In these two examples, the data are right-censored: some times of failure are not observed, but they are bounded below. If censored data were ignored, large values of the response variable would be more often excluded from the inference than early deaths (or failures), resulting in an underestimation of the lifetimes. Some methods exist to account for censored data, including a nonparametric estimate of the cumulative distribution function proposed by Kaplan & Meier (1958). We suggest use of that estimate to define a version of the M A M S E weights for censored data. The paradigm still involves m populations, one of which is of inferential interest. This situation may arise in practice when: • interest concerns a subgroup of the population studied, • data come from different studies with different schemes for censoring, and possibly under other circumstances. We first introduce the notation for this section and review the Kaplan-Meier estimate and its properties. We then define the M A M S E weights and show that the algorithm for 65  Chapter 4. MAMSE Weights for Right-Censored Data calculating them converges. We propose the MAMSE-weighted Kaplan-Meier estimate and prove its uniform convergence to the target distribution. Finally, simulations explore the performance of our proposed estimate in finite samples.  4.1  Notation and Review of the Kaplan-Meier Estimate  We introduce a notation that comprises m possibly right-censored populations and accountsfor increasing sample sizes. For simplicity, we will adopt a "survival analysis" terminology where measurement of interest is the survival of individuals. For Population i, let  Xij  =  time of death of individual j  Vij — censoring time of individual j.  The independent positive random variables Xij(u) and Vij(u>) are defined on a common probability space (fl,B(fl), P). We denote their distributions by Fi and Gi respectively; the distributions Fi are assumed to be continuous, but the Gi do not need to satisfy that assumption. Instead of the values X{j, we rather observe  and  For any fixed k € IN, we observe (Z^, 5ij) for i = 1,..., m and j = 1,..., n,^. The index k will be useful to express the asymptotic results of Section 4.4. We assume that the sample sizes are non-decreasing with k and that  — > oo as k —> oo.  Let us define  66  Chapter 4. MAMSE Weights for Right-Censored Data  diVj^s)  =  Yik(s) =  Nik(s) — Nik(s~) = # of deaths at time s,  ^2 ~^-{Zij>s)  = #  a t  J  r i s k  u s t  before time s,  3=1  dlifc(s)  =  Yik(s) — Yik(s ) = # of deaths and censored at time s. +  For Population i, the Kaplan-Meier estimate (KME) of the probability of dying at time t or earlier is written (see Kaplan 8z Meier 1958)  Note that the factors of the product differ from 1 only at times of death; Let Hi(t) = P(Zn < t) and let T#. — sup{£ : Hi(t) < 1} be the largest value that Zij can attain. The possibility that TH = oo is not ruled out although it is unlikely to occur in T  practice. In addition, let  H*(t) = P(Zn < t,6n = 1) = A l - G (x)}dF (a;) i  i  Jo  be the distribution of observed death times for Population i. The Kaplan-Meier estimate is an increasing step function with a jump at each of the times of death. By the definition of TH , the number of deaths observed in Population 1 is X  •A/fc = NikirH-i)- For k € IN, let t i < • • • < t j^ be the ordered times of these deaths. The k  k  k  times of death are distinct by the continuity of Fj. If in addition, we use the convention that t o = 0, the jumps of F (i) are J k  we have that  lfc  kj  = Fi (t j) - Afc^fcO-i)) for j € {1,... ,A4} and k  k  Jkj < 1-  Efron (1967) discusses how the Kaplan-Meier estimate redistributes weight of the censored data to the observations on the right. Consequently,  Jki < Jk2 < ••• < JkM k  (4.1)  67  Chapter 4. MAMSE Weights for Right-Censored Data We will consider the Kaplan-Meier estimate on a bounded interval [0, U] with U < T  B I  •  Theorem 4.1 (Winter, Foldes & Rejto)  sup\F {t)-Fi{t)\^0 t<u ; ik  almost surely as —>.oo. Foldes & Rejto (1981) study the rate of convergence of sup <rj \Fik(t) - Fi(t)\. To get a t  better rate, they assume that the distribution Gi is continuous. However, they also mention that they proved the result of Theorem 4.1 without making any assumptions about Fi and Gi in some earlier work published as Winter, Foldes & Rejto (1978). Efron (1967) and Breslow & Crowley (1974) assume that the distribution Gi is continuous and show that Fjfc(i) is approximately normal with mean Fj(i) and variance  ~{1-Fm  2  n  Al-fTi(a)}- dff;(s). 2  Jo  This expression for the variance can be estimated using Greenwood's formula (see e.g. Chen & Lo (1997), page 1069), yielding  var{F (i)} « vTr{F^)} = {1 - F {t)f lfc  ik  £  (4.2)  This expression becomes less reliable as t approaches TH , but for a large enough n^, it {  should be reliable on any interval [0,T] with T < U. Note that the terms in the sum above are 0 except when s is a time of death. Defining MAMSE weights based on the Kaplan-Meier estimate involves using an estimate of its variance. Equation (4.2) is thus used for the definition of the MAMSE weights in Equation (4.3). Since the variance term is used as a penalty to foster the inclusion of many populations into the inference, we take the liberty of using v&x{Fik(t)} even though we do not make the assumption that Gi is continuous. 68  Chapter 4. MAMSE Weights for Right-Censored Data In the next section, we build a version of the MAMSE weights based on the KaplanMeier estimate. The weights are then used to define a MAMSE-weighted Kaplan-Meier estimate (WKME).  4.2  Definition of the MAMSE Weights for Right-Censored Data  We suggest using an expression of the form YliLi ^i-^ifcW to estimate Fi(t). To find the weights A = [ A i , . . . , A ] m  T  adaptively, we use a version of the MAMSE criterion, see Equa-  tion (2.1), where the distribution functions are estimated with the corresponding KaplanMeier estimates. When censored observations are found after the last time of death, the Kaplan-Meier estimate is not a proper cumulative distribution function on the positive real line because it never reaches a value of 1. For that reason, we assume that we can specify an upper bound U < THI and limit our study of the survival function to the interval [0, T] where T < U is such that H*(T) < H{{U). The last inequality means that there is a non-null probability that a death is observed in the interval (T, U]. This will be the case whenever the probability of death (observed or not) is non-null in that interval since the cumulative probability of being censored before T is less than one by the definition of TJJI > U.  Preprocessing For a fixed k and i G {2,..., m}, let  rriik =  min  '{j<nik-Sij  = l}  and  Mn~ —  max  Z^  {j<nik--Sij=l}  be the smallest and largest times of death observed in Population i. The weights allocated to the sample from Population i are set to 0 if it fails to satisfy the following two conditions: 69  Chapter 4. MAMSE Weights for Right-Censored Data 1. U & [rriik,Mik], i.e. at least one observed death from Population i is in the interval [0, U] and at least one observed death occurs after U; 2.  ^ {j<ni :8ij=l}  e  '  \ ik-, min(Mjfc, [/)]} > 1, i.e. at least one observed death from m  k  Population 1 which occurred in [0, U] falls within the range of the observed times of death in Population i. Condition 1 ensures that Formula (4.2) is well defined on [0, U] and not null everywhere on that interval. Condition 2 means that the same formula will be strictly positive for at least one of the times of death from Population 1 in [0, U], ensuring the unicity of the MAMSE weights and the convergence of the algorithm used to calculate them. These consequences are explained in greater detail in Section 4.3. The preprocessing requirements appear to be technical, but they avoid using samples yielding unreliable Kaplan-Meier estimates. After the last time of death, the K M E remains constant forever. If Condition 1 failed, we could be relying on such a plateau where subtle differences in the distribution of the lifetimes may be hidden by the censoring scheme. If Condition 2 failed, the whole range where the K M E of Population i increases would be compared against a constant function from Population 1. Hence, the conditions also have some intuitive foundations. If very few deaths from Population 1 fall in [0,17], Condition 2 is likely to fail and the other populations may see their weights set to 0. In such cases, the little information available about Population 1 makes it.hard to compare it to the other samples. It is then better to be more conservative and to avoid relying on the other populations. In particular, if less than 2 deaths from Population 1 fall in the interval [0, U], we allocate no weight to the other populations. The lack of knowledge about Population 1 makes the comparison to other populations too uncertain. Let A4 C {1,..., m} be a set of indices corresponding to the population whose samples k  satisfy the preprocessing conditions. We always have 1 € A4 since Population 1 is never k  70  Chapter 4. MAMSE Weights for Right-Censored Data  removed from the pool of populations. Objective Function Let P (X)= k  J  /  \F (t)- £ X F (t)\  °  I  +  y  lk  t  i  ik  2  dF (t)  - (4.3)  lk  i=l  )  i=l  ]TA W{/4(*)}  be a special case of Equation (2.1) where var{Fi (t)} is estimated by var{Fi (t)} from k  k  Equation (4.2) and dp(x) is replaced by dF\ (t). k  Note that none of the preprocessing steps can remove Population 1 since it is the population of interest. In cases where the last observed death in Population 1 is contained in [0, U] (i.e. when M\ = t j^ < U), the expression for v5Jc{Fi (tkjv )} involves a division by 0 since k  ^»fc(*fe/V) fc  =  k  0' ^ * * n  n a  k  k  c a s e  )  w  e  k  substitute the ill-defined term by its value just before time  t M - Although this solution would not be acceptable for constructing confidence intervals, k  k  it should do no harm here where our purpose is to construct a penalty term that fosters using the data from all populations. In particular, this adjustment will affect at most one term of the integral P (X). k  The weights are chosen to  minimize subject to  -Pfc(A) A > 0 and ^  Aj = 1.  The solution to that program will be referred to as the survival M A M S E weights and written Afc(w) = [Aife(o;),..., A fc(w)] to represent their dependence on LO and k. For values of t T  m  in the interval [0,T], the weighted Kaplan-Meier estimate (WKME) of the lifetime's CDF is then defined by m  G (t) = Y,\k(u)Fik{t). k  (4.4)  i=i  71  Chapter 4. MAMSE Weights for Right-Censored Data  4.3  Computing the MAMSE Weights for Right-Censored Data  The algorithm suggested in Section 2.4 applies to the MAMSE weights defined with the Kaplan-Meier estimate. To prove that the algorithm converges, we must show that Assumption 2.1 is satisfied. Lemma 4.1 y"var{Afc(a;)} dF (x) lk  >0  Proof of Lemma 4-1- To calculate the MAMSE weights, we require that at least two times of death from Population 1 fall in the interval [0, U]. Let t* be the smaller of the two times r \  of death. By the definition of the Kaplan-Meier estimate, F\ {t*) < 1 since the larger time k  of death has some mass. Hence, Expression (4.2) is positive at t = t*, i.e. vax{F  l f c  (O}>0.  Since t* is a time of death, the Kaplan-Meier estimate for Population 1 makes a jump at that time. Hence dFi (t*) > 0 as well and consequently, k  j<tix{P {x)} lk  dFi (x) >0. fc  J  •  Lemma 4.2 For all external populations remaining after prescreening: i = {2,... ,m},  j™x{P {x)} ik  Proof of Lemma 4-2. Let  &F {x) > 0. lk  and M{ be the smallest and largest time of death respectively k  observed in Population i. Note that the sum in Equation (4.2) is cumulative. It is thus positive for all x > rrii . k  72  Chapter 4. MAMSE Weights for Right-Censored Data  The first condition for preprocessing requires that  < U < M^. Since the Kaplan-  Meier estimate for Population i jumps at M ^ , the first part of Equation (4.2), {1 — Fik{t)} , 2  is positive for all t < M^. Consequently,  var  for all x €  [m ,M ). ik  ik  The second requirement for preprocessing ensures that a death occurs in Population 1 in the interval [m-ik, min(Mjjfc, U)]. Consequently, the discrete measure dFi (x) gives positive k  mass to at least one point in that interval where var |Fjfc(x)| > 0. Therefore,  var { F ( x ) | dF (x) > 0. ife  lk  •  Lemmas 4.1 and 4.2 are sufficient to show that the algorithm in Section 2.4 converges for this new application to the M A M S E weights.  4.4  Uniform Convergence of the MAMSE-Weighted Kaplan-Meier Estimate  We prove that the W K M E , G (t), converges uniformly in probability to F\(t). The proof is k  built as a sequence of lemmas that appear next and follow a logic akin to that of Section 3.5. Remember that we assumed in Section 4.1 that the distribution of the times of death is continuous, but the distribution of the times of censoring need not be. Lemma 4.3  EYMs)Y ^  dNik(s)  0<3<u  _P  lk  as k — > oo.  73  Chapter 4. MAMSE Weights for Right-Censored Data Proof of Lemma 4-3. Notice that dN (s)  v  ^  lk  dY (s) lk  Y (s) - Y (s+) lk  v  lk  The first inequality holds since dNi (s) < dYi (s) for all s. k  k  Suppose that the Z\j's are distinct and for a fixed k, let Z^(j) denote the j  th  of the sample from Population 1, i.e. the j  th  order statistic  smallest value in the list Zu, • • •, Z\n - Let ik  jo = max{j : j < n\ , Z^ < U} Then, there is at most one censored datum or death at k  k  any given time and the expression above can be rewritten as  y 0<s<U Z-~>  /_> Y (s )  1 _ \ = Y (s)J  +  lk  y  lk  I VT,J*+\  YTJ.*} I  Z^lY (Z ) lk  Y (Z lk  Y (Z )  lU+1)  lk  )  n  1{jok)  lk  l(j)  - Y (U)  ( 4  lk  -  5 )  since Y (s) is decreasing in s and the series telescopes. ik  Concurrent death times are impossible, but concurrent censoring times are possible since the continuity of their underlying distribution is not assumed. Moreover, censoring cannot occur at a time of death since the two times are independent random variables and at least one of them has a continuous distribution. Even if multiple censoring occurs at one time, inequality (4.5) will still hold. Indeed, the term for concurrent censoring times in the summation is equal to the sum of the individual terms if the times were different but consecutive. For instance, if Yi (t) — Yi (t ) = 2, we have +  k  Yik(t)-Y (t+) lk  Y (t)Y (t+) lk  lk  k  _  1  1  Y (t)-2 1 Y (t)-2 lk  lk  Y (t) 1 Y (t)-1 lk  lk  1 Y (t)-1 + lk  Y (tY lk  This extends for an arbitrary number of concurrent censoring times. Let us now show that the bound l/Y\ (U) k  converges to 0. Since the Z-^fs are indepen-  74  Chapter 4. MAMSE Weights for Right-Censored Data dent, Yik(U) has a Binomial distribution with parameters ni  k  and 1 — Hi(U). Let e > 0,  then  p  \ E vnvfV  ^  8  0<s<£/ —  —  >*} =  P<i/*}  /  l/e-ni {l-Hi(U)}  r  k  ^ni Hi(U){l k  as k  - Hi(U)}  •0  (4.6)  oo since the argument inside the standard normal CDF $ tends to — oo. The ap-  proximation is from the central limit theorem and becomes exact as nik — > oo.  •  Whether a sample is rejected in the preprocessing or not may vary with k and LO. Remember that m populations are available before preprocessing and that weights of populations in JA  are forced to 0, but Population 1 is never excluded from the optimization problem.  k  Hence preprocessing does not affect the distribution of probabilities calculated in expressions such as (4.6). Moreover, preprocessing does not change the fact that A = [1,0,..., 0]T is a suboptimal choice of weights, which we use in the proof of the next result. Lemma 4.4 £  {F (t)-G (i)}  dF (t)$0  2  lk  k  lk  as k — > oo, where G (t) is defined in Equation (4-4)k  Proof of Lemma 4-4- By the definition of the MAMSE weights, Pfc{A H}<P {[l,0,...,0] } T  fc  fc  since [1,0,... ,0] is a suboptimal choice of weights. Following Equations (4.2) and (4.3), T  we thus have  £  {F (t) - G (t)}  2  ik  k  dFik <  P {X (Lv)} k  k  <  P {[l,0,...,0] } T  fc  75  Chapter 4. MAMSE Weights for Right-Censored Data  -f  n { 1  p "  r  2 V -  m  Jo  E  <  0<s<U  * E 0<s<U  dN (s) lk  lk  dN (s) Y (s)Y (s lk  lk  dF (t)  lk  lk  Y (s)YMs" lk  F{I Jo  Y (s)Y (s+) lk  dN (s)  ^ 0<s<t  F l k ( t ) }  - F {t)}  2  lk  dF {t) lk  P  lk  as k — > oo by Lemma 4.3.  Let v = max{j < n\ k  : t j < U} be the index of the largest time of death observed at or  k  k  before time U in the sample from Population 1. Since the steps of Fi (t) are increasing in k  size (see Equation 4.1), we may define  Jk = max\F (t) - F (t lk  lk  )| = J  fcl/fc  ,  the biggest jump of Fi {t) on the interval [0, U}. k  Lemma 4.5 J —> 0 almost surely as k — > oo. k  Proof of Lemma 4-5. The result follows from Theorem 4.1 since  J < 2 sup o<t<u k  almost surely as k — > oo.  Recalling that T < U is such that H{{T) < H*(U), we let  D = N (U) - N (T) k  lk  lk  = J2 *{z e(T,u],6 =i} lj  lj  i=i  be the number of deaths observed in the interval (T, U] among individuals sampled from  76  Chapter 4. MAMSE Weights for Right-Censored Data Population 1. Since the Zy are independent, D follows a Binomial distribution with k  parameters ni and Hl(U) — Hl(T). k  Let £k — N\k(T) be the number of deaths observed in the interval [0,T], and their corresponding times of death t i < . . . k  no death is observed after  < T. By convention, we set t (e +i)  < te k  k  k  k  =  H  T  X  if  tek  k  Lemma 4.6  P < max Fik{t) - G (t)\< Jk +  max  k  0<t<T  I  *€{tfcl,...,tfc(£  fc  +  F (t) - G {t) lk  1  k  )}  converges to 1 as k —> oo. Proof of Lemma 4-6. Fix k G IN and LO e fl, and let xo G [0, T] be the value maximizing \F\h(t) — Gk(t)\. That maximum exists since |Fifc(*)  Gk(t)\ is a bounded function being  —  optimized on a compact set. Three disjoint cases need to be considered;  Case 1: Gk(%o) < Fik(xo) and  > 1.  Let j\ = max{j G {1,... ,£k} '• tkj < xo} be the index of the largest time of death from Population 1 inferior to XQ. By the choice of j i , t j belongs to the same step as xo and k  x  hence Fikitkn) =  F (x ). lk  0  Moreover, Gk{tkj ) < G (x ). x  k  0  since G {t) is a monotone nondecreasing function. Recalling that XQ maximizes the differk  ence between Fi (t) and Gk(t), we can write k  mffl \Fi {t) - G (t)\ k  k  = F (x ) lk  <  0  - G (x ) k  0  Fik(tkji) - Gk(tkj ) x  77  Chapter 4. MAMSE Weights for Right-Censored Data <  max Fik(t)-G (t) te{tfci,...,tfc£,}  <  Jk+  k  max  Fi (t) - G (t) k  k  meaning that the maximum will always occur at a time of death from Population 1, where F\k{t) has a step.  Case 2: G (xo) > Afe(^o) &nd D > 1. k  .  k  Let J2 = minjj € {1,..., £ + 1} : t j > xo} be the index of the smallest time of death k  k  greater than XQ. The condition D > 1 ensures that tk(e +i) exists, hence j is well defined. k  k  2  The choice of j ensures that it belongs to the step of Fi (t) that immediately follows xo, 2  k  hence Fik(t (j -i)) = k  F (x ).  2  lk  0  The function G (t) is a right-continuous nondecreasing function and t j > XQ, thus k  k  2  G {t ) > G (x ). k  k]2  k  0  Recalling that XQ is the point maximizing the difference between Fi (t) and G (t), we write k  k  <  G (t j ) -  =  {Fik{tkj ) - F\ {t ^ -i))} + G (t )  <  J+  k  k  2  Fi (t Qj _i)) k  k  2  k  2  k  k  max  2  k  kj2  -  F (t ) lk  kJ2  Fi (t) - G (t) k  k  *G{tfci,...,t (f )} fc  fc+1  meaning that under Case 2, the maximum will occur immediately before a jump of F (t). lk  Case 3: D = 0. k  This event has probability [1 - {H{(U) - H~i(T)}}  nik  —> 0 as  —> oo. 78-  Chapter 4. MAMSE Weights for Right-Censored Data  The combination of Cases 1 and 2 implies that the desired result is true at least whenever D > 1. Consequently, k  < Jk +  P < max F (t)-G (t) lk  k  0<t<T  max Fi (t) - G (t) te{tki, — ,t 'e )} k  k  k  k+1  > P(D > 1) = 1 - P(D = 0) k  k  as k —> oo since the probability of Case 3 converges to 0 as k —* oo.  1 I  Lemma 4.7  PifcW-GfcCt)  max  t€{tfci,...,tfc(f  fc  +  1  )}  as k —> oo. Proof of Lemma 4-7. Let e > 0 be such that e < iff ((7) - / ^ ( T ) . We will show that  P <  max  Fi (t) - G (t) > e ^ -> 0 k  fc  l*€{*fcl.-">*k(< + l)} fc  asfc—> oo. For a large k,let x G {1,..., £ +1} be the index of a time of death where the difference k  F (t)-G (t) lk  k  k  is maximized. We define the following three events:  A  =  k  £  f e  C  fc  j w G fi : F (t ) lk  = =  kXk  Gfi: G (t ) k  kXk  - G {t ) > e j k  kxk  - F {t ) xk  kXk  > e}  {w G fi : D > enik + 1} . k  Then,  P <  Pifc(t)-G (t)|>6)  max  fc  <  P{Cfu(4nC  f c  )u(B nc k  f c  )}  I tG{tfci.--.>tfc(« +i)} fc  79  Chapter 4. MAMSE Weights for Right-Censored Data  <  p(c^) + p(A nc ) k  +  k  p(B nc ). k  k  We next show that each of the three probabilities on the right hand side go to zero as k —> oo. Note that the event C is used to remove an event of probability 0 that would k  otherwise complicate the proof that P(B ) —-> 0. k  Case 1: P ( C f ) -> 0. Recalling that D follows a Binomial distribution with n\ trials and probability of success k  k  {#*([/) - H{(T)}, we have  P(Cf)  =  P{D  k  < en + 1} lk  ^  en + l-n {H*(U)-H*(T)} lk  y/n {H*AU)  \  ~ Ht(T)}{l - H{{U) + H*(T)} J  lk  =  .  lk  $(-c /n~[ ~ + d/^/n^) -> 0 y  k  as k — > oo since the choice of e implies that c > 0, hence the argument inside the standard normal CDF $ tends towards — oo. We use the central limit theorem to approximate the Binomial by the Normal, but this comparison becomes exact as n\  k  approaches oo.  Case 2: P{A n C ) -* 0. k  k  Let u = min{u : J2i= +i ^ki < e}- This index exists when k is large enough since k  u  • Jkx < Jk — 0 by Lemma 4.5 and • >  k  * /^ Jki = Fi (t ) J  k  kXk  > G (t ) + e>e. k  kXk  i=l •  For a large enough  fc,  < e and hence u < x — 1. For j € {ufc,'..., k  k  — 1}, we have  G (t j) < G (t ) . k  k  k  kXk  80  Chapter 4. MAMSE Weights for Right-Censored Data  since the function G (t) is monotone nondecreasing and k  — /2  Fik{tkj) = F {t ) lk  kXk  since the function F\ {t) makes a jump of size J i at time t^. Combining these inequalities k  k  yields X  3*k  k  F\k{tkj) - Gk{tkj) > Fi {t ) k  - Gk{t ) ~ ^2 Jki>e— ^2 i=j+l i=j+l  kXk  Jki>0-  kXk  The last inequality holds because of the choice of u . The function F\ {t) gives a mass of k  k  J j to the point t j, and hence k  k  /  \G (t)-F (t)\ dF (t)  >  2  k  lk  lk  Y,J \G (t )-F (t )\  2  kj  —  k  kj  yi kj\G (t j) -  kj  Fi (t j)Y  j  k  lk  k  k  k  3=Uk  >  kx e  2  J  k  +  k3  J  2  Jo  x  \  i=j+l  \  k  U ~ J2  3=Uk  f\e - x)  (  J  ki  2  '  ( - )' 4  7  )  I e dx = / x dx = — £  3  2  since the summation corresponds to the Riemann sum for the integral J (e — x) dx depicted £  2  0  in Figure 4.1. The sum converges as k —-> oo because the width of the columns J j tends to k  zero by Lemma 4.5. To clarify the link between the Riemann sum and the integral, consider the change of variable p = x — j and let k  0  p=0  ]LX=l k(x -i+l) J  k  p=l,--.,x -u k  k  81  Chapter 4. MAMSE Weights for Right-Censored Data  Note that c ( ) — c fc  p+1  kv  — Jkj and with respect to the variable j , c  = Jk{ -p) Xk  equals  kp  J2i=j+i Jki when p > 0. We can thus write the expression in (4.7) as  2J  ( fc(p+i) - fcp)( - fep) ~* / (e-x) dx c  c  e  c  2  =  2  P=O  (4.8)  x dx = — 2  7o  «  Consequently, there exists a rCo such that  /V (0-F (i)| dF (i)>^o 2  7o  fc  lfc  lfc  for all fe > fco, an event of probability 0 according to Lemma 4.4. We conclude that P(A  k  n C ) -* 0 as k -> oo. k  A r e a of the s h a d e d c o l u m n : (ck(p+l)  -  C )(e kp  A r e a of t h e s h a d e d c o l u m n :  l l  C )  2  kp  (dk, - 4 ( - i ) ) ( e ?  4  9  )  2  H a s h e d column is ignored  dkidk  Figure 4.1: Graphics representing the Riemann sums used in the proof of Case 2 (left panel) and Case 3 (right panel).  Case 3: P(B D C ) -» 0. k  k  Recall Equation (4.1) and note that the smallest possible size of a jump in F\j(t) is \jn\kHence Jkj > l / n ^ and /J^ > en\k + 1 implies that  {j:t ,e(r,t/]j>4+i} fc  Let  ffc  = max{?j  :  Y^ j=x +i J^j — }- F ° €  V  k  ra  large enough /c, Lemma 4.5 implies that 82  Chapter 4. MAMSE Jk(x +i) k  <  < Jk  e  Weights for Right-Censored Data  and thus v > x + 1. For j € {x + 1,.'.., i^}, the monotonicity of the k  k  k  nondecreasing function G (t) implies that k  Gk(tkj)  In addition, the function Fi (t), k  >  Gk(tkx ) k  a Kaplan-Meier estimate, makes a jump of size J i at each k  time of death t i- Therefore, k  j  E\k(tkj)  = Eik{tkx )  +  k  E  Combining these two inequalities yields . Gk(tkj)  - Fik(t j)  > Gk(tkx )  k  - Fik(tkx )  k  -  k  E  ki>£~  E  J  i=H + l  — ^'  i=x +l fc  r  the last inequality holding because of the choice of v . Using again the fact that dFifc(i) k  allocates a mass of Jkj to t j, we find that k  4  rU  /  |Gfe(t)-Fifc(t)| dF (t) 2  1Jfc  >  Jfcj|G (t ) - Afc(**j)|  2  fc  fci  3=1  J o  -  E '  Jkj\Gk{tkj)  -  Fik(tkj)\  2  3=*k  > / (e-x) dx= / x d x = 2  (4.10)  2  since the summation corresponds to the Riemann sum for the integral f (e — x) dx depicted e  2  0  in Figure 4.1. The term J x £  2  k  k  ignored in Equation (4.9) corresponds to the hashed column.  The sum converges as k —> oo because the width of the columns J j tend to zero by k  Lemma 4.5. 83  Chapter 4. MAMSE Weights for Right-Censored Data To clarify the link between the Riemann sum and the integral, consider the change of variable q = j — x and let k  ( dkq = \  Note that d  - 4( _i) =  kq  k  J  w  n  e  n  Xk  q =0 •  Jk(x + )  9  corresponds to Y?i= +i ki  0  q  = Jkj and that with respect to the variable j,  j.  k  /.e  e  2 - ( d , - d ( - i ) ) ( ^ - 4 , ) - J (e-x) dx 2  f c  f c  kq  9 > 0. We can thus rewrite Expression (4.9) as  v -x k  d  2  9  9=1  = J  £  3  x dx=-  (4.11)  2  o  0  Therefore, there exists a ko such that  f Jo  \G (t)-F (t)\ dF (t)>e /6  U  2  k  lk  3  lk  for all k > ko, an event of probability 0 according to Lemma 4.4. P(B  We conclude that  n C ) -> 0 as k -> oo.  k  k  I  Combining the three cases yields the desired result.  Lemma 4.8 For T <U < TH with H{(T) < H*(U) and any sufficiently small e > 0, x  P lsup\G (t) - F (t)\ >e*> ->0 k  lk  as k —> oo. Proof of Lemma 4-8. For an arbitrary e > 0, let  A  k  —  <  LO  6 fl : max Fik(*) -G (t)\<J + o<t<r k  k  1  max Kfth-A((  k  Fi (t) - G (t) fe  k  + i)} .  84  Chapter 4. MAMSE Weights for Right-Censored Data  wgfi:  Bic  %•=  {w € fi : Jk <  Clearly, A r\B n k  max  lk  k  |}.  Ck implies that sup  k  e <2  F (t)-G (t)  - Fi (t)\ < e. Therefore, k  t<T  p su |G (t)-Fifc(t)| <e P  > P(A n 7i n c7 )  fe  fc  >  fc  fc  P(A ) + P(B ) + P(C ) - 2 - » 1 fc  k  k  as A; —* oo by Lemmas 4.5, 4.6 and 4.7. Theorem 4.2 LetO <T <U < T  H i  P{sup t<T  as k —> oo, i.e. s u p  t<T  with H*(T) < H{{U). For all sufficiently small e > 0,  G ( i ) - F i ( t ) <e\ fc  - 1  Gfc(t) — F\(t) 0.  Proof of Theorem 1. Let us define  co G fi :  B  k  =  Ck =  sup G ( t ) - F i ( t ) < e o<t<r fc  a; G fi : sup Gfe(t)-Fife(<) 0<£<T  to G fi :  sup P i f c W - ^ i W < eo<t<r - 2  By the triangular inequality,  GfcW-Pi(t) < G (t)-F (t) k  lk  +  F (t)-F (t) lk  1  85  Chapter 4. MAMSE Weights for Right-Censored Data  for any fixed t. In particular  sup  C*k{t) — F\(t)  <  <  sup{ | G ( i ) - F ( t ) | + | F fc  sup  lfe  G (t)-F (t) k  lk  + sup  lfc  (t)-Fi(0|} Fi (t)-Fi(t) . fc  Consequently, (B C\ C ) C A and k  k  k  P(A ) > P(B n C ) >P(B ) + P\C ) fc  k  fc  fc  k  -1-1  as A; —> oo. Lemma'4.8 implies that P(B ) —> 1 and Theorem 4.1 implies that P(C ) —> 1. k  k  Therefore,  sup G (t)-Fi(t) t<T  k  40.  The W K M E converges uniformly in probability to the lifetime distribution for Population 1 in the interval [0, T]. The M A M S E weights, although they use data from different distributions, provide an asymptotically unbiased estimate of Fi(i).  4.5  Simulations  This section presents the results of simulations performed to evaluate the finite-sample performance of the MAMSE-weighted Kaplan-Meier estimate compared to the usual KaplanMeier estimate. The two examples are based on real survival functions to mimic reality. Simulations use between 10000 and 20000 repetitions. Unless otherwise stated, this number is large enough to make the standard deviation of the simulation error smaller than the last digit shown in the tables or on the figures.  86  Chapter 4. MAMSE Weights for Right-Censored Data  4.5.1  Survival in the U S A  The Centers for Disease Control and Prevention maintains a website which includes a section called National. Center for Health Statistics. That section contains the decennial life tables published by the National Center for Health Statistics (1997) at the address http://www.cdc.gov/nchs/products/pubs/pubd/lftbls/decenn/1991-89.htm . From this publication, we obtain the survival curves for four subgroups of the population in the United States: • White males; • White females; • Males other than white; • Females other than white. The life tables have a resolution of one year from birth to the age of 109 years. We simulate the age at death based on these tables; the day and time of death is chosen uniformly during the year of death. The CDF of the survival time for each of the four populations is depicted in Figure 4.2. Our purpose here is to explore the potential value of our new method rather than to study its behavior extensively under all possible conditions. Hence, we use a simple definition for the distribution of the censoring times based on the distribution of the lifetimes. Lemma 4.9 Let X\,... ,X i r+  be r + 1 independent and identically distributed variables  with common distribution F. Then,  P{max(X , ...,X )< 1  r  X } r+1  = —|—  87  Chapter 4. MAMSE Weights for Right-Censored Data Distribution of Lifetime in the USA  —• - --•  f* ,**• rm J .  White Male White Female Male Other Than White Female Other Than White  J i  m .—  •—•—  -~T20  1  40  1  —I—  100  80  60 Age at Death (Years)  Figure 4.2: Survival functions for subgroups of the American population as taken from the life tables of the National Center for Health Statistics (1997).  Proof of Lemma 4-9. poo  P{max(Xi,...,X )<X i}= r  r+  /  Jo  P {max(Ii,..., X ) < X \X r  =  r+1  {F(t)Y dF(t) = Jo  = i) dF(t)  r+x  f Jo  u  r  d  u  j L -  = r  +  1  Let X\,..., X be r random variables with the same distribution as the survival time of r  an individual. The censoring time is defined as V = max(Xi,..., X ), yielding a censoring r  rate of l/(r + 1). Throughout this section, r = 4 is used yielding a censoring rate of 20%. The last example however involves different rates of censoring, from 16% to 34%, obtained with r e {2,3,4,5}.' We restrict our goal to inferring the survival distribution of white males based on equal samples drawn from the four demographic groups mentioned above. We investigate if the 88  Chapter 4. MAMSE Weights for Right-Censored Data  MAMSE-weighted Kaplan-Meier estimate is a better choice than the usual Kaplan-Meier estimate based on the data from white males. For different values of the upper bound U £ {60,70,80,90,100}, we generate samples of equal sizes n € {10,25,100,1000} from each of the four populations. Each scenario is repeated 20000 times. We calculate both the weighted Kaplan-Meier estimate F^{t) and the usual KaplanMeier estimate Fi(t). To evaluate the quality of the estimators, we compare the area that separates them from the real survival curve F\(t), more precisely, we use  Table 4.1 shows the ratio lOOAi/A^ for different choices of n, U and with T = 55. The interval of interest is thus [0,T]. Values above 100 mean that the weighted Kaplan-Meier estimate performs better. Ratio 100Ai/Ax, with T 70 80 90 114 135 142 118 137 148 149 128 135 143 140 128 121 120 108 105  U = 60 n = 10  25 100 1000  = 55 100 100 101 102 103  Table 4.1: Relative performance of the W K M E as measured by 100Ai/A\. Both areas are calculated on the interval [0,55] and various upper bounds U are used to determine the weights. Samples of equal size n are taken from each of four subpopulations, then used to estimate the survival of a white male living in the USA. Each scenario is repeated 20000 times.  The weighted Kaplan-Meier estimate seems to be a better estimate under all scenarios considered. No apparent trends in the magnitude of the improvement against n and U are observed. The M A M S E criterion evaluates the dissimilarity between the populations on the interval [0, U]. It is thus not surprising that no clear trend is observed as U varies. Note that for the largest sample size (n — 1000), the advantage of the W K M E seems more 89  Chapter 4. MAMSE Weights for Right-Censored Data modest. The white females have the longest survival time and nearly 25% of them reach the age of 90. However, less than 3% survive long enough to celebrate their 100 birthday. For th  U — 100, the samples from other populations will frequently fall short of the upper bound and be ignored, especially for small sample sizes. This might partially explain the abrupt change in performance from U = 90 to U = 100. The improvements observed in Table 4.1 appear again in Table 4.2 where the interval on which the functions are compared varies with the upper bound U. Once again, the newly proposed weighted Kaplan-Meier estimate performs better than the classical one under all the scenarios.  n = 10 25 100 1000  Ratio 100Ai/Ax, T = U - 5 U = 60 70 80 90 100 114 132 137 116 100 137 141 137 122 101 135 134 128 118 101 121 115 103 101 101  Table 4.2: Relative performance of the W K M E as measured by lOOAi/A\. Areas are calculated on the interval [0,(7 — 5], where U is the upper bound used to determine the weights. Samples of equal size n are taken from each of four subpopulations, then used to estimate the survival of a white male living in the USA. Each scenario is repeated 20000 v times.  Figure 4.3 illustrates the average weights allocated to each of the four subpopulations. Notice that the weight allocated to Population 1 is close to one when U = 100, especially for small sample sizes. This supports the explanations regarding the sudden drop in performance observed in Table 4.1 and 4.2: samples from other populations are often dismissed because no individual reaches 100 years of age. Unless a mixture of the survival distributions of the other populations is identical to that of the survival distribution for Population 1, Theorem 4.2 predicts that the weight allocated to Population 1 will converge to 1. A tendency to increase towards 1 is observed for U £ {70,80,90}, but not for U = 60. It is expected that for larger sample sizes, Ai 90  Chapter 4. MAMSE Weights for Right-Censored Data Average Weight Allocated to Each Population  U = 60  70  n=10  73  51  n= 25  46  43  n=100  39  40  n= 1000  37  46  | •  While Males  •  White Females  !  U=80  53  -  69  •  Non-White Males  U=90  U= 100  78  1  100  70  I  99  64  1  83  I 96  •  9 7  Non-White Females  Figure 4.3: Average value of the MAMSE weights for different upper bounds U and sample sizes n. The cells' area are proportional to the average weight allocated to each population. The numbers correspond to 100Ai and are averaged over 20000 repetitions.  would eventually converge to 1 even in that latter case. The large weight allocated to the three other subpopulations for samples as large as 1000 should be interpreted as a sign that a mixture of these 3 distributions is extremely close to the true survival distribution in Population 1 and that does not seem unreasonable based on Figure 4.2. Figure 4.4 depicts examples of estimates of the survival functions. While the smooth gray line shows the true distribution of the lifetime of a white male in the United States, the plain black line shows the Kaplan-Meier estimate based on a sample of size n and the dashed line corresponds to the MAMSE-weighted Kaplan-Meier estimate that we propose. The numbers on each panel correspond to A\ and A\ respectively with T — 75. As we may expect from the good performances noticed in the previous tables, A\ is typically smaller than A\. Some exceptions arise: A\ will occasionally be smaller than A\, as it is the case for n = 1000 in Figure 4.4. A close look at the dashed line shows one advantage of the weighted Kaplan-Meier estimate: using more populations involves having more steps in the estimate since jumps may occur at a time of death from any of the samples  91  Chapter 4. MAMSE Weights for Right-Censored Data  0  20  40 Age at death (Years) .  60  0  20  40 Age at death (Years)  60  Figure 4.4: Typical examples of the weighted Kaplan-Meier estimate (dashed line) and of the usual Kaplan-Meier estimate (plain'black line) for different sample sizes. Note that U = 80 and T = 75. The true distribution is depicted by a smooth gray line.  considered. This results in a smoother step function. An estimated survival function will typically be used to answer further questions about the population of interest.  Tables 4.3 and 4.4 show the performances of the weighted  Kaplan-Meier in estimating Fi(55) = 0.11976 or Ff^O.lO) = 52.081. Note that we write q = Ff^O.lO) and q = FT^O.IO). x  x  The estimates obtained by the weighted Kaplan-Meier estimate feature a smaller MSE in almost all cases. Moreover, the magnitude of the gains seems to outweigh that of the occasional losses, especially when we consider that such losses occur when n is large, not the cases where our method would be most useful. The relative performances of the W K M E for estimating a quantile is similar to the results obtained for the probability F\(55). The resemblance between the tables is not 92 ..  Chapter 4. MAMSE Weights for Right-Censored Data MSE{A(55)} MSE{F (55)} A  n = 10  25 100 1000  (7 = 60 117 137 125 110  70 151 159 142 107  80 172 170 142 84  90 134 149 134 86  100 101 102 104 103  Table 4.3: Relative performance of the weighted Kaplan-Meier estimate compared to the usual Kaplan-Meier estimate for estimating Fi(55) = 0.11976 as measured by 100 MSE{F\(55)}/MSE{F\(55)}. Different choices of U and n are considered. Each scenario is repeated 20000 times. 100 MSE(gi)/MSE(g ) 70 80 90 100 120 140 161 141 100 153 173 172 133 101 124 145 139 126 102 119 113 86 86 106 A  U = 60 n = 10  25 100 1000  Table 4.4: Relative performance of the weighted Kaplan-Meier estimate compared to the usual Kaplan-Meier estimate for estimating F T (0.10) = 52.081 as measured by MSE(gi)/MSE(f7^). Different choices of U and n are considered. Each scenario is repeated 20000 times. 1  surprising since they both use the estimated curves around their 10% quantile. For Table 4.5, we fix U = 80 and T = 75, then try different distributions for the time of censoring that yield an average fraction p € {1/3,1/4,1/5,1/6} of censored data.  '  The proportion of censored data has little or no effect on the relative performance of the W K M E compared to the K M E . A closer look at the raw data shows that the precision of both the K M E and the W K M E are affected by a larger p, but it appears that the magnitude of this effect is the same for both methods. As a result, their relative performance seems unaffected by the rate of censoring. Overall, the weighted Kaplan-Meier estimate seems to outperform the usual KaplanMeier estimate in almost all the cases studied. Next, we look at another population with more subgroups. 93  Chapter A.  MAMSE Weights for Right-Censored Data  Average of lOOAi ^  n = 10 25 100 1000  1 3  1 4  57 50 47 68  54 48 47 69  lOOAi/A I  x  5  l  6  I  53 48 47 69  52 47 47 69  133 136 127 102  l  3  I  4  5  135 137 127 103  137 137 128 103  I 6  136 138 126 102  Table 4.5: Average MAMSE weights for different rates of censoring p and different sample sizes n. The right-hand side of the table presents the relative performance of the W K M E as measured by 100Ai/A\. Figures are averaged over 20000 repetitions and the values U = 80 and T = 75 are used.  4.5.2  Survival in Canada  Statistics Canada periodically publishes life tables for the survival of Canadians. We use the life tables published for the reference period of 2000 to 2002 that are available online at the address http://www.statcan.ca/bsolc/english/bsolc?catno=84-537-X. . The life tables from Statistics Canada (2006) provide the survival functions of Canadians with a resolution of one year, from birth to over 105 years. Distinct tables are provided for males and females from each of the 10 Canadian provinces. Due to its smaller population, Prince Edward Island is the only exception with a resolution of 5 years. It is excluded from our simulations for that reason. We suppose that n males and n females from each province (except PEI) are followed and that their time of death is observed or censored. Censorship was determined the same way as before, following Lemma 4.9 with r = 4, which yields a censoring rate of 20%. We perform three simulations. 1. Males: We estimate the survival function of a male in New Brunswick when data sets of males across the country are available (total 9 populations). 2. Females: We estimate the survival function of a female in New Brunswick when data - sets of females across the country are available (total 9 populations). 3. Males and Females: We estimate the survival function of a female in New Brunswick 94  Chapter 4. MAMSE Weights for Right-Censored Data when data sets of males and females across the country are available (total 18 populations). Figure 4.5 depicts the survival functions for the different provinces and genders. The survival curves for New Brunswick are repeated on each panel for comparison purposes since they are the target distributions; they appear as gray dotted lines. In the following, we express the distributions in term of survival functions.  These  functions are obtained by simple arithmetic as S,(t) = 1 — F (t) and S,(t) = 1 — F,(t) m  where • stands for any common index defined previously. Throughout this section, we choose the upper bound U = 90 and we fix T — 85. Denote by Si(tj the target distribution and let S (t) and S\(t) be the K M E and the W K M E x  respectively. The measure of performance considered will use the area between our estimate and the true survival function:  A= x  [ \S (t)-Si(t)\dt Jo T  x  and  A = F \S {t) - S {t)\dt. . Jo l  Table 4.6 shows 100Ax/A\, the ratio of these areas.  x  x  Values above 100 mean that the  W K M E performs better than the K M E . Table 4.6 also shows the average weight allocated to each population under the three scenarios considered. Simulations are performed for n 6 {10,25,100} and each scenario is based on 10000 repetitions. For Simulation 3, we can distinguish between the weights allocated to men and women by their font: the weights for males appear in italics. Under all scenarios, the W K M E yields a more precise estimate than the usual KaplanMeier estimate. The improvement is small for the males, but substantial for estimating the survival of females from New Brunswick, with a gain exceeding 50% in some cases. This might mean that the survival of the women are more similar across provinces than that of men, although it is not easily seen from Figure 4.5.  95  Age at Death (Years)  Age at Death (Years)  Age at Death (Years)  Figure 4.5: Survival functions of Canadians as taken from the life tables of Statistics Canada (2006). Survival functions are available for males and females of each province. The curves for New Brunswick are repeated as gray dotted lines on each . panel to facilitate comparisons since they are the populations of interest. CO  1000 x 10 25 100 10 25 100  100 4^ 101 101 101 146 153 151  10  146  n  Males  Females  Males &  Females  25  148  100  144  X N B  349 346 350 438 376 359 365 22  284 ' 25 271 15  X N F  82 91 109 65 82 98 47 22 49 25 48 15  X N S  83 84 87 66 79 87 51 22 54 25 59 15  XoN 80 77 83 76 82 71 68 67 78 76 77 78 52 53 22 22 61 59 25 25 73 76 15 15  X Q C  XsK 85 -84 .82 86 87 79 72 78 77 80 79 75 55 60 22 22 57 64 25 25 64 79 15 15  X M B  X A B  81 78 70 73 77 74 58 22 64 25 81 15  X B C  79 75 65 73 75 74 59 22 69 25 94 15  Table 4.6: Relative performance of the weighted Kaplan-Meier estimate compared to the Kaplan-Meier estimate as measured by lOOAi/'A\. Average M A M S E weights are also shown, but they are multiplied by a factor of 1000 for easier reading. In the simulation with males and females, the average weights in italics refer to the male populations. Note that U = 90, T — 85 and that all figures are based on 10000 repetitions.  CO  Chapter 4. MAMSE Weights for Right-Censored Data Note the difference in performance between the Females and the Males & Females simulations: using more populations did not seem to improve the performance and may even have worsened it. Intuitively, calculating the M A M S E weights has a cost in effective sample size. When the populations' distributions are close to each other, this cost is recovered and the quality of inference is improved. Otherwise, the performances may degrade. Figure 4.5 shows how the survival functions of men and women differ. The additional information contained in the males survival seems insufficient to recover the cost of using them. Their dissimilarity is also visible through the smaller weights allocated to the male populations on average when compared to the females in that simulation. It is interesting to note that the weight allocated to female populations of interest seems to decrease as the sample size increases. In the Males & Females scenario, notice also the small magnitude of the weights allocated to men. As n increases, the dissimilarity between the survival distributions becomes more certain. Figure 4.6 depicts typical estimates obtained in the simulations. The W K M E is typically more precise than the usual K M E , however, some exceptions arise where the area between the K M E and the true distribution is smaller than A\.  Notice once again that using  the weighted Kaplan-Meier estimate produces a smoother curve than the K M E since the function has more jumps. The M A M S E weights allow us to build a weighted Kaplan-Meier estimate that exploits information from similar populations. For the problem of estimating the survival of Canadians from New Brunswick, the weighted Kaplan-Meier estimate features better performance than the usual Kaplan-Meier estimate in all scenarios considered. The W K M E may be an alternative to consider when data from similar populations are easily available, or when one is interested in a subpopulation of a larger group as is the case for this simulation.  98  Chapter 4. MAMSE Weights for Right-Censored Data  Females  Males and Females  Area weighted =0.88  Area weighted = 1.63  Area normal =2.92  Area normal =2.35  Area weighted =0.71  Area weighted = 1.05  Area normal =2.34  Area normal = 1.66  Area weighted =0.78  Area weighted =0.52  Area normal = 1.40  Area normal =0.85  40 Age at death (Years)  60  80  40 Age at death (Years)  60  80  Figure 4.6: Typical examples of estimates under different scenarios. Note the increased number of steps of the weighted Kaplan-Meier estimate (dashed line) which makes it smoother than the usual Kaplan-Meier estimate (plain black line). The true distribution appears as a smooth gray curve.  99  Chapter 5  M A M S E Weights for Copulas Copulas are distribution functions with uniform margins, but most importantly, they are a tool for expressing the dependence structure of multivariate distributions. In this chapter, the MAMSE weights are extended to multivariate data using copulas. The empirical copula, a nonparametric estimate of the copula based on the ranks of the data, is used for that purpose. After reviewing the theory related to copulas, we define the MAMSE weights using the empirical copula. The nonparametric weights obtained are then used to define a mixture of empirical copulas which yields weighted coefficients of correlation. The weighted pseudo-likelihood, an extension of the pseudo-likelihood, is proposed and shown to produce consistent estimates when used in conjunction with the MAMSE weights. Simulations evaluate the performances of the maximum weighted pseudo-likelihood estimate and of the weighted coefficients of correlation.  5.1  Review of Copulas  Let X be a p-dimensional vector with continuous distribution i?(x) and continuous marginal distributions G i ( x i ) , . . . , G {x ). p  p  The dependence between the elements of X is best de-  scribed by its underlying copula, a cumulative distribution function with uniform marginals. Sklar (1959) showed that all multivariate distributions admit the representation  H {[x  ..., x] ) T  u  p  = C {G (xi),..., x  G (x )} p  p  100  Chapter 5. MAMSE Weights for Copulas where C is a CDF with uniform margins called a copula. Any continuous distribution H is associated with a unique copula C. If hi,...,  hp are one-to-one increasing functions, then the unique copula associated with  the distribution of Y = [hi (X\),..., distribution of X = [X\,...,  C o e f f i c i e n t s o f  h (X )]  J  p  p  is the same as the copula underlying the  X]. T  p  C o r r e l a t i o n  The expression "coefficient of correlation" typically refers to an empirical estimate of the dependence between two variables. These statistics however estimate population values that are also called "coefficients of correlation". For instance, the Pearson correlation of a bivariate distribution H(x, y) with marginal distributions F(x) and G(y) is given by  where p  x  and a are the mean and the variance of F(x) and similarly for G(y). x  Pearson's correlation is a parameter of the multivariate normal model and it is an efficient measure of dependence under that model. However, Pearson's correlation is not a good measure of dependence in a general setting. In particular, the range of values for Pearson's coefficient is limited by the distribution of the margins F and G. How can one interpret a correlation of 0.80, say, when it could be the maximal value for r given the marginal distributions at hand? Moreover, a monotone transformation of the variables will typically affect r. Better measures of correlation are invariant to monotone transformations of the data. Spearman's p and Kendall's r are well-known such coefficients.  Let C(u, v) denote the  unique copula associated with H(x, y). Then Table 5.1 gives the population value of different coefficients of correlation, including p and r. These coefficients depend only on the copula C, hence they are invariant to a monotone increasing transformation of the margins. The  101  Chapter 5. MAMSE Weights for Copulas empirical estimates of the coefficients of correlation in Table 5.1 are based on ranks; they are presented in Section 5.2 Usual Name Spearman  Kendall  Gini  Blomqvist  Blest  Symmetrized Blest  Population Value p = 12 J j uvdC{u,v) - 3  T^iJJ  C(u,v)dC(u,v)  - 1  7 = / j I it + v — 1| — \u — v\dC(u, v)  * =  v = 2 - 12  * C | ' i , i ' | - l  j J (1 -  ufv dC(u, v)  i = -2 + 12 J J m;(4 -u-v)  dC(u, v)  Table 5.1: Population value of different coefficients of correlation.  More details on Spearman's p, Kendall's r, Gini's 7 and Blomqvist's 6 can be found in Nelsen (1999). Blest's coefficients were first introduced by Blest (2000), then further developed by Genest & Plante (2003). Pinto da Costa & Soares (2005) studied the same coefficients of correlation and rediscovered independently some of the results published by Genest & Plante (2003).  Families of C o p u l a s A number of families of copulas have been studied in the literature. The books of Joe (1997), Nelsen (1999) and Cherubini et al. (2004) present the most common ones, and discuss methods for constructing copulas. Note in particular that building a copula in more than 2 dimensions is not always an easy endeavor as not all lower dimensional copulas are compatible  102  Chapter 5. MAMSE  Weights for Copulas  with each other. Let us introduce the families of copulas that will be used for simulations in Section 5.9. Normal Copula  The copula underlying the Normal distribution does not have a closed form but can be expressed as follow. Let i/i;(x) be the CDF of a multivariate Normal with mean 0 and covariance matrix E . In addition, suppose that the diagonal of E contains only ones, hence E is a correlation matrix and the margins of H% are J\f (0,1). If $(x) = P(Z < x) for Z ~ JV(0,1), then C (u) =  HX{$- (U ),...,<P-\U )} 1  E  1  P  is the copula underlying H%. In p dimensions, the Normal copula depends on p(p — l)/2 parameters which make E positive definite. In 2 dimensions, it depends on one parameter, r € (—1,1). The limiting cases where r = ± 1 corresponds to perfect linear concordance or discordance. Independence is given by r = 0. Under the bivariate Normal model, the population values of p and r are  p = — arcsin (^-]  and  r = — arcsin(r).  Simulating the Normal copula can be done by generating a multivariate Normal, then applying the inverse transformation $ Farlie-Gumbel-Morgenstern  _ 1  to its marginal values.  Copula  The Farlie-Gumbel-Morgenstern (FGM) copula (see Nelsen (1999) page 68) is parametrized by 9 € [—1,1] and written  CQ(U, V) = uv + 9uv(l — u)(l — v).  103  Chapter 5. MAMSE Weights for Copulas Its simple closed form expression is convenient, but it does not span a great range of dependence. In particular, p = 9/3 under that model, hence the absolute value of Spearman's p is at most 1/3. With the simplicity of its closed-form expression, the F G M copula is often used for illustrative purposes in the literature. However, it is not a very flexible model since it features a limited range of dependence. Hence, it is rarely used as an inferential model in practice. Simulating a datum (U, V) from a FGM copula can be done by generating a first uniform variate (V), then transforming a second uniform variate (W) by inverting the CDF of the conditional distribution of U given V.  This yields a quadratic equation whose root of  interest is -{1 + q(l - 2V)} + y/{l :  +  •• - 2 a ( l  Q(1  - 2V)Y - 4a(l - 2V)W  -2V)  Genest k, MacKay (1986) present a recipe to build numerous families of copulas that are called Archimedean copulas. Many important families of copulas are Archimedean. Chapter 4 of Nelsen (1999) is devoted to the construction of such copulas and to the study of their properties. The next three families of copulas are particular cases of Archimedean copulas. An algorithm to produce pseudo-random variates from these distributions is given by Genest & MacKay (1986).  Clayton Copula The Clayton copula was introduced by Clayton (1978) and is often used in survival analysis (see Oakes (1986) for instance).  The bivariate Clayton family is parametrized by  9 6 [—1,0) U (0, oo), the limiting case where 9 — > 0 corresponding to independence. The  104  Chapter 5. MAMSE Weights for Copulas expression for the copula is  C$(u, v) = max  F r a n k  }  C o p u l a  This family was first discussed by Frank (1979) and is indexed by one real parameter, 9 G (—oo,0) U (0, oo). The limiting case where 9 —> 0 corresponds to independence. Its CDF is  The Frank copula is the only radially symmetric Archimedean copula. Its shape is akin to the Normal copula.  G u m b e l - H o u g a a r d  The Gumbel-Hougaard family is one of three copulas that can be a limit distribution for extremes and is hence used to model data of that.type. Indexed by 9 G [1, oo) it is written  and the choice 9 = 1 corresponds to independence.  The families of copulas presented in this section are absolutely continuous and hence admit a density function, as long as we omit limiting cases of perfect concordance or discordance where the distributions collapse to a line. The existence of a density function is less clear for the Clayton, Frank and GumbelHougaard copulas. Genest & MacKay (1986) show that all bivariate Archimedean copulas can be factorized as a mixture of two components, one absolutely continuous, the second singular on a line that they describe explicitly. From their result, it is straightforward to  105  Chapter 5. MAMSE Weights for Copulas  verify that the singular part of the three Archimedean copulas presented in this section has mass 0, hence that they are absolutely continuous.  5.2  Notation and Empirical Estimation  Suppose that p-dimensional data are available from m different populations believed to have similar dependence structures (i.e. similar copulas). For any fixed k, we observe njfc data points from Population i. Explicitly,  are observed where Xjj = [Xij\,...  ,Xij ]  is a vector in p dimensions. By Sklar's (1959)  T  p  Theorem, there exists a unique copula underlying the distribution F J ; we denote it by Cj(u), u = [u\,... ,u ] being a vector in [0, l] . That unique copula is a cumulative distribution T  p  p  function defined on the unit cube such that  • .  Fj(x) = Ci {Gii(xi),...  where Gn,...,  .,G (x )} ip  p  Gi are the marginal distributions of Fj. P  Let R^- = [R-iji, • , Rijp] be the ranks associated with the vectors X y , j = 1,..., njfc. For fixed i and t, the list of values Xm,...,  Xi e nik  is sorted and Rj?- is the rank of X^ in that t  list. Assume that the {Fj} are continuous, and hence ties cannot occur with probability 1. The empirical copula, defined on the ranks of a sample, is  for u = [u\,... ,u ] . The indicator variable ! ( • ) is equal to one if all the elements of its T  p  argument are true and equal to 0 otherwise. The empirical copula puts a weight of 1/njfc  106  Chapter 5. MAMSE Weights for Copulas  on the points of the grid 1  2  1}  X ••• X  corresponding to an observed combination of ranks. There is exactly one such point in every (p — l)-dimensional slice of the grid (rows and columns in 2 dimensions). Consequently, the univariate marginals of the empirical, copula C{ are uniformly distributed on the points k  {l/n ,2/n ,...,l}. ik  ik  Suppose the {ni } are strictly increasing with k. Deheuvels (1979) shows that k  almost surely as k —> oo. Fermanian et al. (2004) show that y/ni {Ci (u) — Cj(u)} conk  k  verges weakly to a Brownian sheet with mean zero and a variance that depends on Cj and its partial first-order derivatives. Although they hold for an arbitrary number of dimensions, the results of Fermanian et al. (2004) are presented for bivariate copulas only. Tsukahara (2005) credits Fermanian et al. (2004) for the discovery and expresses the same results in p dimensions. Remark 5.1 LetUi(u) be a p-dimensional centered Gaussian randomfieldwith covariance function  C i ( u A v ) -Ci(u)Ci(y) where A is the component-wise minimum. Such a randomfieldis called a p-dimensional pinned Ci-Brownian sheet.  •  Theorem 5.1 (Tsukahara (2005)) Assume that C;(u) is differentiable with continuous partial derivatives dCi(u)/dug for 1= 1,... ,p and let [1, ue, 1] represent a vector of ones, T  107  Chapter 5. MAMSE Weights for Copulas except for the I  th  element who is equal to the £  th  element of u. Then the random variable  y/n^{C (u) - Ci(u)} ik  converges weakly to the random field  Wi(u)-^|^C (u)}w ([l, i  i  ,l]  U <  T  as k —> oo. = [1, U£, 1] for some £ e {1,... ,p}. For the Brownian  Remark 5.2 Let u € [0, l ] and p  T  sheet defined in the previous remark and 1 < I < £ < p, we have:  vax{Wi(u)} =  Ci{u) - Qiu) = Ci(u){l - C^u)} 2  var{Ui{v )}  =  Ci(vi){l-Ci{v )}  cov {Wi(u),Wi (v<)}  =  Ci(u) - C i ( u )  cov{Wi(v;),tVj(v£)}  =  Ci([l,ui,l,ue,l} )  e  =  e  u (l-u ) e  e  = a ( u ) ( l - u<)  u<  ~ um  T  where [1, itj, 1, ug, 1] is a vector of ones, except for the elements I and I that are equal to T  the I  th  and I  th  element of u respectively.  To define the M A M S E weights in Section 5.3, we need an estimate of the asymptotic variance of the empirical copula. Remark 5.3 From Theorem 5.1, we have that  var  n- {C (u) - Ci(u)}] -> var k  ik  W {Ki(n)} + 2  J2  i<i<e<  P  + £ { ^ C i ( u ) j  ^  jwi(u) "53^ i( ) i  ITT" U  e  C  C  I  (  U  )  | {jT  ) V i u  u  Q ( U )  ( ^)|  U  I  v  C 0 V { U I( V / )  ( V  '  ) }  )  v a r { ^ ( v , ) } - 2 ^ | ^ - C ( u ) | cov { ^ ( u ) , ^ (v )} I  £  108  Chapter 5. MAMSE Weights for Copulas The expressions for the variances and covariances from Remark 5.2 can be substituted in the expression above. Note that the only term that does not depend on derivatives o/Cj(u) is  var{W (u)} = C ( u ) { l - C ( u ) } . i  i  i  Coefficients of C o r r e l a t i o n B a s e d o n R a n k s The measures of dependence presented in Table 5.1 can be estimated from a sample by using ranks. Their classical estimates appear in Table 5.2. Note that we use a simplified notation where for a fixed population i and for a fixed k, we write n — no-,  C{u\,U2)  = Cifc(u) and  (Rj,Sj) = R^-. That simplified notation is the most commonly seen in the literature. Usual Name  Empirical Estimate  Spearman Kendall Gini  Pn~  T  n  =  In  G)  3  - 1  Symmetrized Blest  _  1  +  YJX=\  Ei<i<j<n sign(Ri -  ~ |n /2J S i = l \Ri "b Si 2  iSt  R  Rj)sign{Si -  1|  n  \Ri  Sj) Si\  P = 4C(H)-l  Blomqvist Blest  n  n  _ 2n+l  12  K ~. n-1 ln =  4n+5  n-1  11  n(n+iy {n-l) ^i=lK ^ ,  1  6  2  n  y^n  r, r Q (A i t J l  n(n+l)(n-l) 2~>i=l  p\2c  1  \  -"-zj^i  L  Ri+Si\  n+1 J  Table 5.2: Empirical estimates of different coefficients of correlation.  Rescaled R a n k s The empirical copula allocates a non-null weight to points that are on the boundary of [0, l ]  p  even though that set has probability 0 under a continuous copula. In some applications, we need to evaluate functions that are well defined on (0, l ) , but that may have asymptotes p  109  Chapter 5. MAMSE Weights for Copulas close to the boundaries of the p-dimensional cube. To avoid such problems, one can use a rescaled empirical copula where the mass of R^/njfc is rather allocated to T)fc Y  T j f c  1.  ** = riZ_5 n  or  J  Yj = - ^ - . n +1 J  ik  ik  Hence, the rescaled empirical copula allocates an equal mass to some points of the grid  0.5  -0.51  n  ik  x ••• x  f 0.5  - 0.5  or X * * * X  n + 1' " ' ' n + 1 J " ifc  " [ n + 1' " ' ' n + 1 J '  ik  ik  ik  where the function of interest is well-defined. Remark 5.4 Let C* denote a rescaled empirical copula. The asymptotic behavior of C* k  k  is typically identical to that of'Ci since k  sup'\C.; (u)-C (u)\<  ue[o,i]p  k  ik  —  ik  ->0  n  as k —> - oo.  The Pseudo-Likelihood Methods have been proposed for fitting a family of copulas to the data. A review of such methods can be found in Cherubini et al. (2004). Let us consider the pseudo-likelihood originally proposed by Genest et al. (1995). Suppose that the family of copulas C(u\9), 8 E 0 admits a density function c(u|6^). To fit that family of distributions to the data from Population i, Genest et al. (1995) suggest  110  Chapter 5. MAMSE Weights for Copulas maximizing the pseudo-likelihood,  i=i  where  are the reseated ranks described above. The resulting maximum pseudo-likelihood  estimate is a consistent estimate of the true parameter t9 for which C(u|# ) = Cj(u). 0  0  Alternatively, the log-pseudo-likelihood can be written  *(0) = f > g c ( Y £ . | 0 ) =  /tog (u|0)dC4(u) C  where C* (u) denotes the empirical copula defined with the rescaled ranks Y ^ . k  5.3  Definition of the M A M S E Weights for Copulas  Univariate MAMSE weights are designed to yield a mixture of distributions close to the target distribution F\, but less variable than its empirical distribution function. In the context of copulas, an extended version of the MAMSE weights can aim at choosing a mixture of the empirical copulas,  (5.1) i=l with Aj > 0 and 1 A = 1, that is close to C i , but less variable than C\k- Let us define T  «(A) =  /  |Cife(u) - C (u)\  2  Xk  + J2 A var{Q (u)} 2  fc  dM (u) fc  (5.2)  t=i  where M is a discrete probability measure allocating a weight of l/n  p  k  lk  to every point of  the grid  ffl>\n\k= nik LL-£-  nik n^  J 111  Chapter 5. MAMSE Weights for Copulas Remark 5.3 provides an approximation to the variance of yjrii Qfc(u). However, the k  expression depends on the derivatives of Ci(u), the true unknown and unknowable copula. To estimate these derivatives, we could use the data and either assume a parametric model for each population, or use smoothing techniques. Either of these choices involves many degrees of freedom in the selection of families of distributions or basis functions for instance. Our goal is to show that the concept of the MAMSE weights has the potential to improve the quality of the inference while at the same time keeping their computation feasible. Thus at this stage we choose on the basis of Remark 5.3, the simple compromise approximation,  <™{C {n)} « var{C (u)} = — C (u){l - C (u)}, tk  ifc  ifc  (5.3)  ik  n>ik  which corresponds to the only term in Remark 5.3 that does not involve a derivative. We recognize that incorporating other terms in that approximation may improve ^the performance of the weights but leave'that and the investigation of the computational feasibility of doing so to future work. As we will show, the choice made above does demonstrate the potential value of the MAMSE weights in this context. Note that the theoretical results in the following sections hold with a different penalty term as long as the property expressed in Lemma 5.1 is preserved, i.e. as long as  y*var{C (u)}dM (u)-0 lfc  fc  as k —> oo. The property stated in Theorem 5.2 should also hold to insure the proper convergence of the algorithm calculating the MAMSE weights. The value of A minimizing the objective function Pfc(A) defined in (5.2) with the substitution (5.3) are called the MAMSE weights and denoted X (oj) = [\i (ui),..., k  k  A fc(cj)] . T  m  The weighted empirical copula obtained using MAMSE weights will be written m  C (u) = J^\ (u)C (u). k  ik  ik  (5.4)  i=l  112  Chapter 5. MAMSE Weights for Copulas  5.4  Computing the M A M S E Weights for Copulas  The algorithm proposed in Section 2.4 applies to the MAMSE weights for copulas. All copulas are constrained to be on the. [0, l ] cube and have uniform marginals. For p  that reason, there is no need for preprocessing in this case as the domain of all distributions overlap at all times. To prove the convergence of the algorithm, it is sufficient to show that Assumption 2.1 is satisfied when the MAMSE weights are defined as in Equation (5.2). Theorem 5.2 Assumption 2.1 is satisfied for the definition of MAMSE weights suggested in Section 5.3, that is J vab{C (u)} dM (u) > 0 ik  k  for i e {1,... ,m}. Proof of Theorem 5.2. Consider the term  :{C (u)}dM (u) = [ — van J n ik  C (u){l-C (u)}dM (u)  k  ik  ik  k  ik  for some i = 1,... ,m. Note that the measure M (u) gives a weight l/n^. to each point k  of the grid Q . The empirical copula Ci(u) is greater than 0 and less than 1 for all points k  of the grid Q , except for the point 1 and possibly for a few "slices" (rows and columns in k  2 dimensions) of points close to the axes when n{ < n\ . In all cases, 0 < Q(u) < 1 for k  k  many u € Q , meaning that Cjfc(u){l — Qfc(u)} > 0 for those u. Consequently, k  — / C (u){l - Q (u)}dM (u) > ifc  nJ ik  the desired result.  fc  V C (u){l - C (u)} > 0,  fc  ifc  n  ifc  n i f c  ik  ute . f e  •  113  Chapter 5. MAMSE Weights for Copulas  5.5  Uniform Convergence of the MAMSE-Weighted Empirical Copula  We now prove a sequence of lemmas that will build up to showing that the MAMSE-weighted empirical copula converges to the copula underlying Population 1. For that purpose, we suppose that the {n^} are increasing with k and that  —> • oo as k —> oo.  Lemma 5.1  /{<'C {u)  - C (u)} dM (u) - 0  lk  k  fc  almost surely as k —> oo where Cfc(u) is defined as in (5.4)Proof of Lemma 5.1. The choice of A = [1,0,..., 0] yields a larger value of -Pfc(A) than the T  MAMSE weights because of the optimization involved in the definition of the latter. Hence, for any  LO  J {C  lfc  € fl,  (u)-C (u)} fc  2  dM (u) fe  <P ([l,0,...,0] ) T  <P {\ {Lo)} k  fc  k  = — /-Ci (u){l - C (u)}dM (u) < J - - 0 n\ J 4ni fc  lfc  fc  k  fe  as k — > oo. Lemma 5.2 Let u, v € [0, l ] be such that V£ < ug for £ — 1,... ,p. Then p  0 < C  l  k  ( u ) - C  l  k  ( , ) < ± ^ -  V  e  )  ]  ni  k  where \x\ denotes the smallest integer greater or equal to x. Proof of Lemma 5.2. Consider the vectors u, v G [0, l ] with vi <  for £ = 1,..., p. Then,  p  0 < Ci (u) - C (v) fc  nik ^ - f  ifc  1  nik  € {[0,ui] x ••• x [0,up]} - 1 ^ie{[0, }x Vl  ni  x [0,«p]}  k  114  Chapter 5. MAMSE Weights for Copulas ^  nik  P  i  pre  TDK  nik ^ ^ I  ~  1  nik  nik / pre  P  = - E E *— pfc ^  1 Wjfc E  e=i I * = i  l K  1  .?=! nik  for any such vectors.  Let Q be an extended grid that includes the axes: k  Qt . =  j  o  ,  —  i  )  1 x ••• x <J0,  2 ,  ,...,1  n\k nik Lemma 5.3  sup  ue[o,i]p  C i  f  c  ( u ) - C  f  c  ( u )  P  < — + sup C i f c ( u ) - C  ik  ueg*  n  f  c  ( u )  k  /or aZZfcand u e fl. Proof of Lemma 5.3. For a fixed fc, | C i f c ( u ) - C ( u ) | is a bounded function on the compact K  set [0, l ] and hence its maximum is attained. Let v 6 [0, l ] be a point, where that maxip  p  mum is achieved. We treat two cases.  Casel: C ( v ) > 4 ( v ) . u  '  Let v* = [v*,..., v*] be defined by v| = |_ ifc^J/ ifc> where [x\ denotes the largest T  n  n  integer smaller or equal to x. Then v* £ Q is on the same "plateau" of the multivariate k  115  Chapter 5. MAMSE Weights for Copulas  step function C\k(u) as v, meaning that  c w v * )  =  e  l f c  (v).  Moreover, since C ^ ( u ) is a nondecreasing function and v* < v, we have 4(v*)  < Cfe(v).  Recalling that v is the point where the difference between C j t ( u ) and C i f c ( u ) is maximized, we can write |Cife(v) - 4 ( v ) |  =  C (v) - C (v) < C (v*) - C (v*)  <  sup C i f e ( u ) - C  <  — + sup C i  lk  fc  ik  lk  ueg*  n  f  (u)  f e  ( u ) - C  c  fc  f  c  ( u )  k  meaning that the maximum occurs at a point of the grid Q\. •  Case 2: 'C (y) < C,fv). 1t  Let  v* = [•«*,..., v*]  T  be defined by v\ — \ni V(]/ni , k  where \x] denotes the smallest  k  integer greater or equal to x. Then, v* e Q and k  Cfc(v*)>C (v)  .  fc  since Cfc(u) is a nondecreasing function and v* > v. By Lemma 5.2,  c u ( v n - c ^ ) < ± j~i e=i l  n  i  k  i  v  * n lk  v  e  )  ]  <^ ni  k  116  Chapter 5. MAMSE Weights for Copulas Recalling that v maximizes the difference between Cifc(u) and Cfc(u), we can write  |Ci (v)-C (v)| fc  =  fc  C (v)-C (v) k  lk  < - C (v*.) - C (v*) + P fe  P  <  lfc  h sup  Ci (u)-C {u) k  k  Combining Cases 1 and 2 yields the desired result. Lemma 5.4  Lei  u=  [iti,...,  u]  T  G [0,  p  l] , then p  0 < C'jfe(u) < min it^. <?=I,...,P  Proof o/ Lemma 5.4- For each ^ G {1,... ,p}, we have ik  P  n  /  ijC ni  J  k  P:•ijt  1  yn  n  ik  ik  We get the desired result by noting that these p inequalities have to be satisfied simultaneously.  H  Lemma 5.5 We have sup Ci (u)-C (u) fc  fc  almost surely as k — > oo. Proof of Lemma 5.5. Let e > 0. For any givenfcG IN, let  = [ i t ^ i , . . . , Uk ] be the point T  p  of the grid (/£ where |Cifc(u) — Cfc(u)| is maximized. Let  A  k  {w G fi : Cifc(u ) - C (u ) > e} ,  Ek  |w G fi : Cfc(ufc) - Cifc(u ) > ej ,  C  fc  k  fc  fc  fc  =  | C J  G fi : Ufc G  1.1 117  Chapter 5. . MAMSE Weights for Copulas The negation of Lemma 5.5 is  A UB k  i.o.  k  which will happen if and only if  (A UB D k  k  C f ) U (A \JB n k  C ) i.o..  k  k  We will show that neither of the two events in the decomposition above can occur infinitely often.  Case 1: A [jB C\ k  k  Cg.  We have  <  |Ci/c(ufc)| + Cfc(ujfc) m  = <  Cifc(u ) + 53 \k{u)C (\\ ) i=i 2 min min 'u £ < e <e{i,... } fc  ik  k  k  lP  by Lemma 5.4 since the MAMSE weights sum to 1. Consequently, A \jB r\C^ k  k  = 0 for all k.  Case 2: A U 5 n C . fe  fe  k  Let v be a vector of integers such that \/n\  = u ; we temporarily omit the index k for  k  k  notational simplicity. Let also w = [w\,..., w ] be a point from the set T  p  0,1,..., n 2p  W=  xk  x • • • x 4 0,1  [2p  The points (v — w)/ni belong to Q since u G [e/2, l] . Next, we show that p  k  k  v-w\ nik  k  - /v-w wife  2  n  lfc  118  Chapter 5. MAMSE Weights for Copulas by treating two subcases. Note that the last inequality holds because  1 V-  P  nik [2p  - 2  Subcase A: Ak!~)CkFrom the fact that the copulas are monotone functions, we get:  C ( k  V — W \  /  nik J  < Ck[  V  \nik.  ) = Cfc(ufc).  Then from Lemma 5.2,  nik nik J  \nikj  = Cik(uk) -  nik  nik  Combining the two inequalities yields  CIk  v — w  nik  — Ck  v — w  nik  >  Ci (u ) - Cfc(u )  >  e  k  > ~ 2  k  fc  nik  nik ™ifc  > 0.  Subcase B: E>k H C . K  By Lemma 5.2, we have  — Ck  v — w  nik  =  y\ Aifc(w) \ Cik [—)  i=l  n  %  k  e=l  nik  - Cik  v — w  nik v  S S Hik  njkWj  nik  119  Chapter 5. MAMSE Weights for Copulas  <  (  n i k W £  i i) =  W  ^ |  P  Hence,  V "ife /  \nikj  ~i ik  n^  n  n  ifc  ~ i  U i k  Since the empirical copula is a monotone function, we also have  Cifc  / v — w\ V ik n  )  < Cife  / v \ — = Cife(ujfc). \nikj  Let us consider only k that are large enough to make YliLiP/ ik n  < e/2. From the two  previous inequalities, we obtain  c (^) - Ci (l^L) \ nik J \ nik J k  k  >  Ck(n ) - Cik(uk) k  ~  £.JL n ik  m "lfc  2  f—f njfc  nifc  Combining subcases A and B yields  PkW > f (4(u) - Cu(u)}dM,(u) > i  V f^-  2  ^  The sum above corresponds to the Riemann sum of the multiple integral p  \ 2  f^'M^S'*) " ivi  The number  -  dVp=Kp  is a fixed positive constant for any fixed p.  120  Chapter 5. MAMSE Weights for Copulas As a consequence, there exists a ko such that for all k > ko, P (X) > K /2 k  v  > 0, a  contradiction with Lemma 5.1. We must thus conclude that A U B fl C occurs at most k  k  k  a finite number of times.  Since the two cases above do not occur infinitely often, we conclude that A U B occurs at k  k  most a finite number of times. Hence,  sup C i f c ( u ) - C  f  c  ( u )  0  almost surely as k —• oo. Lemma 5.6  C (u)-C (u)  sup  lk  k  U6[0,1]P  almost surely as k —> oo. Proof of Lemma 5.6. The result follows from Lemma 5.3 and Lemma 5.5. Theorem 5.3 We have uniform convergence of the MAMSE weighted empirical copula:  sup C ( u ) - C (u) ue[o,i]p f c  x  almost surely as k —> oo. Proof of Theorem 5.3. Consider the decomposition  4 ( u ) - C i ( u )  that holds for all u e [0,  4 ( u ) - d ( u ) .  sup U6[0,l]  p  < C ( u ) - Cufefu) + f c  Ci  f c  (u)-Ci(u  Then  <  sup { | c ( u ) - C i f c ( u ) f c  uG[0,l]  p  U  +  Ci  f c  (u)-Ci(u)|} U  121  Chapter 5. MAMSE Weights for Copulas  <  Cfc(u) - Cife(u)  sup U£[0,1]  P  +  sup  ..Wn-,1„ ue[o,i] p  Cifc(u) - C i ( u )  0  almost surely as fc — oo. Indeed, the first term goes to 0 almost surely by Lemma 5.6 and the expression sup | C i . ( u ) - C i ( u ) ue[o,i] f c  p  converges almost surely to 0 as k —> oo, see e.g. Expression 3.1 of Deheuvels (1979). Corollary 5.1 Let TJ be a sequence of random vectors with distribution C (u) and TJ a k  k  .random vector with distribution C i ( u ) . Then XJ converges to U in distribution. k  Proof of Corollary 5.1. For almost every a; € fl, the sequence of distributions C (u) conk  verges to C i ( u ) for all u , hence the definition of weak convergence is respected with probability 1.  •  Corollary 5.2 If g(u) is a bounded function, then  jg(u)dC (u) k  = E{g(XJ )} - E{g(V)} = k  J  5  ( u ) d d ( u )  • almost surely as k —> oo. Proof of Corollary 5.2. The result is a well known consequence of the weak convergence proved in Corollary 5.1, see page 164 of Durrett (2005) for more details.  5.6  •  Weighted Coefficients of Correlation Based on Ranks  Coefficients of correlation based on ranks typically estimate a measure of dependence that depends on the copula of the underlying distribution. Examples of such coefficient are provided in Tables 5.1 and 5.2. Suppose that bivariate data from m populations are available. We suggest the use of the MAMSE weights to create a weighted coefficient of correlation and show in this section that it converges almost surely to the true correlation of Population 1. 122  Chapter 5. MAMSE Weights for Copulas Consider first Blomqvist's (3 and let /3; be the Blomqvist coefficient based on the ranks re  of the sample XJI, ... X j . . The weighted Blomqvist's (3 is given by n  fc  m  Theorem 5.4 (3\k  Pi almost surely as k —> oo.  Proof of Theorem 5.4 As a direct consequence of Corollary 5.3,  P\k  =  4^  \ik(v)C  ik  almost surely as k —> oo  •  The remainder of this section will cover a general case that includes the other coefficients in Table 5.1 and 5.2, except for Kendall's r. The population version of Kendall's r depends on J C(u,v)dC(u,v).  A weighted version of Kendall's r is achievable, but replacing both  C(u,v) in this theoretical expression with mixture distributions will not yield a linear combination like the other coefficients considered. The weighted Kendall's r hence requires a special treatment that is left to future work. Replacing the copulas by their empirical counterparts in Table 5.1 yields estimates of the correlation based on ranks. However, the expressions obtained typically differ from the usual estimates based on ranks and do not necessarily span the interval [—1,1], but rather an interval whose bounds tend to ± 1 as the sample size goes to infinity. The classic formulas for the coefficients of correlation based on the ranks of the sample  123  Chapter 5. MAMSE Weights for Copulas  X j i , . . . , Xj . n  fc  will thus admit a representation of the form  Kik = a J J g(u)dC (u) + b k  ik  (5.5)  k  that estimates &ik = a J J g(u) dCi(u) + b. The coefficients a —> a and b n  b as n —> oo are chosen to ensure that ki € [—1,1] for all  n  k  sample sizes n ^ . Moreover, the values ± 1 occur only for perfect concordance or discordance, i.e. when the ranks are identical {R ji = R j ) k  k  2  o  r  antithetic {R i = ni + 1 — R j )k  k  3  k  2  The function g(u) is typically bounded and its functional form defines the coefficient of correlation at hand (see Chapter 2 of Plante (2002) for detailed explanations). Consider a particular case: for Population i and a fixed k, the empirical estimate of Spearman's coefficient, can be written _ %k  3  n +1 ^ 12n n -l (n + l){n - 1) ik  ife  i k  =  ik  ik  - 3 ^ + | + , f no, - 1 [nik + l)("ifc - I) J u  R ji Rjj n n k  2  ik  ik  u dC (u) lU2  lk  and clearly has the same asymptotic behavior as —3 + 12 J u\u dCjfc(u). 2  Let m ^ A f c = E ^ik{u)kik  i=l be a MAMSE-weighted coefficient of correlation based on ranks. Remark 5.5 The MAMSE weighted coefficients of correlations are invariant under monotone transformations of the data. Indeed, both the MAMSE weights and the individual correlations are based on the ranks of the data which are invariant to such transformations.  Next, we prove that the MAMSE-weighted coefficients of correlations are strongly consistent estimates of the correlation in Population 1. 124  Chapter 5. MAMSE Weights for Copulas Lemma 5:7 = J (u)dC (u) 5  J (u)dQ(u)  ifc  ff  almost surely as k — > oo for any bounded continuous function g. Proof of Lemma 5.7. By the uniform convergence of dk proved by Deheuvels (1979), the sequence of distributions Cjfc(u) converges to Ci(u) for all u and almost every to € fl. Consequently, Cik converges weakly to Cj. Since the function g is bounded, the almost sure convergence of the expectation follows (see e.g. Durrett (2005), page 164).  •  Lemma 5.8 The coefficient kik converges almost surely to  Ki = a J g(u) dCi(u) + b,  its associated population value. Proof of Lemma 5.8. Note that  I kik  i  K  I  — kik ~  | a J g{u) dC (u) + fjj + | a J g(u) dC (u) + 6 j - m ik  ik  (a - a) Jg(u) dC (u) + (b - b) fe  ik  k  jg(u) dC (u) - a J.g(u) dQ(u) + 6 - 6 ifc  < • |ajfc - a\ J g(u)dCik(u) + | 6 - 6 | fc  Jg(u)dC (u)ik  Jg(u)dCi(u)  .0  almost surely as k — > oo since J g(u) dQfc(u) is bounded and by Lemma 5.7, the other term involving integrals tends to 0.  •  125  Chapter 5. MAMSE Weights for Copulas Theorem 5.5 The weighted coefficient of correlation  1=1  almost surely as k —> oo. Proof of Theorem 5.5. The proof follows the same steps as that of Lemma 5.8  I^Afc-^ll  <  k\k - { / p ( u ) d C a  (a - a) fc  f c  ( u ) + &j  J g{u) dC ( u ) + (6 k  J < ? ( u ) d C ( u ) + frj  +  f c  Kl  - 6)  fc  a y g(u) dCfc(u) - a  <  \a - a\ y p ( u ) d C ( u ) k  f c  y «?(u)dc  + |6 -6| n  f c  (u)-  y 5(u)dd(u)  0  almost surely as k —» oo since g is bounded and J " g ( u ) d C ( u ) K  converges to Jg(u) dCi(u)  by Corollary 5.2.  •  Corollary 5.3 The MAMSE weighted versions of Spearman's p, Gini's 7 or Blest's v and £ are strongly consistent estimates of their corresponding population value. Proof of Corollary 5.3. Direct consequence of Theorem 5.5.  5.7  •  Weighted Strong Law of Large Numbers  In Chapter 3, we proved a WSLLN for unbounded functions in order to prove the consistency of the weighted likelihood with MAMSE weights.  In the next section, we propose the  maximum weighted pseudo-likelihood estimate and prove its consistency. To do that, we first prove a weighted strong law of large number more general than Corollary 5.2 as it 126  Chapter 5. MAMSE Weights for Copulas holds for functions g continuous on (0, l)  p  with asymptotes at the boundary of the unit  hyper-cube. Let C* denote a rescaled empirical copula based on transformed ranks Y ^ as defined k  in Section 5.2. We let C*(u) = YA=I ^ik{u)C* and propose to show that for a function g k  satisfying some regularity conditions,  E~I> i=i -  l k  (&)  = /9(n) dC* (u) -  Y  /g(u)  k  j=i  J  dC7 u) l(  J  almost surely as k — > oo. Since we are now using rescaled copulas, let us first prove the following result. Lemma 5.9 sup | C £ ( u ) - C i ( u ) | - > 0 ue[o,i]p almost surely as k — > oo. Proof of Lemma 5.9. Remark 5.4 points out that  sup \C* (u) - C (u)l < — - 0 k  ifc  u6[0,l]P  ik  n  as k — > oo. Therefore,  sup | C J E ( u ) - C i ( u ) |  <  uG[0,l]P  sup {|C *(u)-C (u)| + | C ( u ) - C i ( u ) | } fc  fc  f c  ue[0,l]P  <  J  sup | ^ ( u ) - 4 ( u ) | + sup | 4 ( u ) - d ( u ) | ue[o,i]p ue[o,i] p  m  <  P \ ik(*) -C (a)\ ue[o,i]p S U  i = 1  G  ik  '  +  sup | C ( u ) - C i ( u ) | ue[o,i]p f c  m  <  S U  I  =  L  P  |.C; (u)-C (u)|+ fc  U6[0,1]P  ifc  sup  |C (u)-Ci(u)|-0 f e  U£[0,1]P  as k —> oo by Remark 5.4 and Theorem 5.3.  I 127  Chapter 5. MAMSE Weights for Copulas Proposition A-l(i) of Genest et al. (1995) provides a strong law of large numbers that we use to prove the weighted strong law of large numbers that we need. To simplify their notation, Genest et al. (1995) present their result with bivariate copulas, but we do not make this simplification here. Theorem 5.6 (Genest et al. (1995)) Let r(u) = u(l -u), positive numbers satisfying Yle=i V % ~ 1 ^ #( ) an<  u  a  5 > 0, {q ,£ = 1,... ,p} be e  continuous function from (0, l )  p  to  IR such that pi = f g(u)dCj(u) exists. If M < oo is a positive real number such that p  \g(u)\<Ml\r(u )  ae  e  t=\  '  with ai = (—1 + 6)/qe, then  Rn = J g(u) dC7; (u) — pi fc  almost surely as k — oo. A bounded function always satisfies the assumptions of Theorem 5.6 because n?=i  ( e)  r  u  ae  has a lower bound. Lemma 5.7 is thus a particular case of Theorem 5.6. Lemma 5.10 Let A be a cube in p. dimensions whose opposite corners are u\ and u with 2  ui < u , i.e. 2  A = (tin,«2l] x • • • x (m ,u }. p  2p  Let Ci(u) and C (u) be p-dimensional copulas such that 2  sup  |Ci(u)-C (u)|<e. 2  U€[0,1]P  Then, op  \dd{A)-  dC {A)\ < —. 2  128  Chapter 5. MAMSE Weights for Copulas  Proof of Lemma 5.10. The coordinates of the corners of the cube can be re-expressed by replacing some elements of U i by the corresponding elements of 112. Let 5 C { l , . . . , p } be a set of indices and v$ any vector such that Uu  if i € S  Vi =  u-2i if i G S  c  To calculate the probability of the cube, we can use the following development which is akin to the formula for the intersection of multiple sets:  dC (A) = C ( v ) - X > ( v 1  1  0  { i }  )+  i=l  £  d (v  { i j }  ) - • • • ± Ci (v  { w  }  ) .  l<i<J<P  Consider now the difference  \dCi(A)-  dC {A)\ 2  =  Ci (v ) - C (v ) - ]T {Ci ( v ) - C (v )-} 0  2  0  {l}  2  {i}  i=i +  E l<i<j<p  { i(vm)-C2(v , )} C  {l j}  -•••±{CI(V  { 1 I  ...  ) P }  )-C (V 2  { 1 I  ...  I P }  )}  p  <  | d (v ) - C (v )| + }2\Ci ( v ) - C ( v ) j 0  2  0  {l}  2  {i}  i=i  +  E  l l( fe})- 2(v C  V  C  { l j }  )|  l<i<j<p  + --- + | C ( v 1  <  { w }  )-C (v 2  { 1 )  ...,  p }  )|  2 e. . p  Since the sum above has i=l  terms each bounded by e.  129  Chapter 5. MAMSE Weights for Copulas Corollary 5.4 Let B be a closed cube,  B = [un,u i]x 2  •••x  [ui ,U2p\. p  Let C i ( u ) and C ( u ) be p-dimensional copulas such that 2  sup | C i ( u ) - C ( u ) | < e. ue[o,i]p 2  Then,  cyp | d C i ( 5 ) - dC (B)\ < —. 2  Proof of Corollary 5.4- Apply Lemma 5.10 to the sequence of half-open cubes  As = (itii - 5, U2\] x • • • x (ui - 5, u p] p  2  with 5 —> 0. A similar proof holds for a mix of closed and open intervals.  •  Theorem 5.7 Let g(u) be any continuous function on (0, l ) that satisfies the assumptions p  of Theorem 5.6. Suppose that the sample sizes  —» oo for all populations. Then,  <7(u)dCjJ(u)- J <7(u)dCi(u)  - 0  almost surely as k —• oo. Proof of Theorem 5.7. For t G IN, let B = [2~Vl - 2~*] and p  t  , g(u)  r (u) = _  iiuEB  t  t  0  otherwise  Since g(u) is continuous and Bt is a compact set, the image of r is bounded. Suppose t  that T j ( u ) € [Lt, J7t]. By the Heine-Cantor Theorem, r is uniformly continuous on B , t  t  130  Chapter 5. MAMSE Weights for Copulas i.e. Ve t > 0, 3<5,t > 0 such that T)  r  |u - v| < <5, ==> \T (u) - r (v)| < e .  Vu, v G B , t  Let e j = 2  - t  T  Tt  t  t  T)t  and choose 0 < 8 t < 2 * accordingly. Let At be a partition of the interval _  T:  [2-*, 1-2-*],  A = {[2-*, (2)2"*]} U {( 2- , (s + 1)2"*], s = 2 , . . . , 2* - 2} . 4  S  Then, the elements of A x ••• x A = t  {A ,...,A }  t  u  Stt  form a partition of B of cardinality St = (2* — 2) . Define p  t  M ) u  =  ^ ^ * l A  s  t  ( u )  s=l  where 1 if u e A 6 = inf s(u)  a n  s t  d  s t  l^fu) =  y e A s t  1  0 otherwise  Then, by construction s u p [ ] |rt(u) — h (x)\ < 2 * and ue  5  01  P  t  (u)dC *(u)- J f c  (5.6)  (u)dd(u)  5  where  5  (u)d^(u)-|r (u)d^(u t  T (n)dC* (u)- j h (u)dC* (u) t  T  3  k  J h (u)dC* (u)- J t  k  t  k  ^(u)dCi(u)  131  Chapter 5. MAMSE Weights for Copulas / ^ ( u ) d C i ( u ) - J r (u) dCi(u) t  n=  r ( u ) d C i ( u ) - J g(u) d C ( u ) t  :  We now prove that for any e > 0 and w in a subset of fi with probability 1, we can choose t  w  such that the five terms above are less than e/5 for all  k >  k (tu)w  First note that  J /it(u)-r (u)dCi(u) t  < y  |^(u)-r (u)|dC (u)<2t  by construction. The same bound applies for T and does not depend on 2  By Lemma 5.9, s u p [ ] | C £ ( u ) ue  3fin C  01  P  —  sup  or CJ.  C i ( u ' ) | converges almost surely to 0.  fi with P(flo) = 1 such that for each u> G  UG[O,I]P  t  1  |Cfc(u) - C i ( u ) | <  fin  Therefore,  and any t, 3/c^j with  5 max(|f/t|, | L t | ) 2 T + P t  for all k > ku,t- For any such k and w, Lemma 5.10 implies that 2  dC* (A ) k  st  - dCi(Art)  p  < 5 max(|C^|,|L |)2*+P t  t  for any s,t. Developing T 3 yields s  St  T* =  t  b«dC* (A ) - £bst dC^Ast) k  S=l St  < £ > •st\ •  st  8=1  dC^A*) - dCi(A ) st  s=l  <  2  S max{\U \,\L t  t  t  P  '5 max(|f/ |,|L |)2*+P t  t  i  1_ 2*' Therefore, 3*i such that 2 * < e/5 for all t > t\, i.e. T , T 3 and T 4 are each bounded by 2  132  Chapter 5. MAMSE Weights for Copulas e/5 for any t > t\ and k > k tWt  We can write  s(u)l c(u) dCi(u)  < j |^(u)|lc(u)dCi(u). fl  B  The integrand on the right-hand side goes to 0 as t —> oo for each u 6 (0, l ) . Hence, the p  dominated convergence theorem ensures that the right-hand-side expression goes to 0 as t —> oo (the bounding function is  |<?(u)|).  Therefore, there exists t such that 2  T5  < e/5 for  all t>t . 2  Turning now to T\, Theorem 5.6 means that there exists fi^t C fi with P(fij ) = 1 such ]t  that for all u £ fi^t,  ^-E^ ( &) = Y  as  /9(n)dC* (n) k  - /<?(u)dQ(u)  —> 0 0 . Consider a fixed w € fii =  P) fii . i€{l,...,m},telN ;t  The intersection is over a countable number of sets of probability 1, hence P(fii) = 1. For any to £ fii, T\ is developed as  r, =  (u)l (x)dC *(u) < / |0(u)|lc(u)dC *(u)  5 m  i=i  fc  BF  \  fc  B  ( \ ik  ™  n  j=i  i=i  1  ik  n  j=i  The dominated convergence theorem says that 3t* such that  | (u)|l c(u)da(u) < e/lOm 5  B  133  Chapter 5. MAMSE Weights for Copulas for all t > t*. Choose t > t% = maxi<j< t*. Since u G f i j , m  1 1  such'that for all k > fc,^^  r  Uik  e H  j=i  10m  e < - — •  5m  Therefore, Vt > max(i3,t*), there exists k* = maxi<i< k^u such that t  m  Jg(u)dC* (u)- J r ( u ) d C * ( u ) < t  k  f c  for all k > k* . t  In conclusion, for any ui e fi fl fix and any e > 0, we can choose t = max(ti, t , £3, t*) 0  w  2  that yields inequalities showing that  s(u)dC »f e  for all k >  ku(tu)  y  ^(u)dCi(u) < e  = max(k ,k* ). In other words, the left hand side of expression (5.6) u>ttu  ttu  converges to 0 for any u G fi D f i i with P ( f i n f i i ) = 1, i.e. it converges almost surely. • 0  5.8  0  Strong Consistency of the Weighted Pseudo-Likelihood  Section 5.2 introduces the pseudo-likelihood as proposed by Genest et al. (1995). When the goal of inference is to study the dependence structure underlying the data, the heuristics of Section 2.1 can be used with distributions replaced by copulas. Therefore, the maximum pseudo-likelihood estimate can be seen as a particular case of the entropy maximization principle where the true copula is replaced by its empirical counterpart. Suppose that p-dimensional data are available from m populations. It is natural to extend the pseudo-likelihood following the heuristics of Section 2.2. If we consider the family of copulas c ( u | 0 ) and use a mixture of empirical copulas such as C £ ( u ) as an approximation to the true underlying copula, the entropy maximization principle yields the weighted  134  Chapter 5. MAMSE Weights for Copulas pseudo-likelihood:  ^)=nn ( «l*) c  Y  •W tfc n  (5.7)  where Y^- are rescaled ranks as presented in Section 5.2. The maximum weighted pseudolikelihood estimate (MWPLE) is a value of 9 maximizing L(9). To our knowledge, the weighted pseudo-likelihood has never been suggested in the literature. This section shows that when MAMSE weights are used in conjunction with (5.7), the M W P L E is a strongly consistent estimate of the true parameter 9Q. The proof is adapted from the work of Wald (1949).  Families of Copulas The weighted pseudo-likelihood is based on the density function of a family of copulas. However, not all copulas have a density function. In fact, any copula can be factorized as the sum of an absolutely continuous part and a singular term (see Nelsen (1999) page 23). For instance, the bivariate Marshall-Olkin copula, C(u\a,0) = mm(u\~ U2,Uiu\~ ), a  13  is  parametrized by 0 < a,Q < 1 and gives a positive probability to the line u = v@, its singular a  component.  Such a family cannot be fitted with the pseudo-likelihood or its weighted  counterpart as it does not admit a density function with respect to Lebesgue measure. As mentioned previously, all the families presented in Section 5.1 admit a density. Let {c(u\9) : 9 € 6} be the family of copulas that we wish to fit to the data at hand. A family of copulas that does not cover the complete range of dependence (i.e. which does not approach perfect positive or negative correlation) will typically have a compact parameter space. This is the case for the F G M copula. Perfect concordance or discordance between two (or more) variables corresponds to a limit case where the copula places all of its mass on the lines itj = Uj or u\ = —Vj. For such a limit case, the copula does not admit a density anymore and its parameter space will typically be infinite or open, and hence not compact. Other limiting cases may have to  135  Chapter 5. MAMSE Weights for Copulas be omitted because the definition of the copula presents an undefined form; independence under the Clayton model is such an example as it would involve dividing by 0. Despite this fact, we make the assumption that 6 is a compact set. In practice, this means that the proof of consistency that we provide will hold for (possibly constrained) families of copulas that do not include limiting cases such as perfect concordance or discordance between any 2 marginals. Relaxing Assumption 5.1 may be possible but is left to future work. Assumption 5.1 The set 0 is a compact subset of afinite-dimensionalCartesian space. Due to the constraints on their margins, copulas that admit a density will typically be smooth, hence little generality is lost with Assumption 5.2. • Assumption 5.2 The density function of C{\x\9), c(u\9), is jointly continuous for u € (0,1)P and 8 EG. Assumption 5.3 Suppose the true copula underlying the true unknown distribution is C{n\9o) = Ciin) for some 8 EQ.  L  0  In particular, this latter assumption means that the copula underlying the true distribution admits a continuous density function.  Revised Wald's Assumptions For all 8 E 9 and p > 0, let us define  c(u, 9,p)=  sup c(u\9').  \8-8'\<p  Assumption 5.4 For a sufficiently small p, the expected values  Jlog[max{c(u,0,/0),l}] dCi(u) 136  Chapter 5. MAMSE Weights for Copulas are finite. Assumption 5.5  Assumption 5.6  If Q\  ^ 9Q,  then  C{U\0Q)  ^  C{u\6{) for at least one  u.  j | log c(u|(9 )| d C i ( u ) < oo for i = 1,... , m. 0  A s s u m p t i o n 5 . 7 The  Assumption 5.8  functions c(u,9,p) are measurable for any 8 and p.  Suppose that the functions logc(u|t9 ) and logc(u,8,p) satisfy the as0  sumptions of Theorem 5.6 for any 8 and p. Wald (1949) makes additional assumptions, but they are not required here because the copula is defined on a bounded set and because of the stronger assumptions on continuity that are made above. Remark 5.6  Remark 5.7  If l i m ^ o o 8i = 8, then lim;_oo c(u\9i) = c(u|f7) from the continuity of c(u\9).  The function c f > of Wald also found in Section 3.7 is not required here because  we assume that the set Q is compact. Remark 5.8  The functions c(x,9,p) are continuous from the continuity of c(u|f7) and  Lemma 3.10. A n  E x a m p l e  T h a t  S a t i s f i e s t h e  A s s u m p t i o n s  Before proving the main result of this section, we show that the assumptions made above are not vacuous by verifying that at least one often-used family of distributions satisfies all the stated conditions. For this example, we use a simplified notation where u = [u, v] . T  Consider the Clayton family of distributions whose density function  137  Chapter 5. MAMSE Weights for Copulas  is indexed by 8 G 0 = [a, b) with 0 < a < b < oo. Choosing a compact 0 for this family means that the limiting cases of independence and of perfect concordance are omitted. Assumptions 5.1, 5.2 and 5.5 are satisfied. All distributions considered are defined on (0, l ) and its associated Borelian, hence Assumption 5.7 is also satisfied. Assumption 5.3 2  is necessary in the context of proving consistency. For the model considered, we show that Assumptions 5.4 and 5.6 are consequences of the bound implied by Assumption 5.8 to which we now turn our attention. Remark 5.9 For u,v € (0,1), there exists afiniteconstant M > 0 such that  \ °S \ l  u  .  M M ^ «7771 < V 4 - { (u)r(v)y/  4  r  where r(u) = u(l — u) since logii isfinitefor all u G (0,1), except when u —> 0 and — \ogU  lim  U~  1/4  l  —7-r- = lim . • .. = 4 hm u  1  =0  by VHospital's rule. Remark 5.10 For u,v G (0,1), the function r(u) is bounded between .0 and 1/4. Therefore,  r(«)V4 -  4  V  2  meaning that for any constant M\, there exists a constant M2  = M\/yj2  such that  M2 M Mi < — < r(u)V4 ~ {r(u)r{v)y/*' 2  L e m m a 5.11 The functions of Assumption 5.8 satisfy the bound implied by Theorem 5.6 for the Clayton family of copulas with 0 = [a, b) where 0 < a < b < 0 0 .  Proof of Lemma 5.11. Using the notation of Theorem 5.6, set p — q — 2 and 8 = 1/2. We  138  Chapter 5.  Weights for Copulas  MAMSE  show that there exists positive finite constants MQ and M * 0  |logc(u|0 )| < o  such that  D  M  0O  {r(u)r(v)y/  4  and |logc(u,0,p)| <  {r(u)r(v)}W  We can write  |logc(u|0)|  log(0 + !) - ( | + ) 2  <  |log(0 + l)| + Q  l Q  g { ~ + ~ ° - l) u  + 2^) log  6  v  + 1) log uv  ( V * + v~ - l ) e  + \(6 + 1) log u\ + |(0 + 1) log v\ log(0 + 1) + (1 + 20)^ log (u~ + v~ - 1 e  e  +(0 + 1)| log u| +(0 + l)| log u| l o g  <  ^'  1 )  + (1 + 20)M^ + 2(0 + 1)M {r(u)r(z;)}V4  {r(u)r(v)y/  4  by Remark 5.9 and the fact that  ^log (vT + v- - 1 e  e  < ^log{2max(«-%- ) - l j e  <  I log (21*-* - l ) + I log (2v-° - l )  <  J l o g ( 2 ^ . ) + 1 log ( 2 ^ ) 2 | l o g u | + I log t>I + - log 2 o M  V2\og2  M  {r(u)r(v)}V*+ {r(u)r{v)} l l  4  +  (r(u)r(i;)} / 1  4  {r{u)r{v)y/  A  139  Chapter 5. MAMSE  Weights for Copulas  by Remarks 5.9 and 5.10. Note that the constant MQ is continuous in 9 for parameters 9 in any compact subset of [a, b] The inequalities above holds for 9 = 9Q, which is the first part of Assumption 5.8. The second part is also satisfied since  log  sup c ( u | 0 ) =  ' \9-6'\<p  sup l o g c ( u | 0 ) <  \0-0'\<p  sup  \0-0'\<p  {r(u)r-(u)} /4 1  <  {r^r^)} ^ 1  because M# is a continuous function on the compact set {9 : \9 — 9'\ < p] D ©, it thus achieves its maximum denoted  M* ,..  •  E  Corollary 5.5 Under the assumptions of Lemma 5.11, Assumptions 5.4 and 5.6 are satisfied. Proof of Corollary 5.5. Any of the integrals in Assumptions 5.4 and 5.6 are bounded by a positive constant times  /  {r(u)r(u)}V4  d  C  i  (  u  )  "  mm{r(u),r(v)y/i  dCi{U  /  r{U) '^  J{u:r(u)<r(v)}  £  )  J{u:r(u)>r(v)}  1  r{V) ''  d C i ( u )  1  /-^CtuH/^dQtu)  =  2  / ^ I 7 5  Hence we obtain the desired result.  d  » = '  r  <  c  c  '  -  B  Remark 5.11 Corollary 5.5 guarantees that  J logc(u|0 )dCj(u) o  and  J l o g c ( u , 9, p) d C j ( u )  exist. Therefore, the assumptions of Theorem 5.6 are satisfied, i.e. Assumption 5.8 is satisfied. 140  Chapter 5. MAMSE Weights for Copulas  Wald's Lemmas For expectations, the following convention is adopted. Let Y be a random variable. The expected value of Y exists if E{max(Y, 0)} < oo. If .E{max(Y, 0)} is finite but £{min(Y, 0)} is not, we say that E{min(Y, 0)} = —oo. Moreover, a generic U represents a random variable with distribution C\(u) = C(u,$o). The following lemmas are equivalent to those found in Section 3.7. Lemma 5.12 For any 9 ^ 6 , we have Elogc(U|0) < Elogc(U|0 ). 0  o  Lemma 5.13 lim Elogc(u,8,p) = Elogc(U|0). p^O  Main Result Let now turn to the main result of this section. Throughout this subsection, we suppose that Assumptions 5.1 to 5.8 are satisfied. Theorem 5.8 Let T be any closed subset of © that does not contain 8Q. Then,  P  L  i=ij=i  .  J  Proof of Theorem 5.8: Let U denote a random variable with distribution Ci(u) = C(u|0 ). o  With each element 6 £ T, we associate a positive value po such that  E{logc(U,(9,p )}<E{logc(U|0o)}. e  (5.8)  The existence of such pe follows from Lemmas 5.12 and 5.13. Let S(8,p) denote the sphere with center 8 and radius p. The spheres {S(0, pe) : 8 G T} form a covering of the compact set T , hence there exists a finite sub-covering. Let 6\,... ,8h eT such that T C (Js=i S(9s, P6 )S  141  Chapter 5. MAMSE Weights for Copulas Clearly, m  n  \ik{u)ni /n  ik  k  ik  i=ij=i  ^  "T_™ik  < Enn ( c  ,  6' "^  Y  s=l1=1j=l  (?  \ A (u;)ra /n ifc  lfc  ifc  Therefore, to prove Theorem 5.8 if suffices to show that m  n  nii ( £>^) fc-»oo  ik  c  Y  i=l j = l n m  Kk(u)n /n lk  ik  = 0  \ik{u)n /n  ik  lk  ik  i=lj=l  for s — 1,..., h. These equations can be rewritten as m  n  ik  lim nik k—>oo  i=l j=l ik  = —oo  \  = p logc(u|0 )d^(u)  = —oo = 1  o  (5.9)  for s — 1,..., h. The integrals in (5.9) converge almost surely to Elogc(U,9 ,pe ) and Elogc(U|f7 ) by S  s  0  Theorem 5.7. For large k, the expression inside the curly brackets is thus negative by (5.8). Hence the proof of Theorem 5.8 is complete.  I  Theorem 5.9 Let 9k(to) be a sequence of random variables such that there exists a positive  142  Chapter 5. MAMSE Weights for Copulas constant c with  m  n  A (cj)nifc/ra  niHnl ik  ifc  ifc  i=i j=i m  n  X (,cj)n /n  ik  i k  l k  > c>0  (5.10)  i k  i=lj=l for all k £ IN and all to £ Q. Then  P{  lim 0 (w) = 0ol = 1. fc  Proo/ o/ Theorem 5.9. Let e > 0 and consider the values of 9k(u>) as k goes to infinity. Suppose that 9( is a limit point away from 9Q, such that m  n  su n n c H e) ik  — 9Q\> e. Then,  \ {io)n /n k  lk  ik  P  m  n  ik  nn«( s|*) i=i j=i Y  X (ui)n /n i k  l k  > c> 0 i k  infinitely often. By Theorem 5.8, this event has probability 0 even with e arbitrarily small. Therefore, LO  lim 9 (u) k  -9 <e  =1  for all e > 0. Corollary 5.6 TTie MWPLE with MAMSE weights is a strongly consistent estimate of 9. Proof of Corollary 5.6. The M W P L E clearly respects Equation (5.10) with c = 1 since 9 (u>) is then chosen to maximize the numerator of (5.10).  •  k  5.9  Simulations  We study the performance of the weighted coefficients of correlations and of the weighted pseudo-likelihood with MAMSE weights in finite samples through simulations. Unless oth-  143  Chapter 5. MAMSE Weights for Copulas erwise specified, the number of repetitions for each simulation is sufficient to make the standard deviation of the simulation less than the last digit shown in the tables or on the figures.  5.9.1  Different Correlations for Bivariate Copulas  The bivariate families of copulas presented in Section 5.1 all depend on a single parameter and the population value of Spearman's p is a monotone function of that parameter. To use these families on a comparable scale, we use parameter values that are associated with a specified p. Four different families of copulas are considered: Normal, Clayton, Frank and Gumbel-Hougaard. Equal samples of size n € {20, 50,100, 250} are simulated from five bivariate populations with different correlations. Two scenarios are adopted; they are described in Table 5.3. Each situation considered is repeated 10000 times.  Pop. Pop. Pop. Pop. Pop.  1 2 3 4 5  Scenario A Scenario B 0.35 0.25 0.25 0.30 0.30 0.35 0.40 0.40 0.45 0.45  Table 5.3: Values of p under two different scenarios that are simulated for different families of copulas. Figure 5.1 shows the average weight allocated to each of the five populations by the MAMSE approach. Within a given scenario, the average weights do not seem to depend strongly on the actual distribution of the data. This is not very surprising since populations share the same correlation under such circumstances, and hence the shape of their copulas is similar even if they come from different families. As one would expect, the difference between Scenario A and B is bigger than the difference between two distributions within the same scenario. Under Scenario A, bias can cancel out since the correlation of the target population sits squarely within the range of correlations from the other'populations. Consequently, more weight is given to Population 1 under Scenario B where that situation cannot occur. 144  \  Chapter 5. MAMSE Weights for Copulas  Finally, note that under Scenario A populations 2 to 5 receive an approximately equal share of the weight, but under Scenario B, populations whose correlation is closer to the target receive a larger mass than the others. This behavior conforms with intuition. Average Weights Allocated to Each Population Scenario / Family  n= 20  n=50  n= 100  n= 250  Normal  30  30  30  30  Clayton  30  30  30  30  30  30  30  30  31  30  30  30  Gumbel Normal  32  33  34  38  Clayton  32  33  34  38  32  33  34  38  32  33  35  38  A Frank  B Frank Gumbel •  Pop. 1  H  Pop. 2  •  Pop. 3  •  Pop. 4  •  Pop. 5  Figure 5.1: Average value of the MAMSE weights for scenarios where samples of equal sizes n are drawn from 5 different populations. The cell areas are proportional to the average weight allocated to each population. The numbers correspond to lOOAi and are averaged over 10000 repetitions.  Table 5.4 shows a ratio of mean squared errors comparing the usual estimate of Spearman's p with its MAMSE-weighted counterpart. Table 5.5 shows a similar ratio for estimating the parameter of the underlying copula with the pseudo-likelihood or the MAMSEweighted pseudo-likelihood. Note that the error due to simulation in Table 5.5 has a standard deviation from below 6 units for n = 20 to less than 3 units for n = 250. Under Scenario A, the performance of the MAMSE-weighted methods is impressive, featuring in most cases a MSE three times smaller than the equivalent non-weighted method. 145  Chapter 5. MAMSE Weights for Copulas  Scenario A  Scenario B  Family Normal Clayton Frank Gumbel Normal Clayton Frank Gumbel  n =20 ' 50 333 331 335 329 338 333 331 327 275 197 284 197 283 195 • 286 196  100 315 314 314 312 131 131 131 136  250 271 271 271 271 72 81 77 ' 78  Table 5.4: Performance of a MAMSE-weighted coefficient of correlation based on ranks as measured by 100 MSE(p)/MSE(p^) f ° different scenarios -and sample sizes n € {20, 50,100, 250}. Each figure is averaged over 10000 repetitions. r  Scenario A  Scenario B  Family Normal Clayton Frank Gumbel Normal Clayton Frank Gumbel  n =20 305 407 398 389 193 184 245 188  50 310 349 356 341 139 110 164 117  100 304 327 325 315 97 72 112 83  250 266 278. 273 276 57 45 68 51  Table 5.5: Performance of the maximum weighted pseudo-likelihood estimate as measured by 100 MSE(f?)/MSE(i9 ) for different scenarios and sample, sizes n € {20,50,100,250}. Each figure is averaged over 10000 repetitions. A  Under that scenario, the correlation of the population of interest is in the middle of the range of correlations from the other populations, hence the bias can cancel out. Under Scenario B however, Populations 2 to 5 have larger correlations than the target population. Despite that fact, the methods of inference based on MAMSE weights still achieve appreciable gains in performance in many cases. The performance of the weighted methods is sometimes clearly inferior to their unweighted equivalents. Although the counter-performances seem to occur for larger sample sizes, a situation where the weighted methods are less needed, the performance of the method should be studied further before recommending its use in practice to avoid encountering such losses.  146  Chapter 5. MAMSE Weights for Copulas 5.9.2  Different Bivariate Distributions with Common Correlation  In this simulation, we explore the possible advantages of the MAMSE-weighted methods when populations have copulas of different yet similar shapes. Five populations are used, all of which have a theoretical Spearman correlation of p = 0.2. However, the actual shape varies since the populations come from different copulas that are described in Table 5.6. Pop. Pop. Pop. Pop. Pop.  1 2 3 4 5  Normal Copula, r = 0.209 Frank Copula, 9 = 1.22 Farlie-Gumbel-Morgenstern Copula, 6 = 0.600 Gumbel-Hougaard Copula, 9 = 1.16 Clayton Copula, 9 = 0.310  Table 5.6: Distributions from which bivariate random variables are drawn. The choice of parameters in each population yields a Spearman correlation of p = 0.20. Note that Frank, Farlie-Gumbel-Morgenstern and the Normal copulas are all radially symmetric while those of Clayton and Gumbel-Hougaard are not. Hence, they do differ in shape although they are chosen to share a common theoretical correlation. For each sample size n G {20, 50,100, 250} considered, 10000 replicates are produced. The MSE of estimating the correlation p — 0.2 and the MSE of estimating the parameter of the target distribution, a Normal model with 9 — r = 2sin(0.27r/6) « 0.209, are evaluated. Table 5.7 shows the average MAMSE weights allocated to each of the five populations considered as well as the efficiency calculated by the ratios 100 MSE(pi)/MSE(p^) and 100 MSE(f3i)/MSE(0\). Note that the standard deviation of the error due to simulation can reach nearly 4 units for the efficiency at estimating 9\. All other figures in the table respect the quoted margin of error of one standard deviation. First note that a substantial weight is allocated to the external populations, the sample from the target population contributing a mere 31% in the inference. Note also that the remaining weight is spread rather evenly between the other populations. The gain in efficiency is clear and impressive. In fact, it corresponds approximately to the inverse of the weights allocated to Population 1 in average: 1/0.31 « 3.23. When the populations 147  Chapter 5. MAMSE Weights for Copulas Efficiency  pi  §i  20 328 50 . 336 100 338 250 345  n  299 334 345 356  100 x Ai 31 31 31 31  A 17 17 17 17  2  A3 A4 A5 17 17 17 17 17 17 17 17 17 17 17 17  Table 5.7: Average MAMSE weights as well as the relative efficiency of the weighted correlation and of the M W P L E when compared to their non-weighted counterparts. Samples of size n € {20,50,100,250} are drawn from five different bivariate distributions that have a common correlation of p = 0.2. Figures are averaged over 10000 repetitions.  considered are similar, the MAMSE-weighted correlation and the M W P L E clearly improve the inference. The relevant information contained in the samples from Population 2 to 5 has a clear value; that information should not be ignored. The average weight allocated to each population does not seem to vary with n. This unexpected behavior cannot hold for an arbitrarily large sample size as it would contradict Theorem (5.3). In Table 3.1, it was noticed that the weights seem to converge very slowly when the distributions are very close to each other. The same behavior may be observed here since the populations that were simulated share a common Spearman p, i.e. they are quite similar to each other.  5.9.3  Multivariate Normal  The multivariate Normal copula is parametrized by its correlation matrix. In the following simulations, we will use Normal distributions in 3 and 4 dimensions, hence depending respectively on 3 and 6 parameters. We simulate a scenario where the population of interest follows the target distribution and 3 other populations are available with the same underlying copula, but with measurement error that changes the correlations associated with their underlying copula. The sample sizes simulated are unduly small as the MAMSE weights suffer from the curse of dimensionality: the measure M contains n  p  k  lk  points for which operations of non-trivial  148  Chapter 5. MAMSE Weights for Copulas time must be performed. The time required to calculate the equations to determine the MAMSE weights increases rapidly to the hardware limitations even for moderate n^' and p. There are ways in which the calculation of the MAMSE weights might be accelerated. The integrals with respect to dMfc(u) could be evaluated on a grid sparser than that associated with Mfc(u), whether it is through sampling or by a new definition of Mfc(u). Proving that the algorithm and the weights converge properly with such practical simplifications is left to future work.  Maximum Pseudo-Likelihood Estimate and its Weighted Counterpart To calculate the pseudo-likelihood of the data, we apply the inverse CDF of a standard Normal to the rescaled ranks. For this section, we choose to use Y^* = (R*- — l/2)/rijfe, for a fixed k, i — 1,..., m and j = 1,..., n^. We thus transform the data into  Zjj —  a- ( Y * ; ) , . . . , * (y*;) 1  11  By construction, the mean of the vectors Z*j, j = 1,...,  is exactly 0 and the M L E of  the marginal variances are  ^ { • - f t  1  ) }  »/;{•-'«}'...-/y<>•(„-1.  The actual value of the sample variance differs from 1 by afixedamount that depends on the sample size since it determines the number of terms in the sum used above to approximate the corresponding integral. The expression for the pseudo-likelihood based on the rescaled ranks Y^* is identical to the likelihood of a centered normal based on the corresponding Zjj. The maximum  149  Chapter 5. MAMSE Weights for Copulas likelihood estimate of Ej, the covariance matrix in Population i, is given by  Sj =  /  ZjjZ -. i7  i« j = i  The diagonal of Ej already contains a numerical approximation of 1. Define then E* = Ej, except for diagiT,*) = 1, i.e. we replace the diagonal of Ej by true ones. We use E* as the maximum pseudo-likelihood estimate of E . The change of diagonal does not impact the positive definiteness of the matrix because the elements of the diagonal of Ej are equal and slightly smaller than 1, hence the replacement by 1 is equivalent to adding a small positive multiple of the identity matrix which is itself positive definite. We prefer the approach above to brute force numerical maximization' of the pseudo-likelihood. Now, turn to the weighted pseudo-likelihood whose expression is equivalent to the weighted likelihood for centered Normal variates with common covariance matrix E based on the Zjj-: m n  ik  ,  -I  \ i X  z=lj=l  kiu)  '  .  . 1  1  rn n  /  ik  \  / \  \  i=lj=\  where | E | denotes the determinant of the matrix E . The corresponding weighted loglikelihood is -,  m I=I  . / sn  I m  ik  j=i  .  /  ^ i=i  =  -IloglEl-itrfE- ^} 1  N  n  |  j=i  j  ik  (5.11)  150  Chapter 5. MAMSE Weights for Copulas where A  =  A ^ )  E  i=l  H l k  E  z  T  Z  y  3=1  Theorem 5.10 (Wichern & Johnson (2002), Result 4.10) Given apxp symmetric positive definite matrix B and a scalar b > 0, it follows that  -61og|E| - ^-tr(E~ B) < -6log\B\ + p61og(26) - bp l  for all positive definite p x p matrix E , with equality holding only for E = (1/26)5. The calculation of the M L E of the covariance matrix of a multivariate normal distribution typically uses a result akin to Theorem (5.10). Its proof can be found from different sources and is thus not reproduced here. Applying Theorem (5.10) to Expression (5.11) with 6 = 1/2 and B = A, we conclude that the maximum weighted likelihood estimate of the covariance matrix E is given by  1=1-.™**  j=l  i=l  To avoid brute force optimization involving constraints on the positive definiteness of the estimate, we use the same approach as before and use m I=I  as the maximum weighted pseudo-likelihood estimate.  151  Chapter 5. MAMSE Weights for Copulas 3 - D M u l t i v a r i a t e N o r m a l  We suppose a multivariate normal model with measurement error. Let  E  =  A  S = 2  1  0.4  0.3  0.4  1  0.4  0.3  0.4  0.2  1  0.8  0.6  0.8  1  0.8  1  0.6  0.8  1  0  0  0.2  0.1  0  0  0.2  0  0.1  0.2  0  0  0  0.2  0  0  0.2  • >  E# =  S = 3  and  S = 4  0.2  -0.1  0  -0.1  0.2  0  0  0  0.2  We draw random vectors from multivariate Normals with means 0 and covariances that are defined in terms of the matrices above according to the formulas in the column Covariance of Table 5.8. Measurement errors affect the dependence structure of the populations. Let  1  r  Ta r  i 2  i 2  1  r  i 3  I  be the covariance matrix of Population i = 1,..., m. The actual value of Ej is defined by three parameters explicitly written in Table 5.8. Equal samples of size n € {20, 35, 50} are drawn from each population. The maximum pseudo-likelihood estimate of T i = [ r , r , T i ] u  1 2  3  T  and its MAMSE-weighted equivalent are  calculated. Table 5.9 shows the average weight allocated to each population. The inference relies  152  Chapter 5. MAMSE Weights for Copulas 100 x  Pop.  Covariance  1  Scenario A  i 2  r  i 3  30  40  2  £  + S /4  38  29  38  3  SA + S / 4 EA + 5 / 4  40  29  38  38  29  36  4  A  2  3  4  1  Scenario B  r 40  80  60  80  S2  67  50  67  S  3  75  50  67  S B + S4  67  50  58  2 3  TiB +  4  ZB  +  Table 5.8: Parameters of the simulation for 3-variate Normal variables. Population 1 is from the target distribution, but the other populations are affected by measurement errors.  strongly on the populations with measurement errors since they have a total weight of about 60% in all cases. The weights do not seem very affected by the sample size n. The small range of values for n might be partially responsible. Recalling a similar comment from the previous simulation, it is also possible that the measurement error model produced distributions that are similar enough to make the convergence of the weights rather slow. 100 x  Scenario A  ' Scenario B  n  Ai  A  20  41  35 50  A3  A  19  20  20  41  20  20  20  41  20  20  19  20 35  37  21  22  21  38  21  22  20  50  39  20  22  19  2  4  Table 5.9: Average weight allocated to each of four 3-variate Normal distributions. Population 1 is observed from the target distribution, but the other populations contain measurement errors.-The values are averaged over 5000 repetitions.  To evaluate the performance of the MWPLE, we compare its MSE to that of the M P L E based on Population 1 only. Let T\ denote the maximum pseudo-likelihood estimate of Y\  153  Chapter 5. MAMSE Weights for Copulas and T\ denote its weighted counterpart. We estimate the mean squared errors  MSE(fi) = E | | f i - T i l l  and  2  MSE(f ) = E | | f A  A  - Ti||  2  with 5000 replicates; || • || denoting Euclidean distance. Table 5.10 shows the values of  MSE(r^) as well as an equivalent ratio of the MSE for estimating each element of the vector T\. Note that the standard deviation due to simulation error may reach almost 5 units under Scenario A and nearly 2.5 units under Scenario B. Efficiency  Scenario A  Scenario B  n 20 35 50 20 35 50  Ti 243 255 257 74 60 52  Tn 240 262 264 68 53 44  Tl2  260 261 266 123 110 105  Ti 231 241 243 47 35 27 3  Table 5.10: Relative performance of the M W P L E when compared to the M P L E as measured by 100 MSE(Ti)/MSE(r ) or an equivalent ratio for the individual entries of TxPopulation 1 is observed from a 3-variate Normal distribution and the other populations contain measurement errors. The values are averaged over 5000 repetitions. A  Under Scenario A, the performances of the M W P L E are very good as its MSE is less than half that of the MPLE. However, this excellent performance does not obtain under Scenario B where the pseudo-likelihood is the winner. We can only speculate about the causes of these results at this stage, but at least two possible explanations should be explored in future research. • The variance term in the definition of the MAMSE weights is a very rough estimate of the actual asymptotic variance of the empirical copula. This choice appears to be  154  Chapter 5. MAMSE Weights for Copulas reasonable for mild correlations as in Scenario A, but may degenerate as the correlation increases which is the case under Scenario B. A better penalty term can certainly be found. • When the correlations get larger, the data points tend to cluster around a singular subset of the unit cube. The choice of the uniform measure M in the definition of K  the MAMSE weight might not be optimal in that situation.. Despite poor performances under Scenario B, the results of this section show that the MAMSE weights offer great potential to improve the quality of the inference, at least for moderate correlation and possibly for more cases if the MAMSE criterion is improved.  4-D Multivariate Normal We suppose another multivariate normal model with measurement error. Let  1  0.4  0.4 0.3 0.2 1  0.3 0.4  5 =  1  0.4 0.3 1  0.4  0.2 0.3 0.4  0.2  0  0  0.8  0.8 0.6 0.4 1  0.8 0.6 1  0.8  1  0.4 0.6 0.8  1  0  0  0.2 0.1  0  0  0.2  0  0  0.1 0.2  0  0  0  0  0.2  0  0  0  0.2 0.1  0  0  0  0.2  0  0  0.1 0.2  0.6 0.8  5,=  2  and  S =  0.2  -0.1  0  0  -0.1  0.2  0  0  0  0  0.2  -0.1  0  0  -0.  0.2  4  155  Chapter 5. MAMSE Weights for Copulas We draw random vectors from multivariate Normal distributions with means 0 and covariances that are defined in terms of the matrices above according to the formulas in the column Covariance of Table 5.11. Measurement errors affect the dependence structure of the populations. The covariance matrix of Population i is written  1  Tii Yi2 TJ3  Ti2  TJ4  i  r*6  Ti3  TJ5  Tj6  1  and depends on a vector of six parameters, T{ = [Tn,TJ2,r*3,r*4,1^5,rj6] ,  whose values  T  are explicitly written in Table 5.11. 100 x  Scenario A  Scenario B  Pop. 1 2 3 4 1 2 3 4  Covariance + £ /4 2  X  A  + S3/4  + 52 + 53  S s + S4  r  i 2  40 30 38 29 40 29 36 29 80 60 67 50 75 • 50 58 50  r  i 3  20 19 19 19 40 33 33 33  Ti4  Ti5  Ti6  40 38 38 38 80 67 67 67  30 29 29 29 60 50 50 50  40 38 40 36 80 67 75 58  Table 5.11: Parameters of the simulations for 4-variate Normal variables. Population 1 comes from the target distribution, but the other populations are affected by measurement errors.  We simulate 5000 samples of size n = 20 from each of the populations and calculate both the maximum pseudo-likelihood estimate of T i and its weighted counterpart. The choice of such a small sample size is dictated by the curse of dimensionality. The measure M used to average the MAMSE criterion puts an equal weight on n  v  K  lk  points.  Doubling the sample size thus multiplies the run time by a factor greater than 16. To accelerate the calculation of the multivariate MAMSE weights, one could try to 156  Chapter 5. MAMSE Weights for Copulas approximate the integrals of the MAMSE criterion by choosing a sparser M . Using &C\k k  instead of dM  k  is another option to consider as it would not suffer from the curse of  dimensionality in terms of run time. It might however be too sparse to detect differences between the copulas. In particular, it is not clear if the uniform convergence of the MAMSEweighted empirical copula would hold with such a definition of the MAMSE weights. These investigations are however left to future research. Table 5.12 shows the average weight allocated to each of the populations. Less than half of the weight is allocated to Population 1 meaning that the contribution of the other populations is quite substantial. 100 x Scenario A Scenario B  n 20 20  Ai 46 41  A 18 20  2  A3  A4  18 20  18 19  Table 5.12: Average weight allocated to each of four 4-variate Normal distributions. Population 1 is observed from the target distribution and the other populations contain measurement errors. The values are averaged over 5000 repetitions.  Table 5.13 compares the performance of the M P L E to that of its weighted counterpart. The ratios in the table are calculated as those in the last section. The standard deviation of the error due to simulation is less than 4 units in that table. Efficiency Scenario A Scenario B  n 20 20  Ti 235 98  Vu 232 58  Ti2  Ti3  Ti4  234 118  259 214  225 59  F 234 130 i5  Ti 229 62 6  Table 5.13: Relative performance of the M W P L E when compared to the M P L E for 4-variate Normal distributions. Population 1 is observed from the target distribution and the other populations contain measurement errors. The values are averaged over 5000 repetitions.  Once again, the improvement from using the weighted pseudo-likelihood is very substantial under Scenario A, but it suffers a small loss under Scenario B. The results tend to confirm that the proposed implementation of the MAMSE weights performs better for 157  Chapter 5. MAMSE Weights for Copulas moderate correlations. Nonetheless, the potential for improvement is clear and in this case, at least outweighs the performance losses that occur under Scenario B.  The simulations in this section show that using the MAMSE weights can improve the mean squared error in many cases, but as with most methods, it is not uniformly better over all scenarios that were considered.  158  Chapter 6  Summary and Future Research This chapter presents a list of the most important original contributions contained in this thesis and sketches some of the directions that will be explored in future research.  6.1  Summary of the Work  The heuristic justification of the weighted likelihood presented in Section 2.2 provided the genesis of this thesis. That interpretation does not appear to have been exploited previously in the literature. With these heuristics in mind, intuition suggests choosing likelihood weights that make a mixture of the m distributions close to the target distribution, but less variable than its natural estimate. The MAMSE weights are a specific implementation of that idea. The M A M S E weights are not only likelihood weights. As a consequence of their definition, they can also be used to define a mixture of estimates of the CDF. The general idea of the M A M S E weights involves using an estimate of the CDF of each population. Specific properties are studied for three different kinds of data. These three cases are built from nonparametric estimates of the CDF, meaning that the resulting M A M S E weights are nonparametric as well. U n i v a r i a t e  D a t a  We use the empirical distribution function to define the M A M S E criterion and prove the following properties.  159  Chapter 6. Summary and Future Research • Invariance of the M W L E to a reparametrization of the model. • The strong uniform convergence of the MAMSE-weighted empirical CDF. • A weighted strong law of large numbers. • The strong consistency of the M W L E with MAMSE weights. In addition, simulations show that using the MAMSE weights allows for superior performances in many scenarios.  Censored Data We use the Kaplan-Meier estimate to accommodate right-censored data. The main contributions are: • the weighted Kaplan-Meier estimate, a fully nonparametric estimate of the survival function that uses data from the m populations at hand, and • the uniform convergence thereof. Simulations show possible improvements for inference on finite samples when using the MAMSE-weighted Kaplan-Meier estimate.  Multivariate Data Through Copulas We treat the dependence structure of multivariate data through copulas. The empirical copula, an estimate based on the ranks of the data, is used to that end. Important contributions are as follows. • Definition of the weighted empirical copula and a proof of its strong uniform convergence to the target copula. • A weighted strong law of large numbers for bounded multivariate rank order statistics.  160  Chapter 6. Summary and Future Research • A weighted coefficient of correlation based on MAMSE weights; the proof that this estimate is strongly consistent. • The weighted pseudo-likelihood for fitting a family of copulas on data from m populations. • A proof that the M W P L E is a strongly consistent estimate of the parameter of interest. Here again, simulations showed that the methods proposed are powerful tools when appropriately applied.  The sum of these results show that the MAMSE weights succeed in trading bias for precision for different types of data.  6.2  Future Research  The idea of MAMSE weights and the intuitive- understanding of the weighted likelihood open many directions for future research. We list some of these below.  Principle of the M A M S E Weights • Using different functions to combine bias and variance. For instance, it was recently brought to my attention that statistics of the form J{F(x) — G(x)} <&(x) dF(x) have 2  been studied for goodness-of-fit.  The choice \l>(x) = 1 that corresponds to the bias  term in the MAMSE criterion is also known as the Cramer-von Mises test statistic.  However, it seems that the choice ^(x) = F(x){l — F(x)} which corresponds  to Anderson-Darling goodness-of-fit test typically offers a more powerful alternative. Hence, we intend to explore the properties of a different MAMSE criterion where the bias term adopts the form of the Anderson-Darling statistic. • Using parametric estimates of the distributions in the general formulation of the 161  Chapter 6. Summary and Future Research MAMSE weights (rather than nonparametric estimates) may allow one to account for nuisance parameters or for covariates.  Univariate Data • The rate of convergence of the MAMSE weights needs further study. • It would be useful to describe the asymptotic distribution of a MAMSE-weighted sum of variables and even more useful to determine the asymptotic distribution of the MWLE.  '  • The weights implied by the Stein-type estimate of Ahmed (2000) could be studied as possible likelihood weights. Alternatively, the approach of Ahmed (2000) may be extended to the empirical distribution functions to define Stein-like likelihood weights.  Censored Data • MAMSE weights could be used as likelihood weights.  The fact that the weighted  Kaplan-Meier estimate may not converge on the complete real line is a challenge. Defining the MAMSE weights based on parametric models rather than on the KaplanMeier estimates may help in building weights that will make the M W L E consistent. • The idea of the MAMSE weights might be extendable to a Cox proportional hazard model, something that warrants further investigations given the great importance of that model.  Copulas • Defining and studying the properties of a weighted version of Kendall's r. • Exploring different definitions of the MAMSE weights that suffer less from the curse of dimensionality.  162  Chapter 6. Summary and Future Research  • Studying the effect of using different estimates for the variance of the empirical copula. In particular, verifying the hypothesis that it may improve the performance of the M W P L E in the presence of strong dependence. • Using the weighted pseudo-likelihood in other contexts where the weights need not be estimated from the data. For instance, if data were available from populations with the same dependence structure, but different marginals (e.g. different currencies), or if m sets of rankings are the only information available. • Proposing weighted coefficients of correlation under scenarios where the weights do not need to be determined from the data. • Using the M W P L E and univariate M W L E s in order to build an estimate of a multivariate distribution function by combining the estimated copula with estimated marginal distributions. A n important advantage of such an approach is the possibility to use a different number of populations for each of the marginal distributions. Suppose that data are available from some studies who did not record all the variables of interest. The information from these studies could be used for the inference on the marginals, but only the complete data could be used to determine the dependence structure. • One approach to building multivariate distributions may be obtained by extending the work of Cox & Reid (2004) where the likelihoods would be replaced by M A M S E weighted likelihoods. Such an extension may however require a definition of M A M S E weights for multivariate distributions that are not copulas.  Bootstrap In this document, we propose three weighted nonparametric asymptotically unbiased empirical estimates of a target distribution. Resampling techniques such as the bootstrap could be developed from these weighted mixtures of empirical functions. To use such methods  163  Chapter 6. Summary and Future Research with confidence, we should first demonstrate that the bootstrap estimators obtained that way are consistent. One advantage of bootstrapping from a MAMSE-weighted mixture of empirical functions is the ability to draw from a larger number of values, resulting in a bootstrap sample with fewer ties. One could even try including a distribution as an external population, i.e. to allow for an infinite sample, hence further reducing the number of ties in the bootstrap sample. The Stein-like quantile estimate of Ahmed (2000) implies an empirical distribution that could also be considered for the definition of an alternative bootstrap method.  Foundations of Statistics In addition to these directions for research, this work may have shed a new light on likelihood methods for some readers; writing it definitely changed my understanding of this important statistical tool. In particular, the likelihood can be seen as an estimate of the entropy between the fitted model and the EDF. The convergence properties of the CDF can then be used to show the consistency of the M L E . By showing that the MAMSE-weighted EDF converges, we could also show that a M W L E based on MAMSE weights is consistent. A question arises from that interpretation of the likelihood: Is the likelihood a good method because the empirical distribution function is a good estimate, or is the likelihood a more general principle? Can any model be fitted successfully using the likelihood? Using the likelihood for linear model based on the normal distribution for instance is definitely reasonable, especially if when we consider that maximizing the normal likelihood is equivalent to a form of least squares. But what about complex models, possibly involving nonlinear covariates? Should we avoid using the likelihood blindly or is it safe in all situations?  164  Chapter 7  Conclusion The weighted likelihood as proposed by Hu, Wang and Zidek suggests that information from populations similar to that of interest should not be ignored, but rather incorporated into the inference. When I started working on the weighted likelihood, the question that everybody asked first was: "But how do you choose the weights?" The MAMSE weights provide one answer to that question which seemed to be of most concern to those who were not acquainted with the weighted likelihood. This thesis does not say how to choose optimal likelihood weights. In fact, the answer to that question will depend on the purpose of the inference, more particularly on the loss function that one wants to use. This thesis however offers a practical and effective solution to determine likelihood weights. These weights are shown to produce consistent estimates, yet they successfully trade bias for precision in many cases that were simulated. The different cases of the MAMSE weights studied in detail in this document are all based on nonparametric estimates of the distribution functions. Hence, the proposed MAMSE weights are themselves nonparametric. For all cases studied, the MAMSE-weighted empirical distributions are shown to converge uniformly to the target distribution without assuming any particular shape and minimal regularity conditions for the other populations. Hence, the estimates suggested are asymptotically unbiased even though they take into consideration unknown data. With MAMSE weights, the maximum weighted likelihood estimate and the maximum  165  Chapter  7.  Conclusion  weighted pseudo-likelihood estimate converge strongly under conditions that are akin to those imposed on their unweighted counterparts. Finally, simulations show that the M A M S E weights can indeed improve the quality of inference on finite samples in many situations. The results presented in this document show that when different sources of data are available, it is possible to use them to improve the quality of the inference. Hence it is possible with the M A M S E weights to use the data from m populations in order to evaluate their level of dissimilarity and then to use that information to successfully trade bias for precision.  «  166  Bibliography E. S. Ahmed, A. L Volodin k A. A. Hussein (2005). Robust weighted likelihood estimation of exponential parameters, IEEE Transactions on Reliability,-54, 389-395. E. S. Ahmed (2000). Stein-type shrinkage quantile estimation, Stochastic Analysis and Applications, 18, 475-492. H. Akaike (1977). On entropy maximization principle, Applications of Statistics, 27-42. D. Blest (2000). Rank correlation—an alternative measure, Australian and New Zealand Journal of Statistics, 42, 101-111. N. E . Breslow k J. Crowley (1974). A large sample study of the life table and product limit estimates under random censorship, The Annals of Statistics, 2, 437-453. K. Chen k S.-H. Lo (1997). On the rate of uniform convergence of the product-limit estimator: Strong and weak laws, The Annals of Statistics, 25, 1050-1087. U. Cherubini, E . Luciano k W. Vecchiato (2004). Copula Methods in Finance. Wiley Finance, Wiley, Hoboken. D. G. Clayton (1978). A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence, Biometrika, 65, 141-151. D. R. Cox k N. Reid (2004). A note on pseudolikelihood constructed from marginal densities, Biometrika, 91, 729-737.  167  Bibliography P. Deheuvels (1979). La fonction de dependance empirique et ses proprietes: un test non parametrique d'independance, Academie  royale de Belgique-Bulletin  de la classe des  sciences, 65, 274-292. R. Durrett (2005). Probability:  Theory and Examples,  Third Edition.  Duxbury Advanced  Series, Thomson, Belmont. B. Efron (1967). The two sample problem with censored data. Proceedings Berkeley Symposium on Mathematical  Statistics  and Probability,  of the Fifth  Berkeley: University  of California, 4, 831-853. J.-D. Fermanian, D. Radulovic, M . Wegkamp (2004). copula processes, Bernoulli,  Weak convergence of empirical  10, 847-860.  A. Foldes & L. Rejto (1981). Strong uniform consistency for nonparametric survival curve estimators from randomly censored data. The Annals of Statistics, M. J. Frank (1979). On the simultaneous associativity of F(x,y) Aequationes  Mathematicae,  9, 122-129.  and x + y —  F(x,y).  19, 194-226.  C. Genest, K. Ghoudi Sz L.-P. Rivest (1995). A semiparametric estimation procedure of dependence parameters in multivariate families of distributions. Biometrika,  82,  543-552. C. Genest & J. MacKay (1986). Copules archimediennes et families de lois bidimensionnelles dont les marges sont donnees. Canadian  Journal  of Statistics,  14, 145-159.  C. Genest &; J.-F. Plante (2003). On Blest's measure of rank correlation, The Journal  of Statistics,  F. Hu (1994). Relevance  Canadian  31, 35-52. Weighted Smoothing  and a New Bootstrap  Method, unpublished  doctoral dissertation, Department of Statistics, The University of British Columbia.  168  Bibliography F. Hu & J. V . Zidek (1993). A Relevance  Weighted Nonparametric  Quantile  Estima-  tor. Technical report no. 134, Department of Statistics, The University of British Columbia, Vancouver. F. Hu & J. V. Zidek (2001). The relevance weighted likelihood with applications, Empirical Bayes and Likelihood  Inference,  211-235.  F. Hu & J. V. Zidek (2002). The weighted likelihood, The Canadian Journal of  Statistics,  3 0 , 347-371. H. Joe (1997). Multivariate  Models and Dependence  Concepts.  R. A. Johnson Sz D. W. Wichern (2002). Applied Multivariate Edition,  Chapman &Hall, London. Statistical  Analysis,  Fifth  Prentice Hall, Upper Saddle River.  E. L. Kaplan & P. Meier (1958). Nonparametric estimation from incomplete observations, Journal  of the American  Statistical  Association,  53,  457-481.  A. M . Krieger k, D. Pfeffermann (1992). Maximum likelihood estimation from complex sample surveys, Survey Methodology,  18, 225-239.  D. G. Luenberger (2003). Linear and Nonlinear  Programming,  Second Edition,  Kluwer,  Nor well. M . Markatou, A. Basu Sz B. Lindsay (1997). Weighted likelihood estimating equations: The discrete case with applications to logistic regression. Journal of Statistical  Plan-  ning and Inference, 5 7 , 215-232.-  National Center for Health Statistics (1997).  U.S.. Decennial  Life  Tables for  1989-91,  vol. 1, no. 1, Hyattsville, Maryland. R. B. Nelsen (1999). An Introduction  to Copulas,  Lecture Notes in Statistics No. 139,  Springer, Berlin.  169  Bibliography D. Oakes (1986). Semiparametric inference in a model for association in bivariate survival data, Biometrika, 73, 353-361. J. Pinto da Costa h C. Soares (2005). A weighted rank measure of correlation, Australian and New Zealand Journal of Statistics, 47, 515-529. J.-F. Plante (2002). A propos d'une mesure de correlation des rangs de Blest, unpublished Master's Thesis, Departement de mathematiques et de statistique, Universite Laval. A. Sklar (1959). Fonctions de repartition a n dimensions et leurs marges. Publications de I'Institut de statistique de V Universite de Paris, 8, 229-231. Statistics Canada (2006).  Life Tables, Canada, Provinces and Territories.  Reference  Period: 2000 to 2002., catalog number 84-537-XIE, Ottawa, Canada. H. Tsukahara (2005). Semiparametric estimation in copula models. The Canadian Journal of Statistics, 33, 357-375. C. van Eeden & J. V. Zidek (2004). Combining the data from two normal populations to estimate the mean of one when their means difference is bounded, Journal of Multivariate Analysis, 88, 19-46. A. Wald (1949). Note on the consistency of the maximum likelihood estimate, The Annals of Mathematical Statistics, 20, 595-601. X. Wang (2001). Maximum Weighted Likelihood Estimation, unpublished doctoral dissertation, Department of Statistics, The University of British Columbia. X. Wang, C. van Eeden h J. V. Zidek (2004). Asymptotic properties of maximum weighted likelihood estimators, Journal of Statistical Planning and Inference, 119, 37-54. X. Wang & J. V. Zidek (2005). Selecting likelihood weights by cross-validation, The Annals of Statistics, 33, 463-501.  170  Bibliography B. B. Winter, A. Foldes & L. Rejto (1978). Glivenko-Cantelli theorems for the PL estimate. Problems  of Control and Information  Theory, 7, 213-225.  171  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0100736/manifest

Comment

Related Items