STATISTICAL MODELLING AND INFERENCE FOR MULTIVARIATE AND LONGITUDINAL DISCRETE RESPONSE DATA by James Jianmeng Xu B.Sc. Sichuan University M.Sc. Universite Laval A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY THE FACULTY OF GRADUATE STUDIES Department of Statistics We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA September 1996 © James Jianmeng Xu, 1996 in In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of The University of British Columbia Vancouver, Canada •ate S^plfry DE-6 (2/88) Abstract This thesis presents research on modelling, statistical inference and computation for multivariate discrete data. I address the problem of how to systematically model multivariate discrete response data including binary, ordinal categorical and count data, and how to carry out statistical inference and computations. To this end, I relate the multivariate models to similar univariate models al ready widely used in applications and to some multivariate models that hitherto were available but scattered in the literature, and I introduce new classes of models. The main contributions in this thesis to multivariate discrete data analysis are in several distinct directions. In modelling of multivariate discrete data , we propose two new classification of mul tivariate parametric discrete models: multivariate copula discrete (MCD) models and multivariate mixture discrete (MMD) models. Numerous new multivariate discrete models are introduced through these two classes and several multivariate discrete models which have appeared in the literature are unified by these two classes. With appropriate choices of copulas, these two classes of models allow the marginal parameters and dependence parameters to vary with covariates in a natural way. By using special dependence structures, the models can be used for longitudinal data with short time series or repeated measures data. As a result, the scope of multivariate discrete data analysis is sub stantially broadened. In statistical inference and computation for multivariate models, we propose the inference function of margins (IFM) approach in which each inference function is a likelihood equation for some marginal distribution of a multivariate distribution. Examples where the approach applies are the multivariate logit model with the copulas having certain closure properties and the multivariate probit model for binary data. This general approach makes the estimation of parame ters for the multivariate models computationally feasible. The corresponding asymptotic theory, the estimation of standard errors by the Godambe information matrix as well as the jackknife method, and the efficiency of the IFM approach relative to full multivariate likelihood function approach are studied. Particular attention has been given to the models with special dependence structure ii (e.g. the copula dependence structure is exchangeable or AR(1) type if applicable), and efficient parameter estimation schemes based on IFM (weighting approach and pool-marginal-likelihood ap proach) are developed. We also give detailed assessments of the efficiency of the GEE approach for estimating regression parameters in multivariate models; this is lacking in the literature. Detailed data analyses of existing data sets are provided to give concrete application of multivariate models and the statistical inference procedures in this thesis. 111 Contents Abstract ii Table of Contents iiList of Tables viList of Figures xii Basic Notation and Definitions xv Acknowledgements xviii 1 Introduction 1 1.1 Multivariate discrete response data 1 1.2 Review of literature and research motivation 4 1.3 Statistical modelling 8 1.4 Overview of thesis 10 2 Foundation: models, statistical inference and computation 11 2.1 Multivariate copulas and dependence measures 12 2.1.1 Multivariate distribution functions2.1.2 Multivariate copulas and Frechet bounds 13 2.1.3 Dependence measures 15 2.1.4 Examples of multivariate copulas 17 2.1.5 CTJOM, CUOM(fc), MTJBE, PUBE and MPME concepts 20 2.2 Multivariate discrete models 24 2.2.1 Multivariate copula discrete models 25 iv 2.2.2 Multivariate mixture discrete models 26 2.2.3 Examples of MCD and MMD models2.2.4 Some properties of MCD and MMD models 33 2.3 Inference functions of margins 34 2.3.1 Approaches for fitting multivariate models 35 2.3.2 Inference functions for multiple parameters 7 2.3.3 Inference function of margins 43 2.4 Parameter estimation with IFM and asymptotic results 49 2.4.1 Models with no covariates2.4.2 Models with covariates 57 2.4.3 Asymptotic results for the models assuming a joint distribution for response vector and covariates . 66 2.5 The Jackknife approach for the variance of IFME . 70 2.5.1 Jackknife approach for models with no covariates 1 2.5.2 Jackknife for a function of 6 79 2.5.3 Jackknife approach for models with covariates 81 2.6 Estimation for models with parameters common to more than one margin 82 2.6.1 Weighting approach 83 2.6.2 The pool-marginal-likelihoods approach 85 2.6.3 Examples 7 2.7 Numerical methods for the model fitting 88 2.8 Summary ' 92 3 Modelling of multivariate discrete data 3 3.1 Multivariate copula discrete models for binary data 94 3.1.1 Multivariate logit model 93.1.2 Multivariate probit model 100 3.2 Comparison of models 102 3.3 Multivariate copula discrete models for count data 103 3.3.1 Multivariate Poisson model 104 3.3.2 Multivariate generalized Poisson model 106 3.3.3 Multivariate negative binomial model 7 3.3.4 Multivariate logarithmic series model 10v 3.4 Multivariate copula discrete models for ordinal data 108 3.4.1 Multivariate logit model 109 3.4.2 Multivariate probit model 113 3.4.3 Multivariate binomial model 4 3.5 Multivariate mixture discrete models for binary data 115 3.5.1 Multivariate probit-normal model 113.5.2 Multivariate Bernoulli-Beta model 6 3.5.3 Multivariate logit-normal model 118 3.6 Multivariate mixture discrete models for count data 119 3.6.1 Multivariate Poisson-lognormal model 113.6.2 Multivariate Poisson-gamma model 121 3.6.3 Multivariate negative-binomial mixture model 122 3.6.4 Multivariate Poisson-inverse Gaussian model3.7 Application to longitudinal and repeated measures data 123 3.8 Summary 124 4 The efficiency of IFM approach and the efficiency of jackknife variance estimatel25 4.1 The assessment of the efficiency of IFM approach 126 4.2 Analytical assessment of the efficiency 129 4.3 Efficiency assessment through simulation 146 4.4 IFM efficiency for models with special dependence structure 162 4.5 Jackknife variance estimate compared with Godambe information matrix 163 4.6 Summary 168 5 Modelling, data analysis and examples 172 5.1 Some issues on modelling 173 5.1.1 Data analysis cycle5.1.2 Model selection 174 5.1.3 Diagnostic checking 6 5.1.4 Testing the dependence structure 179 5.2 Data analysis examples 181 5.2.1 Example with multivariate/longitudinal binary response data 182 5.2.2 Example with multivariate/longitudinal ordinal response data 194 vi 5.2.3 Example with multivariate count response data 205 5.3 Summary 212 6 GEE methodology and its comparison with ML and IFM approaches 213 6.1 Generalized estimating equations 214 6.2 GEE in multivariate analysis 7 6.3 GEE compared with the ML and IFM approaches 221 6.4 A combination of GEE and IFM estimation approach 234 6.5 Summary 235 7 Some further research topics 237 Bibliography - 247 A Maple programs 248 vii List of Tables 1.1 The structure of general multivariate discrete data 2 4.1 Efficiency assessment with MCD model for binary data: d — 3, z = (0, 0, 0)', N = 1000148 4.2 Efficiency assessment with MCD model for binary data: d = 3, 0O — (0.7,0.5,0.3)', & = (0.5,0.5,0.5)', Xij discrete, N = 1000 149 4.3 Efficiency assessment with MCD model for binary data: d = 3, 0O = (0.7,0.5,0.3)', /?i = (0.5,0.5,0.5)', = X{ continuous, N — 100 149 4.4 Efficiency assessment with MCD model for binary data: d = 4, a\2 = ct\3 — <*14 = a23 = a24 = a34 = 1.3863, N = 1000 150 4.5 Efficiency assessment with MCD model for binary data: d = 4, a\2 = «23 = CZA = 2.1972, ai3 = a24 = 1.5163, a14 = 1.1309, N = 1000 151 4.6 Efficiency assessment with MCD modelfor ordinal data: d = 3, z(l) = (—0.5, —0.5, —0.5)', z(2) = (0.5,0.5,0.5)', TV = 1000 152 4.7 Efficiency assessment with MCD modelfor ordinal data: d = 3, z(l) = (—0.5,0, —0.5)', z(2) = (0.5,1,0.5)', N = 1000 153 4.8 Efficiency assessment with MCD model for ordinal data: d = 4, z(l) = (—0.5, —0.5, -0.5, -0.5)', z(2) = (0.5,0.5,0.5,0.5)', a12 = a13 = a14 = a23 = a24 = a34 = 1.3863, N = 100 154.9 Efficiency assessment with MCD model for ordinal data: d = 4, z(l) = (—0.5, —0.5, -0.5,-0.5)', z(2) = (0.5,0.5,0.5,0.5)', a12 = a23 = a34 = 2.1972, a13 = a24 = 1.5163, a14 = 1.1309, N = 100 154 4.10 Efficiency assessment with MCD modelfor ordinal data: d — 4, z(l) = (—0.5,0,-0.5,0)', z(2) = (0.5,1,0.5,1)', ori2 = ai3 = a14 = a23 = a24 = a34 = 1.3863, N = 100 .... 154 vm 4.11 Efficiency assessment with MCD modelfor ordinal data: d = 4, z(l) = (—0.5, 0, —0.5, 0)', z(2) = (0.5,1,0.5,1)', ai2 = a23 = a34 = 2.1972, a13 = a24 = 1.5163, a14 = 1.1309, JV = 100 155 4.12 Efficiency assessment with MCD model for count data: d = 3, 0O = (1,1,1)' and 0! = (0.5,0.5,0.5)', X{j discrete, N = 1000 156 4.13 Efficiency assessment with MCD model for count data: d = 3, (0i, 02,03) = (1-6094, 1.0986,1.6094), N = 1000 157 4.14 Efficiency assessment with MCD model for count data: d = 4, (01,02,03,04) = (1.6094,1.0986,1.6094,1.6094), 'N = 1000 157 4.15 Efficiency assessment with MCD model for count data: d = 4, (0i,02,03,0A) = (1.3863,0.6931,1.6094,2.0794), N = 1000 158 4.16 Efficiency assessment with multivariate Poisson-Morgenstern-gamma model, d = 3 . 161 4.17 Efficiency assessment with multivariate Poisson-Morgenstern-gamma model, d = 4 . 161 4.18 Efficiency assessment with multivariate Poisson-Morgenstern-gamma model, d = 5 . 161 4.19 Efficiency assessment with special dependence structure: d = 3, z = (0.5,0.5,0.5)' . . 163 4.20 Efficiency assessment with special dependence structure: d = 3, z = (0.5,1.0,1.5)' . . 163 4.21 Efficiency assessment with special dependence structure: d = 3, ao = (0.5,0.5,0.5)', oti =(1,1,1)' 164 4.22 Efficiency assessment with special dependence structure: d = 3, Qfo = (0.5,0.5,0.5)', oi =(1,0.5,1.5)' 164.23 Efficiency assessment with special dependence structure: d = 4, z = (0.5,0.5,0.5,0.5)' 164 4.24 Efficiency assessment with special dependence structure: d = 4, z = (0.5,0.8,1.2,1.5)' 165 4.25 Efficiency assessment with special dependence structure: d = 4, z = (0.5,0.5,0.5,0.5)' 165 4.26 Efficiency assessment with special dependence structure: d = 4, z = (0.5,0.8,1.2,1.5)' 165 4.27 Comparison of estimates of standard error, (i) true, (ii) Godambe, (iii) jackknife with g groups; N - 500, n = 1000 169 4.28 Comparison of estimates of standard error, (i) true, (ii) Godambe, (iii) jackknife with g groups; N = 500, n = 500 170 5.1 Six Cities Study: Percentages for binary variables 188 5.2 Six Cities Study: Frequencies of the response vector (Age 9, 10, 11, 12) 188 5.3 Six Cities Study: Pairwise log odds ratios for Age 9, 10, 11, 12 189 ix 5.4 Six Cities Study: Estimates of marginal regression parameters for multivariate logit model 189 5.5 Six Cities Study: Estimates of dependence regression parameters for multivariate logit model with multinormal copula 185.6 Six Cities Study: Comparisons of AIC values and X2 values from various submodels of models (la) and (2) 190 5.7 Six Cities Study: Comparisons of AIC values from various models 190 5.8 Six Cities Study: Comparisons of X2 values from various models 190 5.9 Six Cities Study: Estimates (SE) of dependence regression parameters from the sub model l.md.g.wn of various models 195.10 Six Cities Study: Estimates of Pr(Y = y) from various submodels of model (la) . . 191 5.11 Six Cities Study: Estimates of Pr(Y = y) from the submodel l.md.g.wn of various models 195.12 Six Cities Study: Observed frequencies in comparison with estimates of Pr(Y = y|x) from various models, x =(City, Smoking9, SmokinglO, Smokingll, Smokingl2). . . . 192 5.13 Six Cities Study: Estimates of Pr(Y = y|x) from the submodel l.md.g.wn of various models, x =(City, Smoking9, SmokinglO, Smokingll, Smokingl2) 193 5.14 TMI Accident Study: Stress levels for 4 years following accident at TMI. Responses with non zero frequencies 199 5.15 TMI Accident Study: Univariate marginal (and relative) frequencies . . 200 5.16 TMI Accident Study: Pairwise gamma measures for Year 1979, 1980, 1981, 1982 . . 200 5.17 TMI Accident Study: Estimates of univariate marginal regression parameters for multivariate logit models 205.18 TMI Accident Study: Estimates of dependence regression parameters for multivariate logit model with multinormal copula 201 5.19 TMI Accident Study: Comparisons of AIC values and X2^ values from various sub models of models (la) and (2) 205.20 TMI Accident Study: Comparisons of AIC values and X2^ values from the submodel l.md.g.wn of various models 202 5.21 TMI Accident Study: Estimates (SE) of dependence regression parameters from the submodel l.md.g.wn of various models 205.22 TMI Accident Study: Comparisons of values from the submodels l.md.g.wn and l.md.a.wc of model (la) 202 5.23 TMI Accident Study: Estimates of Pr(Y = y) and frequencies from the submodels l.md.g.wn and l.md.a.wc of model (la) 203 5.24 TMI Accident Study: Estimates of Pr(Y = y|x) and frequencies from the submodels l.md.g.wn and l.md.a.wc of model (la) 204 5.25 Bacteria Counts: Bacterial counts by 3 samplers in 50 sterile locations 209 5.26 Bacteria Counts: Univariate marginal frequencies 205.27 Bacteria Counts: Pairwise gamma measures for samplers 1, 2, 3 209 5.28 Bacteria Counts: Moment estimate of means, variances, correlations and other sum mary statistics of responses 210 5.29 Bacteria Counts: Estimates of marginal parameters for multivariate Poisson model . 210 5.30 Bacteria Counts: Estimates of dependence regression parameters for multivariate Poisson model with multinormal copula 215.31 Bacteria Counts: Comparisons of AIC values from various submodels of multivariate Poisson model with multinormal copula 210 5.32 Bacteria Counts: Estimates of marginal parameters from multivariate Poisson-lognormal model 211 5.33 Bacteria Counts: Estimates of dependence parameters from multivariate Poisson-lognormal model 215.34 Bacteria Counts: Comparisons of AIC values from various submodels of multivariate Poisson-lognormal model 211 5.35 Bacteria Counts: Estimates of means, variances and correlations of responses from the submodel md.g of multivariate Poisson-lognormal model 211 6.1 GEE assessment: d = 2, /?0 = 0.5, Bx = 1, xtj discrete, p = 0.9, TV - 1000 226 6.2 GEE assessment: d = 2, f30 - 0.5,/?i = 1, xtj continuous, p = 0.9, TV = 1000 226 6.3 GEE assessment: d = 2, (30 = —0.5,/?i = 0.5, = 1, wt, x{j discrete, p = 0.9, TV = 1000227 6.4 GEE assessment: d - 2, f30 = 0.5, /?i = 1, x{j discrete, p = 0.5, TV = 1000 227 6.5 GEE assessment: d = 3, z = 0.5, latent exchangeable, p = 0.9, "working" exchange able, TV = 1000 228 6.6 GEE assessment: d = 3, z = 1.5, latent exchangeable, p = 0.9, "working" exchange able, TV = 1000xi 6.7 GEE assessment: d = 3, z = 0.5, latent AR(1), p = 0.9, "working" AR(1), N = 1000 228 6.8 GEE assessment: d = 3, z = 1.5, latent AR(1), p = 0.9, "working" AR(1), N = 1000 229 6.9 GEE assessment: d = 3, B0 = 0.5, B\ — 1, Xij discrete, latent exchangeable, p — 0.9, "working" exchangeable, TV = 1000 226.10 GEE assessment: d = 3, B0 = 0.5, B\ = 1, Xjj discrete, latent AR(1), p = 0.9, "working" AR(1), N = 1000 230 6.11 GEE assessment: d = 4, z = 0.5, latent exchangeable, = 0.9, "working" exchange able, N = 1000 236.12 GEE assessment: d = 4, z - 0.5, latent AR(1), p = 0.9, "working" AR(1), N - 1000 230 6.13 Estimates of v and rj under different variance specification 232 6.14 GEE assessment: (u,rj) = (0.99995,0.01), E(Y) = 2.718282, Var(Y) = 2.719, n = 1000, N = 500 . , 233 6.15 GEE assessment: (y,rj) = (-0.1,1.48324), E(Y) = 2.718282, Var(Y) = 62.02, n = 1000, N = 1006.16 GEE assessment: a - 0.5, B = 0.5, t] = 0.01, n = 1000, N — 500 233 6.17 A comparison of IFM to GEE with R\(a) given 235 xii List of Figures 4.1 Trivariate probit, exchangeable: The efficiency of p from the margin (1,2) (or (1,3), (2,3)) 135 4.2 Trivariate probit, AR(1): The weight ui (or u3) versus p (solid line) and the weight u2 versus p (dash line) 137 4.3 Trivariate probit, AR(1): The efficiency of pp 134.4 Trivariate probit, AR(1): (a) The efficiency of p from the margins (1, 2) or (2, 3). (b) The efficiency of p from the margin (1,3) 138 4.5 Trivariate Morgenstern-binary model: Relative efficiency of IFM approach versus IFS approach 140 4.6 Trivariate Morgenstern-binary model: (a). Ordered relative efficiency values of IFM approach versus IFS approach; (b) A histogram of the efficiency value rg 141 4.7 Trivariate Morgenstern-binary model: Relative efficiency of IFM approach versus IFS approach when ui = u2 = "3 and 6\2 — #13 = #23 142 4.8 Trivariate Morgenstern-binary model: Relative efficiency of IFM approach versus IFS approach when u\ = u2 = u3, #12 = #23 = # and 913 — 02 142 4.9 Trivariate normal-binary model: Relative efficiency of IFM approach versus IFS ap proach 144 4.10 Trivariate normal-binary model: (a). Ordered relative efficiency of IFM approach versus IFS approach; (b) A histogram of the efficiency rg 145 4.11 Trivariate normal-binary model: Relative efficiency of IFM approach versus IFS ap proach when «i = u2 = «3 and pi2 = pi3 = Piz 145 5.1 Bacteria Counts: Residuals from the submodel md.g of model (1) 207 xm Bacteria Counts: Residuals from the submodel md.g of trivariate Poisson lognormal model xiv Basic Notation and Definitions The following notation and definitions are used throughout the thesis. 1. cdf stands for cumulative distribution function; pdf stands for probability density function, and pmf stands for probability mass function; Pr stands for probability of. 2. rv stands for random variable or random vector depending on the context; iid stands for independent and identically distributed. 3. BVN and MVN are the abbreviations for bivariate normal and multivariate normal respec tively. 4. CUOM is the abbreviation for closure under taking of margins. MUBE is the abbreviation for model univariate and bivariate expressible. PUBE is the abbreviation for parameter uni variate and bivariate expressible. MPME is the abbreviation for model parameters marginally expressible. (Definitions given in Section 2.1.) 5. ML and MLE are the abbreviations for maximum likelihood and maximum likelihood estimates or estimation. An MLE of 6 will usually be denoted by 6. 6. IFM and IFME are the abbreviations for inference functions of margins and inference functions of margins estimates or estimation. An IFME of 6 will usually be denoted by 0. IFS is the abbreviation for inference functions of scores. 7. MCD and MMD are the abbreviations for multivariate copula discrete and multivariate mixture discrete. 8. The symbol "•" indicates the end of a definition, the statement of assumptions, a proof, a result, or an example. xv 9. For a vector or matrix, the transpose is indicated with a superscript of T or ', depending on convenience in the context. 10. All vectors are column vectors; hence transposed vectors such as X', x' (or XT, xT) are row vectors. 11. Rk = {x : x = (xi,..., Xk)', — oo < Xj < oo for j = l,...,k} denotes the fc-dimensional Euclidean space. 12. d is used for dimension of the multivariate response vector of multivariate distribution. 13. Boldfaced Roman upper case letter Y = (Yi,..., Yd)', usually with subscripts, is used to denote a response random vector and y is used for the observed value of this response vector. A vector of explanatory variables or covariates is usually denoted by x or w. 14. Boldfaced Roman upper case letters X,Y, Z and so on, usually with subscripts, are used for (random) vectors, boldfaced Roman lower case letters x, y, z and so on are used for the observed vector values. 15. Roman upper case letters X,Y, Z and so on, usually with subscripts, are used for random variables, roman lower case letters x, y, z and so on are used for the observed values. 16. Greek boldfaced lower case letters, often with subscripts, are used for a collection of parameters of families of distributions, e.g. a, 0,6,6. They are in vector format. Greek lower case letters, often with subscripts, are used for parameters of families of distributions, e.g. a, 0, 9, 8. 17. Greek upper case letters 0, E are used for a set of parameters (often dependence parameters) in multivariate family, they are mostly in matrix format. 18. 3? is the symbol for parameter space, usually ft C Rk for some K. 19. Script Roman upper case letters T and Q are used for classes of functions or distribution families. 20. F, G, H are the symbols for a (multivariate) cdf. 21. For a d-variate cdf F, the set of its marginal distributions is denoted as {Fs '• S £ Sd}, where Sd is the set of non-empty subsets of {1,..., d}. For a specific S, the subscript is written without braces, e.g., F\,..., Fd, Fi2, Fi2z, etc.. xvi 22. We define the pdf or pmf of Y at y = (yi, . . .,yd) as Pu-d(yi • • -yd) or simply P(yx • • -yd), with the corresponding jth marginal Pj(yj), the bivariate (j, k) marginal Pjk{yjl)k), and so on. We also write P(yi • • - yd', 0) to denote that the pdf or pmf of Y depends on a parameter (or parameter vector) 0. 23. The frequency of observing a particular outcome (j/i, • • •, j/d)' in a data set is denoted by "i2--d(j/i '' - Vd) or simply n(yi • • - yd)- The frequency corresponding to the jth marginal out come yj is nj(yj), and that corresponding to the (j, k) bivariate marginal outcome yj and J/J, is njk(yjyk), and so on. 24. J2{yj} means the summation over all possible different values of yj. J2{Xl<y! xd<yd] means the summation over all possible different vector values of x = (x\,..., Xd)' which satisfy {x\ < j/i,..., Xd < yd}. X2{y1-j/d}\{!/J} means the summation over all possible different vector value y = (j/i,..., yd)' with the jth component absent, and so on. 25. U(a, b) denotes the uniform distribution on the interval [a, b]. N(p, a2) denotes the univariate normal with mean // and variance a2. Nd(p, S) denotes the d-variate normal with mean vector fl and covariance matrix S; $<j(x;/i, S) (or $d(x)) denotes the corresponding cdf and <f>d{-x.,U,Y) (or <j>d(x.)) the pdf. 26. The partial derivative of a scalar function ip(0), dip(9)/d9, is the q x 1 vector where 9\,..., 8q are the components of the vector 0. 27. The partial derivative of a vector function \? = (ipi(6),..., ipr(6))', 8^/06', is the r xq matrix (dm dm\ V 00i deq ) nVvO?)/ where 0\,..., 6Q are the components of the vector 6. xvn Acknowledgements I wish to record my warmest thanks to my thesis supervisor, Professor Harry Joe, for his continuous encouragement and for numerous suggestions and discussions during the development of this thesis. I acknowledge, with gratitude, the work of Harry Joe on multvariate copulas and dependence, and especially his invaluable book draft on "Multivariate Models and Dependence Concepts"; this has had a significant impact on the development of this thesis, and served as a foundation for this thesis. There are also many computer programming techniques and computations involved in this work where Harry Joe's tremendous experience was a crucial help. I would like to thank Professors John Petkau and Michael Schulzer for serving on my supervisory committee. In addition, I thank Professor John Petkau for helpful discussions, his valuable comments on the thesis, as well as his encouragement and support. Also, many thanks go to Professors Nancy Heckman, James Zidek and Ruben Zamar for their encouragement and support. Special thanks to Rinaldo Artes, Billy Ching and John Smith for their encouragement and support. I would also like to thank my fellow students, friends, professors and staff members in the Department of Statistics at UBC for providing a pleasant and stimulating research environment. Finally, I would like to thank Harry Joe for his financial support through an NSERC grant. The financial support from the Department of Statistics is acknowledged with great appreciation. xvin Chapter 1 Introduction This chapter starts by discussing the structure of the multivariate data for which we are going to build appropriate multivariate models. We motivate our thesis research through reviewing and criticizing the relevant literature on the modelling of the multivariate discrete data. This chapter is organized in the following way. In section 1.1, we introduce the multivariate data structure, for which we are going to develop multivariate models. In this section, we discuss in detail multivariate binary, multivariate ordinal categorical and multivariate count data. The models developed in this thesis are general in nature, but the illustrative examples will be mainly based on the forementioned three types of multivariate discrete data. In section 1.2, we briefly summarize and criticize the relevant statistical literature on the modelling of data of the types described in section 1.1, point out the inadequacies thereof, and thus motivate our thesis research. In section 1.3, we outline some desirable features of multivariate models and briefly discuss some of my understandings about statistical modelling. Section 1.4 provides an overview of the thesis. 1.1 Multivariate discrete response data The data structure Many data sets consist of discrete variables. Familiar examples of such variables are religion, na tionality, level of education, degree of disability, attitude to a social issue, and the number of job changes for an individual during a certain period of time. These variables are categorical or count, they may be unordered (religion, nationality) or ordered (degree of disability, attitude to a social issue). In real life, what is more complicated is that often the discrete data are multivariate and 1 Chapter 1. Introduction 2 Table 1.1: The structure of general multivariate discrete data d-variate resp. margin-indep. cova. margin-dep. cova. J/n • • • Vid yn • '• • Vid Vnl • • • Und Xu • •• Xip X{1 ' ' ' %ip Xnl ' ' ' Xnp • • • zuPl, ..., z\d\ • • • Zldpd Zi\\ • • • ZHpi, . . ., Zidl ' ' ' Zidpt Znll ' ' ' Zn\pl, . . ., Zndl ' ' ' Zndpt the multiple measurements may be interdependent in some way. The dependence may be general or special. The multivariate data structure can be further complicated by having missing data, random covariates and so on. In this thesis, we shall concentrate mainly on multivariate discrete response data, with or without covariates. The general multivariate discrete data set of interest is given in Table 1.1. The data structure in Table 1.1 consists of basically three parts: d-dimensional discrete response observations = (yn, • • •, yid)': a margin-independent covariate vector of p components x, = (xn,..., Xip)', that is, a covariate vector which is constant across margins, and d margin-dependent (or marginal specific) covariate vectors z,i,..., Zid, where Zij = (ziji,..., ZijPj)' is a vector of pj components for the jth margin, j = 1,..., d, i — 1,..., n. In the longitudinal or repeated measures settings, the marginals might be defined by successive points in time. In these situations, we can call the margin-independent covariates time-independent, that is, constant across times, and the margin-dependent covariates time-dependent. The response vector yt- can be measures on d variates with general or special dependence structure, such as multiple measures from a human, a litter of animals, a piece of equipment, a geographical location, or any other unit for which the observations are a collection of related measures. The measures can be spatial or temporal. One way to make inferences from such a data structure is through a multivariate parametric model. (Nonparametric multivariate inference requires much more data than parametric multivariate inference.) The development and analysis of suitable models for the multivariate data in Table 1.1 are the main objectives of this thesis. Some typical multivariate discrete data Binary data. Binary data arises when measurements can have only one of two values. Conventionally these are represented by 0 and 1, with 1 usually representing the occurrence of an event and 0 representing non-occurrence. For example, the reaction of a living organism to some material, often Chapter 1. Introduction 3 observed as presence or absence of the reaction (usually called quantal response), is binary. Alive throughout a specified period or died during the period, won or lost, success or failure in a specified task, gender, agree or disagree, are all examples of sources of binary data. Multivariate binary data are frequent in statistical applications. Whenever multivariate data are coded, for each dimension, as one of two mutually exclusive categories, the data are multivariate binary. In the more complicated situation, covariates can be included when one is considering binary response data. An example of multivariate binary data is the Six Cities Study case analyzed in subsection 5.2.1. Ordinal categorical data. An ordinal variable is one that has a natural ordering of its possible values, but for which the distances between the values are undefined, such as a four-category Likert scale. Ordinal categorical (or ordered categorical) response data, often accompanied with a set of covari ates, arise frequently in a wide variety of experimental studies, such as in bioassay, epidemiology, econometrics and medicine. For example, in medicine it may be possible to classify a patient as, say, severely, moderately or mildly ill, when a more exact measurement of the severity of the disease is not possible; the covariates may be age, gender and so on. With ordinal variables, the categories are known to have an order but knowledge of the scale is insufficient to consider them as forming a metric. We may treat the ordinal categories simply as nominal categories - which is unordered categorical measures, but by doing so the valuable information of order is lost. So the consideration of the order is important for optimal information extraction. For an ordinal variable, it is often reasonable to assume that the ordered categories correspond to non-overlapping and exhaustive in tervals of the real line. Multivariate ordinal data are frequent in applications. Whenever multivariate response variables are each ordinal categorical, the data are multivariate ordinal categorical. More complicated situations include covariates for each of the response variables. A case of multivariate ordinal data from the Three Mile Island (TMI) nuclear power plant accident study can be found in subsection 5.2.2. Count data. Data in the form of counts appear regularly in life. In the simplest case, the number of occurrences of some phenomena on each unit are counted. Because no explanatory variable (e.g. time, treatment) distinguishes among these observed events, they can be aggregated as single numbers, the counts. Examples of count data are the counts of pest eggs on plant leaves, the counts of bacteria in different kinds of bacteria colonies, the number of organic cells with fixed number of chromosome interchanges produced by X-ray irradiation, etc.. Consul (1989) discussed many count data examples in a variety of situations, including home injuries, and strikes in industries. Other examples include the number of units of different commodities purchased by consumers over a Chapter 1. Introduction 4 period of time, the number of times authors cited over a number of years, spatial patterns of plants, the number of television commercials, or the number of speakers in a meeting. Multivariate count data are also frequent in applications. Whenever multivariate response variables are each count in nature, the data are multivariate count. The more complicated situations also include covariates to the response variables. An example of multivariate count data can be found in subsection 5.2.3. 1.2 Review of literature and research motivation For the data types we have seen in section 1.1, one of the questions is how to build a model or a probability distribution as an approximation to the stochastic phenomenon of multivariate nature, and based on the available data, to estimate the distribution, and make some inference or predictions. For this purpose, the construction of an appropriate probability distribution or statistical model in accordance with the available data generated by the stochastic phenomenon is essential. Models for univariate discrete data have been studied extensively. The well-known generalized linear models for a univariate variable are such examples (McCullagh and Nelder 1989, Nelder and Wedderburn 1972). However, general studies on multivariate models for the type of data outlined in Table 1.1 are lacking in the statistical literature. One difficulty with the analysis of nonnormal multivariate data (including continuous and discrete data) has been the lack of rich classes of models such as the multivariate Gaussian. Some isolated studies on the modelling of a particular data set or under a particular multivariate setting of the type of data in Table 1.1 have appeared in the literature. These studies can be classified in general as being based either on a completely specified probability model or on a-partially specified probability model. We overview some of them here, and point out their drawbacks or weaknesses. Completely specified probability models Exponential family: Following Cox (1972), the probability distribution for a binary random vector Y can be represented as a saturated log-linear model d P(y) = exp(w0 + ^Ujyj + 'YJUjkyiyk + h uu...dyi •••yd) (1-1) where uo is a normalizing constant. The 2d — 1 parameters «i,..., ud, • •., «i2, «13, • • •, U(d-i)d, • • •, ui2—d vary independently from —oo to co. Expressions similar to (1.1) can also be found in Zhao and Prentice (1990), Liang et al. (1992) and Fitzmaurice and Laird (1993). The representation Chapter 1. Introduction 5 (1.1) is not closed under taking of margins (see Section 2.1 for a definition). In fact, if we write P(yiV2) = exp(uo + ]Cj=i UjVj + u*22/i2/2), then UQ, U*- and u*2 must depend on all the parameters uo, Uj, Ujk, ..., u\2-d- This fact makes the interpretation of the parameters Uj, Ujk, . •., ui2--d very difficult, and it is not clear how covariates could be included. For the general form, there are too many parameters. Bahadur representation: Bahadur (1961) gave a representation of the distribution for a binary random vector Y, in terms of the moments: d P(y) = nPj(l)yjPj(0)1~yj[l + J2Pikeiek+ 2 PjkiejekeI + --- + Pl2...de1e2---ed] (1.2) j=l j<k j<k<l where e, = (%• - Pj(l))/^/Pj(l)Pj(0), and pjk = E(ejek), ..., p\2...d = E(eie2 • • • ed). This rep resentation has the closure property under taking of margins, but the parameterization may not be desirable since the pjk's and the parameters of higher order are constrained by the marginal probabilities (see Prentice 1988), and the extension to include covariates may be difficult. For the general form, there are also too many parameters for the model to be useful. Multivariate Poisson convolution-closed model: Teicher (1954) and Mahamunulu (1967) discussed a class of multivariate Poisson convolution-closed models. For example, a trivariate Poisson convolution-closed model has the stochastic representation (Yx,Y2,Y3)d= (Xi + Xi2 + X13 + Xi23, X2 + X12 + X23 + X\23, X3 + X±3 + X23 + Xi23) (1.3) where X\, X2, X3, X\2, X\3, X23, X\23 are independent Poisson rv's with parameters Ai, A2, A3, A12, ^13, A23, Ai23 respectively. These models may be suitable for counts in overlapping regions or time periods if the Poisson process is a reasonable model of the underlying count process. The model has a closure property under taking of margins, but it is not "model univariate-bivariate expressible" (see Chapter 2 and Example 2.5 for further explanation of this expression), and it can only accommodate multivariate count data with a limited type of dependence range (positive dependence). Exchangeable mixture models: Prentice (1986) gives an expression for a joint distribution of a binary random vector Y, with where 0< p < 1, m = YH \-Yd and 7 > -(d-l)_1 min{p, 1-p}. The model (1.4) is an extension of a beta-Bernoulli model derived from the mixture model P(y) — JQ py+ (1 — p)d~y+ g(p) dp, where y+ = Ylj=i %' and 9(P) is trie density of a Beta(a, /?) distribution. This model implies equicorrelation Chapter 1. Introduction 6 of Y with correlation parameter of (1 + 7-1)-1. The representation (1.4) has the closure property under taking of margins, but it is limited to equicorrelation of response variables. Joe (1996) has discussions on the range of negative dependence on this family. Discussions of extensions to incorporate covariates appeared in Prentice (1986) and Connolly and Liang (1988). Multivariate probit model: A d-variate probit model for binary data is where /(•) is indicator function, Z = (Z\,.. .,Zd)' ~ N(0,Q), 0 = (Ojk) is a correlation matrix. The Zj's are often referred to as cut-off points. Ashford and Sowden (1970) used the bivariate probit model for binary data to describe a coal miner's status of development of breathlessness (present or absent) and wheeze (present or absent) as a function of the miner's age. Anderson and Pemberton (1985) used a trivariate probit model for the analysis of an ornithological data set on the three aspects of colouring of blackbirds. A general introduction to the multivariate probit model is Lesaffre and Molenberghs (1991). The multivariate probit model has many nice properties, such as closure under taking of margins, model univariate-bivariate expressibility, and a wide range of dependence. MLE is considered but is computationally more difficult as d increases. New approaches to estimation and inference are explored in this thesis. Multivariate Poisson-lognormal model: Aitchison and Ho (1989) studied a model for count random vector Y, with where fj{yj\^j) is a Poisson pmf with parameter \j and g(X) is the density of a multivariate lognormal distribution. This model also has many nice properties, such as closure under taking of margins, model univariate-bivariate expressibility, and a wide range of dependence. Again the MLE is computationally difficult. Molenberghs-Lesaffre model: A model that may be suitable for binary and ordinal data is studied in Molenberghs and Lesaffre (1994). This model can accommodate general dependence structure from the Molenberghs-Lesaffre construction (Joe 1996) with bivariate copulas, such as in Joe (1993). The multivariate objects in the Molenberghs-Lesaffre construction have not been proved to be proper multivariate copulas, but they can be used for the parameters that lead to positive orthant proba bilities for the resulting probabilities for the multivariate binary vector. Other miscellaneous models (some for time series or longitudinal data): Yj = I(Zj < ZJ), j = l,...,d, (1.5) (1.6) Chapter 1. Introduction 7 - Kocherlakota and Kocherlakota (1992) provide a good summary of bivariate discrete distribu tions (including bivariate Poisson, bivariate negative binomial, etc.). - Markov chain of first order for binary data with Pr(Yj+i = l|Yj = 0) = PJJ+I(01) and Pr(yj+i = l\Yj = 1) = i-jj+i(ll). It can be generalized to higher order Markov chains. Some combinations of P,j+i(01), PJJ+I(11) and Pr(Yj+i = 1) could be replaced by logistic functions (but not all three) to incorporate covariates. Examples are in Darlington and Farewell (1992), Muenz and Rubinstein (1985), Zeger, Liang and Self (1985) and Gardner (1990). - Poisson AR(1) time series, as in Al-Osh and Alzaad (1987) and McKenzie (1988). The bivariate Poisson margin (for consecutive Y<'s) from this Poisson AR(1) time series is the same as a bivariate margin of (1.3). - Negative binomial AR(1) time series, as in McKenzie (1986), Al-Osh and Aly (1992) and Joe (1996b). The model of Al-Osh and Aly has range of serial correlation depending on the parameters of the negative binomial distribution (and hence is not very flexible). - When the binary or count variables are observed sequentially in time, one could use a model consisting of a product of a sequence of logit models for binary data (logit of Yt given Yi,..., Y(_i, x) and of Poisson models for counts (Poisson of Yt given YL, ..., Y(_i, x). This is proposed in Bonney (1987) and Fahrmeir and Kaufmann (1987). The advantage of such models is that one can use widely available software for univariate logit and Poisson models. One disadvantage of such models is that it would be difficult to predict Yt based on x alone. - Meester and MacKay (1994) studied a class of multivariate exchangeable models with the multivariate Frank copula. The models have limited application since only exchangeable de pendence structures are considered. - Glonek and McCullagh (1995) have a similar bivariate model to the Molenberghs-Lesaffre model in that the dependence parameter is linear in covariates and the related bivariate copula is the Plackett copula. Their multivariate extension appears to overlap with that of Molen berghs and Lesaffre (1994), but with a different model construction approach. Partially specified probability models — generalized estimating equations approach General application of many of the preceding models was impeded, however, by their mathemati cal complications and by the computational difficulty usually encountered in multivariate analysis. Chapter 1. Introduction 8 A different body of methodology, called the generalized estimating equations (GEE) approach, has been developed based on moment-type methods which do not require explicit distributional assump tions. References for this methodology are Liang and Zeger (1986) and Zeger and Liang (1986), Zhao and Prentice (1990), Fitzmaurice and Laird (1993), among others. However the GEE approach has several disadvantages mainly related to the modelling, inference, diagnostics checking and interpre tations. Furthermore, the GEE approach does not apply directly to multivariate ordinal data. A detailed study of the GEE approach, including a discussion of some of its shortcomings, can be found in Chapter 6. Research motivation In summary, although some approaches have appeared in the literature to model specific instances/examples for the data in Table 1.1, there are at least two major features lacking in the statistical literature in terms of modelling multivariate discrete data: 1. A unified, systematic approach to multivariate discrete modelling, with classes of models for multivariate discrete data where some models in the class have nice properties (see section 1.3 for some desirable features of multivariate models). 2. A model-fitting strategy with computationally feasible parameter estimation and inference procedures, with good asymptotic properties and efficiency. This thesis makes contributions to these two lackings in multivariate discrete (more generally, non-normal) data modelling. We study systematic approaches to the modelling of multivariate discrete response data with covariates. The response types include binary, ordinal categorical and count. Statistical inference and computational aspects of the multivariate nonnormal models are studied. 1.3 Statistical modelling We discuss here two issues in statistical modelling. One is what we mean by statistical modelling in general. The other is the construction of multivariate models with desirable properties. Other aspects of statistical modelling as part of data analysis will be discussed in Chapter 5. In practice, with a finite sample of data, to capture exactly the possibly complex multivariate system which generated the data is impossible. The problem can even be more complicated than modelling a system; it might be that the system itself does not exist and it is forever a hypothetical Chapter 1. Introduction 9 one. In statistical modelling, the specification of a particular model for the data is always somehow arbitrary; what we hope is that the stochastic models we use may reflect relatively well the random ness or uncertainty in the system, as well as the significant features of the systematic relationships. The statistical models should be considered as a means of providing statistical inference; they should be viewed as tentative approximations to the truth. The most important consideration in using any statistical method (or model) is whether the method (or model) can give insight into important practical problems. All models are subjective in some degree. Often the modeller chooses those elements of the system under investigation that should be included in the model as well as the mode of representation. Modelling should not be a substitute for thinking and will only be effective if combined with an interest in and knowledge of the system being modelled. The construction of multivariate nonnormal models is not easy. For modelling purposes, we would like to have parametric families of models that (i) cover the different types of dependence, (ii) have interpretable parameters, and (iii) apply to multivariate discrete data. Some desirable properties of a multivariate model are the following: 1. The model is natural. That is, the model is interpretable in terms of mixture, stochastic or latent variable representations, etc.. 2. Parameters in the model are interpretable. A parametric family has extra interpretability if some of the parameters can be identified as dependence or multivariate parameters, such that some range of the parameters corresponds to positive dependence and some corresponds to negative dependence, and it is desirable to have the amount of dependence to be increasing as parameters increase. 3. The model allows wide and flexible range of dependence, with interpretable dependence pa rameters which are flexible to the needs for different applications. 4. The model extends naturally to include covariates for the univariate marginal parameters as well as dependence parameters, in the sense that after the extension, we still have probabilistic model and proper interpretations. 5. The model has marginally expressible properties, such as model parameters expressible by pa rameters in univariate and bivariate distributions property and closure property with extension of univariate to bivariate and to higher order margins. 6. The model has a simple form, preferably with closed form representations of the cdf and density, or at least is easy to use for computation, estimation and inference. Chapter 1. Introduction 10 Generally, it is not possible to achieve all of these desirable properties simultaneously, in which case one must decide the relative importance of the properties and sacrifice one or more of them. There is no known multivariate family having all of these properties but the family of multivariate normal distributions may be the closest. Multinormal distributions satisfy (1), (2), (3), (4) and (5) but not (6) since the multinormal has no closed form cdf. The mixture of max-id copulas (Joe and Hu 1996) satisfy'(1), (2), (3), (4) and (6) but only partially (5). In Chapter 3, these desirable properties of a multivariate model will be used as criteria to compare different models. 1.4 Overview of thesis This thesis consists of seven chapters. In Chapter 2, we develop the theoretical background for the multivariate discrete models, statistical inference and computation procedures. Two general classes of multivariate discrete models are introduced; their common feature is that both rely on the copula concept. Several new concepts related to multivariate models are proposed. The asymptotic theory for parameter estimation based on the inference functions of margins (IFM) is also given in this chapter. In Chapter 3, we study and compare many specific models in the two general classes of multivariate models proposed in Chapter 2. Mathematical details for parameter estimation for some of the models are provided. In Chapter 4, the efficiency of IFM approach relative to the classical maximum likelihood approach is investigated. The major advantage of IFM is its computational feasibility and its good asymptotic properties. We demonstrate that IFM is an efficient parameter estimation approach when it is applicable. We also study the efficiency of the jackknife method of variance estimation proposed in Chapter 2. In Chapter 5, some important issues such as a proper data analysis cycle, model selection and diagnostic checking are discussed. Data analysis examples illustrating modelling and inference procedures developed in this thesis are also carried out. In Chapter 6, we study the usefulness and efficiency of the GEE approach which has been the focus of many recent statistical applications dealing with multivariate and longitudinal data with univariate margins covered by the theory of generalized linear models. In Chapter 7, the final chapter, we discuss some further important research topics closely related to this thesis work. Finally, the Appendix contains a Maple symbolic manipulation program example used in Chapter 4. Chapter 2 Foundation: models, statistical inference and computation In this chapter, we propose two classes of multivariate discrete models: multivariate copula dis crete (MCD) models and multivariate mixture discrete (MMD) models. These two classes of models provide a new classification of multivariate discrete models, and allow a general approach to mod elling multivariate discrete data. The two classes unify a number of multivariate discrete models appearing in the literature, such as the multivariate probit model, multivariate Poisson-lognormal model, etc. At the same time, numerous new models are proposed under these two classes. We also propose an inference functions of margins (IFM) approach to parameter estimation for MCD and MMD models. This estimation approach is built on the general theory of inference functions (or estimating equations). Asymptotic theory for IFM is developed and applied to the specific models in Chapter 3. While similar ideas about the same kind of estimating functions for a specific model have appeared in the literature, the general development of the procedure as an approach for the parameter estimation for a class of multivariate discrete models, and the related asymptotic results, are new. We also show that a jackknife estimate of the covariance matrix of the estimates from the IFM approach is asymptotically equivalent to the asymptotic covariance matrix from the Godambe information matrix. The jackknife procedure has the advantage of general computational feasibility. These results are used extensively in the applications in Chapter 5. The efficiency of IFM versus the optimal estimation procedure based on maximum likelihood estimation and the numerical assess ment of the efficiency of jackknife covariance matrix estimates compared with Godambe information 11 Chapter 2. Foundation: models, statistical inference and computation 12 matrix are studied in detail in Chapter 4. The present chapter is organized as follows. Section 2.1 introduces the multivariate copula mod els, some multivariate dependence concepts and a number of new concepts regarding the properties of a multivariate model. In section 2.2, we introduce two classes of multivariate discrete models: the multivariate copula discrete models and the multivariate mixture discrete models. These two classes of models are the focus of this thesis, and specific models in these two classes will be extensively studied in Chapter 3. In section 2.3, we propose an inference functions of margins (IFM) approach for the parameter estimation of MCD and MMD models; the theoretical foundation is built on the theory of inference functions for the multi-parameter situation. Section 2.4 is devoted to the study of the asymptotic properties of parameter estimates based on the IFM approach. Under regularity conditions, the IFM estimators (IFME) for parameters are shown to be consistent and asymptoti cally normal with a Godambe information matrix as the variance-covariance matrix. These are done for the models with no covariates as well as models with covariates. The extension of models with no covariates to models with covariates will be made clear in this section. In section 2.5, we propose a jackknife approach to the asymptotic variance estimation of IFME, and show theoretically that the jackknife estimate of variance is asymptotically equivalent to the Godambe information matrix. The importance of the jackknife estimate of variance will be demonstrated in Chapter 5 for real data analysis. 2.1 Multivariate copulas and dependence measures 2.1.1 Multivariate distribution functions We begin by recalling the definition of a distribution on Md. Definition 2.1 A d-dimensional distribution function is a function F : Md —> IR, which is right continuous, with (i) lim F(y1,...,yi) = 0, j=l,...,d, (ii) lim F(yi,...,yd) = l and which satisfies the following rectangle inequality: For all (ai,..., af), (b\,..., bf) with a,j < bj, j = !,•••, d, 2 2 Jfei=l kd=l Chapter 2. Foundation: models, statistical inference and computation 13 The following are several remarks related to Definition 2.1: i. If F has dth order derivatives, then (2.1) is equivalent to ddF/dyi • • -dyd > 0. ii. Letting a2, • •., ad —+ —oo, then (2.1) reduces to F(bi, b2,.. .,bd) — F(ai,b2,..., bd) > 0, so F is increasing in the first variable. Similarly, by symmetry, F is increasing in the remaining variables. iii. Let 5 be a subset of {1,..., d}. The margins Fs of F(yi,..., yd) are obtained by letting y,- —» oo for i £ S. There are two important types of cdf generated from a random vector Y: discrete and continuous. In the case of an absolutely continuous random vector Y, there is a corresponding density function f(yi ,---,yd) which satisfies /(y!,..., yd) > 0 and ff^ • • • f(yx • • • yd)dyx •••dyd = 1. The cdf can be written by /yd ryi •• f(xi • --x^dxi • --dxd. -oo J — oo In the case of a discrete random vector Y, the probability that Y takes on a value y = (j/i,..., yd)' is defined by the pmf P(yi---yd) = Pv(Y1 = y1,...,Yd = yd), which satisfies P(yi • • -yd) > 0 and Y^,{yi} '' '^{yd} ^(v^ •••%) = !• The cdf can be written as F{yi,...,yd)- P(xi---xd). {xi<y1,...,xd<yd} For a discrete random vector, the jth marginal distribution is defined by Pj(v:)= E p(y±---yd)-{yi-yd}\{yj} The (j, k) marginal distribution is defined by Pjk(yjyk) = P(yi • • • yd)-{yi-yd}\{yjyk} In general, the marginal distributions can be obtained from the previous remark (iii). 2.1.2 Multivariate copulas and Frechet bounds The multivariate normal distribution is used extensively in multivariate analysis because of its many nice properties (see for example, Seber 1984). The wide range of successful application of the Chapter 2. Foundation: models, statistical inference and computation 14 multivariate normal distribution is because of its flexibility in representing different dependence structures rather than for physical reasons or as an approximation from the Central Limit Theorem. The dependence structure plays a crucial role in multivariate modelling. But the multivariate normal model is not sufficient for every multivariate modelling situation. To be able to model multivariate data in general, a good understanding of the general parametric families of multivariate distribution functions - the constructs which describe the characteristic of the random phenomena, is necessary. One useful and well-known approach to understanding a multivariate distribution function F is to express F in terms of its marginals and its associated dependence function C(). This C(-) (or simply C) is commonly called the copula. Definition 2.2 (Copula) C is a copula if G(y1,...,yd) = C(G1(y1),...,Gd(yd)). is a distribution function, whenever Gi, Gd are all arbitrary univariate distribution functions. • Let Y — (Yi,..., Yd)' be a d-variate continuous random vector with cdf G(y\,...,yd) and with continuous univariate marginal distribution functions Gi(yi), ..., Gd(yd) respectively. Then Ui = Gi(Yi), ..., Ud — Gd(Yd) are uniformly distributed on [0,1]. Let Gj"1, • • •,Gdx be the univariate quantile functions, where GJ1 is defined by Gjx(t) = inf{j/: Gj(y) >t},j = l,...,d. The copula, C, of Y = (Yi,..., Yd)' is constructed by making marginal probability integral transforms on Y\,..., Yd to Ui,..., Ud. That is, the copula is the joint distribution function of U\,..., Ud: C{ul,...,ui) = G{G-ll{ul),...,Gd\ud)). (2.2) C is non-unique if the Gj's are not all continuous. This point will be made clear in section 2.2. Sup pose Y is a continuous random vector with distribution function G(yi,.. - ,yd) and the corresponding copula is C(ui,..., ud) with density function c(ui,..., ud). The density function of G(y\,..., yd) in terms of copula density function is g(yx ,...,yd) = c(Gi(t/i),..., Gd(yd)) ]T?=1 9j (%)• The copula captures the dependence among the components of the random vector Y; it contains all of the information that couples the d marginal distributions together to yield the joint distribution of Y. This understanding is essential for the subsequent development of the multivariate discrete models. The copula was first introduced by Sklar (1959). For parametric families of copulas with good properties, see Joe (1993, 1996). Through the copula, a distribution function is decomposed into two parts: a set of marginal distribution functions and the dependence structure which is Chapter 2. Foundation: models, statistical inference and computation 15 specified in terms of the copula. This suggests that one natural way to model multivariate data is to find the dependence structure in terms of copula and the univariate marginals separately. This important feature will be extended to form multivariate discrete models by using the copula concept in section 2.2 and in Chapter 3. Next we define the Frechet bounds. Definition 2.3 (Frechet bounds) Let F(x) be a d-variate cdf with univariate margins F\,..., Fd. Then for Vii,..., xd, max{0, J^(a:i) + • • • + Fd{xd) - (d - 1)} < F(xx, ...,xd)< min{ ^(si),..., Fd(xd)}, (2.3) where min{i?i(xi),..., Fd(xd)} is the Frechet upper bound, and max{0, F\(xi)+- • -+Fd(xd) — (d—l)} is the Frechet lower bound. • We state here some important properties of the Frechet bounds. Properties 1. The Frechet upper bound is a cdf. 2. The Frechet lower bound is a cdf for d = 2. 3. The Frechet upper bound copula is C{/(u) = max{«i,.. .,ud}. For d = 2, the Frechet lower bound copula is C^(u) = min{0, u\ + u2 — 1). For a proof of the properties 1,2,3 and other properties of Frechet bounds, see Joe (1996). Under independence, the copula is d Cj(ui,...,ud) = J\uj, and any copula must pointwise fall between max{0, u\ + • • • + ud — (d — 1)} and min{ui,..., ud}. 2.1.3 Dependence measures It is desirable for a parametric family of multivariate distributions to have a flexible and wide range of dependence. For non-normal random variables, correlation is not the best measure of dependence, and concepts based on linearity are not necessarily the best to work with. More general concepts of positive and negative dependence and measures of monotone dependence are needed. These are necessary for analyzing the type of dependence and range of dependence in a parametric family of Chapter 2. Foundation: models, statistical inference and computation 16 multivariate models. For a thorough treatment of dependence concepts and dependence orderings, see Joe (1996, Chapter 2). In multivariate analysis, one of the most important activities is to model the dependence structure among the random variables. The complexity of the dependence structure and its range often determines the practical usefulness of the model. The dependence structure of a multivariate model can be considered somehow equivalent to the copula; for example, Schweizer and Wolff (1981) used copulas to define several natural nonparametric measures of dependence for pairs of random variables. The parameters in a multivariate copula reflect the degree of dependence among variables; for example, the multivariate normal copula can be adequately expressed in terms of a correlation matrix of which the elements consist of the pairwise correlation coefficients of a multinormal random vector, with a large correlation coefficients indicating strong dependence among variables. However, it is not always possible to express a copula in term of correlation coefficients of a set of random variables. There is also a mathematical reason, e.g. mathematical simplicity, not to express a copula in terms of correlation coefficients. A measure of dependence for two random variables indicates how closely these two random variables are related. The extreme situations would be mutual independence and complete mutual dependence. Some very useful dependence concepts, such as positive and negative dependence concepts, are based on the refinement of some intuitive understanding of dependence among random variables. For example, for two random variables X and Y, the positive dependence concept means roughly that large (small) values of X tend to accompany large (small) values of Y. Often in practice, this knowledge of the amount of dependence is good enough for some modelling purposes. Some well-known measures of dependence for two random variables are Pearson's correlation coefficient r, Spearman's rho and Kendall's tau. These measures are defined as follows: Let X, Y be random variables with continuous distribution function F(x) and G(y) and copula C. We further assume that (X\,Yi), (X2,Y2) and (X, Y) are independent with the same joint distribution. Then Pearson's correlation coefficient is r = Co\(X,Y)/CTXCY or Kendall's tau is r = CoTr{sgn(Xl - X), sgn^ - Y)) = 2Pr((Xi - X)(Y: - Y) > 0) - 1, or and Spearman's rho is p = Covi(sgn(Xi — X), sgn(Y2 — Y)) or Chapter 2. Foundation: models, statistical inference and computation 17 where ax and cry stand for the standard deviation of random variables X and Y, sgn(-) denotes the sign function. Both Kendall's tau and Spearman's rho are invariant to strictly increasing trans formations. They are equal to 1 for the Frechet upper bound and -1 for the Frechet lower bound. These properties do not hold for Pearson's correlation. Essentially, Pearson's r measures the strength of the linear relationship between two random variables X and Y, whereas the Kendall's tau and Spearman's rho are measures of monotone correlation (strength of monotone relationship). For bi variate quantitative data, Spearman's rho corresponds to the rank correlation (Pearson's correlation applied to the ranks of the 2 variables). That the copula captures the basic dependence structure among the components of Y can be seen by the fact that all nonparametric measures of association, such as Kendall's tau, Spearman's rho, are normed distances of the copula from the independence copula. In general, it is difficult to judge the intensity of dependence for a given multivariate model solely based on one dependence measure; the three common dependence measures can be used as a reference for the attainable intensity of the dependences of a given multivariate model. For ease of interpretation of the dependence structure, we would like to see the dependence structure expressed in easily interpretable parameters. For example, for arbitrary marginals, a question is how to express a copula in terms of the most common measures of association, such as Pearson's r (from some specific marginals), Spearman's rho, or Kendall's tau, in a natural way. For some well-defined classes of distribution, such as the multivariate normal, Pearson's correlation coefficient is the measure of choice. In other classes, other measures may be more appropriate. (For example, the Morgenstern copula in subsection 2.1.4 is expressed in terms of Kendall's tau in a natural way.) A parametric family has extra interpretability if some of the parameters can be identified as dependence parameters. More specifically, one would like to be able to say that some range of the parameters corresponds to positive dependence and some corresponds to negative dependence. Furthermore, it would be desirable to have the amount of dependence to be increasing (decreasing) as parameters increase (decrease). 2.1.4 Examples of multivariate copulas Some well-known examples of copula families are: the multivariate normal copula, Morgenstern copula, Plackett copula, Frank copula, etc. Joe (1993, 1996) provides a detailed list of families of copulas with good properties. In Genest and Mackay (1986), a class of copulas, called Archimedean copulas, is studied extensively. Most existing parametric families of copulas represent monotone dependence structures where the intensity of the dependence is determined by the value of the Chapter 2. Foundation: models, statistical inference and computation 18 dependence parameter. Some families, such as the normal family, possess a complete range of dependence intensities whereas others, such as the Morgenstern family, possess only a limited range. In fact, the Morgenstern copula never attains the Frechet bounds; Spearman's rho lies between — 1/3 and 1/3. For general modelling purposes, we would naturally seek families with a wide range of dependence intensities. Here we give some examples of multivariate copulas. More examples of multivariate copulas will be given in Chapter 3 for constructing multivariate discrete models. Example 2.1 (Multivariate normal copula) Let $ be the standard normal distribution func tion and let $<j be the d-variate normal distribution function with mean vector 0, variance vector 1 and correlation matrix 0. Then the corresponding d-variate copula is where every bivariate copula Cjk(uj,Uk',0jk), 1 < j < k < d, attains the lower Frechet bound, the independence case, or the upper Frechet bound according to 8jk = —1,0, or 1. Pearson's correlation coefficient for the corresponding bivariate normal distribution is r = 9jk- For Spearman's where the Gj's are arbitrary cdfs. For example, we can have GJ(ZJ) = exp(zj)/(l + exp(zj)), GJ(ZJ) — $(zj), Gj(zj) = 1—exp(— exp(zj)) or GJ(ZJ) = exp(— exp(—Zj)). GJ(ZJ) can even be taken as a mixture distribution function, for example, GJ(ZJ) = TTJ$(ZJ) + (1 — TTJ) f^1 exp(— |x|)/2 dx, where 0 < TTJ < 1. These flexible choices of univariate marginal distributions combined with the complete range of the dependence parameter matrix 0 make the multivariate normal copula a powerful copula for general modelling purposes. In Chapter 3, we will use this copula extensively. • Example 2.2 (Morgenstern copula) In the literature, sometimes the names of several people are put together in naming this copula; Farlie-Gumbel-Morgenstern copula is one of them. In this thesis, we simply call this copula the Morgenstern copula (Morgenstern 1956)'. One simpler version of a d-dimensional Morgenstern copula, which does not include higher order terms, is C(uu ..., ud; 0) = $d($-V),..., $-1(Ud); 0) (2.4) rho and Kendall's tau, we can also establish the following relationships: r = (2/7r)sin 1r and p = (6/7r) sin_1(r/2). With this copula, we have G(z1, ...,zd) = C7(Gi(zi),.. •, Gd(zd); 0) = ^(^(Gi^i)),..., 3>-\Gd{zd)); 0), d d (2.5) Chapter 2. Foundation: models, statistical inference and computation 19 It has density function d c(ui,u2,...,ud) = 1 + J~] 6jkQ- ~ 2«j)(l - 2uk). j<k The conditions on the parameters 0jk so that (2.5) is indeed a copula are straightforward. For d = 3, the conditions can be conveniently summarized as follows: 1+^12+^13+^23 > 0, l+#i3 > #23+#23, 1 + O12 > #13 + 023, l + 023 > 9i2 + #13, or more succinctly -1 + \012 + ^sl < #i3 < 1 - \0i2 - #231> — 1 < 0\2, #13, ^23 < 1- Similar conditions for higher dimension d — 4,5,..., can also be obtained by considering the 2d cases for uj = 0 or 1, i = 1,..., d, and verifying that c(u\,..., ud) > 0. It is easy to see that for any j,k = 1,..., d; j ^ k, Cjk(uj,uk;0jk) = [l + 9jk(l - UJ)(1 - uk)]ujuk, -1 < 0jk < 1, with density function Cjk(uj,uk) = 1 + 0jk(l - 2UJ)(1 - 2uk). The dependence structure between Uj and Uk is controlled by the parameter 0jk. Spearman's rho is p = 0jk/3. The maximum Pearson's correlation coefficient over all choices of Gj and Gk is 1/3 (when 9jk = 1) which occurs for uniform marginals. For normal marginals, the maximum of the Pearson's correlation coefficient is l/V; for exponential marginals it is 1/4; for double exponential marginals, the limit is 0.281. Kendall's tau is 20jk/9, with the maximum range of —2/9 to 2/9. Because of the dependence range limitation, the Morgenstern copula is not very useful for general modelling. Nevertheless, because the Morgenstern copula has such a simple form, it can be used as an investigation tool in, for example, simulation studies to check properties of some general modelling procedures. An example of its use is provided in section 4.3. If a new procedure breaks down with a distribution based on the Morgenstern copula, then it will probably have difficulties with other models that admit a wider range of dependence. A version of the d-dimensional Morgenstern copula with higher order terms has the following density function d c(uuu2, ...,ud) = l+ ]T Pjd2[l - 2ujJ[l - 2uj2] jl<]3<J3 i = l This form expands the correlation structure of the Morgenstern distribution (2.5). For more details, see Johnson and Kotz (1975, 1977). • Chapter 2. Foundation: models, statistical inference and computation 20 2.1.5 CUOM, CUOM(fc), MUBE, PUBE and MPME concepts In this thesis, we are mainly interested in (a) parametric models (or copulas) with wide range of dependence intensities, and (b) parametric models (or copulas) with certain types of marginal distribution closure properties. In this subsection, we introduce several concepts for the marginal behaviour of a distribution. Definition 2.4 (Closure under taking of margins, or CUOM) A parametric model (copula) is said to have the property of closure under taking of margins, if the bivariate margins and higher-order margins belong to the same parametric family. • Definition 2.5 (Closure under taking of margins of order k, or CUOM(fe)) A parametric model (copula) is said to have the property of closure under taking of margins of order k, if the k-variate margins belong to the same parametric family. • Definition 2.6 (Model univariate-bivariate expressible, or MUBE) A parametric model (copula) is called model univariate-bivariate expressible, or MUBE, if all the parameters in the model can be expressed by parameters in the model's univariate and bivariate marginal distributions. a Definition 2.7 (Parameter univariate-bivariate expressible, or PUBE) If a parameter in a model can be expressed by the model's univariate and bivariate marginal distributions, then this parameter is called parameter univariate-bivariate expressible, or PUBE, under the model. • Definition 2.8 (Model parameters marginally expressible, or MPME) // all the parame ters in a model can be expressed by the model's lower dimensional (lower than full) marginal distri butions, then the model is said to have the property of model parameters being marginally expressible or the MPME property. • If we are thinking about parameter estimation, then the expressions such as "expressible" and "be expressed" in the above definitions should be understood as "estimable" and "be estimated" respectively from lower-dimensional margins. A model with CUOM is also said to have reproducibility or upward compatibility under taking of margins. Basically, the marginal distributions "reproduce" themselves under taking of margins. This property is desirable in many applications in multivariate because initial data analysis often starts with lower dimensional margins. Chapter 2. Foundation: models, statistical inference and computation 21 A model with the MUBE property means that all the parameters appearing in the multivariate distribution appear in univariate and bivariate marginal distributions. A model with the PUBE property may have multivariate parameters of order higher than 3, but part of its parameters of interest can be univariate or bivariate expressed without the involvement of other multivariate parameters (e.g. trivariate parameters). A model with the MPME property means that all of its parameters may be expressed marginally. These are important properties of a multivariate model that allows for a simplification in the parameter estimation through the IFM approach (defined in section 2.3). Based on the above definitions, the following implications hold: i. If a model has the CUOM property, then it also has the CUOM(&) property. If a model is not CUOM(fc), then it is not CUOM. ii. CUOM(ri) implies CUOM(ro) if r\ > ro- That is, there exists a parameterization of the lower dimensional margins so that the lower order closure property hold. iii. If a model has the MUBE property, then all the parameters in this model are PUBE. Further more, this model is also MPME. iv. If every parameter is PUBE, then the model is MUBE. No other implications hold in general. In the following, a few examples are used to illustrate the above concepts and some of their relationships. Example 2.3 (Models with CUOM and MUBE properties) A familiar example of a model with the CUOM and PUBE properties is the multivariate normal model. The closure under taking of margins for the multinormal distribution is somewhat stronger than the CUOM property defined here, since it is also closed under taking of univariate margins, which is not required in our definition. • Example 2.4 (Models with MUBE property) For some copulas, such as (2.4) and (2.5), the dependence structure can be expressed by a d x d matrix parameter 0 = (Ojk) with Ojj = 1. For such a d-dimensional copula C(;0), the 2-dimensional margins can be expressed by a bivariate copula Cj(-; Ojk) with one dependence parameter Ojk, for j, k = 1,..., d; j ^ k. Thus each element in the dependence structure described by the parametric matrix 0 = (Ojk) can be equivalently expressed Chapter 2. Foundation: models, statistical inference and computation 22 by a set of bivariate copulas Cjk(-;9jk)- The distribution with this copula is thus MUBE. Some copulas such as (2.4) have a wide range of dependence; some such as (2.5) do not. • Example 2.5 (Models with CUOM but not MUBE property) We give two examples here: a. Consider the generalized Morgenstern copula (2.6). This copula has the CUOM property, since for any {ji,..., jm} £ {1,..., d} where m < d, it is straightforward to verify that C(UJ1 , Uj2,.,., Ujm) has the form (2.6). But this generalized Morgenstern copula is not MUBE. b. Another example is the multivariate Poisson distribution. Let us examine the trivariate Poisson distribution. Let the random variables X\, X2, X3, X\2, X13, X23, X123 have independent Poisson distributions with the mean parameters Ai, A2, A3, A12, A13, A23, A123 respectively. We now construct new random variables as follows: Y\ = X\ + X\2 + X13 + X\23, Y2 — X2 + X12 + X23 + Xi23i Y3 = X3 + X13 + X23 + Xl23-Using the convolution property of the Poisson, we derive that Yi ~ Po(X\ + Ai2 + A13 + A123), 72 ~ Po(A2 + A12 + A23 + A123), Y3 ~ Po(A3 + Ai3 + A23 + A123), that (YUY2), (yltY3), (Y2,Y3) have bivariate Poisson distributions, and that (Yi, Y2, Y3) has a trivariate Poisson distribution. This 3-dimensional Poisson model has the CUOM property because the bivariate margins have a similar stochastic representation. But it is not MUBE nor PUBE. In fact, with univariate and bivariate margins, we can only estimate Ax + A13, A2 + A23 and A12 + Ai23 from the (1,2) margins, Ai + Ai2, A3 + A23 and Ai3 + Ai23 from the (1,3) margins, and A2 + Ai2, A3 + Ai3 and A23 + A123 from the (2,3) margins. These nine linear expressions form only six independent linear expressions. Since we have seven parameters in the model, thus the model is not MUBE. Furthermore, it can be easily verified that no any single parameter can be univariate-bivariate.expressed. • Example 2.6 (Models with MUBE but not CUOM(2) property) Consider a trivariate cop ula constructed in the following way: Ci23(ui,u2,'u3) = / C1\2(u1\x;612)C3\2(u3\x;623)dx, (2.7) Jo where C\\2 and C3|2 are conditional cdfs obtained from two arbitrary bivariate copulas families C12(ui, «2! ^12) and C23(u2, u3; <523). This trivariate copula has (1,2) bivariate margin Ci2(«i, u2; <5i2), (1,3) bivariate margin Ci3(«i, u3) = JX Ci|2(«i|a;; <5i2)C3|2(u3|a:; <523) dx, and (2, 3) bivariate margin C'23(u2, u3; 623)- Suppose we can let C\2 be the Plackett copula C(u,v;6) = Q.bn-1{l + r1(u + v)-[(l + r)(u + v))2 -ASnuv]1!2}, 0 < <5 < 00, (2.8) Chapter 2. Foundation: models, statistical inference and computation 23 where n = 5 — 1, and we can let C23 be the Frank copula C(u,v;6) = -6-1log([t-(l-e-s»)(l-e-6v)]/t), 0 < 6 < oo, (2.9) where £ = 1 — e~s. Then the model (2.7) is well-defined, and is obviously MUBE with 2 bivariate dependence parameters 612 and 623- But the model (2.7) is not CUOM(2), since the Plackett copula and the Frank copula are not in the same parametric family. Generally speaking, given bivariate distributions ^12,-^23 with univariate margins F\,F2,F3, it can be shown that £13(^112(2/1 W, O12), F3\2(y3\z2; #23)) ^2(^2) (2.10) -00 is a proper trivariate distribution with univariate margins F\,F2,F3, (1,2) bivariate margin F12, and (2,3) bivariate margin i<23- In (2.10), Fi\2,F3\2 are conditional cdfs obtained from F\2,F23, and C13 is a bivariate copula associated with the (1,3) margin (it can be interpreted as a copula representing the amount of conditional dependence in the first and third univariate margin given the second). Specifically, Ci3(ui,u3) = U\u3 corresponds to conditional independence and Ci3(ui, W3) = min{ui,u3} corresponds to perfect conditional dependence. The model (2.10) is MUBE, but it may not be CUOM(2) - it is enough to see this fact by choosing F\2 and F23 from different parametric family. The model (2.7) is a special case of (2.10) obtained by letting F\2, F23 be the Plackett and Frank copulas respectively, and Ci3(ui,U3) = uiu3. The construction (2.10) is a special case of Joe (1996a). • Example 2.7 (Models with CUOM(2) but not CUOM property) Let F(u, v; 9) = uv(l + 9(1 — u)(l - v)), -1 < 9 < 1, be the bivariate Morgenstern family (2.5). Let Fi2 and F23 are in this family with parameters #12 and 023 respectively. Let Ci3(ui,w3) = W1W3. The conditional distributions are FJ\2(UJ\U2) = Uj + 9j2Uj(l - Uj)(l - 2u2), j = 1,3. Hence by (2.10), we have F13(u1}u3) = j F1\2(ui\z2)F3\2(u3\z2)dz2 = ulU3[l + Z'^^^l - Ul)(l - u3)], Jo which is in the bivariate Morgenstern family (2.5) with parameter 0i2#23/3. Hence the model /•U2 Fl23(ui,U2,U3) = ^(Ul |Z2)-F3|2(U3 lz2)^2 (2.11) Jo is CUOM(2). But (2.11) is not CUOM. In fact, we find F123(ult U2, U3) =U1U2U3[1 + 012(1 - Ui)(l - U2) + 3-10i2023(l - «l)(l - U3) + 023(1 - «2)(1 - u3) + 2012023(1 - ui)(l - u2)(l - u3)(l - 2«2)/3], which is not in the trivariate Morgenstern family (2.5). • Chapter 2. Foundation: models, statistical inference and computation 24 Example 2.8 (Models with CUOM(r0) but not CUOM(ri) property, when r0 < ri) Con sider a 4-variate copula model: Fl234(ui,U2,U3,U4) = UlU2U3U4[l + 0i2(l - Ui)(l - u2) + 3_10i2023(l - «i)(l - u3) + 014(1 - Ul)(l - u4) + 823(1 - u2)(l - u3) + e24(l - u2)(l - «4) + 6*34(1 - u3)(l - u4)+ 2^12^23(1 - «i)(l - u2)(l - «3)(1 - 2u2)/3], where \914 + e24 + 934\ - 612 -1< 023(1 + M < 1 + 912 - \914 + 624 - 934\, \914 -024 -934\ + 912 -1< ^23(1-^12) < 1 — ^12 — 1^14 — ^24 + ^341 and \9jk\ < 1, 1 < j < k < 4. It can be shown that F12, Fi3, F14, F23, F24, and F34 are in the bivariate Morgenstern family (2.5), but i?i23, i*i24, Fi34 and F234 are not in the same parametric family. In fact, ^124, -P134 and F234 are in the trivariate Morgenstern family (2.5), but .F123 is not. • Example 2.9 (Models with PUBE but not MPME property) We give two examples here: a. In the generalized Morgenstern copula (2.6), the parameters (3j1j2 (1 < ji < J2 < d) are PUBE, but the model is not MPME, as the parameter /?i2--d cannot be expressed by any marginal copula. b. Another example is the Molenberghs-Lesaffre model in Example 2.17. The parameters r\j (1 < j < d) and r]jk (1 < j < k < d) are PUBE, but the model is not MPME, as the parameter r)i2-d cannot be expressed by any marginal pmf. • 2.2 Multivariate discrete models Assume T is a parametric family defined on a common measurable space (3^.4), where y is a discrete sample space and A the corresponding cr-field. We further assume F={P(y;0):6e$i}, ft C JtV, (2.12) where $ — (9\,..., 9q)' is a g-component vector, and ft is the parameter space. The parameter space is usually a subset of g-dimensional Euclidean space. We presume the existence of a measure fi on y such that for each fixed value of the parameter 0, the function P(y; 0) is the density with respect to u of a probability measure V on y. For a d-dimensional random discrete vector Y — (Yi,..., Yd)', its pmf P(yi • • - yd, 6) (or simply P(y\ • • - yd)) is assumed to be in T. Chapter 2. Foundation: models, statistical inference and computation 25 2.2.1 Multivariate copula discrete models We define a cdf of a discrete random vector Y = (Y\,..., Yd)' as G(yi,...,yd) = C(G1(yl),...,Gd(yd)), (2.13) where C is a (/-dimensional copula and Gj (j = 1,. . .,d) is the cdf of the discrete rv Yj. Thus G(yi, • • •, yd) is a well-defined cdf for a discrete random vector Y. The pmf of Y = y = (yi,..., yd)' is 2 2 P(vi---Vd)= J2---H(-l)kl+"+kdC(^'---'x^), (2-14) *1 = 1 kd = l where Xji = Gj(yj),Xj2 = Gj(y*j) with Gj(yj) < Gj(y*j) and for any x such that yj < x < y*j , we have Pr(Yj = x) = 0. We call the model (2.13) for a discrete random vector Y a multivariate copula discrete (MCD) model. The family of MCD models is a big family. With MCD models, we have flexible choices of marginal cdfs, including standard distributions such as Bernoulli, binomial, negative binomial, Pois son and generalized Poisson, etc., and these allow the models to accommodate a wide range of data. We may also have flexible choices of copulas; examples are multinormal copula, Hiisler-Reiss copula, Morgenstern copula, etc.. For a summary of properties of MCD models, see subsection 2.2.4. For a given d-variate discrete distribution F, we can often find multiple copulas which match F into a MCD model. For example, suppose we have a bivariate binary random vector Y = (Yi, Y2)', where Yj (j = 1,2) takes values 0 and 1. The probability of observing (1,1), (1,0), (0,1) and (0, 0) are P(ll), -P(IO), -P(Ol) and P(00) respectively. Then for any given one-parameter family of bivariate copulas C(u\, u2; 0) that ranges from the Frechet lower bound to upper bound, we can find a 6 to express the four probability masses in the following way C{ux,u2;e) = P(ll), < «i = P(ll) + P(10), (2.15) u2 = P(ll) + P(01). (2.15) may not hold if C(-;9) cannot attain the Frechet bounds. The above observation suggests that to model multivariate discrete data, different copulas could do the modelling job equally well. To make the modelling successful in the general sense, it is important that the copula has a wide dependence range. Evidently, with different copulas, we will not be estimating the same dependence parameters, but nevertheless the fitted model should lead to the similar inference or interpretations. Chapter 2. Foundation: models, statistical inference and computation 26 2.2.2 Multivariate mixture discrete models Multivariate discrete models can be constructed in different ways than the derivation of MCD models. We can envisage circumstances that the multivariate discrete random vector Y at y = (j/i,..., yd)' has pmf/(j/i • • - yd', A) for a given A. Suppose further that A is a random outcome which we assume to be a p-component vector (p may be different from d) subject to chance variation described by a certain (continuous) multivariate distribution G(Ai,..., Ap), which in turn can be expressed in terms of a copula function C(u\,..., up) with (continuous) univariate marginal distribution Gj j = 1,..., p. This is similar to imagining a group of outcomes, with random traits or effects for the individuals in the group, and having a common constant trait or element through the distribution of the random effects. Then the probability of Y = y, or the pmf of Y at y is P(vi---Vd)= /••• [ f(yi---yd;X)c(G1(X1),...,Gp(Xp))'[[gj(Xj)dX1---dXp. (2.16) We call (2.16) a multivariate mixture discrete (MMD) model. We use the word mixture since the distribution function is constructed as a mixture of {/(j/i • • - yd', A)} over A. A special case of (2.16) obtains by assuming that the outcome of each univariate marginal probability mass corresponding to the outcome of Yj, which is Pj{yj), depends on a parameter jj, j = l,...,d, (or a vector of parameters), and given jj, the variables Yj are independent. If A = (Ai,..., Ap)' is the p-component vector formed by the non-singular components of jj, j = 1,..., d, then the model (2.16) becomes P(vi---Vd)= /••/ Y[fj(yj;7j)c(G1(X1),...,Gp(Xp))l[gj(\j)dX1---dXp, (2.17) J J j=i j=i where fj(yj',jj) — Pr(l} = yj\Tj = Jj). The dependence among the response variables is induced through the mixing distribution of A. Usually Xj = jj, j = 1,..., d. A special case is jj = Xj = X for all j. 2.2.3 Examples of MCD and MMD models From their definitions, we see that the above two classes are rather general. We can choose any appropriate multivariate copula as the copula in the construction of the distribution. The sets of MCD and MMD models are not disjoint, as we can see from Example 2.13. From practical viewpoint, we need to find some specific multivariate copulas C which offer good modelling properties and have a simple analytic form. One such choice is the multivariate normal cop ula (2.4). With this copula, we have C(Gi(*i),..., Gd(zd)) = $d($"1(Gi(zi)),..., *-\Gd(zd));e), Chapter 2. Foundation: models, statistical inference and computation 27 where Gj's are arbitrary cdfs. The multivariate normal copula allows us to fully or almost fully ex ploit the dependence structure among the response variables. Its primary disadvantage may be computational difficulties when d is large (e.g. d>7, see Schervish 1984). This subsection consists of examples of MCD and MMD models. Discussion concerning the inclusion of covariates is given in some cases. More extensive studies of specific MCD and MMD models are given in Chapter 3. Example 2.10 (MCD binary model) 1. General models. Let Yj (j = l,...,d) be a binary random variable taking values 0 or 1, and suppose the probability of outcome 1 is pj. The cdf for Yj is For a given d-dimensional copula C(ui,..., ud; $), C(G\(yi),..., Gd{yd)', 9) is a well-defined distri bution for the binary random vector Y = (Yi,..., Yd)'. When d = 2, with a one-parameter copula C(ui, u2; #12), we can write down the pmf of Y as where eti = Gi(j/i - 1), 61 = Gi(yi), a2 = G2(y2 - 1) and b2 - G2(y2). The pmf of Y = y for a general d is expressed by (2.14). One simple way to reparameterize pj in (2.18), so that the new parameter associated to the univariate margin has the range in (—00,00), is by letting pj — FJ(ZJ), where Fj is a proper cdf. This is equivalent to writing Yj = I(Zj < Zj), where Zj is a rv with cdf Fj, and the random vector Z = (Z\,..., Zd)' has a multivariate cdf F^-d- In the literature, this approach is referred to as a latent variable model or a multivariate latent model, since Z is an unobserved (latent) vector. There is also the option of including covariates to the parameter Zj, as well as to the dependence parameters 6 in the copula C(ui, ...,Ud\0). We will show these by examples. 2. Multivariate probit model with no covariates. The classical multivariate probit model for the multivariate binary response vector Y is (2.14) with the multinormal copula (2.4), where pj is reparameterized as pj = $(ZJ) and Gj has form (2.18). This model has the CUOM and MUBE properties. Through its latent variable representation, the model can also be written as Yj = I(Zj < Zj), j = 1,..., d, where Z = (Zlt..., Zd)' ~ TV(0, 0), 6 = (9jk); Zj is often referred to as the cut-off point. 0 is a correlation matrix, which (a) has elements bounded by 1 in absolute value ' 0, yj < 0, Gj(Vj) = I 1 ~Pj, 0 < yj < 1, . 1, yj > 1. (2.18) P{yiV2) = C(blt62; 912) - C(bua2;912) - C(ax,b2;012) + C(ai,a2;912) Chapter 2. Foundation: models, statistical inference and computation 28 and (b) is nonnegative definite. To avoid the constraint of the bounds, we can reparameterize Qjk through the hyperbolic tangent transform as ^ = eXP\7jk)~l (2-19) exp(7jjb) + l so that the new parameter jjk is in the range (—oo, oo). The right hand side of (2.19) is an increasing function in jjk- Condition (a) is not sufficient to guarantee 0 be nonnegative definite except when d = 2. For d = 2, 0 is always nonnegative definite since the determinant of 0, 1 — ^12 > ^s always nonnegative. For d = 3, 0 is nonnegative definite matrix provided det(0) = 1 + 20120i3023 - 0{2 - 6\3 - e23 > 0; (2.20) this constraint is satisfied for about 61.7% of the cube [—1,1]3 for (#12, #13, #23)- For d = 4, only about 18.3% of the hyper cube [—1,1]6 leads to a nonnegative definite matrix 0; see Rousseeuw and Molenberghs (1994). Theoretically, the constraint (b) causes no trouble for the usefulness of the model. But numerically, this constraint may be a problem, since the space where the numerical computation can be carried out is quite limited. For the numerical computation to be successful, we have to guarantee that the current values are not out of the space of constraint, which, in some situations (e.g. the real parameters are close to the space boundaries), may render the computation time consuming or even not possible. In some situations, these problems with the constraint (b) can be avoided by limiting consideration to a simple correlation structure, so that the nonnegative definite condition is always satisfied. Examples include an exchangeable correlation matrix with all correlations equal to the same 9, and an AR(1) correlation matrix with the (j, k) component equal to for some 6. 3. Multivariate probit model with covariates. The classical multivariate probit model for a binary response vector Y,-, i = 1,..., n, with covariate vector X{j for the jth univariate marginal parameter, if we use the latent variable representation, is that Y,j = I (Zij < atj + ySjXjj), j = 1,.. .,d, i = l,...,n, where Z,- ~ N(O,0j), 0S = (Oijk)- A modelling question may be whether dependence parameters should also be functions of covariates. If so, what are natural function to choose, so that Oi are all correlation matrices? If 0,- does not depend on any covariates, then Zs- are iid N(O,0), with 0j = 0 = (Ojk). If 0j depends on some covariate vectors, say 8ijk depends on vfijk, then to satisfy \0ijk\ < 1> we can let o _ exp(7jM + 7,-fcWjjt) - 1 uiJk - 7 i T7T- (^1) exp(7jM + 7jifcWt,jfc) + l Since all 0,, i = 1,..., n, must be nonnegative definite, this may be a very strong restriction on the regression parameters (7^,0, Jjk)- In some situations, choices of the parameters (jjk,o,7jk) in (2-21) Chapter 2. Foundation: models, statistical inference and computation 29 making all 6t nonnegative definite may not exist. The inclusion of covariates to the dependence parameters Oijk as expressed in (2.21) is a mathematical construction. In the Example 2.13, we will give a more "natural" way to include covariates to dependence parameters. • Example 2.11 (MCD count model) 1. General models. Consider a cf-variate random count vector Y = (Yi,..., Yd)'. Let Yj be a random variable taking the integer values 0,1, 2,..., oo, j = 1, 2,..., d. Let Pr(Y;- = m) = p^m\ Then we have J2m=o P^ = ^ an<^ *ne ccu" °f X?' is [y;l GJ(yJ)=Y,P<T)> (2-22) m=0 where [yj] means the largest integer less or equal than yj. Thus for a given d-dimensional copula C(u\,..., Ud\ 9), C(G\(yi),..., Gd(yd)\9) is a well-defined distribution for the count random vector Y. The.pmf of Y = y for a general d is expressed by (2.14). If we further assume that Yj has a Poisson distribution with parameter Xj, that is ^)=Arexp(-A,) J ml then we will say we have a MCD Poisson model. For the MCD Poisson model, the univariate parameter Xj can be reparameterized by rjj = log(Aj), so that the new parameter rjj has the range (—00,00). Covariates can be included to r\j in an appropriate way. The comments on modelling of the dependence structure in the copula C for the MCD binary model are also relevant here. To represent the MCD Poisson model by latent variables, let Yj = m if zm_i < Zj < zm, —00 = z_i < ZQ < • • • < Zoo = 00, where Zj is a rv with cdf Fj, and the random vector Z = (Z\,..., Zd)' has a multivariate cdf F\2...d- The form of Fj does not have much importance since for count data, we are seldom interested in the cut-off points ZQ, Z\, ..., z^. But the copula related to Fi^-.-d has essential importance for the modelling of count data, since it determines the multivariate structure of the random count vector Y. Thus we may say that for count data, the MCD representation (2.14) is more relevant than the latent variable representation. 2. Multivariate Poisson model with multinormal copula. The multivariate Poisson model with multi-normal copula for a count response vector Y is that in (2.14), where the copula is the multinormal copula (2.4) and has the form (2.23). This model has the CUOM and MUBE properties. The univariate marginal parameters Xj can be transformed to rjj — log(Aj) so that rjj has range (—00,00). For a random vector Yj, i = 1,.. .,n, if there is a covariate vector x^- for Xij, a possible way to Chapter 2. Foundation: models, statistical inference and computation 30 include x,j is by letting rjij = ctj + Pj^ij, where rjij — log(Ajj). Similarly, if 0j = (Oijk) with 9ijk depending on a covariate vector Wjjfc, a possible way to include vfijk is by letting 9ijk have the form (2.21). The difficulties with adding covariates to 0, remain, as in the previous example. • Example 2.12 (MMD Poisson model) 1. General models. Let Y = (Yi,..., Yd) be a random vector of count data, where Yj, j = 1,..., d, has a Poisson distribution. The MMD Poisson model for the random vector Y is /•oo /.oo d p P(vi---Vd)= •• / ]Jfj(yj^j)c(G1(m),...,Gd(rjp))'[[gj(r}j)dm---drlp, (2.24) Jo Jo j=1 =1 where fj(yj;\j) = exv-xi\y /yj\ (2.25) is the probability mass function of a Poisson distribution for Yj given the parameter Xj. In (2.24), T) = (rji,..., r)p)' is a p x 1 vector of the collection of functions of Ai,..., Xd; it is assumed to be random with a density function c(Gi(n{),...,Gp(r)d))\Yj=i 9i(^i)^ where c(-) is the density function of a copula C and gj (•) the marginal density of rjj. The model can cover a wide range of dependence through appropriate parametric families of the copula C. Through conditional expectations one can study the covariances and correlations of Y. If Xj = rjj, j = 1,..., cf, we have E(yi) = E(E(Y>|Ai)) = E(Ai), Var(Yi) = E(Var(Yi | Xj)) + Var(E(Yj [A,-)) = Var(A;) + E(A,-), (2.26) [ Cov(Yj, Yk) = E(Cov(Yj, Y^A,-, A*)) + Cov(E(Yj |A,-), E(Y2|Afc)) = Cov(Xj, Xk). Therefore the correlation of Yj and Yk is Con(Yj,Yk) = {[Var(^} + ^[J^j + E(A,)]}i/2) (2-27) which has the same sign as the correlation of Xj and Xk. Corr(Yj,Yi:) is smaller than Corr(Aj, A*,) and tends to Corr(Aj, Afc) when E(Aj)/Var(Aj) and E(Afc)/Var(Afc) tend to zero. When Xj = n, j = 1,..., d, Y is equicorrelated with Corr(Yj, Yk) = Var(?7)/[Var(^)+E(77)]. The range of dependence for this special situation is quite restricted. For the general model (2.24), the parameters are introduced by the marginal distribution of r\j and the copula C. Letting the parameters depend on covariates is possible, as we can see from the next example with a specific copula. 2. Multivariate Poisson-lognormal model. The Multivariate Poisson-lognormal model for a random Poisson vector Y is that in (2.24), where the copula is the multinormal copula (2.4), and T]J has a Chapter 2. Foundation: models, statistical inference and computation 31 lognormal distribution with parameters fij and <Xj. The pmf for Y = y is P(yi--yd)= •••/ Y[fj(yj;*j)g(v,i*,<r,Q)dm---dr)i Jo Jo (2.28) where fj(yj; \j) is of the form (2.25), and gd(ri;ti,<T, 0) = 1 -^(log»?-/i)'(<T/e(T)-1(logi7-p)} , (2.29) (27r)^(7?1...77p)|<T'0(T|l/2 exp with rjj > 0, j = 1,... ,p, is a multivariate lognormal density function. The model (2.28) has the CUOM and MUBE properties. The parameters in the model are p. = (m,..., Ud)', o = (ax,..., ad)' and 0. By (2.26) and (2.27), we have The margins are overdispersed Poisson since Var(Yj)/E(Yj) > 1. |Corr(Y), Yk)\ is less than |Corr(r?j, r]k)\ and Corr(Yj, Yk) approaches Corr(?7j, rjk) when a,j, ak —* oo. A covariate vector x can be included in the model, say by letting the components of (i be linear functions of x. a can be assumed to have some special pattern, for example a\ = • • • — ap — a. It is harder to naturally let the correlation matrix 0 depend on covariates, as already discussed for the multivariate probit model for binary data. • Example 2.13 (MMD model for binary data) 1. General models. Let Y = (Yi,...,Yd)' be a binary random vector. Assume that Y has the MCD binary model in Example 2.10 for a given cut-off point vector a = (cxi,..., ad)' • a in turn is assumed to be a random vector. Let T) = (771,..., t]p) be the collection of functions of a. With the latent variable representation, we have that for given t] E(Yj) = exp{fij + ^a?}d=aj, Var(Yi) = aj + a][exp(a]) - 1], Cov(Yj, Yj,) = ajak[exp(9jkajak) - 1], j ^ k, (2.30) Y = (Yi,..., Yd)' = (/(Ai < ai),..., I(Ad < ad))' (2.31) where A = (Ai,..., Ad)' has a multivariate cdf F, and T) has a multivariate cdf G. Thus P{yi---yd)= / •••/ P(yi---yd\r,)c(G1(r]l),--.,Gp(r]p))]Jgj(T]j)dri1---dr)1 J — 00 J— 00 Chapter 2. Foundation: models, statistical inference and computation 32 where c(G\(rji),..., Gp(rjp)) ]TJ=i 9j(Vj) is the density function of t), with c(-) the density function of a copula C and gj(-) the marginal density of r)j. A more general case is when there is a covariate vector x. In this situation, we may let ctj = Bjt0 + BjX, j = 1,..., d, where the Bj^s and Bj's are random, and Tf is now assumed to be the collection of functions of the random components BJQ'S and Bj's. 2. Multivariate probit-normal model. The MMD probit model is obtained by assuming that in (2.31), A = (Ai,.. .,Ad)' ~ Nd(0,Q) and 17 ~ Np(p,T,), where 6 = (Ojk) is a correlation matrix and £ = (o-jk) is a variance-covariance matrix. Without loss of generality, let us assume 17 = a. Then the MMD probit model of the form (2.31) becomes Y = (Yi,..., Yd)' = < zt),... ,I(Zd < z*d))', (2.32) where Zj = (Aj - a, - Uj)/^/l + <TJJ, zj = m/y/l + a,,, j = l,...,d, and Z = (Zx,...,Zd)' ~ Nd(0,R), where R — (rjk) is a correlation matrix with rjk = (Ojk + fjjfe)/{(l + Cjj)(l + Ckk)}1^2, j ^ k. This is a special class of multivariate probit model in Example 2.10. When Cjj — 0, it is the multivariate probit model discussed in Example 2.10. This example demonstrates that the intersection of the sets of MCD and MMD models is not empty. It is straightforward to extend such a construction to the more general situation with a covariate vector x, such that OLJ = Bj0 + Bj-x. with the BjfiS and Bj's random. With one covariate Xj, for example, one might take aj = BJQ + BJXJ with B0 = (B1>0,3d<0)' ~ Nd(ji0, E0) independent of 8 = (Pi,..., BD)' ~ Nd(p, S), where Mo = (pi,o,---,Ud,o)', S0 = (<Tjk,o), Ii = (ui,---,ud)' and E = (<Tjk). Now in (2.32), we have Zj = (Aj - BIFI - BJXJ - p,ji0 - pjXj)/yJl + (Tjjfi + o-jjxj, with Z = (Zi,...,Zd)' ~ Nd(0,R), R = (rjk), such that j ~ {l + %,o + W2}1/2' 3 ~ ""' ' /» V (2-33) _ Ojk + tTjJb.O + O'jkXjXk ... rjk ~ {(1 + <Tjj,o + <Tjjx])(l + o-kkfi + <rnx\)Yl* ' 3* • The function of rjk in (2.33) can be considered as a "natural" form for the correlation parameters as functions of the covariates, since this function representation is derived directly from the linear regression for marginal parameters. As long as the conditions for linear regression for marginal parameter hold, rjk will always satisfy the constraints for forming a correlation matrix. For R to be nonnegative definite, it suffice that 0, So and E be nonnegative definite. These three matrices do not depend on covariates, which is very attractive numerically compared with the nonnegative definite requirement on G,- in (2.21). A special case is Ojk = 0 and <Tjkto = 0 (j ^ k), in which Chapter 2. Foundation: models, statistical inference and computation 33 case the only constraint is that £ be nonnegative definite. Finally, we notice that in contrast to the conventional univariate probit analysis, the regression function in (2.33) for the cut-off points are not linear functions of covariates. Nevertheless, (2.33) can be used in lieu of the multivariate probit model with covariates in Example 2.10, since the parameters in (2.33) are also interpretable. To use the model (2.33), it is necessary to reparameterize the parameters Cjj.o, Cfcit.o, <Tjk,Q, o-jj, <rkk, o~jk and 9jk such that the new parameters have (—oo, oo) as their domain. • 2.2.4 Some properties of MCD and MMD models We summarize some of the properties of MCD and MMD models: 1. MCD and MMD models, constructed through stochastic or latent variable representation, provide a clear probabilistic description of multivariate discrete random phenomenon. In some situations, the pmf and cdf have closed forms; in other situations, the pmf or cdf can be numerically computed in a reasonably short time. Likelihood inference can be used, with the help of the theory in section 2.3 and section 2.4. 2. MCD and MMD models allow flexible choices of multivariate copulas (Multinormal copula, Hiisler-Reiss copula, Morgenstern copula, etc.) as well as flexible choices of all the univari ate marginal distributions (any discrete distributions: Bernoulli, binomial, negative binomial, Poisson and generalized Poisson, etc.), and they allow relevant covariates to be included in the appropriate parameters in the models. In this way, these two classes of models are able to capture the nature of discrete data in an individual or grouped observation basis, thus they allow the drawing of appropriate inferences from the data. 3. With appropriate copulas, many MCD and MMD models have the CUOM and MUBE prop erties. The CUOM property, sometimes referred to as "reproducibility" or "upward compati bility" in the literature, is also sought for modelling longitudinal and repeated measures. With appropriate families of parametric copulas, a wide range of dependence, including negative dependence, is possible. 4. With appropriate copulas, the parameters related to the univariate margins structure and the parameters related to dependence structure can be allowed to vary independently in separate parameter spaces. This is a good property that the multivariate Gaussian model also enjoys. Chapter 2. Foundation: models, statistical inference and computation 34 5. By choosing appropriate marginal distributions, the MCD and MMD models can naturally account for a variety of situations occuring with discrete data, such as over-dispersion which is independent of covariates, skewed distributions, multimodality, etc. 6. For a given d-variate discrete distribution F, there may be many copulas which match F into MCD model class; MCD models are robust in terms of data modelling with copulas of similar structure. Some of the points above will be made clear in Chapter 3 as well as in Chapter 5. 2.3 Inference functions of margins For a general multivariate model, parameter estimation is often a difficult computational issue. Without readily available parameter estimation methods, any model, even though interpretable, will not have practical usefulness. For situations involving univariate models, many methods have been devised for parameter estimation, ranging from the method of moments through formal maximum likelihood to informal graphical techniques. The maximum likelihood approach is used in general because it has a number of desirable statistical properties. For example, under general regularity conditions, ML estimators are consistent, and asymptotically normal. With some weak additional assumptions, the MLE is also asymptotically efficient. However, the method has not been successfully applied for estimating the parameters of multivariate models, except for the multivariate normal and a few cases with low dimension (e.g. d = 2). A primary cause of this unsatisfactory situation is the computational difficulty involved with multivariate models, even with modern powerful computers. The ML approach for parameter estimation in multivariate situations is still not routine. The •question is: can we have a general effective estimation procedure to estimate parameters for a model in the MCD and MMD classes? In this section, we first discuss model fitting strategies for multivariate models in subsection 2.3.1. One strategy leads to the inference functions of margins approach, that we propose as the parameter estimation approach for MCD and MMD models with the MUBE, PUBE or MPME properties. In subsection 2.3.2, we introduce some important results in inference function theory for multiple parameters needed for developing the inference basis for MCD and MMD models. In subsection 2.3.3, we introduce the inference functions of margins (IFM) approach and give some examples. Chapter 2. Foundation: models, statistical inference and computation 35 2.3.1 Approaches for fitting multivariate models There are at least three possible likelihood-based approaches to estimate parameters in a multivariate model: Approach 1. All univariate and multivariate parameters are estimated simultaneously by maximizing the full-dimensional likelihood function. This is the MLE approach. Approach 2. For a model where all multivariate parameters are in a copula, univariate parameters are estimated from the separate univariate likelihoods. The multivariate parameters are then estimated from multivariate likelihoods with the univariate parameters fixed as estimated from separate univariate likelihoods. Approach 3. For a model with the MUBE, PUBE or MPME property, univariate parameters are estimated from separate univariate likelihoods. Bivariate, trivariate and multivariate parame ters are then estimated from bivariate, trivariate and multivariate likelihoods, with lower order parameters fixed as estimated from lower order likelihoods. The first approach is general and direct. While this strategy sounds most natural from the likelihood point of view, it could be computationally very difficult for most of the multivariate models, even in relatively low dimensional situations. The multivariate normal distribution, which can be easily han dled by this approach, is an exception. The second approach makes the computational task easier, but it still has the difficulties of dealing with a multivariate object in general. These difficulties are mainly two: the high-dimensional maximization problem and the multivariate probability calcula tion. The third approach reduces these difficulties by working with lower dimensional maximizations or lower dimensional probability calculations. This is a valuable approach if the parametric family of interest has the MUBE, PUBE or MPME properties. It is important because it makes statistical inference for multivariate data easier. Computational tractability is an important factor for the popularity of certain statistical tools, as we observe in many areas of statistics. The third approach to stochastic modeling is often convenient, since many tractable models are readily available for the marginal distributions. It is also invaluable as a general strategy for data analysis in that it allows one to investigate the dependence structure independently of marginals effects (through copula) and computationally only dealing with lower dimensional (often two-dimensional) models. Chapter 2. Foundation: models, statistical inference and computation 36 Example 2.14 Consider the multivariate probit model for a d-dimensional binary vector Y with pmf 2 2 P(Vi • • • Vd) = J2 • • • E•'•+*''<M*~1K-,). • • •. J; ©), (2-34) »'i=i »'<i=i where 0 = (Ojk), o,ji = Gj(yj — 1) and a,j2 = Gj(yj), with C7j(l) = 1 and Gj(0) = 1 — $(ZJ). This model has the CUOM and MUBE properties. For estimation from a random sample of iid Yi,..., Y„, the three approaches for fitting multivariate models could be used here: Approach 1. Estimate the parameters z — (z\,..., zd)' and 0 by maximizing the multivariate like lihood L = f]r=i P(yn ''' Vid)- Let the resulting estimates be z and 0. Approach 2. (a) Obtain the estimates z = (z\,..., zd)' by maximizing separately d univariate marginal likelihoods, (b) Estimate the parameters 0 from the multivariate likelihood L = nr=i P(yn '' ~yid) with the parameters z fixed at the estimated values i from (a). Approach 3. (a) Obtain the estimates z = (z\,... ,zd)' by maximizing separately d univariate marginal likelihoods, (b) Estimate the parameters Ojk, 1 < j < k < d, by maximizing separately d(d— l)/2 bivariate likelihoods Ljk — Y\7=i Pjk(yijyik) with the parameters Zj,Zk fixed at the estimated values Zj,Zk from (a). Let the resulting estimate be 0. Approach 1 is computationally demanding, since it requires the calculation of high-dimensional multinormal probabilities and a numerical optimization on many parameters. Approach 2 reduces the numerical optimization problem to fewer parameters, but the high-dimensional multinormal probability calculation is still required. Approach 3 reduces the numerical optimization in Ap proach 2 into several numerical optimizations, each involving fewer parameters. Further, the high-dimensional multinormal probability calculation is no longer required; all that is needed are the binor-mal probability calculations, which are readily feasible with modern computers. Multi-dimensional calculation are needed for predicted or expected frequencies, but this is much less effort compared with multi-dimensional numerical integrations within a numerical optimization procedure. Since it is computationally easier to obtain z and 0, a natural question is what is the asymptotic efficiency of z and 0 compared with z and 0. In Chapter 4, we will deal with this problem in a general context. • Chapter 2. Foundation: models, statistical inference and computation 37 2.3.2 Inference functions for multiple parameters Introduction The common approach to the problem of estimation is to propose an estimator T(x) and then study its properties. For estimators with specific properties such as unbiasedness, minimum variance or minimum mean squared error, or asymptotic normality, theories for ordering these estimators are developed. Standard methods for obtaining the estimator T(x) include least squares (LS), maximum likelihood (ML), best linear unbiased, method of moments, uniform minimum variance (UMV), and so on. However, many point estimation procedures may be viewed as the solution of an (or some) appropriate estimating equation(s). Indeed, any estimator may be regarded as a solution to an equation or a set of equations of the form \P(x,0) = 0, where $ is a vector of functions (or a single function in the one-parameter case) of the data x and the parameter 6. ty(x,6) is commonly called a vector of inference functions or estimating functions. In this thesis, we use mainly the term "inference functions". But when we focus more on the use of the inference functions for estimation, we also employ the term "estimating functions". The theory of inference functions is studied in, for example, Godambe (1960, 1976, 1991), McLeish and Small (1988) and Jorgensen and Labouriau (1995). The theory of inference func tions imposes optimality criteria on the function \S' rather than the estimators obtained from it. The approach of considering a class of inference functions and finding the optimal inference function has the advantage of retaining the strengths of the estimation method (e.g LS, ML, UMV) and at the same time eliminates some of their weaknesses. For example, in point estimation, the Cramer-Rao lower bound is attained only in rare occasions whereas the optimality of the score function among inference functions holds merely under regularity conditions (see below). Inference functions may be used either as estimating equations to determine a point estimate or as the basis for constructing tests or confidence intervals for the parameters. An example is the maximum likelihood estimators, which are obtained as the solutions of estimating equations from the score functions. Thus the inference functions for MLE are the score functions. Other examples of the application of inference functions are the theory of M-estimators for obtaining robust estimators and the quasi-likelihood methods used in generalized linear models. Inference functions have also found application in a wide variety of applied fields; examples in biostatistics, stochastic processes, and survey sampling can be found in Godambe (1991). In the following, we introduce the notion of regular inference functions and study the asymptotic properties of resulting estimates in the iid situation. Chapter 2. Foundation: models, statistical inference and computation 38 Inference functions for a vector parameter In the following, we will give a series of definitions for the inference functions for a vector of pa rameters and a general asymptotic result for the parameter estimates from the defined inference functions. Let us consider a parametric family T defined on a common measurable space (y ,A), where A is the cr-field associated with y. We further assume T= {P(y,6) :$£ ft}, ftC (2.35) where 6 = (6\,.. .,0q)' is g-component vector, and ft the parameter space. The parameter space is usually a subset of (/-dimensional Euclidean space. We presume the existence of a measure p. on y such that for each fixed value of the parameter 0 the function P(y; 6) is the density with respect to p of a probability measure V on y. Definition 2.9 (Inference functions) A Rq-valued vector of functions *(y;0) = (V>i(y;0),.-.,^(y;0))T : ^xft^TR* is called a vector of inference functions, if the component functions of \&(y; 0) are measurable for each fixed 8= (6i,...,0q) eft. • Definition 2.10 (Unbiased inference functions) \P is said to be unbiased if for each 6 E ft and j = 1,.. .,q, Efjlijj} = 0, where Eg means expectation relative to P(-;6). • Unbiasedness is a natural requirement which ensures that the roots of the equations are close to the true values when little random variation is present. Whereas 6 may not have an unbiased estimator, unbiased inference functions exist under fairly general circumstances. For any given inference function vector ^ and any y E y, an estimator of 0, say 6 — 6(y), can be obtained as the solution to \t = 0. In order for the estimate 6 to be well-defined and well-behaved, the inference function vector $ must satisfy some regularity conditions, that is, * must consist of regular inference functions. Definition 2.11 (Regular inference functions) The vector of inference functions $ is said to be a vector of regular inference functions if, for all 6 E ft, the following assumptions are satisfied: 1. The support ofy does not depend on any 6 E ft. 2. E{i>j} = 0, j = l,...,q. Chapter 2. Foundation: models, statistical inference and computation 39 3. The partial derivative dyjj/dOk exists for almost every y £ y, j,k = 1,..., q. 4- The order of integration and differentiation may be interchanged as follows: ^- J ^P{y; 0)dp{y) = J [^P(y; 6)] dp(y), j, k = 1,.. .,q. 5. E{ipjij)k} exists, j,k = 1,..., q, and the q x q matrix M$(0) = E{WT} is positive-definite. 6. The q x q matrix is non-singular. • A model P(y; 6) in (2.35) is said to be regular, if the score functions are regular inference functions and 5ft is an open region of Mq. We are only interested in regular models, such that the asymptotic theory concerning MLEs is readily available for use. This is not a strong assumption for applications. (The main limitation may be the exclusion of models in 1 of Definition 2.11.) Definition 2.12 (Fisher information matrix) The Fisher information matrix is the matrix-valued function I : 5ft —• Ftqxq defined by I(6) = E{U(6)UT(6)}, where U(6) is the vector of score functions, U{B) = d/dB\ogP(y;6). • Definition 2.13 (Godambe information matrix) For a regular inference function vector^!, the Godambe information matrix is the matrix-valued function J$ : 5ft —• Rqxq defined by M$) = Dl{6)M^{6)D<i{6), where My(6) = E{WT} and £>$(0) = E{dV/dO'}. • Chapter 2. Foundation: models, statistical inference and computation 40 Consider n iid observations yx,..., y„ from a model P(y; 0) in (2.35). Let ^(y,-; 8) = (ipn,..., ipiq)' • The inference function vector based on the n observations is \£„ : yn x ft —• IR9 given by n 8 = 1 We define the estimator 0 = 0(yi,..., yn) as the solution of \P„ = 0. The following theorem establishes the asymptotic normality of the solution 0 based on regular inference functions and gives an asymptotic interpretation of the Godambe information matrix. Theorem 2.1 Assume that the estimator 0 = 0(y1,... ,yn) associated with the regular inference function vector \Pn : yn x ft —+ IR9 is a y/n-consistent estimator of 0, that is, y/n(0j — Oj), j = l,...,q, is bounded in probability so that 9j tends to 9j at least at the rate ofl/y/n. We further assume that there exist functions Mjki(y) such that \d2ipj/d9kd9i\ < Mjki(y) for all 0 G ft, where E{Mjki(y)} < oo for all j,k,l. Then as n —• oo, we have asymptotically V^(0-0)°Nq(O,J^(0)) under P(-;0). Proof. The proof is similar to the corresponding theorem for the asymptotic normality of the MLE. We therefore only sketch it. *n has the following expansion around 0 O = Vn(0) = yn(0) + Hn(0)(0-0) + Rn, where Hn is a q x q matrix d^n/d0 and Rn = Op(||0 — 0\\2) = op(n_1) by assumptions. Thus ^(6 -0) = l-Hn{0) n 1 1 -=[-*„(*)-R„]. (2.36) By the Law of Large Numbers -Hn{0)^{0). n Now for any fixed vector u = (ui,..., uq)'', consider the sequence of one-dimensional rv's 4,Ulf« + ...+uf^;«) Chapter 2. Foundation: models, statistical inference and computation 41 By the central limit theorem (Lindberg-Levy), u'9n/y/n is Ni(0, u'M*u). This result leads to Applying Slutsky's Theorem to (2.36), we obtain y/H(8 - 0)^NQ(O, D^M^D^f) or V^(e-e)^Nq(o,j^(e)). • Optimality criteria for inference functions In this subsection, we will summarize optimality results for inference functions in the multi-parameter situation. These results will be referred to later for comparing two sets of regular inference functions. Consider a scalar inference function \P. It is natural to seek an unbiased estimating function \t for which the variance E{\&2} is as small as possible. This is analogous to the theory of minimum variance unbiased (MVU) estimation. Since the variance may be changed by multiplying \? with an arbitrary constant, some further standardization is necessary for the purpose of comparing variances. Godambe (1960) suggested considering the variance of the standardized estimating function \ts = ^/E{d^/d9], and defined an optimal estimating function to be one which minimizes Var(^rJ) = E{\t2}/{E(d\E,/3#)}2, or maximizes Var-1(\T/j), the Godambe information for <£. Godambe showed that in the one-parameter case the usual maximum likelihood estimating equation has this optimal property within a wide class of regular unbiased inference functions. Thus Godambe information can be used to compare two regular inference functions, and the function with the larger Godambe information is generally preferred. Given two vectors of inference functions, \P and Q, several different optimality criteria can be used to say that fi is preferred (or optimal) to Definition 2.14 (M-optimality) A vector of inference functions is said to have matrix opti mality or M-optimality versus a vector of inference functions ^ if the difference of the inverses of the Godambe information matrices is non-negative definite. • Chapter 2. Foundation: models, statistical inference and computation 42 Definition 2.15 (T-optimality) A vector of inference functions is said to have trace optimality or T-optimality versus a vector of inference functions ^ if the difference of the trace of the inverse of Godambe information matrices TriJ^(e))-Tr{J^(d)) is positive. • Definition 2.16 (D-optimality) A vector of inference functions is said to have determinant optimality or D-optimality versus a vector of inference functions \P if the difference of determinant of the inverse of Godambe information matrices \J^(6)\-\J^(6)\ is positive. Chandrasekar and Kale (1984) proved that M-optimality implies T-optimality and D-optimality. Joseph and Durairajan (1991) further proved that the above three criteria are equivalent in the sense that if $ is optimal with respect to any one of the three criteria then it is also optimal with respect to the remaining two. When comparing two sets of regular inference functions, we could examine a slightly different version of T-optimality and D-optimality. For example, for T-optimality, we may examine Tr(J^(e))' and for the D-optimality j\Jn\')\ In practice and often in simulation studies, only the estimated values of J^1(6) and J^iO) are available, M-optimality or T-optimality or D-optimality may be violated slightly numerically based on only one set of observations. We end this subsection by stating an extended Cramer-Rao inequality for inference functions: Theorem 2.2 For any given vector of regular inference functions \P, and for all 6 6 5ft, J$l(8) — I-1 (6) is non-negative definite. For a proof of this result, see Jorgensen and Labouriau (1995). Related references include Ferreira (1982) and Chandrasekar (1988), among others. • Chapter 2. Foundation: models, statistical inference and computation 43 This theorem states that, for a regular model P(y;6), the vector score functions nta\ dlogP(y;fl) fd log P(y,$) dlogP(y;6)\ u{6)- dB - 1, del Wk ) are M-optimal within the class of all regular unbiased estimating functions. 2.3.3 Inference function of margins We have seen from previous subsection that, under fairly general regularity conditions, the score functions are asymptotically optimal among regular inference functions. However, with multivariate models, except in a few special cases (e.g multivariate normal), the estimating equations based on the score functions are computationally very cumbersome or intractable. It would be an invaluable alternative to have inference functions which are computationally feasible in general and also efficient compared to the score functions. In the ensuing subsection, we introduce a set of inference functions, we call the inference functions of margins (IFM). In Chapter 4, we show that IFM shares the asymptotic optimality properties of the score functions, and this is particularly true for the multivariate models with MUBE and PUBE properties. One major advantage of IFM is that it is computationally feasible in general and more flexible for handlinge different types of data. This leads us to develop a new inference theory and computationally feasible procedures for many MCD and MMD models. Inference function of scores We consider the family (2.35) and assume it is a regular parametric family. The likelihood function of 6, given y, is L(6;y) — P(y;6), the corresponding loglikelihood function is £(6;y) — \ogP(y;6). Let Ln(6) = f[L(6;yi) ! = 1 denote the likelihood of 6 based on y1;.. . ,y„, a sample from y. The loglikelihood function of 6 based on ylt..., yn is 4(*) = logM*) = £*(0;y,.). »=i Definition 2.17 (Inference functions of scores, or IFS) The vector of score functions dtn{6) = fd£n(6) dtn{6) 06 V 30i d9q is called inference function vector of scores, or IFS. • Chapter 2. Foundation: models, statistical inference and computation 44 The maximum likelihood estimate (MLE) is generally determined as the solution to the likelihood equations dtn(0)/d0 = 0. The Hessian matrix of the function —£n(6)/n is J(0), where (J(8))jk = -(l/n)(d2£n(0)/d0jd9k). The expected value of J(6), 1(6) = E{J(0)}, is the Fisher information matrix. The value J(0) of /(•) at the maximum likelihood estimate 0 = 6(y1,...,yn) is referred to as the observed information. J(0) will generally be positive definite since 6 is the point of maximum likelihood. A consistent estimate of1(0) is 1(0) = J(0). Under very general regularity conditions, it is known that the MLEs are asymptotically normal, in the sense that as n —• oo, V^(0-0)°Nq(O,I(0)-'). See Sen and Singer (1993, p.209) for a proof. Inference function of margins We now introduce the loglikelihood function of a margin, the inference function of a margin for one parameter, and then define the inference functions of margins (IFM) for a parameter vector 6. The asymptotic results for the estimates from IFM will be established in the next section. Consider the parametric family (2.35) and assume P(y,0) is a d-dimensional density function with respect to a probability measure p on y. Let Sd denote the set of non-empty subsets of {1,..., d}. For any S € Sd, we use \S\ to denote the cardinality of S. Let Ps(ys) be the 5-margin of P(y;0), where y5 = {yj :. j G S}. Assume Ps(ys) depends on 0s, where 0s is a subvector of 0. Definition 2.18 Let 0 = (9\,..., 9q)'. Suppose the parameter 9k appears in S-margin Ps(ys)- The loglikelihood function of the S-margin is ts(8s) = logPs(ys). An inference function for 9k is d£s(0s) 08k ' • The inference function of a margin for a parameter 9 is not necessarily uniquely defined by the above definition. In this thesis, unless specified otherwise, we always work with the inference function from a margin with the smallest \S\. For a specific model, it is often evident when \S\ is the smallest for a parameter, so we will not be concerned with the proof of this feature in most applied situations. If there are two or more inference functions for 9 with the same smallest \S\, than there Chapter 2. Foundation: models, statistical inference and computation 45 is a question of how to combine these inference functions to optimally extract information. We will discuss this issue in section 2.6. Note that with the assumption of MPME (or MUBE), one can use S with \S\ < q (\S\ < 2 for MUBE) for every parameter Bk. In the case where MPME does not hold, then one has S = {1,..., d} for some 6k in the model. For the new theory below, we assume MPME or MUBE or PUBE in the remainder of this chapter. Assume for the parameter vector 8 — {6\,..., 9q)', the corresponding smallest cardinality subsets associated with the parameters are Si,.. .,Sq (as q is usually greater than d, there are duplicates among the Sk's). Definition 2.19 (Inference functions of margins, or IFM) The vector of inference functions 'd£n,Sl(8Sl) den,sq(6s,)\' is called the inference functions of margins, or IFM, for 6. • For a regular model, the inference functions derived from the likelihood functions of margins also satisfy the regularity conditions. Thus asymptotic properties related to the regular inference functions should apply to IFM. Detailed development of this aspect will be given in section 2.4. Definition 2.20 (Inference functions of margins estimates, or IFME) Any fl £ $ which is the solution of -=(^ ^y=° is called the inference functions of margins estimate, or IFME, of the unknown true parameter vector 8. • In a few cases, 6 has an analytic expression (e.g. Example 4.3). In general, 8 has to be obtained by means of numerical methods. Examples of inference functions of margins Example 2.15 Let Xi,.. .,Xn be n iid rv's from the distribution C(Gi(xi),...,Gd(xd)',@) with GJ(XJ) = $(XJ; Uj,aj), where C is the multinormal copula (2.4). Let p = (pi,..., fid) and a = (<TI, ..., ad). The loglikelihood function is n / d \ £n(p,(T,0) = J^log I C(*(XJI), •••,*(*«);©) X\j>{.Xij;p.j,aj) \ . Chapter 2. Foundation: models, statistical inference and computation 46 Thus the IFS is vpIFS = (dtn(M,',&) d£n(p,<r,e) d£n{p,<r,Q) V dpi '"'' dpd ' 9<TI d£n{p,<r,Q) d£n(p,er,0) d£n{p,<r,QY do-d ' #012 ded-i,d The loglikelihood functions of 1 and 2-dimensional margins for the parameters p., <r, © are £nj(pj,o-j) -^log^Xij-^j,^), j = 1,.. .,d, n tnjk(8jk,Pj,Pk,0-j,0-k) = ^ log (c($(lij), $(liO; SjO^1'! I Pj,^i)<t>{xik\ Pk,<^k)) , 1 < j < k < d. Thus the IFM is »=i *IFM = (d^"1^1'0"1) dtnd(Pd,o-d) d£ni(pi,o-d) d£nd(pd,o-d) \ dpi ' ' ' ' ' Q^d ' Q(Tl i • • • > > d£n!2(9l2, Pi, P2, Q~l, ^2) d£nd-l,d(9d-l,d, Pd-1, Pd, O'd-lyO'd) d9d-l,d 86 12 It is known that \PIFS and \PIFM lead to the same estimates; see for example Seber (1984). • Example 2.16 Let Yi,..., Yn be n iid rv from the multivariate Poisson model with multinormal copula in Example 2.11. The loglikelihood functions of margins for the parameters A and 0 are 4j(Aj) = ^logPjiyij), j = l,...,d, 8 = 1 n £njk(9jk, Aj, A*) = ElogPjkiyiiVik), 1 < j < k < d, where Pj(yij) = AfJ exp^A,-)/^-! and Pjk(yijyik) = $2($-1(6o), $_1(M; tyt) - M*-1^). ^_1(a»ifc); ^i/fc) — *2(^_1(aa-j), ^«~1(6ifc); <9jfc) -I- ^2(^_1 (a»i)> ^_1(a«Jk); ^jJfe). where aij = Gij(yij - 1), bu - Gij(yij), aik - Gik(yik - 1) and bik = Gik(yik), with Gij(yij) = YH=opj(x) and Gik(yik) = Zl=0Pk(x)-Let r)j = log(Aj). The IFM for rjj, j = 1,..., d, and 6jk, 1 < j < k < d are *IFM = f E i aPi(Wi) 1 dPd(yid) E f^Pi(yn) dm ""'friPdim) dnd ' 1 dPi2(yayi2) \^ 1 dPd-\,d{yid-iyid) E fr( Pd-lAVid-lVid) d9d-l,d fri Puivnvn) dd12 For a similar random vector Y,-, i = 1,..., n with a covariate vector x,j for A^, a possible way to include X;J is by letting rjij = aj + BjXij, where rjij = log(A,j). We can similarly write down the IFM for parameters aj, Bj and 9jk. • Chapter 2. Foundation: models, statistical inference and computation 47 Example 2.17 (Multivariate binary Molenberghs-Lesaffre model) We first define the mul tivariate binary Molenberghs-Lesaffre model (Molenberghs and Lesaffre, 1994), or M-L model. Let Y = (Yi,..., Yd) be a d-variate binary random vector taking values 0 or 1 for each component. A model for Y is defined in the following way. Consider a set of 2d — 1 generalized cross-ratios with values in (0, oo): rjj, 1 < j' < d, rjjk, 1 < j < k < d, .. ., and r\\i-d such that: )?. . _ n(gil,...,yi(,)6A+ pix-u(Vh • ••«/«) ^ where A+ = {(%1, . . ., yjq) G {1,0}* | (q - E?=iW«) = °(mod 2)> and A, = {1.0}«W. and {ii i • • •! jq} is a subset of {1, 2,..., d} with cardinality q. We can verify for example when q = 1,2,3,4 that * = ^j^d> P(11)P(00) , ^ . , ^ , ^=P(10)P(01)' 1^<fc^> P(111)P(100)P(010)P(001) - p(H0)P(101)P(011)P(000)' -3 < < - ' _ P(1111)P(1100)P(1010)P(1001)P(0110)P(0101)P(0011)P(000Q) tytzm - p(1110)p(iioi)p(ioil)P(1000)P(0111)P(0100)P(0010)P(0001)' 1-^<*</<m^c(' (2.38) where subscripts on P are suppressed to simplify the notation. Molenberghs and Lesaffre (1994) show that the 2d — 1 equations in (2.37) together with ^2 P(yi '"' yd) = 1 leads to unique nonnegative solutions for P(yi • • -yd), (yi, • • - ,2/d) G {l,0}d, under some compatibility conditions on the d — 1 and lower-dimensional probabilities. If all these conditions in the Molenberghs-Lesaffre construction are satisfied, we have a well-defined multivariate Bernoulli model. We call this model multivariate M-L binary model. The multivariate M-L binary model is not MUBE, but the parameters rjj and ijjk are PUBE. The special case where rjs = 1 for |5| > 3 is MUBE. Related to the MCD model, it is not clear if there exists a MCD model such that (2.37) is true and under what conditions a MCD model is equivalent to (2.37). The difficulty is to prove there exists a copula, such that (2.37) is a equivalent expression to a MCD model. The existence of a copula is needed in order to properly define this model with covariates (e.g. logistic regression univariate margins). For a discussion of whether the Molenberghs-Lesaffre construction leads to a copula model, see Joe (1996). Nevertheless, (2.37) in terms of Pjl-jq(yj1 • • - yjq) certainly defines a multivariate model for binary data for some ranges of the parameters. Chapter 2. Foundation: models, statistical inference and computation 48 Let Yi,..., Yn be n iid binary rv's from-a proper multivariate M-L binary model. Assume the parameters of interest are r) = (771,... ,r)d, 7/12,..., nd-i,d)' and let 77s be arbitrary for \S\ > 3. The loglikelihood functions of margins for r) are «=i n enjk(Vjk,rij, Vk) = E loS PjkiyijVik), 1 < j < k < d. Thus the IFM is lT, fd£„i(r]i) d£nd(rid) d£ni2(m2,m>^2) d£nd-i,d(rid-i,d,Vd-i, *IFM = —~ , • • •, —* ' a > • • • > x For an interpretation of the parameters (2.37), see Joe (1996). • Some advantages of IFM approach The IFM approach has many advantages for parameter estimation and statistical inference: 1. The IFM approach for parameter estimation is computationally simpler than estimating all the parameters from the IFS approach. A numerical optimization with many parameters is much more time-consuming (sometimes beyond the capacity of current computers) compared with several numerical optimizations, each with fewer parameters. In some cases, optimization is done with parameters from lower-dimensional margins already estimated (that is, there is some order to the sequence of numerical optimizations). IFM leads to estimates of the parameters of many multivariate nonnormal models efficiently and quickly. 2. A potential problem with the IFS approach is the lack of stability of the solution when there are outliers or perturbations of the data in one or few dimensions. With the IFM approach, we suggest that only the contaminated margins will have such nonrobustness problems. In other words, IFM has some robustness properties in multivariate analysis. It would be interesting to study theoretically and numerically how outliers perturb the IFS and IFM estimates. 3. A large sample size is often needed for a large dimension of the responses. This may not be easily satisfied in most applied problems. Rather, sparse data are commonplace when there are multiple responses; these often create problems for ML estimation. By working with the lower dimensional likelihoods, the IFM approach avoids the sparseness problem in multivariate situations to a certain degree; this could be a major advantage in small sample situations. Chapter 2. Foundation: models, statistical inference and computation 49 4. The IFM approach should be robust against some misspecification in the multivariate model. Also some assessment of the goodness-of-fit of the copula can be made after solving part of the estimation equations from IFM, corresponding to parameters of univariate margins. 5. Finally, IFM leads to separate modelling of the relationship of the response with marginal covariates, and the association among the response variables in some situations. This feature can be exploited to shorten the modelling cycle when some quick answer on the marginal behaviour of the covariates is the scientific focus. In the above, we listed some advantages of IFM approach. In the next section, we study the asymptotic properties of IFM approach. The remaining question of efficiency of IFM will be studied in Chapter 4. 2.4 Parameter estimation with IFM and asymptotic results In this section we will be concerned with the asymptotic properties of the parameter estimates from the IFM approach. We will develop in detail the parameter estimation procedure with the IFM approach for a MCD or MMD model with MUBE or with some parameters of the models having PUBE properties. The situations we consider include models with covariates. Sufficient conditions for the consistency and asymptotic normality of IFME are given. Some theory concerning the asymptotic variance matrix (Godambe information matrix) for the estimates from the IFM approach is also developed. Detailed direct calculations of the Godambe information matrix for the estimates based on the data and fitted models are given. An alternative computational approach, namely the jackknife method, for the estimation of the Godambe information matrix is given in section 2.5. This technique has considerable importance because of its practical usefulness (See Chapter 5). Later in section 2.6, we will propose computational algorithms, which are based on IFM, for the parameter estimation where common parameters appear in different inference functions of margins. 2.4.1 Models with no covariates In this subsection, we confine our discussion to the case of samples of n independent observations from the same distributions. The case of samples of n independent observations from different distributions will be studied in the next subsection. We consider a regular MCD or MMD model in (2.12) P(yi---yd;6), 0 6% (2.39) Chapter 2. Foundation: models, statistical inference and computation 50 where 6 = (di,..., dd, 612, • • •, 8d-\,d)'• The model (2.39) is assumed to have MUBE or to have some of its parameters having PUBE properties. In general, we assume that 8j (j = l,...,d) is a parameter vector for the jth univariate margin of (2.39) such that Pj(yj) = Pj(yj',8j), and Ojk (1 < j < k < d) is & parameter vector for the (j, k) bivariate margin of (2.39) such that Pjk(yjVk) = Pjk(yj, Vk\ 8j, 8k, 9jk)- The situation for models with higher order (> 2) parameters are similar; the extension of the results here should be straightforward. For the purpose of illustration, and without loss of generality, we assume in the following that 8j and 8jk are scalar parameters. Let Y, Yi,..., Y„ be iid rv with model (2.39). The loglikelihood functions of margins of 0 are 4 j (9j) = E lo§ Pi(yij) »• 3 = 1 > • • •) d, 1=1 n tnjk(0j , Ok, Ojk) = E loS PjkiVij Vik), 1 < j < k < d. (2.40) «=i These expressions can also be rewritten as 4>j (di) = ]C ni (y>)log Pi )' 3 = 1, • • •, d, {yj} tnjk(0j, 8k,8jk) - njk{yjyk)\ogPjk(yjyk), 1 < j < k < d, {yjVk} (2.41) based on the summary data rij(yj) and rijk(yjyk). In the following we continue to use the expression (2.40) for technical development, for consistency with the case where covariates are present. But (2.41) is a more economic form for computation, and should be used for model fitting whenever possible. Let 1 d. d--TH<n=f 1 9Pjiyj) i Pj{yj) . m. . 3 Vl " *' }k) Pjk(yjyk) d8jk ' . .def 1 dPj(yij) . 1 < j < k < d, 1pi;jk = 1pi,jk(8j,8k,8jk) def 1 dPjkjyijyik) l<j<k<d, PjkiVijVik) 08jk for i = l,...,n. Let * = tf(0) = (V>i,..., Vd, ^12, • •., Vd-i.d)', and = = (*„i *Bi, *ni2, • • •, 9nd-i,d)', where ¥„,• = £"=1 (j = 1, • • •, d) and 9njk = £?=1 ^.jk (1 < j < k < d). Chapter 2. Foundation: models, statistical inference and computation 51 From (2.40), we derive the IFM for 0 n » = 1 (2.42) i=l Since (2.39) is a regular model, the regularity conditions of Definition 2.11 are also true for the inference functions (2.42). With the IFM approach, an estimate of 0 that we denote by 0 = 0(vi> • • • J Yn) — • • • > 0d> 012, • • •, 8d-i,d)' is obtained by solving the following system of nonlinear equations '*„,•=(), j = l,...,d, ^njk =0, 1 < j < k < d. Properties of estimators We start by examining the simple case (particularly, for notation) when d = 2 to illustrate how the consistency and asymptotic normality of 0 can be established. The model (2.39) is now ^(yi.ifc; ft,02,0i2) (2.43) with 0 = (0i, 02,0i2)' G 3J- Without loss of generality, 0i, 02,0i2 are assumed to be scalar parameters. Let the random vector Y = (Yi, Yjj)' and Yj = (Y,i, YJ2)' (i — 1,...,n) be iid with (2.43), and y, y,-be the observed values of Y and Y,- respectively. Thus the IFM for 0i,02,0i2 are i = l n *r>12 = y~]ll>i;12-(2.44) i=l We denote the true value of 0 = (0i,02,0i2)' by 0O = (0i,o, 02,o, 0i2,o)'- Using Taylor's theorem on (2.44) at 0o to the first order, we have 0 = *„i(0i) = *„i(0i,o) + (Oi ~ 0i,o) 0 = *n2(02) = *„2(^2,0) + (02 - 02,0) 301 502 0 = ¥n12(6~12,h,6~2) = *nl2(012,o) + (#12 ~ »12,o) dVnl2 60 12 (2.45) + (0i - 0i,o) 5^„12 30i r,+(02-02,0)^ Chapter 2. Foundation: models, statistical inference and computation 52 where 0* is some value between 9\ and 01,0, Q\ is some value between 02 and 02,o, and 6** is some vector value between 6 and OQ. Note that \Pni2 also depends on 0i and 02. Let Hn=Hn{6) = 0 0 0 0 3»i 9*1.12 as3 a*„,Q a«i2 and D$ = -D*(0) = E{n 1i7„}. Since (2.43) is assumed to be a regular model, we have that E(\P„) = 0 and non-singular. On the right-hand side of (2.45), *„i, *n2, *„i2, d^m/d9u <9#r,2/<902, dVnl2/d912, <9tf„i2/<90i and d^ni2/d92 are all sums of independent identical variates, and as n —• oo each therefore converges to its expectation by the strong law of large numbers. The expectations of \?„i(0i,o), ^n2(02,o), and *ni2(0i2,o) are zero and the expectations of d^ni/dOx, d^n2/d62, d^ni2/d9i2, are non-zero by regularity assumptions. Since all terms on the right-hand side must converge to zero to remain equal to the left-hand sides, we see that we must have (0i — 0i,o), (02 — 02,o) and (0i2 — 0i2,o) converging to zero asn-> oo, so that 9\, 92 and 0i2 are consistent estimators under our assumptions (for a more rigorous proof along these lines, see Cramer 1946, page 500-504). Now let HI I \ a*„, 0 0 80! •\ 0 «; 0 8*n,2 0*n,12 a«, B" 682 6" a«i2 \ 3** / It follows from the convergence in probability of 6 to 6Q that Hn{0) - Hn(60) >0. Since each element of n~1Hn(6) is the mean of n independent identically-distributed variates, by the strong law of large numbers, it converges with probability 1 to its expectation. This implies that n~1Hn(6o) converges with probability 1 to Dy(0o)- Thus -Hn(6)^D9(60). n Now we rewrite (2.45) in the following form y/n(0 - 00) = n -7=[-*»(«o)]- (2.46) Since 6\ lies between $i and 0i]O, 02 lies between 02 and 02,o, and 0** lies between 6 and 0O, thus we also have -H^D9(e0). n Chapter 2. Foundation: models, statistical inference and computation 53 Along the same lines as the proof in Theorem 2.1, we see that 4=M0o)-^iV3(O,M*(0o)), where M*(0O) = E(tftf'). Applying Slutsky's Theorem to (2.46), we obtain v^(* - 0O)^N3(O, D*\do)M*(6o)(Dy\0o))T), or equivalently Vt(8-0o)°N3(O,Jyl(6o)). Thus we can extend to the following theorem for the IFME of model (2.43): Theorem 2.3 Consider the model (2.43) and let the dimension ofOj (j = 1, 2) be pj and that of 612 be P12. Let 6 denote the IFME of 6 under the IFM corresponding to (2-44)- Then 0 is a consistent estimator of 0. Furthermore, as n —• 00, y/H(9-e)^NPl+Pa+Pia(0,J^), where J* = J*(0) = D^M^1 , with M* = E{W} and D<z = E{9^/90'}. • Inverting the Godambe information matrix J$ yields the asymptotic variances and covariances of 0 = (fli,02,fli2)- We provide the calculations for the situation where 0i,02,0i2 are scalars. The asymptotic variance of Oj, j = 1,2, is n~1\Evb'j]\E9il>j/90j]~2 and the asymptotic covariance of §1,92 is ra-^EVi^tE^Vi/^i]-1^^/^]"1. The asymptotic variance of 012 is 12 30 12 2 r EV22 + £ E EV? -2E E <9Vj_ -1 E 12 [EVi2^-]+2n 3 = 1 and the asymptotic covariance of 0i2, Oj is E dip 12 E av_i (90J Eyj12ipj •n E 50* E <9V 12 30* E a_Vj_ 90j [ 99j \ 90~. -1 'M12 90j \ 'diPj d9~ -1 'M\2 [ 90j \ -1 EV1V2 [EV1V2 Furthermore, from the calculation steps leading to the asymptotic variance expression, we can see that 0i, 02 and ,0i2 are ^/n-consistent. Chapter 2. Foundation: models, statistical inference and computation 54 Now we turn to the general model (2.39) where d is arbitrary. As we can see from the detailed development for d — 2, it is straightforward to generalize Theorem 2.3 to the case where d > 3, since in the general situation of the model (2.39), the corresponding IFM are n *nj=X]^»:J' J = 1> •••><*, t=l ®njk = ^2 1 < J < k < d-t=l In (2.39), 0j (j — 1,..., d) and 9jk (1 < j < k < d) can be scalars or vectors, and in the latter case, ipj(0j) and ipjk(8jk) are function vectors. The asymptotic properties of 0 for a general d is given by the following theorem: Theorem 2.4 Consider the model (2.39) and let the dimension of Oj (j = 1,.. .,d) be pj and that of 9jk (1 < j < k < d) be pjk. Let 0 denote the IFME of 6 under the IFM corresponding to (2.42). Then 0 is a consistent estimator of 6. Furthermore, as n —• oo, we have asymptotically M6-6)^Nq(0,J^), where q = £?=1 pj + J2i<j<k<dPi>°> J* = J*(d) = D^M^D*, with M* = M*(0) = Eg{VW'} and = E{8V/80}. • For many models, we have pjk = 1; that is 9jk is a scalar (see Chapter 3). The asymptotic variance-covariance of 6 is expressed in terms of a Godambe information matrix J$(0). Assume 9j (j = 1,.. .,d) and 9jk (1 < j < k < d) are scalars. Let us see how we can calculate the Godambe information matrix J$. We first calculate the matrix M$ and then Since M$ = E(\P\I>'), we only need to calculate its typical components E(ipjtpk) (1 < j, k < d), E(i/>>ipkm) {k < m) where j may be equal to k or m, and E(ipjkiptm) (j < k,l < m), where j may be equal to / and k may be equal to m. For E(ipjtpk), we have \pj(yj) pj(yk) 99j 09k j pik(yjVk) OPj[yj)0Pk{yk) It can be estimated consistently by {s^} pi(yj)pk(yk) d9j 89k dPj(ya) 8Pk(yik) l^U 1_ n hi pj(yij)pk(Vik) de{j 89i]k (2.47) Chapter 2. Foundation: models, statistical inference and computation 55 or equivalently by 1 njk(VjVk) 9Pj(yj) dPk(yk) based on the summary data. For the case j = k, we need to replace Pjk(yjUk) by Pj(yj), {yjyk} by {yj} and rijk(yjyk) by nj(yj) in the above expressions. For E(ipjtpkm) (k < m), we have E(^km) = E ( 1 dPj(yj)dPkm(ykymy Pj(yj) Pkm(ykym) 96) d9km _ y- Pjkm(yjykym) 9Pj(yj) dPkm(ykym) . ^ , Pj(yj)Pkm(ykym) 96'j d6km \y jykymj It can be estimated consistently by }_ A 1 dPj(yij) 9Pkm(yikyim) (2.48) n Pj(yij)Pkm(yikyim) 96j d6km For the case j = k or j = m, a slight modification on the above expression is necessary. For example for j = k, we need to replace Pjkm(yjykym) by Pjm(yjym), {yjykym} by {%ym} and njkm(yjykym) by rijm(yjym) in the above expressions. For E(tpjkipim) (j < k, I < TO), we have 1 1 dPjk(yjyk) 9Pim(yiymY E(^m) - E Kp.k{y.yk) Plm{yiym) OOjk 99lm y. Pj *i~ (V< Vk. VIVm) 9Pik(Vi Vk ) dPlm (Mm) ~ ^ , Pjk(yjyk)Pim(yiym) 96jk 99lm {yjykyiymi It can be estimated consistently by I 9Pjk(yijyik) dPim (mmm) if " i^i PJk(yijyik)Plm(yiiyim) 99jk 99,„ (2.49) For the particular case j — l,k ^ m (or similarly j = m or k = / or j ^ /, k = m cases), we need to replace Pjkim(yjykyiym) by Pjkm(yjykym), {yjykyiym} by {yjykym} and njklm(yjyky,ym) by njkm(yjykym) in the above expressions. For the particular case j = I, k = m, we need to replace Pjkim(yjykyiym) by Pjk(yjyk), {yjykyiym} by {yjyk} and njkim(yjykyiym) by rijk(yjyk) in the above expressions. Now let us calculate £)$. Since Z)$ = D^(6) is a matrix with (p, q) element E(dipp/89 q) (1 < j, k < q), where xjjp is the pth components of \£ and 9q is the gth component of 6, we only need to calculate its typical component E(8ijjj / 89m) (1 < j,m < d), E(8t}>j/86im) (1 < j < d; 1 < / < ra < Chapter 2. Foundation: models, statistical inference and computation 56 d), F,(dipjk/d6m) (1 < j < k < d; 1 < m < d), and E(dyjjk/dOim) (1 < j < k < d; 1 < / < m < d). Since dipj d9„ 1 dPjjyj) dPjjyj) 1 d^Pjjyj) Pj{yj) d9j dOm + Pj(yj) ddjdem ' we have E 1 dPMdPjjyj) 1 d2Pj(yj)\ = i_dPMdPj(yj) It can be estimated consistently.by n PfiVii) d6j 80m (2.50) Because Pj depends only on univariate parameter Oj, thus yjj does not depend on the parameter Oim- So dvjj(0j)/d0im = 0; this also leads to Since we have dipjk _ 1 dPjk(yjyk) dPjk(yjyk) 1 d2Pjk(yjyk) ddm ~ PjMvjVk) dOjk 80m + Pjk(yjVk) d0jkd0m ' E fdjpj£\_ 1 urjkyyj \dOmJ Pjk(yjyk) dOjk 9Pjk(yjyk) 9Pjk(yjyk) {yjyk} dOr, It can be estimated consistently by _I f 1 dPjk(yjjyik) dPjk(yjjyik) n fr[ P-k(yijVik) dOjk dom (2.51) Similarly, we find Er^,=< 80, lm sr 1 fdPjk(yjyk)\ ._, , 0, otherwise, where E(dyjjk/dOjk) can be estimated consistently by 9Pjk(yijyik) n tt PUVHV") V dOjk 71 (2.52) 8j8k9jk Chapter 2. Foundation: models, statistical inference and computation 57 2.4.2 Models with covariates In this subsection, we consider extensions of models to include covariates. Under regularity con ditions, the IFME for parameters are shown to be consistent and asymptotically normal and the form of the asymptotic variance-covariance matrix is derived. One approach for asymptotic results is given in this subsection, a second approach is given in the next subsection. Our objective is to set forth results as simply as possible and in useful forms; more general theorems for multivariate models could be obtained. Let Yi,..., Y„ be a sequence of independent random vectors of dimension d, which are denned on the probability measure space (y,A,P(Y;0)), 6 G Sr. C Mq. The marginal probability measure spaces are defined as (yj,Aj,P(Yj\6)) (j = l,...,d) for jth margin, and (yjk,Ajk,P{Yj,Yk\0)) (1 < j < k < d) for the (j, k) bivariate margin and so on. Particularly, the random vectors Yj (i = 1,..., n) are assumed to have the distribution P(yn---yid\0i), 0i e% (2.53) where P(yn • • -yid',9i) is a MCD or MMD model with univariate and bivariate expressible (PUBE) parameters 0,- = • • •, 6%,d, 8i-,i2, • • •, 0»;d-i,d)- We also assume that Oij (j = l,...,d) is the parameter for the jth univariate margin of Y,- such that Pj(yij) depends only on the parameter Oij, and Oi-jk (1 < j < k < d) is the parameter for the (j, k) bivariate margin of Y,- such that Oijk is the only bivariate parameter appearing in Pjk(yijyik)- Oij and Oi-jk can both be vectors, but for the purpose of illustration, and without loss of generality, we assume they are scalar parameters. Furthermore, we assume for i = 1,..., ra, = 9j(<*'jXij), j = 1,2,..., d, (2.54) Gijk - hjk(Bjkyfijk), 1 < j < k < d, where the functions gj(-) and hjk(-) are usually monotonic increasing (or decreasing) and differen-tiable functions (for examples of the choice of gj(-) and hjk(-) in a specific model, see Chapter 3), called link functions. They link the model parameters to a set of covariates through a functional relationship. In (2.54), otj = (otji, • • •, ajpj)' is a pj x 1 parameter vector, Xjj = (a;,ji,..., Xijpj)1 is a pj x 1 vector of observable covariates corresponding to the response y;. 3jk = (djki, • • •, Pjkqjk)' is a qjk x 1 parameter vector, and Wijk = (wijki, • • •, wijkqjk)' is a qjk x 1 vector of observable co variates corresponding to the response y,-. Usually ^. pj + J2j<k 9J* is mucn smaller than the total number of observations n. x,j and Wijk may or may not include the same covariate components. In terms of margins, the covariates X{j and vfijk may be margin-dependent, or margin-independent, Chapter 2. Foundation: models, statistical inference and computation 58 or partly margin-dependent and partly margin-independent. The marginal parameters may also be margin-dependent or margin-independent. We consider the problem of how to estimate the regression parameters in the statistical model defined by (2.53) and (2.54). We assume we have a well-structured data set as in Table 1.1. Problems such as the misspecification of the link function, the omission of one or several important covariates, the random nature of some covariates, missing values, and so on will not be dealt with here. From (2.54) and the PUBE assumption, Pj can be considered as a function of atj and Pjk can be considered as a function of ctj, otk, and /3jk. Let yx,..., y„ be the observed values of Yi,..., Y„. The loglikelihood functions of margins of the parameter vectors Qfj and f}jk based onyll...,y„ are Let and lni(ai) = E log-P,(j/«j), j = l,...,d, tnjk(otj,ak,0jk) = J^logPjkiyijVik), 1 < j < k < d. , .def 1 9Pj(yij) . = ^jk(eijkpp ,l °p^yik), i<j<k<d, PjkiyijVik) oVijk tediPi.j(pi.j) ^jkJ dl~j—• <Pi^k 5^—1 *w de—,—' l<3<k<d, (2.55) and . .def 1 dPjiyij) . , , n ,n Ndef 1 9Pjk(yijyik) . . . ^ , . , , . = **.JM») = p.k(yijyik) dpikt • l<J<k<d;t = l,...,qjk. Let 7 = (.aflt...,e/d,p12,...,pd-lidy, and y0 = (a'h0,a'2fi,pl2fi,... ,#,_Mi0)'. where y0 is the true vector value of 7. Let 7 = (a[,... ,ad,p'l2,... ,/?'d_i,d)' be the IFME. Assume 7, 70 and 7 are all q x 1 vectors. Let = *,a(ai) - (il>i-ju • • •,il>i-jPj)', Vijk = VijkiPjk) - (V't-jtl, • • •. 1pi\jkqjk)', *„,• = *„,-(«i) = (*„,-!, • • • , *»j,Pi)', Vnjk = VnjkiPjk) - (*nitl, • • • , ^njk.qjj, Chapter 2. Foundation: models, statistical inference and computation 59 where *njs = Yl"=i^i-J> (s = and ^njkt = H)"=i V'iyJfct (t = l,...,?jjb). Let tf<;(7) = (^jiCoii)',..., *<jd(otd)', *i;i2(/?i2)', • • •, *,-id-i,d(Ai-i1d)/),> *»(7) = E?=i *.-;(7), and we define M„(7) = n-1^E(*i;(7)^;(7)/) and £>„(7) = n^E [. (2.56) From (2.55), the IFM for a; and /?Jjt are t=i datj (2.57) and the estimating equations based on IFM are jvnj = o, j = l,...,d, \ *„it =0, 1 < j < k < d. (2.58) With the IFM approach, estimates of a j and 0jk, denoted by otj — (Sji,.. .,djiPj)' and /?jjb = (^jibi, • • • ,Pjk,qjk)'', are obtained by solving the nonlinear system of equations (2.58). We next give several necessary assumptions for establishing asymptotic properties of otj and (3jk. Assumptions 2.1 For 1 < j < k < d, 1 < / < m < d and i = 1,..., n, we make the following assumptions: 1. (a) For almost all y,- G y and for all 0; G ft, tPi-j, fi-jk, <Pijj, <Pi;jk,j, <Pi-jk,k, <Pi;jk,jk exist. (b) For almost all yt- G y and for every 0,- G ft, dei;j < I<i;j; < Sij; dPjkjyijyik) Mi d2Pjk(yijyik) d PjkiyijVik) dei]k < Li;k', dPjkjyijVik) dOijk d2 Pjk(ynyik) d2 PjkjyijVik) ddhk ^ -^i;j k j d2 PjkiyijVik) dOijkdOij ^i'jkj j d2Pjk(vijyik) dOijkdO^k < Tijk,k: where Kij, Lij, Li-k, Lijk, Sij, T{j, Ti-k, T{jk, Ti-jkj andTijk,k are integrable (summable) over y, and are all bounded by some positive constants. Chapter 2. Foundation: models, statistical inference and computation 2. (a) and E E Pi(v<j) = opM, i=1 {ya • \<Pi-,j\>n} n E E PjkivijVik) = op(i), l 8=1 {ya,yik • \<Pi-,jk\>n} E E <Pi-jP}(Vij) = Op(.n2), »' = 1 {Vij l^ii>l<"} n E E <Pi;j<Pi;kPjk(yijyik) = Op(n2), i=l {ya.yik •• sup(|ipi;i|,|VJ;fc|)<n} n E E <PiJ<Pi;lmPjlm(yijyuyim) = Op(n2), i = 1 {y*i,y;i,yim : sup(|v3iij|,|^i;lm|)<n} n E E <pljkpjk(yijyik) = op{n2), * = 1 {j/ii.yifc : \<Pi;jk\<n} n E E <Pi;jk<Pi;imPjkim(yijyikyuyim) = op(n2): I i = 1 {yij,yik,yil,yim • BUp(.\<f>i;jk\,\<Pi;lm\)<n} (b) E E Pj(yn) = opi1)' E E Pjk(y>jyik) = op(l), »=1 {yij ; l^-ii,il>"} »=1 {ya,yik • \vi-,jk,j\>n} E E Pjkiytjyik) = op(l), E E pik{yayik) I »=1 {yn,yik • Iviijfc.fc|>n> !=i {ytj.yik • \<Pi;jkjk\>n} Chapter 2. Foundation: models, statistical inference and computation 61 and X) YI 'Pi-Jj PJ () = oP{n2), i=\ {vu • \v>i;jj\<n} n E J2 <pl,jk,jpjk(yijyik) = op(n2), ,=1 {ya.yik •• \<i>i;jk,j\<n] n E (pl,jk,kpjk(yijyik) = op(n2), i = 1 {yi]<yik • \<Pi;jk,k\<n} n J2 '?hkjkpjk(yijyik) = op(n2), ,=1 {yii.y.fc : \<fii-,jk,jk\<n} n E E <Pi;j,j<Pi;k,kPjk{yijyik) = op(n2), « = 1 {yijVik • ™p(\'f>i;j,j\,\<Pi;k,k\)<n} n E E Pij J<Pijm,l Pjlm(yij yiiyim) — Op(n2), 1=1 {yij,yu,yim • sup(|0i;;,iJ|,|0j;irn,,|)<n} E E <Pi;j,j(Pi;!m,'mPjlm(yijyuyim) = Op(n2), i = 1 {yu,yi!,yim : sup(|vi;j,J|,|^i;lm,,m|)<n} n E E <Pijk,j<Pi;jk,kPjk(yijyik) = Op(n2), ! = 1 {yij.Vik • sup(|(/ii!jfc,J|,|¥>ilife,fe|)<n} n E E <PiJk,j'Pi;lm,lmPjklm(yijyikyiiyim) = Op(n2), *=1 {yii,yik,yii,yim : sup(|v>i;ik,i|,|v.i.Im,!m|)<n} n E E *Pijk,jk<PiIm,lmPjklm(yijyikyiiyim) = 0p(n2). I »=1 {yij,yik,yu,yim •. sup(|^iifc|Jk|,|0jImi!m|)<n} • Assumptions 2.2 Let Q G J?* is the parameter space of y, where 7 is assumed to be vector of length q. Suppose Q is an open convex set. 1. Each gj(-) and hjk{-) is twice continuously differentiable. 2. (a) The covariates are uniformly bounded, that is, there exist an Mo, such that ||x,j|| < Mo, W^ijkW < Mo- Furthermore, ^,X,JX?J-, J2iyfijkwTjk have full rank for n > no, where n0 is some positive integer value. (bj In the neighborhood of the true 7 defined by B(8) = {y*eg : "||7* - Til < *}, «5 i 0, 9j(')i 9j('):9j(')i hjk(-), frjfcO and hjk(-) are bounded away from zero. Chapter 2. Foundation: models, statistical inference and computation 62 • Assumption 2.3 For all e > 0 and for any fixed vector ||u|| ^ 0, the following condition is satisfied iijSS^i: £ K,(r.)fP(Y;r.) = o. v nwoy ; •=i{|u'fj(70)|>£(u'Af.(70)u)i/»} • Assumptions 2.1 and 2.2 are needed so that we may apply some weak law of large numbers theorem to prove the consistency of the estimates. Assumption 2.3 is needed for applying the central limit theorem for deriving the asymptotic normality of the estimates. These conditions appear to be complicated, but for special cases they are often not difficult to verify. For instance, for the models we will use in Chapter 5, if the covariates are bounded and have full rank in the sense of Assumptions 2.2, with appropriate choice of the link functions, the conditions are almost empty. Related to statistical literature, Bradley and Gart (1962) studied the asymptotic properties of maximum likelihood estimates (MLE's) where the observations are independent but not identically distributed (i.n.i.d.). Hoadley (1971) established general conditions (for cases where the observations are i.n.i.d. and there is a lack of densities with respect to Lebesque or counting measure) under which maximum likelihood estimators are consistent and asymptotically normal. Assumptions 2.1, 2.2 and 2.3 are closely related to the conditions in Bradley and Gart (1962) and Hoadley (1971), but adapted to multivariate discrete models with MPME property. In particular, the Assumptions 2.1 reflect the uniform integrability concept (for discrete models) employed in Hoadley (1971). Properties of estimators As with the model (2.39) with no covariates, we first develop the asymptotic results for the simplest situation when d — 2 such that Yj = (Yu, Y,-2)' (i = 1,..., n) has the distribution P{ynyi2\0i), (2.59) where 6{ = (0j;1, 0j;2, 0j;i2). Without loss of generality, 0j;i, 0j;2, 0j;i2 are assumed to be scalar parameters. We further assume = 9j(<Xj*-ij), J = 1,2, (2.60) A;12 = /ll2O0i2Wj12), where the functions gj(-) and fti2(-) are monotonic increasing (or decreasing) and differentiable functions. In (2.60), otj = (ctji,...,ctjPj)' is a pj x 1 parameter vector and XJJ = (zjji,...,XijPj)' Chapter 2. Foundation: models, statistical inference and computation 63 is a pj x 1 vector of covariates corresponding to the response variables Yij (j — 1,2). Similarly, /?12 = (/?i2i, • • • ,/?i2gi2)' is a gi2 x 1 parameter vector and w,i2 = (wim, • • •, u>ti24l3)' is a ?12 x 1 vector of covariates corresponding to the response vector y,-, where yj = (yn, yi2) is the observed value of Yj. Theorem 2.5 Consider the model (2.59) together with (2.60) and let y = (ot[, d'2, J312)' denote the IFME ofy corresponding to IFM (2.57). Assume y is a q x 1 vector. Under Assumptions 2.1 and 2.2, 7 is a consistent estimator ofy0. Furthermore, under Assumptions 2.1, 2.2 and 2.3, as n —> 00, we have ^-1(7-7o)^(0,/), where An = Dn1/2(y0)M^/2(y0)(Dn 1/2(7o))T, with Dn(y0) and Mn(y0) defined in (2.56). Proof: Using Taylor's theorem to the first order, we have *„i(ai) = *„i(ai,o) + (01 - ari.o) *n2(Q!2) = *n2(«2,o) + («2.- «2,o) 0*m dati a-3*„2 da2 *nl2(/?12) = *nl2(/Vo) + (Pl2 ~ Plifi) 3*„12 12 + (di - ai,0) dati dfi12 + (d2 - a2,o) (2.61) V 3*„12 da2 V where ot\ is some vector value between di and Qfi,o, &2 is some vector value between d2 and a2]o, and 7** is some vector value between 7 and 70. Note that in (2.61), \Pf,i2(/?i2) also depends on di and d2, and ^ni2(0\2,o) depends on o^o and at2to- Furthermore, in (2.61), we have f = E^;l(^;l)(^Kx.-l))2 + ^;l(^;l)^Kxj1)]xj1xf1, i=l ^f^1 = iyi.4°iMi(«S*>))2 + ^;2(^;2)<(^Xi2)]xj2X? T 2> •=1 = E[^;i2(^;i2)(^fc(^2Wii2))2 + Wiia^-iaJ^OTaWasMwaawf^, (2.62) 07,12 ,-=i ^nl2Q9i2) _ ^ 9a 1 1=1 0ttnl2(£l2) = f da2 «=i cVt;l2(#i;l2) /, ; x,/ /j0T \ 'gg ' ' 9j(<XlXil)h'jk(012Wil2) dfi-i2(9i\2) ,. , . , r >. -^(a2x«)^i(i8i2Wii2) 90j;: Wj^Xj^, Wjl2xf2, Chapter 2. Foundation: models, statistical inference and computation 64 and 3*„i(oi)/a«2 = 0, 0*„i(ai)/d012 = 0, d^n2{a2)/da1 = 0 and d*n2(a2)/d/312 = 0. To establish the consistency and asymptotic normality of y, note that with Assumptions 2.1, we have that n-2E(<M«i,o)*»j(«i,o)T) 0 (j = 1,2), and n-2E(tf„12(/?i2,o)*ni2(/?i2,o)T) 0. By the assumed monotonicity (e.g. /*) and differentiability of gj(-) and hi2(-), g'j(-) is a non-negative function of the unknown orj and the given Xjj, and /i'12(-) is a non-negative function of the unknown B12 and the given w,i2. From Assumptions 2.1 and 2.2, d^nj(otj)/daj (j = 1,2) and d^ni2(0i2)/d0i2 have full rank. By Markov's weak law of large numbers (see for instance Petrov 1995, p.134), we obtain that n-1*„i(ai]0)-^p0, n"1*„2(a2,o)->pO and n~11'ni2(/?12|0)-'-P0. Since ^ni(Q!i) = 0, \Pn2(<*2) = 0 and ^n\2(S\2) = 0, by following similar arguments as for the consistency of 6 in the model (2.39), we establish the consistency of y. Now let *n(7) / *»l(«l) \ *n2(«2) V*nl2(^12)/ Hn{y) = I 9Qti x aa. 0 0 0 dOtQ and / 9tti 0 aa3 (2.61) can be rewritten in the following matrix form T* 0 0 \ Vn(y - y0) = -HI h*n(7o)]. (2.63) It follows from the convergence in probability of 7 to y0 that ±[Hn(y) - Hn(y0)]^0. Since each element of n~1Hn(y0) is the mean of n independent variates, under some conditions for the law of large numbers, we have 1 -Hn(y0)-Dn(y0)^0, (2.64) where Dn(y0) = n-1E{i?„(70)}. Assumptions 2.1 and 2.2 imply that lim ra-2Var(#„(70)) = 0. (2.65) Chapter 2. Foundation: models, statistical inference and computation 65 Thus by Markov's weak law of large numbers, we derive that ^H*n - Dn(y0)^0. (2.66) Next, we note that ^n(7o) involves independent centered summands. Therefore, we may directly appeal to the Lindberg-Feller central limit theorem to establish its asymptotic normality. From Assumption 2.3, by the Lindberg-Feller central limit theorem, we have "'*n(7o) D (u>Mn(y0)uy/2 Applying Slutsky's Theorem to (2.63), we obtain v^^1(7-7o)^«(0,/), where An = £>-1/2(70)M„1/2(7o)(£>- 1/2(7o))T, and J is a q x q identity matrix. • Next we turn to the general model (2.53) where d is arbitrary. With the Assumptions 2.1, 2.2 and 2.3, Theorem 2.5 can be generalized to the case d > 2. Compared with the d = 2 situation, the IFM for the general model (2.53) is a system of equations (2.58), which do not introduce any complication in terms of the asymptotic properties for IFME. Thus we have the following: Theorem 2.6 Consider the general model (2.53) with arbitrary d. Lety denote the IFME ofy under the IFM (2.58). Under Assumptions 2.1 and 2.2, y is a consistent estimator of y. Furthermore, under Assumptions 2.1, 2.2 and 2.3, as n —+ oo, we have v^A-1^-70)^(0,7), where An — £»„1/2(70)M„1/2(70)(JD„1/2(70))T, with Dn(y0) and Mn(y0) are defined in (2.56) • We now calculate the matrix Mn(y) and Dn(y) (or just part of the matrices, depending on the need) corresponding to the IFM for aj and 0jk. For example, suppose a,-; is a parameter appearing in Pj(yij) and akm is a parameter appearing in Pk(y(k). Then the element of the matrix Mn(y) corresponding to the parameters ctji and akm can be estimated by 1 " 1 dPjjyjj) dPk(yik) (2.67) OtjOt* n JT[ Pj(Vij)Pk(yik) daji dak where j may equal to k. If ctji is a parameter appearing in Pj{yij), and Bkms is a parameter appearing in Pkm{yikyim), then the element of the matrix Mn(y) corresponding to the parameters ctji and Bkms can be estimated by l.f 1 dPjjyij) dPkm(yikyim) n ~[ Pj(yij)Pkm(yikyim) da^ dBkms . , (2.68) ajakamBkm Chapter 2. Foundation: models, statistical inference and computation 66 where k < m and j may equal to k or m. Furthermore, if Pjks is a parameter appearing in PjkiyijVik) and fiimt a parameter appearing in Pim(Vi\Vim), then the element of the matrix Mn(y) corresponding to the parameters Pjks and Pimt can be estimated by 1 " dPjkiyijVik) dPlmjyuVim) . . , (2.69) dridrllaIdrm0ilk0Im n ^ Pjk(Vijyik)Plm(yilVim) df3jk, dPlmt where j < k, I < m and (j, k) = (/, m) is allowed. For the elements of the matrix Dn(y), suppose ctji and ctjm are parameters appearing in Pj(yij)-Then the element of the matrix Dniy) corresponding to the expression d^fnji{aji)/dajm can be estimated by 1^ 1 dPjjyjj) dPj(yij) (2.70) " i^l Pj(y>i) dai< d0Cjm If ctji is a parameter appearing in Pj(vij) and /?jjts is a parameter appearing in PjkiyijVik), then the element of the matrix Dn(y) corresponding to the expression d^nj(ctji)/dPjks is 0. If Pjks is a parameter appearing in PjkiyijVik) and ai is a parameter corresponding to a univariate margin, then the element of the matrix Dn(y) corresponding to the expression diinjksiPjks)/dai can be estimated by 1 " n t—' dPjkivijyik) dPjkiyijVik) JT[ PjkiVijyik) dfijk* dai (2.71) If Pjks is a-parameter appearing in PjkiyijVik) and /3t is a parameter corresponding to a bivariate margin, then the element of the matrix Dniy) corresponding to the expression d^njksiPjks)/'dPt can be estimated by dPjkiyijVik) dPjkjyijVik) ly^ 1_ n fri Pjkiyayik) dpjks dpt (2.72) However, as in section 2.5 and the data analysis examples in Chapter 5, it is easier to use the jackknife technique to estimate M„iy) and Dniy). 2.4.3 Asymptotic results for the models assuming a joint distribution for response vector and covariates The asymptotic developments in subsection 2.4.2 treat x,i,..., x*d and w,i2,..., Wjd-i,d as known constant vectors. An alternative would be to consider the covariates as marginal stochastic outcomes of a vector V*, and to consider the distribution of the random vector formed by the response vector Chapter 2. Foundation: models, statistical inference and computation 67 together with the covariate vector. Then the-model (2.53) can be interpreted as a conditional model, i.e., (2.53) is the conditional mass function given the covariate vectors, where the Yt- are conditionally independent. Specifically, let Zs- = I I , i = 1,.. .,n, be iid with distribution F$ belonging to a family T = {F^,6 G A}. Suppose that the distributions F$ possess densities or mass functions (or the mixture of density functions and mass functions) with sup port Z; denote these functions by /(z; 6). Let 6 = (y' ,r)')', where y = (o^,..., a'd, /3'12,... ,0'd_1 d)' • Assume that the conditional distribution of Yt- given V,- = vt-, P(yi;6i\vi = vi), (2.73) is as in (2.53), that is P(yi;0i\'Vi = v,) is a MCD or MMD model with univariate and bivariate expressible parameters (PUBE) Oi = (<?!(<*;, v,),..., ed{a'd, v,-), M/?i2, v,), • • •, Od-iAP'd-i.d, v,-)), where Oi is assumed to be completely independent of 17 (defined below), and is a function of y. We also assume that 9j (j = l,...,d) is the parameter for the jth univariate margin Pj(ytj) of Y,-from the conditional distribution (2.73), such that Pj(yij) depends only on the parameter Qj, and 9jk (1 < j < k < d) is the parameter for the (j,k) bivariate margin PjkiyijVik) of Y,- from the conditional distribution (2.73) such that 6jk is the only bivariate parameter. In contrast to (2.54), here we only need to assume that Bj is a function of otj and v, j = 1, 2,..., d, and 9jk is a function oi Pjk and v, 1 < j < k < d, without explicitly imposing the type of link functions in (2.54). The marginal distribution of V; is assumed to depend only on the parameter vector JJ, which is treated as a nuisance parameter vector. Its density or mass function (or the mixture of the two) is denoted by gj (v,-; rj). Thus /(z,-; 6) = P(yi;0i\Vi = v,-)#(vi; t,). (2.74) Under this framework, weak assumptions on the support of V,-, in particular not requiring any special feature of the link function, are sufficient for consistency and asymptotic normality of the IFME in the conditional sense. Let the density of Z = (Y', V')' be f(z;6), and that of V be gj(v;rj). Based on our assumptions on P(yi',0i\Vi = VJ), we see that the joint marginal distribution of Yj and V is ^•,v = P,(%|V = v)^(v;r?) Chapter 2. Foundation: models, statistical inference and computation 68 and the joint marginal distribution of Yj, Yk and V is Pjky = Pjk(yjVk\V = V)3J(V;JJ). For notational simplicity, in the following, we simply write Pj(yj) and Pjk(yjyk) in lieu of Pj(yj |V = v) and Pjk(yjyk|V = v). The corresponding marginal distributions for Z; are Pj,Vi = Pj(yij)9j(v>;r}), Pj k,\i = Pjk (ytj yik )9j (v,-; >?). Thus we obtain a set of loglikelihood functions of margins for 7 t=i >'=i »'=i n n n Injk = ^logP_,jfc,Vi = ElogP^J/.-jJ/j*) + Effj(V';,?)' 1 < J < ^ < rf-(2.75) 8 = 1 i=l s = l Let , .def 1 dPjy Uj,s = Wj,,(Q!j-,) = ^LL-, J = l,...,d; s = l,...,pj, Fjy OOLjs u>jk,t = Ujk,t(0jk,t)d= 9^k'V, 1 < j < k < d; t = 1 and "j.Vi OCtj, Ui-jk,t - ^i;jk,t(0jkt)'=-^:—d^ik'Vi, 1<3 <k <d; t=l,...,qjk, ^jk.Xi OPjkt for i = 1,..., n. From (2.75), we derive the IFM for aj and 8jk - 2^W';J> - 2^ P.v 5a. = 2^ p.(v..\ da. =2^Vi-jB, J = l,...,d; s = l,...,Pj 8 = 1 8 = 1 3,Xi 3° 8 = 1 r]{y*j' 00!1S i = l 1 3PJ(;/,J) njk,t -V* 1 dpjk,vi 1 dPjk(yijyik) L 8 = 1 fr[Pjk,Yi dPjkt fr( Pjk(yijyik) dpjkt jkt, 8 = 1 1 < 3 < k < d; t = l,...,qjk, and the estimating equations for aj and 3jk based on IFM are ttnj,s = 0, j -l,...,d; s=l,...,pj, (2.76) (2.77) ftnjM = 0, 1 < j < < d; t = 1,..., qjk. With the IFM approach, estimates oiaj,j = and 8jk, 1 < j < k < d, denoted by aj = (ctji, • • •,&j,pj)' and Pjk ~ (Pjki, • • •>Pjk,qjk)', are obtained by solving the nonlinear system Chapter 2. Foundation: models, statistical inference and computation 69 of equations (2.77). Note that (2.77) is computationally equivalent to (2.57); they both lead to the same numerical solutions (assuming the link functions given in (2.54)). Let fi„j = (firy.i, • • •>tonj,pj)' and £l„jk = (£lnjk,i, • • • ,^njk,qjk)'• Then (2.76) can be rewritten in function vector form {&nj=0, j=l,...,d, Vnjk = 0, l<j<k<d. Let Qj = (wj.i,.. .,Uj,Pj)' and Qjk = (ujk,i, • • -^jk^J- Let 0 = (fii,..., Q'd, fi'12>ft'd_M)'. Let Mn = E(Qfl'), Dn = dCl/dy. Under some regularity conditions similar to subsection 2.4.1, consistency and asymptotic normality for y can be established. Basically, the assumptions that we need are those making 0 a regular inference function vector. Assumptions 2.3 1. The support ofZ, Z does not depend on any 6 £ A. 2. Es{Sl) = 0; 3. The partial derivative dCl/dy exists for almost every z G Z; 4- Assumed andy areqxl vectors respectively, and their components have subindex j = l,...,q. The order of integration and differentiation may be interchanged as follows: Wi JZU^^'^DTI^ = ^ ^[WJAZ;*)]DMZ); 5. E(j{QQ'} G Mqxq and the q x q matrix Mn = Es{£in'} is positive-definite. 6. The q x q matrix = dd(6)/dy is non-singular. • Assumptions 2.3 are equivalent to assuming that Mn(y) in (2.56) is positive-definite and Dn(y) in (2.56) is non-singular for certain n > no, where no is a fixed integer. We have the following theorem Chapter 2. Foundation: models, statistical inference and computation 70 Theorem 2.7 Consider the model (2.73) and let y denote the IFME of y under the IFM (2.77). Under Assumptions 2.3, y is a consistent estimator ofy. Furthermore, as n —»• oo, we have asymp totically where Jn = D^M^1 Da-Proof. Under the model (2.74) and the Assumptions 2.3, the proof is similar to that of Theorem 2.3 We believe this approach for deriving asymptotic properties of an estimate has not appeared in the statistical literature. The assumptions are suitable for an observational study but not an experimental study in which one has control of the v's. Theorem 2.7 is different from Theorem 2.6 in that Mn and £>n both depend on the distribution function of V. Nevertheless, because Uij^ = ipij,s and Wijk,t = ipi;jk,t, the numerical evaluation of Mn and Mn(y) in (2.59) based on data are the same because only the empirical distribution for ^ is needed. For example, suppose or; is a parameter of Pj(yij) and ctm a parameter of Pk{yik), then the element of the matrix Mn corresponding to the parameters a; and ctm can be estimated by which is the same as (2.67). We can similarly obtain (2.68) and (2.69). The same result is true for Dn versus Dn(y) in (2.56); they both lead to the same numerical results based on the data. We thus derive the same formulas (2.70)-(2.72) for numerical evaluation of Dn-2.5 The Jackknife approach for the variance of IFME The calculation of the Godambe information matrix based on regular IFM for the models (2.39) and (2.53) is straightforward in terms of symbolic representation. However, the actual computation of the Godambe information matrix requires many derivatives of first and second order, and in terms of computer implementation, considerable programming effort would be required. With this consideration, an alternative jackknife procedure for the calculation of the IFME asymptotic variance is developed. The jackknife idea is simple, but very useful, especially in cases where the analytical answer is very difficult to obtain or computationally very complex. This procedure has the advantage of general computational feasibility. V^(7-70)^(0, Jn1). and 2.4. • Chapter 2. Foundation: models, statistical inference and computation 71 In this section, we show that our jackknife method for calculating the corresponding asymptotic variance matrix of 0 is asymptotically equivalent to the Godambe information matrix. We examine the situation for models with covariates and with no covariates. Our main interest in using the jackknife is to obtain the SE of an estimate and not for bias correction (because for multivariate statistical inference the data set cannot be small), though several results about jackknife parameter estimation are also given. The jackknife estimate of variance may be preferred when the appropriate computer code is not available to compute the Godambe information matrix or there are other complications such as the need to calculate the asymptotic variance of a function of an estimator. Some numerical comparisons of the,Godambe information and the jackknife variance estimate based on simulations are reported in Chapter 4. The jackknife procedure is demonstrated to be satisfactory. Some general references about jackknife methods are Quenouille (1956), Miller (1974), Efron and Stein (1981), and Efron (1982), among others. A recent reference on the use of jackknife estimators of variance for parameter estimates from estimating equations is Lipsitz et al. (1994), though their one-step jackknife estimator is not as general as what we are studying here, and their main application is to clustered survival data. In the following, we assume that we can partition the n observations into g groups with m observations each so that n = ra x g, m is an integer. We discuss two situations for applying jackknife idea: leaving out one observation at a time and leaving out more than one observation at a time. 2.5.1 Jackknife approach for models with no covariates Let Y, Yi,..., Y„ be iid rv's from a regular discrete model P(yi---yd;6), 6 e » in (2.12), and y, yi,...,y„ be their observed values respectively. Let 9(6) = (tpi(6),...,ipq(6)) be the IFM based on y, = (i>i-i(6),..., ipi-q(6)) be the IFM based on yit and 9n(6) = (¥„i(0),..., ¥„,(*)) be the IFM based on y lt..., y„, where 9nj(6) - £?=1 ^;j(0) (j = 1, • • •, «)• Let 6 - 0i,..., eq) be the estimate of 6 from 9n(6) = 0. Leave out one observation at a time Let be an estimate of 0 based on the same set of inference functions \P„ but with the ith observation yt from the data set y1,..., yn deleted, i = 1,..., n. In this situation, we have m = 1 Chapter 2. Foundation: models, statistical inference and computation 72 and g — n. That is, we delete one group of size 1 each time and calculate the same estimate of 0 based on the remaining n — 1 observations. Let 0,- = nO — (n — 1)0(«), and 0(.) = X2"=1 0({)/n. 0,- are called "pseudo-values" in the literature. The jackknife estimate of 0 is defined as -!)*(•)• (2-78) i = l The jackknife estimator has the property that it eliminates the order l/n term from a bias of the form E{6) = 0 + u\/n + U2/n2 + • • •, where the functions ui,U2,... do not depend upon n (see Miller 1974). In fact, the jackknife statistic Oj often has less bias than the original statistic 0. The early version of the jackknife estimate of variance for the jackknife estimator Oj was sug gested by Tukey (1958). It is defined as Vj = ^nhf) ~ ~6j)ih ~ h)T = ^ B*« - *<•>)(*«> - *0)T- (2J9) In (2.79), the pseudo-values 0,- are treated as if they are independent and identically distributed, disregarding the fact that the pseudo-values 0t- actually depend on the common observations. To justify (2.79), Thorburn (1976) proved that under rather natural conditions on the original statistic 0, the pseudo-values are asymptotically uncorrelated. In practice, the pseudo-values have been used to estimate the variance not only of the jackknife estimate 0j, but also 0. But if the bias correction in jackknife estimate is not needed, we propose to use a simpler estimate of the asymptotic variance (matrix) of 0: ^ = D*(o-Wo-*)T- (2-8°) «=1 In our context, unless stated otherwise, we always call Vj defined by (2.80) the jackknife estimate of variance. In the following, we first prove that the asymptotic distribution of a jackknife statistic 6j is the same as the original statistic 0 under the same model assumptions; subsequently, we prove that Vj defined by (2.80) is a consistent estimator of inverse of the Godambe information matrix MO). Theorem 2.8 Under the same assumptions as in Theorem 2.4, the jackknife estimate 0j in (2.78) has the same asymptotic distribution as 0. That is, as n —• oo, Mh-0)°Nq(0,M(0)), where J*(0) = D%{0)M^(0)D9(0), with M*(0) = £{tf(0)tfT(0)} and £>*(0) = 89(0)/dO. Chapter 2. Foundation: models, statistical inference and computation 73 Proof. We sketch the proof. For *n(0) = (tf„i(0),*„,(0)) : yn x $ ^ M9 has the following expansion around 0 o = = + - 0) + R„, where Hn(0) — 3*„(0)/30 is a g x g matrix and RN = 0P(||0 — 0\\2) = Op(n_1) by assumptions. Thus Vn~(0 -0)= [-Hn(0)) ^ (-*„(«) - R„). Let 4,(i)(0) be ^n(0) calculated without the ith observation, and H^(0) be the g x q matrix d'9n(0)/d0 calculated without the ith observation. Similarly, we have where Rn_i,< = Op{\\~0(i) - 0\\2) = o^n"1). Since Vn~(0i -0) = n (yn~(0 -0)- x/^(0(i) - 0)) + y/nffi^ - 0), we have yfrih -ey=n (vn-(0 - 0) -1>/»(*(.•) -')) + ^ E v^(*(o - ') \ i=l / i=l Thus Vn(0j -0) = n + By the Law of Large Numbers, we have and From the central limit theorem, -Hn(0)^D^(0), n —ff(O(0)A£>,(*). -±=yn(0)ZNq(o,M*). Chapter 2. Foundation: models, statistical inference and computation 74 We further have This, together with the -yn-consistency of 0 (Theorem 2.4), lead to -Hn(0)) 1 -i= (-*„(«)-R„) ^TT;E{(^TFF(')C)) ("*('>('l)""MJ °".(°.D;'M.(C;')T). Thus by applying Slutsky's Theorem, we obtain ^(ej-0)^Nq(o,j^(0)). • Theorem 2.9 Under the same assumptions as in Theorem 2-4, the sample size n times the jackknife estimate of variance Vj defined by (2.80) is a consistent estimator of J^1(0). Proof. We have - *)(*(0 - Of = - *)(*(.•) - of i=l Recall that i=l L» = l (fl _ fl)T -{0-0) n(O-0)(0-Of. 0-0 = H-1(6)(-9n(O)-Kl), and Chapter 2. Foundation: models, statistical inference and computation 75 where Rn = Op{\\9 - 6\\2) = Opin'1) and R„_li8- = Op(\\6(i) - 6\\2) = Op{n-1). Thus J2(h) ~ *)('«) -~0)T=J2 H^(6) (-*(0(*) - Rn-i,i) (-*(,•)(«) - Rn-i,0T i = l t'=l n Li=l -1 H~\6) (-*„(*) -R„) E^jM (-*(,)(*)-Rn-i,,) «=i + n//-1^) (-*„(*) - Rn) (-*„(«) - Rn)T {H-\e)f . (2.81) As *(0(tf) = *„(*) - *,,(*), thus E:=i *(.-)(») = (« " !)*»(*) and £?=1 ^oC*)*^*) = (n -2)*„(0)^(0) + £"=1 (*)• % the Law of Large Numbers, we have -Hn(B)^*D9(9), n 1 and »=1 From (2.81), we have i=i and this implies that t = l i=l In other words, we proved that nVj is a consistent estimator of J^1(0). • Leave out more than one observation at a time Now for general g > 1, we assume a random subdivision of ylt..., yn into g groups (n = gm). Let 0(„) = 1,...,(/) be an estimate of 0 based on the same set of inference functions \P from the data set y:,..., yn but deleting the u-th group of size m (m is fixed), thus is calculated based on a subsample of size m(g — 1). The jackknife estimate of 6 in this general setting is the mean of Bv, which is 1 3 OJ =-52** = g*- (g- i)h)> (2-82) 9 v=\ Chapter 2. Foundation: models, statistical inference and computation 76 where = g 1 YH=i ^0)> and ^v = gO — (g — 1)6^ {y = 1,..., g) are the pseudo-values. In this situation, the jackknife estimate of variance for 6, Vj, is defined as (2.83) i/=i Theorem 2.8 and 2.9 can be easily generalized to the situation with fixed m > 1. Theorem 2.10 Under the same assumptions as in Theorem 2.4, the jackknife estimate 6j defined by (2.82) with m fixed has the same asymptotic distribution as 6. That is, as n —• oo (thus g —• oo^, Vn-(0j-6)°Nq(O,J*\0)), where Jy{6) = D%{6)M^l(6)Diil{6), with M*(0) = Ee{9{6)9T (6)} and D*(6) = 89(6)/86. Proof. We sketch the proof. Let \P(„)(0) be 9n(6) calculated without the l/th group, and H^(6) be the q x q matrix 89n(6)/d6 calculated without the uth group. We have -l \/n — m(0(v) - 6) = -H{v){6) 1 (-*(„)(*)-R„-m,„) where Rn_m,„ = Op{\\6{v) - 6\\2) - Op(n-1). Recall that - 6) = g (yg(6 -6)- ^{6{v) - 6)) + ^{6{l/) - 6), we thus have y/g&j -6) = g (jg-{6 -6)-l-J2 - 9)) + - E vWw - 6) \ 9 v = l ) 9 v = l this implies that 9 1 yft(h -6) = g [^{6 -6)- ]f^1~ E Vn~^(6(v) - 6)^ + yj^~ E v^=^(0(lO - *). Thus Vn~{0j -6) = g ^(^'^(-•nW-Rn)-+ (2.84) Chapter 2. Foundation: models, statistical inference and computation 77 By the Law of Large Numbers, we have and From the central limit theorem, We also have ±Hn(6)^D*(6), n 1 H(l>)($)**D9(0). n — m 4=*(0)-^W,(O,M»). This, together with the \/n-consistency of 6 (Theorem 2.4), lead to -#„(*)) -L (_*„(«) _R„) ' ij E { (sr^'ww)" ^ (-t(.,(») - a.-.,)} \^ £ { (^"«w) TT= (-*«(') - )} -".0. V«* W)T). Applying Slutsky's Theorem to (2.84), we obtain ^(8j-6)^Nq(0,J^(6)). • Theorem 2.11 Under the same assumptions as in Theorem 2.4, the sample size n times the jack-knife estimate of variance Vj defined by (2.83) is a consistent estimator of J^1(6), when m is fixed and g —• oo. Proof: We use the same notation as in the proof of Theorem 2.10. We have 9 9 - 6)(6{v) - 6f = - 6)(6{v) - 6? vzzl I/=l (6 - 6)T -(6-6) + g(6-6)(6-6f Chapter 2. Foundation: models, statistical inference and computation 78 and Thus *(„) - 0 = H(V)(0) (-*(„>(*) - R„_m,„) . y y r - ~0)(O{V) -6)T = J2 H^){0) (-*(„)(«) - R„_m,„) (-*(„)(') - Rn-m,,)T i/=i 9 (-*„(*)- Rnf (//n"1^))5 £/frU<?) (-*<„)(*) - Rn-m,„) E#w(*) (-*W(0)-R„_m,,)| -T/"1^) (-*„(») -Rn) + gH-\d) (-*„(*) - R„) (-*„(') - R.)T (/f"1 (*))T Let V}(0) = *„(*) - *(„)(«). Then i«MW=(fl-i)*»(f) (2.85) v = l and £ *wW*fv)(tf) = (9 - 2)*„(«)C(«) + E *;(*)(*; W)3 i/=i By the Law of Large Numbers, we have -^—H{v){O)^D*{0). n — m y ' We also have £{<^(0)(tf * (0))T/m} = E{9(0)#T(0)}, thus by the Law of Large Numbers, ±£*;(*)(*;(«))T = iZ{*JW(*^W)T/m>^(*(*)*TW)-ra v From (2.85), we have D*oo - - *)T - E w«: (^i ^ i/=i which implies that «E(*(") " W«0 - *)TI*D*(B)-1M9(B)(D9(B)-1)T. 1=1 In other words, we proved that n ^f_1(fl(1/) - 0)(9(v) ~ ^)T is a consistent estimator of 1 (0), when m is fixed and # —• oo. • Chapter 2. Foundation: models, statistical inference and computation 79 The main motive for the leave-out-more-than-one-observation-at-a-time approach is to reduce the amount of computation required for the jackknife method. For large samples, it may be helpful to make an initial random grouping of the data by randomly deleting a few observations, if necessary, into g groups of size ra. The choice of the number of groups g may be based on the computation costs and the precision or accuracy of the resulting estimators. As regards of computation costs, the choice (m,g) = (l,n) is most expensive. For large samples, g = n may not be computationally feasible, thus some values of g less than n may be preferred. The grouping, however, introduces a degree of arbitrariness, a problem not encountered when g — n. This results in an analysis that is not uniquely defined. This is generally not a problem for SE estimation for application purposes, as usually then a rough assessment is sufficient. As regards the precision of the estimators, when the sample size n is small to moderate, the choice (m,g) = (l,n) is preferred. See Chapter 5 for examples. 2.5.2 Jackknife for a function of 0 The jackknife method can also be used for estimates of functions of parameters, such as the asymp totic variance of P(yi • • - yd', 0) for a MCD or MMD model. The usual delta method requires partial derivatives of the function with respect to the parameters, and these may be very difficult to obtain. The jackknife method eliminates the need for these partial derivatives. In the following, we present some results on the jackknife method for functions of 0. Suppose h(0) = (bi(0),..., bs(6))' is a vector-valued function defined on 5ft and taking values in s-dimensional space. We assume that each component function of b, bj(-) (j = 1,... .,s), is real valued and has a differential at 0Q, thus b has the following expansion as 0 —• 0o'. b(0) = b(6o) + (0-Oo)(j^y+ o(\\0-0o\^ (2.86) where db/d0'o — (db/d0')\Q_Qo is of rank t = min(s, q). By Theorem 2.4, 0 has an asymptotic normal distribution Similarly by Theorem 2.8 and 2.10, 0j has an asymptotic normal distribution in the sense that Vn-(0j- 0)°Nq(0, J*1). We have the following results for b(0) and b(0j): Chapter 2. Foundation: models, statistical inference and computation 80 Theorem 2.12 Let b be as described above and suppose (2.86) holds. Under the same assumptions as in Theorem 2.4, b(0) has the asymptotic distribution given by Proof. See Serfling (1980, Ch.3). • Theorem 2.13 Let b be as described above and suppose (2.86) hold. Under the same assumptions as in Theorem 2.4, b(6j) has the asymptotic distribution given by Proof. See Serfling (1980, Ch.3). • As in the previous subsection, let t9(„) be the estimator of 0 with the i/-ih group of size m deleted, v = 1,..., g. We define the jackknife estimate of variance of h(0), which we denote by Vjb, as VJb = J2 (*>(*(,)) - b(*)) (b(^)) - b(*)f • (2-87) i/=i We have the following theorem. Theorem 2.14 Let b be as described above and suppose (2.86) holds. Under the same assumptions as in Theorem 2.4, the sample size n times the jackknife estimate of variance Vj\> defined by (2.87) is a consistent estimator of dB'JJ* \d6'J • Proof. The proof is similar to that of Theorem 2.11, and thus omitted here. • To carry out the above computational results related to the estimates of functions of parameters, it would be desirable to maintain a table of the parameter estimates for the full sample and each jackknife subsample. Then one can use this table for computing estimates of one or more functions of the parameters, and their corresponding SEs. The results in Theorems 2.12, 2.13 and 2.14 have immediate applications. One example is given next. Example 2.18 For a MCD or MMD model in (2.12), say P(yi • • -yd]6), we could apply the above results to say something about the asymptotic behaviour of P(yi • • • ya',6) and P(yi • • - yd',6j). From Theorems 2.12 and 2.13, we derive that as n —• oo Chapter 2. Foundation: models, statistical inference and computation 81 and yfrWvi •yd;h)- P(V1 • ••yd;0))$N (o, J*1 {^f^j . Furthermore, by Theorem 2.14, we obtain a consistent estimator of (dP/d0')J^1(dP/d0')T, i.e. g 2 n £ {p(yi • • • - • '• • ')) • i/=i Also see Chapter 5 for direct application in data analysis. O 2.5.3 Jackknife approach for models with covariates Suppose we have the model defined by (2.53) and (2.54). Let (v = 1,..., g) be an estimate of 7 based on the same set of inference functions ^n(7) from the data set ylt.. . ,yn but deleting the f-th group of size m (m is fixed). The jackknife estimate off is 1 9 T; = -D"=^-(«-%' (2-88) 9 v = l where 7(.} = 1/g £*=17(l/), and yv = gy - (g - l)7(l/) (v=l,...,g). We define the jackknife estimate of variance Vj:y for y as follows: ^7 = E(7W-7)(7W-7)T.. (2.89) i/=i Under the assumptions for the establishment of Theorem 2.6, in parallel to Theorems 2.10 and 2.11, we have the following theorems for the models with covariates. The proofs are generalizations of the proofs for Theorem 2.5, 2.6, 2.10 and 2.11. We omit the proofs and only state the results here. Theorem 2.15 Consider the general model (2.53) with d arbitrary. Let y denote the IFME of y under the IFM corresponding to (2.57). Under Assumptions 2.1, 2.2 and 2.3, the jackknife estimator 7J defined by (2.88) is asymptotically normal in the sense that, as n —• oo, V^^1(7J-7O))^(0,/), where An = Dn1/2(y0)M^2(yQ)(D- 1/2(y0))T, Dn(y0) and Mn(y0) are defined by (2.56). • Theorem 2.16 Under the same assumptions as in Theorem 2.7, we have nVj,y - JD-1(7o)M„(7o)p-1(7o))T-0! where Vj:y is the jackknife estimate of variance defined by (2.89), and Ai(7o) and Mn(y0) are defined by (2.56). • Chapter 2. Foundation: models, statistical inference and computation 82 Theorem 2.17 Let b be as described in subsection 2.5.2 and suppose (2.86) hold. Under the same assumptions as in Theorem 2.7, b(y), a function of IFME y, has the asymptotic distribution given by VnB'1 (b(y) - b(y))^Nt (0,1), where Bn = [(db/8y'0) D;l(y0)Mn(y0)(D;l(y0))T (db/dy'0f] ^ , and Dn{y0) and Mn(y0) are defined by (2.56). • Theorem 2.18 Let b be as described in subsection 2.5.2 and suppose (2.86) hold. Under the same assumptions as in Theorem 2.7, b(yj), of the jackknife estimate yj derived fromy, has the asymp totic distribution given by v^5-1(b(7J)-b(7))^Ar((o,7), where Bn = T|l/2 (db/dy'0)D-\y0)Mn(y0)(D-1(y0))T(db/dy'0Y , and Dn(y0) and Mn(y0) are defined by (2.56). • We define the jackknife estimate of variance of b(7), Vjb, as follows: VJb = £ (b(7W) - b(7)) (b(7(„)) - b(7))T . (2.90) v = \ Theorem 2.19 Let b be as described in subsection 2.5.2 and suppose (2.86) hold. Under the same assumptions as in Theorem 2.6, we have nVJh - D-1(7o)M„(7o)P„-1(7o)f (jfif where the jackknife estimate of variance Vjt, defined by (2.90) and Dn(y0) and Mn(y0) are defined by (2.56). • 2.6 Estimation for models with parameters common to more than one margin One potential application of the MCD and MMD models is for longitudinal or repeated measures studies with short time series, in which the interest may be on how the distribution of the response changes over time. Some common characteristics, which carry over time, may appear in the form of common regression parameters or common dependence parameters. There are also general situations in which the same parameters appear in more than one margin. This happens with the MCD and Chapter 2. Foundation: models, statistical inference and computation 83 MMD models, for example, when there is a special dependence structure in the copula C, such as in the multinormal copula (2.4), where 0 = (Ojk) is an exchangeable correlation matrix with all correlations equal to 0, or 0 is an AR(1) correlation matrix with the (j, k) component equal to 0|J"*I for some 0. Example 2.19 Suppose for the d-variate binary vector Yj with a covariate vector for the jth univariate margin can be represented as Yjj = I(Zij < ctj+/3jXij), i = 1,..., n, where Zj ~ N(0,0). This is a multivariate probit model. Assume /3j — /?, then the common regression coefficients appear in more than one margin. We could estimate /? from the d univariate margins based on the IFM approach, but then we have d estimates of /?. Taking any one of them as the estimate of /? evidently results in some loss of information. Can we pool the information together to get a better estimate of /3? The same question arises for the correlation matrix 0. Assume there is no covariate for 0. When 0 has certain special forms, for example exchangeable or AR(1), the same parameter appears in d(d—1)/2 bivariate margins. Can we get a more efficient estimate from the IFM approach? There are also situations where a parameter is common some margins, such as 0i2 = 023 = • • • = 0d,d-i in 0. The same question about getting a more efficient estimate arises. • A direct approach for common parameter estimation is to use the likelihood of a higher-order margin, if this is computationally feasible. Otherwise, the IFM approach for model fitting can be ap plied. With the IFM approach, appropriately taking the information about common parameters into account can improve the efficiency of the parameter estimates. Analytical and numerical evidence supporting this claim are given in Chapter 4 for these two approaches of information pooling for IFM that we propose here. The first approach, called the weighting approach (WA), is to form a new estimate based on some weighting of the estimates for the same parameter from different margins. A special case is the simple average. The second approach, called the pool-marginal-likelihoods ap proach (PMLA), is to rewrite the inference function of margins under the assumption that the same parameter appears in several margins. In the following, we outline the two estimating approaches in general terms. 2.6.1 Weighting approach WA is a method to get an efficient estimate based on a weighted average of different estimates for the same parameter. We state this approach in general terms. Assume yi,. • - ,yq are estimates of the same parameter 7, but from different inference functions. Let 7 = (71,... ,yq)', and let Y,y be Chapter 2. Foundation: models, statistical inference and computation 84 the asymptotic variance-covariance matrix based on Godambe information matrix. One relatively efficient estimate of the parameter 7 is based on the following result, which easily obtains from the method of Lagrange multipliers. Result 2.1 Suppose X is a q-variate random vector with mean vector px = (p, • • - tP)' = /^l and Var(X) = Ex, where p is a scalar and 1 = (1,..., 1)'. A linear unbiased estimate of p, u'X, has the smallest variance when E^l u = • Applying the above result to our problem, the resulting estimate of 7 is 7= , 7, . (2.91) If 7jS are consistent estimates of 7, then 7 is also a consistent estimate of 7 and it has smaller asymptotic variance than any of the individual estimates of 7 from one particular inference function. The asymptotic variance of 7 is 4 = 11^11=^^-. (2.92) A computationally simpler but slightly less efficient estimate of 7 is l'diagfCI1}? and an approximation of the asymptotic variance of 7 from (2.93) is <r| = l/(l'diag{S^1}l). A naive estimate of 7 is 7 = l'7/l'l, which is just the simple average. In some special situations, such as in the Example 4.4, the estimate of (2.91) reduces to a simple average. In practice, E^, may not be available. We may have to estimate it based on the data. The following algorithm may be useful for getting a final estimate of 7. Assume we already have 7 and Computation algorithm: 1. Let u = E^l/l'Er1!. 7 ' 7 2. Find 7 = u'7. 3. Set 7 = (7,..., 7), and update E-y with this new 7. Chapter 2. Foundation: models, statistical inference and computation 85 4. Go back to step 1 until the final estimate of 7 satisfies a prescribed convergence criterion. The asymptotic variance of 7 at the final iteration can be calculated by (2.92). • 2.6.2 The pool-marginal-likelihoods approach In this approach, different inference functions for the same parameter are pooled together as a sum, where each inference function is the loglikelihood of some margin. We use the model (2.54) to illustrate the approach. Suppose that in (2.54), otj = Ot, j = 1,..., d and bjk =P,l<j<k<d, then more efficient estimators of a and /? may be obtained. For example, we can sum up the the loglikelihood functions of margins corresponding to the parameter vectors a and /? in the following way: n d 5(«) = £5>g^(«y). C(M) = ££l°g^*(w;W*)-(2.94) is an example of PMLA. The inference functions of margins from (2.94) corresponding to ot and /? are « -iff 1 dpj(yjj) 1 dPj(yij) IFM \hihp^ dai '""kkp^ da> ' n d EE^ J dPjkjyijyik) and the estimating equations based on IFM is n d EE i=lj<k dPjkjyijyik) (2.95) PjkiyijVik) dfiq EE—-—dFi,{KV"-) PjkiyijVik) d/3t = 0, s=l,...,p, = 0, t = l,...,q. i=lj<k If we consider (2.95) as inference functions, we see that asymptotic theory on the estimates from the PMLA can be established by applying the general inference function theory. For completeness, we give here algebraic expressions on IFM with the PMLA for the Godambe information for the model (2.39) with 6 = (#i,..., Bd, 9i2,. • •, 0d-i,d)', where we assume 9\,..., 0d Chapter 2. Foundation: models, statistical inference and computation 86 are each a function of one parameter A, and #12,• • •, are each a function of one parameter p. The loglikelihood functions of margin corresponding to the parameter vectors A and p are n d C(A) = £Elogp;to »=i j=i n d C(x, p) - E E los PjkiyijVik), i-l j,k = l;j<k and the estimating equations based on IFM is n d (2.96) 1 dPjjyij) 0, y^y^ 1 dPjk(yijyik) \ i=lj<k The IFM corresponding to one observation is Pjk(yijVik) dp = 0. 1 dPjkjyjyk) iVk) dp The Godambe information matrix is a 2 x 2 matrix. To calculate the Godambe information matrix we need the matrices My For E(V2), we have ' E(V?) EOilfe)' and Dy /E(3Vi/3A) 0 V 0 E(dy>2/dp) It can be estimated consistently by 1 A / A • 1 fliMw,-) »SL?j^(Wi) ^A (2.97) Similarly, we have E(V22) = E Z 1 dPjkjyjyk) Pjk(yjVk) dp Chapter 2. Foundation: models, statistical inference and computation 87 f=iPj(yj) d\ 1 dPjjyj) \ 1 dPjk(yjyk) Pjk(yjyk) dp = £ P(yi--w) £ 1 dPj(yj) £ 1 dPjk(yjyk) friPiivj) dx / yf^PMyjyk) dp For E(5Vi/3A), chp_ ex r = E 1 fdPM\\ 1 d2Pj(yj) Pf(yj) \ dX J + Pj(yj) dX2 E(d^/dX)= ^(w-w)E {yi-yd} J=1 Similarly, we find E(di>2/dP)= p(yi•••w)E {yi -yd} i<* 1 f-dPjMY } 1 a2p,-(y,-) P,-(W) 3A2 ^•fcCyj^j^" , 1 d2Pjk(yjyk) PjkiVjVk) dp2 Consistent estimates for E^^), E(ipiip2), E(dipi/dX) and E(drp2/dp) can be similarly written as in (2.97). 2.6.3 Examples We give two examples of WA and PMLA. Example 2.20 1. A trivariate probit model with exchangeable dependence structure: Suppose we have a trivariate probit model with known cut-off points and P(lll) = $3(0,0,0,p,p,p). It can be shown (see Example 4.4) that the asymptotic variance of p from one bivariate margin is [(7r2 — 4(sin_1 p)2)(l — p2)]/4n, and the asymptotic variance of p from WA or PMLA is [(1 — p2)(ir + 6 sin-1 p)(n — 2 sin-1 p)]/12n. The ratio of the former to the latter is [3(7r + 2sin-1 /5)]/[7r + 6sin_1 p], which decreases from 00 to 1.5 as p increases from —0.5 to 1. In this example, the optimal weighting is equivalent to a simple average (see Example 4.4 for details). 2. A trivariate probit model with AR(1) dependence structure: Suppose we have a trivariate probit model with cut-off points known, such that -P(lll) = $3(0, 0, 0, p, p2, p). Let u2w be the asymptotic variance of p from WA, a2 be the asymptotic variance of p from PMLA, a22 be the asymptotic variance of p from the (1, 2) margin, and <r23 be the asymptotic variance of p from the (1, 3) margin. In Example 4.5, we show that Cp/c2 > 1, with a maximum value of 1.0391, which is attained at Chapter 2. Foundation: models, statistical inference and computation 88 p = 0.3842 and p — —0.3842; cr22/<r2 increases from 1.707 to 2 as p goes from —1 to 0 and from 2 to 1.707 as p goes from 0 to 1; and <X23/<T2 increases from 1.207 to oo as p goes from —1 to 0, and decreases from oo to 1.207 as p goes from 0 to 1. • PMLA in the form presented in this section can also be considered as a simple weighted likelihood approach. More complicated weighting schemes for the PMLA can be sought. In general, as long as a reasonable efficiency is preserved, we prefer the simple weighting schemes. 2.7 Numerical methods for the model fitting From previous sections, we see that the IFM approach for parameter estimation leads to the problem of optimization of a set of loglikelihood functions of margins. The typical system of functions for the models with no covariates in the form of loglikelihood functions of margins are tnj(Xj) = Y^lo&pi(Vij)' j=l,...,d, i-l n tnjk(Qjk) - Elog PJk(yijyik), 1 < j < k < d, •=1 and the estimating equations (derived from the loglikelihood functions of margins) are ,d. (2.98) (2.99) For the models with covariates, the typical system of functions in the form of loglikelihood functions of margins are Lj(otj) = ^2logPj(yij), j = 1,.. .,d, i=\ (2.100) tnjk(Pjk) = ^2^ogPjk(yijyik), 1 < j < k < d i=i and the estimating equations are f * (a) - V 1 - 0 i-l 1 nAl)~hp^ da> 1 dPjkjyijyik) (2.101) fr[ pjk(yijyik) dBjk 0, 1 < j < k < d. Chapter 2. Foundation: models, statistical inference and computation 89 Newton-Raphson method The traditional approach for numerical optimization or root-finding is the Newton-Raphson method. This method requires the evaluation of both the first and the second derivations of the objective functions in (2.98) and (2.100). This method, with its good rate of convergence, is the preferred method if the derivatives can be easily obtained analytically and coded in a program. But in many cases, for example with £njk(6jk) or £njk(0jk)i where bivariate objects involve non-closed form two-dimensional integrals, application of the Newton-Raphson method is difficult since analytical derivatives in such situations are very hard to obtain. The Newton-Raphson method can only be easily applied for a few cases with £nj(Xj) or £nj(otj), where only univariate objects are involved. For example, the Newton-Raphson method may be used to solve 9nj(Xj) = 0 to find Xj, j — 1,..., d. In this case, based qn Newton-Raphson method, for a given initial value Xj,o, an updated value of Xj is ' 'dVnjjXjY dXj •V.new Aj,o — *nj(Xj) (2.102) This is repeated until successive A^new agree to a specified precision. In (2.102), we need to be able to code n *nj(Xj) = YtlVPjiynWPjiyn)/9^] (2.io3) and 0*n,-(A;) dXj £ 1 (dPjiVij) PfiVij) V dXj (2.104) tiPjiVii) dX] This is for the case with no covariates. For the case with covariates, similar iteration equations to (2.102) can be written down. We need to calculate *«;(<*;) = T,[VPi(vij)][dPj{vij)/d"i], 8 = 1 which is apj x 1 vector and dotj E i=i L 1 d2Pj(yij) 1 dPj(yij) (dPjjyjj) Pj(yij) dctjdet'j Pf{mj) dotj \ dctj (2.105) (2.106) which is a pj x pj matrix. It is equivalent to calculate the gradient of Pj(j)ij) at the point otj, which is the pj-vector of (first order) partial derivatives: d d $ M T Chapter 2. Foundation: models, statistical inference and computation 90 and the Hessian of Pj(ytj) at the point aj, which is &pj xpj matrix of second order partial derivatives with (s,t) (s,t = 1,... ,pj) component (d2/da„dat)Pj(yij). To avoid the often tedious algebraic derivatives in (2.103) - (2.106), modern symbolic computa tion software, such as Maple (Char et al., 1992), may be used. This software is also convenient in that it outputs the results in the form of C or Fortran code. Quasi-Newton method For many multivariate models, it is inconvenient to supply both first and second partial derivatives of the objective functions as required by the Newton-Raphson method. For example, to get the partial derivatives of the forms (2.104) - (2.106) may be tedious, particularly with function objects such as (•njk{9jk) or (.njki,Pjk), where 2-dimensional integrations are often involved. A numerical method for optimization that is useful for many multivariate models in this thesis is the quasi-Newton method (or variable-metric method). This method uses the numerical approximation to the derivatives (gradients and Hessian matrix) in the Newton-Raphson iteration; thus it can be considered as a derivative-free method. In many situations, a crude approximation to the derivatives can lead to convergence in the Newton-Raphson iteration as well. Application of this method requires only the objective functions, such as those in (2.98) and (2.100), to be coded. The gradients are computed numerically and the inverse Hessian matrix of second order derivatives is updated after each iteration. This method has the advantage of not requiring the analytic derivatives of the objective functions with respect to the parameters. Its disadvantage is that convergence could be slow compared with the Newton-Raphson approach. An example of a quasi-Newton routine, which is used in the programs written for this thesis work, is a quasi-Newton minimization routine in Nash (1990) (Algorithm 21, pl92). This is a modified Fletcher variable-metric method; the original method is due to Fletcher (1970). With the quasi-Newton methods, all we need to do is to write down the optimization (min imization or maximization) objective function (such as £njk{Pjk)), and then let a quasi-Newton routine take care of the rest. A quasi-Newton routine works fine if the objective function can be computed to arbitrary precision, say eo- The numerical gradients are then based on a step size (or step length) e < eo- The calculation of the optimization objective function with multivariate model often involves the evaluation of multiple integration at some arbitrary points. One-dimensional and two-dimensional numerical integrals can usually be computed quite quickly to around six digits of precision, but there is a problem of computational time in trying to achieve many digits of precision for numerical integrals of dimension three or more. When the objective function is not computed Chapter 2. Foundation: models, statistical inference and computation 91 sufficiently accurately, the numerical gradients are poor approximations to the true gradients and this will lead to poor performance of the quasi-Newton method. On the other hand, for statistical problems, great accuracy is seldom required; it is often suffice to obtain two or three significant digits, and we expect that in most of situations, we are not dealing with the worst cases. Starting points for numerical optimization In general, an objective function may have many local optima in addition to possibly a single global optimum. There is no numerical method which will always locate an existing global optimum, and the computational complexity in general increases either linearly or quadratically in the number of parameters. The best scenario is that we have a dependable method which converges to a local optimum based on initial guesses of the values which optimize the objective function. Thus good starting points for the numerical optimization methods are important. It is desirable to locate a good starting point based on a simple method, rather than trying many random starting points. An example based on method of moments estimation for deciding the starting points is for the multivariate Poisson-lognormal model (see Example 2.12), where the initial values for an estimate of pj arid <Tj based on the the sample mean (yj), sample variance (s2) and sample correlations (rjk) are = {log[(«? - yj)/y] + I]}1'2, $ = logy,- - 0.5(«T?)2 and 0]k = log[rjkSjSk/(yjyk) + respectively. If the problem involves covariates, one can solve some linear equation systems with appropriately chosen covariate values to obtain initial values for the regression parameters. Initial values may also be obtained from prior knowledge of the study or by trial and error. Generally speaking, it is easier to have a good starting point for a model with interpretable parameters or parameters which are easily related to interpretable characteristics of the model. In the situations where closed-form moment characteristics of the model are not available, we may numerically com pute the model moments. Numerical integration There are several methods for obtaining numerical integrals, among them are Romberg integration, adaptive integration and Monte-Carlo integration. The latter isg especially useful for high dimen sional integration provided the accuracy requirements are modest. With the IFM approach, as often only one or two-dimensional integrations are needed, the necessary numerical integrations are not a problem in most cases. (Thus IFM can be considered as a tractable computational method, which alleviates the often extremely heavy computational burdens in fitting multivariate models.) For Chapter 2. Foundation: models, statistical inference and computation 92 this thesis work, an integration routine based on the Romberg integration method in Davis and Rabinowitz (1984) is used; this routine is good for two to about four dimensional integrations. A routine in Fortran code for computing the multinormal probability (or multinormal copula) can be found in Schervish (1984). Joe (1995) provides some good approximation methods for computing the multinormal cdf and rectangle probabilities. 2.8 Summary In this chapter, two classes of models, MCD and MMD, are proposed and studied. The IFM is proposed as a parameter estimation and inference procedure, and its asymptotic properties are studied. Most of the results are for the the models with MUBE or PUBE properties. But the results of this chapter should apply to a very wide range of inference problems for numerous popular models in MCD and MMD classes. The IFME has the advantage of computational feasibility; this makes numerous models in the MCD and MMD classes practically useful. We also proposed a jackknife procedure for computing the asymptotic covariance matrix of the IFME, and demonstrated the asymptotic equivalence of this estimate to the Godambe information matrix. Based on the IFM approach, we also proposed estimation procedures for models with parameters common to more than one margin. One problem of great interest is that of determining the efficiency of the IFME relative to the conventional MLE. Clearly, a general comparison would be very difficult and next to impossible. Analytic and simulation studies on IFM efficiency will be given in Chapter 4. Another problem of interest is to see how the jackknife procedure compares with the Godambe information calculation numerically. A study of this is also given in Chapter 4. Our results may have several interesting extensions; some of these will be discussed in Chapter 7 as possible future research topics. Chapter 3 Modelling of multivariate discrete data In this chapter, we study some specific models in the class of MCD and MMD models and develop the corresponding IFM approach for fitting the model based on data. The models are first discussed for the case with no covariates and then for the case with covariates. For the dependence structure in the models, we distinguish the models with general dependence structure and the models with special dependence structure. Different ways to extend the models, especially to include covariates for the dependence parameters, are discussed. This chapter is organized as the following. In section 3.1, we study MCD models for binary data, with emphasis on multivariate logit models and probit models for binary data. In section 3.2, we make some comparison of the models discussed in section 3.1. The general ideas in this section should also extend to other MCD and MMD models. In section 3.3, we study MCD models for count data, and in section 3.4, we study MCD models for ordinal data. MMD models for binary data are studied in section 3.5, and MMD models for count data are studied in section 3.6. Finally in section 3.7, we discuss the use of MCD and MMD models for longitudinal and repeated measures data. In each section, only a few parametric models are given, but many others can be derived. For data analysis examples with different models presented in this chapter, see Chapter 5. 93 Chapter 3. Modelling of multivariate discrete data 94 3.1 Multivariate copula discrete models for binary data 3.1.1 Multivariate logit model A multivariate logit model should be based on a multivariate logistic distribution. As there is no nat ural multivariate logistic distribution, we construct multivariate logit models by putting univariate logistic margins into a family of copulas with a wide range of dependence, and simple computa tional form if possible. As a result, multivariate logit models for binary data are obtained by letting Gj(0) = 1/[1 + exp(zj)] and Gj(l) — 1 in the model (2.13), with arbitrary copula C. Some choices of the copula C are: • Multinormal copula C(«i,...,Ud;e) = $d($-1(Ul),...,$-1(Ud);0), (3.1) where © = (6jk) is a correlation matrix (of normal margins). The bivariate margins of (3.1) are Cjk(uj,uk) = $2(*_1(uj).*-1(u*);Ojk) • Mixture of max-id copula (id for infinitely divisible, see Joe and Hu 1996) C(u) = V> (-^log^,(e-^"1("i)>e-P^"1(^)) + ^^ftV'-1K)) , (3-2) where Kjk are max-id copulas, pj = (VJ +d— l)-1, Vj > 0. For interpretation, we may say that ip can be considered as providing the minimal level of (pairwise) dependence, the copula Kjk adds some pairwise dependence to the global dependence, and i/j's can be used for bivariate and multivariate asymmetry (the asymmetries are represented through Vj/(i/j + uk), j ^ k). (3.2) has the MUBE and partially CUOM properties. With ip(s) = -9~l log(l - [1 - e-e)e-'), 0>O, d ^-(^-e-e)l[Kjk(xj,xk)l[xy , (3.3) j<k j=l where Xj = [(1 - e-9u')/(l - e~B)]p', pj = (VJ +d - The bivariate margins of (3.3) are Cjk(uj, uk) - -Q~l log[l - (1 - e~e)xl/ji+d~2xukk+d~2Kjk(xj, xk)]. One good choice of Kik would be Kjk(xj,xk) = (xj 6jk + xkb3k — \)~ll6'k, 0 < 6jk < oo; because the resulting copula is simple in form. (See Kimeldorf and Sampson (1975) and Joe (1993) for a discussion of this bivariate copula.) C(u) = -rMog Chapter 3. Modelling of multivariate discrete data 95 • Molenberghs-Lesaffre (M-L) construction (Joe 1996). The construction generates a multivari ate "copula" with general dependence structure. An example for a 4-dimensional copula is the following: = x(x-ao)Ui<j<k<4(x-aJk) ^ ^ (Cl23 - ar)(Cl24 - z)(Cl34 - X)(C234 - x) Ilj = i(ai - x) ' where x = C1234 is the copula, a0 = C123 + C124 + C134 + C234 — C\i — C13 — C14 — C23 — C24 — C34 + wi + u2 + «3 + U4 — 1, ai = ui — C12 — C13 — C14 + C123 + C124 + C134, a2 = U2 — C\2 — C23 — C24 + C123 + Cl24 + C234, «3 = U3 — C13 — C23 — C34 + C123 + C134 + C234, a4 = u4 — C14 — C24 — C34 + Ci24 + Ci34-|-C234, and ajk = Cjki + Cjkm — Cjk, for 1 < j < k < 4 and 1 < /, m ^ j, k < 4. In (3.4), Ci23, Ci24, C134 and C234 are compatible trivariate copulas such that vikl = h h h h ' (3-5^ O4O5O6&7 where z = CJU, &i = Uj - Cjk - Cji + z, b2 = uk - Cjk - CM + z, 63 = u; - Cji - CM + z, 64 = Cjk — z, 65 = Cji — z, b6 = Cki — z and 67 = 1 - Uj — Uk — u\ + Cjk + Cji + Cki — z, for l<j<k<l<4. The bivariate copulas C12, Ci3, Ci4> C23, C24, C34 are arbitrary compatible bivariate copulas. Examples are the bivariate normal, Plackett (2.8) or Frank copula (2.9); see Joe (1993, 1996) for a list of bivariate copula families with good properties. The parameters in C1234 are the parameters from the bivariate copulas, plus rjjki, l<j<k<l<4, and 771234. The generalization of this construction to arbitrary dimension can be found in Joe (1996). Notice that we have quotation marks on the word copula, because the multivariate object obtained from (3.4) and (3.5) or the corresponding form for general dimension has not been proven to be a proper multivariate copula. But they can be used for the parameter range that leads to positive orthant probabilities for the resulting probabilities for the multivariate binary vector. • Morgenstern copula C(ui, ...,ud) d I+EM1-^1-"*) j<k d Y[uh, (3.6) ft=i where 6jk must satisfy some constraints such that (3.6) is a proper copula. The bivariate margins of (3.6) are Cjk{uj,uk) — [I + Ojk(l - Uj)(l - uk)]ujUk, \0jk \ < 1. The permutation symmetric copulas of the form C{uu...,ud) = <f> ^X>-1(u'-)) > (3J) Chapter 3. Modelling of multivariate discrete data 96 where 4> : [0,oo) —>• [0,1] is strictly decreasing and continuously differentiable (of all orders), and <j>(0) = 1, ^(oo) = 0, (-l)J>Ci) > 0. With ^(s) = -6~l log(l - [1 - e-e]e~s), 6 > 0, (3.7) is C(uu ..., ud) = ^£ = - \ log (l - y^J-!)^) • (3-8) This choice of ip(s) leads to 3.8 with bivariate marginal copulas that are reflection symmetric. It is straightforward to extend the univariate marginal parameters to include covariates. For example, for z,j corresponding to the random vector Y,-, we can let Z{j — otjXij, where otj is a parameter vector and Xjj is a covariate vector corresponding to the jth margin. But what about the dependence parameters in the copula? Should the dependence parameters be functions of covariates? If so, what are the functions? These questions have no obvious answers. It is evident that if the dependence parameters are functions of covariates, the form of the functions will depend on the particular copula associated with the model. A simple way to deal with the dependence parameters is to let the dependence parameters in the copula be independent of covariates, and sometimes this may be good enough for the modelling purposes. If we need to include covariates for the dependence parameters, careful consideration should be given. In the following, in referring to specific copulas, we give some discussion on different ways of letting the dependence parameters depend on covariates: - With the multinormal copula, the dependence structure in the copula for the ith response vector Y, is Q{ = (Oijk)- It is well-known that (i) 0,- has to be nonnegative definite, and (ii) the component 6ijk of has to be bounded by 1 in absolute value. Under these constraints, different ways of letting 0; depend on covariates are possible: (a) let Oijk = [exp(/?Jj.WjJfc) — l]/[exp()9jjfcw,Jjfe)"+ 1]; (b) let 0j have a simple correlation structure such as exchangeable and AR(1); (c) use a representation such as z,j - [otjXij]/[l + x^-Xjj]1/2, 6ijk — fjk/[(l + x-Jx!J)(l + x^Xji)]1/2; (d) use a more general representation such as 6ijk = r^w^ jkvfi,jk; or (e) reparameterize 0j into the form of d— 1 correlations and (d— l)(d—2)/2 partial correlations. The extension (a) satisfies condition (ii), but not necessarily (i). The extension (b) satisfies conditions (i) and (ii), but is only suitable for data with a special dependence structure. The extension (c) is more natural, as it is derived from a mixture representation (see section 3.5 for a more general form) and it satisfies condition (ii) and also condition (i) as long as the correlation matrix (rjk) is nonnegative definite. This is an advantage in comparison with (a), as in (a), for (i) to be satisfied, all n correlation matrices must be nonnegative definite. The disadvantage of (c) is that the dependence range may be limited once a particular formal Chapter 3. Modelling of multivariate discrete data 97 representation is chosen. The extension (d) is similar to the the extension (c), except that now the 0,-jjbS are not required to depend on the same covariate vectors as z8j. The extension (e) completely avoids the matrix constraint (i), thus relieving a big burden on the constraint for the appropriate inclusion of covariate to the dependence parameters. But this extension implicitly implies an indexing of variables, which may not be rational in general, although this is not a problem in many applications as the indexing of variables is often evident, such as with time series; see Joe (1996). - With the mixture of max-id copula (3.3), extensions to parameters 6, Vj, 6jk as functions of the covariates are straightforward. For example, for 0,-, Sijk corresponding to the random vector Y,-, we may have 9i, Vij constant, and 6ijk = exp(3jkVfi,jk)-- With the Molenberghs-Lesaffre construction, the extension to include covariates is possible. In applications, it is often good enough to let the bivariate parameters be function of covariates, such as Sijk = exP(/^jfcws,jfc) for bivariate Plackett copula, or Frank copula, and to let the higher order parameters be constant values, such as 1. This is a simple and useful approach, but there is no guarantee that this leads to compatible parameters. (See Joe 1996 for a maximum entropy interpretation in this case.) - With the Morgenstern copula, the extension to let the parameters Oijk be functions of some covariates is not easy, since the Oijk must satisfy some constraints. This is rather complicated and difficult to manage when the dimension is high. The situation is very similar to the multinormal copula where 0,- should be nonnegative definite. - With the permutation symmetric copula (3.8), the extension to include covariates is to let 0,-be function of covariates, such as to let 0,- = exp(/3'wj). We see that for different copulas, there are many ways to extend the model to include covariates. Some are obvious and thus appear to be natural, others are not easy or obvious to specify. Note also that an exchangeable structure within the copula does not imply an exchangeable structure for the response variables. For MCD models for binary data, an exchangeable structure within the copula plus constant cut-off points across all the margins implies an exchangeable structure with the response variables. The AR dependence structure for the discrete response variables should be understood as latent Markov dependence structure (see section 3.7). When we mention an AR(1) (or AR) dependence structure, we are referring to the latent dependence structure within the multinormal copula. Chapter 3. Modelling of multivariate discrete data 98 In summary, under "multivariate logit models", many slightly different models are available. For example, we have multivariate logit model with i. multinormal copula (3.1), ii. multivariate Molenberghs-Lesaffre construction a. with bivariate normal copula, b. with Plackett copula (2.8), c. with Frank copula (2.9), iii. mixture of max-id copula (3.3), iv. Morgenstern copula (3.6), v. the permutation symmetric copula (3.8). Indeed, such multiple choices of models are available for any kind of MCD model. For a discrete random vector Y, different copulas in (2.13) lead to different probabilistic models. The question is when is one model preferred to another? We will discuss this question in section 3.2. In the following, as an example, we examine estimation aspects of the multivariate logit model with multinormal copula. The multivariate logit model with multinormal copula can also be called multivariate normal-copula logit model to highlight the fact that multinormal copula is used. For the case with no covariates, the estimating equations for multivariate normal-copula logit model based on IFM can be written as the following f *n;(*j) = K(l)(l + exp(-Zi)) - n,-(0)(l + exp(-zj))exp(zj)] ^}~(Zj\.2 = 0, j = 1,..., d, i {i -r exp^—Zj)) ^2($-1(Uj),$-1(Ufc),^,) = 0, 1 < j < k < d, where Pjk(ll) = Cjk(uj, uk; 9jk), Pjk(10) = Uj - Cjk(uj ,uk;9jk), Pjk{\l) - uk - Cjk(uj ,uk;9jk), Pjk{U) = 1 - Uj - uk + Cjk{uj,uk;9jk) with Cjk(uj,uk;9jk) = ^2(^~1{uj),^~1(uk),9jk), Uj = 1/[1 + exp(—Zj)], uk = 1/[1 + exp(—zk)], and (j>2 is the BVN density. We obtain the estimates Zj = log(nj(l)/n;(0)), j = 1,..., d, and 9jk is the root of the equation $2(<b~1(uj), $-1(uj;), 0jk) = rijk(ll)/n, 1 < j < k < d, where Uj = 1/(1 + exp(—Zj)) and uk = 1/(1+ exp(—zk)). 9„jk(9jk) njk(ll) njk(lQ) njk(01) njt(00) + Pjk(U) JW10) Pjk(01) P,*(00) Chapter 3. Modelling of multivariate discrete data 99 For the situation with covariates, we may have the cut-off points depending on some covariate vectors, such that = ctjoXijo + ctjiXiji -\ 1- ctjPjXijPj = ot'jXij, (3.9) where Xjjo = 1, and possibly Vijk - 73^ \~T~7> (3.10) where 0]k = (bjk,o, • • •, bjklPjk). We recognize that (3.10) is one of the form of functions of dependence parameters (in multinormal copula) on covariates that we discussed previously. We use this form of functions for the purpose of illustration. Other function forms can also be used instead. Because of the linearity of (3.9) and (3.10), the regression parameter vectors otj, /3jk have marginal interpretations. The loglikelihood functions of margins are Znj(otj) = £ log Pj(yij), j - 1,..., d, n £njk(otj,otk,l3jk) = ^logPjk(yijyik), 1 < j < k < d, where exp(zij) 1 . 1 + exp(ztj) 1 + exp(zij) ^(^(ay), *_1(^)i + *2(*_1fo). ^-\aik);ejk), where =-Gi,(yij - 1), 6,j = Gij(yij), aik = Gik(yik - 1) and bik = Gik(yik), with G,j(l) = 1 and Gij(0) — 1/1 + exp(zjj). We can apply quasi-Newton minimization to the loglikelihood functions of margins for getting the estimates of the parameters otj and (i]k. The Newton-Raphson method can also be used for getting the estimates of otj (what we used in our computer programs). In this case, we have to solve the estimating equations For applying Newton-Raphson method, we need to calculate dP(yij)/dajt and d2P(yij)/dctjsdajt. WehavedP(yij)/dajs - (2yij-l){exp(zij)/(l+exp(zij))2}xijs, s - 0,1,. ..,pj, andd2P(yij)/dajsdajt = (2ytj — l){exp(z,j)(l - exp(z,j))/(l + exp(zij))3}xijsXijt, s,t = 0,1,.. .,pj. For details about Newton-Raphson and quasi-Newton methods, see section 2.7. M$ and Dy can be calculated and estimated by following the results in section 2.4. In applica tions, to avoid the tedious coding of M$ and , we may use the jackknife technique to obtain the Chapter 3. Modelling of multivariate discrete data 100 asymptotic variance of Zj and Ojk in case there are no covariates, or that of otj and $jk in case there are covariates. 3.1.2 Multivariate probit model The general multivariate probit model, similar to that of multivariate logit model, is obtained by letting Gj(0) = 1 — $(ZJ) and Gj(l) = 1 in the model (2.13). The multivariate probit model in Example 2.10 is the classical multivariate probit model discussed in the literature, where the copula in (2.13) is the multinormal copula. All the discussion of the multivariate logit model is relevant and can be directly applied to the multivariate probit model. For completeness, we give in the following some detailed discussion about the multivariate probit model when the copula is the multinormal copula, as a continuation of Example 2.10. For the multivariate probit model in Example 2.10, it is easy to see that E(Y}) = <&(ZJ), Var(Y}) = <3>(z,)(l -$(ZJ)), Cov(Yj, Yfc) = $2(2/, Zk, Ojk) - $(zj)$(zk), j 7^ k. The correlation of the response variable Yj and Yj is Cott(y Yx = cov(y;-,n) ^.^.M ( 3' k> {VM^OVarfY*)}1/2 {$(ZJ)(1 - $(zj))$(zk)(l -The variance of Yj achieves its maximum when Zj = 0. In this case E(Yj) = 1/2, Var(Y}) = 1/4. If Zj =0,zk= 0, we have Cov(Y}, Yk) = (sin-1 0jk)/(2ir), and Corr(Yj, Yk) = (2 sin-1 0jk)/TT. Without loss of generality, assume Zj < zk, then when 0jk is at its boundary values, f {*(z,-)(l - $(z*))/(l - *(z,.))4(z*)}1/2, Ojk = 1 , -{(1 - $(z,))(l - *(zt))/*(*;)*(*0}1/2. 0jk = -1, -Zk < ZJ , -{*(Zj)*(zk)/(l - *{Zj))(l - ^Zk))}1'2, Ojk = -1, Zj < -Zk , { 0, Ojk = 0 . From Frechet bound inequalities, -miniy/2,*!-1/2} < Cou(Yj,Yk) < minjfc1/2, 6"1/2}, where a = [Pj(l)Pk{l)}/[Pj(0)Pk(0)], b = [Pj(l)Pk(0)]/[Pj(0)Pk{l)], we see that Con(Yj}Yk) attains its upper and lower bound when Ojk = 1 and 0jk = —1 respectively. Corr(Yj,Yfc) is an increasing function in Ojk, it varies over its full range as Ojk varies over its full range. Thus in a general situation a multivariate probit model of dimension d consists of d univariate probit models describing some marginal characteristics and d{d—l)/2 latent correlations 0jk, 1 < j < k < d, expressing the strength of the association among the response variables. Ojk =0 corresponds to independence among the Corr{Yj,Yk) Chapter 3. Modelling of multivariate discrete data 101 response variables. The response variables are exchangeable if 0 has an exchangeable structure and the cut-off points are constant across all the margins. Note that when 0 has exchangeable structure, we must have 0jk = 0 > —l/(d — 1). The estimating equations for the multivariate probit model with multinormal copula, based on n response vectors y,-, i = 1,..., n, are •-<<-'> = (2$-r^W^- <• qn.k{e.k) = ( ni"(n) "i*(10) nM01) , n »jt(00) -^j foizj, zk,6jk) = 0, 1 < j < k < d. 1 - Q(Zj) - $(**) + ^2{Zj,Zk,9jk)i These lead to the solutions Zj = $_1 (rij(l)/n), j = 1,.. .,d, and Ojk is the root of the equation $2(zj,zk,9jk) = njk(ll)/n, 1 < j < k < d. For the situation with covariates, the details are similar to the multivariate logit model with multinormal copula in the preceding subsection, except now we have PiAvu) = yijHzij) + (1 - -Pijk(VijVik) = $2{$~l(bij), $-1(6ijt); Ojk) - $2($~1(bij), $_1(aljfc); 0jk)-*2(*_1(ay). ®~l(bik); Ojk) + *2(S-r(ay), ^(atk); 0jk), with dij - Gij{yij - 1), bij = Gij(yij), aik = Gik(yik - 1) and bik = Gik(yik), with G,j(l) = 1 and Gij(O) = 1 - $(z.j). We also have dP(yij)/dajs = (2yij - l)4>(zij)xijs, s = 0,1,...,pj and d2P(yij)/dajsdctjt = (1 — 2yij)<j>(zij)zijXijaXijt, s,t = 0,1,.. -,Pj\ these expressions are needed for applying the Newton-Raphson method to get estimates of otj. M$ and D$ can be calculated and estimated by following the results in section 2.4. For example, for the case with no covariates, we have E(ip2(zj)) = <f>2(zj)/{<b(zj)(l — $(ZJ))} and E(ip2(0jk)) = [l/^-^llJ+l/P^Cloy+l/P^COlJ+l/^tCOOM^^llJ/a^i]2, where dPjk(n)/d0jk = d^2(zj,zk, 0jk)/90jk — <t>2{zj,zk,0jk), a result due to Plackett (1954). In applications, to avoid the tedious computer coding of and Dy, we may use the jackknife technique to obtain the asymptotic variance of Zj and 0jk in case there are no covariates, or that of d;- and /3jk in case there are covariates. Chapter 3. Modelling of multivariate discrete data 102 3.2 Comparison of models We obtain many models under the name of multivariate logit model (also multivariate probit model) for binary data. An immediate question is when is one model preferred to another? In section 1.3, we outlined some desirable features of multivariate models; among them (2) and (3) may be the most important. But in applications, the importance of a particular desirable feature of multivariate model may well depend on the practical need and constraints. As an example, we briefly compare the multivariate logit models and the multivariate probit models with different copulas studied in the section 3.1. The multivariate logit model with multinormal copula satisfies the desirable properties (1), (2), (3) and (4) of a multivariate model outlined in section 1.3, but not (5). The multivariate probit model with multinormal copula is similar, except that one has logit univariate margins and the other has probit univariate margins. In applications, the multivariate logit model with multinormal copula may be preferred to the multivariate probit model with multinormal copula, as the multivariate logit model with multinormal copula has the advantage of having a closed form univariate marginal cdf. This consideration also leads to the general preference of multivariate logit model to multivariate probit model when both have the same associated multivariate copula. For this reason, in the following, we concentrate on discussion of multivariate logit models. The multivariate logit model with the mixture of max-id copula (3.3) satisfies the desirable properties (1), (3) and (5) of a multivariate model outlined in section 1.3, but only partially (2) and (4). The model only admits positive dependence (otherwise, it is flexible and wide in terms of dependence range) and it is CUOM(fc) (k > 2) but not CUOM. The closed form cdf of this model is a very attractive feature. If the data exhibit only positive dependence (or prior knowledge tells us so), then the multivariate logit model with mixture of max-id copula (3.3) may be preferred to the multivariate logit model with multinormal copula. The multivariate logit model with the M-L construction satisfies the desirable properties (1), (2), (3) and (4) of a multivariate model outlined in section 1.3, but not (5). The computation of the cdf may be easier numerically than that of multivariate logit model with multinormal copula since the former only requires solving a set of polynomial equations, but the latter requires mul tiple integration. The disadvantage with this model, as stated earlier, is that the object from the construction has not been proven to be a proper multivariate copula. What has been verified nu merically (see Joe 1996) is that (3.4) and its extensions do not yield proper distributions if 771234 and Vjki (1 < j < k < I < 4) are either too small or too large. In any case, the useful thing about this Chapter 3. Modelling of multivariate discrete data 103 model is that it leads to multivariate objects with given proper univariate and bivariate margins. The multivariate logit model with the Morgenstern copula satisfies the desirable properties (1), (4) and (5) of a multivariate model outlined in section 1.3, but not (2) and (3). This is a major drawback. Thus this model is not very useful. The multivariate logit models with the permutation symmetric copulas (3.7) are only suitable for the modelling of data with special exchangeable dependence patterns. They cannot be considered as widely applicable models, because the desirable property (2) of multivariate models is not satisfied. Nevertheless, this model may be one of the interesting considerations in some applications, such as when the data to be modelled are repeated measures over different treatments, or familial data. In summary, for general applications, the multivariate logit model with the multinormal copula or the mixture of max-id copula (3.3) may be preferred. If the condition of positive dependence holds in a study, then the multivariate logit model with the mixture of max-id copula (3.3) may be preferred to the multivariate logit model with multinormal copula because the former has a closed form multivariate cdf; this is particularly attractive for moderate to large dimension of response, d. The multivariate logit model with Molenberghs-Lesaffre construction may be another interesting option. When several models fit the data about equally well, a preference for one should be based on which desirable feature is considered important to the successful application of the models. In many situations, several equally good models may be possible; see Chapter 5 for discussion and data analysis examples. In the statistical literature, the multivariate probit model with multinormal copula has been studied and applied. An early reference on an application to binary data is Ashford and Sowden (1970). An explanation of the popularity of multivariate probit model with multinormal copula is that the model is related to the multivariate normal distribution, which allows the multivariate probit model to accommodate the dependence in its full range for the response variables. Furthermore, marginal models follow the simple univariate probit models. 3.3 Multivariate copula discrete models for count data Univariate count data may be modelled by binomial, negative binomial, logarithmic, Poisson, or generalized Poisson distributions, depending on the amount of dispersion. In this section, we study some MCD models for multivariate count data. Chapter 3. Modelling of multivariate discrete data 104 3.3.1 Multivariate Poisson model The multivariate Poisson model for Poisson count data is obtained by letting Gj(yj) = Ylm'loPf"^' yj = 0,1,2,..., oo, j = 1,2, in the model (2.13), where p^ — [XJ1 exp(—Aj)]/m!, Xj > 0. The copula C in (2.13) is arbitrary. Copulas (3.1)—(3.8) are some interesting choices here. The multivariate Poisson model has univariate Poisson marginals. We have E(Yj) = Var(Y)) = Xj, which is a characterizing feature of the Poisson distribution called equidispersion. There are situations where the variance of count data is greater than the mean, or the variance is smaller than the mean. The former case is called overdispersion and the latter case is called underdispersion. We will see models dealing with overdispersion and underdispersion in the subsequent sections. Al though the multivariate Poisson model has Poisson univariate marginal distribution, the conditional distributions are not Poisson. The univariate parameter A, in the multivariate Poisson model can be reparameterized by taking t]j = log(Aj) so that the new parameter rjj has the range (—00,00). It is also straightforward to extend the univariate marginal parameters to include covariates. For example, for Xij corresponding to random vector Y,-, we can let A,-,- = exp(a(jX,j), where aj is a parameter vector and x,j is a covariate vector corresponding to the jth margin. The discussion on modelling the dependence parameters in the copulas in section 3.1 are also relevant here. Most of the discussion in section 3.2 about the comparisons of models is also relevant here as the comparison is essentially the comparison of the associated multivariate copulas. In summary, under the name "multivariate Poisson models", we may have multivariate Poisson model with i. multinormal copula (3.1), ii. multivariate Molenberghs-Lesaffre construction a. with bivariate normal copula, b. with Plackett copula (2.8), c. with Frank copula (2.9), iii. mixture of max-id copula (3.3), iv. Morgenstern copula (3.6), v. the permutation symmetric copula (3.8). Chapter 3. Modelling of multivariate discrete data 105 These are similar to the multivariate logit models for binary data. For illustration purposes, in the following, we provide some details on the multivariate Poisson model with the multinormal copula. The multivariate Poisson model with the multinormal copula can also be called the multivariate normal-copula Poisson model. This model was already introduced in Example 2.11. For the multivariate normal-copula Poisson model, the Frechet upper bound is reached in the limit if 0 = J, where J is matrix of I's. In fact, when Ojk = 1 and Aj = A^, the correlation of the response variable Yj and Yk is A j <A j When 0 has an exchangeable structure and Aj does not depend on j, then there is also an exchange able correlation structure in the response vector Y. The loglikelihood functions of margins for the parameters A and 0 based on n observed response vectors y,-, i = 1,..., n, are: (3.11) tnj k(0jk) = Y,lo& Pi kiVijVik), l<j<k<d, i=l where Pj(yij) = AjiJ exp(-Aj)/2/ij! and PjkjVijVik) = *2(*_1(^), flj*) ~ *2(*_1(M. $"Haik);M-$2($~HaoO.$"H6<*);^ where au = Gij(yij-i), bij = Gij(yij), aik = GikiVik - 1) and 6,-* = Gik(yik), with Gij(yij) = J2V=opj(x) and Gik(yik) = Tjx=o pk(x)- The estimating equations based on IFM are •«<*> = tJ57^^ = o. >=' * „ (3-!2) ,T, ra ^ 1 dPjkijJijyik) „ 1 . . ^, , , , 9njk(0jk) - x— L = 0, 1 < 3 < k < d, ~[ pjk{yijyik) oOjk which lead to Aj = 5Z"=1 Vij/n> and Ojk can be found through numerical computation. An extension of the multivariate normal-copula Poisson model with covariate x,j for the response observation y,j is to let A,j = hj(yj, X;J) for some function hj in the range [0,00). An example of the function hj is A,j = exp(7jX5j) (or log(A,j) = 7jXjj). The ways to let the dependence parameters Ojk be functions of covariates follow the discussion in section 3.1 for the multivariate logit model with multinormal copula. We can apply quasi-Newton minimization to the loglikelihood functions of margins (3.11) to obtain the estimates of the parameters 7j- and the dependence parameters Ojk (or the regression Chapter 3. Modelling of multivariate discrete data 106 parameters for the dependence parameters if applicable). The Newton-Raphson method can also be used to obtain the estimates of fj from ^nj(Xj) = 0. Let log(A,j) = ~fjo + ~fji%iji + • • ' + ljpjxijpj For applying the Newton-Raphson method, we need to calculate dP(yij)/djjs and d2P(yij)/djjsdjjt-If we let xij0 = 1, we have dP(yij)/dfjs = {A^ exp(-A,j)/?Aj\}[ytj - Xij]xijs, s = 0,1,. ..,pj, and d2P(yij)/dyj,dfjt = {Xy{jj exp(-A,J)/j/ij!}[(y1j - A*,)2 - Xij]xijsXijt, s,t = 0,1,.. .,Pj. For details about numerical methods, see section 2.7. 3.3.2 Multivariate generalized Poisson model The multivariate generalized Poisson model for count data is obtained by letting Gj(yj) = X^/ijj P^'\ yj =0,1,2,..., oo, j = 1, 2,..., d, in the model (2.13), where where Xj > 0, max(— 1, — Xj/m) < ctj < 1 and m (> 4) is the largest positive integer for which Xj + maj > 0 when ctj is negative. The copula C in (2.13) is arbitrary. The copulas (3.1)—(3.8) are some choices here. The multivariate generalized Poisson model has as its jth (j = 1,..., d) margin the generalized Poisson distribution with pmf (3.14). This generalized Poisson distribution is extensively studied in a monograph by Consul (1989). Its main characteristic is that it allows for both overdispersion and underdispersion by introducing one additional parameter ctj. The generalized Poisson distribution has the Poisson distribution as a special case when ctj — 0. The mean and variance for Yj are E(Yj) = Aj(l — ctj)-1 and Var(Yj) = A;(l — ctj)~3, respectively. Thus, the generalized Poisson distribution displays overdispersion for 0 < ctj < 1, equidispersion for ctj = 0 and underdispersion for max(—l,Aj/m) < ctj < 0. The restrictions leading to underdispersion are rather complicated, as the parameters ctj are restricted by the sample space. It is easier to work with the overdispersion situation where the restrictions are simply Aj > 0, 0 < ctj < 1. The details of applying the IFM procedure to the generalized Poisson model are similar to that of the multivariate Poisson model. For the situation with no covariates, the univariate estimating equations for the parameters Aj and aj, j = 1,..., d, are Pj Xj(Xj + saj)s 1 exp(—Aj — J 0 for s > m, when aj < 0, saj) S = 0,l,2,... (3.14) (3.15) Chapter 3. Modelling of multivariate discrete data 107 They lead to f^l y+j + (nyij - y+jjctj { y+j(1 - otj) - nXj = 0, where y+j = 5Z"=1S/ij. When there is a covariate vector x,j for the response observation y^-, we may let A,j = aj(yj,x.ij) for some function aj in the range [0,oo), and let a,j = 6j (fy, x,j) for some function bj in the range [0,1]. An example is A,j = exp(7jXjj) (or log(A,j) = 7j-x,j) and ctij — 1/[1 + exp(—Jjj-Xjj)]. The discussion on modelling the dependence parameters in the copulas in section 3.1 is also appropriate here. Furthermore, most of the discussion in section 3.2 about the comparisons of models is also relevant here since the comparison is essentially the comparison of the associated multivariate copulas. 3.3.3 Multivariate negative binomial model The multivariate negative binomial modelfor count data is obtained by letting Gj(yj) = Y^s=oP*f\ yj = 0,1,2,..., oo, j = 1,2,..., d, in the model (2.13), where ^'^wt+V''1-*''' ""^ (3'16) with aj > 0 and 0 < pj < 1. The mean and variance for Yj are E(Yj) = atj(l — Pj)/pj and Var(Yj) = Q!j(l— Pj)/p], respectively. Since aj > 0, we see that this model allows for overdispersion. When there is a covariate vector x,-j for the response observation ytJ-, we may let a,j = aj (yj, x,-j) for some function aj in the range [0,oo), and let pij — bj(r)j,x,j) for some function bj in the range [0,1]. See Lawless (1987) for another way to deal with covariates. Other details are similar to that of the multivariate generalized Poisson model. 3.3.4 Multivariate logarithmic series model The multivariate logarithmic series model for count data is obtained by letting Gj(yj) = Y^a=iPj'\ yj = 0,1,2,..., oo, j = 1,2,..., d, in the model (2.13), where p^ = ajp'j/s, 8=1,2,..., (3.17) with aj = — [log(l^pj)]-1 and 0 < pj < 1. The mean and variance for Yj are E(Y,-) = ajPj/{l — aj) and Var(Yj) = otjPj(l — a,pj)/(l — Pj)2, respectively. This model allows for overdispersion when Pj > 1 — e_1 and underdispersion when pj < 1 — e_1. Note that for this model to allow a zero count, we need a shift of one such that p^ = c*jpj+1/(t + 1) for t = 0,1,2,.... Chapter 3. Modelling of multivariate discrete data 108 For the situation where there is a covariate vector x,j for the response observation y,-,-, we may let pij = Fj (fj, x,j) where Fj is a univariate cdf. An unattractive feature of this model is that is a decreasing function of s, which may not be suitable in many applications. 3.4 Multivariate copula discrete models for ordinal data In this section, we shall discuss the modelling of multivariate ordinal categorical data with mul tivariate copula discrete (MCD) models. We first briefly discuss some special features of ordinal categorical data before we introduce the general MCD model for ordinal data and some specific models. When a polytomous variable has an ordered structure, we may assume the existence of a latent continuous random variable that measures the level of the ordered polytomous variable. For a binary variable, models for ordered data and unordered data are equivalent, but for categories variables with more than 2 categories, ordered data and unordered data are quite different. The modelling of unordered data is not as straightforward as the modelling of ordered data. This is especially so in the multivariate situation, where it is not obvious how to model the dependence structure of unordered data. We will discuss briefly the modelling of multivariate polytomous unordered data in Chapter 7. One aspect of ordinal data worth noticing is that it is possible to combine one category with an adjacent category for data analysis. But this practice may not be as meaningful for unordered categorical data, since the notion of adjacent category is not meaningful, and arbitrary clumping of categories may be unsatisfactory. We next introduce the MCD model for ordinal data. Consider d-dimension ordinal categorical random vectors Y with rrij categories for the jth margin (j — 1,2,.. .,d) and with the categories coded as 1,2,..., rrij. For the jth margin, the outcome yj can take values 1,2,..., rrij, where rrij can differ with the index j. For Yj, suppose the probability of outcome s, s = 1, 2,..., rrij, is p('\ We define {0, yj < 1, E^iPJ0. !<%•<">,-, (3.18) 1, yj > rrij , where [yj] means the largest integer less or equal than yj. For a given d-dimensional copula C(ui,..., u^, $), C(G\(yi),..., Gd{yd)\fl) is a well-defined distribution for the ordinal random vector Chapter 3. Modelling of multivariate discrete data 109 Y. The pmf of Y is 2 2 P(yi • • • Vd) = J2 • • • E(-l)'l+"'+<'C(ai,-1,..., adii; 9), (3.19) »i=i «d=i where ciji — Gj(yj — 1), Oj2 = Gj(yj). (3.19) is called the multivariate copula discrete models for ordinal data. Since Yj is an ordered categorical variable, one simple way to reparameterize p^'\ so that the new parameter associated to the univariate margin has the range in the entire space, is to let Gj (%•) = Fj (ZJ (yj)) where Fj is a cdf of a continuous random variable Zj. Thus p^ = Fj (ZJ (y^)) — Fj(zj(y^'~1^)). This is equivalent to •Yj = l iff ZJ(0) < Zj < ZJ(1), Yj=2 iffzj(l)<Zj <ZJ{2), (3.20) . Yj = rrij iff Zj(mj — 1) < Zj < Zj(rrij), where —oo = Zj(0) < zy(l) < • • • < Zj(rrij — 1) < Zj(rrij) — oo are constants, j = 1, 2,..., d, and the random vector Z = (Z\,..., Zf)' has a multivariate cdf Fi2-d- In the literature, the representation in (3.20) is referred to as modelling Y through the latent random vector Z, and the parameter Zj(yj) is called the yj th cut-off point for the random variable Zj. As for the MCD model for binary data, the choices of Fj are abundant. It can be standard logistic, normal, extreme value, gamma, lognormal cdf, and so on. Furthermore, FJ(ZJ) need not to be in the same distribution family for different j. Similarly, in terms of the form of copula C(ui,..., ud\ 0), it can be the multivariate normal copula, a mixture of max-id copula, the Molenberghs-LesafFre construct, the Morgenstern copula and a permutation symmetric copula. It is also possible to express the parameters Zj(yj) as functions of covariates, as we will see through examples. For the dependence parameters 6 in the copula C(ui,..., ud; 6), there is also the option of including covariates. The discussion in section 3.1 on the extension for letting the dependence parameters in the copulas be functions of covariates is also relevant here, since this only depends on the associated copulas. In the following, in parallel with the multivariate models for binary data, we will see some examples of multivariate models for ordinal data. 3.4.1 Multivariate logit model The multivariate logit model for ordinal data is obtained by letting Gj(yj) — exp(z;-(yj))/[l + exp(zj(yj)] in (3.19), where -oo = Zj(0) < Zj(l) < • • • < Zjjmj - 1) < Zj(rtij) = oo are constants, Chapter 3. Modelling of multivariate discrete data 110 j = 1, 2,..., d. It is equivalent letting Fj(z) = exp(z)/[l + exp(z)], or choosing Fj to be the standard logistic cdf. The copula C in (3.19) is arbitrary. The copulas (3.1)—(3.8) are some choices here. It is relatively straightforward to extend the univariate marginal parameters to include covariates. For example, for ZJJ corresponding to random vector Yj, we can let ZJ,(?/J,-) = 7,-(WJ,) + gj (oij, Xj,), for some constants —oo = 7,(1) < 7,(2) < • • • < 7,(m,- — 1) < 7,(w,) = oo, and some function gj in the range of (-co, oo). An example of the function gj is gj(x) = x. As we have discussed for the multivariate copula discrete models for binary data, a simple way to deal with the dependence parameters is to let the dependence parameters in the copula be independent of covariates. To extend the model to let the dependence parameters be functions of covariates requires specific knowledge of the associated copula C. The discussion on the extension of letting the dependence parameters in the copulas be functions of covariates for the multivariate logit models for binary data in section 3.1 are also relevant here. As with the multivariate logit models for binary data, we may also have multivariate logit model for ordinal data with i. multinormal copula (3.1), ii. multivariate Molenberghs-Lesaffre construction a. with bivariate normal copula, b. with Plackett copula (2.8), c. with Frank copula (2.9), iii. mixture of max-id copula (3.3), iv. Morgenstern copula (3.6), v. the permutation symmetric copula (3.8). For illustrative purposes, we give some details on the multivariate logit model with multinormal copula for ordinal data. The multivariate logit model with multinormal copula for ordinal data is also called the multivariate normal-copula logit model for ordinal data. Let the data be y,- = (yu,..., y,d), i — l,...,n. For the situation with no covariates, there are Ej=i(mi — 1) univariate parameters and d(d— l)/2 dependence parameters. The estimating equations from (2.42) are \ .(z.(v.))- ( UiM : U *n,(z,{y,)) - {F{z.{yj))_F{zjiy. _ 1)} F{z.{yj + vr, /fl \_ V V n(VjVk) dPjk(yjyk) '.VJ + 1) \ exp(-z,(y,)) l))--F(*;(Vi))y (l+exp(-z,(j/,)))2 Chapter 3. Modelling of multivariate discrete data 111 where F(z) = 1/[1 + exp(—z)], and PMvjVk) =$2($-1(«j), V), ojk) - ^(a-1^), ojk)-$2($-1(Uj), $-!«), ojk) + ^2(^_1K), Ojk) with Uj = 1/[1 + exp(-2j(j/j))], uk = 1/[1 + exp(-Zfc(w,t))], = 1/[1 + exp(-zj(yj - 1))], uj; = 1/[1 + exTp(-zk(yk - 1))]. From tf„;(zj(j/j)) = 0, we obtain P(,j(2/j + 1)) - F(zj(yj)) = nj[yi+,1] (F(zj(yj)) - F(zj(yj - 1))) = + ^(l)). nj(l) This implies that ffl;-l which leads to £ (F(zj(yj + l))-F(zj(yj)))= £ „,.(w + l)^M_0, yi=0 yi = 0 ^ ^ ^(1)) = ^. where n = X^=i nj(Vj)- It is thus easy to see that ' F(zj(yj)) =Sy=Wr), which means that the estimate of £J(J/J) from IFM is ^(%) =loS ,n-Erii»;(r). The closed form of Ojk is not available. We need to numerically solve 9njk{0jk) = 0 to find Ojk-For the situation with covariate vector x,j for the marginal parameters zj (yij) for Y,j, and a covariate vector vfijk f°r the dependence parameter Oijk, » = l,...,n, one way to extend the model to include the covariates is as follows: zij{yij) = JiiVij) + J = 1. 2> • • •. d, exp(^Wijfc)-l 0i,ik = 7^7 T7T' l<J<k<d. exP{Pjkwi,jk) + 1 (3.21) The loglikelihood functions of margin for the parameter vectors yj = (7j(2),.. • ,7j(mi — 1))', otj (j - 1,..., d) and 0jk (1 < j < k < d) are ^nj(yj,aj) = ^2\ogPj(yij), i=i n tnj k(yj,yk, Otj ,Otk,Pjk) = Y^ l0S Pij k (Vij ) > (3.22) 1 = 1 Chapter 3. Modelling of multivariate discrete data 112 where Pi,j(yij) = F(zij(yij)) - F(zij(yij - 1)) and Pijk(VijVik) =*2(*_1(6ii.),*_1(6,-*);^*) - M$-1(M>$~Vi*);0uifc)-with a,-j = F(zj(yij -1)), &,•_,• = F(zj(yij)), aik = F(zk(yik - 1)) and bik = F(zk(yik)). We can apply quasi-Newton minimization to the loglikelihood functions of margins (3.22) to obtain the estimates [ of the parameters fj, aj and the dependence parameters 6jk (or the regression parameters for the dependence parameters, 0jk, if applicable). The Newton-Raphson method can also be used to obtain the estimates of yj from *nj(7j) = 0, and the estimates of aj from 9nj(aj) = 0. For applying the Newton-Raphson method, we need to calculate dPj(yij)/dfj, dPj(yij)/daj, d2 Pj(yij) / dyjdyj, d2Pj(yij)/dajdaJ and d2Pj(ytj)/dyjdaJ. The mathematical details for applying the Newton-Raphson method are the following. Let Zij (yij) = yj (y,-,j) + aj i Xij i H h ajPj XijPj. For y,j ^ 1, mj, we have Pj(yij) = exp(zy(ytj))/[l + exp(zij(y0-))] - exp(zij(y1J- - 1))/[1 + exp(z,J(2/ij - 1))], thus dPj(yij)/dajs = {[exp(zij(yij))/(l+exp(zij(yij)))2]-[exp(zij(yij -l))/(l+exp(zij(y,j -l)))2]}^*, s = l,...,pj, and d2Pj{yij)/daj,dajt = {[exp(z,J(y,i))(l - exp(z8J(yij)))/(l + exp(zij(j/,i)))3] -[exp(z0-(s/ij - 1))(1 - exp(zij(yij - 1)))/(1 + exp(zij(yij- - l)))3]}^-^,-,-,, s,< = l,...,p;-. For r = 1, 2,..., mj — 1, we have <9Tj(r) expt (l+exp^o-Cyo-)))3 exp(^j(y<j-l)) (H-exp^Cy^-!)))* and for ri, r2 = 1,2,..., m,-10 1, we have if r = yfj , if r = ytj - 1 otherwise , djj{ri)dyj(r2) ' exp^ijCy^Xl-exp^yfai,-))) (l+exp^ij^yij)))3 < exp^il(yi,-l))(l-exp(^i,(y;,-l))) (l+exp^^Cy^-l)))3 .0 if ri = r2 = y,-j , if n = r2 = - 1 otherwise , and d2Pj{yij) dyj(r)dajs ( exp(^ij(yij))(l-exp(^<.,-(y;j))). = < (l+exp(*ii(yji)))3 h%2 s exp(*O-0/ij-l))(l-exp(*ij(yi.,--l))) (l+exp^i^y^-1)))3 if r = 2/ij , if r = - 1 , 0 otherwise . For y{j = 1, -Pj(y,j) = exp(z0-(y,j))/[l + exp^^y,-,-))] and for j/,-,- = , Pj{yij) = 1 - exp(z!i(j/ij -1))/[1 + exp(zij(ytj — 1))], thus corresponding slight modification on the above formulas should be made. For details about numerical methods, see section 2.7. M$ and Dy can be calculated and estimated by following the results in section 2.4. In applica tions, to avoid the tedious coding of M$ and D$, we may use the jackknife technique to obtain the Chapter 3. Modelling of multivariate discrete data 113 asymptotic variance of Zj(yj) and Ojk when there is no covariates, or that of fj, a, and 8jk when there are covariates. 3.4.2 Multivariate probit model Similar to the multivariate probit model for binary data, the general multivariate probit model is obtained by letting Gj(yj) — $(zj(yj)) in (3.19), where —oo = Zj(0) < Zj(l) < • • • < Zj(rrij — 1) < Zj(rrij) = oo are constants, j = 1, 2,..., d. It is equivalent to letting Fj{z) = 3>(z), or choosing Fj to be the standard normal cdf. The copula C in (3.19) is arbitrary. The copulas (3.1)—(3.8) are some choices here. The multivariate probit model with multinormal copula for ordinal data is discussed in the literature (see for example Anderson and Pemberton 1985). The discussion of the multivariate logit model for ordinal data in the previous subsection is relevant and can be directly applied to the multivariate probit model for ordinal data. For completeness, we provide some detailed discussion for the multivariate probit model for ordinal data when the copula is the multinormal copula. Let the data be yt- = (yu,..., yid), i = 1,..., n. For the situation with no covariates, there are Ej=i(rai — 1) univariate parameters and d(d— l)/2 dependence parameters. As for the multivariate logit model, with the IFM approach, we find that Zj(yj) = $-1(Er=i nj(r)ln), and Ojk must be obtained numerically. For the situation with covariate vector Xjj for the marginal parameters Zj(yij) for Y{j, and a covariate vector vfijk for the dependence parameter 0,-jjt, i = l,...,n, the details on IFM for parameter estimation are similar to the multivariate logit model for ordinal data in the preced ing subsection. We here provide some mathematical details for this model. We have Pi,j(yij) = $(zij(Vii)) ~ ®(zij (y>j ~ !)) and Pijkivijm) =*2(*-1(6o),$-1(6iO;0u*) - *2($-1(feo),$-1(aa);0.J*)-$2($_1(ao), $_1(M; Oijk) + $2($_1K-)> ®~\aiky, Oijk), where atj = <b{zj(yij - 1)), = $(zy(«,•_,)), aik - <&(zk{yik - 1)) and bik = <b(zk(yik))- The mathematical details for applying the Newton-Raphson method are the following. For yij ^ l.m,-, we have Pj(yij) = <S>{zij{yij))-<b{zij{yij -1)), thus dPj(yij)/dajs = [<t>{zij{yij))-<t>(zij(yij-l))]xijs, s = l,...,pj, and d2Pj(yij)/dajsdajt = [-<f>(zij(yij))zij(yij) + <f>(zij(yij - l))zij(yij - l)]xijsxijt, s,t — 1,... ,pj. For r = 1,2,..., — 1, we have dPjjyij) djj(r) ' 4>{zij(yij)) Hr = yij , = s -Hzij(yij -!)) if r = y>j - !> . 0 otherwise , Chapter 3. Modelling of multivariate discrete data 114 and for r\, r2 = 1, 2,..., rrij — 1, we have djj(ri)djj(r2) ' -<P(zij(yij))zij(yij) if n = r2 = yij , < <i>(zij(yij - l))zij{yii -!) if n = r2 = y{j -1 . 0 otherwise , and d2P:(Vij) djj(r)dajs ' -4>{zij{yij))zij{yij)xijs if r = yij , ' <l>(zij(yij - l))zij(yij - tyxijs if r = yij - 1 otherwise . For = 1, Pj(yij) = ®(z%j(yij)) and for ytj - rrij, Pj(ytj) = 1 - $(zij(yij - 1)), thus the corresponding slight modification on the above formulas should be made. For details on numerical methods, see section 2.7. My and Dy can be calculated and estimated by following the results in section 2.4. For example, for the case with no covariate, we have E(ip2(zj(yj))) = {[Pj(yj + 1) + Pj(yj)]<l>2(zj{yj))}/{Pi{yj + tyPj(%')}> where Pj(yj) = $(zj(yj)) — $(zj (yj — 1)), and so on. In applications, to avoid the tedious coding of My and Dy, we may use the jackknife technique to obtain the asymptotic variances of zj(yj) and djk in case there is no covariates, or those of yj, aj and 8jk in case there are covariates. The multivariate probit model with multinormal copula for ordinal data has been studied and applied in the literature. For example, Anderson and Pemberton (1985) used a trivariate probit model for the analysis of an ornithological data set on the three aspects of colouring of blackbirds. 3.4.3 Multivariate binomial model In the previous subsections, we supposed that for Yj, the probability of outcome s is p^'\ s = 1,2,... ,rrij, j = 1,... ,d, and we linked the rrij probabilities p^ to rrij — 1 cut-off points Zj(l), Zj(2),..., Zj(rrij 1). We keep as many independent parameters within the margins and between the margins as pos sible. In some situations, it is worthwhile to reduce the number of free parameters and obtain a more parsimonious model which may still capture the major features of the data and serve the inference purpose. One way to reduce the number of free parameters for the ordinal variable is to reparameterize the marginal distribution. Because J2T=i — 1 and P^ ^ Oi we mav let This reparameterization of the distribution of Yj reduces the number of free parameter to one, namely Pj. The model constructed in this way is called the multivariate binomial model for ordinal data. (3.23) for some 0 < pj < 1. In other words, we assume that Yj follows a binomial distribution Bi(m_,- — l,Pj). Chapter 3. Modelling of multivariate discrete data 115 By treating Pj(yj) as a binomial probabilities, we need only deal with one parameter pj for the jth margin. (3.23) is artificial for the ordinal data as s in (3.23) is based on letting Yj take the integer values in {0,1,..., rrij} as its category indicator. But s is a qualitative quantity; it should reflect the ordinal nature of the variable Yj, not necessarily take on the integer values in {0,1,..., mj}. In applications, if one feels justified in assuming the binomial behaviour in the sense of (3.23) for the univariate margin, then this model may be considered. (3.23) is a more natural assumption if the categorical outcome of each univariate response can be considered as the number of realizations of an event in a fixed number of random trials. In this situation, it is a MCD model for binomial count data. When there is a covariate vector x,j for the response observation y,-j, we may let Pij = bj(r/j,Xij) for some function bj in the range [0,1]. Other details are similar to the multivariate logit model for binary data. 3.5 Multivariate mixture discrete models for binary data The multivariate mixture discrete models (2.16) or (2.17) are flexible for the type of discrete data and for the multivariate structure by allowing different choices of copulas. However, they generally do not have closed form pmf or cdf. The choice of models should be based on the desirable features for multivariate models outlined in section 1.3, among them, (2) and (3) are considered to be essential. In this and the next section, we study some specific MMD models. The mathematical develop ment for other MMD models with different choices of copulas should be similar. 3.5.1 Multivariate probit-normal model The multivariate probit-normal model for binary data is introduced in (2.32) in Example 2.13. Following the notation in Example 2.13, the corresponding cut-off points a,j is a,j = 0'jX.ij for a more general situation, where x,j is a covariate vector, j = l,...,d and i = l,...,n. As sume Bj ~ NPi{pj,Y,j), j = l,...,d. Let 7 = {Bl,...,Bd)' ~ Nq(p,T,), where q = and Gov{Bj,Bk) — Sj . From the stochastic representation in Example 2.13, we have '*?, = ^ -"-I -« {l + x^.E,^-}!/2' _ Ojk + X-jSjjfcXjfc i'jk~ {(l + x^SJ•xl•j)(l + x^Xi,)}1/2, Jt • Chapter 3. Modelling of multivariate discrete data 116 The jth and (j, k) marginal pmf are HAW) = i/y + (! - -Pi,jk(yijyik) = *2($-1(6,-j),*-1(6<jfe);rijt) - <M$_1(M>$_1(ai*0;r»•J•*)-$2($"~1(a•.;•)> nj*) + $2($-1(a,j), $~1(a,-Jb); r,j*), where ay = G,-j(y,j - 1), btj = Gy(j/y), a,-* = Gik(yik - 1) and 6»fc = Gik{yik), with Gy(l) = 1 and Gjj(O) = 1 — $(z*j). We can thus apply quasi-Newton minimization to the log-likelihood functions of margins inj(Pj, £,•) = £logP,-(»y)> J = 1. • • •, d. »=i n 4jfc(Pj ,Sj, pk,T,k,ejk,Ejk) = £log Pjk(yijyik), l<j<k<d, «=i to obtain the estimates of the parameters /i^, Ej, /ij., Ej, fyi and Eyj,. From appropriate assumptions, many simplifications of (3.24) are possible. For example, if Ej = I and Ej* = 0, j ^ fc, then (3.24) simplifies to 1.1/2 ' ^ i., . . . , u, (3.25) * {1 + x^xy}^' 11/2' ,J* {(l + x<.xti)(l+x<fcxi,)}1 which is a simple example of having the dependence parameters be functions of covariates in a natural way, as they are derived. The numerical advantage is that as long as 0 = (9jk) is positive-definite, then all R{ = (ujk), i = 1,..., n, are positive-definite. An extension of (3.24) is to let z*j=lt'iXi}> J = l> •••'d> rijk 11/2 ' J'^' (3.26) {(l + w{,.Eiwii)(l + w;.iEtw,-t)}1 where xy and wy may differ. However this does not obtain from a mixture model. 3.5.2 Multivariate Bernoulli-Beta model For a d-variate binary random vector Y taking value 0 or 1 for each component, assume we have the MMD model (2.17), such that P(yi---Vd)= t •• f ^f{yi-,Pj)<GM,...,Gi(pq))'[{gj{pj)dpl---dpq) (3.27) Jo Jo „•_-, Chapter 3. Modelling of multivariate discrete data 117 where f(yj',Pj) = pJJ(l— Pj)l~Vi • If in (3.27), Gj is a Beta(a?j, Bj) distribution, with density gj(pj) = [B(aj, Bj)]~1Pj3~1(l— PjY'-1, 0 < pj < 1, then (3.27) is called a multivariate Bernoulli-Beta model. The copula C in (3.27) is arbitrary; a good choice may be the normal copula (2.4). With the normal copula, the model (3.27) is MUBE, thus the IFM approach can be applied to fit the model. We can then write the jth and (j, k) marginal pmf as Pi (%) = / Pfi1 ~ Pif~Vi9j (Pj)dpj = B(aj + yj ,Bj+l- yj)/B(aj, Bj), Jo PjkiVjyk) = Pi''(l -Pj)1_WP2*(l -Pk)l-ykHx,y\6jk)9j{Pi)9k{Pk)dpjdpk, where x = $~1(Gj(pj)), y = 3>~l(Gk(pk)), and <p2(x,y;0) is the density of the bivariate normal, and gj the density of Beta(a, ,Bj). Given data y,- = (yu,..., ytd) with no covariates, we may obtain aj, fa and Ojk with the IFM approach. For the case of an individual covariate x,j for Yij, an interpretable extension of (3.27) is ^(y.O = f •• f HM*ii>Pj)]yii[l ~ hjixij^jtf-^ciGM,Gq(Pg)) n gj(Pj)dPl • • • dPq, Jo Jo j=1 j=1 (3.28) for some function hj with range in [0,1]. A large family of such functions is hj (xtj ,pj) = Fj (Ff1 (Pj)+ P'j'X-ij) where Fj is a univariate cdf. Pj(yij) and Pjk(yijVik) can be written accordingly. For ex ample, if Fj(z) = exp(—e~z), then hj(xij,Pj) = p^XP^ ^jX,:'^ and we have that when y,j = 1, Pj(yij) = B(ctj + exp(—3jXij),3j)/B(ctj,8j). If covariates are not subject dependent, but only margin-dependent, an alternative extension is to let a,,- and depend on the covariates for some functions aj and bj with range in [0,oo], such that a,-,- = a,j(fj,Xj) and = bj(t)j,Xj). In this sit uation, we have, for example, Pj(yij) = B(aj(y'jXj) + yij, bj(rr,jXj) + l-yij)/B(aj(y'jXj), bj(tfjXj)). An example of the functions aj and 6, is a,j — exp(y'jXj) and /?,-,- = exp(t]'jXj). When apply ing the IFM approach to parameter estimation, the numerical computation involves 2-dimensional integration which would be feasible in most cases. A special case of the model (3.27), where pj = p, j = 1,..., d, is the model (1.1), studied in Prentice (1986). The pmf of the model is P(vi---Vd)= I' py+(l-P)d-y+g(p)dp, : (3.29) Jo where y+ = J2j=iVj and 9(P) *s the density of a Beta(a,/?) distribution. The model (3.29) has exchangeable dependence and admits only positive dependence. A discussion of this special model, with extensions to include covariates and to admit negative dependence, can be found in Joe (1996). Chapter 3. Modelling of multivariate discrete data 118 3.5.3 Multivariate logit-normal model For a d-variate binary random vector Y taking value 0 or 1 for each component, suppose we have the MMD model (2.17), such that rl rl d P(yi---yd)= •••/ J\f{yj\Pj)g{pi,-•-,Pd)dpi-•-dPd, (3.30) Jo Jo j=1 where f(yj ;pj) = py' (1 —pj)l~yi, and g(-) is the density function of a normal copula, with univariate marginal cdf Gj Jo V °~j In other words, if pj is the outcome of a rv Pj, and Zj = logit(Pj) = log(P,/l — Pj), j = 1,...', d, then (Z\,..., Zd)' has a joint d-dimensional normal distribution with mean vector ji, variance vector <r2 and a correlation matrix 0 = {Ojk). We have 9(Pi, ...,Pd)= ,d/2,ir. >J2T1d T. r exp { - J(z - p)'{<T'Qa)-\z - p)\ , (2Tr)d'2\a'e(T\1/2Ylj=lpj(l-pj) L 2 J where z = (zi,..., z^)', with Zj = log(pj/l — pj). We call this model the multivariate logit-normal model. The Frechet upper bound is reached in the limit if 0 = J, where J is matrix of I's and c2 —• oo. The multivariate probit model obtains in the limit as a goes to oo, by assuming 0 be a fixed correlation matrix and the mean parameters pj = CjZj where Zj is constant. The j'th and the (j, k) marginal pmf are PjiVi) = t ^7lpf_1(l " Pj)-y^(xj)dpj, j = l,...,d, Jo Pjk(yjyk) =11 (o-jo-kr'py _1(1 -Pj)-y'plk-\l -Pk)-ykfo(xj,xk;Ojk) dPjdpk, l<j<k<d, Jo Jo where XJ = {[log(pj/(l — pj)) — Pj]/o-j}, j = 1,..., d, cf> is the standard univariate normal density function and <f>2 the standard bivariate normal density function. Given data = (yn,..., yid) with no covariates, we may obtain pj, aj and Ojk by the IFM approach. For the case of different covariates for different margins, similar to the multivariate Bernoulli-Beta model, an interpretable extension of (3.30) is obtained by letting rl rl d P(yn---Vid)= / •••/ X\[hj{*ij,Pi)]yii[l-hj(xij,pj)]1-yi>g{p1,...,pd)dpl---dpd, (3.31) Jo Jo j=l for some function hj with range in [0,1]. Pj{yij) and Pjk{yijy%k) can be written accordingly. If covariates are not different, an interpretable extension to include covariates to the parameters in c(l - x) dx, 0 < pj < 1, j = 1,. Chapter 3. Modelling of multivariate discrete data 119 (3.30) obtains by letting ptj = aj{fj,Xj) and Cij = bj(r)j,Xij) for some functions aj and bj. The loglikelihood functions of margins for the parameters are now £nj(Pj,CTj) = ^logPjjyij), j=l,...,d, n enjk(8jk) = E \°&pjk{yiiyik), i<j<k<d, »=i where J-oo l + exp(/iy+<7,ja;) D / \ r°° r°° exp{yij(pij + aijx)} exp{yik(pik + o-iky)} , , a , , , Pjk(yijVik)= / / . , , , \ 7 7 '—<l>2{x,y\6jk)dxdy. J-oo J-oo 1 + exp(^ij + cr^z) 1 -)- exp(pik + <riky) An example of the functions a, and 6, is /i8j = 7,Xj and <r,j = exp^x,). It is also possible to include covariates to the dependence parameters 0jk; a discussion of this can be found in section 3.1. Again, when applying the IFM approach to parameter estimation, the numerical computation involves 2-dimensional integration, which should be feasible in most cases. 3.6 Multivariate mixture discrete models for count data 3.6.1 Multivariate Poisson-lognormal model The multivariate Poisson-lognormal model for count data is introduced in Example 2.12. The pmf of y,- = ,..., yid), i = 1, • • •, n, is P(yn---yid)= •••/ T[f(yij;Xij)g(il-,l*i,<ri,®i)dT)i---driP, (3.32) Jo Jo *_i where ; A,-j) = exp(-A,j)A??V2/0'!> and (3.33) with ?7j > 0, j = l,...,p, is the multivariate lognormal density. For simple situation with no covariates, fa = /i, fl-,- = c and 0,- = 0. This model is studied in Aitchison and Ho (1989). The model (3.32) can accommodate a wide range of dependence, as we have seen in Example 2.12. Corr(Y}, Yk) is an increasing function of 9jk, and varies over its full range when 9jk varies over its full range. Thus in a general situation a multivariate Poisson-lognormal model of dimension d, consists of d univariate Poisson-lognormal models describing some marginal characteristics and d(d — l)/2 Chapter 3. Modelling of multivariate discrete data 120 dependence parameters 9jk, 1 < j < k < d, expressing the strength of the associations among the response variables. Ojk = 0 for all j ^ k correspond to independence among the response variables. The response variables are exchangeable if 0 has an exchangeable structure and pj and <TJ are constant across the margins. We will see another special case later which also leads to exchangeable response variables. The loglikelihood functions of margins for the parameters are now £nj(pj, (Tj) = £ logPj(yij), j = l,...,d, i = l n Znjk(Ojk,Hj,Pk,aj,ak) = £logPjk{yijVik), 1 < j < k < d, where Pj{yij) = r eMyij(TjZj-e^oi)eM_z2/2)dZj! VtTVijl J-oo P „ ^ f°° f°° exp [yij(pj + ajZj) + yik(pk + akzk)] PjkiyijVik) = / 7^ , u+a.z. , Uk+a.z.\ <t>2{Zj,Zk\9jk)dZjdzk, J-oo J-oo y»j!2/,A;!exp(e^+^^ + e'ifc+CTfcZk) where <j>2 is the standard binormal density. To get the IFME of p., a and 0, quasi-Newton mini mization method can be used. Good starting points can be obtained from the method of moments estimates. Let yj, s2 and rjk be the sample mean, sample variance and sample correlations re spectively. The method of moments estimates based on the expected values given in (2.30) are = {log[(«? - yj)/y] + l]}1'2, p] = logy,- - 0.5(<7?)2 and 9% = logfokSjSk/(yjyk) + l}/(a]a0k). When there is a covariate vector xy for the response observation yi;-, we may let pij = a, (7,, xy) for some function aj in the range (—00,00), and let <r,j = 6J(JJJ,X;J) for some function 6, in the range [0,oo). An example of the functions a, and bj is mj = 7jX,j and cy = exp()jjX,j). It is also possible to let the dependence parameters 0jk be functions of covariates; a discussion of this can be found section 3.1. For details on numerical methods for obtaining the parameter estimates, see section 2.7. A special situation of the multivariate Poisson-lognormal model is to assume that /(yjjAj) = e~xi\y'/yj\, where Aj = A/?j. /?j > 0 is considered as a scale factor (known or unknown) and the common parameter A has the lognormal distribution LN(p, a2). In this situation we have V27T[\j=1 Vj! J-00 exp(e"+^ Pj) and the parameters p and a are common across all the margins. To calculate P(yi • • - yd), we need only calculate a one-dimensional integral; thus full maximum likelihood estimation can be used to get the estimates of p, a and /3j (if it is unknown). By the formulas in (2.26), it can be shown that there Chapter 3. Modelling of multivariate discrete data 121 is an exchangeable correlation structure in the response vector Y, with the pairwise correlations tending to 1 when p or a tend to infinity. Independence is achieved when a —• 0. The model (3.34) does not admit negative dependence. 3.6.2 Multivariate Poisson-gamma model The multivariate Poisson-gamma model is obtained by letting Gj(r)j) in (2.24) be the cdf of a univariate gamma distribution with shape parameter aj and scale parameter Bj, with the density function gj(x;aj,8j) = B~a'xAJ~1e~x/Pi/T(CXJ), x > 0, aj > 0 and Bj > 0. The Gamma family is closed under convolution for fixed B. The copula C in (2.24) is arbitrary; (3.1)—(3.8) are some choices here. For example, with the multinormal copula, the multivariate Poisson-gamma model is MUBE. Thus the IFM approach can be applied to fit the model. The jth marginal distribution of a multivariate Poisson-gamma distribution is f°° Je~zizVizaj~1e-z^Pi dzj pj(Vj)= / fiVj; ZJ)9j(zj) dzj = Jo yj\pj3T(aj) _T(yj+aj)f 1 Y*/ Bj \» y,-!r(tti) \l + 8j) \l + 8j) ' which implies that Yj has a negative binomial distribution (in the generalized sense). We have E(Yj) — ajBj and Var(Yj) = otj8j(l + Bj). The margins are overdispersed since Var(Yj)/E(Yj) > 1. Based on (3.35), if a, is an integer, yj can be interpreted as the number of observed failures in yj +aj trials, with aj a previously fixed number of successes. The parameter estimation procedure based on IFM is similar to that for the multivariate Poisson-lognormal model. Some simplifications are possible. One simplification for the Poisson-gamma model is to hold the shape parameter aj constant across j. In this situation, we have E(Yj) — pj = aBj and Var{Yj) = pj(l + Pj/a). Similarly, we can also require Bj be constant across j and obtain the same functional relationship between the mean and the variance across j. By doing so, we reduce the total number of parameters. With this simplification in the number of parameters, the same parameter appears in different margins. The IFM approach for estimating parameters common to more than one margin discussed in section 2.6 can be applied. Another special case is to let Aj = \Bj, where Bj > 0 is considered to be a scale factor (known or unknown) and the common parameter A has a Gamma distribution. This is similar to the multivariate Poisson-lognormal model (3.34). Negative dependence cannot be admitted into this special situation, which is similar to the multivariate Poisson-lognormal model (3.34). Chapter 3. Modelling of multivariate discrete data 122 3.6.3 Multivariate negative-binomial mixture model Consider d-dimensional count data with yj = rj, rj + 1,..., rj > 1, j = 1,2 ..., d. For example, with given integer value rj, yj might be the total number of Bernoulli trials until the r,th success, where the probability of success in each trial is pj; that is If Pj is itself the outcome of a random variate Xj, j = 1,..., d, which have the joint distribution G(p\,.. -,Pd), then the distribution for Y = (Yi,..., Y<j) is called the multivariate negative-binomial mixture model. If the inverse of 1/Xj has a distribution with mean pj and variance aj, then simple calculations lead to E(Yj) = rjpj and Var(Yj) = rjpj(muj — 1) -I- rj(rj + 1)<T?. This multivariate negative-binomial mixture model for count data is similar to the multivariate Bernoulli-Beta model for binary data in section 3.5. Thus the comments on the extensions to include covariates apply here as well. A more general form of negative binomial distribution is (3.16), such that PjiVj lPi) = Y{0%{yj + l/ji{l ~ ^' # > °. W = °. L 2> • • • • Using the recursive relation T(x) = (x — l)T(x — 1), Pj(yj\pj) can be written as yj I pj(yj\pj) = Pi (3j + k - - Pj) k=l 1, yj = 0. The multivariate negative-binomial mixture model can be denned with this general negative binomial distribution as the discrete mixing part. Bj, j — 1,.. .,d, can be considered as parameters in the mo del. 3.6.4 Multivariate Poisson-inverse Gaussian model The multivariate Poisson-inverse Gaussian model is obtained by letting Gj(rjj) in (2.24) be the cdf of a three-parameter univariate inverse Gaussian distribution with density function 9j(*j) = -JTi—\Xrlexp[(^/2)(CJ/AJ- + \j/ij)), Xj > 0, (3.36) where uij = f? + a? — aj > 0, £j > 0 and —oo < jj < oo. In the density expression, Kv(z) denotes the modified Bessel function of the second kind of order v and argument z. It satisfies the Chapter 3. Modelling of multivariate discrete data 123 relationship 2v Ku+1(z) = —Kv(z) + Kv.1(z), z with K-i/2(z) = K 1/2(2) = yJ!Tf2zexp(—z). The copula C in (2.24) is arbitrary; interesting choices are copulas (3.1)—(3.8). With the multinormal copula, the multivariate Poisson-inverse Gaussian model is MUBE; thus the IFM approach can be applied to fit the model. A special case of the multivariate Poisson-inverse Gaussian model results when f(yj\Zj) = e~ZiZj1 /yj\, where Zj = Xtj, with tj > 0 considered as a scale factor (j = l,...,d). Then the pmf for Y is Kk+1 (y/w(w + 2tZt:) ( u> V*+7)/2 TT (&i)yi Ky(w) Vw + 2^E^y /=i yj-where k = J^j=i extensive study of this special model can be found in Stein et al. (1987). 3.7 Application to longitudinal and repeated measures data Multivariate copula discrete (MCD) and multivariate mixture discrete (MMD) models can be used for longitudinal and repeated measures (over time) data when the response variables are discrete (binary, ordinal and count), and the number of measures is small and constant over subjects. The multivariate dependence structure has the form of time series dependence or of dependence decreas ing with lag. Examples include MCD and MMD models with special copula dependence structure and special patterns of marginal parameters. These models include stationary time series models that allow arbitrary univariate margins and non-stationary cases, in which there are time-dependent or time-independent covariates or time trends. In classical time series analysis, the standard models are autoregressive (AR) and moving average (MA) models. The generalization of these concepts to MCD and MMD models for discrete time series is that "autoregressive" is replaced by Markov and "moving average" is replaced by fc-dependent (only rv's that are separated by a lag of k or less are dependent). A particularly interesting model is the Markov model of order one, which can be considered as a replacement for AR(1) model in classical time series analysis; and these types of Markov models can be constructed from families of bivariate copulas. For a more detailed discussion of related topics, such as the extension of models to include covariates and models for different individuals observed at different times, see Joe (1996, Chapter 8). Chapter 3. Modelling of multivariate discrete data 124 If the copula is the multinormal copula (3.1), the correlation matrix in the multinormal copula may have patterns of correlations depending on lags, such as exchangeable or AR type. For example, for exchangeable pattern, Ojk = 0 for all 1 < j < k < d. For AR(1), djk = 6^3~h\ for some \0\ < 1. For AR(2), 6jk = ps, with s = \j — k\. ps is the autocorrelations of lag s; the autocorrelation satisfy Ps = <i>iP,-i+<f>2P,-2, s >%,<}>!= px(l- p2)/(l- pj), <f>2 = (P2 - pl)/{l- p\), and are determined from pi and p2-Some examples of models suitable for modelling longitudinal data and repeated measures (over time) are the multivariate Poisson-lognormal model, multivariate logit-normal model, multivariate logit model with multinormal copula or with M-L construction, multivariate probit model with multinormal copula, and so on. In fact, the multivariate probit model with multinormal copula is equivalent to the discretization of ARMA normal time series for binary and ordinal response. For the discrete time series and d > 4, approximations can be used for the probabilities Pr(Y} = yj, j — 1,..., d) which in general are multidimensional integrals. 3.8 Summary In this chapter, we studied specific MCD models for binary, ordinal and count data, and MMD models for binary and count data. (MMD models for ordinal data are not presented, since there is no natural simple way to represent such models, however MMD models for binary data can be extended to MMD models for nominal categorical data.) Extension to let the marginal parameters as well as the dependence parameter be functions of covariates are discussed. We also outlined the potential application of MCD and MMD models for longitudinal data, repeated measures and time series data. However, this chapter does not contain an exhaustive list of models in the family of MCD and MMD classes. Many additional interesting models in MCD and MMD classes could be introduced and studied. Our purpose in this chapter is to demonstrate the richness of the classes of MCD and MMD models, and to make several specific models available for applications. Some examples of the application of models introduced in this chapter can be found in Chapter 5. Chapter 4 The efficiency of IFM approach and the efficiency of jackknife variance estimate It is well known that under regularity conditions, the (full-dimensional) maximum likelihood esti mator (MLE) is asymptotically efficient and optimal. But in multivariate situations, except the multinormal model, the computation of the MLE is often complicated or impossible. The IFM approach is proposed in Chapter 2 as an alternative estimation approach. We have shown that the IFM approach provides consistent estimators with some good asymptotic properties (such as asymptotic normality of the estimators). This approach has many advantages; computational fea sibility is main one. It can be applied to many MCD and MMD models (models with MUBE, PUBE properties) with appropriate choices of the copulas; examples of such copulas are multinor mal copula, M-L construction, copulas from mixture of max-id distributions, copulas from mixture of conditional distributions, and.so on. The IFM theory is a new statistical inference theory for the analysis of multivariate non-normal models. However, the efficiency of estimators obtained from IFM in comparison with ML estimators is not clear. In this chapter, we investigate the efficiency of the IFM approach relative to maximum likelihood. Our studies suggest that the IFM approach is a viable alternative to ML for models with MUBE, PUBE or MPME properties. This chapter is organized as follows. In section 4.1, we discuss how to assess the efficiency of the IFM approach. In section 4.2, we carry out some analytical comparisons 125 Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 126 of the IFM approach to ML for some models. These studies show that the IFM approach is quite efficient. A general analytical investigation is not possible, as closed form expressions for estimators and the corresponding asymptotic variance-covariance matrices from ML and IFM are not possible for the majority of multivariate non-normal models. Most often numerical assessment of their performance must be used. In section 4.3, we carry out extensive numerical studies of the efficiency of IFM approach relative to ML approach. These studies are done mainly for MCD and MMD models with MUBE or PUBE properties. The situations include models without and with covariates. In section 4.4, we numerically study the efficiency of IFM approach relative to ML approach for models with special dependence structure. The IFM approach extends easily to the models with parameters common to more than one margin. Section 4.5 is devoted to the numerical assessment of the efficiency of the jackknife approach for variance estimation of IFME. The numerical results show that the jackknife variance estimates are quite satisfactory. 4.1 The assessment of the efficiency of IFM approach In section 2.3, we have given some optimality criteria for inference functions. We concluded that in the class of all regular unbiased estimating functions, the inference functions of scores (IFS) is M-optimal (so T-optimal or D-optimal as well). For the regular model (2.12), the inference function in the IFM approach are in the class of regular unbiased inference functions; thus all the (asymptotic) properties of regular inference functions apply to IFM. To assess the efficiency of IFM relative to IFS, at least three approaches are possible: Al. Examine the M-optimality (or T-optimality or D-optimality) of IFM relative to IFS. A2. Compare the MSE of the estimates from IFM and IFS based on simulation. A3. Examine the asymptotic behaviour of 2£($) - 2£{6) based on the knowledge that 2£(6) - 21(0) has an asymptotic \2q distribution when 6 is the true parameter vector (of length q). Al is along the lines of inference function theory. As an estimator may be regarded as a solution to an equation of the form \£(y; 6) = 0, we study the inference functions instead of the estimators. This approach can be carried out analytically in a few cases when both the Godambe information matrix of IFM and the Fisher information matrix for IFS are available in closed form, or otherwise numerically by computing (or estimating) the Godambe information matrix and the Fisher information matrix (based on simulation). With this approach, we do not need to actually find the parameter estimates Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 127 for the purposes of comparison. The disadvantage is that the Godambe information matrix or Fisher information matrix may be difficult to calculate, because partial derivatives are needed for the computation and they are difficult to calculate for most multivariate non-normal models. Also this is an asymptotic comparison. A2 is a conventional approach, it provides a way to investigate the small sample properties of the estimates. This possibility is especially interesting in comparison with Al, since although MLEs are asymptotically optimal, this may not generally be the case for finite samples. The disadvantage with A2 is that it may computationally demanding with multivariate non-normal models, because for each simulation, parameters estimation based on IFM and IFS are carried out. A3 is based on the understanding that if the estimates from IFM are efficient, we would envisage that the full-dimensional likelihood function evaluated at these estimates should have the similar asymptotic behaviour as when the full-dimensional likelihood function is evaluated at the MLE. More specifically, suppose the loglikelihood function is £(0) = Yl7=i l°S/(yi|0)> where 6 is a vector of length q. Under regularity conditions, 2(£(8) —£(8)) has an asymptotic x\ distribution (see for example, Sen and Singer 1993, p236). Thus a rough method of assessing the efficiency of 0 is to see if 2(£(0) — £(0)) is in the likelihood-based confidence interval for 2(1(0)— £(0))\ this interval of 1 —a confidence is (x2.a/2> Xji-a/2)' where Xqtp is the lower 0 quantile of a chi-square distribution with q degrees of freedom. The assessment can be carried out by comparing the frequency of (empirical confidence level of) 2(1(0) —1,(0)) in the (x2a/2'^ji_a/2) with 1 — a. In other words, we check the frequency of and 0 is considered to be efficient if the empirical frequency is close to 1 — a. The advantage of this efficiency of 0 in comparison with 0 in relatively small sample situations. In our studies, A3 will not be used. We mention this approach merely for further potential investigations. To compare IFM with IFS by Al, we need to calculate the Fisher information (matrix) and the Godambe information (matrix). Suppose P(y\ • • -yd',0), 0 6 5ft, is a regular MCD or MMD model in (2.12), where 0 = ..., 6q)' is g-component vector, and 5ft is the parameter space. The Fisher information matrix from one observation for the parameter vector 0, I, has the following expression 8£{6--xl,a/2<2(t(0)-l(0))<X2., approach is that only 0, 1(0) and £(0) need to be calculated; this leads to much less computation compare with finding 0. The disadvantage is that this approach may not be very informative about Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 128 where dPi-q(yi •••ya) 1, 1 dPx...q(yi •••yd) dPi...g(yi •••yd) Pi-q(yi •••yd) d6j d9k 1 < j < k < q. {yi---y*} Assume the IFM for one observation is $ = (^I, • • - ,4>q)- The Godambe information matrix Jy based on IFM for one observation is = D<$M^1 Dl%, where /Mu ••• Mlq\ /Dn ,9,J Dlq\ Dqq/ # *), Djj = EWj/dBj) (j = with Mj, = E(rl>]) (j = 1,...,«), Mjk = E(^k) (j, k = 1, 1,..., q), and Djk = E(dipj/dOk) (j,k = 1,..., q, j ^ k). The detailed calculation of the elements of M$ and D$ can be found in section 2.4 for the models without covariates. The M-optimality assessment examines the positive-definiteness of J^1 — I~l. It is equivalent to T-optimality which examines the ratio of the trace of the two information matrices, Tr(J^'1)/Tr(7_1), and D-optimality which examines the ratio of the determinant of the two information matrices, det(J^'1)/det(7_1). T-optimality is a suitable index for the efficiency investigation as it is easier to compute. An equivalent index to D-optimality is ^det(J * )/det(I l). In our efficiency assessment studies, we will use M-optimality, T-optimality or D-optimality interchangeably depending on which is most convenient. In most multivariate settings, Al is not feasible analytically and extremely difficult computa tionally (involving tedious programming of partial derivatives). A2 is an approach which eliminates the above problems as long as MLEs are available. As MLEs and IFMEs are both unbiased only asymptotically, the actual bias related to the sample size is an important issue. For an investigation related to sample size, it is more sensible to examine the measures of closeness of an estimator to its true value. Such a measure is the mean squared error (MSE) of an estimator. For an estimator 6 = 9(X\,..., Xn), where X\,..., Xn is a random sample of size n from a distribution indexed by 9, the MSE of 9 about the true value 9 is MSE(<?) = E(9 - 9f = Var{9) + \E{9) - 9]2. Suppose that 9 has a sampling distribution F, and suppose 9i,..., 0m are iid of F, then an obvious estimator of MSE(£) is MSE(0) = E£i(ft-g)2 m (4-1) Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 129 If 9 is from the IFM approach and 9 from the IFS approach, A2 suggests that we compare MSE(0) and MSE(0). For a fixed sample size, 9 need not be the optimal estimate of 9 in terms of MSE, since now the bias of the estimate is also taken into consideration. The measure MSE(0)/MSE(0) thus gives us an idea how IFME performs relative to MLE. The approach is mainly computational, based on the computer implementation of specific models and subsequent intensive simulation and parameter estimation. A2 can be easily used to study models with no covariates as well as with covariates. 4.2 Analytical assessment of the efficiency In this section, we study the efficiency of the IFM approach analytically for some special models where the Godambe information matrix and the Fisher information matrix are available in closed form, or computable. Example 4.1 (Multinormal, general) Let X ~ Nd(p, E). Given n independent observations xi,..., x„ from X, the MLEs are .*»' — HA** — H i=l i = l It can be easily shown that the IFME p and E are equal to p and E respectively, so MLE and IFME are equivalent for the general multinormal distribution. • Example 4.2 (Multinormal, common marginal mean) Let X ~ Nd(p, £), wherep = (pi,..., pdy = pi for a scalar parameter p and E is known. Given n independent observations xi,.. . ,x„ with same distributions as X, the MLE of p is The IFM of p = (p\,..., pd)' are equivalent to $ . - V Xij ~ ^ i-l d i = l ai which leads to p,j — n~l E"=i x»j> or A* = n_1Ei=ix«- simple calculation leads to J^1^) = n_1E. Thus if we incorporate the knowledge that pi,...,pd are the same value, with WA and PMLA, we have Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 130 i. WA: the final IFME of p is P-w = l'E-il' which is exactly the same as p since ft = "-1Er=ix»- So m this situation, the IFME is equivalent to the MLE. ii. PMLA: the final IFME of p is . l'jdiagtS)}-1/! Eo-r-V ^ l'{diag(£)}-il y>r/ ' With this approach, IFME is not equivalent to MLE. The ratio of Var(/ip) to Var(/i) is l/{diag(S)}-1S{diag(S)}-1ll/S-1l (l'{diag(E)}-il)2 There is some loss of efficiency with simple PMLA. • Example 4.3 (Trivariate probit, general) Suppose we have a trivariate probit model with known cut-off points, such that P(lll) = $3(0, 0, 0, p\2, P13, p23)- We have the following (Tong 1990): Pi(l) = P2(l) = P3(l) = *(0)= (4.2) ^ + ^(sin 1,912+sin 1 p13 + sin 1 p23). (4.3) P(lll) = $3(0,0, 0,pi2,/»13,A>23) The full loglikelihood function is In =n(lll) logP(lll) + n(110) log P(110) + n(101) log P(101) + n(100) logP(100) n(011) logP(Oll) + n(010) log P(010) + n(001) log P(001) + n(000) log P(000). Even in this simple situation, the MLE of pjk is not available in closed form. The information matrix for P12, P13 and p23 from one observation is /In I12 I13 \ I = I12 I22 I23 , \Ii3 I23 hs I where, for example, '11 dP(lll) + P(lll) V dpi2 1 fdP{011)\2 + 1 /<9P(110)\ /ap(ioi)V fdP(100)\ P(110) V dPl2 J + P(101) ^ dPl2 ) + P(100) dPl2 ) /<9P(010)\ /dP(001)V fdP(000)\ p(oii) I, dPl2 J + P(OIO) V dPl2 ) + p(ooi) ^ aPi2 ) + P(OOO) ^ dPl2 ) Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 131 Simple calculation gives us 5P(lll)/cVi2 = 1/(4^-^/1 — pf2); and other terms also have similar expressions. After simplification, we get 7T3 + 64a - 16TT6 ""C1 - P2i2)cdef ' where a = sin 1pi2sin 1/9i3sin 1 p23, 6 = (sin Vi2)2 + (sin 1 pi3f + (sin 1 p23)2, c = 7T + 2 sin-1 pi2 + 2 sin-1 piz + 2 sin-1 p23, d= 7r + 2sin-1 pi2 — 2sin-1 P13 — 2sin_1 p23, e = 7r — 2 sin-1 p\2 + 2 sin-1 p\3 — 2 sin-1 p23, / = 7T — 2 sin-1 pi2 — 2 sin-1 p13 + 2 sin-1 p23. Other components in the matrix I can be computed similarly. The inverse of /, after simplification, is found to be /«n ai2 a13\ r1 = a\2 a22 a23 \ai3 a23 a33 where an = <222 = a33 = ai2 = ai3 = 023 = (T2- 4(sin-Vi2)2)(l -Ph) (T2-4 4(sin-1pi3)2)(l -Piz) 4 4(sin-1/>23)2)(l -pis) (2 sin 4 _1 pi2sin_1 p13 - 7rsin_1 P23)(l -P22)1/2(l - P213)1/2 (2 sin -1 Pi2sin_1 p23 -2 7rsin-1 Pis)(l -P22)1/2(l ~ Ph)1/2 (2 sin _1/?i3sin_1/?23 -2 wsin-1 P12)(l -p?3)1/2(l ~ Ph)1'2 For the IFM approach, we have 'n,-t(ll) + n>t(00) n,-*(10) + n,-t(01)\ dPjk(U) njk 9njk = 0 leads to Pjk = sin 1/2-^(11) 7T n,-t(ll) + njt(OO) ~ njfc(lO) - "jfc(Ol) 3p , 1 < j < k < 3. 2 n If the IFM for one observation is 9 = (^12, ^13, ^23), then from section 2.4, we have E(tl>- VJ )= V p:kim{yjykyiym) dPjk{yjyk) dPim(yiym) j" 'm {y^v-} Pi*(W»)flm(»y») dpjk 8p,m Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 132 where 1 < j < fc < 3, 1 < / < m < 3, and 'dj>jk\ i E We thus find that dpjk E {yjyk} 1 < j < fc < 3. My = / on bi2 bi3' bi2 b22 623 \ ^13 &23 &33' and Dy = /-61 0 V 0 0 -622 0 ° \ 0 -633/ where bn &22 633 bl2 &13 b23 = (TT2 — 4(sir (TT2 — 4(sin" (7r2 — 4(sin" (w2 — 4(sin" (7r2 — 4(sin" pi2)2)(i-pi2y 4 P13)2)(1-P23)' 4 P23)2)(wy 16sin-1pi2sin-1/3i3 — 87rsin-1 P23 Pi2?){*2 - 4(sin-1 p13)')(l - p?2)1/2(l - p\3fl2 ' 16 sin-1 pi2 sin-1 p23 — 87r sin-1 p\3 P12)2)(*2 ~ 4(sin"1 P23)2)(l - Pl2)1/2(1 - P223)1/2 ' 16 sin-1 P13 sin-1 P23 — 87rsin-1 P12 (n2 - 4(sin-1 p13)2)(7r2 - 4(sin-1 p23)2)(l - ?23)1/2(1 - ph)1'2 ' After simplification, J^1 = D^1 My(D~^l)T turns out to be equal to I-1. Therefore by M-optimality, the IFM approach is as efficient as the IFS approach. The algebraic computation in this example was carried out with the help of the symbolic manip ulation software Maple (Char et al. 1992). Maple is also used for other analytical examples in this section. For completeness, the Maple program for this example is listed in Appendix A. The Maple programs for other examples in this section are similar. • Example 4.4 (Trivariate probit, exchangeable) Suppose now we have a trivariate probit model with known cut-off points, such that P(lll) = $3(0,0,0, p, p, p). That is, the latent variables are permutation-symmetric or exchangeable. With (4.2), we obtain P1(1) = P2(1) = P3(1) = *(0)=|, Pi2(H) = Pi3(H) = P23(H) = $2(0,0,p)=] + ±- sin-1 p, P(lll) = $3(0,0,0, p, p, p) = I + A sin"1 p. o 47T The MLE of p is 7r(4niii +4n00o -n) Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 133 Based on the full loglikelihood function (4.3), we calculate the Fisher information for p (using Maple). The asymptotic variance of p is found to be (1 - P2)(TT + 6 sin-1 p)(7r - 2 sin-1 p) Var(p) = 12n Let the IFM for one observation be \P = (ip\2,1P13, ^23) • We use WA and PMLA to estimate the common parameter p: i. WA: We have /a b /-a 0 0 \ b a b and £>$ = 0 —a 0 \b b a) I 0 0 -a) where a = (7r2-4(sin_1p)2)(l-p2) 6 = ! sin 1 p (TT - 2sin_1 p)(w + 2 sin-1 p)2{\ - p2) Thus / a-1 a~2b a~2b\ J*1 = D^MviDy1) 1\T a 2b a 1 a 2b \a~2b a~2b a'1 J Assume the IFME of pn, P13, P23 are p\2, P13, P23 respectively. With WA, we find the weighting vector u = (1/3,1/3,1/3)'. So the IFME of p, pw> is Pw = ^(Pl2 + Pl3 + P23), and the asymptotic variance of pw is Var(p) = — u'J^1u _ l(l-/?2)(7r2-4(sin-1p)2) 2(l-p2)(7r-2sin-1/9)sin-1p + 3 9 4n _ (1 - p2)(ir + 6 sin"1 p)(ir - 2 sin"1 p) 12n ii. PMLA: The IFM is * = V12 + ^13 + fe- Thus = £(^12 + ^13 + V"23) 2n £ P(yi*«,)( £ * /P(.ww)) (4.4) Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 134 and D9 = E(d(ilil2 + V>13 + i>23)/dp) 3 {l/iyaya} j,k=VJ<k We find (using Maple) My — Dy = Pfkivm) V dp 12(TT + 6 sin-1 p) + d2Pjk(yjyk) Pjk(yjVk) dp2 (4.5) (1 - P2)(TT - 2 sin-1 P)(TT + 2 sin-1 p)2 ' 12 (TT2 -4(sin-V)2)(l -p2)' The evaluation of J^1 = £)^1M$(Z)^1)T leads to the asymptotic variance of pp (1 - p2)(ir + 6 sin-1 p)(% - 2 sin-1 p) Var(pp) 12n We have so far shown that Var(p) = Var(ptt,) = Var(pp), which means that the IFM with WA and PMLA lead to an estimate as efficient as that from IFS approach. Any single estimating equation from IFM also gives an asymptotically unbiased estimate of p, and the p from each of these estimating equations has the same asymptotic properties because of the exchangeability of the latent variables. The ratio of the asymptotic variance of the IFME of p from one estimating equation to the asymptotic variance of the MLE p is found to be [3(7T + 2 sin-1 P)]/[TT + 6 sin-1 p]. Figure 4.1 is a plot of the ratio versus p £ [—0.5,1]. The ratio decreases from oo to 1.5 as p increases from —0.5 to 1. When p = 0, the ratio value is 3. These imply that the estimate from a single estimating equation has relatively high efficiency relative to the MLE when there is high positive correlation, but performs poorly when there is high negative correlation. • Example 4.5 (Trivariate probit, AR(1)) Suppose we have a trivariate probit model with known cut-off points, such that -P(lll) = $3(0,0,0, p, p2, p). That is, the latent variables have AR(1) cor relation structure. With (4.2), we obtain P1(1) = P2(1) = P3(1) = *(0) = ^ Pi2(H) = P23(ll) = $2(0,0, p) = \ + ^-sin-1 p, P13(ll) = $2(0,0,p2) = I + J-sirrV, P(lll) = $3(0,0,0, p, P\P)=\ + ^ sin"1 p + ^ sin"1 p2. Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 135 1 1 1 1 1— -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Figure 4.1: Trivariate probit, exchangeable: The efficiency of p from the margin (1,2) (or (1,3), (2,3)) Based on the full loglikelihood function (4.3), the asymptotic variance of p is found (using Maple) to be 7r(l - p4)a1a2a3 Var(p) = 8 [p2a4 + (1 + p2)a5 + p(l + p2y/2a6] ' (4.6) where ai = 7r — 2 sin-1 p2, a2 = 7r — 4 sin-1 p + 2 sin-1 p2, a,3 = ir + 4 sin-1 p + 2 sin"1 p2, a4 = 2TT2 - 16(sin-1 p)2 + 47rsin_1 p2, a5 = 7r2 - 4(sin_1 p2)2, ag = 16 sin-1 p sin-1 p2 — 87rsin_1 p. Let the IFM for one observation be 9 = (V"i2, i>i3, ^23)- We use WA and PMLA to estimate the common parameter p: i. WA: We have My = I a c d^ c b c I and Dy = \ d c a. /-a 0 0 \ 0-6 0 V 0 0 -a I Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 136 where a (7r2-4(sin-V)2)(i-/>2)' (7r2-4(sin-V)2)(l-/>4)' 16/jsin-1 p d = (TT2 - 4(sin-1 P)2)(TT + 2 sin"1 p2)(l - p2)(l + p2)1/2 ' 87rsin_1/92 - 16(sin-1/?)2 -1 n\2\2( (7r2-4(sin-1/9)2)2(l-/'2) Thus c(afc)-1 da~2 \ J*1 = Dy1 My (Dy1 )T = c(ab)-1 ciab)-1 \ da~2 c(ab)-1 a-1 / Assume the IFME of p12, P\3, P23 are p12, P13, f>23 respectively, and let p = (p\2,p\3, P23)'• With WA, the IFME of p, pw, is Pw = u'p, where the weighting vector u = «2, "3)' = Jyl/(l' Jyl). We find that 01020307 Ui = U3 «2 2o8 [p2a4 + (1 + p2)o5 + p{l + p2y/2a6] (2/>2o4 + p{\ + /92)1/2a6)a2a3 2a8 [p2a4 + (1 + p2)a5 + p(l + p2y'2a6} ' where ai, 02, 03, 04, 05, and ae as above, and a7 = TT + irp2 + 2 sin-1 p2 + 2p2 sin-1 p2 - 4p(p2 + 1)1/2 sin-1 p, a8 = TT2 + 47rsin_1 p2 + 4(sin-1 p2)2 - 16(sin_V)2-Figure 4.2 is a plot of the weights versus p 6 [—1,1]. The asymptotic variance of p is Var(p) = Vj-y which turns out to be the same as (4.6). ii. PMLA: The IFM is \P = tp12 + rp13 + tp23. Following (4.4) and (4.5), we calculate (using Maple) the corresponding My and Dy, and then Var(/?p) = J^1. The algebraic expression for Var(pp) is complicated, so we do not display it. Instead, we plot the ratio Var(/>p)/Var(/5) versus p 6 [—1,1] in the Figure 4.3. The maximum of the ratio is 1.0391, which is attained at p = 0.3842 and p = -0.3842. Chapter 4, The efficiency of IFM approach and the efficiency of jackknife variance estimate 137 Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 138 1.0 -0.5 O.O 0.5 1.0 -1.0 -0.5 O.O 0.5 1 .O Figure 4.4: Trivariate probit, AR(1): (a) The efficiency of p from the margins (1,2) or (2,3). (b) The efficiency of p from the margin (1,3). The above results show that IFM with WA leads to an estimate as efficient as the IFS approach in the AR(1) situation, and IFM with simple PMLA leads to a slightly less efficient estimator (ratio< 1.04). The p from the estimating equations based on margin (1,2) (or (2,3)) is different from the p based on margin (1,3). For p from IFM with the (1,2) (or (2,3)) margin, the ratio of the asymptotic variance of the IFME of p to the asymptotic variance p is 2(,r2 - 4(sin-1 p)2) [p2aA + (1 + p2)a5 + p(l + p2)^2a6] 7r(l + p2)aia2a3 For p from IFM with (1,3) margin, the corresponding ratio is (TT2 - 4(sin-1 p2)2) [p2aA + (1 + p2)a5 + p(l + p2)1'2^} 7>2 — 2itp2aia.2a3 We plot r\ and r2 versus p 6 [—1,1] in Figure 4.4. We see that when p goes from —1 to 0, ri increases from 1.707 to values around 2. When p goes from 0 to 1, r*i decreases from values around 2 to 1.707. Similarly, r2 increases from 1.207 to oo as p goes from —1 to 0, and decreases from oo to 1.207 as p goes from 0 to 1. We conclude that the (1,3) margin by itself leads to an inefficient estimator in a wide range of the values of p. We notice that r2 > ri when p < 0.6357, r2 < r\ when p > 0.6357, and rx = r2 = 1.97 when p = 0.6357. • Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 139 Example 4.6 (Trivariate MCD model for binary data with Morgenstern copula) Suppose we have a trivariate MCD model for binary data with Morgenstern copula, such that P(lll) = ulU2u3[l + e12(l - m)(l - u2) + 013(1 - ui)(l - u3) + 023(1 - u2)(l - u3)}, \0{j\< 1, where the dependence parameters 0i2, #13 and 023 obey several constraints: 1 + 012 + $13 + 023 > 0, 1 + #13 > 023 + #23, (4-7) 1 + #12 > #13 + #23, 1 + #23 > #12 + #13-We have Pj(l) = Uj, j = 1,2,3, and P,fc(ll) = [1 + 0jk(l - Uj)(l - uk)]ujUk, 1 < j < k < 3. Assume Uj are given, and the parameters of interest are #12, #13 and #23. The full loglikelihood function is (4.3). The Fisher information matrix for the parameters #i2, #13 and #23 is I. Assume we have IFM for one observation \P = (ipi2,ipi3,ip23). The Godambe information for \P is Jy = DyM^1(Dy)T. We proceed to calculate My and Dy. The algebraic expression of I and Jy are extremely complicated. We used Maple to output algebraic results in the form of C code and then numerically evaluated the ratio r - Tr(^"1) 9 ~ Tr(I-i)' where rg means the general efficiency ratio. For this purpose, we first generate n\ uniform points (#12, #13, #23) from the cube [—1, l]3 in three dimensional space under the constraints (4.7), and then order these «i points based on the value of |#i2| + |#i3| + |^231 from the smallest to the largest. For each one of the ni points (#12,#13,#23), we generate n2 points of (ui,u2,u3) with (#12,#13,#23) as given dependence parameters in a trivariate Morgenstern copula in (2.5) (see section 4.3 for how to generate multivariate Morgenstern variate), and then order these n2 points based on the value of u\+u2+u3 from the smallest to the largest. Each generated set of (u\, u2, u3, #12, #13, #23) determines a trivariate MCD model with Morgenstern copula for binary data. We calculate rg corresponding to each particular model. Figure 4.5 presents the values of rg at n\X n2 = 300 x 300 "grid" points. We can see from Figure 4.5 that the IFM approach is reasonably efficient in most situations. It is also clear that the magnitude of |#i2| + |#i3| + |#23| has an effect on the efficiency of the IFM approach, with generally speaking higher efficiency (rg's value close to 1) when |#i2| + |#i3| + |#23| is relatively smaller. The magnitude of u\ + u2 + u3 has some effect such that the efficiency of the IFM approach is lower at the area close to the boundary of u\ + u2 + u3 (that is close to 0 or 3). The following facts show that the general efficiency of IFM approach is quite good: in these 90,000 efficiency (rg) evaluations, 50% of the rg values are less than 1.0196, 90% of the rg values are less than 1.0722, 99% of the rg values are less than 1.1803 and 99.99% of the rg values are less Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 140 •JOO Figure 4.5: Trivariate Morgenstern-binary model: Relative efficiency of IFM approach versus IFS approach. than 1.4654. The maximum is 1.7084. The minimum is 1. The two plots in Figure 4.6 are used to clarify the above observations. Plot (a) consists of the 90,000 ordered rg values versus their ordered positions (from 1 to 90,000) in the data set. Plot (b) is a histogram of the rg values. Overall, we consider the IFM approach to be efficient. It is also possible to examine the efficiency ratio in some special situations. We study two of them here. The first one is the situation where ui = u2 — u3 — u and 0i2 = #13 = #23 — 9 (—1/3 < 0 < 1). The ratio of the asymptotic variance of 9 (based on WA) versus the asymptotic variance of 9 is found to be 0,10,20,3 n(u,9) = 6i62 where ai = 2702u4 - 5402u3 + 3302u2 - 1O0U2 - 602u + 1O0U -39-1, a2 = 36>V - 90V + 903u4 - 110V - 303u3 + 2202u3 - 1202M2 + 9u2 + 92u-9u-9-l, a3 = 9u2 -0u + l, 61 = (S9u2 - 69u + 30 + l)(30u2 - A9u + 9 + 1), 62 = (30u2 - 20w + l)(30u2 + l)(0u2 - 9u - l)2. Figure 4.7 is a plot of ri(«, 0) versus u (0 < u < 1) and 0 (—1/3 < 0 < 1). We observe that at the boundaries, when 0 = 1, r\{u, 9) is in the interval (1,1.232), and the maximum 1.232 is attained at Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 141 (a) (b) oo 0 1 o 20000 40000 eoooo 80000 1.0 1.2 1.4 1.6 Figure 4.6: Trivariate Morgenstern-binary model: (a). Ordered relative efficiency values of IFM approach versus IFS approach; (b) A histogram of the efficiency value rg. u = 0.2175 oru= 0.7825. When 9 = —1/3, ri(u, 9) is in the interval (1,2), and the maximum 2 is attained at u = 0 or u = 1. Since the maximum ratio is 2 at some extreme points in the parameter space and for the most part the ratio is less than 1.1, we consider the IFM approach to be efficient. The second special situation is where u\ = u2 = 1*3 = u, 9\2 = #23 = 9 and #13 = 92. The algebraic expression of the ratio r2(u, 9) of the asymptotic variance of 9 (based on WA) versus the asymptotic variance of 9 extends to several pages. We thus only present a plot of r2(u, 9) versus u (0 < u < 1) and 9 (—1 < 9 < 1) in Figure 4.8. We observe that at the boundaries when 9=1, the ratio r2(u,9) is in the interval (1,1.200097), and the maximum is attained at u = 0.2139 or u = 0.7861. When 9 = —1, the ratio r2(u,9) is in the interval (1,1.148333), and the maximum is attained at u = 0.154 or 0.846. Overall, the IFM approach is demonstrated again to be efficient. • Example 4.7 (Trivariate normal-copula model for binary data) In Examples 4.3, 4.4 and 4.5, we studied the efficiency of the IFM approach versus the IFS approach in the special situations of P(lll) = $3(0,0,0,p12,/>i3,/>23), P(U1) = $3(0,0,0,p,p,p) and P(lll) = $3(0,0,0,p,p2,p). We found that the IFM approach was fully efficient in these situations. For a general trivariate normal-binary model P(lll) = $3($ 1(ui),$ 1(w2),$ 1{U3),P12,P13,P23), (4.8) Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 142 Figure 4.7: Trivariate Morgenstern-binary model: Relative efficiency of IFM approach versus IFS approach when u\ = u2 = «3 and 9\2 = #13 = #23-Figure 4.8: Trivariate Morgenstern-binary' model: Relative efficiency of IFM approach versus IFS approach when m = u2 — 113, 9\2 — $23 = 9 and #13 = 92. Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 143 the closed form efficiency evaluation, as provided for the trivariate Morgenstern-binary model in Ex ample 4.6, is not possible because $3($_1(ui), $-1(u2), $-1(u3),P12,P13,P23) does not have closed form. Nevertheless, since a high precision multinormal probability calculation subroutine (Schervish 1984) is available, we can evaluate the efficiency numerically. With the model (4.8), we have Pj(l) = ujt j - 1,2,3 and Pjk(ll) = $2(3>_1(u;)> $-1(ujfe); Pjk), 1 < j < k < 3. Assume Uj are given, and the parameters of interest are pi2, P13 and P23- Let 61 — Pi 2, #2 = P13 and 63 — p23- The Fisher information matrix from one observation for the parameters 61, 62 and O3, I, has the following expression /hi I12 Ii3\ I = I12 I22 I23 V ^13 ^23 ^331 where 2 T - ST 1 (^123(2/12/22/3)'\ . _ 1 9 „ " " , ^ , PMvittoVs) { d9j J {yiyaya} T 1 5Pl23(2/l2/22/3) 5Pl23(2/l2/22/3) , . Iik - "5—7 \ ?SZ l<j<k<6. r z-' , ^123(2/12/22/3) ddj 36k We can similarly calculate the Godambe information matrix J$ based on the IFM approach for one observation. We then numerically evaluate the ratio (T-optimality) Tr(J^) 9 Tr^-1) in the joint trinormal copula sample space and its parameter space. Similar to Example 4.6 for the trivariate Morgenstern-binary model, we first generate rti uniform points of (pi2, Pi3> P23) from the cube [—1, l]3 in three dimensional space under the constraints 1 + 2pi2Pi3/>23 — P12 — P13 — P23 > 0 (which guarantees that the determinant of a trinormal correlation matrix is positive) and order these ni points based on the value of \pn\ + \pi3\ + |/>231 from the smallest to the largest. Then for each one of the n\ points (P12, P13, P23), we generate n2 points (ui,«2,U3) with (P12,P13,P23) as given dependence parameters in a trinormal copula, and order these ri2 points based on the value of Ui + U2 + U3 from the smallest to the largest. Each generated set of (ui, 112, 113, P12, P13, P23) determines a trivariate normal-binary model. We evaluate rg corresponding to each particular model. The plot in Figure 4.9 presents the values of rg at ni x 712 = 300 x 300 "grid" points for the trivariate normal-copula model for binary data. We observe from the plot that the IFM approach is reasonably efficient in most situations. It is also clear that the magnitude of |pi2| + \pi3\ + \P23\ has an effect on the efficiency of the IFM approach, with generally higher efficiency (rg's value close to 1) when Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 144 Figure 4.9: Trivariate normal-binary model: Relative efficiency of IFM approach versus IFS ap proach. \pn\ + \pi3\ + \P23\ is smaller. The magnitude of ui -f u2 + u3 has some effect such that the efficiency of IFM approach is lower at the area close to the boundaries of u\ + u2 + u3 (that is close to 0 or 3). In general the IFM approach is quite efficient: in these 90,000 efficiency (rg) evaluation, 50% of the rg values are less than 1.0128, 90% of the rg values are less than 1.0589, 99% of the rg values are less than 1.1479 and 99.99% of the rg values are less than 1.3672. The maximum is 1.8097. The minimum is 1. The two plots in Figure 4.10 are used to clarify the above observations. Plot (a) consists of the 90,000 ordered rg values versus ordered positions (from 1 to 90,000) in the data set. Plot (b) is a histogram of the rg values. Overall, we draw the conclusion that the IFM approach is efficient. In the situation where ui = u2 = u3 = u and p\2 = pi3 = p23 = p (—1/2 < 9 < 1), let us denote r\(u,p) the ratio of the asymptotic variance of p (based on WA) versus the asymptotic variance of 9. ri(u, p) has to be evaluated numerically. Figure 4.11 shows a plot of ri(u, p) versus u (0 < u < 1) and p (—1/2 < p < 1). It is difficult to evaluate ri(u,p) numerically when the values of u and p are near the boundaries of the sample space and the parameter space, but generally speaking, the efficiency is lower when the values of u and p are close to the boundaries. In the situation where ui = u2 = u3 = u, p\2 = p23 = p and p\3 — p2 (p £ [—1,1]), we observed similar efficiency behaviour. These results are not presented here. • Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 145 Figure 4.11: Trivariate normal-binary model: Relative efficiency of IFM approach versus IFS ap proach when ui = u2 = U3 and p\2 = P13 = p2z-Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 146 We have seen from the trivariate normal-copula model for binary data and the trivariate Morgenstern-copula model for binary data that, in some situations, IFM is as efficient as IFS (e.g. when u = 0.5 for normal-binary model and u = 0, 0.5 or 1 for Morgenstern-binary model). In other situations, the efficiency of IFM relative to IFS varies from 1 to a value very close to 1. It is hoped the above results may help to develop intuition for the efficiency of IFM. We would guess that the relative efficiency of IFM to IFS for a model with the MUBE property should be good, as we have seen with the trivariate normal-copula model for binary data and the trivariate Morgenstern-copula model for binary data. However, a general exhaustive analytical investigation such as above is not possible; we have to rely on numerical investigation based on simulation for most of the complicated (higher dimensions or models with covariates) situations. 4.3 Efficiency assessment through simulation In this section, we give efficiency assessment results through simulation studies with various models. The following are the steps in the simulation and computation: (1) a MCD or MMD model (with MUBE property) is chosen; (2) different sets of model parameters are specified; (3) with a given set of parameters, a sample of size n is generated from the model, and IFM and IFS approaches are used on the same generated data set to estimate the model parameters; (4) with the same set of parameters, step (3) is repeated m times; (5) for any single parameter in the model, say 9, if the estimates of 9 with the IFM approach from step (3) and (4) are 0\,...,6m, and the estimates of 9 with IFS approach from step (3) and (4) are 9\,..., 9m, then we compute £=2XiA MSE(i) = ££i(g<-g)2 (4.9) mm and mm The relative efficiency of IFME to MLE is defined as the ratio r where r2 = MSE(#)/MSE(#). The values of 9, ^MSE(0), 6, ^MSE(0) and r are tabulated, with yJlMSE(6) and ^MSE(0) presented in parentheses. For a fixed sample size, a parameter estimation approach is said to be good if 9 (or 9) is close to 9, and if ^MSE(fl) (or y/MSE(#)) is small. There is no "good" in the strict sense, it should be under stood in terms of inference, interpretation (i.e. no misleading interpretation or false inference would be derived, assuming the model is correct) and in comparison with conventional, well-established Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 147 approach. The main objective of this section is to show that with fairly complex models, the IFM approach still has high efficiency. Multivariate copula discrete models for binary data In this subsection, we study the MCD models for binary data. The parameters are assumed to be margin-dependent. In our simulation, we use the MVN copula, and simulate (/-dimensional binary observations y,- (i = 1,..., n) from a multivariate probit model Yij = I(Zij < z^), j = 1,..., d, i=l,...,n, where Zj = (Zn,.. .,Zn)' ~ MVNd(0,Qi) with z,, = £jxy, and 0* = (Oijk) assumed to be free of covariates, that is 0,- = 0 or Oijk = #jA, V i. We transform the dependence parameter 6jk with $jk = (exp(ajfc) — l)/(exp(ajfc) + l), and estimate ajk instead of Ojk- We use the following simulation scheme: 1. The sample size is n, the number of simulations is N; both are reported in the tables. 2. For d = 3, we study the two situations: Yij = I(Zij < Zj) and Y,;- = I(Zij < fyo + fij\Xij). For each situation, two general dependence structures are chosen: #12 = #13 = #23 = 0.6 (or a12 = a13 = a23 = 1.3863) and 012 = 023 = 0.8 (or a12 = a23 = 2.1972), 013 = 0.64 (or ai3 = 1.5163). Other parameters are: (a) With no covariates, with z = (0,0,0)'. (b) With covariates, with 0O = (/?10,/320,/?30)' = (0.7,0.5,0.3)' and & = (/3n,/?21,/?31)' = (0.5,0.5,0.5)'. Situations where Xij is discrete and continuous are considered. For the discrete situation, X{j = I(U < 0) where U ~ U(—1,1); for the continuous situation, XijS are margin-independent, that with Xi ~ N(0,1/4). 3. For d = 4, we only study Y,j = I(Z{j < Zj). Two dependence structures in the study are #12 = #13 = #14 = #23 = #24 - #34 = 0.6 (or C*12 = c*i3 = ai4 = a23 = a24 = <*34 - 1.38 63) and #12 = #23 = #34 = 0.8 (or ai2 = a23 = a34 = 2.1972), #i3 = #24 = 0.64 (or ai3 = a24 = 1.5163) and #i4 = 0.512 (or ai4 = 1.1309). The cut-off points are (a) z = (0,0,0,0)', (b) z = (0.7,0.7,0.7,0.7)', (c) z = (0.7,0,0.7,0)'. The numerical results from MCD models for binary data are presented in Table 4.1 to Table 4.5. These tables lead to two clear conclusions: i) The IFM approach is efficient relative to the Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 148 Table 4.1: Efficiency assessment with MCD model for binary data: d = 3, z = (0,0,0)', N = 1000 n margin parameters 1 2 3 (1,2) (1,3) (2,3) Zl Zi Z3 «12 «13 C*23 ai2 = «13 = «23 = 1.3863 100 IFM MLE r 0.003 -0.002 0.005 1.442 1.426 1.420 (0.131) (0.121) (0.128) (0.376) (0.380) (0.378) 0.002 -0.003 0.004 1.441 1.426 1.420 (0.131) (0.121) (0.128) (0.376) (0.380) (0.378) 0.998 0.999 0.999 0.999 0.999 0.999 1000 IFM MLE r -0.0006 -0.0016 -0.0008 1.3924 1.3897 1.3906 (0.040) (0.038) (0.039) (0.114) (0.114) (0.113) -0.0018 -0.0028 -0.0019 1.3919 1.3893 1.3902 (0.040) (0.038) (0.039) (0.114) (0.114) (0.113) 0.997 0.997 0.997 1.000 1.001 1.000 a12 = a23 = 2.1972, c*i3 = 1.5163 100 IFM MLE r 0.0027 -0.0006 0.0003 2.2664 1.5571 2.2586 (0.131) (0.123) (0.130) (0.454) (0,377) (0.453) 0.0015 -0.0020 -0.0012 2.2646 1.5552 2.2579 (0.131) (0.123) (0.131) (0.453) (0.377) (0.452) 0.999 1.000 0.999 1.001 1.001 1.002 1000 IFM MLE r -0.0006 -0.0001 -0.0005 2.2009 1.5174 2.2043 (0.040) (0.038) (0.039) (0.135) (0.118) (0.136) -0.0023 -0.0020 -0.0022 2.2003 1.5166 2.2036 (0.040) (0.038) (0.039) (0.135) (0.118) (0.137) 0.996 1.000 0.996 0.999 1.000 1.000 ML approach, for small to large sample sizes. The ratio values r are very close to 1 in almost all the situations studied. These results are consistent with the results from the analytical studies reported in the previous section, ii) The MLE may be slightly more efficient than the IFME, but this observation is not conclusive. We would say that IFME and MLE are comparable. Multivariate copula discrete models for ordinal data In this subsection, we study the MCD models for ordinal data. The parameters are assumed to be margin-dependent. In our simulation, we use the MVN copula. We simulate d-dimensional ordinal observations (i = 1,..., n) from a multivariate probit model for ordinal data, such that ' Yj = 1 iff Zj(0) < Zj < Zj(l), Yj =2iftzj(l)<Zj <Zj(2), < Yj — rrij iff Zj(m,j — 1) < Zj < Zj(rrij), where —co — 2/(0) < z;(l) < • • • < Zj(mj — 1) < Zj(rn,j) = oo are constants, j = 1,2,.. .,d, and Zj = (Zn,.. .,Zid)' ~ MVNd(0,®i) with Zijfaj) = fj(yij) + 3'jXij. Qi - (Oijk) is assumed to be free of covariates, that is, 0,- = 6 or 9ijk = Ojk, V i. We transform the dependence parameters Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 149 Table 4.2: Efficiency assessment with MCD model for binary data: d = 3, 0o = (0.7,0.5,0.3)', 0i = (0.5,0.5,0.5)', Xij discrete, N = 1000 n margin parameters 1 2 3 (1,2) (1,3) (2,3) 010 020 021 030 031 "12 "13 "23 "12 = "13 = 023 = 1-3863 100 IFM MLE r 0.694 0.559 0.496 0.547 0.294 0.526 1.446 1.447 1.435 (0.199) (0.385) (0.192) (0.335) (0.186) (0.282) (0.530) (0.492) (0.455) 0.692 0.557 0.495 0.545 0.293 0.526 1.447 1.450 1.438 (0.198) (0.347) (0.192) (0.319) (0.185) (0.282) (0.498) (0.490) (0.458) 1.005 1.108 1.001 1.050 1.001 1.001 1.064 1.004 0.993 1000 IFM MLE r 0.700 0.501 0.498 0.509 0.298 0.503 1.395 1.386 . 1.385 (0.063) (0.100) (0.058) (0.089) (0.058) (0.085) (0.145) (0.136) (0.131) 0.699 0.500 0.497 0.508 0.298 0.503 1.395 1.387 1.387 (0.063) (0.099) (0.058) (0.089) (0.058) (0.085) (0.145) (0.136) (0.131) 1.002 1.013 1.000 1.006 0.998 1.001. 1.001 1.001 1.001 a12 = aas = 2.1972, a13 = 1.5163 100 IFM MLE r 0.694 0.559 0.496 0.540 0.293 0.534 2.392 1.599 2.314 (0.199) (0.385) (0.198) (0.312) (0.186) (0.289) (0.790) (0.592) (0.669) 0.692 0.558 0.494 0.542 0.292 0.534 2.352 1.591 2.314 (0.199) (0.344) (0.198) (0.309) (0.187) (0.290) (0.675) (0.548) (0.597) 1.001 1.117 0.999 1.007 0.995 0.996 1.171 1.081 1.119 1000 IFM MLE r 0.700 0.501 0.499 0.503 0.299 0.502 2.205 1.521 2.201 (0.064) (0.100) (0.058) (0.092) (0.057) (0.086) (0.167) (0.141) (0.155) 0.699 0.501 0.498 0.503 0.297 0.502 2.207 1.523 2.204 (0.063) (0.098) (0.058) (0.092) (0.057) (0.086) (0.167) (0.141) (0.155) 1.005 1.019 1.000 1.008 0.998 1.001 0.999 1.001 0.999 Table 4.3: Efficiency assessment with MCD model for binary data: d = 3, 0O = (0.7,0.5,0.3)', 0i = (0.5,0.5,0.5)', Xij = Xi continuous, N = 100 n margin parameters 1 2 3 (1,2) (1,3) (2,3) 010 011 020 021 030 031 "12 013 "23 ai2 = "13 = "23 = 1.3863 100 IFM MLE r 0.722 0.529 0.488 0.520 0.312 0.524 1.453 1.403 1.473 (0.136) (0.326) (0.144) (0.278) (0.137) (0.310) (0.398) (0.402) (0.401) 0.722 0.532 0.486 0.519 0.311 0.522 1.458 1.407 1.476 (0.137) (0.320) (0.144) (0.278) (0.138) (0.308) (0.402) (0.412) (0.406) 0.999 1.019 0.999 1.002 0.993 1.005 0.990 0.976 0.989 1000 IFM MLE r 0.704 0.495 0.501 0.504 0.306 0.504 1.413 1.380 1.391 (0.042) (0.089) (0.046) (0.084) (0.041) (0.093) (0.140) (0.109) (0.124) 0.703 0.494 0.500 0.503 0.305 0.503 1.415 1.381 1.393 (0.042) (0.090) (0.045) (0.084) (0.040) (0.093) (0.139) (0.109) (0.124) 1.004 0.988 1.004 0.993 1.007 1.001 1.000 1.006 0.999 ai, = a23 = 2.1972, a,3 = 1.5163 100 IFM MLE V 0722 0.529 0.490 0.543 0.300 0.544 2.303 1.556 2.303 (0.136) (0.326) (0.133) (0.272) (0.131) (0.309) (0.494) (0.362) (0.525) 0.721 0.532 0.488 0.542 0.298 0.538 2.318 1.550 2.310 (0.136) (0.317) (0.134) (0.279) (0.131) (0.306) (0.504) (0.371) (0.533) 1.000 1.028 0.993 0.976 1.002 1.010 0.981 0.978 0.985 1000 IFM MLE r 0.704 0.495 0.502 0.500 0.303 0.506 2.220 1.541 2.213 (0.042) (0.089) (0.045) (0.076) (0.041) (0.091) (0.155) (0.123) (0.142) 0.703 0.494 0.499 0.498 0.301 0.505 2.222 1.541 2.215 (0.042) (0.089) (0.045) (0.075) (0.041) (0.091) (0.156) (0.124) (0.142) l.Oll 1.002 1.008 1.015 1.010 1.010 0.991 0.997 0.997 Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 150 Table 4.4: Efficiency assessment with MCD model for binary data: d — 4, c*i2 = «i3 = c*i4 = «23 = a24 = a34 = 1.3863, N = 1000 n margin parameters 12 3 4 (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) Z\ Zi Z3 Z4 <*12 "13 «14 <*23 «24 «34 z = (0,0,0,0)' 100 IFM MLE r 0.002 -0.008 0.002 -0.002 1.417 1.421 1.419 1.420 1.406 1.397 (0.121) (0.122) (0.128) (0.125) (0.369) (0.374) (0.374) (0.370) (0.374) (0.3 66) 0.001 -0.009 0.000 -0.003 1.417 1.419 1.418 1.419 1.405 1.395 (0.121) (0.123) (0.128) (0.125) (0.369) (0.374) (0.374) (0.370) (0.373) (0.3 65) 0.997 0.996 0.999 0.999 1.001 0.999 1.000 1.000 1.002 1.003 1000 IFM MLE r -0.001 -0.001 -0.001 -0.003 1.388 1.386 1.392 1.385 1.391 1.387 (0.040) (0.037) (0.039) (0.039) (0.108) (0.112) (0.112) (0.115) (0.114) (0.1 18) -0.002 -0.003 -0.002 -0.004 1.388 1.386 1.392 1.385 1.390 1.387 (0.040) (0.037) (0.039) (0.039) (0.108) (0.112) (0.112) (0.115) (0.114) (0.1 17) 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 z = (0.7,0.7,0.7,0.7)' 100 IFM MLE r 0.709 0.703 0.710 0.708 1.447 1.403 1.444 1.405 1.401 1.404 (0.139) (0.139) (0.140) (0.139) (0.444) (0.443) (0.466) (0.430) (0.431) (0.4 44) 0.707 0.702 0.709 0.706 1.447 1.402 1.442 1.403 1.401 1.403 (0.138) (0.139) (0.140) (0.139) (0.442) (0.441) (0.465) (0.430) (0.429) (0.4 45) 1.002 1.000 0.998 1.000 1.005 1.004 1.001 1.000 1.005 0.997 1000 IFM MLE r 0.700 0.703 0.699 0.700 1.388 1.384 1.390 1.380 1.383 1.389 (0.043) (0.044) (0.043) (0.045) (0.134) (0.130) (0.131) (0.131) (0.136) (0.1 32) 0.699 0.701 0.698 0.699 1.389 1.385 1.390 1.382 1.384 1.390 (0.043) (0.044) (0.043) (0.044) (0.133) (0.130) (0.131) (0.130) (0.136) (0.1 32) 1.000 1.001 1.000 1.004 1.001 0.999 1.001 1.004 1.000 1.000 z = (0.7,0,0.7,0)' 100 IFM MLE r 0.709 -0.007 0.710 -0.002 1.464 1.403 1.480 1.463 1.406 1.454 (0.139) (0.122) (0.140) (0.124) (0.567) (0.443) (0.596) (0.533) (0.374) (0.5 74) 0.708 -0.009 0.709 -0.003 1.458 1.400 1.472 1.459 1.405 1.447 (0.138) (0.122) (0.140) (0.124) (0.512) (0.445) (0.541) (0.493) (0.373) (0.5 18) 1.002 0.999 0.998 1.000 1.107 0.996 1.102 1.081 1.003 1.108 1000 IFM MLE r 0.700 -0.001 0.699 -0.002 1.392 1.384 1.398 1.386 1.391 1.392 (0.043) (0.037) (0.043) (0.039) (0.128) (0.130) (0.132) (0.133) (0.114) (0.1 31) 0.699 -0.002 0.698 -0.004 1.394 1.385 1.399 1.387 1.390 1.393 (0.043) (0.038) (0.043) (0.039) (0.128) (0.129) (0.132) (0.133) (0.114) (0.1 31) 1.002 0.999 1.001 0.996 1.000 1.008 1.001 0.999 0.999 1.001 Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 151 Table 4.5: Efficiency assessment with MCD model for binary data: d = 4, ai2 = «23 = <*34 = 2.1972, ai3 = a24 = 1.5 1 63, a14 = 1.1309, N = 1000 n margin parameters 12 3 4 (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) Z\ Z2 Z3 Z4 «12 «13 «14 Q!23 a24 <*34 z= (0,0,0,0)' 1000 IFM MLE r 0.002 0.001 0.001 0.001 2.210 1.518 1.137 2.198 1.516 2.204 (0.039) (0.038) (0.039) (0.041) (0.135) (0.116) (0.106) (0.131) (0.115) (0.128) -0.000 -0.002 -0.001 -0.001 2.209 1.517 1.136 2.197 1.515 2.204 (0.039) (0.039) (0.039) (0.041) (0.135) (0.115) (0.106) (0.131) (0.115) (0.127) 1.000 0.996 1.004 0.997 1.000 1.002 1.001 1.000 1.000 1.000 z = (0.7,0.7,0.7,0.7)' 1000 IFM MLE r 0.700 0.701 0.700 0.701 2.206 1.514 1.136 2.205 1.521 2.206 (0.042) (0.043) (0.043) (0.044) (0.154) (0.132) (0.125) (0.153) (0.134) (0.159) 0.699 0.699 0.699 0.700 2.207 1.514 1.135 2.206 1.521 2.208 (0.042) (0.044) (0.043) (0.044) (0.154) (0.130) (0.124) (0.153) (0.132) (0.159) 1.000 0.999 1.003 0.999 1.000 1.010 1.008 1.000 1.011 1.001 z = (0.7,0,0.7,0)' 1000 IFM MLE r 0.700 0.001 0.700 0.001 2.212 1.514 1.139 2.209 1.516 2.214 (0.042) (0.038) (0.043) (0.041) (0.159) (0.132) (0.122) (0.162) (0.115) (0.162) 0.699 -0.001 0.699 -0.001 2.215 1.513 1.140 2.214 1.515 2.218 (0.042) (0.039) (0.043) (0.041) (0.159) (0.130) (0.121) (0.163) (0.115) (0.162) 0.996 0.994 0.998 0.997 0.999 1.014 1.009 0.996 1.008 1.001 9jk with 6jk = (exp(ajjt) — l)/(exp(ajfc) + 1), and estimate ctjk instead of 0jk- In our simulation study, we only examine the situation where no covariates are involved in the marginal parameters, and further assume that rrij — 3. In these situations, for each margin, we need to estimate two parameters: Zj(l) and Zj(2). We use the following simulation scheme: 1. The sample size is n, the number of simulations is N; both are reported in the tables. 2. For d = 3, we study two situations of marginal parameters: (a) z(l) = (-0.5,-0.5,-0.5)', z(2) = (0.5,0.5,0.5)' (b) z(l) = (-0.5,0, -0.5)', z(2) = (0.5,1,0.5)' and for each situation, two dependence structures are used in the simulation study: 0i2 = 913 - 023 = 0.6 (or a12 = a13 - a23 - 1.3863) and 612 = 023 = 0.8 (or a12 = a23 = 2.1972), 013 = 0.64 (or oris = 1.5163). 3. For d — 4, we similarly study two situations of marginal parameters: (a) z(l) = (-0.5, -0.5, -0.5, -0.5)', z(2) = (0.5,0.5,0.5,0.5)' ' (b) z(l) = (-0.5,0, -0.5,0)', z(2) = (0.5,1,0.5,1)' Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 152 Table 4.6: Efficiency assessment with MCD model for ordinal data: d = 3, z(l) = (—0.5, —0.5, —0.5)', z(2) = (0.5,0.5,0.5)', N = 1000 n margin parameters 1 2 3 (1,2) (1,3) (2,3) Zi(2) Z2(l) Z2(2) 23(D 2fc(2) Oi2 «13 "23 «12 = <*13 = «23 = 1.3863 100 IFM MLE r -0.500 0.508 -0.508 0.500 -0.507 0.508 1.413 1.407 1.414 (0.135) (0.135) (0.130) (0.133) (0.134) (0.137) (0.275) (0.284) (0.287) -0.503 0.507 -0.511 0.498 -0.510 0.507 1.413 1.408 1.415 (0.134) (0.135) (0.130) (0.133) (0.134) (0.136) (0.275) (0.284) (0.287) 1.004 1.003 1.000 1.003 1.006 1.003 1.000 0.999 0.998 1000 IFM MLE r -0.501 0.498 -0.501 0.499 -0.502 0.500 1.390 1.386 1.387 (0.043) (0.041) (0.041) (0.041) (0.042) (0.042) (0.086) (0.089) (0.088) -0.504 0.497 -0.504 0.498 -0.504 0.498 1.390 1.386 1.387 (0.043) (0.041) (0.041) (0.041) (0.042) (0.042) (0.085) (0.089) (0.088) 0.998 1.002 0.998 1.002 0.997 1.006 1.005 1.004 1.005 a„ = ttM = 2.1972, a13 = 1.5163 100 IFM MLE r -0.500 0.508 -0.506 0.502 -0.509 0.508 2.251 1.542 2.242 (0.135) (0.135) (0.132) (0.136) (0.136) (0.134) (0.323) (0.282) (0.324) -0.504 0.506 -0.512 0.500 -0.513 0.506 2.252 1.539 2.243 (0.136) (0.136) (0.132) (0.136) (0.138) (0.135) (0.321) (0.285) (0.322) 0.990 0.999 0.997 1.005 0.991 0.999 1.005 0.992 1.005 1000 IFM MLE r -0.501 0.498 -0.500 0.498 -0.502 0.500 2.202 1.516 2.199 (0.043) (0.041) (0.041) (0.040) (0.042) (0.041) (0.093) (0.088) (0.097) -0.505 0.496 -0.504 0.496 -0.506 0.498 2.203 1.516 2.200 (0.043) (0.041) (0.041) (0.040) (0.042) (0.041) (0.093) (0.088) (0.097) 0.997 0.999 1.005 1.008 0.998 1.001 0.999 1.004 1.000 and for each situation, two dependence structures are used in the simulation study: #i2 = #13 — flu — 023 = 024 = 034 = 0.6 (or ai2 - c*i3 — «i4 = a23 — a24 — a34 -• 1.3863) and 012 = 023 = 034 = 0.8 (or «12 = a23 = a34 = 2.1972), 013 = 024 = 0.64 (or aj3 = a24 = 1.5163) and 0i4 =0.512 (or ai4 = 1.1309). The numerical results from MCD models for ordinal data are presented in Table 4.6 to Table 4.11. Again, from these tables, we have the following two clear conclusions: i) The IFM approach is efficient relative to the ML approach, for small to large sample sizes. The ratio values r are very close to 1 in almost all the studied situations, ii) MLE may be slightly more efficient than IFME, but this observation is not conclusive. We would say that IFME and MLE are comparable. Multivariate copula discrete models for count data In this subsection, we study the MCD models for count data. The parameters are assumed to be margin-dependent. In our simulation, we use the MVN copula. We simulate d-dimensional Poisson observations y,- (i = 1,..., n) from a multivariate normal-copula Poisson model 2 2 P(yi • • • %) = E • • • E(-l)*'1+-+<<C(aUladid;6), «'i=i »<i=i Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate Table 4.7: Efficiency assessment with MCD model for ordinal data: d = 3, z(l) = (—0.5,0,-0.5) z(2) = (0.5,1,0.5)', N = 1000 n margin parameters 1 2 3 (1,2) (1,3) (2,3) Zi(l) Zi(2) z2(l) z2(2) z3(l) *3(2) "12 "13 "23 "12 = "13 = "23 = 1.3863 100 IFM MLE r -0.500 0.508 -0.002 1.01 -0.507 0.508 1.429 1.407 1.416 (0.135) (0.135) (0.121) (0.16) (0.134) (0.137) (0.294) (0.284) (0.298) -0.503 0.507 -0.004 1.00 -0.510 0.507 1.430 1.408 1.417 (0.134) (0.135) (0.120) (0.16) (0.134) (0.136) (0.293) (0.284) (0.297) 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1000 IFM MLE r -0.501 0.498 -0.002 0.998 -0.502 0.500 1.393 1.386 1.389 (0.043) (0.041) (0.038) (0.049) (0.042) (0.042) (0.089) (0.089) (0.090) -0.503 0.497 -0.003 0.996 -0.504 0.498 1.393 1.385 1.389 (0.043) (0.041) (0.038) (0.049) (0.042) (0.042) (0.089) (0.089) (0.090) 0.998 1.000 0.996 1.000 0.997 1.005 1.006 1.004 1.004 ai2 = aM = 2.1972, ais = 1.5163 100 IFM MLE r -0.500 0.508 -0.001 1.012 -0.509 0.508 2.261 1.542 2.247 (0.135) (0.135) (0.123) (0.164) (0.136) (0.134) (0.343) (0.282) (0.344) -0.504 0.505 -0.004 1.011 -0.513 0.505 2.268 1.538 2.254 (0.136) (0.136) (0.121) (0.166) (0.139) (0.135) (0.345) (0.290) (0.347) 0.994 0.995 1.016 0.989 0.984 0.997 0.993 0.975 0.991 1000 IFM MLE r -0.501 0.498 -0.000 0.999 -0.502 0.500 2.204 1.516 2.199 (0.043) (0.041) (0.038) (0.049) (0.042) (0.041) (0.099) (0.088) (0.101) -0.504 0.496 -0.003 0.996 -0.505 0.498 2.204 1.516 2.199 (0.043) (0.041) (0.038) (0.049) (0.042) (0.041) (0.099) (0.087) (0.100) 0.998 0.999 1.018 1.002 0.998 1.002 1.002 1.006 1.011 Table 4.8: Efficiency assessment with MCD model for ordinal data: d = 4, z(l) = (—0.5,-0. -0.5, -0.5)', z(2) = (0.5,0.5,0.5,0.5)', a12 = aX3 = aM = a23 = a24 = a34 = 1.3863, TV = 100 n margin parametei •s Zi(l) 12 3 4 zi(2) z2(l) z2(2) z3(l) z3(2) z4(l) z4(2) 100 IFM MLE r -0.4928 0.5025 -0.5037 0.4820 -0.4997 0.4986 -0.5088 0.4890 (0.1238) (0.1345) (0.1139) (0.1377) (0.1145) (0.1293) (0.1365) (0.1349) -0.4930 0.5023 -0.5043 0.4819 -0.5007 0.4983 -0.5101 0.4877 (0.1225) (0.1325) (0.1145) (0.1368) (0.1149) (0.1292) (0.1373) (0.1361) 1.011 1.015 0.995 1.006 0.996 1.001 0.994 0.991 1000 IFM MLE r -0.4954 0.5044 -0.5068 0.4981 -0.4987 0.5015 -0.4966 0.5016 (0.0433) (0.0415) (0.0457) (0.0436) (0.0413) (0.0413) (0.0459) (0.0414) -0.4973 0.5026 -0.5094 0.4963 -0.5008 0.4998 -0.4988 0.4998 (0.0431) (0.0416) (0.0463) (0.0439) (0.0413) (0.0413) (0.0460) (0.0419) 1.003 0.999 0.987 0.993 1.000 1.001 0.997 0.988 n p margin arameters (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) tv-2 ai3 a14 a23 0:94 a34 100 IFM MLE r 1.4489 1.4332 1.4174 1.4539 1.4050 1.4149 (0.2741) (0.2858) (0.2906) (0.2903) (0.3005) (0.3088) 1.4498 1.4351 1.4203 1.4562 1.4070 1.4175 (0.2732) (0.2842) (0.2903) (0.2928) (0.3019) (0.3042) 1.003 1.006 1.001 0.991 0.995 1.015 1000 IFM MLE r 1.3939 1.3937 1.3942 1.3828 1.3739 1.3785 (0.0786) (0.0815) (0.0869) (0.0833) (0.0775) (0.0822) 1.3949 1.3950 1.3958 1.3827 1.3745 1.3800 (0.0795) (0.0817) (0.0886) (0.0834) (0.0790) (0.0823) 0.988 0.997 0.980 0.999 0.981 0.998 Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 154 Table 4.9: Efficiency assessment with MCD model for ordinal data: d = 4, z(l) = (—0.5,-0.5, -0.5,-0.5)', z(2) = (0.5,0.5,0.5,0.5)', an = a23 = "34 = 2.1972, a13 = "24 = 1.5163, a14 = 1.1309, N = 100 n margin parametei ts Zi(l) 12 3 4 zi(2) z2(l) z2(2) z3(l) z3(2) z4(l) z4(2) 100 IFM MLE r -0.4928 0.5025 -0.5032 0.4924 -0.5038 0.4917 -0.5142 0.4909 (0.1238) (0.1345) (0.1156) (0.1357) (0.1155) (0.1342) (0.1349) (0.1424) -0.4965 0.5050 -0.5049 0.4945 -0.5122 0.4936 -0.5203 0.4936 (0.1265) (0.1339) (0.1162) (0.1322) (0.1238) (0.1328) (0.1388) (0.1457) 0.979 1.005 0.994 1.026 0.933 1.011 0.972 0.978 1000 IFM MLE r -0.4954 0.5044 -0.5047 0.5018 -0.5011 0.5019 -0.5035 0.5012 (0.0433) (0.0415) (0.0454) (0.0435) (0.0417) (0.0401) (0.0433) (0.0391) -0.4986 0.5017 -0.5079 0.4988 -0.5047 0.4989 -0.5069 0.4984 (0.0433) (0.0416) (0.0460) (0.0434) (0.0412) (0.0408) (0.0436) (0.0394) 0.999 0.999 0.988 1.004 1.014 0.984 0.992 0.992 n p margin arameters (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) aU a13 "14 "23 "24 "34 100 IFM MLE r 2.2807 1.5610 1.1782 2.2764 1.5603 2.2516 (0.3091) (0.2732) (0.2897) (0.3453) (0.3270) (0.3331) 2.2754 1.5503 1.1795 2.2750 1.5491 2.2477 (0.2933) (0.2621) (0.2802) (0.3354) (0.3263) (0.3291) 1.050 1.040 1.030 1.030 1.000 1.010 1000 IFM MLE r 2.2190 1.5263 1.1396 2.1868 1.5055 2.1851 (0.0915) (0.0803) (0.0790) (0.0884) (0.0874) (0.0957) 2.2217 1.5267 1.1394 2.1887 1.5055 2.1865 (0.0916) (0.0791) (0.0789) (0.0871) (0.0865) (0.0951) 0.999 1.015 1.002 1.014 1.011 1.007 Table 4.10: Efficiency assessment with MCD model for ordinal data: d = 4, z(l) — (—0.5, 0, —0.5, 0)', Z(2) = (0.5, 1, 0.5, 1)', "12 = "13 = "14 = "23 = "24 = "34 = 1.3 8 63, N = 100 n margin parameters *i(l) 12 3 4 Zl(2) z2(l) z2(2) z3(l) z3(2) z4(l) z4(2) 100 IFM MLE r -0.4928 0.5025 -0.01817 0.9924 -0.4997 0.4986 -0.0108 0.9994 (0.1238) (0.1345) (0.12289) (0.1507) (0.1145) (0.1293) (0.1159) (0.1528) -0.4939 0.5020 -0.01733 0.9886 -0.5013 0.4979 -0.0115 0.9964 (0.1217) (0.1318) (0.12165) (0.1505) (0.1141) (0.1296) (0.1146) (0.1532) 1.018 1.020 1.010 1.002 1.003 0.998 1.011 0.997 1000 IFM MLE r -0.4954 0.5044 -0.0076 1.0021 -0.4987 0.5015 -0.0017 1.0044 (0.0433) (0.0415) (0.0437) (0.0450) (0.0413) (0.0413) (0.0405) (0.0474) -0.4975 0.5026 -0.0095 0.9996 -0.5009 0.5000 -0.0037 1.0018 (0.0431) (0.0413) (0.0436) (0.0448) (0.0414) (0.0413) (0.0404) (0.0473) 1.003 1.005 1.001 1.005 0.998 1.000 1.003 1.002 n margin parameters (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) "12 "13 "14 "23 "24 "34 100 IFM MLE r 1.4398 1.4332 1.4428 1.4452 1.4228 1.4345 (0.2874) (0.2858) (0.2801) (0.2866) (0.2966) (0.3429) 1.4468 1.4363 1.4467 1.4509 1.4313 1.4341 (0.2888) (0.2838) (0.2781) (0.2882) (0.2971) (0.3409) 0.995 1.007 1.007 0.995 0.998 1.006 1000 IFM MLE r 1.4060 1.3937 1.3907 1.3866 1.3798 1.3756 (0.0822) (0.0815) (0.0891) (0.0806) (0.0940) (0.0919) 1.4067 1.3947 1.3924 1.3877 1.3813 1.3769 (0.0820) (0.0811) (0.0911) (0.0808) (0.0941) (0.0908) 1.003 1.005 0.978 0.997 0.999 1.012 Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 155 Table 4.11: Efficiency assessment with MCD model for ordinal data: d — 4, z(l) = (—0.5,0, —0.5, 0)', z(2) = (0.5,1,0.5,1)', an = a23 = "34 = 2.1972, a13 = a24 = 1.5163, "i4 = 1.1309, TV = 100 n margin parameters 12 3 4 Zl(2) z2(l) z2(2) z3(l) z3(2) z4(l) zA(2) 100 IFM MLE r -0.4928 0.5025 -0.0217 0.9877 -0.5038 0.4917 -0.01462 0.9791 (0.1238) (0.1345) (0.1241) (0.1516) (0.1155) (0.1342) (0.10744) (0.1439) -0.4944 0.5010 -0.0189 0.9796 -0.5090 0.4892 -0.01577 0.9780 (0.1274) (0.1342) (0.1244) (0.1548) (0.1176) (0.1319) (0.11106) (0.1449) 0.972 1.002 0.998 0.979 0.981 1.018 0.967 0.994 1000 IFM MLE r -0.4954 0.5044 -0.0006 0.9995 -0.5011 0.5019 -0.0013 1.0018 (0.0433) (0.0415) (0.0406) (0.0476) (0.0417) (0.0401) (0.0382) (0.0489) -0.4985 0.5010 -0.0034 0.9956 -0.5046 0.4988 -0.0044 0.9988 (0.0433) (0.0410) (0.0396) (0.0476) (0.0416) (0.0405) (0.0392) (0.0482) 0.999 1.013 1.025 0.999 1.003 0.991 0.974 1.014 n . margin parameters (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) "12 "13 "14 "23 "24 "34 100 IFM MLE r 2.2370 1.5610 1.1717 2.2543 1.5524 2.2616 (0.2871) (0.2732) (0.2737) (0.3444) (0.3290) (0.3438) 2.2484 1.5582 1.1819 2.2640 1.5540 2.2616 (0.2867) (0.2743) (0.2832) (0.3330) (0.3279) (0.3406) 1.001 0.996 0.966 1.034 1.003 1.010 1000 IFM MLE r 2.2143 1.5263 1.1385 2.1965 1.5095 2.1807 (0.0957) (0.0803) (0.0811) (0.0980) (0.0842) (0.0951) 2.2180 1.5268 1.1377 2.1988 1.5102 2.1812 (0.0941) (0.0799) (0.0814) (0.0953) (0.0841) (0.0975) 1.018 1.005 0.996 1.028 1.001 0.975 where o,ji = Gj(y,j — 1), o,j2 = Gj(yj). C is d-dimensional normal copula. Gj(-) is defined as 0, i% < 0, V. 3=0 where p^ = [A? exp(-Aj)]/s!, s = 0,1,2,..., oo. In general, we assume X,j = exp(^Xjj), and 0j- = (#jjfc) to be free of covariates. We further transform the dependence parameters Ojk with Ojk = (exp(ajjt) — l)/(exp(aJ-j.) + l), and estimate ctjk instead of Ojk- We use the following simulation scheme: 1. The sample size is n, the number of simulations is N; both are reported in the tables. 2. For d = 3, we study the two situations: log(Ajj) = Bj and log(A,j) = Bj0 + PjiXij. For each situation, we chose two dependence structures: 0\2 = t913 = 023 = 0.6 (or cti2 = c*i3 = 0/23 = 1.3863) and 0l2 = 023 = 0.8 (or «12 = a23 = 2.1972), 013 = 0.64 (or e*i3 = 1.5163). Other parameters are (a) 0O = (/?io,/W3o)' = (1,1,1)' and 0X = (0u,02i,031)' = (0.5,0.5,0.5)'. Situations where Xij is discrete is considered. For the discrete situation, Xij = I(U < 0) where Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 156 Table 4.12: Efficiency assessment with MCD model for count data: d = 3, 0o = (1,1,1)' and 0i = (0.5,0.5,0.5)', discrete, N = 1000 n margin parameters 1 2 3 (1,2) (1,3) (2,3) /?10 011 020 021 030 031 "12 6*13 "23 "12 = "is - "23 = 1.3863 100 IFM MLE V 1.0054 0.490 1.0017 0.493 1.0029 0.493 1.423 1.412 1.420 (0.0867) (0.109) (0.0883) (0.110) (0.0870) (0.106) (0.198) (0.191) (0.188) 1.0018 0.488 0.9985 0.491 0.9998 0.491 1.422 1.409 1.417 (0.0876) (0.111) (0.0892) (0.112) (0.0872) (0.107) (0.194) (0.183) (0.187) 0.990 0.985 0.990 0.979 0.997 0.993 1.019 1.041 1.009 1000 IFM MLE r 1.0007 0.4991 1.0013 0.4975 1.0014 0.4984 1.3953 1.3864 1.3896 (0.0287) (0.0358) (0.0271) (0.0343) (0.0278) (0.0352) (0.0625) (0.0564) (0.0595) 0.9976 0.4993 0.9984 0.4977 0.9983 0.4987 1.3918 1.3849 1.3875 (0.0292) (0.0362) (0.0274) (0.0343) (0.0283) (0.0354) (0.0578) (0.0563) (0.0574) 0.983 0.988 0.989 0.998 0.983 0.995 1.081 1.003 1.035 a12 = a23 = 2.1972, a13 = 1.5163 100 IFM MLE r 1.0054 0.490 1.0044 0.490 1.0046 0.491 2.242 1.551 2.236 (0.0867) (0.109) (0.0884) (0.110) (0.0870) (0.107) (0.190) (0.196) (0.186) 1.0006 0.489 0.9993 0.488 0.9998 0.489 2.239 1.545 2.233 (0.0878) (0.111) (0.0895) (0.112) (0.0881) (0.109) (0.187) (0.180) (0.185) 0.987 0.986 0.987 0.981 0.987 0.980 1.015 1.087 1.007 1000 IFM MLE r 1.0007 0.4991 1.0013 0.4979 1.0014 0.4982 2.2055 1.5185 2.1975 (0.0287) (0.0358) (0.0272) (0.0350) (0.0277) (0.0351) (0.0555) (0.0620) (0.0572) 0.9962 0.4992 0.9963 0.4981 0.9971 0.4984 2.2037 1.5156 2.1952 (0.0291) (0.0364) (0.0279) (0.0355) (0.0279) (0.0354) (0.0553) (0.0553) (0.0569) 0.985 0.983 0.977 0.986 0.993 0.990 1.003 1.121 1.006 U ~ uniform(—1,1). (b) (A,1,Ai2,A,-3) = (5,3,5) (or (0i,02,03) = (1.6094,1.0986,1.6094)). 3. For d = 4, we only study log(A;j) = 0,-. Two dependence structures are considered: #i2 = #13 = #14 = #23 - #24 — #34 — 0.6 (or Qi2 = «13 = "14 = "23 = "24 — "34 " 1.3863) and #12 = #23 = #34 = 0.8 (or an = a23 = "34 = 2.1972), #i3 = #24 = 0.64 (or al3 = "24 = 1.5163) and #14 = 0.512 (or ai4 = 1.1309). Other parameters are (a) (Aii,Ai2,Ai3,Ai4) = (5,5,5,5) (or equivalent^ (0i,02)03,04) = (1.6094,1.6094,1.6094, 1.6094)). (b) (A,-i,Ai2,A,-3lAM) = (4,2,5,8) (or equivalent^ (0i,02,03,04) = (1.3863,0.6931,1.6094, 2.0794)). The numerical results from MCD models for count data are presented in Table 4.12 to Table 4.15. We obtain the similar conclusions to those for the MCD models for binary data and ordinal data. Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 157 Table 4.13: Efficiency assessment with MCD model for count data: d = 3, (0i, 02,03) = (1.6094, 1.0986,1.6094), TV = 1000 n margin parameters 1 2 3 (1,2) (1,3) (2,3) 01 02 03 "12 "13 "23 "12 = "13 = "23 - 1.3863 100 IFM MLE r 1.6075 1.1000 1.6076 1.415 1.403 1.408 (0.0465) (0.0593) (0.0456) (0.194) (0.190) (0.195) 1.6024 1.0943 1.6032 1.413 1.398 1.403 (0.0519) (0.0648) (0.0490) (0.191) (0.189) (0.192) 0.896 0.915 0.929 1.018, 1.003 1.018 1000 IFM MLE r 1.6098 1.0988 1.6095 1.3885 1.3885 1.3880 (0.0141) (0.0185) (0.0140) (0.0597) (0.0575) (0.0586) 1.6077 1.0963 1.6076 1.3877 1.3855 1.3869 (0.0146) (0.0191) (0.0143) (0.0588) (0.0577) (0.0574) 0.966 0.967 0.975 1.015 0.996 1.021 a,, = a93 = '2.1972, a13 = 1.5163 100 IFM MLE r 1.6075 1.0991 1.6089 2.234 1.539 2.219 (0.0465) (0.0599) (0.0455) (0.187) (0.187) (0.188) 1.6017 1.0912 1.6032 2.231 1.533 2.217 (0.0509) (0.0667) (0.0490) (0.187) (0.181) (0.188) 0.913 0.897 0.929 0.999 1.033 0.995 1000 IFM MLE r 1.6098 1.0992 1.6097 2.2027 1.5176 2.2002 (0.0141) (0.0182) (0.0140) (0.0565) (0.0579) (0.0547) 1.6063 1.0944 1.6063 2.1992 1.5149 2.1968 (0.0155) (0.0197) (0.0152) (0.0563) (0.0567) (0.0551) 0.915 0.924 0.923 1.003 1.021 0.993 Table 4.14: Efficiency assessment with MCD model for count data: d = 4, (0i,02,03,0i) — (1.6094,1.0986,1.6094,1.6094), N = 1000 n margin parameters 1 2 3 4 (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) 01 02 03 04 "12 "13 "14 "23 "24 "34 "12 = "13 = "14 = "23 — "24 = "34 = 1.38 63 100 IFM MLE r 1.6072 1.6099 1.6076 1.6085 1.417 1.410 1.413 1.402 1.407 1.398 (0.0434) (0.0443) (0.0459) (0.0451) (0.179) (0.188) (0.190) (0.185) (0.183) (0.186) 1.6016 1.6044 1.6023 1.6031 1.415 1.410 1.413 1.401 1.404 1.401 (0.0477) (0.0494) (0.0495) (0.0490) (0.179) (0.189) (0.193) (0.184) (0.183) (0.184) 0.910 0.895 0.927 0.920 1.001 0.991 0.986 1.009 1.003 1.008 aia = a,3 = a.* = 2.1972, a13 = a94 = 1.5163, a14 = 1.1309 100 IFM MLE r 1.6072 1.6090 1.6084 1.6084 2.228 1.546 1.159 2.218 1.536 2.215 (0.0434) (0.0445) (0.0454) (0.0456) (0.176) (0.184) (0.195) (0.183) (0.179) (0.177) 1.5996 1.6003 1.5993 1.5999 2.228 1.538 1.153 2.215 1.527 2.217 (0.0499) (0.0513) (0.0544) (0.0525) (0.176) (0.187) (0.193) (0.184) (0.177) (0.177) 0.871 0.867 0.835 0.867 0.997 0.982 1.015 0.995 1.009 1.001 Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 158 Table 4.15: Efficiency assessment with MCD model for count data: d = 4, (/?i,/?2,/?3, A) = (1.3863,0.6931,1.6094,2.0794), N = 1000 T1T2T "12 w "13 IP7 "23 "24 W "34 margin parameters 1 2 3 J3_ 4 = a24 = "34 = 1 TOO" ~JFW MLE 1.3835 0^918 1,1.607^ 2.07*8*7 l2ll8 !' (0.0489) (0.0705) (0. 1.414 1.405 (0.185) (0.194) 1.411 1.405 (0.185) (0.193) 0.998 1.002 1.406 (0.189) 1.407 (0.187) 1.011 1.398 (0.186) 1.402 (0.183) 1.017 1.3772 0.6840 1 (0.0549) (0.0756) (0. 0.891 0.932 0 0459) (0.0356) 6024 2.0744 0496) (0.0384) .927 0.927 1.418 1.411 (0.185) (0.189) 1.415 1.411 (0.185) (0.188) 1.003 1.005 = "23 =_Q!34 ==2.1972^, 0:13 = Q2j = 1.5163, aj4 =1.131)9 TOO" ~TFM~ MLE l.fe F.69061. (0.0489) (0.0693) (0. 1.3758 0.6792 1. (0.0550) (0.0764) (0. 0.890 0.907 0 A »i a = 6084 2.0790 0454) (0.0361) 6006 2.0735 0524) (0.0412) .866 0.876 27239 1.546 (0.191) (0.184) 2.236 1.537 (0.191) (0.184) 1.001 0.997 T53T" (0.191) 1.529 (0.186) 1.023 2.216 (0.173) 2.216 (0.174) 0.991 155 2.219 (0.194) (0.198) 1.152 2.221 (0.188) (0.202) 1.032 0.981 Multivariate mixture discrete models for count data We now consider a MMD model for count data with the Morgenstern copula J J j=l j=l l + £^*(l-2G(Ai))(l-2G(Alt)) 3<k d\\ •••d\d, (4.11) where f(yj',Xj) = e—Aj'AjJ/J/J! is the Poisson frequency function with parameter Aj (Aj > 0), g(X) a Gamma density function, having the form fl'(Aj) = {l/[/?",r(aj)]}AjJ-1e-A3'/'3j', Aj > 0, with /?j being a scale parameter, and G(Aj) is a Gamma cdf. We have -E'(Aj) = «j/?j and Var(Aj) = ctjPj. (4.11) is a multivariate Poisson-Morgenstern-gamma model. The multiple integral in (4.11) over the joint space of Ai,..., A^ can be decomposed into a product of integrals of a single variable. The calculation of P(y\ • • - yd) can thus be accomplished by calculating 2d univariate integrals. In fact, we have d d p(2/i---^)=np(%-)+x>*{ j=l j<k m=l and PjkiVjVk) = P(yj)P(y*) + 0jk[P(Vj) ~ 2Pw(yj)][P(yk) - 2Pw(yk)}, where P(yj) = / f{yj] Xj)g(Xj) dXj and Pw(yj) = f f(y,; Xj)g(\j)G(Xj) dXj. Now 1 A«i-l.-A,/*.n ^n-3+yj) %) = / /?rT(aj) ' "Pi dXj (/3j + l)^+^T(aj)2/j!' (4.12) Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 159 Further, as f(yj; Xj)g(Xj) is proportional to the density of a Gamma(y + a, 3/(8+ 1)) random variable, if we let pj = Bj(yj + ctj)/(8j + 1), cr? = (yj + aj)8j/(8j + l)2, then the upper and lower integration ranges of Pw(yj) can be set up as Lj = pj — 5aj and Uj = pj + haj for numerical evaluation of the integrals. To carry out the efficiency assessment through simulation, we need to simulate the multivariate Poisson-Morgenstern-gamma distribution. Let C be the Morgenstern copula, and G(x) be the cdf of a univariate Gamma distribution. The following simulation algorithm is used: 1. Generate U\, • • -,Ud from C(U\,..., Ud)-2. Let A,- = G~l(Uj), j = l,...,d. 3. Generate Yj from Poisson(Aj), j — 1,..., d. In the above algorithm, the difficult part is the generation of U\,..., Ud from C(Ui,..., Ud)- The conditional distribution approach to generate multivariate random variates can be used here. The conditional distribution approach is to obtain x = (x\,..., Xd)' with F(xi) = Vi, F(x2\xi) = V2, ..., F(xd\xi,..., Xd-i) — Vd, where V\,..., Vd are independent uniform(0,1). With the Morgenstern copula, for m < d, f"m /(«!,- . .,Um-l,u) C(um\U1-u1,...,Um-i=um-i)= I Jo • du, /(«!, . . .,«m_l) where f(uxum) = 1 + J2f<k ®ik^ ~ 2ui)(1 _ 2uk). Since > • • • > u"0 = , • • • > «m-i) + J2f=~i ejm(l ~ 2UJ)(1 - 1um), it follows that /(«!,••-lim) = 1 + H7=ldirn(l-2Uj) _ 2(^0^(1-2^))^ /(Ul,...,«m-l) /(«l,---,«m-l) • • -,«m-l) Hence •\m — 1 / Er=~il^m(i-2«j)\ C(Wm|£/l = C/m_l = Um_i) =1+ —77 7 «„ \ /(Ul,...,Um_i) J Er="l gjm(l-2ti,-) 2 ~ 1/ /(ui,.. .,um_i) m Let A = /(«!,..., um_i), 5 = EJL/ 9jm(l-2uj), and D = B/A From Du2n — (D+l)um+Vm = 0, we get •_•(£> + 1) ± V(D + l)2-4£>Vm 2£> Thus the algorithm for generating U\,.. .,Ud from C(«i,..., is as the following: 1. Generate V\,..., Vd from Uniform(0,1). Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 160 2. Let Ui = Vi. 3. Let A = 1 if m = 2 and A = 1 + £"7/ M1 ~ 2uiX1 ~ 2uO if m > 2- Let B = T%=7i eim{l -IUJ), and D = B/A. 4. For m > 2, if £) = 0, Um = Vm. If D ^ 0, (7m takes one of the values of [(£> + 1) ± \/{D + l)2 — 4DVm]/[2D] for which it is positive and less than 1. The efficiency studies with the multivariate Poisson-Morgenstern-gamma model are carried out only for the dependence parameters Ojk, in that univariate parameters are fixed. We use the following simulation scheme: 1. The sample size is n = 3000, the number of simulations is N = 200. 2. The dimension d is chosen to be 3, 4 and 5. 3. The marginal parameters aj and f3j are fixed. They are aj = f3j = 1 for j = 1,..., d. 4. For each dimension, two dependence structures are considered: (a) For d = 3, we have (012,013,023) = (0.5,0.5,0.5) and (012,013,023) = (0.6,0.7,0.8). (b) For d = 4, we have (612,6>i3,014.623,024,034) = (0.5,0.5,0.5,0.5,0.5,0.5) and (0i2,0i3,0i4,023,024,034) = (0.6,0.7,0.8,0.6,0.7,0.6). (c) For d = 5, we have (012,013,014,015,023,024,025,034,035,045) = (0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5) and (012,013,014,015,023,024,025,034,035,045) = (0.6,0.7,0.8,0.8,0.6,0.7,0.8,0.6,0.7,0.8). The numerical results from the MMD models for count data with the Morgenstern copula are presented in Table 4.16 to Table 4.18. We obtain similar conclusions to those for the MCD models for binary, ordinal and count data. Basically, they are: i) The IFM approach is efficient relative to the ML approach; the ratio values r are very close to 1 in almost all the situations studied, ii) MLE may be slightly more efficient than IFME, but this observation is not conclusive. IFME and MLE are comparable. Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 161 Table 4.16: Efficiency assessment with multivariate Poisson-Morgenstern-gamma model, d = 3 parameters 012 013 023 (012,6 'l3, 023) = (0.5,0.5,0.5) IFM MLE r 0.495 (0.125) 0.494 (0.122) 1.022 0.500 0.501 (0.125) (0.124) 0.499 0.500 (0.124) (0.123) 1.008 1.003 (012,6 'l3, 023) = (0.6,0.7,0.8) IFM MLE r 0.603 (0.127) 0.600 (0.127) 1.000 0.699 0.792 (0.118) (0.119) 0.697 0.790 (0.120) (0.119) 0.985 0.995 Table 4.17: Efficiency assessment with multivariate Poisson-Morgenstern-gamma model, d = 4 parameters 012 013 014 023 024 034 (012,013/0i4,023,024,034) = 0.5,0.5,0.5.0.5,0.57o.5) IFM 0.500 095 0.513 UMB 0.494 0.488 (0.131) (0.128) (0.124) (0.133) (0.134) (0.138) MLE 0.501 0.493 0.512 0.497 0.495 0.485 (0.130) (0.124) (0.122) (0.131) (0.132) (0.135) r 1.008 1.026 1.014 1.021 1.018 1.019 ~ (012,013,014,023,024,034) =_(0-6,0.7,0.8,,0.6,0.7,0 6) IFM— 0.593 —freJSO"— 0.794 0.589^— 0.692 0.599 (0.130) (0.127) (0.120) (0.121) (0.133) (0.124) MLE 0.589 0.678 0.792 0.585 0.689 0.598 (0.127) (0.124) (0.117) (0.118) (0.129) (0.124) r 1.017 1.026 1.024 1.025 1.037 1.006 Table 4.18: Efficiency assessment with multivariate Poisson-Morgenstern-gamma model, d — 5 parameters 012 013 014 015 023 024 025 034 035 045 (012, 013, 014,015,023,024,025,034,035,045) = (0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5) IFM MLE r 0.501 0.496 0.477 0.486 0.511 0.467 0.478 0.504 0.493 0.495 (0.122) (0.131) (0.137) (0.132) (0.123) (0.130) (0.123) (0.128) (0.131) (0.116) 0.495 0.493 0.473 0.482 0.508 0.466 0.475 0.503 0.489 0.494 (0.121) (0.125) (0.134) (0.128) (0.122) (0.125) (0.119) (0.127) (0.128) (0.113) 1.012 1.046 1.023 1.026 1.002 1.043 1.037 1.011 1.026 1.018 (012, 013, 0i4,015,023,024,025,034,035,045) = (0.6,0.7,0.8,0.8,0.6,0.7,0.8,0.6,0.7,0.8) IFM MLE r 0.595 0.667 0.775 0.767 0.597 0.693 0.778 0.590 0.693 0.602 (0.140) (0.137) (0.132) (0.130) (0.127) (0.139) (0.118) (0.136) (0.126) (0.125) 0.590 0.666 0.772 0.766 0.593 0.690 0.778 0.588 0.687 0.604 (0.137) (0.132) (0.128) (0.126) (0.119) (0.135) (0.115) (0.132) (0.124) (0.113) 1.023 1.036 1.029 1.032 1.067 1.029 1.028 1.028 1.018 1.103 Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 162 4.4 IFM efficiency for models with special dependence struc ture The IFM approach may have important applications for models with special dependence struc ture. Data with special dependence structure arise often in practice: longitudinal studies, repeated measures, Markov type dependence data, fc-dependent data, and so on. The analytical assessment of the efficiency of the IFM approach for several models with special dependence structure were studied in section 4.2. In the following, we give some numerical results for IFM efficiency for some more complex models with special dependence structure. The estimation approach that we used here is PMLA. We only present representative results from the MCD model for binary data, with the MVN copula of exchangeable and AR(1) dependence structures. Results with other models are quite similar, as we also observed in section 4.3 for various situations with a general model. We use the following simulation scheme: 1. The sample size is n = 1000, the number of simulations is TV = 200. 2. The dimension d are chosen to be 3 and 4. 3. For d = 3, we considered two marginal models Yij = I(Zij < Zj) and Yij = I(Z{j < ctjo + oijiXij), with Xij = I(U < 0) where U ~ uniform(— 1,1), and with the regression parameters (a) with no covariates: z = (0.5,0.5,0.5)' and z = (0.5,1.0,1.5)'; (b) with covariates: a0 = («io. "20, «3o)' = (0.5,0.5,0.5)', (*i = (an, a2l, a3l)' = (1,1,1)' and ot0 = (c*io, a20, "30)' = (0.5,0.5,0.5)', ori = (an, a2i, a3l)' = (1,0.5,1.5)'. For each marginal model, exchangeable and AR(1) dependence structures in the MVN copula are considered, with the single dependence parameter in both cases being 0,- = [exp(/?o+/?iif«) — l]/[exp(/?o +/?iu>,-) +1], with Wi = I(U < 0) where U ~ uniform(—1,1), and parameters /?o = 1 and /?i = 1.5. 4. For d = 4, we only study Yij = I (Zij < Zj), with the marginal parametersz = (0.5,0.5,0.5,0.5)', and z = (0.5,0.8,1.2,1.5)'. For each marginal model, exchangeable and AR(1) dependence structures in MVN copula are considered. The single dependence parameter in both cases is Bi = [exp(/?0) - l]/[exp(/?0) + 1], with /?0 = 1.386 and /30 = 2.197 for both situations. The numerical results from these models with special dependence structure are presented in Table 4.19 to Table 4.26. We basically have the same conclusions as with all other general cases Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 163 Table 4.19: Efficiency assessment with special dependence structure: d = 3, z = (0.5,0.5,0.5)' parameters zi z 2 Z3 Bp ti\ exchangeable, tin = 1, B\ = 1.5 IFM 0.496 0.497 —0^97 ITSSB 1.511 (0.043) (0.041) (0.042) (0.118) (0.194) MLE 0.494 0.496 0.496 0.996 1.520 (0.041) (0.040) (0.041) (0.118) (0.195) r 1.047 1.021 1.015 1.003 0.998 AR(1), Ai, = 1, 0i = 1.5" TFM 0.496 0.497 0.496 UM2 OW~ (0.043) (0.041) (0.041) (0.123) (0.185) MLE 0.494 0.497 0.495 0.994 1.512 (0.042) (0.040) (0.041) (0.119) (0.183) r 1.034 1.027 1.012 1.031 1.015 Table 4.20: Efficiency assessment with special dependence structure: d = 3, z = (0.5,1.0,1.5)' parameters zi zi Z3 tin ti\ exchangeable, tin — 1, ti\ — 1.5 TFM 0.496 0.997 —TAW 0T9"9"S 1.531 (0.043) (0.047) (0.064) (0.154) (0.249) MLE 0.496 0.996 1.499 0.997 1.534 (0.043) (0.047) (0.063) (0.156) (0.247) r 1.009 0.999 1.010 0.986 1.008 AR(1) 0o, = 1, A = 1.5 ~ IFM 0.496 0:997 1.500 0^91 1.509 (0.043) (0.047) (0.063) (0.158) (0.250) MLE 0.496 0.996 1.500 0.993 1.518 (0.043) (0.046) (0.062) (0.156) (0.249) r 1.017 1.013 1.018 1.011 1.003 studied previously. These conclusions are: i) The IFM approach (PMLA) is efficient relative to the ML approach; the ratio values r are very close to 1 in almost all the studied situations, ii) MLE may be slightly more efficient than IFME, but this observation is not conclusive. IFME and MLE are comparable. 4.5 Jackknife variance estimate compared with Godambe information matrix Now we turn to numerical evaluation of the performance of jackknife variance estimates of IFME. We have shown, in Chapter 2, that the jackknife estimate of variance is asymptotically equivalent to the estimate of variance from the corresponding Godambe information matrix. The jackknife approach may be preferred when the appropriate computer packages are not available to compute Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 164 Table 4.21: Efficiency assessment with special dependence structure: d = 3, ct$ = (0.5,0.5,0.5)', Of! = (1,1,1)' ~parameters am Qn apn Q21 Qan "31 Po 01 ~ exchangeable, Bn = 1, Pi = 15 IFM 0.500 1.020 0.499 i.ulO 0.500 1.002 UM0 1.536 (0.055) (0.109) (0.060) (0.108) (0.059) (0.104) (0.153) (0.242) MLE 0.500 1.018 0.498 1.010 0.500 0.999 0.978 1.556 (0.052) (0.104) (0.059) (0.107) (0.058) (0.102) (0.152) (0.250) r 1.052 1.048 1.011 1.007 1.018 1.028 1.002 0.968 AR(1) Bp = "l p\ = 1.5 IFM 0.500 1.020 0.499 1.010 0.497 1.002 0^88 1.529 (0.055) (0.109) (0.060) (0.108) (0.058) (0.101) (0.158) (0.233) MLE 0.501 1.017 0.499 1.009 0.497 0.999 0.985 1.545 (0.052) (0.104) (0.059) (0.105) (0.058) (0.100) (0.157) (0.235) r 1.043 1.047 1.023 1.022 1.004 1.004 1.008 0.991 Table 4.22: Efficiency assessment with special dependence structure: d — 3, oro = (0.5,0.5,0.5)', cti = (1,0.5,1.5)' "parameters am an qpn 021 0:30 «3i Pa Pi exchangeable, pp = 1, Pi = 1.5 IFM 0.500 1.020 0.499 6.H0 0.500 1.512 UME 1.528 (0.055) (0.109) (0.060) (0.089) (0.059) (0.141) (0.160) (0.238) MLE 0.500 1.017 0.498 0.510 0.500 1.506 0.983 1.539 (0.052) (0.103) (0.059) (0.089) (0.058) (0.132) (0.159) (0.239) r 1.047 1.050 1.011 1.002 1.017 1.070 1.004 0.996 AH(1), 00 = 1 Pi = 1-5 ~ IFM 0.500 1.020 0.499 0.510 0.497 1.514 0^9r3 T3T8-(0.055) (0.109) (0.060) (0.089) (0.058) (0.140) (0.159) (0.225) MLE 0.500 1.017 0.499 0.510 0.497 1.510 0.994 1.530 (0.053) (0.104) (0.059) (0.089) (0.057) (0.133) (0.158) (0.223) _r 1.041 1.045 1.021 1.003 1.006 1.049 1.007 1.010 Table 4.23: Efficiency assessment with special dependence structure: d= 4, z = (0.5,0.5,0.5,0.5)' parameters Zi Z2 z3 24 Jo exchangeable, Po = 1.386 IFM MLE r 0.502 0.499 (0.041) (0.043) 0.501 0.499 (0.041) (0.043) 1.000 1.002 0.501 (0.042) 0.500 (0.042) 1.005 0.501 (0.042) 0.500 (0.042) 1.003 1.387 (0.071) 1.389 (0.070) 1.013 AR(1), A, = 1.386 IFM MLE r 0.502 0.499 (0.041) (0.043) 0.502 0.499 (0.041) (0.043) 0.996 1.000 0.501 (0.041) 0.500 (0.041) 0.998 0.497 (0.042) 0.496 (0.042) 0.998 1.385 (0.072) 1.387 (0.069) 1.047 Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 165 Table 4.24: Efficiency assessment with special dependence structure: d = 4, z = (0.5, 0.8,1.2,1.5)' parameters 2l 23 24 Bo exch angeable, 0o = 1.386 IFM MLE r 0.502 (0.041) 0.502 (0.041) 0.998 fi.803 (0.045) 0.802 (0.045) 1.004 1.199 (0.052) 1.198 (0.052) 1.002 1.494 (0.061) 1.492 (0.061) 1.007 1.389 (0.087) 1.391 (0.087) 1.004 ARID, 0O = 1.386 IFM MLE r 0.502 (0.041) 0.502 (0.041) 0.999 0.803 (0.045) 0.802 (0.045) 1.006 1.20 (0.05) 1.20 (0.05) 1.00 1.495 (0.067) 1.494 (0.065) 1.017 1.388 (0.085) 1.389 (0.083) 1.025 Table 4.25: Efficiency assessment with special dependence structure: d = 4, z = (0.5,0.5,0.5,0.5)' parameters Z\ Z2 2.3 24 00 exch angeable, 0o = 2.197 IFM MLE r 0.502 (0.041) 0.500 (0.041) 0.999 ' Q.501 (0.042) 0.499 (0.042) 1.000 0.501 (0.042) 0.499 (0.042) 0.999 0.501 (0.042) 0.499 (0.042) 1.000 2.200 (0.093) 2.202 (0.092) 1.015 AR(1), 0o = 2.197 IFM MLE r 0.502 (0.041) 0.501 (0.041) 0.995 0.501 (0.042) 0.499 (0.043) 0.993 0.501 (0.042) 0.499 (0.042) 1.000 0.499 (0.042) 0.498 (0.042) 0.999 2.194 (0.086) 2.199 (0.084) 1.025 Table 4.26: Efficiency assessment with special dependence structure: d = 4, z = (0.5,0.8,1.2,1.5)' parameters Zl 22 23 24 0o exchangeable, 0o = 2.197 IFM MLE r 0.502 0.802 (0.041) (0.046) 0.501 0.801 (0.041) (0.046) 0.996 1.002 1.201 (0.056) 1.199 (0.055) 1.005 1.499 (0.060) 1.496 (0.059) 1.003 2.203 (0.114) 2.204 (0.111) 1.031 AR(1), 0o = 2.197 IFM MLE r 0.502 0.802 (0.041) (0.046) 0.501 0.801 (0.041) (0.046) 0.997 1.005 1.198 (0.052) 1.196 (0.052) 0.993 1.500 (0.060) 1.500 (0.060) 1.000 2.200 (0.110) 2.200 (0.100) 1.040 Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 166 the Godambe information matrix or when the asymptotic variance in terms of Godambe informa tion matrix is difficult to compute analytically or computationally. For example, to compute the asymptotic variance of P(yi • • -yd',0) by means of Godambe information is not an easy task. To complement the theoretical results in Chapter 2, in this subsection, we give some analytical and nu merical comparisons of the variance estimates from Godambe information and the jackknife method. The application of jackknife methods to modelling and inference of real data sets is demonstrated in Chapter 5. Analytical comparison of the two approaches Example 4.8 (Multinormal, general) Let X ~ Nd(p, £), and suppose we are interested in esti mating p. Given n independent observations xi,..., xn from X, the IFME of p is p = n-1 ^"=1 x,-, and the corresponding inverse of the Godambe information matrix is J^1 = E. A consistent estimate of Jy1 is n »=1 The jackknife estimate of the Godambe information matrix is nVj = n J2(hi) ~ fi&V) ~ ^)T> i=l where p^ = (n — l)_1(n/2 — x,). Some algebraic manipulation leads to n2 1 n n^ = (^Ti)2^D*'-«(«'-«T. which is a consistent estimate of S. Furthermore, we see that n2 -_, which shows that the jackknife estimate of the Godambe information matrix is also good when the sample size is moderate to small. • Example 4.9 (Multinormal, common marginal mean) Let X ~ Nd(p, E), where p = (pi,..., pd)' = pi and £ is known. We are interested in estimating the common parameter p. Given n independent observations xi,..., x„ with same distributions as X, the IFME of p by the weighting approach is (see Example 4.2) _ _ I/IT1/! Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 167 The inverse of Godambe information of pw is j-1 1 The jackknife estimate of the Godambe information is iVj =«^(jiB(i)-K)(jiB(i)-/itJf, 8 = 1 where = l'E 1/i^\i/l/E 11. Some algebraic manipulation leads to l'E"1 nVj = T^i L 8 = 1 TAT ,„;+u „2 _ i\2* l'E-1! We replace n J2"=i(h) ~ Mh) ~ # with n2/(n - 1)2E. Thus nVj ~ (n-1)2 l'E-il' and "2 -i * (^31)2-J* > which shows that the jackknife estimate of the Godambe information is also good when the sample size is moderate to small. • Numerical comparison of the two approaches In this subsection, we numerically compare the variance estimates of IFME from the jackknife method and from the Godambe information. For this purpose, we use a 3-dimensional probit model with normal copula. The comparison studies are carried out only for the dependence parameters 8jk- For the chosen model parameters, we carry out TV simulations for each sample size n. For each simulation s (s = 1,..., TV) of sample size n, we estimate model parameters #12, #13, #23 with the IFM approach. Let us denote these estimates 6$ % 813, &23- We then compute the jackknife estimate of variance (with g groups of size m such that g x m = n) for > ^13^) ^23^ • We denote these (s) (s) (s) • . ~ ~ ~ variance estimates by v\2 , v\3 , 1*23• Let the asymptotic variance estimate of t?i2, ^13, ^23 based on the Godambe information matrix from a sample of size n be V12, ^13, ^23. We compare the following three variance estimates: (i) . MSE: i £f=1$'2> - M2, * £f=i$3 - M2,' * ET=i(^ - M2; (ii) . Godambe: v12, vi3, v23] Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 168 (iii). Jackknife: 1v[s2\ ££iLiwis> wT,1=iv23-The MSE in (i) should be considered as the true variance of the parameter estimate assuming unbiasedness. (ii) and (iii) should be compared with each other and also with (i). Table 4.27 and Table 4.28 summarize the numerical computation of the variance estimates of 012, #i3,023 based on approaches (i), (ii) and (iii). For the jackknife method, the results for different combinations of (g, m) are reported in the two tables. In total four models with different marginal parameters z = (21,22,23) and different dependence parameters 6 = (0i2, #13, 023) are studied. The details about the parameter values are reported in the tables. We have studied two sample sizes: n = 500 and n = 1000. For both sample sizes, the number of simulations is N = 500. From examining the two tables, we see that the three measures are very close to each other. We conclude that the jackknife method is indeed consistent with the Godambe information computation approach. Both approaches yields variance estimates which are comparable to MSE. In conclusion, we have shown theoretically and demonstrated numerically in several cases that the jackknife method for variance estimation compares very favorably with the Godambe information computation. We are willing to extrapolate to general situations. The jackknife approach is simple and computationally straightforward (computationally, it only requires the code for obtaining the parameter estimates); it also has the advantage of easily handling more complex situations where the Godambe information computation is not possible. One major concern with the jackknife approach is the computational time needed to carry out the whole process. If the computing time problem is due to an extremely large sample size, appropriate grouping of the sample for the sake of applying the jackknife approach may improve the situation. A discussion is given in Section 2.5. Overall, we recommend the general use of the jackknife approach in applications. 4.6 Summary In this chapter, we demonstrated analytically and numerically that the IFM approach is an efficient parameter estimation procedure for MCD and MMD models with MUBE or PUBE properties. We have chosen a wide variety of cases so that we can extrapolate this conclusion to the general situation. Theoretically, we expect IFM to be quite efficient because it is closely tied to MLE in that each inference function is a likelihood score function of a margin. For comparison purposes, we carried out ML estimates for several multivariate models. Our experience was that finding the MLE is a difficult and very time consuming task for multivariate models, while the IFME is Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 169 Table 4.27: Comparison of estimates of standard error, (i) true, (ii) Godambe, (iii) jackknife with g groups; N = 500, n = 1000 approach" | Pi 2 (0.0,0.7,0.0)', f= (-(9,™) (1000,1) (500,2) (250,4) (125,8) (100,10) (50,20) (0.0 0.7,0 (m) 0.002079 0.002012 0.002030 0.002028 0.002025 0.002058 0.002046 0.002089 0.5,0.5,-0.5) Z23_ 0.001704 0.001645 0.001646 0.001653 0.001658 0.001653 0.001663 0.001685 0.002085 0.002012 0.002038 0.002043 0.002047 0.002046 0.002046 0.002089 z = (0.7,0.0 0.7)', 6 = (0.5,0.9,0.5) i 0.002090 0.000281 0.002200 0.002012 0.000295 0.002012 (g,m) (iii) (1000,1) 0.002026 0.000299 0.002023 (500,2) 0.002027 0.000300 0.002021 (250,4) 0.002036 0.000300 0.002035 (125,8) 0.002056 0.000302 0.002049 (100,10) 0.002063 0.000301 0.002054 (50,20) 0.002088 0.000301 0.002067 z = (0.7,0.7,0.7)', 6 = (0.9,0.7,0.5) i) 0.000333 0.001218 0.002319 (ii) 0.000295 0.001239 0.002187 (9,m) (iii) (1000,1) 0.000302 0.001254 0.002208 (500,2) 0.000303 0.001257 0.002210 (250,4) 0.000302 0.001260 0.002212 (125,8) 0.000303 0.001267 0.002216 (100,10) 0.000305 0.001261 0.002214 (50,20) 0.000310 0.001252 0.002220 9, rn) A (m) 1000,1) 500,2) 250,4) 125,8) 100,10) 50,20) z = (1Q, 0.5.0.0)', 0 = (0.8,0.6,0.8) 0.000821 0.000869 0.000873 0.000874 0.000877 0.000884 0.000887 0.000899 0.002147 0.002089 0.002129 0.002118 0.002108 0.002119 0.002138 0.002151 0.000766 0.000666 0.000683 0.000683 0.000681 0.000688 0.000687 0.000690 Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 170 Table 4.28: Comparison of estimates of standard error, (i) true, (ii) Godambe, (iii) jackknife with groups; N = 500, n = 500 o.oy,gy= U approacn 13_ Z23_ = (0.0,0.7, (m) (-0.5,0.5,-03) (flS™) (500.1) (250.2) (125,4) (50,10) 0.004158 0.004024 0.004085 0.004071 0.004053 0.004115 0.003135 0.003290 0.004262 0.004024 0.003315 0.004104 0.003333 0.004122 0.003331 0.004119 0.003396 0.004176 z = (0 7,0.0,0.7)', 0 = (0.5,0.9,0.5) (9,™) (500.1) (250.2) (125,4) (50,10) (m) 0.003998 0.004024 0.004062 0.004062 0.004091 0.004123 0.000602 0.000591 0.003768 0.004024 0.000604 0.004049 0.000601 0.004054 0.000607 0.004103 0.000617 0.004171 : (0.7,0.7, 0.7)', 6 = (0.9,0.7,0.5) I (m) (9,™) (500.1) (250.2) (125,4) (50,10) 0.000632 0.000591 0.002688 0.002479 0.004521 0.004374 0.000607 0.000611 0.000616 0.000622 5,0.0)', 0 = (0.8,0.6,0.8) 0.002501 0.004410 0.002510 0.004425 0.002533 0.004467 0.002539 0.004501 (i 0.001634 0.003846 0.001413 (ii) 0.001738 0.004179 0.001332 (iii) 500,1) 0.001821 0.004397 0.001365 (250,2) 0.001837 0.004407 0.001368 (125,4) 0.001846 0.004433 0.001360 (50,10) 0.001876 0.004476 0.001388 Chapter 4. The efficiency of IFM approach and the efficiency of jackknife variance estimate 171 computationally simple and results in significant saving of computing time. We further demonstrated numerically that the jackknife method yields SEs for the IFME, which are comparable to the SEs obtained from the Godambe information matrix. The jackknife method for variance estimates has significant practical importance as it eliminates the need to calculate the partial derivatives which are required for calculating the Godambe information matrix. The jackknife method can also be used for estimates of functions of parameters (such as probabilities of being in some category or probabilities of exceedances). The IFM approach together with the jackknife estimation of SE's make many more multivariate models computationally feasible for working with real data. The IFM theory as part of statistical inference theory for multivariate non-normal models is highly recommended because of its good asymptotic properties and its computational feasibility. This approach should have significant prac tical usefulness. We will demonstrate its application in Chapter 5. Chapter 5 Modelling, data analysis and examples Possessing a tool is one thing, but using it effectively is quite another. In this chapter, we explore the possibility of effectively using the tools developed in this thesis for multivariate statistical modelling (including IFM theory, jackknife variance estimation, etc.) and provide data analysis examples. In section 5.1, we first discuss out view of the proper data analysis cycle. This is an important issue since the interpretation of the results and maybe the possible indication of further studies are directly related to the way that the data analysis was carried out. We next discuss several other important issues in multivariate discrete modelling, such as how to make the the choice of models and how to deal with checking the adequacy of models. We also provide some discussion on the testing of dependence structure hypotheses, which is useful for identifying some specific multivariate models. In section 5.2, we carry out several data analysis examples with the models and inference procedure developed in the previous chapters. We show some applications of the models and inference procedures developed in this thesis and point out difficulties related to multivariate nonnormal analysis. 172 Chapter 5. Modelling, data analysis and examples 173 5.1 Some issues on modelling 5.1.1 Data analysis cycle A proper data analysis cycle usually consists of initial data analysis, statistical modelling, diagnostic model assessment and inferences. The initial data analysis may consist of computing various data summaries and examining various graphical representation of data. The type of summary statistics and graphical representations depend on the basic features of the data set. For example, for binary, ordinal and count data, we can compute the empirical frequencies (and percentages) of response variables as well as covariates, separately and jointly. If some covariates are continuous, then standard summaries such as the mean, median, standard deviation, quartiles, maximum, minimum, as well as graphical displays such as boxplots and histograms could be examined. To have a rough idea of the dependence among the response variables, for binary data, a check of the pairwise log odds ratios of the responses could be helpful. Another convenient empirical pairwise dependence measure for multivariate discrete data, which is particularly useful for ordinal and count data, is a measure called gamma. This measure, for ordinal and count data y; = (yn, yii), i= 1,..., n, is defined as where C = £"=1 E"=i1(yn > yin) * I(yi2 > yi'2) and D = J2"=1 E"=i < &'i) * Kyi2 > yi'2), and / is the indicator function. In (5.1), C can be interpreted as the number of concordant pairs and D the number of discordant pairs. The gamma measure is studied in Goodman and Kruskal (1954), and is considered as a discrete generalization of Kendall's tau for continuous variables. The properties of the gamma measure follow directly from its definition. Like the correlation coefficient, its range is—1<7<1:7 = 1 when the number of discordant pairs D = 0, 7 = — 1 when the number of concordant pairs C = 0, and 7 = 0 when the number of concordant pairs equals the number of discordant pairs. Other dependence measures as the discrete generalizations of Kendall's tau or Spearman's p can also be used for ordinal response and count response as well as binary response. Furthermore, summaries such as means, variances and correlations could also be meaningful and useful for count data. Initial data analysis is particularly important in multivariate analysis, since the structure of multivariate data is much more complicated than that of univariate data, and the initial data analysis results will shed light on identifying the suitable statistical models. Statistical modelling usually consists of specification, estimation, and evaluation steps. The specification formulates a probabilistic model which is assumed to have generated the observed 7 = C-D C + D (5.1) Chapter 5. Modelling, data analysis and examples 174 data. At this stage, to choose appropriate models, relevant questions are: "What is the nature of the data?" and " How have the data been generated?" The chosen models should make sense for the data. The decision concerning which model to fit to a set of data should, if possible, be the result of a prior consideration of what might be a suitable model for the process under investigation, as well as the result of computation. In some situations, a data set may have several suitable alternative models. After obtaining estimation and computation results, model selections could be made based on certain criteria. Diagnostics consist of assessments of the reliability of the estimates, the fit of the model and the overall performance of the model. Both the fitting error of the model and possibly prediction error should be studied. We should also bear in mind that often a small fitting error does not lead to a small prediction error. Sometimes, it is necessary to seek a balance between the two. Appropriate diagnostic checking is an important but not easy step in the whole modelling process. At the inference stage, relevant statements about the population from which the sample was taken can be made based on the statistical modelling (mainly probabilistic models) results from the previous stages. These inferences may be the explanation of changes in responses over margin or time, the effects of covariates on the probabilities of occurrence, the marginal and conditional behaviour of response variables, the probability of exceedance, as well as of hypothesis testing as suggested by the theory in the application domain, and so on. Some relevant questions are: "How can valid inference be drawn?", "What interpretation can be given to the estimates?", "Is there a structural interpretation, relating to the underlying theory in the application?", and "Are the results pointing to further studies?" 5.1.2 Model selection When modelling a data set, usually it is required only that the model provide accurate predictions or other aspects of data, without necessarily duplicating every detail of the real system. A valid model is any model that gives an adequate representation of the system that is of interest to the model user. Often a large number of equally good models exist for a particular data set in terms of the specific inference aspect of interest to the practitioner. Model selection is carried out by comparing alterna tive models. If a model fits the data approximately as well as the other more complex models, we usually prefer the simple one. There are many criteria to distinguish between models. One suitable criterion for choosing a model is the associated maximum loglikelihood value. However, within the Chapter 5. Modelling, data analysis and examples 175 same family, the maximum loglikelihood value usually depends on the number of parameters esti mated in the model, with more parameters yielding a bigger value. Thus maximizing this statistic cannot be the sole criterion since we would inevitably choose models with more parameters and more complex structure. In application, parsimonious models which identify the essential relations between the variables and capture the major characteristic features of the problem under study are more useful. Such models often lead to clear and simple interpretation. The ideal situation is that we arrive at a simple model which is consistent with the observed data. In this vein, a balance between the size of the maximum loglikelihood value and the number of parameters is important. But it is often difficult to judge the appropriateness of the balance. One widely used criterion is the Akaike Information Criterion (AIC), which is defined as AIC =-2^(0;y) + 2s, where £(6; y) is the maximum loglikelihood of the model, and s is the number of estimated param eters of the model. (With IFM estimation, the AIC is modified to AIC = -2£(6;y) + 2s.) By definition, a model with a smaller AIC is preferable. The AIC considers the principles of maximum likelihood and the model dimensions (of number of parameters) simultaneously, and thus aims for a balance of maximum likelihood value and model complexity. The negative of AIC/2 is asymp totically an unbiased estimator of the mean expected loglikelihood (see Sakamoto et al. 1986); thus AIC can be interpreted as an unbiased estimator of the -2 times the expected loglikelihood of the maximum likelihood. The model having minimum AIC should have minimum prediction error, at least asymptotically. In the use of AIC, it is the difference of AIC values that matters and not the actual values themselves. This is because of the fact that AIC is an estimate of the mean expected loglikelihood of a model. If the difference is less than 1, the goodness-of-fit of these models are almost the same. For a detailed account of AIC, see Sakamoto et al. (1986). The AIC was introduced by Akaike (1973) for the purpose of selecting an optimal model from within a set of proposed models (hypotheses). The AIC procedure has been used successfully to identify models; see, for example, Akaike (1977). The selection of models should also be based on the understanding that it is an essential part of modelling to direct the analysis to aspects which are relevant to the context and to omit other aspects of the real world situation which often lead to spurious results. This is also the reason that we have to be careful not to overparameterize the model, since, although this might improve the goodness-of-fit, it is likely to result in the model portraying spurious features of the sampled data, which may detract from the usefulness of the achieved fit and may lead to poor prediction. The Chapter 5. Modelling, data analysis and examples 176 selection of models should also be based on the consideration of the practical importance of the models, which in turn is based on the nature and extent of the models and their contribution to our understanding to the problem. Statistical modelling is often an iterative process. The general process is such that after, a promising member from a family of models is tentatively chosen, parameters in the model are next efficiently estimated; and finally, the success of the resulting fit is assessed. The now precisely defined model is either accepted by this verification stage or the diagnostic checks carried out will find it lacking in certain respects and should then suggest a sensible modified identification. Further estimation and checking may take place, and the cycle of identification, estimation, and verification is repeated until some satisfactory fits obtain. 5.1.3 Diagnostic checking A model should be judged by its predictive power as well as its goodness-of-fit. Diagnostic checking is a procedure for evaluating to what extent the data support the model. The AIC only compares models through their relative predictive power; it doesn't assess the goodness-of-fit of the model to the data. In multivariate nonnormal analysis, it is not obvious how the goodness-of-fit checking could be carried out. We discuss this issue in the following. There are many conventional ways to check the goodness-of-fit of a model. One direct way to check the model is by means of residuals (mainly for continuous data). A diagnostic check based on residuals consists of making a residual plot of the (standardized) residuals. Another frequently applied approach is to calculate some goodness-of-fit statistics. When the checking of residuals is feasible, the goodness-of-fit statistics are often used as a supplement. In multivariate analysis, direct comparison of estimated probabilities with the corresponding empirical probabilities may also be considered as a good and efficient diagnostic checking method. For multivariate binary or ordinal categorical data, a diagnostic check based on residuals of observed data is not meaningful. However statistics of goodness-of-fit are available in these situations. We illustrate the situation here by means of multivariate binary data. For a d-dimensional random binary vector Y with a model P, its sample space contains 2d elements. We denote these by k = 1,..., 2d with k representing the kt\x particular outcome pattern and Pk the corresponding probability, with Efc=i — 1- Assume n is the number of observations and ni,..., n2i are the empirical frequencies corresponding to k = 1,.. .,2d. Let Pk be the estimate of Pk for a specified model. Under the hypothesis that the specified model is the true model and with the assumption of Chapter 5. Modelling, data analysis and examples 111 some regularity conditions (e.g. efficient estimates, see Read and Cressie 1988, §4.1), Fisher (1924) shows, in the case if Pk depends on one estimated parameter, that the Pearson \2 type statistic x2 = ^(n^nhl (5.2) is asymptotically chi-squared with 2d — 2 degrees of freedom. If Pk depends on s (s > 1) estimated parameters, then the generalization is that (5.2) is asymptotically chi-squared with 2d — s — 1 degrees of freedom. A more general situation is that Y depends on a covariate of g categories. For each category of the covariate, it has the situation of (5.2). If we assume independence between the categories of the covariate, we can form an overall Pearson x2 type test statistic for the goodness-of-fit of the model as tit! nWpW where v is the index of the categories in the covariate. Suppose we estimated s parameters in the model; thus P^ depends on s parameters. Under the hypothesis that the specified model is the true model, the test statistic X2 in (5.3), with some regularity conditions (e.g. efficient estimates), is asymptotically X^2d-i)-3> wnere 9 is the number of categories of the covariate, and s is the total number of parameters estimated in the model. Similarly, an overall loglikelihood ratio type statistic G2 = 2 £ E »<"> -og[n^/(n^P^)} (5.4) v=lk=l is also asymptotically X^2<«-i)-»- ^2 and G2 are asymptotically equivalent, but there are not the same in finite sample case, so sometimes there is a question of which statistic to choose. Read and Cressie (1988) may shed some light on this matter. The computation of the test statistic X2 or G2 requires the calculation of Pk"\ which may not be easily obtained, depending on the copula associated with the model (for example, it is generally feasible with mixture of max-id copula but only feasible with relatively low dimension for multinormal copula, unless approximations are used). One frequently encountered problem in applications with multivariate binary or ordinal cate gorical data (also count data) is that when the dimension of the response is relatively high, the empirical frequency for some particular outcomes of the response vector is relatively small or even zero. Thus Pk or P^ would usually be very small for any particular model, and (5.2) or (5.3) with its related statistical inferences are not suitable in these situations. What we may still do in terms of goodness-of-fit checking in these situations is to limit the comparison of Pfc(x,) with nfc(x;)/n(x,-) by tables and graphics to outcomes of non-zero frequency (where x,- is the covariate vector), or to Chapter 5. Modelling, data analysis and examples 178 calculate 4o= E (nk lkhk? or G« = 2 E »*i°g(»*/aO, (5-5) {nfc>a} {"k>«} where hk = J2l=i -Pfc(xt')> where k represent the kth patterns of the response variables, and plot X^ (or G2a^) versus a = {1, 2, 3,4,5} to get a rough idea of how the model fits the non-zero frequency observations. The data obviously support the model if the observed values of X2a^ (or G2a^) go down quickly to zero, while large values indicate potential model departures. Obviously, in any case, some partial assessments using (5.2) or (5.3) may be done for some lower-dimensional margins where frequencies are sufficiently large. Sometimes, these kinds of goodness-of-fit checking may be used to retain a model while (5.5) is not helpful. The statistics in (5.5) and related analysis can be applied to multivariate count data as well. Furthermore, a diagnostic check based on the residuals of the observed counts is also meaningful. If there are no covariates, quick and overall residual checking cab be based on examining iij =yij-Y,[Yij\Yii-i~9] (5.6) for a particular fixed j, where Yt]_j means the response vector Y, with the jth margin omitted. The model is considered as adequate based on residual plot in terms of goodness-of-fit if the residuals are small and do not exhibit systematic patterns. Note that the computation of EfYijlY^-j,^] may not be a simple task when the dimension d is large (e.g d > 3). Another rough check of the goodness-of-fit of a model for multivariate count data is to compare the empirical marginal means, variances and pairwise correlation coefficients with the corresponding means, variances and pairwise correlation coefficients calculated from the fitted model. In principle, a model can be forced to fit the data increasingly well by increasing its number of parameters. However, the fact that the fitting errors are small is no guarantee that the prediction errors will be. Many of the terms in a complex model may simply be accounting for noise in the data. The overfitted models may predict future values quite poorly. Thus to arrive at a model which represents only the main features of the data, selection and diagnostic criteria which balance model complexity and goodness-of-fit must be used simultaneously. As we have discussed, often there are many relevant models that provide an acceptable approximation to reality or data. The purpose of statistical modelling is not to get the "true" model, but rather to obtain one or several models which extract the most information and better serve the inference purposes. Chapter 5. Modelling, data analysis and examples 179 5.1.4 Testing the dependence structure We next discuss a topic related to model identification. Short series of longitudinal data or repeated measures with many subjects often exhibit highly structured pattern of dependence structure, with the dependence usually becoming weaker as the time separation (if the observation point is time) increases. Valid inferences can be made by borrowing strength across subjects. That is, the consis tency of a pattern across subjects is the basis for substantive conclusions. For this reason, inferences from longitudinal or repeated measures studies can be made more robust to model assumptions than those from time series data, particularly to assumptions about the nature of the dependence. There are many possible structures for longitudinal or repeated measures type dependence. The exchangeable or AR(l)-like dependence structures are the simplest. But in a particular situation, how to test to see if a particular dependence structure is more plausible? The AIC for model comparison may be a useful index. In the following, we provide an alternative approach for testing special dependence structures. For this purpose, we first give a definition and state two results that we are going to use in the later development. A reference for these materials is Rao (1973). Definition 5.1 (Generalized inverse of a matrix) A generalized inverse of an n x m matrix A of any rank is an m x n matrix denoted by A~ which satisfies the following equality: AA~ A = A. • Result 5.1 (Spectral decomposition theorem) Let A be a real n x n symmetric matrix. Then there exists an orthogonal matrix Q such that Q'AQ is a diagonal matrix whose diagonal elements Ai > A2 > • • • > A„ are the characteristic roots of A, that is / Ai 0 ••• 0\ 0 A2 ••• 0 Q'AQ V 0 0 • •• A„ / • Result 5.2 //X ~ Np(p, Ex), and Ex is positive semidefinite, then a set of necessary and sufficient conditions for X'AX ~ xl{&2) is (i)tr(AL-x) = r and p.'Ap = S2, (ii) EXAEX^EX = EXAEX, (iii) p!AH-ynAp = p'Ap, (iv) p'(AT,x)2 = p'AH. X2{&2) denotes the non-central chi-square distribution with noncentality parameters2. • Chapter 5. Modelling, data analysis and examples 180 In the following, we are going to build up a general statistical test, which in turn can be used to test exchangeable or AR(l)-type dependence assumptions. Suppose X ~ Np(p, £x) where Ex is known. We want to test if /i = pi, where p is a constant. Let a = E^l/l'Ex1!, then X - a'Xl = {Xi - a'X,..., Xp - a'X)' = BX, where B = I - H'E^1/l"E^l, and / is the identity matrix. Thus BX ~ Np(Bp, BZXB'). It is easy to see that Rank(B) = p — 1, it implies that Rank(BExB') = p — 1. By Result 5.1, there is an orthogonal matrix Q, such that ^Aj ... 0 0^ J3EXB' = Q 0 \ 0 Ap-i 0 0 0/ Q', where Ai > A2 > • • • > Ap_i > 0. Let A = Q 0 V o 0 0\ 0 • o 1/ then A is a full rank matrix. It is also easy to show that A is a generalized inverse of 5Ex-B', and all the conditions in Result 5.2 are satisfied, we thus have X'B'ABX-xl-iV2), where S2 = p'B'ABp, S2 > 0. b2 = 0 is true if and only if Bp = 0, and this in turn is true if and only if p = pi, that is p should be an equal constant vector. Thus under the null hypothesis p = pi, we should have X'B'ABX-xl-i, where xP-i means central chi-square distribution with p — 1 degrees of freedom. Now we use an example to illustrate the use of above results. Example 5.1 Suppose we choose the multivariate logit model with multinormal copula (3.1) with correlation matrix 0 = (Ojk) to model the d-dimensional binary observations y1;.. .,y„. We want to know if an exchangeable (that is Ojk = 0 for all 1 < < k < d and for some \0\ < 1) or an AR(1) Chapter 5. Modelling, data analysis and examples 181 (that is 9jk = 0^ for all 1 < j < k < d and for some |t9| < 1) correlation matrix in the multinormal copula is the suitable assumptions. The above results can be used to test these assumptions. Let $W be the IFME of 0 from the (j,k) bivariate margin, and ~6 = (0~(12\ 0~(13\ ..., fifa-1.*)). By Theorem 2.4, we have asymptotically ~6~ Nd{d_1)/2(01,X0), where is the inverse of Godambe information matrix of 0. Thus under the exchangeable or AR(1) assumptions of 0, we have asymptotically where 0B'AB~e~x\i-i),2-i, a -5 = 7-1 l'Er1 6 6 A = Q 0 \ 0 ^d(ci-l)/2-l 0 0 1/ Q', and Q is an orthogonal matrix from the spectral decomposition /Ax ... 0 0\ where Ai > A2 > • • • > Xd(d-i)/2-i > 0. 0 ••• Ad(d_1)/2_i 0 \0 ••• 0 0/ The above results are valid for large samples, and can be used in the applications for a rough judgement about the special dependence structure assumptions, though would typically have to be estimated from the data. 5.2 Data analysis examples In this section, we apply and compare some models developed in Chapter 3 on some real data sets, and illustrate the estimation procedures of Chapter 2. Following the discussion in section 5.1, the examples show the stages of the data analysis cycle and the special features related to the specific type of data. Chapter 5. Modelling, data analysis and examples 182 5.2.1 Example with multivariate/longitudinal binary response data In this subsection, several models for multivariate binary response data with covariates are applied to a subset of a data set from the "Six Cities Study" discussed and analyzed by Ware et al. (1984) and Stram et al. (1988). The Six Cities Study is a longitudinal investigation of the effects of indoor and outdoor air pollution on respiratory health. As in'most longitudinal studies, there were missing data for some subjects. In this analysis we consider a subset of data with no missing values, gathered in the study on the occurrence of persistent wheeze (graded as wheeze 1 and none 0) of children (total number of 1020) followed from ages 9 to 12 yearly in two different cities: Kingston-Harriman, Tenessee (KHT), and Portage, Wisconsin (PW) in the US. The outdoor air pollution is measured by the children's residence location, that is, the two cities. These two cities have very different ambient air quality. KHT (coded as 1 in the data set) is influenced by air pollution from several metropolitan and industrial areas, and thus has relatively high average concentrations of fine particulate matter and acid aerosols. PW (coded as 0 in the data set) is located in a region that has relatively low concentrations of these polluants. Indoor pollution is measured by level of maternal smoking graded as 1 (> 10 cigarettes) or 0 (< 10 cigarettes). Let us call the outdoor air pollution variable "City", and the indoor pollution variable "Smoking". Smoking is a time-dependent covariate since level of maternal smoking may vary from year to year, and City is considered as time-independent covariate (for the four-year period) since no one in the study moved over the four years. More documentation of the study can be found in Ware et al. (1984) and Stram et al. (1988). Some of the potential scientific questions are: (1) Does the prevalence of wheeze differ between cities or smoking groups? If so, does the difference change over time. If the effects are constant over time, how should they be estimated? (2) How should the rate of respiratory disease for children whose mothers smoke be compared to the rate for children whose mothers do not smoke? Tables 5.1 - 5.3 summarize the initial data analysis. Table 5.1 provides the univariate summaries of the data, with the percentages of I's for the binary response and predictor variables (City and Smoking at 4 time points which we denote by Smoking9, SmokinglO, Smokingll and Smokingl2). We see, from response variables Age 9 to Age 12, that the incidence of persistent wheeze for ages 9 to 12 decreases slightly across the ages. The same is true for the maternal smoking levels. Table 5.2 contains the frequencies of the response vector of the 4 time points when ignoring the effects of the covariates. Table 5.3 has the pairwise log odds ratio for the response variables, ignoring the covariates; it gives some indication of the amount of dependence in the response variables in addition Chapter 5. Modelling, data analysis and examples 183 to Table 5.2. Table 5.3 indicates that the dependence for consecutive years is larger. Multivariate binary response models that were used to model the data include 1. The multivariate logit model from section 3.1, with a. multinormal copula (3.1), b. multivariate Molenberghs-Lesaffre construction i. with bivariate normal copula, ii. with Plackett copula (2.8), iii. with Frank copula (2.9). c. mixture of max-id copula (3.3), d. the permutation symmetric copula (3.8). 2. The multivariate probit model with multinormal copula. The Multivariate logit-normal model (a MMD model) is also used to model this data set, but since in this model fitting, the variance parameters estimates (CTJ , j = 1,2,3,4) all go to 0, it reduces this model in fact to a MCD model, thus we will not pursue the MMD models fitting with this data set further. Only the results with MCD model fitting are reported here. Since we have the covariates City and Smoking, there is a question of how to include these variables into the models. For subject i (i = 1,..., 1020), the cut-off points are Z{j (j = 1,2,3,4) for an univariate probit or logit model. A suitable approach for the cut-off points to be functions of covariates is to let Zij = ctjo + ctji * City,- + ctj2 * Smoking^. To let the dependence parameters be functions of covariates is more complicated. Many possibilities are open. A simple approach is to let the dependence parameters be independent of covariates. This may serve the general modelling purpose in many situation while keeping the model simple. Besides this simple approach, partly for illustrative purposes, we also examine the situation where the dependence parameters depend on the covariate City. For model (la), the dependence parameters are Oijk for the subject i, 1 < j < k < 4. There are many ways to include covariates to the dependence parameters Oijk, as we have discussed in section 3.1 for model (la). For a general dependence structure, we may simply let Oijk = [exp(/?jtio+/?;'fcii*Cityi) — l]/[exp(/?jj;]o+/?jifc,i*Cityi)+l]. Another two dependence structures appropriate (suggested by the nature of the study and the initial data analysis) for this data set are exchangeable and AR(1) type structure with 0,- = (Oijk) for the ith subject. The exchangeable situation is that Oijk = Oi for some < 1. The AR(1) situation is Oijk = o\J~k^ for some \9i\ < 1. In Chapter 5. Modelling, data analysis and examples 184 both situations, we let 0,- = [exp(/?o-|-/?i*City,-) — l]/[exp(/?o+/?i*Cityi) + l]. For models (lbi), (lbii), (lbiii), we first let higher order (> 3) parameters rjijki and ?7i,i234 be constant, say 1. (This is usually good enough for practical purposes, refer to section 3.1.) We next let the parameters appearing in the bivariate copulas be functions of covariates. Assume that for model (lbi), the dependence parameters in the bivariate copulas are Oijk- Since Oijk are correlation coefficients in bivariate normal copulas, we let 6ijk = [exp(8jkfi + 0jk,i * City,-) - l]/[exp(0jk,o + 0jk,i * CityJ + 1]. For model (lbii), assume are the parameters in the Plackett copulas; we let 6ijk = exp(0jk,o + Pjk,i * City,-). For model (lbiii), assume <5,-jfcS are dependence parameters in the bivariate Frank copulas; we let 8ijk = exp(/?jjto + Pjk,i * City,). For model (lc), the dependence parameters are 6*,- and S{ jk (1 < j < k < 4). (We let the parameter of asymmetry vij = 0 for all i and j.) t9,- represent a general minimum level of dependence, and Sijk represent bivariate dependence exceeding the minimum dependence. For the dependence parameters, we let Sijk — exp(0jkfl + 0jk,i * City,) and 6i = exp(/?o) be independent of covariates. For model (Id), the dependence parameters are We let 9i = exp(0o + 0i * City,). For model (2), the dependence structure is the same as model (la). We use "1" to denote the logit model and "p" to denote the probit model. For the univariate marginal regressions, at least two situations could be considered: regression coefficients differ across margins (or times), denoted by "md"; and regression coefficients common across margins (or times), denoted by "mc". For the regression of the dependence parameters, for models (la) and (2), we consider the general (denoted by "g"), exchangeable (denoted by "e") and AR(1) (denoted by "a") dependence structures. We also consider the situations with covariate (denoted by "wc") and with no covariate (denoted by "wn") for the dependence parameters. Thus a total of 12 submodels of model (la) are considered; they are l.md.g.wc, l.md.g.wn, l.md.e.wc, l.md.e.wn, l.md.a.wc, l.md.a.wn, l.mc.g.wc, l.mc.g.wn, l.mc.e.wc, l.mc.e.wn, l.mc.a.wc and l.mc.a.wn, where for example "l.md.g.wc" stands for the multivariate logit model with marginal regression coefficients differ across margins and general dependence structure with covariates. There are also 12 submodels for the model (2): these are p.md.g.wc, p.md.g.wn, p.md.e.wc, p.md.e.wn, p.md.a.wc, p.md.a.wn, p.mc.g.wc, p.mc.g.wn, p.mc.e.wc, p.mc.e.wn, p.mc.a.wc and p.mc.a.wn. For models (lbi), (lbii), (lbiii), (lc) and (Id), the AR(1) type latent dependence structure may not be well-defined. In any case, for not repeating similar analysis, we will only consider possible models within the models (lbi), (lbii), (lbiii), (lc) and (Id) with similar structure of models retained by the analysis with models (la) and (2). For all the models except (Id), the IFM estimation theory is applied. That is, the univariate (re gression) parameters are estimated from separate univariate likelihoods (using the Newton-Raphson Chapter 5. Modelling, data analysis and examples 185 method), and bivariate and multivariate (regression) parameters are estimated from bivariate like lihoods, using a quasi-Newton optimization routine, with univariate parameters fixed as estimated from the separate univariate likelihoods. Furthermore, for the situation of "mc" for common marginal regression coefficients and exchangeable (or AR(1) if applicable) dependence structure, WA of (2.93) in section 2.6 for parameter estimation based on IFM is used. It is also used for estimating the pa rameter 0o in 8 = exp(/?o) in the model (lc) since 8 is an overall parameter and common across all margins. Notice that only one choice of parametric families for ip and Kjk's were used, but it is expected that other choices could lead to a better fit according to AIC. The model (Id) has a copula with closed form cdf and there is only one dependence (or regression parameters related to it) parameter in the model, thus MLE(s) are computed in this situation. Model (Id) here is used to compare a simple permutation symmetric MCD model with the other models which all allow a general dependence structure. Model (lc) and (Id) have the advantage of having a copula with closed form cdf; this is particularly convenient for dealing with multivariate discrete data of high dimension, as it leads to faster computation in computing probabilities of the form Pr(Y = y) or Pr(Y = y|x). For standard errors (SEs) of parameter estimates and prediction probabilities, the jackknife method from Chapter 2 is used with 255 random groups of 4. Furthermore, the weights used for WA for common parameter estimation are based on the jackknife SEs, and these weights in turn are used, based on (2.93), to each step of jackknife parameter estimation. Summaries of the fits of the models are given in several tables. Table 5.4 contains the esti mates and SEs of the regression parameters for the marginal parameters with the logit model when the regression parameters are considered to be differ and common across the margins. Table 5.5 contains the estimates and SEs of the regression parameters for the dependence parameters under various settings for the multivariate logit model with multinormal copula (model (la)). Table 5.6 contains AIC values and X2 (calculated based on (5.5) with a = 0) values for all the submodels of multivariate logit and probit models with multinormal copula (that is models (la) and (2)). Care must be taken in the comparison since the AICs here are not calculated from the ML of all pa rameters simultaneously, the parameters estimates are IFME. The AIC values and X2 values for the corresponding submodel of models (la), (2) are comparable; this echoes the well-known fact that the univariate probit and logit models are comparable. We thus only compare the submodels within the multivariate logit model. From examining the AIC and X2 values for the 12 models, the models l.md.g.wn, l.md.e.wc, l.md.e.wn and l.mc.g.wn seem to stand out as interesting choices. Chapter 5. Modelling, data analysis and examples 186 Since l.md.e.wc and l.md.e.wn are about the same in terms of AIC and X2 values, and l.md.e.wn is simpler than l.md.e.wc, we only consider l.md.e.wn. At this stage, three models are retained for further inspection: l.md.g.wn, l.md.e.wn and l.mc.g.wn. Table 5.7 contains AIC values and Table 5.8 contains X2 values of submodels l.md.g.wn, l.md.e.wn and l.mc.g.wn of models (la), (lbi), (lbii), (lbiii), (lc) and (Id). These two tables suggest that the models are comparable in general, with models (lc) and (Id) performing relatively poorly; possibly other parametric families of mixture of max-id copulas would do better. The model (lbi) seems to be the best for this data set. Note that since there is only one dependence structure with model (Id), the submodel l.md.g.wn and l.md.e.wn are equivalent in this case. Table 5.9 contains estimates and SEs of the bivariate dependence pa rameters of the submodel l.md.g.wn of models (lbi), (lbii), (lbiii) and (lc). This and Table 5.5 also suggest that the models are comparable; the conclusion about which bivariate margins are more or less dependent are the same from the models. They show that the dependence for consecutive years is slightly stronger; this is also observed in Table 5.3 for the initial data analysis. (Note also the closeness of the dependence parameter estimates with the model (lbii) to the empirical pairwise log odds ratio in Table 5.3.) For comparison, the estimate of the dependence parameter for model (Id), with an permutation symmetric copula, is 1.719 with SE equal 0.067. As we have pointed out, the model (Id) does not perform as well as other models with this data set. This may indicate that, even though a model with exchangeable dependence structure may be acceptable, a better model is to have a general dependence structure. All these models are quite similar in term of computer time for the parameter estimation (because of the IFM approach), but models (la) and (2) used much more computer time than the other models to compute AIC, X2 and the prediction probabilities since 4-dimensional integrations were involved. For the predicted probabilities and inference, and also as a supplement of X2 values for an assessment of goodness-of fit, Table 5.10 contains estimates of probabilities of the form Pr(Y = y) for all possible y from submodels l.md.g.wn, l.md.e.wn and l.mc.g.wn of the model (la), Table 5.11 contains estimates of probabilities of the form Pr(Y = y) for all possible y from submodels l.md.g.wn of models (la), (lbi), (lbii), (lbiii), (lc) and (Id), and these Pr(Y = y) are estimated with X^i=i°Pr(Y = y|xj)/1020. Table 5.12 contains estimates of probabilities of the form Pr(Y = y|x) for various y and x from submodels l.md.g.wn, l.md.e.wn and l.mc.g.wn of the model (la), and Table 5.13 contains estimates of probabilities of the form Pr(Y = y|x) for various y and x from submodels l.md.g.wn of models (la), (lbi), (lbii), (lbiii), (lc) and (Id). In Table 5.12, the n* is the subset sizes for the specific value of x and "rel. freq" is the observed relative frequency for the given y under Chapter 5. Modelling, data analysis and examples 187 that value of x. In Table 5.13, to save space, for each line, only the maximum estimated SE over the different models is given; actually the SEs are quite close to each other. The selected x and y values in the Table 5.12 and Table 5.13 are common values in the data set. Tables 5.10 - 5.13 suggest that the submodels l.md.g.wn, l.md.e.wn and l.mc.g.wn are all adequate for predictive purposes, since the prediction probabilities are comparable when the SEs are taken into account. These three submodels can be used to complement each other for a slightly different inference purposes. The large divergences in estimated probabilities occur with the model (Id) and only in the case where the vector x is at the extreme of the covariate space, for example x = (1, 0, 0, 0, 0) and x = (0, 0, 0, 0, 0) for y = (1,1,1,1). There is a simple exchangeable dependence model l.md.e.wn among the three submodels l.md.g.wn, l.md.e.wn and l.mc.g.wn. An explanation for this may be that the dependence in the bivariate margins are different but not different enough to make a difference in prediction probabilities. Another possibility may be due to the dominance of the response vector (0,0,0,0). The analysis (e.g. from submodel l.md.g.wn) indicates a slight decline in the rate of wheeze over time (the intercepts in Table 5.4 for regression parameters differing across margins decrease gradually over time from —1.090 to —1.564) and a moderate increase in wheeze for children of mothers who smoke (the corresponding regression parameters increases over time from 0.144 to 0. 444) and for the city with pollution (the corresponding regression parameters increase in time from 0.003 to 0.209). There is an indication that the excess of maternal smoking and the city with pollution both increase significantly the probability of the occurrence of wheeze (e.g. from submodel 1. mc.g.wn). This is also consistent with the observation in Ware et al. (1984), where it is believed that maternal smoking is predictive of respiratory illness. If we study the model with covariate (city) for the dependence (e.g. the submodel l.md.e.wc), we see that high city pollution level has a negative effect on the correlation; it possibly means that the low level city pollution leads to a slightly higher correlation on the occurrence of persistent wheeze. We can interpret this as the wheeze occurrence situation not caused by pollution is more stable over time. The analysis indicates that the dependence for consecutive years is stronger and the dependence (pairwise) are all significant. The rate of respiratory disease for children whose mothers smoke heavily is higher than the rate for children whose mothers do not smoke or only smoke slightly, these can be seen from Table 5.12 (with l.md.g.wn submodel), where for example for y = (1,1,1,1), P(y|x(a)) = 0.099 > 0.071 = P(y|x<*)) where x(a) = (0,1,1,1,1) and x<6> = (0,0,0,0,0), and P(y|x(c)) = 0.118 > 0.085 = P(y|x(d)) where x(c) = (1,1,1,1,1) and x(d) = (1,0,0,0,0). Similarly, we also observe that rate of persistent wheeze for children whose mothers smoke is lower than the rate for children whose mothers do not Chapter 5. Modelling, data analysis and examples 188 Table 5.1: Six Cities Study: Percentages for binary variables Variables # I's Percentage Age 9 266 26.07% Age 10 256 25.09% Age 11 241 23.62% Age 12 217 21.27% City 512 50.19% Smoking9 325 31.86% SmokinglO 313 30.68% Smokingll 311 30.49% Smokingl2 309 30.29% Table 5.2: Six Cities Study: Frequencies of the response vector (Age 9, 10, 11, 12) Response pattern Observed numbers Relative frequency 1111 95 0.093 1110 30 0.029 110 1 15 0.015 110 0 28 0.027 10 11 14 0.014 10 10 9 0.009 10 0 1 12 0.012 1 0 0 0 63 0.062 0 111 19 0.019 0 110 15 0.015 0 10 1 10 0.010 0 10 0 44 0.043 0 0 11 17 0.017 0 0 10 42 0.041 0 0 0 1 35 0.034 0 0 0 0 572 0.561 smoke (e.g. for y = (0,0,0,0), P(y|x(e)) = 0.606 > 0.541 = P(y|x<')) where x^ = (0,0,0,0,0) and = (0,1,1,1,1)). Also similarly, the rate of persistent wheeze for children who reside in the city with more pollution is higher than the rate for children who reside in the city with less pollution, e.g., for y = (1,1,1,1), P(y|xM) = 0.118 > 0.099 = P(y|xW) where xW = (1,1,1,1,1) and xW = (0,1,1,1,1), or P(y|x«) = 0.085 > 0.071 = P(y|xW)) where x« = (1,0,0,0,0) and x0') = (0,0,0,0,0). More detailed comparisons for different situations can be made. Similar results to the partial interpretations given above are also obtained in the literature on the analysis of a similar data set from the same study; see for example Fitzmaurice and Laird (1993), Zeger et al. (1988) and Stram et al. (1988). Chapter 5. Modelling, data analysis and examples 189 Table 5.3: Six Cities Study: Pairwise log odds ratios for Age 9, 10, 11, 12 Pair odds log odds Age 9,10 T2T97 JM~ Age 9,11 8.91 2.19 Age 9,12 8.69 2.16 Age 10,11 13.63 2.61 Age 10,12 10.45 2.35 Age 11,12 14.83 2.69 Table 5.4: Six Cities Study: Estimates of marginal regression parameters for multivariate logit model margin intercept (SE) city (SE) smoking (SE) differ across the margins 1 -1.090 (0.113) 0.003 (0.150) 0.144 (0.144) 2 -1.229 (0.120) 0.080 (0.148) 0.293 (0.155) 3 -1.412 (0.136) 0.311 (0.161) 0.237 (0.166) 4 -1.564 (0.123) 0.209 (0.155) 0.444 (0.166) common across the margins -1.308 (0.061) 0.144 (0.077) 0.270 (0.078) Table 5.5: Six Cities Study: Estimates of dependence regression parameters for multivariate logit model with multinormal copula margin intercept (SE) city (SE) general dependence, with covariate 12 2.156 (0.217) -0.373 (0.288) 13 1.740 (0.183) -0.178 (0.249) 14 1.583 (0.209) 0.041 (0.288) 23 2.334 (0.210) -0.628 (0.287) 24 1.891 (0.214) -0.294 (0.281) 34 2.079 0.224) -0.109 (0.287) general dependence, without covariate 12 1.960 (0.143) 13 1.645 (0.124) 14 1.604 (0.143) 23 1.987 (0.143) 24 1.733 (0.139) 34 2 020 (0.142) exchangeable dependence, with covariate 1.948 (0.085) -0.254 (0.114) exchangeable dependence, without covariate 1.815 (0.057) AR(1) dependence, with covariate 2.380 (0.086) -0.258 (0.115) AR(1) dependence, without covariate 2.236 (0.057) -, Chapter 5. Modelling, data analysis and examples 190 Table 5.6: Six Cities Study: Comparisons of AIC values and X2 values from various submodels of models (la) and (2) Logit Models AIC X2 Probit Models AIC Trn~<I l.md l.md l.md l.md l.md l.mc l.mc l.mc l.mc l.mc l.mc 3642.882 3637.415 3659.683 3660.126 3641.662 3641.637 3646.339 3640.444 3661.238 3660.876 3647.731 3646.785 7.862 7.992 42.472 42.329 19.941 19.926 21.031 21.119 52.281 51.942 37.758 37.687 p.md p.md p.md p.md md md mc mc mc p.mc p.mc p.mc 3642.810 3637.347 3659.615 3660.063 3641.585 3641.561 3646.396 3640.499 3661.308 3660.938 3647.784 3646.833 X2 g.wc g.wn a.wc .a.wn e.wc .e.wn g.wc g.wn a.wc a.wn e.wc e.wn g.wc g.wn a.wc a.wn .e.wc .e.wn g.wc g.wn a.wc a.wn e.wc e.wn 7.863 7.993 42.480 42.331 19.950 19.939 21.086 21.175 52.386 52.044 37.839 37.771 Table 5.7: Six Cities Study: Comparisons of AIC values from various models Models (la) (lbi) (lbii) (lbiii) (lc) (Id) l.md.g.wn l.md.e.wn l.mc.g.wn 3637.415 3641.637 3640.444 3631.991 3637.184 3633.903 3634.040 3638.884 3635.855 3636.773 3642.080 3638.423 3668.059 3663.932 3674.095 3703.763 3708.404 Table 5.8: Six Cities Study: Comparisons of X2 values from various models Models (la) (lbi) (lbii) (lbiii) (lc) (Id) l.md.g.wn l.md.e.wn l.mc.g.wn 7.992 19.926 21.119 1.765 15.832 13.959 1.795 15.485 14.249 1.866 16.131 14.436 31.747 36.315 50.613 77.157 96.223 Table 5.9: Six Cities Study: Estimates (SE) of dependence regression parameters from the submodel l.md.g.wn of various models margin (lbi) (lbii) (lbiii) (lc) "12 1.960 (0.143) 2.556 (0.173) 1.861 (0.094) 2.715 (0.202) 13 1.645 (0.124) 2.183 (0.153) 1.654 (0.090) 2.181 (0.183) 14 1.604 (0.143) 2.150 (0.178) 1.638 (0.107) 2.066 (0.179) 23 1.987 (0.143) 2.6
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Statistical modelling and inference for multivariate...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Statistical modelling and inference for multivariate and longitudinal discrete response data Xu, James Jianmeng 1996
pdf
Page Metadata
Item Metadata
Title | Statistical modelling and inference for multivariate and longitudinal discrete response data |
Creator |
Xu, James Jianmeng |
Date | 1996 |
Date Issued | 2009-03-17 |
Description | This thesis presents research on modelling, statistical inference and computation for multivariate discrete data. I address the problem of how to systematically model multivariate discrete response data including binary, ordinal categorical and count data, and how to carry out statistical inference and computations. To this end, I relate the multivariate models to similar univariate models already widely used in applications and to some multivariate models that hitherto were available but scattered in the literature, and I introduce new classes of models. The main contributions in this thesis to multivariate discrete data analysis are in several distinct directions. In modelling of multivariate discrete data , we propose two new classification of multivariate parametric discrete models: multivariate copula discrete (MCD) models and multivariate mixture discrete (MMD) models. Numerous new multivariate discrete models are introduced through these two classes and several multivariate discrete models which have appeared in the literature are unified by these two classes. With appropriate choices of copulas, these two classes of models allow the marginal parameters and dependence parameters to vary with covariates in a natural way. By using special dependence structures, the models can be used for longitudinal data with short time series or repeated measures data. As a result, the scope of multivariate discrete data analysis is substantially broadened. In statistical inference and computation for multivariate models, we propose the inference function of margins (IFM) approach in which each inference function is a likelihood equation for some marginal distribution of a multivariate distribution. Examples where the approach applies are the multivariate logit model with the copulas having certain closure properties and the multivariate probit model for binary data. This general approach makes the estimation of parameters for the multivariate models computationally feasible. The corresponding asymptotic theory, the estimation of standard errors by the Godambe information matrix as well as the jackknife method, and the efficiency of the IFM approach relative to full multivariate likelihood function approach are studied. Particular attention has been given to the models with special dependence structure (e.g. the copula dependence structure is exchangeable or AR(1) type if applicable), and efficient parameter estimation schemes based on IFM (weighting approach and pool-marginal-likelihood approach) are developed. We also give detailed assessments of the efficiency of the GEE approach for estimating regression parameters in multivariate models; this is lacking in the literature. Detailed data analyses of existing data sets are provided to give concrete application of multivariate models and the statistical inference procedures in this thesis. |
Extent | 12783655 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | Eng |
Collection |
Retrospective Theses and Dissertations, 1919-2007 |
Series | UBC Retrospective Theses Digitization Project |
Date Available | 2009-03-17 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0087914 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1996-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
URI | http://hdl.handle.net/2429/6188 |
Aggregated Source Repository | DSpace |
Download
- Media
- ubc_1996-148599.pdf [ 12.19MB ]
- Metadata
- JSON: 1.0087914.json
- JSON-LD: 1.0087914+ld.json
- RDF/XML (Pretty): 1.0087914.xml
- RDF/JSON: 1.0087914+rdf.json
- Turtle: 1.0087914+rdf-turtle.txt
- N-Triples: 1.0087914+rdf-ntriples.txt
- Original Record: 1.0087914 +original-record.json
- Full Text
- 1.0087914.txt
- Citation
- 1.0087914.ris