A Comparison of Methods for Multivariate Familial Binary Responses by Abu Hena M. Mahbub-ul Latif M.Sc, Dhaka University, 1991 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Department of Statistics) we accept this thesis as conforming to the required standard The University of British Columbia September 2001 © Abu Hena M. Mahbub-ul Latif, 2001 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of 5 T A T I $ T [ C $ The University of British Columbia Vancouver, Canada Date DE-6 (2/88) Abstract Among the existing methods for analysing the multivariate familial binary response, we discuss latent variable models and the estimating equations based methods. A brief description of the multivariate Plackett distribution is given and the role of this distribution in developing the esti-mating equations based methods is pointed out. The maximum likelihood and estimating equations based methods for estimating the parameters of the multivariate logistic model are compared. For this comparison, a simulation study examines the effects of the sample sizes, dependence struc-tures, the within-family dependence, etc. in estimating the parameters. The data are generated from the multivariate probit models. The multivariate logistic and probit models are compared for estimating conditional probabilities of interest in a genetics context and the respective standard errors. Numerical methods are used to estimate the parameters of the models considered. Because the original GEE2 code cannot handle multivariate binary data for arbitrary family structures, we have a new implementation of the GEE2 method for familial data; this routine used automatic differentiation for computing the Hessian matrix. C o n t e n t s Abstract ii Contents iii List of Tables vi List of Figures viii Acknowledgements ix Dedication x 1 Introduction 1 2 The Multivariate Probit Model 8 2.1 Univariate Probit Model 9 2.2 Bivariate Probit Model . . 11 2.3 Multivariate Probit Model . 13 2.4 Conditional Probabilities • 16 2.5 Summary 17 iii 3 The Mult ivariate Logistic Model 18 3.1 Plackett Distributions 19 3.1.1 The Bivariate Plackett Distribution 20 3.1.2 The 3-variate Plackett Construction 22 3.1.3 The d-variate Plackett Construction 24 3.2 The Regression Model 26 3.2.1 The Likelihood Function 28 3.3 McCullagh-Nelder-Glonek Approach 30 3.3.1 The Model and Parameter Estimation 30 3.4 Summary 33 4 Generalized Estimating Equations 36 4.1 Generalized Linear Models 37 4.1.1 The Model 37 4.1.2 Parameter Estimation 38 4.1.3 Multivariate Binary Response . . 40 4.2 Generalized Estimating Equations I 41 4.2.1 The Method and Parameter Estimation 42 4.3 Quadratic Exponential Family 44 4.3.1 Zhao-Prentice Method 46 4.4 Generalized Estimating Equations II • • 48 4.4.1 The Method 48 4.4.2 Estimating Equations 49 4.5 Summary 50 iv 5 Simulation Study 52 5.1 Numerical Optimization Methods 52 5.1.1 The Newton-Raphson Method 54 5.1.2 The Quasi-Newton Method 55 5.1.3 Newton method with Automatic Differentiation 57 5.2 Comparison of the Methods 59 5.2.1 Pedigree A 60 5.2.2 Pedigree B 62 5.2.3 Pedigree C 64 5.2.4 Pedigree D 65 5.2.5 Mixture of pedigrees A, B, and C 67 5.2.6 Estimating Conditional Probabilities 67 5.2.7 Comments on Computing Time 68 5.3 Summary 69 6 Discussion 77 Appendix A 81 A . l 81 A.2 82 Bibliography 83 v List of Tables 5.1 Average computing time (in mins.) by pedigrees and methods 70 5.2 Simulation results for Pedigree A with exchangeable correlation structure 70 5.3 Simulation results for Pedigree A with familial correlation structure 71 5.4 Simulation results for Pedigree A with exchangeable correlation structure and sample of size 600 71 5.5 Simulation results for Pedigree B with exchangeable correlation structure 72 5.6 Simulation results for Pedigree B with familial correlation structure 72 5.7 Simulation results for Pedigree C with exchangeable correlation structure 73 5.8 Simulation results for Pedigree C with familial correlation structure 73 5.9 Simulation results for Pedigree D with exchangeable correlation structure 74 5.10 Simulation results for Pedigree D with familial correlation structure 74 5.11 Simulation results for mixed Pedigree with exchangeable correlation structures. . . . 75 5.12 Simulation results for mixed Pedigree with familial correlation structures 75 5.13 Conditional probabilities for a family of Pedigree A with pss = 0.9, ppo = 0.4, X! = 53, X2 = 20, X 3 = 16, and X 4 = 14. 76 5.14 Conditional probabilities for a family of Pedigree B with pss — 0.9, ppo = 0.5, pD2 = 0.3, Xi = 68, X2 = 46, X3 = 14 and XA = 10 76 vi 5.15 Conditional probabilities for a family of Pedigree C with pss — 0.9, ppo — 0.5, pm = 0.3, Xi = 68, X2 = 46, X3 = 14, and X4 = 10 76 vii List of Figures 1.1 Classifications of the methods for multivariate binary responses 2 3.1 Contour plots for the bivariate normal distribution 34 3.2 Contour plots for the bivariate Plackett-normal distribution 35 5.1 Graphical diagram of Pedigree A 61 5.2 Graphical diagram of Pedigree B 63 5.3 Graphical diagram of Pedigree C ' .'. 64 5.4 Graphical diagram of Pedigree D 66 viii Acknowledgements I would like to thank Dr. Harry Joe for his guidance and supervision throughout the development of this thesis. I would also like to thank Dr. John Petkau for his careful reading of the manuscript and subsequent helpful suggestions. Many thanks to Dr. Lionel Pandolfo for his generous support. A B U H E N A M . M A H B U B - U L L A T I F The University of British Columbia September 2001 ix To my Parents. x Chapter 1 Introduction In quantitative genetics, researchers are often interested in identifying important variables related to the occurrence of a genetic disease of interest. Since relatives are genetically more alike, the association between the members of a family (familial aggregation) for the occurrence of the disease is also of interest. Accurate quantification of the familial aggregation leads to more sophisticated genetic studies. Depending on the type of the disease and the objective of the study, the response corresponding to the disease status can be continuous or discrete. For continuous responses (e.g. blood pressure, cholesterol level, etc.), statistical models are well developed (see [1]) to meet the objective of this type of genetic study. But for discrete responses (e.g. presence/absence of disease, levels of disease status, etc.), there is still active research in development of statistical methods. In this thesis, we focus on methods available for binary response. The same data structure also arises in the analysis of repeated measurements, where each individual is examined over time to record the presence/absence of a specific event. Covariates are also recorded for all the individuals over time. Since the measurements are made from the members of a family or from the same individual over time, the responses are usually positively correlated. Responses of this type are known as multivariate or correlated binary responses. The methods 1 available for analyzing multivariate binary response can be classified into two broad classes of methods: likelihood based and estimating equation based methods. The likelihood based methods require complete specification of the joint distribution of the multivariate responses. On the other hand, the estimating equation based methods can be employed when the joint distribution is not fully specified. The likelihood based models can further be divided into several categories: (i) latent variable models, (ii) random effect models, (iii) transition or Markov models, and (iv) conditional logistic regression models. In this thesis, we are mainly interested in latent variable models because of the attractive interpretation of the parameters of this models. The latent variable approach assumes that there is an underlying continuous variable which is categorized to give the observed discrete response. Here the regression models express the param-eters of the distribution of the latent variable as functions of the explanatory variables. Equivalently, this is a model of the marginal distribution of the observed response directly as a function of the explanatory variables. For estimating equation based methods, only the univariate marginal prob-abilities and some form of the associations are specified. Figure 1.1 shows the classification of the methods for analyzing multivariate binary responses that are considered in this thesis. Figure 1.1: Classifications of the methods for multivariate binary responses. Methods for Mult. Binary Responses Likelihood Based Est. Equation Based Mprobit Mlogit G E E l GEE2 2 A common latent variable model is the probit model (Finney [8]), which is widely used in the dose-response studies. In such studies, different dosages of a stimulus are applied to randomly selected subjects and binary responses such as the presence/absence of a specific event of interest (e.g. death) are observed on the subjects. It is assumed that the binary responses depend upon the tolerance of the stimulus and for each subject there is a certain level of tolerance below (above) which the response occurs and above (below) which the response does not occur. In probit analysis, it is assumed that the unobserved tolerance variable follows normal distribution; hence the probability of the occurrence of the observed response can be expressed as a function of the cumulative distribution function of the standard normal distribution (see §2.1). In case of multivariate binary responses, by assuming the tolerance distribution is multivariate normal, Ashford and Sowden [2] used a multivariate probit model (Mprobit) to analyze the data from a dose-response study. In quantitative genetics, when the character or trait is measured on an individual, the observed value is known as the phenotypic value; this depends on the genotypic value and the environmental deviation. The genotypic value is the aggregation of one or several genes possessed by the individual and all other sources of variation in the phenotypic value are assumed to arise from environmental deviation. The phenotypic value can be continuous or discrete. In this thesis, we consider only binary phenotypic values such as presence/absence of a specific trait. Falconer [7] proposed that there is an underlying variable liability and the observed phenotypic value depends on some threshold values of the distribution of the liability. The liability variable related to the phenotypic value is known as the phenotypic liability which is the sum of two independent random variables: the genetic liability and the environmental liability. Each of these two liability variables may be considered as the sum of many small effects so that a normal distribution is reasonable for liability variables, based on the central limit theorem (CLT). The liability variable is analogous to the tolerance in dose-response studies. If the distribution of the liability variable is assumed to be normal, the probit model can be used to analyze the binary response regarding the qualitative 3 trait (e.g. presence/absence of the disease) of interest. In genetic study, the individual who is the first member of the family diagnosed with the disease is known as the proband and usually subjects are selected from the family members of the probands. Since the family members are genetically alike, responses of this type of study are usually correlated. Mendell and Elston [30] considered multivariate probit models for studying multifactorial qualitative traits. Lesaffre and Molenberghs [22] used multivariate probit model to study the relationship between drinking and smoking behavior among Belgians. Another important latent variable model for univariate binary responses is the linear logistic model. This model assumes the underlying distribution is logistic and can be also used for identify-ing important covariates for occurrence of a specific event of interest. The standard logistic density is bell-shaped like the normal density and has a variance of 7r 2/3 (SD=1.81). Logistic regression for a binary response is common in biostatistics and epidemiology. This is mainly due to the convenient closed form of the logistic density and the ready interpretation of resulting regression coefficients as log odds-ratios; however it has not been derived from physical principles. The multivariate version of the logistic model is not as advanced as the multivariate probit model because there is no natural multivariate logistic distribution. Bivariate Plackett distributions (see Plackett [34], Mardia [28]), can provide a legitimate joint bivariate distribution function corre-sponding to a given pair of the univariate margins with the cross-product ratio or odds-ratio as the dependence parameter. The bivariate logistic distribution with cross-product ratio as the depen-dence parameter can be constructed by using univariate logistic margins in the bivariate Plackett distribution. Molenberghs and Lesaffre [31] proposed a procedure for constructing multivariate Plackett distributions for the given univariate margins and two- and higher-order cross-product ratios as the dependence parameters. Joe [17] also discussed different properties and applications of bivariate and multivariate Plackett distributions, as well as a more general view of the construc-tion. The joint distribution function of the multivariate Plackett construction satisfies the Frechet 4 bounds (see Joe [17]), which is one of the necessary conditions for a multivariate joint distribution function. To be a proper distribution function the joint distribution function of the three- and higher-order Plackett distribution must satisfy some necessary conditions; the analytic proof of these conditions are still open problems. Joe [17] has numerically shown for many combinations of parameters that the multivariate Plackett construction is a proper distribution if the third- and higher-order parameters are not too large or too small. Dale [5] used bivariate Plackett distribution for analyzing bivariate categorical responses. She considered regression models corresponding to the univariate margins as well as the dependence parameter the global cross-product ratio. Molenberghs and Lesaffre [31] extended Dale's model for multivariate ordinal categorical responses. Dale's models are defined for any continuous univariate margins and different link functions can be considered for modeling the univariate margins and the cross-product ratios (so that there is not always a latent variable interpretation). A multivariate logistic model (Mlogit) can be obtained by using the logit link for the univariate margins and the log link for the cross-product ratios. Glonek and McCullagh [11] considered a different approach to define a multivariate logistic model. In their approach, the regression models are defined for both the univariate margins and the dependence parameters (cross-product ratios). No distributional assumption corresponding to the univariate margins are required to estimate the parameters of the model. Beside these likelihood based methods for analyzing the multivariate binary responses, a common approach is generalized estimating equations (GEE). Liang and Zeger [23] proposed a estimating equation based method (known as GEE1) which can be used for analyzing both con-tinuous and discrete correlated responses within the generalized linear model framework. This method provides inferences only for the regression coefficients and considers the dependence among the observations as a nuisance. This method can provide consistent estimators of the regression parameters if the specification of the marginal means is correct. They introduce the "working" 5 correlation matrix in which a larger value of the working correlation parameter is used if there is more dependence in the data. In 1990, Zhao and Prentice [38] proposed a method for correlated binary regression by using a quadratic exponential model. The quadratic exponential family is a special case of Cox's [3] log-linear representation of the joint distribution of multivariate binary responses. Zhao and Prentice called this method a "pseudo-likelihood" approach which can estimate the regression parameters corresponding to the marginal means and the dependence parameters of the model. The difference of the "pseudo-likelihood" approach to the classical likelihood approach is that the former does not require to estimate all the parameters which are necessary to fully define the likelihood function of interest. Zhao and Prentice [38] consider one-to-one transformations of the canonical parameters of the quadratic exponential family to the first two moments of the marginal responses. If the regression models corresponding to the marginal means and the pairwise marginal correlations are correctly specified then this pseudo-likelihood approach can consistently estimate the regression parameters of the model. By using the Frechet bounds it can be shown that the range of the pairwise correlation coefficient depends on the univariate margins (see Joe [17], Chapter 7). Lipsitz, Laird and Harrington [27] considered a simulation study to show the advantages of odds-ratio over the correlation coefficient as a dependence parameter for analyzing multivariate binary responses. Fitzmaurice and Laird [9] considered conditional log odds-ratios (canonical parameters of the Cox's log-linear representation) as the dependence parameters to model multivariate binary responses by using the quadratic exponential family. This model is explained more clearly in Joe and Liu [16]. This model is not appropriate to use for familial data of various sizes, because it is not closed under margins (not reproducible). In 1992, Liang, Zeger and Qaqish [25] proposed a method (known as GEE2) for analyzing multivariate binary responses which is based on esti-mating equations for regression and dependence parameters. They considered odds-ratios as the dependence parameters. Given the correct model specification of the mean function and the de-6 pendence structure, the GEE2 can provide consistent estimators for the parameters defined for the mean function and the dependence structure simultaneously. Liang and Beaty [24] used the GEE2 method for examining the degree of familial aggregation for binary responses. Because the GEE2 method considers odds-ratio as the dependence parameter, it turns out that it implicitly uses the multivariate logistic model of Molenberghs and Lesaffre, but are using estimating equations that are easier to compute compared with the maximum likelihood equations. The objective of this thesis is to compare the likelihood based and estimating equation based procedures for the multivariate logistic model based on cross-product ratios. For this comparison, we first review the theoretical development of these models and then consider a simulation study to examine their performance in analyzing data. In comparing these models, the effect of the family sizes and the strength of the within-family dependence are also examined. In the subsequent three chapters the multivariate probit model, the multivariate logistic model, and the estimation equation based methods are described. In Chapter 5, the results of the simulation study is given. A brief description of the numerical optimization methods used for fitting these models is given in Chapter 5. 7 Chapter 2 The Multivariate Probit Model Probit analysis (Finney [8]) is a technique which is commonly used to study the dose-response relationship in a population of biological organisms. In dose-response study, different levels of a stimulus (e.g. a vitamin, a drug, etc.) are applied to a randomly selected group of subjects and the action of a particular level of stimulus is assessed in terms of a quantal responses which depend on the intensity of the stimulus. For any subject, there will be a certain level of intensity below which the response does not occur and above which the response occurs. This level of intensity is known as tolerance or threshold which will vary from subject to subject in the population. The main objective of the dose-response study is to assess the relationship between the levels of stimulus and the probability of occurrence of the response. If the distribution of the tolerance is assumed to be normal normal, a regression technique of probit analysis can assess this relationship after controlling for important covariates. In a dose-response study the response can also be multivariate. In some studies for a specific level of the stimulus, in addition to the responses regarding the main effect responses about some side effects might also be of interest in some studies. In this situation to assess the efficacy of the stimulus, analyzing the responses simultaneously would be the most efficient procedure. Ashford 8 and Sowden [2] used the multivariate probit model in the dose-response context. The multivariate probit model is also considered in quantitative genetics. Mendell and Elston [30] used the multivariate probit model to analyze multi-factorial qualitative traits. In genetics, the term liability is used analogously to tolerance in the dose-response study. For any subject the trait will show up if the liability is smaller (or greater depending on the relationship between the trait and the liability) than a specific threshold value. It is assumed that there is an underlying normal distribution of the liability and different thresholds of this distribution provides the response in terms of qualitative traits. The normal distribution is reasonable because the liability variable may be considered as the sum of many small effects in a polygenic model (response is influenced by many genes). Ochi and Prentice [33] considered multivariate probit models with exchangeable correlation structure. Lesaffre and Molenberghs [22] used multivariate probit models to analyze multivariate ordinal categorical response. A detailed discussion of the general class of multivariate probit models can be found in Joe [17]. In the following sections, probit models are described for univariate, bivariate, and multivariate binary responses respectively. 2.1 Univariate Probit Model Let us consider a study where K subjects are randomly selected from a population and each subject is examined to collect information about the presence of a specific qualitative trait of interest. The focus of the study is to identify important covariates for the occurrence of the trait (we are assuming the selected subjects are independent for the time being). The response y^ (i = 1,2,..., K) corresponding to the i t h subject is a binary variable, where yi — 1 if the trait is present and otherwise yi = 0. Let X be the design matrix of order K x (p + 1), with the first column of X being 1 to accommodate an intercept term. Suppose the random variable Zj, which denotes the latent variable for the z t h subject, follows 9 a normal distribution with mean 0 and variance 1. Let the response yi be the realization of the random variable Yi, such that Yi = I(Zi<Xil3), (2.1) where (3 = (/3o,[3\...., f3p)T is the vector of regression parameters corresponding to the design matrix X, Xi is the ith row of X, and /(•) is an indicator function such that 1(A) = 1, if A is true, 0, otherwise. (2.2) The probability that the i t h subject has the trait is -jTi(f3) = Pv(Yi = l)=Pv(Zi<Xif3) = ^(Xif3), (2.3) where $(•) is the cumulative distribution function of the standard normal distribution. For the given sample y\, y2, • • •, DK, the log-likelihood function is > K m = E ^ 1 ° g ^ ( / 3 ) + ( 1 - ^ ) 1 ° g ( 1 - ^ ^ ) ) } - (2-4) The maximum likelihood estimator f3 is a solution of the score equations t/(/3) = 0, where the jth (j = 1,2,... ,p) element of the score vector U(f3) is 3*08) dfij K and <j)(-) is the probability density function of the standard normal distribution. The covariance matrix of $ can be obtained from the Fisher information matrix. The (j, k) element of the observed Fisher Information matrix I((3) is -d2l([3) 10 <&2(X,/3) {l-S'XiP)}-\ Vi 1 - Vi | (j>2{XiP)XijXik <j)(XiB)XijXik. (2.6) The estimated covariance matrix of /3 can be obtained as 10) l . In the univariate probit model the score function and the elements of the Fisher information matrix can be written in simple convenient form, so an iterative procedure such as the Newton-Raphson procedure can be used to estimate the parameters of the model. In the previous section to describe univariate probit model, we considered a hypothetical study where each of the K independent units provide binary responses about a specific qualitative trait of interest. But there might be a situation where binary responses are available for the members of a family; for example, we could consider a situation where the responses are available for each father-son pair of K randomly selected families from a population. Besides identifying the important covariates for the occurrence of the trait, the objective of this study is to estimate the association of the occurrence of the trait between the father and the son in the father-son pairs. In this case, the responses between families are independent but within-family responses are correlated. So the univariate probit model is not appropriate for this problem. Let yj = (yn,yi2)T and X j be the response vector and and covariate matrix of order 2 x (p + 1) corresponding to the ith (i = 1,2,..., K) family. As the binary responses are bivariate, the corresponding distribution of the latent variable is also bivariate. Suppose the latent vector Zi = (ZJ I ,ZJ2 ) t follows a bivariate standard normal distribution with correlation coefficient p. That means the marginal distribution of Zij (j = 1,2) corresponding to the ith family follows standard normal distribution and corr(Zji, Zi%) = p-2 . 2 Bivariate Probit Model n (2.7) The observed responses are realizations of the random variables Y^-, where Yij = I{Zij < Xtj0), j = 1,2. The marginal probabilities corresponding to the occurrence of the trait are 7rii.(/3) = Pr(y i l = l) = Pr(Zil<Xilf3) = $(Xilf3), = Pr(y i 2 = 1) = Pr(Zi2<Xi2p) = $(Xi2p), where $(•) is the standard normal cumulative distribution function. The other marginal probabil-ities are 7TJO. = 1 — T^H- and TX^Q = 1 — 7 ^ . 1 . The bivariate probability of both the members of the family has the trait is ""in (Pi p) = Pr(y i i = i , y i 2 = i) = Pr(Zil<Xilp,Zi2<Xi2p) = ^ ( X i i / S , X i 2 / 3 ;p ) , . (2.8) where <&2(-> •; •) is the cumulative distribution function of the bivariate standard normal distribution. Similarly, we can write the other bivariate probabilities as 7r i l0(/3,p) = Pr (y i i= 1,^2 = 0) = • Pr(Zil<Xilp,Zi2> Xi2p) = P r ( Z a < Xn 0) - Pv(Zil < Xn p,Zi2 < Xi2p) = ^il-(P) - Kiu(P,P), nm(P,p) = n-\(P) ~ nni(P,p), nm{$,p) = l-Triu(P,p)-Tno{P,p)-'Km(P,p)-To construct the likelihood function, let us consider a 2 x 2 contingency table M j for the ith response vector with (j, k = 0,1) be the number of observations in the (j, k) cell. If there 12 is only one observation per subject within any family then only one cell of Mj will take the value 1 and the remaining cells take the value 0. Throughout this thesis we assume that there is only one observation per member of any family. The likelihood function for the sample of size K can be written as K 1 1 (0) 1=1 jl=0 j2=0 K = £ £ ™ y l o g 7 r y ( 0 ) , (2.9) i=l j where j indicates a multi-index j = {(j'1,.72) : ji,J2 = 0,1} and 6 — ((3T,p)T. The vector of score function for the parameters can be written as The maximum likelihood estimator 6 can be obtained from the solution of the equation U(0) = 0. The elements of the score vector and Fisher information matrix can be expressed in terms of probability density function (pdf) and cumulative distribution function (cdf) of the standardized univariate and bivariate normal distributions. 2.3 Multivariate Probit Model The multivariate probit model is an extension of the bivariate probit model, considering the under-lying distribution of the latent vector as multivariate normal. When binary responses are available for more than two members of a family, multivariate probit models can be used. In practice the fam-ily sizes can be unequal but for notational simplicity throughout this thesis, we will consider equal family sizes for deriving multivariate methods. Let = (yn,yi2, • • •, yid)T be the binary response vector corresponding to the i t h (i = 1, 2 , . . . , K) family and Xij be the 1 x (p + 1) covariate vector corresponding to the jth (j = 1,2,..., d) member of the ith family. Let Zj = (Zji, Z ^ , • • •, Z ^ ) 7 be 13 the latent vector corresponding to the i family which is assumed to follow a multivariate standard normal distribution with correlation matrix R. Let us assume the observed responses are the realizations of the random variables Assume that there are q different types of relative pairs (e.g. father-offspring, sibling-sibling, etc.) within a family, q < d(d — l)/2. Then we can express the correlation matrix as R(a), where a = (ai, #2, • • • j &q)T is the vector of correlation parameters corresponding to the different types of relative pairs within families. For a response vector of dimension d, 21 univariate, 2 2 bivariate, 2 3 3-variate, ..., 2d d-variate probabilities can be obtained. These probabilities can be expressed in terms of the cumu-lative distribution function of the multivariate normal distribution. We can express the marginal probabilities T T i i . . .(fi) = Pv(Yih = 1) = Qiix^fi), (2.11) the bivariate probabilities 7rai.. .(fi,a) P r ( % = l , ^ i a = l ) , h + h = *2(-Xyi fi, Xij2 fi; RjlJ2(a)), (2.12) the 3-variate probabilities W i l l i - <*) = P r ( % = 1, Ym = L Yij3 = 1), j i + j2 # h — §3{Xijx fi, Xij2 fi, Xij3 fi; Rj-Ljij^a)), (2.13) and in general the d-variate probabilities TTJII-I(/3, a) = Pr(Yij x = 1, YiJ2 = 1,..., Yijd = 1), j l ^ j 2 ^ - - - ^ jd = $d(X i j 1 fi,XiJ2 fi,.. .,Xijd fi;Rjl...jd(ct)), (2.14) 14 where <frj (•; S) is the cumulative distribution function corresponding to the j-variate (j — 1,2,..., d) standard normal distribution with correlation matrix £. Once we have these probabilities, we can compute the remaining orthant probabilities. As in the bivariate case, let Mj be a 2 x 2 x • • • x 2 (= 2d) contingency table corresponding to the d—dimensional response vector y^ Let mij1,,,jd be the number of observations corresponding to the (ji,J2, • • • ,jd) cell of Mj. The log-likelihood function can be written as K 1 1 = E E • - • E Wiii-id l o S - i d ( e ) i=l.j'i=0 jd=0 ' . i f = E E ^ w W - (2-15) i= l j where j-indicates a multi-index j = {(ji,J2,---,jd) '• jr G {0,1}, r = 1,2, and 6 = (f3T, ctT)T. The maximum likelihood estimator 6 is the solution of the equations U(9) = 0, where is the vector of the score functions. The equations (2.15) and (2.16) are the general form of the corresponding equations (2.10) and (2.9) of the bivariate case. As the score function U(0) contains multi-dimensional integrals, the second derivative of the score function (i.e. elements of Fisher information matrix) is very difficult to write down. Numerical integration can be used to approximate these high-dimensional integrals; we used the approximations proposed in Joe [15]. Since for the multivariate probit model, it is very difficult to evaluate the Hessian matrix analytically; the classical Newton-Raphson method is no longer useful as an optimization method for estimating the parameters of the model. Numerical methods, which can apply without having the Hessian matrix analytically, are required for this case. A brief description of these methods will be given in Chapter 5. 15 2.4 Conditional Probabilities In the preceding sections, we discussed the multivariate probit model which can be used for iden-tifying important covariates for the occurrence of a disease of interest. This model can be used to estimate the association of the occurrence of the disease between the relatives. The multivari-ate probit model can also be used to predict the disease status of an individual given the disease status and the covariate values of that individual's relatives. Predicting future disease status has significant importance in genetics. In this section, we discuss a procedure to predict an individual's future disease status given the information about his relatives. Let us define p(d\ 1 , 2 , . . . , = Pv(Yd = l\Yl = yuY2 = y2,...,Yd„1=yd_1,X) Pr(Y1 = yl,Y2 = y2,...,Yd = l,X) (2.17) Pr(Yi = j / i , Y2 = y2,..., Y" d_i = yd-i,X)' where p(d | 1, 2 , . . . , d — 1) is the conditional probability that the d}h member of the family will have the disease given the current disease status of the other members of the family and the covariate values. This conditional probability is the ratio of orthant probabilities of order d and d — 1 respectively. The equation (2.14) shows that these orthant probabilities are functions of 6 = ( / 3 t , Q t ) t , where (3 and a are the regression and association parameters defined for the multivariate probit model. So for the multivariate probit model, the conditional probability can be written as . W H O rl-U - MXiP,---,Xd/3;R1...d(<x)) P [ ' ' j " * d_ 1(JT 1/3,... >A: < l_ l i8;il 1... < f_ 1(a)) = q{9). (2.18) Given the maximum likelihood estimator 0 and the covariate values, the maximum likelihood estimate of the conditional probability p(d \ 1,2,..., d — 1) can be obtained from (2.18) as q(9)-The conditional probability is a function of the parameters 0, we can approximate the 16 variance of the maximum likelihood estimate of the conditional probability by using delta method. The approximate expression for the variance of the estimate of p(d | 1,2,..., d — 1) can be written as • = | £ J v < . ) ^ . • (2.19) where V(0) is the covariance matrix of 0. The estimate of the variance can be obtained by replacing 0 by 0 in (2.19). Because of the complex form of q(6), the partial derivatives of the equation (2.19) is very difficult to be obtained analytically. We have used numerical methods to compute the conditional probabilities and the respective standard errors, which are shown in Chapter 5. 2.5 Summary In this chapter different types of probit models are described which can be used to analyze binary responses. The multivariate probit models are very useful in analyzing correlated binary responses. Maximum likelihood estimates for the regression parameters of the model can be obtained. The estimates of the association between the members of the family in terms of the latent correlation coefficient, also known as the tetrachloric correlation, can also be obtained. The multivariate probit model can estimate the conditional probabilities that a particular member of a family has the disease given the current status of the other members of the family and the covariate values, which has importance in genetics. 17 C h a p t e r 3 The Multivariate Logistic Model In the previous chapter, the multivariate probit model was described. This chapter contains the description of an analogous multivariate logistic model. This model is also a latent variable model and has univariate margins that regress each response on the covariate vector. The multivariate logistic model models the dependence parameter in terms of the cross-product ratio which has an attractive interpretation. However there is no physical or stochastic model that leads naturally to the cross-product ratio as the natural dependence parameter. For the multivariate logistic and probit models the underlying univariate margins are assumed to be logistic and normal respectively. This approach of modeling multivariate binary data is based on a class of bivariate dis-tributions proposed by Plackett [34]. For given univariate margins and cross-product ratios as dependence parameter, the Plackett distribution can completely specify the joint latent distribu-tion of bivariate binary responses. Mardia [28] studied the properties of the bivariate Plackett distribution. Dale [5] first considered the bivariate Plackett distribution to model bivariate categorical re-sponses; this model is known as bivariate Dale model in the literature. Molenberghs and Lesaffre [31] introduced the multivariate Plackett distribution and extended the Dale model to multivariate or-18 dinal categorical responses. In the Dale model, the global cross-product ratio is considered as the dependence parameter which is equivalent to the local cross-product ratio for the binary case. The multivariate version of the Dale model is defined for any underlying univariate continuous distribu-tion. For this model, different link functions can be considered for modeling the univariate margins and the cross-product ratios. For the multivariate logistic distribution, the underlying univariate margins are logistic and the multivariate Plackett distribution is used to construct higher order margins with the univariate logistic margins. Joe [17] studied the multivariate Plackett construc-tion within a more general theoretical framework and considered different data sets to show the applications of the multivariate logistic model. Besides the Plackett distribution based approach to multivariate logistic models, McCullagh and Nelder [29] described this model differently based on the logistic transformation of linear combinations of joint probabilities. Their approach does not consider any latent variables. Glonek and McCullagh [11] also studied this approach of modelling the multivariate logistic model. The main focus of this chapter is to describe the multivariate logistic models for analyzing multivariate binary response. The Plackett distribution plays a vital role in defining the multi-variate logistic models that we consider, so we first discuss the bivariate and multivariate Plackett distribution in Section 3.1. The subsequent two sections contain the model description and methods of estimating parameters of this multivariate logistic model. In Section 3.3, a brief description of the McCullagh-Nelder-Glonek approach is given. 3.1 Plackett Distributions In this section, a brief introduction of bivariate and multivariate Plackett distribution is given. Throughout this chapter we will consider the binary responses as taking either 1 (diseased) or 2 (non-diseased) instead of 1/0 which we considered in the previous chapters. This will help us to derive general expression of cumulative distribution function of the multivariate Plackett 19 construction. 3.1.1 The Bivariate Plackett Distribution In 1965, Plackett proposed a procedure for constructing a class of bivariate distributions. Suppose Z\ and Z2 are two continuous random variables and F\(z\) = Pr(Zi < z\) and F2(z2) = Pr(^2 < z2) are the univariate margins of Z\ and Z2 respectively. Plackett [34] considered F\2(z\, z2) = Pr(Zi < zi,Z2 < z2), the possible joint bivariate distribution function of Z\ and Z2, as the solution of the equation _ -^ 12(^ 12 - F\ — F2 + I) ^ ^ ! (Fl — Fi2)(^ 2 - F12) where F12 = 1^2(^ 1, 2^), F\ = Fi(z\), F2 = F2(z2) and the cross-product ratio or odds-ratio 7 is a positive constant for all (zi,z2) for which neither Fi nor F2 assumes the value 0 or 1. The equation (3.1) is known as defining equation. Mardia [28] showed that only the following solution of the defining equation (3.1) (3-2) (l/2)( 7 - I)" 1!.! - (Fi + F2)(l - 7) - S(FuF2,j)}, if 7 * 1 , Fi F2, if 7 = 1, where 5 ( F i , F 2 , 7 ) = [ { l - ( F ! + F 2 ) ( l-7)} 2 +47(1-7)^1^2]V2 (3.3) is the only root leading to a proper bivariate distribution. The corresponding joint density function of Z\ and Z2 can be written as , , . d2F12(zl,z2) 7/ 1/ 2{(7 - l ) ( F i + F 2 - 2 F i F 2 ) + l} / l 2 ( 2 l ' * 2 ) = a ^ i = s~* ' 7 > 0 ' ( 3 " 4 ) where f\ and f2 are the univariate marginal density functions of Z\ and Z2 respectively and S is given in (3.3). 20 For given univariate margins F i and F2, the dependence parameter of the bivariate Plackett distribution (the cross-product ratio 7) is a monotonic increasing function in F12, i.e. 7 = 0 when F\2 = FL a n d 7 = 00 when F12 = Fu, where Fu(zi,Z2) = mm{Fi(zi),F2(z2)}, and FL(zuz2) = max{Fi(;zi) + F2(z2) - 1,0}, are known as upper and lower Frechet bounds (see [17]) respectively. From equation (3.2) the following can also be seen (i) If F i (F 2) tend to 1 then F 1 2 tends to F 2 (Fi). (ii) If F i and F2 tend to 1 then F i 2 also tends to 1 which indicates that FL and F2 are marginal distribution functions. (iii) For fixed F i (F2) and 7, F i 2 increases with F2 (Fi). Bivariate Plackett—normal vs Bivariate Normal The bivariate Plackett distribution is defined for continuous random variables with arbitrary uni-variate margins. The most widely used bivariate distribution is the bivariate normal, so it is of interest to examine how the bivariate Plackett distribution with univariate standard normal margins (which is known as bivariate Plackett-normal distribution) resembles the usual bivariate standard normal distribution. To compare the bivariate standard normal and the bivariate Plackett-normal distributions, the relationship between the cross-product ratio (7) and the correlation coefficient (p), the depen-dence parameters of these two distributions respectively, is required. For a specific cut-off point (z\,z2) the cross-product ratio can be written from equation (3.1) as / s = $2f>l,32;p)$2(-Zl, -Z2;p) ,0 e\ Wl,Z2,P) {a>{z1)-$2(Zl,Z2;p)}{*(z2)-$2(Zi,Z2;p)y ^ " ' 21 The quantity on the right hand side of equation (3.5) depends on the bivariate normal orthant probabilities corresponding to the cut-off point (21,22)- Kepner et al. [21] derived the following expression for the orthant probability for the cut-off point (0,0) Using this result in equation (3.5), we get the following relationship between the cross-product ratio and the correlation coefficient for the cut-off point (0,0): We use this relationship to obtain the cross-product ratio at the cut-off point (0,0) from a given value of the correlation coefficient. For a given correlation coefficient p, different cross-product ratios can be obtained from the equation (3.5) for different cut-off points (21,22)- Numerically it can be shown that i.e. the lower bound of the cross-product ratios is attained at the cut-off point (0,0). The theoretical proof of this result is still an open problem. Figure 3.1 shows contour plots of the bivariate standard normal density for selected values of correlation coefficient p. The corresponding plot for the Plackett-normal distribution is shown in Figure 3.2, where the identical cross-product ratios are obtained from the correlation coefficient values (used to generate plots for bivariate normal) by using the equation (3.6). These sets of plots are almost similar which indicates that the bivariate Plackett distribution with univariate standard normal margins is similar to the bivariate normal distribution. 3.1.2 The 3-variate Plackett Construction Let Z\, Z2, and Z3 be three continuous random variables with univariate margins F\, F2, and F 3 re-spectively. Let us consider F J U 2 (1 < j\ < j2 < 3) as the bivariate Plackett margin corresponding to $2(0,0;p) = l /4+(2 7 r)- 1 sin- 1 p-(3-6) 7(0,0;p) = min 7(21,22^), (3.7) 22 ( ZJD Zj2). Given the univariate and bivariate margins, the 3-variate margin F123 = 2*123(21, 22,23) is a solution of the defining equation ^ 3 _ ^123(2*123 - Ql)(-P1123 - ^2) (2*123 ~ Q3) ^ ( 6 l - Fi2s){b2 - -Fl23)(?>3 - ^123)(^4 - -F123) ' where ai = F12 + Fi3 - Fu a2 = F12 + F13 - F2, a 3 = F13 + F23 - F3, bi = F12, b2 = Fi3, b3 = F23, 64 = 1 — Yl,iF + J2i<j Fij- The function F123 satisfies the Frechet bounds (Joe, 1997), Fu = m i n { 6 i , & 2 , & 3 , & 4 } , a n d FL = max{ai, a2, a3,0}. Because Fi23(zi, z2, z3) is defined implicitly for each (z±,z2,z3), it is difficult to check if-F123 satisfies the rectangle condition necessary for a proper cumulative distribution function. Interpretation of the parameters Let Y i , Y 2 , and Y3 are three independent Bernoulli variables. The third-order dependence param-eter of the 3-variate Plackett distribution can be expressed as 7123 = CR3(YUY2,Y3) CR2(YUY2\Y3 = 1) CR2(YUY2\Y3 = 2) _ W i l l 7T221 7T122 7T212 7!"H2 W222 Wl21 7T211 Now by comparing this with the equation (3.8), the orthant probabilities can be expressed in terms of the 3-variate joint distribution function as W i n = -^123, Wi 2 2 = Fi23 - O i , 7T212 = -£123 _ a2, ^221 = 2*123 ~ ^3, W112 = h — Fi23, 7T121 = 62 - F123, 7T2H = 63 ~ 2*123, W222 = 6 4 - 2*123-Given the dependence parameters 712, 7123 and the univariate and bivariate margins,,these orthant probabilities can be computed from the 3-variate Plackett distribution function. 23 3.1.3 The d-variate Plackett Construction In this section, we describe the multivariate Plackett distribution for arbitrary dimension d (d > 1). To express the general form of the joint distribution and the cross-product ratios, we use the notation of Molenberghs and Lesaffre [31]. Let Z = (Zi, Z2,. • •, Zd)T be the d-dimensional random vector and Fh (1 < jx < 2), F j l h (1 < jx < j2 < 2), . . . , Fj1j2„jd_1 (1 < ji < • • • < jd-i < 2) be the univariate, bivariate,..., (d — l)-variate Plackett distribution functions of Z respectively. Let 1Tjij2-jd = P r ( ^ i = Ji) ^2 = J2y • • i Yd = 3d) be the d-dimensional orthant probabilities. The general expression for the d-dimensional cross-product ratio can be written as _ n ( J - 1 , . . . j < , ) e A+ 7 r i i - jd . . 7l2-d = Ff > W - y ) n 0 l , . . . , where A$ = | (J"I ,J2 ,-- . e {l,2}d : Yji~d i s e v e n | ' A-A = {l,2}d\A+. For example, when d = 2, A+ = {(1,1),(2,2)}, when d = 3, A+ = {(1,1,1), (1,2,2), (2,1,2), (2,2,1)}, when d = 4, ^+ = {(1,1,1,1), (1,1,2,2), (1,2,1,2), (1,2,2,1), (2,1,2,1), (2,2,1,1), (2,1,1,2), (2,2,2,2)}, Now by using these elements of the sets A^ and A^ in equation (3.9), we can get the cross-product ratios 7ril7T22 7rill7ri227!"2127T221 712 = — , 7123 = , • ' ' . 7!"127I"21 7ril27I"12l7!"21l7r222 24 To get the general expression of the defining equation corresponding to the d-variate Plackett distribution, the orthant probabilities in equation (3.9) need to be expressed in terms of the marginal distributions of different dimensions. Suppose the multi-index j represents a d-dimensional vector of l's and 2's, i.e. j € {1,2}d. Let K(J) = n(ji,J2, • • • ,jd) be the set of dimensions for which ji = 1 (1 < I < d), i.e. «(j) = {/ : 1 < « < d,ji = 1}. For example, K(1,2) = {1}, K(1,2, 1) = {1,3}, K(2, 1,2,1) = {2,4}, etc. Let us introduce another notation s(j): the set s(j) contains all possible subsets of {1,2, ...,d} which contain with the elements of s(j) are arranged in lexicographical order. For example, d = 2, «(1,2) = {1} => 8(1,2) = {1},{1,2}, d = 3, K(2, 2,1) = {3} =• s(2,2,1) = {3}, {1,3}, {2,3}, {1,2,3}, d = 4, «(2,1,1,2) = {2,3} => s(2,1,1,2) = {2,3}, {1,2,3}, {2,3,4}, {1,2,3,4}, Let N be the difference between the number of elements of the sets s(j) and n(j). Using the inclusion and exclusion probability law, we can define the general form of the orthant probabilities as = £ s g n ( S ( j ) ) F a ( j ) , (3.10) where 1 if AT is even, -1 otherwise. For instance, we can write down the expressions for the orthant probabilities as 25 sgn(s(j)) = { T221 ""2112 = Fs - F\3 — F23 + F123, Using equation (3.10), we can rewrite the equation (3.9) as i l j eA+^J = nieA+Zs(i)Sgn(s(j))FsQ) UjeA- *j n j e A - Ea(j) sgn(s(j))F s ( j ) 7l2-d = (3.11) i W ^ - « j ) ' where F = Fi2-d) o>j and 6^ - are the functions of the margins of order d! (< d) which can be obtained from equation (3.10). The d-dimensional Plackett distribution function is the solution of equation (3.11) which satisfies the Frechet bounds ^max{aj, 0}, min{6j}^ . For third- and higher-order Plackett distributions, an open problem is whether the function F is a proper distribution function, i.e. whether the mixed derivatives are non-negative. 3.2 The Regression Model The main objective of this study is to compare the existing estimation methods for analyzing multivariate binary responses. We wish to examine the performance of the multivariate logistic model in the field of genetics. As already discussed in the previous chapter, in genetics studies the response is sometimes binary (e.g. presence/absence of a qualitative trait of interest) and the responses are not independent within families. It is assumed that for any individual, the occurrence of the qualitative trait depends on the underlying distribution of a latent variable. Let us assume that the trait occurs if the liability is less than a predefined threshold value; otherwise it does not occur. 26 Let j/j = (yn,yi2, • • • , y i d ) T be the response vector corresponding to the i t h (i = 1,2,..., K) family, where the yjj's are binary variables representing the presence (y^ = 1) or absence (yij = 2) of the qualitative trait of interest in the j t b member of the family. Suppose the random vector Z{ = (Zn, Z i 2 , . . . , Zi,i)T denotes the latent vector corresponding to the i t h family. Let us assume Zij follows standard logistic distribution, i.e. location parameter is 0 and scale parameter is 1. Let X{ be the covariate matrix of order d x (p + 1) corresponding to the ith family with a first column of 1, to accommodate an intercept term. Suppose the observed response y^ is the realization of the random variable Yij, where Yij = I(Zij < Xijd), where Xij is the j t h row of X i and /(•) is the indicator function such that 1, if A is true, 1(A) = 2, otherwise. Let us assume only the univariate margins depend on the covariates. The univariate margin corresponding to the j t h member of the ith family ir^ = Pr(Yij = 1) can be written as a function of the unknown regression parameters 3 as itijiP) = Pv(Yij = l)=Pv(Zij <Xij3)=F0(Xij8), l < j < d , (3.12) where Fo(x) = 1/{1 + e~x} is the distribution function of the standard logistic distribution. Prom equation (3.12) the linear predictor can be written as Vij(3) = 3 = F Q - 1 (^(3)) = log Y Z ^ f y In G L M terminology the function F 0 - 1 ( - ) is known as the link function; beside this logit link, depending on the distribution of the latent variable other link functions can also be considered. These univariate margins do not fully determine the joint distribution of because the elements of yj are not independent. This mean dependence parameters are needed to describe the 27 association between the elements of Let us assume the third- and higher-order cross-product ratios, the dependence parameters of the Plackett distribution, are constant at a fixed value 70 = 1. The choice of 70 = 1 leads to a multivariate logistic distribution that is analogous to the multivariate normal distribution (see Joe [17] for an entropy interpretation). Suppose q different types of pairs (e.g. parent-offspring, sib-sib, grandfather-grandchild, etc.) are possible in the selected families. Suppose the (j, k) pair of the ith family is of the Ith (1 < I < q) type of the pair. We can define the bivariate cross-product ratios corresponding to (j, k) pair of the ith family as lijk(a) = 9~l{oLi), 1 < I < q, (3.13) where ct = (ai, a2, • • •, (xq)T is the vector of the parameters corresponding to the q different types of the association and g(-) is the link function for the cross-product ratios. Having values of the parameters 6 = (f3T,a.T)T, the univariate margins and the cross-product ratios can be computed from equations (3.12) and (3.13) respectively. Given the univariate margins and cross-product ratios corresponding to a pair of members within a family, the bivariate margins 7TJJ U 2(0) = Pr(Yij1 = l,Yij2 = 1) (1 < j\ < j2 < d) can be computed from the equa-tion (3.2). As we already assumed the third- and higher-order cross-product ratios are fixed at 70 = 1 (which is known), the 3-variate margins are the solution of equation (3.8) provided the univariate and bivariate margins are known. Similarly, higher-order margins can also be obtained by using the equation (3.9). Once all the margins are known, the orthant probabilities can be obtained from the equation (3.10). 3.2.1 The Likelihood Function Let Mi (i = 1,2,..., K) denote a 2 x 2 x • • • x 2 (= 2d) contingency table which can be constructed from the observed response vector y^ Let be the number of observations in the j th cell of the table Mj, where j indicates a multi-index j = {ji,j2, • • • ,jd)- Let 7^(0) = Pr(Yi = ji,Y2, = 28 J2,..., Yd = jd) (ji G {1,2}, 1 < l < d) be the orthant probabilities corresponding to the j th cell of Let 0 = (f3T, aT)T be the vector of parameters of the model. As for the multivariate probit model, for the given sample j / ^ (i = 1,2,..., K), the log likelihood function can be written as K 2 2 l ( e ) = E E " " E mijij2...jd log ninj2...jd (0) i=lji=l jd=l K 2 = E E m i j l o g T r ^ ) . (3.14) t=i j=i The maximum likelihood estimator 0 is the solution of the score equation U(0) — 0, where the score function U ( 9 ) - d l { 6 ) - f f m * (3 15) Using the chain rule we can write dTTjj _ drjj _ dirjj dr)ik d\3 ~ dVi d(3 ~ £r[ drjik df3 da ~ d~ii da ~ ^ fc£^ dyijk da ' For the general dimension d, it is impossible to obtain the algebraic expression of the terms (diTij/drjik) and (dir^/djijk) because 7Tjj contains higher dimension cumulative distribution func-tions of the multivariate Plackett construction which has no closed form. So the expressions for the elements of the Fisher information matrix cannot be shown. Numerical methods are needed to get the maximum likelihood estimates of the regression parameters 6. In Chapter 5, two such numerical methods will be described. The conditional probabilities which we discussed for the multivariate probit model (see §2.4) can also be defined for the multivariate logistic model. The conditional probability p(d | 1,2,..., d — 1), which is the ratio of two orthant probabilities, is a function of the parameters of 29 the model. So, given the maximum likelihood estimator 6 of the multivariate logistic model, the estimate of p(d | 1,2,..., d — 1) and corresponding standard error can be obtained for this model. 3.3 McCullagh-Nelder-Glonek Approach McCullagh and Nelder [29] introduced a multivariate logistic transformation and used this to define a class of regression models which can be applied for analyzing multivariate categorical responses. These models are also known as multivariate logistic models and are systematically studied by Glonek and McCullagh [11]. This approach of defining multivariate logistic models does not as-sume any underlying distribution of the univariate margins. The likelihood construction is similar for both the logistic models, but the estimating procedure of the orthant probabilities is differ-ent. In the previous sections, we developed how the multivariate Plackett distribution can be used to estimate the orthant probabilities; McCullagh and Nelder used the multivariate logistic transformation to estimate these probabilities. We have numerically checked that these two ap-proaches give identical orthant probabilities. In the following section, inference procedure of the McCullagh-Nelder-Glonek approach is briefly described. 3.3.1 The Model and Parameter Estimation Let j/j = [yn,yi2, • • •, Vid)T be the response vector corresponding to the ith (i = 1,2,..., K) family, where the response corresponding to the jth (j = 1,2,..., d) member of the iih family, y^ is binary indicates the presence (y^ = 1) or absence (y^ = 2) of a qualitative trait of interest. As before, the objective is to estimate the effect of the covariate for the occurrence of the trait. The estimate of the dependence parameter of the occurrence of the trait between the members within a family is also of interest. Let Xi be the covariate matrix of order d x (p + 1) corresponding to the members of the i t h family. 30 Let us define nj(Xi) = 'Pr(Yii=yii,Yi2 = yi2,...,Yid = yid\Xi), yu = 1,2, l = l,2,...d, as the probability of observing the response vector y{ given the covariate matrix Xi and j indicates a multi-index. Let 7r be the vector of all possible 2d probabilities. For example, for bivariate case 7r = (7ri i ,7r i2,7T2i,7T22) T . If 7 is the vector of (f) x 2 1 univariate, (2) x 2 2 bivariate, (d) x 2d d-variate margins, let us make a linear transformation n —> 7 by 7 = LTV (3.16) where L is a matrix of zeros and ones. For example, when d = 3, 7 would be the vector of 6 univariate, 12 bivariate and 8 trivariate margins. Let us define rj = (770,771, • • • 1 Vd,Vi2, • • •, Vd,-i,d, • • • ,Vi2-d)T as the vector of the logistic factorial contrasts which can be obtained from 7 by 77 = C log(LTr), (3.17) where C is an appropriately chosen contrast matrix. The transformation 7r —> rj defined in equa-tion (3.17) is called a multivariate logistic transformation by McCullagh and Nelder [29]. A latent multivariate logistic distribution obtains only for a suitable choice of L . The first element 770 of the vector 77 is for ensuring the requirement Yij TTJ = 1; 770 also ensures the transformation TC to 7 is of full rank. For the bivariate case the elements of the vector 77 would be 770 = log(7Tn + 7T12 + 7T21 + 7T 2 2) , 7?l = l o g T T i . - 10g7T 2 . = l 0 g i t ( 7 T l . ) , 772 = l og7T . l - logTT.2 = logit(7T.l), 7?12 = log 7Tn + log 7T22 - log 7Ti2 - log 7T21 = log OR\2, where OR\2 is the ratio of the odds of having the trait corresponding to the first and second member of the family, i.e. ORn = {Pr(Yi = 1)/Pr(y x = 2)}/{Pr(y 2 = l ) /P r (F 2 = 2)}. 31 In G L M terminology these 77's are known as linear predictors. The dependence of the 7r's on the covariates can be described through these 77's. Let us assume that only the univariate margins depend on the covariates, so we can define the following regression models for the univariate margins where [3 is the vector of regression parameters. To define the joint distribution of j/j fully, we need to define regression models corresponding to the two- and higher-order margins. Let us assume there are at most q different types of pairs (e.g. father-son, sib-sib, etc.) are in a family; the bivariate margins can be expressed as where the vector a = (a\,... ,aq)T represents the dependence parameter. Let us assume third-and higher-order margins are constant at ao, i.e. »7l23 = V124 = ••• = 7?1234 = ' ' ' = Vli-d = <*Q. The likelihood construction is similar to the model described in previous sections. The score function (3.15) can be written as = f ± r ^ % (3.13) i=l j=i li • From equation (3.17), we can write d 7 V = (CJD _ 1 L) _ 1 , Or} where D = diag(L7r). The matrix D can be written in terms of the elements of 7, as an example which is shown in Appendix (A.2) for 3-variate model. McCullagh and Nelder [29] provide the detailed expressions for the elements of the score function and the Fisher information matrix. 32 3.4 S u m m a r y In this chapter, we have described multivariate logistic models for analyzing multivariate binary responses. The multivariate Plackett distribution, which is described in Section 3.1.3, has been used to construct a multivariate logistic distribution. The bivariate Plackett distribution (see Section 3.1.1) is flexible to consider any continuous univariate margin to construct corresponding bivariate margin. Using two univariate logistic margins, we have constructed bivariate logistic margins from the Plackett distribution with a constant bivariate cross-product ratio. A comparison between the bivariate Plackett-normal distribution and the bivariate normal distribution with a set of comparable dependence parameters is shown in Section 3.1.1. This comparison reveals that the bivariate Plackett distribution with standard normal margin is similar to the bivariate normal distribution. In Section 3.3, McCullagh-Nelder approach of multivariate logistic model is briefly described. 33 ure 3.1: Contour plots for the bivariate normal distribution. 34 Figure 3.2: Contour plots for the bivariate Plackett-normal distribution. 35 Chapter 4 Generalized Estimating Equations The objective of this thesis is to compare the available regression models and methods for multi-variate binary responses. Generalized linear models (GLM) [29] are a general class of regression methods for univariate discrete and continuous responses, in which the density has a certain expo-nential form. Logistic regression models for binary responses, linear regression models for continu-ous responses, and log-linear models for count data are special cases of generalized linear models. In this chapter, we start our discussion from generalized linear models because the construction of a class of multivariate regression methods (estimating equation based) are closely related to that of G L M . The main focus of this chapter is to describe generalized estimating equations (GEE) meth-ods for analyzing multivariate binary responses. In 1986, Liang and Zeger [23] and Zeger and Liang [37] proposed the first version of the GEE method (GEEl) which can be used for analyzing both multivariate continuous and discrete responses in which the univariate margins are G L M . By considering the dependence parameters as a nuisance, the G E E l method focus on estimation of the regression parameters defined for the mean function of the model. Later Liang, Zeger, and Qaqish [25] proposed a second version of the GEE method (GEE2) which can estimate both the 36 regression and dependence parameters (in the form of the log odds-ratios) simultaneously. GEE methods are not likelihood based other than in the case of the normal distribution; estimating equations are defined for estimating the parameters. These estimating equations are similar to the score equations of a multivariate normal model. For analyzing multivariate binary data, models based on the quadratic exponential family, which can provide pseudo-likelihood estimators, are also available in the literature, e.g. Zhao and Prentice [38], Fitzmaurice and Laird [9]. A detailed discussion of these models can be found in Fitzmaurice et al. [10]. In this chapter, first we describe generalized linear models for analyzing univariate and multivariate binary responses. The GEE1 method and quadratic exponential family based methods are described in §4.2 and §4.3 respectively. Section 4.4 contains a description of the GEE2 method. 4.1 Generalized Linear Models The GLM is the unified class of regression models for univariate continuous and discrete responses. Though our main focus is binary responses, in this section, we describe the GLM for general responses. The GLM has two important components: systematic and random component. The random component specifies the distribution of the response. The systematic component specifies the linear predictor which is a linear function of the known explanatory variables. The systematic component can be expressed as a known function of the mean parameter of the distribution of the response. This function is known as the link function which is a monotonic and differentiable function, with an appropriate domain. 4.1.1 The Model Let yi,t/2, • • • ,VK be a random sample from a distribution in the exponential family (see Lindsey [26], p. 11) having mean parameter m (i = 1, 2,..., K) and constant dispersion parameter <f>. The 37 density function of yi is of the form fy(yf,0i,4>) = exp{{yi0i-b(ei))/a(<t>)+c(yi, </>)}, (4.1) where a(-), b(-), and c(-) are specified functions and di is a function of //j, known as canonical parameter of the exponential family. It can be shown that l H = E{Yi) = b'{9i) and va r^ ) = b"{6i) a(4>). def The variance of yi is product of two terms: one, V(fii) = b"(9i), a function of /ij, is known as variance function and the other, a((j)), is a function of only the dispersion parameter <f>. Thus, the second moment of yi is a function of its first moment. To define the systematic component of the model, let us consider the matrix X of order K x (p + 1) as the design matrix and to accommodate an intercept term, the first column of X is 1. The linear predictor corresponding to the ith (i — 1,2,..., K) observation can be written as rji{0) = fa + Btfa + foXa + • • • + 8pXip, where /?o is the intercept term and /3j (j = 1,2,... ,p) is the regression coefficient corresponding to the jf t h explanatory variable. Let h(-) be any monotonic differentiable function such that rji(B) = h(m), i = l,2,...,K; h(-) is known as the link function in generalized linear model terminology. The link function relates the systematic component to the random component of the model. 4.1.2 Parameter Es t imat ion Given the sample yi,y2, • • • ,yK, the log-likelihood function can be written as K i(e) = 5 > s / » u / . ; ; ^ ) i = l 38 i=i i=i To estimate the parameters (3 of the model, we must solve the score equations 17(0) = ^ = 0. (4.2) By using chain rule, we can write T J I Q \ = y d l i d d i d ^ drn h "i 5 M i drli 9(3 _ y (Vi ~ Mi) 1 dfij = E f S ^ ^ - M i ) i=l " " ( i ) V l ' ^ " > - ( 4 ' 3 ) where V = diagjVi, V2,..., VK} and V. = V(p,i)a(4>) = var(j/j). From the expression of the score equation (4.3), it is evident that the dispersion parameter <p can be ignored for estimating the regression parameters f3. But the estimate of the dispersion parameter is required to compute the standard error of the estimate of regression parameter. The equation (4.2) can be solved by iteratively reweighted least squares (McCullagh and Nelder [29], p.41) to estimate the parameters of interest [3. Replacing /3 by the maximum likelihood estimator [3, equation (4.3) becomes a function of the dispersion parameter only which can be solved to obtain the maximum likelihood estimator (f>. The maximum likelihood estimator (3 has an asymptotic multivariate normal distribution with mean vector /3 and covariance matrix Vu = d(3) \df3 (4.4) The variance can be estimated by Vu which is obtained by replacing (3 and <j> by the corresponding maximum likelihood estimates /3 and <j> respectively in expression (4.4). 39 4.1.3 Multivariate Binary Response In this section, the GLM procedure is described for multivariate binary responses. Let y{ = (yn,yi2, • • • >yid)T be the binary response vector corresponding to the ith (i = 1,2,... ,K) family. Let Xij be the p-dimensional covariate vector corresponding to the j t h (j = 1,2,..., d) member of the i t h family. To apply the GLM procedure to multivariate binary responses, let us naively assume the responses are "independent" within each family. That means there are d x K independent obser-vations in the sample. Assuming the marginal distribution of (i = 1, 2,... K; j = 1, 2,..., d) is a member of the exponential family, the log-likelihood contribution of the j t h member of the ith family is kj(0ij) = eyupKyijOij-b(6ij))/a(4>)+c(yij,(f))}, (4.5) where 6{j is the canonical parameter which is a function of the corresponding mean function u-ij = E{Yij). Let h(-) be the link function which relates the mean parameter ^ to the linear predictor r]ij(B) = XijB as nij = h~1(XijB), where 8 = (/3o, Q\,Bi, • • • ,PP)T is the vector of parameters of interest. The score function for the i t h family (similar to equation (4.3)) can be written as UiW) = {^)TV-\yi-^ (4.6) where Vi = diag{var(j/ii), var(yj2)5 • • •, va^y )^}. The maximum likelihood estimator 0r is the solution of the equation U(B) = 0, where = Y.CiVr^y,-^). (4.7) If the binary responses within each family were independent, the maximum likelihood estimator fii is consistent and has an asymptotic multivariate normal distribution with mean vector B and 40 covariance matrix \i=l Replacing /3 by 0j, equation (4.5) becomes a function of the dispersion parameter cf> only. The corresponding likelihood function can be maximized to obtain the estimate (4>) of the dispersion parameter. In spite of the fact that the responses within a family are correlated, the pseudo maximum likelihood estimator $j (obtained by assuming within family responses as independent) is consistent but V/ 0 can be inconsistent. To obtain the consistent estimator of the covariance matrix of 0r, Liang and Zeger [23] used the inverse Godambe information matrix (see Godambe [12]) as Vr = VjoHitfriVjo, (4.8) where H2(f3) = J2CiVf1(Vi ~ f*i)(Vi ~ HifV^C?. i=i The estimated covariance matrix Vj of $j can be obtained by using the estimator f3t and 4> in the equation (4.8). The main disadvantage of this approach is that it sometimes provides less efficient estimates of the regression parameters in some cases when the intra-familial association is high. 4.2 Generalized Estimating Equations I In the previous section, we have seen that the G L M procedure can be used for analyzing multivariate binary responses. Although this approach does not consider the within-family dependence in the analysis, with the correct specification of the univariate margins this approach can provide consistent estimators for both the regression parameters and the respective variances. However, ignoring the within-family dependence in the analysis costs in the efficiency of the estimators corresponding to the regression parameters. 41 To incorporate within-family dependence in the analysis, Liang and Zeger [23] proposed the generalized estimating equations procedure (GEE1). The main focus of the GEE1 method is to examine the dependence of the response on the covariate set which are measured for each observation. In the GEE1 method, regression models are defined for the mean function of each ob-servation. Instead of specifying the joint distribution of the responses, Liang and Zeger [23] defined estimating equations for the regression parameters only, which they called generalized estimating equations. They introduce a "working" correlation or weight matrix to avoid the specification of a joint distribution of the responses. When there is stronger dependence in the data, then one should use larger correlations in the weight matrix. Crowder [4] showed some examples where the estimators corresponding to the parameters of the "working" correlation matrix do not converge in probability to a value in [—1,1] or [0,1]. 4.2.1 The Method and Parameter Estimation Let Hi = (yn,yi2, • • •, yid)T be the response vector corresponding to the ith (i = l,2,...,K) family, where yij, the binary response corresponding to the jth (j = l,2,...,d) member of the ith family, representing the presence/absence of a specific trait of interest. Let X{j be the p-dimensional vector of covariate values for the j t h member of the iih family. As in Section (4.1.3), let us assume that y^ follows a distribution from exponential family and the dependence of the mean function Hij = Pv(Yij = 1) on the covariate set can be expressed by the link function h(-) as /Zjj = where [3 = (fio, /3i, /?2, - • •, Pp)T is the parameter of interest. Liang and Zeger [23] used a "working" correlation matrix Ri(a) (i = 1,2,... ,K) of order dx d for specifying the within-family dependence. In the case of unequal family sizes, the dimension of the "working" correlation matrix is different for different families. The form of the "working" cor-relation matrix is assumed to be fully specified by the parameters a = (ai, a2,..., oeq)T. Common correlation structures such as "independence" and "exchangeable" correlation structures can be ob-42 tained by considering Ri(ct) = Id and Ri(ct) = (1 — p)I([+pJ(i respectively, where p = covv{Yij ,Yn), j,k — 1,2,..., d (j / k), where /<$ is an unit matrix of order d x d and is a d x d matrix with all elements are one. The estimation equations corresponding to the "independence" correlation structure is known as independence estimating equations (IEE) which is similar to the procedure described in 4.1.3. For estimating the regression parameters, Liang and Zeger [23] proposed the following set of estimating equations U(P) = I j ( ^ f ) VfHVi-Pi), (4-9) where Vi is the "working" covariance matrix considered for ith family, which can be expressed as a function of "working" correlation matrix as Vi = A j / 2 i i i ( a ) A j / 2 , (4.10) where Aj = diag{var(ya), var(yj2), • • •, var(yj<f)} and var(yy) = p,ij a((j)), is a function of the known mean function and the dispersion parameter. So the "working" covariance matrix V} is a function of 0, ct, and <j>. Thus the estimating equations defined in equation (4.9) are a function of B, a and (f>. Since we are only interested in estimating the regression parameters, the score equation (4.9) can be reduced as a function of 0 only by replacing a. and <f> by &(Y,3, <f>) and (j){Y, 0) respectively. So equation (4.9) can be written as K "r u[0,&{0,m}] = E(^) {Vi(p,c\(0^))Yl{yi-Hi) K = 2~2CiBiAi. (4.11) i = i Let 0Gl be the solution of the equation U[0, d{/3, <^ (/3)}] = 0. According to Liang and Zeger [23] given the estimators of a and <j> that converge in probability, the estimator of the regression param-eter 0Gx is consistent and asymptotically multivariate normal with mean vector 0 and covariance 43 matrix V G I = ( z a ^ c f ) ^J2^Bi(yi-»i)(yi-»i)TBic?}(^ • (4-12) The variance estimate VQX of $ G l can be obtained by replacing 3, <f>, and a in the expression for VGI by their estimates. Liang and Zeger [23] showed that the consistency of the regression parameters and their variances does not depend on the choice of the "working" correlation matrix as long as the estimates of the parameters of this matrix converge in probability. This is an error of Liang and Zeger pointed out by Crowder [4] and Sutradhar and Das [35]; one cannot talk about the consistency of the parameter that does not have real interpretation. There are also some other drawbacks of the maximum quasi-likelihood estimator 0Gl. Crow-der [4] also indicated that there may not exist any solution for d; in that situation G E E l cannot estimate the regression parameters. Even if d exist and it converges to a specific value, its limiting value depend on the form chosen for "working" correlation matrix. Sutradhar and Das [35] showed some results for multivariate binary data where the estimators corresponding to IEE are found as efficient as the estimators from G E E l . 4.3 Quadratic Exponential Family As we have seen in the Section (4.2), G E E l focus on estimation of the regression parameters corresponding to the marginal mean functions. The G E E l procedure is not sufficient when the objective of the study is to estimate the parameters corresponding to both the mean function and within-family dependence structure. Regression methods based on a quadratic exponential family can estimate both kind of parameters simultaneously. In this section, we first describe the quadratic exponential family and then review a regression method based on this family of distribution. Let yt = (j /j i , yi2,..., yid)T be the binary response vector corresponding to the ith (i = 1,2,..., K) family. According to Cox [3], the joint distribution of j/j can be written in the following 44 saturated log-linear form: f(yi;&i,ni)=exp{&Jyi + nfzi-A(^i,Qi)}, (4.13) where Z ; = (yaw,. • • ,yid-iVid, • • •, Vi\Vi2 • • • Vid)T contains the second- and higher-order cross-products oi y^ @i = {0n,0i2,...,6id)T and Qi = (wji2, • • •, ^id-i,d, • • •, ^H2...d)T are vectors of canonical parameters, and A(@i,Q{) is a normalizing constant such that J2exp{A(@i,fi;)} = Y^exP{®TVi + zi} where sum is over all 2d possible values of yi. The parameters of 6; have interpretations in terms of conditional probabilities as Oij = logit{Pr(Yy: = 1 | Yik = 0, j + k)} and the parameters of Q; can be interpreted in terms of log conditional odds-ratios and contrasts of log conditional odds-ratios as Vijij2 = logOiJd/iji,yij 2 | yij3 = 0, j3 ^ j i , j 2 ) , .. . . _ l o ORiyijiiVm 1 Vijs = hVm = ° ' J4 ^ 'ju 32,33) « U 2 j 3 OR{yih,yiJ2 \ yiJ3 = 0,yiJ4 = 0, j4 + h,h,hY where Pr(Y! = 1, y 2 = 0) Pr(Yi = 0, Y 2 = 1)' is the odds-ratio between between two binary variables Y\ and Y 2 . A special case of Cox's log-linear representation is the quadratic exponential family (Zhao and Prentice [38]) obtained by setting three- and higher-order dependence parameters in fi; at some fixed values in (4.13). This special case of the Cox's log-linear representation is equivalent to that obtained from conditionally logistic regressions (Joe and Liu [16]). Gourieroux et al. [13] consider an exponential quadratic model, parameterized by the mean vector and covariance matrix, for a general response vector. 45 However, one drawback of this representation is that it is not reproducible or closed under margins, i.e. if y* is a subset of y; of order d* x 1 (d* < d), then Vi' # exp {0J r yJ + nf<-A(e?,fi;)} , where 0* is the corresponding subset of 6; of order d* x 1, and fl* and z* are the corresponding subsets of fi; and z; respectively. The canonical parameters in fi; are defined in terms of conditional odds-ratios which have limited use for the studies with unequal family sizes because conditional odds-ratios are specific to the number of subjects in a family. For example, since 0 -R(y ; i , y ;2 | y ; 3 — 0) ^ OR(yn,yi2 | y%z = yi4 = 0), the same parameters cannot be used to measure the association between yn and yj2 if some families have three subjects and others have four subjects. 4.3.1 Zhao-Prentice Method Zhao and Prentice [38] used the quadratic exponential family for analyzing multivariate binary data by making a one-to-one transformation from the canonical parameters (©;,fi;) to the marginal parameters (p^cri), where \ij = Oi and cr; is the vector of the marginal covariances for the i t h family. Let us define the regression model for marginal parameters /ij and cr; as »i(f3) = h-1(Xi(3), o-i(a,f3)=g-1(a,Xif3), respectively, where X ; be the matrix of covariates for the ith (i = 1,2,..., K) family, [3 and a are the vectors of parameters to be estimated, and h(-) and g(-) are the link functions. Let £; = E(zi) and for (j, k) pair of the iih family we can write £ ; j f c = + Hij^ik- That is, £; is the function of both the parameters (3 and a. 46 The score function for the Zhao-Prentice model is T K ( dJh n V / i=i K w 0 i ?L M i \ d/3 da cov(y») coviy^Zi) \ cov{zi,yi) C O V ( Z J ) J = CjBjAj. (4.14) i=i The solution of U(0, a) — 0 provides pseudo-maximum likelihood estimators (/3 ,aT)T. Pseudo-maximum likelihood estimation of (0,ct) requires direct calculation of third and fourth order moments of y, which involves summation over 2d possible values of y^ . This computa-tion is tedious when the family size is large. In this context, Zhao and Prentice [38] suggested the use of any convenient "working" covariance matrices in equation (4.14). In this case, the estimator (/3, &) is no longer a pseudo-maximum likelihood estimator but the estimator is consistent and is asymptotically normally distribution provided the model specification of /Zj and <Tj is correct. Both the G E E l and Zhao-Prentice approaches considered the correlation coefficient for specifying the dependence among the observations within a family. But as a dependence parameter, the correlation coefficient is not the best choice because its range depends on the univariate margins. Fitzmaurice and Laird [9] used quadratic exponential family with conditional log odds-ratio as the dependence parameter. The main drawback of this approach is that the joint distribution of the observations are not reproducible. Thus this approach has limited application for studies with unequal family sizes. The interpretation of the conditional log odds-ratio is not as attractive as the unconditional association parameters. In the following section we will describe a GEE2 method which overcomes some of the drawbacks of the G E E l , the Zhao-Prentice, and the Fitzmaurice-Laird methods. 47 4 . 4 Generalized Estimating Equations II The focus of the GEE1 method is to describe the dependence of the mean function of the response on the explanatory variables. By considering the dependence as a nuisance, the GEE1 procedure provides consistent estimators of the regression parameters given the correct specification of the mean function. If the objective of the study is to describe both the dependence of the mean response on the explanatory variables and the dependence structures among the responses, the GEE1 method is not sufficient. The quadratic exponential model is one way to deal with such problems but has several drawbacks. Liang, Zeger, and Qaqish [25] extended the GEE1 procedure for estimating the parameters defined in the mean function and the dependence structure simultaneously. They call this procedure GEE2 and considered the bivariate log-odds ratio as the dependence parameter to illustrate this procedure. 4.4.1 The Method Let yi = (yn,yi2, • • •, yid)T be the binary response vector corresponding to the ith (i = 1,2,..., K) family. Let us also define Zj = (ynyi2,ynyi3, • • • ,yid-iyid)T, a vector of order m, where m = (ij). Let Xij be the covariate vector of order 1 x (p+1) corresponding to the jth (j = 1,2,..., d) member of the ith family. Let h(-) be the link function such that jjij(/3) = E(Yij) = h-1(Xijf3). The pairwise dependence is expressed in terms of the odds-ratios, for the (j, k) pair of the i t h family the odds-ratio is defined as = Pr(j/jj = l,yik = 1) Pr(3/tj = 0, yik = 0) 7 l j f c Pr(y;j = 1, yik = 0) Pr(yjj = 0, yik = 1)' Let us assume the vector 7 of order m x 1, where m is the number of the pairs in the i t h family, can be expressed as a function of the q x 1 (q < m) vector a. = ( a i , . . . ,aq)T, i.e. 7 = g~1(a), where 48 g(-) is any suitable link function. From the bivariate Plackett distribution (see §3.1), for any pair of responses we can write ( l / 2 ) ( 7 i j f e - l ) - 1 { ^ i Hiki lijk - 4(7yfe - 1) 7ijfc Hij Mifc]1/2} , if lijk / 1, ( 4- 1 5) Hij Hik i if lijk = 1 > Zijk{P,ot) = E<yijYik) = { de f where S((ii,H2,l) = 1 ~ ( ^ l + A*2>(1 - 7)-4.4.2 Estimating Equations Let 0 = (3T, aT)T be the parameters of the model which we wish to estimate. Since the joint distribution of the response vector y{ is not fully specified, Liang et al. [25] consider the following estimating equations for 0: K a , J. \ ( „. \ ( U(0) = E^Mcov-1 K = £ 80 w 0 \ a/3 da J K = £ CjSjAj = 0. i = l (4.16) For the i t h family the inverse of the covariance matrix is Bi = ( cov{yi) cov(yj,Zi) ^ - l \ cov{zi,yi) cov(zi) J The components of this matrix can be expressed in terms of the first four moments of yu which are shown in Appendix A . l . The estimating equations (4.16) are similar to the pseudo-score functions of the Zhao-Prentice method with the odds-ratio as the marginal dependence parameter and a specific choice of "working" covariance matrix. 49 The solution 9 — (0 , a ) of U(9) = 0 follows an asymptotic multivariate normal distri-bution with mean vector 9 and covariance matrix VG2 = Hil{0)H2{9)H?{9), (4.17) where K H\{9) — CjBjCj', 2=1 f y.-U \ / « - » ^ ^ 2 - ^ 2 The estimator 0 is consistent if the model for both the mean parameters h(fx) and the association parameters 3(7) are correctly specified. The main advantage of the GEE2 method over the regression methods based on the quadratic exponential family lies in the interpretation of the dependence parameters. The GEE2 models the bivariate odds-ratios for the within-family dependence which has a straightforward interpretation regarding the magnitude of the association between any pair of members within a family. The quadratic exponential family methods consider conditional models for association parameters which are of limited use in the case of unequal family sizes. 4 . 5 Summary In this chapter chronological developments of some estimating equation approaches which are ap-plicable to multivariate binary responses have been shown. Starting from the GLM, which is used for analyzing univariate binary responses, we have covered regression methods based on generalized estimating equations as well as quadratic exponential family based methods. By naively assuming the correlated binary responses as "independent", one can apply GLM to estimate the regression parameters defined for the mean function. These estimators are con-50 sistent but less efficient if the within-family association is high. The G E E l approach considers within family dependence in the analysis as a nuisance and can estimate regression parameters corresponding to the marginal mean function. Provided the model specification of the marginal means is correct, the G E E l approach provides consistent estimators of the regression parameters. The G E E l approach does not estimate the parameters corresponding to the within family correlation structure which might be of interest of some studies. Some models based on quadratic exponential family (e.g Zhao-Prentice, Fitzmaurice—Laird) and the GEE2 can estimate regression parameters and parameters corresponding to dependence structure simultaneously. These three models considered three different types of association parameters in the respective models. Pair-wise correlations, conditional odds-ratios and bivariate odds-ratios are considered as dependence parameters in the Zhao-Prentice, the Fitzmaurice-Laird and the GEE2 methods respectively. Liang et al. [25] proposed the GEE2 method without mentioning that the probabilistic assumptions are consistent with the multivariate logistic model of Molenberghs and Lesaffre [31] (which came later). 51 Chapter 5 Simulation Study The main objective of this thesis is to compare maximum likelihood and estimation equation based estimators of the multivariate logistic model for analyzing multivariate binary responses. For this comparison, a simulation study is considered with different family structures. The multivariate probit and logistic models are also compared for estimating the conditional probabilities. Numerical methods have been used for estimating the parameters of the methods that are considered in this thesis. This chapter contains results of the simulation study and the description of the numerical methods that are used in this thesis for estimating the parameters of the models. In the following section, a brief description of these numerical methods is given. 5.1 Numerical Optimization Methods Generally numerical optimization methods are used to optimize a function of the independent variables which can also consider restrictions on the independent variables; the function to be optimized is known as the objective function. In statistical applications, the negative of the log-likelihood function 1(0) is the objective function to be minimized to estimate the unknown pa-52 rameters 0 = (0i,..., 6V)T. A common method for estimating the unknown parameters is the Newton-Raphson method which requires analytic evaluation of the gradient and the Hessian ma-trix. If the objective function has a complex form, often the Hessian matrix cannot be evaluated analytically; in such situations quasi-Newton methods which only require analytic evaluation of the objective function, can be used for minimizing the negative of the log likelihood function. Numer-ical differentiation techniques are also useful for estimating the Hessian and possibly the gradient at fixed values of the parameters. For fitting the multivariate logistic and probit models we have used the C routines of Joe ([18], [19]) respectively. The quasi-Newton method with numerically approximated gradient has been used in these routines. For solving the estimating equations, we have used the C++ code of Joe [20]. This routine is based on quasi-Newton method with the gradient and the Hessian matrix computed by a differentiation package FADBAD (http://www.imm.dtu.dk/~os/fadbad.html) which uses automatic differentiation techniques to compute the derivatives of a function written in a high level language such as FORTRAN, C, or C++. The public domain version of the GEE2 code, originally written in Pascal1 and converted to C with p2c, cannot handle general familial data. It can handle familial data with a simple interclass/intraclass structure. It is also restrictive in computer memory requirements. It would have taken more time to modify the original GEE2 code to suit our purposes than to start from scratch; The advantage of the automatic differentiation approach is that we need to code only the likelihood function. For simulated familial data with a structure such that the original GEE2 program can be used, we checked that our program and the original program gave the same (or nearly the same) estimates. The main difference between the public domain version of GEE2 and our implementation is the computational procedure for the second- and higher-order moments. We used the multivariate Plackett construction for computing these moments, whereas the pub-'http://statlab.uni-heidelberg.de/statlib/GEE/GEE2/ 53 lie domain version uses the McCullagh-Nelder-Glonek approach. For estimating the parameters, the public domain version uses the classical Newton-Raphson method and we have used a FAD-BAD based Newton-Raphson method. Because of the operator overloading, the FADBAD based Newton-Raphson method is bit slower than the classical Newton-Raphson method. In the subsequent two sections, we first describe the classical Newton-Raphson method and then the quasi-Newton methods are also briefly described. 5.1.1 The Newton-Raphson Method The Newton-Raphson method is based on approximating the objective function 1(0) locally by a quadratic model and then minimizing that function. For the value of the parameters 0k at the fcth iteration, the objective function can be approximated as l(Ok + d) * l(0k) + UT(0k)d+^dTH(0k)d, (5.1) where are the gradient and the Hessian matrix of the objective function 1(0) respectively. The Hessian matrix is the Jacobian of the gradient which is symmetric and assumed to be positive definite. The minimum of the right-hand side of (5.1) is achieved when dk is the minimum of the quadratic function Q(d) = UT(0k)d+^tfH(0k)d. The expression for dk can be obtained by solving the equation {dQ(d)/dd} — 0, which leads to dk = -H-\0k)U(0k). 54 The quantity dk is known as the Newton direction and is used to update the current parameter value Ok by Ok+i = Ok + dk, fc = 0 , l ,2 . . . = 0k-H-\ek)U{0k) = Ok-H^gk, (5.2) where Hk = H(0k) and gk — U(0k) are introduced to simplify the notation. Starting with a suitable initial value Oo, the equation (5.2) is successively evaluated until the parameter vector 0 has converged. This method requires the computation of the gradient and inverse of the Hessian matrix at each iteration. To evaluate the Hessian matrix, p(p + l)/2 partial derivatives must be calculated analytically and for inverting the Hessian matrix a system of linear equations must be solved. But in many cases the second derivative of the objective function is not available analytically, so the Hessian matrix cannot be computed. The Newton method also breaks down if the Hessian matrix is singular at some iteration. Otherwise Newton's method works very well if the initial value is sufficiently close to the true value. These difficulties of the Newton's method lead researchers to search for other optimization methods, such as quasi-Newton methods which do not require computing the Hessian matrix or the gradient analytically. 5.1.2 The Quasi-Newton Method The quasi-Newton method, also known as a variable metric method (see Nash [32]), approximates the inverse of the Hessian matrix by some modification of the previously constructed matrix. This method does not require computation or inversion of the Hessian matrix for minimizing the objective function 1(0). For implementing the quasi-Newton method, assume that either the gradient is available analytically or a computer routine is available for numerically computing the gradient. 55 For example, at the kth iteration, from the currently available quantities such as 9k, 9k, 0 f c - i , and g k - i the quasi-Newton method provide an approximation Bk to H^1. This approximation can be used to update the current value of the parameter by 9k+i = 9k-Bkgk- (5.3) By using Taylor's theorem, the gradient at kth iteration can approximately be written as fffc-i - 9k + Hk(9k-i - 9k) =>yfc ~ Hksk (5.4) where Sk = 9k — 9k-i and y k — 9k — 9k-i- If the objective function is quadratic, the Hessian matrix is constant, i.e. Hk = H, VA; > 0 and the relationship of (5.4) is exact. If the objective function is not quadratic but is strictly convex and has continuous second partial derivatives in a neighborhood of 9* then the objective function is well approximated by a quadratic function with Hessian matrix H(9*) in a sufficiently small neighborhood of 9* (Wolfe [36]). For such objective functions the relation (5.4) becomes exact as 9k approaches 9*. So, it seems desirable that the approximation Bk to H^1 should satisfy the relation Bkyk = sk- (5.5) This equation is known as the quasi-Newton equation. It is desirable that Bk can be computed from Sk, and y k - Consider Bk = Bk_i + Ck{sk,yk,Bk-i), (5.6) where the correction matrix Ck has the same properties as the matrix Bk. This construction of the sequence {Bk} is not unique; a number of methods are available for computing Bk from the current values of the parameters. One such method is described in the following subsection. 56 Davidson-Fletcher-Powell Method The Davidson-Fletcher-Powell method (see Wolfe [36]) is one of the popular methods for construct-ing the sequence {Bk}. For this method the correction matrix is written as Ck(sk,yk,Bk-i) = SkU? - Bk-iyk Vfe, where Uk and Vk are nonzero vectors such that ukVk = !> vkVk = 1- (5-7) By using the this form of correction matrix Ck in equation (5.6) we get BkVk = Bk-iVk + SkuTyk - -S fc_iy feujy fc = Bk-iyk + Sk~ Bk-iVk = sk. This shows that the Davidson-Fletcher-Powell method satisfies the quasi-Newton equation (5.5). The choice Sk Bk-iVk Uk = ~T—, vk = s%yk VkBk-iVk satisfies the equation (5.7). Therefore, we can write the expression of the matrix Bk+i as o o , sksl (Bk-iyk){Bk-iyk)T ok = ±>A:-1 + ~y 7fr- . skVk VkBk-iVk After getting the value of Bk we can update the parameter by using the equation (5.2). 5.1.3 Newton method with Automatic Differentiation As we already mentioned that the Newton-Raphson/quasi-Newton method can be used with the gradient and the Hessian matrix obtained from differentiation packages. A differentiation package FADBAD, which is based on automatic differentiation, is used for coding the routine of Joe [20]. 57 Automatic differentiation has become a very popular tool for numerical differentiation in the last twenty years. The advantages of this method over other existing procedures (e.g. divided difference, symbolic differentiation) have been discussed by Griewank [14]. There are two versions of automatic differentiation, known as forward and backward modes. A number of software packages for automatic differentiation written in high-level languages such as FORTRAN or C++ are avail-able. The user only needs to provide the subroutine to evaluate the underlying objective function provided an interface routine has been written for Newton-Raphson. We have used the forward mode of the automatic differentiation method to evaluate the Hessian matrix. This Hessian matrix used in Newton-Raphson routine for estimating the parameters of the GEE2 method. Automatic differentiation introduced a data type doublet (see Dixon [6]). A doublet variable U, consists of p + 1 values (u, XjUi), where \/U{ = du/dxi, i = 1,2,... ,p. For doublet variable the usual operators (e.g. +, —, *,.••) a r e overloaded; operator overloading is a nice feature of the computer languages such as C++/Fortran which helps to redefine the meaning of the elementary operators. For an assignment W — U * V where U and V are doublets, the multiplication operator * is defined in such a way that it not only computes the product of U and V but also update the associated gradient object by \/Wi = usj Vi + vsj u% (i = 1,2,... ,p). Similarly all other elementary operations can also be defined; for example W = U + V ->• (u + v,S7Ui + S7Vi, i = 1,2,... ,p), W = l/U -»• (-,4v«i>« = l , 2 p), W-logu -+ flogu, - v«» i = 1,2,... ,p). Automatic differentiation packages convert any given functions to these elementary functions and then use the chain rule to compute the derivatives of the given function. 58 5.2 Comparison of the Methods In this section, a comparison of maximum likelihood (ML) and GEE2 approaches for estimating the parameters of the multivariate logistic model is shown by considering a simulation study. Four different types of the family structures namely Pedigree A, B, C, and D are considered, the descrip-tion and the graphical representation of these families are given in the following sections. Because the multivariate logistic distribution is defined based on implicit equations for dimensions d > 3, simulation from it would be very difficult. To simulate from the multivariate logit binary model requires the computation of 2d (d is the family size) multivariate probabilities and then the 2d pos-sible binary d-dimensional vectors can be considered as outcomes of a multivariate discrete random variable with these probabilities. This is much more difficult than simulation from the multivariate probit model. Therefore, the comparison is based on multivariate binary data simulated from the multivariate probit model. A sample of 200 families (family size depends on the pedigree types) is generated for each pedigree. For simplicity of the comparison, all 200 families have the same family structure. A mixture of pedigrees A, B, and C is also considered for the simulation study because for real data pedigrees for different families will typically be different. To examine the effect of the sample size for the comparison of ML and GEE2 approaches, we also analyzed a data set from Pedigree A with 600 families. The only covariate Age is assumed uniform on (I, u), where / and u depend on the member of the family. The response vector is generated by using the multivariate probit model; i.e., given the specified correlation structure and the values of the intercept (Bo) and regression coefficient (B\) for the covariate Age the response vector is obtained by using the equation (2.1). To simulate binary response vectors by the multivariate probit models, two types of correlation structures (exchangeable and familial) are considered for each pedigree. For an exchangeable correlation structure, the correlation is the same for all pairs of members of a family; on the other hand, for a familial correlation structure, the correlations differ for different types of pairs. Three values (0.9, 59 0.5, 0.1) of the correlation coefficients and two sets of the correlation coefficients are considered for the exchangeable and familial correlation structures respectively. We also used different values of the regression constant and coefficient in different analyses. For this analysis, we assume that only the univariate margins depend on the covariate Age. The multivariate logistic model has equations logit 7T = fio+Pi* Age, log7 = a, for the univariate margins and the dependence structure respectively. Here, 7r and 7 are the vector of univariate margins and cross-product ratios respectively. For the multivariate logistic model the third- and higher-order dependence parameters are assumed fixed at one. For each pedigree, the results of the simulation analysis for the exchangeable and famil-ial correlation structures are shown in two separate tables. The maximum likelihood and GEE2 methods are compared with respect to the mean estimates of the regression parameters and log OR for dependence parameters, and their corresponding standard errors. The average of the ab-solute difference of the parameter estimates and their standard errors corresponding to these two methods are also shown in these tables. Empirical standard deviations of the parameter estimates corresponding to these two approaches are also shown in these tables. For each of the analysis 500 repetitions were considered to obtain the results of these tables. The true parameter values which are used to generate the data for different pedigree are also listed in these tables. 5.2.1 Pedigree A Pedigree A is the simplest family structure considered for the simulation study. This pedigree consists of four members who are from two different relative classes and the graphical diagram of this pedigree is shown in Figure 5.1. The first member is one of the parents and the others are 60 offspring, so only two types of dependence namely, sib-sib (SS) (2:3, 2:4, 3:4) and parent-offspring (PO) (1:2, 1:3, 1:4) are considered for this pedigree. Figure 5.1: Graphical diagram of Pedigree A. Table 5.2 shows the simulation results for the Pedigree A, where the responses are generated by a multivariate probit model with an exchangeable correlation structure, i.e. ppo = pss- The average absolute difference of the estimates show that the maximum likelihood and GEE2 parameter estimates are equal up to two decimal places for the correlation coefficient p = 0.1 and p = 0.5. For p = 0.9, the parameter estimates corresponding to the covariate Age (/3i) and corresponding to parent-offspring pair (log ORpo) are equal up to one decimal place and others are equal up to two decimal places. The differences among the parameter estimates tend to increase as the within-family dependence increases. The averages of the maximum likelihood and the GEE2 estimates of the standard errors are equal up to one decimal place for all the values of p. For the extreme values of correlation coefficients (i.e. p = 0.1,0.9) the absolute differences between the corresponding standard errors are larger than at p = 0.5. The average standard errors of the estimates of the dependence parameters (log ORpo and log ORss) increase as p increases. The average of the intercept (/3o) and the estimates corresponding to the covariate Age tend to decrease as p increases. For Pedigree A, on average the GEE2 estimates are marginally more efficient than maximum likelihood estimates with exchangeable correlation structure. Table 5.3 shows the simulation results for Pedigree A, where the responses are generated by a multivariate probit model with a familial correlation structure with sib-sib and parent-offspring 61 correlations. Two sets of correlation coefficients {pss — 0.8, ppo — 0.6} and {pss — 0.9, ppo = 0.4} are considered for this case. The ML estimates are found marginally less efficient than the corresponding estimates from the GEE2. Table 5.4 shows the simulation results for Pedigree A with a sample of size 600. This results show that the average of the difference of the ML and GEE2 estimates tend to decrease as sample size increases and standard errors behave as expected. For the Pedigree A, the empirical standard deviations (SDs) of the parameter estimates are also shown in the Tables 5.2-5.4. The SDs are quite similar for the estimates corresponding to the ML and GEE2 approaches. As expected these empirical standard deviations are approximately equal to the average of the corresponding standard errors. Note that as expected, most of the regression estimates are approximately 1.6-1.8 times the probit regression coefficients. Also the odds-ratio parameter estimates are roughly satisfied the equations (3.6) and (3.7). For example, for Pedigree A with exchangeable correlation structure (p = 0.5), the regression parameter estimates from the logistic model of /3o = 1.318 and $i = 0.362 correspond to the regression parameters of the probit model of 6Q = 0.8 and B\ = 0.2. That is, the logistic estimates are 1.65 (=1.318/0.8) and 1.81 (=0.362/0.2) times the respective probit parameters. 5.2.2 Pedigree B Figure 5.2 shows the graphical diagram of Pedigree B, which has four members from three different relative classes. The members are one grandparent, one parent and two offspring; three types of dependence namely sib-sib (3:4), parent-offspring (1:2, 2:3, 2:4), and second-degree relationship (D2) (1:3, 1:4) can be considered for this family structure. Table 5.5 shows the simulation results of Pedigree B, where the responses are generated by a multivariate probit model with an exchangeable correlation structure. The averages of the 62 Figure 5.2: Graphical diagram of Pedigree B. parameter estimates and the corresponding standard errors for the maximum likelihood and the GEE2 methods are very close. For p = 0.1, the average absolute differences show that the maximum likelihood and GEE2 estimates of the regression parameters are similar up to two decimal places. For p = 0.5, $i and log ORpo are similar up to two decimal places and other estimators are similar up to one decimal place. For p = 0.9, all the.estimators are similar up to one decimal place. The average of the difference between the parameter estimates increases as the within-family dependence increases. The difference between the standard errors of the parameter estimates corresponding to the maximum likelihood and the GEE2 are equal up to one decimal points for all the values of the correlation coefficients. The average standard errors agree more closely for p = 0.5 than for the other values of p. For Pedigree B with exchangeable correlation structure, the estimates of Q\ and log OR are found to be marginally less efficient for the maximum likelihood in most cases. Table 5.6 shows the simulation result for Pedigree B, where the responses are generated by a multivariate probit model with a familial correlation structure. Two sets of correlation coefficients {pss = 0.8, ppo = 0.6, and pm = 0.5} and {pss = 0.9, ppo = 0.5, and pm = 0.1} are considered for this case. The average of the absolute differences show that except for \ogORm-, all other estimators from the maximum likelihood and the GEE2 are similar up to two decimal places. For this case the parameter estimates corresponding to the GEE2 are found to be marginally more 63 Figure 5.3: Graphical diagram of Pedigree C. efficient than maximum likelihood estimator. The empirical SDs of the corresponding parameter estimates of the ML and GEE2 ap-proaches are very close and are consistent with the average of the corresponding standard errors. 5.2.3 P e d i g r e e C Pedigree C has five members: one grandparent, one parent, one uncle, and two offspring; the graphical diagram of this pedigree is in Figure 5.3. Similar to Pedigree B, for this pedigree three types of dependence can be considered: sib-sib (2:3, 4:5), parent-offspring (1:2, 1:3, 2:4, 2:5), and second-degree relationship (1:4, 1:5, 3:4, 3:5). Table 5.7 shows the results for Pedigree C, where the responses are generated by a multivari-ate probit model with an exchangeable correlation structure. For correlation coefficient p = 0.1, the average of the absolute differences show that the parameter estimates corresponding to the maximum likelihood and GEE2 are similar up to two decimal places. For p = 0.5 and p = 0.9 all the estimates corresponding to the dependence parameters are similar up to one decimal place. The absolute difference between the respective parameter estimates increases as the within-family dependence increases. The average of the absolute differences also show that the standard errors are similar up to one decimal place for all values of the correlation coefficients. For p — 0.5 these 64 standard errors are more similar than for the other values of the correlation coefficient. For Pedigree C, the maximum likelihood method provides marginally more efficient estimators than the GEE2 only for the case in which within-family dependence is high. Table 5.8 shows the simulation results for Pedigree C, where the responses are generated by a multivariate probit model with a familial correlation structure. As before two sets of correlation coefficients are considered for this case. The averages of the absolute differences show that both the parameter estimates and the corresponding standard errors for the maximum likelihood and the GEE2 are similar up to one decimal place. The estimates corresponding to the GEE2 are found to be marginally efficient than the estimates corresponding to maximum likelihood. The empirical SDs of the corresponding parameter estimates of the ML and GEE2 ap-proaches are very close and are consistent with the average of the corresponding standard errors. Though the family sizes are different, the same dependence structures were used for Pedigree B and C in the simulation study. The comparison of the results of these two pedigrees would provide the effect of family size for comparing the parameter estimates of the maximum likelihood and the GEE2. This comparison reveals that the standard errors of all the parameter estimates decreases as the pedigree size increases and the absolute differences between the parameters and the standard errors also decreases as pedigree size increases. 5.2.4 Pedigree D Figure 5.4 shows the graphical diagram of the Pedigree D which has six members: one grandparent, one parent, one uncle, two offsprings, and a cousin. This pedigree can consider four types of dependences: sib-sib (2:3, 4:5), parent-offsprings (1:2, 1:3, 2:4, 2:5, 3:6), second-degree relationship (1:4, 1:5, 1:6, 2:6, 3:4, 3:5) and third-degree relationship (D3) (4:6, 5:6). Table 5.9 shows the simulation results for Pedigree D, where the responses are generated by a multivariate probit model with exchangeable correlation structure. The average absolute difference 65 Figure 5.4: Graphical diagram of Pedigree D. show that the parameter estimates corresponding to the dependence structure (i.e. log OR) are similar up to one decimal place for all correlation coefficients we have considered. The parameter estimate corresponding to the covariate Age is similar up to two decimal place except for p = 0.9. The average of the absolute difference also show that the standard errors corresponding to log ORs are similar up to one decimal place for p = 0.1 and 0.5. For correlation coefficient p = 0.9 the average differences between these standard errors are greater than 0.1. The maximum likelihood estimates corresponding to the log ORs are found marginally efficient than the corresponding GEE2 estimates for p = 0.9. For p — 0.5 both the parameter estimates and the standard errors are found to be closer than the extreme cases. Table 5.10 shows the simulation results for Pedigree D, where the responses are generated by a multivariate probit model with a familial correlation structure. The averages of the absolute differences show that the maximum likelihood and GEE2 parameter estimates and standard errors are similar up to one decimal place. The maximum likelihood estimates are found marginally more efficient than the GEE2 only for log ORpo and log ORr>2-The empirical SDs of the corresponding parameter estimates of the ML and GEE2 ap-proaches are very close and are consistent with the average of the corresponding standard errors. 66 5.2.5 Mixture of pedigrees A, B, and C For the mixture of the pedigrees A, B, and C, a sample of 100 families from each of these pedigrees are considered. The comparison of ML and GEE2 for this mixture pedigree is very important because in practice, the families are of different sizes. For this case, we can consider three types of dependences: sib-sib, parent-offspring, second-degree relationship. Table 5.11 shows the simulation results for the mixture of pedigrees with exchangeable correlation structure. The averages of the absolute differences show that the maximum likelihood and GEE2 parameter estimates are similar up to two decimal places for p = 0.1,0.5; for p = 0.9 these estimates are similar up to one decimal place. The absolute differences also show that the standard errors corresponding to maximum likelihood and GEE2 are similar up to one decimal place for all values of the correlation coefficients. The estimators corresponding to maximum likelihood are found to be marginally more efficient than the corresponding estimators of the GEE2 method only when the within-family dependence is high. Table 5.12 shows the results for mixture pedigrees with a familial correlation structure. The absolute differences show that the maximum likelihood and GEE2 estimators are similar up to one decimal place. The estimators corresponding to the GEE2 are found to be marginally more efficient than the corresponding parameters of the maximum likelihood. 5.2.6 Estimating Conditional Probabilities In Chapter 2, we have discussed the procedure of estimating conditional probabilities and their standard errors for the multivariate probit model (see §2.4). A similar procedure can also be derived for the multivariate logistic model, but not for the GEE2 method. The GEE2 method does not specify the joint distribution, so expressions for the orthant probabilities which enter into the conditional probabilities are not available. Given estimates of the parameters of the multivariate probit and logistic models, these orthant probabilities can be computed from equations (2.14) and 67 (3.10) respectively. We have used C routines of Joe ([18] and [19]) for computing these orthant probabilities; the conditional probabilities and corresponding standard errors are obtained by using numerical differentiation methods. To compare the multivariate probit and logistic models for the estimation of these conditional probabilities, we have considered samples of size 200 from the pedigrees A, B, and C. As before these samples are also generated from the multivariate probit models. Tables 5.13-5.15 show the estimates of the conditional probabilities and their standard errors for these pedigrees. As an example, five different conditional probabilities are arbitrarily chosen for each of these cases. Given the parameter estimates and the covariate values similar estimates of other probabilities can also be obtained. For example, given a diseased individual of age 67 from Pedigree B the estimate of the probability that his younger offspring of age 14 has the disease is about 0.85, with a standard error 0.02. These tables show that most of these estimates are similar for the multivariate probit and logistic models. The empirical SDs of the corresponding parameter estimates of the ML and GEE2 ap-proaches are very close and are consistent with the average of the corresponding standard errors. 5.2.7 Comments on Computing Time This research was motivated by the observation that maximum likelihood estimation for the mul-tivariate logistic model became very time consuming when there were pedigrees in the data set of sizes 7 or more. The computing time for the multivariate Plackett distribution is exponentially in-creasing in the dimension, because of the equations that must be solved for dimension 3 and higher. Table 5.1 shows the average computing time needed for the maximum likelihood and the GEE2 methods. The GEE2 equations are computationally more efficient in dimensions > 6 because they require only computation of the multivariate Plackett distribution up to dimension 4. In practice, if the familial data has many pedigrees of dimension 6 or more, we recommend the GEE2 approach 68 over maximum likelihood. 5.3 Summary In this chapter, a simulation study for comparing the maximum likelihood and GEE2 method for the multivariate logistic model is discussed. In Section 5.1, a brief description of the numerical optimization methods which are used to fit these models is given. Because of the complex form of the gradient and Hessian matrix, the classical Newton-Raphson method is not be used for these models. The quasi-Newton method is used to estimate the parameters of the models considered in this chapter. The results of the simulation study indicate that the maximum likelihood and GEE2 estimates of the regression parameters and their respective standard errors are usually equal at least up to one decimal place. This result is consistent for all the pedigrees we have considered for this study. The empirical SDs of the parameter estimates corresponding to the M L and GEE2 approaches are found quite close and are consistent with the corresponding average of the standard errors. The estimates of the conditional probabilities corresponding to the multivariate probit and logistic models are also found to be similar. 69 Table 5.1: Average computing time (in mins.) by pedigrees and methods. Correlation coefficient 0.1 0.5 0.9 ML GEE2 ML GEE2 ML GEE2 A B C D Mixed 0.216 0.465 0.294 0.460 3.755 2.037 36.852 4.204 2.288 1.479 0.215 0.460 0.283 0.466 3.628 1.911 31.120 3.681 2.324 1.494 0.217 0.507 0.277 0.542 3.653 2.190 40.951 4.914 2.266 1.704 Table 5.2: Simulation results for Pedigree A with exchangeable correlation structure. True Parameter Estimates Standard Errors values ML GEE2 Diff. ML GEE2 Diff. Const. 0.8 1.307 (0.155) 1.308 (0.155) 0.001 0.155 0.152 0.004 Age 0.2 0.383 (0.532) 0.381 (0.532) 0.002 0.517 0.504 0.017 SS 0.1 0.282 (0.249) 0.283 (0.249) 0.001 0.253 0.250 0.019 PO 0.1 0.291 (0.273) 0.292 (0.272) 0.002 0.265 0.263 0.015 Const. 0.8 1.318 (0.149) 1.318 (0.149) 0.002 0.158 0.159 0.004 Age 0.2 0.362 (0.442) 0.360 (0.442) 0.008 0.444 0.442 0.014 SS 0.5 1.547 (0^ 266) 1.545 (0.266) 0.005 0.274 0.272 0.010 PO 0.5 1.578 (0.287) 1.575 (0.287) 0.008 0.284 0.281 0.009 Const. 0.8 1.316 (0.175) 1.312 (0.175) 0.009 0.170 0.169 0.003 Age 0.2 0.376 (0.311) 0.378 (0.305) 0.032 0.305 0.307 0.014 SS 0.9 3.827 (0.379) 3.824 (0.374) 0.008 0.390 0.376 0.016 PO 0.9 3.849 (0.393) 3.839 (0.378) 0.032 0.409 0.388 0.023 tempirical SDs are in the parenthesis 70 Table 5.3: Simulation results for Pedigree A with familial correlation structure. True Parameter Estimates Standard Errors values M L G E E 2 Diff. M L G E E 2 Diff. Const. 0.8 1.314 (0.181) 1.312 (0.181) 0.005 0.183 0.180 0.004 Age 0.2 0.369 (0.442) 0.371 (0.443) 0.012 0.450 0.440 0.015 SS 0.8 2.960 (0.346) 2.959 (0.347) 0.003 0.339 0.327 0.012 P O 0.6 1.927 (0.325) 1.921 (0.325) 0.009 0.341 0.330 0.012 Const. 0.5 0.810 (0.185) 0.810 (0.185) 0.003 0.184 0.179 0.006 Age 1.0 1.731 (0.532) 1.731 (0.533) 0.010 0.505 0.483 0.025 SS • 0.9 3.717 (0.364) 3.718 (0.364) 0.006 0.368 0.357 0.017 P O 0.4 1.221 (0.359) 1.219 (0.359) 0.010 0.376 0.364 0.015 tempirical SDs are in the parenthesis Table 5.4: Simulation results for Pedigree A with exchangeable correlation structure and sample of size 600. True Parameter Estimates Standard Errors values M L G E E 2 Diff. M L G E E 2 Diff. Const. 0.8 1.311 (0.084) 1.312 (0.084) 0.001 0.090 0.090 0.001 Age 0.2 .0.361 (0.292) 0.359 (0.291) 0.002 0.300 0.299 0.006 SS 0.1 0.298 (0.146) 0.298 (0.146) 0.001 0.146 0.146 0.007 P O 0.1 0.306 (0.154) 0.306 (0.154) 0.001 0.152 0.151 0.005 Const. 0.8 1.316 (0.090) 1.316 (0.089) 0.001 0.093 0.093 0.001 Age 0.2 0.352 (0.252) 0.352 (0.252) 0.004 0.260 0.260 0.005 SS 0.5 1.548 (0.164) 1.546 (0.164) 0.003 0.160 0.158 0.004 P O 0.5 1.558 (0.165) 1.556 (0.165) 0.005 0.165 0.162 0.004 Const. 0.8 1.324 (0.099) 1.320 (0.098) 0.006 0.098 0.097 0.002 Age 0.2 0.344 (0.168) 0.346 (0.167) 0.018 0.173 0.176 0.006 SS 0.9 3.812 (0.221) 3.809 (0.222) 0.004 0.223 0.220 0.005 P O 0.9 3.840 (0.229) 3.834 (0.227) 0.016 0.231 0.226 0.007 tempirical SDs are in the parenthesis 71 Table 5.5: Simulation results for Pedigree B with exchangeable correlation structure. True Parameter Estimates Standard Errors values ML GEE2 Diff. ML GEE2 Diff. Const. 0.8 1.309 (0.154) 1.310 (0.154) 0.001 0.158 0.156 0.003 Age 0.2 0.372 (0.395) 0.370 (0.395) 0.001 0.400 0.393 0.011 SS 0.1 0.261 (0.432) 0.261 (0.432) 0.002 0.422 0.417 0.015 PO 0.1 0.287 (0.265) 0.287 (0.265) 0.002 0.258 0.262 0.019 D2 0.1 0.288 (0.340) 0.288 (0.340) 0.002 0.321 0.317 0.024 Const. 0.8 1.316 (0.148) 1.317 (0.149) 0.003 0.159 0.162 0.004 Age 0.2 0.357 (0.321) 0.354 (0.326) 0.008 0.346 0.343 0.010 SS 0.5 1.527 (0.397) 1.563 (0.920) 0.047 0.385 0.388 0.015 PO 0.5 1.574 (0.273) 1.572 (0.274) 0.008 0.282 0.281 0.012 D2 0.5 1.580 (0.324) 1.581 (0.328) 0.011 0.317 0.316 0.018 Const. 0.8 1.323 (0.177) 1.314 (0.172) 0.020 0.171 0.171 0.006 Age 0.2 0.356 (0.246) 0.367 (0.236) 0.067 0.229 0.243 0.019 SS 0.9 3.822 (0.493) 3.857 (0.478) 0.035 0.482 0.475 0.030 PO 0.9 3.862 (0.386) 3.838 (0.373) 0.052 0.397 0.392 0.035 D2 0.9 3.897 (0.414) 3.878 (0.409) 0.086 0.432 0.432 0.041 fempirical SDs are in the parenthesis Table 5.6: Simulation results for Pedigree B with familial correlation structure. True Parameter Estimates Standard Errors values ML GEE2 Diff. ML GEE2 Diff. Const. 0.5 0.797 (0.163) 0.797 (0.163) 0.002 0.171 0.167 0.005 Age 1.0 1.768 (0.391) 1.767 (0.392) 0.005 0.412 0.397 0.018 SS 0.8 2.861 (0.411) 2.861 (0.412) 0.003 0.399 0.393 0.017 PO 0.6 1.971 (0.280) 1.972 (0.278) 0.009 0.294 0.288 0.013 D2 0.4 1.249 (0.409) 1.245 (0.406) 0.013 0.393 0.381 0.027 Const. 0.5 0.796 (0.183) 0.797 (0.183) 0.001 0.180 0.175 0.005 Age 1.0 1.765 (0.445) 1.764 (0.446) 0.003 0.430 0.415 0.016 SS 0.9 3.764 (0.478) 3.765 (0.477) 0.003 0.469 0.452 0.023 PO 0.5 1.581 (0.272) 1.581 (0.270) 0.006 0.295 0.289 0.013 D2 0.3 0.905 (0.443) 0.904 (0.442) 0.010 0.419 0.400 0.031 fempirical SDs are in the parenthesis 72 Table 5.7: Simulation results for Pedigree C with exchangeable correlation structure. True Parameter Estimates Standard Errors values ML GEE2 Diff. ML GEE2 Diff. Const. 0.8 1.313 (0.157) 1.314 (0.158) 0.001 0.156 0.153 0.006 Age 0.2 0.361 (0.405) 0.359 (0.405) 0.002 0.400 0.391 0.015 SS 0.1 0.281 (0.324) 0.281 (0.325) 0.003 0.306 0.303 0.013 PO 0.1 0.264 (0.231) 0.264 (0.232) 0.003 0.231 0.229 0.016 D2 0.1 0.283 (0.222) 0.284 (0.222) 0.002 0.227 0.224 0.021 Const. 0.8 1.313 (0.159) 1.311 (0.160) 0.004 0.161 0.159 0.006 Age 0.2 0.370 (0.339) 0.370 (0.339) 0.008 0.343 0.341 0.014 SS 0.5 1.563 (0.289) 1.557 (0.292) 0.012 0.290 0.287 0.014 PO 0.5 1.553 (0.246) 1.546 (0.248) 0.012 0.253 0.247 0.014 D2 0.5 1.579 (0.240) 1.573 (0.239) 0.010 0.249 0.244 0.017 Const. 0.8 1.317 (0.174) 1.309 (0.175) 0.015 0.171 0.168 0.007 Age 0.2 0.364 (0.235) 0.369 (0.230) 0.032 0.229 0.234 0.020 SS 0.9 3.835 (0.393) 3.817 (0.407) 0.024 0.364 0.370 0.034 PO 0.9 3.811 (0.375) 3.793 (0.374) 0.027 0.342 0.342 0.030 D2 0.9 3.835 (0.364) 3.818 (0.348) 0.035 0.341 0.342 0.032 fempirical SDs are in the parenthesis Table 5.8: Simulation results for Pedigree C with familial correlation structure. True Parameter Estimates Standard Errors values ML GEE2 Diff. ML GEE2 Diff. Const. 0.5 0.787 (0.157) 0.787 (0.157) 0.003 0.165 0.160 0.007 Age 1.0 1.796 (0.395) 1.798 (0.394) 0.011 0.406 0.394 0.021 SS 0.8 2.885 (0.295) 2.888 (0.293) 0.011 0.294 0.294 0.014 PO 0.6 1.944 (0.272) 1.947 (0.271) 0.021 0.273 0.271 0.018 D2 0.4 1.223 (0.272) 1.212 (0.273) 0.019 0.281 0.279 0.022 Const. 0.8 1.310 (0.178) 1.310 (0.178) 0.004 0.181 0.178 0.008 Age 0.2 0.374 (0.394) 0.375 (0.393) 0.011 0.399 0.393 0.021 SS 0.9 3.815 (0.353) 3.821 (0.351) 0.015 0.343 0.342 0.020 PO 0.5 1.527 (0.272) 1.534 (0.275) 0.018 0.267 0.263 0.020 D2 0.3 0.914 (0.293) 0.909 (0.293) 0.020 0.287 0.279 0.025 fempirical SDs are in the parenthesis 73 Table 5.9: Simulation results for Pedigree D with exchangeable correlation structure. True Parameter Estimates Standard Errors values ML GEE2 Diff. ML GEE2 Diff. Const. 0.8 1.303 (0.131) 1.306 (0.139) 0.003 0.137 0.131 0.009 Age 0.2 0.385 (0.317) 0.388 (0.325) 0.004 0.321 0.315 0.020 SS 0.1 0.317 (0.301) 0.294 (0.590) 0.026 0.309 0.307 0.030 PO 0.1 0.320 (0.197) 0.306 (0.363) 0.017 0.216 0.212 0.032 D2 0.1 0.309 (0.190) 0.286 (0.542) 0.026 0.200 0.196 0.036 D3 0.1 0.272 (0.315) 0.249 (0.584) 0.026 0.311 0.306 0.035 Const. 0.8 1.326 (0.147) 1.323 (0.146) 0.005 0.146 0.144 0.007 Age 0.2 0.354 (0.264) 0.355 (0.264) 0.008 0.275 0.271 0.017 SS 0.5 1.558 (0.285) 1.544 (0.289) 0.019 0.287 0.286 0.023 PO 0.5 1.577 (0.233) 1.564 (0.234) 0.017 0.231 0.226 0.021 D2 0.5 1.572 (0.221) 1.559 (0.223) 0.017 0.220 0.215 0.020 D3 0.5 1.548 (0.290) 1.538 (0.294) 0.016 0.298 0.294 0.025 Const. 0.8 1.342 (0.181) 1.325 (0.174) 0.019 0.190 0.162 0.040 Age 0.2 0.327 (0.181) 0.342 (0.175) 0.030 0.193 0.187 0.048 SS 0.9 3.855 (0.364) 3.817 (0.379) 0.046 0.326 0.366 0.115 PO 0.9 3.825 (0.317) 3.787 (0.322) 0.046 0.285 0.319 0.106 D2 0.9 3.840 (0.314) 3.802 (0.316) 0.052 0.274 0.311 0.107 D3 0.9 3.813 (0.389) 3.779 (0.407) 0.046 0.346 0.378 0.108 tempirical SDs are in the parenthesis Table 5.10: Simulation results for Pedigree D with familial correlation structure. True Parameter Estimates Standard Errors values ML GEE2 Diff. ML GEE2 Diff. Const. 0.8 1.321 (0.145) 1.318 (0.140) 0.008 0.156 0.142 0.022 Age 0.2 0.354 (0.303) 0.353 (0.310) 0.017 0.318 0.303 0.041 SS 0.9 3.802 (0.347) 3.812 (0.336) 0.025 0.333 0.333 0.059 PO 0.5 1.529 (0.228) 1.546 (0.232) 0.024 0.212 0.222 0.053 D2 0.3 0.911 (0.232) 0.907 (0.232) 0.022 0.212 0.221 0.048 D3 0.1 0.264 (0.409) 0.258 (0.406) 0.027 0.388 0.378 0.063 tempirical SDs are in the parenthesis 74 Table 5.11: Simulation results for mixed Pedigree with exchangeable correlation structures. True Parameter Estimates Standard Errors values ML GEE2 Diff. ML GEE2 Diff. Const. 0.8 1.318 (0.125) 1.319 (0.125) 0.001 0.127 0.125 0.003 Age 0.2 0.352 (0.338) 0.351 (0.338) 0.002 0.346 0.341 0.010 SS 0.1 0.286 (0.263) 0.287 (0.263) 0.002 0.253 0.247 0.018 PO 0.1 0.304 (0.201) 0.304 (0.201) 0.002 0.205 0.203 0.012 D2 0.1 0.321 (0.263) 0.321 (0.263)' 0.002 0.261 0.257 0.031 Const. 0.8 1.316 (0.129) 1.315 (0.128) 0.002 0.131 0.130 0.003 Age 0.2 0.358 (0.302) 0.359 (0.301) 0.006 0.300 0.300 0.009 SS 0.5 1.547 (0.256) 1.544 (0.258) 0.007 0.250 0.247 0.011 PO 0.5 1.561 (0.206) 1.557 (0.206) 0.008 0.221 0.218 0.010 D2 0.5 1.580 (0.267) 1.576 (0.268) 0.009 0.259 0.258 0.024 Const. 0.8 1.318 (0.136) 1.313 (0.136) 0.010 0.138 0.137 0.004 Age 0.2 0.360 (0.206) 0.363 (0.204) 0.024 0.199 0.206 0.013 SS 0.9 3.822 (0.330) 3.813 (0.335) 0.016 0.324 0.324 0.024 PO 0.9 3.830 (0.295) 3.821 (0.308) 0.022 0.303 0.302 0.021 D2 0.9 3.850 (0.323) 3.843 (0.331) 0.035 0.331 0.340 0.038 fempirical SDs are in the parenthesis Table 5.12: Simulation results for mixed Pedigree with familial correlation structures. True Parameter Estimates Standard Errors values ML GEE2 Diff. ML GEE2 Diff. Const. 0.8 1.322 (0.136) 1.323 (0.137) 0.004 0.156 0.153 0.007 Age 0.2 0.342 (0.358) 0.343 (0.355) 0.009 0.350 0.344 0.016 SS 0.9 3.781 (0.251) 3.783 (0.245) 0.012 0.322 0.316 0.017 PO 0.5 1.539 (0.212) 1.541 (0.215) 0.012 0.243 0.238 0.015 D2 0.3 0.917 (0.279) 0.908 (0.282) 0.025 0.306 0.294 0.033 Const. 0.5 0.809 (0.149) 0.807 (0.150) 0.006 0.138 0.135 0.005 Age 1.0 1.734 (0.347) 1.737 (0.348) 0.017 0.359 0.351 0.016 SS 0.8 2.809 (0.341) 2.816 (0.337) 0.018 0.254 ' 0.253 0.017 PO 0.6 1.931 (0.221) 1.942 (0.223) 0.023 0.230 0.226 0.016 D2 0.3 0.611 (0.291) 0.602 (0.294) 0.035 0.298 0.290 0.033 fempirical SDs are in the parenthesis 75 Table 5.13: Conditional probabilities for a family of Pedigree A with pss = 0.9, ppo = 0.4, X\ = 53, X2 = 20, X3 = 16, and X4 = 14. Conditional Probabilities Mlogit Est. SE Mprobit Est. • SE Pr(Y 4 = 1 | Yi = 1,Y2 = 1,Y3 = 1,X) Pr(Y 4 = l | Yi = l , Y 2 = l , Y 3 = 0 , X ) Pr(Y 4 = l | Yi = l , Y 2 = 0,Y 3 = 0 ,X) Pr(Y 4 = l | Y 1 = 0,Y 2 = 1,Y3 = 1 , X ) Pr(Y 4 = l | Y 1 = 1,Y2 = 1 , X ) 0.952 0.009 0.638 0.028 0.140 0.033 0.932 0.016 0.930 0.013 0.959 0.008 0.551 0.022 0.160 0.033 0.931 0.018 0.932 0.012 Table 5.14: Conditional probabilities for a family of Pedigree B with pss = 0.9, ppo = 0.5, pD2 = 0.3, X-i = 68, X2 = 46, X3 = 14 and X4 = 10. Conditional Probabilities Mlogit Est. SE Mprobit Est. SE Pr(Y 4 = l | Y1 = 1,Y2 = 1,Y3 = 1,X) Pr(Y 4 = l | Y 1 = 1 ) Y 2 = 0 , Y 3 = 0 ,X) Pr(Y 4 = 1 | Yi = 0,Y 2 = 1,Y3 = 0 , X ) Pr(Y 4 = 1 | Yi = 0,Y 2 = 0,Y 3 = 1,X) Pr(Y 4 = l | Y2 = l,X) 0.950 0.012 0.156 0.046 0.321 0.093 0.912 0.029 0.825 0.021 0.953 0.011 0.178 0.041 0.259 0.060 0.895 0.032 0.826 0.021 Table 5.15: Conditional probabilities for a family of Pedigree C with pss — 0.9, ppo = 0.5, pD2 = 0.3, Xx = 68, X2 - 46, X3 = 14, and XA = 10. Conditional Probabilities Mlogit Est. SE Mprobit Est. SE Pr(Y 5 = 1 | Yi = 1,Y2 = 1,Y3 = 1,Y4 = 1,X) Pr(Y 5 = 1 | Yi = 0,Y 2 = 1,Y3 = 0,Y 4 = 0,X) Pr(Y 5 = 1 | Yi = 1,Y2 = 0,Y 3 = 0,Y 4 = 0,X) Pr(Y 5 = 1 | Yi = 0,Y 2 = 0,Y 3 = 1,Y4 = 0,X) Pr(Y 5 = l | Yi = 1,Y2 = 1,Y3 = 1 , X ) 0.957 0.009 0.277 0.094 0.171 0.035 0.094 0.035 0.870 0.022 0.962 0.008 0.227 0.053 0.186 0.034 0.139 0.033 0.875 0.023 76 Chapter 6 Discussion The objective of this thesis was to compare existing methods for analyzing multivariate binary responses. We are mainly interested to study the performance of these methods for analyzing the binary response (e.g. presence/absence of a genetic disease) in the context of data structures typically arising in genetics. In genetics, it is assumed that these binary responses are the quantal values of an underlying continuous distribution. This assumption leads us to consider latent variable models for analyzing binary data in genetics. These latent variable models are likelihood based and have been widely used since the early 1930's. Recently, estimating equation based methods were introduced which can also be used to analyze multivariate binary responses. Among the existing methods, we have compared latent variable and estimating equation based methods for multivariate binary responses. We have reviewed the latent variable models (multivariate probit and logistic) in Chapters 2 and 3 respectively. The multivariate probit model is based on the assumption that the underlying latent variable follows multivariate normal distribution. In genetics, this assumption has physical meaning because the observed phenotypic value is assumed to be the sum of many small genetic effects and the environmental deviations. By the central limit theorem, the normality assumption 77 for this phenotypic value is reasonable. The multivariate probit model is useful to obtain maxi-mum likelihood estimator of the parameters corresponding to the regression model defined for the univariate margins. The maximum likelihood estimator of the association between the members of the family can also be obtained in terms of the latent correlation coefficient, which is known as tetrachloric correlation. In biostatistics and epidemiology, the odds-ratio has a more attractive interpretation than the correlation coefficient as a dependence parameter. Because of that linear logistic model is widely used though there is no physical reasoning of considering this model. In Chapter 3, we discussed the multivariate logistic model which is based on the assumption that the underlying univariate margins are logistic. Given the univariate logistic margins and the two- and higher-order cross-product ratios as the dependence parameters, we have used the multivariate Plackett construction to obtain the higher-order margins. Besides estimating the regression parameters corresponding to the univariate margins, the multivariate logistic model parameterizes the dependence among the members of the family in terms of the log odds-ratio. We also described the McCullagh-Nelder-Glonek approach to the multivariate logistic model. Instead of using the multivariate Plackett construction, to obtain the orthant probabilities, they defined univariate margins and the two- and higher-order dependence parameters in terms of the linear combinations of the joint probabilities. Using the logit and logarithmic links corresponding to the univariate margins and the dependence parameters respectively, the McCullagh-Nelder-Glonek approach provide similar results as the multivariate Plackett construction based model. In Chapter 4, estimating equation based methods are reviewed. These methods do not re-quire full specification of the joint distribution of the responses. Among the estimating equation based methods, the GEE2 method can estimate the regression parameters and the dependence parameters (in terms of logORs) simultaneously. Instead of using likelihood based score functions, for the GEE2 method, a set of estimating equations are considered to estimate the parameters cor-responding to the univariate margins and to the dependence structure. These estimating equations 78 are similar to the score function of a multivariate normal model. In GEE2, the bivariate dependence parameters are defined according to the Plackett distribution. These estimating equations require the computation of third- and fourth-order moments, for which McCullagh-Nelder-Glonek's ap-proach of multivariate logistic model has been used. Besides using estimating equations in place of likelihood score functions for estimating the parameters of the model, the GEE2 method is sim-ilar to the multivariate logistic model. That means, the GEE2 method is an estimating equation approach for estimating the parameters of the multivariate logistic model. For analyzing binary responses, besides identifying important covariates and estimating the association between the family members for the occurrence of the disease, the estimate of the probability that an individual has the disease given the disease status of the other family members is also of interest in genetics. This conditional probability is the ratio of the two orthant probabilities. The GEE2 method cannot estimate the orthant probabilities. In this thesis we have compared the multivariate logistic and probit models for estimating these conditional probabilities and the respective standard errors. Since the form of the likelihood function of the multivariate logistic model is very complex, the elements of the gradient and the Hessian matrix cannot be computed analytically. We have used quasi-Newton method to solve the system of equations for estimating the parameters and the corresponding standard errors of the multivariate logistic model. For estimating the parameters of the GEE2 method, we have used a Newton-Raphson routine with the gradient and Hessian matrix which are computed by FADBAD, a differentiation package. FADBAD is relatively a new differentiation package which uses automatic differentiation to compute the derivatives of a given function. Automatic differentiation could be useful elsewhere in statistical applications for solving system of non-linear equations or optimizing a function where analytic derivatives are too difficult to obtain with symbolic manipulation software. To compare the maximum likelihood and estimating equation based approaches, we have 79 carried out a simulation study with different pedigrees. The multivariate Plackett construction does not have a closed form expression of the joint cumulative distribution function, so generating multivariate binary data from the multivariate Plackett construction is difficult. For our simulation study, we used the multivariate probit model to generate multivariate binary data. The effects of the pedigree sizes, the types of pedigree, the dependence structure, and the sample sizes have been studied for the comparison of these two approaches. To define the regression model corresponding to the univariate margins, only the covariate Age is used. No covariates have been used for the model corresponding to the dependence structure. Chapter 5 contains the simulation results and the description of the estimation procedures that are used for the methods considered in this thesis. The maximum likelihood and estimating equation based estimates of the parameters defined for the univariate margins and dependence structures are found to be quite similar for all the pedigrees we have considered. For samples of size 200, the estimating equation based estimates are marginally more efficient than the maximum likelihood based estimates. But for samples of size 600, there is no difference between the efficiency of the corresponding estimates of these two approaches. The estimates of the conditional probabil-ities and the respective standard errors are similar for the multivariate logistic and probit models. The estimating equation based method GEE2 requires the computation only to the fourth-order moments, whereas the maximum likelihood method based on the multivariate Plackett construc-tion requires the computation of all the moments up to the dth-order (d is the family size). As a consequence, this of this maximum likelihood based method requires more computation time for estimating the parameters than the estimating equation based approach if the family size > 6. 80 Appendix A A . l cov(j/jj, yik) cov{yij, yikVii) cov(yij yik, yu yis) E(yij){l - E(yij)} Hj = k, , E(yij yik) ~ E(yij)E(yik) if j ^ k. E(yij y^ yu) - E(yij)E{yik yu) if j ± k ± l ' E(yij yik) - E(yij)E(yik yu) ii j = I ^ k E(yij yu) - E{yij)E(yik yu) if j = k + I * E(yij yik yu yis) - E(yij yik)E(yu yis) if j ^ k ^ I ^ E(yij yik yu) - E(yij yik)E(yu yis) if j = s or k = * E{yij yik yis) - E(yij yik)E{yu yis) if j = I or k = E(yij y i f c ) ! 1 - E(yij yik)} if (j = i k k = (j = s&k = 81 A.2 drj m m m m2 mz 7?123 W i l l W121 1 1 7T, TT -TT -1 TT 11-TT 1-1 TT , - 1 11 TT -1 -TT . . - 1 TT -TT 12-7T 1-1 — 7T . - 1 •21 -TT 121 W211 1 w.l1 ' 2 1 -— TT. 2-1 7T „ - l •11 -TT. 211 W221 1 - W . 2 7T. 22-— TT. 2-1 — 7T . - 1 21 TT. 221 W H 2 1 1 TT I*. -1 — TT . - 1 7T 11-— TT - 1 1-2 — 7T. „ - l 12 TT 112 7T122 1 TT -1 1--7T •2--7T 12-— TT - 1 1-2 7T , - 1 22 -TT 122 W 2 12 1 -TTo. TT 'n't — TT, 21-7T, 2-2 -7T - 1 •12 -71". W 2 22 1 — 7IV -1 -TT •2-7T, 22-TT. . - 1 2-2 7T 22 212 TT. 222 / 82 Bibliography [1] Andrade, M. de, Amos, C. I. and Thiel, T. J. (1999). Methods to estimate genetic compo-nents of variance for quantitative traits in family studies. Genetic Epidemiology, 17, 64-76. [2] Ashford, J. R. and Sowden, R. R. (1970). Multi-variate probit analysis. Biometrics, 26, 535-46. [3] Cox, D. R. (1972). Analysis of multivariate binary data. Applied Statistics, 21, 113-20. [4] Crowder, M. (1995). On the use of a working correlation matrix in using generalized linear models for repeated measures. Biometrika, 82, 407-10. [5] Dale, J. R. (1986). Global cross-ratio models for bivariate, discrete and ordered responses. Biometrics, 42, 909-17. [6] Dixon, L. C. W. (1991). On the impact of automatic differentiation on the relative per-formance of parallel truncated Newton and variable metric algorithms. SI AM J. Opt., 1, 475-86. • [7] Falconer, D. S. (1960). Introduction to Quantitative Genetics. The Ronald Press, New York. [8] Finney, D. J. (1971). Probit Analysis, Third edition. Cambridge University Press, London. [9] Fitzmaurice, G. M. and Laird N. M. (1993). A likelihood-based method for analysing lon-gitudinal binary responses. Biometrika, 80, 141-51. 83 [10] Fitzmaurice, G. M., Laird, N. M., and Rotnitzky, A. G. (1993). Regression models for discrete longitudinal responses. Statistical Science, 8, 284-99. [11] Glonek, G. F. V. and McCullagh, P. (1995). Multivariate logistic models. J. Roy. Stat. Soc. B, 57, 533-46. [12] Godambe, V. P. (ed.) (1991). Estimating Functions. Oxford University Press, Oxford. [13] Gourieroux, C , Monfort, A. and Trognon, A. (1984). Pseudo maximum likelihood methods: theory. Econometrica, 52, 681-700. [14] Griewank, A. (1989). On automatic differentiation. Mathematical Programming, Kluwer Aca-demic Publishers, Japan. [15] Joe, H. (1995). Approximations to multivariate normal rectangle probabilities based on conditional expectations. J. Amer. Statist. Assoc., 90, 957-64. [16] Joe, H. and Liu, Y. (1996). A model for a multivariate binary response with covariates based on compatible conditionally specified logistic regressions. Statistics & Probability Letters, 31, 113-20. [17] Joe, H. (1997). Multivariate Models and Dependence Concepts. Chapman &; Hall, London. [18] Joe, H. (1999). C routine for multivariate Plackett distribution based on Molenberghs-Lesaffre construction, ftp://ftp.stat.ubc.ca/pub/hjoe/famil/. [19] Joe, H. (1999a). C routine for multivariate probit model, ftp://ftp.stat.ubc.ca/pub/hj oe/f amil/. [20] Joe, H. (2001). C++ rountine for modified Newton-Raphson using FADBAD. ftp://ftp.stat.ubc.ca/pub/hjoe/famil/. 84 [21] Kepner, J. L., Harper, J. D., and Keith, S. Z. (1989). A note on evaluating a certain orthant probability. Amer. Statist., 43, 48-9. [22] Lesaffre, E. and Molenberghs, G. (1991). Multivariate probit analysis: A neglected procedure in medical statistics. Statistics in Medicine, 10, 1391-1401. [23] Liang, K . - Y . and Zeger, S. C. (1986). Longitudinal data analysis with generalized linear models. Biometrika, 73, 13-22. [24] Liang, K . - Y . and Beaty, T. H. (1992). Measuring familial aggregation by using odds-ratio regression models. Genetic Epidemiology, 8, 361-70. [25] Liang, K . - Y . , Zeger, S. L., and Qaqish, B. (1992). Multivariate regression analysis for cate-gorical data (with discussion). J. Roy. Stat. Soc. B, 54, 3-40. [26] Lindsey, J. K. (1997). Applying Generalized Linear Models. Springer-Verlag New York, Inc. [27] Lipsitz, S. R., Laird, N. M. and Harrington, D. P. (1991). Generalized estimating equations for correlated binary data: Using the odds ratio as a measure of association. Biometrika, 78,153-60. [28] Mardia K. V. (1970). Families of Bivariate Distributions. Griffin, London. [29] McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, Second Edition. Chap-man &; Hall, London. [30] Mendell, N . R. and Elston, R. C. (1974). Multifactorial qualitative traits: Genetic analysis and prediction of recurrence risks. Biometrics, 30, 41-57. [31] Molengerghs, G. and Lesaffre, E. (1994). Marginal modeling of correlated ordinal data using a multivariate Plackett distribution. J. Amer. Statist. Assoc., 89, 633-44. 85 [32] Nash, J. C. (1979). Compact Numerical Methods for Computers: Linear Algebra and Func-tion Minimisation. Adam Hilger Ltd., Bristol. [33] Ochi, Y. and Prentice, R. L. (1984). Likelihood inference in a correlated probit regression analysis. Biometrika, 71, 531-43. [34] Plackett, R. L. (1965). A class of bivariate distributions. J. Amer. Statist. Assoc., 60, 516-22. [35] Sutradhar, B. C. and Das, K. (1999). On the efficiency of regression estimators in generalized linear models for longitudinal data. Biometrika, 86, 459-65. [36] Wolfe, M. A. (1978). Numerical Methods for Unconstrained Optimization : an introduction. Van Nostrand Reinhold Company Ltd., England. [37] Zeger, S. L. and Liang, K.-Y. (1986). Longitudinal data analysis for discrete and continuous outcomes. Biometrics, 42, 121-130. [38] Zhao, L. P. and Prentice, R. L. (1990). Correlated binary regression using a quadratic exponential model. Biometrika, 77, 642-48. 86
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- A comparison of methods for multivariate familial binary...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
A comparison of methods for multivariate familial binary responses Latif, Abu Hena M. Mahbub-ul 2001
pdf
Page Metadata
Item Metadata
Title | A comparison of methods for multivariate familial binary responses |
Creator |
Latif, Abu Hena M. Mahbub-ul |
Date Issued | 2001 |
Description | Among the existing methods for analysing the multivariate familial binary response, we discuss latent variable models and the estimating equations based methods. A brief description of the multivariate Plackett distribution is given and the role of this distribution in developing the estimating equations based methods is pointed out. The maximum likelihood and estimating equations based methods for estimating the parameters of the multivariate logistic model are compared. For this comparison, a simulation study examines the effects of the sample sizes, dependence structures, the within-family dependence, etc. in estimating the parameters. The data are generated from the multivariate probit models. The multivariate logistic and probit models are compared for estimating conditional probabilities of interest in a genetics context and the respective standard errors. Numerical methods are used to estimate the parameters of the models considered. Because the original GEE2 code cannot handle multivariate binary data for arbitrary family structures, we have a new implementation of the GEE2 method for familial data; this routine used automatic differentiation for computing the Hessian matrix. |
Extent | 4113952 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2018-05-07 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
IsShownAt | 10.14288/1.0090246 |
URI | http://hdl.handle.net/2429/11852 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2001-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2001-0442.pdf [ 3.92MB ]
- Metadata
- JSON: 831-1.0090246.json
- JSON-LD: 831-1.0090246-ld.json
- RDF/XML (Pretty): 831-1.0090246-rdf.xml
- RDF/JSON: 831-1.0090246-rdf.json
- Turtle: 831-1.0090246-turtle.txt
- N-Triples: 831-1.0090246-rdf-ntriples.txt
- Original Record: 831-1.0090246-source.json
- Full Text
- 831-1.0090246-fulltext.txt
- Citation
- 831-1.0090246.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0090246/manifest