Joint Inference for Longitudinal and Survival Data with Incomplete Time-dependent Covariates by Xu Wang B.Sc., Zhejiang University, 2008 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES (Statistics) The University of British Columbia (Vancouver) August 2010 c Xu Wang, 2010 Abstract In many longitudinal studies, individual characteristics associated with their repeated measures may be covariates for the time to an event of interest. Thus, it is desirable to model both the survival process and the longitudinal process together. Statistical analysis may be complicated with missing data or measurement errors in the time-dependent covariates. This thesis considers a nonlinear mixed-effects model for the longitudinal process and the Cox proportional hazards model for the survival process. We provide a method based on the joint likelihood for nonignorable missing data, and we extend the method to the case of time-dependent covariates. We adapt a Monte Carlo EM algorithm to estimate the model parameters. We compare the method with the existing two-step method with some interesting findings. A real example from a recent HIV study is used as an illustration. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Longitudinal Data . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Longitudinal Studies . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Approaches to Longitudinal Data Analysis . . . . . . . . 3 Survival Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 Survival Studies . . . . . . . . . . . . . . . . . . . . . . 5 1.3.2 Approaches to Survival Data Analysis . . . . . . . . . . . 6 Missing Data Problems . . . . . . . . . . . . . . . . . . . . . . . 7 1.4.1 Missing Data and Measurement Errors . . . . . . . . . . . 7 1.4.2 Classification of Missing Mechanisms . . . . . . . . . . . 8 1.4.3 Approaches to Missing Data Problem . . . . . . . . . . . 9 Joint Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5.1 10 1.3 1.4 1.5 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . iii 1.5.2 2 Approaches to Joint Modeling . . . . . . . . . . . . . . . 11 1.6 A Motivating Example . . . . . . . . . . . . . . . . . . . . . . . 12 1.7 Objective and Outline . . . . . . . . . . . . . . . . . . . . . . . . 12 Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Nonlinear Mixed Effects Models . . . . . . . . . . . . . . . . . . 15 2.3 Covariate Models . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 surement Errors and Missing Data . . . . . . . . . . . . . 16 Model for Time-independent Covariate . . . . . . . . . . 18 2.4 Survival Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Model for Missing Data . . . . . . . . . . . . . . . . . . . . . . . 20 Two-Step Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1 Simple Two-Step Method . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Modified Two-Step Method . . . . . . . . . . . . . . . . . . . . . 23 Joint Likelihood Inference with Time-independent Covariate . . . . 25 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 Joint Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3 A Monte Carlo EM Algorithm . . . . . . . . . . . . . . . . . . . 26 4.4 Sampling Methods and Convergence . . . . . . . . . . . . . . . . 28 2.3.2 3 4 5 6 Empirical Model for Time-dependent Covariate with Mea- Time-dependent Covariate with Measurement Error . . . . . . . . 30 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.2 Joint Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3 A Monte Carlo EM Algorithm . . . . . . . . . . . . . . . . . . . 32 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.2 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.3.1 38 The NLME Model for HIV Viral Dynamics . . . . . . . . iv 7 6.3.2 The Covariate Model . . . . . . . . . . . . . . . . . . . . 40 6.3.3 Survival Models . . . . . . . . . . . . . . . . . . . . . . 41 6.3.4 The Dropout Models . . . . . . . . . . . . . . . . . . . . 41 6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.5 Computation Issues . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.5.1 Choice of Starting Value . . . . . . . . . . . . . . . . . . 44 6.5.2 Convergence Criteria . . . . . . . . . . . . . . . . . . . . 44 6.5.3 Running Time . . . . . . . . . . . . . . . . . . . . . . . 44 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 7.2 Design of Simulation Study . . . . . . . . . . . . . . . . . . . . . 46 7.2.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 7.3 Comparison Criteria . . . . . . . . . . . . . . . . . . . . . . . . . 47 7.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 48 7.4.1 Comparisons of Methods in Different Missing Rates . . . 48 7.4.2 Comparisons of Methods in Different Measurement Times 49 7.4.3 Comparisons of Methods with Different Number of Patients 51 7.4.4 Comparisons of Methods with A Larger Variance of Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 7.5 8 v List of Tables 6.1 Summary of the HIV dataset . . . . . . . . . . . . . . . . . . . . 6.2 Model Selection on NLME model with various random effects specifications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 39 6.3 Model Selection on covariate model in different forms (linear and quadratic). 40 6.4 Results summary by different methods with time-dependent covariate. . 43 7.1 Simulation result (10% missing) . . . . . . . . . . . . . . . . . . 49 7.2 Simulation result (20% missing) . . . . . . . . . . . . . . . . . . 50 7.3 Simulation results (ni = 25) . . . . . . . . . . . . . . . . . . . . 50 7.4 Simulation results (N = 500) . . . . . . . . . . . . . . . . . . . . 51 7.5 Simulation results (σ = 1) . . . . . . . . . . . . . . . . . . . . . 52 vi List of Figures 6.1 Profiles of viral load values for six randomly selected patients. . . vii 38 Acknowledgements First and foremost, I would like to thank my supervisor, Dr. Lang Wu, for his excellent guidance and immense help during my study at Department of Statistics of the University of British Columbia. Without his support, expertise and patience, this thesis would not have been completed. Also, I would like to thank my second reader, Dr. Paul Gustafson, for his invaluable comments and suggestions on this thesis. Further more, I would like to thank Dr. John Petkau for his invaluable advice on my consulting projects, which benefits me very much in the past and future. I would like to thank all the faculty and staff in Department of Statistics of the University of British Columbia for providing such a nice academic environment. I would like to thank all graduate students in the Department of Statistics for making my study and life here so enjoyable. Most importantly, I would like to thank my parents for their great love. I also want to thank my girlfriend, Ying, for the happiness she brings to me. It is their love, constant support and encouragement that push me to be the best at everything I do. viii To my parents and Ying. ix Chapter 1 Introduction 1.1 Background In many longitudinal studies, both the longitudinal process and the survival process are of interest. For example, in HIV studies, while the HIV viral load dynamics in the early period after an anti-HIV treatment are of our interest, we are also interested in the relationship between the individual-specific characteristics of the viral load process in the early period and a long term antiviral response such as the time to a viral load rebound (or a viral load suppression or death). Specifically, one such important question is to check if patients with a faster initial viral load decay rate may have an earlier viral load rebound later in the study. For longitudinal data analysis, nonlinear mixed-effects (NLME) models are often used in many cases because these models are based on underlying mechanisms which generate the observed data (Davidian and Giltinan, 1995). For survival data, the Cox proportional hazards model is often of research interest. Also in such studies, missing data are common since individuals may drop out early for various reasons such as a drug resistance. The missing data may be informative in the sense that the missing data mechanism may be related to unobserved values such as the viral load value at that time point or the initial unobservable true viral load decay rates. Also, measurement errors often appear. For example, in HIV studies, CD4 cell count, which is measured repeatedly on the same individual during the study period, may be hard to be measured accurately. Thus, the analysis of longitudinal data often involves 1 methods for missing data and measurement errors. When the longitudinal process and the survival process may be related via observed variable(s) or latent variable(s), making inference based on information provided by both two process may be helpful. Methods jointly modeling longitudinal data and survival data have been studied in the literature (e.g., DeGruttola and Tu, 1994; Wulfsohn and Tsiatis 1997; Henderson et al. 2002; Guo and Carlin, 2004). Tsiatis and Davidian (2004) provides a very nice review. Wu (2008) discussed a joint model with informative missing data using baseline covariate information for statistical inference. In this thesis, we consider a joint likelihood method to jointly model longitudinal data and survival data, incorporating missing information and measurement errors using time-dependent covariates. 1.2 Longitudinal Data 1.2.1 Longitudinal Studies Longitudinal studies involve repeated observations of the same individual over a long time period. Longitudinal studies are often called panel studies in economics and sociology. In longitudinal studies, individuals are followed over a period of time. For each individual, data are collected at multiple time points. These collected data are called longitudinal data, which is very common in observational studies. These repeated measurements of a variable on the same individual over time is the defining feature of longitudinal studies. These repeated measurements of a variable on the same individual may share a common characteristic and may be correlated, although measurements on different individuals could be assumed to be independent. The measurement correlation within each individual reflects the key characteristic of longitudinal data. For example, in HIV studies, the viral load of each patient is measured repeatedly over time. The viral load values of one specified patient at different time points could be correlated due to the health status of this patient. 2 Longitudinal studies are often compared with cross-sectional studies. Both longitudinal studies and cross-sectional studies are observational studies. The fundamental difference between cross-sectional studies and longitudinal studies is that cross-sectional studies take place at a single time point and longitudinal studies involve a series of measurements taken over a period of time. An important assumption for cross-sectional data is that all observations in the sample are independent with each other. However, in longitudinal studies, the repeated observations on the same individual are usually correlated, although observations from different individuals are regarded to be independent. Hence, applying the classical statistical methods for cross-sectional data to longitudinal data would ignore the correlation in the measurements within each individual. Longitudinal studies have an advantage over cross-sectional studies in that longitudinal studies take the correlation in measurements within the same individual into account. Longitudinal studies are also often compared with time series analysis. Time series takes the measurement correlation within the same individual into account as well. It observes a single long series of measurements over time. When there is only one individual included in the longitudinal study, longitudinal data is reduced to a single time series. In most time series studies, only one single series is available to be used to find clues and draw conclusions. Longitudinal studies have advantage over time series analysis in that the analysis of longitudinal data can be made by borrowing information across different individuals. 1.2.2 Approaches to Longitudinal Data Analysis For longitudinal data, there may be substantial variations in both between and within individual measurements. A main objective of statistical models for longitudinal data analysis is to address these two sources of variations. One can study variable change of a certain subject over time via modeling the within-individual variation, while one can investigate differences between individuals via modeling the between-individual variation. Before introducing commonly used approaches to longitudinal studies, we first 3 define notations that will be used. Let yi j be a response variable and xi j be a p × 1 vector of p explanatory variables for the jth measurement on individual i at time point ti j , i = 1, ..., N, j = 1, ..., ni , where N is the total number of individuals involved in the study, and ni is the number of repeated measures for individual i. The set of repeated measurements for individual i are collected into an ni × 1 vector, yi = (yi1 , ..., yini )T . The covariate matrix for individual i is denoted as Xi = (x1 , ..., xni )T , an ni × p matrix. Three approaches are usually used in longitudinal studies. The first one is often called generalized estimating equations (GEE models) or marginal models. GEE models, which were introduced by Liang and Zeger (1986), specify the mean structure and the correlation structure separately without distributional assumptions. The primary scientific objective of GEE models is to model the mean of the response variable. The correlation structure of the response variable may be specified based on the nature of the observed data or based on simplicity, not necessarily based on any parametric distributions. Thus, GEE models could be useful when the distributional assumptions are questionable, for example, when the response variable is binary or discrete. A transition model is another approach to longitudinal studies. A transition model specifies the measurement correlation within an individual via Markov structures. That is, one models the conditional distribution of yi j given the past measurements, yi, j−1 , ..., yi,1 . The third approach is mixed effects models (or random effects models). Mixed effects models explain the between-individual variation and the within-individual correlation by introducing random effects. In mixed effects models, the conditional expectation of yi j given the individual-specific coefficient, β i is E(yi j |β i ) = f (xi j , β i ), where f (·) is a link function, which explains the relationship between the response variable and the explanatory variables. In practice, usually there are not enough 4 measurements observed for an individual, thus an efficient estimation of the regression coefficients β i is not valid, especially when the link function has a complex form. For example, a nonlinear model. Hence, the β i ’s are further assumed to be independent from some distribution with a mean of β . Then, we can write β i = β + bi , where β is fixed and bi is a vector of random variables with mean 0. In this way, the individual characteristics can be represented by random effects bi . All repeated measures of a response variable for a specified individual share a common unobserved random effect bi , and these responses are correlated via this common factor bi , although bi ’s varies across different individuals. Mixed effects models focus on both the population parameters β and the individual characteristics bi ’s. Hence, mixed effects models are particulary useful when inferences need to be made about both population behaviors and individual trajectories, like in HIV studies. Because of the advantage of mixed effects models in HIV studies, we will focus on mixed effects models in this thesis. 1.3 Survival Data 1.3.1 Survival Studies In many medical studies, the time to an event is often of interest. Common time to events of interest includes the time to death, the time to drop out from a follow-up study, the time to an efficacy loss of a medical treatment, etc. These types of data are called survival data. The analysis of survival data is called survival analysis. In HIV studies, for example, the viral load of a patient would rebound some time after receiving an anti-HIV treatment. This viral load rebound may be due to the loss of the efficacy of an anti-HIV treatment. Hence, it is of interest to find possible relationship between the rebound time and covariates such as the level of CD4 cell count and the personal characteristics which would be represented by random effects if a mixed effects model is used. Survival data, which usually refers to the time to a certain event of interest, has its own features that makes it different from other types of data. First, the 5 distribution of survival data is usually not symmetric and usually skewed to right. Hence, survival data may not be reasonably assumed to have a normal distribution. The second feature is that survival data are often censored. That is, the event of interest may not be observed for some individuals during the study period. The censored data may possibly be due to the dropouts of individuals, loss of followup, or early termination of the study. For instance, in HIV studies, due to the limited follow-up time period, the time to a viral load rebound for an individual may be censored. Therefore, at the end of the follow-up study, one could only know whether the viral load had rebounded but could know nothing for the future. With these unique features of survival data, special statistical analysis procedures are required. 1.3.2 Approaches to Survival Data Analysis Due to the features of survival data, nonparametric and semiparametric models are popularly used, since no distributional assumptions for the survival data are made in these models. Parametric models are also valid for survival data, and they are more efficient than nonparametric or semiparametric models if the distributional assumptions hold. Fleming and Harrington (1991), Andersen er al. (1993), Collett (2003), Lawless (2003), and Wu (2009) give comprehensive discussions of survival models and methodologies. In this thesis, we focus on the Cox proportional hazards model, which are particularly popular in survival analysis. In survival analysis, the survival function and the hazard function play an important role. Let random variable T be the time to an event of interest, called survival time. The survival function is the probability that an individual survives to some time beyond time t, S(t) = P(T ≥ t) = 1 − F(t), t > 0, where F(t) = P(T < t) is the cumulative distribution function of T . The hazard function is defined as P(t ≤ T ≤ t + ∆t|T ≥ t) , ∆t→0 ∆t h(t) = lim 6 t > 0, which is the risk or hazard of an event at time t. It means the probability for an individual to experience an event immediately after time t, given the individual has experienced no event or survived to time t. In survival regression models, finding any possible relationship between the survival time and the covariates of interest is the main goal. A popular approach is to model the hazard (or risk) of an event, rather than the mean of the response as in a classical regression analysis. The hazard function could be modeled in a nonparametric way in that the hazard functions may be complicated and the distribution assumption could be avoided using nonparametric models. Then, the hazard function and the covariates xi can be linked via a usual linear predictor xTi β . This leads to a semiparametric regression model. The Cox proportional hazards model (Cox, 1972) is a widely used semiparametric survival regression model. In the Cox proportional hazards model, the hazard function and the covariates are linked in the following form: hi (t) = h0 (t)exp(xTi β ), where h0 (t) is an unspecified baseline hazard function and other notations have the same meaning as before. The baseline hazard function h0 (t) could be interpreted as the hazard when all covariates equals to 0. The Cox proportional hazards model assumes the hazard ratio hi (t) h0 (t) is proportional to exp(xTi β ), which needs checking in practice. It makes no distributional assumptions for the survival data, thus it is very flexible to use. 1.4 1.4.1 Missing Data Problems Missing Data and Measurement Errors In both longitudinal studies and survival studies, it is usually impossible to have complete information of interest collected. In regression models, the missing data can be missing in responses, missing in covariates, or missing in both responses and covariates. For example, in HIV studies, individuals may not be able to come to medical center for measurement at every scheduled time point due to various of reasons, or they may even dropout permanently from the study due to the drug 7 intolerance or death. Thus, the responses, the viral load, and the covariates such as CD4 cell count, are missing in the follow-up study. Missing data is an important issue in both longitudinal studies and survival studies. Ignoring missing data or using naive methods to deal with the missing data problem may lead to invalid inferences. Standard statistical methods are usually designed for complete data, and they cannot be directly applied to the case of missing values. Measurement errors in covariates are another form of missing data and very common in practice. For example, in HIV studies, CD4 cell count, which is measured repeatedly on the same individual during the study period, may be hard to be measured accurately, possibly due to the imprecision of medical machines. In the presence of measurement errors, the observed data are not the true values but possibly measured with errors. If we treat these mis-measured values of data as true values, statistical analysis would not be appropriate. Particularly, in regression models, if the covariates are measured with errors but treated as accurately measured, the statistical inference will be misleading, for example, a significant covariate may be found to be non-significant. Hence, the measurement errors in covariates must be taken into account for valid inference. In this thesis, the measurement errors in the covariate is taken into account. 1.4.2 Classification of Missing Mechanisms Missing data issues make the statistical analysis for longitudinal studies and survival studies more complicated. It is important to determine the reason for the missing data mechanism (the missingness) because a valid statistical method to deal with the missing data depends on the type of missingness. Little and Rubin (1987) and Little (1995) give a general treatment of statistical analysis with missing values. Little and Rubin (1987) classifies the missing value mechanisms into three categories: missing completely at random (MCAR), missing at random (MAR), and nonignorably or informative missing (NIM). If the missingness is irrelevant either with the observed data or the missing 8 data, then the missingness is regarded as MCAR. For example, in HIV studies, if the dropout of a patient at a scheduled time point is simply because he/she forgot, the missingness at this time point could be regarded as MCAR. Missing data are MAR if the missingness depends only on the observed data, but not on the missing data. For example, if a patient fails to come to a medical center at a scheduled because he/she is very old (suppose the age information is known at the beginning of a follow-up study), the missingness at this time point could be regarded as MAR. Missing data are NIM if the missingness depends on missing data. NIM can further be categorized into two cases in the context of random effects models: • The missingness depends on unobserved responses. For example, a patient fails to visit the medical center because he/she is in a very bad health state. The missingness is called outcome-based informative missingness (Little, 1995). • The missingness depends on unknown random effects which may substantially affect the responses. For example, in HIV studies, the missingness depends on individual characteristics such as individual viral load decay rates. The missingness is called random-effect-based informative missingness (Little, 1995). For NIM the missingness could also depend on the missing covariates if one considers a missing in covariate problem. When the missingness is NIM, the missing data mechanism must be taken into account in the likelihood inference (Little and Rubin, 1987). 1.4.3 Approaches to Missing Data Problem Many methods have been developed to deal with the missing data problem. Simple methods, like the complete-case (CC) method, the last-value-carried-forward (LVCF) method and the mean imputation method, are widely used because they are very simple and easy to carry out; however, these simple methods may lead to 9 inefficient or biased results. For example, the CC method is probably the simplest method used for missing data problems. It simply discards all individuals with missing values. However, the CC method may lead biased results when the missing data is not missing completely at random. Since simple methods often lead inefficient and unbiased results, they are generally not recommended, especially when the missing rate is high. Formal methods, like likelihood inference using EM algorithms, single imputation methods with variance adjustments, multiple imputation methods, Bayesian methods are usually used for more appropriate analysis for missing data. Little and Rubin (2002) provided an overview of missing data methods, Carroll, Ruppert, and Stefanski (2006) reviewed common methods for measurement errors, and Maronna, Martin, and Yohai (2006) discussed recent development of robust methods. Wu (2009) gives a comprehensive review of incomplete data problem in mixed effects models. In longitudinal studies, mixed effects models are widely used, and missing data problem is very common. Because the maximum likelihood method is a standard statistical inference approach for mixed effects models, in this thesis, we will use likelihood-based methods for missing data problem. 1.5 1.5.1 Joint Modeling Motivation In many longitudinal studies, both the longitudinal process and the survial process are of interest. For example, in HIV studies, we are interested in both the HIV viral load dynamics in the early period after an anti-HIV treatment and the long term antiviral responses such as the time to a viral load rebound. The time to a viral load rebound may possibly be related with individual characteristics of the viral load process in the longitudinal process. That is, it is of interest to check if patients with faster initial viral load decay rate may have earlier viral load rebound in the later period of the study. Also, in the analysis of survival data using time-dependent covariates with measurement errors, we may need to model the longitudinal covariate process, which is used to address the measurement errors, in addition to a survival 10 model. In both examples, the longitudinal process and the survival process may be related. To better understand this relationship between the two processes and to make inference based on information provided by both processes, joint modeling of the longitudinal process and the survival process is needed. 1.5.2 Approaches to Joint Modeling The longitudinal model and the survival model are often viewed as shared parameter models since the two models are usually linked through some common unknown variables. To make statistical inference simultaneously, a simple two-step method (TS) or a two-stage method is often used. In the first step, it estimates the common unknown variables or parameters based on one model using the observed data in the second step, it estimates parameters in the other model separately, with common latent variables or unknown parameters substituted by their estimates from the first step as if the estimated values of latent variables or the unknown parameters were observed values. The two-step method is simple, and statistical softwares can be readily used, However, the simple two-step method may lead inappropriate results when the longitudinal process and the survival process are strongly associated (Tsiatis and Davidian, 2004). Also, by the simple two-step method, the uncertainty of estimation in the first step can not be incorporated in the second step. Another approach is the joint likelihood method, where the statistical inference for the joint model is based on the joint likelihood of all observed data. Wu (2009) gives a nice review on the joint likelihood method. Joint likelihood is very appealing because it provides a valid and reliable inference and standard likelihood theory could be used. Maximum likelihood estimations of all model parameters can be obtained simultaneously by maximizing the joint likelihood. However, there are two issues related with joint likelihood methods. First, when the joint modeling contains several longitudinal process, like the HIV viral load dynamics process and the process of time-dependent covariates, there will be too many unknown parameters in the joint model, thus, the models or the parameters may possibly be nonidentifiable. Another issue is that joint modeling may require high-dimensional 11 and intractable integrals, so the computation could be quite intensive. 1.6 A Motivating Example Our research is motivated from HIV studies. In HIV studies, we are often interested in modeling viral load dynamics in the early period after an anti-HIV treatment. In the meantime, we are also interested in the relationship between the individualspecific characteristics of the viral load process in the early period and a long term antiviral response such as the time to a viral load rebound. In HIV studies, a patient’s viral load after an anti-HIV treatment will typically decline in the early period. Late in the follow-up period, the patient may experience a viral rebound, possibly due to an emergence of drug resistance. Some patients may even drop out before the termination of the study for various reasons such as a bad health status. NLME models have been used for modeling HIV viral load dynamics in the early period after an anti-HIV treatment, and the covariates may be used to partially explain large inter-patient variations (Wu and Ding, 1999; Wu, 2005). Ding and Wu (2001) show that some viral load dynamic parameters, such as the initial viral decay rate, may reflect the effcacy/potency of an anti-HIV treatment. It is therefore important to study if some patient-specific viral load dynamic parameters are predictive for a long term antiviral response such as the time to a viral load rebound. 1.7 Objective and Outline In this thesis, we consider a joint likelihood method for a NLME model and a Cox proportional hazard model with informative dropouts in the response and missing or mismeasured information in covariates. By the joint likelihood method we can estimate all model parameters simultaneously. The missing responses in the NLME model are allowed to be nonignorable, which is associated with personal characteristics, while the missing covariates are assumed to be ignorable. The random effects in the NLME model, which represent individual-specific characteristics of the longitudinal process in the early period, are used as possible error-free “covari12 ates” for the proportional hazards model and for the missing response model. A Monte-Carlo EM algorithm is used for estimation. Joint modeling of longitudinal data and survival data has been studied in the literature (e.g., DeGruttola and Tu, 1994; Wulfsohn and Tsiatis 1997; Henderson et al. 2002; Guo and Carlin, 2004). Tsiatis and Davidian (2004) provides a very nice review. Wu (2008) discussed joint models with nonignorable missing data using the baseline covariate information for inference. In this thesis, we extend Wu’s method to the case of time-dependent covariate when the covariate is measured with errors. In Section 2, we describe the models for longitudinal data and survival data, as well as the model to describe the missing data mechanism and the model for the time-dependent covariates which are measured with errors. In Section 3, we describe the two step method for joint inference. In Section 4, we describe the joint likelihood method for simultaneous inference using a Monte-Carlo EM algorithm. In Section 5, we extend the method in Section 4 to the case of time-dependent covariates with measurement errors. A real example of HIV studies is presented in Section 6. We compare the different methods via a simulation study in Section 7. We conclude the thesis with discussions in Section 8. 13 Chapter 2 Statistical Models 2.1 Notation Suppose that there are N individuals. Let yi j be the response for individual i at time ti j , i = 1, ..., N; j = 1, ..., ni , and let yi = (yi1 , ..., yini )T . Let zi be the collection of time-independent covariates for individual i. We write yi = (yi,mis , yi,obs )T , where yi,mis are a collection of missing responses and yi,obs are a collection of observed responses, and similarly we write zi = (zi,mis , zi,obs ). Let si = (si1 , ..., sini )T be a vector of missing response indicators such that si j = 1 if yi j is missing, and si j = 0 if yi j is observed. Let ri = (ri1 , ..., rimi )T be the vector of “event” indicators for individual i. ri j = 1 if an event has happened by time ti j ; ri j = 0 if not. We assume that ri1 = 0 for all i, which means at the beginning of study, no subject experiences an event. For individual i, let Ti be the time to an event. Note that the exact event time Ti usually cannot be directly observed. However, if we observe no events at times ti1 ,...,ti,k−1 (i.e., ri1 = ... = ri,k−1 = 0) but know that an event has occurred by time tik (i.e., rik = 1), we can conclude that the actual event time is between ti,k−1 and tik (i.e., ti,k−1 < Ti ≤ tik ), k = 1, 2, ..., mi . This type of event time data structure is referred to as interval censored event time (Lawless, 2003). 14 2.2 Nonlinear Mixed Effects Models In many longitudinal studies, classical linear models are usually not appropriate, although linear models are widely used for their simplicity. In many cases, these linear models are empirical models, which means they only describe the observed data but cannot reveal the underlying mechanism of data generation. On the other hand, nonlinear models are often used in longitudinal studies when the underlying data generation mechanism can be explained by nonlinear models. There many advantages of nonlinear models over linear models. In terms of model fitting, a nonlinear model is able to fit the observed data as well as its competing linear models but uses fewer parameters. In the aspect of interpretation, nonlinear models are often introduced based on the data generation mechanism, thus the parameters of these nonlinear models may have a natural physical meaning. Nonlinear models may also provide more reliable predictions, even outside the range of the observed data than linear models. In HIV studies, nonlinear mixed effects models (NLMEs) are popular in that NLMEs can characterize the variation both between individuals and within an individual (Davidian and Giltinan, 1995; Vonesh and Chinchilli, 1996). For the longitudinal process, we consider the following general NLME model which could be written as a hierarchical two-stage models (Davidian and Giltinan, 1995): yi j = g(ti j , β i ) + ei j , ei |β i ∼ N(0, σ 2 I), β i = h(zi , β ) + Bi bi , bi i.i.d ∼ N(0, D), (2.1) i = 1, ..., N; j = 1, ..., ni , (2.2) where g(·) is a known nonlinear function, ei = (ei1 , ..., eini )T are random errors, β i = (βi1 , ..., βis )T is a vector of individual-specific regression parameters, β = (β 1 , ..., β s )T is a vector of population parameters, h(·) is a s-dimensional vectorvalued known function, bi is an incidence matrix of 0’s and 1’s, bi = (bi1 , ..., bis )T is a vector of random effects and is independent of ei , σ 2 is the unknown within individual variance, I is the identity matrix, and D is a covariance matrix. 15 If there are no missing data, the probability density function for the responses yi can be written as f (yi |zi , β , σ , D) = f (yi |zi , bi , β , σ ) f (bi |D)dbi . (2.3) Therefore, the likelihood function is L(β , σ 2 , D|y) = ΠNi=1 f (yi |zi , bi , β , σ ) f (bi |D)dbi . (2.4) The likelihood function is complex and generally has no closed-form expression. Thus, numerical method could be used to get exact likelihood calculations. The computation would be intensive when the dimension of random effects bi ’s is high. Alternative methods like the Monte Carlo method and the approximate method (Lindstrom and Bates, 1990) could be considered for this intensive computation. 2.3 2.3.1 Covariate Models Empirical Model for Time-dependent Covariate with Measurement Errors and Missing Data Measurement errors and missing data in time-dependent covariates are very common in practice. For example, CD4 cell count is usually of interest in HIV studies. One can hardly make sure CD4 cell count could be measured at each scheduled time point for an individual because the individual may not come to the medical center every time due to various reasons. Also, CD4 cell count is often measured with errors, possibly due to the imprecision of medical machines or carelessness of physicians. Thus, it is important to model the covariate process in order to address measurement errors or missing data in the covariate. Let zikl be the observed covariate value and z∗ikl be the (unobservable) “true” value of covariate k for individual i at time uil , i = 1, ..., N; k = 1, ..., ν; l = 1, ...ni . We focus on the case where z∗ikl is the current true covariate value. We allow missing data in the covariates. Let zi = (zTi1 , ..., zTini )T , where zil = (zi1l , ..., ziνl )T , l = 1, ..., ni . Following Shah, Laird, and Schoenfeld (1997), we consider the fol16 lowing multivariate LME model to empirically describe the covariate processes zil = Uil α +Vil ai + εil , i = 1, ..., N, l = 1, ..., ni , (2.5) where Uil and Vil are design matrices, α and ai are unknown population (fixedeffects) and individual-specific (random-effects) parameter vectors, and εil are the random measurement errors for individual i at time uil . Parameters in (2.5) may be regarded as nuisance parameters because they are not of our main interest. Therefore, the true (unobservable) covariate values are assumed to be z∗il = Uil α +Vil ai . We also assume that ai i.i.d.∼ N(0, A), εil i.i.d. ∼ N(0, R), and ai and εi = (εi1T , ..., εinT i )T are independent, where A and R are unknown covariance matrices. We further assume that α and ai are independent of ei and bi in the response model. Note that for T commonly-used polynomial empirical LME models, we have uik = (1, uik , ..., ul−1 ik ) T and vik = (1, uik , ..., ur−1 ik ) . To allow for missing data in time-dependent covariates, we recast model (2.5) in continuous time: zi (t) = Ui (t)α +Vi (t)ai + εi (t), i = 1, ..., N, where zi (t), Ui (t), and εi (t) are the covariate values, design matrices, and measurement errors at time t respectively. At the response measurement time ti j , the possibly unobserved “true” covariate values can be viewed as z∗i j = Ui j α +Vi j ai , where Ui j = Ui (ti j ) and Vi j = Vi (ti j ). Note that, without a clear understanding of the data generation mechanism for the covariates, we use an empirical model to describe the covariate process. 17 This empirical model only describes the observed data but cannot reveal the data generation mechanism in the covariates. We may also model the covariate process using empirical polynomial models with random effects, as Higgins et al. (1997) and Wu (2002). By standard model selection procedure, an empirical model for the covariate process can be selected in terms of AIC and BIC criteria. 2.3.2 Model for Time-independent Covariate When the covariates are time-independent, we consider a multivariate normal distribution to model the time-independent covariates (Little and Schlucher, 1985). For example, in longitudinal studies, possible covariates of interest like gender are time-independent. Sometimes, for simplicity, time-varying covariates are only considered for their baseline values (Lee, 2009), thus they can also be regarded as time-independent covariates. To allow for both continuous and categorical covariates, the multivariate normal model for the covariates, zi = (zi1 , ..., zip ), can be written as a product of one-dimensional conditional distributions (Ibrahim et. al., 1999) f (zi ; α) = f (zip |zi1 , ..., zi,p−1 ; α p )... f (zi,1 ; α 1 ) (2.6) where α = (α T1 , ..., α Tp )T are nuisance parameters for the conditional models, and p are number of covariates. 2.4 Survival Model The time to an event may possibly be related with individual characteristics, for the survival process, we assume that the distribution of Ti may depend on the random effects bi which represent individual-specific longitudinal processes in the early period. For example, in HIV studies patients with a faster (or slower) viral load decay rate may be more likely to have an earlier viral load rebound, so the time to viral load rebound Ti may depend on the random effects associated with the viral load decay rates. Therefore, we consider a survival model for the distribution of Ti , which links the probability of the time to an event to the random effects bi in the NLME model. 18 Let the survival function S(t) = P(T > t) be the probability of the survival time T being larger than t. The hazard rate is h(t) = lim∆t→0 P(T < t + ∆t|T > t) , ∆t which means the probability of experiencing an event immediately given no event appears previously. Further we have h(t) = lim∆t→0 − = S(t + ∆t) − S(t) ∆tS(t) S (t) . S(t) Therefore, T S(T ) = exp(− h(t)dt). 0 The Cox proportional hazards model assumes the hazard proportional to covariates in an exponential form. In particular, we assume that the conditional hazard rate at time Ti = ti given the random effects bi as follows h(ti |zi , bi ) = h0 (ti )exp(γ T1 zi + γ T2 bi ), (2.7) where h0 (ti ) is the baseline hazard function, γ 1 and γ 2 are unknown parameters linking the covariates zi , and random effects bi to the conditional hazard rate, respectively. This assumption assumes the hazard is affected by both the covariate value and the individual characteristics. Let pik = P(rik = 1|ril = 0, 0 ≤ l < k; zi , bi ) (2.8) = 1 − P(Ti ≥ tik |Ti ≥ ti,k−1 , zi , bi ) (2.9) = 1− S(tik ) , S(ti,k−1 ) k = 1, 2, ..., ni . 19 (2.10) Then we have, pik = 1 − exp[−exp(γ0k + γ T1 zi + γ T2 bi )], (2.11) or, log(−log(1 − pik )) = γ0k + γ T1 zi + γ T2 bi where, tik γ0k = log ti,k−1 k = 1, ..., max{ni }, h0 (t)dt, . Let γ 0 = (γ 01 , ..., γ 0max{ni } ) and γ = (γ0 , γ 1 , γ 2 ) . The density for ri can be written as i f (ri |zi , bi , γ) = Πnk=1 f (rik |ril = 0, 0 ≤ l < k; zi , bi , γ) (2.12) where, f (rik |ril = 0, 0 ≤ l < k; zi , bi , γ) = prikik (1 − pik )(1−rik ) . 2.5 Model for Missing Data When there are informative dropouts, the missing data mechanism must be taken into account for valid likelihood inference. We assume that the missing covariates are missing at random (or ignorable) in the sense that the missingness may be related to the observed data but not the missing values, so we do not need to specify a missing covariate mechanism. For the missing longitudinal responses; however, it is likely that the missingness may be nonignorable in the sense that the missingness may be related to unobserved values. For example, in HIV studies, patients with slower initial viral load decay after treatment may be more likely to dropout early or miss visits than those with faster initial viral load decay, so the missingness probability may be related to the individual-specific initial viral load decay rates. Thus, here we assume a missing longitudinal response model which allows the missing probability to possibly depend on the unobservable random ef- 20 fects bi . Such a missing data model is related to the shared-parameter models or random-effect-based dropouts (Wu and Carroll, 1988; DeGruttola and Tu, 1994; Little, 1995; Follmann and Wu, 1995; Ten Have et al., 1998). In other words, the missingness depends on both ymis,i and yobs,i through the random effects bi . For such missing responses, a model specifying the missing response mechanism must be incorporated in the likelihood inference. Note that the probability of missing responses at time ti j may also depend on the missing status at the previous time point ti, j−1 . Based on the above arguments, as an example, we may consider the following model for the missing responses: logit(P(si j = 1|si, j−1 , bi , φ )) = φ0 + φ1 si, j−1 + φ T2 bi , f (si |bi , φ ) = i f (si1 |bi , φ )Πnj=2 f (si j |si, j−1 , bi , φ ), (2.13) (2.14) where the parameters φ may be viewed as nuisance parameters and are usually not of inferential interest. More complicated missing data models may be assumed, but a too complicated missing data model may introduce too many nuisance parameters and may cause parameter identifiability problems. 21 Chapter 3 Two-Step Method 3.1 Simple Two-Step Method In joint models of longitudinal data and survival data, the longitudinal model and the survival model are usually linked through some shared parameters or shared variables. For example, the following two cases often arise in practice: • the response of a longitudinal model is a time-dependent covariate in the survival model, which often arises in survival analysis with measurement error or missing data in time-dependent covariates; • the longitudinal model and the survival model share same parameters or random effects, which often arises in longitudinal analysis with dropouts, or when there is a latent process which governs both the longitudinal process and the survival process. In both cases, a simple or naive two-step approach can be used. It is to first fit one model (often the secondary model) to the observed data separately, ignoring the other model, and then in the second step the shared parameters or random effects are substituted by their estimates from the first step. Then, one proceeds with the inference in a usual way as if the estimated parameters or random effects were observed data. This two-step method is closely related to the regression calibration method in measurement error literature. A major advantage of the simple or naive 22 two-step method is that it is simple, and standard softwares are available to use. However, such a simple or naive two-step method may lead to misleading results. In the following, we discuss the two-step method in more details. 3.2 Modified Two-Step Method As pointed out by Ye, Lin, Taylor (2008) and Albert and Shih (2009), the simple two step method mentioned in the last section may lead to misleading results in two ways: • (i) the covariate trajectories of subjects who experience an event (e.g., die or drop out) may be different from those who do not experience any event, so the estimation of the covariate model in the first step to all covariate data may be biased; • (ii) inference in the second step that ignores the estimation uncertainty in the first step may lead to misleading results (e.g., under-estimating standard errors). The bias in case (i), called bias from informative dropouts, may depend on the strength of the association between the longitudinal process and the survival process. The misleading results in case (ii) may depend on the magnitude of measurement errors in covariates. In the following, we consider a modified two-step method to address these issues. In order to adjust the standard errors of parameter estimates in the survival model by incorporating the estimation uncertainty in the first step, we can consider a parametric bootstrap method as follows: • Step 1: Generate covariate values based on the assumed covariate model, with unknown parameters substituted by their estimates; • Step 2: Generate survival times from the fitted survival model; • Step 3: For each generated bootstrap dataset from step 1 and step 2, fit the models using the two-step method and obtain new parameter estimates; 23 • Step 4: Repeat Step 1-3 B times (say, B = 500). We can obtain the estimated standard errors for the fixed parameters from the sample covariance matrix across the B bootstrap datasets. This modified method produces more reliable estimates of the standard errors than the naive two-step method, if the assumed models are correct. The modified two step method gets an advantages over the naive two step method in that it includes the uncertainty of latent parameters or latent variables (which are estimated in the first step). However, a limitation of this modified two-step method is that it can only deal with a dataset with no missing information. When the missing data problem appears, the previous two step method might give misleading results. In next chapters, we consider another approach based on the joint likelihood of observed data to address the issue of missing data in joint models. 24 Chapter 4 Joint Likelihood Inference with Time-independent Covariate 4.1 Introduction In this chapter, we consider simultaneous likelihood inference for all parameters based on the joint likelihood of the observed data. We first focus on timeindependent (or baseline) covariates. The extension to time-dependent covariates with measurement errors will be discussed in next chapter. 4.2 Joint Likelihood We consider simultaneous likelihood inference for all parameters based on the joint likelihood of the observed data {(yi,obs , zi,obs , ri , si ), i = 1, 2, ..., N}. We consider time-independent (or baseline) covariates following Wu (2009). Let f (·) denote a generic density function and f (y|x) denote the conditional distribution of y given x. Let θ = (β , σ , γ, φ , D) denote the collection of all unknown parameters. We assume that yi and ri are conditionally independent given the random effects bi , i.e., ri depends on yi through the random effects bi . We also assume that f (si |yi , bi , φ ) = f (si |bi , φ ). 25 Thus, we have f (yi , ri , si |zi , bi , θ ) = f (yi |zi , bi , β , σ ) f (ri |zi , bi , γ) f (si |bi , φ ). The joint likelihood for the observed data can then be written as L0 (θ ) =ΠNi=1 f (yi |zi , bi , β , σ ) f (ri |zi , bi , γ) f (si |bi , φ ) × f (zi |α) f (bi |D) × f (yi,mis , zi,mis , bi |yi,obs , zi,obs , si , ri , θ )dyi,mis dzi,mis dbi . 4.3 A Monte Carlo EM Algorithm Maximum likelihood estimates (MLEs) of all parameters θ can be obtained by maximizing the observed data likelihood L0 (θ ). However, the observed data likelihood L0 (θ ) may be difficult to evaluate because it involves intractable and high dimensional integral. In the following, we use a Monte-Carlo EM algorithm to obtain the MLEs. If we treat the unobservable random effects bi as additional “missing data”, we can write the “complete data” as {(yi , zi , ri , si , bi ), i = 1, 2, ..., N}. Thus, the complete-data log-likelihood for individual i can be written as (i) lc = log f (yi |zi , bi , β , σ ) + log f (zi |α) + log f (bi |D) + log f (ri |zi , bi , γ) + log f (si |bi , φ ). The E-step at the t th iteration of the EM algorithm for individual i can then be written as Qi (θ |θ (t) ) = {log f (yi |zi , bi , β , σ ) + log f (zi |α) + log f (bi |D) + log f (ri |zi , bi , γ) + log f (si |bi , φ )} × f (yi,mis , zi,mis , bi |yi,obs , zi,obs , si , ri , θ (t) )dyi,mis dzi,mis dbi . Since it is difficult to evaluate the integral Qi (θ |θ (t) ) analytically, we approximate the integral by the Monte-Carlo methods following Wu (2009) as follows. 26 Since Qi (θ |θ (t) ) is a (conditional) expectation with respect to the density f (yi,mis , zi,mis , bi |yi,obs , zi,obs , si , ri , θ (t) ), we may approximate Qi by its empirical mean, obtained by simulating many samples from the conditional density f (yi,mis , zi,mis , bi |yi,obs , zi,obs , si , ri , θ (t) ) and then replacing the expectation by an empirical mean. To generate random samples from the conditional density f (yi,mis , zi,mis , bi |yi,obs , zi,obs , si , ri , θ (t) ), we may use the Gibbs sampler method (Gelfand and Smith, 1990) along with the multivariate rejection method by iteratively sampling from the full conditionals f (yi,mis |yi,obs , zi , bi , si , ri , θ (t) ), f (zi,mis |zi,obs , yi , bi , si , ri , θ (t) ), and f (bi |yi , zi , si , ri , θ (t) ) in turn until the resulting Markov chain converges. To sample these full conditionals, note that f (yi,mis |yi,obs , zi , bi , si , ri , θ (t) ) ∝ f (yi |zi , bi , β (t) , σ (t) ) (t) (t) (4.1) (t) (t) f (zi,mis |zi,obs , yi , bi , si , ri , θ ) ∝ f (yi |zi , bi , β , σ ) f (zi |α) f (ri |zi , bi , γ ) (4.2) (t) (t) (t) (t) f (bi |yi , zi , si , ri , θ ) ∝ f (bi |D ) f (yi |zi , bi , β , σ ) × f (ri |zi , bi , γ (t) ) f (si |bi , φ (t) ). (1) (1) (1) (m ) (m ) (mt ) t t Suppose that {(˜yi,mis , z˜ i,mis , b˜ i ), ..., (˜yi,mis , z˜ i,mis , b˜ i (4.3) )} is a random sample of size mt generated from f (yi,mis , zi,mis , bi |yi,obs , zi,obs , si , ri , θ (t) ). The E-step of the Monte Carlo EM algorithm at the (t + 1)th iteration can be approximated as follows Q(θ |θ (t) ) =ΣNi=1 Qi (θ |θ (t) ) (4.4) 1 t ( j) ( j) ( j) ≈ ΣNi=1 { Σmj=1 log f (yi,obs , y˜ i,mis |zi,obs , z˜ i,mis , b˜ i , β , σ ) mt ( j) ( j) + log f (zi,obs , z˜ |α) + log f (b˜ |D) + log i,mis i ( j) ˜ ( j) f (ri |zi,obs , z˜ i,mis , bi , γ) + log ( j) f (si |b˜ i , φ )}. (4.5) The above approximation can be made arbitrarily accurate by increasing mt . The 27 M-step of the Monte Carlo EM algorithm is then to maximize Q(θ |θ (t) ), which is just like a complete data maximization, so standard optimization procedures for complete-data models such as the Newton-Raphson method can be used to obtain the updated parameters θ (t+1) . If we assume that the parameters in each term of Q(θ |θ (t) ) are distinct, we can maximize each term of Q(θ |θ (t) ) separately using standard methods for linear, nonlinear, and logistic regression models. The variance covariance matrix of θ can be approximated as follows. At the convergence of the EM algorithm, let ( j) ( j) ( j) Si j = ∂ l(θ |yi,obs , y˜ i,mis , zi,obs , z˜ i,mis , b˜ i , ri , si )/∂ θ evaluated at θ = θˆ , and t I(θˆ ) ≈ ΣNi=1 Σmj=1 1 Si j (θˆ )SiTj (θˆ ). mt The approximate asymptotic variance covariance matrix of θˆ is I −1 (θˆ ). 4.4 Sampling Methods and Convergence To implement the Monte-Carlo EM algorithm described in the previous section, one of the major computational steps is to sample from the full conditionals in (4.1)-(4.3). Sampling from the distribution (4.1)-(4.3) can be accomplished by rejection sampling methods. If the appropriate densities on the right hand-sides of (4.1)-(4.3) are log-concave, the adaptive rejection algorithm of Gilks and Wild (1992) may be used. If some densities are not log-concave, we may consider the multivariate rejection sampling method. For example, suppose that we want to generate random samples from f (bi |yi , zi , ri , si , θ (t) ) 28 in (4.3). Let h(bi ) = f (yi |zi , bi , β (t) , σ (t) ) f (ri |zi , bi , γ (t) ) f (si |bi , φ (t) ) and τ = supb h(bi ). A random sample from f (bi |yi , zi , ri , si , θ (t) ) can be obtained as follows: • Step 1: sample b∗i from f (bi |D(t) ), and independently, sample w from the uniform(0,1) distribution • Step 2: if w ≤ h(b∗i )/τ then accept b∗i ; otherwise, go to step 1. Samples from the other two full conditionals can be obtained in a similar way. Therefore, the E-step of the Monte-Carlo EM method can be accomplished by the Gibbs sampling method combined with the rejection sampling methods. To assess the convergence of the Gibbs sampler, we may use standard graphical tools such as trace plots and autocorrelations. To implement the E-step of the Monte-Carlo EM algorithm, we should choose the numbers of Monte-Carlo samples mt . Generally, larger values of mt will result in more exact approximation in the E-step but the computation will be slower. To ensure convergence of the Monte-Carlo EM algorithm, we should increase mt as the number t of EM iterations increases. Note that, for Monte-Carlo EM algorithms, the incomplete-data log-likelihood is not guaranteed to increase at each iteration due to Monte Carlo error at the E-step. However, under suitable regularity conditions, Monte-Carlo EM algorithms still converge to the maximum likelihood estimate (Fort and Moulines, 2003). 29 Chapter 5 Time-dependent Covariate with Measurement Error 5.1 Introduction The method presented in Chapter 4 can be extended to the case of time-dependent covariates where the covariates may be missing (ignorable) or measured with errors. In practice, some covariates may be measured with errors, and the timedependent covariates may also be missing due to different measurement schedules from the response measurements or other problems. For example, in HIV studies, CD4 cell count is often measured with substantial errors and may have measurement schedules different from the viral load measurement schedules. To address covariate measurement errors or missing data, we may model the time-dependent covariates empirically using linear mixed effects (LME) models as follows. 5.2 Joint Likelihood Let zikl be the observed value and z∗ikl be the (unobservable) “true” value of covariate k for the ith individual at time uil , i = 1, ..., N; k = 1, ..., ν; l = 1, ...mi . We focus on the case where z∗ikl is the current true covariate value. We allow missing data in the covariates. Let zi = (zTi1 , ..., zTimi )T , where zil = (zi1l , ..., ziνl )T , l = 1, ..., mi . Following Shah, Laird, and Schoenfeld (1997), we consider the following multivariate 30 LME model to empirically describe the covariate processes zil = Uil α +Vil ai + εil , i = 1, ..., N, l = 1, ..., mt , (5.1) where Uil and Vil are design matrices, α and ai are unknown population (fixedeffects) and individual-specific (random-effects) parameter vectors, and εil are the random measurement errors for the ith individual at time uil . The true (unobservable) covariate values are assumed to be z∗il = Uil α +Vil ai . We also assume that ai T )T are independent, i.i.d.∼ N(0, A), εil i.i.d. ∼ N(0, R), and ai and εi = (εi1T , ..., εim i where A and R are unknown covariance matrices. We further assume that α and ai are independent of ei and bi in the response model. Note that for commonlyT used polynomial empirical LME models, we have uik = (1, uik , ..., ul−1 ik ) and vik = T (1, uik , ..., ur−1 ik ) . To allow for missing data in the time-dependent covariates, we recast model (5.1) in continuous time: zi (t) = Ui (t)α +Vi (t)ai + εi (t), i = 1, ..., N, (5.2) where zi (t), Ui (t), and εi (t) are the covariate values, design matrices, and measurement errors at time t respectively. At the response measurement time ti j , the possibly unobserved “true” covariate values can be viewed as z∗i j = Ui j α + Vi j ai , where Ui j = Ui (ti j ) and Vi j = Vi (ti j ). When the covariates are measured with errors, we assume that the response and the time-to-event distributions f (yi |ai , bi , β , σ ) and f (ri |ai , bi , γ) may depend on the unobserved true covariate values rather than the observed mis-measured covariate values, i.e., the distributions of yi and ri may depend on the random effects ai and bi . Let θ be the collection of unknown parameters, and θ = (α, β , γ, φ , σ , δ , D, A). 31 Therefore, the full likelihood for the observed data can thus be written as L(θ ) = ΠNi=1 f (yi |z∗i (ai , α), bi , β , σ ) f (ri |z∗i (ai , α), bi , γ) × f (ai |A) f (bi |D) f (si |bi , φ ) f (zi |α, ai , δ ) × f (yi,mis , zi,mis , ai , bi |yi,obs , zi,obs , si , ri , θ )dyi,mis dzi,mis dai dbi , where z∗ (α, ai ) is the true covariate value which depends on random effects ai according to model (5.2), and δ 2 stands for the measurement error variability in covariate. We assume that εi1 , εi2 , ..., εini ∼i.i.d N(0, δ 2 ). The random effects ai are introduced to account for large inter-individual variations in the change of the time-dependent covariate. We assume ai = (ai1 , ai2 )T ∼ N(0, A). 5.3 A Monte Carlo EM Algorithm Maximum likelihood estimates (MLEs) of unknown parameters θ can be obtained by maximizing the observed data likelihood L0 (θ ). However, the observed data likelihood L0 (θ ) may be difficult to evaluate because it involves intractable and high dimensional integral. In the following, we use a Monte-Carlo EM algorithm to obtain the MLEs. If we treat the unobservable random effects ai and bi as additional “missing data”, we can write the “complete data” as {(yi , zi , ri , si , bi ), i = 1, 2, ..., N}. Thus, the complete-data log-likelihood for individual i can be written as (i) lc = log f (yi |z∗i (ai , α), bi , β , σ ) + log f (zi |α, ai , δ ) + log f (bi |D) + log f (ri |z∗i (ai , α), bi , γ) + log f (si |bi , φ ) + log f (ai |A). The E-step at the t th iteration of the EM algorithm for individual i can then be 32 written as Qi (θ |θ (t) ) = {log f (yi |z∗i (ai , α), bi , β , σ ) + log f (zi |α, ai , δ ) + log f (bi |D) + log f (ri |z∗i (ai , α), bi , γ) + log f (si |bi , φ ) + log f (ai |A)} × f (yi,mis , zi,mis , ai , bi |yi,obs , zi,obs , si , ri , θ (t) )dyi,mis dzi,mis dai dbi . Since it is difficult to evaluate the integral Qi (θ |θ (t) ) analytically, we approximate the integral by the Monte-Carlo methods. Since Qi (θ |θ (t) ) is a (conditional) expectation with respect to the density f (yi,mis , zi,mis , ai , bi |yi,obs , zi,obs , si , ri , θ (t) ), we may approximate Qi by its empirical mean, obtained by simulating many samples from the conditional density f (yi,mis , zi,mis , ai , bi |yi,obs , zi,obs , si , ri , θ (t) ) and then replacing the expectation by an empirical mean. To generate random samples from the conditional density f (yi,mis , ai , bi |yi,obs , zi,obs , si , ri , θ (t) ), we may use the Gibbs sampler method (Gelfand and Smith, 1990) along with the multivariate rejection method by iteratively sampling from the full conditionals f (yi,mis |yi,obs , zi,obs , ai , bi , si , ri , θ (t) ), f (zi |α, ai , δ ), f (ai |zi,obs , yi , bi , si , ri , θ (t) ), and f (bi |yi , zi,obs , ai , si , ri , θ (t) ) in turn until the resulting Markov chain converges. 33 To sample these full conditionals, note that f (yi,mis |yi,obs , zi,obs , ai , bi , si , ri , θ (t) ) ∝ f (yi |z∗i (ai , α), bi , β (t) , σ (t) ) (5.3) (t) (5.4) f (zi,mis |yi , zi,obs , ai , bi , si , ri , θ ) ∝ f (zi |α, ai , δ ) f (ai |zi,obs , yi , bi , si , ri , θ (t) ) ∝ f (yi |z∗i (ai , α), bi , β (t) , σ (t) ) (5.5) × f (zi |α, ai ) f (ri |z∗i (ai , α), bi , γ (t) ) f (ai |A) (5.6) f (bi |yi , zi,obs , si , ri , θ (t) ) ∝ f (bi |D(t) ) f (yi |z∗i (ai , α), bi , β (t) , σ (t) ) × f (ri |z∗i (ai , α), bi , γ (t) ) f (si |bi , φ (t) ). (5.7) (1) (1) (1) (1) (m ) (m ) (mt ) t t Suppose that {(˜zi,mis , z˜ i,mis , a˜ i , b˜ i ), ..., (˜yi,mis , z˜ i,mis , a˜ i (mt ) , b˜ i )} is a random sam- (t) ple of size mt generated from f (yi,mis , ai , bi |yi,obs , zi,obs , si , ri , θ ). The E-step of the Monte Carlo EM algorithm at the (t + 1)th iteration can be approximated as follows Q(θ |θ (t) ) = ΣNi=1 Qi (θ |θ (t) ) (5.8) 1 t ( j) ( j) ( j) ≈ ΣNi=1 { Σmj=1 log f (yi,obs , y˜ i,mis |z∗i (˜ai , α), b˜ i , β , σ ) mt ( j) ( j) ( j) ( j) + log f (zi,obs , z˜ |α, a˜ , δ ) + log f (˜a |A) + log f (b˜ |D) + log i,mis i ∗ ( j) ˜ ( j) f (ri |zi (˜ai , bi , γ) + log i ( j) f (si |b˜ i , φ )}. i (5.9) The above approximation can be made arbitrarily accurate by increasing mt . The M-step of the Monte Carlo EM algorithm is then to maximize Q(θ |θ (t) ), which is just like a complete data maximization, so standard optimization procedures for complete-data models such as the Newton-Raphson method can be used to obtain the updated parameters θ (t+1) . If we assume that the parameters in each term of Q(θ |θ (t) ) are distinct, we can maximize each term of Q(θ |θ (t) ) separately using standard methods for linear, nonlinear, and logistic regression models. The variance covariance matrix of θ can be approximated as follows. At the 34 convergence of the EM algorithm, let ( j) ( j) ( j) ( j) Si j = ∂ l(θ |yi,obs , y˜ i,mis , z˜ i,mis , a˜ i , b˜ i , ri , si )/∂ θ evaluated at θ = θˆ , and t I(θˆ ) ≈ ΣNi=1 Σmj=1 1 Si j (θˆ )SiTj (θˆ ). mt The approximate asymptotic variance covariance matrix of θˆ is I −1 (θˆ ). 35 Chapter 6 Data Analysis 6.1 Introduction We have discussed the two-step method and the method using the joint likelihood inference for the statistical analysis on the longitudinal process and the survival process in the previous chapters. In this chapter, we analyze a real example using the methods discussed. We describe the data set in Section 6.2. We introduce the models for longitudinal data and survival data as well as the model for missingness in Section 6.3. In Section 6.4, we analyze a real HIV dataset with some interesting findings. We discuss some computational issues in Section 6.5. 36 6.2 Data Description The dataset comes from a recent HIV study. It consists of 41 HIV patients who were given an anti-HIV treatment at the beginning of the study. We consider the study within the first 400 days after the treatment since data after 400 days is likely to be influenced by the long term clinical factors. The time then is scaled from 0 to 1 for convenience. The viral load (in log10 scale) and the CD4 cell count are measured repeatedly over time after an anti-HIV treatment. The measurement times within a patient varies from 8 to 14 (with a mean of 10 and a standard deviation of 1.43). Note that there is a substantial variation among patients. Also some patients may not experience any viral load rebound (a viral load increase) during the study period. About 16% CD4 cell count values and 1% viral load values are missing. A summary of the HIV dataset is shown in Table 6.1. Table 6.1: Summary of the HIV dataset Variable Sample Sample Percentage mean standard deviation of missing values Viral load 2.16 1.189 1.17% CD4 cell count 305.63 156.92 16.19% No. of patients = 41 Observations per patients from 8 to 14 Total missing rate : 17.37% Rebound rate : 15.26% Figure 6.1 shows viral load trajectories for six randomly selected patients from the study. We see that after an anti-HIV treatment, the patients’ viral loads would decline in the early period, which reflected the efficacy of the anti-HIV treatment. As time went by, the patients’ viral loads may continue to decline, or become flat, or rebound. For those patients with a rebound of viral load in the latter period, the rebound might be due to the potency of the treatment in the early period and the possible drug resistance developed after the early period. The difference in the viral load among patients may be due to the individual characteristics. It is therefore interesting to study if the individual characteristics are predictive for the time to a viral load rebound. 37 6 ● 5 ● ● ● 4 ● ● ● ● 3 ●● ● ● ● ● 2 Viral Load (log10 scale) ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● 0 1 ●● ● 0.0 0.2 0.4 0.6 0.8 1.0 Day Figure 6.1: Profiles of viral load values for six randomly selected patients. 6.3 6.3.1 Models The NLME Model for HIV Viral Dynamics Wu and Ding (1999) proposed a two-compartment exponential decay model for viral load dynamics in the early period. They considered a NLME model for statistical inference. The random effects specifications could be various for this two-compartment model. The NLME model we used has the random effects specifications based on the standard model selection procedures. Table 6.2 shows the AIC and BIC values, and the approximate log-likelihood (logLik) values for different random effects specifications. We find that Model 3, which is without random effects specified in λ1i , attains the smallest AIC value and the smallest BIC value. Further likelihood ratio test would support no significant difference between Model 38 1 and Model 3; however, Model 3 is simpler. Model 1 Model 2 Model 3 Model 4 Model 5 Random Effect P1i λ1i P2i λ2i j λ1i P2i λ2i j P1i P2i λ2i j P1i λ1i λ2i j P1i λ1i P2i df 16 12 12 12 12 AIC 305.52 340.43 294.48 325.22 310.57 BIC 364.29 384.51 338.56 369.30 354.65 logLik -136.76 -158.21 -135.24 -150.61 -143.28 1 vs 2 1 vs 3 1 vs 4 1 vs 5 L.Ratio p-value 42.91 3.04 27.69 13.05 < .0001 0.55 < .0001 0.011 Table 6.2: Model Selection on NLME model with various random effects specifications. Hence, we choose the NLME model with random effects specification in Model 3: yi j = log10 (P1i e−λ1iti j + P2i e−λ2iti j ) + ei j , log(P1i ) = β1 + b1i , λ1i = β2 , log(P2i ) = β3 + b2i , λ2i j = β4 + β5CD4∗i j + b3i , (6.1) where yi j is the log10 scale of the viral load measurement for the ith patient at jth measurement at ti j . λ1i and λ2i j represent the individual-specific first and second phases of viral load decay rates, respectively, P1i and P2i are individual-specific baseline values, β = (β1 , ..., β5 )T are population parameters (fixed effects), ei j represents the within individual errors. The exponential decay rates λ1i and λ2i j can be interpreted as the turnover rates of productively infected cells and the long-lived and/or latently infected cells, respectively. bki ’s are random effects. Note that the individual characteristics of the viral load trajectories can be represented by the random effects (or individual effects) bi = (bi1 , bi2 , bi3 )T . We assume that ei j |bi ∼i.i.d N(0, σ 2 ), where bi ∼i.i.d N(0, D). CD4∗i j represents the true but unobserved value of CD4 cell count for patient i at time ti j . This time dependent covariate CD4 cell count is introduced to partially explain the between individual variation in the second phrase of viral load decay rate. 39 6.3.2 The Covariate Model The covariate CD4 cell count changes with time and may be measured with errors. We need to model the change of CD4 cell count over the study period. In the absence of theoretical justification, we model the CD4 cell count process based on empirical polynomial linear mixed effects (LME) models. There are many specification in the random effects of LME model. Similar with the model selection process in the viral load model, we select the “best” model of covariate based on the AIC and BIC criteria. Table 6.3 shows the model selection results. It is found that the LME model with random effects in two coefficient gets the smallest AIC value and the smallest BIC value. Also, this LME model beats a quadratic model for its simplicity. Model Linear1 Linear2 Linear3 Quadratic Random Effect a0 a1 a0 a1 a0 a1 a2 df 6 4 4 10 AIC 313.53 330.60 620.71 317.51 BIC 335.57 345.29 635.40 354.24 logLik -150.76 -161.30 -306.35 -148.75 Test L.Ratio p-value 1 vs 2 1 vs 3 1 vs 4 21.07 311.17 315.19 < .0001 < .0001 0.77 Table 6.3: Model Selection on covariate model in different forms (linear and quadratic). The LME model describing the change of CD4 cell count is specified as below: CD4i j = αi0 + αi1ti j + εi j , CD4∗i j (6.2) = αi0 + αi1ti j , αi0 = α0 + ai0 , αi1 = α1 + ai2 , δ 2 stands for the measurement error variability in CD4 cell count. CD4∗i j represents the true value of CD4 cell count for patient i at time ti j . We assume that εi1 , εi2 , ..., εini ∼i.i.d N(0, δ 2 ). The random effects ai are introduced to account for large inter-individual variations in the change of CD4 cell count. We assume ai = (ai1 , ai2 )T ∼ N(0, A). 40 6.3.3 Survival Models Section 2.4 has given a general discussion on the survival analysis and the Cox proportional hazards model. In particular, in this part, it is of our interest to see if the CD4 cell count or the random effects bi1 , bi2 , bi3 are predictive to the time of a viral load rebound. We consider the following Cox proportional hazards model of the time-to-rebound Ti with the time-dependent covariate CD4 cell count: h(ti j |zi , bi ) = h0 (ti j )exp(γ1 z∗i j + γ2 bi1 + γ3 bi2 + γ4 bi3 ), (6.3) where h0 (ti j ) is the baseline hazards function, and z∗i j is the true value of CD4 cell count for patient i at the jth measurement. 6.3.4 The Dropout Models Missing data appears in both CD4 cell count and viral load, which may probably be due to patients’ dropouts from the follow-up study or failure to visit regularly. The missingness may be informative. The missingness probability of the responses may depend on the random effects which characterize individual differences of the viral load trajectories. Therefore, we assume the following model for the missing mechanism of the viral load in order to include the missingness in the analysis: i f (si |bi , φ ) = Πnj=1 P(si j = 1|φ , bi )si j (1 − P(si j = 1|φ , bi ))1−si j , log P(si j = 1|φ , bi ) = φ0 + φ1 b1i + φ2 b2i + φ3 b3i , i = 1, 2, ..., N, 1 − P(si j = 1|φ , bi ) (6.4) By 6.4, we assume the missingness in the response variable depends on the unobserved value of random effects. Therefore, this missingness is informative missing. We assume the missing in the CD4 cell count missing completely at random. Note that, although we may assume a more complicated model for the missing response mechanism, in this study we would like to avoid building too complicated a model for missing response here since too many nuisance parameters may lead to non-identifiability. 41 6.4 Results We have two main interests in the analysis results. One is to see if the CD4 cell count is predictive for the decay rate of the viral load, which could be found by doing inference on parameters in the NLME model: β = (β1 , ..., β5 )T . The other is to see if the individual characteristics are predictive for the time to a viral load rebound, which could be found by looking at parameters in the survival model: γ = (γ1 , ..., γ4 )T . We will consider the statistical analysis with the time-dependent CD4 cell count. We apply the four methods: the naive two-step method (TS), the modified two-step method (MTS), the joint model (JM), the joint model with complete data (CC). Table 6.4 shows the results by different methods with time-dependent CD4 cell count. All methods suggest a weak predictive power of the time-dependent CD4 cell count in the decay rate of viral load in this dataset. Also, all coefficients in the survival model are insignificant, which implies that the individual characteristics may not be predictive to the time to rebound in this dataset. Although the four methods give similar answers to the issue of our interest, the results by different methods are different. JM considers the noninformative missingness in the inference while the other three methods assume the missingness ignorable. The estimate of β2 by JM is higher than that by the other three methods. This difference suggests simply discarding the information containing missing data may underestimate the initial decay rate. The standard error estimated by TS is generally smaller than the other three methods. This result is not surprising. The standard error for the estimate represents the uncertainty. In this dataset, the uncertainty comes from mainly two sources. One is the sampling variability of obtaining these observed patients. The other one comes from the uncertainty of unknown missing data and the individual characteristics. TS discarded the information containing missing values and it is well known in literature (e.g. Tsiatis and Davidian, 2004) that such a two-step method fails to include the uncertainty of the random effects. Therefore, TS leads 42 to an underestimation of the variability of parameters. MTS adjusts the standard errors of parameter estimates by including the uncertainty of the random effects using the parametric bootstrap method (Wu, 2009). JM, in general, gives larger standard errors probably because JM includes both the uncertainty of missing values and the unknown individual characteristics. Table 6.4: Results summary by different methods with time-dependent covariate. Parameter β1 β2 β3 β4 β5 γ1 γ2 γ3 γ4 TS EST SE 10.97 0.20 68.73 2.92 5.45 0.19 4.05 0.33 -0.05 0.19 -0.03 0.20 0.43 0.41 -0.81 0.43 0.04 0.30 BSM 10.82 64.16 5.44 4.03 -0.05 -0.16 -0.14 -0.29 0.18 MTS SEM 0.37 5.53 0.15 0.22 0.12 0.30 1.01 0.68 0.73 BSE 0.51 6.74 0.16 0.28 0.17 0.33 1.10 0.75 0.80 JM EST ASE 11.1 0.16 90.8 9.96 5.96 0.15 4.81 0.27 -0.02 0.11 -0.01 0.04 -0.78 0.16 -0.44 1.62 0.74 0.34 CC EST ASE 11.12 0.15 68.64 3.79 5.47 0.13 3.86 0.26 0.02 0.13 -0.09 1.72 -0.12 0.78 -0.76 1.45 0.51 0.61 Note: EST is parameter estimate; SE is the estimated default standard error; BSM and BSE are the bootstrap mean and standard error; SEM is the bootstrap mean of the estimated default standard errors; ASE is the approximated asymptotic standard error. By the analysis above, we conclude that, for this particular HIV dataset, ignoring missing data mechanisms may under-estimate the initial decay rate. Additionally, the survival process may not have been linked to the individual-specific characteristics or the CD4 cell count values. However, these conclusions are based on one single dataset; therefore simulation study is required to check the validity of these conclusions. 43 6.5 Computation Issues Much of the computation issues lie in the joint model which is based on the joint likelihood inference. 6.5.1 Choice of Starting Value The EM algorithm was used for the joint inference of the joint model in the example. The starting values for the regression coefficients in the NLME model (β ) were chosen by fitting a nonlinear mixed effects model to the complete dataset, which is after removing the missing information. The regression coefficients in the survival model (γ) were chosen by fitting a Cox proportional hazards model to the complete dataset with covariates from the NLME model. The regression coefficients in the dropout model (φ ) was chosen by fitting a logistic regression model to the original dataset, with covariates from the NLME model. The regression coefficients in the empirical model for the time-dependent covariate were chosen by fitting a linear mixed effects model to the complete dataset. 6.5.2 Convergence Criteria The convergence criteria was based on the relative change in the parameter estimates. The EM algorithm would stop if the differences of the parameter estimates between the current step and the last step is smaller than a tolerance level which is set at beginning. In our example, the tolerance level was set to be 5%. That is, the EM algorithm would stop if the maximum difference of all differences of parameter estimate between the current step and the last step is smaller than 5%. In principle, with a smaller the tolerance level, we could get more accuracy in the parameter estimate, but we have to pay for the additional cost of computation. 6.5.3 Running Time The running time by the joint model could be huge comparing to the native twostep method and the modified two-step method. There are mainly two reasons for the huge computation time. One is due to the use of the Gibbs sampling in generating samples from a complex probability distribution. The other one is due to the long run to reach convergence in the EM algorithm. 44 Chapter 7 Simulation Study 7.1 Introduction In order to evaluate the performance of the joint model comparing to the two-step methods, and the joint method considering the missing data mechanism comparing to the methods with complete data, we conduct a simulation study in this chapter. We compare different methods in terms of their bias and the mean squared errors of the corresponding estimates. We first introduce the design of the simulation study including the setup of parameters and the data generation models. Then, we compare different methods in different scenarios. 45 7.2 7.2.1 Design of Simulation Study Models We generate the response variable yi j from the NLME model as follows: yi j = log10 (P1i e−λ1iti j + P2i e−λ2iti j ) + ei j , log(P1i ) = β1 + b1i , λ1i = β2 , log(P2i ) = β3 + b2i , λ2i j = β4 + β5CD4∗i j + b3i , (7.1) where β = (β1 , ..., β5 )T are the regression parameters of interest. The true value of β is set to be (11, 80, 5, 4, 1). bi = (bi1 , bi2 , bi3 )T are random effects. We assume bi ∼i.i.d N(0, D), so bi is generated from the normal distribution N(0, D), where D is the variance covariance matrix of bi . The number of subjects, N, the measurement times for each individual ni j , the variance of the error terms, σ , and the measurement error variability δ are chosen differently in later comparisons. We generate the true value of CD4 cell count and the observed value of CD4 cell count following the linear mixed effects model as below: CD4i j = αi0 + αi1ti j + εi j , CD4∗i j (7.2) = αi0 + αi1ti j , αi0 = α0 + ai0 , αi1 = α1 + ai2 , δ 2 stands for the measurement error variability in CD4 cell count. CD4∗i j represents the true value of CD4 cell count for patient i at time ti j . We assume that εi j ∼i.i.d N(0, δ 2 ). The random effects ai are introduced to account for large inter-individual variations in the change of CD4 cell count. We assume ai = (ai1 , ai2 )T ∼ N(0, A). We assign the missing values in the response yi j using the dropout model as 46 follows: log P(si j = 1|φ , bi ) = φ0 + φ1 b1i + φ2 b2i + φ3 b3i , i = 1, 2, ..., N, 1 − P(si j = 1|φ , bi ) (7.3) where φ are regression coefficients of interest. si j is the missing indicator for yi j . si j = 1 means yi j is missing; si j = 0 means yi j is observed. The dropout model means the missing mechanism of response is related with the random effects. According to different missing rate, we choose different settings of φ in latter comparisons. The time to a viral load rebound (event) is generated following the Cox proportional hazards model as follows: P(ri j = 1|ril = 0, l < j, γ, bi ) = 1 − exp(−exp(γ0 j + γ1 z∗i j + γ2 bi1 + γ3 bi2 + γ4 bi3 )). (7.4) ri j is the event indicator, which is a binary variable. ri j = 1 means there is a rebound in viral load at time ti j ; ri j = 0 means no rebound at time ti j . The model for the time to an event suggests that the time to an event depends on the random effects and the current covariate value. We choose different settings of model coefficients and the baseline hazard according to different missing rates in the latter comparisons. 7.3 Comparison Criteria We compare different methods in terms of bias and the mean squared errors. The criteria are made in terms of the percentage relative bias and percentage relative root of mean squared errors. The bias for βi is defined as biasi = |βˆi − βi |, where βˆi is the estimate of βi . The MSE of βi is defined as MSEi = bias2i + s2i , 47 where si is the standard deviation for βˆi . Therefore the percentage relative bias of βˆi is 100% × biasi . |βi | The percentage relative root of MSE is defined as √ MSEi . 100% × |βi | In the latter paragraphs, by MSE we mean the percentage relative root of MSE, by Bias we mean the percentage relative bias. 7.4 7.4.1 Simulation Results Comparisons of Methods in Different Missing Rates We will apply different methods to datasets with different rate of missing values in order to check how the rate of missingness affects the estimate results by different methods. We will compare two rates of missing values, 10% and 20%. By rate of missing values, we mean the total rate of missing either in the response or the covariate or in both. The missingness is assumed to be MCAR in this part. The variance covariance matrix for the random effects in the NLME model 7.1 is chosen as D = diag(1, 1, 1); the standard deviation of the error term is set to be σ = 0.25; the variance covariance matrix for the random effect in the LME model 7.2 is A = diag(0.6, 0.2) and α = (0.5, 0.5); the standard deviation of the error terms δ = 0.05. We generate N = 50 subjects with 15 within subject measurement times. We run the simulation with 100 repetitions. Table 7.1 shows simulation results for the missing value rate around 10%. It is 48 found that the bias of parameter estimation by all three methods are similar. However, JM gives quite larger MSE in β2 than that given by the other two methods. This result makes sense considering that JM includes the uncertainty of missing values while the other two methods donot. In β5 , which is the coefficient of covariate, JM gives a smaller MSE. This finding is not unexpected since JM considers a measurement error model for the covariate and imputes the covariate value with “true” value from the covariate model. Missing Rate(%) 10 10 Table 7.1: Simulation result (10% missing) Parameter True MTS JM Value Bias MSE Bias MSE β1 11 1 6 1 9 β2 80 7 16 12 28 β3 5 2 5 0 4 β4 4 1 10 4 8 β5 1 1 24 6 14 γ1 -1 44 74 42 66 γ2 1 43 62 55 62 γ3 -1 41 60 37 44 γ4 1 43 63 49 55 Bias 2 11 1 6 3 39 64 40 48 CC MSE 9 19 3 11 22 62 68 47 53 Table 7.2 shows simulation results when the rate of missing values is 20%. Compared with results in Table 7.1, the results of Bias and MSE given by the other two method MTS and CC generally are not much changed. However, the MSE by JM method increases in general while the Bias of parameter estimations by JM stay similar as before. The result of increased MSE by JM is not surprising since the uncertainty of missing information gets larger as the missing rate goes up, which may bring more uncertainty to the parameter estimation for JM method. 7.4.2 Comparisons of Methods in Different Measurement Times In order to investigate the influence of measurement times on the parameter estimates, in this part we choose 25 visits during the study period. The other setting are the same as the case of missing rate 10% in 7.4.1. The simulation results are 49 Missing Rate(%) 20 20 Table 7.2: Simulation result (20% missing) Parameter True MTS JM Value Bias MSE Bias MSE β1 11 0 9 4 11 β2 80 7 21 11 35 β3 5 2 5 2 9 β4 4 1 11 7 14 β5 1 6 27 4 18 γ1 -1 41 80 39 78 γ2 1 43 63 67 74 γ3 -1 37 55 49 59 γ4 1 40 58 60 64 Bias 4 10 1 4 11 40 66 41 55 CC MSE 11 23 5 10 23 53 69 47 58 shown in Table 7.3. Comparing with Table 7.1, both Bias and MSE in parameter estimation of longitudinal process decrease, which implies that including more individual longitudinal information may give us a better understanding of the longitudinal process. Missing Rate(%) 10 10 Table 7.3: Simulation results (ni = 25) Parameter True MTS JM Value Bias MSE Bias MSE β1 11 0 5 0 4 β2 80 6 12 11 23 β3 5 1 5 1 2 β4 4 1 8 3 6 β5 1 7 15 2 7 γ1 -1 46 71 25 40 γ2 1 64 83 48 57 γ3 -1 61 73 21 35 γ4 1 49 59 35 44 50 Bias 7 9 0 3 3 35 59 37 40 CC MSE 12 13 3 7 11 49 63 44 45 7.4.3 Comparisons of Methods with Different Number of Patients In practice, we can only obtain information from a limited number of patients. To judge the influence of the number of patients in parameter estimation, in this part, we run simulation with a different number of patients (N = 500). The setting of the other parameters are the same as the case of missing rate 10% in 7.4.1. The simulation results are shown in Table 7.4. Compared with results in Table 7.1, we find that the Bias decreases for all three methods. Particularly in parameters of survival model, the Bias decreases substantially, which shows that a larger size of subjects may help us get a better understanding of the survival process in this case. Missing Rate(%) 10 10 7.4.4 Table 7.4: Simulation results (N = 500) Parameter True MTS JM Value Bias MSE Bias MSE β1 11 0 6 0 6 β2 80 3 12 9 17 β3 5 1 3 2 5 β4 4 2 4 4 7 β5 1 1 12 3 7 γ1 -1 22 44 25 41 γ2 1 24 38 23 41 γ3 -1 24 35 21 38 γ4 1 23 36 25 40 Bias 3 4 0 4 6 23 22 27 25 CC MSE 5 11 3 5 13 35 36 41 34 Comparisons of Methods with A Larger Variance of Response When the variance of response changes the model estimation may change as well. To judge the influence of variance of the response variable, we run simulations with an increased σ = 1. The setting about the other parameters are the same as the case of missing rate 10% in 7.4.1. The simulation results are shown in Table 7.5. Comparing to Table 7.1, we find that the MSE increases substantially both in parameters of the longitudinal model and the survival model by all three methods. The increase of MSE suggests a larger sampling variability of response, so the uncertainty of model estimation tends to be larger. 51 Missing Rate(%) 10 10 7.5 Table 7.5: Simulation results (σ = 1) Parameter True MTS JM Value Bias MSE Bias MSE β1 11 7 12 4 16 β2 80 13 28 3 39 β3 5 3 6 12 25 β4 4 3 10 8 25 β5 1 3 29 7 48 γ1 -1 36 55 57 64 γ2 1 45 62 87 91 γ3 -1 42 59 81 89 γ4 1 30 42 78 81 Bias 3 0 13 9 12 25 77 93 78 CC MSE 9 17 24 28 38 42 78 96 81 Conclusion By the previous simulation study, we find that when the missingness is missing completely at random, MTS, JM, and CC tend to give similar bias in model parameter estimation but the estimated variability for the model parameter estimation are different. Besides the uncertainty of random effects, JM accounts for the uncertainty of missing values than MTS and CC, so JM gives a larger MSE in general. For the coefficient of covariate, JM gives a smaller MSE which probably because the consideration of measurement errors in the covariate. All three methods perform relatively good when the overall missing rate is low, or when the number of subjects is large, or when the within subject measurement times is large. Also, all three methods perform better when the response has a relatively small variability. 52 Chapter 8 Conclusion In this thesis, we use a joint model to describe the longitudinal process and the survival process simultaneously. The longitudinal process is characterized by a nonlinear mixed effects model, and the survival process is characterized by a Cox proportional hazards model. We introduce a method based on joint likelihood to estimate parameters in the two models. This method is able to consider timedependent covariate which may be measured with errors and also it is able to account for informative missingness in the response. Due to the intense of likelihood computation, we use a Monte Carlo EM algorithm to get model parameter estimation. Simulation studies are carried out to compare the performance of joint modeling and the existing modified two-step method. Our simulation results suggest the joint modeling method considering informative missingness and measurement errors in time-dependent covariates may give a more reliable results than the results given by the modified two-step method or methods with complete data. By the simulation study, we also find that the rate of missing values, the size of study subjects, the within individual visit times, the variability of response values may affect the model parameter estimation. A real example from a recent HIV study with informative dropouts is analyzed by the joint model method and the two step methods. By the analysis result, we 53 find in this dataset, the CD4 cell count seems not significantly affecting the second phrase viral load decay rate. Also, the individual characteristics which are represented by random effects may not be associated with the survival time. However, the first period decay rate estimated by the joint model considering informative missing is quite larger than that by other methods. This may suggest that simply ignoring or discarding missing information may underestimate the first phrase viral load decay rate. One point needed to address is that in the joint model, we only include random effects as covariates in the missingness model for simplicity. It may be possible that CD4 cell count or other variable is associated with the missingness. Hence, in future research, we may consider a more complex missingness model; however, we should pay attention to the computational issue at the same time since a more complex model may lead a failure of model identification. 54 References Carroll, R. J., Ruppert, D., and Stefanski, L. A. (1995). Measurement Error in Nonlinear Models. London: Chapman & Hall Davidian, M. and Giltinan, D. M. (1995). Nonlinear Models for Repeated Measurements Data. Chapman & Hall. DeGruttola, V. and Tu, X.M. (1994). Modeling Progression of CD4-Lymphocyte Count and Its Relationship to Survival Time. Biometrics 50, 1003-1014. Ding, A. and Wu, H. (2001). Assessing antiviral potency of anti-HIV therapies in vivo by comparing viral decay rates in viral dynamic models. Biostatistics 2(1), 13 - 29. Fort, G. and Moulines, E. (2003). Convergence of the Monte-Carlo EM for curved exponential families. Annals of Statistics 31(4), 1220-1259. Follmann, D. and Wu, M. (1995). An approximate generalized linear model with random effects for informative missing data. Biometrics 51, 151-168. Gelfand, A.E. and Smith, A.F.M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association 85, 398-409. Gilks, W.R. and Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling. Applied Statistics 41, 337-348. Guo, X. and Carlin, B.P. Separate and Joint Modeling of Longitudinal and Event Time Data Using Standard Computer Packages (2004). The American Statistician 58, 1-9. 55 Henderson, R., Diggle, P. J. & Dobson, A. (2002). Joint modeling of longitudinal measurements and event time data. Biostatistics 1, 465-480. Lawless, J.F. (2003). Statistical Models and Methods for Lifetime Data, 2nd edition, Wiley Series in Probability and Statistics, Wiley-Interscience. Little, R. J. A. (1995). Modeling the drop-out mechanism in repeated measures studies. Journal of the American Statistical Association 90, 1112-1121. Shah, A., Laird, N., and Schoenfeld, D. (1997). A random-effects model for multiple characteristics with possibly missing data. Journal of the American Statistical Association 92, 775-779. Ten Have, T. R., Pulkstenis, E., Kunselman, A., and Landis, J. R. (1998). Mixed effects logistics regression models for longitudinal binary response data with informative dropout. Biometrics 54, 367-383. Tsiatis, A.A. and Davidian, M. (2004) An overview of joint modeling of longitudinal and time-to-event data. Statistica Sinica 14, 793-818. Wei, G. C. and Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithm. Journal of the American Statistical Association, 85, 699-704. Wu, H. and Ding, A. (1999). Population HIV-1 dynamics in vivo: application models and inferential tools for virological data from AIDS clinical trials. Biometrics, 55, 410-418. Wu, H. (2005). Statistical Methods for HIV Dynamic Studies in AIDS Clinical Trials. Statistical Methods in Medical Research, to appear. Wu, L. (2002). A joint model for nonlinear mixed-effects models with censoring and covariates measured with error, with application to AIDS studies. Journal of the American Statistical Association, 97, 955-964. Wu, L., Hu, J. and Wu, H. (2008). Joint inference for nonlinear mixed-effects models and time-to-event at the presence of missing data. Biostatistics, 9, 308-320. 56 Wu, L. (2009). Mixed effects models for the complex data. Chapman & Hall/CRC. Wu, M. C. and Carroll, R. J. (1988). Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. Biometrics 55, 410-418. 19. Wulfsohn, M.S. and Tsiatis, A.A. (1997). A Joint Model for Survival and Longitudinal Data Measured with Error. Biometrics 53, 330-339. 20. 57
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Joint inference for longitudinal and survival data...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Joint inference for longitudinal and survival data with incomplete time-dependent covariates Wang, Xu 2010
pdf
Page Metadata
Item Metadata
Title | Joint inference for longitudinal and survival data with incomplete time-dependent covariates |
Creator |
Wang, Xu |
Publisher | University of British Columbia |
Date Issued | 2010 |
Description | In many longitudinal studies, individual characteristics associated with their repeated measures may be covariates for the time to an event of interest. Thus, it is desirable to model both the survival process and the longitudinal process together. Statistical analysis may be complicated with missing data or measurement errors in the time-dependent covariates. This thesis considers a nonlinear mixed-effects model for the longitudinal process and the Cox proportional hazards model for the survival process. We provide a method based on the joint likelihood for nonignorable missing data, and we extend the method to the case of time-dependent covariates. We adapt a Monte Carlo EM algorithm to estimate the model parameters. We compare the method with the existing two-step method with some interesting findings. A real example from a recent HIV study is used as an illustration. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-08-27 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0071204 |
URI | http://hdl.handle.net/2429/27842 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2010-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2010_fall_wang_xu.pdf [ 306.67kB ]
- Metadata
- JSON: 24-1.0071204.json
- JSON-LD: 24-1.0071204-ld.json
- RDF/XML (Pretty): 24-1.0071204-rdf.xml
- RDF/JSON: 24-1.0071204-rdf.json
- Turtle: 24-1.0071204-turtle.txt
- N-Triples: 24-1.0071204-rdf-ntriples.txt
- Original Record: 24-1.0071204-source.json
- Full Text
- 24-1.0071204-fulltext.txt
- Citation
- 24-1.0071204.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0071204/manifest