Imperfect variables: The combined problem of missing data and mismeasured variables with application to generalized linear models. by Michael David Regier B.AR., Ambrose University College, 1997 B.Sc., University of the Fraser Valley, 2003 M.Sc., The University of British Columbia, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in The Faculty of Graduate Studies (Statistics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) July 2D09 © Michael David Regier 2009 Abstract Observational studies predicated on the secondary use of information from administrative and health databases often encounter the problem of missing and mismeasured data. Although there is much methodological literature pertaining to each problem in isolation, there is a scant body of literature addressing both problems in tandem. I investigate the effect of missing and mismeasured covariates on parameter estimation from a binary logistic regression model and propose a likelihood based method to adjust for the combined data deficiencies. Two simulation studies are used to understand the effect of data imperfection on parameter estimation and to evaluate the utility of a likelihood based adjustment. When missing and mismeasured data occurred for separate covariates, I found that the parameter estimate associated with the mismeasured portion was biased and that the parameter estimate for the missing data aspect may be biased under both missing at random and non-ignorable missing at ran dom assumptions. A Monte Carlo Expectation-Maximization adjustment reduced the magnitude of the bias, but a trade-off was observed. Bias re duction for the mismeasured covariate was achieved by increasing the bias associated with the others. When both problems affected a single covariate, the parameter estimate for the imperfect covariate was biased. Additionally, the parameter estimates for the otehr covariates were also biased. The Monte Carlo Expectation-Maximization adjustment often corrected the bias, but the bias trade-off amongst the covariates was observed. For both simulation studies, I observed a potential dissimilarity across missing data mechanisms. A substantive data set was investigated and by using the second sim ulation study, which was structurally similar, I could provide reasonable conclusions about the nature of the estimates. Also, I could suggest avenues 11 of research which would potentially minimize expenditures for additional high quality data. I conclude that the problem of imperfection may be addressed through standard statistical methodology, but that the known effects of missing data or measurement error may not manifest as expected when more general data imperfections are considered. 111 Table of Contents Abstract 11 Table of Contents iv List of Tables List of Figures viii xvii . Acknowledgements xviii Dedication xix 1 Introduction 1.1 Literature review: identifying the gap 1.2 Thesis goals and structure 1.3 Motivating substantive problem . 1.4 2 1.3.1 Background to the substantive problem 1.3.2 Study population 1.3.3 The three routinely collected databases Summary A synthetic methodology for imperfect variables 2.1 Introductory example: a variable which suffers from both missing data and mismeasurement 2.2 Nomenclature 2.3 Imperfection indicator 2.4 The synthetic notation 1 4 8 11 12 13 17 18 18 20 22 24 iv 2.5 The model 2.5.1 Likelihood 2.5.2 Covariate model . 2.5.3 2.5.4 2.5.5 2.5.6 2.5.7 2.5.8 The imperfection model Surrogate versus proxy: nondifferential and differen tial error Response model Mechanism of imperfection Missing data Missing Data and Mismeasurement as Special Cases 26 26 29 32 35 36 39 42 45 3 Maximum likelihood estimation via the Expectation-Maximization algorithm 47 3.1 Brief history of the EM algorithm 47 3.2 Rationale for and critiques against the use of the EM algo 3.3 3.4 rithm 52 Formulation of the EM algorithm 54 3.3.1 EM algorithm 54 3.3.2 EM algorithm and imperfect variables 57 Implementing the Expectation-Maximization algorithm: MonteCarlo Expectation-Maximization 62 3.4.1 3.4.2 3.4.3 3.4.4 63 64 68 72 Louis standard errors 3.5.1 Monte Carlo wit Louis standard error for imperfect variables 73 Simulation studies 4.1 Global aims of the simulation studies 4.2 General structural elements and procedure of the simulation studies 78 78 3.5 4 Monte Carlo integration Monte Carlo EM: the expectation step Gibbs sampler Maximization 76 79 V 4.3 Evaluative measures 4.3.1 4.3.2 4.4 80 80 84 86 Measures of bias and accuracy Confidence interval based measures 4.3.3 Robust measures Simulation study 1: binary logistic regression with two im perfect explanatory variables 4.4.1 Likelihood model for simulation study 1 4.4.2 Simulation specific details for the Monte Carlo EM algorithm 86 87 91 4.4.3 4.4.4 4.5 Results and discussions Simulation study 1 discussion Simulation study 2: a random variable which suffers from both imperfections 4.5.1 Likelihood model for simulation study 2 4.5.2 4.5.3 4.5.4 5 131 131 Simulation specific details for the Monte Carlo EM algorithm Results and discussions 135 139 174 Simulation study 2 discussion Does income affect the location of death for males, living in Aboriginal communities, who died of prostate cancer? 176 . 5.1 5.2 6 96 127 . Results and discussion Conclusions 183 192 Conclusion 6.1 194 197 Future work Bibliography 199 Appendices A British Columbia Vital Statistics location of death . . . . 212 vi hA 6% [ UO,flUqS 1OJ SJeAJ9U o pui soiuiso uod oq oupJuo pposs I 10 tLrUAtJflUI OT JO UOiZtmtXJA j jqi jo ojdwx U! soTpn$s uotnnmrs 6 I UOT9flU1T 10 ioj ttu iuotuiddnS duioo tI 1 x!w uouuorn piOO9 Wt{tJOp J,\[J poiosqo o JO UOAiO[ OTfl JO UO13A!1OG I 9I!0G H I{ IOJ S1Ofl pipUs S!fl07 1OJ 9 UO!8S0120J OTStO An3uq oq 1OJ UT!SOH pu13 uo!z!uItxm OSTA& uouodmoD a T%[ sjtop UOZ!UI!XU1 lopom uoTsso1oJ OSOI iiuq OI{ JO &!A3UO0 pOOT{tjo)!J OT{ JO 2A1OUOO N H I [ VIH o uo!o!Iddy 2utjduns 6tg u!Iduh1?S UOPOçO3J oAd’up cj oAu :urjduis uotpoCo-j r i 1 pofo’-jdooo odo utjdmg uoipoço3-J OAIdPV I IuawaiddnS i 9I 9I uomdopAop Otis JOJ List of Tables 1.1 2.1 2.2 Some results for single and multiple keyword searches using the Web of Science index An example of how the imperfect variable framework allows for more complicated relationships in a data set. Here, there is a single variable which suffers from both missing data and mismeasurement. In this example, mismeasurement is not uniform for all realizations Both missing data and mismeasured data have something missing and something observed. The similar features of both type of data are used as the foundation for the concept of an experimentally observable random variable and the target random variable 2.3 2.4 4.1 4.2 5 Imperfect data patterns as specified by the generalized im perfect data indicator Types of data problems which can be modelled using the im perfect variable framework Notation and combinations for the missing data mechanism that generates the data and the missing data mechanism as sumed by the fitted model. The missing data mechanism takes the form of a binary logistic regression Coefficient values for different missing data mechanisms used to generate simulated data sets. The systematic component )T is specified as 7 = (70, 71, 72, 73 19 21 23 24 79 89 viii 4.3 4.4 4.5 Comparing the effect of the missing data mechanism (MAR, NMAR) on the point estimate, bias, and mean squared error when the mechanism that generates the missing data and the missing data mechanism in the model are matched and mismatched. Mechanism A was used to generate the data and the coefficient for missing data variable is / 3 and for the mismeasured variable is /32, case 1 and case 2 only Comparing the effect of the missing data mechanism (MAR, NMAR) on the point estimate, bias, and mean squared error when the mechanism that generates the missing data and the missing data mechanism in the model are matched and mismatched. Mechanism A was used to generate the data and the coefficient for missing data variable is i3 and for the mismeasured variable is /32, case 3 and case 4 only Comparing the effect of the missing data mechanism (MAR, NMAR) on confidence interval length, and coverage when the mechanism that generates the data and the mechanism used to model the data are matched and mismatched. Mechanism A was used to generate the data and the coefficient for missing data variable is /3 and for the mismeasured variable is /32. . 4.6 4.7 4.8 100 101 . 102 Comparing the effect of the missing data mechanism (MAR, NMAR) on the median and difference between the median and true value (6). Mechanism A was used to generate the data and the coefficient for missing data variable is /3 and for the mismeasured variable is /32 Comparing different missing data mechanisms when the mech 103 anism that generates the data and the mechanism used to model the data is matched. Simulation sizes for case 1 (MAR MAR) are 153, 168, 167 for mechanism A, B, and C respectively. 107 Comparing different missing data mechanisms when the mech anism that generates the data and the mechanism used to model the data is matched. Simulation sizes for case 4 (NMAR NMAR), they are 139, 163, 168 respectively 108 ix 4.9 Confidence interval length, and coverage when the mechanism that generates the data and the mechanism used to model the data are matched. Simulation sizes for case 1 (MAR-MAR) are 153, 168, 167 for mechanism A, B, and C respectively; for case 4 (NMAR-NMAR), they are 139, 163, 168 respectively. . 109 4.10 Comparing the effect of the missing data mechanism (MAR, NMAR) on the median and difference between the median and true value (6) when the mechanism that generated the missing data and the mechanism assumed by the model are matched. Simulation sizes for case 1 (MAR-MAR) are 153, 168, 167 for mechanism A, B, and C respectively; for case 4 (NMAR-NMAR), they are 139, 163, 168 respectively 110 4.11 Comparing the effect of sample size on point estimation, bias, mean squared error, and relative efficiency for case 1: MARMAR. Missing data mechanism B was used for this comparison. 114 4.12 Comparing the effect of sample size on confidence interval length, and coverage for case 1: MAR-MAR. Missing data mechanism B was used for this comparison 4.13 Comparing the effect of sample size on the median and dif ference between the median and true value (6) for case 1: MAR-MAR. Missing data mechanism B was used for this comparison 116 4.14 Comparing the effect of sample size on point estimation, bias, mean squared error, and relative efficiency for case 4: NMAR NMAR. Missing data mechanism B was used for this compar ison 117 115 4.15 Comparing the effect of sample size on confidence interval length, and coverage for case 4: NMAR-NMAR. Missing data mechanism B was used for this comparison 4.16 Comparing the effect of sample size on the median and dif ference between the median and true value (6) for case 4: NMAR-NMAR. Missing data mechanism B was used for this comparison 118 119 x 4.17 Study design for exploring the effect of r 4.18 Comparing how agreement and disagreement on the specifi cation of r when the true value of r == 1.0 (4.17) affects point estimation, bias, mean squared error, and relative effi ciency when using missing data mechanism B with agreement between the mechanism generating the missing data and the assumed missing data mechanism for case 1: MAR-MAR. 4.19 Comparing how agreement and disagreement on the specifi cation of r when the true value of r = 0.5 (4.17) affects point . 120 . 123 estimation, bias, mean squared error, and relative efficiency when using missing data mechanism B with agreement be tween the mechanism generating the missing data and the assumed missing data mechanism for case 1: MAR-MAR. 4.20 Comparing confidence interval length, and coverage for the four patterns of r as given in Table 4.17. when using missing data mechanism B with agreement between the mechanism generating the missing data and the assumed missing data mechanism for case 1: MAR-MAR 4.21 Comparing the effect of sample size on the median and dif . ference between the median and true value () for the four patterns of r as given in Table 4.17 when using missing data mechanism B with agreement between the mechanism gener ating the missing data and the assumed missing data mecha nism for case 1: MAR-MAR 4.22 Coefficient values for different missing data mechanisms used to generate simulated data sets. The systematic component 77777 ( is specified as 7 = )T . 124 125 126 134 xi 4.23 Comparing the effect of the missing data mechanism when the missing data was generated with a MAR mechanism on the point estimate, bias, and mean squared error when a co variate has both missing data and mismeasured data. The mechanism that generates the missing data and the missing data mechanism used to model the data is matched and mis matched. Mechanism A was used to generate the missing data for x 4.24 Comparing the effect of the missing data mechanism when the missing data was generated with a NMAR mechanism on the point estimate, bias, and mean squared error when a co variate has both missing data and mismeasured data. The mechanism that generates the missing data and the missing data mechanism used to model the data is matched and mis matched. Mechanism A was used to generate the missing data for x 4.25 Comparing the effect of the missing data mechanism when the missing data was generated with a MAR mechanism on con fidence interval length, and coverage where a single covariate suffers from both missing data and measurement error. Both matched and mismatched missing data mechanisms are con sidered 4.26 Comparing the effect of the missing data mechanism when the missing data was generated with a NMAR mechanism on confidence interval length, and coverage where a single covari ate suffers from both missing data and measurement error. Both matched and mismatched missing data mechanisms are considered 4.27 Comparing the effect of the missing data mechanism on the median and difference between the median and true value (ö) where a single covariate suffers from both missing data and measurement error 143 144 145 146 147 xii 4.28 Comparing different missing data mechanisms when when the mechanism that generates the data and the mechanism used to model the data is matched (MAR-MAR) for a covariate with both missing and mismeasured data 4.29 Confidence interval length, and coverage when when when the mechanism that generates the data and the mechanism used to model the data is matched (MAR-MAR) for a covariate 152 with both missing and mismeasured data 4.30 Comparing the effect of the missing data mechanism on the median and difference between the median and the true value (6) when the mechanism that generated the missing data and mechanism assumed by the model are matched (MAR-MAR) for a covariate with both missing and mismeasured data. 4.31 Comparing different missing data mechanisms when when the . mechanism that generates the data and the mechanism used to model the data is matched (NMAR-NMAR) for a covariate with both missing and mismeasured data 153 . 154 155 4.32 Confidence interval length, and coverage when when when the mechanism that generates the data and the mechanism used to model the data is matched (NMAR-NMAR) for a covariate with both missing and mismeasured data 156 4.33 Comparing the effect of the missing data mechanism on the median and difference between the median and the true value (6) when the mechanism that generated the missing data and mechanism assumed by the model are matched (NMAR NMAR) for a covariate with both missing and mismeasured data 4.34 Comparing the effect of sample size on point estimation, bias, mean squared error, and relative efficiency for case 1 (MAR MAR) when a covariate has both missing data and mismea surement. Missing data mechanism A was used for this com parison 157 163 xlii 4.35 Comparing the effect of sample size on confidence interval length, and coverage for case 1 (MAR-MAR) when a covariate has both missing data and mismeasurement. Missing data mechanism A was used for this comparison 4.36 Comparing the effect of sample size on the median and differ ence between the median and true value (6) for case 1 (MARMAR) where a single covariate suffers from both missing data and measurement error. Missing data mechanism A was used for this comparison 4.37 Comparing the effect of sample size on point estimation, bias, 164 165 mean squared error, and relative efficiency for case 4 (NMAR NMAR) when a covariate has both missing data and mismeasurement. Missing data mechanism A was used for this comparison 4.38 Comparing the effect of sample size on confidence interval length, and coverage for case 4 (NMAR-NMAR) when a co variate has both missing data and mismeasurement. Missing data mechanism A was used for this comparison 4.39 Comparing the effect of sample size on the median and dif ference between the median and true value (6) for case 4 (NMAR-NMAR) where a single covariate suffers from both missing data and measurement error. Missing data mecha nism A was used for this comparison 4.40 Comparing the effect of r on the point estimate, bias, and mean squared error when a covariate has both missing data and mismeasured data and when the missing data mechanism is NMAR. Mechanism A was used to generate the missing data for 3; 200 simulations were run 4.41 Comparing the effect of r on confidence interval length, and coverage where a single covariate suffers from both missing data and measurement error and when the missing data mech anism is NMAR. Mechanism A was used to generate the miss ing data for /3k; 200 simulations were run 166 167 168 171 172 xiv 4.42 Comparing the effect of r on the median and difference be tween the median and true value (6) where a single covariate suffers from both missing data and measurement error and when the missing data mechanism is NMAR. Mechanism A was used to generate the missing data for / i; 200 simulations 3 were run 5.1 5.2 5.3 173 Univariate summary of the applied data set. The overall col umn has the counts, proportions, means and standard devia tion for all 215 subjects. The hospital and place of residence columns have the rates, proportions, means and standard de viations for the subjects contingent on being in the column category for the location of death Naive complete-cases results Parameter estimates, standard error, z-score, and associated p-value for the MCEM methodology assuming that the miss ing data mechanism is MAR. Four levels of r are explored to check the sensitivity of the model to the assumption on r Point estimate, 95% confidence interval, and confidence inter val length on the odds ratio scale for the MCEM methodology assuming that the missing data mechanism is MAR. Four lev . 5.4 5.5 els of r are explored to check the sensitivity of the model to the assumption on -r Parameter estimates, standard error, z-score, and associated p-value for the MCEM methodology assuming that the miss ing data mechanism is NMAR. Four levels of r are explored to check the sensitivity of the model to the assumption on r 5.6 182 184 185 186 190 Point estimate, 95% confidence interval, and confidence inter val length on the odds ratio scale for the MCEM methodology assuming that the missing data mechanism is NMAR. Four levels of r are explored to check the sensitivity of the model to the assumption on -r 191 xv B. 1 Log-concave properties for probability distributions found in this document [35] 221 xvi TIAX 0T7g so ioj S1A1$U V WS!UtpOW p USS!UI 2u!sn J7 od 1 up1Juoo sno pu wmns u c1,\I 10 uonqip I’ 181 -IOU pipums Ot{ pu mopoij JO 001 pT1 ‘ ‘ :OJ1Ao3 owoou pozpapus oi ioj oT soid TflTM ô-ô uonq1p ruTiou pIpus o pu mopoatj jo satp 001 pu ‘ ‘g jo suonqaqsppozmiou O1{1 JO UII O!H 2 q patduioa iq’ OUTODU! 8L1 -A!IOp OD!AIOS qoq pu SO iOtflfl ipoq quinjo t1SJH 1 sas p pns uonbosqns pu osq oJfl-Jo-pT1 oTp Jo uopnasuoo oq u posn sojuq S[flt{ JO STrJ Acknowledgements Firstly, I would like to thank my supervisor, Dr. Paul Gustafson, for a pass ing comment about missing data and measurement error which eventually turned into this thesis. I am grateful for his patience through this process and for the numerous points at which his insights have directed the research and influenced my particular interests in statistical methodology. Secondly, I extend my appreciation to my supervisory committee, Dr. Lang Wu and Dr. Ying MacNab, for their help and encouragement through the process. Finally, I would like to thank Dr. John Petkau for his advice about my career development and Dr. Rollin Brant for his thought provoking conversations. xviii Dedication To my wife Christine Elizabeth Regier and daughter Elizabeth Anne Violet Regier, for their patience and sacrifices. xix Chapter 1 Introduction Mismeasured variables and missing information are common problems en countered in the routine analysis of data. Frequently, the presence of such data deficiencies obsfucates the contextual truth which is sought. For exam ple, assume there is a single predictor which has a direct causal effect on the outcome, but the predictor cannot be accurately measured. The parameter estimates of a simple linear model with a mismeasured covariate will atten uate if the statistical model does not account for the data deficiency. In this simple situation, neglecting mismeasurement and assuming the data is perfect may result in erroneously discounting the effect under investigation. In a similar manner, neglecting missing data may result in erroneous conclusions. If some of the observations in a data set are missing and the data is analysed without properly accounting for this deficiency, then the parameter estimates may be biased and suffer from reduced efficiency. In both situations, by not accounting for the data deficiencies, biased estimates and erroneous associated measures of variability may result. Encountering data which is plagued with both missing data and mismea surement appears to be a common phenomena with observational data that has been constructed through the linkage of several disparate data sources. Often the motivation for data linkage is the acquisition of a set of covariates which are of primary, or even secondary interest, to substantive researchers. One set of covariates which have recently drawn much attention are cultural covariates. Typically, these are abstracted from census data to model ad justment variables such as sex, age or socio-economic status [51]. The use of surrogates and proxies for these situations is a common practice with census data being a common source of surrogates. Census data is routinely collected and often is released freely to the public at an aggregate level. Although 1 this represents a monetary efficiency, it brings its own set of problems. Aggregate census data derived from a measured attribute of the indi vidual members of the group of interest are “contextual” variables [98]. If conceived as a cultural construct, it is feasible to view aggregate census data as describing a “context” which affects all the members of the group. Both classical and Berkson errors may be reasonable assumptions for such ambigu ous social constructions. When using census based measures, the contextual measurement is ascribed to each unit, which is typically a person. The process of ascription is call geocoding. In its simplest form, geocoding uses a geographic measure to link the aggregate surrogate data to a subject [51]. A fundamental concern with geocoding is the accuracy of the aggregate data in revealing the true relationship between the explanatory variable and the outcome [98]. A naive approach is to assume no measurement error and treat the surrogate or proxy as perfectly measured. In such a situation, we may see a conclusion that would sound something like, “cancer patients who live in a South-Asian community are more likely to die in the hospital when compared to a culturally heterogeneous community”. Although this statement is hard to fault, there is a subtle semantic game at play. The inference desired is for the patient, not the community, thus the implication is that people who live in a South-Asian community are more likely to be South-Asian. The danger is to conclude that South-Asians are more likely to die in a hospital when compare to non-South-Asians. An ecological analysis often is performed for the intent of inferring to the individual and the error of inferring group level conclusions to individuals is referred to as the ecological fallacy [80, 89, 92]. Distilling the ecological fallacy to its core problem reveals that it is a problem of inference across levels of data where the patient is seen as an aggregate unit of size one. The fallacy occurs when inference based on the aggregate data is ascribed to the to the smaller aggregations. In contrast to this concern is the individualistic fallacy which is the no tion that individual level data is sufficient for epidemiological studies [51, 92]. This suggests that the individual-level analysis is able to capture the group dynamics of a health related phenomena. This is aggregate-level conclu 2 sions based upon a generalization from individual-level data [98]. There is much literature concerning these fallacies, which will not be visited in this exposition [34, 51, 72, 80, 92, 98]. The applied example in this thesis utilizes geocoding to associate cancer patients with surrogates of desired cultural attributes, thus we will briefly consider ecological studies. Morgenstern [80] deals with this situation in an oblique manner through the concept of cross-level bias. He partitions the bias resulting from an ecological study into • aggregation bias which results from the grouping of individuals, and • specification bias which results from the confounding effect of the group itself. Cross-level bias is the sum of aggregation bias and specification bias and can make the ecological associations appear stronger or weaker than the individual-level associations. It is noted that the two could cancel each other out resulting in no bias at all. It is suggested that no cross-level bias will occur if the ecological covariates affect the outcome [80]. Observational studies which require the use of surrogates or proxies may seem overly prob lematic, but Greenland [37] gives balanced insight into the problem; “it is important to remember that the possibility of bias does not demonstrate the presence of bias, and that a conflict between ecological and individual-level estimates does not by itself demonstrate that the ecological estimates are more biased”. Stripping away the semantics and definitions, we see at the centre of all of this a basic mismeasurement problem. We have a desired inference about a population, predicated on subject level data, but a model which crosses both the unit level and the aggregate level of the data. With the use of sec ondary data sources we should expect a level of missing data. For databases, the absence of information may result from poor database management or from using constructed variables which are predicated on information from multiple sources. Poor database management would include such issues as incomplete database design such that fields can be left blank without any 3 notation as to the intent of the blank. The intent of the database itself can result in missing information when the primary users of the data have very different data needs than the substantive researchers who are using the database as a secondary source of information. A further source of imperfection is the construction of variables. Some information, such as survival time, may be constructed using a variety of data sources. The date of diagnosis may be in the health database, but the data of death may be abstracted from a different source such as vital statistics. Although for this particular scenario, the probability of missing information may be low, it illustrates the type of problem that may be encountered. If census data is used it is possible to link incomplete surrogate information to the subject for which the mean income would be an example. It is clear that an observational study predicated on the linkage of dis parate data sources for research beyond the intent of the databases them selves may present the dual problem of missing data and mismeasurement. It is reasonable to assume that any analysis of this type of data should not be based on standard naive complete-case methods. The obvious result would be measures of variation which would be too optimistic, but what other dan gers to such an approach may be encountered? What kind of adjustments are necessary? 1.1 Literature review: identifying the gap Before we address these questions, we will see if they have been consid ered in the literature. Considered independently from one another, there is a wealth of information concerning the problems of missing data and mismeasurement, but literature for which both problems are considered in tandem becomes very small. For the identification of peer reviewed arti cles where both missing data and mismeasurement were integrated into the same model under a unified theoretical framework, we restricted the search to the Thomson’s Institute for Scientific Information Web of Science in dex. Furthermore, we restricted the search to the following set of keywords: missing data, missing response, missing covariate, measurement error, mis 4 Table 1.1: Some results for single and multiple keyword searches using the Web of Science index Keyword(s) “Measurement error” “Missing data” Misclassification “Error in covariates” “Errors in variables” Misclassification AND “missing response” Misclassification AND “missing covariate” Misclassification AND “missing dat&’ “Measurement error” AND “missing response” “Measurement error” AND “missing covariate” “Measurement error” AND “missing data” “Error in covariates” AND “missing data” “Errors in variables” AND “missing data” c’1 August 2007 Number of Results September 2007 April 2009 6,011 4,150 3,670 6,051 4,206 3,701 7,086 5355 4383 10 119 0 0 0 0 31 2 8 78 0 0 37 2 8 94 6 3 2 8 77 measurement, misclassification, error in covariate, and errors in variables. Single term searches resulted in large numbers of articles for which any re view would extend beyond the parameters of this thesis, so all searches were restricted to combinations of keywords (e.g. missing response AND misclas sification). When two keywords were considered the number of articles were greatly reduced (Table 1.1). The abstracts were read and promising articles were retrieved and reviewed. The results of the literature fall into four general categories and since this is not a systematic review, an illustrative approach will be taken. From the search, peer reviewed articles 1. used measurement error and missing data as synonyms, 2. used proxies for the missing data and then proceeded with a measure ment error model, 3. addressed measurement error and missing data separately, or 4. proposed a model which adjusted for both the missing data and mismeasurement problems found in the data. Category 1 used the term measurement error and missing data as syn onyms. These papers typically identify one in the abstract and the other in the keywords (e.g. missing data in the abstract and measurement error in the keywords). They had a strong application focus and populated both subject area and statistical journals. An example of this is with Strickland and Crabtree [94], where patterns of missingness of survey responses among heterogeneous subgroups with family medical practices is investigated. Miss ingness is the central idea. An EM algorithm is used in conjunction with a hierarchical model. Although the researchers are investigating missing data, measurement error is listed as a keyword. For category 2, a missing data problem is transformed into a measure ment error problem through the use of a proxy or supplemental data. This was done for covariates that were not completely observed and for covariates that were missing (i.e. completely unobserved). The most prevalent mani festation of this approach is for the latter situation. For example, Pepe [84] 6 considers the situation where an entire covariate is missing, thus a proxy is used and a challenging missing data problem is transformed into a sim pler and more concrete measurement error problem. Wang and Pepe [107] give another example. Here, the investigators are concerned with estimation based on general estimating equations when true covariate data are miss ing for all the study subjects, but surrogate or mismeasured covariates are available. The third category covers those papers that mention both measurement error and missing data, but they are handled separately. Hsiao and Wang [42] propose a method to obtain least-squares or generalized least-squares es timators of structural nonlinear errors-in-variables models and then suggest that their proposed method is useful for missing data or covariate proxies. The final category contains only four papers. Each considers both the problem of missing data and measurement error in tandem. When this re search was initiated there was one article concerning partial linear models which remotely touched on the underlying themes to be addressed in this thesis. Liang, Wang and Carroll [63] considered the situation where the response is missing at random and measurement error affects only the pre dictors. Semiparametric and an empirical likelihood estimator are proposed. Since then, three articles addressing this dual data deficiency have been published; each focuses on longitudinal data. Yi [115] considers the effect of covariate mismeasurement on the estimation of response parameters for longitudinal studies. Missing data is assumed to be missing at random along with a classical measurement error model. An inverse probability weighted generalized estimating equation (IPWGEE) is used with SIMEX. Wang et al. [106] considers the combined problem of missing data measurement er ror for longitudinal studies and uses a general approach based on expected estimating equations (EEEs) with the asymptotic variance obtained with sandwich estimation. Finally, we have Liu and Wu [71] who consider miss ing data and measurement error within the context of a longitudinal study with nonlinear mixed-effects models. Non-ignorable missing data in the re sponse and missing data in other time-varying covariates is considered. 7 1.2 Thesis goals and structure It is clear that little work has been done on the issue of managing the com bined problem of missing data and mismeasurement, but the work done is within a rather sophisticated methodological framework. Although it pro vides solutions to very real substantive problems and is interesting from a statistical point of view, it only obliquely asks a fundamental question; are missing data and mismeasurement manifestations of a more general prob lem? In the spirit of Dempster et al [22], we recognize that much work has been done in the areas of missing data and mismeasurement with much of the work exhibiting advanced statistical methodology, but there has been no formal consideration that these two areas may just be particulars of a more general problem. The primary endeavour of this thesis is to propose a conceptual framework which considers missing data and mismeasurement as manifestations of a more general problem: imperfection. We will con sider the structure of experimental or observational data and the role that observation plays in defining the problem of imperfection. Furthermore, we will propose an imperfect variable model which is an integrated missing data-mismeasurement model. The secondary objective, is develop a likelihood based approach to ad justing for imperfection and to provide a parsimonious analytic framework for adjustment within the context of generalized linear models. Since there is no known literature concerning the adjustment for imperfect variables within the generalized linear model framework, we will explore the effect of imper fection in covariates on parameter estimation for two situation: when miss ing data and mismeasurement affect separate covariates, and when missing data and mismeasurement affect a single variable. In order to highlight the substantive relevance of addressing both missing data and mismeasurement within a unified framework, a social-epidemiological example will be used to exhibit the utility of the proposed conceptual framework and methodologi cal approach. One feature of the substantive problem which will be retained is the assumption that no auxiliary information exists for the mismeasured 8 variables. In a rather unconventional twist, chapter 2 will take a look at the con cepts, language and notation. This chapter will begin with an illustrative example to motivate the ensuing metamorphosis of the standard language, notation and concepts used in both the missing data and mismeasurement literature. Since there is no common language that transcends either sta tistical sub-discipline, the synthesizing of statistical methodologies for the dual problem of missing data and mismeasurement requires nomenclature modification. Within this chapter, a unified model for imperfect variables will be presented and issues such as nondifferential error will be addressed for the proposed imperfect variable. Once the common language and nota tion has been established, the likelihood model and the components of the model will be discussed. Chapter 3 will focus on how to obtain maximum likelihood type estimates of the parameters and their associated standard errors through the use of the EM algorithm. In this section, the time devoted to clearly delineating the conceptual framework for imperfect variables should become evident. Chapter 4 will present the background and the results for two simulation studies. The first study will consider the problem of having both deficiencies in the same data set, but only one of the two deficiencies can present in a single covariate. The second simulation study will consider the problem of having both deficiencies in a single covariate. Chapter 5 will consider a data set abstracted from a linked data set. Most of the information about the motivating social-epidemiological problem will be presented in chapter 1, thus little of the substantive background will be repeated. The final chapter will draw some general conclusions and present some future areas of research. 1.3 Motivating substantive problem Participation in substantive research permits statisticians an alternate per spective on their discipline. For a statistician, the phrase getting your hands dirty means working with data much like a gardener would work with the soil. Metaphorically digging through a complicated data set often reveals 9 treasures much like a gardener would experience when first acquiring an overgrown and unruly garden. After a period of time a familiarity begins to develop; what may at first looked like an unimportant specimen soon becomes a valued and prized piece. A composite data set, one constructed from a collection of disparate sources often seems like an inherited and unruly garden to a statistician. The variables may be unfamiliar and may not be immediately recognized for their value, but familiarity brings understanding and appreciation. The inherent beauty of the composite data set is that information unattainable from a single source can be constructed from multiple and potentially disparate sources to address complicated substantive questions. Although we can link data from disparate sources to construct ever-more elaborate databases and data sets, the act of doing obscures two basic ques tions: should we perform data linkage and what is the cost of data linkage? The first involves a labyrinth of legal issues spanning ethical concerns about the use of secondary information for scientific research to basic concerns of about the protection of an individual’s identity and privacy. As a statisti cian, we may choose to defer those questions to other experts. Beyond the obvious costs of time and money, there is a much more subtle question con cerning the validity of conclusions based upon routine analysis of composite data sets. This second question is where a statistical perspective is needed. This can be thought of as the cost of untenable inference. 1.3.0.1 The problem of defining and measuring culture A team of researchers, known as the Cross-cultural N.3T, is interested in un derstanding the impact of culture on health decisions and health outcomes. For this group, culture has been defined as a complex interplay of meanings that represent and shape the individual and collective lives of people. The definition, although satisfying from a sociological perspective, gives little guidance for measuring culture. The intentional omission of meta-narrative language has been used as an opportunity to generalize the concept of cul ture to include familiar ideas, such as ethnicity, and religion to more novel 10 constructions of culture which include socio-economic status, and occupa tion. From a conceptual point of view, this is a flexible definition of culture which has high utility in an open-ended course of research, but it does pro vide a guide for measuring culture. The generality of the definition is one of its strengths because a study specific definition of culture can be constructed conditional on the data collected. The advantage is also its disadvantage; often the aspects of culture which are of most interest, such as self-reported cultural identity are those not routinely collected in Canadian databases. One way around this problem is to link a database with census data. In doing this, researchers are able to associate each patient in the defined pop ulation with a wider range of cultural constructs, but as previously discussed, it comes with costs beyond those of time and money a complex database. - 1.3.1 Background to the substantive problem The body of literature concerning the use of end-of-life health services in Canada is small, but when issues of culture are of interest, it becomes ex tremely sparse. Due to the substantive focus of the research and the sparsity of the literature, much of the investigation into culture has been predicated on the work of a few Canadian researchers: Burge, Lawson, and Johnston [91. The operating hypothesis is that given a choice, Canadian cancer pa tients would prefer to die outside of a hospital setting. One avenue of research is to understand if this preference is shared across all Canadian cultural groups. A challenge with the research is that much of the infor mation is predicated on routinely collected data housed in government and health databases. These sources of information can only address the action and not the intent of the patients, thus inference from the acts of the pa tients to the desires of the patients is assisted with supplementary domestic and international end-of-life research. It is known that approximately two-thirds of Canadian cancer patients die in the hospital yet from international literature it is hypothesized that be 11 tween 50% and 80% of these patients would prefer to die at home [100, 112]. It is suggested that this difference arises from a complicated mechanism that prevents people from dying in their preferred place [88]. Barriers such as socio-economics, culture, provincial and federal policy, geography, and barriers inherent in the health service delivery system may result in a dis placement of people from their desired location of death. Although it is commonly assumed that people would prefer to die in their place of usual residence, for a culturally diverse population, this as sumption may be erroneous. For example, discussions about death and dying within Chinese communities are, in general, considered disharmonious and are avoided due to the belief that such conversations will bring bad luck [29]. The Cross-Cultural NET researchers have posited that this belief affects the location of death and manifests in the decision to not die at home in order to prevent bad luck for the household. This hypothesis stands in contrast to a recent finding that suggests the trend towards hospital deaths among Chinese is more a factor of improved medical resources and changing soci etal norms due to the changes in the location of the best medical care and natural generational shifts. Furthermore, it is believed that this desire for the dying in Western societies to die at home is related to the quality of care given at the home or in a hospice when compared with hospital care [39]. In the United Kingdom, where there is a different mixture of cultures, cancer patients exhibit an overwhelming preference to die at home or in a hospice [101]. This suggests that under a general definition of culture, disregarding a patient’s culture impedes quality health service delivery. 1.3.2 Study population The study population used to obtain and construct the composite database is defined as all adult British Columbian residents, age 20 and older, who died in BC, as identified on their death certificate, due to malignant cancer between 1997 and 2003, excluding death certificate only (DCO) diagnoses of cancer. A DCO diagnosis is one made at the time of death and has not been traced back to any prior information which indicated or suggested that the 12 patient may have cancer. Prior information can include a pathology report, or even a notation of an assumption that the patient had cancer but not confirmed through testing. In the latter case, it is assumed that the patient is dying from a non-cancer primary cause which will progress more rapidly than the cancer. This is the source population on which all study data sets will be predicated. 1.3.3 The three routinely collected databases BC Cancer Registry BC Vital StaUstics Subject Selection I BCR, BCVS linkeddatabase of unique patients I Endof-life Database Canada Census Dissemination Area Geocode Canada Census Data I Study Data Set Figure 1.1: Database linkages used in the construction of the End-of-life Database and subsequent study data sets. Focusing primarily on the location of death, which was readily available, 13 the investigation was based on routinely collected data which resided with various agencies: the British Columbia Cancer Agency, Statistics Canada, and the British Columbia Vital Statistics. Compared to the acquisition of the data, the process of linking the information is inexpensive and straight forward process given the appropriate software [17]. Currently there are three sources of data: British Columbia Cancer Registry (BCCR), the British Columbia Vital Statistics (BCVS), and Statistics Canada Census (Census or Census data). The BC Cancer Registry (BCCR) began collecting patient data in 1969. In 1980 the maintenance of the BCCR was transferred to the BC Cancer Agency (BCCA). Since then, the I3CCA has collected patient information for the purposes of surveillance and research. It is a population based registry that is routinely linked with BCVS death certificate infor mation and it is assumed that all persons who lived and died in British Columbia and had a diagnosis of cancer (primary, secondary, or tertiary) are included in the BCCR. The BCCR and BC Vital Statistics are routinely linked for basic infor mation such as the causes of death. Information more specific to the death of a patient is not routinely linked and must be requested from BCVS. The linkage of additional information is done by either the BCVS or the I3CCR depending on the type and quantity of information requested. The location of death is an example of information which would require a special data linkage. The linked databases of the ]3CCR and the BCVS contains basic health information information. It does not include information such as relation ship status (e.g. married, single), socio-economic status (SES), or cul tural information. More sophisticated demographic, SES and cultural mea sured are derived from census data. The linked database comprised of the BCCA, BCVS and census data will be referred to as the End-of-life database (EOLDB); all study data sets are derived from this database (Figure 1.1). The postal code of usual residence was taken to be the postal code ab stracted from the death certificate. If this was unavailable, the postal code was obtained from the BCCR. The postal code was used in conjunction with Statistic Canada’s Postal Code Conversion File program to associate each 14 patient with a dissemination area. This is known as geocoding [11]. The geocode is a seven or eight digit number which reflects Standard Geograph ical Classification (SGC) used for census reporting. The hierarchy of the SGC, from largest geographic unit to smallest is province, census division, census subdivision, and dissemination area. In this hierarchy, the province is divided into census divisions, a census division is divided into census sub divisions, and a census sub-division is divided into dissemination areas. The seven digit geocode is broken down as follows: first two digits represent the province, the second two represent the census division (CD), and the last three represent the census subdivision (CSD). The eight digit geocode has the province and census division for the first four digits with the last four representing the dissemination area (DA). Census information is abstracted at the DA level and linked to each patient based on the geocode. 1.3.3.1 British Columbia Cancer Registry The British Columbia Cancer Registry (BCCR) is the primary source of pa tient level data. The target population, initially abstracted from the BCCR, contained approximately 80,000 unique patients. The data contained incom plete patient information with imprecisely recorded fields. Immediately, the combined problem of missing information and mismeasurement was evident. Much of the data in the BCCR is categorical. There are many coding options, but in general there is no code to indicate the intentional omission of information. This makes things a bit more challenging when using the data. Does a blank mean that the data was obtained but not recorded or does it mean that no data was ever collected and would never be collected. Furthermore, the update and completeness of the BCCR data is done as time permits [113]. These represent only some of the problems encountered with the BCCR data. 1.3.3.2 Canadian Census Census information is released in a variety of ways ar4d at a various aggre gations. One useful aggregation is the dissemination area. Following the 15 geographic hierarchy of province, census division, census subdivision and then dissemination area (UA), the DA is the lowest level that is freely re leased is at the dissemination area (DA) which represents a small compact area bounded by visible boarders (e.g. river, road) and has an average size of 550 people [12]. If there are small numbers of people (less than 40) in a DA the information is not released which suggests the presence of missing census information. [12]. This problem occurs frequently for Indian Reser vations, rural and northern locations. An example is the use of mean income rather than the quartiles provided by the Postal Code conversion program [11]. There are many dissemination areas where the population is less than 40 individuals, thus there is non-released income information. This trans lates to missing surrogate information which is a combined missing data and measurement error problem. With the applied problem, we will consider a surrogate for which both problems exist within the same covariate. 1.3.3.3 British Columbia Vital Statistics The data obtained from the death certificate has a high level of completeness. The problem with the death certificate information relates to the outcome measure (Appendix A). With secondary tables which contain the facility codes for various health units and facilities, it is possible to fine tune these codes to abstract more specific locations of death, but there is no information to identify hospices or locations of death inside the hospital (e.g. ICU) which are of greater interest to the Cross-Cultural NET researchers. With effort, these aggregate groups can be broken down into more specific categories, but for some locations it is currently impossible. The inability to identify the location of death in a hospice or in a specific location within a hospital represent a fundamental problem and a misclassification for the location of death. It will be assumed that the outcome is accurately measured and this will be reflected in the proposed model. 16 1.4 Summary Observational epidemiological research utilizes census based data in order to supplement health information with cultural information that may be difficult to obtain. The process of geocoding associates each patient with a geographic region which allows for census based information to be associated with the patient. In epidemiological terms, this is a cross-level analysis and may suffer from the ecological fallacy. The heart of this problem is one of mismeasurement since an aggregate surrogate is being used, but when the surrogate is derived from the census, missing information may occur. The problems which arise with having both problems present in a data set are, in general, unknown. Even more opaque is the effect that this dual problem has on the parameter estimates. It is assumed that they will be biased and have problems with accurate measures of efficiency, but the direc tions and trends of these problems are not well documented. Furthermore, there exists no known literature addressing the dual problem as manifest in a single covariate. In response to these gaps of knowledge, a uniform perspective will be proposed in which both missing data and mismeasure ment will be conceived of as particulars of a more general problem of data deficiency called called imperfection. Two simulation studies will be executed in order to investigate the effect of imperfection on parameter estimation. A likelihood based approach to adjust for imperfection will be explored. An applied data set will be ab stracted from the population of all adult British Columbian residents, age 20 and older, who died in BC, as identified on their death certificate, due to malignant cancer between 1997 and 2003, excluding death certificate only (DCO) diagnosis of cancer. 17 Chapter 2 A synthetic methodology for imperfect variables Frequently, missing data and mismeasurement are treated in isolation from one another. In this chapter, a unified approach will be proposed for the general problem of data imperfection. Since there is substantial work done in both the areas of missing data and measurement error, it would be prudent to use the established methodological approaches and terminology as founda tional to the generalization process. To this end, the primary methodological goal becomes the integration of the two areas under a unified perspective. 2.1 Introductory example: a variable which suffers from both missing data and mismeasurement To begin, we will consider a constructed example to highlight important features, nomenclature, and notation for integrating missing data and mismeasurement into a general framework: imperfection. Table 2.1 is an ex ample of how the complexity of imperfections can be accounted for within the single framework of imperfect variables. In this example, there is a single variable which suffers from both missing data and mismeasurement. Assume, for this example, that realizations are mismeasured once they are above a certain value, hence subject 1 and 4 suffer from mismeasurement. Furthermore, above a certain value, the realizations are unobservable as with subject 3. Finally, below a certain value, the realizations are observable and they are accurately measured, as with subjects 2 and 5. The realizations are 18 Table 2.1: An example of how the imperfect variable framework allows for more complicated relationships in a data set. Here, there is a single variable which suffers from both missing data and mismeasurement. In this example, mismeasuremeit is not uniform for all realizations. Subject 1 2 3 4 5 I Target: x 6.5 1.3 9.8 8.2 2.2 Observed: 4.3 1.3 - 7.6 2.2 xE Imperfection Mismeasurement None (Perfect) Both Mismeasurement None (Perfect) Relationship between x and = f(x) +e XE = x = f(x) = f(x) XE X xE Synthetic notation Xbs x ± + X being generated from a single process, but there are a multiplicity of data problems. In this example, XE 1S the experimentally observable realization and it is always observed. It is what a researcher would observe and record, the third column in table 2.1. The observed realization is not necessarily a realization from the target, X, so we need to determine the relationship between what is observed, XE and the target which is what we want to observe, x. Imperfect variables occur with subjects 1, 3, and 4. For subjects 1 and 4, the problem is restricted to measurement error. The realization x is observed, but it is known to be functionally related to the target, thus = f(x) + e. Additionally, the realization was observed, so = Xb . For 8 subject 3, we know a priori that it is mismeasured, but we are unable to experimentally observe it. In this case we have x 88 = 4 and we know that if it was observed then xE = f(x) + e. The introduction of an experimentally observable random variable, lays the foundation for the development of both the nomenclature and the no tation. Although this may appear to add a level of redundancy, it will help to reduce the notational complexity for specifying a model and assist in developing a model which relates the observed and unobserved data. 2.2 Nomenclature When missing data and mismeasurement methodologies are brought to gether, the foundational theses do not cleanly align. In the case of mismeasurement, there are situations where a mismeasurement problem can be re-cast as a missing data problem. The counter-point is that the entire body of mismeasurement literature cannot be recast as a missing data problem and in the process be relegated to the status of a particular manifestation of missing data. If this was the case, missing data would then be a more general version of mismeasurement. It is the process of integrating the fundamental concepts and notation of missing data and mismeasurement which necessitated the evolution of the notation and the supporting concepts. Although imperfect random van 20 Table 2.2: Both missing data and mismeasured data have something missing and something observed. The similar features of both type of data are used as the foundation for the concept of an experimentally observable random variable and the target random variable. Unobserved Missing Data Mismeasurement X X Imperfect Data X = = (Xrnj 58 (X Observed , X b 0 ) X*) , XE) , J. = (X Target ,I. Experimental ables are of primary interest, we must first consider the language used to describe such random variables. In the example, realizations from the ran dom variable X were sought, but not always obtained. These were called the desired or the target realizations because in the experimental process, the variable under investigation or the target of the investigation is X from which a researcher desires realizations. Recognizing that it is possible to not observe realizations from the target random variable leads us to make a distinction between what is observed and what is not observed. In a perfect experiment, we would observe all realizations from the target random variable X, but this is not always possi ble. When this occurs there are realizations from X for which observation is not possible, but something is observed whether it is a mismeasured version of the target random variable or a placeholder denoting the absence of an observation. Recognizing these two features of data collection assists us in modifying our usage of observation. For each subject in an experiment, something is observed. We denote the experimentally observable “something” as XE for the random variable gen erating the realization and xE for the realization itself. This is an extension of the mismeasurement use of the notation where x denotes the directly observable realization from X* which is functionally related to the desired random variable X (Table 2.2). This extension facilitates a modification of 21 how we perceive the random variable associated with a trial. From the ex ample it was implicitly suggested that for each trial there are two associated random variables: the target random variable X and the experimentally observed random variable XE. Rather than having the random variable X being associate with each trial or subject, a random vector (X, XE) is associated. The random vector (X, XE) associated with each trial of an experiment has a particular structure which should be recognized. For each trial, there is a desired realization from the target random variable X. Unfortunately, experimental realizations of the target can be frustrated, thus we observe a version xE from XE. We assume that there is a relationship between X and XE and predicated on this assumption, we use XE to help us understand how X affects some outcome of interest. Underneath the vector (X, XE) there is a structure which links the two and will provide a coherent means by which information from the experimentally observable random variable can be used to understand the relationship between the target random variable and the outcome of interest. Finally, we will clarify the usage of direct and indirect observation. Di rect observation of the random variable X suggests that we not only observe realizations from X, but that these observation are themselves sufficiently accurate with respect to the precision of measurement in the case of quan titative measures and precision of the language with respect to qualitative measures. We observe realizations from the target random variable, but the tools by which the observations are sufficient blunt that they are not suf ficiently accurate, and example of which would be rounding error. In this case, we will treat the measured value as being mismeasured for it unable to provide accurate measures of the target random variable. Realizations are not from X, but from XE which is related to the target. 2.3 Imperfection indicator For each target variable associate a two dimensional indictor variable, R = (R’, RM). The indicator R’ corresponds to the missing data aspect and RM 22 Table 2.3: Imperfect data patterns as specified by the generalized imperfect data indicator. r’ rM Generalized Imperfect Data Problem 1 0 1 0 1 1 0 0 Perfect Data Missing Data Mismeasurement Missing Data and Mismeasurement is the indicator for mismeasurement. The random vector R has an associated joint probability distribution, p(R’, RM). The imperfection indicator is not associated with XE because XE is what is observed; the imperfection is a problem associated with X and manifests through XE. In section 2.2, we associated the random vector (X, XE) with each sub ject. In the same spirit, we associate the random indicator vector with each variable. For (X, XE) we associate R = (R’, RM), therefore for each real ization (x, xE) we have realizations of the indicator random vector (r’, rM); the realized components are defined as r 11 if x is observed 10 if x is not observed = and M Ii if x is accurately measured 10 if x is mismeasured which jointly allows for all combinations of imperfection to be specified (Ta ble 2.3) The realization of the imperfection indicator, r = (1, 1) indicates com pletely observed and accurate, thus we say that the target random variable X is an imperfect variable if an only if Pr(R = (1, 1)1.) < 1. Using one of 23 Table 2.4: Types of data problems which can be modelled using the imper fect variable framework. Pr(R’ = 1 [0, 1) 1 [0,1) ii.) Pr(RM = ii.) 1 1 [0, 1) [0, 1) Imperfection Perfect Data Missing Data Mismeasurement Both the conditional expressions of the joint distribution as illustrative, if missing data is probable, then Pr(RM = 1) = 1 and Pr(R’ = lirM = 1) < 1 (Table 2.4,). For mismeasurement Pr(RM = 1) < 1 and Pr(R’ = 1rM = 1) = 1 (Table 2.4). If both the variable suffers from both problems, then the prod uct of the conditionals will be less than one. Finally, if neither problem exists then Pr(RM = 1) = 1 and Pr(R’ = lIrM = 1) = 1. With an analogous result if the conditioning is done Pr(RMr) Pr(R’). Before moving on to a further synthesis of missing data and mismea surement ideas, a final observation is that the imperfection indicator R is analogous to Little and Rubin’s [67] proposal of an indicator for missing data M, but is modified to include the problem of mismeasurement. To foreshadow the subsequent formulation, it is an easy extension to conceive of multiple covariates each with an associated imperfection indicator vector, , n and j = 1, 3 where i = 1, R , p. In this situation, we will have the imperfection analog to Little and Rubin’s proposed missing data indicator matrix M. . 2.4 .. . . . The synthetic notation With the use of the indicator vector and some further observations, we will be able to complete the notational synthesis. Recall that it was proposed that the random vector (X, X’) be used rather than just the random vari able X. The idea behind this conceptual shift is that in an experiment, each 24 trial produces something that can be observed: a realization in the form of a quantity or quality, or a placeholder denoting the absence of the quantity or quality. This is denoted XA = (X, XE). This is the augmented version of what is typically associated with an experimental trial with a realization denoted as xA = (x, xE). Under the imperfection framework, it is possible to consider a vector of everything that can be observed during the exper imental process. The complete set of random variable for each trial of an experiment or universa, is denoted XU = (X,XE,RI,RM) = (XA,R) Alluded to in previous sections is a relationship between X and XE. A missing data mechanism is a probabilistic model that characterizes the pattern of missing data and will be used to provide a relationship between = x and the potentially unobserved target x when the imperfection involves mismeasurement. For mismeasured data, a mismeasurement model relates xE and x. Auxiliary data may exist which would help to construct a relationship between X and XE but this auxiliary data is not part of XL in the sense that it may not be observed concurrent with the other study variables. If there is no auxiliary data a measurement error model can be assumed. Now, lets extend these ideas to the situation where there are n indepen dent and identically distributed subjects with more than one covariate. Let X = (Xj, X, R, Re,’) be the random vector for the jth trial or subject, th random i = 1, , n and the vector, j = 1, ‘p. For n independent and identically distributed subjects, the design matrix will be a n x 3p matrix which is related to the underlying conceptual n x 4p design matrix which is functionally related to the target n x p matrix. Now that we have a working framework for imperfect variables, we can turn our attention to the likelihood. It can be broken into three parts: a covariate model, a response model, and a model for the mechanism of imperfection. Before moving on to the likelihood, two notes should be made. .. . . . . 25 The first is that is missing data literature often uses z to denote missing data. Although there is no direct correspondence between the notation used in missing data literature and that proposed for imperfect variables, the use of xE = Xmjss and XE = are analogous to z. The second is that much of the notation has been inspired from the mismeasurement literature. It is noted that the use of x” is not universal notation for the mismeasured variable and often w is used. 2.5 The model From notation and underlying concepts, we turn our attention to the mod els themselves. We will begin by considering the likelihood model and its components. 2.5.1 Likelihood Suppose that there are n independent and identically distributed subjects with (Xv, Y) being the random vector for the jth subject, where the response 2 ) is the associated is Y, is the universa random vector and p(X, Y For notational joint probability distribution indexed by the parameter simplicity, p(Ui I u2) will be used to denote p(U 1 = I u2) which is the conditional distribution of the random variable U 1 given the realization . This notation will be used throughout for both continuous 2 u2 from U and discrete random variables. The joint distribution, as defined through a product of conditionals, is . p(X’,Yj ) =p(RjIx’,y, )p(YIx, 7 /3, )p(XI) (2.5.1) where X = (Xx,... ,X), X = (X, ,X), and T = ( The conditional distribution of the indicator vector given the observed ex planatory variables and response indexed by 7 and will be referred to as the mechanism of imperfection model is p(Rj Ix”, y, fry). The conditional distri bution of Y given the observed explanatory variables indexed by the p x 1 vector of regression coefficients /3 and the dispersion parameter 4 and will . . . 26 be referred to as the response model is p(YjIx, /3, q). Finally p(X, I) is the joint distribution of the explanatory variables indexed by i& and will be referred to as the covariate model. The complete likelihood for the th subject is ) =p(RIx,y, -y)p(YIx”, /3, )p(X, 1 L(jx,y I) (2.5.2) with the complete likelihood as Lc(x, yj) Lc(IXU, 3 (2.5.3) = and the complete log-likelihood 1(IxU,y) = log p(Rjx, y, -y) + log p(Yjx, /3, q) + log p(X, I&) = (2.5.4) Within the context of missing data, the idea of factoring the joint proba bility distribution was implicitly suggested by Rubin [90] when he begins to address missing data and the sampling distribution of inference. Little and Schluchter [68] make this approach explicit in the discussion about alternate strategies to specifying the joint probability distribution of the response and explanatory variables. When the outcome is categorical and the explana tory variables are continuous, the factorization p(X, Y) = p(YIx) p(x) is recommended when y is binary because this underlies the logistic regression model which unlike the linear discriminant factorization, p(XIy) p(Y), does not rely on multivariate normal assumptions for the continuous variables. Additionally, the number of parameters is greater for the latter than those needed to characterize a logistic factorization [91]. The logistic regression factorization approach has been subsequently used by Ibrahim [43], Lipsitz 27 and Ibrahim [64], Lipsitz et al. [65], Ibrahim et al. [44], and Ibrahim and Lipsitz [48]. Furthermore, by including the probability distribution of Rj, the reason for missing information, and p(X, ), the model for the mismeasurement mechanism in the likelihood, bias from missing information and attenuation should be removed [14, 65]. Equation 2.5.1 is the probability distribution for the jth subject and re flects the factorization proposed by Ibrahim and Lipsitz. Additionally, it represents the complete, or perfect, data likelihood. Without loss of gener ality, if we assume that the imperfect covariates for which both problems do not co-exist in the same covariate are the first j covariates in the data set and that the imperfect covariates for which both problems coexist within the same covariate are the following r covariates, then when data is imperfect, the joint distribution of the observable covariates is p(X, R, YjI) = / I ) dx . . . . dx d4 1 . . . . •X U x (2.5.5) E is the union of for the i’ subject which suffers from imperfection, X U 1 the variable space associated with the target covariates and the variable space associated with the experimentally observable covariates. With the specification of the likelihood (Equation 2.5.4), we are implic itly suggesting a particular approach for handling the missing data compo nent of imperfection. Little [67] classifies models as selection models where the likelihood is written as a product of uniquely indexed distributions: the probability distribution for the indicator of missing information conditional on the data and the probability of the data. Furthermore, this character ization of the relationship between the indicators of missing information and the data which suffers from the imperfection motivates likelihood-based techniques [67]. 28 2.5.2 Covariate model The covariate model is of particular interest because it has the potential for some combination of missing data and mismeasurement. A successful integration of the two deficiencies will require ideas from both areas to be woven together: a model which specifies the joint probability distribution XA = (x, XE) and a mapping from X to X. Both nonparametric and parametric approaches have been proposed for the specification of the covariate model when explanatory data is missing. baird [57] and Robins [86] proposed nonparametric approaches. Agresti [1] considered a parametric approach and used a joint multinomial log-linear model for discrete covariates. Although the number of nuisance parameters needed to model the distribution could be reduced by retaining only the main effects, Lipsitz and Ibrahim [64] proposed and alternate approach where they sought to reduce the number of nuisance parameters by using a series of one-dimensional conditional distributions. With this approach, the joint distribution of (Y, X), is broken into a response model, Y conditional on x and a conditional specification of the joint distribution of X. This approach has been termed the conditional-conditional approach. Two immediate benefits are that it has been shown that the conditionalconditional specification of the joint distribution approximates a joint loglinear model for the covariates and that any completely observed covariate does not need to be modelled [48]. The latter is a significant improvement over any method which would require the full specification of the joint prob ability model. In a discrete covariate situation, the number of required cell probabilities to be modelled can grow quickly for a fully saturated model. For continuous covariates, specifying a high-dimensional joint probability distribution may be unrealistic and a joint probability distribution may not naturally present itself. For example, consider a data set with three co variates; the first is time to e’ent data such as survival time, the second covariate is continuous and resides in the interval [0, 1], and the third ap pears normally distributed, but has heavy tails. A three dimensional joint probability distribution does rot instantly come to mind, but a conditional 29 model using an exponential, a beta, and a t-distribution is readily apparent. Initially, the conditional-conditional approach was used to reduce the number of nuisance parameters needed to specify the joint distribution of discrete covariates and it was later extended to continuous covariates [48]. If the problem at hand was only missing data then the joint distribution for a p-dimensional explanatory variable vector X: = (X ,..., X) would be 1 conditionally specified as , 1 p(X . .. Ixj,. . . , x_ 2 , 1 , X,b) =p(X . p(Xx, 1 )p(xiIb 2 ) ,)T 1 th where T conditional ‘cl’ is the indexing vector for the distribution, and the are distinct for all i. A beneficial feature of this approach is that data sets rarely contain only discrete or only continuous covariates. Problems with both discrete and continuous covariates have been called mixed covariate models in that the covariate model is not strictly composed of only discrete or continuous data. The conditional-conditional approach is general enough to handle these situations [48]. The complete likelihood (Equation 2.5.1), suggests that the conditionalconditional approach may have utility; we have the joint distribution p(X,YjI/3,q) which can be expressed as p(Yjx,3,q!)p(X, I). The conditional model for the covariate distribution is p(Xb) =p(Xx,. . .XjQ,_),’K/)p) =p(XIx, x . . . )p(XIl/) 2 p(XIx/) ) 1 ), 1 X . . . Xp_fl, 1’) x p(XIx Ix, ‘i4) 2 , x, ‘bjp(X 2 x p(X I xii, ‘i,hf’)p(Xji I’) ..., (2.5.6) = ,•T is the parameter vec where ,T = tor indexing the conditional distribution associated with the target explana is the parameter tory variable for the th conditional distribution and vector indexing the conditional distribution associate with the experimen tally observable explanatory variable for the th conditional distribution. (, 30 Buried within the conditional-conditional specification of the joint dis tribution of p(X, Y/3, , is a measurement error model. Consider the th component of the conditional distribution for the joint distribution of the covariate model ‘v’) 1 p(XIx, Xj . . .Xj(_l),’/)J . . ),,b) 1 .Xj(j_ and if the imperfection includes mismeasurement then p(X = Xi,Xj 1 Xjj .. .Xj(J_l),’,hJ )p(XjIxl . . where P indexes the state of observation, that is P = {obs, miss}. By extending the conditional-conditional idea to the joint distribution of X, it is easy to see that we have come to the likelihood construction for classical measurement error [14). Berkson specification can result as well with the th component of the conditional distribution of the covariates being I4 3 p(X = Xjjr,Xji . . ),)p(Xjj 1 .Xj(_ = X,rIx .. An interesting feature of this approach is that a measurement error relation ship naturally emerged through the application of a missing data modelling technique. At first, the conditional-conditional specification of the covariate model may appear excessive, but the complexity of the specification allows for an increased flexibility. Additionally, such complex models may be rare. Con sider the three cases of imperfection and the one case for perfect covariates. For a perfect covariate, no joint distribution needs to be specified, so all of the associated components in equation 2.5.6 drop out. For only missing data problems, the observed realizations and the target are identical and the missing data mechanism is already modelled in the complete likelihood. Mismeasured variables, missing or not, need a specification for a joint dis tribution and will still need the full complexity of the covariate model. 31 2.5.3 The imperfection model Using a measurement error model as a genesis point, a structured relation ship between the observable random variable and the target random variable will be constructed. There are two general types of models to use in order to specify the measurement error process: • Error models which include the classical measurement error models, where the conditional distribution of X* given X is modelled, and • Regression calibration models which include l3erkson error models, where the conditional distribution of X given X is modelled. Given that a classical measurement error model has been specified in equa tion 2.5.6, we will restrict ourselves to this type of measurement error. A general version of the classical error model relates the mismeasured variable with the true variable and may be conditional on some set of per fectly measured covariates. In a standard problem of measurement error, the general classical measurement error model for the jth subject and the th covariate is X = co + cxiXjj + + Ej (2.5.7) )T is the indexing vector for measurement error 2 where a = (aO, ci a model, Xj is the unobserved target variable, Xj(_j) is the set of accurately measured covariates without the th covariate, and E (e IX) = 0. If this was strictly a measurement error problem, then this would be sufficient for specifying the relationship between X$ and Xj, but it is unsatisfactory for integrating the problem of missing data measurement error. Using material from both measurement error and missing data, a more general model can be constructed which relates the observable random vari able, XE, with the target random variable, X. The imperfection model must be able to characterize: • accurate and observable data, • missing data, 32 • mismeasured data, and • data with both problems. A simple approach is to use R to combine these four into a single relation ship. For the jth subject and th covariate we have yE A.j M I 3 =r + (1 — + I., ii 1J- — rjj).15..ij,miss rjf) [rjX,j,Qb + (1 — (2.5.8) rfj)X,m TX(_) + ej, E (EJIXj) = 0, X_ 2 3 is the set of ao + oiXij + a TT. th T covariates with the j removed, and a 3 = (aj, aid, 2 3 ) is the indexing th vector for the imperfect covariate. Although this stitching together is where X, = a naive approach, it does permit the required flexibility to model a wide range of complex data structures. For example, the jth subject may have = (xij,miss, —, 0, 1), thus we have xE = Xij,miss. The kth subject my have XE = (xj,ObS, 4.5, 1,0), so XE Xkj,Obs for which Xj,obs = Xkj + An interesting feature of this model is that a further generalization is possible. Consider the situation where the measurement error does not apply uniformly to all subjects. In this setting, there exists disjoint subsets of subjects for which the measurement error structure is different. Consider the situation with K subsets indexed by k, then equation 2.5.8 becomes E MfI i =r TjjXij,obs + i1 k 3 3 Xj + (1 — rjf) — I\ TjjiXij,miss 8 + (1 [rfjXk,Qb — (2.5.9) rfj)Xk,mjss] kTXi(_j) + jk and E 2 + a1kX + cX 0. A obvious benefit of this flexibility is that imperfection models need not be constant across all subjects. The indexing parameter aJkT and random error jk permits a different parameterization for different imperfection subsets. where X = (E,kIXj) = Consider the situation. where a measurement. tool is imprecise at both the lower end and upper end of the measurement scale. In such situations, it may not be reasonable to assume that the structure of imprecision is identical at either extrema. It is observed that as k —* n the imperfection model becomes 33 a subject level model rather than a group level model. For simplicity, we will restrict ourselves to where the imperfection model is identical for all i subjects (Equation 2.5.8). Equation 2.5.8 provides much flexibility, but at this point we will focus on unbiased measurement error. Under this restriction a = (ao = 0, a = 1, a2 = 0) which reduces equation 2.5.8 to X =rjf {rjX, 08 + (1 — + (1 — rj)Xij,miss] r!) [rfjx,Qb 8 + (1 — rfj)Xj,miss] where X =X + (2.5.10) . With the measurement error component of the imperfection model unbiased, the overall bias needs to be determined. The expectation of X given the target random variable yields E 8 + (1 b 0 (xIx, nj) =E (rig {rXj, + (1 where E (Xj, IXjj, n) 8 b 0 and E (x:*j,o&s!xij, ri ) 3 r) — = = — rfj)Xij,mi s 5 ] 8 + (1 [rfjX,Qb — r(j)X,mjss] ni) E (Xij,mi slXij, r) 5 E (XKj,missIXij, nj), so r) =rE =riX + (1 — + (1 — rf)E 1 ) 2 (X I X, n a T X(_) + fjIXjrij) r)E (ao + aiX + 2 34 and under the assumption of unbiased measurement error and E (Ej IX) = 0 =rfXj ± (1- rjf)Xjj =xij, so the imperfection model, under the assumption that the measurement error is unbiased, is itself unbiased. A final note concerns the covariance structure of X. We will not assume independence amongst the target random variables, so Coy 0, Xk) and independence will not be assumed amongst the experimental random variables, Coy (X, X) 0. We will assume that the error terms are inde pendent across subjects and across covariates, so the error is only random noise generated without respect to any other mechanism under considera tion. Furthermore, we will assume that the experimentally observable ran dom variables are conditionally independent given the target variable and indication that this is a measurement error problem. 2.5.4 Surrogate versus proxy: nondifferential and differential error When mismeasured, surrogate and proxy are terms commonly used to de scribe XE, the mismeasured, observed random variable, but there is evidence that these two terms should not be used interchangeably. Carroll [14] sug gests that a surrogate is the measured substitute for the desired variable, but it gives no information beyond that which would have been given by the true variable. In contrast, the proxy yields information beyond that which the true variable would have given. This distinction in terminology reflects the two types of measurement error: nondifferential and differential error. More technically, the measurement error is said to be nondifferential when Xf and Y, are conditionally independent given X, so that p(XIX, Y) p(XfX) [40]. If conditioning includes the indicator of imperfection, then the mea surement error is nondifferential when the conditional independence of X and (Ye, a ) given the target X so that p(XfIX 1 , Y, R 1 ) p(XIX 1 ). 1 35 2.5.5 Response model The response model p(Y Ix , Ø, ), will be developed in two stages. Firstly, 4 we will consider response models that are members of the exponential family. Secondly, we will restrict ourselves to a particular member, the Bernoulli response model. The exponential family is a rich source of probability models such as the normal, binomial, and gamma which has often been exploited for the development of statistical theory. Some authors make a distinction between the types of members and identify single parameter members as members of the exponential family while two parameter members are members of the exponential dispersion family [27]. For our purposes, we will focus on single parameter members, thus future references to the exponential family will be within this context. Assuming that the th subject has a response from the exponential fam ily conditional on the indexing parameters 0 and , the response Y: is dis tributed as p(Yj=yj I x, 0, ) = exp + c(yj, (2.5.11) indexed by the canonical parameter 0 which represents location and the dispersion parameter which represents scale. Members of the exponential family are defined through the specification of the functions a, b and c, for example the binomial distribution has a(q5) = 1, b(S) = —m log(1 —ir) where (v). ir indexes the binomial distribution and 0 = log and c(y, q) = log The form for the expectation and variance of the exponential family is well known [73]. The expectation of an exponential family random variable, Y, is E(YZ) = b’(O) a function of 0. The variance is a product of the location and the scale, Var(Y) = b”(O)a(). For the exponential family, b”(O) is called the variance function and describes how the variance of Y is related to the mean. The function a(q) is commonly of the form a() = where w is a known prior weight that varies from observation to observation and is 36 constant over observations. For one parameter members, such as Poisson and binomial, is fixed at one, but for two parameter members, such as Normal or Gamma, is free. Given that we are restricting ourselves to one parameter members, it will be assumed that = 1. An implication of this restriction is that over- and under-dispersed models will not be considered. At this point it is good to observe that the response model given in the likelihood is parameterized by /3, and relates a set of covariates to the response (Equation 2.5.1). The proposed exponential family response is indexed by 0 which is the canonical location parameter. A generalized linear model allows for a natural connection between the parameter of interest /3 and the canonical parameter of the exponential family, 0. This relationship requires three pieces: a probability distribution, a systematic component, and a link function. The probability distribution has already been identified so only the systematic component and the link function are needed. The systematic component is a linear combination of the covariates. If the covariates were perfectly measured and complete, then the systematic component would be (2.5.12) where X is the n x p design matrix and /3 is the p x 1 indexing vector. The link function, g(•), connects the random component given by the probability distribution of Y and the systematic component, which is a linear combination of the covariates. In general, g(.), describes how the mean response, (Y) = ,u, is linked to the covariates through the linear predictor. For the ith subject g(ILi) = Tfi. The link function can be any monotone continuous and differentiable func tion, but we will restrict ourselves to canonical links. Working with canonical links brings a level of simplicity to theproblem and arises when O = g(ij) = rj. An immediate benefit is that a sufficient statistic equal in dimension to 37 /3 in the linear predictor, = Xf3, exists for canonical links [73]. The response of interest, which is also a common response in epidemi ological research, is the Bernoulli random variable, Y. A typical goal is to find a functional relationship between the response and a set of explanatory variables X indexed by /3. We will consider an independent and identi cally distributed sample of n subjects with the th response denoted Y, i = 1, .. , n. A realization from the Bernoulli random variable Y is . — f 1 with probability ir 0 with probability (1 — r) is the probability of observing a success in the Bernoulli experiment. The associated probability mass function is where p(Y = Yi(l I — ire). Here a() = 1 when we let weight with E(Y) = ir and Var(Y) = equal the number of independent trials and assume no dispersion problems with the data. We let m = 1, q = 1, where m is the number of trials in the log(1 ir) where ir indexes the binomial Bernoulli experiment ), b(O) = and c(y, ) = 0. We see that the canonical distribution and 0 = log the logit function. The linkage between the random and link is 0 = log — — — the systematic components is = 1 log In — thus p(Y = I n.j) = IrYi(1 = exp (Yilogi = exp (yt% — — + log(1 log (1 + ern)) — in)) (2.5.13) =p(11Ix,/3,1= 1) 38 2.5.6 Mechanism of imperfection The final component of equation 2.5.1 is the model for the missing data mechanism, p(R Ix , y, y). Modelling the joint probability distribution of 4 the indicator vector is simplified through two observations. The first is that we can use the ideas from the conditional-conditional model for the joint distribution of the the covariates and model the missing data mechanism as a sequence of one-dimensional distributions [48]. With this approach we have, p(RiIx,y, ) p(Rprii,...,rip_i,x,yi, 7p) ,x’,yi, 2 xp(Rjp_iIrii,...,rip_ 7—i). x p(Rj )p(RilIx”,y, 2 Irji,x,yj, 7 2 71) where indexes the model for the th indicator vector, j = 1, , p. As with the covariates, the above relationship only provided a modest simplification of the problem. The next step is to apply the conditioning Unlike the for each of the j joint conditional distributions, j = 1, , p. covariate situation, the relationship between the random variables is less obvious. With the covariates, there was a perfect random variable and there was an imperfect random variable. The natural relationship between the two was motivated by the classical measurement error assumption. Here, the Jth joint distribution of the indicator variables is between the indicator for missing data, R and the indicator for mismeasurement, R. Without a clear guide, we have three situations to consider. The first is a conditioning on the indicator of mismeasurement. This characterization of the joint probability model suggests that the probabilistic mechanism for missing data is dependent on the presence, or absence, of measurement error. If we choose to condition on the indicator of missing information, then it is suggested that the mechanism for mismeasurement is dependent on the observation of a realization from the associated random variable. The third situation is to assume independence so that the presence of one type of imperfection has no bearing on the presence of the other. .. .. . . 39 Although simplest, at this point the assumption of independence between the two mechanisms is not desirable primarily because it can be considered a special case of either conditional specification. For the th subject and th covariate, the associated models are p(1{jjIrji,...,rjj_i,x,yj, or for the second conditional situation iMI rj, r , 1 =pR . . . , rj_i, xA , y, ‘-yM xp(RIrii,...,rjj_i,x,yi, 7). Although for expository purposes the first will be used, there are subtle contextual reasons for this choice. Recall that the first model suggests that missingness depends on mismeasurement whereas the second suggests that mismeasurernent depends on missingness. Intuitively, the first seems more plausible, but why? Consider the sec ond situation. The conditional distribution is conditional on the observa tional status: observed, not observed. If it is not observed, then there is no observed information about the measurement error mechanism for the jth subject. The only information we would have about the mismeasurement mechanism would be from the observed realizations. To apply this model, we would have to assume that both the observed and the unobserved would be subject to the same measurement error model. The first situation is appealing at this point. A motivating illustration is to consider the situation where obtaining perfect measurements is difficult, but easy for imperfect measurements. If it is difficult to observe a perfect realization of the random variable, there may be a high probability that the perfect realization will not be observed. If it is easy to obtain an imperfect realization of the random variable, then there may be a high probability of observing the realization. This charac terization of the mechanism of imperfection still allows for multiple types 40 of measurement error models for a single random variable. It for these rea sons that the first characterization will be used with the recognition that the other characterization is not inferior. Conditioning on the indicator of mismeasurement yields p(RjIx,yj, y) =p(RfIr,rji,...,rjp_i,x,yi, y) A mM M ii,I X P ’ r. , 1 ip_1 1 . r , 1 . . , A r,_2, x A i,M xp1-L_ 1 M 7, ,yj, y, I 7 p —1 M ‘fJj p 7 —1 (RIrji, x, y, 7) (2.5.14) x p (RiiIr’f, x, y, f) p (RfIx, j, x p (RIrX, p y, indicator vector and the vth where 77 is the indexing vector for the M)T ,71 7 ..yf, T Since imperfection where v = {I, M} and 7 = (yf, is binary, we can use a sequence of logistic regressions for equation 2.5.14. With binary indicators, we can use the previous exposition about the response model substituting the random variable R for Yj. Here we will apply the same assumptions as with the response, thus we have . . p(RIr,ri,. . . . , exp ‘y) (rfj +log (1 + and p (Rf In1,.. [, x, y, = exp (rj + log (i + eJ)) is a linear combination of the realiza r] where = tions and the indicators on which conditioning is predicated for the indi 3 nj_i) which is the vec cation of missing data model, r,_( = (ri, th through = indicators removed, and tor of indicators with the is the linear combination of the realizations and the indi [x, cators on which conditioning is predicated for the indication of mismeasure 4 . . . , 41 ment model. As indicated with the covariate model, this method greatly reduces the number of nuisance parameters that need specification compared with a joint log-linear specification [95]. Furthermore, we have the advantage that it provides a natural approach to the specification of the joint probability distribution of the imperfection indicators. 2.5.7 Missing data As it was natural to discuss the measurement error model in the context of the covariate model, the model for the missing data mechanism provides a natural context for discussing the types of missing data. There are three types of missing data: missing completely at random (MCAR), missing at random (MAR), and non-ignorable missing (NMAR) data. These three classes of missing data will follow the original descriptions as given by Rubin [90] and their interpretation by Ibrahim [46, 47]. 2.5.7.1 Missing completely at random Data are said to be missing completely at random (MCAR) when the in ability to observe a realization from a random variable does not depend on any data either observed or unobserved. In our situation, the response is completely observed whereas some components of the th realization x are missing. For example, if xj is missing, then we say it is missing completely at random (MCAR) if the probability of observing a realization of Xj is independent of the response j and of x. From a model perspective, if the indexing parameter vector of the logistic regression 73(0) = 0 which is th model for the indicator variable associated with the missing information component of imperfection with the intercept term removed, then the data is missing completely at random [47]. Under the MCAR assumption, the observed data is a random sample of all the data. When imperfection is restricted to missing data, then a complete case analysis will lose efficiency, but not introduce bias to the parameter estimates. 42 2.5.7.2 Missing at random Data are said to be missing at random (MAR) if conditional on the observed data, the inability to observe a realization from a random variable does not depend on the data that are not observed. Viewed another way, the inability to observe a realization may depend on the observed data. This conditional dependence does not prevent the unconditional probability of not observing a realization from depending on the unobserved data. As with MCAR data, assume that the response is completely observed and that some components of the th realization x are missing. The missing values of x are MAR if conditional on the observed data, the likelihood of observing x: is independent the values of x what would have been observed, where the observed values include both the response and the observed values of x . 2 It has been observed that the MAR is a much more reasonable assump tion than that of MCAR, but adjustments should be made because the observed realization are no longer a random sample of the sample [46]. A complete case analysis will result in inefficient and biased estimates of the parameters. Furthermore, if the probability depends on the observed co variates and not on the response then a complete case analysis will result in unbiased estimates, but if the probability of not observing a realization depends on at least the response, then a complete case analysis will result in biased estimates [67]. Little and Rubin’s assessment to which Ibrahim refers is made in the context of a regression model. The unbiased property when the response is not part of the missing data model stems from the fact that both the regression model and the missing data model are conditional on the covariates [36]. For logistic regression, the effect of having data which is missing at ran dom is less conclusive than with linear regression. Vach and Biettner [104] observed for case-control studies under the MAR assumption that complete case analysis when the exposure (covariate) and disease status (outcome) are variables for the missing data mechanism, biased estimation may re sult. Furthermore, Vach and Blettner comment that the bias for MAR data 43 which involves both the exposure and the disease state can range from be ing small to rather large. Greenland and Finkle [38] demonstrate in a series of limited simulation studies that for logistic regression with the missing at random mechanism depending on the outcome can produce parameter estimates with negligible bias. These two studies indicate that missing at random data for logistic regression models may not operate as expected from linear regression. If we consider the situation where there is a single covariate which suffers from missing data, x , then from a modelling point of view, if the coefficient 3 variable for the vector of parameters , which is the coefficient for the 3 [y] which indexes the missing data component of the th imperfection model is 0, then the missing data mechanism does not depend on x; in this situation, the missing data are missing at random [47]. 2.5.7.3 Non-ignorable missing data Data are said to be non-ignorable missing if the probability of observing a realization from a random variable depends on the value of the realization that would have been observed. As with the previous examples, assume that some components of x are unobservable and that the response is always observed. Here, if the probability of observing realizations, conditional on the observed data, depends on the missing values that would have been observed from the components of x, then missing data is non-ignorable. In this setting, a complete case analysis, when considering only missing data imperfection, leads to unbiased estimates of the parameter if the conditional probability depends only on the missing covariate and not the response. Valid inference requires the correct model specification of the mechanism 2 [46, 67]. of missingness and/or the specification of the distribution of X Finally, if [y] 0 then the missing data mechanism is not ignorable [47, 90]. 44 Missing Data and Mismeasurement as Special Cases 2.5.8 Throughout the development of the generalized imperfect variable model, examples have been given which highlight the flexibility and complexity of the framework. As a counterpoint to these examples, we will not consider simplifying assumptions which will illustrate that the missing data and mismeasurement frameworks are now special cases. For this, equations 2.5.8 and 2.5.14 are of primary interest. For missing data RM = 1 for all subjects so the imperfect variable re duces to XE Mn =r 0 rX, 1 8 b / + il — I\ r )Xij,miss If the realization is observed, then X = Xj,ob 8 then we observe a realization from the true random variable, but if it is not observed then = Xij,miss which is the random variable that would have generated the realization if it had been observable. The model for the mechanism of imperfection reduces to p(Rjjx,y, 7) =p(RJr X P 7)... = M ml r i-’i2 2 i = .1., rd, X Yi, 1 M_ A X 71 IDI r i., X VLi1 1 — A I 72 I lIz, 7i which is identical to Ibrahim’s proposal for a likelihood based approach to handling missing data within a generalized linear model framework [48]. For measurement error, often it is assumed that all realizations from X are unobservable. Instead a surrogate is used and realizations from XE which is functionally related to X are observed. In this scenario R’ = 1 and RM = 0 for all subjects. The generalized imperfect variable reduces to x = a T X(_)) + Xj + 2 1 +c € which is Carroll’s proposed classical measurement error [14]. In this situa 3 Ir tion, the model for the mechanism of imperfection is p(R , 3 r _ , x, y, 7) , 1 . . . 45 = 1 for all subjects. Since all the observations are observable and all realiza tions are mismeasured with probability one, the probability distribution is a point mass on the event R 3 = (1,0) for all subjects. If the restriction on mismeasurement is relaxed, an application of the conditional-conditional method for measurement error results. Consider the situation where only R’ = 1, XZ =Tf X,, 8 b 0 + (1 — The model for the mechanism of imperfection reduces to p(RjIx,y, x p(Rf rjl,x,yj, 7 )p(RIx,y, -ye’) which is Ibrahim’s conditional-conditional method applied to a reformulated mismeasurement problem. It does not require all realizations from X, to be mismeasured. 46 Chapter 3 Maximum likelihood estimation via the Expectation-Maximization algorithm The Expectation-Maximization algorithm or EM algorithm is a broadly ap plicable algorithm that provides an iterative methodology for obtaining max imum likelihood estimates in situations where straightforward estimation of the maximum likelihood estimate is frustrated by data complexity such as incomplete information. In this chapter, we will briefly recount the history of the EM algorithm, consider the rationale for and critiques against its use, present a formulation of the EM algorithm, translate the EM algorithm for generalized imperfect variables, provide a Monte Carlo based method for ob taining the EM algorithm based maximum likelihood estimates and finally obtain standard error estimates for the parameters of primary interest. 3.1 Brief history of the EM algorithm The earliest manifestation of an EM-type algorithm was in 1886. The cen tral problem was parameter estimation of a mixture of two univariate nor mal distributions [74]. The EM-type algorithm in this instance was used to incorporate observations for which the associated errors were abnormally large into the estimation of the parameter rather than discarding these ob servations as outliers and estimating parameters based on the reduced set of 47 observations. These outliers are affected by some error, termed the evil of a value, that is suspiciously similar, in some respects, to measurement error problems reformulated as missing data problems [81]. Following this early example. of the EM algorithm, other examples of EM-type and EM algorithms in the statistical literature were considered. The number of these grew over the course of ninety years. In 1977, Demp ster, Laird, and Rubin proposed a general formulation which is known as the Expectation-Maximization algorithm for which all predecessors are par ticular examples [22]. Prior to their formulation, a few precursors to the Dempster, Laird, and Rubin formulation of the EM algorithm should be noted. In 1958, Hartley’s approach for the general case of count data sets forth the basic ideas of the EM algorithm. Two years later, Buck [8] considers the estimation of the p-dimensional mean vector and the associated covariance matrix when some of the observations are missing. Suggesting an imputation method based on regressing the missing variables on the observed ones, the parameters are estimated on a synthetic “complete” data set. Under certain conditions, this approach yields the maximum likelihood estimators of parameters for the exponential family and has the basic components of the EM algorithm. Convergence results for EM-type algorithms emerged in the late 1960’s and a inter-related series of papers form the basis of the present-day formu lation of the EM algorithm [4—6]. In 1972, the Missing Information principle was introduced and is in the spirit of and related to the basic ideas of the EM algorithm [83]. Here, the relationship between the complete and incompletedata log-likelihood functions lead to the conclusion that the maximum likeli hood estimator is a fixed point of a particular transformation. Contempora neously with these results was the proposal of the Self-Consistency principle by Efron [24]. Turnbull uses Efron’s Self-Consistency idea for nonparametric estima tion of a survivorship function with doubly censored data [102]. Here, Efron’s ideas are extended to show that the equivalence between Efron’s self-consistency and the nonparametric likelihood equations. Convergence 48 is proven for for these EM-like algorithms. Two years later, he deals with the empirical distribution in the context of grouped, censored, and truncated data [103]. It is here that Turnbull derives a version of the EM algorithm and notes that this approach can be used for not only missing data but truncated data. Just prior to Dempster, Laird, and Rubin’s seminal paper proposing the present version of the EM algorithm, work also appeared to address mixture distributions [20] and an iterative mapping for incompletedata problems predicated on exponential families which are now known as the Sundberg formulas were addressed [96, 97]. Since Dempster, Laird, and Rubin’s paper, there have been many appli cations of the EM algorithm. The algorithm has since become a staple in many statisticians methodological tool-box. Although there is a high level of utility to the algorithm, two issues need to be addressed: interpretation and extensions of the EM algorithm. McLachlan and Krishnan [74] suggest two interpretations. The first comes from problems involving mixture dis tributions. In these problems, the EM algorithm natural emerges from the particular forms taken by the derivatives of the log-likelihood function for which Day’s paper is an example [20]. The second interpretation is taken by viewing a complex problem as an incomplete-data problem with an associ ated complete-data problem. If well formulated there is a natural connection between the two likelihood functions which the EM algorithm can exploit. This interpretation is in the spirit of the Missing Information principle, and is the one taken in this thesis [83]. The work surrounding the EM algorithm has continued to grow since Dempster, Laird, and Rubin generalized the approach in 1977. A detailed account of the convergence properties of the EM algorithm were given by Wu [114]. In particular Wu shows that convergence of LQIJk) to L(1*) k does not necessarily imply the convergence of 11 to 4’, where ‘I’ is the vector of parameters. That is the convergence of the likelihood evaluated using the kth iterative estimate of the parameter to the likelihood evaluated using the unique maximizer, L([J*), does not necessarily imply that the kth iterative estimate of the parameter converges to the unique maximizer ‘T’. Almost a decade later, Lansky, Casella, McCulloch and Lansky establish 49 some invariance and convergence results in addition to considering the rate of convergence for the EM algorithm [60]. Laird [57], in his work with maximum likelihood estimation in sur vival/sacrifice experiments, shows the equivalence of Efron’s Self-Consistency principle [24] and Orchard and Woodbury’s Missing Information principle [83]. Furthermore, Laird also shows that with parametric exponential fam ilies, these two principles have the same mathematical basis as the Sunberg formulas and establishes the self-consistency algorithm as a special case of the EM algorithm. Despite the popularity of the EM algorithm, it is not without its prob lems. An initial criticism and probably one of the biggest impediments to its use is its inability to automatically produce an estimate of the covariance matrix of the maximum likelihood estimate. Early on, Louis [70] developed a method which yields the observed information matrix in terms of the gra dient and curvature of the complete log-likelihood function. In doing so, Louis, by providing a method for obtaining the second moments, deepens the connections between Dempster, Laird, and Rubin EM algorithm and Fisher’s [28] observation that the incomplete score statistic is the condi tional expectation of the complete-data score statistic given the incomplete data. Since Louis, others have proposed methods for obtaining covariance esti mates. Meilijson [75] proposed a numerical approach which uses components of the E- and M-steps of the algorithm. This methods avoids the computa tion of the second derivatives as with Louis’ method. Furthermore, it was shown that single-observation scores for the incomplete-data model can be obtained as a by-product of the E-step. Meng and Rubin [76] proposed a method which produces a numerically stable estimate of the asymptotic covariance of the EM estimate which using only the code of the EM algo rithm itself and standard matrix operations. This approach is called the Supplemented EM algorithm (SEM). A negative aspect of the SEM algo rithm is that accurate estimates of the parameters are needed causing the SEM to be potentially algorithmically expensive. Furthermore, it has been suggested that the SEM suffers from numerical inaccuracy which is a con 50 siderable drawback for an algorithmic or numerical method [3, 93]. A recent proposal comes from Oakes [82]. Oakes’ proposes a computation of the standard error by deriving a formula for the observed information matrix by using an explicit expression for the. second derivative of the observed-data log-likelihood in terms of the the derivatives of the conditional expectation of the complete-data log-likelihood given the observed data. Two secondary issues with the EM algorithm are the rate of convergence and the potential for high dimensional integration in the E-step. An inter esting feature of the Dempster, Laird, and Rubin [22] paper is that keys for addressing acceleration and identifying problems with the likelihood surfaces are buried within. Louis [70] not only suggested a widely adopted approach to obtaining the covariance matrix of the EM estimates, but also proposed method to accelerate the convergence itself. His proposal uses the multivari ate generalization of the Aitken acceleration procedure which, when applied to the EM problem, is essentially equivalent to using the Newton-Raphson method to find a zero of the incomplete-data score statistic. Jamshidian and Jennrich [49] propose a gradient approach for accelerating convergence as does Lange [58, 59]. Meng and van Dyk [‘77] consider acceleration of the algorithm by introducing a working parameter in the specification of the complete-data likelihood. This working parameter indexes a class of EM algorithms, thus a parameter can be specified which accelerates the EM algorithm with out affecting either the stability or the simplicity of the algo rithm. Parameter Expanded Expectation Maximization (PX-EM) expands the parameter space over which maximization occurs and often results in accelerating the convergence of the EM algorithm [69]. High dimensional integration, although interesting, can pose serious an alytic challenges even in what may be conceived as the simplest of problems. Wei and Tanner [109] use Monte Carlo integration for the E-step and pro poses the Monte Carlo EM algorithm (MCEIVI). Ibrahim [44] use the MCEM approach combined with Gilks and Wild’s [35], and Wild and Gilks’ [111] adaptive rejection sampling for the Gibbs sampler in order to obtain EIVI estimates with missing data for generalized linear models. 51 3.2 Rationale for and critiques against the use of the EM algorithm A rational application of statistical methodology is best done acknowledg ing both the reasons for a method and the critiques against it. The EM algorithm has high utility and an inherent simplicity, yet it has limitations and failings. McLachlan and Krishnan [74] provide a useful summary of reasons for and against the use of the EM algorithm. This list will not be reproduced, but for completeness, a summary of the reasons will be pre sented. The authors identify ten reasons why the EM algorithm has such wide spread appeal. • The EM algorithm is numerically stable; increasing the likelihood with each iteration until it reaches a stationary point. • Under general conditions, the EM algorithm has reliable global con vergence. Given an arbitrary starting point, p(°) in the parameter space, convergence to a local maximizer almost always occurs. • In general, the EM algorithm is easily implemented. The E-step in volves taking the expectations over the complete-data likelihood con ditional on the observed data and the M-Step involves complete-data maximum likelihood estimation. • The estimation in the M-step is frequently a standard one, thus it is possible to use standard software packages when the maximum likeli hood estimates are not available in closed form. • The analytic work is often simpler than other methods because only the conditional expectation of the complete-data log-likelihood needs maximization. The counter point is that the E-step may require non trivial analytic work. • Programming of the EM algorithm is generally easy since there is no evaluation of the likelihood itself nor any of the derivatives. 52 • In general, it is inexpensive in terms of computer resources since es timation does not involve the storage of large data structures such as the information matrix. Furthermore, since there is low cost per iter ation, even a large number of iterations can prove to be less expensive computationally than other methods. • Monitoring convergence is easy since each iterative step increases the value of the likelihood. • The EM algorithm is useful in providing estimates for missing data problems A further reason in favour of the EM algorithm which is specific to the context of missing data and logistic regression is the recommendation by Vach and Schumacher [105] which recommends either the EM algorithm for maximum likelihood estimation, following the methodology of Ibrahim [43], or the pseudo maximum likelihood approach. Although there are compelling reasons to use the EM algorithm, it is not without its faults. As indicated in the previous section, the EM algorithm does not naturally provide an estimate of the covariance matrix of the pa rameter estimates. Although this is a failing, it was shown that work has been done to address this shortcoming. The counter point to the EM algo rithm’s low cost of computer resources for a single iteration is that it may take a long time to converge. This is a particular problem when there is a large amount of “missing information” [74]. Even though the EM algorithm will eventually converge, it may only coverage to a local maximum. This is not a specific failure of the EM algorithm, but more a general failure of optimization algorithms in general. Finally, the E-step may be intractable, thus more complicated methods such as MCEM will be required. Here the simplicity of the EM algorithm is muted with the complexity of Monte Carlo integration; additionally, the computational complexity and cost will increase. 53 3.3 3.3.1 Formulation of the EM algorithm EM algorithm In this section, we will first consider the EM algorithm without reference to a specific likelihood then we will consider the EM algorithm in the case of the exponential family. Finally, we will briefly discuss the mapping induced by the EM algorithm and its role. The EM algorithm is commonly used iterative procedure for finding max imum likelihood parameters when part of the data is missing or more gener ally for finding maximum likelihood estimates in situations where maximum likelihood estimation would be straightforward, but there is the additional complexity of incomplete information. The objective is to obtain maximum likelihood type estimates, with nice statistical properties, that would be ob tained if the data was perfect. The EM algorithm is less an algorithm and more a two-step general principle. The first step, the E-step, involves tak ing the conditional expectation of the complete likelihood given the observed data. The second step, the M-step, involves maximizing the conditional ex pectation with respect to the indexing parameter. We then return to the E-step, treating the parameter estimates obtained in the previous M-step as part of the observed data. In this manner, an iterative process evolves which, under some predetermined stopping rule, yields a EM based maxi mum likelihood estimate. At this point, we will deviate from the standard notation commonly used for presenting the basic formulation of the EM algorithm. The fundamental reason for this to minimize the introduction of superfluous notation and to provide a natural bridge between the proposed imperfect variable framework and the EM algorithm. Recall that, for the jth subject, the universa random vector is XU (X,XE,RI,RM) = (XA,R) where X is the true or target random variable, XE is the experimentally oh-. 54 servable version of X and R = (RI, RM) is the indicator vector for missing information and mismeasurement respectively. The log-likelihood, equation 2.5.4 is the product of three conditional distributions: the joint probability distribution of the indicator vector, the probability density of the response and the joint probability distribution of the covariate vector (X, XE). Fur thermore, recall that if all the data were observable, then the perfectly observed log-likelihood function is 1c(IxU,y) = l(Ix,yj) p(YjIx,, = ) +logp(XI). and the experimentally observable likelihood given by l(Ix,rj,yj) = f .. 1 .dXidX . . .dX (3.3.1) £U for the jth subject which suffers from imperfection, is the union of the variable space associated with the target covariates and the variable space associated with the experimentally observable covariates. Furthermore we assume that the imperfect covariates for which both problems do not co exist in the same covariate are the first j covariates in the data set and that the imperfect covariates for which both problems coexist within the same covariate are the following r covariates. It is clear that all the unobservable information is integrated out so that all that remains are the experimentally observed realizations from X The EM algorithm proceeds iteratively in terms of the complete-data loglikelihood function. Since the complete-data log-likelihood is unobservable, it is replaced by the conditional expectation given the observed data and the current estimate of the parameter, To begin, specify some initial value, for For the first iteration of the EIVI algorithm, the E-step requires the calculation of the expectation of the complete log-likelihood given the . . observed data and the initial parameter estimate, For the jth subject 55 this is Q(I°) = E thus over all subjects, the expectation, denoted Q(°)) (3.3.2) , Q(I°), is Q((°)), (3.3.3) = where the expectation is taken with respect to probability distribution of the missing information conditional on the observed information and the most recent update of the parameter estimates, p (xI0), xE, y, r). The M-step requires the maximization of Q(I°) with respect to the parameter space The parameter 1) is chosen such that over . Q(’I°) Q(P°) for all e Alternately, this can be expressed as = argmax Q(I°). () The subsequent step replaces with The E- and M-steps are repeated () but this time with in place For the (t+l)th iteration the E-step . is E [l(Iy, xV)I , x, y, r] t (3.3.4) = For the M-step. we choose (t+1) such that t), that is 1 Q( ’I t Q( ) ) t Q(I (t+1) E and that it maximizes (3.3.5) or alternately, (3.3.6) 56 for all E The E- and M-Steps are repeated in an iterative fashion until the prede termined stopping criterion is satisfied. One of three choices are commonly used; two of which- are based on the estimate of the parameter and the other being based on the likelihood. The first criterion is to allow the EM algorithm to iterate until the distance between lagged estimates of the pa rameter is less than some value, (t+1) (t) < ö, where 1 is the chosen lag and ö is some arbitrarily small amount [65, 70]. The second criterion is to use only a subset of the parameters, for example, require that the dis tance between lagged estimates of the parameter vector of interest be less (t) 13 than some value, ö. The third criterion is to require that the distance between lagged estimates of the likelihood be less than some value, IL(t+t)) L((t))L ö [74]. It is good to note that regardless of the criterion chosen, this is only a measure of lack of progress in the change of the parameter estimate or in the likelihood and not a measure of distance to the true maximum value of the likelihood. Dempster, Laird, and Rubin [22] showed that the likelihood function does not decrease after an EM iteration; that is . — — — L((t+l)) L()) (3.3.7) for t = 0, 1,2,..., thus convergence is obtained with a sequence of likelihood values that are bounded above when the stationary point is reached. This is a strong motivation to use a likelihood based stopping criterion. The counter-point to this is the potential complexity of computing the likelihood negates one of the advantages of using the EM algorithm. In response to this, a lagged distance between EM based estimates is commonly used. 3.3.2 EM algorithm and imperfect variables Although we have begun translating the EM algorithm for generalized im perfect variables, we still need to finish the translational process by parsing the expectation function into several components. We begin with equation 57 3.3.4’ ) t Q(I = [1c(Iyi,x)! , t x,yi,ri] E = + (1 -flrr)E = ? + (1— llrr) f. .flc(Iyi,x)P(X(t),x,ri,yi)dxi] . j=1 (3.3.8) where dx = dx 1 dxq d43 1 dXf for notational simplicity in the context of integration unless otherwise specified. Equation 3.3.8 consists of two parts. The first is the contribution when the jth subject is perfectly . . . . . . observed, the log-likelihood that is seen in complete case analysis. The sec ond part is the contribution when the jth subject contains imperfections. The contribution is the expected log-likelihood over the space of unobserv able realizations given the observed data. Notice that if there exists an imperfect covariate for the jth subject, the second component is the contri bution because that imperfection may affect not only response model, but also the missing data mechanism and the covariate model. Now consider only the portion for which imperfection is problematic, f.. f lc(Iyi, . x)p(X(t), x, r, xU E which can be written as 58 f. f.. f f. f1ogp(RiIx, y, )p(XI(t), x, r, + . , , )p(XI(t), x, r, y)dx 4 logp(Yjx U + f1ogp(XI)p(XI(t), x, r, y)dx U and is succinctly represented as + )+ t Q(7I7 where Q(.) ) t Q(I is the expectation for the th 1 subject (Equation 3.3.2); therefore = + (1- rrf) ) + Q(I t [Q(7I ) + Qt))]] t (3.3.9) This needs further decomposition. First we will consider the joint distribu tion of the imperfection indicator. From equation 2.5.14 we have 59 f. =J )p(X(t), x, r, y)dx f1ogP(RiIx, y, 7 = UE p ,. j=1 .f1ogP(R,Rprii,.. . UE xp(X4I(t) I’ ‘ x i r,y)dx p ) 7 =f...f1ogP(RIrr.ir..ix4yi j1 UE xp(X4I(t) uS ‘ xE r,y)dx p +f...flogP(RIrii,...,rii_i,xjA YijM j1 UE x p(X(t),xf, r,y)dx p = j=1 [ / 1(t) i -I-r M(t)”] ) +Qi (Vxj )] (3.3.10) where rio denotes the absence of an indicator. In an analogous manner, the joint distribution of the covariates can be written as p ) t Q(I = j=1 [ (fii’) + Q (I/(1 j ,j 1]’ (3.3.11) 60 thus equation 3.3.9 becomes ) t Q(I + (1- fl rrf ) [Q(7I7) + Q(Ip ) + Q((t))]] t 1c(Iyj,xV) =, + + ( [(i) [P - rr) ) t Q(pIØ + [ (77t)) [ (i) + + Q (7M7M(t))] Q (It))]]] (3.3.12) It is here that the relevance of assuming unique parameterization for the various distributions becomes evident. Equation 3.3.12 tells us that a com plicated joint distribution, under the assumption that the conditional ex pression of that distribution is uniquely parameterized, can be partitioned into a sum of simpler and most likely standard problems. Equation 3.3.12 has four components. The first is the log-likelihood when there are no data imperfections. The remaining three result when imperfection exists. The second component corresponds to the sequence of one-dimensional conditional distributions used to specify the joint distri bution of the imperfection indicator which decomposes into a conditional expectation, given the observed data, of the indicator of missing informa tion and a conditional expectation, given the observed data, of the indicator of inismeasurement. Since the indicators are binary, a binary regression model is being used to characterize the uniquely parameterized conditional distributions. The third component is the conditional expectation, given the observed data, of the response distribution. Recall that a binary regres sion model is being used to characterize the response distribution (Equation 61 2.5.13). The final component of equation 3.3.12 is for the joint distribution of the true variable and the experimentally observable variable. Recall that this distribution was constructed by using a sequence of one-dimensional conditional distributions utilizing unique parameterization for each condi tional distribution which also undergoes a component-wise decomposition. 3.4 Implementing the Expectation-Maximization algorithm: Monte-Carlo Expectation-Maximization Although equation 3.3.12 brings much needed simplicity to the E-step (Equa tion 3.3.4), it is still, for the most part, non-trivial. Even in a small problem that may involve only two imperfect covariates, double integrals will be re quired for the response and may be required for the covariate and indicator variable portions depending on their structure. It is clear that even in well defined an modest problems where imperfection is present, high-dimensional integration over U will be necessary. Evaluating the expectations of the E-step can take several forms. As we are synthesizing missing data and mismeasurement together we have multi ple sources from which guidance can be taken. Wei and Tanner [109, 110] proposed a Monte Carlo approach to compute the expectation in the E-Step of the EM Algorithm. Missing data in the context of generalized linear models literature tends to support the use of Monte Carlo integration as a means to handle potentially difficult and high dimensional integration [46, 48, 95]. Prom the measurement error side of the problem, Monte Carlo methods have been considered when framing the mismeasurement problem as a missing data problem and using the EM algorithm [13, 30]. Wang [108] proposed a Gaussian quadrature EM (GQEM) that handles low dimensional problems as an alternative to MCEM. In general, Monte Carlo methods are not competitive against quadrature methods which have convergence rates of O(M ) or O(M 2 ) in one dimension with M-step evaluations [108]. 4 Typically, quadratures are used for low dimensional problems and are corn 62 monly used for single integrals. Wang shows that the GQEM performs well for k < 3 dimensions [108]. This would be an appealing approach, but the simple model under discussion is only a first step in understanding how to integrate similar but more complex patterns of missingness and measure ment error. It is reasonable to believe that any extension of this problem will result in high dimensional integration. Although MCEM is less efficient, it is a much more attractive starting point for a problem that is only going to grow in complexity. 3.4.1 Monte Carlo integration Monte Carlo methods are well documented, thus only a brief overview of the methods used will be presented. In 1949, Metropolis presented a statistical method to the study of differential equations and as indicated by Metropo lis, to a more general class of integro-differential equations that occur across the natural sciences [79]. With the advent of inexpensive and almost ubiq uitous computing power, Monte Carlo based methods have flourished. We will consider classical Monte Carlo integration as presented by Robert and Cassella [85]. Consider the problem of evaluating the integral Ef [h(X)] f h(x)f(x)dx (3.4.1) where X is the domain of X. A sample (Xi, , Xm) is generated from the density f(x) and equation 3.4.1 is approximated used the empirical average . . . By the Strong Law of Large Numbers, hm converges almost surely to Ef [h(X)j. The variance is var (flm) = m f (h(x) - 2 f(x)dx Ef [h(X)]) 63 and also can be approximated from the sample with Vm=(h(Xj)_hm)2. For large m hmEf[k(X)] has an approximate N(0, 1) distribution which allows for the construction of tests and confidence bounds on the approximation of Ef [h(X)]. 3.4.2 Monte Carlo EM: the expectation step In the application of Monte Carlo integration to the problem of finding the expectation in the E-step of the EM algorithm, the approach suggested by Wei and Tanner [109, 110] will be used. A sample of simulated data is drawn from the conditional distribution of p(XAIt), XE, r, y) on the (t+l)t iteration of the E-step. This sample is used to approximate the expectation in equation 3.3.4 using the empirical average n = m (3.4.2) t), hence MCEM is the regular 1 oc then this converges to Q( EM in the limit [85]. Although using Monte Carlo integration is an attractive solution to han dle a difficult high-dimensional integral, it is not without some cost. Ob taining the sample X, ,Xm from p(XAI(t),xA,r,y) can greatly increase the computation time for an algorithm which is notorious for its slow con vergence. In response to this, Levine and Casella [61] proposed a method by which samples may be reused, thus reducing the overall computational When m —* ... time. Alternately, sampling has been eased through the use of Gibbs sam pling and Gibbs Adaptive Rejection Sampling [18, 35, 44]. Rodriguez [87] 64 has implemented Gibbs Adaptive Rejection Sampling in the statistical pack age R in the ars library. The ars function is predicated on Gilks [35] paper and requires log-concavity of the functions from which it is sampling. At this point, a brief digression concerning MCEM implementation will be taken. The discussion herein is relevant to the MCEM algorithm, but will not be incorporated into the simulation studies nor the applied example. McLachlan [74] points out that in MCEM, a Monte Carlo error is introduced at the E-step and that the monotonicity property of the EM algorithm is lost in that sequential steps may not result in an increase of the likelihood yet it is posited that with an adaptive procedure for increasing the size of the drawn sample, the MCEM algorithm will move towards the maximizer with high probability [7, 85]. Wei and Tanner [109] recommended that the size of the sample be in crease as the algorithm moves closer to convergence, thus equation 3.4.2 can be written as It)) = 1 mg(t) m ( 9 ) l(Iyj,x’) (3.4.3) where mg(t) is the size of the sample drawn from the sampling distribution which is itself a function of iterative step itself. By using a function of the step at which the EM algorithm is currently at, g(t), a more flexible representation of how to increase the sample size is obtained. In its simplest form, g(t) = t with a sample size increase at each step of the algorithm. A more sophisticated approach is to monitor the induced Monte Carlo error so that m is increased when changes in the parameter become dominated by the Monte Carlo error itself [7]. Since Booth and Hobert proposed the increase to be applied to the following step, the sample size can be thought of as a function of the step itself. The method requires independent MC samples in the E-step to allow for computationally inexpensive and straightforward assessment of Monte Carlo error through an application of the central limit theorem. Levine and Casella [61] proposed a method for monitoring Monte Carlo error in the MCEM algorithm in the presence of dependent Monte 65 Carlo samples. Although there are two recent proposals for monitoring the Monte Carlo rate for the MCEM algorithm, the process is not yet truly automatic. Both approaches can determine when m should be increased, but neither suggest by how much m should be increased. This is an important feature since the operating hypothesis is that by increasing the sample size as the EM algorithm nears convergence, the effect of the Monte Carlo error can be minimized thus allowing the algorithm to get closer to the maximizer with high probability. Levine [62] considers a central limit theorem based method for gauging Monte Carlo error directly on the EM estimates in the presence of dependent MC samples. Using their asymptotic results, an adaptive rule for updating the Monte Carlo sample size at each iteration of the MCEM algorithm is proposed. Another proposition is to allow each subject to have a subject specific Monte Carlo sample size, m. In practice this recommendation given by Ibrahim is not implemented [44, 46, 48]. This proposition manifests as m n = lIy, x) - j=1 (3.4.4) 1=1 where m is the Monte Carlo sample for the jth subject. If the aforemen tioned adaptive Monte Carlo sample size methodologies were integrated, the E-step would become m ( 9 t) = =i m’ g(t) 1(lyj,x) where mg(t) is the Monte Carlo sample size for the of the tt iterative step in the MCEM algorithm. jth (3.4.5) subject as a function 66 3.4.2.1 Monte Carlo EM: the expectation step for imperfect variables The implementation of this method requires the application of equation 3.4.5 to the conditional expectation of the log-likelihood given the observed infor mation, equation 3.3.12. Recall that equation 3.3.12 has four components. The first is not of direct interest in this setting since it is the complete case or perfect data contribution to the likelihood. The remaining three compo nents provide the complexity to the problem, hence are the direct points of interest. Of the three remaining components in equation 3.3.12, the first is com prised of two parts, each pertaining to one aspect of the mechanism of imper fection: missing information and mismeasurement. Here, we will introduce the notation xjt+1) to indicate the 1’ Monte Carlo sample (xt ,x) for 1 jth subject and th covariate for the t + application of Monte Carlo integration, the [ (i) Q (i) = + EM iteration. With the Q (7I7)] is approximated by mg.(t) p (t+i) ( t) 7 17 [logp(RijIr1,rii, 1 m g = . . 7 ,rjj_i,xt+1),y t ))+ j, 1=1 M logp(R In1,. where . A(t+) . ((x’),x),. . , . . nj_i, x , , (x,x)). yi, M(t) The response model is approximated by . (t) 9 m = 1g%(t) m l g 0 p(yjxt,t)), (3.4.6) 67 and the covariate likelihood [ Q (i) = ((t)) + Q (i)] is approximated by mg.(t) p (t+i) 1 (j, , (t)) . . = m 1 () i=i ,x,9(t))+ j=i A(t+1) 1 ogp1 X- 1 x 1 , . . A(t+1) 1 P(t) i(j—1)1’ — The notation Q(t+1)(.) brings emphasis to the fact that this is the approxi mation of the estimating equation for the (t + l)th iteration of the MCEM algorithm; it is reasonable to drop the suffix (t + 1) which indexes the iter ation for which the estimation is being made. Although notationally com plex, the core idea remains constant; generate a large sample of data from p(XAlt), x, r, y,) and then average over the sample. 3.4.3 Gibbs sampler A key idea of the MCEM algorithm is to use a well chosen sample from the distribution of unobservable variables which are then used to approximate an expectation through the use of an empirical average, thus we will turn our attention to how the sample is to be obtained before proceeding to a full representation of the MCEM algorithm for imperfect variables within a generalized linear model framework. The Gibbs sampler can be traced back to the development of a fast computing algorithm for calculating the properties of any substance which may be considered as being composed of interacting individual molecules [78] with further developments by Hastings [41]. In the mid 1980’s, the Gibbs sampler enjoyed some renewed interest which with Geman and Geman’s [33] introduction of the Gibbs sampler under the consideration of a Bayesian approach to image restoration [31]. Here, the Gibbs Sampler is used to generate realizations from the Markov random field through an exploitation 68 of the equivalence between Gibbs distributions and Markov random fields. Interest in the Gibbs sampler was renewed with Gelfand and Smith’s [31] paper which considers the Gibbs sampler as a method by which numerical estimates of non-analytically available marginal densities can be obtained. An objective of this paper is to translate the Gibbs sampler, which until then was widely known in the image-processing literature, for more general and conventional statistical problems. Two years later, Casella and George [16] provide an accessible and insightful paper concerning the Gibbs sampler where they explain both how and why it works. To begin, we will consider the general form of the Gibbs sampler be fore interpreting it for imperfect variables within a generalized linear model framework, then we will briefly review log-concavity and finally we will present a general approach using Gibbs adaptive rejection sampling for im perfect variables within a generalized linear model framework. Suppose we have a joint density p(XI) indexed by and we want to obtain the characteristics of the marginal density p(X 3 I) where in dexes the marginal distribution for the th variable and = a straightforward approach would be to calculate p(X I) directly and then use the marginal density to obtain the desired characteristics. Obtaining the marginal distribution of X 3 directly may not be feasible. The integral may be analytically intractable and numerical methods may be overly difficult to implement. In these cases, the Gibbs sampler provides an alternative method to obtaining p(XI). With Gibbs sampling, we do not compute or approximate p(Xj) di rectly, but rather generate a sample Xii,. 3 By simu Xmj from p(X lating a large enough sample, the empirical distribution of p(X j) is used to compute the desired characteristics [16]. It is worth reminding ourselves that although an empirical distribution based on a simulated sample is being used to obtain the characteristics of the marginal distribution, if m is large enough the population characteristics and even the density itself can be ob tained with a specified degree of accuracy. For example, if we are interested (, . . . . . , ), , 69 th in the mean of the marginal distribution of the random variable then 00 lim—Xi = fxiP(xiIi)dxi ) 3 =E(X A fundamental assumption is that we can simulate from the univariate con ditional densities p1, ,p where p 3 = p(Xj, x(_)) are called the full conditionals and X(_j) = (xi, x2, , x_ , xj1, 1 , x ,) which is the vector 1 th x with the variable removed. The Gibbs sampling algorithm or Gibbs sampler for the (t + l)th sample given the tth sample x(t) = (xt), , x) ... . . . . .. . X(t+1) 1 (t+1) 2 xt+l) P’ “P 1 (t) (t) ,X i 2 2 (t) (t+1) (t) ,X ,X 3 2 1 ,. t+1) . . (t) ,X , . . j, ,.. . . (t). .,X j, xj’). (3.4.7) A readily apparent feature of the Gibbs sampler is that the full condition als are the only distributions used for the simulation, thus even in highdimensional problems, all of the simulations may be done using univariate full conditionals. Since the Gibbs sampler is being used as a tool to im plement the EM algorithm via Monte Carlo integration, we will not discuss the technicalities of obtaining a sample from a stationary distribution. Such technical details are well documented in texts such as Robert and Casella [85] or Gelman, Carlin, Stern and Rubin [32]. Adaptive rejection sampling is one approach used to obtain samples from the full conditionals for which details are given in Appendix B.! for details. 3.4.3.1 Gibbs sampler for imperfect variables From the previous section, most of the general work for the specification of the Gibbs sampler for imperfect variables has been done. Equation 3.4.7 is ‘70 already close to the general form needed for imperfect variables. The joint (t), x, r, Yi). 1 distribution from which samples are to be taken is Pr (X4 If we have p imperfect random variables, then for the th subject we have (x-y), (t) Xt+1) p r, y); , . t) (xi (t), 4 , 7 r, (t+1), U(t) (xI , 7 t), p(t), ,(t) r, Xt1 Xt+1) 4 t ) (t+1) . J(t) Yi) Yi) , Notice that the indexing parameter for each full conditional is itself indexed t), (t), çt))T 7 such that the th full conditional is indexed by ( This in dexing is a result of the conditional-conditional approach for specifying the joint probability distribution of the imperfection indicator and the joint dis tribution of the covariates (Sections 2.5.2 and 2.5.6). Currently, we have the full conditionals for the joint distribution of XA. For imperfect variables, we further decompose the sampling distribution to yield (x. i 1 7) Xt+1) Xt+1) t t 4 t)x+1)rixt) (XI , 7 t), p (x ..t) p (t) (x 7 r, (t), (t) ,,(t) E(t) (t) ,,(t) , .. xu Yi) . r, xg(t+1), (t+1) r, . Yi) , U(t+1) U(t+1) u(t+1) Yi) (3.4.8) With the sequence of full conditional distributions specified, we can consider the structure of the th set of full conditionals, (t+1) X p/ I (t) E(t) , , U(t+1) r, x 1 , U(t+1) . U(t) , . . . , U(t) x , 71 and XE(t+J ‘‘- = where p (i3 (t) j (t+1) , , r, x 1U(t+1) , ( t 7 ), 13 (t) (t))T U(t+1) . . . , 1 _ 3 x U(t) , U(t) . , , denotes the set of updated covariates and X) be the set of covariates which have yet to be updated. The full conditionals reduce to the following p/ Ij, E u r, X(_), Vi) p(X,YjI) = I E U p Xjj,X(_),Yi j cxp(Rx,y, 7)p(YjIx, /3, )p(X/,) and I E — P o where X_) i3ij,X_Yi j p(XI) denotes the universa with the (3.4.9) th information removed. 3.4.4 Maximization Since (it) is equivalent to a sum of uniquely indexed functions, the maximization of ( t 1 ) reduces to the maximization of each compo (t)) 1 nent of ( (Appendix B.2.1). Given that we are working within the generalized linear model and that we are working with a binary outcome, 1313 ( t) can be maximized using the standard iteratively re-weighted lin ear regression method typically used for binary logistic regression. With this approach, the maximization is the same as with complete data with weights [44, 109]. For example, if the jth subject has a perfect set of covariates, then it is given a weight of 1, but if the jth subject has an imperfect variable, then each contribution from the Monte Carlo sample is given weight m g(t) 72 (t)) 7 (y This approach can be directly applied to the maximization of since the components are themselves Bernoulli random variables modelled t)) 1 with a binary logistic regression. The maximization for Q (, cannot be explicitly given without applying a distributional assumption, thus its maximization will be discussed in context of particular examples. 3.5 Louis standard errors Although there are a variety of strategies for obtaining the standard error of the EM based parameter estimates, Louis’ approaches method was chosen for several reasons. The first is that the information matrix and related derivatives can provide information about the rate of convergence. As well, the EM algorithm can be modified to increase rate of convergence [74]. The second reason is that the likelihood surface is sufficiently complex, thus any movement towards understanding this surface can be considered an advantageous step in understanding the overall problem of imperfection and how it may affect the likelihood itself. A third reason is that the mapping induced by the EM algorithm can be expressed in terms of the information matrix and related objects. As before any movement towards a deeper understanding of the EM algorithm can be considered advantageous. To this end, Louis’ approach based on an analytic derivation of the information matrix is a reasonable venture. An attractive feature of this method is that it can be embedded in the EM algorithm itself with the evaluation occurring only after the EM algo rithm itself converged. Computationally this is attractive since potentially large matrices do not need to be stored, nor are computationally intense inversions of these matrices required for each EM iteration. This is an at tractive feature because with the implementation of the EM algorithm both the estimates and the associated standard errors are produced. A secondary feature which made Louis’s method popular was that the gradient and sec ond derivative matrix were derived from the complete-data log-likelihood and not the incomplete-data log-likelihood. Key to the proposed method is the ability to extract the observed information matrix from the complete 73 data log-likelihood. Proceeding with Louis standard error, denoted SEL, will require the introduction of some notation. This will be done as it is needed. To begin we will consider the case for the random variable before proceeding with the situation of n independent cases. Recall from equation 2.5.2 that the jth complete-data log-likelihood is lc(Ix’,yj) = logp(X,YI) = Iogp(Rjx, yj, y) + logp(Yjx, 3, q) + logp(X, ) For imperfect variables, xj is always missing. What is observed is an opaque version of x. For the jth subject, the observable distribution is p(X, R, YI) = f. . YjI)p(XIt), x, r, yj)dxj . thus we can define r, y) = log (i•• Jr(XY, x, r, Yi)dxi). The score vector for the subject associated with the complete-data log likelihood, lc(IxY, y) is denoted S(x’, y) and the score vector associated with the observed-data log-likelihood, l*(Ixp, r, y), is denoted S*(Ix, r, y). The Hessian matrix for the complete-data log-likelihood is denoted H(Ix’, ) and the for the observed-data log-likelihood, the Hessian is denoted as r, yj). Now under the regularity conditions which permit differentiating under an integral sign [15] we have S*(Ix,rj,yj) (3.5.1) 74 with the details given in appendix B.3.1. Furthermore, with satisfying S*(Ix,rj,yj) =0. Now we turn our attention to deriving the observed (incomplete-data) information matrix. Orchard and Woodbury [83] set down the basic princi ples for correctly specifying the information matrix in the context of missing data. Interestingly enough, the conceptual framework used by Orchard and Woodbury is similar to the imperfect variable framework currently under discussion. The similarity may stem from the common assumption that the data contains unobservable information. The goal for deriving the informa tion matrix, which provides direct link to estimating standard errors, is to adjust the information matrix obtained from the observed data with the in formation that has been lost due to non-observance. Louis [70] showed that the expected information matrix for for the unobservable data x given the observed data x and the assumption that we have a regular exponential family distribution in the canonical form [74] is given by Im(Ix) =Cov -E (S(IxV,y))] x [S(Ix,yj) -E (S(Ixy,yj))]Tx] =E — S*(Ix, r, yj)S* (Ix1, r, .)T (3.5.2) thus the information matrix for the observed data is I(Ix) = ‘(I) Im(IX) =E (Ic(IX)Ixfl - E .)T + S*(Ix, r, yj)S*(Ix, r, (3.5.3) 75 where the first part of equation 3.5.3 is the conditional expectation of the complete-data information matrix while the latter two parts give the ex pected information information for the conditional distribution of the completedata given the observed data. The Louis formula for the observed, incompletedata, information matrix for the MCEM based maximum likelihood solution is given by I(Ix) -_E (I(IxV) ix) E - x) T (s(EIx’, y)S(Ix’, yi) T +S*(EIxf,rj,y j)S*(x,rj,yj) (3.5.4) Details concerning the development of equations 3.5.3 and 3.5.4 are given in appendix B.3.2. 3.5.1 Monte Carlo wit Louis standard error for imperfect variables From equation 3.5.3 we have three expectations for which Monte Carlo in tegration will be used. Following Wei and Tanner [1091, the Monte Carlo estimate of S*(xE, r, y) is .*(ijxEriyi) =E[S(Ix’,yj)Ixf,rj,yjj m ( 9 t) 1 9( ) 1=1 mg.(t) mg(t) where x = (xj,x,rj,yj) and x 1 second term of equation 3.5.3 is = , 1 (x . .. The estimate of the . (t) 9 m E (s(Ix’, yj)S(xV, y) Ix) T (S(Ix, yj)S(Ix, yi) ) T m(t) m ( 9 t) m(t) T Z 76 The first term in equation 3.5.3 is approximated by E (Ic(IXV)Ixfl = m. (t) 1 mg.(t) I(IX) 1z1 mg.(t) 2 a = mgi(t) l(Ix, ye). 1=1 Pulling these three approximations together transforms equation 3.5.3 into mg(t) 2 a 1 I(IX)=mth(t) mg(t) - — m’g(t) [ + L 11 U xiii)) “t) 1 [ mg(t) 11 ] T (i(I4))) m() mg.(t) a 1=1 (3.5.5) and equation 3.5.4 to mg(t) I(Ix) 1 m9() mg(t) 1 - a 1(Ix, y) 2 T aa — mg(t) 1=1 + F 1 9 [m (1(Ix,yi))T] L mg.(t) 11 U xiiYi)) r 1 mg.(t) m g(t) a T Yi] 1=1 (3.5.6) 77 Chapter 4 Simulation studies Two simulation studies will be executed. The first study will be based on a binary logistic regression with two covariates. One covariate will suffer from missing data and the other will suffer from measurement err. The second study will use a binary logistic regression model with a set of per fectly measured covariates and a single imperfect covariate. The imperfect covariate will suffer from both missing data and measurement error. It is recognized that there are many parameters involved with the utilized mod els. Due to the embryonic nature of this investigation, we will focus on the the performance of the response model parameters. 4.1 Global aims of the simulation studies There are three general and shared aims for the two simulation studies. The first is to determine if the combined problem of missing data and mismea surement is computationally feasible. This will be considered from a resource management point of view with a particular focus on computational time and storage requirements. The second goal is to identify how the combined problem of missing data and mismeasurement affect parameter estimation. This goal is a derivative from the observation that in many situations com plete case analysis is chosen for its simplicity in execution and low cost on resources. The third is to assess the affect of the MCEM adjustment on parameter estimation. 78 Table 4.1: Notation and combinations for the missing data mechanism that generates the data and the missing data mechanism assumed by the fitted model. The missing data mechanism takes the form of a binary logistic regression. Case 1 2 3 4 4.2 True Mechanism Fitted Mechanism MAR MAR NMAR NMAR MAR NMAR MAR NMAR General structural elements and procedure of the simulation studies Each simulation study will be a composite of simulation scenarios for which a global scenario will be subsequently presented. For each scenario moderately independent simulations will be used, which use the same set of simulated data sets to compare statistical methodologies [101. For each scenario, we will have S simulations. Within a scenario, the simulated data sets are independent, but the results from the methodologies under investigation are like matched pairs, reducing the effects of between sample variation. Table 4.1 is an example of four scenarios which serve to construct a study of the effect of how the specification of the missing data mechanism in the model affects parameter estimation. It is clear from table 4.1 that we can consider two modelling situations: we correctly specify the missing data mechanism in the model and we incorrectly specify it. Outside of highly controlled situations, accurately specifying all pieces of a complicated model is highly unlikely. Often, the best we can do is make a reasonable conjecture based on prior knowledge about the system under investigation and on the data we have observed. It may not be feasible to consider all four cases for each aspect of the study. In these situations only cases 1 and 4 will be considered. 79 The MCAR mechanism is not being directly investigated because the literature suggests a complete case analysis will yield unbiased estimates. The problem should reduce to a mismeasurement problem. With missing data, one feature of interest which is not explored is how the proportion of missing information affects the quality of the estimator. Given that the motivating example typically had proportions of missing data in the range of 10% to 20%, when the missing data mechanism was altered, the goal is to keep the proportion of missing data in this range. This decision reflects the most common structure of social-epidemiological data related to the motivating example. 4.3 Evaluative measures The same set of evaluative measures will be used for both simulation studies. We will consider a variety of ways to summarize the simulation results to create a more complete picture of both the problem of ignoring imperfect data and the effect of using a maximum likelihood type method to adjust for the presence of imperfect variables. 4.3.1 Measures of bias and accuracy In the current context, the parameter of interest is the estimated vector of regression coefficients of the response model. For simplicity, consider the th parameter, /3g. The expectation of the th coefficient of the response model is estimated by E($) =Z/is where s indexes the simulations in a particular scenario. The estimate of the variance is given by () 80 with the estimate of the standard deviation across the S simulated data sets given as (/j) =i($). The standard deviation of / j will be reported since it puts the measure of 3 variability on the same scale as the estimate itself. The bias is Bias(/3) =E(/3) — which is estimated by ) 1 (‘) =E( — where the bias itself can be thought of as being drawn from its own sampling distribution, thus the variance of the estimated bias is ) 3 Var(BiasCãj)) =Var(E(/3 — / j 3 ) Var(j) S which is estimated by ((j)) with the standard deviation of the bias across the S simulated data sets given by ((i) 81 With these, we can construct the z-score for the bias under the assumption that the estimators are unbiased, Bias(/3j) Zbjas V =- SD(Bias(/3j)) It is recognized that this is strictly a t-test statistic, but with sufficiently large S and an associated degrees-of-freedom S 1, the normal distribution becomes a reasonable approximation. To this end we will use the notation ZiaS to emphasize the fact that we are using an approximation. A large z-score indicates a biased point estimator. There is concern with the use of statistical testing within the context of simulation studies. Since many statistical tests can be considered to be a function of the number of simulations S, it is suggested that the use of this type of measure is dubious. The results of such statistics can be improved through the implementation of more or less simulations the statistics can be improved or penalized respectively [10]. Burton suggests the standardized bias which Demirtas proposed in 2004 without any referential antecedent [21]. It is defined as — (E(e) SE(O)) 100. A value of 100% means that the estimate on average falls within one standard deviation above the parameter. It has been suggested by Demirtas that a measure standard bias 50% or less is practically significant [21]. The measure is used again in a simulation study concerning missing information within the context of longitudinal data, but cites only Demirtas’ paper for the source of this measure [99]. With a wider search, a robust analog was found, but the measure of variability was directly related to the estimator of interest [50]. In the proposed standardization, the standard error proposed for scaling is related to the sampling distribution of 4 j and only indirectly related to 3 the sampling distribution of the bias, Although there is an appreciation for 82 what is intended with the standardized bias, the disconnect between the estimator and the standard error being used to standardize is problematic. The numerator is the estimated bias, which has its own sampling distribution and hence its own standard error. Even though the z-score has some subtle issues, it will be used used with the caveat that no additional action was taken once the simulation study was completed to enhance the results, nor were preliminary results considered during the execution of the simulation. To this end, the z-score seems reasonable despite its dependence on S. The mean squared error (MSE), often reported as the root mean squared error (RMSE) to put it on the same measurement scale as the parameter itself, offers an overall measure of accuracy. Incorporating both the bias and the variability associated with an estimator, it provides a succinct measure for comparing estimators. The MSE is given by MSE() =E ( - )2 ) + Bias(3) 3 =Var(/ 2 which is estimated by + ‘() =(/3) The relative efficiency will be used as a comparative measure between the complete case estimate and the MCEM estimate. For the th parameter, the relative efficiency is \ e (\/ CC, I 3 MCEM,i) 3 2 E(/3MCEM, j—/3j) = — = ! j 3 ) 2 MSEMCEM (ij) MSEcc(/3) 83 which is estimated as e 3 C / M CEM,j) Cj/ IVISEMcEM(3j) = MSEcc(j5) Recognizing that both) and ê ($GG,j I3M0EM,j) are themselves estimates which have associated sampling distributions, the bootstrap was used to estimate the respective associated standard deviations [26]. From the original estimate of the sample distribution of / 3 yielded from the simulation scenario, a bootstrap sample of size B = 10,000 was obtained for (/i) and ê (&YC,j, /3McEM,j). The bootstrap estimate of the standard deviation across the S simulations for each estimator is 1 B I’3B L/B=BlL4IJb 3 1B 2 U• b=1 where B and O is the bth bootstrap estimate of 8. Here we let 0 = for esti mating SDB,MCEM(MSE(/ J)), the standard deviation of the mean squared 3 error across the S simulated data sets; we let 0 = ê (&C,j I3MCEM,j) for es timating SDB,e( (/CC,J /3MCEM,J)), the standard deviation of the relative efficiency across the S simulated data sets. 4.3.2 Confidence interval based measures Each scenario within the simulation study consists of S simulations. For each simulation, using the Louis estimate of the standard error, we can construct a confidence interval. From this we can directly compare the coverage of the simulation based coverage rate to that assumed in the construction of the confidence intervals. The confidence interval for the th parameter and the 84 sth simulation is constructed with the end points defined as Ijs ± zl_SEL([3j) where 1 z _ . is the 1 quantile of the standard normal distribution and ) is the Louis standard error for the th parameter and the 3 SEL(,j simulation. From the confidence interval we will consider two measures: mean length and coverage. The mean length can be used as an evaluative tool in simula tion studies. If the parameter estimates perform well and exhibit a general unbiased property, then a narrower confidence interval suggests more precise estimate as well as gains in efficiency and power [10, 19]. Coverage refers to the proportion of times the constructed confidence intervals contain the true value. Ideally, the coverage should be the same as the nominal cov erage rate, that is the coverage for the 95% confidence interval should be 95%. Over-coverage, where the coverage rate exceeds the proposed confi dence level indicates that the results are too conservative. In this situation, too many Type II errors are being made which implies a decrease in sta tistical power. On the other hand, under-coverage where the coverage rate is less than the proposed confidence level indicates that the results are too bold. In this situation, too many Type I errors are being made. For assessing the coverage, Burton [10] suggests a rough yet inferentially based approach where an acceptable simulation based assessment is one that does not fall outside two standard errors of the the nominal coverage probability. The interesting feature is that Burton proposes a confidence interval based approach to assessing the coverage under the null hypothesis 0 = 0.95, but confidence interval construction with the intent of drawing H inference is typically performed under the alternate hypothesis [56]. Casella [15] notes that in practice, we will have both the null and alternate hypothe — ses in mind; this alternate will dictate the form of the acceptance region of a 1evel-o test which in turn influences the shape of the ensuing confidence set. Rather than taking a confidence interval approach where the a-level is pre-set, we will use p-values based on an one-proportion z-test to test the 85 sample based coverage rate against the nominal rate used for its construction rather than the proposed confidence interval approach. 4.3.3 Robust measures It has been observed that some of the simulations resulted in point estimates which could be classified as outliers. The median, denoted m, and median of the absolute deviations from the median will be used as compliments to ECãj) and Var($j). The median of the absolute deviations from the median (MAD) is given by MAD(/3) =1I/t — which is estimated with iAb(j) =I/,t — Furthermore, the difference between the median and the true value, which is a crude analog to the bias, will be given by j =r($) 6 - which is estimated by j 4.4 =(i) — /3g. Simulation study 1: binary logistic regression with two imperfect explanatory variables We will explore the situation where the two types of imperfection co-exist in a single data set, but do not occur simultaneously in the same variable. One covariate will suffer from missing data, X 1 while the other will suffer from measurement error, X . To help us understand the effect of imperfect 2 86 variables on parameter estimation we will break the simulation study into four smaller components. Each part constitutes a comparison where all the features of the simulation are held constant while allowing one feature to change. We will compare • missing data mechanisms, • the effect of the model assumption about the missing data mechanism assumption, • the effect of sample size, and • the effect of different distributional assumptions for the error term in the measurement error model. The assumption of nondifferential, classical measurement error will re main constant across all scenarios in this simulation study as will the as sumption that c N(0,T). To prevent model misspecification, r will be assumed known [40]. The target number of simulations for each scenario was 150 simulated data sets. For most components, eight runs of 25 simulations were executed. If this fell below 150 simulations, then two additional runs of 25 simulations were executed in order to bring the total number of simulations to approx imately 150. There is not a constant number of simulations run for each scenario and this will be addressed in a subsequent section devoted to dis cussing the implementation problems of the MCEM algorithm within this context. 4.4.1 Likelihood model for simulation study 1 We will begin with equation 2.5.4 where tc(IXU,y) = logp(Rjx,yj, ) +logp(Yjx, , ) + logp(X, I); recall that log p(Rx, yj, -y) is the model of the mechanism of imperfection, log p(YjIx, 3, q.’) is the response model and logp(X’ , b) is the covariate 87 model. For the covariate model, we begin with equation 2.5.6, thus 2 p(XI) 2 =p(XIx x, ) p(X , 1 I,f). ) p(XI xii, ‘‘f) p(X Under the pecifications of this simulation study, x Xj1,ob if’ the realiza tion is observed or x = Xjl,mjss if the realization is not observed, thus ) =p(X 1 p(Xjb I’f). 1 The second covariate suffers only from measurement error, so X and , 1 p(XIxi, 2) =p(XIx ,x 2 Assuming a bivariate normal distribution for the joint probability distribu tion of (X ,X 1 ), we have 2 , X) MVN (ii, E), 1 (X where T = )T and (0, 0 = [o2 ] thus we have X, -‘-‘N(O, 1), and X x 2 i ‘-‘-N(0.2xji,0.96). Under the assumption of an unbiased classical imperfection model, equation 2.5.10 becomes 2 =x xi 2+2 where 2 N (0,r ); therefore XIx 2 ,r N(x ) . Often it is assumed 2 ‘-‘s 2 that social-epidemiological data sets are noisy, so we initially let ‘r 2 = 1. The response is a Bernoulli random variable with the binary logistic regression response model given in section 2.5.5 which is indexed by /3 = 88 Table 4.2: Coefficient values for different missing data mechanisms used to generate simulated data sets. The systematic component is specified as = Mechanism A B C MAR NMAR (—2,0, 1.5, (2,0.1.5,0)’ (—2, 0.75, 1.5, (—2,0.75,1.5,0)’ 2 . , 2 (_ ) 0 T 5 2 5 . 1 , 2 (_ ) 0 T 5 )T. 6 • 0 _ 9 (—1.39,0.41, The mechanism of imperfection model is given in section 2.5.6 and we will retain the definition for the imperfection indicator given in section 2.3. It is clear from equation 2.5.14 that the conditional expression of the joint distribution of imperfection indicator allows for a dependent structure between the indicator for missing data and the indicator for mismeasurement. Also, it allows for dependent structure across the set of imperfection indicators. Two simplifying assumptions about the mechanism of imperfection will be made: th and kth indicators of imperfection are independent for all 1,...,pandk=1,...,p,and • the j = • the indicator for missing data and the indicator for mismeasurement are independent, R’J±RJ. Predicated on these two assumptions, equation 2.5.14 becomes p(Rjx,y, ) =p(RIx,y, 7f) since all realizations from X 1 are accurately measured, X 2 is mismeasured for all i = 1, , m, and all realizations from X are observable. . . . For the simulation, three different missing data mechanisms will be used (Table 4.2). The structure of the variables for the missing data mechanism model is the same for all three mechanisms, = (to, ‘Yr, 72, 73)T where ‘yo 89 corresponds to the intercept, ‘Yi relates to X , X 1 2 is associated with ‘Y2, and 73 being the coefficient for y. With only one covariate suffering from missing data, the MAR mecha nism has Yi 0 and the NMAR mechanism has ‘y 0. For mechanism A, the response was part of the missing data mechanism model. The simulation based mean proportion of data missing when the missing data mechanism A is MAR is 0.176 and is 0.195 when NMAR with respective ranges of 0.26 and 0.29. The specification of the missing data mechanism was motivated by the applied problem. With social-epidemiological research predicated on linked databases where one or more data sources includes census data for the expressed purpose of investigating vulnerable populations, it is reasonable to assume that for some cultural constructs, there will be a high proportion of missing information. The choice of having the proportion of missing data around 20% was to reflect this assumption. Missing data mechanism B is a variation of mechanism A with only the dependence on the response beingdifferent. This will allow for an exploration of the effect of the response in the missing data mechanism on parameter estimation. As with mechanism A, the proportion of missing information between MAR and NMAR mecha nisms are similar: 19% and 21% respectively. With mechanism C, the roles of 7i and 72 were switched. For mechanisms A and B, 72 was larger by two fold than 7i• Here, we kept the sign of the two coefficients the same as in the previous two mechanisms, but allowed 7i to be greater then 72. Ad ditionally, the difference in size between the two coefficients was amplified. This was done to force a strong effect when the mechanism was NMAR. This resulted in a difference in the proportion of missing data for the MAR and NMAR cases: 12% and 19% respectively. 90 4.4.2 Simulation specific details for the Monte Carlo EM algorithm Beginning with the approximation to 3.3.12, we have (I ) t ) + (i_ ñ rr) [i(7I7(t)) + Q(Iø )+ t Under the imperfection mechanism model assumptions given in the previous 717 ( section the Monte Carlo estimate of Q (t)) is i(I7( ) t ) (7It)) m9(t) 1 = g(t) m 1(t) / 1 A(t+i) logpR 1 Ix 11 1 ,yj,7 1=1 Recall that in section 3.4.2 we discussed how the sample size used in the expectation step can be made to increase as the algorithm gets closer to satisfying the convergence criterion, but indicated that we would not imple ment it. It will be assumed that the Monte Carlo sample is the the same for all iterative steps and all subjects. Under these assumptions 1(t) / t .-r Q 1j 1 h” becomes (t)) 7 ( t)). 7 logp (RIxt+1),y, It is recognized that this will not translate to the most efficient algorithm since in the early iterations of the EM algorithm a cruder approximation, based on smaller Monte Carlo samples, would be sufficient [7, 44, 46, 48]. The computational cost of this decision was not investigated. The approximation to the expectation estimate of the response model 91 given by equation 3.4.6 becomes (I ) t ) iogp(yjxt+1), (t)) = With the covariate model specified in the previous section the estimate of (t)) is 1 equation Q(/, (I ) t ) logp(Xx), t)) ,P(t)) + 1ogp(Xjxl), + logp(Xjit)). 4.4.2.1 Gibbs adaptive rejection sampling Beginning with equations 3.4.8, the structure of the Gibbs sampler for this simulation is x’ p (XjiI ),x,rji,xt),yj), and p2 (x i ),x, ri2,x,Yi). Equation 3.4.9 gives the general form of the full conditionals which become ,x,yj) f 1 p (Xjj ,rf 7 ocp(RIxji,x,y )p(YjIxji,x,/3 j, ) x1 Ix )p(XjlI4D) 2 p(X , and p(Xj , I 2 xf,xji,r,yj) x Ixi, 2 , b)p(X 2 p(XIx Each component of the conditionals is log-concave (Appendix B.1.5, Table B.1), thus each full conditional is itself log-concave. This permits the use of the Gibbs adaptive rejection sampler. The statistical package R was used to perform Gibbs Adaptive Rejection Sampling (ARS) and the recently 92 developed function ars in the ars library was used [87]. The literature which includes a sensitivity analysis for the Monte Carlo sample size for MCEM implementation within a logistic regression setting strongly suggests that the results are insensitive to sample sizes ranging from 2000 to 10,000 [44, 45, 48]. Lipsitz seems to prefer a larger sample size and regularly used a Monte Carlo sample of 10,000, but makes no mention of burn-in [65, 66]. When burn-in samples are discussed, they are exceptionally small. Stubbendick [95] uses a Monte Carlo sample of size 100 within the context of a random effects model with no burn-in. Also, it is suggested that no difference was found between using a Monte Carlo sample of size 100 with no burn-in and using a sample of size 1000 with a burn-in sample of size 100. The burn-ins appear to be rather small; this may be a reason for the negligible results. Recognizing that automating the Monte Carlo sample size over the entire simulation was beyond the scope of inquiry for this investigation, a Monte Carlo sample of size 6000 was drawn from the full conditionals with the first 1000 being burn-in for the sampler; all results are predicated on the remaining Monte Carlo sample of size 5000. 4.4.2.2 Analytic solutions for maximization The maximization step of the EM algorithm maximizes Q(((t)). With each component of Q((t)) uniquely indexed, its maximization reduces to the maximization of each component. Both Q(j3j3(t)) and Q(yy(t)) are charac terized by the binary linear regression model, thus they can be maximized using standard iteratively re-weighted linear regression method with the ob served realizations and the Monte Carlo sample realizations are weighted based on their contribution. The observed realizations from the experiment are given a weight of one with each realization for the Monte Carlo sample given the weight m* The Monte Carlo algorithm given by Wei and Tanner [109] allows us to use the EM algorithm by method of weights [48]. The remaining distributions are normally distributed and have closed form solutions. Using the results found in Dwyer’s [23] work on multivariate 93 derivatives, the MCEM maximum-likelihood type estimate of the mean is n (4.4.1) [rr) xi+ (1-rr) where j. xj is the intra-Monte Carlo sample average with the = dot emphasizing the index over which the average is taken (Appendix C.1.1). The estimate for the covariance matrix is n [(P) (xii - )(x - + (1_ m— rrI) 1S2] (4.4.2) = L Zi(xii variance(Appendix C.1 .1). where 82 — t)(xj — 1 ) T is the intra-Monte Carlo sample These two estimates are intuitively appealing since they have a general form which emulates that of the maximum likelihood estimates for a com plete case scenario. When a subject has a covariate which has been classified as imperfect, but not all realizations suffer from the imperfection then for the jth subject for which the imperfection does not manifest we have x for all 1. When this occurs we have [(‘)+ =* [(fr’) (i_ñrrz’”) ±x.l] [:] [:]] x+ (i-ñrir) ... [1 i=1 j=1 = j=1 Z ii2i = 1 ] 94 The situation is similar for the estimate of the covariance, m 1 1 (x m f- (Xjij — — 1 (x — 1 (x — 2 — 2) 1 2 1tl)(Xi \l — — /12) Itl) P’2) (x — — — — 21 (x /Llj 11)2 I1l)(i2. 1 (xi — 1 u ’) Zl(xj2j (x = ILl) ILl)(Xi21 i — — m( jiXj21 — /1l)(j2. Ti — — IL2 112)2 112) s where s = Zl (x 112)2 is the intra-Monte Carlo sample variation for the second covariate, x . is the intra-Monte Carlo sample average for 2 th 1 the subject, and (i2. /12) is the average deviation of the Monte Carlo sample from the grand mean. Although much of the maximization is algorithmic, the computational gains with using the closed solutions is expected to be minimal, but they provide some insight into the structure of the estimates. It is clear from equation 4.4.1 that when imperfection exists, the contribution for the mean is the estimated expectation of the joint distribution of X 1 and X 2 given tt the MCEM estimate of the parameters with an analogous contribution for the variance in equation 4.4.2. — — 4.4.2.3 Louis standard error Before proceeding with the derivation of the Louis standard errors, two ob servations will be made. Given that the likelihood is a product of uniquely parameterized probability distributions, the Hessian matrix will be block di agonal with each block pertaining to a single probability distribution. Also, since the object of interest in this investigation is parameter estimation of the response model, only the block pertaining to the response model will be discussed. From the general presentation of the Louis standard error, we have the relationship given in equation 3.5.3 with the Monte Carlo approximation based on the current Gibbs sample given byequation 3.5.5. Since the likeli 95 hood model components are uniquely parameterized, the resulting informa tion matrix will be block diagonal, thus we need not compute the inverse for the entire information matrix in order to obtain the covariance matrix for the parameters of interest. Using equation 3.5.6 and the score and Hessian for the response model as given in appendix B.2.2, we are able to obtain the MCEM Louis standard error for the sth simulation estimate of /3. 4.4.3 Results and discussions Through the very act of completing the first simulation, we have begun to answer the first global question; it is computationally feasible to implement a solution to the problem of having both missing data and measurement error in the same data set. Although fraught with problems which will be addressed in section 4.4.4, the proposed strategy can be implemented. From implementation, we move to assessment of how the combined problem of missing data and mismeasurement impact parameter estimation. Finally, the effect of the MCEM adjustment is considered.. For this simulation study, four experiments were considered. The first is a comparison of how the agreement or disagreement between the missing data mechanism that generated the missingness and the assumed mechanism for the model affects parameter estimation for the response model. Here we will consider four scenarios, denoted as generating mechanism-model mechanism: MAR-MAR, MAR-NMAR, NMAR-MAR, NMAR-NMAR. The second experiment explores the effect of different missing data mechanisms. This experiment will help to determine if our observations can be generalized to more missing data mechanisms than that used in the first experiment. The third experiment is to explore the effect of sample size on parameter estimation. This is to begin considering the asymptotic properties of this approach. The final experiment is to consider the effect that the variance 2 has on parameter estimation. r 96 4.4.3.1 Comparing the effect of the missing data mechanism assumption Two comparisons will be made in this experiment: comparing the naive complete-case approach with the MCEM method when the generating miss ing data mechanism and the modelled mechanism are matched, and when they are mismatched. Since we are using a logistic regression model, it is good to recall that both Vach [104] and Greenland [38] indicate that the MAR missing data mechanism with the response as a covariate may or may not result in negligible bias, but for measurement error with the logistic regression model we should expect attenuation or bias towards zero [40]. For the first situation, we will consider a comparison across the methods for case 1 and case 4. The complete-case approach has large estimated standard deviations for both cases (Tables 4.3 and 4.4). The estimate of is unbiased as anticipated with small z-scores. For both cases, /32 is biased, both have very large z-scores, and we see the expected attenuation. A general failure to reach the nominal level for /32 across the two cases with case 4 failing for the intercept as well (Table 4.5). The results for 6 emulate that of the bias (Table 4.6). The MCEM methodology has large estimated standard errors, but these are much larger than those of the complete-case approach (Tables 4.3 and 4.4). The bias associated with /3 is increased in both cases with large increases in case 1. The trade-off for these increases is reduction in the bias associated with /32. Unfortunately, the bias has insufficient evidence against the null of no bias for only case 4. The MSE is similar across the the two mechanisms, but the MCEM based MSE is substantially larger for all parameter estimates than that of the complete-case approach; ê indicates a general inefficiency for the MCEM method. The 1VICEM confidence intervals are consistently larger than those of the complete-case approach, but this is not unexpected due to the observed increase in the estimated standard error. Also, the length of the confidence intervals has a much larger confidence interval than the naive ones. For both cases, the nominal coverage is reached for all parameter estimates (Table 97 4.5). The results for öMCEM are similar to those of the bias except in case 4 where a decrease in 6 is seen moving from the naive approach to the MCEM method. In general, similar trends are seen when moving from the complete-case method to the MCEM approach for cases 2 and 3 as we have seen for cases 1 and 4, thus we will focus oniy on the effect of misspecifying the missing data mechanism on the MCEM method when we move from correct specification to incorrect characterization. When moving from case 1 to case 2 there is a reduction in the estimated standard error for /32 and a decrease in its bias (Table 4.3). The bias associated with /3o and j3 increases. Although we have some increase in bias, the MSE for all estimates is reduced since it is driven primarily by the size of the estimated standard deviation. Also, we see a reduction in ê and a shortening of the confidence intervals. The attainment of the nominal coverage rate is lost for and i3 (Table 4.5). Finally we see a decrease in 6 (Table 4.6). MCEM Under-modelling the missing data mechanism, which is assuming a smaller missing data mechanism model than the true model (e.g. case 3), the es timated standard deviation of i3o and /32 is reduced, but increased for /3i (Table 4.4). The bias is increased for the estimates of /3 and /32 as well as an increase in the associated MSE. Interestingly, 6 is reduced for all pa rameter estimates. There is little difference in the lengths of the confidence intervals and the attainment of the nominal coverage rate (Table 4.5). Fi nally, 8 MCEM exhibits mixed results (Table 4.6). Discussion We see many features which would be anticipated if this was only a missing data problem or only a mismeasurement problem. For case 1, we see an unbiased estimate of /3i when a complete-case analysis is imple mented; for case 4, the estimate is biased. Also, we see that the estimates of /32 for both cases are biased and we see the expected attenuation. The estimated standard deviations are large, but we notice a substantial increase in the variability of the MCEM estimates. This contributes to large MSE estimates for both cases. A positive feature of the MCEM approach is that it is able to reach the nominal coverage rate for all parameter estimates, for 98 which the complete-case analysis fails. In this situation, we see many of the expected features, but we do see a subtle trade-off emerge. It appears that sacrifices are made in the estimation of /3 and /3i in order to reduce the bias associated with /32. This trade-off seems to work best when the matched missing data mechanism is NMAR. When the missing data model is misspecified, the type of misspecifica tion appears to have some affect on how the parameters are affected. When we over-model the missing data mechanism, case 2, we a general reduction in the noise of the parameter estimates, but an increase in the bias associated with /3o and / i. Also, the nominal coverage is lost for these two par am 3 eter estimates. If we are only interested /32 then over-modelling has some advantages by reducing bias, variability and increasing its efficiency. When the missing data mechanism is under-specified, we see mixed results for j3 and /32. We have an increase in bias and an increase in the MSE, but the estimated standard deviation decreases for only / i. There is little change 3 in the coverage and in the confidence interval lengths. 99 Table 4.3: Comparing the effect of the missing data mechanism (MAR, NMAR) on the point estimate, bias, and mean squared error when the mechanism that generates the missing data and the missing data mechanism in the model are matched and mismatched. Mechanism A was used to generate the data and the coefficient for missing data variable is i3 and for the mismeasured variable is /32, case 1 and case 2 only. (a) Case 1: MAR-MAR, Simulation Size= 153 Naive (s’r) g E 8 -1.367 (0.290) 0.419 (0.284) -0.349 (0.190) /32 ñis (Z,jas) 0.023 (0.961) 0.009 (0.389) 0.341 (22.174) MCEM () E () (sr) 0.085 (0.013) 0.080 (0.011) 0.152 (0.011) -1.503 (0.417) 0.479 (0.347) -0.840 (0.759) 1E () -0.113 (-3.344) 0.069 (2.462) -0.150 (-2.445) )?E () 0.187 (0.051) 0.125 (0.020) 0.599 (0.315) () 2.20 (0.58) 1.56 (0.23) 3.93 (2.15) (b) Case 2: MAR-NMAR, Simulation Size112 Naive /3 E () (sb) /3 -1.419 (0.282) 0.420 (0.309) -0.333 (0.192) /3 /32 (Zja) -0.029 (-1.075) 0.010 ( 0.335) 0.357 (19.652) MCEM () E (/) (s1) 0.080 (0.014) 0.095 (0.016) 0.164 (0.013) -1.524 (0.376) 0.495 (0.356) -0.747 (0.543) Ts (Zbias) -0.134 (-3.781) 0.085 ( 2.530) -0.057 (-1.112) MSE () 0.159 (0.042) 0.134 (0.024) 0.298 (0.073) (SD) 1.98 (0.28) 1.40 (0.15) 1.82 (0.50) Table 4.4: Comparing the effect of the missing data mechanism (MAR, NMAR) on the point estimate, bias, and mean squared error when the mechanism that generates the missing data and the missing data mechanism in the model are matched and mismatched. Mechanism A was used to generate the data and the coefficient for missing data variable is /3i and for the mismeasured variable is /32, case 3 and case 4 only (a) Case 3: NMAR-MAR, Simulation Size=150 Naive 3 / /3 /32 E (j) (s’D) ii (Zbias) -1.347 (0.311) 0.457 (0.365) -0.348 (0.194) 0.043 ( 1.690) 0.047 ( 1.577) 0.342 (21.581) MCEM () E () (sb) 0.099 (0.013) 0.135 (0.017) 0.154 (0.012) -1.458 (0.389) 0.490 (0.382) -0.776 (0.473) (Zbjas) -0.068 (-2.145) 0.080 ( 2.560) -0.086 (-2.227) i () 0.156 (0.023) 0.152 (0.020) 0.231 (0.031) () 1.58 (0.14) 1.12 (0.07) 1.49 (0.25) (b) Case 4: NMAR-NMAR, Simulation Size= 139 Naive /3 E () (siz) /3 -1.352 (0.308) 0.440 (0.329) -0.341 (0.207) /32 MCEM iTh1 0.038 ( 1.463) 0.030 ( 1.092) 0.349 (19.904) () 0.096 (0.012) 0.109 (0.011) 0.164 (0.013) E (sb) -1.478 (0.422) 0.459 (0.350) -0.770 (0.641) S (Zbias) -0.088 (-2.449) 0.049 ( 1.632) -0.080 (-1.464) ITh1 () 0.186 (0.050) 0.125 (0.015) 0.418 (0.188) () 1.93 (0.53) 1.15 (0.09) 2.54 (1.15) Table 4.5: Comparing the effect of the missing data mechanism (MAR, NMAR) on confidence interval length, and coverage when the mechanism that generates the data and the mechanism used to model the data are matched and mismatched. Mechanism A was used to generate the data and the coefficient for missing data variable is i3 and for the mismeasured variable is /32. (a) Case 1: MAR-MAR, Simulation Size=153 (gb) j 3 I Lpjajve /3 1.058 (0.123) 1.140 (0.135) 0.756 (0.101) /32 LMGEM () 1.571 (2.866) 1.361 (0.728) 2.414 (6.359) CoverageNajve (p-value) CoverageMcEM (p-value) 0.928 (0.296) 0.967 (0.228) 0.523 (< 0.001) 0.961 (0.492) 0.967 (0.228) 0.961 (0.492) (b) Case 2: MAR-NMAR, Simulation Size= 112 (gb) /3 LNaive /3i 1.072 (0.135) 1.150 (0.143) 0.758 (0.091) /32 LMCEM (SD) CoverageNajve (p-value) CoverageMcE],f (p-value) 1.352 (0.576) 1.303 (0.367) 1.843 (0.878) 0.982 (0.012) 0.973 (0.128) 0.500 (< 0.001) 1.000 (< 0.001) 0.982 (0.012) 0.938 (0.585) (c) Case 3: NMAR-MAR, Simulation Size15O j 3 I Ljajve o 3 / 1.064 (0.131) 1.201 (0.177) 0.762 (0.095) /3 /32 () LMCEM (SD) CoverageNajve (p-value) CoverageMcEM (p-value) 1.311 (0.395) 1.328 (0.282) 1.816 (0.638) 0.92 (0.176) 0.92 (0.176) 0.56 (< 0.001) 0.933 (0.412) 0.947 (0.856) 0.967 (0.255) (d) Case 4: NMAR-NMAR, Simulation Size139 I. 3 / LNaive (SD) LMCEM (SD) CoverageNajve (p-value) CoverageMcEM (p-value). /3 /3’ 1.063 (0.126) 1.202 (0.158) 0.755 (0.102) 1.454 (1.811) 1.325 (0.363) 1.990 (2.988) 0.899 (0.048) 0.964 (0.376) 0.568 (< 0.001) 0.935 (0.480) 0.957 (0.692) 0.935 (0.480) /32 Table 4.6; Comparing the effect of the missing data mechanism (MAR, NMAR) on the median and difference between the median and true value (6). Mechanism A was used to generate the data and the coefficient for missing data variable is j3 and for the mismeasured variable is /32. (a) Case 1: MAR-MAR, Simulation Size153 i 3 / j3 i 3 / /32 ) 3 rnNjV(/ (MAD) -1.361 (0.230) 0.428 (0.275) -0.325 (0.191) rnMCEM(/ J 3 ) (MAD) -1.450 (0.276) 0.471 (0.351) -0.696 (0.472) Naive 8 MCEM 8 0.029 0.018 0.365 -0.060 0.061 -0.006 (b) Case 2: MAR-NMAR, Simulation Size112 3 ! /32 rnNaive(/’ j 3 ) (Ab) -1.369 (0.248) 0.406 (0.289) -0.325 (0.143) fdMCEM(/j) (MAD) -1.439 (0.274) 0.443 (0.318) -0.684 (0.380) Naive 3 SMCEM 0.021 -0.004 0.365 -0.049 0.033 0.006 (c) Case 3: NMAR-MAR, Simulation Size15O j3 i3 /32 rnNaive(1 j 3 ) (MAD) -1.354 (0.313) 0.418 (0.368) -0.360 (0.186) TflMCEM(/ j 3 ) (IVIAD) -1.400 (0.366) 0.463 (0.373) -0.794 (0.457) &vaive SMCEM 0.036 0.008 0.330 -0.010 0.053 -0.104 (d) Case 4: NMAR-NMAR, Simulation Size=139 3 / i3 /32 rnNaive(1 j 3 ) (MAD) -1.330 (0.290) 0.409 (0.348) -0,334 (0.210) ?T1MCEM(/ j 3 ) (MAD) -1.425 (0.356) 0.463 (0.379) -0.714 (0.466) Naive 6 MC’EM 0.060 -0.001 0.356 -0.035 0.053 -0.024 103 4.4.3.2 Comparing missing data mechanisms Three missing data mechanisms will be considered with each taking an MAR and an NMAR form. These mechanisms will be considered for both case 1 and case 4 where the missin data mechanism that generated the data and the mechanism used to model the data are matched. We will focus on a comparison across the three mechanisms to see if the functional form of the missing data mechanism affects parameter estimation. A special note will be taken for differences between mechanism A and B since they vary only in the inclusion or exclusion of the response as part of the mechanism. Beginning with case 1, we see that the estimated standard deviation for the parameter estimates under the complete case method are similar across the mechanisms (Table 4.7). For /32 we see the expected bias, but we have mixed results for the other parameter estimates. For /3o mechanisms A and B are unbiased and for /3i only mechanism A is unbiased. These mixed results may be how the ambiguity with parameter estimation with missing data is manifesting [38, 1041. The MSE is similar across the mechanisms, but probably a bit large with mechanism C. Table 4.9 suggests that the confidence interval lengths are of similar magnitude across the three mech anisms. We see a similarity in the coverage rates for mechanisms A and B, but C shows a lower coverage rate for /3o. The noticeable difference in öNaiVe is with mechanism C for /3o and /3 which are much larger than the other mechanism estimates. When the MCEM adjustment is applied, we notice that the estimated standard deviation of /3 is similar across the mechanisms, but the magnitude exhibits the expected increase (Table 4.7). There is a correction for the bias, but this correction appears to be better when the response is not part of the mechanism. We see that the MSE is similar across mechanisms for /3 and i3 but varied for /32. A noticeable feature is that the MSE for mechanism A is larger than that of B. The confidence intervals are longer and more variable, but this is similar to that seen in the previous experiment. The coverage is similar across all mechanisms for /3i and yields the same conclusions as the naive approach. This is not true for the others estimates; /32 improves 104 under the MCEM approach and attains the nominal rate (Table 4.9). We see no clear pattern for 6 and j3, but MCEM across the mechanisms for we observe a general improvement for ö(/2) (Table 4.10) When the matched missing data mechanisms are NMAR, we see that the estimated standard deviations obtained from the naive complete-case analysis are similar across the three missing data mechanisms (Table 4.8). There is much variation in the biases, but we do observe that the bias of $ is similar across mechanisms and exhibits the expected magnitude. The confidence interval lengths are similar across the mechanisms, but in terms of coverage we observe that i3 is too low for mechanisms A and C (Table 4.9). We see for /3o and i3 that 8 Najve is lowest for mechanism A and largest for C. For 132, 6 Naive is similar across the mechanisms (Table 4.10). Under the MCEM methodology, we see that the expected increase in the estimated standard variation of 2, but we also notice that the mechanisms are similar within each set of parameter estimates (Table 4.8). There is much variation in the biases, but we do observe a reduction in the bias of /32 for all mechanisms, but the reduction appears to remove the bias for only mechanisms A and B. Although there is no clear pattern for /3 we do observe that the z-scores reveal that all estimates can be considered unbiased, which is not the case for the naive approach. For 13o , there is an improvement in the bias for mechanisms B and C. The MSE is similar across the mechanisms, although there is the suggestion that differences may emerge for /32. From table 4.9, we see that the confidence interval lengths are longer for the MCEM approach, but they are similar across the mechanisms. In terms of coverage, all mechanisms across all the parameter estimates attain the nominal coverage rate. Finally, for SMOEM we see improvements for all estimates and mechanisms except in the single case of mechanism A and i3 /32 (Table 4.10). In this case we see a worsening. Discussion When the generating missing data mechanism is MAR and matched with the model assumption, we see much similarity across the mechanisms. Some notable deviations are with mechanism C which has larger MSE for the complete-case approach and does not perform as well as 105 the other mechanisms for the complete-case approach in terms of coverage. We see this problem with coverage vanishes with the MCEM methodology. Furthermore, we note that the MCEM approach does a reasonable job at correcting the bias, but tends to perform less well when the response is part of the missing data mechanism. We also see the largest MSE for all param eter estimates associated with mechanism A. We may conclude, based on this limited simulation based investigation, that under the MAR assumption there is little difference across the three mechanisms, but the inclusion of the response in the mechanism may cause the parameter estimates to have less than desirable properties. Under the NMAR assumptions, drawing general conclusions is much more difficult for there are many points at which the three mechanisms are similar, but there are points of divergence. The inclusion of the response in the missing data mechanism does not appear to have the same influence over the estimate properties as in the MAR situation. The MCEM approach appears to do a sufficient job at improving the parameter estimates and their properties over the naive complete-case approach. 106 Table 4.7: Comparing different missing data mechanisms when the mechanism that generates the data and the mechanism used to model the data is matched. Simulation sizes for case 1 (MAR-MAR) are 153, 168, 167 for mechanism A, B, and C respectively. Naive Mech. 3 / A B C /3 8 A B C A B C I’ E (j) (s’)z) S (Zjas) MCEM i () E () (sb) S (Z,ja) () / -1.367 (0.290) -1.355 (0.251) -1.318 (0.285) 0.023 (0.961) 0.035 (1.804) 0.072 (3.253) 0.085 (0.013) 0.064 (0.008) 0.087 (0.010) -1.503 (0.417) -1.478 (0.315) -1.436 (0.423) -0.113 (-3.344) -0.088 (-3.618) -0.046 (-1.411) 0.187 (0.051) ) 0.107 (0.015) 0.181 (0.048) /3 j3 0.419 (0.284) 0.361 (0.309) 0.310 (0.310) 0.009 (0.389) -0.049 (-2.062) -0.100 (-4.174) 0.080 (0.011) 0.098 (0.011) 0.106 (0.012) 0.479 (0.347) 0.448 (0.374) 0.411 (0.403) 0.069 (2.462) 0.038 (1.315) -0.001 (-0.021) 0.125 (0. (0.23) 0.141 (0.017)) 0.163 (0.032) -0.349 (0.190) -0.321 (0.188) -0.330 (0.193) 0.341 (22.174) 0.369 (25.457) 0.360 (24.184) 0.152 (0.011) 0.171 (0.012) 0.167 (0.010) -0.840 (0.759) -0.742 (0.531) -0.778 (0.652) -0.150 (-2.445) -0.052 (-1.259) -0.088 (-1.743) 0.599 (0.315) 0.285 (0.042) 0.432 (0.175) /32 /32 /32 Table 4.8: Comparing different missing data mechanisms when the mechanism that generates the data and the mechanism used to model the data is matched. Simulation sizes for case 4 (NMAR-NMAR), they are 139, 163, 168 respectively. Naive (s’)) Mech. 3 / E A B C 8 -1.352 (0.308) -1.319 (0.265) -1.288 (0.294) A B C A B C j3 /3 /3i 132 12 /32 (j) ñYS (Z,jas) 0.038 0.071 0.102 ( ( ( ( ( 1.463) 3.418) 4.490) MCEM () E (‘i) (SD) ñiS (Zija) (SE)) 0.096 (0.012) 0.075 (0.007) 0.097 (0.009) -1.478 (0.422) -1.446 (0.374) -1.420 (0.388) -0.088 (-2.449) -0.056 (-1.914) -0.030 (-0.990) 0.186 (0.050) 0.143 (0.020) 0.151 (0.019) 0.440 (0.329) 0.402 (0.325) 0.330 (0.323) 0.030 1.092) -0.008 -0.332) -0.080 (-3.208) 0.109 (0.011) 0.106 (0.012) 0.111 (0.013) 0.459 (0.350) 0.431 (0.385) 0.409 (0.408) 0.049 ( 1.632) 0.021 ( 0.712) -0.001 (-0.044) 0.125 (0.015) 0.148 (0.011) 0.167 (0.021) -0.341 (0.207) -0.319 (0.209) -0.337 (0.196) 0.349 (19.904) 0.371 (22.629) 0.353 (23.349) 0.164 (0.013) 0.181 (0.013) 0.163 (0.012) -0.770 (0.641) -0.717 (0.533) -0.801 (0.573) -0.080 (-1.464) -0.027 (-0.644) -0.111 (-2.502) 0.418 (0.188) 0.285 (0.048) 0.341 (0.048) Table 4.9: Confidence interval length, and coverage when the mechanism that generates the data and the mecha nism used to model the data are matched. Simulation sizes for case 1 (MAR-MAR) are 153, 168, 167 for mechanism 3 / A, B, and C respectively; for case 4 (NMAR-NMAR), they are 139, 163, 168 respectively. (a) Case 1: MAR-MAR Model LNaive A B C A B C A B C /3 /3 /3 /32 /32 /32 () LMCEM (SD) CoverageNaive (p-value) CoverageMcEM (p-value) 1.058 (0.123) 1.045 (0.102) 1.044 (0.122) 1.571 (2.866) 1.306 (0.392) 1.433 (1.717) 0.928 (0.296) 0.958 (0.588) 0.898 (0.028) 0.961 (0.492) 0.976 (0.024) 0.940 (0.592) 1.140 (0.135) 1.134 (0.128) 1.175 (0.167) 1.361 (0.728) 1.315 (0.336) 1.438 (0.962) 0.967 (0.228) 0.940 (0.600) 0.958 (0.604) 0.967 (0.228) 0.946 (0.836) 0.958 (0.604) 0.756 (0.101) 0.739 (0.083) 0.745 (0.090) 2.414 (6.359) 1.774 (0.796) 2.100 (2.785) 0.523 (< 0.001) 0.488 (< 0.001) 0.461 (< 0.001) 0.961 (0.492) 0.940 (0.600) 0.970 (0.128) (b) Case 4: NMAR-NMAR Model 3 ,i LNaive A B C /3 /3 A B C I. A B C /3 /3 /32 /32 /32 () LMCEM (SD) CoverageNaive (p-value) CoverageMcEM (p-value) 1.063 (0.126) 1.045 (0.105) 1.050 (0.111) 1.454 (1.811) 1.315 (0.500) 1.356 (0.498) 0.899 (0.048) 0.933 (0.372) 0.869 (< 0.001) 0.935 (0.480) 0.963 (0.372) 0.935 (0.416) 1.202 (0.158) 1.228 (0.139) 1.280 (0.141) 1.325 (0.363) 1.359 (0.280) 1.475 (0.361) 0.964 (0.376) 0.963 (0.372) 0.940 (0.600) 0.957 (0.692) 0.945 (0.772) 0.958 (0.588) 0.755 (0.102) 0.748 (0.076) 0.745 (0.087) 1.990 (2.988) 1.749 (0.677) 1.955 (0.898) 0.568 (< 0.001) 0.509 (< 0.001) 0.542 (< 0.001) 0.935 (0.480) 0.920 (0.160) 0.952 (0.884) Table 4.10: Comparing the effect of the missing data mechanism (MAR, NMAR) on the median and difference between the median and true value (6) when the mechanism that generated the missing data and the mechanism assumed by the model are matched. Simulation sizes for case 1 (MARMAR) are 153, 168, 167 for mechanism A, B, and C respectively; for case 4 (NMAR-NMAR), they are 139, 163, 168 respectively. (a) Case 1: MAR-MAR Model fñNaive($j) (MAD) fñMCEM(!j) (MAD) Naive MCEM 6 A B C -1.361 (0.230) -1.354 (0.242) -1.305 (0.231) -1.450 (0.276) -1.463 (0.306) -1.364 (0.288) 0.029 0.036 0.085 -0.060 -0.073 0.026 A B C 0.428 (0.275) 0.378 (0.299) 0.290 (0.313) 0.471 (0.351) 0.436 (0.370) 0.374 (0.345) 0.018 -0.032 -0.120 0.061 0.026 -0.036 -0.325 (0.191) -0.312 (0.200) -0.307 (0.183) -0.696 (0.472) -0.653 (0.427) -0.651 (0.435) 0.365 0.378 0.383 -0.006 0.037 0.039 Naive MCEM A B C L32 /32 /32 (b) Case 4: NMAR-NMAR Model /j A B C /3 -1.330 (0.290) -1.294 (0.269) -1.252 (0.280) -1.425 (0.356) -1.372 (0.356) -1.376 (0.369) 0.060 0.096 0.138 -0.035 0.018 0.014 A B C /3 0.409 (0.348) 0.388 (0.321) 0.343 (0.305) 0.463 (0.379) 0.417 (0.377) 0.429 (0.392) -0.001 -0.022 -0.067 0.053 0.007 0.019 /2 -0.334 (0.210) -0.325 (0.223) -0.357 (0.209) -0.714 (0.466) -0.636 (0.432) -0.733 (0.514) 0.356 0.365 0.333 -0.024 0.054 -0.043 A B C /32 /32 rnNaive(/ j 3 ) (MAD) mMCEM(/ j 3 ) (MAD) 110 4.4.3.3 Assessing the effect of sample size The primary objective is to identify empirical evidence suggesting any asymp totic behaviour of the estimates using mechanism B (Table 4.2). For case 1, clear trends are exhibited. Considering first the complete-case approach we see that for j3 the bias worsens, but stays about the same for /3 and /32 as the sample size increases (Table 4.11). The mean squared error as sociated with the estimates decreases for all parameters as the sample size increases. The confidence interval coverage for the complete-case estimates of /32 degrades as the sample size increases (Table 4.12). The confidence interval coverage rate associated with exhibits negligible differences, but the coverage associated with i3 appears to be improving with increasing sample size. Finally, we see no noticeable trend in 6 Najve (Table 4.13). When the MCEM adjustment is implemented the story appears to change. We see a reduction in the bias for all the parameter estimates as the sample size increases (Table 4.11). This pattern is exhibited again for the MSE and for the estimated relative efficiency. For /32 and n = 250 we observe ê = 0.67 which is statistically different from unity. In table 4.12 we see the confidence interval coverage rates move towards the nominal value as the sample size increases. Finally, in table 4.13 we see that SMCEM decreases as the sample size increases. For case 4 we see similar trends as with case 1. For the complete-case analysis, the bias associated with the parameters appears to be stable across sample sizes. When n = 100, we see that the bias associated with the estimate of /3i is smaller than for the other sample sizes. Although we see this difference, the magnitude of the bias is similar across the sizes and tells the same story through their z-scores, /i is unbiased with the others being biased. As before the MSE and confidence interval length decreases as the sample size increases. There is no noticeable trend for the coverage rates and they exhibit values similar to those seen in case 1. The difference between the estimated median and the true value has some interesting trends. For the intercept and /32, the difference increases as the sample size increase, but the reverse is generally true for /3i. 111 As with case 1, the MCEM approach for case 4 exhibits some desirable properties as the sample size increases. In general, across all parameters there is a trend of decreasing bias, MSE, ê, and confidence interval length as the sample size increases. The exception to this is for /32 at n 250 where the bias appears to increase. Firstly, this is not a statistically significant dif ference (p-value = 0.052 on 144 df). Secondly, the break in the trend could be a result of the small sample and the resulting crude empirical approxi mation to the sampling distribution produced by the simulation experiment when m = 250. In general, as the sample size increases, the coverage rate reaches the nominal level. For 6 MSE there is no clear trend, but this may be due to the small simulation size when n 250. Special attention should be given to the trend of ê for case 4. We see that as the sample size increases ê decreases. The reduction is rather quick in that by the time the sample size has reached m = 250, the MCEM approach is at par with or better than the complete-case approach. For /3i the relative efficiency exhibits no statistical difference from the complete-case approach, but for and /32 there is a difference with that of /32 being most noticeable. Discussion A general trend emerges for both case 1 and case 4. As we move from the complete-case analysis to the MCEM approach and as we move from small to large sample sizes, there is a general improvement in estimate properties. The bias tends to decrease and balancing the trade-offs appears to be eased with more information. The MSE reduces and in some cases the MCEM approach becomes the more efficient approach. These trends continue to be observed for the length of confidence intervals and for the difference between the mediai and the true value for the MCEM ap proach. Furthermore, the MCEM exhibits the desirable property of reaching the nominal coverage rate as the sample size increases. Given the evidence for emerging and desirable asymptotic properties of the MCEM approach, we can make a few observations. The first is that the properties that we anticipated to see by using the EM algorithm do emerge at relatively modest sample sizes. The second is that due to computational lim itations, n = 100 was used for many of the simulation experiments, but this 112 may obsfucate a lucid understanding of the ability of the MCEM approach to manage the dual problem of missing data and mismeasurement in other experiments. That said, seeing the performance of the MCEM algorithm with a modest sample size of n 100 will give valuable insight for substan tive researchers who regularly deal with small and modestly sized samples. The importance lies in how it illuminates the fact that the complete-case approach may be attractive from an efficiency point of view, but is unable to reasonably manage the various biases, namely those caused by missing data and measurement error. For case two there is a wide discrepancy in the number of simulation performed. It was found that in case 4, when dealing with NMAR miss ing data, the ARS Gibbs sampler regularly encountered numerical problems with the abscissa, thus causing many abortions of the algorithm. Addition ally, the algorithm was found to wander through the parameter space, thus suggesting potential ridges in the log-likelihood function. 113 Table 4.11: Comparing the effect of sample size on point estimation, bias, mean squared error, and relative efficiency for case 1: MAR-MAR. Missing data mechanism B was used for this comparison. (a) Case 1: Sample Size 50, Simulation Size Naive o 3 / /32 MCEM E ($) (si) S (z,jas) -1.406 (0.420) 0.337 (0.448) -0.339 (0.301) -0.016 (-0.510) -0.073 (-2.176) 0.351 (15.609) () 0.177 (0.021) 0.206 (0.025) 0.214 (0.017) E () (Si)) -1.811 (1.932) 0.431 (0.803) -1.072 (2.177) (b) Case 1: Sample Size = j3 /3 /32 E ($) (si)) -1.355 (0.251) 0.361 (0.309) -0.321 (0.188) (Zja) 0.035 ( 1.804) -0.049 (-2.062) 0.369 (25.457) () E (‘) (sl) 0.064 (0.008) 0.098 (0.011) 0.171 (0.012) -1.478 (0.315) 0.448 (0.374) -0.742 (0.531) A?E /3 i3i I. /32 -1.340 (0.172) 0.350 (0.186) -0.323 (0.125) ) (SD) V 3.909 (2.296) 0.645 (0.172) 4.883 (2.801) 22.08 (12.60) 3.12 (0.820) 22.81 (12.94) 168 (Zbia.s) 0.050 ( 3.651) -0.060 (-4.094) 0.367 (37.175) VS (Zija) -0.088 (-3.618) 0.038 ( 1.315) -0.052 (-1.259) 250, Simulation Size Naive (Si)) -0.421 (-2.914) 0.021 ( 0.351) -0.382 (-2.351) () V MCEM (c) Case 1: Sample Size E () S (Zi,jas) 100, Simulation Size Naive j 3 / 179 = () 0.107 (0.015) 0.141 (0.017) 0.285 (0.042) 1.66 (0.17) 1.44 (0.15) 1.67 (0.28) 161 MCEM )E () 0.032 (0.003) 0.038 (0.004) 0.151 (0.007) E () (Si)) -1.433 (0.208) 0.417 (0.210) -0.711 (0.318) () V (Zbias) -0.043 (-2.618) 0.007 ( 0.417) -0.021 (-0.828) V () 0.045 (0.005) 0.044 (0.005) 0.101 (0.011) () V 1.40 (0.13) 1.15 (0.08) 0.67 (0.09) Table 4.12: Comparing the effect of sample size on confidence interval length, and coverage for case 1: MAR-MAR. Missing data mechanism B was used for this comparison. (a) Case 1: Sample Size Parameter CoverageNajv (p-value) CoverageMcEM (p-value) 1.568 (0.293) 0.978 (0.012) 0.972 (0.072) 0.966 (0.220) 7.968 (47.757) 3.664 (11.279) /32 1.123 (0.221) 8.936 (48.990) 0.955 (0.732) 0.966 (0.220) 0.670 (< 0.001) (b) Case 1: Sample Size 100, Simulation Size /32 /3o /3 /32 = 168 LNaive LMCEM CoverageNajve (p-value) CoverageMcEM (p-value) 1.045 (0.102) 1.134 (0.128) 0.739 (0.083) 1.306 (0.392) 1.315 (0.336) 1.774 (0.796) 0.958 (0.588) 0.940 (0.600) 0.488 (< 0.001) 0.976 (0.024) 0.946 (0.836) (c) Case 1: Sample Size Parameter 179 LMCEM 1.743 (0.355) /3 /3 = LNajve /3 Parameter I. 50, Simulation Size 250, Simulation Size 0.940 (0.600) 161 LNajve LMCEM CoverageNaive (p-value) CoverageMcEM (p-value) 0.644 (0.041) 0.704 (0.049) 0.461 (0.034) 0.756 (0.123) 0.779 (0.089) 1.041 (0.230) 0.919 (0.152) 0.951 (0.984) 0.168 (< 0.001) 0.963 (0.392) 0.944 (0.744) 0.938 (0.524) Table 4.13: Comparing the effect of sample size on the median and difference between the median and true value (S) for case 1: MAR-MAR. Missing data mechanism B was used for this comparison. (a) Case 1: Sample Size Parameter /32 rnNaive(1 j 3 ) (ib) /3 iMCEM(IJ) -1.365 (0.425) 0.362 (0.442) -0.327 (0.307) (b) Case 1: Sample Size Parameter 50, Simulation Size flNaive(/ j 3 ) (MAD) = (MAD) -1.481 (0.473) 0.387 (0.501) -0.658 (0.66) = 100, Simulation Size rnMCEM(/ j 3 ) (K1Ab) 179 Naive MCEM 5 0.025 -0.048 0.363 -0.091 -0.023 0.032 168 Naive ÔMCEM /3i -1.354 (0.242) 0.378 (0.299) -1.463 (0.306) 0.436 (0.370) 0.036 -0.032 -0.073 0.026 /32 -0.312 (0.200) -0.653 (0.427) 0.378 0.037 Naive OMCEM 0.068 -0.059 0.374 -0.009 0.009 0.016 (c) Case 1: Sample Size Parameter o 3 / /3i 132 rnNaive(1 j 3 ) (iJZb) -1.322 (0.192) 0.351 (0.210) -0.316 (0.124) = 250, Simulation Size MCEM(1j) (1b) -1.399 (0.220) 0.419 (0.225) -0.674 (0.327) 161 116 Table 4.14: Comparing the effect of sample size on point estimation, bias, mean squared error, and relative efficiency for case 4: NMAR-NMAR. Missing data mechanism B was used for this comparison. (a) Case 4: Sample Size 50, Simulation Size Naive E () (siz.) /3 /32 -1.305 (0.378) 0.379 (0.495) -0.349 (0.308) MCEM (zs) 0.085 (1.923) -0.031 (-0.534) 0.341 (9.536) () E (,) (s1) 0.150 (0.022) 0.246 (0.048) 0.211 (0.030) -1.512 (0.612) 0.467 (0.648) -0.932 (1.010) iE (b) Case 4: Sample Size o 3 i i3i /32 -1.319 (0.265) 0.402 (0.325) -0.319 (0.209) (Zbias) 0.071 ( 3.418) -0.008 ( -0.332) 0.371 (22.629) () E (i) (sD) 0.075 (0.007) 0.106 (0.012) 0.181 (0.013) -1.446 (0.374) 0.431 (0.385) -0.717 (0.533) = I’ —4 /3o -1.298 (0.166) 0.391 (0.185) -0.297 (0.117) /32 = 0.389 (0.101) 0.423 (0.110) 1.080 (0.291) 2.59 (0.60) 1.72 (0.28) 5.11 (1.65) 163 S (Zbias) 0.092 (3.742) -0.019 (-0.705) 0.393 (22.690) () (Z,ja) -0.056 (-1.914) 0.021 ( 0.712) -0.027 (-0.644) 250, Simulation Size Naive E (,) (sb) -0.122 (-1.720) 0.057 (0.758) -0.242 (-2.059) () MCEM (c) Case 4: Sample Size /3 (81)) (z&ia) 100, Simulation Size Naive E () (sb) 74 0.143 (0.02) 0.148 (0.018) 0.285 (0.048) () 1.90 (0.27) 1.40 (0.11) 1.57 (0.29) 46 MCEM () E (j) (sb) 0.036 (0.006) 0.035 (0.009) 0.168 (0.013) -1.362 (0.175) 0.417 (0.198) -0.621 (0.278) (Ziyjas) 0.028 (1.067) 0.007 (0.253) 0.069 (1.697) 1i () 0.031 (0.006) 0.039 (0.009) 0.082 (0.018) () 0.87 (0.11) 1.13 (0.09) 0.49 (0.11) Table 4.15: Comparing the effect of sample size on confidence interval length, and coverage for case 4: NMAR NMAR. Missing data mechanism B was used for this comparison. (a) Case 4: Sample Size Parameter / o 3 /3 /32 = 74 LMCEM CoverageNajve (p-value) CoverageMcEM (p-value) 1.526 (0.226) 1.758 (0.321) 1.094 (0.197) 2.481 (2.291) 2.301 (1.485) 3.646 (4.231) 0.946 (0.876) 0.959 (0.680) 0.635 (< 0.001) 0.959 (0.680) 0.973 (0.224) 0.946 (0.876) 100, Simulation Size 163 LNajve LMCEM CoverageNj (p-value) CoverageMcEM (p-value) /3 1.045 (0.105) j3 1.228 (0.139) /32 0.748 (0.076) 1.315 (0.500) 1.359 (0.280) 1.749 (0.677) 0.933 (0.372) 0.963 (0.372) 0.509 (< 0.001) 0.963 (0.372) 0.945 (0.772) 0.920 (0.160) (c) Case 4: Sample Size Parameter Co = LNaj,e (b) Case 4: Sample Size Parameter 50, Simulation Size = 250, Simulation Size = 46 CoverageNajv (p-value) CoverageMcEM (p-value) 0.718 (0.084) 0.935 (0.676) 0.957 (0.828) 0.741 (0.060) 0.784 (0.076) 0.459 (0.034) 0.973 (0.173) 0.913 (0.372) 0.087 (< 0.001) 0.957 (0.828) 0.957 (0.828) LNajve LMCEM /3 0.640 (0.032) i3 /32 Table 4.16: Comparing the effect of sample size on the median and difference between the median and true value (S) for case 4: NMAR-NMAR. Missing data mechanism B was used for this comparison. (a) Case 4: Sample Size Parameter /3o j3 /32 rnNaive(/ j 3 ) (MA1)) 50, Simulation Size ?ñMcEM(I3j) -1.306 (0.424) 0.278 (0.402) -0.372 (0.292) / o 3 /3 /32 rnNaive(/ j 3 ) (MA1)) fñMCEM(/j) -1.294 (0.269) 0.388 (0.321) -0.325 (0.223) /3 /32 rnNaive(/ j 3 ) (ii) -1.267 (0.168) 0.378 (0.186) -0.284 (0.098) (MAD) 100, Simulation Size = = (MAD) 250, Simulation Size fñMCEM(/) = (MAD) -1.352 (0.151) 0.393 (0.265) -0.589 (0.247) Naive MCEM 0.084 -0.132 0.318 -0.038 -0.040 -0.090 163 -1.372 (0.356) 0.417 (0.377) -0.636 (0.432) (c) Case 4: Sample Size Parameter 74 -1.428 (0.441) 0.370 (0.409) -0.780 (0.784) (b) Case 4: Sample Size Parameter = SNaive MCEM 8 0.096 -0.022 0.365 0.018 0.007 0.054 Naive 6 MCEM 0.123 -0.032 0.406 0.038 -0.017 0.101 46 119 Table 4.17: Study design for exploring the effect of r. T Pattern 1 2 3 4 4.4.3.4 Data Generating 1.0 1.0 0.5 0.5 Model Assumption 1.0 0.5 1.0 0.5 Comparing the effect of the specification of r Under the assumption that we have correctly specified the missing data model, mechanism B, two aspects will be considered in this experiment: correctly specifying T and incorrectly specifying it (Table 4.17). When the value of r is correctly specified, we can observe some general differences between a noisy measurement error model (pattern 1) and a much less noisy one (pattern 4). For the complete-case analysis we observe a reduction in the bias as we move from the noisy model to the less noisy one, and we see a similar trend in the MSE. The MSE associated with is the exception to this trend (Tables 4.18 and 4.19). Here, the bias is reduced, but the variability of the estimator increases for / o and /32 as r goes from 1.0 to 3 0.5. We see the increased variability affect the MSE of The the other coefficients we see a reduction in the MSE. We see longer confidence intervals as we move from the noisy to less noisy model; this difference may not be statistically significant (Table 4.20). There is no substantial difference in the coverage and Naive (Table 4.21). The exceptions are that the coverage rate for /32 in the less noisy model is much better than that of the noisy model and the difference for /32 has a noted reduction in the difference for the less noisy model. Considering the MCEM adjustment, we see some familiar features. As desired, we see a reduction in the bias of the estimator when moving from a noisy measurement error model to a less noisy one. We see little change in the 1VISE, but we do note that /32 has a smaller associated MSE for the 120 less noisy model (Tables 4.18 and 4.19). Unlike the previous situation, the confidence interval length is smaller with the less noisy model. We also see the maintenance of reaching the nominal level for the confidence intervals and note the i3o goes from not reaching the nominal level in the noisy model to reaching it in the less noisy model (Table 4.20). We see a reduction of MCEM as we move from pattern 1 to 4 (Table 4.21). Finally, we observe that there is little reduction in ê as we move to a less noisy measurement error model. When considering the problem of incorrectly specifying r we need to compare the changes with the correct model. For pattern 2, where we under estimate the variability of the measurement error model, we should compare it with pattern 1. For pattern 3, where we over-estimate the variability of the measurement error model, we should compare it with pattern 4. We see, with pattern 2, that the complete-case analysis bias is similar to that of pattern 1, as is the MSE, length, coverage, and Naive. For the MCEM approach we see that the bias improves from the complete-case situation, but is larger for j3 and /32 estimates than those of pattern 1. The MSE is about the same between the two methods, but is smaller than that of pattern 1 (Table 4.18). The lengths are longer than the complete-case confidence interval lengths, but they are shorter than those seen in pattern 1. The coverage is the most complicated relationship in that there is no clear trend or pattern. For /32 the coverage for the associated confidence intervals is similar across estimation procedures, but worse than in pattern 1. For /3k, the coverage is similar for both comparisons and for /3o the coverage is similar across estimation methodologies and better than pattern 1 (Table 4.20). For MCEM the differences are in general larger than those found in pattern 1 6 (Table 4.21). Finally, the estimated relative efficiency is better for pattern 2 than in pattern 1, and is noticeably better for /32. For pattern 3, the results are not as favourable towards mismatching the true value of r and the proposed value used for the model. In the complete case situation, the parameter estimates are less variable than the matched counterparts but the bias increases with the mismatch. The result is that the IVISE exhibits negligible differences between the matched and mismatched 121 estimates. We notice that the confidence interval lengths and coverage rates are similar for both pattern 3 and 4, but 6 Naive is larger for the mismatched situation. For the MCEM adjustment, the bias is much greater than that seen in the complete-case analysis and is much worse than in case 4 (Table 4.19). This pattern is seen again for the MSE, the estimate of the relative efficiency and for the length of the confidence intervals. Coverage worsens for i3o and /i when compared to both the compete-case approach and the matched situation. For /32 the coverage rate for the associated confidence interval reaches the nominal rate and is similar to that of pattern 4 (Table 4.20). Finally, the difference between the median and the true value is worse for both comparisons (Table 4.21). Discussion There are two main comparisons on which to make comment. Firstly, we will consider the difference between a noisy and less noisy model; secondly we will consider the situations when r in inaccurately specified. We see that the less noisy model has smaller bias and a greater ability to maintain the associated confidence intervals at the correct nominal level, but we see little gains in the area of MSE and in the estimated relative efficiency which may be due in part to the general, and unexpected, increase in variability of the estimators in pattern 4 when compared to pattern 1. Pattern 2 under-estimates r, assuming a less noisy model. If we were to assess the performance solely on the MSE and relative efficiency, we may conclude that under-estimating the measurement error may be beneficial to the estimation procedure. This hides a bias that is larger for the covariates of interest: /3i and /32. Furthermore, we see undesirable properties for the coverage rate and for 6 MCEM• Pattern 3 over-estimates -r and we observe that many of the estimate properties are are noticeably worse than the cor rectly specified model. If we were to make a mistake in our assumptions we would rather assume a smaller measurement error than one that is too big. A final note concerns the execution of this experiment. It is clear that from the number of successful simulations for pattern 3, that many computation problems existed. The primary and most frequently encountered problem was with the initialization and updating of the abscissa. 122 Table 4.18: Comparing how agreement and disagreement on the specification of r when the true value ofT == 1.0 (4.17) affects point estimation, bias, mean squared error, and relative efficiency when using missing data mechanism B with agreement between the mechanism generating the missing data and the assumed missing data mechanism for case 1: MAR-MAR. (a) Pattern 1: Simulation Size 168 Naive 3 / E () (s’D) /32 -1.355 (0.251) 0.361 (0.309) -0.321 (0.188) S (Zjas) 0.035 ( 1.804) -0.049 (-2.062) 0.369 (25.457) MCEM () E () (s1) ñTS (Z,ja) 0.064 (0.008) 0.098 (0.011) 0.171 (0.012) -1.478 (0.315) -0.088 (-3.618) 0.448 (0.374) -0.742 (0.531) 0.038 ( 1.315) -0.052 (-1.259) (b) Pattern 2: Simulation Size Naive j 3 / /3o /32 E (/j) (st) -1.352 (0.267) 0.367 (0.310) -0.327 (0.187) (zba) 0.038 (1.918) -0.043 (-1.878) 0.363 (25.964) A1E () 0.107 (0.015) 0.141 (0.017) 0.285 (0.042) t () 1.66 (0.17) 1.44 (0.15) 1.67 (0.28) 179 MCEM () E ($j) (sb) 0.073 (0.008) 0.098 (0.012) 0.167 (0.010) -1.379 (0.278) 0.365 (0.307) -0.394 (0.222) (Zb.jas) 0.011 (0.527) 0.045 (-1.965) 0.296 (17.901) () 0.078 (0.008) 0.097 (0.011) 0.137 (0.010) () 1.06 (0.03) 0.99 (0.03) 0.82 (0.02) Table 4.19: Comparing how agreement and disagreement on the specification of T when the true value of r = 0.5 (4.17) affects point estimation, bias, mean squared error, and relative efficiency when using missing dat mechanism B with agreement between the mechanism generating the missing data and the assumed missing data mechanism for case 1: MAR-MAR. (a) Pattern 3: Simulation Size Naive E () (sD) / o 3 /3 /32 -1.306 (0.217) 0.366 (0.265) -0.471 (0.246) .8 (Z,ja.) 0.084 (2.425) -0.044 (-1.025) 0.219 (5.559) MCEM () E () (s1) 0.054 (0.011) 0.072 (0.012) 0.109 (0.023) -1.824 (0.648) 0.752 (0.549) -2.304 (1.796) (b) Pattern 4: Simulation Size Naive B o 3 / /32 E (,) (sr) -1.383 (0.290) 0.387 (0.297) -0.527 (0.265) 8 (Zjas) 39 (Zbjg) -0.434 (-4.188) 0.342 (3.892) -1.614 (-5.612) = (3D) 0.608 (0.158) 0.418 (0.109) 5.828 (1.684) () 11.19 (4.34) 5.79 (1.73) 53.67 (23.1) 150 MCEM () E () (sb) 0.007 (0.297) 0.084 (0.012) -1.45 (0.321) -0.023 (-0.949) 0.089 (0.009) 0.429 (0.329) 0.163 (7.552) 0.097 (0.010) -0.724 (0.383) S (Z&jas) -0.06 (-2.268) 0.019 (0.702) -0.034 (-1.077) () 0.107 (0.018) 0.109 (0.013) 0.148 (0.028) () 1.26 (0.08) 1.23 (0.08) 1.52 (0.21) Table 4.20: Comparing confidence interval length, and coverage for the four patterns of T as given in Table 4.17. when using missing data mechanism B with agreement between the mechanism generating the missing data and the assumed missing data mechanism for case 1: MAR-MAR. (a) Pattern 1: Simulation Size Parameter /3 /32 = LNaive LMCEM CoverageNajv (p-value) CoverageMcEM (p-value) 1.045 (0.102) 1.134 (0.128) 0.739 (0.083) 1.306 (0.392) 1.315 (0.336) 1.774 (0.796) 0.958 (0.588) 0.940 (0.600) 0.488 (< 0.001) 0.976 (0.024) 0.946 (0.836) 0.940 (0.600) (b) Pattern 2: Simulation Size Parameter /3o ,3 /32 /3 /32 LMCEM CoverageNajve (p-value) CoverageMcE (p-value) 1.044 (0.107) 1.141 (0.135) 0.747 (0.078) 1.084 (0.137) 1.154 (0.149) 0.898 (0.114) 0.944 (0.732) 0.944 (0.732) 0.464 (< 0.001) 0.955 (0.732) 0.951 (0.988) 0.737 (< 0.001). /3 /3 /32 = 39 LNaive LMCEM CoverageNajv (p-value) CoverageMcEM (p-value) 1.031 (0.086) 1.128 (0.121) 0.938 (0.116) 2.717 (3.923) 2.428 (2.576) 6.331 (10.108) 1.000 (< 0.001) 0.974 (0.336) 0.821 (0.036) 1.000 (< 0.001) 1.000 (< 0.001) 0.923 (0.528) (d) Pattern 4: Simulation Size Parameter 179 LNajve (c) Pattern 3: Simulation Size Parameter 168 150 LNajv LMCEM CoverageNajve (p-value) CoverageMcEM (p-value) 1.073 (0.137) 1.148 (0.141) 0.979 (0.145) 1.173 (0.240) 1.225 (0.193) 1.360 (0.325) 0.933 (0.412) 0.947 (0.856) 0.880 (0.008) 0.947 (0.856) 0.960 (0.532) 0.967 (0.256) Table 4.21: Comparing the effect of sample size on the median and difference between the median and true value (8) for the four patterns of r as given in. Table 4.17 when using missing data mechanism B with agreement between the mechanism generating the missing data and the assumed missing data mechanism for case 1: MAR-MAR. (a) Pattern 1: Simulation Size Parameter i3 /32 fñNaive(/3j) (MAD) -1.365 (0.425) 0.362 (0.442) -0.327 (0.307) ) 3 rnMCEM(! /3o 132 rnNaive(/ j 3 ) (iA) -1.348 (0.263) 0.374 (0.271) -0.316 (0.187) (MAD) iMCEM(/3j) = j3 j3 /32 ‘FñNajve(/3j) (MAD) -1.364 (0.26) 0.356 (0.275) -0.377 (0.206) -1.309 (0.190) 0.353 (0.319) -0.458 (0.274) mMCEM(/ j 3 ) Parameter j3 /32 fhNajve(/ j 3 ) (MAD) -1.364 (0.264) 0.378 (0.299) -0.497 (0.233) mMCEM(/ j 3 ) MCEM 0.025 -0.048 0.363 -0.091 -0.023 0.032 Naive MCEM 5 ‘ 0.042 -0.036 0.374 0.026 -0.054 0.313 Naive 6 MCEM 8 0.081 -0.057 0.232 -0.161 0.261 -1.193 SNajve S?VICEM 0.026 -0.032 0.193 -0.047 -0.011 0.023 39 (MAD) -1.551 (0.608) 0.671 (0.501) -1.883 (1.413) (d) Pattern 4: Simulation Size SNaive 179 (MAD) (c) Pattern 3: Simulation Size Parameter 168 -1.481 (0.473) 0.387 (0.501) -0.658 (0.66) (b) Pattern 2: Simulation Size Parameter = = 150 (YTAb) -1.437 (0.304) 0.399 (0.315) -0.667 (0.316) 126 4.4.4 Simulation study 1 discussion We will first consider the combined effect of having missing data and mea surement error problems in the same data set on a complete-case analysis. The estimated standard deviation was large, but reduced in magnitude as the sample size increased. When the missing data mechanism is MAR, /3 is generally unbiased, but mechanism C produced large z-scores with mech anism B producing z-scores which indicate that problems with bias may arise. As expected, 432 was biased for all mechanisms and in all considered scenarios. Increasing the sample size appears to have little affect on bias reduction, but smaller r appears to yield smaller biases. In terms of cover age, the nominal rate was not attained for any mechanism for 432, but this was not unexpected. Mechanism C failed to attain the nominal rate for io. Increasing the sample size did not improve the overall attainment of the nominal coverage rate. With a NMAR missing data mechanism, /3 was unbiased for mechanisms A and B and /32 was biased as expected. The bias of /3o became more problematic under the NMAR assumption. As with MAR, increasing the sample size had little impact on the bias. The attainment of the nominal coverage rate exhibited the same pattern as with MAR with no change in attainment as the sample size increased. When compared with the naive complete-case analysis we see that the MCEM approach attempts to mitigate the bias of /-32 associated with the mismeasured covariate by striking a set of trade-offs with the other esti mates. The emergent empirical relationship shows the MCEM algorithm implicitly setting up a hierarchy amongst the three parameters, allowing the bias in both /3o and /3 to increase in order to reduce that of /32. It appears to put the overall quality of the model estimates above the quality of the individual estimates. This is dramatically seen as the sample size increases. Furthermore, there appears to be a subtle structure to this hier archy where both /3o and j3 suffer at the expense of correcting 432. When benefit is brought back to the parameter estimates, is the first to undergo improvement. 127 The missing data mechanism, correctly or incorrectly specified appears to make little impact on the results, but we do see that under-modelling the missing data mechanism is in general worse than over-modelling it. When the systematic component of the missing data mechanism is considered, we do see evidence that including the response in the systematic component does influence the quality of the estimates (Mechanism C). In this situation, the MCEM approach has a much more difficult time reducing the bias for all three parameter estimates and may sacrifice the accuracy of the estimates following the aforementioned priority. In general, the larger the sample size, the better the performance of the MCEM adjustment. Although not unexpected, it is a desirable quality to observe. We also observe that the parameter estimates are in general better for less noisy measurement error models. When ‘r is incorrectly specified, we see that it is better to use a less noisy value for r than the one in the true measurement error model. Although it was possible to execute the simulation study and obtain results, it was not without its obstacles. The barriers break-down to four types: storage, computational time, ridges, and the ARS Gibbs sampler. Given the structure of the mismeasurement problem of the motivating sub stantive context, no additional data was assumed to exist for modelling the measurement error model. A classical model was assumed as was the distri bution for the random noise, c, in the measurement error model. This model was then built into the likelihood function, thus we needed to draw samples for each subject in the data. With a data set of size 100 with two covariates and 5000 ARS Gibbs samples per subject, the resulting augmented data set yields a 500,000 by 2 data matrix. A data matrix of this size becomes non trivial when it needs to be held in memory for the duration of the algorithm. Although these algorithms often required between one and two gigabytes of memory, it was large enough to be noticed. When the sample size increased to 250 subjects, the increase in memory requirement is not a linear increase of 2.5 times that of the 100 subject memory requirements. Although it was possible to manage the programs so that they would run at non-peak times, they were sufficiently large as to warrant some concern about the practicality 128 of what could be called a “brute force” approach to memory management. The required memory was reduced by identifying only the key pieces of in formation necessary for the overall simulation study; these changes made a modest impact on the required memory. The second practical feature on which comment can be made is the computational time for the MCEM algorithm with imperfect covariates. Given the comments about the computational difficulties with the MCEM approach made by Ibrahim et al. [44] it was expected that we may only be able to reasonably attain simulation sizes of around 60, but we were able to reach sizes of approximately 150 and still maintain a reasonable computation time in doing so. For this particular mixture of imperfect covariates, we ran each scenario in batches of 25. The computational time for each batch was roughly 2 to 5 days depending on the sample size and the demands place on the servers. Although for many simulation studies, this may seem unreasonably long, it permitted enough time to assess the algorithm itself, manage computational difficulties, and permitted extra runs for batches which would encounter computational problems or fail. Furthermore, this permitted a wide scoping look at the performance of the algorithm. The computational problems fall into one of two types: potential ridges and ARS Gibbs abscissa. Certain scenarios, such as the n = 50 scenario, ex hibited behaviour highly suggestive of a ridge in the likelihood. At this point is in unclear if the likelihood associated with imperfect variables with the assumptions levied in this thesis has ridges and in these situations we were unlucky, or if the likelihood associated with only these scenarios contained ridges. An example of the problem is best exhibited with two scenarios. The first was when n = 50. In this situation, the measure of convergence trav elled in the range (0.03, 0.04) and reached over 200 iterations. This was a noticeably large number of iterations for the majority of the simulations were ending around 3D or 40 iterations. From what could be assessed, it appeared as if the algorithm found a portion of the likelihood where it could continue to travel without changing the distance between the lagged estimates of the parameter vector. This behaviour was also exhibited for mechanism C with n 100. Here, the convergence measure stayed in (0.01,0.1) for longer than 129 expected, around 60 iterations, but eventually failed before reaching a very large number of iterations (more then 300) due to problems with updating the abscissa for the ARS Gibbs sampler. When considering the third pattern for r, we encountered some interest ing computational problems: failure and non-convergence. This particular scenario was highly prone to problems with initializing and updating the abscissa for the ARS Gibbs sampler. This problem was not isolated to this scenario and occurred when n = 50 and then the missing data mechanisms were mismatched. Furthermore, it appears that many of the problems with updating the abscissa were found with simulations where the convergence criterion remained in a well defined area, appearing to travel along a ridge or a plateau. A final problem that was observed occurred only with pattern 3 of the r experiment. In this situation, non-convergence was common and in some cases, the convergence measure grew with successive iterations. 130 4.5 Simulation study 2: a random variable which suffers from both imperfections In this simulation study, we will consider the situation where there are mul tiple covariates with one suffering from both imperfections. This study will investigate a comparison of the: • effect of the model assumption about the missing data mechanism, • missing data mechanisms, • effect of sample size, and • effect of different distributional assumptions for the error term of the measurement error model. The assumption of a classical measurement error model will remain con stant across all the scenarios in this simulation study with a target of 200 simulations. 4.5.1 Likelihood model for simulation study 2 Beginning with the covariate model, we begin have p(XI) where X ) is the vector of augmented random variables excluding the 1 first augmented random variable, 1 (_ is the associated random variable, 4 X ) and is the vector of indexing parameters with the indexing parameter ) removed. The covariate X E P while X(_i) E P. 1 associated with X For X,, j 2,3,4, and Xr —Xi, so X =(Xf,X) rrXj 131 for j = 2,3,4. The covariate model becomes =p(X Il)p(Xj(_l) II’(-1)) j p(XI) 4 =p(XIxji, ‘)p(Xjll)p(Xj(_l)I(l)). The log-likelihood is now specified as tc(IXU,y) = [1ogP(RIxYi ) +logp(Yjx, , ) +1ogp(Xxi,) + 1ogp(Xi) + lOgP(Xi(_l)I(_l))]. For the covariates, we will assume a multivariate normal distribution for the joint probability distribution of X where X MVN(u,E), 10 where T = )T 0 ( 000 and 0 0 0 1 0.1 0.05 0 0.1 1 0.1 0 0.05 0.1 1 It is clear that the multivariate normal specification has Xil pairwise in dependent of Xj(_i). Under the assumption of an unbiased imperfection model, when rj = 0 for all i, equation 2.5.8 for X becomes yE_ I y* T il,obs L(1 1 il I — where = 1 + X € for F = — IY* r 1 1 il,miss {obs,miss} and €i N2 (0,r ) , so X ,r 1 N(x ) under the assumption that 1 2 X 1 are identically distributed. In the previous simulation study a noisy measurement error model was assumed. For this simulation study, we will assume a slightly less noisy 132 model and specify previous simulation r , 0.7 as the true value for all models. Unlike the we will not consider various levels of r, but instead we will focus on the problem of misspecifying the correct value of T. The change in focus for the r experiment pertains to the similarity of the applied example to this simulation study, thus with the applied example we will not know the true value of r. In this respect it is of interest to understand how the naive complete case approach and the MCEM methodology performs when the model is misspecified in term of T. The response is a Bernoulli random variable with the binary logistic regression response model given in section 2.5.5 which is indexed by /3 = (—0.69,0.41, —0.22,0.10, —0.11). The mechanism of imperfection model is given in section 2.5.6 and we will retain the definition for the imperfection indicator given in section 2.3. It is clear from equation 2.5.14 that the con ditional expression of the joint distribution of imperfection indicator allows for a dependent structure between the indicator for missing data and the indicator for mismeasurement. Also, it allows for dependent structure across the set of imperfection indicators. Two simplifying assumptions about the mechanism of imperfection will be made. The first is that the and kth indicators of imperfection are independent for all j 1,. ,p and k = 1, ,p. The second assumption . . . . . is that the indicator for missing data and the indicator for mismeasurement are independent, RfLR for all j, j = 1...4. Predicated on these two assumptions and the model currently under investigation, equation 2.5.14 becomes )p(RfIx,yj, 7). p(Rlx,y, 7) p(RIx,yj, j7 It is assumed that X 1 is mismeasured for all realizations, so = 0 for all subjects which means that there is only one outcome possible for this random variable, p(RIx,y, 7) =p(RfIx,yj, 7f) 133 Table 4.22: Coefficient values for different missing data mechanisms used to generate simulated data sets. The systematic component is specified as = (77777)’ Mechanism •MAR NMAR A B (—2,0, 0.5, 0.25, 0.5)’ )T (—3,0,0, 1 1 (—2, 0.75, 0.5, 0.25, —0.5)’ )T (—3,2,0, 1, 1 C (_ ) 0 , 5 . 2 T (_ ) 0 , 5 . 2 T The complete log-likelihood becomes tc(IxU,y) = logp(R x 1 ,y, f) +logp(YiIx, ) x 1ogp(Xj f ) j + log 1 )/4 p(Xj(_ ) ). + 1ogp(Xf) + 1 (4.5.1) since we are not considering over- or under-dispersed models. For the simulation study, one experiment will consider three different missing data mechanisms (Table 4.22). The structure for the missing data mechanism model is the same for all three, y = (‘yo, 71,72,73 74)T where 70 corresponds to the intercept, ‘yi corresponds to the imperfect covariate, and (72,73,74) correspond to the perfect covariates. For this simulation study, the response was not considered to be a covariate in the indicator of imperfection model. For missing data mechanism A, the probability of missing information was dependent on on (x , X3, x4) if MAR and all covariates if NMAR. The 2 simulation based mean proportion of missing data is 0.14 and 0.16 respec tively with respective ranges of 0.25 and 0.27. As with the previous simu lation study, the proportion of missing information was kept high to reflect the assumption, and observed levels of missing data, often encountered with social-epidemiological research. As with mechanism A of the previous simu lation study, the mechanism was constructed so that the MAR and NMAR 134 proportions would be similar. In contrast to mechanism A, mechanism B puts much more weight on the imperfect covariate and does not depend on all the perfect covariates. The simulation based mean proportion of missing data is 0.Q96 and 0.162 for MAR and NMAR mechanisms. The respective ranges are 0.21 and 0.28. For mechanism C, we further explored the role of the imperfect covariate on EM adjustment. For MAR, we consider the situation where the missing data mechanism is MCAR by setting y 3 = 0 for j = 1,2,3,4. As before we compare this with NMAR. The simulation based mean proportion of missing data is 0.076 for MAR and 0.173 for NMAR. The respective ranges are 0.2 and 0.3. 4.5.2 Simulation specific details for the Monte Carlo EM algorithm Beginning with the approximation to Q((t)), we have I ) t ) + + (1-11 rr) [iI7t + Under the imperfection mechanism model assumptions given in the previous section, equation 4.5.1 and the functional form given in equation 3.3.10 we have Q(7I7 (t) ) =Q — / i 1(t) 1 7iVY logp (R p(t+1) 135 where m = mt) as discussed in the previous simulation study. The expectationstep estimate of the response model given by equation 3.4.6 becomes V (Iøt)) = Z 1ogp( Ixt+1), t) The covariate model is slightly more involved than that of the previous simulation. We approximate equation 3.3.11 with [ (i) (It))]. First we have the approximation for the observable random variable =-- 1ogp(Xx1), t)) and second we have the approximation for the unobservable target random variable 4 m =-- = > [logp(xj f 1 ) + 1ogp(X(l)Il))] [logp(Xj ) 1 ] + logp(X(_l)I )) 1 so (I ) t ) P(t)) [log p(XIx + log p(X 1 If)] , 1 )). 1 + logp(X(_l)I 136 4.5.2.1 Gibbs adaptive rejection sampling Beginning with equations 3.4.8, the structure of the gibbs sampler for this simulation is Xt+1) (xIx, r, Xj(_i), y, 1 p (x x 1 Ir, (t+l) Xi(_i), /j, and The full conditionals are p (xIx, rf, Xi(_i), (t)) , E(t+1) / (t)’\ ,x 1 pXjiIr Xj(_l),Yj, j p(xjx, X(_i), ,P(t)) and E(t+1) I xp(RIx 1 1(t) ,Xj(_1),Yi, ,x 1(t) 7 i ) (t) E(t+1) x1 1 ,Xj(_i), /3 (t) ) ,x p(YjIx x 1 ,P(t)) p0fiIx, ,P(t))( As with the previous simulation study each full conditional is itself logconcave (Table B.1 and Appendix B.1.5). We will use the ARS Gibbs sampler under the sampling specifications outlined in simulation study one (Section 4.4.2.1). 4.5.2.2 Analytic solutions for maximization The underlying concepts for maximization are identical to those in section 4.4.2.1, thus we proceed directly to the normal distribution which has a closed form solution. Before we proceed, an important observation should be made. The first is that T is assumed to be given for the model to ensure it is identifiable. This the only parameter which which needs to be estimated for the conditional distribution 1 xj Since we are assuming that T is X . known, this distribution is fully characterized. Again, following the results given by Dwyer [23] and beginning with equation C.1.3, the MCEM maximum-likelihood type estimate of the mean 137 for X 1 is ( [() x+ where = —i) 0 for all i, so i4 where xi and is the grand average over the = mn simulated subjects. Turning to the estimation of the variance, we begin with equation C.1.4 [(v) E where = (xj (1_ - - m- irri) 1S2] + 0 for all i and we are oniy estimating o, so 21m—12 where 4 = —j 1 ( xii — , 2 i1.) with the result for large m holding (Equation C.1.5) 4.5.2.3 Louis standard error The computation of the Louis standard errors is essentially the same as the previous simulation study. From the general presentation of the Louis stan dard error, we have the relationship given in equation 3.5.3 with the Monte Carlo approximation based on the current Gibbs sample given by equation 3.5.5. Since the likelihood model components are uniquely parameterized, the resulting information matrix will be block diagonal, with each block cor responding to a uniquely parameterized component of the likelihood. With 138 this structure we can use the result that the inverse of a block diagonal matrix is a block diagonal matrix of the block inverses, thus we need not compute the inverse for the entire information matrix in order to obtain the covariance matrix for the parameters of interest. Since we are primarily interested in 3 the parameter for the response model, we need only to compute the block that corresponds to the response model to obtain the corresponding information and then take its inverse for the variance estimate. Using equation 3.5.6 and the score and Hessian for the response model as given in appendix B.2.2, we are able to obtain the th MCEM Louis standard error for the 8 simulation estimate of /3. 4.5.3 Results and discussions With two completed simulations, each of which consider different manifes tations of imperfection in data set, we have compelling evidence that it is computationally feasible to implement a solution to the problem of having both missing data and measurement error in the same data set. The caveat seems to be that we may be limited to small samples. 4.5.3.1 Comparing the effect of the missing data mechanism assumption Three aspects of this simulation experiment will be considered: comparisons between the naive complete case approach and the MCEM method within the matched scenarios, a comparison of correctly specified and misspecified missing data models, and a comparison between MAR and NMAR models when they are correctly specified. We begin with table 4.23. With the model correctly specified and hav ing a MAR missing data mechanism, we can see that the MCEM approach produces a dramatic reduction in the bias of /3i and there is no statistical evidence against the null hypothesis of equality with zero. This is not sur prising as this would be our best guess based on the missing data literature. An interesting feature is that /o, /32, and /33 all retain biases which are sig nificantly different than zero, while i3 remains statistically non-significant. 139 Furthermore, the MCEM approach does not appear to have considered any noticeable adjustment for the /3, = 0, 2, 3,4. The MSE for /3 is much larger with the others being much the same between the two methods. The efficiency for /3 favours the complete case method, but for all others, the difference may not be significant. For all parameters, the lengths of the con fidence intervals are similar between approaches. (Table 4.25). The coverage is as expected for all parameters except i3 and /33. The MCEM adjustment corrects the coverage problem for j3 while /33 remains uncorrected. Finally with table 4.27 we see that there is a good adjustment for /3 which looks very similar to that observed for the non-robust measures. Additionally we see that there may be a slight trade-off in the correction; the difference in / is sacrificed in order to improve that of / i. This is similar to that observed 3 in the previous study. The NMAR-NMAR scenario has shows that the bias for /3i undergoes a substantial correction under the MCEM model and the other parameters are relatively unchanged (Table 4.24). As expected, there is little change in the MSE for all parameters except that of /3 which has increased. The efficiency is similar to that of the other scenarios. There is no significant difference in the lengths of the confidence intervals, but we do note longer MCEM intervals (Table 4.26). For i3, the coverage improves and attains the nominal rate. There is no change in coverage for the other parameters. With ö, there is a substantial improvement in the difference for with little change for the other parameters (Table 4.27). A very real challenge with missing data is the correct specification of the missing data model. It is overly naive to assume that we would always get this correct, thus it is of interest to consider how misspecification changes the adjusted estimates. For these two cases we will focus on the IVICEM estimates. In table 4.23, we see, moving from assuming MAR to NMAR, that the bias for /32 and /33 changes from being significantly different than zero to non-significance. The MSE for each parameter is roughly the same with a decrease for but this may not be a significant depreciation of the IVISE. We see a similar result for the efficiency. The lengths are very similar, but we see a worsening of the coverage for /i and an improvement for /33 140 (Table 4.25). Finally, we that /32 MCEM 5 worsens for / o and /3i, improves for 3 and /33, and stays the same for /34. When the data is generated by a NMAR mechanism but assumed to be MAR, we see that misspecification causes the bias to be reduced for /3o and while increasing it for the other parameters (Table 4.24). The MSE for the parameters increases for all except for /o which decreases. Furthermore, and /32 exhibit significant differences in their associated MSEs across the two cases. There is a general decrease in efficiency except for /33; /3o exhibits a significant difference in efficiency. There is no significant difference in the lengths (Table 4.26). The coverage worsens for o, improves for /3i and /32, and shows now real change for /33 and /3. For 8 MCEM we see that that misspeciflcation reduces 6 for /o and j3 while increasing it for the others. Now we turn our attention to some comparisons between MAR and NMAR when the models are correctly specified, that is we are comparing cases 1 and 4. The MCEM approach appears to do a better job with bias reduction for NMAR data compared to MAR when considering the entire parameter vector. The caveat here is that the MCEM approach does a bet ter adjustment for /3 if the missing data mechanism is MAR rather than NMAR, but performs rather poorly for the other parameters. Comparing the MSE across the two cases, we see little difference with perhaps a differ ence for and The nominal coverage is reached for /3 and /34 for both mechanisms. For j3 the coverage rate drops, but still attains the nominal rate. For /32, there is a loss in the attainment of the nominal rate whereas /33 /32. attains it. We see no differences in the lengths of the confidence intervals. We see that the magnitude of 6 MCEM increases for / as we move from the MAR to NIVIAR mechanism, but all others decrease. Discussion As before, we see subtle trade-offs occurring. Frequently these trade-offs involve a worsening of the bias for the parameters associated with perfect covariates and the intercept, with the latter often suffering the most. In most cases, we see stability in the MSE with only the MSE for /3 having a substantial increase. There is little change in efficiency across the four cases. The lengths of the confidence intervals for the MCEM approach are similar 141 across all cases. It is oniy reached in case 1 and case 3. A general observation is that MCEM adjustment does improve the coverage rate, but for cases 1, 2, and 4 there exists at at least one parameter for which the nominal rate is not achieved. Finally we see that S improves with adjustment, but this often comes at the expense of a degradation of the others, with a worsening for o being the most obvious. The effect of misspecification appears to depend on the type. Considering only the MCEM results, we see that over-modelling results in a general decrease of parameter biases whereas under-modelling tends to increase the biases. The inference about the biases changes only when going from case 1 to case 2. Here the over-modelled results appear to have no significant bias. Over-modelling appears to affect negatively the coverage associated with /i, but has a positive affect on on the coverage for Under-modelling brings an improvement to the coverage associated with /3 and /32, but negatively affects the intercept. From the point of view of bias and 5 MCEM it is better to over-model, but in terms of coverage, it would be better to under-model. Stepping back to look at the overall trending across the mechanism re veals no clear or obvious patterns. What is consistent is that the MCEM approach does decrease the magnitude of the bias for the parameter estimate associated with the imperfect covariate, j3. The MSE for this parameter is larger for the MCEM approach. The MCEM approach produces longer con fidence intervals but this does not translate to the attainment of the nominal coverage rate for all cases. As with the bias, S improves with the application of the MCEM method on the imperfect data, but there is a general cost on S for the other parameters which displays no clear pattern. . 142 Table 4.23: Comparing the effect of the missing data mechanism when the missing data was generated with a MAR mechanism on the point estimate, bias, and mean squared error when a covariate has both missing data and mismeasured data. The mechanism that generates the missing data and the missing data mechanism used to model the data is matched and mismatched. Mechanism A was used to generate the missing data for Xl. (a) Case 1: MAR-MAR, Simulation Size2OO Naive E (/) (sb) /31 /32 /33 /3 -0.722 0.291 -0.260 0.133 -0.121 (0.230) (0.214) (0.237) (0.231) (0.253) MCEM S (Zja) -0.032 -0.119 -0.040 0.033 -0.011 (-1.992) (-7.833) (-2.417) (2.045) (-0.629) E’ () 0.054 0.060 0.058 0.055 0.064 (0.005) (0.006) (0.006) (0.005) (0.006) E (,) (sb) -0.738 0.418 -0.266 0.136 -0.125 (0.241) (0.319) (0.242) (0.236) (0.259) (zbjas) -0.048 0.008 -0.046 0.036 -0.015 (-2.809) (0.375) (-2.662) (2.136) (-0.794) 0.060 0.102 0.061 0.057 0.067 () e (0.006) (0.011) (0.007) (0.005) (0.007) 1.12 1.70 1.05 1.04 1.05 () (0.03) (0.14) (0.01) (0.01) (0.01) (b) Case 2: MAR-NMAR, Simulation Size=200 Naive ,3 j3 /32 /3 /3 I. E (,) (sD) -0.713 0.280 -0.241 0.089 (0.242) (0.196) (0.237) (0.239) -0.134 (0.250) S (Zja.) -0.023 -0.130 -0.021 -0.011 (-1.361) (-9.373) (-1.227) (-0.638) -0.024 (-1.351) MCEM 11E () 0.059 0.055 0.057 0.057 (0.006) (0.005) (0.007) (0.005) 0.063 (0.008) E -0.728 0.405 -0.245 0.091 (Sb) (0.250) (0.293) (0.242) (0.244) -0.136 (0.255) () S (Z,jas) -0.038 -0.005 -0.025 -0.009 (-2.128) (-0.242) (-1.455) (-0.505) -0.026 (-1.459) 0.064 0.086 0.059 0.060 (0.007) (0.009) (0.007) (0.005) 0.066 (0.008) () 1.08 1.55 1.05 1.04 (0.02) (0.13) (0.01) (0.01) 1.05 (0.01) Table 4.24: Comparing the effect of the missing data mechanism when the missing data was generated with a NMAR mechanism on the point estimate, bias, and mean squared error when a covariate has both missing data and mismeasured data. The mechanism that generates the missing data and the missing data mechanism used to model the data is matched and mismatched. Mechanism A was used to generate the missing data for x . 1 (a) Case 3: NMAR-MAR, Simulation Size200 Naive E (/) (s’b) i3i /32 /33 /34 -0.700 0280 -0.242 0.121 -0.133 (0.202) (0.235) (0.249) (0.259) (0.255) MCEM () 7S (Zbias) -0.010 -0.130 -0.022 0.021 -0.023 (-0.708) (-7.822) (-1.223) (1.119) (-1.268) 0.041 0.072 0.063 0.068 0.065 (0.004) (0.007) (0.007) (0.008) (0.008) E () (sb) TS (0.207) (0.369) (0.256) (0.268) (0.261) -0.014 0.005 -0.026 0.024 -0.026 -0.704 0.415 -0.246 0.124 -0.136 (-0.938) (0.183) (-1.450) (1.260) (-1.429) () () (Zbias) 0.043 0.136 0.066 0.073 0.069 (0.005) (0.019) (0.007) (0.009) (0.008) 1.06 1.89 1.05 1.07 1.05 () ê (0.02) (0.20) (0.01) (0.02) (0.01) (b) Case 4: NMAR-NMAR, Simulation Size200 Naive E (/) (st) /3o (Zbia5) MCEM () E () (sb) 0.053 (0.005) 0.053 (0.005) -0.672 (0.241) 0.427 (0.365) 0.018 (1.062) 0.017 (0.664) (zS) () /3i -0.669 (0.228) 0.284 (0.228) 0.021 (1.331) -0.126 (-7.847) 132 -0.223 (0.212) -0.003 (-0.175) 0.045 (0.004) -0.228 (0.218) -0.008 (-0.487) 0.048 (0.004) 1.06 (0.02) 0.111 (0.244) -0.117 (0.236) 0.011 (0.651) -0.007 (-0.407) 0.060 (0.007) 0.056 (0.008) 0.114 (0.249) -0.119 (0.242) 0.014 (0.799) -0.009 (-0.528) 0.062 (0.008) 0.059 (0.008) 1.05 (0.01) 1.05 (0.01) ,3 /3 0.058 (0.005) 0.134 (0.020) 1.11 (0.02) 1.97 (0.22) Table 4.25: Comparing the effect of the missing data mechanism when the missing data was generated with a MAR mechanism on confidence interval length, and coverage where a single covariate suffers from both missing data and measurement error. Both matched and mismatched missing data mechanisms are considered. (a) Case 1: MAR-MAR, Simulation Size=200 j 3 / /3 /32 /33 /34 Lprajve (SD) 0.900 (0.055) 0.812 (0.086) 0.915 (0.087) 0.917 (0.086) 0.907 (0.086) LMCEM 0.936 1.207 0.936 0.937 0.927 () (0.094) (0.229) (0.099) (0.095) (0.097) CoverageNajv (p-value) 0.945 0.880 0.955 0.975 0.930 CoverageMcEM (p-value) 0.950 0.955 0.955 0.975 0.930 (0.756) (0.004) (0.732) (0.024) (0.268) (1.000) (0.732) (0.732) (0.024) (0.268) (b) Case 2: MAR-NMAR, Simulation Size2OO Liiajve (SD) LMCEM (SD) /3 0.901 0.813 0.914 0.917 0.934 1.204 0.933 0.935 /3 0.914 (0.082) j 3 1 /3 /32 C-,’ (0.055) (0.079) (0.081) (0.082) (0.083) (0.207) (0.090) (0.089) 0.932 (0.092) CoverageNajve (p-value) 0.950 0.905 0.960 0.965 (1.000) (0.028) (0.472) (0.248) 0.945 (0.756) CoverageMcEM (p-value) 0.950 (1.000) 0.985 (<0.001) 0.960 (0.472) 0.965 (0.248) 0.945 (0.756) Table 4.26; Comparing the effect of the missing data mechanism when the missing data was generated with a NMAR mechanism on confidence interval length, and coverage where a single covariate suffers from both missing data and measurement error. Both matched and mismatched missing data mechanisms are considered. (a) Case 3: NMAR-MAR, Simulation Size2OO j 3 / LNaive (SD) /3o 0.898 0.844 0.917 0.915 0.915 i 8 , ,82 /33 /34 (0.047) (0.087) (0.080) (0.084) (0.083) LMCEM 0.929 1.294 0.938 0.937 0.937 () (0.072) (0.294) (0.095) (0.107) (0.097) CoverageNajve (p-value) CoverageMcEM (p-value) 0.970 (0.096) 0.875 (< 0.001) 0.935 (0.388) 0.940 (0.552) 0.940 (0.552) 0.975 0.955 0.935 0.940 0.940 (0.024) (0.732) (0.388) (0.552) (0.552) (b) Case 4: NMAR-NMAR, Simulation Size=2OO j 3 ! LNaive -3 0.893 0.838 0.908 0.903 0.905 /3 /32 83 /34 () (0.052) (0.091) (0.078) (0.086) (0.088) LMCEM 0.928 1.289 0.931 0.925 0.928 () (0.089) (0.311) (0.095) (0.101) (0.103) CoverageNajve (p-value) CoverageMcEM (p-value) 0.950 (1.000) 0.880 (0.004) 0.990 (< 0.001) 0.940 (0.552) 0.955 (0.732) 0.950 (1.000) 0.970 (0.096) 0.990 (< 0.001) 0.940 (0.552) 0.955 (0.732) Table 4.27: Comparing the effect of the missing data mechanism on the median and difference between the median and true value (ó) where a single covariate suffers from both missing data and measurement error. (a) Case 1: MAR-MAR, Simulation Size2OO i 3 / 13i /32 /33 /34 rnN,(/3f) (Jb) -0.698 (0.240) 0.296 (0.228) -0.241 (0.234) 0.148 (0.216) -0.087 (0.248) MMcE!vI(/j) (Ab) -0.729 (0.266) 0.419 (0.327) -0.246 (0.241) 0.149 (0.217) -0.087 (0.253) Naive MGEM -0.008 -0.114 -0.021 0.048 0.023 -0.039 0.009 -0.026 0.049 0.023 (b) Case 2: MAR-NMAR, Simulation Size2OO j 3 f /3 i3 /32 /33 /3 TflNajVe(/ j 3 ) (1b) -0.727 (0.231) 0.273 (0.207) -0.232 (0.236) 0.093 (0.252) -0.133 (0.250) MMCEM(/j) -0.745 0.391 -0.239 0.094 -0.133 (MAD) (0.250) (0.288) (0.239) (0.257) (0.255) &vaive MCEM -0.037 -0.137 -0.012 -0.007 -0.023 -0.055 -0.019 -0.019 -0.006 -0.023 (c) Case 3: NMAR-MAR, Simulation Size=200 / , 3 j /3o /3i /32 /33 /3 rnNaive(/ j 3 ) (MAD) -0.693 (0.185) 0.278 (0.231) -0.264 (0.223) 0.124 (0.245) -0.139 (0.21) mMCEM(/ J 3 ) -0.703 0.395 -0.266 0,125 -0.143 (MAD) (0.183) (0.346) (0.224) (0.250) (0.215) Naive 6 MCEM -0.003 -0.132 -0.044 0.024 -0.029 -0.013 -0.015 -0.046 0.025 -0.033 (d) Case 4: NMAR-NMAR, Simulation Size=200 j 3 / /3 /32 /3 /3 rnNaive(1 j 3 ) (ib) -0.664 (0.232) 0.265 (0.212) -0.227 (0.204) 0.118 (0.222) -0.116 (0.229) TnMCEM(1 j 3 ) -0.666 0.389 -0.227 0.122 -0.116 (MAD) (0.247) (0.324) (0.209) (0.223) (0.235) Naive MCEM 5 0.026 -0.145 -0.007 0.018 -0.006 0.024 -0.021 -0.007 0.022 -0.006 147 4.5.3.2 Comparing missing data mechanisms Three missing data mechanisms will be considered (Table 4.22) and we will compare the effect of the different mechanism on the summary measures for the situation where the generating missing data mechanism is correctly specified for both the MAR and the NMAR cases. For the MAR naive complete-case, little difference in the estimated standard deviation across the three missing data mechanisms is exhibited. For the parameter of pri mary interest, i3, the biased across all three mechanisms is comparable. The large z-scores suggest significant biases, for j3, and /32 and /33 for mechanism A. For /3o and i3, mechanism C, which is MCAR, produces the worst bi ases. We can think of the mechanisms as providing a progression from MAR with all the covariates, except x, participating in the model to MCAR for mechanism 3. From this perspective we see that the bias associated with the intercept worsens as the mechanism progresses towards MCAR, but is re duced for /32 and /33. The MSE is similar in magnitude across mechanism for each parameter estimate as are the confidence interval lengths. For coverage, we see the expected significant difference for /3, but we also see difference from the nominal rate with mechanism A for /33 and with mechanism C for all parameters except i3 and /32. When we consider the difference ö, we see that mechanism C tends to do the best job at reducing the differences whereas mechanisms A and B have varied results. This contrasts with the bias. When we look to the MCEIVI summaries, we see observe that the esti mated standard deviations tend to be larger than the naive approach with the most noticeable difference being the estimated standard deviations for /3i. The bias is significant for /3o across all mechanisms and for 132 and /33 with mechanism A. As we move from the naive approach to the MCEM method, we see that the bias, in general, tends to increase for all parameter estimates aside from /3i, which is associated with the imperfect covariate. It is interesting that the z-scores are similar to those of the naive approach. Although the biases have increased for all the estimated standard deviations for each parameter have also increased which has an almost combined null 148 effect on magnitude of the z-score. The MSE increases for all parameter estimates across all mechanisms. It seems that the increase in MSE is primarily driven by the increase in the estimated standard deviation. For / i there is a dramatic increase in 3 the estimated standard deviation which almost swamps any gains made by managing the bias. For the other estimates, we have a combined effect of a slightly larger estimate of the standard deviation with a general increase in the bias. The efficiency is greater for the naive approach across all mecha nisms and parameter estimates, but outside of i3, the two are rather close to one another. The confidence interval lengths are longer for the MCEM approach and they are similar across missing data mechanism for each parameter estimate. The difference in length between the naive complete case method and the MCEM approach may not be statistically significant. We see the usual trend from the naive approach to the MCEM method for the coverage. We can observe that mechanism C, the MCAR mechanism, struggles to achieve the nominal rate, leaving i3, /32 and /3 with coverage rates either too low or tOo high. Finally we see that mechanism C has a general decrease in S moving from the complete-case method to the MCEM whereas the other two mechanism have a general increase in S. Turning out attention to case 4, NMAR-NMAR, we see that the esti mates, for the complete-case situation, are similar in magnitude across the mechanism and that the estimated standard deviations are large. The bias is not uniform across the three mechanisms with mechanism C performing worse than the other two. We see the expected bias in / i• 3 The MSE tends to increase as we move from mechanism A to C. The larger bias of mechanism C is the main driver for the difference between the MSE of the parameters associated with mechanisms A and B and those of C. This suggests that the mechanism may in fact be dissimilar in terms of their MSE, which appears to result from the differences in the biases. We see little difference across the mechanisms for the confidence inter vals. We observe that mechanism C struggles with the attainment of the nominal coverage rate for all parameters except /34. Mechanism B is unable 149 to attain the coverage rate for the intercept parameter and mechanism A fails for /32. Mechanism C consistently has the largest magnitudes for öNaive and we observe that 5 Najve for i3 is large across all three mechanisms. When we adjust for the imperfection using the MCEM approach, we see that the point estimates are similar across mechanisms for each parameter with large estimates of the standard deviation. The MCEM approach ap pears to perform well for mechanisms A and B, but struggles to adequately correct the biases of mechanism C. This leads to a noticeable dissimilar ity across the mechanisms. This translates to some dissimilarities for the MSE or each parameter across the mechanisms. For all parameter estimates associated with mechanism C, except for the estimate of /33, the MSE is no ticeably larger than the other two mechanism and is most likely statistically significant. We see as a general trend, as the perfect covariates are removed from the missing data mechanism, the MSE tends to increase. We see no substantial differences for the parameter estimates across mechanism for the length of the confidence intervals. When we consider the coverage rates, we have some surprises. For example, /3i fails to reach the nominal rate for mechanism C. We also see a lack of attainment for mechanism C with and /33 and for mechanism A with /32. For the differ ence between the medan and the true value, we see the expected correction appear for /9. Aside from this, what is noticeable is a distinct similarity be tween öMCEM for mechanisms A and B but a dissimilarity with mechanism C. Discussion Under the MAR missing data model, if we are only interested in i3 and not the other covariates, that is they are included in the model for adjustment only, then we see the desired corrections as we move from the naive complete-case approach to the MCEM methodology. If we are interested in all the covariates, then we have some mixed results. We see the familiar trade-off, in that the correction in the bias of the parameter associated with the imperfect covariate appears to increase of bias for other covariates. This happened across all missing data mechanisms for ,3o, 2, and /33 and for mechanisms A and B with Although there were increases, 150 these have a minimal impact on the estimates themselves and would have even less impact when translated to the odds-ratio scale. Although for most measures there was little to distinguish the three miss ing data mechanisms, the coverage rates attained does pose some concern. Mechanism C, which is MCAR, struggled to reach the nominal coverage rate for most parameters. This is a serious issue for this will affect the sensibil ity of inference for the affected parameters. If we were to be restricted to the parameter associated with the imperfect covariate, then we have little concern in this area. In general, we observe few differences between the three missing data mechanisms under the MAR model. As noted above, there is a difference in coverage for mechanism C compared to mechanisms A and B. We also note a difference in how 5 performs when moving from the complete-case situation to the MCEM approach. Mechanism C shows a general decrease in the magnitude of S whereas the other two mechanisms tend to show an increase. When we turn to the NMAR-NMAR case, we see in general a trend towards a similarity in results for mechanisms A and B, but a dissimilarity with C. This was surprising due to the results for the MAR-MAR portion of this experiment. Mechanism C in the NMAR-NMAR situation has the missing data mechanism solely reliant on the unobservable realizations of the imperfect random variable. The biased estimates, as well as the evidence against the similarity of Mechanism C with the other two suggests that perhaps a synergistic effect is occurring in that the combined imperfection problem within a single covariate results in a dissonance between what is expected when missing data and mismeasurement is considered in isolation with the effect that manifests when the two problems are combined. 151 Table 4.28: Comparing different missing data mechanisms when when the mechanism that generates the data and the mechanism used to model the data is matched (MAR-MAR) for a covariate with both missing and mismeasured data. Naive ) (s’v) 1 E( Model A B C 13o /3o Bias (Zbias) ) (st) 3 E( s(zi,jas) e() -0.722 (0.230) -0.758 (0.239) -0.793 (0.217) -0.032 (-1.992) -0.068 (-4.031) -0.103 (-6.713) 0.054 (0.005) 0.062 (0.008) 0.058 (0.006) -0.738 (0.241) -0.776 (0.255) -0.815 (0.231) -0.048 (-2.809) -0.086 (-4.808) -0.125 (-7.662) 0.060 (0.006) 0.072 (0.010) 0.069 (0.006) 1.12 (0.03) 1.17 (0.05) 1.19 (0.03) A B C i3 13i 0.291 (0.214) 0.311 (0.227) 0.282 (0.230) -0.119 (-7.833) -0.099 (-6.187) -0.128 (-7.897) 0.060 (0.006) 0.061 (0.007) 0.069 (0.006) 0.418 (0.319) 0.447 (0.354) 0.398 (0.336) 0.008 (0.375) 0.037 (1.470) -0.012 (-0.524) 0.102 (0.011) 0.127 (0.024) 0.113 (0.014) 1.70 (0.14) 2.07 (0.25) 1.63 (0.13) A B C 132 -0.260 (0.237) -0.239 (0.249) -0.228 (0.242) -0.040 (-2.417) -0.019 (-1.097) -0.008 (-0.474) 0.058 (0.006) 0.062 (0.006) 0.058 (0.005) -0.266 (0.242) -0.245 (0.255) -0.230 (0.245) -0.046 (-2.662) -0.025 (-1.374) -0.010 (-0.583) 0.061 (0.007) 0.065 (0.006) 0.060 (0.005) 1.05 (0.01) 1.05 (0.01) 1.03 (< 0.01) 0.133 (0.231) 0.116 (0.239) 0.113 (0.282) 0.033 (2.045) 0.016 (0.959) 0.013 (0.651) 0.055 (0.005) 0.057 (0.006) 0.080 (0.006) 0.136 (0.236) 0.120 (0.248) 0.117 (0.291) 0.036 (2.136) 0.020 (1.123) 0.017 (0.825) 0.057 (0.005) 0.062 (0.006) 0.085 (0.007) 1.04 (0.01) 1.08 (0.03) 1.06 (0.01) -0.121 (0.253) -0.120 (0.248) -0.097 (0.172) -0.011 (-0.629) -0.010 (-0.568) 0.013 (1.079) 0.064 (0.006) 0.062 (0.007) 0.030 (0.003) -0.125 (0.259) -0.123 (0.254) -0.099 (0.176) -0.015 (-0.794) -0.013 (-0.713) 0.011 (0.913) 0.067 (0.007) 0.065 (0.008) 0.031 (0.003) 1.05 (0.01) 1.05 (0.01) 1.04 (< 0.01) A B C A B C I’ cJ1 /3o MCEM 132 132 /33 /3 /3 /34 /34 /3 Table 4.29: Confidence interval length, and coverage when when when the mechanism that generates the data and j 3 / the mechanism used to model the data is matched (MAR-MAR) for a covariate with both missing and mismeasured data. Model () LMCEM (SD) CoverageNajve (p-value) CoverageMcEM (p-value) 0.900 (0.055) 0.909 (0.060) 0.916 (0.058) 0.936 (0.094) 0.951 (0.125) 0.958 (0.107) 0.945 (0.756) 0.935 (0.396) 0.900 (0.020) 0.950 (1.000) 0.935 (0.396) 0.900 (0.020) /3 /3 0.812 (0.086) 0.803 (0.087) 0.798 (0.076) 1.207 (0.229) 1.194 (0.291) 1.162 (0.199) 0.880 (0.004) 0.891 (0.008) 0.900 (0.020) 0.955 (0.732) 0.945 (0.768) 0.950 (1.000) /2 /32 0.915 (0.087) 0.919 (0.078) 0.925 (0.091) 0.936 (0.099) 0.942 (0.091) 0.944 (0.095) 0.955 (0.732) 0.945 (0.768) 1.000 (< 0.001) 0.955 (0.732) 0.945 (0.768) 1.000 (< 0.001) A B C /33 /33 /33 0.917 (0.086) 0.923 (0.082) 0.915 (0.086) 0.937 (0.095) 0.946 (0.101) 0.936 (0.111) 0.975 (0.024) 0.945 (0.768) 0.950 (1.000) 0.975 (0.024) 0.949 (0.988) 0.950 (1.000) A /34 /34 0.907 (0.086) 0.913 (0.084) 0.916 (0.075) 0.927 (0.097) 0.936 (0.098) 0.933 (0.074) 0.930 (0.268) 0.945 (0.768) 1.000 (< 0.000) 0.930 (0.268) 0.945 (0.768) 1.000 (< 0.001). A B C A B C A B C B C LNaive j3 o 3 / /32 /3. Table 4.30: Comparing the effect of the missing data mechanism on the median and difference between the median and the true value (ö) when the mechanism that generated the missing data and mechanism assumed by the model are matched (MAR-MAR) for a covariate with both missing and mismeasured data. Model A B C A B C A B C /j /3 /3 /3 /32 /32 /32 A B C /33 A B C /34 /34 /33 /33 MNaive(/j) (MAD) fñMCEM(/j) (MAD) SNaive MGEM -0.698 (0.240) -0.744 (0.215) -0.794 (0.121) -0.729 (0.266) -0.761 (0.238) -0.809 (0.136) -0.008 -0.054 -0.104 -0.039 -0.071 -0.119 0.296 (0.228) 0.307 (0.184) 0.291 (0.204) 0.419 (0.327) 0.428 (0.275) 0.405 (0.300) -0.114 -0.103 -0.119 0.009 0.018 -0.005 -0.241 (0.234) -0.24 (0.272) -0.193 (0.296) -0.246 (0.241) -0.243 (0.278) -0.194 (0.298) -0.021 -0.020 0.027 -0.026 -0,023 0.026 0.148 (0.216) 0.11 (0.252) 0.069 (0.276) 0.149 (0.217) 0.115 (0.255) 0.070 (0.279) 0.048 0.010 -0.031 0.049 0.015 -0.030 -0.087 (0.248) -0.129 (0.235) -0.095 (0.168) -0.087 (0.253) -0.131 (0.234) -0.096 (0.171) 0.023 -0.019 0.015 0.023 -0.021 0.014 154 Table 4.31: Comparing different missing data mechanisms when when the mechanism that generates the data and the mechanism used to model the data is matched (NMAR-NMAR) for a covariate with both missing and mismeasured data. Naive () /3 E () (s’D) o 3 / /3o -0669 (0.228) -0.671 (0.261) -0.508 (0.256) 0.021 (1.331) 0.019 (1.049) 0.182 (10.061) 0.053 (0.005) 0.069 (0.006) 0.098 (0.008) -0.672 (0.241) -0.658 (0.275) -0.499 (0.289) 0.018 (1.062) 0.032 (1.640) 0.191 (9.344) 0.058 (0.005) 0.077 (0.007) 0.120 (0.010) 1.11 (0.02) 1.12 (0.03) 1.22 (0.02) A B C /3 0.284 (0.228) 0.268 (0.233) 0.325 (0.287) -0.126 (-7.847) -0.142 (-8.607) -0.085 (-4.176) 0053 (0.005) 0.074 (0.006) 0.089 (0.006) 0.427 (0.365) 0.425 (0.380) 0.544 (0.527) 0.017 (0.664) 0.015 (0.563) 0.134 (3.603) 0.134 (0.020) 0.144 (0.017) 0.296 (0.039) 1.97 (0.22) 1.94 (0.20) 3.31 (0.35) A B C /32 -0.223 (0.212) -0.221 (0.238) -0.134 (0.252) -0.003 (-0.175) -0.001 (-0.039) 0.086 (4.847) 0.045 (0.004) 0.057 (0.006) 0.071 (0.007) -0.228 (0.218) -0.226 (0.245) -0.132 (0.267) -0.008 (-0.487) -0.006 (-0.354) 0.088 (4.674) 0.048 (0.004) 0.060 (0.007) 0.079 (0.009) 1.06 (0.02) 1.06 (0.01) 1.12 (0.02) 0.111 (0.244) 0.098 (0.253) 0.102 (0.287) 0.011 (0.651) -0.002 (-0.123) 0.002 (0.085) 0.060 (0.007) 0.064 (0.007) 0.082 (0.007) 0.114 (0.249) 0.101 (0.259) 0.106 (0.302) 0.014 (0.799) 0.001 (0.056) 0.006 (0.294) 0.062 (0.008) 0.067 (0.007) 0.091 (0.009) 1.05 (0.01) 1.05 (0.01) 1.10 (0.01) -0.117 (0.236) -0.097 (0.263) -0.065 (0.266) -0.007 (-0.407) 0.013 (0.722) 0.045 (2.399) 0.056 (0.008) 0.069 (0.008) 0.073 (0.006) -0.119 (0.242) -0.098 (0.270) -0.067 (0.276) -0.009 (-0.528) 0.012 (0.633) 0.043 (2.229) 0.059 (0.008) 0.073 (0.008) 0.078 (0.007) 1.05 (0.01) 1.05 (0.01) 1.07 (0.01) Model A B C A B C A B C c-Il MCEM i3 /32 /32 /33 /33 /33 34 /34 3 Bias (Zbias) E (sr) (z,jas) A) () Table 4.32: Confidence interval length, and coverage when when when the mechanism that generates the data and the mechanism used to model the data is matched (NMAR-NMAR) for a covariate with both missing and mismeasured data. (gb) Model 3 / LNaive A B C /3o j3 0.893 (0.052) 0.904 (0.059) 0.889 (0.057) 0.928 (0.089) 0.942 (0.081) 0.963 (0.124) A B C /3 j3 0.838 (0.091) 0.882 (0.092) 0.906 (0.121) A B C /3i 132 /32 132 A B C /33 /33 A B C /34 /34 /33 /3 () CoverageNaje (p-value) CoverageQEM (p-value) 0.950 (1.000) 0.920 (0.116) 0.750 (< 0.001) 1.289 (0.311) 1.391 (0.325) 1.565 (0.613) 0.950 (1.000) 0.910 (0.048) 0.750 (< 0.001) 0.880 (0.004) 0.885 (0.004) 0.850 (< 0.001) 0.908 (0.078) 0.912 (0.083) 0.873 (0.081) 0.931 (0.095) 0.935 (0.095) 0.912 (0.121) 0.990 (< 0.001) 0.955 (0.732) 0.900 (0.020) 0.970 (0.096) 0.970 (0.096) 1.000 (< 0.001) 0.990 (< 0.001) 0.955 (0.732) 0.950 (1.000) 0.903 (0.086) 0.923 (0.087) 0.861 (0.084) 0.925 (0.101) 0.946 (0.101) 0.901 (0.120) 0.940 (0.552) 0.935 (0.388) 0.900 (0.020) 0.940 (0.552) 0.935 (0.388) 0.900 (0.020) 0.905 (0.088) 0.906 (0.085) 0.899 (0.086) 0.928 (0.103) 0.928 (0.095) 0.938 (0.120) 0.955 (0.732) 0.945 (0.756) 0.950 (1.000) 0.955 (0.732) 0.945 (0.756) 0.950 (1.000) LMCEM Table 4.33: Comparing the effect of the missing data mechanism on the median and difference between the median and the true value (ö) when the mechanism that generated the missing data and mechanism assumed by the model are matched (NMAR-NMAR) for a covariate with both missing and mismeasured data. Model i 3 / A B C /3 /3 A B C rnNaive(/ j 3 ) (MAD) TnM0EMCI j 3 ) (MAD) Naive 5 ‘ MCEM 5 -0.664 (0.232) -0.639 (0.241) -0.513 (0.186) -0.666 (0.247) -0.619 (0.258) -0.479 (0.222) 0.026 0.051 0.177 0.024 0.070 0.211 0.265 (0.212) 0.266 (0.237) 0.284 (0.334) 0.389 (0.324) 0.390 (0.360) 0.416 (0.495) -0.145 -0.144 -0.126 -0.021 -0.020 0.006 -0.227 (0.209) -0.193 (0.227) -0.147 (0.199) -0.007 0.028 0.075 -0.007 0.027 0.073 A B C /32 -0.227 (0.204) -0.192 (0.223) -0.145 (0.198) A B C /33 /33 /33 0.118 (0.222) 0.110 (0.245) 0.163 (0.308) 0.122 (0.223) 0.110 (0.244) 0.168 (0.314) 0.018 0.010 0.063 0.022 0.010 0.068 A B C /34 -0.116 (0.229) -0.103 (0.251) -0.072 (0.308) -0.116 (0.235) -0.105 (0.259) -0.073 (0.313) -0.006 0.007 0.038 -0.006 0.005 0.037 /32 /32 /34 157 4.5.3.3 Comparing the effect of sample size For this experiment, we use missing data mechanism A and we are interested in observing how the summary measures change as the sample size increase from n = 50 to n = 250. Although the largest sample size is still modest, it will serve to indicate possible asymptotic behaviours for both the completecase and MCEM approaches. We will consider case 1, MAR-MAR, and case 4, NMAR-NMAR. For the complete-case method we see that the estimated standard devi ation of the parameter estimates decreases as n increases (Table 4.34). We also see a decrease in bias for /3o and /33. For /32 we see the bias increase when we go from a sample size of 50 to that of 100, and then we see a decrease when we move from a sample size of 100 to that of 250. For /34 we see the reverse action. The the bias associated with the parameter of interest, / i, increases as the sample size increase. Given that the estimated 3 standard deviation is much larger than the bias, the MSE exhibits the same trend: decreasing as the sample size increases. The lengths of the confidence intervals decrease as the sample size in creases (Table 4.35). When it comes to the coverage rates, we have a mixed story. For 3 / and /32 we attain the nominal coverage rate for all sample o sizes. For the parameter of interest, we see a worsening of the coverage and an inability to attain the coverage rate, but this is not surprising. For /33 we see that the coverage rate improves with increasing sample size and moves from missing the nominal rate to achieving it. The reverse is true for /34 where the coverage rate actually degrades with an increasing sample size, but maintains the nominal rate. We see the same trend for /3, but it ceases to reach the nominal rate at n = 100. For 6 we see that both and /33 decrease in magnitude as the sample size increases, but we see the opposite for /3 (Table 4.36). 1 Turning now to the MCEM approach, we see some familiar trends. The estimate of the standard deviation decreases for all parameters as the sample size increases (Table 4.34). For /3ü and /33, we see that the bias decreases with increasing n, but we see mixed results for the remaining parameters. 158 Both /3 and /34 exhibit a decrease in bias as we move from n 50 to n = 100 and then increase as we move to m = 250. The reverse is true for the bias associated with the estimate of /32. Although there are convoluted trends for /3k, /32 and /34 we do see some clarity in terms of possible statistical significance. For / i, the associated estimate of the bias moves from being 3 significantly different from zero to having no evidence for a difference as the sample size increases. This does not hold for /32 which moves from no evidence of a difference to a significant difference to no evidence for a difference. In an opposite fashion to the estimated bias for i3, we see the bias associated with /34 go from having no evidence of a difference to having a significant difference from the null value of zero. As with the completecase method, we see that the MSE and ê decreases with an increase in the sample size. The length of the confidence intervals exhibit the same trend as with the complete-case with the additional observation that the mean lengths of the MCEM confidence intervals constructed using the Louis standard errors appears to be approaching that of the complete-case as the sample size increases (Table 4.35). As with the naive complete-case approach, we have a mixed story for the coverage rates. We see that both /3 and /32 achieve the nominal coverage for all sample sizes. For / i, the rate is too 3 high, but achieves the nominal rate when n = 250. Both /33 and /34 have coverage rates that decrease with increasing n. In both cases they achieve the nominal coverage rate as the sample size increases. Finally, we see that öMCEM improves as the sample size improves for /3o and /33 as the sample size increases (Table 4.36). We also see that j3 becomes similar to the complete-case estimate while öMCEM for /33 for all sample sizes. For only three measurement point, there is no clear pattern for the estimate of MCEM associated with the remaining parameter estimates, but it is clear 6 that MCEM does do a good job in reducing the difference associated with the estimate of /3k. When the model generating the missing information and the model as sumed are matched and is NMAR, case 4, we see similar results as before. For both the complete-case and the MCEM approaches we see that the esti 159 mate of the standard deviation, MSE and length of the estimated confidence interval for the parameter estimate decrease as the sample size increases (Ta ble 4.37). The bias presents a mixed story. The intercept and /34 go from having a large z-score suggesting a statistical difference in the bias when n = 50 to a much lower z-score score which posits that there is a lack of evidence supporting any difference in the bias from the null value of zero when n = 100. When moving to n = 250, we see that the z-score is much larger and once again suggests a significant difference. In the naive situation, the bias associated with the estimate of increases as n increases, but for the MCEM method, we see that there is no clear trend, other than a lack i 3 / of evidence for a significant difference for all sample sizes. For /32 we see small z-scores for all sample sizes even though the bias travels in different directions for the naive and MCEM methods as the sample size increases. Finally we see that the estimated bias for the estimate of /33 decreases as n increases. Coverage in this situation is a rather complicated issue. There are no clear or general trends over both scenarios in this experiment. The trends for the coverage rates associated with the estimates of /32, /33, and /34 are identical for both the naive complete-case and MCEM methods (Table 4.38). For /32, we see the coverage oscillate from attaining the nominal rate at n = 50 to being different form it at m again when n = 100, to attaining the rate once 250. We see with /33 that the coverage is rather stable and all values suggest the attainment of the nominal rate for all sample = sizes. Turning to it is observed that the coverage rate declines for both approaches, but achieves the coverage rate for all sample sizes. For the naive approach, we see that the coverage associated with is stable and achieves the nominal rate for all sample sizes, but for the MCEM approach the coverage goes from being too high with n coverage rate for n 100 and n = 50 to the attainment of the 250. Finally, we see that the coverage rate for the parameter of interest in the naive situation declines as the sample = = size increases and never reaches the nominal rate. For the MCEM method, we see that the nominal rate is achieved for all sample sizes for /3k, but 160 is abandoned with n = 100. For the differences, identical trends with the complete-case and the MCEM methods for differences associated with all the parameters except 32 are observed. Here we see opposing trends as we move from the naive to the MCEM approach; there is no clear overall trend for the differences (Table 4.39). Discussion Considering first case 1, we observe that as the sample size increases there is a general improvement in the bias associated with all the parameters except /3i. The parameter of interest, appears to become more biased as the sample size increase. We see the expected shrinkage of the estimated standard deviation, MSE, and a shortening of the confidence intervals as the sample size increases. The coverage rate associated with /i degrades as the sample size increases. There appears to be a general improvement in the attainment of the nominal coverage rate for the other parameters, but this is not a uniform result. For the MCEM approach, we do not see such a clear improvement in the bias as with the naive approach. The bias associated with j3 shows little evidence against the null of no bias as the sample size increases which is desirable, but the results for the remaining estimates is a bit of a quagmire. The reduction in estimated standard deviation, MSE, ê, and the expected shortening of confidence interval lengths is observed. There appears to be a general improvement for the attainment of the nominal coverage rate over all the estimates. When the missing data mechanism is NMAR for the generating model, we see the expected reduction in the estimated standard deviation, MSE, and confidence interval length for the complete-case approach. The same trend for a general bias reduction with all parameters except /3 as the sample size increase, but the reverse for i3. The significance of the bias with respect to the z-score shows no clear pattern over all the parameter estimates, except for that already noted with /3. For the coverage, the general trend is to reach the nominal rate except in the case of The MCEM approach exhibits the expected trends for the estimated standard deviation, MSE, ê, and confidence interval length. What appears 161 to be a general trend in the z-scores towards a lack of evidence for a difference with the null of no bias is frustrated with the bias showing a significant difference for /o and /34. This is a noticeable change in direction when moving from n = 100 to n = 250. The coverage performs much better with a general trend towards the attainment of the nominal rate as the sample size increases. Over both missing data mechanisms and both methodologies we see the desired improvements in the estimate of the standard deviation, MSE, ê, and confidence interval length as the sample size increases. In terms of bias and coverage the results are not as transparent. In general, there is an improvement in the bias of the estimates under the MCEM approach, with i exhibiting the most consistent trend. A peculiar result was a degradation 3 / of the bias associated with for the complete-case analysis. This may be a manifestation of the combined effect of the imperfections on the estimate. The coverage is another murky set of results with little transparency other than an overall improvement in the coverage for the MCEM methodology when the missing data mechanism is NMAR. 162 Table 4.34: Comparing the effect of sample size on point estimation, bias, mean squared error, and relative efficiency for case 1 (MAR-MAR) when a covariate has both missing data and mismeasurement. Missing data mechanism A was used for this comparison. (a) Case 1: MAR-MAR, Sample Size=50, Simulation Size200 Naive /32 /33 /3 E () (s’D) s -0.817 (0.380) 0.339 (0.355) -0.236 (0.385) 0.177 (0.360) -0.132 (0.393) -0.127 -0.071 -0.016 0.077 -0.022 MCEM (Zbias) )ci:::E () E (,) (sb) s (-4.724) (-2.818) (-0.591) (3.042) (-0.810) 0.161 0.131 0.149 0.135 0.155 (0.017) (0.012) (0.015) (0.014) (0.020) -0.850 (0.421) 0.524 (0.589) -0.246 (0.402) 0.183 (0.379) -0.140 (0.413) -0.160 0.114 -0.026 0.083 -0.030 () (Zbjas) 1E (5D) e (-5.372) (2.737) (-0.929) (3.082) (-1.018) 0.203 0.359 0.163 0.150 0.171 (0.024) (0.052) (0.017) (0.016) (0.023) 1.26 2.74 1:09 1.11 1.11 (0.05) (0.26) (0.02) (0.02) (0.04) (b) Case 1: MAR-MAR, Sample Size=100, Simulation Size200 Naive E () (s)) j3 /3i /32 /33 /3 -0.722 0.291 -0.260 0.133 -0.121 (0.230) (0.214) (0.237) (0.231) (0.253) MCEM (b) (zS) -0.032 -0.119 -0.040 0.033 -0.011 (-1.992) (-7.833) (-2.417) (2.045) (-0.629) 0.054 0.060 0.058 0.055 0.064 (0.005) (0.006) (0.006) (0.005) (0.006) E () (st) -0.738 0.418 -0.266 0.136 -0.125 (0.241) (0.319) (0.242) (0.236) (0.259) () S (Zbias) -0.048 0.008 -0.046 0.036 -0.015 (-2.809) (0.375) (-2.662) (2.136) (-0.794) 0.060 0.102 0.061 0.057 0.067 (0.006) (0.011) (0.007) (0.005) (0.007) () 1.12 1,70 1.05 1.04 1.05 (0.03) (0.14) (0.01) (0.01) (0.01) (c) Case 1: MAR-MAR, Sample Size250, Simulation Size=200 Naive ECã) (sb) j3o /3 /2 /3 -0.662 0.270 -0.213 0.099 -0.135 (0.137) (0.132) (0.145) (0.151) (0.151) MCEM iiTs(zbias) 0.028 (2.886) -0.140 (-14.963) 0.007 (0.689) -0.001 (-0.124) -0.025 (-2.387) 0.019 0.037 0.021 0.023 0.023 () E() (sD) (0.002) (0.003) (0.003) (0.002) (0.002) -0.660 0.391 -0.216 0.100 -0.138 (0.140) (0.197) (0.148) (0.154) (0.153) () 8(Zbjas) 0.030 0.019 0.004 0.000 0.028 (3.025) (-1.392) (0.391) (0.008) (-2.546) 0.020 0.039 0.022 0.024 0.024 (0.002) (0.004) (0.003) (0.002) (0.002) () 1.05 (0.01) 1106 (0.09) 1.03 (0.01) 1.03 (< 0.01) 1.04 (< 0.01) Table 4.35: Comparing the effect of sample size on confidence interval length, and coverage for case 1 (MAR MAR) when a covariate has both missing data and mismeasurement. Missing data mechanism A was used for this comparison. (a) Case 1: MAR-MAR, Sample Size5O, Simulation Size2OO j 3 I Liyajve /3 1.388 1.277 1.419 1.391 1.406 /3i /32 /33 /3 () (0.167) (0.238) (0.228) (0.211) (0.205) LMCEM 1.527 2.103 1.486 1.462 1.478 (b) Case j 3 / LNaive /3 /3 0.901 0.813 0.914 0.917 0.914 /32 /3 /34 () 0.934 1.204 0.933 0.935 0.932 (c) Case LNaive (SD) /3 0.546 0.506 0.551 0.551 0.550 /3 /32 /3 /34. (0.401) (0.954) (0.284) (0.305) (0.315) (0.015) (0.030) (0.030) (0.031) (0.032) 1: (0.083) (0.207) (0.090) (0.089) (0.092) 0.940 0.920 0.955 0.975 0.960 (0.552) (0.116) (0.732) (0.024) (0.472) CoverageMcEM (p-value). 0.950 0.975 0.955 0.975 0.970 (1.000) (0.024) (0.732) (0.024) (0.096) CoverageNajve (p-value) 0.950 0.905 0.960 0.965 0.945 (1.000) (0.028) (0.472) (0.248) (0.756) CoverageMcEM (p-value) 0.950 (1.000) 0.985 (< 0.001) 0.960 (0.472) 0.965 (0.248) 0.945 (0.756) MAR-MAR, Sample Size25O, Simulation Size2OO LMCEM 0.556 0.747 0.559 0.559 0.558 CoverageNajve (p-value) MAR-MAR, Sample Size=1OO, Simulation Size2OO LMCEM (SD) (0.055) (0.079) (0.081) (0.082) (0.082) j 3 / 1: () () (0.020) (0.079) (0.032) (0.032) (0.034) CoverageNaive (p-value) 0.945 (0.756) 0.800 (< 0.001) 0.950 (1.000) 0.940 (0.552) 0.925 (0.180) CoverageMcEM (p-value) 0.955 0.965 0.950 0.940 0.930 (0.732) (0.248) (1.000) (0.552) (0.268) Table 4.36: Comparing the effeët of sample size on the median and difference between the median and true value (ö) for case 1 (MAR-MAR) where a single covariate suffers from both missing data and measurement error. Missing data mechanism A was used for this comparison (a) Case 1: MAR-MAR, Sample Size5O, Simulation Size2OO i 3 / /3i /32 /3 /34 rnNaiveCBj) (MAD) -0.778 (0.360) 0.314 (0.356) -0.215 (0.347) 0.160 (0;336) -0.094 (0.382) fhMCEM(/3j) (MAD) -0.813 (0.399) 0.451 (0.507) -0.222 (0.365) 0.161 (0.341) -0.095 (0.401) cNaive MCEM 8 -0.088 -0.096 0.005 0.060 0.016 -0.123 0.041 -0.002 0.061 0.015 (b) Case 1: MAR-MAR, Sample Size1OO, Simulation Size2OO ,j 3 1 /3i /32 /33 7flNaive(/ j 3 ) (MAD) -0.727 (0.231) 0.273 (0.207) -0.232 (0.236) 0.093 (0.252) -0.133 (0.250) rnMCEM(/ j 3 ) -0.745 0.391 -0.239 0.094 -0.133 (MAD) (0.250) (0.288) (0.239) (0.257) (0,255) Naive 8 MCEM -0.037 -0.137 -0.012 -0.007 -0.023 -0.055 -0.019 -0.019 -0.006 -0.023 (c) Case 1: MAR-MAR, Sample Size25O, Simulation Size2OO j 3 1 /3 /32 /3 /3 rnNaive(/ j 3 ) (MAD) -0.654 (0.129) 0.266 (0.135) -0.211 (0.120) 0.101 (0.164) -0.124 (0.150) mMCEM(/ j 3 ) (MAD) -0.605 (0.134) 0.372 (0.190) -0.214 (0.122) 0.103 (0.166) -0.127 (0.152) Naive 6 MCEM 6 0.036 -0.144 0.009 0.001 -0.014 0.040 -0.038 0.006 0.003 -0.017 165 Table 4.37: Comparing the effect of sample size on point estimation, bias, mean squared error, and relative efficiency for case 4 (NMAR-NMAR) when a covariate has both missing data and mismeasurement. Missing data mechanism A was used for this comparison. (a) Case 4: NMAR-NMAR, Sample Size=50, Simulation Size=200 Naive E () (sD) i3i /32 /33 /3 -0.771 0.293 -0.218 0.149 -0.188 (0.391) (0.370) (0.405) (0.394) (0.388) MCEM () E (/) (sD) (0.022) (0.016) (0.021) (0.017) (0.022) -0.790 0.462 -0.229 0.162 -0.196 (0.427) (0.652) (0.444) (0.425) (0.405) S (Z&jas) -0.081 -0.117 0.002 0.049 -0.078 (-2.938) (-4.476) (0.070) (1.773) (-2.860) 0.159 0.151 0.164 0.158 0.156 () BiS (Zbias) -0.100 0.052 -0.009 0.062 -0.086 (-3.299) (1.133) (-0.285) (2.076) (-3.001) 0.193 0.428 0.197 0.184 0.171 (0.029) (0.076) (0.032) (0.025) (0.026) () 1.21 2.85 1.21 1.17 1.10 (0.04) (0.33) (0.06) (0.07) (0.03) (b) Case 4: NMAR-NMAR, Sample Size= 100, Simulation Size200 Naive E (,) (sr) /3o /3i /32 /3 /3 -0.669 (0.228) 0.284 (0.228) -0.223 (0.212) 0.111 (0.244) -0.117 (0.236) (Z,jas) 0.021 (1.331) -0.126 (-7.847) -0.003 (-0.175) 0.011 (0.651) -0.007 (-0.407) MCEM () E (‘) (0.005) -0.672 0.427 -0.228 0.114 -0.119 1i 0.053 0.053 0.045 0.060 0.056 (0.005) (0.004) (0.007) (0.008) (sD) () (Z,jas) () (0.241) 0.018 (1.062) 0.058 (0.005) 1.11 (0.02) (0.365) 0.017 (0.664) 0.134 (0.020) 1.97 (0.22) (0.218) -0.008 (-0.487) 0.014 (0.799) -0.009 (-0.528) 0.048 (0.004) 0.062 (0.008) 0.059 (0.008) 1.06 (0.02) 1.05 (0.01) 1.05 (0.01) (0.249) (0.242) (c) Case 4: NMAR-NMAR, Sample Size=250, Simulation Size200 Naive 3 i I. /3 /32 /3 E (j) (sr) -0.662 0.270 -0.213 0.099 -0.135 (0.137) (0.132) (0.145) (0.151) (0.151) (Zbias) 0.028 (2.886) -0.140 (-14.963) 0.007 (0.689) -0.001 (-0.124) -0.025 (-2.387) MCEM 1i1 () E (/) (s)) S (Zbias) )1E () 0.019 0.037 0.021 0.023 0.023 (0.002) (0.003) (0.003) (0.002) (0.002) -0.660 (0.140) 0.391 (0.197) -0.216 (0.148) 0.100 (0.154) -0.138 (0.153) 0.030 (3.025) 0.019 (-1.392) 0.004 (0.391) 0.000 (0.000) 0.028 (-2.546) 0.020 0.039 0.022 0.024 0.024 (0.002) (0.004) (0.003) (0.002) (0.002) () 1.05 1.06 1.03 1.03 1.04 (0.01) (0.09) (0.01) (0.00) (0.00) Table 4.38: Comparing the effect of sample size on confidence interval length, and coverage for case 4 (NMAR NIVIAR) when a covariate has both missing data and mismeasurement. Missing data mechanism A was used for this comparison. (a) Case 4: NMAR-NMAR, Sample Size5O, Simulation Size2OO 3 / /3i /32 /33 LNaive () 1.385 (0.197) 1.294 (0.242) 1.415 (0.252) 1.404 (0.222) 1.420 (0.242) LMCEM 1.509 2.178 1.507 1.495 1.491 () (0.386) (1.191) (0.457) (0.482) (0.326) CoverageNajve (p-value) 0.955 0.895 0.950 0.935 0.960 (0.732) (0.012) (1.000) (0.388) (0.472) CoverageMcEM (p-value) 0.975 0.955 0.955 0.945 0.965 (0.024) (0.732) (0.732) (0.756) (0.248) (b) Case 4: NMAR-NMAR, Sample Size1OO, Simulation Size2OO j 3 1 Lpqajve /3o i3 0.893 0.838 0.908 0.903 0.905 132 /33 /3 (gb) (0.052) (0.091) (0.078) (0.086) (0.088) LMCEM 0.928 1.289 0.931 0.925 0.928 () (0.089) (0.311) (0.095) (0.101) (0.103) CoverageNajv (p-value) CoverageMcEM (p-value) 0.950 (1.000) 0.880 (0.004) 0.990 (< 0.001) 0.940 (0.552) 0.955 (0.732) 0.950 (1.000) 0.970 (0.048) 0.990 (< 0.001) 0.940 (0.552) 0.955 (0.732) (c) Case 4: NMAR-NMAR, Sample Size25O, Simulation Size2OO 3 / /3 /32 /3 j3 LNaive (SD) LMCEM 0.546 (0.015) 0.506 (0.030) 0.551 (0.030) 0.551 (0.031) 0.550 (0.032) 0.556 0.747 0.559 0.559 0.558 () (0.020) (0.079) (0.032) (0.032) (0.034) CoverageNjv (p-value) 0.945 (0.756) 0.800 (< 0.001) 0.950 (1.000) 0.940 (0.552) 0.925 (0.180) Coveragec],f (p-value) 0.955 0.965 0.950 0.940 0.930 (0.732) (0.248) (1.000) (0.552) (0.268) Table 4.39: Comparingthe effectof sample size on the median and difference between the median and true value (S) for case 4 (NMAR-NMAR) where a single covariate suffers from both missing data and measurement error. Missing data mechanism A was used for this comparison (a) Case 4: NMAR-NNAR, Sample Size =50, Simulation Size=200 i 3 / j3 /3 /32 /34 rnNaive(/ j 3 ) -0.746 0.279 -0.206 0.136 -0.158 (i) (0.385) (0.337) (0.365) (0.370) (0.311) fhMcEM(/j) -0.758 (0.390) 0.403 (0.516) -0.215 (0.376) 0.136 (0.384) -0.165 (0.339) (b) Case 4: NMAR-NMAR, Sample Size 3 / /3 /32 /33 /3A. rnNaive(/ j 3 ) (Ib) -0.664 (0.232) 0.265 (0.212) -0.227 (0.204) 0.118 (0.222) -0.116 (0.229) (MAD) mMCEM(/ j 3 ) Naive 6 SMCEM -0.056 -0.131 0.014 0.036 -0.048 -0.068 -0.007 0.005 0.036 -0.055 100, Simulation Size=200 (MAD) -0.666 (0.247) 0.389 (0.324) -0.227 (0.209) 0.122 (0.223) -0.116 (0.235) Naive MCEM 6 0.026 -0.145 -0.007 0.018 -0.006 0.024 -0.021 -0.007 0.022 -0.006 (c) Case 4: NMAR-NMAR, Sample Size =250, Simulation Size=200 j 3 / /3o /3 /32 /3 rnNaive(/ j 3 ) -0.654 0.266 -0.211 0.101 -0.124 (AiA) (0.129) (0.135) (0.120) (0.164) (0.150) ) 3 mMCEM(/ (MAD) -0.650 (0.134) 0.372 (0.190) -0.214 (0.122) 0.103 (0.166) -0.127 (0.152) Naive 5 ‘ MCEM 6 0.036 -0.144 0.009 0.001 -0.014 0.040 -0.038 0.006 0.003 -0.017 168 4.5.3.4 Comparing the effect of the specification of -r In this experiment, the true value of T was kept at 0.7 for all scenarios. The value of r assumed for the model was 0.5, 0.7, 1.0. Rather than looking at situation where T was matched and mismatched for different values of r, we wanted to explore the effect of misspecification of r. This better represents the “real-world” context where the true value of r is unknown and a sensitivity analysis is done to see how the parameter estimates change as a function of r. Furthermore, we will only consider the NMAR case and restrict our observations to the MCEM adjustment. We observe that the estimate of the standard deviation associated with the estimate of i3 increase as r increase, but there is much stability in the standard deviation estimates of the other parameter estimates (Table 4.40). For i3, the bias is greater with a z-score suggesting a statistical difference from the null value of zero for mismatches of r. When it is matched, the bias associated with /3i has little evidence to suggest a difference from the null. The bias associated with the other parameter estimates all have small z-scores. The MSE associated with the parameter estimate of i3 increases as ‘r in creases. Since this is the same pattern of the standard deviation of the point estimates, which are much larger than the biases, this is not unexpected. Although there are varying patterns in the MSE for the other parameters, there may be little evidence suggesting an statistical difference in them. We observe that the length of the confidence intervals increases as r increases (Table 4.41). This too may not be surprising as the increased noise in the measurement error model should translate to increased variability in the parameter estimates which would be reflected in the the associated confidence intervals. The coverage rate presents the familiar complexity seen in previous experiments. Unfortunately, in this case there is no overarching trend. We observe that /33 and /34 attain the nominal rate for all levels of T. For /32 we see that the coverage is significantly different from the nominal rate for T = 0.7. For both /3 and j3 we observe an increase in the coverage rate as r increases; for both parameters the movement is from attaining the 169 nominal rate to being significantly different than it; this occurs at T = 1.0. For we see no clear trend, but we do see that when assumed -r matches that of the measurement error model, we have the smallest values for /3k, and /3 .(Table 4.42). 4 Discussion We can observe a couple of trends in this experiment. We see an expected increase in measures of variability as r increases which manifests in the estimate of the standard deviation, MSE, efficiency, and confidence interval length. We observe that the bias is least for the parameter estimate associated with the imperfect covariate when the assumed value of r matches that of the measurement error model. Misspecification of r has no clear effect on the bias of the other parameter estimates. Although there is not a clear story for the coverage rate, the evidence seems to suggest that specifying a T that is smaller than the true value is better than having one that is too large. This seems to apply to the efficiency as well. 170 Table 4.40: Comparing the effect of T on the point estimate, bias, and mean squared error when a covariate has both missing data and mismeasured data and when the missing data mechanism is NMAR. Mechanism A was used to generate the missing data for /3k; 200 simulations were run. (a) 0.7, TTrue Tmodel = 0.5 Naive /32 /33 /3 MCEM E. () (sb) iS (Zjg) A (SD) E (Ij) (sD) -0.691 (0.253) 0.274 (0.189) -0.196 (0.238) 0.12 (0.251) -0.127 (0.248) -0.001 (-0.031) -0.136 (-10.151) 0.024 (1.448) 0.020 (1.102) -0.017 (-0.957) 0.064 0.054 0.057 0.064 0.062 (0.009) (0.005) (0.006) (0.006) (0.006) -0.689 0.338 -0.197 0.121 -0.128 (0.256) (0.236) (0.240) (0.253) (0.250) (b) TTrue — 0.7, Tmodej 1 o 3 j3 /32 /33 /34. -0.669 0.284 -0.223 0.111 -0.117 (0.228) (0.228) (0.212) (0.244) (0.236) (1.331) (-7.847) (-0.175) (0.651) (-0.407) () 0.053 0.053 0.045 0.060 0.056 (0.005) (0.005) (0.004) (0.007) (0.008) (c) ‘TTrue = E (,) (s1) -0.672 0.427 -0.228 0.114 -0.119 0.7, (0.241) (0.365) (0.218) (0.249) (0.242) Tmodel E () (sr) j3 -0.675 0.271 -0.238 0.098 -0.125 I’ /32 /3 /34 (0.216) (0.206) (0.257) (0.233) (0.257) 0.015 -0.139 -0.018 -0.002 -0.015 0.065 0.061 0.058 0.064 0.063 (0.009) (0.007) (0.006) (0.006) (0.007) )iE () 0.058 0.134 0.048 0.062 0.059 (0.005) (0.020) (0.004) (0.008) (0.008) (SD) 1.03 1.12 1.02 1.01 1.02 (0.01) (0.05) (0.00) (0.00) (0.00) S (Zjas) 0.018 0.017 -0.008 0.014 -0.009 (1.062) (0.664) (-0.487) (0.799) (-0.528) () 1.11 1.97 1.06 1.05 1.05 (0.02) (0.22) (0.02) (0.01) (0.01) 1.0 Naive /3 (0.029) (-4.349) (1.347) (1.154) (-1.022) () MCEM (Zbjas) 0.021 -0.126 -0.003 0.011 -0.007 0.001 0.072 0.023 0.021 0.018 MSE 0.7 Naive E () (st) S (Zjas) MCEM (Zbias) ii () (0.957) (-9.543) (-0.987) (-0.093) (-0.834) 0.047 0.062 0.066 0.054 0.066 (0.004) (0.005) (0.008) (0.006) (0.008) E (,) (sb) -0.695 0.595 -0.255 0.107 -0.135 (0.256) (0.531) (0.286) (0.249) (0.274) (Z,jas) -0.005 0.185 0.035 0.007 0.025 (-0.281) (4.928) (-1.752) (0.390) (-1.287) MSE (SD) 0.065 0.316 0.083 0.062 0.076 (0.009) (0.054) (0.013) (0.007) (0.009) () 1.39 5.14 1.25 1.14 1.14 (0.13) (0.92) (0.10) (0.03) (0.03) Table 4.41: Comparing the effect of r on confidence interval length, and coverage where a single covariate suffers from both missing data and measurement error and when the missing data mechanism is NMAR. Mechanism A was used to generate the missing data for /3i; 200 simulations were run. (a) 3 / i 3 ,i /32 /33 /34 LNaive (SD) LMCEM (SD) 0.895 0.831 0.920 0.912 0.902 0.905 1.033 0.927 0.920 0.910 (0.060) (0.089) (0.092) (0.086) (0.087) LNaive /3 i3 0.893 0.838 0.908 0.903 0.905 /32 /33 /3 () (0.052) (0.091) (0.078) (0.086) (0.088) LMCEM 0.928 1.289 0.931 0.925 0.928 () j 3 ! LNaive /3 0.892 0.839 0.918 0.907 0.907 /3i /32 /3 () (0.049) (0.082) (0.089) (0.079) (0.086) LMCEM 1.007 1.955 0.999 0.976 0.977 0.930 0.910 0.970 0.940 0.955 rTrue = () (0.262) (0.913) (0.261) (0.164) (0.167) 0.5 0.7, (0.268) (0.048) (0.096) (0.552) (0.732) Tmodel = CoverageMcEM (p-value) 0.930 0.960 0.970 0.940 0.955 (0.268) (0.472) (0.096) (0.552) (0.732) 0.7 CoverageNajve (p-value) CoverageMcEM (p-value) 0.950 (1.000) 0.880 (0.004) 0.990 (< 0.001) 0.940 (0.552) 0.955 (0.732) 0.950 (1.000) 0.970 (0.096) 0.990 (< 0.001) 0.940 (0.552) 0.955 (0.732) (0.089) (0.311) (0.095) (0.101) (0.103) (c) Tmode CoverageNajve (p-value) (0.068) (0.149) (0.095) (0.089) (0.089) (b) 3 / 0.7, TTrue rTre 0.7, Tmodel 1.0 CoverageNaive (p-value) 0.965 0.905 0.935 0.955 0.940 (0.248) (0.028) (0.388) (0.732) (0.552) CoverageMcEM (p-value) 0.985 (< 0.001) 1.000 (< 0.001) 0.945 (0.756) 0.955 (0.732) 0.945 (0.756) Table 4.42: Comparing the effect of r on the median and difference between the median and true value (6) where a single covariate suffers from both missing data and measurement error and when the missing data mechanism is NMAR. Mechanism A was used to generate the missing data for /3i; 200 simulations were run. (a) i 3 / /3 j3 /32 /33 /34 rnNaive(/ j 3 ) (MAD) TTrue -0.675 (0.212) 0.278 (0.174) -0.199 (0.237) 0.12 (0.222) -0.124 (0.233) /3 /3 /32 /33 /34 TflNajve(/ j 3 ) (MAD) i 3 ! /32 43 /3 -0.669 0.246 -0.225 0.101 -0.118 (MAD) Tmod TTrue 0.7, Tmodt MMcEM(/) Naive MCEM 0.015 -0.132 0.021 0.020 -0.014 0.010 -0.073 0.018 0.021 -0.016 SNaive MCEM 0.026 -0.145 -0.007 0.018 -0.006 0.024 -0.021 -0.007 0.022 -0.006 Naive MGEM 5 0.021 -0.164 -0.005 0.001 -0.008 0.010 0.072 -0.012 0.009 -0.023 0.7 (MAD) -0.666 (0.247) 0.389 (0.324) -0.227 (0.209) 0.122 (0.223) -0.116 (0.235) (MAD) (0.214) (0.194) (0.224) (0.209) (0.237) 0.7, ?flMCEM(/ j 3 ) -0.664 (0.232) 0.265 (0.212) -0.227 (0.204) 0.118 (0.222) -0.116 (0.229) rnNaive(! j 3 ) 0.5 -0.68 (0.224) 0.337 (0.226) -0.202 (0.237) 0.121 (0.220) -0.126 (0.234) (c) j 13 Tmodel ‘fñMcEMC1j) (b) TTrue j 3 / 0.7, 1.0 (MAD) -0.68 (0.239) 0.482 (0.406) -0.232 (0.261) 0.109 (0.230) -0.133 (0.246) 173 4.5.4 Simulation study 2 discussion The effect of having both missing data and measurement error affecting the same covariate introduces bias to the intercept and to i3. The bias is not restricted to just these two parameter estimates, for it was seen that with mechanism A, significant bias was introduced for /2 and /33. The presence of both problems within a single covariate appears to have the potential to cause problems with the attainment of the nominal coverage rate for accurately measured covariates as seen with mechanism A and C. In the first two experiments, we see the familiar trade-off between re ducing the bias associated with the parameter of interest, /3, and the other parameters. As with the previous study, we see that the intercept is the most affected by this trade-off. Across the four cases in the first experi ment, we see that for /3 the MCEM approach reduces the bias and tends to improve the coverage rate allowing the coverage to reach the nominal rate. Unfortunately, this is not a global result which can be made for all the parameters. We also see the familiar result of larger estimated standard deviations, MSE, and longer confidence intervals. When the missing data mechanism is misspecified, we observe differing effects which are contingent on the type of disconnect between the generating model and the assumed model. When the mechanism is under-modelled, that is assumed to be MAR when it is NMAR, there is a tendency for the bias and the MSE to be larger than if the model was correctly specified. When it is over-modelled, we observed that the point estimates are less biased than if it was correctly specified. In this situation, struggled to attain the nominal rate of coverage. When we compared different missing data mechanisms for cases 1 and 4, we saw some conflicting results. For /3 we observed few differences across the three mechanism when the mechanism was MAR. For the other covari ates, we saw many similarities across the three mechanism, but there was some suggestion that Mechanism C, which was MCAR, may be behaving differently. This difference emerges with the estimation of bias of the MCEM approach for /34 and in the MCEM coverage rates associated with /o, /32 and 174 ,6. The difference of Mechanism C fully emerges when we consider case 4, NMAR-NMAR. It is here that the similarity of mechanisms A and B solidify against the dissimilarity of mechanism C. Although there are peculiarities observed with this simulation study, by distancing ourselves from the minutia of details in the tables, we see some trends for both the complete-case and MCEM approaches across both the MAR and NMAR missing data mechanisms. We observe decreasing estimates of the standard deviation, MSE, and mean length of the confidence intervals as the sample size increases. Furthermore we see that the MCEM approach becomes more efficient as the sample size increases. When we consider the issue of bias, the details can create much confusion, but by taking a broader look at the trend we can see the possibility that for all parameters the bias is reduced as the sample size increases. Applying this perspective to the other murky results, we see familiar ground emerge with the MCEM approach having a trend towards the attainment of the nominal rate as the sample size increases. Finally, when we consider the specification of r, we see the expected results, such as larger estimates of the standard deviation for larger values of r. This trend is seen again for the MSE, and confidence interval length. When the assumed value of r matches that of the model, we see that the bias is minimized for We see a generally positive feature with the bias in that misspecification of -r seems to have no affect on the bias of the other parameters. Although there is not clear story for the coverage rate, the evidence seems to suggest that specifying a r that is smaller than the true value is better than having one that is too large. This seems to apply to the efficiency as well. 175 Chapter 5 Does income affect the location of death for males, living in Aboriginal communities, who died of prostate cancer? Although there are a myriad of ways to specify an individual’s culture, for this investigation, we will consider the socio-economic context. This is rarely measured directly and not present in any of the linked data sources. A com mon means to compensate for this deficiency, income is used as a surrogate [53—55]. Socio-economic status represents a context in which its members learn social coding; it is a cultural context which forms an understanding of self and others. Ecosocial theory suggests that people biologically in corporate their “context”, thus it is suggested that health is the physical realization of our cumulative experiences [52]. With this operating definition of a subject’s socio-economic cultural con text, we extend the original population definition to include all male adults, age 20 and older, who resided in and died in British Columbia, as identified on their death certificate, due to malignant prostate cancer between 1997 and 2003 inclusively, excluding death certificate only (DCO) diagnosis of cancer, who lived in a dissemination area where the self reported aboriginal population was 20% or greater. With this definition, 215 patients were iden tified and the following covariates were used for the model: average income 176 of the dissemination area, age, health authority, aboriginal profile of the dissemination area. The response is the location where the patient died. For the outcome, all locations of death are coded on the death certificate by the BCVS (Appendix A). In 2000, the place of death codes changed from following the International Classification of Diseases (lCD) version 9 to version 10. Coding for deaths at home (also called place of usual residence) remained the same across the change, but this was not true for other codes. For example, the hospital code was 7 prior to 2000 and 2 after. The change of code also changed the group of locations associated with hospital deaths. From the code alone, it is impossible to extract only hospital deaths, thus a supplementary file was obtained from the BCVS that indicated all patients who died in a hospital where the facility code and the hospital name was given for this purpose. The set of codes has been reduced to indicate death at the place of usual residence, home and free standing facilities which constitutes a place of residence (e.g. nursing home), and death on a hospital campus which includes all facilities associated with the postal code of the hospital. The outcome was defined as yi 10 if the location of death was on a hospital campus = if the location of death was in the place of usual residence We have three perfect covariates: age, health authority, and aboriginal profile of the dissemination area. Although the age is constructed from mul tiple data sources (date of birth from the cancer registry and date of death from BC vital statistics), the population had no identifiable problems. The age was kept as a continuous variable, which is not standard for applied epidemiological analysis, and it was standardized. The health authority for the place of usual residence was used and obtained from the geocod ing program. All the persons in the population were sufficiently geocoded as to correctly identify them with one health authority: Vancouver Island, Vancouver Coastal, Fraser, Interior, and Northern (Figure 5.1). Vancouver Coastal is the reference Health Authority with X32 z=Fraser Health Author ity, X33 =Interior Health Authority, x 34 =Vancouver Island Health Author177 ity, and x 35 z=Nothern Health Authority. The aboriginal profile of each area is predicated on the dissemination code which is obtained from the geocoding program. In British Columbia, dissemination areas which are in the 800 series aside from the Nisga’a nation are reservations. The dissemination area code was used to construct an indicator for reservations and the Nisga’a nation. Since all the patients were successfully geocoded, the indicator was complete and defined as X4 = { 0 if the location of residence is off a reservation 1 if the location of residence is on a reservation - British Columbia Health Authorities and Health Sevice Delivery Areas Health Authortitea fltndo 2. Fraser 3. Vancouver Coastal •4. Vancouver Island •5.Nern 6. ProvincIal Health Sravice (province-wide) 1. Ic.V,dI I5OJthVrcotwcbr Figure 5.1: British Columbia health authorities and health service delivery areas The age, health authority, and reservation indicator were included in the model as adjusting variables only with no direct interest in assessing 178 their utility to the overall model. We see in figure 5.1 that there may be a spatial aspect to this problem since the health authorities have geographic relationships to each other. Although it is reasonable to consider a hierar chical model to account for this structure, we have chosen to adjust for any potential regionality in the place of death by including the health authority in the modeL Furthermore, the health authority is the overarching govern mental structure for the delivery of regional services. By including this in the model, we are not only adjusting for any regional differences, but also adjusting for administrative practices. The indicator for reservations was included to adjust for any differences that may arise due to living on or off of a reservation. The average income is the surrogate for the socio-economic status arid is the variable of primary interest for this investigation. Income is a census based covariate which was inaccurately measured. Although there may be equivalent motivation for either a Berkson or classical measurement error model, for illustrative purposes a classical model is assumed. It is clear that this variable is mismeasured, but it also suffers from missing data. Of the 215 subjects, 38 did not have any income data which translate to missing 17.7% of the income data. This is due to the non-reporting of information in areas with low numbers of individuals (Chapter 1). Income was standardized using the empirical mean and standard deviation of the observable data. Two assumptions will be made for the imperfect covariate. First we will assume that the distribution of the unobservable experimental realizations is identical to that of the observed experimental realizations realizations, that is is identically distributed for all subjects. Assuming unbiased measurement error and ej ) then Xjxj 2 N(O,r ) where 2 N(x,r F = {obs,rniss}, so X)xj, ). Secondly, we assume that the 2 N(x,r variance of the measurement error model is pre-specified, that is we will not be estimating it from the data. In order to specify a likelihood model and apply the MCEM methodology to this example, we will need to assume a model for the income data. Given that we are assuming a normal distribution for the measurement error, if the surrogate is normally distributed as well, we can reasonably assume that the 179 C V - - t (dt=5) • t (dt=25) t(df=100) N(O,1) - - V — Co o Standardized income Figure 5.2: Histogram of the normalized income variable compared with t distributions of 5, 25, and 100 degrees of freedom and the standard normal distribution. underlying unobservable measure of direct interest is normally distributed as well. The empirical distribution of the standardized income does not immediately suggest that a normal assumption is tenable (Figure 5.2). Even the next best guess, such as a t-distribution remains unconvincing. If we turn to the Q-Q plots for the t , , 5 25 tioo and standard normal t distributions, we gain some clarity (Figure 5.3). We see that the , 100 25 t t and standard normal distributions would be reasonable assumptions for a “real-world” data set, thus to retain simplicity, we will assume that the standardized income is distributed as a N(0,1) distribution. The caveat to this assumption is the recognition that the tails do not have the correct 180 I Quantiles of t—cfistribution (df=5) Quantjles of I—distribution (df=25) Normal Q—Q Plot I -31O12 Quantiles of t—distribution (df=100) Theoretical Quantiles Figure 5.3: Q-Q plots for the standardized income covariate: t-distributions with 5, 25, and 100 degrees of freedom and the standard normal distribution. mass. Finally we have the outcome: place of death. We defined the place of death as being in a hospital or in the place of usual residence, where the place of usual residence includes both the home and any institutional place of usual residence such as a nursing home. Furthermore, a hospital designation is any death which occurred on a hospital campus as defined by the postal code. We identified location of death within the hospital campus first then all places of usual residence. The proportion of men who died at the place of their usual residence indicates that the odds ratio is not a reasonable approximation to the relative risk (Table 5.1). There is a large difference in the proportion of individuals who die in hospital when comparing living 181 on a reservation with living off of a reservation. The covariate of primary interest, income, shows little difference between those who died in hospital and those who died in their place of usual residence. Table 5.1: Univariate summary of the applied data set. The overall col umn has the counts, proportions, means and standard deviation for all 215 subjects. The hospital and place of residence columns have the rates, propor tions, means and standard deviations for the subjects contingent on being in the column category for the location of death. (%) Variable Hospital Place of death Health Authority Vancouver Island Vancouver Coastal Fraser Interior Northern Reservation status Living off a reservation Living on a reservation 141 (65.6%) 74 (34.4%) 10 20 46 26 39 7 (9.5%) 11 (14.9%) 15 (20.3%) 30 (40.5%) 11 (14.9%) (7.1%) (14.2%) (32.6%) (18.3%) (27.7) Residence (%) 124 (87.9%) 17 (12.1%) 56 (75.7%) 18 (24.3%) Variable Hospital (SD) Residence (SD) Income Age 43,650 (12,690) 76.2 (8.7) 39,712 (11,435) 78.7 (9.2) The model required for the analysis has been comprehensively discussed in section 4.5, thus we will only briefly review the components for the ap plied problem. The analysis will involve a binary outcome with a logit link function, an imperfect covariate, the assumption that e N(0, r ), and 2 three perfect covariates which have been included in the model for adjust ment purposes and are not of primary interest. The variance of e is assumed known and a sensitivity analysis for -r will be performed. Finally, the MAR and NMAR assumptions will be considered to see if the assumption about the missing data mechanism affects parameter estimation. Assuming that the imperfection indicator for missing data and mismea 182 surement are independent and that they are binary random variables, we will use the logit link function with the basic model for the systematic com ponent for the jth subject as = 70 + 7lXi,1 + 72Xi,2 + 732Xi,32 + 733Xj,33 + 734Xi,34 + 735Xi,35 + 74Xj,4 where 72 is the parameter for age, 732,733,734, and 735 are the parameters associated with the four levels of the Health Authority covariate and 74 is the parameter indexing the indicator of residence on a reservation. The covariate of primary interest and suffering from imperfection is X , thus for 1 MAR we set 7’ = 0. For NMAR, we are assuming that 0. 5.1 Results and discussion There are two aspects that need to be explored from a sensitivity analysis point of view: the assumption about the missing data mechanism and the variance of , r . To begin, we will first consider the effect of r when the 2 missing data mechanism is MAR, then we will consider the NMAR case. Finally we will consider the effect of changing the missing data assumption. When appropriate, comparisons against the complete-case analysis will be made (Table 5.2) The striking feature under the MAR assumption is the stability of the parameter estimates across the various values of r (Table 5.3). The MCEM estimates exhibit little variation across the specifications ofT. We see stabil ity in the parameter estimates, z-scores and the associated p-values. When considered in light of the associated standard errors, these differences would most likely not achieve any significance. When it comes to significance of the parameter in modelling the outcome, we see that only /34 has a statistical difference from the null value of zero, but there is some weak evidence in support of /32. We also notice that the standard error associated with the MCEM estimates are, in general, smaller than those derived from the naive complete-case approach. Recalling that we are primarily interested in /3 and are only using the 183 Table 5.2: Naive complete-cases results (a) Parameter estimates, standard error, z-score, and associ ated p-value Parameter /3Naive (SE(/)) z-score p-value I3 j3 -0.580 (0.636) -0.207 (0.194) 0.355 (0.178) -0.425 (0.800) -0.760 (0.713) 0.654 (0.686) -0.414 (0.760) 0.751 (0.659) -0.912 -1.069 1.994 -0.531 -1.066 0.954 -0.544 1.141 0.362 0.284 0.046 0.596 0.286 0.340 0.586 0.254 /32 /332 /333 /334 /335 /34 (b) Point estimate, 95% confidence interval, and confidence interval length on the odds ratio scale / j 3 /3o i3 /32 /332 /3 ORNaive 95% CI length 0.56 0.81 1.43 0.65 0.47 (0.16,1.95) (0.56,1.19) (1.01,2.02) (0.14,3.14) (0.12,1.89) (0.50,7.37) (0.15,2.93) (0.58,7.71) 0.63 1.02 3.00 1.78 6.87 2.78 7.13 M 3 / 1.92 /3 /3 0.66 2.12 1.79 184 Table 5.3: Parameter estimates, standard error, z-score, and associated p-value for the MCEM methodology assuming that the missing data mechanism is MAR. Four levels of r are explored to check the sensitivity of the model to the assumption on r (a) / , 3 j /32 /332 /333 /334 /335 /3. MCEM 3 / (b) rO.3 2 T=O. (SE(/)) -0.789 (0.561) -0.137 (0.231) 0.298 (0.160) -0.131 (0.647) -0.507 (0.615) 0.783 (0.602) -0.436 (0.684) 0.804 (0.409) z-score p-value 3 / -1.408 -0.593 1.855 -0.203 -0.824 1.303 -0.637 1.968 0.160 0.554 0.064 0.838 0.410 0.192 0.524 0.050 /3 i3i /32 /332 1 j 3 /3. /3 /3. /MCEM -0.787 (0.563) -0.144 (0.262) 0.298 (0.161) -0.137 (0.646) -0.513 (0.615) 0.780 (0.602) -0.440 (0.689) 0.805 (0.409) (c) rO.4 MCEM 3 ! /3o 43 /32 /332 /3 /3. /3 00 cii /34 (SE()) -0.781 (0.574) -0.141 (0.341) 0.298 (0.161) -0.141 (0.649) -0.518 (0.621) 0.776 (0.606) -0.454 (0.712) 0.805 (0.409) (SE(/)) (d) z-score p-value -1.359 -0.413 1.852 -0.217 -0.835 1.280 -0.638 1.969 0.174 0.680 0.064 0.828 0.404 0.200 0.524 0.048 3MCEM 7 /3 /3 /32 /332 /3 z-score p-value• -1.398 -0.550 1.854 -0.213 -0.833 1.296 -0.639 1.969 0.162 0.582 0.064 0.832 0.404 0.194 0.522 0.048 z-score p-value -1.220 -0.222 1.841 -0.220 -0.810 1.228 -0.563 1.973 0.222 0.824 0.066 0.826 0.418 0.220 0.574 0.048 5 TO. (SE()) -0.775 (0.635) -0.142 (0.638) 0.297 (0.161) -0.146 (0.665) -0.525 (0.648) 0.771 (0.628) -0.469 (0.832) 0.807 (0.409) Table 5.4: Point estimate, 95% confidence interval, and confidence interval length on the odds ratio scale for the MCEM methodology assuming that the missing data mechanism is MAR. Four levels of r are explored to check the sensitivity of the model to the assumption on r (a) 7-=O.2 (b) r=O.3 3 / ORNaive 95% CI length j 13 ORNaive 95% CI length /3o j3 0.45 0.87 1.35 0.88 0.60 2.19 0.65 2.23 (0.15,1.36) (0.55,1.37) (0.98,1.84) (0.25,3.11) (0.18,2.01) (0.67,7.12) (0.17,2.47) (1.00,4.98) 1.21 0.82 0.86 2.87 1.83 6.44 2.30 3.97 /3 0.46 0.87 1.35 0.87 0.60 2.18 0.64 2.24 (0.15,1.37) (0.52,1.45) (0.98,1.84) (0.25,3.09) (0.18,2.00) (0.67,7.09) (0.17,2.48) (1.00,4.98) 1.22 0.93 0.86 2.85 1.82 6.42 2.32 3.98 132 /332 /333 /335 /3 i 3 / /32 1332 /3 /3. /34 (d) (c) rO.4 rO.5 3 / ORNaive 95% CI length j 3 / ORNajve 95% CI length /3o 0.46 0.87 1.35 0.87 0.60 2.17 0.63 2.24 (0.15,1.41) (0.44,1.70) (0.98,1.84) (0.24,3.10) (0.18,2.01) (0.66,7.12) (0.16,2.56) (1.00,4.99) 1.26 1.25 0.86 2.86 1.83 6.46 2.41 3.98 /3o 0.46 0.87 1.35 0.86 0.59 2.16 0.63 2.24 (0.13,1.60) (0.25,3.03) (0.98,1.85) (0.23,3.18) (0.17,2.11) (0.63,7.40) (0.12,3.20) (1.01,5.00) 1.47 2.79 0.87 2.95 1.94 6.77 3.07 3.99 i 8 , /32 /332 /3 /3. s 3 / /3 /3i /32 /332 /3 /3. /3 other covariates to adjust the model, we will first consider 4 i then the other 3 parameters. From simulation study 2, we expect to see a substantial reduc tion in the bias, especially if the model is correctly specified across all model features. For all values of T we see an approximate 33% reduction in the magnitude of the parameter. This is considerable, but not of the magnitude often seen in simulation study 2 when the model was correctly specified. It is reasonable to assume that we have not correctly specified the model in at least one, but perhaps more ways. In simulation 2, we also saw that the bias associated with the MCEM approach for /3 was, in general, positive. Transferring to this situation, we can assume that the MCEM estimate is larger than the true value, but we may also assume that we are closer to it than the estimate from the complete case analysis. Across the values of r, we see that /3 ranges from -0.137 to -0.144, so pulling everything together, it is reasonable to conclude that the true value of the parameter, conditional on this particular functional form of the model, would be around -0.13. Considering the other covariates which have been included to adjust the model and the intercept, we see a variety of movements. For the intercept, we see the magnitude of the estimate increase, but from simulation 2, we know that the bias associated with the estimate of the intercept tends to increase when going from the naive complete-case analysis to that of the MCEM. If we maintain that observation for this data set, then we would assume that the intercept has a larger bias than that of the naive analysis, thus by observing the direction of movement of the parameter estimate for the intercept we can make a reasonable guess as to where the true value should be. In this case by becoming more negative we would surmise that the true value for the intercept should be closer to zero than that observed for the complete-case analysis. Outside of the estimate of the parameter associated with the imperfect covariate and the intercept, there are few clear and decisive results from simulation 2 about the relationship between the MCEM estimate, the complete-case estimates, and the true values. In general, we expect a slight increase in the bias for these covariates, but this is not uniform across all the scenarios implemented in simulation study 2. We can observe that some of the parameter estimates do have large changes 187 in the estimates with /332 and /333 being most notable. Table 5.4 converts the estimates to the odds-ratio scale. It is noted that we are not in a rare case situation, so the odds-ratio is not an approximation to the relative risk in this case and should not be interpreted in that manner. For the estimates of the odds-ratio, we constructed the 95% confidence in terval for each parameter estimate and the length of the confidence interval. The first noticeable feature is that the lengths of the confidence intervals for the MCEM approach are smaller than those of the compete-case method ology. This is due to the smaller standard errors observed earlier. From simulation study 2, we observed that the large sample asymptotic proper ties appeared to be “kicking in” around a sample size of 250. The shorter intervals could be evidence that for this study and we are beginning to ex perience some of these benefits. The difference in lengths is most noticeable for /34 where the MCEM based length is almost half that of the naive ap proach. Also, we observe that the length of the confidence interval for /3i increases as r increases, but this is not unexpected since as we increase r we are introducing more variability into the model and in particular more variability associated with x. Finally, from a confidence interval point of view, only /3 presents substantive interest. For most parameters the difference on the odds-ratio scale between the complete-case and the MCEM approaches is slight and would result in little difference in substantive interpretation of the model. Focusing on we see that there is what appears to be a minor difference. For the complete-case analysis we have an odds-ratio of 0.81 and for the MCEM method we have 0.87. From our previous observations, we may assume that the the true odds-ratio may be around 0.90. With the large confidence interval and the proximity of both estimates to zero, we may quickly decide that there is little of interest in this covariate. From a substantive point of view, the complete-case estimate is much more compelling than the MCEM estimate. The confidence interval for the naive estimate is (0.56, 1.19) with a point estimate of 0.81. Knowing that the data is fraught with problems, this may be enough evidence to divert resources to the acquisition of better data in order to further explore the effect of income on the location of death for this 188 population. It may be argued that with better and more data, the confidence interval may be shortened resulting in a significant finding. With the MCEM adjustment, which takes into account the data imperfection, we see that the point estimate is 0.87 with a potential true odds-ratio of around 0.9, based on inferring the results of an analogous simulation study to the result of the substantive problem. Furthermore, the confidence interval is rather large. With diverting more resources to investigating the effect of income on the place of death for this population, we may be able to shorten the confidence interval, but now a shorter interval looks less likely to produce a significant finding, thus the diversion of resources becomes much more questionable. In fact, in light of the MCEM adjustment, the question about the effect of living on or off a reservation, /34, becomes much more compelling and may be a more sensible place to focus limited resources. Here we see that the MCEM approach, predicated by an analogous simulation study, not only provides statistical methodological insight, but also can provide a means for substantive researchers to make better data acquisition decisions. Moving from the MAR to the NMAR assumption, we see again an overall stability in the parameter estimates (Table 5.5). We see a similar stability in the estimated standard deviation across the parameters except for the estimated standard deviation associated with /3 and /335, the indicator for the Northern Health Authority. In both cases we see an increase in the associated estimate of the standard deviation. Perhaps, in this case, the un derlying assumption that the imperfect random variable and the covariates are pairwise independent is untenable for at least x and . 35 Finally, we see x that /34 is the only significant covariate. If we were exploring the data from an hypothesis generation point of view in order to determine the important subset of data in which to invest both time and money, it would be reason able to focus on the location of residence (on/off reservation) rather than income. Furthermore there is moderate evidence for the inclusioll of /32. 011 the odds ratio scale, we see similar conclusions with a particular note of the evidence supporting the odds ratio associated with /34 (Table 5.6). It is reasonable to assume that although we have done a sensitivity anal ysis we have not correctly specified the model in terms of T, thus from 189 Table 5.5: Parameter estimates, standard error, z-score, and associated p-value for the MCEM methodology assuming that the missing data mechanism is NMAR. Four levels of r are explored to check the sensitivity of the model to the assumption on T (b) i-=O.3 (a) T=O.2 /3 /3 /32 /332 /333 /334 /335 /34 MCEM 3 /‘ (SE(/)) -0.812 (0.557) -0.189 (0.230) 0.299 (0.161) -0.112 (0.646) -0.482 (0.614) 0.803 (0.601) -0.377 (0.683) 0.821 (0.409) (c) MCEM 3 / /3 /32 /332 /333 /3. /3 cc /34 z-score p-value -1.457 -0.820 1.859 -0.173 -0.785 1.337 -0.552 2.006 0.146 0.412 0.064 0.862 0.432 0.182 0.580 0.044 j 3 / i 3 1 /32 /332 /3 /3. /335 MCEM 3 I -0.811 (0.560) -0.201 (0.261) 0.300 (0.161) -0.110 (0.647) -0.485 (0.615) 0.804 (0.603) -0.376 (0.691) 0.824 (0.410) (d) 4 . 0 T (SE(s)) -0.801 (0.571) -0.191 (0.338) 0.300 (0.161) -0.122 (0.650) -0.496 (0.620) 0.796 (0.608) -0.403 (0.713) 0.821 (0.409) (SE(/)) z-score p-value -1.403 -0.566 1.859 -0.188 -0.800 1.310 -0.565 2.005 0.160 0.572 0.064 0.852 0.424 0.190 0.572 0.044 j 3 / /3i /32 /332 /3 /3. /3 /34 MCEM 3 I z-score p-value -1.448 -0.768 1.860 -0.170 -0.788 1.333 -0.545 2.012 0.148 0.442 0.062 0.866 0.430 0.182 0.586 0.044 z-score p-value -1.302 -0.358 1.846 -0.187 -0.783 1.266 -0.517 2.004 0.192 0.720 0.064 0.852 0.434 0.206 0.606 0.046 rO.5 (SE(/)) -0.798 (0.613) -0.203 (0.568) 0.299 (0.162) -0.124 (0.663) -0.500 (0.639) 0.793 (0.626) -0.413 (0.800) 0.822 (0.410) Table 5.6: Point estimate, 95% confidence interval, and confidence interval length on the odds ratio scale for the MCEM methodology assuming that the missing data mechanism is NMAR. Four levels of T are explored to check the sensitivity of the model to the assumption on r (b) ‘r=O.3 (a) -r=O.2 3 / ORNaive 95% CI length j 3 / ORNaive 95% CI length /3 0.44 0.83 1.35 0.89 0.62 2.23 0.69 2.27 (0.15,1.32) (0.53,1.30) (0.98,1.85) (0.25,3.17) (0.19,2.06) (0.69,7.25) (0.18,2.62) (1.02,5.07) 1.17 0.77 0.87 2.92 1.87 6.56 2.44 4.05 /3o 0.44 0.82 1.35 0.90 0.62 2.23 0.69 2.28 (0.15,1.33) (0.49,1.37) (0.98,1.85) (0.25,3.19) (0.18,2.06) (0.69,7.28) (0.18,2.66) (1.02,5.09) 1.18 0.87 0.87 2.93 1.87 6.59 2.48 4.07 /32 /332 /333 /334 /335 /34 i 3 , /32 /332 /3 (c) rO.4 i 3 / ORNaive 95% CI length 3 / ORNaive 95% CI length /3o 0.45 0.83 1.35 0.89 0.61 2.22 0.67 2.27 (0.15,1.37) (0.43,1.60) (0.98,1.85) (0.25,3.16) (0.18,2.05) (0.67,7.29) (0.17,2.70) (1.02,5.07) 1.23 1.18 0.87 2.92 1.87 6.62 2.54 4.05 o 3 4 0.45 0.82 1.35 0.88 0.61 2.21 0.66 2.28 (0.14,1.50) (0.27,2.48) (0.98,1.85) (0.24,3.24) (0.17,2.12) (0.65,7.55) (0.14,3.17) (1.02,5.09) 1.36 2.21 0.87 3.00 1.95 6.90 3.03 4.07 /3i /32 /332 cc (d) r=O.5 /3 /3 /3. /3 /32 /332 /3 /335 /34. simulation study 2, we can infer that the bias associated with /3 will range in magnitude from 0.05 to 0.15. Predicated on the results from simulation study 2, it is feasible that the true value may be in the range of -0.13 to -0.03, which would have ramifications on the substantive conclusions for this study. Considering this on the odds-ratio scale, we would conclude that the MCEM and the complete-case approaches are similar enough to warrant the expenditure of resources in the acquisition of higher quality data, but a caveat exists in that it may be difficult to sufficiently shrink the confi dence interval. If the expenditure of funds in order to have a non-significant finding was permissible within the research framework, then this would be a worthy venture. Otherwise, it would be more productive to relegate this investigation into a hypothesis generation category and use it to identify variables with greater potential. There is much similarity in the conclusions that can be drawn when com paring the two missing data mechanisms. Both support the inclusion of /34 in the model and both have moderate evidence for /32. From an exploratory analysis point of view, it would be worth the expenditure of time and money to acquire better data surrounding these two covariates: age and residence on a reserve. Both models suggest that income is not a statistically signif icant predictor of the location of death for this population, thus there may be little interest in pursuing substantive questions pertaining to income for this population. 5.2 Conclusions Although the estimate for /3i was less with NMAR assumption than with the MAR assumption, the conclusions about the significance of /3i in the model is the same. The variation between the two approaches is more of a nuance in the sensibility of pursuing the acquisition of higher quality data to pursue the substantive question. A better role for the income covariate is to adjust for socio-economic status in other investigations. If we change the perspective of the study from a designed investigation to one of exploration, the MCEM adjustment appears to better define research opportunities. For 192 example, the MCEM adjustment strongly suggests that /32 and /34 would be good candidates for further investigation into their effect on the location of death. We observed that for this study, the specification of ‘i- appears to have less of an impact on the estimate of i3 than the assumption about the missing data mechanism. Also, we observed little difference between the two missing data mechanisms. This suggests a degree of stability in the estimates even when the model is misspecifled. Although we may not be able to identify the true model and compare the effect of misspeciflcation, we were able to use results from a simulation for which the structure was identical in order to construct some plausible scenarios for the true parameter value. Finally, the shorter confidence intervals and the smaller standard errors associated with the MOEM approach suggest that we may have “kicked in” the asymptotic properties which may result in the MCEM being more efficient. 193 Chapter 6 Conclusion A commonality between missing data and mismeasurement is the inability to observe realizations from the desired or target random variable. Although the desired realizations may not be observed, something is observed through the course of the experiment, a realization or the absence of the realization. Recognizing this provides a mean by which the two areas can be synthesized into a unified conceptual framework. In the spirit of both sub-disciplines, the idea of an experimentally observable random variable was introduced, XE. The random vector of the target random variable and the experimen tally observable random variable, (X, XE), was associated with each subject as was a functional relationship which linked the two. With the use the in dicator variables, a naive model was used to glue together aspects of missing data and mismeasurement to ensure that the more general problem of im perfection would not only have both missing data and mismeasurement as subsets but also provide a mechanism to characterize imperfections which do not fall neatly into one of the two types of data deficiencies. Although the introduction of the experimentally observable random variable appeared to be superfluous, it provided much needed notational simplicity when specify ing the particulars of the Monte Carlo Expectation-Maximization algorithm needed to obtain the adjusted maximum likelihood estimates. Two situations were considered. The first had two covariates with each affected by a single deficiency. The expected attenuation in the covariate suffering from mismeasurement was observed across various missing data mechanisms and specifications of ‘r, but statistically significant bias was observed in the estimates associated with the covariate affected by missing data when the missing data mechanism was MAR. Also, an overly optimistic estimate of the variation in the data resulted. Furthermore, it appears as 194 if oniy the covariate suffering from mismeasurement has problems with the attainment of the nominal coverage rate. The second situation considered multivariable binary logistic regression as well, but there was one covariate which suffered from both problems. Across various missing data mechanisms we saw that the naive parameter estimates experienced attenuation and an overly optimistic estimation of the variation in the data. The estimation of the standard deviation for the imperfect covariate was similar, for both MAR and NMAR assumptions, to the estimated standard deviation for the accurate covariates. This was a feature not seen when the two problems affected different covariates. As expected, the coverage rate for the parameter associated with the imperfect covariate failed to reach the nominal rate, but other parameter estimates were also affected. In first simulation with two imperfect covariates, each with a single defi ciency, the MCEM approach attempts to mitigate the bias of /2 associated with the mismeasured covariate by striking a set of trade-offs with the other estimates. The algorithm appears to adjust first for the mismeasured co variate, then for the covariate suffering from missing data and finally for the intercept. The adjustment appears to be making a trade-off between the overall fit of the model and the accuracy of the individual estimates. The specification of the missing data mechanism appears to have little impact on parameter estimation, but we do see that under-modelling the missing data mechanism is in general worse than over-modelling it. When the systematic component of the missing data mechanism is considered, we do see evidence that including the response in the systematic component does influence the quality of the estimates (Mechanism C). In this situation, the MCEM ap proach has a much more difficult time reducing the bias for all three param eter estimates and may sacrifice the accuracy of the estimates following the aforementioned priority. Finally, it was observed that the larger the sample size, the better the performance of the MCEM adjustment. For the second simulation study, MCEM trends are clearer with a slight step back from the details buried in the many tables. In general, the MCEM approach managed to reduce the bias associated with the parameter esti 195 mates, but this was not uniformly observed across all missing data mech anisms or missing data assumptions. An unexpected finding was that the relative efficiency of the MCEM approach was much better when both prob lems plagued a single covariate as opposed to the situation where the prob lems affected separate covariates. We observe decreasing estimates of the standard deviation, MSE, and mean length of the confidence intervals as the sample size increases. Furthermore we see that the MCEM approach becomes more efficient as the sample size increases. When we consider the issue of bias, the details can create much confusion, but by taking a broader look at the trend we can see the possibility that for all parameters the bias is reduced as the sample size increases. Finally, the misspecification of T appears to have a minimal affect on parameter estimation. Across both simulation studies, we see the effect of attenuation on the im perfect covariates. Depending on the exact “mix” of imperfections appears to have an affect on how this manifests. We also see that when missing data and mismeasurement problems are not co-mingled in a single covariate, we may be able to rely on standard results from each area. The caveat to this is that we do observe nuances which suggest that standard knowledge may break down, thus it serves as only a guide and ceases to hold its definitive explanatory role about the effects. The MCEM adjustment worked well in general and has many promising features to warrant further exploration of the method. In particular, the emerging large sample properties were very attractive, thus it is reasonable to conclude that with a more sensitive application of the MCEM algorithm, estimates with better properties may emerge. Although the MCEM adjust ment does not mark a novel algorithmic approach, a novel perspective about data was required in order to make the exposition of the details less forebod ing. I was shown that standard approaches can work well in novel contexts, given the right perspective about the problem. This begs the question as to what is needed more, new algorithms or a better understanding of the basic problem? From a substantive point of view, we see that there is utility in using this approach. Although current sample sizes are restricted by computational 196 limitations, it is reasonable to use this method on a subset of data for the purpose of data exploration. Using simulation studies which emulate the structure of the data, it is possible to gain an understanding of how the biases may be working and the sensibility of inferences. Furthermore, such insights will be a valuable tool for researchers who are in a position to seek out more costly data. By providing a reasonable guide as to which aspects of the data may be most profitable in terms of meaningful outcomes, research resources can be administrated in a much more focused manner. 6.1 Future work Given the small number of peer reviewed articles which consider the com bined problem of missing data and mismeasurement, there is much work to be done. From a computational point of view, if the MCEM approach was to be retained, future investigation would need to involve a detailed investigation into the automation of the Monte Carlo sample size for each subject and at each iterative step of the EM algorithm. Also, methods for managing very large design matrices would need to be explored in order to permit sample sizes greater than n = 250. Dempster et al. [22] indicate that computational efficiencies can be gained by understanding the structure of the likelihood. It may be worth pursuing a more analytic understanding of the likelihood in order to identify ridges or plateaux that may exists. Furthermore, it would be of interest to consider methodologies which allow for ridges, but utilize a pulse mechanism to “eject” the algorithm away from the ridge and into a more profitable area of the likelihood. A natural progression of this work would be to consider hierarchical mod els, the modelling of causal pathways, and time-varying covariates. Given that the motivating substantive problem involved cultural covariates, which themselves are considered to vary with time, it would be feasible to consider an integration of the imperfect variable frame work with the Andersen model of health services [2], and Lui and Wu’s work with time-varying covariates in the context of HIV trials [71]. Finally, any integration of a hierarchical 197 models and imperfect covariates would be an asset to substantive researchers utilizing geocoded aggregate data. A final area of future research is to consider the basic problem of data imperfection and its impact. This is in the spirit of Gustafsons work [40] where a consideration of the impact of measurement error and misclassi fication are considered before turning to corrective methodologies. Given the complexities of relationships that may arise with imperfection and the limited research on the topic, a more detailed look at the basic problem may be profitable. 198 Bibliography [1] A. Agresti. Categorical Data Analysis. New York: Wiley, 1990. [2] R M Andersen. Revisiting the behavioural model and access to medical care: Does it matter? Journal of Health and Social Behaviour, 36:1— 10, 1995. [3] Stuart G. Baker. A simple method for computing the observed in formation matrix when using the em algorithm with categorical data. Journal of Computational an Graphical Statistics, 1(1):63—76, 1992. [4] L E Baum and J A Eagon. An inequality with applications to statis tical estimation for probabilistic functions of markov processes and to a model for ecology. Bulletin of the American Mathematical Society, 73:360—363, 1967. [5] L E Baum and T Petrie. Statistical inference for probabilistic functions of finite markov chains. Annals of Mathematical Statistics, 37:1554— 1563, 1966. [6] L E Baum, T Petrie, G Soules, and N Weiss. A maximization technique occurring in the statistical analysis of probablistic functions of markov chains. Annals of Mathematical Statistics, 41:164—171, 1970. [7] J G Booth and J P Hobert. Maximizing generalized linear mixed model likelihoods with an automated monte carlo algorithm. Journal of the Royal Statistical Society. Series B, 61:265—285, 1999. [8] S F Buck. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. Journal of the Royal Statistical Society. Series B, 22:302—306, 1960. 199 [9] Frederick Burge, Beverley Lawson, and Grace Johnston. Trends in the place of death of cancer patients, 1992-1997. CMAJ, 168(3):265—270, February 2003. [10] Andrea Burton, Douglas G. Altman, Patrick Royston, and Roger L. Holder. The design of simulation studies in medical statistics. Statis tics in Medicine, 25:4279—4292, 2006. [11] Statistics Canada. Postal Code Conversion File (PCCF): Reference Guide. Statistics Canada, catalogue no. 92f0153gie statistics canada catalogue no. 92f0153gie. ottawa edition, September 2006. [12] Statistics Canada. 2006 Census Dictionary. Statistics Canada, statis tics canada catalogue no. 92-566-xwe. ottawa. edition, February 14 2007. [13] Raymond J. Carroll, Douglas Midthune, Laurence S. Freedman, and Victor Kipnis. Seemingly unrelated measruement error models, with appplication to nutritional epidemiology. Biometrics, 62:75—84, 2006. [14] Raymond J. Carroll, David Ruppert, Leonard A. Stefanski, and Cipriän M. Crainiceanu. Measurement Error in Nonlinear Models: A Modern Perspective. Chapman & Hall/CRC, 2006. [15] George Casella and Robert L. Berger. Statistical Inference. Duxbury, second edition, 2002. [16] George Cassella and Edward I. George. Explaining the gibbs sampler. The American Statistician, 46(3):167—174, 1992. [17] Richard Chamberlayne, Bo Green, Morris L Barer, Clyde Hertzman, William J Lawrence, and Samuel B Sheps. Creating a population based linked health database: A new resource for health services re search. Canadian Journal of Public Health, 89(4):270—273, 1998. [18] K S Chan and J Ledolter. Monte carlo estimation for time series mod els involving counts. Journal of the American Statistical Association, 90:242—252, 1995. 200 [19] H Y Chen and R J A Little. Proportional hazards regression with missing covariates. Journal of the American Statistical Association, 94:896—908, 1999. [20] N E Day. Estimating the components of a mixture of normal distri butions. Biometrilva, 56:463—474, 1967. [21] Hakan Demirtas. Simulation driven inferences for multiply imputed longitudinal datasets. Statistica Neerlandica, 58(4):466—482, 2004. [22] A P Dempster, N M Laird, and D B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1—38, 1977. [23] Paul S. Dwyer. Some applications of matrix derivatives in multivariate analysis. Journal of the American Statistical Association, 62(318):607— 625, June 1967. [24] Bradley Efron. The two sample problem with censored data. In Pro ceedings of the 5th Berkley Symposium of Mathematical Statistics and Probability, volume 4, pages 831—853. University of California Press, 1967. [25] Bradley Efron and David V. Hinkley. Assessing the accuracy of the maximum likelihood estimator: Observed versus expected fisher infor mation. Biomet’rika, 65(3):457—482, 1978. [26] Bradley Efron and Robert J. Tibshirani. An Introduction to the Boot strap. Chapman & Hall/CRC, 1998. [27] Julian J. Faraway. Extending the Linear Model with R. Chapman & Hall/CRC, 2006. [28] R A Fisher. Theory of statistical estimation. Proceedings of the Cam bridge Philosophical Society, 22:700—725, 1925. 201 [29] L W Fung. Implementing the patient self-determination act (psda): How to effectively engage chinese-american elderly persons in the de cision of advance directives. Journal of Gerontological Social Work, 22(1/2):161—174, 1994. [30] B Ganguli, J Staudenmayer, and M P Wand. Additive models with predictors subject to measurement error. Australian and New Zealand Journal of Statistics, 47:193—202, 2005. [31] Alan E Gelfand and Ariand F M Smith. Sample based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410):398—409, 1990. [32] Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian Data Analysis. Chapman & Hall/CRC, 2nd edition, 2004. [33] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distri butions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721—741, 1984. [34] A T Geronimus, J Bound, and L J Neidert. On the validity of using census geocode characteristics to proxy individual socioeco nomic characteristics. Journal of the American Statistical Association, 91(434) :529—537, 1996. [35] W. R. Gilks and P. Wild. Adaptive rejection sampling for gibbs sam pling. Applied Statistics, 41(2) :337—348, 1992. [36] R J Glynn and N M Laird. Regression estimates and missing data: complete-case analyis. Technical report, Harvard School of Public Health, Department of Biostatistics, 1986. [37] Sander Greenland. Ecological versus individual-level sources of bias in ecological estimates of contextual health effects. International Journal of Epidemiology, 30:1343—1350, 2001. 202 [38] Sander Greenland and William D Finkle. A critical look at methods for handling missing covariates in epidemiological regression analy sis. American Journal of Epidemiology, 142(12): 1255—1264, December 1995. [39] D Gu, G Liu, D A Viosky, and Z Yi. Factors associated with place of death among the chinese oldest old. Journal of Applied Gerontology, 26(1):34—57, 2007. [40] Paul Gustafson. Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments. Chapman & Hall/CRC, 2004. [41] W K Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57:97—109, 1970. [42] C Hsiao and Q K Wang. Estimation of structural nonlinear errorsin-variables models by simulated least-squares method. International Economic Review, 41(2):523—542, 2000. [43] Joseph G. Ibrahim. Incomplete data in generalized linear models. Journal of the American Statistical Association, 85(411):765—769, Sept 1990. [44] Joseph G. Ibrahim, Ming-Hui Chen, and Stuart R. Lipsitz. Monte carlo em for missing covariates in parametric regression models. Bio metrics, 55(2):591—696, June 1999. [45] Joseph G. Ibrahim, Ming-Hui Chen, and Stuart R. Lipsitz. Missing responses in generalised linear mixed models when the missing data mechanism is nonignorable. Biometrika, 88(2):551—564, June 2001. [46] Joseph G. Ibrahim, Ming-Hui Chen, Stuart R. Lipsitz, and Amy H. Herring. Missing-data methods for generalized linear models: A com parative review. Journal of the American Statistical Association, 100(469):332—346, March 2005. 203 [47] Joseph G. Ibrahim and Stuart R. Lipsitz. Parameter estimation from incomplete data in binomial regression when the missing data mecha nism is nonignorable. Biometrics, 52(3):1071—1078, Sept 1996. [48] Joseph G: Ibrahim, Stuart R. Lipsitz, and Ming-Hui Chen. Missing covariates in generalized linear models when the missing data mecha nism is non-ignorable. Journal of the Royal Statistical Society. Series B (Methodological), 61(1): 173—190, 1999. [49] M Jamshidian and R I Jennrich. Conjugate gradient acceleration of the em algorithm. Journal of the American Statistical Association, 88:221—228, 1993. [50] Daijin Ko. Estimation of the concentration parameter of the von mises fisher distribution. The Annals of Statistics, 20(2):917—928, June 1992. [51] Nancy Krieger. Overcoming the absence of socioeconomic data in med ical records: validation and application of a census-based methodology. American Journal of Public Health, 82(5):703—710, 1992. [52] Nancy Krieger. Theories for social epidemiology in the 21st cen tury: an ecosocial perspective. International Journal of Epidemiology, 30:668—677, 2001. [53] Nancy Krieger. Choosing area based socioeconomic measures to mon itor social inequalities in low birth weight and childhood lead poison ing: The public health disparities geocoding project (us). Journal of Epidemiology and Community Health, 57:186—199, 2003. [54] Nancy Krieger. Defining and investigating social disparities in cancer: critical issues. Cancer Causes and Control, 16:5—14, 2005. [55] Nancy Krieger. Race/ethnicity and breast cancer estrogen receptor status: impact of class, missing data, and modeling assumptions race/ethnicity and breast cancer estrogen receptor status: impact of class, missing data, and modeling assumptions race/ethnicity and breast cancer estrogen receptor status: impact of class, missing data, 204 and modeling assumptions. Cancer Causes and Control, 19:1305—1318, 2008. [56] John M Lachin. Biostatistical Methods: The Assessment of Relative Risks. John Wiley and Sons, 2000. [57] N.M Laird. Nonparamteric maximum likelihood estimation of a mixing distribution. Journal of the American Statistical Association, 73:805— 811, 1978. [58] K Lange. A gradient algorithm locally equivalent to the em algorithm. Journal of the Royal Statistical Society. Series B, 57:425—437, 1995. [59] K Lange. A quasi-newton acceleration of the em algorithm. Statistica Sinica, 5:1—18, 1995. [60] D Lansky, George Cassella, C E McCulloch, and D Lansky. Conver gence and invariance properties of the em algorithm. In American Statistical Association Proceedings of the Statistical Computing Sec tion, pages 28—33. American Statistical Association, 1992. [61] Richard A. Levine and George Casella. Implementations of the monte carlo em algorithm. Journal of Computational an Graphical Statistics, 10(3):422—439, Sept 2001. [62] Richard A. Levine and Juanjuan Fan. An automated (markov chain) monte carlo em algorithm. Journal of Statistical Computation 4 Sim ulation, 74(5):349—360, May 2004. [63] Hua Liang, Suojin Wang, and Raymond J. Carroll. Partially linear models with missing response variables and erorr-prone covariates. Biometrika, 94(1): 185—198, 2007. [64] Stuart R. Lipsitz and Joseph G. Ibrahim. A conditional model for incomplete covariates in parametric regression models. Biometrika, 83(4):916—922, Dec 1996. 205 [65] Stuart R. Lipsitz, Joseph G. Ibrahim, Ming-Hui Chen, and Harriet Pe terson. Non-ignorable missing covariates in generalized linear models. Statistics in Medicine, 18:2435—2448, 1999. [66] Stuart R. Lipsitz, Joseph G. Ibrahim, and Lue Ping Zhao. A weighted estimating equation for missing covariate data with properties similar to maximum likelihood. Journal of the American Statistical Associa tion, 94(448):1147—1160, Dec 1999. [67] Roderick J. A. Little and Donald B. Rubin. Statistical analysis with missing data. John Wiley and Sons, New Jersey, 2002. [68] Roderick J. A. Little and Mark D. Schluchter. Maximum likelihood estimation for mixed continuous and categorical data with missing values. Biometrika, 72(3):497—512, 1985. [69] Chuanhai Liu, Donald B. Rubin, and Ying Nian Wu. Parame ter expansion to accelerate em: The px-em algorithm. Biometrika, 85(4):755—770, 1998. [70] Thomas A. Louis. Finding the observed information matrix when using the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 44(2):226—233, 1982. [71] Wei Lui and Lang Wu. A semiparametric nonlinear mixed-effects model with non-ignorable missing data and measurement errors for Mv viral data. Computational Statistics and Data Analysis, 52:112— 122, 2008. [72] M Malstrom, J Sundquist, and S E Johansson. Neighbourhood envi ronment and self-rated health status: a multilevel analysis. American Journal of Public Health, 89(8):1181—1186, 1999. [73] P. IVicCullagh and J.A. Nelder. Generalized Linear Models. Chapman & Hall/CRC, 2nd edition, 1989. [74] Geoffre J. lVlcLachlan and Thriyambakam Krishnan. The EM Algo rithm and Extensions. John Wiley and Sons, 2008. 206 [75] I Meilij son. A fast improvement to the em algorithm on its own terms. Journal of the Royal Statistical Society. Series B, 51:127—138, 1989. [76] X L Meng and D B Rubin. using em to obtain asymptotic variance covariance matrices: the sem algorithm. Journal of the American Statistical Association, 86:899—909, 1991. [77] Xiao-Li Meng and David van Dyk. The em algorithm—an old folk-song sung to a fast new tune. Journal of the Royal Statistical Society. Series B (Methodological), 59(3) :511—567, 1997. [78] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calcula tions by fast computing machines. The Journal of Chemical Physics, 21(6):1087—1091, June 1953. [79] Nicholas Metropolis and S Ulam. The monte carlo method. Journal of the American Statistical Association, 44(247):335—341, 1949. [801 H. Morgenstern. Uses of ecological analysis in epidemiological research. American Journal of Public Health, 72(12):1336—1344, 1982. [81] Simon Newcomb. A generalized theory of the combination of observa tions so as to obtain the best result. American Journal of Mathematics, 8(4):343—366, Aug 1886. [82] David Oakes. Direct calculation of the information matrix via the em algorithm. Journal of the Royal Statistical Society. Series B (Statisti cal Methodology), 61(2) :479—482, 1999. [83] Terence Orchard and Max A Woodbury. A missing information princi ple: Theory and applications. In Proceedings of the 6th Berkley Sym posium of Mathematical Statistics and Probability, volume 1, pages 697—715. University of California Press, 1972. [84] M S Pepe. Inference using surrogate outcome data and a validation sample. Biometrika, 79(2) :355—365, 1992. 207 [85] Christian P. Robert and George Cassella. Monte Carlo Statistical Methods. Springer, second edition, 2004. [86] J. M. Robins, A. Rotnitzky, and L. P. Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89:846—866, 1994. [87] Paulino Perez Rodriguez. Adaptive Rejection Sampling, 2007. [88] L L Roos, J Magoon, S Gupta, D Chateau, and P J Veugelers. Socioeconomic determinants of mortality in two canadian provinces: Multilevel modelling and neighbourhood context. Social Science and Medicine, 59(7): 1435—1447, 2004. [891 A V Roux. The study of group-level factors in epidemiology: rethink ing variables, study designs and analytic approaches. Epidemiology Revue, 26:104—111, 2004. [90] Donald B. Rubin. Inference and missing data. Biometrika, 63(3):581— 592, 1976. [91] Donald B. Rubin. The analysis of transformed data: Comment. Jour nal of the American Statistical Association, 79(386):309—312, 1984. [92] S Schwartz. The fallacy of the ecological fallacy: the potential misuse of a concept and its consequences. American Journal of Public Health, 84(5):819—824, 1994. [93] M R Segal, P Bacchetti, and N P Jewell. Variance for maximum penalized likelihood estimates obtained via the em algorithm. Journal of the Royal Statistical Society. Series B, 56:345—352, 1994. [94] P A 0 Strickland and B F Crabtree. Modelling effectiveness of in ternally heterogeneous organizations in the presence of survey non response: An application to the ultra study. Statistics in Medicine, 26(8):1702—1711, 2007. 208 [951 Amy L. Stubbendick and Joseph G. Ibrahim. Maximum likelihood methods for nonignorable missing responses and covariates in random effects models. Biometrics, 59:1140—1150, December 2003. [96] R Sundberg. Maximum likelihood theory for incomplete data from an exponential family. Scandinavian Journal of Statistics: Theory and Applications, 1:49—58, 1974. [97] R Sundberg. An iterative method for solution of the likelihood equa tions for incomplete data from exponential families. Communications in Statistics - Simulations and Computations, 5:55—64, 1976. [98] M Susser. The logic of ecological: I. the logic of analysis. American Journal of Public Health, 84(5):825—829, 1994. [991 Lingqi Tang, Juwon Song, Thomas R. Belin, and Jurgen Unutzer. A comparison of imputation methods in a longitudinal randomized clinical trial. Statistics in Medicine, 24:2111 2128, 2005. — [100] S T Tang and R McCorkle. Determinants of place of death for terminal cancer patients. Cancer Invest., 19(2):165—180, 2001. [101] C Thomas, S M Morris, and D Clark. Place of death preferences among cancer patients and their carers. Social Science é4 Medicine, 58(12):2431—2444, 2004. [102] B W Turnbull. Nonparametric estimation of a survivorship function with doubly censored data. Journal of the American Statistical Asso ciation, 69:169—173, 1974. [103] B W Turnbull. The empirical distribution with arbitrarily grouped, censored, and truncated data. Journal of the Royal Statistical Society. Series B, 38:290—295, 1976. [104] Werner Vach and Maria Blettner. Bias estiamtion of the odds-ratio in case-control studies due to the use of ad hoc methods of correcting for missing values for confounding variables. Journal of American Epidemiology, 134(8):895—907, 1991. 209 [105] Werner Vach and Martin Schumacher. Logistic regression with in completely observed categorical covariates: A comparison of three ap proaches. Biometrika, 80(2):353—362, 1993. [106] C YWang, Yijian Huang, Edward C Chao, and Marjorie Jeffcoat. Expected estimating equations for missing data, measurement error, and misclassification, with application to longitudinal nonignorable missing data. Biometrics, 64:85—95, March 2008. [107] C Y Wang and M S Pepe. Expected estimating equations to accom modate covariate measurement error. Journal of the Royal Statistical Society. Series B (Methodological), 62:509—524, 2000. [108] Jing Wang. Em algorithms for nonlinear mixed effects models. Com putational Statistics and Data Analysis, 51:3244—3256, 2007. [109] G. C. Wei and M.A. Tanner. A monte carlo implementation of the em algorithm and the poor man’s data augmentation algorithms. Journal of the American Statistical Association, 85:699—704, 1990. [1101 G. C. Wei and M.A. Tanner. Posterior computations for censored re gression data. Journal of the American Statistical Association, 85:829— 839, 1990. [111] P. Wild and W. R. Gilks. Algorithm as 287: Adaptive rejection sam pling from log-concave density functions. Applied Statistics, 42(4):701— 709, 1993. [112] D M Wilson, J C Anderson, R L Fainsinger, H C Northcott, S L Smith, and M J Stingi. Social and health care trends influencing palliative care and the location of death in twentieth-century canada. Technical report, University of Alberta, 1998. [113] C Wong. Personal communication with data analyst, March 2005. [114] C F J Wu. On the convergence properties of the em algorithm. Annals of Statistics, 11:95—103, 1983. 210 TT 8OO ‘Tc—IOc:()6 P IPI ‘S ?4SOff U0I J 10 POtI S1AO ifi I”!’” poanSuisrw pu nodoip U0P9flT1I tj& V A A °D [TT1 Appendix A British Columbia Vital Statistics location of death All locations of death are coded on the death certificate by the BCVS. In 2000, the place of death codes changed from following the International Classification of Diseases (lCD) version 9 to version 10. Coding for deaths at home (also called place of usual residence) remained the same across the change, but the codes for other locations changed. The hospital code was 7 prior to 2000 and 2 after. It is clear that these categories represent general locations of death based on some similarity across the locations. The place of usual residence, called home, is its own distinct category, but comparison with this location and others becomes problematic due to the composite nature of the other categories. Prior to 2000, the location of death codes as reported on the death certificate are • (0) HOME: includes apartment, boarding house, caravan (trailer) park, farmhouse, home premises, house (residential), non-institutional place of residence, private driveway to home, garage, garden of home, swim ming pool in private house or garden • (1) RESIDENTIAL INSTITUTION: includes children’s home, dormi tory, home for the sick, hospice, military camp, nursing home, old people’s home, orphanage, pensioners’ home, prison, reform school • (2) SCHOOL, OTHER INSTITUTION, PUBLIC ADMINISTRATIVE AREA: Building and adjacent grounds used by the general public or by a particular group of the public which includes hospital, assembly hall, 22 campus, church, cinema, institute for higher education, court-house, day nursery, post-office, public hail, school, library, youth centre, (the vast majority of deaths with this place code are hospital deaths) . (3) SPORTS AND ATHLETIC AREA: includes baseball field, basket ball court, football field, stadium, public swimming pooi, etc. (fairly self explanatory) • (4) STREET AND HIGHWAY • (5) TRADE AND SERVICE AREA: includes airport, bank, cafe, ho tel/motel, office buildings, public transit stations, railway stations, shops, malls, service station, commercial garage, radio/television sta tions • (6) INDUSTRIAL AND CONSTRUCTION AREA: includes any build ing under construction, dockyard, factory premises, industrial yard, mine, oil rig and other offshore installations, gravel/sand pit, power station, shipyard, tunnel under construction, workshop, commercial garage. • (7) FARM includes farm buildings, land under cultivation, ranch • (8) OTHER SPECIFIED PLACES: including but not limited to beach, campsite, derelict house, wilderness areas and natural bodies of water (mountain, forest, lake, marsh, river, sea, seashore, etc.), public park, parking lots, railway line, zoo, water reservoir, unspec. public place. • (9) UNSPECIFIED PLACE After 2000, the location of death codes used for the British Columbia death certificates are • (0) HOME: includes apartment, boarding house, caravan (trailer) park, farmhouse, home premises, house (residential), non-institutional place of residence, private driveway to home, garage, garden of home, swim ming pooi in private house or garden 213 • (1) FARM: Includes farm buildings and land under cultivation. Ex cludes farm house and home premises of farm. • (2) MINE AND QUARRY: including gravel and sand pits, tunnels under construction • (3) INDUSTRIAL PLACE AND PREMISES: includes any building under construction, dockyard, factory premises, industrial yard, mine, oil rig and other offshore installations, gravel/sand pit, power station, shipyard, tunnel under construction, workshop, commercial garage • (4) PLACE FOR RECREATION AND SPORT: includes lake and mountain resorts, vacation resorts, public parks, playgrounds, baseball field, basketball court, football field, stadium, public swimming pool, etc. • (5) STREET AND HIGHWAY • (6) PUBLIC BUILDING: includes assembly hall, campus, church, cin ema, institute for higher education, court-house, day nursery, postoffice, public hall, school, library, youth centre, airport, bank, cafe, • hotel, office buildings, public transit stations, shops, malls, commer cial parking garage, clinic • (7) RESIDENTIAL INSTITUTION: includes Hospitals, children’s home, dormitory, home for the sick, hospice, military camp, nursing home, old people’s home, orphanage, pensioners’ home, prison, reform school • (8) OTHER SPECIFIED PLACES: including but not limited to beach, campsite, derelict house, wilderness areas and natural bodies of water (mountain, forest, lake, marsh, river, sea, seashore, etc.), public park, parking lots, railway line, zoo, water reservoir. • (9) UNSPECIFIED PLACE 214 Appendix B Supplemental material for the methodological development This appendix contains information which is either informative background information which is not directly necessary for the main expository flow of the thesis, or methodological development which, if included in the main body of the thesis would distract from the narrative points being made. B.1 Adaptive Rejection Sampling Gilks and Wild [35] proposed a method for rejection sampling from any univariate log-concave probability density function. Details for rejection sampling, adaptive rejection sampling and log-concavity follow. Intended to be a black-box technique for sampling from any univariate log-concave distribution, a upper hull, called the the rejection envelope and a lower hull, called the squeezing function are constructed about the target distribution. As sampling proceeds, the rejection envelope and squeezing function con verge to the target distribution. The formal goal is to generate a sample of independent realizations from a density, f(x), which only needs to be known up to a constant of propor tionality to another function g(x), that is g(x) cf(x). This is particularly useful when c f g(x)dx is not known in closed form where V is the domain of f(x) for which f(x) > 0 for all x V. 215 B .1.1 Rejection sampling: Envelope Accept-Reject On approach, called Envelope Accept-Reject, is to construct an envelope about g(x) consisting of two functions and then use this envelope to de termine if a samples point is from the function under consideration, f(x). First, define an envelope function gu(x) such that gu(x) g(x)Vx E V and a squeezing function g (x) such that g(x) g(x)Vx e V. Now, sample an observation, x’ from gu(x) and independently sample an observation, w from U(O, 1). If we have defined 91(x) then we can perform the squeeze test, g(x*) if w 9u (x*) then accept x otherwise perform the following rejection test, if w (x* < then accept x gu(x otherwise reject x. This is repeated until a sample of m points is obtained. B .1.2 Adaptive Rejection S arnpling Frequently, only one sample is required when using Gibbs sampling, but often we need one sample from a large number of different probability den sities. Adaptive rejection sampling (ARS) reduces the number of evaluations required to obtain the desired sample size by • assuming log-concavity which avoids the need to find the supremum of the function, and 216 • reducing the probability of needing to evaluate g(x) further because after each rejection, the envelope and squeezing functions are updated to incorporate the new information obtained from the rejected obser • vation. Along with the log-concavity assumption of the function, adaptive rejection sampling requires the additional assumptions that • the domain V of the function is connected, that is it cannot be repre sented as the disjoint union of two or more non-empty open subsets, • g(x) is continuous and differentiable everywhere in V, and • h(x) = log g(x) is concave everywhere in V. Given that we are working in a standard Cartesian coordinate system, the x coordinates are termed the abscissa. In order to construct the upper and rower hulls which are then squeezed towards the target function, a set of k abscissae are selected where x 1 < x2 x. We will retain the notation of Gilks and Wild [35] and denote the set of abscissae Tk = {x : i 1,.. , k} where the subscript k denotes the cardinality of the set T. The function h(x) and its derivative h’(x) are evaluated for x E Tk where i = 1,. , k •. . . . and TkCV. Now we construct the upper hull known as the rejection envelope on Tk. Define the rejection envelope as expuk(x) where uk(x) is a piece-wise linear upper hull formed by the tangents to h(x) at x e Tk for i = 1,.. , k. Notice that for each update of the abscissae, Tk changes its cardinality by the introduction of the point that was rejected. This in turn results in . the construction of a new upper hull, thus the upper hull function u(x) is indexed by the cardinality of the abscissae, Uk (x). The upper hull, constructed from the tangent lines is easy to construct using the point-slope form of a line. Given the function h(x) we find the derivative h’(x). Furthermore, the tangent line, nk(x), and h(x) share the point (xi, h(x)) for i = 1,.. . , k. Finally we have the point-slope form of 217 e the tangent line associated with x Uk(X) = Tk, h(x) + (x The k 1 intersections betweeii the as Zj, is h(x+i) — x)h’(x). and the x 1 tangent lines, denoted — — — h(x) x+ h’(x+i) + xh’(x) 1 h’(x) — — — Now with the construction of the upper hull, we normalize expuk(x) to construct a density from the hull envelope function, thus exp — fvP uk(x) Uk(X / )dx / The squeezing function is constructed in a similar manner to that of the envelope function lk(x), except that we use the k—i chords (secants) between x,, e Tk. Notice that as with the envelope function, the squeezing function is also indexed by the cardinality of the abscissae. The slope of the secant between the abscissae x, and is ) 1 h(x+ xi+1 — h(x) — and with the point-slope form of a line we have tk(X) — ) 1 h(x+ h(x) xi+1 — h(x) (x — x) — ) (x 1 h(x+ — x) — h(x) (x — lk(X) = Xj+1 = x) + h(x) (x 1 — x) — h(x) (x+i_) + h(x+i) (x —xi 1 xi+ — x) Under the assumption of concavity of h(x) we ensure that lk(x) < h(x) < uk(x) for all x E V. The ARS algorithm, utilizing these functions, involves three steps: initialization, sampling, and updating [35, iii]. 218 B.1.2.1 Initialization We initialize the abscissae, Tk by choosing a set of points in the domain of f(x). If V is unbounded from below, choose x such that h”(x) > 0 and if V is unbounded from above, then choose xk such that h’(x) <0. Calculate the functions Uk(Xj), Sk(Xj), and lk(Xj) for x E Tk where i = 1, , k. . B.1.2.2 . . Sampling First sample an observation, x, from the distribution constructed from the upper hull, sk(x), then independently sample an observation, w, from the uniform distribution, U(0, 1). Now perform the squeeze test, if w exp {lk(x*) uk(x )}then accept x — otherwise evaluate h(x*) and hI(x*) and perform the rejection test, if w exp {h(x*) uk(x )}then accept x — otherwise reject x. B.1.2.3 Updating If the functions h(x*) and h’(x*) are evaluated, then include the abscissa ,xk,x*}. The abscissae in Tk+1 are x in Tk, thus we have Tk+1 = {xi, ordered and the functions uk+1(x), Sk+1(X) and lk+1(x) are evaluated with the new abscissae. If a sample of size m is not yet obtained, then return to the sampling step. . B.1.3 . . Application to Gibbs sampling With Gibbs sampling, we have a sequence of fully conditional distributions from which samples are drawn. The adaptive rejection sampling becomes the process by which samples are drawn from the full conditional distributions as specified by the Gibbs sampler. Here, we require that the full conditional 219 distribution is proportional to to some function g(). Unless the full con ditional distribution is expressible in terms of conjugate distributions, the full conditional distribution will not correspond to a common distribution and it will not be possible to obtain a closed form for the proportionality constant. Additionally, since the full conditional distribution will commonly be constructed from the product of many terms, it may be computationally expensive to sample from the fully conditional distribution, hence the utility of the adaptive rejection sampling method for the Gibbs sampler [35]. B.1.4 Concavity of the likelihood Although the adaptive rejection sampling algorithm requires several as sumptions, the one of particular concern is the assumption of log-concave functions. A log-concave function is simply a function that is concave on the logarithmic scale. Using Gilks and Wild’s [35] notation, recall that h(x) = logg(x), thus g(x) is log-concave if h(x) is concave. For differen tiable functions, h(x) is concave if the derivative h’(x) = fh(x) decreases monotonically with increasing x E V. Furthermore, this can be modified to consider functions over a specified interval, thus a differentiable function is concave on an interval D* if h’(x) is monotonically decreasing on that in terval. This translates to the familiar concept that a concave function has a negative second derivative, f”(x) <0. This situation is known as a strictly concave function. Gilks and Wild consider concave functions in general and relax the condition to include straight lines, thus we have f”(x) 0. More formally, if h(x) is twice differentiable, then h(x) is concave if and only if h”(x) <0. A desirable property which emerges for concave functions is that the sum of concave functions is itself concave, thus if each component of the log-likelihood is log-concave, then the log-likelihood itself is log-concave. Gilks and Wild [35] provide a list of common probability distributions and their log-concavity properties. Table B.1 reproduces the concavity fea tures for those distributions considered in this document. 220 Table B .1: Log-concave properties for probability distributions found in this document [35] p(x) Normal Bernoulli Binomial Parameters Mean , Variance a Proportion p Proportion p log p(x) concave with respect to x, 1 u, log a, p, logit(p) p, logit(p) log p(x) not concave with respect to a B.1.5 Concavity of the binary logistic regression model Here, we will show that the exponential family for generalized linear models is log-concave in terms of the covariates. Specifically, for adaptive rejection Gibbs sampling, we need to generate a sample based on the full conditional of the x variable, thus the response model and the model for the mechanism of imperfection need to be log-concave in terms of x. Recall from equation 2.5.11 that the form of the exponential family is p(Y = y10, ) exp with the log-likelihood, in terms of (°a( + c(y, given by ( Iy,c)— y6—b(8) +c(y,çb) and for our context, the log-likelihood is yj&j l(Ojyj,xj,q) = — +c(y,b) a() where the canonical parameter is a function of the data = g(/3, xi). Recognizing that our outcome is distributed as a Bernoulli random vari able, we will assume a binary logistic regression form of the generalized linear model. Recall that we assumed that the model is not over- or underdispersed ( = 1) which resulted in a(q) = 1. Furthermore, the canonical link was assumed for the model, thus 0 = log where n is the proba , bility of success as given in section 2.5.5 and 0 relationship is given as ‘Tlj = = where xo = 1 for all i = 1, . . . , (-) = 7). For the jth subject this x/3 Xijj n and /3o is the intercept. The function b(0) 222 for binary logistic regression is b(O) = — log(1 — =log(1+e9i) and c(yj, ) =log () since m = 1 for Bernoulli random variables and y = 0, 1. The concavity of a Bernoulli random variable under the generalized linear model is required, thus we want to determine -r1(OIyj, x, ) is negative for all x e chain rule, Obtaining 2 -l (OIy, x, ) requires the application of the J— t where for the = jth subject we have 1 81 where e 1+e 1 +e 1 ,and i thrj — — 7 th ir ( 1 1 e 2 (1+ei) =irj(l—irj), and oxij —p1, 223 thus = [Yi — eOt 1 + e8j] i for a binary logistic regression model. Applying the chain rule again for the second derivative yields - where ” 8 e 1+e0i’ ) 1 b”(9 thus i 2 a e 0 2 1+eei3 — ax — — — = It is clear that for xjj, i = 1, . . — 1+ (1 n and j — = 1, . . . , p the binary logistic model is log-concave. 224 B.2 EM maximization details B.2.1 Component wise maximization (t)) 1 Recall that the maximization of Q ( reduces to the maximization of each component of Q (it). Although this is intuitively satisfying, lets consider a more detailed verification of this proposition. The basic requirement of the EM algorithm is that for each step of the algorithm, the expectation of the complete log-likelihood increases (Equation 3.3.5), that is +l)( t Q(( ) ) ) t Q(I > To verify that the maximization of each component of equation 3.4.5 ap plied to equation 3.3.12 results in the maximization of Q (t) we will begin with the assumption that each component of equation 3.4.5 can be maximized, thus ,(t+1) = argmaxQ (p t 1 )) = argmaxQ (t)) 7 ( 17 = argmax,Q Without loss of generality, begin with mizes Q ( t 1 ) Q and ( ( 1 t)) (t)). 113 (p Given that (t+i) maxi then Q(/3( + t l) (t)) From this we have Q((t+1)j(t)) + (t+1)j 7 Q( ( t)) (t)) 17 Q()) + Q((t+1) Q(3i3) + Q(7Vy ) t 225 /,(t)) 1 Q((t+1) (t)) 1 Q( Q(717). Finally, since (t)) 1 we have Q((t+1) Q(I) which is the requirement for the EM algorithm. since Q(y(t+1)y(t)) It is clear that under the assumption that each probability distribution is uniquely parameterized and that each density can be maximized in terms of its unique parameter allows for a complex maximization problem to be partitioned into a series of smaller and potentially easier ones. Further (t)) are themselves a sum 1 more, it is recognized that Q(yy(t)) and Q( of uniquely parameterized distributions. In this regard, proving that if we can maximize each of these distributions in terms of their unique parame t) 1 ters then we maximize Q ( involves a more elaborate, but identical in spirit, version of that which has been given above. B.2.2 Score and Hessian for the binary logistic regression model Beginning with the binary regression model given in section B.1.5, and the response model as specified in equation 2.5.4, the log-likelihood of the re sponse model as 1(/3Ix, y, q). With the application of the chain rule, as given in B.1.5 we can obtain the score function. We have -l(/3IxA, y, 1(j3x, yj, 1) =- — -l(/3Ix, y) &7rth] =L-l(/ I 3 xiA ,yj)—-_ where éIj is ultimately a function of / g. Since we are dealing with a binary 3 logistic regression model as in section B.1.5, it is self-evident that the middle two terms = 1. Additionally, the first partial derivative yields the same form as in section B.l.5, thus the only new part if the final partial 226 derivative, thj Pulling all the pieces together yields, n -l(/ I 3 xA,y) = [Yi — Xjj 1 which is the component of the score function. jkt The component of the Hessian matrix has the the second and third components of the chain rule identical to that given in section B.1.5 which cancel, thus we are left with aakH =- = — [Yi — b’()] b”(8j)xjjxk and from section B.1.5 we have b”(O) = 1+e ’ 0 thus = — 8 e 1 +e 9 XjjXjk 227 B.3 B.3.1 Details for Louis standard errors for the EM algorithm Derivation of the observed score Here we present the derivation of equation 3.5.1. For the observed-data, we have the log-likelihood (I l*(Ix,rj,yj) log and the score is given by the first derivative of the log-likelihood, S*(Ixp, r, y) =l/*(Ix, r, y) — p’(X, R, I) p(X, R, YI) f...j’p(X’,YI)dxj U f...fp(xV,YiI)dxi — and under the regularity conditions required for differentiation under the integral sign we have f... fp’(XV,YtI)dx U f...fp(X’,Y)dx u f = E ... fp’(XIdxj UE f...fp(xUy.)x. u [ I S(IxY,yi)p(X’,Y)dx -JUE •••J f...fp(x,Y)dxj - UE =E [S(IxY, yj)Ix] 228 so S*(Ix,rj,yj) —E Furthermore, the fixed point, , [S(Ix,yj)lx] is the solution to S*(Ix,rj,yj) =0 B.3.2 (B.3.1) Derivation of the complete-data information matrix The matrix of second derivatives is commonly known as the Hessian matrix. For the subject this is 82 H() aaT1(i) 82 =8 8 l(Ixj) =Sx) where the Hessian over all subjects is The Fisher information function for a p-dimensional vector of parameters is a p x p matrix defined as Tl) 8 I() =E (_ with the jkth element 1 ( )jk (_ak1). 229 An alternate expression for the information is I() =E (—H()) The observed information is defined as the negative of the Hessian, so I() = - The Fisher Information can then be written as I() =E (I()) Now we will consider the second derivative of l*(Ix, r, y), which is c T 9 l(Ixi,ri,yi) and equals 8 1p’(X,R,YjI)1 8 [p(Xf,Ri,Y)j p(X,Rj, R, Y) [p’(X, R, Y I)] [p’(Xf, 1 2 [p(X,R,YjI)] — — — p”(X, , Y) p(X,R,Y) _UE — S*(Ix, r, yj)S*(IxE , r, y) T _S*(Ixf,rj,yj)S*(Ix,rj,yj) T U J fp”(XU _UE y p 1 (XY) ‘ — f...fp(X’,Yj)dxj — u 82 Tl(Ix, y) + S(Ix, yj)S(Ix, Yi) ] p(x, YiI) T f...fp(X’,Yj)dxj UE 2 dx UE — 230 then, [T1IxY Y)IxP] — +E x] T [s(IxV, y)S(jx, Yi) SIx, r, yj)S*(Ix, r, therefore, r, y) =E [H(IxV, ) Ix] + E — [s(IxV yj)S(Ix, ) Ix] T S*(Ix, r, yj)S*(Ix, r, which is the Hessian matrix associated with the observed-data log-likelihood. The first term is the conditional expectation of the complete-data Hessian, the second term is the conditional expectation of the outer product of the complete-data score functions with the final term being the outer product of the observed-data score function. Louis uses the observed information which under frequentist justification is preferred over the Fisher Information [25]. Beginning with the definition for the observed information for the observed data we have I(Ix,rj,yj) =_H*(Ix,rj,yj) = - E [H(Ix’,yj)lx] - E + S*(Ix, r, yj)S*(Ix, r, =E [I(Ix’,yj)Ix] + - E S*(Ix, r, yj)S*(Ix, r, =I(Jx) - E + S*(x, r, yj)S*(x, r, =I(x) - Irn(Ix) where Im(Ix) is the information matrix associated with the missing in formation (Equation 3.5.2). 231 The solution to the score, equation B.3.1 is the value for which an esti mate of variance is desired. The information function evaluated at is the central feature in the derivation of the standard error of where I(x,r,yj) =1(Ix) =Ixf) Im() - E T [s(Ix’,yi)s(Ix I x]. ,yi) , 1 + [S*(Ix, r, yj)S*(Ix, r an x,E ,rj,yjj — , )T] (B.3.2) so - E [S(Ix’, y)S(x’, y.)TIx] . (B.3.3) At the solution to the score equation, the original incomplete-data problem is obtained from the gradient and Hessian (curvature) of the complete-data log-likelihood function that is used in the EM algorithm. Although there is still the derivation of the first and second order partial derivatives, this rela tionship presents a simplicity to the incomplete-data problem by exchanging difficult derivations for much more simple ones. On a more pragmatic level, there is an implicit suggestion within the missing data literature that all three components should be used within the MCEM framework. From a conceptual point of view, the solution to S*(Ix, r, y) = 0 occurs at a local or global maximum, denoted Within the MCEM framework, we have a algorithmically based solution to r, y) = 0 which we can denote on the tth iteration of the MCEM algorithm. Recalling that the sequence of likelihood values, denoted {L()(t)} being bounded above converges monotonically to L() where is the local or global maximum. Wu [114], in his presentation of two conver gence theorems for the EM algorithm, notes that if . 1 ( t+i) (t) 11 ‘ 0, as t is assumed to be a necessary condition for (t) ‘ cc to tend to then the assump 232 tion that the set of stationary points for the EM algorithm needs to consist of a single point ) can be relaxed. The relevant feature of this necessary condition is that it suggests one of the commonly used stopping criteria used for the EM algorithm. As emphasized in section 3.3.1 the stopping criterion — (t—1) < S for some small S is a measure of lack of progress with the solution occurring only asymptotically. In a finite number of steps it may be unreasonable to assume that we have reached the target stationary point as suggested by Wu’s theory on convergence. Such a recognition of the limitation of restricting our algorithm to a finite number of steps controlled by the size of S suggests that when we reach the pragmatic convergence (t) criterion, we have on the t thiteration of the EM algorithm , then S*(Ix, r, yj)(t) 0. This understanding is implicit in the use of Louis . -‘ standard errors for missing data problems within a generalized linear model framework where equation B.3.2 is used instead of the asymptotic result given in equation B.3.3 [43, 44, 47, 48]. 233 Appendix C Supplemental material for the simulation studies C.1 C.1.1 Simulation 1 Maximization of the multivariate normal The MCEM based maximum-likelihood type estimators for a multivariate normal distribution with imperfect variables will be derived. To begin, we will assume that X 2 is identically and independently distributed as a multivariate normal of p dimension with mean t and variance-covariance >, X N(,u, ). For a sample of size n, the likelihood for the L(i, Ixi) exp{—(xj 1 = (21r)2I2 jth subject is — — with the log-likelihood l(, x) [— log2 — log ‘ — (x — )T] — (C.1J) From equation 2.5.4 we have lc(IxU,y) ) +1ogp(Yjx, )+logp(X,I) 234 where The expectation step for imperfect variables, given by equation 3.3.12 becomes = (ji, , T). t)) [firr [logP(RlIx,Yi.7f) + 1ogp(Yjx) i, r) + logp(X, 1 + logp(Xx + (1- flrr) [i(7I7 + + (rIr(t)) + , (t), (t)) thus it is possible to isolate the portion which pertains to the covariates. We also observe that one component pertains to T, which is assumed to be known, thus no estimation is required. This will result in a further simplification. For the joint distribution of and X 2 we have , + [( ( [ riiri) _rir) (EI(t)(t))] (ñrr) logp(Xl,X2,) m p — rfir) log p(Xjii, X 21 (t), (C.1.2) , The two components of ,t(t), y(t)) are almost identical except for the indexing. With the use of Dwyer’s [23] results, the first partial derivatives in terms of j and D are reasonably straight forward to obtain for the multivariate normal, thus obtaining the MCEM maximum likelihood 235 type estimators and >D will pose few challenges. Since the two components of equation C. 1.2 are coming from a multivariate normal, the first order partial derivatives for the multivariate normal will first be obtained before , (t), E(t)) the first order derivative for is found. Taking equation C.1.1 as our starting point, for ,t we have l(,Ix) — ( 1 x —ii) and solving for the critical point yields ‘ThL Xj. The first order partial derivative with respect to (_‘ l(t,Ix) + is - - = and solving for the critical point yields {i — )T_1] — = - = - With these results, we have all the basic pieces for the derivation of the first order partial derivative of equation C.1.2 and comment on the form of them when compared with the complete case scenario. The derivative with 236 respect to u is = [(rr) ( 1 x — P + (1_llrr) 1-1(x)] Solving for the critical point yields [ rr) xi p + (1_llrIir’) and with left multiplication with , _Zx] the estimate of ,i is (i-rr)i.] (C. 1.3) where is the intra-Monte Carlo sample average with the dot emphasizing the index over which the average is taken. The derivative . = 237 with respect to , &Q(p (t)) , [-1-1( is equivalent to t)(x )TE-1] + =, m p (1 = — —llrrf)-_ 1+ nZ [firr p (1 — [>_i llr(jr)-_ 1 +E’(x [1(x — p)(xji — — )T_1] — + m — — Solving for the critical point yields [T [1(x 11 —)(x 1 p )T_1] 11 + m (1 —flr(jr)J_ — — and with pre- and post-multiplication of then the estimate for is — — p m + (i_JJrfjr)-_ 238 let S 2 (x 1 = — ,),)T — the intra-Monte Carlo sample vari ance, then [(ñri1xil_12xi1_)T + (1 _flrrf)m 1S2] (C.1.4) and with large m [() C.1.2 (i_firiri) S2]. (C.1.5) Example of variability in the point estimates and the associated confidence intervals for simulation 1. With the first simulation study, a high degree of variability in the estimates was observed. This was not uniform over all the scenarios, but was frequent enough to warrant special comment. Figure C.1 is an example of the noise seen with the point estimates derived from the MCEM approach. Although many of the estimates remain within a well defined and reasonably tight boundary, some of the estimates appear to be outside this band. Enough of the estimates appear to be outliers that robust measures were used to see if there were substantial differences between the standard and the robust measures. 239 Point estimate with 95% confIdence interval (Louis Method) 0 cJ. I I I 0 20 40 I 60 80 100 120 140 Index Figure C.1: MCEM point estimates and Louis confidence intervals for case 4 using missing data mechanism A. 240
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Imperfect variables : the combined problem of missing...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Imperfect variables : the combined problem of missing data and mismeasured variables with application.. Regier, Michael David 2009
pdf
Page Metadata
Item Metadata
Title | Imperfect variables : the combined problem of missing data and mismeasured variables with application to generalized linear models |
Creator |
Regier, Michael David |
Publisher | University of British Columbia |
Date | 2009 |
Date Issued | 2009-11-27T19:34:21Z |
Description | Observational studies predicated on the secondary use of information from administrative and health databases often encounter the problem of missing and mismeasured data. Although there is much methodological literature pertaining to each problem in isolation, there is a scant body of literature addressing both problems in tandem. I investigate the effect of missing and mismeasured covariates on parameter estimation from a binary logistic regression model and propose a likelihood based method to adjust for the combined data deficiencies. Two simulation studies are used to understand the effect of data imperfection on parameter estimation and to evaluate the utility of a likelihood based adjustment. When missing and mismeasured data occurred for separate covariates, I found that the parameter estimate associated with the mismeasured portion was biased and that the parameter estimate for the missing data aspect may be biased under both missing at random and non-ignorable missing at random assumptions. A Monte Carlo Expectation-Maximization adjustment reduced the magnitude of the bias, but a trade-off was observed. Bias reduction for the mismeasured covariate was achieved by increasing the bias associated with the others. When both problems affected a single covariate, the parameter estimate for the imperfect covariate was biased. Additionally, the parameter estimates for the other covariates were also biased. The Monte Carlo Expectation-Maximization adjustment often corrected the bias, but the bias trade-off amongst the covariates was observed. For both simulation studies, I observed a potential dissimilarity across missing data mechanisms. A substantive data set was investigated and by using the second simulation study, which was structurally similar, I could provide reasonable conclusions about the nature of the estimates. Also, I could suggest avenues of research which would potentially minimize expenditures for additional high quality data. I conclude that the problem of imperfection may be addressed through standard statistical methodology, but that the known effects of missing data or measurement error may not manifest as expected when more general data imperfections are considered. |
Extent | 4407561 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Collection |
Electronic Theses and Dissertations (ETDs) 2008+ |
Date Available | 2009-11-27 |
Provider | Vancouver : University of British Columbia Library |
DOI | 10.14288/1.0068495 |
URI | http://hdl.handle.net/2429/15883 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2009-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- [if-you-see-this-DO-NOT-CLICK]
- ubc_2009_fall_regier_michael.pdf [ 4.2MB ]
- [if-you-see-this-DO-NOT-CLICK]
- Metadata
- JSON: 1.0068495.json
- JSON-LD: 1.0068495+ld.json
- RDF/XML (Pretty): 1.0068495.xml
- RDF/JSON: 1.0068495+rdf.json
- Turtle: 1.0068495+rdf-turtle.txt
- N-Triples: 1.0068495+rdf-ntriples.txt
- Original Record: 1.0068495 +original-record.json
- Full Text
- 1.0068495.txt
- Citation
- 1.0068495.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
United States | 12 | 1 |
China | 8 | 9 |
Canada | 7 | 0 |
Brazil | 6 | 0 |
Germany | 2 | 26 |
France | 2 | 0 |
Denmark | 1 | 0 |
Ukraine | 1 | 0 |
City | Views | Downloads |
---|---|---|
Unknown | 11 | 26 |
Beijing | 6 | 0 |
Ashburn | 5 | 0 |
Toronto | 4 | 0 |
Shenzhen | 2 | 9 |
Vancouver | 2 | 0 |
Alamosa | 2 | 0 |
Sunnyvale | 2 | 0 |
Donetsk | 1 | 0 |
Wilmington | 1 | 0 |
Miami | 1 | 0 |
Montreal | 1 | 0 |
Redmond | 1 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0068495/manifest