Bayesian Propensity Score Analysis for Observational Dat by Lawrence Cruikshank McCandless B.Sc, The University of British Columbia, 2000 M.Sc, The University of British Columbia, 2004 A THESIS SUBMITTED IN PARTIAL F U L F I L M E N T OF T H E REQUIREMENTS FOR T H E D E G R E E OF Doctor of Philosophy in The Faculty of Graduate Studies (Statistics) The University of British Columbia August 2007 © Lawrence Cruikshank McCandless 2007 Abstract Propensity scores analysis (PSA) involves regression adjustment for the estimated propensity scores, and the method can be used for estimating causal effects from observational data. However, confidence intervals for the treatment effect may be falsely precise because PSA ignores uncertainty in the estimated propensity scores. We propose Bayesian propensity score analysis (BPSA) for observational studies with a binary treatment, binary outcome and measured confounders. The method uses logistic regression models with the propensity score as a latent variable. The first regression models the relationship between the outcome, treatment and propensity score, while the second regression models the relationship between the propensity score and measured confounders. Markov chain Monte Carlo is used to study the posterior distribution of the exposure effect. We demonstrate BPSA in an observational study of the effect of statin therapy on all-cause mortality in patients discharged from Ontario hospitals following acute myocardial infarction. The results illustrate that BPSA and PSA may give different inferences despite the large sample size. We study performance using Monte Carlo simulations. Synthetic datasets are generated using competing models for the outcome variable and various fixed parameter values. The results indicate that if the outcome regression model is correctly specified, in the sense that the outcome risk within treatment groups is a smooth function of the propensity score, then BPSA permits more efficient estimation of the propensity scores compared to PSA. BPSA exploits prior information about the relationship be- ii tween the outcome variable and the propensity score. This information is ignored by PSA. Conversely, when the model for the outcome variable is misspecified, then BPSA generally performs worse than PSA. iii Table of Contents Abstract ii Table of Contents iv List of Tables vi List of Figures viii Acknowledgments x Dedication 1 xi Introduction 1.1 1 A motivating example: Estimating the effectiveness of statins in patients following acute myocardial infarction 2 Background: Control of confounding in observational studies -. 5 . . 9 2.1 Defining causal parameters 10 2.2 Estimating causal effects in randomized experiments and observational studies '. 13 2.3 Confounding bias 18 2.4 Propensity score analysis (PSA) for control of confounding 21 iv 3 Bayesian propensity score analysis (BPSA) 4 5 3.1 The method 30 3.2 Causal inference with BPSA 37 Analysis of the statin data 44 4.1 Conventional regression adjustment for confounding 45 4.2 PSA analysis of the statin data 45 4.3 BPSA analysis of the statin data 53 4.4 Decomposition of the posterior variance for @ and £ 60 Simulation studies of the performance of B P S A and PSA 5.1 5.2 63 Simulation study when the distribution of the outcome follows the propensity score model 6 29 64 Simulation study when the distribution of the outcome follows a conventional regression model for Y on X and C 76 5.3 Predictive performance in real and simulated data 88 5.4 A n investigation of covariate balance produced by B P S A versus PSA 98 Conclusion 112 6.1 Summary 112 6.2 Future Research 115 Bibliography 121 Appendices A Computing the limiting estimates for P S A v 127 List of Tables 1.1 Baseline characteristics of 4572 patients discharged alive from hospital following acute myocardial infarction 8 4.1 Log ORs for the association between Y and (X, C) 4.2 Point estimates for 7, the log ORs for the association between X and C. 47 4.3 Parameter estimates for the treatment effect [3, and the baseline risk 46 of Y within quintile groups ( £ 1 , £ , £ 3 , £ 4 , £ 5 ) , calculated from PSA . . 2 4.4 Means of measured covariates among treatment groups, within quintiles of the propensity score,' calculated from PSA 4.5 2 ; £ 3 , £ 4 , £ 5 ) , calculated from BPSA. 54 Means of measured covariates among treatment groups, within bins of the propensity score, calculated from BPSA 4.7 . ': . 62 Performance of point and interval estimators from BPSA and PSA, when data are simulated under Design #1 5.2 59 Posterior variances for [3 and £ from the analysis of the statin data using BPSA and PSA 5.1 52 Parameter estimates for the treatment effect /?, and.the baseline risk of Y within each of the five bins ( £ 1 , £ 4.6 51 70 Performance of point and interval estimators from BPSA and PSA, when data are simulated under Design #2 vi 71 5.3 Performance of point and interval estimators from B P S A and PSA, when data are simulated under Design #3 5.4 Performance of point and interval estimators from B P S A and PSA, when data are simulated under Design #4 5.5 84 Performance of point and interval estimators from BPSA and PSA, when data are simulated under Design #4 5.9 83 Performance of point and interval estimators from BPSA and PSA, when data are simulated under Design #3 5.8 82 Performance of point and interval estimators from BPSA and PSA, when data are simulated under Design #2 5.7 73 Performance of point and interval estimators from B P S A and PSA, when data are simulated under Design #1 5.6 72 The quantities 4>BPSA and <J>PSA 85 from five different random splittings of the statin data into build data (n = 4500) and test data (n = 4671). 93 5.10 The quantities 4>BPSA-, 4>TRUE and 4>PSA from ten simulated datasets generated under Designs #1, #2, #3 or #4 5.11 The quantities ^BPSA, 95 4>TRUE and 4>PSA from ten simulated datasets generated under Designs #1, #2, #3 or #4 using the conventional regression model of equation (5.1) 5.12 The quantities C^BPSA, 4>PSA 97 and 4>TRUE based on predictive models for treatment calculated from Datasets A or B vii 110 List of Figures 2.1 Directed acyclic graphs in which C is a confounder, X is treatment Y is the outcome, and U is an additional measured or unmeasured variable. 20 4.1 The distribution of propensity scores among the treated and the untreated, denoted f(z\x) for x = 0,1. The solid curve refers to the treated patients, while the dashed curve refers to the untreated patients. 49 4.2 Log odds of death as a function of x and z, denoted logit[P(y = l\x, z)]. The solid curve refers to the treated patients, while the dashed curve refers to the untreated patients. 50 4.3 BPSA sampler convergence for B and £ 55 4.4 BPSA sampler convergence for 7 56 5.1 Estimated risks of death in statin data for BPSA compared to PSA. (YBPSA 5.2 versus Y ) PSA •. 90 The density f(z\X = x) for x = 0,1 and logit[P(Y = l\x,z)] for ' £ = 0,1 and z e [0,1] for datasets A and B. Solid curves correspond to treated patients and dashed curves correspond to untreated patients. 103 5.3 Normal quantile plots for the z-statistics -&r, ohi\ ..., 0E11 in either 0E20 Dataset A or B. The symbols o, A and + correspond to regressing on ZPSA, ZBPSA or Z respectively 105 Vlll Histograms of (% - j ) for k = 0,1,...., 20 based on either BPSA of k PSA analyses of either Dataset A or B ix Acknowledgments I gratefully acknowledge the support of my thesis supervisory committee Paul Gustafson John Petkau, Rollin Brant and Peter Austin for their time and guidance. I am indebted to Paul Gustafson for his mentorship, insights and wisdom in helping me to direct of my thesis research and pursue an academic career. I especially thank John Petkau for his detailed readings of my thesis drafts, and also for his insight in navigating the job application process. Additionally, I would like to thank Peter Austin for his time and access to the E F F E C T data. My thesis research is supported by training awards from the Michael Smith Foundation for Health.Research. The E F F E C T study is funded by a Canadian Institutes of Health Research Team Grant in Cardiovascular Outcomes Research. Additional funding for this project came from an operating grant (Grant No. N A 5703) from the Heart and Stroke Foundation of Ontario. Lastly, I would like to thank my family, friends and fellow graduate students for their support during my time at U B C . x Dedication To Megan Chapter 1 Introduction The propensity score, defined as the probability of treatment given measured confounders, can be used as a tool to control confounding bias in observational studies [1, 2]. Patients with fixed propensity scores have identical distributions of measured confounders, irrespective of whether they are treated or not [1]. Thus when there are no unmeasured confounders, comparing patient outcomes between treatment groups conditional on the propensity score gives unconfounded estimates of causal treatment effects. Sampling patients with the same propensity score replicates a miniature randomized trial within a subset of the study population. To adjust for confounding, we can include the propensity score in a regression model for the outcome [3]. Because the functional form for the relationship between the outcome and the propensity score within treatment groups is usually unknown, we have a non-parametric regression problem. A common analytic strategy is to break the study sample into homogeneous groups, and then estimate the treatment effect within each group [2]. The groups are often constructed using quintiles of the empirical distribution of the propensity scores. Other adjustment techniques include treating the propensity score as a continuous covariate or fitting a regression using, cubic splines [2]. If matched pairs are available then conditional likelihood estimation is often used [2]. In observational studies, the propensity score is unknown and can be interpreted as missing data. A propensity score analysis (PSA) proceeds using a two step procedure. 1 First we estimate the propensity'score for each patient. Then we fit a regression model on the estimated propensity scores. In most applications, standard errors estimates are calculated from asymptotic approximation to the distribution of the maximum likelihood estimator based on the regression model of the outcome on treatment and estimated propensity scores [4]. This approach to interval estimation ignores uncertainty in the estimated propensity scores. Frequentist inferences are reported under hypothetical repeated sampling of patient outcomes conditional on the observed treatments and estimated propensity scores. If our objective is to report inferences while acknowledging uncertainty in the estimated propensity scores, then intuitively the conventional standard error calculations will yield interval estimates for the treatment effect which are falsely precise. PSA substitutes point estimates for parameters. We can draw parallels with stepwise model selection techniques used in regression modelling. A preliminary analysis selects one model among many competing models. The usual confidence intervals are calculated by conditioning on the selected model and they will be too narrow because they do not acknowledge model uncertainty [5]. There is little discussion in the literature about methods for incorporating uncertainty from the estimated propensity scores into uncertainty about the treatment effect. Some exceptions include [4, 6]. In contrast, there a large body of work describing the merits of using estimated propensity scores over true propensity scores [7]. Hahn and others [8-11] argue that treatment effect estimates calculated using the estimated propensity scores are generally more efficient than corresponding estimates calculated from true propensity scores. This has been demonstrated both analytically and in simulations in the context of propensity score weighting adjustments for confounding [8-11]. For matched sampling, Rubin and Thomas [12, 13] demonstrate that matching on the estimated propensity scores generally yields greater similarity in 2 empirical distribution of covariates for treated versus control subjects. Similar findings are reported for subclassification on quintiles of the estimated propensity score versus the true propensity score [3]. In this thesis, we develop a Bayesian approach to modelling uncertainty in the estimated propensity scores in observational studies with a dichotomous treatment, dichotomous outcome and a vector of measured confounders. We propose a Bayesian propensity score analysis (BPSA) which models the joint distribution of the data and parameters with the patient propensity scores as latent variables. Markov chain Monte Carlo (MCMC) is used to sample from the posterior distribution of model parameters. This estimation strategy yields interval estimates for the treatment effect which incorporate uncertainty in the estimated propensity scores. We show that BPSA has certain advantages compared to PSA. The method uses patient outcome data in order to calculate propensity score estimates. Instead of using a two step procedure which first estimates propensity scores and then estimates treatment effects, BPSA estimates both quantities simultaneously. The M C M C algorithm iteratively imputes the propensity scores and estimates the treatment effect. When imputing the propensity scores, posterior information about the outcome distribution flows through the algorithm in order to inform inferences about the propensity scores. Thus BPSA incorporates prior information about the relationship between the outcome and propensity score within treatment groups. This information is ignored by PSA when estimating propensity scores. With the exception of two articles [14, 15], there appears to be little research combining propensity score methods and Bayesian statistics. One possible reason is because PSA does not use conventional models for the risk of the outcome variable. The usual approach to adjust for confounding is to work with an assumed model for the risk of the outcome given treatment and measured confounders. Instead PSA 3 uses a model for the outcome given treatment and the propensity score. But the propensity score is not a typical covariate. It is a characteristic of the manner in which treatment and covariates were sampled. While the PSA method produces valid treatment effect estimates from a frequentist standpoint, it makes less sense within the Bayesian paradigm. How one chooses to design an experiment (i.e. the true values of propensity scores) should convey no prior information about the risk of the outcome in a given study subject. Robins and Ritov argue the propensity scores should not appear in the likelihood for the distribution of the outcome variable and are irrelevant in a Bayesian analysis [15]. We elaborate on this in more detail in Section 3.2. This thesis is organized as follows. In Section 1.1, we describe a case-study to motivate the methodology. We describe an observational study of the effectiveness of statin therapy in Ontario residents following myocardial infarction. In Chapter 2, we review the topic of confounding bias in observational studies, and analytic adjustments for confounding bias using propensity scores. Only observational studies with a binary treatment, binary response and a vector of measured covariates are considered. In Chapter 3, we describe the BPSA method which models the propensity score as a latent variable. We demonstrate that BPSA uses patient outcomes to estimate propensity scores by incorporating prior information about the relationship between the outcome and propensity scores. In Section 3.2, we discuss using BPSA as a tool for causal inference. We argue that while the BPSA model is not a realistic representation of the true data generating process, the method can nonetheless be used to construct Bayesian estimators with good frequentist properties. In Chapter 4 we apply our method to the statin data and compare the results to PSA. We illustrate that contrary to intuition, BPSA and PSA may give very different results even in large samples. In Chapter 5 we use simulations to study the frequentist performance 4 of point estimates, interval estimates and prediction error when applied to both the statin data and simulated data. The results demonstrate that if the outcome regression model is correctly specified, in the sense that the outcome risk within treatment groups is a smooth function of the propensity score, than BPSA permits more efficient estimation of the propensity scores compared to PSA. Conversely, when the model for the outcome variable is misspecified, then simulations show that BPSA will generally perform poorly relative to PSA. 1.1 A motivating example: Estimating the effectiveness of statins in patients following acute myocardial infarction. To motivate the propensity score methodology, we begin by considering the example of an observational study estimating the effectiveness of statin therapy in patients discharged alive from hospital following acute myocardial infarction. Statins are a class of lipid lowering medications which are commonly prescribed to persons with multiple risk factors for cardiovascular disease. Several large studies in the United States and elsewhere have demonstrated that statin therapy reduces morbidity and mortality in patients following myocardial infarction [16-18]. Nonetheless, the magnitude of the effectiveness of statins in large populations is less well understood. Randomized trials typically do not provide estimates of population level effects because they exclude vulnerable populations from study, such as the elderly and very sick. Observational studies indicate that statins may be highly effective in these vulnerable groups [19]. Detailed clinical data were obtained from a random sample of 4572 patients discharged alive from Ontario hospitals with myocardial infarction between April 1, 5 1999 and March 31, 2000. The data were collected in conjunction with the Enhanced Feedback for Effective Cardiac Treatment (EFFECT) study, an ongoing initiative to improve the quality of health services for Ontario residents with cardiovascular disease [2]. For each patient, medical charts were abstracted to obtain information on demographic characteristics, cardiac risk factors, comorbid conditions, vascular history, vital signs at hospital admission, and laboratory tests. Data on prescriptions for statins were also collected. More details about the dataset are given by Austin and Mamdani [2]. A complete list of the measured covariates is given in Table 1.1. To estimate the effect of statin therapy on mortality, patients were classified as statin users if they were prescribed a statin at hospital discharge, and they were classified as statin non-users otherwise. Death within three years of hospital discharge was established by linking patient records to the Ontario Registered Persons Database. Before comparing mortality rates among treated and untreated patients, we describe the study sample. Table 1.1 presents summary statistics for the baseline characteristics of treated and untreated patients. We use t-tests and chi-squared tests to compare the covariate distributions among treated and untreated patients for continuous and dichotomous variables respectively. The results indicate that the treatment groups differ systematically with respect to risk factors for mortality. Patients who were prescribed a statin at hospital discharge were typically younger and healthier than patients without a prescription [2]. These, results-are consistent with previous studies of physician prescribing habits for cardiovascular disease medications [19, 20]. Physicians tend to avoid giving statins to the elderly or to patients with multiple comorbidities, even though they are often indicated in such populations [19]. The results in Table 1.1 indicate that a crude comparison of mortality among treated and untreated patients may be biased due to confounding. Randomized trials demonstrate that statin therapy reduces mortality in patients following acute myocar6 dial infarction [16]. Consequently, we expect the protective effect of statin therapy on mortality to be exaggerated in a crude comparison of mortality rates. Treated patients have lower mortality because of the benefits of statins and because they are more healthy. The crude odds ratio for the association between statin therapy and mortality can be calculated from the 2 x 2 table: Died Survived Treated 193 1161 1354 Untreated 800 2418 3218 993 3579 4572 and is equal 0.50 with 95% confidence interval (0.42, 0.60). This estimate is far lower than previous estimates from randomized controlled trials or observational studies [16-18]. In light of prior information about physician prescribing habits of statins [19, 20], these data provide evidence that association between statin therapy and mortality is confounded. Analytic adjustments are required in order to account for the unequal distribution of mortality risk factors between treatment groups. 7 Table 1.1: Baseline characteristics of 4572 patients discharged alive from hospital following acute myocardial infarction. Statin prescribed Statin not prescribed (n= 1354) (n==3218) Number (%) or Mean ± SD Characteristic Demographic characteristics Age (mean) Female sex 63 398 ±12** (29)* 68 1201 < 5 (< 1) 24 525 459 122 548 459 794 (39)** (26) (9) (48)* (34) (59)** 973 1060 312 1386 1060 604 504 10 (37)** (1) 999 13 (31)** . (< 1) 149 85 81 20 ±31 ±18 ±23* ±5** 148 84 84 21 ±32 ±18 ±23* ±6** 10 141 139 9 101 ± ± ± ± ± 10 137 139 9 104 ± ± ± ± ± ±14** • (37)* ' Presenting characteristics'* Shock (1) AMI risk factors + Family history of C A D Diabetes CVA/TIA High B P Current smoker Hyperlipdaemia (30)** (26) (10) (43)* (33) (19)** Comorbidities'* Angina Renal disease Vital signs on admission^ Systolic B P Diastolic B P Heart rate Respiratory rate Laboratory values^ White blood count Haemoglobin Sodium Glucose Creatinine 5 17 ** 3* 6 54 * p < 0.05, ** p < 0.001 f Continous variables, + Dichotomous variables 8 5 19** 4* 5 60 Chapter 2 Background: Control of confounding in observational studies The concept of confounding is ubiquitous in epidemiology and observational research. See Rothman and Greenland [21], Pearl [22] and Greenland, Rothman and Pearl [23] for reviews. Confounding bias is the problem of mixing of the effects of the putative treatment of interest with that of extraneous outcome risk factors. For example, having yellow fingers does not affect risk of lung cancer. But yellow fingers and lung cancer will tend to be associated because yellow fingers are associated with smoking which also causes lung cancer. Conceptualizing confounding has been controversial. The definition of confounding varies from one reference to another and is an area of ongoing research [24]. In this chapter we review methods for control of confounding bias with emphasis on propensity score methods. This provides the setting for discussing our proposed Bayesian propensity score analysis. Because confounding implies biased estimation, we begin by reviewing the framework for understanding the competing targets of inference in the analysis of observational data. In Section 2.1, we review definitions of causal parameters. In Section 2.2, we discuss estimation of causal parameters in randomized experiments and observational studies. In Section 2.3, we define con9 founding bias and confounding variables. In Section 2.4, we review stratification on the propensity score for reducing confounding bias. We restrict our discussion to the setting of a binary treatment, binary response and a vector of measured covariates, where the log odds ratio (OR) is the measure of effect. 2.1 Denning causal parameters Suppose that our objective is to estimate the causal effect of applying a dichotomous treatment on the risk of a dichotomous outcome. To define causation, a popular approach is to use potential outcome models [25]. The idea is to model not only the data that is observed in an investigation, but also the data that would be observed under hypothetical treatment interventions that are not observed. To motivate the approach, we paraphrase an example from Pearl [22]. Imagine a barometer which records air pressure measurements dichotomously as B = High or Low each morning on a sequence of days. Additionally, we record the weather each afternoon as either W — Rain or No Rain. Suppose that our objective is to measure the causal effect of physically intervening to set the barometer to B — Low each morning on the frequency of rain each afternoon. Furthermore, suppose that this is an observational study in the sense that we cannot actually manipulate the values of the barometer. To define "causal effect", we model the distribution of W given B, and also the hypothetical distribution of W given the unobserved value of B. The causal effect of B on W is defined based on differences between these two distributions. To define the causal effect of a treatment on an outcome in human populations, we model the distribution of outcomes among patients who have been sampled from a large hypothetical population in which either all patients received the treatment, 10 or all patients did not receive the treatment. Mathematically, let denote a di- chotomous random variable which models the outcome of a patient had they been sampled from the treated population. The variable Y^y takes the value one if the patient has the outcome and zero otherwise. Similarly, let Y{ y denote the outcome for 0 the same patient had they been sampled from the untreated population. The sampled pair of random variables (Y^y,Y^ y) are potential outcomes (sometimes called 0 counterfactuals). Let X model the treatment received by the patient, taking value one if the patient was treated and zero otherwise. Let C denote a p x 1 vector of patient characteristics such as age and gender. Data for each patient is modeled by the collection of random variables (V{i}, Y^,X, C). Modelling both Y{\} and Y[oy may seem strange because we know the patient received treatment X and not 1 — X. But modelling outcomes under hypothetical treatment interventions allows us to define causation. Defining causal effects based on the distribution of observed data is problematic because it leads to ambiguity about the difference between association and causation. In the barometer example, were we to investigate the effect of B on W in an observational study of the joint distribution of B and W', then we might erroneously conclude that changing B does cause a change in the weather because B and W are dependent. But this dependence does not have a causal interpretation. Changes in air pressure affect both B and W. Suppose instead that we model the two distributions of W given that we intervene in the experiment by setting B to either Low or High each day. While we could never observe both distributions at the same time, common sense tells us they are identical. Thus B does not cause W. The key to defining causation statistically is to model not only the distribution of the data that we observe, but also the distribution of data that we might have observed under hypothetical interventions. Whether or not we can make inferences about these distributions from the data is a separate issue. 11 We denote causal parameters from standard definitions [22, 26]. Let f(y{i},y{o},x\c) = / ( x | y { 1 } ,y { 0 } (2.1) ,0/(y{i},y{o}|c) c model the joint probability density function of a sampled unit (Y{i},Y{ },X) given C. 0 Uppercase and lowercase letters denote random variables and realizations of random variables respectively. Define the causal log OR conditional on C as •P{Y = 1\C)/(1-P(Y {1} p(y = 1\C)Y {1} {0} = i|c)/(i-p(y { 0 } = i|c)). • Let P{Y{ ] = 1) — JP(Y{x} = l|c)/(c)dc for x = 0,1 where /(c) is the probability X density function of C. The quantity P(Y{ } — 1) can be interpreted as the standardX ized risk given treatment X = x, averaged with respect to the marginal density P(C). Define the average causal log OR as Plva = lOg = i)/(i - p ( y -P(Y {1} p(y {0} { 1 } = i))- = i ) / ( i - P ( y } = i)). ' { 0 The parameter ft* describes the causal effect of X on Y within the subgroup of the population with C = c. In contrast, /?* describes the causal effect assuming that all patients either have or have not been treated. In general /?* and J (3*f(c)dc are not equal. This is true even when f3 does c not depend on C, meaning that there is no effect modification [27]. This is because the log OR is a non-linear function of risks. The average of the log ORs conditional on C will typically not equal the log OR calculated from average risks. This property of ORs is called non-collapsibility by epidemiologists [23] and is related to the challenges of characterizing odds ratios in marginal and conditional models in longitudinal data analysis. In contrast, the risk difference is always collapsible, while 12 the risk ratio is collapsible when there is no effect modification [23]. We emphasize this point because when using the log OR as a measure of effect, we obtain different parameters depending on whether we condition on C. For example, as discussed in Section 2.4, conditioning on the propensity score for reducing confounding bias yields a third causal quantity which may differ from either f3 avg or B . Each parameter is a c valid causal quantity, but the fact that they differ from one another could lead one to erroneously conclude that a method is biased. This has been typically overlooked in the literature, particularly with respect to propensity score methods. Recent work on the relationship between causal quantities is investigated by [28-31]. 2.2 Estimating causal effects in randomized experiments and observational studies For each sampled unit, only one of Yji} or Y{o} is observed while the other is missing. Formalizing, let Y denote the observed outcome for a sampled patient, where Y=l Y if X = 1 Y if X = 0. {1} {0} Or more simply, define Y as Y — Y{x}- Therefore, if X = 1 then we observe Y = Y{i} and the potential outcome Y{ y is missing. If X = 0 then Y = F{o} and Y{i} is missing. 0 In observational studies, we observe (Y,X,C) for each subject. Following [26], define the associational parameter, ft = log P(Y = 1\X = 1, c)/(l -P(Y = 1\X = 1, c)) [P(Y = 1|X = 0, c)/(l -P{Y = l\X = 0, c)) 13 Letting \x = f P(Y = l\x, c)f(c)dc for x = 0,1, we define x A™, = log Mo/(l -Mo))J The parameter [3 is just the usual conditional association between Y and X given C, C and might be estimated from logistic regression of Y on X and C. The parameter B is a log OR calculated from standardized risks. The important issue in causal inference is to determine the relationship between B and 8*, and between [3 c avg and 8* . In other words, to determine when association equals causation. For identification of causal parameters, it is usually assumed that P(X = l\y y c) {lh = P(X = l\c) {oh or equivalently, that v (r ,y {1} {0} )jix|c, meaning that (Y/i}, Y{ }) are conditionally independent of X given C. This assump0 tion appears under different names in the causal inference literature, and we call-it the assumption of no unmeasured confounding. The assumption means that, within levels of C , the treatment variable X is not associated with the observed or unobserved potential outcomes. When there is no unmeasured confounding, the associational parameters are equal 14 to causal parameters, meaning that B =' P* and B c = P* . This is because avg P(Y = l\X = l,c) vg = P(Y = 1\X = 1, c) = P(Y m = l\c) P(Y = 1\X = 0,c) = P(Y {0} = l\X = 0,c) = P(Y {1} {0} = l\c). In each equation, the first equality follows because Y = Ypc}) while the second equality follows because (Y{iy,Y{ y) _LL X\C. Therefore P(Y = l\x,c) = P(Y{ y = 0 x l|c) and association equals causation. We have Pc = log - log P(Y = 1\X = 1, c)/(l -P(Y = 1\X = 1, c)) P(Y = 1\X = 0, c)/(l - P(Y = 1|X = 0, c)) p(y p(y { 1 } { 0 } = i|c)/(i-p(y = i|c)/(i-p(y { 1 } { 0 } = i| ))= i|c)). . c = P:. Similarly \i x = J P{Y = l\x,c)f{c)dc = J P(Y \c)f(c)dc {x} = P(Y {1} = 1) which means that avg log = log - Ml) W(l-^o))J P(y = !)/(! - P ( Y { 1 } p(y o = i ) / ( i - p ( y { } { 1 } = 1)) { 0 } = i))J = A To estimate causal effects, it suffices to estimate features of the distribution of P ( Y = l|x,c) using standard statistical techniques, and we can dispense with the potential framework. 15 The value of potential outcome models is not so much statistical as it is conceptual. In simple settings, the same methods of analysis are used regardless of whether or not we use associational or causal models. But causal modelling gives a clear definition of causal effects, and it characterizes the assumption that is needed to distinguish association from causation. Whether or not the assumption of no unmeasured confounders is valid becomes a separate question. In randomized controlled trials with perfect compliance, blinding and, no loss to follow up, this assumption of no unmeasured confounders is valid because the distribution P(X = l|c) is specified by the investigator. For example, we may choose P(X = l|c) = 0.5 for all subjects. Then we have P(X = Z/{o} c) = P[X = l|c) automatically. In ob5 servational studies, there can be no guarantee that the assumption is correct. Prior information is used to select a collection of covariates such that we have approximately P{X — l|y{i},y{0}»c)' ~ P(X = l|c). In other words, causal inference in observational studies necessitates efforts to classify patients into strata such that the treatment is assigned approximately at random. Causal inference using potential outcome models has also been developed in connection with models for missing data [25]. For each patient, we observe Y{x) while Y{i-x} is missing. The issue is to characterize the pattern of missing data, or rather, the way in which treatment is assigned to each patient. In the factorization of equation (2.1), the treatment assignment mechanism is modelled by P(X = l|y{i}, V{o}, c). If X = 1 then F{o} is missing, whereas if X = 0 then Y{i} is missing. When there is no unmeasured confounding, the missing data mechanism does not depend on the missing potential outcome Y{\-x}- For Bayesian inference, valid estimation of the causal effects defined in Section 2.1 can proceed by modelling only the observed patient outcomes Y given X and C, that is P(Y = l\x,c), without modelling the treatment assignment mechanism P(X = l|c). 16 We elaborate in more detail about Bayesian inference for causal effects because it is relevant to the discussion about Bayesian propensity score analysis in Chapter 3. Following Rubin [25], relabel the potential outcomes as Y{ } and Kfi-x}, where X Y{x} = Y is the observed outcome while ^ { I - X } is missing. Suppose we model the joint probability density function of Y{ y, and X given C parametrically as X f(y{x},y{i-x},x\c, 9,7) = f(x\y{x},y{i-x},c, 7 ) / ( y {x} , y i _ | c , 9). { x} The parameter 9 indexes the parametric model for the potential outcomes and it includes the causal parameter of interest. The quantity 7 parametrizes the missing data mechanism. Suppose that there are no unmeasured confounders, meaning that (Y ,Y{ ) AL X\C. Then we have f(x\y ,y _ ,c,j) {1} Q} {x} {1 x} = f(x\c,j). The probability density function of the observed quantities Y, X and C is given by f(y,x\c,9,-y) = f(y ,x\c,9,-y) {x} = J = J = f(x\c,l) J f(y{x},y{i-x}\c,9)dy i_ . f(,y{x},y{i-x},x\c,9,'y)dy _ {1 x} f{x\cn)P(y{x},y{i-x}\c,9)dy _ {1 { x} x} (2.2) Let f{9,j) = f{9)f(j) denote a prior density in which 9 and 7 are independent. Then the missing data mechanism is called ignorable [25]. Given y, x and c, Bayesian inference for 9 proceeds from the posterior distribution for 9 which obeys 17 the proportionality f{6\y,x,c) cx f(y,x\c,0, )f(9)f( ) 1 f(x\c,j) J oc « J 1 f(y{ },y i- \c,0)dy x { x} { 1-x} (2.3) f{y{x},y{i-x}\c,O)dy - f(0) {1 x} The parameter 7 does not appear in the posterior distribution. Hence for Bayesian inferences about 9, the treatment assignment mechanism f(x\c, 7) conveys no information about 9 and can be ignored when developing a model for the data. 2.3 Confounding bias The literature on confounding bias distinguishes between the notions of confounding and a confounder [24, 26]. We say that there is confounding if (Y Y ) {lh 4LX\C. {0} When ( Y , Y ) _)£. X\C, then this implies that P{Y {1} {0} {X} = l|c) ^ P{Y } = l\x, c) {X P(Y = l\x, c) for X = 0 or 1. Therefore, PI = log ± log p(y { 1 } = i| )/(i-p(y p(y { 0 } = i | c ) / ( i - p ( y } = i|c))J c { 1 } = i| ))c {0 p ( y = i [ x = i , ) / ( i - p ( y = \\x = i,c)) [ p ( y = i | x = 0, )/(i - P ( y = \\x = 0, ))J c c C = Pc Thus p* P , and the associational parameter P does not have a causal interpretac tion. Similarly, we have P* c vg ^ p . avg 18 This definition of confounding is appealing because it avoids reference to measures of effect, and it helps distinguish confounding from the unrelated notion of non-collapsibility described in Section 2.1. Lack of a clear separation between confounding and non-collapsibility when inferring causation created some controversy in early attempts to characterized confounding. For example, when analyzing data from randomized trials, conditioning on outcome risk factors in a stratified analysis will drive the log OR parameter away from zero when there is a treatment effect [23, 32]. But this change in the log OR is unrelated to confounding. See [22, 24] for discussion. Recent definitions of confounders, meaning variables which are responsible for confounding, assume that the causal relationship between variables in a population can be.modelled using directed acyclic graphs (DAGs). While we have discussed potential outcome models for causation, DAGs offer a different strategy that can be used for qualitative assessment of bias. For fixed time point treatments, such as in the case of the statin data example of Section 1.2, the approaches can be shown to be equivalent in the sense that both types of models give the same mathematical description of the same quantities [22]. A D A G consists of a set of variables connected by arrows in which no directed paths form loop's. A n arrow models direct causal effects of the parent variable on the child variable. A scalar variable C is defined as a confounder for the effect of X on Y if C connects to X and Y by forward pointing arrows. For example, Figure 2.1 presents examples of DAGs in which G is a confounding variable. Therefore, C is a confounder if it causes both X and Y. If C is smoking, Y is lung cancer, and X is yellow fingers, and their causal relationship can be modelled by any of the diagrams in Figure 2.1, then smoking is a confounder for the effect of yellow fingers on lung cancer. The definition of a confounder appearing in modern epidemiological textbooks 19 Figure 2.1: Directed acyclic graphs in which C is a confounder, X is treatment Y the outcome, and U is an additional measured or unmeasured variable. c X c —» u Y X Y U t c u <—c l v X Y j X 20 \ Y says that C is a confounder for the effect of X on Y under two conditions. Condition 1: The variable C must be associated with both X and Y. More technically, we require that Y AL C\X and X AL C\Y. Condition 2: The causal interpretation of dependencies between Y, X and C is restricted to require that C is not affected by X or Y. This definition and the one based on DAGs are usually equivalent [24]. But using DAGs to define a confounder formalizes and generalizes the intent of Condition 2 to more complex observational data (see [24] for details). To infer causation from observational data, we must identify a set of covariates C such that we have approximately (F{i},Y{o}) AL X\C. DAGs can be helpful in this process. Instead of focusing on whether individual covariates meet the traditional definition of being confounders, the investigator can instead attempt to elicit a D A G model for all relevant measured and unmeasured factors. Such a diagram may be elaborate, but simple criteria have been developed based on paths between the treatment and outcome variable that allow the investigator to verify if ^{o}) AL X\C [22, 26]. 2.4 Propensity score analysis (PSA) for control of confounding When there is no unmeasured confounding, the standard approach to estimating causal effects is to estimate P(Y = l\x,c) over levels of C and then calculate an estimate of f3 , the log OR conditional on C. We call this stratification on measured c confounders. When strata are sparse, model-based estimation is often used. We assume a parametric model for P(Y = l|a;,c), such as a logistic regression model of Y on X and C , and proceed by maximum likelihood estimation. A difficulty is 21 that this strategy requires accurate specification of P(Y = l\x,c). This may be difficult if C is high-dimensional with continuous components. In the statin data example of Section 1.1, we have rich patient information on quantities such as vital signs at hospital admission and laboratory values. But the functional form of the dependence between these variables and mortality is poorly understood. If the model for P(Y = l|a;,c) is misspecified, then estimates of B will be asymptotically biased. c An alternative technique to control confounding is to use propensity scores. The propensity score is defined as the probability that a subject is treated given measured confounders, or mathematically as the quantity Z = P(X = l|c). The propensity score can be used as a.tool to ensure that treatment-groups have similar distributions of measured confounders. Rosenbaum and Rubin [1], Theorem 1, showed that patients with fixed propensity score Z have the same distribution for C irrespective of X, or more technically that C 1L X\Z. This is because P(X = l|c, z) = z, which does not depend on C. Further Rosenbaum and Rubin [1], Theorem 3, showed that if there is no unmeasured confounding conditional on C, then this implies that there is no unmeasured confounding conditional on Z. Or rather, that (Y{iy,Y{ y) _LL X\Z. The 0 reason is because Ju z /(y{i},y{o}l )2 The integration is over the subsets of the support of C given by the set U — (c|/(c) = z z} for 0 < z < 1. The set U corresponds to the subset of the population who have z 22 propensity score z. The first line follows because there is no unmeasured confounding given C , while the second line follows because C AL X\Z. Thus P{X{\}-,Y{Q}\x,z) = P(Y ,Y \z) {1} and we have {Y ,Y ) {0} {i} AL X\Z. {0} Rosenbaum and Rubin used the result (Y{i}, ^{o}) X\Z to argue that stratifying on the propensity score eliminates confounding bias due to C. The association between X and Y within patients with the same propensity score has a causal interpretation. Since (Yji}, Y{ }) -II X\Z, we have P(Y = l\x,z) = P{Y{x} = because 0 P(Y = 1\X = l,z) = P(Y [l} P(Y = 1\X = 0, z) = P(Y {0} = 1\X = = P(Y {1} = l\X = 0,z) = P(Y {0} = l\z) = l\z). The associational log OR conditional on Z, written 0 = log P(Y = 1\X = l,z)/(l -P(Y = 1\X = 1, z)) [P(Y = 1|X = 0,z)/(l - P(Y = 1\X = 0,z))\ has a causal interpretation because it is equal to the causal log O R conditional on Z P* - log P(Y = l | ^ ) / ( l - P ( Y = l|z)) X 1 P{Y = \\z)/{\-P(Y Q Q = \\z)) Patients with fixed propensity scores have equal probability of treatment, irrespective of their characteristics. Therefore, comparing patient outcomes between treatment groups conditional on the propensity score removes confounding. The parameter B will generally differ from either P avg or B , the causal parameter defined in Section 2.2, c because of non-linearity of the log OR [28, 29]. But B can nonetheless be interpreted as a valid causal quantity [30, 31]. Point estimation of ft using propensity score analysis (PSA) 23 ( In observational studies, the propensity score is unknown and can be interpreted as missing data. A propensity score analysis (PSA) proceeds using a two step procedure. First the propensity scores are estimated for each patient. Then we fit a regression model on the estimated propensity scores. In most applications, standard error estimates are calculated from asymptotic approximation to the distribution of the maximum likelihood estimator based on the regression model of the outcome on treatment and estimated propensity scores [4]. Thus interval estimates ignore uncertainty in the estimated propensity scores. To illustrate in greater detail, suppose that we have an observational study with sample size n. To define the data, let yi and xi for i — 1,..., n, denote the observed values'of outcome Y and treatment X, and let y = ( y i , . . . , y ) and x = (x\,... ,x ). n n Let Cj for i = 1,... ,n, denote values of the p x 1 vector C of confounding variables, and let c denote an n x p design matrix where each row is given by Cj. To ease notation in modelling regression intercept terms, we assume that the first component of C is identically equal to one, or rather, that the matrix c contains a column of ones. The vector of propensity scores z = (zi,... ,z ), where Zj = P(X = 1|cj), is n unknown. To estimate /?, we first estimate z, typically via logistic regression of x on c using the model logit[P(X = l|c)] = Y c (2.4) The fitted values of the regression provide point estimates of z denoted z. -Next, the patients are stratified on the estimated propensity scores, and we estimate characteristics of P(Y = l\x, z) such as B. Exact stratification on the propensity score is often not possible because the data are sparse. Instead, we can propose a regression model 24 for P(Y = l\x, z), such as • - logit[P(Y = l\x, z)} = Bx + g(z)'f, • (2.5) and then use maximum likelihood estimation with z = z to calculate an estimate for B. The quantity g(Z) is a k x 1 covariate vector which is a known function of the propensity score and models the relationship between the propensity score and Y. A popular choice for g(Z) is based on five equal sized quintile groups. Then g(z) is a 5 x 1 vector in which the latter four components are "dummy" variables indicating quintile group membership, [1, 0, 0, 0, 0] if z < ci [1, 1, 0, 0, 0] if c < z < c g(z)' = { [l, 0, 1, 0, 0] if c < z < c x 2 2 [1,0,0,1,0] ifc <z<c 3 [ [1, 0, 0, 0, 1] 3 (2.6) 4 if c < z. 4 The parameter vector £ = (^, £ , £3, £4, £5) governs the risk of Y within each group. 2 The quantities C i , c , 0 3 , 0 4 which define the quintile groups are specified in advance 2 based on the empirical distribution of the estimated propensity scores zi,...,z . n When subclassifying on propensity score bins, we can also use the Mantel Haenszel method [4]. Alternatively we could model g(z)'£ using cubic splines 3 P + J2ZJ+P{Z ~ 9(z)'$ = 6 + 2^ j=i 3=1 where (u)+.= u I{u > 0) with knots C i , c , . . . c , or as a simple linear predictor p 2 9( )'t, z = p £0+6-2 [2, 33]. If matched pairs are available with matching on the estimated 25 propensity score, then we can use conditional likelihood estimation which does not assume an explicit functional form for g(z)'£. PSA is similar to conventional regression of Y on X and C, except that it substitutes a high dimensional covariate C with the scalar Z. Each of the functional forms for P(Y = l\x,z) assumes that there is no effect modification due to the propensity score. But this may be unrealistic in some settings and the method can be extended appropriately [31]. Interval estimation of (5 using PSA Interval estimates for j3 are typically calculated by estimating the asymptotic variance of B, denoted Var{/3}. We report J3 - 1.96y/Var{j3}, 0 + 1.96^Var0} as a 95% confidence interval for B. The quantity Var{(3} is calculated from the observed information matrix based on the parametric model for P(Y = 1 |x, z). When P(Y = l\x,z) follows equation (2.4), this is accomplished by evaluating the matrix W0 tT 2 Q2 • g ll+exp(/?x + ^ i ) 0 ) i n Y^ViiPxi + gWO - log(l + exp(/3a; + p(^)'O) i at the maximum likelihood estimates for B and £. This expression is calculated from z rather than z, and the standard error estimate of B does not acknowledge 26 the uncertainty in the estimated propensity scores. Thus we might expect that the interval estimates will be too narrow and not have nominal coverage probability! There is little discussion in the literature about methods for incorporating uncertainty from the estimated propensity scores into uncertainty about the treatment effect. Some exceptions include [4, 6]. In contrast, there a large body of work describing the merits of using estimated propensity scores over true propensity scores [7]. Hahn and others [8-11] argue that treatment effect estimates calculated from the estimated propensity scores are generally more efficient than corresponding estimates calculated from true propensity scores. This has been demonstrated both analytically and in simulations in the context of propensity score weighting adjustments for confounding [8-11]. A n accessible discussion is given by Hirano, Imbens and Ridder [11] for the case of a single dichotomous confounder. For matched sampling, Rubin and Thomas [12, 13] show that matching on the estimated propensity scores yields greater similarity in empirical distributions of covariates for treated versus control subjects relative to matching on true propensity scores. A similar argument is given by Rosenbaum and Rubin [3] based on the analysis of a dataset using PSA stratifying on quintiles of the estimated propensity scores. We return to this topic in more detail in Section 5.4. Nonetheless, it is unclear whether or not ignoring uncertainty in the estimated propensity scores is harmful to interval estimation for PSA. Many of the arguments in favor of using estimated propensity scores rely on large-sample theory. Interval estimates ignoring the estimated propensity scores may not have nominal coverage in finite samples. The arguments also tend to focus on propensity score weighting methods rather than PSA. The literature investigating interval estimation for PSA is fairly sparse and usually considers continuous outcomes (e.g. [3, 4, 34]). For dichotomous outcomes where the log OR is the measure of effect, the non-linearity 27 of the logit link causes ambiguity about the targets of inference. Thus it becomes difficult to make sense of notions like coverage probability because it depends on what the analyst is trying to estimate. For example, it is sometimes assumed PSA should be used to estimate either j3 avg or [3 , defined in Section 2.2, even though it C is asymptotically biased for both quantities [34, 35]. A detailed discussion of the targets of inference for PSA is given in [28, 29]. We discuss this topic in more detailin Section 5.2. 28 Chapter 3 Bayesian propensity score analysis In this chapter we present a Bayesian propensity score analysis (BPSA). The method is similar to PSA in the sense that it uses the same models for the data. But inferences are carried out using Bayes theorem rather than maximum likelihood. Because the propensity score for each subject is unknown, it is modelled as a latent covariate which is integrated out of the posterior distribution of model parameters. Unlike PSA which uses a two step procedure to first estimate the propensity scores and then estimate the treatment effect, BPSA estimates both quantities simultaneously. This can offer unique advantages to BPSA. We observe an "information flow" in which outcome data is used for efficient estimation of the propensity scores. In Section 3.1 we describe the BPSA method including the model, prior distributions and a strategy for posterior simulation. We show that during M C M C , the conditional distribution for the propensity scores depends on the distribution of patient outcomes. The BPSA method incorporates prior information about smoothness assumptions between the response distribution and propensity score, within treatment groups. This information is not used by PSA for estimation of the propensity scores. While this seems like a reasonable estimation strategy, it is a break with tradition [7, 36]. For example, Rubin writes [36]: 29 "Of substantial importance, the propensity score approach to causal inference ... focuses on the theme that the design of an observational study should parallel the design of a randomized experiment. That is, our propensity score approach is accomplished without any access to outcome data." These issues are explored further in Section 3.2, where we discuss using BPSA as a tool for causal inference. 3.1 The method Model Factorize the density of Y, X, and C as P(Y = y, X = x\c) = P(Y = y\x, c)P(X = x\c). We use models which are identical to the PSA model given in equations (2.4) and (2.5). We let logit[P(Y = l|o:,c)] = ' logit[P(X = l|c)] = /3x + 0(z(c, ))'£ (3.1) 'c, (3.2) 7 7 where 2(0,7) = expit^'c) and expit(a) = (1 + exp(—a)) . Here we write z = 2(0,7) -1 to acknowledge that the propensity score is a known analytic function of C and 7. Equation (3.1) is a logistic regression model for the risk of Y with treatment effect 30 B which does not interact with C or Z . The parameter B is the primary quantity of interest and models the causal effect of X on Y given Z. The quantity g(Z)'£ is a linear predictor relating the propensity score Z = z(c, 7) to the risk of Y via the parameters £. In this thesis, we restrict our attention the case where g(Z) is a 5x1 vector of indicator variables ' [1, 0, 0, 0, 0] if Z < ci [1,1,0,0,0] ifc!<Z<c 2 g(Z)' = <J [l, 0, 1, 0, 0] if c < Z < c 3 [1, 0, 0, 1, 0] if c < Z < c 4 2 3 [ [1, 0, 0, 0, 1] (3.3) if c < Z 4 The quantities C i , c , C 3 , c must be specified a priori. They define five separate 2 4 "bins" in which the risk of Y given X is assumed to be constant. The components of £ = (£i>£2,£3,£4,£5) model the risk of Y within the bins. To choose C i , c , c , c , 2 3 4 we use an approach motivated by PSA. We fit the logistic regression of X on C via maximum likelihood and then use the fitted values to obtain initial estimates of the propensity scores. The values of C i , c , c , c are selected to define five equal size 2 3 4 quintile groups from the empirical distribution of these estimates. It is important to emphasize that the bins [0,Ci], [c ,c ], [c ,c ], [c ,1] will not 2 3 3 4 4 define quintile groups when applying BPSA. The parameter 7 is modelled as a latent variable, and thus the covariate g(z(c, 7)) is also a latent quantity. We can only classify patients into quintile groups in an average sense, such as based on the posterior expectation of g(z(c, y)). Because the quantities C i , C 2 , c , c r 3 4 are chosen using PSA, we would not necessarily expect 20% of the study observations to fall within each bin. The choice of model for the outcome variable may not be suitable for the E F F E C T data. The idea that the risk of Y is constant within bins is clearly only an 31 approximation, and it it appears that there might be more suitable functional forms for g(Z). Nonetheless, we use the form of g(Z) given above in order to maintain similarity with the published litterature on PSA. In first proposing PSA, Rosenbaum and Rubin [1] advocated using five equal sized bins because previous work by Cochran demonstrated that this would elimianate 90% of the bias induced by a confounder. Furthermore, our method can be readily modified to incorporate more flexible choice of g(Z) through modification of the M C M C algorithm. Another possible limitation of the model is the choice of using the log OR B to model the causal effect of X on Y. As discussed in Chapter 2.1, the log OR has well know limitation compared to other effect measures. In this investigation we choose the log OR because of the flexibility of using logistic regression to model binary data. The logit link simplifies modelling when g{Z) takes on other forms such as cubic splines. Nonethless, in priciple, we could define an outcome model using other measures of effect. Prior distributions The model parameters /?, £ and 7 are all standard regression coefficients, and we use mean-zero diffuse independent normal distributions as prior distributions. 7 l ,..., B ~ JV(0,100), ~ 7V(0,100), ~ iV(0,100). l k This should yield similarity between BPSA and PSA, which uses no prior information for model parameters. There are also opportunities for more flexible modelling 32 strategies. Zheng and Little [37] propose a model similar to BPSA for estimating the population mean from a finite population where subjects have non-equal probability of selection. They model the distribution of the data conditional on the selection probabilities, and they use a hierarchical prior distribution for the subgroup means. The resulting inferences lie mid-way between assuming that the means are all identical versus all different. For BPSA, we could use a similar strategy and model the parameters £1,^2, •••£5 hierarchically and as conditionally independent and identi- cally distributed given an unknown hyperparameter. Posterior simulation Following the notation given in Section 2.3, recall that y, x denote vectors of length n of the observed responses, exposures in n study subjects, and c denotes an nx p matrix of measured covariates with a column of ones. Our objective is to study f(/3,£|y,x, c), the posterior distribution of model parameters given y, x and c. To accomplish this, we sample from the posterior using M C M C . To illustrate the main ideas, observe that if the propensity scores were known then posterior inferences could be accomplished by fitting the logistic regression models in equation (3.1). Hence posterior simulation may proceed by treating 7 as latent variable that is integrated out of the joint posterior distribution f([3, £, 7|y, x, c). We update successively from f(P, Z\y, x, c, 7) and /(7|y, x, c, B, £). To update from the conditional densities f(B, £|y, x, c, 7) and /(7|y, x, c, /?, £) we must appropriately condition the joint distribution of the data and parameters. Write this distribution as / ( y , x, 0, f, | c ) = /(yjx, c, B, £, ) / ( x | c , )/(/5, £,7). 7 7 33 7 where /(/?,£, 7) is the prior distribution for and 7. All of the density functions in the right-hand side of this equation are known from the modelling and prior assumptions. Therefore, /(/?, £|y, x, c, 7) obeys the proportionality /'(/U|y,x,c, ) 7 « /(y|x,c,/3,£, )/(AO e x p { ^ ( ^ + ^(z(c ,7)) £)} n . . l i=l 7 , i + exp{^ i + ^( C i ,7y£} X / ( A 0 - ( 3 " 4 ) This is the likelihood for logistic regression of y on x and the propensity score z(c, 7), multiplied by the prior for 3 and £. Updating from the density /(/3,£|y,x, c, 7) involves a single update from Bayesian logistic regression of y on x and z. Bayesian logistic regression can be accomplished using the Metropolis Hastings algorithm [38]. The algorithm is an iterative procedure for sampling from a target density f(0) = k~ n(9), where k is an unknown normalization constant. The iml plementation proceeds as follows: At iteration i, given a current sampled parameter value #M, a candidate value 9* is generated from a proposal density Q(9*\9^). We assign <— 9* with probability f{8*)Q(9W\9^) ' mm f{8^)Q{9*\8^Y or assign tions, the series <— #W otherwise. After discarding a suitable number of initial itera. are a dependent sample from the target density f(6). The choice of proposal density Q(9*\9^) impacts the performance of the Metropolis Hastings algorithm. For Bayesian logistic regression, a common choice is a multivariate normal density with mean equal to the maximum likelihood estimator and covariance matrix given by the inverse of the observed information. This proposal density approximates the posterior distribution in large samples and yields high acceptance 34 rates for candidate parameter values in posterior updating. The density of /(7|y, x, c,/3,£) obeys the proportionality /(7|y,x,c, P,0 oc /(y|x,c,/3,7,0/(x|c,7)/(7) f \[l = + exp{pxi + g(z(dn)^} 1 + exp{ 'c} 7 (3.5) This density is not proportional to /(x|c, 7)7(7). Therefore, updating the propensity scores z(c, 7) in BPSA does not consist of sampling from the posterior distribution of regression coefficients from logistic regression of x on c. Instead, information about patient outcomes is also involved in updating information about the propensity scores. We update from /(7|y, x, c, [3, £) using a Metropolis-Hastings step. Finding a suitable proposal distribution is challenging because the characteristics of this distribution are not obvious. For example, we found that a proposal based on the approximation /(7|y,x,c, /(xjc, 7)7(7) k (3-6) where A; is a normalization constant, led to unacceptably high rejection rates. Such a proposal approximates the target density using the asymptotic distribution of the maximum likelihood estimator from fitting the regression in equation (3.2). The fact that the approximation is poor indicates that the patient outcome distribution may supply a lot of information about 7. Instead, we use a proposal distribution based on a random walk Metropolis Hastings algorithm which updates the p components of 7, one at a time. This approach samples from /(7|y, x, c, /?, £) by sampling sequentially from /(7i|y,x, c, (3, £, 7(-i)), • • •, /(7fe|y, . c, P, f,7(-fc)), where 7 _ dex ( i) notes (71,..., 7j_i, 7.7+1 > • • •, Ik)- At each step, a proposal for f(jj\y, x, c, /?,.£, 7 _ ) ( 35 j) is given by a random draw from a univariate t-distribution with appropriate scale and degrees of freedom. The scale parameter is chosen so as to ensure fast mixing of the M C M C chain through the target distribution. If the scale is too small, the chain will move slowly. A large scale will give high rejection rates. Specifying a small degrees of freedom for the ^-distribution gives heavy tails to the proposal density, which allows greater flexibility for ensuring rapid movement through the target distribution. A difficulty with this updating scheme is that the computational cost will be substantial when the number of covariates p is large. For example, in the statin data described in Section 1.1 we have p = 20. A n alternative strategy is to use a multivariate random walk. We could update 7 using a proposal equalto 7 plus a draw from a multivariate ^-distribution of dimension p with mean zero, small degrees of freedom, and scale matrix equal to the inverse of the observed information from logistic regression of x on c. This proposal distribution has a similar shape as the density in equation (3.6), but is not constrained to one area of the parameter space. We assess sampler convergence using the CODA package which is available for the statistical software R [39] designed for output analysis and diagnostics for M C M C . The cumuplotQ function plots the evolution of sample quantiles over iterations, and it can be used to identify poor mixing. Diagnostic tools based on the analysis of multiple M C M C chains are also available [40]. To summarize, simulation from the posterior distribution /(/?, £, 7, |y, x, c) proceeds as follows: 1. Specify the quantities C i , C2, C3, C4 by constructing five quintile groups from the fitted values of a logistic regression of x on c fit by maximum likelihood. This defines the "bins" for BPSA. 2. Obtain a starting value for 7 ^ , by sampling once from a normal distribution 36 with mean and variance given by the maximum likelihood estimator and its asymptotic variance, respectively, calculated form the logistic regression of x on c from Step 1. Obtain starting values for (8^°\^) by sampling from the asymptotic distribution of the maximum likelihood estimators from logistic regression of y on x and z(c, 7^) using the regression model of equation (3.1). 3. Fort = 1,2,... (a) Update 7( )-by updating successively from each of /(7i|y, x, c, /?, £, 7(-i)), • • t+1 /(7fc|y, x, c, 3, £, 7(-fc)) using a random walk Metropolis Hastings step with proposal distribution that is univariate t with a suitable scale parameter and degrees of freedom. (b) Update and £ ( ) using a Metropolis Hastings step with target dent+1 sity f(8, £|y,x, c, 7( )) and proposal distribution obtained by logistic ret+1 gression of y on x and z(c, 7^ '). t+1 After discarding a suitable number of initial iterations, the sequence (B^ \ £ ^ , 7 ^ ) % for i = 1, 2,..., is a serially dependent sample from the required posterior distribution /(Afi7.|y,x,c). 3.2 Causal inference with B P S A The BPSA model described in Section 3.1 seems reasonable. BPSA essentially mimics PSA from a Bayesian perspective. We can generate data from the model. Intuition and standard large sample theory suggest that BPSA point and interval estimates will agree with PSA in large samples. This is confirmed in the simulations of Chapter 37 5. Nonetheless, BPSA is intended to be used as a tool for causal inference, and upon closer inspection we see some technicalities. Does it make sense to use outcome data in order to estimate propensity scores? In Section 3.1, we showed that when updating the propensity scores, the conditional distribution of 7, given in equation (3.5) depends on the observed values of y and the parameters £ and B. As we learn about the risk of Y as a function of g(Z) and X, this information flows back through the M C M C algorithm to impact estimation of 7. Unlike PSA, the BPSA method fits both regression models given in equations (3.1) and (3.2) simultaneously rather than one at a time. Thus the two regression models assist each other in order to produce a good fit for the data. But this estimation strategy is in conflict with the standard approaches to PSA. Rubin [7] emphasizes that when estimating the propensity scores, the investigator should not have access to outcome data. Rosenbaum and Rubin [3] advocate estimating the propensity scores by an iterative process where the investigator first estimates the propensity scores using a model for the distribution of X given C, and then checks for balance in the distribution of C within quintile groups. If the covariate distributions differ between treatment groups, this indicates that the model for P(X = l|c) is incorrect. The investigator can then estimate the propensity scores again using alternate models. PSA is a tool for constructing comparable treatment groups and the method should be blind to the values of the outcome. When there are no unmeasured confounders, PSA replicates randomized groups in a manner similar to actual randomization. Furthermore, there is the issue of the time ordering of when Y and X are measured. If the outcome variable is measured after the start of followup, then is it meaningful to use this quantity to make inferences of the probability of treatment? To reconcile these contradictory viewpoints, we need to establish if the regression 38 model given in equation (3.1) is a sensible model. If we truly believe that the risk of Y given X is determined by the linear predictor g(Z), then it makes sense to use Bayes rule to incorporate patient outcome information when estimating propensity scores. However, in most published articles on propensity score methods, PSA is evaluated assuming that the regression model given in equation (3.1) is misspecified (see for example [4, 28, 34, 35]). The ideal analysis estimates the association between X and Y conditional on Z exactly, while stratification of study units into equal sized bins is a only crude approximation of this procedure. Therefore PSA will tend to produce biased estimates because of residual confounding due to incomplete adjustment for Z. The justification of this perspective is that while the association between X and Y given Z has a causal interpretation, the interpretation ought not extend to units within the same quintile group. In Section 2.2, we showed that when there is no unmeasured confounding conditional on C, then this implies that (Y{iy,Y{ y) AL X\Z. 0 Therefore given Z, all patients have equal probability of X — 1 irrespective of C , and the association between X and Y is unconfounded. But among units with propensity score Z € [c/t, Ck+i), for some interval from c to Ck+i, treated and untreated units will k not have identical distributions of propensity scores. Consequently, it seems tenuous to argue that the association between X and Y within a fixed quintile has a causal interpretation. If we adhere to this logic, then it implies that equation (3.1) is not a realistic model for causal inference. Thus using Bayes theorem and equation (3.1) to estimate propensity scores BPSA is not a sensible strategy for analyzing observational data. An alternative perspective is that it may be appropriate to impose additional modelling'assumptions which treat the distribution of Y given X as a smooth function in Z. For example, if patients A and B have propensity scores equal to 0.75 and 0.80 respectively, then it may be reasonable to assume that they have similar baseline 39 risk of Y. In the statin data example, healthy patients are likely to be treated, and consequently, patients with high propensity scores have reduced risk of death. Thus while the distribution of propensity scores may differ among treated and untreated patients within quintile groups, all patients may have roughly equal risk of Y. This reasoning is similar to that used to justify standard regression models that categorize confounders. When adjusting for patient age using conventional regression, we may group patients into one year groups rather than one month groups. Inferences are then reported conditional on this smoothness assumption. If we can assume that • (Y Y ) {lh {0} AL X\g(Z) (3.7) then this implies that P(Y = 1\X = l,g(z)) = P(Y = 1\X = l,g(z)) = P(Y = l\g(z)) P(Y = 1\X = 0,g(z)) = P(Y = 1\X = 0,g(z)) = P(Y = l\g(z)). {1} {0} {1} {0} The log ORs calculated from P(Y = l\X, g(z)) will have a causal interpretation within subsets of the population with a given g(z). The assumption in equation (3.7) is very similar in spirit to the assumption of no unmeasured confounders from equation (2.2). But it in some sense it is made implicitly in conventional PSA. When we use PSA to calculate model based point and interval estimates for /?, we are assuming that equation (3.1) models the true distribution of the response variable. The choice of linear predictor g(.), whether quintile groups or cubic splines, is presumably guided by prior information about the smoothness of P(Y = l\x,z). As an aside, we note that the BPSA model, given in equation (3.1), is not 40 compatible with potential outcome models for causal inference. B P S A models the quantities P(Y = l\x,c) and P(X = l|c) as dependent a priori. We see this because the quantity P(Y = l\x,c) =expit{f3x + g(z(c,j))'(} is an explicit function of z(c, 7) = P(X = l|c) = expit(7'c). What this means is that, given X and C, we believe that the risk of Y depends on the manner in which treatment is assigned. However, causal inference using potential outcomes models requires that P(X — l|c) and P(Y — l|x,c) be independent a priori. As described in Section 2.2, in order for the treatment assignment mechanism to be ignorable in the sense of Rubin [25], a necessary condition is that the parameters governing the potential outcomes be independent a priori from the parameters which model the treatment assignment mechanism. To elaborate, in the Section 2.2 equation (2.2), we showed that the potential outcome approach to causation models the distribution of Y and X given C as the product of two parts; one which models the potential outcomes and one which models treatment assignment, The parameter 0 models the potential outcomes, while 7 models the treatment assignment. Provided that we assign independent prior distributions to 0 and 7, denoted /(0,7) = 7(0)/(7)) then the missing data mechanism for the potential outcomes is ignorable [25], and Bayesian inference for the parameter 0 proceeds from (3.8) According to this model for the observed data, knowledge of 7 conveys no information 41 about 9. If f(y{i], y{ }\ ) c x a n d f(x\c) are indexed by different parameters which are a priori independent, then this implies that P(X = l|c) and P(Y = l\x,c) are also a priori independent. This, in turn, invalidates the BPSA model given in equation (3.1). Potential outcome models for causal inference view observational data as arising from two separate processes, one which generates the potential outcomes, and one which assigns treatment and masks potential outcomes. Because the models have distinct parameters, the investigator can ignore the manner in which treatment is assigned. The intuitive explanation for this is that, in a randomized experiment, how one chooses to randomize should convey no prior information about causal effects. Furthermore, the assumption that f(y{i},y{ }\c) x a n d f(x\c) are indexed by dif- ferent parameters which are a priori independent, implies that propensity scores are irrelevant to a Bayesian analysis [14]. If we look at the expression for f(6\y,x,c) given above, we see that because of the way the model is specified, the parameter 7 which models the propensity score conveys no information about the parameter 9. This may explain why there is so little published work combining Bayesian statistics with propensity score methods. In order to make sense of BPSA, we should operate from the premise of using models which assume a relationship between P(X — l|c) and P(Y = l\x,c), even if such models seem unrealistic. This reasoning is used by Rubin [14] and Robins and Ritov [15] in the context of propensity score methods. For control of confounding in observational studies, Rubin [14] proposes BPSA when we have prior knowledge of the propensity scores. He argues that because specification of P(Y = l|:r,z) may be easier than specification of P(Y — \ \x, c), BPSA inferences may have good frequentist properties even though the model is not an accurate representation of the true data generating process. Similar logic is argued by Little in the context of survey sampling 42 [41]: "...in practice all models are simplifications, and the features of the population that are important to include in the model vary according to the choice of design.... One way of limiting the effects of model misspecification is to restrict attention to models that yield design-consistent estimates." Hence while the B P S A model may be implausible, it can be viewed as a tool for building good frequentist procedures. 43 Chapter 4 Analysis of the statin data In Section 1.1, we outlined a case-study of an observational study estimating the effect of statin therapy on mortality in Ontario patients discharged alive from hospital following acute myocardial infarction. We demonstrated that the crude association between statin therapy, and mortality was likely biased due to confounding. Statins were generally prescribed to younger and healthier patients. Consequently, the reduction in mortality associated with statin use is partly driven by systematic differences in patient characteristics. We apply BPSA and PSA to adjust for confounding. For each of the 4572 patients in the sample, we let Y equal one if the patient died within three years of discharge from hospital, and zero otherwise. We let X equal-one if the patient was prescribed a statin at hospital discharge, and zero otherwise. We let C equal a 21 x 1 vector of measured covariates listed in Table 1.1, where the first term is equal to one in order to include an intercept term in the regression modelling. In this investigation, we assume that each of the variables in Table 1.1 is a true confounding variable regardless of the observed associations between C, X and Y. 44 4.1 Conventional regression adjustment for confounding To reduce confounding due to C, we estimate the effect of X on Y using conventional regression on measured confounders. We fit a logistic regression model of Y on X and C with main effects and no interactions. Table 4.1 presents log ORs for the associations between measured covariates and mortality. The adjusted log OR for the association between of X and Y given C is -0.33 with 95% confidence interval (0.53, -0.13). The odds ratio, exp(-0.33) = 0.72, is closer to one compared to the crude OR, and is more consistent with the results of other population-based observational studies [17, 18]. This suggests that the analysis has reduced some of the confounding. 4.2 PSA analysis of the statin data We apply PSA to the statin data. The method uses a two step procedure where the propensity scores are estimated for each patient and then the estimated propensity scores are included in a regression model for the mortality based on five quintile groups. We estimate the propensity scores using the fitted values from the logistic regression model given in equation (2.4) with main effects and no interactions. The treatment effect is then estimated via the model given in equation (2.5). Table 4.2 gives estimates for the parameter 7 which is the log ORs for the association between X and C. Before moving on to estimation oi.B and £, we illustrate some of the properties of propensity scores. In Figure 4.1 we plot kernel density estimates of the density of propensity scores within treatment groups, given by f(z\x) for x — 0,1. The dashed and solid curves refer to the untreated group (x = 0) and 45 Table 4.1: Log ORs for the association between Y and (X, C). Characteristic Statin therapy lojI OR -0.33 (95% CI) (-0.53, -0.13)* Demographic characteristics Age in years (mean) Female sex 0.07 -0.19 (0.06, 0.08)** (-0.37, -0.01)* Presenting characteristics'* Shock 0.47 (-0.38, 1.31) -0.19 0.40 0.36 -0.08 0.23 (-0.39, 0.01) (0.21, 0.60)** (0.12, 0.60)* (-0.25, 0.10) (0.02/ 0.43)* 0.27 0.54 (0.10, 0.44)* (-0.67, 1.75) AMI risk factors'* Family history of C A D Diabetes CVA/TIA High B P Current smoker Comorbidities* Angina Renal disease Vital signs on admission 1 Systolic B P Diastolic B P Heart rate Respiratory rate -0.01 0.00 0.01 0.03 (-0.01, 0.00)** (-0.01, 0.00) (0.01, 0.02)** (0.02, 0.05)** Laboratory values ' 1 White blood count 0.02 (0.00, 0.03)* Haemoglobin -0.02 (-0.02, -0.01)** Sodium -0.02 (-0.04, 0.00) Glucose 0.02 (0.00, 0.04)* Creatinine 0.00 (0.00, 0.01)** * p< 0.05, ** p < 0.001 f Continous variables, + Dichotomous variables 46 Table 4.'2: Point estimates for 7, the log ORs for the association between X and C. Characteristic lojI OR (95% CI) Demographic characteristics Age (mean) Female sex -0.03 -0.17 (-0.04, -0.02)** (-0.33, -0.01)* Presenting characteristics* Shock -0.54 (-1.54, 0.45) 0.16 -0.05 0.17 0.31 -0.27 (0.02, 0.31)* (-0.22, 0.12) (-0.07, 0.40) (0.17, 0.44)** (-0.42, -0.12)** AMI risk factors* Family history of C A D Diabetes CVA/TIA High B P Current smoker Comorbidities* Angina Renal disease 0.37 1.06 (0.23, 0.51)** (0.01, 2.11)* 0.00 0.00 0.00 -0.01 (0.00, 0.00) (-0.01, 0.00) (-0.01, 0.00) (-0.02, 0.00) Vital signs on admission' Systolic B P Diastolic B P Heart rate Respiratory rate Laboratory values^ White blood count 0.00 (-0.01, 0.01) Haemoglobin 0.00 (0.00, 0.01) Sodium 0.01 (0.01, 0.03) Glucose 0.01 (0.00, 0.02) Creatinine 0.00 (0.00, 0.00) * p < 0.05, ** p < 0.001 f Continous variables, + Dichotomous variables 47 treated group (x = 1) respectively. The curves were generated via Gaussian kernel density estimation using the R function densityQ with default settings. In Figure 4.2, we plot the estimated log odds of death, as a function of x and z, given by logit[P(Y = 2 ) ] . This quantity is estimated by second-order local polynomial regression of Y on Z for fixed X using the R function loess () [39] with default settings. In Figure 4.2, the four vertical bars indicate the values of c i , 0 2 , 0 3 , 0 4 which define the five equal-sized quintile groups. Each of the intervals [0, 0.21), [0.21, 0.26), [0.26, 0.31), [0.31,0.38), [0.38, 1] contains an equal amount of the data. Consequently, estimates of logit[P(F = 2;)] in the outermost quintiles are more imprecise. In Figure 4.2, we see that the dashed line is systematically higher than the solid line within quintile groups #1, #2, #3, #4. This suggests that statin therapy reduces mortality, in patients with a given propensity score. The curves are roughly parallel, indicating that the effect of X on Y is not modified by Z. Figure 4.2 reveals that high propensity scores are associated with a reduced risk of death. This is consistent with the literature on statin prescribing in Ontario residents [19] and the results of Table 1.1. Healthy patients are more likely to be treated with statins. In Figure 4.1, we see that untreated patients have lower propensity scores than treated patients. This is expected because P(X = l\z) = z implying that X and Z are dependent. Because Z is simultaneously associated with Y given X, and is also associated with Y given X, this implies that Z acts like a confounder, and Figures 4.1 and 4.2 allow us to visualize the confounding action of C. To estimate the effect of X on Y while controlling for C, we apply PSA and stratify on quintiles of the propensity score. The results are given in Table 4.3. We see that risk of death is greatest in the 1st quintile and decreases in quintiles 2 though 5, as is illustrated in Figure 4.2. PSA assumes that there is no effect modification between X and Z and the log odds ratio of the effect of X on Y is -0.36 (-0.54, -0.18). 48 Figure 4.1: The distribution of propensity scores among the treated and the untreated, denoted f(z\x) for x = 0,1. The solid curve refers to the treated patients, while the dashed curve refers to the untreated patients. Figure 4.2: Log odds of death as a function of x and z, denoted logit[P(Y = l\x, z)]. The solid curve refers to the treated patients, while the dashed curve refers to the untreated patients. Table 4.3: Parameter estimates for the treatment effect B, and the baseline risk of Y within quintile groups (£1, £ , £3, £4, £5), calculated from PSA 2 Parameter B ii 6 £3 £4 £5 log OR -0.36 (95% CI) (-0.54, -0.18) -0.13 -0.78 -1.44 -1.69 -2.18 (-0.27, 0.00) (-0.98, -0.59) (-1.66, -1.22) (-1.93, -1.45) (-2.45, -1.90) Assessing the covariate balance produced by PSA To assess whether PSA succeeds in reducing confounding, we investigate the balancing properties of the propensity score. We examine the distribution of measured confounders among treated and untreated patients within quintile groups. The results are given in Table 4.4. The distributions of measured confounders within quintile groups are compared using two methods: by. comparing sample means and proportions using t-tests, and by comparing the standardized difference between the distributions, calculated as the mean difference divided by the pooled standard deviation of the distributions [2]. Table 4.4 illustrates the ability of PSA to reduce confounding bias. Consider patient age which is a strong risk factor for mortality. In Table 1.1 we see that younger patients are more likely to be prescribed a statin. But within quintile groups, much of the systematic differences in age between treated and untreated patients is removed. Hence PSA reduces confounding bias due to age. The same effect is observed for other covariates which are strongly associated with treatment, such as angina. 51 Table 4.4: Means of measured covariates among treatment groups, within quintiles of the propensity score, calculated from PSA. Quintile 1 n = 914 Statin No Yes Demographic characteristics 77** + Age 64%t Female sex AMI risk factors Family history of C A D 9.0% 32%t Diabetes CVA/TIA 11% High B P 29% 37%* t Current smoker Comorbidities Angina 7.6%* + Renal disease 0% Vital signs on admission Shock 3% Systolic B P 148+ Diastolic B P 86* + 94 Heart rate 24 Respiratory rate Laboratory values White blood count 11 Haemoglobin 132* + Sodium 138 Glucose 10+ Creatinine 110* + * p < 0.05, ** p < 0.001, | Standardized Quintile 2 n = 914 Statin Yes No Quintile 3 n = 914 Statin No Yes Quintile 4 n = 914 Statin Yes No Quintile 5 n = 916 Statin Yes No 80**+ 59%+ 72** + 48% 74**t 46% 67 33% 67 33% 62+ 21% 60+ 23% 53 11%+ 52 14%+ 8.6% 26%+ 11% 29% 27%* + 22% 28% 10% 41% 35%+ 21% 29% 13% 42% 29%+ 32% 26% 8.6% 42% 34% 31% 28% 9.7% 44% 34% 43% 24% 8.7% 51% 36% 41% 26% 8.5% 48% 40% 61% 24% 8.4%+ 61% 31%* + 61% 23% 5.3%+ 59% 37%* + 22%* + 0% 29%* + 0% 34% 0.3% 35% 39% 0.5% . 0.3% 39% 0.3% 57%* + 2.1% 45%** + 1.5% 13%* + 0% 2% ' 144+ 79* + 92 24 0% 149 82 85 21 0% 149 82 83 21 0% 147+ 83+ 82 20 0% 151+ 86+ . 83 20 0% 151 85+ 78* + 20 0% 151 86+ 81* + 20 0% 149 87 78+ 19 0% 149 87 76+ 19 11 127* + 138 9.8+ 121* + difference 10 134 139 9.5 103 > 10% 10 135 139 9.4 103 10 140 139 9.2 97 10 139 139 9.5 101 9.7 142 139 8+ 101* + 10 143 139 9.2+ 93* + 9.9 146 139+ 10 99 10 147 140+ 9.4 96 In Table 4.4 we see that there are'a number of statistically significant imbalances within quintile groups, particularly within the quintiles #1 and #5. But if the PSA model for P(Y = c) given in equation (3.1) is correct, then these imbalances do not necessarily indicate that the associations between X and Y within quintile groups are confounded. The difficulty with using Table 4.4 to identify residual confounding is that the table assumes nothing about the relationship between the propensity score and the outcome. Such an estimation strategy is valid in the sense that if we see balance in the distribution of confounders, then it suggests that there is no confounding. But the approach may be overly pessimistic. It focuses entirely on efforts to create comparable treatment and control groups, while ignoring prior information about the relationship between the outcome and the propensity score. If the risk of Y given X is a smooth function of Z within each interval [0, ci ], [ci, c ], 2 [ 2,C3], C [03,04], [c4,l], then systematic imbalances in the distribution of propensity scores within quintile groups do not necessarily indicate that there is confounding. To put this another way, covariate imbalances in Table 4.4 may cancel themselves out in such a way that all individuals within each interval have the same risk of Y. Furthermore, Figure 4.2 illustrates that smoothness assumptions may be reasonable. Modest change in the propensity score are associated with modest change in risk of mortality. 4.3 B P S A analysis of the statin data Before applying BPSA to the statin data, we first set the bins [0, c i ] , [^3,04], [01,02], [02,03], [c ,1] which define intervals of homogeneous risk of Y given X, using the 4 values c i — 0.21, c = 0.26, c = 0.31, c = 0.38 from Section 4.2. We then apply 2 3 4 BPSA to the statin data. We run a single M C M C chain of length 100 000 after 53 Table 4.5: Parameter estimates for the treatment effect 8, and the baseline risk of Y within each of the five bins £ 3 , £ 4 . £ 5 ) , calculated from BPSA. Characteristic 8 £1 cf 6 £ £5 2 4 log odds ratio (95% CI) -0.30 (-0.50, -0.10) 0.62 (0.38, 0.87) -1.04 (-1.39, -0=67) -1.93 (-2.27, -1.60) -3.04 (-3.45, -2.60) -3.98 (-4.43, -3.55) discarding the initial 10 000 iterations, and we then thin the chain by retaining every tenth iteration to obtain a sample of size 10 000. Thinning the chain is advantageous for computer storage of the analysis results. Sampler convergence for 7 was worse than for 3 or £. Figures 4.3 and 4.4 contains mixing plots for 3, £ and the first six components of 7 based on the last one thousand iterations of the thinned chain. We can see that samples for £ and 3 move more rapidly through the target distribution than for 7. To assess the effect of sampler convergence, we obtained three additional M C M C chains of length 100 000 iterations with overdispersed starting values, and we inspected the marginal distributions of the components of /?, £ and 7. The variation of sample means between M C M C chains was found to be small in relation to the variation within individual chains. Thus while mixing is not ideal, it does not appear to greatly affect estimation. Table 4.5 contains point and interval estimates for 8 and £ from BPSA. The log odds ratio for the effect of X on Y is similar to that of PSA and is given by -0.30 with 95% credible interval (-0.50, -0.10). But there are large differences in the estimates for £ which underscore the differences between BPSA and PSA. To illustrate why BPSA and PSA give different results, consider Table 4.6 which presents the distribution of 54 Figure 4.3: BPSA sampler convergence for B and £. 00 o 1—i—n—r 9000 9500 9500 t—i—i—i—r 9000 9500 10000 9000 10000 9000 9500 10000 \ i n i r 9000 9500 10000 \—i—i—r 9000 1—i—i—r 10000 55 9500 10000 Figure 4.4: B P S A sampler convergence for 7. 56 measured confounders among treated and untreated patients within propensity score bins. In the table, each study unit is assigned to a bin based on the posterior mean of the propensity score. This means that for a unit with covariate C, we assign a level of g(.) equal to g(E{z(C,j)\y, x, c}) where £ { z ( C , ) | y , x, c} = J e x p i t ( ' C ) / ( | y , x, c)dj. 7 7 7 A comparison of Table 4.4 and Table 4.6 reveals striking differences. While PSA assigns an equal number of patients into each bin (roughly 4572/5=914), this is not the case for BPSA. From Table 4.6, we see that bins #1, #2, #3, #4, #5, contain 591, 781, 945, 1180, 1075 patients respectively. PSA classifies patients into bins using only information about the relationship between C and X. If a confounder is strongly predictive of treatment, then this association is largely reduced after applying PSA because the variation in C is re-distributed across propensity score bins. In contrast, B P S A estimates propensity scores by incorporating modelling information about the relationship between Y and g(Z). The consequence is that BPSA assigns patients to bins based on how sick they are. For example, consider the indicator variable for diabetes. Diabetes is a strong risk factor for death with odds ratio 1.5 (1.2, 1.8) based on the conventional regression of Section 4.1 (see Table 4.1). For PSA, the prevalence of diabetes is fairly well balanced within each of the five bins because diabetes is not strongly associated with statin prescribing. In contrast, for BPSA, we see that the prevalence of diabetes is high in bins #1 and #2, while lower in bins #3, #4, and #5. In Figure 4.2, we see that patients with low propensity scores have the greatest risk of death. Consequently, when estimating the propensity scores, BPSA assigns sicker patients to bins #1 and #2. For every single covariate that is a strong risk factor for Y, we see a tendency for BPSA to assign patients 57 with these risk factors to lower bins. Another example is renal disease. For PSA all patients with renal disease are in bins #3, #4 and #5, whereas for BPSA these same patients are assigned to bin #1. What can we conclude about the differences between the BPSA and PSA analyses? Which analysis is more valid? Intuitively one might think that modelling uncertainty in the estimated propensity scores will only negligibly impact on inferences when the sample size is large. But this analysis indicates that the reverse may be true. As sample size increases, we get better estimates for £, the baseline risk of Y within bins. This information flows back to affect propensity score estimation. Furthermore, in Table 4.6 we see that BPSA yields fairly severe imbalances in the distribution of measured covariates between treated and untreated compared to PSA. The method does not appear to be producing homogeneous subgroups. What are the implications for the ability of BPSA to reduce confounding? We investigate these questions in Chapter 5. We study estimator performance in synthetic data under competing models for the outcome variable and various parameter values. We also consider the balance in C induced by using BPSA versus PSA. 58 Table 4.6: Means of measured covariates among treatment groups, within bins of the propensity score, calculated from BPSA. Bin 1 . n = 591 Statin Yes No Demographics .Age Female sex AMI risk factors Family history of C A D Diabetes CVA/TIA High B P Current smoker Comorbidities < Angina Renal disease Vital signs on admission Shock Systolic B P Diastolic B P Heart rate Respiratory rate Laboratory values White blood count Haemoglobin Sodium Glucose Creatinine * p < 0.05, ** p< 0.001, | Bin 2 n = 781 Statin No Yes Bin 3 n = 945 Statin Yes No B n 4 n = 1180 Statin No Yes B n 5 n = 1075 Statin Yes No 77* *t 48% 81**f 51% 75**t 47% 78**f 49% 71*f 38% 72*f 42% 63 26%f 63 31%f 50 15%* f 51 20%*f 12%f 63%** | 31%f 64%* f 24% 8%| 41%**f 26%| 46%* f 20% 19%f 46%* | 18% 55%t 20% 14%f 34%*f 15% 48%t 19% 29% 33% 13%f 53%* f 23%* | 25% 29% 9%f 44%* f 29%* t 39% 21% 5%*f 43% 36% 38% 22% 3%*| 41% 39% 59% 10% 1% 43% 48% 57% 11% 2% 38% 52% 40% 10%* t 38% 2%*f '50%*t 0% 40%* f 0% • 44%* | 0% 35%* f 0% 35%**| 0% 26%**f 0% 28%**f 0% 19%**t 0% 3% 140 80f 113**f 26 3% 139 78f 100**f 26 1% 147 81 93f 22 13| 14+ • lit 129 124f 119f 137 137 139 12*f 14* t Ht 165f 192f Hit Standardized difference > 10% 1% 145 80 90t 22 0% 148 82 83 21 0% 150 82' 81 20 0% 149t 84* f 76* | 19 0% 152f 86* | 79* f 20 0% 152 90 74 19 0% 152 90 74 19 10f 131 138 iot 107f 10 137 139 10.0 99t 10 137 139 9.4 95| 9.6 143 139 8.8 92**f 9.7 144 139 8.9 87**f 9.6 149 140*f 8.0 86 9.7 150 140*f 7.7 85 4.4 Decomposition of the posterior variance for f3 and £ One strategy for investigating the difference between B P S A and PSA is to study the effect of admitting uncertainty in the estimated propensity scores on the posterior variance of 8 and £. Gustafson and Clarke discuss factorizing the posterior variance of model parameters in the context of hierarchical models [42]. Using the relation Var[A] = £{Var[A|BJ} + Var{E[A\B}}, they note that for hierarchical models with data D, parameter 9 and hyperparameter </>, we can write the posterior variance for 9 as Var[0\D] = E{Var[6\D, <f>]\D} + Var{E[0\D, </>]\D}. (4.1) When the parameter 4> is known a priori, the model is not hierarchical because there is a single prior distribution for 9. Then the quantity Var{E[9\D, 0]|D} is equal to zero and the posterior variance for 9 is given by E{Var[9\D, 4>]\D}., Consequently, we can conceptualize the term Var{E[9\D,4>]\D} as modelling the extent to which admitting uncertainty in the hyperparameter <f> increases the posterior variance. This gives an "ANOVA" type representation for decomposing posterior variance in latent variable models. The first term on the right hand side of the equation models variation "within" specific priors, while the second term models variation in the posterior arising "between" different choices of priors. The BPSA method can be investigated within this framework. When estimating 8 and £, PSA uses a degenerate prior distribution for 7 which is equal to the maximum likelihood estimator from logistic regression of X on C. In contrast, BPSA estimates 8 and £ by admitting uncertainty in 7 and modelling 7 as a latent quantity. We can 60 write Var\j3,Z\D] = E{Var\j3,Z\D,7]\D} where D — (y, x, c). + Var{E\p,Z\Dn]\D}, (4.2) Letting 7 denote the maximum likelihood estimator for 7 from PSA, the expression £{Var[/3, £|.D, 7]|Z)} = Var[f3,£\D,j] equals the posterior variance for PSA. For BPSA, the quantity E{Var[p,j]\D} variation of B and £ given 7, while the quantity Var{E[8,j]\D} models the within models the between variation over levels of 7. Thus to characterize the different inferences for BPSA and PSA we can factorize the posterior variance for the statin data and calculate both Var[B,£\D,j] and the quantities in equation (4.2). We can calculate E{Var[p,f\D,j}\D} • E{Var\p,£\D;7]\D} + Var{E\p,£\D,i]\D} which is the variation in within levels of 7, divided by the posterior variance. If this quantity is close to one, then admitting uncertainty in 7 has little effect on posterior variance. A simple estimate of this ratio for BPSA is provided from the M C M C algorithm described in Section 4.1. When updating 8 and £, given some current value 7*, > the proposal distribution is given by the normal approximation to the asymptotic distribution of the maximum likelihood estimator from logistic regression of Y on X and the quintile group G. The mean and variance of this distribution are estimates of E[8, £|D, 7*] and Var[8, £|£>, 7*]. Thus if we collect and store these quantities during simulation along with the sample model parameters, we can take the empirical average across sampled values of 7 to estimate E { Var [8, £ | D, 7] | D } and Var { E [8, £ | D, 7] | D }. 61 Table 4.7: Posterior variances for 8 and £ from the analysis of the statin data using BPSA and PSA Posterior variance from PSA 6 6 U 6 Posterior variance from BPSA Decomposition of BPSA posterior variance 0.009 0.010 Within 0.010 0.005 0.010 0.013 0.015 0.020 0.016 0.034 0.030 0.049 0.050 0.008 0.013 0.015 0.021 0.043 Between 0.001 Within/Total 0.948 0.010 0.019 0.015 0.030 0.016 0.445 0.412 0.498 0.407 0.729 Table 4.7 presents a decomposition of the posterior variances of [3 and £ from BPSA and PSA. The first two columns are posterior variance for BPSA versus PSA. The final three columns give the decomposition of the posterior variance for BPSA. The main result is that the ratio of the within variation divided by total variation is estimated as 0.948 for the parameter 3 and is much smaller for each component of £. In other words, admitting uncertainty in the estimated propensity score greatly increases posterior uncertainty £, but only marginally increases uncertainty in 8. This makes sense when we compare the interval estimates from BPSA and PSA in Tables 4.3 and 4.5. Interval estimates for £ from BPSA are roughly twice as wide compared to PSA, whereas for the parameter /3, BPSA only modestly increases posterior uncertainty compared to PSA. 62 Chapter 5 Simulation studies of the performance of BPSA and PSA In Chapter 4, we applied BPSA to the statin data and contrasted the results with those from PSA. Because of the large sample size, intuition suggests that there should be little uncertainty in the estimated propensity score, and we would expect both methods to give similar inferences. But the statin data example shows that this may not be the case. The estimates for B are comparable, but we see large differences in the point estimates for £. BPSA groups patients into propensity score bins based on their health status, and the result is that the estimated components of £ are spread apart compared to PSA. Sick patients are grouped into bin #1 and this drove the estimate of £1 upwards to reflect the fact that this group had greater risk of death. Similarly, healthy patients were grouped into bin #5. A plot of the risk of death as a function of the propensity score, given in Figure 4.2, shows that this approach seems reasonable because patients with high propensity scores have lower mortality. But the question remains whether or not BPSA inferences are more valid. Can we expect that the frequentist inferences from BPSA will be superior in some settings? If so, then what are these settings and how do they compare to those for the statin data? In this chapter we investigate the frequentist performance of BPSA estimates using simulations and further analysis of the statin data. In Section 5.1, we use simulations to study the bias, relative efficiency and relative mean squared error (MSE) of point 63 estimators, and the coverage probability and length of interval estimates. BPSA has the potential drawback that it relies more heavily on modelling assumptions for the outcome variable than PSA. The method uses a model for P{Y = c) for propen- sity score estimation, while PSA does not. Consequently, we might expect that PSA inferences are more robust to model misspecification. To investigate further, Section 5.2 investigates the performance of BPSA and PSA when applied to synthetic data generated under competing models for the outcome. In Section 5.3, we consider prediction error. We use cross-validation techniques to study the ability of BPSA and PSA to accurately forecast the outcome variable when applied to real and synthetic data. Finally, Section 5.4 revisits the idea of covariate balance induced by BPSA versus PSA. Chapter 4 illustrated that BPSA appears to produce treatment and control groups which less similar with respect to measured confounders, compared to PSA. Thus the method does not appear to be effectively reducing confounding. We explore covariate balance using simulations. 5.1 Simulation study when the distribution of the outcome follows the propensity score model We use simulations to investigate the frequentist performance of point and interval estimators for 3, £, 7 when the data are generated according to the BPSA model given in equations (3.1) and (3.2), for fixed parameter values of 8,£ and 7. Simulation design We consider the case where C has four continuous components (meaning that C 64 is a 5 x 1 vector with first component equal to one), and we simulate datasets for four different choices of model parameters, Design Q I 7 #1 -1/2 (1/2, -1/2, 1/2,-1/2, 1/2) (1/2, -1/2, 1/2, -1/2, 1/2) #2 -1/2 (2, -2, 2, -2, 2) (1/2, -1/2, 1/2, -1/2, 1/2) #3 -1/2 (1/2, -1/2, 1/2, -1/2, 1/2) (2,-2,2,-2,2) #4 . -1/2 (2, -2, 2, -2, 2) (2,-2,2,-2,2) and sample size n = 1000. Design #1 models the setting where there are strong associations between between Y, X and bin membership, with odds ratios of either exp(—1/2) = 0.61 or exp(l/2) = 1.64. Designs #2, #3 and #4 are similar, but use more extreme odds ratios of either exp(—2) = 0.13 or exp(2) = 7.4. While these designs are less realistic, they can indicate settings in which BPSA or PSA break down. Design #2 is of particular interest. The components of £ are heterogeneous while the components of 7 are quite similar. We expect that PSA will misclassify patients into the wrong propensity score bins and this will adversely affect estimation of £ and B. For each design and fixed sample size, we generate and analyze 400 simulated datasets, to yield a sample of 400 point estimates and 80% interval estimates for /?, £ and 7, using the following algorithm: 1. Generate the n x 5 design matrix c. The first column is a column of ones. The latter four columns are the sampled covariates for the dataset of size n. Each ( element of each column is simulated as an independent draw from a N(0,1) random variable. 65 2. Generate the n x 1 vector x using the logistic regression model of equation (3.2), given by logit[P(X = l|c)] =ic where 7 = (70,71,72,73,74) is a 5 x . l vector. 3. Because of the way the simulation is designed, we have 7 ' C ~ -/V(7 , 2~Zt=i l i ) f ° r 0 fixed 7. Thus the values Ci, 02,03, c defining the true quintiles of the propensity 4 score are given exactly by Cfc = expit{7 +(^t=i li)Qk} 0 for k = 1, 2,3,4 and qk = $ (0.2fc), where $~ (.) is the quantile function of a N(0,1) random variable. _1 1 Generate the n x 1 vector y using the logistic regression model of (3.1) given by logit[P(y = z)\ = 3x + g(z(c, ))'£ 7 where [1,0,0,0,0] ifz(c, )< [1,1,0,0,0] if g(z) = { [1, 0, 1, 0, 0] [1,0,0,1,0] 7 C l < z(c,j) < c C l 2 if C2 < z(c,j) < c if c <z(c,j) { [1,0,0,0,1] 3 if c <z{c, 7) A <c 3 4 . and z(c, 7) = expit(7'c). 4. Analyze the datasets using PSA and BPSA with an M C M C chain of length 10 000 after discarding 500 initial iterations. Obtain point and 80% interval estimates for 3, ( and 7 from each method. Careful tuning of the M C M C sampler is needed for each simulation design, and this is accomplished using separate trial simulation runs. 66 Results Table 5.1 summarizes the performance of point and interval estimators for 6, £ and 7 from BPSA and PSA, in the case where datasets are simulated according to Design #1. The first two columns indicate of the magnitude of bias of each method. Each cell contains the sample mean of the collection of 400 sampled estimators, as well as the z-score for the' sample mean relative to the true. underlying parameter value. The z-score was calculated as „ sample mean - true parameter value z-score for mean-of simulated point estimators = -—-——— —— . sample standard deviation A large z-score indicates that the point estimator is biased.- The third and fourth columns in the table contain the estimated relative efficiency and the relative MSE of BPSA point estimators compared to PSA. These quantities are calculated as Sample variance of BPSA point estimates Estimated relative efficiency = — : : _ . : : Sample variance of PSA point estimates and Estimated relative MSE = Sample average squared error of B P S A point estimates Sample average squared error of PSA point estimates To aid with interpretation of results, we calculated simulation standard errors for the relative efficiency and relative MSE estimates via the bootstrap. In Table 5.1, estimates denoted with a "*" imply that a 90% bootstrapped confidence interval for the parameter excludes 1. If these estimates are significantly less than one, it indicates that BPSA point estimates have smaller variance or smaller MSE compared to PSA. The final four columns contain estimates of the coverage probability and average 67 length of interval estimates. The symbol t indicates that a 90% confidence intervals for the coverage probability excludes the nominal level of 80%. To clarify, we estimate the coverage probabilities which are nominally equal to 80%, and we construct 90% confidence interval for these quantities. We use an a-level of 0.1 for the analysis of simulation results in order to increase power at the expense of the Type I error rate. Tables 5.2, 5.3 and 5.4 are identical to Table 5.1, but correspond to data simulated under Designs #2-4. The results of the simulation study indicate that B P S A and PSA perform' very comparably in terms of estimation of 8. The quality of inferences for 8 from BPSA and PSA are so similar that any systematic differences in performance is swamped by variation from the Monte Carlo simulation. However, a qualitative assessment of the results across the four simulation designs suggests that PSA point estimators are more efficient than for BPSA. PSA interval estimators also appear to have lower coverage probability. Inferences for £ calculated from BPSA are generally superior to those from PSA. BPSA point estimates of £ have similar efficiency to those from PSA, but they have smaller bias and this reduces overall MSE. In Tables 5.1 through 5.4, the estimated relative efficiencies of point estimators for £ are generally not systematically different from one across the simulation designs. But for the parameters £, the z-scores calculated from BPSA are much smaller than for PSA. For example, under Design #2, PSA point estimators are biased with z-scores of greater than 20. This reduction in bias from using BPSA for estimating £ reduces overall M S E compared to PSA. In all four simulation designs, the estimated relative MSEs tend to be significantly less than one. The improvements of BPSA compared to PSA also applies to interval estimation of £. For each simulation design, BPSA credible intervals have greater average 68 length and estimated coverage probabilities which do not differ significantly from 80%. In contrast, PSA interval estimates always have lower coverage probability, and in Design #2, the estimated coverage probabilities are never greater than 50%. The increase in average interval length for BPSA is usually modest, but is occasionally quite substantial. Nonetheless, the increased length appears to be justifiable because the resulting interval estimates have proper coverage probability. BPSA point and interval estimates of 7 also appear to have better performance. Thus BPSA appears to do a better job of estimating the propensity scores. In terms of bias, BPSA and PSA are similar, but BPSA point estimators are much more efficient. Under each of the four simulation designs, the estimated relative efficiencies for BPSA compared to PSA are significantly less than one. Under Designs #2 and #4, the relative efficiencies were typically less than 0.2, meaning that BPSA point estimators of 7 are perhaps five times more efficient. Consequently, in the four simulation studies, the BPSA estimators of 7 have smaller MSE compared to PSA. The performance of interval estimates for 7 are also improved for BPSA. Under Design #4, BPSA interval estimates have length which is roughly one fifth of that of PSA, and yet they maintain roughly nominal coverage probability. BPSA interval estimates of 7 tend to have lower coverage probability compared to PSA. This is particularly true under Design #2. In this case, the empirical variance of the BPSA 1 point estimates of 7 is far greater than the posterior variances. Thus BPSA tends to modestly under report uncertainty in 7. One possible explanation for this undercoverage is poor M C M C mixing. The sampler may not fully explore the posterior distribution. 69 Table 5.1: Performance of point and interval estimators from BPSA and PSA, when data are simulated under Desii #1Parameter B = -0.5 = & = £3 = & = £5 = 0.5 -0-5 0.5 -0.5 0.5 Point Estimation Interval Estimation B P S A Sample mean (z-score) -0.51 (-1.7) P S A Sample mean (z-score) -0.49 (0.9) Rel. efficiency Rel. M S E 1.05* 1.06* 0.49 (-1.0) -0.46 (3.6) 0.52 (1.5) -0.48 (1.6) 0.53 (2.4)- 0.45 (-7.4) -0.29 (18.4) 0.33 (-15.8) -0.26 (21.7) 0.47 (-2.1) 1.07 1.04 1.20* 0.92 0.96 0.94 0.58* 0.74* 0.43* 0.96 0.49 (-3.3) 0.16* 0.16* 0.50 (-1.0) -0.51 (-2.2) 0.22* 0.24* = -0.5 -0.51 (-5.8) 0.24* 0.23* 0.51 (4.5) 0.51 (1.8) 7 2 = 0.5 0.34* -0.51 (-4.7) -0.50 (-0.6) 0.32* 7 3 = -0.5 0.26* 0.26* 0.51 (3.9) 0.51 (1.9) 7 4 = 0.5 * Quantity differs from 1, p < 0.1. ' Coverage probability is less than 80% , p < 0 . 1 7 0 = 0.5 7 l BPSA Coverage Length 0.84 0.38 PSA Coverage Length 0.84 0.37 0.84 0.78 0.78 0.81 0.81 0.40 0.58 0.59 0.57 0.59 0.80 0.60+ 0.64t 0.57+ 0.78 0.38 0.53 0.54 0.55 0.57 0.79 0.77 0.77 0.76* . 0.76+ 0.07 0.09 0.09 0.09 0.09 0.78 0.79 0.78 0.81 0.80 0.19 0.19 0.19 0.19 0.19 Table 5.2: Performance of point and interval estimators from BPSA and PSA, when data are simulated under Design #2. B = -0.5 €i = 2 £2 = -2 & =2 £4 = -2 ?5 = 2 B P S A Sample mean (z-score) -0.50 (0.4) 2.00 -1.96 2.19 -1.97 2.16 (0.0) (1.6) (5.5) (1.1) (5.1) P S A Sample mean (z-score) -0.42 (9.7) 1.71 -1.39 0.47 -1.34 1.30 Rel efficiency Rel. M S E 1.24* 1.00 0.73* 1.55 1.66* 1.72 0.71* 0.36* 0.47* 0.20* 0.38* 0.40* (-20.3) (30.6) (-56.9) (37.4) (-19.0) 0.50 (-0.3) 0.50 (-1.5) -0.50 (-1.2) -0.50 (-0.9) 0.51 (1.6) 0.50 (0.8) 7 2 = 0.5 -0.51 (-2.3) -0.50 (-0.8) 7 3 = -0.5 0.51 (1.8) 0.50 (1.1) 7 4 = 0.5 * Quantity differs from 1, p < 0.1, + Coverage probability is 7 0 = 0.5 7 1 = -0.5 Interval Estimation Point Estimation Parameter 0.59* 0.59* 0.14* 0.14* 0.14* 0.14* 0.12* 0.12* 0.13* 0.13* less than 80%, p < 0.1 BPSA Coverage Length 0.81 0.48 PSA Coverage Length 0.44 0.75+ 0.78 0.82 0.82 0.80 0.81 0.56 0.66 1.43 0.68 1.37 0.43+ 0.24+ 0.03+ 0.20+ 0.35+ 0.51 0.62 0.77 0.64 0.97 0.68+ 0.70+ 0.74+ 0.70+ 0.68+ 0.01 0.01 0.01 0.01 0.01 0.79 0.80 0.80 0.77 0.78 0.19 0.19 0.19 0.19 0.19 Table 5.3: Performance of point and interval estimators from BPSA and PSA, when data are simulated under Design #3: B = -0.5 £ 1 = 0.5 6 =-0.5 & = 0.5 £ 4 = -0.5 £ 5 = 0.5 Interval Estimation Point Estimation Parameter P S A Sample mean (z-score) -0.49 (0.5) Rel efficiency Rel. M S E 1.03 1.03 0.49 (-0.9) -0.47 (2.6) 0.52 (1.1) -0.49 (1.0) 0.53 (1.7) 0.48 (-3.4) -0.41 (7.6) 0.44 (-4.4) -0.38 (8.2) 0.48 (-1.6) 1.01 1.02 1.01 1.01 0.95* 0.98 0.91* 0.97 0.87* 0.95* 0.81 0.77 0.80 0.82 0.80 0.38 0.60 0.73 0.76 0.77 0.79 0.74+ 0.76+ 0.78 0.79 0.37 0.57 0.71 0.74 0.75 0.-24* 0.34* 0.35* 0.41* 0.36* p < 0.1 0.78 0.78 0.82 0.75+ 0.79 0.19 0.25 0.25 0.26 0.25 0.80 0.81 0.82 0.80 0.81 0.42 0.44 0.44 0.44 0.44 2.03 (3.2) 2.00 (-0.2) 0.25* 0.34* -2.02 (-3.2) -2.03 (-3.1) 7 i = -2 2.03 (3.4) 2.01 (2.2) 0.36* 72 = 2 0.41* -2.02 (-4.1) -2.04 (-4.1) 7 3 = -2 -• 2.03(3.7) 0.37* 74 = 2 2.01 (1.9) * Quantity differs from 1, p < 0.1, + Coverage probability is less than 80%, 70 = 2 BPSA Coverage Length 0.80 0.58 PSA Coverage Length 0.81 0.57 B P S A Sample mean (z-score) -0.51 (-0.5) Table 5.4: Performance of point and interval estimators from BPSA and PSA, when data are simulated under Design #4. 8 = -0.5 Ci = 2 . £2 = -2 & =2 U = -2 & =2 Interval Estimation Point Estimation Parameter B P S A Sample mean (z-score) -0.51 (-0.9) 2.03 -2.01 2.15 -2.00 2.14 (2.5) (-0.9) (4.3) (-0.2) (2.9) P S A Sample mean (z-score) -0.48 (1.6) Rel efficiency Rel. M S E 1.05 1.04 1.89 (-9.8) -1.75 (16.2) 1.09 (-30.3) -1.64 (19.5) 1.52 (-7) 0.95 0.92 1.32* 1.22 0.52 0.78* 0.56* 0.42* 0.63* 0.48 0.13* 2.04 (4.6) 0.13* 2.00 (-0.9) 0.08* 0.09* -2.00 (-0.1) -2.03 (-4.3) 7 i = -2 0.08* 0.08* 2.00 (-1.0) 2.04 (4.9) 7 2 =2 0.06* -2.04 (-4.4) 0.07* -2.00 (1.1) 73 = -2 0.06* 0.06* 2.03 (4.0) 7 4 =2 2.00 (-1.0) * Quantity differs from 1, p < 0.1, + Coverage probability is less than 80%, p < 0.1 •7o = 2 BPSA Coverage Length 0.66 0.78 PSA Coverage Length 0.77 0.65 0.77 0.82 0.80 0.81 0.81 0.56 0.71 1.46 0.90 1.46 0.65+ 0.60+ 0.26+ 0.55+ 0.46+ 0.54 0.70 1.06 0.87 1.19 0.80 0.73+ 0.78 0.79 0.79 0.04 0.05 0.05 0.05 0.05 0.82 0.83 0.81 0.79 0.79 0.42 0.44 0.44 0.44 0.44 Discussion The results of the simulation study indicate that BPSA does a better job of estimating £ and 7. For estimation of 0, the methods are very comparable, with BPSA point estimators appearing to perform slightly worse under the "realistic" Design #1. Contrary to intuition, differences in inferences between B P S A and PSA may be substantial. This is particularly true when the true data generating mechanism is given by Designs #2 or #4. To explain this behavior, we observe that Design #2 involves large values of £ and small values of 7. Because the components of 7 are similar, correct classification of patients to bins using PSA is error prone. Because the components of £ are large and heterogeneous, bin misclassification adversely affects estimation of £ since study subjects with very different outcome risks are being grouped together. Under Design #2, patient outcomes contribute a great deal of information about the propensity scores. BPSA has an advantage because it uses this information while PSA ignores it. While the simulations demonstrate that BPSA can perform well compared to PSA, the findings may not generally apply to the analysis'of typical epidemiologic data. In the statin data example of Section 4, the risk of Y as a function of the propensity score is not particularly heterogeneous. In Figure 4.2 we see that this risk function is a fairly smooth function of Z. Future simulations studies could try to more carefully mimic real observational studies. It is interesting that PSA point estimates of 8 are more efficient than for BPSA. Note that B is the primary parameter of interest because it models the treatment effect. The parameters £ and 7 are essentially nuisance parameters. Because BPSA does a better job of estimating the propensity scores, we would expect that this should improve control of confounding and estimation of B. BPSA and PSA both use 74 estimated propensity scores, but BPSA incorporates modelling information about the relationship between Y and Z given X. While the simulations indicate that this improves estimation of 7 and £, the BPSA approach may be harmful to the efficiency of point estimation of B. For interval estimation of (3, BPSA seems to perform favorably. For example, under Design #2, BPSA interval estimates of B. have nominal coverage probability while those calculated from PSA do not. A feature of our simulation study is that we have evaluated BPSA and PSA assuming that the quantities C i , 02,03,04 which define propensity score bins are known. Because of the manner in which the data are simulated, we can determine the quintiles of Z = P(X = 1|C) exactly. In contrast, in the practical application of BPSA, C i , C 2 , c , c are unknown. We recommend estimating the propensity scores for each 3 4 subject using PSA, and then calculating the quintiles of the empirical distribution. Thus a perceived limitation of our simulation is that we are not evaluating BPSA and PSA as they would be applied in practice. However, another perspective is that Ci,iC2,03, c form part of the specification for the model for P(Y — l\x, c). If we were 4 to estimate C i , C2, 03, c when applying BPSA and PSA to the synthetic datasets, then 4 the interpretation of the parameters (£1, £2, £3, £4, £5) would change from one dataset to the next. Averaging point estimates over repeated simulations would not be meaningful. Instead, we study BPSA and PSA under repeated application to data where C i , c , c , 04 take on specific values. 2 3 75 5.2 Simulation study when the distribution of the outcome follows a conventional regression model for Y on X and C A limitation of the BPSA method is that it relies on more modelling assumptions than PSA, in some sense. Both methods use the same models for the data, but they' handle information in' different ways. PSA estimates the propensity scores using the marginal model for P(X = l|c), whereas BPSA estimates propensity scores using models for both P(Y = l\x,c) and P(X — l|c). Consequently, we might expect that the PSA methodology is more robust. If the model for P(Y = l\x, c) does not follow equation (3.1), then this would adversely affect BPSA because the method would be classifying study units into propensity score bins based on an incorrect model. The improved performance of B P S A that is observed in simulations may be sensitive to modelling assumptions. To investigate this further, we repeat the simulations of Section 5.1 by generating synthetic datasets using a more conventional regression model for Y on X and C. Simulation design We consider the case where C has four continuous components (thus a 5 x 1 vector with first component is equal to one for the y-intercept), and we simulate datasets where the outcome variable Y follows the regression model logit{P(y = l\x,c)} 76 = rx + c'p, (5.1) rather than the propensity score model given in equation (3.1). Equation (5.1) is a logistic regression model of Y on X and C with treatment effect r and covariate effects p. We consider four different simulation designs with model parameters given by 7• Design r p #1 -1/2 (1/2, -1/2, 1/2, -1/2, 1/2) #2 -1/2 #3 -1/2 (1/2, -1/2, 1/2, -1/2, 1/2) (1, -1, 1, -1, 1) #4 -1/2- (1,-1,1,-1,1) (1,-1, 1,-1, 1) (1, -1, 1, -1, 1) . (1/2, -1/2, 1/2, -1/2, 1/2) (1/2, -1/2, 1/2, -1/2, 1/2) and datasets of fixed sample size n = 1000. The simulation designs parallel those of Section 5.1. Designs #1 models the case where the components of C are modestly associated with X and Y, while Designs #2, #3 and #4 model stronger associations. Thus we consider instances where C are strong or weak confounders for the effect of XonY.. For each design, we generate and analyze 400 synthetic datasets using the following algorithm: 1. Generate the n x 5 design matrix c. The first column is a column of ones. The latter four columns are the sampled covariates for the dataset of size n. Each element of each column is simulated as an independent draw from a N(0,1) random variable. 2. Generate the n x 1 vector x using the logistic regression model of equation (3.2), where 7 = (70,71,72,73,74) is a 5 x 1 vector. 3. Generate the n x 1 vector y using the logistic regression model given in equation (5.1). 77 4. Because of the way the simulation is designed, we have ' C ~ AT( o, Ylt=i ^f) f ° 7 r 7 fixed 7. Thus the values c\, c , C3, C4 defining the true quintiles of the propensity 2 score are given exactly by c — expit{7 -r-(^t=i jf)qk} f ° k — 1, 2,3,4 and r 0 k $ (0.2A;), where is the quantile function of a N(0,1) random variable. _1 Given Ci,02,03,04, = analyze the datasets using BPSA and PSA to obtain point and 80% interval estimates for 8, £ and 7 from each method. A feature of this simulation is that we are generating Y, X and C using one model but then analyzing the data using a different model. BPSA and PSA give limiting estimates which will differ from r, p and 7 used to generate the data. In large samples, PSA yields estimates of the quantity 7* which solves E { — logppQlQ)} =0, where p(Xi\Ci) — exp{Xi(j Ci)}/(l (5.2) + exp{7 Ci)} from equation (3.2), and the ex- T r pectation is with respect to the distribution of X^ given C{ used to generate the data. PSA also estimates (/3*,£*) which solves E\^jj \ogpiX \X C ) i h . =0, ii 7=7* where p ( y | X , d) = e x p f y ^ + £ i i T ( a , 7*))}/(l + exp{/3X, + £ ( Q , ) } ) from T 5 5 7 equation (3.1), with 7* as the solution to equation (5.2), and where the expectation is with respect to the distribution of Yi given Xi and Q in equation (5.1). The quantity 7* is identical to the true value of 7 used to generated the synthetic data because PSA estimates the propensity scores using the correct model specification for p(Xi\Ci). But 8* may differ from r in-equation (5.1) because the quantities parametrize different regression models for the outcome. 78 In contrast, BPSA estimates (/?*,£*, 7*) which solves E d W,£,7) { l o g p M X i . C O + logptXilCO} =0 The quantities (/?*,£*, 7*) give the best overall fit for the BPSA model, and they need not equal the limiting estimates obtained from PSA. In particular, 7* from BPSA need not equal 7 used to generate the synthetic data. Thus in this particular simulation, when the outcome variable follows equation (5.1), BPSA may give propensity score estimates which are asymptotically biased. This raises questions about identifying suitable targets of inference in simulation. We focus on the quantities estimated from PSA. The reason is because these parameters model the best fit for the model given in equation (3.1) in the case where the true propensity scores are know. Thus, for example, £* model the average outcome risks within the propensity score bins. To compute these quantities, we write the estimating equation for PSA as n(0*) = E{R[Y - expit(#0*)]} = 0 (5.3) where R is a 6 x 1 vector with first component equal to X and remaining five components equal to the 5 x 1 vector g(z(y,C)). Additionally, 9* = [/?*,£*,£*,£3,£4,£*]. The expectation is with respect to the true joint probability density function for Y,X and C. The solution to equation (5.3) yields the density P(Y = c) = exp\t{(3*x + g(z(j, C)'t,*)} with the smallest Kullback-Liebler distance from the density expit-fYz+c'p)}, where r and p are equal to the values from the specific simulation design [38]. Because R is categorical, we can calculate the solution to this estimating equation. 79 numerically using Monte Carlo integration and the Newton Raphson method. Equation (5.3) is the expectation of the score function for a single observation of logistic regression. It can be written as a finite sum of quantities which can be estimated by Monte Carlo, and which depend on 6*. If we take the derivative of this sum with respect to 9*, then we can use Newton Raphson to solve 0,(6*) = 0. Complete details are given in the Appendix. As discussed in Section 2.4, simulation studies of PSA in which a dichotomous outcome variable Y is generated using models like in equation (5.1) are common in the literature (see for example [28, 29, 34, 35, 43]). But investigators often treat the quantity /3*, calculated from PSA, as a point estimator for r rather than (3* [34, 35]. But Austin and others [28-30] show that in general we have r ^ (3* because of noncollapsibility of the odds ratio. Thus [3* will be asymptotically biased for r. The alternative approach that we adopt here to study the operating characteristics of [3* and £* as estimates of f3* and £* rather than r and p. Results Table 5.5 summarizes the performance of point and interval estimates for (3*, f* and 7 from PSA, in the case where datasets are simulated according to Design #1 with sample size n = 1000. For PSA, we have 7* = 7. The left most column gives the true values of B* and £* which are estimated using the Monte Carlo algorithm given in the Appendix. The uncertainty in these quantities is negligible with standard errors less than 1 0 . As we discuss in more detail below, the PSA point estimates computed -5 from analyzing the synthetic data appear to be essentially unbiased. This gives us some reassurance that the Monte Carlo algorithm has been implemented correctly. As in Section 5.1, the 'second and third columns in Table 5.5 provide descriptive 80 information about the magnitude of bias for each method, while the fourth and fifth columns in the table contain the estimated relative efficiencies and relative MSE of BPSA point estimators compared to PSA. Tables 5.6, 5.7 and 5.8 are identical to Table 5.5, but correspond to data simulated under Designs #2, #3 and #4. With respect to the treatment effect parameter /?*, both B P S A and PSA perform comparably well. BPSA point estimates appear to be slightly less efficient with larger overall MSE, although the difference in the estimated variances is not statistically significant. BPSA interval estimates are generally longer on average, but this does not appear to greatly impact coverage. The results of the simulation study indicate that point estimates for £* calculated from BPSA have inferior performance compared to PSA. They are less efficient and badly biased. In each of the four tables we see that the relative efficiencies are significantly greater than one, indicating that the variance of BPSA estimates of f* are greater than for PSA. The BPSA point estimates are biased with z-score values in the range of 5 to 10 or more. This increases mean squared error. B P S A interval estimates of £* also perform worse compared to PSA. In each of the tables, they have lower coverage probability and greater average length. 81 Table 5.5: Performance of point and interval estimators from BPSA and PSA, when data are simulated under Desi; #1Point Estimation Parameter 3* = -0.44 & G Q £ = = = = = Interval Estimation P S A Sample mean (z-score) -0.44 (0.2) Rel efficiency Rel. M S E 1.02 1.04* -0.97 (-8.3) 0.92 (8.8) 1.43 (7.3) 1.98 (7.3) 2.93 (14.4) -0.87 (0.7) 0.83 (1.1) 1.33 (-0.7) 1.83 (-2.8) 2.65 (-0.7) 1.33* 1.15* 1.19* 1.27* 1.76* 1.56* 1.37* 1.35* 1.41* 2.67* 0.76+ 0.77 0.78 0.79 0.72+ 0.50 0.66 0.66 0.73 ' 0.85 0.76+ 0.80 0.78 0.77 0.78 0.42 0.56 0.58 0.60 0.67 1.37* 0.86* 0.86* 0.78* 0.77* < 0.1 0.60+ 0.67+ 0.66+ 0.66+ 0.68+ 0.16 0.15 0.15 0.14 0.14 0.77 0.82 0.80 0.80 0.79 0.19 0.19 .0.19 0.19 0.19 -0.87 0.81 1-33 1-87 2.66 To = 0.5 7i = -0.5 7 2 = 0.5 7 3 = -0.5 7 4 = 0.5 * Quantity differs 0.50 0.50 (0.1) -0.49 (3.9) -0.51 0.50 0.49 (-4.0) -0.49 (4.5) -0.51 0.51 0.48 (-4.9) from 1, p < 0.1, + Coverage (0.2) 1.37* 0.85* (-3.1) (1.2) 0.83* 0.75* (-2.0) 0.74* (2.1) probability is less than 80%, p BPSA Coverage Length 0.78 0.40 PSA Coverage Length 0.78 0.40 B P S A Sample mean (z-score) -0.46 (-3.1) Table 5.6: Performance of point and interval estimators from BPSA and PSA, when data are simulated under Design #2Interval Estimation Point Estimation Parameter BPSA Coverage Length 0.84 0.46 PSA Coverage Length 0.82 0.45 B P S A Sample mean (z-score) -0.43 (-3.0) P S A Sample mean (z-score) -0.39 (1.1) Rel efficiency Rel. M S E 1.04* 1.06* & = 1-50 C3 = 2.52 a = 3.57 a =5-03 -1.78 (-12.8) 1.61 (9.1) 2.68 (10.3) 3.80 (11.5) 5.59 (16.9) -1,58 (1.3) 1.48 (-1.6) 2.48 (-3.1) 3.51 (-3.1) 5.08 (1.1) 1.33* 1.12 1.21* 1.26* 0.49 1.87* 1.35* 1.50* 1.63* 0.85 0.74+ 0.82 0.78 0.74+ 0.70+ 0.69 0.77 0.82 1.0.0 1.51 0.70+ 0.80 0.76+ 0.68+ 0.74+ 0.52 0.64 0.67 0.76 1.09 = 0.5 = -0.5 = 0.5 = -0.5 = 0.5 * Quantity differs 1.64* 0.53 (5.8) 0.51 (2.0) -0.50 (-0.5) 0.65* -0.48 (5.3) 0.50 (-0.6) 0.78* 0.49 (-3.8) -0.51 (-1.5) 0.69* -0.49 (4.7) 0.50 (-0.2) 0.63* 0.48 (-5) from 1, p < 0.1, + Coverage probability is less than 80%, p 1.76* 0.70* 0.80* 0.72* 0.67* < 0.1 0.44+ 0.56+ 0.57+ 0.59+ 0.55+ 0.13 0.11 0.11 0.11 0.11 0.80 0.82 0.82 0.82 0.81 0.19 0.19 0.19 0.19 0.19 8* = -0.40 Ci = -1.60 7o 7i 72 73 74 Table 5.7: Performance of point and interval estimators from BPSA and PSA, when data are simulated under Design #3. B* = -0.41 = & = ?3 = Q = Q = -0.87 0.80 1-31 1.85 2.63 B P S A Sample mean (z-score) -0.43 (-1.8) -0.93 (-6.3) 0.85 (4.9) 1.38 (5.5) 1.90 (3.7) 2.81 (10.7) 1.00 -0.98 7 i = -1 0.98 72 = 1 -0.98 7 3 = -1 0.99 74 = 1 * Quantity differs from 7o = 1 Interval Estimation Point Estimation Parameter PSA Sample mean (z-score) -0.40 (1.1) -0.88 0.79 1.30 1.81 2.61 (-0.3) (-0.7) (-0.7) (-2.8) (-0.8) BPSA Coverage Length 0.80 0.46 PSA Coverage Length 0.79 0.46 Rel efficiency Rel. M S E 0.99 1.00 1.18* 1.09* 1.04 1.03 L22* 1.30* 1.16* 1.11* 1.04 1.56* 0.80 0.85 0.80 0.82 0.74+ 0.46 0.63 0.67 0.74 0.83 0.78 0.80 0.77 0.78 0.78 . 0.41 0.57 0.61 0.66 0.72 1.29* 1.24* 1.33* 1.18* 0.99 < 0.1 0.69+ 0.65+ 0.69+ 0.67+ 0.70+ 0.23 0.23 0.23 0.23 0.23 0.78 0.82 0.86 0.79 0.77 0.24 0.26 0.26 0.25 0.26 1.32* (-0.1) 1.01 (3.0) 1.19* -1.01 (-2.5) (4.7) 1.28* 1.01 (1.8) (-4.3) 1.16* (3.3) -1.01 (-1.6) 1.00 (-2.2) 1.01 (2.7) 1, p < 0.1, + Coverage probability is less than 80%, p Table 5.8: Performance of point and interval estimators from BPSA and PSA, when data are simulated under Desij #4. B* = -0.35 £* = -1.60 Q = 1.47 H = 2.48 41 = 3-52 S = -9 4 7 Rel. M S E BPSA Coverage Length 0.51 0.80 B P S A Sample mean (z-score) -0.38 (-3.5) PSA Sample mean (z-score) -0.35 (-0.3) Rel. efficiency 1.00 ^ 1.03 -1.70 (-8.6) 1.56 (6.9) 2.59 (7.1) 3.65 (6.6) 5.33 (12.7) -1.61 (-0.4) 1.49 (1.7) 2.48 (0.3) 3.49 (-2.1) 5.00 (1.3) 1.09 1.09* 1.11* 1.09 1.16* 1.29* 1.21* 1.24* 1.19* 1.63* 0.78 0.79 0.76+ 0.78 0.73+ 0.58 0.72 0.79 0.94 1.35 2.10* 2.20* (2.5) 1.15* 1.13* (-3.0) 1.01 1.00 (2.0) 1.15* 1.16* (-1.0) 0.98 0.98 (1.2) probability is less than 80%, p < 0.1 0.47+ 0.59+ 0.60+ 0.58+ 0.57+ 0.20 0.19 0.19 0.19 0.19 1.03 (4.9) 1.01 -1.00 (0.1) -1.01 7 i = -1 1.01 1.00 (-0.8) 72 = 1 -1.01 -0.99 (1.9) 7 3 = -1 74 = 1 0.99 (-1.2) 1.00 * Quantity differs from 1, p < 0.1, + Coverage 7o = 1 Interval Estimation Point Estimation Parameter PSA Coverage Length 0.76+ 0.50 0.73+ 0.78 . 0.77 0.76+ 0.70+ 0.50 0.65 0.70 0.80 1.11 0.80 0.80 0.78 0.81 0.76+ 0.24 0.26 0.26 0.25 0.26 The simulations indicate that BPSA point and interval estimates of 7 occasionally have better performance compared to PSA. When the true values of the components of 7 are small in magnitude (Designs #1 and #2), we see an improvement in estimation of 71,72,73,74 due to better efficiency. The relative efficiencies are significantly less than one. Whereas when the components of 7 are large, the efficiency of BPSA and PSA point estimates for 7 are essentially the same. Across all the simulation designs, BPSA point estimation of the quantity 70 is always poor. BPSA interval estimates for 7 perform much worse than PSA. They have smaller average length across all four simulation designs, but this is not accompanied by proper coverage probability. Thus BPSA appears to produce interval estimates which are falsely precise. In contrast, PSA interval estimates always have correct coverage levels. Discussion The simulations indicate that when the model for Y obeys equation (5.1), many of the advantages of BPSA over PSA disappear. Section 5.1 indicated that BPSA may yield improved point and interval estimation of £* and 7. Across each of the simulation designs in Tables 5.1 through 5.4, BPSA point estimates of 7 were more efficient compared to PSA and had smaller MSE. This led to improved classification of study units into propensity score bins and better estimation of £*. When the model for P(Y = l\x, c) follows the model of equation (5.1), we see nearly the reverse results. The estimates of the components of 7 are sometimes more efficient compared to PSA, but are generally highly biased. BPSA does not appear to give good estimates of the propensity scores, consequently, we also see poor estimation of £*. Inspection of Tables 5.5 through 5.8 reveals that PSA interval estimates of £* do not have nominal coverage probability. This would be expected even if 7 were known 86 because the model for P(Y = l|x,c) is incorrect. Standard practice for maximum likelihood estimation under misspecified models uses a robust "sandwich" estimate of the estimator variance [38]. PSA uses no such approach to interval estimation, and confidence intervals should not be expected to have the correct coverage levels. This reasoning does not apply to PSA interval estimation of 7 because the marginal model for P(X = 1|C) is correctly specified. Comparing the results from Section 5.1 and 5.2, we see that when the model for P(Y — l\x, c) is correctly specified, PSA yields interval estimates for f* which are too narrow, presumably because PSA ignores uncertainty in 7. In this case, Section 5.1 shows that using BPSA to model uncertainty in 7 yields interval estimates of £* with correct coverage levels. In contrast, if the model for P(Y = l|a:,c) is incorrect, then neither method can be expected to give interval estimates with proper coverage. One feature of Tables 5.5 through 5.8 is that BPSA estimates of the quantity £ 5 seems to be particularly bad. Under each of the four simulation designs, the z-score for the BPSA sample mean is between 15 and 20. This is a likely consequence of the simulation design involving sparse data within the fifth propensity score bin. In this case, PSA seems to produce point and interval estimates with better performance. As was the case in Section 5.1, BPSA and PSA perform comparably well in point and interval estimation of the treatment effect 3*. PSA point estimates are slightly more efficient across the four simulation designs. For interval estimation, we see a small increase in the average length for BPSA, and this is accompanied by an increase in coverage probability. But the differences between the two methods are sufficiently modest to be swamped by simulation error. 87 5.3 Predictive performance in real and simulated data Simulations studies have the disadvantage that they only describe estimator performance where the data are generated under certain specific circumstances. These may not be representative of real epidemiologic investigations such as the statin data example. Consequently, our ability to generalize about the performance of BPSA is somewhat limited. One strategy for characterizing performance for the statin data is to investigate prediction error using cross validation. The original dataset involved 4572 patients discharged from hospital between 1999-2001. However, data from an additional 4599 patients discharged the following year are also available for a total of 9171 observations. If we randomly split the the entire collection of data in half, we can use the first half (called the build data) to construct a predictive model to estimate the probability of death for a future patient. We can then study prediction error when applied to the other half of the data (called the test data). To average over variability in the choice of build data, we can replicate the random splitting of the data. Denote a random sample of data of size n — 4500 as (y, x, c). Let. ( Y * , X * , C*) denote the data for a future patient from the same population for whom only X* and C* are observed. Let (3, £ and j denote the point estimates for model parameters from PSA applied to (y, x, c), and define Y SA P = expit{[3X* + g(z(C*,j))'£} as the predictive model from PSA which estimates P(Y* — 1\X*, C*). The posterior 88 distribution for (/?,£,7) from BPSA of (y, x, c) is P ( / ? , £ , | y , x, c). Define 7 YBPSA = /II expit{/?X* + g{z{C\ ))'^P{B,i, \y, 1 1 x, c)dBdi<%. as the predictive model from BPSA for estimation of P{Y* = \\X*, C*). The quantity is prediction based on substitution of (f3, £, 7) into the propensity score model a YPSA for the outcome, while estimate YBPSA Y PSA B is the posterior predictive distribution for Y*. The acknowledges uncertainty in the bin group membership of the future patient, while YPSA does not. To give a sense of the extent that predictions from BPSA and PSA may differ, Figure 5.1 plots Y s P A versus YBPSA for the 4572 patients from the original dataset of Chapter 4. In other words, we analyze the data using BPSA and PSA, and we plot YPSA versus YBPSA based on predictions within the same data. In Figure 5.1, we see that there is disagreement between predictions. The estimates take on YPSA at most ten different values because patients can be classified into only one of ten different treatment-bin combinations. BPSA averages over uncertainty in 7, or rather, uncertainty in bin membership. We see that the distribution of Y PSA B given YPSA is highly variable. We quantify prediction error using the loss function L(Y, Y*) = -{Y* log(y).+ (1 - r*) log(l - Y)], where Y is a prediction. This is called the predictive log score [5, 44] because, for a sample of data Y i , . . . , Y and predictions Y i , . . . , Y , the quantity — N N L(Yi, Y*) is the log-likelihood for the data calculated from the predictive model. We define prediction error as the expected loss E[L(Y,Y*)], where the expectation is with re- 89 Figure 5.1: Estimated risks of death in statin data for B P S A compared to PSA. {YBPSA versus Y ) PSA spect to variability in both in Y and Y*. Randomness in Y arises from the random sampling of both the build data (y, x, c) and X*,C*. Because of the way the loss function is defined, small prediction errors are desirable. Predictions with small error have the appealing property that they assign high probability to the observed data, irrespective of how those data are generated. When comparing predictive models, the model with the smaller prediction error has the smaller Kullback-Leibler distance from the true distribution which generated the data. For a dichotomous variable Y with density /(y) and a predictive model /(y), this distance is given by E = £?[io (/(y))]-£;[iog(/(y))]. log g The first term does not depend on the choice of predictive model, while the second term is equal to the prediction error. This is because log{/(F)} = Ylog{/(l)} + (1 - Y) log{l - /(0)}. Hence minimizing prediction error gives a predictive model which approximates the true distribution of the data. We can estimate both E[L(Y PSA, B Y*)] and E[L(Y SA,Y*)} P for the statin data using a variation of 5-fold cross-validation. From the total sample of 9171 patients, we randomly select n = 4500 patients without replacement. We analyze the dataset to obtain predictive models, and we then evaluate these predictions using the remaining 91 i. 4671 patients by calculating ^ 4671 4>BPSA = -^^^L(Y psA,i,Y*) i=l 4671 B 1 = "4671 E K * M * W , i) + (1 - Y*) log(l - f B P S A 01 i=l and ^ 4671 <t>PSA = L S 1 = -^f[^2 (Yp A,i,Y*) i—1 4671 - ^ E ^ i=l l o s ( ^ , z ) + (i-F;)iog(i-y> )], 5 A t where the index i is over observations in the test data. The quantities ((>PSA are unbiased estimates of E[L(Y PSA-, Y*)\y, B 4>BPSA x, c] and E[L(Y A,Y*)\y, PS and x, c)] respectively, where the conditioning is with respect to the build data. Thus 4> PSA B and 4> SA provide "half the story" of predictive performance in the sense that they P quantify prediction error for specific build data. To fully characterize we can repeatedly split the data and examine the sequence of <J) PSA B E[L(Y PSA, B Y*)] and 4>PSA over the random splittings. We do not consider traditional 5-fold cross-validation where the data are split into five, parts. In the analysis, of the statin data, we found that the computational cost of analyzing four fifths of the 9171 observations for the statin data was high for BPSA. The large sample size combined with a fairly inefficient M C M C algorithm, which requires individual updating of the components of 7 one at a time over many measured covariates, reduced the efficiency of the algorithm. Instead our approach was to assess predictive performance using the "random splitting" approach described here. 92 Table 5.9: The quantities <J>BPSA and (f)psA from five different random splittings of the statin data into build data (n = 4500) and test data (n — 4671). &BPSA Mean Table 5.9 presents 4>BPSA 4>PSA 0.40 0.38 0.39 0.39 0.39 0.39 0.47 0.45 0.46 0.45 0.46 0.46 and (ppsA for five random splittings of the data. Each row in the table corresponds to C^BPSA and 4>PSA for one random splitting. Standard errors for the estimates are calculated as the sample standard deviation of the replicates of L(Y,Y*) over subjects in the test data, and they are less than 0.008. For each row in the table, estimates YBPSA 4>BPSA is significantly smaller than 4>PSA- Thus the prediction calculated by applying BPSA to the build data have smaller predic- tion error compared to the corresponding estimates calculated from using PSA. The improvement in prediction persists across random splittings of the data. To shed some insight into the results, we can attempt to confirm these findings using simulations. Because of the flexibility of simulation studies, we need not worry about correlation in estimated prediction error because of repeated re-analysis of the same data. Instead we can simulate a sequence of build datasets, analyze them to obtain predictive models, and then estimate prediction error using massive simulated test datasets. Furthermore, in simulations the quantity YTRUE = P(Y* = l\X*, C*) = expit(/3X* + g(z(C*, 7)'£)) is known exactly because we know 3, £ and 7. We can 93 calculate the quantity 1 4>TRUE = N . ^ L(XTRUE, i=l 1 -^D i* i,y*) N = y l o e(^^,i)+(1-^)10^1-^^,0], i=l where N is the size of a simulated test dataset. E[L(P(Y* — The quantity 4>TRUE estimates 1), y*)|y, x, c)] which is the smallest possible prediction error because the true model has Kullback-Leibler distance of zero from itself. Thus calculating 4>TRUE for each build dataset gives a benchmark of the best predictions and quanti- fies the extent that BPSA improves upon PSA. Accordingly, Table 5.10 presents estimates of 4>TRUE, 4>BPSA and (f>psA for sim- ulated build datasets of sample size n = 1000 under the four simulation designs described in Section 5.1. In other words, we generate ten synthetic datasets using the propensity score model given in equations (3.1) and (3.2). For each design, a row in the table corresponds to one simulated build dataset. We analyze the simulated build dataset using PSA and BPSA, and then study the predictive performances by calculating the quantities <\>BPSA and (f>psA using large test datasets (n = 40000). Standard errors for the quantities are less than 0.001 for Designs #1 and #3, and 0.003 for Designs #2 and #4.. In Table 5.10, 4>BPSA is always less than or equal to (fipsA, irrespective of the simulation design or build dataset used to obtain the predictive models. As expected, 4>TRUE is less than ()>BPSA because BPSA cannot have smaller prediction error than the true model. To give an indication of performance across the ten simulated build datasets, the bottom row of each table reports the averages across the ten build datasets. 94 Table 5.10: The quantities 4>BPSA, <PTRUE and generated under Designs #1, #2, #3 or #4. <J>PSA from ten simulated datasets Design #1 Design #3 't'TRUE 0BPSA 4>PSA Mean 0.67 0.67 0.67 0.67 0.67 0.67 0.67 0.67 0.67 0.67 0.67 0.66 0.66 0.66 0.66 0.66 0.66 0.66 0.66 0.66 0.66 0.66 0.68 . 0.68 0.68 0.68 0.68 0.68 0.68 0.67 0.67 0.68 0.68 4>TRUE 4>BPSA 4>PSA 0.67 0.67 0.66 0.66 0.67 0.67 0.67 0.66 0.67 0.66 0.67 Design #2 <pTRUE Mean 0.40 0.40 0.40 0.40 0.40 0.41 0.40 0.40, 0.39" 0.39 0.40 . 0.67 0.68 0.68 6.68 0.68 0.68 0.67 0.68 0.68 0.68 0.68 Design #4 4>BPSA 4>PSA 0.40 0.42 0.40 0.41 0.41 0.41 0.40 0.40 0.41 0.40 0.41 0.67 0.67 0.67 0.67 0.67 0.67 0.67 0.68 0.67 0.67 0.67 0.50 0.47 0.47 0.51 0.49 0.52 0.49 0.52 0.45 0.52 0.49 95 4>TRUE (pBPSA 0.40 0.40 0.39 0.40 0.39 0.40 0.40 ,0.40 0.40 0.40 0.40 0.40 0.41 0.40 0.41 0.40 0.40 0.40 0.40 0.40 0.40 0.40 &PSA 0.42 0.47 0.47 0.49 0.43 0.43 0.46 0.43 0.45 0.45 0.46 From Table 5.10 we see that BPSA predictions strongly outperform PSA under Designs #2 and #4. For Designs #1 and #3 we see more modest improvements. This fits well with the findings of Section 5.1, where we studied the performance of point and interval estimation using the same simulation designs. Design #2 corresponds to the case where the components of £ are large in magnitude and heterogeneous while 7 is small. Thus in model fitting the outcome variable supplies a lot of information about the propensity score. PSA is disadvantaged because it does not use this information. In Table 5.2, PSA point estimators of £ are badly biased because error in propensity score estimation causes study units to be misclassified into propensity score bins. Poor estimation of £ and 3 should be expected to yield unreliable predictions for Y given X and C. For Designs #1 and #3 we see smaller improvement in the quality of predictions for BPSA versus PSA. For these designs, B P S A point estimators for £ have only modestly improved mean squared error compared to PSA, so we would not expect large improvements in predictive performance. To study the sensitivity of these findings across different specifications of the outcome model, we study predictive performance in synthetic data generated according to the regression model of Section 5.2. Table 5.11 presents the quantities 4>TRUE ^BPSA, and ifipsA from datasets generated according to Designs #1 through #4 from Section 5.3. The layout of the table is identical to that of Table 5.10. Standard errors are less than 0.002 for Designs #1 and #3, and 0.003 for Designs #2 and #4. For each of the simulated datasets, we see a tendency for fa RUE < 4>BPSA < 4>PSA, indicating that the model from BPSA yields better predictions than PSA. The differences <PBPSA — (/>psA'®ce fairly modest compared to the case where the model for the outcome is correctly specified. The simulations show that the improvement in predictive performance for BPSA relative to PSA is sensitive to correct model specification for the outcome. Comparing 96 Table 5.11: The quantities (^BPSA, 4>TRUE and 4>PSA from ten simulated datasets generated under Designs #1, #2, #3 or using the conventional regression model of equation (5.1). Design #1 4>TRUE Mean 0.61 0.61 0.60 0.61 0.61 0.61 0.61 0.61 0.61 0.60 0.61 4>BPSA 0.62 0.62 0.61 0.62 0.62 0.61 0.61 0.62 0.62 0.61 0.62 Design #3 (f>PSA 0.62 0.62 0.62 0.63 0.62 0.62 0.62 0.62 0.62 0.62 0.62 PTRUE 0.61 0.62 0.61 0.62 0.62 0.62 0.62 0.61 0.62 0.61 0.62 Design #2 4>TRUE Mean 0.46 0.46 0.46 0.46 0.46 0.46 0.46 0.46 0.46 0.46 0.46 4>BPSA 0.47 0.47 0.48 0.47 0.47 0.48 0.47 0.48 0.48 0.47 0.47 PBPSA 0.62 0.63 0.62 0.62 0.63 0.62 0.62 0.63 0.63 0.62 0.62 (PPSA 0.63 0.63 0.62 0.63 0.63 0.63 0.63 0.63 0.63 0.63 0.63 Design #4 (f'PSA 0.48 0.49 0.48 0.48 0.48 0.48 0.48 0.49 0.48 0.48 0.48 97 4>TRUE . 0.47 0.47 0.47 0.47 0.47 0.46 0.47 0.47 0.47 0.47 0.47 4>BPSA 4>PSA 0.48 0.48 0.48 0.49 0.49 0.48 0.48 0.49 0.48 ' 0.48 0.48 0.48 0.48 0.49 0.49 0.49 0.48 0.49 0.50 0.49 0.48 0.49 Design #2 in Tables 5.10 and 5.11, we see sensitivity in predictive performance. In the bottom rows of the tables, we report the mean predictive performance when averaged across the build datasets. For Table 5.10, comparing BPSA to PSA, we have a mean difference of 0.49 - 0.41 = 0.08, whereas in Table 5.11 we have a mean difference of only 0.48 - 0.47 = 0.01. Thus correct model specification is important. In fact, when we look at prediction in the statin data, the improvement in performance for BPSA is better than would be expected from the simulation findings. 5.4 A n investigation of covariate balance produced by B P S A versus PSA In Chapter 4, we applied B P S A to the statin data and compared the results to those obtained with PSA. One of the findings was that PSA produced treatment and control groups with similar distributions of measured confounders, whereas BPSA produced treatment and control groups that are not particularly similar. PSA appeared to do a better job of reducing confounding bias. To illustrate, recall Tables 4.4 and 4.6. Such tables are commonly used to assess the adequacy of models for the propensity score. They give summary statistics for the distribution of the components of C, among treatment and control groups, within bins of the estimated propensity scores. As discussed in Section 2.3, stratifying on the true propensity score causes C to be identically distributed across treatment groups. The reason is because the distribution of X given the propensity score does not depend on C. Table 4.4 was generated using propensity scores estimates from PSA, and indicates that the method effectively reduces much of the confounding due to C. Within each 98 of the bins, the components of C are roughly evenly distributed in treatment and control. For example, while age is a strong confounding variable (see Table 1.1), the strength of the association between age and statin treatment is largely reduced within each of the bins. Table 4.6 presents identical summary statistics, based on propensity scores estimated from BPSA. Within most of the propensity score bins, there are numerous covariates with distributions that differ in treatment versus control. Thus BPSA produces worse covariate balance compared to PSA, and does not appear to effectively reduce confounding. These findings are surprising in light of the results of Sections 5.1 and 5.2. In simulations, point and interval estimates for the propensity scores calculated from BPSA perform very favorably compared to PSA when the model for P(Y = l|x,c) is correctly specified. The point estimates have smaller variance and MSE. Interval estimates have shorter length while retaining roughly nominal coverage probability. Based on these results, we might expect that stratifying on propensity scores estimated from BPSA should reduce covariate imbalances in treatment versus control and do a better job of reducing confounding compared to PSA. In this chapter we explore these apparently contradictory findings by examining the lack of covariate balance that arises in using BPSA for control of confounding. We analyze synthetic data which closely approximate the statin data, with the following two objectives: 1. To replicate the covariate imbalance that is observed when applying BPSA to the statin data. 2. To determine if poor covariate balance from BPSA may occur simultaneously with better estimation of the propensity scores. This investigation should shed insight into whether or not it is possible for improved 99 estimation of the propensity scores to result in treatment and control groups which are less similar in terms of the distribution of measured confounders. Simulation design Rather than sampling numerous datasets and studying the operating characteristics of our methods across datasets, we focus on the analysis of a pair of datasets which closely approximate the statin data. This should facilitate drawing a connection between previous simulations and the data analysis of Chapter 4. Moreover, Section 5.1 demonstrates that the improvement of point and interval estimation of 7 from using BPSA is sufficiently large that it should be detectable without repeated sampling of datasets. The two datasets of sample size n = 5000, which we denote as Dataset A and Dataset B, are generated using the following algorithm: 1. Generate c as a 5000 x 21 design matrix consisting of a column of ones and twenty columns of covariates drawn independently from N(0,1). 2. Generate y and x as 5000 x 1 response and treatment vectors for the n subjects, generated from either: • Dataset A: The propensity score model given in equations (3.1) and (3.2), with parameters B = -0.3, e = (1,0,-1,-2,-3), 7' = (-1,-0.1,0.1,-0.1,0.1,...,-0.1,0.1). 100 ] • Dataset B : The regression models given in equations (5.1) and (3.2), with parameters r = p = 7 ' = -0.3, (-2,-0.2, 0.2,...,-0.2,0.2), (-1,-0.1,0.1,-0.1,0.1,...,-0.1,0.1). The choice of design including sample size, number of covariates and parameter values is guided by the analysis results for the statin data. The values of B and £ are guided by Tables 4.3 and 4.5. The choice of r and p is guided by Table 4.1, and the choice of 7 is guided by Table 4.2. We generate the outcome under different models in order reflect the fact that the true model for P(Y = l\x, c) in the statin dataset is unknown. Results To illustrate rough similarity between Datasets A and B and the statin data, Figure 5.2 gives plots of the density f(z\X — x) for x = 0,1 and logit[P (Y = l\x, z)] for x = 0,1 and z G [0,1] for the pair of datasets. The plots are exactly comparable to Figures 4.1 and 4.2, and are produced in the same manner using the true propensity scores. Dashed and solid lines correspond to the untreated group and treated group, respectively. The plots in Figure 5.2 are fairly similar to those in Figures 4.1 and 4.2. The risk of Y within treatment groups is roughly decreasing in Z. This mimics the notion that healthy subjects are the most likely to receive treatment. Further, the untreated group has a lower distribution of propensity scores than the treated group. Thus Datasets A and B roughly approximate the statin data in the sense that they 101 have similar sample size, number of covariates, and dependencies between X, Y and C. We apply BPSA and PSA to each of the datasets. To investigate the distribution of C in treatment versus control when stratifying on competing propensity score estimates, we use a descriptive technique given by Imai and Van Dyk [45]. As was discussed in Section 2.3, Rosenbaum and Rubin [1] showed that X _LL C\Z because the distribution of X given the propensity score is equal to P(X = 1|C, z) = z which does not depend on C. This implies that X AL Cj\Z for j = 1,..., 20. Thus the following models are correct: logit[P(X = l\Cj, Z)\ = fa + OjCj + Wj-logitfZ] for j = 1,..., 20, (5.4) where fa — 6j — 0 and uij .= 1 for j = 1,..., 20, We can fit these regression models by substituting the propensity scores estimates ZBPSA or ZPSA for Z , where ZPSA ZBPSA = expit(7'C) = J expit(7'C)/(7|y,x, c)d7, and /(7|y, x, c) is the posterior distribution for 7. The point estimates for Q\, 62, • • •, #20, can be used as diagnostic tools to assess the similarity of the distribution of C in treatment versus control. If we fit the models using the true propensity scores Z = expit(7'C), then Imai and Van Dyk [45] point out that the resulting z-statistics for point estimates of #1, 62, • • •, #20 will be normally distributed with mean zero and variance equal to one. This is because the true values of 81, 62,. •., #20 are equal to zero since X AL Cj\Z. To elaborate, if we calculate the maximum likeli- 102 Figure 5.2: The density f(z\X = x) for x = 0,1 and logit[F(F = l\x, z)] for x = 0,1 and z G [0,1] for datasets A and B. Solid curves correspond to treated patients and dashed curves correspond to untreated patients. Dataset A Dataset B >^ .ti c 00 CD I * CD Q CN O 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Propensity score Propensity score Dataset A Dataset B T3 CO TCD 3 T3 T3 O CO T3 T3 O OJ O O) O CO CD I I n—i—r 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Propensity score Propensity score 103 hood estimates 61,62,..., 620, and we divide them by their estimated standard errors SE\, SE ,..., 2 SE o, then asymptotically we will have 2 -A_ SE\ SE2 ^L~jV( ,l). 0 SE20 Conditional on Z, any association between X and Cj is due to chance. We can study the distribution of the z-statistics calculated from competing estimates of the propensity scores, while using the N(0,1) density as a "benchmark" for comparison. Accordingly, we fit the regression models given in equation (5.4) for Datasets A or B, and competing propensity scores estimates Z, ZPSA and ZBPSA- In other words, we fit a total of 2 x 20 x 3 = 120 regression models, where each regression has a specific dataset (Dataset A or B), covariate Cj (j = 1,2,3... 20), and vector of propensity scores (Z, ZPSA or ZBPSA)- For each regression, we compute the z-statistic then produce normal quantile plots of -j-V, ohj\ Z or ZPSA or ..., 7^-. We for Dataset A or B and for 0E20 bhj-2 Z PSAB The results are given in Figure 5.3. The (o) symbols and (A) symbols correspond to z-statistics computed from regressing on ZPSA or ZBPSA, respectively. The (+) symbols correspond to z-statistics computed from regressing on the true propensity score Z. 104 Figure 5.3: Normal quantile plots for the z-statistics + correspond to regressing on ZPSA, ZBPSA Normal quantiles or Z ..., in either Dataset A or B. The symbols o, A and respectively. Normal quantiles Figure 5.3 illustrates that adjustment for ZPSA rather than Z produces treat- ment and control groups that are more similar with respect to C. The circle (o) symbols lie right along the horizontal axes for both Datasets A and B. The quantities , . . . , ^2Q_ , btj\ bh*2 small variance compared to a sample of size 20 from a n a v e SEj20 standard normal distribution. In contrast, when we adjust for Z we obtain the (+) symbols which, as expected, lie along the diagonal line indicating rough agreement with a standard normal density. Thus in Datasets A and B, stratifying on propensity scores estimated from PSA produces better covariate balance than stratifying on the true propensity scores. The empirical distributions of the confounders in treatment and control are more similar than would be expected if treatment were assigned at random. This does not mean that adjusting for ZPSA is a more effective strategy for reducing confounding bias in either dataset. Given Z , the distribution of C is identical in treatment and control, and we cannot have confounding. But PSA appears to reduce differences in the empirical distributions of C that arise by chance. The balancing properties of stratifying on true versus estimated propensity scores are investigated by Rosenbaum and Rubin [3] and by Rubin and Thomas [12, 13]. Rubin and Thomas [12, 13] study matched sampling in observational studies when matching on Z versus Zps A They derive analytic approximations of the variance of the difference of the sample means in the matching variables for treatment versus control in the case where C is multivariate normal. They demonstrate that matching on ZPSA can reduce the variance of the difference of the sample means by a factor of one half compared to matching on Z. Figure 5.3 illustrates that regression adjustment for ZBPSA produces greater dif- ferences in the distribution of C in treatment versus control compared to adjustment for ZPSA- The (A) symbols are much more variable than the (o) symbols. However, adjustment for ZBPSA does not appear to be any worse than adjustment for 106 Z. In Datasets A and B, the variation of the (A) symbols is not greater than that of the (+) symbols. We now investigate the quality of the propensity score estimates from applying BPSA or P S A to the data. Recall that the parameter 7 = (70,71, • • • ,720) models the propensity scores because it indexes the regression model logit[P(X = l|c)] = = 7 'c 7o + 7iCi + 72C2 + • • • + 72oC o2 Thus in Datasets A and B, the true propensity score for a subject with covariate vector C is equal to Z = expit(7 +71 C\ +72C2 + . • -+72oC2o)- Figure 5.4 plots histograms of 0 the quantities (7*, — 7fc) for /c = 0 , 1 , . . . , 20 where the jk are point estimates obtained from BPSA or PSA applied to either Datasets A, or B . For example, the top left figure has the heading "PSA applied to Dataset A " . This means that we apply PSA to Dataset A, obtain point estimates of 71,72,..., 720 denoted 71,72, • • •, 720, calculate the quantities (jk — Ik) for k = 0 , 1 , . . . , 20, and make a histogram of the result. If the histograms have small variance and are centered at zero, then this indicates that the quantities are high quality estimates of 7^. The histogram does not give us information about bias or efficiency because they are generated from the analysis of a single dataset. As expected, Figure 5.4 illustrates that in the Dataset A , B P S A provides better estimates of 7 compared to PSA. The histogram is visibly concentrated near zero because the point estimates of the components of 7 have small M S E compared to PSA. This fits well with the simulation results of Section 5.1 which indicate that BPSA performs favorably when the model for P(Y = l\x, c) is correct. In Dataset B , the model for P(Y — l\x, c) is misspecified, and Figure 5.4 indicates that BPSA point 107 Figure 5.4: Histograms of (j — j ) for k = 0 , 1 , . . . , 20 based on either BPSA of PSA analyses of either Dataset A or B. k k PSA applied to Dataset A BPSA applied Dataset A 00 o c co H o c CD CD 3 CT £. -, CT CD CM H 13 —I -0.10 1 1 0.00 0.05 0.10 -0.10 PSA applied to Dataset B o c CD H T T 1 0.00 0.05 0.10 BPSA applied to Dataset B o c CD CD 3 D" CD CT CD CM r -0.10 13 o T 0.00 0.05 0.10 -0.10 108 T 1 0.00 0.05 0.10 estimates do not appear to have any clear advantage over PSA. The variances of the histograms are roughly the same for both methods. This is also expected because the simulations in Section 5.2 illustrate that misspecification of P(Y = l|x,c) adversely affects the performance of estimates of 7. We can also study the quality of the propensity score estimates by estimating prediction error for treatment. Let (X*, C*) denote the data for a study unit drawn from the density function /(a;*|c*)/(c*), which is the same for Datasets A and B. We can study the performance of competing estimates for P(X* = 1|C*). Two estimates are the quantities Z P S A ZBPSA = = expit( 'C*) 7 J expit(YC*)/(7|y,x,c)d7, which are the estimated propensity scores for calculated from the dataset (X*,C*) used to build the predictive model for treatment. We study performance using the quantities 1 4>BPSA = N L( BPSA, Jj H Z X i) i=\ 1 $PSA N = —y^L(Z* ,X*) PSAI 1=1 1 4>TRUE = N —^2L(Z X*), I} 1=1 where L(Z,X*) is the predictive log score defined in Section 5.3, and indexing over i denotes observations in a test dataset of sample size N. Table 5.12 gives the values of 4>BPSAI 4>PSA and 4>TRUE calculated from the pre- dictive models obtained by analyzing Datasets A and B. For Dataset A , the quantity 109 Table 5.12: The quantities (f>BPSA,<t>PSA d treatment calculated from Datasets A or B. a n Dataset 4>BPSA A 0.572 B 0.574 SE's < 0.001 <I>BPSA is less than 4>PSA- 4>TRVE based on predictive models for (f>PSA 0.574 0.574 0.572 0.572 This implies that propensity scores estimates calculated from BPSA predict treatment assignment of future observations with smaller error than PSA. This is consistent with the results of Figure 5.3, and also the simulation results. In contrast, for Dataset B we see no improvement in prediction of treatment for BPSA compared to PSA. Discussion The motivation, for this section was to understand the seemingly contradictory results of Chapter 4 and 5. The simulations of Section 5.1 and 5.2, indicate that BPSA propensity score estimates perform favorably compared to PSA. We would expect that adjustment for propensity scores estimated from BPSA should be an effective approach to reducing confounding bias. But in Chapter 4, we applied BPSA the statin data and observed that the treatment and control groups differed systematically with respect to important outcome risk factors. The results of this investigation demonstrate that better estimation of the propen 1 sity scores does not imply better covariate balance between treatment versus control groups. When we fit the regression models in equation (5.4), adjustment for ZPSA rather than Z produces a greater level of similarity in the distribution of C in treat- 110 merit versus control. Conditional on Z , X and C are independent, but the empirical distributions of C given X will differ from one dataset to the next- In contrast, stratifying on ZPSA reduces differences in the empirical distributions that emerge by chance. A detailed discussion of this characteristic of PSA is given by Rubin and Thomas [12, 13]. Intuitively, BPSA may yield results which are a compromise between regression adjustment for Z SA P versus adjustment for Z. The quantities Z B P S A are the better estimates of Z. But in Datasets A and B, we appear to pay a price in terms of worse comparability in the distribution of C in treatment versus control. This may shed some light into the apparently poor performance of B P S A when applied to the statin data. The imbalances in Table 4.6 may be a consequence of better estimation of the propensity scores. To fully address this question, it would be useful to study the quality of the propensity score estimates in the statin data, perhaps using the cross-validation approach outlined above. Furthermore, the results of this section may shed light into why BPSA point estimates of 3 are less efficient in the simulations of Section 5.1 and 5.2 than those of PSA. Ill Chapter 6 Conclusion 6.1 Summary We have proposed an approach for combining propensity score methods with Bayesian inference for control of confounding bias in observational studies with a dichotomous outcome, dichotomous treatment, and measured confounders. The method models the propensity score as a latent variable and uses Bayes theorem to integrate the latent variable out of the posterior distribution for model parameters. This estimation strategy is common in data analysis applications with missing data or latent structure. We model the joint distribution for the data,' parameters and latent quantity, and we then study the marginal posterior distribution for model parameters. For BPSA, the posterior distribution for the treatment effect incorporates uncertainty about the propensity scores for each subject. In simulations and. in the analysis of the statin data, we demonstrate that BPSA yields interval estimates for the treatment effect parameter which are longer on average compared to PSA. This is a consequence of propagating uncertainty in the propensity scores through the analysis. Intuition says that BPSA and PSA should give similar answers. The methods use the same models. Asymptotically, uncertainty in the propensity scores should be small. Nonetheless, PSA is a two step procedure. It involves first estimating propensity scores and then including the estimates as a covariate in a regression model for the outcome. BPSA does both steps simultaneously. The method exploits prior 112 information about the relationship between the outcome and propensity score within treatment groups. The M C M C computational scheme involves iteratively updating the propensity scores and then fitting a complete data step from the results. During the updating, the algorithm is likely to yield propensity score estimates which cluster patients into groupings based on the outcome risk. The conditional distribution for the 7 parameter, given the data and (/?,£), contains a contribution from the model for the outcome variable. As we learn about £ and /?, information flows back through the algorithm to affect estimation of 7. The outcome variable is ignored by PSA when estimating the propensity scores. To put it another way, PSA estimates 7 from the marginal density f(x\c) whereas BPSA uses the joint model for f(y,x\c). The methods use the same models, but handle information in different ways. We demonstrate BPSA in an observational study of the effectiveness of statin therapy in Ontario patients discharged alive from hospital following acute myocardial infarction. Austin and Mamdani previously used this dataset to conduct a detailed case-study of propensity score methods [2]. We apply BPSA and compare the results to PSA. While treatment effect estimates are similar, we see large differences in point and interval estimates of £. Further examination reveals that the differences can be attributed to differences in the characteristics of patients classified to propensity score bins for either method. Patients with high propensity scores are healthier.in general because physicians are known to prescribe statins to patients who are young with fewer comorbidities [19, 20]. When we apply BPSA, study subjects in the upper bins with high propensity score have low prevalences of mortality risk factors. We see the reverse effect in lower propensity score bins. BPSA aggregates subjects into bins depending on how sick they are, whereas PSA only considers the relationship between the treatment variable and confounders. In the Monte Carlo simulations of Chapter 5, we demonstrate that BPSA may 113 yield more efficient estimates of 7 and therefore the propensity scores. When the model for the outcome variable follows equation (3.1), BPSA point estimates are more efficient relative to PSA and have lower mean squared error. Interval estimates for 7 have shorter average length and retain roughly nominal coverage probability. Point and interval estimates of £ calculated from BPSA also appear to have better performance. Improved estimation of 7 may allow more accurate classification of study subjects into propensity score bins, improving estimation of £. While BPSA and PSA use the same models, BPSA makes stronger assumptions in some sense. We might expect BPSA to be less robust to modelling assumptions for for P(Y — l\x,c). We study this issue in Section 5.2. Synthetic data are simulated for the case when the distribution of the outcome variable follows a conventional regression model of Y on X and C given in equation (5.1). We show that in this case the performance advantage of BPSA breaks down. Point and interval estimates of 7 and £ have fairly severe bias, and this increases MSE and harms interval estimation. Furthermore, stratifying on propensity score bins estimated from BPSA appears to produce treatment and control groups which are not particularly comparable. For the PSA methodology, it is standard practice to assess the quality of the propensity score model by looking for systematic differences in the covariate distributions for treatment versus control within each of the bins. In the statin data, PSA eliminates much of the confounding bias, whereas BPSA produces treatment and control groups which are not particularly comparable. This suggests that the BPSA method does not effectively reduce confounding in the statin data example. These findings are surprising in light of the simulations of Section 5.1 and 5.2. To explore this phenomenon further, Section 5.4 conducts a detailed analysis of two synthetic datasets which closely approximate the statin data in terms of sample size, covariates, and underlying parameter values. We show that BPSA may yield better estimation of the propensity scores which occurs 114 simultaneously with greater dissimilarity in the covariate distributions for treatment and control. The poor comparability observed in Table 4.6 for the statin data may be a consequence of better estimation of the propensity scores. 6.2 Future Research Implementing BPSA using the MCEM algorithm. There is nothing inherently Bayesian about fitting regression models for P(Y • — l\x, c) and P(X = 11c) simultaneously. In principle, it should be possible to use other, methods of estimation, such as maximum likelihood. A n advantage is that thiswould yield likelihood-based estimates which might have similar performance to estimates computed from BPSA. In this thesis, we consider the case where the likelihood for the data is given by L(/U|y,x,c) The quantity f(j) = £[/(y|x,c,/U,7)/(x|c,7)] = J /(y|x,c,/?,f,7)/(x|c,7)/(7)d7, is a prior distribution for 7. This expression states that the likeli- hood is equal to the likelihood given the propensity score, averaged over uncertainty in propensity score. Because the density /(y|x, c, /?, £, 7) depends on 7 only via the linear predictor g(C,7), given in equation (3.1), this amounts to saying that L([3, £|y, x, c) is the average of the likelihoods when we account for uncertainty in the propensity score bin classification for the study units. To maximize this likelihood with respect to 8 and £, we can use the Monte Carlo .Expectation Maximization (MCEM) algorithm [46]. The algorithm permits us to 115 maximize L(/3, £|y, x, c) while only working with the quantity /(y|x, c, /?, £, 7), often called the complete data likelihood, and the conditional probability density function /(7|y,x,c,/U) = / ( y | x , c , / U , ) / ( x | c , 7)7(7) £(/3,§|y,x,c) 7 In motivating the M C E M algorithm, Wei and Tanner [47] note that for fixed B^> and we can write logL(8, £|y, x, c) = £[log /(y|x, c, 8, £, )/(x|c, 7)] - £[log /( |y, x, c, B, £)], 7 7 where the expectations are with respect to /(7|y, x, c, B^\ £ ^ ) . To maximize logL(/3,£|y,x,c), we need only consider the first term on the RHS of the above equation. We use the following algorithm: 1. Initialize B^ and 2. For t = 1,2,..., • Expectation Step (E-step): Draw 7 ^ 7 ( t , 2 ) . . . ^(t,M) ; ; f r o m /( | 7 V ) x > C ) /5 - ,£ - ). ( t 1 ) ( t 1 ) • Maximization Step (M-step): M J>g/(y|x,c,/U,7 ( M ) ). The theoretical details are given by Wei and Tanner [47] who show that the sequence (6^,^), (B^ \^),... 2 converges to the maximum likelihood estimator. The M C E M algorithm has close parallels to the Metropolis Hastings algorithm detailed in Section 3.2 for posterior simulation for BPSA. The E-step imputes the missing data, in this case, the propensity scores for study units which are modelled by the parameter 7. The M-step maximizes the resulting mixture of likelihood functions. 116 This is conceptually similar to sampling for the posterior distribution of Q and £ given the data and 7. When implementing the M C E M algorithm, we cannot draw from the conditional density /(7|y,x, c,(3^~ \^ ~ ^) directly because it is of unknown form. l t 1 But we can sample from it using the Metropolis Hastings algorithm. Moreover, we have already identified suitable proposal distributions in Section 3.3. We can compute standard errors for the resulting estimates of B and £ using the approach of Oakes [48]. The variance of the maximum likelihood estimator for is approximated by ^ | ^ log L(B, £|y, x, c), and we can write 2 2 8 d 2 2 logL(/U|y,x,c) = E 3(/U) d Var W O logL(/?,£|y,x,c,7) + 2 log£(/3,£|y,x,c,7) This yields expression for computing standard errors. d 1 2 E Var d{6,0 2 d L(/?,f|y,x,c,7) logL(/3,f|y,x,c 7) = - M d 2 log/(y|x,c,/?,£,7 ) (M) Y M^d(P,0'' M ^ 1 M d log/(y|x,c,A£,7 I (M) ) i=l d /(y|x,c,A£,7 - - Y M f^i W ( M , ) ) 0 The quantities (3 and £ are the maximum likelihood estimators, and the first and second derivatives of /(y|x, c, [3, £, 7) are obtained from the standard output from logistic regression of y on x and z as determined by the choice of 7. The expression for the large sample variance has a similar form to the variance decomposition given in equation (4.2) from Section 4.4. Further study of the BPSA method could involve a detailed investigation of max- 117 imum likelihood estimators for 3 and £ computed using the M C E M algorithm. We could compare such estimates to those from PSA in much the same manner as the investigations of Sections 5.1 and 5.2. Because we have already studied computational algorithms for BPSA, little work would be needed to implement the M C E M algorithm. BPSA for hierarchically structured data The BPSA method has the advantage that it allows for the incorporation of standard Bayesian machinery in data analysis settings using propensity scores. We can use flexible modelling strategies, such as hierarchical models based on exchangeability assumptions, or incorporation of prior information from expert opinion or external validation data. BPSA also permits the use of Markov chain Monte Carlo methods for computing'point and interval estimates. This gives inferences which do not depend on asymptotic approximations, as is the case for PSA. One extension for BPSA would be to apply it in settings involving hierarchically structured data. Propensity score methods have generally been developed in settings involving cross-sectional data with individual-level covariates. However, epidemiological data often have a hierarchical structure. In the statin data example, prescribing practices might be driven by hospital-level covariates. For example, teaching hospitals may provide different quality services than hospitals in rural areas. Study subjects from the same hospital may have treatment levels which are correlated. Suppose that r is a m x 1 vector of dummy variables indicating hospital membership for each study unit. We could model the propensity scores as logit[P(X = l|c, r)] = ic + TV. 118 where the parameter r models the hospital effects. Since the components of r are likely to be similar, particularly if m is large, we could model them hierarchically Tl,T ,...,T 2 ~ m iV(0,CT ), 2 where a is a hyperparameter. If hospital level covariates are available, then they can 2 be incorporated into modelling variability in r. Sampling from the posterior distribution /(/?, £, 7, r|y, x, c) is an extension of the M C M C algorithm of Section 3.1. To fit the models logit[P(y = l|x,c)] = Bx + g(z(c ))'Z logit[P(A" = l|c)] = 7'c + r V r ~ n N(0,a I), 2 with suitable prior distributions for /3, £, 7 and r, we update sequentially from the conditional densities /(/?,f|y.x,c,7,r) /(7|y,x,c,/3,£,r) /My,x,c,/?,f,7) The first conditional density does not depend on r and is just the posterior step from the M C M C algorithm of Section 3.1. The second conditional distribution is identical to the imputation step of Section 3.1, but uses a prior for 7 rather than a N(0,T ) 2 diffuse Gaussian prior. The final conditional distribution is conditionally conjugate under an inverse gamma prior for the hyperparameter a . 2 119 In summary, BPSA seems to be a sensible alternative to PSA for reducing confounding in observational studies. While BPSA assumes more than PSA in estimating propensity scores, the method is merely exploiting the modelling assumptions that are built into PSA when selecting a linear predictor g(.) for modelling the outcome variable. PSA has the advantage that inferences may be more robust to model misspecification. But in many applications, this estimation strategy may be overly pessimistic in the sense that prior information is available. Furthermore, PSA implicitly uses modelling assumptions for the outcome when computing parameter estimates and standard errors. Our investigation helps to shed light on validity and relevance of modelling assumptions that underlie propensity score methods. 120 Bibliography [1] P.R. Rosenbaum and D.B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70:41-57, 1983. [2] P.C. Austin and M . M . Mamdani. A comparison of propensity score methods: A case-study estimating the effectiveness of post-AMI statin use. Stat Med, 25:2084-106,' 2005: [3] P.R. Rosenbaum and D.B. Rubin. Reducing bias in observational studies using subclassification on the propensity score. J Am Stat Assoc, 79:516-24, 1984. [4] J.K. Lunceford and M . Davidian. Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study. Stat Med, 23:2937-60, 2004. i [5] J.A. Hoeting, D. Madigan, A . E . Raftery, and C T . Volinsky. Bayesian model averaging: A tutorial. Statist Sci, 14:382-417, 1999. [6] K . Hirano and G.W. Imbens. The propensity score with continuous treatments. In. Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives, pages Ed. A. Gelman and X . Meng, New York, Wiley. (2004), 71-84. [7] D.B. Rubin. Estimating causal effects from large datasets using propensity scores. Ann Intern Med, 15:757-63, 1997. 121 [8] J. Hahn. On the role of the propensity score for efficient estimation of the average treatment effects. Econometrika, 66:315-31, 1998. [9] J.M. Robins and Newey W.K. Estimation of exposure effects by modelling the expectation of exposure conditional on confounders. Biometrics, 48:479-95, 1992. [10] P.R. Rosenbaum. Model-based direct adjustment. J Am Stat Assoc, 82:387-94, 1987. [11] K. Hirano, G.W. Imbens, and G. Ridder. Efficient, estimation of average treatment effects using the estimated propensity score. Econometrika, 71:1161-89, 2003. • [12] D.B. Rubin and N . Thomas. Matching using estimated propensity scores: Relating theory to practice. Biometrics, 52(l):249-264, 1996. [13] D.B. Rubin and N . Thomas. Characterizing the effect of matching using linear propensity score methods with normal distributions. Biometrika, 79:797-809, 1992. [14] D.B. Rubin. The use of propensity scores in applied Bayesian inference. Bayesian Statistics, 2:463-72, 1985. [15] J.M. Robins and Y. Ritov. Toward a curse of dimensionality appropriate (CODA) asymptotic theory for semi-parametric models. Stat Med, 16:285-318, 1997. [16] J.C. LaRosa, J. He, and S. Vupputuri. Effect of statins on risk of coronary disease: A meta-analysis of randomized controlled trials. J Am Med Assoc, 282:2340-2346, 1999. 122 [17] H.D. Aronow, E.J. Topol, M.T. Roe, and et al. Effect of lipid-lowering therapy on early mortality after acute coronary syndromes: A n observational study. Lancet, 357:1063-68, 2001. [18] U . Stenestrand and L. Wallentin. Early statin treatment following acute myocardial infarction and 1-year survival. J Am Med Assoc, 285:430-36, 2001. [19] D.T. Ko, M . Mamdani, and D.A. Alter. Lipid-lowering therapy with statins in high-risk elderly patients: The treatment-risk paradox. J Am Med Assoc, 291:1864-70, 2004. [20] R.J. Glynn, E.L. Knight, R. Levin, and J. Avorn. Paradoxical relations of drug treatment with mortality in older persons. Epidemiol, 12:682-689, 2001. [21] PvJ. Rothman and S. Greenland. Modern Epidemiology, 2nd ed. Lippincott, Philadelphia, 1998. [22] J. Pearl. Causality, models reasoning and inference. Cambridge University Press, New York, 1999. [23] S. Greenland, J. Pearl, and J.M. Robins. Confounding and collapsibility in causal inference. Statist Sci, 14:29-46, 1999. [24] M.A. Hernan, S. Hernandez-Diaz, M . M . Werler, and Mitchell A . A . Causal knowledge as a prerequisite for confounding evaluation: A n application to birth defects epidemiology. Am J Epidemiol, 155:176-84, 2002. [25] D.B. Rubin. Practical implications of modes of statistical inference for causal effects and the critical role of the assignment mechanism. Biometrics, 47:1213-34, 1991. 123 [26] L. Wasserman. All of Statistics. Springer, New York, 2004. [27] Rosenbaum PR. Propensity score. In. Encyclopedia of Bio statistics, pages Ed. P. Armitage and T. Colton, vol. 5. New York, Wiley. (1998), 3551-3555. [28] P.C. Austin and G . M . Normand, S.T.and Anderson. Conditioning on the propensity score can result in biased estimation of common measures of treatment effect: A Monte Carlo study. Stat Med, 26:754-68, 2006. [29] P.C. Austin. The performance of different propensity score methods for estimating marginal odds ratios. Stat Med, 26:3078-94, 2006. [30] T. Stumer, K . J . Rothman, and R.J. Glynn. Insights into different results from different causal contrasts in the presence of effect-measure modification. Pharmacoepidemiol Drug Saf 15:698-709, 2006. [31] T. Kurth, A . M . Walker, R.J. Glynn, K . A . Chan, J.M. Gaziano, K . Berger, and J.M. Robins. Results of multivariable logistic regression, propensity matching, propensity adjustment, and propensity-based weighting under conditions of nonuniform effect. Am J Epidemiol, 163:262-270, 2006. [32] M.H. Gail, S. Wieand, and S. Piantadosi. Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika, 71:431-44, 1984. [33] T. Stumer, S. Schneeweiss, M.A. Brookhart, K . J . Rothman, J. Avorn, and R.J. Glynn. Analytic strategies to adjust confounding using exposure propensity scores and disease risk scores: Nonsteroidal anti-inflammatory drugs and shortterm mortality in the elderly. Am J Epidemiol, 161:891-9, 2005. 124 [34] M.S. Cepeda, J.T. Boston, R.and Farrar, and B.L. Strom. Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. Am J Epidemiol, 158:280-7, 2003. [35] C. Drake. Effects of misspecification of the propensity score on estimators of treatment effect. Biometrics, 49:1231-1236, 1993. [36] D.B. Rubin and R.P. Waterman. Estimating the causal effects of marketing interventions using propensity score methodology. Stat Sci, 21:206-222, 2006. [37] H. Zheng and R.J.A. Little. Penalized spline model-based estimation of the finite populations total from probability-proportional-to-size samples. J Off Stat, 19:99-107, 2003. [38] A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin. Bayesian Data Analysis, 2nd edition. Chapman Hall/CRC, New York, 2004. [39] R Development Core Team. R: A language and environment for statistical com- puting. R Foundation for Statistical Computing:Vienna, 2004. ISBN 3-90005100-3. U R L http://www.R-project.org. [40] A. Gelman and D.B. Rubin. Inference from iterative simulation using multiple sequences. Stat Sci, 7:457-511, 1992. [41] R.J.A. Little. To model or not to model? Competing modes of inference for finite population sampling. J Am Stat Assoc, 99:546-56, 2004. [42] P. Gustafson and B. Clarke. Decomposing posterior variance. J Stat Plan Inference, 119:311-27, 2004. 125 [43] P.C. Austin, P. Grootendorst, and G . M . Anderson. A comparison of the ability of different propensity score models to balance measured variables between treated and untreated subjects: a Monte Carlo study. Stat Med, 2006. [44] B. Efron and G . Gong. A leisurely look at the bootstrap, the jackknife, and cross-validation. Am Stat, 37(l):36-48, 1983. [45] K . Imai and D.A. van Dyk. Causal inference with general treatment regimes: Generalizing the propensity score. J Am Stat Assoc, 99:854-866, 2004. [46] C P . Robert and G . Casella. Monte Carlo Statistical Methods. Springer, New York, 2004. [47] G.C.G. Wei and M.A. Tanner. A Monte Carlo implementation of the E M algorithm and the poor man's data augmentation algorithms. J Am Stat Assoc, 85:699-704, 1990. [48] D. Oakes. Direct calculation of the information matrix via the E M algorithm. J R Stat Soc Ser B, 61:479-482, 1999. 126 Appendix A Computing the limiting estimates for PSA. In Section 5.2, we apply PSA to synthetic datasets where P(Y = l\x,c) follows the model given in equation (5.1). The PSA method estimates the quantities (3* and £* which solve the estimating equation n(9) = E{R[Y - expit(#0)]} = 0, where R is a 6 x 1 vector with the first component equal to X, and the last five components equal to g(z(C,7)) and where 6* — [/?*,£*,££,£3,£4,£5]. For PSA we have 7* equal to 7, the true values used to generate the data. We outline a method for calculating the quantity 6* numerically for a given sim- ulation design. We may write £2(0*) = = E{R[Y -expit(R'9*)}} E{R[E[Y\R] -expit(#(?*)]} = ^r^tylrl-expit^*)]/^) = 0 r where r is a realization of R and the summation is over the support of R. The 127 quantity f(r) is the probability density function of R. Because R is a discrete random variable, we can calculate the quantities £?[Y|r] and /(r) numerically by Monte Carlo. To illustrate, consider the realization r = [1,1,0,0,1,0], meaning that we have X = 1 and g(z(C,j)) = [1,0,0,1,0]. To calculate E[Y\r] and /(r), 1. Draw a large sample of vectors C from the simulation design. 2. Retain the observations such that g(z(j,c))'. = [1,0,0,1,0], and denote this sample as ci, Ci ... r ,c . n 3. Compute ^ E[Y\r] />) = 1 " - ] P expit(r x 1 + c » = P(x = l\g(z(C, ) ) = [1,0,0,1,0]) x 7 p[g(z(C, )) 1 = [1,0,0,1,0]) 1 1 -y^expit(c^7) x - . n = i=i where r, p and 7 are the true parameter values from the simulation design of interest. Here we have p(g(z(C,j)) = [1,0,0,1,0]) = | because of the way the data are simulated. Replicates of C have 20% probability of lying in any specific propensity score bin. Having computed i5[Y"|r] and /(r), we can find the solution to the equation Cl(9*) = 0 using the Newton Raphson method. 128 The derivative of f2(0*) is given by J-Q(0*) = I ) E[rexpit(r'r)][rexpit(-r'r)]7(r), . r where the summation is over the support of R. Thus we can calculate the 6 x 1 vector £2(0*) and 6 x 6 matrix ^£2(0*) as a function of 0*. To solve £2(0*) = 0, we apply the following algorithm: 1. Initialize 0*, for example with 0* = [0,0,0,0,0,0]. 2. Set inc<- [1,1,1,1,1,1] 3. While Max{|mci|, \inc \, |mc |, |mc |, |mc |, |mc |} > 1 0 2 3 4 Set inc^- £2(0*) [^£2(0*)] ~* Set 0* <- 0* + inc } 4. Return 0* 129 5 6 -5 {
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Bayesian propensity score analysis of observational...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Bayesian propensity score analysis of observational data McCandless, Lawrence Cruikshank 2007
pdf
Page Metadata
Item Metadata
Title | Bayesian propensity score analysis of observational data |
Creator |
McCandless, Lawrence Cruikshank |
Publisher | University of British Columbia |
Date Issued | 2007 |
Description | Propensity scores analysis (PSA) involves regression adjustment for the estimated propensity scores, and the method can be used for estimating causal effects from observational data. However, confidence intervals for the treatment effect may be falsely precise because PSA ignores uncertainty in the estimated propensity scores. We propose Bayesian propensity score analysis (BPSA) for observational studies with a binary treatment, binary outcome and measured confounders. The method uses logistic regression models with the propensity score as a latent variable. The first regression models the relationship between the outcome, treatment and propensity score, while the second regression models the relationship between the propensity score and measured confounders. Markov chain Monte Carlo is used to study the posterior distribution of the exposure effect. We demonstrate BPSA in an observational study of the effect of statin therapy on all-cause mortality in patients discharged from Ontario hospitals following acute myocardial infarction. The results illustrate that BPSA and PSA may give different inferences despite the large sample size. We study performance using Monte Carlo simulations. Synthetic datasets are generated using competing models for the outcome variable and various fixed parameter values. The results indicate that if the outcome regression model is correctly specified, in the sense that the outcome risk within treatment groups is a smooth function of the propensity score, then BPSA permits more efficient estimation of the propensity scores compared to PSA. BPSA exploits prior information about the relationship between the outcome variable and the propensity score. This information is ignored by PSA. Conversely, when the model for the outcome variable is misspecified, then BPSA generally performs worse than PSA. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2011-02-17 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0100672 |
URI | http://hdl.handle.net/2429/31420 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_2007-318867.pdf [ 5.01MB ]
- Metadata
- JSON: 831-1.0100672.json
- JSON-LD: 831-1.0100672-ld.json
- RDF/XML (Pretty): 831-1.0100672-rdf.xml
- RDF/JSON: 831-1.0100672-rdf.json
- Turtle: 831-1.0100672-turtle.txt
- N-Triples: 831-1.0100672-rdf-ntriples.txt
- Original Record: 831-1.0100672-source.json
- Full Text
- 831-1.0100672-fulltext.txt
- Citation
- 831-1.0100672.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0100672/manifest