Bayesian Methods for Alleviating Identification Issues with Applications in Health and Insurance Areas by Chaoxiong Xia B.Econ. & B.Sc., Nankai University, 2002 M.Sc., University of Calgary, 2007 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in THE FACULTY OF GRADUATE STUDIES (Statistics) The University Of British Columbia (Vancouver) June 2013 c Chaoxiong Xia, 2013 Abstract In areas such as health and insurance, there can be data limitations that may cause an identification problem in statistical modeling. Ignoring the issues may result in bias in statistical inference. Bayesian methods have been proven to be useful in alleviating identification issues by incorporating prior knowledge. In health areas, the existence of hard-to-reach populations in survey sampling will cause a bias in population estimates of disease prevalence, medical expenditures and health care utilizations. For the three types of measures, we propose four Bayesian models based on binomial, gamma, zero-inflated Poisson and zero-inflated negative binomial distributions. Large-sample limits of the posterior mean and standard deviation are obtained for population estimators. By extensive simulation studies, we demonstrate that the posteriors are converging to their large-sample limits in a manner comparable to that of an identified model. Under the regression context, the existence of hard-to-reach populations will cause a bias in assessing risk factors such as smoking. For the corresponding regression models, we obtain theoretical results on the limiting posteriors. Case studies are conducted on several well-known survey datasets. Our work confirms that sensible results can be obtained using Bayesian inference, despite the nonidentifiability caused by hard-to-reach populations. In insurance, there are specific issues such as misrepresentation on risk factors that may result in biased estimates of insurance premiums. In particular, for a binary risk factor, the misclassification occurs only in one direction. We propose three insurance prediction models based on Poisson, gamma and Bernoulli distri- ii butions to account for the effect. By theoretical studies on the form of posterior distributions and method of moment estimators, we confirm that model identification depends on the distribution of the response. Furthermore, we propose a binary model with the misclassified variable used as a response. Through simulation studies for the four models, we demonstrate that acknowledging the misclassification improves the accuracy in parameter estimation. For road collision modeling, measurement errors in annual traffic volumes may cause an attenuation effect in regression coefficients. We propose two Bayesian models, and theoretically confirm that the gamma models are identified. Simulation studies are conducted for finite sample scenarios. iii Preface This thesis was completed under the supervision of my PhD advisor Professor Paul Gustafson. It consists of four papers coauthored or to be coauthored with Professor Gustafson. One of the papers has already been published. Chapter 2 and Appendix A are based on a published paper: M. Xia and P. Gustafson. A Bayesian method for estimating prevalence in the presence of a hidden sub-population. Statistics in Medicine, 31(21): 2386-2398, 2012 [59]. Professor Gustafson identified the research question and guided me through designing the research problems. Under the supervision of Professor Gustafson, I derived the theoretical results, undertook simulation studies, applied the methodology to two real datasets, and wrote up the manuscript. For all the other three Chapters of 3, 4 and 5, I identified the real-life problems, and designed the research questions. Professor Gustafson directed me to the right direction when I encountered difficulties in deriving the theoretical results in Subsection 3.5. I did all the work in deriving the theoretical results, performing simulation studies, undertaking case studies using real data, and writing up the thesis chapters. Professor Gustafson pointed out the right tools when I had problems in deriving theoretical results for Chapter 4 and 5. iv Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 M-Track Survey . . . . . . . . . . . . . . . . . . . 1.1.2 Medical Expenditure Panel Survey . . . . . . . . . 1.1.3 Misrepresentation in insurance underwriting . . . . 1.1.4 Measurement errors in traffic volumes . . . . . . . 1.2 Bayesian inference for nonidentified models . . . . . . . . 1.3 Summary of proposed research . . . . . . . . . . . . . . . 1.3.1 Prevalence estimation with a hidden sub-population 1.3.2 Bias assessment in MEPS . . . . . . . . . . . . . . v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 2 4 5 6 8 8 9 1.3.3 1.3.4 Misrepresentation modeling . . . . . . . . . . . . . . . . Mismeasured AADT in collision prediction . . . . . . . . 2 Prevalence Estimation with a Hidden Sub-population 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . 2.2 A Bayesian model . . . . . . . . . . . . . . . . . . 2.2.1 The observed variables . . . . . . . . . . . 2.2.2 Prior on hidden proportion . . . . . . . . . 2.2.3 Distribution of sampling probabilities . . . 2.2.4 Variable of interest . . . . . . . . . . . . . 2.3 Large-sample limit of posterior distributions . . . . 2.3.1 Posterior mean . . . . . . . . . . . . . . . 2.3.2 Posterior standard deviation . . . . . . . . 2.3.3 The limiting posterior distribution . . . . . 2.4 Simulation study . . . . . . . . . . . . . . . . . . 2.4.1 Model specification . . . . . . . . . . . . . 2.4.2 Posterior behavior on finite samples . . . . 2.4.3 The limiting posterior . . . . . . . . . . . . 2.4.4 Assumption of a smooth relationship . . . . 2.5 Case studies on real data . . . . . . . . . . . . . . 2.5.1 NHANES data . . . . . . . . . . . . . . . 2.5.2 Adolescent Health Study . . . . . . . . . . 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 15 15 15 16 16 18 18 20 21 21 22 23 28 28 30 30 32 35 3 Never-respondents in Medical Expenditure Panel Survey 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . 3.2 A Bayesian approach . . . . . . . . . . . . . . . . . . 3.2.1 The observed variables . . . . . . . . . . . . . 3.2.2 Prior on hidden proportion . . . . . . . . . . . 3.2.3 Distribution of sampling probabilities . . . . . 3.2.4 Variables of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 36 40 40 40 41 41 vi . . . . . . . . . . . . . . . . . . . . 10 11 3.3 3.4 3.5 3.6 Large-sample limits of posterior distributions 3.3.1 Posterior mean . . . . . . . . . . . . 3.3.2 Posterior standard deviation . . . . . Sensitivity analysis on MEPS . . . . . . . . . 3.4.1 Model specification . . . . . . . . . . 3.4.2 Asymptotic behavior . . . . . . . . . 3.4.3 MEPS application . . . . . . . . . . . Impact on regression . . . . . . . . . . . . . 3.5.1 Logistic regression . . . . . . . . . . 3.5.2 Gamma regression . . . . . . . . . . 3.5.3 Zero-inflated count regression . . . . 3.5.4 Case study on MEPS . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . 4 Misrepresentation in Insurance Underwriting . 4.1 Motivation . . . . . . . . . . . . . . . . . . 4.2 Poisson response model . . . . . . . . . . . 4.2.1 Model specification . . . . . . . . . 4.2.2 Model identification . . . . . . . . . 4.2.3 Computational study . . . . . . . . 4.3 Gamma response model . . . . . . . . . . . 4.3.1 Model specification . . . . . . . . . 4.3.2 Model identification . . . . . . . . . 4.3.3 Computational study . . . . . . . . 4.4 Binary response model . . . . . . . . . . . 4.4.1 Model specification . . . . . . . . . 4.4.2 Model identification . . . . . . . . . 4.4.3 Simulation study . . . . . . . . . . 4.5 Misclassified variable as response . . . . . 4.5.1 Model specification . . . . . . . . . 4.5.2 Model identification . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 45 46 46 50 52 67 67 69 71 74 79 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 . 80 . 83 . 83 . 84 . 85 . 90 . 90 . 90 . 91 . 96 . 96 . 96 . 97 . 107 . 107 . 108 4.6 4.5.3 Simulation study . . . . . . . . . . . . . . . . . . . . . . 111 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5 Mismeasurements in Traffic Volumes for Accident Prediction 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Mismeasurements in traffic volumes . . . . . . . . 5.2 Lognormal measurement error model . . . . . . . . . . . . 5.3 Gamma measurement error models . . . . . . . . . . . . . 5.3.1 Poisson response model . . . . . . . . . . . . . . . 5.3.2 Negative binomial response model . . . . . . . . . 5.4 Simulation study . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Metropolis-Hastings algorithm . . . . . . . . . . . 5.4.2 Simulation study . . . . . . . . . . . . . . . . . . 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 123 124 125 127 127 132 133 133 139 150 6 Concluding Remarks . . . . . . . . . . . . 6.1 Conclusions . . . . . . . . . . . . . . . 6.2 Future work . . . . . . . . . . . . . . . 6.2.1 Thesis related future work . . . 6.2.2 Other partially identified models . . . . . . . . . . . . . . . . . . . . 151 151 153 153 154 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 A Supporting Materials for Prevalence Estimation . . . . . . . . . . . 163 viii List of Tables Table 2.1 Table 2.2 Table 2.3 Table 3.1 Table 3.2 Impact of a non-smooth relationship on population prevalence. The three digits in the case number correspond to the three assumptions of the distribution of the observed sampling probability, the natural cubic spline and the prior on the hidden proportion. For example, ‘111’ refers to the situation when we take the first set of values for the three assumptions given in the previous subsections. The notation θˆ denotes the limiting posterior mean. . . . . . . . . . . . . . . . . . . . . . . . . . . Estimated prevalences in the NHANES example . . . . . . . . Estimated prevalences in the Adolescent Health example . . . . Comparison of estimated expenditures for different types of expenditures. Except for Medicaid, ignoring the existence of never-response will cause an over-estimation in the population expenditures. For all the cases, acknowledging never-response will result in higher standard errors. The weighted estimates are very close to the model-based estimates assuming no neverrespondents. . . . . . . . . . . . . . . . . . . . . . . . . . . . Over-dispersion in utilization data. For emergency room visits, inpatient hospital stay and dental visits, the variance to mean ratio (unconditional on sampling probabilities) is below 2.9. For other types of visits, the ratio is above 18. . . . . . . . . . ix 30 32 34 58 59 Table 3.3 Table 3.4 Table 3.5 Table A.1 Comparison of estimated utilizations. For all types of health care visits, ignoring the existence of never-response will cause an over-estimation in the population expenditures. Acknowledging never-response will result in higher standard errors. The weighted estimates are very close to the model-based estimates assuming no never-respondents. . . . . . . . . . . . . . . . . . Comparison of estimated relative risk for incurring expenditure and utilization, using logistic regression on the insurance status. We observe that all the model-based estimates acknowledging the never-respondents are larger than weighted estimates. Note that the standard errors are estimated using bootstrap for the naive and weighted estimates. . . . . . . . . . . . . . . . . Comparison of estimated relative risk for incurring expenditure and utilization, using Gamma, ZIP and ZINB regression on the insurance status. Similar to those from logistic regression, all the model-based estimates acknowledging the neverrespondents are larger than weighted estimates. Note that the standard errors are estimated using bootstrap for the naive and weighted estimates. . . . . . . . . . . . . . . . . . . . . . . . 66 77 78 Limiting posteriors for cases with mis-specified priors . . . . . 172 x List of Figures Figure 2.1 Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5 Figure 2.6 Relationship of binary trait and unnormalized sampling probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sampling distribution of posterior mean of prevalence θ with different sample sizes . . . . . . . . . . . . . . . . . . . . . . Sampling distribution of posterior standard deviation of prevalence θ with different sample sizes. In panel (d), the concentrated distribution of sampling probability results in difficulty of obtaining subjects with small sampling probability, which makes it hard for finite sample inferences but not the asymptotic case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Limiting posterior distribution of prevalence θ . . . . . . . . . Posterior samples of natural cubic splines based on 50 samples. The histograms are based on the frequency of subjects with the unnormalized sampling probability within each range of width 0.1. The sample size of asthma data is 9659, which is about 2 times larger than the other two diseases. . . . . . . . Posterior samples of natural cubic splines based on 50 samples. The histograms are based on the frequency of subjects with the unnormalized sampling probability within each range of width 0.03. . . . . . . . . . . . . . . . . . . . . . . . . . xi 24 26 27 28 31 34 Figure 3.1 Distribution of expenditure by group of sampling probability. We observe that the distributions (i.e., the conditional distributions) are all right skewed with a positive support. . . . . . Figure 3.2 Total charges - posterior samples of natural cubic splines (50 samples). The solid dots at the x-axis are the 5%, Q1, Median, Q3 and 95% quantiles of the sampling probabilities. . . . . . . Figure 3.3 Total expenditures - posterior samples of natural cubic splines (50 samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.4 Medicaid - posterior samples of natural cubic splines (50 samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.5 Medicare - posterior samples of natural cubic splines (50 samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.6 Private insurance - posterior samples of natural cubic splines (50 samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.7 Self-paid - posterior samples of natural cubic splines (50 samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.8 Distribution of health care visit counts by groups of sampling probability. For some types of visits, we have to perform a logarithm transformation in order to visualization the distributional body. . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.9 Office-based visits - posterior samples of natural cubic splines (50 samples). Both the mixture proportion and mean of ZINB are increasing with sampling probability. . . . . . . . . . . . . Figure 3.10 Outpatient visits - posterior samples of natural cubic splines (50 samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.11 Emergency visits - posterior samples of natural cubic splines (50 samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.12 Inpatient stays - posterior samples of natural cubic splines (50 samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 53 54 55 55 56 56 57 60 61 62 62 63 Figure 3.13 Prescription medicines - posterior samples of natural cubic splines (50 samples) . . . . . . . . . . . . . . . . . . . . . . Figure 3.14 Dental visits - posterior samples of natural cubic splines (50 samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.15 Home care visits - posterior samples of natural cubic splines (50 samples). We use the negative binomial model due to an over-dispersion of more than 100. . . . . . . . . . . . . . . . Figure 3.16 Logistic regression on insurance indicator - posterior samples of natural cubic splines (50 samples). The solid lines and dots are for individuals with medical insurance, while the dashed lines and unfilled dots are for uninsured individuals. . . . . . . Figure 3.17 Binary regression for incurring expenditure or utilization posterior samples of natural cubic splines (50 samples). The solid lines and dots are for individuals with medical insurance, while the dashed lines and unfilled dots are for uninsured individuals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.18 Gamma, ZIP and ZINB regression on expenditure and utilization - posterior samples of natural cubic splines (50 samples). The solid lines and dots are for individuals with medical insurance, while the dashed lines and unfilled dots are for uninsured individuals. . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.1 Figure 4.2 Distribution of posterior samples for α0 for the Poisson model. The results are based on one simulation with a sample size of 1000. The Bayesian models implemented in WinBUGS and R provide estimates that are very close to those from the true model using unobserved values of V . . . . . . . . . . . . . . . Distribution of posterior samples for α1 for the Poisson model. The results are based on one simulation with a sample size of 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 63 64 64 74 75 78 87 88 Figure 4.3 Figure 4.4 Figure 4.5 Figure 4.6 Figure 4.7 Figure 4.8 Figure 4.9 Distribution of posterior samples for p for the Poisson model. The results are based on one simulation with a sample size of 1000. The posterior distributions from Bayesian models implemented in WinBUGS and R have a concentrate distribution covering the true values of p0 . . . . . . . . . . . . . . . . . . Distribution of posterior samples for θ for the Poisson model. The results are based on one simulation with a sample size of 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distribution of posterior samples for α0 for the gamma model. The results are based on one simulation with a sample size of 1000. Acknowledging the misrepresentation results in a more accurate estimates with higher uncertainties. . . . . . . . . . . Distribution of posterior samples for α1 for the gamma model. The results are based on one simulation with a sample size of 1000. Acknowledging the misrepresentation results in a more accurate estimates with higher uncertainties. . . . . . . . . . . Distribution of posterior samples for p for the gamma model. The results are based on one simulation with a sample size of 1000. The relatively spread posterior distribution covers the true value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distribution of posterior samples for θ for the gamma model. The results are based on one simulation with a sample size of 1000. The relatively spread posterior distribution covers the true value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distribution of posterior samples for φ for the gamma model. The results are based on one simulation with a sample size of 1000. Misrepresentation does not seem to have a big impact on the dispersion parameter. . . . . . . . . . . . . . . . . . . xiv 88 89 93 94 94 95 95 Figure 4.10 Probability density and cumulative density functions of prior on the misclassified proportion p. The prior mean and standard deviation are (0.25, 0.19), (0.5, 0.22) and (0.25, 0.14) respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Figure 4.11 Convergence of posterior mean for α0 with different true values and priors on p. For each sample size, the five boxplots (based on 20 simulations) from left to right are the estimates using the misclassified covariate (naive), the ‘unobserved’ covariate (true), and posterior mean based three different priors for p, in the order in the legend. We exclude the results with p0 = 0.5 and n = 40, due to the problem of numerical overflow/underflow. . . . . . . . . . . . . . . . . . . . . . . . . . 101 Figure 4.12 Convergence of posterior mean for α1 with different true values and priors on p. For each sample size, the five boxplots (based on 20 simulations) from left to right are the estimates using the misclassified covariate (naive), the ‘unobserved’ covariate (true), and posterior mean based three different priors for p. We exclude the results with p0 = 0.5 and n = 40, due to the problem of numerical overflow/underflow. . . . . . . . . 102 Figure 4.13 Convergence of posterior mean for p with different true values and priors on p. For each sample size, the three boxplots (based on 20 simulations) from left to right are the posterior mean based three different priors in the order given in the legend. We exclude the results with p0 = 0.5 and n = 40, due to the problem of numerical overflow/underflow. . . . . . . . . . 103 xv Figure 4.14 Convergence of posterior standard deviation for α0 with different true values and priors on p. For each sample size, the five boxplots (based on 20 simulations) from left to right are the estimates using the misclassified covariate (naive), the ‘unobserved’ covariate (true), and posterior standard deviation based three different priors for p. We exclude the results with p0 = 0.5 and n = 40, due to the problem of numerical overflow/underflow. . . . . . . . . . . . . . . . . . . . . . . . . . 104 Figure 4.15 Convergence of posterior standard deviation for α1 with different true values and priors on p. For each sample size, the five boxplots (based on 20 simulations) from left to right are the estimates using the misclassified covariate (naive), the ‘unobserved’ covariate (true), and posterior standard deviation based three different priors for p. We exclude the results with p0 = 0.5 and n = 40, due to the problem of numerical overflow/underflow. . . . . . . . . . . . . . . . . . . . . . . . . . 105 Figure 4.16 Convergence of posterior standard deviation for p with different true values and priors on p. For each sample size, the three boxplots (based on 20 simulations) from left to right are the posterior mean based three different priors in the order given in the legend. We exclude the results with p0 = 0.5 and n = 40, due to the problem of numerical overflow/underflow. . . . . . 106 Figure 4.17 Trace plots for posterior samples of α0 from Metropolis-Hastings algorithm when n = 160 and p0 = 0.25. We observe that the trace plot improves using five independent chains after thinning. In particular, the first 1000 samples are from the first chain, the next 1000 are from the second chain, etc. . . . . . . 114 xvi Figure 4.18 Trace plots for posterior samples of p from Metropolis-Hastings algorithm when n = 160 and p0 = 0.25. We observe that the trace plot improves using five independent chains after thinning. In particular, the first 1000 samples are from the first chain, the next 1000 are from the second chain, etc. . . . . . . Figure 4.19 Convergence of posterior mean for β0 with different true values and priors on p. For each sample size, the five boxplots (based on 20 simulations) from left to right are the estimates using the misclassified response (naive), the ‘unobserved’ response (true), and posterior mean based three different priors for p, in the order in the legend. When n = 40, we exclude the results from the model using a beta prior with parameters 2 and 6, due to convergence problems. . . . . . . . . . . . . . Figure 4.20 Convergence of posterior mean for β1 with different true values and priors on p. For each sample size, the five boxplots (based on 20 simulations) from left to right are the estimates using the misclassified response (naive), the ‘unobserved’ response (true), and posterior mean based three different priors for p. When n = 40, we exclude the results from the model using a beta prior with parameters 2 and 6, due to convergence problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.21 Convergence of posterior mean for p with different true values and priors on p. For each sample size, the three boxplots (based on 20 simulations) from left to right are the posterior mean based three different priors in the order given in the legend. When n = 40, we exclude the results from the model using a beta prior with parameters 2 and 6, due to convergence problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii 115 116 117 118 Figure 4.22 Convergence of posterior standard deviation for β0 with different true values and priors on p. For each sample size, the five boxplots (based on 20 simulations) from left to right are the estimates using the misclassified response (naive), the ‘unobserved’ response (true), and posterior mean based three different priors for p. When n = 40, we exclude the results from the model using a beta prior with parameters 2 and 6, due to convergence problems. . . . . . . . . . . . . . . . . . . . . . 119 Figure 4.23 Convergence of posterior standard deviation for β1 with different true values and priors on p. For each sample size, the five boxplots (based on 20 simulations) from left to right are the estimates using the misclassified response (naive), the ‘unobserved’ response (true), and posterior mean based three different priors for p. When n = 40, we exclude the results from the model using a beta prior with parameters 2 and 6, due to convergence problems. . . . . . . . . . . . . . . . . . . . . . 120 Figure 4.24 Convergence of posterior standard deviation for p with different true values and priors on p. For each sample size, the three boxplots (based on 20 simulations) from left to right are the posterior mean based three different priors in the order given in the legend. When n = 40, we exclude the results from the model using a beta prior with parameters 2 and 6, due to convergence problems. . . . . . . . . . . . . . . . . . . . . . . . 121 Figure 5.1 Figure 5.2 Figure 5.3 Solution of Equation (5.18). For all the cases, there exits a unique solution for β1 . . . . . . . . . . . . . . . . . . . . . . 131 Distribution of posterior mean for β0 with a sample size of n = 10, based on simulations on 100 datasets. . . . . . . . . . 141 Distribution of posterior mean for β0 with a sample size of n = 40, based on simulations on 100 datasets. . . . . . . . . . 141 xviii Figure 5.4 Figure 5.5 Figure 5.6 Figure 5.7 Figure 5.8 Figure 5.9 Figure 5.10 Figure 5.11 Figure 5.12 Figure 5.13 Figure 5.14 Figure 5.15 Figure 5.16 Distribution of posterior mean for β1 with a sample size of n = 10, based on simulations on 100 datasets. . . . . . . . . . Distribution of posterior mean for β1 with a sample size of n = 40, based on simulations on 100 datasets. . . . . . . . . . Distribution of posterior mean for η 2 with a sample size of n = 10, based on simulations on 100 datasets. . . . . . . . . . Distribution of posterior mean for η 2 with a sample size of n = 40, based on simulations on 100 datasets. . . . . . . . . . Distribution of posterior mean for predicted accident counts Λ with a sample size of n = 10, based on simulations on 100 datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distribution of posterior mean for predicted accident counts Λ with a sample size of n = 40, based on simulations on 100 datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distribution of posterior standard deviation for β0 with a sample size of n = 10, based on simulations on 100 datasets. . . . Distribution of posterior standard deviation for β0 with a sample size of n = 40, based on simulations on 100 datasets. . . . Distribution of posterior standard deviation for β1 with a sample size of n = 10, based on simulations on 100 datasets. . . . Distribution of posterior standard deviation for β1 with a sample size of n = 40, based on simulations on 100 datasets. . . . Distribution of posterior standard deviation for η 2 with a sample size of n = 10, based on simulations on 100 datasets. . . . Distribution of posterior standard deviation for η 2 with a sample size of n = 40, based on simulations on 100 datasets. . . . Distribution of posterior standard deviation for predicted accident counts Λ with a sample size of n = 10, based on simulations on 100 datasets. . . . . . . . . . . . . . . . . . . . . . xix 142 142 143 143 144 144 145 145 146 146 147 147 148 Figure 5.17 Distribution of posterior standard deviation for predicted accident counts Λ with a sample size of n = 40, based on simulations on 100 datasets. . . . . . . . . . . . . . . . . . . . . . 148 Figure A.1 Figure A.2 Figure A.3 Figure A.4 Figure A.5 Figure A.6 Figure A.7 Figure A.8 Sampling distribution of posterior mean of prevalence θ with different sample sizes. The three digits in the panel name correspond to the three assumptions of the distribution of the observed sampling probability, the natural cubic spline and the prior on the hidden proportion. For example, ‘111’ refers to the situation when we take the first set of values for the three assumptions given in Subsection 2.4.2. of the thesis. . . . . . Sampling distribution of posterior standard deviation of prevalence θ with different sample sizes . . . . . . . . . . . . . . . Sampling distribution of posterior mean of prevalence θ with different sample sizes . . . . . . . . . . . . . . . . . . . . . . Sampling distribution of posterior standard deviation of prevalence θ with different sample sizes . . . . . . . . . . . . . . . Sampling distribution of posterior mean of prevalence θ with different sample sizes . . . . . . . . . . . . . . . . . . . . . . Sampling distribution of posterior standard deviation of prevalence θ with different sample sizes . . . . . . . . . . . . . . . Sampling distribution of posterior mean of prevalence θ with different sample sizes . . . . . . . . . . . . . . . . . . . . . . Sampling distribution of posterior standard deviation of prevalence θ with different sample sizes . . . . . . . . . . . . . . . xx 164 165 166 167 168 169 170 171 Acknowledgments First of all, I would like to express the deepest appreciation to my supervisor, Dr. Paul Gustafson, for his persistent guidance, inspiration and support during my PhD program. Dr. Gustafson has always been so brilliant, encouraging and patient. Without his encouragement and inspiration during every stage of my program, this dissertation would not have been possible. In addition, he has demonstrated me the precious values and philosophies on how to become a great teacher and researcher, through his own practices. I am full of gratitude to my supervisory committee members, Dr. Rollin Brant and Dr. Lang Wu for their precious time and their guidance throughout my program. Their pertinent comments have led to great improvements of the thesis proposal and the dissertation. Last but not least, I would like to thank all the other examining committee members, Dr. Martin Barlow (chair), Dr. Jason Sutherland (university examiner), Dr. Jennifer Bryan (university examiner) and Dr. Lawrence Joseph (external examiner), for their pertinent comments that have greatly improved my dissertation. I am indebted to Dr. Bryan and Dr. Joseph for their precious time, their detailed comments and corrections of typographical errors. xxi Dedication Dedicated to my family, whom I love and who love me! This dissertation would not have been possible without the support from my big family. My mother came from China to care for my family, which enabled me to complete my dissertation. During the period of time when my mother was in Vancouver, my father undertook all the duties for the home business and endured a long lonely time by himself. My grandparents, who are in their eighties, undertook the hard work of family duties in China. Furthermore, my husband has been continually supporting my pursuit of an academic study and career. In the every difficult stage during my PhD program, he has been there supporting and encouraging me regarding my personal choices. My two kids, Claire and Daniel, have brought me enormous joy and fulfillness. Without all of you, I would not have reached where I am today. xxii Chapter 1 Introduction 1.1 Motivations In areas such as health and insurance, there can be data issues that may result in identification problems in statistical modeling, in which case we are not able to identify a unique set of parameter values based on the observed data. The thesis is motivated by the following four real-life problems and data issues that have never been addressed in the statistical literature. 1.1.1 M-Track Survey One example of a partially identified model we will study involves disease prevalence estimation using data collected from weighted sampling, when there is a hidden sub-population. The particular interest arises from our direct involvement with the M-Track Survey [41] coordinated by the Public Health Agency of Canada for monitoring the prevalence of risk behaviors, HIV, and other sexually transmitted infections (STI) among gay, bisexual and other men who have sex with men (MSM), for which a venue-based sampling method is used (see e.g., [24]). Venue-based sampling is a time-space sampling method [27, 56] that has been used throughout the literature for studies involving MSM populations [32, 33, 60]. 1 In venue-sampling, interviewers visit participating venues (e.g., an event the target population attend) at randomly selected times and enroll eligible population members. From the fact that subjects with more frequent venue visits are more likely to be recruited, venue-based sampling falls into the category of weighted sampling methods. A weight can be estimated for each participant based on his probability of being sampled, using information asked in the questionnaire on his or her venue attendance pattern, along with information on the sampling scheme. Those MSM who never attend the sampled venues will have a zero sampling probability, giving rise to a hidden sub-population in weighted sampling. The presence of a hidden sub-population in weighted sampling may cause a distortion in the final estimate of the population disease prevalence, particularly if the hidden sub-population exhibits different risk behaviors. The problem falls into the broad framework of nonidentified models that have been well studied in the literature [34]. 1.1.2 Medical Expenditure Panel Survey The Medical Expenditure Panel Survey (MEPS) [1] is a set of national surveys sponsored by the Agency for Healthcare Research and Quality (AHRQ). It provides comprehensive information on the U.S. population regarding the frequency, cost and source of payment for the health services that Americans use, as well as information including demographic, health and financial factors, under both household and individual levels. It provides comprehensive data for studying the population expenditure, health utilization, health insurance, as well as various causal effects on the healthcare expenditure and utilization. AHRQ provides numerous research reports each year based on MEPS data, on issues related to health care utilization and expenditure of the U.S. population. For example, Krauss et al. [29] studied the population percentages with various types of health service visits, using MEPS household component data. In earlier works such as [4, 28, 39], a total population average expense, as well as those by service type and source of payment, were estimated using 1996 data. For 2 diabetes, Sarpong and Miller [46] assessed trends in pharmaceutical treatments by comparing the average expenditure on medications as well as proportions of individuals using different types of medicines. In Rohde and Machlin [43], costs were estimated for prenatal care of uncomplicated pregnancies by the insurance status and type of care. Bernard and Banthin [2] evaluated healthcare expenditures and insurance premiums by factors such as insurance status and family size. More recent works include expenditure and utilization estimation for subcategories such as dental [42], depression [50], anticonvulsants [54], hypertension [7], top five therapeutic medications [51, 52], costly medical conditions [6], dermatological agents [53], and trauma-related disorders [62]. For MEPS, sampling weights are available on both household and individual levels. MEPS base weights are initially obtained from the National Health Interview Survey (NHIS, [37]), as the households participating in MEPS are a subsample of those in NHIS. The MEPS sample design includes a feature of oversampling Hispanics, Blacks and Asians, in order to increase the precision of estimates for these sub-populations. The oversampling feature of the survey and the existence of non-responses may cause a bias in the estimates of population expenditure and frequency of health care visits. In MEPS, the selection bias caused by oversampling and usual nonresponse (e.g., due to time inconvenience) can be adjusted by using weighted estimates instead of the naive mean and variance for population quantities of interest (e.g., expenditures and utilizations in MEPS). However, a proportion of the nonrespondents may choose not to respond due to the occurrence of medical expenditure or health care visits. In the thesis, the term never-respondents will be used for the group of people who choose not to respond to the survey due to reasons related to their medical expenditure or utilization. This is another example of a hidden sub-population in weighted sampling that raises a concern of model identification. 3 1.1.3 Misrepresentation in insurance underwriting In insurance underwriting, misrepresentation occurs when the applicant purposely provides a false statement on factors that may affect the insurability or the insurance premium. As the act is in the benefit of the applicant, it frequently happens across insurance product lines, such as life insurance, auto insurance and home insurance. Due to the expensive costs that are required for verifying (e.g., by medical exams in life and health insurance) various information provided by the applicant, insurance companies have to rely on the applicant in providing accurate information on factors that are used for insurance pricing. Strategies aimed at reducing the impact of misrepresentation (e.g., through revising policy clauses) have been studied in the literature [58]. For example, insurance companies usually control the risk of misrepresentation by rejecting a claim made by the insured, once an applicant is found to have provided a false statement in the application. A lot of information (e.g., for pre-existing conditions, when the health condition first started), however, is hard to verify afterward. Even if the misrepresentation may be found afterward (e.g., during a claim investigation), insurance companies still need information on how serious the problem of misrepresentation can be, and how it may impact the calculation of premium. Using existing data on the claims occurrence of all the policies, insurance rate-making models accounting for the misrepresentation can be helpful under this situation. If insurance companies could make an inference on the proportion of misrepresentation on existing policies, then they would be able to estimate the cost of misrepresentation, and conduct cost and benefit analyses for possible investigations of applications. Comparing to investigations and medical exams, overall statistical modeling on the existing policies is much less expensive. Despite the advantages, to our best knowledge, there is no existing work on statistical modeling of insurance misrepresentation. In this thesis, we will study the simple case of misrepresentation when the question of concern is binary. The general problem of misclassification in binary variables has been studied extensively in the literature. For example, Gustafson 4 et al. [23] have showed that incorporating the uncertainty on the misclassification probabilities through priors will result in estimates of odds ratios that are less affected by the bias in the guess of the probabilities. Gustafson and Greenland [21] raised several issues in modeling the misclassification using Bayesian inference in unmatched case-control analysis with binary exposure, when confounders are not adjusted. Liu et al. [31] studied Bayesian adjustment when there are information available on both the misclassification of exposure and the exposure-disease association. Chu et al. [3] demonstrated that when there is validation data on the misclassification, Bayesian methods have more advantages over non-Bayesian counterparts. Wang et al. [57] investigated the partial identification obtained for case control studies when the binary exposure is misclassified. Since misrepresentation causes misclassification in risk factors, there may be model identification issues that require further investigation. For the usual random measurement error modeling, we assume that the measurement will be centered around the true value. For misrepresentation, on the other hand, we know the direction of the measurement error. That is, the applicant only has a motivation to make a false statement when it is to his or her benefit. Hence, it would be interesting to study how this information will be helpful in model inference. 1.1.4 Measurement errors in traffic volumes In certain jurisdictions, such as British Columbia in Canada, the auto insurance in the region is mainly carried by one insurance company. Under such a situation, the insurance company will spend a part of their revenue in improving road conditions, in order to reduce accident counts for saving insurance claim costs. The evaluation of road safety measures (e.g., installing new cameras) depends on whether the measure reduces collisions, after adjusting for the impact from other factors such as the traffic volumes during the period. In practice, the average annual daily traffic (AADT) at a given road segment is a major factor for the accident prediction models based on Poisson or negative binomial distributions (see e.g., [8, 10, 26]). 5 In road safety modeling, annual data are usually used for accident counts, which requires covariate variables such as AADT to be measured for the whole year. In reality, due to the expensive cost, the engineering department of a city obtains AADT from snap shots done in a selected period of time such as 24 hours. In some extreme cases, the observation period can be as short as several hours, depending on the available budget. Hence, the AADT measurements used in accident prediction models can be subject to large measurement errors. The impact from estimating AADT with short periods of observation time is studied in earlier work such as [13, 49]. In EI-Basyoungy and Sayed [9], the authors proposed a lognormal measurement error model component in accident prediction models, and compared the estimated regression coefficients with those from the models ignoring the measurement error. However, the identification of measurement error models in road safety areas has caught little attention, despite existing findings on model identification when there are measurement errors in the covariates (see, e.g., [16]). It has been well accepted that measurement errors in covariate variables will result in an attenuation effect that moves the estimated effect toward zero. In normal measurement error models, there will be identification issues when there is only one observation available for each subject. For the Poisson and negative binomial accident prediction models, it would be interesting to study the model identification, as well as the potential impact on the regression coefficients and predicted collision counts. 1.2 Bayesian inference for nonidentified models In all the motivating examples given in the previous section, there are likely to be identification issues in statistical modeling. In particular, there may be nonidentifiability in which case multiple sets of parameter values correspond to the same distributions of observables. For certain situations involving data issues, the statistical models may be weakly identified. That is, although the model is theoretically identified, learning the values of parameters with reasonable precision requires an unreasonably high sample size. 6 For nonidentified models, one may be able to obtain partial identification of a distribution, in which case an identification region may be obtained for the parameter of interest. An identification region is a set of values, containing the true value of the parameter, that is consistent with the available data as well as other assumptions concerning the missing information. Under partial identification, although the posterior distribution will not converge to a point-mass distribution at the true value asymptotically, its convergence to a distribution over the identification region may provide us with helpful information on the true value of the variable of interest. Inference on partially identified models has been well studied in the literature, using both Bayesian and non-Bayesian approaches. Earlier works such as [34] propose non-Bayesian solutions to some nonidentified model situations. Bayesian adjustment for nonidentified models can be found in papers such as [14, 15, 22, 23, 47]. In particular, [23, 38, 61] focus on the amount of Bayesian learning arising from nonidentified models. Gustafson [17] discusses the analysis of nonidentified models and proposes a method for investigating the asymptotic behavior of the posterior mean. The authors of [18] put forward a sample size criterion based on the extent to which Bayesian learning can be obtained from a nonidentified model. Bayesian inference has the flexibility of incorporating expert knowledge via priors. Gustafson [18] shows that, despite the nonidentifiability, posterior learning can be quantified by obtaining the posterior distribution with a transparent parameterization, the posterior distribution of the nonidentified part of parameters conditioning on the identified ones. The term ‘transparent’ is used as we can learn some information of the nonidentified part of the parameters through the identified parameters, when there is dependence between them. Particularly for partially identified models, in addition to the identification region that is obtained, one may be able to study the shape of the limiting posterior distribution over the identification region. Gustafson [20] demonstrates that in some problems the limiting posterior distribution will be more concentrated around the true value than the prior, which may provide us with useful information. 7 As pointed out by [20], the development of Bayesian inference has lagged behind that of non-Bayesian methods for partially identified models. Much remains to be done on Bayesian analysis for situations in real practice where data limitations lead to model nonidentifiability. For the situations discussed in the previous section, Bayesian methods will be helpful for alleviating the potential identification problems in statistical modeling, particularly when there is some information available on the unobserved variables and measurement errors (e.g., the direction of misclassification in misrepresentation). 1.3 Summary of proposed research In this thesis, we will propose Bayesian models to account for the data issues in the four motivating real-life problems presented in Section 1.1.1. The models will be evaluated by theoretical studies on the posterior distributions, simulation studies with finite samples, and case studies using real data. 1.3.1 Prevalence estimation with a hidden sub-population In Chapter 2, we will propose a Bayesian model for modeling the population disease prevalence using a weighted sample from the non-hidden portion of the population (e.g., the situation faced in the M-Track Survey). The prevalence in the hidden sub-population will be extrapolated by modeling the relationship between the disease status and sampling probability. In particular, we will try to accomplish the following. • In order to study the identification of our proposed Bayesian model, we will derive the analytical forms of the large-sample limits of the posterior mean and standard deviation of the prevalence estimator. • For assessing the statistical learning of the model, the convergence of the posterior mean and standard deviation will be studied for various situations, using Markov chain Monte Carlo (MCMC) simulations based on self-coded Metropolis-Hastings algorithms. 8 • To illustrate how our model will perform under realistic situations, we will conduct case studies on two real datasets from the National Health and Nutrition Examination Survey (NHANES) [36] and the National Longitudinal Study of Adolescent Health [25]. In order to confirm the feasibility of statistical learning and the performance of the proposed model, we will perform theoretical, simulation and case studies. 1.3.2 Bias assessment in MEPS Bias modeling is a topic that has been well studied in the epidemiology literature [12, 15, 30, 44, 55]. Due to lack of any data for never-respondents, it is generally hard to conduct a traditional sensitivity analysis on the MEPS data. One possible method is to perform a Bayesian sensitivity analysis by incorporating different prior assumptions on the proportion of never-respondents. In Chapter 3, our specific scientific targets are as follows. • For estimating population expenditure, we will propose a two-part model structure with a binary model for the existence of expenditure and a gamma model for the magnitude of expenditure. Analytical forms will be obtained for the limiting posterior mean and standard deviation of the estimators, in order to study the identification of our Bayesian model accounting for never-respondents. • For estimating population health care utilization (i.e., visit counts), we will propose a zero-inflated Poisson (ZIP) model and a zero-inflated negative binomial (ZINB) model. The limiting posterior mean and standard deviation will be obtained explicitly. • Models based on binary, gamma, ZIP and ZINB will be proposed for assessing the impact of never-respondents on the regression analysis of a binary factor. Theoretical results will be obtained for the limiting posterior distribution of the estimators for the relative risk (and relative magnitude in the case of expenditure and utilization). 9 • Sensitivity analyses will be undertaken for the above models on MEPS (2010) data, using self-coded Metropolis-Hastings algorithm. The research will provide theoretical insights on population inference of various types of variables, when there exist never-respondents in weighted sampling. 1.3.3 Misrepresentation modeling In Chapter 4 of this thesis, we will propose models accounting for misrepresentation in binary risk factors. The type of misclassification caused by misrepresentation is a special case for which we know the direction of false statements. In particular, the following models and results will be presented. • For predicting the number of insurance claims for a certain insured, we will propose a Poisson response model for rate-making, using the misclassified variable as a predictor. The model identifiability will be studied through the method of moment estimators. Using both the Metropolis-Hastings algorithm and WinBUGS, computational studies will be conducted to evaluate the performance of Bayesian modeling, as well confirming the theoretical findings. • For the claim severity (amount), a gamma response model will be proposed. The model identification and the feasibility of statistical learning using Bayesian analysis will be studied through both theoretical studies including the method of moment estimators, and MCMC computational studies. • In order to predict whether there is any claim, we will put forward a Bayesian binary response model. Similarly, the model identification and feasibility of statistical learning will be studied through theoretical studies and MCMC simulations. • Another situation we will study is when we use the misclassified variable as a response and we have correctly measured covariate variables. Both 10 theoretical and simulation studies will be undertaken in order to understand the model identification, and assess the impact from the priors. Through the intensive studies, we will be able to confirm whether the distribution of the response will influence the identifiability of the model. If there is an identification issue, then we would be able to confirm whether priors on the misclassified proportion will help alleviate the problem. 1.3.4 Mismeasured AADT in collision prediction In Chapter 5, we will investigate three Bayesian measurement error models based on lognormal and gamma distributions, when predicting collision counts using AADT. The proposed study consists of the following. • We will investigate the identification of the lognormal model, and the possible impact on the estimated regression coefficient as well as accident prediction. Theoretical results will be presented. • Two gamma measurement error models will be proposed, when predicting accident counts using Poisson and negative binomial regression. Theoretical studies based on method of moment estimators will be undertaken, in order to understand the model identification. • For the complex Poisson gamma model, we will derive the MetropolisHastings algorithm for MCMC simulations. Simulation studies are performed in order to evaluate the model performance. With both the theoretical and simulation results, we would be able to confirm the identification of the model, assess the impact from the mismeasurements in AADT, and confirm the feasibility of statistical learning using Bayesian inference. 11 Chapter 2 Prevalence Estimation with a Hidden Sub-population 2.1 Introduction In clinical studies, one may be interested in population prevalence of a binary trait in order to better understand the health status of the population and plan for risk control measures. For hard-to-reach populations, there may exist a hidden sub-population that cannot be reached under the available sampling schemes. The presence of a hidden sub-population in weighted sampling may cause a distortion in the final estimate of the population prevalence, particularly if harder-to-reach subjects are thought to exhibit different behaviors. In this chapter, we are interested in estimating population prevalence using a weighted sample from the non-hidden portion of the population. The primary assumption behind our method is that conditional trait prevalence (e.g., on risk behaviors or disease status) varies smoothly with sampling probability, as the sampling probability goes to zero. Thus we are conceptualizing the hidden sub-population via a smooth limit of the hardest-to-reach sub-populations. This would seem particularly appropriate for some study designs. One example would be designs based on venue sampling [27, 56]. For instance, such designs 12 have been used to estimate various trait prevalences in men-who-have-sex-withmen (MSM) populations [32, 33, 35, 60], due to the fact that MSM populations are otherwise hard to reach. Interviewers visit participating venues at randomly selected times and enroll eligible population members. Two challenges arise, as (i) population members who attend venues more frequently have a greater chance of being sampled, and (ii) typically it is not possible to enlist a sufficiently diverse array of venues to ensure that all population members have positive probability of selection. In the venue-sampling context, sampling weights for study participants can be constructed from questionnaire items concerning the frequency of venue attendance [27]. Ordinary use of these sampling weights addresses selection bias amongst ever-attenders for the venues, but does not address the bias involved in ignoring the hidden sub-population of never-attenders. It seems plausible under certain context that never attenders are more like very seldom attenders than they are like frequent attenders. Thus it makes sense to model a proportion of the population as having a zero sampling probability, with conditional trait prevalence obtained as the limit as the sampling probability goes to zero. More generally, comprehensive sampling frames are hard to achieve in very many survey contexts, so the proposed method can be used to assess sensitivity to the assumption that the frame is indeed comprehensive. The unobserved information on the hidden sub-population will cause model identification issues that have been well studied in the literature. Under the Bayesian framework, earlier works such as [38], [23] and [61] have studied the amount of Bayesian learning arising from nonidentified models. Gustafson [17] discussed the analysis of nonidentified models and proposed a method for investigating the asymptotic behavior of the posterior mean and variance. Bayesian inference has the flexibility of incorporating expert knowledge through priors. This author [17] has also shown that, despite the nonidentifiability, posterior learning can be quantified by obtaining the posterior distribution using a so-called transparent parameterization, under which the posterior mean and variance of a target parameter can be written in terms of the identified part of the parameters. Particularly for 13 partially identified models, in addition to obtaining an identification region that strictly contains the true value of the target as the support of the limiting posterior, one can study the shape of the limiting posterior distribution over this region. In [20] the author showed that in some problems the limiting posterior distribution will be more concentrated around the true value than the prior truncated to the identification region, which may provide us with useful information. In this chapter, we propose a Bayesian model for modeling the population trait prevalence using a weighted sample from the non-hidden portion of the population. The prevalence in the hidden sub-population will be extrapolated by modeling the relationship between the trait and the sampling probability, under the assumption that the conditional prevalence varies smoothly from low to zero sampling probability. Using a method similar to [17, 18], we will derive the analytical forms of the large-sample limiting posterior mean and variance for our parameter of interest. The limiting posteriors we obtain have several nice properties that match our intuition. Through MCMC simulations, we will show that the posterior mean and standard deviation converge to their limits in a fashion that is comparable to identified models, with the proviso that the limiting posterior standard deviation is positive due to the lack of identification. Despite the nonidentifiability caused by the presence of the hidden sub-population, our work confirms that sensible results can be obtained using our Bayesian model. The chapter is organized as follows. In Section 2.2, a Bayesian model will be proposed with a flexible form using three sets of assumptions. In Section 2.3, general expressions for the posterior mean and variance as well as their large sample limits are obtained. MCMC simulations are performed in Section 2.4 using a nonparametric model based on Bayesian bootstrap. The prevalence in the hidden sub-population is estimated from logistic regression based on natural cubic splines. Two case studies will be undertaken in Section 2.5 using weighted samples from the National Health and Nutrition Examination Survey (NHANES) [36], and the National Longitudinal Study of Adolescent Health (Adolescent Health) [25]. Section 2.6 concludes the chapter. 14 2.2 A Bayesian model 2.2.1 The observed variables Under the weighted sampling method, the trait is observed directly from survey responses, while typically the sampling probability can be obtained from information on the sampling scheme and certain characteristics of the participant. We let Y be the binary trait that we are interested in, and Q be proportional to the sampling probability. The binary trait and unnormalized sampling probability (Y, Q) are the variables that we assume to be observable from the weighted sample. Note that we do not address the situation where Q is subject to measurement error, though in some situations it might be that only a surrogate Q∗ can be observed (for instance, if sampling probabilities are determined somewhat ‘roughly’ from questionnaire responses, in the venue sampling context). While it is beyond the scope of the present article, it should be possible to extend our approach via model and prior assumptions concerning the relationship between the actual and recorded sampling probabilities Q and Q∗. This would be in line with general strategies for Bayesian measurement error correction (see, for instance [16]). 2.2.2 Prior on hidden proportion For weighted sampling where there is a hidden sub-population inaccessible through the sampling scheme, we would typically not know the exact proportion of the subjects that are hidden. In Bayesian analysis, investigator-supplied knowledge on the proportion of the population that is hidden can be incorporated into the model through priors. Despite the nonidentifiability caused by this, we would be able to specify a prior on the hidden proportion with its mean representing a guess based on expert knowledge, and a standard deviation reflecting the certainty of the guess. For example, we can specify the prior for the hidden proportion p as f (p | ω ), a specific parametric distribution with its mean and standard deviation depending on the hyperparameters ω . If there is little information available on the 15 hidden proportion, we may want to use sufficiently vague priors. 2.2.3 Distribution of sampling probabilities Let Q1 , Q2 , . . . , Qn denote the sampling probabilities that we obtain from weighted sampling for the n participants in the study. We model the observed sampling probabilities with a distribution f ∗ (q | λ ), where the unknown vector of parameters λ can be assigned vague or non-informative priors to reflect our lack of information on the parameter values. Note that we will use f ∗ (·) to denote the distribution of the observed sampling probability, which is different from the distribution of the sampling probability across the population. Accounting for the variation in sampling probabilities [40], the distribution of the sampling probability in the accessible population is f (q | λ ) = cλ f ∗ (q | λ ) , q (2.1) where cλ = 1/ 1/q f ∗ (q | λ ) dq is a normalizing factor. Therefore the population distribution of the sampling probability can be written as p δ0 (q) + (1 − p) f (q | λ ), (2.2) where the distribution of p is specified earlier in Subsection 2.2.2, and δ0 (·) is the Dirac delta function, the ‘density’ of a point-mass at zero. 2.2.4 Variable of interest For demonstration purposes, we will present our model in the specific situation where our parameter of interest θ is population prevalence of a binary trait Y as the expectation E[Y ], though our model can be extended to a non-binary variable of interest. We assume that the binary trait Y may depend on the sampling probability Q, and we can model the relationship with a parameter vector β that characterizes 16 the dependence between Y and Q when β = 0. The model can be parameterized in such a way that when β = 0 the trait Y does not depend on Q. We may use a logistic regression model with the coefficients of regressors depending on Q denoted by β , in which the parameter vector β satisfies the above model assumptions. In the design matrix, examples of items related to Q could include Q and Q2 for quadratic regression, or basis functions in the case of natural cubic splines. Assuming that the distribution of (Y |Q) also depends on another parameter vector η such as an intercept, we may denote the probability mass function of (Y |Q) as f (y | q, β , η ). Using the notations and the assumptions given earlier in this section, we can obtain the prevalence in the accessible sub-population as θacc (β , η , λ ) = E[Y | Q > 0] = ∑ y=0,1 y f (y | q, β , η ) f (q | λ ) dq, where f (q | λ ) is defined in Equation (2.1), and E[·] denotes the trait prevalence as an expectation with respect to the population distribution, which is different from the weighted sample. Here we use summations for the binary trait, although our model can be used for variables with other types of distributions. Similarly, the prevalence in the hidden sub-population can be denoted as θhid (β , η ) = E[Y | Q = 0] = ∑ y f (y | 0, β , η ). y=0,1 The overall prevalence in the population, θ , is the weighted average of prevalence in the hidden and accessible sub-populations. That is, θ (β , η , λ , p) = E[Y ] = p θhid (β , η ) + (1 − p) θacc (β , η , λ ). 17 (2.3) We now have a simple structure for the three components of our Bayesian model. Equation (2.3) gives the expression of the trait prevalence as a function of the three sets of parameters we specify in these subsections. From Equation (2.3), we also observe that the limiting posterior distribution will be a linear transform of the prior distribution on the hidden proportion p. By assuming independent priors for the unknown parameters of p, λ and (β , η ), we can show that the three sets of parameters from the three components will be a posteriori independent. That is, the joint posterior density of the three sets of parameters can be written as f (β , η , λ , p | y, q) ∝ f (y, q | β , η , λ , p) f (β , η , λ , p) = f (y | q, β , η ) f (q | λ ) f (β ) f (η ) f (λ ) f (p | ω ) = [ f (y | q, β , η ) f (β ) f (η )][ f (q | λ ) f (λ )] f (p | ω ) ∝ f (β , η | y, q) f (λ | q) f (p | ω ), where the prior and posterior for p are identical as we would never be able to learn any information on the true proportion of the hidden sub-population. 2.3 Large-sample limit of posterior distributions In the section we will study the feasibility of statistical inferences by looking at the large-sample limit of the posterior mean and standard deviation of prevalence θ . We will use the method and notations from [17, 18] for deriving the posterior mean and variance under our partially identified model, and obtaining their largesample limits. 2.3.1 Posterior mean For the Bayesian model we proposed in Section 2.2, the identified part of the parameters is φI = (λ , β , η ) and the nonidentified part of the parameters is φN = p. 18 Hence, prevalence θ can be denoted as θ = θ (φI , φN ) = θ (λ , β , η , p) . Denote Dn = (Y1 , Q1 ), (Y2 , Q2 ), . . . , (Yn , Qn ) as a sample of data with n observations. According to the transparent parameterization from [17, 18], we have E[θ | Dn ] = E[θ˜ (φI )| Dn ], where for our model θ˜ (φI ) = E[θ (φI , φN )| φI ] = p¯ θhid (β , η ) + (1 − p) ¯ θacc (β , η , λ ) = θ (λ , β , η , p) ¯ , (2.4) where p¯ is the prior mean of p. Here we use the prior mean p¯ (a function of ω indexing the prior distribution assumed), as it has a practical meaning for better interpretation. Under the regularity conditions, as the sample size gets larger, the model will gradually learn the true values of the parameters except for the hidden proportion. Hence the large-sample limit of the posterior mean of θ can be obtained as lim E[θ | Dn ] = θ (λ 0 , β 0 , η 0 , p) ¯ n→∞ = p¯ θhid (β 0 , η 0 ) + (1 − p) ¯ θacc (β 0 , η 0 , λ 0 ), (2.5) where λ 0 , β 0 and η 0 denote the true values of λ , β and η . For our Bayesian model, the asymptotic posterior mean is a linear function of the prior mean p. ¯ It is straightforward to conclude that the accuracy of the posterior estimate will depend on the true value of β and the compatibility of the prior and true value for p. When β 0 = 0, the posterior mean will not be affected 19 by the prior on p since the hidden sub-population has the same prevalence. This is what we would expect when we can infer from the data that there is no relationship between the binary trait and sampling probability. 2.3.2 Posterior standard deviation Similarly, we can obtain the asymptotic posterior variance using the method from [17, 18]. The posterior variance can be written as Var[θ | Dn ] =E[var{θ | φI , Dn }| Dn ] + var[E{θ | φI , Dn }| Dn ] =E[(θ (λ , β , η , p) − θ (λ , β , η , p)) ¯ 2 | Dn ] + Var[θ (λ , β , η , p)| ¯ Dn ]. Under the regularity conditions, we then have the limiting posterior variance as 2 lim Var[θ | Dn ] = Var(p) θhid (β 0 , η 0 ) − θacc (β 0 , η 0 , λ 0 ) , n→∞ (2.6) where Var(p) (a function of ω ) is the prior variance of p. Hence the limiting posterior standard deviation for θ is simply the product of the prior standard deviation for p and the absolute difference of prevalence in the hidden and accessible sub-populations. That is, lim SD[θ | Dn ] = SD(p) | θhid (β 0 , η 0 ) − θacc (β 0 , η 0 , λ 0 )| , n→∞ (2.7) where SD(p) is the prior standard deviation of p. From Equation (2.7), we may conclude that the limiting posterior standard deviation is likely to be very small for most realistic settings of the priors on p and true values of prevalence in the accessible and hidden sub-populations. This indicates that with a large sample size, we may be able to obtain sensible inference on the trait prevalence using our proposed model, when a reasonable amount of expert knowledge is available on the proportion of the hidden sub-population. 20 2.3.3 The limiting posterior distribution From Equation (2.3) in the previous section, we may conclude that the limiting posterior of θ is a linear transform of the prior on the hidden proportion p. This means that the limiting posterior distribution will have the same shape as the prior on p. Therefore the limiting posterior will have a calibration property. That is, if the true value of the hidden proportion p falls at the α quantile of the prior, then the true value of θ falls at either the α or 1 − α quantile of the limiting posterior distribution. This shows the importance of specifying a reasonable prior for the hidden proportion. When β 0 = 0, the posterior standard deviation converges to 0, in which case we will get the limiting posterior as a point mass. This makes sense, as when there is no association between the trait and sampling probability, there would be no difference in prevalence in the accessible and hidden sub-populations. 2.4 Simulation study In this section, we will evaluate the model that we proposed in Section 2.3 through a simulation study. For each simulated dataset, the posterior distribution of the trait prevalence will be evaluated through MCMC. A logistic regression model based on natural cubic splines is used to model the relationship between the trait and sampling probability. For modeling the distribution of the sampling probabilities, we use the nonparametric method of Bayesian bootstrap. Although we choose Bayesian bootstrap for its simplicity as well as generality, we may also use a very closely related nonparametric model based on Dirichlet process priors, or a parametric model based on an appropriate distribution. All the three models can be used for the situations when we may only be able to obtain the sampling probabilities that are proportional to the actual sampling probabilities. In our experience, the three models essentially give the same results. 21 2.4.1 Model specification We propose a logistic model with natural cubic splines for modeling the relationship between the binary trait and sampling probability. The logistic model is specified as fY (y | Q = q) = pq y (1 − pq )1−y (2.8) logit{pq } = S(q), (2.9) where S(q) is a natural cubic spline with k uniformly spaced knots (x1 , x2 , . . . , xk ). In our model, we take x1 = 0 in order to extrapolate the prevalence to the Q = 0 case for the hidden sub-population. The idea is similar to low dose extrapolation in pharmaceutical investigations, in which the risks at exposures below the range of data may be estimated using statistical models. Similarly, the estimates from our model may be affected by the smooth relationship we assume for the prevalence and sampling probabilities. The impact of a non-smooth relationship will be evaluated in Subsection 2.4.4. The population prevalence can be obtained as c eS(Q) p eS(0) ∗ + (1 − p) EF θ= 1 + eS(0) Q 1 + eS(Q) , (2.10) where c = 1/EF ∗ Q−1 is a normalizing factor calculated over F ∗ , the distribution of the observed sampling probability. Since for our model in Section 2.2, the posteriors of p and (β , η ) are independent, we can draw MCMC samples of p from the prior distribution and those of the regression coefficients and intercept (β , η ) directly from the posterior arising from the logistic regression model. Denote q1 , q2 , . . . , qn as n observed values of sampling probability Q. The Bayesian bootstrap (BB) [45] can be regarded as modeling the distribution of Q as a discrete distribution with support points taken to be the observed sampling probabilities. The family of distributions can be parameterized via λ = 22 (λ1 , λ2 , . . . , λn ), with λi being the probability mass associated with qi . The improper distribution Dirichlet(0, 0, . . . , 0) is regarded as the prior for λ , so that observing one observation at each support point yields the distribution Dirichlet(1, 1, . . ., 1) as the posterior. Thus uncertainty about the distribution of Q is then represented via uncertainty about how much mass is attached to each of the n observed values. Denote p(1) , p(2) , . . . , p(K) as a posterior sample of p of size K, and S(1) (·), S(2) (·), . . ., S(K) (·) as natural cubic splines based on a posterior sample β (1) , η (1) , β (2) , η (2) , . . . , β (K) , η (K) of (β , η ). For posterior sample of natural cu(k) (k) (k) bic spline S(k) (·) and a posterior sample of λ denoted as λ1 , λ2 , . . . , λn the posterior sample of the integral EF ∗ c eS(Q) /Q 1 + eS(Q) (k) c(k) ∑ni=1 λi e S(k) (qi ) /qi 1 + e S(k) (qi ) , can be written as (k) under BB, where c(k) = 1/∑ni=1 λi /qi . It is then straightforward to obtain the posterior sample of prevalence θ (k) using all the variables. 2.4.2 Posterior behavior on finite samples In this subsection, we will study the convergence of the posterior mean and variance of the trait prevalence based on a simulation study with finite samples. We will consider the situation where the true value of the hidden proportion p is 10% as an illustration. The performance of the model will be impacted by quality of the prior information comparing to the true hidden proportion. For the sampling probability, we assume that we are only able to obtain unnormalized sampling probabilities that are proportional to the true values. Hence for generating the observed sampling probabilities, we will use two Gamma distributions to account for two situations when the distribution is relatively concentrated and spread out. The parameters of the two Gamma distributions (a, b) take values of (2, 0.5) and (10, 0.1) respectively. We use the R function ns() in the splines package to obtain the basis functions of a natural cubic spline. For data generation, three different natural cubic splines 23 0.30 −1.0 0.25 0.20 p logit(p) −1.5 −2.0 0.15 −2.5 −3.0 0.10 low slope non−monotone large slope 0.0 0.5 1.0 0.05 1.5 2.0 2.5 3.0 0.0 q 0.5 1.0 1.5 2.0 2.5 3.0 q (a) Logistic scale (b) Probability scale Figure 2.1: Relationship of binary trait and unnormalized sampling probability are used in our logistic model for the relationship between prevalence and sampling probability. Two natural cubic splines give a monotone relationship with a slope that is small and large, while the third spline gives a non-monotone relationship. The coefficients for the three natural cubic splines with 4 uniformly spaced knots (0, 1, 2, 3) are (−1.37, −1.14, −3.87, −0.64), (−1.82, −2.12, −4.25, −2.08) and (−1.17, −1.10, −3.72, −0.80) respectively. Figure 2.1 gives the plot of the spline relationships under both the logistic and probability scales. Five different prior distributions will be used to reflect the diversity of information we may have on the hidden proportion. The prior distributions are specified as Beta distributions with their mean and standard deviation ( p, ¯ SD(p)) given as (7.5%, 2.5%), (15%, 2.5%), (7.5%, 15%), (20%, 2.5%) and (30%, 7.5%). The third case represents a situation when we have in practical terms very little information on the hidden proportion. The last two are for extreme cases when the prior contains incorrect information on the true proportion. Although evaluations of the asymptotic posterior mean and standard deviation under various scenarios have indicated that sensible inferences can be obtained when we have some information on the hidden proportion, we will study the con24 vergence of the posteriors through simulations with finite samples, in order to get a sense of how our Bayesian model will work under realistic situations. The model will be fitted with datasets of different sizes from the data generating mechanisms, using priors for p given earlier in this Section. For model fitting, the knots (x1 , x4 ) for the natural cubic splines are 0 and the 95th percentile of the observed sampling probabilities. Hence there is a mild discrepancy in that the knot locations assumed in analysis differ from those used in data generation. For MCMC sampling on the coefficients of natural cubic splines, the normal distribution matching mode and curvature at the mode with logistic likelihood is used as our proposal distribution for the Metropolis-Hastings algorithm. For several examples we looked at, the acceptance rate from the Metropolis-Hastings algorithm is near one, and the posterior samples are mixing very well. The convergence of the posterior mean and standard deviation for six of the possible 2×3×5 = 30 scenarios described earlier is shown in Figures 2.2 and 2.3. The boxplots are based on 100 simulated datasets for the specific sample sizes demonstrated for each of the six cases. The posterior mean and standard deviation are calculated based on 5000 posterior samples from MCMC simulations. The base case refers to the situation when the distribution of the observed sampling probability, the natural cubic spline and the prior on the hidden proportion take the first set of values in the data generation and prior assumptions. For the last panel, both the prior standard deviation is set to 0.15 and a natural cubic spline with the large slope is chosen, so that the limiting posterior standard deviation is around 0.015. For the other four panels except for the last one, we only change one of the distribution of the observed sampling probability, the natural cubic spline, or the prior on the hidden proportion, at a time. For all the cases, we observe that both the posterior mean and standard deviation of population prevalence are converging to their asymptotic values. Despite the nonidentifiability under the situation, the posterior mean is converging to its √ large-sample limit with a speed of n, which is the same as the identified case. √ The posterior standard deviation seems to have a ( n)−1 behavior when the limit- 25 25600 102400 Inf 0.35 1600 25600 102400 Inf 400 25600 102400 Sample size (d) concentrated q Inf 25600 102400 Inf 0.35 Posterior Mean 0.30 0.35 0.25 0.10 0.15 0.20 Posterior Mean 0.15 0.10 6400 6400 (c) non-monotone 0.30 0.35 0.30 0.25 0.20 0.10 1600 1600 Sample size (b) large slope 0.15 Posterior Mean 6400 Sample size (a) base 400 0.25 0.15 0.10 400 Sample size 0.25 6400 0.20 1600 0.20 Posterior Mean 0.30 0.35 0.25 0.10 0.15 0.20 Posterior Mean 0.30 0.35 0.30 0.25 0.20 Posterior Mean 0.15 0.10 400 400 1600 6400 25600 102400 Inf Sample size (e) large prior mean 400 1600 6400 25600 102400 Inf Sample size (f) large posterior SD Figure 2.2: Sampling distribution of posterior mean of prevalence θ with different sample sizes ing posterior standard deviation is small, but radically different behavior when its limit is large. Particularly, the posterior standard deviation decreases with a much slower speed, with the convergence speed decreasing with the sample size as it is approaching its large sample limit. For cases with incorrectly specified priors (the last two prior distributions), the posterior mean and standard deviation are still converging to their large sample limits. However, the limiting posterior confidence intervals will not cover the true prevalence due to the calibration property we have shown in Section 2.3. Hence, special precautions need to be taken when specifying priors for the hidden 26 25600 102400 Inf 0.06 1600 25600 102400 Inf 400 25600 102400 Sample size (d) concentrated q Inf 25600 102400 Inf 0.06 Posterior SD 0.04 0.05 0.06 0.04 0.03 0.00 0.01 Posterior SD 0.02 0.01 0.00 6400 6400 (c) non-monotone 0.05 0.06 0.05 0.04 0.03 0.02 0.00 1600 1600 Sample size (b) large slope 0.01 Posterior SD 6400 Sample size (a) base 400 0.03 0.01 0.00 400 Sample size 0.03 6400 0.02 1600 0.02 Posterior SD 0.04 0.05 0.06 0.03 0.00 0.01 0.02 Posterior SD 0.04 0.05 0.06 0.05 0.04 0.03 Posterior SD 0.02 0.01 0.00 400 400 1600 6400 25600 102400 Inf Sample size (e) large prior mean 400 1600 6400 25600 102400 Inf Sample size (f) large posterior SD Figure 2.3: Sampling distribution of posterior standard deviation of prevalence θ with different sample sizes. In panel (d), the concentrated distribution of sampling probability results in difficulty of obtaining subjects with small sampling probability, which makes it hard for finite sample inferences but not the asymptotic case. proportion. For situations when there is little information available on the hidden proportion, we recommend using sufficiently vague priors. For studies where useful information is available on the hidden proportion, our Bayesian model may help improve the inference on the population prevalence. Additional results and figures are given in Appendix A. 27 200 30000 3000 2000 Density Density Density 150 100 20000 10000 1000 50 0 0 0.1632 0.1634 0.1636 0.1638 0 0.208 0.210 0.212 0.214 0.216 0.218 0.220 prevalence 0.18282 0.18284 prevalence (a) base 0.18286 0.18288 prevalence (b) large slope (c) non-monotone 2000 3000 100 1000 Density Density Density 1500 2000 50 1000 500 0 0 0.1598 0.1600 0.1602 0.1604 0.1606 0.1608 0 0.1635 prevalence (d) concentrated q 0.1640 0.1645 prevalence (e) large prior mean 0.210 0.215 0.220 0.225 0.230 prevalence (f) large posterior SD Figure 2.4: Limiting posterior distribution of prevalence θ 2.4.3 The limiting posterior In Figure 2.4, we plot the limiting posterior distributions for the six cases considered in Figures 2.2 and 2.3. As we have shown in the previous section, the shapes of the posterior distributions are inherited from those of the priors for the hidden proportion p. 2.4.4 Assumption of a smooth relationship Our proposed model assumes a smooth relationship between the binary trait Y and the sampling probability Q. Here we study what happens when the assumption 28 does not actually hold at Q = 0. Assume that at Q = 0, instead we have logit{p0 } = δ S(0) , where δ = 1. That is, the prevalence in the hidden sub-population is not the limit of prevalence in the near-hidden sub-population. Then the true population prevalence in equation (2.10) is replaced with c eS(Q) p eδ S(0) ∗ θ= + (1 − p) EF 1 + eδ S(0) Q 1 + eS(Q) c eS(Q) p eS(0) ∗ =r + (1 − p) EF 1 + eS(0) Q 1 + eS(Q) , (2.11) where r = eδ S(0) 1 + eS(0) / 1 + eδ S(0) eS(0) is the relative magnitude of the prevalence of the hidden sub-population. As δ is an unknown constant we are not able to learn from the weighted sample, the large-sample limit of the posterior mean from our model will be different to the one given in Equation (2.5), while that of the posterior standard deviation will not change. In Table 2.1, we evaluate the impact of a non-smooth relationship on population prevalence. We give the required values of r that will move the true value of prevalence outside of the 2.5% and 97.5% percentiles of the limiting posterior distribution, for each of the 12 out of 18 scenarios where the hidden proportion is inside the percentiles of the prior on p. We observe that only when the credible interval is very narrow will the result be sensitive to the assumption of a smooth relationship. 29 Table 2.1: Impact of a non-smooth relationship on population prevalence. The three digits in the case number correspond to the three assumptions of the distribution of the observed sampling probability, the natural cubic spline and the prior on the hidden proportion. For example, ‘111’ refers to the situation when we take the first set of values for the three assumptions given in the previous subsections. The notation θˆ denotes the limiting posterior mean. Case 111 113 121 123 131 133 211 213 221 223 231 233 2.5 θ 0.1636 0.1636 0.2156 0.2156 0.1828 0.1828 0.1605 0.1605 0.1597 0.1597 0.1835 0.1835 θˆ 0.1635 0.1635 0.2136 0.2136 0.1829 0.1829 0.1603 0.1603 0.1562 0.1562 0.1835 0.1835 Limiting Posterior 2.5% percentile 97.5% percentile 0.1633 0.1637 0.1632 0.1654 0.2102 0.2182 0.2075 0.2495 0.1828 0.1829 0.1826 0.1829 0.1600 0.1607 0.1597 0.1637 0.1503 0.1643 0.1454 0.2194 0.1834 0.1836 0.1830 0.1836 Required r < > 0.983 1.008 0.974 1.109 0.814 1.089 0.720 2.173 0.999 1.002 0.988 1.002 0.969 1.015 0.954 1.195 0.673 1.156 0.503 3.063 0.998 1.004 0.973 1.007 Case studies on real data 2.5.1 NHANES data For illustration purposes, we apply our model to a weighted sample from the National Health and Nutrition Examination Survey [36] by the U.S. Centers for Disease Control and Prevention. The sampling weights and disease status for asthma, arthritis and cancer are from the 2007 to 2008 surveys. We are not suggesting that there is a hidden sub-population in the NHANES sampling scheme. Rather, we merely provide a ‘what if’ illustration of how inferences would change if we 30 0.4 0.3 0.2 0.1 0.5 prevalence 0.5 prevalence prevalence 0.5 0.4 0.3 0.2 0.1 0.0 0.5 1.0 unnormalized sampling prob (d) Asthma 0.4 0.3 0.2 0.1 0.0 0.5 1.0 unnormalized sampling prob (e) Arthritis 0.0 0.5 1.0 unnormalized sampling prob (f) Cancer Figure 2.5: Posterior samples of natural cubic splines based on 50 samples. The histograms are based on the frequency of subjects with the unnormalized sampling probability within each range of width 0.1. The sample size of asthma data is 9659, which is about 2 times larger than the other two diseases. postulate there is a hidden population. This can be regarded as some form of sensitivity analysis. Particularly, we ascribe a Beta distribution with mean 0.1 and standard deviation 0.05 as a prior for p, the proportion of the population that is hidden. In real applications, of course, the prior specification would be based on subject-area knowledge. The weights from the 2007 to 2008 surveys are unnormalized weights, ranging from 0.07 to 2.89. In our analysis, we exclude all the observations with missing values on the disease status for each disease. The dataset for asthma contains 9659 observations, while for arthritis and cancer the sample sizes are 5924 and 5925 respectively. The acceptance rates of the Metropolis-Hastings algorithm are around 0.85, 0.99 and 0.94 for asthma, arthritis and cancer data respectively. The relationships of prevalence and unnormalized sampling probability are plotted in Figure 2.5 using 50 posterior samples from the MCMC simulation as an illustration. The bold line is the spline relationship based on the posterior mean of coefficients. 31 Table 2.2: Estimated prevalences in the NHANES example Disease Asthma Arthritis Cancer Raw Estimate θˆ 0.1433 ˆ 0.0036 SE(θ ) ˆ θ 0.2962 ˆ SE(θ ) 0.0059 ˆ θ 0.0972 SE(θˆ ) 0.0038 Weighted Est Posterior Est 0.1461 0.1471 0.0048 0.0053 0.2565 0.2663 0.0073 0.0080 0.0897 0.0952 0.0048 0.0052 Estimates are reported in Table 2.2. For each disease, the raw prevalence estimate is the unweighted sample proportion. The weighted estimate is the weighted proportion which can be viewed as a prevalence estimate for the accessible subpopulation. The posterior estimate uses the proposed methodology in tandem with the illustrative prior distribution for the hidden proportion. From our case study using the NHANES data, we observe that the inclusion of sampling weights can have a big impact on the estimate of prevalence, particularly in the case of arthritis. In real practice where there is an association between the sampling probability and disease status or risk behavior, the presence of a hidden sub-population that cannot be sampled will cause a biased estimate of prevalence depending on the shape of the relationship. In the case of arthritis, we observe a quite large slope with prevalence varying from about 0.25 to 0.40. Looking at the standard error estimates, there are very modest increases in the uncertainty due to acknowledging a hidden sub-population. 2.5.2 Adolescent Health Study In order to illustrate our method in a situation where there is an evidence of a hidden sub-population, we will apply our method to another dataset from the U.S. National Longitudinal Study of Adolescent Health [25]. In this case study, we are interested in whether the subject has been tested for the sexually transmitted 32 diseases of HIV, chlamydia and Gonorrhea in the past 12 months. The variables are available in Wave 3, when the subjects were aged between 18 to 28. Although in Wave 1 of the study, there seems to be no evidence of a substantial hidden sub-population, we would expect there to be a hidden sub-population in Wave 3, as the sample only consists of subjects who could be located six years after the Wave 1 interview. Conventionally, we may deal with the drop-out as missing data, based on certain assumptions. For example, we may assume that the drop-out does not relate to the quantities of our interest (e.g., the assumption of missing completely at random). Based on this assumption, we may simply ignore the drop-out (as done by the study) or use imputation methods to fill in the missing information. For the Adolescent Health Study, it is plausible to think that the drop-out may be associated with health behaviours. Hence, we may use our method to model the relationship of the prevalence and sampling probabilities, and extrapolate the information of the drop-out from those who are less likely to participate in the survey. By comparing the Wave 1 and Wave 3 samples, we observe that only 75% of the Wave 1 subjects were included in the Wave 3 study. Hence we would expect there to be about 25% of the population hidden according to the sampling scheme in Wave 3, due to reasons such as address change and voluntary dropout. Although we might expect the actual hidden proportion to be larger than 25% due to other factors, here we assume a conservative prior for the hidden proportion p, a Beta distribution with mean 0.25 and standard deviation 0.05. The weights from Wave 3 are unnormalized weights, ranging from 296 to 27327. They have been adjusted from Wave 1 to Wave 3 to reflect the change in sampling probabilities. In our analysis, we exclude all the observations with missing values on the test status for each disease. The sample sizes of the test status of the three diseases of HIV, chlamydia and Gonorrhea are 4102, 4091 and 4104 respectively. Similar to the NHANES study, the relationships of prevalence and unnormalized sampling probability are plotted in Figure 2.6, and estimates are reported in 33 0.4 0.3 0.2 0.1 0.5 prevalence 0.5 prevalence prevalence 0.5 0.4 0.3 0.2 0.1 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob 0.3 0.2 0.1 0.00 0.05 0.10 0.15 0.20 0.25 unnormalized sampling prob (d) HIV 0.4 (e) Chlamydia 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob (f) Gonorrhea Figure 2.6: Posterior samples of natural cubic splines based on 50 samples. The histograms are based on the frequency of subjects with the unnormalized sampling probability within each range of width 0.03. Table 2.3: Estimated prevalences in the Adolescent Health example Tested Disease HIV Chlamydia Gonorrhea θˆ SE(θˆ ) θˆ SE(θˆ ) θˆ SE(θˆ ) Raw Estimate 0.1848 0.0061 0.1723 0.0059 0.1562 0.0057 Weighted Est Posterior Est 0.1751 0.1688 0.0069 0.0094 0.1615 0.1567 0.0067 0.0094 0.1465 0.1440 0.0064 0.0092 Table 2.3. Note that the adjustments for HIV and chlamydia testing prevalences are substantial, in the sense of point estimates moving by one standard error and standard error increased by about a half. The conclusions from the case study are similar to those from the NHANES data. For cases where there is a hidden sub-population that cannot be sampled, Bayesian adjustments can be helpful when prior information is available on the hidden proportion. 34 2.6 Conclusions In the previous sections, we propose a Bayesian model for inferring trait prevalence using a weighted sample from the non-hidden portion of the population. The prevalence in the hidden sub-population is inferred by extrapolating the relationship between the trait and sampling probability, presuming that this relationship is smooth. The behavior of the posterior distribution on population prevalence is studied using methods similar to those of [17, 18] for nonidentified models. We obtain the limiting posteriors in simple analytical forms that give intuitively expected properties. If the true value of the hidden proportion falls at the α quantile of the prior, then the true value of prevalence will be at either the α or 1 − α quantile of the limiting posterior distribution. This shows us the efficacy of Bayesian inference with a reasonable prior for the hidden proportion. MCMC-based inference is performed on finite samples in order to evaluate the effectiveness of statistical learning. The model and results are applied to two real datasets from weighted sampling. Our work confirms that sensible results can be obtained with Bayesian analysis, despite the nonidentifiability caused by the presence of a hidden sub-population that cannot be sampled. It is an example of a partially identified model, in which the shape of the posterior distribution gives valuable information for statistical inferences on the variable of interest (see [20]). 35 Chapter 3 Never-respondents in Medical Expenditure Panel Survey 3.1 Background The Medical Expenditure Panel Survey (MEPS) [1] is a set of national surveys on the frequency, cost and source of payment for the health services that Americans use, as well as information on health insurance of U.S. workers. MEPS data have been frequently used in the literature for estimating the population expenditure and utilization for certain health services. For example, Krauss, Machlin and Kass [29] reported the population percentages with various types of health service visits, based on MEPS household component data. In research reports such as [4, 28, 39], a total population average expense, as well as those by service type and source of payment, were estimated using MEPS 1996 data. In Rohde and Machlin [43], costs were estimated for prenatal care of uncomplicated pregnancies by the insurance status and type of care. Bernard and Banthin [2] evaluated the expenditures on health care and insurance premiums by factors such as the insurance status and family size. For diabetes, Sarpong and Miller [46] assessed trends in pharmaceutical treatments by comparing the average expenditures on medications as well as proportions of individuals using different types of 36 medicines. The most recent works include expenditure and utilization estimation for subcategories such as dental [42], depression [50], anticonvulsants [54], hypertension [7], top five therapeutic medications [51, 52], costly medical conditions [6], dermatological agents [53] and trauma-related disorders [62]. For MEPS, sampling weights are available under both household and individual levels. Initial base weights are those from the National Health Interview Survey (NHIS, [37]), as the households participating in MEPS are a subsample of those included in NHIS. In order to increase the precision of estimates for certain sub-populations, MEPS sample design included a feature of oversampling Hispanics, Blacks and Asians. The base weights are adjusted to reflect the oversampling. MEPS weights are then adjusted for nonresponse and calibrated to the Current Population Survey (CPS) population estimates. The post-stratification and raking of sample weights is based on six variables including the census region, MSA status indicator, race/ethnicity, sex, age and poverty status. Details on MEPS sampling scheme and weights calculation can be obtained from earlier reports such as [5] and [11]. For MEPS data, the selection bias caused by oversampling and usual nonresponse (e.g., due to time inconvenience) can be adjusted by using weighted estimates instead of the naive mean and variance for population quantities of interest (e.g., health expenditure and utilization). However, some of the nonresponse may be related to the individuals’ medical expenditure or utilization. We use the term never-respondents for the group of people who would never choose to respond to the survey due to reasons related to their medical expenditure or utilization. For inference about the whole population including the never-respondents, the weight adjustment based on demographic and financial factors will not be sufficient. Selection bias and bias modeling is a topic that has been well studied in the epidemiology literature (see e.g., [12, 15, 30, 44, 55]). For MEPS study, it is generally hard to conduct a traditional sensitivity analysis due to lack of any data for neverrespondents. Although other covariate variables are available from the NHIS, the variables of our interest, such as those on the expenditures or utilizations, are not 37 collected. Under such a situation, the statistical models for estimating the population expenditure and utilization will be nonidentified. One possible method is to perform a Bayesian sensitivity analysis by assuming different priors on the proportion of never-respondents. Based on the sensitivity analysis, we will have more confidence on how the never-respondents may impact the variables of our interest (e.g., the population expenditure and utilization). In the previous chapter, a Bayesian model was proposed for estimating the population disease prevalence using a weighted sample from the non-hidden portion of the population. A model-based approach was employed assuming a smooth relationship between the variable of interest and the sampling probability. The fitted values from the regression are used for obtaining the population estimates. The key assumption of the method is that the hidden sub-population is treated as the limiting case of unlikely respondents when the sampling probability approaches zero. Using a similar method, one can perform a sensitivity analysis on the impact from a hidden sub-population of never-respondents for MEPS. In this chapter, we will extend the methodology to more complex situations where the variables of interest are non-binary, including continuous variables and discrete variables that are zero-inflated. Despite the complexity of the models, we are able to derive analytical forms for the limiting posterior mean and standard deviation of our variables of interest. We will show that the calibration property of the binary model in the previous chapter can be extended to the non-binary cases under consideration. A sensitivity analysis will be performed regarding population estimation of health expenditures and utilizations, using the 2010 MEPS individual data [1]. Based on the features of the expenditure data, we use a twopart approach, with a logistic regression model used for the probability of zero expenditure and a gamma regression model for the distribution of expenditures. Zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) regression models are used for the utilization variables, depending on the over-dispersion for the data. Our analysis indicates that individuals with higher health expenditures and utilizations are more likely to be included in MEPS. Using our model-based 38 approach, we are able to assess how the assumption on the proportion of neverrespondents will impact the population estimates of medical expenditures and utilizations. Our analysis confirms that the existence of a small never-respondent sub-population can lead to a mild over-estimation in most of the expenditure and utilization variables that we include in the analysis. In addition, we study the possible impact of the hidden never-respondents on regression analysis with a binary covariate variable. In particular, we propose three models based on logistic regression, gamma regression, zero-inflated ZIP and ZINB regression respectively for the binary, expenditure and utilization variables. Analytical forms are obtained for the large-sample limits of the posterior mean and standard deviation of the relative risk (for binary data) and the relative magnitude (for continuous and discrete data). In addition to the assumption of the smooth relationship between the expected expenditure (or expected utilization) and the sampling probability, we assume that the sampling probability has an impact on the binary covariate. Using the 2010 MEPS data [1], we use our models to study the impact of insurance coverage on the expenditure and utilization variables. From a sensitivity study assuming different priors on the proportion of never-respondents, we confirm that acknowledging the existence of neverrespondents results in a mild impact on the estimated relative risk and relative magnitude. The rest of the chapter is organized as follows. In Section 3.2, we will propose a general framework for our Bayesian models on the population expenditures and utilizations, using data from the accessible sub-population. Section 3.3 presents the analytical expressions of the posterior mean and variance, and their large-sample limits for our Bayesian models. In Section 3.4, we will apply our approach for estimating the population expenditures and utilizations, using the 2010 MEPS data [1]. Section 3.5 studies the impact of the hidden never-respondents on regression analysis with a binary covariate. Section 3.6 concludes the chapter. 39 3.2 A Bayesian approach 3.2.1 The observed variables For MEPS, the expenditure and utilization variables are observed directly from the survey responses, while the sampling weights (probabilities) are estimated from NHIS and adjusted for oversampling and nonresponse based on population characteristics. We denote M as the magnitude of the expenditure or frequency of health care visits, and Q as the sampling probability. (Mo , Qo ) are the variables that we are able to observe from the weighted sample for both the expenditure and utilization case. Note that in weighted sampling, (Mo , Qo ) have distributions that are different to those of (M, Q) in the accessible sub-population. As opposed to the binary case in the previous chapter, the expenditure or visit variable M is zero-inflated, relative to continuous distributions such as gamma and discrete distributions such as negative binomial that we commonly use for expenditure and utilization data. For continuous expenditure variables, there is an additional indicator variable Y that we can observe (i.e., we can observe whether the expenditure is zero or not), as continuous parametric distributions generally have a positive support. For health care visits, the zero indicator Y is assumed to be unobserved in popular discrete distributions, including zero-inflated Poisson and zero-inflated negative binomial distributions that have a non-negative support. 3.2.2 Prior on hidden proportion Similar to the previous chapter, we will specify the prior on the proportion of hidden never-respondents based on existing knowledge. Here the prior mean represents our best guess and the standard deviation reflects the certainty of the guess. For example, we can formulate the prior distribution as π (p) ∼ π (p | ω ), 40 (3.1) where π (p | ω ) is a specific parametric distribution with its mean and standard deviation depending on the vector of parameters ω . If we have little information on the proportion (i.e., we only know it is between 0 and 1), then we could choose an uninformative prior with a large prior variance. 3.2.3 Distribution of sampling probabilities Under the same framework as the previous chapter, we will need to specify the distribution of the sampling probabilities. Let Q1 , Q2 , . . . , Qn denote a weighted sample of the sampling probabilities that we obtain for the n individuals in MEPS. We assume that the observed sampling probabilities are from an arbitrary distribution f (q | λ ), with λ denoted as a vector of parameters. The distribution of the observed sampling probability can be written as f ∗ (Q = q | Q > 0) = f (q | λ ), (3.2) where the parameter vector λ can be assigned a vague prior to reflect our lack of information on the parameter values. From earlier definitions, we know that Pr(Q = 0) = p. It is straightforward to obtain the population distribution of the sampling probability in the accessible sub-population as f (q) = C f (q | λ ) . q (3.3) 3.2.4 Variables of interest For MEPS study, our variables of interest are more complicated, including continuous and categorical variables that are zero inflated. The population average of the expenditures or health care visits, κ , can be obtained using the law of total expectation. That is, κ = E(M) = Pr(Y = 1)E(M|Y = 1) + Pr(Y = 0)E(M|Y = 0). We may assume that both the expenditure or utilization indicator Y and the conditional magnitude (M|Y = 1) depend on the population sampling probability 41 Q. Further assume that we can use a parameter vector β to characterize the dependence between Y and Q, with β = 0 corresponding to the case when they are independent. Similarly, we assume that the relationship between (M|Y = 1) and Q can be characterized by a parameter vector γ when γ = 0. For MEPS, we want to assess whether the nonresponse is related to the magnitude of medical expenditures or the frequency of utilizations. So we may use a gamma regression for the relationship between the expenditure (or utilization) and the sampling probability, with the coefficients of items related to q denoted by γ . The parameter vector γ in gamma regression satisfies the above model assumptions. Assuming that the distribution of Y and (M|Y = 1) also depend on another parameter vectors of η and ζ respectively, we may denote the distribution functions of Y and (M|Y = 1) as fY (y | Q = q) = f (y | q, β , η ) fM (m |Y = 1, Q = q) = f (m | q, γ , ζ ). Despite the complication caused by zero inflation, it is straightforward to obtain the expenditure or visit frequency of the accessible and hidden (neverrespondents) sub-populations as κacc (λ , β , η , γ , ζ ) =E(M | Q > 0) =Pr(Y = 1 | Q > 0) E(M |Y = 1, Q > 0) = C f (q | λ ) dq q C f (q | λ ) dq dm m f (m | q, γ , ζ ) q Pr(Y = 1 | Q = q) × κhid (β , η , γ , ζ ) =E(M | Q = 0) =Pr(Y = 1 | Q = 0) E(M |Y = 1, Q = 0), (3.4) (3.5) where E(·) denotes the expectation, and C f (qq | λ ) is the density function of the population sampling probability in the accessible sub-population. 42 The overall expenditure (or visit frequency) of the population, κ , is the weighted average of the expenditures (or visit frequency) of the accessible and hidden (never-respondents) sub-populations. That is, E(M) = p κhid (β , η , γ , ζ ) + (1 − p) κacc (λ , β , η , γ , ζ ) = κ (λ , β , η , γ , ζ , p). (3.6) Despite the complication in the variables of interest, we can still use the simple structure from the previous chapter. The three sets of parameters from the three components in Subsections 3.2.2, 3.2.3 and 3.2.4 will still be a posteriori independent, when independent priors are assumed. 3.3 Large-sample limits of posterior distributions Using the method and notations from [17, 19] for deriving the posterior mean and variance of nonidentified models, we can obtain the large-sample limits of the posterior mean and variance for our quantities of interest. 3.3.1 Posterior mean For the Bayesian model we proposed in Section 3.2, the identified part of the parameters is φI = (λ , β , η , γ , ζ ) and the nonidentified part of the parameters is φN = p. Hence, the population expenditure (or utilization) κ can be denoted as κ = κ (φI , φN ) = κ (λ , β , η , γ , ζ , p) . According to [17, 19], we have E[κ | data] = E[κ˜ (λ , β , η , γ , ζ )| data], 43 (3.7) where it is straightforward that for our model κ˜ (λ , β , η , γ , ζ ) = = κ (λ , β , η , γ , ζ , p) f (p|λ , β , η , γ , ζ ) d p κ (λ , β , η , γ , ζ , p) π (p) d p = p¯ κhid (β , η , γ , ζ ) + (1 − p) ¯ κacc (λ , β , η , γ , ζ ) = κ (λ , β , η , γ , ζ , p) ¯ , where p¯ is the prior mean of p. We would never be able to learn any information on the true proportion of the sub-population of never-respondents. Hence we would expect to have the posterior mean of the hidden (never-respondents) proportion to be the same as the prior mean we specify, despite the increase of the sample size. That is, E(p | data) = p. ¯ (3.8) Under the regularity conditions, as the sample size increases, the model will gradually learn the true values of the parameters except for the hidden proportion of never-respondents. Hence the large-sample limit of the posterior mean of κ can be obtained as E[κ | data] → κ (λ 0 , β 0 , η 0 , γ 0 , ζ 0 , p) ¯ = p¯ κhid (β 0 , η 0 , γ 0 , ζ 0 ) + (1 − p) ¯ κacc (β 0 , η 0 , γ 0 , ζ 0 , λ 0 ), (3.9) where λ 0 , β 0 , η 0 , γ 0 , ζ 0 denote the true values of λ , β , η , γ and ζ . For our Bayesian model, the asymptotic posterior mean of κ is a linear function of the prior mean p. ¯ It is straightforward to conclude that the accuracy of the posterior estimate will depend on the prior on p and the true value of β . When β 0 = 0 and γ 0 = 0, the posterior mean will not be affected by the prior on p, which is what we would expect when there is no relationship between the sampling probability 44 and expenditure (or utilization). 3.3.2 Posterior standard deviation Similarly, we can obtain the posterior variance using the method from [17, 19]. The posterior variance can be written as ¯ 2 | data] Var[κ | data] =E[(κ (λ , β , η , γ , ζ , p) − κ (λ , β , η , γ , ζ , p)) + Var[κ (λ , β , η , γ , ζ , p)| ¯ data] =Var(p) E κhid (β , η , γ , ζ ) − κacc (λ , β , η , γ , ζ ) 2 | data + Var[κ (λ , β , η , γ , ζ , p)| ¯ data], (3.10) where Var(p) is the prior variance of p. Under the regularity conditions, we have the limiting posterior variance as 2 Var[κ | data] → Var(p) κhid (β 0 , η 0 , γ 0 , ζ 0 ) − κacc (β 0 , η 0 , γ 0 , ζ 0 , λ 0 ) , (3.11) and the limiting posterior standard deviation is simply the product of the prior standard deviation and the difference of the hidden and observed (or utilizations) . That is, STD[κ | data] → STD(p) | κhid (β 0 , η 0 , γ 0 , ζ 0 ) − κacc (β 0 , η 0 , γ 0 , ζ 0 , λ 0 )| , (3.12) where STD(p) is the prior standard deviation of p. We observe that the limiting posteriors possess the same calibration property as that of the binary model in the previous chapter. That is, the posterior credible interval will contain the true value of κ , whenever we are able to specify a prior distribution of p that contains the true value of p with the same confidence level. And when β 0 = 0 and γ 0 = 0, the posterior standard deviation converges to 0, which makes sense, as when there is no association between the expenditure (or 45 utilization) and the sampling probability, there would be no difference for the accessible and hidden (never-respondents) sub-populations. From the previous chapter, we have showed through a simulation study that the speed of convergence for the posterior mean and standard deviation is comparable to that of identified models. 3.4 Sensitivity analysis on MEPS In this section, we will apply our approach to the 2010 MEPS data [1], as a sensitivity analysis on the impact of never-respondents. For the medical expenditure and utilization variables of interest, we will use the two-part gamma model and zero-inflated Poisson and negative binomial regression for modeling the relationship with sampling probabilities. A parametric model based on the gamma distribution, or a nonparametric model based on Bayesian bootstrap can be used for the distribution of unnormalized sampling probabilities. 3.4.1 Model specification Expenditure model We assume that the observed sampling probabilities (unnormalized) follow a gamma distribution with a shape parameter a and a scale parameter b. That is, the parameter vector λ in Equation (3.2) has the form λ = (a, b). Equation (3.2) in Subsection 3.2.3 can now be explicitly written as ∗ fQ>0 (q) = fΓ(a, b) (q), (3.13) where fΓ(a, b) (·) is the probability density function of a gamma distribution with a shape parameter a and a scale parameter b. For our model, when the observed sampling probability has a gamma distribution, we can show that the population distribution of Q is also a gamma distri- 46 bution. That is, f (q) ∝ ∗ (q) fQ>0 q (3.14) = fΓ(a−1, b) (q). In our simulations, we specify the prior distribution π (p) in Equation (3.1) as a beta distribution, to reflect our information on the hidden (never-respondents) proportion. For MEPS, we suspect that the never-response may be related to the individual’s status on medical expenditures and utilizations. Here we assume that the expenditures and utilizations vary smoothly with the sampling probability. We propose to use a two-part gamma model with natural cubic splines for modeling the relationship between the expenditure and sampling probability. The model will be more complicated than that in the previous chapter due to zero-inflation. In particular, we can write the expenditure model as (Y | Q = q) ∼ Bernoulli(pq ), logit(pq ) = S(q), (3.15) (M | Q = q, Y = 0) ∼ δ0 (), (3.16) (M | Q = q, Y = 1) ∼ Γ(α , βq ), log µq = log α /βq = N(q), (3.17) where δ0 (·) is the Dirac delta function, α and βq are the shape and scale parameters of the gamma distribution, and S(·) and N(·) are natural cubic splines with 4 uniformly spaced knots (x1 , x2 , x3 , x4 ). Here, we use a log link function for the gamma regression as it gives better interpretations on the estimated effects. For the gamma model specified in Equations (3.15) to (3.17), we can now 47 rewrite the expenditure in Equation (3.6) as κ = p κhid (β , η , γ , ζ ) + (1 − p) κacc (λ , β , η , γ , ζ ) = p eS(0)+N(0) eS(Q)+N(Q) + (1 − p) E Γ(a−1, b) 1 + eS(0) 1 + eS(Q) , (3.18) where EΓ(a−1, b) (·) denotes the expectation with regards to the gamma distribution of the population sampling probabilities. With Equation (3.18), we will be able to perform posterior analysis of the expenditure κ under finite sampling, using Markov chain Monte Carlo (MCMC) techniques. Utilization model For health care utilization, we will use a zero-inflated Poisson (ZIP) regression model for the relationship between number of visits and the sampling probability. That is, we assume the observed number of visits M is from a mixture of a point mass distribution at zero and a Poisson distribution. As opposed to the expenditure model where all the observed zeros are from the point mass distribution, we would not be able to distinguish the zeros of the point mass distribution from those of the Poisson distribution. The distributions of the observed variables can be specified as Pr(M = 0 | Q = q) = (1 − pq ) + pq exp(−λq ), Pr(M = m | Q = q) = pq exp(−λq )λqm /m!, m = 1, 2, · · · logit(pq ) = S(q), log λq = N(q), (3.19) where λq is the mean of the Poisson distribution. Under the assumption of Poisson distribution, the count variable will have a variance that is equal to the mean. Hence, in the cases when the variance of 48 the count variable is considerably larger than the mean, a Poisson model will be inappropriate. In these cases, a negative binomial model may be used, which introduces an additional parameter to account for possible over-dispersion. Zeroinflated negative binomial (ZINB) regression can be specified based on the parameterization of the mean and dispersion parameter. In particular, the model can be written as Pr(M = 0 | Q = q) =(1 − pq ) + pq α −1 α −1 + µq Γ m + α −1 Pr(M = m | Q = q) =pq Γ(m + 1)Γ (α −1 ) α −1 , α −1 α −1 + µq α −1 µq −1 α + µq m , m = 1, 2, · · · logit(pq ) =S(q), log µq =N(q), (3.20) where µq is the mean of the negative binomial distribution, and α is a dispersion parameter that is assumed to be fixed. Despite the unobserved property of the zero indicator, it is straightforward to verify that the population frequency of visits (utilization) will have the same analytical form as Equation (3.18). Bayesian bootstrap We assume that the observed sampling probabilities Q1 , Q2 , . . . , Qn are independent and identically distributed with an arbitrary distribution F ∗ . That is, Equation (3.2) becomes Qi | Qi > 0 ∼ F ∗ , where F ∗ is the distribution of the observed sampling probability. 49 (3.21) The expenditure and utilization in Equation (3.18) can be rewritten as κ= c eS(Q)+N(Q) p eS(0)+N(0) ∗ + (1 − p) E F 1 + eS(0) Q 1 + eS(Q) , (3.22) 1 where c = E ∗ (1/Q) is a normalizing factor calculated with respect to the observed F ∗ distribution F . Since for our model in Section 3.2, the posteriors of p and (β , η , γ , ζ ) are independent to the posterior of q, we can draw Markov chain Monte Carlo (MCMC) samples of p from the prior distribution and (β , η , γ , ζ ) directly from the logistic regression and gamma (or Poisson, negative binomial) regression. Denote p(1) , p(2) , . . . , p(K) as a posterior sample of p of size K, S(1) (·), S(2) (·), . . ., S(K) (·) as the natural cubic splines based on a posterior sample β (1) , η (1) , β (2) , η (2) , . . . , β (K) , η (K) , and N (1) (·), N (2) (·), . . . , N (K) (·) as the natural cubic splines based on a posterior sample γ (1) , ζ (1) , γ (2) , ζ (2) , . . . , γ (K) , ζ (K) . According to Rubin [45], posterior samples of κ can be generated using the Bayesian bootstrap (BB) as follows. With an observed sample of sampling probability q1 , q2 , . . . , qn , for each k = 1, 2, . . . , K we have (k) (k) (k) 1. Generate g1 , g2 , . . . , gn of Dirichlet(1, 1, . . . , 1). from an n dimensional Dirichlet distribution 2. The corresponding posterior sample of the integral EF ∗ (k) obtained as c(k) ∑ni=1 gi (k) eS (qi ) (k) qi 1+eS (qi ) , where c(k) = c eS(Q)+N(Q) Q(1+eS(Q) ) 1 (k) ∑ni=1 gi /qi can be . It is then straightforward to obtain the posterior sample of expenditure and utilization κ (k) . 3.4.2 Asymptotic behavior Our model satisfies the regularity conditions under the asymptotic theory for Bayesian models, except for the nonidentifiability for the hidden (never-respondents) pro50 portion. Denoting Fn∗ as the empirical distribution of the observed sampling probability, the posterior distributions of q under both the gamma model (if correctly specified) and Bayesian bootstrap will converge to the true distribution. That is F ∗ | Q1 , Q2 , . . . , Qn =→ F0∗ , (3.23) where F0∗ is the true distribution for the observed sampling probability (Q | Q > 0). For the gamma model, Equation (3.23) ensures E[a|data] → at = a0 E[b|data] → bt = b0 , (3.24) where a0 and b0 are the true values of the gamma parameters. Similar to Subsections 3.3.1 and 3.3.2, we will derive the posterior mean and variance of expenditure and utilization κ and their large sample limit for the parametric and nonparametric models we proposed in this section, applying the method and notations from [17, 19] for nonidentified models. Posterior mean Using the explicit form of expenditure and utilization κ in Equation (3.9), we can rewrite the posterior mean in Subsection 3.3.1 as E[κ |data] = → p eS(0)+N(0) Cn eS(Q)+N(Q) ∗ + (1 − p) E F |Q1 ,Q2 ,...,Qn 1 + eS(0) Q 1 + eS(Q) C0 eS0 (Q)+N0 (Q) p¯ eS0 (0)+N0 (0) ∗ , + (1 − p) ¯ E F0 1 + eS0 (0) Q 1 + eS0 (Q) (3.25) where F0∗ is the true distribution for the observed sampling probabilities (Q | Q > 0), S0 (.) and N0 (.) are the true relationship of the expenditure (utilization) and 1 sampling probability, and C0 = E ∗ (1/Q) is a normalizing factor after considering the sampling weights. F0 51 Posterior variance Similarly, we can obtain the specific expression of the posterior variance in Equations (3.11) as 2 eS(0)+N(0) S( Q)+N( Q) C e n ∗ − E Var[κ |data] =Var(p) E | data F |Q1 ,Q2 ,...,Qn 1 + eS(0) Q 1 + eS( Q) + Var[g(F ∗ |Q1 , Q2 , . . . , Qn , S(·), N(·), p)|data] ¯ →Var(p) C0 eS0 (Q)+N0 (Q) eS0 (0)+N0 (0) ∗ − E F0 1 + eS0 (0) Q 1 + eS0 (Q) 2 . (3.26) The posterior standard deviation in Equation (3.12) now becomes C0 eS0 (Q)+N0 (Q) eS0 (0)+N0 (0) ∗ STD[κ | data] → STD(p) − EF0 1 + eS0 (0) Q 1 + eS0 (Q) . (3.27) 3.4.3 MEPS application Here, we will apply our models on the health expenditure and visit variables from 2010 MEPS as a sensitivity analysis on impact from never-response. The dataset contains records on sampling weights (i.e., inverse of unnormalized sampling probabilities), expenditure and health care visit variables, as well as other variables such as insurance coverage and demographics for 31,228 Americans. Expenditure analysis For our sensitivity analysis on the existence of a hidden sub-population of neverrespondents, we choose six expenditure variables including total medical charge, total medical expenditure (actual paid which is usually smaller than the total charge), self-paid expenditure, and expenditures covered by Medicaid, Medicare and private insurance. 52 0 5000 10000 15000 Density 0 M|M>0 5000 10000 15000 0 M|M>0 5000 10000 15000 M|M>0 (b) Total expenditure (c) Medicaid 0 5000 10000 M|M>0 (d) Medicare 15000 Density 0.0000 0.0010 4e−04 0e+00 0.00000 0.00015 Density 0.0020 0.00030 (a) Total charge Density 0.0000 0.0005 0.0010 0.0015 Density 2e−04 0e+00 Density 4e−04 q<0.1 0.1<q<0.2 0.2<q<0.3 q>0.3 0e+00 2e−04 4e−04 6e−04 First, we will examine the distributional assumptions of the expenditure and visit variables in Figures 3.1. For all types of expenditures, the distribution is right skewed with a positive support. Hence we may use distributions such as gamma and lognormal for our regression models. Although for some sources of payment, such as Medicaid, the distribution of the logarithms of the expenditures looks symmetric. Predicted expenditures using lognormal regression are highly over-estimated as a little left-skewness in the log expenditure distribution will have a huge impact for estimates on the original dollar scale. Hence, we use a gamma model-based approach which gives results that are quite close to those from weighted methods. 0 5000 10000 15000 M|M>0 (e) Private insurance 0 2000 4000 6000 8000 M|M>0 (f) Self-paid Figure 3.1: Distribution of expenditure by group of sampling probability. We observe that the distributions (i.e., the conditional distributions) are all right skewed with a positive support. 53 In the Metropolis-Hastings algorithm, we use the correlation matrix of regression coefficients from generalized linear models for our normal proposal distributions. The variance of the proposal distribution is chosen such that the acceptance rate is around 0.25-0.50 for all the parameters. We take every 10th posterior sample after discarding the first 1000, with a total posterior sample size of 6000. Fitted regression curves using the posterior mean (bold lines) along with 50 posterior samples for the expenditure are given in Figures 3.2 to 3.7. 10000 10000 0.8 0.4 8000 expenditure magnitude prevalence 8000 0.6 6000 4000 0.2 0.05 0.10 0.15 0.20 0.25 unnormalized sampling prob (a) Indicator 0.30 4000 2000 2000 0.00 6000 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob unnormalized sampling prob (b) Magnitude (c) Total amount Figure 3.2: Total charges - posterior samples of natural cubic splines (50 samples). The solid dots at the x-axis are the 5%, Q1, Median, Q3 and 95% quantiles of the sampling probabilities. We observe that in general individuals with a higher sampling probability have a much higher average expenditure and higher proportion with expenditures, on all the sources of payment except for Medicaid for which people might not want to be included as they are receiving government welfare. The possibility of neverresponse will cause the expenditures to be over-estimated for most expenditures and under-estimated for Medicaid. Table 3.1 presents the population estimates of total charges, total expenditures, expenditures covered by Medicaid, Medicare and private insurance, and self-paid expenditures. As an illustration, the sensitivity analysis uses a prior mean of 0.1, 0.2 and 0.3, and prior standard deviation of 0.025 for the proportion of neverrespondents. The naive estimates as an arithmetic mean, the weighted estimates 54 6000 6000 0.6 0.4 expenditure 5000 magnitude prevalence 0.8 4000 4000 3000 2000 2000 0.2 1000 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob (a) Indicator 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob unnormalized sampling prob (b) Magnitude (c) Total amount Figure 3.3: Total expenditures - posterior samples of natural cubic splines (50 samples) 3000 800 0.6 0.4 expenditure 2500 magnitude prevalence 0.8 2000 1500 600 400 1000 0.2 200 500 0.00 0.05 0.10 0.15 0.20 0.25 unnormalized sampling prob (a) Indicator 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob unnormalized sampling prob (b) Magnitude (c) Total amount Figure 3.4: Medicaid - posterior samples of natural cubic splines (50 samples) and the model-based estimates assuming no never-respondents are given as a comparison. Except for Medicaid, ignoring the existence of never-response will cause an over-estimation in the population expenditures. For all the cases, acknowledging never-response will result in higher standard errors. The weighted estimates are very close to the model-based estimates assuming no never-respondents. 55 0.8 8000 0.4 expenditure magnitude prevalence 1500 0.6 6000 4000 1000 500 0.2 2000 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob (a) Indicator 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob unnormalized sampling prob (b) Magnitude (c) Total amount Figure 3.5: Medicare - posterior samples of natural cubic splines (50 samples) 4000 0.8 2000 0.6 0.4 expenditure magnitude prevalence 3000 2000 1000 0.2 0.00 0.05 0.10 0.15 0.20 0.25 unnormalized sampling prob (a) Indicator 0.30 1500 1000 500 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob unnormalized sampling prob (b) Magnitude (c) Total amount Figure 3.6: Private insurance - posterior samples of natural cubic splines (50 samples) 56 1000 0.8 800 0.6 0.4 0.2 expenditure magnitude prevalence 800 600 400 0.05 0.10 0.15 0.20 0.25 unnormalized sampling prob (a) Indicator 0.30 400 200 200 0.00 600 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob unnormalized sampling prob (b) Magnitude (c) Total amount Figure 3.7: Self-paid - posterior samples of natural cubic splines (50 samples) 57 Table 3.1: Comparison of estimated expenditures for different types of expenditures. Except for Medicaid, ignoring the existence of never-response will cause an over-estimation in the population expenditures. For all the cases, acknowledging never-response will result in higher standard errors. The weighted estimates are very close to the model-based estimates assuming no never-respondents. Source Naive Wgted p=0 p¯ = .1 p¯ = .2 p¯ = .3 Total chg Mean SE 6829 167 6001 218 5992 106 5832 141 5671 149 5512 161 Total exp Medicare Mean SE Mean SE 3446 63.7 902 32.7 2843 73.0 759 37.9 2869 46.0 750 36.8 2760 72.8 727 41.5 2651 75.0 704 45.4 2544 78.3 681 50.0 Medicaid Mean SE 535 32.6 628 47.0 634 21.5 650 27.6 667 32.5 684 38.1 Private ins Mean SE 1156 34.3 783 28.5 810 17.5 756 32.6 702 31.7 647 32.2 Self-paid Mean SE 449 7.1 329 6.7 335 5.8 317 10.9 298 11.2 278 11.5 58 Utilization analysis For health care utilization analysis, we choose number of health care visits including those that are office-based, outpatient, emergency, inpatient, prescription, dental and home care. The distributional assumptions of the visit variables are assessed in Figure 3.8 and Table 3.2, using 31,228 records from the 2010 MEPS data. We observe from the histograms (Table 3.2) that there are large values ranging from 20 to 500 for different types of visits. The variance to mean ratio varies dramatically for different types of health care visits. As for Poisson regression, the variance equals mean conditioning on the sampling probability Q, so it is generally acceptable to have an unconditional ratio around two to three. Hence it will be appropriate to use a ZIP regression model for types of visits such as emergency room visits, inpatient hospital stay and dental visits, for which the variance to mean ratios are below 2.9. For the other four types of health care visits, the variance to mean ratios are above 18, it is more appropriate to use a ZINB regression model. Table 3.2: Over-dispersion in utilization data. For emergency room visits, inpatient hospital stay and dental visits, the variance to mean ratio (unconditional on sampling probabilities) is below 2.9. For other types of visits, the ratio is above 18. Including 0 Excluding 0 Mean Var/Mean Mean Var/Mean Office based 4.00 19.7 6.1 17.6 Outpatient 0.34 18.8 2.9 16.7 Emergency 0.17 1.8 1.4 0.6 Inpatient 0.09 1.8 1.3 0.5 Prescription 9.45 42.8 17.0 35.2 Dental 0.76 2.9 2.1 1.6 Home care 1.62 222.4 80.5 143.6 Type Hence for the health care utilizations, we did the analysis with ZIP and ZINB regression models. Similar to the expenditure model, we use the correlation matrix of regression coefficients from generalized linear models for our normal pro59 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 q 50 40 30 20 10 0 20 0 30 Percent Percent 10 q 20 10 0 q q 50 40 30 20 10 0 0.0 0.5 1.0 1.5 2.0 2.5 0 0 0 5 10 15 20 M|M>0 (c) Emergence 0.0 0.5 1.0 1.5 2.0 2.5 q 0 q 20 60 15 40 10 20 5 0 q 15 40 10 20 5 0 q 20 60 0 0 2 4 6 8 10 12 0.0 0.5 1.0 1.5 2.0 2.5 M|M>0 q 0 q 50 40 30 20 10 0 5 10 15 20 25 M|M>0 (e) Prescription(log) 1 q q 50 40 30 20 10 0 0 log10(M|M>0) (d) Inpatient 5 10 15 20 25 q 80 Percent q q 80 (b) Outpatient(log) 2 (f) Dental 3 q 20 15 10 5 Percent Percent 20 q 20 q q 40 40 0 2 4 6 8 10 12 80 20 60 log10(M|M>0) 0 15 q 60 log10(M|M>0) q 10 80 0.0 0.5 1.0 1.5 2.0 2.5 (a) Office based(log) 5 q 30 q 0 q Percent q Percent q 0 q q 20 15 10 5 0 0 1 2 3 log10(M|M>0) (g) Home care(log) Figure 3.8: Distribution of health care visit counts by groups of sampling probability. For some types of visits, we have to perform a logarithm transformation in order to visualization the distributional body. 60 posal distributions in the Metropolis-Hastings algorithm. The variance of the proposal distribution is chosen such that the acceptance rate is around 0.25-0.50 for all the parameters. For ZIP regression, we take every 10th posterior sample after discarding the first 1000, with a total posterior sample size of 6000. For negative binomial regression, we have to run the simulations longer, and take every 160th posterior sample, due to higher autocorrelations in the output. The utilization analysis results from the ZIP and ZINB models are given in Figures 3.9 to 3.15. For ZIP and ZINB regression, a portion of the zeros is attributed to the Poisson distribution so the results for the non-zero indicator will not have a clean interpretation. So we will mainly rely on the final average number of visits. We observe that for most of the utilization variables, participants with higher sampling probabilities have higher average health care utilizations. 8 8 6 6 0.6 0.4 expenditure magnitude prevalence 0.8 4 2 0.2 0.00 0.05 0.10 0.15 0.20 0.25 unnormalized sampling prob (a) Indicator 0.30 4 2 0.00 0.05 0.10 0.15 0.20 0.25 unnormalized sampling prob (b) Frequency 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob (c) Overall frequency Figure 3.9: Office-based visits - posterior samples of natural cubic splines (50 samples). Both the mixture proportion and mean of ZINB are increasing with sampling probability. 61 0.8 1.2 0.8 0.6 0.4 expenditure magnitude prevalence 1.0 0.6 0.8 0.6 0.4 0.4 0.2 0.2 0.2 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 unnormalized sampling prob 0.05 0.10 0.15 0.20 0.25 0.30 0.00 unnormalized sampling prob (a) Indicator 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob (b) Frequency (c) Overall frequency Figure 3.10: Outpatient visits - posterior samples of natural cubic splines (50 samples) 1.5 0.30 0.8 0.4 expenditure magnitude prevalence 0.25 0.6 1.0 0.5 0.20 0.15 0.10 0.2 0.05 0.00 0.05 0.10 0.15 0.20 0.25 unnormalized sampling prob (a) Indicator 0.30 0.00 0.05 0.10 0.15 0.20 0.25 unnormalized sampling prob (b) Frequency 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob (c) Overall frequency Figure 3.11: Emergency visits - posterior samples of natural cubic splines (50 samples) 62 1.5 0.15 0.6 0.4 expenditure magnitude prevalence 0.8 1.0 0.5 0.10 0.05 0.2 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 unnormalized sampling prob 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob (a) Indicator unnormalized sampling prob (b) Frequency (c) Overall frequency Figure 3.12: Inpatient stays - posterior samples of natural cubic splines (50 samples) 15 0.6 0.4 expenditure 20 magnitude prevalence 0.8 15 10 10 5 0.2 5 0.00 0.05 0.10 0.15 0.20 0.25 unnormalized sampling prob (a) Indicator 0.30 0.00 0.05 0.10 0.15 0.20 0.25 unnormalized sampling prob (b) Frequency 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob (c) Overall frequency Figure 3.13: Prescription medicines - posterior samples of natural cubic splines (50 samples) 63 0.8 4 0.6 0.4 expenditure magnitude prevalence 1.5 3 2 1.0 0.5 0.2 1 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 unnormalized sampling prob 0.05 0.10 0.15 0.20 0.25 0.30 0.00 unnormalized sampling prob (a) Indicator 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob (b) Frequency (c) Overall frequency Figure 3.14: Dental visits - posterior samples of natural cubic splines (50 samples) 150 2.5 0.06 0.04 expenditure magnitude prevalence 0.08 100 50 0.02 2.0 1.5 1.0 0.5 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob (a) Indicator 0.00 0.05 0.10 0.15 0.20 0.25 unnormalized sampling prob (b) Frequency 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob (c) Overall frequency Figure 3.15: Home care visits - posterior samples of natural cubic splines (50 samples). We use the negative binomial model due to an overdispersion of more than 100. 64 Table 3.3 presents the population estimates of health care visit frequencies including office-based, outpatient, emergency, inpatient, prescription, dental and home care. The sensitivity analysis uses a prior mean of 0.1, 0.2 and 0.3, and prior standard deviation of 0.025 for the proportion of never-respondents. The naive estimates as an arithmetic mean, the weighted estimates and the modelbased estimates assuming no never-respondents are given as a comparison. For all types of health care visits, ignoring the existence of never-response will cause an over-estimation in the population expenditures. Acknowledging never-response will result in higher standard errors. The weighted estimates are very close to the model-based estimates assuming no never-respondents. 65 Table 3.3: Comparison of estimated utilizations. For all types of health care visits, ignoring the existence of never-response will cause an over-estimation in the population expenditures. Acknowledging neverresponse will result in higher standard errors. The weighted estimates are very close to the model-based estimates assuming no never-respondents. Office based Outpatient Emergency Inpatient Prescription Dental Home care Mean SE Mean SE Mean SE Mean SE Mean SE Mean SE Mean SE Naive 3.997 .050 .343 .014 .170 .003 .088 .002 9.448 .114 .762 .008 1.615 .107 Wgted 3.083 .050 .287 .015 .165 .004 .078 .002 8.206 .135 .633 .009 1.629 .136 p = 0 3.084 .052 .288 .014 .166 .005 .078 .003 8.171 .177 .634 .011 1.656 .274 p¯ = .1 2.859 .127 .279 .016 .163 .005 .076 .004 7.940 .228 .612 .016 1.616 .301 p¯ = .2 2.637 .129 .270 .018 .159 .006 .074 .004 7.705 .247 .591 .017 1.576 .330 p¯ = .3 2.413 .137 .261 .021 .156 .007 .072 .004 7.472 .276 .569 .018 1.536 .367 Type 66 3.5 Impact on regression In this section, we study the impact of never-respondents on regression analysis of the expenditure and utilization variables of interest. The estimation equation methods for response-biased samples can not handle the situation when there is hidden never-respondents in weighted sampling (see e.g., [48]). Here we will use Bayesian methods to account for the existence of never-respondents. 3.5.1 Logistic regression We will first focus on the regression analysis of a binary response variable Y and a binary risk factor X. The model can be written as (Y | Q = q, X = x) ∼ Bernoulli(pqx ) logit(pqx ) = x S(q) + (1 − x) H(q), x = 0, 1 (X | Q = q) ∼ Bernoulli(pq ) logit(pq ) = R(q) (Q | Q > 0), ∼ Γ(d, e), (3.28) where S(·), H(·) and R(·) are some smooth functions, d and e are shape and scale parameters, and pq is the probability of a Bernoulli trial. Here we assume that (Q | Q > 0) follows a parametric form so that we could explicitly write out its distribution conditioning X. For data analysis, we may still use Bayesian bootstrap. From the assumptions in Equation (3.28), the probability density function of (Q | X) can be written as f (q | X = x) ∝ Pr(X = x | Q = q)[ pδ0 (q) + (1 − p) f (q)] = eR(q) 1 + eR(q) x 1 1 + eR(q) 1−x [ p δ0 (q) + (1 − p) f (q)]. (3.29) From Equation (3.29), both the shape of the curve for logistic regression on R(·) 67 and the distribution of X will have an impact on the conditional distribution of sampling probabilities. Hence, we may denote f (q | X = x) = pxp δ0 (q) + (1 − pxp ) f (q | x, R(·), f (·)), (3.30) where we use pxp to denote the new mixture weight that depends on x, p and R(·), and f (q | x, R(·), f (·)) to denote a distribution function of q depending on x, R(·) and f (·). More explicitly, they can be obtained as logit(p0p ) = logit(p) + log 1 1 + eR(0) − log f (q) dq 1 + eR(q) logit(p1p ) = logit(p) + log eR(0) 1 + eR(0) − log eR(q) f (q) dq 1 + eR(q) f (q | 0, R(·), f (·)) = (1 − p) 1+f e(q) R(q) (1 − p0p ) p 1+e1R(0) + (1 − p) f (q) 1+eR(q) dq f (q) (1 − p) e1+eR(q) R(q) f (q | 1, R(·), f (·)) = R(q) f (q) R(0) (1 − p0p ) p 1+e eR(0) + (1 − p) e1+eR(q) dq . (3.31) If there is no interaction between the effect of covariate of interest and that of the sampling probability (i.e., if both S(q) = H(q) = c are constants), then the existence of a hidden sub-population of never-respondents will have no impact on the estimated effect of the covariate. Otherwise, we may obtain the relative risk as r= E(Y | X = 1) E(Y | X = 0) e p1p 1+e S(0) + (1 − p1p ) EF(q|X=1) eS(Q) 1+e ( S(Q) ) eH(0) p0p 1+e H(0) eH(Q) 1+e ( H(Q) ) S(0) = + (1 − p0p ) EF(q|X=0) . (3.32) When the sample size n → ∞, all the parameters and smooth functions in Equa68 tion (3.32) will learn their true values, except for the hidden proportion p. Hence, the limiting posterior distribution of r will have a distribution that depends on the prior on p. Denoting θ 0 = (R0 (·), f0 (·)) as the true functions of (R(·), f (·)), the limiting posterior distribution of r is given by p1p (p, θ 0 ) lim (r|Dn ) = n→∞ eS0 (0) 1+eS0 (0) H (0) p0p (p, θ 0 ) e H0 0 (0) 1+e eS0 (Q) + [1 − p1p (p, θ 0 )] EF0 (q |Y =0,θ 0 ) (1+eS0 (Q) ) + [1 − p0p (p, θ 0 )] EF0 (q |Y =1,θ 0 ) eH0 (Q) (1+eH0 (Q) ) , (3.33) where p0p (p, θ 0 ) and p1p (p, θ 0 ) are the mixture weights based on the prior distribution of p and the true functions of R0 (·) and f0 (·), F0 (q |Y = 0, θ 0 ) and F0 (q |Y = 1, θ 0 ) are the distributions of sampling probabilities based on the true functions of R0 (q) and f0 (q), and S0 (q) and H0 (·) are the true relationships in the logistic regression. From Equation (3.33), the limiting posterior distribution is a function of the prior distribution on p. Although we can not obtain an analytical form for the limiting posterior mean and standard deviation, they can be numerically evaluated based on the prior distribution of p and the true parameters. Note that, even when the relationships for the two groups are the same (i.e., S(q) = H(q) for all q > 0), the posterior mean of the relative risk may not converge to the true value when the sample size increases. This is true because our model allows for different hidden proportions of never-respondents in the two groups. 3.5.2 Gamma regression For zero-inflated continuous response variables M, we will also focus on the regression analysis of M and a binary covariate X. For zero inflated continuous response variable, we denote Y as the indicator whether M > 0. The model can be 69 written as (Y | Q = q, X = x) ∼ Bernoulli(pqx ), logit(pqx ) = x S(q) + (1 − x) H(q), x = 0, 1 (3.34) (M | Q = q, Y = 0, X = x) ∼ δ0 () (M | Q = q, Y = 1, X = x) ∼ Γ(α , βqx ), log(µqx ) = log(αβqx ) = x N(q) + (1 − x) L(q), x = 0, 1 (3.35) (X | Q = q) ∼ Bernoulli(pq ) logit(pq ) = R(q) (Q | Q > 0), ∼ Γ(d, e), (3.36) where N(·) and L(·) are some smooth functions. If there is no interaction between the effect of covariate of interest and that of the sampling probability (i.e., if both S(q) = H(q) = c1 and N(q) = L(q) = c2 are constants), then the existence of a hidden sub-population of never-respondents will have no impact on the estimated effect of the covariate. Otherwise, we may obtain the relative magnitude of health expenditures as r= E(M | X = 1) E(M | X = 0) S(0)+N(0) = p1p e1+eS(0) + (1 − p1p ) EF(q|X=1) H(0)+L(0) p0p e1+eH(0) + (1 − p0p ) EF(q|X=0) eS(Q)+N(Q) (1+eS(Q) ) . (3.37) eH(Q)+L(0) (1+eH(Q) ) When the sample size n → ∞, all the parameters and smooth functions in Equation (3.37) will learn their true values, except for the hidden proportion p. Hence, the limiting posterior distribution of r will have a distribution that depends on the prior on p. Denoting θ 0 = (R0 (·), f0 (·)) as the true functions of (R(·), f (·)), the limiting posterior distribution of r is given by 70 p1p (p, θ 0 ) eS0 (0)+N0 (0) 1+eS0 (0) p0p (p, θ 0 ) eH0 (0)+L0 (0) 1+eH0 (0) lim (r|Dn ) = n→∞ + [1 − p1p (p, θ 0 )] EF0 (q |Y =0,θ 0 ) eS0 (Q)+N0 (Q) (1+eS0 (Q) ) + [1 − p0p (p, θ 0 )] EF0 (q |Y =1,θ 0 ) eH0 (Q)+L0 (Q) (1+eH0 (Q) ) , (3.38) where p0p (p, θ 0 ) and p1p (p, θ 0 ) are the mixture weights based on the prior distribution of p, F0 (q |Y = 0, θ 0 ) and F0 (q |Y = 1, θ 0 ) are the distributions of sampling probabilities based on the true functions of R0 (q) and f0 (q), S0 (·), N0 (·), and H0 (·) and L0 (·) are the true relationships in the logistic and gamma regression. From Equation (3.38), the limiting posterior distribution is a function of the prior distribution on p. Although we can not obtain an analytical form for the limiting posterior mean and standard deviation, they can be numerically evaluated based on the prior distribution of p and the true parameters. Note that, even when the relationships for the two groups are the same (i.e., S(q) + N(q) = H(q) + L(q) for all q > 0), the posterior mean of the relative magnitude may not converge to the true value when the sample size increases. This is true because our model allows for different hidden proportions of never-respondents in the two groups. 3.5.3 Zero-inflated count regression For zero-inflated categorical response variables M, we will also focus on the regression analysis of M and a binary covariate X. For zero inflated continuous response variable, we denote Y as the indicator whether M > 0. The Poisson 71 model can be written as Pr(M = 0 | Q = q, X = x) = (1 − pqx ) + pqx exp(−λqx ), m Pr(M = m | Q = q, X = x) = pqx exp(−λqx )λqx /m!, logit(pqx ) = x S(q) + (1 − x) H(q), m = 1, 2, · · · x = 0, 1 log(λqx ) = x N(q) + (1 − x) L(q), x = 0, 1 (3.39) X | Q = q ∼ Bernoulli(pq ) logit(pq ) = R(q) (Q | Q > 0), ∼ Γ(d, e), (3.40) where N(·) and L(·) are some smooth functions, and λqx is the mean of the Poisson distribution. Similarly, we may write the negative binomial model as Pr(M = 0 | Q = q , X = x) = (1 − pqx ) + pqx α −1 α −1 + µqx Γ m + α −1 Pr(M = m | Q = q , X = x) = pqx Γ(m + 1)Γ (α −1 ) α −1 , α −1 α −1 + µqx α −1 µqx −1 α + µqx m , m = 1, 2, · · · logit(pqx ) = x S(q) + (1 − x) H(q), log(µqx ) = x N(q) + (1 − x) L(q), x = 0, 1 x = 0, 1 (3.41) X | Q = q ∼ Bernoulli(pq ) logit(pq ) = R(q) (Q | Q > 0), ∼ Γ(d, e), (3.42) where µqx is the mean of the negative binomial distribution, and α is an overdispersion parameter that is assumed to be fixed. If there is no interaction between the effect of covariate of interest and that of the sampling probability (i.e., if both S(q) = H(q) = c1 and N(q) = L(q) = c2 are constants), then the existence of a hidden sub-population of never-respondents 72 will have no impact on the estimated effect of the covariate. For both the Poisson and negative binomial models, we may obtain the relative frequency of health care visit as r= E(M | X = 1) E(M | X = 0) S(0)+N(0) = p1p e1+eS(0) + (1 − p1p ) EF(q|X=1) H(0)+L(0) p0p e1+eH(0) + (1 − p0p ) EF(q|X=0) eS(Q)+N(Q) (1+eS(Q) ) . (3.43) eH(Q)+L(0) (1+eH(Q) ) When the sample size n → ∞, all the parameters and smooth functions in Equation (3.37) will learn their true values, except for the hidden proportion p. Hence, the limiting posterior distribution of r will have a distribution that depends on the prior on p. Denoting θ 0 = (R0 (·), f0 (·)) as the true functions of (R(·), f (·)), the limiting posterior distribution of r is given by p1p (p, θ 0 ) eS0 (0)+N0 (0) 1+eS0 (0) p0p (p, θ 0 ) eH0 (0)+L0 (0) 1+eH0 (0) lim (r|Dn ) = n→∞ + [1 − p1p (p, θ 0 )] EF0 (q |Y =0,θ 0 ) eS0 (Q)+N0 (Q) (1+eS0 (Q) ) + [1 − p0p (p, θ 0 )] EF0 (q |Y =1,θ 0 ) eH0 (Q)+L0 (Q) (1+eH0 (Q) ) , (3.44) where p0p (p, θ 0 ) and p1p (p, θ 0 ) are the mixture weights based on the prior distribution of p, F0 (q |Y = 0, θ 0 ) and F0 (q |Y = 1, θ 0 ) are the distributions of sampling probabilities based on the true functions of R0 (q) and f0 (q), S0 (·), N0 (·), H0 (·) and L0 (·) are the true relationships in the logistic and ZIP (or ZINB) regression. From Equation (3.44), the limiting posterior distribution is a function of the prior distribution on p. Although we can not obtain an analytical form for the limiting posterior mean and standard deviation, they can be numerically evaluated based on the prior distribution of p and the true parameters. Note that, even when the relationships for the two groups are the same (i.e., S(q) + N(q) = H(q) + L(q) 73 for all q > 0), the posterior mean of the relative magnitude may not converge to the true value when the sample size increases. This is true because our model allows for different hidden proportions of never-respondents in the two groups. 3.5.4 Case study on MEPS insurance prevalence Here, we will perform a case study using MEPS 2010 data. The binary covariate variable X we consider is the indicator whether the individual has any private or government health insurance coverage during the year. It would be interesting to study how insurance coverage can impact medical expenditures and utilizations. Among the 31,228 individuals included in the study, 25,958 of them have some kind of insurance during the year. Using the logistic regression model proposed in the previous chapter, the relationship of the insurance status and sampling probability is given in Figure 3.16. We observe that individuals with insurance are more likely to be included in the study. 0.8 0.6 0.4 0.2 0.0 0.1 0.2 0.3 unnormalized sampling prob (a) Insurance Figure 3.16: Logistic regression on insurance indicator - posterior samples of natural cubic splines (50 samples). The solid lines and dots are for individuals with medical insurance, while the dashed lines and unfilled dots are for uninsured individuals. As an illustration, we apply our logistic regression model on the indicators whether there is a medical expenditure, an office-based health care visit and a dental visit. In the Metropolis-Hastings algorithm, a normal proposal distribu74 tion matching the first two moments are used. This results in an acceptance ratio around 0.7. A separate regression model is fitted for the sub-group of individuals with insurance and without any insurance during the course of the year. The fitted regression curves are given in Figure 3.17. We observe that, for both individuals with and without insurance, the probability of incurring a health expenditure or visit increases with the sampling probability. It is obvious that having medical insurance causes an increase in the probability of incurring an expenditure, regardless of the sampling probability. The relative risk seems to be decreasing with sampling probability. There is more uncertainty in the regression curves for individuals without insurance coverage due to the smaller sample size. 0.6 0.4 0.2 0.8 prevalence 0.8 prevalence prevalence 0.8 0.6 0.4 0.2 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.6 0.4 0.2 0.00 unnormalized sampling prob 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob (a) Expenditure 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob (b) Office-based (c) Dental Figure 3.17: Binary regression for incurring expenditure or utilization - posterior samples of natural cubic splines (50 samples). The solid lines and dots are for individuals with medical insurance, while the dashed lines and unfilled dots are for uninsured individuals. Based on the fitted regression curves, we use Bayesian bootstrap to obtain posterior samples of r. The process is similar to that given in Subsection 2.4.1 in the previous chapter for estimating prevalence. The only change is to adjust (k) (k) (k) g1 , g2 , . . . , gn and p according to Equation (3.29), in order to obtain posterior samples from F(q|X = x) and pxq . In particular, the steps are as follows (do for x = 1 and x = 0 separately). (k) 1. Obtain gi (k) as gi eR(qi ) 1+eR(qi ) x 1 1+eR(qi ) 75 1−x . 2. Obtain p(k) as p(k) eR(0) 1+eR(0) x 1 1+eR(0) 1−x . (k) 3. Normalize p(k) and gi by applying a common factor that results in p(k) + (k) (1 − p(k) ) ∑ni=1 gi = 1. p(k) is then a sample of pxq . (k) 4. The adjusted posterior samples of gi for qi ). (k) is gi (1−p(k) ) 1−p(k) (i.e., the probability Then we may simply plug in the adjusted samples of bootstrap probabilities and pxq in the steps given in Subsection 2.4.1 in the previous chapter, and obtain posteriors separately for the denominator and numerator. In Table 3.4, we present the estimated relative risks using different methods. The naive method correspond to the estimate based on the arithmetic mean in the sub-group with and without insurance. The weighted estimate is based on the weighted average in each sub-group. The case p = 0 corresponds the estimate using our model based approach assuming the hidden proportion is zero. Similar to the sensitivity analysis on the expenditure and utilization in the previous section, prior mean of 0.1, 0.2 and 0.3, and prior standard deviation of 0.025 are assumed for the proportion of never-respondents. From the table, we observe that all the model-based estimates assuming a non-zero never-respondents are larger than weighted estimates. This is due to the fact that there is a larger difference in the expenditure or utilization prevalence, when the sampling probability approaches zero. At the same time, model-based estimates assuming no never-respondents are very close to those from weighted estimation. 76 Table 3.4: Comparison of estimated relative risk for incurring expenditure and utilization, using logistic regression on the insurance status. We observe that all the model-based estimates acknowledging the neverrespondents are larger than weighted estimates. Note that the standard errors are estimated using bootstrap for the naive and weighted estimates. Type Naive Wgted p=0 p¯ = .1 p¯ = .2 p¯ = .3 Expenditure Office based Mean SE Mean SE 1.707 .024 2.053 .039 1.748 .033 2.062 .052 1.749 .031 2.056 .050 1.776 .042 2.076 .062 1.799 .050 2.090 .076 1.816 .060 2.095 .089 Dental Mean SE 2.817 .097 2.868 .127 2.852 .122 2.916 .152 2.971 .182 3.014 .213 Similarly, we perform separate gamma regression on the total medical expenditure for the sub-group of individuals with and without insurance coverage. For the utilization variables, ZIP and ZINB models are used for the dental visit and office-based visit, respectively. The details of the Metropolis-Hastings algorithms are similar to those in the previous section. The fitted regression curves are given in Figure 3.18. We observe that, for both individuals with and without insurance, the magnitude of the health expenditure or utilization increases with the sampling probability. It is obvious that having medical insurance causes an increase in the expectation of the expenditure and utilization, regardless of the sampling probability. The relative magnitude seems to be decreasing with sampling probability. The smaller sample size causes more uncertainty in the regression curves of the sub-group without insurance coverage. In Table 3.5, we present the estimated relative magnitude using different methods. Similarly, we observe that all the model-based estimates acknowledging the never-respondents are larger than weighted estimates. This is consistent to the finding from the regression curves, in the sense that never-respondents seem to have a larger difference in the utilization or expenditure. 77 6000 8 1.5 3000 utilization 6 4000 utilization expenditure 5000 4 1.0 2000 0.5 2 1000 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob (a) Expenditure 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob (b) Office-based 0.00 0.05 0.10 0.15 0.20 0.25 0.30 unnormalized sampling prob (c) Dental Figure 3.18: Gamma, ZIP and ZINB regression on expenditure and utilization - posterior samples of natural cubic splines (50 samples). The solid lines and dots are for individuals with medical insurance, while the dashed lines and unfilled dots are for uninsured individuals. Table 3.5: Comparison of estimated relative risk for incurring expenditure and utilization, using Gamma, ZIP and ZINB regression on the insurance status. Similar to those from logistic regression, all the modelbased estimates acknowledging the never-respondents are larger than weighted estimates. Note that the standard errors are estimated using bootstrap for the naive and weighted estimates. Type Naive Wgted p=0 p¯ = .1 p¯ = .2 p¯ = .3 Expenditure Office based Mean SE Mean SE 3.774 .254 3.175 .153 4.108 .309 3.291 .167 4.052 .209 3.257 .196 4.258 .265 3.304 .228 4.446 .305 3.319 .274 4.620 .361 3.295 .332 78 Dental Mean SE 3.260 .158 3.290 .207 3.271 .231 3.366 .284 3.451 .342 3.519 .404 3.6 Conclusions In this chapter, we proposed a Bayesian model-based approach to account for impact from potential never-respondents in the Medical Expenditure Panel Survey. A gamma model was presented for estimating the relationship between medical expenditure and sampling probabilities. For estimating population health care utilization, we proposed to use regression models based on ZIP and ZINB. The impact of the existence of never-respondents on the population estimation can be assessed by assuming a prior on the proportion of never-respondents. We obtained analytical forms of the large-sample limits of the posterior mean and standard deviation of the population expenditure and utilization. In addition, we proposed models for assessing the impact of never-respondents on the estimated effect of a binary covariate variable under the regression context. Large-sample limits of the posterior mean and variance were obtained for binary, as well as zero-inflated continuous and count response variables. Our data analysis using MEPS 2010 data confirmed that individuals with higher expenditures, more frequent health care visits are more likely to be included in the Medical Expenditure Panel Survey (MEPS). Acknowledging a hidden sub-population of never-respondents will result in an adjustment on the final estimates of the population expenditures, utilizations, as well as relative risk and magnitude in regression. For population estimation of expenditures and utilizations, ignoring the potential never-respondents will generally result in an over-estimation. Sensitivity analyses using different assumptions of the proportion of never-respondents will offer us better confidence in situations where the sampling scheme has limitations in reaching certain subjects. 79 Chapter 4 Misrepresentation in Insurance Underwriting 4.1 Motivation In insurance underwriting, misrepresentation means the policy applicant purposely provides a false statement on facts that may affect his or her eligibility or premium for obtaining the insurance. As verifications of various information provided by the applicant (e.g., by a medical examination) can be very expensive, in most situations, insurance companies have to rely on the insurance applicant in providing reliable information. Since it is to the benefit of the applicant, misrepresentations happen frequently, which will bring millions of extra cost to insurance companies and affect the fairness of insurance premiums. Although there has been extensive study on how to minimize the effects from the design of insurance policies and relevant regulations (see e.g., [58]), statistical models tailored to data with misrepresentations seem to have caught little attention. The most intuitive as well as the simplest case of misrepresentation is when the question of concern is binary. The general problem of misclassification in binary variables in case-control studies has been studied extensively in the literature. For example, Gustafson et. al [23] have showed that incorporating the uncertainty on 80 the misclassification probabilities through priors will result in estimates of odds ratios that are less affected by the bias in the guess. Gustafson and Greenland [21] raised several issues in modeling the misclassification using Bayesian inference in unmatched case-control analysis with binary exposure, when confounders are not adjusted. Liu et. al [31] studied Bayesian adjustment when there are information available on both the misclassification of exposure and the exposure-disease association. Chu et. al [3] demonstrated that when there is validation data on the misclassification, Bayesian methods have more advantages over non-Bayesian counterparts. Wang et. al [57] investigated the partial identification obtained for case control studies when the binary exposure is misclassified. For misrepresentations on a binary risk factor (e.g., smoking status), we may formulate the model as follows. Denote V as the true binary status of the insured that we are not able to observe, and V ∗ as the observed variable including a proportion of answers that are misrepresented. Then we can write the conditional probabilities as P(V ∗ = 0 |V = 0) = 1 P(V ∗ = 0 |V = 1) = p (4.1) E[V ] = θ E[V ∗ ] = θ (1 − p) = θ∗ ≤ θ, where θ is the binomial mean of our interest, and p is the proportion of people who provided a false negative answer due to misrepresentation that we are not able to observe. Here we assume that the misclassification is non-differential. In general, P(V ∗ = 1 |V = 1) is called sensitivity and P(V ∗ = 0 |V = 0) is called the specificity. Our example is a special case where the specificity is one. One advantage of this type of misclassification is that we will know with some certainty that there 81 is no misclassification for subjects who provide a negative answer (we know the direction of misclassification). In addition, we may have accurately measured or classified covariates on all the subjects, which may be used for predictive modeling of the misclassified variable. In general, misclassification may lead to a problem of model identification. For our model, it would be interesting to study whether these special characteristics of misrepresentation will impact the model identification and statistical inference. In this chapter, we will propose three Bayesian models based on Poisson, gamma and Bernoulli distributions for predicting insurance claim frequency, severity and occurrence, when there is misrepresentation in a binary covariate variable (i.e., risk factor). Theoretical results based on the forms of posterior distributions, method of moment estimators and Fisher information matrix will be provided for studying model identification. MCMC computational and simulation studies will be performed, in order to evaluate the statistical learning and convergence of the posteriors. Our work confirms that the identification of the models depends on the distribution of the response variable. In addition, we study the model identification when the misclassified variable is used as the response and when correctly measured covariate variables are available. Through theoretical and simulation studies, we confirm that the model is weakly identified and learning the parameters is difficult due to the high posterior correlations between the parameters. For all the models, we confirm that acknowledging the existence and direction of misclassification improves the accuracy of parameter estimates. The rest of the chapter will be organized as follows. The three models based on Poisson, gamma and binary response and a misclassified covariate variable are studied in Section 4.2, 4.3 and 4.4, respectively. Section 4.5 presents the binary model with a misclassified response. Section 4.6 concludes the chapter. 82 4.2 Poisson response model 4.2.1 Model specification In rate-making, generalized linear models (GLM) based on Poisson distributions can be used to predict the frequency of insurance claims. Denoting Y as the number of claims in a policy year, we can specify the distribution of (Y |V ) as (Y |V ) ∼ Poisson (exp(α0 + α1V )) , (4.2) where Poisson(λ ) denotes a Poisson distribution with mean λ , α0 and α1 are the intercept and slope of the regression. Here we assume that the misclassification is non-differential (i.e., Y and (V ∗ |V ) are independent). It is straightforward to obtain that P(V = 1 |V ∗ = 1) = 1 P(V = 1 |V ∗ = 0) = θp . 1 − θ (1 − p) (4.3) The conditional distribution of (Y |V ∗ ) can now be written as fY (y |V ∗ ) = fY,V (y, 1 |V ∗ ) + fY,V (y, 0 |V ∗ ) = fY (y |V = 1)P(V = 1 |V ∗ ) + fY (y |V = 0)P(V = 0 |V ∗ ). Denoting f (·|λ ) as the Poisson density function with mean parameter λ , we can obtain the conditional distribution of (Y |V ∗ ) as fY (y |V ∗ = 1) = f (y| exp(α0 + α1 )) 1−θ θp fY (y |V ∗ = 0) = f (y| exp(α0 + α1 )) + f (y| exp(α0 )). 1 − θ (1 − p) 1 − θ (1 − p) (4.4) From Equation (4.4), we observe that the conditional distribution of (Y |V ∗ ) fol83 lows a mixture of two Poisson distributions when V ∗ = 0, and it is a Poisson distribution when V ∗ = 1. From the form of the conditional distribution, the model may possibly be identified. 4.2.2 Model identification Here we study the identification of the model through the method of moment estimators using the expectations and covariance matrix of the observed variables (V ∗ ,Y ). That is, we may obtain the first two moments as E(V ∗ ) = θ (1 − p) (4.5) Var(V ∗ ) = E[(V ∗ )2 ] − E2 (V ∗ ) = E(V ∗ ) − E2 (V ∗ ) (4.6) E(Y ) = E[E(Y |V )] = θ exp(α0 + α1 ) + (1 − θ ) exp(α0 ) (4.7) Var(Y ) = E(Y 2 ) − E2 (Y ) = E[E(Y 2 |V )] − E2 (Y ) = θ exp(2α0 + 2α1 ) + (1 − θ ) exp(2α0 ) + E(Y ) − E2 (Y ) Cov(V ∗ ,Y ) = E(V ∗Y ) − E(V ∗ )E(Y ) = ∑ ∗ v =0,1 ∗ (4.8) ∗ v∗ y f (y|v∗ )[θ (1 − p)]v [1 − θ (1 − p)]1−v dx − E(V ∗ )E(Y ) = θ (1 − p) exp(α0 + α1 ) − E(V ∗ )E(Y ). (4.9) Here, we have Equations (4.5), (4.7), (4.8) and (4.9) for solving the four parameters of p, θ , α0 and α1 . In order to simplify the notation, we may denote that A = Var(Y ) − E(Y ) + E2 (Y ) = θ exp(2α0 + 2α1 ) + (1 − θ ) exp(2α0 ) B= Cov(V ∗ ,Y ) + E(V ∗ )E(Y ) E(V ∗ ) = exp(α0 + α1 ). 84 (4.10) (4.11) Using Equations (4.5), (4.10) and (4.11), we may uniquely obtain θ as θ= A − E2 (Y ) . B2 + A − 2BE(Y ) It is then straightforward to obtain other parameters using Equation (4.5), (4.10) and (4.11). This suggests that we will be able to identify the model. In an identified model, we would be able to learn the true value of the misclassified proportion when the sample size goes to infinity. 4.2.3 Computational study In order to illustrate the performance of the model under finite sample scenarios, we perform a computational study using MCMC simulations. A single sample of size n = 1000 is generated for V , using a Bernoulli trial with the probability θ0 = 0.5. Two different values of p0 = 0.25 and p0 = 0.5 are used as the misclassified proportion for obtaining the corresponding samples of V ∗ . The corresponding samples of Poisson counts Y are then generated using the true values of the intercept and slope (α0 , α1 ) as (1.2, 1). In order to construct the Metropolis-Hastings algorithm, we may use the log likelihood of the parameters given by l(p, θ , α0 , α1 ) n n i=1 n i=1 = − ∑ log(yi !) + ∑ v∗i [ yi (α0 + α1 ) − exp(α0 + α1 )] − ∑ (1 − v∗i ) log[1 − θ (1 − p)] i=1 n + ∑ (1 − v∗i ) log θ p exp yi (α0 + α1 ) − eα0 +α1 + (1 − θ ) exp [yi α0 − eα0 ] . i=1 (4.12) Using a uniform prior of p on the interval (0,1), we fit our model using both 85 WinBUGS and a self-coded Metropolis-Hastings algorithm in R. For the MetropolisHastings algorithm written in R, normal proposal distributions centered at the current value are used for the regression coefficients (α0 , α1 ). Maximum likelihood estimates of the correlation matrix from a fit of the generalized linear model are used for the bivariate proposal distribution of the regression coefficients. For the misclassified proportion p and probability θ , we use normal proposal distributions centered at the current value. The ranges of the proposal distributions are chosen in order to achieve an acceptance ratio around 0.7 for the regular Poisson regression, and between 0.2 to 0.7 for all the parameters for our model acknowledging the misrepresentation. We compare the results with naive estimates by fitting a Poisson regression using the observed values of V ∗ , pretending there to be no misrepresentation. We denote the true model as Poisson regression using the corresponding values of V . For all the models, a normal prior with mean 0 and variance 10 are used for the regression coefficients. For the probability parameter θ , a uniform prior on (0,1) is used. For the thinning, we take every 10th sample for regular Poisson regression, and every 100th sample for our model acknowledging the misrepresentation. The posterior distributions of the regression coefficients and misclassified proportion from different methods are provided in Figures 4.1 to 4.4. We observe that the naive estimates are biased downward comparing to the estimates from the true models using the corresponding values of V , which will result in an underestimation of the risk. That is, the misrepresentation in the covariate variable causes an attenuation of the naive estimates. With a sample size of 1000, we observe that the estimates from the WinBUGS and the Metropolis-Hastings algorithm are very close to those from the true models. It is surprising to observe that the distributions of the regression coefficients from our proposed models have more concentrated distributions, comparing to those from the true model. As the algorithms are implemented using two completely different methods of WinBUGS and Metropolis-Hastings algorithm in R, it is unlikely to be caused by Monte Carlo errors. For both the cases with p0 = 0.25 and p0 = 0.5, the posterior 86 20 True model Naive Posterior WinBUGS True value 15 Density 0 5 10 0 5 Density 15 True model Naive Posterior WinBUGS True value 10 20 samples of the misclassified proportion p have a narrow distribution concentrated around the true values. This further confirms that the model is identified. 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.6 α0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 α0 (a) p0 = 0.25 (b) p0 = 0.5 Figure 4.1: Distribution of posterior samples for α0 for the Poisson model. The results are based on one simulation with a sample size of 1000. The Bayesian models implemented in WinBUGS and R provide estimates that are very close to those from the true model using unobserved values of V . 87 10 12 14 8 Density True model Naive Posterior WinBUGS True value 0 2 4 6 10 12 14 8 6 0 2 4 Density True model Naive Posterior WinBUGS True value 0.0 0.5 1.0 1.5 0.0 0.5 α1 1.0 1.5 α1 (a) p0 = 0.25 (b) p0 = 0.5 Figure 4.2: Distribution of posterior samples for α1 for the Poisson model. The results are based on one simulation with a sample size of 1000. 15 Posterior WinBUGS Prior True value 0 5 10 Density 10 0 5 Density 15 Posterior WinBUGS Prior True value 0.0 0.2 0.4 0.6 0.8 1.0 0.0 p 0.2 0.4 0.6 0.8 1.0 p (a) p0 = 0.25 (b) p0 = 0.5 Figure 4.3: Distribution of posterior samples for p for the Poisson model. The results are based on one simulation with a sample size of 1000. The posterior distributions from Bayesian models implemented in WinBUGS and R have a concentrate distribution covering the true values of p0 . 88 15 20 25 Posterior WinBUGS Prior True value 0 5 10 Density 15 10 0 5 Density 20 25 Posterior WinBUGS Prior True value 0.0 0.2 0.4 0.6 0.8 1.0 0.0 θ 0.2 0.4 0.6 0.8 1.0 θ (a) p0 = 0.25 (b) p0 = 0.5 Figure 4.4: Distribution of posterior samples for θ for the Poisson model. The results are based on one simulation with a sample size of 1000. 89 4.3 Gamma response model 4.3.1 Model specification For predicting the insurance claim (loss) amount, in actuarial practice, people usually use gamma generalized linear models (GLM). A log link function is commonly used for gamma GLM in the insurance industry, in order to obtain a multiplicative model with coefficients that are relatively easy to interpret. Denoting Y as the loss amount, we can specify the distribution of (Y |V ) as (Y |V ) ∼ Γ (φ , µV ) µV = exp(α0 + α1V ), (4.13) where φ and µV are the dispersion parameter and the mean of the gamma distribution, and α0 and α1 are the intercept and slope of the regression. Denoting f (·|φ , µ ) as the gamma density function with the dispersion parameter φ and mean µ , we can obtain the conditional distribution (Y |V ∗ ) as fY (y |V ∗ = 1) = f (y|φ , exp(α0 + α1 )) fY (y |V ∗ = 0) = θp 1−θ f (y|φ , exp(α0 + α1 )) + f (y|φ , exp(α0 )). 1 − θ (1 − p) 1 − θ (1 − p) (4.14) From Equation (4.14), we observe that the conditional distribution of (Y |V ∗ ) follows a mixture of two gamma distributions when V ∗ = 0, and it is a gamma distribution when V ∗ = 1. Based on the form of the conditional distribution, the model may possibly be identified depending on the existence of a unique solution. 4.3.2 Model identification Similarly for the gamma response model, we may study the model identification using method of moment estimators based on the expectations and covariance 90 matrix of the observed variables (V ∗ ,Y ). That is, we can obtain E(V ∗ ) = θ (1 − p) (4.15) Var(V ∗ ) = E(V ∗ ) − E2 (V ∗ ) (4.16) E(Y ) = E[E(Y |V )] = θ exp(α0 + α1 ) + (1 − θ ) exp(α0 ) (4.17) Var(Y ) = E(Y 2 ) − E2 (Y ) = E[E(Y 2 |V )] − E2 (Y ) = (φ + 1) [θ exp(2α0 + 2α1 ) + (1 − θ ) exp(2α0 )] − E2 (Y ) Cov(V ∗ ,Y ) = E(V ∗Y ) − E(V ∗ )E(Y ) = ∑ ∗ v =0,1 ∗ (4.18) ∗ v∗ y f (y|v∗ )[θ (1 − p)]v [1 − θ (1 − p)]1−v dx − E(V ∗ )E(Y ) = θ (1 − p) exp(α0 + α1 ) − E(V ∗ )E(Y ). (4.19) Hence, we have four Equations (4.15), (4.17), (4.18) and (4.19) for solving the five parameters of p, θ , α0 , α1 and φ . Hence, in reality the model may be weakly identified, as learning the values of all the parameters requires higher order moments that usually require a very large sample size. 4.3.3 Computational study In order to illustrate the performance of the model under finite sample scenarios, we perform a computational study using MCMC simulations. A single sample of size n = 1000 is generated for V using a Bernoulli trial with the probability θ0 = 0.5. Two different values of the misclassified proportion of p0 = 0.25 and p0 = 0.5 are used to obtain the corresponding samples of V ∗ . Samples of Y are then generated from the gamma regression model, using a dispersion parameter of φ0 = 2 and the true values of the regression intercept and slope (α0 , α1 ) as (1.2, 1). For constructing the Metropolis-Hastings algorithm, we may use the log like91 lihood of the parameters given by l(p, θ , α0 , α1 , φ ) =− 1 n [α0 + log(φ )] − nlgamma φ φ n − ∑ v∗i i=1 n 1 −1 φ n ∑ log(yi ) i=1 n yi α1 − ∑ (1 − v∗i ) log[1 − θ (1 − p)] + φ φ exp(α0 + α1 ) i=1 + ∑ (1 − v∗i ) log θ p exp − i=1 + yi yi α1 + (1 − θ ) exp − − φ φ exp(α0 + α1 ) φ exp(α0 ) , (4.20) where lgamma(·) is the logarithm of the gamma function. Using a uniform prior of p on the interval (0,1), we fit our model using both WinBUGS and a self-coded Metropolis-Hastings algorithm in R. For the MetropolisHastings algorithm written in R, normal proposal distributions centered at the current value are used for the regression coefficients and the gamma dispersion parameter. Maximum likelihood estimates of the correlation matrix from generalized linear models using the misclassified variable are used for the bivariate proposal distribution for the regression coefficients. For the probability parameters p and θ , a uniform random walk centered at the current value is used, in order to achieve a reasonable acceptance ratio. The ranges of the proposal distributions are chosen in order to achieve an acceptance ratio around 0.7 for the regular gamma regression, and between 0.2 to 0.7 for our model acknowledging the misrepresentation. We compare the results with naive estimates by fitting a gamma regression using the observed values of V ∗ , pretending there to be no misrepresentation. We denote the true model as gamma regression using the corresponding values of V . For the dispersion parameter, we use a normal prior with mean 5.5 and variance 10. A uniform prior on (0,1) is used for the probability parameter θ . Regarding the thinning, we take every 10th sample for regular gamma regression, and every 1000th sample for our model acknowledging the misrepresentation. The distribution of posterior samples of the parameters from the gamma re92 5 4 True model Naive Posterior WinBUGS True value 3 Density 3 0 1 2 0 1 Density 4 True model Naive Posterior WinBUGS True value 2 5 sponse model are plotted in Figures 4.5 to 4.9. From the figures, we observe that misrepresentation in the covariate variable causes an attenuation of the naive estimates. With a sample size of 1000, both the WinBUGS and the MetropolisHastings algorithm provide estimates of regression coefficients that are closer to those from the true model using the corresponding values of V . Acknowledging the misrepresentation results in higher variations in the posterior samples. For both the cases with the true values of p0 = 0.25 and p0 = 0.5, the posterior samples of the misclassified proportion p have a distribution centered around the true values. The misrepresentation does not seem to have a big impact on the gamma dispersion parameter φ . Comparing to the Poisson model, the posterior distributions are much more spread for all the parameters, indicating slower learning caused by the weak identification. The results are consistent with those from the theoretical study on the method of moment estimators, in the sense that learning the true values require a larger sample size. 0.0 0.5 1.0 1.5 0.0 α0 0.5 1.0 1.5 α0 (a) p0 = 0.25 (b) p0 = 0.5 Figure 4.5: Distribution of posterior samples for α0 for the gamma model. The results are based on one simulation with a sample size of 1000. Acknowledging the misrepresentation results in a more accurate estimates with higher uncertainties. 93 2.0 3.0 True model Naive Posterior WinBUGS True value 0.0 1.0 Density 2.0 0.0 1.0 Density 3.0 True model Naive Posterior WinBUGS True value 0.0 0.5 1.0 1.5 0.0 0.5 1.0 α1 1.5 α1 (a) p0 = 0.25 (b) p0 = 0.5 7 7 Figure 4.6: Distribution of posterior samples for α1 for the gamma model. The results are based on one simulation with a sample size of 1000. Acknowledging the misrepresentation results in a more accurate estimates with higher uncertainties. 4 5 6 Posterior WinBUGS Prior True value 0 1 2 3 Density 4 3 0 1 2 Density 5 6 Posterior WinBUGS Prior True value 0.0 0.2 0.4 0.6 0.8 1.0 0.0 p 0.2 0.4 0.6 0.8 1.0 p (a) p0 = 0.25 (b) p0 = 0.5 Figure 4.7: Distribution of posterior samples for p for the gamma model. The results are based on one simulation with a sample size of 1000. The relatively spread posterior distribution covers the true value. 94 8 Density 6 Posterior WinBUGS Prior True value 0 2 4 8 4 0 2 Density 6 Posterior WinBUGS Prior True value 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 θ 0.4 0.6 0.8 1.0 θ (a) p0 = 0.25 (b) p0 = 0.5 6 5 True model Naive Posterior WinBUGS True value 4 Density 0 1 2 3 0 1 2 Density 4 5 True model Naive Posterior WinBUGS True value 3 6 Figure 4.8: Distribution of posterior samples for θ for the gamma model. The results are based on one simulation with a sample size of 1000. The relatively spread posterior distribution covers the true value. 1.0 1.5 2.0 2.5 3.0 1.0 φ 1.5 2.0 2.5 3.0 φ (a) p0 = 0.25 (b) p0 = 0.5 Figure 4.9: Distribution of posterior samples for φ for the gamma model. The results are based on one simulation with a sample size of 1000. Misrepresentation does not seem to have a big impact on the dispersion parameter. 95 4.4 Binary response model 4.4.1 Model specification For binary response variables (e.g., whether an insured person has a loss), we can use logistic regression to model the relationship between the response variable and a covariate. Denoting Y as the indicator whether there is any claim in a policy year, we can specify the distribution of (Y |V ) as (Y |V ) ∼ Bernoulli exp(α0 + α1V ) , 1 + exp(α0 + α1V ) (4.21) where Bernoulli(π ) denotes a Bernoulli trial with probability π , and α0 and α1 are the intercept and slope of the regression. Denoting f (·|π ) as the binomial density function with probability π and n = 1, it is straightforward to obtain the conditional distribution (Y |V ∗ ) as exp(α0 + α1 ) 1 + exp(α0 + α1 ) 1−θ θ p exp(α0 + α1 ) exp(α0 ) f y f y fY (y |V ∗ = 0) = + . 1 − θ (1 − p) 1 + exp(α0 + α1 ) 1 − θ (1 − p) 1 + exp(α0 ) (4.22) fY (y |V ∗ = 1) = f y From Equation (4.22), we observe that the conditional distribution of (Y |V ∗ ) follows a mixture of two Bernoulli distributions when V ∗ = 0, and it is a Bernoulli distribution when V ∗ = 1. The theoretical form is similar to that of the Poisson and gamma models. However, for binary data, higher order moments do not contain extra information. This may cause potential issues in model identification. 4.4.2 Model identification In particular, we can study the model identification using method of moment estimators, based on the expectations and covariance matrix of the observed variables 96 (V ∗ ,Y ). For a binary response variable Y , we can obtain that E(V ∗ ) = θ (1 − p) (4.23) Var(V ∗ ) = E(V ∗ ) − E2 (V ∗ ) (4.24) E(Y ) = E[E(Y |V )] = θ exp(α0 + α1 ) (1 − θ ) exp(α0 ) + 1 + exp(α0 + α1 ) 1 + exp(α0 ) (4.25) Var(Y ) = E(Y 2 ) − E2 (Y ) ∗ = E(Y ) − E2 (Y ) ∗ (4.26) ∗ Cov(V ,Y ) = E(V Y ) − E(V )E(Y ) = ∑ ∑ v∗ =0,1 y=0,1 = ∗ ∗ v∗ y f (y|v∗ )[θ (1 − p)]v [1 − θ (1 − p)]1−v dx − E(V ∗ )E(Y ) θ (1 − p) exp(α0 + α1 ) − E(V ∗ )E(Y ). 1 + exp(α0 + α1 ) (4.27) Here, we have only three equations from the first two moments, and we will not be able to solve the four parameters of p, θ , α0 and α1 . Since both the response and covariate variables are binary, higher order moments will be same as the first and second moments. Hence, we will not be able to learn all the parameters from the data, which suggests that the model is nonidentified. 4.4.3 Simulation study In order to illustrate the performance of the model under finite sample scenarios, we perform a simulation study using MCMC techniques. Again, we may use the Metropolis-Hastings algorithm for the MCMC simulation. The log likelihood of 97 the parameters is given by n n l(p, θ , α0 , α1 ) = ∑ v∗i yi (α0 + α1 ) − ∑ v∗i log [1 + exp(α0 + α1 )] i=1 i=1 θ p expyi (α0 + α1 ) (1 − θ ) expyi (α0 ) + ∑ (1 − v∗i ) log + 1 + exp(α0 + α1 ) 1 + exp(α0 ) i=1 n n − ∑ (1 − v∗i ) log[1 − θ (1 − p)]. (4.28) i=1 When we use a uniform prior on the interval (0, 1) for the misclassified proportion p, we found that WinBUGS has a problem in the algorithm convergence. Three chains with different starting values do not mix for the parameter p. The Metropolis-Hastings algorithm coded in R seems to work better. Hence, we will study the convergence of the posterior mean and variance of the parameters with different priors on the misclassified proportion p. Thus we would be able to confirm whether more informative priors on p will improve the model inference. We use four different sample sizes of 40, 160, 640 and 2560 for studying the convergence of the posterior estimates of the parameters. Samples of V are generated using a Bernoulli trial with the probability θ0 = 0.5. Two different values of p0 = 0.25 and p0 = 0.5 are used for the misclassified proportion, for obtaining the corresponding samples of V ∗ . Corresponding samples of Y are then generated from the binary regression model, using the true values of the intercept and slope (α0 , α1 ) as (−0.5, 1.5). For the Metropolis-Hastings algorithm written in R, normal proposal distributions centered at the current value are used for the regression coefficients α0 and α1 . Maximum likelihood estimates of the correlation matrix from generalized linear models are used for the bivariate proposal distribution. For the probability parameter θ and misclassified proportion p, we use a uniform random walk proposal centered at the current value, in order to achieve a reasonable acceptance ratio. The ranges of the proposal distributions are chosen in order to achieve an acceptance ratio around 0.7 for the regular logistic regression, and between 0.2 98 to 0.7 for our model acknowledging the misrepresentation. We take every 200th sample based on the autocorrelation, with a total of five chains using different starting values of p. Here we will compare the results with naive estimates by fitting a logistic regression using the observed values of V ∗ , pretending there to be no misrepresentation. We denote the true model as logistic regression using the corresponding values of V . For all the models, vague priors with mean 0 and variance 10 are used for the regression coefficients. For the parameter θ , a uniform prior on (0,1) is used. The three prior distributions of p are beta distributions with parameters (1, 3), (2, 2) and (2, 6). The prior mean and standard deviation are (0.25, 0.19), (0.5, 0.22) and (0.25, 0.14) respectively. Using more informative priors (regardless of whether the guess is correct) causes a problem in the convergence of the Metropolis-Hastings algorithm. The three prior distributions are plotted in Figure 4.10. 1.0 3.0 (1,3) (2,2) (2,6) 2.5 0.8 Density Density 2.0 1.5 0.6 0.4 1.0 0.2 0.5 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 p 0.2 0.4 0.6 0.8 1.0 p (a) pdf (b) cdf Figure 4.10: Probability density and cumulative density functions of prior on the misclassified proportion p. The prior mean and standard deviation are (0.25, 0.19), (0.5, 0.22) and (0.25, 0.14) respectively. In Figures 4.11 to 4.16, we present the distribution of the posterior mean and 99 standard deviation for the estimates from the five methods. Each boxplot is based on fitted models using 20 generated datasets, with data generation mechanisms and model assumptions provided in the earlier paragraphs. From the figures, we observe that misrepresentation in the covariate variable causes an attenuation of the naive estimates. For all the three prior distributions, our Bayesian model accounting for the existence of misrepresentation improves the accuracy of the estimates. That is, acknowledging the misrepresentation moves the estimates toward the corresponding estimates from the true models. For the case when the true misclassified proportion p0 = 0.25, it seems that the prior distribution Beta(2, 6) works better. For the case when p0 = 0.5, it is not obvious which prior distribution works better across all the sample sizes. However, the estimates of the misclassified proportion p do not seem to converge to their true values, with the increase of the sample size. The posterior mean p seems to be driven by the prior, and the posterior standard deviation does not decrease when the sample size increases. This is consistent with the theoretical finding that the model is nonidentified. 100 0.5 Method 0.0 naive α0 true pri(2,2) pri(1,3) −0.5 pri(2,6) −1.0 40 160 640 2560 sample size (a) p0 = 0.25 0.5 Method 0.0 naive α0 true pri(2,2) pri(1,3) −0.5 pri(2,6) −1.0 40 160 640 2560 sample size (b) p0 = 0.5 Figure 4.11: Convergence of posterior mean for α0 with different true values and priors on p. For each sample size, the five boxplots (based on 20 simulations) from left to right are the estimates using the misclassified covariate (naive), the ‘unobserved’ covariate (true), and posterior mean based three different priors for p, in the order in the legend. We exclude the results with p0 = 0.5 and n = 40, due to the problem of numerical overflow/underflow. 101 3 Method 2 naive α1 true pri(2,2) pri(1,3) 1 pri(2,6) 0 40 160 640 2560 sample size (a) p0 = 0.25 3 Method 2 naive α1 true pri(2,2) pri(1,3) 1 pri(2,6) 0 40 160 640 2560 sample size (b) p0 = 0.5 Figure 4.12: Convergence of posterior mean for α1 with different true values and priors on p. For each sample size, the five boxplots (based on 20 simulations) from left to right are the estimates using the misclassified covariate (naive), the ‘unobserved’ covariate (true), and posterior mean based three different priors for p. We exclude the results with p0 = 0.5 and n = 40, due to the problem of numerical overflow/underflow. 102 0.5 Method 0.4 p pri(2,2) pri(1,3) pri(2,6) 0.3 0.2 40 160 640 2560 sample size (a) p0 = 0.25 0.5 Method 0.4 p pri(2,2) pri(1,3) pri(2,6) 0.3 0.2 40 160 640 2560 sample size (b) p0 = 0.5 Figure 4.13: Convergence of posterior mean for p with different true values and priors on p. For each sample size, the three boxplots (based on 20 simulations) from left to right are the posterior mean based three different priors in the order given in the legend. We exclude the results with p0 = 0.5 and n = 40, due to the problem of numerical overflow/underflow. 103 1.2 Method 0.8 naive α0 true pri(2,2) pri(1,3) 0.4 pri(2,6) 0.0 40 160 640 2560 sample size (a) p0 = 0.25 1.2 Method 0.8 naive α0 true pri(2,2) pri(1,3) 0.4 pri(2,6) 0.0 40 160 640 2560 sample size (b) p0 = 0.5 Figure 4.14: Convergence of posterior standard deviation for α0 with different true values and priors on p. For each sample size, the five boxplots (based on 20 simulations) from left to right are the estimates using the misclassified covariate (naive), the ‘unobserved’ covariate (true), and posterior standard deviation based three different priors for p. We exclude the results with p0 = 0.5 and n = 40, due to the problem of numerical overflow/underflow. 104 1.5 Method 1.0 naive α1 true pri(2,2) pri(1,3) 0.5 pri(2,6) 0.0 40 160 640 2560 sample size (a) p0 = 0.25 1.5 Method 1.0 naive α1 true pri(2,2) pri(1,3) 0.5 pri(2,6) 0.0 40 160 640 2560 sample size (b) p0 = 0.5 Figure 4.15: Convergence of posterior standard deviation for α1 with different true values and priors on p. For each sample size, the five boxplots (based on 20 simulations) from left to right are the estimates using the misclassified covariate (naive), the ‘unobserved’ covariate (true), and posterior standard deviation based three different priors for p. We exclude the results with p0 = 0.5 and n = 40, due to the problem of numerical overflow/underflow. 105 0.22 Method 0.20 p pri(2,2) pri(1,3) pri(2,6) 0.18 0.16 40 160 640 2560 sample size (a) p0 = 0.25 0.22 Method 0.20 p pri(2,2) pri(1,3) pri(2,6) 0.18 0.16 40 160 640 2560 sample size (b) p0 = 0.5 Figure 4.16: Convergence of posterior standard deviation for p with different true values and priors on p. For each sample size, the three boxplots (based on 20 simulations) from left to right are the posterior mean based three different priors in the order given in the legend. We exclude the results with p0 = 0.5 and n = 40, due to the problem of numerical overflow/underflow. 106 4.5 Misclassified variable as response For example, in order to conduct a cost benefit analysis on possible investigations on false statements, insurance companies may need an understanding on the proportion of misrepresentation on the existing policies. In this section, we will try to investigate whether the availability of correctly measured covariate variables will help with inference on the misclassified proportion p. 4.5.1 Model specification If we believe X to be a correctly measured covariate (e.g., annual millage in auto insurance) that is thought to have a predictive power on our variable with misrepresentation (e.g., the binary indicator of whether the vehicle is used for commuting to work), we can use a logistic regression model to describe the relationship between V and X. That is, (V | X) ∼ Bernoulli (πX ) logit(πX ) = β0 + β1 X, (4.29) where Bernoulli(πX ) is a Bernoulli distribution with probability of success πX , β0 and β1 are the intercept and slope of the logistic regression. Here we assume that the misclassification is non-differential on X (i.e., X and (V ∗ |V ) are independent). From Equations (4.1) and (4.29), we can obtain the conditional distribution of (V ∗ | X). That is, P(V ∗ = 1 | X) = P(V ∗ = 1, V = 1 | X) + P(V ∗ = 1, V = 0 | X) = (1 − p) exp(β0 + β1 X) . 1 + exp(β0 + β1 X) (4.30) Equation (4.30) shows that a generalized linear model can be used to model the relationship between V ∗ and X with the link function related to the misclassified proportion. The model has a parameter p in addition to the regression coefficients 107 β0 and β1 . 4.5.2 Model identification For the situation when there is only one covariate X, we denote f (x) as the probability density function of X. Here, we consider the case when X has a continuous parametric distribution. The case of a discrete distribution will be similar. In order to study the model identification, we can obtain the method of moment estimators based on the expectation and covariance matrix of (V ∗ , X). That is, E(V ∗ ) =E[E(V ∗ |X)] = (1 − p) exp(β0 + β1 x) f (x) dx 1 + exp(β0 + β1 x) (4.31) Var(V ∗ ) =E[(V ∗ )2 ] − E(V ∗ )2 =E(V ∗ ) − E(V ∗ )2 E(X) = Var(X) = (4.32) x f (x) dx (4.33) x2 f (x) dx − E2 (X) (4.34) Cov(V ∗ , X) =E(V ∗ X) − E(V ∗ )E(X) = ∑ v∗ x f (x) v∗ =0,1 − E(V ∗ )E(X) =(1 − p) x f (x) (1 − p) exp(β0 + β1 x) 1 + exp(β0 + β1 x) v∗ 1− (1 − p) exp(β0 + β1 x) 1 + exp(β0 + β1 x) exp(β0 + β1 x) dx − E(V ∗ )E(X) 1 + exp(β0 + β1 x) 1−v∗ dx (4.35) For the parameters p, β0 and β1 , we only have Equations (4.31) and (4.35). Particularly for binary data, higher moments do not contain extra information on the parameters. Hence, learning all of the parameters of p, β0 and β1 will require higher order moments involving both V ∗ and X of the form E(V ∗ X i ) i ≥ 2. This suggests that the model would be weakly identified. On the other hand, we may be able to study the model identification using the 108 Fisher information matrix. The log likelihood of the parameters is given by n n l(p, β0 , β1 ) = ∑ v∗i [β0 + β1 xi + log(1 − p)] + ∑ (1 − v∗i ) log[1 + p exp(β0 + β1 xi )] i=1 i=1 n − ∑ log[1 + exp(β0 + β1 xi )]. (4.36) i=1 Hence, we can obtain the gradient (score function) of the log likelihood as n exp(β0 + β1 xi ) ∂ l(p, β0 , β1 ) ∑n v∗ = − i=1 i + ∑ (1 − v∗i ) ∂p 1− p 1 + p exp(β0 + β1 xi ) i=1 n n n p exp(β0 + β1 xi ) exp(β0 + β1 xi ) ∂ l(p, β0 , β1 ) = ∑ v∗i + ∑ (1 − v∗i ) −∑ ∂ β0 1 + p exp(β0 + β1 xi ) i=1 1 + exp(β0 + β1 xi ) i=1 i=1 n n n exp(β0 + β1 xi )xi ∂ l(p, β0 , β1 ) ∗ p exp(β0 + β1 xi )xi ∗ = ∑ vi xi + ∑ (1 − vi ) −∑ . ∂ β1 1 + p exp(β0 + β1 xi ) i=1 1 + exp(β0 + β1 xi ) i=1 i=1 (4.37) Elements of the Hessian matrix of the log likelihood can be obtained as n ∂ l 2 (p, β0 , β1 ) exp2 (β0 + β1 xi ) ∑ni=1 v∗i ∗ (1 − v ) = − − ∑ i ∂ p2 (1 − p)2 i=1 [1 + p exp(β0 + β1 xi )]2 n exp(β0 + β1 xi ) ∂ l 2 (p, β0 , β1 ) = ∑ (1 − v∗i ) ∂ p ∂ β0 [1 + p exp(β0 + β1 xi )]2 i=1 n ∂ l 2 (p, β0 , β1 ) exp(β0 + β1 xi )xi = ∑ (1 − v∗i ) ∂ p ∂ β1 [1 + p exp(β0 + β1 xi )]2 i=1 n n p exp(β0 + β1 xi ) exp(β0 + β1 xi ) ∂ l 2 (p, β0 , β1 ) ∗ (1 − v ) = − ∑ ∑ i 2 2 2 [1 + p exp(β0 + β1 xi )] ∂ β0 i=1 [1 + exp(β0 + β1 xi )] i=1 n n ∂ l 2 (p, β0 , β1 ) p exp(β0 + β1 xi )xi exp(β0 + β1 xi )xi = ∑ (1 − v∗i ) − ∑ 2 2 ∂ β0 ∂ β1 [1 + p exp(β0 + β1 xi )] i=1 i=1 [1 + exp(β0 + β1 xi )] n n p exp(β0 + β1 xi )xi2 exp(β0 + β1 xi )xi2 ∂ l 2 (p, β0 , β1 ) ∗ = (1 − v ) − ∑ [1 + exp(β0 + β1xi)]2 . ∑ i [1 + p exp(β0 + β1 xi )]2 i=1 ∂ β12 i=1 (4.38) 109 Taking expectations over Vi∗ | xi , we can obtain elements of the Fisher information matrix as (I )11 = −E n n exp2 (β0 + β1 xi ) ∂ l 2 (p, β0 , β1 ) exp(β0 + β1 xi ) +∑ =∑ 2 ∂p i=1 [1 + exp(β0 + β1 xi )][1 + p exp(β0 + β1 xi )] i=1 (1 − p)[1 + exp(β0 + β1 xi )] (I )12 = (I )21 = −E n exp(β0 + β1 xi ) ∂ l 2 (p, β0 , β1 ) =−∑ + ∂ p ∂ β0 [1 + exp( β β1 xi )][1 + p exp(β0 + β1 xi )] 0 i=1 (I )13 = (I )31 = −E n exp(β0 + β1 xi )xi ∂ l 2 (p, β0 , β1 ) =−∑ ∂ p ∂ β1 i=1 [1 + exp(β0 + β1 xi )][1 + p exp(β0 + β1 xi )] (I )22 = −E n n p exp(β0 + β1 xi ) exp(β0 + β1 xi ) ∂ l 2 (p, β0 , β1 ) =−∑ +∑ 2 + x )][1 + p exp( + x )] [1 + exp( β β β β [1 + exp(β0 + β1 xi )]2 ∂ β0 i i 0 1 0 1 i=1 i=1 (I )13 = (I )31 = −E (I )33 = −E n n exp(β0 + β1 xi )xi p exp(β0 + β1 xi )xi ∂ l 2 (p, β0 , β1 ) +∑ =−∑ + x )][1 + p exp( + x )] ∂ β0 ∂ β1 [1 + exp( β β β β [1 + exp(β0 + β1 xi )]2 i i 0 1 0 1 i=1 i=1 n n p exp(β0 + β1 xi )xi2 exp(β0 + β1 xi )xi2 ∂ l 2 (p, β0 , β1 ) =−∑ . +∑ 2 2 ∂ β1 i=1 [1 + exp(β0 + β1 xi )][1 + p exp(β0 + β1 xi )] i=1 [1 + exp(β0 + β1 xi )] (4.39) In order to evaluate the Fisher information matrix, we generate two sets of values for the design points Xi (i = 1, 2, · · · , n) with a sample size of n = 1000, using a gamma distribution with a shape parameter of 2 and a scale parameter of 1/2. The true values of the intercept and slope for the logistic regression are β0 = 0 and β1 = −1.5. For p, we use two different true values of p0 = 0.25 and p0 = 0.5 respectively for the two sets of values of Xi . We can then plug in the true values of the parameters and the observed values of the Xi , and obtain the inverse of the Fisher information matrix as 0.736 1.687 −1 I p=0.25 = 1.687 3.902 −0.426 −1.009 0.514 1.767 −1 I p=0.5 = 1.767 6.127 −0.445 −1.581 110 −0.426 −1.009 0.292 −0.445 −1.581 . 0.458 (4.40) Hence, we can obtain the correlation matrix as 1.000 0.996 Cor p=0.25 = 0.996 1.000 −0.917 −0.944 1.000 0.996 Cor p=0.5 = 0.996 1.000 −0.919 −0.946 −0.917 −0.944 1.000 −0.919 −0.946 . 1.000 (4.41) From the correlation matrix, the posterior correlations of the parameters very high, with correlation coefficients that are very close to 1 or −1. Hence, learning the true values of the parameters are hard in the case of a sample size with n = 1000. This confirms that the model is weakly identified. 4.5.3 Simulation study In this subsection, we will study the convergence of the posterior mean and variance of the parameters with different priors on the misclassified proportion p. For the uniform prior on the interval (0, 1), we found that WinBUGS has a problem in the algorithm convergence. Three chains with different starting values do not mix for the parameter p. This is probably owing to the identification problem caused by the binary feature of the data (as for binary, higher order moments do not contain extra information on the parameters). The convergence problem in WinBUGS seems to be related to the identification issues, as it occurs for both the binary models using the misclassified variable as the response and as the covariate. Here, we will try to confirm whether more informative priors on p will improve the model inference. For data generation, we use two different values of p0 = 0.25 and p0 = 0.5 for the misclassified proportion. The intercept and slope of the logistic regression are 0 and −1.5. The values of the covariate variable X are generated from a gamma distribution with a shape parameter of 2 and a scale parameter of 1/2. Then unobserved values of V are generated using the 111 corresponding probabilities calculated from logistic regression. According to the misclassified probabilities, the values of V are adjusted to obtain the observed values of V ∗ . We use four different sample sizes of 40, 160, 640 and 2560 to study the convergence of the estimates for the parameters. For the Metropolis-Hastings algorithm written in R, normal proposal distributions centered at the current value are used for all the parameters. Maximum likelihood estimates of the correlation matrix from generalized linear models using the misclassified variable are used in the bivariate proposal distribution for the regression coefficients. For the simulation when the sample size is 40, we have to use a uniform random walk for the probability parameter p, in order to achieve a reasonable acceptance ratio. The ranges of the proposal distributions are chosen in order to achieve an acceptance ratio around 0.7 for the regular logistic regression, and between 0.2 to 0.7 for our model acknowledging the misrepresentation. We take every 200th or 1000th sample based on the autocorrelation, with a total of five chains using different starting values of p. In the simulation study, we use five methods including the naive model by fitting a logistic regression model using the misclassified variable as a response, the true model which uses the ‘unobserved ’ values of V , and the Bayesian model using three different priors on the misclassified proportion p. For all the models, a vague prior with a mean of 0 and variance of 10 are used for the regression coefficients. The three prior distributions of p are beta distributions with parameters (1, 3), (2, 2) and (2, 6) (same as those for the binary response model in the previous section). The prior mean and standard deviations are (0.25, 0.19), (0.5, 0.22) and (0.25, 0.14) respectively. Using more informative priors (regardless of whether the guess is correct) causes a problem in the convergence of the Metropolis-Hastings algorithm. The probability and cumulative densities can be found in Figure 4.10 in the the previous section. For the Metropolis-algorithm we coded in R, there are also some issues in the algorithm convergence. The MCMC stays at a value for a long period of time on 112 occasion. Running longer chains (e.g., around 5,000,000) leads to possibilities of a longer period staying at one value. Hence, we have to run 5 independent chains with different starting values of p. After taking every 1000th sample, the posterior samples look much better. However, we are not able to completely mitigate the problem. This is probably owing to the identification problem caused by binary data. Examples of trace plots for the parameters of α0 and p are provided in Figures 4.17 and 4.18. In Figures 4.19 to 4.24, we present the distribution of the posterior mean and standard deviation of the estimates from the five methods. Each boxplot is based on fitted models using 20 generated datasets based on the assumptions provided in the earlier paragraphs. From the figures, we observe that acknowledging the existence of misrepresentation will improve the accuracy of the estimates. It is more obvious by looking at the estimated intercept for all the priors, in the sense that it moves the estimates toward the right direction. However, the estimates of the misclassified proportion p do not seem to converge to their true values. Instead, they are converging to zero (with both the posterior mean and standard deviation converging in a manner that is comparable to identified models), when the effects of the prior starts to wear off with the increase of the sample size. This may be due to the high posterior correlations in the parameters, as shown from the Fisher information matrix. With the theoretical finding that for binary data the model is weakly identified, however, the results are quite surprising. In the future, we may need to undertake a more comprehensive theoretical study in order to figure out the cause of the simulation behavior. 4.6 Conclusions In this chapter, we study the identification of models when there is misrepresentation in a binary covariate or response variable. In particular, we propose three models based on Poisson, gamma and Bernoulli distributions respectively for predicting insurance claim frequency, severity and occurrence. We confirm that the identification of the model depends on the distribution of the response variable. 113 1.5 1.0 0.5 0.0 −2.0 −1.0 α0 0 1000 2000 3000 4000 5000 4000 5000 Index −2.0 −1.0 α0 0.0 0.5 1.0 1.5 (a) 1 chain before thinning 0 1000 2000 3000 Index (b) 5 chains after thinning Figure 4.17: Trace plots for posterior samples of α0 from MetropolisHastings algorithm when n = 160 and p0 = 0.25. We observe that the trace plot improves using five independent chains after thinning. In particular, the first 1000 samples are from the first chain, the next 1000 are from the second chain, etc. 114 0.8 0.6 0.4 0.0 0.2 p 0 1000 2000 3000 4000 5000 3000 4000 5000 Index 0.4 0.0 0.2 p 0.6 0.8 (a) before 0 1000 2000 Index (b) after Figure 4.18: Trace plots for posterior samples of p from MetropolisHastings algorithm when n = 160 and p0 = 0.25. We observe that the trace plot improves using five independent chains after thinning. In particular, the first 1000 samples are from the first chain, the next 1000 are from the second chain, etc. 115 2 1 Method naive 0 β0 true pri(2,2) −1 pri(1,3) pri(2,6) −2 −3 40 160 640 2560 sample size (a) p0 = 0.25 2 1 Method naive 0 β0 true pri(2,2) −1 pri(1,3) pri(2,6) −2 −3 40 160 640 2560 sample size (b) p0 = 0.5 Figure 4.19: Convergence of posterior mean for β0 with different true values and priors on p. For each sample size, the five boxplots (based on 20 simulations) from left to right are the estimates using the misclassified response (naive), the ‘unobserved’ response (true), and posterior mean based three different priors for p, in the order in the legend. When n = 40, we exclude the results from the model using a beta prior with parameters 2 and 6, due to convergence problems. 116 0 Method naive β1 true −2 pri(2,2) pri(1,3) pri(2,6) −4 40 160 640 2560 sample size (a) p0 = 0.25 0 Method naive β1 true −2 pri(2,2) pri(1,3) pri(2,6) −4 40 160 640 2560 sample size (b) p0 = 0.5 Figure 4.20: Convergence of posterior mean for β1 with different true values and priors on p. For each sample size, the five boxplots (based on 20 simulations) from left to right are the estimates using the misclassified response (naive), the ‘unobserved’ response (true), and posterior mean based three different priors for p. When n = 40, we exclude the results from the model using a beta prior with parameters 2 and 6, due to convergence problems. 117 0.8 0.6 Method p pri(2,2) 0.4 pri(1,3) pri(2,6) 0.2 0.0 40 160 640 2560 sample size (a) p0 = 0.25 0.8 0.6 Method p pri(2,2) 0.4 pri(1,3) pri(2,6) 0.2 0.0 40 160 640 2560 sample size (b) p0 = 0.5 Figure 4.21: Convergence of posterior mean for p with different true values and priors on p. For each sample size, the three boxplots (based on 20 simulations) from left to right are the posterior mean based three different priors in the order given in the legend. When n = 40, we exclude the results from the model using a beta prior with parameters 2 and 6, due to convergence problems. 118 1.5 Method naive 1.0 β0 true pri(2,2) pri(1,3) pri(2,6) 0.5 0.0 40 160 640 2560 sample size (a) p0 = 0.25 1.5 Method naive 1.0 β0 true pri(2,2) pri(1,3) pri(2,6) 0.5 0.0 40 160 640 2560 sample size (b) p0 = 0.5 Figure 4.22: Convergence of posterior standard deviation for β0 with different true values and priors on p. For each sample size, the five boxplots (based on 20 simulations) from left to right are the estimates using the misclassified response (naive), the ‘unobserved’ response (true), and posterior mean based three different priors for p. When n = 40, we exclude the results from the model using a beta prior with parameters 2 and 6, due to convergence problems. 119 2 Method naive β1 true pri(2,2) pri(1,3) 1 pri(2,6) 0 40 160 640 2560 sample size (a) p0 = 0.25 2 Method naive β1 true pri(2,2) pri(1,3) 1 pri(2,6) 0 40 160 640 2560 sample size (b) p0 = 0.5 Figure 4.23: Convergence of posterior standard deviation for β1 with different true values and priors on p. For each sample size, the five boxplots (based on 20 simulations) from left to right are the estimates using the misclassified response (naive), the ‘unobserved’ response (true), and posterior mean based three different priors for p. When n = 40, we exclude the results from the model using a beta prior with parameters 2 and 6, due to convergence problems. 120 0.4 0.3 Method p pri(2,2) 0.2 pri(1,3) pri(2,6) 0.1 0.0 40 160 640 2560 sample size (a) p0 = 0.25 0.4 0.3 Method p pri(2,2) 0.2 pri(1,3) pri(2,6) 0.1 0.0 40 160 640 2560 sample size (b) p0 = 0.5 Figure 4.24: Convergence of posterior standard deviation for p with different true values and priors on p. For each sample size, the three boxplots (based on 20 simulations) from left to right are the posterior mean based three different priors in the order given in the legend. When n = 40, we exclude the results from the model using a beta prior with parameters 2 and 6, due to convergence problems. 121 For Poisson response variables, we are able to fully identify the model, as identifying a Poisson distribution only requires one parameter. For gamma and binary response variables, however, the models are weakly identified and non-identified, respectively. In addition, we study the model when we use the misclassified variable as the response and have correctly measured covariate variables. Due to the binary feature of the response, the model is weakly identified. It seems very difficult to learn the true parameters as they are highly correlated, despite a large sample size. Through extensive computational and simulation studies, we are able to confirm that our Bayesian models improve the accuracy of the estimates with vague priors, by acknowledging the direction and possibility of the misclassification caused by misrepresentation. 122 Chapter 5 Mismeasurements in Traffic Volumes for Accident Prediction 5.1 Introduction In the context of road safety evaluation, the safety performance of a location such as a road segment is evaluated by the number of accidents or collisions in a certain period of time (e.g., a year). The evaluation of road safety measures (e.g., installing new cameras) is based on the predicted reductions in roadside collisions, after adjusting for the impact from other factors such as the traffic volumes during the study period. The number of accidents at the site, the accident frequency Y , is often assumed to follow a certain distribution such as the Poisson or negative binomial distribution (see e.g., [8, 10]). For example, we can assume that Y ∼ Poisson(Λ), where Poisson(Λ) is a Poisson distribution with mean Λ. Assuming we have an observed sample of accident frequencies (Y1 , Y2 , . . . , Yn ) from n different locations, we can use a generalized linear model (GLM) to model the accident frequency. The average annual daily traffic (AADT) at the road seg123 ment is a major factor that will affect the number of accidents at the site. For n independent sites, a model frequently used in modeling collision counts (see for example, [26]) is given by Λi = a0 Via1 (i = 1, 2, . . . , n), (5.1) where V1 , V2 , . . . , Vn denote the AADT of the n road segments, and a0 and a1 are the parameters of the regression model. 5.1.1 Mismeasurements in traffic volumes The measurements of AADT at time t (t = 1, 2, . . . , T ) for the n locations, (Vt1∗ , Vt2∗ , . . ., Vtn∗ ), are usually obtained from the engineering department of each city. From the literature, it is recommended that the data need to be collected for at least 48 hours in order to obtain reliable information. In practice, however, due to the existence of the large number of road segments, the observation may only be done for a couple of hours at some locations. Hence, the values are estimates of AADT that are subject to measurement errors. In order to model the mismeasurements, we may assume that the observed AADT (Vt1∗ , Vt2∗ , . . ., Vtn∗ ) have distributions that are centered around the true values of (V1 , V2 , . . ., Vn ). We may further assume that the conditional standard deviation is proportional to the conditional mean. That is, E(Vti∗ |Vi ) = Vi SD(Vti∗ |Vi ) = η Vi (i = 1, 2, . . . , n; t = 1, 2, . . . , T ). (5.2) Based on the relationships given in Equation (5.2), we may specify the measurement model using distributions with a positive support, such as the gamma, exponential and lognormal distributions. The actual unobserved AADT of (V1 , V2 , . . ., Vn ) (i.e., the exposure model) may also take similar distribution forms. In a perfect world, we may be certain that η have some known value (for ex- 124 ample based on the assumption of the Poisson process). In this case, the statistical model on the AADT measurements will be identified. In reality, however, the traffic volumes exhibit patterns during different hours of the day. For example, during rush hours on working days there may be a lot more vehicles, but there may be hardly any traffic during the night. There is also seasonality when in summer people travel more often, while in winter and during bad weather there are little traffic. In real practice, the observation period is not randomly selected, as counting is usually done in days when the weather is good. Hence, we would only be able to obtain a guess on the value of η based on the length of observation we used for estimating the AADT. For example, we may have some prior knowledge that the observed values will be within a certain percentage of the true value, say 30%. Due to the possible nonidentifiability caused by mismeasurements in AADT when T = 1 (e.g., for normal models [16]), it would be interesting to study the identifiability of the model when we assume different distributions for the measurement and exposure models. 5.2 Lognormal measurement error model In order to obtain analytical results, we may use the models from Subsection 6.3.3. of [16]. That is, we assume that all the three elements (the measurement, outcome, and exposure models) take a lognormal distribution form. Here, we use the original notations from [16] for the purpose of consistency. After taking log transformations for the traffic volumes and accident counts, the models can be written as log(Vti∗ )| log(Vi ), log(Yi ) ∼ N(log(Vi ), γε 2 ), log(Yi )| log(Vi ) ∼ N(β0 + β1 log(Vi ), σ 2 ), log(Vi ) ∼ N(µ , ε 2 ). (5.3) Road sites are usually selected for an intervention (i.e., a road improvement program), if they have high collision frequencies. Here we may assume that the road 125 sites are selected according to some mechanism. We can assume that the AADT of the selected sites follow a certain distribution (e.g., a lognormal distribution). When we have multiple measurements of Vti∗ at the same location (i.e., T > 1), the measurement model in Equation (5.3) will be identified. That is, we will be able to separate out the variance from measurement error and that of the distribution of the underlying AADT (i.e., the parameter γ can be identified). Otherwise, we may not be able to separate out the variation due to measurement error. Based on results from [16], when T = 1 we have log(Yi )| log(Vti∗ ) ∼ N(β˜0 + β˜1 log(Vti∗ ), σ˜ 2 ), log(Vti∗ ) ∼ N(µ˜ , ε˜ 2 ), (5.4) where the identified part of the parameters are β˜0 = β0 + µ β1 /(1 + γ ), β˜1 = β1 /(1 + γ ), µ˜ = µ , σ˜ 2 = σ 2 + β1 ε 2 γ /(1 + γ ), ε˜ 2 = ε 2 (1 + γ ). (5.5) Under the road safety context, we are interested in predicting accident counts in order to estimate the savings from accident reduction. The unindentified parameter γ is simply a nuisance parameter. Based on Equation (5.4), we may directly perform a regression analysis of the accident counts on the observed AADT Vti∗ s. All the parameters in Equation (5.4) are identified, so we would be able to perform statistical inference on the model parameters and make predictions of the accident counts, without any complication. 126 5.3 Gamma measurement error models Under a realistic situation, the lognormal assumptions in the previous section may not be true. We may use distributions such as a gamma distribution to model the conditional distribution of (V ∗ |V ), which take only positive values. For example, we can assume that the observed values follow a gamma distribution with the mean as the true values of AADT, and the standard deviation proportional to the true values. 5.3.1 Poisson response model Model specification Here we will specify the measurement and outcome models based on gamma distributions, and the exposure model takes the form of Equation (5.1). The models can be written as (Vti∗ |Vi , Yi ) ∼ gamma(η −2 , η −2Vi−1 ), (Yi |Vi ) ∼ Poisson(exp[β0 + β1 log(Vi )]) (i = 1, 2, . . . , n), Vi ∼ gamma(a, b), (5.6) (5.7) (5.8) where a and b refer to the shape and rate parameters of the gamma distribution. Here we assume that road sites are selected for an intervention (i.e., a road improvement program) according to a gamma distribution. Based on the results of the lognormal model in the previous section, we may have the following conjectures. When we have multiple measurements of Vti∗ at the same location (i.e., T > 1), the measurement model in Equation (5.21) may possibly be identified. Otherwise, we may not be able to separate out the variation due to measurement error. Hence, we will attempt to verify the conjectures using theoretical studies on 127 the posterior distribution. Based on the model assumptions, we have η −2 η −2 v−1 i f (Vti∗ | vi , Yi ) = Γ(η −2 ) Pr(Yi | vi ) = f (vi ) = ·Vti∗ η −2 −1 · e−η −2V ∗ /v ti i , [exp (β0 + β1 log(vi ))]Yi · exp −eβ0 +β1 log(vi ) ba a−1 −bvi . v e Γ(a) i Yi ! , (5.9) ∗ ) as Hence, we may obtain the conditional distribution of (Yi |V1i∗ , V2i∗ , . . . , VTi Pr(Yi |V1i∗ , V2i∗ , . . . , VTi∗ ) ∝ f(Yi ,Vi ,V1i∗ ,V2i∗ , ...,VTi∗ ) (Yi , vi , V1i∗ , V2i∗ , . . . , VTi∗ ) dvi ∝ Pr(Yi | vi ) f (vi ) ∏ f (Vti∗ | vi , Yi ) dvi T t=1 ∝ T exp (β0 + β1 log(vi ))Yi − eβ0 +β1 log(vi ) Vti∗ η −1 −η −2Vti∗ /vi −bvi dvi · va−1 e · ·e ∏ i η −2 Yi ! t=1 v −2 i = −2 T exp(β0Yi ) ∏t=1 Vti∗ η −1 Yi ! β Yi −nη −2 +a−1 vi 1 β · exp −eβ0 vi 1 − bvi − η −2 T ∗ ∑ Vti dvi . vi t=1 (5.10) ∗) From Equation (5.10), we observe that the log likelihood of Pr(Yi |V1i∗ , V2i∗ , . . . , VTi T T log(Vti∗ ) and ∑t=1 Vti∗ . However, we may not be able to obis a function of ∏t=1 tain an analytical form for the conditional distribution. Using Equation (5.10), we would be able to evaluate the conditional distribution numerically (or sample from the conditional distribution). Model identification Now we will try to obtain an understanding of model identification through method of moment estimators, by deriving the expectation and covariance matrix of (V ∗ , Y ). From the laws of total expectation, variance and covariance, it is straightforward 128 to show that a b ∗ ∗ Var(V ) =E[Var(V |V )] + Var[E(V ∗ |V )] E(V ∗ ) =E[E(V ∗ |V )] = (5.11) =E(η 2V 2 ) + Var(V ) a(aη 2 + η 2 + 1) b2 E(Y ) =E[E(Y |V )] = exp(β0 )E V β1 = = exp(β0 )Γ(a + β1 ) Γ(a)bβ1 a + β1 > 0 (5.12) (5.13) Var(Y ) =E[Var(Y |V )] + Var[E(Y |V )] = exp(β0 )E V β1 + exp2 (β0 )Var V β1 = exp(β0 ) Γ(a + 2β1 ) Γ2 (a + β1 ) β β ) + exp( ) Γ(a + − 1 0 Γ(a)bβ1 bβ1 Γ(a)bβ1 a + 2β1 > 0 (5.14) Cov(V ∗ , Y ) =E(V ∗ ·Y ) − E(V ∗ )E(Y ) =E[E(V ∗ |V )E(Y |V )] − E(V ∗ )E(Y ) a exp(β0 )Γ(a + β1 ) b Γ(a)bβ1 exp(β0 )(Γ(a + β1 + 1) − aΓ(a + β1 )) = Γ(a)bβ1 +1 exp(β0 )β1 Γ(a + β1 ) = a + β1 > 0. Γ(a)bβ1 +1 = exp(β0 )E(V β1 +1 ) − (5.15) Note that when a + β1 <= 0, the integral E V β1 does not converge, although the Euler’s gamma function is defined for negative non-integers. Similarly, when a + 2β1 <= 0, the integral E V 2β1 does not converge. Hence, we may be able to obtain all the elements for expectations and covariance matrix of (V ∗ , Y ), only 129 when β1 > −a/2. When β1 > −a/2, it is straightforward to obtain that Cov(V ∗ , Y ) β1 = E(Y ) b Var(Y ) − E(Y ) + E2 (Y ) Γ(a + 2β1 )Γ(a) . = Γ2 (a + β1 ) E2 (Y ) (5.16) (5.17) Plugging Equations (5.11) and (5.16) into Equation (5.17), we have Var(Y ) − E(Y ) + E2 (Y ) Γ = E2 (Y ) β1 E(V ∗ )E(Y ) β1 E(V ∗ )E(Y ) Cov(V ∗ ,Y ) + 2β1 Γ Cov(V ∗ ,Y ) ∗ 1 E(V )E(Y ) + β1 Γ2 βCov (V ∗ ,Y ) . (5.18) Using Equation (5.18), we will be able to obtain β1 . Note that we can determine the sign of β1 based on that of the covariance Cov(V ∗ , Y ). That is, when Cov(V ∗ , Y ) > 0, we will need to search the solution of β1 on the positive domain. It is then straightforward to obtain b using Equation (5.16), a using Equation (5.11), η using Equation (5.12), and β0 using Equation (5.13). However, Equation (5.18) involves gamma functions that may make the situation more complicated. We first obtain E(Y ), Var(Y ) and Cov(V ∗ , Y ), using Equations (5.13), (5.14) and (5.15). Using the right hand side of Equation (5.18), we can plot the relationship to check whether there are multiple solutions. In Figure 5.1, we plot Equation (5.18) for cases with different values of β1 , using different sets of true values for the parameters. For all the panels we use a = 6.5, b = 0.0002 and η = 0.1. The true values of (β0 , β1 ) for panels (a)-(d) are (5.5, 0.25), (33.5, -3), (-0.25, 0.25) and (-50, 5) respectively. We observe that for all the cases the equation is monotone in the neighborhood, and there exists a unique solution for β1 . This suggests that the model is identified. 130 1.3 40 20 30 Equation (23) 50 1.2 1.1 0 0.9 10 1.0 Equation (23) Equation (23) Observed Solution 60 Equation (23) Observed Solution −3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 −3.0 −2.5 β1 −1.0 −0.5 0.0 (b) β1 = −3 1000 1.4 −1.5 β1 (a) β1 = −0.25 Equation (23) Observed Solution 1.0 600 400 200 1.1 1.2 Equation (23) 1.3 800 Equation (23) Observed Solution 0 0.9 Equation (23) −2.0 0 1 2 3 4 5 0 β1 5 10 15 20 β1 (c) β1 = 0.25 (d) β1 = 5 Figure 5.1: Solution of Equation (5.18). For all the cases, there exits a unique solution for β1 . 131 5.3.2 Negative binomial response model Model specification In reality, the accident counts Yi may be over-dispersed. That is, the variance may be larger than the mean after conditioning on the covariate variables. Models based on negative binomial regression are frequently used for road safety accident counts (see e.g., [10]). The negative binomial response models can be written as (Vti∗ |Vi , Yi ) ∼ gamma(η −2 , η −2Vi−1 ), (Yi |Vi ) ∼ NB(exp[β0 + β1 log(Vi )], α ) Vi ∼ gamma(a, b), (5.19) (i = 1, 2, . . . , n), (5.20) (5.21) where NB(exp[β0 + β1 log(Vi )], α ) is the negative binomial distribution with mean exp[β0 + β1 log(Vi )] and an over-dispersion parameter α , a and b refer to the shape and rate parameters of the gamma distribution. Model identification The negative binomial response model is more complicated than the Poisson model. Hence, we will try to obtain an understanding of model identification through method of moment estimators, by deriving the expectation and covariance matrix 132 of (V ∗ , Y ). From the laws of total expectation, variance and covariance, we have a b 2 + η 2 + 1) η a(a Var(V ∗ ) = b2 exp(β0 )Γ(a + β1 ) E(Y ) = Γ(a)bβ1 E(V ∗ ) =E[E(V ∗ |V )] = (5.22) (5.23) a + β1 > 0 (5.24) Var(Y ) =E[Var(Y |V )] + Var[E(Y |V )] = exp(β0 )E V β1 + α exp(2β0 )E V 2β1 + exp2 (β0 )Var V β1 (5.25) Cov(V ∗ , Y ) = exp(β0 )β1 Γ(a + β1 ) Γ(a)bβ1 +1 a + β1 > 0. (5.26) Here we have only 5 equations for solving 6 parameters of a, b, η , α , β0 and β1 . Learning all the parameters requires higher order moments. This suggests that the model will be weakly identified. 5.4 Simulation study For the gamma measurement error models, although it may be hard to obtain the limiting posteriors, we may use Markov chain Monte Carlo (MCMC) techniques (e.g., the Metropolis-Hastings algorithm) to study the convergence of the posteriors. 5.4.1 Metropolis-Hastings algorithm Poisson response model For the Metropolis-Hastings algorithm, the posterior distribution of the hyperparameters a and b is the same as the full conditional distribution that can be 133 obtained as n f (a, b | v1 , v2 , · · · , vn ) ∝ π (a) π (b) ∏ f (vi | a, b) i=1 ∝ a n b na ∏ vi (Γ(a))n e−b ∑i=1 vi , n i=1 aL ≤ a ≤ aU bL ≤ b ≤ bU , (5.27) where we assume uniform priors for all the hyper-parameters a and b. That is, a ∼ U(aL , aU ), b ∼ U(bL , bU ). For the regression coefficients β0 and β1 , we can obtain the full conditional distribution as f (β0 , β1 | y1 , y2 , · · · , yn , v1 , v2 , · · · , vn ) ∝ π (β0 ) π (β1 ) Pr(y1 , y2 , · · · , yn | β0 , β1 , v1 , v2 , · · · , vn ) [exp (β0 + β1 log(vi ))]yi · exp −eβ0 +β1 log(vi ) yi ! i=1 n ∝∏ n ∝ ∏ exp β1 log(vi )yi − eβ0 +β1 log(vi ) , (5.28) i=1 when we assume uniform priors. Similarly, we may obtain the full conditional distribution of the measurement error parameter η 2 . When we assume uniform priors on η 2 , we have f (η 2 | v1 , v2 , · · · , vn , v∗11 , v∗12 , · · · , v∗21 , v∗22 , · · · , v∗T n ) ∝ f η 2 , v∗11 , v∗12 , · · · , v∗21 , v∗22 , · · · , v∗T n | v1 , v2 , · · · , vn n ∝ π (η 2 ) ∏ i=1 n T ∝ ∏∏ i=1 t=1 T ∏ f (vti∗ | vi) t=1 η −2 −2 η −2 v−1 −2 ∗ i vti∗ η e−η vti /vi . −2 Γ(η ) 134 (5.29) When assuming uniform priors, the full conditional distribution of the unobserved measurements can be obtained as f (v1 , v2 , · · · , vn | β0 , β1 , a, b, η 2 , v∗11 , v∗12 , · · · , v∗21 , v∗22 , · · · , v∗T n , y1 , y2 , · · · , yn ) ∝ f (v1 , v2 , · · · , vn , v∗11 , v∗12 , · · · , v∗21 , v∗22 , · · · , v∗T n , y1 , y2 , · · · , yn | η 2 , β0 , β1 ) n T i=1 t=1 ∝ ∏ Pr(yi | vi ) f (vi ) ∏ f (vti∗ | vi , yi ) n ∝ ∏ exp β1 log(vi )yi − eβ0 +β1 log(vi ) via−1 e−bvi i=1 T vi−η ∏ −2 e− η −2 v∗ /v ti i . t=1 (5.30) Using Equations (5.27) to (5.30), we may construct the Metropolis-Hastings algorithm for the s + 1th step as follows: (s) (s) (s) 1. Update (a, b): Sample a(s+1) , a(s+1) using f a, b v1 , v2 , · · · , vn . 2. Update η 2 : Sample (η 2 )(s+1) according to (s) (s) (s) f η 2 v1 , v2 , · · · , vn , v∗11 , v∗12 , · · · , v∗21 , v∗22 , · · · , v∗T n (s+1) 3. Update (β0 , β1 ): Sample β0 (s+1) , β1 according to (s) (s) (s) f β0 , β1 y1 , y2 , · · · , yn , v1 , v2 , · · · , vn (s+1) 4. Update (v1 , v2 , · · · , vn ): Sample v1 (s+1) f (v1 , v2 , · · · , vn | β0 (s+1) , β1 (s+1) , v2 (s+1) , · · · , vn . according to , a(s+1) , b(s+1) , (η 2 )(s+1) , v∗11 , v∗12 , · · · , v∗21 , v∗22 , · · · , v∗T n , y1 , y2 , · · · , yn ). (5.31) For the gamma distribution parameters (a, b), we use a bivariate normal distribution as the proposal distribution. We match the mean and variance of the 135 proposal distribution with the method of moment estimates. For a negative proposed value, we use the absolute value. Due to the high positive correlation in the two parameters, we use a correlation coefficient of 0.75 for the covariance matrix of the proposal distribution. The Metropolis-Hastings algorithm gives an acceptance ratio around 30% to 45%. For the variance parameter η 2 of the measurement error model, we use a uniform distribution centered at the current value. For negative values proposed, we use its absolute value. The range of the proposal distribution is tuned to obtain an acceptance ratio around 50% in the Metropolis-Hastings algorithm. In order to achieve a desirable acceptance ratio for the regression coefficients (β0 , β1 ), we use a bivariate normal proposal distribution matching the mean and covariance from maximum likelihood estimates (MLE) from generalized linear model fitting using the response and the average measurements of the AADT. For the Metropolis Hastings algorithm with a sample size of 10, the proposal distribution results in an acceptance ratio ranging from 20% to 40%. When we looked at the covariance from the MLE, the correlation among these two parameters is around −1. However, the acceptance ratio starts to drop as we increase the sample size. Changing the scale of the variance and/or covariance for the proposal distribution does not seem to work. For the unobserved AADT (V1 , V2 , · · · , Vn ), we use a uniform proposal distribution centered at the current value. Each of the n variables is updated separately. We use the absolute value for negative proposals. The ranges of the uniform distributions are chosen to achieve an acceptance ratio around 30% to 45%. 136 Negative binomial response model For the regression coefficients β0 and β1 , the full conditional distribution of the negative binomial model can be written as f (β0 , β1 | y1 , y2 , · · · , yn , v1 , v2 , · · · , vn , α ) ∝ π (β0 ) π (β1 ) Pr(y1 , y2 , · · · , yn | β0 , β1 , α , v1 , v2 , · · · , vn ) ∝∏ α −1 α −1 + exp (β0 + β1 log(vi )) ∝∏ exp (β0 yi + β1 log(vi )yi ) n i=1 n i=1 [α −1 + exp (β0 + β1 log(vi ))]α α −1 −1 +y i exp (β0 + β1 log(vi )) −1 α + exp (β0 + β1 log(vi )) , yi (5.32) when we assume uniform priors. Similarly, we can obtain the full conditional distribution of the over-dispersion parameter as f (α | y1 , y2 , · · · , yn , v1 , v2 , · · · , vn , β0 , β1 ) ∝ π (α ) Pr(y1 , y2 , · · · , yn | β0 , β1 , α , v1 , v2 , · · · , vn ) n ∝∏ i=1 Γ yi + α −1 α −α exp (β0 yi + β1 log(vi )yi ) −1 Γ (α −1 ) [α −1 + exp (β0 + β1 log(vi ))]α −1 +y i , (5.33) when we assume uniform priors. When assuming uniform priors, the full conditional distribution of the unobserved measurements can be obtained as f (v1 , v2 , · · · , vn | β0 , β1 , α , a, b, η 2 , v∗11 , v∗12 , · · · , v∗21 , v∗22 , · · · , v∗T n , y1 , y2 , · · · , yn ) ∝ f (v1 , v2 , · · · , vn , v∗11 , v∗12 , · · · , v∗21 , v∗22 , · · · , v∗T n , y1 , y2 , · · · , yn | η 2 , β0 , β1 , α ) n T i=1 t=1 ∝ ∏ Pr(yi | vi ) f (vi ) ∏ f (vti∗ | vi , yi ) n ∝∏ i=1 η T exp (β1 log(vi )yi − bvi ) via−1 ∏t=1 v− i [α −1 + exp (β0 + β1 log(vi ))] 137 −2 e−η α −1 +yi −2 v∗ /v ti i . (5.34) Using Equations (5.27), (5.29), (5.33), (5.32) and (5.34), we may construct the Metropolis-Hastings algorithm for the s + 1th step as follows: (s) (s) (s) 1. Update (a, b): Sample a(s+1) , a(s+1) using f a, b v1 , v2 , · · · , vn . 2. Update η 2 : Sample (η 2 )(s+1) according to (s) (s) (s) f η 2 v1 , v2 , · · · , vn , v∗11 , v∗12 , · · · , v∗21 , v∗22 , · · · , v∗T n (s+1) 3. Update (β0 , β1 ): Sample β0 (s+1) , β1 according to (s) (s) (s) f β0 , β1 y1 , y2 , · · · , yn , v1 , v2 , · · · , vn , α (s) . 4. Update α : Sample α (s+1) according to (s) (s) (s) (s+1) f α y1 , y2 , · · · , yn , v1 , v2 , · · · , vn , β0 (s+1) 5. Update (v1 , v2 , · · · , vn ): Sample v1 (s+1) f (v1 , v2 , · · · , vn | β0 (s+1) , β1 (s+1) , v2 (s+1) , · · · , vn (s+1) , β1 . according to , α (s+1) , a(s+1) , b(s+1) , (η 2 )(s+1) , v∗11 , v∗12 , · · · , v∗21 , v∗22 , · · · , v∗T n , y1 , y2 , · · · , yn ). (5.35) For the gamma distribution parameters (a, b), we use a bivariate normal distribution as the proposal distribution. We match the mean and variance of the proposal distribution with the method of moment estimates. For a negative proposed value, we use the absolute value. Due to the high positive correlation in the two parameters, we use a correlation coefficient of 0.75 for the covariance matrix of the proposal distribution. The Metropolis-Hastings algorithm gives an acceptance ratio around 30% to 45%. For the variance parameter η 2 of the measurement error model, we use a uniform distribution centered at the current value. For negative values proposed, we 138 use its absolute value. The range of the proposal distribution is tuned to obtain an acceptance ratio around 50% in the Metropolis-Hastings algorithm. In order to achieve a desirable acceptance ratio for the regression coefficients (β0 , β1 ), we use a bivariate normal proposal distribution matching the mean and covariance from maximum likelihood estimates (MLE) from generalized linear model fitting using the response and the average measurements of the AADT. For the Metropolis Hastings algorithm with a sample size of 10, the proposal distribution results in an acceptance ratio ranging from 20% to 40%. When we looked at the covariance from the MLE, the correlation among these two parameters is around −1. However, the acceptance ratio starts to drop as we increase the sample size. Changing the scale of the variance and/or covariance for the proposal distribution does not seem to work. For the unobserved AADT (V1 , V2 , · · · , Vn ), we use a uniform proposal distribution centered at the current value. Each of the n variables is updated separately. We use the absolute value for negative proposals. The ranges of the uniform distributions are chosen to achieve an acceptance ratio around 30% to 45%. 5.4.2 Simulation study We perform a simulation study, in order to evaluate the performance of the model under finite sample scenarios. For data generation, we use the true values of parameters as a0 = 6.5, b0 = 0.0002 for generating the unobserved AADT. Two values of 0.032 and and 0.32 are used for η02 respectively for two extreme cases when there is either little or very large measurement error in the traffic volumes. In the Poisson regression, we use (−10, 1.3) for the intercept and slope (β0 , β1 ). Due to the fact that the number of variables (i.e., V1 , V2 , . . . ,Vn ) drawn from MCMC increases almost linearly as the sample size, the Metropolis-Hastings algorithm we wrote in R is extremely slow (taking several months for larger sample sizes). For the purpose of this thesis, we use only two sample sizes of n = 10 and n = 40. We investigate the two situations when there are T = 1 and T = 2 measurements of AADT available for each site. 139 Uniform prior distributions are used for all the parameters except for the regression coefficients for the naive models using the observed AADT as the covariate. Normal prior distributions are assumed for the regression coefficients from Poisson regression. The prior mean and standard deviation used for (β0 , β1 ) are (-9. 3.2) and (4.6, 1.7) respectively for all the models. For the gamma parameters a and b, uniforms priors on (0.002, 40) and (0.000002, 0.0015) are used in order to achieve convergence. For η 2 , a uniform prior on 0.0152 , 0.22 is used. The algorithm does not seem to work well in WinBUGS. There is a numerical problem that causes the algorithm to crash (i.e., splitting problems in the nodes of V1 , V2 , . . . , Vn ). In the cases with η02 = 0.032 , a sample size n = 10 and T = 1, 2, the MCMC algorithms seem to converge very quickly for all the parameters. There is a high autocorrelation in the MCMC samples for all of the variables. After thinning by taking every 40th sample, the autocorrelation is much smaller. The effective sample sizes of the 1000 retained samples ranges from 600 to 1000. The parameters of (a, b) and (β0 , β1 ) have the lowest effective sample sizes when T = 1. When we use a true value η02 = 0.32 (i.e., high variability in the traffic volume measurements), the Metropolis algorithm converges very slowly for n = 40, particularly for the case when T = 1. We have to perform a thinning of 320 and 160 times respectively for T = 1 and T = 2, in order to achieve a reasonable effective sample size. For n = 10 and n = 40, we generated 100 datasets, and use the same generated samples for all the methods. In particular, the adjusted model refers to our gamma model accounting for mismeasurements in AADT, while the naive model is Poisson regression using the observed AADT as the covariate. The true values are the parameter values we use for data generation. Using the MCMC samples after thinning, the distribution of the posterior mean and standard deviation for different parameters are presented in Figures 5.2 to 5.17. We observe that when the sample size n is small and the measurement error parameter η 2 is small, the naive methods seem to work better than our model 140 0.5 0.4 0.3 Density T=2 adjusted T=2 naive T=1 adjusted T=1 naive True value 0.0 0.1 0.2 0.5 0.4 0.3 0.2 0.0 0.1 Density T=2 adjusted T=2 naive T=1 adjusted T=1 naive True value −15 −10 −5 0 −15 −10 β0 −5 0 β0 (a) η = 0.03 (b) η = 0.3 0.5 0.1 0.2 0.3 T=2 adjusted T=2 naive T=1 adjusted T=1 naive True value 0.0 0.1 0.2 Density 0.3 0.4 T=2 adjusted T=2 naive T=1 adjusted T=1 naive True value 0.0 Density 0.4 0.5 Figure 5.2: Distribution of posterior mean for β0 with a sample size of n = 10, based on simulations on 100 datasets. −15 −10 −5 0 −15 β0 −10 −5 0 β0 (a) η = 0.03 (b) η = 0.3 Figure 5.3: Distribution of posterior mean for β0 with a sample size of n = 40, based on simulations on 100 datasets. 141 6 6 4 5 T=2 adjusted T=2 naive T=1 adjusted T=1 naive True value 0 1 2 3 Density 3 0 1 2 Density 4 5 T=2 adjusted T=2 naive T=1 adjusted T=1 naive True value 0.0 0.5 1.0 1.5 2.0 0.0 0.5 β1 1.0 1.5 2.0 β1 (a) η = 0.03 (b) η = 0.3 6 6 Figure 5.4: Distribution of posterior mean for β1 with a sample size of n = 10, based on simulations on 100 datasets. 4 5 T=2 adjusted T=2 naive T=1 adjusted T=1 naive True value 0 1 2 3 Density 3 2 1 0 Density 4 5 T=2 adjusted T=2 naive T=1 adjusted T=1 naive True value 0.0 0.5 1.0 1.5 2.0 0.0 β1 0.5 1.0 1.5 2.0 β1 (a) η = 0.03 (b) η = 0.3 Figure 5.5: Distribution of posterior mean for β1 with a sample size of n = 40, based on simulations on 100 datasets. 142 150 0 50 Density T=2 adjusted T=1 adjusted True value 100 150 100 0 50 Density T=2 adjusted T=1 adjusted True value 0.00 0.05 0.10 0.15 0.20 0.00 0.05 η 0.10 0.15 0.20 η 2 2 (a) η = 0.03 (b) η = 0.3 150 50 100 T=2 adjusted T=1 adjusted True value 0 50 Density 100 T=2 adjusted T=1 adjusted True value 0 Density 150 Figure 5.6: Distribution of posterior mean for η 2 with a sample size of n = 10, based on simulations on 100 datasets. 0.00 0.05 0.10 0.15 0.20 0.00 η 0.05 0.10 0.15 0.20 η 2 2 (a) η = 0.03 (b) η = 0.3 Figure 5.7: Distribution of posterior mean for η 2 with a sample size of n = 40, based on simulations on 100 datasets. 143 0.15 0.10 Density T=2 adjusted T=2 naive T=1 adjusted T=1 naive True value 0.00 0.05 0.15 0.10 0.05 0.00 Density T=2 adjusted T=2 naive T=1 adjusted T=1 naive True value 10 20 30 40 50 60 70 10 Average predicted counts 20 30 40 50 60 70 Average predicted counts (a) η = 0.03 (b) η = 0.3 0.15 0.05 0.10 T=2 adjusted T=2 naive T=1 adjusted T=1 naive True value 0.00 0.05 Density 0.10 T=2 adjusted T=2 naive T=1 adjusted T=1 naive True value 0.00 Density 0.15 Figure 5.8: Distribution of posterior mean for predicted accident counts Λ with a sample size of n = 10, based on simulations on 100 datasets. 10 20 30 40 50 60 70 10 Average predicted counts 20 30 40 50 60 70 Average predicted counts (a) η = 0.03 (b) η = 0.3 Figure 5.9: Distribution of posterior mean for predicted accident counts Λ with a sample size of n = 40, based on simulations on 100 datasets. 144 3 4 T=2 adjusted T=2 naive T=1 adjusted T=1 naive 0 1 2 Density 2 0 1 Density 3 4 T=2 adjusted T=2 naive T=1 adjusted T=1 naive 1 2 3 4 5 6 1 2 3 β0 4 5 6 β0 (a) η = 0.03 (b) η = 0.3 Figure 5.10: Distribution of posterior standard deviation for β0 with a sample size of n = 10, based on simulations on 100 datasets. 3 4 T=2 adjusted T=2 naive T=1 adjusted T=1 naive 0 1 2 Density 2 1 0 Density 3 4 T=2 adjusted T=2 naive T=1 adjusted T=1 naive 1 2 3 4 5 6 1 β0 2 3 4 5 6 β0 (a) η = 0.03 (b) η = 0.3 Figure 5.11: Distribution of posterior standard deviation for β0 with a sample size of n = 40, based on simulations on 100 datasets. 145 50 Density 30 40 T=2 adjusted T=2 naive T=1 adjusted T=1 naive 0 10 20 50 30 20 0 10 Density 40 T=2 adjusted T=2 naive T=1 adjusted T=1 naive 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.2 β1 0.3 0.4 0.5 0.6 β1 (a) η = 0.03 (b) η = 0.3 50 40 T=2 adjusted T=2 naive T=1 adjusted T=1 naive 30 Density 30 0 10 20 10 0 Density 40 T=2 adjusted T=2 naive T=1 adjusted T=1 naive 20 50 Figure 5.12: Distribution of posterior standard deviation for β1 with a sample size of n = 10, based on simulations on 100 datasets. 0.1 0.2 0.3 0.4 0.5 0.6 0.1 β1 0.2 0.3 0.4 0.5 0.6 β1 (a) η = 0.03 (b) η = 0.3 Figure 5.13: Distribution of posterior standard deviation for β1 with a sample size of n = 40, based on simulations on 100 datasets. 146 250 250 150 200 T=2 adjusted T=1 adjusted 0 50 100 Density 150 100 0 50 Density 200 T=2 adjusted T=1 adjusted 0.02 0.04 0.06 0.08 0.10 0.12 0.02 0.04 η 0.06 0.08 0.10 0.12 η 2 2 (a) η = 0.03 (b) η = 0.3 250 250 Figure 5.14: Distribution of posterior standard deviation for η 2 with a sample size of n = 10, based on simulations on 100 datasets. 150 200 T=2 adjusted T=1 adjusted 0 50 100 Density 150 100 50 0 Density 200 T=2 adjusted T=1 adjusted 0.02 0.04 0.06 0.08 0.10 0.12 0.02 η 0.04 0.06 0.08 0.10 0.12 η 2 2 (a) η = 0.03 (b) η = 0.3 Figure 5.15: Distribution of posterior standard deviation for η 2 with a sample size of n = 40, based on simulations on 100 datasets. 147 7 7 4 5 6 T=2 adjusted T=2 naive T=1 adjusted T=1 naive 0 1 2 3 Density 4 3 0 1 2 Density 5 6 T=2 adjusted T=2 naive T=1 adjusted T=1 naive 1 2 3 4 5 1 Average predicted counts 2 3 4 5 Average predicted counts (a) η = 0.03 (b) η = 0.3 7 7 Figure 5.16: Distribution of posterior standard deviation for predicted accident counts Λ with a sample size of n = 10, based on simulations on 100 datasets. 4 5 6 T=2 adjusted T=2 naive T=1 adjusted T=1 naive 0 1 2 3 Density 4 3 2 1 0 Density 5 6 T=2 adjusted T=2 naive T=1 adjusted T=1 naive 1 2 3 4 5 1 Average predicted counts 2 3 4 5 Average predicted counts (a) η = 0.03 (b) η = 0.3 Figure 5.17: Distribution of posterior standard deviation for predicted accident counts Λ with a sample size of n = 40, based on simulations on 100 datasets. 148 acknowledging the mismeasurements. When η 2 is large, our proposed model starts to outperform the naive method in terms of the attenuation effect. When measurement errors are large, it seems also easier for our model to learn the true values of η 2 in terms of the accuracy of the posterior mean. However, the posterior standard deviation does not seem to decrease for the parameter η 2 when there are larger measurement errors. For the accident counts, ignoring the measurement errors in AADT does not seem to have an impact on the predicted value Λ. For the purpose of this thesis, we would not be able to obtain simulation results on sample sizes that are larger than 40. In reality, it is common to see sample sizes that are around several hundred for road safety studies. In the future, we will try to rewrite the algorithms in more efficient computer languages such as C and FORTRAN, so that we could get larger sample sizes to confirm the results from the theoretical study in the previous subsection. 149 5.5 Conclusions In this chapter, we propose to model the mismeasurements in the traffic volume counts using Bayesian analysis. Two measurement error models based on the lognormal and gamma distributions are proposed. For the lognormal model, we have theoretical results that the mismeasurements in AADT will not have an effect on the prediction of accident counts, despite the attenuation effect on the regression coefficients and the identification problem when there is only one measurement for each site. For the gamma measurement error model, we demonstrate through method of moment estimators that we may be able to identify all the parameters in the model. We propose Metropolis-Hastings algorithms for fitting the gamma measurement error models with both Poisson and negative binomial responses. Through a simulation study, we have showed that for the Poisson response model there seems to be no effect of mismeasurements of AADT on the accident prediction. The model acknowledging the measurement errors is expected to perform better in terms of parameter estimation when there is moderate or large measurement errors and when the sample size is large. In the future, we would be able to perform more intensive simulation studies with larger sample sizes (e.g., by developing more efficient MCMC schemes), in order to confirm the results suggested by the theoretical study. 150 Chapter 6 Concluding Remarks 6.1 Conclusions In this thesis, we have proposed to use Bayesian methodologies that are applicable to four real-life problems in health and insurance areas, when there are data issues that may cause model identification issues in statistical modeling. Through extensive theoretical and simulation studies, we have showed that, for both nonidentified models and weakly identified models, acknowledging the data issues through Bayesian modeling would help improve the accuracy in the model estimation. Particularly in the cases when there is expert knowledge available on the direction or magnitude of data issues, we would be able to obtain partial identification for nonidentified models. That is, modeling the data issues through our proposed methodologies will provide statistical inference that is more appropriate for these situations. Hence, the methods we propose will be reasonable practical solutions for the relevant health and insurance problems, in alleviating the identification problems caused by data limitations. When there is prior information available on the hidden sub-population or never-respondents (e.g., through other studies), we may be able to provide more accurate population estimates and regression analysis on various types of variables, such as disease prevalence, medical expenditures and utilizations of health 151 care. For many large scale surveys such as the Medical Expenditure Panel Survey (MEPS), it is usually feasible to obtain an estimate on the percentage of neverrespondents by asking a simple question on why they choose not to participate in the study. Even when there is no data available on the hidden proportion, our Bayesian methodologies can be used for performing a sensitivity analysis on the impact of the never-respondents (or hidden sub-population). The convergence theory and simulation results have ensured that the estimates will behave comparably to those of identified models. For insurance companies, we have provided Bayesian misrepresentation modeling tools that may help them identify the proportion of misrepresentation, through rate-making models. We have confirmed that the identification of the models depends on the type of the response variable. A Poisson response model is fully identified, while the gamma model is weakly identified. And a binary response model is nonidentified. For all the models, we have confirmed that we would be able to more accurately estimate the regression coefficients (i.e., insurance premium factors), by accounting for the misrepresentation in modeling. This will ensure that insurance companies can obtain more accurate premiums. For the gamma and Poisson model, the insurance companies will be able to identify the proportion of misrepresentation, so they can have an estimate on the potential cost of misrepresentation, and take risk control measures if needed. The solution will be much more cost effective than investigations on individual insurance policies. For road safety accident prediction, we have proposed Bayesian models that account for the measurement errors in the traffic volume counts. Through theoretical studies on the posterior distribution as well as method of moment estimators, we have confirmed that the lognormal measurement error model is nonidentified, while the gamma measurement error models for Poisson and negative binomial responses are identified and weakly identified. For both the lognormal and gamma measurement error models, although there will be an attenuation effect on the estimated regression coefficients, there seems to be no impact on the predicted accident counts. That is, despite the measurement error in the covariate variable, 152 the safety evaluation will not be affected when we ignore the measurement error in modeling. Our research provides confidence to the industrial practice that measurement errors in traffic volumes are usually ignored in road safety evaluations. Our research provides practical models as well as theoretical confidence in using Bayesian inference, in order to account for various data issues in the four real-life problems that have never been addressed in the statistical literature. 6.2 Future work In this section, we will provide some directions on potential future work in Bayesian inference for partially identified models. 6.2.1 Thesis related future work In Section 4.5 when we model the misrepresentation as the response, we can theoretically show that the model is weakly identified. However, for different sample sizes, the proportion of misrepresentation (i.e., misclassification) is consistently converging to zero, with both the posterior mean and standard deviation converging in a manner of identified models. This may be due to the high posterior correlations in the parameters, as shown from the Fisher information matrix. In the future, it would be interesting to conduct more extensive investigations both theoretically and by simulations, in order to study the mechanism of the finite sample behaviors in this specific model when we use the misclassified variable as a response. In addition to binary questions that are asked in insurance underwriting (e.g., on the smoking status, or whether an insured person has a specific disease), there may be other types of questions for which the applicant may make a false statement. Such questions include number of kilometers driven on a car in auto insurance. The question may be formed as an ordinal (based on intervals) or continuous response. We may extend our models to be applicable to these types of variables. For these types of variables, the situations are more complicated which 153 requires more assumptions on the misrepresentation (e.g., probabilities jumping to different categories in the case of ordinal data, and length of jump in the case of continuous data). When the model involves more parameters, the conclusions concerning the model identification may be different to those in this thesis. Hence, it would be interesting to study the models both theoretically and by simulations. For road safety models, due to the very slow speed of the Metropolis-Hastings algorithm written in R, we were not able to obtain simulation results for the Poisson model, with large sample sizes. Hence, we were not able to fully show the advantage of our Bayesian model acknowledging the mismeasurements, when the sample size increases. In the future, we would be able to develop more efficient MCMC schemes and rewrite the algorithms in more efficient computer languages such as FORTRAN and C. Despite the theoretical findings we have obtained showing the advantages of our proposed models, we would be able to have more practical confidence on the convergence of the posterior distributions of the parameters and model prediction for accident counts. In addition to the Poisson model, we may investigate the model identification for other models such as the negative binomial model [10]. The model is shown to be weakly identified as it requires one more parameter in order to allow over-dispersion. MCMC simulations can be taken for the negative binomial model, in order to confirm the theoretical findings. 6.2.2 Other partially identified models As mentioned by [20], the development of Bayesian inference for partially identified models has lagged behind that of non-Bayesian methods. There are lots of data issues from real practices in health, social and financial fields which may lead to nonidentifiability of a statistical model. At the same time, there may exist other assumptions and information that can help to obtain partial identification. Bayesian methods can be applied to these problems for evaluating the usefulness of the information and studying the feasibility of statistical learning. Some examples, where corresponding Bayesian models can be developed, in154 cludes the treatment-response models in [34]. Gustafson [20] mentioned about the possibility of future development in this area, but no work seems to have been done in the literature. In Chapters 7 to 10 of [34], the author studies the interesting problem of missing outcomes in certain treatment-response models. Treatmentresponse analysis tries to predict the patient’s disease status or life expectancy when different kinds of disease treatments were chosen for a population of patients. In reality, however, for each patient only one treatment can be used, thus it is impossible to observe counterfactual outcomes if other treatment options were to be used. The non-observability of counterfactual outcomes will lead to model nonidentifiability. Chapter 7 of [34] shows that point or partial identification can be achieved under various assumptions concerned with treatment selection rules and missing data mechanisms. In Chapters 8 and 9, the author studies the identifying power under certain monotonicity assumptions for things such as the treatment-response relationship, which may lead to point or partial identification of the outcome distribution. Chapter 10 relaxes the uniform assumption of treatment conditional on covariates, and studies the identification of the outcome when treatment varies for persons with the same value of covariates. The identification regions are given for the cases where partial identification is obtained. Using Bayesian posterior analysis, such as the transparent parameterization proposed in [18] for nonidentified models, we will be able to study the shape of the limiting posterior distribution, which has been shown by [20] to be partially helpful in ruling out some values for the true value of the parameter. In order to show this, we would need to find out how concentrated the limiting posterior is on the identification region, particularly whether it has a shape more concentrated over the identification region than the prior. 155 Bibliography [1] AHRQ. Agency for Healthcare Research and Quality (AHRQ). Medical Expenditure Panel Survey. Rockville (MD): U.S. Department of Health and Human Services, 2010. → pages 2, 36, 38, 39, 46 [2] D. Bernard and J. Banthin. Family level expenditures on health care and insurance premiums among the nonelderly population. MEPS Research Findings. No. 29. March 2009. Agency for Healthcare Research and Quality, Rockville, MD., 2006. → pages 3, 36 [3] R. Chu, P. Gustafson, and N. Le. Bayesian adjustment for exposure misclassification in case-control studies. Statistics in Medicine, 29: 994–1003, 2010. → pages 5, 81 [4] J. Cohen, S. Machlin, and S. Zuvekas. Distribution of health care expenses. MEPS Research Findings, No. 11.:AHRQ Pub. No. 00–0024, 2000. → pages 2, 36 [5] S. Cohen. Sample design of the 1997 medical expenditure panel survey household component. MEPS Methodology Report, No. 11.:AHRQ Pub. No. 01–0001, 2000. → pages 37 [6] S. Cohen. The concentration of health care expenditures and related expenses for costly medical conditions. Statistical Brief No. 359. February 2012. Agency for Healthcare Research and Quality, Rockville, MD., 2009. → pages 3, 37 [7] K. Davis. Expenditures for hypertension among adults age 18 and older: : Estimates for the u.s. civilian noninstitutionalized population. Statistical Brief. No. 371. June 2012. Agency for Healthcare Research and Quality, Rockville, MD., 2009. → pages 3, 37 156 [8] K. EI-Basyoungy and T. Sayed. Collision prediction models using multivariate Poisson-lognormal regression. Accident Analysis and Prevention, 41:820–828, 2009. → pages 5, 123 [9] K. EI-Basyoungy and T. Sayed. Safety performance functions with measurement errors in traffic volume. Safety Science, 48:1339–1344, 2010. → pages 6 [10] K. EI-Basyoungy and T. Sayed. Application of generalized link functions in developing accident prediction models. Safety Science, 48:410–416, 2010. → pages 5, 123, 132, 154 [11] T. Ezzati-Rice, F. Rohde, and J. Greenblatt. Sample design of the medical expenditure panel survey household component, 1998-2007. MEPS Methodology Report, No. 22., 2008. → pages 37 [12] L. Gordis. Epidemiology. Saunders Elsevier, Philadelphia, PA, 2009. → pages 9, 37 [13] S. Granato. The impact of factoring traffic counts for daily and monthly variation in reducing sample counting error. Transportation Conference Proceedings, pages 122–125, 1998. → pages 6 [14] S. Greenland. The impact of prior distributions for uncontrolled confounding and response bias: A case study of the relation of wire codes and magnetic fields to childhood leukemia. J. Amer. Statist. Assoc., 98 (461):47–54, 2003. → pages 7 [15] S. Greenland. Multiple-bias modelling for analysis of observational data. J. Roy. Statist. Soc. Ser. A, 168(2):267–306, 2005. → pages 7, 9, 37 [16] P. Gustafson. Measurement error and misclassification in statistics and epidemiology. Interdisciplinary Statistics. Chapman & Hall/CRC, Boca Raton, FL, 2004. Impacts and Bayesian adjustments. → pages 6, 15, 125, 126 [17] P. Gustafson. On model expansion, model contraction, identifiability and prior information: two illustrative scenarios involving mismeasured variables. Statist. Sci., 20(2):111–140, 2005. With comments and a rejoinder by the author. → pages 7, 13, 14, 18, 19, 20, 35, 43, 45, 51 157 [18] P. Gustafson. Sample size implications when biases are modelled rather than ignored. J. Roy. Statist. Soc. Ser. A, 169(4):865–881, 2006. → pages 7, 14, 18, 19, 20, 35, 155 [19] P. Gustafson. Measurement error modelling with an approximate instrumental variable. J. R. Stat. Soc. Ser. B Stat. Methodol., 69(5): 797–815, 2007. → pages 43, 45, 51 [20] P. Gustafson. Bayesian inference for partially identified models. International Journal of Biostatistics, 6(2):Article 17, 2010. → pages 7, 8, 14, 35, 154, 155 [21] P. Gustafson and S. Greenland. Curious phenomena in bayesian adjustment for exposure misclassification. Statistics in Medicine, 25:87–103, 2006. → pages 5, 81 [22] P. Gustafson and S. Greenland. The performance of random coefficient regression in accounting for residual confounding. Biometrics, 62(3): 760–768, 2006. → pages 7 [23] P. Gustafson, N. D. Le, and R. Saskin. Case-control analysis with partial knowledge of exposure misclassification probabilities. Biometrics, 57(2): 598–609, 2001. → pages 5, 7, 13, 80 [24] P. Gustafson, M. Gilbert, M. Xia, W. Michelow, W. Robert, T. Trussler, M. McGuire, D. Paquette, D. Moore, and R. Gustafson. Impact of statistical adjustment for frequency of venue attendance in a venue-based survey of men who have sex with men. American Journal of Epidemiology, to appear. → pages 1 [25] K. M. Harris and J. R. Udry. National longitudinal study of adolescent health. Inter-university Consortium for Political and Social Research [distributor], 1994-2008. → pages 9, 14, 32 [26] E. Hauer, J. C. N. Ng, and J. Lovell. Estimation of safety at signalized intersections. Transportation Research Record, 1185:48–61, 1988. → pages 5, 124 [27] J. M. Karon. The analysis of time-location sampling study data. ASA Section on Survey Research Methods, 2005. → pages 1, 12, 13 158 [28] D. Kashihara and K. Carper. National health care expenses in the u.s. civilian noninstitutionalized population. Statistical Brief. No. 355. January 2012. Agency for Healthcare Research and Quality, Rockville, MD., 2009. → pages 2, 36 [29] N. Krauss, S. Machlin, and B. Kass. Use of health care services. MEPS Research Findings, No. 7.:AHCPR Pub. No. 99–0018, 1996. → pages 2, 36 [30] T. L. Lash, M. P. Fox, and A. K. Fink. Applying Quantitative Bias Analysis to Epidemiologic Data. Springer Series in Statistics for Biology and Health. Springer, New York, 2009. → pages 9, 37 [31] J. Liu, P. Gustafson, N. Cherry, and I. Burstyn. Bayesian analysis of a matched case-control study with expert prior information on both the misclassification of exposure and the exposure-disease association. Statistics in Medicine, 28:3411–3423, 2009. → pages 5, 81 [32] D. MacKellar, L. Valleroy, J. Karon, G. Lemp, and R. Janssen. The young men’s survey: Methods for estimating HIV seroprevalence and risk factors among young men who have sex with men. Public Health Reports, 111 (Suppl 1):138–144, 1996. → pages 1, 13 [33] R. Magnani, K. Sabin, T. Saidel, and D. Heckathorn. Review of sampling hard-to-reach and hidden populations for HIV surveillance. AIDS, 19(Suppl 2):S67–72, 2005. → pages 1, 13 [34] C. F. Manski. Partial identification of probability distributions. Springer Series in Statistics. Springer-Verlag, New York, 2003. → pages 2, 7, 155 [35] F. Muhib, L. Lin, A. Stueve, R. Miller, W. Ford, W. Johnson, and P. Smith. A venue-based method for sampling hard-to-reach populations. Public Health Rep, 116(Suppl 1):216–222, 2001. → pages 13 [36] NCHS. Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Data. Hyattsville, MD: U.S. Department of Health and Human Services, CDC, 2007. → pages 9, 14, 30 [37] NCHS. National Center for Health Statistics (NCHS). National Health Interview Survey. National Center for Health Statistics, Centers for 159 Disease Control and Prevention, Hyattsville, Maryland, 2009. → pages 3, 37 [38] A. A. Neath and F. J. Samaniego. On the efficacy of Bayesian inference for nonidentifiable models. Amer. Statist., 51(3):225–232, 1997. → pages 7, 13 [39] G. Olin and S. Machlin. Per capita health care expenses. MEPS Research Findings, No. 12.:AHRQ Pub. No. 00–0026, 2000. → pages 2, 36 [40] G. P. Patil and C. R. Rao. Weighted distributions and size-biased sampling with applications to wildlife populations and human families. Biometrics, 34(2):179–189, 1978. → pages 16 [41] PHAC. Public Health Agency of Canada (phac). M-Track Survey. 2008. → pages 1 [42] F. Rohde. Dental expenditures in the 10 largest states. Statistical Brief. No. 353. December 2011. Agency for Healthcare Research and Quality, Rockville, MD., 2008. → pages 3, 37 [43] F. Rohde and S. Machlin. Health care expenditures for uncomplicated pregnancies. MEPS Research Findings. No. 32. June 2012. Agency for Healthcare Research and Quality, Rockville, MD., 2009. → pages 3, 36 [44] K. J. Rothman. Epidemiology: An Introduction. Oxford University Press, New York, 2002. → pages 9, 37 [45] D. B. Rubin. The Bayesian bootstrap. Ann. Statist., 9(1):130–134, 1981. → pages 22, 50 [46] E. Sarpong and G. Miller. Trends in the pharmaceutical treatment of diabetes. MEPS Research Findings. No. 30. September 2010. Agency for Healthcare Research and Quality, Rockville, MD., 2007. → pages 3, 36 [47] D. Scharfstein, M. Daniels, and J. Robins. Incorporating prior beliefs about selection bias into the analysis of randomized trials with missing outcomes. Biostatistics, 4:495–512, 2003. → pages 7 [48] A. J. Scott and C. J. Wild. Fitting regression models with response-biased samples. 39:519–536, 2011. → pages 67 160 [49] S. C. Sharma, P. Kilburn, and Y. Wu. The prediction of average annual daily traffic volume estimates from seasonal traffic counts: Alberta example. Can. J. Civ. Eng., 23:302–304, 1996. → pages 6 [50] A. Soni. Trends in use and expenditures for depression among u.s. adults age 18 and older, civilian noninstitutionalized population. Statistical Brief. No. 377. July 2012. Agency for Healthcare Research and Quality, Rockville, MD., 1999 and 2009. → pages 3, 37 [51] A. Soni. Expenditures for the top five therapeutic classes of outpatient prescription drugs, adults age 18 and older, u.s. civilian noninstitutionalized population. Statistical Brief. No. 362. March 2012. Agency for Healthcare Research and Quality, Rockville, MD., 2009. → pages 3, 37 [52] A. Soni. Expenditures for the top five therapeutic classes of outpatient prescription drugs, medicare beneficiaries, age 65 and older, u.s. civilian noninstitutionalized population. Statistical Brief. No. 363. March 2012. Agency for Healthcare Research and Quality, Rockville, MD., 2009. → pages 3, 37 [53] M. N. Stagnitti. Trends in dermatological agents utilization and expenditures for the u.s. civilian noninstitutionalized population. Statistical Brief No. 356. February 2012. Agency for Healthcare Research and Quality, Rockville, MD., 1999-2008. → pages 3, 37 [54] M. N. Stagnitti. Trends in anticonvulsants utilization and expenditures for the u.s. civilian noninstitutionalized population. Statistical Brief. No. 372. June 2012. Agency for Healthcare Research and Quality, Rockville, MD., 1999 and 2009. → pages 3, 37 [55] K. Steenland and S. Greenland. Monte carlo sensitivity analysis and bayesian analysis of smoking as an unmeasured confounder in a study of silica and lung cancer. American Journal of Epidemiology, 160(4): 384–392, 2004. → pages 9, 37 [56] A. Stueve, L. O’Donnell, R. Duran, A. San Doval, and J. Blome. Time-space sampling in minority communities: Results with latino young men who have sex with men. American Journal of Public Health, 91(6): 922–926, 2001. → pages 1, 12 161 [57] D. Wang, T. Shen, and P. Gustafson. Partial identification arising from nondifferential exposure misclassification: How informative are data on the unlikely, maybe, and likely exposed? The International Journal of Biostatistics, 8:1557–4679, 2012. → pages 5, 81 [58] R. S. Winsor. Misrepresentation and non Disclosure on Applications for Insurance. Blaney McMurtry LLP, 1995. → pages 4, 80 [59] M. Xia and P. Gustafson. A bayesian method for estimating prevalence in the presence of a hidden sub-population. Statistics in Medicine, 31: 2386–2398, 2012. → pages iv [60] Q. Xia, M. Tholandi, D. H. Osmond, L. M. Pollack, W. Zhou, J. D. Ruiz, and J. A. Catania. The effect of venue sampling on estimates of HIV prevalence and sexual risk behaviors in men who have sex with men. Sexually Transmitted Diseases, 33(9):545–550, 2006. → pages 1, 13 [61] Y. Xie and B. P. Carlin. Measures of Bayesian learning and identifiability in hierarchical models. J. Statist. Plann. Inference, 136(10):3458–3477, 2006. → pages 7, 13 [62] S. Yeh, N. Uberoi, and J. Cohen. Expenditures for trauma-related disorders among the elderly and non- elderly: Estimates for the u.s. civilian noninstitutionalized population. Statistical Brief No. 397. January 2013. Agency for Healthcare Research and Quality, Rockville, MD., 2009. → pages 3, 37 162 Appendix A Supporting Materials for Prevalence Estimation For the simulation study in Section 2.4 of the thesis, the distribution of the posterior mean and standard deviation for the possible 2 × 3 × 3 = 18 scenarios with reasonable priors (the first three priors) described in Subsection 2.4.2 is plotted in Figures A.1 to A.4. The results for the 2 × 3 × 2 = 12 cases with incorrectly specified priors are given in Figures A.5 to A.8. We observe that despite of mis-specified priors, the posterior mean and standard deviation is converging to their large sample limits for all the cases. The only change is that the limiting posterior credible intervals do not cover the true population prevalence, which is consistent to the theoretical results we derive in Section 2.3 of the thesis. True values as well as the limiting posteriors of the prevalence for these 12 cases are given in Table A.1. Due to the calibration property of the limiting posteriors, we recommend using sufficiently vague priors for situations when no reliable information is available for the hidden proportion. 163 Inf 0.35 1600 Inf 400 25600 102400 Inf 0.35 6400 25600 102400 Inf 400 25600 102400 Inf 0.35 Posterior Mean 0.30 0.35 0.25 0.10 0.15 0.20 Posterior Mean 0.15 0.10 (g) 131 Inf 6400 (f) 123 0.30 0.35 0.30 0.25 0.20 25600 102400 Sample size 1600 Sample size (e) 122 0.15 6400 0.25 Posterior Mean 0.15 1600 Sample size 0.10 1600 Inf 0.10 400 (d) 121 400 25600 102400 0.30 0.35 0.25 0.20 Posterior Mean 0.15 6400 6400 (c) 113 0.10 1600 1600 Sample size 0.30 0.35 0.30 0.25 0.20 0.10 0.15 Posterior Mean 25600 102400 (b) 112 Sample size Posterior Mean 6400 Sample size (a) 111 400 0.25 0.15 0.10 400 0.20 25600 102400 Sample size 0.25 6400 0.20 1600 0.20 Posterior Mean 0.30 0.35 0.25 0.10 0.15 0.20 Posterior Mean 0.30 0.35 0.30 0.25 0.20 Posterior Mean 0.15 0.10 400 400 1600 6400 25600 102400 Sample size (h) 132 Inf 400 1600 6400 25600 102400 Inf Sample size (i) 133 Figure A.1: Sampling distribution of posterior mean of prevalence θ with different sample sizes. The three digits in the panel name correspond to the three assumptions of the distribution of the observed sampling probability, the natural cubic spline and the prior on the hidden proportion. For example, ‘111’ refers to the situation when we take the first set of values for the three assumptions given in Subsection 2.4.2. of the thesis. 164 Inf 0.06 1600 Inf 400 25600 102400 Inf 0.06 0.04 6400 25600 102400 Inf 400 25600 102400 Inf 0.06 Posterior SD 0.04 0.05 0.06 0.04 0.03 0.00 0.01 Posterior SD 0.02 0.01 0.00 (g) 131 Inf 6400 (f) 123 0.05 0.06 0.05 0.04 0.03 0.02 25600 102400 Sample size 1600 Sample size (e) 122 0.01 6400 0.03 Posterior SD 0.01 1600 Sample size 0.00 1600 Inf 0.00 400 (d) 121 400 25600 102400 0.05 0.06 0.04 0.03 Posterior SD 0.02 0.01 6400 6400 (c) 113 0.00 1600 1600 Sample size 0.05 0.06 0.05 0.04 0.03 0.02 0.00 0.01 Posterior SD 25600 102400 (b) 112 Sample size Posterior SD 6400 Sample size (a) 111 400 0.03 0.01 0.00 400 0.02 25600 102400 Sample size 0.03 6400 0.02 1600 0.02 Posterior SD 0.04 0.05 0.06 0.03 0.00 0.01 0.02 Posterior SD 0.04 0.05 0.06 0.05 0.04 0.03 Posterior SD 0.02 0.01 0.00 400 400 1600 6400 25600 102400 Sample size (h) 132 Inf 400 1600 6400 25600 102400 Inf Sample size (i) 133 Figure A.2: Sampling distribution of posterior standard deviation of prevalence θ with different sample sizes 165 Inf 0.35 1600 Inf 400 25600 102400 Inf 0.35 6400 25600 102400 Inf 400 25600 102400 Inf 0.35 Posterior Mean 0.30 0.35 0.25 0.10 0.15 0.20 Posterior Mean 0.15 0.10 (g) 231 Inf 6400 (f) 223 0.30 0.35 0.30 0.25 0.20 25600 102400 Sample size 1600 Sample size (e) 222 0.15 6400 0.25 Posterior Mean 0.15 1600 Sample size 0.10 1600 Inf 0.10 400 (d) 221 400 25600 102400 0.30 0.35 0.25 0.20 Posterior Mean 0.15 6400 6400 (c) 213 0.10 1600 1600 Sample size 0.30 0.35 0.30 0.25 0.20 0.10 0.15 Posterior Mean 25600 102400 (b) 212 Sample size Posterior Mean 6400 Sample size (a) 211 400 0.25 0.15 0.10 400 0.20 25600 102400 Sample size 0.25 6400 0.20 1600 0.20 Posterior Mean 0.30 0.35 0.25 0.10 0.15 0.20 Posterior Mean 0.30 0.35 0.30 0.25 0.20 Posterior Mean 0.15 0.10 400 400 1600 6400 25600 102400 Sample size (h) 232 Inf 400 1600 6400 25600 102400 Inf Sample size (i) 233 Figure A.3: Sampling distribution of posterior mean of prevalence θ with different sample sizes 166 Inf 0.06 1600 Inf 400 25600 102400 Inf 0.06 0.04 6400 25600 102400 Inf 400 25600 102400 Inf 0.06 Posterior SD 0.04 0.05 0.06 0.04 0.03 0.00 0.01 Posterior SD 0.02 0.01 0.00 (g) 231 Inf 6400 (f) 223 0.05 0.06 0.05 0.04 0.03 0.02 25600 102400 Sample size 1600 Sample size (e) 222 0.01 6400 0.03 Posterior SD 0.01 1600 Sample size 0.00 1600 Inf 0.00 400 (d) 221 400 25600 102400 0.05 0.06 0.04 0.03 Posterior SD 0.02 0.01 6400 6400 (c) 213 0.00 1600 1600 Sample size 0.05 0.06 0.05 0.04 0.03 0.02 0.00 0.01 Posterior SD 25600 102400 (b) 212 Sample size Posterior SD 6400 Sample size (a) 211 400 0.03 0.01 0.00 400 0.02 25600 102400 Sample size 0.03 6400 0.02 1600 0.02 Posterior SD 0.04 0.05 0.06 0.03 0.00 0.01 0.02 Posterior SD 0.04 0.05 0.06 0.05 0.04 0.03 Posterior SD 0.02 0.01 0.00 400 400 1600 6400 25600 102400 Sample size (h) 232 Inf 400 1600 6400 25600 102400 Inf Sample size (i) 233 Figure A.4: Sampling distribution of posterior standard deviation of prevalence θ with different sample sizes 167 25600 102400 Inf 0.35 1600 25600 102400 Inf 400 25600 102400 Sample size (d) 115 Inf 25600 102400 Inf 0.35 Posterior Mean 0.30 0.35 0.25 0.10 0.15 0.20 Posterior Mean 0.15 0.10 6400 6400 (c) 134 0.30 0.35 0.30 0.25 0.20 0.10 1600 1600 Sample size (b) 124 0.15 Posterior Mean 6400 Sample size (a) 114 400 0.25 0.15 0.10 400 Sample size 0.25 6400 0.20 1600 0.20 Posterior Mean 0.30 0.35 0.25 0.10 0.15 0.20 Posterior Mean 0.30 0.35 0.30 0.25 0.20 Posterior Mean 0.15 0.10 400 400 1600 6400 25600 102400 Sample size (e) 125 Inf 400 1600 6400 25600 102400 Inf Sample size (f) 135 Figure A.5: Sampling distribution of posterior mean of prevalence θ with different sample sizes 168 25600 102400 Inf 0.06 1600 25600 102400 Inf 400 25600 102400 Sample size (d) 115 Inf 25600 102400 Inf 0.06 Posterior SD 0.04 0.05 0.06 0.04 0.03 0.00 0.01 Posterior SD 0.02 0.01 0.00 6400 6400 (c) 134 0.05 0.06 0.05 0.04 0.03 0.02 0.00 1600 1600 Sample size (b) 124 0.01 Posterior SD 6400 Sample size (a) 114 400 0.03 0.01 0.00 400 Sample size 0.03 6400 0.02 1600 0.02 Posterior SD 0.04 0.05 0.06 0.03 0.00 0.01 0.02 Posterior SD 0.04 0.05 0.06 0.05 0.04 0.03 Posterior SD 0.02 0.01 0.00 400 400 1600 6400 25600 102400 Sample size (e) 125 Inf 400 1600 6400 25600 102400 Inf Sample size (f) 135 Figure A.6: Sampling distribution of posterior standard deviation of prevalence θ with different sample sizes 169 25600 102400 Inf 0.35 1600 25600 102400 Inf 400 25600 102400 Sample size (d) 215 Inf 25600 102400 Inf 0.35 Posterior Mean 0.30 0.35 0.25 0.10 0.15 0.20 Posterior Mean 0.15 0.10 6400 6400 (c) 234 0.30 0.35 0.30 0.25 0.20 0.10 1600 1600 Sample size (b) 224 0.15 Posterior Mean 6400 Sample size (a) 214 400 0.25 0.15 0.10 400 Sample size 0.25 6400 0.20 1600 0.20 Posterior Mean 0.30 0.35 0.25 0.10 0.15 0.20 Posterior Mean 0.30 0.35 0.30 0.25 0.20 Posterior Mean 0.15 0.10 400 400 1600 6400 25600 102400 Sample size (e) 225 Inf 400 1600 6400 25600 102400 Inf Sample size (f) 235 Figure A.7: Sampling distribution of posterior mean of prevalence θ with different sample sizes 170 25600 102400 Inf 0.12 1600 25600 102400 Inf 400 25600 102400 Sample size (d) 215 Inf 25600 102400 Inf 0.12 Posterior SD 0.08 0.10 0.12 0.08 0.06 0.00 0.02 Posterior SD 0.04 0.02 0.00 6400 6400 (c) 234 0.10 0.12 0.10 0.08 0.06 0.04 0.00 1600 1600 Sample size (b) 224 0.02 Posterior SD 6400 Sample size (a) 214 400 0.06 0.02 0.00 400 Sample size 0.06 6400 0.04 1600 0.04 Posterior SD 0.08 0.10 0.12 0.06 0.00 0.02 0.04 Posterior SD 0.08 0.10 0.12 0.10 0.08 0.06 Posterior SD 0.04 0.02 0.00 400 400 1600 6400 25600 102400 Sample size (e) 225 Inf 400 1600 6400 25600 102400 Inf Sample size (f) 235 Figure A.8: Sampling distribution of posterior standard deviation of prevalence θ with different sample sizes 171 Table A.1: Limiting posteriors for cases with mis-specified priors Case θ 114 0.1636 124 0.2156 134 0.18285 214 0.1605 224 0.1597 234 0.1835 115 0.1636 125 0.2156 135 0.18285 215 0.1605 225 0.1597 235 0.1835 θˆ 0.1640 0.2238 0.18279 0.1613 0.1741 0.1834 0.1645 0.2320 0.18275 0.1621 0.1884 0.1832 Limiting Posterior 2.5% percentile 97.5% percentile 0.1638 0.1643 0.2200 0.2280 0.18276 0.18281 0.1609 0.1617 0.1675 0.1815 0.1833 0.1834 0.1639 0.1652 0.2215 0.2453 0.18267 0.18281 0.1611 0.1633 0.1700 0.2120 0.1830 0.1834 172
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Bayesian methods for alleviating identification issues...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Bayesian methods for alleviating identification issues with applications in health and insurance areas Xia, Chaoxiong 2013
pdf
Page Metadata
Item Metadata
Title | Bayesian methods for alleviating identification issues with applications in health and insurance areas |
Creator |
Xia, Chaoxiong |
Publisher | University of British Columbia |
Date Issued | 2013 |
Description | In areas such as health and insurance, there can be data limitations that may cause an identification problem in statistical modeling. Ignoring the issues may result in bias in statistical inference. Bayesian methods have been proven to be useful in alleviating identification issues by incorporating prior knowledge. In health areas, the existence of hard-to-reach populations in survey sampling will cause a bias in population estimates of disease prevalence, medical expenditures and health care utilizations. For the three types of measures, we propose four Bayesian models based on binomial, gamma, zero-inflated Poisson and zero-inflated negative binomial distributions. Large-sample limits of the posterior mean and standard deviation are obtained for population estimators. By extensive simulation studies, we demonstrate that the posteriors are converging to their large-sample limits in a manner comparable to that of an identified model. Under the regression context, the existence of hard-to-reach populations will cause a bias in assessing risk factors such as smoking. For the corresponding regression models, we obtain theoretical results on the limiting posteriors. Case studies are conducted on several well-known survey datasets. Our work confirms that sensible results can be obtained using Bayesian inference, despite the nonidentifiability caused by hard-to-reach populations. In insurance, there are specific issues such as misrepresentation on risk factors that may result in biased estimates of insurance premiums. In particular, for a binary risk factor, the misclassification occurs only in one direction. We propose three insurance prediction models based on Poisson, gamma and Bernoulli distributions to account for the effect. By theoretical studies on the form of posterior distributions and method of moment estimators, we confirm that model identification depends on the distribution of the response. Furthermore, we propose a binary model with the misclassified variable used as a response. Through simulation studies for the four models, we demonstrate that acknowledging the misclassification improves the accuracy in parameter estimation. For road collision modeling, measurement errors in annual traffic volumes may cause an attenuation effect in regression coefficients. We propose two Bayesian models, and theoretically confirm that the gamma models are identified. Simulation studies are conducted for finite sample scenarios. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2013-06-13 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0071991 |
URI | http://hdl.handle.net/2429/44559 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2013-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2013_fall_xia_chaoxiong.pdf [ 5.2MB ]
- Metadata
- JSON: 24-1.0071991.json
- JSON-LD: 24-1.0071991-ld.json
- RDF/XML (Pretty): 24-1.0071991-rdf.xml
- RDF/JSON: 24-1.0071991-rdf.json
- Turtle: 24-1.0071991-turtle.txt
- N-Triples: 24-1.0071991-rdf-ntriples.txt
- Original Record: 24-1.0071991-source.json
- Full Text
- 24-1.0071991-fulltext.txt
- Citation
- 24-1.0071991.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0071991/manifest