UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Bayesian adjustments for disease misclassification in epidemiological studies of health administrative… Högg, Tanja 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2019_february_hoegg_tanja.pdf [ 6.06MB ]
JSON: 24-1.0374224.json
JSON-LD: 24-1.0374224-ld.json
RDF/XML (Pretty): 24-1.0374224-rdf.xml
RDF/JSON: 24-1.0374224-rdf.json
Turtle: 24-1.0374224-turtle.txt
N-Triples: 24-1.0374224-rdf-ntriples.txt
Original Record: 24-1.0374224-source.json
Full Text

Full Text

Bayesian adjustments for disease misclassification inepidemiological studies of health administrative databases, withapplications to multiple sclerosis researchbyTanja Ho¨ggB.Sc. Universita¨t Augsburg, 2011M.Sc. University of British Columbia - Okanagan, 2013a thesis submitted in partial fulfillmentof the requirements for the degree ofDoctor of Philosophyinthe faculty of graduate and postdoctoral studies(Statistics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)November 2018c© Tanja Ho¨gg, 2018The following individuals certify that they have read, and recommend to the Faculty of Graduateand Postdoctoral Studies for acceptance, the thesis entitled:Bayesian adjustments for disease misclassification in epidemiological studies ofhealth administrative databases, with applications to multiple sclerosis researchsubmitted by Tanja Ho¨gg in partial fulfillment of the requirements for the degree of Doctor ofPhilosophy in Statistics.Examining Committee:Paul Gustafson, Department of StatisticsCo-supervisorJohn Petkau, Department of StatisticsCo-supervisorYinshan Zhao, BC Center for Improved Cardiovascular HealthSupervisory Committee MemberRollin Brant, Department of StatisticsUniversity ExaminerJohn J Spinelli, Population and Public HealthUniversity ExaminerAdditional Supervisory Committee Members:Helen Tremlett, Faculty of MedicineSupervisory Committee MemberiiAbstractWith disease information routinely established from diagnostic codes or prescriptions in healthadministrative databases, the topic of outcome misclassification is gaining importance in epidemi-ological research. Motivated by a Canada-wide observational study into the prodromal phase ofmultiple sclerosis (MS), this thesis considers the setting of a matched exposure-disease associationstudy where the disease is measured with error.We initially focus on the special case of a pair-matched case-control study. Assuming non-differential misclassification of study participants, we give a closed-form expression for asymptoticbiases in odds ratios arising under naive analyses of misclassified data, and propose a Bayesianmodel to correct association estimates for misclassification bias. For identifiability, the model relieson information from a validation cohort of correctly classified case-control pairs, and also requiresprior knowledge about the predictive values of the classifier. In a simulation study, the model showsimproved point and interval estimates relative to the naive analysis, but is also found to be overlyrestrictive in a real data application.In light of these concerns, we propose a generalized model for misclassified data that extendsto the case of differential misclassification and allows for a variable number of controls per case.Instead of prior information about the classification process, the model relies on individual-levelestimates of each participant’s true disease status, which were obtained from a counting processmixture model of MS-specific healthcare utilization in our motivating example.iiiLastly, we consider the problem of assessing the non-differential misclassification assumptionin situations where the exposure is suspected to impact the classification accuracy of cases andcontrols, but information on the true disease status is unavailable. Motivated by the non-identifiednature of the problem, we consider a Bayesian analysis and examine the utility of Bayes factorsto provide evidence against the null hypothesis of non-differential misclassification. Simulationstudies show that for a range of realistic misclassification scenarios, and under mildly informativeprior distributions, posterior distributions of the exposure effect on classification accuracy exhibitsufficient updating to detect differential misclassification with moderate to strong evidence.ivLay SummaryBritish Columbia health administrative databases contain records on hospitalizations, filled pre-scriptions and physician billings for all residents covered under the provincial health care programs,including diagnostic codes for each physician contact. While previous research suggests that di-agnostic codes can be error prone, they are increasingly relied upon in epidemiological studies toidentify individuals affected by the disease under study. In this thesis, we examine how standardstatistical analyses that ignore the inaccurate nature of the data can impact study findings, anddevelop new statistical tools that acknowledge the possibility that the disease may not be accu-rately assessed for some of the study participants. These tools allow researchers to make betteruse of the rich information captured in health administrative databases, without compromising theintegrity of the results. All methods are applied to a motivating study investigating the symptomsproceeding the first recognized sign of multiple sclerosis.vPrefaceThis thesis was completed under the joint supervision of Prof. Paul Gustafson, Prof. John Petkauand Dr. Yinshan Zhao. The research problems covered in Chapters 2 and 3 were proposed by Dr.Yinshan Zhao. The motivating dataset and epidemiological research question have been providedthrough Prof. Helen Tremlett’s Pharmacoepidemiology in Multiple Sclerosis (PiMS) ResearchGroup at the University of British Columbia (study specific funding: the National Multiple SclerosisSociety, RG 5063A4/1, PI: Tremlett). Data access and linkages were facilitated by Population DataBC (http://www.popdata.bc.ca). All inferences, opinions, and conclusions drawn in this thesis arethose of the author, and do not reflect the opinions or policies of the Data Steward(s).Chapter 2 of this thesis has been published as “Bayesian analysis of pair-matched case-controlstudies subject to outcome misclassification” in Statistics in Medicine [40]. As first author, I draftedthe manuscript, developed and implemented the model, and conducted all data analysis.The Prodromal Multiple Sclerosis (ProMS) study has been approved by the University of BritishColumbia Clinical Research Ethics Board (Certificate: H14-00448).viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Bayesian analysis of pair-matched case-control studies subject to outcome mis-classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.1 Notation and preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.2 Misclassification probabilities under matched case-control sampling . . . . . . 9vii2.2.3 Analysis under perfect outcome classification . . . . . . . . . . . . . . . . . . 122.2.4 Bias under outcome misclassification . . . . . . . . . . . . . . . . . . . . . . . 132.2.5 A model for the analysis of matched case-control data subject to outcomemisclassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.6 Prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.7 Determination of hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . 202.2.8 The case of perfect disease ascertainment . . . . . . . . . . . . . . . . . . . . 212.3 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4 Application to ProMS study data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Relaxing the non-differential misclassification assumption in analyses of matchedcase-control studies with misclassified outcomes . . . . . . . . . . . . . . . . . . 353.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.1 Analysis under perfect outcome classification . . . . . . . . . . . . . . . . . . 373.2.2 Modelling of exposure probabilities under outcome misclassification . . . . . 383.2.3 Using disease-specific healthcare utilization to discriminate between true andfalse positive cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2.4 A counting process mixture model for disease-specific claim times . . . . . . . 463.2.5 An exposure risk model with differential misclassification adjustment . . . . . 513.3 Application to ProMS data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594 Assessing the non-differential misclassification assumption: A Bayesian ap-proach using non-identified models . . . . . . . . . . . . . . . . . . . . . . . . . . 62viii4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.2 Testing the non-differential misclassification assumption . . . . . . . . . . . . . . . . 654.2.1 A general model of differential misclassification . . . . . . . . . . . . . . . . . 654.2.2 Bayesian analysis of non-identified models . . . . . . . . . . . . . . . . . . . . 684.2.3 Prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.2.4 Estimation under outcome related subject selection . . . . . . . . . . . . . . . 714.3 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.3.2 Results for prospective modelling . . . . . . . . . . . . . . . . . . . . . . . . . 754.3.3 Results for retrospective modelling . . . . . . . . . . . . . . . . . . . . . . . . 804.4 Application to ProMS data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95A Appendix for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101A.1 The negative predictive value under matched sampling . . . . . . . . . . . . . . . . . 101A.2 Details of the simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103A.3 The case of odds ratios adjusted towards the null . . . . . . . . . . . . . . . . . . . . 105B Appendix for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106B.1 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106ixList of TablesTable 2.1 Median of the posterior distribution of OR, empirical coverage and length of the95% equal-tailed posterior credible interval under naive and adjusted analysesof matched case-control data for various settings of exposure-disease odds ratio(OR), underestimation of pp in the prior input (∆), exposure prevalence (con-trolled via a0) and specificity (SP ), averaged over R = 1000 simulation runs.Positive predictive values (pp) are displayed for each setting of SP . Other sim-ulation parameters are n = 1000, m = 200, se(p̂p) = se(n̂p) = 0.02, SN = 1(np = 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Table 2.2 Morbidity status of matched pairs in the administrative and validation cohort forten morbidities under investigation. For morbidities with cell counts of less thanfive pairs in the validation cohort, two cells were suppressed to fulfill the privacyrequirements related to data access. . . . . . . . . . . . . . . . . . . . . . . . . . . 28Table 3.1 Disease status of 6830 apparent cases and 31714 controls in the ProMS cohort forsix morbidities suspected to be part of the MS prodrome. Data are aggregatedacross matching strata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53xTable 3.2 Posterior median, 2.5 and 97.5 percentile of counting process mixture model pa-rameters estimated from MS-specific claim history of 6830 study participants inthe ProMS case group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Table 4.1 Counts of exposed (E = 1) and unexposed (E = 0) study participants withpositive (M/> 40) and negative (F/≤ 40) confounder values in case and controlgroup, as well as odds ratios (OR) between exposure and confounder. . . . . . . 84Table B.1 Results of the sensitivity analysis for ProMS data: Median, 2.5 and 97.5 percentileof the posterior distribution under a normal, t, and skewed normal distributionfor the strata-specific effect bk. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107xiList of FiguresFigure 2.1 Relative size of OR∗ from pair-matched case-control data subject to outcomemisclassification under three scenarios of exposure-disease association (OR =0.7, OR = 1.5, OR = 3), stratified by positive (pp) and negative predictivevalues (np). Pairs are matched on a binary confounder with prevalence 0.2 andexposure-confounder odds ratio 1.5. . . . . . . . . . . . . . . . . . . . . . . . . . 17Figure 2.2 Median, 2.5th and 97.5th percentile of the OR posterior averaged over 1000 simu-lation runs for selected settings of specificity (SP ), stratified by main cohort size(n), validation cohort size (m) and prior variance of pp (sd(p̂p)). The true valueof OR is represented by the horizontal grey line. Other simulation parametersare a0 = log(0.01) in the exposure model (A.2), SN = 1 (np = 1) and ∆ = 0.Positive predictive values are pp = 0.39, 0.5, 0.76, 1 for SP = 0.95, 0.97, 0.99, 1,respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25xiiFigure 2.3 Median, 2.5th and 97.5th percentile of the OR posterior averaged over 1000 sim-ulation runs for selected settings of specificity (SP ), stratified by main cohortsize (n), validation cohort size (m) and exposure prevalence (controlled via expo-sure model intercept a0). The true value of OR is represented by the horizontalgrey line. Simulation parameters are sd(p̂p) = 0.02, SN = 1 (np = 1) and∆ = 0. Positive predictive values are pp = 0.33, 0.45, 0.73, 1 (a0 = log(0.002))and pp = 0.39, 0.5, 0.76, 1 (a0 = log(0.01)) for SP = 0.95, 0.97, 0.99, 1, respectively. 26Figure 2.4 Posterior median and 95% equal-tailed credible interval for odds ratios of multiplesclerosis and presence of morbidity in the prodromal phase under the proposedmodel (adjusted) and the naive model based on the full cohort (naive) and thevalidation cohort only (validation). . . . . . . . . . . . . . . . . . . . . . . . . . . 29Figure 3.1 MS-specific physician billing codes over time for three example participants inthe ProMS case group. For each participant, the end of follow-up is representedby the vertical grey line. Time zero is defined as the index date, the time of thefirst MS-specific code. Gaussian noise has been added to each MS-specific claimtime to fulfill the privacy requirements related to data access. . . . . . . . . . . . 44Figure 3.2 Time difference between first and last MS-specific claim relative to the totallength of follow-up plotted against follow-up time in years for ProMS cases.Participants captured by the British Columbia Multiple Sclerosis database arehighlighted in blue. Contours are obtained from a two-dimensional kernel densityestimate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Figure 3.3 Distribution of predicted misclassification probabilities pik for ProMS cases (left),ProMS cases captured by the British Columbia Multiple Sclerosis database (cen-tre), and ProMS cases with at least one administrative record for a MS-specificdisease modifying therapy (right). . . . . . . . . . . . . . . . . . . . . . . . . . . 57xiiiFigure 3.4 Time difference between the first and last MS-specific claim relative to the totallength of follow-up, plotted against follow-up time in years for ProMS cases.Points are coloured according to a participant’s probability of being a true MScase, as estimated by the counting process mixture model. . . . . . . . . . . . . . 58Figure 3.5 Estimated log odds ratios of MS and six morbidities in the prodromal phase us-ing the proposed model with differential misclassification adjustment (adjusted(diff)), a model with non-differential misclassification adjustment (adjusted (non-diff)) and a naive analysis where all study participants are assumed to be cor-rectly classified (naive). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59Figure 4.1 Relationship of confounder U , true and apparent disease status D and D∗ andexposure E under non-differential misclassification (left) and differential misclas-sification with respect to E (right). . . . . . . . . . . . . . . . . . . . . . . . . . . 65Figure 4.2 Relationship between identified probabilities oij = P (D∗ = 1 | E = i, U = j)(black) and non-identified probabilities pik = P (D∗ = 1 | E = i,D = k) (blue)under misclassification model (4.2) when both disease groups are differentiallymisclassified. Key: † on the logistic probability scale. . . . . . . . . . . . . . . . . 67Figure 4.3 Differential misclassification for cases and controls (top) and controls only (bot-tom): Average median, 2.5% and 95.7% percentile of posterior distribution of β1,for selected settings of sensitivity (SN0), specificity (SP0) and exposure effect(δ). In each row, the true value of δ is indicated by a cross (+). The respec-tive values for the prior distribution of β1 are given in the last row. The righthand side table shows median Bayes factors (BF), average length (Length) andcoverage (Cover) of 95% posterior credible intervals, and percentage of inter-vals with lower bound greater than 0 (LB>0). Other parameters are n = 7000,a1 = b1 = b2 = log(1.5), a0 = b0 = logit(0.2). . . . . . . . . . . . . . . . . . . . . . 76xivFigure 4.4 Prior to posterior updating for non-identified parameters of the prospectivemodel for one set of simulated data, when cases and controls are differentiallymisclassified. In each plot, posterior and prior distributions are given by solid anddashed lines. True parameter values are indicated by the dotted vertical lines.Model parameters are 1− SP0 = 0.2, SN0 = 0.9, δ = 1, a1 = b1 = b2 = log(1.5),a0 = b0 = logit(0.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77Figure 4.5 Differential misclassification for cases and controls (top) and controls only (bot-tom): Average median, 2.5% and 95.7% percentiles of the posterior distributionof β1 for selected settings of confounder prevalence (η1), conditional disease-confounder (OR(U,D | E)) and exposure-disease (OR(E,D | U)) associations.In each row, the true value of δ is indicated by a cross (+). The respective valuesfor the prior distribution of β1 are indicated in the last row. Other parametervalues are SN0 = 0.7, 1 − SP0 = 0.2, δ = 1, a0 = b0 = logit(0.2), a1 = log(1.5),n = 7000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78Figure 4.6 Differential misclassification for controls only under retrospective sampling: Av-erage median, 2.5% and 95.7% percentiles of the posterior distribution of α1 forselected settings of specificity (SP0) and exposure effect (δ). In each row, thetrue value of δ is indicated by a cross (+). The respective values for the priordistribution of α1 are given in the last row. Other parameters are n = 7000,SN0 = SN1 = 1, a1 = b1 = b2 = log(1.5), a0 = b0 = logit(0.2). . . . . . . . . . . . 81xvFigure 4.7 Prior to posterior updating for non-identified parameters of the retrospectivemodel for one set of simulated data when only controls are differentially misclas-sified. In each plot, posterior and prior distributions are given by solid and dashedlines. True parameter values are indicated by the dotted vertical lines. Modelparameters are 1 − SP0 = 0.3, SN0 = SN1 = 1, δ = 1, a1 = b1 = b2 = log(1.5),a0 = b0 = logit(0.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Figure 4.8 Differential misclassification for controls only under retrospective sampling: Av-erage median, 2.5% and 95.7% percentiles of the posterior distribution of α1 forselected settings of confounder prevalence η1, disease-confounder (OR(U,D | E))and exposure-disease (OR(E,D | U)) associations. The respective values for theprior distribution of α1 are indicated in the last row. Other parameter valuesare SN1 = SN0 = 1, 1− SP0 = 0.3, δ = 1, n = 7000. . . . . . . . . . . . . . . . . 83Figure 4.9 Median, 2.5% and 95.7% percentiles of the posterior distribution of α1, Bayesfactors (BF) and lengths of 95% posterior credible intervals (Length) for a rangeof morbidities suspected to be part of the MS prodrome. In each panel, the topand bottom rows correspond to analyses using age and sex, respectively, as theconfounding variable. Summary statistics for the prior distribution of α1 areindicated in the last row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86xviAcknowledgmentsI am grateful to my co-supervisors, Prof. John Petkau, Prof. Paul Gustafson and Dr. Yinshan Zhao,and to my committee member Prof. Helen Tremlett for their guidance and support throughoutmy PhD studies. It has been an extraordinary privilege to know and work with such a dedicated,encouraging and kind group of individuals.xviiChapter 1IntroductionIn many scientific fields, imperfect assessment of binary traits can lead to study participants beingfalsely assigned to one of two possible groups, and subsequently results in misclassification of thevariables under study. Common examples include diseases diagnosed using imperfect laboratorytests, or assessment of past exposures from patient self-report. The impact of misclassification onstatistical inferences has received considerable attention in the literature, and numerous methodsto remedy its detrimental impact have been proposed [12, 32, 68]. If the misclassified trait issubstituted for the truth without acknowledging its imperfect nature, consequences can includebiases in association measures, loss of statistical power and increases in Type 1 error rates.The effect of exposure misclassification on association measures has been well studied, and avariety of methods have been proposed to analyze misclassified exposure data from cohort stud-ies [31], case-control studies [22, 56] and matched case-control studies [20, 30, 45, 58, 59]. Incomparison, the topic of outcome (e.g. disease) misclassification has received little attention. Bothfrequentist [21, 48, 49] and Bayesian methods [28, 54] have been developed for the analysis of cohortstudies, but only a few papers investigate this subject under study designs with outcome-relatedsubject selection [7, 13]. Jurek et al. [42] propose a misclassification adjustment for the scenario1of case-control sampling, but the problem of outcome misclassification under matched case-controlsampling remains unexplored, despite the popularity of matched designs in association studies [60].Following an increasing use of health administrative databases as cost-effective data sources forepidemiological research at the population level [4, 17, 24, 52, 65], the topic of outcome misclas-sification is gaining new importance. In Canada, health administrative databases are collated bythe provincial governments and contain records on hospitalizations, physician billings and phar-maceutical prescriptions filled for all residents covered under the provincial health care programs.Among other variables, records include diagnostic information for each physician contact and hos-pital discharge, coded using the International Classification of Diseases, Ninth or Tenth Revision(ICD 9/10). When selecting case and control groups for epidemiological research, investigatorsoften rely on these diagnostic codes to determine presence or absence of the disease under study.Despite being a rich source of information, the validity of diagnostic information in healthadministrative databases has been questioned in the past. While physicians are required to provideat least one ICD code for claim reimbursement, diagnostic fields are not closely monitored for qualityand thus can be prone to errors, missingness and lack of specificity [66]. Diagnostic informationin hospital separation records suffers from similar limitations, although in Canada it is generallythought to provide a higher degree of accuracy. Furthermore, administrative databases only captureconditions for which a patient seeks medical assistance; conditions that are self-managed will notbe adequately represented. Consequently, the use of diagnostic codes as proxies for the presence orabsence of a disease under study can result in controls that are falsely assigned to a study’s casegroup, as well as cases that are falsely assigned to the control group.The work presented in this thesis is motivated by the Prodromal Multiple Sclerosis (ProMS)study, a Canadian multi-province observational study investigating the existence of a prodromalphase in multiple sclerosis (MS) [65]. MS is a chronic disease of the central nervous system ofunknown cause and with no single diagnostic test. The diagnostic process is complicated by early2symptoms that are highly heterogeneous among patients and can be confused with several othermedical conditions. To better understand the onset of MS and promote earlier recognition, theProMS study focuses on the five-year time period prior to the first recognized symptom of MS.Primary outcomes include the annual utilization of health care services, measured by the numberof physician contacts, hospitalizations and prescriptions filled, but also the presence of severalmorbidities prior to the first recognized sign of MS. Outcomes are compared between people withand without MS in a matched case-control design.Study participants in four Canadian provinces (British Columbia, Manitoba, Nova Scotia,Saskatchewan) were identified from provincial health administrative records available between April1, 1991 and December 31, 2013. Individuals were classified as having MS if administrative healthrecords showed at least three MS-specific billing codes or pharmaceutical prescription claims, whileeligible controls were required to have no records for MS or other demyelinating diseases, such asoptic neuritis or transverse myelitis. Because privacy legislation prohibits individual-level data toleave the provinces, we focus on data from British Columbia throughout this thesis.The case definition of three or more MS-specific records has previously been validated in theCanadian provinces of Manitoba [51] and Nova Scotia [52]. In both studies, a total of 2000 indi-viduals with administrative records for demyelinating diseases were identified from the provincialdatabases, and the disease status determined by the case definition was compared against a goldstandard reference. In Nova Scotia, a provincial MS database with near-universal coverage servedas the gold standard, while in Manitoba, permission for medical records review was requested fromthe selected individuals. Both studies reported a sizeable percentage of non-MS patients amongindividuals who fulfilled the case definition, with an estimated 17% in Nova Scotia and 24% inManitoba. Because of comparable health systems, we suspect a similar degree of contaminationin the British Columbia cohort. Consequently, analyses of these data call for statistical techniquesthat acknowledge the potentially imperfect disease status of the study participants. A detailed3description of the ProMS study can be found in Wijnands et al. [65].Broadly, this thesis deals with problems of outcome misclassification that arise in epidemio-logical studies of health administrative databases. We consider the setting of an exposure-diseaseassociation study where the disease variable is measured with error, but the exposure variable iscorrectly classified. Motivated by ProMS study data, Chapters 2 and 3 of this thesis focus on theproblem of outcome misclassification under matched case-control sampling.Chapter 2 discusses the “simpler” case of a pair-matched case-control study. We give a closed-form expression for the bias in odds ratios that arises from naive analyses of misclassified data andpropose a Bayesian model to correct association estimates for misclassification bias. For identifia-bility, the model relies on information from a validation cohort of correctly classified case-controlpairs, and also requires some prior knowledge about the negative and positive predictive value ofthe classifier. In a simulation study, the model demonstrates its ability to provide improved pointand interval estimates of the true association parameter relative to a naive analysis. As a real dataexample, we apply the model to British Columbia ProMS study data to investigate the presence ofseveral morbidities in the prodromal phase of MS.In Chapter 2, we assume that misclassification of cases and controls is non-differential, meaningthat the chance of a positive disease label depends upon the true disease status only, and in partic-ular is not influenced by the values of other variables under investigation. While this assumptionmay be sound in many applied problems, it becomes particularly hard to verify when classificationdepends on the presence or absence of disease-specific diagnostic codes. Because coding practicesmay differ between physicians and could be influenced by other factors such as a patient’s morbidityburden, the probability of misclassification is bound to vary across study participants, ultimatelyleading to a classification process with poorly understood properties.Statistical methods dealing with misclassified variables are often sufficiently general to accom-modate the case of differential misclassification [28, 45, 58], but their application to practical prob-4lems can still be challenged by a lack of understanding about the underlying classification process.To adjust statistical analyses for misclassification, methods tend to rely on information about theclassifier’s sensitivity and specificity for identifiability, either by assuming them to be exactly knownor via informative prior distributions. If misclassification is assumed to be non-differential and thusindependent of factors other than the true disease status, such knowledge may be readily availablefrom expert opinion or the results from validation studies. For situations where these assumptionsare unsuitable, elicitation of prior information becomes more complicated given that sensitivitiesand specificities for all factors influencing the chance of a positive disease label are required.In light of these concerns, Chapter 3 focuses on relaxing the assumption of non-differentialmisclassification in the analysis of matched case-control studies. We propose a Bayesian modelthat does not require prior information about a possibly complicated classification mechanism, butinstead relies on individual-level estimates of each participant’s true disease status. In contrast toChapter 2, the model is not limited to pair-matched case-control data, but extends to a variablenumber of matched controls per case.The rationale behind our approach stems from the type of information available to researchersin health administrative database studies. While data extraction by the relevant bodies governingdata access can require case and control groups to be defined by a coarse, binary classificationrule, it is possible for investigators to gain access to the cohort’s raw health administrative recordsonce this process is complete. These records provide complete histories of healthcare utilizationand thus contain a variety of information that can inform our understanding of a participant’s truedisease status. Revisiting the data analyzed in Chapter 2, we identify unique temporal patterns ofMS-specific healthcare utilization among ProMS participants and apply a counting process mixturemodel to estimate each participant’s probability of being a true MS case. Re-analyses of ProMSstudy data suggest that differential misclassification of MS cases may in fact be present.Following these findings, Chapter 4 takes a closer look at the assumption of non-differential5misclassification, and in particular the question of whether violations thereof can be detected in aformal hypothesis test when information about the true disease status is unavailable. In practice,this assumption is common [5, 29] as it greatly simplifies statistical analyses of misclassified data,but is often based upon theoretical considerations about the underlying classification mechanismrather than the data at hand. While Thygesen and Ersbøll [63] have issued a broad statement aboutthe validity of non-differential misclassification assumptions in administrative database studies,analyses of ProMS data in Chapters 2 and 3 raise concerns about their validity in our application.In Chapter 4, we consider the case of an exposure-disease association study, where the expo-sure is suspected to influence the chance of receiving a positive disease label. In the absence ofdisease information, inference about this exposure effect yields a non-identified model, and thusprohibits frequentist-type testing of the non-differential misclassification assumption. Motivatedby previous papers that have demonstrated the utility of posterior distributions in non-identifiedmodels [18, 33, 34, 46], we therefore consider Bayesian hypothesis testing, with the Bayes factoras a measure of evidence for or against the null. Simulation studies show that under mildly infor-mative prior distributions for the non-identified model parameters, posterior distributions of theexposure effect show appreciable updating relative to the prior, and in some cases allow differentialmisclassification to be detected with moderate to strong evidence. Compared to related work deal-ing with unmeasured covariates in non-linear models [50], this approach does not require access toan instrumental variable and thus applies to administrative database studies where variables arelimited to information collected by government agencies. Application to British Columbia ProMSstudy data suggests that controls have a higher chance of misclassification if health administrativerecords show diagnostic codes for depression or migraine during the five year prodromal period,reaffirming suspicions about differential misclassification amongst the cohort.The thesis closes with a discussion in Chapter 5, where we summarize our main results andoutline directions for future research.6Chapter 2Bayesian analysis of pair-matchedcase-control studies subject tooutcome misclassification2.1 IntroductionIn this chapter, we examine the impact of non-differential outcome misclassification on odds ratiosestimated from pair-matched case-control studies and propose a Bayesian approach to adjust theseestimates for misclassification biases. Following an introduction of preliminary concepts and no-tation in Section 2.2.1, we begin by examining the impact of pair-matched case-control samplingfrom an imperfect outcome variable on the misclassification probabilities in Section 2.2.2. Section2.2.3 gives a brief introduction to the analysis of matched case-control studies under perfect out-come ascertainment. In Section 2.2.4, we quantify the bias in odds ratios when standard estimatorsare applied in the presence of misclassified outcomes. Section 2.2.5 outlines a Bayesian model toestimate exposure-outcome odds ratios for a binary exposure variable from misclassified data. The7model’s performance is illustrated on simulated data in Section 2.3. Finally, we apply the modelto ProMS study data in Section 2.4 to investigate the presence of ten morbidities prior to the firstrecognized sign of MS and conclude with a discussion in Section Methods2.2.1 Notation and preliminariesThroughout this chapter, we consider the setting of a pair-matched case-control study with focus onthe association between a binary exposure variable E and the outcome D. We consider the outcometo be a disease, although generally any binary variable is possible. The true disease status of thestudy participants is not directly observable, but is assessed via a potentially imperfect classifier.Examples of such classifiers could include a diagnostic test or, in the context of administrativedatabase studies, a case definition. In place of D, the classifier produces an apparent diseasestatus, denoted D∗, which may not be consistent with the true disease status for a subset of studyparticipants. The extent of mismeasurement in D∗ is commonly quantified by sensitivity (SN) andspecificity (SP ), defined asSN = P (D∗ = 1 | D = 1),SP = P (D∗ = 0 | D = 0),or alternatively by the positive (pp) and negative (np) predictive value, defined aspp = P (D = 1 | D∗ = 1),np = P (D = 0 | D∗ = 0).8Given the disease prevalence in the population, P (D = 1), the predictive values can be related tosensitivity and specificity via the following equations,pp =SN P (D = 1)SN P (D = 1) + (1− SP )(1− P (D = 1)) , (2.1)np =SP (1− P (D = 1))(1− SN) P (D = 1) + SP (1− P (D = 1)) . (2.2)In this chapter, the classification mechanism is assumed to be non-differential, that is, the misclas-sification probability does not depend on factors other than the true disease status. We will referto a participant with positive apparent disease status as an apparent case, and to a participantwith negative apparent disease status as an apparent control.Under matched sampling using the imperfect disease status D∗, apparent cases are selectedfrom the population of interest, and apparent controls are matched to cases on a set of possibleconfounders, U , thought to be associated with both the disease and the exposure. In this chapter,we focus on a fixed 1:1 matching ratio and assume that all measured confounders are used duringthe matching process. Exposure E and matching factors U are assumed to be measured withouterror. Let (E1k, E2k) denote the exposure status of the k-th of n apparent case-control pairs,(D1k, D2k) the true disease status and (D∗1k, D∗2k) the apparent disease status. For all k, noticethat (D∗1k = 1, D∗2k = 0), but that (D1k = 1, D2k = 0) may not hold when outcomes are subjectto misclassification. Further, we denote θij as the probability of observing a matched pair with(E1 = i, E2 = j) and nij as the number of pairs with (E1 = i, E2 = j). We denote I as an indicatorvariable taking a value of 1 if a participant is selected for the sample and 0 otherwise.2.2.2 Misclassification probabilities under matched case-control samplingModels adjusting statistical analyses for outcome misclassification commonly rely on knowledgeabout the sensitivity and specificity of the classification mechanism for identifiability, either byassuming them to be exactly known [49] or via informative prior distributions [54]. In practice,9such knowledge may be available from validation studies, where the disease status produced by theimperfect classifier is compared against a gold standard reference for a subset of study participants.Ideally, these validation studies are based on a subset that is representative of the population, sothat population estimates of SN and SP are produced. In this section, we examine the impactof matched sampling from an imperfect disease indicator on the classification probabilities in thesample, P (D∗ = i | D = i, I = 1), i = 0, 1.When the selection of study participants is irrespective of the apparent disease status D∗, as isthe case for cohort studies, sensitivity and specificity in the sample coincide with population valuesand estimates of SN and SP are valid to adjust inferences for outcome misclassification. However,as pointed out by Jurek et al. [42], the same does not hold true for studies involving outcome-related selection such as matched or unmatched case-control sampling. Because the samplingfraction of apparent cases typically exceeds that of the apparent controls, participants with D∗ = 1are overrepresented in a case-control sample relative to the population, leading to higher sensitivityin the sample compared to the population value. Conversely, with D∗ = 0 underrepresented, thespecificity in the case-control sample decreases. As a result, population estimates of SN and SPare no longer valid to adjust statistical analyses of case-control data for misclassification.In some applications, it may be possible to calculate sample values of sensitivity and specificityfrom the population values SN and SP . Under unmatched sampling, such calculations requireknowledge of the sampling fractions for apparent cases and controls, both of which may be readilyavailable in administrative database studies. Denoting these quantities by pii = P (I = 1|D∗ =i), i = 0, 1, sample values are then obtainable viaP (D∗ = 1|D = 1, I = 1) = pi1 SNpi1 SN + pi0(1− SN) ,P (D∗ = 0|D = 0, I = 1) = pi0 SPpi1(1− SP ) + pi0 SP10[42]. In the matched case, however, evaluation of the above expressions is not easily accomplished.With controls selected from the strata defined by the matching variables, the sampling fractionof controls varies among the strata and calculation of pi0 would require knowledge of each sam-pling fraction in addition to the strata distribution. For all but the simplest matching schemes,this soon becomes infeasible, making sample values of sensitivity and specificity unavailable formisclassification adjustments.A closer look at these equations further highlights that judging the severity of misclassificationin a case-control sample by population sensitivity and specificity can be severely misleading underoutcome misclassification. Consider, for instance, a classifier with near-perfect accuracy of SN = 1and SP = 0.99 applied in a study where 100% of apparent cases, but only 1% of apparent controlsare selected. Evaluation of the above equations then give a sample sensitivity of 1, suggestingthat all apparent controls are correctly classified, yet a specificity of 0.50. Such a large degreeof misclassification can quickly decimate the number of true cases in the sample and lead to aconsiderable decrease in efficiency.An alternative measure of misclassification that is less impacted by outcome-related selectionis the positive and negative predictive value. While predictive values are a function of the diseaseprevalence and are therefore considered less stable when transferred across populations, unmatchedcase-control sampling will not lead to changes in negative and positive predictive value relativeto their population values [42]. Under matched sampling, the same remains true for the positivepredictive value as matched and unmatched designs do not differ in the selection of the case group,but as shown in Appendix A.1, this result no longer holds for the negative predictive value.To illustrate the difference, consider the case of a single binary confounder and reasonablyaccurate classifier with pp > (1−np). Assuming a positive association between U and D, matchedsampling of controls increases the proportion of confounder-positive participants in the samplerelative to the source population, and hence enriches the apparent control group with true cases in11comparison to unmatched sampling. As a result, the negative predictive value among the matchedapparent controls will decrease relative to the negative predictive value among the population ofapparent controls. In general, the discrepancy between the two values will depend upon the D−Uassociation, pp, np and the prevalence of the confounder U .Although this result implies that population estimates of a classifier’s negative predictive valueare also not applicable to adjust inferences under matched case-control designs, discrepancies be-tween the two values are often small in practice. Unless high D − U associations are present, itis our experience that for rare diseases, the population value np provides a close approximationto the sample value given that near-perfect values of np are common under these circumstances.Therefore, population values of the negative predictive value may still be useful in misclassifica-tion adjustments of matched case-control data, particularly in a Bayesian context where possibleshifts due to sampling can be acknowledged by increasing the variability of this parameter’s priordistribution.While models in the misclassification literature tend to rely on sensitivity and specificity toadjust statistical inferences, better stability under matched case-control sampling makes predic-tive values the preferred measure of misclassification in the context of this chapter. The deriva-tions following in Section 2.2.4 and 2.2.5 also reveal that predictive values arise naturally in theparametrization of our proposed model, and we shall retain this parametrization given that samplevalues of sensitivity and specificity will generally not be available in studies involving administra-tive databases. For notational convenience, we will refer to the negative predictive value amongmatched controls as np for the remainder of this chapter.2.2.3 Analysis under perfect outcome classificationWe begin by considering an analysis of pair-matched case-control data for the scenario of perfectdisease ascertainment. In the standard case of D being fully observable, that is (D1k = 1, D2k = 0)12for all k, pair-matched data is commonly represented via the following risk modellogit{P (Eik = 1 | Dik = d, bk)} = bk + δd, i = 1, 2, (2.3)where bk is a pair-specific random effect intended to capture the dependencies within the pairs.The odds ratio of exposure between cases and controls, OR, appears in the model as the parameterexp(δ). Prescott and Garthwaite [58] show that under the assumption of conditional independenceof E1k and E2k given bk, OR can be expressed asOR =P (E1 = 1, E2 = 0)P (E1 = 0, E2 = 1)=θ10θ01, (2.4)without specification of a random effect distribution, and can be estimated by the ratio of discordantpairs,ÔR =n10n01. (2.5)The same estimator also arises as the maximum likelihood estimator from conditional logisticregression [1]. In a Bayesian framework, a standard approach assumes a multinomial model for thecell counts (n11, n00, n10, n01) with a Dirichlet prior on the cell probabilities,(n11, n00, n10, n01) ∼ Multinomial(n, (θ11, θ00, θ10, θ01)′)(θ11, θ00, θ10, θ01)′ ∼ Dir(1).(2.6)2.2.4 Bias under outcome misclassificationWhen D is unobserved and classification is based on the imperfect surrogate D∗, (D1k = 1, D2k = 0)no longer holds, resulting in a biased estimate of OR if the estimator of Equation (2.5) is applied.To quantify the magnitude and direction of this bias, we consider the representation of OR in13Equation (2.4) and begin by examining the numerator, θ10. Under outcome misclassification, theevent captured by this probability, (E1 = 1, E2 = 0), can now arise not only under the anticipatedcondition (D1 = 1, D2 = 0), but also under (D1 = 1, D2 = 1), (D1 = 0, D2 = 0) and (D1 = 0, D2 =1). We introduce the following notationθij|lm = P (E1 = i, E2 = j | D1 = l,D2 = m),to differentiate between these four scenarios and note that under perfect disease ascertainment, theodds ratio of Equation (2.4) is equivalent toOR = θ10|10/θ01|10.Notice that by dropping the apparent disease status, D∗1 and D∗2, from the conditioning statementin the definition of θij|lm, it is implicitly assumed that the misclassification mechanism is non-differential with respect to any factors influencing the probability of exposure, including measuredand unmeasured confounders.Under misclassification of the disease, the numerator of Equation (2.4) no longer equals θ10|10,but becomes a weighted sum of the four conditional probabilities θ10|lm, l,m = 0, 1,θ10 = θ10|10P (D1 = 1, D2 = 0 | D∗1 = 1, D∗2 = 0) + θ10|11P (D1 = 1, D2 = 1 | D∗1 = 1, D∗2 = 0)+θ10|01P (D1 = 0, D2 = 1 | D∗1 = 1, D∗2 = 0) + θ10|00P (D1 = 0, D2 = 0 | D∗1 = 1, D∗2 = 0).(2.7)Under the assumptions that1. the true disease states D1 and D2 are independent conditional on the apparent disease states,2. the true disease state of each individual does not depend on the apparent disease state of itsmatched pair member, given its own apparent disease state,14the weights can be manipulated to yield a function of the positive and negative predictive valueamong matched apparent cases and controls,P (D1 = i,D2 = j | D∗1 = 1, D∗2 = 0) 1.= P (D1 = i | D∗1 = 1, D∗2 = 0)P (D2 = j | D∗1 = 1, D∗2 = 0)2.= P (D1 = i | D∗1 = 1)P (D2 = j | D∗2 = 0),for i, j = 0, 1. Using a similar argument for θ01, the numerator and denominator of Equation (2.4)can be expressed asθ10 =pp np θ10|10 + pp(1− np)θ10|11 + (1− pp)(1− np)θ10|01 + (1− pp)np θ10|00 (2.8)θ01 =pp np θ01|10 + pp(1− np)θ01|11 + (1− pp)(1− np)θ01|01 + (1− pp)np θ01|00. (2.9)We denote this new ratio of θ10 and θ01 as OR∗ to emphasize the difference with OR.The relationship between OR and OR∗ can be brought to light by recognizing the connectionsamong the conditional probabilities θij|lm. For correctly classified pairs, the ratio of θ10|10 and θ01|10yields OR under the logistic model of Equation (2.3), resulting inθ10|10 = θ01|10OR. (2.10)Further, among pairs with true disease states that are opposite to the apparent disease states, i.e.(D1 = 0, D2 = 1), the odds ratio of exposure is equal to the inverse of OR, andθ01|01 = θ10|01OR. (2.11)Lastly, following directly from the definition,θ10|01 = θ01|10, θ10|10 = θ01|01, θ01|00 = θ10|00 and θ01|11 = θ10|11. (2.12)15Evaluating Equation (2.4) with these equalities, we obtainOR∗ = OR1 +((1−np)np a+(1−pp)pp c)+ (1−pp)(1−np)pp np b1 +OR((1−np)np a+(1−pp)pp c)+OR2 (1−pp)(1−np)pp np b, (2.13)wherea =θ10|11θ10|10, b =θ10|01θ10|10=1OR, c =θ10|00θ10|10.From this expression, it is easy to see that OR∗ ≤ OR when OR ≥ 1 and OR∗ > OR otherwise.In the case of a reasonably accurate classifier with pp > (1 − np), however, OR and OR∗ willboth be greater or less than 1, implying that the directions of association indicated by OR and OR∗are in agreement. For OR > 1, θ01|01 = θ10|10 > θ01|10 = θ10|01, and the sum of the first and thirdsummand of Equation (2.8) exceeds the sum of the first and third summand of Equation (2.9),pp np θ10|10 + (1− pp)(1− np)︸ ︷︷ ︸<pp npθ10|01 > ppnp θ01|10 + (1− pp)(1− np)θ01|01.Recalling that the second and forth summand of Equation (2.8) and Equation (2.9) take equalvalues, it follows that OR∗ > 1. The result that OR∗ ≤ 1 when OR ≤ 1 can be derived analogously.Therefore, the use of Equation (2.5) in the estimation of OR will tend to produce estimates thatare biased towards one in the presence of outcome misclassification.The magnitude of the discrepancy between OR∗ and OR increases with increasing OR anddecreasing np and pp, but is generally difficult to elicit from Equation (2.13) due to the fraction’scomplicated dependence on several parameters. To shed some light on the difference between thetwo odds ratios, the ratio OR∗/OR is displayed in Figure 2.1 for three values of OR = 0.7, OR = 1.5and OR = 3 in the case of a binary confounder U with prevalence 0.2, E − U odds ratio 1.5, for arange of positive predictive values and three negative predictive values.16OR = 0.7 OR = 1.5 OR = 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0ppOR*/ORnp0.70.91Figure 2.1: Relative size of OR∗ from pair-matched case-control data subject to outcome misclassificationunder three scenarios of exposure-disease association (OR = 0.7, OR = 1.5, OR = 3), stratified by positive(pp) and negative predictive values (np). Pairs are matched on a binary confounder with prevalence 0.2 andexposure-confounder odds ratio A model for the analysis of matched case-control data subject tooutcome misclassificationBased on the derivations of Section 2.2.4, we now propose a Bayesian model for the analysis of pair-matched studies in the presence of disease misclassification. Under the assumption of independencebetween pairs, cell counts nij follow a multinomial model(n11, n00, n10, n01) ∼ Multinomial(n, (θ11, θ00, θ10, θ01)′), (2.14)where, using Equations (2.8) and (2.9) along with Equations (2.10) through (2.12), the cell proba-bilities θ10 and θ01 take the formθ10 =pp np θ01|10OR+ pp(1− np)θ10|11 + (1− pp)(1− np) θ01|10 + (1− pp)np θ10|00 (2.15)17θ01 =pp np θ01|10 + pp(1− np)θ10|11 + (1− pp)(1− np)θ01|10OR+ (1− pp)np θ10|00. (2.16)Taking the difference between (2.15) and (2.16) givesθ10 − θ01 = θ01|10(OR− 1)(pp np− (1− pp)(1− np)), (2.17)and reveals that OR can be expressed as a function of the difference between two cell probabili-ties, the predictive values and θ01|10. If pp, np or θ01|10 are unknown in addition to OR, OR isnot identifiable from Equation (2.17). Under the Bayesian framework, however, estimation of ORis still possible, provided that knowledge about pp, np and θ01|10 is available to inform the priordistributions of these parameters. For pp and np, such information may be based on expert knowl-edge about the classifier’s properties or the results of a previous validation study; for θ01|10, it isobtainable from a validation cohort as this parameter represents the probability of (E1 = 0, E2 = 1)among correctly classified pairs.Note that the two remaining cell probabilities in Model (2.14), θ11 and θ00, cannot be simplifiedinto an expression involving the same set of parameters as those appearing in Equation (2.17).Thus, cell counts n00 and n11 of Model (2.14) will be collapsed into a single cell and not utilized inthe estimation of OR. This is similar to the standard estimator of Equation (2.5), which considersthe counts of discordant pairs only.2.2.6 Prior distributionsBecause of summation constraints on the model parameters, that is∑ij θij|lk = 1 for k, l = 0, 1,prior distributions are specified for the parameterization of the cell probabilities given by Equations(2.8) and (2.9). Using the equalities of Equation (2.12) and collapsing cells with concordant pairs,the model is then parameterized by (pp, np, θ01|10, θ10|10, θ10|11, θ10|00), for which we set up a jointprior distribution as follows.18We assume that the parameters np, pp and θ01|10 are independent a priori, and given the non-identifiability of the model, choose informative Beta distributions with hyperparameters αk, βk andγk, k = 1, 2, to be determined based on previous knowledge,pp ∼ Beta(α1, α2), (2.18)np ∼ Beta(β1, β2), (2.19)θ01|10 ∼ Beta(γ1, γ2). (2.20)For the distributions of θ10|10, θ10|00 and θ10|11, we choose weakly informative uniform distributions,θ10|10 | θ01|10 ∼ Unif(0, 1− θ01|10), (2.21)θ10|00 ∼ Unif(0, 0.5), (2.22)θ10|11 ∼ Unif(0, 0.5), (2.23)where the upper bound of 0.5 for the latter two distributions follows from∑ij θij|ll = 1 andθ10|ll = θ01|ll, l = 0, 1.Notice that the choice of prior for θ01|10 and θ10|10 as a marginal and conditional Beta distribu-tion provides more flexibility over a Dirichlet prior on (θ01|10, θ10|10, θ11|10 + θ00|10) as it allows forone additional hyperparameter. Without this flexibility, specifying expectation and variance of themarginal distribution of θ01|10 constrains two of three of the Dirichlet parameters, and will in turnlead to informed marginals of θ10|10 and θ11|10 +θ00|10. This implied prior distribution on θ10|10 andθ11|10 + θ00|10 may not be consistent with the data at hand, and hence has the potential to distortinferences about OR.Because predictive values can be expressed as a function of disease prevalence, sensitivity andspecificity, a re-parameterization of the model in terms of the misclassification probabilities couldbe accomplished by replacing the prior distribution of pp and np with priors on disease prevalence,19sensitivity and specificity. However, this requires prior information about the sensitivity, specificityand prevalence in the matched sample, not the population, and may thus be of limited use inpractice.Model (2.14) and prior distributions (2.18) through (2.23) specify a posterior distribution ofthe model parameters (pp, np, θ01|10, θ10|10, θ10|11, θ10|00) and OR, which can be approximated usingMonte Carlo Markov Chain (MCMC) techniques. An implementation of posterior sampling inJAGS [57] is available from the author’s Github repository [39].2.2.7 Determination of hyperparametersFor the prior distribution of pp, availability of a point estimate p̂p and standard error se(p̂p)determines αk, k = 1, 2, via E(pp) = p̂p and sd(pp) = se(p̂p) asα1 =( 1− p̂pse(p̂p)2− 1p̂p)p̂p2 and α2 = α1( 1p̂p− 1).The hyperparameters for the distribution of np, βk, k = 1, 2, can be calculated in a similar fash-ion, but recall from Section 2.2.2 that an increase in prior variance of np may be required toaccommodate a possible shift of np from the population value due to case-control sampling.Denoting the number of matched case-control pairs with exposure states (E1 = 0, E2 = 1) in avalidation cohort of size m as m01, the hyperparameters γk, k = 1, 2, for the prior distribution ofθ01|10 are obtained from the modelm01 | θ01|10 ∼ Bin(m, θ01|10),θ01|10 ∼ Unif(0, 1).(2.24)This implies a posterior distribution of θ01|10 of type Beta(m01 + 1,m − m01 + 1) and hence,γ1 = m01 + 1 and γ2 = m−m01 + The case of perfect disease ascertainmentIn the special case of perfect disease ascertainment, the proposed model closely resembles thestandard analysis of Model (2.6), provided that the counts of concordant pairs, n11 and n00, arealso collapsed into a single cell. For pp = np = 1, Equations (2.15) and (2.16) take the formθ10 = θ10|10 and θ01 = θ01|10, respectively, and result in the same representation of OR as given inEquation (2.4),OR =θ10|10θ01|10=θ10θ01.The joint prior distribution of the proposed model, however, differs from the Dirichlet prior on thecell probabilities suggested by Model (2.6). To recognize this difference, first note that a Dirichletprior Dir(α) on (θ11 + θ00, θ10, θ01)′ with concentration parameter α = (1, 1, 1)′ can be representedby a marginal and conditional Beta distribution,θ01 ∼ Beta(1, 2) (2.25)θ10 | θ01 ∼ Unif(0, 1− θ01), (2.26)and setting θ11 + θ00 = 1− θ10 − θ01 [27].In our model, θ01 = θ01|10 also implies a prior distribution of Beta type for θ01, however, thehyperparameters are determined by the validation data and are not fixed at γ1 = 1, γ2 = 2. Withθ10 = θ10|10 and θ01 = θ01|10, the conditional prior distribution of θ10|10 given in (2.21) is consistentwith distribution (2.26) of the standard analysis. Therefore, the joint prior distribution of theproposed model will generally differ from the standard analysis if the parameters γ1 and γ2 aredetermined from the validation cohort, but is equivalent if we set γ1 = 1 and γ2 = 2. Underpp = np = 1, the latter is possible as an informative prior for θ01|10 is no longer required for theestimation of OR.212.3 Simulation studyIn this section, we examine the model’s ability to adjust estimates of exposure-disease associationsunder outcome misclassification, and compare properties such as length and coverage of the 95%credible interval to those of the naive analysis of Model (2.6).In our simulation study, we generated n apparent case-control pairs and m true case-controlpairs, matched on discretized values of a continuous confounder U . We considered different settingsof OR, cohort sizes (n, m), prior uncertainty about the predictive values (se(p̂p), se(n̂p)), under-estimation of pp in the prior input (∆ = pp− p̂p), exposure prevalence, controlled via intercept a0in the exposure model (A.2), as well as different specificities of the classifier (SP ). Because thesimulation parameters were chosen to generate a rare disease, even low values of SN lead to onlyminor degrees of contamination in the apparent control group. Therefore, we limited our simula-tion study to the setting of SN = 1, hence np = 1. The details of the data generation processare given in Appendix A.2. For both the naive and the adjusted analysis, we evaluated (1) theaverage median of the posterior distribution of OR, (2) the average length of the 95% equal-tailedposterior credible interval and (3) the coverage of the 95% posterior credible interval, calculated asthe relative frequency of intervals containing the true value of OR.The results of the simulation study are presented in Table 2.1 and Figures 2.2 and 2.3 for selectedsettings of the simulation parameters. From Table 2.1 we see that the naive analysis leads to theanticipated underestimation of OR while the average posterior median of the adjusted analysis liesclose to the true value of OR for all parameter settings in the case of ∆ = 0. In contrast, theadjusted analysis leads to an overestimation of OR when ∆ = 0.04 as pp is underestimated in theprior input.For all settings of the parameters, coverage of credible intervals is close to the nominal levelof 95% for the adjusted analysis, even in the case of an underestimation of pp with ∆ = 0.04. Incontrast, the strong biases in the odds ratio of the naive analysis result in empirical coverage far22Table 2.1: Median of the posterior distribution of OR, empirical coverage and length of the 95% equal-tailedposterior credible interval under naive and adjusted analyses of matched case-control data for various settingsof exposure-disease odds ratio (OR), underestimation of pp in the prior input (∆), exposure prevalence(controlled via a0) and specificity (SP ), averaged over R = 1000 simulation runs. Positive predictivevalues (pp) are displayed for each setting of SP . Other simulation parameters are n = 1000, m = 200,se(p̂p) = se(n̂p) = 0.02, SN = 1 (np = 1).naive adjustedOR ∆ exp(a0) SP pp median coverage length median coverage length1.5 0 0.002 0.95 0.32 1.19 0.585 0.71 1.51 0.961 2.050.97 0.46 1.25 0.712 0.74 1.51 0.965 1.640.99 0.71 1.41 0.912 0.80 1.56 0.975 1.221.00 1.00 1.50 0.946 0.84 1.50 0.942 0.790.01 0.95 0.35 1.18 0.253 0.44 1.52 0.964 1.310.97 0.48 1.23 0.402 0.46 1.50 0.961 1.000.99 0.73 1.35 0.774 0.50 1.48 0.953 0.691.00 1.00 1.52 0.945 0.56 1.52 0.947 0.530.04 0.002 0.95 0.32 1.22 0.643 0.73 1.67 0.964 2.350.97 0.46 1.25 0.705 0.74 1.55 0.959 1.770.99 0.71 1.39 0.896 0.80 1.57 0.968 1.261.00 1.00 1.46 0.943 0.82 1.46 0.941 0.770.01 0.95 0.35 1.17 0.226 0.44 1.55 0.954 1.480.97 0.48 1.25 0.462 0.47 1.58 0.955 1.110.99 0.73 1.36 0.790 0.51 1.53 0.959 0.741.00 1.00 1.51 0.966 0.56 1.51 0.962 0.522 0 0.002 0.95 0.33 1.35 0.193 0.79 1.99 0.956 2.430.97 0.45 1.48 0.369 0.85 1.99 0.959 1.940.99 0.73 1.71 0.734 0.95 1.99 0.939 1.501.00 1.00 2.05 0.956 1.11 2.05 0.955 1.040.01 0.95 0.39 1.34 0.010 0.50 2.00 0.960 1.510.97 0.50 1.48 0.102 0.55 2.03 0.948 1.240.99 0.76 1.74 0.641 0.65 2.03 0.959 0.971.00 1.00 2.00 0.942 0.75 2.00 0.949 0.700.04 0.002 0.95 0.33 1.37 0.210 0.81 2.16 0.974 2.740.97 0.45 1.47 0.364 0.85 2.06 0.969 2.110.99 0.73 1.78 0.821 0.99 2.15 0.980 1.671.00 1.00 2.05 0.958 1.12 2.05 0.957 1.050.01 0.95 0.39 1.35 0.013 0.51 2.10 0.967 1.670.97 0.50 1.45 0.073 0.54 2.06 0.966 1.320.99 0.76 1.74 0.652 0.65 2.09 0.975 1.031.00 1.00 1.99 0.949 0.75 1.99 0.950 0.7023below the nominal value, in particular under high misclassification where values approaching 0 areobserved. The lengths of the 95% posterior credible intervals are notably larger in the adjustedanalysis as prior uncertainty about pp and θ01|10 is now reflected in the posterior distribution ofOR.As seen in Figures 2.2 and 2.3, the variability of the adjusted OR posterior distributions differconsiderably across the different combinations of sample sizes m and n and specificity SP . Theimpact of cohort size n on posterior interval lengths is most notable for SP = 0.95, 0.97 as highpercentages of contamination in the apparent case group severely reduce the number of true casesin the sample. Changes in posterior variance due to increases in the validation cohort size m aresmaller compared to those from increases in n, but are most prominent for low exposure prevalencesgiven that the number of pairs with positive and negative exposure status, m01, tends to be smallunder these circumstances (Figure 2.2). In contrast, prior uncertainty about pp shows comparablylittle impact on the length of posterior credible intervals (Figure 2.3).2.4 Application to ProMS study dataWe now illustrate the proposed method on our motivating dataset. MS cases were identifiedfrom British Columbia health administrative databases using an MS case definition of three ormore MS-specific records. A MS-specific record may include a physician billing code for MS (ICD9/10: 340/G35) in the Medical Services Plan database [8], a hospital separation record for MS inthe Canadian Discharge Abstract database [11], or a record for a MS-specific immunomodulatoryprescription in PharmaNet [9]. The three databases are linkable via a unique and lifelong identifierassigned to every individual covered under the Medical Services Plan, the medical insurance providerof approximately 95% of residents. Data linkages were facilitated by Population Data BC, the pan-provincial data hub which allows researchers access to de-identified healthcare utilization records.The date of the first MS-specific or demyelinating disease-related record (ICD 9: 337, 323, 341;24SP = 0.95 SP = 0.97 SP = 0.99 SP = 1123123sd(pp) = 0.02sd(pp) = 0.04200 500 1000 4000 200 500 1000 4000 200 500 1000 4000 200 500 1000 4000nPosterior median    m1002005001000Figure 2.2: Median, 2.5th and 97.5th percentile of the OR posterior averaged over 1000 simulation runs forselected settings of specificity (SP ), stratified by main cohort size (n), validation cohort size (m) and priorvariance of pp (sd(p̂p)). The true value of OR is represented by the horizontal grey line. Other simulationparameters are a0 = log(0.01) in the exposure model (A.2), SN = 1 (np = 1) and ∆ = 0. Positive predictivevalues are pp = 0.39, 0.5, 0.76, 1 for SP = 0.95, 0.97, 0.99, 1, respectively.ICD 10: H46, G37.3, G36.9, G37.8, G36) is defined as a patient’s index date and marks the earliestrecognition of MS, hence the end of the five-year prodromal phase under investigation. We focus ouranalysis on 6830 apparent cases with index dates between April 1, 1996, and December 31, 2013,to ensure data availability during the five years before the index date for all study participants.Eligible controls were required to have no demyelinating disease-related records, and up to fivecontrols were matched to cases based on sex, birth year and three digit postal code at index date.Individuals with one or two MS-specific records, or records for other demyelinating diseases wereineligible for inclusion in the study. To minimize the chance of a healthcare contact outside ofBritish Columbia, all study participants were further required to be enrolled with the MedicalServices Plan for at least 90% of the year for each of the five years prior to the index date. In the25SP = 0.95 SP = 0.97 SP = 0.99 SP = 11234512345a0 = log(0.002)a0 = log(0.01)200 500 1000 4000 200 500 1000 4000 200 500 1000 4000 200 500 1000 4000nPosterior mean    m1002005001000Figure 2.3: Median, 2.5th and 97.5th percentile of the OR posterior averaged over 1000 simulation runsfor selected settings of specificity (SP ), stratified by main cohort size (n), validation cohort size (m) andexposure prevalence (controlled via exposure model intercept a0). The true value of OR is represented bythe horizontal grey line. Simulation parameters are sd(p̂p) = 0.02, SN = 1 (np = 1) and ∆ = 0. Positivepredictive values are pp = 0.33, 0.45, 0.73, 1 (a0 = log(0.002)) and pp = 0.39, 0.5, 0.76, 1 (a0 = log(0.01)) forSP = 0.95, 0.97, 0.99, 1, respectively.following analysis, we limit the number of controls per case to one in order to mimic the setting ofa pair-matched study.Presence or absence of a morbidity in the five-year prodromal phase was determined fromadministrative data using a list of administrative case definitions previously published and validatedby Marrie et al. [53]. For example, hypertension was considered present if a study participant showedat least four hypertension-related physician billing codes or hospital separation records over a two-year period in the prodromal phase. Clearly, this approach suffers from similar limitations as theidentification of MS cases from administrative records, but we shall consider the morbidity statusof all participants as measured without error for the purpose of this analysis.26The positive predictive value of the MS case definition used in this study has been estimatedas p̂p = 0.83 (95% CI: 0.82 - 0.85) in an external validation study [52]. The point estimate wasused to determine the mean of the prior distribution of pp, while the prior standard deviation wasincreased to 0.03 to acknowledge possible differences between the validation cohort of Marrie et al.[52] and the British Columbia cohort, resulting in a Beta(130, 26.5) prior for pp. Motivated by therarity of MS in British Columbia (prevalence of 0.2% reported by Kingwell et al. [43] for 2008)and stringent eligibility criterion for the control group, we assumed that all apparent controls arecorrectly classified, and hence np = 1.To inform the prior distribution of the model parameter θ01|10, we utilized a linkage between theadministrative cohort and the British Columbia Multiple Sclerosis (BCMS) database, a provincialdatabase maintained by the MS specialty clinics and housed at the University of British Columbia.The database contains information on individuals who came to one of the four MS clinics in BritishColumbia, and also includes individuals with definite MS, as confirmed by a MS specialist neurol-ogist. Due to a sharp drop in the database’s capture of MS cases diagnosed after 2004, we restrictthe following analysis to the subgroup of 3608 apparent MS cases in the administrative cohort withindex dates between April 1, 1996, and December 31, 2004. Further, because of a lower averageage at index date among BCMS cases relative to the administrative cohort, it was necessary toselect a subgroup of 470 cases from the 854 cases captured by the BCMS database. Denoting pjand mj as the relative and absolute frequency of five-year age category j in the administrativeand BCMS cohort, respectively, this was achieved by first determining the largest possible cohortsize m as m = max{x ∈ N : pjx ≤ mj ∀j}, followed by random sampling of bpjmc individualsin age category j. This subgroup, along with their matched controls, represents the validation co-hort for the analysis. We compared the distributions of age, sex and year of index date among theadministrative and validation cohort to rule out systematic differences for our primary confounders.The morbidity status of the matched pairs in the administrative and validation cohort are27Table 2.2: Morbidity status of matched pairs in the administrative and validation cohort for ten morbiditiesunder investigation. For morbidities with cell counts of less than five pairs in the validation cohort, two cellswere suppressed to fulfill the privacy requirements related to data access.administrative cohort validation cohortMorbidity n00 + n11 n10 n01 m00 +m11 m10 m01anxiety 3486 88 34 > 455 10 < 5bipolar disorder 3528 50 30 460 5 5inflammatory bowel disease 3561 30 17 > 460 5 < 5depression 2839 512 257 361 65 44diabetes 3362 102 144 447 7 16hyperlipidemia 3353 109 146 436 17 17hypertension 3040 287 281 407 32 31chronic lung disease 3142 248 218 411 23 36migraine 3391 149 68 456 11 5thyroid disease 3359 139 110 446 19 5presented in Table 2.2. For each morbidity, we generated MCMC samples in JAGS [57] fromthe posterior distribution of OR arising under Model (2.6) (naive), the proposed Model (2.14)(adjusted) and, for the sake of comparison, under Model (2.6) when only data on the validationcohort are considered (validation). For all analyses, the median, 2.5 and 97.5 percentile of the ORposterior are displayed in Figure 2.4 on a logarithmic scale.For most, but not all morbidities, the adjusted analysis leads to a shift of the posterior medianaway from the null, as would be expected following the results of Section 2.2.4. Differences aremost notable for morbidities such as anxiety, inflammatory bowel disease and thyroid disease, wheremedians and 97.5 percentiles of the validation posterior are noticeably larger compared to those ofthe naive posterior. In contrast, differences between naive and adjusted analyses are surprisinglysmall for conditions such as depression and bipolar disorder. A glance at the validation analysisfor these morbidities reveals that exposure-disease odds ratios in the validation cohort and theadministrative cohort are somewhat discrepant, in particular because it is expected that naivepoint estimates are biased towards the null.28thyroid diseasemigrainelung diseasehypertensionhyperlipidemiadiabetesdepressionbowel diseasebipolar disorderanxiety−1 0 1 2 3log(OR) adjusted naive validationFigure 2.4: Posterior median and 95% equal-tailed credible interval for odds ratios of multiple sclerosisand presence of morbidity in the prodromal phase under the proposed model (adjusted) and the naive modelbased on the full cohort (naive) and the validation cohort only (validation).Possible issues of comparability between the administrative and validation cohorts are furtherhighlighted in the case of chronic lung disease, where the naive point estimate indicates a posi-tive exposure-disease association, while the point estimate from the validation analysis suggests anegative association. For this morbidity, the adjusted analysis produces a posterior median that isslightly shifted towards the null, in conflict with the expected result given the small, but positiveassociation indicated by the naive point estimate. We expand on the reasons for this phenomenonin the discussion, but note here that we have confirmed that this is not an artifact of MCMC29simulation error.Posterior credible intervals under the adjusted analysis are wider relative to the naive analysisfor all morbidities, with differences again ranging from minor, as in the case of depression, toconsiderable, as observed for anxiety. In contrast, the adjusted analysis produce posterior credibleintervals that are typically shorter than credible intervals based only on the validation data. Thisspeaks for the merit of utilizing a larger and appropriately analyzed misclassified dataset, ratherthan restricting oneself to the correctly classified, yet smaller dataset of the validation cohort.2.5 DiscussionIn this chapter, we provided a closed form expression for the asymptotic bias in odds ratios estimatedfrom matched case-control data in the presence of outcome misclassification and proposed a simpleBayesian approach to adjust standard estimators for misclassification biases. The strength of theBayesian methodology in this context lies in its ability to incorporate information from externaland internal sources into the analysis, and thus facilitate estimation of an otherwise non-identifiablemodel. The model’s parameterization in terms of the predictive values, rather than sensitivity andspecificity, avoids the need to adapt population misclassification probabilities for outcome-relatedsampling. Moreover, the model does not require any distributional assumptions for the pair-specificeffects, an advantage given that inferences about OR can be sensitive to this choice [58].In our simulation studies, the model demonstrated its ability to adjust association estimates forbias, and also revealed good properties regarding coverage of the 95% posterior credible interval.Its reliance on valid prior input about the predictive values also became evident and implies thatcare must be taken when specifying the hyperparameters for np and pp. For sample sizes anddegrees of contamination that ensured a reasonably large number of true cases in the main cohort,the length of the adjusted posterior distribution was within acceptable bounds for useful posteriorinference about OR.30An important limitation of our approach is the need for non-differential misclassification withrespect to any factor associated with the exposure, and analysts need to be mindful of this assump-tion in the context of administrative database studies. When both the disease and the exposureare defined via administrative case definitions, as was the case in our motivating example, it is easyto imagine a scenario where dependence between the exposure and the disease label may occur.Consider, for instance, an application where both definitions include a prescription drug that is in-dicated for both the exposure and the outcome of interest. For the ProMS study, non-overlappingICD 9 codes and prescription drugs between the case definitions of MS and the ten morbidities un-der investigation, as well as temporal separation between the prodromal and disease phase, shouldhave limited such dependence.However, when presence or absence of a disease is defined via disease-specific administrativerecords, it is important to keep in mind that classification depends in part on ICD 9 codes assignedby a physician, which may in turn be influenced by a variety of factors such as a patient’s riskprofile (including age and sex), medical history, physician specialization or even coding practices.Therefore, it will generally be hard to fully rule out violations of the non-differential misclassifi-cation assumption, but the possibility of any of these factors having a considerable impact on theprobability of exposure requires careful consideration. If violated, adjusted odds ratios using theproposed approach are no longer valid and may either under- or overestimate the true exposure-disease association.Because our model does not accommodate covariates, application of the proposed approachrequires a setting where cases and controls are matched on all measured confounders, and furtheradjustment for matching variables is not required. This assumption may be limiting in many prac-tical applications, but was not of major concern in our motivating problem. For studies involvingadministrative databases, available covariates are often limited to basic, non-identifying informa-tion such as age, sex and geographic location, both because of limited information accessible to31researchers, but also due to privacy regulations. Given the source and type of this information,there is also little reason to anticipate mismeasurement of our matching variables. Further, gener-alizations to non-binary exposure variables or joint modelling of multiple exposures appear difficultgiven that the proposed model relies heavily on Equation (2.4) arising between the cell probabilitiesand the odds ratio under the logistic model.In our illustrative example, adjusted analyses produced point estimates that differed from thoseof the naive analysis by various degrees, but did not lead to different results for the hypothesistest of no association between morbidity and MS status. Similar observations were made in oursimulation studies where naive and adjusted analyses showed near-identical statistical power toreject the null hypothesis of no association between disease and exposure. These results are in linewith the bias-variance tradeoff previously discussed in the context of measurement error models:although adjusted analyses produce point estimates that are shifted away from the null, an increasein variability compared to naive analyses results in similar power for both approaches [12]. Theadjusted analysis proposed in this chapter is therefore most beneficial for investigations aimedat estimating exposure-disease associations; if focus lies on testing hypotheses of no associationbetween disease and exposure, there seems little to be gained.The results of our illustrative example further stressed the importance of a representative val-idation cohort for valid posterior inference. In the simulation study of Section 2.3, the validationcohort was selected using random sampling of case-control pairs. Such selection is ideal to preventsystematic differences between the main and validation cohorts, but was not feasible in our studyof prodromal morbidities in MS. In fact, our choice of validation cohort may be to blame for a pos-terior that is shifted further towards the null than the posterior arising under the naive analysis,as was the case for one of the morbidities under investigation. As shown in Appendix A.3, suchbehaviour will occur ifθ01θ01|10< pp(1− pp)− np(1− np).32This inequality does not hold for the large sample limits of the parameter estimates of θ01 andθ01|10, but can be fulfilled for finite sample estimates, in particular when the estimate of θ01|10 isbiased upwards as the result of a misrepresentative validation cohort. When selecting a validationcohort by means other than random sampling, it is therefore important to examine the two cohortsfor differences in the distribution of risk factors that are suspected to influence the probability ofexposure. Comparing odds ratios estimated from the validation cohort with those produced by theadjusted analysis can also provide useful insights, as seen in Figure 2.4 for our motivating example.Several limitations of our analysis need to be highlighted. The validation cohort consideredto inform the prior distribution of θ01|10 is not an independent sample, but rather a subset ofthe main study cohort. This is in conflict with the Bayesian paradigm where prior distributionsencapsulate information on the model parameters before the data are evaluated. The implicationsof this violation include an underestimation of the posterior variability of OR, resulting in lowercoverage probabilities of the 95% credible interval. In a simulation study, we found this differenceto be small, but nonetheless present.Further, the morbidities studied in this example were determined from health administrativerecords and are also subject to misclassification. Strictly speaking, estimated parameters representthe odds ratios of MS and morbidity-related health care contacts, as defined by the case definition,and not the presence of morbidity. When prior information about the sensitivity and specificityof the exposure classification is available, a possible approach to account for both outcome andexposure misclassification simultaneously is to incorporate our model with that of Liu et al. [45].Because both models are based on a multinomial likelihood, this would entail modelling the cellprobabilities for the apparent exposure, P (E∗1 = i, E∗2 = j), i, j = 0, 1, as a function of exposuresensitivity, specificity and θij according to Equation (4) in Liu et al. [45], followed by modelling ofθij according to our proposed approach.Lastly, the positive predictive value used to inform the prior distribution of pp was estimated in33the Canadian province of Nova Scotia. Positive predictive values are a function of the disease preva-lence in the underlying population and are therefore less stable when transferred across populations.We suspect that the change in pp is small as similar prevalences of MS have been reported for NovaScotia and British Columbia (0.3% in 2010 [52] versus 0.2% in 2008 [43]) and both provinces havesimilar health systems and health related administrative practices such as physician billing. Tocompensate for possible shifts in the positive predictive value, we increased the standard deviationin the prior distribution of pp to 0.03 for our analysis.As highlighted in this chapter, the topic of disease misclassification is a central theme in analysesof administrative health data. We are the first to examine the impact on estimates from unadjustedmatched analyses, and demonstrated that methods tailored towards this phenomenon are neededfor valid estimation of exposure-disease associations. The model proposed in this chapter will bea useful tool to better utilize health administrative databases as a comprehensive data source forepidemiological studies.34Chapter 3Relaxing the non-differentialmisclassification assumption inanalyses of matched case-controlstudies with misclassified outcomes3.1 IntroductionThe model proposed in the previous chapter presents a simple tool to adjust odds ratios for outcomemisclassification in pair-matched case-control studies, but it may also be viewed as overly restrictivefor some real-data applications. Extensions to non-binary exposures, settings that require additionalcovariate adjustment or matched studies with a variable number of controls per matching strataare not easily accomplished under the previous framework. In fact, the model’s restriction to pair-matched data led us to discard all but one control per unique case from the analysis of our motivatingdataset, and thus did not allow us to make efficient use of the available data. Most importantly, it35required the hard to verify assumption that the classification process is non-differential.In this chapter, we aim to develop a more flexible model for the analysis of matched case-controlstudies, and in particular focus on relaxing the non-differential misclassification assumption. As-suming that no prior information about the differential classification process is available, we proposeto estimate each participant’s probability of being a true case from auxiliary data, and to use theseestimates as weights in a Bayesian analysis of matched case-control data. This approach is par-ticularly tailored towards the issue of disease misclassification in health administrative databasestudies, where prior information is often limited to marginalized estimates of the misclassificationprobabilities, but rich individual-level information on disease-specific healthcare utilization is avail-able. A similar approach was taken by Hahn and Xia [37] in the context of non-linear regressionmodels with a non-differentially misclassified explanatory variable, where misclassification proba-bilities were estimated from covariate information alongside the regression parameters in a jointmixture-type model. Motivated by ProMS study data, the following work focuses on the scenariowhere misclassification only occurs among participants with a positive disease label, and wheremisclassification among the control group is negligible.The remainder of this chapter is structured as follows. We begin with an introduction tothe analysis of matched case-control studies under perfect outcome assessment in Section 3.2.1,and derive a general model for a binary exposure variable when cases are subject to differentialmisclassification in Section 3.2.2. In Section 3.2.3, we explore our motivating dataset and identifyfeatures of disease-specific health care utilization thought to be indicative of a participant falselyassigned to the study’s case group. Motivated by these features, we propose a counting processmixture model in Section 3.2.4 to estimate subject-specific misclassification probabilities. Lastly,we revisit our motivating dataset in Section 3.3, where we estimate odds ratios of MS and thepresence of several morbidities in the five years before the first recognized sign of MS. We closewith a discussion in Section 3.4363.2 MethodsThroughout this chapter, we consider the setting of an exposure-disease association study similar toChapter 2, but now focus on a generalized 1 : (nk−1) matching ratio of cases and controls, with nkdenoting the number of participants in the k-th ofN strata. Let Ejk denote the exposure status, Djkthe true disease status and D∗jk the apparent disease status of individuals in the k-th stratum, withindices j = 1 and j = 2 . . . nk referring to the apparent case and the apparent controls, respectively.We further assume that the disease of interest is rare. This implies a negative predictive valueapproximating one for most realistic classifiers, and consequently minimal contamination amongapparent controls. For all k, it follows that Djk = D∗jk = 0, j = 2 . . . nk, but D∗1k = D1k = 1 maynot hold when apparent cases are subject to misclassification.3.2.1 Analysis under perfect outcome classificationWe begin with the analysis of matched case-control data for the scenario of perfect disease ascertain-ment. In the standard case of D being fully observable, that is D1k = 1 for all k, exposure-diseaseodds ratios may be estimated from matched data via the following risk model,logit(P (Ejk = 1 | Djk = d, bk,xjk)) = γ ′xjk + δI(d = 1) + bk, j = 1, . . . , nk, (3.1)where xjk is a vector of covariates not considered in the matching process and bk ∼ G is a stratum-specific random effect intended to capture the dependencies within the strata. We consider theconditional exposure-disease odds ratio, exp(δ), to be the parameter of interest in this investigation.Conditional on bk, within-strata exposures ek = (e1k, . . . , enkk)′ are assumed to be independent,37and the likelihood of the data e = (e1, . . . , eN )′ takes the formL(e | γ, δ, b) =N∏k=1P (E1k = 1 | D1k = 1, bk,x1k)e1kP (E1k = 0 | D1k = 1, bk,x1k)1−e1k×N∏k=1nk∏j=2P (Ejk = 1 | Djk = 0, bk,xjk)ejkP (Ejk = 0 | Djk = 0, bk,xjk)1−ejk=N∏k=1g−1(γ ′x1k + δ + bk)e1k(1− g−1(γ ′x1k + δ + bk))1−e1k×N∏k=1nk∏j=2g−1(γ ′xjk + bk)ejk(1− g−1(γ ′xjk + bk))1−ejk ,(3.2)where g() denotes the logistic function. In the frequentist framework, estimation of Model (3.1)entails a marginalization of the likelihood over the density G to eliminate the nuisance parametersbk, followed by a maximization of the resulting marginal likelihood. In contrast, bk is treated as anunobserved quantity similar to the remaining model parameters under the Bayesian paradigm, andthe posterior distribution of (γ, δ, b | e) is obtained via Bayes theorem following specification of aprior distribution for (γ, δ, b)′. In practice, sampling schemes such as Gibbs sampling [25] or theMetropolis-Hastings algorithm [38, 55] are often employed to generate samples from the posteriordistribution, which then form the basis for approximate inference about the model parameters.3.2.2 Modelling of exposure probabilities under outcome misclassificationUnder misclassification of the disease, study participants are no longer sampled conditional on theirtrue disease status D, but conditional on the imperfect label D∗. The observed data are thereforerealizations of E | D∗ = i rather than E | D = i, i = 0, 1, and we are unable to model theprobability on the left hand side of Model (3.1) directly using the data at hand. Instead, we are38required to express the exposure risk model via P (Ejk = 1 | D∗jk = i), which can be decomposed asP (Ejk = 1 | D∗jk = i) = P (Ejk = 1 | D∗jk = i,Djk = 0)P (Djk = 0 | D∗jk = i)+ P (Ejk = 1 | D∗jk = i,Djk = 1)P (Djk = 1 | D∗jk = i), j = 1, . . . , nk.(3.3)In this expression, the first term of both summands closely resembles the exposure probability ofModel (3.1), except for additional conditioning on the observed disease status D∗, while the secondterms act as weights indicating the chances that a participant belongs to either the case or controlgroup, given their observed disease status. These weights are related to the classifier’s negative andpositive predictive value, but are specific to participant j in the k-th stratum.Before proceeding to a discussion of the probabilities appearing in Equation (3.3), some sim-plifications can be made. As stated previously, we focus our investigation on a scenario wherethe group of apparent cases contains a non-negligible proportion of false positive cases, but wherethere is negligible contamination in the apparent control group. From the latter, it follows thatP (Djk = 0 | D∗jk = 0) ≈ 1, and hence,P (Ejk = 1 | D∗jk = 0) = P (Ejk = 1 | D∗jk = 0, Djk = 0), j = 2 . . . nk. (3.4)For apparent cases, however, P (D1k = 1 | D∗1k = 1) is generally less than 1, and Equation (3.3)retains its form as a positive predictive value weighted sum of two exposure probabilities,P (E1k = 1 | D∗1k = 1) = P (E1k = 1 | D∗1k = 1, D1k = 0)P (D1k = 0 | D∗1k = 1)+ P (E1k = 1 | D∗1k = 1, D1k = 1)P (D1k = 1 | D∗1k = 1).(3.5)Two questions arise when parameterizing the quantities appearing on the right hand side of Equa-tions (3.4) and (3.5):• how to accommodate the observed disease status D∗ in an exposure risk model similar to39that of Model (3.1), and• how to deal with the weights, the probability of apparent case k being a true case, in Equation(3.5)?Both questions are related, and their answers are strongly tied to the characteristics of the under-lying classification mechanism.If misclassification were non-differential, the disease label D∗ is not associated with any factorsother than the true disease status and could therefore be dropped from the conditioning statementof the exposure probabilities in Equations (3.4) and (3.5). Both probabilities are then equivalent tothose appearing in Model (3.1) and can be parameterized accordingly. This, however, is a strongassumption in the context of administrative case definitions, and one which we are unwilling tomake in this chapter.Instead, we want to acknowledge that controls with a positive disease label, that is (D = 0, D∗ =1), may be different from controls not suspected of the disease, (D = 0, D∗ = 0). Consider, forinstance, our motivating example where participants in the former group constitute people falselysuspected of MS. Because the morbidities under investigation are suspected to be part of the MSprodrome, it is easy to imagine that their presence could have influenced a physician’s decision toassign a MS-specific billing code. To accommodate possible differences between the two groups, weexpand Model (3.1) with an additional parameter,logit(P (Ejk = 1 | Djk = d,D∗jk = d∗,xjk, bk)) = γ ′xjk+δ1I(d∗ = 1)+δ2I(d = 1)+bk, j = 1, . . . , nk,(3.6)to separately express the odds ratio for a false positive case, exp(δ1), and a true positive case,exp(δ1 + δ2).With this adaptation, what remains to be justified is whether the exposure probabilities fortrue positive cases and their matched true negative controls are representative of the exposure40probabilities for the general population of cases and matched controls,P (E1k = 1 | D1k = 1) ≈ P (E1k = 1 | D∗1k = 1, D1k = 1), (3.7)P (Ejk = 1 | Djk = 0) ≈ P (Ejk = 1 | D∗jk = 0, Djk = 0), j = 2 . . . nk, (3.8)or if the label-based inclusion criterion could have introduced selection bias into the cohort. Approx-imation (3.8) follows quite readily for rare diseases, as it justifies the assumption that controls withpositive disease label constitute only a negligible fraction of the population of matched controls.Therefore,P (Ejk = 1 | Djk = 0) = P (Ejk = 1 | D∗jk = 0, Djk = 0)P (D∗jk = 0 | Djk = 0)+ P (Ejk = 1 | D∗jk = 1, Djk = 0)P (D∗jk = 1 | Djk = 0)︸ ︷︷ ︸≈0≈ P (Ejk = 1 | D∗jk = 0, Djk = 0).The approximation of Equation (3.7) is less straightforward and requires a stronger assumption.Consider again our motivating example where MS cases require at least three MS-specific claims tobe recognized by the case definition. People with disease onset towards the end of the study are lesslikely to have collected a sufficient number of claims, and are thus less likely to be included in theProMS case group. This leads to a dependence between time and the case definition’s sensitivity,which may introduce bias if a similar dependence also exists between time and the probability ofexposure. When interpreting the final results, the assumption that exposure rates among the casegroup are representative of the exposure rates among the population of MS cases should thus bekept in mind.To finalize the model, the probability weights appearing in Equation (3.5) still require attention.The commonly adopted approach in the misclassification literature is to replace misclassification41probabilities with estimates from validation data or expert knowledge. In our motivating example,the positive predictive value of the administrative case definition has been estimated, and could thusbe incorporated into the model of Equation (3.5) under the assumption that P (D1k = 1 | D∗1k = 1)is constant across all strata. The problem with this approach is two-fold: it is in conflict withthe differential nature of the misclassification mechanism previously assumed, and it is also notfeasible, as under the expanded exposure model (3.6), the likelihood for apparent cases is no longeridentified if constant weights are assumed for all k. Thus, to facilitate consistent estimation ofthe parameters in Model (3.6), we are required to acknowledge the individuality of each apparentcase when expressing their probability of misclassification. This prompts a modelling approach forP (D1k = 1 | D∗1k = 1), which we will denote as pik, k = 1, . . . , N, for the remainder of this chapter.In the decomposition of Equation (3.5), we acknowledge the exposure’s dependence on the trueand observed disease status, but so far have not considered that the exposure model will also needto include a stratum-specific effect bk and possibly a vector of covariates x. Incorporating theseadditional dependencies into the left hand side of Equation (3.5), bk and x also appear in theconditioning statement of the probability weights and should therefore be considered as predictorvariables in a model for pik. While x is fully observed and readily included in such a model,accounting for the latent stratum-specific effect bk is not straightforward. Because bk is onlyidentified through the exposures of the k-th stratum, an immediate consequence is that exposureand misclassification models would need to be estimated in a joint model, rather than a two-stepapproach where the misclassification model is estimated first, followed by the estimation of theexposure model with estimates of pik “plugged in”. A second question that arises is under whichcircumstances the probability of misclassification will depend on bk, and in which settings thisrelationship might be ignorable. If bk is viewed as a representative of the matching variables, theprobability of a positive disease status is generally not independent of bk given that the matchingvariables are confounders, and hence associated with the disease.42In our motivating example, apparent cases and controls are matched on sex, birth year andgeographic location at index date. These variables are not strongly associated with MS whenmeasured in terms of absolute risk, and thus do not have satisfactory ability to discriminate betweentrue and false positive cases in the ProMS case group. When modelling the probability of apositive disease status, additional predictor variables need to be considered, and we identify such avariable in the following section from MS-specific healthcare utilization patterns. Generally, whendependencies between D and the matching variables are thought to be weak, and variables withstronger dependence are available, it can then be justifiable to remove bk from the model for pik asthe majority of the variability in D is explained by the remaining predictors. This approach willbe taken in the analysis of the ProMS study data. Formally, we will assume that D and bk areindependent conditionally on the predictor variables and the apparent disease status.3.2.3 Using disease-specific healthcare utilization to discriminate betweentrue and false positive casesTo obtain a disease label from the number of disease-specific billing codes, hospital separations andpharmaceutical prescriptions, the administrative case definition of our motivating example applies asimple classification rule that assigns a positive label if an individual shows at least three MS-specificrecords. Once an individual fulfills this case definition, they enter the case group of the ProMSstudy, and information about the total number of records is not further considered. In essence, sucha case definition represents a crude dichotomization of an underlying count variable and ultimatelydiscards valuable information that could inform our understanding of the true disease states of theapparent cases. In our example, the number of MS-specific records varies considerably across the6830 participants included in the original cohort, from a minimum of three for those just meetingthe case definition’s threshold, to counts exceeding 200 for some of the study participants.While the total number of disease-specific physician billing codes, hospital separations and phar-maceutical prescriptions provide important information that should be reflected in the probability430 2 4 6 8 10Time (years)ID 1ID 2ID 3Figure 3.1: MS-specific physician billing codes over time for three example participants in the ProMS casegroup. For each participant, the end of follow-up is represented by the vertical grey line. Time zero is definedas the index date, the time of the first MS-specific code. Gaussian noise has been added to each MS-specificclaim time to fulfill the privacy requirements related to data access.of misclassification, we argue that the temporal aspect of these claims also needs to be takes intoconsideration. To illustrate this point, consider the time series of MS-specific physician billingcodes displayed in Figure 3.1 for three example participants. In each plot, time zero represents thetime of the first MS-specific code. Gaussian noise has been added to each MS-specific claim timeto fulfill the privacy requirements related to data access.Although all three IDs fulfill the case definition of the ProMS study, their MS-specific medicalhistories exhibit strong differences even at first glance. While example ID 1 shows a large numberof claims throughout the entire follow-up period of ten years and thus leaves little doubt aboutbeing a true case, example IDs 2 and 3 show a very different pattern. For ID 2, all claims occurat the beginning of the observational period, followed by a long period of over nine years with no44evidence of disease-specific physician contacts. Considering the progressive and disabling nature ofMS, this seems unusual, and raises the suspicion that MS-specific claims observed during the firstyear may have been falsely assigned. If we were to express the chance of ID 2 being a true MS case,we would necessarily choose a low value to reflect these considerations.The timing of MS-specific claims for example ID 3 is not unlike that of ID 2, but the lengthof follow-up time between the two participants differs considerably. Without access to claims datafor ID 3 after approximately one and a half years, we are unable to determine if ID 3 will exhibita claim pattern similar to ID 1, or similar to ID 2. As a result, there is little guidance to informthe probability of ID 3 being a true MS case, and the overall positive predictive value of the casedefinition may be the most appropriate choice.From this illustrative example, it becomes evident that the total number of MS-specific claimsmay not discriminate well between true and false positive MS cases as participants differ consid-erably in the length of time over which these claims are observed. Instead, one may suspect thatthe distribution of claims over time has better discriminatory ability, with true cases showing MS-specific records throughout follow-up while for false positive cases, records will tend to stop aftersome initial time period.Further evidence for this hypothesis is given in Figure 3.2. In this plot, the time differencebetween the first and last MS-specific claim, expressed as a fraction of follow-up time, is plottedagainst follow-up time in years. From the overlaid density contours, we see that most participantsfall along the y = 1 horizontal line, with a second, albeit less pronounced group along the y = 0line. Highlighting participants captured by the BCMS database in blue reveals that true MS casescorrespond nicely with the first group, and suggests that these participants may in fact representthe subset of true MS cases among the ProMS case group. These exploratory results lead to twoassumptions for the following model development:• the subset of apparent cases identified by the case definition contains two distinct groups, one450.000.250.500.751.000 5 10 15Length of follow−up time (years)Rel. time difference between first and last claimFigure 3.2: Time difference between first and last MS-specific claim relative to the total length of follow-upplotted against follow-up time in years for ProMS cases. Participants captured by the British ColumbiaMultiple Sclerosis database are highlighted in blue. Contours are obtained from a two-dimensional kerneldensity estimate.of which corresponds to the group of true MS cases; and• MS-specific claims for true positive cases tend to be evenly distributed across the entirelength of follow-up, while claims for false positive cases tend to appear only during a shorttime period at the beginning of follow-up.In the following section, we derive two counting process models to represent these two characteristicbehaviours.3.2.4 A counting process mixture model for disease-specific claim timesLet tk = (tk1, . . . tkmk) denote the vector of ordered MS-specific claim times and τk the follow-uptime for apparent case k. The time of the first MS-specific claim, tk0, defines time zero, but will notbe included in the vector of event times tk as it is fixed by design. Because modelling is restricted to46the group of apparent cases, all probabilistic expressions in this section would require conditioningon D∗1k = 1, but we will suppress this dependence in our notation in the interest of conciseness.To model the temporal distribution of MS-specific claims across the follow-up period, we con-sider tk to be realizations of a counting process Nk(t) = |{i : tki ≤ t}| observed over the timeperiod [0, τk]. The behaviour of a counting process N(t) is governed by its conditional intensityfunction defined asλ(t | H(t)) = lim∆t→0P(∆N(t) = 1 | H(t))∆t, (3.9)where ∆N(t) denotes the change in N(t) over the time interval (t, t+ ∆t] and H(t) = {N(s); 0 ≤s < t} represents the process history up to time t. Definition (3.9), together with the assumptionthat two events cannot occur at the same time, impliesP (∆N(t) = 1 | H(t)) = λ(t | H(t))∆t+ o(∆t)P (∆N(t) = 0 | H(t)) = 1− λ(t | H(t))∆t+ o(∆t)P (∆N(t) > 1 | H(t)) = o(∆t),(3.10)where o(h) denotes any function g(h) such that limh→0 g(h)/h = 0. To arrive at a joint distributionf(t) for m events occurring at times t = (t1, . . . tm), we first consider the cumulative distributionfunction of the i-th event time Ti, given the previous event times (t1, . . . , ti−1). For a partition of(ti−1, t] into l intervals of equal length ∆s = (sj − sj−1), j = 1 . . . l, where ti−1 = s0 < s1 < . . . <sl = t, we haveP (Ti ≤ t | t1, . . . , ti−1) = 1− P (Ti > t | t1, . . . , ti−1) == 1− lim∆s→0l∏j=1P (∆N(sj) = 0 | H(sj))= 1− lim∆s→0l∏j=1(1− λ(sj | H(sj))∆s+ o(∆s))47= 1− lim∆s→0l∏j=1exp(log(1− λ(sj | H(sj))∆s+ o(∆s)))= 1− lim∆s→0l∏j=1exp(− λ(sj | H(sj))∆s+ o(∆s))= 1− lim∆s→0exp( l∑j=1−λ(sj | H(sj))∆s+ o(∆s))= 1− exp(−∫ tti−1λ(s | H(s))ds),where the equality of line four and five follows by Taylor expansion of log(1 − λ(s | H(s))∆s +o(∆s)) = −λ(s | H(s))∆s+ o(∆s). Taking the derivative of the above expression with respect to tgives the density for event time if(ti | t1, . . . , ti−1) = λ(ti | H(ti)) exp(− ∫ titi−1λ(s | H(s))ds)and determines a joint density for events t = (t1, . . . tm) asf(t) =m∏i=1f(ti | t1, . . . , ti−1)P (N(τ)−N(tm) = 0 | H(tm))=m∏i=1λ(ti | H(ti)) exp(−∫ τ0λ(s | H(s))ds).(3.11)For a detailed discussion of counting process models see Andersen et al. [3] or Cook and Lawless[14].Following from the correspondence between λ(t | H(t)) and f(t), models for counting processesare commonly specified via the conditional intensity function. Special cases include the homoge-nous Poisson process with rate parameter λ ≥ 0, defined by a constant intensity function that isindependent of previous events,λ(t | H(t)) = λ,48and the renewal process, where the intensity function depends upon the history H(t) only throughthe elapsed time since the previous event time, t∗,λ(t | H(t)) = λ(t | t− t∗).In the previous section, we identified two distinct characteristics of the claim time distributionfor true and false positive MS cases, which we now translate into models for their respective intensityprocesses, λ1(t | H(t)) and λ0(t | H(t)). With the implied probability models f1(tk) and f0(tk),the likelihood of event times tk can then be expressed as a two-component mixture,f(tk) = f1(tk)pi + f0(tk)(1− pi), (3.12)where pi = P (D = 1) denotes the mixing proportion. Recalling that D∗ = 1 has been droppedfrom conditioning statements in this section, pi also represents the positive predictive value of thecase definition. Under Model (3.12), the probability of apparent case k being a true case given thesubject’s MS-specific claim history tk is given byP (D1k = 1 | tk) = pi f1(tk)pi f1(tk) + (1− pi)f0(tk) , (3.13)and can be evaluated once the mixture model parameters are estimated from the data.We begin with a model for the conditional intensity of true MS cases. We postulate thatthe chance of observing a MS-specific claim is influenced by two phenomena: (1) the underlyingdisease process which leads a patient to seek medical assistance and (2) the physician’s request fora follow-up visit. These two processes are encapsulated in the following model,λ1(t | H(t)) = µ+ β1 exp(−β2(t− t∗)) (3.14)49where µ > 0, β1, β2 ≥ 0 and t∗ denotes the time of the last claim prior to time t. The constantbaseline intensity µ represents the intensity attributable to the underlying disease process andensures that claims are continuously observed throughout the follow-up time. The second termaccounts for history dependence due to physician recall. If a claim is observed at time t, theintensity jumps by β1, followed by an exponential decay to the baseline intensity at a rate of β2.Notice that even though the time of the first claim, tk0, is not included in the vector tk, it iscaptured by the history H(t) and leads to λ1(t = 0) = µ+ β1.For false positive cases, we assume that MS-specific claims are not assigned due to an underlyingdisease process, but are initiated by the first, falsely assigned MS code at time t = 0. The conditionalintensity model therefore only includes a history dependence termλ0(t | H(t)) = α1 exp(−α2(t− t∗)), (3.15)which implies that the probability of a MS-specific claim is highest after a physician reaffirms hissuspicion of MS with a falsely assigned code, but decays to zero as time passes. Without a positivebaseline intensity parameter, a counting process with this intensity will eventually cease to produceclaims, and thus reflects the behaviour assumed to be indicative of a false positive case.Lastly, we introduce two constraints on the parameters of models (3.14) and (3.15) by choosingµ + β1 = α1 and β2 = α2. These constraints imply that the counting process behaviours for trueand false positive cases are similar during an initial time period, and therefore implies that themixture model is unable to distinguish cases from controls among participants with short periodsof follow-up.Evaluating expression (3.11) with the intensity models (3.14) and (3.15), the density functions50for claim times tk take the formf1(tk) =mk∏i=1(µ+ β1 exp(−β2(tki − tk(i−1))))×exp(− µτk + β1β2( mk∑i=1(exp(−β2(tki − tk(i−1)))− 1) + (exp(−β2(τk − tkmk))− 1))),f0(tk) =mk∏i=1α1 exp(−α2(tki − tk(i−1)))×exp(α1α2( mk∑i=1(exp(−α2(tki − tk(i−1)))− 1) + (exp(−α2(τk − tkmk))− 1))),(3.16)and fully determine the data likelihood under the mixture model (3.12).3.2.5 An exposure risk model with differential misclassification adjustmentCombining the models of Section 3.2.2 and 3.2.4, a model for the exposure status of the k-thstratum is given byP (E1k = 1 | D∗1k = 1,γ,x1k, bk, pik) = (1− pik)g−1(γ ′x1k + δ1 + bk) + pik g−1(γ′x1k + δ1 + δ2 + bk),P (Ejk = 1 | D∗jk = 0,γ,xjk, bk) = g−1(γ ′xjk + bk), j = 2 . . . nk,(3.17)where g() is a logit link function, bk ∼ N(0, σ2) is a stratum-specific random effect andpik =pi f1(tk | µ, β1, β2)pi f1(tk | µ, β1, β2) + (1− pi)f0(tk | α1, α2) , (3.18)with f0 and f1 as defined in Equations (3.16). We consider the odds of exposure for true positivecases relative to true negative controls, OR = exp(δ1 + δ2), to be the parameter of interest.To fit the model, we choose a two-step Bayesian approach in which the counting process mixturemodel is estimated first, followed by the estimation of the exposure model (3.17). This approachallows us to transfer uncertainty about pik into the second stage of the modelling as the posterior51distributions of the mixture model parameters can be used as prior distributions for the parametersin Equation (3.18). Using the complete-data specification of the likelihood, the joint posteriordistribution of the mixture model parameters takes the formf(θ,D | t) ∝N∏k=1f1(tk | µ, β1, β2)dkf0(tk | µ, β1, β2)(1−dk)N∏k=1pidk(1− pi)1−dk h(θ), (3.19)where θ = (µ, β1, β2, pi)′, t = (t1, . . . , tN )′ and h(θ) represents the prior distribution of θ. Weassume all parameters to be independent a priori, and choose a weakly informative uniform distri-bution on [0, 1] for pi, and truncated diffuse normal distributions with mean zero and a standarddeviation of 20 for the remaining parameters.Denoting ψ = (δ1, δ2,γ, σ)′, the posterior distribution of the exposure risk model (3.17) is givenbyf(ψ,θ | bk,x, e, t) ∝N∏k=1( nk∏j=2g−1(γ ′xjk + bk)ejk(1− g−1(γ ′xjk + bk))1−ejk)×N∏k=1((1− pik)g−1(γ ′x1k + δ1 + bk) + pik g−1(γ′x1k + δ1 + δ2 + bk))e1k×(1− (1− pik)g−1(γ ′x1k + δ1 + bk)− pik g−1(γ′x1k + δ1 + δ2 + bk))1−e1k×h(ψ) h(θ).(3.20)where h(ψ) denotes the prior distribution of ψ. Prior h(ψ) is chosen as a product of independent,diffuse mean-zero normal priors with a standard deviation of 20 for δ1, δ2 and all components ofγ, and a uniform distribution on [0, 10] for σ [26]. To implement the second stage of our two-stepBayesian approach, distribution h(θ) in Model (3.20) is specified as the product of independent,univariate normal distributions with means and variances estimated from the respective posteriordistributions of the mixture model parameters.52Table 3.1: Disease status of 6830 apparent cases and 31714 controls in the ProMS cohort for six morbiditiessuspected to be part of the MS prodrome. Data are aggregated across matching strata.controls apparent casesMorbidity absent present absent presentanxiety 30369 1345 6243 587irr. bowel syndrome 31315 399 6644 186depression 28353 3361 5423 1407diabetes 30136 1578 6497 333hypertension 28009 3705 5904 926lung disease 29814 1900 6277 5533.3 Application to ProMS dataWe now apply the proposed model to the motivating dataset. We focus our analysis on 6830apparent case-control strata identified from British Columbia databases with index dates betweenApril 1, 1996, and December 31, 2013. Among these, 4887 apparent cases were matched with fivepeers from the population of eligible controls; the remaining 26, 66, 283 and 1568 apparent caseswere matched with one to four peers, respectively, leading to a total cohort size of 38544 studyparticipants. Notice that this represents a larger cohort compared to that of Chapter 2, as we cannow accommodate all controls in the analysis, and also include participants with index dates afterDecember 31, 2004.Similarly to Chapter 2, interest lies in estimating the odds ratio of MS and a list of severalmorbidities suspected to be part of the MS prodrome. Presence or absence of these morbidities inthe five-year prodromal window was determined from administrative data as previously outlined inSection 2.4. The data considered in the following analysis are displayed in Table 3.1.For participants included in the ProMS case group, MS-specific claim times tk were defined asthe time differences in years between the first and each subsequent MS-specific physician billingcode or hospital separation record. To fulfill the assumptions of the counting process model,each apparent case can contribute only one MS-specific claim per calendar day; additional claims53appearing on the same day are treated as duplicates and are removed from the analysis. Thedistribution of the total number of claims is highly skewed with a median of eleven claims, lowerand upper quartiles of five and 25, respectively, and a maximum exceeding 250. Note that recordsfor MS-specific prescription drugs are not included in the vector of claim times tk. Instead, theserecords are used for validation purposes given that the use of disease-modifying therapy (DMT) isa near-perfect proxy for a positive MS status, but absence of these drugs are not a good indicatorfor a negative MS status.The follow-up time τk is defined as the time difference in years between the first MS-specificrecord and the end of the study (December 31, 2013), the time of death or the time of last regis-tration in the province, whichever occurs first. The average follow-up time among apparent casesis 8.4 years with a standard deviation of 4.8 years.Because all primary confounders were used during the matching of cases and controls, we do notconsider further covariate adjustment in the estimation of exposure-disease associations. Parametervector γ in Model (3.17) therefore reduces to an intercept parameter γ0 for this application. MCMCsamples from the posterior distributions (3.19) and (3.20) were generated in WinBUGS [47] usingtwo Markov chains with different initial values.We begin with the results of the counting process mixture model for apparent MS cases. Forall model parameters, posterior medians, 2.5 and 97.5 percentiles are displayed in Table 3.2. Weconsider posterior medians to be point estimates of the model parameters. The posterior medianof the mixing proportion, pi, indicates that 76% of apparent cases are correctly classified, and liesclose to the positive predictive value of the ProMS case definition reported by Marrie et al. [51],p̂p = 0.75. This is reassuring, and provides some evidence that the two groups identified by themixture model may in fact be representative of true and false positive MS cases.For true positive MS cases, an average of 0.65 claims per year are attributed to the constantbaseline intensity, while the remaining claims occur as the result of a high degree of history depen-54Table 3.2: Posterior median, 2.5 and 97.5 percentile of counting process mixture model parameters esti-mated from MS-specific claim history of 6830 study participants in the ProMS case group.cases controls mixing prop.percentile µ β1 β2 α1 α2 pi2.5 0.62 4.66 2.46 5.30 2.46 0.7450 0.65 4.70 2.54 5.36 2.54 0.7697.5 0.68 4.75 2.62 5.41 2.62 0.77dence in the time series. After a claim is observed, intensities jump by βˆ1 = 4.70 and decay to thebaseline level at a rate of βˆ2 = 2.54. This decay rate corresponds to a 72%, 92% and 98% intensityreduction after 6 months, one year and 18 months, respectively, from the time of the last claim.For false positive cases, intensity jumps are slightly higher with αˆ1 = 5.36, but decay to zero atthe same rate due to the constraints imposed on the model parameters. To better understand theprocess characteristics for this group, we simulated 10,000 claim time series from the fitted countingprocess model f0 for a follow-up time of τ = 10. On average, we observed a total of 3.2 claims pertime series, with an average difference between the first and last claim of 0.6 years, and a standarddeviation of 0.4 years. With only 5% of these time series showing a time difference exceeding oneand a half years, claims for false positive cases tend to stop shortly after the first claim is observed.Using the fitted mixture model, we evaluated Equation (3.18) at the posterior medians of themodel parameters to obtain point estimates of the misclassification probabilities pik. In Figure3.3, the distribution of these estimated probabilities is displayed for the full ProMS case group,the subgroup of 910 apparent cases captured by the BCMS database and the subgroup of 1755apparent cases with at least one administrative record for a DMT. In the density plot of pik for thefull ProMS case group, we see a bimodal distribution with peaks at both ends of the [0, 1] interval,suggesting a high degree of certainty about the negative and positive disease status for the majorityof ProMS cases. In contrast, distributions for BCMS cases and DMT users are concentrated at theupper limit of the probability spectrum. Recalling that an inclusion in the BCMS database and the55use of disease modifying therapy are near-perfect proxies for a positive MS status, participants inboth groups tend to be correctly assigned to the group of true MS cases based on their estimatedprobabilities pik.To further examine if the two claim patterns discerned by the mixture model correspond tothose identified in Section 3.2.3, we replicated the scatterplot of Figure 3.2, but with points nowcoloured according to their predicted misclassification probability pik. The result is shown in Figure3.4. With darker colours concentrated mainly along the bottom right hand side of the plot, lowvalues of pik are assigned to individuals with a long gap between their last claim and the end offollow-up. Conversely, lighter colours tend to appear for those with claims close to the end offollow-up, a pattern assumed to be indicative of a true positive case. Vertical colour gradientsare most pronounced for participants with long follow-up times, whereas participants with aboutthree years or less tend to exhibit probability values close to the mixing proportion pˆi = 0.76.Overall, we see high discrimination among apparent cases with long follow-up, but little for thosewho were followed for only a short time period. This probability pattern corresponds nicely withour discussion of the example time series in Section 3.2.3 and confirms that the two claim patternshave been adequately captured by the counting process models.For the proposed risk model (3.17) with probability weights estimated by the counting processmixture model, posterior medians and 95% equal-tailed credible intervals of the log odds ratiosof MS and six morbidities in the prodromal phase are shown in Figure 3.5. For the purpose ofcomparison, we also display the results of a naive analysis where all study participants are assumedto be correctly classified, as well as the results of an adjusted analysis that relies on the assumptionof non-differential outcome misclassification. The latter analysis is a special case of Model (3.17)with δ1 = 0 and constant weights pik = pp that correspond to the positive predictive value of theMS case definition. This value has been estimated as p̂p = 0.83 (95% CI: 0.82 - 0.85) in an externalvalidation study [52] and was used to determine the hyperparameters of a Beta prior distribution56ProMS cases BCMS cases DMT users0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00051015010203040012345P( D = 1 | t )densityFigure 3.3: Distribution of predicted misclassification probabilities pik for ProMS cases (left), ProMS casescaptured by the British Columbia Multiple Sclerosis database (centre), and ProMS cases with at least oneadministrative record for a MS-specific disease modifying therapy (right).for pp as outlined in Section 2.2.7. Posterior summary statistics for the remaining parameters ofModel (3.17) are displayed in the Appendix in Table B.1.As seen in Figure 3.5, posterior medians of all three analyses are in agreement on the direction ofassociation between MS and each of the six morbidities, but tend to show considerable differences insize. In comparison to the naive analysis, odds ratios adjusted for non-differential misclassificationlie further away from the null hypothesis for all morbidities. This is in line with results of Chapter 2showing that naive analyses of matched case-control data lead to an attenuation of odds ratiostowards the null hypothesis in the case of non-differential misclassification. For all morbiditieswith positive MS association, analyses adjusted for differential and non-differential misclassificationproduce posterior medians that lie on opposite sides of those obtained from naive analyses, raisingconcerns that the assumption of non-differential misclassification may not be appropriate in thiscontext. Instead, with differentially-adjusted odds ratios lower than those from naive analyses,these results suggest that apparent cases with low probability values pik tend to show a higherburden of morbidity compared to apparent cases with high values.570.000.250.500.751.000 5 10 15Length of follow−up time (years)Rel. time difference between first and last claim0.000.250.500.751.00Figure 3.4: Time difference between the first and last MS-specific claim relative to the total length offollow-up, plotted against follow-up time in years for ProMS cases. Points are coloured according to aparticipant’s probability of being a true MS case, as estimated by the counting process mixture model.In addition to largely different point estimates obtained by the three analyses, a comparison ofthe 95% credible intervals further reveals that depending on the type of analysis chosen, one mayarrive at different conclusions for a test of no association between MS and presence of morbidity.For hypertension, 95% credible intervals under differential adjustment cover logOR = 0 while thoseof the naive and non-differentially adjusted analysis do not. The opposite is true for diabetes wherethe credible interval of the log odds ratio from the proposed model are removed from the null, yetcredible intervals from naive and non-differential analyses include the value of logOR = 0. This isconcerning, and highlights the possibility of committing both Type I and II errors if non-differentialmisclassification of the disease is falsely assumed.58anxietydepressiondiabeteshypertensionirr. bowel syndromelung disease−0.5 0.0 0.5 1.0log(OR)adjusted (non−diff) naive adjusted (diff)Figure 3.5: Estimated log odds ratios of MS and six morbidities in the prodromal phase using the proposedmodel with differential misclassification adjustment (adjusted (diff)), a model with non-differential misclas-sification adjustment (adjusted (non-diff)) and a naive analysis where all study participants are assumed tobe correctly classified (naive).3.4 DiscussionIn this chapter, we developed a flexible model to estimate exposure-disease associations frommatched case-control studies under differential misclassification in the case group. Assuming thatprior knowledge about a potentially complex misclassification mechanism is unavailable, we pro-posed a modelling approach to individually express the probability of misclassification for eachparticipant in the case group, and to use these probabilities as weights in a Bayesian analysis. Aspredictor variables in this model, we derived features from disease-specific health care utilizationhistory, which were found to have good discriminatory ability in our motivating example. Al-though we focused on a binary exposure variable with a logistic link, the model generalizes easily59to non-binary exposures or other link functions.To estimate the probability of misclassification, we applied a mixture model with two underlyingcounting process models for MS-specific claim patterns of true and false positive cases. Because ofthe chronic and debilitating nature of MS, we suspect that these claim patterns are unique to ourapplication and would not expect that a mixture model of the same form is generally applicable toother diseases. Rather, our work shows that in addition to aggregated claim counts, administrativedatabases capture a variety of important temporal information, and that this information can beexploited to arrive at an improved evaluation of a participant’s true disease status.With only three model parameters, the counting process models used to describe the claimpatterns of true and false positive cases are very simple. Despite their simplicity, or perhaps becauseof it, we found that these models performed best in discriminating between the two temporal claimpatterns we identified in our exploratory analysis. More complicated models with parametricbaseline intensity functions and random effects to account for heterogeneity often provided betterfits to the data, but also tended to assign very high or low probability weights to participants forwhom we were unwilling to make a distinction due to short follow-up. Because the estimationof misclassification probabilities was the primary goal of this analysis, we preferred a model thatreflected our exploratory insights over a better-fitting model.In the proposed exposure model, within-strata dependence was accounted for by a strata-specificrandom effect. In our motivating example, we chose this effect to follow a mean-zero normal distri-bution in line with common practice, but should note that there is no particular evidence to supportthis choice. In a sensitivity analysis, we considered two additional random effect distributions assuggested by Lee and Thompson [44] and found little change in association estimates between thethree analyses. Details and results of the sensitivity analysis can be found in Appendix B.1.In the application of the proposed model to our motivating dataset, a comparison with oddsratios estimated under naive and non-differentially adjusted analyses was particularly interesting.60Odds ratios produced under the proposed model were consistently lower than those of the naiveanalysis. This suggests an increased risk of all morbidities for participants falsely suspected of MSand leads to the conclusion that the assumption of non-differential misclassification may not beappropriate in this context. Instead, controls with high morbidity burden appear to be more likely toreceive a falsely assigned MS code compared to the general population. These results demonstratethe difficulty of judging the characteristics of a classification mechanism when case and controlstatus is determined from disease-specific billing codes, and also highlight that unadjusted analysescan lead to false insights about exposure-disease associations.Some limitations of our analysis need to be highlighted. Despite the exposure model’s de-pendence on the stratum-specific random effect bk, we did not include bk in the counting processmixture model based on the hard to verify assumption that MS and the matching variables of thestudy are independent conditionally on time series of MS-specific healthcare utilization. Further,all insights drawn from the differentially-adjusted analysis of ProMS data rely on the assumptionthat the two groups identified by the mixture model are in fact true MS cases and those falselysuspected of the disease. Because the true disease status is unavailable, we are unable to fully verifythis assumption, but the analysis of misclassification probabilities among BCMS cases and DMTusers provided some evidence towards its validity.Overall, the work presented in this chapter illustrates the need for misclassification adjustmentsin the analysis of health administrative databases to arrive at valid inferences. We demonstratedhow supplementary information captured by administrative databases may be utilized to avoidthe need to rely on simplified assumptions about the underlying classification mechanism, and toarrive at subject-specific probabilities of misclassification. The models proposed in this chapter,along with possible extensions in future work, provide an example on how to better utilize healthadministrative databases as a comprehensive data source for epidemiological studies.61Chapter 4Assessing the non-differentialmisclassification assumption: ABayesian approach usingnon-identified models4.1 IntroductionIn epidemiological studies where misclassification of a binary trait is suspected, investigators oftenassume that classification is non-differential with respect to the other variables under study [5, 29].This assumption can be attractive for two different reasons. First, if the investigator chooses toaddress the misclassification issue in the statistical analysis, specification of a model relating theobserved label and the true underlying trait is required. In the case of non-differential misclassifica-tion, this relationship is relatively simple and requires only two additional parameters (sensitivityand specificity) that are typically well understood. Differential misclassification, on the other hand,62will introduce at least four additional parameters, a rather undesirable scenario given that substan-tive knowledge about each parameter is required to facilitate analyses. Second, the investigatormay choose to ignore the issue in the statistical analysis, but instead rely on the rule of thumbthat non-differential misclassification will lead to an attenuation of association measures towardsthe null [60]. Because no increase in Type 1 error is feared, tests of hypotheses are considered validat the nominal α level and results are reported. Of course, several authors have warned that thisneed not be true under differential misclassification [16, 41].Perhaps surprisingly, the reliance upon the non-differential misclassification assumption hasnot spurred the development of methodology with which it can be examined. In the setting of anexposure-disease association study with misclassified disease status D, the crux of the problem isthat we wish to estimate the exposure effect on the classification accuracy, while adjusting for thetrue yet unobserved disease status. If exposure E is assumed to affect the odds of a positive diseaselabel equally for cases and controls, this may entail inference about β1 inlogit{P (D∗ = 1 | D,E)} = β0 + β1E + β2D. (4.1)With three parameters to be estimated, but only two degrees of freedom provided by the numberof positive labels for each exposure level, this model is non-identified.In related work dealing with unmeasured covariates in non-linear models, several authors haveachieved identifiability by use of instrumental variables and imperfect surrogates [2, 10, 50]. Whencovariates are binary (e.g. an unobservable disease status), Mahajan [50] establishes a set ofrelatively general conditions for identifiability. Translated into our specific context, we require(1) information on a correlate of the disease that need not be independent of the exposure and(2) information on an instrumental variable conditionally independent of the correlate given thedisease and the exposure.In epidemiological research, a natural candidate to play the role of the former is a known63confounding variable. It is associated with both the disease and the exposure by definition, whilealso being routinely collected and therefore readily available to investigators. On the other hand,finding a suitable instrumental variable to fulfill the latter condition is likely to be more challenging,in particular when variables are limited to information collected by government agencies, such asvital statistics and demographics.Meanwhile, several papers have illustrated the value of Bayesian analyses of non-identifiedmodels [18, 33, 34, 46]. Using two examples, Gustafson [34] showed that despite non-identifiability,limiting posterior distributions can sometimes be sufficiently narrow to provide useful insights intothe parameters of interest. As we shall see, one of these examples is closely related to the problemconsidered here. In the absence of an instrumental variable and faced with non-identifiability,it is therefore interesting to consider how well one can detect violations of the non-differentialmisclassification assumption using a Bayesian analysis of a non-identified model.In the following work, we aim to explore this question in the setting of an epidemiologicalinvestigation about an exposure-disease relationship, where the binary exposure is suspected toinfluence the chance of observing a positive disease label. In line with Mahajan [50], we assumethat information on a binary confounder variable has been recorded, but we do not assume that aninstrumental variable is available. In contrast to Gustafson [34] who focused on the limits of pos-terior distributions, we explore their finite sample properties in simulation studies with practicallyrelevant settings. Lastly, we apply the proposed approach to ProMS data to examine if previousdiagnostic codes for morbidities suspected to be part of the MS prodrome increase the chance of afalsely assigned MS code.64D EUD*D EUD*Figure 4.1: Relationship of confounder U , true and apparent disease status D and D∗ and exposure Eunder non-differential misclassification (left) and differential misclassification with respect to E (right).4.2 Testing the non-differential misclassification assumption4.2.1 A general model of differential misclassificationIn line with previous chapters, let D denote the true but unobserved disease status, D∗ the assigneddisease label and E the binary exposure under investigation. A binary confounder, denoted U , istaken to be associated with both D and E, but is assumed to have no direct effect on the diseaselabel D∗. A diagram displaying the relationships among the four variables is shown in Figure 4.1.With additional information on U , we may now model the distribution of (D∗, E | U), andaugment Model (4.1) as followslogit{P (D∗ = 1 | E,D,U)} = β0 + β1E + β2D + β3DElogit{P (D = 1 | E,U)} = α0 + α1E + α2Ulogit{P (E = 1 | U)} = γ0 + γ1U,(4.2)where SN0 = expit(β0 + β2) and SN1 = expit(β0 + β1 + β2 + β3) represent the sensitivity forunexposed and exposed cases, and similarly, SP0 = 1 − expit(β0) and SP1 = 1 − expit(β0 + β1)the specificity for unexposed and exposed controls. Non-differential misclassification with respect65to E corresponds to β1 = β3 = 0. Notice that this represents a generalization of Model (4.1) as weallow exposure effects to differ between cases and controls, and also that P (D∗ = 1 | E,D,U) =P (D∗ = 1 | E,D) is assumed. The latter is comparable to the assumptions of an instrumentalvariable analysis, where the instrument is required to have no direct effect on the outcome [34].Motivated by the real data example of Section 4.4, we focus the following investigation on twoscenarios that represent realistic mechanisms of differential misclassification for practical applica-tions: (a) the exposure E increases the odds of a positive label equally for both cases and controls,and (b) the exposure E increases the chance of a positive label for controls only, and cases areknown to be unaffected. In the first case, we assume that β3 = 0 is known a priori, while in thesecond case, we constrain β3 = −β1. The target of inference, β1, is thus a main effect in setting (a),but an interaction effect in setting (b). Both resulting models contain eight unknown parameterswhile the data provide six degrees of freedom, three per cross-classification table of (D∗, E) for eachlevel of U , and are therefore non-identified.Denoting pik = P (D∗ = 1 | E = i,D = k), qij = P (D = 1 | E = i, U = j) and rj = P (E = 1 |U = j), the density arising under Model (4.2) is given byf(x | θ) =∏i,j=0,1(pi0(1− qij) + pi1qij)yij((1− pi0)(1− qij) + (1− pi1)qij)nij−yij×∏j=0,1rn1jj (1− rj)n0j ,(4.3)where θ = (β0, β1, β2, α0, α1, α2, γ0, γ1)′ and x = (n00, n01, n10, n11, y00, y01, y10, y11)′ represents theobserved data, with nij denoting the number of individuals with (E = i, U = j) and yij denoting thenumber of positive labels among those with (E = i, U = j). Notice that factors involving γ0 and γ1are isolated in the likelihood from those involving the remaining model parameters. The parameterof interest, β1, therefore depends on the data only via four of the six estimable quantities, denotedoij = P (D∗ = 1 | E = i, U = j), i, j = 0, 1, while (γ0, γ1) are identified by r0 and r1. In comparison66E = 10 p10 o10 o11 p11 1E = 00p00 o00 o01 p011P (D∗ = 1)= β2†= β1† = β1†Figure 4.2: Relationship between identified probabilities oij = P (D∗ = 1 | E = i, U = j) (black) andnon-identified probabilities pik = P (D∗ = 1 | E = i,D = k) (blue) under misclassification model (4.2) whenboth disease groups are differentially misclassified. Key: † on the logistic probability scale.to Model (4.1) in which β1 depends upon P (D∗ = 1 | E = i), i = 0, 1 only, model augmentationusing confounder information U adds two additional degrees of freedom, while introducing only oneadditional non-identifiable parameter, α2.To further illustrate how confounder U impacts the underlying structure of the problem, Figure4.2 displays the relationship between the four identified probabilities oij , i, j = 0, 1, and the non-identified probabilities pik, i, k = 0, 1, for one example setting of model parameters α0 through β2when both disease groups are differentially misclassified and β3 = 0. Identified and non-identifiedquantities are shown in black and blue, respectively. The target parameter β1 and the (D∗, D)association parameter β2 are given byβ1 = logit(p10)− logit(p00) = logit(p11)− logit(p01),β2 = logit(p01)− logit(p00) = logit(p11)− logit(p10).(4.4)Accounting for confounder U introduces two main constraints on the parameters in line one ofModel (4.2). First, because oi0 and oi1 are probability-weighted averages of the two unidentifiedprobabilities pi0 and pi1 and therefore must lie between pi0 and pi1 for each level of E, oi1 − oi067induces a bound on the (D∗, D) association parameter, β2. Second, following from the sameargument and the relationships of Equation (4.4),p00 ≤ o00, o01 ≤ p01, p10 ≤ o10, o11 ≤ p11 if β2 ≥ 0p00 > o00, o01 > p01, p10 > o10, o11 > p11 otherwise,implying that the difference between o0j and o1j informs the size of β1. In the specific example ofFigure 4.2 where β2 > 0, positive values for β1 are favoured given that o10 − o00 and o11 − o01 areboth positive.4.2.2 Bayesian analysis of non-identified modelsDenoting x as a vector of observed data, model f(·) parameterized by θ ∈ Θ is said to be non-identified if there exists θ1,θ2 ∈ Θ such that θ1 6= θ2 andf(x | θ1) = f(x | θ2).In other words, the likelihood function of a non-identified model does not have a unique globalmaximum with respect to Θ. While this property prohibits maximum likelihood inference aboutθ, Bayesian inference does not suffer a similar fate. Proper posterior distributions f(θ | x) arealways available provided that prior distributions are taken to be proper, and one can proceed withposterior inference about θ as in the identified case. Specifically, posterior credible intervals willretain their calibration property; that is, frequentist-type coverage, averaged over the distributionof the true θ, remains at the nominal value [36].In contrast, posterior distributions f(θ | x) no longer obey the regular asymptotic results ofthe identified case. Posterior distributions will no longer converge to a point mass at the trueparameter value, but instead converge to a non-degenerate distribution whose shape depends upon68the prior distribution. Consequently, there is no longer asymptotic equivalence between Bayesianand frequentist interval estimators, and as a result, Bayesian credible intervals do generally nothave approximate frequentist coverage properties in the non-identified context [35].A direct consequence of this behaviour is that frequentist-type testing of simple null hypothesesbased on the posterior credible interval’s coverage of the null value is no longer valid. To testthe assumption of non-differential misclassification, we therefore consider Bayes factors as Bayesianmeasures of evidence against the null [6]. Generally, for hypotheses H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1,with Θ0 and Θ1 denoting disjoint subsets of Θ, the Bayes factor is given byB10 =f(x | H1)f(x | H0) , (4.5)wheref(x | Hi) =∫Θif(x | θ)pii(θ)dθand pii represents the prior restricted to the subset of the parameter space Θi. The value of B10is a measure of strength of the data’s support for H1 relative to H0, with values above or below 1suggesting evidence for or against the alternative, respectively.With f(x | Hi) being a marginal likelihood, direct evaluation of Equation (4.5) can be com-plicated in the case of a high-dimensional parameter space Θ. However, for situations where thetesting problem can be posed as a comparison of two models, one nested within the other, theSavage-Dickey density ratio [19] provides a useful tool for easier computation of the Bayes fac-tor. Specifically, if θ can be decomposed as θ = (ω,ψ) and we wish to test H0 : ω = ω0 versusH1 : ω 6= ω0, the Bayes factor can be expressed as a ratio of the posterior and prior densities underthe alternative,B10 =f1(ω0 | x)pi1(ω0)(4.6)69provided thatpi1(ψ | ω0) = pi0(ψ). (4.7)Computation of B10 then reduces to an evaluation of the marginal posterior f1(ω | x) at ω = ω0,and with posterior MCMC samples and a univariate density estimation method at hand, this valuecan be readily approximated. For our specific question, provided that (4.7) holds, the relevanthypotheses to test the assumption of non-differential misclassification are H0 : β1 = 0 versusH1 : β1 6= 0, leading toB10 =f1(β1 = 0 | x)pi1(β1 = 0). (4.8)Clearly, assumption (4.7) holds when ψ and ω are independent a priori, but may not undercertain specifications of dependent priors. Consider, for example, the case of a bivariate parametervector (α, β)′ with N(0,Σ) prior under the full model and a null hypothesis of H0 : β = 0. Underthe alternative, we have α | β ∼ N(βσα/σβρ, (1 − ρ2)σ2α), which simplifies to N(0, (1 − ρ2)σ2α)when β = 0 and represents the left hand side of Equation (4.7), pi1(α | β = 0). On the otherhand, the marginal prior of α under the restricted model, pi0(α), is N(0, σ2α), and thus differs frompi1(α | β = 0). For such situations where assumption (4.7) is violated, Verdinelli and Wasserman[64] provide a correction factor for the right hand side of Equation (4.6).4.2.3 Prior distributionsWith the prior’s impact on the posterior no longer vanishing as n increases, Bayesian analyses ofnon-identified models require particular care when specifying pi(θ). Going back to Model (4.2)and the dependencies among the four variables depicted in Figure 4.1, parameters α0 through γ1correspond to distinct quantities of the data generating mechanism, so it is reasonable to consider70them independent a priori. Prior distributions are therefore chosen as follows:α0 ∼ N−(0, σ21), α1 ∼ N(0, σ22), α2 ∼ N(0, σ22)β0 ∼ N(µ1, σ23), β1 ∼ N(0, σ22), β2 ∼ N(µ2, σ24)γ0 ∼ N(0, σ21), γ1 ∼ N(0, σ22),(4.9)where N−(·) denotes a half-normal distribution restricted to the negative real line. Recall that β3is completely specified as β3 = 0 or β3 = −β1 under the two scenarios considered here, and thusdoes not require a prior distribution in Model (4.2).In line with Gustafson [34], we set 2σ1 = logit(0.98) to achieve a weakly informative priordistribution for the prevalence of E and D, and 2σ2 = log(6) to downweight extreme E−U , E−Dand D − U associations. Further, we set µ1 = logit(0.25), 2σ3 = logit(0.5)− logit(0.25) to rule outsmall values of SP0, and µ2 = logit(0.75) − logit(0.25), 2σ4 = logit(0.75) − logit(0.25) to reflectthat most realistic classifiers are expected to perform better than chance, that is SN0 > 1 − SP0.In general, these values need to be carefully chosen, and the commonly used default setting ofµ1 = µ2 = 0 for vague priors will not reflect a realistic misclassification problem. Lastly, we assumethat the disease under investigation has a prevalence that is small to moderate in size and restrictα0 to take negative values only. This will prevent the label-switching phenomenon that commonlyoccurs during MCMC sampling.4.2.4 Estimation under outcome related subject selectionIn epidemiological studies of rare diseases, study participants are often sampled retrospectivelybased on presence or absence of the disease under study. With D being unobserved in our context,sampling would thus proceed using the disease label D∗ rather than D. In place of the prospective71specification in (4.2), a model more reflective of this particular sampling strategy is given bylogit{P (U = 1 | E,D,D∗)} = β0 + β1E + β2Dlogit{P (E = 1 | D,D∗)} = α0 + α1D∗ + α2D + α3D∗DP (D = 1 | D∗) = γ0I(D∗ = 0) + γ1I(D∗ = 1),(4.10)where the overall negative and positive predictive value of the classifier correspond to 1 − γ0 andγ1, respectively. If α1 = α3 = 0, the disease label D∗ contains no additional information about Ethat is not explained by D, and misclassification is non-differential with respect to E.In light of the motivating example that we will encounter in Section 4.4, we assume in the fol-lowing investigation that D is a rare disease. This implies a near-perfect negative predictive valuefor a range of sensitivities and specificities and justifies the assumption of γ0 = 0. Consequently,because occurrence of (D = 1, D∗ = 0) is ruled out a priori, we set α3 = 0 to avoid overparameteri-zation in Model (4.10) and focus on the scenario where E is suspected to influence the classificationaccuracy for controls only. As a result, α1 is the parameter of interest, and parameters β0, β1 andα0 are fully identified given that a negative disease label is indicative of a negative disease status.To finalize the Bayesian specification of Model (4.10), we set the following independent priordistributions for the parameters,β0 ∼ N(0, σ21), β1 ∼ N(0, σ22), β2 ∼ N(0, σ22)α0 ∼ N(0, σ21), α1 ∼ N(0, σ22), α2 ∼ N(0, σ22)γ1 ∼ Unif(l1, u1),(4.11)with hyperparameters σ1, σ2 as specified above to impose mild prior information on the parametersin the confounder and exposure model. We further assume that prior knowledge on the positivepredictive value of the classifier is available to constrain the values of γ1 to the interval [l1, u1].724.3 Simulation studyIn this section, we explore the extent to which the posterior distribution of β1 and α1 arising underprospective model (4.2) and retrospective model (4.10), respectively, can provide insights aboutviolations of the non-differential misclassification assumption.4.3.1 DesignConfounder, exposure and disease information was generated from the following modelsU ∼ Bin(1, η1),E | U ∼ Bin(1, η2),D | E,U ∼ Bin(1, η3),(4.12)wherelogit(η2) = a0 + a1u,logit(η3) = b0 + b1u+ b2e.Sensitivities and specificities to generate the observed disease label are chosen depending on whetherexposure affects the classification accuracy of both disease groups or of controls only. Underdifferential misclassification for cases and controls, we set logit(1 − SP1) − logit(1 − SP0) = δand logit(SN1) − logit(SN0) = δ. Under differential misclassification for controls only, we setlogit(1− SP1)− logit(1− SP0) = δ and SN1 = SN0. In both cases, δ controls the exposure effecton the classification accuracy and measures the severity of differential misclassification. Notice thatδ corresponds to β1 in Model (4.2).For investigations involving the prospective model (4.2), new data on n individuals was gener-ated in each simulation run. For investigations involving the retrospective model (4.10), we initiallygenerated data on a population of size N = 1, 000, 000, from which n/2 apparent cases with D∗ = 1and n/2 apparent controls with D∗ = 0 were randomly selected in each simulation run.73As a first goal of the simulation study, we examine what degree of differential misclassificationcan be detected, similarly to a power analysis in a frequentist setting. In the case of prospectivemodelling, we consider a range of sensitivities and specificities for unexposed cases and controlswith SN0 = 0.7, 0.9, 1 and 1 − SP0 = 0.1, 0.2, and set δ = 0, 0.5, 1 to examine the model’s perfor-mance under two degrees of differential misclassification as well as non-differential misclassification.Sensitivities and specificities are set to SN0 = SN1 = 1, 1−SP0 = 0.1, 0.2, 0.3, 0.4 and δ = 0, 0.5, 1for the case of retrospective modelling. The remaining parameters are held constant at η1 = 0.3,a0 = b0 = logit(0.2) and a1 = b1 = b2 = log(1.5). To mitigate variability due to small cell countsin the (E,U,D∗) cross classification table, the sample size was set to a comparably large value ofn = 7000.As a second goal, we aim to investigate to what extent the degree of associations betweenD, E and U affect the informativeness of the posterior distribution of the target parameters. Insituations where one can choose between several confounding variables, or in the case of studydesign, the question of how the properties of U impact inferences about β1 or α1 is of particularinterest. Parameters a1, b1 and η1 are set to b1 = − log(1.5), 0, log(1.5), a1 = − log(1.5), 0, log(1.5)and η1 = 0.3, 0.5 to examine posterior distributions under various degrees of confounding, includingthe case when U is uncorrelated with D, and under different values of the confounder prevalence.The remaining parameters appearing in Model (4.12) are set to SN0 = 0.7, 1 − SP0 = 0.2 andSN0 = SN1 = 1, 1 − SP0 = 0.3 for prospective and retrospective modelling, respectively, δ = 1,n = 7000, a0 = b0 = logit(0.2), b2 = − log(1.5), log(1.5) for this investigation.Prior distributions for parameters in Models (4.2) and (4.10) were specified as stated in (4.9)and (4.11), respectively. For the retrospective model, new hyperparameters for the uniform priorof γ1 were generated in each simulation run such that the support was of length 0.15 and alsoincluded the true underlying value. Specifically, denoting pp as the true positive predictive value,we generated ρ ∼ Unif(0, 0.15), and set u1 = min(1, pp+ ρ) and l1 = u1 − 0.15. This approach was74intended to incorporate uncertainty in prior point and interval estimates of pp into the simulationstudy.For each parameter setting, 500 datasets were generated and a total of 150,000 MCMC samples,thinned at a rate of 10, were generated in JAGS [57] following a burn-in period of 2000 samples.For each dataset, Bayes factors were calculated from the posterior samples of the target parameterusing the spline-based density estimation method of Stone et al. [62], implemented in the R packagelogspline. In addition to summary statistics of the posterior distribution of β1 in (4.2) and α1 in(4.10), we report the average length of the 95% equal-tailed posterior credible intervals (Length),the intervals’ coverage proportion of the true value of δ (Cover), the median Bayes factors (BF) aswell as the percentage of intervals with 2.5th percentile above the null value (LB>0).4.3.2 Results for prospective modellingThe results of the simulation study are displayed in Figures 4.3, 4.4 and 4.5 for investigationsinvolving the prospective Model (4.2). When exposure effects on classification accuracy are correctlyassumed to be the same for both disease groups (Figure 4.3, top panel) we see considerable updatingin the posterior distributions of β1 compared to the prior, both with respect to length of the credibleintervals, but also in terms of the posterior median when misclassification is differential (δ = 0.5, 1).When δ = 1, intervals tend to be well removed from β1 = 0, and median Bayes factors range from 4:1to 9:1 odds in favour of the alternative. Evidence against the null is weaker in the case of δ = 0.5;while posterior medians tend to agree nicely with the true value, intervals are not sufficientlyremoved from zero to warrant a claim of differential misclassification. In the case of non-differentialmisclassification (δ = 0), medians remain close to zero, and Bayes factors correctly indicate oddsin favour of the null hypothesis. With the average median being lower than the true value in allsettings with δ = 1, and higher when δ = 0, biases in posterior medians are clearly visible and tendto increase as SN0 and 1− SP0 are increasingly at odds with the means of the prior distributionsfor β2 and β0.750.70.9Prior0. 1 −SP0 δ SN1 1 −SP1-3.0 -2.5 -2.0 -1.5 -1.0Posterior β1-2 -1 0 1 2Length 1.19  1.18  1.17  1.26  1.27  1.26  1.32  1.31  1.32  1.43  1.43  1.41  3.50 Cover1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00  BF 6.882.100.548.652.190.433.701.440.634.821.500.53  NALB>00.890.190.000.940.100.000.480.010.000.640. 3.0 3.5 4.0 4.5 5.00.7                 0.9                 1                   Prior0.1           0.2           0.1           0.2           0.1           0.2                  10.510.510.510.510.510. 1 −SP0 δ 1 −SP1-3.0 -2.5 -2.0 -1.5Posterior β1-2 -1 0 1 2Length 1.81  1.84  1.89  1.92  2.03  2.03  2.17  2.13  2.11  2.11  2.28  2.23  3.50 Cover1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 0.98 1.00 1.00 1.00 1.00  BF 3.621.635.311.792.541.363.591.472.371.212.971.37  NALB> 3.0 3.5 4.0 4.5 5.0Figure 4.3: Differential misclassification for cases and controls (top) and controls only (bottom): Averagemedian, 2.5% and 95.7% percentile of posterior distribution of β1, for selected settings of sensitivity (SN0),specificity (SP0) and exposure effect (δ). In each row, the true value of δ is indicated by a cross (+).The respective values for the prior distribution of β1 are given in the last row. The right hand side tableshows median Bayes factors (BF), average length (Length) and coverage (Cover) of 95% posterior credibleintervals, and percentage of intervals with lower bound greater than 0 (LB>0). Other parameters aren = 7000, a1 = b1 = b2 = log(1.5), a0 = b0 = logit(0.2).76-6 -4 -2α0-2 0 2α1-2 -1 0 1 2α2-3.0 -2.5 -2.0 -1.5 -1.0 -0.5β0-1 0 1β1-2 0 2 4β2Figure 4.4: Prior to posterior updating for non-identified parameters of the prospective model for one setof simulated data, when cases and controls are differentially misclassified. In each plot, posterior and priordistributions are given by solid and dashed lines. True parameter values are indicated by the dotted verticallines. Model parameters are 1− SP0 = 0.2, SN0 = 0.9, δ = 1, a1 = b1 = b2 = log(1.5), a0 = b0 = logit(0.2).Under differential misclassification for the control group only, the bottom panel of Figure 4.3shows fair prior updating, but also an increase in average length of the credible intervals in com-parison to the first panel. Posterior distributions tend to be more skewed with a longer left tailthat extends beyond β1 = 0 in the majority of simulations. Consequently, frequentist-type power,if measured via coverage of β1 = 0 would be low in most cases. Bayes factors, on the other hand,seem to take the skewness into account and give moderately strong odds in favour of the alternativewhen δ = 1.To illustrate the degree of prior updating for the remaining non-identified parameters in Model(4.2), Figure 4.4 shows prior and posterior distributions for one set of simulated data with 1−SP0 =0.2, SN0 = 0.9 and δ = 1 when both disease groups are differentially misclassified. While theposterior of β2 shows little change relative to the prior, we see considerable updating for α2 where77OR(E,D)0.671.5Priorη1   0.3               0.5               0.3               0.5                  OR(U,D)   0.67      1    1.5   0.67      1    1.5   0.67      1    1.5   0.67      1    1.5       -3.25 -3.00 -2.75 -2.50 -2.25Posterior β1-2 -1 0 1 2Length 1.08  1.05  1.29  1.06  1.04  1.31  1.08  1.05  1.28  1.06  1.04  1.30  3.50 Cover0.99 0.98 1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 BF 6.87 6.80 3.91 7.82 6.92 3.8313.3916.54 9.5514.6016.18 9.02   NALB>00.930.920.450.980.940.451. 3.0 3.5 4.0 4.5 5.0OR(E,D)0.671.5Priorη1   0.3               0.5               0.3               0.5                  OR(U,D)   0.67      1    1.5   0.67      1    1.5   0.67      1    1.5   0.67      1    1.5       -3.25 -3.00 -2.75 -2.50 -2.25Posterior β1-2 -1 0 1 2Length 1.54  1.59  1.84  1.45  1.57  1.86  1.54  1.56  1.88  1.47  1.53  1.92  3.50 Cover1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 BF5.635.443.816.805.443.818.828.695.349.118.995.04  NALB>00.410.310.130.530.290.150.650.740.330.700.760.300.002.5 3.0 3.5 4.0 4.5 5.0Figure 4.5: Differential misclassification for cases and controls (top) and controls only (bottom): Averagemedian, 2.5% and 95.7% percentiles of the posterior distribution of β1 for selected settings of confounderprevalence (η1), conditional disease-confounder (OR(U,D | E)) and exposure-disease (OR(E,D | U)) asso-ciations. In each row, the true value of δ is indicated by a cross (+). The respective values for the priordistribution of β1 are indicated in the last row. Other parameter values are SN0 = 0.7, 1−SP0 = 0.2, δ = 1,a0 = b0 = logit(0.2), a1 = log(1.5), n = 7000. 78the posterior correctly identifies the direction of the D − U association. Posteriors for α0 and α1remain similar to the prior with respect to width, but show a shift in modes towards the truevalues. The characteristic long left tail of the β1 posterior, a feature particularly important in thecontext of testing, is also clearly visible. When only the control group is subject to differentialmisclassification, the degree and direction of updating is similar to Figure 4.4, but left tails of theβ1 posterior tend to extend further across the β1 = 0 vertical line.Figure 4.5 show the impact of the confounder prevalence η1 as well as conditional D − E | Uand D − U | E associations on the posterior distributions of β1. When classification accuracy ofonly the control group is impacted by the exposure (bottom panel), increasing confounder-diseaseassociations can be seen to decrease prior updating. Posterior intervals are shortest and Bayesfactors tend to show the highest odds against the null when D and U are negatively correlatedconditionally on E. Posterior medians show notable underestimation of the true values when Dand E are negatively associated, but credible intervals maintain coverage for all simulated datasets.The prevalence of the confounder, η1, does not show any notable effect on posterior intervals forour choice of values for this parameter.When both disease groups are subject to differential misclassification (Figure 4.5, top panel),average lengths are largest for positive D−U associations similar to the results of the bottom panel,but are similar when U is uncorrelated, or negatively correlated, with D conditionally on E. Thereseems little to be gained when the confounder contains information about the unknown diseasestatus, and in the case of positive exposure-disease associations, posterior inferences even appearto suffer from increasingly positive and negative D − U associations. As touched on previouslyin Section 4.2.1, associations between D and U impose a lower bound on the D −D∗ associationparameter that tends to rule out lower values of β2. Because β2 and β1 are negatively correlated aposteriori, additional density for lower values of β1 is added as a result, and Bayes factors tend todecrease.79Under both misclassification scenarios, changes in the U − E association parameter a1 didnot impact posterior intervals of β1 in a noteworthy way, and simulation results are not shownin Figure 4.5. This is not surprising, given that β1 depends on (E,U) only conditionally viaP (D∗ = 1 | E = i, U = j), i, j = 0, Results for retrospective modellingSimulation results for the retrospective model specification are displayed in Figures 4.6, 4.7 and 4.8.As seen in Figure 4.6, posterior updating relative to the prior varies considerably across the valuesof SP0, ranging from virtually none in the case of 1−SP0 = 0.1 for all settings of δ, to modest when1 − SP0 = 0.4. The two settings translate to positive predictive values of approximately 0.85 and0.50, respectively, and highlight that an increasing number of misclassified controls in the sampleleads to stronger evidence against the null. Biases in posterior medians decrease with increasing1 − SP0, and average posterior medians agree closely with the true values when 1 − SP0 = 0.4.Despite improvements in posterior updating, median Bayes factors remain quite low with maximumodds of 3:1 against the null.Figure 4.7 illustrates the degree of prior to posterior updating of non-identified parameters forone set of simulated data. The change in the posterior distribution of β2, which corresponds to theconditional U −D association, is most striking, both in terms of posterior length, and as seen inthis particular example, agreement to the true underlying value. In contrast, updating in α1 andα2 is less pronounced, with shifts in medians corresponding to the direction of the true values, butsmaller changes in posterior uncertainty.Lastly, results for investigations about the impact of confounder properties on posterior intervalsare shown in Figure 4.8. Compared to the results of the prospective modelling where non-zero D−Uassociations had little positive or even negative impact on posterior lengths, the opposite is trueunder retrospective modelling. Intervals are longest and median Bayes factors lowest when D isuncorrelated with U . The direction of the conditional association between D and E can be seen800.1                  0.2                  0.3                  0.4                  Prior  0    0.5  1    0    0.5  1    0    0.5  1    0    0.5  1          0.10   0.15   0.23   0.20   0.29   0.40   0.30   0.41   0.54   0.40   0.52   0.64         1 −SP0 δ 1 −SP1-3.0 -2.7 -2.4 -2.1Posterior α1-2 -1 0 1 2Length 3.29  3.33  3.35  2.99  3.03  3.04  2.39  2.55  2.55  2.13  2.04  2.08  3.50 Cover1.  NALB> 3.0 3.5 4.0 4.5 5.0Figure 4.6: Differential misclassification for controls only under retrospective sampling: Average median,2.5% and 95.7% percentiles of the posterior distribution of α1 for selected settings of specificity (SP0) andexposure effect (δ). In each row, the true value of δ is indicated by a cross (+). The respective values forthe prior distribution of α1 are given in the last row. Other parameters are n = 7000, SN0 = SN1 = 1,a1 = b1 = b2 = log(1.5), a0 = b0 = logit(0.2).-3 -2 -1 0 1 2α1-3 -2 -1 0 1 2α2-2 -1 0 1β2Figure 4.7: Prior to posterior updating for non-identified parameters of the retrospective model for oneset of simulated data when only controls are differentially misclassified. In each plot, posterior and priordistributions are given by solid and dashed lines. True parameter values are indicated by the dotted verticallines. Model parameters are 1 − SP0 = 0.3, SN0 = SN1 = 1, δ = 1, a1 = b1 = b2 = log(1.5), a0 = b0 =logit(0.2).81to impact the average interval length, with negative E − D associations leading to higher Bayesfactors when U and D are correlated conditional on E. There appear to be no consistent changesin length or median Bayes factors resulting from changes in the confounder prevalence η1 or E−Uassociation.4.4 Application to ProMS dataIn this section, we apply the method proposed in Section 4.2.4 to ProMS study data in order toinvestigate the question of whether healthcare utilization for several morbidities suspected to bepart of the MS prodrome leads to an increased chance of falsely assigned MS codes, and consequentlyan increased chance of misclassification for non-MS patients. The motivation for this investigationstems from findings of Chapter 3 suggesting that individuals suspected to be falsely classified ashaving MS have a higher prevalence of several morbidities compared to those that are correctlyclassified.4.4.1 DataFor illustrative purposes, we consider two different confounder variables U in this analysis: age atindex date, dichotomized using a cutoff of 40 years, and sex. With women outnumbering men in aratio of 3:1 among people with MS, sex is highly associated with the disease. On the other hand,morbidities considered in this example show a range of age and sex associations, making this agood example to illustrate the model’s performance under various settings of E − U and D − Uassociations. Odds ratios between the confounding variables and the morbidities considered in thisanalysis are displayed alongside the data in Table 4.1.Because apparent cases and controls were originally sampled in a matched fashion, we subsam-pled the ProMS cohort with the goal of removing strata-specific dependencies among the data. Thisis necessary given that Model (4.10) does not apply to matched data. From 6830 apparent casesincluded in the original cohort, 3400 were selected at random and assigned to the apparent case82OR(E,D|U)     0.67                                                                                                                                                               1.5                                                                                                                                                             Prior   0.3                                                   0.5                                                   0.3                                                   0.5                                                      η1 OR(U,E)   0.67                    1                  1.5                 0.67                    1                  1.5                 0.67                    1                  1.5                 0.67                    1                  1.5                     OR(U,D|E)     0.67        1      1.5     0.67        1      1.5     0.67        1      1.5     0.67        1      1.5     0.67        1      1.5     0.67        1      1.5     0.67        1      1.5     0.67        1      1.5     0.67        1      1.5     0.67        1      1.5     0.67        1      1.5     0.67        1      1.5         −3.0 −2.5 −2.0 −1.5 −1.0lllllllllllllllllllllllllllllllllllllPosterior α1−2 −1 0 1 2Length 1.74  2.56  1.96  1.58  2.54  1.84  1.64  2.50  1.83  1.59  2.55  1.97  1.51  2.52  1.89  1.39  2.49  1.80  2.19  3.03  2.59  2.17  3.00  2.54  2.08  2.99  2.49  2.14  2.99  2.55  1.97  3.03  2.50  1.87  3.02  2.45  3.50 Cover0.99 1.00 0.99 0.99 1.00 0.99 0.98 1.00 0.98 0.99 1.00 0.98 0.98 1.00 0.99 0.98 1.00 0.98 0.99 1.00 1.00 1.00 1.00 1.00 0.99 1.00 0.99 0.99 1.00 1.00 0.99 1.00 0.99 0.98 1.00 0.99 1.00  BF 2.351.171.833.341.222.322.781.192.232.851.201.843.521.  NALB>00.350.000.220.470.000.320.440.000.340.430.000.220.510.000.340.650.010.380.320.000.160.320.010.160.350. 3.0 3.5 4.0 4.5 5.0Figure 4.8: Differential misclassification for controls only under retrospective sampling: Average median,2.5% and 95.7% percentiles of the posterior distribution of α1 for selected settings of confounder prevalence η1,disease-confounder (OR(U,D | E)) and exposure-disease (OR(E,D | U)) associations. The respective valuesfor the prior distribution of α1 are indicated in the last row. Other parameter values are SN1 = SN0 = 1,1− SP0 = 0.3, δ = 1, n = 7000.83Table 4.1: Counts of exposed (E = 1) and unexposed (E = 0) study participants with positive (M/> 40)and negative (F/≤ 40) confounder values in case and control group, as well as odds ratios (OR) betweenexposure and confounder.D∗ = 0 D∗ = 1E = 0 E = 1 E = 0 E = 1 U − Esex 0 1 0 1 0 1 0 1 OR 2.5% 97.5%Anxiety 1476 1522 258 144 1882 779 605 134 0.47 0.41 0.54Bipolar 1691 1637 43 29 2376 880 111 33 0.65 0.48 0.87Infl. Bowel Disease1711 1648 23 18 2445 900 42 13 0.78 0.50 1.19Depression 1202 1406 532 260 1360 643 1127 270 0.40 0.36 0.45Diabetes 1645 1565 89 101 2328 824 159 89 1.27 1.05 1.55Epilepsy 1712 1652 22 14 2402 877 85 36 0.76 0.54 1.06Fibromyalgia 1573 1557 161 109 2057 806 430 107 0.56 0.48 0.66Hyperlipidemia 1629 1514 105 152 2302 811 185 102 1.48 1.24 1.77Hypertension 1479 1418 255 248 2006 724 481 189 0.97 0.85 1.10Ischemic Heart 1650 1570 84 96 2312 813 175 100 1.26 1.04 1.52Lung 1557 1532 177 134 2125 819 362 94 0.66 0.56 0.78Migraine 1532 1651 202 15 2069 876 418 37 0.12 0.09 0.16ageAnxiety 1586 1412 216 186 1015 1646 279 460 1.11 1.11 1.11Bipolar 1769 1559 33 39 1248 2008 46 98 1.47 1.47 1.47Infl. Bowel Disease1784 1575 18 23 1284 2061 10 45 2.04 2.04 2.04Depression 1399 1209 403 389 787 1216 507 890 1.27 1.27 1.27Diabetes 1748 1462 54 136 1257 1895 37 211 3.41 3.41 3.41Epilepsy 1779 1585 23 13 1250 2029 44 77 1.13 1.13 1.13Fibromyalgia 1661 1469 141 129 1124 1739 170 367 1.38 1.38 1.38Hyperlipidemia 1751 1392 51 206 1252 1861 42 245 4.47 4.47 4.47Hypertension 1694 1203 108 395 1203 1527 91 579 5.19 5.19 5.19Ischemic Heart 1771 1449 31 149 1264 1861 30 245 5.91 5.91 5.91Lung 1653 1436 149 162 1129 1815 165 291 1.23 1.23 1.23Migraine 1665 1518 137 80 1090 1855 204 251 0.79 0.79 0.7984group for this analysis. Controls matched to these cases were then discarded, and 3400 subjectswere selected from the pool of remaining controls with sampling weights chosen such that theirdistribution among the four categories defined by the confounder variables (male > 40, male ≤ 40,female > 40, female ≤ 40) is similar to that of the general British Columbia population [61]. Theproportions for the four categories are 22.3%, 27.0%, 24.5% and 26.2%, respectively.For all morbidities, we fitted the retrospective Model (4.10) with prior distributions as specifiedin Section 4.2.4. In line with earlier analyses of these data, we assume that all controls are perfectlyclassified, but that the case group includes a percentage of people without MS. To reflect the former,we assume γ0 = 0 to be known, and specify γ1 ∼ Beta(55, 18) based on the validation study ofMarrie et al. [51].4.4.2 ResultsThe posterior distributions of α1, along with Bayes factors for H0 : α1 = 0 versus H1 : α1 6= 0, areshown in Figure 4.9 for a range of morbidities suspected to be part of the MS prodrome. Conclusionsabout the presence of differential misclassification are in agreement between the age and sex-basedanalyses for the majority of morbidities, with the exception of depression, migraine and, althoughto a lesser extent, ischemic heart disease. Posterior intervals tend to be wider for analyses based onage compared those using sex as a confounder, a result that is in agreement with findings from thesimulation study given that sex has a stronger association with MS compared to age. With 27:1odds against the null, there is strong evidence that non-MS patients with depression have a higherchance of receiving a false MS code compared to those without depression. This result is in linewith findings from earlier analyses of these data, in particular the fact that MS patients in the BCMS database showed lower rates of depression before the index date compared to those identifiedusing the case definition of three or more MS-related claims. A similar conclusion can be drawnfor migraine, where a Bayes factor of 64 suggests that people with a migraine-related code havea higher chance of falsely assigned MS codes. Previous studies have reported that migraines are85Disease               Anxiety                                     Bipolar disorder                            Bowel disease                               Depression                                  Diabetes                                    Epilepsy                                    Fibromyalgia                                Hyperlipidemia                              Hypertension                                Ischemic heart disease                      Lung disease                                Migraine                                    Prior                 −2.6 −2.5 −2.4 −2.3 −2.2lllllllllllllllllllllllllPosterior α1−2 −1 0 1 2Length2.672.133.252.942.992.862.311.512.811.813.433.352.752.362.602.002.681.583.051.802.572.222.951.873.50BF 0.89 0.93 1.10 1.22 0.91 0.89 1.1527.18 0.86 2.49 1.01 3.04 0.84 0.83 0.97 0.58 0.79 1.03 0.93 4.57 1.31 0.57 0.9164.39−−2.8 3.2 3.6Figure 4.9: Median, 2.5% and 95.7% percentiles of the posterior distribution of α1, Bayes factors (BF) andlengths of 95% posterior credible intervals (Length) for a range of morbidities suspected to be part of the MSprodrome. In each panel, the top and bottom rows correspond to analyses using age and sex, respectively, asthe confounding variable. Summary statistics for the prior distribution of α1 are indicated in the last row.strongly correlated with MS and may also be a presenting sign of disease onset [67]. Neurologistconsultation upon presentation with migraines, followed by rule-out codes for MS, may thus be aplausible explanation for this finding. Moderate evidence for an association between the presenceof a diagnostic codes for ischemic heart disease and falsely assigned MS codes is harder to explainand may be the result of confounding between both variables due to increased healthcare utilizationamong apparent MS cases.864.5 DiscussionIn statistical analyses dealing with misclassified data, the question of whether the classificationmechanism is non-differential is of central importance. If valid, it allows the analyst to anticipatethe direction of bias in association measures, implies validity of Type I error rates when testinghypotheses of no association between exposure and disease, and greatly simplifies analyses thatadjust for misclassification bias. In practice, assumptions thereof are often based upon theoreticalconsiderations about the underlying classification mechanism rather than the data at hand.In this chapter, we aimed to examine how well the data can speak towards violations of thenon-differential misclassification assumption. To allow progress despite the non-identified natureof the problem being posed, we considered the Bayesian framework of hypothesis testing, and alsoconsidered two sampling designs common in epidemiological research. Because this investigationwas motivated by a study relying on data recorded in government registries, the models imposeminimum assumptions about auxiliary variables being available, but only require that investigatorshave access to a confounder.Simulation results showed that the power to detect deviations from non-differential misclas-sification varies greatly depending on the extent of misclassification in the sample, the degree ofdependence between disease and exposure, the properties of the confounding variable and the sam-pling design, among other factors. In most scenarios considered, we found good prior updatingdespite non-identifiability, and while posterior credible intervals may be considered too wide foruseful posterior estimation, updating was sufficient to reject the null of non-differential misclassifi-cation under a range of parameters settings. In other cases, Bayes factors were small and credibleintervals not sufficiently removed from zero to give strong evidence against the null. Our appli-cation to ProMS study data suggested that controls have a higher chance of receiving a positiveMS label if diagnosis codes for two morbidities are present, a result that is in line with previousanalysis of these data.87Investigations into retrospective and prospective modelling revealed some interesting structuraldifferences that particularly manifested themselves in the effect of the confounding variable onposterior inferences. In the retrospective model, confounder U added a second dimension to themixture component of the model, with mixing weights held constant at pp, while in the prospectivemodel, U contributed variability among the mixing weights, P (D = 1 | U,E). As seen in thesimulation results, increasing confounder-disease associations were always beneficial for posteriorinference in the former case, but could lead to either wider or narrower intervals for the latterdepending on D − U associations being negative or positive (and thus mixing weights movingtowards or away from 0). Further, confounders that were increasingly associated with the exposureled to shorter posterior intervals in the retrospective model, but did not lead to noteworthy changesin the prospective model.In practice, the models considered here are relatively simple and can be implemented easilyusing standard MCMC samplers, but sampling efficiency tends to suffer from high autocorrelationamong the Markov chains. Fortunately, models run comparably fast and increasing the number ofsamples, possibly followed by thinning, will not lead to dramatic increases in computational time.Re-parameterization of the models in terms of the fully identified parameters, such as P (D∗ =1 | E = i, U = j), i, j = 0, 1 in the prospective case, could improve computational efficiency, butdependencies among these quantities may also neccessitate a more complicated specification of ajoint prior.Perhaps the biggest hurdle surrounding the use of non-identified models in practice is thespecification of the prior distributions. Priors that are at odds with the true underlying valuescan lead to biases in posterior point estimates in the best case, and to false conclusions aboutthe presence of differential misclassification in the worst. Making matters worse, increasing priorvariances may reduce occurrence of the latter, but comes at the cost of failing to detect differentialmisclassification when it is in fact present. In the prospective model, parameters most at risk of88inducing shifts in the posterior of β1 away from a true null are those related to the indirect effectbetween E and D∗, namely β2 and α2. Indications thereof could for instance be seen in Figure4.3 for situations when δ = 0, but posterior medians are shifted away from the null. If the priormean of the sensitivity parameter β2 were to be chosen sufficiently small, this behaviour can beexacerbated to the point where one would confidently reject H0.In conclusion, the approach investigated in this chapter may not have shown adequate powerto detect differential misclassification in all of the examined settings, but it allowed us to reject thenull of non-differential misclassification in some scenarios with moderate to strong evidence. Thus,it is a worthwhile approach in situations where limited data prohibit inferences from fully identifiedmodels, but expert knowledge is available to inform the prior distributions of parameters at play.89Chapter 5ConclusionWith disease information routinely established from diagnostic codes in electronic health records,the topic of outcome misclassification is of current interest in epidemiological research. The workpresented in this thesis is the first to examine the problem of misclassified outcomes in the contextof matched case-control studies, and thus contributes important theoretical results and novel mod-elling approaches for a widely-used study design. Further, we discussed the question of assessingthe non-differential misclassification assumption with a formal test of hypotheses, a central yetlargely unexplored problem in the literature.In Chapter 2, we established that naive analyses of non-differentially misclassified, pair-matcheddata using standard conditional logistic regression techniques lead to an attenuation of odds ratiostowards the null, the degree of which depends upon the negative and positive predictive values,among other quantities. This result is particularly important considering that naive analyses ofmisclassified data are often justified citing high values of sensitivity and specificity, but that lowpositive predictive values can still arise under near-perfect specificity if the disease under studyis rare. Further, we showed that consistent estimation of odds ratios from misclassified data ispossible, provided that negative and positive predictive values are known, and that the investigator90has access to a validation cohort of correctly classified pairs. We proposed a novel Bayesian model tointegrate these internal and external sources of information in a single analysis, and found that ourapproach produced posterior distributions with improved point estimates and coverage probabilitiesfor simulated data. The contents of Chapter 2 have also been published in Ho¨gg et al. [40].In Chapter 3, we investigated the problem of outcome misclassification in the specific contextof health administrative database studies. We argued that in comparison to “traditional” contextsplagued by misclassification, these studies benefit from access to complete healthcare utilizationrecords for study participants, and that this information should be utilized to weight each individualin the analysis according to our certainty about their true disease status. We proposed a weightedBayesian model for matched case-control data in settings where controls are correctly classified,and applied a point process mixture model to our motivating dataset to estimate each individual’sprobability of being a true MS case. Given that case-control studies are typically employed toinvestigate rare diseases, our assumption of a near-perfect negative predictive value, and thereforeignorable contamination among controls, does not appear to be overly limiting in practice. Ourreal data analysis highlighted that adjustments for non-differential and differential misclassificationcan lead to contradictory inferences about exposure-disease associations, indicating that simpli-fying assumptions about the properties of diagnostic code-based case definitions should be madewith caution. Therefore, statistical techniques extending to differential misclassification should beapplied in the analysis of health administrate database studies.In Chapter 4, we discussed the problem of assessing the non-differential misclassification assump-tion with a formal test of hypotheses. Assumptions about the classification process are central tothe analysis of misclassified variables, but a lack of statistical tools largely prevents the investigatorfrom exploring these assumptions by use of the available data. Motivated by the non-identifiednature of the problem, we explored the value of Bayes factors as measures of evidence for or againstthe null hypothesis of non-differential misclassification. Simulation studies showed that despite91non-identifiability, analyses with mildly informative prior distributions result in sufficient prior-to-posterior updating to provide convincing evidence against the null for a range of realistic parametersettings. Consequently, the proposed approach is a valuable tool when absence of validation dataor instrumental variables prohibits inferences from fully-identified models.In addition to contributions to the statistical literature, the work presented in this thesis alsocontributes to current research into the epidemiology of MS. All research questions investigatedin Chapters 2 to 4 were motivated by real study data, and proposed models were developed tofurther our understanding of the symptoms proceeding the first recognized sign of MS. Our workrevealed several unexpected aspects of the ProMS cohort. Specifically, our results suggest thatcontrols falsely classified as MS cases have an overall higher morbidity burden compared to thosecorrectly classified, and that a previous diagnostic code for depression or migraine increases thechance of falsely assigned codes for MS. Assumptions of non-differential misclassification, and thusattenuation of association measures towards the null, can therefore not be relied upon for thisstudy. Our exploratory analyses of temporal claim patterns in Chapter 3 further illustrated howthe currently used case definition of three of more MS-specific records may be improved to achievemore accurate capture of true MS cases.As highlighted throughout this thesis, epidemiological studies relying on health administrativedatabases encounter a variety of challenges that appear in both the design and the analysis stages ofthe study. When determining which case definition to use, investigators are faced with the dilemmaof choosing between a larger study cohort that captures virtually all cases in the population,but requires advanced statistical techniques due to inevitable contamination, or a smaller cohortthat could suffer from selection bias, but can be analyzed using standard methodology becauseof ignorable contamination. Motivated by the large, but contaminated group of ProMS cases,our work largely focussed on the first scenario. However, analyses of these data with the modelsproposed in Chapters 2 and 3 required several assumptions, which ultimately raises the question92of whether a smaller but cleaner cohort would have provided more reliable results.A possible direction for future research is therefore the question of how to arrive at such a“cleaner” cohort. Specifically, how can one make better use of the myriad of information containedin health administrative databases to arrive at a better classification process compared to a meredichotomization of disease-specific claim counts. In our motivating dataset, information that hasyet to be exploited is a unique physician identifier assigned to every claim, as well the specializationinformation for every physician. Preliminary analyses suggest that it is possible to identify MS-specialized neurologists based on this additional information, which could ultimately be used todistinguish among MS-specific ICD codes according to the physician’s experience in diagnosing MS.As such, better discrimination of true and false cases among individuals with low MS-specific claimcounts, or those with short follow-up times, may be possible. Because joint statistical modelling ofphysician- and individual-level claims data would be complicated by several sources of dependence,mixed-type variables and unequal follow-up times, distribution-free machine learning methods suchas unsupervised clustering could be worth pursuing.The approach of Chapter 3 may be viewed as a middle ground between the analysis of a “largebut dirty” and a “small but clean” cohort. Instead of assigning a hard disease label using animproved classifier, the mixture model supplied a soft disease label for each individual in the formof a probability. As such, we are able to make use of a large cohort, but can also downweight certainindividuals in the estimation of association measures. In the analysis of the ProMS data, we ignoredthe weights’ dependence on the strata-specific random effect bk on the grounds of weak associationsbetween the matching variables and MS, but better ways to handle this dependence in generalsettings should be explored in future work. Furthermore, extending the point process mixturemodel to incorporate physician-level variables could be pursued to provide better discriminationof ProMS cases with short follow-up times. Multi-state models as discussed by Cook et al. [15],with state i defined as observing a MS-specific claim from physician specialty i, and stayer-mover93features to capture claim patterns of true and false MS cases, could provide the necessary extensionto a mixture of one-dimensional point processes.Lastly, models proposed in Chapters 2 and 3 assumed that the exposure variable was measuredwithout error. Yet, this assumption clearly does not hold if a morbidity identified from healthadministrative data is the exposure of interest, as was the case in our motivating example. In futurework, models should thus be extended to account for both disease and exposure misclassificationin matched case-control studies.In conclusion, the strength of this thesis lies in its relevance to both theory and application.The presented work not only provides novel results and modelling approaches for an importantopen problem in the literature, but also contributes to current epidemiological research into thecharacteristics of the MS prodrome. Outcome misclassification is a central problem in analysesof administrative health data, and our statistical tools tailored towards these applications will bebeneficial for future epidemiological studies of MS and other diseases.94Bibliography[1] A. Agresti. Categorical Data Analysis. Wiley Series in Probability and Statistics. Wiley &Sons, Hoboken, NJ, 2002. → page 13[2] Y. Amemiya. Two-stage instrumental variables estimators for the nonlinearerrors-in-variables model. Journal of Econometrics, 44(3):311–332, 1990. → page 63[3] P. K. Andersen, O. Borgan, R. D. Gill, and N. Keiding. Statistical models based on countingprocesses. Springer Series in Statistics. Springer Science & Business Media, New York, 2012.→ page 48[4] J. A. Avin˜a-Zubieta, M. Abrahamowicz, M. A. De Vera, H. K. Choi, E. C. Sayre, M. M.Rahman, M.-P. Sylvestre, W. Wynant, J. M. Esdaile, and D. Lacaille. Immediate and pastcumulative effects of oral glucocorticoids on the risk of acute myocardial infarction inrheumatoid arthritis: a population-based study. Rheumatology, 52(1):68–75, 2013. → page 2[5] L. E. Bautista and V. M. Herrera. An assessment of public health surveillance of Zika virusinfection and potentially associated outcomes in Latin America. BMC Public Health, 18(1):656, 2018. → pages 6, 62[6] J. O. Berger. Statistical Decision Theory and Bayesian analysis. Springer Series in Statistics.Springer Science & Business Media, New York, 2013. → page 69[7] H. Brenner and D. A. Savitz. The effects of sensitivity and specificity of case selection onvalidity, sample size, precision, and power in hospital-based case-control studies. AmericanJournal of Epidemiology, 132(1):181–192, 1990. → page 1[8] British Columbia Ministry of Health [creator] (2015). Medical Services Plan (MSP) PaymentInformation File. Population Data BC [publisher]. Data Extract. MOH (2015).https://www.popdata.bc.ca/data. → page 24[9] British Columbia Ministry of Health [creator] (2016). PharmaNet. BC Ministry of Health[publisher]. Data Extract. Data Stewardship Committee (2015).https://www.popdata.bc.ca/data. → page 2495[10] J. S. Buzas and L. A. Stefanski. Instrumental variable estimation in generalized linearmeasurement error models. Journal of the American Statistical Association, 91(435):999–1006, 1996. → page 63[11] Canadian Institute for Health Information [creator] (2015). Discharge Abstract Database(Hospital Separations). Population Data BC [publisher]. Data Extract. MOH (2015).https://www.popdata.bc.ca/data. → page 24[12] R. J. Carroll, D. Ruppert, and L. A. Stefanski. Measurement Error in Nonlinear Models.CRC Press, Boca Raton, 1995. → pages 1, 32[13] P.-H. Chyou. Patterns of bias due to differential misclassification by case–control status in acase–control study. European Journal of Epidemiology, 22(1):7–17, 2007. → page 1[14] R. J. Cook and J. Lawless. The Statistical Analysis of Recurrent Events. Statistics forBiology and Health. Springer Science & Business Media, New York, 2007. → page 48[15] R. J. Cook, J. D. Kalbfleisch, and G. Y. Yi. A generalized mover–stayer model for paneldata. Biostatistics, 3(3):407–420, 2002. → page 93[16] K. T. Copeland, H. Checkoway, A. J. McMichael, and R. H. Holbrook. Bias due tomisclassification in the estimation of relative risk. American Journal of Epidemiology, 105(5):488–495, 1977. → page 63[17] B. B. Dean, J. Lam, J. L. Natoli, Q. Butler, D. Aguilar, and R. J. Nordyke. Review: Use ofelectronic medical records for health outcomes research: A literature review. Medical CareResearch and Review, 66(6):611–638, 2009. → page 2[18] N. Dendukuri and L. Joseph. Bayesian approaches to modeling the conditional dependencebetween multiple diagnostic tests. Biometrics, pages 158–167, 2001. → pages 6, 64[19] J. M. Dickey. The weighted likelihood ratio, linear hypotheses on normal locationparameters. The Annals of Mathematical Statistics, pages 204–223, 1971. → page 69[20] S. W. Duffy, T. E. Rohan, R. Kandel, T. C. Prevost, K. Rice, and J. P. Myles.Misclassification in a matched case-control study with variable matching ratio: application toa study of c-erbb-2 overexpression and breast cancer. Statistics in Medicine, 22(15):2459–2468, 2003. → page 1[21] J. K. Edwards, S. R. Cole, M. A. Troester, and D. B. Richardson. Accounting formisclassified outcomes in binary regression models using multiple imputation with internalvalidation data. American Journal of Epidemiology, 177(9):904–912, 2013. → page 1[22] R. Elton and S. Duffy. Correcting for the effect of misclassification bias in a case-controlstudy using data from two different questionnaires. Biometrics, pages 659–663, 1983. → page196[23] C. Ferna´ndez and M. F. Steel. On Bayesian modeling of fat tails and skewness. Journal ofthe American Statistical Association, 93(441):359–371, 1998. → page 106[24] W. Q. Gan, J. M. FitzGerald, C. Carlsten, M. Sadatsafavi, and M. Brauer. Associations ofambient air pollution with chronic obstructive pulmonary disease hospitalization andmortality. American Journal of Respiratory and Critical Care Medicine, 187(7):721–727,2013. → page 2[25] A. E. Gelfand and A. F. M. Smith. Sampling-based approaches to calculating marginaldensities. Journal of the American Statistical Association, 85(410):398–409, 1990. ISSN01621459. URL http://www.jstor.org/stable/2289776. → page 38[26] A. Gelman and D. B. Rubin. Inference from iterative simulation using multiple sequences.Statistical Science, 7(4):457–472, 1992. → pages 52, 105[27] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis, volume 2.Chapman & Hall, Boca Raton, 2014. → page 21[28] R. Gerlach and J. Stamey. Bayesian model selection for logistic regression with misclassifiedoutcomes. Statistical Modelling, 7(3):255–273, 2007. → pages 1, 4[29] E. D. Gorham, C. F. Garland, F. C. Garland, W. B. Grant, S. B. Mohr, M. Lipkin, H. L.Newmark, E. Giovannucci, M. Wei, and M. F. Holick. Vitamin D and prevention ofcolorectal cancer. The Journal of Steroid Biochemistry and Molecular Biology, 97(1-2):179–194, 2005. → pages 6, 62[30] S. Greenland. The effect of misclassification in matched-pair case-control studies. AmericanJournal of Epidemiology, 116(2):402–406, 1982. → page 1[31] S. Greenland and D. G. Kleinbaum. Correcting for misclassification in two-way tables andmatched-pair studies. International Journal of Epidemiology, 12(1):93–97, 1983. → page 1[32] P. Gustafson. Measurement Error and Misclassification in Statistics and Epidemiology:Impacts and Bayesian Adjustments. CRC Press, Boca Raton, 2003. → page 1[33] P. Gustafson. Measurement error modelling with an approximate instrumental variable.Journal of the Royal Statistical Society. Series B (Statistical Methodology), 69(5):797–815,2007. → pages 6, 64[34] P. Gustafson. What are the limits of posterior distributions arising from nonidentifiedmodels, and why should we care? Journal of the American Statistical Association, 104(488):1682–1695, 2009. → pages 6, 64, 66, 71[35] P. Gustafson. On the behaviour of Bayesian credible intervals in partially identified models.Electronic Journal of Statistics, 6:2107–2124, 2012. → page 6997[36] P. Gustafson and S. Greenland. Interval estimation for messy observational data. StatisticalScience, 24:328–342, 2009. → page 68[37] P. R. Hahn and M. Xia. A finite mixture model approach to regression under covariatemisclassification. ArXiv e-prints, Nov. 2016. → page 36[38] W. K. Hastings. Monte carlo sampling methods using markov chains and their applications.Biometrika, 57(1):97–109, 1970. URL http://www.jstor.org/stable/2334940. → page 38[39] T. Hoegg. https://github.com/tanjahoegg, 2017. → page 20[40] T. Ho¨gg, J. Petkau, Y. Zhao, P. Gustafson, J. Wijnands, and H. Tremlett. Bayesian analysisof pair-matched case-control studies subject to outcome misclassification. Statistics inMedicine, 36(26):4196–4213, 2017. → pages vi, 91[41] A. M. Jurek, S. Greenland, and G. Maldonado. Brief report: how far from non-differentialdoes exposure or disease misclassification have to be to bias measures of association awayfrom the null? International Journal of Epidemiology, 37(2):382–385, 2008. → page 63[42] A. M. Jurek, G. Maldonado, and S. Greenland. Adjusting for outcome misclassification: theimportance of accounting for case-control sampling and other forms of outcome-relatedselection. Annals of Epidemiology, 23(3):129–135, 2013. → pages 1, 10, 11[43] E. Kingwell, F. Zhu, R. A. Marrie, J. D. Fisk, C. Wolfson, S. Warren, J. Profetto-McGrath,L. W. Svenson, N. Jette, V. Bhan, et al. High incidence and increasing prevalence of multiplesclerosis in British Columbia, Canada: findings from over two decades (1991–2010). Journalof Neurology, 262(10):2352–2363, 2015. → pages 27, 34[44] K. J. Lee and S. G. Thompson. Flexible parametric models for random-effects distributions.Statistics in Medicine, 27(3):418–434, 2008. → pages 60, 106[45] J. Liu, P. Gustafson, N. Cherry, and I. Burstyn. Bayesian analysis of a matched case–controlstudy with expert prior information on both the misclassification of exposure and theexposure–disease association. Statistics in Medicine, 28(27):3411–3423, 2009. → pages1, 4, 33[46] Y. Lu, N. Dendukuri, I. Schiller, and L. Joseph. A Bayesian approach to simultaneouslyadjusting for verification and reference standard bias in diagnostic test studies. Statistics inMedicine, 29(24):2532–2543, 2010. → pages 6, 64[47] D. J. Lunn, A. Thomas, N. Best, and D. Spiegelhalter. WinBUGS - A Bayesian modellingframework: concepts, structure, and extensibility. Statistics and Computing, 10(4):325–337,2000. → page 5498[48] R. H. Lyles and J. Lin. Sensitivity analysis for misclassification in logistic regression vialikelihood methods and predictive value weighting. Statistics in Medicine, 29(22):2297–2309,2010. → page 1[49] L. S. Magder and J. P. Hughes. Logistic regression when the outcome is measured withuncertainty. American Journal of Epidemiology, 146(2):195–203, 1997. → pages 1, 9[50] A. Mahajan. Identification and estimation of regression models with misclassification.Econometrica, 74(3):631–665, 2006. → pages 6, 63, 64[51] R. A. Marrie, N. Yu, J. Blanchard, S. Leung, and L. Elliott. The rising prevalence andchanging age distribution of multiple sclerosis in Manitoba. Neurology, 74(6):465–71, 2010.→ pages 3, 54, 85[52] R. A. Marrie, J. D. Fisk, K. J. Stadnyk, B. N. Yu, H. Tremlett, C. Wolfson, S. Warren, andV. Bhan. The incidence and prevalence of multiple sclerosis in Nova Scotia, Canada. TheCanadian Journal of Neurological Sciences, 40(06):824–831, 2013. → pages 2, 3, 27, 34, 56[53] R. A. Marrie, J. D. Fisk, K. J. Stadnyk, H. Tremlett, C. Wolfson, S. Warren, V. Bhan, B. N.Yu, and CIHR Team in the Epidemiology and Impact of Comorbidity on Multiple Sclerosis.Performance of administrative case definitions for comorbidity in multiple sclerosis inManitoba and Nova Scotia. Chronic Diseases and Injuries in Canada, 34(2-3):145–153, 2014.→ page 26[54] P. McInturff, W. O. Johnson, D. Cowling, and I. A. Gardner. Modelling risk when binaryoutcomes are subject to error. Statistics in Medicine, 23(7):1095–1109, 2004. → pages 1, 9[55] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation ofstate calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092, 1953. → page 38[56] P. Mu¨ller and K. Roeder. A Bayesian semiparametric model for case-control studies witherrors in variables. Biometrika, 84(3):523–537, 1997. → page 1[57] M. Plummer. JAGS: A program for analysis of Bayesian graphical models using Gibbssampling. In K. Hornik, F. Leisch, and A. Zeileis, editors, Proceedings of the 3rdInternational Workshop on Distributed Statistical Computing, volume 124, pages 1–10, 2003.→ pages 20, 28, 75[58] G. J. Prescott and P. H. Garthwaite. Bayesian analysis of misclassified binary data from amatched case–control study with a validation sub-study. Statistics in Medicine, 24(3):379–401, 2005. → pages 1, 4, 13, 30[59] K. Rice. Full-likelihood approaches to misclassification of a binary exposure in matchedcase-control studies. Statistics in Medicine, 22(20):3177–3194, 2003. → page 199[60] K. J. Rothman, S. Greenland, and T. L. Lash. Modern Epidemiology. Lippincott Williams &Wilkins, Philadelphia, 2008. → pages 2, 63[61] Statistics Canada. British Columbia [province] and Canada [country] (table). census profile.https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/index.cfm?Lang=E,2017. Accessed: June 25, 2018. → page 85[62] C. J. Stone, M. H. Hansen, C. Kooperberg, and Y. K. Truong. Polynomial splines and theirtensor products in extended linear modeling. The Annals of Statistics, 25(4):1371–1470,1997. → page 75[63] L. C. Thygesen and A. K. Ersbøll. When the entire population is the sample: strengths andlimitations in register-based epidemiology. European Journal of Epidemiology, 29(8):551–558,2014. → page 6[64] I. Verdinelli and L. Wasserman. Computing Bayes factors using a generalization of theSavage-Dickey density ratio. Journal of the American Statistical Association, 90(430):614–618, 1995. → page 70[65] J. M. Wijnands, E. Kingwell, F. Zhu, Y. Zhao, T. Ho¨gg, K. Stadnyk, O. Ekuma, X. Lu,C. Evans, J. D. Fisk, R. A. Marrie, and H. Tremlett. Health-care use before a firstdemyelinating event suggestive of a multiple sclerosis prodrome: a matched cohort study.The Lancet Neurology, 16(6):445–451, 2017. → pages 2, 4[66] M. Wilchesky, R. M. Tamblyn, and A. Huang. Validation of diagnostic codes within medicalservices claims. Journal of Clinical Epidemiology, 57(2):131–141, 2004. → page 2[67] J. Wilcox, D. Tabby, M. H. Majeed, and B. Youngman. Headache in multiple sclerosis:Features and implications for disease management (p6.163). Neurology, 82(10), 2014. → page86[68] G. Yi. Statistical Analysis with Measurement Error or Misclassification. Springer Series inStatistics. Springer, New York, 2017. → page 1100Appendix AAppendix for Chapter 1A.1 The negative predictive value under matched samplingWe consider the case of a single binary confounder U and assume that the sampling from thepopulation of apparent controls with positive and negative confounder status is non-informative,and also that the misclassification process among controls is non-differential with respect to U .Specifically, this implies thatP (D = i|D∗ = 0, U = l, I = 1) = P (D = i|D∗ = 0, U = l), i, l = 0, 1;P (D∗ = j|D = 0, U = l) = P (D∗ = j|D = 0), j, l = 0, 1.Under the assumption of a positive disease-confounder association, hence P (U = 1|D = 1) >P (U = 1|D = 0) and P (D = 1|U = 1) > P (D = 1|U = 0), misclassification by a classifier withpp > 1−np implies a higher proportion of confounder-positive subjects in the apparent case group101compared to the apparent control group,q := P (U = 1|D∗ = 1) = P (U = 1|D∗ = 1, D = 1)P (D = 1|D∗ = 1)+ P (U = 1|D∗ = 1, D = 0)P (D = 0|D∗ = 1)= P (U = 1|D = 1) pp+ P (U = 1|D = 0)(1− pp)> P (U = 1|D = 1)(1− np) + P (U = 1|D = 0) np= P (U = 1|D∗ = 0, D = 1)P (D = 1|D∗ = 0)+ P (U = 1|D∗ = 0, D = 0)P (D = 0|D∗ = 0)= P (U = 1|D∗ = 0) =: p.Moreover, the negative predictive value among subjects with negative confounder status exceedsthat among subjects with positive confounder status,P (D = 0|D∗ = 0, U = 1)=P (D∗ = 0|D = 0, U = 1)P (D = 0|U = 1)P (D∗ = 0|D = 0, U = 1)P (D = 0|U = 1) + P (D∗ = 0|D = 1, U = 1)P (D = 1|U = 1)=P (D∗ = 0|D = 0)P (D = 0|U = 1)P (D∗ = 0|D = 0)P (D = 0|U = 1) + P (D∗ = 0|D = 1)P (D = 1|U = 1)<P (D∗ = 0|D = 0)P (D = 0|U = 0)P (D∗ = 0|D = 0)P (D = 0|U = 0) + P (D∗ = 0|D = 1)P (D = 1|U = 0)= P (D = 0|D∗ = 0, U = 0).Consequently, the negative predictive value in the population exceeds that among matched apparentcontrols asP (D = 0|D∗ = 0, I = 1) = P (D = 0|D∗ = 0, U = 1, I = 1)P (U = 1|D∗ = 0, I = 1)+ P (D = 0|D∗ = 0, U = 0, I = 1)P (U = 0|D∗ = 0, I = 1)102= P (D = 0|D∗ = 0, U = 1)︸ ︷︷ ︸<P (D=0|D∗=0,U=0)p+ P (D = 0|D∗ = 0, U = 0)(1− p)< P (D = 0|D∗ = 0, U = 1)q + P (D = 0|D∗ = 0, U = 0)(1− q)= P (D = 0|D∗ = 0, U = 1)P (U = 1|D∗ = 0)+ P (D = 0|D∗ = 0, U = 0)P (U = 0|D∗ = 0)= P (D = 0|D∗ = 0).The inequality in line four follows because the expressions in line three and four are weightedaverages of the same two probabilities and the larger probability has a larger weight in line four.A.2 Details of the simulation studyFor a population of size N = 1, 000, 000, we generate a continuous confounder Ui, binary exposureEi and the true disease status Di of subject i, i = 1 . . . N , via the following modelsUi ∼ N(40, 42), (A.1)Ei | Ui ∼ Bin(1, pi2i) where logit(pi2i) = a0 + a1ui, (A.2)Di | Ei, Ui ∼ Bin(1, pi3i) where logit(pi3i) = b0 + b1ui + b2ei. (A.3)Given the subject’s true disease status, the apparent disease status is generated byD∗i | Di = 1 ∼ Bin(1, SN) and D∗i | Di = 0 ∼ Bin(1, 1− SP ),where SN and SP denote the sensitivity and specificity of the classification mechanism. From thispopulation, we randomly select1. a study cohort containing n matched apparent case-control pairs with (D∗1 = 1, D∗2 = 0),1032. a validation substudy containing m matched case-control pairs with (D1 = 1, D2 = 0),where subjects are matched on confounder values U that are truncated to the nearest integer. Thesimulation parameters are chosen to create the following scenarios:• several degrees of contamination in the apparent case group with SP = 0.95, 0.97, 0.99, 1;• moderate and high disease-exposure association with OR = exp(b2) = 1.5, 2;• low and moderate exposure prevalence with a0 = log(0.002), log(0.01);• various combinations of cohort sizes with n = 200, 500, 1000, 4000 and m = 100, 200, 500, 1000where m < n.The parameters of Models (A.1) through (A.3) were set to a1 = log(1.1), b0 = log(0.0005) andb1 = log(1.1). Because b0 is chosen to generate a rare disease with prevalence of approximately0.02, even low values of SN lead to only minor degrees of contamination in the apparent controlgroup. Therefore, we limited our simulation study to the setting of SN = 1, hence np = 1.For each setting of the simulation parameters, posterior distributions of OR were generatedfrom R = 1000 simulated datasets using the proposed Model (2.14) and the naive approach ofModel (2.6). For the adjusted analysis, hyperparameters of the prior distributions of np and ppwere chosen to examine the following scenarios:• two degrees of prior uncertainty about pp with se(p̂p) = 0.02, 0.04;• presence (∆ = 0.04) and absence (∆ = 0) of a difference between the true positive predictivevalue and its point estimate, i.e. choose p̂p = pp−∆.When SP = 1, the positive predictive value was set to pp = 1, and no prior uncertainty was includedin the model. For two chains with initial values sampled from the prior distribution of the modelparameters, a total of 15000 MCMC samples from the posterior distribution of OR were generated104following a burn-in period of 1000 samples. Convergence of the Markov chains was assessed usingthe Gelman-Rubin diagnostic [26].A.3 The case of odds ratios adjusted towards the nullUsing Equation (2.17), OR can be restated asOR =θ10 − θ01θ01|10(pp np− (1− pp)(1− np)) + 1 = 1pp np− (1− pp)(1− np)( θ01θ01|10 θ10θ01︸︷︷︸=OR∗− θ01θ01|10)+ 1=1pp np− (1− pp)(1− np)θ01θ01|10(OR∗ − 1) + 1.Assuming OR > 1, and hence OR∗ > 1, OR∗ will exceed OR ifθ01θ01|10< pp np− (1− pp)(1− np).The same condition can be also established for OR > OR∗ in the case of OR < 1.105Appendix BAppendix for Chapter 2B.1 Sensitivity analysisIn a sensitivity analysis, all analyses outlined in Section 3.2.5 were repeated for two additionaldistributions of the strata-specific effect bk. As suggested by Lee and Thompson [44], we considera mean zero t distribution t(ν, l) with scale parameter ν = σ2/(l/(l− 2)) and l degrees of freedom,as well as the skewed mean-zero normal distribution of Ferna´ndez and Steel [23],p(βk | ρ) = 2ρ+ 1/ρ(f(βk/ρ)I[0,∞)(x) + f(βkρ)I(−∞,0)(x)), (B.1)which introduces a skewness parameter ρ to achieve asymmetry around zero for the normal densityf(·). For analyses using the t distribution, we consider a vague uniform prior on [0, 10] for σ anda truncated Exp(0.1) prior for l with l > 2.5 as suggested by Lee and Thompson [44]. For theskewed normal distribution, we use a Gamma(0.5, 0.318) prior for ρ2 as suggested by Ferna´ndezand Steel [23], which corresponds to a prior expectation and variance of 1 and 0.57 for ρ. Resultsare displayed in Table B.1.106Table B.1: Results of the sensitivity analysis for ProMS data: Median, 2.5 and 97.5 percentile of theposterior distribution under a normal, t, and skewed normal distribution for the strata-specific effect bk.normal t skewed normalmorbidity 2.5 50 97.5 2.5 50 97.5 2.5 50 97.5anxiety γ0 -3.34 -3.25 -3.17 -3.29 -3.21 -3.14 -3.88 -3.27 -2.56δ1 + δ2 0.52 0.65 0.77 0.52 0.65 0.76 0.52 0.65 0.77δ1 0.87 1.07 1.25 0.87 1.06 1.24 0.88 1.07 1.25σ 0.41 0.54 0.66 0.33 0.45 0.58 0.14 0.37 0.60l or ρ – – – 2.74 6.83 59.90 0.19 0.94 2.98irr. bowel γ0 -4.58 -4.44 -4.32 -4.59 -4.45 -4.33 -4.74 -4.37 -3.94syndrome δ1 + δ2 0.49 0.71 0.91 0.49 0.71 0.91 0.49 0.71 0.91δ1 0.65 1.00 1.30 0.66 1.00 1.30 0.65 1.00 1.31σ 0.13 0.39 0.60 0.24 0.42 0.63 0.08 0.36 0.68l or ρ – – – 4.06 10.50 25.14 0.57 1.14 2.41depression γ0 -2.24 -2.19 -2.15 -2.23 -2.19 -2.14 -2.62 -2.07 -1.76δ1 + δ2 0.65 0.74 0.82 0.65 0.74 0.82 0.65 0.74 0.82δ1 0.83 0.97 1.11 0.82 0.97 1.11 0.83 0.97 1.11σ 0.30 0.39 0.48 0.26 0.38 0.46 0.19 0.36 0.47l or ρ – – – 5.02 15.47 47.83 0.32 1.23 2.26diabetes γ0 -3.36 -3.27 -3.18 -3.33 -3.24 -3.16 -4.13 -3.37 -2.49δ1 + δ2 -0.43 -0.24 -0.07 -0.42 -0.24 -0.07 -0.42 -0.25 -0.08δ1 0.20 0.45 0.67 0.20 0.45 0.67 0.20 0.45 0.68σ 0.76 0.86 0.96 0.72 0.82 0.93 0.12 0.80 1.00l or ρ – – – 7.12 18.11 48.81 0.11 0.91 1.81hypertension γ0 -2.66 -2.60 -2.53 -2.65 -2.58 -2.52 -3.44 -1.92 -1.55δ1 + δ2 -0.04 0.07 0.18 -0.04 0.07 0.18 -0.04 0.07 0.18δ1 0.32 0.51 0.69 0.32 0.51 0.68 0.32 0.50 0.69σ 1.27 1.34 1.40 1.26 1.33 1.39 0.88 1.45 1.53l or ρ – – – 15.50 33.64 77.07 0.53 1.42 1.71lung disease γ0 -2.89 -2.82 -2.76 -2.88 -2.82 -2.75 -3.19 -2.81 -2.38δ1 + δ2 0.13 0.25 0.37 0.13 0.25 0.37 0.13 0.25 0.37δ1 0.32 0.53 0.72 0.32 0.53 0.72 0.33 0.53 0.72σ 0.28 0.40 0.52 0.24 0.38 0.49 0.15 0.31 0.50l or ρ – – – 4.63 13.51 53.61 0.41 0.99 2.46107


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items