Extensions to the Multiplier Method for InferringPopulation SizebyVivian Yun MengB.Sc. Applied Science, Queen’s University, 2008A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Statistics)The University Of British Columbia(Vancouver)August 2014c© Vivian Yun Meng, 2014AbstractEstimating population size is an important task for epidemiologists and ecologistsalike, for purposes of resource planning and policy making. One method is the“multiplier method” which uses information about a binary trait to infer the sizeof a population. The first half of this thesis presents a likelihood-based estimatorwhich generalizes the multiplier method to accommodate multiple traits as wellas any number of categories (strata) in a trait. The asymptotic variance of thislikelihood-based estimator is obtained through the Fisher Information and its be-haviour with varying study designs is determined. The statistical advantage ofusing additional traits is most pronounced when the traits are uncorrelated and oflow prevalence, and diminishes when the number of traits becomes large. The useof highly stratified traits however, does not appear to provide much advantage overusing binary traits. Finally, a Bayesian implementation of this method is appliedto both simulated data and real data pertaining to an injection-drug user popula-tion. The second half of this thesis is a first systematic approach to quantifying theuncertainty in marginal count data that is an essential component of the multipliermethod. A migration model that captures the stochastic mechanism giving rise touncertainty is proposed. The migration model is applied, in conjunction with themulti-trait multiplier method, to real-data from the British Columbia Centre forDisease Control.iiPrefaceThis dissertation is original, unpublished, independent work by the author, V. Meng.Chapter 3, Appendix A, Appendix B, Appendix C, Appendix D and Appendix Econtain work submitted to the Annals of Applied Statistics, titled “Inferring Pop-ulation Size: Extending the Multiplier Method to Incorporate Multiple Traits with aLikelihood-Based Approach” (under review), co-authored with Prof. Paul Gustafsonwho provided guidance for the research which led to said journal submission.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Problem with using multiple traits in the multiplier method . . . . 62.2 Problem with capturing uncertainties in marginal counts . . . . . 62.3 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . 73 Extended Multiplier Method for Multiple Traits . . . . . . . . . . . 83.1 A Likelihood Model for Estimating N . . . . . . . . . . . . . . . 93.1.1 The Reparameterized Multinomial Likelihood Model forModelling Two Binary Traits . . . . . . . . . . . . . . . . 93.1.2 Generalizing the Reparameterization for k Traits . . . . . 113.2 Uncertainty about the Maximum Likelihood Estimator of N . . . . 123.2.1 The Effect of n and N on Estimation Uncertainty . . . . . 13iv3.2.2 The Effect of Additional Traits of Given Prevalence andDegree of Association . . . . . . . . . . . . . . . . . . . 133.2.3 The Effect of Increased Stratification . . . . . . . . . . . 143.3 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . 163.4 Application: San Francisco Injection-Drug User Study . . . . . . 193.4.1 Obtaining a Single Estimate of the Size of IDU PopulationUsing Two Traits . . . . . . . . . . . . . . . . . . . . . . 203.4.2 Alternative Analysis with Marginal Prevalences Only . . . 223.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 Uncertainty in Marginal Counts – Accounting for Migration . . . . 254.1 A Model for Migration . . . . . . . . . . . . . . . . . . . . . . . 264.2 Combining the Migration Model with the Multiplier Method . . . 284.3 Application: Estimating the Size of MSM Population in GVRD . 294.3.1 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3.2 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . 304.3.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37A On the Inappropriateness of Using Capture-Recapture Inspired Method-ology to Analyse Data for the Multiplier Method . . . . . . . . . . . 41B Bench-Marking with Equi-Correlation and Equi-Prevalence for theGeneral Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44C Deriving the Fisher Information . . . . . . . . . . . . . . . . . . . . 47C.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47C.2 Defining a Reparameterization for the Cell Probabilities . . . . . . 48C.3 The Fisher Information . . . . . . . . . . . . . . . . . . . . . . . 48C.4 Examining the Effect of Changing N on Var(NˆMLE) . . . . . . . . 49vD Implementing Bayesian Inference . . . . . . . . . . . . . . . . . . . 51D.1 Model Specification for Inferring N with Information from TwoTraits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51D.2 Model Specification for Inferring N with Information from ThreeTraits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52E Properties of the Bayesian Estimator using Multiple Traits with theMultiplier Method – a Simulation Study . . . . . . . . . . . . . . . . 56E.1 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 57F JAGS Model File for Combined Modelling of Migration and Multi-ple Trait Multiplier Method . . . . . . . . . . . . . . . . . . . . . . . 59viList of TablesTable 3.1 Contingency table for two traits, binary categories . . . . . . . 9Table 3.2 Estimates and prevalences from Johnston et al. (2013) study . . 20Table 3.3 RDS-adjusted proportions in the contingency table for IDU’s inSan Francisco in 2009. . . . . . . . . . . . . . . . . . . . . . . 21Table 3.4 Pseudo-data for estimating size of IDU population with the mul-tiplier method . . . . . . . . . . . . . . . . . . . . . . . . . . 21Table 3.5 Multiplier method under multiple scenarios analysis with marginaldata from Johnston et al. (2013) . . . . . . . . . . . . . . . . . 22Table 4.1 Selected hyper-parameter values for parameters relating to mi-gration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Table 4.2 Result from inferring size of MSM population with the com-bined model . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Table A.1 Difference in summary statistics between capture recapture stud-ies and multiplier method studies . . . . . . . . . . . . . . . . 42Table E.1 Performance of Bayesian inference under select populations andprior distributions . . . . . . . . . . . . . . . . . . . . . . . . 58viiList of FiguresFigure 3.1 Effect of changing design parameters on the precision of Nˆ . . 15Figure 3.2 Schematic of the simulation study exploring the effect of strat-ification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Figure 3.3 Effect of stratification on precision of Nˆ . . . . . . . . . . . . 17Figure B.1 Benchmarking any combination of traits with equi-correlationand equi-prevalence . . . . . . . . . . . . . . . . . . . . . . . 46viiiAcknowledgmentsI would like to thank Prof. Paul Gustafson for his guidance in my research, andNSERC for financial support. Dr. Mark Gilbert and Travis Salway Hottes havebeen a major source of motivation for this research. I would also like to thankDr. Lisa Johnston and H. Fisher Raymond for providing data that complements thetheoretical analysis in Chapter 3. Lastly, thanks mom, dad, and Johnty for yourcontinued support.ixChapter 1IntroductionPopulation size estimation is a fundamental interest for ecologists and epidemiol-ogists alike. This knowledge is used for resource allocation, program planning,and estimating disease incidence, amongst other important applications. Popularmethods for estimating population size include: capture-recapture, “direct” sur-vey, network scale-up, and the multiplier method. The capture-recapture methoduses identified membership of a sampled individual across multiple lists to inferpopulation size. It was originally developed for applications in ecology (Pollocket al., 1990), but applications for epidemiological purposes can be found in, as ex-amples, Paz-Bailey et al. (2011) and Hook and Regal (1995). The “direct” surveymethod uses information on the size of a super-population which encompasses thetarget population together with the proportion of people belonging to the targetpopulation in the inference (Purcell et al., 2012; Lieb et al., 2011). The networkscale-up method uses sampled individuals from a super-population whereby eachrespondent is asked to provide information on the size of his/her personal networkand the number of his/her acquaintances belonging to the target population (Ezoeet al., 2012; Salganik et al., 2011). Lastly, the multiplier method (UNAIDS/WHOWorking Group on Global HIV/AIDS and STI Surveillance, 2010; Johnston et al.,2013) uses knowledge of the exact count (marginal count) of people with a certaintrait (Nt) in the target population and a sample proportion of this trait ( pˆt) to inferthe size of the target population, N, based on the simple relationship pt = Nt/N.The World Health Organization has a publication that gives a good account of the1advantages and disadvantages of the various methods listed above (UNAIDS/WHOWorking Group on Global HIV/AIDS and STI Surveillance, 2010). This documentalso points out the importance of considering data-availability and data-reliabilityin the selection of an appropriate estimation method.The work documented in this dissertation is motivated by an application incollaboration with the British Columbia Centre for Disease Control (BCCDC),whereby estimating the size of a “hard-to-reach population” (Magnani et al., 2005)is of importance to public health. When inferring the size of hard-to-reach pop-ulations, extra constraints are imposed in the method selection. It is often noted(Fendrich et al., 1999; Colo´n et al., 2001; Delaney-Black et al., 2010) that membersof the hard-to-reach population do not self-report on their behaviour readily whichresults in biased estimates; as a result, the reliability of direct survey and network-scale-up methods is compromised due to relying on self-reporting of membershipin the hard-to-reach population in a survey of a super-population. As a secondconstraint, privacy regulations in Canada often result in data free of personal iden-tifying information, which makes capture-recapture studies difficult to implementfor human populations. In contrast, the multiplier method is unaffected by the twoconstraints above.The highly popular multiplier method was, however, developed at a time whendata were hard to come by. The method prescribes a way to estimate popula-tion size based on one binary trait only, whereas recently, data on the marginalcounts and sample prevalences for numerous health related traits are available topublic health agencies (Okal et al., 2013; Raymond et al., 2013; Johnston et al.,2013). Without a statistically sound prescription to incorporate data from multipletraits, researchers resort to constructing multiple estimates of N with the multipliermethod, one from each trait, while using sample prevalences captured in a singlesurvey. This attempt is statistically unsatisfactory because it ignores the correlationbetween multiple estimates derived from the same survey. Additionally, it does notresult in a unified statistical conclusion about the size of the target population.Furthermore, the importance of accounting for uncertainty in the marginalcounts has rarely been discussed in literature. This uncertainty may be attributedto various mechanisms, yet previous attempts to capture uncertainty on marginalcounts have resorted to using a simple parametric distribution based on the intuition2of the practitioner (Archibald et al., 2001; Johnston et al., 2013). In these instances,a scientific evaluation of the quality of the selected distribution is difficult when therationale for selection is not formulated explicitly based on stochastic mechanisms.These serious problems were encountered in collaborative work with the BritishColumbia Center for Disease Control (BCCDC), and motivated the methodologiespresented in this thesis. These methodologies advance the multiplier method to-ward a statistically correct utilization of data on multiple traits based on a popu-lar data collection scheme, and a tractable formulation of uncertainty in marginalcounts; this is approached using a likelihood-based and model-based extension ofthe multiplier method. As an additional feature, the new methodology now al-lows categorical traits in the analysis without having to collapse them into binarycategories as before. This thesis contains work that can impact the practice ofepidemiology immediately and widely, given current interests in the epidemiologyliterature on estimating population with the multiplier method using multiple traits.3Chapter 2BackgroundThe original multiplier method developed out of the epidemiology literature, withlittle use of statistical language in its definition even in an authoritative referencedocument by the UNAIDS/WHO Working Group on Global HIV/AIDS and STISurveillance (2010). The method has acquired many aliases, including the servicemultiplier method and the indirect method. The method relies on the followingdefinition for a trait proportion:pt =NtN(2.1)where N, Nt , pt are as defined in Chapter 1. If Nt and pt are known exactly, arearrangement of Equation 2.1 leads to the exact population size, with no need forestimation.However, neither the exact proportion nor marginal count of people with aparticular trait is known exactly in the majority of cases and so are inferred fromdata. A survey is conducted within the target population to obtain the sample traitprevalence. While the target population has no defined sampling frame, samplingdesigns exist, e.g. venue-based sampling (Muhib et al., 2001) or respondent-drivensampling (Heckathorn, 1997), to give estimators of prevalences that are unbiasedunder certain mathematical assumption and adjustments.As for the marginal count for a trait, it is inferred from pre-existing records inpublic health agencies. These records provide a list, and hence a head-count, ofpeople with the trait in question. The raw head-count may differ from the actual4marginal count of a trait in the target population if, for example, trait misidentifi-cation occurs during record-keeping (Archibald et al., 2001) or if the population“catchable” by the record keeping procedure is not the same as the target popula-tion in question, as noted by UNAIDS/WHO Working Group on Global HIV/AIDSand STI Surveillance (2010). However, the raw head-count is commonly acceptedas an estimate of the marginal count without formal justification.Nevertheless, a substitution of pt and Nt with their estimates in Equation 2.1gives an estimate of N. While the majority of studies consulted do not calculateuncertainty on the estimated population size (Raymond et al., 2013; Okal et al.,2013; Luan et al., 2005), the papers by Archibald et al. (2001) and Johnston et al.(2013) give two approaches to generating the confidence interval for the followingestimator,Nˆ =Nˆtpˆt. (2.2)Archibald et al. (2001) treats Nˆ as a function of two random variables, anduses a Monte Carlo simulation to obtain the distribution of Nˆ. Archibald assumeda Normal distribution for both Nˆt , based on ad hoc a priori information, and pˆt ,based on information from survey sampling.Johnston et al. (2013) obtains the confidence interval for the estimator in Equa-tion 2.2 by assuming its convergence to the Normal distribution. The variance isapproximated by the Delta Method (Taylor series expansion),Var(Nˆ)≈Var(Nˆt)E[pˆt ]2+E[Nˆt ]2E[pˆt ]4Var(pˆt).The 95% confidence interval is hence95%CI = Nˆ±1.96Var(Nˆ).Both methods in (Archibald et al., 2001) and in (Johnston et al., 2013) dealwith quantifying the statistical uncertainty from estimating population size withthe multiplier method using a single binary trait. However, these formulas 1) donot extend naturally to estimation with multiple traits , and 2) lack a systematic wayto specify the uncertainty in the inferred marginal count. The following subsections5highlight each of these problems in detail.2.1 Problem with using multiple traits in the multipliermethodIn recent studies, epidemiologists have reported access to information on a vastnumber of “traits” with which to carry out their estimation. Generally there isaccepted intuition that using more information provides better estimates. Suchbelief has lead epidemiologists to synthesize k estimates of the same populationsize using information from k traits, i.e.Nˆ1 =Nˆt1pˆt1,Nˆ2 =Nˆt2pˆt2,...Nˆk =Nˆtkpˆtk,see (Raymond et al., 2013) for example. Should estimators Nˆ1, · · · , Nˆk be indepen-dent, standard procedures, e.g. inverse variance weighting, exist to combine theminto a single best estimate of N.However, these estimates are not independent in the studies involving multipletraits, due to a data collection scheme commonly adopted for practicality. In thesestudies, a single survey querying k traits is conducted, such that pˆt1 , · · · , pˆtk areestimated from a single sample of the target population. Currently, no statisticalmethod exists to account for dependency that arises from the particular design forsurveying the population only once for multiple traits.2.2 Problem with capturing uncertainties in marginalcountsThe uncertainty in marginal counts may be due to several important factors, e.g.population migration (“mobility” in Saidel et al. (2010)), record under-counting/over-6counting due to various reasons, that occur in conjunction or in isolation. In thepast, uncertainty in inferred marginal counts is often specified in an ad hoc man-ner. As an example, Johnston et al. (2013) treated the inferred marginal countas variable according to a single parameter Poisson distribution with mean valueequal to the raw head-count, without further justification. Archibald et al. (2001)assumes inferred marginal count to be Normally distributed with mean value equalto the raw head-count, and lower and upper 95th percentile value determined basedon “the extent of duplicates, risk-category misclassification together with [their]subjective assessment of the likely magnitude of uncertainty.” In either case, it isdifficult to support nor rebut the parametric distributions selected when they arechosen in an ad hoc manner. A systematic way to describing the mechanism maybe the key toward better assessing model misspecification.2.3 Thesis organizationThis thesis makes two contributions to the statistical literature on the multipliermethod. It firstly outlines a likelihood-based approach to address current method-ological limitations due to the “one survey multiple traits” design in Chapter 3, andsecondly presents the first systematic approach to capture uncertainty in marginalcounts in Chapter 4. However, the scope of Chapter 4 is limited to addressing asingle mechanism, population migration, as the source of uncertainty in inferredmarginal counts, which is motivated by the needs of a collaboration in progress.This thesis concludes with remarks on limitations and future work in Chapter 5.7Chapter 3Extended Multiplier Method forMultiple TraitsIn this chapter a likelihood-based method is developed for estimating the size ofa population with multi-trait data that is currently inadequately addressed by thesingle-trait multiplier method due to employing a particular data collection design.A likelihood-based estimator is presented in Section 3.1 to represent the designwith which the data are obtained. Properties of this estimator are explored in Sec-tion 3.2 under varying study design parameters. In Section 3.3, considerations forBayesian inference with the proposed likelihood are outlined. A real data exampleof Bayesian inference with the proposed likelihood is shown in Section 3.4. Fi-nally, a discussion on the advantages and limitations of the method developed inthis chapter is found in Section 3.5.8Table 3.1: Contingency table for two traits, binary categoriesX2absent (0) present (1)X1absent (0) n00 n01present(1) n10 n11n3.1 A Likelihood Model for Estimating N3.1.1 The Reparameterized Multinomial Likelihood Model forModelling Two Binary TraitsData and AssumptionsIn the case that information on two binary traits, X1 and X2, are collected for thepurpose of inferring population size, the following likelihood is suitable for dataresulting from a data collection scheme that results in• a 2 by 2 contingency table (Table 3.1) that cross-classifies respondents of asample survey based on two traits,• marginal count N1·, the number of people in the target population with X1 =1,• marginal count N·1, the number of people in the target population with X2 =2.It is assumed that no personal identifying information is available where theabove data come from due to privacy regulations. Secondly, the target populationis assumed closed without births, deaths, or migrations. The marginal counts areassumed to be exact, i.e. known without uncertainty – this restriction is relaxedin Chapter 4. The sample survey is obtained through simple random samplingwithout replacement (SRSWOR). For target populations without a sampling frame,the assumption of sampling through SRSWOR cannot be met, but is used here tosimplify the theoretical development. Application of the method in violation ofSRSWOR assumption is addressed in Section 3.4 and Section 3.5.9The above data may at first appear equivalent to multiple “lists” that can beanalysed with capture-recapture methods but this is not so, as persons appearingin each data source may not be cross-identified due to privacy regulations – one ofmany reasons why such data ought not be analysed with capture-recapture (Pollocket al., 1990) or similar methods (Olkin et al., 1981) of analysis (see Appendix Afor details).A likelihood for the stochastic mechanismThe mechanism that gave rise to the data at hand is analogous to drawing marblesout of a bag where the total number of marbles in the bag, N, is unknown. Eachmarble in the bag may be either clear, red, sparkly, or red and sparkly. Additionally,we know exactly the number of marbles that are red, and the number of marbles thatare sparkly in the bag. Finally, a random sample of n marbles are drawn withoutreplacement, their traits observed and recorded in a contingency table.For this process, a multinomial likelihood is appropriate for approximatingthe distribution of the sample frequencies of every trait combinations obtained viaSRSWOR. The usual multinomial likelihood has parameters p00, p01, p10, p11 thatdetermine the frequencies observed in Table 3.1. Note that only three of theseparameters are free to vary in the support space because the parameter space isconstrained by ∑ pi j = 1.Additionally, knowledge of marginal counts N1· and N·1 presents extra infor-mation that further constrains the support space. Necessarily in the overall targetpopulation,N1·N= p10 + p11, (3.1)N·1N= p01 + p11. (3.2)The additional constraints on the support are incorporated by reparameterizingvia p10 = N1·/N− p11, and p01 = N·1/N− p11. The reparameterized likelihood ofsurvey data is, given fixed marginal counts,10L (p11,N|data) =(∑ni j)!n00!n10!n01!n11!pn1111(N1·N− p11)n10(N·1N− p11)n01(1−N1·+N·1N+ p11)n00,(3.3)where the statistics ni j = the number of people in the survey with status i for thefirst trait, and status j for the second trait.As a consequence of the above reparameterization, the population size, N, hasbecome a parameter in the likelihood (Eq.(3.3)). Estimation of N can be carriedout via several standard techniques, e.g. maximum likelihood or Bayesian infer-ence. As a side note, this reparameterized multinomial likelihood must not be con-fused with the multinomial models in capture-recapture inspired methodology –the stochastic mechanisms described by these methods are fundamentally different(Appendix A).3.1.2 Generalizing the Reparameterization for k TraitsLet a set of k traits be denoted by {X1, . . . ,Xk}. A trait Xi has li categories (strata),coded from 0 to li−1. Throughout this thesis, a trait with more strata than anotheris referred to as being more “stratified”. Let stratum 0 always represent the absenceof a trait. As an example, if “positive test result for disease a” is a trait of interest,then stratum 0 necessarily corresponds to the absence of a positive result and stra-tum 1 to stratum li− 1 may be defined to represent a positive result diagnosed indifferent time periods.As data, a k-dimensional contingency table is observed. The contingency tableis populated by cross-classifying a sample of people from the target populationbased on the status of k traits. A set of marginal counts is also known to us. Let theobserved marginal count be denoted by Mis, where i ∈ {1, . . . ,k} indexes the trait,and s ∈ {1, . . . , li−1} specifies the stratum/sub-type of this trait. Let M denote thecollection of marginal counts observed, where the size of the collection, |M |, is[k∑i=1(li−1)].In this case the development of the reparameterized likelihood follows a similar11procedure to that in the previous section. Beginning with the standard multinomiallikelihood, the parameters of this multinomial likelihood are reparameterized basedon |M | known marginal counts using equations similar to Eq.(3.1) and Eq.(3.2)to incorporate constraints on the support space. A systematic reparameterizationscheme has been chosen, which rearranges each marginalization equation in theformMisN=l1−1∑j1=0· · ·li−1−1∑ji−1=0li+1−1∑ji+1=0· · ·lk−1∑jk=0Pr{X1 = ji, . . . ,Xi−1 = ji−1,Xi = s,Xi+1 = ji+1, . . . ,Xk = jk}(3.4)for the probability of having only 1 trait.The exact expression of the reparameterized likelihood is found in Appendix C.Note that the resulting reparameterized likelihood will have {[(k∏i=1li)−1]−|M |+1} free parameters. The first term, [(k∏i=1li)−1], is the number of free parameters ofa standard multinomial likelihood. The second term, −|M |, is a reduction in freeparameters equal to the number of observed marginal counts. The last term, +1,results from the inclusion of N as a parameter in the reparameterization process.With N being a parameter of the likelihood, its estimation may be carried out usingstandard likelihood-based methods.3.2 Uncertainty about the Maximum LikelihoodEstimator of NThe following section assesses the uncertainty in the proposed estimator (MLE)obtained from maximum likelihood theory. The variance of the MLE based on thereparameterized likelihood is approximated asymptotically from the Fisher Infor-mation, the derivation of which is documented in Appendix C.An important outcome from this section is understanding how multiple traitsmay best be used to provide a desired level of precision for the population sizeestimate from a study-design perspective. Thus, the Fisher Information is used to12explore the behaviour of the estimation uncertainty, SD(NˆMLE), as important pa-rameters and design factors are varied. In specific, the parameter N is investigatedalong with the following design factors: sample size, the number of traits used, andthe level of trait stratification. The effects of trait prevalence and strength of traitassociation are also explored. Note that “trait prevalence” and “strength of traitassociation” are two functions of the cell probability parameters in the reparame-terized multinomial likelihood. They are chosen for investigation instead of cellprobability parameters due to better interpretability and better chance that a prioriinformation exist about them. While they are not entirely controllable by the studydesigner, the use of a priori information about prevalence and association leadsto better design selection in light of the results shown in the following sections.Hence, in this thesis, any reference to the set of design factors will also includeprevalence and trait association.3.2.1 The Effect of n and N on Estimation UncertaintyThe SWSWOR assumption on the survey data implies that SD(NˆMLE) ∝ 1/√n, awell-known classical result. The derivation of the effect of N is, however, non-trivial and thus reserved for Appendix C. There it is shown that SD(NˆMLE) ∝ Nwhen all other design parameters are kept fixed. An extension of this result with theDelta Method shows SD(log NˆMLE) to be unaffected by the size of N as well. Theproportionality between population size and estimation uncertainty allows the ef-fect of design parameters to be studied on the scale of relative error (SD(NˆMLE)/N),or equivalently, SD(log NˆMLE), in the sections to come.3.2.2 The Effect of Additional Traits of Given Prevalence and Degreeof AssociationIn this section, designs with varying number of traits, prevalence and correlationare mapped to cell probabilities to obtain the Fisher Information. This explorationis restricted to binary traits of equi-prevalence and equal pairwise correlation (equi-correlation). The equi-prevalence, equi-correlation restriction is used to reduce thecomplexity of the design parameters under exploration, and may be extrapolatedto represent general situations– this is justified Appendix B. Appendix B also pro-13vides guidance to benchmarking estimator precision from a study-design perspec-tive using the equi-correlation, equi-prevalence assumption. Note that situationsbeyond binary stratification are explored in the next section.Figure 3.1 provides a summary of the effect of increasing the number of binarytraits, prevalence and correlation. One makes the observations that (a), increasingthe number of traits has a beneficial but diminishing effect for lowering the un-certainty, (b), the addition of traits is most effective at lowering uncertainty whentraits are not correlated, (c), using more prevalent traits is effective for lowering theuncertainty, and (d), when the prevalence is high (≈ 10%), the impact of additionaltraits becomes negligible.3.2.3 The Effect of Increased StratificationStratification, as previously mentioned, refers to the fineness of the categorizationfor a trait. When the stratification is beyond binary, it is challenging to map a givenset of design factor values to a point in the parameter space because trait prevalenceand trait association are typically defined for binary traits (indicators). Instead, aMonte Carlo, generative approach is taken to explore the parameter space directlyfor a given level of stratification. One will see shortly that a distribution for theestimation uncertainty results for a given set of design parameters based on thisapproach.This Monte Carlo simulation is motivated by the situation where a subject-areaexpert may have some expectation of the strength of association and prevalence fora set of binary traits but is unsure if obtaining further detailed information (strat-ification) for each trait will provide useful information to lower estimation uncer-tainty, given there is no knowledge concerning what the stratification will look like.To provide practical advice to this hypothetical subject-area expert, test scenarioswere generated randomly in a way that captures the variability in the estimationuncertainty assuming a priori any stratification is equally likely to happen.Let us begin by assuming a set of two binary traits as the starting point forexploring traits with three strata. Take a population with two binary traits of givenprevalence and association. This specification translates to some cell probabilityparameters when sampling and observing trait combinations with SRSWOR. When14prevalence = 0.01 prevalence = 0.05 prevalence = 0.10.20.40.60.20.40.6n= 200n= 8000 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1fsd(N^)/NNumber of traits2345Figure 3.1: Examining the effect of increasing the number of binary traitswith changing prevalence and association on the relative error of esti-mating N. The association between traits is measured by φ , rangingfrom 0 (no association) to 1 (complete association). One makes the ob-servation that the relative error decreases with (1) increasing numberof traits (2) increasing trait prevalence and (3) decreasing associationbetween traits. The advantage of additional traits disappears as traitsbecome completely associated, or as prevalence becomes high. Onemay further note that the relative error doubles as n is decreased by 4folds, reflecting its inversely proportional relationship with√n.the binary category for “trait present” is stratified into two mutually exclusive cate-gories, each cell probability parameter, pre-stratification, is necessarily equivalentto the sum of a set of d parameters post-stratification (see Figure 3.2 for an illus-tration). One may see that the mapping from pre-stratification cell probabilities topost-stratification cell probabilities is not unique.Thus in the Monte Carlo experiment, to generate one possible instance of strat-ification, a sample from the d-dimensional Dirichlet(1,...,1) distribution is drawn to15X2 = 0 X2 = 1X1 = 0X1 = 1p00 p01p10 p11X2 = 0 X 2 = 1X1 = 0X1 = 1p00 p01p10 p11X 2 = 2X1 = 2 p20 p21p02p12p22Figure 3.2: Showing how binary traits are further stratified in the simulationin Section 3.2.3.provide the weights with which a cell probability, pre-stratification, is distributedinto the post-stratified cells. Estimator error is then calculated using the post-stratified cell probabilities. This process is repeated 100 times for a given set ofbinary traits to characterize the variation in estimator error in the space of possiblescenarios for stratification. A similar process is applied to k starting binary traitsof equi-prevalence and equi-correlation up to k = 4 traits.The result of this simulation study is shown in Figure 3.3, where each box-plotsummarizes the estimation uncertainty of all randomly generated scenarios thatcorrespond to the same starting k binary traits. Generally speaking, the reductionin estimation uncertainty from increased stratification is small compared with thereduction due to an increased number of traits, given an uninformative a prioribelief about stratification. While there are a few special cases where stratificationprovides reduction comparable to that from additional traits, it is difficult for thesubject-area expert to evaluate if a set of traits of interest might constitute such aspecial case.3.3 Bayesian InferenceIn recent years, both in the research of statistical methodologies and in subject-areaapplications, Bayesian inference has grown to be increasingly useful and popular.Bayesian estimation has several strengths for inferring population size with the162 traits3 traits4 traits0.150.200.250.300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.99fsd(N^)/NFigure 3.3: The effect of stratification: smooth curves show estimator uncer-tainty obtained with binary traits with varying strength of association.The boxplots show the empirical distributions of estimator uncertaintywhen binary traits are increased to three strata at varying strengths ofassociation from φ = 0 to φ = 1 in increments of 0.1. Given k traits atsome the level of association (along the x-axis), the vertical differencebetween the location of the “box” and the curve for shows the effect ofstratification. Compare this with the vertical difference between curves,which marks the effect of increased number of traits. The comparisonshows that stratification has a small effect for reducing estimation un-certainty relative to the effect of using more traits.17reparameterized multinomial model described in this chapter. Firstly, one is likelyto observe a sparse contingency table, based on surveying the absence/presence ofmultiple traits in a population, should the trait prevalences be low. This sparsityposes a problem in MLE-based inference but not in Bayesian inference. Secondly,with limited computational expertise, Bayesian inference is easier to implementthan the MLE for complex models are specified through standard distributions andsimple deterministic functions due to availability of many MCMC software pack-ages. Lastly, while the Bayes estimator in theory agrees asymptotically with theMLE as n→∞, in subject-area applications, a Bayes estimator may achieve a lowerlevel of uncertainty compared with the MLE when the use of informative priors canbe justified.However, one recognizes the challenge to find an easy to implement, validjoint-prior for a large set of parameters in a constrained space. The following para-graphs outline some suggestions on an appropriate prior and its implementation.The proposed prior is evaluated with a simulation study and is shown to have goodestimation properties (Appendix E).The key requirement of a suitable prior is that it allows expert knowledge toenter the analysis in a natural way. The set of parameters in the reparameterizedlikelihood function is, according to Appendix C, η = {θ ,N}. Rather than defin-ing a prior pi(θ ,N) multivariately, factorizing the joint-prior into smaller modulesresults in a simpler implementation and is easier to interpret. Two factorizationsare considered: pi(N)×pi(θ |N) and pi(θ )×pi(N|θ ). Substantive prior knowledgeof pi(θ ) is unlikely to exist because this requires knowledge of how the traits areassociated with each other and prior knowledge of every cell probability, and forthis reason the use of pi(θ )×pi(N|θ ) is not pursued. On the other hand, knowl-edge about pi(N) is relatively easy to elicit from an expert in the form of perceivednatural limits to population sizes. Thus the formulation pi(N)×pi(θ |N) is a usefulchoice for carrying out Bayesian inference.In the case of data consisting of two binary traits (k = 2, l1 = l2 = 2), recall thenotations in Section 3.1.1. One possible formulation of the prior with factorization18pi(p11|N)×pi(N) is to letN ∼Uni f orm(L,U), (3.5)p11|N ∼Uni f (0,bN), (3.6)where L≥max(N·1,N1·,n), and U is the upper bound on the possible size of N re-flecting the range of possible values a priori, bN = min(N·1/N,N1·/N) – the small-est value appearing in the set of marginal probabilities. Further details on the exactimplementation of these priors can be found in Appendix D.In the case of inference with three or more binary traits, once again, the nota-tion ({θ ,N}), as given by Appendix C, is adopted for the set of parameters. LetN ∼Uni f orm(L,U) with L,U defined as in the case of two binary traits. For theparameter θ , a uniform prior prior is placed in the constrained parameter spacegiven N, that is, pi(θ |N) = 1/VN . The quantity VN can be interpreted as the “vol-ume” of the constraint space for θ defined by the inequalities 0 < pc < 1; here pcis a cell probability and a function of parameters θ and N. Appendix D lists thedetails for implementing this prior for inferring N with three binary traits.3.4 Application: San Francisco Injection-Drug UserStudyA recent study using the multiplier method to infer the population size of SanFrancisco Injection-Drug Users (IDUs) in 2009 had been documented in Johnstonet al. (2013). In this study, information about a number of traits, ranging fromservice usages to disease diagnoses, were used to construct several estimates ofpopulation size via the traditional multiplier method, without taking into accountcorrelation between traits. Note that a single Respondent-Driven Sampling (RDS)survey was conducted that asked each respondent to self-report on all traits.An illustration of the likelihood-based multi-trait multiplier method shall bemade with with two binary traits – usage of a particular substance use treatmentcenter (Walden House) and status of being a reported HIV case. Table 3.2 shows,for each trait, the marginal count, the RDS adjusted prevalence estimate, and pop-ulation size estimate made with the Multiplier Method. No respondent indicated19Table 3.2: Marginal count and estimated prevalences of two traits in the IDUpopulation in San Francisco in 2009. The table also lists two estimates onthe size of IDU population obtained by applying the Multiplier Methodto each trait.Trait Marginal Count pˆRDS s.e.(pˆRDS) NˆMultiplier 95% C.I.Using Walden House 104 2% 0.008 5,200 1,003 - 9,398Reported HIV cases 3,308 7.3% 0.017 45,315 24,576 - 66,057to being a reported HIV case and also using Walden House, based on a personalcommunication with the authors of Johnston et al. (2013).3.4.1 Obtaining a Single Estimate of the Size of IDU PopulationUsing Two TraitsTo estimate the population size of IDU population in San Francisco in 2009 (N)with two traits, a contingency table that cross-classifies the survey data according tousage of Walden House and reported HIV status is required, on top of the marginalcounts appearing in Table 3.2. However, pseudo-data, synthesized based on theraw data with some adjustments, is used in place of a raw contingency table of theRDS sample, to account for one unmet requirement that the survey be carried outthrough SRSWOR. The RDS design differs from SRSWOR in having a selectionbias and a reduced efficiency; hence, two adjustments are built in to the pseudo-data.To adjust for bias in RDS sampling, unbiased RDS adjusted proportion esti-mates is used to construct the pseudo-data. Let the answer to a survey question becoded 1 for “yes” and 0 for “no”, and let pˆi jRDS indicate the RDS adjusted estimatefor the proportion of IDUs who answer i for using Walden House and answer jfor being a reported HIV case. Given that no respondents in the survey indicatedbeing both a reported HIV case and a user of Walden House, pˆ00RDS must be zero– the rest of the RDS-adjusted proportion estimates are calculated using this facttogether with the RDS-adjusted marginal prevalences in Table 3.2. Table 3.3 showsthe RDS adjusted sample proportions as per calculation.To adjust for the RDS sampling being less efficient that SRSWOR, the effectivesample size (ESS) is used in constructing the pseudo-data. ESS represents the20Table 3.3: RDS-adjusted proportions in the contingency table for IDU’s inSan Francisco in 2009.Reported HIV case = no Reported HIV case = yesUses Walden House = no 90.7% 2%Uses Walden House = yes 7.3% 0%Table 3.4: Pseudo-data: adjusted contingency table for estimating the size ofIDU population in San Francisco in 2009 with the proposed method ofanalysis.Reported HIV case = no Reported HIV case = yesUses Walden House = no 245 5Uses Walden House = yes 20 0sample size required of a SRSWOR survey to yield the same estimation uncertaintyas the given RDS survey, and is defined (Kish, 1965) in the case of estimating aproportion asESS =nDe f f=n(var(pˆ)RDS/var(pˆ)SRSWOR)=p(1− p)var(pˆ)RDS. (3.7)The ESS is estimated by substituting the RDS-adjusted prevalence estimate and itsstandard error for each trait into Eq. (3.7). This results in two estimates of ESS,234 and 306, that are similar in magnitude; an average value ( ˆESS = 270) is usedto construct the pseudo-data.Finally, the pseudo-data consists of “cell counts”, n∗i j = ESS× pˆi jRDS , roundedto the nearest whole number. The adjusted contingency table containing this pseudo-data is shown Table 3.4.A Bayesian approach is taken to infer the size of an IDU population in SanFrancisco in 2009. Note that the presence of a 0 count in the pseudo-data posesno difficulty for the use of Bayesian inference. Following the recommendationsin Section 3.3, a prior distribution where N ∼ Uni f (3308,1e10) and p11|N ∼Uni f (0,104/N) is chosen to reflect the lack of prior knowledge about N, given themarginal counts of 104 and 3308. The posterior distribution of logN is obtained viaMarkov Chain Monte Carlo (MCMC) in the software JAGS. To set-up MCMC, the21Table 3.5: Analysis of plausible scenarios with proposed method. Each rowcontains hypoethetical pseudo-data that fit the required ESS and esti-mated marginal prevalences (2% and 7.3%), followed by an estimatedpopulation size based on analysing this pseudo-data.n∗00;n∗01;n∗10;n∗11 NˆBayes 95% CI245; 5; 20; 0 38966 [27059, 58561]246; 4; 19; 1 40511 [27843, 61688]247; 3; 18; 2 42206 [28795, 64631]248; 2; 17; 3 44078 [29828, 68462]249; 1; 16; 4 46180 [30898, 72729]250; 0; 15; 5 48430 [32023, 77213]model-file appearing in Appendix D, the marginal counts and the pseudo contin-gency table are supplied to the software. The MCMC is run for a total of 500,000iterations and demonstrates good convergence. The population size is estimatedwith NˆBayes defined as the exponentiated posterior expectation of logN. The 95%CI is obtained through exponentiating the HPD 95%CI for the posterior of logN.Based on this analysis, the population size of IDU in San Francisco in 2009 was38,966 people with a 95% CI of [27059, 58561].3.4.2 Alternative Analysis with Marginal Prevalences OnlyIn the above application, extraneous information beyond what is readily availablein the published study is required in order to complete the analysis. However, thereis merit in considering the likelihood-based multi-trait multiplier method even ifonly marginal prevalences and marginal counts are accessible, i.e. the informationcontained in Table 3.2. To demonstrate, Table 3.5 lists, in each row, pseudo-datawith an ESS of 270 that fit the RDS-adjusted marginal prevalences of 7.3% and2% along with an estimate of N made by analysing that pseudo-data with the repa-rameterized likelihood method to combine multiple traits. One may contrast theestimates in Table 3.5 with that of the original study in Table 3.2; by attempting theanalysis in a “what-if” approach, one may still obtain a sense of what combiningmultiple traits says about N that is otherwise difficult to replicate with intuition andmarginal estimates alone.223.5 DiscussionThis chapter outlines a likelihood based method for estimating population size,N, based on extending the multiplier method to multiple traits. This is achievedthrough reparameterizing the multinomial likelihood to incorporate N as a param-eter. The statistical advantage of using additional traits, of varying association andprevalence, is explored by comparing the asymptotic standard deviation of the es-timator for N. This understanding of how different parameters affect the efficiencyis useful to subject-area experts for obtaining the desired precision on N througha well-planned study design. One may find in Appendix B some suggestions re-garding the approach to the design process. Figure 3.1 may also serve as a startingpoint for visually extrapolating design values that provide the required estimationprecision.In many applications, Bayesian inference presents important advantages overmaximum likelihood methods, for example when the observed data table containsa 0 frequency or when prior knowledge exists. A joint prior outlined in this chapteremphasizes considering the prior knowledge on population size marginally. Onemay find in Appendix D details on how to obtain the posterior distribution in con-venient MCMC software with the recommended prior. Appendix E documentsa simulation study which verifies the performance of estimating population sizeusing this recommended Bayesian approach.An application of the multi-trait multiplier method is provided which estimatesthe population size of IDUs in San Francisco in 2009, based on data published inJohnston et al. (2013). The data requirement outlined in Section 3.1.1 is trans-lated to this particular context. Note that, in practice, the survey data may requireadjustments as in this application when it is not collected through SRSWOR.As mentioned in the introduction, this chapter addresses the gap in methodol-ogy for combining multiple correlated estimates arising from the multiplier method.Certainty one may approach this problem in other ways, e.g. by minimizing thevariance of linear combination of correlated estimators. A likelihood-based ap-proach is chosen because not only does it accomplish the original task of com-bining multiple estimates with asymptotically optimal properties, it also providesan easy transition to the Bayesian framework. A likelihood approach further en-23ables simple model extensions to deal with problems that arise with real data. Asan example, the concern of marginal counts being inexact was noted in (Johnstonet al., 2013). In this situation, the reparameterized likelihood model can be easilyextended to model population migration, or other mechanisms that causes uncer-tainty in the marginal counts; the subsequent chapter addresses this issue in depth.24Chapter 4Uncertainty in Marginal Counts –Accounting for MigrationThis chapter presents a systematic approach to accounting for uncertainty in marginalcounts that result from population migration through the use of a migration model.The migration mechanism is motivated by an application of the multiplier methodto data collected with the disease surveillance system employed by the BritishColumbia Center for Disease Control (BCCDC) and the Public Health Agency ofCanada (PHAC). The model outlined in this chapter may be generalized to coun-tries with a similar model of disease monitoring for extended applications.Central to the problem of migration is the requirement of defining a “closed”target population as the target of inference in the multiplier method. A closedpopulation (Lukacs, 2009), is one that is fixed in composition, with no migration,births nor deaths. However, the composition of human populations is constantlychanging. Thus to achieve a truly closed population, one must consider the targetpopulation at a single instance in time (target time) in a certain well-defined region(target region). Thus, the two essential descriptors necessary to define a humantarget population are geography and time.The previous chapter demonstrated inference under the assumption that themarginal counts are known exactly. As a reminder, a marginal count is the num-ber of individual with a certain trait (marked individuals) in the target population.However, in reality, data on the exact number of marked individuals in the target25region at the exact target time is impossible to obtain, since collecting such data isnot instantaneous.During the time between enrolling a marked individual in the record and thetarget time, an individual is subject to migration. Should recorded individuals mi-grate out of the target region before the target time without notifying the recordkeeping agency, the records present an over-counting of actual marginal counts.Similarly, should marked individuals who lived outside the target region migrateinto the region without being noted in the relevant record, the records produce anunder-counting of actual marginal counts. However, this nuance has been over-looked in previous literature and no adjustment method has been proposed.This problem was observed in a collaboration with the British Columbia Centrefor Disease Control (BCCDC) to infer the population size of Men-who-have-sex-with-men (MSM) living in the Greater Vancouver Regional District (GVRD) at theend of year 2008. In this application, the disease surveillance database, whose dataare collected over the years, appear initially as a source for tabulating “marginalcounts” for traits related to disease diagnosis. However, as individuals diagnosedwith diseases are not tracked over time in the Canadian disease surveillance system,the surveillance data does not provide the actual marginal count at the target time.While the true marginal count for a trait is not observed, fuzzy information onthe marginal count can be obtained if one is to make modelling assumptions. Inthe following sections a simple model is outlined, followed by details on how tointegrate the migration model with the multiplier method via Bayesian inference.This is followed by an application of the proposed model and inference to real data.The chapter concludes with a discussion of the proposed method.4.1 A Model for MigrationThe proposed migration model contains two basic components, emigration and im-migration. The unobserved marginal count is composed of individuals who stayedin the region of interest until the target time, J, after a being recorded at a earliertime j. The model simplifies the continuous time migration process into discretetime process (with time indexed incrementally by integers). A limitation of theproposed model is that it may be used with the single trait multiplier method, or26multi-trait multiplier method with only one marginal count affected by migration.Let• Nt be the marginal count, i.e. number of marked individuals in the targetpopulation, i.e. living in the target region at the target time,• S j be the count of marked individuals who stay in the target region at thetarget time after being listed in the record as residing in target region at timej,• I j be the count of marked individuals who immigrate to the target regionby the target time after being listed in the record as residing outside targetregion at time j,• D j be the count of marked individuals enlisted in the record as residing intarget region at time j,• O j be the count of marked individuals enlisted in the record as residing out-side target region at time j,• ps be the probability that a marked individual residing in the target region attime j stays in the target region at time j+1,• pm be the probability that a marked individual residing outside the targetregion at time j move in to the target region at time j+1.A model, Mm for Nt is the following:Nt =∑jS j + I j (4.1)S j ∼ Binomial(p(J− j)s ,D j) (4.2)I j ∼ Binomial(1− (1− pm)(J− j),O j). (4.3)The above model makes the following assumptions about migration:1. Migration is a discrete time process.2. The probabilities ps, pm that dictate the chance of staying/moving do notvary with time.273. The probability of moving more than once is negligible.4. An individual’s probability of being in the target population depends on abinary residency descriptor (in the target region or not) at the time he/shewas enlisted in the records.4.2 Combining the Migration Model with the MultiplierMethodTo infer N with uncertainty in one of the marginal counts with the migration modelMm is to use the migration model as a prior for Nt in the likelihood-based multipliermethod of Chapter 3 in a Bayesian inference, creating a two-level combined model.The parameters pm and ps inMm are effectively hyper-parameters in the combinedmodel.To infer N in the case of two binary traits, the reparameterized likelihood forthe survey data follows the parameterization based on the Section 3.1.1. Where Nt1is affected by migration, the posterior distribution of the combined model isf (p11,N,Nt1, pm, ps|data) ∝ f (data|p11,N,Nt1, pm, ps)× f (p11,N,Nt1, pm, ps).(4.4)An intuitive prior that allows a subject area expert to prioritize a-priori knowl-edge on N over p11, can be obtained by factoring f (p11,N,Nt1, pm, ps) as f (p11|N,Nt1, pm, ps)×f (N|Nt1, pm, ps)× f (Nt1|pm, ps)× f (pm|ps)× f (ps).For example, one may choose the following:• ps ∼Uni f orm(a,b), with a,b reflecting subject area knowledge• pm|ps = pm ∼Uni f orm(c,d), with c,d reflecting subject area knowledge• Nt1|pm, ps is the model Mm• N|Nt1, pm, ps = N|Nt1 ∼Uni f orm(L,U), with L ≥ max(N·1,N1·,n) , and Uan upper bound on the possible size of N• p11|N,Nt1, pm, ps = p11|N,Nt1∼Uni f orm(0,bN), wtih bN =min(N·1/N,N1·/N)28Note that the above example highlights some important observations about in-tegrating uncertainty on marginal count with performing the Bayesian inferenceon N with the method in Chapter 3. Should Nt1 be given, the variables N, p11 donot depend on the mechanism that determines Nt1. In this respect, the migrationmodel can be thought of as “standing-alone”. Therefore the prior on N, p11 can beformulated according to Section 3.3, without consideration of migration.4.3 Application: Estimating the Size of MSM Populationin GVRD4.3.1 The DataAs a potential source of marginal counts, the British Columbia Centre for DiseaseControl (BCCDC) monitors new diagnoses of sexually transmitted infections suchas HIV, syphilis, etc. Beginning in 2004, any first time diagnosis of either HIV orSyphilis, a reportable disease, must be reported to the BCCDC. The surveillancedatabase also contains detailed demographic information, e.g. age, sex, residence,etc., on each patient. As a potential source for the trait prevalence, the BCCDC alsohas data from The ManCount Project (Moore et al., 2011), a survey undertaken tounderstand the sexual health of MSM in Vancouver, Canada. The survey includedquestions regarding time of an individual’s first diagnosis of HIV and Syphilis, ifany. This survey was conducted at the end of 2008 using Venue-Based Sampling(VBS). The survey was restricted to MSM who were age 19 or older at the timeof the survey. The survey captures information regarding the sexual behaviour,disease diagnosis, disease testing behavior, together with basic demographic infor-mation (e.g. age, residence, etc.). A total of 1012 respondents from the GreaterVancouver Regional District (GVRD) were surveyed.An appropriate definition of the target population is “the population of MSMabove age 19 living in GVRD at the end of year 2008,” based on the informationcontained in the data. Secondly, based on the above two sources, two traits ofinterest are identified as the following:• Trait 1: First diagnosis of HIV between 2004 and 2008.29• Trait 2: First diagnosis of Syphilis in year 2008.Based on the definition of the target population, a perfect marginal count for eachtrait needs to be tabulated at the end year 2008. However, the BCCDC surveil-lance database enrols people into the database at the time of their diagnosis with-out tracking the location of these individuals after enrolment in database. As such,the marginal count for trait 1 is suspected to be inaccurate as the oldest records arefrom 2004 giving those individuals plenty of time to migrate out of GVRD. Further-more, other provincial health agencies also do not report emigration of individualsdiagnosed in their provinces to BCCDC, such that anyone who was diagnosed inanother province but moved to GVRD by 2008 would not be present in the BCCDCrecord. Epidemiologists agree that international migration does not seriously af-fect marginal count for trait 1 due to stringent immigration rules surrounding HIVinfected individuals.In order to model the effect of inter-provincial migration on the surveillancecounts, further data are needed. At a minimum, data on the number of new HIVdiagnosis within Canada each year is required. These data are readily availablefrom the Public Health Agency of Canada (PHAC), in publicly accessible reports(Surveillance and Risk Assessment Division, Centre for Communicable Diseasesand Infection Control, 2010). This report also provides new HIV diagnoses strat-ified by province. However, in the following model all outside-GVRD diagnosesare aggregated to reduce model complexity. Further notes about this modellingdecision are given in the discussion.4.3.2 Bayesian InferenceAs in Section 4.2, the posterior distribution of model parameters is, f (p11,N,Nt1, pm, ps|data)∝f (data|p11,N,Nt1, pm, ps)× f (p11,N,Nt1, pm, ps). To implement the reparametrizedlikelihood with pre-defined distributions in popular MCMC software, e.g. JAGS,30Table 4.1: Selected hyper-parameter values for parameters relating to migra-tiona b c dSet 1 0.9 1 0 0.06Set 2 0.978 1 0 0.004it may be written as,x∼Multinomial(p,n)p10 = Nt1/N− p11p01 = Nt2/N− p11p00 = 1− p10− p01− p11.The prior distribution is specified according to Section 4.2, with migrationhyper-parameters as in Table 4.1. Table 4.1 lists two sets of values. Hyper-parameter values in Set 1 reflects a conservative (wide) prior upon consulting epi-demiologists at BCCDC. Hyper-parameters in Set 2 are based on inter-provincialmigration statistics (BC Stats, 2009; Statistics Canada, Demography Division, 2013)for the overall BC population, with the mean migration probability matching theaverage migration proportion in 2004-2008 and upper cut-off extending to 1 forps and lower cut-off extending to 0 for pm. The prior for f (N|Nt1) is capped atmaximum of 270,000 which is roughly 30% of the adult male population in 2008according to Statistics Canada.Population size, N, is estimated based on the posterior distribution of its loga-rithm, i.e. log(N), a standard practice for estimating variables that are necessarilypositive. The Bayes estimate, NˆBayes, is defined as the exponentiated mean of theposterior distribution for log(N). A 95% credibility interval (CI) for N is obtainedby exponentiating the end-points of the 95% HPD CI for log(N). Note that theposterior is obtained with 50,000 MCMC runs in JAGS. A complete model file forthe model is found in Appendix F.31Table 4.2: Result from inferring size of MSM population with the combinedmodelmigration (parameter set 1) migration (parameter set 2)NˆBayes +22% no change95%CI NˆBayes +32% no changemean(Nt1) +27% no changelength C.I.(Nt1) 500 people 60 people4.3.3 ResultDue to confidentiality, the exact estimate on the size of MSM population in 2008cannot be disclosed in this thesis. Detailed results are planned for publication inan epidemiology journal in the near future. Nevertheless, the difference in keyquantities between inference with migration versus without migration are providedin Table 4.2.By using migration hyper-parameter set 1 (see Table 4.1), which is a conser-vative (wide) prior, migration resulted in a large amount of uncertainty on Nt1,with 95% CI range of 500 people. Both the estimate of N and its standard er-ror increased by a large percentage (+22% and +32%). Using the second set ofhyper-parameters, a tight prior, in the migration model resulted in a range for 95%CI for the unobserved Nt1 of about 60 people. This uncertainty, however, had anegligible effect on the population size estimate and its standard error. Thoughnot displayed due to confidentiality, both sets of hyper-parameters resulted in theconfidence interval for NˆBayes after adjusting for migration overlapping with the CIwithout adjusting for migration, indicating no large inconsistency in estimating Nwithout accounting for migration and versus accounting for migration.4.4 DiscussionThis chapter introduced a migration model in attempt to systematically capturethe uncertainty in marginal count data due to migration. The model can be easilyintegrated with the multi-trait multiplier method of Chapter 3 under a Bayesianframework. The Bayesian analysis is demonstrated with an example to estimatethe population size of MSM in GVRD in 2008 with data hosted by the BCCDC. In32this application extra data from outside the region are required, but these extra dataare easily found using the data from Canada wide disease surveillance database.Section 4.1 noted four important assumptions in the proposed migration model.Firstly, the proposed model describes migration in discrete time steps. While re-ducing the length of each time-step may increase the resemblance of the discretetime model to one of continuous time, there are disadvantages. As the time-intervaldecreases, the probabilities ps and pm may become too small for epidemiologiststo formulate an appropriate prior for them. Furthermore, there is an advantagein using a time-interval based on a natural time division in human activities, e.g.monthly or yearly. In many organizations records are summarized periodically bythe year, or by the month so that supplementary data may be easily acquired. Inthe exemplary application, by using time-steps by the year, inter-provincial migra-tion statistics (BC Stats, 2009) can be used to derive a suitable prior for ps and pm.However, the need to explore trade off between model fit and ease of implementa-tion due to varying time-interval length remains.Secondly, the model assumes ps and pm as invariant in time. This allows agreater degree of parsimony. This assumption can be empirically validated in thecase of the example application. One may obtain, again, from the inter-provincialmigration statistics the proportion of people leaving and entering BC each year; thisnumber is roughly the same over the period of 2004-2008 so there is no good reasonto suspect the necessity of time-varying parameters. Furthermore, as the proportionof overall inter-provincial migration in Canada is small (BC Stats, 2009; Statis-tics Canada, Demography Division, 2013), the assumption regarding the chance ofmoving more than once as negligible is likely adequate.Lastly, in this model, the only “covariate” that influences the probability ofmigration is the binary indicator of a person being in the target region or not at thetime enlisted in the records. This is again a simplification of migration mechanismfor 1) parsimony, and 2) the ease of generating prior information. In the context ofthe example application, data on migration into and out of BC exist, but these datastratified by the province from which an immigrant comes from are not publiclyaccessible. As such, a complicated model which incorporate finer geographicalinformation may be difficult to implement due to the lack of prior informationon ps and pm for small regions. On a similar note, other covariates (e.g. age,33ethnicity, etc.) may also have an important role in the probability of migration, butas such information is not well documented in migration records in Canada, priorknowledge on migration with respect to any combination of covariates would beextremely difficult to obtain.In the application, two sets of hyper-parameter values were tested with resultsshown in Table 4.2. The choice of hyper-parameter values has a significant effecton how much uncertainty is place on the unobserved Nt1. Based on the two testcases, increased uncertainty on Nt1 translates into uncertainty about Nˆ, but whenthe uncertainty about Nt1 is small, the amount of uncertainty surrounding Nˆ maynot be affected at all. The two sets of scenarios tested represent very extremecases; one likely ought to adjust the final hyper-parameter choice to be somewherein between the values used in the two sets. According to the collaborators, theGVRD is attractive to MSM due to lifestyle and the quality of care for STI diag-nosed individuals, such that an influx of MSM over time to GVRD is a reasonableexpectation.Finally, the chapter has not included any validation of the migration model, norcomparison with the ad-hoc method to account for uncertainty in marginal count.The method described here also is also suitable for when only one marginal countis affected by migration. These shortcomings will hopefully be addressed in futurework.34Chapter 5ConclusionsThis thesis has highlighted two important improvements on the multiplier method.Firstly, a likelihood-based method is presented to incorporate information frommultiple traits for a widely practised data collection scheme. This likelihood-basedextension not only enables the multiplier method to use data from multiple traitssimultaneously, but also allows the use of categorical traits should one desire. Inaddition, the exploration on the effect of study design on inference precision (Sec-tion 3.2 and Appendix B) enables the prediction of inference precision associatedwith choices made in study design, such that resources may be spent in a pre-dictable way.Secondly, Chapter 4 is the first body of work to account for uncertainty inthe marginal count data based on an explicitly formulated migration model. Anexplicit specification of the migration model allows distributional parameters tobe chosen or critiqued based on publicly available data, as in the application inSection 4.3. This model can be conveniently integrated into the likelihood-modelpresented in Chapter 3 under a Bayesian framework. Furthermore, an implemen-tation of Bayesian inference is also outlined which may be carried out in freelyavailable MCMC software.As final remarks, this work is by no means a comprehensive solution that coversevery problem with the multiplier method. The method in Chapter 3 is is tailoredto the data collection scheme where a single survey queries multiple traits. Whilethis particular data collection scheme is the only one employed at the moment,35when new data collection designs surface in the public health system, develop-ment of new statistical methods to infer population size will be necessary. Also,the migration model in Chapter 4 falls under what is completely model-based in-ference, where model misspecification could seriously undermine the estimation.Thus, work to examine the robustness of the migration model is prioritized for theimmediate future. Future work may also include validating the practicality of themigration model, and comparison with previous methods to capturing uncertaintyin the marginal counts.36BibliographyArchibald, C., Jayaraman, G., Major, C., and Patrick, D. (2001). Estimating thesize of hard-to-reach populations: a novel method using HIV testing datacompared to other methods. Aids 15, S41–S48.BC Stats (2009). Migration review 2008. Technical report, BC Stats.Carroll, R. J. and Lombard, F. (1985). A note on n estimators for the binomialdistribution. Journal of the American Statistical Association 80, 423–426.Chao, A., Tsay, P., Lin, S. H., Shau, W. Y., and Chao, D. Y. (2001). Theapplications of capture-recapture models to epidemiological data. Statistics inMedicine 20, 3123–3157.Colo´n, H. M., Robles, R. R., and Sahai, H. (2001). The validity of drug useresponses in a household survey in puerto rico: comparison of survey responsesof cocaine and heroin use with hair tests. International Journal ofEpidemiology 30, 1042–1049.Delaney-Black, V., Chiodo, L. M., Hannigan, J. H., Greenwald, M. K., Janisse, J.,Patterson, G., Huestis, M. A., Ager, J., and Sokol, R. J. (2010). Just say i don’t:lack of concordance between teen report and biological measures of drug use.Pediatrics 126, 887–893.Ezoe, S., Morooka, T., Noda, T., Sabin, M. L., and Koike, S. (2012). Populationsize estimation of men who have sex with men through the network scale-upmethod in Japan. PLoS ONE 7, e31184.Fendrich, M., Johnson, T. P., Sudman, S., Wislar, J. S., and Spiehler, V. (1999).Validity of drug use reporting in a high-risk community sample: a comparisonof cocaine and heroin survey reports with hair tests. American Journal ofEpidemiology 149, 955–962.37Heckathorn, D. D. (1997). Respondent-driven sampling: a new approach to thestudy of hidden populations. Social Problems 44, 174–199.Hook, E. B. and Regal, R. R. (1995). Capture-recapture methods inepidemiology: methods and limitations. Epidemiologic Reviews 17, 243–64.Johnston, L. G., Prybylski, D., Raymond, H. F., Mirzazadeh, A., Manopaiboon,C., and McFarland, W. (2013). Incorporating the service multiplier method inrespondent-driven sampling surveys to estimate the size of hidden andhard-to-reach populations: case studies from around the world. SexuallyTransmitted Diseases 40, 304–10.Kish, L. (1965). Survey sampling. John Wiley and Sons.Lieb, S., Fallon, S. J., Friedman, S. R., Thompson, D. R., Gates, G. J., Liberti,T. M., and Malow, R. M. (2011). Statewide estimation of racial/ethnicpopulations of men who have sex with men in the U.S. Public Health Reports126, 60–72.Luan, R., Zeng, G., Zhang, D., Luo, L., Yuan, P., Liang, B., and Li, Y. (2005). Astudy on methods of estimating the population size of men who have sex withmen in Southwest China. European Journal of Epidemiology 20, 581–585.Lukacs, P. (2009). Closed population capture-recapture models. In Cooch, E. G.and White, G. C., editors, Program MARK: a gentle introduction.http://www.phidot.org.Magnani, R., Sabin, K., Saidel, T., and Heckathorn, D. (2005). Review ofsampling hard-to-reach and hidden populations for hiv surveillance. Aids 19,S67–S72.Moore, D. M., Kanters, S., Michelow, W., Gustafson, R., Hogg, R. S., Kwag, M.,Trussler, T., McGuire, M., Robert, W., Gilbert, M., et al. (2011). Implicationsfor HIV prevention programs from a serobehavioural survey of men who havesex with men in Vancouver, British Columbia: the ManCount study. CanadianJournal of Public Health 103, 142–46.Muhib, F. B., Lin, L. S., Stueve, A., Miller, R. L., Ford, W. L., Johnson, W. D.,Smith, P. J., for Youth Study Team, C. I. T., et al. (2001). A venue-basedmethod for sampling hard-to-reach populations. Public Health Reports 116,216.38Okal, J., Geibel, S., Muraguri, N., Musyoki, H., Tun, W., Broz, D., Kuria, D.,Kim, A., Oluoch, T., and Raymond, H. F. (2013). Estimates of the size of keypopulations at risk for HIV infection: men who have sex with men, female sexworkers and injecting drug users in Nairobi, Kenya. Sexually TransmittedInfections 89, 366–71.Olkin, I., Petkau, A. J., and Zidek, J. V. (1981). A comparison of n estimators forthe binomial distribution. Journal of the American Statistical Association 76,637–642.Paz-Bailey, G., Jacobson, J. O., Guardado, M. E., Hernandez, F. M., Nieto, A. I.,Estrada, M., and Creswell, J. (2011). How many men who have sex with menand female sex workers live in El Salvador? Using respondent-driven samplingand capture-recapture to estimate population sizes. Sexually TransmittedInfections 87, 279–82.Pollock, K. H., Nichols, J. D., Brownie, C., and Hines, J. E. (1990). Statisticalinference for capture-recapture experiments. Wildlife Monographs pages 3–97.Purcell, D. W., Johnson, C. H., Lansky, A., Prejean, J., Stein, R., Denning, P.,Gau, Z., Weinstock, H., Su, J., and Crepaz, N. (2012). Estimating thepopulation size of men who have sex with men in the United States to obtainHIV and syphilis rates. The Open AIDS Journal 6, 98–107.Raymond, H. F., Bereknyei, S., Berglas, N., Hunter, J., Ojeda, N., and McFarland,W. (2013). Estimating population size, HIV prevalence and HIV incidenceamong men who have sex with men: a case example of synthesising multipleempirical data sources and methods in San Francisco. Sexually TransmittedInfections 89, 383–7.Saidel, T., Loo, V., Salyuk, T., Emmanuel, F., Morineau, G., and Lyerla, R.(2010). Applying current methods in size estimation for high risk groups in thecontext of concentrated epidemics: lessons learned. jHASE-Journal ofHIV/AIDS Surveillance & Epidemiology 2,.Salganik, M. J., Fazito, D., Bertoni, N., Abdo, A. H., Mello, M. B., and Bastos,F. I. (2011). Assessing network scale-up estimates for groups most at risk ofHIV/AIDS: evidence from a multiple-method study of heavy drug users inCuritiba, Brazil. American Journal of Epidemiology 174, 1190–6.Statistics Canada, Demography Division (2013). Annual estimates of populationfor canada, provinces and territories, from july 1, 1971 to july 1, 2013.Technical report, Statistics Canada.39Surveillance and Risk Assessment Division, Centre for Communicable Diseasesand Infection Control (2010). Hiv and aids in canada: Surveillance report todecember 31, 2009. Technical report, Public Health Agency of Canada.Sutherland, J. M. (2003). Multi-list methods in closed populations with stratifiedor incomplete information. PhD thesis, Simon Fraser University.UNAIDS/WHO Working Group on Global HIV/AIDS and STI Surveillance(2010). Guidelines on estimating the size of populations most at risk to HIV.Technical report, World Health Organization.40Appendix AOn the Inappropriateness ofUsing Capture-RecaptureInspired Methodology to AnalyseData for the Multiplier MethodThe following discussion is restricted to applying the multiplier method with atleast two traits. Recall that in the multiplier method, one has data on the exactnumber of people in the target population with a trait for t traits, and a surveyquerying the status of all t traits in a sample of the target population. One maybe inclined to draw parallels between the formation of a trait with “capturing andtagging” individuals in the target population which forms the basis of capture-recapture type inference. Should one take this view, data on t marginal counts givesrise to t “lists” in capture-recapture terminology, Lm1 , . . . ,Lmt . However, unlike truecapture-recapture, individuals cannot be matched at all across these first t lists, dueto privacy regulations.The survey component of data for multiplier method gives rise to another list,Ls, of individuals sampled from the target population. By design, survey respon-dents self-report on t traits for which marginal counts are available. This meansonly for members of Ls can their membership in Lm1 , . . . ,Lmt be ascertained. Thus41Table A.1: Comparison of observable statistics from a two trait multipliermethod study to a true three list capture-recapture study. N000 is neverobservable even in true capture-recapture designs.Observable summary statistics for capture-recapture analysisN000 N001 N010 N011 N100 N101 N110 N111data frommultiplier-method studyX X X X X X X Xdata fromtrue capture-recapture studyX X X X X X X Xdata for the multiplier method, when cast under the “capture-recapture” framework,produces t +1 lists with some unmatchable member. Table A.1 illustrates the criti-cal information missing from casting a two trait multiplier method study in a threelist capture-recapture analysis. While connections between capture-recapture andmultiplier method likely exists, the following paragraphs emphasize the limitationsof forcing capture-recapture analysis on multi-trait multiplier method data.Capture-recapture methodologies for lists with unmatchable members are under-developed for application in epidemiology. Sutherland (2003) argues that classicalmodels motivated by ecological applications for lists with unmatchable membersare unsuitable given the mechanism in epidemiology. A method motivated by epi-demiological applications is described in Sutherland (2003) for two-list capture-recapture. However without extension to multi-list this method cannot be appliedto the problem at hand.Furthermore, key assumptions in analysis tailored for capture-recapture ex-periments cannot be justified in the multiplier data. A fundamental flaw is therequirement that any person in the target population must have non-zero proba-bility of being on any list (Chao et al., 2001). Say for example, that one of thetraits is “positive diagnosis of HIV.” Given the epidemiological practice of usingconfirmatory testing to eliminate false positives, people without the HIV virus aresurely excluded from the list of people diagnosed with HIV. According to Chaoet al. (2001), an epidemiologist must modify the definition of the target populationto exclude those with zero probability of being jointly captured in all lists. How-42ever, in the example of using HIV diagnosis as one of the traits, this advice leadsto exclusion of the majority of the intended target population as HIV prevalenceis expected to be low - rendering the inference useless in practice. Furthermore,given the wide variety of traits that can be employed in the multiplier method, e.g.,disease diagnosis, service usage, etc., it is difficult for epidemiologists to justifychoosing any particular structure for dependencies between traits when employ-ing either the ecological or log-linear models of capture-recapture. While Chaoet al. (2001) note a third category of analysis in capture-recapture, the “samplecoverage” approach, which bypasses this requirement, this method does rely on anuntestable assumption.A second type of analysis – known as “capture-recapture without recapture”(Carroll and Lombard, 1985; Olkin et al., 1981) – does not require cross-identificationof subjects across lists. However, these methods makes a key assumption thatthe probability of members of the target population appearing on a list being thesame across lists, which does not apply to the data at hand. Furthermore, this typeof analysis assumes equal catchability of all members and having no uncatchablemembers, which, similar to classical capture-recapture. These assumptions are areunwarranted in the epidemiology data used in the multiplier method for the reasonsgiven in the previous paragraph.In contrast, the reparameterized likelihood multiplier method of Chapter 3 doesnot share the above concerns of capture-recapture due to having been truly moti-vated by the data at hand. An analogy of the reparameterized likelihood methodis drawing marbles out of a bag where one observes each marble having up to ttypes of markings (traits), conditional on the number of each type of marking inthe bag being known exactly. The only assumptions required, besides having aclosed-population, is that the survey be a representative probability survey of thetarget population.43Appendix BBench-Marking withEqui-Correlation andEqui-Prevalence for the GeneralCaseIn Section 3.2, the relationship between estimation uncertainty and properties ofthe traits being used was explored under equi-prevalence and equi-correlation sce-narios only. However, intuition suggests that the observed behaviour of estimationuncertainty may be extrapolated to apply more generally. In this section, this intu-ition is verified by showing that the performance in general falls between bench-marking numbers generated under the equi-correlation, equi-prevalence restriction.This is done via a Monte Carlo simulation.The simulation is restricted to study designs with between 3 to 5 binary traits,each with 100 test cases. To generate a random case of k binary traits, q, a k di-mensional vector of marginal prevalences is sampled one dimension at a time fromUni f orm(0, pmax), and ρ , representing(k2)correlation parameters is sampled fromUni f (0,1) one dimension at a time. This selected case is mapped to cell prob-abilities by using the cumulative distribution for a k-variate Normal distribution,MV Nk(0,Σ), with a vector of cut-points which correspond to marginal trait preva-44lences q. The diagonal elements of Σ are equal to 1, and off-diagonal elementsof Σ are set as randomly selected correlations ρ . The cut-points are the desiredtrait prevalences are specified through. The matrix Σ is confirmed to be positive-definite, otherwise ρ is re-drawn.The estimation uncertainty for this randomly generated set of traits is calculatedupon obtaining the contingency table. Three bench-marks for this uncertainty areinvestigated: bl , bm, bu. Each benchmark is obtained from a set of k traits of equi-correlation and equi-prevalence for which the correlation is a function of ρ , andprevalence is a function of q. The first benchmark, bl , uses max(q) and min(ρ ) forprevalence and correlation. The second, bm, uses avg(q) and avg(ρ ) for prevalenceand correlation, Lastly, bu uses min(q) and max(ρ ) for prevalence and correlation.The result of the study shows that the estimator uncertainty arising from ageneral set of traits is always bounded by the maximum and minimum of the 3benchmarks. The result from this simulation study is visualized in Figure B.1. Inparticular, bm appear to be highly predictive of the actual estimator error. Thisproperty suggests only the a priori knowledge on the average prevalence and aver-age correlation is required of an subject-area expert to produce good predictions ofthe study precision.450.51.01.50.51.01.50.51.01.53 traits4 traits5 traits0.000 0.025 0.050 0.075average prevalencesd(N^)/NFigure B.1: The case of unequal association and unequal prevalence. Solid-circles in the plot represent the actual estimation uncertainty for eachrandomly selected test-scenario. Each circle is accompanied by a verti-cal line extending from the lower bench-mark (bl) to the upper bench-mark (bu). Each vertical line is accompanied by a tick-mark which lo-cate the middle bench-mark (bm). Each bench-mark is obtained with theassumption of equi-correlation (chosen to be a function of(K2)originalcorrelations) and equi-prevalence (chosen to be a function of K originalprevalences). One observes that the actual estimator error is always lessthan bu. This suggests an a priori knowledge of the maximum plausiblecorrelation and minimum plausible prevalence will lead to a conserva-tive prediction of the estimation uncertainty. However, in some cases,this usage may be overly conservative. Instead, bm appears highly pre-dictive of the actual estimation uncertainty which suggests the use ofaverage prevalence and average correlation in study design.46Appendix CDeriving the Fisher InformationC.1 NotationsIn this section some new notations are defined for describing k traits building onthose appearing in Section 2.3 of the main body . Let x be a vector which denotesthe statuses of k traits, x = (x1, . . . ,xk). Next, viewing the k traits as forming amulti-dimensional contingency table, let C denote the number of cells in this con-tingency table, C =K∏i=1li. Intuitively, each cell in the contingency table correspondsto a unique combination of the status of k traits, i.e. there exist a 1-1 function gwhich maps x to cell index, c. In this thesis the following mapping is followed:c = g(x) = 1+ x1 + l1x2 + l1l2x3 + . . .+(k−1∏i=1li)xk (C.1)It is easy to see that cell indexing begins from 1, and that cell 1 corresponds to thesituation where all k traits are of status 0 – an absence of every trait. Using g, anew random variable Z = g(X ) is defined, and it is interpreted as the multinomialvariable indicating the cell in the k dimensional table a sampled individual fallsunder. Let p = (p1, . . . , pC) be the cell probabilities that govern the frequencies inthe k dimensional table, i.e., pc = Pr[Z = c]. Lastly, define x(c) = g−1(c) whichreturns the combination of the k statuses that correspond to cell c.47C.2 Defining a Reparameterization for the CellProbabilitiesThe reparameterization of a particular cell probability pc, c ∈ {1, . . . ,C}, dependson the combination of trait statuses given by x(c). Because this systematic ap-proach to reparameterization relies on expressing the cells that correspond to 1 orless traits being present as a function of the rest of the probabilities, the followingsets of cell indices are defined to help express this mapping:• S = {1, . . . ,C}: the complete set of possible index numbers• A ={c :k∑i=1I [x(c)i > 0] = 1}: the index of the cells that correspond to traitcombinations where only one trait is non-zero• B =S \ (A ∪{1}): the index of the cells that correspond to trait combina-tions where more than one trait is non-zeroSecondly, the cell probabilities that do not need reparameterizing are definedas identity mapped to a vector θ . The elements of θ are part of the parameters inthe reparameterized likelihood.Next the function h is defined, such that pc = h(θ ,N,M ,c):pc =θc if c ∈Bk∑i=1li−1∑s=1I [x(c)i = s](MisN− ∑m∈BθmI [x(m)i = x(c)i])if c ∈A1−k∑i=1li−1∑s=1MisN+ ∑m∈Bθm(k∑j=1I [x(m) j > 0]−1)if c = 1(C.2)Note that θ = {θc : c ∈B}. This means the size of θ is |B|, and θc is namedwith indices that identify to which pc it is identity mapped, which also identify theunique combination of k trait values it represents.C.3 The Fisher InformationUsing the multinomial likelihood approximation for simple random sampling with-out replacement from a large population, a likelihood model for the survey data is:48log f (Z1, . . . ,Zn) =C∑c=1(n∑i=1I [Zi = c])log pc. Here pc is used for simplicity but itis understood to be a function of θ , N, M and c as in Equation (C.2). Note that Mis known in this case.To obtain the Fisher Information with respect to vector of parameters η ={θ ,N} of length |B|+1, letA =∂ p1∂θ1· · ·∂ p1∂N.... . ....∂ pC∂θ1· · ·∂ pC∂N.As is standard for multinomial models, it is hereby stated without proof the formof the Fisher information matrix:I = EZ[Atdiag(I [Z = 1]p21, . . . ,I [Z =C]p2C)A]= Atdiag(1p1, . . . ,1pC)AC.4 Examining the Effect of Changing N on Var(NˆMLE)Assuming each person is sampled independently, the variance-covariance matrixVCOV (ηˆMLE)≈ I−11 /n, where I−11 is the inverse of the Fisher Information for onesample. The variance component of interest is given as Var(NˆMLE)≈ (I−11 )N,N/n.In the last section, it was shown that I = Atdiag(1p1, . . . ,1pC)A. In order toexamine the effect of increasing N on Var(NˆMLE), while keeping all other param-eters the same, an expression for I−1 that isolates out all appearances of N in theexpression is desired. Let qis = Mis/N be the marginal prevalence of a trait, wherei ∈ {1, . . . ,k} ,s ∈ {1, . . . , li−1}. Note that qis is kept constant.In the matrix of partial derivatives, A, according to the parameterization of pcin Eqn. (C.2),∂ pc∂θldoes not contain N no matter which c or l. As for the elements49∂ pc∂N in the matrix A, the partial derivatives are:∂ pc∂N =0 if c ∈B−k∑i=1li−1∑s=1I [x(c)i = s](qisN)if c ∈Ak∑i=1li−1∑s=1qisNif c = 1∝1NWith the above knowledge I may be factored into the following matrix sub-blocks:I =[B1 1N B21N Bt21N2 B3]where sub-block B1 is of dimension (|B|× |B|), B2 is of dimension (|B|×1) andB3 of dimension (1×1).Using the block inversion matrix formula, the element (I−1)(N,N) = (1N2 B3−1N Bt2×B1×1N B2)−1 ∝N2. Hence asymptotically, Var(NˆMLE)∝N2, and SD(NˆMLE)∝N. 50Appendix DImplementing Bayesian InferenceThis appendix outlines the details regarding implementing Bayesian inference ap-pearing in Section 3.3 of the main body with Markov Chain Monte Carlo (MCMC)in the software package OpenBUGS.D.1 Model Specification for Inferring N withInformation from Two TraitsIn the case that inference is carried out with information from two traits, the spec-ification of the reparameterized likelihood and priors can be done with built-indistribution functions in OpenBUGS. In this case the chain for logN is directlygenerated by OpenBUGS and inference from MCMC can be performed with thecoda() package in R. Please see the following for the model file:#uninformative (Uniform prior) #logN#requires x, n, Marg1, Marg2# U is the upper limit ...model{logN <- log(N)x ˜ dmulti(pi, n)pi[1] <- p00pi[2] <- p0151pi[3] <- p10pi[4] <- p11p00 <- 1-p01-p10-p11p01 <- Marg2/N - p11p10 <- Marg1/N - p11p11 ˜ dunif(0, min(Marg1/N, Marg2/N))N ˜ dunif(max(n,Marg1,Marg2),U)}D.2 Model Specification for Inferring N withInformation from Three TraitsIn the main body, an argument was presented for preferring a prior of the formpi(N)× pi(θ |N). However, specifying a joint prior in a convenient MCMC soft-ware (e.g. OpenBUGS) of the form pi(N)×pi(θ |N) in the case of K > 2 traits isnon-trivial. This is because the sampled θ vector must satisfy marginal probabili-ties (constraints) given N, which will likely involve custom distribution functionsand MCMC samplers. A simple work-around is to first obtain the posterior MCMCchains under a “convenience prior”, pi(θ )×pi(N|θ ), then post-process these chainswith importance sampling to represent the actual posterior distribution. This ap-proach simplifies the sampling because the convenience prior involves a uncon-strained multivariate distribution, pi(θ ), and a constrained univariate distribution,pi(N|θ ). Constraints on univariate distributions are readily implemented in mostconvenient MCMC software. The convenience prior chosen for this simulationconsists of pi(θ ) taken to be the first |θ | dimensions of a Dirichlet(α ) distribution,and pi(N|θ ) ∼Uni f orm(Lθ ,Uθ ) with Lθ ,Uθ being the upper and lower limits asdetermined by θ , based on the constraints that 0 < pc < 1 for all pc a function ofθ ,N (i.e. Equation C.2).Let g′(N,θ) be the posterior density under the convenience prior, and g(N,θ)be the posterior density up to a proportionality constant, i.e. without the normaliza-tion constant. Let f ′(N,θ) be the actual posterior density, and f (N,θ) the actualposterior density up to a proportionality constant. An importance sampling esti-mate of the posterior expected value of a function of parameters h(N,θ ), based on52a MCMC chain of length T , is E f ′ [h(N,θ )]≈ h˜T whereh˜T =1TT∑t=1[h(N(t),θ (t)) f (N(t),θ (t))g(N(t),θ (t))]1TT∑i=1[f (N(t),θ (t))g(N(t),θ (t))]f (N(t),θ (t)) = 1U−L1VN(t)pi(data|p,N)g(N(t),θ (t)) =(|θ |∏j=1(θ (t)j )α j−1)(1−|θ |∑j=1θ (t)j)α|θ |+1−1B(α )1Uθ (t)−Lθ (t)pi(data|p,N)The above re-weighting of the MCMC samples can be carried out completelyin R, by using the R packages geometry and rcdd to obtain the approximate“volume” of the constraint space, VN , for a given sample of N. To obtain the per-formance measures appearing in Table 2 in the main body with importance sam-pling, let h1(N,θ ) = logN and h2(N,θ ) = (logN)2 which gives the 1st and 2ndmoment of logN, and let h3(N,θ ) = I[logN < t] for obtaining the equal tailed95% credibility interval for logN.The model file for obtaining the posterior MCMC chain under the convenienceprior is provided below:#uninformative (Uniform prior) -> f(pi)f(N|pi)#requires x, n, Marg1, Marg2, Marg3, alphamodel{logN <- log(N)x[1:8] ˜ dmulti(pp[1:8], n)pp[1] <- 1- sum(pp[2:8])pp[2] <- Marg1/N - a-b-dpp[3] <- Marg2/N - a-c-d53pp[4] <- app[5] <- Marg3/N - b-c-dpp[6] <- bpp[7] <- cpp[8] <- dtheta[1:5]˜ ddirich(alpha[1:5]) # let alpha be c(1,1,1,1,4)a <- theta[1]b <- theta[2]c <- theta[3]d <- theta[4]U2 <- Marg1/(a+b+d)U3 <- Marg2/(a+c+d)U4 <- Marg3/(b+c+d)U1 <- (Marg1+Marg2+Marg3)/(a+b+c+2*d)L2 <- Marg1/(1+a+b+d)L3 <- Marg2/(1+a+c+d)L4 <- Marg3/(1+b+c+d)L1 <- (Marg1+Marg2+Marg3)/(1+a+b+c+2*d)#specify new prior for N|pdummy <- 0dummy ˜ dloglik(phi)phi <- log(S) #logLik of NN ˜ dflat()S <- step(UpperLim- N)*step(N- LowerLim)*Y/ZZ <- UpperLim-LowerLim#define some vars#check to make sure the upper limit > lower limitY <- step(UpperLim-LowerLim)54LowerLim <- max(n,max(L1,max(L2,max(L3,L4))))UpperLim <- min(U1,max(U2,max(U3,U4)))}55Appendix EProperties of the BayesianEstimator using Multiple Traitswith the Multiplier Method – aSimulation StudyThe purpose of this simulation study is twofold – to demonstrate that Bayesian in-ference, in terms of estimation uncertainty, performs similarly to the MLE with theuse of an uninformative prior, and outperforms the MLE with the use of an informa-tive prior. For each selected scenario, a random sample of size n from is drawn froma simulated population B = 1000 times. With each sample of size n, Bayesian in-ference is obtained from logN with ˆlogNBayes =E[logN|Data] and SD[logN|Data]playing the role of standard error. The estimator error, SD( ˆlogNBayes), is numeri-cally approximated with the standard deviation of B values of ˆlogNBayes.In the summary of results,(ˆlogNBayes)= 1/B∑Bi=1( ˆlogNBayes)i is providedto assess bias, SD( ˆlogNBayes) is listed alongside SD( ˆlogNMLE) for comparison,SD [logN|Data] = 1/BB∑i=1SD[logN|Data]i is provided to assess the quality of us-ing SD[logN|Data] as an substitute for estimator standard deviation- a leap of faiththat is necessary in real applications. The coverage probability from equal-tailed95% credibility interval is also presented as a indication of the performance of56finite-sample inference.E.1 Simulation ResultsTable E.1 summarises the performance of the Bayes estimator in select hypotheticalpopulations. Comparing ˆlogNBayes with logN gives an approximation of bias. Notethat in all cases the numerically obtained bias indicates a multiplicative bias of afew percent on the scale of N.The standard deviation of the Bayes estimator is numerically approximated bySD(ˆlogNBayes). A comparison of the standard deviation of the Bayes estimator thatuses an uninformative prior with the theoretical asymptotic standard deviation ofthe MLE reveals close agreement. This gives us confidence in the results obtainedfrom the theoretical exploration in previous sections. Furthermore, empirically, again in efficiency is observed with the use of informative priors, as seen in the tophalf and bottom half of Table E.1.In practice, the posterior standard deviation of log(N) plays the role of standarderror. From Table E.1, the standard deviation of posterior density, SD [logN|Data],on average agrees with the standard deviation of the Bayes estimator. Furthermore,the equal-tailed 95% credibility interval provides good coverage probability. Over-all, this simulation has demonstrated many good properties that support the useof Bayesian inference with the proposed reparameterized multinomial likelihood(Chapter 3) to infer population size.57Table E.1: Performance of Bayesian inference under select populations and prior distributionsDescription of Population Prior distribution logN(ˆlogNBayes)SD(ˆlogNMLE)SD(ˆlogNBayes)SD [logN|Data] coverage prob.N = 20000, p= 0.05, uninformative:9.903 9.869 0.218 0.21 0.221 0.95n= 200, k = 2, φ = 0 ∼Uniform(L, 1e6)N = 20000, p= 0.05, informative:9.903 9.83 0.218 0.159 0.191 0.958n= 200, k = 2, φ = 0 ∼Uniform(L, 30000)N = 40000, p= 0.1, uninformative:10.597 10.588 0.077 0.075 0.079 0.96n= 500, k = 3, φ = 0 ∼Uniform(L, 1e6)N = 40000, p= 0.1, informative:10.597 10.584 0.077 0.069 0.076 0.964n= 500, k = 3, φ = 0 ∼Uniform(L, 50000)58Appendix FJAGS Model File for CombinedModelling of Migration andMultiple Trait Multiplier Method#in JAGS syntaxmodel{logN <- log(N)x ˜ dmulti(pi, n)pi[1] <- p00pi[2] <- p01pi[3] <- p10pi[4] <- p11p00 <- 1-p01-p10-p11p01 <- N_t2/N - p11p10 <- N_t1/N - p11#priorp11 ˜ dunif(0, min(N_t1/N, N_t2/N))N ˜ dunif(max(n,N_t1,N_t2),270000)N_t1 <- sum(S) + sum(I)59S[1] ˜ dbin(p_sˆ4, D[1])S[2] ˜ dbin(p_sˆ3, D[2])S[3] ˜ dbin(p_sˆ2,D[3])S[4] ˜ dbin(p_s, D[4])S[5] <- D[5] # S[j], where j = 5 represent year 2008I[1] ˜ dbin( 1- (1-p_m)ˆ4, O[1])I[2] ˜ dbin( 1- (1-p_m)ˆ3, O[2])I[3] ˜ dbin( 1- (1-p_m)ˆ2, O[3])I[4] ˜ dbin( 1- (1-p_m), O[4])I[5] = 0## hyper priorp_m ˜ dunif(0,0.06)p_s ˜ dunif(0.9,1)}60
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Extensions to the multiplier method for inferring population...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Extensions to the multiplier method for inferring population size Meng, Vivian Yun 2014
pdf
Page Metadata
Item Metadata
Title | Extensions to the multiplier method for inferring population size |
Creator |
Meng, Vivian Yun |
Publisher | University of British Columbia |
Date Issued | 2014 |
Description | Estimating population size is an important task for epidemiologists and ecologists alike, for purposes of resource planning and policy making. One method is the "multiplier method" which uses information about a binary trait to infer the size of a population. The first half of this thesis presents a likelihood-based estimator which generalizes the multiplier method to accommodate multiple traits as well as any number of categories (strata) in a trait. The asymptotic variance of this likelihood-based estimator is obtained through the Fisher Information and its behaviour with varying study designs is determined. The statistical advantage of using additional traits is most pronounced when the traits are uncorrelated and of low prevalence, and diminishes when the number of traits becomes large. The use of highly stratified traits however, does not appear to provide much advantage over using binary traits. Finally, a Bayesian implementation of this method is applied to both simulated data and real data pertaining to an injection-drug user population. The second half of this thesis is a first systematic approach to quantifying the uncertainty in marginal count data that is an essential component of the multiplier method. A migration model that captures the stochastic mechanism giving rise to uncertainty is proposed. The migration model is applied, in conjunction with the multi-trait multiplier method, to real-data from the British Columbia Centre for Disease Control. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2014-09-03 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
IsShownAt | 10.14288/1.0135548 |
URI | http://hdl.handle.net/2429/50295 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2014-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2014_november_meng_vivian.pdf [ 487.59kB ]
- Metadata
- JSON: 24-1.0135548.json
- JSON-LD: 24-1.0135548-ld.json
- RDF/XML (Pretty): 24-1.0135548-rdf.xml
- RDF/JSON: 24-1.0135548-rdf.json
- Turtle: 24-1.0135548-turtle.txt
- N-Triples: 24-1.0135548-rdf-ntriples.txt
- Original Record: 24-1.0135548-source.json
- Full Text
- 24-1.0135548-fulltext.txt
- Citation
- 24-1.0135548.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0135548/manifest