PHYLOGENETIC ESTIMATION OF CONTACT NETWORK PARAMETERSWITH APPROXIMATE BAYESIAN COMPUTATIONbyRosemary Martha McCloskeyB.Sc., Simon Fraser University, 2014A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THEDEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Bioinformatics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)July 2016©Rosemary Martha McCloskey, 2016AbstractModels of the spread of disease in a population often make the simplifying assumption that the popula-tion is homogeneously mixed, or is divided into homogeneously mixed compartments. However, humanpopulations have complex structures formed by social contacts, which can have a significant influenceon the rate and pattern of epidemic spread. Contact networks capture this structure by explicitly repre-senting each contact that could possibly lead to a transmission. Contact network models parameterizethe structure of these networks, but estimating their parameters from contact data requires extensive,often prohibitive, epidemiological investigation.We developed a method based on approximate Bayesian computation (ABC) for estimating struc-tural parameters of the contact network underlying an observed viral phylogeny. The method combinesadaptive sequential Monte Carlo for ABC, Gillespie simulation for propagating epidemics though net-works, and a previously developed kernel-based tree similarity score. Our method offers the potential toquantitatively investigate contact network structure from phylogenies derived from viral sequence data,complementing traditional epidemiological methods.We applied our method to the Baraba´si-Albert network model. This model incorporates the prefer-ential attachment mechanism observed in real world social and sexual networks, whereby individualswith more connections attract new contacts at an elevated rate (“the rich get richer”). Using simulateddata, we found that the strength of preferential attachment and the number of infected nodes could of-ten be accurately estimated. However, the mean degree of the network and the total number of nodesappeared to be weakly- or non-identifiable with ABC.Finally, the Baraba´si-Albert model was fit to eleven real world HIV datasets, and substantial het-erogeneity in the parameter estimates was observed. Posterior means for the preferential attachmentpower were all sub-linear, consistent with literature results. We found that the strength of preferentialattachment was higher in injection drug user populations, potentially indicating that high-degree “su-perspreader” nodes may play a role in epidemics among this risk group. Our results underscore theimportance of considering contact structures when investigating viral outbreaks.iiPrefaceThe initial idea to use approximate Bayesian computation (ABC) to infer contact network model pa-rameters was Dr. Poon’s, based on his previous work using ABC to infer parameters of populationgenetic models. The tree kernel was originally developed by Dr. Poon, but the version used here wasimplemented by me to improve computational efficiency. The idea to apply sequential Monte Carlo wasmine, but Dr. Alexandre Bouchard-Coˆte´ made me aware of the adaptive version used in this work. Dr.Sarah Otto suggested the experiments involving a network with a heterogeneous α parameter and peer-driven sampling. Dr. Richard Liang provided guidance in the development of the Gillespie simulationalgorithm and statistical advice. The netabc program, and all supplementary analysis programs, werewritten by me.A version of chapter 2 has been submitted for publication with the title “Reconstructing networkparameters from viral phylogenies.” An oral presentation entitled “Phylodynamic inference of contactnetwork parameters with kernel-ABC” was given based on chapter 2 to the 23rd HIV Dynamics andEvolution meeting on April 25, 2016, in Woods Hole, Massachusetts, USA (the presentation was de-livered remotely). A poster based on chapter 2 entitled “Likelihood-free estimation of contact networkparameters from viral phylogenies” was presented at the Intelligent Systems for Molecular Biologymeeting on July 8, 2016, in Orlando, Florida, USA.Use of the BC data is in accordance with an ethics application that was reviewed and approved bythe UBC/Providence Health Care Research Ethics Board (H07-02559).Source code for the netabc program is freely available at https://github.com/rmcclosk/netabc underthe GPL-3 license. Scripts to run all computational experiments, as well as the source code for thisthesis, are available at https://github.com/rmcclosk/thesis.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Phylogenetics and phylodynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.1 Phylogenetic trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.2 Transmission trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.3 Phylodynamics: linking evolution and epidemiology . . . . . . . . . . . . . . 61.2.4 Tree shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Contact networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3.2 Scale-free networks and preferential attachment . . . . . . . . . . . . . . . . . 111.3.3 Relationship between network structure and transmission trees . . . . . . . . . 131.4 Sequential Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.4.1 Overview and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.4.2 Sequential importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . 151.4.3 Sequential Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.4.4 The sequential Monte Carlo sampler . . . . . . . . . . . . . . . . . . . . . . . 171.5 Approximate Bayesian computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.5.1 Overview and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20iv1.5.2 Algorithms for ABC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 Reconstructing contact network parameters from viral phylogenies . . . . . . . . . . . 272.1 Netabc: a computer program for estimation of contact network parameters with ABC . 272.1.1 Simulation of transmission trees over contact networks . . . . . . . . . . . . . 292.1.2 Phylogenetic kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.1.3 Adaptive sequential Monte Carlo for Approximate Bayesian computation . . . 312.1.4 Justification for approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.2 Analysis of Baraba´si-Albert model with synthetic data . . . . . . . . . . . . . . . . . 352.2.1 Why study the Baraba´si-Albert model? . . . . . . . . . . . . . . . . . . . . . 352.2.2 Using synthetic data to investigate identifiability and sources of estimation error 372.2.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.3 Application to real world HIV data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642.4.1 Netabc: uses, limitations, and possible extensions . . . . . . . . . . . . . . . . 642.4.2 Analysis of Baraba´si-Albert model with synthetic data . . . . . . . . . . . . . 652.4.3 Application to real world HIV data . . . . . . . . . . . . . . . . . . . . . . . 693 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76Appendix A Mathematical models, likelihood, and Bayesian inference . . . . . . . . . . . 89Appendix B Additional plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92vList of Tables2.1 Values of parameters and meta-parameters used in classifier experiments to investigateidentifiability of BA model parameters from tree shapes. . . . . . . . . . . . . . . . . 422.2 Values of parameters and meta-parameters used in grid search experiments to furtherinvestigate identifiability of BA model parameters. . . . . . . . . . . . . . . . . . . . 422.3 Parameter values used in simulation experiments to test accuracy of Baraba´si-Albert(BA) model fitting with netabc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.4 Average posterior mean point estimates and 95% highest posterior density (HPD) inter-val widths for BA model parameter estimates obtained with netabc on simulated data.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512.5 Parameters of a fitted generalized linear model (GLM) relating error in estimated α totrue values of BA parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512.6 Parameters of a fitted GLM relating error in estimated I to true values of BA parameters. 522.7 Characteristics of published human immunodeficiency virus (HIV) datasets analyzedwith netabc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572.8 Estimated power law exponents for six HIV datasets based on posterior mean pointestimates of BA model parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . 592.9 Values of power law exponent γ previously reported in the literature. . . . . . . . . . . 71viList of Figures1.1 Illustration of a rooted, ultrametric, time-scaled phylogeny. . . . . . . . . . . . . . . . 41.2 Illustration of a contact network and transmission tree. . . . . . . . . . . . . . . . . . 51.3 Examples of Baraba´si-Albert networks with preferential attachment power α = 0, 1, and2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1 Graphical schematic of the ABC-SMC algorithm implemented in netabc. . . . . . . . 282.2 Illustration of an estimated transmission tree without labels and two possible underlyingcomplete transmission trees with labels. . . . . . . . . . . . . . . . . . . . . . . . . . 332.3 Schematic of classifier experiments investigating identifiability of BA model parametersfrom tree shapes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.4 Schematic of grid search experiment to further investigate identifiability of BA modelparameters from tree shapes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.5 Simulated transmission trees under three different values of preferential attachmentpower (α) parameter of BA model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.6 Kernel-principal components analysis (PCA) projections of simulated trees under vary-ing BA parameter values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472.7 Cross-validation accuracy of kernel-SVM, nLTT-based SVM, and Sackin’s SVM clas-sifiers for BA model parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.8 Grid search estimates of BA model parameters for one replicate experiment with treesof size 500. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492.9 Posterior mean point estimates for BA model parameters α and I obtained by runningnetabc on simulated data, stratified by true parameter values. . . . . . . . . . . . . . . 502.10 One-dimensional marginal posterior distributions of BA model parameters estimatedwith low error by netabc from a simulated transmission tree. . . . . . . . . . . . . . . 532.11 Two-dimensional marginal posterior distributions of BA model parameters estimatedwith low error by netabc from a simulated transmission tree. . . . . . . . . . . . . . . 542.12 One-dimensional marginal posterior distributions of BA model parameters estimatedwith high error by netabc from a simulated transmission tree. . . . . . . . . . . . . . 552.13 Two-dimensional marginal posterior distributions of BA model parameters estimatedwith high error by netabc from a simulated transmission tree. . . . . . . . . . . . . . 562.14 Marginal posterior mean estimates and 50%/95% HPD intervals of BA model parame-ters when some parameters were fixed. . . . . . . . . . . . . . . . . . . . . . . . . . 60vii2.15 Comparison of BA parameter estimates obtained with netabc on the same dataset withdifferent sequential Monte Carlo (SMC) settings. . . . . . . . . . . . . . . . . . . . . 612.16 Posterior means and 50%/95% HPD intervals for parameters of the BA network model,fitted to eleven HIV datasets with netabc. . . . . . . . . . . . . . . . . . . . . . . . . 622.17 Posterior means and 50%/95% HPD intervals for parameters of the BA network model,fitted to two HIV datasets where both gag and env genes were sequenced. . . . . . . . 632.18 First and second time derivatives of epidemic growth curves at time of sampling forvarious values of I and N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67B.1 Reproduction of Figure 1A from Leventhal et al. (2012) used to check the accuracy ofour implementation of Gillespie simulation. . . . . . . . . . . . . . . . . . . . . . . . 92B.2 Approximation of mixture of Gaussians used by Del Moral et al. (2012) and Sisson etal. (2009) to test adaptive approximate Bayesian computation (ABC)-SMC. . . . . . . 93B.3 Approximation of mixture of two Gaussians used to test convergence of adaptive ABC-SMC algorithm to a bimodal distribution. . . . . . . . . . . . . . . . . . . . . . . . . 94B.4 Simulated transmission trees under three different values of BA parameter I . . . . . . 95B.5 Simulated transmission trees under three different values of BA parameter m . . . . . 96B.6 Simulated transmission trees under three different values of BA parameter N . . . . . 97B.7 Cross validation accuracy of classifiers for BA model parameter α for eight epidemicscenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98B.8 Cross validation accuracy of classifiers for BA model parameter I for eight epidemicscenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99B.9 Cross validation accuracy of classifiers for BA model parameter m for eight epidemicscenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100B.10 Cross validation accuracy of classifiers for BA model parameter N for eight epidemicscenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101B.11 Kernel principal components projection of trees simulated under three different valuesof BA parameter α , for eight epidemic scenarios. . . . . . . . . . . . . . . . . . . . . 102B.12 Kernel principal components projection of trees simulated under three different valuesof BA parameter I, for eight epidemic scenarios. . . . . . . . . . . . . . . . . . . . . 103B.13 Kernel principal components projection of trees simulated under three different valuesof BA parameter m, for eight epidemic scenarios. . . . . . . . . . . . . . . . . . . . . 104B.14 Kernel principal components projection of trees simulated under three different valuesof BA parameter N, for eight epidemic scenarios. . . . . . . . . . . . . . . . . . . . . 105B.15 Grid search kernel scores for testing trees simulated under various α values. The otherBA parameters were fixed at I = 1000, N = 5000, and m = 2. . . . . . . . . . . . . . . 106B.16 Grid search kernel scores for testing trees simulated under various I values. The otherBA parameters were fixed at α = 1.0, N = 5000, and m = 2. . . . . . . . . . . . . . . . 107B.17 Grid search kernel scores for testing trees simulated under various m values. The otherBA parameters were fixed at α = 1.0, I = 1000, and N = 5000. . . . . . . . . . . . . . 108viiiB.18 Grid search kernel scores for testing trees simulated under various N values. The otherBA parameters were fixed at α = 1.0, I = 1000, and m = 2. . . . . . . . . . . . . . . . 109B.19 Point estimates of preferential attachment power α of BA network model, obtained onsimulated trees with kernel-score-based grid search. . . . . . . . . . . . . . . . . . . 110B.20 Point estimates of prevalence at time of sampling I of BA network model, obtained onsimulated trees with kernel-score-based grid search. . . . . . . . . . . . . . . . . . . 111B.21 Point estimates of number of edges per vertex m of BA network model, obtained onsimulated trees with kernel-score-based grid search. . . . . . . . . . . . . . . . . . . 112B.22 Point estimates of number of edges per vertex N of BA network model, obtained onsimulated trees with kernel-score-based grid search. . . . . . . . . . . . . . . . . . . 113B.23 Posterior mean point estimates for BA model parameters m and N obtained by runningnetabc on simulated data, stratified by true parameter values. . . . . . . . . . . . . . . 114B.24 Relationship between preferential attachment power parameter α and fitted power lawexponent γ for networks simulated under the BA network model with N = 5000 andm = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114B.25 Best fit power law and stretched exponential curves for degree distributions of simulatedBaraba´si-Albert networks for several values of α and m. . . . . . . . . . . . . . . . . 115B.26 Posterior mean point estimates and 95% HPD intervals for parameters of the BA net-work model, fitted to eleven HIV datasets with netabc using the prior m ∼ DiscreteUni-form(2, 5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116ixList of SymbolsI number of infected nodes in a contact network at the time of transmission tree sampling.M number of simulated datasets per particle in ABC-SMC.N total number of nodes in a contact network.R0 basic reproductive number; average number of infections ultimately caused by one infected individ-ual.T a transmission tree.Θ parameter space for a given mathematical model.αESS per-iteration rate of expected sample size decay in adaptive ABC-SMC.α preferential attachment power parameter in Baraba´si-Albert networks.β transmission rate in susceptible-infected and susceptible-infected-removed epidemiological models.γ exponent of power-law degree distribution in scale-free networks.λ decay factor meta-parameter for tree kernel.D set of discordant edges in Gillespie simulation.I set of infected nodes in Gillespie simulation.R set of recovered nodes in Gillespie simulation.S set of recovered nodes in Gillespie simulation.ν recovery rate in the susceptible-infected-removed epidemiological model.ρ distance function for approximate Bayesian computation.σ radial basis function variance meta-parameter for tree kernel.ε distance function for approximate Bayesian computation.m number of edges added per vertex when constructing a Baraba´si-Albert network.xn number of particles used for sequential Monte Carlo.q proposal function for Metropolis-Hastings kernel.tmax user-defined cutoff time at which to stop Gillespie simulation.t time since initial infection of index case in an epidemic.xiList of AbbreviationsABC approximate Bayesian computation.AIC Akaike information criterion.BA Baraba´si-Albert.ER Erdo˝s-Re´nyi.ERGM exponential random graph model.ESS expected sample size.GLM generalized linear model.GSL GNU scientific library.GTR generalized time-reversible.HDI highest density interval.HIV human immunodeficiency virus.HMM hidden Markov model.HPD highest posterior density.IDU injection drug users.IQR interquartile range.IS importance sampling.kPCA kernel principal components analysis.kSVM kernel support vector machine.kSVR kernel support vector regression.LTT lineages-through-time.MAP maximum a posteriori.MCMC Markov chain Monte Carlo.MH Metropolis-Hastings.ML maximum likelihood.MPI message passing interface.MSM men who have sex with men.nLTT normalized lineages-through-time.PA preferential attachment.PCA principal components analysis.pdf probability density function.POSIX Portable Operating System Interface.SARS severe acute respiratory syndrome.SI susceptible-infected.SIR susceptible-infected-recovered.SIS sequential importance sampling.SMC sequential Monte Carlo.SVM support vector machine.SVR support vector regression.TasP treatment as prevention.WS Watts-Strogatz.xiiAcknowledgementsFirst and foremost I am deeply indebted to my supervisor, Dr. Art Poon, for mentoring me through themajority of my academic career, and somehow always managing to make time for me. My time in hislab has been challenging and rewarding, and I feel well prepared for opportunities to come. I also thankmy labmates, especially Jeff, Richard, and Thuy. I’ll miss our Thursday lunches.I am grateful for the guidance of my committee members Dr. Alexandre Bouchard-Coˆte´, Dr. RichardHarrigan, and Dr. Sarah Otto. Thanks also to my rotation supervisors Dr. Sara Mostafavi and Dr. RyanMorin, who accepted me into their labs and always had time for my many questions.I owe thanks to my fellow graduate students of the bioinformatics training program, especially Ogan,who helped keep me on an even keel when times got tough.Finally, I dedicate this thesis to my parents, who have supported me morally and financially, enduredmy frustrations, and celebrated my successes.xiiiChapter 1Introduction1.1 ObjectiveThe spread of a disease is most often modelled by assuming either a homogeneously mixed popula-tion [1, 2], or a population divided into a small number of homogeneously mixed groups [3]. Thisassumption, also called mass action [4], or panmixia, implies that any two individuals in the same com-partment are equally likely to come into contact making transmission possible at some predefined rate.Although this provides a reasonable approximation in many cases [5], the error introduced by assuminga panmictic population can be substantial when significant contact heterogeneity exists in the underly-ing population [6–8]. Contact network models provide an alternative to compartmental models whichdo not require the assumption of panmixia. In addition to more accurate predictions, the parametersof the networks themselves may be of interest from a public health perspective. For example, certainvaccination strategies may be more or less effective in curtailing an epidemic depending on the under-lying network’s degree distribution [9, 10]. Phylodynamic methods, which link viruses’ evolutionaryand epidemiological dynamics, have been used to fit many different types of models to phylogeneticdata [11, 12]. However, these models generally assume a panmictic population. The primary objectiveof this work is to develop a method to fit contact network models, and thereby relax the assumption ofhomogeneous mixing, in a phylodynamic framework.In this work, we take a Bayesian approach: our goal is to estimate the posterior distribution on modelparameters given our data,pi(θ | T ) = f (T | θ)pi(θ)∫Θ f (T | θ)pi(θ)dθ, (1.1)where f (T | θ) is the likelihood of the parameters given T , pi(θ) is the prior on θ , and Θ is the spaceof possible model parameters. The denominator on the right-hand side is the marginal probability ofT which acts as a normalizing constant on the posterior (see appendix A for a review of mathematicalmodelling and Bayesian inference, including definitions of these concepts). As we shall show (sec-tion 2.1.4), estimating this distribution presents computational challenges beyond those usually encoun-tered in Bayesian inference. Both the likelihood f (T | θ) and the normalizing constant seem to be1intractable, which rules out the use of most common maximum likelihood and Bayesian methods.An alternative to likelihood-based methods, which could be applied to any network model, is pro-vided by approximate Bayesian computation (ABC) [13–16]. All of the ingredients required to applyABC to this problem are readily available. Simulating networks is straightforward under a variety ofmodels. Epidemics on those networks, and the corresponding transmission trees, can also be easilysimulated. As mentioned above, contact networks can profoundly affect transmission tree shape. Thoseshapes can be compared using a highly informative similarity measure called the “tree kernel” [17]; sim-ilar kernel functions have been demonstrated to work well as distance functions in ABC [18]. ABC canbe implemented with several algorithms, but sequential Monte Carlo (SMC) has advantages over oth-ers, including improved accuracy in low-density regions and parallelizability [19]. A recently-developedadaptive algorithm requiring minimal tuning on the part of the user makes SMC an even more attractiveapproach [20]. In summary, our method to infer contact network parameters will combine the follow-ing: stochastic simulation of epidemics on networks, the tree kernel, and adaptive ABC-SMC. Ourmethod will expand on the framework developed by [21], who combined ABC with the tree kernel toinfer parameters of population genetic models from viral phylogenies using Markov chain Monte Carlo(MCMC).Empirical studies of sexual contact networks have found that these networks tend to be scale-free [22–25], meaning that their degree distributions follow a power law (although there has been somedisagreement, see [6, 26]). Preferential attachment has been postulated as a mechanism by which scale-free networks could be generated [27]. The Baraba´si-Albert (BA) model [27] is one of the simplestpreferential attachment models, which makes it a natural choice to explore with our method. The sec-ond aim of this work is to use simulations to investigate the parameters of the Baraba´si-Albert model,including whether they have a detectable impact on tree shape, and whether they can be accuratelyrecovered using ABC.Due to its high global prevalence and fast mutation rate, HIV is one of the most commonly-studiedviruses in a phylodynamic context. Consequently, a large volume of HIV sequence data is publiclyavailable, more than for any other pathogen, and including sequences sampled from diverse geographicand demographic settings. At the time of this writing, there were 635,400 HIV sequences publiclyavailable in GenBank, annotated with 172 distinct countries of origin. Since HIV is almost always spreadthrough either sexual contact or sharing of injection drug supplies, the contact networks underlying HIVepidemics are driven by social dynamics and are therefore likely to be highly structured [25]. Moreover,since no cure yet exists, efforts to curtail the progression of an epidemic have relied on preventingfurther transmissions through measures such as treatment as prevention (TasP) and education leadingto behaviour change. The effectiveness of this type of intervention can vary significantly based on theunderlying structure of the network and the particular nodes to whom the intervention is targeted [28,29]. Due to this combination of data availability and potential public health impact, HIV is an obviouscontext in which our method could be applied. Therefore, the third and final aim of this work is to applyABC to fit the Baraba´si-Albert model to existing HIV outbreaks.To summarize, this work has three objectives. First, we will develop a method which uses ABC2to infer parameters of contact network models from observed transmission trees. Second, we will usesimulations to characterize the parameters of the BA network model in terms of their effect on tree shapeand how accurately they can be recovered with ABC. Finally, we will apply the method to fit the BAmodel to several real-world HIV datasets.The remainder of this background chapter is organized in four sections. The first section introducesphylogenies and transmission trees, which are the input data from which our method aims to makestatistical inferences. This section also introduces phylodynamics, a family of methods that, like ours,aim to infer epidemiological parameters from evolutionary data. The second section focuses on contactnetworks and network models, whose parameters we are attempting to infer. The relationship betweencontact networks and transmission trees is also discussed. The third and fourth sections introduce SMCand ABC respectively, which are the two algorithmic components of the method we will implement. Inparticular, ABC refers to the general approach of using simulations to replace likelihood calculations ina Bayesian setting, while SMC is a particular algorithm which can be used to implement ABC.1.2 Phylogenetics and phylodynamics1.2.1 Phylogenetic treesIn evolutionary biology, a phylogeny, or phylogenetic tree, is a graphical representation of the evolution-ary relationships among a group of organisms or species (generally, taxa) [30]. The tips of a phylogeny,that is, the nodes without any descendants, correspond to extant, or observed, taxa. The internal nodescorrespond to their common ancestors, usually extinct (although occasionally the internal nodes may beobserved as well, e.g.[31]). The edges or branches of the phylogeny connect ancestors to their descen-dants. Phylogenies may have a root, which is a node with no parent distinguished as the most recentcommon ancestor of all the extant taxa [32]. When such a root exists, the tree is referred to as beingrooted; otherwise, it is unrooted. The structural arrangement of nodes and edges in the tree is referredto as its topology [33].The branches of the tree may have associated lengths, representing either evolutionary distance orcalendar time between ancestors and their descendants. The term “evolutionary distance” is used hereimprecisely to mean any sort of quantitative measure of evolution, such as the number of differencesbetween the DNA sequences of an ancestor and its descendant, or the difference in average body massor height. A phylogeny with branch lengths in calendar time units is often referred to as time-scaled.In a time-scaled phylogeny, the internal nodes can be mapped onto a timeline by using the tips of thetree, which usually correspond to the present day, reference points [34]. The corresponding points onthe timeline are called branching times, and the rate of their accumulation is referred to as the branchingrate. Rooted trees whose tips are all the same distance from the root are called ultrametric trees [35].These concepts are illustrated in fig. 1.1.3extant taxatimepresentpastbranching timesinternal nodesrootFigure 1.1: Illustration of a rooted, ultrametric, time-scaled phylogeny. The tips of the tree, which rep-resent extant taxa, are placed at the present day on the time axis. Internal nodes, representing extinctcommon ancestors to the extant taxa, fall in the past. The topology of the tree indicates that cats anddogs are the most closely related pair of species, whereas fish is most distantly related to any other taxonin the tree.1.2.2 Transmission treesIn epidemiology, a transmission tree is a graphical representation of an epidemic’s progress througha population [36]. Like phylogenies, transmission trees have tips, nodes, edges, and branch lengths.However, rather than recording an evolutionary process (speciation), they record an epidemiologicalprocess (transmission). The tips of a transmission tree represent the removal by sampling of infectedhosts, while internal nodes correspond to transmissions from one host to another. Transmission treesgenerally have branch lengths in units of calendar time, with branching times indicating times of trans-mission. The root of a transmission tree corresponds to the initially infected patient who introducedthe epidemic into the network, also known as the index case. The internal nodes may be labelled withthe donor of the transmission pair, if this is known. The tips of the tree, rather than being fixed at thepresent day, are placed at the time at which the individual was removed from the epidemic, such asby death, recovery, isolation, behaviour change, or migration [37]. Consequently, the transmission treemay not be ultrametric, but may have tips located at varying distances from the root. Such trees aresaid to have heterochronous taxa [38], in contrast to the isochronous taxa found in most phylogeniesof macro-organisms. A transmission tree is illustrated in fig. 1.2 (right). The object on the right of thefigure is called a contact network, which depicts the entire susceptible population along with all possibleroutes of disease transmission. Contact networks, and their relationships to transmission trees, will be4abcdcaatimea infects ba infects cb is removedc infects da is removedc, d are removedabecdFigure 1.2: Illustration of epidemic spread over a contact network, and the corresponding transmis-sion tree. (Left) A contact network with five hosts, labelled a through e. Thick shaded edges indicatesymmetric contacts among the hosts. The transmission network is indicated by coloured arrows. Theepidemic began with node a, who transmitted to nodes b and c. Node c further transmitted to node d.Node e was not infected. (Right) The transmission tree corresponding to this scenario, with a timelineof transmission and removal times.discussed further in section 1.3.Each infected individual in an epidemic may appear at nodes of the transmission tree more thanonce. This is different from the transmission network, in which each infected individual appears exactlyonce, and edges are in one-to-one correspondence with transmissions [8, 39]. The distinction betweenthe two objects is illustrated in fig. 1.2. However, since transmission networks generally have no cycles(unless re-infection occurs), they are trees in the graph theoretical sense, and hence are sometimes alsoreferred to as transmission trees [e.g. 40]. In this work, we reserve the term “transmission tree” forthe objects depicted on the right side of fig. 1.2, following e.g. [37]. The term “transmission network”is taken to mean the subgraph of the contact network along which transmissions occurred, followinge.g. [8, 39].Since transmission trees are essentially a detailed record of an epidemic’s progress, they containsubstantial epidemiological information. As a basic example, the lineages-through-time (LTT) plot [34],which plots the number of lineages in a phylogeny against time, can be used to quantify the incidenceof new infections over the course of an epidemic [41]. However, in all but the most well-studied ofepidemics, transmission trees are not possible to assemble through traditional epidemiological meth-ods [39]. The time and effort to conduct detailed interviews and contact tracing of a sufficient numberof infected individuals is usually prohibitive, and may additionally be confounded by misreporting andother challenges [42]. However, it turns out that for viral epidemics, some of the epidemiological in-formation contained in the transmission tree leaves a mark on the viral genetic material circulating inthe population. A family of methods called phylodynamics [43] addresses the challenge of estimatingepidemiological parameters from viral sequence data [12].51.2.3 Phylodynamics: linking evolution and epidemiologyThe basis of phylodynamics is the fact that, for RNA viruses, epidemiological and evolutionary pro-cesses occur on similar time scales [38]. In fact, these two processes interact, such that it is possibleto detect the influence of host epidemiology on the evolutionary history of the virus as recorded in aninter-host viral phylogeny. Phylodynamic methods aim to detect and quantify the signatures of epidemi-ological processes in these phylogenies [11, 12], which relate one representative viral genotype fromeach host in an infected population. These methods have been used to investigate parameters such astransmission rate, recovery rate, and basic reproductive number [11, 12]. The majority of phylodynamicstudies attempt to infer the parameters of an epidemiological model for which the likelihood of an ob-served phylogeny can be calculated. Most often, this is some variation of the birth-death [44, 45] orcoalescent [46, 47] models. These methods either assume the viral phylogeny is known, as we do in thiswork, or (more commonly) integrate over phylogenetic uncertainty in a Bayesian framework. Phyloge-netic inference is a complex topic which we shall not discuss here; see e.g. [48] for a full review.Due to the relationship between the aforementioned processes, there is a degree of correspondencebetween viral phylogenies and transmission trees [36, 40, 49, 50]. In particular, the transmission processis quite similar to allopatric speciation [51], where genetic divergence follows the geographic isolationof a sub-population of organisms. Thus, transmission, which is represented as branching in the transmis-sion tree, causes branching in the viral phylogeny as well [52]. Similarly, the removal of an individualfrom the transmission tree causes the extinction of their viral lineage in the phylogeny. Consequently,the topology of the viral phylogeny is sometimes used as a proxy for the topology of the transmis-sion tree [53]. Modern likelihood-based methods of phylogenetic reconstruction [e.g. 54, 55] produceunrooted trees whose branch lengths measure genetic distance in units of expected substitutions persite. On the other hand, transmission trees are rooted, and have branches measuring calendar time [11].Therefore, estimating a transmission tree from a viral phylogeny requires the phylogeny to be rootedand time-scaled. Methods for performing this process include root-to-tip regression [56–58], which weapply in this work, and least-square dating [59]. Alternatively, the tree may be rooted separately with anoutgroup [60] before time-scaling.A caveat of estimating transmission trees in this manner is that the correspondence between thetopologies of the viral phylogeny and transmission tree is far from exact [36, 61]. Due to intra-hostdiversity, the viral strain which is transmitted may have split from another lineage within the donorlong before the transmission event occurred. Hence, the branching point in the viral phylogeny may bemuch earlier than that in the transmission tree. Another possibility is that one host transmitted to twoor more recipients in one order, but the transmitted lineages originated within the donor in a differentorder. In this case, the topology of the transmission tree and the viral phylogeny will be mismatched. Inpractice, this discordance has not proven an insurmountable problem: for example, Leitner et al. [62] andParaskevis et al. [63] were able to accurately recover known transmission trees using viral phylogenies.The problem of accurately estimating transmission trees is an ongoing area of research [31, 53, 64–67].For example, Hall, Woolhouse, and Rambaut [53] developed a Bayesian method to jointly estimate atransmission tree and viral phylogeny by combining models of agent-based transmission, within-host6population dynamics, and sequence evolution.1.2.4 Tree shapesTo perform phylodynamic inference, we must be able to extract quantitative information from viralphylogenies. What is informative about a phylogeny, beyond the demographic characteristics of theindividuals it relates, is its shape. The shape of a phylogeny has two components: the topology, and thedistribution of branch lengths [68]. Methods of quantifying tree shape fall into two categories: summarystatistics, and pairwise measures. Summary statistics assign a numeric value to each individual tree,while pairwise measures quantify the similarity between pairs of trees.One of the most widely used tree summary statistics is Sackin’s index [69], which measures theimbalance or asymmetry in a rooted tree. For the ith tip of the tree, we define Ni to be the number ofbranches between that tip and the root. The unnormalized Sackin’s index is defined as the sum of allNi. It is called unnormalized because it does not account for the number of tips in the tree. Amongtwo trees having the same number of tips, the least-balanced tree will have the highest Sackin’s index.However, among two equally balanced trees, the larger tree will have a higher Sackin’s index. Thismakes it challenging to compare balances among trees of different sizes. To correct this, Kirkpatrickand Slatkin [70] derive the expected value of Sackin’ index under the Yule model [71], under whicheach lineage has an equal probability of splitting (creating a branching in the tree) at all times. Dividingby this expected value normalizes Sackin’s index, so that it can be used to compare trees of differentsizes. An example of a pairwise measure is the normalized lineages-through-time (nLTT) [72], whichcompares the LTT [34] plots of two trees. Specifically, the two LTT plots are normalized so that theybegin at (0,0) and end at (1,1), and the absolute difference between the two plots is integrated between0 and 1. In the context of infectious diseases, the LTT is related to the prevalence [41], so large valuesmay indicate that the trees being compared were produced by different epidemic trajectories [72].Poon et al. [17] developed an alternative pairwise measure which applies the concept of a ker-nel function to phylogenies. Kernel functions, whose best known use is in support vector machines(SVMs) [73], compare objects in a space X by mapping them into a feature space F of high or infinitedimension via a function ϕ . The similarity between the objects is defined asK(x,x′) = 〈ϕ(x),ϕ(x′)〉,that is, the inner product of the objects’ representations in the feature space. Computing ϕ(x) may becomputationally prohibitive due to the dimension of F . The utility of a kernel function is that it isconstructed in such a way that it can compute the inner product without explicitly computing ϕ(x). Thekernel function developed in [17] will henceforth be referred to as the tree kernel. This kernel maps treesinto the space of all possible subset trees, which are subtrees that do not necessarily extend all the way tothe tips. The subset-tree kernel was originally developed for comparing parse trees in natural languageprocessing [74] and did not incorporate branch length information. The version developed by Poon etal. [17] includes a radial basis function to compare the differences in branch lengths, thus incorporating7both the trees’ topologies and their branch lengths in a single similarity score.The kernel score of a pair of trees, denoted K(T1,T2), is defined as a sum over all pairs of nodes(n1,n2), where n1 is a node in T1 and n2 is a node in T2. Following Poon et al. [17], let N(T ) denote theset of all nodes in T , nc(n) be the number of children of node n, c jn be the jth child of node n, and ln bethe ordered vector of branch lengths connecting node n to its nc(n) children. Furthermore, let nl(n) bethe number of children of n which are leaves (we always have nl(n)≤ nc(n)). The production rule of nis the pair (nc(n),nl(n)). That is, if two nodes have the same number of children and among these, thesame number of leaves, then they have the same production rule. Let kG(x,y) be a Gaussian radial basisfunction of the vectors x and y,kG(x,y) = exp(− 12σ‖x− y‖22),where ‖·‖2 is the Euclidean norm and σ is a variance parameter. The tree kernel is defined asK(T1,T2) = ∑n1∈N(T1)∑n2∈N(T2)∆(n1,n2), (1.2)where∆(n1,n2) =λ n1 and n2 are leavesλkG(ln1 , ln2)nc(n1)∏j=1(1+∆(c jn1 ,cjn2)) n1 and n2 have the sameproduction rule0 otherwise.The parameter λ in the above expression, is called the decay factor [75], and takes a value between 0and 1. Without this parameter, terms in the sum 1.2 corresponding to large subset trees with the sametopology would be similarly large and tend to dominate the kernel score. λ penalizes ∆ more stronglyas the number of recursive calls increases, which downweights the largest matching substructures andallows smaller matches to contribute more to the kernel score. In this work, we refer to the parametersλ and σ as meta-parameters, to avoid confusing them with model parameters we are trying to estimate.When evaluating the tree kernel, it is helpful to reorder the children of each internal node such thatthe larger of the two subtrees is on the right-hand side. If the two subtrees have equal sizes, then thechild with the longer branch length can be put on the right-hand side. This operation is referred toas ladderizing. Since the ordering of children is arbitrary in phylogenies, this operation ensures that amaximal number of matching subset trees are counted by the tree kernel without making meaningfulchanges to the trees.The tree kernel was later shown to be highly effective in differentiating trees simulated under acompartmental model with two risk groups of varying contact rates [21]. In that paper, Poon used thetree kernel as the distance function in approximate Bayesian computation (ABC) (see section 1.5), to fitepidemiological models to observed trees.81.3 Contact networks1.3.1 OverviewEpidemics spread through populations of hosts through contacts between those hosts. The definitionof contact depends on the mode of transmission of the pathogen in question. For an airborne pathogenlike influenza, a contact may be simple physical proximity, while for human immunodeficiency virus(HIV), contact could be via unprotected sexual relations or blood-to-blood contact (such as throughneedle sharing). A contact network is a graphical representation of a host population and the contactsamong its members [8, 76, 77]. The nodes in the network represent hosts, and edges or links representcontacts between them. A contact network is shown in fig. 1.2 (left). Contact networks are a particulartype of social network [78, 79], which is a network in which edges may represent any kind of social oreconomic relationship. Social networks are frequently used in the social sciences to study phenomenawhere relationships between people or entities are important [for a review see 80].Edges in a contact networks may be directed, representing one-way transmission risk, or undirected,representing symmetric transmission risk. For example, a network for an airborne epidemic would useundirected edges, because the same physical proximity is required for a host to infect or to become in-fected. However, an infection which may be spread through blood-to-blood contact through transfusionswould use directed edges, since the recipient has no chance of transmitting to the donor. Directed edgesare also useful when the transmission risk is not equal between the hosts, such as with HIV transmissionamong men who have sex with men (MSM), where the receptive partner carries a higher risk of infectionthan the insertive partner [81]. In this case, a contact could be represented by two directed edges, one ineach direction between the two hosts, with the edges annotated by what kind of risk they imply [80]. Anundirected contact network is equivalent to a directed network where each contact is represented by twosymmetric directed edges. The degree of a node in the network is how many contacts it has. In directednetworks, we may make the distinction between in-degree and out-degree, which count respectively thenumber incoming and outgoing edges. The degree distribution of a network denotes the probability thata node has any given number of links. The set of edges attached to a node are referred to as its incidentedges.Epidemiological models most often assume some form of contact homogeneity. The simplest mod-els, such as the susceptible-infected-recovered (SIR) model [5], assume a completely homogeneouslymixed population, where every pair of contacts is equally likely. More sophisticated models partition thepopulation into groups with different contact rates between and among each group [82]. However, thesemodels still assume that every possible contact between a member of group i and a member of groupj is equally likely. Real human communuties are not necessarily homogeneously mixed; therefore, thisassumption can lead to errors in predicted epidemic trajectories when there is substantial heterogeneitypresent [6, 83, 84]. Contact networks provide a way to relax this assumption by representing individualsand their contacts explicitly. It is important to note that, although panmixia is an unrealistic modellingassumption, it has not proven a substantial hurdle to epidemic modelling in practice [5]. Using thisassumption, researchers have been able to derive estimates of the transmission rate and the basic repro-9ductive number of various outbreaks, which have agreed with values obtained by on-the-ground datacollection [85]. Therefore, if one is interested only in these population-level variables, the additionalcomplexity of contact network models may not be warranted. Rather, these models are most usefulwhen we are interested in properties of the network itself, such as centrality, structural balance, andtransitivity [80].From a public health perspective, knowledge of contact networks has the potential to be extremelyuseful. On a population level, network structure can dramatically affect the speed and pattern of epi-demic spread [e.g. 7, 86]. For example, epidemics are expected to spread more rapidly in networkshaving the “small world” property, where the average path length between two nodes in the network isrelatively low [87]. Some sexually transmitted infections would not be expected to survive in a homoge-neously mixed population, but their long-term persistence can be explained by contact heterogeneity [5,88]. Hence, the contact network can provide an idea of what to expect as an epidemic unfolds. In termsof actionable information, the efficacy of different vaccination strategies may depend on the topologyof the network [8–10, 89]. On a local level, contact networks can be informative about the groups orindividuals who are at highest risk of acquiring or transmitting infection who would therefore benefitmost from public health interventions [28, 29].Contact networks are a challenging type of data to collect, requiring extensive epidemiological in-vestigation in the form of contact tracing [8, 39, 42, 77]. Therefore, it has been necessary to exploreless resource-intensive alternatives which still contain information about population structure. For in-stance, it is possible to obtain limited information about the contact network by individual interviewswithout contact tracing. Variables which can be estimated in this fashion are referred to as node-levelmeasures [80]. One of the most well-studied of these is the degree distribution mentioned above, whichcan theoretically be estimated by simply asking each person how many contacts they had in some inter-val of time. However, the degree distributions often observed in real-world sexual networks are heavy-tailed [22–24], so dense or respondent-driven sampling [90] would be needed to capture the high-degreenodes characterizing the tail of the distribution.An alternative approach has been the analysis of other types of network, which can be directlyestimated with phylogenetic methods from viral sequence data. Some work focuses on the phylogeneticnetwork, in which two nodes are connected if the genetic distance between their viral sequences is belowsome threshold. Primarily, this work has focused on the detection of phylogenetic clusters, which aregroups of individuals whose viral sequences are significantly more similar to each other’s than to thegeneral population’s. The phylogenetic network is informative about “hotspots” of transmission and canbe used to identify demographic groups to whom targeted interventions are likely to have the greatesteffect [91]. However, this network may show little to no agreement with contact data obtained throughepidemiological methods [92–94] and therefore may be a poor proxy for the contact network. Otherstudies [95] have investigated the transmission network, which is the subgraph of the contact networkconsisting of infected nodes and the edges that led to their infections [39] (fig. 1.2, left). It is possibleto estimate the transmission network phylogenetically, although the methods required for doing so aremore sophisticated than for estimating the phylogenetic network [95]. These studies again mostly focus10on clustering and also on degree distributions.Other statistical methods have been developed to infer contact network parameters strictly from thetimeline of an epidemic, using neither genetic data nor reported contacts. Britton and O’Neill [96] de-veloped a Bayesian method to infer the p parameter of an Erdo˝s-Re´nyi (ER) network, along with thetransmission and removal rate parameters of the susceptible-infected (SI) model, using observed infec-tion and optionally removal times. However, it was designed for only a small number of observations,and was unable to estimate p independently from the transmission rate. Groendyke, Welch, and Hunter[97] significantly updated and extended the methodology of Britton and O’Neill and applied it to ameasles outbreak affecting 188 individuals. They were able to obtain a much more informative estimateof p, although this data set included both symptom onset and recovery times for all individuals and wasunusual in that the entire contact network was presumed to be infected. Volz [86] developed differen-tial equations describing the dynamics of the SIR model on a wide variety of random networks definedby their degree distributions. Although the topic of estimation was not addressed in the original paper,Volz’s method could in principle be used to fit such models to observed epidemic trajectories, similar towhat is done with the ordinary SIR model. Volz and Meyers [83] later extended the method to dynamiccontact networks and applied it to a sexual network relating 99 individuals investigated during a syphilisoutbreak.1.3.2 Scale-free networks and preferential attachmentA scale-free network is one whose degree distribution follows a power law, meaning that the number ofnodes in the network with degree k is proportional to k−γ for some constant γ [27]. Scale-free networksare characterized by a large number of nodes of low degree, with relatively few “hub” nodes of very highdegree. Epidemiological surveys have indicated that human sexual networks tend to be scale-free [22–25]. Interestingly, many other types of network, including computer networks [88], biological metabolicnetworks [98], and academic co-author networks [99], also have the scale-free property.Several properties of scale-free networks are relevant in epidemiology. The high-degree hub nodesare known as superspreaders [100], which have been postulated to contribute in varying degree to thespread of diseases such as HIV [37] and severe acute respiratory syndrome (SARS) [101]. Scale-freenetworks have no epidemic threshold [88], meaning that diseases with arbitrarily low transmissibilityhave a chance, however small, of persisting at low levels indefinitely. This is in contrast with homo-geneously mixed populations, in which transmissibility below the epidemic threshold would result inexponential decay in the number of infected individuals and eventual extinction of the pathogen [5].One mechanism which has been shown to lead to scale-free networks is preferential attachment [27,102]. The simplest preferential attachment model is known as the Baraba´si-Albert (BA) model after itsinventors [27]. Under this model, networks are formed by starting with a small number m0 of nodes.New nodes are added one at a time until there are a total of N in the network. Each time a new nodeis added, m ≥ 1 edges are added from it to other nodes in the graph. In the original formulation, thepartners of the new node are chosen with probability linearly proportional to their degree plus one.There has been some contention over the idea that contact networks are scale-free. Handcock and11●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●Figure 1.3: Examples of Baraba´si-Albert networks with preferential attachment power α = 0 (left),1 (centre), and 2 (right). All networks have N = 50 nodes and were constructed with m = 2 edgesper vertex. When α = 0, attachments are formed at random and most nodes have low degree. Whenα = 1, preferential attachment is linear and several higher-degree nodes are observable. When α = 2,preferential attachment is quadratic and nearly every vertex is attached to a small number of hub nodes.Jones [26] fit several stochastic models of partner formation to empirical degree distributions derivedfrom population surveys of sexual behaviour. They found that a negative binomial distribution, ratherthan a power law, was the best fit to five out of six datasets, although the difference in goodness offit was extremely small in four out of these five. Bansal, Grenfell, and Meyers [6] found that an expo-nential distribution, rather than a power law, was the best fit to degree distributions of six social andsexual networks. Dombrowski et al. [103] contend that sexual networks are shaped more by homophily(“like attract like”) than by preferential attachment, but find that injection drug users (IDU) network dodemonstrate a scale-free structure.In the paper describing the BA model, Baraba´si and Albert suggest an extension where the proba-bility of choosing a partner of degree d is proportional to dα +1 for some constant α . When α 6= 1, thedegree distribution no longer follows a power law [104]. For α < 1, the distribution is a stretched expo-nential, meaning that the number of nodes of degree k is proportional to exp(−kβ ) for some constant β .For α > 1, the distribution takes on a characteristic called gelation, where a one or a few high-degree hubnodes are connected to nearly every other node in the graph. We do not believe these departures from thepower law affect the applicability of the model to real world networks. In fact, de Blasio, Svensson, andLiljeros [105] were able to estimate the preferential attachment power from partner count data collectedfrom the same individuals for consecutive time intervals, and found a value less than one in all cases.It is also worth noting that, in addition to the BA model, other investigations of the interaction betweencontact networks and transmission trees have studied the Erdo˝s-Re´nyi and Watts-Strogatz models [106],whose degree distributions do not generally follow a power law under any parameter settings.When m = 1, the network takes on the distinctive shape of a tree, that is, it does not contain anycycles. Cycles are present in the network for all other m values. Examples of BA networks with threedifferent values of the preferential attachment power α are shown in fig. 1.3.121.3.3 Relationship between network structure and transmission treesThe contact network underlying an epidemic constrains the shape of the transmission network, which inturn determines the topology of the transmission tree relating the infected hosts (fig. 1.2). The index casewho introduces the epidemic into the network becomes the root of the tree. Each time a transmissionoccurs, the lineage corresponding to the donor host in the tree splits into two, representing the recipientlineage and the continuation of the donor lineage. Figure 1.2 illustrates this correspondence. It must beemphasized that, although the order and timing of transmissions determines the tree topology uniquely,the converse does not hold. That is, for any given topology, there are in general many transmission net-works which would lead to that topology. In other words, it is impossible to distinguish who transmittedto whom from a transmission tree alone [107].A number of studies have made progress in quantifying the relationship between contact networksand transmission trees. O’Dea and Wilke [108] simulated epidemics over networks with four types ofdegree distribution. They then estimated the Bayesian skyride [109] population size trajectory in twoways: from the phylogeny, using MCMC; and from the incidence and prevalence trajectories, using themethod developed by Volz et al. [52]. The concordance between the two skyrides, as well as the relation-ship between the skyride and prevalence curve, was qualitatively different for each degree distribution.Leventhal et al. [106] investigated the relationship between transmission tree imbalance and severalepidemic parameters under four contact network models and found that these relationships varied con-siderably depending on which model was being considered. The authors also investigated a real-worldHIV phylogeny and found a level of imbalance inconsistent with a randomly mixing population. Welch[110] simulated transmission trees over networks with varying degrees of community structure. Theyfound that transmission trees simulated under networks with low clustering could not generally be dis-tinguished from those simulated under highly clustered networks and concluded that contact networkclusters do not affect transmission tree shape. However, more recently, Villandre et al. [111] investi-gated the correspondence between contact network clusters and transmission tree clusters and did find amoderate correspondence between the two in some cases. Goodreau [112] combined a dynamic contactnetwork model with a model of within-host viral evolution to simulate viral phylogenies over eight typesof contact network. Estimates of prevalence and effective population size were calculated for each sim-ulated phylogeny under three models of epidemic growth. The author found that estimates for networkswith a small high-risk subgroup and networks involving commercial sex workers were substantiallydifferent than estimates for random networks or networks with segregated equal-risk groups.1.4 Sequential Monte Carlo1.4.1 Overview and notationRecall that the primary objective of our work is to develop a statistical inference method for estimatingcontact network parameters from transmission trees. For a network model with parameters θ and aninput transmission tree T , our goal is to estimate statistics, such as means and credible intervals, of the13posterior distribution (1.1). As alluded to in section 1.1, both the likelihood f (T | θ) and the normalizingconstant are likely computationally intractable (this will be discussed further in section 2.1.4). Hence,rather than computing the posterior distribution analytically, we will approximate it using a Monte Carloapproach. The fundamental idea behind Monte Carlo methods is succinctly expressed by Liu, Chen, andLogvinenko [113]:Monte Carlo’s view of the world is that any probability distribution pi , regardless of itscomplexity, can always be represented by a discrete sample from it. By “represented”, wemean that any computation of expectations using pi can be replaced to an acceptable degreeof accuracy by using the empirical distribution resulting from the discrete sample.In other words, if we are able to sample enough points from a distribution of interest, we will be able tomake reasonably accurate statements about the distribution itself. For example, the expected value of thedistribution can be estimated by the sample’s population mean. The reason Monte Carlo methods will beuseful in this work is that algorithms exist for obtaining samples from distributions that are analyticallyintractable and from which direct sampling is not possible (for a review see [114]). Sequential MonteCarlo (SMC) [115–117] is one such algorithm.SMC considers a population of points or “particles”, here denoted{x(k)}and indexed by an integerk. The particles are associated with weights,{w(x(k))}representing their probability density. After thealgorithm has been run, the weighted particles are a Monte Carlo representation for the target distribu-tion. For example, the expected value is approximated by1nn∑k=1x(k)w(x(k)),where n is the number of particles. The weights of the whole population always sum to 1 so that theirweighted histogram defines a true probability density function integrating to 1. Initially, the particles donot represent the target distribution but rather a more tractable distribution from which direct samplingis straightforward. The word “sequential” is used to describe the iterative process of perturbation, re-sampling, and reweighting applied to the particles in such a way that they converge, collectively, to arepresentation of the target. SMC is also known as the particle filter.In this work, the distribution of interest is the posterior distribution ??. The particles are particularvalues of the parameters θ of the contact network model being studied. If we were taking a typicalBayesian Monte Carlo approach to this problem, the particles would end up weighted by their posteriorprobability and distributed in such a way that the weighted population was a reasonable representationof pi(θ | T ). In our case, due to the intractable likelihood, we will need to consider an approximation tothe posterior (see section 1.5). However, for now, nothing is lost by assuming that our target distributionis the posterior itself.In this section, we review an algorithm called the SMC sampler, developed by Del Moral, Doucet,and Jasra [118], which forms the basis of the adaptive ABC-SMC algorithm we apply toward the mainobjective of this work. We begin by describing sequential importance sampling (SIS), which is a precur-sor to SMC that samples from a sequence of distributions defined on spaces of increasing dimension.14We then describe SMC itself, which extends SIS with a resampling step to fight particle degeneracy.Finally, we outline the SMC sampler, which allows SMC to be applied to sequences of distributions alldefined on the same space. This terminology will become clear as the methods are described.To fully describe SMC, we will introduce some notation and terminology. For a sequence x1, . . . ,xd ,we will write xi to mean the partial sequence x1, . . . ,xi. The subscript (k) will be used to indicate the kthparticle in a population. To ease the notational burden we will omit the superscripts and subscripts onthe weight functions w. We define a Markov kernel as the continuous analogue of the transition matrixin a finite-state Markov model. For some spaces X and Y , K : X×Y → [0,1] such that∫YK(x,y)dy = 1 (1.3)for all x∈X . This is an “operational” definition of Markov kernel which will be suitable for our purposes.A more rigorous definition can be found in e.g. [119]. Note that Markov kernels have nothing to do withthe kernel functions defined in section 1.2.4, other than sharing a name (the word “kernel” is ubiquitousin mathematics). Also, the variable pi refers to a generic distribution in the following descriptions ofSMC and related algorithms, and does not refer to the prior or posterior distributions we will eventuallywant to work with.1.4.2 Sequential importance samplingSequential importance sampling (SIS) [120] is a particle-based method whose aim is to sample from adistribution pi on an high-dimensional space, say pi(x) = pi(x1, . . . ,xd). The basis of SIS is importancesampling (IS), which is a method of estimating summary statistics of distributions which are knownonly up to a normalizing constant, and therefore cannot be sampled from directly. That is, if pi is such adistribution and f is any real-valued function, IS is concerned with estimatingpi( f ) =∫f (x)pi(x)dx =∫f (x)γ(x)Zdx,where the integral is over the space on which pi is defined, γ(x) is known pointwise, and Z =∫γ(x)dxis the unknown normalizing constant. Suppose we have at hand another distribution η , called the im-portance distribution, from which we are able to sample. Define the importance weight as the ratiow(x) = γ(x)/η(x). We can write the expectation of interest as∫f (x)pi(x)dx =1Z∫w(x)η(x) f (x)dx. (1.4)Since η can be sampled from exactly, and γ and f can both be evaluated pointwise, the integral∫w(x)η(x) f (x)dx can be approximated by a Monte Carlo estimate. We simply take a sample fromη , multiply each point in the sample by f and γ evaluated at that point, and sum the results. More-over, the normalizing constant Z can be expressed in terms of the importance weight and distribution,Z =∫w(x)η(x)dx. Therefore, we have all the ingredients we need to obtain an estimate of pi( f ) usingeq. (1.4). Although this is a simple and elegant approach, the drawback is that the variance of the esti-15mate is proportional to the variance of the importance weights [117], which may be quite large if η andγ are very different. Therefore, the practical use of IS on its own is limited, since it depends on findingan importance distribution similar to pi , which we usually know very little about a priori.The objective of SIS is to build up an importance distribution η for pi sequentially. By the generalproduct rule, pi(x) can be decomposed aspi(x) = pi(x1)pi(x2 | x1) · · ·pi(xd−1 | xd−2)pi(xd | xd−1).This decomposition is natural in many contexts, particularly for on-line estimation. For example, in astateful model like an hidden Markov model (HMM), xi may represent the state at time i, with pi(x) beingthe posterior distribution over possible paths. The importance distribution η for pi will be constructedusing a similar decomposition,η(x) = η(x1)η(x2 | x1) · · ·η(xd−1 | xd−2)η(xd | xd−1).The importance weights for η can be written recursively asw(xi) =pi(xi)η(xi)definition of importance weight=pi(xi | xi−1)pi(xi−1)η(xi | xi−1)η(xi−1) definition of conditional probability=pi(xi | xi−1)η(xi | xi−1) ·w(xi−1) definition of importance weight.(1.5)Thus, we can choose η(xi | xi−1) such that the variance of the importance weights is as small as possibleat every step, eventually arriving at a full importance distribution. This choice is made on a problem-specific basis, taking any available information about pi(xi | xi−1) into account (see e.g. [117, 121] formany examples). One potential choice for η(xi | xi−1) is simply pi(xi | xi−1), if it is possible to compute.In a Bayesian setting, the prior distribution may be used. The exact form of η(xi | xi−1) which minimizesthe variance of the weights is called the optimal kernel [122], the name deriving from the fact thatk(xi,xi−1) = η(xi | xi−1) is a Markov kernel. In some applications, it is possible to approximate theoptimal kernel or even compute it explicitly.The recursive definition 1.5 suggests an algorithm for obtaining a sample from η and using it toobtain an approximate sample from pi by IS (algorithm 1). We begin with n particles, which have beensampled from the importance distribution η(x1) for pi(x1). Each particle is extended by drawing x2 fromη(x2 | x1), and the importance weights are updated by eq. (1.5). This procedure is repeated d times untilthe particles have dimension equal to that of pi .1.4.3 Sequential Monte CarloThe importance distribution η constructed with SIS is merely an approximation to pi , and may be afairly poor one in practice depending on the application. Try as we might to keep the variances of the16Algorithm 1 Sequential importance sampling.for k = 1 to n doSample x(k)1 from η(x1) . Initialize the kth particlew(k)← pi(x(k)1 )/η(x(k)1 ) . Initialize importance weightend forfor i = 2 to d dofor k = 1 to n doSample x(k)i from η(xi | x(k)i−1) . Extend the kth particlew(k)← [pi(x(k)i | x(k)i−1)/η(x(k)i | x(k)i−1)] ·w(k) . Update importance weightend forNormalize the weights so that ∑w = 1end forweights low, the cumulative errors at each sequential step tend to push many of the weights to verylow values [115]. This results in a poor approximation to pi , since only a few particles retain highimportance weights after all d sequential steps, a problem known as particle degeneracy. To mitigatethis problem, Gordon, Salmond, and Smith [120] introduced technique they called the bootstrap filter,which involves a resampling of the population of particles after each sequential step in accordance withtheir importance weights. A similar idea, termed particle rejuvenation, was proposed by Liu and Chen[123]. These approaches cause particles with high importance weights to be replicated in the population,while particles with low weights may be removed. After each resampling step, the importance weightsfor all particles are set equal.The resampling step was formally integrated with SIS by Doucet, Godsill, and Andrieu [115] toform the first SMC algorithm (algorithm 2). Rather than resample at every step as the bootstrap filterproposed, the authors use a criterion based on the expected sample size (ESS) the particle population todetermine when resampling is necessary. The ESS of the population of particles is defined asESS(w) =n1+Var(w),where n is the number of particles in the population. Resampling is triggered when the ESS drops belowthe threshold (conventionally n/2 [117]). Various resampling strategies beyond basic sampling withreplacement have been proposed [124], but we will not discuss those here.1.4.4 The sequential Monte Carlo samplerThe SIS and SMC algorithms described above aim to sample from a high-dimensional distribution pi(x),by sequentially sampling from d distributions of lower but increasing dimension. Del Moral, Doucet,and Jasra [118] developed the SMC sampler with an alternative objective: to sample sequentially fromd distributions pi1, . . . ,pid , all of the same dimension and defined on the same space. The pii are assumedto form a related sequence, such as posterior distributions attained by sequentially considering newevidence. As with SIS, we assume that pii(x) = γi(x)/Zi, where each γi is known pointwise and thenormalizing constants Zi are unknown.17Algorithm 2 Sequential Monte Carlo [115].for k = 1 to n doSample x(k)1 from η(x1) . Initialize the kth particlew(k)← pi(x(k)1 )/η(x(k)1 ) . Initialize importance weightend forfor i = 2 to d dofor k = 1 to n doSample x(k)i from η(xi | x(k)i−1) . Extend the kth particlew(k)← [pi(x(k)i | x(k)i−1)/η(x(k)i | x(k)i−1)] ·w(k) . Update importance weightend forif ESS(w)< T then . T is a user-defined thresholdResample the particles according to wfor k = 1 to n dow(k)← 1/nend forend ifend forBoth algorithms involve progression through a sequence of related distributions. For SIS and SMC,these distributions are lower-dimensional marginals of the target distribution, while for the SMC sam-pler, they are of the same dimension and constitute a smooth progression from an initial to a finaldistribution. In both cases, the neighbouring distributions in the sequence are related to each other insome way, and we can take advantage of that relationship to create a sequence of importance distri-butions alongside the sequence of targets. In SIS, the neighbouring marginals pi(xi) and pi(xi+1) wererelated by the conditional density pi(xi | xi−1), which we used to inform the importance distribution. InSMC, the relationship between subsequent distributions is less explicit, but it is assumed that they arerelated closely enough that an importance distribution for pii can be easily transformed into one for pii+1.In particular, the sequence of importance distributions ηi is constructed asηi(x′) =∫ηi−1(x)Ki(x,x′)dx, (1.6)where Ki is a Markov kernel and the integral is over the space on which the pii are defined. The choiceof Ki should be based on the perceived relationship between pii−1 and pii. Del Moral, Doucet, and Jasra[118] propose the use of a MCMC kernel with equilibrium distribution pii. That is,Ki(x,x′) = max(1,q(x′,x)pii(x)q(x,x′)pii(x′)),where q(x,x′) is a proposal function such as a Gaussian distribution centred at x from which x′ is drawn(see appendix A).Although this method of building up η appears straightforward, the drawback is that the importancedistribution itself becomes intractable. In particular, evaluating ηi(x) involves an i-dimensional integralof the type in eq. (1.6). As it is necessary to evaluate η(x) pointwise to perform IS, this construction18appears to have defeated the purpose of providing an importance distribution for each pii. Del Moral,Doucet, and Jasra [118] overcome this problem with two “artificial” objects. First, they propose theexistence of backward Markov kernels Li−1(xi,xi−1). For now, these kernels are arbitrary; they will laterbe precisely defined on a problem-specific basis. Second, the authors define an alternative sequence oftarget distributionsp˜ii(xi) = pii(xi)i−1∏k=1Lk(xk+1,xk)of increasing dimension. That is, p˜ii has dimension equal to the original dimension of pii raised to thepower of i. This brings us back to the SIS setting described above (section 1.4.2), namely of building upan importance distribution sequentially through lower-dimensional distributions. The dimension of theimportance distributions η is similarly augmented with the forward kernels,ηi(xi) = η1(x1)i−1∏k=1Kk(xk,xk+1) (1.7)Thanks to the backwards kernels, we can write p˜ii in terms of p˜ii−1 as follows.p˜ii(xi) = pii(xi)i−1∏k=1Lk(xk+1,xk) definition of ˜i= pii(xi)Li−1(xi−1,xi)i−2∏k=1Lk(xk+1,xk) pull out k = i−1 term from product=pii(xi)Li−1(xi−1,xi)pii−1(xi−1)·pii−1(xi−1)i−2∏k=1Lk(xk+1,xk) multiply and divide by pii−1(xi−1)=pii(xi)Li−1(xi−1,xi)pii−1(xi−1)· p˜ii−1(xi−1) definition of p˜ii−1.The importance distributions can also be expressed recursively using the forward kernels, which followsdirectly from eq. (1.7),ηi(xi) = ηi−1(xi−1)Ki(xi−1,xi).Therefore, the importance weights for these new targets are defined recursively asw(xi) =p˜ii(xi)ηi(xi)definition of importance weight=p˜ii(xi)ηi−1(xi−1)Ki(xi−1,xi)recursive definition of ηi=p˜ii−1(xi−1)pii(xi)Li−1(xi,xi−1)ηi−1(xi−1)pii−1(xi−1)Ki(xi−1,xi)recursive definition of pii= w(xi−1) · pii(xi)Li−1(xi,xi−1)pii−1(xi−1)Ki(xi−1,xi) definition of importance weight∝ w(xi−1) · γi(xi)Li−1(xi,xi−1)γi−1(xi−1)Ki(xi−1,xi) remove normalizing constant Zi/Zi−1 (1.8)19The final key piece of information is to notice that, because the Li are Markov kernels, pii is simplythe marginal in xi−1 of p˜i . Therefore, a sample from p˜ii automatically gets us a sample from pii, byconsidering only the ith component of xi. In fact, since the weight update eq. (1.8) depends only onthe ith and i− 1st components of each particle, we do not even need to keep track of the completeparticles if we are only interested in the final distribution. The sequences of kernels L and K shouldbe chosen based on the problem at hand to minimize the variance in the importance weights as wellas possible. For a fixed choice of K, the backward kernels which minimize this variance are called theoptimal backward kernels. The full SMC sampler algorithm is presented as algorithm 3. A resamplingstep is applied whenever the ESS of the population drops too low, as discussed in the previous section.Algorithm 3 Sequential Monte Carlo sampler of Del Moral, Doucet, and Jasra [118].for k = 1 to n doSample x(k)1 from η1(x1) . Initialize the kth particlew(k)← γ1(x(k)1 )/η1(x(k)1 ) . Initialize the importance weightsNormalize the weights so that ∑w = 1end forfor i = 2 to d dofor k = 1 to n doSample x(k)i from K(x(k)i−1,xi) . Extend the kth particlew(k)← w(k) · γi(xi)Li−1(xi,xi−1)γi−1(xi−1)Ki(xi−1,xi). Update the importance weightsend forNormalize the weights so that ∑w = 1if ESS(w)< T then . T is a user-defined thresholdResample the particles according to wfor k = 1 to n dow(k)← 1/nend forend ifend for1.5 Approximate Bayesian computation1.5.1 Overview and motivationSequential Monte Carlo, and the SMC sampler, were developed for sampling from distributions whichcan be evaluated up to a normalizing constant. We claim, and shall argue more thoroughly below (sec-tion 2.1.4) that the posterior distributionpi(θ | T ) ∝ f (T | θ)pi(θ)for a contact network model with parameters θ and an input transmission tree T does not fall in thiscategory. Therefore, SMC, and other Bayesian and maximum likelihood (ML) techniques for fitting20mathematical models (see appendix A), cannot be directly applied to our problem. In particular, MCMCand the SMC sampler are designed for distributions pi which can be evaluated up to a normalizingconstant Z, that is, pi(x) = γ(x)/Z. Both algorithms calculate a ratio of the form pi(x)/pi(x′)∝ γ(x)/γ(x′)for a current value x and proposed updated value x′ - for MCMC, this is part of the Metropolis-Hastingsratio, while for the SMC sampler, a similar ratio is required to calculate the importance weights. Inthe context of Bayesian inference, this ratio contains a likelihood ratio, which must be calculated bycomputing the individual likelihoods and dividing them. If the likelihood is intractable, this is clearlynot a viable approach.Approximate Bayesian computation (ABC) [13–15] was developed to estimate posterior distribu-tions with intractable likelihoods, which have arisen frequently in the domain of population genetics [16,125]. ABC navigates around the intractable likelihood by replacing the posterior as the target of infer-ence by an approximate posterior. This distribution is constructed in such a way that the ratios requiredfor MCMC and the SMC sampler can be computed, conveniently allowing us to apply those algorithmswith minimal changes. In the next section, we shall demonstrate how this is done, but first we give thedefinition of the approximate posterior.By targeting the posterior distribution, Bayesian inference makes the assertion that model parame-ters with higher posterior density are “better” in the sense that they offer a more credible explanationfor the observed data. The approximate posterior targeted by ABC uses an alternative metric for pa-rameter credibility: the similarity of simulated datasets to the observed data. If datasets simulated underthe model closely resemble the real data, it follows that the model is a reasonable approximation to thereal-world process generating the observed data. More formally, let y be the observed data to which weare trying to fit a model with parameters θ . In the case of this work, the data is a transmission tree T ,but we shall stick with the generic variable y for now. Suppose we have a distance measure ρ definedon the space of all possible data our model could generate. ABC aims to sample from the joint posteriordistribution of model parameters and simulated datasets z which are within some small distance ε of theobserved data y,piε(θ ,z | y) =pi(θ) f (z | θ)IAε,y(z)∫Aε,y×Θpi(θ) f (z | θ)dθ. (1.9)Here, Aε,y is an ε-ball around y with respect to ρ ,Θ is the space of all possible model parameters, and I isthe indicator function [126] (that is, IAε,y(x) = 1 if x ∈ Aε,y or 0 otherwise). The distribution piε(θ ,z | y)will be referred to as the ABC target distribution. The term f (z | θ) appears to be the bothersomelikelihood again, but this will turn out not to be a problem because we are simulating z ourselves.To return to the context of this thesis, the observed data y is an estimated transmission tree for a viralepidemic under investigation. The model in question is a contact network model with parameters θ . Thesimulated dataset z is a transmission tree, obtained by first generating a contact network under the model,and then simulating the spread of an epidemic over that network. A transmission tree can be constructedby keeping track of who infected whom during the simulated epidemic (further details will be given insection 2.1.1). The distance function ρ must compare two transmission trees - the observed tree, which21takes the place of y, and the simulated tree z. We will define this distance function using the tree kerneldiscussed in section 1.2.4. To write eq. (1.9) in words, the approximate posterior we consider here is adistribution which assigns a joint probability density to model parameters and simulated transmissiontrees under those parameters. The probability density is proportional to the product of the prior on theparameters, and the likelihood of the simulated transmission tree under those parameters, but only if thesimulated transmission tree is sufficiently close to the observed tree. Otherwise, the probability densityis zero.In fact, it is not the ABC target distribution itself, but rather its marginal in z, which approximatesthe posterior distribution. In other words, we claim that∫piε(θ ,z | y)dz≈ pi(θ | y).The intuition for why this approximation might be reasonable comes from considering the case whenε = 0 and ρ has the property that ρ(z,y) if and only if x = y. In that case, the ε-ball around y shouldcontain only y itself, hence the integral on the left is exactly equal to the posterior. Thus, by taking εsmall, we should attain something close to the posterior if ρ captures the similarity between datasetsreasonably well. However, the accuracy of the ABC approximation depends heavily on the choice ofdistance function [127, 128].Distance functions and summary statistics in ABCIn many applications (eg. [15, 129]), ρ is defined as ρ(S(·),S(·)) where S is a function which mapsdata points into a vector of summary statistics. In the context of ABC, a summary statistic S is calledsufficient ifpi(θ | y) = pi(θ | S(y)).That is, sufficiency implies that the data can be replaced with the summary statistic without losing anyinformation about the posterior distribution [130]. For most problems, it is not possible to find sufficientsummary statistics [130]. A number of sophisticated methods have been developed for selecting andweighting summary statistics based on various optimality criteria [127, 128, and references therein]. Wedo not apply these methods in this work, instead focusing on a distance function which is not based onsummary statistics.Although summary statistics are often presented as a fundamental part of ABC, they are not strictlynecessary. For instance, if the data are numeric and of low dimension, the distance function may simplybe the Euclidean distance [131]. Park et al. [18] proposed the use of a kernel function (as defined insection 1.2.4) in place of a distance function. The authors referred to their approach as “double-kernelABC” due to the use of a second (unrelated) kernel function to compute the weights of the particles. Thework by Poon [21], upon which ours is based, employed a similar approach, replacing the likelihoodratio in Bayesian MCMC with a ratio of kernel scores.221.5.2 Algorithms for ABCAlgorithms for performing ABC can be grouped into three categories: rejection, MCMC, and SMC [126].To simplify the notation, we shall restrict the descriptions of these algorithms to the case of one sim-ulated dataset per parameter particle (the meaning of this will become clear shortly). The extension tomultiple datasets per particle is straightforward and will be given at the end of the section. We use thevariable x to refer to the pair (θ ,z), so that the ABC target distribution can be written piε(x | y). Thismakes our notation consistent with section 1.4.Rejection ABC is the simplest method, and also the one that was first proposed [13, 14]. The al-gorithm, outlined in algorithm 4, repeats the following steps until a desired number of samples fromthe target distribution are obtained. Parameter values θ are sampled according to the prior distributionpi(θ). Then, a simulated dataset z is generated from the model with the sampled parameter values. Bydefinition, the probability density of obtaining the particular dataset z is f (z | θ). Finally, the parametersare sampled if the distance of z from the observed data y is less than ε , that is, with probability IAε,y(z).Putting this all together, the parameters θ are sampled with probability proportional topi(θ) f (z | θ)IAε,y(z),which is exactly the numerator of the ABC target distribution. Thus, θ represents an unbiased samplefrom the approximate posterior.Algorithm 4 Rejection ABC.loopDraw θ according to pi(θ)Simulate a dataset z from the model with parameters θif ρ(y,z)< ε thenSample θend ifend loopRejection ABC is easy to understand and implement, but it is not generally computationally feasible.If the posterior is very different from the prior, a very large number of samples may need to be takenin order to find a simulated dataset that is close to z. The inefficiency is compounded by the curseof dimensionality - the measure of the ε-ball around y decreases exponentially with the number ofdimensions. ABC-MCMC (algorithm 5) was designed to overcome these hurdles [132]. The approach issimilar to ordinary Bayesian MCMC (appendix A), except that a distance cutoff replaces the likelihoodratio. That is, the transition probability between states x and x′ is defined asmin(1,pi(θ ′)q(θ ′,θ)pi(θ)q(θ ,θ ′)· IAε,y(z′)).Some of the same computational inefficiencies arise with ABC-MCMC as with rejection. For ex-ample, in regions of low posterior density, the probability to simulate a dataset proximal to the observed23Algorithm 5 ABC-MCMC.Draw θ according to pi(θ)loopPropose θ ′ according to q(θ ,θ ′)Simulate a dataset z′ according to the model with parameters θAccept θ ← θ ′ with probability min(1,pi(θ ′)q(θ ′,θ)pi(θ )q(θ ,θ ′)· IAε,y(z′))end loopdata is low. Various strategies have been developed to mitigate this, including reducing the tolerancelevel ε as the chain progresses [133].The most recently developed class of algorithm for ABC is ABC-SMC [131, 134]. As with ABC-MCMC, the algorithm is a straightforward modification of an existing Bayesian inference method, in thiscase the SMC sampler (section 1.4.4). The sequence of target distributions is defined as pii(x) = piεi(x | y)for a decreasing sequence of tolerances εi. The intention is for the algorithm to progress smoothlythrough a sequence of target distributions which ends at the ABC approximation to the posterior. Theinitial value ε1 is set to ∞, which makes the first distribution in the sequencepi1(θ ,z) =pi(θ) f (z | θ)∫R×Θpi(θ) f (z | θ)dθdz.This initial distribution does not depend on the observed data y. In the terminology of the SMC sampler(algorithm 2), the numerator is the first of the γ’s, that is, γ1 = pi(θ) f (z | θ). Sampling in proportion toγ1 is straightforward and was already demonstrated for rejection ABC above. Because the sampling isexact, the initial importance weights are all set equal to 1 and normalized to 1/n where n is the numberof particles.As discussed in section 1.4.4, the choices of the kernels K and L are problem-specific, and so ap-propriate kernels must be chosen for ABC. Several options have been proposed [20, 131, 134]. With anappropriately chosen kernel, the weight update (eq. (1.8)) will simplify into a computable expression.Sisson, Fan, and Tanaka [131] and Beaumont et al. [134] both suggest random walk kernels for K, whereeach particle is perturbed according to a Gaussian distribution. The backwards kernels L are chosen toapproximately minimize the variance in importance weights. In this thesis, we use an MCMC kernel, asproposed by Del Moral, Doucet, and Jasra [20]. The associated backwards kernels and weight updatesare discussed in the next chapter (section 2.1.3). With generic kernels K and L, the ABC-SMC algorithmis almost identical to the SMC sampler (algorithm 3), with γi replaced with pii. Therefore we will notrepeat the full algorithm here.All the algorithms discussed in this section can be straightforwardly extended to sample from thejoint distributionpiε(θ ,z1, . . . ,zM | y),which is equivalent to associating M simulated datasets to each parameter particle instead of just one.The simulated dataset z is replaced by z = z1, . . . ,zM, and the indicator function for the ε-ball around y24is replaced byM∑k=1IAε,y(zi).For ABC-MCMC and ABC-SMC, the proposal distribution q(θ ,θ ′) f (z | θ ′) is replaced byqi(θ ,θ ′)M∏k=1f (zi | θ ′).1.6 SummaryIn section 1.1, we outlined three research objectives that will be addressed in this thesis. First, we aimto develop a method for fitting contact network models to estimated transmission trees. Although trans-mission trees can be estimated for any epidemic by thorough contact tracing, viral diseases are the mostuseful context for this method due to the possibility of inferring the trees from sequence data. Transmis-sion and sequence evolution occur on similar time scales for RNA viruses, resulting in viral phylogenieswhose shapes are heavily constrained by the transmission process. The study of this interaction betweenevolution and epidemiology is called phylodynamics; phylodynamic methods make it possible to es-timate transmission trees from viral sequence data. These estimated trees form the input data for ourmethod.The desired output of our method is a posterior distribution of the parameters of a contact networkmodel. Rather than assuming a homogeneously mixed population, as most epidemiological models do,network models take the more realistic view that human populations are structured. For example, con-tacts may tend to occur more often between “like” individuals (with respect to geography, socioeco-nomic status, or other factors) than at random. A network model provides a formal representation of thisstructure. In particular, our second research objective is to characterize our ability to fit the Baraba´si-Albert (BA) model, which incorporates preferential attachment to generate networks with realistic de-gree distributions.To fit these models, our method will apply approximate Bayesian computation (ABC), a simulation-based approach. Simulating a transmission tree according to a network model is straightforward: anetwork can be generated according to the model, and the spread of an epidemic can be simulated overthe network and recorded in a transmission tree. ABC uses the concordance between these simulatedtransmission trees and the “true” estimated tree as an indicator of parameter credibility. The closerthe simulated transmission trees appear to the true tree, the more weight is assigned to the associatednetwork parameters.ABC can be implemented by at least three classes of algorithm, but the one we choose to applyin this work is sequential Monte Carlo (SMC). SMC uses a population of parameter “particles” toapproximate a distribution of interest, in this case the approximate posterior distribution targeted byABC. After running the ABC-SMC algorithm, statistics on the model parameters can be approximatedby the weighted population of particles. For example, a weighted average would give an approximateexpected value for each parameter.25Thus, our method integrates four research topics: phylogenetics, contact networks, sequential MonteCarlo, and approximate Bayesian computation. The first two topics together form the problem domain.Phylogenetic data are the input to our method, while estimates of the parameters of contact networkmodels are the desired output. The latter two topics define the algorithm and statistical framework thatour inference method will use.26Chapter 2Reconstructing contact networkparameters from viral phylogeniesIn this chapter, we will address the three research aims of this thesis introduced in section 1.1. First, insection 2.1, we describe netabc, a computer program that implements an approximate Bayesian compu-tation-based algorithm to fit contact network models to phylogenetic data. We also provide a justificationfor the use of ABC for this problem by arguing that the likelihood functions required to fit these modelsby more conventional means are likely to be computationally intractable. Second, in section 2.2, weperform a simulation study to investigate the Baraba´si-Albert network model, which uses a preferentialattachment mechanism to generate networks with the power law degree distributions observed in realworld social and sexual networks. We progress through two exploratory analyses testing the identifi-ability of the model’s parameters, and conclude by testing netabc’s ability to recover the parametersfrom simulated transmission trees. Third, in section 2.3, we apply netabc to fit the BA model to six realworld HIV datasets, with the understanding of the model parameters’ identifiability gained through thesimulation experiments. We conclude the chapter with a unified discussion of the three research aims,including interpretation of the results of both the simulated and real data experiments, as well as anexamination of the limitations of our approach and opportunities for future investigation.2.1 Netabc: a computer program for estimation of contact network pa-rameters with ABCNetabc is a computer program to perform statistical inference of contact network parameters from anestimated transmission tree using ABC. As discussed in section 1.1, the principal statistical algorithmused by netabc is adaptive ABC-SMC [20]. In addition, there are two supplementary components whichare specific to the domain of phylogenetics and contact networks: Gillespie simulation [135], to simulatetransmission trees on contact networks; and the tree kernel [17], which is used as the distance function inABC to compare transmission trees [21] (see section 1.5). We give a high-level overview of the programhere, before describing these components in detail. Netabc takes as input an estimated transmission tree,27θf 1(θ|D)●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●θf 2(θ|D)●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●θf 3(θ|D)●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●Figure 2.1: Graphical schematic of the ABC-SMC algorithm implemented in netabc. Particles areinitially drawn from their prior distributions, making the initial population a Monte Carlo approximationto the prior. At each iteration, particles are perturbed, and a distance threshold around the input treecontracts. Particles are rejected, and eventually resampled, when all their associated simulated trees lieoutside the threshold. As the algorithm progresses, the population smoothly approaches a Monte Carloapproximation of the ABC target distribution, which is assumed to resemble the posterior.which can be derived from a viral phylogeny by rooting and time-scaling as described in section 1.2.3or estimated by other methods [31, 53, 64–67]. We variously refer to this estimated transmission tree asthe observed tree, estimated tree, or input tree.As described in section 1.4, netabc keeps track of a population of particles x(k) indexed by an integerk, each of which contains particular parameter values θ (k) for the contact network model we are tryingto fit to the input tree. A small number M of contact networks z(k,i), 1≤ i≤M, are generated under themodel for each particle in accordance with that particle’s parameters. An epidemic is simulated over eachof these networks using Gillespie simulation, and by keeping track of its progress, a transmission tree isobtained. Thus, each particle becomes associated with several simulated transmission trees. These treesare compared to the input tree using the tree kernel. Particles are weighted according to the similarityof their associated simulated trees with the input tree, with more similar trees receiving higher weights.The particles are iteratively perturbed to explore the parameter space, and particles with simulated treestoo distant from the input tree are periodically dropped and resampled. Once a convergence criterion isattained, the final set of particles is used as a Monte Carlo approximation to the target distribution ofABC, which is assumed to resemble the posterior distribution on model parameters (see section 1.5). Agraphical schematic of this algorithm is shown in fig. 2.1.Netabc is written in the C programming language. The igraph library [136] is used to generate and28store contact networks and phylogenies. Judy arrays [137] are used for hash tables and dynamic pro-gramming matrices. The GNU scientific library (GSL) [138] is used to generate random draws fromprobability distributions, and to solve for the next ε by bisection in the adaptive ABC-SMC algorithm.Parallelization is implemented with Portable Operating System Interface (POSIX) threads [139]. Inaddition to the netabc binary to perform ABC, we provide three additional stand-alone utilities: treek-ernel, to calculate the tree kernel; nettree, to simulate a transmission tree over a contact network; andtreestat, to compute various summary statistics of phylogenies. The programs are freely available athttps://github.com/rmcclosk/netabc.To check that our implementation of Gillespie simulation was correct, we reproduced Figure 1A ofLeventhal et al. [106] (our fig. B.1), which plots the imbalance of transmission trees simulated over fournetwork models at various levels of pathogen transmissibility. Our implementation of adaptive ABC-SMC was tested by applying it to the same mixture of Gaussians used by Del Moral, Doucet, and Jasra[20] to demonstrate their method (originally used by Sisson, Fan, and Tanaka [131]). We were able toobtain a close approximation to the function (see fig. B.2), and attained the stopping condition used bythe authors in a comparable number of steps. To check that the algorithm would converge to a bimodaldistribution, we also applied it to a mixture of two Gaussians with means ±4 and variances 1. Thealgorithm was able to recover both peaks (fig. B.3).2.1.1 Simulation of transmission trees over contact networksThe simulation of epidemics, and the corresponding transmission trees, over contact networks is per-formed in netabc using the Gillespie simulation algorithm [135]. This method has been independentlyimplemented and applied by several authors [e.g. 94, 97, 106, 108, 111]. Groendyke, Welch, and Hunter[97] published their implementation as an R package, but since the SMC algorithm is quite computa-tionally intensive, we chose to implement our own version in C as part of netabc.Let G= (V,E) be a directed contact network. We assume the individual nodes and edges of G followthe dynamics of the SIR model [2]. Each directed edge e = (u,v) in the network is associated with atransmission rate βe, which indicates that, once u becomes infected, the waiting time until u infectsv is distributed as Exponential(β e). Note that v may become infected before this time has elapsed, ifv has other incoming edges. v also has a removal rate νv, so that the waiting time until removal of vfrom the population is Exponential(νv). Removal may correspond to death or recovery with immunity,or a combination of both, but in our implementation recovered nodes never re-enter the susceptiblepopulation. We define a discordant edge as an edge (u,v) where u is infected and v has never beeninfected. In the epidemiology literature, the symbol γ is usually used in place of ν ; we use ν here todistinguish the recovery rate from the power law exponent of scale free networks (see section 1.3.2).To describe the algorithm, we introduce some notation and variables. Let inc(v) be the set of in-coming edges to v, and out(v) be the set of outgoing edges from v. Let I be the set of infected nodesin the network, R be the set of removed nodes, S be the set of susceptible nodes, and D be the set ofdiscordant edges in the network. Let β be the total transmission rate over all discordant edges, and ν be29the total removal rate of all infected nodes,β = ∑e∈Dβ e, ν = ∑v∈Iνv.The variables S, I,R,D, β , and ν are all updated as the simulation progresses. When a node v becomesinfected, it is deleted from S and added to I. Any formerly discordant edges in inc(v) are deleted fromD, and edges in out(v) to nodes in S are added toD. If v is later removed, it is deleted from I and addedtoR, and any discordant edges in out(v) are deleted from D. At the time of either infection or removal,the variables β and ν are updated to reflect the changes in the network.The Gillespie simulation algorithm is given as section 2.1.1. The transmission tree T is simulatedalong with the epidemic. We keep a map called “tip”, which maps infected nodes in I to the tips ofT . The simulation continues until either there are no discordant edges left in the network, or we reacha user-defined cutoff of time (tmax) or number of infections (I). We use the notation Uniform(0,1) toindicate a number drawn from a uniform distribution on (0,1), and Exponential(λ ) to indicate a numberdrawn from an exponential distribution with rate λ . The combined number of internal nodes and tips inT is denoted |T |. The updates to S, I,R,D, β , and ν described in the previous paragraph are not writtenexplicitly in section 2.1.1, as they are quite straightforward and would only obfuscate the pseudocode.Algorithm 6 Simulation of an epidemic and transmission tree over a contact networkinfect a node v at random, updating S, I, D, β , and νT ← a single node with label 1tip[v]← 1t← 0while D 6=∅ and |I|+ |R|< I and t < tmax dos←min(tmax− t, Exponential(β +ν))for v ∈ tip doextend the branch length of tip[v] by send fort← t+ sif t < tmax thenif Uniform(0,β +ν)< β thenchoose an edge e = (u,v) from D with probability β e/β and infect vtip[v]← |T |+1 . add new tips to tree and tip arraytip[u]← |T |+2 . corresponding to u and vadd tips with labels (|T |+1) and (|T |+2) to Tconnect the new nodes to tip[v] in T , with branch lengths 0elsechoose a node v from I with probability νv/ν and remove vdelete v from tipend ifupdate S, I,R, D, β , and νend ifend while302.1.2 Phylogenetic kernelThe tree kernel developed by Poon et al. [17] provides a comprehensive similarity score between twophylogenetic trees, via the dot-product of the two trees’ feature vectors in the space of all possiblesubset trees with branch lengths (see section 1.2.4). Because the branch lengths are continuous, thereare infinitely many possible subset trees; hence, the feature space is infinite-dimensional. The kernelwas implemented using the fast algorithm developed by Moschitti [75]. First, the production rule ofeach node, which is the total number of children and the number of leaf children, is recorded. The nodesof both trees are ordered by production rule, and a list of pairs of nodes sharing the same production ruleis created. These are the nodes for which the value of the tree kernel must be computed - all other pairshave a value of zero. The pairs to be compared are then re-ordered so that the child nodes are alwaysevaluated before their parents. Due to its recursive definition, ordering the pairs in this way allowsthe tree kernel to be computed by dynamic programming. The complexity of this implementation isO(|T 1||T 2|) for the two trees T 1 and T 2 being compared.The tree kernel cannot be used directly as a distance measure for ABC, since it is maximized, notminimized, when the two trees being compared are the same. Therefore, we defined the distance betweentwo trees asρ(T 1,T 2) = 1− K(T 1,T 2)√K(T 1,T 1)K(T 2,T 2),which is a number between 0 and 1 that is minimized when T 1 = T 2. This is similar to the normalizationused by Poon et al. [17] and Collins and Duffy [74].2.1.3 Adaptive sequential Monte Carlo for Approximate Bayesian computationWe implemented the adaptive SMC algorithm for ABC developed by Del Moral, Doucet, and Jasra [20].This algorithm is similar to the reference ABC-SMC algorithm described in section 1.5.2, except that thesequence of tolerances ε i is automatically determined rather than specified in advance. The tolerancesare chosen such that the ESS of the particle population, which indicates the quality of the Monte Carloapproximation (see section 1.4.2), decays at a controlled rate. A sudden precipitous drop in ESS wouldindicate that only a small number of particles had non-zero importance weights, which would result ina very poor Monte Carlo approximation to the target distribution. This situation is referred to as thecollapse of the approximation or particle degeneracy (see section 1.4.3) and is mitigated by the adaptiveapproach. A single parameter controls the decay rate. In the original paper of Del Moral, Doucet, andJasra [20], the parameter is called α , but to avoid confusion with the BA parameter of the same namewe will refer to it here as αESS. The tolerance ε i is chosen to satisfyESS(wi) = αESS ESS(wi−1),where, wi is the vector of weights at the ith step. Note that, since wi depends on ε i, this equation solvesfor the updated weights and the updated tolerance simultaneously. As pointed out by Del Moral, Doucet,and Jasra [20], the equation has no analytic solution, but can be solved numerically by bisection. The31forward kernels Ki are taken to be MCMC kernels with stationary distributions piε i and proposal distri-butionsqi(θ (k),θ (k)′) M∏j=1f(z( j,k)′∣∣∣θ (k)′) ,where θ (k) is the vector of model parameters associated with particle x(k) and z( j,k)′, 1 ≤ j ≤ M, areM datasets simulated according to θ (k)′ [20]. In our implementation, the qi are constructed component-wise for θ out of Gaussian proposals for continuous parameters [20] and Poisson proposals for discreteparameters. For the Poisson proposals, the number of discrete steps to move the particle is drawn froma Poisson distribution, and the direction in which to move the particle is chosen uniformly at random.The variance of each proposal distribution is set equal to twice the empirical variance of the particles,following [20, 134]. The backwards kernels areLi−1(x′,x) =pii(x)Ki(x,x′)pii(x′).Here we have written x′ for xi and x for xi−1 to emphasize that xi−1 is the current value of the particleand xi is the proposed value. When substituted into eq. (1.8), the forward kernels Ki(x,x′) and densitiespii(x′) = piεi(x′) cancel out, and we are left with the following weight update.wi(x) ∝ wi−1(x)pii(x′)Li−1(x′,x)pii−1(x)Ki(x,x′)importance weight update from SMC= wi−1(x)pii(x′)pii(x)Ki(x,x′)pii−1(x)Ki(x,x′)pii(x′)substitute Li−1= wi−1(x)pii(x)pii−1(x)cancel Ki(x,x′) and pii(x′)= wi−1(x)pi(θ)∏Mj=1 f (z( j)′ | θ)∑Mj=i IAεi ,y(z( j))pi(θ)∏Mj=1 f (z( j)′ | θ)∑Mj=i IAεi−1 ,y(z( j))definition of pii(x)= wi−1(x)∑Mj=i IAεi ,y(z( j))∑Mj=i IAεi−1 ,y(z( j))cancel prior and likelihood.In other words, when the distance threshold ε i−1 is contracted to ε i, the particles’ weights are multipliedby the proportion of simulated datasets that are still inside the new threshold. The user may specifya final tolerance ε , or a final acceptance rate of the MCMC kernel, and the algorithm will be stoppedwhen either of these termination conditions is reached. The latter condition stops the algorithm whenthe particles are not moving around very much, implying little change in the estimated target.2.1.4 Justification for approachWe present here a non-rigorous justification for the use of ABC for the problem at hand, as opposed tomore frequently-used approaches for fitting mathematical models (see appendix A). Consider a contactnetwork model with parameters θ , and an estimated transmission tree T . Taking a Bayesian approach,32aaabcbcabcabcFigure 2.2: Illustration of an estimated transmission tree without labels (left) and two possible under-lying complete transmission trees with labels (right). In the top right scenario, the epidemic began withnode a who transmitted first to b and then to c. In the bottom right scenario, b was the index case; binfected c, who went on to infect a. A transmission tree estimated from a viral phylogeny would havethe same topology and tip labels in both cases.our aim is to obtain a sample from the posterior distribution on the model’s parameters given our data,pi(θ | T ) = f (T | θ)pi(θ)∫Θ f (T | θ)pi(θ)dθ.For all but the simplest models, the normalizing constant in the denominator is an intractable integral.What we shall argue here is that, in contrast to most commonly studied mathematical models, the like-lihood f (T | θ) is also likely to be intractable in our case.As discussed in section 1.2.2, the internal nodes of transmission trees represent transmission events,and are labelled with the donor in the associated transmission pair. However, when we estimate a trans-mission tree from viral sequence data, we generally only know the labels of the tips of the tree, notthe labels of the internal nodes. In viral phylogenies, the transmissions are at least partially preservedthrough the evolutionary relationships among the viruses, but the directionality of those transmissionsis unknown. Thus, a single estimated transmission tree can correspond to many possible pathways ofthe epidemic through the network. Figure 2.2 illustrates this concept for a simple transmission treewith three tips. When calculating a likelihood given a transmission tree, we must sum over all possiblelabellings of the internal nodes. Let L be the set of such labellings. Thenf (T | θ) = ∑l∈Lf (T , l | θ). (2.1)33A contact network model assigns a probability density to each possible contact network. Transmissiontrees are realized over particular contact networks, not over the model itself. Therefore, we must alsosum over all contact networks which could be generated by the model. Let G be the set of all possiblecontact networks. Summing eq. (2.1) over G givesf (T | θ) = ∑G∈G∑l∈Lf (T , l | G,θ) f (G | θ). (2.2)This can be simplified somewhat by noticing that, given a specific contact network, the labelled trans-mission tree depends only on that network and not on the model that generated it. That is, f (T ,ν |G,θ) = f (T , l | G), andf (T | θ) = ∑G∈Gf (G | θ)∑l∈Lf (T ,ν | G) (2.3)Under the assumption that both transmission and removal are Poisson processes, calculating f (T , l |G)can be accomplished by a straightforward modification of the Gillespie simulation algorithm (sec-tion 2.1.1). Rather than choosing transmission or removal events according to their probabilities, theevents would be deterministically chosen to mimic the events recorded in the transmission tree. Assum-ing efficient data structures for storing lists of nodes and edges, the complexity of this calculation wouldbe O(|T |). The number of possible labellings of internal nodes of T is easily seen to be 2(|T |−1)/2 bynoticing that each of the (|T | − 1)/2 internal nodes must be labelled with the same label as either itsright child or its left child. Although exponential calculations of this nature can often be simplified ontrees using dynamic programming (e.g.[140]), it cannot be straightforwardly applied in this case becausethe subtrees’ probabilities depend on the existing epidemic progress (their parents and siblings). Hence,calculating the inner sum over labels may take time O(2(|T |−1)/2).The outer sum, over all contact networks, is also difficult to evaluate in general. There are 2N(N−1)directed graphs on N nodes [141]. There must be at least as many nodes in the contact network as thenumber of tips in the tree, which is (|T |+ 1)/2. Of course, it is very likely that there are more nodesin the network than observed tips because some individuals are never infected and/or some infectedindividuals are never sampled. The complexity of calculating f (G | θ) is obviously dependent on theparticular model being investigated. For the BA model, we might have to sum over all possible ordersin which the nodes could be added, and all assignments of edges to the nodes which generated them.However, even in the case that calculating f (G | θ) can be done in constant time, the sum (2.3) still hasat least O(2|T |2) terms.We have shown that both the normalizing constant∫Θ f (T | θ)pi(θ)dθ and the likelihood f (T | θ)are likely computationally prohibitive to calculate. If this is the case, the problem of fitting contactnetwork models to phylogenies seems to be of the doubly-intractable type [142], which would imply thatthese models are amenable to neither ML nor Bayesian inference techniques. Although both methods areable to cope with an intractable normalizing constant (for example, by local search for ML or BayesianMCMC), neither can avoid the intractable likelihood calculations. This justifies the use of ABC, which34is a likelihood-free method.We have not proven here that eq. (2.3) is impossible to calculate in polynomial time - it couldbe possible to algebraically simplify the sum into a tractable expression. Furthermore, under certainmodels, a large proportion of G may have zero probability, which would enable the simplification of theouter sum on a model-specific basis. It should also be noted that extensions of Bayesian MCMC havebeen developed for doubly-intractable problems [142, 143], which might be adaptable to the problem athand. These have not been as widely used as ABC, nor are they as easily parallelizable as SMC.2.2 Analysis of Baraba´si-Albert model with synthetic data2.2.1 Why study the Baraba´si-Albert model?We developed netabc with the objective of extracting useful, quantitative information about networkstructures from viral phylogenies. An important aspect of “usefulness” is model specification and thebiological or epidemiological interpretation of the parameters. We want the model to be realistic, butnot so complicated that it becomes difficult to interpret. At least some of the parameters should be ofinterest from a theoretical or practical perspective, or there would be no point in estimating them. Sincenetabc is a phylodynamic method, intended to be used with viral sequence data, we would also liketo choose parameters which may be difficult to estimate with more standard methods. Otherwise, ourmethod provides no advantage. The Baraba´si-Albert (BA) model (section 1.3.2) satisfies these criteria,albeit some better than others. The purported realism of the model stems from the fat-tailed degreedistributions it produces, which are similar to those observed in real world sexual networks [22–25,144]. Moreover, the “rich get richer” phenomenon, where popular individuals attract new connectionsat an elevated rate, is intuitively reasonable for both sexual [105] and IDU [103] networks. However, themodel is very simple, assuming that all nodes form the same number of links when added to the networkand share the same preference for popular individuals.In this thesis, we consider four parameters related to the BA model, denoted N, m, α , and I (see sec-tion 1.3.2). The first three of these parameterize the network structure, while I is related to the simulationof transmission trees over the network. However, we will refer to all four as BA parameters. N denotesthe total number of nodes in the network, or equivalently, susceptible individuals in the population; m isthe number of new undirected edges added for each new vertex, or equivalently one-half of the averagedegree; α is the power of preferential attachment – new nodes are attached to existing nodes of degree dwith probability proportional to dα +1; finally, I is the number of infected individuals at the time whensampling occurs. The α parameter is unitless, while m has units of edges or connections per vertex, andN and I both have units of nodes or individuals.From a public health standpoint, all four parameters are of some interest. The prevalence I can beused to estimate the resources required to combat an ongoing epidemic, while total susceptible popu-lation size N provides a similar metric for preventative measures. The average degree of the network,in this case 2m, is directly related to R0, the basic reproductive number [96]. R0 quantifies the aver-age number of secondary infections ultimately caused by one infected individual; higher R0 generally35indicates faster epidemic growth and/or larger eventual epidemic size [5]. In a homogeneously mixedpopulation, the theoretically expected proportion of the population which must be vaccinated to con-trol an epidemic (the vaccination threshold) can be expressed in terms of R0 [145]. Although optimalvaccination strategies may differ in heterogeneous contact structures [10], it is reasonable to supposethat there would still be a relationship between m, R0, and the vaccination threshold. The preferentialattachment power α quantifies the degree to which high-degree nodes, also called superspreaders [100],characterize the network structure. Superspreaders have been hypothesized to play a disproportionatelygreater role in the spread of several diseases [37, 101]. If so, network-based interventions [28, 29] maybe worth considering as part of an epidemic control strategy. α can also offer some insight into how thenetwork would react to the removal of nodes. Dombrowski et al. [103] found evidence of preferentialattachment in IDU networks, and suggested that as a consequence of this characteristic, the removal ofrandom nodes (such as through a police crackdown) might inadvertently make it easier for epidemics tospread. When individuals with only one or two connections lose them, they might tend to seek out well-known (that is, high-degree) members of the community, thus increasing those individuals’ connectivityeven further. Although we do not consider a dynamic network in this work, it is not unreasonable that thenetwork could be close to static over short periods of time, and this static structure might be informativeof future dynamic behaviour.Though it is theoretically possible to estimate all four BA parameters using more conventional ap-proaches, the cost of doing so may be high, in addition to other situation-specific challenges. In theory,any network parameter can be estimated by explicitly constructing the contact network, although thisis highly resource intensive and is hampered by misreporting and other challenges [42]. All parametersbecome more difficult to estimate when the infected population is “hidden” due to illegal or stigmatizedbehaviour, as is sometimes the case with HIV outbreaks among IDU, MSM, or sex workers. In suchcases, phylodynamic methods may offer the advantage of providing information about the whole pop-ulation from only a small sample. A survey, for example, will not tell us if there are large groups ofthe population that we simply have not sampled. Our hope is that estimating N and I phylogeneticallymight provide this additional information. The average degree of the network, 2m, is also estimable bya survey, although individuals may be unwilling or unable to disclose how many contacts they have had.Sequence data also provide an advantage in this respect – they are objective and do not share the samebiases as self-reported partner counts. The estimation of α is more complex, as this parameter is moststrongly reflected in the connectivity of very high degree nodes, who are rare in the population. Locatingthem might require contact tracing, or respondent-driven sampling [90]. Even if the full degree distri-bution of a network is available, there are models other than preferential attachment which can producescale-free networks [e.g. 146]. de Blasio, Svensson, and Liljeros [105] were able to estimate α by max-imum likelihood using partner count data from several sequential time intervals, but they admit suchdetailed data are not usually available. Moreover, their dataset was constructed via a random survey,which would likely miss the few high-degree nodes characterizing a power law degree distribution. Insummary, each of the BA parameters may be estimated without phylodynamics, but there are sufficientdifficulties that we believe an alternative method using sequence data is warranted.362.2.2 Using synthetic data to investigate identifiability and sources of estimation errorWe have argued that the parameters of the BA model are interesting and worth estimating with phylody-namic methods. However, these estimates will only constitute “useful” information about the network ifthe parameters are identifiable from phylogenetic data. Roughly speaking, the identifiability of a param-eter says how much information about that parameter can possibly be obtained from the observed data.If the parameters of the BA model do not influence tree shape at all, then we cannot possibly estimatethem – the posterior distribution will exactly resemble the prior, no matter how accurate a representationof the posterior we are able to produce. Hence, before proceeding with a full validation of netabc onsimulated data, we undertook two experiments designed to assess the identifiability of the BA param-eters. These experiments only investigated one parameter of the BA model at a time while holding allothers fixed, a strategy commonly used when performing sensitivity analyses of mathematical models.This allowed us to perform a fast preliminary analysis without dealing with the “curse of dimensional-ity” of the full parameter space. The experiments are motivated and described on a high level here, withmore detail provided in the next section.First, we simulated trees under three different values of each parameter, and asked how well we couldtell the different trees apart. The better we are able to distinguish the trees, the more identifiability wemight expect for the corresponding parameter when we attempt to estimate it with ABC. This experimentalso had the secondary purpose of validating our choice of the tree kernel as a distance measure in ABC.To tell the trees apart, we used a classifier based on the tree kernel, but we also tested two other tree shapestatistics: one which considers only the topology, and another which considers only branching times.Since the tree kernel incorporates both of these sources of information, we expected it to outperform theother two statistics. Finally, the tree kernel can be “tuned” by adjusting the values of the meta-parametersλ and σ . The results of this experiment were used to select values for these meta-parameters to carryforward to the rest of the thesis, based on their accuracy in distinguishing the different trees.A second experiment was designed to test whether we could actually estimate the parameters nu-merically, rather than just telling three different values apart, and also to assess how identifiability variedin different regions of the parameter space. An individual tree was compared to simulated trees on a one-dimensional grid of values of one BA parameter, to obtain a “distribution” of kernel score values. Fromthis distribution, estimates and credible intervals of the parameter could be calculated. Repeating thisexperiment with trees located throughout the parameter space allowed us to better quantify the identifi-ability. Furthermore, doing marginal estimation (that is, estimating one parameter with all others fixed)can provide insight into any biases observed when doing joint estimation with ABC. If grid search isinaccurate, it indicates a lack of parameter identifiability. However, if the marginal grid search estimatesare accurate but the estimates obtained with ABC are biased, this points to confounding between theparameters which could only be observed when they are all estimated jointly.After these preliminary experiments, the strategy we used for testing netabc was a standard simulation-based validation. Transmission trees were simulated under several combinations of parameter values,and we tried to recover these values with netabc. We then used a multivariable analysis to investigatehow accuracy of these estimates was influenced by the true parameter values.37In the previous section, we argued that the BA model parameters were worth investigating, and herewe have presented several computational experiments designed to assess their identifiability. There isa final, more technical aspect of our method’s “usefulness” to consider, which is the accuracy of theABC approximation to the posterior. As discussed in section 1.5, ABC does not target the posteriordistribution directly, but rather an approximate posterior derived from simulated data and a distancefunction. ABC assumes that this approximate posterior resembles the true posterior, and it is criticalfor our estimates’ relevance that this assumption holds. There are two potential causes of an inaccurateABC approximation [147]. First, the Monte Carlo approximation to the ABC target distribution maybe poor, due to the settings used for ABC-SMC. Second, the ABC target distribution may not resemblethe posterior, due to a poor choice of distance function. We designed two experiments to investigate theimpact of these sources of error.The Monte Carlo approximation error is fairly easily quantified by simply increasing the computingpower used for SMC. We ran one simulation using a larger number of particles, more simulated datasetsper particle, and a higher value for αESS (defined in section 2.1.3). A substantial improvement in accu-racy resulting from these changes would likely indicate a high Monte Carlo error with the lower settings.The second issue, the resemblance of the ABC target distribution to the true posterior, is somewhat moredifficult to investigate. We do not have access to the true posterior, even for simulations where the trueparameter values are known. To address this source of error, we performed marginal parameter esti-mation with ABC by informing netabc of some of the true parameter values. Any inaccuracy or biasobserved only in the joint estimation results, but not the marginal estimates, is most easily explained byinterdependence between parameters in the true posterior. However, errors observed in both marginaland joint estimates could be due to either the shape of the true posterior or an inaccurate ABC approxi-mation, and we have no way to distinguish one from the other. In other words, this experiment providedonly an upper bound on the error due to an inaccurate ABC approximation.2.2.3 MethodsFor all simulations, we assumed that all contacts had symmetric transmission risk, which was imple-mented by replacing each undirected edge in the network with two directed edges (one in each di-rection). Nodes in our networks followed simple SI dynamics, meaning that they became infected ata rate proportional to their number of infected neighbours, and never recovered. We did not considerthe time scale of the transmission trees in these simulations, only their shape. Therefore, the transmis-sion rate along each edge in the network was set to 1, the removal rate of each node was set to 0,and all transmission trees’ branch lengths were scaled by their mean. The igraph library’s implemen-tation of the BA model [136] was used to generate the graphs. The analyses were run on Westgrid(https://www.westgrid.ca/) and a local computer cluster. With the exception of our own C programs, allanalyses were done in R, and all packages listed below are R packages. Code to run all experiments isfreely available at https://github.com/rmcclosk/thesis.38Classifiers for BA model parameters based on tree shapeOur first computational experiment was designed as an exploratory analysis of the four BA model pa-rameters defined above: α , I, m, and N. The objective of this experiment was to determine whether anyof the four parameters were identifiable from the shape of the transmission tree, as quantified by the treekernel. Each of the BA model parameters was varied one at a time, while holding the other parametersat fixed, known values. Contact networks were generated according to each set of parameter values, andtransmission trees were simulated over the networks. We then evaluated how well a classifier based onthe tree kernel could differentiate the trees simulated under distinct parameter values. If the classifier’scross-validation accuracy was high, this could be taken as an indication that the parameter in questionwas identifiable in the range of values considered. A caveat of this preliminary analysis is that, sinceall parameters but one were held at known values, nothing could be said about the identifiability ofcombinations of parameters; this issue will be explored later by jointly estimating all parameters withABC.In addition to testing for identifiability, a secondary objective of this analysis was to validate theuse of the tree kernel as a distance measure for ABC in our context. As discussed in section 1.5, thechoice of distance function is extremely important for the accuracy of the ABC approximation to theposterior. Therefore, we evaluated two additional tree statistics in the same manner as we evaluated thetree kernel (that is, by constructing and testing a classifier). First, we considered Sackin’s index [69],which measures the degree of imbalance or asymmetry in a phylogeny (see section 1.2.4). Sackin’sindex is widely used for characterizing phylogenies [148] and has been demonstrated to vary betweentransmission trees simulated under different contact network types [106]. Sackin’s index does not takebranch lengths into account, considering only the tree’s topology. The other statistic we consideredwas the normalized lineages-through-time (nLTT) [72], which compares two trees based on normalizeddistributions of their branching times (see section 1.2.4). In contrast with Sackin’s index, the nLTT doesnot explicitly consider the trees’ topologies, but it does use their normalized branch lengths. While thenLTT is a newly developed statistic not yet in widespread use, the unnormalized LTT [34] was the basisof seminal early work extracting epidemiological information from phylogenies [41]. We expected thetree kernel to classify the BA parameters more accurately than either Sackin’s index or the nLTT, sincethe tree kernel takes both topology and branch lengths into account.This experiment involved a large number of variables that were varied combinatorially. For easeof exposition, we will describe a single experiment first, then enumerate the values of all variables forwhich the experiment was repeated. The parameters of the tree kernel, λ and σ (section 1.2.4), will bereferred to as meta-parameters to distinguish them from the parameters of the BA model.The attachment power parameter α was varied among three values: 0.5, 1.0, and 1.5. For eachvalue, the sample pa function in the igraph package was used to simulate 100 networks, with the otherparameters set to N = 5000 and m = 2. This step yielded a total of 300 networks. An epidemic wassimulated on each network using our nettree binary until I = 1000 nodes were infected, at which point500 of them were sampled to form a transmission tree. A total of 300 transmission trees were thusobtained, comprised of 100 trees for each of the three values of α . The trees were “ladderized” so39that the subtree descending from the left child of each node was not smaller than that descending fromthe right child. Summary statistics, such as Sackin’s index and the ratio of internal to terminal branchlengths, were computed for each simulated tree using our treestat binary. The trees were visualizedusing the ape package [149]. Both the tree kernel and the nLTT are pairwise statistics, and the SVMsclassifiers we used to investigate them operate on pairwise distance matrices. Our treekernel binary wasused to calculate the value of the kernel for each pair of trees, with the meta-parameters set to λ = 0.3and σ = 4. These values were stored in a symmetric 300 × 300 kernel matrix. Similarly, we computedthe nLTT statistic between each pair of trees using our treestat binary, and stored them in a second300×300 matrix.To investigate the identifiability of α from tree shape, we constructed classifiers for α based on thethree tree shape statistics discussed above. The kernlab package [150] was used to create a kernel supportvector machine (kSVM) classifier using the computed kernel matrix, and the e1071 package [151] wasused to create ordinary SVM classifiers using the pairwise nLTT matrix and Sackin’s index values. Eachof these classifiers was evaluated with 1000 two-fold cross-validations with equally-sized folds. We alsoperformed a kernel principal components analysis (kPCA) projection of the kernel matrix, and used it tovisualize the separation of the different α values in the tree kernel’s feature space. A schematic of thisexperiment is presented in fig. 2.3.Similar experiments were performed with the values shown in table 2.1. The other three BA param-eters, N, m, and I, were each varied while holding the others fixed. The experiments for α , m, and Nwere repeated with three different values of I. All experiments were repeated with trees having threedifferent numbers of tips. Kernel matrices were computed for all pairs of the meta-parameters λ = {0.2,0.3, 0.4} and σ = {1/8, 1/4, 1/2, 1, 2, 4, 8}.403 parameter valuesθ1θ2 θ3300 networks× 100 × 100 × 100300 epidemics× 100 × 100 × 100300 transmission trees× 100 × 100 × 1003 tree shape statistics300 × 300 kernel matrix300 × 300 nLTT matrix300 Sackin’s index values3 cross-validationskernel-PCA projectionFigure 2.3: Schematic of classifier experiments investigating identifiability of BA model parametersfrom tree shapes. The parameters of the BA model were varied one at a time while holding all othersfixed. Transmission trees were simulated under three different values of each parameter. Classifiers wereconstructed for each parameter based on three tree shape statistics, and their accuracy was evaluated bycross-validation. Kernel-PCA projections were used to visually examine the separation of the trees inthe feature space defined by the tree kernel.41varied parameter N α m I tips λ σN 3000, 5000, 8000 1.0 2 500, 1000, 2000 100, 500, 1000 0.2, 0.3, 0.4 1/8, 1/4, 1/2, 1, 2, 4, 8α 5000 0.5, 1.0, 1.5 2 500, 1000, 2000 100, 500, 1000 0.2, 0.3, 0.4 1/8, 1/4, 1/2, 1, 2, 4, 8m 5000 1.0 2, 3, 4 500, 1000, 2000 100, 500, 1000 0.2, 0.3, 0.4 1/8, 1/4, 1/2, 1, 2, 4, 8I 5000 1.0 2 500, 1000, 2000 100, 500 0.2, 0.3, 0.4 1/8, 1/4, 1/2, 1, 2, 4, 8Table 2.1: Values of parameters and meta-parameters used in classifier experiments to investigate identifiability of BA model parameters from treeshapes. Each row corresponds to one of the BA model parameters. One kernel matrix was created for every combination of values except the oneindicated in the “varied parameter” column, which was varied when producing simulated trees.parameter grid values test values N α m I tipsN 1050, 1125, . . . , 15000 1000, 3000, . . . , 15000 - 1.0 2 1000 100, 500, 1000α 0, 0.01, . . . , 2 0, 0.25, . . . 2 5000 - 2 1000 100, 500, 1000m 1, 2, . . . , 6 1, 2, . . . 6 5000 1.0 - 1000 100, 500, 1000I 500, 525, . . . , 5000 500, 1000, 1500, 2000 5000 1.0 2 - 100, 500Table 2.2: Values of parameters and meta-parameters used in grid search experiments to further investigate identifiability of BA model parameters.Trees were simulated under the test values, and compared to a grid of trees simulated under the grid values. Kernel scores were used to calculate pointestimates and credible intervals for each parameter, which were compared to the test values.42Grid searchThe previous experiment was an exploratory analysis intended to determine which of the BA parameterswere identifiable, and whether the tree kernel could potentially be used to distinguish different parametervalues when all others were held fixed. In this experiment, which was still of an exploratory nature, wecontinued to consider one parameter at a time while fixing the other three. However, rather than checkingfor identifiability, we were now interested in quantifying the accuracy and precision of kernel score-based estimates. This was done by examining the distribution of kernel scores on a grid of parametervalues, when trees simulated according to those values were compared with a single simulated test tree.As in the previous section, we will begin by describing a single experiment, and then list the vari-ables for which similar experiments were performed. We varied α along a narrowly spaced grid of val-ues: 0, 0.01, . . . , 2. For each value, fifteen networks were generated with igraph, and transmission treeswere simulated over each using nettree. These trees will be referred to as “grid trees”. Next, one furthertest tree was simulated with the test value α = 0. Both the grid trees and the test tree had 500 tips, andwere simulated with the other BA parameters set to the known values N = 5000, m = 2, and I = 1000.The test tree was compared to each of the grid trees using the tree kernel, with the meta-parameters setto λ = 0.3 and σ = 4, using the treekernel binary. The median kernel score was calculated for eachgrid value, and the scores were normalized such that the area under the curve was equal to 1. For allparameters except m, 50% and 95% highest density intervals were obtained using the hpd function inthe TeachingDemos package [152]. Since the hpd function assumes a continuous distribution, we im-plemented our own version for discrete distributions to use for m.Each experiment of the type just described was repeated ten times with the same test value. Similarexperiments were performed for each of the four BA parameters, with several test values and trees ofvarying sizes. The variables are listed in table 2.2. A graphical schematic of the grid search experimentsis shown in fig. 2.4.Approximate Bayesian computationOur final synthetic data experiment was designed to test the full ABC-SMC algorithm by jointly esti-mating the four parameters of the BA model. We used the standard validation approach of simulatingtransmission trees under the model with known parameter values and attempting to recover those val-ues with netabc. The algorithm was not informed of any of the true parameter values for the main setof simulations. Despite the fact that the parameter values used to generate the simulated transmissiontrees were known, the true posterior distributions on the BA parameters were unknown. Therefore, anyapparent errors or biases in the estimates could be due to either poor performance of our method, or toreal features of the posterior distribution. The latter type of error reflects on the suitability of the model,but does not invalidate the use of our method in cases where the parameters are more identifiable. Tworetrospective experiments were performed to disambiguate some of the observed errors: one where weran a simulation with increased computational power to test for an increase in accuracy, and a secondwhere we estimated parameters marginally to remove confounding from the other parameters in the43k test values n grid values10× k test trees× 10 · · · × 1015×n grid trees× 15 · · · × 1510× k kernel score vectors●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●training valuescore10× k parameter estimatespoint estimate = grid value withhighest median kernel scoreFigure 2.4: Schematic of grid search experiment to further investigate identifiability of BA model pa-rameters from tree shapes. Trees were simulated along a narrowly spaced grid of values of one parameter(“grid trees”) with all other parameters fixed to known values. Separate trees were simulated for a smallsubset of the grid values (“test trees”), also holding the other parameters fixed. Each test tree was com-pared to every grid tree using the tree kernel, and the resulting kernel scores were normalized to resemblea probability density from which the mode and 95% highest density interval were calculated.joint posterior.The parameter values and priors used are listed in table 2.3. The tree kernel meta-parameters wereset to λ = 0.3 and σ = 4. The SMC algorithm was run with 1000 particles, five sampled datasets perparticle, and αESS set to 0.95. The algorithm was stopped when the acceptance rate of the MCMC kernel(see section 2.1.3) dropped below 1.5%, the same criterion used by Del Moral, Doucet, and Jasra [20].For visualization, approximate marginal posterior densities for each parameter were calculated usingthe density function in R applied to the final weighted population of particles. Posterior means obtainedfor each parameter using the wtd.mean function in the Hmisc package [153]. Credible intervals wereobtained using the hpd function in the TeachingDemos package [152] for α I, and N, and using our ownimplementation for discrete distributions for m.To evaluate the effects of the true parameter values on the accuracy of the posterior mean estimates,we analyzed the α and I parameters individually using GLMs. The response variable was the error of44parameter or variable test values priorN 5000 Uniform(500, 15000)α 0, 0.5, 1, 1.5 Uniform(0, 2)m 2, 3, 4 DiscreteUniform(1, 5)I 1000, 2000 Uniform(500, 5000)tips 500 -Table 2.3: Parameter values used in simulation experiments to test accuracy of BA model fitting withnetabc. Trees were simulated under the test values, and netabc was used to estimate posterior distribu-tions on the BA parameters for each simulated tree. Netabc was naı¨ve to the true parameter values.the point estimate, and the predictor variables were the true values of α , I, and m. We did not test for dif-ferences across true values of N, because N was not varied in these simulations. The distribution familyand link function for the GLMs were chosen as Gaussian and inverse, respectively, by examination ofresidual plots and Akaike information criterion (AIC). The p-values of the estimated GLMs coefficientswere corrected using Holm-Bonferroni correction [154] with n = 6 (two GLMs with three predictorseach). Because there was clearly little to no identifiability of N and m with ABC (see results in nextsection), we did not construct GLMs for those parameters.Two further simulations were performed to assess the possible impact of two types of model mis-specification. To consider the effect of heterogeneity among nodes, we generated a network where halfthe nodes were attached with power α = 0.5 and the other half with power α = 1.5. The other parametersfor this network were N = 5000, I = 1000, and m = 2. To investigate the effects of potential samplingbias [155], we simulated a transmission tree where the tips were sampled in a peer-driven fashion, ratherthan at random. That is, the probability to sample a node was twice as high if any of that node’s networkpeers had already been sampled. The parameters of this network were N = 5000, I = 2000, m = 2, andα = 0.5.To assess the impact of the SMC settings on netabc’s accuracy, we ran netabc twice on the samesimulated transmission tree. For the first run, the SMC settings were the same as in the other simu-lations: 1000 particles, 5 simulated transmission trees per particle, and αESS = 0.95. The second runwas performed with 2000 particles, 10 simulated transmission trees per particle, and αESS = 0.97. Toinvestigate the extent to which errors in the estimated BA parameters were due to true features of theposterior, rather than an inaccurate ABC approximation, we performed marginal estimation for one setof parameter values. Each combination of 1, 2, or 3 model parameters (14 combinations total) was fixedto their known values, and the remaining parameters were estimated with netabc. The parameter valueswere α = 0.0, m = 2, I = 2000, and N = 5000.2.2.4 ResultsClassifiers for BA model parameters based on tree shapeTrees simulated under different values of α were visibly quite distinct (fig. 2.5). In particular, highervalues of α produce networks with a small number of highly connected nodes, which, once infected,45are likely to transmit to many other nodes. This results in a more unbalanced, ladder-like structure inthe phylogeny, compared to networks with lower α values. None of the other three parameters producedtrees that were as easily distinguished from each other (figs. B.4 to B.6). Sackin’s index, which measurestree imbalance, was significantly correlated with all four parameters (for α , I, m, and N respectively:Spearman’s rho = 0.85, −0.12, −0.13, 0.09; p-values <10−5, 0.003, <10−5, <10−5). The ratio ofinternal to terminal branch lengths was negatively correlated with α and I, and positively correlatedwith m and N (Spearman’s rho −0.8, −0.69, 0.09, 0.17; all p < 10−5).α = 0.5 α = 1 α = 1.5Figure 2.5: Simulated transmission trees under three different values of preferential attachment power(α) parameter of BA model. Epidemics were simulated on BA networks of 5000 nodes, with α equal to0.5, 1.0, or 1.5, until 1000 individuals were infected. Transmission trees were created by randomly sam-pling 500 infected nodes. Higher α values produced networks with a small number of highly-connectednodes, resulting in highly unbalanced, ladder-like trees.Figure 2.6 shows kPCA projections of the simulated trees onto the first two principal componentsof the kernel matrix. The figure shows only the simulations with 500-tip trees and 1000 infected nodes.The three α and I values considered are well separated from each other in the feature space mapped toby the tree kernel. On the other hand, the three N values overlap significantly, and the three m valuesare virtually indistinguishable. Similar observations can be made for other values of I and the numberof tips (figs. B.11 to B.14). The values of I and N separated more clearly with larger numbers of tips,and in the case of N, with larger epidemic sizes (figs. B.12 and B.14).The accuracy of each classifier, which is the proportion of trees assigned the correct parameter value,is shown in fig. 2.7. Since there were three possible values, random guessing would produce an accuracyof 0.33. Classifiers based on the nLTT and Sackin’s index generally exhibited worse performance thanthe tree kernel, although the magnitude of the disparity varied between the parameters (fig. 2.7, centre46●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●α● 0.511.5●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●I● 50010002000●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●m● 234 ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●N● 300050008000first principal componentsecond principal componentFigure 2.6: Each parameter of the BA model was individually varied to produce 300 simulated treeswith 500 tips each. Kernel matrices were formed from all pairwise kernel scores among each set of 300trees. The trees were projected onto the first two principal components of the kernel matrix calculatedusing kPCA. The other parameters, which were not varied, were set to α = 1, I = 1000, m = 2, andN = 5000. The tree kernel meta-parameters were λ = 0.3 and σ = 4.and right). The results were largely robust to variations in the tree kernel meta-parameters λ and σ ,although accuracy varied between different epidemic and sampling scenarios (figs. B.7 to B.10). Forall parameters except m, the absolute number of tips in the tree had a much greater impact on accuracythan the proportion of infected individuals these tips represented. However, for m, both the number andproportion of sampled tips had a strong impact or the accuracy of the kSVM (fig. B.9).The kSVM classifier for α had an average accuracy of 0.92, compared to 0.6 for the nLTT-basedSVM, and 0.77 for Sackin’s index. There was little variation about the mean for different tree and epi-demic sizes. No classifier could accurately identify m in any epidemic scenario, with average accuracyvalues of 0.35 for kSVM, 0.32 for the nLTT, and 0.38 for Sackin’s index. Again, there was little vari-ation in accuracy between epidemic scenarios, although the accuracy of the kSVM was slightly higheron 1000-tip trees (average 0.33, 0.35, 0.38 for 100, 500, and 1000 tips respectively).The accuracy of classifiers for I appeared to be strongly influenced by the number of tips in the47tree kernel nLTT Sackin's indexlllllll llllll lllllllllll lll lll0.40.60.81.0α I m N α I m N α I m Nparameteraccuracytipsl 1005001000Illl50010002000Figure 2.7: Cross-validation accuracy of kernel-SVM classifier (left), SVM classifier using nLTT (cen-tre), and SVM using Sackin’s index (right) for BA model parameters. Kernel meta-parameters wereset to λ = 0.3 and σ = 4. Each point was calculated based on 300 simulated transmission trees overnetworks with three different values of the parameter being tested, assuming perfect knowledge of theother parameters. Vertical lines are empirical 95% confidence intervals based on 1000 two-fold cross-validations. The classifiers for I were not evaluated with 1000-tip trees, because one of the tested Ivalues was 500, and it is not possible to sample a tree of size 1000 from 500 infected individuals.tree. For 100-tip trees, the average accuracy values were 0.59, 0.58, and 0.34 for the tree kernel, nLTT,and Sackin’s index respectively. For 500-tip trees, the values increased to 0.99, 0.76, and 0.37. Finally,the performance of classifiers for N depended heavily on the epidemic scenario. The accuracy of thekSVM classifier ranged from 0.36 for the smallest epidemic and smallest sample size, to 0.81 for thelargest. Likewise, accuracy for the nLTT-based SVM ranged from 0.33 to 0.63. Sackin’s index did notaccurately classify N in any scenario, with an average accuracy of 0.35 and little variation betweenscenarios.Marginal parameter estimates with grid searchFigure 2.8 shows point estimates and 50% and 95% highest density intervals for each of the BA param-eters, for one replicate experiment with 500-tip trees. Plots showing the point estimates for all replicatescan be found in figs. B.19 to B.22. For all parameters except m, the error of point estimates was nega-tively correlated with the number of sampled tips in the tree (for α , I, and N respectively: Spearman’s ρ= −0.22, −0.51, −0.16; p-values 4×10−4, <10−5, 0.01). The 95% highest density intervals obtainedfor all parameters were extremely wide, occupying >75% of the grid in all cases (fig. 2.8).Across all replicates, R2 values for the correlations between the estimated and true values were 0.91,0.91, 0.25, and 0.54 for α , I, m, and N respectively. The mean absolute errors of the point estimates were0.14, 310, 1.31, and 2419, representing 7%, 8%, 26%, and 17% of the respective grids. Note that fig. 2.8contains only one replicate data point per parameter value, out of ten total. For I and N, relative errorsmay be more appropriate to consider. An overestimate of 100 individuals would be very misleading ifthe true population was only of size 100, but almost negligible in a population of size 5000. The meanrelative errors for I and N respectively were 18% and 31%.Qualitatively, α and I exhibited weak identifiability within particular sections of the grid (fig. 2.8).48●●●●●●●● ●0.00.51.01.52.00.0 0.5 1.0 1.5 2.0true αestimated α●● ●●●●●●●100020003000400050001000 2000 3000 4000true Iestimated I●●● ● ● ●2462 4 6true mestimated m●●●● ●●●●40008000120000 5000 10000 15000true Nestimated NFigure 2.8: Grid search estimates of BA model parameters for one replicate experiment with trees of size500. Point estimates (dots), 95% HDIs (lines), and 50% HDIs (notches) for each BA model parameter,obtained using grid search. Networks and transmission trees were simulated over a grid of values foreach parameter while holding the others fixed to known values. For a subset of the grid values (x-axis),test networks and trees were created and compared to each tree on the grid using the tree kernel. Thekernel scores along the grid were normalized to resemble a probability distribution, from which themode and highest density interval were calculated.That is, the posterior distributions for several different grid values were extremely similar. The 50%HDIs for α were similar for the values α ≤ 0.5 (on average 0.02 - 0.84) and for α ≥ 1.5 (1.26 - 2).For I, similar 50% HPD were observed for I ≤ 1500 (average [650 - 2846]) and for I > 3000 ([2596 -4802]). No similar patterns were observed for m or N (although the 50% HDIs appear to be identical form > 2 in fig. 2.4, this was not consistent across replicates).Figures B.15 to B.18 show kernel score distributions of kernel scores along the grid for each pa-rameter. The distributions for some values, such as α = 1.25, I = 500 and 4500, m = 1, and N = 1000,exhibited distinct peaks around the true value. This indicates that these values produce distinctivelyshaped trees that can be identified with the tree kernel, when the other parameter values are fixed andknown. However, for the majority of values of each parameter, the score distributions were fairly flataround the true value. This means there is a range of values which produce similarly shaped trees,and the parameter is less identifiable within that range. The exception was I, whose score distributions49exhibited a more or less rounded shape with the highest point near the true value.Joint parameter estimates with netabcFigure 2.9 shows stratified posterior mean point estimates of the BA model parameters α and I obtainedwith ABC on simulated data. The parameters m and N were not identifiable with ABC for any parametercombinations (fig. B.23). Average boundaries of 95% HPD intervals are given in table 2.4.llll0.00.51.01.52.0αstratified by αl llllstratified by Ill llllstratified by ml l12345I (thousands)llllα 0 0.5 1 1.5 I 1000 2000 m 2 3 4Figure 2.9: Posterior mean point estimates for BA model parameters α and I obtained by running netabcon simulated data, stratified by true parameter values. First row of plots contains true versus estimatedvalues of α; second row contains true versus estimated values of I. Columns are stratified by α , I, andm respectively. Dashed lines indicate true values.Across all simulations, the median [interquartile range (IQR)] absolute errors of the parameter esti-mates obtained with netabc were 0.11 [0.03 - 0.25] for α , 492 [294 - 782] for I, 1 [0 - 1] for m, and 4153[3660 - 4489] for N. These errors comprised, respectively, 6%, 11%, 17%, and 29% of the regions ofnonzero prior density. For I and N, relative errors were 38% [20 - 50%] and 83% [73 - 90%]. Average95% HPD interval widths were 0.68, 2454, 3.01, and 12046, representing 34%, 55%, 50%, and 83%of the nonzero prior density regions. Point estimates of I were upwardly biased: I was overestimated in69 out of 72 simulations (96%). The estimates for m and N were similar across all simulations (median[IQR] point estimates 3 [3 - 3] and 9153 [8660 - 9489]) regardless of the true values of any of the BAparameters.50Parameter True value Mean pointestimateMean HPDlower boundMean HPDupper boundα 0.0 0.36 0.01 0.810.5 0.43 0.04 0.831.0 0.90 0.51 1.091.5 1.52 1.26 1.81I 1000 1450 651 25922000 2622 1114 4080m 2 2.96 2.00 5.003 3.04 2.04 4.964 3.17 1.88 5.00N 5000 9041 2613 14659Table 2.4: Average posterior mean point estimates and 95% HPD interval widths for BA model param-eter estimates obtained with netabc on simulated data. Three transmission trees were simulated undereach combination of the listed parameter values, and the parameters were estimated with ABC withouttraining.To analyze the effects of the true parameter values on the accuracy our estimates of α and I, wefitted one GLM for each of these two parameters, with error rate as the dependent variable and the trueparameter values as independent variables. Since the estimates of m and N were roughly equal acrossall simulations (fig. B.23), GLMs were not fitted for these parameters. The estimated coefficients areshown in tables 2.5 and 2.6. The GLM were fitted using the inverse link function. That is, if p is the truevalue of the parameter and pˆ is a random variable representing our estimate of the parameter, the GLMposits a relationship of the formE(|p− pˆ|) = (β0+βαα+βII+βmm)−1,where the β ’s are coefficients to be fitted. If the true value α , say, is increased by one, the inverse of theexpected absolute error will increase by βα . If βα is positive, it means that the absolute error decreasesas the true value of α increases.Parameter Estimate Standard error p-value(Intercept) 2 0.6 0.01α 10 2 <10−5I −3×10−4 2×10−4 0.7m 0.5 0.2 0.01Table 2.5: Parameters of a fitted GLM relating error in estimated α to true values of BA parameters.GLM was fitted with a Gaussian distribution and inverse link function. Coefficients are interpretable asadditive effects on the inverse of the mean error.The GLM analysis indicated that the error in estimates of α decreased with larger true values of α(p<10−5) and m (p= 0.01) but was not significantly affected by I (table 2.5). Qualitatively, α seemed tobe only weakly identifiable between the values of 0 and 0.5 (fig. 2.9). The error in the estimated preva-lence I was slightly lower for smaller values of α (p<10−5) and I (p = 0.05), but was not significantly51Parameter Estimate Standard error p-value(Intercept) 0.004 5×10−4 <10−5α −0.001 2×10−4 <10−5I −4×10−7 2×10−7 0.05m −7×10−5 8×10−5 1Table 2.6: Parameters of a fitted GLM relating error in estimated I to true values of BA parameters.GLM was fitted with a Gaussian distribution and inverse link function. Coefficients are interpretable asadditive effects on the inverse of the mean error.affected by the true value of m (table 2.6).The dispersion of the ABC approximation to the posterior also varied between the parameters (ta-ble 2.4). HPD intervals around α and I were often narrow relative to the region of nonzero prior density,whereas the intervals for m and N were more widely dispersed. Figures 2.10 and 2.11 show one- andtwo-dimensional marginal distributions for a simulation with relatively low error. Each parameter waschosen based on its mean error rate across all simulations. The parameters for this simulation wereα = 1.5, I = 1000, m = 4, and N = 5000. Figures 2.12 and 2.13 show the equivalent marginals for adifferent simulation with relatively high error rates for each individual parameter. The parameters wereα = 0, I = 2000, m = 2, and N = 5000. The two-dimensional marginals indicate some dependence be-tween pairs of parameters, particularly I and N which show a diagonally shaped region of high posteriordensity.To test the effect of model misspecification, we simulated one network where the nodes exhibitedheterogeneous preferential attachment power (half 0.5, the other half 1.5), with m = 2, N = 5000, andI = 1000. The posterior mean [95% HPD] estimates for each parameter were: α , 1.03 [0.67 - 1.18]; I,1474 [511 - 2990]; m, 3 [1 - 5]; N, 9861 [3710- 14977]. To test the effect of sampling bias, we sampledone transmission tree in a peer-driven fashion, where the probability to sample a node was twice ashigh if one of its peers had already been sampled. The parameters for this experiment were N = 5000,m = 2, α = 0.5, and I = 2000. The estimated values were: α , 0.3 [0 - 0.63]; I, 2449 [1417 - 3811]; m,3 [2 - 5]; N, 9132 [2852 - 14780]. Both of these results were in line with estimates obtained on othersimulated datasets (table 2.4), although the estimate of peer-driven sampling for α was somewhat lowerthan typical.Figure 2.14 shows the effect of performing marginal ABC estimation of each of the BA parameterson the same simulated transmission tree. The estimates of m were apparently unaffected by marginaliz-ing out the other parameters, corroborating the previous experiments’ findings that m is not an identifi-able parameter from scaled tree shapes. Compared to allowing all parameters to vary, estimates of α , I,and N were improved by 41%, 59%, and 46% when all other parameters were fixed. Figure 2.15 showsthe impact of increasing the number of particles, simulated datasets, and αESS parameter on the accuracyof a single simulation. The results of the two simulations were similar, but surprisingly, the results withhigher SMC settings were slightly worse (by 10%, 8%, and 11% for α , I, and N respectively). However,the 50% HPD interval for I was closer to the true value of 2000 with the improved settings (2338 - 3423,vs. 2810 - 3767 with basic settings). The estimate of m, 3 in both cases, was unaffected by the settings.520.0 0.5 1.0 1.5 2.0α1000 2000 3000 4000 5000I1 2 3 4 5m0 5000 10000 15000Napproximate posterior densityFigure 2.10: One-dimensional marginal posterior distributions of BA model parameters estimated withlow error by netabc from a simulated transmission tree. Dashed lines indicate true values, solid linesindicate posterior means, and shaded areas show 95% highest posterior density intervals.2.3 Application to real world HIV data2.3.1 MethodsBecause the BA model assumes a single connected contact network, it is most appropriate to applyto groups of individuals who are epidemiologically related. Therefore, we searched for published HIVdatasets which originated from existing clusters, either phylogenetically or geographically defined. Toidentify datasets fitting these criteria, we used the Entrez module in the BioPython library [156] toidentify all studies which were linked to at least 150 sequences of the same HIV gene in GenBank (635at the time when the data was collected). We manually curated a subset of these articles which, basedon their title and abstract, appeared to have sampled one sequence per individual in a group likely tobe epidemiologically related. For example, studies were excluded if they were investigating response toa particular drug, pregnant women or pediatric HIV patients, intra-host evolution, or a multi-region ormulti-country cohort. We acknowledge that our perusal of the complete set of articles was rather cursory,often limited to reading the title, and it is quite possible that we failed to identify other studies whichwould have been suitable to include. Each potential dataset was revisited, and those without samplingtime annotation in GenBank were excluded. We also excluded studies where all sequences were sampledat the same timepoint, which was necessary because the method we used to time scale the tree requires53Figure 2.11: Two-dimensional marginal posterior distributions of BA model parameters estimated withlow error by netabc from a simulated transmission tree. White circles indicate true values, magentadiamonds indicate posterior means.non-contemporaneous tips. The datasets are summarized in table 2.7. In addition to the published data,we analyzed an in-house dataset sampled from HIV-positive individuals in British Columbia, Canada.For clarity, we will refer to each dataset by its risk group and location of origin in the text. For example,the Zetterberg et al. [157] data will be referred to as IDU/Estonia.We downloaded all sequences associated with each published study from GenBank. For the IDU/Romaniadata, only sequences from IDU (whose sequence identifiers included the letters “DU”) were includedin the analysis. Kao et al. [163] (MSM/Taiwan) found a strong association in their study populationbetween subtype and risk group - subtype B was most often associated with MSM, whereas IDU wereusually infected with a circulating recombinant form. Since there were many more subtype B sequencesin their data than sequences of other subtypes, we restricted our analysis to the subtype B sequences andlabelled this dataset as MSM. Two datasets (HET/Uganda and HET/Malawi) included both env and gagsequences. Each gene was analyzed separately to assess the robustness of netabc to the particular HIVgene sequence used to estimate a transmission tree. The IDU/Estonia data also sequenced both genes,but the highly variable coverage and high homology of the gag sequences made it impossible to obtain asufficiently large block of non-identical sequences to analyze. Therefore, we analyzed only env for this540.0 0.5 1.0 1.5 2.0α1000 2000 3000 4000 5000I1 2 3 4 5m0 5000 10000 15000Napproximate posterior densityFigure 2.12: One-dimensional marginal posterior distributions of BA model parameters estimated withhigh error by netabc from a simulated transmission tree. Dashed lines indicate true values, solid linesindicate posterior means, and shaded areas show 95% highest posterior density intervals.dataset.Each env sequence was aligned pairwise to the HXB2 reference sequence (GenBank accession num-ber K03455), and the hypervariable regions were clipped out with BioPython version 1.66+ [156]. Se-quences were multiply aligned using MUSCLE version 3.8.31 [166], and alignments were manuallyinspected with Seaview version 4.4.2 [167]. Duplicated sequences were removed with BioPython. Phy-logenies were constructed from the nucleotide alignments by approximate maximum likelihood usingFastTree2 version 2.1.7 [54] with the generalized time-reversible (GTR) model [168]. Transmissiontrees were estimated by rooting and time-scaling the phylogenies by root-to-tip regression, using a mod-ified version of Path-O-Gen (distributed as part of BEAST [169]) as described previously [21]. Due tothe removal of duplicated sequences, all estimated transmission trees were fully binary.To check if our results were robust to the choice of phylogenetic reconstruction method, we built andreanalyzed phylogenies for the datasets with the lowest and highest estimated α values (mixed/Spain andIDU/Estonia) with RAxML [55] with the GTR+Γ model of sequence evolution and rate heterogeneity.The trees were rooted and time-scaled with Least Square Dating [LSD, 59]. For expediency, the analysiswas run with the prior m ∼ DiscreteUniform(2,5), which defines a smaller total search space than theprior allowing m = 1.Four of the datasets (MSM/Shanghai, HET/Botswana, HET/Uganda, and MSM/USA) were initially55Figure 2.13: Two-dimensional marginal posterior distributions of BA model parameters estimated withhigh error by netabc from a simulated transmission tree. White circles indicate true values, magentadiamonds indicate posterior means.larger than the others, containing 1265, 1299, 1026/915 (env/gag), and 648 sequences respectively.To ensure that our results would not be affected by the dataset size, we reduced these to a numberof sequences similar to the smaller datasets. For the MSM/Shanghai dataset, we detected a cluster ofsize 280 using a patriotic distance cutoff of 0.02 as described previously [91]. Only sequences withinthis cluster were carried forward. For the HET/Uganda, HET/Botswana, and MSM/USA datasets, nolarge clusters were detected using the same cutoff, so we analyzed subsets of size 225, 180, and 180respectively. The subset of the HET/Uganda data was chosen by eye such that the individuals weremonopolistic in both the gag and env trees. The other subsets were arbitrarily chosen subtrees fromphylogenies of the complete datasets.For all datasets, we used the priors α ∼ Uniform(0,2) and N and I jointly uniform on the region{n≤ N ≤ 10000, n≤ I ≤ 10000, I ≤ N}, where n is the number of tips in the tree (see table 2.7). Sincethe value m = 1 produces networks with no cycles, which we considered fairly implausible, we ran oneanalysis with the prior m ∼ DiscreteUniform(1,5), and one with the prior m ∼ DiscreteUniform(2,5).The other parameters to the SMC algorithm were the same as used for the simulation experiments,except that we used 10000 particles instead of 1000 to increase the accuracy of the estimated posterior.56Reference Sequences (n) Location Risk group GeneZetterberg et al. [157] 171 Estonia IDU envNiculescu et al. [158] 136 Romania IDU polunpublished 399 British Columbia, Canada IDU polNovitsky et al. [159]180 Mochudi, Botswana HET envNovitsky et al. [160]McCormack et al. [161] 141/154 Karonga District, Malawi HET env/gagGrabowski et al. [162] 225 Rakai District, Uganda HET env/gagWang et al. [29] 173 Beijing, China MSM polKao et al. [163] 275 Taiwan MSM polLittle et al. [28] 180 San Fransisco, USA MSM polLi et al. [164] 280 Shanghai, China MSM polCuevas et al. [165] 287 Basque Country, Spain mixed polTable 2.7: Characteristics of published HIV datasets analyzed with netabc. Abbreviations: MSM, menwho have sex with men; HET, heterosexual; IDU, injection drug users. The HET data were sampled froma primarily heterosexual risk environment but did not explicitly exclude other risk factors. The numberof sequences column indicates how many sequences were included in our analysis; there may have beenadditional sequences linked to the study which we excluded for various reasons (see methods).This was computationally feasible due to the small number of runs required for this analysis.Empirical studies of contact networks often report the exponent γ of the power law degree distribu-tion [22–25, 95, 144]. Although the BA model does not produce networks with true power law degreedistributions except when α = 1, the power law still provides a reasonable approximation to the slopewhen α 6= 1 (fig. B.25). To compare our results to existing network literature we simulated 100 net-works each according to the posterior mean parameter estimates obtained for each investigated dataset.Although the BA model does not produce power law networks except when α = 1, simulations showthat the power law fit still captures the slope of the degree distribution reasonably well (fig. B.25). γ wascalculated for each network using the fit power law function in igraph, with the ‘R.mle’ implementa-tion. The median of the 100 γ values was taken as a point estimate for the associated dataset.2.3.2 ResultsPosterior mean point estimates and 50% and 95% HPD intervals for each parameter are shown infig. 2.16. Figure B.26 shows point estimates and HPD intervals obtained when the value m = 1 wasdisallowed by the prior. Since the results indicated that m = 1 was the most credible value for severaldatasets, all results discussed henceforth are for the prior m ∼ DiscreteUniform(1,5) unless otherwisestated.Posterior mean point estimates for the preferential attachment power α were all sub-linear, rangingfrom 0.83 for the IDU/Estonia data to 0.27 for the mixed/Spain data. When aggregated by risk group,the average estimates were 0.73 for IDU, 0.41 for primarily heterosexual risk, and 0.37 for MSM.These values were obtained with gag for the datasets where both gag and env were sequenced, butthe estimates for α did not change appreciably between the two genes (fig. 2.17). There was a large57amount of uncertainty associated with these estimates. 95% HPD intervals were very wide for mostdatasets, often encompassing almost the entire range from 0 to 1 (fig. 2.16). For all the datasets exceptHET/Malawi and MSM/Beijing, the posterior mean was either outside, or very close to the border of, the50% HPD intervals. This indicates that the posterior distributions were diffuse with heavy tails, ratherthan having most of their mass around the mode.For all but the HET/Botswana data, the posterior mean estimates for the prevalence I were between373 (IDU/Estonia) and 1391 (MSM/Shanghai). The HET/Botswana data had a much higher estimatedI value (5432) than the other datasets, with a very wide 95% HPD interval covering almost the entireprior region (fig. 2.16). There was no significant correlation between the number of sequences in thetree and the estimated prevalence (Spearman correlation, p = 0.7), indicating that the higher prevalenceestimates were not simply due to increased sampling density. When both gag and env sequences wereanalyzed, the estimates from the env data were higher (HET/Uganda, 939 for gag vs. 1615 for env;HET/Malawi, 724 for gag vs. 845 for env). As with α , many of the posterior means were outside the50% HPD intervals, suggesting diffuse posterior distributions.The posterior means of m were equal to one for zero of the datasets analyzed. The widths of the 95%HPD intervals varied from 0 (all the mass on the estimated value) to 5 (the entire prior region). Estimatesof N were mostly uninformative, with very similar estimates for all datasets (mean 6212, range 5881 -6882). This was similar to the pattern observed for the synthetic data, where the posterior mean alwaysfell around the upper two-thirds mark of the region allowed by the prior (fig. B.23).When the value m = 1 was disallowed by the prior, the separation in α between the IDU datasetsand the others became more striking (fig. B.26). All three IDU datasets had estimated α values at orabove 1. The estimate for the MSM/Beijing data was slightly lower (0.85) and the estimates for theseven remaining non-IDU datasets were bounded above by 0.58. The values of I were fairly robustto the choice of prior (compare figs. 2.16 and B.26), although the 95% HPD intervals were slightlynarrower (average width 2244 for m≥ 1 and 1848 for m≥ 2). The posterior means of m for all but theHET/Botswana data took on the value 3 with this prior, with the HPD intervals spanning the entire priorregion. This is very similar to the results observed for m on simulated data (table 2.4), and suggeststhat m is not identifiable from these data with this prior. The results for N did not change appreciablybetween the two choices of prior.For the two datasets we reanalyzed using RAxML [55] and LSD [59], α was relatively robust to thechoice of method (??, posterior means 0.48 vs. 0.48 for mixed/Spain and 1.02 vs. 1.12 for IDU/Estonia).However, the estimates of I were about twice as high when RAxML was used instead of FastTree toreconstruct the trees (228 vs. 437 for IDU/Estonia, 816 vs. 1949 for mixed/Spain).To compare our results to the existing network literature, we estimated values of the power lawexponent γ for each of the datasets investigated, using the posterior median estimates. The results aresummarized in table 2.8. All of the estimated exponents were in the range 2≤ γ ≤ 2.5, which is on thelower end of the range 2≤ γ ≤ 4 reported in the literature. Because the estimates of N were uninforma-tive, we also simulated another set of networks with the estimated I value, rather than N, as the numberof nodes. However, none of the estimated γ values were changed by this operation (not shown).58Dataset γmixed/Spain (Cuevas et al. 2009) 2.09MSM/Shanghai (Li et al. 2015) 2.10MSM/USA (Little et al. 2014) 2.10MSM/Taiwan (Kao et al. 2011) 2.42MSM/Beijing (Wang et al. 2015) 2.12HET/Uganda (Grabowski et al. 2014) 2.09HET/Malawi (McCormack et al. 2002) 2.32HET/Botswana (Novitsky et al. 2013 & 2014) 2.44IDU/Canada (unpublished) 2.12IDU/Romania (Niculescu et al. 2015) 2.14IDU/Estonia (Zetterberg et al. 2004) 2.41Table 2.8: Estimated power law exponents for six HIV datasets based on posterior mean point estimatesof BA model parameters. 100 networks were simulated using posterior mean parameter estimates ob-tained with netabc. The power law exponent γ was estimated for each, and the median of those estimateswas used as a point estimate for the corresponding dataset.59l lll l l ll0.00.51.01.52.0none I m N I+m I+N m+NI+m+Nfixed parametersαllll llll10002000300040005000none α m Nα+mα+N m+Nα+m+Nfixed parametersIlll lll ll12345none α I Nα+ Iα+N I+Nα+ I+Nfixed parametersmllllllll050001000015000none α I mα+ Iα+m I+mα+ I+mfixed parametersNFigure 2.14: Marginal posterior mean estimates (points), 50% HPD intervals (notches), and 95% HPDintervals (lines) of BA model parameters when some parameters (x-axis) were fixed. Parameters werefixed by specifying a Dirac-delta prior at the true value. True values are indicated by horizontal dashedlines.60ll0.0 0.5 1.0 1.5 2.0αll1000 2000 3000 4000 5000Ill1 2 3 4 5mll0 5000 10000 15000NSMC settingslllowhighFigure 2.15: Comparison of BA parameter estimates obtained with netabc on the same dataset withdifferent SMC settings. Points are posterior means, notches are 50% HPD intervals, and lines are 95%HPD intervals.61ll0.0 0.5 1.0 1.5 2.0αll0 2500 5000 7500 10000Ill1 2 3 4 5mll0 2500 5000 7500 10000Nllmixed/Spain (Cuevas et al. 2009)MSM/Shanghai (Li et al. 2015)MSM/USA (Little et al. 2014)MSM/Taiwan (Kao et al. 2011)MSM/Beijing (Wang et al. 2015)HET/Uganda (Grabowski et al. 2014)HET/Malawi (McCormack et al. 2002)HET/Botswana (Novitsky et al. 2013 & 2014)IDU/Canada (unpublished)IDU/Romania (Niculescu et al. 2015)IDU/Estonia (Zetterberg et al. 2004)Figure 2.16: Posterior means (points), 50% HPD intervals (notches), and 95% HPD intervals (lines)for parameters of the BA network model, fitted to eleven HIV datasets with netabc. Legend labelsindicate risk group and country of origin. Abbreviations: IDU, injection drug users; MSM, men whohave sex with men; HET, heterosexual. Note that posterior means can fall outside of HPD intervals ifthe distribution is diffuse.62ll0.0 0.5 1.0 1.5 2.0αll0 2500 5000 7500 10000Ill1 2 3 4 5mll0 2500 5000 7500 10000NdatasetlHET/Uganda (Grabowski et al. 2014)HET/Malawi (McCormack et al. 2002)genellenvgagFigure 2.17: Posterior means (points), 50% HPD intervals (notches), and 95% HPD intervals (lines) forparameters of the BA network model, fitted to two HIV datasets where both gag and env genes weresequenced.632.4 Discussion2.4.1 Netabc: uses, limitations, and possible extensionsContact networks can have a strong influence on epidemic progression, and are potentially useful as apublic health tool [28, 29]. Despite this, few methods exist for investigating contact network parametersin a phylodynamic framework [although see 86, 95, 97, 106, 170, for related work]. Kernel-ABC is amodel-agnostic method which can be used to investigate any quantity that affects tree shape [21]. Inthis work, we developed netabc, a method based on kernel-ABC to infer the parameters of a contactnetwork model. The method can be used to infer parameters of any network model, as long as it allowssimulated networks to be easily generated. We have included generators for the BA model discussedhere, as well as the Erdo˝s-Re´nyi (ER) [171] and Watts-Strogatz (WS) [87] network models. Instructionsfor adding additional models are available in the project’s online documentation. We have made netabcpublicly available at github.com/rmcclosk/netabc under a permissive open source license, to encourageother researchers to apply and extend our method.Several alternative network models and modelling frameworks have been developed which mayprovide useful future targets for ABC. Waring models [26, 172] are a more flexible type of preferentialattachment model which permit a subset of attachments to be formed non-preferentially. These modelswere used by Leigh Brown et al. [95] to characterize the transmission network in the United Kingdom.Exponential random graph models (ERGMs) [173] are a flexible and expressive parameterization ofcontact networks in terms of statistics of network features such as pairs and triads. Goodreau [112] eval-uated the effect of several different ERGM parameterizations on transmission tree shape and effectivepopulation size, and suggested the use of ERGMs as a general framework for estimation of epidemi-ological quantities related to HIV transmission. Except for a few special cases, simulating a networkaccording to an ERGM generally requires MCMC, which would be too computationally intensive tointegrate into netabc as it currently stands. To fit ERGM with ABC, one possibility would be to con-sider the network itself as a parameter to be modified by the MCMC kernel. Other network modellingframeworks include the partnership-centric formulation developed by Eames and Keeling [174] and thelog-linear adjacency matrix parameterization applied by Morris [77].The two-step process of simulating a contact network and subsequently allowing an epidemic tospread over that network carries with it the assumption that the contact network is static over the durationof the epidemic. Clearly this assumption is false, as people make and break partnerships on a regularbasis. Addressing the impact of this simplifying assumption is outside the scope of this work. However,the same assumption is made by most studies using contact network models in an epidemiologicalcontext [6, 39]. In principle, ABC could be adapted to dynamic contact networks by using a method suchas that developed by Robinson, Cohen, and Colijn [175] to simulate such a network, while concurrentlysimulating the spread of an epidemic.It is important to note that netabc takes a transmission tree as input, rather than a viral phylogeny. Inreality, true transmission trees are not available and must be estimated; these estimates are often basedon the viral phylogeny. Although this has been demonstrated to be a fair approximation [e.g. 62], and is64frequently used in practice [e.g. 37], the topologies of a viral phylogeny and transmission tree can differsignificantly [36, 53] due to within-host evolution and the sampling process. We have left the estimationof a transmission tree up to the user. In theory, it is possible to incorporate the process by which aviral phylogeny is generated along with a transmission tree into our method, for example by simulatingwithin-host dynamics. Indeed, progress has very recently been demonstrated on this front in a talk byGiardina [176], who have independently developed in a method similar to ours to fit contact networkmodels to phylogenetic data that additionally incorporates a within-host evolutionary model. Althoughthis may be an avenue for future extension, we felt that it would obscure the primary purpose of thiswork, which is to study contact network parameters. In addition, there are a number of different methodsavailable for inferring transmission trees [31, 53, 64, 65, 67], some of which incorporate geographicand/or epidemiological data not accommodated by our method. We therefore felt it would be best toallow researchers to use their own preferred method of constructing a transmission tree.Our implementation of SMC uses a simple multinomial scheme to sample particles from the pop-ulation according to their weights. Several other sampling strategies have been developed [124], and itis possible that the use of a more sophisticated technique might increase the algorithm’s accuracy. Fi-nally, the ABC-SMC algorithm is computationally intensive, taking about a day when run on 20 cores inparallel with the settings described in the methods section. Implementing parallelization using messagepassing interface (MPI), rather than POSIX threads as we have done here, would allow the program tobe run over a larger number of cores on multiple CPUs in parallel.2.4.2 Analysis of Baraba´si-Albert model with synthetic dataThe preferential attachment power α had a very strong influence on tree shape in the range of valueswe considered (figs. 2.5 and 2.6). Although the tree kernel was the most effective classifier for α , aSackin’s index of tree imbalance performed nearly as well (fig. 2.7). This result was intuitive: high αvalues produce networks with few well-connected “superspreader” nodes which are involved in a largenumber of transmissions, resulting in a highly unbalanced ladder-like tree structure (fig. 2.5). Thereappeared to be weaker identifiability for α < 1 than for α ≥ 1 (fig. 2.9 and table 2.4), which may bepartially explained by the relationship between α and the power law exponent γ (fig. B.24). Althoughthe degree distributions do not truly follow a power law for α 6= 1, the fitted exponent still capturesthe slope of the degree distribution reasonably well (fig. B.25), and distributions with similar fitted γvalues appear qualitatively similar in other respects. The γ values fitted to α = 0 and α = 0.5 are nearlyidentical (about 2.28 f orα = 0 and 2.33 f orα = 0.5 with N = 5000 and m = 2). In other words, thedegree distributions of networks with α < 1 are similar to each other, which may result in similarity ofcorresponding transmission trees as well.The I parameter, representing the prevalence at the time of sampling, was also generally identifiable,although it was slightly over-estimated for both cases we considered with ABC. Sackin’s index wasbetter able to capture the effect of I on tree shape than the nLTT (figs. 2.7 and B.8), suggesting thatthis parameter impacts the distribution of branching times in the tree more than the topology. In ahomogeneously-mixed population, branching times can be modelled by the coalescent process [46], in65our case under the SI model [52]. Although our populations are not homogeneously mixed, the forceswhich affect the distributions of branching times still apply. In our simulations, we assumed that alldiscordant edges shared the same transmission rate, so that the waiting time until the next transmissionin the entire network was always inversely proportional to the number of discordant edges. In the initialphase of the epidemic, when I is small, each new transmission results in many new discordant edges.Hence, there is an early exponential growth phase, producing many short branches near the root of thetree. As the epidemic gets closer to saturating the network, the number of discordant edges decays,causing longer waiting times.The number of nodes in the network, N, exhibited the most variation in terms of its effect on treeshape. There was almost no difference between trees simulated under different N values when the num-ber of infected nodes I was small (figs. B.10 and B.14). There is an intuitive explanation for this result,namely that adding additional nodes does not change the edge density or overall shape of a BA net-work. In fact, this is why the degree distributions of these networks are called “scale-free.” This canbe illustrated by imagining that we add a small number of nodes to a network after the epidemic sim-ulation has already been completed. It is possible that none of these new nodes attains a connection toany infected node. Thus, running the simulation again on the new, larger network could produce theexact same transmission tree as before. On the other hand, when I is large relative to N, the coalescentdynamics discussed above also apply. The waiting times until the next infection increase, resulting inlonger coalescence times toward the tips. The relative accuracy of the nLTT in these situations (figs. 2.7and B.10) corroborates this hypothesis, as the nLTT uses only information about the coalescence times.When all BA parameters were simultaneously estimated with ABC, N was nearly always over-estimatedby approximately a factor of two (fig. B.23 and table 2.4). One factor which may have contributed tothis bias was our choice of prior distribution. Since the prior for I and N was jointly uniform on a regionwhere I ≤ N, more prior weight was assigned to higher N values. Another contributing factor relates tothe dynamics of the SI model and the coalescent process and is discussed below.I and N were both systematically over-estimated by netabc, although the bias was more severe for Nthan for I. The number of infected individuals follows a logistic growth curve under the SI model. Thiskind of growth curve has three qualitative phases: a slow ramp-up, an exponential growth phase, and aslow final phase when the susceptible population is almost depleted. The waiting times until the nexttransmission, which determine the coalescence times in the tree, are dependent on the growth phase ofthe epidemic. Therefore, we hypothesize that it is the growth phase at the time of sampling which mostaffects tree shape, rather than the specific values of I or N. To investigate this hypothesis, we simulatedtransmission trees over networks on a grid of I and N values in the region of uniform prior density. Wefit logistic growth curves to the proportion of infected individuals over time, and calculated the first andsecond derivatives of these curves (with respect to time) at the time of transmission tree sampling. Thesederivatives give us an indication of the growth rates of the epidemics at the time of sampling. As shownin fig. 2.18, there are bands along which both derivatives are similar which contain the values we tested.These bands span mostly higher values of N and I than the true values. Therefore, if N and I are free tovary (as is the case in ABC), and our hypothesis is true, both parameters will tend to be overestimated66due to being less identifiable within their own band. However, when N is fixed at 5000, the derivativesvary substantially along the I-axis, which explains why the grid search estimates of I were accurate andunbiased (figs. 2.8 and B.20). We also note the resemblance of the contour surface of fig. 2.18 to thetwo-dimensional marginal posterior distributions on I and N obtained with simulated data (figs. 2.11and 2.13). In addition, both the accuracy and precision of the ABC estimates of I improved substantiallywhen N was marginalized out (fig. 2.14).100020003000400050000 5000 10000 15000NI−1.0−0.50.00.51.0d2I/dt20.10.20.30.40.50.60.7dI/dtFigure 2.18: First and second time derivatives of epidemic growth curves at time of sampling for variousvalues of I and N. Networks were simulated under the BA model with α = 1.0, m = 2, and N variedalong the values shown on the x-axis. Transmission trees were sampled at the time when I nodes wereinfected (y-axis). Logistic growth curves were fit to epidemic trajectories derived from the transmissiontrees, and their first and second derivatives were calculated at the time of sampling. Contours show firstderivatives, while colours indicate second derivatives. Values of I and N used in simulation experimentswith ABC are indicated by diamonds.The m parameter, which controls the number of connections added to the network per vertex, didnot have a measurable impact on tree shape and was not identifiable with ABC, nor with grid searchwhen the other parameters were held fixed. The exception to this was the value m = 1, which producesnetworks without cycles whose associated trees were more easily distinguished (figs. 2.8 and B.17).However, all the analyses presented here did not take the absolute size of the transmission trees intoaccount, as the branch lengths were rescaled by their mean. Because higher m values imply higher edgedensity, an epidemic should spread more quickly for higher m than lower m with the same per-edgetransmission probability. Hence, considering the absolute height of the trees may improve our method’sability to reconstruct m. It was also pointed out to us by an anonymous reviewer that for a fixed I, an67infected node may only end up transmitting along a fraction of its outgoing edges, which could maskthe presence of the extra edges associated with higher m.In addition to the tree height, many summary statistics have been developed to capture particulardetails of tree shape [177]. Two of these, Sackin’s index and the ratio of internal to terminal branchlengths, were correlated with every BA parameter. Classifiers based on Sackin’s index and the nLTTsimilarity measure performed well in some cases, though poorly in others. ABC is often applied usinga vector of summary statistics [126, 178], rather than a kernel-based similarity score as we have donehere. Methods have been developed to select an optimal combination of summary statistics for a giveninference task [147]. Hence, an avenue for future improvement of our method may be the inclusion ofadditional summary statistics to supplement the tree kernel. In addition, all four BA parameters weremore accurately classified when the number of tips in the transmission trees was larger, underscoringthe importance of adequate sampling for accurate phylodynamic inference [160].For α and I, the credible intervals attained from the marginal ABC target distributions were narrowerthan those obtained through grid search. This was likely due to the fact that SMC employs importancesampling to approximate the posterior distribution, while grid search simply calculates a distance met-ric which may not have any resemblance to the posterior. Our method of finding credible intervals fromkernel scores along the grid, by normalizing the scores to resemble a probability distribution, was some-what ad hoc, which may also have played a role. Regardless, this result indicates that there is benefit toapplying the more sophisticated method, even if values for some of the parameters are known a priori,and especially if credible intervals are desired on the parameters of interest.As noted by Lintusaari et al. [179], uniform priors on model parameters may translate to highlyinformative priors on quantities of interest. We observed a non-linear relationship between the prefer-ential attachment power α and the fitted power law exponent γ (fig. B.24, although see the discussionin section 1.3.2). Therefore, placing a uniform prior on α between 0 and 2 is equivalent to placing aninformative prior that γ is close to 2. If we were primarily interested in γ rather than α , a more sensi-ble choice of prior might have a shape informed by fig. B.24 and be bounded above by approximatelyα = 1.5. This would bound γ in the region 2≤ γ ≤ 4 commonly reported in the network literature [22–25, 95]. We note however that Jones and Handcock [180] estimated γ values greater than four for somedatasets, in one case as high as 17, indicating that consideration of a wider range of permitted γ valuesmay be warranted.The combination of method, model, and priors we employed did not produce perfect estimates of anyof the parameters. The estimates of α were the most accurate, although the variance of the estimates washigh and the confidence intervals were wide for α < 1 (fig. 2.9 and table 2.4). The estimates of N and Iwere both upwardly biased, and the estimates of m were largely uninformative. Despite these limitations,a major result of our investigation is that some contact network parameters have a measurable impacton tree shape which can be used to perform statistical inference. Further refinements to netabc, as wellas the use of more sophisticated network models, may improve the accuracy and precision of theseestimates.682.4.3 Application to real world HIV dataOur investigation of published HIV datasets indicated heterogeneity in the contact network structuresunderlying several distinct local epidemics. When interpreting these results, we caution that the BAmodel is quite simple and most likely misspecified for these data. In particular, the average degree of anode in the network is equal to 2m, and therefore is constrained to be a multiple of 2. Furthermore, weconsidered the case m = 1, where the network has no cycles, to be implausible and therefore assigned itzero prior probability in one set of analyses. This forced the average degree to be at least four, which maybe unrealistically high for sexual networks. The fact that the estimated values of α differed substantiallyfor several datasets depending on whether or not m = 1 was allowed by the prior is further evidence ofthis potential misspecification. However, we note that for two of the datasets, the estimated values ofα did not change much between priors, and the estimates of I were robust to the choice of prior for alldatasets studied. More sophisticated models, for example models incorporating heterogeneity in nodebehaviour, are likely to provide a better fit to these data.Preferential attachment power is sub-linear and higher for IDUFor all datasets we examined, the posterior mean estimates for α were sub-linear, ranging from 0.27to 0.83. The sub-linearity is consistent with the results of de Blasio, Svensson, and Liljeros [105], whodeveloped a statistical inference method to estimate the parameters of a more sophisticated preferentialattachment model incorporating heterogeneous node behaviour. When used to analyze population-levellongitudinal partner count data, they found α values ranging from 0.26 to 0.62 depending on the genderand time period considered. Our estimates of α for the IDU/Romania was above this range under bothpriors, as were the estimates for the MSM/Beijing data and the BC data when m = 1 was disallowed bythe prior. The dataset investigated by de Blasio, Svensson, and Liljeros [105] was derived from a surveyof a random sample of the Norwegian population, whereas our investigation focused on datasets fromknown phylogenetic or geographic clusters of HIV-infected persons. It is therefore unsurprising thatwe detected stronger preferential attachment dynamics in some cases. For instance, random samplingis much less likely to discover the high-degree nodes characterizing the tail of the degree distribution,simply because those individuals are rare in the general population. In addition, it is plausible that HIV-positive individuals are more likely to be highly connected in their sexual networks, as the odds ofacquiring HIV increase with the number of unprotected sexual contacts.Both de Blasio, Svensson, and Liljeros [105] and the HET/Botswana data studied populations whoseprimary risk factor for HIV infection was heterosexual contact. de Blasio, Svensson, and Liljeros explic-itly excluded reported homosexual contacts; Novitsky et al. did not, but noted that heterosexual contactis the primary mode of transmission in Botswana where the study was done. In the first of the two pa-pers describing the Botswana study [159], the authors noted that their sample was gender-biased, beingcomposed of approximately 75% women. Our estimate of α for these data was 0.55 or 0.53, dependingon the prior on m. Similarly, de Blasio, Svensson, and Liljeros [105] estimated 0.54, 0.57, and 0.29 for3-year, 5-year, and lifetime partnership networks respectively for the female portion of their sample.69For both choices of prior on m, the datasets derived from IDU populations had a higher estimatedpreferential attachment power than the other datasets (figs. 2.16 and B.26). This finding is in line withDombrowski et al. [103], who reanalyzed a network of IDUs in Brooklyn, USA, collected between1991 and 1993 [181]. They found that the IDU network resembled a BA network much more closelythan other social and sexual networks, and offered sociological explanations for the apparent preferen-tial attachment dynamics in this population. Importantly, from a public health perspective, the authorsasserted that the removal of random individuals from IDU networks may have the paradoxical effect ofincreasing the network’s epidemic susceptibility. When low-degree nodes are removed, as would occurduring a police crackdown, their network neighbours may turn to well-known community members foradvice or supplies, thus increasing the connectivity of these high-degree nodes.Unfortunately, the sub-linear region for α identified by both de Blasio, Svensson, and Liljeros [105]and netabc is also the region of poorest identifiability (figs. 2.8 and 2.9). This was reflected in the highlevel of uncertainty in the estimates, with most 95% HPD intervals covering the majority of the range[0, 1]. The value α = 0.5 was contained in the 95% HPD interval for every dataset; consequently, itis not possible to say with high confidence that any of the α values are different from each other. Insynthetic data, the confidence intervals around α narrowed when other parameters were marginalizedout (fig. 2.14). Thus, it is possible that estimates of α could be made more precise by specifying eitherexact values or informative priors on the other BA parameters. We argued informally in section 2.2.1 thatI, m, and N can sometimes be estimated by standard epidemiological methods, although the difficultymay depend on the specific epidemic under consideration.Power law exponent is within reported range, but hard to interpretIn order to compare our results to existing literature on networks and distributions of partner counts,we have reported estimated values for the power law exponent γ of the real data sets we evaluated.However, the posterior means for α for all six datasets were less than one; the degree distributionsin this parameter range are stretched exponential, not power law [104]. As we show in fig. B.25, thepower law fit does capture the slope of the degree distribution fairly well, but the results should stillbe interpreted cautiously. Krapivsky, Redner, and Leyvraz [104] showed that the power law distributioncan be maintained, with γ tuned to any desired value, by a straightforward modification of the BA model.The authors define the “connection kernel” Ak the probability of a new connection to a node of degreek, up to a normalizing constant. In the BA model as we have presented it here, Ak = kα + 1. TakingAk as any asymptotically linear function will result in a power law distribution, with the exponent γdetermined by the properties of Ak. Implementing such a model would be straightforward and seems anatural next step toward improving the realism of the BA model.Table 2.9 shows some estimated γ values reported in the network literature. Our estimates are mostsimilar to those of Liljeros et al. [23] and Rothenberg and Muth [144], but interpretation of the exactvalues is challenging. Estimates from the literature mainly fall between 2 and 4, although substantialvariation within this range exists, even for networks with apparently similar demographics. For example,Schneeberger et al. [24] estimate higher γ values for heterosexual females than heterosexual males, while70Cle´menc¸on et al. [25] find the opposite. Some of these differences are likely due to the differing timeintervals considered in each study – some investigate the distribution of lifetime partners, while othersconsider only recent partnerships, such as in the past year. However, the effect of the time period on γ isnot readily obvious from the available estimates.reference risk group observation period estimated γColgate et al. [22] MSM 1 year 3Cle´menc¸on et al. [25] MSM 20 years 3.02Schneeberger et al. [24] MSM1 year 1.57-1.75six years 1.85lifetime 1.64-1.75Leigh Brown et al. [95] MSM 7 years 2.7Cle´menc¸on et al. [25]HET, females 20 years 2.71HET, males 20 years 3.36Schneeberger et al. [24]HET, females1 year 3.10-3.68lifetime 2.99-3.09HET, males1 year 2.48-2.85lifetime 2.46-2.47Liljeros et al. [23]mixed sexual, females1 year 2.54lifetime 2.1mixed sexual, males1 year 2.31lifetime 1.6Rothenberg and Muth [144] mixed sexual 1-15 years 2.12-2.65Jones and Handcock [180]mixed sexual, females not stated 3.84-17.04mixed sexual, males not stated 3.03-5.43Latora et al. [182]mixed sexual, females 1 year 3.9mixed sexual, males 1 year 2.93Rothenberg and Muth [144]IDU 1-2 years 2.08-2.87mixed sexual and IDU 3-5 years 2.20-4.74Dombrowski et al. [103] IDU 2 years 1.8Table 2.9: Values of power law exponent γ previously reported in the literature. In the risk group col-umn, “mixed sexual” indicates participants were not selected on the basis of sexual orientation; thesevalues likely reflect primarily heterosexual dynamics. Leigh Brown et al. [95] studied a phylogeneti-cally estimated transmission network, rather than a contact network. The values presented by Jones andHandcock [180] were in the context of arguing that the power law distribution is not suitable for usewith network data.HIV prevalence estimates are often discordant with epidemiological informationThe true HIV prevalence in a population can be difficult to estimate for several reasons. HIV-infectedindividuals may be asymptomatic for months or years, possibly delaying their awareness of their status.In many contexts, the risk factors for acquisition of HIV are illegal or stigmatized, which may representa barrier to testing, treatment, and/or disclosure of status. Several of the studies whose data we ana-lyzed report the number of new infections over the study period. We expect these numbers to be lower,perhaps substantially, than the true prevalence. We also note that our simulation study showed that the71estimates of I obtained with netabc were upwardly biased, the severity increasing with the true valueof I. In addition, our initial exploratory analysis showed that the identifiability of I decreases with thenumber of sampled tips (figs. B.8 and B.12). In real world studies, the proportion of infected individualssampled is usually low, which may impede our ability to estimate I. In the following, we discuss anyinformation available about the HIV prevalence in the analyzed study populations, and relate this to ourestimates. However, due to the aforementioned challenges, this information cannot be taken as absolute“ground truth”. Discrepancies between our results and the available data may indicate that the estimatesproduced by netabc are inaccurate, but they may also indicate that the available epidemiological datawas not sufficient to estimate the true prevalence. Since the estimates of I were largely robust to thechoice of prior on m (figs. 2.16 and B.26), we discuss here only the results obtained with the priorm∼ DiscreteUniform(1,5).The prevalence estimates for two IDU epidemics in Eastern Europe, IDU/Romania and IDU/Estonia,were 747 and 373. The authors of the IDU/Romania study [158] reported that 494 new HIV infectionswere diagnosed among the studied population (IDUs in Bucharest) during the study period, and thisvalue was contained in our 95% HPD interval (136 - 2379). On the other hand, the IDU/Estonia datawere collected from several locations in the country. Using their reported incidence estimates, therewere approximately 2000 new infections in the country during the study period, a value outside our95% HPD interval (171 - 1012). The authors indicated that their sample was “convenience based”; it ispossible that other, harder-to-reach individuals had no contacts with those sampled, reducing the appar-ent network size. In contrast, the IDU/Canada data was collected primarily from IDU in Vancouver’sdowntown east side, a population that has been intensively targeted by public health interventions. Com-bined with the routine clinical genotyping performed in British Columbia for all new HIV infections, itis not unreasonable to suppose that the sampled individuals (399) comprise the majority of HIV-positivenetwork members. This value was the lower bound of our estimated 95% HPD interval (384 - 3484),but it seems likely that the posterior mean of 1356 is an overestimate.The prevalence estimates for two MSM populations in major Chinese cities (MSM/Beijing andMSM/Shanghai) were 676 and 1391. The individuals sampled for the MSM/Beijing study were recruitedby following a prospective cohort of size 2000. 179 new infections were identified in the cohort duringthe study period, which is much lower than the estimated prevalence obtained with netabc. However, theauthors did not claim to have sampled the entire sexual network. In a nationwide survey, Wu et al. [183]estimated that there were 24,198 MSM living in Beijing, of whom 5.7%, or 1379, were HIV-infected,which is close to the upper limit of the 95% confidence interval of our estimate of I (175 - 1401). Thesame survey estimated the size of the MSM population in Shanghai 14,511, with an HIV prevalenceof 6.8% or approximately 987 individuals. This was within the 95% confidence interval of the estimateobtained with netabc for the MSM/Shanghai data (311 - 2822). It is worth noting that the authors ofthe IDU/Shanghai study [164] claimed that there were 80,000 MSM living in Shanghai, which is asignificantly different estimate than that obtained by Wu et al. [183]. The authors of the MSM/Taiwanstudy [163] stated that there were 18,378 HIV-infected individuals in Taiwan, of which approximately44% (8086 individuals) were MSM. This value is well outside of our 95% HPD (268 - 607), suggesting72that the entire MSM population does not comprise a single network. In fact, our estimate of I for thesedata (410), was closer to the number of MSM enrolled in the study (301) than to the epidemiologicallyestimated prevalence. The study focused exclusively on newly diagnosed individuals. On the other hand,our estimated prevalence for the MSM/USA data (482) was clearly an underestimate. The study enrolledalmost 648 HIV-infected individuals, which should have been a lower bound on I, although the valuewas contained in the 95% HPD interval (180 - 1112).The HET/Uganda, HET/Malawi, and HET/Botswana data all originated from populations in sub-Saharan Africa, where the dominant mode of transmission is believed to be heterosexual. The HET/Ugandastudy population included 1434 seropositive individuals. Since the majority of inhabitants of the studydistrict were enrolled in the study, it is likely that this number represents the majority of HIV-infectedpersons. Our estimate was slightly lower (939), but the value 1434 was contained within the 95% HPDinterval (226 - 2661). The HET/Malawi data represents possibly the oldest population-level HIV se-quencing effort, carried out on samples collected for other purposes between 1981 and 1989. Everyindividual in the study district was sampled; among these, a total of 200 were HIV-positive. Althoughthe HET/Malawi data had the lowest estimated I value among the heterosexual datasets, the posteriormean (724), was much higher than this value. Due to the thorough population-level screening, it is mostlikely that the posterior mean represents an overestimate, especially considering the upward bias ofABC estimates of I observed on simulated data (fig. B.23). It is also possible that the true sexual net-work extended outside the study’s geographic area to include a larger number of infected individuals.The authors of the HET/Botswana data [159, 160] estimated that there were 1731 HIV-positive individ-uals in the study area. HIV sequences were obtained from approximately 70% of these individuals. Theestimated prevalence we obtained was much higher (5432), with a 95% HPD interval spanning nearlythe entire prior region. The authors noted that the town where the study took place was located prox-imally to the country’s capital city, and suggested that frequent travel between the two locations mayhave facilitated linking of their sexual networks. Thus, the high estimate of I we obtained for these datamay include a larger network component based in the capital city.Cuevas et al. [165] reported that 620 newly infected and 1500 chronically infected patients had beensampled during the study period, for a total of 2120. This is substantially greater than the point estimateof 702 we obtained with ABC. This error may in part be due to model misspecification – it is highlyunlikely that a mixed risk group population would form a single, well-connected contact network of thetype generated by the BA model.Modelling assumptionsOur use of the BA model makes several simplifying assumptions. First, we assume homogeneity acrossthe network with respect to node behaviour and transmission risk. In reality, the attraction to high-degree nodes seems likely to vary among individuals, as does their risk of transmitting or contractingthe virus. We have also assumed that all transmission risks are symmetric, which is clearly false for allknown modes of HIV transmission, and that infected individuals never recover but remain infectiousindefinitely. These assumptions were made for the purpose of keeping the model as simple as possible,73since this is the very first attempt to fit a contact network model in a phylodynamic context. However,the Gillespie simulation algorithm built into netabc can handle arbitrary transmission and removal rateswhich need not be homogeneous across the network. Moreover, it is possible to use ABC to fit a modelwhich relaxes some or all of these assumptions, which may be a fruitful avenue for future investigation.Despite the possible misspecification, our estimates of the power law exponent γ were within the rangeof values reported in the literature (table 2.8), and our estimates of α were in rough agreement with aprevious study using survey-based epidemiology [105].74Chapter 3ConclusionDue to the rapid advancement of nucleotide sequencing technology, viral sequence data have becomeincreasingly feasible to collect on a population level. Through phylodynamic methods, these data offera window into epidemiological processes that would otherwise be virtually impossible to study on arealistic scale.This thesis developed netabc, a computer program implementing a statistical inference method forcontact network parameters from viral phylogenetic data. Netabc brings together the areas of viral phylo-dynamics and network epidemiology, which have only intersected in a very limited fashion thus far [39].The use of kernel-ABC, a likelihood-free method, makes it possible to fit network models to phylogenieswithout calculating intractable likelihoods.Although phylodynamic methods have been developed to fit a wide variety of epidemiological mod-els to phylogenetic data assuming homogeneous mixing [47, 184], our method is able to fit models notrequiring this assumption. We believe this capability will be of broad interest to the molecular evolutionand epidemiology community, as it widens the field of epidemiological parameters that may be investi-gated through viral sequence data. In addition, the characterization of local contact networks could bevaluable from a public health perspective, such as for investigating optimal vaccination strategies [8–10,89]. This information could assist in curtailing current epidemics, as well as preventing future epidemicsof different diseases over the same contact network.The particular model we have investigated uses a preferential attachment mechanism to generatescale-free networks resembling real-world social and sexual networks [22–24]. Of the four parameterswe considered, the preferential attachment power α was the most readily estimable. Estimating α withtraditional epidemiological methods is challenging due to the requirement of sampling the high-degreenodes making up the tail of the power law distribution, although approaches such as respondent-drivensampling [90] or collection of longitudinal partner count data [105] may be effective.In closing, netabc combines phylodynamics, contact network epidemiology, approximate Bayesiancomputation, and sequential Monte Carlo to provide a source of insight into network structures comple-mentary to traditional epidemiology.75Bibliography[1] William Heaton Hamer. The Milroy Lectures on Epidemic Disease in England: the Evidence ofVariability and of Persistency of Type. Bedford Press, 1906.[2] William O Kermack and Anderson G McKendrick. “A contribution to the mathematical theoryof epidemics”. In: Proceedings of the Royal Society of London A: Mathematical, Physical andEngineering Sciences. Vol. 115. 772. The Royal Society. 1927, pp. 700–721.[3] S Rushton and AJ Mautner. “The deterministic model of a simple epidemic for more than onecommunity”. In: Biometrika 42.1-2 (1955), pp. 126–132.[4] JAP Heesterbeek. Mathematical Epidemiology of Infectious Diseases: Model Building, Analysisand Interpretation. Vol. 5. John Wiley & Sons, 2000.[5] Roy M Anderson, Robert M May, and B Anderson. Infectious diseases of humans: dynamicsand control. Vol. 28. Wiley Online Library, 1992.[6] Shweta Bansal, Bryan T Grenfell, and Lauren Ancel Meyers. “When individual behaviour mat-ters: homogeneous and network models in epidemiology”. In: Journal of the Royal Society In-terface 4.16 (2007), pp. 879–891.[7] Marc Barthe´lemy et al. “Dynamical patterns of epidemic outbreaks in complex heterogeneousnetworks”. In: Journal of Theoretical Biology 235.2 (2005), pp. 275–288.[8] Matt J Keeling and Ken TD Eames. “Networks and epidemic models”. In: Journal of the RoyalSociety Interface 2.4 (2005), pp. 295–307.[9] Xiao-Long Peng et al. “Vaccination intervention on epidemic dynamics in networks”. In: Phys-ical Review E 87.2 (2013), p. 022813.[10] Junling Ma, P van den Driessche, and Frederick H Willeboordse. “The importance of contactnetwork topology for the success of vaccination strategies”. In: Journal of Theoretical Biology325 (2013), pp. 12–21.[11] Oliver G Pybus and Andrew Rambaut. “Evolutionary analysis of the dynamics of viral infectiousdisease”. In: Nature Reviews Genetics 10.8 (2009), pp. 540–550.[12] Erik M Volz, Katia Koelle, and Trevor Bedford. “Viral phylodynamics”. In: PLoS Computa-tional Biology 9.3 (2013), e1002947.[13] Donald B Rubin. “Bayesianly justifiable and relevant frequency calculations for the appliedstatistician”. In: The Annals of Statistics 12.4 (1984), pp. 1151–1172.76[14] Simon Tavare´ et al. “Inferring coalescence times from DNA sequence data”. In: Genetics 145.2(1997), pp. 505–518.[15] Yun-Xin Fu and Wen-Hsiung Li. “Estimating the age of the common ancestor of a sample ofDNA sequences”. In: Molecular Biology and Evolution 14.2 (1997), pp. 195–199.[16] Mark A Beaumont, Wenyang Zhang, and David J Balding. “Approximate Bayesian computationin population genetics”. In: Genetics 162.4 (2002), pp. 2025–2035.[17] Art FY Poon et al. “Mapping the shapes of phylogenetic trees from human and zoonotic RNAviruses”. In: PLoS ONE 8.11 (2013), e78122.[18] Mijung Park et al. “K2-ABC: Approximate Bayesian Computation with Kernel Embeddings”.In: stat 1050 (2015), p. 24.[19] Trevelyan McKinley, Alex R Cook, and Robert Deardon. “Inference in epidemic models withoutlikelihoods”. In: The International Journal of Biostatistics 5.1 (2009), pp. 1–40.[20] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. “An adaptive sequential Monte Carlo methodfor approximate Bayesian computation”. In: Statistics and Computing 22.5 (2012), pp. 1009–1020.[21] Art FY Poon. “Phylodynamic inference with kernel ABC and its application to HIV epidemiol-ogy”. In: Molecular Biology and Evolution 32.9 (2015), pp. 2483–2495.[22] Stirling A Colgate et al. “Risk behavior-based model of the cubic growth of acquired immunod-eficiency syndrome in the United States”. In: Proceedings of the National Academy of Sciences86.12 (1989), pp. 4793–4797.[23] Fredrik Liljeros et al. “The web of human sexual contacts”. In: Nature 411.6840 (2001), pp. 907–908.[24] Anne Schneeberger et al. “Scale-free networks and sexually transmitted diseases: a descriptionof observed patterns of sexual contacts in Britain and Zimbabwe”. In: Sexually TransmittedDiseases 31.6 (2004), pp. 380–387.[25] Ste´phan Cle´menc¸on et al. “A statistical network analysis of the HIV/AIDS epidemics in Cuba”.In: Social Network Analysis and Mining 5.1 (2015), pp. 1–14.[26] Mark S Handcock and James Holland Jones. “Likelihood-based inference for stochastic modelsof sexual network formation”. In: Theoretical Population Biology 65.4 (2004), pp. 413–422.[27] Albert-La´szlo´ Baraba´si and Re´ka Albert. “Emergence of scaling in random networks”. In: Sci-ence 286.5439 (1999), pp. 509–512.[28] Susan J Little et al. “Using HIV networks to inform real time prevention interventions”. In:PLoS ONE 9.6 (2014), e98443.[29] Xicheng Wang et al. “Targeting HIV prevention based on molecular epidemiology among deeplysampled subnetworks of men who have sex with men”. In: Clinical Infectious Diseases 61.9(2015), p. 1462.77[30] Ernst Heinrich Haeckel. Generelle Morphologie der Organismen. Verlag von Georg Reimer,1866.[31] T Jombart et al. “Reconstructing disease outbreaks from genetic data: a graph approach”. In:Heredity 106.2 (2011), pp. 383–390.[32] EF Harding. “The probabilities of rooted tree-shapes generated by random bifurcation”. In:Advances in Applied Probability (1971), pp. 44–77.[33] Luigi L Cavalli-Sforza and Anthony WF Edwards. “Phylogenetic analysis. Models and estima-tion procedures”. In: American Journal of Human Genetics 19.3 Pt 1 (1967), p. 233.[34] Sean Nee, Arne O Mooers, and Paul H Harvey. “Tempo and mode of evolution revealed frommolecular phylogenies”. In: Proceedings of the National Academy of Sciences 89.17 (1992),pp. 8322–8326.[35] Peter Buneman. “A note on the metric properties of trees”. In: Journal of Combinatorial Theory,Series B 17.1 (1974), pp. 48–50.[36] Rolf JF Ypma, W Marijn van Ballegooijen, and Jacco Wallinga. “Relating phylogenetic trees totransmission trees of infectious disease outbreaks”. In: Genetics 195.3 (2013), pp. 1055–1062.[37] Tanja Stadler and Sebastian Bonhoeffer. “Uncovering epidemiological dynamics in heteroge-neous host populations using phylogenetic methods”. In: Philosophical Transactions of theRoyal Society B: Biological Sciences 368.1614 (2013).[38] Alexei J Drummond et al. “Measurably evolving populations”. In: Trends in Ecology & Evolu-tion 18.9 (2003), pp. 481–488.[39] David Welch, Shweta Bansal, and David R Hunter. “Statistical inference to advance networkmodels in epidemiology”. In: Epidemics 3.1 (2011), pp. 38–45.[40] Eben Kenah et al. “Algorithms linking phylogenetic and transmission trees for molecular infec-tious disease epidemiology”. In: arXiv preprint arXiv:1507.04178 (2015).[41] Eddie C Holmes et al. “Revealing the history of infectious disease epidemics through phy-logenetic trees”. In: Philosophical Transactions of the Royal Society B: Biological Sciences349.1327 (1995), pp. 33–40.[42] K Eames et al. “Six challenges in measuring contact networks for use in modelling”. In: Epi-demics 10 (2015), pp. 72–77.[43] Bryan T Grenfell et al. “Unifying the epidemiological and evolutionary dynamics of pathogens”.In: Science 303.5656 (2004), pp. 327–332.[44] David G Kendall et al. “On the generalized ”birth-and-death” process”. In: The Annals of Math-ematical Statistics 19.1 (1948), pp. 1–15.[45] Tanja Stadler et al. “Estimating the basic reproductive number from viral sequence data”. In:Molecular Biology and Evolution 29.1 (2012), pp. 347–357.78[46] John Frank Charles Kingman. “The coalescent”. In: Stochastic Processes and their Applications13.3 (1982), pp. 235–248.[47] Erik M Volz. “Complex population dynamics and the coalescent under neutrality”. In: Genetics190.1 (2012), pp. 187–201.[48] Masatoshi Nei and Sudhir Kumar. Molecular Evolution and Phylogenetics. Oxford UniversityPress, 2000.[49] Thomas Leitner. The Molecular Epidemiology of Human Viruses. Springer Science & BusinessMedia, 2002.[50] Eben Kenah et al. “Molecular infectious disease epidemiology: survival analysis and algo-rithms linking phylogenies to transmission trees”. In: PLOS Computational Biology 12.4 (2016),e1004869.[51] Jerry A Coyne and H Allen Orr. Speciation. Vol. 37. Sinauer Associates Sunderland, MA, 2004.[52] Erik M Volz et al. “Phylodynamics of infectious disease epidemics”. In: Genetics 183.4 (2009),pp. 1421–1430.[53] Matthew Hall, Mark Woolhouse, and Andrew Rambaut. “Epidemic reconstruction in a phylo-genetics framework: transmission trees as partitions of the node set”. In: PLoS ComputationalBiology 11.12 (2015), e1004613.[54] Morgan N Price, Paramvir S Dehal, and Adam P Arkin. “FastTree 2–approximately maximum-likelihood trees for large alignments”. In: PloS ONE 5.3 (2010), e9490.[55] Alexandros Stamatakis. “RAxML version 8: a tool for phylogenetic analysis and post-analysisof large phylogenies”. In: Bioinformatics 30.9 (2014), p. 1312.[56] Raj Shankarappa et al. “Consistent viral evolutionary changes associated with the progressionof human immunodeficiency virus type 1 infection”. In: Journal of Virology 73.12 (1999),p. 10489.[57] Bette Korber et al. “Timing the ancestor of the HIV-1 pandemic strains”. In: Science 288.5472(2000), pp. 1789–1796.[58] Alexei Drummond, G Oliver, Andrew Rambaut, et al. “Inference of viral evolutionary rates frommolecular sequences”. In: Advances in Parasitology 54 (2003), pp. 331–358.[59] Thu-Hien To et al. “Fast Dating Using Least-Squares Criteria and Algorithms”. In: SystematicBiology 65.1 (2016), p. 82.[60] Wen-Hsiung Li, Masako Tanimura, and Paul M Sharp. “Rates and dates of divergence betweenAIDS virus nucleotide sequences”. In: Molecular Biology and Evolution 5.4 (1988), pp. 313–330.[61] Ethan Romero-Severson et al. “Timing and order of transmission events is not directly reflectedin a pathogen phylogeny”. In: Molecular biology and evolution (2014), msu179.79[62] Thomas Leitner et al. “Accurate reconstruction of a known HIV-1 transmission history by phy-logenetic tree analysis”. In: Proceedings of the National Academy of Sciences 93.20 (1996),pp. 10864–10869.[63] Dimitrios Paraskevis et al. “Phylogenetic reconstruction of a known HIV-1 CRF04 cpx trans-mission network using maximum likelihood and Bayesian methods”. In: Journal of molecularevolution 59.5 (2004), pp. 709–717.[64] Eleanor M Cottam et al. “Integrating genetic and epidemiological data to determine transmissionpathways of foot-and-mouth disease virus”. In: Proceedings of the Royal Society of London B:Biological Sciences 275.1637 (2008), pp. 887–895.[65] RJF Ypma et al. “Unravelling transmission trees of infectious diseases by combining genetic andepidemiological data”. In: Proceedings of the Royal Society of London B: Biological Sciences279.1728 (2012), pp. 444–450.[66] Marco J Morelli et al. “A Bayesian inference framework to reconstruct transmission trees usingepidemiological and genetic data”. In: PLoS Computational Biology 8.11 (2012), e1002768.[67] Xavier Didelot, Jennifer Gardy, and Caroline Colijn. “Bayesian inference of infectious diseasetransmission from whole-genome sequence data”. In: Molecular Biology and Evolution 31.7(2014), pp. 1869–1879.[68] Arne O Mooers and Stephen B Heard. “Inferring evolutionary process from phylogenetic treeshape”. In: Quarterly Review of Biology (1997), pp. 31–54.[69] Kwang-Tsao Shao. “Tree balance”. In: Systematic Biology 39.3 (1990), pp. 266–276.[70] Mark Kirkpatrick and Montgomery Slatkin. “Searching for evolutionary patterns in the shapeof a phylogenetic tree”. In: Evolution (1993), pp. 1171–1181.[71] G Udny Yule. “A mathematical theory of evolution, based on the conclusions of Dr. JC Willis,FRS”. In: Philosophical Transactions of the Royal Society of London. Series B, ContainingPapers of a Biological Character 213 (1925), pp. 21–87.[72] Thijs Janzen, Sebastian Ho¨hna, and Rampal S Etienne. “Approximate Bayesian computation ofdiversification rates from molecular phylogenies: introducing a new efficient summary statistic,the nLTT”. In: Methods in Ecology and Evolution 6.5 (2015), pp. 566–575.[73] Christopher JC Burges. “A tutorial on support vector machines for pattern recognition”. In: DataMining and Knowledge Discovery 2.2 (1998), pp. 121–167.[74] Michael Collins and Nigel Duffy. “New ranking algorithms for parsing and tagging: kernelsover discrete structures, and the voted perceptron”. In: Proceedings of the 40th Annual Meetingof the Association for Computational Linguistics. Association for Computational Linguistics.2002, pp. 263–270.[75] Alessandro Moschitti. “Making tree kernels practical for natural language learning”. In: Euro-pean Chapter of the Association for Computational Linguistics. Vol. 113. 120. 2006, p. 24.80[76] Alden S Klovdahl. “Social networks and the spread of infectious diseases: the AIDS example”.In: Social Science & Medicine 21.11 (1985), pp. 1203–1216.[77] Martina Morris. “Epidemiology and social networks: modeling structured diffusion”. In: Socio-logical Methods & Research 22.1 (1993), pp. 99–126.[78] Jacob L Moreno. Who Shall Survive? Beacon House Inc., 1934.[79] JA Barnes. “Class and Committees in a Norwegian Island Parish”. In: Human Relations 7.1(1954), pp. 39–58.[80] Stanley Wasserman and Katherine Faust. Social Network Analysis: Methods and Applications.Vol. 8. Cambridge University Press, 1994.[81] Rebecca F Baggaley, Richard G White, and Marie-Claude Boily. “HIV transmission risk throughanal intercourse: systematic review, meta-analysis and implications for HIV prevention”. In: In-ternational journal of epidemiology (2010), dyq057.[82] John A Jacquez et al. “Modeling and analyzing HIV transmission: the effect of contact patterns”.In: Mathematical Biosciences 92.2 (1988), pp. 119–199.[83] Erik Volz and Lauren Ancel Meyers. “Susceptible–infected–recovered epidemics in dynamiccontact networks”. In: Proceedings of the Royal Society of London B: Biological Sciences274.1628 (2007), pp. 2925–2934.[84] David A Rolls et al. “A simulation study comparing epidemic dynamics on exponential ran-dom graph and edge-Triangle configuration type contact network models”. In: PloS ONE 10.11(2015), e0142181.[85] Tanja Stadler et al. “Insights into the early epidemic spread of Ebola in Sierra Leone providedby viral sequence data”. In: PLoS Currents Outbreaks 10 (2014).[86] Erik Volz. “SIR dynamics in random networks with heterogeneous connectivity”. In: Journal ofMathematical Biology 56.3 (2008), pp. 293–310.[87] Duncan J Watts and Steven H Strogatz. “Collective dynamics of ‘small-world’ networks”. In:Nature 393.6684 (1998), pp. 440–442.[88] Romualdo Pastor-Satorras and Alessandro Vespignani. “Epidemic spreading in scale-free net-works”. In: Physical review letters 86.14 (2001), p. 3200.[89] Julie Rushmore et al. “Network-based vaccination improves prospects for disease control inwild chimpanzees”. In: Journal of The Royal Society Interface 11.97 (2014), p. 20140349.[90] Douglas D Heckathorn. “Respondent-driven sampling: a new approach to the study of hiddenpopulations”. In: Social problems (1997), pp. 174–199.[91] Art FY Poon et al. “The impact of clinical, demographic and risk factors on rates of HIV trans-mission: a population-based phylogenetic analysis in British Columbia, Canada”. In: The Jour-nal of Infectious Diseases 211.6 (2015), pp. 926–935.81[92] David L Yirrell et al. “Molecular epidemiological analysis of HIV in sexual networks in Uganda”.In: AIDS 12.3 (1998), pp. 285–290.[93] Sonia Resik et al. “Limitations to contact tracing and phylogenetic analysis in establishing HIVtype 1 transmission networks in Cuba”. In: AIDS Research and Human Retroviruses 23.3 (2007),pp. 347–356.[94] Katy Robinson et al. “How the dynamics and structure of sexual contact networks shape pathogenphylogenies”. In: PLoS Computational Biology 9.6 (2013), e1003105.[95] Andrew J Leigh Brown et al. “Transmission network parameters estimated from HIV sequencesfor a nationwide epidemic”. In: The Journal of Infectious Diseases 204.9 (2011), p. 1463.[96] Tom Britton and Philip D O’Neill. “Bayesian inference for stochastic epidemics in populationswith random social structure”. In: Scandinavian Journal of Statistics 29.3 (2002), pp. 375–390.[97] Chris Groendyke, David Welch, and David R Hunter. “Bayesian inference for contact networksgiven epidemic data”. In: Scandinavian Journal of Statistics 38.3 (2011), pp. 600–616.[98] Hawoong Jeong et al. “The large-scale organization of metabolic networks”. In: Nature 407.6804(2000), pp. 651–654.[99] Albert-La´szlo´ Baraba´si et al. “Evolution of the social network of scientific collaborations”. In:Physica A: Statistical mechanics and its applications 311.3 (2002), pp. 590–614.[100] John T Kemper. “On the identification of superspreaders for infectious disease”. In: Mathemat-ical Biosciences 48.1 (1980), pp. 111–127.[101] Zhuang Shen et al. “Superspreading SARS events, Beijing, 2003”. In: Emerging Infectious Dis-eases 10.2 (2004), pp. 256–260.[102] Herbert A Simon. “On a class of skew distribution functions”. In: Biometrika 42.3/4 (1955),pp. 425–440.[103] Kirk Dombrowski et al. “Topological and historical considerations for infectious disease trans-mission among injecting drug users in bushwick, Brooklyn (USA)”. In: World Journal of AIDS3.1 (2013), p. 1.[104] Paul L Krapivsky, Sidney Redner, and Francois Leyvraz. “Connectivity of growing randomnetworks”. In: Physical Review Letters 85.21 (2000), p. 4629.[105] Birgitte Freiesleben de Blasio, A˚ke Svensson, and Fredrik Liljeros. “Preferential attachment insexual networks”. In: Proceedings of the National Academy of Sciences 104.26 (2007), pp. 10762–10767.[106] Gabriel E Leventhal et al. “Inferring epidemic contact structure from phylogenetic trees”. In:PLoS Computational Biology 8.3 (2012), e1002413.[107] Edwin J Bernard et al. “HIV forensics: pitfalls and acceptable standards in the use of phyloge-netic analysis as evidence in criminal investigations of HIV transmission”. In: HIV medicine 8.6(2007), pp. 382–387.82[108] Eamon B O’Dea and Claus O Wilke. “Contact heterogeneity and phylodynamics: how con-tact networks shape parasite evolutionary trees”. In: Interdisciplinary Perspectives on InfectiousDiseases (2011), p. 238743.[109] Vladimir N Minin, Erik W Bloomquist, and Marc A Suchard. “Smooth skyride through a roughskyline: Bayesian coalescent-based inference of population dynamics”. In: Molecular Biologyand Evolution 25.7 (2008), pp. 1459–1471.[110] David Welch. “Is network clustering detectable in transmission trees?” In: Viruses 3.6 (2011),pp. 659–676.[111] Luc Villandre et al. “Assessment of overlap of phylogenetic transmission clusters and commu-nities in simple sexual contact networks: applications to HIV-1”. In: PloS ONE 11.2 (2016),e0148459.[112] Steven M Goodreau. “Assessing the effects of human mixing patterns on human immunodefi-ciency virus-1 interhost phylogenetics through social network simulation”. In: Genetics 172.4(2006), pp. 2033–2045.[113] Jun S Liu, Rong Chen, and Tanya Logvinenko. “A theoretical framework for sequential impor-tance sampling with resampling”. In: Sequential Monte Carlo Methods in Practice. Springer,2001, pp. 225–246.[114] Christian Robert and George Casella. Monte Carlo Statistical Methods. Springer Science &Business Media, 2004.[115] Arnaud Doucet, Simon Godsill, and Christophe Andrieu. “On sequential Monte Carlo samplingmethods for Bayesian filtering”. In: Statistics and Computing 10.3 (2000), pp. 197–208.[116] Arnaud Doucet, Nando De Freitas, and Neil Gordon. “An introduction to sequential MonteCarlo methods”. In: Sequential Monte Carlo methods in practice. Springer, 2001, pp. 3–14.[117] Jun S Liu. Monte Carlo Strategies in Scientific Computing. Springer Science & Business Media,2008.[118] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. “Sequential Monte Carlo samplers”. In: Jour-nal of the Royal Statistical Society: Series B (Statistical Methodology) 68.3 (2006), pp. 411–436.[119] Olav Kallenberg. Foundations of Modern Probability. Springer Science & Business Media,2006.[120] Neil J Gordon, David J Salmond, and Adrian FM Smith. “Novel approach to nonlinear/non-Gaussian Bayesian state estimation”. In: Radar and Signal Processing, IEE Proceedings F.Vol. 140. 2. IET. 1993, pp. 107–113.[121] Adrian Smith et al. Sequential Monte Carlo Methods in Practice. Springer Science & BusinessMedia, 2013.[122] Olivier Cappe´, Simon J Godsill, and Eric Moulines. “An overview of existing methods andrecent advances in sequential Monte Carlo”. In: Proceedings of the IEEE 95.5 (2007), pp. 899–924.83[123] Jun S Liu and Rong Chen. “Blind deconvolution via sequential imputations”. In: Journal of theAmerican Statistical Association 90.430 (1995), pp. 567–576.[124] Randal Douc and Olivier Cappe´. “Comparison of resampling schemes for particle filtering”. In:Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis.IEEE. 2005, pp. 64–69.[125] Mark A Beaumont. “Approximate Bayesian computation in evolution and ecology”. In: AnnualReview of Ecology, Evolution, and Systematics 41 (2010), pp. 379–406.[126] Jean-Michel Marin et al. “Approximate Bayesian computational methods”. In: Statistics andComputing 22.6 (2012), pp. 1167–1180.[127] Simon Aeschbacher, Mark A Beaumont, and Andreas Futschik. “A novel approach for choosingsummary statistics in approximate Bayesian computation”. In: Genetics 192.3 (2012), pp. 1027–1047.[128] Michael GB Blum et al. “A comparative review of dimension reduction methods in approximateBayesian computation”. In: Statistical Science 28.2 (2013), pp. 189–208.[129] Mark M Tanaka et al. “Using approximate Bayesian computation to estimate tuberculosis trans-mission parameters from genotype data”. In: Genetics 173.3 (2006), pp. 1511–1520.[130] Paul Marjoram and Simon Tavare´. “Modern computational approaches for analysing moleculargenetic variation data”. In: Nature Reviews Genetics 7.10 (2006), pp. 759–770.[131] Scott A Sisson, Yanan Fan, and Mark M Tanaka. “Sequential Monte Carlo without likelihoods”.In: Proceedings of the National Academy of Sciences 104.6 (2007), pp. 1760–1765.[132] Paul Marjoram et al. “Markov chain Monte Carlo without likelihoods”. In: Proceedings of theNational Academy of Sciences 100.26 (2003), pp. 15324–15328.[133] Oliver Ratmann et al. “Using likelihood-free inference to compare evolutionary dynamics of theprotein networks of H. pylori and P. falciparum”. In: PLoS Computational Biology 3.11 (2007),e230.[134] Mark A Beaumont et al. “Adaptive approximate Bayesian computation”. In: Biometrika 96.4(2009), pp. 983–990.[135] Daniel T Gillespie. “A general method for numerically simulating the stochastic time evolutionof coupled chemical reactions”. In: Journal of Computational Physics 22.4 (1976), pp. 403–434.[136] Gabor Csardi and Tamas Nepusz. “The igraph software package for complex network research”.In: InterJournal, Complex Systems 1695.5 (2006), pp. 1–9.[137] Doug Baskins. Judy arrays. 2004.[138] Brian Gough. GNU Scientific Library Reference Manual. Third edition. Network Theory Ltd.,2009.[139] Blaise Barney. POSIX Threads Programming. 2009. URL: https://computing.llnl.gov/tutorials/pthreads/ (visited on 09/05/2016).84[140] Tal Pupko et al. “A fast algorithm for joint reconstruction of ancestral amino acid sequences”.In: Molecular Biology and Evolution 17.6 (2000), pp. 890–896.[141] Frank Harary and Edgar M Palmer. Graphical Enumeration. Elsevier, 2014.[142] Iain Murray, Zoubin Ghahramani, and David MacKay. “MCMC for doubly-intractable distribu-tions”. In: arXiv preprint arXiv:1206.6848 (2012).[143] Faming Liang. “A double Metropolis–Hastings sampler for spatial models with intractablenormalizing constants”. In: Journal of Statistical Computation and Simulation 80.9 (2010),pp. 1007–1022.[144] Richard Rothenberg and Stephen Q Muth. “Large-network concepts and small-network charac-teristics: fixed and variable factors”. In: Sexually Transmitted Diseases 34.8 (2007), pp. 604–612.[145] Roy M Anderson and Robert M May. “Directly transmitted infections diseases: control by vac-cination”. In: Science 215.4536 (1982), pp. 1053–1060.[146] Ravi Kumar et al. “Stochastic models for the web graph”. In: Foundations of Computer Science,2000. Proceedings. 41st Annual Symposium on. IEEE. 2000, pp. 57–65.[147] Paul Fearnhead and Dennis Prangle. “Constructing summary statistics for approximate Bayesiancomputation: semi-automatic approximate Bayesian computation”. In: Journal of the Royal Sta-tistical Society: Series B (Statistical Methodology) 74.3 (2012), pp. 419–474.[148] Simon DW Frost and Erik M Volz. “Modelling tree shape and structure in viral phylodynamics”.In: Philosophical Transactions of the Royal Society of London B: Biological Sciences 368.1614(2013).[149] Emmanuel Paradis, Julien Claude, and Korbinian Strimmer. “APE: analyses of phylogeneticsand evolution in R language”. In: Bioinformatics 20.2 (2004), pp. 289–290.[150] Achim Zeileis et al. “kernlab-an S4 package for kernel methods in R”. In: Journal of StatisticalSoftware 11.9 (2004), pp. 1–20.[151] David Meyer et al. e1071: Misc Functions of the Department of Statistics, Probability TheoryGroup (Formerly: E1071), TU Wien. R package version 1.6-7. 2015.[152] Greg Snow. TeachingDemos: Demonstrations for Teaching and Learning. R package version2.10. 2016.[153] Frank E Harrell Jr, with contributions from Charles Dupont, and many others. Hmisc: HarrellMiscellaneous. R package version 3.17-4. 2016. URL: https://CRAN.R-project.org/package=Hmisc.[154] Sture Holm. “A simple sequentially rejective multiple test procedure”. In: Scandinavian Journalof Statistics (1979), pp. 65–70.[155] Michael D. Karcher et al. “Quantifying and mitigating the effect of preferential sampling onphylodynamic inference”. In: PLoS Computational Biology 12.3 (Mar. 2016), pp. 1–19.85[156] Peter JA Cock et al. “Biopython: freely available Python tools for computational molecularbiology and bioinformatics”. In: Bioinformatics 25.11 (2009), pp. 1422–1423.[157] Veera Zetterberg et al. “Two viral strains and a possible novel recombinant are responsible forthe explosive injecting drug use-associated HIV type 1 epidemic in Estonia”. In: AIDS Researchand Human Retroviruses 20.11 (2004), pp. 1148–1156.[158] Iulia Niculescu et al. “Recent HIV-1 outbreak among intravenous drug users in Romania: evi-dence for cocirculation of CRF14 BG and subtype F1 strains”. In: AIDS Research and HumanRetroviruses 31.5 (2015), pp. 488–495.[159] Vladimir Novitsky et al. “Phylogenetic relatedness of circulating HIV-1C variants in Mochudi,Botswana”. In: PLoS ONE 8.12 (2013), e80589.[160] Vlad Novitsky et al. “Impact of sampling density on the extent of HIV clustering”. In: AIDSResearch and Human Retroviruses 30.12 (2014), pp. 1226–1235.[161] Grace P McCormack et al. “Early evolution of the human immunodeficiency virus type 1 sub-type C epidemic in rural Malawi”. In: Journal of Virology 76.24 (2002), pp. 12890–12899.[162] Mary K Grabowski et al. “The role of viral introductions in sustaining community-based HIVepidemics in rural Uganda: evidence from spatial clustering, phylogenetics, and egocentrictransmission models”. In: PLoS Medicine 11.3 (2014), e1001610.[163] Cheng-Feng Kao et al. “Surveillance of HIV type 1 recent infection and molecular epidemiologyamong different risk behaviors between 2007 and 2009 after the HIV type 1 CRF07 BC outbreakin Taiwan”. In: AIDS Research and Human Retroviruses 27.7 (2011), pp. 745–749.[164] Xiaoyan Li et al. “HIV-1 genetic diversity and its impact on baseline CD4+ T cells and viralloads among recently infected men who have sex with men in Shanghai, China”. In: PLoS ONE10.6 (2015), e0129559.[165] MT Cuevas et al. “HIV-1 transmission cluster with T215D revertant mutation among newlydiagnosed patients from the Basque Country, Spain”. In: Journal of Acquired Immune DeficiencySyndromes 51.1 (2009), p. 99.[166] Robert C Edgar. “MUSCLE: multiple sequence alignment with high accuracy and high through-put”. In: Nucleic Acids Research 32.5 (2004), pp. 1792–1797.[167] Manolo Gouy, Ste´phane Guindon, and Olivier Gascuel. “SeaView version 4: a multiplatformgraphical user interface for sequence alignment and phylogenetic tree building”. In: MolecularBiology and Evolution 27.2 (2010), pp. 221–224.[168] Simon Tavare´. “Some probabilistic and statistical problems in the analysis of DNA sequences”.In: Lectures on Mathematics in the Life Sciences 17 (1986), pp. 57–86.[169] Alexei J Drummond and Andrew Rambaut. “BEAST: Bayesian evolutionary analysis by sam-pling trees”. In: BMC Evolutionary Biology 7.1 (2007), p. 214.86[170] Gili Greenbaum, Alan R. Templeton, and Shirli Bar-David. “Inference and analysis of popu-lation structure using genetic data and network theory”. In: Genetics 202.4 (2016), pp. 1299–312.[171] Paul Erdo˝s and Alfred Re´nyi. “On the evolution of random graphs”. In: Publications of theMathematical Institute of the Hungarian Academy of Sciences 5 (1960), pp. 17–61.[172] Joseph Oscar Irwin. “The place of mathematics in medical and biological statistics”. In: Journalof the Royal Statistical Society 126.Pt. 1 (1963), pp. 1–41.[173] Garry Robins et al. “An introduction to exponential random graph (p*) models for social net-works”. In: Social networks 29.2 (2007), pp. 173–191.[174] Ken TD Eames and Matt J Keeling. “Modeling dynamic and network heterogeneities in thespread of sexually transmitted diseases”. In: Proceedings of the National Academy of Sciences99.20 (2002), pp. 13330–13335.[175] Katy Robinson, Ted Cohen, and Caroline Colijn. “The dynamics of sexual contact networks:effects on disease spread and control”. In: Theoretical Population Biology 81.2 (2012), pp. 89–96.[176] Federica Giardina. “Inference of epidemic contact networks from HIV phylogenetic trees”. Oralpresentation at HIV Dynamics and Evolution. 2016.[177] Nicolas Bortolussi et al. “apTreeshape: statistical analysis of phylogenetic tree shape”. In: Bioin-formatics 22.3 (2006), pp. 363–364.[178] Mikael Sunna˚ker et al. “Approximate Bayesian computation”. In: PLoS Computational Biology9.1 (2013), e1002803.[179] Jarno Lintusaari et al. “On the identifiability of transmission dynamic models for infectiousdiseases”. In: Genetics (2016).[180] James Holland Jones and Mark S Handcock. “An assessment of preferential attachment as amechanism for human sexual network formation”. In: Proceedings of the Royal Society of Lon-don B: Biological Sciences 270.1520 (2003), pp. 1123–1128.[181] Samuel R Friedman et al. Social Networks, Drug Injectors’ Lives, and HIV/AIDS. SpringerScience & Business Media, 2006.[182] Vito Latora et al. “Network of sexual contacts and sexually transmitted HIV infection in BurkinaFaso”. In: Journal of Medical Virology 78.6 (2006), pp. 724–729.[183] Z Wu et al. “HIV and syphilis prevalence among men who have sex with men: a cross-sectionalsurvey of 61 cities in China”. In: Clinical Infectious Diseases 57.2 (2013), p. 298.[184] David A Rasmussen, Erik M Volz, and Katia Koelle. “Phylodynamic inference for structuredepidemiological models”. In: PLoS Computational Biology 10.4 (2014), e1003570.[185] Nicholas Metropolis et al. “Equation of state calculations by fast computing machines”. In: Thejournal of chemical physics 21.6 (1953), pp. 1087–1092.87[186] W Keith Hastings. “Monte Carlo sampling methods using Markov chains and their applica-tions”. In: Biometrika 57.1 (1970), pp. 97–109.88Appendix AMathematical models, likelihood, andBayesian inferenceThis appendix reviews some fundamental statistical concepts that are referred to throughout the bodyof this thesis. A mathematical model is a formal description of a hypothesized relationship betweensome observed data, y and covariates x. A parametric model defines a family of possible relationshipsbetween data and outcomes, parameterized by one or more numeric parameters θ . A statistical modeldescribes the relationship between data and outcomes in terms of probabilities. Statistical models define,either explicitly or implicitly, the probability of observing y given x and, if the model is parametric, θ .Note that it is entirely possible to have no data x, only observed outcomes y. In this case, a model woulddescribe the process by which y is generated.To illustrate these concepts, consider the well-known linear model. For clarity, we will restrict ourattention to the case of one-dimensional data and outcomes where x = {x1, . . . ,xn} and y = {y1, . . . ,yn}are vectors of real numbers. The linear model postulates that the outcomes are linearly related to thedata, modulo some noise introduced by measurement error, environmental fluctuations, and other exter-nal factors. Formally, yi = βxi + εi, where β is the slope of the linear relationship, and εi is the errorassociated with measurement i. We can make this model a statistical one by hypothesizing a distributionfor the error terms εi; most commonly, it is assumed that they are normally distributed with variance σ .In mathematical terms, yi ∼ βxi+N (0,σ2), where “∼” means “is distributed as”. We can see from thisformulation that the model is parametric, with parameters θ = (β , σ ). Moreover, we can write down theprobability density fof observing outcome yi given the parameters,f (y | β ,σ) =n∏i=1fN (0,σ2)(yi−βxi),where fN (0,σ2) is the probability density of the normal distribution with mean zero and variance σ2.Note that we are treating the xi as fixed quantities and therefore have not conditioned the probabilitydensity on x. Also, we have assumed that all the yi are independent.For a general model, the probability density of y given the parameters θ is also known as the likeli-89hood, written L, of θ . That is, L(θ | y) = f (y | θ) for the model’s probability density function (pdf) f .The higher the value of the likelihood, the more likely the observations y are under the model. Thus, thelikelihood provides a natural criterion for fitting the model parameters: we want to pick θ such that theprobability density of our observed outcomes y is as high as possible. The parameters that optimize thelikelihood are known as the ML estimates, denoted θˆ . That is,θˆ = argmaxθL(θ | y).ML estimation is usually performed with numerical optimization. In the simplest terms, many possiblevalues for θ are examined, L(θ | y) is calculated for each, and the parameters that produce the highestvalue are accepted. Many sophisticated numerical optimization methods exist, although they may not beguaranteed to find the true ML estimates if the likelihood function is multi-modal.ML estimation makes use only of the data and outcomes to estimate the model parameters θ . How-ever, it is frequently the case that the investigator has some additional information or belief about whatθ are likely to be. For example, in the linear regression case, the instrument used to measure the out-comes may have a well-known margin of error, or the sign of the slope may be obvious from previousexperiments. The Bayesian approach to model fitting makes use of this information by codifying the in-vestigator’s beliefs as a prior distribution on the parameters, denoted pi(θ). Instead of considering onlythe likelihood, Bayesian inference focuses on the product of the likelihood and the prior, f (y | θ)pi(θ).Bayes’ theorem tells us that this product is related to the posterior distribution on θ ,pi(θ | y) = f (y | θ)pi(θ)∫f (y | θ)pi(θ)dθ . (A.1)In principle, pi(y | θ)pi(θ) can be optimized numerically just like L(θ | y), which would also optimizethe posterior distribution. The resulting optimal parameters are called the maximum a posteriori (MAP)estimates. However, from a Bayesian perspective, θ is not a fixed quantity to be estimated, but rather arandom variable with an associated distribution (the posterior). Therefore, the MAP estimate by itselfis of limited value without associated statistics about the posterior distribution, such as the mean orcredible intervals. Unfortunately, to calculate such statistics, it is necessary to evaluate the normalizingconstant in the denominator of eq. (A.1), which is almost always an intractable integral.A popular method for circumventing the normalizing constant is the use of MCMC to obtain asample from the posterior distribution. MCMC works by defining a Markov chain on the space ofpossible model parameters. The transition density from parameters θ1 to θ2 is taken to bemin(1,f (y | θ2)pi(θ2)q(θ2,θ1)f (y | θ1)pi(θ2)q(θ1,θ2)),where q(θ ,θ ′) is a symmetric proposal distribution used in the algorithm to generate the chain. Thestationary distribution of this Markov chain is equal to the posterior distribution on θ . Therefore, if a longenough random walk is performed on the chain, the distribution of states visited will be a Monte Carloapproximation of pi(θ | y), from which we can calculate statistics of interest. Actually performing this90random walk is straightforward and can be accomplished via the Metropolis-Hastings algorithm [185,186] (algorithm A.1).Algorithm A.1 Metropolis-Hastings algorithm for Markov chain Monte Carlo.Draw θ according to the prior pi(θ)loopPropose θ ′ according to q(θ ,θ ′)Accept θ ← θ ′ with probability min(1,f (y | θ ′)pi(θ ′)q(θ ′,θ)f (y | θ )pi(θ )q(θ ,θ ′))end loop91Appendix BAdditional plots0e+002e+054e+056e+050.00 0.25 0.50 0.75 1.00transmissibilitySackin's indexnetwork typeBAERWSFigure B.1: Reproduction of Figure 1A from Leventhal et al. (2012) used to check the accuracy of ourimplementation of Gillespie simulation. Transmission trees were simulated over three types of network,with pathogen transmissibility varying from 0 to 1. Sackin’s index was calculated for each simulatedtransmission tree. Lines indicate median Sackin’s index values, and shaded areas are interquartile ranges.92−4 −2 0 2 40.00.51.01.52.0thetadensityFigure B.2: Approximation of mixture of Gaussians used by Del Moral et al. (2012) and Sisson etal. (2009) to test SMC. Solid black line indicates true distribution. Grey shaded area shows ABC ap-proximation obtained with our implementation of adaptive ABC-SMC, using 10000 particles with onesimulated data point per particle.93−10 −5 0 5 100.000.050.100.150.20thetadensityFigure B.3: Approximation of mixture of two Gaussians used to test convergence of SMC algorithm toa bimodal distribution. Solid black line indicates true distribution. Grey shaded area shows ABC-SMCapproximation obtained with our implementation, using 10000 particles with one simulated data pointper particle.94I = 500 I = 1000 I = 2000Figure B.4: Simulated transmission trees under three different values of BA parameter I. Epidemicswere simulated on BA networks with parameters α = 1.0, m = 2, and N = 5000. Epidemics weresimulated until I = 500, 1000, or 2000 nodes were infected. Transmission trees were created by sampling500 infected nodes. For higher I values, the network was closer to saturation at the time of sampling,resulting in longer terminal branches as the waiting time until the next transmission increased.95m = 2 m = 3 m = 4Figure B.5: Simulated transmission trees under three different values of BA parameter m. Epidemicswere simulated on BA networks with parameters α = 1.0, n = 5000, and m = 2, 3, or 4. Epidemicswere simulated until I = 1000 nodes were infected. Transmission trees were created by sampling 500infected nodes.96N = 3000 N = 5000 N = 8000Figure B.6: Simulated transmission trees under three different values of BA parameter N. Epidemicswere simulated on BA networks with parameters α = 1.0, m = 2, and N = 3000, 5000, or 8000. Epi-demics were simulated until I = 1000 nodes were infected. Transmission trees were created by sampling500 infected nodes. For lower N values, the network was closer to saturation at the time of sampling,resulting in longer waiting times until the next transmission and longer terminal branch lengths.97I: 500 I: 1000 I: 2000ll llllll l lll l lll ll l l ll l ll ll ll lllll lllll ll ll l llll ll lll ll l0.60.70.80.91.00.60.70.80.91.00.60.70.80.91.0tips: 100tips: 500tips: 10000.1250.25 0.5 1 2 4 80.1250.25 0.5 1 2 4 80.1250.25 0.5 1 2 4 8radial basis function varianceaccuracydecay factorl 0.20.30.4classifierkernelnLTTSackinFigure B.7: Cross validation accuracy of classifiers for BA model parameter α for eight epidemicscenarios. Solid lines and points are R2 of tree kernel kSVR under various kernel meta-parameters.Dashed and dotted lines are R2 of linear regression against Sackin’s index, and SVR using nLTT.98lllll llll l l lll0.40.60.81.00.40.60.81.0tips: 100tips: 5000.1250.25 0.5 1 2 4 8radial basis function varianceaccuracydecay factorl 0.20.30.4classifierkernelnLTTSackinFigure B.8: Cross validation accuracy of classifiers for BA model parameter I for eight epidemic sce-narios. Solid lines and points are R2 of tree kernel kSVR under various kernel meta-parameters. Dashedand dotted lines are R2 of linear regression against Sackin’s index, and SVR using nLTT.99I: 500 I: 1000 I: 2000ll l llllllllllllll ll lllllll llll ll llll llll l llll l llllllll ll0.300.350.400.450.300.350.400.450.300.350.400.45tips: 100tips: 500tips: 10000.1250.25 0.5 1 2 4 80.1250.25 0.5 1 2 4 80.1250.25 0.5 1 2 4 8radial basis function varianceaccuracydecay factorl 0.20.30.4classifierkernelnLTTSackinFigure B.9: Cross validation accuracy of classifiers for BA model parameter m for eight epidemicscenarios. Solid lines and points are R2 of tree kernel kSVR under various kernel meta-parameters.Dashed and dotted lines are R2 of linear regression against Sackin’s index, and SVR using nLTT.100I: 500 I: 1000 I: 2000ll ll l l ll l llll ll l l ll l ll lllllll lll llll ll l llll ll lll llll llll0.30.40.50.60.70.80.30.40.50.60.70.80.30.40.50.60.70.8tips: 100tips: 500tips: 10000.1250.25 0.5 1 2 4 80.1250.25 0.5 1 2 4 80.1250.25 0.5 1 2 4 8radial basis function varianceaccuracydecay factorl 0.20.30.4classifierkernelnLTTSackinFigure B.10: Cross validation accuracy of classifiers for BA model parameter N for eight epidemicscenarios. Solid lines and points are R2 of tree kernel kSVR under various kernel meta-parameters.Dashed and dotted lines are R2 of linear regression against Sackin’s index, and SVR using nLTT.101I: 500 I: 1000 I: 2000●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●tips: 100tips: 500tips: 1000first principal componentsecond principal componentα●●●0.511.5Figure B.11: Kernel principal components projection of trees simulated under three different values ofBA parameter α , for eight epidemic scenarios.102●● ●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●● ●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●● ●●●●●●●●●●● ●●●●tips: 100tips: 500first principal componentsecond principal componentI●●●50010002000Figure B.12: Kernel principal components projection of trees simulated under three different values ofBA parameter I, for eight epidemic scenarios.103I: 500 I: 1000 I: 2000●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●● ●● ●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●● ●●●●● ●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●● ● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●● ●●●● ●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●tips: 100tips: 500tips: 1000first principal componentsecond principal componentm●●●234Figure B.13: Kernel principal components projection of trees simulated under three different values ofBA parameter m, for eight epidemic scenarios.104I: 500 I: 1000 I: 2000lllllllllllllllllllllllllllllll llllll llllll llllllllllllllllllllllllllll llllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllll lllllllllllllllllllllllllllllll lllllll lll lllllllll lllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll l lllllllllllllllllllllllllllllllllllllllll lllllll llllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllll llllll llllllllllllllllllllllllllllllllllllllllllllllllllll llll lllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllll lllllllllllllllllll lllllllllllllllllllllll lllllllll llll llll lllllllllllllllll l lll ll lllllllllllllllllll llllllllllllllllllllll llllll llllllllllllllllllllllll llllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllll lllllllllllllllllllllllllllllllll lllllll llllllllllll lllllll lll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lll lllllllllllllllllll lllll llllllllllllllllllllll lllllllllllllllllllllllllll llllllllllllll ll llllllllll llllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllll llllll ll l lllllllllllllll llllllllllllllllllllll l ll lll llllllllll lllllllllllllllllllll llllllllllltips: 100tips: 500tips: 1000first principal componentsecond principal componentNlll300050008000Figure B.14: Kernel principal components projection of trees simulated under three different values ofBA parameter N, for eight epidemic scenarios.105α = 0 α = 0.25 α = 0.5α = 0.75 α = 1 α = 1.25α = 1.5 α = 1.75 α = 20.0 0.5 1.0 1.5 2.00.0 0.5 1.0 1.5 2.00.0 0.5 1.0 1.5 2.0simulated αnormalized kernel scorenumber of tips1005001000Figure B.15: Grid search kernel scores for testing trees simulated under various α values. The other BAparameters were fixed at I = 1000, N = 5000, and m = 2.106I = 500 I = 1000 I = 1500I = 2000 I = 2500 I = 3000I = 3500 I = 4000 I = 45001000 3000 50001000 3000 50001000 3000 5000simulated Inormalized kernel scorenumber of tips100500Figure B.16: Grid search kernel scores for testing trees simulated under various I values. The other BAparameters were fixed at α = 1.0, N = 5000, and m = 2.107m = 1 m = 2 m = 3m = 4 m = 5 m = 62 4 6 2 4 6 2 4 6simulated mnormalized kernel scorenumber of tips1005001000Figure B.17: Grid search kernel scores for testing trees simulated under various m values. The other BAparameters were fixed at α = 1.0, I = 1000, and N = 5000.108N = 1000 N = 3000 N = 5000N = 7000 N = 9000 N = 11000N = 13000 N = 150003000 8000 13000 3000 8000 13000simulated Nnormalized kernel scorenumber of tips1005001000Figure B.18: Grid search kernel scores for testing trees simulated under various N values. The other BAparameters were fixed at α = 1.0, I = 1000, and m = 2.109● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●0.00.51.01.52.00.0 0.5 1.0 1.5 2.0true αestimated αFigure B.19: Point estimates of preferential attachment power α of Baraba´si-Albert network model,obtained on simulated trees with kernel-score-based grid search. Test trees were simulated according toseveral values of α (x-axis) with other model parameters fixed at m = 2, N = 5000, and I = 1000. The testtrees were compared to trees simulated along a narrowly spaced grid of α values using the tree kernel,with the same values of the other parameters. The grid value with the highest median kernel score wastaken as a point estimate for α (y-axis).110●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●100020003000400050001000 2000 3000 4000true Iestimated IFigure B.20: Point estimates of prevalence at time of sampling I of BA network model, obtained onsimulated trees with kernel-score-based grid search. Test trees were simulated according to several val-ues of I (x-axis) with other model parameters fixed at α = 1, m = 2, and N = 5000. The test trees werecompared to trees simulated along a narrowly spaced grid of I values using the tree kernel, with thesame values of the other parameters. The grid value with the highest median kernel score was taken asa point estimate for I (y-axis).111●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●2462 4 6true mestimated mFigure B.21: Point estimates of number of edges per vertex m of BA network model, obtained on sim-ulated trees with kernel-score-based grid search. Test trees were simulated according to several valuesof m (x-axis) with other model parameters fixed at α = 1, I = 1000, and N = 5000. The test trees werecompared to trees simulated along a narrowly spaced grid of m values using the tree kernel, with thesame values of the other parameters. The grid value with the highest median kernel score was taken asa point estimate for m (y-axis).112●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●● ●●●●●40008000120004000 8000 12000true Nestimated NFigure B.22: Point estimates of number of edges per vertex N of BA network model, obtained on sim-ulated trees with kernel-score-based grid search. Test trees were simulated according to several valuesof N (x-axis) with other model parameters fixed at α = 1, m = 2, and I = 1000. The test trees werecompared to trees simulated along a narrowly spaced grid of N values using the tree kernel, with thesame values of the other parameters. The grid value with the highest median kernel score was taken asa point estimate for m (y-axis).113lllll12345mstratified by αllstratified by Illstratified by mll051015N (thousands)l lα 0 0.5 1 1.5 I 1000 2000 m 2 3 4Figure B.23: Posterior mean point estimates for BA model parameters m and N obtained by runningnetabc on simulated data, stratified by true parameter values. First row of plots contains true versusestimated values of m; second row contains true versus estimated values of N. Columns are stratified byα , I, and m respectively. Dashed lines indicate true values.0.0 0.5 1.0 1.5 2.02468αγFigure B.24: Relationship between preferential attachment power parameter α and fitted power lawexponent γ for networks simulated under the BA network model with N = 5000 and m = 2.114alpha: 0 alpha: 0.5 alpha: 1 alpha: 1.5●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●● ● ●●101000101000101000101000m: 1m: 2m: 3m: 41 10 1 10 1 10 100 10 1000degreefrequency methodpower lawstretched exponentialFigure B.25: Best fit power law and stretched exponential curves for degree distributions of simulatedBaraba´si-Albert networks for several values of α and m.115ll0.0 0.5 1.0 1.5 2.0αll0 2500 5000 7500 10000Ill2 3 4 5mll0 2500 5000 7500 10000Nllmixed/Spain (Cuevas et al. 2009)MSM/Shanghai (Li et al. 2015)MSM/USA (Little et al. 2014)MSM/Taiwan (Kao et al. 2011)MSM/Beijing (Wang et al. 2015)HET/Uganda (Grabowski et al. 2014)HET/Malawi (McCormack et al. 2002)HET/Botswana (Novitsky et al. 2013 & 2014)IDU/Canada (unpublished)IDU/Romania (Niculescu et al. 2015)IDU/Estonia (Zetterberg et al. 2004)Figure B.26: Posterior mean point estimates and 95% HPD intervals for parameters of the BA networkmodel, fitted to five published HIV datasets with netabc using the prior m ∼ DiscreteUniform(2,5).x-axes indicate regions of nonzero prior density.116
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Phylogenetic estimation of contact network parameters...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Phylogenetic estimation of contact network parameters with approximate Bayesian computation McCloskey, Rosemary Martha 2016
pdf
Page Metadata
Item Metadata
Title | Phylogenetic estimation of contact network parameters with approximate Bayesian computation |
Creator |
McCloskey, Rosemary Martha |
Publisher | University of British Columbia |
Date Issued | 2016 |
Description | Models of the spread of disease in a population often make the simplifying assumption that the population is homogeneously mixed, or is divided into homogeneously mixed compartments. However, human populations have complex structures formed by social contacts, which can have a significant influence on the rate and pattern of epidemic spread. Contact networks capture this structure by explicitly representing each contact that could possibly lead to a transmission. Contact network models parameterize the structure of these networks, but estimating their parameters from contact data requires extensive, often prohibitive, epidemiological investigation. We developed a method based on approximate Bayesian computation (ABC) for estimating structural parameters of the contact network underlying an observed viral phylogeny. The method combines adaptive sequential Monte Carlo for ABC, Gillespie simulation for propagating epidemics though networks, and a previously developed kernel-based tree similarity score. Our method offers the potential to quantitatively investigate contact network structure from phylogenies derived from viral sequence data, complementing traditional epidemiological methods. We applied our method to the Barabási-Albert network model. This model incorporates the preferential attachment mechanism observed in real world social and sexual networks, whereby individuals with more connections attract new contacts at an elevated rate (“the rich get richer”). Using simulated data, we found that the strength of preferential attachment and the number of infected nodes could often be accurately estimated. However, the mean degree of the network and the total number of nodes appeared to be weakly- or non-identifiable with ABC. Finally, the Barabási-Albert model was fit to eleven real world HIV datasets, and substantial heterogeneity in the parameter estimates was observed. Posterior means for the preferential attachment power were all sub-linear, consistent with literature results. We found that the strength of preferential attachment was higher in injection drug user populations, potentially indicating that high-degree “superspreader” nodes may play a role in epidemics among this risk group. Our results underscore the importance of considering contact structures when investigating viral outbreaks. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2016-08-04 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
IsShownAt | 10.14288/1.0307348 |
URI | http://hdl.handle.net/2429/58663 |
Degree |
Master of Science - MSc |
Program |
Bioinformatics |
Affiliation |
Science, Faculty of |
Degree Grantor | University of British Columbia |
GraduationDate | 2016-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2016_september_mccloskey_rosemary.pdf [ 1.98MB ]
- Metadata
- JSON: 24-1.0307348.json
- JSON-LD: 24-1.0307348-ld.json
- RDF/XML (Pretty): 24-1.0307348-rdf.xml
- RDF/JSON: 24-1.0307348-rdf.json
- Turtle: 24-1.0307348-turtle.txt
- N-Triples: 24-1.0307348-rdf-ntriples.txt
- Original Record: 24-1.0307348-source.json
- Full Text
- 24-1.0307348-fulltext.txt
- Citation
- 24-1.0307348.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0307348/manifest