DrSMC: A Sequential Monte Carlo Sampler forDeterministic Relationships on Continuous RandomVariablesbyNeil SpencerB.Sc. with Honours in Mathematics and Statistics, Acadia University, 2013A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Statistics)The University of British Columbia(Vancouver)August 2015c© Neil Spencer, 2015AbstractComputing posterior distributions over variables linked by deterministic constraintsis a recurrent problem in Bayesian analysis. Such problems can arise due to cen-soring, identifiability issues, or other considerations. It is well-known that standardimplementations of Monte Carlo inference strategies break down in the presenceof these deterministic relationships. Although several alternative Monte Carlo ap-proaches have been recently developed, few are applicable to deterministic rela-tionships on continuous random variables. In this thesis, I propose Deterministicrelationship Sequential Monte Carlo (DrSMC), a new Monte Carlo method for con-tinuous variables possessing deterministic constraints. My exposition focuses ondeveloping a DrSMC algorithm for computing the posterior distribution of a con-tinuous random vector given its sum. I derive optimal settings for this algorithmand compare its performance to that of alternative approaches in the literature.iiPrefaceThis dissertation is original, unpublished, independent work by the author, NeilSpencer.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Importance Sampling and Sequential Monte Carlo . . . . . . . . . 42.2 Sequential Monte Carlo Samplers . . . . . . . . . . . . . . . . . 72.3 Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . . . 103 DrSMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1 Problem Formulation and General Idea . . . . . . . . . . . . . . . 153.2 Intermediate Densities . . . . . . . . . . . . . . . . . . . . . . . 163.2.1 Comparison to SCMC . . . . . . . . . . . . . . . . . . . 173.3 Backward Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4 Proposal Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4.1 Metropolis-Hastings Proposals . . . . . . . . . . . . . . . 19iv3.4.2 HMC Proposals . . . . . . . . . . . . . . . . . . . . . . . 203.4.3 Split-HMC Proposal for Deterministic Sums . . . . . . . 223.5 DrSMC Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 243.6 Exact Enforcement . . . . . . . . . . . . . . . . . . . . . . . . . 254 Demonstration and Experiments . . . . . . . . . . . . . . . . . . . . 274.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.3 Numerical Comparisons to Previous Methods . . . . . . . . . . . 304.3.1 DYSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.3.2 SCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3.3 Comparison of Algorithms . . . . . . . . . . . . . . . . . 355 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.1 Novel Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 395.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42A Supporting Derivations . . . . . . . . . . . . . . . . . . . . . . . . . 45A.1 Analytically Integrating Hamiltonian Dynamics of the Random La-grangian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45A.2 Functional Form for DrSMC Intermediate Distributions . . . . . . 48A.3 Exact Distribution of Multivariate Normal given its Sum . . . . . 50vList of TablesTable 4.1 Configuration of the DrSMC with HMC proposals used to ap-proximate the distribution described in Section 4.1. . . . . . . 29Table 4.2 Configuration of the SCMC with MH proposals used to approx-imate the distribution described in Section 4.1. . . . . . . . . . 35viList of FiguresFigure 4.1 A comparison of DrSMC posterior density estimates (shownin red) and the true posterior densities (shown in green) forthe 15-dimensional multivariate normal problem described inSection 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Figure 4.2 An illustration of the effective sample size trajectory (shownin blue) across 31 steps of a DrSMC approximation for the15-dimensional multivariate normal problem described in Sec-tion 4.1. The red line depicts the number surviving particlesafter each resampling, and the green line reports the number ofHMC proposals acceptance. . . . . . . . . . . . . . . . . . . 31Figure 4.3 A plot demonstrating the trajectory of f (X) for a 500 parti-cle, 31 step DrSMC algorithm applied to the 15-dimensionalmultivariate normal problem described in Section 4.1. . . . . 32Figure 4.4 A density plot (shown in green) of f (X) at the 30th step (thefinal step before the exact enforcement step) of a 500 particleDrSMC algorithm applied to the 15-dimensional multivariatenormal problem described in Section 4.1. The theoretical dis-tribution for f (X) at this step is shown in red. . . . . . . . . . 33Figure 4.5 A plot overlaying the DYSC posterior density estimates for250000 particles (shown in red) and the true theoretical values(shown in green) for the 15-dimensional multivariate normalproblem described in Section 4.1 . . . . . . . . . . . . . . . . 34viiFigure 4.6 A plot overlaying the SCMC posterior density estimates for3500 particles (shown in red) and the true theoretical values(shown in green) for the 15-dimensional multivariate normalproblem described in Section 4.1. . . . . . . . . . . . . . . . 36Figure 4.7 A dot plot illustrating the estimated means (odd-indexed vari-ables) for the multivariate normal problem described in Sec-tion 4.1. The estimates are DrSMC (green), DYSC (red), andSCMC (blue), respectively. The true means (obtained analyti-cally) are shown in yellow. . . . . . . . . . . . . . . . . . . . 37Figure 4.8 A dot plot illustrating the estimated means (even-indexed vari-ables) for the multivariate normal problem described in Sec-tion 4.1. The estimates are DrSMC (green), DYSC (red), andSCMC (blue), respectively. The true means (obtained analyti-cally) are shown in yellow. . . . . . . . . . . . . . . . . . . . 38Figure 4.9 A plot demonstrating the the mean squared error for the pos-terior means for five runs of DrSMC (green), DYSC (red), andSCMC (blue) on problem described in Section 4.1. . . . . . . 38viiiAcknowledgmentsThroughout my time at UBC, I have had the pleasure of being surrounded by manytalented friends and mentors. Not only did their support and guidance make thisthesis possible, it transformed a daunting and difficult task into a delightful andrewarding experience.First and foremost, I would like to express my gratitude to my thesis supervisor,Alexandre Bouchard-Coˆte´. Thank you for your confidence, cheerfulness, insight-fulness, and for having the patience to tolerate my dabbling in several projectsbefore picking a thesis topic. In particular, thank you for introducing me to andleading such an engaging reading group. To Vincent, Seong, Camila, Crystal, Sean,Sohrab, Reza, Creagh, Elena, Henry, and the rest of the reading group, thank youfor your effort and enthusiasm. Much of the work in this thesis was inspired bygroup presentations and discussions.Thank you very much to those academics outside of UBC whose correspon-dences helped make this thesis possible: Shirin Golchi, Liam Paninski, ArnaudDoucet, Michael Betancourt, and Lei Li. Special thanks to Dave Campbell atSimon Fraser, whose helpful comments, encouragement, and patience were veryhelpful in the writing of this thesis. Thanks to Harry Joe for his insights regardingmultivariate normal distributions. Thank you to Pritam Ranjan for getting me intostatistics, and for his continued support and encouragement over the years.Many thanks to my officemates for the countless vibrant discussions in the pasttwo years. We had our share of good times (nights out, suppers in the village, spir-ited debates), and difficult times (conference deadlines, the “cold wars”). Thanksto Sohrab for his kindness and conscientiousness, Daniel for his enthusiasm andmany compliments, Vivian for her pleasant conversations, Andres (honorary of-ixficemate and great excursion planner) for the many long walks (and for making theterm “the doctor” stick), and Jack for his friendliness and go-karting skills. Thanksto Creagh for his collaborations, for being a fellow office night-owl, and for hisclever jokes. Thanks to Sean for being my collaborator from day one, for his con-tinuous help with technology, for having the patience to put up with my “intuition”and “tastelessness”, and for the Paris grand tour.Thank you to the rest of my cohort, Chiara, Huiting, Vanny, Jinyuan, Ken, andEric for their camaraderie. Special thanks to Danny for being my first friend inVancouver (I think we single-handedly kept the Chinese food places open in ourfirst year). Thank you to Tyler and Vincenzo for hosting delightful parties. Thankyou to the rest of the faculty and staff of the Department of Statistics for creat-ing such a comfortable learning environment. Thank you to NSERC for partiallyfunding my research.Thank you to my dear friends Ben Callaghan and Margaux Ross. You guysprovided a delightful escape when the academic life got overwhelming. Ben, Ialready miss taco/sloppy Joe nights. Thank you to the rest of AVFR for a greatfrisbee season. Finally, thank you to my family for your continued support and en-couragement. This is directed at both my Nova Scotia family: my parents, siblings,grandparents and extended relatives, as well my surrogate Vancouver family: theRoss clan, including Cooper.xChapter 1IntroductionMonte Carlo methods provide a rich class of probabilistic inference tools for prob-lems arising in statistics and machine learning. Methods such as Importance Sam-pling ([21]), Sequential Monte Carlo (SMC), and Markov Chain Monte Carlo(MCMC) ([21]) have fuelled significant advancements in the realm of statisticalcomputation. This is especially true in the realm of Bayesian statistics; MonteCarlo methods are standard for approximating posterior distributions for complexmodels. This thesis introduces Deterministic Relationship Sequential Monte Carlo(DrSMC), a Monte Carlo method for continuous random variables which are de-terministically related.Deterministic relationships arise in a variety of modelling situations, such aslinkage analysis [14], transportation [16], and medical diagnosis [27]. An instancewhere determinism arises naturally is described by [20], in which a mobile oper-ator wants to simulate the lengths of a sequence of k individual phone calls giventhat their aggregated length is s. In other cases, deterministic relationships areartificially introduced for practical reasons. For example, when inferring the ratematrix of a continuous time Markov chain, it is common to restrict the space of ratematrices to those in which the expected number of jumps is one per unit of time inorder to maintain model identifiability (e.g. [13]).It is well-known that standard implementations of Monte Carlo methods arenot directly applicable in the presence of deterministic relationships ([5, 31]). Insuch situations, approximation bounds become arbitrarily large and MCMC mixing1rates become arbitrarily slow. To avoid this, [5] recommend transforming the net-work to remove the deterministic dependencies. However, these transformationscan render a network intractably large and be computationally infeasible to carryout [20]. For this reason, there has been a multitude of recent work proposing al-ternative inference procedures to deal with deterministic relationships. Almost allof this work focuses on discrete variables (e.g. [12, 15, 18]) or other complicatedcombinatorial variables (such as graphs [3], [30]). The literature for the case ofcontinuous variables is much more sparse, with the applicability of the proposedmethods being limited in scope (e.g. deterministic sums with independent andeasily sampled marginals [20], unit vectors and the unit simplex [29]).To address the deficiency in the literature, I introduce DrSMC, a new SMCsampler [9] whose scope consists of sampling problems in which the deterministicrelationship can be expressed as f (X) = s, where, X = (X1, . . . ,Xk) is a vectorof continuous random variables, f is a continuous almost everywhere function ofX , and s is a deterministic value in the range of f . DrSMC involves a gradualenforcement of the constraint f (X) = s using Markov Kernels to move through aseries of intermediate distributions in a sequential sampling procedure.In high-dimensional problems in which f is differentiable, it is useful to em-ploy Hamiltonian Monte Carlo (HMC [22]) when developing the Markov Kernelproposals. HMC’s use of the gradient information avoids the dimensionality prob-lems associated with random walks, thereby expediting the rate at which proposedmoves satisfy the deterministic relationship.Despite being developed independently, the general DrSMC algorithm sharesmany similarities with a special instance of a recently proposed family of samplerscalled Sequentially Constrained Monte Carlo (SCMC [17]) samplers. Determinis-tic relationships fall within a broad class of problems to which SCMC is applicable.However, a key difference between DrSMC and SCMC is the functional form usedto gradually enforce the constraint. A consequence of this is that, unlike DrSMC,SCMC is unable to utilize HMC kernels. I further elaborate on the differencesbetween DrSMC and SCMC for deterministic problems in Chapter 3.The remainder of this thesis is organized as follows. Chapter 2 provides rele-vant background information on importance sampling, SMC, SMC samplers, andHamiltonian Monte Carlo. Section 3 introduces the general DrSMC procedure,2discusses implementation using both Metropolis Hastings (MH) and HMC kernels.Emphasis is placed a specialized proposal kernel which facilitates tuning DrSMCfor deterministic sums. Chapter 4 illustrates DrSMC for deterministic sums onasynthetic problem, comparing its performance to that of DYSC [20] and SCMC[17]. Chapter 5 makes some concluding remarks. The appendix contains somemathematical details that are too lengthy for the main text.3Chapter 2BackgroundDeterministic relationship Sequential Monte Carlo algorithm (DrSMC) falls withina broad class of sampling algorithms referred to as Sequential Monte Carlo sam-plers ([9] SMCS). As a result, exposition and correctness of this algorithm reliesheavily on results and insights from the SMCS framework. Section 2.1 reviewsimportance sampling (IS), sequential importance sampling (SIS), and sequentialMonte Carlo (SMC), the foundations of the SMCS framework. Section 2.2 presentsthe SMCS framework, and highlights its connections with SMC.Following the discussion of SMCS, Section 2.3 provides a brief review ofHamiltonian Monte Carlo (HMC), a Markov Chain Monte Carlo strategy whichcan be used to develop effective DrSMC kernels. For ease of presentation, thischapter shall follow [9] and assume any probability measure pi admits a densitypi(x) (with a slight abuse of notation) with respected to a σ -finite dominating mea-sure denoted dx. Throughout this thesis, y`1:`2 is used to denote the sequence ofvariables y`1 ,y`+1, . . . ,y`2 .2.1 Importance Sampling and Sequential Monte CarloThe objective of Monte Carlo methods is to approximate a target probability mea-sure pi on a measurable space {E,E } with a set of points x1, . . . ,xN ∈ E. Thisis useful for approximating properties of pi , namely the expectation Epi(ψ) of afunction ψ on E, (e.g. ψ(x) = x for first moment). In simple cases where pi can4be sampled from directly, this value can be estimated by sampling a set of pointsx1, . . . ,xN independently from pi and invoking the strong law of large numbers.That is, if Epi(|ψ(x)|)< ∞,∑Ni=1ψ(xN)N→∫Eψ(x)pi(x)dx,where → denotes almost-sure convergence. However, when pi represents a mul-tivariate distribution with complicated dependencies (such as deterministic rela-tionships), it typically cannot be directly sampled, and an alternative procedure isnecessary. One option is importance sampling (IS), a method in which a distribu-tion g on E (referred to as the importance distribution) is sampled from instead ofpi and the points x1, . . . ,xN are assigned weights wi ∝ pi(xi)/g(xi) to compensate forthe differences. These weighted points are referred to as particles. More formally,IS exploits the identitiesEpi(ψ) = Z−1∫Eψ(x)w(x)g(x)dx,Z =∫Ew(x)g(x)dx,to approximate Epi(ψ) usingEpi(ψ)≈∑Ni=1 wiψ(xi)∑ki=1 wi. (2.1)IS is also applicable in instances where only γ(x) = pi(x) ·Z can be evaluated as afactor of Z would be present in both numerator and denominator of Equation 2.1.These scenarios arise often in Bayesian inference.The success of IS depends heavily upon the choice of g; it is beneficial forg to be as close to pi as possible, as the variance of an estimation is approxi-mately proportional to 1+ var(wi(xi)) [21]. Finding an appropriate g which canbe tractably sampled from is often difficult, especially when X is non-standard andhigh-dimensional.A possible solution is sequential importance sampling (SIS), an IS strategyin which the proposal is developed via recursive sampling of each component of5X conditionally on earlier components. That is, the variable X = (X1, . . . ,Xk) issequentially sampled by individually proposing each Xi conditionally on X1:(i−1)using some sequence of kernels qi. The result is an importance distribution g com-posed of the proposal distributions g(X1:k) = q1(x1)∏ki=2 qi(xi|x1:(i−1)).Another possible solution is to use to use a variant of SIS called SequentialMonte Carlo (SMC) which simultaneously performs SIS on all N particles as asingle batch. This facilitates particle resampling at intermediate steps, allowingthe algorithm to focus on particles that have high weights (i.e. are representativeof the target distribution), and discard those with low weights (i.e. are not rep-resentative of the target distribution). Resampling requires calculation of particleweights at these intermediate steps, thereby necessitating intermediate distributionspi1, . . . ,pik = pi defined with the support of pii being x1:i.The simplest resampling procedure involves multinomial sampling using theparticle weights. However, other resampling mechanisms are available [11]. Often,resampling is not performed at every step. Instead the effective sample size (ESS[21]) is monitored and resampling is performed only when it drops below a pre-specified threshold. This generalization is sometimes given a separate name (suchas SMC with adaptive resampling [11]). Here, no such distinction is necessary.To ensure acceptable performance during the resampling step, it is desirableto use intermediate distributions pii’s which are similar to pi , but on a restricteddomain. In many popular applications of SMC (e.g. online problems such as parti-cle filtering or time series), good intermediate distributions pi1, . . . ,pik are naturallyprovided by the layout of the problem (the posterior distribution up to time i). Oth-erwise, the user must design the intermediate distributions themselves. Like IS, theapplicability of SMC extends to cases in which only γ(x) ∝ pi(x) and γi(x) ∝ pii(x)can be evaluated point-wise.Algorithm 1 provides an algorithmic formulation of SMC employing N parti-cles to approximate a k-dimensional distribution pi . Here, qn(·|x1:n−1) denotes thesequence of proposal distributions (n ∈ 1 : k), the unnormalized intermediate den-sities are denoted γ1:k, and γk = γ represents the unnormalized target density. Thevariables W (1:N)1:n are the normalized weights of the particles at each step, with w˜(i)ndenoting the weight update for particle i at step n.We are now prepared to advance to Section 2.2, where Sequential Monte Carlo6Algorithm 1 : SMCfor i = 1, . . . ,N doX (i)1 ∼ q1(·)w(i)1 ← γ1(X(i)1 )/q1(X(i)1 )W (i)1 ← w(i)1 /∑Nj=1 w( j)1end forfor n = 2, . . . ,k doif the effective sampling size, (∑Ni (W(i)n−1)2)−1, is smaller than a threshold T :thenResample the particlesW (i)n−1← 1/N for all iend iffor i = 1, . . . ,N doX (i)n ∼ qn(·|X (i)1:(n−1))w˜(i)n ← γn(X (i)1:n)/(γn−1(X (i)1:n−1)qn(X(i)n |X(i)1:(n−1)))W (i)n ←W(i)n−1w˜(i)n /∑Nj=1W( j)n−1w˜( j)nend forend forsamplers, a non-standard special case of SMC algorithms, are discussed.2.2 Sequential Monte Carlo SamplersIn classical Sequential Monte Carlo (SMC) setups, the dimensionality of the parti-cles increases at each SMC iteration. For example, in a state-space or HMM modelwhere each latent state is a univariate random variable, the particles are sequencesof latent states that grow by one dimension at each step. A new particles at iterationn is constructed from an old one from iteration n− 1 by copying the exact samefirst n−1 latent variables, and proposing one additional latent variable.In contrast to classical SMC, SMC samplers [9] (SMCS) allow the dimen-sionality of the particles to be constant across SMC iterations. This facilitates theapplication of SMC to instances in which the proposal is intended to move com-plete particles through a sequence of distributions, such as in annealed importancesampling [23]. This becomes critical when one wants to use proposals inspired7from the Markov Chain Monte Carlo literature, where latent variables are modifiedrather than appended. Because of this flexibility, implementation of SMCS differsin an important way compared to classical SMC; the weight calculation is alteredby an auxiliary backward kernel.The general goal of SMCS is to sample from a probability measure dx definedon a probability space (E,E ). To do so, auxiliary distributions pin, n ∈ {1, . . . , p}are introduced, where pip = pi . Abusing notation, assume that pin has a densitypin(xn) = γn(xn)/Zn, xn ∈ E, with respect to a σ -finite dominating measure denoteddx. Note that each xn might itself be a vector, xn = (xn(1), . . . ,xn(k)).At a high-level, SMCS follow the structure of standard SMC algorithms veryclosely, with the important difference of a different weight update. As in the stan-dard SMC case: We initialize the first iteration of the algorithm using the outputof an importance sampling algorithm, x(1:N)1 = (x(1)1 , . . . ,x(N)1 ) with correspondingunnormalized weights W (1:N)1 , where N denotes the number of particles. After ini-tialization, we go from one iteration to the next using a proposal Kn, followed by are-weighting, followed by an optional resampling step.In the remainder of this section, we focus on the motivation behind the alter-nate weight update in SMCS. The intuition is that since the space does not grow ateach iteration, it is no longer true that a particle at a given iteration has a uniquepossible ancestor in the previous generation. Unique ancestry provides a signifi-cant computational convenience for SMC algorithms. Otherwise, the calculationof importance weights involves a marginalization (integration) over all possibleancestor particles. Unique ancestry is true in classical SMC because particles areconstructed by adding a suffix. It no longer holds once MCMC-type moves areused to perturb particles, as the several sequences of perturbations can result in thesame destination.SMCS circumvents the issue of non-unique ancestry by introducing auxiliaryvariables. This allows SMCS to be interpreted and analyzed as a standard SMCalgorithm on a transformed space. More precisely, SMCS employs a new sequenceof extended intermediate distributions {p˜in} defined such that p˜in has support En butadmits pin as a marginal. These intermediate distributions can be conceptualizedas distributions over particle trajectories. Each perturbation increases the lengthof a particle trajectory. As a result, a problem with constant dimension has been8transformed into one of increasing dimension, a situation in which traditional SMCis applicable. The integral part of SMCS is that the sequence {p˜in} is defined suchthat pin is admitted as a marginal. This ensures that the final weights of the particletrajectories approximate pin without any additional computation or integration.These intermediate distributions are designed using artificial backward Markovkernels Ln−1 : E×E → [0,1] with densities Ln−1(xn,xn−1). Specifically,p˜in(x1:n) = γ˜n(x1:n)/Znγ˜n(x1:n) = γn(xn)n−1∏j=1L j(x j+1,x j).Notice that p˜in admits pin as a marginal. Each trajectory x(i)1:(n−1) is propagated tox(i)1:n using a proposal kernel Kn−1(xn−1,xn). This results in a recursive expressionfor particle weights given bywn(x1:n) = wn−1(x1:(n−1)) · w˜n(xn−1,xn) (2.2)w˜n(xn−1,xn) =γn(xn)Ln−1(xn,xn−1)γn−1(xn−1)Kn(xn−1,xn). (2.3)Given this formulation, we can now express the SMCS framework algorithmi-cally, as given by [9], as Algorithm 2. Here, η1 denotes the initial distribution fromwhich the particles are drawn, with η1(x) denoting its density.The output of Algorithm 2 is a collection of particles X (1:N)1:p ,W(1:N)p with eachparticle consisting of a trajectory and a weight. Because p˜ip admits pip as a marginal,pip can be unbiasedly approximated using the marginalized particles X (1:N)p ,W (1:N)p .This concludes the SMCS review.It is important to recognize the power of the generality of the formulation ofSMCS. Given this flexible framework, Chapter 3 is primarily a description andmotivation of a particular formulation of proposals, intermediate distributions, andbackward kernels which result in a particular class of SMCS’s (referred to asDrSMC) which can handle problems involving deterministic relationships. Be-fore moving into this detailed exposition, it is useful to review Hamiltonian MonteCarlo, as it plays a key role in the development of efficient proposals for DrSMC.9Algorithm 2 : SMCSfor i = 1, . . . ,N doX (i)1 ∼ η1w1(X(i)1 )← γ1(X(i)1 )/η1(X(i)1 )end forfor n = 2, . . . , p doif the effective sampling size, (∑Ni (W(i)n−1)2)−1, is smaller than a threshold T :thenResample the particlesW (i)n−1← 1/N for all iend iffor i = 1, . . . ,N doX (i)n ∼ Kn(X(i)n−1, ·)w˜(i)n ← w˜n(X (i)) (Equation (2.3))W (i)n ←W(i)n−1w˜(i)n /∑Nj=1W( j)n−1w˜( j)n .end forend for2.3 Hamiltonian Monte CarloConsider a k-dimensional continuous random variable X1:k whose distribution piadmits a differentiable density. Random walk Metropolis-Hastings (MH) is a tra-ditionally popular approach for approximating such distributions. Unfortunately,the performance of random walk Metropolis-Hastings (MH) worsens as k grows.The larger the dimension, the higher the likelihood that the proposal is poor in atleast one of the dimensions of pi . To compensate, the variance of the random walkis typically decreased as k grows, resulting in the speed of the exploration beingunfeasibly slow. The computational cost per independent sample in random walkMH is O(k2) [8]).In many of these cases, this curse of dimensionality can be mitigated throughthe use of Hamiltonian Monte Carlo (HMC) [22]. Let L (X1:k) denote the log-arithm of this density. Hamiltonian Monte Carlo (HMC) is a specialized auxil-iary variable Markov Chain Monte Carlo (MCMC) which incorporates informationabout the gradient ofL (X1:k) to construct a proposal which favours moves towardshigh density areas. The results is far-off proposals that maintain high acceptance10probability. The cost per independent sample in HMC is approximately O(k5/4)[19], allowing for better scaling to high dimensions than random walk MH. In ad-dition, HMC’s use of gradient information results in proposals which can capturehighly correlated variables and other complicated dependencies.HMC is based on a concept from physics known as Hamiltonian dynamics,meaning it has a physical interpretation which provides intuition for the workingsof the algorithm. Consider the following analogy for HMC sampling. Suppose aball is rolling around in an irregularly shaped bowl. An observer stands over thebowl taking photos of the ball at regular intervals. The camera is designed suchthat when a photo is snapped, a poof of air is expelled from the camera, randomlyaltering the ball’s momentum. HMC is an algorithm that leverages the fact that,under certain conditions of the system, the proportion of time the ball spends ina region of the bowl is exponentially proportional to that region’s volume. Thus,given an appropriately shaped bowl, the location of the ball resultant series of pho-tos could be used to approximate a distribution of interest. The “certain conditions”are that the movement of the ball in the bowl is governed by Hamiltonian dynam-ics, and that the poofs of air update the momentum according to a standard normaldistribution.The above system can be viewed as an MCMC algorithm over an augmentedspace. Consider a more general setting where bowl exists in a (k+1)-dimensionalspace (with the k+1th dimension being the depth of the bowl). At any given point,the state of the ball can be characterized by k real-valued location variables, and kreal-valued momentum variables. Let X t1:k denote the location variables at time t,and St1:k denote the corresponding momentum variables. The depth of the bowl ata location X1:k is given by L (X1:k). According to Hamiltonian dynamics, the total“energy” of the ball at time t is given byH (X t1:k,St1:k) =−L (Xt1:k)+K (St1:k), (2.4)where the summands represent “potential energy” (height of the ball) and “kineticenergy” (a function of momentum), respectively. Typically, the energy function isgiven by K (St1:k) = ∑kn=1 X2n /2. The potential energy is the negative log densityof pi , and the kinetic energy is the negative log density of a standard k-dimensional11normal distribution. Thus, Equation 2.4 can be interpreted as the negative logdensity of a joint distribution on X t1:k and St1:k. It admits pi as a marginal, and is thetarget density of the HMC algorithm.The movement of the ball in between photos can be interpreted as a determin-istic MCMC proposal. It is governed by a system of partial differential equationscalled Hamilton’s equations. These are shown below as Equations 2.5 and 2.6.dStndt=−∂H∂X tn(2.5)dX tndt=∂H∂Stn(2.6)This movement conserves energy, meaning that H (X t1:k,St1:k) (and thereforethe target density) is invariant to the ball’s rolling. This is a desirable property fora MCMC proposal, as it eliminates the need of a MH rejection step. However,Hamiltonian motion alone is not enough to build a valid MCMC proposal, as thechain would be restricted to a single contour of the density. HMC achieves fullergodicity through the second momentum refreshing (“air poof”) step. That is,at fixed time points (after each photo) a Gibbs’ step is applied to the momentumvariables, resampling them from their standard normal marginals.In situations where the Hamiltonian mechanics can be integrated analytically,the result is an MCMC algorithm capable of far off proposals with no MH re-jection step. Unfortunately, except for special cases such as in [24], Hamilton’sequations cannot be integrated exactly. Instead, standard practice is to discretizethe movement using a “leapfrog” integrator [22]. Given a discretization step-sizeε , a leapfrog update is given by the following function,Leapfrog(X1:n,S1:n,ε){S˜1:n← S1:n +(ε/2)∇X1:nL (X1:n)X˜1:n← X1:n + ε S˜1:nS˜1:n← S˜1:n +(ε/2)∇X˜1:nL (X˜1:n)return (X˜1:n, S˜1:n) } where ∇θ denotes the gradient with respect to θ . The inter-val of “time” between samples is controlled by ε and the number of leapfrog steps.Note that “time” is an artificial computational construct which bears no direct re-lationship with physical time.12A standard HMC proposal consists of a Gibbs resampling the momentum fol-lowed by a simulation of the particle moving according to Hamiltonian dynamicsfor L leapfrog steps. The inexact nature of the leapfrog steps necessitates the intro-duction of a MH rejection step to maintain invariance of exp(H(θ , p)). Algorithm3 summarizes a standard implementation of an HMC proposal. Here, N (0, I)denotes a multivariate normal distribution (with mean 0 and identity covariancematrix) and unif(0,1) denotes a uniform distribution on the interval [0,1].Algorithm 3 : HMCInputs: X1:n, ε , L, H ;X˜1:n← X1:nDraw S1:n ∼N (0, I)S˜1:n← S1:nfor i = 1, . . . ,L doX˜1:n, S˜1:n← Leapfrog(X˜1:n, S˜1:n, ε)end forDraw u∼unif(0,1)MH← exp(H (X1:n,S1:n)−H (X˜1:n, S˜1:n))if u < MH thenreturn (X˜1:n, S˜1:n)elsereturn (X1:n,S1:n)end ifReversibility of the HMC proposal holds due to the negation the momentum atthe end of the simulation. In practice, this negation need not be performed becauseit is immediately followed by a Gibbs resampling of S1:n. HMC can still be valid ifthe support of θ is constrained, as long as the support is connected (e.g. θd ≥ 0). Ifa leapfrog step violates the constraint, it is standard to negate p and reflect θ acrossthe violated constraint boundary.The performance of HMC depends heavily on choice of ε (the step size) and L(the number of steps). If ε is chosen too large, the associated discretization errorresults in an excessive proposals rejection rate. Moreover, a choice of ε that istoo small leads to wasted computation time. A too small value of L will resultin successive proposed values being too close to each other. If L is too large, theparticle’s trajectory form a loop, also resulting in wasted computation.13Heuristics for choosing ε and L, as well as strategies for tuning them by hand,are discussed in [22]. Unfortunately, this tuning is time consuming, often involv-ing multiple preliminary runs and may require expert knowledge. Recently, therehas been progress in automatically tuning HMC proposals [19, 32]. As will be-come apparent in Chapter 3, DrSMC represents a particularly difficult case fortuning ε . Fortunately, I have developed a specialized leapfrog integrator based onsplit Hamiltonian Monte Carlo [26] for DrSMC on problems involved deterministicsums. I elaborate on this approach in Chapter 3.14Chapter 3DrSMCDrSMC is a new Sequential Monte Carlo Sampler (SMCS) for continuous randomvectors conditioned on a deterministic constraint. This chapter is a presentationof the DrSMC algorithm which is organized as follows. Section 3.1 outlines abroad class of deterministically constrained Bayesian inference problems to whichDrSMC can be applied. Sections 3.2 – 3.4 define DrSMC in terms of its SMCScomponents. Section 3.1 formulates the intermediate distributions, Section 3.2provides the form of backward kernels, and Section 3.4 discusses several choicesfor the proposal kernels. In this Section, emphasis is placed on the exposition ofa specialized proposal kernel developed specifically for problems involving deter-ministic sums. A general algorithmic statement of DrSMC is provided in Sec-tion 3.5. Section 3.6 discusses how to exactly (rather than approximately) enforcedeterministic constraints using DrSMC for deterministic sum problems.3.1 Problem Formulation and General IdeaConsider the problem of conducting inference on a model containing latent vari-ables X1:k with a joint density p(x). We are interested in computing expectations ofsome test functions h (for example, indicator variables, moments, etc), that condi-tion on satisfaction of a deterministic constraint: E[h(X)|Y,G], where G= f (X)−s,and G = 0.1 Here, f : Rk → R is a function encoding a constraint, assumed to be1Formally defining this conditional expectation requires some care to avoid conditioning on anevent of zero probability. To be more precise, what we wish to compute is a version of a density15continuous, and s ∈ R is assumed to be known. We call G the random Lagrangian,in parallel to the form and function of the classical Lagrangian in the method ofLagrange multipliers from optimization.DrSMC approximates the posterior distribution X |G by generating a collectionof N particles{(x(1),w(1)), . . . ,(x(N),w(N))}which satisfy the deterministic rela-tion, either exactly or approximately, depending on the precise form of f —this isdiscussed in Section 3.6. The intuition behind DrSMC is that the particles in thefirst SMC iteration approximate the distribution p(x) (drawn exactly from p(x) if itis easy to sample, otherwise obtained through importance sampling), and then theseparticles are propagated through a series of intermediate distributions to graduallyintroduce the constraint G = 0.Note that it is possible to use SMCS (and therefore DrSMC) to gradually in-troduce other attributes of p(x), such as a likelihood term [7], multi-modality [23],or monotone constraints [17]. In theory, these approaches can be combined withDrSMC to simultaneously introduce several aspects the target distribution. How-ever, tuning such algorithms is a very difficult problem in practice. Empirically, wehave achieved promising results in some small examples through time-consumingpilot runs. However, developing an automatic tuning strategy which generalizesto many instances is still an open problem. For this reason, here I focus on anexposition of DrSMC which solely introduces a constraint.3.2 Intermediate DensitiesLet ϕσ2(x) denote a normal density with mean 0 and variance σ2. DrSMC grad-ually enforces the deterministic relationship by defining intermediate distributionspin with the following densities:γn(x) = p(x)ϕbn( f (x)− s)where b1:p is a positive decreasing sequence. This annealing approach ensuresthat a variety of particles following p(x) are considered, but as the algorithm pro-gresses through the intermediate distributions, the weight updates and resamplingof E[h(X)|G], evaluated at G = 0, and selected in the same fashion as in the standard argument forjustifying conditioning on a continuous observation in Bayesian models.16steps cause DrSMC to hone in on particles which approximately satisfy the con-straint and have high probability. DrSMC’s performance is sensitive to choice ofb1:p. Annealing too quickly causes the ESS to plummet which results in particledegeneracy. Annealing too slowly results in wasted computation. In Section A.2,it is shown that under reasonable assumptions, a good functional form for b1:p isbn = αβ−n,where α and β are both real numbers such that α > 0 and β > 1. The use of thenormal density is valid even if the range of f is only a subset of the real numbers;it amounts to an truncated normal density.Notice that, for finite p, γp is not the exact posterior distribution. Particles fol-lowing these densities do not exactly satisfy the deterministic relationship. How-ever, they can be made arbitrarily close to the exact target distribution by making bplarge enough. There are situations in which it is possible to enforce the relationshipexactly with a final rounding step, while preserving the correct target distribution.We detail this procedure in Section 3.6.3.2.1 Comparison to SCMCThe use of SMCS intermediate distributions to enforce a deterministic constraintwas independently developed in [17] as a specialized application of SequentiallyConstrained Monte Carlo (SCMC). However, SCMC uses a different formulationof intermediate distributions. There, the intermediate distributions are given byγ ′n(x) = p(x)Φ(−τn| f (x)− s|),where Φ(·) denotes the cumulative distribution function of the standard normal dis-tribution, and τ1:p is an increasing sequence of non-negative numbers. However,this series of intermediate distributions lack some of the advantages of those em-ployed in DrSMC. The functional form of the optimal sequence of τ1:p is unknownfor problems involving deterministic constraints, making it much more difficultand time-consuming to determine a sequence τ1:p which controls the descent ofthe effective sample size. Additionally, HMC-based proposal kernels discussed in17Section 3.4 would require additional developments, because γ ′n is not differentiableeverywhere.3.3 Backward KernelsTypically, a good measure of the performance of a SMCS is the variance of theunnormalized importance weights wn, with a lower variance indicating better per-formance. Using this measure, given the proposal kernels, the optimal backwardkernel [9] of an SMCS is of the formLoptn−1(xn,xn−1) =ηn−1(xn−1)Kn(xn−1,xn)ηn(xn)where ηn is the marginal importance distribution at time n and K1:p is the se-quence of proposal kernels. Unfortunately, calculating ηn(xn) point-wise is oftenintractable as it involves marginalizing over all possible paths that end in xn. It istherefore customary to use suboptimal backward kernels in practice [9].DrSMC employs a backward kernel of the formLn−1(xn,xn−1) =γn(xn−1)Kn(xn−1,xn)γn(xn).Assuming that γn(xn−1)/γn(xn) is well approximated by ηn−1(xn−1)/ηn(xn), such abackward kernel is a tractable approximation of the optimal kernel. This backwardkernel first described by [23] as a justification of the validity of Annealed Impor-tance Sampling. It was also given special attention by [9] because it reduces theweight update calculation in Equation 2.3 tow˜(xn−1,xn) = γn(xn−1)/γn−1(xn−1).Given the intermediate distributions γ1:p provided in Section 3.2, it can be furtherreduced tow˜n(xn−1,xn) =exp((b−1n−1−b−1n)( f (xn−1− s)2) . (3.1)Such an expression indicates that the weight of a particle at step n is indepen-18dent of its current value xn given xn−1. As is discussed in Section 3.5, there arecomputational benefits of such an independence. Note that DrSMC’s backwardkernel formulation is only valid when Kn(xn−1,xn) leaves pin invariant. Otherwise,DrSMC may not admit the target distribution as a marginal. This necessitates thatDrSMC employ proposal kernels which satisfy this property. Section 3.4 elabo-rates on developing such kernels.3.4 Proposal KernelsBuilding suitable pin invariant proposal kernels Kn is the most difficult part of de-veloping a DrSMC. Here, we consider two directions used to formulate kernels:Metropolis-Hastings and Hamiltonian Monte Carlo. The success of these propos-als rely heavily on tuning parameters. For many instances, developing efficienttuning approaches remains an open problem. Instead of focusing on the negative,the emphasis in this section is placed on a specialized kernel formulation for prob-lems involving deterministic sums.3.4.1 Metropolis-Hastings ProposalsGiven their popularity for Markov chain Monte Carlo (MCMC), it is reasonableto first consider Metropolis-Hastings (MH) when developing pin invariant kernels.Indeed, such proposals are the standard in many SMCS (such as SCMC). Moving aparticle according to a MH kernel consists of two steps. First, a new value xn+1 forparticle is proposed (conditional on the current value xn) according to the densityρn+1(xn+1|xn). Then, the proposal is accepted if and only ifu <γn+1(xn+1)ρn+1(xn|xn+1)γn+1(xn)ρn+1(xn+1|xn),where u is randomly generated from a uniform(0,1) distribution. This accept/rejectstep ensures that Kn leaves pin invariant. Point-wise evaluation of such Kn ker-nels can be computationally intensive, mainly due to the rejection step. EvaluatingKn(xn,xn) involves marginalizing over all possible rejected proposals given a start-ing point. Our choice of backward kernel results in a SCMC weight update whichdoes not involve point-wise evaluation of Kn thereby avoiding these potential dif-19ficulties. However, there are other obstacles which can result in poor performanceof DrSMC under a MH proposal kernel.Typically, ρn+1(xn+1|xn) consists of a step in a random walk2. Due to theirsimplicity, random walk MH kernels yield algorithms for which the computationalcost per particle is low. However, many particles and steps are usually needed inpractice since MH moves are relatively uninformed.As DrSMC progresses (n increases), the density of γn is increasingly concen-trated on a small region where the constraint is approximately satisfied. This meansthat the acceptance rate of uninformed far-off proposals is decreasing, thus neces-sitating smaller and smaller suggested moves as n progresses. This requires tuningof the sequence of ρn. In addition, smaller and smaller proposed moves result inthe proposal’s exploration of the space being very inefficient once n gets large, re-gardless of how well it is tuned. As the dimension k of a problem increases, theseproblems become more and more pronounced. Correlated variables in X furtherexacerbate the issue. This motivates another direction of proposals which are moreinformed.3.4.2 HMC ProposalsScaling up to high-dimensional problems can be made much easier through theuse of informed moves. If the target density is differentiable, informed movescan be developed by leveraging gradient information, such as in HMC. While theusage of HMC in an MCMC framework is increasingly popular, so far the use ofHMC in a SMC framework has been limited. There has been previous work [6, 25,28] in which HMC has been employed in a sequential sampling framework, butnone within an SMCS framework. In [6, 25], HMC is implemented within SMC(not SMCS; i.e. an expanding state), and [28] describe HMC within AnnealedImportance Sampling which contains no resampling step. As far as I know, HMCwithin SMCS is a novel contribution of this thesis.My formulation of HMC as a SMCS proposal kernel follows a similar formatto an implementation of HMC in an MCMC context (as described in Algorithm 3).First, the auxiliary momentum variables sn are instantiated by drawing from a mo-2Here, our discussion concerns such proposals. Other choices from the MCMC literature can beused, and may negate such issues, depending on the specifics.20mentum distribution. Then, the Hamiltonian movement is simulated (typically us-ing the leapfrog integrator) to obtain the proposed (x′n,s′n). Finally, a MH rejectionstep is applied. The variable xn+1 is obtained by discarding the momentum vari-ables after the MH step. Note that the new instantiation and then discard of themomentum variables within each step is what separates HMC within SMCS fromHMC within MCMC.The above process amounts to a non-traditional MH kernel. Therefore, manyof the benefits of a MH kernel are retained, such as invariance of pin and the avoid-ance of evaluating Kn. In addition, these kernels enjoy the benefits of HMC, suchas far-off proposals with high acceptance probability. In terms of the bowl analogyintroduced in Section 2.3, the sequence of distributions described by γ1:p corre-spond to bowls which become increasingly deep at values of x where the randomLagrangian G = 0. Therefore, Hamiltonian proposals have the benefit of systemat-ically preferring moves which advance the particle toward satisfying the constraint.As a result, DrSMC with HMC proposal kernels require far fewer steps to satisfythe constraint in practice than those with MH proposal kernels. In addition, thegradient of the log density of the random Lagrangian is often trivial to compute.Unfortunately, Hamiltonian motion is usually very computationally intensiveto simulate relative to random walks. This means that far fewer particles can beused than in DrSMC with MH proposal kernels. In addition, tuning HMC withinSMCS is very difficult. The existing tuning methods in[19, 32] are intended fortuning HMC as part of a MCMC strategy, not within a SMCS strategy such as inDrSMC. Tuning a HMC proposal for SMCS is a fundamentally different problem;MCMC strategies involve simulating a single Markov Chain from a single targetdistribution, whereas SMC samplers involve simulating multiple particles across aseries of different intermediate distributions. The existing SMC with HMC kernelliterature [6, 28] contains no discussion of automatic tuning procedures.In practice, the appropriate value for ε will differ across intermediate distribu-tions. If a naive leapfrog integrator is used, the latter steps of DrSMC require avalue of ε so small that decent proposals are effectively impossible. This is due tothe fact that as n increases, the Hessian of the random Lagrangian becomes moreand more extreme. This necessitates smaller and smaller values of ε as n increasesin order to keep the approximation accurate enough to have accepted MH moves.21This “vanishing epsilon” problem means that HMC kernels based solely on theleapfrog integrator are impractical at the latter steps of the DrSMC algorithm.The desire to eliminate the need of vanishing epsilons motivated my devel-opment of the split HMC proposal kernel for DrSMC, which I now describe inSection 3.4.3. Note that thus far, my developments are only applicable to problemsfor which the deterministic relationship can be expressed as a sum.3.4.3 Split-HMC Proposal for Deterministic SumsTo facilitate the split-HMC formulation which eliminates the vanishing epsilonproblem, I will briefly summarize the general mechanics of split-HMC, and thendiscuss the ingredients of the specialized split-HMC formulation.Although exact simulation of Hamiltonian motion is typically intractable, thereare special cases (such as in [24]) for which it is possible to analytically solveEquations 2.5 and 2.6, allowing one to simulate the motion exactly. When such asolution is available, simulation is very efficient and there is no need to include anMH rejection step. Unfortunately, these special cases rarely arise in applications.However, a recent advance in the literature, called split Hamiltonian Monte Carlo,has made it possible to exploit the computational efficiency associated with exactsimulation even when the target density is not one of the special cases.In the split HMC [26] strategy for simulating Hamiltonian motion, the Hamil-tonian energy function is “split” into two components, one of which admits ananalytic solution. That is, H (X t1:k,St1:k) in Equation 2.4 is decomposed into twosummands,H (X t1:k,St1:k) =H1(Xt1:k,St1:k)+H2(Xt1:k,St1:k), (3.2)where H2(X t1:k,St1:k) is a special case for which Hamiltonian motion can be simu-lated exactly. Typically, the decomposition is such thatH1(X t1:k,St1:k) =−L1(Xt1:k)and H2(X t1:k,St1:k) =−L2(Xt1:k)+K (St1:k), with −L1(Xt1:k)−L2(Xt1:k) being thenegative log likelihood of the target distribution. It is shown in [26] that in suchcases, it is valid to integrate the Hamiltonian motion by replacing each leapfrogstep with a hybrid of exact simulation of H2(X t1:k,St1:k) and a leapfrog step onH1(X t1:k,St1:k). This is expressed in the SplitLeapfrog(X1:n,S1:n,ε) function shownbelow. Note that here, EH2ε (·) is a function that returns the result of perfectly sim-22ulating Hamiltonian motion according to H2 for an ε time interval.SplitLeapfrog(X1:n,S1:n,ε){S˜1:n← S1:n +(ε/2)∇X1:nL1(X1:n)(X˜1:n, S˜1:n)← EH2ε (X˜1:n, S˜1:n)S˜1:n← S˜1:n +(ε/2)∇X˜1:nL1(X˜1:n)return (X˜1:n, S˜1:n) }.Because of the approximate component of the SplitLeapfrog integrator, it isstill necessary to implement an MH rejection step after the proposed move. In [26],it is claimed that this approach performs well if H2 is a reasonable approximationof H . Often, much larger values for ε can be employed in split-HMC than for thenaive leapfrog integrator. This was my motivation for investigating split-HMC toeliminate the vanishing epsilon problem in DrSMC.In the appendix (Section 6.1), I demonstrate that Hamiltonian motion can besimulated exactly for random Lagrangian terms which are expressed as ∑ki=1 Xi−s.Therefore, for DrSMC problems involving deterministic sum constraints, it is pos-sible to employ a split-HMC kernels as outlined above, with L1 being the logdensity of the prior andL2 being−G2/bn. This results in a SplitLeapfrog functionwhich alternates between updating the momentum based on the prior and simulat-ing motion according to the constraint.Using this integrator instead of the naive leapfrog within a HMC-based Knhas several practical benefits. The expression of H1 does not involve bn. As aresult, ε need not depend on bn, as ε need only be tuned to accommodate H1.Since b1:p is the only varying parameter in the sequence γ1:p, tuning ε for DrSMCis no longer a moving target; H1 is the same from step to step. This allows asingle ε to be used for the DrSMC full algorithm run, eliminating the vanishingepsilon problem. The proposal kernels continue to make far off proposals in latterstages of the algorithm without any increased computational cost or low acceptanceprobabilities. In addition, ε can now be tuned using the existing methods in theHMC for MCMC literature, e.g. [32].This negates the major problems associated with HMC proposal kernels pre-viously discussed. For this reason, I advocate the use of such split-Hamiltonianproposal kernels for instances in which random walk kernels perform poorly (e.g.high-dimensional problems with correlated variables).233.5 DrSMC AlgorithmsWith the SMCS components (intermediate distributions, backward kernels, andproposal kernels) which characterize DrSMC defined, I can now state DrSMC asAlgorithm 4. This statement is for a generic pin invariant proposal kernel Kn, anyof the three (or others) discussed in Section 3.4 are valid.The choice of backward kernel described in Section 3.3 resulted in a weightupdate given by Equation 3.1. Such a weight update indicates that w˜n(X) is inde-pendent of Xn given Xn−1. As recommended by [9], this independence allows fora modification of the SMCS algorithm such that resampling occurs before propa-gation of the particles. This improves efficiency of the algorithm, as time is notwasted propagating particles that do not survive the resampling step. Algorithm 4includes this modification.Algorithm 4 : DrSMCfor i = 1, . . . ,N doX (i)1 ∼ γ1w1(X(i)1 )← 1/N for all i.end forfor n = 2, . . . , p dofor i = 1, . . . ,N doX (i)n ∼ Kn−1(X(i)n−1, ·)w˜(i)n ← w˜n(X(i)n−1) (Equation (3.1))W (i)n ←W(i)n−1w˜(i)n /∑Nj=1W( j)n−1w˜( j)n .end forif If the effective sampling size, (∑Ni (W(i)n )2)−1, is smaller than a threshold T :thenResample the particlesW (i)n ← 1/N for all iend ifend forNote that the above DrSMC results in particles that approximately satisfy theconstraint G = 0. For deterministic sum constraints, it is possible to ensure that theconstraint is exactly satisfied. I elaborate on this point in Section 3.6.243.6 Exact EnforcementIn DrSMC, as particles are propagated through the intermediate distributions γ1:p,the values of the random Lagrangians associated with the particles converge to 0.However, unless additional steps are taken the constraint is not perfectly satisfiedin finite time (i.e. f (X) ∼ N(s,bp)). When f (X) encodes a sum constraint, aug-menting the DrSMC to satisfy the constraint exactly is straightforward through theintroduction of a (p+1)th SMCS step. However, this final step does not utilize thesame form of intermediate distribution, backward kernel, or proposal kernel as theprevious steps. Instead, they are defined as follows:γp+1(x) = p(x)1( f (x) = s),where 1(·) is defined to be 1 if its argument is true, and 0 otherwise.Lp+1(xp+1,xp) = ϕbp(xp+1(k)− xp(k))where xn(k) denotes the value of the kth variable in xn.Kp+1(xp,xp+1) = δ (xp+1 = y(xp)),where δ (·) is the dirac-delta distribution andy(xp) =(xp(1),xp(2), . . . ,yp(k−1),s−k−1∑i=1xp(i)).Note that this proposal kernel deterministically enforces the constraint. Althoughnon-standard, this configuration of backward kernel and proposal distribution arevalid, meeting the criteria outlined in [9]. They also result in several cancellationsin the weight update, resulting in the final expressionw˜p+1(xp,xp+1 = p(xp+1)/p(xp).For DrSMC for deterministic sums, this final step can be implemented at the endof Algorithm 4. If desired, the ESS can be recalculated, followed by a resampling25step. Chapter 4 demonstrates DrSMC with a split-HMC proposal on a syntheticdeterministic sum problem.26Chapter 4Demonstration and ExperimentsThis chapter showcases the performance of DrSMC with split-HMC on a syn-thetic deterministic sum problem. The organization of the chapter is as follows.Section 4.1 formulates the problem. Section 4.2 both describes a DrSMC algo-rithm configuration for attacking the problem and illustrates experimental resultsobtained from simulation. Section 4.3 compares the performance of the DrSMCalgorithm to that of two existing algorithms, Dynamic Scaled Sampling (DYSC[20]), and Sequentially Constrained Monte Carlo (SCMC [17])).4.1 Problem FormulationConsider a 15-dimensional random vector X with multivariate normal distributionMV N(0,Σ). Here, Σ= DΩD where Ω is a 15×15 correlation matrix defined suchthatΩi, j =1 if i = j−0.6 if i− j is odd0.6 otherwise,and D is a 15×15 diagonal matrix with Di,i =√(16− i). Essentially, Ω encodestwo distinct groups ({X1,X3,X5, . . . ,X15} and {X2,X4, . . . ,X14}) of variables withpositive within-group correlation (0.6) and negative between-group correlations (-0.6). The marginal variances of X1:15 decrease sequentially such that Xi possessesa marginal variance of 16− i.27The investigation in this chapter focuses on inferring the posterior distributionof the vector X described above given that ∑15i=1 Xi = 20. This synthetic problem isan example of a deterministic sum problem for which DrSMC with split HMC pro-posals is well-suited. It is relatively high-dimensional, possesses correlation amongvariables, and involves inhomogeneity among the variables. An added bonus isthat the true posterior distribution for this problem can be obtained analyticallyusing special properties of the multivariate normal distribution (details regardingthis derivation are provided in the appendix). The availability of the exact solutionfacilitates assessment of DrSMC’s performance.4.2 ResultsFor the problem outlined in Section 4.1, consider the DrSMC with split-HMC pro-posals algorithm configuration outlined in Table 4.1. Here, the number of steps andnumber of particles are balanced to achieve good performance in a computationalruntime of five minutes. Note that the 31st step is an exact enforcement step asdescribed in Section 3.6. The resampling threshold of 250 (half the number of par-ticles) is standard practice in the SMC literature. The value of α is specified suchthat b1 is approximately equal to the variance of the unconstrained ∑15i=1 Xi, andthe parameter β is assigned a value which was chosen using a short pilot run (itcorresponds to C = 0.8 in Section 6.2). The HMC parameters are assigned valueswhich were determined using pilot runs on the prior. They balance computationalburden, high acceptance rate, and efficiency in exploring the space.Diagnostic plots and indicators of performance for a run of the algorithm areprovided as Figures 4.1– 4.4. Figure 4.1 provides a comparison of the approxi-mated posterior with the true theoretical values. Section 6.3 provides the detailson the derivation of the theoretical values. This plot visually demonstrates that theDrSMC algorithm accurately captures both the shape and mean of the marginaldistributions of X1:15.Figure 4.2 shows the trajectory of effective sample size across steps, as well asthe number of particles which survive each resampling step. Note that the effec-tive sample size is reset each time resampling is performed. As desired, the ESSdescends gradually. This evidence supports the adoption of an exponential decay28Table 4.1: Configuration of the DrSMC with HMC proposals used to approx-imate the distribution described in Section 4.1.SMCS ParametersNumber of Particles 500Resample Threshold 250Number of Steps 31Determinants of b1:pα 14.5β 1.2026HMC parametersε 0.3L 3schedule of b1:p. The plot also demonstrates the number of HMC proposals thatare accepted at each step. Using the split-HMC proposal, one would expect theacceptance rate to remain relatively constant. The volatility of the acceptance rateof the proposal at later steps is unexpected, and requires further investigation.Figure 4.3 shows the trajectory of f (X) of the particles as they are propagatedthrough the 30 steps. At step one, f (X) < 20 for all particles. This is to be ex-pected, as the prior is centred at f (X) = 0. Despite this poor initialization withrespect to the constraint, a combination of split-HMC proposals and resamplingensures that all particles approximately satisfy f (X) = 20 by step 30. The gradualconvergence shown in Figure 4.3 is the hallmark of a healthy DrSMC algorithm. Itdemonstrates that both the proposal and resampling steps are helping to push theparticles toward the constraint. This is in contrast to situations in which the pro-posal is poor. In such instances, almost all movement toward the constraint occursdue to resampling. In such cases, the f (X) trajectory plot has a block-like structurewith large declines at the resample steps. As a result, DrSMC performs poorlybecause the approximation is based on only a few initial ancestor particles.Figure 4.4 provides a more complete view of how the particles satisfy the con-straint at step 30. The density estimate of constraint before the exact enforcementstep is as expected; it closely resembles a normal density centred at 20 with vari-ance b30. More measures of performance of DrSMC are available for multiple runsin Section 4.3.3, where it is contrasted with the performance of existing algorithms.29 1 2 3 4 5 6 7 8 910 11 1213 14 150.00.10.20.30.40.00.10.20.30.40.00.10.20.30.40.00.10.20.30.40.00.10.20.30.4−10 0 10 −10 0 10 −10 0 10Valuedensity EstimatorTruthDrSMCFigure 4.1: A comparison of DrSMC posterior density estimates (shownin red) and the true posterior densities (shown in green) for the 15-dimensional multivariate normal problem described in Section 4.1.4.3 Numerical Comparisons to Previous MethodsIn order to fully appreciate the strength of DrSMC with split-HMC proposals, itis useful to contrast performance with that of approaches in the literature. In thissection, I compare the results of running DrSMC to that of two existing algorithms.The results of running each algorithm five times on the problem outlined in Sec-tion 4.1 are available in Section 4.3.3. Before presenting these results, I will brieflydescribe the configurations of the two existing algorithms in Sections 4.3.1 and4.3.2.30llllllllllllllllllllllllllllllll lll lll l l l l l l l l l l l lll l l l l l llll l l l l l l l l l l l l l l l llll lllllllll01002003004005001 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31StepNumber of Particles MeasurementslllESSSurvivorsAcceptsFigure 4.2: An illustration of the effective sample size trajectory (shown inblue) across 31 steps of a DrSMC approximation for the 15-dimensionalmultivariate normal problem described in Section 4.1. The red line de-picts the number surviving particles after each resampling, and the greenline reports the number of HMC proposals acceptance.4.3.1 DYSCDynamic scaled sampling (DYSC) [20] is an importance sampling framework whichtargets the posterior distribution of a sequence of random variables X1:k conditioned∑ki=1 Xi = s. The DYSC procedure is characterized by sequentially transforming theproposal distributions to ensure that the expected sum satisfies the constraint. Al-though it is primarily intended for sums of independent (and identically distributed)non-negative random variables, the DYSC philosophy easily extends to more gen-eral settings, such as our synthetic problem described in Section 4.1.The recipe for the general DYSC framework considered here is described asAlgorithm 5, with γ(·) denoting a (potentially unnormalized) target distribution andq j(·|η j) denoting the marginal distribution of X j translated such that its expectationis η j. In [20], the sequential DYSC proposals include rejection sampling to ensure31−20−100102010 20 30StepConstraint Value102030StepFigure 4.3: A plot demonstrating the trajectory of f (X) for a 500 particle,31 step DrSMC algorithm applied to the 15-dimensional multivariatenormal problem described in Section 4.1.that the sum does not exceed s at intermediate steps. Here, no such step is necessaryas we are not dealing with strictly-positive variables.Algorithm 5 : DYSCfor i = 1, . . . ,N dow˜i← 1for j = 1, . . . ,k doη j← (s−∑ j−1m=1 Xm)/(k+1− j)q¯ j(·)← q j(·|η j)Draw X j ∼ q¯ j(·)w˜i← w˜i/q¯ j(X j)end forX (i)← (X1, . . . ,Xk)wi← w˜iγ(X (i))end forw1:N ← w1:N/∑Ni=1 wiConsider running Algorithm 5 with N = 250000 particles to approximate the32024619.8 19.9 20.0 20.1 20.2 20.3Constraintsdensity typeTheoretical Constraint SatisfactionObserved Constraint SatisfactionFigure 4.4: A density plot (shown in green) of f (X) at the 30th step (thefinal step before the exact enforcement step) of a 500 particle DrSMCalgorithm applied to the 15-dimensional multivariate normal problemdescribed in Section 4.1. The theoretical distribution for f (X) at thisstep is shown in red.synthetic target distribution described by Section 4.1. This value of N was cho-sen to achieve a computation time of five minutes per run. Figure 4.5 illustratesan overlay of the DYSC posterior estimates and the true posterior distributions. Itis a DYSC analog to Figure 4.1 for DrSMC. Comparing Figure 4.5 to Figure 4.1,DrSMC clearly outperforms DYSC on our synthetic problem. The DYSC approx-imation fails to capture the unimodal structure of the posterior. This failure isprimarily due to the high variability in the importance weights which results infew particles accounting for the majority of the the normalized weight. Seeing asthe importance distribution in DYSC did not capture the correlated structure of theprior, this was not unexpected. Further comparisons of DYSC and DrSMC, as wellas SCMC, are available in Section 4.3.3.33 1 2 3 4 5 6 7 8 910 11 1213 14 150.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00−10 0 10 20 −10 0 10 20 −10 0 10 20Valuedensity EstimatorTruthDYSCFigure 4.5: A plot overlaying the DYSC posterior density estimates for250000 particles (shown in red) and the true theoretical values (shownin green) for the 15-dimensional multivariate normal problem describedin Section 4.14.3.2 SCMCSequentially constrained Monte Carlo (SCMC) [17] is a recently developed classof SMCS for sampling from distributions which possess constraints. SCMC is ageneral framework capable of enforcing a wide variety of constraints, including thedeterministic constraints considered in this thesis. Like DrSMC, SCMC targets aconstrained distribution by initializing particles from an unconstrained distributionthen propagating them through a series of intermediate distributions to graduallyenforce the constraint. For constraints such as the one described in Section 4.1,[17] recommend that the series of intermediate distributions follow the form which34Table 4.2: Configuration of the SCMC with MH proposals used to approxi-mate the distribution described in Section 4.1.SMCS ParametersNumber of Particles 3500Resample Threshold 1750Number of Steps 100Density Parameterτ(i) exp(i−1)Proposal Parameter (sd)σ(i) i−1was described in Section 3.2.1.We follow this structure when specifying a SCMC algorithm for approximat-ing the posterior distribution described in Section 4.1. Table 4.2 summarizes theparameters of SCMC used in the approximation. Each proposal is a random walkMH step in a randomly selected dimension. In order to maintain a reasonable MHacceptance rate, the standard deviation of the random walk decreases as the inverseof the step number. A similar strategy is employed for the intermediate distribu-tions. In order to maintain a reasonable effective sample decline, the parameter τgrows exponentially in the step size. Figure 4.6 illustrates an overlay of the SCMCposterior estimates and the true posterior distributions (an analog of Figures 4.1 and4.5). It is visually evident by comparing Figure 4.6 to Figure 4.1 that SCMC doesnot perform as well as DrSMC. Further comparisons are available in Section 4.3.3.4.3.3 Comparison of AlgorithmsThe results of Sections 4.2-4.3.2 suggest that DrSMC markedly outperforms bothDYSC and SCMC on the problem being considered. Although these illustrationsare illuminating and compelling, they are based on a single run of each algorithm.This is not enough data to make a conclusion as to the optimal algorithm for thisproblem. In this section, five independent runs of each algorithm are considered toverify that DrSMC performs better than its counterparts. Each algorithm’s perfor-mance is assessed based on its ability to approximate the true means for each ofthe fifteen variables.35 1 2 3 4 5 6 7 8 910 11 1213 14 150.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00−10 0 10 20 −10 0 10 20 −10 0 10 20Valuedensity EstimatorTruthDYSCFigure 4.6: A plot overlaying the SCMC posterior density estimates for 3500particles (shown in red) and the true theoretical values (shown in green)for the 15-dimensional multivariate normal problem described in Sec-tion 4.1.Figures 4.7 and 4.8 illustrate the accuracy of the estimates for {X1,X3, . . . ,X15}and {X2,X4, . . . ,X14} respectively. The split of the variables into evens and odds(i.e. the two clusters), as well as the dotplot binning (to avoid point overlap), is per-formed to facilitate visualization. The true mean value for each variable is shown(yellow), as well as the five estimates associated with DrSMC (green), DYSC (red),and SCMC (blue). These two plots demonstrate that across multiple trials, DrSMCconsistently performs at least as well as (usually better than) DYSC and SCMC.Figure 4.9 further clarifies this point by showing the mean squared errors (MSE)in mean estimation across trials. The performance of DrSMC dominates that of36DYSC, and four of five of the DrSMC runs outperform all of the SCMC runs.Employing a Welch Two Sample t-test results in a significant difference (at anα = 0.05 significance level) between the mean MSE for DrSMC and the meanMSE for both SCMC and DYSC.ll lll ll l llll l l l lllllllll lllllll lllll l ll ll l ll ll l llllll lllll lll ll lllll l ll l ll llll l llllllll l l l lllll lllllllll l l llllllX2X4X6X8X10X12X14−3 −2 −1 0 1 2valueEstimatorllllTruthDrSMCDYSCSCMCFigure 4.7: A dot plot illustrating the estimated means (odd-indexed vari-ables) for the multivariate normal problem described in Section 4.1. Theestimates are DrSMC (green), DYSC (red), and SCMC (blue), respec-tively. The true means (obtained analytically) are shown in yellow.37ll lll ll l llll l l l lllllllll lllllll lllll l ll ll l ll ll l llllll lllll lll ll lllll l ll l ll llll l llllllll l l l lllll lllllllll l l llllllX2X4X6X8X10X12X14−3 −2 −1 0 1 2valueEstimatorllllTruthDrSMCDYSCSCMCFigure 4.8: A dot plot illustrating the estimated means (even-indexed vari-ables) for the multivariate normal problem described in Section 4.1. Theestimates are DrSMC (green), DYSC (red), and SCMC (blue), respec-tively. The true means (obtained analytically) are shown in yellow.lll lll ll lll llDrSMCDYSCSCMC0 20 40MSEEstimator EstimatorlllDrSMCDYSCSCMCFigure 4.9: A plot demonstrating the the mean squared error for the posteriormeans for five runs of DrSMC (green), DYSC (red), and SCMC (blue)on problem described in Section 4.1.38Chapter 5ConclusionIn this chapter, I make some brief concluding remarks regarding the contents ofthis thesis. The remarks are organized into two sections. Section 5.1 highlightssome of the novel contributions in this thesis. Section 5.2 discusses some potentialdirections for future work in the area of DrSMC.5.1 Novel ContributionsThis thesis made several new methodological contributions on the way to devel-oping the DrSMC algorithm and the split HMC proposal kernels for deterministicsums. Here, I briefly summarize some of the substantial contributions.HMC within SMCS: The existing literature is sparse on using HMC to build pro-posals within SIS or SMC. HMC within SMC is employed in [6, 25], and HMCfor annealed importance sampling is employed [28]. As far as I know, this thesiscontains the first general development of HMC within a SMCS framework. Notethat it is possible to formulate HMC within SMCS differently than in this thesis.For example, the algorithm could be altered such that the momentum variables arenot discarded at the end of each step.The derivation of a new Split-HMC kernel: In [26], the only types of Hamil-tonian splits considered are those which exploit exact simulation. In this thesis, I39developed a new splitting scheme which exploited something other than the normaldistribution (although similar to a normal distribution, the analog to the covariancematrix was not positive-definite). I derived an analytic solution to this new type ofsplit, which resulted in a new form of split-HMC. This new kernel eliminated theproblem of the vanishing epsilon for HMC kernels in deterministic sum DrSMCapplications. This development creates an opportunity for HMC to be used in ap-plications for which it was previously intractable.Normal density formulation of constraint annealing: The general statement ofDrSMC and the development of SCMC for deterministic constraints developed in[17] share many similarities. A crucial difference, however, is the functional formused to anneal in the constraints. The use of a differentiable normal density facil-itates the use of HMC proposal kernels and the novel development of a suitableannealing schedule. It also allowed the development of a tractable exact enforce-ment step for deterministic sums.5.2 Future WorkIn the future, there are several areas that could be explored to improve the perfor-mance and/or extend the applicability of DrSMC. In this section, I propose someprospective areas for future developments which could further improve the perfor-mance of DrSMC.HMC proposal kernel improvements: There are several possible avenues to ex-plore for improving the proposal kernel for deterministic sums, as well as to ex-tend split-HMC kernels to other constraints. For example, [2] developed a newsplit-HMC strategy called an exponential integrator. It is possible that elements ofthis integrator could be used to improve the split-HMC proposal kernels. Anotheravenue to explore is the use of different distributions for the momentum variables.Currently, a standard multivariate normal is used. However, multivariate normalswith inhomogeneous variances could be useful when X possesses marginal vari-ances which differ by orders of magnitude. There may also be a benefit to se-quentially scaling the momentum variable distribution throughout the DrSMC al-40gorithm. Since DrSMC is the first use of HMC within SMCS, it represents anopportunity to develop new methods for tuning the HMC ε and L in sequentialframeworks. As noted in the body of the thesis, further investigation is requiredinto the volatility of the HMC acceptance rate at latter stages of the DrSMC algo-rithm.Extending the applicability of DrSMC: Because of the many benefits special-ized split-HMC proposal kernel, DrSMC performs well when the deterministicconstraint is a sum. An extension of the split-HMC proposal kernel to a broaderclass of constraints would effectively extend the applicability of DrSMC. Such anextension would likely involve developing exact simulation of HMC for a broaderclass of distributions, or a clever re-parameterization of constraints which trans-forms them into sums. Similarly, it would be beneficial to extend the final exactenforcement step to situations other than sums. Another avenue to explore is aneffective tuning algorithm for simultaneously introducing multiple components ofthe target density, such as two constraints, or a constraint and a likelihood. Whenintroducing a likelihood, it may be worth investigating the use of a stochastic gra-dient for the HMC proposal kernel, as in [4]. Although there have been criticismsof use of stochastic gradients within an MCMC framework [1], it is worth investi-gating whether these criticisms extend to a SMCS framework.Finally, due to the success of the split-HMC kernel, it is worth investigatingits use for other sequential annealing approaches, such as approximate Bayesiancomputation [10] or simulated annealing.41Bibliography[1] M. Betancourt. The fundamental incompatibility of scalable HamiltonianMonte Carlo and naive data subsampling. In International Conference onMachine Learning (ICML), volume (In Press), 2015. → pages 41[2] W. Chao, J. Solomon, D. Michels, and F. Sha. Exponential integration forHamiltonian Monte Carlo. In International Conference on MachineLearning (ICML), volume (In Press), 2015. → pages 40[3] S. Chatterjee, P. Diaconis, A. Sly, et al. Random graphs with a given degreesequence. The Annals of Applied Probability, 21(4):1400–1435, 2011. →pages 2[4] T. Chen, E. Fox, and C. Guestrin. Stochastic gradient Hamiltonian MonteCarlo. In Proc. International Conference on Machine Learning, June 2014.→ pages 41[5] H. L. Chin and G. F. Cooper. Bayesian belief network inference usingsimulation. In UAI, pages 129–148, 1987. → pages 1, 2[6] K. Choo and D. J. Fleet. People tracking using hybrid Monte Carlo filtering.In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEEInternational Conference on, volume 2, pages 321–328. IEEE, 2001. →pages 20, 21, 39[7] N. Chopin. A sequential particle filter method for static models. Biometrika,89(3):539–552, 2002. → pages 16[8] M. Creutz. Global Monte Carlo algorithms for many-fermion systems.Physical Review D, 38(4):1228, 1988. → pages 10[9] P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers.Journal of the Royal Statistical Society: Series B (Statistical Methodology),68(3):411–436, 2006. → pages 2, 4, 7, 9, 18, 24, 2542[10] P. Del Moral, A. Doucet, and A. Jasra. An adaptive sequential Monte Carlomethod for approximate Bayesian computation. Statistics and Computing,22(5):1009–1020, 2012. → pages 41[11] A. Doucet and A. M. Johansen. A tutorial on particle filtering andsmoothing: Fifteen years later. Handbook of Nonlinear Filtering, 12:656–704, 2009. → pages 6[12] S. Ermon, C. Gomes, A. Sabharwal, and B. Selman. A flat histogrammethod for inference with probabilistic and deterministic constraints. InNIPS Workshop on Monte Carlo Methods for Modern Applications, 2010.→ pages 2[13] J. Felsenstein. Inferring phylogenies, volume 2. Sinauer AssociatesSunderland, 2004. → pages 1[14] M. Fishelson and D. Geiger. Optimizing exact genetic linkage computations.Journal of Computational Biology, 11(2-3):263–275, 2004. → pages 1[15] V. Gogate and R. Dechter. Samplesearch: Importance sampling in presenceof determinism. Artificial Intelligence, 175(2):694–729, 2011. → pages 2[16] V. G. Gogate. Sampling Algorithms for Probabilistic Graphical Models withDeterminism DISSERTATION. PhD thesis, University of California, Irvine,2009. → pages 1[17] S. Golchi and D. A. Campbell. Sequentially constrained Monte Carlo. arXivpreprint arXiv:1410.8209v2, 2014. → pages 2, 3, 16, 17, 27, 34, 40[18] O. Gries and R. Mo¨ller. Gibbs sampling in probabilistic description logicswith deterministic dependencies. In UniDL. Citeseer, 2010. → pages 2[19] M. D. Hoffman and A. Gelman. The no-U-turn sampler: Adaptively settingpath lengths in Hamiltonian Monte Carlo. Journal of Machine LearningResearch, 15:1593–1623, 2014. → pages 11, 14, 21[20] L. Li, B. Ramsundar, and S. Russell. Dynamic scaled sampling fordeterministic constraints. In Proceedings of the Sixteenth InternationalConference on Artificial Intelligence and Statistics, pages 397–405, 2013.→ pages 1, 2, 3, 27, 31[21] J. S. Liu. Monte Carlo strategies in scientific computing. Springer, 2008. →pages 1, 5, 643[22] R. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov ChainMonte Carlo, 2, 2011. → pages 2, 10, 12, 14[23] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11(2):125–139, 2001. → pages 7, 16, 18[24] A. Pakman and L. Paninski. Exact Hamiltonian Monte Carlo for truncatedmultivariate Gaussians. Journal of Computational and Graphical Statistics,23(2):518–542, 2014. → pages 12, 22[25] E. Poon and D. J. Fleet. Hybrid Monte Carlo filtering: Edge-based peopletracking. In Motion and Video Computing, 2002. Proceedings. Workshop on,pages 151–158. IEEE, 2002. → pages 20, 39[26] B. Shahbaba, S. Lan, W. O. Johnson, and R. M. Neal. Split HamiltonianMonte Carlo. Statistics and Computing, 24(3):339–349, 2014. → pages 14,22, 23, 39, 45[27] M. A. Shwe, B. Middleton, D. Heckerman, M. Henrion, E. Horvitz,H. Lehmann, and G. Cooper. Probabilistic diagnosis using a reformulationof the INTERNIST-1/QMR knowledge base. Methods of information inMedicine, 30(4):241–255, 1991. → pages 1[28] J. Sohl-Dickstein and B. J. Culpepper. Hamiltonian annealed importancesampling for partition function estimation. arXiv preprint arXiv:1205.1925,2012. → pages 20, 21, 39[29] Stan Development Team. Stan Modeling Language Users Guide andReference Manual, Version 2.5.0, 2014. URL http://mc-stan.org/. → pages 2[30] L. Tabourier, C. Roth, and J.-P. Cointet. Generating constrained randomgraphs using multiple edge switches. Journal of Experimental Algorithmics(JEA), 16:1–7, 2011. → pages 2[31] D. Venugopal and V. Gogate. Giss: Combining Gibbs sampling andsamplesearch for inference in mixed probabilistic and deterministicgraphical models. In AAAI, 2013. → pages 1[32] Z. Wang, S. Mohamed, and N. de Freitas. Adaptive Hamiltonian andRiemann manifold Monte Carlo samplers. In International Conference onMachine Learning (ICML), pages 1462–1470, 2013. → pages 14, 21, 2344Appendix ASupporting DerivationsA.1 Analytically Integrating Hamiltonian Dynamics ofthe Random LagrangianIn Section 3.4, I discuss separating the energy function underlying DrSMC forsum constraints into two distinct summands, H1(X t1:k) and H2(Xt1:k,St1:k), with H2possessing the following form:H2(Xt1:k,St1:k) =(∑Dj=1 Xtj− s)22b2+∑Dj=1(Stj)22.This division of the energy function facilitates the use of a split Hamiltonianintegrator [26], with H2 being integrated analytically. Here, I provide the mathe-matical derivation for the analytic integration.Recall that Hamiltonian motion is governed by the following system of partialdifferential equations,∂H(X t1:k,St1:k)∂Xi=−dStidt,∂H(X t1:k,St1:k)∂Sti=dX tidt.45Differentiating H2 yields∂H(X t1:k,St1:k)∂X ti=∑Dj=1 Xtj− sb2,∂H(X t1:k,St1:k)∂Sti= Sti .It follows thatd2X tidt2=−∑Dj=1 Xtj− sb2,which in turn implies thatX ti = Xt1 +(S0i −S01)t +(X0i −X01 ).Therefore, our system can be expressed as a set of independent equations of theformd2X tidt2=−DX ti +αit +βi− sb2for constants αi and βi whereαi =D∑j=1S0j −DS0i ,βi =D∑j=1X0j −DX0i .The solution to such an equation yields an expression for X is of the form:X ti = k1 cos(√Dt/b)+ k2 sin(√Dt/b)−αit +βi− sD.Differentiating with respect to t yields an expression for the momentumSti =−√Dbk1 sin(√Dt/b)+√Dbk2 cos(√Dt/b)−αiD.46Expressions for k1 and k2 can be derived from the initial conditions as follows.X0i = k1− (βi− s)/D⇒ k1 = X0i +(βi− s)/D.S0i =√Dbk2−αi/√D⇒ k2 =b√D(S0i +αi/D).Using these expressions, X t1:k and St1:n depend on the initial conditions X01:k and S01:nin the following manner:X ti =(X¯0− s/D)cos(√Dt/b)+bS¯0√Dsin(√Dt/b)− (S¯0−S0i )t− (X¯0−X0i )+sD,Sti =−√D(X¯0−C/D)sin(√Dt/b)b+ S¯0 cos(√Dt/b)− (S¯0−S0i ),where X¯0 = ∑ j=1 X0j /D and S¯0 = ∑ j=1 S0j/D.47A.2 Functional Form for DrSMC IntermediateDistributionsIn SMCS, it is crucial for the sequence of intermediate distributions to introducechange gradually. Otherwise, the sampler can quickly devolve into a state of “par-ticle degeneracy”, in which essentially all of the weight is assigned to a singleparticle. Propagating forward, all final particles are descendants of this same an-cestor, resulting in poor approximation of the target distribution. Degeneracy canbe avoided by controlling the effective sample size (ESS) between steps. This sec-tion provides an approach for controlling ESS in DrSMC.In DrSMC, the dynamics of the intermediate distributions are driven by thesequence of constraint satisfaction variances (annealing schedule), given by b1:p.This sequence is the mechanism through which changes in ESS can be controlled.Recall that the normal variance of the relationship violation at step n is bn−1, andthe weight update at is given by w˜(Xn−1,Xn) = γn(Xn−1)/γn−1(Xn−1). In the ab-sence of any likelihood annealing, this weight update reduces tow˜(Xn−1,Xn) =bn−1bnexp(−( f (Xn−1)− s)22b2n+( f (Xn−1)− s)22b2n−1).To prevent rapid ESS drops, I propose a functional form of the sequence whichexplicitly controls the weight updates of particles, proportional to the degree towhich they violate the deterministic relationship f (X) = s. Specifically, bn is de-fined such that at step n, the weight update of a particle X ′ which perfectly satisfiesthe relationship is a factor of C greater than that of a particle X which is bn−1 (onestandard deviation) away from satisfying the relationship (for some 0 <C < 1).Without loss of generality, suppose f (X ′) = s and f (X) = s+bn−1. The ratioof the weight update for X and the weight update for X ′ at step n is given by:w˜(X ′n−1,X′n)w˜(Xn−1,Xn)= exp(12(b2n−1b2n−1))Setting this equal to C yields the relationshipbn = bn−1(1+2log(C))−1/248which justifies a geometric decay for b1:p. With this derivation, tuning b1:p be-comes relatively painless. The only two parameters to tune are b1 and C.49A.3 Exact Distribution of Multivariate Normal given itsSumConsider an k-dimensional multivariate random normal distribution X ∼ N(µ,Σ).Now consider applying the transformation Yi = Xi for i = 1, . . . ,k− 1 and Yk =X1 +X2 + · · ·+Xk. This transformation can be expressed as Y = AX , whereA =[Ik−1 0k−11Tk−1 1].Here, Ik−1 denotes a (k−1)×(k−1) identity matrix, 0k−1 is a column vector of k−1 zeroes and 1Tk−1 is a row vector of k−1 ones. Since the transformation is linear,it follows from the properties of the multivariate normal that Y ∼ N(Aµ,AΣAT ).Conditioning on Yk = s, the joint distribution of Y1:(k−1) = X1:(k−1) is given byN(µ ′,Σ′), where µ ′ = µ + Σ1:(k−1),k(s−∑ki=1 µi)/Σk,k and Σ′ = Σ1:(k−1),1:(k−1)−Σ1:(k−1),kΣk,1:(k−1)/Σk,k. Finally, Xk = s is then specified by the condition.50
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- DrSMC : a sequential Monte Carlo sampler for deterministic...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
DrSMC : a sequential Monte Carlo sampler for deterministic relationships on continuous random variables Spencer, Neil 2015
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | DrSMC : a sequential Monte Carlo sampler for deterministic relationships on continuous random variables |
Creator |
Spencer, Neil |
Publisher | University of British Columbia |
Date Issued | 2015 |
Description | Computing posterior distributions over variables linked by deterministic constraints is a recurrent problem in Bayesian analysis. Such problems can arise due to censoring, identifiability issues, or other considerations. It is well-known that standard implementations of Monte Carlo inference strategies break down in the presence of these deterministic relationships. Although several alternative Monte Carlo approaches have been recently developed, few are applicable to deterministic relationships on continuous random variables. In this thesis, I propose Deterministic relationship Sequential Monte Carlo (DrSMC), a new Monte Carlo method for continuous variables possessing deterministic constraints. My exposition focuses on developing a DrSMC algorithm for computing the posterior distribution of a continuous random vector given its sum. I derive optimal settings for this algorithm and compare its performance to that of alternative approaches in the literature. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2015-08-26 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
DOI | 10.14288/1.0166643 |
URI | http://hdl.handle.net/2429/54647 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2015-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2015_november_spencer_neil.pdf [ 967.03kB ]
- Metadata
- JSON: 24-1.0166643.json
- JSON-LD: 24-1.0166643-ld.json
- RDF/XML (Pretty): 24-1.0166643-rdf.xml
- RDF/JSON: 24-1.0166643-rdf.json
- Turtle: 24-1.0166643-turtle.txt
- N-Triples: 24-1.0166643-rdf-ntriples.txt
- Original Record: 24-1.0166643-source.json
- Full Text
- 24-1.0166643-fulltext.txt
- Citation
- 24-1.0166643.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0166643/manifest