UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Particle Markov chain Monte Carlo Holenstein, Roman 2009

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2009_spring_holenstein_roman.pdf [ 2.34MB ]
Metadata
JSON: 24-1.0051665.json
JSON-LD: 24-1.0051665-ld.json
RDF/XML (Pretty): 24-1.0051665-rdf.xml
RDF/JSON: 24-1.0051665-rdf.json
Turtle: 24-1.0051665-turtle.txt
N-Triples: 24-1.0051665-rdf-ntriples.txt
Original Record: 24-1.0051665-source.json
Full Text
24-1.0051665-fulltext.txt
Citation
24-1.0051665.ris

Full Text

Particle Markov Chain Monte Carlo by Roman Holenstein B.Sc. Physics/Computer Science, University of Northern British Columbia, 2002 M.Sc. Electrical Engineering, University of Alberta, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University Of British Columbia (Vancouver) March 2009 c© Roman Holenstein, 2009 Abstract Markov chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC) methods have emerged as the two main tools to sample from high-dimensional probability distributions. Although asymptotic convergence of MCMC algorithms is ensured under weak assumptions, the performance of these latters is unreliable when the proposal distributions used to explore the space are poorly chosen and/or if highly correlated variables are updated independently. In this thesis we propose a new Monte Carlo framework in which we build efficient high-dimensional proposal distributions using SMC methods. This allows us to design effective MCMC algo- rithms in complex scenarios where standard strategies fail. We demonstrate these algorithms on a number of example problems, including simulated tempering, non- linear non-Gaussian state-space model, and protein folding. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Review of Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . 5 2.1 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Sequential Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 Algorithm and Use of Its Output . . . . . . . . . . . . . . 8 2.2.2 Implementation Issues . . . . . . . . . . . . . . . . . . . 11 2.2.3 Discussion of Convergence Results . . . . . . . . . . . . 12 2.3 Configurational Bias Monte Carlo . . . . . . . . . . . . . . . . . 14 2.4 Inference in State-Space Models . . . . . . . . . . . . . . . . . . 15 2.4.1 Model Description . . . . . . . . . . . . . . . . . . . . . 15 iii 2.4.2 Markov Chain Monte Carlo for State-Space Models . . . 18 2.4.3 Sequential Monte Carlo for State-Space Models . . . . . . 19 3 Particle Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . 21 3.1 Particle Metropolis-Hastings Sampler . . . . . . . . . . . . . . . 22 3.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.2 Proof of Validity . . . . . . . . . . . . . . . . . . . . . . 23 3.1.3 Extensions and Variations . . . . . . . . . . . . . . . . . 27 3.2 Particle Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.2 Proof of Validity . . . . . . . . . . . . . . . . . . . . . . 33 3.2.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Particle Marginal Metropolis-Hastings Sampler . . . . . . . . . . 37 3.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.2 Proof of Validity . . . . . . . . . . . . . . . . . . . . . . 39 3.3.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.1 Dynamic and Optimal Resampling . . . . . . . . . . . . . 41 3.4.2 Arbitrary Target Distribution . . . . . . . . . . . . . . . . 42 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1 Nonlinear State-Space Model . . . . . . . . . . . . . . . . . . . . 45 4.2 Protein Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.1 Lattice Model . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 Dirichlet Mixture Model . . . . . . . . . . . . . . . . . . . . . . 61 4.4 Markov Jump Processes (Lotka-Volterra) . . . . . . . . . . . . . 67 4.5 Lévy-Driven Stochastic Volatility Model . . . . . . . . . . . . . . 73 4.5.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . 76 4.5.2 Standard & Poor’s 500 . . . . . . . . . . . . . . . . . . . 77 4.6 Tempering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 iv 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 A Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 B Nonlinear State-Space Model . . . . . . . . . . . . . . . . . . . . . . 108 B.1 Gibbs Sampling for State-Space Model . . . . . . . . . . . . . . 108 B.1.1 Sampling Parameters . . . . . . . . . . . . . . . . . . . . 108 B.1.2 Sampling State Variables . . . . . . . . . . . . . . . . . . 109 C Dirichlet Mixture Model: Derivation . . . . . . . . . . . . . . . . . 111 C.1 Gaussian Mixture . . . . . . . . . . . . . . . . . . . . . . . . . . 111 C.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 v List of Tables 4.1 Protein sequences for 2D HP model . . . . . . . . . . . . . . . . 56 4.2 Protein sequences for 3D HP model . . . . . . . . . . . . . . . . 57 4.3 Protein folding results for HP model . . . . . . . . . . . . . . . . 58 4.4 Comparison of protein folding performance (data set 1) . . . . . . 59 4.5 Comparison of protein folding performance (data set 2) . . . . . . 60 4.6 Results for Galaxy data set . . . . . . . . . . . . . . . . . . . . . 63 4.7 Posterior statistics of parameters for S&P 500 data . . . . . . . . 85 4.8 Simulation results for a mixture of Gaussians with 4292 modes . . 89 5.1 Guidelines for algorithm choice . . . . . . . . . . . . . . . . . . 95 vi List of Figures 2.1 Illustration of particle ancestry in SMC . . . . . . . . . . . . . . . 10 2.2 Construction of the proposal in CBMC . . . . . . . . . . . . . . . 14 2.3 State-space model . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Generating a proposal using SMC . . . . . . . . . . . . . . . . . 24 4.1 Acceptance probabilities for PMH and CBMC . . . . . . . . . . . 48 4.2 PMH acceptance rates for varying # of observations and particles . 49 4.3 Hidden states (x) and observations (y). . . . . . . . . . . . . . . . 49 4.4 ACF of the output of the PG and MH algorithms . . . . . . . . . . 50 4.5 Estimates of p (σV | y1:T ) and p (σW | y1:T ) . . . . . . . . . . . . 51 4.6 ACF of σV and σW for PG and PMMH . . . . . . . . . . . . . . 52 4.7 Illustration of protein folding in HP model . . . . . . . . . . . . . 54 4.8 Sampled parameter α for PG and PMMH . . . . . . . . . . . . . 64 4.9 Acceptance rates for various versions of the PMH algorithm . . . 65 4.10 ACF for parameter α and estimate of pi (α| y1:T ) in Dirichlet model 66 4.11 Data set for LV model . . . . . . . . . . . . . . . . . . . . . . . . 69 4.12 Histogram and trace plots of LV parameters . . . . . . . . . . . . 70 4.13 ACF of sampled Lotka-Volterra parameters . . . . . . . . . . . . 71 4.14 Histograms of Lotka-Volterra states . . . . . . . . . . . . . . . . 72 4.15 Lévy-driven SV with synthetic data: ACFs . . . . . . . . . . . . . 77 4.16 Lévy-driven SV with synthetic data: Histogram & scatter plots . . 78 4.17 Lévy-driven SV with synthetic data: Trace plots . . . . . . . . . . 79 4.18 S&P 500 data set for Lévy-driven SV . . . . . . . . . . . . . . . 80 4.19 Lévy-driven SV with S&P 500 data: Histogram & scatter plots . . 81 vii 4.20 Lévy-driven SV with S&P 500 data: ACF of sampled parameters . 82 4.21 Lévy-driven SV with S&P 500 data: Trace plots of parameters . . 82 4.22 Lévy-driven SV with S&P 500 data: Histogram & scatter plots . . 83 4.23 Lévy-driven SV with S&P 500 data: ACF of sampled parameters . 84 4.24 Lévy-driven SV with S&P 500 data: Trace plots of parameters . . 84 4.25 Mixture of Gaussians posterior distribution using PMH . . . . . . 90 4.26 Data set and posterior for 4-component mixture of Gaussians . . . 92 4.27 Acceptance rates for finite mixture model . . . . . . . . . . . . . 93 viii List of Algorithms 2.1 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Sequential Monte Carlo Algorithm . . . . . . . . . . . . . . . . . 9 2.4 Configurational Bias Monte Carlo . . . . . . . . . . . . . . . . . . 16 3.1 Particle Metropolis Hastings Sampler . . . . . . . . . . . . . . . . 23 3.2 Particle Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 Conditional Sequential Monte Carlo Algorithm . . . . . . . . . . . 32 3.4 Conditional Multinomial Resampling . . . . . . . . . . . . . . . . 32 3.5 Conditional Stratified Resampling . . . . . . . . . . . . . . . . . . 33 3.6 Alternate Move for Particle Gibbs Sampler . . . . . . . . . . . . . 35 3.7 Marginal Maximisation using Particle Gibbs . . . . . . . . . . . . 36 3.8 Particle Marginal Metropolis-Hastings Sampler . . . . . . . . . . . 38 3.9 PMMH with m Independent SMC Algorithms . . . . . . . . . . . 40 4.1 Gibbs Sampler for Dirichlet Mixture Model. . . . . . . . . . . . . 66 4.2 Tempered Transitions . . . . . . . . . . . . . . . . . . . . . . . . . 86 ix Glossary Notation X a random variable (upper case) x a value (lower case) X,x a vector B(E) Borel σ-algebra on E P(A) probability of event A pi(dx) probability distribution pi(x) probability density (E,F) measurable space (σ-field) on set E and σ-algebra F Epi(·) expectation with respect to distribution pi varpi(·) variance with respect to distribution pi δx(dx) Dirac delta function N (x;µ, σ2) Normal (Gaussian) distribution with mean µ and variance σ2 B(x;N, p) Binomial distribution for N trials with success probability p M(x;N,p) Multinomial distribution for N trials with event probabilities p = (p1, ..., pk) Ga(x; k, θ) Gamma distribution with shape k > 0 and scale θ > 0 IGa(x;α, β) Inverse Gamma distribution with shape α > 0 and scale β > 0 Be(x;α, β) Beta distribution with shape parameters α > 0 and β > 0 TS(κ, δ, γ) tempered-stable distribution x Acronyms AC auto-correlation ACF auto-correlation function ACO ant colony optimisation CBMC configurational bias Monte Carlo DP Dirichlet process DPERM dynamic prune-enriched Rosenbluth method GPU graphics processing unit EKF extended Kalman filter ESS effective sample size FRESS fragment regrowth via energy-guided sequential sampling [76] G Gibbs HMM hidden Markov model LV Lotka-Volterra, predator-prey model MC Monte Carlo MCMC Markov chain Monte Carlo MH Metropolis Hastings MJP Markov jump process MTM multiple-try Metropolis NLSS non-linear state-space model NTT Neal’s tempered transitions, MCMC algorithm to sample from multimodal distributions developed by Radford Neal [66] OU Ornstein-Uhlenbeck process PDF probability distribution function PERM prune-enriched Rosenbluth method xi PG particle Gibbs PMCMC particle Markov chain Monte Carlo PMH particle Metropolis Hastings PMMH particle marginal Metropolis Hastings REMC replica exchange Monte Carlo RG recoil-growth RJMCMC reversible-jump MCMC S&P 500 Standard & Poor’s 500 index SA simulated annealing SAW self-avoiding walk SDE stochastic differential equation SIMCMC sequentially interacting Markov chain Monte Carlo SMC sequential Monte Carlo SSM state-space model SV stochastic volatility UKS unscented Kalman smoother xii Acknowledgements I am extremely grateful to my supervisor Arnaud Doucet. His enthusiasm, his support, his honest critiques, his sound advice, and his deep insight have helped me greatly and this thesis would not have been possible without him. He is an inspiration and I am very honoured to call him my teacher and friend. I would like to thank Christophe Andrieu who has also been integral to the development of the methodology in this thesis. Many thanks go to the numerous people who have been my teachers and men- tors along my academic journey so far. I would especially like to express my gratitude to: Mark Shegelski (undergraduate research supervisor), Robert Fedose- jevs and Ying Tsui (M.Sc. supervisors), Raphael Gottardo (committee member), as well as Nando de Freitas (committee member), who introduced me to the fun of Monte Carlo and machine learning. I am very grateful for their generous support, advise, and help with various projects. Furthermore I would like to thank my student colleagues and friends for great discussions and for providing a fun and stimulating learning environment. Special thanks go to Aline Tabet, Emtiyaz Khan, François Caron, Frank Hutter, Gareth Peters, Glen Goodvin, Hoyt Koepke, Luke Bornn, Mark Schmidt, Mark Crowley, Matt Hoffman, Mike Chiang, and Mike Whitwick. The secretaries at the CS department deserve my gratitude for providing assis- tance and administrative support, with special mention to Joyce Poon, Lara Hall, and Kath Imhiran. This research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), the University of British Columbia, the Department of Computer Science, and the Province of British Columbia. I would also like to xiii acknowledge the Western Canada Research Grid (WestGrid) for providing compu- tational resources. Last but not least, I am forever grateful to my parents and siblings for their continuous love and support, especially my mother who has always been there for me and helped me grow as a person. Finally, I want to thank my darling Elaine Qing Chang for her unbounded encouragement, support, and love. xiv Chapter 1 Introduction Monte Carlo methods have become the standard tool to solve many problems in statistics and scientific computing. Examples are abound, and include instances in Bayesian statistics (posterior estimation), statistical physics (Ising model), particle physics (simulation of high energy particles interacting with a detector), biology (protein folding, Lotka-Volterra model), chemistry (chemical reaction networks), and finance (option valuation), to name a few. The problems are becoming more and more sophisticated and Monte Carlo methods are now expected to deal with high dimensionalities and complex interactions between the model variables. In this thesis we present a new framework [3] that allows us to expand the class of problems that can routinely be addressed with Monte Carlo methods. It is based on a non-trivial and novel combination of existing sampling strategies and takes advantage of their strengths. We then apply the method to problems in Bayesian statistics and biology and make comparisons with competitive algorithms. Assume we are given a probability distribution pi(dx) defined on a measurable space (E,F) and that we are interested in sampling from this distribution, usually to compute analytically intractable expectations of interest with respect to pi(dx) by invoking the law of large numbers [61, 68]. For ease of notation, we shall further assume that pi(dx) admits a density pi(x) with respect to a σ-finite dominating measure denoted dx. Monte Carlo sampling is generally done by proposing samples from some in- strumental distribution. A weighing and/or selection mechanism then corrects for 1 the bias from the proposal distribution. As the dimension increases, finding a good proposal distribution becomes harder. A good proposal distribution should take into account key features of the target, such as multimodality, and ideally should be as close as possible to the target. There are two classes of Monte Carlo algorithms that are often used for sam- pling from high-dimensional distributions: Markov chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC). MCMC relies on sampling a realization of a Markov chain with invariant distribution pi(dx). SMC is a sequential implemen- tation of importance sampling, employing a population of samples (particles) and a sequence of probability distributions (of increasing dimension), with the final distribution being the target distribution to generate samples from pi(dx). When the consecutive distributions in the sequence are not too different one can find good proposals to move the particles from one distribution to the next. An in- troduction to this method is given in section 2.2. For a book-length review see [27]. While initially designed for on-line inference in dynamic models, usually re- ferred to as “particle filters” in that context, they are now also used in static models [15, 32, 65]. SMC samplers have found a wide range of applications in fields as diverse as econometrics [16], [67], ecology [12], and systems biology [48]. It was realised later that the interest of SMC methods is not limited to dynamic models and they are now increasingly used to perform inference for a wide range of mod- els including contingency tables [15], mixtures models [32, 65], graphical models [53], and population genetics [61, Section 4.1.2]. Where traditional importance sampling would try to directly produce weighted samples on E to approximate pi (x), and most likely fail for the same reason that an independent Metropolis Hastings (MH) algorithm would fail, an SMC algorithm will adopt a more pro- gressive approach which can be beneficial in large dimensional scenarios. The first ingredient of such an algorithm is a sequence of intermediate “bridging” proba- bility distributions of increasing dimensions {pin (dxn) , n = 1, . . . , p−1} with xn = (x1, x2, . . . , xn) ∈ X1 × · · · × Xn and such that pin (dxn) = pin (xn) dxn where dxn = dxn−1 × dxn and pip (xp) = pi (x). However, sampling sequentially comes at a price. Indeed, if p is too large, then the SMC approximation of the joint density pi (x) deteriorates as components imputed at any time n < p are not rejuvenated at subsequent time steps. As a 2 result, when p − n is too large the approximation of the marginal pi (xn) is likely to be rather poor as the successive resampling steps deplete the number of distinct particles. However, despite the potential drawbacks outlined above, SMC methods have had a major impact on inference in dynamic systems in the last fifteen years, and are now showing great promise for more classical “static” inference problems e.g. [15], [65], [32], [53] and [61, Section 4.1.2]. This thesis proposes a novel sampling strategy which aims to combine the ad- vantages of MCMC and SMC methods. As we shall see, this opens up the possibil- ity to routinely tackle problems that cannot be satisfactorily addressed using either of these approaches on its own. Several algorithms combining both approaches have already been proposed in the literature. In particular MCMC algorithms have been successfully used as updating mechanisms or proposals within SMC methods [45]; see also [17], [65]. Our approach is entirely different because on the con- trary it proposes to use SMC algorithms as a proposal mechanism within MCMC methods. We use MCMC methods to allow us to “break” the complex task of sampling from pi (x) into a series of simulations from lower-dimensional distribu- tions whereas we use SMC ideas to efficiently update very large sub-blocks of x. Although this idea might seem natural, it is important to note that its realisation is far from obvious. Indeed, a direct implementation of such a strategy is impos- sible as the marginal distribution of a particle generated by an SMC algorithm is not available in closed-form but would be required for the implementation. To by- pass this problem we introduce a non-standard auxiliary target distribution on an extended space which allows us to define a valid MH update and from which infer- ence about pi(x) can be carried out. We show that the resulting algorithm enjoys attractive properties and can be used as a component of more advanced algorithms. The rest of this thesis is organised as follows. In Chapter 2 we give a brief review of MCMC and SMC and provide some convergence results to motivate our new method. Note that SMC is presented in a non-standard notation in order to sim- plify the proofs for the particle Markov chain Monte Carlo (PMCMC) algorithms. Chapter 3 introduces the PMCMC framework, starting with the particle Metropolis Hastings (PMH) sampler, followed by the particle Gibbs (PG) and particle marginal Metropolis Hastings (PMMH) samplers. The PMH sampler (Section 3.1) is a novel Metropolis Hastings sampler which uses SMC algorithms as proposals to sample 3 from pi (x). The PG sampler (Section 3.2) extends the PMH algorithm to allow sampling from distributions of the form pi (θ,x) (defined on some space Θ × E), which can be understood as being an approximation of the standard Gibbs sampler (which alternates sampling from the full conditionals pi (θ|x) and pi (x|θ)), but has the property that it leaves pi(θ,x) invariant, and is hence not biased. In cases where θ and x are strongly correlated, this approach is likely inefficient. This is addressed by the PMMH algorithm (Section 3.3), which is an unbiased approximation of a MH algorithm that works directly on the marginal space Θ. In Chapter 4 we present a variety of applications to which this methodology can be applied and make com- parisons with competitive alternative algorithms. In Section 4.1 we apply PMCMC to a non-linear state-space model. Using PMH with simulated annealing we find optimal configurations in a protein folding model in Section 4.2. We perform clus- tering using Dirichlet mixture models in Section 4.3. We estimate parameters for the Lotka-Volterra (LV) model in Section 4.4. In Section 4.5 we consider a Lévy- driven stochastic volatility (SV) model for the Standard & Poor’s 500 (S&P 500) data set. And in Section 4.6 we use PMH with tempering to sample from highly multimodal distributions. Finally Chapter 5 summarises the contributions of this thesis and outlines further extensions. 4 Chapter 2 Review of Monte Carlo Methods 2.1 Markov Chain Monte Carlo Markov chain Monte Carlo (MCMC) methods are a class of algorithms based on constructing a Markov chain which has the desired target distribution as its equilibrium distribution. Two very popular MCMC algorithms are the Metropolis Hastings (MH) and the Gibbs (G) sampler [68]. While originally mainly used in physics, these algorithms soon found widespread use in many other disciplines. Often they are used to solve integration and optimisation problems, for example computing expectations in Bayesian statistics by invoking the law of large num- bers. In MCMC, a Markov chain is constructed from a transition kernelK(xn,xn+1). This kernel is defined on (E,B(E)) such that K(x, ·) is a probability measure for all X ∈ E. In the discrete domain the kernel is simply a transition matrix. In order to establish that the Markov chain converges to the required target distribution we require the following three properties: • The invariant distribution of the Markov kernel is pi(·) • The Markov chain is irreducible: any subset of A ⊂ E for which pi(A) > 0 is reachable from any starting point x(0) ∈ E • The Markov chain is aperiodic 5 Next we will (very) briefly introduce the MH and Gibbs samplers. 2.1.1 Metropolis-Hastings The MH sampler uses an instrumental proposal distribution to generate candidates that are accepted or rejected such that the resulting Markov chain has the target pi(x) = γ(x)/Z as the invariant distribution. Algorithm 2.1: Metropolis-Hastings Initialise X(0)1 For iteration i ≥ 12 Sample a candidate from a proposal distribution: X∗ ∼ q(X(i− 1), ·)3 With probability4 1 ∧ γ(X ∗) γ(X(i− 1)) q(X∗,X(i− 1)) q(X(i− 1),X∗) (2.1) set X(i) = X∗, otherwise set X(i) = X(i− 1) The case where the proposal distribution is symmetric, q(x,y) = q(y,x), ∀y,x ∈ E, corresponds to the Metropolis algorithm, also known as random walk Metropolis. When the proposal is independent of the current state of the Markov chain we get the independent MH sampler. Many more variations and special cases exist. For a thorough survey of these methods see [68]. We may for example use multiple transition kernels and combine them through mixture or composition. 2.1.2 Gibbs Sampler The Gibbs sampler is in fact a special case of the Metropolis-Hastings algorithm. If we consider multiple kernels, one for each block of variables, and use them in a composition, and additionally these kernels use the conditional distribution of the target as the proposal, then the acceptance rate is 1 and we obtain the Gibbs 6 sampler. Algorithm 2.2: Gibbs Sampler Initialise X(0)1 For iteration i ≥ 12 For n = 1, . . . , p do3 Sample Xn(i) ∼ pi(·|X1:n−1(i),Xn+1:p(i− 1))4 2.2 Sequential Monte Carlo In this section, we briefly review the principle of SMC methods to sample from a given target distribution pi (x). As explained in the introduction the method re- quires one to introduce a sequence of bridging probability densities {pin (xn) , n = 1, . . . , p} of increasing dimension such that pip (xp) = pi (x); see Section 4 for detailed examples. For ease of presentation, we will assume that Xi = X for any i = 1, . . . , p, implying that pin (xn) is defined on the product space X n. Each density is assumed known up to a normalising constant, i.e. for n = 1, . . . , p pin (xn) = Z−1n γn (xn) , where γn : X n → R+ can be evaluated pointwise, but the normalising constant Zn is unknown. We will use the notation Z for Zp. We describe below a generic SMC algorithm which encompasses numerous variants proposed in the literature. Section 2.2.1 below is important and sufficient to understand the description of the new algorithms presented in the paper. It is worth noting for readers already familiar with SMC methods that our description outlined below might appear slightly unorthodox in that it does not exactly match standard computer implementations. It however enjoys the same important the- oretical and practical properties and has the advantage of greatly simplifying the mathematical developments required to show the validity of the algorithms pre- sented throughout the paper. Section 2.2.2 focuses on such subtle issues, is only 7 required to understand the theoretical justifications of our algorithms, and might be skipped on a first reading. 2.2.1 Algorithm and Use of Its Output An SMC algorithm requires one to specify an importance density M1 (x1) on X in order to initialise the recursion at time 1 and a family of transition kernels with associated densities {Mn (xn−1, xn) , n = 2, ..., p} in order to extend xn−1 ∈ X n−1 by sampling xn ∈ X conditional upon xn−1 at time instants n = 2, ..., p. Guidelines on how to best select {Mn (xn−1, xn)} are well known, and the main recommendation is that an approximation of the conditional density pin (xn|xn−1) should be used [27], [61]. As mentioned earlier an SMC algorithm also involves a resampling procedure of the N particles, which relies on a family of probability distributions on {1, . . . , N}N , {r( ·|w),w ∈ [0, 1]N}. The resampling step is usually necessary as in most applications the variance of the importance weights would otherwise increase over time. This is an undesirable feature which rapidly manifests itself in practice in that the majority of particles have negligible weights. The algorithm proceeds as shown in Algorithm 2.3 in order to produce a se- quence of samples {Xin, i = 1, ..., N} for n = 1, . . . , p. Note that in order to alle- viate the notational burden we adopt below the convention that whenever the index i is used we mean “for all i ∈ {1, ..., N}.” Further on, we also use the standard convention whereby capital letters are used for random variables while lower case letters are used for their values. We have used the notation Wn := ( W 1n , ...,W N n ) and An := ( A1n, ..., A N n ) . The variable Ain−1 plays an important rôle in our for- mulation of SMC methods, and represents the index of the “parent” at time n − 1 of particle Xin for n = 2, . . . , p (see Figure 2.1). The vector An is thus a random mapping defined on {1, . . . , N} → {1, . . . , N}N , and the standard resampling procedure is hence interpreted here as being the operation by which child particles at time n choose their parent particles at time n − 1 according to a probability r(·|Wn−1) dependent on the parents’ weights Wn−1, or “fitness.” Let Oin := ∑N k=1 I ( Akn = i ) be the number of offspring of particle i at time n. A desirable property of a resampling scheme is that it satisfies the following 8 Algorithm 2.3: Sequential Monte Carlo Algorithm At n = 11 Sample Xi1 ∼M1(·)2 Update and normalise the weights3 w1 ( Xi1 ) := γ1(Xi1) M1(Xi1) , W i1 = w1 ( Xi1 )∑N k=1w1 ( Xk1 ) . For n = 2, ..., p do4 Sample An−1 ∼ r (·|Wn−1)5 Sample Xin ∼Mn(X Ain−1 n−1 , ·) and set Xin = (X Ain−1 n−1 , X i n)6 Update and normalise the weights7 wn ( Xin ) := γn ( Xin ) γn−1 ( X Ain−1 n−1 ) Mn ( X Ain−1 n−1 , Xin ) , (2.2) W in = wn ( Xin )∑N k=1wn (Xkn) (2.3) unbiasedness condition E [ Oin|Wn ] = NW in . (2.4) The unbiasedness condition (2.4) ensures that future computational effort is con- centrated on the most promising particles, while guaranteeing that the SMC algo- rithm described earlier yields consistent approximations of the distributions {pin (dxn) , n = 1, . . . , p} and of their normalising constants {Zn, n = 1, . . . , p}. In particular, conditional upon the sampled particles, an approximation of the target distribution pi (dx) is given by p̂iN (dx) := N∑ i=1 W ipδXip (dx) , (2.5) 9 A11=3 A 2 1=6 A 3 1=6 A 4 1=3 A 5 1=1 A 6 1=5 A12=2 A 2 2=4 A 6 2=4A 5 2=5A 4 2=3 A13=2 A 2 3=2 A 3 3=4 A 4 3=6 A 5 3=4 A 6 3=1 A32=4 X61X 5 1X 4 1X 3 1X 2 1X 1 1 X62X 5 2X 4 2 X43 X 5 3 X 6 3 X64X 4 4 X 5 4 X32X 2 2X 1 2 X13 X 2 3 X 3 3 X34X 1 4 X 2 4 p̂i1(dx1) p̂i2(dx1:2) p̂i3(dx1:3) p̂i4(dx1:4) Figure 2.1: Illustration of particle ancestry in SMC. from which expectations can be easily computed, but also an estimate of its nor- malising constant Z ẐN := p∏ n=1 [ 1 N N∑ i=1 wn ( Xin )] . (2.6) The derivation for the estimate of the normalising constant is as follows: Z := Zp = Z1 p∏ n=2 Zn Zn−1 (2.7) where Z1 and the ratio of normalising constants Zn/Zn−1 are given by Z1 = ∫ γ1(x1) M1(x1) M1(x1) dx1 = ∫ wn (x1)M1(x1) dx1 ≈ 1 N N∑ i=1 w1 ( Xi1 ) 10 Zn Zn−1 = 1 Zn−1 ∫ γn(xn) dxn = ∫ γn(xn) pin−1(xn−1) γn−1(xn−1) dxn = ∫ γn(xn) γn−1(xn−1)Mn(xn−1, xn) pin−1(xn−1)Mn(xn−1, xn) dxn = ∫ wn (xn) pin−1(xn−1)Mn(xn−1, xn) dxn ≈ 1 N N∑ i=1 wn ( Xin ) Now substituting the Monte Carlo estimates of these integrals into Eqn. 2.7 yields the estimate of the normalising constant given in Eqn. 2.6. 2.2.2 Implementation Issues In fact in practice, for computational efficiency, On is drawn first (i.e. without explicit reference to An) according to a probability distribution s(·|Wn) such that (2.4) holds and the offspring then matched to their parents. For example, the sim- plest unbiased resampling algorithm consists of sampling On according to a multi- nomial distribution of parameter (N,Wn). More sophisticated schemes such as residual resampling [61], stratified resampling [57] and minimum entropy resam- pling [23] also satisfy (2.4). Once On has been sampled, this is followed by a deterministic allocation procedure of the child particles to the parents, which de- fines “computer” indices e.g. theO1n first child particles are associated to the parent particle number 1, i.e. A1n = 1, ..., A O1n n = 1, likewise for the O2n following child particles and the parent particle number 2, i.e. AO 1 n+1 n = 2, ..., A O1n+O 2 n n = 2 etc. Further on, we will impose the slightly stronger unbiasedness condition. (A1) For any i = 1, . . . , N and n = 1, . . . , p the resampling scheme satisfies E [ Oin|Wn ] = NW in and r ( Ain = k|Wn ) = W kn . (2.8) Note that even if (2.4) holds then (2.8) is not necessarily satisfied, for example by the standard deterministic allocation procedure, but this property can be easily enforced by the addition of a random permutation of these “computer” indices. As 11 we shall see our indexing system makes the writing of the probability distributions underpinning our algorithms extremely simple. 2.2.3 Discussion of Convergence Results The theoretical properties of SMC are now well understood, and we restrict our- selves to some of the simplest assumptions required to ensure the validity of our algorithms; see [64] for a full treatment and more sophisticated assumptions. The following notation will be needed Sn = {xn ∈ X n : pin (xn) > 0} , Q1 = {x1 ∈ X : M1 (x1) > 0} , Qn = {xn ∈ X n : pin−1 (xn−1)Mn (xn−1, xn) > 0} for n ≥ 2. (A2) For n = 1, ..., p, we have Sn ⊆ Qn. (A3) For any n = 1, ..., p, there exists {Bn <∞} such that for any xn ∈ Sn wn (xn) ≤ Bn . (2.9) (A4) There exist µ (·) a probability density on X and 0 < w,w, ε, ε < ∞ such that for any n = 1, ..., p and any xn ∈ Sn, w ≤ wn (xn) ≤ w and ε µ (xn) ≤Mn (xn−1, xn) ≤ ε µ (xn) . Under assumption (A3) it is possible to establish results such as Lp bounds, show that p̂iN (dx) converges (weakly) almost surely (a.s.) as N → ∞ towards pi (dx) and that ẐN converges a.s. to Z. Under additional mixing assumptions on the Feynman-Kac semi-group associated to {pin} and {Mn}, stronger results can be established. In particular, it can be shown using a proof similar to [18, Theorem 5], [64, Section 7.4.3] that for multinomial and residual resampling ẐN satisfies a central limit theorem and that, under (A4), there exists a finite constant 12 C, depending on w,w, ε, ε, µ and the dimension of X but not p, such that the variance var (N) of ẐN/Z satisfies for any N ≥ 1 var (N) ≤ C p N (2.10) The so-called degeneracy phenomenon for SMC algorithms is well documented, and manifests itself in practice by the fact that the approximation p̂iN (dxn) of the marginal pi(dxn) of pi (dx) is concentrated on a very limited number of particles whenever p− n is large for a given number N of particles; this is the consequence of the successive resampling steps between times n and p. As a result the approx- imation of p̂iN (dxn) is poor and cannot be used for inferential purposes. This however does not necessarily mean that the distribution of the surviving path(s) is not “close” to pi (dxn). Under (A4), it can in fact be established [64, Section 8.3] that there exists a finite constant D, depending on w,w, ε, ε but not p, such that the unconditional distribution of a sample from p̂iN (dx), denoted qN (dx), satisfies ∥∥qN (·)− pi (·)∥∥ ≤ D p N , (2.11) where ‖·‖ is the total variation norm. Note that the constants C and D above typi- cally increase with the dimensionality of X and decrease as the ergodic properties of the Feynman-Kac semi-group improve. These results suggest that, under sufficient mixing conditions, the performance of SMC degrades “gracefully” linearly and not exponentially with p as is often the case with other sampling strategies. Note however, that (A4) is a very strong as- sumption not satisfied in most realistic scenarios and that it is difficult in practice to obtain useful quantitative bounds on C and D that could guide the choice of N . This generally results in a bias of unknown magnitude. Nevertheless, empirical evidence from numerous researchers suggests that SMC can be designed to pro- duce random samples whose distribution is “close” to pi even for large p in realistic scenarios where (A4) is not satisfied. This motivates using samples provided by an SMC algorithm as a proposal to feed a MH based algorithm in order to correct for the fact that qN (dx) 6= pi (dx). As we shall see in the next section, this is not straightforward. 13 Figure 2.2: Construction of the proposal in CBMC. For each component sev- eral candidates are sampled, of which one is selected with probability proportional to it’s importance weight. 2.3 Configurational Bias Monte Carlo We now briefly introduce the configurational bias Monte Carlo (CBMC) algorithm, which is one of the methods used to tackle problems that we are interested in, as mentioned in the introduction, and we will use it to compare against the perfor- mance of PMCMC. CBMC is based on the scheme introduced by Rosenbluth and Rosenbluth in 1955 [70] and was developed by J. Siepmann to sample the con- formational space of linear chain molecules [73]. In this section we will briefly introduce the basic CBMC algorithm. We will later compare the performance of our new method with this algorithm. CBMC is a Markov chain Monte Carlo algorithm where the proposal is built up sequentially. In the case of a long molecule – or high dimensional state space – only a subset of the variables may be updated at an iteration. As in SMC, we need to define a sequence of distributions pin(·) (n = 1, . . . , p) of increasing dimension. In the case of sampling chain molecules, this sequence is naturally given by the Boltzmann distributions obtained when growing the chain. The proposal configuration is built as follows. For each variable (element in the chain) a population of candidate values are sampled, of which one is selected with probability proportional to its importance weight. This yields a candidate chain X∗ with the importance weight ẐN,∗ given in Eqn. 2.12, as illustrated in Figure 2.2. To compute the importance weight ẐN for the current configuration, this procedure is repeated while ensuring that one of the candidates matches the current configura- tion X. The candidate is then accepted with probability 1∧ ẐN,∗/ẐN . The details are given in Algorithm 2.4. This algorithm is quite greedy (as illustrated in Figure 2.2), and hence only 14 works for moderately large chains. Several other extensions to the Rosenbluth method have been proposed [40], such as PERM [51], DPERM [19], and recoil- growth [20][40, Sec. 13.7]. PERM, like the Rosenbluth method, is a static Monte Carlo method, and addresses the weight degeneracy problem of the Rosenbluth method by pruning conformations with weights below some preset threshold and replicating ones with weights above some threshold (adjusting the weights accord- ingly). DPERM is the dynamic (Markov chain) generalisation of PERM. However, the performance of (D)PERM is quite sensitive to the thresholds [19], and unlike SMC, the particles (conformations) are not interacting. The recoil-growth (RG) method uses a multi-step look-ahead in an effort to overcome the dead-alley prob- lem of CBMC, which potentially spends much of the simulation time exploring dead ends or evaluating importance weights of low probability candidates. In- stead of generating N candidates at each step, RG proposes something similar to a “depth-first” search (of finite depth) to find “open” and “closed” trial directions, where “open” denotes directions that are deemed to have non-zero probability of generating a complete chain (at least for the finite depth look ahead), and “closed” indicates that the trial direction will result in a dead-alley, e.g. as in sampling self- avoiding walks on a grid. 2.4 Inference in State-Space Models We now take a look at how MCMC and SMC are applied to state-space models (SSMs), and motivate the need for PMCMC. SSM is a very popular class of mod- els and has broad application in many disciplines; for a thorough discussion of SSM see [13, 29, 41]. Some examples include target tracking models [50], change- point models [36], population dynamics models [12], stochastic volatility models (Sec. 4.5), partially observed diffusions [37], and models appearing in systems bi- ology (Sec. 4.4). 2.4.1 Model Description We consider the following SSM, also known as a hidden Markov model (HMM), where {Xn}n≥1 is an unobserved Markov process with initial density X1 ∼ µθ (·) 15 Algorithm 2.4: Configurational Bias Monte Carlo Initialise X(0) (arbitrarily)1 For iteration s ≥ 12 Sample proposal X∗ and compute importance weight:3 At n = 14 For k = 1, . . . , N , sample Xk,∗1 ∼M1(·)5 Compute the weights: w1 ( Xk,∗1 ) := γ1 ( Xk,∗1 ) /M1 ( Xk,∗1 ) ,6 and normalise W k,∗1 = w1 ( Xk,∗1 ) / ∑N k=1w1 ( Xk1 ) Sample X∗1 ∼ p̂iN1 (dx) := ∑N k=1W k,∗ 1 δXk,∗1 (dx)7 For n = 2, . . . , p do8 For k = 1, . . . , N , sample Xk,∗n ∼Mn(X∗n−1, ·) and set9 Xk,∗n = (Xn−1, X k,∗ n ) Compute the weights wn ( Xk,∗n ) := γn “ Xk,∗n ” γn−1 “ Xk,∗n−1 ” Mn “ Xn−1,Xk,∗n ” 10 and normalise W k,∗n = wn “ Xk,∗n ”∑N k=1wn “ Xk,∗n ” Sample X∗n ∼ p̂iNn (dx) := ∑N k=1W k,∗ n δXk,∗n (dx) and set11 X∗n = (X∗n−1, X∗n) Compute importance weight:12 ẐN,∗ = p∏ n=1 [ 1 N N∑ k=1 wn ( Xk,∗n )] (2.12) Sample auxiliary variables and compute importance weight:13 At n = 114 Set X11 := X1(s) For k = 2, . . . , N , sample X k 1 ∼M1(·)15 Compute the weights w1 ( Xk1 ) as above16 For n = 2, . . . , p do17 Set X1n := Xn(s) For k = 2, . . . , N , sample18 Xkn ∼Mn(Xn−1(s), ·) and set Xkn = (Xn−1, Xkn) Compute the weights wn ( Xkn ) as above19 Compute importance weight: ẐN = ∏p n=1 [ 1 N ∑N k=1wn ( Xkn )] 20 With probability 1 ∧ ( ẐN,∗/ẐN (i− 1) ) set X(i) = X∗, otherwise set21 X(i) = X (i− 1) 16 Y3Y1 Y2 X1 X2 X3 XT YT ... Figure 2.3: State-space model, also known as a hidden Markov model (HMM). The Markov process {Xn}n≥1 is not observed directly but through a conditionally independent observation process {Yn}n≥1. and transition probability density (with n > 1) Xn|Xn−1 ∼ fθ ( ·|Xn−1) where θ is a static parameter of the model. The process {Xn}n≥1 cannot be ob- served directly, but only through observations {Yn}n≥1, which are assumed inde- pendent conditional upon {Xn}n≥1 with Yn|Xn ∼ gθ ( ·|Xn) . The parameter θ may also be unknown, in which case it follows the prior distribu- tion p (θ). The goal is now to perform Bayesian inference in this context. First consider the case where the static model parameter θ is known. Given some observations y1:T inference relies on the following posterior density pθ (x1:T | y1:T ) ∝ µθ (x1) gθ (y1|x1) T∏ n=2 fθ (xn|xn−1) gθ (yn|xn) . (2.13) If we do not know θ, we assign a prior density p(θ) to θ and perform Bayesian inference using the joint posterior p (θ, x1:T | y1:T ) ∝ p(θ) pθ (x1:T | y1:T ) (2.14) When the model is non-linear non-Gaussian, the posterior densities pθ (x1:T | y1:T ) and p (θ, x1:T | y1:T ) do not admit standard forms. This makes inference difficult and we need to resort to approximations. Here we consider Monte Carlo methods, 17 which have shown to be a flexible and efficient tool for inference in these types of models. As SSMs are ubiquitous in many areas of science there are literally thousands of papers published on this in the past decade alone. Instead of a thor- ough review, we instead highlight the underlying principles of applying MCMC and SMC in this context, as well as point out some of their limitations, which will motivate the use of PMCMC. 2.4.2 Markov Chain Monte Carlo for State-Space Models In order to perform inference in SSM using Monte Carlo (MC) we need to be able to generate samples from the posteriors pθ (x1:T | y1:T ) or p (θ, x1:T | y1:T ) in the case of fixed (known) or unknown parameter θ, respectively. It is generally impossible to sample exactly from pθ (x1:T | y1:T ), except for lin- ear Gaussian models and finite state-space HMMs. As outlined in Section 2.1 the Metropolis Hastings (MH) sampler makes use of a proposal distribution q(x1:T , x∗1:T ) to generate candidates x∗1:T , which are then accepted or rejected with some prob- ability. The high dimensionality of x1:T makes it impossible to design good pro- posal distributions to update all states jointly, and a popular strategy consists of updating only a subset of components at a time. For example we can divide the T components into adjacent blocks of length K and iteratively update all the blocks, conditioned on the components outside the current block and the observations. A block xn:(n+K−1) is then updated according to an MCMC step of invariant density pθ ( xn:(n+K−1)|y1:T , x1:(n−1), x(n+K+1):T ) ∝ n+K∏ k=n fθ (xk|xk−1) n+K−1∏ k=n gθ (yk|xk) (2.15) As long as K is not too large, it may be possible to design good proposals to be used in a MH update. For example if the transition density fθ is linear Gaussian and the observation density gθ is log-concave, then we may design a proposal density based on a Gaussian approximation of Eqn. 2.15, as illustrated in [71]. However, as K increases a Gaussian approximation may become insufficient. Even seemingly benign differences between the proposal and target distribution in low dimension can accumulate in higher dimensions and result in the acceptance rate dropping to zero, which limits the number of components K that can be updated simultane- 18 ously. This is a serious limitation, as it can severely constrain and slow down the exploration of the support of pθ (x1:T |y1:T ) and introduce dependence between the samples produced by the MCMC algorithm. If the parameter is unknown we need to be able to sample from the joint density p (θ, x1:T | y1:T ). A common approach is to alternately update the state components x1:T conditional on θ and the parameter θ conditional upon x1:T . Sampling exactly from p (θ| y1:T , x1:T ) can often be performed efficiently due to its small or mod- erate size. However, when the parameter θ and states x1:T are strongly correlated, this approach can result in poor mixing and thus yield an inefficient algorithm. In this case we would like to be able to update the parameter and states jointly. How- ever, this would require sampling all the states jointly, which in general we cannot do as outlined above. 2.4.3 Sequential Monte Carlo for State-Space Models The SMC algorithm decomposes the problem of sampling from pθ (x1:T | y1:T ) into a series of simpler sub-problems, as illustrated in Section 2.2, where we define the sequence of distributions {pin(xn) = pθ (x1:n| y1:n) , n = 1 . . . , p = T} , as well as transition densities M1(x1) = qθ (x1| yn) and Mn(xn−1, xn) = qθ (xn|xn−1, yn) , for n = 2 . . . , p = T . The incremental particle weights (Eqn. 2.2) in this case are: w1 (x1) = µθ (x1) gθ (y1|x1) qθ (x1| y1) and wn (xn) = fθ (xn|xn−1) gθ (yn|xn) qθ (xn|xn−1, yn) This allows us to efficiently generate a population of samples (particles) from pθ (x1:T | y1:T ). However, SMC also suffers from well-known drawbacks. If T is too large, then the SMC approximation of the joint density deteriorates as com- ponents at earlier times are not rejuvenated at subsequent time steps. This is due to 19 the resampling which diminishes the diversity of the samples at earlier times. As a result the approximation of the marginals pθ(xn|y1:T ) is likely very poor when T − n is large. This is also the reason why it is very difficult, although possi- ble [2, 31, 43], perform inference on static parameters using SMC, i.e. to sample from the joint distribution p (θ, x1:T | y1:T ). This naturally suggest to combine MCMC with SMC and leverage the strengths of each by using SMC to build an approximation of the joint posterior distribution which can then be used to generate proposals to be used in MCMC. In the next Chapter we will present the framework to do this and also perform parameter esti- mation, i.e. to sample from the joint posterior. Chapter 4 demonstrates this novel methodology on several applications, including SSMs. 20 Chapter 3 Particle Markov Chain Monte Carlo As discussed previously, a simple independent MH update which leaves pi (x) in- variant requires a proposal density q(x). In order to sample a realisation {X(i)} of the associated Markov chain, the MH update proceeds as follows at iteration i: (a) sample X∗ ∼ q (·), (b) set X(i) = X∗ with probability 1 ∧ pi (X ∗) pi (X(i− 1)) q (X(i− 1)) q (X∗) , otherwise set X(i) = X(i−1). We could suggest using as proposal density q (x) = qN (x), i.e. the density of a particle generated by an SMC algorithm targeting pi(x). This is likely to result in an efficient independent MH algorithm following the discussion of the previous section. This would however require being able to evaluate qN (x) in order to compute the acceptance ratio above. This quantity is unfortunately not available in closed-form. Indeed, for n = 1, ..., p let us denote Xn := ( X1n, ..., X N n ) ∈ XN the set of N simulated X -valued random variables at time n, then it is not difficult to establish that the joint density of all the variables 21 generated by the SMC algorithm described in Section 2.2 is ψ (x̄1, ..., x̄p,a1, ...,ap−1) = ψ (x̄1) ∏p n=2 ψ (an−1|xn−1)× ψ (x̄n|x̄n−1,an−1) (3.1) = ( N∏ i=1 M1 ( xi1 )) p∏ n=2 ( r (an−1|wn−1) N∏ i=1 Mn(x ain−1 n−1 , x i n) ) , (3.2) which is defined on X pN × {1, . . . , N}(p−1)N . Now, we deduce that the marginal distribution of a particle drawn according to p̂iN (dx) given in (2.5) is, denoting Eψ the expectation with respect to ψ, qN (dx) = Eψ ( p̂iN (dx) ) = Eψ (∑N i=1W i pδXip (dx) ) , which cannot be computed analytically in most situations of interest. In the next section we present a general methodology to circumvent this difficulty. 3.1 Particle Metropolis-Hastings Sampler In order to illustrate the simplicity of the implementation of our approach we first describe a particular instance of the methodology in order to sample from pi(x), where x is updated in one single block. The particle Metropolis Hastings (PMH) algorithm by itself does not offer any advantage over SMC, but would instead be used as a component of more complex MCMC algorithms. For example in the pro- tein folding application (Sec. 4.2) we use PMH in simulated annealing to perform block updates. We can show that this particular MCMC update is nothing but an in- dependent MH sampler with an auxiliary target distribution defined on an extended space with the output of an SMC algorithm as a proposal. However, the principle underlying the construction of the auxiliary target distribution is not classical. 3.1.1 Algorithm In order to sample from pi(x) the particle Metropolis Hastings (PMH) sampler proceeds as shown in Algorithm 3.1 (with the notation of Section 2.2, in particu- 22 Algorithm 3.1: Particle Metropolis Hastings Sampler Initialisation i = 01 Run an SMC algorithm targeting pi(x)2 sample X(0) ∼ p̂iN (·) and compute ẐN (0)3 For iteration i ≥ 14 Run an SMC algorithm targeting pi(x), sample X∗ ∼ p̂iN (·) and5 compute ẐN,∗ With probability6 1 ∧ Ẑ N,∗ ẐN (i− 1) , (3.3) set X(i) = X∗ and ẐN (i) = ẐN,∗, otherwise set X(i) = X (i− 1) and ẐN (i) = ẐN (i− 1) lar (2.5) and (2.6)). Note that the acceptance probability (3.3) is independent of X∗ which makes it possible to sample X∗ only once the move is accepted. The acceptance probability (3.3) also enjoys the attractive property that, under (A3), it converges to 1 as N → ∞ as both ẐN,∗ and ẐN (i− 1) are consistent estimates of Z, the unknown normalising constant of pi(dx). In addition, under mixing as- sumptions of the Feynman-Kac semi-group, the variance of this acceptance ratio can be shown to be proportional to p/N . 3.1.2 Proof of Validity In this section we establish the validity of the PMH algorithm by showing that it is a standard independent MH update with specific target and proposal distribu- tions defined on an extended space (Theorem 1). This results in straightforward convergence properties, see Theorem 2. At first sight, one might think that the sequence {X (i)} generated by the PMH will have pi (dx) as the desired equilibrium distribution only when N → ∞. We can show that this is in fact the case for all N ≥ 1. The key to establish this result is to reformulate the PMH as a standard independent MH sampler defined on an extended state-space with a suitable invariant distribution. In the following we will need the following notation related to particle ances- 23 A11=3 A 2 1=6 A 3 1=6 A 4 1=3 A 5 1=1 A 6 1=5 A12=2 A 2 2=4 A 6 2=4A 5 2=5A 4 2=3 A13=2 A 2 3=2 A 3 3=4 A 4 3=6 A 5 3=4 A 6 3=1 A32=4 X61X 5 1X 4 1X 3 1X 2 1X 1 1 X62X 5 2X 4 2 X43 X 5 3 X 6 3 X64X 4 4 X 5 4 X32X 2 2X 1 2 X13 X 2 3 X 3 3 X34X 1 4 X 2 4 p̂i1(dx1) p̂i2(dx1:2) p̂i3(dx1:3) p̂i4(dx1:4) Figure 3.1: Illustration of running an SMC algorithm and selection of a path {X61 , X32 , X43 , X34} (yellow) to use as a proposal within PMCMC. The arrows show the resampling and resulting assignment of ancestry vari- ables Ain. tries. Assume that we have selected particle Xkp at time p. For k = 1, . . . , N and n = 1, . . . , p, let ikn denote the index of the ancestor particle of x k p at generation n. More formally we define ikp := k, i k p−1 := akp−1 (with the notation of Section 2.2) and for n = p−1, . . . , 1 we have the backward recursive relation ikn := a ikn+1 n . As a result we can rewrite xkp = (x ik1 1 , x ik2 2 , . . . , x ikp−1 p−1 , x ikp p ). The matrix a = {akn} is sampled row by row at the resampling steps of the SMC algorithm and allows us to represent the ancestries of all Xin (e.g. see arrows in Figure 3.1 which point from parent to offspring), while ik lets us easily index the path of particle k. For example in Figure 3.1 the highlighted path is i34 = 3, i 3 3 = 4, i 3 2 = 3, i 3 1 = 6. Theorem 1 Assume (A1 )-(A2) then for any N ≥ 1 the PMH sampler is an inde- 24 pendent MH sampler defined on the extended space defined on X pN × {1, . . . , N}(p−1)N+1, with target density p̃iN (k, x̄1, ..., x̄p,a1, . . . ,ap−1) = (3.4) 1 Np pi ( xkp ) M1(x ik1 1 ) ∏p n=2 r(i k n−1|wn−1)Mn(x ikn−1 n−1 , x ikn n ) ψ (x̄1, ..., x̄p,a1, ...,ap−1) , with ψ defined in (3.1), and proposal density qN (k, x̄1, ..., x̄p,a1, . . . ,ap−1) := wkp × ψ (x̄1, ..., x̄p,a1, . . . ,ap−1) , (3.5) where wkp is a realization of the normalised importance weight defined in (2.3). The proof is given below. Here xkp is distributed according to pi(dx). As a result the sequence {X(i)} = {XK(i)p (i)} is of interest, because from standard MH theory it will leave pi(dx) invariant. Proof of Theorem 1. We can easily check that (3.4) and (3.5) are both positive and sum to one; the factor 1/Np corresponds to the uniform distribution on the set {1, ..., N}p for the random variables K,AIK21 , . . . , A IKp p−1, i.e. there are N p equiv- alent orderings of the indices in the particle path. Now the acceptance ratio of an independent MH algorithm is known to depend on the following “importance weight” p̃iN (k,x1, ...,xp,a1, . . . ,ap−1) qN (k,x1, ...,xp,a1, . . . ,ap−1) = 1 Np pi ( xkp ) wkp ×M1(xi k 1 1 ) p∏ n=2 r(ikn−1|wn−1)Mn(x ikn−1 n−1 , x ikn n ) First we make use of (A1), i.e. r(ikn|wn) = wi k n n = wn(x ikn n )/ ∑N i wn(x i n), with 25 some abuse of notation. This yields: p̃iN (·) qN (·) = 1 Np pi ( xkp ) M1(x ik1 1 ) p∏ n=2 Mn(x ikn−1 n−1 , x ikn n ) p∏ n=1 w ikn n = γ(xkp) Z 1 Np p∏ n=1 ∑N i=1wn ( xin ) M1(x ik1 1 ) p∏ n=2 Mn(x ikn−1 n−1,n, x ikn n ) p∏ n=1 wn(x ikn n ) In the above manipulations we have also substituted γ ( xkp ) /Z for pi ( xkp ) . p̃iN (·) qN (·) = γ(xkp) Z 1 Np p∏ n=1 ∑N i=1wn ( xin ) M1(x ik1 1 ) p∏ n=2 Mn(x ikn−1 n−1,n, x ikn n ) γ1(x ik1 1 ) M1(x ik1 1 ) p∏ n=2 γn(x ikn n ) γn−1(x ikn−1 n−1 )Mn(x ikn−1 n−1 ,x ikn n ) = ẐN Z The final result is obtained thanks to the definitions of the incremental weights (2.2) w1 ( xi1 ) := γ1(xi1) M1(xi1) and wn ( xin ) := γn ( xin ) γn−1 ( x ikn−1 n−1 ) Mn ( x ikn−1 n−1 , xin ) , and of the normalising constant estimate (2.6) ẐN := p∏ n=1 [ 1 N N∑ i=1 wn ( xin )] . It should now be clear that the PMH algorithm described above corresponds to sam- pling particles according to qN defined in (3.5) and that the acceptance probability (3.3) corresponds to that of an independent MH algorithm with target distribution p̃iN given by (3.4). As a result, Theorem 1 is powerful since by showing that our PMH algorithm is a MH algorithm in disguise, it allows us to use standard results concerning the 26 convergence properties of the independent MH algorithm to prove the following theorem. LetLN (X (i) ∈ ·) denote the distribution of X (i) generated by the PMH algorithm with N ≥ 1 particles. Theorem 2 Assume (A1)-(A2) then for any N ≥ 1 the PMH sampler generates a sequence of probability distributions {LN (X (i) ∈ ·)} such that ∥∥LN (X (i) ∈ ·)− pi(·)∥∥→ 0 as i→∞ . In addition, under (A3), there exists ρ ∈ [0, 1) such that for any i ≥ 1 and N ≥ 1, ∥∥LN (X (i) ∈ ·)− pi(·)∥∥ ≤ ρi . The proof can be found in Appendix A. Note that the second result might ap- pear negative since from the proof we do not observe an improvement on the rate of convergence of the algorithm as N increases. However, as a particular case of [1, Theorem 1] it is possible to show that for any , η > 0 there exists N0 such that for any N ≥ N0 and any i ≥ 1∥∥LN∗ (X (i) ∈ ·)− pi(·)∥∥ ≤  with ψ−probability larger than 1−η, whereLN∗ (X (i) ∈ ·) denotes the conditional distribution of X (i) conditional upon the random variables generated at iteration 0 by the SMC algorithm. 3.1.3 Extensions and Variations As pointed out earlier, Theorem 1 is fundamental since it highlights the standard MH nature of the PMH algorithm and allows us to easily establish its validity. In this section we show how this property can also suggest further improvements. Using All the Particles One criticism of PMH is that a lot of work seems wasted by only selecting one can- didate from the SMC proposal, especially considering that the importance weight (estimate of the normalisation constant) is independent of the selected path. We 27 propose to use all the particles in order to carry out inference using a strategy which shares some characteristics with [39]. The ‘standard’ estimate of Epi(f) for L MCMC iterations is 1L ∑L i=1 f(X(i)). We will show in this thesis that the following estimator, which utilises all the par- ticles, converges towards Epi(f) as L→∞ for any N ≥ 1: Theorem 3 Assume (A1)-(A2) and Epi(|f |) <∞. Then for any N ≥ 1, 1. the estimate 1 L L∑ i=1 ( N∑ k=1 W kp (i) f(X k 1:p(i)) ) (3.6) converges almost surely towards Epi(f) asL→∞ where {W kp (i) , Xk1:p(i)} corresponds to the set of normalised weights and particles used to compute Ẑ (i), 2. denoting {W ∗kp (i) , X∗k1:p(i), k = 1, . . . , N} the set of proposed particles at iteration i (i.e. before deciding whether or not to accept this population) 1 L L∑ i=1 ( 1 ∧ Ẑ N,∗ (i) ẐN (i− 1) )( N∑ k=1 W ∗kp (i) f(X ∗k 1:p(i)) ) (3.7) + ( 1− 1 ∧ Ẑ N,∗ (i) ẐN (i− 1) )( N∑ k=1 W kp (i− 1) f(Xk1:p(i− 1)) ) (3.8) converges almost surely towards Epi(f) as L→∞. The proof is given in Appendix A. The estimate (3.6) is an average of impor- tance sampling estimates. Each of these estimates is biased but the MH mecha- nism allows us to obtain asymptotically consistent estimates by rejecting some of the populations of particles proposed. Following [39] is also possible to propose an estimate which recycles the candidate populations of particles rejected by the PMH. Subblock Updates of pi We have focused on the case where all the components of x = (x1, ..., xp) are updated simultaneously. Again, if p is too large compared to the number N of par- 28 ticles, then the particle approximation of pi produced by the SMC algorithm might be poor, resulting in an inefficient MH algorithm. In such situations we can sug- gest the use of mixtures/compositions of PMH transition probabilities that update subblocks of the form xa:b for 1 ≤ a < b ≤ p, effectively targeting conditional dis- tributions of the type pi (xa:b|x−a:b). The theoretical results of Section 2.2 suggest that under favourable conditions the number of particles required to approximate pi (xa:b|x−a:b) for a given precision decreases linearly with the size b−a+ 1 of the subblock. For such a strategy there is a tradeoff between the accuracy of the pro- posal distribution for the conditional update, and the dependence structure on the complementary block which might result in long correlation time in the sequence {X(i)}. Partial Particle Updates Although the structure of the target distribution p̃iN naturally lends itself to full updates of all the particles using SMC algorithms we might consider rejuvenating some of the particles conditional upon the others for computational purposes. For example, given xkp and its “ancestral lineage” (i k 1, i k 2, . . . , i k p) which is such that xkp = (x ik1 1 , x ik2 2 , . . . , x ikp−1 p−1 , x ikp p ) then we could propose to update the N −1 remain- ing particles and their “ancestral lineages” according to a Gibbs step under p̃iN ; this “conditional SMC sampling” strategy is detailed in Section 3.2. Any classical MCMC updates that leave p̃iN invariant can also obviously be used. Another pos- sible suggestion in order to fight the aforementioned degeneracy for large p’s is to use possibly standard MCMC transition probabilities in order to update subblocks xa:b of x for 1 ≤ a < b ≤ p such that xa:b corresponds to a region of x where particle depletion is severe. 3.2 Particle Gibbs Sampler Assume now that we are interested in sampling from a distribution pi (θ,x) = γ (θ,x) Z (3.9) 29 where γ : Θ × E → R+ is known pointwise. We show here again how SMC algorithms can be used as a component of an MCMC algorithm, namely a Gibbs sampler, to sample from distributions of the type pi(θ,x) in situations where sam- pling from pi(x|θ) is difficult without resorting to SMC methods. A standard strategy to sample from pi (θ,x) consists of using the Gibbs sam- pler; that is sampling iteratively from the full conditionals pi (θ|x) and pi (x|θ). In numerous situations it is possible to sample exactly from pi (θ|x), and we will assume here that this is the case. Otherwise an MH update of invariant density pi (θ|x) might be used. For models of practical interest x can be high dimensional (e.g. a vector of latent variables of the size of a large dataset) and the conditional distribution pi (x|θ) non-standard, precluding practical exact sampling. We have pi (x|θ) = γ (θ,x) γ (θ) , (3.10) where γ (θ) := ∫ E γ (θ,x) dx is assumed unknown. It is therefore natural to suggest the use of an SMC algorithm in order to pro- pose approximate samples from this conditional distribution. Hence we consider a family of bridging distributions {pin (xn|θ) ;n = 1, . . . , p} where pin (xn| θ) = γn (θ,xn) Zθn , (3.11) such that pip (xp|θ) = pi (x|θ) and a family of transition probability densities {M θn (xn−1, xn)} that defines sampling of xn ∈ X conditional upon xn−1 ∈ X n−1; note that γp (θ,xp) = γ (θ,x) and Zθp = γ (θ). 3.2.1 Algorithm The particle Gibbs (PG) sampler is an approximation of the “ideal” Gibbs sampler where we approximately sample from pi (x|θ) using an SMC algorithm. Contrary to the PMH algorithm, sampling from the approximation p̂iN (dx|θ) of pi (x|θ) using the PG algorithm requires us to keep track of the “ancestral lineage” I := (I1, I2, . . . , Ip) of the random variable X which, we point out again, is such that 30 X = (XI11 , ..., X Ip p ). The PG sampler is shown in Algorithm 3.2. Algorithm 3.2: Particle Gibbs Sampler Initialisation i = 01 Set randomly θ (0)2 Run an SMC algorithm targeting pi (x|θ (0))3 Sample X (0) ∼ p̂iN (·|θ (0)) and denote I (0) its ancestral lineage.4 For iteration i ≥ 15 Sample θ (i) ∼ pi (·|X (i− 1))6 Run a conditional SMC algorithm for θ (i) consistent with7 X (i− 1) , I (i− 1) Sample X (i) ∼ p̂iN (·|θ (i)) and denote I (i) its ancestral lineage8 The conditional SMC step is the non-standard step of the algorithm which, given (θ, I,X = XIpp ), yields an SMC approximation of pi (x|θ) by using the “frozen” particle X and sampling N − 1 “free” particles consistent with this particle and its ancestral lineage. This is shown in Algorithm 3.3, where we have used the notation A−Inn−1 = An−1\{AInn−1}. To sample from the discrete-valued conditional distribution r(a−Inn−1|wn−1, aInn−1), we can use the following procedure. 1. Sample the number of offspring On−1 ∼ s ( ·|Wn−1, OInn−1 ≥ 1 ) . 2. Sample the indices of the N − 1 “free” offspring uniformly on the set {1, ..., N} \ {In}. When sampling from s ( ·|Wn−1, OInn−1 ≥ 1 ) cannot be performed in closed- form, we can use a rejection sampling procedure by sampling On−1∼s ( ·|Wn−1) until OInn−1 ≥ 1. Note however that it is possible to sample directly from s(·|Wn−1, OInn−1≥1) in important cases. We consider multinomial and stratified resampling below. Remark 1 Note that the conditional SMC algorithm does not correspond to sam- pling from a conditional distribution of the distribution of the SMC algorithm ψθ (given in (3.16)). However this becomes true as N →∞. 31 Algorithm 3.3: Conditional Sequential Monte Carlo Algorithm At n = 11 For i 6= I1, sample Xi1 ∼M θ1 (·)2 Update and normalise the weights3 w1 ( Xi1 ) = γ1(θ,Xi1) M θ1 (X i 1) , W i1 = w1 ( Xi1 )∑N k=1w1 ( Xk1 ) . For n = 2, ..., p do4 Sample A−Inn−1 ∼ r(·|Wn−1, AInn−1)5 For i 6= In, sample Xin ∼M θn(X Ain−1 n−1 , ·) and set Xin = (X Ain−1 n−1 , X i n)6 Update and normalise the weights7 wn ( Xin ) = γn ( θ,Xin ) γn−1 ( θ,X Ain−1 n−1 ) M θn ( X Ain−1 n−1 , Xin ) , (3.12) W in = wn ( Xin )∑N k=1wn (Xkn) . (3.13) Multinomial Resampling For multinomial resampling, we can factorise the probability of the number of offsprings as P(On−1|Wn−1, OInn−1≥1) = P(OInn−1|Wn−1, OInn−1≥1)P(O−Inn−1|Wn−1, OInn−1) We then have the following procedure to perform multinomial resampling: Algorithm 3.4: Conditional Multinomial Resampling Sample OInn−1 ∼ B+ ( N,W Inn−1 ) 1 Compute W in−1 ∝W in−1 with ∑ i 6=InW i n−1 = 1 and set2 Wn−1 = ( W 1 n−1, . . . ,W In−1 n−1 ,W In+1 n−1 , . . . ,W N n−1 ) Sample O−Inn−1 ∼M ( N,Wn−1 ) 3 32 where B+ ( N,W Inn−1 ) is the Binomial distribution of parameters (N,W Inn−1) re- stricted to the support {1, 2, . . .}. Stratified Resampling Algorithm 3.5: Conditional Stratified Resampling Compute adjusted weights W i,∗n−1 as given in Eqn.3.141 if W In,∗n−1 ≥ 1/N then2 sample On−1 ∼ s ( ·|W∗n−1)3 else4 sample U1 uniformly on (∑In−1 k=1 W k,∗ n−1, ∑In k=1W k,∗ n−1 ) 5 compute Uj = U1 + j−1N for j ∈ Z and set6 Oin−1 = # { Uj : i−1∑ k=1 W k,∗n−1 ≤ Uj ≤ i∑ k=1 W k,∗n−1 } . For the popular stratified resampling algorithm [57], we can use the method given in Algorithm 3.5. The condition of having at least one offspring of particle In changes the resample distribution and results in the following adjusted weights: W In,∗n−1 = W Inn−1 1− (1−W Inn−1)N , W i 6=In,∗n−1 = W i n−1 ( 1−W In,∗n−1 1−W Inn−1 ) (3.14) 3.2.2 Proof of Validity In this section we establish the validity of the PG algorithm under mild assump- tions. The following notation will be needed, for any θ ∈ Θ, let Sθn = {xn ∈ X n : pin (xn| θ) > 0} , Qθ1 = { x1 ∈ X : M θ1 (x1) > 0 } , Qθn = { xn ∈ X n : pin−1 (xn−1| θ)M θn (xn−1, xn) > 0 } for n ≥ 2, 33 and S = {θ ∈ Θ : pi (θ) > 0} . (A5) For any θ ∈ S, we have Sθn ⊆ Qθn for n = 1, ..., p. (A6) The ‘ideal’ Gibbs (G hereafter) sampler defined by the conditionals pi (θ|x) and pi (x|θ) is irreducible and aperiodic (and hence converges for pi-almost all starting points). We have the following result. Theorem 4 Assume (A1)-(A5), then for any N ≥ 2 the PG sampler defines an MCMC kernel on the extended space Θ×X pN ×{1, . . . , N}(p−1)N+1 with target density p̃iN (θ, k,x1, ...,xp,a1, . . . ,ap−1) := (3.15) 1 Np pi ( θ,xkp ) M θ1 (x ik1 1 ) ∏p n=2 r(i k n−1|wn−1)M θn(x ikn−1 n−1 , x ikn n ) ψθ(x1, ...,xp,a1, ...,ap−1) , where ψθ (x1, ...,xp,a1, ...,ap−1) := (3.16)( N∏ i=1 M θ1 ( xi1 )) p∏ n=2 ( r (an−1|wn−1) N∏ i=1 M θn ( x ain−1 n−1 , x i n )) . Additionally assume (A6) holds then for any N ≥ 2 the PG sampler generates a sequence {θ (i) ,X (i)} whose marginal distributions {LN ((θ (i) ,X (i)) ∈ ·)} satisfy ∥∥LN ((θ (i) ,X (i)) ∈ ·)− pi(·)∥∥→ 0 as i→∞ . The proof can be found in Appendix A. 34 3.2.3 Extensions Alternative Moves The PG algorithm is an appealing algorithm which has the property that it con- verges to the standard Gibbs sampler asN →∞. However, given p and for a finite N for which the degeneracy problem is severe, the fact that the PG sampler forces the “frozen” path I,X = XIpp to survive during the conditional PMH step can have detrimental effects and result in a highly dependent Markov chain. Indeed, in such situations most of the particles at time p generated by the conditional SMC algorithm coalesce with the ancestral lineage of the “frozen” path I,X = XIpp . To limit this problem, various alternatives to update of X (i) proposed above can be suggested. For example, we can suggest to update X (i) using the PMH algorithm, which yields Algorithm 3.6. Algorithm 3.6: Alternate Move for Particle Gibbs Sampler For iteration i ≥ 11 Sample θ (i) ∼ pi (·|X (i− 1))2 Run a conditional SMC algorithm for θ (i) consistent with3 X (i− 1) , I (i− 1) and set γ̂N (θ (i)) = ẐN Run an SMC algorithm targeting pi (x|θ (i)), sample X∗ ∼ p̂iN (·|θ (i))4 and set γ̂N,∗ (θ (i)) = ẐN With probability5 1 ∧ γ̂ N,∗ (θ (i)) γ̂N (θ (i)) set X(i) = X∗ and I(i) = I∗, otherwise set X(i) = X (i− 1) and I(i) = I (i− 1) Following the developments concerning the PG sampler, we can easily show that under mild assumptions (A1-A5) and for any N ≥ 1 this algorithm has the desired invariant distribution. 35 Optimisation In some situations we might be interested in maximising pi (θ). As pi (θ) is typ- ically unknown, even up to a normalising constant, it is not possible to use an algorithm such as the simulated annealing algorithm. However, we can introduce the following distribution, defined on Θ× Em pim ( θ,x1, ...,xm ) ∝∏m i=1 pi ( θ,xi ) (3.17) whose marginal distribution pim (θ) is proportional to [pi(θ)] m [28]. This distri- bution concentrates itself on the set of global maxima of pi (θ) as m increases. It is straightforward to apply a modified version of the PG sampler to sample from pim ( θ,x1, ...,xm ) by sampling m independent conditional SMC algorithms. At iteration i, it proceeds as follows: Algorithm 3.7: Marginal Maximisation using Particle Gibbs Sample θ (i) ∼ pim (·|X1 (i− 1) , ...,Xm (i− 1))1 For k = 1, ...,m2 Run a conditional SMC algorithm for θ (i) consistent with3 Xk (i− 1) , Ik (i− 1) Sample Xk (i) ∼ p̂iN (·|θ (i)) and denote Ik (i) its ancestral lineage4 It can be shown under mild assumptions (A1-A5) and for any N ≥ 2 that the Markov kernel associated to this algorithm admits pim ( θ,x1, ...,xm ) as invariant distribution. An alternative consists of considering the following extended distribution p̃iN (θ, k1, ..., km,x1, ...,xp,a1, . . . ,ap−1) ∝ ∏m i=1 pi ( θ,xkip ) N∏ i=1,i 6=ik1:km1 M θ1 ( xi1 ) (3.18) × p∏ n=2 r(a−ik1:km1n−1 ∣∣∣∣wn−1, aik1:kmnn−1 ) N∏ i=1,i 6=ik1:kmn M θn ( x ain−1 n−1 , x i n ) 36 where ka 6= kb for a 6= b and ik1:kmn = ( ik1n , ..., i km n ) . Note that even if ka 6= kb, these paths might have a common ancestor; that is there exists n such that ikan = ikbn . To sample from (3.18) and thus from pim ( θ,x1, ...,xm ) , we could propose the following iterative algorithm. At each iteration, sample θ ∼ pim ( ·|xk1p , ...,xkmp ) then run a conditional SMC algorithm for θ consistent with xk1p , ...,x km p and their ancestral lineages. Finally, sample (k1, ..., km) from p̃iN (k1, ..., km| θ,x1, ...,xp,a1, . . . ,ap−1) . However, in this scenario the full conditional distribution of (k1, ..., km) does not take a simple form because of the potential coalescences of the paths and an MH step is required. 3.3 Particle Marginal Metropolis-Hastings Sampler Assume we are interested in sampling from pi (θ,x) defined in (3.9). In cases where x and θ are highly correlated, the Gibbs sampling strategy described in the previ- ous section might be inefficient. In this case, we might want to sample directly from the marginal pi (θ) using say an MH algorithm. However, if pi (θ) is unknown even up to a normalising constant, this is impossible. We present here a particle approximation of this “ideal” marginal MH algorithm. Our algorithm is based on the following remark. Consider a MH algorithm of target density pi (θ,x) and proposal density q ((θ,x) , (θ∗,x∗)) = q (θ, θ∗)pi (x∗| θ∗) . Then the acceptance ratio is given by 1 ∧ pi (θ ∗,x∗) pi (θ,x) q ((θ∗,x∗) , (θ,x)) q ((θ,x) , (θ∗,x∗)) = 1 ∧ pi (θ ∗) pi (θ) q (θ∗, θ) q (θ, θ∗) . Consequently, provided that it is possible to sample the component x from a good approximation of pi (x| θ) then the resulting MH algorithm will essentially approx- imate the “marginal” MH algorithm. 37 To build a proposal approximating pi (x| θ), we use an SMC method and in- troduce as in the PG sampler a sequence of distributions {pin (xn| θ)} defined in (3.11) and some transition kernels {M θn (xn−1, xn)}. The key point is that the normalising constant estimate obtained at time p is an estimate of γ (θ), the unnor- malised version of pi (θ), i.e. recall that γ(θ,x) = pi(x|θ)γ(θ) from (3.10). 3.3.1 Algorithm Algorithm 3.8: Particle Marginal Metropolis-Hastings Sampler Initialisation i = 01 Set randomly θ (0)2 Run an SMC algorithm targeting pi (x|θ (0))3 Sample X (0) ∼ piN ( ·| θ (0)) and set γ̂N (θ (0)) = ẐN4 For iteration i ≥ 15 Sample θ∗ ∼ q (θ (i− 1) , ·)6 Run an SMC algorithm targeting pi (x|θ∗) sample X∗ ∼ piN ( ·| θ∗) and7 set γ̂N (θ∗) = ẐN With probability8 1 ∧ γ̂ N (θ∗) γ̂N (θ (i− 1)) q (θ∗, θ (i− 1)) q (θ (i− 1) , θ∗) (3.19) set θ (i) = θ∗, X (i) = X∗, γ̂N (θ (i)) = γ̂N (θ∗), otherwise set θ (i) = θ (i− 1), X (i) = X (i− 1), γ̂N (θ (i)) = γ̂N (θ (i− 1)) The particle marginal Metropolis Hastings (PMMH) sampler is given in Al- gorithm 3.8. Note that the acceptance probability (3.19) is independent of X∗ so that it is possible to sample X∗ only once the move is accepted and, if we are only interested in estimating pi (θ), then this simulation step can always be omit- ted. The computational complexity of each iteration is of orderO (N). Under mild assumptions the acceptance probability of this sampler converges to the acceptance 38 probability of the “ideal” marginal MH algorithm as γ̂N (θ∗) γ̂N (θ (i− 1)) → γ (θ∗) γ (θ (i− 1)) = pi (θ∗) pi (θ (i− 1)) when N → ∞. Moreover, under mixing assumptions, the variance of the accep- tance ratio is proportional to p/N . We can also show under mild assumptions (A1- A5,A7) that for anyN ≥ 1 the PMMH sampler generates a sequence {θ (i) ,X (i)} whose sequence of distributions {LN ((θ (i) ,X (i)) ∈ ·)} satisfies ∥∥LN ((θ (i) ,X (i)) ∈ ·)− pi(·)∥∥→ 0 as i→∞ . 3.3.2 Proof of Validity We show here that the PMMH is a standard MCMC algorithm on an extended space and that for all N ≥ 1 it generates a sequence converging towards the target distribution of interest. The following assumption will guarantee convergence. (A7) The “ideal” MH sampler of target density pi (θ) and proposal density q (θ, θ∗) is irreducible and aperiodic (and hence converges for pi-almost all starting points). Theorem 5 Assume (A1 )-(A5), then for any N ≥ 1 the PMMH sampler is an MH sampler defined on the extended space Θ×{1, . . . , N}×X pN×{1, . . . , N}(p−1)N+1 with target density (3.15) and proposal density q (θ, θ∗)× w∗kp × ψθ ∗ ( x∗1, ...,x ∗ p,a ∗ 1, ...,a ∗ p−1 ) where ψθ ∗ is defined in (3.16) and {w∗kp } consists of the normalised importance weights associated to the proposed population of particles as defined in (2.3). Assume in addition that (A7) holds. Then for any N ≥ 1 the PMMH sampler generates a sequence {θ (i) ,X (i)} whose sequence of distributions {LN ((θ (i) ,X (i)) ∈ ·)} satisfies ∥∥LN ((θ (i) ,X (i)) ∈ ·)− pi(·)∥∥→ 0 as i→∞ . 39 The proof can be found in Appendix A. 3.3.3 Extensions Alternative Proposals It can be difficult to select the proposal distribution for θ. However, our framework is flexible enough that it allows us to design proposal distributions for θ relying on the set of current particles. For example, if pi (θ|x) is a standard distribution dependent on a set of sufficient statistics of x then we could use an independent proposal of the same functional form with sufficient statistics computed using the set of current particles with possibly an inflated variance, as long as the dimension of θ is not too large. Optimisation To sample from (3.17), we can use straightforwardly the PMMH by sampling m independent SMC algorithms at each iteration i. This is shown in Algorithm 3.9. Algorithm 3.9: PMMH with m Independent SMC Algorithms For iteration i ≥ 11 Sample θ∗ ∼ q (θ (i− 1) , ·)2 Run m independent SMC targeting pi (x|θ∗) and let3 γ̂Nm (θ ∗) = m∏ i=1 ẐNi where ẐNi is the normalising constant estimate provided by the ith SMC algorithm. With probability4 1 ∧ γ̂ N m (θ ∗) γ̂Nm (θ (i− 1)) q (θ∗, θ (i− 1)) q (θ (i− 1) , θ∗) set θ (i) = θ∗, γ̂Nm (θ (i)) = γ̂Nm (θ∗), otherwise set θ (i) = θ (i− 1), γ̂Nm (θ (i)) = γ̂Nm (θ (i− 1)) 40 Under (A1)-(A5), the invariant distribution of the Markov kernel associated to this algorithm is pim (θ) ∝ pim(θ) for any N ≥ 1. 3.4 Extensions In this section, we present various extensions of the PMCMC methodology and dis- cuss connections with earlier work. The content of this section is not required to understand the examples in Section 4, where applications of the PMCMC method- ology are presented. 3.4.1 Dynamic and Optimal Resampling In many applications of SMC, the resampling step is only performed when the accuracy of the estimator is poor. Practically, this is assessed by looking at the variability of the weights using the so-called effective sample size (ESS) criterion [61, pp. 35-36] given at time n by ESS = ( N∑ i=1 ( W in )2)−1 . Its interpretation is that inference based on the N weighted samples is approxi- mately equivalent to inference based on ESS perfect samples from the target. The ESS takes values between 1 andN and we resample only when it is below a thresh- old NT . All the strategies presented in the previous sections can also be applied in this context. The PMH and PMMH can be implemented in the dynamic resampling context without any modification. However, the PG is more difficult to imple- ment as the conditional SMC step requires simulating a set of N − 1 particles not only consistent with a ‘frozen’ path but also consistent with the resampling times of the SMC method used to generate the ‘frozen’ path. For example assume that the ‘frozen’ path has been generated by an SMC algorithm resampling at time n1, n2, ..., nm. Then the conditional SMC method has to generate a set of N − 1 particles which are such that when the ESS is computed with this N − 1 ‘free’ par- ticles plus the frozen one, then it is below the threshold NT at times n1, n2, ..., nm 41 and only at these time indices. This implies that the N − 1 particles cannot be simulated independently anymore. The easiest way to implement the conditional SMC consists of using a rejection method. However the acceptance rate of this rejection step is expected to be only reasonable when N is large - as the variance of ESS is inversely proportional to N - and if p is moderate. The SMC algorithm proposed in [32], [34] for discrete-valued variables is not of the form described in Section 2.2 as the ‘sampling’ step is deterministic and the proposed so-called optimal resampling step does not attribute equal weights to all particles. It can be shown though that we can use it directly within the PMH and PMMH algorithms. However the peculiar structure of the optimal resampling step seems to prevent the development of a simple PG-type algorithm. 3.4.2 Arbitrary Target Distribution For the time being, we have considered it possible to devise a sequence of distri- butions of increasing dimensions which progressively evolves towards the target distribution of interest. However, there are many problems where such a decompo- sition might not be obvious to define. In this case, it remains possible to use SMC to sample from any target distribution using the general methodology presented in [65]; see also [17], [45], [56]. Assume now that the target p̃i (x) is defined on X . To sample from p̃i (x) using SMC-type ideas, we introduce a sequence of p − 1 intermediate distribu- tions {p̃in (x)} on X moving smoothly from p̃i1 (x) – an easy to sample distribu- tion – to p̃ip (x) = p̃i (x), where p̃in (x) = Z−1n γ̃n (xn). Here γ̃n : X → R+ is known pointwise but Zn might be unknown. For example, we could have p̃in (x) ∝ [p̃i (x)]ηn where 0 < η1 < η2 < · · · < ηp = 1 or, in a Bayesian context, p̃in (x) could be the posterior distribution of X given observations until time n [17]. To move from p̃in−1 (xn−1) to p̃in (xn) we use a sequence of Markov kernels Mn (xn−1, xn) = Mn (xn−1, xn). In most applications, Mn is an MCMC kernel of invariant distribution p̃in although this is not necessary. We then build the distributions pin (xn) on X n for n = 2, ..., p using γn (xn) = γ̃n (xn) n−1∏ k=1 Lk (xk+1, xk) 42 where {Ln} is a sequence of auxiliary (backward) Markov kernels giving the prob- ability density to move from xn+1 to xn. By construction, we have p̃i (xp) = ∫ pip (xp) dxp−1 . Once {p̃in} , {Mn} and {Ln} have been selected (see [65] for guidelines), then we can straightforwardly implement the algorithms proposed in Section 3.1 to sample from pip (xp) and hence from its marginal pip (xp) = pi (xp). We can also directly adapt the PMMH presented in Section 3.3 to sample from p̃i (θ, x) = Z−1γ̃ (θ, x) on some space Θ×X by defining pin (xn| θ) = γn (θ,xn) Zθn where γn (θ,xn) = γ̃n (θ, xn) n−1∏ k=1 Lθk (xk+1, xk) and γ̃p (θ, xp) = γ̃ (θ, xp). However, even if we can sample from p̃i (θ|x), the PG sampler proposed in Section 3.2 cannot be applied as it does not require sampling from p̃i (θ|xp) but from pip (θ|xp) which is usually intractable. To bypass the simulation step from pip (θ|xp), we can use a collapsed Gibbs sampler strategy where we first sample θ from p̃i (θ|xp) then sample Xp−1, ..., X1 from n−1∏ k=1 Lθk (xk+1, xk) . 3.5 Discussion The PMH algorithm presented in Section 3.1 is related to the configurational bias Monte Carlo (CBMC) method which is a very popular method in molecular simu- lation used to sample long proteins [40, 74]. However, on the contrary to the PMH algorithm, the CBMC algorithm does not propagateN particles in parallel. Indeed, at each time step n, the CBMC algorithm samples N particles but the resampling step is such that a single particle survives, to which a new set ofN offsprings is then 43 attached. Using our notation, this means that the CBMC algorithm corresponds to the case where Ain = A j n for all i, j = 1, . . . , N and A1n ∼ r(·|Wn) i.e. at any time n, all the children share the same and unique parent particle. The problem with this approach is that it is somewhat too greedy and that if a “wrong” decision is taken too prematurely then the proposal will be most likely rejected. It can be shown that the acceptance probability of the CBMC algorithm does not converge to 1 for p > 1 as N →∞ contrary to that of the PMH algorithm. It has been more recently proposed in [19] to improve the CBMC algorithm by propagating forward several particles simultaneously in the spirit of the PMH algorithm. However, con- trary to us, the authors in [19] propose to kill or multiply particles by comparing their unnormalised weights wn ( Xin ) with respect to some pre-specified lower and upper thresholds; i.e. the particles are not interacting and their number is a ran- dom variable. In simulations, they found that the performance of this algorithm was very sensitive to the values of these thresholds. Our approach has the great advantage of bypassing the delicate choice of such thresholds. In statistics, a variation of the CBMC algorithm known as the multiple-try Metropolis (MTM) has been introduced in the specific case where p = 1 in [60]. The key of our methodology is to build efficient proposals using sequential and interacting mechanisms for cases where p  1: the sequential structure might be natural for some models but can also be induced in other scenarios in order to take advantage of the potential improvement brought in by the interacting mechanism. In this respect, both methods do not apply to the same class of problems. The idea of approximating the marginal MH algorithm which samples directly from pi (θ), by approximately integrating out the latent variables x, was proposed in [6] then generalised and studied theoretically in [1]. The present work is a sim- ple mechanism which opens up the possibility to make this approach viable in large dimensional setups. Indeed the PMMH approach is expected to lead to approxi- mations of pi (θ) (up to a normalising constant) with a variance which for high- dimensional problems is likely to scale favourably compared to estimates based, for example, on standard non-sequential and non-interacting importance sampling. The results in [1] suggest that this is a property of paramount importance in order to design efficient marginal MCMC algorithms. 44 Chapter 4 Applications 4.1 Nonlinear State-Space Model Consider the following non-linear non-Gaussian dynamic model where {Xn}n≥1 is an unobserved Markov process defined by X1 ∼ µθ (·) and for n > 1 Xn|Xn−1 ∼ fθ ( ·|Xn−1) where the observations {Yn}n≥1 are assumed independent conditional upon {Xn}n≥1 with Yn|Xn ∼ gθ ( ·|Xn) . The parameter θ is also unknown and follows the prior distribution p (θ). Having observed y1:T , we are interested in sampling from the target distribution p (θ, x1:T | y1:T ) ∝ p (θ)µθ (x1) gθ (y1|x1) T∏ n=2 fθ (xn|xn−1) gθ (yn|xn) . Although sampling from p (θ| y1:T , x1:T ) can typically be performed using stan- dard techniques, sampling from p (x1:T | y1:T , θ) is impossible except for lin- ear Gaussian models and finite state-space Markov chains [14], [41]. The standard approach consists of sampling alternatively from the full conditional p(xn|yn, xn−1, xn+1, θ) directly or using a MH step but this strategy can be ex- 45 tremely inefficient due to the strong dependence between xn and (xn−1, xn+1). A more efficient strategy would consists of sampling blocks of variables say p (xn:n+L−1| yn:n+L−1, xn−1, xn+L, θ). When this distribution is log-concave, it is possible to design efficient proposal distributions [71]. However for more complex scenarios, this can be very challenging. We propose here to use the PMH algorithm to sample from p (x1:T | y1:T , θ) or from subblocks p (xk:k+L−1| yk:k+L−1, xk−1, xk+L, θ) whenever T is too large. In this case, the “natural” sequence of bridging distributions to consider is pin (xn) = p (xn| y1:n, θ). We illustrate our methodology on the following example Xn = Xn−1 2 + 25 Xn−1 1 +X2n−1 + 8 cos(1.2n) + Vn (4.1) Yn = X2n 20 +Wn (4.2) where X1 ∼ N (0, 5), Vn iid∼ N ( 0, σ2V ) , Wn iid∼ N (0, σ2W ); N (m,σ2) denotes the Gaussian distribution of mean m and variance σ2. We set θ = (σV , σW ). This example is often used in the literature in order to assess the performance of particle methods; e.g. [27], [47], [57]. The posterior distribution p (xT | y1:T , θ) for this non-linear state-space (NLSS) model is highly multimodal as there is uncertainty about the sign of the state Xn which is only observed through its square. We generated a set of observations y1:200 according to the model (4.1)-(4.2) for θ = (σV , σW ) = (√ 10, 1 ) . We first compared the acceptance rates for the PMH and CBMC algorithm [40] when sampling from p (xT | y1:T , θ) for various lengths T of the observations and a varying number of particles N . To sample from p (xT | y1:T ) we used for the proposal M θn (xn−1, xn) an ap- proximation of the conditional distribution p (xn| yn, xn−1, θ) ∝ gθ (yn|xn) fθ (xn|xn−1) given by the extended Kalman filter (local-linearisation); see [26] for details. We also used the most basic resampling scheme; i.e. multinomial resampling. The results are displayed in Figure 4.1. The PMH algorithm performs consistently 46 better than the CBMC algorithm, and is able to successfully sample large blocks of variables for a reasonable number of particles. As expected, the PMH algorithm can achieve acceptance rates close to 1 for a sufficiently large number of particles. We then compared the acceptance rates for proposal from prior (P) versus pro- posal using local-linearisation (LL). Here we used θ = ( √ 15, √ 2). The simula- tions with the P proposal used multinomial resampling at every time step (boot- strap) while the simulations with the LL proposal used dynamic and stratified re- sampling. The result is shown in Figure 4.2. We found that for low number of par- ticles the LL proposal does slightly better, but as the number of particles increases proposing from the prior does as well or better than LL. This can be explained by the fact that the LL proposal doesn’t handle multimodalities well. At low parti- cles it is able to steer more particles into regions of higher posterior density than the much more diffuse prior. However, this advantage becomes less important and eventually the property of being a unimodal Gaussian proposal for a multimodal posterior becomes a handicap for LL as the number of particles increases and the prior is able to propose sufficient number of particles into high probability regions. We then set the following prior on θ = (θ1, θ2) = (σV , σW ) θ21 ∼ IGa (a, b) , θ22 ∼ IGa (a, b) , where IGa is the Inverse-Gamma distribution with a = b = 0.01. To sample from p (xT , θ| y1:T ), we used the PG sampler described in Sec- tion 3.2 with for M θn (xn−1, xn) an approximation of the conditional distribution p (xn| yn, xn−1, θ) ∝ gθ (yn|xn) fθ (xn|xn−1) given by the extended Kalman filter as before (LL proposal). The full conditional distribution p (θ| y1:T ,xT ) is of a standard form. We compared the PG sampler to a standard algorithm where one updates the states Xn one-at-a-time using a MH step of invariant distribution p (xn| yn, xn−1, xn+1, θ) and proposal distribu- tion M θn (xn−1, xn). In the one-at-a-time algorithm, we updated N times the state variables XT at each iteration before updating θ. Hence the PG sampler and the 47  0  0.2  0.4  0.6  0.8  1 5 10 25 50 100 200  5   10   25   50   100   200  0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9 State Space Size       # of Particles  0  0.2  0.4  0.6  0.8  1 5 10 25 50 100 200  5   10   25   50   100   200  0  0.05  0.1  0.15  0.2  0.25  0.3 State Space Size       # of Particles Figure 4.1: Acceptance probabilities for PMH (top) and CBMC (bottom) as a function of T and N . Each point was computed over an average of 250,000 iterations. 48  0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1  0  200  400  600  800  1000  1200  1400  1600  1800  2000 Ac ce pt an ce  R at e Number of Particles Proposal from Prior, multinomial always resampling T= 10 T= 25 T= 50 T=100  0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1  0  200  400  600  800  1000  1200  1400  1600  1800  2000 Ac ce pt an ce  R at e Number of Particles Proposal using local linearization, stratified dynamic resampling T= 10 T= 25 T= 50 T=100 Figure 4.2: PMH acceptance rates for sampling from prior (left) and proposal using local-linearisation (right) for varying number L of observations and number of particles. -30 -20 -10  0  10  20  30  40  0  50  100  150  200  250  300  350  400  450  500 x y Figure 4.3: Hidden states (x) and observations (y). standard algorithm have approximately the same computational complexity. Note that in this case, it is difficult to design efficient block proposal distributions to sample from p (xk:k+L−1| yk:k+L−1, xk−1, xk+L, θ) for L > 1 as this distribution can be multimodal. We used T = 500 and N = 1000. For θ fixed to its real value, the average ac- 49  0  0.2  0.4  0.6  0.8  1  0  20  40  60  80  100 PG σvPG σwMH one-at-a-time, σvMH one-at-a-time, σw Figure 4.4: ACF of the output of the PG and MH algorithms. ceptance rate of our PMH algorithm of Section 3.1 using the ‘improved’ proposal and a multinomial resampling scheme was equal to 0.43. This is quite remarkable as this effectively corresponds to updating 500 real-valued dependent variables si- multaneously. We then ran the PG algorithm and the standard one-at-a-time sam- pler for 50,000 iterations with a burn-in of 10,000 iterations. In Figure 4.4, we display the auto-correlation functions (ACFs) of the parameters (σV , σW ) for the PG and the MH one-at-a-time algorithms obtained using N = 1000. The autocor- relation sequences for the various parameters consistently vanish faster for the PG sampler. In Figure 4.5, we present the estimates of the marginal distributions of the parameters for both algorithms. Contrary to the PG sampler, the standard algorithm clearly misses the main mode of the posterior distribution and over-estimates the parameter σV . The block proposal for the states XT of the PG sampler allows us to avoid getting trapped in this secondary mode. Out of 100 independent runs of both al- gorithms on the same dataset, the PG algorithm never experienced being trapped in such a secondary mode of the target distribution and produced consistent results 50  0  0.5  1  1.5  2  2.5  3  0  0.5  1  1.5  2  2.5  3  2.8  3  3.2  3.4  3.6  3.8  4  4.2  4.4  0  1  2  3  4  5  6  0  1  2  3  4  5  6  0.6  0.7  0.8  0.9  1  1.1  1.2  1.3 Figure 4.5: Estimates of p (σV | y1:T ) (left) and p (σW | y1:T ) (right) for the PG (top) and the MH one-at-a-time (bottom). The vertical lines corre- spond to the true values. whereas the standard algorithm, initialised at θ = (10, 10) for data with ground truth ( √ 10, 1), was trapped 67 times. In practice, we can obviously combine both strategies by only occasionally updating the state variables with the PG sampler to avoid such traps while using more standard and cheaper updates for large propor- tion of the computational time. In simulations not presented here, we also applied our algorithm to the stochastic volatility model presented in [67], [71]. This model is somewhat ‘easier’ than the toy example presented here and the particle MCMC algorithms perform very well in this case. We also compared the performance of the PG sampler with the PMMH sam- pler. In this case we used the prior as a proposal in the SMC with multinomial resampling. The PMMH used a normal random walk proposal with a diagonal covariance matrix. The standard deviation was equal to 0.15 for σV and 0.08 for σW . We present in Figure 4.6 the auto-correlation function for (V,W ) for the PG and PMMH samplers and various number of particles N . Clearly the performance improves as N increases, and while dependent on the test function f considered 51  0  0.2  0.4  0.6  0.8  1  0  20  40  60  80  100  120  140  160  180  200   500 particles 1000 particles 2000 particles 5000 particles  0  0.2  0.4  0.6  0.8  1  0  20  40  60  80  100  120  140  160  180  200   500 particles 1000 particles 2000 particles 5000 particles (a) ACF for σV – PG sampler (b) ACF for σW – PG sampler  0  0.2  0.4  0.6  0.8  1  0  20  40  60  80  100  120  140  160  180  200   500 particles 1000 particles 2000 particles 5000 particles  0  0.2  0.4  0.6  0.8  1  0  20  40  60  80  100  120  140  160  180  200   500 particles 1000 particles 2000 particles 5000 particles (a) ACF for σV – PMMH sampler (b) ACF for σW – PMMH sampler Figure 4.6: ACF of parameters σV and σW for the PG sampler and the PMMH sampler. (as in Eqn. 3.6), it is possible to choose N in order to optimise the ratio integrated autocorrelation function/particle number N . In this scenario, it appears necessary to use at least 2000 particles to get the ACF to drop sharply, whereas increasing N beyond 5000 does not improve performance. That is for N > 5000 we observe that the ACFs (not presented here) are very similar toN = 5000 and probably very close to that of the corresponding “ideal” MMH algorithm. 52 4.2 Protein Folding One of the most important and challenging scientific problems is simulating molec- ular structures. Protein folding is one of the more well known open problems in structural biology and biophysics. Proteins are the final products of most genes and perform a variety of roles: they serve as structural units, control metabolic pathways, participate in the catalysis or inhibition of various chemical reactions, and serve many other functions. Proteins are chain-like polymers, composed of small subunits (amino acids). There are 20 different types of these amino acids, with a few less common modified versions thereof, which can be loosely grouped into classes based on their chemical properties, such as hydrophobic, hydrophilic (or polar), and charged [8]. The protein folding problem consists of predicting the structure of the protein given its sequence of amino acids. This may include the prediction of the folding pathways to arrive at the “final” shape. The protein structure is dictated by the free energy landscape, lower energy structures being preferred. The native state of a protein is usually at the minimum free energy, and we will consider proteins where this is the case. Exceptions to this can occur when the folding takes place under kinetic control [22]. The function of a protein is primarily determined by its structure, and the sequence of amino acids dictates the structure (Central Dogma of genomics) [22]. There are several different approaches that have been taken to investigate these types of problems, including SMC, CBMC, prune-enriched Rosenbluth method (PERM), recoil-growth (and variations thereof) [40, 61, 74], replica exchange Monte Carlo (parallel tempering) [75], ant colony optimisation [72], and model based search [11]. Next we apply the PMCMC method to this class of problems. We will focus only on the lattice approximations, in particular the HP model, and leave the off- lattice case for future studies. 4.2.1 Lattice Model Instead of modelling the full protein with all the interacting forces, it is often pos- sible to imitate the protein folding using a simple 2-D or 3-D lattice-bead model. 53 → Figure 4.7: Illustration of protein folding in HP model. The left configuration is in a high energy state (U = 0), while on the right the protein sequence is folded into into the lowest energy configuration (U = −9). The positions of the amino acids (now beads) in the protein sequence are thus re- stricted to a lattice and we only need to account for interactions of non-covalent bonds and ensure a self-avoiding walk (SAW), i.e. no two beads can occupy the same position and neighbouring elements in the sequence must be neighbours on the lattice. One of the simplest models used is an Ising model and has only two types of beads: white and black, corresponding to hydrophilic and hydrophobic amino acids, respectively [61]. The energy function for this model is Un(xn) = − ∑ |i−j|>1 c(xi, xj) (4.3) where c(xi, xj) = 1 if xi and xj are non-bonding neighbours and both beads are black (hydrophobic), and c(xi, xj) = 0 otherwise. 4.2.2 Implementation We used PMH with sub-block updating and simulated annealing [68, Chapter 5] to optimise the protein configuration for the lowest energy. The inverse temperature for the annealing was increased linearly from βmin = 0.1 to βmax = 5 with incre- ments 4β, using 50×L iterations at each temperature, where L is the length of the protein sequence. The subblocks were chosen at random positions with sizesB sampled from [2, 20], either uniformly or “reverse-linearly”, that is, with a distribu- tion Prl(B) ∝ 21−B. We also used an adaptive number of particles which scaled 54 linearly with the current block size. This might break detailed balance, however, since this is an optimisation problem, we can afford this approximation and benefit from a more efficient algorithm. To further improve the efficiency of the sampler we implemented the resampling scheme suggested by Fearnhead & Clifford [35]. The method works well for discrete state models, especially when the number of possible states is small. 4.2.3 Results In order to do comparison with other methods, we ran our method on the data sets given in [75] and [76]. The protein sequences and corresponding lowest (known) energies are given in Tables 4.1 and 4.2, for folding in 2D and 3D, respectively. The results are summarised in Table 4.3. The time is given in seconds. kr spec- ifies the scaling factor used for adapting the number of particles to the block size (i.e. N = 1 +kr×B). 4β is the increment of the inverse-temperature for the annealing. “bsd” is the block size distribution, where “u” indicates uniform and “rl” is reverse-linear. We show both the best performance as well as the median performance over 15 runs at the given settings. The performance of our method is currently not as good as some of the competing methods, such as replica ex- change Monte Carlo (REMC) [75], fragment regrowth via energy-guided sequen- tial sampling (FRESS) [76], or specialised versions of PERM (nPERMis [55] and PERMtExp [55, 75]). However, the results are encouraging. We are able to find the lowest known energy for most of the protein sequences in a reasonable amount of time. However, we do struggle with the 3D sequences from the data set of [76]. Ta- bles 4.4 and 4.5 compare our results to the other methods for the data sets from [75] and [76], respectively. Further tuning of the method, plus adding some specialised moves could make this a competitive approach. 55 ID Length Emin Protein Sequence S1-1 20 -9 (HP)2 PH(HP)2 (PH)2 HP(PH)2 S1-2 24 -9 H2 (P2H)7 H S1-3 25 -8 (P2H)2 HP4H2P4H2P4H2 S1-4 36 -14 P3 (H2P2)2 P3H5 (H2P2)2 P2H(HP2)2 S1-5 48 -23 P2HP2 (H2P2)2 P3H10P6 (H2P2)2 HP2H5 S1-6 50 -21 H(HP)4 H4PH(P3H)2 P(P3H)3 PH3 (HP)4 H2 S1-7 60 -36 P2H3PH8P3H9 (HP)2 P2H12P4H4 (H2P)2 HP S1-8 64 -42 H11 (HP)3 P(H2P2)2 HP2 (H2P2)2 HP2 (H2P2)2 (HP)2 H12 S1-9 85 -53 H4P4H12P6H12P3H12P3H12P3HP2 (H2P2)2 HPH S1-10 100 -50 P(P2H2)2 H2P2H(H2P)3 H4P8H6P2H6P9H(PH2)2 H9P2H(H2P)2 HP(PH)2 H2P6H3 S1-11 100 -48 P5 (PH)2 HP5H3PH3 (H2P)2 P(P2H2)2 PH5PH8 (H2P)2 H7P11H7P(PH)2 H2P5 (PH)2 H 2D50 50 -21 H(HP)4 H4PH(P3H)2 P(P3H)3 PH3 (HP)4 H2 2D60 60 -36 P2H3PH8P3H9 (HP)2 P2H12P4H4 (H2P)2 HP 2D64 64 -42 H11 (HP)3 P(H2P2)2 HP2 (H2P2)2 HP2 (H2P2)2 (HP)2 H12 2D85 85 -53 H4P4H12P6H12P3H12P3H12P3HP2 (H2P2)2 HPH 2D100a 100 -48 P5 (PH)2 HP5H3PH3 (H2P)2 P(P2H2)2 PH5PH8 (H2P)2 H7P11H7P(PH)2 H2P5 (PH)2 H 2D100b 100 -50 P(P2H2)2 H2P2H(H2P)3 H4P8H6P2H6P9H(PH2)2 H9P2H(H2P)2 HP(PH)2 H2P6H3 Table 4.1: Benchmark collection of protein sequences for 2D HP model. Hi and Pi indicate strings of i consecutive H’s and P’s and (s)i denotes i repetitions of string s. Emin is the lowest known energy. 56 ID Length Emin Protein Sequence S2-1 48 -32 HPH2P2H(H3P)2 PH2P(PH)2 H(HP)2 (H2P2)2 PHP8H2 S2-2 48 -34 H2 (H2P)2 H5 (P2H)2 (HP2)2 P2 (P2H)2 P3H(P2H2)2 HPH S2-3 48 -34 PH(PH2)2 H4P2 (HP)2 (PH)2 (HP)3 P2HP2 (H2P2)2 (HP)2 PHP S2-4 48 -33 (PH)2 HP(PH)2 H2P2 (H2P)2 P2H5P(PH)2 (HP)3 P (P2H)2 PHP S2-5 48 -32 P2HP2 (PH)2 H3P2H2 (H2P)2 H3P2 (HP)2 (HP2)2 P4 (H2P)2 H S2-6 48 -32 H3P3H(HP)2 (H2P)3 HP7 (HP)2 PHP(P2H)2 H5PH S2-7 48 -32 PHP3 (PH)2 H(HP)2 H2 (H2P)3 P2 (HP)2 P2H(H2P2)3 PH S2-8 48 -31 (PH2)2 HPH(H3P2)2 P3 (PH)2 HP2H(HP)2 P2H(HP)3 H2P3 S2-9 48 -34 P(HP)2 P3 (HP)3 (PH)2 H5P2H2 (HP)2 (PH)2 HP (PH)2 H2P4H S2-10 48 -33 PH2P3 (P3H2)2 (HP)2 (PH)2 H(P2H)2 (P2H2)2 H5P2H2 3D58 58 -44 PHP(H3P)2 PH2PH(PH2)2 (HP)3 H2P2H3P2 (HP)2 P (P2H)3 H(P2H)2 3D64 64 -56 P(H2P)2 H3P2 (HP)2 P(HP)2 PH(H2P)3 P(H2P)2 H3P2 (HP)2 P(HP)2 PH(H2P)3 3D67 67 -56 PHP(H2P)2 HP2H3P3HP(H2P)2 HP2H3P3HP(H2P)2 HP2H3P3HP(H2P)2 HP2H3P 3D88 88 -72 PHP(H2P)2 HP2H2 (P2H)6 HP2 (H3P2)4 HP(H2P)2 H (P2H)3 H(P2H)3 HP2HP 3D103 103 -57 P2H2P3 (P2H2)2 P(HP2)2 P2 (P3H)2 HPH2P4 (P2H)2 P (HP2)2 P3H3P4 (H2P)2 P4H2P4H3 (HP)2 P7H4 (HP2)2 3D124 124 -75 P3H2 (HP)2 P3HP5H2P2 (P2H2)2 (P4H)2 (P2H)2 HP3H(HP)2 H3P4H3P6H2 (P2H)2 P(HP2)2 P3 (P2H)2 H2P(P3H)2 H4P4H(HP)4 H 3D136 136 -83 HP(P4H)2 P(H2P)2 P2 (PH)2 H2P4 (HP)2 H4P9 (P2H)2 P2 (PH)2 HP3H2 (P2H)2 P(HP)2 P4 (P3H)2 H5P (P2H2)2 HP2 (PH2)2 H3P5 (P4H)2 PHP4 Table 4.2: Benchmark collection of protein sequences for 3D HP model. Hi and Pi indicate strings of i consecutive H’s and P’s and (s)i denotes i repetitions of string s. Emin is the lowest known energy. 57 Best performance Median performance Setting Protein Emin E Time #iter E Time #iter kr 4β bsd S1-1 -9 -9 0.0 0 K -9 0.1 0 K 2 .100 u S1-2 -9 -9 0.1 1 K -8 0.2 2 K 1 .100 rl S1-3 -8 -8 0.1 1 K -7 0.3 2 K 1 .100 u S1-4 -14 -14 0.4 1 K -12 0.1 1 K 2 .100 u S1-5 -23 -23 0.4 4 K -20 0.8 6 K 1 .100 rl S1-6 -21 -21 0.4 2 K -17 0.2 1 K 2 .100 rl S1-7 -36 -36 5.8 56 K -35 2.4 24 K 1 .010 rl S1-8 -42 -42 22.5 80 K -38 9.8 58 K 2 .010 u S1-9 -53 -53 242.9 1535 K -51 71.2 725 K 1 .001 u S1-10 -50 -49 182.0 696 K -47 189.1 440 K 3 .001 rl S1-11 -48 -47 4.2 43 K -45 2.5 26 K 1 .010 rl 2D50 -21 -21 0.5 2 K -19 1.0 4 K 2 .100 u 2D60 -36 -36 536.4 1170 K -35 151.9 571 K 3 .001 u 2D64 -42 -42 6.6 72 K -40 8.5 59 K 1 .010 rl 2D85 -53 -53 108.5 366 K -52 77.0 409 K 2 .001 u 2D100a -48 -48 112.3 737 K -46 67.7 664 K 1 .001 u 2D100b -50 -49 52.5 303 K -46 79.1 283 K 2 .001 rl S2-1 -32 -32 45.5 216 K -31 60.1 181 K 2 .001 rl S2-2 -34 -34 7.1 35 K -31 4.9 16 K 2 .010 rl S2-3 -34 -34 8.0 44 K -31 2.6 23 K 1 .010 u S2-4 -33 -33 45.5 240 K -31 18.0 154 K 1 .001 u S2-5 -32 -32 2.1 9 K -29 1.3 4 K 2 .100 rl S2-6 -32 -32 3.7 36 K -30 4.9 46 K 1 .010 rl S2-7 -32 -32 117.0 978 K -31 80.2 454 K 1 .001 rl S2-8 -31 -31 4.8 41 K -29 4.3 24 K 1 .010 rl S2-9 -34 -33 12.8 25 K -31 22.2 40 K 3 .010 u S2-10 -33 -33 362.3 1179 K -33 362.3 1179 K 3 .001 rl 3D58 -44 -43 8.1 24 K -39 2.7 14 K 2 .010 u 3D64 -56 -52 260.0 835 K -51 631.8 1200 K 3 .001 rl 3D67 -56 -50 128.8 1130 K -49 177.8 995 K 1 .001 rl 3D88 -72 -67 607.1 1708 K -64 223.4 1053 K 2 .001 u 3D103 -57 -55 5.9 36 K -48 4.9 46 K 1 .010 u 3D124 -75 -70 150.8 515 K -68 237.2 498 K 3 .001 rl 3D136 -83 -78 66.7 613 K -74 97.8 884 K 1 .001 rl Table 4.3: Protein folding results for HP model using PMH with subblock updating and tempering. The time is given in seconds. 58 PM C M C PE R M t E xp A C O -H PP FP -3 R E M C m Pr ot ei n E m in E T im e E T im e E T im e E T im e S1 -1 -9 -9 0. 03 -9 < 1 -9 < 1 -9 < 1 S1 -2 -9 -9 0. 07 -9 < 1 -9 < 1 -9 < 1 S1 -3 -8 -8 0. 14 -8 2 -8 2 -8 < 1 S1 -4 -1 4 -1 4 0. 43 -1 4 < 1 -1 4 4 -1 4 < 1 S1 -5 -2 3 -2 3 0. 37 -2 3 2 -2 3 60 -2 3 < 1 S1 -6 -2 1 -2 1 0. 38 -2 1 3 -2 1 15 -2 1 < 1 S1 -7 -3 6 -3 6 5. 79 -3 6 4 -3 6 12 00 -3 6 13 S1 -8 -4 2 -4 2 22 .5 0 -4 2 28 08 00 -4 2 54 00 -4 2 6 S1 -9 -5 3 -5 3 24 2. 94 -5 3 60 -5 3 1d ay -5 3 38 S1 -1 0 -5 0 -4 9 18 2. 00 -5 0 12 00 -4 9 43 20 0 -5 0 48 0 S1 -1 1 -4 8 -4 7 4. 23 -4 8 48 0 -4 7 36 00 0 -4 8 72 S2 -1 -3 2 -3 2 45 .5 4 -3 2 6 -3 2 18 00 -3 2 6 S2 -2 -3 4 -3 4 7. 11 -3 4 18 -3 4 25 20 0 -3 4 12 S2 -3 -3 4 -3 4 8. 03 -3 4 6 -3 4 72 00 -3 4 6 S2 -4 -3 3 -3 3 45 .4 7 -3 3 12 0 -3 3 18 00 0 -3 3 6 S2 -5 -3 2 -3 2 2. 08 -3 2 30 -3 2 90 0 -3 2 6 S2 -6 -3 2 -3 2 3. 70 -3 2 6 -3 2 43 20 0 -3 2 6 S2 -7 -3 2 -3 2 11 6. 97 -3 2 30 -3 2 43 20 0 -3 2 18 S2 -8 -3 1 -3 1 4. 85 -3 1 18 -3 1 72 00 -3 1 6 S2 -9 -3 4 -3 3 12 .7 6 -3 4 30 0 -3 4 27 00 0 -3 4 54 S2 -1 0 -3 3 -3 3 36 2. 34 -3 3 0. 6 -3 3 36 00 -3 3 6 Ta bl e 4. 4: C om pa ri so n of pr ot ei n fo ld in g pe rf or m an ce fo rt he da ta se tf ro m [7 5] .T he tim e is gi ve n in se co nd s. 59 PMCMC FRESS nPERMis Protein Emin E Time E Time E Time 2D50 -21 -21 0.53 -21 – – – 2D60 -36 -36 536.41 -36 – -36 – 2D64 -42 -42 6.62 -42 – -42 – 2D85 -53 -53 108.51 -53 – -52 – 2D100a -48 -48 112.34 -48 – -48 – 2D100b -50 -49 52.54 -50 – -50 – 3D58 -44 -43 8.07 -44 5.4 -44 11.4 3D64 -56 -52 259.98 -56 32.4 -56 27 3D67 -56 -50 128.81 -56 84.6 -56 66 3D88 -72 -67 607.10 -72 301.8 -69 3D103 -57 -55 5.94 -57 268.2 -55 187.2 3D124 -75 -70 150.82 -75 – -71 738 3D136 -83 -78 66.65 -83 – -80 6600 Table 4.5: Comparison of protein folding performance for the data set from [76]. The time is given in seconds. 60 4.3 Dirichlet Mixture Model We consider a Dirichlet process (DP) mixture model for the observations Y1:T . We have the following hierarchical model G ∼ DP (α,G0) , Un|G i.i.d.∼ G, Yn|Un ∼ gUn (·) , where DP (α,G0) is a Dirichlet process of base measure G0 and scale parameter α. By integrating out G and using the Polya urn representation of the predictive distribution of Un given U1:n−1, we can equivalently reformulate the model by introducing a vector of cluster labels X1:n which satisfy P (Xn = j|x1:n−1) = { mjn−1/ (n− 1 + α) for j = 1, ..., kn−1 α/ (n− 1 + α) j = kn−1 + 1 (4.4) where kn−1 is the number of clusters in the assignment x1:n−1 and m j n−1 is the number of observations that x1:n−1 assigns to cluster j. The cluster locations are such that θk i.i.d.∼ G0 and Yn| θXn ∼ gθXn (·) . Assuming α is also unknown with a suitable prior pi (α), the posterior of in- terest is given by pi (xT , θ1:kT , α| y1:T ) . We will consider here the case where the base measure G0 and the likelihood gθ are conjugate so that it is possible to com- pute the marginal pi (xT , α| y1:T ) up to a normalising constant. In such situations, several SMC methods have already been proposed to sample from pi (xT | y1:T , α) [32], [61, Section 4.4.2]. We focus here on the following Gaussian mixture model for real-valued observations where θ = ( µ, σ2 ) and G0 ( µ, σ2 ) = IG (σ2; a, b)N (µ; η, τσ2) . 61 For the scale parameter, we use α ∼ Ga (c, d) . (4.5) Using standard calculations [32], we can show that pi (y1:n|x1:n, α) = kn∏ j=1  baΓ ( a+mjn/2 ) Γ (a) √ mjnτ + 1 b+ mjn2 (σ̂jn)2 + ( yjn − η )2 1 +mjnτ   −(a+mjn/2)  where yjn = 1 mjn n∑ k=1 I (xk = j) yk, ( σ̂jn )2 = 1 mjn n∑ k=1 I (xk = j) ( yk − yjn )2 . We cannot sample from pi (α| y1:T ,xT ) directly. However, following [30], we can introduce an auxiliary variable υ ∈ (0, 1) such that α| y1:T ,xT , υ ∼ γυ Ga (c+ kT , d− log υ)+(1− γυ)Ga (c+ kT − 1, d− log υ) with γυ/ (1− γυ) = (c+ kT − 1) / (T (d− log υ)) and υ| y1:T ,xT , α ∼ Be (α+ 1, T ) where Be is the Beta distribution. We first applied our algorithm to the Galaxy dataset [68, p. 426] and used the version of the data in [32] with T = 82. We used a = b = 1, η = 20, τ = 152 [32] and c = 1, d = 0.5. To sample from pi (xT , α| y1:T ), we ran the PG algo- rithm of Section 3.2 with pi (xn|α) = pi (xn| y1:n, α) and the stratified resampling scheme at each iteration. We also used the PMMH algorithm of Section 3.3 with the same sequence pi (xn|α) and using both stratified and dynamic resampling 62 scale parameter number of clusters S E (α| y1:T ) V (α| y1:T ) E (kT | y1:T ) V (kT | y1:T ) PG 9000 0.911 0.267 5.342 2.686 PMMH 9000 0.916 0.273 5.416 2.683 PMH 49000 fixed to α = 1 5.754 1.843 SMC [32] 50,000 fixed to α = 1 5.75 – Gibbs [32] 50,000 fixed to α = 1 5.75 – Gibbs [30] 10,000 ∼1.0 ∼0.2 7.01 3.41 Table 4.6: Results for Galaxy data set. S is the number of samples used in the Monte Carlo estimate. Both PG and PMMH used 100 parti- cles. The Gibbs sampler in Escobar & West [30] used the prior p(α) = Ga(2, 0.25). schemes; resampling being performed when ESS< N/2. The proposal for α is a Gaussian random walk proposal of standard deviation 0.5. We ran the algorithms for 10, 000 iterations and a burn-in of 1, 000. For N = 50, we found that the PG and the PMMH algorithms provided similar results in terms of posterior mean and posterior variance of α and kT ; the results for the galaxy data set are sum- marised in Table 4.6, and the histogram, trace, and ACF for the PG and PMMH samplers are shown in Figure 4.8. Increasing the number of particles did not sig- nificantly improve the performance of the algorithms. For example for N = 50 the average acceptance rate of the PMMH algorithm was 0.62 while for N ≥ 100 it stabilised at 0.71. This suggests that in this scenario N = 100 is sufficient to approximate the marginal MH algorithm. Fearnhead [32] performed inference on the number of clusters using Gibbs sampling and SMC. The reported mean was 5.75 and agrees with our results, however we should note that in [32] the scaling parameter was fixed at α = 1 while in our simulations α was also sampled and had mean 0.91 with variance 2.7. Escobar & West [30] analysed the Galaxy data using a Ga(2, 0.25) as a prior for α and obtained a posterior for the scale parameter with E(α) ≈ 1 and V(α) ≈ 0.2, which is a bit larger and more peaked than what we found. Note that they also learn the prior parameters η and τ and we have used a slightly different prior for α. We also applied our algorithm to a large simulated dataset of 10, 000 real- valued observations generated from the Dirichlet process Gaussian mixture model. 63  0  0.2  0.4  0.6  0.8  1  0  0.5  1  1.5  2  2.5  3  3.5  4  0  0.2  0.4  0.6  0.8  1  0  0.5  1  1.5  2  2.5  3  3.5  4  0  0.5  1  1.5  2  2.5  3  3.5  4  4.5  1000  2000  3000  4000  5000  6000  7000  8000  9000 10000  0  0.5  1  1.5  2  2.5  3  3.5  4  1000  2000  3000  4000  5000  6000  7000  8000  9000 10000 -0.2  0  0.2  0.4  0.6  0.8  1  0  20  40  60  80  100   10   50  100  500 1000 -0.2  0  0.2  0.4  0.6  0.8  1  0  20  40  60  80  100   10   50  100  500 1000 Figure 4.8: Histogram (top), trace (middle), and ACF (bottom) of parame- ter α for PG (left) and PMMH (right). The histogram and trace plots were obtained using 100 particles. The PG resampled at every time step, while the PMMH used dynamic resampling. Both used stratified resampling. 64  0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1  10  100  1000 Ac ce pt an ce  R at e Number of Data Points multinomial resampling (always) stratified resampling (always) multinomial resampling (dynamic) stratified resampling (dynamic) Figure 4.9: Average acceptance rate for various versions of the PMH algo- rithm for varying number of data points and N = 100 For α fixed to its true value, α = 1, we compared the performance of the PMH sampler for varying SMC schemes and numbers of data points for N = 100 par- ticles. The results are displayed in Figure 4.9. Even for only N = 100 particles, the acceptance rate of the PMH algorithms based on SMC proposals with dynamic resampling is equal to 0.31 for T = 10, 000 observations. For T = 10, 000 we also implemented the PG sampler to sample from pi (xT , α| y1:T ) based on an SMC proposal using stratified resampling at each it- eration . We ran the algorithm for 10, 000 iterations with 1, 000 iterations for the burn-in. In Figure 4.10, we display the autocorrelation function for the simulated values { α(i) } and the estimate of pi (α| y1:T ). The PMMH algorithms for stratified and dynamic resampling yield similar results. We also compared our method to Gibbs sampling. In order to keep the com- putational complexity the same, we allow the Gibbs sampler to run N full Gibbs updates for the state variables between each parameter update, whereN is the num- ber of particles used in the PMMH and PG samplers (see Algorithm 4.1). For this problem however, we found that particle MCMC does not give any improvement in 65  0  0.2  0.4  0.6  0.8  1  0  20  40  60  80  100  0  0.2  0.4  0.6  0.8  1  0  0.5  1  1.5  2  2.5  3  3.5  4 Figure 4.10: Auto-correlation function for parameter α (left) and estimate of pi (α| y1:T ) (right). The vertical line corresponds to the true value. performance over Gibbs sampling. As there is only weak dependence between the label assignments Xk and between the parameter α and Xk, the PMCMC method has only limited advantage over Gibbs. Given the greater complexity of PMCMC, the simpler Gibbs sampler would be the preferred method to use on this model. Algorithm 4.1: Gibbs Sampler for Dirichlet Mixture Model. // see Appendix C.1 for details sample α(0) from prior1 initialise state vector X(0)2 For iteration i ≥ 13 sample α(i) ∼ pi(·|X(i− 1))4 set X(i) = X(i− 1)5 for k = 1, .., N do6 for n = 1, .., T do7 sample Xn(i) ∼ pi(Xn(i)|X−n(i),y1:T )8 66 4.4 Markov Jump Processes (Lotka-Volterra) We consider here a discretely observed stochastic kinetic Lotka-Volterra (LV) model [9]; see [49] for a description of such models and their applications in sys- tems biology. For this model it is not possible to compute the prior between two given times; i.e. P (Zt+T |Zt) (integrating out the paths between times t and t+T ), which prevents us from designing proposals with the observations taken into ac- count. We are thus required to propose from the prior. As we shall see, PMCMC allows us to still obtain good performance despite this limitation. The LV model describes the evolution of two species Z1t (prey) and Z 2 t (preda- tor) which are non-negative integer-valued processes. In a small time interval (t, t+ dt], there are three possible transitions for the Markov jump process (MJP) Zt = ( Z1t , Z 2 t ) P ( Z1t+dt = z 1 t + 1, Z 2 t+dt = z 2 t ∣∣ z1t , z2t ) = α z1t dt + o (dt) , P ( Z1t+dt = z 1 t − 1, Z2t+dt = z2t + 1 ∣∣ z1t , z2t ) = β z1t z2t dt + o (dt) , P ( Z1t+dt = z 1 t , Z 2 t+dt = z 2 t − 1 ∣∣ z1t , z2t ) = γ z2t dt + o (dt) , corresponding respectively to prey reproduction, predator reproduction & prey death, and predator death. We assume that we only observe the process Yn =( Y 1n , Y 2 n ) at discrete time points τn which is given by Y in = Z i τn +W i n , (4.6) where W in i.i.d.∼ N (0, σ2I) for i = 1, 2. The observation times are assumed to be evenly spaced at τn = n∆. We are interested here in making inference about the parameters θ = (α, β, γ) which are assumed to be a priori distributed as α ∼ Ga(1, 10), β ∼ Ga(1, 0.25), γ ∼ Ga(1, 7.5) where Ga is the Gamma distribution. For the initial populations we used x0 ∼ U(20, 80), Although it is possible to analyse such models directly using reversible-jump 67 MCMC (RJMCMC) [9], a very popular alternative consists of using a diffusion ap- proximation instead [49, pp. 188-189] which typically leads to ‘simpler’ inference algorithms [48]. For the Lotka-Volterra model, one can obtain a two-dimensional diffusion processXt = ( X1t , X 2 t ) with drift and instantaneous variance-covariance matrix given by [33]( αX1t − βX1tX2t βX1tX 2 t − γX2t ) and ( αX1t + βX 1 tX 2 t βX 1 tX 2 t −βX1tX2t βX1tX2t + γX2t ) . (4.7) Most approaches rely on data augmentation (by discretising time using the Euler-Maruyama approximation) to solve the stochastic differential equations (SDEs). We applied PMCMC to this problem and have developed a Gaussian proposal for the SMC using an unscented Kalman smoother (UKS). However, we only had limited success with this approach and have found that a simpler method works much better (and faster) for this model. Instead of transforming the problem into an SDE, we would prefer to stay in the discrete state space and use the exact model. Unfortunately we cannot evaluate the prior (transition) probability P(Z1(n+1)4, Z 2 (n+1)4|z1n4, z2n4). However, we are able to sample exactly from the prior using Gillespie’s algorithm [46]. And since we are able to evaluate the likelihood, we can use SMC to sample the states, and PMCMC to sample the parameters (α, β, γ). Of course for this to work efficiently, the likelihood must be sufficiently diffuse in order to get enough particles in regions of high probability. We generated T = 50 observations y1:50 by simulating the MJPZt = ( Z1t , Z 2 t ) using Gillespie’s algorithm [46, 49] and (4.6) with parameters α = 2, β = 0.05, and γ = 1.5, ∆ = 0.2, and σ2 = 4; see Figure 4.11. Applying PMMH, the parameters were sampled using a Gaussian random walk proposal (independent for each component). In order to improve mixing, random subsets of the parameters were updated at each iteration. The number of compo- nents n to update was sampled uniformly. The variances used for the random walk proposal are s2α = 0.2 2/n, sβ = 0.0052/n, and sγ = 0.152/n. The acceptance rate for the sampler was 36%, with an average ESS/N of 17.1% and resample rate of 95%. The simulation took 4hrs (17412.8 seconds) using a sin- 68 -20  0  20  40  60  80  100  120  140  0  1  2  3  4  5  6  7  8  9  10 prey predator Figure 4.11: Synthetic data generated from the Lotka-Volterra model using Gillespie’s algorithm. The prey (Z1t ) and predator (Z 2 t ) are shown in green and red, respectively. The squares and circles indicate the obser- vations Y 1n and Y 2 n , respectively. gle core of an Intel Core2 Duo 6300 @ 1.86GHz. Figure 4.12 shows the histogram and trace plots of the parameters. The algorithm mixes well, and the posterior is centred around the true value. The auto-correlation functions of the parameters are shown in Figure 4.13. PMMH also gives posterior estimates of the hidden states at the observation times, as shown in Figure 4.14. Boys et al. [9] perform exact inference (i.e. without SDE approximation) on this type of model using RJMCMC and a block updating (BU) method. They use a different observation model, where the prey is observed without error, but the predator is unobserved. In their BU method, the latent process (reaction events) is updated in blocks consisting of pairs of intervals between observations, keeping the endpoints (i.e. population sizes at the beginning and end of the block) fixed and using a random walk proposal on the number of reactions. The parameters are up- dated conditional on sampled events. The authors note that, while RJMCMC mixes much worse than BU and requires significantly more CPU time, both “algorithms 69 Fi gu re 4. 12 :H is to gr am an d tr ac e pl ot s of th e sa m pl ed pa ra m et er s. 70 -0.2  0  0.2  0.4  0.6  0.8  1  0  20  40  60  80  100  120  140  160  180  200 α β γ Figure 4.13: ACFs of the sampled parameters. suffered significant mixing problems” when the predators where unobserved, and due to CPU constraints no satisfactory results were obtained using RJMCMC. As evident from the trace plots in Figure 4.12 PMCMC does not suffer from this (al- though note that a direct comparison should be taken with a grain of salt due to the difference in the observation model). 71  0  10  20  30  40  50  60  70  80  0  0.5  1  1.5  2  2.5  3  3.5  4  4.5  5  0  10  20  30  40  50  60  70  80  90  0  0.5  1  1.5  2  2.5  3  3.5  4  4.5  5 Figure 4.14: Histograms of prey (top) and predator (bottom) states. The ground truth trajectory and observations are shown in green and blue, respectively. 72 4.5 Lévy-Driven Stochastic Volatility Model We now consider another model in which the prior cannot be evaluated, but we can sample from it. This model has the additional challenge of complex dependence between the parameters. The logarithm of an asset price y∗ (t) is assumed to be determined by the fol- lowing stochastic differential equation dy∗ (t) = ( µ+ βσ2 (t) dt ) + σ (t) dB (t) where µ is the drift parameter, β the risk premium andB (t) is a Brownian motion. The instantaneous latent variance/volatility σ2 (t) is assumed to be stationary and independent from B (t). It is modelled by the following Lévy-driven Ornstein- Uhlenbeck process proposed in [4] dσ2 (t) = −λσ2 (t) dt+ dz (λt) (4.8) where λ > 0 and z (t) is a purely non-Gaussian Lévy process with positive incre- ments with z (0) = 0. We define the integrated volatility σ2∗ (t) = ∫ t 0 σ2 (u) du which satisfies from (4.8) σ2∗ (t) = λ−1 { z (λt)− σ2 (t) + σ2 (0)} . Let ∆ denote the length of time between two periods of interests, then the incre- ments of the integrated volatility satisfy σ2n = σ 2∗ (n∆)− σ2∗ ((n− 1) ∆) = λ−1 { z (λn∆)− σ2 (n∆)− z (λ (n− 1) ∆) + σ2 ((n− 1) ∆)} 73 where ( σ2 (n∆) z (λn∆) ) = ( exp (−λ∆)σ2 ((n− 1) ∆) z (λ (n− 1) ∆) ) + ηn with ηn d= ( exp (−λ∆) ∫ ∆0 exp (λu) dz (λu)∫ ∆ 0 dz (λu) ) . (4.9) By aggregating returns over a time interval of length ∆, we have yn = ∫ n∆ (n−1)∆ dy∗ (t) = y∗ (n∆)− y∗ ((n− 1) ∆) thus, conditional on the volatility, we have yn ∼ N ( µ∆ + βσ2n, σ 2 n ) . Many publications have restricted themselves to the case where σ2 (t) follows marginally a Gamma distribution in which cases the stochastic integrals appear- ing in 4.9 can be represented by a finite number of random variables. In this case sophisticated MCMC schemes were developed so as to perform Bayesian infer- ence on λ and the parameters of the Gamma [42, 52, 69]. However, as outlined in [44] the use of Gamma models seems to have been mostly motivated by com- putational tractability. We address here the case where σ2(t) follows a tempered stable marginal distribution TS(κ, δ, γ) which includes the class of inverse Gaus- sian distributions for κ = 12 ; this class of models has been successfully used to fit the returns from exchange rate time series [5]. The tempered stable distribution does not admit a closed-form expression, but it is shown in [5] that σ2 (0) d= ∞∑ i=1 {( aiκ A0 )−1/κ ∧ eiv1/κi } (4.10) where {ai}, {ei}, {vi} are independent of one another. The {ei} are i.i.d. expo- nential with mean 1/B, the {vi} are standard uniforms whereas a1 < a2 < . . . are 74 arrival times of a Poisson process of intensity 1 and A0 = δ2κ κ Γ (1− κ) , B = 1 2 γ1/κ. It is also established that to ensure a TS(κ, δ, γ) marginal for σ2(t) then z (t) has to be the sum of an infinite activity Lévy process and of a compound Poisson process such that ηn d= ∞∑ i=1 ( exp (−λ∆ri) 1 ){( aiκ Aλ∆ )−1/κ∧ eiv1/κi }+N(λ∆)∑ i=1 ( exp (−λ∆r∗i ) 1 ) ci (4.11) where A = δ2κ κ2 Γ (1− κ) , B = 1 2 γ1/κ. In (4.11), {ai} , {ei}, {ri}, {r∗i }, {vi} are independent of one another. The {ai} , {ei}, {vi} follow the same distributions as in (4.10), the {ci} are i.i.d. Ga (1− κ,B), and {ri}, {r∗i } are standard uniforms. Finally N (λ∆) is a Pois- son random variable of mean λ∆δγκ. It was shown experimentally in [5] that the infinite series appearing in (4.10)-(4.11) are dominated by the first few terms, “although as κ goes to one this becomes less sharp”. Performing inference in this context is extremely challenging as, although it is possible to sample approximately from the prior, it is not possible to evaluate it pointwise. In [44] the authors have proposed an MCMC algorithm for Bayesian inference. This requires sampling the volatility process { σ2n } ; this is achieved by updating {ηn} one at a time with an independent MH sampler using the prior as proposal (and relying on an alternative infinite series expansion of ηn). In [21] an SMC algorithm was proposed to estimate { σ2n } when the parameters of the models are fixed: it uses the prior as a proposal as any alternative proposal would require a pointwise evaluation of the prior. We applied PMCMC to this model in order to perform Bayesian inference on the parameters {κ, δ, γ, λ} and we set here µ = β = 0 as in [21]. First we ran PMMH on a synthetic data set and then used the daily Standard & Poor’s 500 (S&P 500) closing prices from January 12, 2002 to December 30, 2005. 75 4.5.1 Synthetic Data We generated a synthetic data set with T = 400 observations using parameters θtrue = (κ, δ, γ, λ) = ( 0.5, √ 2, √ 8, 0.1 ) . The MH step in the PMMH used a Gaussian random walk proposal using the following covariance matrix ΣRW, which was found by running the algorithm for some time using a diagonal covariance matrix and then computing the sample co- variance: ΣRW =  0.00834501 −0.0602253 0.0346668 −0.00188407 −0.0602253 0.694237 0.0921531 0.000170793 0.0346668 0.0921531 1.82894 −0.121654 −0.00188407 0.000170793 −0.121654 0.199951  We found that the parameters κ and δ were quite correlated. To improve mixing, we alternated between the following two proposals: q1(θ, θ∗) = N(θ∗; θ, 1.5 2 4 Σ RW) q2(θ1:2, θ∗1:2) = N(θ∗1:2; θ1:2, 1.52 2 Σ RW 1:2 ), with θ ∗ 3:4 = θ3:4 The simulation is quite sensitive to κ, so we use a fairly informative prior. This was also observed in [44]. We used a Beta prior for κ and Gamma priors for the other three parameters: pi(κ) = Be(10, 10) pi(δ) = Ga(1, 7) pi(γ) = Ga(1, 14) pi(λ) = Ga(1, 0.5) We ran simulations using using 50, 100, and 200 particles. The autocorrelation of the parameters is shown in Figure 4.15), the histograms and scatter plots of the parameters (using 200 particles) are given in Figure 4.16, and the trace is plots are shown in Figure 4.17. The posterior for κ is almost the same as the prior, suggest- ing that the prior is either too informative, or that there is very little information about κ in the data. We also tried a broader prior and found that the posterior is also similar to the prior, but the simulation does not mix as well, in particular when κ gets close to 1. The acceptance rate for the PMMH was 10%. We used dynamic 76 00.2 0.4 0.6 0.8 1 N=  50 N=100 N=200 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 10000 200 400 600 800 1000 κ δ γ λ Figure 4.15: Lévy-driven SV model using synthetic data: Comparison of auto-correlation functions for different number of particles. resampling in the SMC proposal and resampled on average at 12% of the time steps. 4.5.2 Standard & Poor’s 500 The Standard & Poor’s 500 (S&P 500) data set has 1000 observations, which are the daily increments in the log-prices from 12/01/2002 to 30/12/2005, and rescaled to have unit variance (the mean is 0.0079). The data is shown in Figure 4.18. Next we show simulations results using two different priors; the first one being a bit more informative than the second. Prior 1 The MH step in the PMMH used a Gaussian random walk proposal using the fol- lowing covariance matrix ΣRW, which again was generated by running the algo- rithm for some time using a diagonal covariance matrix and then computing the 77 0.3 0.4 0.5 0.6 0.7 0.8 2 4 6 8 10 12 2 4 6 8 10 12 0 0.5 1 1.5 2 2.5 κ δ γ λ Figure 4.16: Lévy-driven SV model using synthetic data: Histogram and 2D scatter plots of sampled parameter values. The prior is shown as dashed curve and the true value is shown by the vertical dashed line: θtrue = (κ, δ, γ, λ) = ( 0.5, √ 2, √ 8, 0.1 ) . The data set has 400 obser- vation and the simulation used 200 particles. sample covariance: ΣRW =  0.00133842 −0.104184 9.80768e−05 1.24963e−05 −0.104184 16.3029 0.0881985 −0.000947038 9.80768e−05 0.0881985 0.00670481 7.94049e−06 1.24963e−05 −0.000947038 7.94049e−06 2.25697e−06  The same proposal was used as before (i.e. alternating between updating first two parameters, κ and δ, and updating all parameters jointly). The prior was chosen quite informative as follows: pi(κ) = Be(4, 36) pi(δ) = Ga(1, 7) pi(γ) = Ga(1, 14) pi(λ) = Ga(1, 0.5) 78 0.3 0.4 0.5 0.6 0.7 0.8 κ δ γ λ 2 4 6 8 10 12 2 4 6 8 10 12 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 0 0.5 1 1.5 2 2.5 Figure 4.17: Lévy-driven SV model with synthetic data: Trace plots. where Be(α, β) is the Beta distribution, and Ga(k, θ) is the Gamma distribution with shape k and scale θ. The histograms and scatter plots of the parameters are shown in Figure 4.19. As evident from the trace and auto-correlation plots given in Figure 4.21 and 4.20, the Markov chain mixes quite well given the complex structure of the posterior. The SMC proposal performs very well with a resample rate of only 4% and average ESS of about 0.75N , where N = 500 is the number of particles used. The main difficulty lies is proposing good candidates for the parameters. The acceptance rate was 21%. Prior 2 For this simulation we broadened the priors1 as follows: pi(κ) = Be(1, 15) pi(δ) = Ga(1, 20) pi(γ) = Ga(1, 10) pi(λ) = Ga(1, 1) 1The prior for γ was tightened, but this change had negligible effect on the posterior 79  750  800  850  900  950 1000 1050 1100 1150 1200 1250 1300 Cl os in g Pr ice   -4   -3   -2   -1    0    1    2    3    4    5    6 0 100 200 300 400 500 600 700 800 900 1000 Lo g Re tu rn Figure 4.18: S&P 500 data set for Lévy-driven SV. Top: daily prices from from 12/01/2002 to 30/12/2005. Bottom: difference of log of prices, scaled to have unit variance. The same covariance matrix as before was used for the random walk proposal, and we again alternated between updating the first two components and updating all parameters jointly. The acceptance rate was 23% with an average resampling rate of 4% and ESS=0.75N (again using N = 500 particles). The histograms and scatter plots of the samples are given in Figure 4.22. Figures 4.23 and 4.24 show the autocorrelation and trace, respectively. Discussion As expected from the simulations using the synthetic data, the simulation using the more informative prior 1 mixes much better than using prior 2. In both cases, we get similar results for the posterior of the parameters (see Table 4.5.2 for details). With this application we showcased a model in which we cannot evaluate the prior, but can only sample from it. This forced us to propose from the prior, which 80 0.12 0.24 0.36 5 1015202530354045 0.6 0.8 1 1.2 1.4 1.6 0.0057 0.0114 0.0171 κ δ γ λ Figure 4.19: Lévy-driven SVs model using daily S&P 500 prices from 12/01/2002 to 30/12/2005: Histogram and 2D scatter plots of sam- pled parameter values. The prior is shown as the dashed curve. The data set has 1000 observation and the simulation used 500 particles. can be very inefficient, especially when the observations are informative. However, using PMCMC we are still able to achieve good performance. 81 -0.2  0  0.2  0.4  0.6  0.8  1  0  50  100  150  200  250  300  350  400  450  500 κ δ γ λ Figure 4.20: Lévy-driven SV model using daily S&P 500 prices from 12/01/2002 to 30/12/2005: autocorrelation function of sampled param- eter values. Figure 4.21: Lévy-driven SV model using daily S&P 500 prices from 12/01/2002 to 30/12/2005: Trace plots of sampled parameter values. 82 0.1 0.2 0.3 0.4 0.5 20 40 60 80 0.6 0.8 1 1.2 1.4 1.6 0.0067 0.0134 0.0201 κ δ γ λ Figure 4.22: Lévy-driven SV model using daily S&P 500 prices from 12/01/2002 to 30/12/2005: Histogram and 2D scatter plots of sam- pled parameter values. The priors are shown as dashed curves and are pi(κ) = Be(1, 15), pi(δ) = Ga(1, 20), pi(γ) = Ga(1, 10), pi(λ) = Ga(1, 1) 83 -0.2  0  0.2  0.4  0.6  0.8  1  0  50  100  150  200  250  300  350  400  450  500 κ δ γ λ Figure 4.23: Lévy-driven SV model using daily S&P 500 prices from 12/01/2002 to 30/12/2005: autocorrelation function of sampled param- eter values. Figure 4.24: Lévy-driven SV model using daily S&P 500 prices from 12/01/2002 to 30/12/2005: Trace plots of sampled parameter values. 84 Prior 1 parameter κ δ γ λ prior Be(4, 36) Ga(1, 7) Ga(1, 14) Ga(1, 0.5) mean 0.1555 8.939 1.163 0.008472 std 0.0524 5.034 0.126 0.002055 10th percentile 0.0923 3.877 1.027 0.006461 median 0.1502 7.883 1.151 0.007924 90th percentile 0.2260 15.254 1.322 0.011302 Prior 2 parameter κ δ γ λ prior Be(1, 15) Ga(1, 20) Ga(1, 10) Ga(1, 1) mean 0.1452 14.004 1.155 0.008661 std 0.0794 11.523 0.134 0.002337 10th percentile 0.0555 3.630 1.030 0.006501 median 0.1301 10.443 1.136 0.008071 90th percentile 0.2555 30.212 1.326 0.011558 Table 4.7: Posterior statistics of parameters for S&P 500 data set. 85 4.6 Tempering Following the idea outlined in [65, Sec. 4.2.2] we use particle Metropolis Hastings to sample from a multimodal target distribution. By using a sequence of tempered distributions we sample a chain of variables such that the marginal of the last vari- able in the chain is our target distribution. The sequence of distributions is built such that each two consecutive distributions are similar. An alternative method is Tempered Transitions by Neal [66], where the first distribution in the chain is the target distribution, and the following distributions are progressively easier to sample from (i.e. tempered) until the centre of the chain, at which point the distri- butions start approaching the target distribution, with the last distribution being the target. The method is shown in Algorithm 15. Note that for this case it is actually preferable to use SMC, as in [65], instead of PMH. In fact the only potential benefit of using PMH is its iterative nature, which allows us run it for an unspecified amount of time, e.g. until some time limit or a sufficient number of samples have been generated. However, this can also be accomplished using the sequentially interacting Markov chain Monte Carlo (SIMCMC) sampler [10], though we have not yet made any comparison with this method. Algorithm 4.2: Tempered Transitions Initialise X(s = 0)1 At iteration s ≥ 12 Set X∗0 = X(s− 1)3 For n = 1, . . . , 2p: Sample X∗n ∼ Tn(·|X∗n−1)4 With probability:5 1 ∧ 2p∏ j=0 pin+1(Xn) pin(Xn) (4.12) set X(s) = X∗2p, otherwise set X(s) = X(s− 1) The target density is pi(·) = pi0(·) and the sequence of distributions pi0, . . . , pi2p with pip−n = pip+n, such that for n ≤ p, pin(·) is easier to sample from than pin−1. We also define a sequence of transition kernels T1, . . . , T2p, where Tn(X, ·) has pim(·) as an invariant distribution, with m = n for n ≤ p and m = 2p− n+ 1 for 86 n > p. The following reversibility condition must hold: pin(x)Tn(x, x′) = pin(x′)T2p−n+1(x′, x), for n < p (4.13) Note that we may choose Tn to be the same as T2p−n+1. Tempered Sampling with PMH and CBMC For the PMH and CBMC samplers we only need the second half of the sequence defined above. In this context we label them p̄i1, . . . , p̄ip, with p̄ip = pi0. We also choose different transition kernels more suitable for the particle method. This gives rise to joint distributions as given in [65] using the artificial backwards transitions Ln(xn+1, xn): p̃in(x1:n) = γ̃n(x1:n)/Zn (4.14) where γ̃n(x1:n) = γn(xn) n−1∏ j=1 Ln(xn+1, xn) (4.15) and γj(xj) ∝ p̄ij(xj). The transition kernels Tn(xn|xn−1) have p̄in as invariant distribution. Using the second half this sequence of target distributions we can sample “chains” of variables from the joint target distribution and obtain the marginal by simply discarding all but the last variable in the chain. Mixture of Gaussians We used the above algorithm to sample from a highly multimodal distribution. The example was taken from [66]. The probability distribution function (PDF) was a mixture of over 4000 Gaussians in 2-D with means separated by many times their widths. Despite being low dimensional, this is a difficult distribution to sample from, using MCMC and pretending we only have access to pin(x, y) (Eqn. 4.16) pointwise and up to an unknown normalising constant, and serves well to demon- 87 strate tempering using PMH. pi0(x, y) = ∑5 i=−5 ∑5 j=−5 exp ( − |(x,y)−µ1,i,j |2 2σ2 ) ) + ∑5 i=−5 ∑5 j=−5 exp ( − |(x,y)−µ2,i,j |2 2σ2 ) ) + ∑22 i=−22 ∑22 j=−22 exp ( − |(x,y)−µ3,i,j |2 2σ2 ) ) + ∑22 i=−22 ∑22 j=−22 exp ( − |(x,y)−µ4,i,j |2 2σ2 ) ) (4.16) with σ = 0.001 and µ1,i,j = (0.0025i+ 15, 0.0025j + 15) µ2,i,j = (0.1500i− 15, 0.1500j + 15) µ3,i,j = (0.0025i− 15, 0.0025j − 15) µ4,i,j = (0.1500i+ 15, 0.1500j − 15) (4.17) We then define a tempering schedule by choosing a sequence β1, . . . , βp with βp = 1 and 0<βn<βn+1≤1. The distributions are then p̄in = piβn0 . The performance of the PMH sampler for this example was compared with Neal’s tempered transitions (NTT) and CBMC, the results are given in Table 4.8. Each algorithm was run 3 times for 8 hours (the results shown are averages). For each algorithm we chose 5 settings (number of particles and temperatures) that gave the best performance in terms of mean and standard deviation of the target distribution. PMH gives comparable performance as both CBMC and tempered transitions. Figure 4.25 shows the trace of the last 3000 samples for a PMH run using 5 particles and 20 temperatures. As we can see, there is a tradeoff between the number of particles N , the number of distributions p, and the number of samples to generate, given a fixed computation budget. In this model the best setting, among the ones tried, was to use only a small number of particles (N = 5), a medium number of tempered distributions (p = 20), and opt for a larger number of samples. Unfortunately we have no theoretical result to tell us this “sweet” spot in advance and one should perform a few short simulations at various settings to determine a “good” setting using standard MCMC and SMC diagnostics to measure performance. 88 A lg . sa m pl es N p µ x va r( x ) µ y va r( y ) er r( µ ) er r( va r) A cc R PM H 78 40 0 5 20 -0 .0 87 4 22 6. 6 -1 3. 31 49 .1 6 0. 08 74 0. 57 9 12 23 PM H 14 20 0 5 10 0 -0 .0 91 8 22 6. 8 -1 3. 21 51 .7 5 0. 13 4 2. 07 0 33 1 PM H 87 60 0 10 10 0. 31 0 22 6. 8 -1 3. 21 52 .0 0 0. 32 6 2. 31 9 14 75 PM H 42 60 0 20 10 0. 40 7 22 6. 7 -1 3. 31 49 .6 8 0. 40 7 0. 10 2 24 76 PM H 38 00 0 10 20 -0 .1 60 22 5. 7 -1 3. 11 54 .9 3 0. 25 7 5. 36 2 22 36 C B M C 29 32 20 50 0. 09 48 22 7. 4 -1 3. 33 48 .4 1 0. 09 75 1. 39 3 44 - C B M C 61 77 5 10 0 0. 11 9 22 6. 4 -1 3. 33 47 .8 3 0. 12 1 1. 88 4 37 - C B M C 16 02 1 10 20 0. 02 03 22 6. 1 -1 3. 43 47 .7 2 0. 11 9 2. 09 3 31 - C B M C 78 36 20 20 0. 01 58 22 6. 6 -1 3. 37 47 .0 8 0. 06 49 2. 60 3 37 - C B M C 33 24 1 5 20 -0 .0 94 3 22 5. 5 -1 3. 14 54 .3 4 0. 19 2 4. 82 3 20 - N ea l 27 68 9 1 10 0 -0 .1 21 22 7. 0 -1 3. 27 51 .2 2 0. 12 6 1. 55 9 16 - N ea l 14 20 6 1 20 0 0. 22 7 22 8. 7 -1 3. 29 50 .5 6 0. 22 8 2. 07 5 27 - N ea l 53 00 0 1 50 -0 .6 06 22 5. 4 -1 3. 45 43 .3 3 0. 62 1 6. 50 9 64 - N ea l 14 55 00 1 20 1. 65 0 22 1. 4 -1 3. 58 43 .6 6 1. 67 2 8. 09 1 0. 72 - N ea l 30 00 00 1 10 0. 33 7 21 6. 5 -1 2. 27 72 .3 0 1. 08 9 24 .8 6 0. 07 - Ta bl e 4. 8: Si m ul at io n re su lts fo ra m ix tu re of G au ss ia ns w ith 42 92 m od es .E ac h al go ri th m w as ru n fo rt he sa m e am ou nt of tim e (3 tim es 8 hr s) an d us ed 10 M H st ep s at ea ch te m pe ra tu re .N is th e nu m be ro fp ar tic le s an d p is th e nu m be r of di st ri bu tio ns (t em pe ra tu re s) us ed .A cc is th e av er ag e ac ce pt an ce ra te an d R is th e re sa m pl in g ra te fo rP M H . 89 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 0.2 0.4 0.6 0.8 1 1.2 -20 -15 -10 -5 0 5 10 15 -15 -10 -5 0 5 10 15 20 y x algorithm  = PMH #samples   = 26201 #particles = 5 #modes     = 4293 nMC  = 10 numT = 20 maxT = 2.68435e+08 Figure 4.25: Trace and marginal posterior distributions for mixture of Gaus- sians distribution using PMH with 5 particles and numT=20 tempera- tures (length of sequence of distributions), a maximum temperature of 228=2.7×108, and 10 Metropolis-Hastings steps within each temper- ature. The large (bottom-left) frame shows the trace of the last 1000 iterations (lines and small blue circles) and the locations of the 4292 modes of the target distribution (larger red circles). The top and the right frames show the histograms of the marginals using 26201 sam- ples. Density Estimation using Finite Mixture of Gaussians Given data y1, . . . , yc, which are independent and identically distributed, we want to infer the distribution from which they arise. We model the distribution as a finite mixture of Normal distributions as in [65]: yi|θr ∼ r∑ j=1 ωj N (µj , λ−1j ) (4.18) where θr=(µ1:r, λ1:r, ω1:r), 2≤r<∞ and the number of Gaussians, r, is known. The weights ωj must sum to one. The priors are exchangeable for each mixture 90 component: µj ∼ N (ξ, κ−1) (4.19) λj ∼ Ga(ν, χ), ν = shape, χ = scale (4.20) ω1:r−1 ∼ D(ρ) (4.21) The parameters ξ and κ are set to the midpoint of the data range R and a small multiple ι of 1/R2: ξ = min(yi) + max(yi) 2 κ = ι [max(yi)−min(yi)]2 (4.22) The parameters for the Gamma distribution are set as follows: (2 =)ν > 1 > g = 0.2 h = 10 [max(yi)−min(yi)]−2 χ = h/g The Dirichlet prior has parameter ρ = (1, . . . , 1). The log prior is given by: log f(θr) = r∑ j=1 log {N (µj |ξ, κ−1) Ga(λj |ν, χ) }+ logD(ω1:r|ρ) We define a sequence of distributions which starts with the prior and ends with the full posterior: pip(θr) ∝ l(y1:c|θr)φp f(θr), p = 1, . . . , n (4.23) with 0 ≤ φ1 < . . . < φn = 1 We have compared the performance of PMH with NTT and with CBMC. The data set was taken from [65] and is shown in Figure 4.26. The acceptance rates for the three different algorithms are presented in Figure 4.27. The algorithm set- tings are chosen such that all have comparable computational complexity. PMH achieves at least double the acceptance rate over the other methods, using 20 par- ticles and a sequence of 50 distributions (temperatures). The tempered transitions 91  0  0.1  0.2  0.3  0.4  0.5  0.6 -10 -5  0  5  10 D en si ty µ1 Posterior for µ1 Input data (for likelihood) -4 -2  0  2  4  6  8  0  1000  2000  3000  4000  5000  6000  7000 µ 1 Iteration Figure 4.26: Data set for 4-component mixture of Gaussians (green) and pos- terior probability for parameter µ1 (red) using PMH with p = 50 and 20 particles. The temperatures (1/βn) schedule was linear with a max- imum temperature of 50. Bottom: trace of µ1. algorithm requires a much finer temperature scale with a sequence of 1000 distri- butions. CBMC fails to achieve acceptance rates of more than a few percent. 92  0  0.05  0.1  0.15  0.2  0.25  0.3  0.35 CBM C  N=30  p=30 CBM C  N=50  p=20 CBM C  N=20  p=50 PM H    N =50  p=20 N eal  p=1000 PM H    N =30  p=30 PM H    N =20  p=50 Acceptance rate: quartiles Figure 4.27: Acceptance rates for finite mixture model using PMH, CBMC, or NTT 93 Chapter 5 Conclusion We have presented a new class of MCMC algorithms which rely on proposal dis- tributions built using SMC methods. One of the major advantages of this approach is that it “automatically” builds efficient high-dimensional proposal distributions whilst requiring the practitioner to design only low-dimensional proposal distri- butions. It offers the possibility to simultaneously update large vectors of depen- dent random variables. This strategy is computationally expensive but to some extent unavoidable and useful in complex scenarios for which standard proposals are likely to fail. Indeed, as demonstrated in simulation, they can allow algorithms to better explore the posterior distribution in multimodal cases for example. In ad- dition, in practice these algorithms can be combined with “standard” local moves which are computationally cheaper. It is difficult to specify in advance when PMCMC will yield better performance than some alternative, possibly simpler, methods. However, we can take some lessons from the applications presented in this thesis and provide a few (rough) guidelines, as shown in Table 5.1. We only list the most commonly used algo- rithms, namely Gibbs, MH, and SMC, as well as the the methods introduced in this thesis. As a rule of thumb, the simpler methods should be tried first (unless it is clear that they will be inefficient). If they do not perform well, then implement- ing a PMCMC algorithm requires only little extra work, as it is based on MCMC and SMC, so part of the work would have already been done. For models in which the prior cannot be evaluated, but only sampled from (e.g. 94 x low dimensional x high dimensional dependence weak strong weak strong in x G or MH MH G or SMC SMC or PMH between θ and x G or MH MH PG PMMH Table 5.1: These are rough guidelines for algorithm choice. θ represents the parameters and x the states. G stands for Gibbs. PMH would be used for subblock updating. MJP or Lévy-driven SV models with certain marginals), PMCMC may be the only feasible method. We believe that many problems where SMC methods have already been suc- cessfully used could benefit from the particle MCMC methodology. These in- clude contingency tables [15], graphical models [53], changepoint models [36], population dynamic models in ecology [12], volatility models in financial econo- metrics [16, 25], partially observed diffusions [37], population genetics [24],[61, Section 4.1.2], systems biology [48, 49], and experimental design problems [59]. The CBMC method, to which our approach is related, is a very popular method in computational chemistry and physics which has been widely used for molecular and polymer simulation [40], and particle MCMC algorithms might also prove a useful alternative in these areas. In this thesis we have addressed some of these top- ics to demonstrate the use of PMCMC. We are already aware of recent successful applications of PMCMC methods in econometrics [38] and statistics [7]. There are also numerous possible extensions to the PMCMC framework. Some have already been added, such as stratified sampling. Other resampling techniques could potentially also be added, for example antithetic sampling. In practice, the performance of the particle MCMC are closely related to the variance of the SMC estimates of the normalising constants. The SMC literature has many more tech- niques that could be used within PMCMC. For example we might investigate whether the SMC smoothing techniques in [47, 58] could be used to design bet- ter proposal distributions. We can also add adaptive strategies to set the algorithm parameters in PMCMC. It would nice to automatically adjust the number of parti- cles used in the proposal to ensure a reasonable acceptance rate. In order to speed up the computation, a parallel implementation of the SMC component would also 95 be very useful. For example [63] and [54] have already demonstrated a parallel particle filter running on the GPU of a graphics card. From a theoretical point of view, it is possible to study how “close” the particle MCMC algorithms are from the ideal MCMC algorithms they are approximating (corresponding to N →∞) using the techniques developed in [1]. 96 Bibliography [1] C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efficient computation. Annals Statist., 2007. URL http://www.maths.bris.ac.uk/∼maxca/preprints/andrieu roberts 2006.pdf. in press. → pages 27, 44, 96, 107 [2] C. Andrieu, J. De Freitas, and A. Doucet. Sequential MCMC for bayesian model selection. In Proceedings IEEE Workshop Higher Order Statistics, pages 130–134, 1999. → pages 20 [3] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo. Submitted to Journal of the Royal Statistical Society, November 2007. → pages 1 [4] O. Barndorff-Nielsen and N. Shephard. Non-Gaussian Orstein-Uhlenbeck-based models and some of their uses in financial economics. J. Royal Statist. Soc., 63:167–241, 2001. → pages 73 [5] O. Barndorff-Nielsen and N. Shephard. Normal modified stable processes. Theory Proba. Math. Statist., 65:1–19, 2001. → pages 74, 75 [6] M. A. Beaumont. Estimation of population growth or decline in genetically monitored populations. Genetics, 164:1139–1160, 2003. → pages 44 [7] M. Belmonte, O. Papaspiliopoulos, and M. Pitt. Particle filter estimation of duration-type models. Technical report, Department of Statistics, Warwick University, 2008. → pages 95 [8] P. Bourne and H. Weissig. Structural Bioinformatics. John Wiley & Sons Publishing, New York, 2003. → pages 53 [9] R. J. Boys, D. Wilkinson, and T. Kirkwood. Bayesian inference for a discretely observed stochastic kinetic model. Statistics and Computing, 18 97 (2):125–135, 2008. ISSN 0960-3174. doi:10.1007/s11222-007-9043-x. → pages 67, 68, 69 [10] A. Brockwell, P. D. Moral, and A. Doucet. Sequentially interacting Markov chain Monte Carlo. unpublished, 2008. → pages 86 [11] T. Brunette and O. Brock. Improving protein structure prediction with model-based search. Bioinformatics, 21:i66–i74, 2005. doi:10.1093/bioinformatics/bti1029. → pages 53 [12] S. Buckland, K. Newman, C. Fernández, L. Thomas, and J. Harwood. Embedding population dynamic models in inference. Statist. Sci., 22:44–5, 2007. → pages 2, 15, 95 [13] O. Cappé, E. Moulines, and T. Rydén. Inference in Hidden Markov Models. Springer-Verlag, New York, 2005. → pages 15 [14] C. Carter and R. Kohn. On Gibbs sampling for state space models. Biometrika, 81:541–553, 1994. → pages 45 [15] Y. Chen, P. Diaconis, S. Holmes, and J. Liu. Sequential Monte Carlo methods for statistical analysis of tables. J. Amer. Statist. Assoc., 100: 109–120, 2004. → pages 2, 3, 95 [16] S. Chib, M. K. Pitt, and N. Shephard. Likelihood based inference for diffusion driven state space models. unpublished, pages 1–33, 2006. URL http://www.olin.wustl.edu/faculty/chib/absdenew.htm. → pages 2, 95 [17] N. Chopin. A sequential particle filter method for static models. Biometrika, 89:539–552, 2002. → pages 3, 42 [18] N. Chopin. Central limit theorem for sequential Monte Carlo methods and its application to Bayesian inference. Annals Statist., 32:2385–2411, 2004. → pages 12 [19] N. Combe, T. J. H. Vlugt, P. R. T. Wolde, and D. Frenkel. Dynamic pruned-enriched Rosenbluth method. Molecular Physics, 101:1675–1682, 2003. → pages 15, 44 [20] S. Consta, N. B. Wilding, D. Frenkel, and Z. Alexandrowicz. Recoil growth: An efficient simulation method for multi-polymer systems. Journal of Chemical Physics, 110:3220–3228, 1999. → pages 15 98 [21] D. Creal. Analysis of filtering and smoothing algorithms for Lévy-driven stochastic volatility models. Comp. Statist. Data Analy., 52:2863–2876, 2008. → pages 75 [22] T. Creighton. Protein Folding, pages 1–55. W.H. Freeman and Company, New York, USA, 1992. → pages 53 [23] D. Crisan and T. Lyons. Minimal entropy approximations and optimal algorithms for the filtering problem. Monte Carlo Meth. Appli., 8:343–356, 2002. → pages 11 [24] M. De Iorio and R. Griffiths. Importance sampling on coalescent histories. Adv. Appl. Proba., 36:417–433, 2004. → pages 95 [25] P. Dellaportas, D. Denison, and C. Holmes. Flexible threshold models for modelling interest rate volatiliy. Econometrics Review, 26:419–437, 2007. → pages 95 [26] A. Doucet, S. Godsill, and C. Andrieu. On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing, 10:197–208, 2000. → pages 46 [27] A. Doucet, J. de Freitas, and N. Gordon, editors. Sequential Monte Carlo Methods in Practice. Springer-Verlag, New York, 2001. → pages 2, 8, 46 [28] A. Doucet, S. J. Godsill, and C. P. Robert. Marginal maxiumum a posteriori estimation using Markov chain Monte Carlo. Statistics and Computing, 12: 77–84, 2002. → pages 36 [29] J. Durbin and S. Koopman. Time Series Analysis by State Space Methods. Oxford University Press, 2001. → pages 15 [30] M. Escobar and M. West. Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc., 90:577–588, 1995. → pages 62, 63, 114, 115 [31] P. Fearnhead. MCMC, sufficient statistics and particle filter. J. Comp. Graph. Stat., 11:848–862, 2002. → pages 20 [32] P. Fearnhead. Particle filters for mixture models with an unknown number of components. Journal of Statistics and Computing, 14:11–21, 2004. doi:10.1023/B:STCO.0000009418.04621.cd. → pages 2, 3, 42, 61, 62, 63 99 [33] P. Fearnhead. Computational methods for complex stochastic systems: Alternatives to MCMC. Technical report, Department of Mathematics and Statistics, Lancaster University, 2007. → pages 68 [34] P. Fearnhead. Online inference for multiple changepoint problems. J. Royal Statist. Soc. B, 69:589–605, 2007. → pages 42 [35] P. Fearnhead and J. R. Clifford. On-line inference for hidden Markov models via particle filters. Statist. Soc. B, 65:887–899, 2003. → pages 55 [36] P. Fearnhead and L. Meligkotsidou. Filtering methods for mixture models. Journal of Computational & Graphical Statistics, 16:586–607, September 2007. → pages 15, 95 [37] P. Fearnhead, O. Papaspiliopoulos, and G. Roberts. Particle filters for partially-observed diffusion. J. Royal Statist. Soc. B, 69:589–605, 2008. doi:10.1111/j.1467-9868.2008.00661.x. → pages 15, 95 [38] T. Flury and N. Shephard. Bayesian inference based only on simulated likelihood: particle filter analysis of dynamic economic models. Technical report, Nuffield College, Oxford University, 2008. → pages 95 [39] D. Frenkel. Waste-recycling Monte Carlo, pages 127–138. Springer-Verlag, New York, December 2006. ISBN 354035283X. URL http://www.springer.com/materials/book/978-3-540-35283-9. → pages 28 [40] D. Frenkel and B. Smit. Understanding Molecular Simulation: From Algorithms to Applications. Academic Press (Elsevier), 2nd edition edition, 2002. → pages 15, 43, 46, 53, 95 [41] S. Frühwirth-Schnatter. Finite Mixture and Markov Switching Models. Springer Verlag, New York, 2006. → pages 15, 45 [42] S. Frühwirth-Schnatter and L. Sögner. Bayesian estimation of stochastic volatility models based on OU processes with marginal gamma law. Ann. Instit. Statist. Math. to appear. → pages 74 [43] S. G. Particle filters in state space models with the presence of unknown static parameters. IEEE Trans. Signal Processing, 50:281–289, 2002. → pages 20 [44] M. Gander and D. Stephens. Stochastic volatility modelling in continuous time with general marginal distributions: Inference, prediction and model selection. J. Statist. Plann. Inf., 137:3068–3081, 2007. → pages 74, 75, 76 100 [45] W. Gilks and C. Berzuini. Following a moving target - Monte Carlo inference for dynamic Bayesian models. J. Royal Statist. Soc. B, 63: 127–146, 2001. → pages 3, 42 [46] D. T. Gillespie. Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem., 81:2340–2361, 1977. → pages 68 [47] S. Godsill, A. Doucet, and M. West. Monte Carlo smoothing for nonlinear time series. J. Amer. Statist. Assoc., 99:156–168, 2004. → pages 46, 95 [48] A. Golightly and D. J. Wilkinson. Bayesian sequential inference for stochastic kinetic biochemical network models. J. Comp. Biology, 13: 838–851, 2006. → pages 2, 68, 95 [49] A. Golightly and D. J. Wilkinson. Bayesian sequential inference for nonlinear multivariate diffusions. Statistics and Computing, 16:323–338, 2006. → pages 67, 68, 95 [50] N. Gordon, D. Salmond, and A. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. In IEE-Proceedings-F 140, pages 107–113, 1993. → pages 15 [51] P. Grassberger. Pruned-enriched Rosenbluth method: Simulations of θ polymers of chain length up to 1 000 000. Phys. Rev. E, 56:3682 – 3693, 1997. → pages 15 [52] J. Griffin and M. Steel. Inference with non-Gaussian ornstein-unlhenbeck processes for stochastic volatility. J. Econometrics, 134:605–644, 2006. → pages 74 [53] F. Hamze and N. de Freitas. Hot coupling: a particle approach to inference and normalization on pairwise undirected graphs. In Adv. Neural Info. Proc. Sys., volume 18. MIT Press, 2005. → pages 2, 3, 95 [54] G. Hendeby, J. D. Hol, R. Karlsson, and F. Gustafsson. A graphics processing unit implementation of the particle filter. In European Signal Processing Conference (EUSIPCO), 2007. → pages 96 [55] H. Hsu, V. Mehra, W. Nadler, and P. Grassberger. Growth-based optimization algorithm for lattice heteropolymers. Physical Review E, 68(2): 21113, 2003. → pages 55 [56] C. Jarzynski. Nonequilibrium equality for free energy differences. Phys. Rev. Lett., 78:2690–2693, 1997. → pages 42 101 [57] G. Kitagawa. Monte Carlo filter and smoother for non-gaussian nonlinear state space models. J. Comp. Graph. Statist., 5:1–25, 1996. → pages 11, 33, 46 [58] G. Kitagawa and S. Sato. Monte Carlo smoothing and self-organising state-space models, chapter 9. Springer-Verlag, New York, 2001. → pages 95 [59] H. Kueck, N. de Freitas, and A. Doucet. SMC samplers for Bayesian optimal nonlinear design. In IEEE Nonlinear Statistical Signal Processing Workshop, pages 99–102, Cambridge, UK, 13-15 Sept. 2006. ISBN 978-1-4244-0581-7. doi:10.1109/NSSPW.2006.4378829. → pages 95 [60] J. Liu, F. Liang, and W. Wong. The use of multiple-try method and local optimization in metropolis sampling. J. Amer. Statist. Assoc., 95:121–134, 2000. → pages 44 [61] J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer Verlag, New York, 1st edition edition, 2001. → pages 1, 2, 3, 8, 11, 41, 53, 54, 61, 95, 106 [62] K. Mengersen and R. Tweedie. Rates of convergence of the Hastings and Metropolis algorithms. Annals Statist., 24:101–121, 1996. → pages 104 [63] A. S. Montemayor, J. J. Pantrigo, Ángel Sánchez, and F. Fernández. Particle filter on GPUs for real-time tracking. In SIGGRAPH ’04: ACM SIGGRAPH 2004 Posters, page 94, New York, NY, USA, 2004. ACM. ISBN 1-58113-896-2. doi:10.1145/1186415.1186526. → pages 96 [64] P. D. Moral. Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications. Springer-Verlag, New York, 2004. → pages 12, 13 [65] P. D. Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society B, 68:411–436, 2006. → pages 2, 3, 42, 43, 86, 87, 90, 91 [66] R. M. Neal. Sampling from multimodal distributions using tempered transitions. Statistics and Computing, 6:353–366, 1996. → pages xi, 86, 87 [67] M. K. Pitt and N. Shephard. Filtering via simulation: auxiliary particle filter. J. Am. Statist. Assoc., 94:590–599, 1999. → pages 2, 51 102 [68] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2nd edition edition, 2004. → pages 1, 5, 6, 54, 62 [69] G. Roberts, O. Papaspiliopoulos, and P. Dellaportas. Bayesian inference for non-Gaussian Orstein-Uhlenbeck stochastic volatility processes. J. Royal Statist. Soc. B, 66:369–393, 2004. → pages 74 [70] M. Rosenbluth and A. Rosenbluth. Monte Carlo simulations of the average extension of molecular chains. J. Chem. Phys., 23:356–359, 1955. → pages 14 [71] N. Shephard and M. K. Pitt. Likelihood analysis of non-Gaussian measurement time series. Biometrika, 84:653–667, 1997. → pages 18, 46, 51 [72] A. Shmygelska and H. H. Hoos. An ant colony optimisation algorithm for the 2D and 3D hydrophobic polar protein folding problem. Bioinformatics, 6:30, 2005. → pages 53 [73] J. Siepmann. A method for the direct calculation of chemical potentials for dense chain systems. Mol. Phys., 70:1145–1158, 1990. → pages 14 [74] J. Siepmann and D. Frenkel. Configurational-bias Monte Carlo: a new sampling scheme for flexible chains. Mol. Phys., 75:59, 1992. → pages 43, 53 [75] C. Thachuk, A. Shmygelska, and H. H. Hoos. A replica exchange Monte Carlo algorithm for protein folding in the HP model. Bioinformatics, 8:342, 2007. → pages 53, 55, 59 [76] J. Zhang, S. Kou, and J. S. Liu. Biopolymer structure simulation and optimization via fragment regrowth Monte Carlo. J. Chem. Phys., 126(22): 225101, 2007. → pages xi, 55, 60 103 Appendix A Proofs Proof of Theorem 2. Assumption (A2) ensures that qN covers the support of p̃iN and hence that the PMH defines an irreducible and aperiodic Markov chain with invariant distribution p̃iN from Theorem 1. It follows that the marginal distribution of the sequence of random variables generated by the algorithm converges to p̃iN . Since x(i) = xK(i)p (i) we have established the first result. Now under (A3) we have for all x1, ...,xp,a1, . . . ,ap−1 p̃iN (k,x1, ...,xp,a1, . . . ,ap−1) qN (k,x1, ...,xp,a1, . . . ,ap−1) = ẐN Z < Z−1 p∏ n=1 Bn < +∞. For an independent MH algorithm this implies uniform geometric ergodicity to- wards p̃iN , with a rate at least 1 − Z/ ( p∏ n=1 Bn ) ; see for example [62, Theorem 2.1]. This, together with a reasoning similar to above concerning x(i), we deduce the second statement of the theorem. Proof of Theorem 3. To simplify notation, we note z = (x̄1, ..., x̄p,a1, . . . ,ap−1). The transition probability of the PMH is of the form P ( (k, z) , ( k̇, ż )) = ẇk̇p × ψ (ż)× α (z, ż) (A.1) + ( N∑ m=1 ∫ w̌mp × ψ (ž)× (1− α (z, ž)) dž ) . δ(k,z) ( k̇, ż ) 104 where α (z, ż) = 1 ∧ Z N (ż) ZN (z) . Note the important fact used below that the acceptance probability is independent of k and k̇ (which, as pointed out earlier, means that k̇ needs to be sampled only when a new set of particles has been accepted). Now consider the function F (k, z) := N∑ l=1 f ( xlp ) I {l = k} , and note the fact that since by construction p̃iN (k, z) is invariant with respect to P , N∑ k,k̇=1 ∫ p̃iN (k, z)P ( (k, z) , ( k̇, ż )) F ( k̇, ż ) dzdż (A.2) = N∑ k̇=1 ∫ p̃iN ( k̇, ż ) F ( k̇, ż ) dż = Epi(f) Let Z ∼ p̃iN , Ż ∼ ψ (defined in (3.2)) and U be a random variable uniformly distributed on [0, 1]. Then, using (A.1), the following quantity is an estimate of the left hand side of (A.2) N∑ k̇=1 Ẇ k̇p f(Ż k̇ p)  I{U<α(Z, Ż)}+I{U>α(Z, Ż)} N∑ k=1 F (k,Z) p̃iN (k|Z) , (A.3) and it can be checked using (3.4) that p̃iN (k| z) = p̃i N (k, z) p̃iN (z) = wkp . The result of the theorem follows straightforwardly from this and the fact that under (A1)-(A2), the simulated Markov chain is ergodic i.e. as i → ∞, (K(i),Z(i)) ∼ p̃iN (k, z). 105 Proof of Theorem 4. To establish the result, we first rewrite (3.15) as p̃iN (θ, k,x1, ...,xp,a1, . . . ,ap−1) = 1 Np pi ( θ,xkp ) N∏ i=1,i 6=ik1 M θ1 ( xi1 ) p∏ n=2 r ( a−i k n n−1 ∣∣∣wn−1, aiknn−1) N∏ i=1,i 6=ikn M θn ( x ain−1 n−1 , x i n ) (A.4) Notice that with ik = (ik1, i k 2, . . . , i k p) the ancestral lineage of x k p , we have p̃iN ( θ,xkp, i k ) = 1 Np pi ( θ,xkp ) . Now the PG sampler can be interpreted as a standard Gibbs sampler of invariant distribution (3.15) which iterates the following steps (a) θ|(xkp, ik) ∼ p̃iN (·|k,xkp, ik) = pi (θ|xkp) (b) (x−i k 1 1 , ...,x −ikp p ,a −ik2 1 , . . . ,a −ikp p−1) ∼ p̃iN (·|θ, k,xkp, ik) (c) k|(θ,x1, ...,xp,a1, . . . ,ap−1) ∼ p̃iN (·|θ,x1, ...,xp,a1, . . . ,ap−1) with p̃iN (k|θ,x1, ...,xp,a1, . . . ,ap−1) = wkp . Note that (a) might appear unusual but leaves (3.15) invariant and is known in the literature under the name “collapsed Gibbs sampling” [61, Section 6.7]. Let A ∈ B(Θ), B ∈ B(X p), C ∈ B(X (N−1)p × {1, . . . , N}(N−1)p), k ∈ {1, . . . , N} and i ∈ {1, . . . , N}p be such that p̃iN ({k} ×A×B × {i} × C) > 0. From (A5) it is possible to show that accessible sets for the G sampler are also marginally accessible by the PG sampler i.e. more precisely if A × B ∈ B(Θ) × X p is such that LG ((θ(j),x (j)) ∈ A×B) > 0 for some finite j > 0 then also LPG ((K(j), θ(j),x (j) , I(j)) ∈ {k} ×A×B × {i}) > 0 for all k ∈ {1, . . . , N} and i ∈ {1, . . . , N}p. From this and the assumed irreducibility of the G sampler in (A6), we deduce that if pi((θ,x) ∈ A × B) > 0 then there exists a finite j such that LPG ((K(j), θ(j),x (j) , I(j)) ∈ {k} ×A×B × {i}) > 0 for all k ∈ {1, . . . , N} and i ∈ {1, . . . , N}p. Now because pi((θ,x) ∈ A × B) > 0 and step (b) corresponds to sampling from the conditional distribution of p̃iN , we 106 deduce that LPG ( (K(j+1), θ(j+1),x (j+1) , I(j+1), x−I K(j+1) 1 1 (j+1), ...,x −IK(j+1)p p (j+1), A−I K(j+1) 2 1 (j+1), . . . ,A −IK(j+1)p p−1 (j+1) ) ∈ {k} ×A×B × {i} × C ) > 0 and the irreducibility of the PG sampler follows. Aperiodicity can be proved by contradiction. Indeed from (A5) we deduce that if the PG sampler is periodic, then so is the G sampler, which contradicts (A6). Proof of Theorem 5. The proof of the first part of the theorem is similar to the proof of Theorem 1 and is shown below. The second part of the proof is a direct consequence of [1, Theorem 1] and (A5)-(A7). p̃iN (·) qN (·) = 1 Np pi ( θ,xkp ) q (θ∗, θ)× wkp ×M θ1 (xi k 1 1 ) p∏ n=2 r(ikn−1|wn−1)M θn(x ikn−1 n−1 , x ikn n ) = pi ( θ,xkp ) q (θ∗, θ)M θ1 (x ik1 1 ) p∏ n=2 M θn(x ikn−1 n−1 , x ikn n ) 1 Np p∏ n=1 w ikn n = pi ( θ,xkp ) q (θ∗, θ)M θ1 (x ik1 1 ) p∏ n=2 M θn(x ikn−1 n−1,n, x ikn n ) 1 Np p∏ n=1 ∑N i=1wn ( xin ) p∏ n=1 wn(x ikn n ) =  γ ( θ,xkp ) /Z q (θ∗, θ) γ̂N (θ)  γ ( θ,xkp ) = 1 Z γ̂N(θ) q (θ∗, θ) where γ̂N (θ) = ẐN is given in (2.6). In the manipulations above we have used (A1) on the second line, whereas the final result is obtained thanks to the definitions of the incremental weights (2.2) w1 ( xi1 ) := γ1(θ,xi1) M θ1 (x i 1) , wn ( xin ) := γn ( θ,xin ) γn−1 ( θ,x Ain−1 n−1 ) M θn ( x Ain−1 n−1 , Xin ) , and of the normalising constant estimate (2.6). 107 Appendix B Nonlinear State-Space Model B.1 Gibbs Sampling for State-Space Model We want to sample the time series using Gibbs sampling. Thus we need the condi- tional probabilities pi(θ|x1:T ) and pi(xj |x−j ,y1:T ). B.1.1 Sampling Parameters For now, assume that we are interested in the variance of the noise in the state transition, i.e. σ2V . pi(σ2V |x1:T ) ∝ pi(x1:T |σ2V )pi(σ2V ) = pi(σ2V )pi(x1) ∏T j=2 1√ 2piσ2V exp { − [xj−f(xj−1)]2 2σ2V } ∝ pi(σ2V )σ−(T−1)V exp { − ( 1 2 ∑T j=2[xj − f(xj−1)]2 ) σ−2V } = pi(σ2V ) σ −(T−1) V e −βV σ−2V where we set βV = 12 ∑T j=2[xj − f(xj−1)]2. Setting κV = σ−2V we get: pi(κV |x1:T ) ∝ pi(κV )κ(T−1)/2V e−βV κV = pi(κV ) κ(T−1)/2V e−βV κV Now we need to choose a prior for κV . A convenient one is the Gamma distribution 108 with shape ξv and scale γv: pi(κV |x1:T ) ∝ κ(T−1)/2+ξV −1V e − “ βV + 1 γv ” κV (B.1) ∝ Ga ( κV ; T − 1 2 + ξV , ( βV + 1 γV )−1) (B.2) Note that this implies pi(σ2W ) = IGa(ξv, 1/γv) Similarly, for the variance of the observation error we get: pi(σ2W |x1:T ) ∝ pi(x1:T |σ2V ,y1:T )pi(σ2W ) (B.3) = pi(σ2W )pi(x0) T∏ j=1 1√ 2piσ2W exp { − [yj − g(xj)] 2 2σ2W } (B.4) ∝ pi(σ2W )σ−TW e−βW σ −2 W (B.5) where βW = 12 ∑T j=1[yj − g(xj)]2. Using the same approach as above, we sample κW = σ−2W using a Gamma prior with shape ξW and scale γW : pi(κW |x1:T ) ∝ κT/2+ξW−1W e − “ βW+ 1 γW ” κW ∝ Ga ( κW ; T 2 + ξW , ( βW + 1 γW )−1) (B.6) B.1.2 Sampling State Variables The conditional distribution for the latent variables xi is as follows: pi(xj |x−j ,y1:T ) =  pi(xj+1|xj) pi(xj |yj), for j = 1 pi(xj |xj−1) pi(xj+1|xj)pi(xj |yj), for 1 < j < T pi(xj |xj−1) pi(xj |yj), for j = T 109 with pi(xj |xj−1) = N ( xj ; f(xj−1), σ2V ) pi(xj |yj) = N ( yj ; g(xj), σ2W ) In general we cannot sample from this directly (except for some simple func- tions f and g). We therefore use a few MH steps within the Gibbs sampler. The proposal for this is the same as for the PMCMC (i.e. proposing from prior or using the extended Kalman filter (EKF) proposal). 110 Appendix C Dirichlet Mixture Model: Derivation C.1 Gaussian Mixture The posterior is proportional to the prior times the likelihood: P(x1:n|y1:n, α) ∝ P(x1:n|α)pi(y1:n|x1:n, α) = P(x1:n|α)pi(y1:n|x1:n) (C.1) The prior P(x1:n|α) is given by equation 4.4 P(x1:n|α) = n∏ i=1 P(xi|x1:i−1, α) = αkn ∏kn j=1(m j n − 1)! α (α+ 1)...(α+ n− 1) (C.2) The likelihood factorises since the observations are independent given the cluster: pi(y1:n|x1:n, α) = kn∏ j=1 pi(y{s|xs=j}|x1:n, α) = kn∏ j=1 ∫ pi(y{s|xs=j}, θj |x1:n) dθj = kn∏ j=1 ∫  ∏ {i|xi=j} pi(yi|θj) pi(θj) dθj = kn∏ j=1 ∫  ∏ {i|xi=j} gθj (yi)  dG0(θj) (C.3) 111 We can now substitute for gθj and G0: pi(y1:n|x1:n, α) = kn∏ j=1 ∫  ∏ {i|xi=j} N (yi; µj , σ2j )  IGa(σ2j ; a, b)N (µj ; η, τσ2j ) dµj d(σ2j ) = kn∏ j=1 ∫  ∏ {i|xi=j} exp { − (yi−µj)2 2σ2j } √ 2piσ2j  ×  b a exp { −b σ2j } Γ(a) ( σ2j )a+1  exp { − (µj−η)2 2τσ2j } √ 2piτσ2j  dµj d(σ2j ) = kn∏ j=1 ( ba Γ(a) )∫ exp{−P{i|xi=j}(yi−µj)2 2σ2j − b σ2j − (µj−η)2 2τσ2j } ( 2piσ2j )mjn/2 ( σ2j )a+1 ( 2piτσ2j )1/2 dµj d(σ2j ) ∝ kn∏ j=1 ba Γ(a) ∫ (σ2j)−(1+a+mjn+12 )√ τ × exp − 1σ2j b+ 1 2 ∑ {i|xi=j} (yi−µj)2+ 12τ (µj−η) 2  dµj d(σ2j ) We now have to complete the square in order to integrate out µj : Φjn = ∑ {i|xi=j} (yi − µj)2 + 1 τ (µj − η)2 = µ2j ( mjn + 1 τ ) − 2µj  ∑ {i|xi=j} yi − η τ +  ∑ {i|xi=j} y2i + η2 τ 112 Now define the first and second moment: ỹjn,1 = 1 mjn ∑ {i|xi=j} yi ≡ ȳjn (C.4) ỹjn,2 = 1 mjn ∑ {i|xi=j} y2i ≡ (σ̂jn)2 + (ȳjn)2 (C.5) We continue to complete the square and further simplify the expression Φjn: Φjn = µ 2 j ( mjn + 1 τ ) − 2µj ( mjnỹ j n,1 − η τ ) +mjnỹ j n,2 + η2 τ = ( mjn + 1 τ )( µj − mjnỹ j n,1 − η/τ mjn + 1τ )2 + mjn [( ỹjn,2 − (ỹjn,1)2 ) + (ỹjn,1 − η)2 1 +mjnτ ] Now we end up with an expression that has the form of a Gaussian in µj : If we now reinsert this into the above exponential, we can easily marginalise out µj . pi(y1:n|x1:n, α) ∝ kn∏ j=1 ba Γ(a) √ τ ∫ ( σ2j )−(1+a+mjn+1 2 ) × exp { − 1 σ2j ( b+ mjn 2 [( ỹjn,2 − (ỹjn,1)2 ) + (ỹjn,1 − η)2 1 +mjnτ ])} × exp − 12σ2j ( mjn + 1 τ )( µj − mjnỹ j n,1 − η/τ mjn + 1τ )2 dµj d(σ2j ) For ease of notation, we define Ψjn = ( b+ mjn 2 [( ỹjn,2 − (ỹjn,1)2 ) + (ỹjn,1 − η)2 1 +mjnτ ]) (C.6) 113 The integral over µj simply evaluates to the normalising constant of a Gaussian: pi(y1:n|x1:n, α) ∝ kn∏ j=1 ba Γ(a) √ τ ∫ ( σ2j )−(1+a+mjn+1 2 ) √√√√ 2piσ2j mjn + 1τ e−Ψ j n/σ 2 j d(σ2j ) = kn∏ j=1 ba Γ(a) √ τ √ 2pi mjn + 1τ ∫ ( σ2j )−(1+a+mjn 2 ) e−Ψ j n/σ 2 j d(σ2j ) This integral has the form of the Inverse-Gamma distribution, thus we can solve it analytically and the result is the normalising constant:∫ x−(α+1) exp (−β x ) dx = Γ(α) β−α (C.7) Thus we get pi(y1:n|x1:n, α) ∝ kn∏ j=1 ( ba Γ(a) √ τ √ 2pi mjn + 1τ ) Γ ( a+ mjn 2 ) (Ψjn) −(a+m j n 2 ) ∝ kn∏ j=1  ba Γ ( a+mjn/2 ) Γ(a) √ mjnτ + 1 ( b+ mjn 2 [( σ̂jn )2 + (ȳjn − η)2 1 +mjnτ ])−(a+mjn 2 )  (C.8) The derivation for sampling α conditional on the current state (number of clus- ters) is outlined in [30]. We repeat it here for convenience: We first determine the probability distribution of the number of clusters given α and the number of data points. P(kn|α, n) = cn(kn)n!αkn Γ(α)Γ(α+ n) (C.9) where cn(kn) = P(kn|α = 1, n). It is possible to write the above fraction of Gamma functions in terms of a Beta function, which has the following definition 114 and property: β(x, y) = ∫ 1 0 tx−1(1− t)y−1dt = Γ(x)Γ(y) Γ(x+ y) Thus rewriting this and inserting it into the equation of P(kn|α, n) gives (where we let x = α+ 1, y = n): P(kn|α, n) = cn(kn)n!αkn Γ(α+ 1)/αΓ(α+ 1 + n)/(α+ n) = cn(kn)n!αkn (α+ n)β(α+ 1, n) αΓ(n) We now replace the Beta function with its definition as an integral: P(kn|α, n) = cn(kn)n!Γ(n) α kn−1(α+ n) ∫ 1 0 tα(1− t)n−1dt By Bayes’ formula the posterior for α conditional on kn is thus: pi(α|kn, n) ∝ pi(α)pi(kn|α, n) ∝ pi(α)αkn−1(α+ n) ∫ 1 0 tα(1− t)n−1dt This suggests that we can interpret pi(α|kn, n) as the marginal of a joint distribution for α and another variable v defined on the continuous space v ∈ (0, 1): pi(α, v|kn, n) ∝ pi(α)αkn−1(α+ n) vα(1− v)n−1 Thus we can now establish conditional posteriors pi(α|v, kn, n) and pi(v|α, kn, n). Under a Gamma prior for α (Eqn. 4.5) we have: 1 pi(α|v, kn, n) ∝ Ga(α; c, d)αkn−1 (α+ n) vα ∝ αc−1 e−α/d αkn−1(α+ n) vα = (α+ n)αc+kn−2 e−α/d+α log(v) = αc+kn−1 e−α(1/d−log(v)) + nαc+kn−2 e−α(1/d−log(v)) 1Note that Escobar and West [30] define Ga(a, b) with b being the scale parameter on pg. 584, which is a typo and b is actually the rate parameter (inverse of scale). 115 This can be written as a mixture of two Gamma distributions: (α|v, kn, n) ∼ piv Ga (c+ kn, θv) + (1− piv)Ga (c+ kn − 1, θv) (C.10) where θv = ( 1 d − log(v) )−1 is the scale parameter, and the weights piv are given by the ratio of the normalising constants of the two Gamma distributions. Recall the density of the Gamma distribution is defined as: Ga(x; k, θ) = xk−1 e −x/θ θk Γ(k) for x > 0 and k, θ > 0 (C.11) where k and θ denote the shape and scale, respectively. piv n 1− piv = θc+knv Γ(c+ kn) θc+kn−1v Γ(c+ kn − 1) = c+ kn − 1 1 d − log(v) (C.12) Thus we get: piv = { 1 + n ( 1 d − log(v) ) c+ kn − 1 }−1 (C.13) Now consider the conditional distribution for v|α, kn, n: pi(v|α, kn, n) ∝ vα(1− v)n−1 (0 < v < 1) This is simply a Beta distribution with mean (α+ 1)/(α+ n+ 1): pi(v|α, kn, n) = B(v; α+ 1, n) (C.14) Thus we can sample α by first sampling v ∼ B(α+ 1, n) and then sample α from the mixture of Gamma distributions defined above. 116 C.2 Implementation The optimal proposal for the particle filter is q(xn|x1:n−1, y1:n, α) = pin(xn|xn−1): q(xn|x1:n−1, y1:n, α) = P(xn|x1:n−1, y1:n, α) = pi(y1:n|x1:n, α)P(xn|x1:n−1, α) pi(y1:n|x1:n−1, α) = pi(y1:n|x1:n, α)P(xn|x1:n−1, α)∑ x′n pi(y1:n|x′n, x1:n−1, α)P(x′n|x1:n−1, α) (C.15) Since we have a finite (and generally low) number of clusters, we can compute q(xn|x1:n−1, y1:n, α) for all possible values of xn. Thus we can compute the nor- malising constant and sample from P(xn|x1:n−1, y1:n, α) exactly. xn ∼ ∑kn−1+1 j=1 pi(y1:n|xn, x1:n−1, α)P(xn|x1:n−1, α) δj(xn)∑kn−1+1 x′n=1 pi(y1:n|x′n, x1:n−1, α)P(x′n|x1:n−1, α) As noted above, in SMC the optimal proposal for xn is pin(xn|x1:n−1), which gives the following (incremental) weight (the dependence on α is suppressed for ease of notation): wn(xn) = pin(x1:n) pin−1(x1:n−1) pin(xn|x1:n−1) = pin(x1:n−1)((((( (( pin(xn|x1:n−1) pin−1(x1:n−1)((((( (( pin(xn|x1:n−1) = P(x1:n−1|y1:n) P(x1:n−1|y1:n−1) = ∑ x′n P(x′n, x1:n−1|y1:n) P(x1:n−1|y1:n−1) = ∑ x′n pi(y1:n|x′n, x1:n−1)P(x′n, x1:n−1) / pi(y1:n) pi(y1:n−1|x1:n−1)P(x1:n−1) / pi(y1:n−1) ∝ ∑ x′n pi(y1:n|x′n, x1:n−1) P(x′n|x1:n−1) pi(y1:n−1|x1:n−1) (C.16) = ∑ x′n ∏kn j=1 f(m j n, ȳ j n, (σ̂ j n)2|x1:n−1, x′n, y1:n)∏kn−1 j=1 f(m j n−1, ȳ j n−1, (σ̂ j n−1)2|x1:n−1, y1:n−1) P(x′n|x1:n−1) = αf(1, yn, 0) n− 1 + α + kn−1∑ j=1 mjn−1 n− 1 + α f(mjn, ȳ j n, (σ̂ j n)2|x1:n−1, xn=j, y1:n) f(mjn−1, ȳ j n−1, (σ̂ j n−1)2|x1:n−1, y1:n−1) (C.17) 117 where f is the likelihood contribution from a cluster at time n (the sufficient statis- tics obviously depend on xn): f(mjn, ȳ j n, (σ̂ j n) 2|x1:n, y1:n) = ba Γ ( a+mjn/2 ) Γ(a) √ mjnτ + 1 ( b+ mjn 2 [( σ̂jn )2 + (ȳjn − η)2 1 +mjnτ ])−(a+mjn 2 ) Note that the weight for the particle is independent of the choice of xn. We also see that the numerator in equation (C.16) is simply the normalisation factor used in equation (C.15). This allows for an efficient computation of the particle weight since both the normalisation constant and the likelihood of the previous time steps have already been computed. 118

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0051665/manifest

Comment

Related Items