Particle Markov Chain Monte Carlo by Roman Holenstein B.Sc. Physics/Computer Science, University of Northern British Columbia, 2002 M.Sc. Electrical Engineering, University of Alberta, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University Of British Columbia (Vancouver) March 2009 c Roman Holenstein, 2009 Abstract Markov chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC) methods have emerged as the two main tools to sample from high-dimensional probability distributions. Although asymptotic convergence of MCMC algorithms is ensured under weak assumptions, the performance of these latters is unreliable when the proposal distributions used to explore the space are poorly chosen and/or if highly correlated variables are updated independently. In this thesis we propose a new Monte Carlo framework in which we build efficient high-dimensional proposal distributions using SMC methods. This allows us to design effective MCMC algorithms in complex scenarios where standard strategies fail. We demonstrate these algorithms on a number of example problems, including simulated tempering, nonlinear non-Gaussian state-space model, and protein folding. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Review of Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . 5 2.1 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . 6 Sequential Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 Algorithm and Use of Its Output . . . . . . . . . . . . . . 8 2.2.2 Implementation Issues . . . . . . . . . . . . . . . . . . . 11 2.2.3 Discussion of Convergence Results . . . . . . . . . . . . 12 2.3 Configurational Bias Monte Carlo . . . . . . . . . . . . . . . . . 14 2.4 Inference in State-Space Models . . . . . . . . . . . . . . . . . . 15 2.4.1 15 2.2 Model Description . . . . . . . . . . . . . . . . . . . . . iii 3 Markov Chain Monte Carlo for State-Space Models . . . 18 2.4.3 Sequential Monte Carlo for State-Space Models . . . . . . 19 Particle Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . 21 Particle Metropolis-Hastings Sampler . . . . . . . . . . . . . . . 22 3.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.2 Proof of Validity . . . . . . . . . . . . . . . . . . . . . . 23 3.1.3 Extensions and Variations . . . . . . . . . . . . . . . . . 27 Particle Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.2 Proof of Validity . . . . . . . . . . . . . . . . . . . . . . 33 3.2.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 35 Particle Marginal Metropolis-Hastings Sampler . . . . . . . . . . 37 3.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.2 Proof of Validity . . . . . . . . . . . . . . . . . . . . . . 39 3.3.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 40 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.1 Dynamic and Optimal Resampling . . . . . . . . . . . . . 41 3.4.2 Arbitrary Target Distribution . . . . . . . . . . . . . . . . 42 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1 Nonlinear State-Space Model . . . . . . . . . . . . . . . . . . . . 45 4.2 Protein Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.1 Lattice Model . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 Dirichlet Mixture Model . . . . . . . . . . . . . . . . . . . . . . 61 4.4 Markov Jump Processes (Lotka-Volterra) . . . . . . . . . . . . . 67 4.5 L´evy-Driven Stochastic Volatility Model . . . . . . . . . . . . . . 73 4.5.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . 76 4.5.2 Standard & Poor’s 500 . . . . . . . . . . . . . . . . . . . 77 Tempering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.1 3.2 3.3 3.4 3.5 4 2.4.2 4.6 iv 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 A Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 B Nonlinear State-Space Model . . . . . . . . . . . . . . . . . . . . . . 108 B.1 Gibbs Sampling for State-Space Model . . . . . . . . . . . . . . 108 B.1.1 Sampling Parameters . . . . . . . . . . . . . . . . . . . . 108 B.1.2 Sampling State Variables . . . . . . . . . . . . . . . . . . 109 C Dirichlet Mixture Model: Derivation . . . . . . . . . . . . . . . . . 111 C.1 Gaussian Mixture . . . . . . . . . . . . . . . . . . . . . . . . . . 111 C.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 v List of Tables 4.1 Protein sequences for 2D HP model . . . . . . . . . . . . . . . . 56 4.2 Protein sequences for 3D HP model . . . . . . . . . . . . . . . . 57 4.3 Protein folding results for HP model . . . . . . . . . . . . . . . . 58 4.4 Comparison of protein folding performance (data set 1) . . . . . . 59 4.5 Comparison of protein folding performance (data set 2) . . . . . . 60 4.6 Results for Galaxy data set . . . . . . . . . . . . . . . . . . . . . 63 4.7 Posterior statistics of parameters for S&P 500 data . . . . . . . . 85 4.8 Simulation results for a mixture of Gaussians with 4292 modes . . 89 5.1 Guidelines for algorithm choice . . . . . . . . . . . . . . . . . . 95 vi List of Figures 2.1 Illustration of particle ancestry in SMC . . . . . . . . . . . . . . . 10 2.2 Construction of the proposal in CBMC . . . . . . . . . . . . . . . 14 2.3 State-space model . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Generating a proposal using SMC . . . . . . . . . . . . . . . . . 24 4.1 Acceptance probabilities for PMH and CBMC . . . . . . . . . . . 48 4.2 PMH acceptance rates for varying # of observations and particles . 49 4.3 Hidden states (x) and observations (y). . . . . . . . . . . . . . . . 49 4.4 ACF of the output of the PG and MH algorithms . . . . . . . . . . 50 4.5 Estimates of p ( σV | y1:T ) and p ( σW | y1:T ) . . . . . . . . . . . . 51 4.6 ACF of σV and σW for PG and PMMH . . . . . . . . . . . . . . 52 4.7 Illustration of protein folding in HP model . . . . . . . . . . . . . 54 4.8 Sampled parameter α for PG and PMMH . . . . . . . . . . . . . 64 4.9 Acceptance rates for various versions of the PMH algorithm . . . 65 4.10 ACF for parameter α and estimate of π ( α| y1:T ) in Dirichlet model 66 4.11 Data set for LV model . . . . . . . . . . . . . . . . . . . . . . . . 69 4.12 Histogram and trace plots of LV parameters . . . . . . . . . . . . 70 4.13 ACF of sampled Lotka-Volterra parameters . . . . . . . . . . . . 71 4.14 Histograms of Lotka-Volterra states . . . . . . . . . . . . . . . . 72 4.15 L´evy-driven SV with synthetic data: ACFs . . . . . . . . . . . . . 77 4.16 L´evy-driven SV with synthetic data: Histogram & scatter plots . . 78 4.17 L´evy-driven SV with synthetic data: Trace plots . . . . . . . . . . 79 4.18 S&P 500 data set for L´evy-driven SV . . . . . . . . . . . . . . . 80 4.19 L´evy-driven SV with S&P 500 data: Histogram & scatter plots . . 81 vii 4.20 L´evy-driven SV with S&P 500 data: ACF of sampled parameters . 82 4.21 L´evy-driven SV with S&P 500 data: Trace plots of parameters . . 82 4.22 L´evy-driven SV with S&P 500 data: Histogram & scatter plots . . 83 4.23 L´evy-driven SV with S&P 500 data: ACF of sampled parameters . 84 4.24 L´evy-driven SV with S&P 500 data: Trace plots of parameters . . 84 4.25 Mixture of Gaussians posterior distribution using PMH . . . . . . 90 4.26 Data set and posterior for 4-component mixture of Gaussians . . . 92 4.27 Acceptance rates for finite mixture model . . . . . . . . . . . . . 93 viii List of Algorithms 2.1 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Sequential Monte Carlo Algorithm . . . . . . . . . . . . . . . . . 9 2.4 Configurational Bias Monte Carlo . . . . . . . . . . . . . . . . . . 16 3.1 Particle Metropolis Hastings Sampler . . . . . . . . . . . . . . . . 23 3.2 Particle Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 Conditional Sequential Monte Carlo Algorithm . . . . . . . . . . . 32 3.4 Conditional Multinomial Resampling . . . . . . . . . . . . . . . . 32 3.5 Conditional Stratified Resampling . . . . . . . . . . . . . . . . . . 33 3.6 Alternate Move for Particle Gibbs Sampler . . . . . . . . . . . . . 35 3.7 Marginal Maximisation using Particle Gibbs . . . . . . . . . . . . 36 3.8 Particle Marginal Metropolis-Hastings Sampler . . . . . . . . . . . 38 3.9 PMMH with m Independent SMC Algorithms . . . . . . . . . . . 40 4.1 Gibbs Sampler for Dirichlet Mixture Model. . . . . . . . . . . . . 66 4.2 Tempered Transitions . . . . . . . . . . . . . . . . . . . . . . . . . 86 ix Glossary Notation X a random variable (upper case) x a value (lower case) X, x a vector B(E) Borel σ-algebra on E P(A) probability of event A π(dx) probability distribution π(x) probability density (E, F) measurable space (σ-field) on set E and σ-algebra F Eπ (·) expectation with respect to distribution π varπ (·) variance with respect to distribution π δx (dx) Dirac delta function N (x; µ, σ 2 ) Normal (Gaussian) distribution with mean µ and variance σ 2 B(x; N, p) Binomial distribution for N trials with success probability p M(x; N, p) Multinomial distribution for N trials with event probabilities p = (p1 , ..., pk ) Ga(x; k, θ) Gamma distribution with shape k > 0 and scale θ > 0 IGa(x; α, β) Inverse Gamma distribution with shape α > 0 and scale β > 0 Be(x; α, β) Beta distribution with shape parameters α > 0 and β > 0 TS(κ, δ, γ) tempered-stable distribution x Acronyms AC auto-correlation ACF auto-correlation function ACO ant colony optimisation CBMC configurational bias Monte Carlo DP Dirichlet process DPERM dynamic prune-enriched Rosenbluth method GPU graphics processing unit EKF extended Kalman filter ESS effective sample size FRESS fragment regrowth via energy-guided sequential sampling [76] G Gibbs HMM hidden Markov model LV Lotka-Volterra, predator-prey model MC Monte Carlo MCMC Markov chain Monte Carlo MH Metropolis Hastings MJP Markov jump process MTM multiple-try Metropolis NLSS non-linear state-space model NTT Neal’s tempered transitions, MCMC algorithm to sample from multimodal distributions developed by Radford Neal [66] OU Ornstein-Uhlenbeck process PDF probability distribution function PERM prune-enriched Rosenbluth method xi PG particle Gibbs PMCMC particle Markov chain Monte Carlo PMH particle Metropolis Hastings PMMH particle marginal Metropolis Hastings REMC replica exchange Monte Carlo RG recoil-growth RJMCMC reversible-jump MCMC S&P 500 Standard & Poor’s 500 index SA simulated annealing SAW self-avoiding walk SDE stochastic differential equation SIMCMC sequentially interacting Markov chain Monte Carlo SMC sequential Monte Carlo SSM state-space model SV stochastic volatility UKS unscented Kalman smoother xii Acknowledgements I am extremely grateful to my supervisor Arnaud Doucet. His enthusiasm, his support, his honest critiques, his sound advice, and his deep insight have helped me greatly and this thesis would not have been possible without him. He is an inspiration and I am very honoured to call him my teacher and friend. I would like to thank Christophe Andrieu who has also been integral to the development of the methodology in this thesis. Many thanks go to the numerous people who have been my teachers and mentors along my academic journey so far. I would especially like to express my gratitude to: Mark Shegelski (undergraduate research supervisor), Robert Fedosejevs and Ying Tsui (M.Sc. supervisors), Raphael Gottardo (committee member), as well as Nando de Freitas (committee member), who introduced me to the fun of Monte Carlo and machine learning. I am very grateful for their generous support, advise, and help with various projects. Furthermore I would like to thank my student colleagues and friends for great discussions and for providing a fun and stimulating learning environment. Special thanks go to Aline Tabet, Emtiyaz Khan, Franc¸ois Caron, Frank Hutter, Gareth Peters, Glen Goodvin, Hoyt Koepke, Luke Bornn, Mark Schmidt, Mark Crowley, Matt Hoffman, Mike Chiang, and Mike Whitwick. The secretaries at the CS department deserve my gratitude for providing assistance and administrative support, with special mention to Joyce Poon, Lara Hall, and Kath Imhiran. This research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), the University of British Columbia, the Department of Computer Science, and the Province of British Columbia. I would also like to xiii acknowledge the Western Canada Research Grid (WestGrid) for providing computational resources. Last but not least, I am forever grateful to my parents and siblings for their continuous love and support, especially my mother who has always been there for me and helped me grow as a person. Finally, I want to thank my darling Elaine Qing Chang for her unbounded encouragement, support, and love. xiv Chapter 1 Introduction Monte Carlo methods have become the standard tool to solve many problems in statistics and scientific computing. Examples are abound, and include instances in Bayesian statistics (posterior estimation), statistical physics (Ising model), particle physics (simulation of high energy particles interacting with a detector), biology (protein folding, Lotka-Volterra model), chemistry (chemical reaction networks), and finance (option valuation), to name a few. The problems are becoming more and more sophisticated and Monte Carlo methods are now expected to deal with high dimensionalities and complex interactions between the model variables. In this thesis we present a new framework [3] that allows us to expand the class of problems that can routinely be addressed with Monte Carlo methods. It is based on a non-trivial and novel combination of existing sampling strategies and takes advantage of their strengths. We then apply the method to problems in Bayesian statistics and biology and make comparisons with competitive algorithms. Assume we are given a probability distribution π(dx) defined on a measurable space (E, F) and that we are interested in sampling from this distribution, usually to compute analytically intractable expectations of interest with respect to π(dx) by invoking the law of large numbers [61, 68]. For ease of notation, we shall further assume that π(dx) admits a density π(x) with respect to a σ-finite dominating measure denoted dx. Monte Carlo sampling is generally done by proposing samples from some instrumental distribution. A weighing and/or selection mechanism then corrects for 1 the bias from the proposal distribution. As the dimension increases, finding a good proposal distribution becomes harder. A good proposal distribution should take into account key features of the target, such as multimodality, and ideally should be as close as possible to the target. There are two classes of Monte Carlo algorithms that are often used for sampling from high-dimensional distributions: Markov chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC). MCMC relies on sampling a realization of a Markov chain with invariant distribution π(dx). SMC is a sequential implementation of importance sampling, employing a population of samples (particles) and a sequence of probability distributions (of increasing dimension), with the final distribution being the target distribution to generate samples from π(dx). When the consecutive distributions in the sequence are not too different one can find good proposals to move the particles from one distribution to the next. An introduction to this method is given in section 2.2. For a book-length review see [27]. While initially designed for on-line inference in dynamic models, usually referred to as “particle filters” in that context, they are now also used in static models [15, 32, 65]. SMC samplers have found a wide range of applications in fields as diverse as econometrics [16], [67], ecology [12], and systems biology [48]. It was realised later that the interest of SMC methods is not limited to dynamic models and they are now increasingly used to perform inference for a wide range of models including contingency tables [15], mixtures models [32, 65], graphical models [53], and population genetics [61, Section 4.1.2]. Where traditional importance sampling would try to directly produce weighted samples on E to approximate π (x), and most likely fail for the same reason that an independent Metropolis Hastings (MH) algorithm would fail, an SMC algorithm will adopt a more progressive approach which can be beneficial in large dimensional scenarios. The first ingredient of such an algorithm is a sequence of intermediate “bridging” probability distributions of increasing dimensions {πn (dxn ) , n = 1, . . . , p − 1} with xn = (x1 , x2 , . . . , xn ) ∈ X1 × · · · × Xn and such that πn (dxn ) = πn (xn ) dxn where dxn = dxn−1 × dxn and πp (xp ) = π (x). However, sampling sequentially comes at a price. Indeed, if p is too large, then the SMC approximation of the joint density π (x) deteriorates as components imputed at any time n < p are not rejuvenated at subsequent time steps. As a 2 result, when p − n is too large the approximation of the marginal π (xn ) is likely to be rather poor as the successive resampling steps deplete the number of distinct particles. However, despite the potential drawbacks outlined above, SMC methods have had a major impact on inference in dynamic systems in the last fifteen years, and are now showing great promise for more classical “static” inference problems e.g. [15], [65], [32], [53] and [61, Section 4.1.2]. This thesis proposes a novel sampling strategy which aims to combine the advantages of MCMC and SMC methods. As we shall see, this opens up the possibility to routinely tackle problems that cannot be satisfactorily addressed using either of these approaches on its own. Several algorithms combining both approaches have already been proposed in the literature. In particular MCMC algorithms have been successfully used as updating mechanisms or proposals within SMC methods [45]; see also [17], [65]. Our approach is entirely different because on the contrary it proposes to use SMC algorithms as a proposal mechanism within MCMC methods. We use MCMC methods to allow us to “break” the complex task of sampling from π (x) into a series of simulations from lower-dimensional distributions whereas we use SMC ideas to efficiently update very large sub-blocks of x. Although this idea might seem natural, it is important to note that its realisation is far from obvious. Indeed, a direct implementation of such a strategy is impossible as the marginal distribution of a particle generated by an SMC algorithm is not available in closed-form but would be required for the implementation. To bypass this problem we introduce a non-standard auxiliary target distribution on an extended space which allows us to define a valid MH update and from which inference about π(x) can be carried out. We show that the resulting algorithm enjoys attractive properties and can be used as a component of more advanced algorithms. The rest of this thesis is organised as follows. In Chapter 2 we give a brief review of MCMC and SMC and provide some convergence results to motivate our new method. Note that SMC is presented in a non-standard notation in order to simplify the proofs for the particle Markov chain Monte Carlo (PMCMC) algorithms. Chapter 3 introduces the PMCMC framework, starting with the particle Metropolis Hastings (PMH) sampler, followed by the particle Gibbs (PG) and particle marginal Metropolis Hastings (PMMH) samplers. The PMH sampler (Section 3.1) is a novel Metropolis Hastings sampler which uses SMC algorithms as proposals to sample 3 from π (x). The PG sampler (Section 3.2) extends the PMH algorithm to allow sampling from distributions of the form π (θ, x) (defined on some space Θ × E), which can be understood as being an approximation of the standard Gibbs sampler (which alternates sampling from the full conditionals π (θ|x) and π (x|θ)), but has the property that it leaves π(θ, x) invariant, and is hence not biased. In cases where θ and x are strongly correlated, this approach is likely inefficient. This is addressed by the PMMH algorithm (Section 3.3), which is an unbiased approximation of a MH algorithm that works directly on the marginal space Θ. In Chapter 4 we present a variety of applications to which this methodology can be applied and make comparisons with competitive alternative algorithms. In Section 4.1 we apply PMCMC to a non-linear state-space model. Using PMH with simulated annealing we find optimal configurations in a protein folding model in Section 4.2. We perform clustering using Dirichlet mixture models in Section 4.3. We estimate parameters for the Lotka-Volterra (LV) model in Section 4.4. In Section 4.5 we consider a L´evydriven stochastic volatility (SV) model for the Standard & Poor’s 500 (S&P 500) data set. And in Section 4.6 we use PMH with tempering to sample from highly multimodal distributions. Finally Chapter 5 summarises the contributions of this thesis and outlines further extensions. 4 Chapter 2 Review of Monte Carlo Methods 2.1 Markov Chain Monte Carlo Markov chain Monte Carlo (MCMC) methods are a class of algorithms based on constructing a Markov chain which has the desired target distribution as its equilibrium distribution. Two very popular MCMC algorithms are the Metropolis Hastings (MH) and the Gibbs (G) sampler [68]. While originally mainly used in physics, these algorithms soon found widespread use in many other disciplines. Often they are used to solve integration and optimisation problems, for example computing expectations in Bayesian statistics by invoking the law of large numbers. In MCMC, a Markov chain is constructed from a transition kernel K(xn , xn+1 ). This kernel is defined on (E, B(E)) such that K(x, ·) is a probability measure for all X ∈ E. In the discrete domain the kernel is simply a transition matrix. In order to establish that the Markov chain converges to the required target distribution we require the following three properties: • The invariant distribution of the Markov kernel is π(·) • The Markov chain is irreducible: any subset of A ⊂ E for which π(A) > 0 is reachable from any starting point x(0) ∈ E • The Markov chain is aperiodic 5 Next we will (very) briefly introduce the MH and Gibbs samplers. 2.1.1 Metropolis-Hastings The MH sampler uses an instrumental proposal distribution to generate candidates that are accepted or rejected such that the resulting Markov chain has the target π(x) = γ(x)/Z as the invariant distribution. Algorithm 2.1: Metropolis-Hastings 1 2 Initialise X(0) For iteration i ≥ 1 3 Sample a candidate from a proposal distribution: X∗ ∼ q(X(i − 1), ·) 4 With probability 1∧ γ(X∗ ) q(X∗ , X(i − 1)) γ(X(i − 1)) q(X(i − 1), X∗ ) (2.1) set X(i) = X∗ , otherwise set X(i) = X(i − 1) The case where the proposal distribution is symmetric, q(x, y) = q(y, x), ∀y, x ∈ E, corresponds to the Metropolis algorithm, also known as random walk Metropolis. When the proposal is independent of the current state of the Markov chain we get the independent MH sampler. Many more variations and special cases exist. For a thorough survey of these methods see [68]. We may for example use multiple transition kernels and combine them through mixture or composition. 2.1.2 Gibbs Sampler The Gibbs sampler is in fact a special case of the Metropolis-Hastings algorithm. If we consider multiple kernels, one for each block of variables, and use them in a composition, and additionally these kernels use the conditional distribution of the target as the proposal, then the acceptance rate is 1 and we obtain the Gibbs 6 sampler. Algorithm 2.2: Gibbs Sampler 1 Initialise X(0) For iteration i ≥ 1 2 For n = 1, . . . , p do 3 Sample Xn (i) ∼ π(·|X1:n−1 (i), Xn+1:p (i − 1)) 4 2.2 Sequential Monte Carlo In this section, we briefly review the principle of SMC methods to sample from a given target distribution π (x). As explained in the introduction the method requires one to introduce a sequence of bridging probability densities {πn (xn ) , n = 1, . . . , p} of increasing dimension such that πp (xp ) = π (x); see Section 4 for detailed examples. For ease of presentation, we will assume that Xi = X for any i = 1, . . . , p, implying that πn (xn ) is defined on the product space X n . Each density is assumed known up to a normalising constant, i.e. for n = 1, . . . , p πn (xn ) = Zn−1 γn (xn ) , where γn : X n → R+ can be evaluated pointwise, but the normalising constant Zn is unknown. We will use the notation Z for Zp . We describe below a generic SMC algorithm which encompasses numerous variants proposed in the literature. Section 2.2.1 below is important and sufficient to understand the description of the new algorithms presented in the paper. It is worth noting for readers already familiar with SMC methods that our description outlined below might appear slightly unorthodox in that it does not exactly match standard computer implementations. It however enjoys the same important theoretical and practical properties and has the advantage of greatly simplifying the mathematical developments required to show the validity of the algorithms presented throughout the paper. Section 2.2.2 focuses on such subtle issues, is only 7 required to understand the theoretical justifications of our algorithms, and might be skipped on a first reading. 2.2.1 Algorithm and Use of Its Output An SMC algorithm requires one to specify an importance density M1 (x1 ) on X in order to initialise the recursion at time 1 and a family of transition kernels with associated densities {Mn (xn−1 , xn ) , n = 2, ..., p} in order to extend xn−1 ∈ X n−1 by sampling xn ∈ X conditional upon xn−1 at time instants n = 2, ..., p. Guidelines on how to best select {Mn (xn−1 , xn )} are well known, and the main recommendation is that an approximation of the conditional density πn ( xn | xn−1 ) should be used [27], [61]. As mentioned earlier an SMC algorithm also involves a resampling procedure of the N particles, which relies on a family of probability distributions on {1, . . . , N }N , {r( ·| w), w ∈ [0, 1]N }. The resampling step is usually necessary as in most applications the variance of the importance weights would otherwise increase over time. This is an undesirable feature which rapidly manifests itself in practice in that the majority of particles have negligible weights. The algorithm proceeds as shown in Algorithm 2.3 in order to produce a sequence of samples {Xin , i = 1, ..., N } for n = 1, . . . , p. Note that in order to alleviate the notational burden we adopt below the convention that whenever the index i is used we mean “for all i ∈ {1, ..., N }.” Further on, we also use the standard convention whereby capital letters are used for random variables while lower case letters are used for their values. We have used the notation Wn := Wn1 , ..., WnN i and An := A1n , ..., AN ole in our forn . The variable An−1 plays an important rˆ mulation of SMC methods, and represents the index of the “parent” at time n − 1 of particle Xin for n = 2, . . . , p (see Figure 2.1). The vector An is thus a random mapping defined on {1, . . . , N } → {1, . . . , N }N , and the standard resampling procedure is hence interpreted here as being the operation by which child particles at time n choose their parent particles at time n − 1 according to a probability r(·|Wn−1 ) dependent on the parents’ weights Wn−1 , or “fitness.” Let Oni := N k=1 I Akn = i be the number of offspring of particle i at time n. A desirable property of a resampling scheme is that it satisfies the following 8 Algorithm 2.3: Sequential Monte Carlo Algorithm 1 2 3 At n = 1 Sample Xi1 ∼ M1 (·) Update and normalise the weights w1 Xi1 := 4 5 γ1 (Xi1 ) , W1i = M1 (Xi1 ) w1 Xi1 N k=1 w1 7 . For n = 2, ..., p do Sample An−1 ∼ r (·|Wn−1 ) Ai Ai 6 Xk1 n−1 n−1 , Xni ) , ·) and set Xin = (Xn−1 Sample Xni ∼ Mn (Xn−1 Update and normalise the weights γn Xin wn Xin := Ain−1 Xn−1 γn−1 Mn Ain−1 , Xni Xn−1 wn Xin Wni = N k k=1 wn (Xn ) , (2.2) (2.3) unbiasedness condition E Oni |Wn = N Wni . (2.4) The unbiasedness condition (2.4) ensures that future computational effort is concentrated on the most promising particles, while guaranteeing that the SMC algorithm described earlier yields consistent approximations of the distributions {πn (dxn ) , n = 1, . . . , p} and of their normalising constants {Zn , n = 1, . . . , p}. In particular, conditional upon the sampled particles, an approximation of the target distribution π (dx) is given by N π ˆ N (dx) := Wpi δXip (dx) , i=1 9 (2.5) π ˆ1 (dx1 ) X11 π ˆ2 (dx1:2 ) X21 π ˆ3 (dx1:3 ) X31 π ˆ4 (dx1:4 ) X41 X12 A1 1=3 A1 2=2 A1 3=2 X22 X32 X42 X13 A2 1=6 A2 2=4 A2 3=2 X14 A3 1=6 X23 A3 2=4 X33 X43 A3 3=4 X24 X34 X44 X15 A4 1=3 A4 2=3 A4 3=6 X25 X35 X45 X16 A5 1=1 A5 2=5 A5 3=4 X26 X36 X46 A6 1=5 A6 2=4 A6 3=1 Figure 2.1: Illustration of particle ancestry in SMC. from which expectations can be easily computed, but also an estimate of its normalising constant Z p Zˆ N := n=1 1 N N wn Xin . (2.6) i=1 The derivation for the estimate of the normalising constant is as follows: p Z := Zp = Z1 n=2 Zn Zn−1 (2.7) where Z1 and the ratio of normalising constants Zn /Zn−1 are given by Z1 = γ1 (x1 ) M1 (x1 ) dx1 = M1 (x1 ) 1 wn (x1 ) M1 (x1 ) dx1 ≈ N 10 N w1 Xi1 i=1 1 Zn = Zn−1 Zn−1 γn (xn ) dxn = γn (xn ) πn−1 (xn−1 ) dxn γn−1 (xn−1 ) = γn (xn ) πn−1 (xn−1 )Mn (xn−1 , xn ) dxn γn−1 (xn−1 )Mn (xn−1 , xn ) = wn (xn ) πn−1 (xn−1 )Mn (xn−1 , xn ) dxn ≈ 1 N N wn Xin i=1 Now substituting the Monte Carlo estimates of these integrals into Eqn. 2.7 yields the estimate of the normalising constant given in Eqn. 2.6. 2.2.2 Implementation Issues In fact in practice, for computational efficiency, On is drawn first (i.e. without explicit reference to An ) according to a probability distribution s(·|Wn ) such that (2.4) holds and the offspring then matched to their parents. For example, the simplest unbiased resampling algorithm consists of sampling On according to a multinomial distribution of parameter (N, Wn ). More sophisticated schemes such as residual resampling [61], stratified resampling [57] and minimum entropy resampling [23] also satisfy (2.4). Once On has been sampled, this is followed by a deterministic allocation procedure of the child particles to the parents, which defines “computer” indices e.g. the On1 first child particles are associated to the parent O1 particle number 1, i.e. A1n = 1, ..., An n = 1, likewise for the On2 following child O1 +1 particles and the parent particle number 2, i.e. An n 2 O1 +On = 2, ..., An n = 2 etc. Further on, we will impose the slightly stronger unbiasedness condition. (A1) For any i = 1, . . . , N and n = 1, . . . , p the resampling scheme satisfies E Oni |Wn = N Wni and r Ain = k|Wn = Wnk . (2.8) Note that even if (2.4) holds then (2.8) is not necessarily satisfied, for example by the standard deterministic allocation procedure, but this property can be easily enforced by the addition of a random permutation of these “computer” indices. As 11 we shall see our indexing system makes the writing of the probability distributions underpinning our algorithms extremely simple. 2.2.3 Discussion of Convergence Results The theoretical properties of SMC are now well understood, and we restrict ourselves to some of the simplest assumptions required to ensure the validity of our algorithms; see [64] for a full treatment and more sophisticated assumptions. The following notation will be needed Sn = {xn ∈ X n : πn (xn ) > 0} , Q1 = {x1 ∈ X : M1 (x1 ) > 0} , Qn = {xn ∈ X n : πn−1 (xn−1 ) Mn (xn−1 , xn ) > 0} for n ≥ 2. (A2) For n = 1, ..., p, we have Sn ⊆ Qn . (A3) For any n = 1, ..., p, there exists {Bn < ∞} such that for any xn ∈ Sn wn (xn ) ≤ Bn . (2.9) (A4) There exist µ (·) a probability density on X and 0 < w, w, ε, ε < ∞ such that for any n = 1, ..., p and any xn ∈ Sn , w ≤ wn (xn ) ≤ w and ε µ (xn ) ≤ Mn (xn−1 , xn ) ≤ ε µ (xn ) . Under assumption (A3) it is possible to establish results such as Lp bounds, show that π ˆ N (dx) converges (weakly) almost surely (a.s.) as N → ∞ towards π (dx) and that Zˆ N converges a.s. to Z. Under additional mixing assumptions on the Feynman-Kac semi-group associated to {πn } and {Mn }, stronger results can be established. In particular, it can be shown using a proof similar to [18, Theorem 5], [64, Section 7.4.3] that for multinomial and residual resampling Zˆ N satisfies a central limit theorem and that, under (A4), there exists a finite constant 12 C, depending on w, w, ε, ε, µ and the dimension of X but not p, such that the variance var (N ) of Zˆ N /Z satisfies for any N ≥ 1 var (N ) ≤ Cp N (2.10) The so-called degeneracy phenomenon for SMC algorithms is well documented, and manifests itself in practice by the fact that the approximation π ˆ N (dxn ) of the marginal π(dxn ) of π (dx) is concentrated on a very limited number of particles whenever p − n is large for a given number N of particles; this is the consequence of the successive resampling steps between times n and p. As a result the approximation of π ˆ N (dxn ) is poor and cannot be used for inferential purposes. This however does not necessarily mean that the distribution of the surviving path(s) is not “close” to π (dxn ). Under (A4), it can in fact be established [64, Section 8.3] that there exists a finite constant D, depending on w, w, ε, ε but not p, such that the unconditional distribution of a sample from π ˆ N (dx), denoted q N (dx), satisfies q N (·) − π (·) ≤ Dp , N (2.11) where · is the total variation norm. Note that the constants C and D above typically increase with the dimensionality of X and decrease as the ergodic properties of the Feynman-Kac semi-group improve. These results suggest that, under sufficient mixing conditions, the performance of SMC degrades “gracefully” linearly and not exponentially with p as is often the case with other sampling strategies. Note however, that (A4) is a very strong assumption not satisfied in most realistic scenarios and that it is difficult in practice to obtain useful quantitative bounds on C and D that could guide the choice of N . This generally results in a bias of unknown magnitude. Nevertheless, empirical evidence from numerous researchers suggests that SMC can be designed to produce random samples whose distribution is “close” to π even for large p in realistic scenarios where (A4) is not satisfied. This motivates using samples provided by an SMC algorithm as a proposal to feed a MH based algorithm in order to correct for the fact that q N (dx) = π (dx). As we shall see in the next section, this is not straightforward. 13 Figure 2.2: Construction of the proposal in CBMC. For each component several candidates are sampled, of which one is selected with probability proportional to it’s importance weight. 2.3 Configurational Bias Monte Carlo We now briefly introduce the configurational bias Monte Carlo (CBMC) algorithm, which is one of the methods used to tackle problems that we are interested in, as mentioned in the introduction, and we will use it to compare against the performance of PMCMC. CBMC is based on the scheme introduced by Rosenbluth and Rosenbluth in 1955 [70] and was developed by J. Siepmann to sample the conformational space of linear chain molecules [73]. In this section we will briefly introduce the basic CBMC algorithm. We will later compare the performance of our new method with this algorithm. CBMC is a Markov chain Monte Carlo algorithm where the proposal is built up sequentially. In the case of a long molecule – or high dimensional state space – only a subset of the variables may be updated at an iteration. As in SMC, we need to define a sequence of distributions πn (·) (n = 1, . . . , p) of increasing dimension. In the case of sampling chain molecules, this sequence is naturally given by the Boltzmann distributions obtained when growing the chain. The proposal configuration is built as follows. For each variable (element in the chain) a population of candidate values are sampled, of which one is selected with probability proportional to its importance weight. This yields a candidate chain X∗ with the importance weight Zˆ N,∗ given in Eqn. 2.12, as illustrated in Figure 2.2. To compute the importance weight Zˆ N for the current configuration, this procedure is repeated while ensuring that one of the candidates matches the current configuration X. The candidate is then accepted with probability 1 ∧ Zˆ N,∗ /Zˆ N . The details are given in Algorithm 2.4. This algorithm is quite greedy (as illustrated in Figure 2.2), and hence only 14 works for moderately large chains. Several other extensions to the Rosenbluth method have been proposed [40], such as PERM [51], DPERM [19], and recoilgrowth [20][40, Sec. 13.7]. PERM, like the Rosenbluth method, is a static Monte Carlo method, and addresses the weight degeneracy problem of the Rosenbluth method by pruning conformations with weights below some preset threshold and replicating ones with weights above some threshold (adjusting the weights accordingly). DPERM is the dynamic (Markov chain) generalisation of PERM. However, the performance of (D)PERM is quite sensitive to the thresholds [19], and unlike SMC, the particles (conformations) are not interacting. The recoil-growth (RG) method uses a multi-step look-ahead in an effort to overcome the dead-alley problem of CBMC, which potentially spends much of the simulation time exploring dead ends or evaluating importance weights of low probability candidates. Instead of generating N candidates at each step, RG proposes something similar to a “depth-first” search (of finite depth) to find “open” and “closed” trial directions, where “open” denotes directions that are deemed to have non-zero probability of generating a complete chain (at least for the finite depth look ahead), and “closed” indicates that the trial direction will result in a dead-alley, e.g. as in sampling selfavoiding walks on a grid. 2.4 Inference in State-Space Models We now take a look at how MCMC and SMC are applied to state-space models (SSMs), and motivate the need for PMCMC. SSM is a very popular class of models and has broad application in many disciplines; for a thorough discussion of SSM see [13, 29, 41]. Some examples include target tracking models [50], changepoint models [36], population dynamics models [12], stochastic volatility models (Sec. 4.5), partially observed diffusions [37], and models appearing in systems biology (Sec. 4.4). 2.4.1 Model Description We consider the following SSM, also known as a hidden Markov model (HMM), where {Xn }n≥1 is an unobserved Markov process with initial density X1 ∼ µθ (·) 15 Algorithm 2.4: Configurational Bias Monte Carlo 1 2 3 4 5 6 Initialise X(0) (arbitrarily) For iteration s ≥ 1 Sample proposal X∗ and compute importance weight: At n = 1 For k = 1, . . . , N , sample X1k,∗ ∼ M1 (·) Compute the weights: w1 Xk,∗ := γ1 Xk,∗ /M1 X1k,∗ , 1 1 and normalise W1k,∗ = w1 Xk,∗ / 1 7 8 9 10 12 := Compute the weights wn Xk,∗ n Sample Xn∗ ∼ π ˆnN (dx) ∗ ∗ Xn = (Xn−1 , Xn∗ ) 16 (dx) γn Xk,∗ n ” “ ” k,∗ Xk,∗ M n Xn−1 ,Xn n−1 (dx) and set Compute importance weight: n=1 15 “ γn−1 “ ” k,∗ wn Xn “ ” N k,∗ k=1 wn Xn k,∗ := N k=1 Wn δXk,∗ n p 14 Xk1 For n = 2, . . . , p do For k = 1, . . . , N , sample Xnk,∗ ∼ Mn (X∗n−1 , ·) and set k,∗ Xk,∗ n = (Xn−1 , Xn ) “ ” Zˆ N,∗ = 13 k,∗ N k=1 W1 δXk,∗ 1 Sample X∗1 ∼ π ˆ1N (dx) := and normalise Wnk,∗ = 11 N k=1 w1 1 N N wn Xk,∗ n (2.12) k=1 Sample auxiliary variables and compute importance weight: At n = 1 Set X11 := X1 (s) For k = 2, . . . , N , sample X1k ∼ M1 (·) Compute the weights w1 Xk1 as above 19 For n = 2, . . . , p do Set Xn1 := Xn (s) For k = 2, . . . , N , sample Xnk ∼ Mn (Xn−1 (s), ·) and set Xkn = (Xn−1 , Xnk ) Compute the weights wn Xkn as above 20 Compute importance weight: Zˆ N = 17 18 21 p n=1 1 N N k=1 wn Xkn With probability 1 ∧ Zˆ N,∗ /Zˆ N (i − 1) set X(i) = X∗ , otherwise set X(i) = X (i − 1) 16 X1 X2 X3 Y1 Y2 Y3 ... XT YT Figure 2.3: State-space model, also known as a hidden Markov model (HMM). The Markov process {Xn }n≥1 is not observed directly but through a conditionally independent observation process {Yn }n≥1 . and transition probability density (with n > 1) Xn | Xn−1 ∼ fθ ( ·| Xn−1 ) where θ is a static parameter of the model. The process {Xn }n≥1 cannot be observed directly, but only through observations {Yn }n≥1 , which are assumed independent conditional upon {Xn }n≥1 with Yn | Xn ∼ gθ ( ·| Xn ) . The parameter θ may also be unknown, in which case it follows the prior distribution p (θ). The goal is now to perform Bayesian inference in this context. First consider the case where the static model parameter θ is known. Given some observations y1:T inference relies on the following posterior density T pθ ( x1:T | y1:T ) ∝ µθ (x1 ) gθ ( y1 | x1 ) fθ ( xn | xn−1 ) gθ ( yn | xn ) . (2.13) n=2 If we do not know θ, we assign a prior density p(θ) to θ and perform Bayesian inference using the joint posterior p (θ, x1:T | y1:T ) ∝ p(θ) pθ ( x1:T | y1:T ) (2.14) When the model is non-linear non-Gaussian, the posterior densities pθ ( x1:T | y1:T ) and p (θ, x1:T | y1:T ) do not admit standard forms. This makes inference difficult and we need to resort to approximations. Here we consider Monte Carlo methods, 17 which have shown to be a flexible and efficient tool for inference in these types of models. As SSMs are ubiquitous in many areas of science there are literally thousands of papers published on this in the past decade alone. Instead of a thorough review, we instead highlight the underlying principles of applying MCMC and SMC in this context, as well as point out some of their limitations, which will motivate the use of PMCMC. 2.4.2 Markov Chain Monte Carlo for State-Space Models In order to perform inference in SSM using Monte Carlo (MC) we need to be able to generate samples from the posteriors pθ ( x1:T | y1:T ) or p (θ, x1:T | y1:T ) in the case of fixed (known) or unknown parameter θ, respectively. It is generally impossible to sample exactly from pθ ( x1:T | y1:T ), except for linear Gaussian models and finite state-space HMMs. As outlined in Section 2.1 the Metropolis Hastings (MH) sampler makes use of a proposal distribution q(x1:T , x∗1:T ) to generate candidates x∗1:T , which are then accepted or rejected with some probability. The high dimensionality of x1:T makes it impossible to design good proposal distributions to update all states jointly, and a popular strategy consists of updating only a subset of components at a time. For example we can divide the T components into adjacent blocks of length K and iteratively update all the blocks, conditioned on the components outside the current block and the observations. A block xn:(n+K−1) is then updated according to an MCMC step of invariant density n+K pθ xn:(n+K−1) |y1:T , x1:(n−1) , x(n+K+1):T ∝ n+K−1 fθ (xk |xk−1 ) k=n gθ (yk |xk ) k=n (2.15) As long as K is not too large, it may be possible to design good proposals to be used in a MH update. For example if the transition density fθ is linear Gaussian and the observation density gθ is log-concave, then we may design a proposal density based on a Gaussian approximation of Eqn. 2.15, as illustrated in [71]. However, as K increases a Gaussian approximation may become insufficient. Even seemingly benign differences between the proposal and target distribution in low dimension can accumulate in higher dimensions and result in the acceptance rate dropping to zero, which limits the number of components K that can be updated simultane18 ously. This is a serious limitation, as it can severely constrain and slow down the exploration of the support of pθ (x1:T |y1:T ) and introduce dependence between the samples produced by the MCMC algorithm. If the parameter is unknown we need to be able to sample from the joint density p (θ, x1:T | y1:T ). A common approach is to alternately update the state components x1:T conditional on θ and the parameter θ conditional upon x1:T . Sampling exactly from p ( θ| y1:T , x1:T ) can often be performed efficiently due to its small or moderate size. However, when the parameter θ and states x1:T are strongly correlated, this approach can result in poor mixing and thus yield an inefficient algorithm. In this case we would like to be able to update the parameter and states jointly. However, this would require sampling all the states jointly, which in general we cannot do as outlined above. 2.4.3 Sequential Monte Carlo for State-Space Models The SMC algorithm decomposes the problem of sampling from pθ ( x1:T | y1:T ) into a series of simpler sub-problems, as illustrated in Section 2.2, where we define the sequence of distributions {πn (xn ) = pθ ( x1:n | y1:n ) , n = 1 . . . , p = T } , as well as transition densities M1 (x1 ) = qθ ( x1 | yn ) and Mn (xn−1 , xn ) = qθ ( xn | xn−1 , yn ) , for n = 2 . . . , p = T . The incremental particle weights (Eqn. 2.2) in this case are: w1 (x1 ) = µθ (x1 ) gθ ( y1 | x1 ) qθ ( x 1 | y 1 ) and wn (xn ) = fθ ( xn | xn−1 ) gθ ( yn | xn ) qθ ( xn | xn−1 , yn ) This allows us to efficiently generate a population of samples (particles) from pθ ( x1:T | y1:T ). However, SMC also suffers from well-known drawbacks. If T is too large, then the SMC approximation of the joint density deteriorates as components at earlier times are not rejuvenated at subsequent time steps. This is due to 19 the resampling which diminishes the diversity of the samples at earlier times. As a result the approximation of the marginals pθ (xn |y1:T ) is likely very poor when T − n is large. This is also the reason why it is very difficult, although possible [2, 31, 43], perform inference on static parameters using SMC, i.e. to sample from the joint distribution p (θ, x1:T | y1:T ). This naturally suggest to combine MCMC with SMC and leverage the strengths of each by using SMC to build an approximation of the joint posterior distribution which can then be used to generate proposals to be used in MCMC. In the next Chapter we will present the framework to do this and also perform parameter estimation, i.e. to sample from the joint posterior. Chapter 4 demonstrates this novel methodology on several applications, including SSMs. 20 Chapter 3 Particle Markov Chain Monte Carlo As discussed previously, a simple independent MH update which leaves π (x) invariant requires a proposal density q(x). In order to sample a realisation {X(i)} of the associated Markov chain, the MH update proceeds as follows at iteration i: (a) sample X∗ ∼ q (·), (b) set X(i) = X∗ with probability 1∧ π (X∗ ) q (X(i − 1)) , π (X(i − 1)) q (X∗ ) otherwise set X(i) = X(i−1). We could suggest using as proposal density q (x) = q N (x), i.e. the density of a particle generated by an SMC algorithm targeting π(x). This is likely to result in an efficient independent MH algorithm following the discussion of the previous section. This would however require being able to evaluate q N (x) in order to compute the acceptance ratio above. This quantity is unfortunately not available in closed-form. Indeed, for n = 1, ..., p let us denote Xn := Xn1 , ..., XnN ∈ X N the set of N simulated X -valued random variables at time n, then it is not difficult to establish that the joint density of all the variables 21 generated by the SMC algorithm described in Section 2.2 is ψ (¯ x1 , ..., x ¯p , a1 , ..., ap−1 ) = ψ (¯ x1 ) p n=2 ψ (an−1 |xn−1 ) × ψ (¯ xn |¯ xn−1 , an−1 ) p N M1 xi1 = i=1 N ai n−1 Mn (xn−1 , xin ) r (an−1 |wn−1 ) n=2 (3.1) , i=1 (3.2) which is defined on X pN × {1, . . . , N }(p−1)N . Now, we deduce that the marginal distribution of a particle drawn according to π ˆ N (dx) given in (2.5) is, denoting Eψ the expectation with respect to ψ, q N (dx) = Eψ π ˆ N (dx) = Eψ N i i=1 Wp δXip (dx) , which cannot be computed analytically in most situations of interest. In the next section we present a general methodology to circumvent this difficulty. 3.1 Particle Metropolis-Hastings Sampler In order to illustrate the simplicity of the implementation of our approach we first describe a particular instance of the methodology in order to sample from π(x), where x is updated in one single block. The particle Metropolis Hastings (PMH) algorithm by itself does not offer any advantage over SMC, but would instead be used as a component of more complex MCMC algorithms. For example in the protein folding application (Sec. 4.2) we use PMH in simulated annealing to perform block updates. We can show that this particular MCMC update is nothing but an independent MH sampler with an auxiliary target distribution defined on an extended space with the output of an SMC algorithm as a proposal. However, the principle underlying the construction of the auxiliary target distribution is not classical. 3.1.1 Algorithm In order to sample from π(x) the particle Metropolis Hastings (PMH) sampler proceeds as shown in Algorithm 3.1 (with the notation of Section 2.2, in particu22 Algorithm 3.1: Particle Metropolis Hastings Sampler 1 2 3 4 5 6 Initialisation i = 0 Run an SMC algorithm targeting π(x) sample X(0) ∼ π ˆ N (·) and compute Zˆ N (0) For iteration i ≥ 1 Run an SMC algorithm targeting π(x), sample X∗ ∼ π ˆ N (·) and compute Zˆ N,∗ With probability Zˆ N,∗ , 1∧ Zˆ N (i − 1) (3.3) set X(i) = X∗ and Zˆ N (i) = Zˆ N,∗ , otherwise set X(i) = X (i − 1) and Zˆ N (i) = Zˆ N (i − 1) lar (2.5) and (2.6)). Note that the acceptance probability (3.3) is independent of X∗ which makes it possible to sample X∗ only once the move is accepted. The acceptance probability (3.3) also enjoys the attractive property that, under (A3), it converges to 1 as N → ∞ as both Zˆ N,∗ and Zˆ N (i − 1) are consistent estimates of Z, the unknown normalising constant of π(dx). In addition, under mixing assumptions of the Feynman-Kac semi-group, the variance of this acceptance ratio can be shown to be proportional to p/N . 3.1.2 Proof of Validity In this section we establish the validity of the PMH algorithm by showing that it is a standard independent MH update with specific target and proposal distributions defined on an extended space (Theorem 1). This results in straightforward convergence properties, see Theorem 2. At first sight, one might think that the sequence {X (i)} generated by the PMH will have π (dx) as the desired equilibrium distribution only when N → ∞. We can show that this is in fact the case for all N ≥ 1. The key to establish this result is to reformulate the PMH as a standard independent MH sampler defined on an extended state-space with a suitable invariant distribution. In the following we will need the following notation related to particle ances23 π ˆ1 (dx1 ) X11 π ˆ2 (dx1:2 ) X21 π ˆ3 (dx1:3 ) X31 π ˆ4 (dx1:4 ) X41 X12 A1 1=3 A1 2=2 A1 3=2 X22 X32 X42 X13 A2 1=6 A2 2=4 A2 3=2 X23 X33 X43 X14 A3 1=6 A3 2=4 A3 3=4 X24 X34 X44 X15 A4 1=3 A4 2=3 A4 3=6 X25 X35 X45 X16 A5 1=1 A5 2=5 A5 3=4 X26 X36 X46 A6 1=5 A6 2=4 A6 3=1 Figure 3.1: Illustration of running an SMC algorithm and selection of a path {X16 , X23 , X34 , X43 } (yellow) to use as a proposal within PMCMC. The arrows show the resampling and resulting assignment of ancestry variables Ain . tries. Assume that we have selected particle Xkp at time p. For k = 1, . . . , N and n = 1, . . . , p, let ikn denote the index of the ancestor particle of xkp at generation n. More formally we define ikp := k, ikp−1 := akp−1 (with the notation of Section ik 2.2) and for n = p − 1, . . . , 1 we have the backward recursive relation ikn := ann+1 . ik ik ik ik p−1 As a result we can rewrite xkp = (x11 , x22 , . . . , xp−1 , xpp ). The matrix a = {akn } is sampled row by row at the resampling steps of the SM C algorithm and allows us to represent the ancestries of all Xni (e.g. see arrows in Figure 3.1 which point from parent to offspring), while ik lets us easily index the path of particle k. For example in Figure 3.1 the highlighted path is i34 = 3, i33 = 4, i32 = 3, i31 = 6. Theorem 1 Assume (A1 )-(A2) then for any N ≥ 1 the PMH sampler is an inde- 24 pendent MH sampler defined on the extended space defined on X pN × {1, . . . , N }(p−1)N +1 , with target density π ˜ N (k, x ¯1 , ..., x ¯p , a1 , . . . , ap−1 ) = 1 N p M (xik1 ) 1 1 (3.4) π xkp ikn−1 ikn p k n=2 r(in−1 |wn−1 )Mn (xn−1 , xn ) ψ (¯ x1 , ..., x ¯p , a1 , ..., ap−1 ) , with ψ defined in (3.1), and proposal density q N (k, x ¯1 , ..., x ¯p , a1 , . . . , ap−1 ) := wpk × ψ (¯ x1 , ..., x ¯p , a1 , . . . , ap−1 ) , (3.5) where wpk is a realization of the normalised importance weight defined in (2.3). The proof is given below. Here xkp is distributed according to π(dx). As a K(i) result the sequence {X(i)} = {Xp (i)} is of interest, because from standard MH theory it will leave π(dx) invariant. Proof of Theorem 1. We can easily check that (3.4) and (3.5) are both positive and sum to one; the factor 1/N p corresponds to the uniform distribution on the set IK IK p , i.e. there are N p equiv{1, ..., N }p for the random variables K, A12 , . . . , Ap−1 alent orderings of the indices in the particle path. Now the acceptance ratio of an independent MH algorithm is known to depend on the following “importance weight” π ˜ N (k, x1 , ..., xp , a1 , . . . , ap−1 ) q N (k, x1 , ..., xp , a1 , . . . , ap−1 ) = 1 Np π xkp ik wpk × M1 (x11 ) p ik n=2 ik ik First we make use of (A1), i.e. r(ikn |wn ) = wnn = wn (xnn )/ 25 ik n−1 r(ikn−1 |wn−1 )Mn (xn−1 , xnn ) N i wn (xin ), with some abuse of notation. This yields: π ˜ N (·) 1 = p N q (·) N π xkp ik M1 (x11 ) p n=2 γ (xkp ) 1 Z Np = ik ik p n=2 N i=1 wn n=1 p M1 (x11 ) p ik n−1 Mn (xn−1 , xnn ) ik ik n−1 Mn (xn−1,n , xnn ) ik wnn n=1 xin p ik wn (xnn ) n=1 In the above manipulations we have also substituted γ xkp /Z for π xkp . γ (xkp ) 1 Z Np π ˜N (·) = N q (·) = ik1 M1 (x1 ) p ikn−1 n=2 p N i=1 wn n=1 ik ikn Mn (xn−1,n , xn ) γ1 (x11 ) xin p ik γn (xnn ) ik ik ik k n−1 n−1 in M1 (x11 ) n=2 γn−1 (xn−1 )Mn (xn−1 ,xn ) Zˆ N Z The final result is obtained thanks to the definitions of the incremental weights (2.2) w1 xi1 := γ1 (xi1 ) M1 (xi1 ) and γn xin wn xin := γn−1 ikn−1 xn−1 Mn ikn−1 , xin xn−1 , and of the normalising constant estimate (2.6) p Zˆ N := n=1 1 N N wn xin . i=1 It should now be clear that the PMH algorithm described above corresponds to sampling particles according to q N defined in (3.5) and that the acceptance probability (3.3) corresponds to that of an independent MH algorithm with target distribution π ˜ N given by (3.4). As a result, Theorem 1 is powerful since by showing that our PMH algorithm is a MH algorithm in disguise, it allows us to use standard results concerning the 26 convergence properties of the independent MH algorithm to prove the following theorem. Let LN (X (i) ∈ ·) denote the distribution of X (i) generated by the PMH algorithm with N ≥ 1 particles. Theorem 2 Assume (A1)-(A2) then for any N ≥ 1 the PMH sampler generates a sequence of probability distributions LN (X (i) ∈ ·) such that LN (X (i) ∈ ·) − π(·) → 0 as i → ∞ . In addition, under (A3), there exists ρ ∈ [0, 1) such that for any i ≥ 1 and N ≥ 1, LN (X (i) ∈ ·) − π(·) ≤ ρi . The proof can be found in Appendix A. Note that the second result might appear negative since from the proof we do not observe an improvement on the rate of convergence of the algorithm as N increases. However, as a particular case of [1, Theorem 1] it is possible to show that for any , η > 0 there exists N0 such that for any N ≥ N0 and any i ≥ 1 LN ∗ (X (i) ∈ ·) − π(·) ≤ with ψ−probability larger than 1−η, where LN ∗ (X (i) ∈ ·) denotes the conditional distribution of X (i) conditional upon the random variables generated at iteration 0 by the SMC algorithm. 3.1.3 Extensions and Variations As pointed out earlier, Theorem 1 is fundamental since it highlights the standard MH nature of the PMH algorithm and allows us to easily establish its validity. In this section we show how this property can also suggest further improvements. Using All the Particles One criticism of PMH is that a lot of work seems wasted by only selecting one candidate from the SMC proposal, especially considering that the importance weight (estimate of the normalisation constant) is independent of the selected path. We 27 propose to use all the particles in order to carry out inference using a strategy which shares some characteristics with [39]. The ‘standard’ estimate of Eπ (f ) for L MCMC iterations is 1 L L i=1 f (X(i)). We will show in this thesis that the following estimator, which utilises all the particles, converges towards Eπ (f ) as L → ∞ for any N ≥ 1: Theorem 3 Assume (A1)-(A2) and Eπ (|f |) < ∞. Then for any N ≥ 1, 1. the estimate 1 L L N k Wpk (i) f (X1:p (i)) i=1 (3.6) k=1 k (i)} converges almost surely towards Eπ (f ) as L → ∞ where {Wpk (i) , X1:p corresponds to the set of normalised weights and particles used to compute Zˆ (i), ∗k (i), k = 1, . . . , N } the set of proposed particles at 2. denoting {Wp∗k (i) , X1:p iteration i (i.e. before deciding whether or not to accept this population) 1 L L i=1 + Zˆ N,∗ (i) 1∧ Zˆ N (i − 1) 1−1∧ N ∗k Wp∗k (i) f (X1:p (i)) (3.7) k=1 N Zˆ N,∗ (i) Zˆ N (i − 1) k Wpk (i − 1) f (X1:p (i − 1)) (3.8) k=1 converges almost surely towards Eπ (f ) as L → ∞. The proof is given in Appendix A. The estimate (3.6) is an average of importance sampling estimates. Each of these estimates is biased but the MH mechanism allows us to obtain asymptotically consistent estimates by rejecting some of the populations of particles proposed. Following [39] is also possible to propose an estimate which recycles the candidate populations of particles rejected by the PMH. Subblock Updates of π We have focused on the case where all the components of x = (x1 , ..., xp ) are updated simultaneously. Again, if p is too large compared to the number N of par28 ticles, then the particle approximation of π produced by the SMC algorithm might be poor, resulting in an inefficient MH algorithm. In such situations we can suggest the use of mixtures/compositions of PMH transition probabilities that update subblocks of the form xa:b for 1 ≤ a < b ≤ p, effectively targeting conditional distributions of the type π (xa:b |x−a:b ). The theoretical results of Section 2.2 suggest that under favourable conditions the number of particles required to approximate π (xa:b |x−a:b ) for a given precision decreases linearly with the size b − a + 1 of the subblock. For such a strategy there is a tradeoff between the accuracy of the proposal distribution for the conditional update, and the dependence structure on the complementary block which might result in long correlation time in the sequence {X(i)}. Partial Particle Updates Although the structure of the target distribution π ˜ N naturally lends itself to full updates of all the particles using SMC algorithms we might consider rejuvenating some of the particles conditional upon the others for computational purposes. For example, given xkp and its “ancestral lineage” (ik1 , ik2 , . . . , ikp ) which is such that ik ik ik ik p−1 , xpp ) then we could propose to update the N − 1 remainxkp = (x11 , x22 , . . . , xp−1 ing particles and their “ancestral lineages” according to a Gibbs step under π ˜N ; this “conditional SMC sampling” strategy is detailed in Section 3.2. Any classical MCMC updates that leave π ˜ N invariant can also obviously be used. Another possible suggestion in order to fight the aforementioned degeneracy for large p’s is to use possibly standard MCMC transition probabilities in order to update subblocks xa:b of x for 1 ≤ a < b ≤ p such that xa:b corresponds to a region of x where particle depletion is severe. 3.2 Particle Gibbs Sampler Assume now that we are interested in sampling from a distribution π (θ, x) = 29 γ (θ, x) Z (3.9) where γ : Θ × E → R+ is known pointwise. We show here again how SMC algorithms can be used as a component of an MCMC algorithm, namely a Gibbs sampler, to sample from distributions of the type π(θ, x) in situations where sampling from π(x|θ) is difficult without resorting to SMC methods. A standard strategy to sample from π (θ, x) consists of using the Gibbs sampler; that is sampling iteratively from the full conditionals π (θ|x) and π (x|θ). In numerous situations it is possible to sample exactly from π (θ|x), and we will assume here that this is the case. Otherwise an MH update of invariant density π (θ|x) might be used. For models of practical interest x can be high dimensional (e.g. a vector of latent variables of the size of a large dataset) and the conditional distribution π (x|θ) non-standard, precluding practical exact sampling. We have π (x|θ) = where γ (θ) := E γ (θ, x) , γ (θ) (3.10) γ (θ, x) dx is assumed unknown. It is therefore natural to suggest the use of an SMC algorithm in order to propose approximate samples from this conditional distribution. Hence we consider a family of bridging distributions {πn (xn |θ) ; n = 1, . . . , p} where πn ( xn | θ) = γn (θ, xn ) , Znθ (3.11) such that πp (xp |θ) = π (x|θ) and a family of transition probability densities {Mnθ (xn−1 , xn )} that defines sampling of xn ∈ X conditional upon xn−1 ∈ X n−1 ; note that γp (θ, xp ) = γ (θ, x) and Zpθ = γ (θ). 3.2.1 Algorithm The particle Gibbs (PG) sampler is an approximation of the “ideal” Gibbs sampler where we approximately sample from π (x|θ) using an SMC algorithm. Contrary to the PMH algorithm, sampling from the approximation π ˆ N (dx|θ) of π (x|θ) using the PG algorithm requires us to keep track of the “ancestral lineage” I := (I1 , I2 , . . . , Ip ) of the random variable X which, we point out again, is such that 30 I X = (X1I1 , ..., Xpp ). The PG sampler is shown in Algorithm 3.2. Algorithm 3.2: Particle Gibbs Sampler 1 2 3 4 5 6 7 8 Initialisation i = 0 Set randomly θ (0) Run an SMC algorithm targeting π (x|θ (0)) Sample X (0) ∼ π ˆ N (·|θ (0)) and denote I (0) its ancestral lineage. For iteration i ≥ 1 Sample θ (i) ∼ π (·|X (i − 1)) Run a conditional SMC algorithm for θ (i) consistent with X (i − 1) , I (i − 1) Sample X (i) ∼ π ˆ N (·|θ (i)) and denote I (i) its ancestral lineage The conditional SMC step is the non-standard step of the algorithm which, I given (θ, I, X = Xpp ), yields an SMC approximation of π (x|θ) by using the “frozen” particle X and sampling N − 1 “free” particles consistent with this particle and its ancestral lineage. This is shown in Algorithm 3.3, where we have used the notation −In n An−1 }. = An−1 \{AIn−1 In n To sample from the discrete-valued conditional distribution r(a−I n−1 |wn−1 , an−1 ), we can use the following procedure. In ≥1 . 1. Sample the number of offspring On−1 ∼ s ·| Wn−1 , On−1 2. Sample the indices of the N − 1 “free” offspring uniformly on the set {1, ..., N } \ {In }. In ≥1 When sampling from s ·| Wn−1 , On−1 cannot be performed in closed- form, we can use a rejection sampling procedure by sampling On−1 ∼ s ( ·| Wn−1 ) In until On−1 ≥ 1. Note however that it is possible to sample directly from In s(·|Wn−1 , On−1 ≥ 1) in important cases. We consider multinomial and stratified resampling below. Remark 1 Note that the conditional SMC algorithm does not correspond to sampling from a conditional distribution of the distribution of the SMC algorithm ψ θ (given in (3.16)). However this becomes true as N → ∞. 31 Algorithm 3.3: Conditional Sequential Monte Carlo Algorithm 1 2 3 At n = 1 For i = I1 , sample Xi1 ∼ M1θ (·) Update and normalise the weights w1 Xi1 = 4 5 γ1 (θ, Xi1 ) , W1i = M1θ (Xi1 ) w1 Xi1 N k=1 w1 For n = 2, ..., p do In n Sample A−I n−1 ∼ r(·|Wn−1 , An−1 ) Ai Ai 6 7 . Xk1 n−1 n−1 , Xni ) , ·) and set Xin = (Xn−1 For i = In , sample Xni ∼ Mnθ (Xn−1 Update and normalise the weights γn θ, Xin wn Xin = Ai , Ai (3.12) n−1 n−1 , Xni γn−1 θ, Xn−1 Mnθ Xn−1 Wni = wn Xin N k k=1 wn (Xn ) . (3.13) Multinomial Resampling For multinomial resampling, we can factorise the probability of the number of offsprings as In In In In n P(On−1 |Wn−1 , On−1 ≥ 1) = P(On−1 |Wn−1 , On−1 ≥ 1) P(O−I n−1 |Wn−1 , On−1 ) We then have the following procedure to perform multinomial resampling: Algorithm 3.4: Conditional Multinomial Resampling 1 In In Sample On−1 ∼ B + N, Wn−1 2 i Compute W n−1 ∝ Wn−1 with i Wn−1 = 3 i i=In W n−1 = 1 and 1 In −1 In +1 N W n−1 , . . . , W n−1 , W n−1 , . . . , W n−1 −In Sample On−1 ∼ M N, Wn−1 32 set In In ) rewhere B + N, Wn−1 is the Binomial distribution of parameters (N, Wn−1 stricted to the support {1, 2, . . .}. Stratified Resampling Algorithm 3.5: Conditional Stratified Resampling 1 2 3 4 i,∗ Compute adjusted weights Wn−1 as given in Eqn.3.14 In ,∗ if Wn−1 ≥ 1/N then ∗ sample On−1 ∼ s ·| Wn−1 else 5 sample U1 uniformly on 6 compute Uj = U1 + j−1 N k,∗ In −1 k=1 Wn−1 , k,∗ In k=1 Wn−1 for j ∈ Z and set i i−1 k,∗ Wn−1 k,∗ ≤ Uj ≤ Wn−1 i On−1 = # Uj : k=1 . k=1 For the popular stratified resampling algorithm [57], we can use the method given in Algorithm 3.5. The condition of having at least one offspring of particle In changes the resample distribution and results in the following adjusted weights: In ,∗ Wn−1 = 3.2.2 In Wn−1 In N ) 1 − (1 − Wn−1 , In ,∗ 1 − Wn−1 i=In ,∗ i Wn−1 = Wn−1 In 1 − Wn−1 (3.14) Proof of Validity In this section we establish the validity of the PG algorithm under mild assumptions. The following notation will be needed, for any θ ∈ Θ, let Snθ = {xn ∈ X n : πn ( xn | θ) > 0} , Qθ1 = x1 ∈ X : M1θ (x1 ) > 0 , Qθn = xn ∈ X n : πn−1 ( xn−1 | θ) Mnθ (xn−1 , xn ) > 0 33 for n ≥ 2, and S = {θ ∈ Θ : π (θ) > 0} . (A5) For any θ ∈ S, we have Snθ ⊆ Qθn for n = 1, ..., p. (A6) The ‘ideal’ Gibbs (G hereafter) sampler defined by the conditionals π (θ|x) and π (x|θ) is irreducible and aperiodic (and hence converges for π-almost all starting points). We have the following result. Theorem 4 Assume (A1)-(A5), then for any N ≥ 2 the PG sampler defines an MCMC kernel on the extended space Θ × X pN × {1, . . . , N }(p−1)N +1 with target density π ˜ N (θ, k, x1 , ..., xp , a1 , . . . , ap−1 ) := π 1 N p M θ (xik1 ) 1 (3.15) θ, xkp k ikn p k θ in−1 n=2 r(in−1 |wn−1 )Mn (xn−1 , xn ) 1 ψ θ (x1 , ..., xp , a1 , ..., ap−1 ) , where ψ θ (x1 , ..., xp , a1 , ..., ap−1 ) := p N M1θ i=1 (3.16) xi1 N n=2 ai n−1 Mnθ xn−1 , xin r (an−1 |wn−1 ) . i=1 Additionally assume (A6) holds then for any N ≥ 2 the PG sampler generates a sequence {θ (i) , X (i)} whose marginal distributions {LN ((θ (i) , X (i)) ∈ ·)} satisfy LN ((θ (i) , X (i)) ∈ ·) − π(·) → 0 as i → ∞ . The proof can be found in Appendix A. 34 3.2.3 Extensions Alternative Moves The PG algorithm is an appealing algorithm which has the property that it converges to the standard Gibbs sampler as N → ∞. However, given p and for a finite N for which the degeneracy problem is severe, the fact that the PG sampler forces I the “frozen” path I, X = Xpp to survive during the conditional PMH step can have detrimental effects and result in a highly dependent Markov chain. Indeed, in such situations most of the particles at time p generated by the conditional SMC I algorithm coalesce with the ancestral lineage of the “frozen” path I, X = Xpp . To limit this problem, various alternatives to update of X (i) proposed above can be suggested. For example, we can suggest to update X (i) using the PMH algorithm, which yields Algorithm 3.6. Algorithm 3.6: Alternate Move for Particle Gibbs Sampler 1 2 3 4 5 For iteration i ≥ 1 Sample θ (i) ∼ π (·|X (i − 1)) Run a conditional SMC algorithm for θ (i) consistent with X (i − 1) , I (i − 1) and set γ N (θ (i)) = Z N Run an SMC algorithm targeting π (x|θ (i)), sample X∗ ∼ π ˆ N (·|θ (i)) and set γ N,∗ (θ (i)) = Z N With probability γ N,∗ (θ (i)) 1∧ N γ (θ (i)) set X(i) = X∗ and I(i) = I ∗ , otherwise set X(i) = X (i − 1) and I(i) = I (i − 1) Following the developments concerning the PG sampler, we can easily show that under mild assumptions (A1-A5) and for any N ≥ 1 this algorithm has the desired invariant distribution. 35 Optimisation In some situations we might be interested in maximising π (θ). As π (θ) is typically unknown, even up to a normalising constant, it is not possible to use an algorithm such as the simulated annealing algorithm. However, we can introduce the following distribution, defined on Θ × E m m πm θ, x1 , ..., xm ∝ i=1 π θ, xi (3.17) whose marginal distribution πm (θ) is proportional to [π(θ)]m [28]. This distribution concentrates itself on the set of global maxima of π (θ) as m increases. It is straightforward to apply a modified version of the PG sampler to sample from πm θ, x1 , ..., xm by sampling m independent conditional SMC algorithms. At iteration i, it proceeds as follows: Algorithm 3.7: Marginal Maximisation using Particle Gibbs 1 2 Sample θ (i) ∼ πm ·|X1 (i − 1) , ..., Xm (i − 1) For k = 1, ..., m Run a conditional SMC algorithm for θ (i) consistent with 3 Xk (i − 1) , I k (i − 1) Sample Xk (i) ∼ π ˆ N (·|θ (i)) and denote I k (i) its ancestral lineage 4 It can be shown under mild assumptions (A1-A5) and for any N ≥ 2 that the Markov kernel associated to this algorithm admits πm θ, x1 , ..., xm as invariant distribution. An alternative consists of considering the following extended distribution π ˜ N (θ, k1 , ..., km , x1 , ..., xp , a1 , . . . , ap−1 ) ∝ m i=1 N π θ, xkpi M1θ xi1 (3.18) k :k i=1,i=i1 1 m p × n=2 r −i k1 :km 1 an−1 k1 :km N ain−1 Mnθ xn−1 , xin n wn−1 , ain−1 k :km i=1,i=in1 36 where ka = kb for a = b and ikn1 :km = ikn1 , ..., iknm . Note that even if ka = kb , these paths might have a common ancestor; that is there exists n such that ikna = iknb . To sample from (3.18) and thus from πm θ, x1 , ..., xm , we could propose the following iterative algorithm. At each iteration, sample θ ∼ πm ·| xkp1 , ..., xkpm then run a conditional SMC algorithm for θ consistent with xkp1 , ..., xkpm and their ancestral lineages. Finally, sample (k1 , ..., km ) from π ˜ N ( k1 , ..., km | θ, x1 , ..., xp , a1 , . . . , ap−1 ) . However, in this scenario the full conditional distribution of (k1 , ..., km ) does not take a simple form because of the potential coalescences of the paths and an MH step is required. 3.3 Particle Marginal Metropolis-Hastings Sampler Assume we are interested in sampling from π (θ, x) defined in (3.9). In cases where x and θ are highly correlated, the Gibbs sampling strategy described in the previous section might be inefficient. In this case, we might want to sample directly from the marginal π (θ) using say an MH algorithm. However, if π (θ) is unknown even up to a normalising constant, this is impossible. We present here a particle approximation of this “ideal” marginal MH algorithm. Our algorithm is based on the following remark. Consider a MH algorithm of target density π (θ, x) and proposal density q ((θ, x) , (θ∗ , x∗ )) = q (θ, θ∗ ) π ( x∗ | θ∗ ) . Then the acceptance ratio is given by 1∧ π (θ∗ ) q (θ∗ , θ) π (θ∗ , x∗ ) q ((θ∗ , x∗ ) , (θ, x)) = 1 ∧ . π (θ, x) q ((θ, x) , (θ∗ , x∗ )) π (θ) q (θ, θ∗ ) Consequently, provided that it is possible to sample the component x from a good approximation of π ( x| θ) then the resulting MH algorithm will essentially approximate the “marginal” MH algorithm. 37 To build a proposal approximating π ( x| θ), we use an SMC method and introduce as in the PG sampler a sequence of distributions {πn ( xn | θ)} defined in (3.11) and some transition kernels {Mnθ (xn−1 , xn )}. The key point is that the normalising constant estimate obtained at time p is an estimate of γ (θ), the unnormalised version of π (θ), i.e. recall that γ(θ, x) = π(x|θ)γ(θ) from (3.10). 3.3.1 Algorithm Algorithm 3.8: Particle Marginal Metropolis-Hastings Sampler 1 2 3 4 5 6 7 8 Initialisation i = 0 Set randomly θ (0) Run an SMC algorithm targeting π (x|θ (0)) Sample X (0) ∼ π N ( ·| θ (0)) and set γ N (θ (0)) = Z N For iteration i ≥ 1 Sample θ∗ ∼ q (θ (i − 1) , ·) Run an SMC algorithm targeting π (x|θ∗ ) sample X∗ ∼ π N ( ·| θ∗ ) and set γ N (θ∗ ) = Z N With probability 1∧ γ N (θ∗ ) q (θ∗ , θ (i − 1)) γ N (θ (i − 1)) q (θ (i − 1) , θ∗ ) (3.19) set θ (i) = θ∗ , X (i) = X∗ , γ N (θ (i)) = γ N (θ∗ ), otherwise set θ (i) = θ (i − 1), X (i) = X (i − 1), γ N (θ (i)) = γ N (θ (i − 1)) The particle marginal Metropolis Hastings (PMMH) sampler is given in Algorithm 3.8. Note that the acceptance probability (3.19) is independent of X∗ so that it is possible to sample X∗ only once the move is accepted and, if we are only interested in estimating π (θ), then this simulation step can always be omitted. The computational complexity of each iteration is of order O (N ). Under mild assumptions the acceptance probability of this sampler converges to the acceptance 38 probability of the “ideal” marginal MH algorithm as γ N (θ∗ ) γ (θ∗ ) π (θ∗ ) → = γ N (θ (i − 1)) γ (θ (i − 1)) π (θ (i − 1)) when N → ∞. Moreover, under mixing assumptions, the variance of the acceptance ratio is proportional to p/N . We can also show under mild assumptions (A1A5,A7) that for any N ≥ 1 the PMMH sampler generates a sequence {θ (i) , X (i)} whose sequence of distributions {LN ((θ (i) , X (i)) ∈ ·)} satisfies LN ((θ (i) , X (i)) ∈ ·) − π(·) → 0 as i → ∞ . 3.3.2 Proof of Validity We show here that the PMMH is a standard MCMC algorithm on an extended space and that for all N ≥ 1 it generates a sequence converging towards the target distribution of interest. The following assumption will guarantee convergence. (A7) The “ideal” MH sampler of target density π (θ) and proposal density q (θ, θ∗ ) is irreducible and aperiodic (and hence converges for π-almost all starting points). Theorem 5 Assume (A1 )-(A5), then for any N ≥ 1 the PMMH sampler is an MH sampler defined on the extended space Θ×{1, . . . , N }×X pN ×{1, . . . , N }(p−1)N +1 with target density (3.15) and proposal density q (θ, θ∗ ) × wp∗k × ψ θ ∗ x∗1 , ..., x∗p , a∗1 , ..., a∗p−1 ∗ where ψ θ is defined in (3.16) and {wp∗k } consists of the normalised importance weights associated to the proposed population of particles as defined in (2.3). Assume in addition that (A7) holds. Then for any N ≥ 1 the PMMH sampler generates a sequence {θ (i) , X (i)} whose sequence of distributions {LN ((θ (i) , X (i)) ∈ ·)} satisfies LN ((θ (i) , X (i)) ∈ ·) − π(·) → 0 as i → ∞ . 39 The proof can be found in Appendix A. 3.3.3 Extensions Alternative Proposals It can be difficult to select the proposal distribution for θ. However, our framework is flexible enough that it allows us to design proposal distributions for θ relying on the set of current particles. For example, if π (θ|x) is a standard distribution dependent on a set of sufficient statistics of x then we could use an independent proposal of the same functional form with sufficient statistics computed using the set of current particles with possibly an inflated variance, as long as the dimension of θ is not too large. Optimisation To sample from (3.17), we can use straightforwardly the PMMH by sampling m independent SMC algorithms at each iteration i. This is shown in Algorithm 3.9. Algorithm 3.9: PMMH with m Independent SMC Algorithms 1 2 3 For iteration i ≥ 1 Sample θ∗ ∼ q (θ (i − 1) , ·) Run m independent SMC targeting π (x|θ∗ ) and let m N γm (θ∗ ) = ZiN i=1 4 where ZiN is the normalising constant estimate provided by the ith SMC algorithm. With probability 1∧ N (θ ∗ ) γm q (θ∗ , θ (i − 1)) N (θ (i − 1)) q (θ (i − 1) , θ ∗ ) γm N (θ (i)) = γ N (θ ∗ ), set θ (i) = θ∗ , γm m N (θ (i)) = γ N (θ (i − 1)) otherwise set θ (i) = θ (i − 1), γm m 40 Under (A1)-(A5), the invariant distribution of the Markov kernel associated to this algorithm is πm (θ) ∝ π m (θ) for any N ≥ 1. 3.4 Extensions In this section, we present various extensions of the PMCMC methodology and discuss connections with earlier work. The content of this section is not required to understand the examples in Section 4, where applications of the PMCMC methodology are presented. 3.4.1 Dynamic and Optimal Resampling In many applications of SMC, the resampling step is only performed when the accuracy of the estimator is poor. Practically, this is assessed by looking at the variability of the weights using the so-called effective sample size (ESS) criterion [61, pp. 35-36] given at time n by N ESS = −1 2 Wni . i=1 Its interpretation is that inference based on the N weighted samples is approximately equivalent to inference based on ESS perfect samples from the target. The ESS takes values between 1 and N and we resample only when it is below a threshold NT . All the strategies presented in the previous sections can also be applied in this context. The PMH and PMMH can be implemented in the dynamic resampling context without any modification. However, the PG is more difficult to implement as the conditional SMC step requires simulating a set of N − 1 particles not only consistent with a ‘frozen’ path but also consistent with the resampling times of the SMC method used to generate the ‘frozen’ path. For example assume that the ‘frozen’ path has been generated by an SMC algorithm resampling at time n1 , n2 , ..., nm . Then the conditional SMC method has to generate a set of N − 1 particles which are such that when the ESS is computed with this N − 1 ‘free’ particles plus the frozen one, then it is below the threshold NT at times n1 , n2 , ..., nm 41 and only at these time indices. This implies that the N − 1 particles cannot be simulated independently anymore. The easiest way to implement the conditional SMC consists of using a rejection method. However the acceptance rate of this rejection step is expected to be only reasonable when N is large - as the variance of ESS is inversely proportional to N - and if p is moderate. The SMC algorithm proposed in [32], [34] for discrete-valued variables is not of the form described in Section 2.2 as the ‘sampling’ step is deterministic and the proposed so-called optimal resampling step does not attribute equal weights to all particles. It can be shown though that we can use it directly within the PMH and PMMH algorithms. However the peculiar structure of the optimal resampling step seems to prevent the development of a simple PG-type algorithm. 3.4.2 Arbitrary Target Distribution For the time being, we have considered it possible to devise a sequence of distributions of increasing dimensions which progressively evolves towards the target distribution of interest. However, there are many problems where such a decomposition might not be obvious to define. In this case, it remains possible to use SMC to sample from any target distribution using the general methodology presented in [65]; see also [17], [45], [56]. Assume now that the target π ˜ (x) is defined on X . To sample from π ˜ (x) using SMC-type ideas, we introduce a sequence of p − 1 intermediate distributions {˜ πn (x)} on X moving smoothly from π ˜1 (x) – an easy to sample distribution – to π ˜p (x) = π ˜ (x), where π ˜n (x) = Zn−1 γ˜n (xn ). Here γ˜n : X → R+ is known pointwise but Zn might be unknown. For example, we could have π ˜n (x) ∝ [˜ π (x)]ηn where 0 < η1 < η2 < · · · < ηp = 1 or, in a Bayesian context, π ˜n (x) could be the posterior distribution of X given observations until time n [17]. To move from π ˜n−1 (xn−1 ) to π ˜n (xn ) we use a sequence of Markov kernels Mn (xn−1 , xn ) = Mn (xn−1 , xn ). In most applications, Mn is an MCMC kernel of invariant distribution π ˜n although this is not necessary. We then build the distributions πn (xn ) on X n for n = 2, ..., p using n−1 γn (xn ) = γ˜n (xn ) Lk (xk+1 , xk ) k=1 42 where {Ln } is a sequence of auxiliary (backward) Markov kernels giving the probability density to move from xn+1 to xn . By construction, we have π ˜ (xp ) = πp (xp ) dxp−1 . Once {˜ πn } , {Mn } and {Ln } have been selected (see [65] for guidelines), then we can straightforwardly implement the algorithms proposed in Section 3.1 to sample from πp (xp ) and hence from its marginal πp (xp ) = π (xp ). We can also directly adapt the PMMH presented in Section 3.3 to sample from π ˜ (θ, x) = Z −1 γ˜ (θ, x) on some space Θ × X by defining πn ( xn | θ) = γn (θ, xn ) Znθ where n−1 Lθk (xk+1 , xk ) γn (θ, xn ) = γ˜n (θ, xn ) k=1 and γ˜p (θ, xp ) = γ˜ (θ, xp ). However, even if we can sample from π ˜ ( θ| x), the PG sampler proposed in Section 3.2 cannot be applied as it does not require sampling from π ˜ ( θ| xp ) but from πp ( θ| xp ) which is usually intractable. To bypass the simulation step from πp ( θ| xp ), we can use a collapsed Gibbs sampler strategy where we first sample θ from π ˜ ( θ| xp ) then sample Xp−1 , ..., X1 from n−1 Lθk (xk+1 , xk ) . k=1 3.5 Discussion The PMH algorithm presented in Section 3.1 is related to the configurational bias Monte Carlo (CBMC) method which is a very popular method in molecular simulation used to sample long proteins [40, 74]. However, on the contrary to the PMH algorithm, the CBMC algorithm does not propagate N particles in parallel. Indeed, at each time step n, the CBMC algorithm samples N particles but the resampling step is such that a single particle survives, to which a new set of N offsprings is then 43 attached. Using our notation, this means that the CBMC algorithm corresponds to the case where Ain = Ajn for all i, j = 1, . . . , N and A1n ∼ r(·|Wn ) i.e. at any time n, all the children share the same and unique parent particle. The problem with this approach is that it is somewhat too greedy and that if a “wrong” decision is taken too prematurely then the proposal will be most likely rejected. It can be shown that the acceptance probability of the CBMC algorithm does not converge to 1 for p > 1 as N → ∞ contrary to that of the PMH algorithm. It has been more recently proposed in [19] to improve the CBMC algorithm by propagating forward several particles simultaneously in the spirit of the PMH algorithm. However, contrary to us, the authors in [19] propose to kill or multiply particles by comparing their unnormalised weights wn Xin with respect to some pre-specified lower and upper thresholds; i.e. the particles are not interacting and their number is a random variable. In simulations, they found that the performance of this algorithm was very sensitive to the values of these thresholds. Our approach has the great advantage of bypassing the delicate choice of such thresholds. In statistics, a variation of the CBMC algorithm known as the multiple-try Metropolis (MTM) has been introduced in the specific case where p = 1 in [60]. The key of our methodology is to build efficient proposals using sequential and interacting mechanisms for cases where p 1: the sequential structure might be natural for some models but can also be induced in other scenarios in order to take advantage of the potential improvement brought in by the interacting mechanism. In this respect, both methods do not apply to the same class of problems. The idea of approximating the marginal MH algorithm which samples directly from π (θ), by approximately integrating out the latent variables x, was proposed in [6] then generalised and studied theoretically in [1]. The present work is a simple mechanism which opens up the possibility to make this approach viable in large dimensional setups. Indeed the PMMH approach is expected to lead to approximations of π (θ) (up to a normalising constant) with a variance which for highdimensional problems is likely to scale favourably compared to estimates based, for example, on standard non-sequential and non-interacting importance sampling. The results in [1] suggest that this is a property of paramount importance in order to design efficient marginal MCMC algorithms. 44 Chapter 4 Applications 4.1 Nonlinear State-Space Model Consider the following non-linear non-Gaussian dynamic model where {Xn }n≥1 is an unobserved Markov process defined by X1 ∼ µθ (·) and for n > 1 Xn | Xn−1 ∼ fθ ( ·| Xn−1 ) where the observations {Yn }n≥1 are assumed independent conditional upon {Xn }n≥1 with Yn | Xn ∼ gθ ( ·| Xn ) . The parameter θ is also unknown and follows the prior distribution p (θ). Having observed y1:T , we are interested in sampling from the target distribution T p ( θ, x1:T | y1:T ) ∝ p (θ) µθ (x1 ) gθ ( y1 | x1 ) fθ ( xn | xn−1 ) gθ ( yn | xn ) . n=2 Although sampling from p ( θ| y1:T , x1:T ) can typically be performed using standard techniques, sampling from p ( x1:T | y1:T , θ) is impossible except for linear Gaussian models and finite state-space Markov chains [14], [41]. The standard approach consists of sampling alternatively from the full conditional p(xn |yn , xn−1 , xn+1 , θ) directly or using a MH step but this strategy can be ex- 45 tremely inefficient due to the strong dependence between xn and (xn−1 , xn+1 ). A more efficient strategy would consists of sampling blocks of variables say p ( xn:n+L−1 | yn:n+L−1 , xn−1 , xn+L , θ). When this distribution is log-concave, it is possible to design efficient proposal distributions [71]. more complex scenarios, this can be very challenging. However for We propose here to use the PMH algorithm to sample from p ( x1:T | y1:T , θ) or from subblocks p ( xk:k+L−1 | yk:k+L−1 , xk−1 , xk+L , θ) whenever T is too large. In this case, the “natural” sequence of bridging distributions to consider is πn (xn ) = p ( xn | y1:n , θ). We illustrate our methodology on the following example Xn = Xn−1 Xn−1 + 25 + 8 cos(1.2n) + Vn 2 2 1 + Xn−1 (4.1) Yn = Xn2 + Wn 20 (4.2) iid iid 2 ; N m, σ 2 denotes where X1 ∼ N (0, 5), Vn ∼ N 0, σV2 , Wn ∼ N 0, σW the Gaussian distribution of mean m and variance σ 2 . We set θ = (σV , σW ). This example is often used in the literature in order to assess the performance of particle methods; e.g. [27], [47], [57]. The posterior distribution p ( xT | y1:T , θ) for this non-linear state-space (NLSS) model is highly multimodal as there is uncertainty about the sign of the state Xn which is only observed through its square. We generated a set of observations y1:200 according to the model (4.1)-(4.2) for √ θ = (σV , σW ) = 10, 1 . We first compared the acceptance rates for the PMH and CBMC algorithm [40] when sampling from p ( xT | y1:T , θ) for various lengths T of the observations and a varying number of particles N . To sample from p ( xT | y1:T ) we used for the proposal Mnθ (xn−1 , xn ) an approximation of the conditional distribution p ( xn | yn , xn−1 , θ) ∝ gθ ( yn | xn ) fθ ( xn | xn−1 ) given by the extended Kalman filter (local-linearisation); see [26] for details. We also used the most basic resampling scheme; i.e. multinomial resampling. The results are displayed in Figure 4.1. The PMH algorithm performs consistently 46 better than the CBMC algorithm, and is able to successfully sample large blocks of variables for a reasonable number of particles. As expected, the PMH algorithm can achieve acceptance rates close to 1 for a sufficiently large number of particles. We then compared the acceptance rates for proposal from prior (P) versus pro√ √ posal using local-linearisation (LL). Here we used θ = ( 15, 2). The simulations with the P proposal used multinomial resampling at every time step (bootstrap) while the simulations with the LL proposal used dynamic and stratified resampling. The result is shown in Figure 4.2. We found that for low number of particles the LL proposal does slightly better, but as the number of particles increases proposing from the prior does as well or better than LL. This can be explained by the fact that the LL proposal doesn’t handle multimodalities well. At low particles it is able to steer more particles into regions of higher posterior density than the much more diffuse prior. However, this advantage becomes less important and eventually the property of being a unimodal Gaussian proposal for a multimodal posterior becomes a handicap for LL as the number of particles increases and the prior is able to propose sufficient number of particles into high probability regions. We then set the following prior on θ = (θ1 , θ2 ) = (σV , σW ) θ12 ∼ IGa (a, b) , θ22 ∼ IGa (a, b) , where IGa is the Inverse-Gamma distribution with a = b = 0.01. To sample from p ( xT , θ| y1:T ), we used the PG sampler described in Section 3.2 with for Mnθ (xn−1 , xn ) an approximation of the conditional distribution p ( xn | yn , xn−1 , θ) ∝ gθ ( yn | xn ) fθ ( xn | xn−1 ) given by the extended Kalman filter as before (LL proposal). The full conditional distribution p ( θ| y1:T , xT ) is of a standard form. We compared the PG sampler to a standard algorithm where one updates the states Xn one-at-a-time using a MH step of invariant distribution p ( xn | yn , xn−1 , xn+1 , θ) and proposal distribution Mnθ (xn−1 , xn ). In the one-at-a-time algorithm, we updated N times the state variables XT at each iteration before updating θ. Hence the PG sampler and the 47 1 0.8 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5 0.6 0.4 0.2 0 200 100 25 10 25 50 State Space Size 100 50 # of Particles 10 200 5 1 0.3 0.8 0.25 0.6 0.2 0.4 0.15 0.2 0 0.1 0.05 200 100 0 5 25 10 25 50 State Space Size 100 50 # of Particles 10 200 5 Figure 4.1: Acceptance probabilities for PMH (top) and CBMC (bottom) as a function of T and N . Each point was computed over an average of 250,000 iterations. 48 Proposal using local linearization, stratified dynamic resampling 1 0.9 0.9 0.8 0.8 Acceptance Rate Acceptance Rate Proposal from Prior, multinomial always resampling 1 0.7 0.6 0.5 0.4 0.3 T= 10 T= 25 T= 50 T=100 0.2 0.1 0 0 200 400 600 0.7 0.6 0.5 0.4 0.3 T= 10 T= 25 T= 50 T=100 0.2 0.1 0 800 1000 1200 1400 1600 1800 2000 0 200 400 Number of Particles 600 800 1000 1200 1400 1600 1800 2000 Number of Particles Figure 4.2: PMH acceptance rates for sampling from prior (left) and proposal using local-linearisation (right) for varying number L of observations and number of particles. 40 x y 30 20 10 0 -10 -20 -30 0 50 100 150 200 250 300 350 400 450 500 Figure 4.3: Hidden states (x) and observations (y). standard algorithm have approximately the same computational complexity. Note that in this case, it is difficult to design efficient block proposal distributions to sample from p ( xk:k+L−1 | yk:k+L−1 , xk−1 , xk+L , θ) for L > 1 as this distribution can be multimodal. We used T = 500 and N = 1000. For θ fixed to its real value, the average ac- 49 1 PG σv PG σw MH one-at-a-time, σv MH one-at-a-time, σw 0.8 0.6 0.4 0.2 0 0 20 40 60 80 100 Figure 4.4: ACF of the output of the PG and MH algorithms. ceptance rate of our PMH algorithm of Section 3.1 using the ‘improved’ proposal and a multinomial resampling scheme was equal to 0.43. This is quite remarkable as this effectively corresponds to updating 500 real-valued dependent variables simultaneously. We then ran the PG algorithm and the standard one-at-a-time sampler for 50,000 iterations with a burn-in of 10,000 iterations. In Figure 4.4, we display the auto-correlation functions (ACFs) of the parameters (σV , σW ) for the PG and the MH one-at-a-time algorithms obtained using N = 1000. The autocorrelation sequences for the various parameters consistently vanish faster for the PG sampler. In Figure 4.5, we present the estimates of the marginal distributions of the parameters for both algorithms. Contrary to the PG sampler, the standard algorithm clearly misses the main mode of the posterior distribution and over-estimates the parameter σV . The block proposal for the states XT of the PG sampler allows us to avoid getting trapped in this secondary mode. Out of 100 independent runs of both algorithms on the same dataset, the PG algorithm never experienced being trapped in such a secondary mode of the target distribution and produced consistent results 50 3 6 2.5 5 2 4 1.5 3 1 2 0.5 1 0 0 3 6 2.5 5 2 4 1.5 3 1 2 0.5 1 0 0 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 Figure 4.5: Estimates of p ( σV | y1:T ) (left) and p ( σW | y1:T ) (right) for the PG (top) and the MH one-at-a-time (bottom). The vertical lines correspond to the true values. whereas the standard algorithm, initialised at θ = (10, 10) for data with ground √ truth ( 10, 1), was trapped 67 times. In practice, we can obviously combine both strategies by only occasionally updating the state variables with the PG sampler to avoid such traps while using more standard and cheaper updates for large proportion of the computational time. In simulations not presented here, we also applied our algorithm to the stochastic volatility model presented in [67], [71]. This model is somewhat ‘easier’ than the toy example presented here and the particle MCMC algorithms perform very well in this case. We also compared the performance of the PG sampler with the PMMH sampler. In this case we used the prior as a proposal in the SMC with multinomial resampling. The PMMH used a normal random walk proposal with a diagonal covariance matrix. The standard deviation was equal to 0.15 for σV and 0.08 for σW . We present in Figure 4.6 the auto-correlation function for (V, W ) for the PG and PMMH samplers and various number of particles N . Clearly the performance improves as N increases, and while dependent on the test function f considered 51 1 1 500 particles 1000 particles 2000 particles 5000 particles 0.8 500 particles 1000 particles 2000 particles 5000 particles 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 20 40 60 80 100 120 140 160 180 200 0 (a) ACF for σV – PG sampler 1 40 60 80 100 120 140 160 180 200 (b) ACF for σW – PG sampler 1 500 particles 1000 particles 2000 particles 5000 particles 0.8 20 500 particles 1000 particles 2000 particles 5000 particles 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 20 40 60 80 100 120 140 160 180 200 (a) ACF for σV – PMMH sampler 0 20 40 60 80 100 120 140 160 180 200 (b) ACF for σW – PMMH sampler Figure 4.6: ACF of parameters σV and σW for the PG sampler and the PMMH sampler. (as in Eqn. 3.6), it is possible to choose N in order to optimise the ratio integrated autocorrelation function/particle number N . In this scenario, it appears necessary to use at least 2000 particles to get the ACF to drop sharply, whereas increasing N beyond 5000 does not improve performance. That is for N > 5000 we observe that the ACFs (not presented here) are very similar to N = 5000 and probably very close to that of the corresponding “ideal” MMH algorithm. 52 4.2 Protein Folding One of the most important and challenging scientific problems is simulating molecular structures. Protein folding is one of the more well known open problems in structural biology and biophysics. Proteins are the final products of most genes and perform a variety of roles: they serve as structural units, control metabolic pathways, participate in the catalysis or inhibition of various chemical reactions, and serve many other functions. Proteins are chain-like polymers, composed of small subunits (amino acids). There are 20 different types of these amino acids, with a few less common modified versions thereof, which can be loosely grouped into classes based on their chemical properties, such as hydrophobic, hydrophilic (or polar), and charged [8]. The protein folding problem consists of predicting the structure of the protein given its sequence of amino acids. This may include the prediction of the folding pathways to arrive at the “final” shape. The protein structure is dictated by the free energy landscape, lower energy structures being preferred. The native state of a protein is usually at the minimum free energy, and we will consider proteins where this is the case. Exceptions to this can occur when the folding takes place under kinetic control [22]. The function of a protein is primarily determined by its structure, and the sequence of amino acids dictates the structure (Central Dogma of genomics) [22]. There are several different approaches that have been taken to investigate these types of problems, including SMC, CBMC, prune-enriched Rosenbluth method (PERM), recoil-growth (and variations thereof) [40, 61, 74], replica exchange Monte Carlo (parallel tempering) [75], ant colony optimisation [72], and model based search [11]. Next we apply the PMCMC method to this class of problems. We will focus only on the lattice approximations, in particular the HP model, and leave the offlattice case for future studies. 4.2.1 Lattice Model Instead of modelling the full protein with all the interacting forces, it is often possible to imitate the protein folding using a simple 2-D or 3-D lattice-bead model. 53 → Figure 4.7: Illustration of protein folding in HP model. The left configuration is in a high energy state (U = 0), while on the right the protein sequence is folded into into the lowest energy configuration (U = −9). The positions of the amino acids (now beads) in the protein sequence are thus restricted to a lattice and we only need to account for interactions of non-covalent bonds and ensure a self-avoiding walk (SAW), i.e. no two beads can occupy the same position and neighbouring elements in the sequence must be neighbours on the lattice. One of the simplest models used is an Ising model and has only two types of beads: white and black, corresponding to hydrophilic and hydrophobic amino acids, respectively [61]. The energy function for this model is Un (xn ) = − c(xi , xj ) (4.3) |i−j|>1 where c(xi , xj ) = 1 if xi and xj are non-bonding neighbours and both beads are black (hydrophobic), and c(xi , xj ) = 0 otherwise. 4.2.2 Implementation We used PMH with sub-block updating and simulated annealing [68, Chapter 5] to optimise the protein configuration for the lowest energy. The inverse temperature for the annealing was increased linearly from βmin = 0.1 to βmax = 5 with increments β, using 50 × L iterations at each temperature, where L is the length of the protein sequence. The subblocks were chosen at random positions with sizes B sampled from [2, 20], either uniformly or “reverse-linearly”, that is, with a distribution Prl (B) ∝ 21 − B. We also used an adaptive number of particles which scaled 54 linearly with the current block size. This might break detailed balance, however, since this is an optimisation problem, we can afford this approximation and benefit from a more efficient algorithm. To further improve the efficiency of the sampler we implemented the resampling scheme suggested by Fearnhead & Clifford [35]. The method works well for discrete state models, especially when the number of possible states is small. 4.2.3 Results In order to do comparison with other methods, we ran our method on the data sets given in [75] and [76]. The protein sequences and corresponding lowest (known) energies are given in Tables 4.1 and 4.2, for folding in 2D and 3D, respectively. The results are summarised in Table 4.3. The time is given in seconds. kr specifies the scaling factor used for adapting the number of particles to the block size (i.e. N = 1 + kr × B). β is the increment of the inverse-temperature for the annealing. “bsd” is the block size distribution, where “u” indicates uniform and “rl” is reverse-linear. We show both the best performance as well as the median performance over 15 runs at the given settings. The performance of our method is currently not as good as some of the competing methods, such as replica exchange Monte Carlo (REMC) [75], fragment regrowth via energy-guided sequential sampling (FRESS) [76], or specialised versions of PERM (nPERMis [55] and PERMtExp [55, 75]). However, the results are encouraging. We are able to find the lowest known energy for most of the protein sequences in a reasonable amount of time. However, we do struggle with the 3D sequences from the data set of [76]. Tables 4.4 and 4.5 compare our results to the other methods for the data sets from [75] and [76], respectively. Further tuning of the method, plus adding some specialised moves could make this a competitive approach. 55 ID S1-1 S1-2 S1-3 S1-4 S1-5 S1-6 S1-7 S1-8 Length 20 24 25 36 48 50 60 64 Emin -9 -9 -8 -14 -23 -21 -36 -42 S1-9 S1-10 85 100 -53 -50 S1-11 100 -48 2D50 2D60 2D64 50 60 64 -21 -36 -42 2D85 2D100a 85 100 -53 -48 2D100b 100 -50 Protein Sequence (HP)2 PH(HP)2 (PH)2 HP(PH)2 H2 (P2 H)7 H (P2 H)2 HP4 H2 P4 H2 P4 H2 P3 (H2 P2 )2 P3 H5 (H2 P2 )2 P2 H(HP2 )2 P2 HP2 (H2 P2 )2 P3 H10 P6 (H2 P2 )2 HP2 H5 H(HP)4 H4 PH(P3 H)2 P(P3 H)3 PH3 (HP)4 H2 P2 H3 PH8 P3 H9 (HP)2 P2 H12 P4 H4 (H2 P)2 HP H11 (HP)3 P(H2 P2 )2 HP2 (H2 P2 )2 HP2 (H2 P2 )2 (HP)2 H12 H4 P4 H12 P6 H12 P3 H12 P3 H12 P3 HP2 (H2 P2 )2 HPH P(P2 H2 )2 H2 P2 H(H2 P)3 H4 P8 H6 P2 H6 P9 H(PH2 )2 H9 P2 H(H2 P)2 HP(PH)2 H2 P6 H3 P5 (PH)2 HP5 H3 PH3 (H2 P)2 P(P2 H2 )2 PH5 PH8 (H2 P)2 H7 P11 H7 P(PH)2 H2 P5 (PH)2 H H(HP)4 H4 PH(P3 H)2 P(P3 H)3 PH3 (HP)4 H2 P2 H3 PH8 P3 H9 (HP)2 P2 H12 P4 H4 (H2 P)2 HP H11 (HP)3 P(H2 P2 )2 HP2 (H2 P2 )2 HP2 (H2 P2 )2 (HP)2 H12 H4 P4 H12 P6 H12 P3 H12 P3 H12 P3 HP2 (H2 P2 )2 HPH P5 (PH)2 HP5 H3 PH3 (H2 P)2 P(P2 H2 )2 PH5 PH8 (H2 P)2 H7 P11 H7 P(PH)2 H2 P5 (PH)2 H P(P2 H2 )2 H2 P2 H(H2 P)3 H4 P8 H6 P2 H6 P9 H(PH2 )2 H9 P2 H(H2 P)2 HP(PH)2 H2 P6 H3 Table 4.1: Benchmark collection of protein sequences for 2D HP model. Hi and Pi indicate strings of i consecutive H’s and P’s and (s)i denotes i repetitions of string s. Emin is the lowest known energy. 56 ID S2-1 Length 48 Emin -32 S2-2 48 -34 S2-3 48 -34 S2-4 48 -33 S2-5 48 -32 S2-6 S2-7 48 48 -32 -32 S2-8 48 -31 S2-9 48 -34 S2-10 48 -33 3D58 58 -44 3D64 64 -56 3D67 67 -56 3D88 88 -72 3D103 103 -57 3D124 124 -75 3D136 136 -83 Protein Sequence HPH2 P2 H(H3 P)2 PH2 P(PH)2 H(HP)2 (H2 P2 )2 PHP8 H2 H2 (H2 P)2 H5 (P2 H)2 (HP2 )2 P2 (P2 H)2 P3 H(P2 H2 )2 HPH PH(PH2 )2 H4 P2 (HP)2 (PH)2 (HP)3 P2 HP2 (H2 P2 )2 (HP)2 PHP (PH)2 HP(PH)2 H2 P2 (H2 P)2 P2 H5 P(PH)2 (HP)3 P (P2 H)2 PHP P2 HP2 (PH)2 H3 P2 H2 (H2 P)2 H3 P2 (HP)2 (HP2 )2 P4 (H2 P)2 H H3 P3 H(HP)2 (H2 P)3 HP7 (HP)2 PHP(P2 H)2 H5 PH PHP3 (PH)2 H(HP)2 H2 (H2 P)3 P2 (HP)2 P2 H(H2 P2 )3 PH (PH2 )2 HPH(H3 P2 )2 P3 (PH)2 HP2 H(HP)2 P2 H(HP)3 H2 P3 P(HP)2 P3 (HP)3 (PH)2 H5 P2 H2 (HP)2 (PH)2 HP (PH)2 H2 P4 H PH2 P3 (P3 H2 )2 (HP)2 (PH)2 H(P2 H)2 (P2 H2 )2 H5 P2 H2 PHP(H3 P)2 PH2 PH(PH2 )2 (HP)3 H2 P2 H3 P2 (HP)2 P (P2 H)3 H(P2 H)2 P(H2 P)2 H3 P2 (HP)2 P(HP)2 PH(H2 P)3 P(H2 P)2 H3 P2 (HP)2 P(HP)2 PH(H2 P)3 PHP(H2 P)2 HP2 H3 P3 HP(H2 P)2 HP2 H3 P3 HP(H2 P)2 HP2 H3 P3 HP(H2 P)2 HP2 H3 P PHP(H2 P)2 HP2 H2 (P2 H)6 HP2 (H3 P2 )4 HP(H2 P)2 H (P2 H)3 H(P2 H)3 HP2 HP P2 H2 P3 (P2 H2 )2 P(HP2 )2 P2 (P3 H)2 HPH2 P4 (P2 H)2 P (HP2 )2 P3 H3 P4 (H2 P)2 P4 H2 P4 H3 (HP)2 P7 H4 (HP2 )2 P3 H2 (HP)2 P3 HP5 H2 P2 (P2 H2 )2 (P4 H)2 (P2 H)2 HP3 H(HP)2 H3 P4 H3 P6 H2 (P2 H)2 P(HP2 )2 P3 (P2 H)2 H2 P(P3 H)2 H4 P4 H(HP)4 H HP(P4 H)2 P(H2 P)2 P2 (PH)2 H2 P4 (HP)2 H4 P9 (P2 H)2 P2 (PH)2 HP3 H2 (P2 H)2 P(HP)2 P4 (P3 H)2 H5 P (P2 H2 )2 HP2 (PH2 )2 H3 P5 (P4 H)2 PHP4 Table 4.2: Benchmark collection of protein sequences for 3D HP model. Hi and Pi indicate strings of i consecutive H’s and P’s and (s)i denotes i repetitions of string s. Emin is the lowest known energy. 57 Protein Emin S1-1 -9 S1-2 -9 S1-3 -8 S1-4 -14 S1-5 -23 S1-6 -21 S1-7 -36 S1-8 -42 S1-9 -53 S1-10 -50 S1-11 -48 2D50 -21 2D60 -36 2D64 -42 2D85 -53 2D100a -48 2D100b -50 S2-1 -32 S2-2 -34 S2-3 -34 S2-4 -33 S2-5 -32 S2-6 -32 S2-7 -32 S2-8 -31 S2-9 -34 S2-10 -33 3D58 -44 3D64 -56 3D67 -56 3D88 -72 3D103 -57 3D124 -75 3D136 -83 Best performance E Time #iter -9 0.0 0K -9 0.1 1K -8 0.1 1K -14 0.4 1K -23 0.4 4K -21 0.4 2K -36 5.8 56 K -42 22.5 80 K -53 242.9 1535 K -49 182.0 696 K -47 4.2 43 K -21 0.5 2K -36 536.4 1170 K -42 6.6 72 K -53 108.5 366 K -48 112.3 737 K -49 52.5 303 K -32 45.5 216 K -34 7.1 35 K -34 8.0 44 K -33 45.5 240 K -32 2.1 9K -32 3.7 36 K -32 117.0 978 K -31 4.8 41 K -33 12.8 25 K -33 362.3 1179 K -43 8.1 24 K -52 260.0 835 K -50 128.8 1130 K -67 607.1 1708 K -55 5.9 36 K -70 150.8 515 K -78 66.7 613 K Median performance E Time #iter -9 0.1 0K -8 0.2 2K -7 0.3 2K -12 0.1 1K -20 0.8 6K -17 0.2 1K -35 2.4 24 K -38 9.8 58 K -51 71.2 725 K -47 189.1 440 K -45 2.5 26 K -19 1.0 4K -35 151.9 571 K -40 8.5 59 K -52 77.0 409 K -46 67.7 664 K -46 79.1 283 K -31 60.1 181 K -31 4.9 16 K -31 2.6 23 K -31 18.0 154 K -29 1.3 4K -30 4.9 46 K -31 80.2 454 K -29 4.3 24 K -31 22.2 40 K -33 362.3 1179 K -39 2.7 14 K -51 631.8 1200 K -49 177.8 995 K -64 223.4 1053 K -48 4.9 46 K -68 237.2 498 K -74 97.8 884 K kr 2 1 1 2 1 2 1 2 1 3 1 2 3 1 2 1 2 2 2 1 1 2 1 1 1 3 3 2 3 1 2 1 3 1 Setting β bsd .100 u .100 rl .100 u .100 u .100 rl .100 rl .010 rl .010 u .001 u .001 rl .010 rl .100 u .001 u .010 rl .001 u .001 u .001 rl .001 rl .010 rl .010 u .001 u .100 rl .010 rl .001 rl .010 rl .010 u .001 rl .010 u .001 rl .001 rl .001 u .010 u .001 rl .001 rl Table 4.3: Protein folding results for HP model using PMH with subblock updating and tempering. The time is given in seconds. 58 59 PMCMC E T ime -9 0.03 -9 0.07 -8 0.14 -14 0.43 -23 0.37 -21 0.38 -36 5.79 -42 22.50 -53 242.94 -49 182.00 -47 4.23 -32 45.54 -34 7.11 -34 8.03 -33 45.47 -32 2.08 -32 3.70 -32 116.97 -31 4.85 -33 12.76 -33 362.34 PERMtExp E T ime -9 <1 -9 <1 -8 2 -14 <1 -23 2 -21 3 -36 4 -42 280800 -53 60 -50 1200 -48 480 -32 6 -34 18 -34 6 -33 120 -32 30 -32 6 -32 30 -31 18 -34 300 -33 0.6 ACO-HPPFP-3 E T ime -9 <1 -9 <1 -8 2 -14 4 -23 60 -21 15 -36 1200 -42 5400 -53 1day -49 43200 -47 36000 -32 1800 -34 25200 -34 7200 -33 18000 -32 900 -32 43200 -32 43200 -31 7200 -34 27000 -33 3600 REMCm E T ime -9 <1 -9 <1 -8 <1 -14 <1 -23 <1 -21 <1 -36 13 -42 6 -53 38 -50 480 -48 72 -32 6 -34 12 -34 6 -33 6 -32 6 -32 6 -32 18 -31 6 -34 54 -33 6 Table 4.4: Comparison of protein folding performance for the data set from [75]. The time is given in seconds. Protein Emin S1-1 -9 S1-2 -9 S1-3 -8 S1-4 -14 S1-5 -23 S1-6 -21 S1-7 -36 S1-8 -42 S1-9 -53 S1-10 -50 S1-11 -48 S2-1 -32 S2-2 -34 S2-3 -34 S2-4 -33 S2-5 -32 S2-6 -32 S2-7 -32 S2-8 -31 S2-9 -34 S2-10 -33 Protein Emin 2D50 -21 2D60 -36 2D64 -42 2D85 -53 2D100a -48 2D100b -50 3D58 -44 3D64 -56 3D67 -56 3D88 -72 3D103 -57 3D124 -75 3D136 -83 PMCMC E T ime -21 0.53 -36 536.41 -42 6.62 -53 108.51 -48 112.34 -49 52.54 -43 8.07 -52 259.98 -50 128.81 -67 607.10 -55 5.94 -70 150.82 -78 66.65 FRESS E T ime -21 – -36 – -42 – -53 – -48 – -50 – -44 5.4 -56 32.4 -56 84.6 -72 301.8 -57 268.2 -75 – -83 – nPERMis E T ime – – -36 – -42 – -52 – -48 – -50 – -44 11.4 -56 27 -56 66 -69 -55 187.2 -71 738 -80 6600 Table 4.5: Comparison of protein folding performance for the data set from [76]. The time is given in seconds. 60 4.3 Dirichlet Mixture Model We consider a Dirichlet process (DP) mixture model for the observations Y1:T . We have the following hierarchical model G ∼ DP (α, G0 ) , i.i.d. Un | G ∼ G, Yn | Un ∼ gUn (·) , where DP (α, G0 ) is a Dirichlet process of base measure G0 and scale parameter α. By integrating out G and using the Polya urn representation of the predictive distribution of Un given U1:n−1 , we can equivalently reformulate the model by introducing a vector of cluster labels X1:n which satisfy P ( Xn = j| x1:n−1 ) = mjn−1 / (n − 1 + α) for j = 1, ..., kn−1 α/ (n − 1 + α) j = kn−1 + 1 (4.4) where kn−1 is the number of clusters in the assignment x1:n−1 and mjn−1 is the number of observations that x1:n−1 assigns to cluster j. The cluster locations are such that i.i.d. θk ∼ G0 and Yn | θXn ∼ gθXn (·) . Assuming α is also unknown with a suitable prior π (α), the posterior of interest is given by π ( xT , θ1:kT , α| y1:T ) . We will consider here the case where the base measure G0 and the likelihood gθ are conjugate so that it is possible to compute the marginal π ( xT , α| y1:T ) up to a normalising constant. In such situations, several SMC methods have already been proposed to sample from π ( xT | y1:T , α) [32], [61, Section 4.4.2]. We focus here on the following Gaussian mixture model for real-valued observations where θ = µ, σ 2 and G0 µ, σ 2 = IG σ 2 ; a, b N µ; η, τ σ 2 . 61 For the scale parameter, we use α ∼ Ga (c, d) . (4.5) Using standard calculations [32], we can show that π ( y1:n | x1:n , α) = a kn j b Γ a + mjn /2 mn j b+ σn j 2 Γ (a) m τ + 1 n j=1 −(a+mjn /2) − η 2 + 1 + mjn τ 2 y jn where y jn σnj 2 = = 1 mjn 1 n I (xk = j) yk , k=1 n mjn k=1 I (xk = j) yk − y jn 2 . We cannot sample from π ( α| y1:T , xT ) directly. However, following [30], we can introduce an auxiliary variable υ ∈ (0, 1) such that α| y1:T , xT , υ ∼ γυ Ga (c + kT , d − log υ)+(1 − γυ ) Ga (c + kT − 1, d − log υ) with γυ / (1 − γυ ) = (c + kT − 1) / (T (d − log υ)) and υ| y1:T , xT , α ∼ Be (α + 1, T ) where Be is the Beta distribution. We first applied our algorithm to the Galaxy dataset [68, p. 426] and used the version of the data in [32] with T = 82. We used a = b = 1, η = 20, τ = 152 [32] and c = 1, d = 0.5. To sample from π ( xT , α| y1:T ), we ran the PG algorithm of Section 3.2 with π ( xn | α) = π ( xn | y1:n , α) and the stratified resampling scheme at each iteration. We also used the PMMH algorithm of Section 3.3 with the same sequence π ( xn | α) and using both stratified and dynamic resampling 62 PG PMMH PMH SMC [32] Gibbs [32] Gibbs [30] S 9000 9000 49000 50,000 50,000 10,000 scale parameter E ( α| y1:T ) V ( α| y1:T ) 0.911 0.267 0.916 0.273 fixed to α = 1 fixed to α = 1 fixed to α = 1 ∼1.0 ∼0.2 number of clusters E ( kT | y1:T ) V ( kT | y1:T ) 5.342 2.686 5.416 2.683 5.754 1.843 5.75 – 5.75 – 7.01 3.41 Table 4.6: Results for Galaxy data set. S is the number of samples used in the Monte Carlo estimate. Both PG and PMMH used 100 particles. The Gibbs sampler in Escobar & West [30] used the prior p(α) = Ga(2, 0.25). schemes; resampling being performed when ESS < N/2. The proposal for α is a Gaussian random walk proposal of standard deviation 0.5. We ran the algorithms for 10, 000 iterations and a burn-in of 1, 000. For N = 50, we found that the PG and the PMMH algorithms provided similar results in terms of posterior mean and posterior variance of α and kT ; the results for the galaxy data set are summarised in Table 4.6, and the histogram, trace, and ACF for the PG and PMMH samplers are shown in Figure 4.8. Increasing the number of particles did not significantly improve the performance of the algorithms. For example for N = 50 the average acceptance rate of the PMMH algorithm was 0.62 while for N ≥ 100 it stabilised at 0.71. This suggests that in this scenario N = 100 is sufficient to approximate the marginal MH algorithm. Fearnhead [32] performed inference on the number of clusters using Gibbs sampling and SMC. The reported mean was 5.75 and agrees with our results, however we should note that in [32] the scaling parameter was fixed at α = 1 while in our simulations α was also sampled and had mean 0.91 with variance 2.7. Escobar & West [30] analysed the Galaxy data using a Ga(2, 0.25) as a prior for α and obtained a posterior for the scale parameter with E(α) ≈ 1 and V(α) ≈ 0.2, which is a bit larger and more peaked than what we found. Note that they also learn the prior parameters η and τ and we have used a slightly different prior for α. We also applied our algorithm to a large simulated dataset of 10, 000 realvalued observations generated from the Dirichlet process Gaussian mixture model. 63 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 0.5 1 1.5 2 2.5 3 3.5 4 0 4.5 4 4 3.5 3.5 3 3 0.5 1 1.5 2 2.5 3 3.5 4 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1 1 10 50 100 500 1000 0.8 10 50 100 500 1000 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 -0.2 -0.2 0 20 40 60 80 100 0 20 40 60 80 100 Figure 4.8: Histogram (top), trace (middle), and ACF (bottom) of parameter α for PG (left) and PMMH (right). The histogram and trace plots were obtained using 100 particles. The PG resampled at every time step, while the PMMH used dynamic resampling. Both used stratified resampling. 64 1 0.9 Acceptance Rate 0.8 0.7 0.6 0.5 0.4 0.3 0.2 multinomial resampling (always) stratified resampling (always) multinomial resampling (dynamic) stratified resampling (dynamic) 0.1 0 10 100 1000 Number of Data Points Figure 4.9: Average acceptance rate for various versions of the PMH algorithm for varying number of data points and N = 100 For α fixed to its true value, α = 1, we compared the performance of the PMH sampler for varying SMC schemes and numbers of data points for N = 100 particles. The results are displayed in Figure 4.9. Even for only N = 100 particles, the acceptance rate of the PMH algorithms based on SMC proposals with dynamic resampling is equal to 0.31 for T = 10, 000 observations. For T = 10, 000 we also implemented the PG sampler to sample from π ( xT , α| y1:T ) based on an SMC proposal using stratified resampling at each iteration . We ran the algorithm for 10, 000 iterations with 1, 000 iterations for the burn-in. In Figure 4.10, we display the autocorrelation function for the simulated values α(i) and the estimate of π ( α| y1:T ). The PMMH algorithms for stratified and dynamic resampling yield similar results. We also compared our method to Gibbs sampling. In order to keep the computational complexity the same, we allow the Gibbs sampler to run N full Gibbs updates for the state variables between each parameter update, where N is the number of particles used in the PMMH and PG samplers (see Algorithm 4.1). For this problem however, we found that particle MCMC does not give any improvement in 65 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 20 40 60 80 100 0 0.5 1 1.5 2 2.5 3 3.5 4 Figure 4.10: Auto-correlation function for parameter α (left) and estimate of π ( α| y1:T ) (right). The vertical line corresponds to the true value. performance over Gibbs sampling. As there is only weak dependence between the label assignments Xk and between the parameter α and Xk , the PMCMC method has only limited advantage over Gibbs. Given the greater complexity of PMCMC, the simpler Gibbs sampler would be the preferred method to use on this model. Algorithm 4.1: Gibbs Sampler for Dirichlet Mixture Model. 1 2 3 4 5 6 7 8 // see Appendix C.1 for details sample α(0) from prior initialise state vector X(0) For iteration i ≥ 1 sample α(i) ∼ π(·|X(i − 1)) set X(i) = X(i − 1) for k = 1, .., N do for n = 1, .., T do sample Xn (i) ∼ π(Xn (i)|X−n (i), y1:T ) 66 4.4 Markov Jump Processes (Lotka-Volterra) We consider here a discretely observed stochastic kinetic Lotka-Volterra (LV) model [9]; see [49] for a description of such models and their applications in systems biology. For this model it is not possible to compute the prior between two given times; i.e. P (Zt+T |Zt ) (integrating out the paths between times t and t + T ), which prevents us from designing proposals with the observations taken into account. We are thus required to propose from the prior. As we shall see, PMCMC allows us to still obtain good performance despite this limitation. The LV model describes the evolution of two species Zt1 (prey) and Zt2 (predator) which are non-negative integer-valued processes. In a small time interval (t, t + dt], there are three possible transitions for the Markov jump process (MJP) Zt = Zt1 , Zt2 2 1 = zt2 zt1 , zt2 = zt1 + 1, Zt+dt P Zt+dt 2 1 = zt2 + 1 zt1 , zt2 = zt1 − 1, Zt+dt P Zt+dt 2 1 = zt2 − 1 zt1 , zt2 = zt1 , Zt+dt P Zt+dt = α zt1 dt + o (dt) , = β zt1 zt2 dt + o (dt) , = γ zt2 dt + o (dt) , corresponding respectively to prey reproduction, predator reproduction & prey death, and predator death. We assume that we only observe the process Yn = Yn1 , Yn2 at discrete time points τn which is given by Yni = Zτin + Wni , (4.6) i.i.d. where Wni ∼ N 0, σ 2 I for i = 1, 2. The observation times are assumed to be evenly spaced at τn = n∆. We are interested here in making inference about the parameters θ = (α, β, γ) which are assumed to be a priori distributed as α ∼ Ga(1, 10), β ∼ Ga(1, 0.25), γ ∼ Ga(1, 7.5) where Ga is the Gamma distribution. For the initial populations we used x0 ∼ U(20, 80), Although it is possible to analyse such models directly using reversible-jump 67 MCMC (RJMCMC) [9], a very popular alternative consists of using a diffusion approximation instead [49, pp. 188-189] which typically leads to ‘simpler’ inference algorithms [48]. For the Lotka-Volterra model, one can obtain a two-dimensional diffusion process Xt = Xt1 , Xt2 with drift and instantaneous variance-covariance matrix given by [33] αXt1 − βXt1 Xt2 βXt1 Xt2 − γXt2 and αXt1 + βXt1 Xt2 βXt1 Xt2 −βXt1 Xt2 βXt1 Xt2 + γXt2 . (4.7) Most approaches rely on data augmentation (by discretising time using the Euler-Maruyama approximation) to solve the stochastic differential equations (SDEs). We applied PMCMC to this problem and have developed a Gaussian proposal for the SMC using an unscented Kalman smoother (UKS). However, we only had limited success with this approach and have found that a simpler method works much better (and faster) for this model. Instead of transforming the problem into an SDE, we would prefer to stay in the discrete state space and use the exact model. Unfortunately we cannot evaluate 1 2 the prior (transition) probability P(Z(n+1) , Z(n+1) |zn1 , zn2 ). However, we are able to sample exactly from the prior using Gillespie’s algorithm [46]. And since we are able to evaluate the likelihood, we can use SMC to sample the states, and PMCMC to sample the parameters (α, β, γ). Of course for this to work efficiently, the likelihood must be sufficiently diffuse in order to get enough particles in regions of high probability. We generated T = 50 observations y1:50 by simulating the MJP Zt = Zt1 , Zt2 using Gillespie’s algorithm [46, 49] and (4.6) with parameters α = 2, β = 0.05, and γ = 1.5, ∆ = 0.2, and σ 2 = 4; see Figure 4.11. Applying PMMH, the parameters were sampled using a Gaussian random walk proposal (independent for each component). In order to improve mixing, random subsets of the parameters were updated at each iteration. The number of components n to update was sampled uniformly. The variances used for the random walk proposal are s2α = 0.22 /n, sβ = 0.0052 /n, and sγ = 0.152 /n. The acceptance rate for the sampler was 36%, with an average ESS/N of 17.1% and resample rate of 95%. The simulation took 4hrs (17412.8 seconds) using a sin- 68 140 prey predator 120 100 80 60 40 20 0 -20 0 1 2 3 4 5 6 7 8 9 10 Figure 4.11: Synthetic data generated from the Lotka-Volterra model using Gillespie’s algorithm. The prey (Zt1 ) and predator (Zt2 ) are shown in green and red, respectively. The squares and circles indicate the observations Yn1 and Yn2 , respectively. gle core of an Intel Core2 Duo 6300 @ 1.86GHz. Figure 4.12 shows the histogram and trace plots of the parameters. The algorithm mixes well, and the posterior is centred around the true value. The auto-correlation functions of the parameters are shown in Figure 4.13. PMMH also gives posterior estimates of the hidden states at the observation times, as shown in Figure 4.14. Boys et al. [9] perform exact inference (i.e. without SDE approximation) on this type of model using RJMCMC and a block updating (BU) method. They use a different observation model, where the prey is observed without error, but the predator is unobserved. In their BU method, the latent process (reaction events) is updated in blocks consisting of pairs of intervals between observations, keeping the endpoints (i.e. population sizes at the beginning and end of the block) fixed and using a random walk proposal on the number of reactions. The parameters are updated conditional on sampled events. The authors note that, while RJMCMC mixes much worse than BU and requires significantly more CPU time, both “algorithms 69 70 Figure 4.12: Histogram and trace plots of the sampled parameters. 1 α β γ 0.8 0.6 0.4 0.2 0 -0.2 0 20 40 60 80 100 120 140 160 180 200 Figure 4.13: ACFs of the sampled parameters. suffered significant mixing problems” when the predators where unobserved, and due to CPU constraints no satisfactory results were obtained using RJMCMC. As evident from the trace plots in Figure 4.12 PMCMC does not suffer from this (although note that a direct comparison should be taken with a grain of salt due to the difference in the observation model). 71 80 70 60 50 40 30 20 10 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 90 80 70 60 50 40 30 20 10 0 Figure 4.14: Histograms of prey (top) and predator (bottom) states. The ground truth trajectory and observations are shown in green and blue, respectively. 72 4.5 L´evy-Driven Stochastic Volatility Model We now consider another model in which the prior cannot be evaluated, but we can sample from it. This model has the additional challenge of complex dependence between the parameters. The logarithm of an asset price y ∗ (t) is assumed to be determined by the following stochastic differential equation dy ∗ (t) = µ + βσ 2 (t) dt + σ (t) dB (t) where µ is the drift parameter, β the risk premium and B (t) is a Brownian motion. The instantaneous latent variance/volatility σ 2 (t) is assumed to be stationary and independent from B (t). It is modelled by the following L´evy-driven OrnsteinUhlenbeck process proposed in [4] dσ 2 (t) = −λσ 2 (t) dt + dz (λt) (4.8) where λ > 0 and z (t) is a purely non-Gaussian L´evy process with positive increments with z (0) = 0. We define the integrated volatility t σ 2 (u) du σ 2∗ (t) = 0 which satisfies from (4.8) σ 2∗ (t) = λ−1 z (λt) − σ 2 (t) + σ 2 (0) . Let ∆ denote the length of time between two periods of interests, then the increments of the integrated volatility satisfy σn2 = σ 2∗ (n∆) − σ 2∗ ((n − 1) ∆) = λ−1 z (λn∆) − σ 2 (n∆) − z (λ (n − 1) ∆) + σ 2 ((n − 1) ∆) 73 where σ 2 (n∆) z (λn∆) exp (−λ∆) σ 2 ((n − 1) ∆) = with exp (−λ∆) d ηn = + ηn z (λ (n − 1) ∆) ∆ 0 exp (λu) dz (λu) ∆ 0 dz (λu) . (4.9) By aggregating returns over a time interval of length ∆, we have n∆ dy ∗ (t) = y ∗ (n∆) − y ∗ ((n − 1) ∆) yn = (n−1)∆ thus, conditional on the volatility, we have yn ∼ N µ∆ + βσn2 , σn2 . Many publications have restricted themselves to the case where σ 2 (t) follows marginally a Gamma distribution in which cases the stochastic integrals appearing in 4.9 can be represented by a finite number of random variables. In this case sophisticated MCMC schemes were developed so as to perform Bayesian inference on λ and the parameters of the Gamma [42, 52, 69]. However, as outlined in [44] the use of Gamma models seems to have been mostly motivated by computational tractability. We address here the case where σ 2 (t) follows a tempered stable marginal distribution TS(κ, δ, γ) which includes the class of inverse Gaussian distributions for κ = 1 2; this class of models has been successfully used to fit the returns from exchange rate time series [5]. The tempered stable distribution does not admit a closed-form expression, but it is shown in [5] that ∞ d σ 2 (0) = i=1 ai κ A0 −1/κ 1/κ ∧ ei vi (4.10) where {ai }, {ei }, {vi } are independent of one another. The {ei } are i.i.d. exponential with mean 1/B, the {vi } are standard uniforms whereas a1 < a2 < . . . are 74 arrival times of a Poisson process of intensity 1 and A0 = δ2κ κ 1 , B = γ 1/κ . Γ (1 − κ) 2 It is also established that to ensure a TS(κ, δ, γ) marginal for σ 2 (t) then z (t) has to be the sum of an infinite activity L´evy process and of a compound Poisson process such that ∞ d ηn = i=1 exp (−λ∆ri ) 1 ai κ Aλ∆ −1/κ ∧ N (λ∆) 1/κ ei vi + i=1 exp (−λ∆ri∗ ) 1 ci (4.11) where A = δ2κ κ2 1 , B = γ 1/κ . Γ (1 − κ) 2 In (4.11), {ai } , {ei }, {ri }, {ri∗ }, {vi } are independent of one another. The {ai } , {ei }, {vi } follow the same distributions as in (4.10), the {ci } are i.i.d. Ga (1 − κ, B), and {ri }, {ri∗ } are standard uniforms. Finally N (λ∆) is a Poisson random variable of mean λ∆δγκ. It was shown experimentally in [5] that the infinite series appearing in (4.10)-(4.11) are dominated by the first few terms, “although as κ goes to one this becomes less sharp”. Performing inference in this context is extremely challenging as, although it is possible to sample approximately from the prior, it is not possible to evaluate it pointwise. In [44] the authors have proposed an MCMC algorithm for Bayesian inference. This requires sampling the volatility process σn2 ; this is achieved by updating {ηn } one at a time with an independent MH sampler using the prior as proposal (and relying on an alternative infinite series expansion of ηn ). In [21] an SMC algorithm was proposed to estimate σn2 when the parameters of the models are fixed: it uses the prior as a proposal as any alternative proposal would require a pointwise evaluation of the prior. We applied PMCMC to this model in order to perform Bayesian inference on the parameters {κ, δ, γ, λ} and we set here µ = β = 0 as in [21]. First we ran PMMH on a synthetic data set and then used the daily Standard & Poor’s 500 (S&P 500) closing prices from January 12, 2002 to December 30, 2005. 75 4.5.1 Synthetic Data We generated a synthetic data set with T = 400 observations using parameters √ √ θtrue = (κ, δ, γ, λ) = 0.5, 2, 8, 0.1 . The MH step in the PMMH used a Gaussian random walk proposal using the following covariance matrix ΣRW , which was found by running the algorithm for some time using a diagonal covariance matrix and then computing the sample covariance: ΣRW −0.00188407 −0.0602253 0.694237 0.0921531 0.000170793 = 0.0346668 0.0921531 1.82894 −0.121654 −0.00188407 0.000170793 −0.121654 0.199951 0.00834501 −0.0602253 0.0346668 We found that the parameters κ and δ were quite correlated. To improve mixing, we alternated between the following two proposals: 2 q1 (θ, θ∗ ) = N (θ∗ ; θ, 1.54 ΣRW ) 2 ∗ ) = N (θ ∗ ; θ , 1.5 ΣRW ), q2 (θ1:2 , θ1:2 1:2 1:2 2 1:2 ∗ =θ with θ3:4 3:4 The simulation is quite sensitive to κ, so we use a fairly informative prior. This was also observed in [44]. We used a Beta prior for κ and Gamma priors for the other three parameters: π(κ) = Be(10, 10) π(δ) = Ga(1, 7) π(γ) = Ga(1, 14) π(λ) = Ga(1, 0.5) We ran simulations using using 50, 100, and 200 particles. The autocorrelation of the parameters is shown in Figure 4.15), the histograms and scatter plots of the parameters (using 200 particles) are given in Figure 4.16, and the trace is plots are shown in Figure 4.17. The posterior for κ is almost the same as the prior, suggesting that the prior is either too informative, or that there is very little information about κ in the data. We also tried a broader prior and found that the posterior is also similar to the prior, but the simulation does not mix as well, in particular when κ gets close to 1. The acceptance rate for the PMMH was 10%. We used dynamic 76 1 κ 0.8 δ N= 50 N=100 N=200 0.6 0.4 0.2 0 1 γ 0.8 λ 0.6 0.4 0.2 0 0 200 400 600 800 10000 200 400 600 800 1000 Figure 4.15: L´evy-driven SV model using synthetic data: Comparison of auto-correlation functions for different number of particles. resampling in the SMC proposal and resampled on average at 12% of the time steps. 4.5.2 Standard & Poor’s 500 The Standard & Poor’s 500 (S&P 500) data set has 1000 observations, which are the daily increments in the log-prices from 12/01/2002 to 30/12/2005, and rescaled to have unit variance (the mean is 0.0079). The data is shown in Figure 4.18. Next we show simulations results using two different priors; the first one being a bit more informative than the second. Prior 1 The MH step in the PMMH used a Gaussian random walk proposal using the following covariance matrix ΣRW , which again was generated by running the algorithm for some time using a diagonal covariance matrix and then computing the 77 κ δ γ λ 0.3 0.4 0.5 0.6 0.7 0.8 2 4 6 8 10 12 2 4 6 8 10 12 0 0.5 1 1.5 2 2.5 Figure 4.16: L´evy-driven SV model using synthetic data: Histogram and 2D scatter plots of sampled parameter values. The prior is shown as dashed curve and the true value √ √is shown by the vertical dashed line: θtrue = (κ, δ, γ, λ) = 0.5, 2, 8, 0.1 . The data set has 400 observation and the simulation used 200 particles. sample covariance: RW Σ 0.00133842 −0.104184 9.80768e−05 1.24963e−05 −0.104184 16.3029 0.0881985 −0.000947038 = 9.80768e−05 0.0881985 0.00670481 7.94049e−06 1.24963e−05 −0.000947038 7.94049e−06 2.25697e−06 The same proposal was used as before (i.e. alternating between updating first two parameters, κ and δ, and updating all parameters jointly). The prior was chosen quite informative as follows: π(κ) = Be(4, 36) π(δ) = Ga(1, 7) π(γ) = Ga(1, 14) π(λ) = Ga(1, 0.5) 78 κ 0.8 0.7 0.6 0.5 0.4 0.3 δ 12 10 8 6 4 2 γ 12 10 8 6 4 2 λ 2.5 2 1.5 1 0.5 0 50000 5000 10000 15000 20000 25000 30000 35000 40000 45000 Figure 4.17: L´evy-driven SV model with synthetic data: Trace plots. where Be(α, β) is the Beta distribution, and Ga(k, θ) is the Gamma distribution with shape k and scale θ. The histograms and scatter plots of the parameters are shown in Figure 4.19. As evident from the trace and auto-correlation plots given in Figure 4.21 and 4.20, the Markov chain mixes quite well given the complex structure of the posterior. The SMC proposal performs very well with a resample rate of only 4% and average ESS of about 0.75N , where N = 500 is the number of particles used. The main difficulty lies is proposing good candidates for the parameters. The acceptance rate was 21%. Prior 2 For this simulation we broadened the priors1 as follows: π(κ) = Be(1, 15) π(δ) = Ga(1, 20) π(γ) = Ga(1, 10) π(λ) = Ga(1, 1) 1 The prior for γ was tightened, but this change had negligible effect on the posterior 79 Closing Price Log Return 1300 1250 1200 1150 1100 1050 1000 950 900 850 800 750 6 5 4 3 2 1 0 -1 -2 -3 -4 0 100 200 300 400 500 600 700 800 900 1000 Figure 4.18: S&P 500 data set for L´evy-driven SV. Top: daily prices from from 12/01/2002 to 30/12/2005. Bottom: difference of log of prices, scaled to have unit variance. The same covariance matrix as before was used for the random walk proposal, and we again alternated between updating the first two components and updating all parameters jointly. The acceptance rate was 23% with an average resampling rate of 4% and ESS=0.75N (again using N = 500 particles). The histograms and scatter plots of the samples are given in Figure 4.22. Figures 4.23 and 4.24 show the autocorrelation and trace, respectively. Discussion As expected from the simulations using the synthetic data, the simulation using the more informative prior 1 mixes much better than using prior 2. In both cases, we get similar results for the posterior of the parameters (see Table 4.5.2 for details). With this application we showcased a model in which we cannot evaluate the prior, but can only sample from it. This forced us to propose from the prior, which 80 κ δ γ λ 0.12 0.24 0.36 5 10 15 20 25 30 35 40 45 0.6 0.8 1 1.2 1.4 1.6 0.0057 0.0114 0.0171 Figure 4.19: L´evy-driven SVs model using daily S&P 500 prices from 12/01/2002 to 30/12/2005: Histogram and 2D scatter plots of sampled parameter values. The prior is shown as the dashed curve. The data set has 1000 observation and the simulation used 500 particles. can be very inefficient, especially when the observations are informative. However, using PMCMC we are still able to achieve good performance. 81 1 κ δ γ λ 0.8 0.6 0.4 0.2 0 -0.2 0 50 100 150 200 250 300 350 400 450 500 Figure 4.20: L´evy-driven SV model using daily S&P 500 prices from 12/01/2002 to 30/12/2005: autocorrelation function of sampled parameter values. Figure 4.21: L´evy-driven SV model using daily S&P 500 prices from 12/01/2002 to 30/12/2005: Trace plots of sampled parameter values. 82 κ δ γ λ 0.1 0.2 0.3 0.4 0.5 20 40 60 80 0.6 0.8 1 1.2 1.4 1.6 0.0067 0.0134 0.0201 Figure 4.22: L´evy-driven SV model using daily S&P 500 prices from 12/01/2002 to 30/12/2005: Histogram and 2D scatter plots of sampled parameter values. The priors are shown as dashed curves and are π(κ) = Be(1, 15), π(δ) = Ga(1, 20), π(γ) = Ga(1, 10), π(λ) = Ga(1, 1) 83 1 κ δ γ λ 0.8 0.6 0.4 0.2 0 -0.2 0 50 100 150 200 250 300 350 400 450 500 Figure 4.23: L´evy-driven SV model using daily S&P 500 prices from 12/01/2002 to 30/12/2005: autocorrelation function of sampled parameter values. Figure 4.24: L´evy-driven SV model using daily S&P 500 prices from 12/01/2002 to 30/12/2005: Trace plots of sampled parameter values. 84 parameter prior mean std 10th percentile median 90th percentile κ Be(4, 36) 0.1555 0.0524 0.0923 0.1502 0.2260 parameter prior mean std 10th percentile median 90th percentile κ Be(1, 15) 0.1452 0.0794 0.0555 0.1301 0.2555 Prior 1 δ γ Ga(1, 7) Ga(1, 14) 8.939 1.163 5.034 0.126 3.877 1.027 7.883 1.151 15.254 1.322 Prior 2 δ γ Ga(1, 20) Ga(1, 10) 14.004 1.155 11.523 0.134 3.630 1.030 10.443 1.136 30.212 1.326 λ Ga(1, 0.5) 0.008472 0.002055 0.006461 0.007924 0.011302 λ Ga(1, 1) 0.008661 0.002337 0.006501 0.008071 0.011558 Table 4.7: Posterior statistics of parameters for S&P 500 data set. 85 4.6 Tempering Following the idea outlined in [65, Sec. 4.2.2] we use particle Metropolis Hastings to sample from a multimodal target distribution. By using a sequence of tempered distributions we sample a chain of variables such that the marginal of the last variable in the chain is our target distribution. The sequence of distributions is built such that each two consecutive distributions are similar. An alternative method is Tempered Transitions by Neal [66], where the first distribution in the chain is the target distribution, and the following distributions are progressively easier to sample from (i.e. tempered) until the centre of the chain, at which point the distributions start approaching the target distribution, with the last distribution being the target. The method is shown in Algorithm 15. Note that for this case it is actually preferable to use SMC, as in [65], instead of PMH. In fact the only potential benefit of using PMH is its iterative nature, which allows us run it for an unspecified amount of time, e.g. until some time limit or a sufficient number of samples have been generated. However, this can also be accomplished using the sequentially interacting Markov chain Monte Carlo (SIMCMC) sampler [10], though we have not yet made any comparison with this method. Algorithm 4.2: Tempered Transitions 1 2 3 4 5 Initialise X(s = 0) At iteration s ≥ 1 Set X0∗ = X(s − 1) ∗ ) For n = 1, . . . , 2p: Sample Xn∗ ∼ Tn (·|Xn−1 With probability: 2p 1∧ j=0 πn+1 (Xn ) πn (Xn ) (4.12) ∗ , otherwise set X(s) = X(s − 1) set X(s) = X2p The target density is π(·) = π0 (·) and the sequence of distributions π0 , . . . , π2p with πp−n = πp+n , such that for n ≤ p, πn (·) is easier to sample from than πn−1 . We also define a sequence of transition kernels T1 , . . . , T2p , where Tn (X, ·) has πm (·) as an invariant distribution, with m = n for n ≤ p and m = 2p − n + 1 for 86 n > p. The following reversibility condition must hold: πn (x) Tn (x, x ) = πn (x ) T2p−n+1 (x , x), for n < p (4.13) Note that we may choose Tn to be the same as T2p−n+1 . Tempered Sampling with PMH and CBMC For the PMH and CBMC samplers we only need the second half of the sequence defined above. In this context we label them π ¯1 , . . . , π ¯p , with π ¯p = π0 . We also choose different transition kernels more suitable for the particle method. This gives rise to joint distributions as given in [65] using the artificial backwards transitions Ln (xn+1 , xn ): π ˜n (x1:n ) = γ˜n (x1:n )/Zn where (4.14) n−1 γ˜n (x1:n ) = γn (xn ) Ln (xn+1 , xn ) (4.15) j=1 and γj (xj ) ∝ π ¯j (xj ). The transition kernels Tn (xn |xn−1 ) have π ¯n as invariant distribution. Using the second half this sequence of target distributions we can sample “chains” of variables from the joint target distribution and obtain the marginal by simply discarding all but the last variable in the chain. Mixture of Gaussians We used the above algorithm to sample from a highly multimodal distribution. The example was taken from [66]. The probability distribution function (PDF) was a mixture of over 4000 Gaussians in 2-D with means separated by many times their widths. Despite being low dimensional, this is a difficult distribution to sample from, using MCMC and pretending we only have access to πn (x, y) (Eqn. 4.16) pointwise and up to an unknown normalising constant, and serves well to demon- 87 strate tempering using PMH. π0 (x, y) = + + + 5 i=−5 5 i=−5 22 i=−22 22 i=−22 |(x,y)−µ1,i,j |2 5 ) j=−5 exp − 2σ 2 2 |(x,y)−µ2,i,j | 5 ) j=−5 exp − 2σ 2 |(x,y)−µ3,i,j |2 22 ) j=−22 exp − 2σ 2 2 |(x,y)−µ4,i,j | 22 ) j=−22 exp − 2σ 2 (4.16) with σ = 0.001 and µ1,i,j = (0.0025i + 15, 0.0025j + 15) µ2,i,j = (0.1500i − 15, 0.1500j + 15) µ3,i,j = (0.0025i − 15, 0.0025j − 15) µ4,i,j = (0.1500i + 15, 0.1500j − 15) (4.17) We then define a tempering schedule by choosing a sequence β1 , . . . , βp with βp = 1 and 0 < βn < βn+1 ≤ 1. The distributions are then π ¯n = π0βn . The performance of the PMH sampler for this example was compared with Neal’s tempered transitions (NTT) and CBMC, the results are given in Table 4.8. Each algorithm was run 3 times for 8 hours (the results shown are averages). For each algorithm we chose 5 settings (number of particles and temperatures) that gave the best performance in terms of mean and standard deviation of the target distribution. PMH gives comparable performance as both CBMC and tempered transitions. Figure 4.25 shows the trace of the last 3000 samples for a PMH run using 5 particles and 20 temperatures. As we can see, there is a tradeoff between the number of particles N , the number of distributions p, and the number of samples to generate, given a fixed computation budget. In this model the best setting, among the ones tried, was to use only a small number of particles (N = 5), a medium number of tempered distributions (p = 20), and opt for a larger number of samples. Unfortunately we have no theoretical result to tell us this “sweet” spot in advance and one should perform a few short simulations at various settings to determine a “good” setting using standard MCMC and SMC diagnostics to measure performance. 88 89 2932 6177 16021 7836 33241 27689 14206 53000 145500 300000 CBMC CBMC CBMC CBMC CBMC Neal Neal Neal Neal Neal 1 1 1 1 1 20 5 10 20 5 N 5 5 10 20 10 100 200 50 20 10 50 100 20 20 20 p 20 100 10 10 20 -0.121 0.227 -0.606 1.650 0.337 0.0948 0.119 0.0203 0.0158 -0.0943 µx -0.0874 -0.0918 0.310 0.407 -0.160 227.0 228.7 225.4 221.4 216.5 227.4 226.4 226.1 226.6 225.5 var(x) 226.6 226.8 226.8 226.7 225.7 -13.27 -13.29 -13.45 -13.58 -12.27 -13.33 -13.33 -13.43 -13.37 -13.14 µy -13.31 -13.21 -13.21 -13.31 -13.11 51.22 50.56 43.33 43.66 72.30 48.41 47.83 47.72 47.08 54.34 var(y) 49.16 51.75 52.00 49.68 54.93 0.126 0.228 0.621 1.672 1.089 0.0975 0.121 0.119 0.0649 0.192 err(µ) 0.0874 0.134 0.326 0.407 0.257 1.559 2.075 6.509 8.091 24.86 1.393 1.884 2.093 2.603 4.823 err(var) 0.579 2.070 2.319 0.102 5.362 16 27 64 0.72 0.07 44 37 31 37 20 Acc 12 33 14 24 22 - - R 23 1 75 76 36 Table 4.8: Simulation results for a mixture of Gaussians with 4292 modes. Each algorithm was run for the same amount of time (3 times 8 hrs) and used 10 MH steps at each temperature. N is the number of particles and p is the number of distributions (temperatures) used. Acc is the average acceptance rate and R is the resampling rate for PMH. samples 78400 14200 87600 42600 38000 Alg. PMH PMH PMH PMH PMH 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 algorithm = PMH #samples = 26201 #particles = 5 #modes = 4293 nMC = 10 numT = 20 maxT = 2.68435e+08 15 10 y 5 0 -5 -10 -15 -20 20 1.2 15 1 10 0.8 5 0.6 0 0.4 -5 0.2 -10 0 -15 x Figure 4.25: Trace and marginal posterior distributions for mixture of Gaussians distribution using PMH with 5 particles and numT=20 temperatures (length of sequence of distributions), a maximum temperature of 228 =2.7×108 , and 10 Metropolis-Hastings steps within each temperature. The large (bottom-left) frame shows the trace of the last 1000 iterations (lines and small blue circles) and the locations of the 4292 modes of the target distribution (larger red circles). The top and the right frames show the histograms of the marginals using 26201 samples. Density Estimation using Finite Mixture of Gaussians Given data y1 , . . . , yc , which are independent and identically distributed, we want to infer the distribution from which they arise. We model the distribution as a finite mixture of Normal distributions as in [65]: r ωj N (µj , λ−1 j ) yi |θr ∼ (4.18) j=1 where θr = (µ1:r , λ1:r , ω1:r ), 2 ≤ r < ∞ and the number of Gaussians, r, is known. The weights ωj must sum to one. The priors are exchangeable for each mixture 90 component: µj ∼ N (ξ, κ−1 ) λj ∼ Ga(ν, χ), (4.19) ν = shape, χ = scale ω1:r−1 ∼ D(ρ) (4.20) (4.21) The parameters ξ and κ are set to the midpoint of the data range R and a small multiple ι of 1/R2 : ξ= min(yi ) + max(yi ) 2 κ= ι [max(yi ) − min(yi )]2 (4.22) The parameters for the Gamma distribution are set as follows: (2 =)ν > 1 > g = 0.2 h = 10 [max(yi ) − min(yi )]−2 χ = h/g The Dirichlet prior has parameter ρ = (1, . . . , 1). The log prior is given by: r log N (µj |ξ, κ−1 ) Ga(λj |ν, χ) log f (θr ) = + log D(ω1:r |ρ) j=1 We define a sequence of distributions which starts with the prior and ends with the full posterior: πp (θr ) ∝ l(y1:c |θr )φp f (θr ), p = 1, . . . , n (4.23) with 0 ≤ φ1 < . . . < φn = 1 We have compared the performance of PMH with NTT and with CBMC. The data set was taken from [65] and is shown in Figure 4.26. The acceptance rates for the three different algorithms are presented in Figure 4.27. The algorithm settings are chosen such that all have comparable computational complexity. PMH achieves at least double the acceptance rate over the other methods, using 20 particles and a sequence of 50 distributions (temperatures). The tempered transitions 91 0.6 Posterior for µ1 Input data (for likelihood) 0.5 Density 0.4 0.3 0.2 0.1 0 -10 -5 0 5 10 µ1 8 6 µ1 4 2 0 -2 -4 0 1000 2000 3000 4000 5000 6000 7000 Iteration Figure 4.26: Data set for 4-component mixture of Gaussians (green) and posterior probability for parameter µ1 (red) using PMH with p = 50 and 20 particles. The temperatures (1/βn ) schedule was linear with a maximum temperature of 50. Bottom: trace of µ1 . algorithm requires a much finer temperature scale with a sequence of 1000 distributions. CBMC fails to achieve acceptance rates of more than a few percent. 92 0.35 Acceptance rate: quartiles 0.3 0.25 0.2 0.15 0.1 0.05 0 PMH N=20 p=50 PMH N=30 p=30 Neal p=1000 PMH N=50 p=20 CBMC N=20 p=50 CBMC N=50 p=20 CBMC N=30 p=30 Figure 4.27: Acceptance rates for finite mixture model using PMH, CBMC, or NTT 93 Chapter 5 Conclusion We have presented a new class of MCMC algorithms which rely on proposal distributions built using SMC methods. One of the major advantages of this approach is that it “automatically” builds efficient high-dimensional proposal distributions whilst requiring the practitioner to design only low-dimensional proposal distributions. It offers the possibility to simultaneously update large vectors of dependent random variables. This strategy is computationally expensive but to some extent unavoidable and useful in complex scenarios for which standard proposals are likely to fail. Indeed, as demonstrated in simulation, they can allow algorithms to better explore the posterior distribution in multimodal cases for example. In addition, in practice these algorithms can be combined with “standard” local moves which are computationally cheaper. It is difficult to specify in advance when PMCMC will yield better performance than some alternative, possibly simpler, methods. However, we can take some lessons from the applications presented in this thesis and provide a few (rough) guidelines, as shown in Table 5.1. We only list the most commonly used algorithms, namely Gibbs, MH, and SMC, as well as the the methods introduced in this thesis. As a rule of thumb, the simpler methods should be tried first (unless it is clear that they will be inefficient). If they do not perform well, then implementing a PMCMC algorithm requires only little extra work, as it is based on MCMC and SMC, so part of the work would have already been done. For models in which the prior cannot be evaluated, but only sampled from (e.g. 94 dependence in x between θ and x x low dimensional weak strong G or MH MH G or MH MH x high dimensional weak strong G or SMC SMC or PMH PG PMMH Table 5.1: These are rough guidelines for algorithm choice. θ represents the parameters and x the states. G stands for Gibbs. PMH would be used for subblock updating. MJP or L´evy-driven SV models with certain marginals), PMCMC may be the only feasible method. We believe that many problems where SMC methods have already been successfully used could benefit from the particle MCMC methodology. These include contingency tables [15], graphical models [53], changepoint models [36], population dynamic models in ecology [12], volatility models in financial econometrics [16, 25], partially observed diffusions [37], population genetics [24],[61, Section 4.1.2], systems biology [48, 49], and experimental design problems [59]. The CBMC method, to which our approach is related, is a very popular method in computational chemistry and physics which has been widely used for molecular and polymer simulation [40], and particle MCMC algorithms might also prove a useful alternative in these areas. In this thesis we have addressed some of these topics to demonstrate the use of PMCMC. We are already aware of recent successful applications of PMCMC methods in econometrics [38] and statistics [7]. There are also numerous possible extensions to the PMCMC framework. Some have already been added, such as stratified sampling. Other resampling techniques could potentially also be added, for example antithetic sampling. In practice, the performance of the particle MCMC are closely related to the variance of the SMC estimates of the normalising constants. The SMC literature has many more techniques that could be used within PMCMC. For example we might investigate whether the SMC smoothing techniques in [47, 58] could be used to design better proposal distributions. We can also add adaptive strategies to set the algorithm parameters in PMCMC. It would nice to automatically adjust the number of particles used in the proposal to ensure a reasonable acceptance rate. In order to speed up the computation, a parallel implementation of the SMC component would also 95 be very useful. For example [63] and [54] have already demonstrated a parallel particle filter running on the GPU of a graphics card. From a theoretical point of view, it is possible to study how “close” the particle MCMC algorithms are from the ideal MCMC algorithms they are approximating (corresponding to N → ∞) using the techniques developed in [1]. 96 Bibliography [1] C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efficient computation. Annals Statist., 2007. URL http://www.maths.bris.ac.uk/∼maxca/preprints/andrieu roberts 2006.pdf. in press. → pages 27, 44, 96, 107 [2] C. Andrieu, J. De Freitas, and A. Doucet. Sequential MCMC for bayesian model selection. In Proceedings IEEE Workshop Higher Order Statistics, pages 130–134, 1999. → pages 20 [3] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo. Submitted to Journal of the Royal Statistical Society, November 2007. → pages 1 [4] O. Barndorff-Nielsen and N. Shephard. Non-Gaussian Orstein-Uhlenbeck-based models and some of their uses in financial economics. J. Royal Statist. Soc., 63:167–241, 2001. → pages 73 [5] O. Barndorff-Nielsen and N. Shephard. Normal modified stable processes. Theory Proba. Math. Statist., 65:1–19, 2001. → pages 74, 75 [6] M. A. Beaumont. Estimation of population growth or decline in genetically monitored populations. Genetics, 164:1139–1160, 2003. → pages 44 [7] M. Belmonte, O. Papaspiliopoulos, and M. Pitt. Particle filter estimation of duration-type models. Technical report, Department of Statistics, Warwick University, 2008. → pages 95 [8] P. Bourne and H. Weissig. Structural Bioinformatics. John Wiley & Sons Publishing, New York, 2003. → pages 53 [9] R. J. Boys, D. Wilkinson, and T. Kirkwood. Bayesian inference for a discretely observed stochastic kinetic model. Statistics and Computing, 18 97 (2):125–135, 2008. ISSN 0960-3174. doi:10.1007/s11222-007-9043-x. → pages 67, 68, 69 [10] A. Brockwell, P. D. Moral, and A. Doucet. Sequentially interacting Markov chain Monte Carlo. unpublished, 2008. → pages 86 [11] T. Brunette and O. Brock. Improving protein structure prediction with model-based search. Bioinformatics, 21:i66–i74, 2005. doi:10.1093/bioinformatics/bti1029. → pages 53 [12] S. Buckland, K. Newman, C. Fern´andez, L. Thomas, and J. Harwood. Embedding population dynamic models in inference. Statist. Sci., 22:44–5, 2007. → pages 2, 15, 95 [13] O. Capp´e, E. Moulines, and T. Ryd´en. Inference in Hidden Markov Models. Springer-Verlag, New York, 2005. → pages 15 [14] C. Carter and R. Kohn. On Gibbs sampling for state space models. Biometrika, 81:541–553, 1994. → pages 45 [15] Y. Chen, P. Diaconis, S. Holmes, and J. Liu. Sequential Monte Carlo methods for statistical analysis of tables. J. Amer. Statist. Assoc., 100: 109–120, 2004. → pages 2, 3, 95 [16] S. Chib, M. K. Pitt, and N. Shephard. Likelihood based inference for diffusion driven state space models. unpublished, pages 1–33, 2006. URL http://www.olin.wustl.edu/faculty/chib/absdenew.htm. → pages 2, 95 [17] N. Chopin. A sequential particle filter method for static models. Biometrika, 89:539–552, 2002. → pages 3, 42 [18] N. Chopin. Central limit theorem for sequential Monte Carlo methods and its application to Bayesian inference. Annals Statist., 32:2385–2411, 2004. → pages 12 [19] N. Combe, T. J. H. Vlugt, P. R. T. Wolde, and D. Frenkel. Dynamic pruned-enriched Rosenbluth method. Molecular Physics, 101:1675–1682, 2003. → pages 15, 44 [20] S. Consta, N. B. Wilding, D. Frenkel, and Z. Alexandrowicz. Recoil growth: An efficient simulation method for multi-polymer systems. Journal of Chemical Physics, 110:3220–3228, 1999. → pages 15 98 [21] D. Creal. Analysis of filtering and smoothing algorithms for L´evy-driven stochastic volatility models. Comp. Statist. Data Analy., 52:2863–2876, 2008. → pages 75 [22] T. Creighton. Protein Folding, pages 1–55. W.H. Freeman and Company, New York, USA, 1992. → pages 53 [23] D. Crisan and T. Lyons. Minimal entropy approximations and optimal algorithms for the filtering problem. Monte Carlo Meth. Appli., 8:343–356, 2002. → pages 11 [24] M. De Iorio and R. Griffiths. Importance sampling on coalescent histories. Adv. Appl. Proba., 36:417–433, 2004. → pages 95 [25] P. Dellaportas, D. Denison, and C. Holmes. Flexible threshold models for modelling interest rate volatiliy. Econometrics Review, 26:419–437, 2007. → pages 95 [26] A. Doucet, S. Godsill, and C. Andrieu. On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing, 10:197–208, 2000. → pages 46 [27] A. Doucet, J. de Freitas, and N. Gordon, editors. Sequential Monte Carlo Methods in Practice. Springer-Verlag, New York, 2001. → pages 2, 8, 46 [28] A. Doucet, S. J. Godsill, and C. P. Robert. Marginal maxiumum a posteriori estimation using Markov chain Monte Carlo. Statistics and Computing, 12: 77–84, 2002. → pages 36 [29] J. Durbin and S. Koopman. Time Series Analysis by State Space Methods. Oxford University Press, 2001. → pages 15 [30] M. Escobar and M. West. Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc., 90:577–588, 1995. → pages 62, 63, 114, 115 [31] P. Fearnhead. MCMC, sufficient statistics and particle filter. J. Comp. Graph. Stat., 11:848–862, 2002. → pages 20 [32] P. Fearnhead. Particle filters for mixture models with an unknown number of components. Journal of Statistics and Computing, 14:11–21, 2004. doi:10.1023/B:STCO.0000009418.04621.cd. → pages 2, 3, 42, 61, 62, 63 99 [33] P. Fearnhead. Computational methods for complex stochastic systems: Alternatives to MCMC. Technical report, Department of Mathematics and Statistics, Lancaster University, 2007. → pages 68 [34] P. Fearnhead. Online inference for multiple changepoint problems. J. Royal Statist. Soc. B, 69:589–605, 2007. → pages 42 [35] P. Fearnhead and J. R. Clifford. On-line inference for hidden Markov models via particle filters. Statist. Soc. B, 65:887–899, 2003. → pages 55 [36] P. Fearnhead and L. Meligkotsidou. Filtering methods for mixture models. Journal of Computational & Graphical Statistics, 16:586–607, September 2007. → pages 15, 95 [37] P. Fearnhead, O. Papaspiliopoulos, and G. Roberts. Particle filters for partially-observed diffusion. J. Royal Statist. Soc. B, 69:589–605, 2008. doi:10.1111/j.1467-9868.2008.00661.x. → pages 15, 95 [38] T. Flury and N. Shephard. Bayesian inference based only on simulated likelihood: particle filter analysis of dynamic economic models. Technical report, Nuffield College, Oxford University, 2008. → pages 95 [39] D. Frenkel. Waste-recycling Monte Carlo, pages 127–138. Springer-Verlag, New York, December 2006. ISBN 354035283X. URL http://www.springer.com/materials/book/978-3-540-35283-9. → pages 28 [40] D. Frenkel and B. Smit. Understanding Molecular Simulation: From Algorithms to Applications. Academic Press (Elsevier), 2nd edition edition, 2002. → pages 15, 43, 46, 53, 95 [41] S. Fr¨uhwirth-Schnatter. Finite Mixture and Markov Switching Models. Springer Verlag, New York, 2006. → pages 15, 45 [42] S. Fr¨uhwirth-Schnatter and L. S¨ogner. Bayesian estimation of stochastic volatility models based on OU processes with marginal gamma law. Ann. Instit. Statist. Math. to appear. → pages 74 [43] S. G. Particle filters in state space models with the presence of unknown static parameters. IEEE Trans. Signal Processing, 50:281–289, 2002. → pages 20 [44] M. Gander and D. Stephens. Stochastic volatility modelling in continuous time with general marginal distributions: Inference, prediction and model selection. J. Statist. Plann. Inf., 137:3068–3081, 2007. → pages 74, 75, 76 100 [45] W. Gilks and C. Berzuini. Following a moving target - Monte Carlo inference for dynamic Bayesian models. J. Royal Statist. Soc. B, 63: 127–146, 2001. → pages 3, 42 [46] D. T. Gillespie. Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem., 81:2340–2361, 1977. → pages 68 [47] S. Godsill, A. Doucet, and M. West. Monte Carlo smoothing for nonlinear time series. J. Amer. Statist. Assoc., 99:156–168, 2004. → pages 46, 95 [48] A. Golightly and D. J. Wilkinson. Bayesian sequential inference for stochastic kinetic biochemical network models. J. Comp. Biology, 13: 838–851, 2006. → pages 2, 68, 95 [49] A. Golightly and D. J. Wilkinson. Bayesian sequential inference for nonlinear multivariate diffusions. Statistics and Computing, 16:323–338, 2006. → pages 67, 68, 95 [50] N. Gordon, D. Salmond, and A. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. In IEE-Proceedings-F 140, pages 107–113, 1993. → pages 15 [51] P. Grassberger. Pruned-enriched Rosenbluth method: Simulations of θ polymers of chain length up to 1 000 000. Phys. Rev. E, 56:3682 – 3693, 1997. → pages 15 [52] J. Griffin and M. Steel. Inference with non-Gaussian ornstein-unlhenbeck processes for stochastic volatility. J. Econometrics, 134:605–644, 2006. → pages 74 [53] F. Hamze and N. de Freitas. Hot coupling: a particle approach to inference and normalization on pairwise undirected graphs. In Adv. Neural Info. Proc. Sys., volume 18. MIT Press, 2005. → pages 2, 3, 95 [54] G. Hendeby, J. D. Hol, R. Karlsson, and F. Gustafsson. A graphics processing unit implementation of the particle filter. In European Signal Processing Conference (EUSIPCO), 2007. → pages 96 [55] H. Hsu, V. Mehra, W. Nadler, and P. Grassberger. Growth-based optimization algorithm for lattice heteropolymers. Physical Review E, 68(2): 21113, 2003. → pages 55 [56] C. Jarzynski. Nonequilibrium equality for free energy differences. Phys. Rev. Lett., 78:2690–2693, 1997. → pages 42 101 [57] G. Kitagawa. Monte Carlo filter and smoother for non-gaussian nonlinear state space models. J. Comp. Graph. Statist., 5:1–25, 1996. → pages 11, 33, 46 [58] G. Kitagawa and S. Sato. Monte Carlo smoothing and self-organising state-space models, chapter 9. Springer-Verlag, New York, 2001. → pages 95 [59] H. Kueck, N. de Freitas, and A. Doucet. SMC samplers for Bayesian optimal nonlinear design. In IEEE Nonlinear Statistical Signal Processing Workshop, pages 99–102, Cambridge, UK, 13-15 Sept. 2006. ISBN 978-1-4244-0581-7. doi:10.1109/NSSPW.2006.4378829. → pages 95 [60] J. Liu, F. Liang, and W. Wong. The use of multiple-try method and local optimization in metropolis sampling. J. Amer. Statist. Assoc., 95:121–134, 2000. → pages 44 [61] J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer Verlag, New York, 1st edition edition, 2001. → pages 1, 2, 3, 8, 11, 41, 53, 54, 61, 95, 106 [62] K. Mengersen and R. Tweedie. Rates of convergence of the Hastings and Metropolis algorithms. Annals Statist., 24:101–121, 1996. → pages 104 ´ [63] A. S. Montemayor, J. J. Pantrigo, Angel S´anchez, and F. Fern´andez. Particle filter on GPUs for real-time tracking. In SIGGRAPH ’04: ACM SIGGRAPH 2004 Posters, page 94, New York, NY, USA, 2004. ACM. ISBN 1-58113-896-2. doi:10.1145/1186415.1186526. → pages 96 [64] P. D. Moral. Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications. Springer-Verlag, New York, 2004. → pages 12, 13 [65] P. D. Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society B, 68:411–436, 2006. → pages 2, 3, 42, 43, 86, 87, 90, 91 [66] R. M. Neal. Sampling from multimodal distributions using tempered transitions. Statistics and Computing, 6:353–366, 1996. → pages xi, 86, 87 [67] M. K. Pitt and N. Shephard. Filtering via simulation: auxiliary particle filter. J. Am. Statist. Assoc., 94:590–599, 1999. → pages 2, 51 102 [68] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2nd edition edition, 2004. → pages 1, 5, 6, 54, 62 [69] G. Roberts, O. Papaspiliopoulos, and P. Dellaportas. Bayesian inference for non-Gaussian Orstein-Uhlenbeck stochastic volatility processes. J. Royal Statist. Soc. B, 66:369–393, 2004. → pages 74 [70] M. Rosenbluth and A. Rosenbluth. Monte Carlo simulations of the average extension of molecular chains. J. Chem. Phys., 23:356–359, 1955. → pages 14 [71] N. Shephard and M. K. Pitt. Likelihood analysis of non-Gaussian measurement time series. Biometrika, 84:653–667, 1997. → pages 18, 46, 51 [72] A. Shmygelska and H. H. Hoos. An ant colony optimisation algorithm for the 2D and 3D hydrophobic polar protein folding problem. Bioinformatics, 6:30, 2005. → pages 53 [73] J. Siepmann. A method for the direct calculation of chemical potentials for dense chain systems. Mol. Phys., 70:1145–1158, 1990. → pages 14 [74] J. Siepmann and D. Frenkel. Configurational-bias Monte Carlo: a new sampling scheme for flexible chains. Mol. Phys., 75:59, 1992. → pages 43, 53 [75] C. Thachuk, A. Shmygelska, and H. H. Hoos. A replica exchange Monte Carlo algorithm for protein folding in the HP model. Bioinformatics, 8:342, 2007. → pages 53, 55, 59 [76] J. Zhang, S. Kou, and J. S. Liu. Biopolymer structure simulation and optimization via fragment regrowth Monte Carlo. J. Chem. Phys., 126(22): 225101, 2007. → pages xi, 55, 60 103 Appendix A Proofs Proof of Theorem 2. Assumption (A2) ensures that q N covers the support of π ˜N and hence that the PMH defines an irreducible and aperiodic Markov chain with invariant distribution π ˜ N from Theorem 1. It follows that the marginal distribution of the sequence of random variables generated by the algorithm converges to π ˜N . K(i) Since x(i) = xp (i) we have established the first result. Now under (A3) we have for all x1 , ..., xp , a1 , . . . , ap−1 π ˜ N (k, x1 , ..., xp , a1 , . . . , ap−1 ) Zˆ N = < Z −1 q N (k, x1 , ..., xp , a1 , . . . , ap−1 ) Z p Bn < +∞. n=1 For an independent MH algorithm this implies uniform geometric ergodicity top wards π ˜ N , with a rate at least 1 − Z/ Bn ; see for example [62, Theorem n=1 2.1]. This, together with a reasoning similar to above concerning x(i), we deduce the second statement of the theorem. Proof of Theorem 3. To simplify notation, we note z = (¯ x1 , ..., x ¯p , a1 , . . . , ap−1 ). The transition probability of the PMH is of the form ˙ z˙ P (k, z) , k, ˙ = w˙ pk × ψ (˙z) × α (z, z˙ ) (A.1) N ˙ z˙ w ˇpm × ψ (ˇ z) × (1 − α (z, ˇ z)) dˇ z . δ(k,z) k, + m=1 104 where α (z, z˙ ) = 1 ∧ Z N (˙z) . Z N (z) Note the important fact used below that the acceptance probability is independent of k and k˙ (which, as pointed out earlier, means that k˙ needs to be sampled only when a new set of particles has been accepted). Now consider the function N f xlp I {l = k} , F (k, z) := l=1 and note the fact that since by construction π ˜ N (k, z) is invariant with respect to P , N ˙ z˙ π ˜ N (k, z) P (k, z) , k, ˙ z˙ dzd˙z F k, (A.2) ˙ k,k=1 N ˙ z˙ F k, ˙ z˙ d˙z π ˜ N k, = = Eπ (f ) ˙ k=1 ˙ ∼ ψ (defined in (3.2)) and U be a random variable uniformly Let Z ∼ π ˜N , Z distributed on [0, 1]. Then, using (A.1), the following quantity is an estimate of the left hand side of (A.2) N ˙ pk˙ f (Z ˙ kp˙ ) I W N ˙ U< α Z, Z ˙ +I U> α Z, Z ˙ k=1 F (k, Z) π ˜ N (k|Z) , k=1 (A.3) and it can be checked using (3.4) that π ˜ N ( k| z) = π ˜ N (k, z) = wpk . π ˜ N (z) The result of the theorem follows straightforwardly from this and the fact that under (A1)-(A2), the simulated Markov chain is ergodic i.e. as i → ∞, (K(i), Z(i)) ∼ π ˜ N (k, z). 105 Proof of Theorem 4. To establish the result, we first rewrite (3.15) as π ˜ N (θ, k, x1 , ..., xp , a1 , . . . , ap−1 ) = 1 π θ, xkp Np p N M1θ xi1 r −ikn an−1 n=2 i=1,i=ik1 N ikn wn−1 , an−1 ai n−1 Mnθ xn−1 , xin i=1,i=ikn (A.4) Notice that with ik = (ik1 , ik2 , . . . , ikp ) the ancestral lineage of xkp , we have π ˜ N θ, xkp , ik = 1 π θ, xkp Np . Now the PG sampler can be interpreted as a standard Gibbs sampler of invariant distribution (3.15) which iterates the following steps (a) θ|(xkp , ik ) ∼ π ˜ N ·|k, xkp , ik = π θ|xkp −ik −ik −ik −ik p )∼π ˜ N ·|θ, k, xkp , ik (b) (x1 1 , ..., xp p , a1 2 , . . . , ap−1 (c) k|(θ, x1 , ..., xp , a1 , . . . , ap−1 ) ∼ π ˜ N (·|θ, x1 , ..., xp , a1 , . . . , ap−1 ) with π ˜ N (k|θ, x1 , ..., xp , a1 , . . . , ap−1 ) = wpk . Note that (a) might appear unusual but leaves (3.15) invariant and is known in the literature under the name “collapsed Gibbs sampling” [61, Section 6.7]. Let A ∈ B(Θ), B ∈ B(X p ), C ∈ B(X (N −1)p × {1, . . . , N }(N −1)p ), k ∈ {1, . . . , N } and i ∈ {1, . . . , N }p be such that π ˜ N ({k} × A × B × {i} × C) > 0. From (A5) it is possible to show that accessible sets for the G sampler are also marginally accessible by the PG sampler i.e. more precisely if A × B ∈ B(Θ) × X p is such that LG ((θ(j), x (j)) ∈ A × B) > 0 for some finite j > 0 then also LP G ((K(j), θ(j), x (j) , I(j)) ∈ {k} × A × B × {i}) > 0 for all k ∈ {1, . . . , N } and i ∈ {1, . . . , N }p . From this and the assumed irreducibility of the G sampler in (A6), we deduce that if π((θ, x) ∈ A × B) > 0 then there exists a finite j such that LP G ((K(j), θ(j), x (j) , I(j)) ∈ {k} × A × B × {i}) > 0 for all k ∈ {1, . . . , N } and i ∈ {1, . . . , N }p . Now because π((θ, x) ∈ A × B) > 0 and step (b) corresponds to sampling from the conditional distribution of π ˜ N , we 106 deduce that K(j+1) −I1 LP G (K(j +1), θ(j +1), x (j +1) , I(j +1), x1 K(j+1) −I2 −I (j +1), K(j+1) (j +1), . . . , Ap−1p A1 K(j+1) −Ip (j +1), ..., xp ∈ {k} × A × B × {i} × C (j +1) >0 and the irreducibility of the PG sampler follows. Aperiodicity can be proved by contradiction. Indeed from (A5) we deduce that if the PG sampler is periodic, then so is the G sampler, which contradicts (A6). Proof of Theorem 5. The proof of the first part of the theorem is similar to the proof of Theorem 1 and is shown below. The second part of the proof is a direct consequence of [1, Theorem 1] and (A5)-(A7). π ˜ N (·) 1 = p N q (·) N π θ, xkp ik q (θ∗ , θ) × wpk × M1θ (x11 ) π = ik θ, xkp p q (θ∗ , θ) M1θ (x11 ) π = ik q (θ∗ , θ) M1θ (x11 ) k✘/Z γ✘ θ,✘ x✘ ✘ p = q (θ∗ , θ) n=2 ik n=2 ik n−1 , xnn ) r(ikn−1 |wn−1 )Mnθ (xn−1 1 Np ik p ik n−1 , xnn ) Mnθ (xn−1 ik wnn n=1 1 Np θ, xkp p n=2 p ik ik p n−1 , xnn ) Mnθ (xn−1,n γ N (θ) k✘ γ✘ θ,✘ x✘ ✘ p = N i=1 wn n=1 p xin ik wn (xnn ) n=1 1 γ N(θ) Z q (θ∗ , θ) where γ N (θ) = Z N is given in (2.6). In the manipulations above we have used (A1) on the second line, whereas the final result is obtained thanks to the definitions of the incremental weights (2.2) w1 xi1 := γ1 (θ, xi1 ) , M1θ (xi1 ) γn θ, xin wn xin := γn−1 and of the normalising constant estimate (2.6). 107 Ain−1 θ, xn−1 Mnθ Ain−1 xn−1 , Xni , Appendix B Nonlinear State-Space Model B.1 Gibbs Sampling for State-Space Model We want to sample the time series using Gibbs sampling. Thus we need the conditional probabilities π(θ|x1:T ) and π(xj |x−j , y1:T ). B.1.1 Sampling Parameters For now, assume that we are interested in the variance of the noise in the state transition, i.e. σV2 . π(σV2 |x1:T ) ∝ π(x1:T |σV2 ) π(σV2 ) = π(σV2 ) π(x1 ) T j=2 exp − −(T −1) e−βV σV = π(σV2 ) σV 1 2 1 2 2πσV −(T −1) ∝ π(σV2 ) σV where we set βV = √ T j=2 [xj [xj −f (xj−1 )]2 2 2σV T j=2 [xj 1 2 − f (xj−1 )]2 σV−2 −2 − f (xj−1 )]2 . Setting κV = σV−2 we get: (T −1)/2 π(κV |x1:T ) ∝ π(κV ) κV exp − e−βV κV = (T −1)/2 π(κV ) κV e−βV κV Now we need to choose a prior for κV . A convenient one is the Gamma distribution 108 with shape ξv and scale γv : π(κV |x1:T ) ∝ (T −1)/2+ξV −1 κV ∝ Ga κV ; e “ ” − βV + γ1 κV (B.1) v T −1 + ξV , 2 βV + 1 γV −1 (B.2) Note that this implies 2 π(σW ) = IGa(ξv , 1/γv ) Similarly, for the variance of the observation error we get: 2 2 π(σW |x1:T ) ∝ π(x1:T |σV2 , y1:T ) π(σW ) T 1 2 ) π(x0 ) = π(σW j=1 (B.3) exp − 2 2πσW [yj − g(xj )]2 2 2σW −2 −T −βW σW 2 ∝ π(σW ) σW e where βW = κW = −2 σW 1 2 T j=1 [yj (B.5) − g(xj )]2 . Using the same approach as above, we sample using a Gamma prior with shape ξW and scale γW : π(κW |x1:T ) ∝ T /2+ξW −1 κW ∝ Ga κW ; B.1.2 (B.4) e “ ” − βW + γ 1 κW T + ξW , 2 W βW + 1 γW −1 Sampling State Variables The conditional distribution for the latent variables xi is as follows: for j = 1 π(xj+1 |xj ) π(xj |yj ), π(xj |x−j , y1:T ) = π(xj |xj−1 ) π(xj+1 |xj )π(xj |yj ), for 1 < j < T π(xj |xj−1 ) π(xj |yj ), for j = T 109 (B.6) with π(xj |xj−1 ) = N xj ; f (xj−1 ), σV2 2 π(xj |yj ) = N yj ; g(xj ), σW In general we cannot sample from this directly (except for some simple functions f and g). We therefore use a few MH steps within the Gibbs sampler. The proposal for this is the same as for the PMCMC (i.e. proposing from prior or using the extended Kalman filter (EKF) proposal). 110 Appendix C Dirichlet Mixture Model: Derivation C.1 Gaussian Mixture The posterior is proportional to the prior times the likelihood: P(x1:n |y1:n , α) ∝ P(x1:n |α) π(y1:n |x1:n , α) = P(x1:n |α) π(y1:n |x1:n ) (C.1) The prior P(x1:n |α) is given by equation 4.4 n P(x1:n |α) = P(xi |x1:i−1 , α) = i=1 j kn j=1 (mn α kn − 1)! α (α + 1)...(α + n − 1) (C.2) The likelihood factorises since the observations are independent given the cluster: kn π(y1:n |x1:n , α) = kn π(y{s|xs =j} |x1:n , α) = j=1 kn = j=1 π(y{s|xs =j} , θj |x1:n ) dθj j=1 kn π(yi |θj ) π(θj ) dθj = gθj (yi ) dG0 (θj ) j=1 {i|xi =j} {i|xi =j} (C.3) 111 We can now substitute for gθj and G0 : π(y1:n |x1:n , α) kn = N (yi ; µj , σj2 ) IGa(σj2 ; a, b) N (µj ; η, τ σj2 ) dµj d(σj2 ) j=1 {i|xi =j} exp kn = j=1 (y −µ )2 − i2σ2j j 2πσj2 {i|xi =j} (µj −η)2 a exp −b b σj2 exp − 2τ σj2 × a+1 2 2πτ σj2 Γ(a) σj 2 {i|xi =j} (yi −µj ) 2 2σj P kn = j=1 kn ∝ j=1 ba exp − Γ(a) ba Γ(a) 2πσj2 σj2 −(1+a+ mjn /2 σj2 − a+1 b σj2 − dµj d(σj2 ) (µj −η)2 2τ σj2 2πτ σj2 1/2 dµj d(σj2 ) j mn +1 ) 2 √ τ 1 1 × exp − 2 b+ σj 2 1 2 2 (yi −µj ) + (µj −η) dµj d(σj2 ) 2τ {i|xi =j} We now have to complete the square in order to integrate out µj : 1 (yi − µj )2 + (µj − η)2 τ {i|xi =j} 1 = µ2j mjn + − 2µj yi − τ Φjn = {i|xi =j} 112 η + τ {i|xi =j} yi2 + η2 τ Now define the first and second moment: j y˜n,1 = j y˜n,2 = 1 yi mjn {i|x =j} i 1 yi2 j mn {i|x =j} i ≡ y¯nj (C.4) ≡ (ˆ σnj )2 + (¯ ynj )2 (C.5) We continue to complete the square and further simplify the expression Φjn : Φjn = µ2j mjn + mjn = 1 τ 1 + τ + mjn j − 2µj mjn y˜n,1 − µj − η η2 j + mjn y˜n,2 + τ τ 2 j mjn y˜n,1 − η/τ mjn + j j )2 + − (˜ yn,1 y˜n,2 1 τ j − η)2 (˜ yn,1 1 + mjn τ Now we end up with an expression that has the form of a Gaussian in µj : If we now reinsert this into the above exponential, we can easily marginalise out µj . kn π(y1:n |x1:n , α) ∝ j=1 ba √ Γ(a) τ −(1+a+ σj2 j mn +1 ) 2 j (˜ yn,1 − η)2 mjn 1 j j 2 yn,1 ) + y˜n,2 − (˜ × exp − 2 b + 2 σj 1 + mjn τ 2 j j 1 − η/τ m y ˜ n n,1 1 × exp − 2 mjn + µj − dµj d(σj2 ) j 1 2σj τ mn + τ For ease of notation, we define Ψjn = mjn b+ 2 j y˜n,2 − j (˜ yn,1 )2 113 + j (˜ yn,1 − η)2 1 + mjn τ (C.6) The integral over µj simply evaluates to the normalising constant of a Gaussian: kn π(y1:n |x1:n , α) ∝ j=1 kn = j=1 ba √ Γ(a) τ ba √ Γ(a) τ σj2 −(1+a+ mjn 2π mjn + 2πσj2 j mn +1 ) 2 σj2 1 τ + j m −(1+a+ 2n ) j 2 e−Ψn /σj d(σj2 ) 1 τ j 2 e−Ψn /σj d(σj2 ) This integral has the form of the Inverse-Gamma distribution, thus we can solve it analytically and the result is the normalising constant: x−(α+1) exp −β x dx = Γ(α) β −α (C.7) Thus we get kn π(y1:n |x1:n , α) ∝ j=1 kn ∝ ba √ Γ(a) τ ba Γ a + mjn /2 Γ(a) j=1 mjn τ + 1 2π mjn + b+ 1 τ mjn 2 Γ a+ σ ˆnj 2 mjn 2 + (¯ ynj (Ψjn )−(a+ − η)2 1 + mjn τ j mn ) 2 −(a+ j mn ) 2 (C.8) The derivation for sampling α conditional on the current state (number of clusters) is outlined in [30]. We repeat it here for convenience: We first determine the probability distribution of the number of clusters given α and the number of data points. P(kn |α, n) = cn (kn ) n! αkn Γ(α) Γ(α + n) (C.9) where cn (kn ) = P(kn |α = 1, n). It is possible to write the above fraction of Gamma functions in terms of a Beta function, which has the following definition 114 and property: 1 tx−1 (1 − t)y−1 dt = β(x, y) = 0 Γ(x)Γ(y) Γ(x + y) Thus rewriting this and inserting it into the equation of P(kn |α, n) gives (where we let x = α + 1, y = n): Γ(α + 1)/α Γ(α + 1 + n)/(α + n) (α + n) β(α + 1, n) = cn (kn ) n! αkn αΓ(n) P(kn |α, n) = cn (kn ) n! αkn We now replace the Beta function with its definition as an integral: P(kn |α, n) = cn (kn ) n! kn −1 α (α + n) Γ(n) 1 tα (1 − t)n−1 dt 0 By Bayes’ formula the posterior for α conditional on kn is thus: 1 π(α|kn , n) ∝ π(α)π(kn |α, n) ∝ π(α) αkn −1 (α + n) tα (1 − t)n−1 dt 0 This suggests that we can interpret π(α|kn , n) as the marginal of a joint distribution for α and another variable v defined on the continuous space v ∈ (0, 1): π(α, v|kn , n) ∝ π(α) αkn −1 (α + n) v α (1 − v)n−1 Thus we can now establish conditional posteriors π(α|v, kn , n) and π(v|α, kn , n). Under a Gamma prior for α (Eqn. 4.5) we have: 1 π(α|v, kn , n) ∝ Ga(α; c, d) αkn −1 (α + n) v α ∝ αc−1 e−α/d αkn −1 (α + n) v α = (α + n) αc+kn −2 e−α/d+α log(v) = αc+kn −1 e−α(1/d−log(v)) + n αc+kn −2 e−α(1/d−log(v)) 1 Note that Escobar and West [30] define Ga(a, b) with b being the scale parameter on pg. 584, which is a typo and b is actually the rate parameter (inverse of scale). 115 This can be written as a mixture of two Gamma distributions: (α|v, kn , n) ∼ πv Ga (c + kn , θv ) + (1 − πv ) Ga (c + kn − 1, θv ) where θv = 1 d − log(v) −1 (C.10) is the scale parameter, and the weights πv are given by the ratio of the normalising constants of the two Gamma distributions. Recall the density of the Gamma distribution is defined as: Ga(x; k, θ) = xk−1 e−x/θ θk Γ(k) for x > 0 and k, θ > 0 (C.11) where k and θ denote the shape and scale, respectively. πv n 1 − πv = θvc+kn Γ(c + kn ) θvc+kn −1 Γ(c + kn − 1) = c + kn − 1 1 d − log(v) (C.12) Thus we get: πv = 1+ n 1 d − log(v) c + kn − 1 −1 (C.13) Now consider the conditional distribution for v|α, kn , n: π(v|α, kn , n) ∝ v α (1 − v)n−1 (0 < v < 1) This is simply a Beta distribution with mean (α + 1)/(α + n + 1): π(v|α, kn , n) = B(v; α + 1, n) (C.14) Thus we can sample α by first sampling v ∼ B(α + 1, n) and then sample α from the mixture of Gamma distributions defined above. 116 C.2 Implementation The optimal proposal for the particle filter is q(xn |x1:n−1 , y1:n , α) = πn (xn |xn−1 ): π(y1:n |x1:n , α) P(xn |x1:n−1 , α) π(y1:n |x1:n−1 , α) π(y1:n |x1:n , α) P(xn |x1:n−1 , α) (C.15) π(y 1:n |xn , x1:n−1 , α) P(xn |x1:n−1 , α) xn q(xn |x1:n−1 , y1:n , α) = P(xn |x1:n−1 , y1:n , α) = = Since we have a finite (and generally low) number of clusters, we can compute q(xn |x1:n−1 , y1:n , α) for all possible values of xn . Thus we can compute the normalising constant and sample from P(xn |x1:n−1 , y1:n , α) exactly. xn ∼ kn−1 +1 π(y1:n |xn , x1:n−1 , α) P(xn |x1:n−1 , α) δj (xn ) j=1 kn−1 +1 xn =1 π(y1:n |xn , x1:n−1 , α) P(xn |x1:n−1 , α) As noted above, in SMC the optimal proposal for xn is πn (xn |x1:n−1 ), which gives the following (incremental) weight (the dependence on α is suppressed for ease of notation): wn (xn ) = = = = P(x1:n−1 |y1:n ) = P(x1:n−1 |y1:n−1 ) P(x1:n−1 |y1:n−1 ) xn π(y1:n |xn , x1:n−1 ) P(xn , x1:n−1 ) / π(y1:n ) π(y1:n−1 |x1:n−1 ) P(x1:n−1 ) / π(y1:n−1 ) xn π(y1:n |xn , x1:n−1 ) P(xn |x1:n−1 ) ∝ π(y1:n−1 |x1:n−1 ) = xn j kn ¯nj , (ˆ σnj )2 |x1:n−1 , xn , y1:n ) j=1 f (mn , y kn−1 j j j ¯n−1 , (ˆ σn−1 )2 |x1:n−1 , y1:n−1 ) j=1 f (mn−1 , y kn−1 = ✭✭ ✭ ) ✭1:n−1 πn✭ (x✭ πn (x1:n−1 ) ✭ n |x ✭✭✭ ✭1:n−1 πn✭ (x✭ ) πn−1 (x1:n−1 ) ✭ n |x xn P(xn , x1:n−1 |y1:n ) πn (x1:n ) πn−1 (x1:n−1 ) πn (xn |x1:n−1 ) αf (1, yn , 0) + n−1+α j=1 (C.16) P(xn |x1:n−1 ) mjn−1 f (mjn , y¯nj , (ˆ σnj )2 |x1:n−1 , xn=j, y1:n ) j j n − 1 + α f (mjn−1 , y¯n−1 , (ˆ σn−1 )2 |x1:n−1 , y1:n−1 ) (C.17) 117 where f is the likelihood contribution from a cluster at time n (the sufficient statistics obviously depend on xn ): f (mjn , y¯nj , (ˆ σnj )2 |x1:n , y1:n ) = ba Γ a + mjn /2 Γ(a) mjn τ b+ +1 mjn 2 σ ˆnj 2 + (¯ ynj − η)2 −(a+ j mn ) 2 1 + mjn τ Note that the weight for the particle is independent of the choice of xn . We also see that the numerator in equation (C.16) is simply the normalisation factor used in equation (C.15). This allows for an efficient computation of the particle weight since both the normalisation constant and the likelihood of the previous time steps have already been computed. 118
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Particle Markov chain Monte Carlo
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Particle Markov chain Monte Carlo Holenstein, Roman 2009
pdf
Page Metadata
Item Metadata
Title | Particle Markov chain Monte Carlo |
Creator |
Holenstein, Roman |
Publisher | University of British Columbia |
Date Issued | 2009 |
Description | Markov chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC) methods have emerged as the two main tools to sample from high-dimensional probability distributions. Although asymptotic convergence of MCMC algorithms is ensured under weak assumptions, the performance of these latters is unreliable when the proposal distributions used to explore the space are poorly chosen and/or if highly correlated variables are updated independently. In this thesis we propose a new Monte Carlo framework in which we build efficient high-dimensional proposal distributions using SMC methods. This allows us to design effective MCMC algorithms in complex scenarios where standard strategies fail. We demonstrate these algorithms on a number of example problems, including simulated tempering, nonlinear non-Gaussian state-space model, and protein folding. |
Extent | 2454345 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-04-17 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
IsShownAt | 10.14288/1.0051665 |
URI | http://hdl.handle.net/2429/7319 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2009-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2009_spring_holenstein_roman.pdf [ 2.34MB ]
- Metadata
- JSON: 24-1.0051665.json
- JSON-LD: 24-1.0051665-ld.json
- RDF/XML (Pretty): 24-1.0051665-rdf.xml
- RDF/JSON: 24-1.0051665-rdf.json
- Turtle: 24-1.0051665-turtle.txt
- N-Triples: 24-1.0051665-rdf-ntriples.txt
- Original Record: 24-1.0051665-source.json
- Full Text
- 24-1.0051665-fulltext.txt
- Citation
- 24-1.0051665.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0051665/manifest