Towards Smooth Particle Filters for Likelihood Estimation with Multivariate Latent Variables by Anthony Lee B.Sc., The University of British Columbia, 2006 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) August, 2008 c© Anthony Lee 2008 Abstract In parametrized continuous state-space models, one can obtain estimates of the likelihood of the data for fixed parameters via the Sequential Monte Carlo methodology. Unfortunately, even if the likelihood is continuous in the parameters, the estimates produced by practical particle filters are not, even when common random numbers are used for each filter. This is because the same resampling step which drastically reduces the variance of the estimates also introduces discontinuities in the particles that are selected across filters when the parameters change. When the state variables are univariate, the methodology of [23] gives an estimator of the log- likelihood that is continuous in the parameters. We present a non-trivial generalization of this method using tree-based o(N2) (and as low as O(N logN)) resampling schemes that induce signifi- cant correlation amongst the selected particles across filters. In turn, this reduces the variance of the difference between the likelihood evaluated for different values of the parameters and the resulting estimator is considerably smoother than naively running the filters with common random numbers. Importantly, in practice our methods require only a change to the resample operation in the SMC framework without the addition of any extra parameters and can therefore be used for any applica- tion in which particle filters are already used. In addition, excepting the optional use of interpolation in the schemes, there are no regularity conditions for their use although certain conditions make them more advantageous. In this thesis, we first introduce the relevant aspects of the SMCmethodology to the task of likelihood estimation in continuous state-space models and present an overview of work related to the task of smooth likelihood estimation. Following this, we introduce theoretically correct resampling schemes that cannot be implemented and the practical tree-based resampling schemes that were developed instead. After presenting the performance of our schemes in various applications, we show that two of the schemes are asymptotically consistent with the theoretically correct but unimplementable methods introduced earlier. Finally, we conclude the thesis with a discussion. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Sequential Monte Carlo Methods and Likelihood Evaluation . . . . . . . . . . . 3 2.1 Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 State-Space Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Sequential Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4.1 Sequential Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.2 Sequential Importance Sampling/Resampling . . . . . . . . . . . . . . . . . . 10 2.4.3 Likelihood Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Smooth Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Desired Estimator Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 iii Table of Contents 3.2 Common Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.1 Smooth Likelihood Estimation with One-Dimensional State Variables . . . . 21 3.3.2 Smooth Likelihood Estimation with Multi-Dimensional State Variables in O(N2) time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.3 Smooth Likelihood Estimation Using Fixed Selection Indices . . . . . . . . . 23 3.4 Theoretical Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4.1 Regularity Conditions and Notation . . . . . . . . . . . . . . . . . . . . . . . 24 3.4.2 A Generalized CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4.3 Median-Cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.4 Practical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Fast and Approximately Smooth Likelihood Estimation . . . . . . . . . . . . . . 32 4.1 A Weighted Tree Structure for P̂N (dx0:t|y0:t) . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Splitting via the Unweighted Median (Weighted Binary Trees) . . . . . . . . . . . . 33 4.2.1 From O(N(logN)2) to O(N logN) time . . . . . . . . . . . . . . . . . . . . 35 4.2.2 From u ∈ Rk to u ∈ Rd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2.3 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Splitting via the Weighted Median (Unweighted Binary Trees) . . . . . . . . . . . . 38 4.3.1 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.2 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4 An Unweighted k-ary Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4.1 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4.2 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5 Overall Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 iv Table of Contents 5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1 Gaussian State-Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Factor Stochastic Volatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3 Stochastic Kinetic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.4 Dynamic Stochastic General Equilibrium Model . . . . . . . . . . . . . . . . . . . . 57 5.5 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.1.1 Quantile Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.1.2 The Mean Value Theorem for Integration . . . . . . . . . . . . . . . . . . . . 67 6.1.3 Taylor’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.1.4 Quantile Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.2 An approximation of the generalized cdf . . . . . . . . . . . . . . . . . . . . . . . . 69 6.3 The Unweighted k-ary Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.4 The Unweighted Binary Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.5 Changing Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.6 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 v List of Tables 5.1 Mean and standard deviation of the log-likelihood as N increases . . . . . . . . . . . 50 5.2 Running time (in seconds) for the vanilla, tree-based and O(N2) vanilla filter for various values of N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 vi List of Figures 3.1 Median-cut partitioning of the space for two different densities . . . . . . . . . . . . 29 4.1 uj+d against uj for two different values of wj,0 . . . . . . . . . . . . . . . . . . . . . 36 4.2 Binary tree representation of the sets as a function of β . . . . . . . . . . . . . . . . 38 5.1 2D Gaussian state-space model log-likelihood plots using the various filters . . . . . 46 5.2 2D Gaussian state-space model log-likelihood plots comparing the tree-based resam- pling schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3 2D Gaussian state-space model statistical error plots using the various filters . . . . 47 5.4 2D Gaussian state-space model log-likelihood plots using the various filters with the locally optimal proposal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.5 2D Gaussian state-space model log-likelihood plots comparing the tree-based resam- pling schemes with the locally optimal proposal distribution . . . . . . . . . . . . . . 49 5.6 2D Gaussian state-space model statistical error plots using the various filters with the locally optimal proposal distribution . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.7 3D Gaussian state-space model log-likelihood plots using the various filters . . . . . 51 5.8 3D Gaussian state-space model log-likelihood plots comparing the tree-based resam- pling schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.9 3D Gaussian state-space model statistical error plots using the various filters . . . . 52 5.10 FSV log-likelihood plots using the various filters . . . . . . . . . . . . . . . . . . . . 54 5.11 FSV log-likelihood plots comparing the tree-based resampling schemes . . . . . . . . 55 5.12 Estimated expected predator-prey population levels for the partially observed Lotka- Volterra model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.13 Partially observed Lotka-Volterra model log-likelihood plots using the various filters 58 vii List of Figures 5.14 Partially observed Lotka-Volterra model log-likelihood plots comparing the tree-based resampling schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.15 DSGE log-likelihood plots using the various filters . . . . . . . . . . . . . . . . . . . 61 5.16 DSGE log-likelihood plots comparing the tree-based resampling schemes . . . . . . . 62 5.17 DSGE log-likelihood plots for the vanilla particle filter with N = 20000 . . . . . . . 62 5.18 2D Gaussian state-space model log-likelihood surface maps using the various filters . 63 5.19 2D Gaussian state-space model statistical error surface maps using the various filters 64 viii List of Algorithms 1 The generic sequential importance sampling algorithm . . . . . . . . . . . . . . . . . 9 2 The generic sequential importance sampling/resampling algorithm . . . . . . . . . . 12 3 An algorithm to evaluate the inverse of the generalized cdf F−1(u1, . . . , ud) . . . . . 26 4 The mapping function fn(u) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5 Algorithm to convert a number u in [0,1] to a string of bits of length n . . . . . . . . 27 6 Constructing the weighted binary tree: constructq(particles, level) . . . . . . . . . . 34 7 Selecting a particle from the weighted binary tree: selectq(node, level, u1, . . . , ud) . . 34 8 Constructing the unweighted binary tree: constructp(particles, level) . . . . . . . . 39 9 Selecting a particle from the unweighted binary tree: selectp(node, level, u) . . . . . 39 10 Constructing the unweighted k-ary tree: constructk(particles, level) . . . . . . . . . 42 11 Selecting a particle from the unweighted k-ary tree: selectk(node, level, u1, . . . , ud) . 42 12 An algorithm to approximate the inverse of the generalized cdf F−1(u1, . . . , ud) . . . 70 ix Acknowledgements I thank my supervisor, Arnaud Doucet, for his help, advice and wisdom during the course of this degree. I also thank Nando de Freitas, who in addition to being my second reader also introduced me to the field of machine learning when I was an undergraduate and impressed upon me the notion that statistics provides a meaningful way to build and measure machines that learn. More broadly, I would like to thank anyone and everyone who has taught me something about anything. I am grateful for the financial support I received from the Natural Sciences and Engineering Research Council of Canada, which has allowed me to pursue this work. Finally, I would like to thank my family and friends for their love and support throughout the years. x Chapter 1 Introduction Sequential Monte Carlo (SMC) methods are effective tools for sampling from high-dimensional probability distributions. However, in many situations we are interested in parameter estimation and not the latent variables generated by such methods. While SMC methods are still useful in these situations, a major drawback is that the likelihood estimates given are far from smooth. Assume we have data y, latent variables x and parameters θ along with a model that specifies the probability density p(x,y|θ). Furthermore, assume that the marginal likelihood p(y|θ) is continuous in θ. For an N -sample Monte Carlo estimator L̂N (θ) of this marginal likelihood to be smooth we require it to be continuous in θ as well. Additionally, we want L̂N (θ) to be consistent. In the SMC framework, we often obtain an estimate of log p(y|θ) by running a particle filter with fixed parameter θ. Unfortunately, the interaction of the particles or resampling step renders such estimates non-smooth in θ even if each filter is run with common random numbers. In particular, sequential importance sampling without resampling provides smooth but prohibitively high-variance estimates of the log-likelihood using common random numbers. The resampling step, on the other hand, can typically be viewed as the use of the inverse transform on the empirical cumulative distribution function of the particle indices. As the weights of the particles change with θ, the same random numbers will lead to the selection of a particle with a different but ‘close’ index. Unfortunately, particles with close indices are not generally close themselves and so this introduces large differences in the resampled particles as θ changes. Of course, it is this same resampling step that allows us to estimate the log-likelihood with reasonable variance and so discarding this crucial element of SMC methodology for the sake of smoothness is not really an option. It may be impossible to produce a tractable estimator with both these properties in the general case. For continuous state-space models, a particle filter has already been produced that possesses these properties in the case where the state variables are univariate [23]. This thesis attempts to tackle the problem in the case where the state variables are multivariate. While the algorithms presented do not satisfy the smoothness criteria defined above, they provide significant improvements in smoothness over the generic particle filter algorithms they extend and run in o(N2) time, which is faster than other methods that also attempt to solve this problem. The rest of this thesis is organized as follows. Chapter 2 provides a brief overview of the statistical 1 Chapter 1. Introduction concepts required to understand the material that follows. Chapter 3 begins the treatment smooth likelihood estimation in earnest, with a discussion of related work as well as the introduction of theoretical algorithms we would like to approximate. In chapter 4, we present novel practical algorithms utilizing tree-based resampling. In chapter 5, we apply the new algorithms to a variety of application domains and show our results. Chapter 6 provides some analysis of the algorithms and chapter 7 concludes the thesis with a discussion. 2 Chapter 2 Sequential Monte Carlo Methods and Likelihood Evaluation This chapter provides the reader with a brief overview of the statistical concepts required to under- stand the rest of the thesis. While the reader is expected to be familiar with the most basic concepts in probability, advanced knowledge of Monte Carlo methods is not required as the relevant details are covered here. If a more rigorous treatment of the material is desired, such treatment is available in [29] and [26]. 2.1 Likelihood Estimation A parametric model is a set of distributions that can be parametrized by a finite number of parame- ters. For continuous distributions, each member of such a set can be represented by a density p(·|θ), where θ belongs to a finite-dimensional parameter space Θ. Given a set of parametrized densities and some fixed data y, the likelihood function p(y|θ) is a function of θ. It is often the case that we want to evaluate p(y|θ) for various values of θ. Unfortunately, in many models it is impossible to compute this quantity directly. However, in some cases it is possible to introduce a latent random variable x and evaluate p(x,y|θ). Integrating out x can yield an expression for p(y|θ) p(y|θ) = ∫ X p(x,y|θ)dx If we could compute this integral analytically we would have an exact expression for p(y|θ). In general, however, this integral can only be evaluated numerically. When x is high-dimensional, Monte Carlo methods are often used to approximate integrals of this type. 3 2.2. Monte Carlo Methods 2.2 Monte Carlo Methods Monte Carlo integration is based on the understanding that the expectation of function of a random variable is an integral. Indeed, the expectation of the function φ : X → R evaluated at random variable X ∈ X with probability density function pi is Epi[φ(X)] = ∫ X φ(x)pi(x)dx Importantly, we can use samples {xi}Ni=1from the distribution defined by pi to approximate this integral with the empirical average I1 = 1 N N∑ i=1 φ(xi) which has variance var[I1] = 1 N { Epi[φ(x) 2]− Epi[φ(x)]2 } An extension of classical Monte Carlo integration is importance sampling, in which we sample instead from a proposal distribution γ with the requirement that γ(x) > 0 whenever pi(x) > 0: Epi[φ(X)] = ∫ X φ(x) pi(x) γ(x) γ(x)dx This yields the approximation I2 = 1 N N∑ i=1 φ(xi) pi(xi) γ(xi) The variance of this estimate is given by var[I2] = Eγ [I 2 2 ]− Eγ [I2]2 = 1 N {∫ X φ(x)2 pi(x)2 γ(x)2 γ(x)dx− (∫ X φ(x) pi(x) γ(x) γ(x)dx )2} = 1 N { Epi[φ(x) 2 pi(x) γ(x) ]− Epi[φ(x)]2 } For both classical Monte Carlo and importance sampling, the estimates are unbiased and converge almost surely via the Strong Law of Large Numbers. In practice, importance sampling can be used as a variance reduction technique or because γ is much easier to sample from. For our purposes, the 4 2.2. Monte Carlo Methods latter is most often the case, and care should be taken to ensure that Epi[φ(x) 2 pi(x) γ(x) ] < ∞ so that the estimate has at least finite variance. It is important to note that often we only know pi and even γ up to a normalizing constant, ie. we know pi∗(x) ∝ pi(x) and γ∗(x) ∝ γ(x) where ∫X pi(x)dx = ∫X γ(x)dx = 1. In this case we can use the estimate I∗2 = ∑N i=1 φ(xi)w ∗(xi)∑N i=1 w ∗(xi) where w∗(xi) = pi∗(xi) γ∗(xi) Alternatively, we can write this as I∗2 = N∑ i=1 φ(xi)W (i) where W (i) = w∗(xi)∑N j=1 w ∗(xj) This estimate is not unbiased but it is asymptotically consistent since the squared bias is of order O(N−2) and the variance is of order O(N−1). Let us also define the empirical density PN (dx) = N∑ i=1 W (i)δxi(dx) When x is univariate, we can define the weighted empirical distribution function FN (x) = N∑ i=1 W (i)I(xi ≤ x) where I(xi ≤ x) = 1 if xi ≤ x,0 if xi > x. 5 2.3. State-Space Models In fact, since we can set φ(x) = I(xi ≤ a), we have that FN (a) = N∑ i=1 W (i)I(xi ≤ a) is an asymptotically consistent estimator of F (a) = ∫ I(x ≤ a)pi(x)dx ie. PN converges in distribution to pi. In the case of multivariate x, we also have convergence in distribution since PN (x ∈ S) = N∑ i=1 W (i)I(xi ∈ S) is an asymptotically consistent estimator of Prpi[x ∈ S] = ∫ I(x ∈ S)pi(x)dx = ∫ S pi(x)dx 2.3 State-Space Models In this thesis, we will restrict our discussion to a particular class of models that is widely applicable and commonly associated with the sequential Monte Carlo methodology. It should be stressed that sequential Monte Carlo is not itself restricted to models of this form. Consider a time-homogeneous Markovian state-space model with hidden states {xt : t ∈ {0, . . . , T}}, xt ∈ X and observations {yt : t ∈ {0, . . . , T}},yt ∈ Y. The model is given by p(x0|θ) (initial state) p(xt|xt−1, θ) for 1 ≤ t ≤ T (evolution) p(yt|xt, θ) for 0 ≤ t ≤ T (observation) The defining characteristic in such models is that the observation yt is conditionally independent of the other observations given the latent variable xt. In practice, this allows for the expression of a large number of possible systems. Intuitively, this type of model is natural if xt represents the state of the system at time t, the transition from xt to xt+1 is captured in the evolution distribution and the observation at time t depends only on the state of the system at that time. 6 2.3. State-Space Models Likelihood Evaluation For this class of models, likelihood evaluation is complicated by the presence of latent variables x0:T . A first attempt at computing p(y0:T |θ) could be to decompose this probability as p(y0:T |θ) = ∫ p(x0:T ,y0:T |θ)dx0:T = ∫ p(y0:T |x0:T , θ)p(x0:T |θ)dx0:T = ∫ ( T∏ i=0 p(yi|xi, θ) ) p(x0:T |θ)dx0:T and proceed with a Monte Carlo estimate by sampling x0:T ∼ p(x0:T |θ) = p(x0) ∏T i=1 p(xi|xi−1). Unforunately, while theoretically valid, this estimate has prohibitively high variance whenever the data y0:T is informative. In general, we would like to incorporate the data when proposing samples x0:T . Using importance sampling, we could conceivably reduce the variance by sampling x0:T ∼ q(x0:T |y0:T , θ) since p(y0:T |θ) = ∫ ( T∏ i=0 p(yi|xi, θ) ) p(x0:T |θ) q(x0:T |y0:T , θ)q(x0:T |y0:T , θ)dx0:T However, it is difficult to come up with a high-dimensional proposal density q that is easy to sample from that would sufficiently reduce the variance. An alternative decomposition of the likelihood is attained by noticing that p(y0:T |θ) = p(y0|θ) T∏ k=1 p(yk|y0:k−1, θ) where, dropping θ for clarity p(yk|y0:k−1) = ∫ p(yk,xk−1|y0:k−1)dxk−1 = ∫ p(yk|xk−1)p(xk−1|y0:k−1)dxk−1 = ∫ ∫ p(yk,xk|xk−1)dxkp(xk−1|y0:k−1)dxk−1 = ∫ ∫ p(yk|xk)p(xk|xk−1)dxkp(xk−1|y0:k−1)dxk−1 7 2.4. Sequential Monte Carlo Therefore, if we can produce samples from p(xk−1|y0:k−1) we can estimate each p(yk|y0:k−1) via p̂(yk|y0:k−1) = 1 NM N∑ i=1 M∑ j=1 p(yk|x(i,j)k ) (2.1) where we define x (i) k−1 as the ith sample from p(xk−1|y0:k−1) and x(i,j)k as the jth sample from the distribution with density p(xk|x(i)k−1). We will see that sequential Monte Carlo allows us to sample approximately from p(xk−1|y0:k−1). Alternatively, we can use an importance distribution q(xk|yk,xk−1) within the inner integral: p(yk|y0:k−1) = ∫ ∫ p(yk|xk)p(xk|xk−1)dxkp(xk−1|y0:k−1)dxk−1 = ∫ ∫ p(yk|xk)p(xk|xk−1) q(xk|yk,xk−1) q(xk|yk,xk−1)dxkp(xk−1|y0:k−1)dxk−1 yielding p̂(yk|y0:k−1) = 1 NM N∑ i=1 M∑ j=1 p(yk|x(i,j)k )p(x(i,j)k |x(i)k−1) q(x (i,j) k |yk,x(i)k−1) (2.2) where again x (i) k−1 is the ith sample from p(xk−1|y0:k−1) but x(i,j)k is the jth sample from the distribution with density q(xk|yk,x(i)k−1). 2.4 Sequential Monte Carlo In order to appreciate the problem of smooth likelihood estimation using particle filters, it is neces- sary to specify the particle filtering method so that we can highlight the difficulties that are faced. Details are omitted that are not directly relevant to smooth likelihood estimation for state-space models and so the reader is directed to [8] for a more general overview of sequential Monte Carlo methods. Given a target distribution p(x0:T |y0:T ), the central idea in sequential Monte Carlo algorithms is to define a sequence of intermediate target distributions p(x0|y0), p(x0:1|y0:1), . . . , p(x0:T |y0:T ) that move ‘smoothly’ towards the target distribution. Importantly, we can evaluate each intermediate distribution up to a normalizing constant since p(x0:t|y0:t) = p(x0:t,y0:t) p(y0:t) 8 2.4. Sequential Monte Carlo and p(x0:t,y0:t) is readily computable. We utilize an importance distribution of the form q(x0:T |y0:T ) = q(x0|y0) T∏ t=1 q(xt|xt−1,yt) 2.4.1 Sequential Importance Sampling The generic sequential importance sampling (SIS) algorithm using these distributions is given in Algorithm 1. Algorithm 1 The generic sequential importance sampling algorithm 1. At time t = 0. • For i = 1, . . . , N , sample x(i)0 ∼ q(x0|y0) • For i = 1, . . . , N , evaluate the importance weights: w0(x (i) 0 ) = p(x (i) 0 ,y0) q(x (i) 0 |y0) = p(y0|x(i)0 )p(x(i)0 ) q(x (i) 0 |y0) 2. For times t = 1, . . . , T . • For i = 1, . . . , N , sample x(i)t ∼ q(xt|yt,x(i)t−1) and set x(i)0:t def= (x(i)0:t−1,x(i)t ) • For i = 1, . . . , N , evaluate the importance weights: wt(x (i) 0:t) = wt−1(x (i) 0:t−1) p(x (i) 0:t,y0:t) p(x (i) 0:t−1,y0:t−1)q(x (i) t |yt,x(i)t−1) = wt−1(x (i) 0:t−1) p(yt|x(i)t )p(x(i)t |x(i)t−1) q(x (i) t |yt,x(i)t−1) We can prove by induction that the importance weights in this algorithm satisfy wt(x (i) 0:t) = p(x (i) 0:t,y0:t) q(x (i) 0:t|y0:t) where x (i) 0:t ∼ q(x0:t|y0:t). Indeed, we have wt(x (i) 0:t) = wt−1(x (i) 0:t−1) p(x (i) 0:t,y0:t) p(x (i) 0:t−1,y0:t−1)q(x (i) t |yt,x(i)t−1) = p(x (i) 0:t−1,y0:t−1) q(x (i) 0:t−1|y0:t−1) p(x (i) 0:t,y0:t) p(x (i) 0:t−1,y0:t−1)q(x (i) t |yt,x(i)t−1) = p(x (i) 0:t,y0:t) q(x (i) 0:t|y0:t) 9 2.4. Sequential Monte Carlo Note also that wt(x (i) 0:t) = p(x (i) 0:t,y0:t) q(x (i) 0:t|y0:t) = p(y0:t) p(x (i) 0:t|y0:t) q(x (i) 0:t|y0:t) ∝ p(x (i) 0:t|y0:t) q(x (i) 0:t|y0:t) Since p(y0:t) is an unknown normalizing constant 1, we are forced to normalize the importance weights as in normalized importance sampling. This gives the following empirical distribution to approximate p(x0:t|y0:t): P̂N (dx0:t|y0:t) = N∑ i=1 W (i) t δx(i)0:t (dx0:t) where W (i) t = wt(x (i) 0:t)∑N j=1 wt(x (j) 0:t ) While theoretically valid, for practical values of N this algorithm suffers from degeneracy for even small values of t due to the unavoidable increase in variance of the importance weights over time. This means that the estimate P̂N (dx0:t|y0:t) has significant mass on only a few particles, regardless of the shape of the distribution p(x0:t|y0:t), and this leads to prohibitively high variance of Monte Carlo estimates using these particles. 2.4.2 Sequential Importance Sampling/Resampling To alleviate the problem of degeneracy, we can replace intermediate weighted empirical distributions with unweighted empirical distributions such that none of the relative weights are negligible. More formally, we replace P̂N (dx0:t|y0:t) = N∑ i=1 W (i) t δx(i)0:t (dx0:t) with P̃N (dx0:t|y0:t) = 1 N N∑ i=1 n (i) t δx(i)0:t (dx0:t) where n (i) t ∈ {0, 1, . . . , N} and ∑N i=1 n (i) t = N . 1If we could compute this, we could evaluate p(y0:T ) = p(y0:T |θ) without using SMC. 10 2.4. Sequential Monte Carlo A replacement, or resampling, scheme is acceptable if for any function φ(·) we have E [(∫ φ(x0:t)P̂N (dx0:t|y0:t)− ∫ φ(x0:t)P̃N (dx0:t|y0:t) )2] N→∞−→ 0 since convergence of the expectation of φ with respect to the empirical density P̃N to the true expectation is then guaranteed as N →∞. A variety of common resampling techniques are discussed in [15]. The general idea of many re- sampling algorithms is to generate a set of N uniform random variables u1, . . . , uN (not necessarily identically distributed) and use them to sample N values of x0:t from P̂N (dx0:t|y0:t) such that E[n (i) t ] = NW (i) t The variable n (i) t corresponds to the number of times x (i) 0:t is selected. It is vital to note that the resampling process is a source of flexibility within the sequential importance sampling/resampling framework. In fact, we will later see that it is this process that is responsible for the lack of smoothness in likelihood estimation and which can be modified to alleviate this problem. The sequential importance sampling algorithm augmented with a resampling (SIR) procedure is given in Algorithm 2. Although we no longer have wt(x (i) 0:t) ∝ p(x (i) 0:t|y0:t) q(x (i) 0:t|y0:t) we know that our resulting empirical distribution still converges in distribution2 to p(x0:t|y0:t) and that Monte Carlo estimates of expectations of test statistics (such as the likelihood estimators in equations 2.1 and 2.2) converge in probability to their true values. Furthermore, at each step we have a greater diversity of particles than with the original algorithm. The problem of degeneracy is not removed: at time t the marginals p(xi|y0:t) for i << t are poorly represented. However, the marginal p(xt|y0:t) is well represented and this is the marginal that is used for likelihood estimation and the generation of particles from p(xt+1|y0:t). In practice, the resampling step is what permits the importance sampling/resampling algorithm to give reasonable estimates for practical values of N . 2This thesis is not particularly concerned with extolling the virtues of particle filters in general, and the algorithms presented are only intended to be used when SMC methods are effective. As such, more sophisticated results about the convergence of particle filters are not discussed. However, [6] provides a survey of convergence results for those who are interested. 11 2.4. Sequential Monte Carlo Algorithm 2 The generic sequential importance sampling/resampling algorithm 1. At time t = 0. • For i = 1, . . . , N , sample x̃(i)0 ∼ q(x0|y0) • For i = 1, . . . , N , evaluate the importance weights: w0(x̃ (i) 0 ) = p(x̃ (i) 0 ,y0) q(x̃ (i) 0 |y0) = p(y0|x̃(i)0 )p(x̃(i)0 ) q(x̃ (i) 0 |y0) • For i = 1, . . . , N , normalize the importance weights: W (i) 0 = w0(x̃ (i) 0 )∑N j=1 w0(x̃ (j) 0 ) • Resample (with replacement) N particles {x(i)0 : i = 1, . . . , N} from {x̃(i)0 : i = 1, . . . , N} according to the importance weights {W (i)0 : i = 1, . . . , N}. Set w(i)0 = 1N for i = 1, . . . , N . 2. For times t > 0. • For i = 1, . . . , N , sample x̃(i)t ∼ q(xt|yt,x(i)t−1) and set x̃(i)0:t def= (x(i)0:t−1, x̃(i)t ) • For i = 1, . . . , N , evaluate the importance weights: wt(x̃ (i) 0:t) = w (i) t−1 p(yt|x̃(i)t )p(x̃(i)t |x(i)t−1) q(x̃ (i) t |yt,x(i)t−1) • For i = 1, . . . , N , normalize the importance weights: W (i) t = wt(x̃ (i) 0:t)∑N j=1 wt(x̃ (j) 0:t ) • Resample (with replacement) N particles {x(i)0:t : i = 1, . . . , N} from {x̃(i)0:t : i = 1, . . . , N} according to the importance weights {W (i)t : i = 1, . . . , N}. Set w(i)t = 1N for i = 1, . . . , N . Multinomial Resampling In resampling, we wish to replace the weighted empirical distribution P̂N (dx0:t|y0:t) with an un- weighted empirical distribution P̃N (dx0:t|y0:t). One of the most common resampling schemes em- ployed by practitioners is that of multinomial resampling. In this scheme, we construct the empirical cumulative distribution function F̂t,N (j) = j∑ i=1 W (i) t 12 2.4. Sequential Monte Carlo It is important to note that this is a cumulative distribution function where the random variable of interest is not X0:t but instead the index of the particle we wish to choose. We can sample an index according to its weight by defining the inverse of this function as F̂−1t,N (u) = min{j : F̂t,N (j) ≥ u}, generating a uniform random number on [0, 1] and evaluating the inverse at this value. If we repeat this process N times and increment n (i) t (with each n (i) t starting at 0) each time the index i is sampled we end up with a valid P̃N (dx). If we denote each uniform random number as ui we can define k (i) t = F̂ −1 t,N (ui) and call this the selection index that is picked during resampling. Indeed, with the selection index made explicit we have x (i) 0:t = x̃ (k (i) t ) 0:t The most relevant issue for the topic at hand is that particles whose indices are close are not necessarily close themselves. The computational complexity of resampling naively is in O(N2). However, one can reduce this complexity by generating pre-ordered uniform random numbers as in [1] and traversing an array of cumulative weights without backtracking to give a time complexity of O(N). An Alternative View of Resampling Instead of viewing resampling as the replacement of the weighted empirical distribution P̂N (dx0:t|y0:t) with an unweighted empirical distribution P̃N (dx0:t|y0:t), we can view it as the approximation of the true density p(x0:t|y0:t) with P̂N (dx0:t|y0:t). In particular, in the context of likelihood evaluation in time-homogeneous Markovian state-space models we are only interested in the marginal p(xt|y0:t) which is approximated by P̂N (dxt|y0:t) = N∑ i=1 W (i) t δx(i)t (dxt) The predictive distribution p(xt+1|y0:t) is thus approximated by the density p̂N (xt+1|y0:t) = N∑ i=1 W (i) t p(xt+1|x(i)t ) 13 2.4. Sequential Monte Carlo We can think of this density as the marginal of a joint distribution p̂N (k,xt+1|y0:t) =W (k)t p(xt+1|x(k)t ) The proposal distribution used in sequential importance sampling can then be defined as qN (xt+1|x(1:N)t ,W (1:N)t ,yt+1) = N∑ i=1 W (i) t q(xt+1|yt+1,x(i)t ) This density is the marginal of a joint distribution qN (k,xt+1|x(1:N)t ,W (1:N)t ,yt+1) =W (k)t q(xt+1|yt+1,x(k)t ) The resampling step then corresponds to (at each stage i) sampling the auxiliary variable k, which is the index into both the mixture representation of p̂N (xt+1|y0:t) and qN (xt+1|x(1:N)t ,W (1:N)t ,yt+1). This is followed by sampling x (i) t+1 ∼ qN (xt+1|k,x(1:N)t ,W (1:N)t ,yt+1) = q(xt+1|x(k)t ,yt+1) and then evaluating the weight wt+1(x (i) t+1) = p(yt+1|x(i)t+1)p̂N (k,x(i)t+1|y0:t) qN (k,x (i) t+1|x(1:N)t ,W (1:N)t ,yt+1) = p(yt+1|x(i)t+1)p(x(i)t+1|x(k)t ) q(x (i) t+1|yt+1,x(k)t ) It may seem strange to compute the weights using the joint distributions involving the auxiliary variable k since we are essentially sampling x (i) t+1 ∼ qN (xt+1|y0:t+1). However, evaluating the im- portance weights using the marginals via wt+1(x (i) t+1) = p(yt+1|x(i)t+1)p̂N (x(i)t+1|y0:t) qN (x (i) t+1|x(1:N)t ,W (1:N)t ,yt+1) = p(yt+1|x(i)t+1) ∑N j=1W (j) t p(x (i) t+1|x(j)t )∑N j=1W (j) t q(x (i) t+1|yt+1,x(j)t ) as in the marginal particle filter of [17] would take O(N) time per particle3, giving an O(N2) time complexity for weighting N particles. It is easily shown that the expectation of any test function φ is unchanged by weighting the particles using the joint distributions as opposed to the marginals. 3It is possible in some situations to speed up this computation using N -body methods, but usually this requires approximation and sometimes it provides no gains in efficiency at all for a given tolerance. 14 2.4. Sequential Monte Carlo Letting ψ = (x (1:N) t ,W (1:N) t ,yt+1), we have:∫ ∫ φ(xt+1) p(yt+1|xt+1)p̂N (k,xt+1|y0:t) qN (k,xt+1|ψ) qN (k,xt+1|ψ)dxt+1dk = ∫ ∫ φ(xt+1) p(yt+1|xt+1)p̂N (xt+1|y0:t)p̂N (k|xt+1,y0:t) qN (xt+1|ψ)qN (k|xt+1, ψ) qN (xt+1|ψ)qN (k|xt+1, ψ)dxt+1dk = ∫ φ(xt+1) p(yt+1|xt+1)p̂N (xt+1|y0:t) qN (xt+1|ψ) qN (xt+1|ψ) ∫ p̂N (k|xt+1,y0:t) qN (k|xt+1, ψ) qN (k|xt+1, ψ)dkdxt+1 = ∫ φ(xt+1) p(yt+1|xt+1)p̂N (xt+1|y0:t) qN (xt+1|ψ) qN (xt+1|ψ)dxt+1 since ∫ p̂N (k|xt+1,y0:t)dk = 1. 2.4.3 Likelihood Evaluation It should be clear that the estimators for the likelihood given in Equations 2.1 and 2.2 are particularly suitable when particle filters are used to generate the latent variables x0:T . This is because it is the marginals p(xk|y0:k), for k = 0, . . . , T that are used at each step k and these marginals do not suffer from degeneracy in the sequential importance sampling/resampling algorithm. A computational issue associated with the evaluation of the likelihood as proposed, however, is that the repeated multiplication of terms is prone to numerical inaccuracy. A way to circumvent this problem is to evaluate instead the log-likelihood log p(y0:t). A biased estimate of this quantity is l̂1(y0:t) = log p̂(y0) + t∑ i=1 log p̂(yi|y0:i−1) An approximate bias correction by Taylor expansion, as done in [23] gives the improved estimate: l̂2(y0:t) = l̃(y0) + t∑ i=1 l̃(yi|y0:i−1) where l̃(y0) = log p̂(y0) + 1 2 σ̂20 NMp̂(y0)2 and l̃(yk|y0:k−1) = log p̂(yk|y0:k−1) + 1 2 σ̂2k NMp̂(yk|y0:k−1)2 and σ̂20 is the sample variance of p̂(y0) and σ̂ 2 k is the sample variance of p̂(yk|y0:k−1). 15 2.4. Sequential Monte Carlo Rationale Let X be an unbiased estimate of the quantity µ. Then X̄, the sample mean of X with R samples, is also an unbiased estimate of µ. Furthermore, if the variance of X is σ2 then the variance of X̄ is σ2 R . The second order Taylor expansion of log X̄ at µ is log X̄ ≈ log µ+ 1 µ (X̄ − µ)− 1 µ2 (X̄ − µ)2 2 so we have E[log X̄] ≈ log µ− 1 µ2 E[(X̄ − µ)2] 2 and therefore log X̄ + σ̂ 2 2RX̄2 is an estimate which is good for large R. 16 Chapter 3 Smooth Likelihood Estimation 3.1 Desired Estimator Properties We are now ready to consider the smooth likelihood problem. First, we should define formally the three properties that we would like a smooth estimator to have. We would like the estimator L̂N (θ) of p(y|θ) to be consistent. This means L̂N (θ) P→ p(y|θ) as N →∞. In addition, we would like the estimator to vary smoothly with θ if p(y|θ) does, ie. if ∀θ,∀ > 0,∃δ > 0 s.t. ∀θ′ ∈ Bδ(θ), p(y|θ′) ∈ B(p(y|θ)) then ∀θ,∀ > 0,∃δ > 0 s.t. ∀θ′ ∈ Bδ(θ), L̂N (θ′) ∈ B(L̂N (θ)) where Br(θ) denotes the ball of radius r around θ under some appropriate metric (eg. the metric induced by the L2-norm on Rd). Finally, we want L̂N (θ) to have a computational complexity in o(TN2) and which permits practical values of N to be used for accurate estimation. It might seem like the smoothness criterion is a strange one. However, there are two reasons to believe that a smooth likelihood estimator is advantageous. First, it captures the continuous nature of the true likelihood p(y|θ) and can better facilitate likelihood maximization [23]. In particular, it leads to reduced variance estimates of the difference in likelihood for some θ1 and θ2: L̂N (θ2)− L̂N (θ1) ≈ p(y|θ2)− p(y|θ1) Indeed, assume we have two consistent estimators L̃N (θ) and L̂N (θ) both with variance given by 17 3.2. Common Random Numbers A(θ) N but cov(L̃N (θ1), L̃N (θ2)) = 0 and L̂N (θ) is smooth as defined above. Necessarily, we have cov(L̂N (θ1), L̂N (θ2)) > 0 to satisfy smoothness when N is finite and A(θ1) or A(θ2) are nonzero. Then we have var(L̂N (θ2)− L̂N (θ1)) = var(L̂N (θ1)) + var(L̂N (θ2))− 2cov(L̂N (θ1), L̂N (θ2)) < var(L̂N (θ1)) + var(L̂N (θ2)) = var(L̃N (θ2)− L̃N (θ1)) Note that this reduction in variance does not come at the cost of consistency since the difference between two consistent estimators gives a consistent estimate of the difference between the two values they estimate. This is because the expectation of sums of random variables is equal to the sum of the expectations of those random variables, even if they are dependent. A caution at this point is that smooth estimators are not useful in all endeavours. In particular, if in a Bayesian framework one wished to estimate p(y) = ∫ Θ p(y|θ)p(θ)dθ by sampling {θi}Ri=1 from a prior p(θ) and computing p̂(y) = R∑ i=1 L̂N (θi) then this estimate would have higher variance by the same reasoning since var(L̂N (θ1) + L̂N (θ2)) = var(L̂N (θ1)) + var(L̂N (θ2)) + 2cov(L̂N (θ1), L̂N (θ2)) For such estimates, lower variance would be attained by using negatively correlated (and hence non-smooth) estimates of the likelihood. 3.2 Common Random Numbers One way in which we can try to attain a smooth estimator of the likelihood is by using common random numbers. For example, imagine we want to compute p(y|θ) = ∫ p(y,x|θ)dx = ∫ p(y|x, θ)p(x|θ) q(x) q(x)dx 18 3.2. Common Random Numbers via the Monte Carlo estimate Î(θ) = 1 N N∑ i=1 p(y|xi, θ)p(xi|θ) q(xi) where xi ∼ q(x), for i = 1, . . . , N . Using common random numbers xi and allowing θ to vary, we have that Î(θ) is a continuous function in θ as long as p(x|θ) and p(y|x, θ) are continuous in θ, since the product and quotient of continuous functions is continuous4. In practice, however, we often use proposal distributions of the form q(x|θ). In this scenario it is not possible to use samples xi that are identical. However, the utility of common random numbers can be retained by generating xi’s that vary continuously with θ. In particular, assume we can sample a random variable z from g(z) and we have a continuous (in θ) deterministic function f(z; θ) that transforms z, given θ, to a sample from a distribution q(x|θ) that we can evaluate pointwise. This means we require z ∼ g(·)⇒ f(z; θ) ∼ q(·|θ). Therefore, if we use common random numbers zi with θ varying, we still have that Î(θ) is continuous in θ since the composition of continuous functions is continuous. For example, imagine we want to sample random vectors from a d-dimensional multivariate Gaus- sian distribution q(x|θ) with θ = (µ,Σ) where µ is the mean and Σ is the covariance matrix of the distribution. Furthermore, define A such that it satisfies AAT = Σ. We can sample a standard multivariate Gaussian random vector z ∈ Rd by sampling each component from the standard uni- variate Gaussian distribution N(0, 1). Then we can set x = f(z; θ) = µ + Az and we have that x is a sample from q(x|θ). Furthermore, since f(z; θ) is continuous in θ, this method can be used to produce ‘common’ random vectors from q(x|θ) by using common random numbers from N(0, 1). Indeed, if we compute p(y|θ) = ∫ p(y,x|θ)dx = ∫ p(y|x, θ)p(x|θ) q(x|θ)q(x|θ)dx via the Monte Carlo estimate Î(θ) = 1 N N∑ i=1 p(y|xi, θ)p(xi|θ) q(xi|θ) where xi ∼ q(x|θ) for i = 1, . . . , N , we still have that Î(θ) is a continuous function in θ since the composition of continuous functions is continuous. In addition, for this case we require continuity in x as well as θ for p(x|θ) and p(y|x, θ), continuity in both x and θ for q(x|θ) and the existence of a function f(z|θ) as detailed above. 4An additional requirement is that q(x) 6= 0 but this is guaranteed by sampling from q and we require q(x) > 0 whenever p(x|θ) > 0 for importance sampling to be valid anyway 19 3.2. Common Random Numbers On a practical note, when we say that we can use common random numbers or refer to their use as a ‘technique’, we are referring to the ability to run algorithms with varying values of θ but using the same random numbers in all runs. On a computer, this can often be achieved by resetting the random seed of the random number generator to a fixed number at the beginning of each run. Naturally, this only applies when the number of random numbers that are sampled does not vary across runs and each one is used for the same purpose in each run. This implies, for example, that we cannot use rejection sampling in such an algorithm when the acceptance probability depends on θ. CRN and Particle Filters Now that common random numbers have been introduced and motivated, it is worth mentioning that this technique alone cannot produce a smooth likelihood estimator within the particle filter framework. First, it is worth noting that using CRN within the sequential importance sampling framework presented in Algorithm 1 and computing an estimate of the log-likelihood using the expression given in Section 2.4.3 will indeed produce a smooth estimate. However, this method is hopelessly slow as any practical5 value of N gives very high variance estimates of the likelihood. Unfortunately, the sequential importance sampling/resampling framework presented in Algorithm 2 that provides lower variance estimation simultaneously makes the likelihood estimator discontinuous, as remarked in [23]. In fact, smoothness of SIS is a result of the smoothness of the sampled latent variables {x(i)0:k}Ni=1 at every time-step k. The introduction of the resampling step in SIR introduces discontinuities in the latent variables {x(i)0:k}Ni=1. Indeed, from the multinomial sampling algorithm in Section 2.4.2, it is apparent that small changes in θ can cause drastically different values of x0:k to be chosen. For example, if particle index j is chosen for a given θ but particle index j+1 is chosen for θ′ that is close to θ, there is no guarantee that x(j)0:k will be at all close to x (j+1) 0:k . Additionally, even if we could construct a resampling procedure that made such a guarantee the resulting particles would not be continuous in θ without additional interpolation. In summary, it is a non-trivial task to produce an estimator that satisfies the properties of consis- tency, smoothness and o(N2) time complexity for practical values of N . However, it is trivial to produce an estimator that has any two of these properties. If we are willing to sacrifice smooth- ness, the standard estimate using sequential importance sampling/resampling suffices. On the other hand, if we are willing to sacrifice reasonable time complexity, the standard estimate using sequen- tial importance sampling (without resampling) suffices. It is hardly worth considering constructing an estimator that is not consistent, and it is trivial to construct one that is both fast and smooth. 5Although this estimator runs in O(N) time, the variance increases exponentially with T since SIS is just impor- tance sampling with a special proposal distribution. As such, for comparable variance to SIR, impractical values of N need to be chosen to compensate for the higher variance. 20 3.3. Related Work 3.3 Related Work The task of smooth likelihood estimation using sequential Monte Carlo methods is a relatively new endeavour, mostly because particle filters have only recently become popular. However, there has already been substantial work in this area and one method in particular is the main inspiration for the new ideas presented here. A brief survey of related advances in the field follows. 3.3.1 Smooth Likelihood Estimation with One-Dimensional State Variables In [23], a resampling scheme is proposed that, coupled with common random numbers, gives us all three estimator properties we desire in the case where the latent variables for state are one- dimensional. The idea is quite simple and follows naturally from the problem statement outlined in Section 2.4.2: that particles whose indices are close are not necessarily close themselves. Indeed, it is straightforward to guarantee that particles whose indices are close are themselves close simply by sorting the particles and reassigning their indices. More formally, assume that at stage t we have a weighted empirical distribution P̂N (dx0:t|y0:t) =∑N i=1W (i) t δx(i)0:t (dx0:t). Since the models we are working with are first-order Markov in x, for the purposes of filtering and likelihood estimation we are only interested in the tth component of x0:t when resampling. As such, we can reorder the particles such that ∀i, j ∈ {1, . . . , N}, i < j ⇒ x(i)t < x (j) t . An important result is that instead of using the empirical cumulative distribution function for indices F̂t,N (j) = j∑ i=1 W (i) t introduced in Section 2.4.2 we can instead use the empirical distribution function for particles F̂t,N (x) = ∑ x (i) t ≤x W (i) t The corresponding inverse of this empirical cdf is given by F̂−1t,N (u) = min{x(i)t : F̂N (x(i)t ) ≥ u} The cost of sorting each set of N particles by their last component is in O(N logN). Since this is done at each time-step of the particle filter and the cost of resampling at each step is in O(N) the time complexity of the algorithm is in O(TN logN). Of course, this method does not give continuous resampled particles without the addition of an interpolation scheme to transform the 21 3.3. Related Work sorted discrete cdf into a continuous cdf. In [23], a linear interpolation scheme is proposed to attain continuity. While interpolation introduces a bias, this bias becomes negligible as N increases. As impressive as this new particle filtering framework is, unfortunately it cannot be trivially extended to the multivariate case since only when the latent state variables are one-dimensional can we sort them in this manner and guarantee closeness. 3.3.2 Smooth Likelihood Estimation with Multi-Dimensional State Variables in O(N2) time While we often think of the empirical distribution that approximates p(x0:t|y0:t) in our discussion: P̂N (dx0:t|y0:t) = N∑ i=1 W (i) t δx(i)0:t (dx0:t) which is a mixture of point masses. However, we can instead think of the empirical density that approximates p(xt+1|y0:t): p̂N (xt+1|y0:t) = N∑ i=1 W (i) t p(xt+1|x(i)t ) which is a mixture of densities (see Section 2.4.2). Indeed, the resampling process followed by the transition step provides samples from this distribution. Furthermore, this density is continuous when p(xt+1|xt) is continuous since it is a weighted sum of continuous densities. If we could produce smoothly varying (with {W (i)}Ni=1 and {x(i)t }Ni=1) samples from this mixture distribution we could use those samples to make a smooth likelihood estimator. Unfortunately, even if it is easy to sample smoothly from p(xt+1|xt), it is not generally feasible to sample smoothly from a mixture of such distributions when the weights vary with θ. A solution to this problem, proposed in [7], is to approximate p̂N (xt+1|y0:t) with another distribution g(xt+1;αt+1), which is easy to produce smoothly varying (in αt+1) samples from. The density g(xt+1;αt+1) is a member from the parametric class of densities {g(xt+1|αt+1) : αt+1 ∈ A}. In particular, the parameter αt+1 is obtained in a way that minimizes some distributional distance between g(xt+1;αt+1) and p̂N (xt+1|y0:t). For this to be useful, it must be true that the parameter αt+1 obtained by this minimization varies smoothly with θ. Using g(xt+1;αt+1) as a proposal distribution, we can approximate p(yt+1|y0:t) as follows p̂(yt+1|y0:t) = 1 N N∑ i=1 p(yt+1|x(i)t+1) p̂N (x (i) t+1|y0:t) g(x (i) t+1;αt+1) 22 3.3. Related Work The drawback is that p̂N (x (j) t+1|y0:t) = ∑N i=1W (i) t p(x (j) t+1|x(i)t ) takes O(N) time to compute and so p̂(yt+1|y0:t) takes O(N2) time to compute6. In addition, it can be difficult to define parametric families of densities that can both be smoothly sampled from and can represent well the target distribution p(xt+1|y0:t) when this target has certain properties, eg. multiple modes. In summary, this method satisfies smoothness and consistency but requires time in O(TN2). In general, while proposal distributions that can be smoothly sampled from do provide smooth esti- mation, the evaluation of each importance weight will always take O(N) time unless some kind of cancellation occurs. 3.3.3 Smooth Likelihood Estimation Using Fixed Selection Indices An approach introduced in [5] is to run a particle filter for a given θ = θ∗ and to store the selection indices {k(i)t : i ∈ {1, . . . , N}, t ∈ {0, . . . , T}} and weights {W (k (i) t ) t,θ∗ def = W (k (i) t ) t : i ∈ {1, . . . , N}, t ∈ {0, . . . , T}} (see Section 2.4.2). Then, for θ close to θ∗, the ‘filter’ is run such that the same indices are selected. In practice, this means that the weights of resampled particles w (i) t are no longer equal. Indeed, if θ causes the weight of a selected particle to increase relative to the weight under θ∗ then the weight will increase and if the opposite is true, the weight will decrease. In Algorithm 2 this can be seen as replacing the resampling step with: • For i = 1, . . . , N : – Set x (i) 0:t = x̃ (k (i) t ) 0:t – Set w (i) t = W (k (i) t ) t W (k (i) t ) t,θ∗ • Renormalize the weights w(i)t so ∑N i=1 w (i) t = 1. This method does provide smooth estimates of the likelihood when coupled with common random numbers. However, this method is restricted to a small neighbourhood around θ as many of the weights will become negligible when the discrepancy in the weights increases. Of particular concern is the fact that without conventional resampling (based on the actual weights of the particles in question) the effective sample size of the particle ‘swarm’ decreases over time, which can reduce significantly the variance of the estimates. In this scenario, the high positive correlation in log- likelihood estimates can be offset by the increase in the variance of estimates themselves (see Section 3.1) when considering the variance of the estimated difference in the log-likelihood. In addition, the numerical stability of this method on a computer is not guaranteed due to the repeated use of 6As with the marginal particle filter, it may be possible to speed up this computation in some circumstances. In general, however, the complexity is O(N2) 23 3.4. Theoretical Solutions products in the algorithm. Of course, this method is appealing for use in a small neighbourhood of θ∗ if T is not too large. In particular, it is used in [5] to estimate the partial derivatives of p(y0:t|θ) with respect to θ. 3.4 Theoretical Solutions This chapter is concluded with the introduction of two theoretical solutions to the problem of smooth resampling. These can both be seen as extensions of [23] to the case where the state variables are multivariate. The main idea in that work is that we can, through sorting the particles, construct an empirical cdf for xt F̂t,N (z) = ∑ x (i) t ≤z w (i) t that approximates the cdf Ft(z) = ∫ z −∞ p(xt|y0:t)dxt When xt is multi-dimensional, however, this no longer applies. After introducing some regular- ity conditions and some modifications to notation, we will present two solutions that inspire new approaches. 3.4.1 Regularity Conditions and Notation To deal with the case when xt ∈ Rd is multi-dimensional, we first simplify our notation. Instead of referring to the density p(xt|y0:t) we will drop explicit references to y0:t and the time index t. Thus from this point on the density p(x) = p(x1:d) refers implicitly to p(xt|y0:t). Using this notation, we will refer to marginal distributions using pj(xj) = ∫ ∫ p(x1:j−1, xj , xj+1:d)dx1:j−1dxj+1:d and p1:j(x1:j) = ∫ p(x1:j , xj+1:d)dxj+1:d We will also refer to conditional distributions in a limited context with pj(xj |x1:j−1) = ∫ p(xj , xj+1:d|x1:j−1)dxj+1:d 24 3.4. Theoretical Solutions Finally, we can express the cdf of x1 as F1(z) = ∫ z −∞ p1(x1)dx1 and the conditional cdf of xj given x1:j−1 as Fj(z|x1:j−1) = ∫ z −∞ pj(xj |x1:j−1)dxj In addition to the change in notation, we restrict our attention to situations in which the set {x : p(x) > 0} is convex7. The reason for this condition is that when it is satisfied, the cdf F1(z) and the conditional cdfs Fj(z|x1:j−1) are strictly monotonic for all z such that these functions are in (0, 1). As such, the quantile function, or inverse cdf, F−11 is given by F−11 (u) = z : F1(z) = u Similarly, the conditional quantile function F−1j is given by F−1j (u|x1:j−1) = z : Fj(z|x1:j−1) = u Note that when one of these cdfs is not strictly monotonic, there exist values of u ∈ (0, 1) and x1:j−1 such that z above is not determined uniquely by the constraint Fj(z|x1:j−1) = u. Furthermore, the cdf of any marginal distribution over any convex subset of {x : p(x) > 0} is strictly monotonic. 3.4.2 A Generalized CDF We can define a bijective generalized cdf F : Rd → Rd, where the state variables are d-dimensional. First let us define Fj(z; z1:j−1) = Fj(z|z1:j−1) = ∫ z −∞ pj(xj |x1:j−1 = z1:j−1)dxj The inverse of this function is F−1j (u;u1:j−1) = xj : Fj(xj ;F −1 1 (u1), F −1 2 (u2;u1), . . . , F −1 j−1(uj−1;u1:j−2)) = u = xj : Fj(xj ; z1:j−1) = u = F−1j (u|z1:j−1) 7It is possible that this condition can be relaxed but this is sufficient for all the marginal distributions of interest to be non-zero only on a convex subset (interval) of R. 25 3.4. Theoretical Solutions where z1 = F −1 1 (u1), z2 = F −1 2 (u2;u1) = F −1 2 (u2|z1) and in general zi = F−1i (ui;u1:i−1) = F−1i (ui|z1:i−1). With these definitions, we can define a bijection F as F (z1, z2, . . . , zd) = (F1(z1), F2(z2; z1), . . . , Fd(zd; z1:d−1)) The inverse of F is given by F−1(u1, u2, . . . , ud) = ( F−11 (u1), F −1 2 (u2;u1), . . . , F −1 d (ud;u1:d−1) ) While this formulation might appear daunting, the intuition is simple when the function F (z1:d) = u1:d is considered sequentially. We can view u1 as the probability that x1 ≤ z1, u2 as the probability that x2 ≤ z2 given that x1 = z1 and so on until ud is the probability that xd ≤ zd given that x1:d−1 = z1:d−1. For the evaluation of the inverse F−1, an algorithmic approach is presented in Algorithm 3. It should also be clear that if u is drawn uniformly from the unit hypercube of dimension d, then the output of the algorithm is a corresponding sample from the distribution p(x). This is because z1 ∼ p(x1), z2 ∼ p(x2|x1 = z1), . . . , zd ∼ p(xd|x1:d−1 = z1:d−1). Algorithm 3 An algorithm to evaluate the inverse of the generalized cdf F−1(u1, . . . , ud) 1. Set z1 = F −1 1 (u1) = z : F1(z) = u1 2. For j = 2, . . . , d • Set zj = F −1 j (uj ;u1:j−1) = z : Fj(uj |z1:j−1) = uj 3. Output (z1, . . . , zd). 3.4.3 Median-Cuts While the generalized CDF proposed above is in theory a clean extension of the approach in [23], we will see later that there are practical issues involved in trying to approximate conditional quantiles (F−1j for j > 1) directly using an empirical distribution. In addition, many of the properties of the generalized CDF are unnecessary for smooth resampling. In particular, we are looking for a function f such that if a random number u is sampled uniformly from [0, 1], f(u) returns a point sampled according to the distribution with density p(x). We require this function to be continuous in θ (recall that this density does depend on θ) but not u and it need not have an inverse. Nevertheless, the mapping function fn described in Algorithm 4 is largely inspired by the inverse transform technique. Furthermore, we can think of f(u) as equal to unique element of limn→∞ fn(u) as the 26 3.4. Theoretical Solutions type of function described above since limn→∞ fn(u) almost surely8 returns a set containing a single point in Rd. Algorithm 4 The mapping function fn(u) 1. • Abusing notation, set a (1) 1 = a (1) 2 = · · · = a (1) d = −∞ and b (1) 1 = b (1) 2 = · · · = b (1) d = ∞ • Let X (1) def = ∏d i=1[a (1) i , b (1) i ] • Let X (1) −1 def = ∏d i=2[a (1) i , b (1) i ] • Compute t1 satisfying ∫ t1 a (1) 1 ∫ X (1) −1 p(x)dx2:ddx1 = 0.5 • For i = 1, . . . , d - Set a (2) i = a (1) i and b (2) i = b (1) i • If u < 0.5, set b (2) 1 = t1 and u = 2u, else set a (2) 1 = t1 and u = 2(u− 0.5) 2. For j = 2, . . . , n • Let k = j mod d. If k = 0, let k = d. • Let X (j) def = ∏d i=1[a (j) i , b (j) i ] • Let X (j) −k def = ∏k−1 i=1 [a (j) i , b (j) i ] ∏d i=k+1[a (j) i , b (j) i ] • Compute tj satisfying 2 j−1 ∫ tj a (j) k ∫ X (j) −k p(x)dx1:k−1dxk+1:ddxk = 0.5 • For i = 1, . . . , d - Set a (j+1) i = a (j) i and b (j+1) i = b (j) i • If u < 0.5, set b (j+1) k = tj and u = 2u, else set a (j+1) k = tj and u = 2(u− 0.5) 3. Return X (n+1) = ∏d i=1[a (j+1) i , b (j+1) i ]. Algorithm 5 Algorithm to convert a number u in [0,1] to a string of bits of length n 1. For j = 1, 2, . . . , n If u < 0.5 • Set sj = 0 and u = 2u. Else • Set sj = 1 and u = 2(u− 0.5). 3. Return s. Correctness The function fn is a mapping from [0, 1] to a region in R d. In the algorithm we essentially treat the number u as an n-bit string s of 0’s and 1’s, where each sj is independent and identically distributed 8By this we mean for almost all u ∈ [0, 1]. 27 3.4. Theoretical Solutions from the Bernoulli distribution with success probability 0.5 (without conditioning on u). Algorithm 5 shows how to produce such a string of arbitrary length n. It is sometimes easier to think of the function fn taking an argument s1:n when n is finite (instead of conditioning on u < 0.5 at step j, one can condition directly on sj = 0). For any n and s = s1:n ∈ {0, 1}n, the region X (n+1) returned satisfies∫ X (n+1) p(x)dx = 1 2n by construction. This corresponds to the probability of the n-bit binary string s used to select the region. Furthermore, we can see that ⋃ s∈{0,1}n fn(s) = R d In other words, each equally probable uniform random n-bit s ∈ {0, 1}n maps to an equally probable (under p) region fn(s) ∈ Rd. In addition, the intersection of any two regions fn(s1) and fn(s2) for s1 6= s2 is a subset of Rd with dimension less than d and hence it has null Lebesgue measure in Rd. As such, there are no overlapping regions of non-zero probability for distinct s and we have ∑ s∈{0,1}n ∫ fn(s) p(x)dx = 1 In sum, each of the regions that can be returned for a given n have equal probability mass 12n and cover the entire space Rd while the intersection of the interior of any two regions is empty. Given a uniform random variable u, this variable is treated as an n-bit binary string s drawn from the uniform distribution over n-bit strings (each possible string has probability 12n ) and used to select one of the regions. It should be emphasized that if we think of the 2n regions that fn can return as a partitioning 9 of Rd, the relative positions of the regions are fixed and so the relative position of the region that is picked by u (via s) is not dependent on p (although the coordinates of the region are). Take, for example, d = 2 and s1:4 = 1001. This maps s1:4 to the region [t1, t3] × [t4, t2] with a(2)1 = t1, b (3) 2 = t2, b (4) 1 = t3 and a (5) 2 = t4. Clearly the values of ti depend on p, but the position of the partition is fixed with respect to the other partitions. This is probably best understood by looking at Figure 3.1, which shows how the function fn partitions R d for two different densities p(x). In the context where p(x) is continuous in θ, it is thus the coordinates of the orthotope X (n+1) that change as θ varies with u fixed. In fact, for p(x) continuous in θ and x, each tj is continuous in θ and so the coordinates of the orthotope are continuous in θ. 9As noted above, the intersection of two regions is not always null, but the intersection of their interiors are. 28 3.4. Theoretical Solutions 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 1 1 1 0 0 0 1 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 (a) uniform 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 1 1 1 0 0 0 1 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 (b) non-uniform Figure 3.1: Median-cut partitioning of the space for p(x) uniform and p(x) non-uniform Characterizing the Output Note that for fixed u, the sequences {a(n)i } for i ∈ {1, . . . , d} are weakly monotone increasing. Similarly, the sequences {b(n)i } for i ∈ {1, . . . , d} are weakly monotone decreasing. Furthermore, for any i, j, k, a (k) i < b (j) i and b (k) i > a (j) i . It can be shown that the set {u : ∃i s.t. ∀j, b(j)i =∞ or a(j)i = −∞} has measure zero since this corresponds to the infinite binary string representation of u having infinitely repeated 1’s or 0’s at fixed intervals. Thus {a(n)i } and {b(n)i } are almost always bounded above and below respectively. Since they are weakly monotone this implies that they converge and their limits are given by ai def = lim n→∞ a (n) i = sup n {a(n)i } bi def = lim n→∞ b (n) i = infn {b(n)i } By the nested interval theorem, we know that X ∗ def= lim n→∞ X (n+1) = d∏ i=1 [ai, bi] is a non-empty subset of Rd. In fact, we know that for at least one i, ai = bi. To show this, assume that ai < bi for all i. Then ∫ X∗ p(x)dx = c where c is a constant. However, we know ∫ X (j) p(x)dx = 1 2j−1 so there exists a j such that∫ X j p(x)dx < ∫ X∗ p(x)dx which is a contradiction since X ∗ ⊆ X (j). The following shows that 29 3.4. Theoretical Solutions for every i, ai = bi. Claim: For any i ∈ {1, . . . , d}, ai = bi for almost all u ∈ [0, 1]. Proof: Consider the sequence of natural numbers defined by mk = ki for k ∈ N. Then consider the subsequences {a(mj)i }, {b(mj)i } and {tmj}. Note that these subsequences satisfy tmj ∈ (a(mj)i , b(mj)i ) b (mj+1) i = tmj if smj = 0,b(mj)i if smj = 1 a (mj+1) i = a (mj) i if smj = 0, tmj if smj = 1 Now assume that there exists an h > 0 such that bi − ai ≥ h. This implies that b(mj)i − a(mj)i ≥ h for all j. We already know that a (mj) i and b (mj) i are finite almost surely as j →∞. Let k be the first integer such that a (mk) i and b (mk) i are both finite and let dj = b (mj) i − a(mj)i . Then for any j ≥ k we require dj ≥ h. For any j, tmj ∈ (a(mj)i , b(mj)i ). If b(mj)i − tmj ≥ h and tmj − a(mj)i ≥ h, then regardless of the value of smj we will have dj+1 ≤ dj − h. Since dk is finite, there are only finitely many times this can occur for j > k. If, on the other hand, b (mj) i − tmj ≤ h or tmj − a(mj)i ≤ h, we require smj to be equal to 1 or 0 respectively, in order for dj+1 ≥ h. Since the previous case can only occur finitely many times, this case must occur infinitely many times as there are infinitely many j > k. Note that the values of tmj depend on p(x) and s1:mj−1 but not smj . In fact, smj can be thought of as drawn from the Bernoulli distribution with success probability 0.5. As such, we are requiring that an infinite number of independent random bits of s coincide with infinitely many values of tmj , which has probability zero. Hence we have that h = 0 almost surely and so ai = bi almost surely . Of course, this means that for almost all u, X ∗ def= lim n→∞ X (n+1) = d∏ i=1 [ai, bi] = {c} where c ∈ Rd and ci = supn{a(n)i } = infn{b(n)i } for i ∈ {1, . . . , d}. Thus we can define (for almost all u) f(u) as the single member of limn→∞ fn(u). 30 3.4. Theoretical Solutions 3.4.4 Practical Notes While the theoretical resampling algorithms described above would yield smoothly varying particles, they are impossible to implement in practice because they require the repeated evaluation of integrals that have no analytical solution in general. As such, in the next chapter we focus on solutions that are implementable on a computer. 31 Chapter 4 Fast and Approximately Smooth Likelihood Estimation In this chapter, we present a tree-based representation of the particles for resampling. This repre- sentation is itself flexible and can be used within a wide variety of resampling schemes. However, we restrict our discussion to three different realizations. 4.1 A Weighted Tree Structure for P̂N(dx0:t|y0:t) A valid resampling scheme for P̂N (dx0:t|y0:t) = ∑N i=1W (i) t δx(i)0:t (dx0:t) can be constructed by parti- tioning the particles x (i) 0:t into k1 subsets where each subset is given a weight equal to the normalized sum of the weights of the particles within it. Each of the k1 sets is then further partitioned into some number of subsets where weights are again assigned so that the weight of a subset is given by the normalized sum of the weights of the particles within it. This process continues until each particle is in only one partition at the lowest depth of the tree. Let the depth of the tree be d and assume we have a d-dimensional uniform random number (u1, . . . , ud). Then we can use u1 to select the subset on the first level according to the weights of each partition, u2 to select a subset of that subset according to the weights of each of these subsets, and so on until we use some uj to select a single particle 10. Let the kth level subset containing the particle with index i be denoted Sk(i) and the sum of the weights in Sk(i) be denoted W (Sk(i)). The probability of selecting any individual particle index i 10uj+1, . . . , ud might not be needed since in this general construction the leaves of the tree do not necessarily all first appear on the dth level. 32 4.2. Splitting via the Unweighted Median (Weighted Binary Trees) is given by: Pr[select index i] = W (S1(i)) sum of all weights · W (S2(i)) W (S1(i)) · W (S3(i)) W (S2(i)) · · · W (Sd(i)) W (Sd−1(i)) = W (Sd(i)) sum of all weights = weight of particle i sum of all weights ie. the probability of selecting a particle from such a tree is proportional to its weight. It is clear that this general description allows for a wide variety of possible trees to be constructed. Our main interest is in constructing a tree such that for fixed u, small perturbations in the weights of the particles give rise to small perturbations in the selected particle. 4.2 Splitting via the Unweighted Median (Weighted Binary Trees) Let us assume we have N = 2k for some k ∈ N. Then one possible construction of a tree is to have each parent have two child nodes each containing half the number of particles of their parent, ie. a perfect binary tree. The depth of this tree is k+1 = log2N +1. To partition each block of particles at each level, we propose to sort the particles along one dimension, cycling through each dimension as we traverse the levels. For example, we can sort the N particles according to the first dimension of xt, then sort each subset of N/2 particles on the second level of the tree by the second dimension of xt, etc., and we return to the first dimension after sorting by the dth dimension of xt if xt is d-dimensional. The cost to construct this tree is given by: T (N) = N log(N) + 2 N 2 log( N 2 ) + 4 N 4 log( N 4 ) + · · ·+ N 2 N N 2 log( N N 2 ) = Nk +N(k − 1) +N(k − 2) + · · ·+N(2) +N = N(k + (k − 1) + (k − 2) + · · ·+ 2 + 1) = N k(k + 1) 2 ⇒ T (N) ∈ O(N(logN)2) The cost to select a particle from this tree is in O(logN) so the cost to evaluate it N times is in O(N logN). This means the total cost of both constructing the tree and resampling N particles is 33 4.2. Splitting via the Unweighted Median (Weighted Binary Trees) in O(N(logN)2). Algorithm 6 Constructing the weighted binary tree: constructq(particles, level) 1. Set dimension r = level mod d. If r = 0, set r = d. 2. Set particle m = the median of particles in the rth dimension. 3. Create node node, empty sets left and right, and set wleft = wright = 0. 4. If level = k + 1, set node.particle = p and return node. 5. For each particle p in particles – If the rth coordinate of p is less than or equal to the rth coordinate of m, ∗ Add p to the set left. ∗ Set wleft = wleft+weight(p). – If the rth coordinate of p is greater than the rth coordinate of m ∗ Add p to the set right. ∗ Set wright = wright+weight(p). 6. – Set node.wleft = wleft/(wleft + wright) – Set node.wright = wright/(wleft + wright) 7. – Set node.left = constructq(left, level + 1) – Set node.right = constructq(right, level + 1) 8. Return node. Algorithm 7 Selecting a particle from the weighted binary tree: selectq(node, level, u1, . . . , ud) 1. If level = k + 1, return node.particle. 2. Set dimension j = level mod d. If j = 0, set j = d. 3. If uj < node.wleft – Set uj = uj/node.wleft. – Return selectq(node.left, level + 1, u1, . . . , ud) Else – Set uj = (uj − node.wleft)/(1− node.wleft). – Return selectq(node.right, level + 1, u1, . . . , ud) While the unweighted median along a particular dimension is used to split the particles into two groups, there are obviously unequal weights associated with each group except in the case where the weights of all the particles are equal. As such, splitting along the unweighted median gives rise to a weighted binary tree. While this tree might resemble the theoretical median-cuts approach in Section 3.4.3 there are important differences. First, the particles are approximately split along the median of the proposal distribution q and not p. Second, the values of the weights of each node are dependent on θ since the weights W (i) t depend on θ. This means that while using the same u to traverse two similarly weighted trees of even continuously varying particles will often lead to 34 4.2. Splitting via the Unweighted Median (Weighted Binary Trees) selecting the same intermediate nodes, there will be times when the change in weights causes a different node to be picked (this is also what makes the selection scheme correct). Obviously, the specific level of the tree that this occurs in impacts greatly how far apart the corresponding particles will be. 4.2.1 From O(N(log N)2) to O(N log N) time To construct each level of the tree, it is not necessary to sort the particles completely. In particular, we only need to find the median particle to split the particles into two blocks of equal numbers of particles and sum their weights. There exists a deterministic algorithm to find the median in linear time (“Median of Medians”) [2], although in practice its performance is slower than some randomized algorithms (eg. Hoare’s selection algorithm [14]) which run in expected linear time [18]. Upon discovering the median particle, the rest of the work at each level of the tree also takes linear time so we can construct the tree in O(N logN) time. This means the overall resampling step takes O(N logN) time. 4.2.2 From u ∈ Rk to u ∈ Rd It should be clear that the use of k uniform random numbers to select a single particle is unnecessary since we traverse the tree at level j by checking only if uj is less than the sum of weights in a level j subpartition. Let the level j subset reached by using u1, . . . , uj−1 be denoted Bj(u1:j−1). Let wj,0 = W (Bj+1(u1:j−1, 0)) W (Bj(u1:j−1)) and wj,1 = W (Bj+1(u1:j−1, 1)) W (Bj(u1:j−1)) ie. wj,i ∝W (Bj+1(u1:j−1, i)) and ∑ i∈{0,1} wj,i = 1. Then conditioned only on uj < wj,0, the value uj wj,0 is uniformly distributed on [0, 1). Similarly, conditioned only on uj ≥ wj,0 the value uj−wj,01−wj,0 is uniformly distributed on [0, 1]. Thus we can have u1, . . . , ud given and then at every level j we set uj+d = uj wj,0 or uj+d = uj−wj,0 1−wj,0 if uj < wj,0 or uj ≥ wj,0 respectively. This is not just a trick to reduce the number of uniform random variables we generate, especially since this generation is not costly. Instead, it provides a mechanism by which we ensure that when we are ‘near’ the boundary wj,0, the next time we choose a node after splitting along the same 35 4.2. Splitting via the Unweighted Median (Weighted Binary Trees) dimension we use a consistent value uj+d. Intuitively, imagine uj = wj,0− . Then uj+d = 1− wj,0 , so the next time we choose a node after splitting along the same dimension we are highly likely to pick the second one. This is consistent with being very close to having picked the second partition at level j. Of course, this only mitigates the problem by trying to reduce the distance between selected particles when the weights change - the distance between two particles can only be strictly bounded by the size of their last common ancestor. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 uj u j+d (a) wj,0 = 0.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 uj u j+d (b) wj,0 = 0.7 Figure 4.1: Plots of uj+d against uj for wj,0 = 0.4 and wj,0 = 0.7 At first glance, it might appear that the manipulation of each uj to produce uj+d introduces a bias in the selection of the particles. This is not so. The fact that uj = wj,0 − ⇒ uj+d = 1 − wj,0 just shows that if we know the actual value of uj then uj+d is not random at all. However, in the algorithm the only use of uj is to check uj < wj,0 and compute uj+d. We never explicitly check if uj is by some measure close to wj,0. Thus the distribution of uj+d is always uniform on either [0, 1) or [0, 1] and the probability of selecting the left node at stage j + d is equal to wj+d,0 for u1:d drawn from the uniform distribution on [0, 1] d. Figure 4.1 shows the impact of the weight and the value of uj on the value of uj+d. Inspection of these graphs should reassure the reader that Pr[uj+d < wj+d,0|uj < wj,0] = wj+d,0. 4.2.3 Interpolation The construction of the tree could provide even smoother estimates of the likelihood if we are willing to introduce some bias by interpolating between particles at the last (few) layer(s) of the tree. One strategy is to interpolate between the last 2d particles, which for N large should be quite close. Since each node of the tree has two weighted children, we need to interpolate between two weighted points p1 and p2 with weights w and 1− w. We first define the interpolated point: p? = c(u,w)p1 + (1− c(u,w))p2 36 4.2. Splitting via the Unweighted Median (Weighted Binary Trees) We would like c(u,w) = 1 for u = 0, c(u,w) = 0 for u = 1, c continuous and monotonic in u and w, and Eu[c(u,w)] = w. We can satisfy these constraints with c(u,w) = (1− u) 1−w w if w < 12 , 1− u w1−w if w ≥ 12 . We also have c(u,w)+ c(1− u, 1−w) = 1 which means that the choice of which point is labelled p1 and which point is labelled p2 does not matter. To interpolate 2d points we can interpolate the 2d−1 pairs of points on the bottom level to produce 2d−2 points, then interpolate the 2d−3 pairs of these points and so on, using the weights of each node. 4.2.4 Remarks A cause for concern with this method is that it arbitrarily large discontinuities can be induced even for large N . Recall that in the median-cuts algorithm we have⋃ s∈{0,1}n fn(s) = R d For the weighted binary tree, we can define a string β1:k via βj = 0 if uj < wj,0,1 if uj ≥ wj,0. Note that β1:k depends not only on u1:d but also the values of the weights of the nodes that are traversed. As such, fixed u1:d does not correspond to the same β1:k when the weights change. Let us define the set of particles in the node at level j + 1 reached by a given β1:j as bj(β1:j). Nevertheless, we have the analogous result that ⋃ β∈{0,1}j bj(β) = N⋃ i=1 {x(i)t } Furthermore, we have for any β1:i ∈ {0, 1}i⋃ βi+1:j∈{0,1}j−i bj(β1:j) = bi(β1:i) With fixed u across different runs of the filter, we can expect many β1:k to share the prefix β1:i for 37 4.3. Splitting via the Weighted Median (Unweighted Binary Trees) some i. These particles maximum distance from each other is obviously bounded by the size of the region bi(β1:i) and shrinks as i increases. However, there exist u for any change in θ such that β1 will be different across different runs. As such, arbitrarily large distances between selected particles across different runs can occur. The use of only d-dimensional uniform random vectors is used to temper the effects of this but it cannot be escaped. The problem is exacerbated when the density of xt is very different in different regions 11 and it is made more frequent when the weights change drastically for even small changes in θ. The former is a property solely of the target density but the latter depends on the proposal density used to sample the particles. l a m b d a 0 1 0 0 1 11 00 1 0 0 0 0 0 1 0 1 0 1 0 10 1 1 1 0 0 1 1 0 1 1 1 Figure 4.2: The binary tree representation of the sets as a function of β There are many ways of visualizing the sets of particles as a function of β. One such way is to view them as a partitioning of the particles similar to that presented in Figure 3.1 with the understanding that the location of the hyperplanes (in arbitrary dimension d) that separate the space is determined by the proposal distribution and not the target distribution. Alternatively, one can think of the sets as nodes in a binary tree where each set at level i is given by the binary string β0 + β1:i with β0 = λ 12 A visualization of this binary tree is given in Figure 4.2. 4.3 Splitting via the Weighted Median (Unweighted Binary Trees) An alternative to the algorithm presented in Section 4.2 is to split the particles such that the weight of each subset is always equal. The weighted median along a particular dimension is thus used to split the particles into two equally weighted sets of particles. Alternatively, one can see the binary 11If this is true, the use of u ∈ Rd makes little difference since trying to be close to the region the next time we split along the same dimension does not affect the large differences to the regions selected in between. 12λ is the empty string. 38 4.3. Splitting via the Weighted Median (Unweighted Binary Trees) tree that is constructed as being unweighted. Of course, the number of particles in each node is not equal. In addition, one particle must be split into two particles every time two child nodes are created so that the weight of each node is equal. However, the general construction of the tree is identical. Algorithm 8 Constructing the unweighted binary tree: constructp(particles, level) 1. Create node node, empty sets left and right, and set wleft = wright = 0. 2. If particles contains only one particle, set node.particle = p and return node. 3. Set dimension r = level mod d. If r = 0, set r = d. 4. Set particle m = the weighted median of particles in the rth dimension. 5. For each particle p in particles – If the rth coordinate of p is less than the rth coordinate of m, ∗ Add p to the set left. ∗ Set wleft = wleft+weight(p). – If the rth coordinate of p is greater than the rth coordinate of m ∗ Add p to the set right. ∗ Set wright = wright+weight(p). 6. Create two particles m1 and m2 identical to m. 7. – Set weight(m1) = (wright + weight(m) - wleft) / 2 – Set weight(m2) = weight(m) - weight(m1) – Add m1 to the set left. – Add m2 to the set right. 8. – Set node.left = constructp(left, level + 1) – Set node.right = constructp(right, level + 1) 9. Return node. Algorithm 9 Selecting a particle from the unweighted binary tree: selectp(node, level, u) 1. If node contains only one particle, return node.particle. 2. If u ≤ node.wleft – Set u = 2u. – Return selectp(node.left, level + 1, u) Else – Set u = 2(u− 0.5). – Return selectp(node.right, level + 1, u) A key feature of this tree is that the same relative region is selected each time - it is the particles that make up the region that vary when the weights change. This is due to the unweighted nature of the nodes in the binary tree that is constructed. A drawback is that the imbalance in the number 39 4.3. Splitting via the Weighted Median (Unweighted Binary Trees) of particles in sibling nodes and the introduction of an extra particle at every branch makes time complexity analysis more complicated. 4.3.1 Running Time To analyze the running time, we first remark that the weighted median can also be computed in O(N) time. Given points x1, . . . , xN with associated (normalized) weights w1, . . . , wN , the weighted median is the point xi such that ∑ xj<xi wj < 0.5 and ∑ xj≤xi wj ≥ 0.5 Note that we can find the unweighted median of x1, . . . , xN , split the particles into two groups of N 2 particles using this median and compute the sum of the weights in each group in O(N) time. We can pick which group contains the weighted median by looking at the weight of both groups and then repeating this process on that group. This type of recursion is repeated until the weighted median is found. The running time is thus given by the solution to the recursion T (N) = O(N) + T (N/2), which is satisfied by T (N) ∈ O(N). Of course, in the case where we split by the unweighted median, the running time is in O(N logN) because the depth of the tree is in O(logN). However, in the worst case the tree constructed when splitting by the weighted median has depth N . Nevertheless, in most situations we expect the tree to have depth in O(logN) since the worst case is only achieved in pathological cases. In addition, the number of particles increases with the depth due to the replication of the median particle at each split. The running time is thus in O(N) +O(N + 1) +O(N + 2) +O(N + 4) +O(N + 8) + · · ·+O(N + 2depth) where we have O(logN) terms and 2depth ∈ O(N) if the depth is in O(logN). Thus the overall running time for tree construction is in O(N logN). If we again assume that the depth of the tree is in O(logN), the selection of N particles also takes O(N logN) time. 4.3.2 Interpolation Due to the imbalanced numbers of particles in the branches of the binary tree, a sophisticated interpolation scheme is a little more difficult to implement. Therefore, for this structure we only interpolate when we reach a node with exactly two particles and use the same method as in Section 4.2.3. In the case where the node reached has only particle (which is always due to an uneven split) we do not interpolate at all. 40 4.4. An Unweighted k-ary Tree 4.3.3 Remarks This method is more involved implementation-wise compared to that of splitting by the unweighted median. However, it has the same computational complexity in non-pathological cases and it has asymptotic properties that we will show in Chapter 6 but are outlined below. This approach can be seen as an approximation to the median-cuts approach introduced in Section 3.4.3. It is difficult to directly compare the two approaches since the median-cuts algorithm parti- tions the space for a given u ∈ R. While the calculation of the splitting hyperplanes does not depend on u, this algorithm does not partition the space of Rd that it is not interested in - it calculates only n conditional medians t1:n despite selecting a region of probability 1 2n . In contrast, for reasons of computational complexity the unweighted binary tree is fully constructed so that each selection takes only O(logN) time. Indeed, this involves computation of O(N) conditional medians of the empirical distribution. However, if we consider the selection of a single particle using the same u with the unweighted binary tree, we can denote by t̂1:k the k conditional medians that split each node that is visited. Then, we can show that for any m, as N → ∞, √N(t̂1:m − t1:m) D→ N(0,Σ) for some Σ independent of N . The major difference between this tree and the weighted binary tree is that there is no dependency on the weights of the particles for the relative region that is chosen - it is the constituent particles that change. In terms of the binary string representation of u, we can visualize the partitioning of the particles again using Figures 3.1 and 4.2 but since the string is independent of the weights of the particles the same relative region is picked. Of course, the effect of the changing membership of particles in each node in the unweighted binary tree should not be underestimated for practical values of N as this can lead to large discontinuities in the selected particles. 4.4 An Unweighted k-ary Tree The last tree construction algorithm we consider here is similar to the weighted-median-splitting algorithm in that it produces an unweighted tree. However, unlike that approach each parent node in this tree has k children. Furthermore, we let N = kd, where d is the dimension of the latent state variables. Therefore, as we increase N the number of children per node increases. In fact, sampling from this tree provides an approximation to the theoretical approach outlined in Algorithm 3. 41 4.4. An Unweighted k-ary Tree Algorithm 10 Constructing the unweighted k-ary tree: constructk(particles, level) 1. Create node node. 2. If level = d, – Sort particles by dimension d. – Set node.particles = particles and return node. – Normalize the particle weights. 3. Sort by dimension level and then partition particles into k sets S1, . . . , Sk each with equal normalized weight of 1/k (replicating particles as necessary). 4. For i = 1, . . . , k – Set node.child(i) = constructk(Si, level + 1). 5. Return node. Algorithm 11 Selecting a particle from the unweighted k-ary tree: selectk(node, level, u1, . . . , ud) 1. If level = d, – Find index i of node.particles = p1, . . . such that ∑i−1 j=1 weight(pj) < ud and ∑i j=1 weight(pj) ≥ ud – Return particle pi 2. Return selectk(dkulevele, level + 1, u1, . . . , ud) 4.4.1 Running Time The computational time to construct the tree is clearly in O(N logN) since the computational effort at each level of construction is in O(N logN) and there are only d (which is independent of N) levels. However, the time to select is in expected O(k) time since there are k particles on average in each leaf node (each of which are equally likely to be chosen). As such, the cost to select N particles is in O(Nk) = O(N d+1 d ). Of course, this is still in o(N2) but is slower than O(N logN). 4.4.2 Interpolation The interpolation scheme used in this approach is slightly different to that of the binary trees. This is because we now have a variable number of particles in the leaf nodes of the tree. However, let us assume there are r particles in the leaf node reached. We can split these r particles into r + 1 blocks such that the first block contains only the first particle p1, the (r + 1)th block contains the last particle pr and every other block i contains two particles pi−1 and pi. Furthermore, the weight associated with each block is the sum of the weights of the constituent particles divided by 2. As such, we proceed as normal to select a block and then within each block we use the remaining weight 42 4.5. Overall Remarks to interpolate between the two remaining particles (in the case where the block selected is either the first or last, no interpolation is done). 4.4.3 Remarks The unweighted k-ary tree approach we have just seen has the defining characteristic that the number of children increases with N . This allows for each coordinate of u to specify a shrinking region of the space as N increases, which allows for the sampled particles to more closely approximate the particles that would be returned by the theoretical Algorithm 3. In fact, we will show in Chapter 6 that if F−1(u) = µ and the particle returned by the unweighted k-ary tree using u is denoted x̂, then as N →∞, x̂ P→ µ. A practical disadvantage is that there is an imbalance in the accuracy of each dimension of the particle selected since less particles are used for estimation in successive levels of the tree. At the same time, for practical values of N it is very difficult to make the regions very small. For example, if d = 3 then using N = 103 = 1000 only allows us to partition the first dimension of the particles into 10 intervals. In addition, while we use all 1000 particles to approximate the interval values in the first dimension, only 100 particles are used to approximate the interval values in the second dimension and only 10 particles are available to represent the third dimension. In addition, since the particles are not partitioned very finely for practical values of N , interpolation can also introduce considerable bias to the procedure. 4.5 Overall Remarks We have seen three different types of trees that can be constructed and sampled from in o(N2) time. Of course, there are many other possible variants since any properly weighted/constructed tree can be used to resample particles according to P̂N (dx0:t|y0:t). Indeed, it is possible that a different type of tree could provide much better results than those presented above while still retaining o(N2) time complexity for construction and selection. Such a tree could utilize, for example, different and possibly variable numbers of child nodes, more aggressive replication of particles or a different method for partitioning the particles. Nevertheless, the three trees we have seen each showcase a different tree-based approach with unique characteristics. The weighted binary tree approach is the quickest of the methods presented. While it is not asymp- totically faster than the unweighted binary tree in non-pathological cases13, the latter’s reliance on the more computationally intensive weighted median in addition to the imbalanced nature of the 13The weighted binary tree approach’s worst-case performance is asymptotically faster than that of the unweighted binary tree. 43 4.5. Overall Remarks constructed tree means it is slower in practice. As such, the fact that the weighted binary tree does not give smooth samples in the asymptotic case does not necessarily make it worse, especially since the asymptotic case is not necessarily interesting in practice14. In addition, while the k-ary tree spends unequal numbers of particles and computation time on the different dimensions of the sampled particles, both binary tree approaches spend roughly equal time in each dimension since they cycle through the dimensions used to partition the particles. The unweighted binary tree and k-ary trees are of interest mainly because of their asymptotic properties. In fact, empirical results suggest that the weighted binary tree outperforms the other two trees with respect to smoothness with practical values of N . As a final note, it should be obvious that all of the proposed trees can be seen as a generalization of the sorting idea in [23]. This is because in the case where d = 1, each of these trees are reduced to sorting along one dimension. 14As N → ∞, we can using standard multinomial resampling since the likelihood estimate will converge towards the true likelihood which is a smooth function of θ. Alternatively, we can use CRN sequential importance sampling without resampling to guarantee continuity. 44 Chapter 5 Applications To investigate the effectiveness of the the various tree-based resampling schemes in practice, we compare filters using the weighted binary tree, unweighted binary tree, weighted k-ary tree with the CRN15 and plain vanilla particle filters in different applications. In all instances, the number of particles used for the vanilla filter is selected such that the time taken by the vanilla filter is slightly more than that taken by the smooth filter. Unless explictly stated, we use the importance distribution q(xt|yt,x(i)t−1) = p(xt|x(i)t−1). 5.1 Gaussian State-Space Model We consider the model: xt ∼ N(Axt−1,Σ1), t = 1, . . . , T yt ∼ N(xt,Σ2), t = 1, . . . , T x0 = 0 where yt is observed for t ∈ {1, . . . , T}. Two-Dimensional Case To generate the observations we use Σ1 = ( v11 ρ √ v11v12 ρ √ v11v12 v12 ) , Σ2 = ( v21 0 0 v22 ) 15The CRN vanilla filter is just a filter that uses the same random numbers each time. 45 5.1. Gaussian State-Space Model 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −630 −628 −626 −624 −622 −620 −618 −616 v11 lo g− lik el ih oo d (a) weighted binary tree filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −628 −626 −624 −622 −620 −618 −616 v11 lo g− lik el ih oo d (b) unweighted binary tree filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −628 −626 −624 −622 −620 −618 −616 v11 lo g− lik el ih oo d (c) unweighted k-ary tree filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −630 −628 −626 −624 −622 −620 −618 −616 v11 lo g− lik el ih oo d (d) CRN vanilla filter Figure 5.1: Plot of the estimated log-likelihood for the 2D Gaussian state-space model as a function of v11 using the modified filters (blue) and the vanilla particle filter (red). The true log-likelihood given by the Kalman filter is shown in black. For the the CRN vanilla filter, we show estimates for only one set of common random numbers. and A = ( φ 0 0 φ ) with v11 = v12 = 1, ρ = 0.8, φ = 0.5, v21 = v22 = 0.5 and T = 200. We ran the Kalman filter, the tree-based particle filters with 1024 particles and the vanilla particle filters with 1536 particles, for 500 values of v11 from 0.5 to 1.5 and the other parameters set to the values used to generate the data. Figures 5.1 - 5.2 show the results. Each run of each particle filter took about 2 seconds on our machine. Since we have access to the true likelihood via the Kalman filter for this model, we also plot the statistical errors of the log-likelihood in Figure 5.3. 46 5.1. Gaussian State-Space Model 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −628 −627 −626 −625 −624 −623 −622 −621 −620 −619 −618 v11 lo g− lik el ih oo d Figure 5.2: The plots of the estimated log-likelihood for the 2D Gaussian state-space model as a function of v11 using the weighted binary tree particle filter (blue), the unweighted binary tree particle filter (red) and the unweighted k-ary tree particle filter (green) together with the true log-likelihood given by the Kalman filter (black). 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −3 −2 −1 0 1 2 3 v11 e rr o r (a) weighted binary tree filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −3 −2 −1 0 1 2 3 v11 e rr o r (b) unweighted binary tree filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −3 −2 −1 0 1 2 3 v11 e rr o r (c) unweighted k-ary tree filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −3 −2 −1 0 1 2 3 v11 e rr o r (d) CRN vanilla filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −3 −2 −1 0 1 2 3 v11 e rr o r (e) vanilla filter Figure 5.3: Plot of the estimated log-likelihood errors for the 2D Gaussian state-space model as a function of v11 using the modified filters 47 5.1. Gaussian State-Space Model 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −628 −627 −626 −625 −624 −623 −622 −621 −620 −619 −618 v11 lo g− lik el ih oo d (a) weighted binary tree filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −628 −627 −626 −625 −624 −623 −622 −621 −620 −619 −618 v11 lo g− lik el ih oo d (b) unweighted binary tree filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −627 −626 −625 −624 −623 −622 −621 −620 −619 −618 v11 lo g− lik el ih oo d (c) unweighted k-ary tree filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −628 −627 −626 −625 −624 −623 −622 −621 −620 −619 −618 v11 lo g− lik el ih oo d (d) CRN vanilla filter Figure 5.4: Plot of the estimated log-likelihood for the 2D Gaussian state-space model as a function of v11 using the modified filters (blue) and the vanilla particle filter (red). The true log-likelihood given by the Kalman filter is shown in black. For the particle filters, the locally optimal proposal distribution is used. We show estimates for only one set of common random numbers using the CRN vanilla filter. We also ran the filters using the ‘locally optimal’ proposal distribution q(xt|yt,x(i)t−1) ∝ p(yt|xt)p(xt|x(i)t−1) with the same values of the parameters. Figures 5.4 - 5.5 show the results. The statistical errors of the log-likelihood are plotted in Figure 5.6. The estimates of the log-likelihood given by the various filters are slightly biased due to the use of logarithms. In addition, there is a small amount of bias from the interpolation schemes in the tree-based filters, which is reduced quickly as N increases. Table 5.1 shows how as N increases the estimates of the log-likelihood indicate very little bias and also that the sample variance decreases 48 5.1. Gaussian State-Space Model 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −627 −626 −625 −624 −623 −622 −621 −620 −619 −618 v11 lo g− lik el ih oo d Figure 5.5: The plots of the estimated log-likelihood for the 2D Gaussian state-space model as a function of v11 using the weighted binary tree particle filter (blue), the unweighted binary tree particle filter (red) and the unweighted k-ary tree particle filter (green) together with the true log- likelihood given by the Kalman filter (black). For the particle filters, the locally optimal proposal distribution is used. 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 v11 e rr o r (a) weighted binary tree filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 v11 e rr o r (b) unweighted binary tree filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 v11 e rr o r (c) unweighted k-ary tree filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 v11 e rr o r (d) CRN vanilla filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 v11 e rr o r (e) vanilla filter Figure 5.6: Plot of the estimated log-likelihood errors for the 2D Gaussian state-space model as a function of v11 using the various filters with the locally optimal proposal distribution. 49 5.1. Gaussian State-Space Model as 1 N . Table 5.1: Mean and standard deviation of the log-likelihood as N increases, with θ fixed. For each value of N , 100 filters were run with different random seeds to collect the data. The true value of the log-likelihood given by the Kalman filter is −618.8168. N vanilla weighted binary unweighted binary k-ary 1024 l̂ -619.02 -619.11 -618.67 -618.67 σ̂ 1.01 0.97 1.05 1.06 2048 l̂ -618.83 -618.82 -618.85 -618.86 σ̂ 0.76 0.70 0.75 0.76 4096 l̂ -618.85 -618.78 -618.82 -618.82 σ̂ 0.51 0.50 0.54 0.53 8192 l̂ -618.80 -618.81 -618.85 -618.84 σ̂ 0.34 0.38 0.34 0.34 16384 l̂ -618.81 -618.81 -618.79 -618.77 σ̂ 0.27 0.24 0.26 0.26 Three-Dimensional Case To generate the observations we use Σ1 = v11 ρ12 √ v11v12 ρ13 √ v11v13 ρ12 √ v11v12 v12 ρ23 √ v12v13 ρ13 √ v11v13 ρ23 √ v12v13 v13 , Σ2 = v21 0 00 v22 0 0 0 v23 and A = φ 0 00 φ 0 0 0 φ with v11 = v12 = v13 = 1, ρ12 = 0.8, ρ13 = 0.4, ρ23 = 0.4, φ = 0.5, v21 = v22 = v23 = 0.5 and T = 200. We ran the Kalman filter, the tree-based particle filters with 2048 particles and the vanilla particle filters with 5120 particles, for 500 values of v11 from 0.5 to 1.5 and the other parameters set to the values used to generate the data. Figures 5.7 - 5.8 show the results. Each run of each particle filter 50 5.1. Gaussian State-Space Model 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −935 −930 −925 −920 −915 −910 v11 lo g− lik el ih oo d (a) weighted binary tree filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −935 −930 −925 −920 −915 −910 v11 lo g− lik el ih oo d (b) unweighted binary tree filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −935 −930 −925 −920 −915 −910 v11 lo g− lik el ih oo d (c) unweighted k-ary tree filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −935 −930 −925 −920 −915 −910 v11 lo g− lik el ih oo d (d) CRN vanilla filter Figure 5.7: Plot of the estimated log-likelihood for the 3D Gaussian state-space model as a function of v11 using the modified filters (blue) and the vanilla particle filter (red). The true log-likelihood given by the Kalman filter is shown in black. For the CRN vanilla filter we show estimates for only one set of common random numbers. took about 6.5 seconds on our machine. Since we have access to the true likelihood via the Kalman filter for this model, we also plot the statistical errors of the log-likelihood in Figure 5.9. 51 5.1. Gaussian State-Space Model 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −934 −932 −930 −928 −926 −924 −922 −920 −918 −916 −914 v11 lo g− lik el ih oo d Figure 5.8: The plots of the estimated log-likelihood for the 3D Gaussian state-space model as a function of v11 using the weighted binary tree particle filter (blue), the unweighted binary tree particle filter (red) and the unweighted k-ary tree particle filter (green) together with the true log-likelihood given by the Kalman filter (black). 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −4 −3 −2 −1 0 1 2 3 4 v11 e rr o r (a) weighted binary tree filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −4 −3 −2 −1 0 1 2 3 4 v11 e rr o r (b) unweighted binary tree filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −4 −3 −2 −1 0 1 2 3 4 v11 e rr o r (c) unweighted k-ary tree filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −4 −3 −2 −1 0 1 2 3 4 v11 e rr o r (d) CRN vanilla filter 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 −4 −3 −2 −1 0 1 2 3 4 v11 e rr o r (e) vanilla filter Figure 5.9: Plot of the estimated log-likelihood for the 3D Gaussian state-space model as a function of v11 using the modified filters. 52 5.2. Factor Stochastic Volatility 5.2 Factor Stochastic Volatility We consider the factor stochastic volatility model most similar to that proposed in [20]: yt ∼ N(Bft,Ψ) ft ∼ N(0,Ht) αt ∼ N(Φαt−1,U) where Ψ def = diag(ψ1, . . . , ψM ) Ht def = diag(exp(αt)) Φ def = diag(φ1, . . . , φK) Here, ft is K-dimensional, yt is M -dimensional and B is an M × K factor loading matrix. The latent variable at each time step t is the K-dimensional vector αt. Importantly, U is allowed to have non-zero off-diagonal entries to allow for correlation in the changes in log factor variances with time. The likelihood of the data, yt, given αt is Gaussian with: yt|αt ∼ N(0,BHBT +Ψ) For identifiablity, the factor loading matrix has the form: B = 1 0 0 b21 1 0 b31 b32 1 b41 b42 b43 b51 b52 b53 We generate data using α0 = 0, ψi = 0.5, i ∈ {1, . . . ,M}, φi = 0.9, i ∈ {1, . . . ,K} U = 0.5 0.2 0.10.2 0.5 0.2 0.1 0.2 0.5 and B = 1 0 0 0.5 1 0 0.5 0.5 1 0.2 0.6 0.3 0.8 0.7 0.5 53 5.2. Factor Stochastic Volatility We do not consider time-varying covariances as per [24] here because this would increase the di- mension of the latent variable space to M +K as opposed to the current dimension of K. While the same smooth filtering algorithm would be applicable with higher-dimensional latent variables, the number of particles required to approach smoothness increases with dimension. We ran the tree-based particle filters with 1024 particles and the vanilla particle filters with 1536 particles, for 500 values of b42 from 0.3 to 0.8 and the other parameters set to the values used to generate the data. Figures 5.10 - 5.11 show the results. Each run of each particle filter took about 6 seconds on our machine. 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 −1670 −1665 −1660 −1655 −1650 −1645 −1640 b42 lo g− lik el ih oo d (a) weighted binary tree filter 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 −1675 −1670 −1665 −1660 −1655 −1650 −1645 −1640 b42 lo g− lik el ih oo d (b) unweighted binary tree filter 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 −1675 −1670 −1665 −1660 −1655 −1650 −1645 −1640 b42 lo g− lik el ih oo d (c) unweighted k-ary tree filter 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 −1670 −1665 −1660 −1655 −1650 −1645 −1640 b42 lo g− lik el ih oo d (d) CRN vanilla filter Figure 5.10: Plot of the estimated log-likelihood for the FSV model as a function of b42 using the modified filters (blue) and the vanilla particle filter (red). For the CRN vanilla filter we show estimates for only one set of common random numbers. 54 5.3. Stochastic Kinetic Model 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 −1670 −1665 −1660 −1655 −1650 −1645 −1640 b42 lo g− lik el ih oo d Figure 5.11: The plots of the estimated log-likelihood for the FSV model as a function of b42 using the weighted binary tree particle filter (blue), the unweighted binary tree particle filter (red) and the unweighted k-ary tree particle filter (green). 5.3 Stochastic Kinetic Model Stochastic kinetic models are important in systems biology [4]. One of the simpler models in this class is the Lotka-Volterra model, which describes the evolution of prey (Z (1) t ) and predator (Z (2) t ) in time. In particular, in a small time interval (t, t+ dt] the process evolves according to Pr(Z (1) t+dt = z (1) t + 1, Z (2) t+dt = z (2) t |z(1)t , z(2)t ) = αz(1)t + o(dt), Pr(Z (1) t+dt = z (1) t − 1, Z(2)t+dt = z(2)t + 1|z(1)t , z(2)t ) = βz(1)t z(2)t + o(dt), Pr(Z (1) t+dt = z (1) t , Z (2) t+dt = z (2) t − 1|z(1)t , z(2)t ) = γz(2)t + o(dt) where each equation corresponds to prey reproduction, predator reproduction and predator death, respectively. This model, and stochastic kinetic models in general, are straightforward to simulate using the Gillespie algorithm with time complexity linear in the number of reactions. However, it is often the case that we wish to estimate the posterior distribution of θ def = (α, β, γ) given discrete (ie. incomplete) observations of the reactants {Zt|t ∈ I ⊂ R}. The main obstacle to doing this in a reasonable amount of time is that it is difficult to generate low variance estimates of p(zj |zi, θ) in acceptable time. For example, the Monte Carlo estimate p̂(zj |zi) = 1 N N∑ k=1 δzj (x (k)) where x(k) ∼ p(·|zi) has unacceptable variance in general when many reactions occur in the time 55 5.3. Stochastic Kinetic Model j− i. Even with recent methods proposed in the ‘likelihood-free’ inference literature [21, 3, 28], such estimates remain unacceptable because the simulation process is naturally diffuse. It is, however, possible to sample from this posterior effectively using Markov Chain Monte Carlo methods by including the intermediate (unobserved) population levels as auxiliary variables in an extended state-space [4] but this approach can lack computational efficiency for large models. A promising alternative to dealing with the stochastic kinetic model is to approximate the true evo- lution process with a nonlinear diffusion process observed with noise for the purposes of parameter estimation [12]. This diffusion approximation has been used to speed up MCMC methods for ap- proximate inference of the posterior of θ since the number of intermediate population numbers is far lower. In particular, the number of intermediate values is T − 1 where [0, T ] is the time interval of interest and is the time discretization step, which is advantageous computationally when the number of expected reactions in time T far exceeds T . Another recent advance in inference with nonlinear diffusion models involves utilising a modified diffusion bridge [13] in the spirit of [9]. The approach proposed here is to use sequential Monte Carlo to estimate p({Zt|t ∈ I}|θ) smoothly for given values of θ using the diffusion approximation at each step of a particle filter for the discretized diffusion process. In addition, we use observations of only Z (1) t , corresponding to the what is called the partially observed case in the literature. This is because SMC seems particularly well-suited to inference in this context since each trajectory includes an estimate of X(2) ≈ Z(2) for each time. The diffusion process follows the model: Xt ∼ N(Xt− + µ, Σ1), t = , 2, . . . , T − , T Zt ∼ N(Xt,Σ2), t = ∆, 2∆, . . . , T −∆, T where µ and Σ1 are the drift and instantaneous variance-covariance respectively. For the Lotka- Volterra model we use drift and instantaneous variance-covariance given in [10]: µ = ( αX(1) − βX(1)X(2) βX(1)X(2) − γX(2) ) and Σ1 = ( αX(1) + βX(1)X(2) −βX(1)X(2) −βX(1)X(2) βX(1)X(2) + γX(2) ) Σ2, which encodes the noisiness of the observations, typically has only non-zero diagonal entries. To generate the data we use the Gillespie algorithm with X0 = (100, 100), α = 0.5, β = 0.0025, γ = 0.3 and T = 20. Observations are taken with ∆ = 0.1. The particle filter uses the diffusion approximation with = 0.01 and Σ2 = I. The total number of time steps in the filter is 200 but each sample from the proposal involves sampling from 10 diffusion distributions. We ran the tree-based particle filters with 1024 particles and the vanilla particle filters with 1536 particles, for 500 values of γ from 0.2 to 0.4 and the other parameters set to the values used to 56 5.4. Dynamic Stochastic General Equilibrium Model generate the data. Figures 5.13 - 5.14 show the results. Each run of each particle filter took about 11 seconds on our machine. In Figure 5.12 we show the mean population estimates at each time from the particle filter with the parameters used to generate the data. Since the predator levels are not observed, they have more variance with respect to the true levels. 0 2 4 6 8 10 12 14 16 18 20 0 50 100 150 200 250 300 time pr ey p op ul at io n 0 2 4 6 8 10 12 14 16 18 20 50 100 150 200 250 300 350 400 time pr ed at or p op ul at io n Figure 5.12: Left: The plots of the true prey population level (red) and estimated expected prey population level (blue) at each time. Right: The plots of the true predator population level (red) and estimated expected predator population level (blue) at each time. 5.4 Dynamic Stochastic General Equilibrium Model Dynamic stochastic general equilibrium (DGSE) models are used to explain macroeconomic phe- nomena like growth and business cycles. These models specify the preferences of agents, the actions they can take and the (stochastic) dynamics of the environment. Although they are difficult to solve in practice, these models allow us to compute distributions over various quantities of interest like output, consumption and employment as well as predicting the effect of changing structural parameters of the model. In addition, given data from such a system, it is possible to compute the likelihood of the data for various values of the structural parameters of the model. The defining characteristic of such models is that all agents seek to maximize their expected utility whenever choosing an action and it is this rationality condition that makes such models so difficult to solve in practice. Following [7], we restrict our attention to a simple case of the DSGE model: a single sector real business cycle (RBC) model. This type of model explains short-run fluctuations in economic indi- cators via stochastic shocks to the system [19]. Let qt, kt, ct, it and at represent output, capital, consumption, investment and total factor productivity respectively. To simplify the model, it is as- sumed that labour is fixed at 1. The model assumes we observe qt and it noisily for a representative 57 5.4. Dynamic Stochastic General Equilibrium Model 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 −545 −540 −535 −530 −525 −520 −515 −510 −505 −500 gamma lo g− lik el ih oo d (a) weighted binary tree filter 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 −545 −540 −535 −530 −525 −520 −515 −510 −505 −500 gamma lo g− lik el ih oo d (b) unweighted binary tree filter 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 −545 −540 −535 −530 −525 −520 −515 −510 −505 −500 gamma lo g− lik el ih oo d (c) unweighted k-ary tree filter 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 −540 −535 −530 −525 −520 −515 −510 −505 −500 gamma lo g− lik el ih oo d (d) CRN vanilla filter Figure 5.13: Plot of the estimated log-likelihood for the partially observed Lotka-Volterra model as a function of γ using the modified filters17 (blue) and the vanilla particle filter (red). For the CRN vanilla filter we show estimates for only one set of common random numbers. agent who at every time t maximizes expected lifetime utility: Ut = E [ ∞∑ s=t βs−t log(ct) ∣∣∣∣∣q0:t, k0:t ] 58 5.4. Dynamic Stochastic General Equilibrium Model 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 −545 −540 −535 −530 −525 −520 −515 −510 −505 gamma lo g− lik el ih oo d Figure 5.14: The plots of the estimated log-likelihood for the partially observed stochastic kinetic model as a function of γ using the weighted binary tree particle filter (blue), the unweighted binary tree particle filter (red) and the unweighted k-ary tree particle filter (green). subject to the dynamics qt = a 1−α t k α t ct = qt − it kt+1 = it + (1− δ)kt log(at+1) = ρ log(at) + vt+1 vt ∼ N(0, σ2) and given the parameters α, β, δ, ρ and σ. α corresponds to capital’s share of output, γ is the discount factor, δ is capital depreciation and (ρ, σ) determines the type of shocks in the system. The observations are given by yt ∼ N ([ qt it ] , [ σ2q 0 0 σ2i ]) where σq and σi are also structural parameters. We further assume δ = 0, ie. full depreciation. This simplifies the model so that the consumption policy ct(qt, at) that maximizes expected utility can be found analytically. In this case, the solution is given by ct(qt, at) = (1− αβ)qt (see, for example [27]) which implies it = αβqt. As such, we have the following identities kt+1 = αβqt qt+1 = a 1−α t+1 (αβqt) α This suggests use of st = (zt, qt) as the state variable in the state-space model, where zt = log(at). 59 5.5. Remarks The dynamics are fully captured by zt+1 = ρzt + vt+1 qt+1 = (e zt+1)1−α(αβqt)α with the distribution over (z0, q0) given. The observation equations are then yt ∼ N ([ 1 αβ ] qt, [ σ2q 0 0 σ2i ]) We simulated observations using parameters (α, β, ρ, σ, σq, σi) = (0.33, 0.96, 0.8, 0.05, 0.014, 0.02). We also set (z0, q0) = (0, 1). We ran the tree-based particle filters with 1024 particles and the vanilla particle filters with 1536 particles, for 500 values of α between 0.32 and 0.34 and the other parameters set to the values used to generate the data. Figures 5.15 - 5.16 show the results. Each run of each particle filter took about 1.5 seconds on our machine. To show how large the improvement in smoothness is we also ran the vanilla particle filter with 20000 particles to compare with the results from the weighted binary tree filter with only 1024 particles, despite the large difference in computation time. The results are shown in Figure 5.17. For the graphs associated with this simple DSGE model, we plot estimates of the log-likelihood for three different sets of common random numbers to emphasise that the variance of any individual log-likelihood estimate is not improved by our resampling techniques. It is the variance in the difference between log-likelihood estimates for different values of θ that is reduced. 5.5 Remarks The applications we have seen all show a marked improvement in the smoothness of log-likelihood estimates for the tree-based particle filters over the CRN vanilla and vanilla particle filters. The most impressive algorithm is the weighted binary tree filter, despite the fact that it does not have interesting asymptotic properties like the unweighted tree filters. This highlights the fact that asymptotic properties (and their lack) can be misleading when assessing the efficacy of finite-sample methods. One characteristic of the tree-based resampling schemes is that while they are not much slower than the vanilla filter, they always require more time to run. As such, for a fixed computational budget the number of particles they can use is less. Naturally, this means that their estimates of the log-likelihood have higher variance than their non-smooth counterparts given fixed time. It is worth mentioning that in the process of applying the tree-based resampling approaches to these 60 5.5. Remarks 0.32 0.322 0.324 0.326 0.328 0.33 0.332 0.334 0.336 0.338 0.34 450 451 452 453 454 455 456 457 458 alpha lo g− lik el ih oo d (a) weighted binary tree filter 0.32 0.322 0.324 0.326 0.328 0.33 0.332 0.334 0.336 0.338 0.34 450 451 452 453 454 455 456 457 458 alpha lo g− lik el ih oo d (b) unweighted binary tree filter 0.32 0.322 0.324 0.326 0.328 0.33 0.332 0.334 0.336 0.338 0.34 450 451 452 453 454 455 456 457 458 alpha lo g− lik el ih oo d (c) unweighted k-ary tree filter 0.32 0.322 0.324 0.326 0.328 0.33 0.332 0.334 0.336 0.338 0.34 450 451 452 453 454 455 456 457 458 alpha lo g− lik el ih oo d (d) CRN vanilla filter Figure 5.15: Plot of the estimated log-likelihood for the DSGE model as a function of α using the modified filters (blue) and the vanilla particle filter (red). For the vanilla filter we show estimates for only one set of common random numbers. Table 5.2: Running time (in seconds) for the vanilla, tree-based and O(N2) vanilla filter for various values of N . N vanilla tree-based O(N2) vanilla 64 0.07 0.14 2.5 128 0.125 0.2 10 256 0.26 0.5 40 512 0.5 1 150 1024 1 2 600 2048 2.3 5 2500 4096 4.6 10 9000 61 5.5. Remarks 0.32 0.322 0.324 0.326 0.328 0.33 0.332 0.334 0.336 0.338 0.34 450 451 452 453 454 455 456 457 alpha lo g− lik el ih oo d Figure 5.16: The plots of the estimated log-likelihood for the DSGE model as a function of α using the weighted binary tree particle filter (blue), the unweighted binary tree particle filter (red) and the unweighted k-ary tree particle filter (green). 0.32 0.322 0.324 0.326 0.328 0.33 0.332 0.334 0.336 0.338 0.34 450 451 452 453 454 455 456 457 alpha lo g− lik el ih oo d Figure 5.17: The plots of the estimated log-likelihood for the DSGE model as a function of α using the weighted binary tree particle filter with 1024 particles (blue) and the vanilla particle filter with 20000 particles (red). application domains, we considered the effect of adding only the O(N2) computation p̂N (x (j) t+1|y0:t) = N∑ i=1 W (i) t p(x (j) t+1|x(i)t ) to the vanilla filter for the 2D Gaussian state-space model. This shows insight into the efficiency ramifications of the approach discussed in Section 3.3.2. Table 5.2 shows a comparison of the running times for the vanilla filter, our tree-based methods18 and the O(N2) vanilla filter for some small 18While the tree-based methods do exhibit different time complexities, for small values of N these differences are almost negligible. 62 5.5. Remarks values of N . Indeed for N > 128 the running time quickly blows up to levels that would generally be unmanageable in real-world applications. At the same time, N ≤ 128 severely limits the types of distributions that can be represented accurately. Nevertheless, the applications in [7] usually use N = 100 particles to represent p(x0:t|y0:t) for the purpose of filtering and this leads to reasonable results for the applications they consider. For estimation of the log-likelihood more particles are used since they need not be resampled. In constrast, for our tree-based methods we found that using more particles only for estimation of the log-likelihood was not as helpful as increasing the number of particles for both filtering and estimation (keeping the computation time constant). Lastly, we have shown log-likelihood plots with only one parameter changing since this allows the differences in smoothness to be seen very easily. For the two-dimensional Gaussian state-space model, Figure 5.18 shows surface maps for the log-likelihood where v11 and v12 each range from 0.5 to 1.5 with 10000 different data points in total. Figure 5.19 shows the statistical errors. 0.5 1 1.5 0.5 1 1.5 −640 −635 −630 −625 −620 −615 v11v12 lo g− lik el ih oo d (a) Kalman filter 0.5 1 1.5 0.5 1 1.5 −640 −635 −630 −625 −620 −615 v11v12 lo g− lik el ih oo d (b) weighted binary tree filter 0.5 1 1.5 0.5 1 1.5 −640 −635 −630 −625 −620 −615 v11v12 lo g− lik el ih oo d (c) unweighted binary tree filter 0.5 1 1.5 0.5 1 1.5 −640 −635 −630 −625 −620 −615 v11v12 lo g− lik el ih oo d (d) unweighted k-ary tree filter 0.5 1 1.5 0.5 1 1.5 −640 −635 −630 −625 −620 −615 v11v12 lo g− lik el ih oo d (e) CRN vanilla filter 0.5 1 1.5 0.5 1 1.5 −640 −635 −630 −625 −620 −615 v11v12 lo g− lik el ih oo d (f) vanilla filter Figure 5.18: Surface maps of the estimated log-likelihood for the 2D Gaussian state-space model as a function of v11 and v12 for the various filters. 63 5.5. Remarks 0.5 1 1.5 0.5 1 1.5 −4 −3 −2 −1 0 1 2 3 v11v12 lo g− lik el ih oo d (a) weighted binary tree filter 0.5 1 1.5 0.5 1 1.5 −4 −3 −2 −1 0 1 2 3 v11v12 lo g− lik el ih oo d (b) unweighted binary tree filter 0.5 1 1.5 0.5 1 1.5 −4 −3 −2 −1 0 1 2 3 v11v12 lo g− lik el ih oo d (c) unweighted k-ary tree filter 0.5 1 1.5 0.5 1 1.5 −4 −3 −2 −1 0 1 2 3 v11v12 lo g− lik el ih oo d (d) CRN vanilla filter 0.5 1 1.5 0.5 1 1.5 −4 −3 −2 −1 0 1 2 3 v11v12 lo g− lik el ih oo d (e) vanilla filter Figure 5.19: Surface map of the estimated log-likelihood errors for the 2D Gaussian state-space model as a function of v11 and v12 using the various filters 64 Chapter 6 Analysis In this chapter we show the asymptotic properties of the unweighted binary and k-ary tree algo- rithms. To begin, we first present some results that are used in our analysis. 6.1 Preliminaries 6.1.1 Quantile Estimation In all of our methods, we have used empirical estimates of quantiles to partition the particles. Therefore, in order to understand the asymptotic properties of our methods it is necessary to show the asymptotic properties of these quantile estimates. In particular, the analysis of quantile estimates is complicated by the fact that the quantile is not a linear statistical functional. However, it has been shown that such estimates do satisfy asymptotic normality results under very weak regularity conditions. Classical Monte Carlo Estimates Assume we want to compute α = F−1(p) where F is a c.d.f. and p ∈ [0, 1]. Thus α is the pth quantile of F . Let f(x) = F ′(x) be the p.d.f. of the distribution of interest. The classical Monte Carlo estimate of α with N i.i.d. samples {xi}Ni=1 from f is given by αN = F −1 N (p) def = inf{x : FN (x) ≥ p} where FN is the empirical distribution function FN (x) = 1 N N∑ i=1 I(xi < x) 65 6.1. Preliminaries It is well-known that √ N(αN − α) D→ N(0, σ2) as N →∞, where σ2 = p(1− p) f(α)2 as long as f exists and is continuous at α. Importance Sampling Estimates Of course, we are not typically in the situation where we can sample from f . It is shown in [16] that we can generalize this result when using an importance sampling estimate of the quantile. In particular, we now sample {xi}Ni=1 from a distribution with density given by q. The importance sampling estimate of α with N i.i.d. samples {xi}Ni=1 from q is again given by αN = F −1 N (p) def = inf{x : FN (x) ≥ p} but FN is now the weighted empirical distribution function FN (x) = 1 N N∑ i=1 I(xi < x) f(xi) q(xi) With this estimate we still have √ N(αN − α) D→ N(0, σ2) as N →∞, but now σ2 = 1 f(α)2 {∫ α −∞ f(x)2 q(x) dx− p2 } Normalized Importance Sampling Estimates Of course, the above variance is applicable only for estimate which uses normalized (true) densities for f and q. In SMC methods we generally only have access to unnormalized forms of these den- sities. In [11] it is shown that asymptotic normality still holds when the estimate uses normalized importance sampling, ie. FN (x) = 1∑N i=1 f(xi) q(xi) N∑ i=1 I(xi < x) f(xi) q(xi) in which case the estimate is unchanged by rescaling f and q by any amount. In this case, √ N(αN− α) D→ N(0, σ2) as N →∞, with σ2 = 1 f(α)2 {∫ ∞ −∞ f(x)2 q(x) [I(x ≤ α)− p]2 dx } (6.1) 66 6.1. Preliminaries 6.1.2 The Mean Value Theorem for Integration The first mean value theorem for integration can be stated as follows: Given continuous F : [a, b]→ R and integrable, positive p : [a, b]→ R there exists a number c ∈ [a, b] such that ∫ b a F (t)p(t)dt = F (c) ∫ b a p(t)dt In fact, a more general version of this theorem is true for functions on multiple variables, which we will use in our analysis. Given continuous F : X → R where X is a connected space and integrable, positive p : X → R there exists a vector c ∈ X such that∫ X F (t)p(t)dt = F (c) ∫ X p(t)dt It should be noted that orthotopes are convex and thus connected. When the function p(t) is a probability density function which is nonzero only when t ∈ X then∫ X p(t)dt = 1 and so ∫ X F (t)p(t)dt = F (c) for some c ∈ X . 6.1.3 Taylor’s Theorem The analysis presented relies heavily on Taylor’s theorem, which we state without proof. Let f be a real-valued function on [a, b], n a positive integer, f (n−1) be continuous on [a, b], f (k)(t) exists for all t ∈ [a, b] and k ∈ {1, . . . , n}. Then for any x and a in [a, b], there exists a c between x and y such that f(x) = n−1∑ k=1 f (k)(y) k! (x− y)k + f (n)(c) n! (x− y)n This is of particular interest when |f (n)(t)| is bounded for all t ∈ [a, b] as then we can say r(x, y) = f (n)(c) n! (x− y)n satisfies r(x, y) ||x− y||n−1 → 0 as (x− y)→ 0 In our analysis we will use exclusively the case where n = 2, which is also called the first-order case. 67 6.1. Preliminaries This gives f(x) = f(y) + f ′(y)(x− y) + r(x, y) The multivariate case is also needed here. For n = 2 with similar regularity conditions this is f(x) = f(y) +∇f(y)T (x− y) + r(x,y) where r(x,y) = 1 2 d∑ i=1 d∑ j=1 ∂2f ∂xi∂xj (ci,j)(xi − yi)(xj − yj) with each ci,j falling on the line between x and y. In particular, if all the second partial derivatives of f in a neighbourhood of y are bounded it can be shown that r(x,y)||x−y|| → 0 as (x− y) → 0 so the remainder term is insignificant when (x− y)→ 0. 6.1.4 Quantile Derivatives We will refer numerous times to the inverse cumulative distribution function of a marginal distri- bution in our analysis. Indeed, the function F−1j (u|z1:j−1) is such a function. Recall that Fj(z|z1:j−1) = ∫ Rd−j ∫ z −∞ p(xj , xj+1:d|x1:j−1 = z1:j−1)dxjdxj+1:d and pj(z|z1:j−1) = ∫ Rd−j p(xj = z, xj+1:d|x1:j−1 = z1:j−1)dxj+1:d so that Fj(z|z1:j−1) = ∫ z −∞ pj(xj |x1:j−1 = z1:j−1)dxj We are also limiting our discussion to situations in which {x : p(x) > 0} is convex. This implies that the inverse cdf F−1j (u|z1:j−1) is unique for u ∈ (0, 1). F−1j (u|z1:j−1) = z s.t. Fj(z|z1:j−1) = u 68 6.2. An approximation of the generalized cdf It is a strictly monotonic function (in u) by its definition and is thus differentiable (in u) almost everywhere with partial derivative ∂ ∂u F−1j (u|z1:j−1) = 1 pj(z|x1:j−1 = z1:j−1) where z = F−1j (u|z1:j−1). Furthermore, for fixed u this inverse function is defined implicitly via the equation Fj(z|z1:j−1)− u = 0 If we denote this function as g(x1:j−1), it satisfies Fj(g(z1:j−1)|z1:j−1)− u = 0 By the implicit function theorem19, the partial derivatives of g are given by ∂g ∂xi (z1:j−1) = − ∂F (z|z1:j−1)/∂xi ∂F (z|z1:j−1)/∂xj = − ∂ ∂xi ∫ z −∞ pj(xj |x1:j−1 = z1:j−1)dxj pj(z|x1:j−1 = z1:j−1) = − ∫ z −∞ ∂ ∂xi pj(xj |x1:j−1 = z1:j−1)dxj pj(z|x1:j−1 = z1:j−1) as long as pj(xj |x1:j−1) is differentiable for all xj when x1:j−1 = z1:j−1 and the partial derivatives ∂ ∂xi pj(xj |z1:j−1) are integrable. One can similarly express ∂ 2g ∂xi∂xk (z1:j−1) if p is twice differentiable for all xj when x1:j−1 = z1:j−1 and these derivatives are integrable as well. 6.2 An approximation of the generalized cdf Before moving to an analysis of the unweighted k-ary and binary tree-based algorithms, we first analyse a simpler algorithm that is a more direct (though intractable) approximation to Algorithm 3. This approximation algorithm is described in Algorithm 12. For our analysis, let us define µ = F−1(u), ie. µ is the true value of the inverse of the generalized cdf at u = u1:d. For our proof we will introduce a new set of functions of the form Gj : R j → R to simplify notation. 19Note that dF = 0 = ∑j i=0 ∂xiFdxi and solve for dxj dxi with all other xk constant. See [22] for more details. 69 6.2. An approximation of the generalized cdf Algorithm 12 An algorithm to approximate the inverse of the generalized cdf F−1(u1, . . . , ud) 1. • Sample N1 points {x (i))}Ni=1 from proposal distribution q(x). • For i = 1, . . . , N , evaluate the importance weights: w1(x (i)) = p(x(i)) q(x(i)) • Normalize the importance weights W (i) 1 = w1(x (i))∑N1 k=1 w1(x (k)) • Define x̂1 def = inf{z : F̂1(z) ≥ u1} where F̂1(z) = N1∑ i=1 W (i) 1 I(x (i) 1 ≤ z) 2. For j = 2, . . . , d. • Sample Nj points {x (i) j:d} Nj i=1 from proposal distribution q(xj:d|x̂1:j−1) and set x (i) = (x̂1:j−1, x (i) j:d). • For i = 1, . . . , Nj , evaluate the importance weights: wj(x (i)) = p(x(i)) q(x(i)) • Normalize the importance weights W (i) j = wj(x (i))∑Nj k=1 wj(x (k)) • Define x̂j def = inf{z : F̂j(z|x̂1:j−1) ≥ uj} where F̂j(z; x̂1:j−1) = Nj∑ i=1 W (i) j I(x (i) j ≤ z) and set x̂1:j = (x̂1:j−1, x̂j) 3. Output x̂1:d. 70 6.2. An approximation of the generalized cdf In particular, we define Gj(uj ;x1:j−1) as follows: Gj(uj ;x1:j−1) = inf{xj : Fj(xj |x1:j−1) ≥ uj} = xj : Fj(xj |x1:j−1) = uj = F−1j (uj |x1:j−1) Claim: Let Nj = cjN for j ∈ {1, . . . , d} and u ∈ [0, 1]d be given. Define µ = F−1(u). Assume ∂2Gj(u;x1:j−1) ∂xi∂xk is bounded for all i, k < j in a neighbourhood of µ1:j−1 and all j ∈ {1, . . . , d}. Also, assume the set {x : p(x) > 0} is convex. Then as N → ∞, √N(x̂ − µ) D→ N(0,Σ) for some Σ independent of N . Proof: We know x̂1 is asymptotically distributed as N(µ1, σ21 N1 ) where σ1 is of the form given in Equation 6.1. Let v1 = x̂1 − µ1 D→ N(0, σ 2 1 N1 ). We also know x̂2 is distributed as N(µ̂2, σ22 N2 ), where µ̂2 def = G2(u2; x̂1) and σ2 is of the form given in Equation 6.1. A Taylor expansion of G2 around µ1 gives µ̂2 = G2(u2; x̂1) = G2(u2;µ1) + ∂Gj(u2;µ1) ∂x1 (x̂1 − µ1) + r2(x̂1, µ1) ≈ µ2 + ∂Gj(u2;µ1) ∂x1 v1 If we let x̂2 − µ̂2 = v2 D→ N(0, σ 2 2 N2 ), we have asymptotically x̂2 = µ2 + ∂G2(u2;µ1) ∂x1 v1 + v2 so [ x̂1 x̂2 ] = [ µ1 µ2 ] + [ 1 0 ∂G2(u2;µ1) ∂x1 1 ][ v1 v2 ] For j = 3 we have x̂3 is distributed as N(µ̂3, σ23 N3 ), where µ̂3 def = G3(u3; x̂1:2) and σ3 is of the form given in Equation 6.1. 71 6.2. An approximation of the generalized cdf A Taylor expansion of G3 around µ1:2 gives µ̂3 = G3(u3; x̂1:2) = G3(u3;µ1:2) + ∂G3(u3;µ1:2) ∂x1 (x̂1 − µ1) + ∂G3(u3;µ1:2) ∂x2 (x̂2 − µ2) + r3(x̂1:2, µ1:2) ≈ µ3 + ∂G3(u3;µ1:2) ∂x1 v1 + ∂G3(u3;µ1:2) ∂x2 [ ∂G2(u2;µ1) ∂x1 v1 + v2 ] If we let x̂3 − µ̂3 = v3 D→ N(0, σ 2 3 N3 ), we have asymptotically x̂3 = µ3 + [ ∂G3(u3;µ1:2) ∂x1 + ∂G3(u3;µ1:2) ∂x2 ∂G2(u2;µ1) ∂x1 ] v1 + ∂G3(u3;µ1:2) ∂x2 v2 + v3 so x̂1x̂2 x̂3 = µ1µ2 µ3 + 1 0 0∂G2(u2;µ1)∂x1 1 0 ∂G3(u3;µ1:2) ∂x1 + ∂G3(u3;µ1:2) ∂x2 ∂G2(u2;µ1) ∂x1 ∂G3(u3;µ2) ∂x2 1 v1v2 v3 In general, we have x̂j is distributed as N(µ̂j , σ2j Nj ), where µ̂j def = Gj(uj ; x̂1:j−1) and σj is of the form given in Equation 6.1. A Taylor expansion of Gj around µ1:j−1 gives µ̂j = Gj(uj ; x̂1:j−1) = Gj(uj ;µ1:j−1) + j−1∑ i=1 ∂Gj(uj ;µ1:j−1) ∂xi (x̂i − µi) + rj(x̂1:j−1, µ1:j−1) ≈ µj + j−1∑ i=1 ∂Gj(uj ;µ1:j−1) ∂xi (x̂i − µi) Note that the terms x̂i − µi get progressively more complex in terms of v1:i as i increases. For example, x̂4 − µ4 has the form x̂4 − µ4 = [ ∂G4 ∂x1 + ∂G4 ∂x2 ∂G2 ∂x1 + ∂G4 ∂x3 ∂G3 ∂x1 + ∂G4 ∂x3 ∂G3 ∂x2 ∂G2 ∂x1 ] v1 + [ ∂G4 ∂x2 + ∂G4 ∂x3 ∂G3 ∂x2 ] v2 + ∂G4 ∂x3 v3 + v4 Nevertheless, each x̂i − µi can be expressed as a linear combination of vj ’s (with j ≤ i) since every partial derivative of Gk is evaluated at (uk;µ1:k−1). If we let λi,j denote the coefficient of vj in the expression x̂i − µi = ∑i j=1 λi,jvj then we can write x̂ = µ + Λv with the matrix Λ defined by Λ(i, j) = λi,j . 72 6.3. The Unweighted k-ary Tree Note that each vj ∼ N(0, σ 2 j Nj ) = N(0, σ2j cjN ) As such, we can write x̂ = µ + 1√ N Az where the matrix A is defined by A(i, j) = √ cj σj λi,j and z ∼ N(0, I). In other words, as N → ∞,√ N(x̂− µ) D→ N(0, AAT ) where Σ def= AAT is independent of N . 6.3 The Unweighted k-ary Tree Having proven asymptotic normality of the sampled points using direct approximation to Algorithm 3, we turn to the k-ary tree sampling scheme, detailed in Section 4.4. This algorithm draws its efficiency gains from the use of partitions as opposed to sampling many vectors with identical components. The drawback is that it is more complex to analyze. For our proof we will introduce a new set of functions of the form Hj : R 2j−1 → R to simplify notation. In particular, we define Hj(u;x1:j−1) as follows: Hj(u;x1:j−1, x1:j−1) = inf{xj : ∫ xj −∞ pj ( t ∣∣∣∣∣x1:j−1 ∈ j−1∏ i=1 [xi, xi] ) dt ≥ u} = xj : Fj(xj |x1:j−1 ∈ j−1∏ i=1 [xi, xi]) = u = F−1j (u|x1:j−1 ∈ j−1∏ i=1 [xi, xi]) where pj ( xj ∣∣∣∣∣x1:j−1 ∈ j−1∏ i=1 [xi, xi] ) = ∫∏ j−1 i=1 [xi,xi] p(x1:j)dx1:j−1∫∏ j−1 i=1 [xi,xi] p(x1:j−1)dx1:j−1 Note that we can use the mean value theorem to show that Hj(u;x1:j−1, x1:j−1) = F−1j (u|c1:j−1) 73 6.3. The Unweighted k-ary Tree for some c1:j−1 ∈ I = ∏j−1 i=1 [xi, xi]:∫ xj −∞ pj (t|x1:j−1 ∈ I) dt = u ⇒ ∫ xj −∞ ∫ I p(x1:j−1, t)dx1:j−1dt∫ I p(x1:j−1)dx1:j−1 = u ⇒ ∫ xj −∞ ∫ I p(t|x1:j−1)p(x1:j−1)dx1:j−1dt∫ I p(x1:j−1)dx1:j−1 = u ⇒ ∫ xj −∞ p(t|c1:j−1)dt ∫ I p(x1:j−1)dx1:j−1∫ I p(x1:j−1)dx1:j−1 = u ⇒ ∫ xj −∞ p(t|c1:j−1)dt = u ⇒Hj(u;x1:j−1, x1:j−1) = F−1j (u|c1:j−1) Recall that Gj(u; c1:j−1) = F−1j (u|c1:j−1) so we have Hj(u;x1:j−1, x1:j−1) = Gj(u; c1:j−1). Claim: Let N = kd and let u ∈ [0, 1]d be given. Define µ = F−1(u). Assume ∂2Gj(u;x1:j−1) ∂xi∂xk is bounded for all i, k < j in a neighbourhood of µ1:j−1 and all j ∈ {1, . . . , d}. Also, assume the set {x : p(x) > 0} is convex. Then as k →∞, x̂ P→ µ. Proof: Let uj = dkuje−1 k and uj = dkuje k . We denote the number of particles at each stage of the algorithm as N1, . . . , Nd. Note first that N1, . . . , Nd are all related. We will prove later that each of these values increases as k →∞. In Algorithm 11, the first subset that is chosen is the dku1eth subset on the second level. Let us define x1 and x1 as the left and right endpoints of the selected subset. Since u1 converges (this is not a random variable) to u1, sample quantiles are asymptotically unbiased and their asymptotic variance decreases linearly in N , we know that x1 P→ µ1 (one can pick N large enough such that the mean of the estimate is close enough to µ1 and the variance is small enough for a Chebyshev inequality to show convergence). Similarly, x1 P→ µ1. It doesn’t matter that x1 and x1 are dependent since two variables that converge in probability marginally must also converge in probability jointly. Now define c (2) 1 such that is satisfies G2(u2; c (2) 1 ) = H2(u2;x1, x1). Define µ̂2 = G2(u2; c (2) 1 ). By the same reasoning as above, the left and right endpoints of the second partition x2 and x2 satisfy x2 P→ µ̂2 and x2 P→ µ̂2 (as long as N2 →∞ as k →∞) where µ̂2 = µ2 + ∂G2(u2;µ1) ∂x1 (c (2) 1 − µ1) + r2(c(2)1 , µ1) Since c (2) 1 ∈ [x1, x1] with x1 P→ µ1 and x1 P→ µ1, c(2)1 P→ µ1. Therefore x2 P→ µ2 and x2 P→ µ2. 74 6.3. The Unweighted k-ary Tree In general we have c (j) 1:j−1 ∈ ∏j−1 i=1 [xi, xi] satisfying Gj(uj ; c (j) 1:j−1) = Hj(uj ;x1:j−1, x1:j−1) and µ̂j ≈ µj + j−1∑ i=1 ∂Gj(uj ;µ1:j−1) ∂xi (c (j) i − µi) where c (j) 1:j−1 P→ µ1:j−1 so xj P→ µj and xj P→ µj as long as Nj → ∞ as k → ∞. Thus we can be assured that x̂1:d−1 P→ µ1:d−1 for any x̂1:d−1 ∈ ∏d−1 i=1 [xi, xi]. We do not compute xd and xd. Instead, we set x̂d to be the sample conditional ud-quantile of the remaining particles in the dth dimension. We have some c (d) 1:d−1 ∈ ∏d−1 i=1 [xi, xi] satisfying Gd(ud; c (d) 1:d−1) = Hd(ud;x1:d−1, x1:d−1), with c (j) 1:j−1 P→ µ1:d−1. Let µ̂d = Gd(ud; c(d)1:d−1). Then√ Nd(x̂d − µ̂d) D→ N(0, σ2d) with µ̂d ≈ µd + d−1∑ i=1 ∂Gd(ud;µ1:j−1) ∂xi (c (d) i − µi) and σd is of the form given in Equation 6.1. Therefore µ̂d P→ µd and so x̂d P→ µd. All that remains to show is that as k → ∞, Nj → ∞ for j ∈ {1, . . . , d}. Let us denote Nj as a function of k by Nj(k). Recall that q(x) is the proposal distribution used to sample the particles {x(i)}. Note first that N(k) = kd and N1(k) = N(k). N2(k)→ N1(k) ∫ F−11 (u1) F−11 (u1) q(x1)dx1. We have that F−11 (u1) = F −1 1 (u1) + ∂F−11 (u1) ∂u (u1 − u1) + r1(u1, u1) and F−11 (u1) = F −1 1 (u1) + ∂F−11 (u1) ∂u (u1 − u1) + r1(u1, u1) which implies F−11 (u1)− F−11 (u1) ≈ ∂F−11 (u1) ∂u (u1 − u1) = 1 k ∂F−11 (u1) ∂u = 1 kp(µ1) 75 6.4. The Unweighted Binary Tree Therefore, as k →∞ we have ∫ F−11 (u1) F−11 (u1) q(x1)dx1 = 1 k ∂F−11 (u1) ∂u q(µ1) and so N2(k) = N1(k) q(µ1) kp(µ1) In general, we will have Nj(k)→ Nj−1(k) ∫ F−1 j (u1|µ1:j−1) F−1 j (uj |µ1:j−1) q(xj |x1:j−1 = µ1:j−1)dxj , which yields Nj(k) = Nj−1(k) 1 k ∂F−1j (uj |µ1:j−1) ∂u q(µj |µ1:j−1) = Nj−1(k) q(µj |µ1:j−1) kp(µj |µ1:j−1) or Nj(k) = N1(k) j−1∏ i=1 q(µi|µ1:i−1) kp(µi|µ1:i−1) = N1(k) kj−1 j−1∏ i=1 q(µi|µ1:i−1) p(µi|µ1:i−1) Since N1(k) = N(k) = k d Nj(k) = k d−j+1 j−1∏ i=1 q(µi|µ1:i−1) p(µi|µ1:i−1) ∝ kd−j+1 ie. Nj →∞ as k →∞ for all j ∈ {1, . . . , d}. 6.4 The Unweighted Binary Tree The resampling scheme using an unweighted binary tree described in Algorithm 8 is a fairly direct approximation of Algorithm 4. To analyze it, we introduce a set of functions Qj : R j → R. For any j, let k = j mod d (and if this makes k = 0, let k = d). Then Qj(z1:j−1, u) = zk : 2j−1 ∫ b(j)1 a (j) 1 . . . ∫ b(j) k −1 a (j) k −1 ∫ zk a (j) k ∫ b(j) k +1 a (j) k +1 . . . ∫ b(j) d a (j) d p(x)dx = 0.5 where the value of u determines the value of each a (j) i and b (j) i given z1:j−1. In particular, as a representation of an infinite number of bits, u determines for each (i, j) whether a (j) i is equal to −∞ or zi+rd for some r ∈ N. Similarly, u determines if b(j)i is equal to ∞ or zi+rd for some r ∈ N. In order to analyze the selection scheme, we pretend that the medians t̂1, . . . , t̂n are computed only as needed by Algorithm 9. Since the tree is unweighted, it is not the case that Nj = 2 k−j+1. Let 76 6.4. The Unweighted Binary Tree I = ∏d i=1[a (j) i , b (j) i ] be the region determined by (t1:j−1, u). Then E[Nj(k)] = N(k) ∫ I q(x)dx and furthermore Nj(k) converges in probability to this value. As such, Nj converges in probability to a linear function of N . Claim: Let u ∈ [0, 1] and m ∈ N be given and let t1:m = fm(u). Assume ∂ 2Qj(z1:j−1,u) ∂zi∂zk is bounded for all i, k < j in a neighbourhood of t1:j−1 and all j ∈ {1, . . . ,m}. Also, assume the set {x : p(x) > 0} is convex. Then as N →∞, √N(t̂1:m − t1:m) D→ N(0,Σ) for some Σ independent of N . Proof: We have that t̂1 D→ N(t1, σ 2 1 N1 ) where σ1 is of the form given in Equation 6.1. Let v1 = t̂1 − t1 D→ N(0, σ 2 1 N1 ). We have also that t̂2 D→ N(t2, σ 2 2 N2 ), where t2 def = Q2(t̂1, u) and σ2 is of the form given in Equation 6.1. A Taylor expansion of Q2 around t1 is given by t2 = Q2(t̂1, u) = Q2(t1, u) + ∂Q2(t1, u) ∂z1 (t̂1 − t1) + r2(t̂1, t1, u) ≈ t2 + ∂Q2(t1, u) ∂z1 v1 If we let t̂2 − t2 = v2 D→ N(0, σ 2 2 N2 ), we have asymptotically t̂2 = t2 + ∂Q2(t1, u) ∂z1 v1 + v2 In general, we have t̂j is distributed as N(tj , σ2j Nj ), where tj def = Qj(t1:j−1, u) and σj is of the form given in Equation 6.1. A Taylor expansion of Qj around t1:j−1 gives Qj(t̂1:j−1, u) = Qj(t1:j−1, u) + j−1∑ i=1 [ ∂Qj(t1:j−1, u) ∂zi ] (t̂i − ti) + rj(t̂1:j−1, t1:j−1, u) ≈ tj + j−1∑ i=1 [ ∂Qj(t1:j−1, u) ∂zi ] (t̂i − ti) It should be obvious that at most 2d of the partial derivatives are nonzero since Qj does not depend 77 6.5. Changing Parameters on every zi. Similar to the analysis for the approximation to the generalized cdf in Section 6.2, the terms t̂i − ti get progressively more complex as i increases. Again, however, each t̂i − ti can be expressed as a linear combination of vj ’s (with j ≤ i). If we let λi,j denote the coefficient of vj in the expression t̂i − ti = ∑i j=1 λi,jvj then we can write t̂1:m = t1:m + Λmv1:m with the m×m matrix Λm defined by Λm(i, j) = λi,j . Let Ij = ∏d i=1[a (j) i , b (j) i ] be the region determined by (t1:j−1, u) for j = 1, . . . ,m. Then set cj =∫ Ij q(x)dx so that Nj P→ cjN . Note that each vj ∼ N(0, σ 2 j Nj ) = N(0, σ2j cjN ) As such, we can write t̂1:m = t1:m + 1√ N Aρ1:m where the matrix A is defined by A(i, j) = √ cj σj λi,j and ρ1:m ∼ N(0, I). In other words, as N →∞,√ N(t̂1:m − t1:m) D→ N(0, AAT ) where Σ def= AAT is independent of N . Let us define µ to be (almost surely) the unique point in limn→∞ fn(u), where f is specified in Algorithm 4. Since the above result shows that for any m, √ N(t̂1:m − t1:m) D→ N(0,Σ) for some Σ independent of N , it follows that x̂ P→ µ. 6.5 Changing Parameters The above analyses show that the resampling mechanisms provide consistent sampled particles given the same inputs, at least marginally. As N →∞, it is expected that the correlation between sampled particles due to the use of the same N particles each time a particle is selected is negligible. To show that each sampled particle is close to its corresponding particle when the parameters change is straightforward. If x̂ P→ µ using target distribution p(·|θ) and x̂′ P→ µ′ using target distribution p(·|θ′), then x̂′ = x̂′ − µ′ + µ′ − µ + µ− x̂+ x̂ ⇒ x̂′ P→ x̂+ µ′ − µ 78 6.6. Remarks If we have asymptotic normality, ie. √ N(µ− x̂) D→ N(0,Σ1) and √ N(x̂′−µ′) D→ N(0,Σ2) we have √ N(x̂′ − x̂) D→ N(µ′ − µ,Σ1 +Σ2) In the event that µ is continuous in θ, this implies that x̂′ will be close to x̂ if θ′ is close to θ. 6.6 Remarks In the above analysis, we have shown asymptotic convergence for selected particles from a dis- tribution p(x) = p(xt|y0:t) whereas we are interested in convergence for selected particles from a distribution P̂N (dxt|y0:t) = ∑N i=1 w (i) t δx(i)t (dxt). However, it is known that P̂N (xt|y0:t) converges in distribution to p(xt|y0:t) as N →∞ so this is not an issue. It is worth noting that there are no regularity conditions required to guarantee that any of the tree- based resampling algorithms select a particle (without interpolation) x(i) according to its weight W (i), as shown in Chapter 4. The conditions used above are only needed to show that the selected particle using common random numbers converges asymptotically to a particular particle. 79 Chapter 7 Discussion This thesis has presented the current state of the art for smooth likelihood estimation using particle filters, discussing recent approaches in the literature as well as proposing a novel tree-based approach. In particular, we have seen that in continuous state-space models, it is possible to evaluate the log- likelihood of data given fixed parameters using particle filters. In addition, we can reduce the variance of differences in this log-likelihood by inducing correlation among the particles selected by each filter through the use of common random numbers. The main contribution of this thesis is the introduction of tree-based resampling schemes to induce correlation amongst selected particles across filters when the state variables are multi-dimensional. This is accomplished with no additional constraints on the general filtering framework - any proposal distribution can be used and there are no additional parameters. In fact, the use of a tree-based resampling scheme involves only a change to the resample operation in the SMC framework. In addition, the running time of a filter utilising such a resampling scheme is in o(TN2), where N is the number of particles and T is the number of discrete time steps in the filter. The most empirically compelling scheme, that involving the weighted binary tree, leads to a filter that runs in expected O(TN logN) time. This is in contrast to other approaches which run in O(TN2) time and require a family of approximating distributions (see Section 3.3.2) or approaches which generally require small T and close θ (see Section 3.3.3). Furthermore, the tree-based resampling scheme can be seen as a generalization of the ‘solved’ case (see [23]) where the state variables are univariate. In the analysis of the unweighted binary and k-ary trees that are proposed, we have shown that as N → ∞, the particle sampled using a common random number converges in probability to a point that would be returned by a theoretical procedure involving the computation of numerous intractable integrals. While this assures us of a high degree of correlation for very large N , it should not be confused with a result for practical values of N . In our applications, it is found that the weighted binary tree resampling scheme (for which no asymptotic results are shown) gives the smoothest log-likelihood function. In addition, the variance of a particle filter estimate of the log-likelihood is proportional to 1 N . Since we are concerned with situations where the true log- likelihood is continuous in the parameters, runs of uncorrelated particle filters as N → ∞ will themselves become nearly smooth. Furthermore, uncorrelated filters run in O(N) time and will 80 Chapter 7. Discussion always provide lower variance estimates of the log-likelihood given a fixed computational budget. Alternatively, even the SIS algorithm which is naturally smooth using common random numbers will provide accurate estimation of the log-likelihood as N → ∞ since the variance, despite being exponential in the fixed time index T , decreases as 1 N . This obviously does not mean that our methods are not useful - they just highlight the potential risk of using asymptotic properties to justify the use of a particular resampling scheme. In practical applications, the use of the tree-based resampling schemes show a distinct improvement over the vanilla and CRN vanilla filters with respect to smoothness of plots of the log-likelihood for changing parameters. In order to benefit from this smoothness, one could employ curve fitting or surface fitting techniques to obtain approximate maximum likelihood estimates of the parameters. In an online setting, one could assign one filter to each of M parameter values and run each filter for one step upon receipt of new data point. This would take time in O(MN logN) and could be easily parallelized to O(N logN) on M computing machines. If the surface fitting procedure is not computationally intensive given the M estimates of the log-likelihood, a smooth approximating surface could additionally be computed at each time step. In contrast, methods like those in [5] are not able to produce an accurate and almost smooth surface for large ranges of parameters but are better suited to maximum-likelihood parameter estimation via stochastic gradient ascent for fixed T . Future Research The tree-based resampling schemes we have presented are not necessarily the most effective. It is possible that different trees could have better properties. In addition, the methods we have proposed select indices by processing the particles without any kind of modification. In some cases, dimension reduction or some transformation of the particles could be used to build the tree and select indices such that the smoothness of the likelihood estimator is increased. There is clearly work to be done in analyzing the correlation between particles in tree-based resam- pling schemes for finite N . If we could estimate the amount of correlation given a particular tree structure or set of structures, we would have better grounds for selecting the type of tree used for resampling. Finally, the most appropriate way to utilize nearly smooth particle filters is unknown at present. They seem well-suited to the task of likelihood maximization but not necessarily via stochastic gradient ascent, for which the methods in [25] and [5] are readily applicable. Curve or surface fitting as a precursor to maximization as mentioned above is only one possible option. 81 Bibliography [1] Jon Louis Bentley and James B. Saxe. Generating sorted lists of random numbers. ACM Trans. Math. Softw., 6(3):359–364, 1980. [2] M. Blum, R.W. Floyd, V. Pratt, R. Rivest, and R. Tarjan. Time bounds for selection. J. Comput. System Sci. 7, pages 448–461, 1973. [3] P. Bortot, S. G. Coles, and S. A. Sisson. Inference for stereological extremes. Journal of the American Statistical Association, 102:84–92, 2007. [4] R. J. Boys, D. J. Wilkinson, and T. B. L. Kirkwood. Bayesian inference for a discretely observed stochastic kinetic model. Statistics and Computing, 2007. [5] Pierre-Arnaud Coquelin, Romain Deguest, and Remi Munos. Perturbation analysis for pa- rameter estimation in continuous space HMMs. Submitted to IEEE Transactions on Signal Processing, 2008. [6] D. Crisan and A. Doucet. A survey of convergence results on particle filtering methods for practitioners. IEEE Transactions on Signal Processing, 50(3):736–746, 2002. [7] David N. Dejong, Hariharan Dharmarajan, Roman Liesenfeld, and Jean-Francois Richard. An efficient approach to analyzing state-space representations. SSRN eLibrary, 2007. [8] Arnaud Doucet. On sequential simulation-based methods for Bayesian filtering. Technical Report CUED/F-INFENG/TR. 310, Cambridge University, Department of Engineering, 1998. [9] Garland B Durham and A Ronald Gallant. Numerical techniques for maximum likelihood estimation of continuous-time diffusion processes. Journal of Business & Economic Statistics, 20(3):297–316, 2002. [10] Paul Fearnhead. Computational methods for complex stochastic systems: a review of some alternatives to MCMC. Statistics and Computing, 2007. [11] P. Glynn. Importance sampling for Monte Carlo estimation of quantiles. Technical report, Stanford University, Department of Operations Research, 1996. 82 Chapter 7. Bibliography [12] A. Golightly and D. J. Wilkinson. Bayesian inference for stochastic kinetic models using a diffusion approximation. Biometrics, 61:781–788, 2005. [13] A. Golightly and D. J. Wilkinson. Bayesian inference for nonlinear multivariate diffusion models observed with error. Comput. Stat. Data Anal., 52(3):1674–1693, 2008. [14] C. A. R. Hoare. Algorithm 65: find. Commun. ACM, 4(7):321–322, 1961. [15] Jeroen D. Hol, Thomas B. Schn, and Fredrik Gustafsson. On resampling algorithms for particle filters. In Nonlinear Statistical Signal Processing Workshop, Cambridge, United Kingdom, 2006. [16] M. Vernon Johns. Importance sampling for bootstrap confidence intervals. Journal of the American Statistical Association, 83(403):709–714, 1988. [17] Mike Klaas, Nando de Freitas, and Arnaud Doucet. Toward practical N2 Monte Carlo: the marginal particle filter. In Proceedings of the 21th Annual Conference on Uncertainty in Arti- ficial Intelligence (UAI-05), pages 308–31, Arlington, Virginia, 2005. AUAI Press. [18] Donald E. Knuth. Mathematical analysis of algorithms. Technical Report STAN-CS-71-206, Stanford University, Department of Computer Science, 1971. [19] Finn E. Kydland and Edward C. Prescott. Time to build and aggregate fluctuations. Econo- metrica, 50(6):1345–1370, 1982. [20] J. Liu and M. West. Combined parameter and state estimation in simulation-based filtering. In Arnaud Doucet, Nando de Freitas, and Neil Gordon, editors, Sequential Monte Carlo Methods in Practice. New York. Springer-Verlag, New York, 2000. [21] Paul Marjoram, John Molitor, Vincent Plagnol, and Simon Tavar. Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. U. S. A., 100:15324–15328, 2003. [22] Jerrold E. Marsden and Anthony J. Tromba. Vector Calculus. W. H. Freeman, 4th edition, 1996. [23] Michael K. Pitt. Smooth particle filters for likelihood evaluation and maximisation. The War- wick Economics Research Paper Series (TWERPS) 651, University of Warwick, Department of Economics, 2002. Available at http://ideas.repec.org/p/wrk/warwec/651.html. [24] Michael K. Pitt and Neil Shephard. Time-varying covariances: A factor stochastic volatility approach. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6. Oxford University Press, 1999. [25] George Poyiadjis, Arnaud Doucet, and Sumeetpal S. Singh. Particle methods for optimal filter derivative: Application to parameter estimation. In IEEE ICASSP, pages 925–928, 2005. [26] Christian P. Robert and George Casella. Monte Carlo Statistical Methods (Second Edition). Springer-Verlag, New York, 2004. 83 Chapter 7. Bibliography [27] David Romer. Advanced Macroeconomics. McGraw-Hill/Irwin, May 2001. [28] S. A. Sisson, Y. Fan, and Mark M. Tanaka. Sequential Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. U. S. A., 104:1760–1765, 2007. [29] Larry Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer-Verlag, New York, 2004. 84
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Towards smooth particle filters for likelihood estimation...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Towards smooth particle filters for likelihood estimation with multivariate latent variables Lee, Anthony 2008
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Towards smooth particle filters for likelihood estimation with multivariate latent variables |
Creator |
Lee, Anthony |
Publisher | University of British Columbia |
Date Issued | 2008 |
Description | In parametrized continuous state-space models, one can obtain estimates of the likelihood of the data for fixed parameters via the Sequential Monte Carlo methodology. Unfortunately, even if the likelihood is continuous in the parameters, the estimates produced by practical particle filters are not, even when common random numbers are used for each filter. This is because the same resampling step which drastically reduces the variance of the estimates also introduces discontinuities in the particles that are selected across filters when the parameters change. When the state variables are univariate, a method exists that gives an estimator of the log-likelihood that is continuous in the parameters. We present a non-trivial generalization of this method using tree-based o(N²) (and as low as O(N log N)) resampling schemes that induce significant correlation amongst the selected particles across filters. In turn, this reduces the variance of the difference between the likelihood evaluated for different values of the parameters and the resulting estimator is considerably smoother than naively running the filters with common random numbers. Importantly, in practice our methods require only a change to the resample operation in the SMC framework without the addition of any extra parameters and can therefore be used for any application in which particle filters are already used. In addition, excepting the optional use of interpolation in the schemes, there are no regularity conditions for their use although certain conditions make them more advantageous. In this thesis, we first introduce the relevant aspects of the SMC methodology to the task of likelihood estimation in continuous state-space models and present an overview of work related to the task of smooth likelihood estimation. Following this, we introduce theoretically correct resampling schemes that cannot be implemented and the practical tree-based resampling schemes that were developed instead. After presenting the performance of our schemes in various applications, we show that two of the schemes are asymptotically consistent with the theoretically correct but unimplementable methods introduced earlier. Finally, we conclude the thesis with a discussion. |
Extent | 4870474 bytes |
Subject |
Sequential Monte Carlo Likelihood estimation State-space models |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2008-08-28 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0051412 |
URI | http://hdl.handle.net/2429/1547 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2008-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2008_fall_lee_anthony.pdf [ 4.64MB ]
- Metadata
- JSON: 24-1.0051412.json
- JSON-LD: 24-1.0051412-ld.json
- RDF/XML (Pretty): 24-1.0051412-rdf.xml
- RDF/JSON: 24-1.0051412-rdf.json
- Turtle: 24-1.0051412-turtle.txt
- N-Triples: 24-1.0051412-rdf-ntriples.txt
- Original Record: 24-1.0051412-source.json
- Full Text
- 24-1.0051412-fulltext.txt
- Citation
- 24-1.0051412.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0051412/manifest