"Science, Faculty of"@en .
"Computer Science, Department of"@en .
"DSpace"@en .
"UBCV"@en .
"Lee, Anthony"@en .
"2008-08-28T17:26:39Z"@en .
"2008"@en .
"Master of Science - MSc"@en .
"University of British Columbia"@en .
"In parametrized continuous state-space models, one can obtain estimates of the likelihood of the data for fixed parameters via the Sequential Monte Carlo methodology. Unfortunately, even if the likelihood is continuous in the parameters, the estimates produced by practical particle filters are not, even when common random numbers are used for each filter. This is because the same resampling step which drastically reduces the variance of the estimates also introduces discontinuities in the particles that are selected across filters when the parameters change.\n\nWhen the state variables are univariate, a method exists that gives an estimator of the log-likelihood that is continuous in the parameters. We present a non-trivial generalization of this method using tree-based o(N\u00B2) (and as low as O(N log N)) resampling schemes that induce significant correlation amongst the selected particles across filters. In turn, this reduces the variance of the difference between the likelihood evaluated for different values of the parameters and the resulting estimator is considerably smoother than naively running the filters with common random numbers.\n\nImportantly, in practice our methods require only a change to the resample operation in the SMC framework without the addition of any extra parameters and can therefore be used for any application in which particle filters are already used. In addition, excepting the optional use of interpolation in the schemes, there are no regularity conditions for their use although certain conditions make them more advantageous.\n\nIn this thesis, we first introduce the relevant aspects of the SMC methodology to the task of likelihood estimation in continuous state-space models and present an overview of work related to the task of smooth likelihood estimation. Following this, we introduce theoretically correct resampling schemes that cannot be implemented and the practical tree-based resampling schemes that were developed instead. After presenting the performance of our schemes in various applications, we show that two of the schemes are asymptotically consistent with the theoretically correct but unimplementable methods introduced earlier. Finally, we conclude the thesis with a discussion."@en .
"https://circle.library.ubc.ca/rest/handle/2429/1547?expand=metadata"@en .
"4870474 bytes"@en .
"application/pdf"@en .
"Towards Smooth Particle Filters for Likelihood Estimation with Multivariate Latent Variables by Anthony Lee B.Sc., The University of British Columbia, 2006 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) August, 2008 c\u00C2\u00A9 Anthony Lee 2008 Abstract In parametrized continuous state-space models, one can obtain estimates of the likelihood of the data for fixed parameters via the Sequential Monte Carlo methodology. Unfortunately, even if the likelihood is continuous in the parameters, the estimates produced by practical particle filters are not, even when common random numbers are used for each filter. This is because the same resampling step which drastically reduces the variance of the estimates also introduces discontinuities in the particles that are selected across filters when the parameters change. When the state variables are univariate, the methodology of [23] gives an estimator of the log- likelihood that is continuous in the parameters. We present a non-trivial generalization of this method using tree-based o(N2) (and as low as O(N logN)) resampling schemes that induce signifi- cant correlation amongst the selected particles across filters. In turn, this reduces the variance of the difference between the likelihood evaluated for different values of the parameters and the resulting estimator is considerably smoother than naively running the filters with common random numbers. Importantly, in practice our methods require only a change to the resample operation in the SMC framework without the addition of any extra parameters and can therefore be used for any applica- tion in which particle filters are already used. In addition, excepting the optional use of interpolation in the schemes, there are no regularity conditions for their use although certain conditions make them more advantageous. In this thesis, we first introduce the relevant aspects of the SMCmethodology to the task of likelihood estimation in continuous state-space models and present an overview of work related to the task of smooth likelihood estimation. Following this, we introduce theoretically correct resampling schemes that cannot be implemented and the practical tree-based resampling schemes that were developed instead. After presenting the performance of our schemes in various applications, we show that two of the schemes are asymptotically consistent with the theoretically correct but unimplementable methods introduced earlier. Finally, we conclude the thesis with a discussion. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Sequential Monte Carlo Methods and Likelihood Evaluation . . . . . . . . . . . 3 2.1 Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 State-Space Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Sequential Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4.1 Sequential Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.2 Sequential Importance Sampling/Resampling . . . . . . . . . . . . . . . . . . 10 2.4.3 Likelihood Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Smooth Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Desired Estimator Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 iii Table of Contents 3.2 Common Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.1 Smooth Likelihood Estimation with One-Dimensional State Variables . . . . 21 3.3.2 Smooth Likelihood Estimation with Multi-Dimensional State Variables in O(N2) time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.3 Smooth Likelihood Estimation Using Fixed Selection Indices . . . . . . . . . 23 3.4 Theoretical Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4.1 Regularity Conditions and Notation . . . . . . . . . . . . . . . . . . . . . . . 24 3.4.2 A Generalized CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4.3 Median-Cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.4 Practical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Fast and Approximately Smooth Likelihood Estimation . . . . . . . . . . . . . . 32 4.1 A Weighted Tree Structure for P\u00CC\u0082N (dx0:t|y0:t) . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Splitting via the Unweighted Median (Weighted Binary Trees) . . . . . . . . . . . . 33 4.2.1 From O(N(logN)2) to O(N logN) time . . . . . . . . . . . . . . . . . . . . 35 4.2.2 From u \u00E2\u0088\u0088 Rk to u \u00E2\u0088\u0088 Rd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2.3 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Splitting via the Weighted Median (Unweighted Binary Trees) . . . . . . . . . . . . 38 4.3.1 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.2 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4 An Unweighted k-ary Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4.1 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4.2 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5 Overall Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 iv Table of Contents 5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1 Gaussian State-Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Factor Stochastic Volatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3 Stochastic Kinetic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.4 Dynamic Stochastic General Equilibrium Model . . . . . . . . . . . . . . . . . . . . 57 5.5 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.1.1 Quantile Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.1.2 The Mean Value Theorem for Integration . . . . . . . . . . . . . . . . . . . . 67 6.1.3 Taylor\u00E2\u0080\u0099s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.1.4 Quantile Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.2 An approximation of the generalized cdf . . . . . . . . . . . . . . . . . . . . . . . . 69 6.3 The Unweighted k-ary Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.4 The Unweighted Binary Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.5 Changing Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.6 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 v List of Tables 5.1 Mean and standard deviation of the log-likelihood as N increases . . . . . . . . . . . 50 5.2 Running time (in seconds) for the vanilla, tree-based and O(N2) vanilla filter for various values of N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 vi List of Figures 3.1 Median-cut partitioning of the space for two different densities . . . . . . . . . . . . 29 4.1 uj+d against uj for two different values of wj,0 . . . . . . . . . . . . . . . . . . . . . 36 4.2 Binary tree representation of the sets as a function of \u00CE\u00B2 . . . . . . . . . . . . . . . . 38 5.1 2D Gaussian state-space model log-likelihood plots using the various filters . . . . . 46 5.2 2D Gaussian state-space model log-likelihood plots comparing the tree-based resam- pling schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3 2D Gaussian state-space model statistical error plots using the various filters . . . . 47 5.4 2D Gaussian state-space model log-likelihood plots using the various filters with the locally optimal proposal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.5 2D Gaussian state-space model log-likelihood plots comparing the tree-based resam- pling schemes with the locally optimal proposal distribution . . . . . . . . . . . . . . 49 5.6 2D Gaussian state-space model statistical error plots using the various filters with the locally optimal proposal distribution . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.7 3D Gaussian state-space model log-likelihood plots using the various filters . . . . . 51 5.8 3D Gaussian state-space model log-likelihood plots comparing the tree-based resam- pling schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.9 3D Gaussian state-space model statistical error plots using the various filters . . . . 52 5.10 FSV log-likelihood plots using the various filters . . . . . . . . . . . . . . . . . . . . 54 5.11 FSV log-likelihood plots comparing the tree-based resampling schemes . . . . . . . . 55 5.12 Estimated expected predator-prey population levels for the partially observed Lotka- Volterra model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.13 Partially observed Lotka-Volterra model log-likelihood plots using the various filters 58 vii List of Figures 5.14 Partially observed Lotka-Volterra model log-likelihood plots comparing the tree-based resampling schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.15 DSGE log-likelihood plots using the various filters . . . . . . . . . . . . . . . . . . . 61 5.16 DSGE log-likelihood plots comparing the tree-based resampling schemes . . . . . . . 62 5.17 DSGE log-likelihood plots for the vanilla particle filter with N = 20000 . . . . . . . 62 5.18 2D Gaussian state-space model log-likelihood surface maps using the various filters . 63 5.19 2D Gaussian state-space model statistical error surface maps using the various filters 64 viii List of Algorithms 1 The generic sequential importance sampling algorithm . . . . . . . . . . . . . . . . . 9 2 The generic sequential importance sampling/resampling algorithm . . . . . . . . . . 12 3 An algorithm to evaluate the inverse of the generalized cdf F\u00E2\u0088\u00921(u1, . . . , ud) . . . . . 26 4 The mapping function fn(u) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5 Algorithm to convert a number u in [0,1] to a string of bits of length n . . . . . . . . 27 6 Constructing the weighted binary tree: constructq(particles, level) . . . . . . . . . . 34 7 Selecting a particle from the weighted binary tree: selectq(node, level, u1, . . . , ud) . . 34 8 Constructing the unweighted binary tree: constructp(particles, level) . . . . . . . . 39 9 Selecting a particle from the unweighted binary tree: selectp(node, level, u) . . . . . 39 10 Constructing the unweighted k-ary tree: constructk(particles, level) . . . . . . . . . 42 11 Selecting a particle from the unweighted k-ary tree: selectk(node, level, u1, . . . , ud) . 42 12 An algorithm to approximate the inverse of the generalized cdf F\u00E2\u0088\u00921(u1, . . . , ud) . . . 70 ix Acknowledgements I thank my supervisor, Arnaud Doucet, for his help, advice and wisdom during the course of this degree. I also thank Nando de Freitas, who in addition to being my second reader also introduced me to the field of machine learning when I was an undergraduate and impressed upon me the notion that statistics provides a meaningful way to build and measure machines that learn. More broadly, I would like to thank anyone and everyone who has taught me something about anything. I am grateful for the financial support I received from the Natural Sciences and Engineering Research Council of Canada, which has allowed me to pursue this work. Finally, I would like to thank my family and friends for their love and support throughout the years. x Chapter 1 Introduction Sequential Monte Carlo (SMC) methods are effective tools for sampling from high-dimensional probability distributions. However, in many situations we are interested in parameter estimation and not the latent variables generated by such methods. While SMC methods are still useful in these situations, a major drawback is that the likelihood estimates given are far from smooth. Assume we have data y, latent variables x and parameters \u00CE\u00B8 along with a model that specifies the probability density p(x,y|\u00CE\u00B8). Furthermore, assume that the marginal likelihood p(y|\u00CE\u00B8) is continuous in \u00CE\u00B8. For an N -sample Monte Carlo estimator L\u00CC\u0082N (\u00CE\u00B8) of this marginal likelihood to be smooth we require it to be continuous in \u00CE\u00B8 as well. Additionally, we want L\u00CC\u0082N (\u00CE\u00B8) to be consistent. In the SMC framework, we often obtain an estimate of log p(y|\u00CE\u00B8) by running a particle filter with fixed parameter \u00CE\u00B8. Unfortunately, the interaction of the particles or resampling step renders such estimates non-smooth in \u00CE\u00B8 even if each filter is run with common random numbers. In particular, sequential importance sampling without resampling provides smooth but prohibitively high-variance estimates of the log-likelihood using common random numbers. The resampling step, on the other hand, can typically be viewed as the use of the inverse transform on the empirical cumulative distribution function of the particle indices. As the weights of the particles change with \u00CE\u00B8, the same random numbers will lead to the selection of a particle with a different but \u00E2\u0080\u0098close\u00E2\u0080\u0099 index. Unfortunately, particles with close indices are not generally close themselves and so this introduces large differences in the resampled particles as \u00CE\u00B8 changes. Of course, it is this same resampling step that allows us to estimate the log-likelihood with reasonable variance and so discarding this crucial element of SMC methodology for the sake of smoothness is not really an option. It may be impossible to produce a tractable estimator with both these properties in the general case. For continuous state-space models, a particle filter has already been produced that possesses these properties in the case where the state variables are univariate [23]. This thesis attempts to tackle the problem in the case where the state variables are multivariate. While the algorithms presented do not satisfy the smoothness criteria defined above, they provide significant improvements in smoothness over the generic particle filter algorithms they extend and run in o(N2) time, which is faster than other methods that also attempt to solve this problem. The rest of this thesis is organized as follows. Chapter 2 provides a brief overview of the statistical 1 Chapter 1. Introduction concepts required to understand the material that follows. Chapter 3 begins the treatment smooth likelihood estimation in earnest, with a discussion of related work as well as the introduction of theoretical algorithms we would like to approximate. In chapter 4, we present novel practical algorithms utilizing tree-based resampling. In chapter 5, we apply the new algorithms to a variety of application domains and show our results. Chapter 6 provides some analysis of the algorithms and chapter 7 concludes the thesis with a discussion. 2 Chapter 2 Sequential Monte Carlo Methods and Likelihood Evaluation This chapter provides the reader with a brief overview of the statistical concepts required to under- stand the rest of the thesis. While the reader is expected to be familiar with the most basic concepts in probability, advanced knowledge of Monte Carlo methods is not required as the relevant details are covered here. If a more rigorous treatment of the material is desired, such treatment is available in [29] and [26]. 2.1 Likelihood Estimation A parametric model is a set of distributions that can be parametrized by a finite number of parame- ters. For continuous distributions, each member of such a set can be represented by a density p(\u00C2\u00B7|\u00CE\u00B8), where \u00CE\u00B8 belongs to a finite-dimensional parameter space \u00CE\u0098. Given a set of parametrized densities and some fixed data y, the likelihood function p(y|\u00CE\u00B8) is a function of \u00CE\u00B8. It is often the case that we want to evaluate p(y|\u00CE\u00B8) for various values of \u00CE\u00B8. Unfortunately, in many models it is impossible to compute this quantity directly. However, in some cases it is possible to introduce a latent random variable x and evaluate p(x,y|\u00CE\u00B8). Integrating out x can yield an expression for p(y|\u00CE\u00B8) p(y|\u00CE\u00B8) = \u00E2\u0088\u00AB X p(x,y|\u00CE\u00B8)dx If we could compute this integral analytically we would have an exact expression for p(y|\u00CE\u00B8). In general, however, this integral can only be evaluated numerically. When x is high-dimensional, Monte Carlo methods are often used to approximate integrals of this type. 3 2.2. Monte Carlo Methods 2.2 Monte Carlo Methods Monte Carlo integration is based on the understanding that the expectation of function of a random variable is an integral. Indeed, the expectation of the function \u00CF\u0086 : X \u00E2\u0086\u0092 R evaluated at random variable X \u00E2\u0088\u0088 X with probability density function pi is Epi[\u00CF\u0086(X)] = \u00E2\u0088\u00AB X \u00CF\u0086(x)pi(x)dx Importantly, we can use samples {xi}Ni=1from the distribution defined by pi to approximate this integral with the empirical average I1 = 1 N N\u00E2\u0088\u0091 i=1 \u00CF\u0086(xi) which has variance var[I1] = 1 N { Epi[\u00CF\u0086(x) 2]\u00E2\u0088\u0092 Epi[\u00CF\u0086(x)]2 } An extension of classical Monte Carlo integration is importance sampling, in which we sample instead from a proposal distribution \u00CE\u00B3 with the requirement that \u00CE\u00B3(x) > 0 whenever pi(x) > 0: Epi[\u00CF\u0086(X)] = \u00E2\u0088\u00AB X \u00CF\u0086(x) pi(x) \u00CE\u00B3(x) \u00CE\u00B3(x)dx This yields the approximation I2 = 1 N N\u00E2\u0088\u0091 i=1 \u00CF\u0086(xi) pi(xi) \u00CE\u00B3(xi) The variance of this estimate is given by var[I2] = E\u00CE\u00B3 [I 2 2 ]\u00E2\u0088\u0092 E\u00CE\u00B3 [I2]2 = 1 N {\u00E2\u0088\u00AB X \u00CF\u0086(x)2 pi(x)2 \u00CE\u00B3(x)2 \u00CE\u00B3(x)dx\u00E2\u0088\u0092 (\u00E2\u0088\u00AB X \u00CF\u0086(x) pi(x) \u00CE\u00B3(x) \u00CE\u00B3(x)dx )2} = 1 N { Epi[\u00CF\u0086(x) 2 pi(x) \u00CE\u00B3(x) ]\u00E2\u0088\u0092 Epi[\u00CF\u0086(x)]2 } For both classical Monte Carlo and importance sampling, the estimates are unbiased and converge almost surely via the Strong Law of Large Numbers. In practice, importance sampling can be used as a variance reduction technique or because \u00CE\u00B3 is much easier to sample from. For our purposes, the 4 2.2. Monte Carlo Methods latter is most often the case, and care should be taken to ensure that Epi[\u00CF\u0086(x) 2 pi(x) \u00CE\u00B3(x) ] < \u00E2\u0088\u009E so that the estimate has at least finite variance. It is important to note that often we only know pi and even \u00CE\u00B3 up to a normalizing constant, ie. we know pi\u00E2\u0088\u0097(x) \u00E2\u0088\u009D pi(x) and \u00CE\u00B3\u00E2\u0088\u0097(x) \u00E2\u0088\u009D \u00CE\u00B3(x) where \u00E2\u0088\u00ABX pi(x)dx = \u00E2\u0088\u00ABX \u00CE\u00B3(x)dx = 1. In this case we can use the estimate I\u00E2\u0088\u00972 = \u00E2\u0088\u0091N i=1 \u00CF\u0086(xi)w \u00E2\u0088\u0097(xi)\u00E2\u0088\u0091N i=1 w \u00E2\u0088\u0097(xi) where w\u00E2\u0088\u0097(xi) = pi\u00E2\u0088\u0097(xi) \u00CE\u00B3\u00E2\u0088\u0097(xi) Alternatively, we can write this as I\u00E2\u0088\u00972 = N\u00E2\u0088\u0091 i=1 \u00CF\u0086(xi)W (i) where W (i) = w\u00E2\u0088\u0097(xi)\u00E2\u0088\u0091N j=1 w \u00E2\u0088\u0097(xj) This estimate is not unbiased but it is asymptotically consistent since the squared bias is of order O(N\u00E2\u0088\u00922) and the variance is of order O(N\u00E2\u0088\u00921). Let us also define the empirical density PN (dx) = N\u00E2\u0088\u0091 i=1 W (i)\u00CE\u00B4xi(dx) When x is univariate, we can define the weighted empirical distribution function FN (x) = N\u00E2\u0088\u0091 i=1 W (i)I(xi \u00E2\u0089\u00A4 x) where I(xi \u00E2\u0089\u00A4 x) = \u00EF\u00A3\u00B1\u00EF\u00A3\u00B2\u00EF\u00A3\u00B31 if xi \u00E2\u0089\u00A4 x,0 if xi > x. 5 2.3. State-Space Models In fact, since we can set \u00CF\u0086(x) = I(xi \u00E2\u0089\u00A4 a), we have that FN (a) = N\u00E2\u0088\u0091 i=1 W (i)I(xi \u00E2\u0089\u00A4 a) is an asymptotically consistent estimator of F (a) = \u00E2\u0088\u00AB I(x \u00E2\u0089\u00A4 a)pi(x)dx ie. PN converges in distribution to pi. In the case of multivariate x, we also have convergence in distribution since PN (x \u00E2\u0088\u0088 S) = N\u00E2\u0088\u0091 i=1 W (i)I(xi \u00E2\u0088\u0088 S) is an asymptotically consistent estimator of Prpi[x \u00E2\u0088\u0088 S] = \u00E2\u0088\u00AB I(x \u00E2\u0088\u0088 S)pi(x)dx = \u00E2\u0088\u00AB S pi(x)dx 2.3 State-Space Models In this thesis, we will restrict our discussion to a particular class of models that is widely applicable and commonly associated with the sequential Monte Carlo methodology. It should be stressed that sequential Monte Carlo is not itself restricted to models of this form. Consider a time-homogeneous Markovian state-space model with hidden states {xt : t \u00E2\u0088\u0088 {0, . . . , T}}, xt \u00E2\u0088\u0088 X and observations {yt : t \u00E2\u0088\u0088 {0, . . . , T}},yt \u00E2\u0088\u0088 Y. The model is given by p(x0|\u00CE\u00B8) (initial state) p(xt|xt\u00E2\u0088\u00921, \u00CE\u00B8) for 1 \u00E2\u0089\u00A4 t \u00E2\u0089\u00A4 T (evolution) p(yt|xt, \u00CE\u00B8) for 0 \u00E2\u0089\u00A4 t \u00E2\u0089\u00A4 T (observation) The defining characteristic in such models is that the observation yt is conditionally independent of the other observations given the latent variable xt. In practice, this allows for the expression of a large number of possible systems. Intuitively, this type of model is natural if xt represents the state of the system at time t, the transition from xt to xt+1 is captured in the evolution distribution and the observation at time t depends only on the state of the system at that time. 6 2.3. State-Space Models Likelihood Evaluation For this class of models, likelihood evaluation is complicated by the presence of latent variables x0:T . A first attempt at computing p(y0:T |\u00CE\u00B8) could be to decompose this probability as p(y0:T |\u00CE\u00B8) = \u00E2\u0088\u00AB p(x0:T ,y0:T |\u00CE\u00B8)dx0:T = \u00E2\u0088\u00AB p(y0:T |x0:T , \u00CE\u00B8)p(x0:T |\u00CE\u00B8)dx0:T = \u00E2\u0088\u00AB ( T\u00E2\u0088\u008F i=0 p(yi|xi, \u00CE\u00B8) ) p(x0:T |\u00CE\u00B8)dx0:T and proceed with a Monte Carlo estimate by sampling x0:T \u00E2\u0088\u00BC p(x0:T |\u00CE\u00B8) = p(x0) \u00E2\u0088\u008FT i=1 p(xi|xi\u00E2\u0088\u00921). Unforunately, while theoretically valid, this estimate has prohibitively high variance whenever the data y0:T is informative. In general, we would like to incorporate the data when proposing samples x0:T . Using importance sampling, we could conceivably reduce the variance by sampling x0:T \u00E2\u0088\u00BC q(x0:T |y0:T , \u00CE\u00B8) since p(y0:T |\u00CE\u00B8) = \u00E2\u0088\u00AB ( T\u00E2\u0088\u008F i=0 p(yi|xi, \u00CE\u00B8) ) p(x0:T |\u00CE\u00B8) q(x0:T |y0:T , \u00CE\u00B8)q(x0:T |y0:T , \u00CE\u00B8)dx0:T However, it is difficult to come up with a high-dimensional proposal density q that is easy to sample from that would sufficiently reduce the variance. An alternative decomposition of the likelihood is attained by noticing that p(y0:T |\u00CE\u00B8) = p(y0|\u00CE\u00B8) T\u00E2\u0088\u008F k=1 p(yk|y0:k\u00E2\u0088\u00921, \u00CE\u00B8) where, dropping \u00CE\u00B8 for clarity p(yk|y0:k\u00E2\u0088\u00921) = \u00E2\u0088\u00AB p(yk,xk\u00E2\u0088\u00921|y0:k\u00E2\u0088\u00921)dxk\u00E2\u0088\u00921 = \u00E2\u0088\u00AB p(yk|xk\u00E2\u0088\u00921)p(xk\u00E2\u0088\u00921|y0:k\u00E2\u0088\u00921)dxk\u00E2\u0088\u00921 = \u00E2\u0088\u00AB \u00E2\u0088\u00AB p(yk,xk|xk\u00E2\u0088\u00921)dxkp(xk\u00E2\u0088\u00921|y0:k\u00E2\u0088\u00921)dxk\u00E2\u0088\u00921 = \u00E2\u0088\u00AB \u00E2\u0088\u00AB p(yk|xk)p(xk|xk\u00E2\u0088\u00921)dxkp(xk\u00E2\u0088\u00921|y0:k\u00E2\u0088\u00921)dxk\u00E2\u0088\u00921 7 2.4. Sequential Monte Carlo Therefore, if we can produce samples from p(xk\u00E2\u0088\u00921|y0:k\u00E2\u0088\u00921) we can estimate each p(yk|y0:k\u00E2\u0088\u00921) via p\u00CC\u0082(yk|y0:k\u00E2\u0088\u00921) = 1 NM N\u00E2\u0088\u0091 i=1 M\u00E2\u0088\u0091 j=1 p(yk|x(i,j)k ) (2.1) where we define x (i) k\u00E2\u0088\u00921 as the ith sample from p(xk\u00E2\u0088\u00921|y0:k\u00E2\u0088\u00921) and x(i,j)k as the jth sample from the distribution with density p(xk|x(i)k\u00E2\u0088\u00921). We will see that sequential Monte Carlo allows us to sample approximately from p(xk\u00E2\u0088\u00921|y0:k\u00E2\u0088\u00921). Alternatively, we can use an importance distribution q(xk|yk,xk\u00E2\u0088\u00921) within the inner integral: p(yk|y0:k\u00E2\u0088\u00921) = \u00E2\u0088\u00AB \u00E2\u0088\u00AB p(yk|xk)p(xk|xk\u00E2\u0088\u00921)dxkp(xk\u00E2\u0088\u00921|y0:k\u00E2\u0088\u00921)dxk\u00E2\u0088\u00921 = \u00E2\u0088\u00AB \u00E2\u0088\u00AB p(yk|xk)p(xk|xk\u00E2\u0088\u00921) q(xk|yk,xk\u00E2\u0088\u00921) q(xk|yk,xk\u00E2\u0088\u00921)dxkp(xk\u00E2\u0088\u00921|y0:k\u00E2\u0088\u00921)dxk\u00E2\u0088\u00921 yielding p\u00CC\u0082(yk|y0:k\u00E2\u0088\u00921) = 1 NM N\u00E2\u0088\u0091 i=1 M\u00E2\u0088\u0091 j=1 p(yk|x(i,j)k )p(x(i,j)k |x(i)k\u00E2\u0088\u00921) q(x (i,j) k |yk,x(i)k\u00E2\u0088\u00921) (2.2) where again x (i) k\u00E2\u0088\u00921 is the ith sample from p(xk\u00E2\u0088\u00921|y0:k\u00E2\u0088\u00921) but x(i,j)k is the jth sample from the distribution with density q(xk|yk,x(i)k\u00E2\u0088\u00921). 2.4 Sequential Monte Carlo In order to appreciate the problem of smooth likelihood estimation using particle filters, it is neces- sary to specify the particle filtering method so that we can highlight the difficulties that are faced. Details are omitted that are not directly relevant to smooth likelihood estimation for state-space models and so the reader is directed to [8] for a more general overview of sequential Monte Carlo methods. Given a target distribution p(x0:T |y0:T ), the central idea in sequential Monte Carlo algorithms is to define a sequence of intermediate target distributions p(x0|y0), p(x0:1|y0:1), . . . , p(x0:T |y0:T ) that move \u00E2\u0080\u0098smoothly\u00E2\u0080\u0099 towards the target distribution. Importantly, we can evaluate each intermediate distribution up to a normalizing constant since p(x0:t|y0:t) = p(x0:t,y0:t) p(y0:t) 8 2.4. Sequential Monte Carlo and p(x0:t,y0:t) is readily computable. We utilize an importance distribution of the form q(x0:T |y0:T ) = q(x0|y0) T\u00E2\u0088\u008F t=1 q(xt|xt\u00E2\u0088\u00921,yt) 2.4.1 Sequential Importance Sampling The generic sequential importance sampling (SIS) algorithm using these distributions is given in Algorithm 1. Algorithm 1 The generic sequential importance sampling algorithm 1. At time t = 0. \u00E2\u0080\u00A2 For i = 1, . . . , N , sample x(i)0 \u00E2\u0088\u00BC q(x0|y0) \u00E2\u0080\u00A2 For i = 1, . . . , N , evaluate the importance weights: w0(x (i) 0 ) = p(x (i) 0 ,y0) q(x (i) 0 |y0) = p(y0|x(i)0 )p(x(i)0 ) q(x (i) 0 |y0) 2. For times t = 1, . . . , T . \u00E2\u0080\u00A2 For i = 1, . . . , N , sample x(i)t \u00E2\u0088\u00BC q(xt|yt,x(i)t\u00E2\u0088\u00921) and set x(i)0:t def= (x(i)0:t\u00E2\u0088\u00921,x(i)t ) \u00E2\u0080\u00A2 For i = 1, . . . , N , evaluate the importance weights: wt(x (i) 0:t) = wt\u00E2\u0088\u00921(x (i) 0:t\u00E2\u0088\u00921) p(x (i) 0:t,y0:t) p(x (i) 0:t\u00E2\u0088\u00921,y0:t\u00E2\u0088\u00921)q(x (i) t |yt,x(i)t\u00E2\u0088\u00921) = wt\u00E2\u0088\u00921(x (i) 0:t\u00E2\u0088\u00921) p(yt|x(i)t )p(x(i)t |x(i)t\u00E2\u0088\u00921) q(x (i) t |yt,x(i)t\u00E2\u0088\u00921) We can prove by induction that the importance weights in this algorithm satisfy wt(x (i) 0:t) = p(x (i) 0:t,y0:t) q(x (i) 0:t|y0:t) where x (i) 0:t \u00E2\u0088\u00BC q(x0:t|y0:t). Indeed, we have wt(x (i) 0:t) = wt\u00E2\u0088\u00921(x (i) 0:t\u00E2\u0088\u00921) p(x (i) 0:t,y0:t) p(x (i) 0:t\u00E2\u0088\u00921,y0:t\u00E2\u0088\u00921)q(x (i) t |yt,x(i)t\u00E2\u0088\u00921) = p(x (i) 0:t\u00E2\u0088\u00921,y0:t\u00E2\u0088\u00921) q(x (i) 0:t\u00E2\u0088\u00921|y0:t\u00E2\u0088\u00921) p(x (i) 0:t,y0:t) p(x (i) 0:t\u00E2\u0088\u00921,y0:t\u00E2\u0088\u00921)q(x (i) t |yt,x(i)t\u00E2\u0088\u00921) = p(x (i) 0:t,y0:t) q(x (i) 0:t|y0:t) 9 2.4. Sequential Monte Carlo Note also that wt(x (i) 0:t) = p(x (i) 0:t,y0:t) q(x (i) 0:t|y0:t) = p(y0:t) p(x (i) 0:t|y0:t) q(x (i) 0:t|y0:t) \u00E2\u0088\u009D p(x (i) 0:t|y0:t) q(x (i) 0:t|y0:t) Since p(y0:t) is an unknown normalizing constant 1, we are forced to normalize the importance weights as in normalized importance sampling. This gives the following empirical distribution to approximate p(x0:t|y0:t): P\u00CC\u0082N (dx0:t|y0:t) = N\u00E2\u0088\u0091 i=1 W (i) t \u00CE\u00B4x(i)0:t (dx0:t) where W (i) t = wt(x (i) 0:t)\u00E2\u0088\u0091N j=1 wt(x (j) 0:t ) While theoretically valid, for practical values of N this algorithm suffers from degeneracy for even small values of t due to the unavoidable increase in variance of the importance weights over time. This means that the estimate P\u00CC\u0082N (dx0:t|y0:t) has significant mass on only a few particles, regardless of the shape of the distribution p(x0:t|y0:t), and this leads to prohibitively high variance of Monte Carlo estimates using these particles. 2.4.2 Sequential Importance Sampling/Resampling To alleviate the problem of degeneracy, we can replace intermediate weighted empirical distributions with unweighted empirical distributions such that none of the relative weights are negligible. More formally, we replace P\u00CC\u0082N (dx0:t|y0:t) = N\u00E2\u0088\u0091 i=1 W (i) t \u00CE\u00B4x(i)0:t (dx0:t) with P\u00CC\u0083N (dx0:t|y0:t) = 1 N N\u00E2\u0088\u0091 i=1 n (i) t \u00CE\u00B4x(i)0:t (dx0:t) where n (i) t \u00E2\u0088\u0088 {0, 1, . . . , N} and \u00E2\u0088\u0091N i=1 n (i) t = N . 1If we could compute this, we could evaluate p(y0:T ) = p(y0:T |\u00CE\u00B8) without using SMC. 10 2.4. Sequential Monte Carlo A replacement, or resampling, scheme is acceptable if for any function \u00CF\u0086(\u00C2\u00B7) we have E [(\u00E2\u0088\u00AB \u00CF\u0086(x0:t)P\u00CC\u0082N (dx0:t|y0:t)\u00E2\u0088\u0092 \u00E2\u0088\u00AB \u00CF\u0086(x0:t)P\u00CC\u0083N (dx0:t|y0:t) )2] N\u00E2\u0086\u0092\u00E2\u0088\u009E\u00E2\u0088\u0092\u00E2\u0086\u0092 0 since convergence of the expectation of \u00CF\u0086 with respect to the empirical density P\u00CC\u0083N to the true expectation is then guaranteed as N \u00E2\u0086\u0092\u00E2\u0088\u009E. A variety of common resampling techniques are discussed in [15]. The general idea of many re- sampling algorithms is to generate a set of N uniform random variables u1, . . . , uN (not necessarily identically distributed) and use them to sample N values of x0:t from P\u00CC\u0082N (dx0:t|y0:t) such that E[n (i) t ] = NW (i) t The variable n (i) t corresponds to the number of times x (i) 0:t is selected. It is vital to note that the resampling process is a source of flexibility within the sequential importance sampling/resampling framework. In fact, we will later see that it is this process that is responsible for the lack of smoothness in likelihood estimation and which can be modified to alleviate this problem. The sequential importance sampling algorithm augmented with a resampling (SIR) procedure is given in Algorithm 2. Although we no longer have wt(x (i) 0:t) \u00E2\u0088\u009D p(x (i) 0:t|y0:t) q(x (i) 0:t|y0:t) we know that our resulting empirical distribution still converges in distribution2 to p(x0:t|y0:t) and that Monte Carlo estimates of expectations of test statistics (such as the likelihood estimators in equations 2.1 and 2.2) converge in probability to their true values. Furthermore, at each step we have a greater diversity of particles than with the original algorithm. The problem of degeneracy is not removed: at time t the marginals p(xi|y0:t) for i << t are poorly represented. However, the marginal p(xt|y0:t) is well represented and this is the marginal that is used for likelihood estimation and the generation of particles from p(xt+1|y0:t). In practice, the resampling step is what permits the importance sampling/resampling algorithm to give reasonable estimates for practical values of N . 2This thesis is not particularly concerned with extolling the virtues of particle filters in general, and the algorithms presented are only intended to be used when SMC methods are effective. As such, more sophisticated results about the convergence of particle filters are not discussed. However, [6] provides a survey of convergence results for those who are interested. 11 2.4. Sequential Monte Carlo Algorithm 2 The generic sequential importance sampling/resampling algorithm 1. At time t = 0. \u00E2\u0080\u00A2 For i = 1, . . . , N , sample x\u00CC\u0083(i)0 \u00E2\u0088\u00BC q(x0|y0) \u00E2\u0080\u00A2 For i = 1, . . . , N , evaluate the importance weights: w0(x\u00CC\u0083 (i) 0 ) = p(x\u00CC\u0083 (i) 0 ,y0) q(x\u00CC\u0083 (i) 0 |y0) = p(y0|x\u00CC\u0083(i)0 )p(x\u00CC\u0083(i)0 ) q(x\u00CC\u0083 (i) 0 |y0) \u00E2\u0080\u00A2 For i = 1, . . . , N , normalize the importance weights: W (i) 0 = w0(x\u00CC\u0083 (i) 0 )\u00E2\u0088\u0091N j=1 w0(x\u00CC\u0083 (j) 0 ) \u00E2\u0080\u00A2 Resample (with replacement) N particles {x(i)0 : i = 1, . . . , N} from {x\u00CC\u0083(i)0 : i = 1, . . . , N} according to the importance weights {W (i)0 : i = 1, . . . , N}. Set w(i)0 = 1N for i = 1, . . . , N . 2. For times t > 0. \u00E2\u0080\u00A2 For i = 1, . . . , N , sample x\u00CC\u0083(i)t \u00E2\u0088\u00BC q(xt|yt,x(i)t\u00E2\u0088\u00921) and set x\u00CC\u0083(i)0:t def= (x(i)0:t\u00E2\u0088\u00921, x\u00CC\u0083(i)t ) \u00E2\u0080\u00A2 For i = 1, . . . , N , evaluate the importance weights: wt(x\u00CC\u0083 (i) 0:t) = w (i) t\u00E2\u0088\u00921 p(yt|x\u00CC\u0083(i)t )p(x\u00CC\u0083(i)t |x(i)t\u00E2\u0088\u00921) q(x\u00CC\u0083 (i) t |yt,x(i)t\u00E2\u0088\u00921) \u00E2\u0080\u00A2 For i = 1, . . . , N , normalize the importance weights: W (i) t = wt(x\u00CC\u0083 (i) 0:t)\u00E2\u0088\u0091N j=1 wt(x\u00CC\u0083 (j) 0:t ) \u00E2\u0080\u00A2 Resample (with replacement) N particles {x(i)0:t : i = 1, . . . , N} from {x\u00CC\u0083(i)0:t : i = 1, . . . , N} according to the importance weights {W (i)t : i = 1, . . . , N}. Set w(i)t = 1N for i = 1, . . . , N . Multinomial Resampling In resampling, we wish to replace the weighted empirical distribution P\u00CC\u0082N (dx0:t|y0:t) with an un- weighted empirical distribution P\u00CC\u0083N (dx0:t|y0:t). One of the most common resampling schemes em- ployed by practitioners is that of multinomial resampling. In this scheme, we construct the empirical cumulative distribution function F\u00CC\u0082t,N (j) = j\u00E2\u0088\u0091 i=1 W (i) t 12 2.4. Sequential Monte Carlo It is important to note that this is a cumulative distribution function where the random variable of interest is not X0:t but instead the index of the particle we wish to choose. We can sample an index according to its weight by defining the inverse of this function as F\u00CC\u0082\u00E2\u0088\u00921t,N (u) = min{j : F\u00CC\u0082t,N (j) \u00E2\u0089\u00A5 u}, generating a uniform random number on [0, 1] and evaluating the inverse at this value. If we repeat this process N times and increment n (i) t (with each n (i) t starting at 0) each time the index i is sampled we end up with a valid P\u00CC\u0083N (dx). If we denote each uniform random number as ui we can define k (i) t = F\u00CC\u0082 \u00E2\u0088\u00921 t,N (ui) and call this the selection index that is picked during resampling. Indeed, with the selection index made explicit we have x (i) 0:t = x\u00CC\u0083 (k (i) t ) 0:t The most relevant issue for the topic at hand is that particles whose indices are close are not necessarily close themselves. The computational complexity of resampling naively is in O(N2). However, one can reduce this complexity by generating pre-ordered uniform random numbers as in [1] and traversing an array of cumulative weights without backtracking to give a time complexity of O(N). An Alternative View of Resampling Instead of viewing resampling as the replacement of the weighted empirical distribution P\u00CC\u0082N (dx0:t|y0:t) with an unweighted empirical distribution P\u00CC\u0083N (dx0:t|y0:t), we can view it as the approximation of the true density p(x0:t|y0:t) with P\u00CC\u0082N (dx0:t|y0:t). In particular, in the context of likelihood evaluation in time-homogeneous Markovian state-space models we are only interested in the marginal p(xt|y0:t) which is approximated by P\u00CC\u0082N (dxt|y0:t) = N\u00E2\u0088\u0091 i=1 W (i) t \u00CE\u00B4x(i)t (dxt) The predictive distribution p(xt+1|y0:t) is thus approximated by the density p\u00CC\u0082N (xt+1|y0:t) = N\u00E2\u0088\u0091 i=1 W (i) t p(xt+1|x(i)t ) 13 2.4. Sequential Monte Carlo We can think of this density as the marginal of a joint distribution p\u00CC\u0082N (k,xt+1|y0:t) =W (k)t p(xt+1|x(k)t ) The proposal distribution used in sequential importance sampling can then be defined as qN (xt+1|x(1:N)t ,W (1:N)t ,yt+1) = N\u00E2\u0088\u0091 i=1 W (i) t q(xt+1|yt+1,x(i)t ) This density is the marginal of a joint distribution qN (k,xt+1|x(1:N)t ,W (1:N)t ,yt+1) =W (k)t q(xt+1|yt+1,x(k)t ) The resampling step then corresponds to (at each stage i) sampling the auxiliary variable k, which is the index into both the mixture representation of p\u00CC\u0082N (xt+1|y0:t) and qN (xt+1|x(1:N)t ,W (1:N)t ,yt+1). This is followed by sampling x (i) t+1 \u00E2\u0088\u00BC qN (xt+1|k,x(1:N)t ,W (1:N)t ,yt+1) = q(xt+1|x(k)t ,yt+1) and then evaluating the weight wt+1(x (i) t+1) = p(yt+1|x(i)t+1)p\u00CC\u0082N (k,x(i)t+1|y0:t) qN (k,x (i) t+1|x(1:N)t ,W (1:N)t ,yt+1) = p(yt+1|x(i)t+1)p(x(i)t+1|x(k)t ) q(x (i) t+1|yt+1,x(k)t ) It may seem strange to compute the weights using the joint distributions involving the auxiliary variable k since we are essentially sampling x (i) t+1 \u00E2\u0088\u00BC qN (xt+1|y0:t+1). However, evaluating the im- portance weights using the marginals via wt+1(x (i) t+1) = p(yt+1|x(i)t+1)p\u00CC\u0082N (x(i)t+1|y0:t) qN (x (i) t+1|x(1:N)t ,W (1:N)t ,yt+1) = p(yt+1|x(i)t+1) \u00E2\u0088\u0091N j=1W (j) t p(x (i) t+1|x(j)t )\u00E2\u0088\u0091N j=1W (j) t q(x (i) t+1|yt+1,x(j)t ) as in the marginal particle filter of [17] would take O(N) time per particle3, giving an O(N2) time complexity for weighting N particles. It is easily shown that the expectation of any test function \u00CF\u0086 is unchanged by weighting the particles using the joint distributions as opposed to the marginals. 3It is possible in some situations to speed up this computation using N -body methods, but usually this requires approximation and sometimes it provides no gains in efficiency at all for a given tolerance. 14 2.4. Sequential Monte Carlo Letting \u00CF\u0088 = (x (1:N) t ,W (1:N) t ,yt+1), we have:\u00E2\u0088\u00AB \u00E2\u0088\u00AB \u00CF\u0086(xt+1) p(yt+1|xt+1)p\u00CC\u0082N (k,xt+1|y0:t) qN (k,xt+1|\u00CF\u0088) qN (k,xt+1|\u00CF\u0088)dxt+1dk = \u00E2\u0088\u00AB \u00E2\u0088\u00AB \u00CF\u0086(xt+1) p(yt+1|xt+1)p\u00CC\u0082N (xt+1|y0:t)p\u00CC\u0082N (k|xt+1,y0:t) qN (xt+1|\u00CF\u0088)qN (k|xt+1, \u00CF\u0088) qN (xt+1|\u00CF\u0088)qN (k|xt+1, \u00CF\u0088)dxt+1dk = \u00E2\u0088\u00AB \u00CF\u0086(xt+1) p(yt+1|xt+1)p\u00CC\u0082N (xt+1|y0:t) qN (xt+1|\u00CF\u0088) qN (xt+1|\u00CF\u0088) \u00E2\u0088\u00AB p\u00CC\u0082N (k|xt+1,y0:t) qN (k|xt+1, \u00CF\u0088) qN (k|xt+1, \u00CF\u0088)dkdxt+1 = \u00E2\u0088\u00AB \u00CF\u0086(xt+1) p(yt+1|xt+1)p\u00CC\u0082N (xt+1|y0:t) qN (xt+1|\u00CF\u0088) qN (xt+1|\u00CF\u0088)dxt+1 since \u00E2\u0088\u00AB p\u00CC\u0082N (k|xt+1,y0:t)dk = 1. 2.4.3 Likelihood Evaluation It should be clear that the estimators for the likelihood given in Equations 2.1 and 2.2 are particularly suitable when particle filters are used to generate the latent variables x0:T . This is because it is the marginals p(xk|y0:k), for k = 0, . . . , T that are used at each step k and these marginals do not suffer from degeneracy in the sequential importance sampling/resampling algorithm. A computational issue associated with the evaluation of the likelihood as proposed, however, is that the repeated multiplication of terms is prone to numerical inaccuracy. A way to circumvent this problem is to evaluate instead the log-likelihood log p(y0:t). A biased estimate of this quantity is l\u00CC\u00821(y0:t) = log p\u00CC\u0082(y0) + t\u00E2\u0088\u0091 i=1 log p\u00CC\u0082(yi|y0:i\u00E2\u0088\u00921) An approximate bias correction by Taylor expansion, as done in [23] gives the improved estimate: l\u00CC\u00822(y0:t) = l\u00CC\u0083(y0) + t\u00E2\u0088\u0091 i=1 l\u00CC\u0083(yi|y0:i\u00E2\u0088\u00921) where l\u00CC\u0083(y0) = log p\u00CC\u0082(y0) + 1 2 \u00CF\u0083\u00CC\u008220 NMp\u00CC\u0082(y0)2 and l\u00CC\u0083(yk|y0:k\u00E2\u0088\u00921) = log p\u00CC\u0082(yk|y0:k\u00E2\u0088\u00921) + 1 2 \u00CF\u0083\u00CC\u00822k NMp\u00CC\u0082(yk|y0:k\u00E2\u0088\u00921)2 and \u00CF\u0083\u00CC\u008220 is the sample variance of p\u00CC\u0082(y0) and \u00CF\u0083\u00CC\u0082 2 k is the sample variance of p\u00CC\u0082(yk|y0:k\u00E2\u0088\u00921). 15 2.4. Sequential Monte Carlo Rationale Let X be an unbiased estimate of the quantity \u00C2\u00B5. Then X\u00CC\u0084, the sample mean of X with R samples, is also an unbiased estimate of \u00C2\u00B5. Furthermore, if the variance of X is \u00CF\u00832 then the variance of X\u00CC\u0084 is \u00CF\u00832 R . The second order Taylor expansion of log X\u00CC\u0084 at \u00C2\u00B5 is log X\u00CC\u0084 \u00E2\u0089\u0088 log \u00C2\u00B5+ 1 \u00C2\u00B5 (X\u00CC\u0084 \u00E2\u0088\u0092 \u00C2\u00B5)\u00E2\u0088\u0092 1 \u00C2\u00B52 (X\u00CC\u0084 \u00E2\u0088\u0092 \u00C2\u00B5)2 2 so we have E[log X\u00CC\u0084] \u00E2\u0089\u0088 log \u00C2\u00B5\u00E2\u0088\u0092 1 \u00C2\u00B52 E[(X\u00CC\u0084 \u00E2\u0088\u0092 \u00C2\u00B5)2] 2 and therefore log X\u00CC\u0084 + \u00CF\u0083\u00CC\u0082 2 2RX\u00CC\u00842 is an estimate which is good for large R. 16 Chapter 3 Smooth Likelihood Estimation 3.1 Desired Estimator Properties We are now ready to consider the smooth likelihood problem. First, we should define formally the three properties that we would like a smooth estimator to have. We would like the estimator L\u00CC\u0082N (\u00CE\u00B8) of p(y|\u00CE\u00B8) to be consistent. This means L\u00CC\u0082N (\u00CE\u00B8) P\u00E2\u0086\u0092 p(y|\u00CE\u00B8) as N \u00E2\u0086\u0092\u00E2\u0088\u009E. In addition, we would like the estimator to vary smoothly with \u00CE\u00B8 if p(y|\u00CE\u00B8) does, ie. if \u00E2\u0088\u0080\u00CE\u00B8,\u00E2\u0088\u0080\u000F > 0,\u00E2\u0088\u0083\u00CE\u00B4 > 0 s.t. \u00E2\u0088\u0080\u00CE\u00B8\u00E2\u0080\u00B2 \u00E2\u0088\u0088 B\u00CE\u00B4(\u00CE\u00B8), p(y|\u00CE\u00B8\u00E2\u0080\u00B2) \u00E2\u0088\u0088 B\u000F(p(y|\u00CE\u00B8)) then \u00E2\u0088\u0080\u00CE\u00B8,\u00E2\u0088\u0080\u000F > 0,\u00E2\u0088\u0083\u00CE\u00B4 > 0 s.t. \u00E2\u0088\u0080\u00CE\u00B8\u00E2\u0080\u00B2 \u00E2\u0088\u0088 B\u00CE\u00B4(\u00CE\u00B8), L\u00CC\u0082N (\u00CE\u00B8\u00E2\u0080\u00B2) \u00E2\u0088\u0088 B\u000F(L\u00CC\u0082N (\u00CE\u00B8)) where Br(\u00CE\u00B8) denotes the ball of radius r around \u00CE\u00B8 under some appropriate metric (eg. the metric induced by the L2-norm on Rd). Finally, we want L\u00CC\u0082N (\u00CE\u00B8) to have a computational complexity in o(TN2) and which permits practical values of N to be used for accurate estimation. It might seem like the smoothness criterion is a strange one. However, there are two reasons to believe that a smooth likelihood estimator is advantageous. First, it captures the continuous nature of the true likelihood p(y|\u00CE\u00B8) and can better facilitate likelihood maximization [23]. In particular, it leads to reduced variance estimates of the difference in likelihood for some \u00CE\u00B81 and \u00CE\u00B82: L\u00CC\u0082N (\u00CE\u00B82)\u00E2\u0088\u0092 L\u00CC\u0082N (\u00CE\u00B81) \u00E2\u0089\u0088 p(y|\u00CE\u00B82)\u00E2\u0088\u0092 p(y|\u00CE\u00B81) Indeed, assume we have two consistent estimators L\u00CC\u0083N (\u00CE\u00B8) and L\u00CC\u0082N (\u00CE\u00B8) both with variance given by 17 3.2. Common Random Numbers A(\u00CE\u00B8) N but cov(L\u00CC\u0083N (\u00CE\u00B81), L\u00CC\u0083N (\u00CE\u00B82)) = 0 and L\u00CC\u0082N (\u00CE\u00B8) is smooth as defined above. Necessarily, we have cov(L\u00CC\u0082N (\u00CE\u00B81), L\u00CC\u0082N (\u00CE\u00B82)) > 0 to satisfy smoothness when N is finite and A(\u00CE\u00B81) or A(\u00CE\u00B82) are nonzero. Then we have var(L\u00CC\u0082N (\u00CE\u00B82)\u00E2\u0088\u0092 L\u00CC\u0082N (\u00CE\u00B81)) = var(L\u00CC\u0082N (\u00CE\u00B81)) + var(L\u00CC\u0082N (\u00CE\u00B82))\u00E2\u0088\u0092 2cov(L\u00CC\u0082N (\u00CE\u00B81), L\u00CC\u0082N (\u00CE\u00B82)) < var(L\u00CC\u0082N (\u00CE\u00B81)) + var(L\u00CC\u0082N (\u00CE\u00B82)) = var(L\u00CC\u0083N (\u00CE\u00B82)\u00E2\u0088\u0092 L\u00CC\u0083N (\u00CE\u00B81)) Note that this reduction in variance does not come at the cost of consistency since the difference between two consistent estimators gives a consistent estimate of the difference between the two values they estimate. This is because the expectation of sums of random variables is equal to the sum of the expectations of those random variables, even if they are dependent. A caution at this point is that smooth estimators are not useful in all endeavours. In particular, if in a Bayesian framework one wished to estimate p(y) = \u00E2\u0088\u00AB \u00CE\u0098 p(y|\u00CE\u00B8)p(\u00CE\u00B8)d\u00CE\u00B8 by sampling {\u00CE\u00B8i}Ri=1 from a prior p(\u00CE\u00B8) and computing p\u00CC\u0082(y) = R\u00E2\u0088\u0091 i=1 L\u00CC\u0082N (\u00CE\u00B8i) then this estimate would have higher variance by the same reasoning since var(L\u00CC\u0082N (\u00CE\u00B81) + L\u00CC\u0082N (\u00CE\u00B82)) = var(L\u00CC\u0082N (\u00CE\u00B81)) + var(L\u00CC\u0082N (\u00CE\u00B82)) + 2cov(L\u00CC\u0082N (\u00CE\u00B81), L\u00CC\u0082N (\u00CE\u00B82)) For such estimates, lower variance would be attained by using negatively correlated (and hence non-smooth) estimates of the likelihood. 3.2 Common Random Numbers One way in which we can try to attain a smooth estimator of the likelihood is by using common random numbers. For example, imagine we want to compute p(y|\u00CE\u00B8) = \u00E2\u0088\u00AB p(y,x|\u00CE\u00B8)dx = \u00E2\u0088\u00AB p(y|x, \u00CE\u00B8)p(x|\u00CE\u00B8) q(x) q(x)dx 18 3.2. Common Random Numbers via the Monte Carlo estimate I\u00CC\u0082(\u00CE\u00B8) = 1 N N\u00E2\u0088\u0091 i=1 p(y|xi, \u00CE\u00B8)p(xi|\u00CE\u00B8) q(xi) where xi \u00E2\u0088\u00BC q(x), for i = 1, . . . , N . Using common random numbers xi and allowing \u00CE\u00B8 to vary, we have that I\u00CC\u0082(\u00CE\u00B8) is a continuous function in \u00CE\u00B8 as long as p(x|\u00CE\u00B8) and p(y|x, \u00CE\u00B8) are continuous in \u00CE\u00B8, since the product and quotient of continuous functions is continuous4. In practice, however, we often use proposal distributions of the form q(x|\u00CE\u00B8). In this scenario it is not possible to use samples xi that are identical. However, the utility of common random numbers can be retained by generating xi\u00E2\u0080\u0099s that vary continuously with \u00CE\u00B8. In particular, assume we can sample a random variable z from g(z) and we have a continuous (in \u00CE\u00B8) deterministic function f(z; \u00CE\u00B8) that transforms z, given \u00CE\u00B8, to a sample from a distribution q(x|\u00CE\u00B8) that we can evaluate pointwise. This means we require z \u00E2\u0088\u00BC g(\u00C2\u00B7)\u00E2\u0087\u0092 f(z; \u00CE\u00B8) \u00E2\u0088\u00BC q(\u00C2\u00B7|\u00CE\u00B8). Therefore, if we use common random numbers zi with \u00CE\u00B8 varying, we still have that I\u00CC\u0082(\u00CE\u00B8) is continuous in \u00CE\u00B8 since the composition of continuous functions is continuous. For example, imagine we want to sample random vectors from a d-dimensional multivariate Gaus- sian distribution q(x|\u00CE\u00B8) with \u00CE\u00B8 = (\u00C2\u00B5,\u00CE\u00A3) where \u00C2\u00B5 is the mean and \u00CE\u00A3 is the covariance matrix of the distribution. Furthermore, define A such that it satisfies AAT = \u00CE\u00A3. We can sample a standard multivariate Gaussian random vector z \u00E2\u0088\u0088 Rd by sampling each component from the standard uni- variate Gaussian distribution N(0, 1). Then we can set x = f(z; \u00CE\u00B8) = \u00C2\u00B5 + Az and we have that x is a sample from q(x|\u00CE\u00B8). Furthermore, since f(z; \u00CE\u00B8) is continuous in \u00CE\u00B8, this method can be used to produce \u00E2\u0080\u0098common\u00E2\u0080\u0099 random vectors from q(x|\u00CE\u00B8) by using common random numbers from N(0, 1). Indeed, if we compute p(y|\u00CE\u00B8) = \u00E2\u0088\u00AB p(y,x|\u00CE\u00B8)dx = \u00E2\u0088\u00AB p(y|x, \u00CE\u00B8)p(x|\u00CE\u00B8) q(x|\u00CE\u00B8)q(x|\u00CE\u00B8)dx via the Monte Carlo estimate I\u00CC\u0082(\u00CE\u00B8) = 1 N N\u00E2\u0088\u0091 i=1 p(y|xi, \u00CE\u00B8)p(xi|\u00CE\u00B8) q(xi|\u00CE\u00B8) where xi \u00E2\u0088\u00BC q(x|\u00CE\u00B8) for i = 1, . . . , N , we still have that I\u00CC\u0082(\u00CE\u00B8) is a continuous function in \u00CE\u00B8 since the composition of continuous functions is continuous. In addition, for this case we require continuity in x as well as \u00CE\u00B8 for p(x|\u00CE\u00B8) and p(y|x, \u00CE\u00B8), continuity in both x and \u00CE\u00B8 for q(x|\u00CE\u00B8) and the existence of a function f(z|\u00CE\u00B8) as detailed above. 4An additional requirement is that q(x) 6= 0 but this is guaranteed by sampling from q and we require q(x) > 0 whenever p(x|\u00CE\u00B8) > 0 for importance sampling to be valid anyway 19 3.2. Common Random Numbers On a practical note, when we say that we can use common random numbers or refer to their use as a \u00E2\u0080\u0098technique\u00E2\u0080\u0099, we are referring to the ability to run algorithms with varying values of \u00CE\u00B8 but using the same random numbers in all runs. On a computer, this can often be achieved by resetting the random seed of the random number generator to a fixed number at the beginning of each run. Naturally, this only applies when the number of random numbers that are sampled does not vary across runs and each one is used for the same purpose in each run. This implies, for example, that we cannot use rejection sampling in such an algorithm when the acceptance probability depends on \u00CE\u00B8. CRN and Particle Filters Now that common random numbers have been introduced and motivated, it is worth mentioning that this technique alone cannot produce a smooth likelihood estimator within the particle filter framework. First, it is worth noting that using CRN within the sequential importance sampling framework presented in Algorithm 1 and computing an estimate of the log-likelihood using the expression given in Section 2.4.3 will indeed produce a smooth estimate. However, this method is hopelessly slow as any practical5 value of N gives very high variance estimates of the likelihood. Unfortunately, the sequential importance sampling/resampling framework presented in Algorithm 2 that provides lower variance estimation simultaneously makes the likelihood estimator discontinuous, as remarked in [23]. In fact, smoothness of SIS is a result of the smoothness of the sampled latent variables {x(i)0:k}Ni=1 at every time-step k. The introduction of the resampling step in SIR introduces discontinuities in the latent variables {x(i)0:k}Ni=1. Indeed, from the multinomial sampling algorithm in Section 2.4.2, it is apparent that small changes in \u00CE\u00B8 can cause drastically different values of x0:k to be chosen. For example, if particle index j is chosen for a given \u00CE\u00B8 but particle index j+1 is chosen for \u00CE\u00B8\u00E2\u0080\u00B2 that is close to \u00CE\u00B8, there is no guarantee that x(j)0:k will be at all close to x (j+1) 0:k . Additionally, even if we could construct a resampling procedure that made such a guarantee the resulting particles would not be continuous in \u00CE\u00B8 without additional interpolation. In summary, it is a non-trivial task to produce an estimator that satisfies the properties of consis- tency, smoothness and o(N2) time complexity for practical values of N . However, it is trivial to produce an estimator that has any two of these properties. If we are willing to sacrifice smooth- ness, the standard estimate using sequential importance sampling/resampling suffices. On the other hand, if we are willing to sacrifice reasonable time complexity, the standard estimate using sequen- tial importance sampling (without resampling) suffices. It is hardly worth considering constructing an estimator that is not consistent, and it is trivial to construct one that is both fast and smooth. 5Although this estimator runs in O(N) time, the variance increases exponentially with T since SIS is just impor- tance sampling with a special proposal distribution. As such, for comparable variance to SIR, impractical values of N need to be chosen to compensate for the higher variance. 20 3.3. Related Work 3.3 Related Work The task of smooth likelihood estimation using sequential Monte Carlo methods is a relatively new endeavour, mostly because particle filters have only recently become popular. However, there has already been substantial work in this area and one method in particular is the main inspiration for the new ideas presented here. A brief survey of related advances in the field follows. 3.3.1 Smooth Likelihood Estimation with One-Dimensional State Variables In [23], a resampling scheme is proposed that, coupled with common random numbers, gives us all three estimator properties we desire in the case where the latent variables for state are one- dimensional. The idea is quite simple and follows naturally from the problem statement outlined in Section 2.4.2: that particles whose indices are close are not necessarily close themselves. Indeed, it is straightforward to guarantee that particles whose indices are close are themselves close simply by sorting the particles and reassigning their indices. More formally, assume that at stage t we have a weighted empirical distribution P\u00CC\u0082N (dx0:t|y0:t) =\u00E2\u0088\u0091N i=1W (i) t \u00CE\u00B4x(i)0:t (dx0:t). Since the models we are working with are first-order Markov in x, for the purposes of filtering and likelihood estimation we are only interested in the tth component of x0:t when resampling. As such, we can reorder the particles such that \u00E2\u0088\u0080i, j \u00E2\u0088\u0088 {1, . . . , N}, i < j \u00E2\u0087\u0092 x(i)t < x (j) t . An important result is that instead of using the empirical cumulative distribution function for indices F\u00CC\u0082t,N (j) = j\u00E2\u0088\u0091 i=1 W (i) t introduced in Section 2.4.2 we can instead use the empirical distribution function for particles F\u00CC\u0082t,N (x) = \u00E2\u0088\u0091 x (i) t \u00E2\u0089\u00A4x W (i) t The corresponding inverse of this empirical cdf is given by F\u00CC\u0082\u00E2\u0088\u00921t,N (u) = min{x(i)t : F\u00CC\u0082N (x(i)t ) \u00E2\u0089\u00A5 u} The cost of sorting each set of N particles by their last component is in O(N logN). Since this is done at each time-step of the particle filter and the cost of resampling at each step is in O(N) the time complexity of the algorithm is in O(TN logN). Of course, this method does not give continuous resampled particles without the addition of an interpolation scheme to transform the 21 3.3. Related Work sorted discrete cdf into a continuous cdf. In [23], a linear interpolation scheme is proposed to attain continuity. While interpolation introduces a bias, this bias becomes negligible as N increases. As impressive as this new particle filtering framework is, unfortunately it cannot be trivially extended to the multivariate case since only when the latent state variables are one-dimensional can we sort them in this manner and guarantee closeness. 3.3.2 Smooth Likelihood Estimation with Multi-Dimensional State Variables in O(N2) time While we often think of the empirical distribution that approximates p(x0:t|y0:t) in our discussion: P\u00CC\u0082N (dx0:t|y0:t) = N\u00E2\u0088\u0091 i=1 W (i) t \u00CE\u00B4x(i)0:t (dx0:t) which is a mixture of point masses. However, we can instead think of the empirical density that approximates p(xt+1|y0:t): p\u00CC\u0082N (xt+1|y0:t) = N\u00E2\u0088\u0091 i=1 W (i) t p(xt+1|x(i)t ) which is a mixture of densities (see Section 2.4.2). Indeed, the resampling process followed by the transition step provides samples from this distribution. Furthermore, this density is continuous when p(xt+1|xt) is continuous since it is a weighted sum of continuous densities. If we could produce smoothly varying (with {W (i)}Ni=1 and {x(i)t }Ni=1) samples from this mixture distribution we could use those samples to make a smooth likelihood estimator. Unfortunately, even if it is easy to sample smoothly from p(xt+1|xt), it is not generally feasible to sample smoothly from a mixture of such distributions when the weights vary with \u00CE\u00B8. A solution to this problem, proposed in [7], is to approximate p\u00CC\u0082N (xt+1|y0:t) with another distribution g(xt+1;\u00CE\u00B1t+1), which is easy to produce smoothly varying (in \u00CE\u00B1t+1) samples from. The density g(xt+1;\u00CE\u00B1t+1) is a member from the parametric class of densities {g(xt+1|\u00CE\u00B1t+1) : \u00CE\u00B1t+1 \u00E2\u0088\u0088 A}. In particular, the parameter \u00CE\u00B1t+1 is obtained in a way that minimizes some distributional distance between g(xt+1;\u00CE\u00B1t+1) and p\u00CC\u0082N (xt+1|y0:t). For this to be useful, it must be true that the parameter \u00CE\u00B1t+1 obtained by this minimization varies smoothly with \u00CE\u00B8. Using g(xt+1;\u00CE\u00B1t+1) as a proposal distribution, we can approximate p(yt+1|y0:t) as follows p\u00CC\u0082(yt+1|y0:t) = 1 N N\u00E2\u0088\u0091 i=1 p(yt+1|x(i)t+1) p\u00CC\u0082N (x (i) t+1|y0:t) g(x (i) t+1;\u00CE\u00B1t+1) 22 3.3. Related Work The drawback is that p\u00CC\u0082N (x (j) t+1|y0:t) = \u00E2\u0088\u0091N i=1W (i) t p(x (j) t+1|x(i)t ) takes O(N) time to compute and so p\u00CC\u0082(yt+1|y0:t) takes O(N2) time to compute6. In addition, it can be difficult to define parametric families of densities that can both be smoothly sampled from and can represent well the target distribution p(xt+1|y0:t) when this target has certain properties, eg. multiple modes. In summary, this method satisfies smoothness and consistency but requires time in O(TN2). In general, while proposal distributions that can be smoothly sampled from do provide smooth esti- mation, the evaluation of each importance weight will always take O(N) time unless some kind of cancellation occurs. 3.3.3 Smooth Likelihood Estimation Using Fixed Selection Indices An approach introduced in [5] is to run a particle filter for a given \u00CE\u00B8 = \u00CE\u00B8\u00E2\u0088\u0097 and to store the selection indices {k(i)t : i \u00E2\u0088\u0088 {1, . . . , N}, t \u00E2\u0088\u0088 {0, . . . , T}} and weights {W (k (i) t ) t,\u00CE\u00B8\u00E2\u0088\u0097 def = W (k (i) t ) t : i \u00E2\u0088\u0088 {1, . . . , N}, t \u00E2\u0088\u0088 {0, . . . , T}} (see Section 2.4.2). Then, for \u00CE\u00B8 close to \u00CE\u00B8\u00E2\u0088\u0097, the \u00E2\u0080\u0098filter\u00E2\u0080\u0099 is run such that the same indices are selected. In practice, this means that the weights of resampled particles w (i) t are no longer equal. Indeed, if \u00CE\u00B8 causes the weight of a selected particle to increase relative to the weight under \u00CE\u00B8\u00E2\u0088\u0097 then the weight will increase and if the opposite is true, the weight will decrease. In Algorithm 2 this can be seen as replacing the resampling step with: \u00E2\u0080\u00A2 For i = 1, . . . , N : \u00E2\u0080\u0093 Set x (i) 0:t = x\u00CC\u0083 (k (i) t ) 0:t \u00E2\u0080\u0093 Set w (i) t = W (k (i) t ) t W (k (i) t ) t,\u00CE\u00B8\u00E2\u0088\u0097 \u00E2\u0080\u00A2 Renormalize the weights w(i)t so \u00E2\u0088\u0091N i=1 w (i) t = 1. This method does provide smooth estimates of the likelihood when coupled with common random numbers. However, this method is restricted to a small neighbourhood around \u00CE\u00B8 as many of the weights will become negligible when the discrepancy in the weights increases. Of particular concern is the fact that without conventional resampling (based on the actual weights of the particles in question) the effective sample size of the particle \u00E2\u0080\u0098swarm\u00E2\u0080\u0099 decreases over time, which can reduce significantly the variance of the estimates. In this scenario, the high positive correlation in log- likelihood estimates can be offset by the increase in the variance of estimates themselves (see Section 3.1) when considering the variance of the estimated difference in the log-likelihood. In addition, the numerical stability of this method on a computer is not guaranteed due to the repeated use of 6As with the marginal particle filter, it may be possible to speed up this computation in some circumstances. In general, however, the complexity is O(N2) 23 3.4. Theoretical Solutions products in the algorithm. Of course, this method is appealing for use in a small neighbourhood of \u00CE\u00B8\u00E2\u0088\u0097 if T is not too large. In particular, it is used in [5] to estimate the partial derivatives of p(y0:t|\u00CE\u00B8) with respect to \u00CE\u00B8. 3.4 Theoretical Solutions This chapter is concluded with the introduction of two theoretical solutions to the problem of smooth resampling. These can both be seen as extensions of [23] to the case where the state variables are multivariate. The main idea in that work is that we can, through sorting the particles, construct an empirical cdf for xt F\u00CC\u0082t,N (z) = \u00E2\u0088\u0091 x (i) t \u00E2\u0089\u00A4z w (i) t that approximates the cdf Ft(z) = \u00E2\u0088\u00AB z \u00E2\u0088\u0092\u00E2\u0088\u009E p(xt|y0:t)dxt When xt is multi-dimensional, however, this no longer applies. After introducing some regular- ity conditions and some modifications to notation, we will present two solutions that inspire new approaches. 3.4.1 Regularity Conditions and Notation To deal with the case when xt \u00E2\u0088\u0088 Rd is multi-dimensional, we first simplify our notation. Instead of referring to the density p(xt|y0:t) we will drop explicit references to y0:t and the time index t. Thus from this point on the density p(x) = p(x1:d) refers implicitly to p(xt|y0:t). Using this notation, we will refer to marginal distributions using pj(xj) = \u00E2\u0088\u00AB \u00E2\u0088\u00AB p(x1:j\u00E2\u0088\u00921, xj , xj+1:d)dx1:j\u00E2\u0088\u00921dxj+1:d and p1:j(x1:j) = \u00E2\u0088\u00AB p(x1:j , xj+1:d)dxj+1:d We will also refer to conditional distributions in a limited context with pj(xj |x1:j\u00E2\u0088\u00921) = \u00E2\u0088\u00AB p(xj , xj+1:d|x1:j\u00E2\u0088\u00921)dxj+1:d 24 3.4. Theoretical Solutions Finally, we can express the cdf of x1 as F1(z) = \u00E2\u0088\u00AB z \u00E2\u0088\u0092\u00E2\u0088\u009E p1(x1)dx1 and the conditional cdf of xj given x1:j\u00E2\u0088\u00921 as Fj(z|x1:j\u00E2\u0088\u00921) = \u00E2\u0088\u00AB z \u00E2\u0088\u0092\u00E2\u0088\u009E pj(xj |x1:j\u00E2\u0088\u00921)dxj In addition to the change in notation, we restrict our attention to situations in which the set {x : p(x) > 0} is convex7. The reason for this condition is that when it is satisfied, the cdf F1(z) and the conditional cdfs Fj(z|x1:j\u00E2\u0088\u00921) are strictly monotonic for all z such that these functions are in (0, 1). As such, the quantile function, or inverse cdf, F\u00E2\u0088\u009211 is given by F\u00E2\u0088\u009211 (u) = z : F1(z) = u Similarly, the conditional quantile function F\u00E2\u0088\u00921j is given by F\u00E2\u0088\u00921j (u|x1:j\u00E2\u0088\u00921) = z : Fj(z|x1:j\u00E2\u0088\u00921) = u Note that when one of these cdfs is not strictly monotonic, there exist values of u \u00E2\u0088\u0088 (0, 1) and x1:j\u00E2\u0088\u00921 such that z above is not determined uniquely by the constraint Fj(z|x1:j\u00E2\u0088\u00921) = u. Furthermore, the cdf of any marginal distribution over any convex subset of {x : p(x) > 0} is strictly monotonic. 3.4.2 A Generalized CDF We can define a bijective generalized cdf F : Rd \u00E2\u0086\u0092 Rd, where the state variables are d-dimensional. First let us define Fj(z; z1:j\u00E2\u0088\u00921) = Fj(z|z1:j\u00E2\u0088\u00921) = \u00E2\u0088\u00AB z \u00E2\u0088\u0092\u00E2\u0088\u009E pj(xj |x1:j\u00E2\u0088\u00921 = z1:j\u00E2\u0088\u00921)dxj The inverse of this function is F\u00E2\u0088\u00921j (u;u1:j\u00E2\u0088\u00921) = xj : Fj(xj ;F \u00E2\u0088\u00921 1 (u1), F \u00E2\u0088\u00921 2 (u2;u1), . . . , F \u00E2\u0088\u00921 j\u00E2\u0088\u00921(uj\u00E2\u0088\u00921;u1:j\u00E2\u0088\u00922)) = u = xj : Fj(xj ; z1:j\u00E2\u0088\u00921) = u = F\u00E2\u0088\u00921j (u|z1:j\u00E2\u0088\u00921) 7It is possible that this condition can be relaxed but this is sufficient for all the marginal distributions of interest to be non-zero only on a convex subset (interval) of R. 25 3.4. Theoretical Solutions where z1 = F \u00E2\u0088\u00921 1 (u1), z2 = F \u00E2\u0088\u00921 2 (u2;u1) = F \u00E2\u0088\u00921 2 (u2|z1) and in general zi = F\u00E2\u0088\u00921i (ui;u1:i\u00E2\u0088\u00921) = F\u00E2\u0088\u00921i (ui|z1:i\u00E2\u0088\u00921). With these definitions, we can define a bijection F as F (z1, z2, . . . , zd) = (F1(z1), F2(z2; z1), . . . , Fd(zd; z1:d\u00E2\u0088\u00921)) The inverse of F is given by F\u00E2\u0088\u00921(u1, u2, . . . , ud) = ( F\u00E2\u0088\u009211 (u1), F \u00E2\u0088\u00921 2 (u2;u1), . . . , F \u00E2\u0088\u00921 d (ud;u1:d\u00E2\u0088\u00921) ) While this formulation might appear daunting, the intuition is simple when the function F (z1:d) = u1:d is considered sequentially. We can view u1 as the probability that x1 \u00E2\u0089\u00A4 z1, u2 as the probability that x2 \u00E2\u0089\u00A4 z2 given that x1 = z1 and so on until ud is the probability that xd \u00E2\u0089\u00A4 zd given that x1:d\u00E2\u0088\u00921 = z1:d\u00E2\u0088\u00921. For the evaluation of the inverse F\u00E2\u0088\u00921, an algorithmic approach is presented in Algorithm 3. It should also be clear that if u is drawn uniformly from the unit hypercube of dimension d, then the output of the algorithm is a corresponding sample from the distribution p(x). This is because z1 \u00E2\u0088\u00BC p(x1), z2 \u00E2\u0088\u00BC p(x2|x1 = z1), . . . , zd \u00E2\u0088\u00BC p(xd|x1:d\u00E2\u0088\u00921 = z1:d\u00E2\u0088\u00921). Algorithm 3 An algorithm to evaluate the inverse of the generalized cdf F\u00E2\u0088\u00921(u1, . . . , ud) 1. Set z1 = F \u00E2\u0088\u00921 1 (u1) = z : F1(z) = u1 2. For j = 2, . . . , d \u00E2\u0080\u00A2 Set zj = F \u00E2\u0088\u00921 j (uj ;u1:j\u00E2\u0088\u00921) = z : Fj(uj |z1:j\u00E2\u0088\u00921) = uj 3. Output (z1, . . . , zd). 3.4.3 Median-Cuts While the generalized CDF proposed above is in theory a clean extension of the approach in [23], we will see later that there are practical issues involved in trying to approximate conditional quantiles (F\u00E2\u0088\u00921j for j > 1) directly using an empirical distribution. In addition, many of the properties of the generalized CDF are unnecessary for smooth resampling. In particular, we are looking for a function f such that if a random number u is sampled uniformly from [0, 1], f(u) returns a point sampled according to the distribution with density p(x). We require this function to be continuous in \u00CE\u00B8 (recall that this density does depend on \u00CE\u00B8) but not u and it need not have an inverse. Nevertheless, the mapping function fn described in Algorithm 4 is largely inspired by the inverse transform technique. Furthermore, we can think of f(u) as equal to unique element of limn\u00E2\u0086\u0092\u00E2\u0088\u009E fn(u) as the 26 3.4. Theoretical Solutions type of function described above since limn\u00E2\u0086\u0092\u00E2\u0088\u009E fn(u) almost surely8 returns a set containing a single point in Rd. Algorithm 4 The mapping function fn(u) 1. \u00E2\u0080\u00A2 Abusing notation, set a (1) 1 = a (1) 2 = \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 = a (1) d = \u00E2\u0088\u0092\u00E2\u0088\u009E and b (1) 1 = b (1) 2 = \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 = b (1) d = \u00E2\u0088\u009E \u00E2\u0080\u00A2 Let X (1) def = \u00E2\u0088\u008Fd i=1[a (1) i , b (1) i ] \u00E2\u0080\u00A2 Let X (1) \u00E2\u0088\u00921 def = \u00E2\u0088\u008Fd i=2[a (1) i , b (1) i ] \u00E2\u0080\u00A2 Compute t1 satisfying \u00E2\u0088\u00AB t1 a (1) 1 \u00E2\u0088\u00AB X (1) \u00E2\u0088\u00921 p(x)dx2:ddx1 = 0.5 \u00E2\u0080\u00A2 For i = 1, . . . , d - Set a (2) i = a (1) i and b (2) i = b (1) i \u00E2\u0080\u00A2 If u < 0.5, set b (2) 1 = t1 and u = 2u, else set a (2) 1 = t1 and u = 2(u\u00E2\u0088\u0092 0.5) 2. For j = 2, . . . , n \u00E2\u0080\u00A2 Let k = j mod d. If k = 0, let k = d. \u00E2\u0080\u00A2 Let X (j) def = \u00E2\u0088\u008Fd i=1[a (j) i , b (j) i ] \u00E2\u0080\u00A2 Let X (j) \u00E2\u0088\u0092k def = \u00E2\u0088\u008Fk\u00E2\u0088\u00921 i=1 [a (j) i , b (j) i ] \u00E2\u0088\u008Fd i=k+1[a (j) i , b (j) i ] \u00E2\u0080\u00A2 Compute tj satisfying 2 j\u00E2\u0088\u00921 \u00E2\u0088\u00AB tj a (j) k \u00E2\u0088\u00AB X (j) \u00E2\u0088\u0092k p(x)dx1:k\u00E2\u0088\u00921dxk+1:ddxk = 0.5 \u00E2\u0080\u00A2 For i = 1, . . . , d - Set a (j+1) i = a (j) i and b (j+1) i = b (j) i \u00E2\u0080\u00A2 If u < 0.5, set b (j+1) k = tj and u = 2u, else set a (j+1) k = tj and u = 2(u\u00E2\u0088\u0092 0.5) 3. Return X (n+1) = \u00E2\u0088\u008Fd i=1[a (j+1) i , b (j+1) i ]. Algorithm 5 Algorithm to convert a number u in [0,1] to a string of bits of length n 1. For j = 1, 2, . . . , n If u < 0.5 \u00E2\u0080\u00A2 Set sj = 0 and u = 2u. Else \u00E2\u0080\u00A2 Set sj = 1 and u = 2(u\u00E2\u0088\u0092 0.5). 3. Return s. Correctness The function fn is a mapping from [0, 1] to a region in R d. In the algorithm we essentially treat the number u as an n-bit string s of 0\u00E2\u0080\u0099s and 1\u00E2\u0080\u0099s, where each sj is independent and identically distributed 8By this we mean for almost all u \u00E2\u0088\u0088 [0, 1]. 27 3.4. Theoretical Solutions from the Bernoulli distribution with success probability 0.5 (without conditioning on u). Algorithm 5 shows how to produce such a string of arbitrary length n. It is sometimes easier to think of the function fn taking an argument s1:n when n is finite (instead of conditioning on u < 0.5 at step j, one can condition directly on sj = 0). For any n and s = s1:n \u00E2\u0088\u0088 {0, 1}n, the region X (n+1) returned satisfies\u00E2\u0088\u00AB X (n+1) p(x)dx = 1 2n by construction. This corresponds to the probability of the n-bit binary string s used to select the region. Furthermore, we can see that \u00E2\u008B\u0083 s\u00E2\u0088\u0088{0,1}n fn(s) = R d In other words, each equally probable uniform random n-bit s \u00E2\u0088\u0088 {0, 1}n maps to an equally probable (under p) region fn(s) \u00E2\u0088\u0088 Rd. In addition, the intersection of any two regions fn(s1) and fn(s2) for s1 6= s2 is a subset of Rd with dimension less than d and hence it has null Lebesgue measure in Rd. As such, there are no overlapping regions of non-zero probability for distinct s and we have \u00E2\u0088\u0091 s\u00E2\u0088\u0088{0,1}n \u00E2\u0088\u00AB fn(s) p(x)dx = 1 In sum, each of the regions that can be returned for a given n have equal probability mass 12n and cover the entire space Rd while the intersection of the interior of any two regions is empty. Given a uniform random variable u, this variable is treated as an n-bit binary string s drawn from the uniform distribution over n-bit strings (each possible string has probability 12n ) and used to select one of the regions. It should be emphasized that if we think of the 2n regions that fn can return as a partitioning 9 of Rd, the relative positions of the regions are fixed and so the relative position of the region that is picked by u (via s) is not dependent on p (although the coordinates of the region are). Take, for example, d = 2 and s1:4 = 1001. This maps s1:4 to the region [t1, t3] \u00C3\u0097 [t4, t2] with a(2)1 = t1, b (3) 2 = t2, b (4) 1 = t3 and a (5) 2 = t4. Clearly the values of ti depend on p, but the position of the partition is fixed with respect to the other partitions. This is probably best understood by looking at Figure 3.1, which shows how the function fn partitions R d for two different densities p(x). In the context where p(x) is continuous in \u00CE\u00B8, it is thus the coordinates of the orthotope X (n+1) that change as \u00CE\u00B8 varies with u fixed. In fact, for p(x) continuous in \u00CE\u00B8 and x, each tj is continuous in \u00CE\u00B8 and so the coordinates of the orthotope are continuous in \u00CE\u00B8. 9As noted above, the intersection of two regions is not always null, but the intersection of their interiors are. 28 3.4. Theoretical Solutions 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 1 1 1 0 0 0 1 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 (a) uniform 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 1 1 1 0 0 0 1 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 (b) non-uniform Figure 3.1: Median-cut partitioning of the space for p(x) uniform and p(x) non-uniform Characterizing the Output Note that for fixed u, the sequences {a(n)i } for i \u00E2\u0088\u0088 {1, . . . , d} are weakly monotone increasing. Similarly, the sequences {b(n)i } for i \u00E2\u0088\u0088 {1, . . . , d} are weakly monotone decreasing. Furthermore, for any i, j, k, a (k) i < b (j) i and b (k) i > a (j) i . It can be shown that the set {u : \u00E2\u0088\u0083i s.t. \u00E2\u0088\u0080j, b(j)i =\u00E2\u0088\u009E or a(j)i = \u00E2\u0088\u0092\u00E2\u0088\u009E} has measure zero since this corresponds to the infinite binary string representation of u having infinitely repeated 1\u00E2\u0080\u0099s or 0\u00E2\u0080\u0099s at fixed intervals. Thus {a(n)i } and {b(n)i } are almost always bounded above and below respectively. Since they are weakly monotone this implies that they converge and their limits are given by ai def = lim n\u00E2\u0086\u0092\u00E2\u0088\u009E a (n) i = sup n {a(n)i } bi def = lim n\u00E2\u0086\u0092\u00E2\u0088\u009E b (n) i = infn {b(n)i } By the nested interval theorem, we know that X \u00E2\u0088\u0097 def= lim n\u00E2\u0086\u0092\u00E2\u0088\u009E X (n+1) = d\u00E2\u0088\u008F i=1 [ai, bi] is a non-empty subset of Rd. In fact, we know that for at least one i, ai = bi. To show this, assume that ai < bi for all i. Then \u00E2\u0088\u00AB X\u00E2\u0088\u0097 p(x)dx = c where c is a constant. However, we know \u00E2\u0088\u00AB X (j) p(x)dx = 1 2j\u00E2\u0088\u00921 so there exists a j such that\u00E2\u0088\u00AB X j p(x)dx < \u00E2\u0088\u00AB X\u00E2\u0088\u0097 p(x)dx which is a contradiction since X \u00E2\u0088\u0097 \u00E2\u008A\u0086 X (j). The following shows that 29 3.4. Theoretical Solutions for every i, ai = bi. Claim: For any i \u00E2\u0088\u0088 {1, . . . , d}, ai = bi for almost all u \u00E2\u0088\u0088 [0, 1]. Proof: Consider the sequence of natural numbers defined by mk = ki for k \u00E2\u0088\u0088 N. Then consider the subsequences {a(mj)i }, {b(mj)i } and {tmj}. Note that these subsequences satisfy tmj \u00E2\u0088\u0088 (a(mj)i , b(mj)i ) b (mj+1) i = \u00EF\u00A3\u00B1\u00EF\u00A3\u00B2\u00EF\u00A3\u00B3tmj if smj = 0,b(mj)i if smj = 1 a (mj+1) i = \u00EF\u00A3\u00B1\u00EF\u00A3\u00B2\u00EF\u00A3\u00B3a (mj) i if smj = 0, tmj if smj = 1 Now assume that there exists an h > 0 such that bi \u00E2\u0088\u0092 ai \u00E2\u0089\u00A5 h. This implies that b(mj)i \u00E2\u0088\u0092 a(mj)i \u00E2\u0089\u00A5 h for all j. We already know that a (mj) i and b (mj) i are finite almost surely as j \u00E2\u0086\u0092\u00E2\u0088\u009E. Let k be the first integer such that a (mk) i and b (mk) i are both finite and let dj = b (mj) i \u00E2\u0088\u0092 a(mj)i . Then for any j \u00E2\u0089\u00A5 k we require dj \u00E2\u0089\u00A5 h. For any j, tmj \u00E2\u0088\u0088 (a(mj)i , b(mj)i ). If b(mj)i \u00E2\u0088\u0092 tmj \u00E2\u0089\u00A5 h and tmj \u00E2\u0088\u0092 a(mj)i \u00E2\u0089\u00A5 h, then regardless of the value of smj we will have dj+1 \u00E2\u0089\u00A4 dj \u00E2\u0088\u0092 h. Since dk is finite, there are only finitely many times this can occur for j > k. If, on the other hand, b (mj) i \u00E2\u0088\u0092 tmj \u00E2\u0089\u00A4 h or tmj \u00E2\u0088\u0092 a(mj)i \u00E2\u0089\u00A4 h, we require smj to be equal to 1 or 0 respectively, in order for dj+1 \u00E2\u0089\u00A5 h. Since the previous case can only occur finitely many times, this case must occur infinitely many times as there are infinitely many j > k. Note that the values of tmj depend on p(x) and s1:mj\u00E2\u0088\u00921 but not smj . In fact, smj can be thought of as drawn from the Bernoulli distribution with success probability 0.5. As such, we are requiring that an infinite number of independent random bits of s coincide with infinitely many values of tmj , which has probability zero. Hence we have that h = 0 almost surely and so ai = bi almost surely \u0003. Of course, this means that for almost all u, X \u00E2\u0088\u0097 def= lim n\u00E2\u0086\u0092\u00E2\u0088\u009E X (n+1) = d\u00E2\u0088\u008F i=1 [ai, bi] = {c} where c \u00E2\u0088\u0088 Rd and ci = supn{a(n)i } = infn{b(n)i } for i \u00E2\u0088\u0088 {1, . . . , d}. Thus we can define (for almost all u) f(u) as the single member of limn\u00E2\u0086\u0092\u00E2\u0088\u009E fn(u). 30 3.4. Theoretical Solutions 3.4.4 Practical Notes While the theoretical resampling algorithms described above would yield smoothly varying particles, they are impossible to implement in practice because they require the repeated evaluation of integrals that have no analytical solution in general. As such, in the next chapter we focus on solutions that are implementable on a computer. 31 Chapter 4 Fast and Approximately Smooth Likelihood Estimation In this chapter, we present a tree-based representation of the particles for resampling. This repre- sentation is itself flexible and can be used within a wide variety of resampling schemes. However, we restrict our discussion to three different realizations. 4.1 A Weighted Tree Structure for P\u00CC\u0082N(dx0:t|y0:t) A valid resampling scheme for P\u00CC\u0082N (dx0:t|y0:t) = \u00E2\u0088\u0091N i=1W (i) t \u00CE\u00B4x(i)0:t (dx0:t) can be constructed by parti- tioning the particles x (i) 0:t into k1 subsets where each subset is given a weight equal to the normalized sum of the weights of the particles within it. Each of the k1 sets is then further partitioned into some number of subsets where weights are again assigned so that the weight of a subset is given by the normalized sum of the weights of the particles within it. This process continues until each particle is in only one partition at the lowest depth of the tree. Let the depth of the tree be d and assume we have a d-dimensional uniform random number (u1, . . . , ud). Then we can use u1 to select the subset on the first level according to the weights of each partition, u2 to select a subset of that subset according to the weights of each of these subsets, and so on until we use some uj to select a single particle 10. Let the kth level subset containing the particle with index i be denoted Sk(i) and the sum of the weights in Sk(i) be denoted W (Sk(i)). The probability of selecting any individual particle index i 10uj+1, . . . , ud might not be needed since in this general construction the leaves of the tree do not necessarily all first appear on the dth level. 32 4.2. Splitting via the Unweighted Median (Weighted Binary Trees) is given by: Pr[select index i] = W (S1(i)) sum of all weights \u00C2\u00B7 W (S2(i)) W (S1(i)) \u00C2\u00B7 W (S3(i)) W (S2(i)) \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7 W (Sd(i)) W (Sd\u00E2\u0088\u00921(i)) = W (Sd(i)) sum of all weights = weight of particle i sum of all weights ie. the probability of selecting a particle from such a tree is proportional to its weight. It is clear that this general description allows for a wide variety of possible trees to be constructed. Our main interest is in constructing a tree such that for fixed u, small perturbations in the weights of the particles give rise to small perturbations in the selected particle. 4.2 Splitting via the Unweighted Median (Weighted Binary Trees) Let us assume we have N = 2k for some k \u00E2\u0088\u0088 N. Then one possible construction of a tree is to have each parent have two child nodes each containing half the number of particles of their parent, ie. a perfect binary tree. The depth of this tree is k+1 = log2N +1. To partition each block of particles at each level, we propose to sort the particles along one dimension, cycling through each dimension as we traverse the levels. For example, we can sort the N particles according to the first dimension of xt, then sort each subset of N/2 particles on the second level of the tree by the second dimension of xt, etc., and we return to the first dimension after sorting by the dth dimension of xt if xt is d-dimensional. The cost to construct this tree is given by: T (N) = N log(N) + 2 N 2 log( N 2 ) + 4 N 4 log( N 4 ) + \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7+ N 2 N N 2 log( N N 2 ) = Nk +N(k \u00E2\u0088\u0092 1) +N(k \u00E2\u0088\u0092 2) + \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7+N(2) +N = N(k + (k \u00E2\u0088\u0092 1) + (k \u00E2\u0088\u0092 2) + \u00C2\u00B7 \u00C2\u00B7 \u00C2\u00B7+ 2 + 1) = N k(k + 1) 2 \u00E2\u0087\u0092 T (N) \u00E2\u0088\u0088 O(N(logN)2) The cost to select a particle from this tree is in O(logN) so the cost to evaluate it N times is in O(N logN). This means the total cost of both constructing the tree and resampling N particles is 33 4.2. Splitting via the Unweighted Median (Weighted Binary Trees) in O(N(logN)2). Algorithm 6 Constructing the weighted binary tree: constructq(particles, level) 1. Set dimension r = level mod d. If r = 0, set r = d. 2. Set particle m = the median of particles in the rth dimension. 3. Create node node, empty sets left and right, and set wleft = wright = 0. 4. If level = k + 1, set node.particle = p and return node. 5. For each particle p in particles \u00E2\u0080\u0093 If the rth coordinate of p is less than or equal to the rth coordinate of m, \u00E2\u0088\u0097 Add p to the set left. \u00E2\u0088\u0097 Set wleft = wleft+weight(p). \u00E2\u0080\u0093 If the rth coordinate of p is greater than the rth coordinate of m \u00E2\u0088\u0097 Add p to the set right. \u00E2\u0088\u0097 Set wright = wright+weight(p). 6. \u00E2\u0080\u0093 Set node.wleft = wleft/(wleft + wright) \u00E2\u0080\u0093 Set node.wright = wright/(wleft + wright) 7. \u00E2\u0080\u0093 Set node.left = constructq(left, level + 1) \u00E2\u0080\u0093 Set node.right = constructq(right, level + 1) 8. Return node. Algorithm 7 Selecting a particle from the weighted binary tree: selectq(node, level, u1, . . . , ud) 1. If level = k + 1, return node.particle. 2. Set dimension j = level mod d. If j = 0, set j = d. 3. If uj < node.wleft \u00E2\u0080\u0093 Set uj = uj/node.wleft. \u00E2\u0080\u0093 Return selectq(node.left, level + 1, u1, . . . , ud) Else \u00E2\u0080\u0093 Set uj = (uj \u00E2\u0088\u0092 node.wleft)/(1\u00E2\u0088\u0092 node.wleft). \u00E2\u0080\u0093 Return selectq(node.right, level + 1, u1, . . . , ud) While the unweighted median along a particular dimension is used to split the particles into two groups, there are obviously unequal weights associated with each group except in the case where the weights of all the particles are equal. As such, splitting along the unweighted median gives rise to a weighted binary tree. While this tree might resemble the theoretical median-cuts approach in Section 3.4.3 there are important differences. First, the particles are approximately split along the median of the proposal distribution q and not p. Second, the values of the weights of each node are dependent on \u00CE\u00B8 since the weights W (i) t depend on \u00CE\u00B8. This means that while using the same u to traverse two similarly weighted trees of even continuously varying particles will often lead to 34 4.2. Splitting via the Unweighted Median (Weighted Binary Trees) selecting the same intermediate nodes, there will be times when the change in weights causes a different node to be picked (this is also what makes the selection scheme correct). Obviously, the specific level of the tree that this occurs in impacts greatly how far apart the corresponding particles will be. 4.2.1 From O(N(log N)2) to O(N log N) time To construct each level of the tree, it is not necessary to sort the particles completely. In particular, we only need to find the median particle to split the particles into two blocks of equal numbers of particles and sum their weights. There exists a deterministic algorithm to find the median in linear time (\u00E2\u0080\u009CMedian of Medians\u00E2\u0080\u009D) [2], although in practice its performance is slower than some randomized algorithms (eg. Hoare\u00E2\u0080\u0099s selection algorithm [14]) which run in expected linear time [18]. Upon discovering the median particle, the rest of the work at each level of the tree also takes linear time so we can construct the tree in O(N logN) time. This means the overall resampling step takes O(N logN) time. 4.2.2 From u \u00E2\u0088\u0088 Rk to u \u00E2\u0088\u0088 Rd It should be clear that the use of k uniform random numbers to select a single particle is unnecessary since we traverse the tree at level j by checking only if uj is less than the sum of weights in a level j subpartition. Let the level j subset reached by using u1, . . . , uj\u00E2\u0088\u00921 be denoted Bj(u1:j\u00E2\u0088\u00921). Let wj,0 = W (Bj+1(u1:j\u00E2\u0088\u00921, 0)) W (Bj(u1:j\u00E2\u0088\u00921)) and wj,1 = W (Bj+1(u1:j\u00E2\u0088\u00921, 1)) W (Bj(u1:j\u00E2\u0088\u00921)) ie. wj,i \u00E2\u0088\u009DW (Bj+1(u1:j\u00E2\u0088\u00921, i)) and \u00E2\u0088\u0091 i\u00E2\u0088\u0088{0,1} wj,i = 1. Then conditioned only on uj < wj,0, the value uj wj,0 is uniformly distributed on [0, 1). Similarly, conditioned only on uj \u00E2\u0089\u00A5 wj,0 the value uj\u00E2\u0088\u0092wj,01\u00E2\u0088\u0092wj,0 is uniformly distributed on [0, 1]. Thus we can have u1, . . . , ud given and then at every level j we set uj+d = uj wj,0 or uj+d = uj\u00E2\u0088\u0092wj,0 1\u00E2\u0088\u0092wj,0 if uj < wj,0 or uj \u00E2\u0089\u00A5 wj,0 respectively. This is not just a trick to reduce the number of uniform random variables we generate, especially since this generation is not costly. Instead, it provides a mechanism by which we ensure that when we are \u00E2\u0080\u0098near\u00E2\u0080\u0099 the boundary wj,0, the next time we choose a node after splitting along the same 35 4.2. Splitting via the Unweighted Median (Weighted Binary Trees) dimension we use a consistent value uj+d. Intuitively, imagine uj = wj,0\u00E2\u0088\u0092 \u000F. Then uj+d = 1\u00E2\u0088\u0092 \u000Fwj,0 , so the next time we choose a node after splitting along the same dimension we are highly likely to pick the second one. This is consistent with being very close to having picked the second partition at level j. Of course, this only mitigates the problem by trying to reduce the distance between selected particles when the weights change - the distance between two particles can only be strictly bounded by the size of their last common ancestor. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 uj u j+d (a) wj,0 = 0.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 uj u j+d (b) wj,0 = 0.7 Figure 4.1: Plots of uj+d against uj for wj,0 = 0.4 and wj,0 = 0.7 At first glance, it might appear that the manipulation of each uj to produce uj+d introduces a bias in the selection of the particles. This is not so. The fact that uj = wj,0 \u00E2\u0088\u0092 \u000F \u00E2\u0087\u0092 uj+d = 1 \u00E2\u0088\u0092 \u000Fwj,0 just shows that if we know the actual value of uj then uj+d is not random at all. However, in the algorithm the only use of uj is to check uj < wj,0 and compute uj+d. We never explicitly check if uj is by some measure close to wj,0. Thus the distribution of uj+d is always uniform on either [0, 1) or [0, 1] and the probability of selecting the left node at stage j + d is equal to wj+d,0 for u1:d drawn from the uniform distribution on [0, 1] d. Figure 4.1 shows the impact of the weight and the value of uj on the value of uj+d. Inspection of these graphs should reassure the reader that Pr[uj+d < wj+d,0|uj < wj,0] = wj+d,0. 4.2.3 Interpolation The construction of the tree could provide even smoother estimates of the likelihood if we are willing to introduce some bias by interpolating between particles at the last (few) layer(s) of the tree. One strategy is to interpolate between the last 2d particles, which for N large should be quite close. Since each node of the tree has two weighted children, we need to interpolate between two weighted points p1 and p2 with weights w and 1\u00E2\u0088\u0092 w. We first define the interpolated point: p? = c(u,w)p1 + (1\u00E2\u0088\u0092 c(u,w))p2 36 4.2. Splitting via the Unweighted Median (Weighted Binary Trees) We would like c(u,w) = 1 for u = 0, c(u,w) = 0 for u = 1, c continuous and monotonic in u and w, and Eu[c(u,w)] = w. We can satisfy these constraints with c(u,w) = \u00EF\u00A3\u00B1\u00EF\u00A3\u00B2\u00EF\u00A3\u00B3(1\u00E2\u0088\u0092 u) 1\u00E2\u0088\u0092w w if w < 12 , 1\u00E2\u0088\u0092 u w1\u00E2\u0088\u0092w if w \u00E2\u0089\u00A5 12 . We also have c(u,w)+ c(1\u00E2\u0088\u0092 u, 1\u00E2\u0088\u0092w) = 1 which means that the choice of which point is labelled p1 and which point is labelled p2 does not matter. To interpolate 2d points we can interpolate the 2d\u00E2\u0088\u00921 pairs of points on the bottom level to produce 2d\u00E2\u0088\u00922 points, then interpolate the 2d\u00E2\u0088\u00923 pairs of these points and so on, using the weights of each node. 4.2.4 Remarks A cause for concern with this method is that it arbitrarily large discontinuities can be induced even for large N . Recall that in the median-cuts algorithm we have\u00E2\u008B\u0083 s\u00E2\u0088\u0088{0,1}n fn(s) = R d For the weighted binary tree, we can define a string \u00CE\u00B21:k via \u00CE\u00B2j = \u00EF\u00A3\u00B1\u00EF\u00A3\u00B2\u00EF\u00A3\u00B30 if uj < wj,0,1 if uj \u00E2\u0089\u00A5 wj,0. Note that \u00CE\u00B21:k depends not only on u1:d but also the values of the weights of the nodes that are traversed. As such, fixed u1:d does not correspond to the same \u00CE\u00B21:k when the weights change. Let us define the set of particles in the node at level j + 1 reached by a given \u00CE\u00B21:j as bj(\u00CE\u00B21:j). Nevertheless, we have the analogous result that \u00E2\u008B\u0083 \u00CE\u00B2\u00E2\u0088\u0088{0,1}j bj(\u00CE\u00B2) = N\u00E2\u008B\u0083 i=1 {x(i)t } Furthermore, we have for any \u00CE\u00B21:i \u00E2\u0088\u0088 {0, 1}i\u00E2\u008B\u0083 \u00CE\u00B2i+1:j\u00E2\u0088\u0088{0,1}j\u00E2\u0088\u0092i bj(\u00CE\u00B21:j) = bi(\u00CE\u00B21:i) With fixed u across different runs of the filter, we can expect many \u00CE\u00B21:k to share the prefix \u00CE\u00B21:i for 37 4.3. Splitting via the Weighted Median (Unweighted Binary Trees) some i. These particles maximum distance from each other is obviously bounded by the size of the region bi(\u00CE\u00B21:i) and shrinks as i increases. However, there exist u for any change in \u00CE\u00B8 such that \u00CE\u00B21 will be different across different runs. As such, arbitrarily large distances between selected particles across different runs can occur. The use of only d-dimensional uniform random vectors is used to temper the effects of this but it cannot be escaped. The problem is exacerbated when the density of xt is very different in different regions 11 and it is made more frequent when the weights change drastically for even small changes in \u00CE\u00B8. The former is a property solely of the target density but the latter depends on the proposal density used to sample the particles. l a m b d a 0 1 0 0 1 11 00 1 0 0 0 0 0 1 0 1 0 1 0 10 1 1 1 0 0 1 1 0 1 1 1 Figure 4.2: The binary tree representation of the sets as a function of \u00CE\u00B2 There are many ways of visualizing the sets of particles as a function of \u00CE\u00B2. One such way is to view them as a partitioning of the particles similar to that presented in Figure 3.1 with the understanding that the location of the hyperplanes (in arbitrary dimension d) that separate the space is determined by the proposal distribution and not the target distribution. Alternatively, one can think of the sets as nodes in a binary tree where each set at level i is given by the binary string \u00CE\u00B20 + \u00CE\u00B21:i with \u00CE\u00B20 = \u00CE\u00BB 12 A visualization of this binary tree is given in Figure 4.2. 4.3 Splitting via the Weighted Median (Unweighted Binary Trees) An alternative to the algorithm presented in Section 4.2 is to split the particles such that the weight of each subset is always equal. The weighted median along a particular dimension is thus used to split the particles into two equally weighted sets of particles. Alternatively, one can see the binary 11If this is true, the use of u \u00E2\u0088\u0088 Rd makes little difference since trying to be close to the region the next time we split along the same dimension does not affect the large differences to the regions selected in between. 12\u00CE\u00BB is the empty string. 38 4.3. Splitting via the Weighted Median (Unweighted Binary Trees) tree that is constructed as being unweighted. Of course, the number of particles in each node is not equal. In addition, one particle must be split into two particles every time two child nodes are created so that the weight of each node is equal. However, the general construction of the tree is identical. Algorithm 8 Constructing the unweighted binary tree: constructp(particles, level) 1. Create node node, empty sets left and right, and set wleft = wright = 0. 2. If particles contains only one particle, set node.particle = p and return node. 3. Set dimension r = level mod d. If r = 0, set r = d. 4. Set particle m = the weighted median of particles in the rth dimension. 5. For each particle p in particles \u00E2\u0080\u0093 If the rth coordinate of p is less than the rth coordinate of m, \u00E2\u0088\u0097 Add p to the set left. \u00E2\u0088\u0097 Set wleft = wleft+weight(p). \u00E2\u0080\u0093 If the rth coordinate of p is greater than the rth coordinate of m \u00E2\u0088\u0097 Add p to the set right. \u00E2\u0088\u0097 Set wright = wright+weight(p). 6. Create two particles m1 and m2 identical to m. 7. \u00E2\u0080\u0093 Set weight(m1) = (wright + weight(m) - wleft) / 2 \u00E2\u0080\u0093 Set weight(m2) = weight(m) - weight(m1) \u00E2\u0080\u0093 Add m1 to the set left. \u00E2\u0080\u0093 Add m2 to the set right. 8. \u00E2\u0080\u0093 Set node.left = constructp(left, level + 1) \u00E2\u0080\u0093 Set node.right = constructp(right, level + 1) 9. Return node. Algorithm 9 Selecting a particle from the unweighted binary tree: selectp(node, level, u) 1. If node contains only one particle, return node.particle. 2. If u \u00E2\u0089\u00A4 node.wleft \u00E2\u0080\u0093 Set u = 2u. \u00E2\u0080\u0093 Return selectp(node.left, level + 1, u) Else \u00E2\u0080\u0093 Set u = 2(u\u00E2\u0088\u0092 0.5). \u00E2\u0080\u0093 Return selectp(node.right, level + 1, u) A key feature of this tree is that the same relative region is selected each time - it is the particles that make up the region that vary when the weights change. This is due to the unweighted nature of the nodes in the binary tree that is constructed. A drawback is that the imbalance in the number 39 4.3. Splitting via the Weighted Median (Unweighted Binary Trees) of particles in sibling nodes and the introduction of an extra particle at every branch makes time complexity analysis more complicated. 4.3.1 Running Time To analyze the running time, we first remark that the weighted median can also be computed in O(N) time. Given points x1, . . . , xN with associated (normalized) weights w1, . . . , wN , the weighted median is the point xi such that \u00E2\u0088\u0091 xj 128 the running time quickly blows up to levels that would generally be unmanageable in real-world applications. At the same time, N \u00E2\u0089\u00A4 128 severely limits the types of distributions that can be represented accurately. Nevertheless, the applications in [7] usually use N = 100 particles to represent p(x0:t|y0:t) for the purpose of filtering and this leads to reasonable results for the applications they consider. For estimation of the log-likelihood more particles are used since they need not be resampled. In constrast, for our tree-based methods we found that using more particles only for estimation of the log-likelihood was not as helpful as increasing the number of particles for both filtering and estimation (keeping the computation time constant). Lastly, we have shown log-likelihood plots with only one parameter changing since this allows the differences in smoothness to be seen very easily. For the two-dimensional Gaussian state-space model, Figure 5.18 shows surface maps for the log-likelihood where v11 and v12 each range from 0.5 to 1.5 with 10000 different data points in total. Figure 5.19 shows the statistical errors. 0.5 1 1.5 0.5 1 1.5 \u00E2\u0088\u0092640 \u00E2\u0088\u0092635 \u00E2\u0088\u0092630 \u00E2\u0088\u0092625 \u00E2\u0088\u0092620 \u00E2\u0088\u0092615 v11v12 lo g\u00E2\u0088\u0092 lik el ih oo d (a) Kalman filter 0.5 1 1.5 0.5 1 1.5 \u00E2\u0088\u0092640 \u00E2\u0088\u0092635 \u00E2\u0088\u0092630 \u00E2\u0088\u0092625 \u00E2\u0088\u0092620 \u00E2\u0088\u0092615 v11v12 lo g\u00E2\u0088\u0092 lik el ih oo d (b) weighted binary tree filter 0.5 1 1.5 0.5 1 1.5 \u00E2\u0088\u0092640 \u00E2\u0088\u0092635 \u00E2\u0088\u0092630 \u00E2\u0088\u0092625 \u00E2\u0088\u0092620 \u00E2\u0088\u0092615 v11v12 lo g\u00E2\u0088\u0092 lik el ih oo d (c) unweighted binary tree filter 0.5 1 1.5 0.5 1 1.5 \u00E2\u0088\u0092640 \u00E2\u0088\u0092635 \u00E2\u0088\u0092630 \u00E2\u0088\u0092625 \u00E2\u0088\u0092620 \u00E2\u0088\u0092615 v11v12 lo g\u00E2\u0088\u0092 lik el ih oo d (d) unweighted k-ary tree filter 0.5 1 1.5 0.5 1 1.5 \u00E2\u0088\u0092640 \u00E2\u0088\u0092635 \u00E2\u0088\u0092630 \u00E2\u0088\u0092625 \u00E2\u0088\u0092620 \u00E2\u0088\u0092615 v11v12 lo g\u00E2\u0088\u0092 lik el ih oo d (e) CRN vanilla filter 0.5 1 1.5 0.5 1 1.5 \u00E2\u0088\u0092640 \u00E2\u0088\u0092635 \u00E2\u0088\u0092630 \u00E2\u0088\u0092625 \u00E2\u0088\u0092620 \u00E2\u0088\u0092615 v11v12 lo g\u00E2\u0088\u0092 lik el ih oo d (f) vanilla filter Figure 5.18: Surface maps of the estimated log-likelihood for the 2D Gaussian state-space model as a function of v11 and v12 for the various filters. 63 5.5. Remarks 0.5 1 1.5 0.5 1 1.5 \u00E2\u0088\u00924 \u00E2\u0088\u00923 \u00E2\u0088\u00922 \u00E2\u0088\u00921 0 1 2 3 v11v12 lo g\u00E2\u0088\u0092 lik el ih oo d (a) weighted binary tree filter 0.5 1 1.5 0.5 1 1.5 \u00E2\u0088\u00924 \u00E2\u0088\u00923 \u00E2\u0088\u00922 \u00E2\u0088\u00921 0 1 2 3 v11v12 lo g\u00E2\u0088\u0092 lik el ih oo d (b) unweighted binary tree filter 0.5 1 1.5 0.5 1 1.5 \u00E2\u0088\u00924 \u00E2\u0088\u00923 \u00E2\u0088\u00922 \u00E2\u0088\u00921 0 1 2 3 v11v12 lo g\u00E2\u0088\u0092 lik el ih oo d (c) unweighted k-ary tree filter 0.5 1 1.5 0.5 1 1.5 \u00E2\u0088\u00924 \u00E2\u0088\u00923 \u00E2\u0088\u00922 \u00E2\u0088\u00921 0 1 2 3 v11v12 lo g\u00E2\u0088\u0092 lik el ih oo d (d) CRN vanilla filter 0.5 1 1.5 0.5 1 1.5 \u00E2\u0088\u00924 \u00E2\u0088\u00923 \u00E2\u0088\u00922 \u00E2\u0088\u00921 0 1 2 3 v11v12 lo g\u00E2\u0088\u0092 lik el ih oo d (e) vanilla filter Figure 5.19: Surface map of the estimated log-likelihood errors for the 2D Gaussian state-space model as a function of v11 and v12 using the various filters 64 Chapter 6 Analysis In this chapter we show the asymptotic properties of the unweighted binary and k-ary tree algo- rithms. To begin, we first present some results that are used in our analysis. 6.1 Preliminaries 6.1.1 Quantile Estimation In all of our methods, we have used empirical estimates of quantiles to partition the particles. Therefore, in order to understand the asymptotic properties of our methods it is necessary to show the asymptotic properties of these quantile estimates. In particular, the analysis of quantile estimates is complicated by the fact that the quantile is not a linear statistical functional. However, it has been shown that such estimates do satisfy asymptotic normality results under very weak regularity conditions. Classical Monte Carlo Estimates Assume we want to compute \u00CE\u00B1 = F\u00E2\u0088\u00921(p) where F is a c.d.f. and p \u00E2\u0088\u0088 [0, 1]. Thus \u00CE\u00B1 is the pth quantile of F . Let f(x) = F \u00E2\u0080\u00B2(x) be the p.d.f. of the distribution of interest. The classical Monte Carlo estimate of \u00CE\u00B1 with N i.i.d. samples {xi}Ni=1 from f is given by \u00CE\u00B1N = F \u00E2\u0088\u00921 N (p) def = inf{x : FN (x) \u00E2\u0089\u00A5 p} where FN is the empirical distribution function FN (x) = 1 N N\u00E2\u0088\u0091 i=1 I(xi < x) 65 6.1. Preliminaries It is well-known that \u00E2\u0088\u009A N(\u00CE\u00B1N \u00E2\u0088\u0092 \u00CE\u00B1) D\u00E2\u0086\u0092 N(0, \u00CF\u00832) as N \u00E2\u0086\u0092\u00E2\u0088\u009E, where \u00CF\u00832 = p(1\u00E2\u0088\u0092 p) f(\u00CE\u00B1)2 as long as f exists and is continuous at \u00CE\u00B1. Importance Sampling Estimates Of course, we are not typically in the situation where we can sample from f . It is shown in [16] that we can generalize this result when using an importance sampling estimate of the quantile. In particular, we now sample {xi}Ni=1 from a distribution with density given by q. The importance sampling estimate of \u00CE\u00B1 with N i.i.d. samples {xi}Ni=1 from q is again given by \u00CE\u00B1N = F \u00E2\u0088\u00921 N (p) def = inf{x : FN (x) \u00E2\u0089\u00A5 p} but FN is now the weighted empirical distribution function FN (x) = 1 N N\u00E2\u0088\u0091 i=1 I(xi < x) f(xi) q(xi) With this estimate we still have \u00E2\u0088\u009A N(\u00CE\u00B1N \u00E2\u0088\u0092 \u00CE\u00B1) D\u00E2\u0086\u0092 N(0, \u00CF\u00832) as N \u00E2\u0086\u0092\u00E2\u0088\u009E, but now \u00CF\u00832 = 1 f(\u00CE\u00B1)2 {\u00E2\u0088\u00AB \u00CE\u00B1 \u00E2\u0088\u0092\u00E2\u0088\u009E f(x)2 q(x) dx\u00E2\u0088\u0092 p2 } Normalized Importance Sampling Estimates Of course, the above variance is applicable only for estimate which uses normalized (true) densities for f and q. In SMC methods we generally only have access to unnormalized forms of these den- sities. In [11] it is shown that asymptotic normality still holds when the estimate uses normalized importance sampling, ie. FN (x) = 1\u00E2\u0088\u0091N i=1 f(xi) q(xi) N\u00E2\u0088\u0091 i=1 I(xi < x) f(xi) q(xi) in which case the estimate is unchanged by rescaling f and q by any amount. In this case, \u00E2\u0088\u009A N(\u00CE\u00B1N\u00E2\u0088\u0092 \u00CE\u00B1) D\u00E2\u0086\u0092 N(0, \u00CF\u00832) as N \u00E2\u0086\u0092\u00E2\u0088\u009E, with \u00CF\u00832 = 1 f(\u00CE\u00B1)2 {\u00E2\u0088\u00AB \u00E2\u0088\u009E \u00E2\u0088\u0092\u00E2\u0088\u009E f(x)2 q(x) [I(x \u00E2\u0089\u00A4 \u00CE\u00B1)\u00E2\u0088\u0092 p]2 dx } (6.1) 66 6.1. Preliminaries 6.1.2 The Mean Value Theorem for Integration The first mean value theorem for integration can be stated as follows: Given continuous F : [a, b]\u00E2\u0086\u0092 R and integrable, positive p : [a, b]\u00E2\u0086\u0092 R there exists a number c \u00E2\u0088\u0088 [a, b] such that \u00E2\u0088\u00AB b a F (t)p(t)dt = F (c) \u00E2\u0088\u00AB b a p(t)dt In fact, a more general version of this theorem is true for functions on multiple variables, which we will use in our analysis. Given continuous F : X \u00E2\u0086\u0092 R where X is a connected space and integrable, positive p : X \u00E2\u0086\u0092 R there exists a vector c \u00E2\u0088\u0088 X such that\u00E2\u0088\u00AB X F (t)p(t)dt = F (c) \u00E2\u0088\u00AB X p(t)dt It should be noted that orthotopes are convex and thus connected. When the function p(t) is a probability density function which is nonzero only when t \u00E2\u0088\u0088 X then\u00E2\u0088\u00AB X p(t)dt = 1 and so \u00E2\u0088\u00AB X F (t)p(t)dt = F (c) for some c \u00E2\u0088\u0088 X . 6.1.3 Taylor\u00E2\u0080\u0099s Theorem The analysis presented relies heavily on Taylor\u00E2\u0080\u0099s theorem, which we state without proof. Let f be a real-valued function on [a, b], n a positive integer, f (n\u00E2\u0088\u00921) be continuous on [a, b], f (k)(t) exists for all t \u00E2\u0088\u0088 [a, b] and k \u00E2\u0088\u0088 {1, . . . , n}. Then for any x and a in [a, b], there exists a c between x and y such that f(x) = n\u00E2\u0088\u00921\u00E2\u0088\u0091 k=1 f (k)(y) k! (x\u00E2\u0088\u0092 y)k + f (n)(c) n! (x\u00E2\u0088\u0092 y)n This is of particular interest when |f (n)(t)| is bounded for all t \u00E2\u0088\u0088 [a, b] as then we can say r(x, y) = f (n)(c) n! (x\u00E2\u0088\u0092 y)n satisfies r(x, y) ||x\u00E2\u0088\u0092 y||n\u00E2\u0088\u00921 \u00E2\u0086\u0092 0 as (x\u00E2\u0088\u0092 y)\u00E2\u0086\u0092 0 In our analysis we will use exclusively the case where n = 2, which is also called the first-order case. 67 6.1. Preliminaries This gives f(x) = f(y) + f \u00E2\u0080\u00B2(y)(x\u00E2\u0088\u0092 y) + r(x, y) The multivariate case is also needed here. For n = 2 with similar regularity conditions this is f(x) = f(y) +\u00E2\u0088\u0087f(y)T (x\u00E2\u0088\u0092 y) + r(x,y) where r(x,y) = 1 2 d\u00E2\u0088\u0091 i=1 d\u00E2\u0088\u0091 j=1 \u00E2\u0088\u00822f \u00E2\u0088\u0082xi\u00E2\u0088\u0082xj (ci,j)(xi \u00E2\u0088\u0092 yi)(xj \u00E2\u0088\u0092 yj) with each ci,j falling on the line between x and y. In particular, if all the second partial derivatives of f in a neighbourhood of y are bounded it can be shown that r(x,y)||x\u00E2\u0088\u0092y|| \u00E2\u0086\u0092 0 as (x\u00E2\u0088\u0092 y) \u00E2\u0086\u0092 0 so the remainder term is insignificant when (x\u00E2\u0088\u0092 y)\u00E2\u0086\u0092 0. 6.1.4 Quantile Derivatives We will refer numerous times to the inverse cumulative distribution function of a marginal distri- bution in our analysis. Indeed, the function F\u00E2\u0088\u00921j (u|z1:j\u00E2\u0088\u00921) is such a function. Recall that Fj(z|z1:j\u00E2\u0088\u00921) = \u00E2\u0088\u00AB Rd\u00E2\u0088\u0092j \u00E2\u0088\u00AB z \u00E2\u0088\u0092\u00E2\u0088\u009E p(xj , xj+1:d|x1:j\u00E2\u0088\u00921 = z1:j\u00E2\u0088\u00921)dxjdxj+1:d and pj(z|z1:j\u00E2\u0088\u00921) = \u00E2\u0088\u00AB Rd\u00E2\u0088\u0092j p(xj = z, xj+1:d|x1:j\u00E2\u0088\u00921 = z1:j\u00E2\u0088\u00921)dxj+1:d so that Fj(z|z1:j\u00E2\u0088\u00921) = \u00E2\u0088\u00AB z \u00E2\u0088\u0092\u00E2\u0088\u009E pj(xj |x1:j\u00E2\u0088\u00921 = z1:j\u00E2\u0088\u00921)dxj We are also limiting our discussion to situations in which {x : p(x) > 0} is convex. This implies that the inverse cdf F\u00E2\u0088\u00921j (u|z1:j\u00E2\u0088\u00921) is unique for u \u00E2\u0088\u0088 (0, 1). F\u00E2\u0088\u00921j (u|z1:j\u00E2\u0088\u00921) = z s.t. Fj(z|z1:j\u00E2\u0088\u00921) = u 68 6.2. An approximation of the generalized cdf It is a strictly monotonic function (in u) by its definition and is thus differentiable (in u) almost everywhere with partial derivative \u00E2\u0088\u0082 \u00E2\u0088\u0082u F\u00E2\u0088\u00921j (u|z1:j\u00E2\u0088\u00921) = 1 pj(z|x1:j\u00E2\u0088\u00921 = z1:j\u00E2\u0088\u00921) where z = F\u00E2\u0088\u00921j (u|z1:j\u00E2\u0088\u00921). Furthermore, for fixed u this inverse function is defined implicitly via the equation Fj(z|z1:j\u00E2\u0088\u00921)\u00E2\u0088\u0092 u = 0 If we denote this function as g(x1:j\u00E2\u0088\u00921), it satisfies Fj(g(z1:j\u00E2\u0088\u00921)|z1:j\u00E2\u0088\u00921)\u00E2\u0088\u0092 u = 0 By the implicit function theorem19, the partial derivatives of g are given by \u00E2\u0088\u0082g \u00E2\u0088\u0082xi (z1:j\u00E2\u0088\u00921) = \u00E2\u0088\u0092 \u00E2\u0088\u0082F (z|z1:j\u00E2\u0088\u00921)/\u00E2\u0088\u0082xi \u00E2\u0088\u0082F (z|z1:j\u00E2\u0088\u00921)/\u00E2\u0088\u0082xj = \u00E2\u0088\u0092 \u00E2\u0088\u0082 \u00E2\u0088\u0082xi \u00E2\u0088\u00AB z \u00E2\u0088\u0092\u00E2\u0088\u009E pj(xj |x1:j\u00E2\u0088\u00921 = z1:j\u00E2\u0088\u00921)dxj pj(z|x1:j\u00E2\u0088\u00921 = z1:j\u00E2\u0088\u00921) = \u00E2\u0088\u0092 \u00E2\u0088\u00AB z \u00E2\u0088\u0092\u00E2\u0088\u009E \u00E2\u0088\u0082 \u00E2\u0088\u0082xi pj(xj |x1:j\u00E2\u0088\u00921 = z1:j\u00E2\u0088\u00921)dxj pj(z|x1:j\u00E2\u0088\u00921 = z1:j\u00E2\u0088\u00921) as long as pj(xj |x1:j\u00E2\u0088\u00921) is differentiable for all xj when x1:j\u00E2\u0088\u00921 = z1:j\u00E2\u0088\u00921 and the partial derivatives \u00E2\u0088\u0082 \u00E2\u0088\u0082xi pj(xj |z1:j\u00E2\u0088\u00921) are integrable. One can similarly express \u00E2\u0088\u0082 2g \u00E2\u0088\u0082xi\u00E2\u0088\u0082xk (z1:j\u00E2\u0088\u00921) if p is twice differentiable for all xj when x1:j\u00E2\u0088\u00921 = z1:j\u00E2\u0088\u00921 and these derivatives are integrable as well. 6.2 An approximation of the generalized cdf Before moving to an analysis of the unweighted k-ary and binary tree-based algorithms, we first analyse a simpler algorithm that is a more direct (though intractable) approximation to Algorithm 3. This approximation algorithm is described in Algorithm 12. For our analysis, let us define \u00C2\u00B5 = F\u00E2\u0088\u00921(u), ie. \u00C2\u00B5 is the true value of the inverse of the generalized cdf at u = u1:d. For our proof we will introduce a new set of functions of the form Gj : R j \u00E2\u0086\u0092 R to simplify notation. 19Note that dF = 0 = \u00E2\u0088\u0091j i=0 \u00E2\u0088\u0082xiFdxi and solve for dxj dxi with all other xk constant. See [22] for more details. 69 6.2. An approximation of the generalized cdf Algorithm 12 An algorithm to approximate the inverse of the generalized cdf F\u00E2\u0088\u00921(u1, . . . , ud) 1. \u00E2\u0080\u00A2 Sample N1 points {x (i))}Ni=1 from proposal distribution q(x). \u00E2\u0080\u00A2 For i = 1, . . . , N , evaluate the importance weights: w1(x (i)) = p(x(i)) q(x(i)) \u00E2\u0080\u00A2 Normalize the importance weights W (i) 1 = w1(x (i))\u00E2\u0088\u0091N1 k=1 w1(x (k)) \u00E2\u0080\u00A2 Define x\u00CC\u00821 def = inf{z : F\u00CC\u00821(z) \u00E2\u0089\u00A5 u1} where F\u00CC\u00821(z) = N1\u00E2\u0088\u0091 i=1 W (i) 1 I(x (i) 1 \u00E2\u0089\u00A4 z) 2. For j = 2, . . . , d. \u00E2\u0080\u00A2 Sample Nj points {x (i) j:d} Nj i=1 from proposal distribution q(xj:d|x\u00CC\u00821:j\u00E2\u0088\u00921) and set x (i) = (x\u00CC\u00821:j\u00E2\u0088\u00921, x (i) j:d). \u00E2\u0080\u00A2 For i = 1, . . . , Nj , evaluate the importance weights: wj(x (i)) = p(x(i)) q(x(i)) \u00E2\u0080\u00A2 Normalize the importance weights W (i) j = wj(x (i))\u00E2\u0088\u0091Nj k=1 wj(x (k)) \u00E2\u0080\u00A2 Define x\u00CC\u0082j def = inf{z : F\u00CC\u0082j(z|x\u00CC\u00821:j\u00E2\u0088\u00921) \u00E2\u0089\u00A5 uj} where F\u00CC\u0082j(z; x\u00CC\u00821:j\u00E2\u0088\u00921) = Nj\u00E2\u0088\u0091 i=1 W (i) j I(x (i) j \u00E2\u0089\u00A4 z) and set x\u00CC\u00821:j = (x\u00CC\u00821:j\u00E2\u0088\u00921, x\u00CC\u0082j) 3. Output x\u00CC\u00821:d. 70 6.2. An approximation of the generalized cdf In particular, we define Gj(uj ;x1:j\u00E2\u0088\u00921) as follows: Gj(uj ;x1:j\u00E2\u0088\u00921) = inf{xj : Fj(xj |x1:j\u00E2\u0088\u00921) \u00E2\u0089\u00A5 uj} = xj : Fj(xj |x1:j\u00E2\u0088\u00921) = uj = F\u00E2\u0088\u00921j (uj |x1:j\u00E2\u0088\u00921) Claim: Let Nj = cjN for j \u00E2\u0088\u0088 {1, . . . , d} and u \u00E2\u0088\u0088 [0, 1]d be given. Define \u00C2\u00B5 = F\u00E2\u0088\u00921(u). Assume \u00E2\u0088\u00822Gj(u;x1:j\u00E2\u0088\u00921) \u00E2\u0088\u0082xi\u00E2\u0088\u0082xk is bounded for all i, k < j in a neighbourhood of \u00C2\u00B51:j\u00E2\u0088\u00921 and all j \u00E2\u0088\u0088 {1, . . . , d}. Also, assume the set {x : p(x) > 0} is convex. Then as N \u00E2\u0086\u0092 \u00E2\u0088\u009E, \u00E2\u0088\u009AN(x\u00CC\u0082 \u00E2\u0088\u0092 \u00C2\u00B5) D\u00E2\u0086\u0092 N(0,\u00CE\u00A3) for some \u00CE\u00A3 independent of N . Proof: We know x\u00CC\u00821 is asymptotically distributed as N(\u00C2\u00B51, \u00CF\u008321 N1 ) where \u00CF\u00831 is of the form given in Equation 6.1. Let v1 = x\u00CC\u00821 \u00E2\u0088\u0092 \u00C2\u00B51 D\u00E2\u0086\u0092 N(0, \u00CF\u0083 2 1 N1 ). We also know x\u00CC\u00822 is distributed as N(\u00C2\u00B5\u00CC\u00822, \u00CF\u008322 N2 ), where \u00C2\u00B5\u00CC\u00822 def = G2(u2; x\u00CC\u00821) and \u00CF\u00832 is of the form given in Equation 6.1. A Taylor expansion of G2 around \u00C2\u00B51 gives \u00C2\u00B5\u00CC\u00822 = G2(u2; x\u00CC\u00821) = G2(u2;\u00C2\u00B51) + \u00E2\u0088\u0082Gj(u2;\u00C2\u00B51) \u00E2\u0088\u0082x1 (x\u00CC\u00821 \u00E2\u0088\u0092 \u00C2\u00B51) + r2(x\u00CC\u00821, \u00C2\u00B51) \u00E2\u0089\u0088 \u00C2\u00B52 + \u00E2\u0088\u0082Gj(u2;\u00C2\u00B51) \u00E2\u0088\u0082x1 v1 If we let x\u00CC\u00822 \u00E2\u0088\u0092 \u00C2\u00B5\u00CC\u00822 = v2 D\u00E2\u0086\u0092 N(0, \u00CF\u0083 2 2 N2 ), we have asymptotically x\u00CC\u00822 = \u00C2\u00B52 + \u00E2\u0088\u0082G2(u2;\u00C2\u00B51) \u00E2\u0088\u0082x1 v1 + v2 so [ x\u00CC\u00821 x\u00CC\u00822 ] = [ \u00C2\u00B51 \u00C2\u00B52 ] + [ 1 0 \u00E2\u0088\u0082G2(u2;\u00C2\u00B51) \u00E2\u0088\u0082x1 1 ][ v1 v2 ] For j = 3 we have x\u00CC\u00823 is distributed as N(\u00C2\u00B5\u00CC\u00823, \u00CF\u008323 N3 ), where \u00C2\u00B5\u00CC\u00823 def = G3(u3; x\u00CC\u00821:2) and \u00CF\u00833 is of the form given in Equation 6.1. 71 6.2. An approximation of the generalized cdf A Taylor expansion of G3 around \u00C2\u00B51:2 gives \u00C2\u00B5\u00CC\u00823 = G3(u3; x\u00CC\u00821:2) = G3(u3;\u00C2\u00B51:2) + \u00E2\u0088\u0082G3(u3;\u00C2\u00B51:2) \u00E2\u0088\u0082x1 (x\u00CC\u00821 \u00E2\u0088\u0092 \u00C2\u00B51) + \u00E2\u0088\u0082G3(u3;\u00C2\u00B51:2) \u00E2\u0088\u0082x2 (x\u00CC\u00822 \u00E2\u0088\u0092 \u00C2\u00B52) + r3(x\u00CC\u00821:2, \u00C2\u00B51:2) \u00E2\u0089\u0088 \u00C2\u00B53 + \u00E2\u0088\u0082G3(u3;\u00C2\u00B51:2) \u00E2\u0088\u0082x1 v1 + \u00E2\u0088\u0082G3(u3;\u00C2\u00B51:2) \u00E2\u0088\u0082x2 [ \u00E2\u0088\u0082G2(u2;\u00C2\u00B51) \u00E2\u0088\u0082x1 v1 + v2 ] If we let x\u00CC\u00823 \u00E2\u0088\u0092 \u00C2\u00B5\u00CC\u00823 = v3 D\u00E2\u0086\u0092 N(0, \u00CF\u0083 2 3 N3 ), we have asymptotically x\u00CC\u00823 = \u00C2\u00B53 + [ \u00E2\u0088\u0082G3(u3;\u00C2\u00B51:2) \u00E2\u0088\u0082x1 + \u00E2\u0088\u0082G3(u3;\u00C2\u00B51:2) \u00E2\u0088\u0082x2 \u00E2\u0088\u0082G2(u2;\u00C2\u00B51) \u00E2\u0088\u0082x1 ] v1 + \u00E2\u0088\u0082G3(u3;\u00C2\u00B51:2) \u00E2\u0088\u0082x2 v2 + v3 so \u00EF\u00A3\u00AE\u00EF\u00A3\u00AF\u00EF\u00A3\u00B0 x\u00CC\u00821x\u00CC\u00822 x\u00CC\u00823 \u00EF\u00A3\u00B9\u00EF\u00A3\u00BA\u00EF\u00A3\u00BB = \u00EF\u00A3\u00AE\u00EF\u00A3\u00AF\u00EF\u00A3\u00B0 \u00C2\u00B51\u00C2\u00B52 \u00C2\u00B53 \u00EF\u00A3\u00B9\u00EF\u00A3\u00BA\u00EF\u00A3\u00BB+ \u00EF\u00A3\u00AE\u00EF\u00A3\u00AF\u00EF\u00A3\u00B0 1 0 0\u00E2\u0088\u0082G2(u2;\u00C2\u00B51)\u00E2\u0088\u0082x1 1 0 \u00E2\u0088\u0082G3(u3;\u00C2\u00B51:2) \u00E2\u0088\u0082x1 + \u00E2\u0088\u0082G3(u3;\u00C2\u00B51:2) \u00E2\u0088\u0082x2 \u00E2\u0088\u0082G2(u2;\u00C2\u00B51) \u00E2\u0088\u0082x1 \u00E2\u0088\u0082G3(u3;\u00C2\u00B52) \u00E2\u0088\u0082x2 1 \u00EF\u00A3\u00B9\u00EF\u00A3\u00BA\u00EF\u00A3\u00BB \u00EF\u00A3\u00AE\u00EF\u00A3\u00AF\u00EF\u00A3\u00B0 v1v2 v3 \u00EF\u00A3\u00B9\u00EF\u00A3\u00BA\u00EF\u00A3\u00BB In general, we have x\u00CC\u0082j is distributed as N(\u00C2\u00B5\u00CC\u0082j , \u00CF\u00832j Nj ), where \u00C2\u00B5\u00CC\u0082j def = Gj(uj ; x\u00CC\u00821:j\u00E2\u0088\u00921) and \u00CF\u0083j is of the form given in Equation 6.1. A Taylor expansion of Gj around \u00C2\u00B51:j\u00E2\u0088\u00921 gives \u00C2\u00B5\u00CC\u0082j = Gj(uj ; x\u00CC\u00821:j\u00E2\u0088\u00921) = Gj(uj ;\u00C2\u00B51:j\u00E2\u0088\u00921) + j\u00E2\u0088\u00921\u00E2\u0088\u0091 i=1 \u00E2\u0088\u0082Gj(uj ;\u00C2\u00B51:j\u00E2\u0088\u00921) \u00E2\u0088\u0082xi (x\u00CC\u0082i \u00E2\u0088\u0092 \u00C2\u00B5i) + rj(x\u00CC\u00821:j\u00E2\u0088\u00921, \u00C2\u00B51:j\u00E2\u0088\u00921) \u00E2\u0089\u0088 \u00C2\u00B5j + j\u00E2\u0088\u00921\u00E2\u0088\u0091 i=1 \u00E2\u0088\u0082Gj(uj ;\u00C2\u00B51:j\u00E2\u0088\u00921) \u00E2\u0088\u0082xi (x\u00CC\u0082i \u00E2\u0088\u0092 \u00C2\u00B5i) Note that the terms x\u00CC\u0082i \u00E2\u0088\u0092 \u00C2\u00B5i get progressively more complex in terms of v1:i as i increases. For example, x\u00CC\u00824 \u00E2\u0088\u0092 \u00C2\u00B54 has the form x\u00CC\u00824 \u00E2\u0088\u0092 \u00C2\u00B54 = [ \u00E2\u0088\u0082G4 \u00E2\u0088\u0082x1 + \u00E2\u0088\u0082G4 \u00E2\u0088\u0082x2 \u00E2\u0088\u0082G2 \u00E2\u0088\u0082x1 + \u00E2\u0088\u0082G4 \u00E2\u0088\u0082x3 \u00E2\u0088\u0082G3 \u00E2\u0088\u0082x1 + \u00E2\u0088\u0082G4 \u00E2\u0088\u0082x3 \u00E2\u0088\u0082G3 \u00E2\u0088\u0082x2 \u00E2\u0088\u0082G2 \u00E2\u0088\u0082x1 ] v1 + [ \u00E2\u0088\u0082G4 \u00E2\u0088\u0082x2 + \u00E2\u0088\u0082G4 \u00E2\u0088\u0082x3 \u00E2\u0088\u0082G3 \u00E2\u0088\u0082x2 ] v2 + \u00E2\u0088\u0082G4 \u00E2\u0088\u0082x3 v3 + v4 Nevertheless, each x\u00CC\u0082i \u00E2\u0088\u0092 \u00C2\u00B5i can be expressed as a linear combination of vj \u00E2\u0080\u0099s (with j \u00E2\u0089\u00A4 i) since every partial derivative of Gk is evaluated at (uk;\u00C2\u00B51:k\u00E2\u0088\u00921). If we let \u00CE\u00BBi,j denote the coefficient of vj in the expression x\u00CC\u0082i \u00E2\u0088\u0092 \u00C2\u00B5i = \u00E2\u0088\u0091i j=1 \u00CE\u00BBi,jvj then we can write x\u00CC\u0082 = \u00C2\u00B5 + \u00CE\u009Bv with the matrix \u00CE\u009B defined by \u00CE\u009B(i, j) = \u00CE\u00BBi,j . 72 6.3. The Unweighted k-ary Tree Note that each vj \u00E2\u0088\u00BC N(0, \u00CF\u0083 2 j Nj ) = N(0, \u00CF\u00832j cjN ) As such, we can write x\u00CC\u0082 = \u00C2\u00B5 + 1\u00E2\u0088\u009A N Az where the matrix A is defined by A(i, j) = \u00E2\u0088\u009A cj \u00CF\u0083j \u00CE\u00BBi,j and z \u00E2\u0088\u00BC N(0, I). In other words, as N \u00E2\u0086\u0092 \u00E2\u0088\u009E,\u00E2\u0088\u009A N(x\u00CC\u0082\u00E2\u0088\u0092 \u00C2\u00B5) D\u00E2\u0086\u0092 N(0, AAT ) where \u00CE\u00A3 def= AAT is independent of N \u0003. 6.3 The Unweighted k-ary Tree Having proven asymptotic normality of the sampled points using direct approximation to Algorithm 3, we turn to the k-ary tree sampling scheme, detailed in Section 4.4. This algorithm draws its efficiency gains from the use of partitions as opposed to sampling many vectors with identical components. The drawback is that it is more complex to analyze. For our proof we will introduce a new set of functions of the form Hj : R 2j\u00E2\u0088\u00921 \u00E2\u0086\u0092 R to simplify notation. In particular, we define Hj(u;x1:j\u00E2\u0088\u00921) as follows: Hj(u;x1:j\u00E2\u0088\u00921, x1:j\u00E2\u0088\u00921) = inf{xj : \u00E2\u0088\u00AB xj \u00E2\u0088\u0092\u00E2\u0088\u009E pj ( t \u00E2\u0088\u00A3\u00E2\u0088\u00A3\u00E2\u0088\u00A3\u00E2\u0088\u00A3\u00E2\u0088\u00A3x1:j\u00E2\u0088\u00921 \u00E2\u0088\u0088 j\u00E2\u0088\u00921\u00E2\u0088\u008F i=1 [xi, xi] ) dt \u00E2\u0089\u00A5 u} = xj : Fj(xj |x1:j\u00E2\u0088\u00921 \u00E2\u0088\u0088 j\u00E2\u0088\u00921\u00E2\u0088\u008F i=1 [xi, xi]) = u = F\u00E2\u0088\u00921j (u|x1:j\u00E2\u0088\u00921 \u00E2\u0088\u0088 j\u00E2\u0088\u00921\u00E2\u0088\u008F i=1 [xi, xi]) where pj ( xj \u00E2\u0088\u00A3\u00E2\u0088\u00A3\u00E2\u0088\u00A3\u00E2\u0088\u00A3\u00E2\u0088\u00A3x1:j\u00E2\u0088\u00921 \u00E2\u0088\u0088 j\u00E2\u0088\u00921\u00E2\u0088\u008F i=1 [xi, xi] ) = \u00E2\u0088\u00AB\u00E2\u0088\u008F j\u00E2\u0088\u00921 i=1 [xi,xi] p(x1:j)dx1:j\u00E2\u0088\u00921\u00E2\u0088\u00AB\u00E2\u0088\u008F j\u00E2\u0088\u00921 i=1 [xi,xi] p(x1:j\u00E2\u0088\u00921)dx1:j\u00E2\u0088\u00921 Note that we can use the mean value theorem to show that Hj(u;x1:j\u00E2\u0088\u00921, x1:j\u00E2\u0088\u00921) = F\u00E2\u0088\u00921j (u|c1:j\u00E2\u0088\u00921) 73 6.3. The Unweighted k-ary Tree for some c1:j\u00E2\u0088\u00921 \u00E2\u0088\u0088 I = \u00E2\u0088\u008Fj\u00E2\u0088\u00921 i=1 [xi, xi]:\u00E2\u0088\u00AB xj \u00E2\u0088\u0092\u00E2\u0088\u009E pj (t|x1:j\u00E2\u0088\u00921 \u00E2\u0088\u0088 I) dt = u \u00E2\u0087\u0092 \u00E2\u0088\u00AB xj \u00E2\u0088\u0092\u00E2\u0088\u009E \u00E2\u0088\u00AB I p(x1:j\u00E2\u0088\u00921, t)dx1:j\u00E2\u0088\u00921dt\u00E2\u0088\u00AB I p(x1:j\u00E2\u0088\u00921)dx1:j\u00E2\u0088\u00921 = u \u00E2\u0087\u0092 \u00E2\u0088\u00AB xj \u00E2\u0088\u0092\u00E2\u0088\u009E \u00E2\u0088\u00AB I p(t|x1:j\u00E2\u0088\u00921)p(x1:j\u00E2\u0088\u00921)dx1:j\u00E2\u0088\u00921dt\u00E2\u0088\u00AB I p(x1:j\u00E2\u0088\u00921)dx1:j\u00E2\u0088\u00921 = u \u00E2\u0087\u0092 \u00E2\u0088\u00AB xj \u00E2\u0088\u0092\u00E2\u0088\u009E p(t|c1:j\u00E2\u0088\u00921)dt \u00E2\u0088\u00AB I p(x1:j\u00E2\u0088\u00921)dx1:j\u00E2\u0088\u00921\u00E2\u0088\u00AB I p(x1:j\u00E2\u0088\u00921)dx1:j\u00E2\u0088\u00921 = u \u00E2\u0087\u0092 \u00E2\u0088\u00AB xj \u00E2\u0088\u0092\u00E2\u0088\u009E p(t|c1:j\u00E2\u0088\u00921)dt = u \u00E2\u0087\u0092Hj(u;x1:j\u00E2\u0088\u00921, x1:j\u00E2\u0088\u00921) = F\u00E2\u0088\u00921j (u|c1:j\u00E2\u0088\u00921) Recall that Gj(u; c1:j\u00E2\u0088\u00921) = F\u00E2\u0088\u00921j (u|c1:j\u00E2\u0088\u00921) so we have Hj(u;x1:j\u00E2\u0088\u00921, x1:j\u00E2\u0088\u00921) = Gj(u; c1:j\u00E2\u0088\u00921). Claim: Let N = kd and let u \u00E2\u0088\u0088 [0, 1]d be given. Define \u00C2\u00B5 = F\u00E2\u0088\u00921(u). Assume \u00E2\u0088\u00822Gj(u;x1:j\u00E2\u0088\u00921) \u00E2\u0088\u0082xi\u00E2\u0088\u0082xk is bounded for all i, k < j in a neighbourhood of \u00C2\u00B51:j\u00E2\u0088\u00921 and all j \u00E2\u0088\u0088 {1, . . . , d}. Also, assume the set {x : p(x) > 0} is convex. Then as k \u00E2\u0086\u0092\u00E2\u0088\u009E, x\u00CC\u0082 P\u00E2\u0086\u0092 \u00C2\u00B5. Proof: Let uj = dkuje\u00E2\u0088\u00921 k and uj = dkuje k . We denote the number of particles at each stage of the algorithm as N1, . . . , Nd. Note first that N1, . . . , Nd are all related. We will prove later that each of these values increases as k \u00E2\u0086\u0092\u00E2\u0088\u009E. In Algorithm 11, the first subset that is chosen is the dku1eth subset on the second level. Let us define x1 and x1 as the left and right endpoints of the selected subset. Since u1 converges (this is not a random variable) to u1, sample quantiles are asymptotically unbiased and their asymptotic variance decreases linearly in N , we know that x1 P\u00E2\u0086\u0092 \u00C2\u00B51 (one can pick N large enough such that the mean of the estimate is close enough to \u00C2\u00B51 and the variance is small enough for a Chebyshev inequality to show convergence). Similarly, x1 P\u00E2\u0086\u0092 \u00C2\u00B51. It doesn\u00E2\u0080\u0099t matter that x1 and x1 are dependent since two variables that converge in probability marginally must also converge in probability jointly. Now define c (2) 1 such that is satisfies G2(u2; c (2) 1 ) = H2(u2;x1, x1). Define \u00C2\u00B5\u00CC\u00822 = G2(u2; c (2) 1 ). By the same reasoning as above, the left and right endpoints of the second partition x2 and x2 satisfy x2 P\u00E2\u0086\u0092 \u00C2\u00B5\u00CC\u00822 and x2 P\u00E2\u0086\u0092 \u00C2\u00B5\u00CC\u00822 (as long as N2 \u00E2\u0086\u0092\u00E2\u0088\u009E as k \u00E2\u0086\u0092\u00E2\u0088\u009E) where \u00C2\u00B5\u00CC\u00822 = \u00C2\u00B52 + \u00E2\u0088\u0082G2(u2;\u00C2\u00B51) \u00E2\u0088\u0082x1 (c (2) 1 \u00E2\u0088\u0092 \u00C2\u00B51) + r2(c(2)1 , \u00C2\u00B51) Since c (2) 1 \u00E2\u0088\u0088 [x1, x1] with x1 P\u00E2\u0086\u0092 \u00C2\u00B51 and x1 P\u00E2\u0086\u0092 \u00C2\u00B51, c(2)1 P\u00E2\u0086\u0092 \u00C2\u00B51. Therefore x2 P\u00E2\u0086\u0092 \u00C2\u00B52 and x2 P\u00E2\u0086\u0092 \u00C2\u00B52. 74 6.3. The Unweighted k-ary Tree In general we have c (j) 1:j\u00E2\u0088\u00921 \u00E2\u0088\u0088 \u00E2\u0088\u008Fj\u00E2\u0088\u00921 i=1 [xi, xi] satisfying Gj(uj ; c (j) 1:j\u00E2\u0088\u00921) = Hj(uj ;x1:j\u00E2\u0088\u00921, x1:j\u00E2\u0088\u00921) and \u00C2\u00B5\u00CC\u0082j \u00E2\u0089\u0088 \u00C2\u00B5j + j\u00E2\u0088\u00921\u00E2\u0088\u0091 i=1 \u00E2\u0088\u0082Gj(uj ;\u00C2\u00B51:j\u00E2\u0088\u00921) \u00E2\u0088\u0082xi (c (j) i \u00E2\u0088\u0092 \u00C2\u00B5i) where c (j) 1:j\u00E2\u0088\u00921 P\u00E2\u0086\u0092 \u00C2\u00B51:j\u00E2\u0088\u00921 so xj P\u00E2\u0086\u0092 \u00C2\u00B5j and xj P\u00E2\u0086\u0092 \u00C2\u00B5j as long as Nj \u00E2\u0086\u0092 \u00E2\u0088\u009E as k \u00E2\u0086\u0092 \u00E2\u0088\u009E. Thus we can be assured that x\u00CC\u00821:d\u00E2\u0088\u00921 P\u00E2\u0086\u0092 \u00C2\u00B51:d\u00E2\u0088\u00921 for any x\u00CC\u00821:d\u00E2\u0088\u00921 \u00E2\u0088\u0088 \u00E2\u0088\u008Fd\u00E2\u0088\u00921 i=1 [xi, xi]. We do not compute xd and xd. Instead, we set x\u00CC\u0082d to be the sample conditional ud-quantile of the remaining particles in the dth dimension. We have some c (d) 1:d\u00E2\u0088\u00921 \u00E2\u0088\u0088 \u00E2\u0088\u008Fd\u00E2\u0088\u00921 i=1 [xi, xi] satisfying Gd(ud; c (d) 1:d\u00E2\u0088\u00921) = Hd(ud;x1:d\u00E2\u0088\u00921, x1:d\u00E2\u0088\u00921), with c (j) 1:j\u00E2\u0088\u00921 P\u00E2\u0086\u0092 \u00C2\u00B51:d\u00E2\u0088\u00921. Let \u00C2\u00B5\u00CC\u0082d = Gd(ud; c(d)1:d\u00E2\u0088\u00921). Then\u00E2\u0088\u009A Nd(x\u00CC\u0082d \u00E2\u0088\u0092 \u00C2\u00B5\u00CC\u0082d) D\u00E2\u0086\u0092 N(0, \u00CF\u00832d) with \u00C2\u00B5\u00CC\u0082d \u00E2\u0089\u0088 \u00C2\u00B5d + d\u00E2\u0088\u00921\u00E2\u0088\u0091 i=1 \u00E2\u0088\u0082Gd(ud;\u00C2\u00B51:j\u00E2\u0088\u00921) \u00E2\u0088\u0082xi (c (d) i \u00E2\u0088\u0092 \u00C2\u00B5i) and \u00CF\u0083d is of the form given in Equation 6.1. Therefore \u00C2\u00B5\u00CC\u0082d P\u00E2\u0086\u0092 \u00C2\u00B5d and so x\u00CC\u0082d P\u00E2\u0086\u0092 \u00C2\u00B5d. All that remains to show is that as k \u00E2\u0086\u0092 \u00E2\u0088\u009E, Nj \u00E2\u0086\u0092 \u00E2\u0088\u009E for j \u00E2\u0088\u0088 {1, . . . , d}. Let us denote Nj as a function of k by Nj(k). Recall that q(x) is the proposal distribution used to sample the particles {x(i)}. Note first that N(k) = kd and N1(k) = N(k). N2(k)\u00E2\u0086\u0092 N1(k) \u00E2\u0088\u00AB F\u00E2\u0088\u009211 (u1) F\u00E2\u0088\u009211 (u1) q(x1)dx1. We have that F\u00E2\u0088\u009211 (u1) = F \u00E2\u0088\u00921 1 (u1) + \u00E2\u0088\u0082F\u00E2\u0088\u009211 (u1) \u00E2\u0088\u0082u (u1 \u00E2\u0088\u0092 u1) + r1(u1, u1) and F\u00E2\u0088\u009211 (u1) = F \u00E2\u0088\u00921 1 (u1) + \u00E2\u0088\u0082F\u00E2\u0088\u009211 (u1) \u00E2\u0088\u0082u (u1 \u00E2\u0088\u0092 u1) + r1(u1, u1) which implies F\u00E2\u0088\u009211 (u1)\u00E2\u0088\u0092 F\u00E2\u0088\u009211 (u1) \u00E2\u0089\u0088 \u00E2\u0088\u0082F\u00E2\u0088\u009211 (u1) \u00E2\u0088\u0082u (u1 \u00E2\u0088\u0092 u1) = 1 k \u00E2\u0088\u0082F\u00E2\u0088\u009211 (u1) \u00E2\u0088\u0082u = 1 kp(\u00C2\u00B51) 75 6.4. The Unweighted Binary Tree Therefore, as k \u00E2\u0086\u0092\u00E2\u0088\u009E we have \u00E2\u0088\u00AB F\u00E2\u0088\u009211 (u1) F\u00E2\u0088\u009211 (u1) q(x1)dx1 = 1 k \u00E2\u0088\u0082F\u00E2\u0088\u009211 (u1) \u00E2\u0088\u0082u q(\u00C2\u00B51) and so N2(k) = N1(k) q(\u00C2\u00B51) kp(\u00C2\u00B51) In general, we will have Nj(k)\u00E2\u0086\u0092 Nj\u00E2\u0088\u00921(k) \u00E2\u0088\u00AB F\u00E2\u0088\u00921 j (u1|\u00C2\u00B51:j\u00E2\u0088\u00921) F\u00E2\u0088\u00921 j (uj |\u00C2\u00B51:j\u00E2\u0088\u00921) q(xj |x1:j\u00E2\u0088\u00921 = \u00C2\u00B51:j\u00E2\u0088\u00921)dxj , which yields Nj(k) = Nj\u00E2\u0088\u00921(k) 1 k \u00E2\u0088\u0082F\u00E2\u0088\u00921j (uj |\u00C2\u00B51:j\u00E2\u0088\u00921) \u00E2\u0088\u0082u q(\u00C2\u00B5j |\u00C2\u00B51:j\u00E2\u0088\u00921) = Nj\u00E2\u0088\u00921(k) q(\u00C2\u00B5j |\u00C2\u00B51:j\u00E2\u0088\u00921) kp(\u00C2\u00B5j |\u00C2\u00B51:j\u00E2\u0088\u00921) or Nj(k) = N1(k) j\u00E2\u0088\u00921\u00E2\u0088\u008F i=1 q(\u00C2\u00B5i|\u00C2\u00B51:i\u00E2\u0088\u00921) kp(\u00C2\u00B5i|\u00C2\u00B51:i\u00E2\u0088\u00921) = N1(k) kj\u00E2\u0088\u00921 j\u00E2\u0088\u00921\u00E2\u0088\u008F i=1 q(\u00C2\u00B5i|\u00C2\u00B51:i\u00E2\u0088\u00921) p(\u00C2\u00B5i|\u00C2\u00B51:i\u00E2\u0088\u00921) Since N1(k) = N(k) = k d Nj(k) = k d\u00E2\u0088\u0092j+1 j\u00E2\u0088\u00921\u00E2\u0088\u008F i=1 q(\u00C2\u00B5i|\u00C2\u00B51:i\u00E2\u0088\u00921) p(\u00C2\u00B5i|\u00C2\u00B51:i\u00E2\u0088\u00921) \u00E2\u0088\u009D kd\u00E2\u0088\u0092j+1 ie. Nj \u00E2\u0086\u0092\u00E2\u0088\u009E as k \u00E2\u0086\u0092\u00E2\u0088\u009E for all j \u00E2\u0088\u0088 {1, . . . , d}\u0003. 6.4 The Unweighted Binary Tree The resampling scheme using an unweighted binary tree described in Algorithm 8 is a fairly direct approximation of Algorithm 4. To analyze it, we introduce a set of functions Qj : R j \u00E2\u0086\u0092 R. For any j, let k = j mod d (and if this makes k = 0, let k = d). Then Qj(z1:j\u00E2\u0088\u00921, u) = zk : 2j\u00E2\u0088\u00921 \u00E2\u0088\u00AB b(j)1 a (j) 1 . . . \u00E2\u0088\u00AB b(j) k \u00E2\u0088\u00921 a (j) k \u00E2\u0088\u00921 \u00E2\u0088\u00AB zk a (j) k \u00E2\u0088\u00AB b(j) k +1 a (j) k +1 . . . \u00E2\u0088\u00AB b(j) d a (j) d p(x)dx = 0.5 where the value of u determines the value of each a (j) i and b (j) i given z1:j\u00E2\u0088\u00921. In particular, as a representation of an infinite number of bits, u determines for each (i, j) whether a (j) i is equal to \u00E2\u0088\u0092\u00E2\u0088\u009E or zi+rd for some r \u00E2\u0088\u0088 N. Similarly, u determines if b(j)i is equal to \u00E2\u0088\u009E or zi+rd for some r \u00E2\u0088\u0088 N. In order to analyze the selection scheme, we pretend that the medians t\u00CC\u00821, . . . , t\u00CC\u0082n are computed only as needed by Algorithm 9. Since the tree is unweighted, it is not the case that Nj = 2 k\u00E2\u0088\u0092j+1. Let 76 6.4. The Unweighted Binary Tree I = \u00E2\u0088\u008Fd i=1[a (j) i , b (j) i ] be the region determined by (t1:j\u00E2\u0088\u00921, u). Then E[Nj(k)] = N(k) \u00E2\u0088\u00AB I q(x)dx and furthermore Nj(k) converges in probability to this value. As such, Nj converges in probability to a linear function of N . Claim: Let u \u00E2\u0088\u0088 [0, 1] and m \u00E2\u0088\u0088 N be given and let t1:m = fm(u). Assume \u00E2\u0088\u0082 2Qj(z1:j\u00E2\u0088\u00921,u) \u00E2\u0088\u0082zi\u00E2\u0088\u0082zk is bounded for all i, k < j in a neighbourhood of t1:j\u00E2\u0088\u00921 and all j \u00E2\u0088\u0088 {1, . . . ,m}. Also, assume the set {x : p(x) > 0} is convex. Then as N \u00E2\u0086\u0092\u00E2\u0088\u009E, \u00E2\u0088\u009AN(t\u00CC\u00821:m \u00E2\u0088\u0092 t1:m) D\u00E2\u0086\u0092 N(0,\u00CE\u00A3) for some \u00CE\u00A3 independent of N . Proof: We have that t\u00CC\u00821 D\u00E2\u0086\u0092 N(t1, \u00CF\u0083 2 1 N1 ) where \u00CF\u00831 is of the form given in Equation 6.1. Let v1 = t\u00CC\u00821 \u00E2\u0088\u0092 t1 D\u00E2\u0086\u0092 N(0, \u00CF\u0083 2 1 N1 ). We have also that t\u00CC\u00822 D\u00E2\u0086\u0092 N(t2, \u00CF\u0083 2 2 N2 ), where t2 def = Q2(t\u00CC\u00821, u) and \u00CF\u00832 is of the form given in Equation 6.1. A Taylor expansion of Q2 around t1 is given by t2 = Q2(t\u00CC\u00821, u) = Q2(t1, u) + \u00E2\u0088\u0082Q2(t1, u) \u00E2\u0088\u0082z1 (t\u00CC\u00821 \u00E2\u0088\u0092 t1) + r2(t\u00CC\u00821, t1, u) \u00E2\u0089\u0088 t2 + \u00E2\u0088\u0082Q2(t1, u) \u00E2\u0088\u0082z1 v1 If we let t\u00CC\u00822 \u00E2\u0088\u0092 t2 = v2 D\u00E2\u0086\u0092 N(0, \u00CF\u0083 2 2 N2 ), we have asymptotically t\u00CC\u00822 = t2 + \u00E2\u0088\u0082Q2(t1, u) \u00E2\u0088\u0082z1 v1 + v2 In general, we have t\u00CC\u0082j is distributed as N(tj , \u00CF\u00832j Nj ), where tj def = Qj(t1:j\u00E2\u0088\u00921, u) and \u00CF\u0083j is of the form given in Equation 6.1. A Taylor expansion of Qj around t1:j\u00E2\u0088\u00921 gives Qj(t\u00CC\u00821:j\u00E2\u0088\u00921, u) = Qj(t1:j\u00E2\u0088\u00921, u) + j\u00E2\u0088\u00921\u00E2\u0088\u0091 i=1 [ \u00E2\u0088\u0082Qj(t1:j\u00E2\u0088\u00921, u) \u00E2\u0088\u0082zi ] (t\u00CC\u0082i \u00E2\u0088\u0092 ti) + rj(t\u00CC\u00821:j\u00E2\u0088\u00921, t1:j\u00E2\u0088\u00921, u) \u00E2\u0089\u0088 tj + j\u00E2\u0088\u00921\u00E2\u0088\u0091 i=1 [ \u00E2\u0088\u0082Qj(t1:j\u00E2\u0088\u00921, u) \u00E2\u0088\u0082zi ] (t\u00CC\u0082i \u00E2\u0088\u0092 ti) It should be obvious that at most 2d of the partial derivatives are nonzero since Qj does not depend 77 6.5. Changing Parameters on every zi. Similar to the analysis for the approximation to the generalized cdf in Section 6.2, the terms t\u00CC\u0082i \u00E2\u0088\u0092 ti get progressively more complex as i increases. Again, however, each t\u00CC\u0082i \u00E2\u0088\u0092 ti can be expressed as a linear combination of vj \u00E2\u0080\u0099s (with j \u00E2\u0089\u00A4 i). If we let \u00CE\u00BBi,j denote the coefficient of vj in the expression t\u00CC\u0082i \u00E2\u0088\u0092 ti = \u00E2\u0088\u0091i j=1 \u00CE\u00BBi,jvj then we can write t\u00CC\u00821:m = t1:m + \u00CE\u009Bmv1:m with the m\u00C3\u0097m matrix \u00CE\u009Bm defined by \u00CE\u009Bm(i, j) = \u00CE\u00BBi,j . Let Ij = \u00E2\u0088\u008Fd i=1[a (j) i , b (j) i ] be the region determined by (t1:j\u00E2\u0088\u00921, u) for j = 1, . . . ,m. Then set cj =\u00E2\u0088\u00AB Ij q(x)dx so that Nj P\u00E2\u0086\u0092 cjN . Note that each vj \u00E2\u0088\u00BC N(0, \u00CF\u0083 2 j Nj ) = N(0, \u00CF\u00832j cjN ) As such, we can write t\u00CC\u00821:m = t1:m + 1\u00E2\u0088\u009A N A\u00CF\u00811:m where the matrix A is defined by A(i, j) = \u00E2\u0088\u009A cj \u00CF\u0083j \u00CE\u00BBi,j and \u00CF\u00811:m \u00E2\u0088\u00BC N(0, I). In other words, as N \u00E2\u0086\u0092\u00E2\u0088\u009E,\u00E2\u0088\u009A N(t\u00CC\u00821:m \u00E2\u0088\u0092 t1:m) D\u00E2\u0086\u0092 N(0, AAT ) where \u00CE\u00A3 def= AAT is independent of N \u0003. Let us define \u00C2\u00B5 to be (almost surely) the unique point in limn\u00E2\u0086\u0092\u00E2\u0088\u009E fn(u), where f is specified in Algorithm 4. Since the above result shows that for any m, \u00E2\u0088\u009A N(t\u00CC\u00821:m \u00E2\u0088\u0092 t1:m) D\u00E2\u0086\u0092 N(0,\u00CE\u00A3) for some \u00CE\u00A3 independent of N , it follows that x\u00CC\u0082 P\u00E2\u0086\u0092 \u00C2\u00B5. 6.5 Changing Parameters The above analyses show that the resampling mechanisms provide consistent sampled particles given the same inputs, at least marginally. As N \u00E2\u0086\u0092\u00E2\u0088\u009E, it is expected that the correlation between sampled particles due to the use of the same N particles each time a particle is selected is negligible. To show that each sampled particle is close to its corresponding particle when the parameters change is straightforward. If x\u00CC\u0082 P\u00E2\u0086\u0092 \u00C2\u00B5 using target distribution p(\u00C2\u00B7|\u00CE\u00B8) and x\u00CC\u0082\u00E2\u0080\u00B2 P\u00E2\u0086\u0092 \u00C2\u00B5\u00E2\u0080\u00B2 using target distribution p(\u00C2\u00B7|\u00CE\u00B8\u00E2\u0080\u00B2), then x\u00CC\u0082\u00E2\u0080\u00B2 = x\u00CC\u0082\u00E2\u0080\u00B2 \u00E2\u0088\u0092 \u00C2\u00B5\u00E2\u0080\u00B2 + \u00C2\u00B5\u00E2\u0080\u00B2 \u00E2\u0088\u0092 \u00C2\u00B5 + \u00C2\u00B5\u00E2\u0088\u0092 x\u00CC\u0082+ x\u00CC\u0082 \u00E2\u0087\u0092 x\u00CC\u0082\u00E2\u0080\u00B2 P\u00E2\u0086\u0092 x\u00CC\u0082+ \u00C2\u00B5\u00E2\u0080\u00B2 \u00E2\u0088\u0092 \u00C2\u00B5 78 6.6. Remarks If we have asymptotic normality, ie. \u00E2\u0088\u009A N(\u00C2\u00B5\u00E2\u0088\u0092 x\u00CC\u0082) D\u00E2\u0086\u0092 N(0,\u00CE\u00A31) and \u00E2\u0088\u009A N(x\u00CC\u0082\u00E2\u0080\u00B2\u00E2\u0088\u0092\u00C2\u00B5\u00E2\u0080\u00B2) D\u00E2\u0086\u0092 N(0,\u00CE\u00A32) we have \u00E2\u0088\u009A N(x\u00CC\u0082\u00E2\u0080\u00B2 \u00E2\u0088\u0092 x\u00CC\u0082) D\u00E2\u0086\u0092 N(\u00C2\u00B5\u00E2\u0080\u00B2 \u00E2\u0088\u0092 \u00C2\u00B5,\u00CE\u00A31 +\u00CE\u00A32) In the event that \u00C2\u00B5 is continuous in \u00CE\u00B8, this implies that x\u00CC\u0082\u00E2\u0080\u00B2 will be close to x\u00CC\u0082 if \u00CE\u00B8\u00E2\u0080\u00B2 is close to \u00CE\u00B8. 6.6 Remarks In the above analysis, we have shown asymptotic convergence for selected particles from a dis- tribution p(x) = p(xt|y0:t) whereas we are interested in convergence for selected particles from a distribution P\u00CC\u0082N (dxt|y0:t) = \u00E2\u0088\u0091N i=1 w (i) t \u00CE\u00B4x(i)t (dxt). However, it is known that P\u00CC\u0082N (xt|y0:t) converges in distribution to p(xt|y0:t) as N \u00E2\u0086\u0092\u00E2\u0088\u009E so this is not an issue. It is worth noting that there are no regularity conditions required to guarantee that any of the tree- based resampling algorithms select a particle (without interpolation) x(i) according to its weight W (i), as shown in Chapter 4. The conditions used above are only needed to show that the selected particle using common random numbers converges asymptotically to a particular particle. 79 Chapter 7 Discussion This thesis has presented the current state of the art for smooth likelihood estimation using particle filters, discussing recent approaches in the literature as well as proposing a novel tree-based approach. In particular, we have seen that in continuous state-space models, it is possible to evaluate the log- likelihood of data given fixed parameters using particle filters. In addition, we can reduce the variance of differences in this log-likelihood by inducing correlation among the particles selected by each filter through the use of common random numbers. The main contribution of this thesis is the introduction of tree-based resampling schemes to induce correlation amongst selected particles across filters when the state variables are multi-dimensional. This is accomplished with no additional constraints on the general filtering framework - any proposal distribution can be used and there are no additional parameters. In fact, the use of a tree-based resampling scheme involves only a change to the resample operation in the SMC framework. In addition, the running time of a filter utilising such a resampling scheme is in o(TN2), where N is the number of particles and T is the number of discrete time steps in the filter. The most empirically compelling scheme, that involving the weighted binary tree, leads to a filter that runs in expected O(TN logN) time. This is in contrast to other approaches which run in O(TN2) time and require a family of approximating distributions (see Section 3.3.2) or approaches which generally require small T and close \u00CE\u00B8 (see Section 3.3.3). Furthermore, the tree-based resampling scheme can be seen as a generalization of the \u00E2\u0080\u0098solved\u00E2\u0080\u0099 case (see [23]) where the state variables are univariate. In the analysis of the unweighted binary and k-ary trees that are proposed, we have shown that as N \u00E2\u0086\u0092 \u00E2\u0088\u009E, the particle sampled using a common random number converges in probability to a point that would be returned by a theoretical procedure involving the computation of numerous intractable integrals. While this assures us of a high degree of correlation for very large N , it should not be confused with a result for practical values of N . In our applications, it is found that the weighted binary tree resampling scheme (for which no asymptotic results are shown) gives the smoothest log-likelihood function. In addition, the variance of a particle filter estimate of the log-likelihood is proportional to 1 N . Since we are concerned with situations where the true log- likelihood is continuous in the parameters, runs of uncorrelated particle filters as N \u00E2\u0086\u0092 \u00E2\u0088\u009E will themselves become nearly smooth. Furthermore, uncorrelated filters run in O(N) time and will 80 Chapter 7. Discussion always provide lower variance estimates of the log-likelihood given a fixed computational budget. Alternatively, even the SIS algorithm which is naturally smooth using common random numbers will provide accurate estimation of the log-likelihood as N \u00E2\u0086\u0092 \u00E2\u0088\u009E since the variance, despite being exponential in the fixed time index T , decreases as 1 N . This obviously does not mean that our methods are not useful - they just highlight the potential risk of using asymptotic properties to justify the use of a particular resampling scheme. In practical applications, the use of the tree-based resampling schemes show a distinct improvement over the vanilla and CRN vanilla filters with respect to smoothness of plots of the log-likelihood for changing parameters. In order to benefit from this smoothness, one could employ curve fitting or surface fitting techniques to obtain approximate maximum likelihood estimates of the parameters. In an online setting, one could assign one filter to each of M parameter values and run each filter for one step upon receipt of new data point. This would take time in O(MN logN) and could be easily parallelized to O(N logN) on M computing machines. If the surface fitting procedure is not computationally intensive given the M estimates of the log-likelihood, a smooth approximating surface could additionally be computed at each time step. In contrast, methods like those in [5] are not able to produce an accurate and almost smooth surface for large ranges of parameters but are better suited to maximum-likelihood parameter estimation via stochastic gradient ascent for fixed T . Future Research The tree-based resampling schemes we have presented are not necessarily the most effective. It is possible that different trees could have better properties. In addition, the methods we have proposed select indices by processing the particles without any kind of modification. In some cases, dimension reduction or some transformation of the particles could be used to build the tree and select indices such that the smoothness of the likelihood estimator is increased. There is clearly work to be done in analyzing the correlation between particles in tree-based resam- pling schemes for finite N . If we could estimate the amount of correlation given a particular tree structure or set of structures, we would have better grounds for selecting the type of tree used for resampling. Finally, the most appropriate way to utilize nearly smooth particle filters is unknown at present. They seem well-suited to the task of likelihood maximization but not necessarily via stochastic gradient ascent, for which the methods in [25] and [5] are readily applicable. Curve or surface fitting as a precursor to maximization as mentioned above is only one possible option. 81 Bibliography [1] Jon Louis Bentley and James B. Saxe. Generating sorted lists of random numbers. ACM Trans. Math. Softw., 6(3):359\u00E2\u0080\u0093364, 1980. [2] M. Blum, R.W. Floyd, V. Pratt, R. Rivest, and R. Tarjan. Time bounds for selection. J. Comput. System Sci. 7, pages 448\u00E2\u0080\u0093461, 1973. [3] P. Bortot, S. G. Coles, and S. A. Sisson. Inference for stereological extremes. Journal of the American Statistical Association, 102:84\u00E2\u0080\u009392, 2007. [4] R. J. Boys, D. J. Wilkinson, and T. B. L. Kirkwood. Bayesian inference for a discretely observed stochastic kinetic model. Statistics and Computing, 2007. [5] Pierre-Arnaud Coquelin, Romain Deguest, and Remi Munos. Perturbation analysis for pa- rameter estimation in continuous space HMMs. Submitted to IEEE Transactions on Signal Processing, 2008. [6] D. Crisan and A. Doucet. A survey of convergence results on particle filtering methods for practitioners. IEEE Transactions on Signal Processing, 50(3):736\u00E2\u0080\u0093746, 2002. [7] David N. Dejong, Hariharan Dharmarajan, Roman Liesenfeld, and Jean-Francois Richard. An efficient approach to analyzing state-space representations. SSRN eLibrary, 2007. [8] Arnaud Doucet. On sequential simulation-based methods for Bayesian filtering. Technical Report CUED/F-INFENG/TR. 310, Cambridge University, Department of Engineering, 1998. [9] Garland B Durham and A Ronald Gallant. Numerical techniques for maximum likelihood estimation of continuous-time diffusion processes. Journal of Business & Economic Statistics, 20(3):297\u00E2\u0080\u0093316, 2002. [10] Paul Fearnhead. Computational methods for complex stochastic systems: a review of some alternatives to MCMC. Statistics and Computing, 2007. [11] P. Glynn. Importance sampling for Monte Carlo estimation of quantiles. Technical report, Stanford University, Department of Operations Research, 1996. 82 Chapter 7. Bibliography [12] A. Golightly and D. J. Wilkinson. Bayesian inference for stochastic kinetic models using a diffusion approximation. Biometrics, 61:781\u00E2\u0080\u0093788, 2005. [13] A. Golightly and D. J. Wilkinson. Bayesian inference for nonlinear multivariate diffusion models observed with error. Comput. Stat. Data Anal., 52(3):1674\u00E2\u0080\u00931693, 2008. [14] C. A. R. Hoare. Algorithm 65: find. Commun. ACM, 4(7):321\u00E2\u0080\u0093322, 1961. [15] Jeroen D. Hol, Thomas B. Schn, and Fredrik Gustafsson. On resampling algorithms for particle filters. In Nonlinear Statistical Signal Processing Workshop, Cambridge, United Kingdom, 2006. [16] M. Vernon Johns. Importance sampling for bootstrap confidence intervals. Journal of the American Statistical Association, 83(403):709\u00E2\u0080\u0093714, 1988. [17] Mike Klaas, Nando de Freitas, and Arnaud Doucet. Toward practical N2 Monte Carlo: the marginal particle filter. In Proceedings of the 21th Annual Conference on Uncertainty in Arti- ficial Intelligence (UAI-05), pages 308\u00E2\u0080\u009331, Arlington, Virginia, 2005. AUAI Press. [18] Donald E. Knuth. Mathematical analysis of algorithms. Technical Report STAN-CS-71-206, Stanford University, Department of Computer Science, 1971. [19] Finn E. Kydland and Edward C. Prescott. Time to build and aggregate fluctuations. Econo- metrica, 50(6):1345\u00E2\u0080\u00931370, 1982. [20] J. Liu and M. West. Combined parameter and state estimation in simulation-based filtering. In Arnaud Doucet, Nando de Freitas, and Neil Gordon, editors, Sequential Monte Carlo Methods in Practice. New York. Springer-Verlag, New York, 2000. [21] Paul Marjoram, John Molitor, Vincent Plagnol, and Simon Tavar. Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. U. S. A., 100:15324\u00E2\u0080\u009315328, 2003. [22] Jerrold E. Marsden and Anthony J. Tromba. Vector Calculus. W. H. Freeman, 4th edition, 1996. [23] Michael K. Pitt. Smooth particle filters for likelihood evaluation and maximisation. The War- wick Economics Research Paper Series (TWERPS) 651, University of Warwick, Department of Economics, 2002. Available at http://ideas.repec.org/p/wrk/warwec/651.html. [24] Michael K. Pitt and Neil Shephard. Time-varying covariances: A factor stochastic volatility approach. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6. Oxford University Press, 1999. [25] George Poyiadjis, Arnaud Doucet, and Sumeetpal S. Singh. Particle methods for optimal filter derivative: Application to parameter estimation. In IEEE ICASSP, pages 925\u00E2\u0080\u0093928, 2005. [26] Christian P. Robert and George Casella. Monte Carlo Statistical Methods (Second Edition). Springer-Verlag, New York, 2004. 83 Chapter 7. Bibliography [27] David Romer. Advanced Macroeconomics. McGraw-Hill/Irwin, May 2001. [28] S. A. Sisson, Y. Fan, and Mark M. Tanaka. Sequential Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. U. S. A., 104:1760\u00E2\u0080\u00931765, 2007. [29] Larry Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer-Verlag, New York, 2004. 84"@en .
"Thesis/Dissertation"@en .
"2008-11"@en .
"10.14288/1.0051412"@en .
"eng"@en .
"Computer Science"@en .
"Vancouver : University of British Columbia Library"@en .
"University of British Columbia"@en .
"Attribution-NonCommercial-NoDerivatives 4.0 International"@en .
"http://creativecommons.org/licenses/by-nc-nd/4.0/"@en .
"Graduate"@en .
"Sequential Monte Carlo"@en .
"Likelihood estimation"@en .
"State-space models"@en .
"Towards smooth particle filters for likelihood estimation with multivariate latent variables"@en .
"Text"@en .
"http://hdl.handle.net/2429/1547"@en .