Exorcising N2 Stigmata in Sequential Monte Carlo by Mike Klaas B.Sc, Dalhousie University A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA August 2005 © Mike Klaas, 2005 Abstract Abstract Sequential Monte Carlo (SMC) has, since being "rediscovered" in the early 1990's, become one of the most important inference techniques in machine learning. It en-joys a prominent place in statistics, robotics, quantum physics, as well as control and other industrial applications. SMC methods represent probability densities as a dicrete set of N Dirac masses called particles. This non-parametric representation provably converges to the true distribution of interest and is effective in high dimensional ap-plications. Many sophisticated SMC algorithms require 0(N2) computation, which is viewed as impractically-expensive by researchers in the field. Hence, N2 SMC algorithms possess what we call "N2 stigmata"—they are simply "too slow." This thesis aims to exorcise these stigmata. We present a survey of areas where these algorithms occur and show that in every case their expense results from having to compute a sum-kernel or max-kernel operation. Both are of a class of operations called iV-body problems which occur in physics, statistics, and machine learning. We show how the techniques developed in this field can be used to accelerate sum- and max-kernel (and consequently the SMC algorithms), reducing the cost to 0(Nlog N)—in some cases O(N). Using these methods, iV 2 Monte Carlo algorithms should be applicable in a much wider range of settings. Along the way, we introduce new SMC algorithms for marginal filtering and a novel algorithm for drastically accelerating max-kernel. Contents iii Contents Abstract 1 1 Contents . • iii List of Tables vii List of Figures viii Acknowledgements x 1 Preamble 1 1.1 Notational remark 3 1 N2 algorithms in Sequential Monte Carlo 4 2 SMC and Bayesian Inference 5 2.1 Bayesian state estimation 6 2.1.1 Prediction, filtering, and smoothing 7 2.1.2 Bayesian filtering 8 2.1.3 Bayesian smoothing 8 2.1.3.1 Forward-backward smoothing 9 2.1.3.2 Two-filter smoothing 9 2.1.4 Nonanalytic example 10 2.2 Sequential Monte Carlo 10 2.2.1 Resampling 12 Contents iv 2.3 Particle smoothing 15 2.3.1 Forward-backward smoother 16 2.3.2 Two-filter smoother 17 2.3.3 Maximum a posteriori (MAP) smoother 18 2.4 Experiments 21 2.4.1 MAP smoothing • 21 2.4.1.1 Main example 21 2.4.1.2 Real-world application 22 2.4.2 Forward-backward smoothing 25 3 The Marginal Particle Fi l ter 26 3.1 Marginal Particle Filter 26 3.1.1 Auxiliary Variable MPF 28 3.1.2 Discussion 31 3.1.3 Cases of equivalence 33 3.2 Efficient Implementation • 34 3.2.1 Fast methods for AMPF 34 3.3 Experiments 35 3.3.1 Multi-modal non-linear time series 35 3.3.2 Stochastic volatility 37 3.3.3 Fast methods . 38 3.4 Summary . 40 II Acceleration methods 41 4 Fast methods 42 4.1 The problem 42 4.1.1 Sum-Kernel 43 4.1.2 Max-Kernel . . . 43 4.2 Preliminaries 44 Contents 4.2.1 Assumptions 44 4.2.2 Spatial indices , 45 4.2.2.1 Kd-trees 46 4.2.2.2 Anchors hierarchy and metric trees 46 4.2.3 Dual-tree recursion 47 4.2.3.1 Single-tree recursion 47 4.2.3.2 From one tree to two 48 4.3 Fast' methods for sum-kernel 50 4.3.1 Fast Gauss Transform 51 4.3.2 Improved Fast Gauss Transform 52 4.3.3 Dual-tree sum-kernel 52 4.4 Fast methods for max-kernel 54 4.4.1 The distance transform 54 4.4.1.1 Extension to irregular grids in 1-D 55 5 Dual-tree max-kernel 57 5.1 The algorithm 59 5.1.1 Influence bounding 59 5.2 Performance in iV 62 5.3 The effect of other parameters 67 5.3.1 Distribution and dimensionality 67 5.3.2 Kernel bandwidth 69 5.4 Conclusion and an application 72 5.4.1 Maximum a posteriori belief propagation 72 5.4.2 Relaxation of kernel assumption 73 5.4.3 Summary 73 6 A better(?) algorithm 74 6.1 Bounding in dual-tree algorithms • . . . 75 6.1.1 A bounds-tightening regime for max-kernel 76 Contents vi 6.1.2 Solving the optimization 77 6.2 The algorithm • 8 0 6.2.1 Discussion 81 6.3 Results 81 6.3.1 STDT results 82 6.3.2 DTDT results 82 6.4 Summary 87 7 Postamble 8 8 7.1 ' JV2 Sequential Monte Carlo 88 7.2 Fast methods 90 Bibliography 92 List of Tables vii List of Tables 3.1 1-D time series; RMS error and weight variance 37 3.2 Fast methods applied to the marginal particle filter 40 4.1 Summary of sum-kernel methods 54 List of Figures viii List of Figures 2.1 A Markov model 7 2.2 Example 2.1.4 11 2.3 Sequential importance sampling 13 2.4 Comparison of SIS, SIR, and SIS with occasional resampling 14 2.5 Particle filtering algorithm at time t 15 2.6 M A P smoothing algorithm . 20 2.7 M A P smoothing results; state estimate 21 2.8 M A P smoothing results; baseline view 22 2.9 1-D time series results; efficiency. 23 2.10 Beat-tracking results: time v. particle count 24 2.11 Forward-backward smoothing on synthetic data 24 3.1 The M P F algorithm at time t 28 3.2 The A M P F algorithm at time t 31 3.3 M P F mixture predictive density 32 3.4 Variance reduction with MPF and A M P F , . . . . 33 3.5 1-D time series; state estimate 36 3.6 1-D time series; importance weight variance 36 3-7 1-D time series; unique particle count 37 3.8 Stochastic volatility model; state estimate 38 3.9 Stochastic volatility model; importance weight variance. 39 3.10 Stochastic volatility, unique particle count 39 4.1 Inclusion and exclusion (single-tree) 48 List of Figures ix 4.2 Dual-tree recusion 49 4.3 Illustration of the fast Gauss transform 51 4.4 Dual-tree bounding for spherical nodes •. . . 53 4.5 Lower envelope computed with the distance transform 55 5.1 Max-kernel bounding 60 5.2 Pseudocode for dual-tree max-kernel algorithm, part 1 63 5.3 Pseudocode for dual-tree max-kernel algorithm, part 2 64 5.4 Dual-tree pruning 65 5.5 Plots for dual-tree max-product example . 66 5.6 Synthetic data (N = 20000) with c = 20 clusters 67 5.7 Dual-tree max-kernel; time v. dimensionality. 68 5.8 Dual-tree max-kernel; distance computations v. dimenionality 68 5.9 Time v. dimensionality; ratio to kd-tree = 1 70 5.10 Dual-tree max-kernel; distance computations v. bandwidth 71 6.1 An upper bound on maximum influence obtained by rotating the par-ticles in X 78 6.2 Lower envelope using the distance transform 78 6.3 An lower bound on maximum influence obtained by rotating the parti-cles in X 79 6.4 Bounding in a metric space 80 6.5 STDT, d = 3, time v. N 83 6.6 STDT, h = 1, d = 3, DCs/leaves v. TV 83 6.7 STDT, h = l, N = 10000, performance v. d 84 6.8 DTDT, h — 0.1/1, d = 3, cpu time v. N 84 6.9 DTDT, h = 0.1/1, d = 3, distance computations v. N 85 6.10 DTDT, h=l,N= 10000, performance v. d. 85 6.11 DTDT, h = l,N = 10000, relative comparison 86 6.12 DTDT, pruned nodes v. tree depth 86 Acknowledgements x Acknowledgements Working alone, this thesis would not have occurred. I must especially thank two people I worked closely with in the course of this research. First, my supervisor, Nando de Freitas, who has been an invaluable, source of ideas, advice, and support throughout my degree. Second, Dustin Lang, with whom I met practically daily working on fast methods. It is not an exaggeration to say that there are no ideas in Part II of this thesis untouched by his influence. Both truly deserve co-authorship credit for this work. I would generally like to thank all my collaborators—a partial list of whom can be found on the preceeding page. They have been wonderful to work with, and I hope they forgive the bits of our co-authored papers that have found their way into this thesis. Finally, I am indebted to my two readers, Nando de Freitas and Daniel Hutten-locher, for their helpful comments and suggestions. Chapter 1 Preamble Ouvrez la tete. Gnossiene No. 3 E R I C S A T I E This thesis is about two very different things, and is hence organized into two parts. The first part is about a certain class of algorithms that have 0(N2) cost. These algorithms have many advantages over cheaper alternatives,1 but are largely ignored due to this cost, which stems from a core computation which is similar among all the algorithms. The second part of this thesis is concerned with a class of algorithms which can accelerate this core computation. In both parts, you will find contributions to the state-of-the-art. The first part is about Sequential Monte Carlo (SMC) [10, 40, 34]. Monte Carlo algorithms approximate probability distributions by a set of samples, called particles, and can be used to numerically estimate integrals which cannot be evaluated analyt-ically. In the domain of Bayesian filtering (or smoothing), where the task is to infer the distribution of a latent state given a sequence of observations, these integrals arise frequently and as such SMC methods are a popular inference technique [2]. These are, in fact, the most widely-known application of SMC techniques; most of the examples we consider are those of particle filters and particle smoothers. In Part I, N will refer to the number of particles used by an SMC algorithm. The use of large particle populations is desirable as the quality of the Monte Carlo approximation increases in the number of particles used. Practitioners generally use as many particles as is computationally feasible, and thus the sensitivity of the cost of an algorithm to the size of the particle population is of paramount importance when using SMC. For this reason, techniques that exhibit superlinear (generally quadratic) cost in N and have reputations of impracticality. These stigmata are the topic of this 1 Assuming cheaper alternatives even exist 1 Chapter 1. Preamble 2 thesis. N2 algorithms, though costly, can accomplish things impossible with linear-time algorithms (such as smoothing; improving the estimate of past states given future observations), or do similar things with greater precision and less variance. The first part will introduce Sequential Monte Carlo in the context of Bayesian state estimation, and discuss areas in which TV2 algorithms arise in this setting. Further, we present in Chapter 3 the Marginal Particle Filter, which is a novel filtering algorithm that outperforms existing techniques in several respects, but exhibits 0(N2) cost. Throughout Part I, we show that all the N2 algorithms presented owe their cost to the same core computation, of which there are two variations. One is the sum-kernel computation: N for j = l,...,N fj = £ wiKfayj). ( 1 . 1 ) i=i Here, source particles {XJ} with weights {u>i} exert an influence on target particles {y^} given by UiK (xj,yj). Sum-kernel aims to determine the sum of influence exerted on each target particle by all source particles. The other is known as the max-kernel problem, for j = 1,...,N ft = max w { ( x ; , y^), ( 1 . 2 ) where we wish to determine the source particle of maximum influence at each target particle. Both of these operations cost 0(N2) when implemented straightforwardly, which we will call the naive implementation. We will not directly discuss acceleration strate-gies for N2 algorithms in Part I. Instead, we will show how the algorithms reduce to equations ( 1 . 1 ) and ( 1 . 2 ) , and devote the entirety of Part II to methods for perform-ing these two operations efficiently. We review existing acceleration techniques and in Chapter 5 present a novel algorithm for performing max-kernel in Monte Carlo state spaces, for which no previous algorithm exists. Our goal is to demonstrate that N2 methods in Sequential Monte Carlo are prac-tical when implemented using these fast methods, and consequently undeserving of stigmata, which we hope to exorcise. Chapter 1. Preamble 3 1.1 Notational remark The conventional particle filtering notation uses x to denote latent state, y to denote observations, and w is used for importance weights. In fast methods, x and y are often used to denote source and target points (particles), and w to denote source weights. These variables are not completely unrelated: a Monte Carlo particle x will sometimes correspond to x in fast methods, and an importance weight w will often correspond to a source particle weight w, but not always. The symmetry breaks down further in the case of y, as a filtering observation y is completely unrelated to y in fast methods. Since we do not trust our ability to safely navigate the quagmire that would result from succumbing to the temptation to abuse notation in this case, we deviate slightly from conventional notation in both parts. We will use u to denote latent state, z for observations, and w for importance weights when discussing SMC, and use x, y, and UJ when speaking of fast methods. Al l mathematical notation in this thesis is standard, except for one eccentricity from Sequential Monte Carlo: when a sequence is indexed by a subscript variable, such as {vt}, we will use va:b, (with a < b) to denote the set {va, va+i,..., Vb-i,Vb}. Part I algorithms in Sequential Monte Carlo Chapter 2 Munissez-vous de clairvoyance. Gnossiene No. 3 E R I C S A T I E S M C and Bayesian Inference Bayesian state estimation is ubiquitous in the A l community, being one of the most popular techniques for performing inference in dynamic models. Examples of its use include tracking [42], diagnosis [7], and control [36]. An unobserved signal (latent states Ut € U) is assumed to exist, and evolves according to a (typically Markovian) dynamic model. Additionally, Bayesian filtering assumes the existence of observations (zt) which are conditionally independent given the process. An observation model specifies the generation of observations given a specified latent state. Of interest is estimating a distribution of the latent state up to time t given the observations up to that time; either the joint filtering distribution p ( u i : t | z i ; t ) or the marginal filtering distribution p(ut\zi:t). If all observations up to time T are available, the state at time t < T can be estimated more accurately; determining p ( u t | z i : r ) is known as smoothing. In some cases this model can be solved using exact inference, for instance using the Kalman or H M M filters [26]. Unfortunately, real-world models are rarely simple enough to be solved exactly, often containing non-linearity, non-Gaussianity, or hybrid combinations of discrete and continuous variables which, all of which lead to intractable integrals. Since these integrals cannot be solved analytically, approximation techniques are required. One of the most successful and popular approximation techniques is Sequential Monte Carlo (SMC), which is referred to as Particle Filtering (PF) in the Bayesian filtering domain [34, 10, 40, 11]. In its most basic form, particle filters work by starting with a sample from the posterior at time t — 1, predicting the state at time t, then updating the importance weights based on the observation z t . These samples form an 5 Chapter 2. SMC and Bayesian Inference 6 approximation of the joint density p(ui : t |z i : t ) at time t. Often, however, it is the filter-ing distribution p(ut|zi : t) that is desired. This is approximated by dropping samples of the states ui . . . ut-i at time t (often implicitly). Particle smoothing algorithms are more expensive, but can provide an SMC approximation to the smoothed density 7j (ui |z i : r ) . In this chapter, we review Sequential Monte Carlo in the setting of Bayesian state estimation. Along the way, we will encounter several N2 algorithms. In each case, we will show how to implement these algorithms efficiently by reducing them to a sum-kernel (equation (1.1)), or max-kernel (equation (1.2)) problem. We assume algorithms exist to solve these operations efficiently and with little (or no) error. Precisely what algorithms can be applied and in what cases will be dealt with the part II. 2.1 Bayesian state estimation The unobserved signal {ut}, Ut € U, is modelled as a Markov process of initial dis-tribution p(ui) and transition prior p(uj |u£_i). The observations {z^}, zt 6 Z, are assumed to be conditionally independent given the process {u t} and of marginal dis-tribution p (zt \ Ut). Hence, the model is described by p(u t|ut_i) t > 1, and p(zi|u t) t > 1. We denote by u i : t = {ui,...,u t} and z i : t = {zi, ...;z t}, respectively, the signal and the observations up to time t, and define p(ui|un) = p(ui) for notational convenience. Figure 2.1 depicts" this model graphically. The goal of Bayesian filtering is to estimate sequentially in time either the marginal filtering distribution p (ut\zi:t) or the joint filtering distribution p(ui : t |z i : i ) . Often the aim is estimate the expectation of a function over the filtering distribu-tion, i.e., ^( /0=Ep(u t |« l ! t )[ / t (u t )]= f ft(nt)p(ut\z1:t)dut (2.1) Chapter 2. SMC and Bayesian Inference 7 o o o Figure 2.1: The well-nigh ubiquitous figure of a Hidden Markov Model (HMM). The clear nodes are random variables representing the latent state at time t, and the shaded nodes are observed variables. Note that the observations at time t are d-separated from the rest of the graph by the latent state at time t, and said latent state is solely dependent on the preceeding state. for some function of interest ft : tC —> R n ^ t integrable with respect to p ( u t | z i : t ) . Examples of appropriate functions include the conditional mean, in which case 2.1.1 Prediction, filtering, and smoothing As mentioned in the previous section, Bayesian state estimation involves determining the distribution of the latent state at time t, u t . There is specific nomenclature for the precedure of performing this estimation given different sets of observations z. The three facets of state estimation are:1 Prediction Estimating state in the future of available observations: p(u t | z i : fc) k < t. Filtering Estimating state up to the time for which we have data: p(ut\zi:k) k = t. 1If the intuition behind these names is not immediately obvious, it is because the nomenclature has been borrowed from signal processing literature where it originated. Filtering is the process of removing noise from a corrupted signal, and smoothing reduces the "jumpiness" of the signal. These terms now exist without less confusing alternatives in the Bayesian state estimation literature. ft ( u t ) = U t or the conditional covariance of u j where ft(ut) = utuTt - E p ( u t | S l : t ) [ u t ] E j ( u t | a i ! t ) [ut] • Chapter 2. SMC and Bayesian Inference 8 Smoothing Estimating state for which future observations exist: p(ui|zi:fc) k > t. This list is in ascending order of accuracy; it is clear that an increased number of ob-servations will improve our estimate of state. The list is also ordered by computational complexity. This is not particularly surprising as prediction is needed for filtering, and filtering is needed for smoothing. 2.1.2 Bayesian filtering General Bayesian filtering is a two-step procedure. Given an estimate of p(ui_i|zi : t _i), we can marginalize over u t_i to obtain an formula for p(u t|zi : t_i): p(u t|zi : f_i) = J p(u f|u t_i)p(ut_i|zi : f_i)du t_i. (2.2) This is known as the prediction step, and the density of equation (2.2) is hence the predictive density. We then update this density with the observation from time t, zt, using Bayes' rule: R U t zl:tJ - — — • ——. (2.3) Jp(z t|u t)p(u t |z 1 : t_ 1)du t It is only in rare cases that the integrals in equations (2.2) and (2.3) can be com-puted analytically.2 Approximation strategies have therefore been devised to solve this model in more realistic settings. Such strategies include the Extended Kalman Filter, the Unscented Kalman Filter [25], and the Gaussian-mixture Filter, which are convered in-depth by Fearnhead [11]. 2.1.3 Bayesian smoothing Assume we have observations up to time T, and we wish to estimate the state at time i < T, i.e., determine p(ut|zi :x). 2Such as the case of a linear-Gaussian model, which can be solved exactly using the Kalman filter [26], and the case of solely discrete distributions, in which case the integrals are in fact sums and can be computed directly (albeit often only at prohibitive computational expense). Chapter 2. SMC and Bayesian Inference 9 2.1.3.1 Forward-backward smoothing The smoothed density p(u t |zi :T) can be factored as follows: p(u t |z i : T ) = J p(ut,ut+i\zhT)dut+i = J p ( u t + i | z i : T ) p ( u t | U t + i , Z i : T ) d U t + l = J p ( u t + l | z i : T ) p ( u t | u t + l , Z i : t ) d U t + i smoothed dynamics / ~ * s = p(u t z 1 : t ) / -j- r — ^ dut+i. (2.4) v v 'J p (u t + i | z 1 : t ) filtered v •>» ' prediction Hence, we obtain a recursive formula for the smoothed density p(u t |zi :x) in terms of the filtering density at time t p(ut|zi : t), the prediction density p(u 4 + i | z i ; t ) , and the smoothed density at time t+1 p(ut+i|zi :r). The algorithm proceeds by making first a forward filtering pass to compute the filtered distribution at each timestep, and then a backward smoothing pass to determine the smoothing distribution. 2.1.3.2 Two-filter smoothing A smoothed marginal distribution can also be obtained by combining the result of two independent filter-like procedures; one running forward in time and another backward. We use the following factorization: P ( u t | z i : T ) =p (u t | z i : t _ i ,Zt : r ) = P(utNl:t-l)p(Zt:r |zi:t-l, Ut) P(z t : r |Z l :«- l ) O C p ( u t | z i : t _ i ) p ( z t : r | u t ) ' o c p ( u t | z 1 : t ) p ( z t + 1 : r | u i ) . (2.5) V v " v ' filter one filter two Chapter 2. SMC and Bayesian Inference 10 Filter one is our familiar Bayesian filter. Filter two can be computed sequentially using the Backward Information Filter [33], noting that: p(zt:r|u t) = J p (z t , z t + i : r ,u t + i | u t )du t + i / reverse dynamics p ( z t + i : T | u t + i ) p(ut+i\ut) p(zt\ut)dut+1. (2.6) filtered pdf at 4+1 likelihood Remark. As Briers notes [4], p(z t :r|ut) is not a proper probability density in u^. An artificial prior to deal with this is presented in that reference. 2.1.4 Nonanalytic example We motivate the Bayesian state estimation problem with an example, which is a stan-dard example of a difficult filtering workload [15]. The dynamics are specified by: ui ~N(0,ov) ut = i u t _ i + 25 Ut~l + 8 cos (1.2i) +vt t>l 2 1 + u t _ 1 Z t = -L+Wt t > 1 20 . ~ where vt ~ A/"(0, ov) and wt ~ A/"(0, ow). The filtering distribution for ov = I, ow = 10 is shown in Figure 2.2. Note that while the noise in the model is Gaussian, it is non-linear and multimodal, due to the quadratic term in the observation model. It is thus impossible to derive analytic expressions for the posterior. We will return to this example several times in the course of this thesis. 2.2 Sequential Monte Carlo The approximation techniques mentioned near the end of Section 2.1.2 all operate on the principle of replacing the model with a tractable (typically linear-Gaussian) approximation, and performing exact inference on this approximate model. In Se-quential Monte Carlo, we do not use an approximation to the orginial model, but instead perform approximate inference on the exact model. Chapter 2. SMC and Bayesian Inference 11 Time (t) Figure 2.2: Filtering distribution p(u t |zi : t) for the example model of Section 2.1.4. If we had a set of samples (or particles) | u ^ | from p(u t |zi : t ) , we could ap-proximate the distribution with the Monte Carlo histogram estimator 1 N p(dut\z1:t) = -^J2^u(i){dut) where 5 (t) (dut) denotes the delta Dirac function.3 This can be used to approximate the expectations of interest in equation (2.1) with N '</<) = ^ E/<(0 i=l This estimate converges almost surely to the true expectation as N goes to infinity. Unfortunately, one cannot easily sample from the marginal distribution p(ut|zi :j) di-rectly. Instead, we draw particles from p(ui ;t|zi :t) and samples U i : t _ i are ignored. To draw samples from p(ui : t | z i : t ) , we sample from a proposal distribution q and weight the particles according to the following importance ratio: P (Ul: t |z i : t ) p(ui : t | z 1 : t ) q(ui:t_l |zi :t_i) (?(Ul : t Z l : t ) q{U1:t Z i : t ) p ( U l : t _ l Z i : t _ i ) 'Defined as <5 <•;> (dut) = 1 u < ° g dut, 0 otherwise. Chapter 2. SMC and Bayesian Inference 12 The proposal distribution is constructed sequentially g ( U l : t | z i : t ) = < 7 ( u i : t _ i | z i : t _ i ) g ( u t | z t , U l : t _ l ) . and, hence, the importance weights can be updated recursively . P(ui:t|zi:t) / n „x Wt = — — : r W t - l . (2.7) P(ui:t_i |z 1 : t_i) q{ut\zt, U i : t _ i ) Given a set of N particles j u ^ ^ j , we obtain a set of particles j u ^ j by sam-pling from g(uj|u^?i_1, z i : t ) and applying the weights of equation (2.7). An estimate of p(dut\zi:t) is then N p(dut\z1:t) = ^w t W<5 u(i)(u t). i = i ' The familiar particle filtering equations for this model are obtained by remarking that t P(ui:t|z 1 : t) O C p ( u l : t , Z i : t ) = JJ p(z f e |u f c )p(u f c | u f c _ l ) , fc=l given which, equation (2.7) becomes ~(0 p(z< ( (») )P{Ut Q ( (*) Z t , U t - l ) ..(0 (2.8) ^ ~(j) This iterative scheme produces a weighted measure |u^ : i,u;j j and is known as Sequential Importance Sampling (SIS) [41]. Note that SIS performs importance sam-pling in the joint space, as samples from p(ui : t_i |zi : t_i) are augmented to produce samples from p(ui : t |z i ; i ) ; it is intuitive to think of particles as paths through the state space which grow at each iteration. Figure 2.3 illustrates this point. 2.2.1 Resampling Unfortunately, the SIS algorithm often leads to unbounded variance in the importance weights (equation (2.8)), which leads to nonconvergence. This phenomenon is known Chapter 2. SMC and Bayesian Inference 13 u ,(0 l:t - 0 U xU x W • • • =Ul Figure 2.3: Sequential importance sampling; particles at time t can be thought of as a path inU1. as degeneracy of the importance weights. Figure 2.4(a) demonstrates a situation in which degeneracy is fatal to the SIS algorithm. To achieve an estimate of minimum variance, we would like a set of samples from the exact posterior distribution, in which case the importance weights are all = 1, and thus have zero variance. Generally, the further our samples are from this target distribution, the greater resulting importance weight variance. This quantity is hence used as one measure of quality for particle filter algorithms. The importance weights are also used to calculate a measure called effective sample size, which is an estimate of the number of non-degenerate particles in the current population. It is defined as follows: To overcome importance weight degeneracy, an effective solution is to resample the particles according to the discrete measure implied by the importance weights. This results in particles with high weight being multiplied, while particles with low weight get discarded, ensuring that the population remains viable. If this resampling step is performed at every timestep, the Sequential Importance Sampling/Resampling (SIR) algorithm is obtained. Resampling should not be performed if it is not necessary, however, as it adds variance to the resulting state estimate. Hence, it is often advocated that resampling only be performed when the effective sample size (equation (2.9)) drops below a certain threshold, say N/3. The manner in which resampling is performed affects the variance of the result. These include multinomial (convential) resampling, as well as deterministic, system-atic, and residual resampling [28]. Convergence of the SIR algorithm has only been (2.9) Chapter 2. SMC and Bayesian Inference 14 State estimate 20 15 10 5 0 -5 -10 -15 -20. - SIS • - SIR - SIS + thres. resamp. »••• True state - Observations 10 20 30 Timestep 40 50 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0. Importance weight variance SIS SIR SIS + thres. resamp. ii 10 20 30 Timestep 40 50 (a) State estimate (b) Importance weight variance 700 600 500 400 300 200 100 0. Effective sample s ize S IS SIR S IS + thres. resamp. 0 t i > J M •» i 500 400 300 I 200 100 10 20 30 Timestep 40 50 Unique particles (max = 700) — SIS • - SIR - S IS + thres. resamp. it ii 11 i i 1 ) I 1 1 » I ' . • » I P ' s .' A 1 » ' 1 • 10 20 30 Timestep 40 50 (c) Effective sample size (d) Unique particle count Figure 2.4: Comparison of SIS, SIR, and SIS with thresholded resampling on the example in Section 2.1.4. The SIS importance weight variance grows until a single particle has all the mass (Figure (6)), and when machine precision is exhausted, the particle filter is no longer updated (Figure (a)). This is also why the importance weight variance goes to zero. This can also be seen in Figure (c), as the sample size quickly drops to zero. Resampling allows the particle filter to survive the effects of high-dimensional sampling. Chapter 2. SMC and Bayesian Inference 15 theoretically proven in the case of multinomial resampling, but the other methods ap-pear to be as stable, and can reduce the variance caused by resampling considerably. Figure 2.5 contains pseudo-code for the SIS and SIR algorithms. Sequential importance sampling step • For i = 1,N, sample f rom the proposal • For i = 1,N, evaluate the importance, weights Wl = —. '—. i-r —Vll • Normal ise the impor tance weights Selection step ~ ( i ) If Nen < Cr, resample the discrete weighted measure | u j l ) , w ^ j to obta in an unweighted measure yu[l\ jjj of Af new particles. Figure 2.5: Particle filtering algorithm at time t. When Cr = N, this is the SIR algo-rithm (always resample); when Cr = —oo, this is the SIS algorithm (never resample); when 0 < Cr < N, this is SIS with thresholded resampling (resample when needed). 2.3 Particle smoothing SMC also affords solutions to the Bayesian smoothing problem. Generally these al-gorithms are only slightly more complicated than their filtering brethren, but can provide substantially more accurate estimates. Regrettably, these algorithms tend to be under-used, as they all involve computing an interaction between an individ-ual particle estimate and a probability density, which is typically represented by a population of N particles. Since this is done for each particle estimate, the cost is 0(N2)—substantially higher than filtering. In this section, we present three Af2 SMC Chapter 2. SMC and Bayesian Inference 16 smoothing algorithms and show how their cost can be substantially reduced. 2.3.1 Forward-backward smoother We can derive an SMC smoothing algorithm by using the following Monte Carlo ap-proximation to equation (2.4): N p(ut\z1:T) oc ^2 w (i) i=l JV E t+l\T (»)\ u! f c ) )] <S1i(i)(ut). (2.10) We maintain the original particle locations—thus depend on the filtered distribution having support where the smoothed density is significant—but reweight the particles to obtain an approximation to the smoothed density. The weight for particle is then (i) A (i) ™\\T = Wt N E 3=1 (j) t+l\T P(USI ut ) Z^ fc=l wt P \ut+l «?>)] (2.11) The smoothing algorithm is straightforward: first, perform filtering to obtain a weighted f (i) (i) 1 N measure ju£ , wl' j for each t. Next, we recurse from t = T — 1,..., 1 using equa-tion (2.11) to calculate w^T (we define w^T = w^). The set j u f ^ u ^ j is now a weighted measure approximation of p(ut|zi :x) for all t: N p(ut\zhT) w p ( u t | z 1 : T ) = 5^wt(|t*u(o(ut)-i=l Equation (2.11) costs 0(iV 3) operations to evaluate directly, but can easily be per-formed in'0(iV 2) by observing that the denominator of the fraction in equation (2.11) is independent of i, hence can be performed independently from i for each j. 0(N2) is still too expensive, however, so we demonstrate how to evaluate this equation using sum-kernel fast methods. Recall the sum-kernel expression: N for ; = 1,. ..,N ft = £ u{ K ( X i , y , ) . (1.1) Chapter 2. SMC and Bayesian Inference 17 We assign iV . r m i N r •) N . r N perform sum-kernel using these parameters, and assign the results to |o4+i } • Next, I J j=i we assign W : M C ' {^};^K}::, H,1i M ^ M ' ' } , ! ' ^u<i>.|u<'>), perform sum-kernel, and assign the results to |/3 t j . I t is now clear that In the first sum-kernel operation, the u^^ particles are the target particles, while in the second call, they are the source particles. Using these two sum-kernel calls, we can reduce the computational complexity of equation (2.11) (and hence the entire algorithm) from 0(N2) to O(NlogN) or lower. Remark. It should be pointed out that the mapping we have performed cannot be done for arbitrary densities p(-). The kernel K(-) must obey certain properties to be able to apply the fast methods. Different methods are suitable for different kernels; the Fast Gauss Transform, for instance, requires that K(-) be a Gaussian kernel. Some methods additionally impose restrictions on the distribution of the source and target points. We will ignore these details in this part—they will be covered in depth in part II. 2.3.2 Two-filter smoother The computationally-intensive component to implementing the two-filter smoother is calculating the Monte Carlo approximation to equation (2.6). Let ju j^ .u;^ j be the weighted measure approximation produced by filter one (conventional forward Chapter 2. SMC and Bayesian Inference 18 filter), and | u f \ u ; ^ | be the measure obtained by running filter two4 (backward information filter) at time t. The Monte Carlo approximation to the smoothing density is [3]: p ( u t | z i : r ) ( X /\w. N i = l 3 = 1 It <J.(o(ut). (2.12) Applying sum-kernel algorithms to compute equation (2.12) proceeds similarly to the case of forward-backward smoothing. 2.3.3 Maximum a posteriori (MAP) smoother In practical applications, it is often not necessary to estimate the entire distribution p(ui : r |zi : r) . Instead, the goal is to determine the sequence for which the aforemen-tioned distribution is maximized; this is the maximum o posteriori sequence u ^ p — argmaxp(ui :r |zi :r). (2-13) "1:T For continuous state spaces, equation (2.13) almost never admits an analytic form. In [15], Godsill et al. develop an SMC approximation to (2.13). First, standard particle filtering is performed. At time t, the set of particles | u ^ | can be viewed as a discretization of the state space. Since particles are more likely to survive in regions of high-probability, this discretization will be sensible.5 We can approximate (2.13) by finding the sequence of maximum probability on this grid, namely: - M A P A argmax p(ui : r |z i : r) • (2-14) u i ^ e f c L i M 0 } , ^ This approximation still requires exponential computation. However, it can be solved efficiently using dynamic programming. We will present the algorithm in significant detail, as it enjoys a prominent place in the remainder of this thesis. 4Here, we assume that filter two produces an approximation p(ut|z t :r) using an artificial prior 7t according to [3]. This avoids the dangerous implicit assumption that Jp(zi:T|ut) dut < oo as is common in previous literature [24]. 5No claim is made about the optimality of this discretization, but it will be refined in areas where U j f ^ ^ 3 is likely to lie and coarse in areas where the solution is unlikely. Thus it is hoped that the error induced by the discretization is minimal. Chapter 2. SMC and Bayesian Inference 19 Due to the Markov property of the model (Section 2.1), the density in equa-tion (2.13) factors as T p ( u i : T | z i : r ) OC JJ P (zt\ut) P (ut | u t _ l ) , t = l hence equation (2.13) is additive in log space: u ^ p = a r g m a x ^ [logp(z t|u t) + l o g p ( u t | u 4 _ i ) ] . "1:T t = l Thus the Viterbi algorithm [43] can be used to compute (2.14) efficiently. Figure 2.6 shows pseudocode for the algorithm. Step (2) in Figure 2.6 is the cause of the algorithm's N2 cost. This 0(N2T) computation can be implemented in 0(T • NlogN) time by using a max-kernel fast method described in Chapter 5. We can re-write the maximization as 6fc(j) = iogp(z f c u j ^ ) +max <5fc_i(i) + logp(u^ u j ^ ) es"^=p{zk\u^) max [ e ^ « - p j u ^ u ^ ) ] , and thus use max-kernel (equation (1.2)) by setting 1 3 and assigning the results (which are, in the max-kernel case, a vector of maximum influence for each particle and a vector of indices of the source of maximum influence) f 1 N t o |/fc(j))*fc(j) | • It is now clear that, for all j, 5k(j) = log p ( z f c | u ^ ) + log [rk(j) and Chapter 2. SMC and Bayesian Inference 20 0. Filtering. For t = 1,..., T Perform particle filtering to obtain the weighted measure 1. Initialization. For i = 1,... ,N (5i(i) = logp ( u ^ ) + logp(zi u^ ) 2. Recursion. For k = 2,... ,T and j = 1,..., N VfeO') = argmax <5fe-i(i) + logp( 3. Termination. IT = argmax Sx(i) i*-i(0 + logp(u^|u£,) - M A P = 4^ For k = T -1, ..,1 5. Aggregation. ik = ^fc+i (« j fc+i) - M A P _ „ ( i f c ) u f c - u f c - M A P A / - M A P - M A P — M A P \ U 1 : T — \ u l > U 2 ) • • • ) U T / Figure 2.6: Maximum a posteriori particle smoothing algorithm. The algorithm pro-duces an approximation to u ^ p by employing the Viterbi algorithm on the discretized state space induced by the particle filter approximation at time t: ^u[l\w[^ j . Note that the importance weights are ignored—because of this it is useless to use resam-pled (unweighted) particles, as this will result in a strictly coarser discretization. The computational expense of the algorithm is 0(N2T) due to step (2). Chapter 2. SMC and Bayesian Inference 21 2.4 Experiments 2.4.1 M A P smoothing 2.4.1.1 M a i n example We turn to the main example of Section 2.1.4. Figures 2.7 and 2.8 show the gain achieved by performing M A P smoothing in this setting. RMSE was significantly re-duced over the filtered estimate. The particle filter estimate is also more prone to be led astray by outliers in the observations, as having access to the future observations provides evidence that a previous data point was an outlier. 30 20 10 0 -10 -20 -30 0 10 20 30 40 50 Time step Figure 2.7: Particle filter and M A P estimates of the latent state in the 1-D time series experiment (Section 2.1.4). Mean error for the particle filter was 4.05, while the M A P solution achieved a mean error of 1.74. . Figure 2.9 shows efficiency results for implementing the M A P smoother on the main example of Section 2.1.4. Since the state space is one-dimensional, both the distance transform and dual-tree algorithms can be used. Both provide orders of mag-nitude improvement in speed. We give the results in terms of number of distance computations required, because in some applications (particularly in vision), this as-pect computationally dominates the cost of the procedure. Chapter 2. SMC and Bayesian Inference 22 5 0 4 0 3 0 2 0 10 0 - 1 0 - 2 0 N o i s y o b s . T rue state P F es t imate M A P es t imate 10 2 0 3 0 T i m e s tep 4 0 5 0 Figure 2.8: The results of the experiment of Figure 2.7, transformed so that the true state lies on the x axis. This shows the stark contrast between the two algorithms. 2.4.1.2 Real-world application Beat-tracking is the process of determining the time slices in a raw song file that correspond to musical beats. This is a challenging problem: both the tempo and phase of the beats must be estimated throughout the song. This is an typical example of cases where the single best result is desired rather than a distribution over states; the objective is the sequence of maximum probability. M A P smoothing after particle filtering has achieved impressive results in this setting [30]. We omit the details of the probability model for the sake of brevity, but a full explanation is found in [30]. Since the state space of this model is a three-dimensional Monte Carlo grid, the distance transform cannot be used. Figure 2.10 summarizes the results: songs can be processed in seconds rather than hours with this method. Using the fast method also enables more particles to be used, which results in a better solution: the probability of the M A P sequence with 50000 particles was p = 0.87, while using 1000 particles resulted in a M A P sequence of probability p = 0.53. Chapter 2. SMC and Bayesian Inference 23 Particles Figure 2.9: 1-D time series results, displayed in log-log scale. The dual-tree algo-rithm became more efficient than naive computation after approximately 70ms of compute time. Both dual-tree and distance transform methods show similar asymp-totic growth, although the constants in the distance transform are approximately three times smaller. Chapter 2. SMC and Bayesian Inference 24 Particles Particles Figure 2.10: Beat-tracking results: time v. particle count. The dual-tree method becomes more efficient at t = 10ms, and thereafter dominates the naive method. Note the log-log scale. Figure 2.11: Particle-smoothing results on synthetic data, shown on a log-log scale for clarity. For the same computation cost, FGT achieves an RMSE two orders of magnitude lower. We are able to smooth with as many as 4,000,000 particles with this method! Chapter 2. SMC and Bayesian Inference 25 2.4.2 Forward-backward smoothing Forward-backward particle smoothing was tested on a three-dimensional linear-Gaussian state space model populated with synthetic data. By keeping the model sufficiently simple, the particle smoother can be compared to the analytic solution (in this case, a Kalman smoother). Since this is a sum-kernel problem, the acceleration method necessarily introduces error. Thus, we report both the cpu time for the naive and fast method (in this case, the Fast Gauss Transform (FGT)) and the error of both algorithms after a given amount of cpu time. Figure 2.11 houses the results. The data verify that using more particles, smoothed approximately using the FGT, is better in terms of RMSE than using fewer particles smoothed exactly (and lethargically). Chapter 3 P l u s i n t i m e m e n t . Gnossiene No. 2 E R I C S A T I E The Marginal Particle Filter We have seen how Sequential Monte Carlo techniques are useful for state estimation in non-linear, non-Gaussian dynamic models. These methods allow us to approximate the joint posterior distribution using sequential importance sampling. In this framework, the dimension of the target distribution grows with each time step, and thus it is necessary to introduce some resampling steps to ensure that the estimates provided by the algorithm have a reasonable variance. In many applications, we are only interested in the marginal filtering distribution which is defined on a space of fixed dimension. In this chapter, we develop a novel particle filtering algorithm which performs importance sampling directly in the marginal space U of p ( u t | z i ; i ) , hence avoiding the need to perform importance sampling in a space of growing dimension. Using this idea, we also derive an improved version of the auxiliary particle filter (ASIR). We show using synthetic and real-world experiments that these algorithms improve significantly over Sequential Importance Sampling/Resampling (SIR) and Auxiliary Particle Filtering (ASIR) in terms of importance weight variance, and provide theoretical results that confirm this reduction. The marginal particle filter is another example of an N2 SMC algorithm. As usual, fast methods can be applied to make this technique substantially more efficient. 3.1 Marginal Particle Filter SIS estimates p ( u i : t | z i : t ) by taking an estimate of p ( u i : t _ i | z i ; t _ i ) and augmenting it with a new sample u t at time t. Each particle at time t is a draw over the joint space p(ui:t|zi : t), sampled sequentially, thus can be thought of as a path through the state 2 6 Chapter 3. The Marginal Particle Filter 27 space at times 1... t (Figure 2.3). At each time step, the dimension of the sampled paths is increased by the dimension of the state space at time i , quickly resulting in a very high dimensional space. The sequential nature of the algorithm means that the variance is high, leading to most paths having vanishingly small probability. This problem is known as degeneracy of the weights, and usually leads to weights whose variance tends to increase without bound. Remark. It is not trivial to observe that all SIS-like algorithms operate in the joint space. Figure 2.5 contains the pseudocode for SIS; note that this is the algorithm in use by practitioners. Bereft of its derivation, it is not immediately apparent that the algorithm performs importance sampling in p(ui :t|zi :t). One strategy employed to combat degeneracy is to use a resampling step after updating the particle weights to multiply particles (paths) with high probability and prune particles with negligible weight (the SIR algorithm—Section 2.2.1). The Marginal Particle Filter (MPF) uses a somewhat more principled approach. We perform particle filtering directly on the marginal distribution p(ut|zi : t) instead of on the joint space. The predictive density is obtained by marginalizing p(u t |z i : t _i) = J p(ut\ut-i)p(ut-i\zi:t-i)dut-i (3.1) hence, the filtering update becomes p(ui|z 1 : t) oc p(z t |u t )p(u t |z i : t _i) = p(z«|ut) JP(ut|ut-i)p(uf_i|zi;t_i)dut_i. The integral in equation (3.1) is generally not solveable analytically, but since we have a particle approximation of p(u t _i|zi : t _i) ^namely, j u t - i i ^ t - i } ) ' we can approximate (3.1) as the weighted kernel estimate J2jLi wi-iP ( u * | u t - i ) • While we are free to choose any proposal distribution that has appropriate support, it is convenient to assume that the proposal takes a similar form, namely ?(u t |zi : t ) = J2 Wt-lQ (U*|Z*> U i - l ) • (3-2) 3=1 Chapter 3. The Marginal Particle Filter 28 The importance weights are now on the marginal space: p(u t |zi : t ) wt <?(ut|zi:i) Pseudo-code for the algorithm is given in Figure 3.1. Marginal Particle Filter (MPF) • F o r % = 1, ...,N, s a m p l e f r o m t h e p roposa l TV u;' ~ j = l • F o r i = 1,N, e v a l u a t e t he i m p o r t a n c e w e i g h t s (i) • N o r m a l i s e t he i m p o r t a n c e w e i g h t s w ~(i) (») _ wt ' ~ti) Figure 3.1: The MPF algorithm at time t. 3.1.1 Auxiliary Variable M P F The auxiliary particle filter (ASIR), introduced by Pitt and Shepard [39, 1], is de-signed to improve the performance of sequential Monte Carlo in models with peaked likelihoods (which is another source of importance weight variance). In this section, we derive an algorithm that combines both approaches. When the likelihood is narrow, it is desirable to choose a proposal distribution that samples particles which will be in high-probability regions of the observation model. The auxiliary particle filter works by re-weighting the particles at time t — 1 to boost them in these regions. Chapter 3. The Marginal Particle Filter 29 We are interested in sampling from the following target distribution N p(ut\z1:t) o c p ( z i | u t ) ^ u ; J i ) 1 p ^ u t uji\) J'=I (3-3) j=l The auxiliary PF uses the following joint distribution: p{k, ut|zi:t) = p (zt uj*\) p (u t u £ \ , z t) . /c is known as an auxiliary variable and is an index into the mixture of equation (3.3). Thus, p(k\zi..t) ex wf\p[zt u £ \ ) = wf\ Jp(zt\\xt)p{ut u £ \ ) dut (3.4) Since the exact evaluation of (3.4) is usually impossible, we approximate this via what is known as a simulation step. For each index k at time t — 1, we choose u.^ associated with the distribution p(ut|u^\) in some deterministic fashion (/J,^ could be the expected value, for instance). We define the simulation weight1 for index k to be At-i ~ Using these weights, the auxiliary particle filter defines the following joint proposal distribution: q(k,ut\z1:t) = q(k\zi:t)q(ut\z1:t,k) where <"i-lp(* 9(A:|z1:t) = A S , 9(ui|zi:t,A;) = q(ut ( f c ) x To prevent this step from introducing bias into the estimate, the simulation weights must be chosen independantly from u^'. Chapter 3. The Marginal Particle Filter 30 The importance weight is given by p(fc,Ut Z l : t ) ( v w (k, ut) - —— : r (3.5) q(k,ut\z1:t) (k) v.-oc -- l P (z*l US) P (u*l u!-i>z*) In the marginal particle filter, we use the same importance distribution but instead of performing importance sampling between p(k, u t | z i : t ) and q (k, Ut\zi:t), we di-rectly perform importance sampling between p ( u t | z i : t ) and q(ut\zi:t) to compute the weights p(ut\z1:t) w (u*) = —f—i V (3-6) q{ut\zi:t) OC E f = i ^ ( ^ i u a ) p ( u i | u a , z t ) E f = 1 A a g ( u t | U a , z t ) This leads to the auxiliary marginal particle filter (AMPF) which is described in Fig-ure 3.2. We expect that performing importance sampling directly between the distributions of interest will lead to a reduction in variance. It is not hard to show that it can be no worse. Proposition 3.1.1. The variance of the AMPF importance sampling weights w(ut) is less than or equal to ASIR's importance weights w(k,Ut). Proof. By the variance decomposition lemma, we have var [w (k, ut)] = var [E(w (k, ut)\ ut)} + E [var (w (k, ut)\ ut)] = var [w (ut)] + E [var (w (k, ut)| ut)]. Hence, as E[var(ttf(fc,ut)|ut)] > 0 it follows that var [w (ut)] < var [w (k, Ut)]. • Chapter 3. The Marginal Particle Filter 31 Auxiliary Marginal Particle Filter (AMPF) • F o r i = 1 , N , c h o o s e s i m u l a t i o n a n d c a l c u l a t e m i x t u r e w e i g h t s (*) d ( (.*) \ hi <-p[ut J T(i) E ; (*K>) T(i) A t - 1 • F o r i = l , ...,N, s a m p l e f r o m t he p roposa l N U (i) 3 = 1 • F o r i = 1 , T V , e v a l u a t e t h e i m p o r t a n c e w e i g h t s ~ ( » ) to; = • N o r m a l i s e t he i m p o r t a n c e w e i g h t s v ^ / V ~ ( j ) Figure 3.2: The A M P F algorithm at time t. The <— symbol denotes the deterministic selection of a likely value from the distribution, such as the mean or a mode of the density. 3.1.2 Discussion Since the particles at time t are sampled from a much lower dimensional space in the marginal filter algorithms, we expect that the variance in the weights will be signif-icantly less than that of (A)SIR for the same number of particles. Proposition 3.1.1 proves that it is no greater in the auxiliary variable setting, and a similar result holds for SIR. Figure 3.3 demonstrates this with an example. Consider a multi-modal state esti-mate at time t — 1, and a Gaussian transition prior. In (A)SIR, a particle's transition likelihood is relative to the tail end of a path (3.3(a)), while the marginal PF calculates Chapter 3. The Marginal Particle Filter 32 (a) (A)SIR samples a mixture component (b) Marginal filtering uses the entire mix-and uses this to compute importance sam- ture. pling weights. Figure 3.3: Predictive density p(u t |z 1 : t _i) in (A)SIR and Marginal PF. By using a single mixture component, the (A)SIR estimate ignores important details of the dis-tribution; a particle lying in the left mode should be given less weight than one lying in the right mode. the true marginal transition density (3.3(b)). The marginal PF improves over (A) SIR whenever a particle has high marginal probability but low joint probability along its path. This can occur due to heavy-tailed importance distributions or models with narrow or mis-specified transition pri-ors. In contrast, the improvement of MPF over SIR will not be very pronounced if the observation model is peaked (i.e., if likelihood is highly concentrated), as this will in-fluence the importance weights more than the effect of sampling in the joint space. In these cases, A M P F should be used. Figure 3.4 demonstrates the two types of variance reduction. Finally, it should be noted that Sequential Monte Carlo applies to domains outside of Bayesian filtering, and an analogous marginal SMC algorithm can be straightfor-wardly derived in a general SMC context. Note that the evaluation of the proposal (equation (3.2)) must be performed for each sample, thus both MPF and A M P F incur an JV2 cost. As we later show, this can be improved substantially. Chapter 3. The Marginal Particle Filter 33 , x 1 0 " 1.5 0.5 Particle weight var iance 10 j SIR j! — Auxil iary marginal P F Marginal P F | j - - AS IR \t • ii :. ? :. *:' jt • i 20 Timestep 30 40 Figure 3.4: Importance weight variance reduction. "Spikes" in weight variance are caused by unlikely observations. ASIR (red) is successful in smoothing these occur-rences when using SIR (green). The marginal PF (black) reduces overall variance by sampling in a smaller-dimensional space, but still suffers from spikes. A M P F (blue) gains the advantages of both approaches. 3.1.3 Cases of equivalence There is one case where SIR and marginal PF are equivalent. When the transition prior is used as the proposal distribution, then the MPF weight update equation becomes: ..(0 p(*t «&) = p(zt uj l )) . When using SIR, particles are resampled after being weighted, but this is precisely equivalent to sampling from the marginal proposal distribution E j ^ w t - i p ( u t u t - i ) -The SIR weight update equation is ( » ) u r j p p[uKt • 2 . ) w. (0 t - i = p(zt uj l )) (after renormalizing) Chapter 3. The Marginal Particle Filter 34 since w^_l is set to N~l after resampling. In both cases, the conventional likelihood-weighted filter is recovered. Similarly, when performing auxiliary filtering, it may be possible to sample exactly from the optimal proposal q(ut\zt, k) — p(ut\zt,u[k^) and exactly evaluate p(zt\u(j.k^l) (equation (3.4)). In this case the importance weight variance is zero, thus no marginal filtering improvement is possible. 3.2 Efficient Implementation The mapping to marginal filtering is direct. For instance, equation (3.2) can be for-mulated as a sum-kernel operation (equation (1.1)) as follows: K(x.i,yj) = q(ut = yj|u t_i = X j , z t ) . Note that while we have reduced MPF to the same asymptotic complexity as SIR, the constants (while not prohibitive) are certainly higher than in SIR. The use of M P F is preferrable when there is uncertainty about the choice of transition prior p(u t|ut_i), which is common in industrial applications [36]. In many applications (such as vision), the likelihood is much more complex and expensive to evaluate than the transition prior. In these cases, it if often more efficient to use fewer particles with the marginal filter than using many particles with SIR. 3.2.1 Fast methods for A M P F In the auxiliary particle filter, we approximate the integral the following way: J P(2t\ut)p(ut u ^ ) dut^p(zt /44)) (3.7) where /4° 4- p ( u j u j ^ ) . (3.8) This is basically a single-particle Monte Carlo approximation. We can do better using fast methods. First, we draw M samples from the marginal unweighted predictive Chapter 3. The Marginal Particle Filter 35 density: (fc) l £ > ( u ( | u « ) , i = l . . . M (3.9) To evaluate (3.7), we actually want a set of samples distributed according to p ^ U j ,^ not eq. (3.9). Hence, we treat draws from the latter as draws obtained via importance sampling. Our estimate of (3.7) is then: Jp(zt|ut)p(ut u ^ ) du, M p(zt (k)\ ( ( f c ) Pi )V\Pt -420 M We can evaluate this operation efficiently using two sum-kernel operation, as we demonstrated for particle smoothing (Section 2.3.1). Alternatively, it is possible to perform the computation exactly, but using M <^ N. 3.3 Experiments We present several examples of using the marginal particle filter for Bayesian filtering. We compare the algorithms in several respects: the error of the state estimate to the ground truth (when known); the variance of the importance weights; and the unique particle count. High variance is indicative of degeneracy of the importance sampling weights, and affects the precision and variance of the estimator. Unique particle count is a measure of the diversity of the particles. The latter two are both important: it is trivial to construct an algorithm which performs well under either of these measures individually, but the construction will behave pathologically under the other measure. 3.3.1 Multi-modal non-linear time series We apply the marginal particle filter to the main example from Section 2.1.4. Table 3.1 and Figures 3.5, 3.6, and 3.7 summarize the results. MPF improves over SIR slightly in terms of RMSE, and produces a substantial reduction in importance weight variance. Chapter 3. The Marginal Particle Filter 36 P a r t i c l e w e i g h t v a r i a n c e S I R M a r g P F 1 0 2 0 3 0 T i m e s t e p 4 0 5 0 Figure 3.6: 1-D time series; variance of the importance weights. The spikes are the result of unlikely data—MPF does particularly better than SIR in these cases. Chapter 3. The Marginal Particle Filter 37 Table 3.1: 1-D time series; RMS error and weight variance. Method RMS Error (variance) Importance weight variance SIR 2.902 1.03 0.000163 MPF 2.344 0.06 0.000025 600 500 400 300 200 100 0. 0 Unique particles (max = 600) 10 i i 1 SIR 1 i M a r g P F • t t II , 1 11 1 1 1 i '' 1 1 , 1 ,1 , i i 1 i - I I , 1 ,1 1 ^ 1, 1 , i * i ' i i • . i i » , 1 1 1 • > i / 1 i 1 • i i . i • 1 1 " • i 1 1 1 1 'A 1 7 t I v > I \ A''J\ ' . • ' 'A1 1 •A, M ily i r \» ' 'JV r / V r±J V / -20 30 Timestep 40 50 Figure 3.7: 1-D time series; unique particle count. Although the marginal PF generally has better diversity, an unlikely observation can still cause problems (such as at t — 8). 3.3.2 Stochastic volatility Monte Carlo methods are often applied to the analysis of the variance of financial returns as the models involved cannot be solved analytically. One popular model is known as stochastic volatility [27], which can be summarized as: zt = et /3 exp {ut/2} ut = (j>ut-i + rjt where r\t ~ A/"(0,0V,), et ~ N(0,1), and x\ ~ A/"(0,<7 2/(l -(f)1))- We analyze the weekday close of the U.K. Sterling/U.S. Dollar exchange rate from 1/10/81 to 28/6/85. There are 946 timesteps, but we limit analysis to the first 200 for readability, and use Chapter 3. The Marginal Particle Filter 38 the model parameters fit to the data using M C M C 2 in [27]. We use as proposal the transition prior with heavier tails to test the marginal filter's ability to compensate for poor proposals. State estimate 2.5 2 1.5 1 0.5 0 -0.5 SIR Auxiliary marginal PF Marginal PF ASIR Observations X X X '" ' x*> 5 0 100 Timestep 150 200 Figure 3.8: Stochastic volatility model; state estimate. The results are summarized in Figures 3.8, 3.9, and 3.10. Al l results are means over five runs. 3.3.3 Fast methods We compared the MPF implemented naively to an implementation using the fast Gauss Transform (FGT), as it is conceivable that the error introduced by the FGT would offset the variance-reduction benefits that MPF provides. The results in Ta-ble 3.2 indicate that we can increase the precision sufficiently to render this issue moot. (Additionally, we performed all the experiments in the previous section using the approximation techniques.) 2That is, Markov Chain Monte Carlo, which along with SMC comprise the two main classes of Monte Carlo algorithms. Chapter 3. The Marginal Particle Filter 39 ,x 10" Particle weight variance 1.5 0.5 SIR — Auxiliary marginal P F Marginal PF ASIR I I ) !. i: •' ! . . . bi'v1 -'vA..-ii,VV.yjA.^^Vvi,^\ j 50 100 Timestep 150 200 Figure 3.9: Stochastic volatility model; importance weight variance. Marginal filter-ing consistently achieves lower variance, and the best result is obtained by the AMPF algorithm, which combines the advantages of both ASIR and marginal PF. See Fig-ure 3.4 for a closer view of the first twenty timesteps, which more clearly illustrates the interaction among SIR, AMPF, and ASIR. Unique particles 400 350 300 250 200 150. L 1 SIR — Auxiliary marginal PF Marginal PF ASIR •!• •i : ( ! :, [% s if*/ J *Sj7 V 50 100 Timestep 150 200 Figure 3.10: Stochastic volatility, unique particle count. Marginal PF consistently has higher diversity. Chapter 3. The Marginal Particle Filter 40 e N time(s) speedup RMSE le-3 500 naive FGT 0.300 0.181 1.66 1.2684 1.2685 le-3 1500 naive FGT 2.568 0.310 8.28 1.2469 1.2542 le-7 5000 naive FGT 28.14 1.482 19.0 1.2466 1.2466 Table 3.2: Fast methods applied to the marginal particle filter. Al l reported results are the mean values over ten runs. The FGT enables a substantial improvement in speed at the cost of a slight increase in error. For the last test, we substantially decreased the error tolerance of the FGT approximation (which increases the runtime), and were still able to achieve a considerable speedup and RMSE within the variance of the test to the naive. This suggests that the error introduced by the FGT approximation can be made less than the inherent error caused by Monte Carlo variance. 3.4 Summary Particle filtering involves importance sampling in the high-dimensional joint distri-bution even when the lower-dimensional marginal distribution is desired. We have introduced marginal importance sampling which overcomes this deficiency, and have derived two new particle filtering algorithms using marginal importance sampling that improve over SIR and ASIR, respectively. We have presented theoretical and em-pirical results which show that MPF and A M P F achieve a significant reduction in importance weight variance over the joint-space algorithms, and have shown how the ensuing computational burden can be drastically reduced. Part II Acceleration methods Chapter 4 Du bout de la pensee. Gnossiene No. 1 E R I C S A T I E Fast methods To recap: we have explored a variety of N2 algorithms in Sequential Monte Carlo, and have shown how to map the expensive part of their computation to base operations of a certain form. The remainder of this thesis will be about performing those operations quickly. Not all the methods we present will be interchangeable; we will be explicit in detailing the conditions where each method can be used. 4.1 The problem We are given a set of source particles1 X = { X J } ^ 1 with associated scalar weights {i^i}iLi- We are additionally given a set of target particles Y = {yj}^Lv and an influence function (or kernel) that acts on a pair of target and source particles and determines (along with the target weight) the influence between the two particles. The goal is to determine the sum of influence from all source particles on all target particles, or determine the source particle of maximum influence on each target. These are examples of generalized iV-body problems [16] which originated with the computation of the force between astronomical bodies in physics, and deal with the computation of some sort of N2 interaction among N entities. xNote that the term particle here is overloaded with Monte Carlo particles from part I. We apologize for any resulting confusion. 42 Chapter 4. Fast methods 43 4.1.1 Sum-Kernel Sum-kernel (or Sum-Product) is the first problem. To reiterate equation (1.1), we wish to determine N forj = l , . . . , M /,- = X>itf(xi>yj)- (1-1) i = i Note that j runs from 1 to M, rather than 1,.. . , N as was the case in part I; although all the instances of this problem that occur in the former part had the same number of source and target particles, there is no reason that the sum-kernel problem needs to be constrained in such a manner. The naive cost of this algorithm is in this case O(MN). None of the methods we present compute equation (1.1) exactly. Instead, they r ~ i M require an additional parameter e, and have as their output a vector | fj j , where each fj is within a certain error tolerance of the true kernel sum. This can be absolute error, in which case or relative error, where V j , V j , max fi-fi <e, m i n < 1 + e. The sum-kernel problem is sometimes called weighted Kernel Density Estimation (KDE). 4.1.2 Max-Kernel The max-kernel (or Max-Product; weighted maximum kernel) problem aims to deter-mine two quantities for each target particle: the source particle exerting maximum influence, and the influence that particle exerts. I.e., N forj = l , . . . , M /* = max u;jK( X j ,yj) N i* = arg max cji K ( X i , y^). i=l (1.2) Chapter 4. Fast methods 44 The output of a max-kernel algorithm will the two vectors . x> . 1" Unlike sum-kernel, the exact answer is returned (although in some applications might suffice to have a approximate answer, we do not consider such cases in this work). Remark 4.1.1. In the case that all the weights are identical, and K is interpreted as an inverse distance function (such as K = —||x—y||2), then max-kernel is min-distance, i.e., the all-nearest-neighbour problem of statistics. 4.2 Preliminaries 4.2.1 Assumpt ions Al l the methods we cover require that the source and target particle exist in a metric space. Definition 4.2.1. A metric space M. = (V, S) is a tuple containing a set of points V, and a metric 5 : V x V —> [0,oo), such that the following hold for all x , y £ P : 1. <S(x,y) = 0 x = y 2. <5(x,x) = 0 3. <5(x,y) >0 4. £(x, z) < <5(x,y) + S(y, z) (triangle inequality) We will refer to the metric S(-) as the distance function. In some cases, it is not required that the distance function obey the triangle inequality (definition 4.2.1, condition 4). We will indicate when this is the case. Remark. All vector spaces are metric spaces, as a metric is implied by the existence of a norm. For instance, in Euclidean R n space, the norm ||x||2 induces a corre-sponding metric J(x, y) = | | x - y | | | . In Sequential Monte Carlo for Bayesian inference, the particles usually lie in the model state space, which are multi-dimensional vector spaces. Chapter 4. Fast methods 45 We can now state the main assumption underlying fast methods: Assumption 4.2.1 (Kernel assumption). The kernel function K(x,y) is param-eterized by a metric 5, that is, K(x.,y) = K(S{x.,y)). Further, K is non-strictly decreasing or increasing in this parameter. For instance, the Gaussian kernel K(x,y) = exp { -^ | | x - y|| 2} (and trucated versions thereof) is clearly parameterized by the L 2 metric, and decreases as the L2 distance grows. This assumption may seem restrictive, but it encompasses many commonly-used kernels. Besides obvious distance functions like the Lp family of norms, inner-products are also metrics. Hence kernels such as polynomial K(x, y) = (x • y + b)p and sigmoid K(x.,y) = tanh(ax • y — 3) satisfy assumption 4.2.1. From this point on, we will assume that K is decreasing, not increasing, in 8. Remark. An important point to note when using fast methods with SMC is that an abitrary discrete potential function does not satisfy assumption 4.2.1, and hence cannot be accelerated. It is still the case that some types of discrete potentials may be accelerated, if they have some spatial structure. 4.2.2 Spat ia l indices Spatial indices (sometimes called spatial access methods) intelligently subdivide a space into regions of high locality given some concept of distance. The index consists of a number of cells that contain points. To make access efficient, each the points in a cell should satisfy some easily-computable property. For instance, this property when using kd-trees is that the points' value in a certain dimension is less than a certain splitting value. It is extremely common that the cells that partition the particles can be further subdivided to create a more refined partition; spatial indices often consist of a hierarchy of paritions which form a tree. Spatial indices originated in computational geometry, but techniques have also been developed in the database literature to facilitate high-dimensional nearest-neighbour search. In the past five years, this topic has also been pursued in the machine learning Chapter 4. Fast methods 46 community [16, 35, 44]. A full review of spatial indices is beyond the scope of this work; Chavez presents a unifying view in [5]. Instead, we briefly cover the two most-used examples. 4.2.2.1 Kd-trees A kd-tree operates on a multi-dimensional vector field. Cells are created by recur-sively choosing a dimension and a splitting value in that dimension; particles having value less than or equal to that value are placed in one cell, and the remainder in the other. This procedure is now performed recursively on the two cells (note: the two cells may choose different splitting dimensions and/or values). The dimension of largest spread is typically chosen as splitting dimension. Kd-trees are effective in low dimensional settings; a 2-D example is given in Figure 5.4 (not all levels are shown). For dimensionality greater than ten,2 other methods may be more appropriate. 4.2.2.2 Anchors hierarchy and metric trees Metric trees are more relaxed in their requirements than kd-trees: they need only a defined distance metric. Nodes in a metric tree consist of a pivot (a point lying at the centre of the node), and radius. Al l points belonging to the node must have a distance to the pivot smaller than the radius of the node.3 The Anchors hierarchy introduced, by Moore [35] is an efficient means of constructing a metric tree. Unlike kd-trees, metric tree construction and access costs do not have factors that explicitly depend on dimension. Note that they are still vulnerable to the distributional effects of dimen-sionality (the distance histogram tends to be more peaked in high dimensions). Thus they may not outperform kd-trees, even in high dimensions. The Anchors hierarchy construction technique is particularly vulnerable to peaked distance distributions. A 2That is, according to [17]. In our experiments, we have found this depends heavily on the distribu-tion of the data; cases exist where kd-trees outperform supposed "dimensionality-independent" spatial indices for all tested dimensions, and cases where a dimensionality-independent index outperforms kd-trees in as few as six dimensions. We recommend testing both to determine their appropriateness for a given application. 3Note: it is not the case that all points within the radius of the pivot belong to the node. Chapter 4. Fast methods 47 theoretical discussion of this effect can be found in [5]; experiments demonstrative of this effect can be found in Section 5.3.1. This phenomenon is also one of the main reasons why fast methods vary in their performance as the dataset is varied. 4.2.3 Dual-tree recursion Dual-tree recursion, introduced by Gray and Moore [16], is an example of so-called higher-order divide and conquer. It is rather important to iV-body methods, so we will devote some space to its description. We will consider only dual-tree recursion applied to ./V-body simulation (though it could conceivably be used in other ways as well). Hence, we have a set of source particles (X), and a set of target particles (y). Further, there is some concept of influence between source and target particles. The particles lie in a metric space, and we presume that the source particles are indexed in a tree-based spatial index which we call X. A node (cell) in this tree will be called an X-node. 4.2.3.1 Single-tree recursion The preceding setup is sufficient to describe single-tree algorithms, which are another name for the class of algorithms evoked by the mention of "tree-based recursion". We assume the existence of a operation that compares a query point y £ ^ to an X-node, and outputs either include, exclude, or neither. The algorithm proceeds as follows: first, compare y to a node X (initialized to the root of the X tree). If the output is either include or exclude, then the algorithm terminates. Otherwise, the algorithm recurses on the children of X. If there is more than one target particle, then this algorithm will be performed independently for each y € y. The outputs of the comparison operators deserve explanation: • inclusion (or subsumption) The entire node X can be included in the results of the algorithm. • exclusion The entire node X can be excluded from the search algorithm. Chapter 4. Fast methods 48 Example 4.2.1. (Range-Counting) Assume we wish to count the number of source particles within a certain range ry of y. An X-node can be included when the maximum distance from y to the node is less than ry (see Figure 4.1). Hence, all particles in X are within distance ry of y, and we can add the count of particles in X (which we assume is precomputed), to a running count of the result of the algorithm. Similarly, if the minimum distance from y to X is greater than ry, we can immediately exclude all points in X from consideration. If neither of these conditions hold, then X must be further refined—we recurse on its children. It is possible to reach the leaves of the tree (individual particles). It is common that inclusion and exclusion are not both part of a given algorithm. Consider binary search, which is easily seen to be a single-tree algorithm as we have described them. At each step, a node is excluded and one is expanded; no concept of inclusion makes sense in this setting. o o \ o o B - " " a Figure 4.1: Inclusion and exclusion in example 4.2.1 (range counting). Shown are target point y, query radius ry, and three X-nodes, A, B, and C. A can be included and C can be excluded, but B must be further refined (recursed on). 4.2.3.2 From one tree to two In the range-counting example, a very similar set of inclusion and exclusion decisions will be made for two query points that are spatially proximate. Consider adding Chapter 4. Fast methods 49 (c) Dual-tree comparison (d) Dual-tree recursion (one of four) Figure 4.2: Dual-tree and single-tree recursion; 4-2(a), 4-2(b): Single-tree recursion processes a single target particle at a time, and recurses on the children of the X-node. Dotted lines indicate lower bounds on distance, and dashes lines show upperbounds. 4.2(c), 4-2(d): Dual-tree recursion considers several target particles at a time by using node-node bounds. Recursion proceeds on the children of both nodes. Only one pair of the four recursive calls is shown. Chapter 4. Fast methods 50 another query point y' to Figure 4.1 such that y' is quite close to y. It is likely that A would still be included and C excluded for y'. This is central idea of dual-tree recursion: whenever possible, we want to share pruning decisions among target particles. We now assume that there is a tree-based spatial index built on the set of target particles y as well as the source tree on X. Instead of particle-node comparisons, basic operations in dual-tree algorithms will be node-node comparisons—specially, a comparison between a Y-node and an X-node. The comparison results in precisely the same possible outputs (inclusion, exclusion, or neither), and the result of the comparison must hold for every y in said Y-node. If the X-node cannot be included or excluded, then the algorithm recurses on the cross-product of the children of the Y-and X-nodes. In the case of binary trees, the recursion on Y-node A, and X-node B is comprised of four recursive calls, on (A.left, B.left), (A.left, B.right), (A.right, B.left), and (A.right, B.right) (Figure 4.2). This induces the problem of selecting which recur-sive call to perform first; usually this requires resorting to heuristics (the problem of determining the optimal recursion order is exponential for many dual-tree algorithms). Alternatively, all four pairs of nodes can be pushed on a global priority queue, and the pairs evaluated on the basis of a heuristical priority value [17]. Intuitively this ap-proach seems ideal, but maintaining the queue requires substantial overhead, and its use increases the memory consumption of the dual-tree algorithm enormously. Some solutions to this problem are found in [31]. 4.3 Fast methods for sum-kernel We will give relatively short shrift to the description of sum-kernel methods, as we do not make a contribution to this area in this thesis. For more information, see our tech report that presents an empirical comparison of sum-kernel algorithms [32]. For a more in-depth and relaxed presentation, see Dustin Lang's thesis [31]. Chapter 4. Fast methods 51 4.3.1 Fast Gauss Transform If the kernel is Gaussian, then the fast Gauss Transform (FGT) [19, 20] can be used to perform sum-kernel in O(N). This algorithm is an instance of more general fast multipole methods for solving iV-body interactions [18]. We present a brief overview of this algorithm subsequently. Figure 4.3: Illustration of the fast Gauss transform. The contribution of points xi :3 in box B is summarized by a single Hermite expansion about xg. This expansion is then translated to x"c and Taylor expanded to X4 :7. The intuition behind the FGT is illustrated by Figure 4.3. To evaluate the inter-action between N points, we first partition the space. Then instead of considering the individual contribution of each point in a partition, we only consider a single aggre-gated contribution at the centroid of the partition. In this way, if there is a cluster of points far away, this cluster can be interpreted as a single point summarizing the contribution of all the points in the cluster. As shown in Figure 4.3, the partition could be a square grid. This is acceptable if the problem is low dimensional, say x^ g R 3 . More precisely, the FGT works by carrying out a Hermite expansion about the centroid in each partition and then transferring the aggregated effect to other partitions via a Taylor series expansion. Hence, at each source box B, we expand the Gaussian field with a multivariate Hermite series of p terms: 5 Chapter 4. Fast methods 52 where a is a multidimensional index,4 ha(-) is a Hermite basis function, NB is the number of points in partition B, XB is the centroid of partition B and Aa(B) are the series coefficients given by Aa(B) — ^ ^2f=i Qj (^"^p") • We c a n precompute these Hermite expansions for all boxes in 0(pdN) operations where d is the dimension of x and p is the number of terms in the expansion. For each target box C, we transform the Hermite expansions into a single Taylor expansion: B i = l 0<V V V 6 J where = £ £ ^(B)^ (^=p) • Evaluating these Taylor series takes again 0(dPN) operations. 4.3.2 Improved Fast Gauss Transform The improved Fast Gauss Transform (IFGT) [44] replaces the Hermite expansion of the FGT with a Taylor series expansion. This yields numerous advantages, such as improved performance (dramatically so in high dimensions) and removing the require-ment of a single variance parameter. The main disadvantage over the FGT is that the error bound is considerably looser, and if parameters are chosen to fulfill this bound then the algorithm is much slower than even naive computation for dimensions greater than d = 2 [32]. Parameter selection is a major open problem for this method. Further, the error bound in [44] is subtly wrong. Both issues are tackled in [32]. 4.3.3 Dual-tree sum-kernel Dual-tree sum-kernel is more flexible than the previous methods, requiring only as-sumption 4.2.1 to be met (the particles need not even lie in a vector space). This encompasses most continuous densities and some discrete distributions (in the latter 4 A multi-index a = (cu,..., ad) 6 N d is a d-tuple of nonnegative integers such that for any x € Rd, we have x a = x ^ x ? 2 • •- x^", |a| = ai + a2 + ... + ad, a\ = cn\a2\• • • ad\ and ha(x) = hai ( X i ) / l „ 2 ( X 2 ) • • • had (Xrf). Chapter 4. Fast methods 53 case, it depends on the parameters of the distribution). This algorithm is an instance of generalized dual-tree recursion described in Section 4.2.3. As in all dual-tree algorithms, this method first builds a spatial tree on both the source and target points. The algorithm is distinguished in the manner in which node-node comparisons are performed; here we introduce influence bounding, which is the main tool used in kernel-influence-based dual-tree algorithms. Assume we have a node X of source points, and a node Y of target points. We can easily state the kernel influence that all the particles in X have on a particle y in Y: Let W~x be the sum of the weights of the particles in node X. Further, assume that 5min and 5max are the minimum and maximum distances between nodes X and Y, respectively (Figure 4.4). Since K is decreasing in S by assumption 4.2.1, we can bound equation (4.1) for any y by N (4.1) i = i upper bound WxK(8max) < ^ UiKiSfayW^ex < WxK(5min). (4.2) lower bound These bounds can be used to estimate (4.1) as fy(X) « fy(X) = WX K(6min) + K(Smax) 2 (4.3) with absolute error bounded by ey(X) = Wx 2 (4.4) 5, max 6, mm Figure 4.4: Dual-tree bounding for spherical nodes. The bounds of equation (4.2) can be tightened by recursing on the X and Y node using dual-tree recursion as described in Section 4.2.3. At each stage of the algorithm, Chapter 4. Fast methods 54 Method Kernel data dimension error speed FGT Gaussian x G R n up to 3-D absolute O(N) IFGT Gaussian up to 10-D n/a a O(N) Dual-Tree Ass. 4.2.1 x € M thousands abs. or rel. 0(Nlog N)b "Absolute error, but if guaranteed then algorithm is no longer O(N). B[17] claims O(N) recursion time, but it remains unproven to our knowledge. Table 4.1: Summary of sum-kernel methods. each particle y will be associated with a series of X nodes {Xiiy,X2,y, • • •, Xg^y} and thus a bound on the total error introduced by approximating fy as 2~2i=i / y P Q *s> D v (4.4): e ey = J2ey(Xt). (4.5) i=i When equation (4.4) drops below a specified desire error tolerance e, equation (4.3) is used to estimate the sum of influences at that point. A dual-tree sum-kernel algo-rithm using relative error (rather than absolute) is analogous. 4 . 4 Fast methods for max-kernel 4.4.1 The distance t ransform In [12, 14], Felzenszwalb and Huttenlocher derive a fast algorithm for computing the distance transform, which is equivalent to computing max-kernel for a certain class of kernels. This achieves O(N) cost and is very efficient in practice^—the constants hiding in that asymptotic cost are gloriously low. Additionally, the problem is separable in dimensionality, so a d-dimensional transform of Nd points costs 0(dNd). Consider the case of a Gaussian kernel in one dimension. We wish to solve ij = arg max i = arg mm Wi exp - Xi x,; logUi (4.6) Chapter 4. Fast methods 55 Felzenszwalb et al. note that equation (4.6) is upperbounded by a parabola (whose shape is determined by cr, but is the same for all points) anchored at ( X J , /(«)), where f{i) = — loga>;. Hence the distance transform at a point yj is the lowest parabola induced by an x point at yj , and the entire distance transform is defined by the lower envelope over M of all such parabolas. Figure 4.5: Lower envelope computed with the distance transform. The algorithm has two steps: First, determine which parabolas comprise the lower envelope and where they occur (Figure 4.5). Next, sample each query point to deter-mine the distance transform at that point. The distance transform was originally developped for regular discrete grids—in that setting it achieves O(N) runtime. It also extends to a exponential Nd regular grid in d dimensions for a cost of 0(dNd). Finally, a similar algorithm can be used for kernels exponential in the L\ distance. In this case, the intersecting geometric shapes are cones rather than parabolas. It is easy to calculate distance transforms for truncated versions of the kernels we mention, as well as linear combinations of these kernels. 4.4.1.1 Extension to irregular grids in 1-D While the distance transform was designed to work exclusively on regular grids, it is easily extended to irregular grids in the one-dimensional case, for a small increase in cost. 0(N log N) algorithms for computing the lower envelope of one-dimensional axis-aligned parabolas of arbitrary shape exist [8] and can be applied in this case. Chapter 4. Fast methods 56 Since the parabolas in the distance transform setting are identically-shaped, we can modify the original algorithm to obtain a slightly simpler procedure compared to [8] (though having the same computational complexity). Assume we are given source particles { x i , . . . , x^r} and target particles { y i , . . . , YM}-The first step of the algorithm is to compute the lower envelope of the parabolas an-chored at {XJ} . This step is unchanged, save that the x particles need to be pre-sorted at a cost of 0(N log N). The second step is to calculate the value of the lower envelope at each y particle. This can be done by either pre-sorting the y particles, or employing binary search on the lower-envelope, which costs 0(M log M) or 0(M log N) respec-tively. Unfortunately, this extension only applies to the one-dimensional case. Other means must be used to compute higher-dimensional max-kernel problems on Monte Carlo (irregular) grids. A final item is needed to use this method with arbitrary Gaussian kernels with bandwidth cr2. The x value of intersection between two parabolas anchored at (p, f(p)) and (q,f(q)) respectively is given by: T _ (<?2f(p)+P2)-(v2f(q)-q2) 2(P-? ) a minor change from the formula in [14]. Chapter 5 Dans une grande bonte. Gnossiene No. 2 E R I C S A T I E Dual-tree max-kernel In Chapter 4, we reviewed a variety of methods for performing fast sum- and max-kernel operations. The reader may have noted the relative paucity of the max-kernel section compared to the sum-kernel algorithms. Fewer methods exist for the max case, in part due to the smaller number of applications. Max-kernel problems arise in many settings of maximum a posteriori inference—this includes the M A P particle smoothing algorithm described in Section 2.3.3, but is also applicable to the Viterbi algorithm for HMMs, and M A P belief propagation. We also note (remark 4.1.1) that max-kernel is a generalization of the all-nearest-neighbour problem, and hence acceleration methods for the former can apply to the latter.1 The most important existing prior work on accelerating max-kernel are the tech-niques developped by Felzenszwalb and Huttenlocher using the distance transform (Section 4.4.1, [12, 13, 14]). These techniques are extremely efficient, but suffer from two major limitations: • They are confined to a certain class of kernel functions. • They are only applicable to state spaces embedded in a regular grid of parame-ters.2 Monte Carlo methods, such as MCMC and particle filters, have been shown to effectively adapt to examine interesting regions of the state space, and can achieve better results than regular discretizations using fewer support points [2, 40]. Problems requiring high-dimensional inference are ubiquitous in machine learning and are best Although there are usually more specialized (and hence more efficient) algorithms for solving nearest-neighbour problems, the asymptotic complexity is rarely improved. 2This limitation is not present in the one-dimensional case. 57 Chapter 5. Dual-tree max-kernel 58 attacked with Monte Carlo techniques, as regular discretizations grow exponentially and quickly become intractable. In this chapter, we address the need of fast algorithms for computing weighted max-kernel on so-called Monte Carlo or irregular grids. In particular, we develop a new algorithm based on dual-tree recursion to exactly solve the max-kernel problem. We derive the method in the context of kernels pa-rameterized by a distance function,3 which represent a broad class of frequently used kernel functions, including Gaussians, Epanechnikov, spherical, and linear kernels, as well as thresholded versions of the same. Our method can also be used to accelerate other spatial-based kernels (such as K(pc,y) = x-y) , and problems that have multiple kernels over different regions of the state space, but we restrict our attention to the simpler and more common case in this work. Our empirical results show that our algorithm provides a speedup of several orders of magnitude over the naive method, becoming more efficient after as little as 10ms of cpu time. Further, we stress that our method is a more general solution to the max-kernel problem that the distance transform, as it strictly subsumes the class of problems to which the latter algorithm is applicable. However, while the dual-tree algorithm still compares favorably to naive computation on discrete grids where both algorithms can be applied, we find that the distance transform algorithm boasts smaller constants in these cases, being 3 to 4 times faster. The performance of algorithms based on dual-tree recursion as N grows is relatively well-understood; see Gray and Moore [17] and Ihler [23]. However, we have found that the performance of this family of techniques also depends heavily on other variables, such as the data distribution, the dimensionality of the problem, and the choice of spatial index and kernel. We present several experiments to investigate these effects, and we believe that the conclusions can be generalized to other pruning-based dual-tree algorithms. 3That is, kernels satisfying assumption 4.2.1. Chapter 5. Dual-tree max-kernel 59 5.1 The algorithm Recalling our objective, we wish to compute equation (1.2): N for j = 1,..., M f* = max u>i K ( X J , yj) J i=i N ij = arg max UiK{xi,yj), (1.2) where X j G X are the source particles, yj € y the target particles, and K a kernel function. In this section we develop an algorithm for solving max-kernel when the following conditions are met: 1. (X U y, 5) is a metric space, for some distance function S. 2. K satisfies assumption 4.2.1. The algorithm is an instance of dual-tree recursion techniques, hence requires tree-based spatial indices built on X and y, called X and Y respectively. For each X-node, we assume the existence of cached statistics on the weights: w*(X) (the maximum weight of a source particle in X) and a sorted list of the particle weights (only needed for leaf nodes). These can be computed efficiently while building the spatial index. The spatial indices provide upper and lower bounds on the distance between an X -node and Y-node, denoted 5U(X,Y), 8l(X,Y) = 6^(X,Y) . Like the dual-tree sum-kernel algorithm, it is based on influence bounding. 5.1.1 Influence bounding Consider a source node X and target particle y. If we had an upper bound on the influence of any particle in X on y, we could potentially prune X from consideration (if, for instance, we had already found a particle whose influence was greater than the upper bound). We know that all particles in X have distance at least Sl(X,y) from y, and have weight no more than u*(X). Since K decreases in 5, we can upper-bound the influence Chapter 5. Dual-tree max-kernel 60 Figure 5.1: Given a node X and a particle y, we derive bounds on the maximum influence of a particle in y. on y by a particle in X as follows: max cjiK (8(xi,y)) < u*(X)K(5l(X,y)). (5.1) i ; x ,€X If this upper bound is less than the pruning threshold for y (which we call r y ) , then X can be pruned. The pruning threshold can be obtained by lower-bounding the influence any particle in X can exert on y. We can do slightly better than that by lower-bounding the influence of the particle of maximum influence on y in X. Less convolutingly, we want a lower bound on the influence of the source particle of maximum influence on y (/*), given that the source particle lies in X. This is given by max .UiK(5(xi,y)) > u*(X)K(6u(X,y)). (5.2) The intuition for the bounds in equations (5.1) and (5.2) is that the particle of maxi-mum weight in X4 can be moved freely in the space defined by the node (which may be a hypersphere, hyper-rectangle, or a ball in a metric space). The bounds are obtained when this particle is as close (or far) from y as possible within this region. The lower bound of equation (5.2) can be interpreted as a guarantee that we can find a particle in X that exerts influence on y at least as great the bound. Hence we 4Which is not necessarily the particle of maximum influence on y in X. Chapter 5. Dual-tree max-kernel 61 can set y's pruning theshold to that bound and discard X nodes which have an upper bound on influence that does not meet this threshold. This is sufficient to derive a single-tree algorithm; a dual-tree algorithm is obtained by considering bounds on node-node, rather than node-particle, distances. The algorithm proceeds by doing a depth-first recursion down the Y tree. For each node, we maintain a list of X-nodes that could contain the best particle (candidates). For each X-node we compute the lower and upper bounds of the influence of the maximum particle in the node on all points in the Y-node (f^l'Uf(X, Y)) by evaluating the kernel at the upper and lower bound on the distances to particles in the node ( 0 W } ( x , Y ) ) . In each recursive step, we choose one Y child on which to recurse. Initially, the set of X candidates is the set of candidates of the parent. We sort the candidates by their lower bound, which allows us to explore the most promising nodes first. For each of the candidates' children, we compute the lower bound on distance and hence the upper bound on influence. Any candidates that have upper bound less than the pruning threshold are pruned. For those that are kept, the lower influence bound is computed; these nodes have the potential to become the new best candidate. The influence bounds tighten as we descend the tree, allowing an increasingly number of nodes to be pruned. Once we reach the leaf nodes, we begin looking at individual particles. The candidate nodes are sorted by lower influence bound, and the particles are sorted by weight, so we examine the most promising particles first and minimize the number of individual particles examined. In many cases, we only have to examine the first few particles in the first node, since the pruning threshold often increases sufficiently to prune the remaining candidate nodes. Figures 5.2 and 5.3 contain pseudo-code for the algorithm. For a given node in the Y tree, the list of candidate nodes in the X tree is valid for all the points within the Y-node, which is the secret behind the efficiency of dual-tree recursion. In this way, pruning decisions are shared among Y points when possible. Of course, this limits the quality of the bound. At this point, a smaller set (children of Y) is considered to refine the bound. Chapter 5. Dual-tree max-kernel 62 It is clear that the faster the lower bound can be tightened, the more opportunities for pruning nodes will be afforded. Hence, any time we evaluate a set of nodes or points, we first sort the set in descending order based on its lower bound. In many cases, this quickly improves the bound and allows many of the less-promising candidates to be pruned without being expanded. 5.2 Performance in N We turn to empirical evaluation of the dual-tree algorithm. In this section, we focus on performance in synthetic and real-world settings as ./V grows; comparisons are made both in settings where the distance transform is applicable and where it is not. We present results in terms of both cpu time and number of distance computations (kernel evaluations) performed. This is important as in some applications the kernel evaluation is extremely expensive and thus dominates the runtime of the algorithm. We use the dual-tree algorithm to accelerate the maximum a posteriori particle smoothing algorithm described in Section 2.3.3. We performed experiments both in a synthetic model and real-world (beat-tracking) setting. The figures appear in Sec-tion 2.4.1. For the synthetic experiment, we chose a one-dimensional setting so that the dual-tree algorithm could be directly compared against the distance transform (using the modified algorithm from Section 4.4.1.1). Figure 2.9 summarizes the results. It is clear that the distance transform is superior in this setting, although the dual-tree algorithm is still quite usable, being several orders of magnitude faster than the naive method. The real-world setting uses a three-dimensional state space, so the distance trans-form cannot be applied. Figure 2.10 shows that the dual-tree method significantly improves runtime in this application. Chapter 5. Dual-tree max-kernel 63 I N P U T S : root nodes of X and Y trees: Xr, Yr. A L G O R I T H M : 1 leaves = {} 2 candidates = {Xr} 3 max_recursive(Yr, leaves, candidates, —oo) F U N C T I O N max_recursive(Y, leaves, candidates, 7 y ) : 4 if (isleaf (Y) AND candidates = {}) //Base Case: reached leaves (see figure 5.3). 5 max_base_case(Y, leaves) 6 else // Recursive case: recurse on each Y child. 7 foreach y G children(Y) 8 Ty = T y 9 valid = {} 10 foreach p € candidates // Check if we can prune parent node p. 11 il(u*(p)K(6l(p,y)) <Ty) 12 continue 13 foreach x € children (p) // Compute child bounds. 14 / { u ' ' } (x) = UJ*(X) K (a:, y)) // Set pruning threshold. 15 Ty = max (ry , max ^ / ' (x )^ 16 valid = valid U {x G children (p) : fu(x) > ry} 17 valid = {x G valid : fu(x) > ry} 18 leaves^ = {x € valid : isleaf (x)} 19 candidates^ = { i £ valid : N O T (isleaf (x))} 20 sort(leavesy by / ') 21 max-recursive (y, leaves ,^ candidates^, ry) Figure 5.2: Dual-tree max-kernel algorithm, part 1. Given a Y-node and pruning threshold, we compute the upper and lower bounds on influence /^ u ,^(x) for each x. This allows us to prune X-nodes and tighten r. If we encounter a leaf X-node, we do not examine it further until we've reached a leaf in the Y tree (Figure 5.3). Chapter 5. Dual-tree max-kernel 64 FUNCTION max_base_case(Yieaf, leaves): 1 foreach x G leaves 2 /W>( x ) = u*{x)K (5^ [x,Y)) 3 T y = max ( / ' ( x ) ) 4 leaves — {x G leaves : / u (x) > r} 5 sort (leaves by / ') // Examine individual y points. 6 foreach y € Y 7 T y = r y 8 foreach x € leaves // Prune nodes by Y (cached), then by y. 9 if (/V) < Ty O R u*{x) K (6l (x, y)) < r y ) 10 continue // Examine individual x particles. 11 foreach € x // Prune by weight. 1 2 i f ( r w ^ R < T y ) 13 break 14 / y (xi) = Wi • K(5(xi ,y)) 15 if ( / y (xi) > r„) // i is the new best particle. 16 T y = / ; = / y ( x i ) 17 «* = z Figure 5.3: Dual-tree max-kernel algorithm, part 2. We first calculate the influence bounds of the candidates, find the pruning threshold, and prune. We then examine individual y points. We prune based on the cached bound (for the parent node Y) and then on y. Next, we look at individual particles in the node x. The particles are sorted by weight, so when a particle is reached whose weight leads to an influence below threshold, we can skip the remaining particles. Finally, we compute the influence of the particle and decide if it is the new best. Chapter 5. Dual-tree max-kernel 6 5 Figure 5.4: Example of pruning the max-kernel algorithm (for a single Y particle). The candidate (dark) and non-candidate (light) nodes are shown. In the bottom-right plot, a close-up of the six final candidate nodes is shown (dashed). The single box whose particles are examined is shown in black. The subset of the particles that are examined individually is shown in black. There were 2000 particles in X, of which six nodes (containing 94 particles total) were candidate leaf nodes. Of these, only six particles from the first node were examined individually. Chapter 5. Dual-tree max-kernel 66 o rt <u eg -50 rt i—i bO O -100 66 Candidate Boxes 0> O I — I tuO O J - 3 It -©- Examined Boxes -*- Examined Particles - A - Skipped Particles •a- Skipped Boxes Objects Figure 5.5: Dual-tree max-kernel example. Top: the influence bounds for the nodes shown in Figure 5.4. The pruning threshold at each level is shown (dashed line), along with the bounds for each candidate node. Bottom: pruning at the leaf level: in the example, six leaf nodes are candidates. We begin examining particles in the first box. As it happens, the first particle we examine is the best particle (the correct answer). Pruning by particle weight (the upper marker) allows us to ignore all but the first six particles. The pruning threshold is then sufficiently high that we can prune the remaining candidate nodes without having to examine any of their particles. Chapter 5. Dual-tree max-kernel 67 5.3 The effect of other parameters 5.3.1 Distribution and dimensionality To examine the effects of other parameters on the behaviour of the dual-tree algo-rithm, we ran several experiments varying dimensionality, distribution, and spatial index while keeping N constant. We used two spatial indices: kd-trees and metric trees (built using the Anchors hierarchy) as described in Section 4.2.2. We generated synthetic data by drawing points from a mixture of Gaussians distributed uniformly in the space. Figure 5.6 shows a typical clustered data distribution. In all runs the number of particles was taken to be N — 20000, and the dimension was varied to a maximum of d = 40. Figures 5.7 and 5.8 show the results for CPU time and distance computations, respectively. In these figures, the solid line represents a uniform dis-tribution of particles, the dashed line represents a 4-cluster distribution, the dash-dot line has 20 clusters, and the dotted line has 100. As results shown are means over ten runs. We expect methods based on spatial indexing to fare poorly given uniform data • ••.••.•'!-,I:;'i1.V"--.r1i,f.-,i%S-t' ••' ••:<^$si^3&:--::-Figure 5.6: Synthetic data (JV = 20000) with c = 20 clusters. since the average distance between points quickly becomes a very peaked distribution in high dimensions, reducing the value of distance as a measure of contrast among points. The results are consistent with this expectation: for uniform data, both the Chapter 5. Dual-tree max-kernel G8 Dimens ion Figure 5.7: Time v. dimensionality. For clarity, only the uniform distribution and one level of clustered data (c — 100) are shown. This experiment demonstrates that some structure is required to accelerate max-kernel in high dimensions. Figure 5.8: Distance computations v. dimensionality. The data is less clustered (c = 20) than in Fi gure 5.7. For kd-trees, clustering hurts performance when d < 8. Chapter 5. Dual-tree max-kernel 69 kd-tree and Anchors methods exceeded 0(N2) distance computations when d > 12. More surprising is that the kd-tree method consistently outperformed the anchors hi-erarchy on uniform data even up to d = 40; the depth of a balanced binary kd-tree of 20000 particles and leaf size 25 is ten, so for d > 10 there are many dimensions that are not split even a single time! This is likely due to the data distribution having such a low variance that for every query point, the entire tree needs to be expanded (i.e., virtually no pruning occurs). Of more practical interest are the results for clustered data. It is clear that the distribution vastly affects the runtime of dual-tree algorithms; at d — 20, performing max-kernel with the anchors method was six times faster on clustered compared to uniform data. We expect this effect to be even greater on real data sets, as the clustering should exist on many scales rather than simply on the top level as is the case with our synthetic data. It is also interesting to note the different effect that clustering had on kd-trees compared to metric trees. For metric trees, clustering always improved the runtime, albeit by only a marginal amount in low dimensions. For kd-trees, clustering hurt performance in low dimensions, only providing gains after about d = 8. The difference in the two methods is shown in Figure 5.9. 5.3.2 Kernel bandwidth To measure the effect of different kernels, we test both methods on a 1-D uniform distribution of 200,000 points, and use a Gaussian kernel with bandwidth (a) varying over several orders of magnitude. The source weights were generated randomly from U([0,1]). The number of distance computations required was reduced by an order of magnitude over this range (Figure 5.10). This result was the inverse of the trend we were expecting. Intuitively, we expected the pruning to become more efficient as the bandwidth narrowed, since near points will have much stronger influence compared to farther points in this case. The explanation for this result soon became clear. Consider unweighted max-kernel. In this case, the strength of the kernel between two points is only important relative to the strength Chapter 5. Dual-tree max-kernel 70 1*- - S" - J _ — f - -e-§ - -<>-- * - anchors -o- kd-tree BK--4 1*-•-*•• anchors -e • kd-tree CD — • -B>- -133 © " " r * " ' ' — f ^ - - ^ . _ _ - f f i - _ ~ ^ > Dimension Figure 5.9: Time v. dimensionality; ratio to kd-tree = 1. Metric trees are better able to properly index clusters: the more clustered the data, the smaller dimensionality required for metric trees to outperform kd-trees (d = 30 for somewhat-clustered data, d = 15 for moderately-clustered data, and d = 6 for significantly-clustered data). The cross-over point is indicated with a pale gray vertical line. Chapter 5. Dual-tree max-kernel 71 Figure 5.10: Effect of kernel choice: distance computations v. bandwidth of a Gaussian kernel. The work done by the algorithm drops to 10% of its maximum value as o is varied, attaining a point where it is almost indistinguishable from the distance transform. The distance transform is effectively independent of this variation. of the kernel between other points. Thus the magnitude is irrelevant, and unweighted max-kernel will perform identically under different kernel bandwidths. In the weighted case, however, there is interplay between the source weights and the kernel; nodes with a low maximum weight can more easily be pruned. Thus, as the kernel bandwidth is increased, the importance of the weights relative to the kernel strength in determining max influence is increased, and thus more pruning opportunities are afforded. Our choice of a uniform distribution of source weights reduces the contrast be-tween the nodes as well, as only the maximum weight in each node is meaningful for making pruning decisions. But the maximum value of N uniform draws from [0,1] is distributed with mean and variance as follows:5 N N N2 mAX = WTi CTMAX = >v^"(ivTi? ; the maximum weight in most nodes is very close to 1, making it hard to use the weights Chapter 5. Dual-tree max-kernel 72 to prune nodes. 5.4 Conclusion and an application 5.4.1 M a x i m u m a posteriori belief propagat ion Although we have focused on applications of max-kernel in Sequential Monte Carlo, we present its use in M A P belief propagation due to its importance and widespread use. Given a graphical model with latent variables u i : t , observations z i : t , and potentials ipkh 4>ki the joint probability distribution is admitted: k,l k • We are interested in computing the maximum a posteriori estimate, M A P / i N u i : t = argmaxp(ui :t|zi : t). U l : t We can use the standard max-product belief propagation equations for message passing and marginal (belief) computation [38]. The message from node / to node k is given by: mik{uki) = mb<pk(uij,zi)ihkl(uki,uij) JJ mri(uij), (5.3) 3~ reU{l)-k where J\f(l) — k denotes the neighbours of node / excluding k. We can re-write equa-tion (5.3) as a max-kernel problem by setting N . ( ( -\N A r i J V {Ui\i=i ~ { ^ ( " ' i ' 2 ' ) II mn(uij)} = 1 , K(*i,yj) = ^ w ( « f c i , « y ) -reM{l)-k 5Let be uniform draws from [0,1]. We want [AMAX = E {maxij} = x Pr(x = max a;,) da;. 1 Jo ( N \ N~1 J JJ a; = NxN~l. Hence, Nx dx = pj ^ • The variance calculation is similar. Chapter 5. Dual-tree max-kernel 73 5.4.2 Relaxation of kernel assumption A wide range of kernels are covered by assumption 4.2.1, but it still limits the applica-bility of dual-tree methods. Fortunately, there are cases in which this assumption can be relaxed. The requirement of the presence of the triangle inequality and monotonic-ity in a distance function are actually just requirements that influence can be bounded in some manner. For an efficient algorithm, these bounds should be cheap to compute. An example is the case of multiple kernels, which can all be evaluated at the endpoints of a node, and the maximum (or minimum) taken at each endpoint to achieve bounds. Another example would be a situation in which different kernels operate in different regions of the state space (a common requirement in vision applications). 5.4.3 Summary Weighted maximum-kernel problems are common in statistical inference, being used, for instance, in belief propagation and M A P particle filter sequence estimation. We develop an exact algorithm based on dual-tree recursion that substantially reduces the computational burden of this procedure for a wide variety of kernels. It is particularly important when the state space lies on a multi-dimensional Monte Carlo grid, where, to our knowledge, no existing acceleration methods can be applied. The method we present speeds up the inner loop of belief propagation, which means that it can be combined with other acceleration methods such as node pruning and dynamic quantization [6] to achieve even faster results, albeit at the expense of the loss of accuracy that those methods entail. The techniques we present could also be integrated seamlessly into hierarchical BP [12]. We also look at the other variables that affect the performance of dual-tree re-cursion, such as dimensionality, data distribution, spatial index, and kernel. These parameters have dramatic effects on the runtime of the algorithm, and our results suggest that more exploration is warranted into these effects—behaviour as N varies is only a small part of the story. Chapter 6 Portez cela plus loin. Gnossiene No. 3 E R I C S A T I E A better(?) algorithm Algorithms based on dual-tree recursion all have a similar structure. A spatial index is built on the source and target particles, and each node is decorated with data that describes the properties of the particles contained in the node. Sum-kernel, for instance, requires that each source node maintain the value of the sum the weights of the contained particles. This data, called sufficient statistics [35], is used along with the constraint that the particles lie in the spatial region defined by the node to produce bounds, which enable nodal inclusion or exclusion (Section 4.2.3). For the sum- and max-kernel algorithms, these bounds are bounds on the kernel influence exerted on target particles by source particles. In many cases the bound calculations are relatively rudimentary; little aside from the spatial constraint of the node is used. One route for improving the performance of a dual-tree algorithm is thus to make use of more comprehensive sufficient statistics to calculate better node-node bounds. We expect that these better bounds will be more expensive to compute, but the hope is that fewer nodes will ultimately need to be expanded to achieve the desired error bounds (sum-kernel) or find the particle of maximum influence (max-kernel). In the chapter we develop a "better bounds" max-kernel algorithm. At each node, we store the distance from each particle to the node's centre of mass. When calculating bounds, we enforce these distances as a constraint (i.e., we assume a particle x lies on a shell of distance dx from the centre of mass), hence achieving tighter bounds. This might sound expensive, but in the Gaussian kernel case, a one-dimensional distance transform can be used to drastically accelerate calculation of bounds; the new bounds are less than twice as costly to compute as the traditional dual-tree bounds. We present 74 Chapter 6. A better (?) algorithm 75 two new algorithms that use the improved bounds: single-tree distance transform (STDT) and dual-tree distance transform (DTDT). Unfortunately, although the new bounds can be up orders of magnitude better, the impact of this in terms of improving the runtime of the algorithm is marginal. We show that the new bounds reduce the number of kernel evaluations by as much as 50%, though the total computation time is not significantly reduced in most cases due to the increase in overhead. Based on this discouraging result and a similar failed attempt at an improved sum-kernel algorithm [31], we conjecture that improving node-node bounds in dual-tree algorithms is an exercise unlikely to produce significant algorithmic improvement. It does, however, shed significant light on the behaviour of dual-tree algorithms. 6.1 Bounding in dual-tree algorithms Dual-tree and single-tree recursion operate by using distance-related bounds to in-clude or exclude nodes in the source and target spatial index trees. In general, these bounds are the solutions to optimization problems given a set of constraints. Consider the range-counting example (example 4.2.1) with metric trees as spatical indices. We wish to compare an X-node with centre cx and radius rx to a Y-node with cen-tre cy and radius ry. Determining the bounds is equivalent to solving the following optimization problem: Optimize maxo"(x, y) and min<5(x,y) x,y x,y Subject to the constraints S(y:,cx) < rx and o~(y,cy) < ry. It is reasonable to assume that additional constraints will lead to bounds which are closer to the true value determined by the actual distribution of particles in the nodes. In the limit, we could store the position of every particle in a node, and "constrain" the optimization problem to particles lying at these positions.1 To usefully improve 1Such an exercise would yield the exact answer, but the resulting optimization problem will have complexity equivalent to the original iV-body problem, hence achieving no computational gain. Chapter 6. A better(?) algorithm 76 bounds in dual-tree algorithms, we must find sufficient statistics which are both cheap to compute and which result in optimization problems which are efficiently-solved. 6.1.1 A bounds-tightening regime for max-kernel The dual-tree max-kernel algorithm described in Chapter 5 has as sufficient statistic the maximum weight of a particle in an X-node, u*(X). The optimization problem is in this case (assuming X and Y are rx- and ry-balls, respectively): Optimize2 extrema < max u5j • K(5(5ci,y)) > Subject to the constraints X — {-x.i,Ui}f=l X is an arbitrary set of weighted particles; (6.1) 5(%,cx) < rx all X j lie in X; (6.2) Vxi , Zi<oj*(X) all X j have weight less than a maximum; (6.3) 3 X j , uij = u*(X) the maximum weight is attained; (6.4) ^(y,Cy) < ry y lies in Y. (6.5) Instead of allowing that the set of source particle be of arbitrary cardinality, with arbitrary weights (less than a maximum), and arbitrary position within X, we will impose significant constraints on the position of the points. First, we calculate the centre of mass3 of the node, and the distance of each source particle to the centre of mass (di = <5(x;,mx))- We also constrain the weights to be those of the particles in X. These distances and weights become spatial constraints for the optimization: 2We switch to using the notation extrema to mean finding both the global maximum and minimum of this function. 3We use the unweighted centre of mass, which minimizes the distances to this point. The weighted centroid may be more effective in the case of high weight variance in the node, as it should force high-weight particles to be closer to the centroid. If the spatial index is a metric tree, the anchor (node centre) could also be used, but using the centre of mass leaves us the freedom to use any spatial index. Chapter 6. A better(?) algorithm 77 Optimize extrema { max Ui • K(S(xi,y)) Subject to the constraints X — { X J , u)i}, X is a particle set of size exactly Nx; (6.6) Vxi , 5(xi,mx) = 5(xi,mx) Xi is distance di from mx; (6.7) V X j , tDj = LOi particle weights are exactly specified; (6.8) <Ky,Cy) < rY y lies in Y. (6.9) These constraints restrict the possible data distribution considerably; we exactly spec-ify the number of particles rather than allow unboundedly many, exactly specify their weights, and significantly constrain their position.4 In fact, rotation on a hypershell is the only freedom remaining. 6.1.2 Solving the optimization The constraints we have chosen induce a more difficult optimization problem than does vanilla max-kernel. It remains, however, efficiently solveable. Consider the following geometric interpretation, which assumes the particles exists in a vector space. Consider a two-dimensional X-node containing particles that can rotate around the centroid. Let a be an line originating at a query point q and intersecting the node's centroid. Any particle x can be rotated about the centroid toward q to lie on a. This rotation cannot increase the distance to from x to q (and thus correspondingly decrease the influence between x and q). Consequently, the maximum influence on q after rotating all the particles cannot be less in this configuration. Finding the max-influence in this configuration gives a upper bound on the max-influence in the original node. Figure 6.1 demonstrates this graphically. 4The new constraints are not strictly more restrictive, however, as constraint (6.7) may not imply constraint (6.2) in some cases (for instance, a metric tree node where the centroid is far from the node's centre). In our experiments we have not found this to be problematic, but contraint (6.2) can be added to the new constraint set if desired. Chapter 6. A better (?) algorithm 78 Figure 6.1: An upper bound on maximum influence obtained by rotating the particles in X. We are now required to solve a one-dimensional max-kernel problem to obtain this bound, at a cost of O(Nx)- Further, the rotation required by each particle depends on the position of q. However, the one-dimensional position of the particles along a is the same for any q! The only change is the distance from q along a. But this is a case of many queries to the same set of one-dimensional source points, which is precisely the situation where the distance transform can be applied. Hence, we store for each X-node the lower envelope of parabolas precomputed using the distance transform (Figure 6.2). When we need to compute a bound, we query the cached lower-envelope based on the distance of the query point to the centroid <5(mx,q). This reduces the cost from 0(Nx) to 0(logit!), where R < Nx is the number of points that comprise the lower envelope, often significantly less than the number of points in the node. Figure 6.2: Lower envelope using the distance transform. To obtain a lower bound, one can similarly visualize the rotation of particles away Chapter 6. A better (?) algorithm 79 from q, but still onto a (Figure 6.3). • q Figure 6.3: An lower bound on maximum influence obtained by rotating the particles in X. This produces a one-dimensional distribution on o which is the inverse of of the distribution in Figure 6.1. Thus there is no need to compute a lower-envelope of parabolas, instead we can just query the upper-bound envelope with -(5(mx,q) as the distance.5 Remark. The, intuition to which we have appealed is plausible in Euclidean space, but the concept of "rotation" is nonsensical in an abstract metric space. We can, serendipitously, make a similar argument in the general case by invoking the triangle inequality. Consider a centroid c, source particle x and target particle q. Rotating x on the axis intersecting q and c is, essentially, assigning x to a new position, x', where <S(x', c) = 5(x, c) and r5(x', q) = 6(q, c) - <5(x, c). The distance of the new position x' to q must be no greater than S(x, q), as 5(q, x') < S(q, x) + <J(x, c) + S(c, x') (triangle inequality) < <*(q,x), hence the influence on q must be no less than x (see Figure 6.4). The proof for rotating away is analogous. 5In the common case that the influence of the kernel dominates the effect of the source weights, the answer to this query is almost always the particle closest to the centroid. In these cases, linear scan can be faster than binary search for querying. Chapter 6. A better (?) algorithm 80 c X Figure 6.4: Bounding in a metric space 6.2 The algorithm We used the bounding strategy of the previous section to implement single-tree and dual-tree versions of the max-kernel algorithm. The preprocessing step for both is the same: after the spatial index is built on the source particles, it is decorated with the relevant sufficient statistics. Specifically, for each X-node we compute • the centre of mass (or centroid): cx • the distances of all particles X{ G X to cx '• di • the one-dimensional lower envelope of the distance transform for points {di, This additional preprocessing does not add additional asymptotic complexity to the cost of building and decorating the tree in the original algorithm; the overhead is approximately 10-20% (see experiments). The changes to the recursion phase of the algorithm are minimal. Instead of using bounds on the distance between a target particle and an X-node, the target's distance to the centroid is computed, and the upper and lower bounds on the maximum influence are computed using the distance transform. In the dual-tree case, we have an X-node and a Y-node. For the upper-bounding, the query point is taken to be the closest point to X in Y, and for lower-bounding, the farthest. This algorithm is not significantly more complicated to implement than the original dual-tree max-kernel algorithm. Chapter 6. A better (?) algorithm 81 6.2.1 Discussion Sharper upper bounds should lead to nodes being pruned earlier in the search, which saves their children being expanded and considered separately. Tighter lower bounds lead to a faster increase in the pruning theshold which also leads to more aggressive pruning. Additionally, it is reasonable to expect that these two effects synergize. It is difficult to quantify how much improvement in bounds will be attained, or (more importantly) how much additional pruning will occur over the original algo-rithm. There are two scenarios in which we can see the potential for large gains. One is the case of a large spread of particle weights in a node: in this case, the weight is not distributed evenly in the node, and constraining the position of the high-weight particles should prove effective. Another scenario is a node with relatively evenly-distributed weight; distance is more important than weight in this case. Here, the lower bound is improved significantly, as any particle close to the centroid will still be close to the centroid after rotation away from the query point. An example of this behaviour is in Figure 6.3—the distance used in the bound is reduced by about half the span of the node. Unfortunately, the effects of dimensionality reduce the effectiveness of the improved bounding. First, the area of a d-dimensional hypershell grows exponentially in d, which makes the bounds increasingly loose. Further, for a uniform d-dimensional particle distribution, an increasing fraction of a node's particles will be located near the boundary of the node. There will be, in expectation, 2d~l times more particles of distance greater than r/2 from the centre than particles of distance less than r/2 from the centre. The only consolation is that real datasets typically exist on a low-dimensional manifold; it is not uncommon for a hundred-dimensional real dataset to behave like a uniform distribution of dimension < 10 [35]. 6.3 Results We tested the STDT and DTDT algorithms in a variety of synthetic settings. Al l data is generated uniformly distributed in [0, l]d with source weights drawn from U ([0,1]). Chapter 6. A better (?) algorithm 82 We used a Gaussian kernel with two bandwidths. In all tests, we used metric trees created using the Anchors hierarchy method. We measured cpu time, distance compu-tations (kernel evaluations), the total number of leaf nodes examined, and number of pruned X-nodes. In one experiment, we additionally measured the number of pruned node by their tree depth. Legend labels "dual-tree" and "single-tree" always refer to the original algorithm and "STDT" / "DTDT" are used for the better bounds algo-rithms described in this chapter. In some tests, we distinguish between the preprocessing time and recursion time. The preprocessing time includes building the spatial indices and decorating the nodes with the relevant sufficient statistics. 6.3.1 STDT results The most improvement was seen in the single-tree case. Distance computations are reduced by half; cpu time is reduced by up to 30%. The better bounds algorithm was faster for dimensionality up to d — 12 (of uniformly-distributed data6). 6.3.2 DTDT results The results for DTDT are considerably less impressive. The gain in distance compu-tations is considerably more modest, and the algorithm is only faster than the original algorithm in a few cases. This is due to two factors: First, as dual-tree recursion is more efficient than single-tree, the preprocessing takes up more time relative to the whole, hence the increase in preprocessing runtime is that more significant. Also, since dual-tree bounding is performed on a pair of nodes, we are only tightening one "half" of the bounds, so to speak. Dual-tree bounds must be looser, as they are shared among a group of target particles. 6While this may seem low, real data is rarely on a manifold of dimensionality this high. As shown in Figure 5.7, even the original dual-tree algorithm is slower than naive computation at uniformly-distributed data at d = 12. Chapter 6. A better (?) algorithm 83 ,x 10 Single-tree, narrow bandwidth a Single-tree (total time) j 8 STDT (total time) A — Single-tree (recursion) / ,<\ 7 — STDT (recursion) / ' ' ' <D 6 Single-tree(preproc) / - ' / -E STDT(preproc) / *' ^ <D 5 l 4 " 3 2 1 0 0.5 1 1.5 2 N y m 5 8 7 6 •I 5 0) 3 4 Q. E o 3 2 1 0. x 10 Single-tree, wide bandwidth Single-tree (total time) STDT (total time) — Single-tree (recursion) STDT (recursion) — Single-tree(preproa) STDT(preproc) / A /* A T /y / „ ± = : : : : : : : : ' : ) 0.5 1 N 1.5 2 y m 5 Figure 6.5: STDT, d = 3, h = 0.1/1, time v. N; The recursion time is reduced by as much as 30%, while incurring minimal additional preprocessing cost. The gain is less in the case of a stronger kernel, which indicates that better bounds produce more improvement when the particle weights are important relative to the kernel. 10' §io 6 o O 10 Single-tree, wide bandwidth Single-tree (dist comps.) -STDT (dist comps) Single-tree (leaves examined STDT (leaves examined) 10' 10' Figure 6.6: STDT, h = 1, d = 3, DCs/leaves v. N; Better bounds reduces total distance computations by half, and examines but a third of the leaf nodes considered by the original algorithm (note log/log scale). Chapter 6. A better (?) algorithm 84 10' jjj 10 ° 1 0 3 Single-tree, wide bandwidth Single-tree, wide bandwidth Single-tree (total time) STDT (total time) Single-tree (recursion) STDT (recursion) Single-tree(preproc) STDT(preproc) 10 10 10 10' Single-tree (DCs) STDT (DCs) Single-tree (leaves) - - - STDT (leaves) Single-tree(particles) STDT(particles) x -,-•'* ^ 10 10 10 10 Figure 6.7: STDT, h = 1, N = 10000, performance v. d\ Left: cpu time as dimension-ality increases. For d > 2, the better bounds algorithm is faster (note log/log scale). In higher dimensions, the recursion time dominates the total run-time of the algo-rithm. At d — 12, the original algorithm is more efficient. Right: The relative savings in distance computations, leaves, and particles examined decreased at dimensionality increases, as we expect. Figure 6.8: DTDT, h = 1, d = 3, cpu time v. TV; The performance gains are signifi-cantly less pronounced than for single-tree. This is partially due to the preprocessing being a more important part of the algorithm's runtime. Chapter 6. A better (?) algorithm 85 . X 10 Dual-tree, narrow bandwidth o o Dual-tree (DCs) DTDT (DCs) 0.5 1.5 2.5 x10 Dual-tree, wide bandwidth 1.5 o O 0.5 Dual-tree (DCs) DTDT (DCs) 0.5 1.5 2 v m 5 Figure 6.9: DTDT, h = 1, d = 3, distance computations v. JV; The DTDT algorithm achieves slightly less than half as many distance computations compared to original algorithm in this setting. D D .0 Figure 6.10: DTDT, h = 1, N = 10000, performance v. d; Left: DTDT is only more efficient in terms of cpu time when d = 3,4 (note log/log scale). Right: The gain is distance computations is modest (which explains the anemic cpu-time performance). After d — 6, there is effectively no gain in distance computations or number of exam-ined leaf nodes (though there is a slight gain near d = 10; see Figure 6.11). Chapter 6. A better(?) algorithm 86 Figure 6.11: DTDT, h = 1, TV = 10000, relative comparison; Improvement (in terms of distance computations) of DTDT compared to the original algorithm. Included for comparison are versions of the DTDT algorithm implemented using the new bounds for upper-bounding and lower-bounding only. The effect of using both is greater than the sum of using the bounds individually (for instance, at d = 3). The improvement diminishes significantly as dimension increases. 1 0 - i , , , , 1 1fy i , , , , 1 0 5 10 15 20 25 0 5 10 15 20 25 depth depth Figure 6.12: DTDT, pruned nodes v. tree depth; Left: A single run. Right: Mean over 10 runs. In both cases, the DTDT algorithm is able to prune nodes effectively at a depth one or two higher in the tree than the original algorithm. The bimodality is an artifact of the algorithm avoiding processing leaf X-nodes until a leaf of the Y tree is reached. Chapter 6. A better (?) algorithm 87 6.4 Summary We demonstrate a strategy for obtaining tighter bounds in the dual-tree max-kernel algorithm. In particular, we show a specific set of constraints on the composition of a node of source points leads to an optimization problem that can be solved efficiently using the distance transform. We find that the sharper bounds do not afford an appreciable increase in perfor-mance. The dual-tree version was particularly resistant to efficiency gains. At best, half as many distance computations were performed, which resulted in a cpu time gain of 30% (less for dual-tree). We have only presented synthetic results, but the algorithm was additionally tested on the particle smoothing applications of Chapter 2, and there was little-to-no difference between using the orginal algorithm and DTDT (we did not test the performance of single-tree algorithms on real-world examples, as they are dominated by their dual-tree counterparts). We believe that significant performance gains for dual-tree algorithms will be diffi-cult to attain via better bounds. Dual-tree recursion is, essentially, a bounds-tightening strategy: loose-bounds are improved by considering smaller nodes. In this chapter, we use sufficient statistics that result in bounds that are considerably tighter while being only slightly more expensive to evaluate; the improvement gained was, at best, a constant factor. Chapter 7 C o n s e i l l e z - v o u s s o i g n e u s e m e n t . Gnossiene No. 3 E R I C S A T I E Postamble This work is centred around two key goals: first, an exposition of a computationally expensive class of algorithms that are often ignored due to their cost—despite providing more accurate results than their less expensive brethren; next, a discussion of a set of fast methods for accelerating the aforementioned class of algorithms. Hopefully this will lead to a wider recognition of the value of the techniques, and allow them to shed the stigmata associated with them. 7.1 N2 Sequential Monte Carlo All the algorithms we present are Sequential Monte Carlo (SMC) sampling algorithms. SMC techniques represent probability distributions empirically, using a set of N par-ticles that are draws from the distribution. Monte Carlo methods are of great interest among academics and practitioners as they provide means for solving a great deal of thorny problems that do not admit analytic solutions, and boast theoretical con-vergence proofs in many cases. Further, Monte Carlo sampling is essential for high-dimensional applications, as it is one of the few techniques that escapes the curse of dimensionality. Probably one of the most important and common applications of SMC is in the setting of Bayesian inference in state-space models. In Chapter 2, we introduced Bayesian state estimation and SMC techniques for performing inference in this setting. Bayesian filtering has an SMC analogue called the particle filter, which runs in 0(N). Bayesian smoothing can provide a much more accurate estimate of state as it takes future observations into account. Unfortunately, all particle smoothing algorithms 88 Chapter 7. Postamble 89 suffer from 0(N2) cost. This cost arises because each particle interacts with a density represented by N particles. Chapter 3 present a new particle filtering strategy called marginal filtering. The standard sequential approach to obtain a sample from the filtered state Ut is to per-form importance sampling in the joint space (which yields a sample u i ; t ) , and drop-ping the indices for = 1, . . . ,£ — 1 to obtain the desired sample. Sampling from high-dimensional spaces is inefficient, so this procedure adds variance to the resulting estimate. We derive an algorithm for sampling directly in the marginal space (which does not grow in time as the joint space does), and show both empirically and theo-retically that this algorithm provides a reduction in importance weight variance. To sample in the marginal space, it is necessary to integrate out the state in the previous timestep. This requires a particle-distribution interaction, which costs 0(N2). In these two chapters, we concentrated on the Bayesian state estimation prob-lem for clarity of presentation and to appeal to a broader audience. It is, however, straightforward to apply the same ideas in the context of general SMC [37, 22, 21]. The marginal sampling techniques also apply equally-well here, but the same types of N2 algorithms arise. Further, although we have not discussed implementation details associated with individual models, the form of the models in some applications can also induce N2 algorithms. An example is particle filtering in game A l [29], where the algorithm is unnecessarily weakened by the choice of a poor proposal to avoid 0(N2) computation. Model parameter estimation can also be accomplished with SMC meth-ods; this also involves N2 cost [9]. In every one of these cases, the N2 cost is the result of one of two base operations, called sum-kernel and max-kernel. We show how the SMC algorithms can be inter-preted in the context of these operations. If the cost of these operations is reduced, the algorithms' complexity is also reduced. In part II we present a gamut of techniques which can reduce the cost of sum-kernel and max-kernel to 0(N'logN)—in some cases 0(N). We believe that this revolutionizes the potential usefulness of the algorithms cursed with iV 2 stigmata. Further, we expect that this will encourage research into better SMC algorithms by Chapter 7. Postamble 90 reducing the need for concerns about the higher complexity that these algorithms generally entail. We close with a cautionary note. First, the fast methods have restrictions on their applicability. Although covering a wide range of common settings, they do not apply to the most general sum-kernel or max-kernel problem. The most important example is a discrete-state H M M with arbitrary discrete transition potentials. Second, due to the overhead of fast methods, the resulting accelerated algorithms are still slower than the notoriously-efficient O(N) particle filter. Thus, it is not an automatic decision to use iV 2 algorithms (though it should be no longer an automatic decision to reject them!). 7.2 Fast methods In Part II, we focused on sum- and max-kernel and means of accelerating them. A central idea is that of dual-tree recursion, which describes a family of techniques that efficiently evaluate iV-body problems (i.e., problems which are described by a set of source particles exerting an influence on a set of target particles). Dual-tree recursion works by locating nodes of source and target particles that are spatially proximate at various levels of refinement. Source nodes are compared to target nodes; it may be possible to deal with the influence of all the particles in the nodes simultaneously, saving the need to examine individual particles. If not, nodes at a finer level of granularity are considered. For sum-kernel, the fast Gauss transform and improved fast Gauss transform can be used when the kernel is Gaussian. Both exploit series expansions to achieve perfor-mance gains. Alternatively, a more general class of kernels can be accelerated using a dual-tree algorithm based on bounding node-node influence based on the distance between the nodes. Max-kernel for state spaces lying on a regular grid can be computed efficiently using techniques based on the distance transform. This is generally not applicable to SMC algorithms, as their entire purpose is to generate a particle set which non-uniformly Chapter 7. Postamble 91 covers the state space. We show how the distance transform can be simply extended to handle irregular grids, but only in one dimension. Accelerating higher-dimensional max-kernel is left unsolved. As such, in Chapter 5 we present a novel algorithm for max-kernel in general state spaces. We can upper-bound the influence of a node of target particles to a node of source particles by considering the minimum distance between them. If this bound is lower than a threshold (determined by lower-bounding the max influence), the entire node can be pruned. Since we can evaluate node-node interactions, we can use dual-tree recursion. Our experiments show that this algorithm enables substantial performance gains in synthetic and real-world experiments, though it is less efficient than the distance transform where the latter algorithm can be applied. Lastly, we develop an enhanced bounding strategy for max-kernel. By increasingly constraining the distribution of particles in a node, the upper and lower node-node bounds can be tightened considerably. Further, the distance transform can be applied to compute the bounds extremely quickly. We found that the new algorithm reduces the distance computations performed by as much as 50%, which is not terribly impres-sive: the new bounds have higher cost, and thus the new strategy is only occasionally faster than the original dual-tree max-kernel algorithm, and then only slightly. This is a gloomy result. Worse, in [31], we developed a bound-improving strategy for sum-kernel (which was similar but based on a different type of spatial constraint), and the results were similarly anemic: though the bounds were up to orders of mag-nitude tighter, the overall runtime of the algorithm was not significantly affected. A possible explanation is that dual-tree recursion is, in an essential respect, a means for taking loose node-node bounds and tightening them by considering smaller nodes. Thus improving node-node bounds in other ways does not yield notable gains. We conjecture that any attempt at improving bounding by leveraging richer sufficient statistics will yield at best marginal improvements in runtime. Of course, to be wrong would be delightful. Bibliography 92 Bibliography [1] C Andrieu, M Davy, and A Doucet. Improved auxiliary particle filtering: Appli-cation to time-varying spectral analysis. In IEEE SCP 2001, Signapore, August 2001. [2] N Bergman. Recursive Bayesian Estimation: Navigation and Tracking Applica-tions. PhD thesis, Department of Electrical Engineering, Linkoping University, Sweeden, 1999. [3] M Briers, A Doucet, and S Maskell. Generalised two-filter smoothing for state-space models. In Submitted, 2005. [4] Mark Briers, Arnaud Doucet, and Simon Maskell. Smoothing algorithms for state space models. In Submitted, 2005. [5] E. Chavez, G. Navarro, R. Baeza-Yates, and J.L. Marroquin. Searching in metric spaces. ACM Computing Surveys, 33(3):273-321, September 2001. [6] J M Coughlan and H Shen. Shape matching with belief propagation: Using dynamic quantization to accomodate occlusion and clutter. In GMBV, 2004. [7] N de Freitas. Rao-Blackwellised particle filtering for fault diagnosis. In IEEE Aerospace Conference, 2001. [8] O. Devillers and M . Golin. Incremental algorithms for finding the convex hulls of circles and the lower envelopes of parabolas. Inform. Process. Lett., 56(3):157-164, 1995. [9] A Doucet, C Andrieu, and V B Tadic. Online sampling for parameter estimation in general state space models. In Proc. IFAC SYSID, 2003. Bibliography 93 [10] A Doucet, N de Freitas, and N J Gordon, editors. Sequential Monte Carlo Methods in Practice. Springer-Verlag, 2001. [11] P Fearnhead. Sequential Monte Carlo Methods in Filter Theory. PhD thesis, Department of Statistics, Oxford University, England, 1998. [12] P Felzenszwalb and D Huttenlocher. Efficient belief propagation for early vision. In CVPR, 2004. [13] P Felzenszwalb, D Huttenlocher, and J Kleinberg. Fast algorithms for large-state-space HMMs with applications to web usage analysis. NIPS, 2003. [14] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Distance transforms of sam-pled functions. Technical Report TR2004-1963, Cornell Computing and Informa-tion Science, 2004. [15] S J Godsill, A Doucet, and M West. Maximum a posteriori sequence estimation using Monte Carlo particle filters. Ann. Inst. Stat. Math., 53(l):82-96, March 2001. [16] A Gray and A Moore. 'N-Body' problems in statistical learning. In NIPS, pages 521-527, 2000. [17] A Gray and A Moore. Nonparametric density estimation: Toward computational tractability. In SIAM International Conference on Data Mining, 2003. [18] L Greengard and V Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics, 73:325-348, 1987. [19] L Greengard and J Strain. The fast Gauss transform. SIAM Journal of Scientific Statistical Computing, 12(l):79-94, 1991. [20] L Greengard and X Sun. A new version of the Fast Gauss transform. Documenta Mathematica, ICM(3):575-584, 1998. Bibliography 94 [21] Firas Hamze and Nando de Freitas. A sequential monte carlo approach to infer-ence and normalization on pairwise undirected graphs of arbitrary topology. In Submitted, 2005. [22] Yukito Iba. Population Monte Carlo algorithms. In Transactions of the Japanese Society for Artificial Intelligence, volume 16, pages 279-286, 2001. [23] A T Ihler, E B Sudderth, W T Freeman, and A S Willsky. Efficient multiscale sampling from products of Gaussian mixtures. In NIPS 16, 2003. [24] M Isard and A Blake. A smoothing filter for condensation. In Proceedings of the 5th European Conference on Computer Vision (ECCV), volume 1, pages 767-781, 1998. [25] S J Julier and J K Uhlmann. A new extension of the Kalman filter to non-linear systems. In Proc. of AeroSense: The 11th International Symposium on Aero space/Defence Sensing, Simulation and Controls, Orlando, Florida, volume Multi Sensor Fusion, Tracking and Resource Management II, 1997. [26] R E Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Eng., 82:35-45, 1960. [27] S Kim, N Shephard, and S Chib. Stochastic volatility: Likelihood inference and comparison with ARCH models. Review of Economic Studies, 65(3):361-93, 1998. [28] G Kitagawa. Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. Journal of Computational and Graphical Statistics, 5:1-25, 1996. [29] Mike Klaas, Tristram Southey, and Warren Cheung. Particle-based communica-tion among game agents. In AIIDE, 2005. [30] D Lang and N de Freitas. Beat tracking the graphical model way. In NIPS 17, 2004. [31] Dustin Lang. Fast methods for inference in graphical models (and beat tracking the graphical model way). Master's thesis, University of British Columbia, 2004. Bibliography 95 [32] Dustin Lang, Mike Klaas, and Nando de Freitas. Empirical testing of fast kernel density estimation algorithms. Technical Report TR-2005-03, Dept of Computer Science, UBC, February 2005. [33] D Q Mayne. A solution of the smoothing problem for linear dynamic systems. In Automatica, volume 4, pages 73-92, 1966. [34] N Metropolis and S Ulam. The Monte Carlo method. JASA, 44(247):335-341, 1949. [35] A Moore. The Anchors Hierarchy: Using the triangle inequality to survive high dimensional data. In 11 Al 12, pages 397-405, 2000. [36] R Morales-Menendez, Nando de Freitas, and David Poole. Estimation and control of industrial processes with particle filters. In American Control Conference, 2003. [37] Pierre Del Morel, Arnaud Doucet, and Gareth Peters. Sequential Monte Carlo samples. In in revision, 2005. [38] J Pearl. Probabilistic reasoning in intelligent systems: networks of plausible in-ference. Morgan-Kaufmann, 1988. [39] M K Pitt and N Shephard. Filtering via simulation: Auxiliary particle filters. JASA, 94(446):590-599, 1999. [40] C P Robert and G Casella. Monte Carlo Statistical Methods. Springer-Verlag, New York, 1999. [41] M N Rosenbluth and A W Rosenbluth. Monte Carlo calculation of the average extension of molecular chains. Journal of Chemical Physics, 23:356-359, 1955. [42] Jaco Vermaak, Arnaud Doucet, and Patrick Perez. Maintaining multi-modality through mixture tracking. In International Conference for Computer Vision (ICCV), 2003. Bibliography 96 [43] A J Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. In IEEE Trans. Info. Theory, volume IT-13, pages 260-269, 1967. [44] C Yang, R Duraiswami, N A Gumerov, and L S Davis. Improved fast Gauss transform and efficient kernel density estimation. In ICCV, Nice, 2003. 97 Index N2 stigma, i i , 1, 88 Anchors hierarchy, 46, 66, 68, 80 ball tree, see metric tree Bayesian filtering, 1, 5, 7-8, 35, 87 smoothing, 1, 5, 8-10, 87 state estimation, ii , 2, 5, 87 Belief Propagation (BP), 71 distance function, 44, 72 distance transform, 53-57, 61 distribution joint, see filtering, joint distribu-tion dual-tree recursion, 47-49, 51, 52, 57, 58, 73, 74, 89 effective sample size, 13, 14 Fast Gauss Transform, 22, 49-51, 89 Improved (IFGT), 51, 89 fast methods, 87 filtering distribution, 5, 6, 26 joint, 5, 6 marginal distribution, see filtering distribution particle, see particle filter Gaussian-mixture filter, 8 importance weights, 3, 5, 12, 20, 35 influence, 42, 47, 72, 73 function, 42 joint distribution, see filtering, joint dis-tribution Kalman filter, 5, 8 Extended, 8 Unscented, 8 kd-tree, 45-46, 66, 68 kernel, 42, 52 assumption, 45, 72 Kernel Density Estimation, 43 marginal distribution, see filtering distribu-tion Marginal Particle Filter, 2, 26-28 Markov process, 6 max-kernel, 2, 19, 43-44, 53-72, 88 unweighted, 68 Max-Product, see max-kernel Index 98 MCMC, 38 metric space, 44, 58, 59, 78 metric tree, 46, 66, 74, 80 nearest-neighbour, 56 all-, 44, 56 particle fast methods, 3 filter, 1, 5, 10, 14, 56, 87 marginal filtering, 2 Monte Carlo, 1, 11 smoother, 1, 6, 15-19, 86, 87 source, 17, 42, 54, 58 target, 17, 42, 54, 58 predictive density, 8 Sequential Importance Sampling, 12 Sequential Monte Carlo, 1, 5 single-tree recursion, 47-48, 60 smoothing Bayesian, see Bayesian smoothing forward-backward, 9, 16-17, 22-25 MAP, 18-22, 56, 61 two-filter, 9-10, 17-18 spatial index, 45-46, 73 sufficient statistics, 73, 79, 81, 86, 90 sum-kernel, 2, 17, 22, 43, 49-52, 56, 58, 73, 88 Sum-Product, see sum-kernel Viterbi algorithm, 19, 20, 56
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Exorcising N2 stigmata in Sequential Monte Carlo
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Exorcising N2 stigmata in Sequential Monte Carlo Klaas, Mike 2005
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Exorcising N2 stigmata in Sequential Monte Carlo |
Creator |
Klaas, Mike |
Publisher | University of British Columbia |
Date Issued | 2005 |
Description | Sequential Monte Carlo (SMC) has, since being "rediscovered" in the early 1990's, become one of the most important inference techniques in machine learning. It enjoys a prominent place in statistics, robotics, quantum physics, as well as control and other industrial applications. SMC methods represent probability densities as a dicrete set of N Dirac masses called particles. This non-parametric representation provably converges to the true distribution of interest and is effective in high dimensional applications. Many sophisticated SMC algorithms require 0(N2) computation, which is viewed as impractically-expensive by researchers in the field. Hence, N2 SMC algorithms possess what we call "N2 stigmata"—they are simply "too slow." This thesis aims to exorcise these stigmata. We present a survey of areas where these algorithms occur and show that in every case their expense results from having to compute a sum-kernel or max-kernel operation. Both are of a class of operations called iV-body problems which occur in physics, statistics, and machine learning. We show how the techniques developed in this field can be used to accelerate sum- and max-kernel (and consequently the SMC algorithms), reducing the cost to 0(Nlog N)—in some cases O(N). Using these methods, iV2 Monte Carlo algorithms should be applicable in a much wider range of settings. Along the way, we introduce new SMC algorithms for marginal filtering and a novel algorithm for drastically accelerating max-kernel. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2009-12-11 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051326 |
URI | http://hdl.handle.net/2429/16586 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2005-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2005-0509.pdf [ 5.4MB ]
- Metadata
- JSON: 831-1.0051326.json
- JSON-LD: 831-1.0051326-ld.json
- RDF/XML (Pretty): 831-1.0051326-rdf.xml
- RDF/JSON: 831-1.0051326-rdf.json
- Turtle: 831-1.0051326-turtle.txt
- N-Triples: 831-1.0051326-rdf-ntriples.txt
- Original Record: 831-1.0051326-source.json
- Full Text
- 831-1.0051326-fulltext.txt
- Citation
- 831-1.0051326.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0051326/manifest