Nonparametric Bayesian Models for Markov Jump Processes by Ardavan Saeedi B.Sc., Sharif University of Technology, 2007 M.Sc., Sharif University of Technology, 2009 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Statistics) The University Of British Columbia (Vancouver) August 2012 © Ardavan Saeedi, 2012 Abstract Markov jump processes (MJPs) have been used as models in various fields such as disease progression, phylogenetic trees, and communication networks. The main motivation behind this thesis is the application of MJPs to data modeled as hav- ing complex latent structure. In this thesis we propose a nonparametric prior, the gamma-exponential process (GEP), over MJPs. Nonparametric Bayesian models have recently attracted much attention in the statistics community, due to their flex- ibility, adaptability, and usefulness in analyzing complex real world datasets. The GEP is a prior over infinite rate matrices which characterize an MJP; this prior can be used in Bayesian models where an MJP is imposed on the data but the number of states of the MJP is unknown in advance. We show that the GEP model we propose has some attractive properties such as conjugacy and simple closed-form predic- tive distributions. We also introduce the hierarchical version of the GEP model; sharing statistical strength can be considered as the main motivation behind the hi- erarchical model. We show that our hierarchical model admits efficient inference algorithms. We introduce two inference algorithms: 1) a “basic” particle Markov chain Monte Carlo (PMCMC) algorithm which is an MCMC algorithm with se- quences proposed by a sequential Monte Carlo (SMC) algorithm; 2) a modified version of this PMCPC algorithm with an “improved” SMC proposal. Finally, we demonstrate the algorithms on the problems of estimating disease progression in multiple sclerosis and RNA evolutionary modeling. In both domains, we found that our model outperformed the standard rate matrix estimation approach. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 BASIC NOTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Markov Jump Process (MJP) . . . . . . . . . . . . . . . . . . . . 5 2.2 Gamma Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 MGP and Dirichlet process (DP) . . . . . . . . . . . . . . 10 2.3 DP Definition and Properties . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Conjugacy and posterior distribution . . . . . . . . . . . . 12 2.3.3 Stick-breaking construction . . . . . . . . . . . . . . . . 13 2.3.4 CRP definition . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.5 DP mixture model . . . . . . . . . . . . . . . . . . . . . 15 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 iii 3 RELATED MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Hierarchical Dirichlet Process . . . . . . . . . . . . . . . . . . . 17 3.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.3 Chinese Restaurant Franchise . . . . . . . . . . . . . . . 19 3.1.4 Stick-breaking representation for the HDP . . . . . . . . . 21 3.2 HDP-HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.1 Background on HMM . . . . . . . . . . . . . . . . . . . 22 3.2.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Sticky HDP-HMM . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Dependent Dirichlet Process (DDP) . . . . . . . . . . . . . . . . 26 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4 GAMMA EXPONENTIAL PROCESS . . . . . . . . . . . . . . . . . . . . 30 4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 Conjugacy Property . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Predictive Distribution . . . . . . . . . . . . . . . . . . . . . . . 35 5 HIERARCHICAL GAMMA EXPONENTIAL PROCESS . . . . . . . . . . 37 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3 Connection to CRF . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.4 Predictive Distribution . . . . . . . . . . . . . . . . . . . . . . . 41 6 INFERENCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.1 MCMC Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.2 PMCMC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.3 SMC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.4 SMC Algorithm Proposal Distributions . . . . . . . . . . . . . . 58 6.4.1 The predictive distribution as the proposal distribution . . 58 6.4.2 An “improved” proposal distribution based on a modified predictive distribution . . . . . . . . . . . . . . . . . . . 59 7 LIKELIHOOD MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . 69 iv 8 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 8.1 Qualitative Analysis of the Prior . . . . . . . . . . . . . . . . . . 73 8.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . 73 8.2.1 Synthetic data experiment . . . . . . . . . . . . . . . . . 75 8.2.2 RNA evolution modeling . . . . . . . . . . . . . . . . . . 76 8.2.3 multiple sclerosis (MS) disease progression . . . . . . . . 77 8.2.4 Application in estimating disease progression in MS . . . 78 8.3 Comparison of Inference Algorithms . . . . . . . . . . . . . . . . 80 9 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 v List of Tables Table 3.1 Variables used in the CRF . . . . . . . . . . . . . . . . . . . . 19 Table 6.1 Notation used in the PMCMC algorithm . . . . . . . . . . . . 50 Table 8.1 A sample dataset with symbols A, T, C and G having the same structure as datasets used in the experiments . . . . . . . . . . 73 Table 8.2 Summary statistics and mean error results for the experiments. All experiments were repeated 5 times. . . . . . . . . . . . . . 75 vi List of Figures Figure 2.1 An illustration of notation for samples from MJPs. We assume the state space (Ω) is countable. The notation for the observa- tions Y (t1), . . . ,Y (tG) is described in Chapter 6. . . . . . . . . 8 Figure 2.2 A realization G of a DP with base measure G0 and support Ω. A partition (A1,A2, · · · ,AK) and a point y sampled from G are also illustrated. . . . . . . . . . . . . . . . . . . . . . . . . . 13 Figure 3.1 A graphical representation for an HMM . . . . . . . . . . . . 23 Figure 3.2 Generative mechanism for the state transitions in the iHMM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Figure 4.1 Generating samples of µθ from MGP(H0,β0). We use super- script i instead of subscript N to distinguish the ith atom from the state at the Nth jump. We present the row corresponding to state θ i in the rate matrix. As an example, if θN = θ i, then the probability of transition to θ 2 is proportional to µθ i({θ 2}). . . 31 Figure 5.1 An illustration of generating the next state given the current state θN in the HGEP model . . . . . . . . . . . . . . . . . . 40 Figure 6.1 Pseudocode for the PMCMC algorithm . . . . . . . . . . . . 51 Figure 6.2 An illustration of the SMC algorithm with three particles and three observation times up to generation 1. . . . . . . . . . . 53 Figure 6.3 An illustration of the SMC algorithm with three particles and three observation times up to generation 2. . . . . . . . . . . 54 vii Figure 6.4 An illustration of the SMC algorithm with three particles and three observation times up to generation 3. . . . . . . . . . . 55 Figure 6.5 Pseudocode for the SMC step of the PMCMC algorithm . . . 58 Figure 6.6 Pseudocode for sampling from the “improved” proposal distri- bution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Figure 6.7 An illustration of sampling the last jump for a generation of particles in the SMC algorithm. For simplicity we show only one particle. . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Figure 6.8 Sampling procedure for the “modified” predictive distribution (case 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Figure 6.9 Sampling procedure for the “modified” predictive distribution (case 2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Figure 6.10 Sampling procedure for the “modified” predictive distribution (case 3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Figure 6.11 Computing the predictive probability for a sampled path . . . 67 Figure 7.1 Sampling from the predictive distribution of the HGEP model used in the experiments . . . . . . . . . . . . . . . . . . . . . 71 Figure 8.1 Qualitative behavior of the prior: (a) β0 = 10, ‖H0‖= 5, γ0 = 100, (b) β0 = 100, ‖H0‖= 5, γ0 = 10, (c) β0 = 10, ‖H0‖= 500, γ0 = 10, (d) β0 = 1, ‖H0‖= 5000, γ0 = 1 . . . . . . . . . . . 74 Figure 8.2 Mean reconstruction error on the held-out data as a function of the number of Gibbs scans for the experiments. . . . . . . . . 76 Figure 8.3 (a) A sample path in a reconstructed phylogenetic tree, (b) Evolution of the sequences for the reconstructed sample path . 78 Figure 8.4 Mean predictive probability of reaching category 3 starting in category 1 (95% credible interval is provided by the dashed lines) 79 Figure 8.5 Comparison of inference algorithms for the RNA dataset: mean acceptance rate per iteration for a proposal using 100 particles as a function of MCMC iteration using (a) predictive distribu- tion (b) modified predictive distribution in the SMC proposal. 80 viii Glossary Acronyms CRF Chinese restaurant franchise CRP Chinese restaurant process DDP dependent Dirichlet process DP Dirichlet process EDSS expanded disability status scale GEP gamma exponential process HDP hierarchical Dirichlet process HGEP hierarchical gamma exponential process HMM hidden Markov model IHMM infinite hidden Markov model MCMC Markov chain Monte Carlo MGP Moran gamma process MH Metropolis - Hastings MJP Markov jump process MS multiple sclerosis ix PIMH particle independent Metropolis - Hastings PMCMC particle Markov chain Monte Carlo SMC sequential Monte Carlo x Acknowledgments It is difficult to overstate my gratitude to my supervisors Drs. Alexandre Bouchard- Côté and John Petkau. Alex’s enthusiasm, his patience, his guidance, and his sup- port have helped me greatly and this thesis would not have been possible without him. He is an inspiration and I am very honored to call him my teacher. John’s great advice, support, friendship, especially his great efforts to explain things clearly and simply has been invaluable on both an academic and a personal level, for which I am extremely grateful. Many thanks go to the numerous people who have been my teachers and mentors along my academic journey so far. I would especially like to express my gratitude to Dr. Alireza Haji (B.Sc. supervisor) and Dr. Kourosh Eshghi (M.Sc. supervisor) at Sharif University of Technology. I am very grateful for their generous support, advise, and help with various projects. I would like to thank my student colleagues and friends for great discussions and for providing a stimulating and fun environment in which to learn and grow. Special thanks go to Chao Xiong, Hongyang Zhang, Liangliang Wang, Hossein Bashashati, Reza Skandari, Shahram Pourazadi, Sajjad Fayazi, Abbas Javaherian, and Ehsan Esfahanian. The secretaries at the Statistics department deserve my gratitude for providing assistance and administrative support, with special mention to Peggy, Andrea, and Elaine. Last but not least, I am forever grateful to my parents for their love and support, especially my mother who has always been there for me and helped me grow as a person. To her I dedicate this thesis. xi Chapter 1 INTRODUCTION Inside every non-Bayesian, there is a Bayesian struggling to get out. — Dennis V. Lindley Markov jump processes (MJPs), continuous-time Markov processes with piece- wise constant paths, have been used as models in many different fields includ- ing disease progression [5], phylogenetic trees [41], and communication networks [44]. The main motivation behind this thesis is the application of MJPs to data modeled as having complex latent structure. As a concrete example, disease progression in multiple sclerosis (MS) can be considered as an MJP [31]. MS is one of the most complex disabling diseases affecting the central nervous system. A typical evolution of the disease begins with periods of relapses and remissions. At this stage of the disease, patients re- cover from the neurological symptoms that appear during a relapse period in the following remission period. However, most of the patients eventually move to another stage of the disease in which disability accumulates. The course of the dis- ease can be assessed using disability scales such as the expanded disability status scale (EDSS), an ordinal score ranging from 0 to 10 in half point steps. Since the patient’s state (measured by the EDSS score) usually remains stable for a relatively long period of time, an MJP model is a reasonable choice for modeling the EDSS data. In MS studies, EDSS scores are analyzed with the aim of describing or pre- dicting the disease progression. The EDSS data available are partially observed 1 data; measurements are only taken at certain time points. It is not reasonable to as- sume the EDSS scores, collected in time, are independent of each other; thus, they can be thought of as sequential data for each individual patient. Several authors have proposed Markov models for analyzing EDSS scores; see, for example [1] and [32]. However, due to the complex nature of the disease and also the partial observability of the data, more sophisticated models may be required. A hidden Markov model (HMM) is proposed in [31] as a possible model for disease pro- gression. In an HMM, the states are not directly observed, but are reflected by the observations (e.g., EDSS scores). That is, we only observe a sequence of outputs which are dependent on an underlying Markov process. The currently available methods for carrying out inference in an HMM with an underlying MJP assume the number of the hidden states is either known (see for example [43] or [5]) or can be estimated using a model selection method (e.g., AIC, BIC, or penalized minimum-distance [30] methods). To successfully apply these methods to a real- world problem, we need a domain expert or a model selection method to specify the number of hidden states. In complex settings, such as disease progression in MS, it is seldom the case that one has a defensible parametric model. Nonparametric Bayesian models, Bayesian models with nonparametric priors, have recently attracted much atten- tion in the statistics community, due to their flexibility, adaptability, usefulness in analyzing complex real world datasets, and their ability to sidestep the model selec- tion. Nonparametric Bayesian models can automatically infer an adequate model complexity from the data, without needing explicit Bayesian model comparison. In addition, while being highly adaptable, these models maintain some advantages of a fully model-based probabilistic framework. Exact inference in these models is usually impossible; thus, statistical inference typically is done using Markov chain Monte Carlo (MCMC), a general approach used in approximate inference algorithms for Bayesian models. In this thesis we propose a nonparametric prior, the gamma exponential process (GEP), over MJPs. In particular, we propose a prior over infinite rate matrices which characterize an MJP. These priors can be used in Bayesian models where an MJP is imposed on the data but the number of states of the MJP is unknown in advance. Thus, if we consider a Bayesian MJP model for analyzing the MS data 2 then these priors are a reasonable choice. We will briefly provide some background on MJPs and their rate matrices in Chapter 2. We will also review some basic nonparametric priors such as the gamma process and the Dirichlet process (DP) in this chapter. There are a number of nonparametric Bayesian models available for sequential data. One class of these models are developed for the discrete-time framework; examples of these models are the infinite HMM (iHMM) [3] and the sticky hier- archical Dirichlet process-HMM [13]. Another class of nonparametric Bayesian models are developed for continuous-time transient processes. Transient processes are processes where the effect of an observation at time point t1, on a prediction at time point t2 should decrease as |t1− t2| increases. Examples of these models are the order-based dependent Dirichlet process (DDP) [17] and the stick-breaking autoregressive process [18]. We will review these and related models in Chap- ter 3. Our GEP model fills the gap for nonparametric priors over continuous-time recurrent (i.e., non-transient) processes. The GEP model that we propose has some attractive properties such as conju- gacy and simple closed-form predictive distributions. The simple predictive distri- butions that arise with the DP is one of the factors behind its widespread adoption in nonparametric Bayesian statistics; moreover, a simple predictive distribution can simplify the process of making statistical inference. We discuss the properties of GEPs in Chapter 4. We extend the basic model of Chapter 4 to a hierarchical version in Chapter 5. Sharing statistical strength can be considered as the main motivation behind a hier- archical model. Informally, we want the rows of the rate matrix to share informa- tion on what states are frequently visited. We introduce the hierarchical gamma exponential process (HGEP) and the motivation behind it in Chapter 5. After building an appropriate model (i.e., GEP and HGEP) for a complex la- tent structure, we need to infer this structure from the observed data. Typically we are not able to observe the whole sequence of states; that is, we only have par- tially observed sequences of data. In Chapter 6, we show that the HGEP model admits efficient inference algorithms. We introduce two inference algorithms: 1) a particle Markov chain Monte Carlo (PMCMC) algorithm which is an MCMC algorithm with sequences proposed by a sequential Monte Carlo (SMC) algo- 3 rithm; 2) a modified version of this PMCPC algorithm with an “improved” SMC proposal. In order to link the observed data to the latent structure we need a likelihood model; we assume a simple multinomial-Dirac likelihood model. Using this like- lihood model in several experiments, we observe that despite the simplicity of this likelihood model the results are reasonable. The multinomial-Dirac likelihood model is described in Chapter 7. We apply the model to three different datasets in Chapter 8: 1) a synthetic dataset generated from a random rate matrix; 2) an MS dataset obtained from a phase III clinical trial of a drug; and 3) an RNA dataset containing aligned RNA of species from the three domains of life. We evaluate our model on these datasets based on a held-out task. For the held-out task we construct three new datasets where each observation is held-out with 10% probability; then we reconstruct the observations at the held-out times and measure the mean error. The results show that our model outperforms the current available methods. We also apply the model to estimating disease progression in MS. We conclude the thesis in Chapter 9 with some final remarks and directions for future research. 4 Chapter 2 BASIC NOTIONS In this chapter, we introduce some basic notions and theories used in the following chapters of the thesis. We start by defining a Markov jump process (MJP); next, we present the definitions and properties of gamma and Dirichlet processes as the important elements of our model. 2.1 Markov Jump Process (MJP) An MJP can be described as a continuous time process with the Markov property on a discrete state space; a more formal definition is as follows: Definition 1. An MJP is a stochastic process {St , t ∈ T } with a discrete state space Ω and a continuous index set T = R+ that satisfies the Markov property: p(St+s = j|Su;u≤ t) = p(St+s = j|St). (2.1) If, moreover, the right hand side of Equation 2.1 only depends on the time increment s, but not on t, then the MJP is called homogeneous. Here, we only consider homogeneous MJPs; thus, the term MJP will always refer to a homogeneous MJP. 5 An MJP is often defined through its instantaneous transition rate matrix Q: Q =  q1,1 q1,2 q1,3 · · · q2,1 q2,2 q2,3 · · · q3,1 q3,2 q3,3 · · · ... ... ... . . .  . The off-diagonal element qi, j corresponds to the instantaneous transition rate from state i to state j: qi, j = lim h↓0 P(St+h = j|St = i) h ; hence, they are non-negative. As a consequence of the definition of qi, j, the waiting rate at state i denoted by qi,i is determined from qi,i = −∑ j 6=i qi, j. The convention is to place qi,i in the diagonals of Q. One particular probability of interest in an MJP is the transition probability from one state to another given a time interval t. The transition probability matrix P(t) = (pi, j(t)) where: pi, j(t) = p(St = j|S0 = i) is calculated by matrix exponentiation; that is, P(t) = eQt . Another probability of interest is the initial probability distribution pi where pii = p(S0 = i). Given the transition rate matrix Q, the behavior of a sample path from an MJP is specified by the following theorem [27]: Theorem 2. Assume an MJP process with transition rate matrix Q = (qi, j), is in state i at time point t. The sample path behavior of the MJP can be described as follows: 1. The waiting time in the current state i is distributed according to Exp(−qi,i). 2. Given its past,if a transition occurs at a particular time, the probability that this transition is from the current state i to state j is given by Mi, j = qi, j/|qi,i|, 6 Note that in the second statement of Theorem 2, M is the transition probability matrix for a Markov chain without self-transitions (i.e., Mi,i = 0, ∀i ∈ Ω). This Markov chain is called the embedded Markov chain of the MJP. A sample from an MJP can be described as a list of pairs X = (θn,Jn)Nn=1, where Jn is the waiting time at jump n-1 (J1 = 0) and θn is the state at jump n . We assume the initial state of the sequence (i.e., θ1 or S0) is distributed according to the initial distribution of the MJP. For finite sequences with length N, we can assume a special absorbing state for the end of a finite sequence θend . We would then condition on (θN+1 = θend) and (θn 6= θend,n ∈ {1, . . . ,N}), and set the total rate for the row corresponding to θend to zero. In the following, we only consider distributions over infinite sequences. Based on Theorem 2, we can use Doob-Gillespie algorithm [8] to sample from an MJP. Denoting the jump times by T1,T2, . . ., this algorithm can be described as follows: 1. For T0 = 0 sample θ1 from pi . Set J1 = 0, S0 = θ1, and n = 1. 2. Sample Jn+1 from Exp(−qθn,θn). 3. Update the jump time: Tn = Tn−1+ Jn+1. 4. For Tn−1 ≤ t < Tn set St = θn. 5. Sample θn+1 from the row corresponding to the θn row of the probability transition matrix M. Set n = n+1 and go to step 2. A sample path from an MJP is sketched in Figure 2.1. 2.2 Gamma Process The nonparametric prior that we propose is based on the gamma process, a stochas- tic process with independent, identically gamma distributed increments. Recall that the gamma distribution with shape parameter a > 0 and rate parameter b > 0 has a density given by: Gamma(x|a,b) = b a Γ(a) xa−1e−bxI(0,∞)(x), 7 Jଶ θଵ Jଷ θଶ θ௕௘௚ t = 0 Jଵ=0 . . . Jே θேିଵ t∗ Y(tଵ) Y(tଶ) . . . t Y(tீିଶ) θே X Y Y(tீିଵ) Y(tீ) Ω Figure 2.1: An illustration of notation for samples from MJPs. We assume the state space (Ω) is countable. The notation for the observations Y (t1), . . . ,Y (tG) is described in Chapter 6. where Γ(a) = ∫ ∞ z=0 z a−1e−zdz is the gamma function for a > 0, and IA(x) = 1 if x ∈ A and IA(x) = 0 otherwise. We use the terminology in [25] for defining the gamma process; thus we use the term Moran gamma process (MGP) and denote a gamma process by MGP. The MGP can be defined through three parameters: 1. concentration parameter also called the shape parameter α0 > 0 2. base probability distribution, P0 :FΩ→ [0,1], whereFΩ is the σ −algebra over the sample space Ω 3. rate parameter β0 > 0 We can combine the first two parameters into one new parameter called the base measure parameter H0 where H0 = α0P0 is a measure on the space (Ω,FΩ). By this parameterization we can characterize an MGP just by two parameters H0 and β0. Informally a sample from a gamma process is an atomic measure with random weighted point masses. The point masses ω ∈ Ω and their corresponding weights x > 0 are points in a product space Ω⊗ [0,∞). A formal definition of the MGP can be provided based on Poisson processes. 8 Consider a Poisson process with intensity λ (dωdx) = H0(dω)x−1e−β0xdx over the product space Ω⊗ [0,∞). A realization of this Poisson process is an infinite set of atoms, {ωi,xi}∞i=1 whereωi ∈Ω can be interpreted as the point mass of an atomic measure and xi as its corresponding weight. A realization of an MGP(H0,β0) is then defined as: µ = ∞ ∑ i=1 xiδωi ∼MGP(H0,β0). (2.2) Note that in contrast to the usual definition of the one-dimensional Poisson pro- cess where the process is defined over a line (usually time), here we define the process over a product space Ω⊗ [0,∞). In such a spatial Poisson process the counts of the number of events inside each of a number of non-overlapping finite sub-regions of the space are Poisson distributed and independent of each other. It can be shown [25] that the total mass µ(A) = ∑∞i=1 xi1(ωi ∈ A) of any mea- surable subset A ⊂ Ω is gamma distributed with shape parameter H0(A) and rate parameter β0. In other words, a gamma process takes positive values indepen- dently distributed as gamma at each event generated by a Poisson process. Hence, we have an equivalent definition for the MGP based on its marginals. Recall that by the Kolmogorov consistency theorem, in order to guarantee the existence of a stochastic process on a probability space (Ω′,FΩ′), it is enough to provide a consistent definition of the marginals of this stochastic process. As already mentioned, in the case of an MGP, the marginals are gamma distributions: Definition 3 (MGP). Let H0,β0 be of the types listed above. We say that µ :FΩ′→ (FΩ→ [0,∞)) is distributed according to the MGP distribution, denoted by µ ∼ MGP(H0,β0), if for all measurable partitions of Ω, (A1, . . . ,AK), we have: 1 (µ(A1),µ(A2), . . . ,µ(AK))∼ Gamma(H0(A1),β0)×·· ·×Gamma(H0(Ak),β0). 1We use the rate parameterization for the gamma density throughout. 9 2.2.1 MGP and Dirichlet process (DP) There is a connection between the MGP and the DP which will be important in the following chapters. As mentioned in Section 2.2 for any measurable subset A⊂Ω we have: µ(A) = ∞ ∑ i=1 xi1(ωi ∈ A)∼ Gamma(H0(A),β0). (2.3) Dividing a sample from an MGP, µ by µ(Ω) we get a random probability mea- sure which is in fact a sample from a DP with concentration parameter H0(Ω) and the base probability measure H0/H0(Ω). We will explain these parameters in the following section. 2.3 DP Definition and Properties In the previous section, we mentioned that by normalizing the MGP we obtain a new stochastic process called the DP. In this section we describe the DP, some of its important properties, and different ways to represent it. 2.3.1 Definition Intuitively the DP is a generalization of the Dirichlet distribution to an infinite number of dimensions. In fact, the Dirichlet distribution is itself the multivariate generalization of the beta distribution. Recall that the beta distribution is a continuous probability distribution defined on the interval (0,1) with two parameters α > 0 and β > 0. This distribution is typically used in modelling proportions. The probability density function of the beta distribution is: Beta(x;α,β ) = Γ(α+β ) Γ(α)Γ(β ) xα−1(1− x)β−1I(0,1)(x) By extending a beta distribution to more than two dimensions we obtain a Dirich- let distribution. A K-dimensional Dirichlet distribution with positive parameters 10 α1, · · · ,αK is defined by the following density function: Dir(x1, · · · ,xK−1;α1, · · · ,αK) = Γ(∑ K k=1αk) ∏Kk=1Γ(αk) K ∏ k=1 xαk−1k for all x1, · · · ,xK−1 > 0 satisfying ∑K−1k=1 xk < 1 where xK = 1− (∑K−1k=1 xk). This distribution is often used in the Bayesian framework as a prior for the multinomial distribution. That is due to the fact that the Dirichlet distribution is a conjugate prior over the multinomial distribution. That is, if a random variable has a multinomial distribution with a Dirichlet prior over the parameters of the distribution, then the posterior over the parameters is also Dirichlet. Denoting N independent observations from a multinomial distribution by Y = (y1, · · · ,yN), we have P = (p1, · · · , pK)∼ Dir(α1, · · · ,αK) Y |P∼Mult(p1, · · · , pK) ⇒ P|Y ∼ Dir(α1+n1, · · · ,αK +nK) where nk is the number of yis in category k. The DP is the result of extending the Dirichlet distribution to infinite dimensions. A DP, denoted by G, can be thought of as a probability distribution over probability measures (i.e., non-negative functions that integrate to one) and is defined by two parameters: 1. Base measure (G0): a probability distribution G0 :FΩ→ [0,1] which is the mean of the DP; that is, for any measurable set A ⊂ Ω we have E[G(A)] = G0(A). Here,FΩ is the σ −algebra over the sample space Ω. 2. Concentration parameter (α0): a positive real number that controls the vari- ability around the mean G0. A realization from a DP(α0,G0) is a random distribution with values drawn from G0. In other words, the support of the sample is the same as the base measure. Similar to the definition of the MGP, by the Kolmogrov consistency theorem, we can define a DP on a probability space (Ω′,FΩ′) by providing a consistent defini- 11 tion of its marginals. Following this approach we have the following definition for the DP: Definition 4. G :FΩ′ → (FΩ → [0,1]) is DP distributed with base measure G0 and concentration parameter α0, denoted by G∼DP(α0,G0), if for any finite mea- surable partition (A1, · · · ,AK) of Ω we have: (G(A1),G(A2), · · · ,G(AK))∼ Dir(α0G0(A1),α0G0(A2), · · · ,α0G0(AK)). (2.4) Note that, similarly to the gamma process, Definition 4 suggests that the two parameters α0 and G0 can be treated as a single parameter H0 = α0G0; this will result in a notationally convenient parameterization G∼ DP(H0). 2.3.2 Conjugacy and posterior distribution A conjugacy property holds for the DP. Let G ∼ DP(α0,G0) and y be a single realization from G, y∼ G. Note that since G is a measure valued random variable, we can sample random variables from a realization of G. Roughly speaking G can be thought of as an infinite-dimensional multinomial distribution. Due to the Dirichlet-multinomial conjugacy, for a Dirichlet distribution induced by a fixed partition (A1,A2, · · · ,AK) of Ω (i.e., the support of G0) and y ∈ Ak, we have: G|α0,G0 ∼ DP(α0,G0) y|G∼ G (G(A1),G(A2), · · · ,G(AK))|y∼ Dir(α0G0(A1),α0G0(A2), · · · ,α0G0(Ak)+1, · · · ,α0G0(AK)) That is, the single (data) point in the state space y only affects the single element of the partition Ak to which it belongs. Figure 2.2 illustrates an example where the point belongs to the element A2 of the partition. Generally, for a sequence of N independent realizations y1, · · · ,yN from G we have: (G(A1),G(A2), · · · ,G(AK))|y1, · · · ,yN ∼ Dir(α0G0(A1)+n1,α0G0(A2)+n2, · · · ,α0G0(AK)+nK) where nk is the number of points in the kth element of the partition: nk = |{i : yi ∈ Ak}|. Thus, the posterior distribution over G must be a DP as well, since the above 12 Figure 2.2: A realization G of a DP with base measure G0 and support Ω. A partition (A1,A2, · · · ,AK) and a point y sampled from G are also illus- trated. is true for any finite partition. Formally, we have the following proposition for the posterior of a DP measure given a set of points y1, · · · ,yN . Proposition 5. Suppose G∼DP(α0,G0). Given N independent observations yi ∼ G the posterior distribution also follows a DP: G|y1, · · · ,yN ,α0,G0 ∼ DP(α0+N, α0G0+∑ N i=1 δyi α0+N ). where δyi is a point mass located at yi. The proof provided in [12] follows directly from the Dirichlet-multinomial con- jugacy. 2.3.3 Stick-breaking construction Definition 4 is not constructive; that is, it does not provide a mechanism for sam- pling from a DP. Here we provide an alternative constructive definition for the DP. Definition 6 (Stick-breaking construction [40]). Let βc ∼ Beta(1,α0) be indepen- dent. Define the stick lengths pi = {pic}∞c=1, as follows: pi1 = β1 pic = βc∏1≤c′ 1. We denote this construction by pi ∼ GEM(α0) 2. 2GEM named after Griffiths, Engen and McCloskey; the abbreviation was first introduced in [36]. 13 In words, we start from a stick with length 1; at step c a proportion βc is broken off and used as the stick length. At each step, the rest of the stick is kept for the remaining steps; thus, if at some step c we have a stick with length L the new stick will have length βcL. It is shown in [40] that pi constructed by this mechanism is a probability distribution. The following theorem provides an alternative definition for the DP which is constructive. Theorem 7. Given a base measure G0 consider the random measure G = ∞ ∑ k=1 pikδθk , where θk iid∼ G0 and pi ∼ GEM(α0). Then G∼ DP(α0,G0). The proof is provided in [40]. That is, the two alternative definitions of DP provided in Definition 4 and Theorem 7 are equivalent. Theorem 7 shows that the random probability measures generated from a DP are discrete with probability one [12]. This leads to the predictive distribution, the distribution of a new point yN+1 given a set of N existing i.i.d. points Y = {y1, · · · ,yN}, for the DP known as the Chinese restaurant process (CRP). 2.3.4 CRP definition As mentioned in Section 2.3.3, the DP produces discrete random measures; thus, there is a positive probability of multiple points yi taking identical values and hence forming clusters. Recall that the points y1, · · · ,yN are i.i.d. draws from G, a real- ization of a DP with some concentration parameter α0 and a base measure G0; that is, G ∼ DP(α0,G0). Since the values of the draws are repeated, let y∗1, · · · ,y∗M be the unique values among y1, · · · ,yN and nk be the number of repeats of y∗k . The pre- dictive distribution for the N + 1th point yN+1 given all the previous observations y1, · · · ,yN is given by the following theorem: Theorem 8. Let y1, · · · ,yN be an i.i.d. sample from G∼DP(α0,G0), y∗1, · · · ,y∗M be the unique values among y1, · · · ,yN , and nk be the number of repeats of y∗k . Then 14 the predictive distribution for yN+1 is given by: yN+1|y1, · · · ,yN ,α0,G0 ∼ M ∑ i=1 ni N+α0 δy∗i + α0 N+α0 G0. (2.5) For a proof of Theorem 8 see for example [23]. The CRP refers to the above distribution (i.e., predictive distribution of a DP). The interpretation of Equation 2.5 in terms of the CRP is as follows. Consider a restaurant with an infinite set of tables; tables correspond to clusters, the ith customer corresponds to the ith point yi, and the kth dish corresponds to the kth unique value that points can attain y∗k . The customers sequentially enter the restaurant and sit either at a previously occupied table or at a new table. Customer i enters and sits at an already occupied table k with probability proportional to the number of customers already at that table nk; that is, P(Yi = y∗k) = nkδy∗k α0+N , or sits at a new table with probability proportional to α0; that is, P(Yi = y∗) = α0G0α0+N . When the table is occupied the new customer shares the same dish as the first customer at that table, and when the table is empty a new dish is sampled y∗ ∼G0. 2.3.5 DP mixture model One of the most common applications of the DP is the DP mixture model which can be used for clustering data. In fact, the DP mixture model is a mixture model with countably infinite mixture components. Consider a set of points y1,y2, · · · ,yN , a set of latent parameters θ1, · · · ,θn, and a parametric distribution F . Then the DP mixture model can be written in the following form: yi|θi ∼ F(θi) θi|G∼ G G|α0,G0 ∼ DP(α0,G0) Each observation yi is sampled from F(θi) and each θi is sampled i.i.d. from G. As G is discrete, samples θi from G can have identical values; therefore, obser- 15 vations sampled from F with the shared parameter θi belong to the same mixture component (i.e., cluster). 2.4 Summary In this chapter we reviewed some of the basic notions that will be useful in devel- oping our model. In the subsequent chapters, we will show how our nonparametric prior over MJPs is built upon the MGP that we described in this chapter. Moreover, we will present the properties of our model that we believe can be better understood in connection with the properties of the DP explained in the current chapter. In the following chapter, we review some Bayesian nonparametric models and explain how our proposed model is different from these currently available models. 16 Chapter 3 RELATED MODELS Our model can be considered as a prior over MJPs, which are a specific type of time series. In this chapter, we present the currently available Bayesian nonparametric models that are used in time series modelling; we describe their properties and group them into three main categories. 3.1 Hierarchical Dirichlet Process In this section, we introduce the hierarchical Dirichlet process (HDP) [42], a non- parametric prior over a set of random probability measures over (Ω,FΩ). We first explain the motivation behind the HDP from the perspective of mixture models; next, we describe different representations of the HDP which are in fact analogous to different representations of the DP. 3.1.1 Motivation Consider a clustering problem where there are multiple groupings of data; we want to be able to share the clusters across the multiple groupings of data. As a concrete example, consider a document modeling problem where the goal is to model a corpus of documents. A common simplification in document modeling is “the bag of words” assumption by which the words are assumed to be exchangeable and the order of the words in the documents is ignored. Another common assumption in document modeling is that the words in each document are sampled from the 17 distribution of a topic [4]; each topic has a multinomial distribution over a basic vocabulary. For example, in a document on the applications of machine learning in genetics, the words might be sampled from the topics “genetics” and “machine learning”. Now, considering a collection of documents (i.e., corpus) we would like to share the topics among the documents. We might also be interested in sharing among different corpus. 3.1.2 Definition The HDP is defined by a set of hyperparameters: a baseline probability measure H, and concentration parameters γ and α . The HDP is defined as follows: G0|γ,H ∼ DP(γ,H) (3.1) G j|α,G0 iid∼ DP(α,G0) (3.2) θi j|G j iid∼ G j (3.3) According to this definition a global measure G0 is sampled from a DP with the baseline measure H and concentration parameter γ; G j is the random probability measure for group j sampled from a DP with base measure G0 and concentration parameter α . Given G0, the G js are independent from each other. In other words, through the global measure we are linking different groups together which will allow us to share statistical strength between them. The connection of the HDP with mixture models in general, and topic models in particular, is completed with the specification of a parametric distribution F for the observations: xi j|θi j ∼ F(θi j). (3.4) Here we assume given θi j, the parameter of the ith cluster in the jth group, the observation xi j is independent of all other observations. In comparison with the DP mixture model (Section 2.3.5), in the HDP mixture model we have a collection of mixture models that are connected through a global base measure G0. Note that we are not limited to a single layer of hierarchy in the HDP. In fact, we can have as many layers as required for modeling through a recursive structure where the 18 Notation Definition θi j ith customer in restaurant j φk kth sampled dish from the global menu H ψ jt the dish served at table t in restaurant j n jtk number of customers in restaurant j at table t eating dish k n jt. number of customers in restaurant j at table t n j.k number of customers in restaurant j eating dish k m jk number of tables in restaurant j serving dish k m.k number of tables serving dish k m j. number of occupied tables in restaurant j m.. total number of occupied tables Table 3.1: Variables used in the CRF base measure for each layer is itself sampled from another DP. An example of a situation where we might need more than one layer would be sharing topics among different corpus. 3.1.3 Chinese Restaurant Franchise In Section 2.3.4 we introduced the CRP: a representation for the predictive distri- bution of a DP where we marginalize out the random probability measure. There is an analogous representation for the HDP which is called a Chinese restaurant franchise (CRF). In a CRF we have multiple restaurants with a shared menu. For each restaurant, a customer entering the restaurant can sit at an occupied table or at a new table. A customer sitting at an occupied table orders the same dish as that for the first customer at that table; for a new table a dish is ordered from the shared menu across restaurants. Each restaurant corresponds to a group and customers correspond to the θi js in Equation 3.3 (i.e., θi j is the ith customer at the jth restaurant). Moreover, following the notation in [42], we denote K i.i.d. samples from the baseline measure H by φ1, · · · ,φK ; here H corresponds to the global menu of the dishes and φk is the kth sampled dish from the global menu. The notation needed for defining the predictive distribution of an HDP is provided in Table 3.1. 19 Similar to the DP, in order to obtain the predictive distribution for the HDP we have to marginalize out the random probability measures G0 and the G js. First we marginalize out G j. From Theorem 8, the predictive distribution for θi j (ith latent parameter in group j) given θ1 j,θ2 j, · · · ,θi−1, j is: θi j|θ1 j, · · · ,θi−1, j,α,G0 ∼ m j. ∑ t=1 n jt. i−1+α0 δψ jt + α0 i−1+α0 G0. (3.5) Note that the similarity to Equation 2.5. That is due to the fact that in both we are marginalizing out realizations of a DP. Marginalizing out G0, using Theorem 8 again, we can obtain the predictive dis- tribution for the dishes. Anytime we need to sample a new dish in Equation 3.5 (i.e., sampling from G0), we sample from the predictive distribution of the dishes. The dishes are sampled in order of the tables being occupied in each restaurant; that is, we start from table 1 in restaurant j and assign a dish to it (ψ j1), next we move to the second table (ψ j2), and so on. The predictive distribution for ψ jt given the dishes for all occupied tables in all restaurants, ψ11.ψ12, · · · ,ψ21, · · · ,ψ j,t−1, is: ψ jt |ψ11,ψ12, · · · ,ψ21, · · · ,ψ j,t−1,γ,H ∼ K ∑ k=1 m.k m..+ γ δφk + γ m..+ γ H. (3.6) Equation 3.5 is a mixture for the customers; a customer entering restaurant j: 1) sits at a preoccupied table with probability proportional to the number of cus- tomers sitting at that table or 2) sits at a new table with probability proportional to α0. Equation 3.6 is a mixture for the dishes; whenever a new dish is needed for a new table: 1) it is chosen from the previously chosen dishes with probability proportional to the number of tables serving that dish or 2) a new dish is sampled with probability proportional to γ . In summary, to obtain a new sample θi j we first use Equation 3.5; if a new dish is needed from G0 we proceed to Equation 3.6 and sample a new dish ψ jt and set θi j = ψ jt . 20 We can combine Equations 3.5 and 3.6 as follows: θi j|θ1 j, · · · ,θi−1, j,α ∼ K ∑ k=1 n j.k i−1+α0 δφk + α0 i−1+α0 µ µ = K ∑ k=1 m.k m..+ γ δφk + γ m..+ γ H. (3.7) This can be done due to the fact that m j. ∑ t=1 n jt.δψ jt = K ∑ k=1 m jk ∑ t=1 n jtkδψ jt = K ∑ k=1 n j.kδφk . This representation will be useful in Chapter 5. 3.1.4 Stick-breaking representation for the HDP Similar to the DP, there is a stick-breaking representation for the HDP. We have this representation for the global measure G0 and also for the other random probability measures G js, as all are sampled from a DP. Assume φk iid∼ H; it can be shown [42] the stick-breaking representation for G0 and the G js is as follows: G0 = ∞ ∑ k=1 βkδφk (3.8) G j = ∞ ∑ k=1 pi jkδφk (3.9) where β = (βk)∞k=1 ∼ GEM(γ) are mutually independent, and pi j = (pi jk)∞k=1 ∼ DP(α0,β ). Note that the weights pi j are independent given β as the G j are inde- pendent given G0. In this representation, the support of G0 is φ = (φk)∞k=1; each G j has the same support since each G j is a realization of a DP with base measure G0. In summary, the stick-breaking representation of the HDP can be written as follows: 21 φk iid∼ H (3.10) β |γ ∼ GEM(γ) (3.11) pi j|α0,β iid∼ DP(α0,β ) (3.12) G0 = ∞ ∑ k=1 βkδφk (3.13) G j = ∞ ∑ k=1 pi jkδφk (3.14) 3.2 HDP-HMM In this section we describe a closely related model to the HDP which is called the hierarchical Dirichlet process hidden Markov model (HDP-HMM) [42]. Before introducing the model we briefly review hidden Markov model (HMM). 3.2.1 Background on HMM HMMs are doubly stochastic Markov chains. That is, there are finite number of observations at discrete time points; we denote the set of observations by Y = (Y1,Y2, . . . ,YN). Each of these observations is aX -valued random variable; given its hidden state θ the observation is independent of all the other observations. In other words, there is a hidden state at each of the discrete time points; these hidden states follow a Markov chain on the space of hidden states Ω. HMMs are defined by a transition probability matrix P that characterizes the dynamics of the hidden states, and a parametric family P indexed by the hidden states θ of the chain,P = {Lθ :FX → [0,1],θ ∈Ω}. The parametric family de- termines the probability of “emitting” an observation by a hidden state. A graphical representation of an HMM is sketched in Figure 3.1. An HMM can be viewed as a dynamic mixture model; the choice of the mixture component for each observation is not independent of the other observations but depends on choice of mixture component for the previous observation. That is, for each hidden state θ there is a corresponding random probability measure µθ which 22 Figure 3.1: A graphical representation for an HMM can be thought of as a multinomial distribution over mixture components. More concretely, assume the current hidden state is θN where θN ∈ Ω. Since the hidden states follow a Markov chain, the next hidden state is a random variable distributed as a multinomial distribution that corresponds to the θN th row of P, denoted θN+1 ∼ PθN . The next observation would be a random variable sampled from a parametric family indexed by θN+1. 3.2.2 Definition The dynamic mixture view of the HMM allows us to extend its definition to an infinite number of hidden states which yields the infinite mixture model. The result, an HDP-HMM [42], is a Bayesian model for the HMM with a nonparametric prior over an infinite-dimension transition probability matrix. We assume a DP prior over the random probability measure µθ (equivalent to G j in 3.2) corresponding to each hidden state of the HMM; informally, we are assuming an infinite-dimension multinomial distribution for each hidden state. However, the DP priors for all the hidden states need to be linked as we want the chain to be irreducible. One approach for linking the DP priors for the states could be using a shared base measure H among all of the priors, µθ iid∼ DP(α,H) ∀θ ∈ Ω. This approach may work if the base measure has a countable support; otherwise, samples from the DPs will have different supports with probability one. Another approach, that works even with uncountable support for the base mea- sure, is using the HDP framework; we first sample a global measure µ0 from a DP with base measure H and then use the global measure as a shared base measure for the DPs of the hidden states. Let θN be the current state of the chain, yN+1 23 the observation emitted by the hidden state θN+1, and F a parametric distribution. Then, formally, the HDP-HMM model has the following form: µ0 ∼ DP(γ,H) (3.15) µθ iid∼ DP(α,µ0) ∀θ ∈Ω (3.16) θN+1 | {θn}Nn=1,{µθ} ∼ µθN (3.17) yN+1 | θN+1 ∼ F(θN+1) (3.18) It is worth mentioning that a model was introduced in [3] that allows count- ably infinite hidden states for an HMM. This model, known as the infinite hidden Markov model (IHMM), is in fact a CRF representation of an HDP-HMM. This model can be described as a two level hierarchy. Given the current state θN = i, there could be: 1. a self transition with probability proportional to the number of self transitions that were observed before for state i, nii plus a parameter α (i.e., nii+α), 2. a transition to an already visited state j with probability proportional to ni j, the number of times this transition was observed before, 3. a move to a higher level of the hierarchy where an “oracle” process is in- voked, with probability proportional to a parameter β . At this level of the hierarchy there are two possibilities: (a) a transition to an already visited state j, with probability proportional to n0j the number of times state j was previously chosen by the oracle regardless of the previous state, (b) a transition to a novel state with probability proportional to a parameter γ . The role of the oracle is similar to the role of µ0 in the definition of the HDP- HMM (Equation 3.15); that is, it ties together the transition models to have destina- tion states in common. Figure 3.2 illustrates the transition generative mechanism. The equivalence of this model and a CRF representation of HDP-HMM is shown in [42]. 24 Figure 3.2: Generative mechanism for the state transitions in the iHMM model 3.3 Sticky HDP-HMM Although the HDP-HMM was successful in various applications (see for exam- ple [26] or [45]) it has limitations. One limitation of the HDP-HMM is inadequacy in modeling temporal persistence of states; the HDP-HMM creates unnecessarily rapid transitions between states in order to explain all the transitions. By using µθ iid∼ DP(α,µ0) ∀θ ∈ Ω as in Equation 3.16, the HDP-HMM cannot differenti- ate between self-transitions and transitions to other states. Thus, although all states have similar transition distributions, there could be unnecessary transitions to other states. Particularly, in data sequences with persistent states, the flexibility of HDP- HMM will cause the sequence samples with rapid state switches to have high pos- terior probability. In other words, HDP-HMM will generate new redundant states instead of having larger probability of self transition. The sticky HDP-HMM model introduced in [13] provides a solution to the problem of state-persistence in the HDP-HMM. The model simply includes a self- transition bias parameter. For presenting the sticky HDP-HMM model in this sec- tion, we assume that all the hidden states are observed; hence, by the data sequence 25 we mean the sequence of the states. The model proposed in [13] is a slight modification of the model in Section 3.1.4. Using the same notation as Sections 3.1.4 and 3.2 the model can be defined as follows: φk iid∼ H (3.19) β | γ ∼ GEM(γ) (3.20) piθ | α0,β ,η iid∼ DP(α0+η , α0β +ηδθα0+η ) (3.21) µ0 = ∞ ∑ k=1 βkδφk (3.22) µθ = ∞ ∑ k=1 piθkδφk∀θ ∈Ω (3.23) θN+1 | {θn}Nn=1,{µθ} ∼ µθN (3.24) (3.25) In this definition η is the self-transition bias parameter introduced to the HDP; here α0β +ηδθ means that η > 0 is added to the θ th component of α0β . This parameter will encourage self-transitions in the model; note that η = 0 reduces the model to the HDP-HMM. 3.4 Dependent Dirichlet Process (DDP) DDPs are a general framework for modeling a collection of random probability measures GT = {Gt : t ∈ T ⊂ Rd}. Here Gt is a random probability measure on some space (Ω,FΩ) which is indexed by elements t ∈ T . We only consider time space in the following; hence, T = R+. As in the previous section, for presenting the DDP model in this section, we assume that all the states are observed. In the DDP we start from the stick-breaking construction of the DP (Section 2.3.3) and replace the weights and/or atoms with appropriate stochastic processes [29]. The general definition for a DDP provided in [29] is: Definition 9. A DDP is a collection of random probability measures GT = {Gt : 26 t ∈T }; Gt is defined as Gt = ∞ ∑ k=1 pik(t)δθk(t) (3.26) where {θk(t) : t ∈T }, for k= 1,2, · · · are independent realizations from a stochas- tic process G0 defined on T and the pik(t)s are distributed independently across t according to GEM(αt). Note that in the above definition both the set of atoms and the set of weights depend on t. The DP and HDP can be considered as special cases of the DDP; for any fixed t, the DDP yields a DP. Moreover, the HDP is an example of the DDP where we have dependence in the weights but not in the atoms (see Section 3.1.4). The simplest construction for the DDP is obtained when the atoms are replaced with a stochastic process, but the weights remain constant at different values of t. The resulting model is called a single-p DP [29] which is simply a DP mixture of stochastic processes. The advantage of this model is that it is simple and fairly flexible; however, the disadvantage is that the model has a lack of “locality” due to the global sharing of pi . More general DDP models discuss how pi should vary across T . A model proposed in [17], the order-based DDP, uses a common collection of stick-breaking proportions β = {βi}∞i=1 for all Gt , t ∈ T . Since the βis are i.i.d. for any permutation σ , pic = βσ(c)∏1≤c′ 0), and drop these conditioning events from the notation. The initial distribution is then the set of transition probabilities out of θbeg. 4.2 Conjugacy Property In this section, we show that the posterior of each row, µθ |X , is also MGP dis- tributed with updated parameters. The sufficient statistics for the parameters of µθ |X , the empirical transition measures Fθ and the empirical waiting times Tθ , are defined as follows: Fθ = N ∑ n=1 1[θn−1 = θ ] δθn , (4.1) Tθ = N ∑ n=1 1[θn−1 = θ ] Jn. (4.2) The main result of this section is as follows: Proposition 12. The GEP is a conjugate family: µθ |X ∼ MGP ( µ ′θ ,β ′ θ ) , where µ ′θ = Fθ +H0 and β ′ θ = Tθ +β0. For the proof of Proposition 12, we will need the following elementary lemma: Lemma 13. If V ∼ Beta(a,b) and W ∼ Gamma(a+ b,c) are independent, then VW ∼ Gamma(a,c). See for example [9] for a survey of standard beta-gamma results such as that stated in this lemma. We now prove the proposition: 33 Proof. For simplicity, fix an arbitrary state θ and drop the index (this is without loss of generality since the rows are iid); hence, µ = µθ ∼MGP(H0,β0), µ ′ = µ ′θ , and β ′ = β ′θ . Let (A1, . . . ,AK) be a measurable partition ofΩ. By the Kolmogorov consistency theorem, it is enough to show that for all such partitions, (µ(A1),µ(A2), . . . ,µ(AK)) | X ∼ Gamma(µ ′(A1),β ′)×·· ·×Gamma(µ ′(Ak),β ′). (4.3) Assume for simplicity that K = 2 (the argument can be generalized to K > 2 without difficulty), and let V = µ(A1)/‖µ‖, W = ‖µ‖. Then, by elementary prop- erties of gamma distributed vectors: V ∼ Beta(H0(A1),H0(A2)), (4.4) W ∼ Gamma(α0,β0), (4.5) and V,W are independent (both conditionally on X = (θn,Jn)Nn=1 and uncondition- ally). By beta-multinomial conjugacy, we have V | X =V | (θn,Jn)Nn=1 d=V | θ1, . . . ,θN ∼ Beta(µ ′(A1),µ ′(A2)), and by gamma-exponential conjugacy, we have W | X d=W | J1, · · · ,JN ∼ Gamma(‖µ ′‖,β ′). Using Lemma 13 with a= µ ′(A1),b= µ ′(A2),c= β ′, we finally get that µ(A1)|X = VW |X ∼ Gamma(µ ′(A1),β ′), which concludes the proof. Note that µ ′θ is the mean parameter of an unnormalized DP, as µθ |X ∼MGP ( µ ′θ ,β ′ θ ) which is an unnormalized DP. 34 4.3 Predictive Distribution In this section we find an expression for the distribution of (θN+1,JN+1)|X which is the predictive distribution. We will need the following family of densities: Definition 14 (Translated Pareto). Let α > 0,β > 0. We say that a random variable T is translated-Pareto, denoted T ∼ TP(α,β ), if it has density: f (t) = 1[t > 0]αβα (t+β )α+1 . (4.6) Proposition 15. The predictive distribution of the GEP is given by: (θN+1,JN+1) | X ∼ µ̄ ′θN ×TP(‖µ ′θN‖,β ′θN ). (4.7) Proof. By Proposition 12, it is enough to show that if µ ∼MGP(H0,β0), θ |µ ∼ µ̄ , and J|µ ∼ Exp(‖µ‖), then (θ ,J)∼ µ̄×TP(α0,β0), where α0 = ‖H0‖. Note first that because (J|θ) d= J the minimum and argmin of independent expo- nential random variables are independent. To get the distribution of J, we need to integrate over the rate parameter of the exponential waiting time distribution. Here we denote the distribution of the waiting time given the rate by p(t|x) and the distribution of the rate parameter by p(x). From Definition 10 and the properties of the gamma process described in Section 2.2, we know the rate parameter has a gamma distribution. Thus, to obtain the distribution of J we compute the following integral: p(t) = ∫ x>0 p(t|x)p(x)dx = ∫ x>0 Exp(t;x) ·Gamma(x;α0,β0)dx = βα00 Γ(α0) ∫ x>0 xexp(−xt) · xα0−1 exp(−β0x)dx = ∫ x>0 xα0 exp(−(β0+ t)x) dx = α0β α0 0 (β0+ t)α0+1 35 Hence J ∼ TP(α0,β0). A useful property of predictive distributions is exchangeability. We can show that these predictive distributions are indeed exchangeable: Proposition 16. Let J j(θ ,1),J j(θ ,2), . . . ,J j(θ ,K) be the subsequence of waiting times following state θ . Then the random variables J j(θ ,1),J j(θ ,2), . . . ,J j(θ ,K) are ex- changeable. Moreover, the joint density of the sequence of waiting times (J j(θ ,1) = j1,J j(θ ,2) = j2, . . . ,J j(θ ,K) = jK) is given by: p( j1, j2, . . . , jK) = 1[ jk > 0,k ∈ {1, . . . ,K}](α0)Kβα00 (β0+ j1+ · · ·+ jK)α0+K (4.8) where the Pochhammer symbol (x)n is defined as (x)n = x(x+1) · · ·(x+n−1). Proof. From Proposition 15, we have: p( j1, j2, . . . , jK) = p( j1)p( j2| j1)×·· ·× p( jK | j1, . . . , jK) = T P(α0,β0)T P(α0+1,β0+ j1)×·· ·×T P(α0+K−1,β0+ j1+ · · ·+ jK−1) = 1[ jk > 0,k ∈ {1, . . . ,K}] α0β α0 0 (β0+ j1)α0+1 (α0+1)(β0+ j1)α0+1 (β0+ j1+ j2)α0+2 ×·· ·× (α0+K−1)(β0+ j1+ · · ·+ jK−1) α0+K−1 (β0+ j1+ · · ·+ jK)α0+K ∝ 1[ jk > 0,k ∈ {1, . . . ,K}](β0+ j1+ · · ·+ jK)−α0−K . The second line is obtained from the first by considering the fact that each jump to state θ updates the sufficient statistics Fθ and Tθ ; consequently, the parameters of the TP (i.e., ‖µ ′θ‖ and β ′θ ) are updated. The normalization of the above expression is indeed equal to 1/((α0)Kβα00 ). 36 Chapter 5 HIERARCHICAL GAMMA EXPONENTIAL PROCESS 5.1 Motivation In this section, we present a hierarchical version of the GEP, where the rate matrix, instead of i.i.d. rows, has exchangeable rows. Informally, the motivation behind this construction is to have the rows share information on what states are frequently visited. This is similar to the motivation behind the HDP-HMM model presented in Section 3.2. Recall that in order to link the rows of the transition probability ma- trix in the HDP-HMM model, we used a random shared base probability measure distributed according to a DP. As with HDPs, the hierarchical construction is especially important when Ω is uncountable. For such spaces, since each GEP sample has a random countable support, any two independent GEP samples will have disjoint supports with prob- ability one. Therefore, a GEP alone cannot be used to construct recurrent Markov processes when Ω is uncountable. Fortunately, the hierarchical model introduced in this chapter addresses this issue: it yields a recurrent model for MJPs over both countable and uncountable spaces Ω. 37 5.2 Definition The hierarchical gamma exponential process (HGEP) is constructed by making the base measure parameter of the rows shared and random. Formally, the model has the following form: µ0 ∼MGP(H0,γ0) µθ | µ0 iid∼MGP(µ0,β0) θN+1 ∣∣ X ,{µθ}θ∈Ω ∼ µ̄θN JN+1 ∣∣ X ,{µθ}θ∈Ω ∼ Exp(‖µθN‖). (5.1) In order to get a tractable predictive distribution, we introduce a set of auxiliary variables. As we will see shortly, these auxiliary variables are closely related to the variables used in the CRF metaphor presented in Section 3.1.3 to indicate when new tables are created in a given restaurant (i.e., m jk). 5.3 Connection to CRF From the connection between the DP and the gamma process described in Sec- tion 2.2.1, we know that if X ∼ DP(α0,H0) and Y ∼ Gamma(γ0,β0) then XY ∼ MGP(γ0H0,β0). Hence, we can rewrite Definition 5.1 as: ‖µ0‖ ∼ Gamma(‖H0‖,γ0) (5.2) µ̄0 ∼ DP(‖H0‖, H̄0) (5.3) ‖µθ‖ ∣∣ ‖µ0‖ iid∼ Gamma(‖µ0‖,β0) (5.4) µ̄θ ∣∣ µ̄0,‖µ0‖ iid∼ DP(‖µ0‖, µ̄0) (5.5) θN+1 ∣∣ X ,{µθ}θ∈Ω ∼ µ̄θN (5.6) JN+1 ∣∣ X ,{µθ}θ∈Ω ∼ Exp(‖µθN‖). (5.7) Note the similarity between Equations 5.3 and 3.1, and between Equations 5.5 and 3.2. The only difference is in Equation 5.5, where we condition on a random variable ‖µ0‖, the normalization of the top level random measure. Now, as we have 38 a similar structure to the HDP model, we can construct the predictive distribution for the states θ in the HGEP similarly to the CRF model of Section 3.1.3. In the HGEP, a restaurant can be understood as a row in the rate matrix, the tables as groups of transitions to the same destination state, and a dish as a destina- tion state. We can use an auxiliary variable An to indicate when the n-th transition creates a new table; An = 1 means the n-th transition is to a state not previously visited. The variable takes value An = 0 otherwise. We augment the sufficient statistics (i.e., Fθ and Tθ defined in Section 4.2) with empirical counts for the num- ber of tables across all restaurants that share a given dish, G=∑Nn=1 Anδθn . In other words, G({θ}) indicates the number of transitions to the state θ from all the states. We need to introduce one additional auxiliary variable, the normalization of the top level random measure, ‖µ0‖. This auxiliary variable has no equivalent in CRFs. Based on the construction of the CRF in Section 3.1.3 and the DP-based repre- sentation of the HGEP described above, we can write the predictive distribution for the state θN+1 in the form of Equation 3.7: µ ′′ = G+H0 µ̄ ′′ = G ‖G+H0‖ + ‖H0‖ ‖G+H0‖ H̄0 θN+1 | θ1, · · · ,θN ,‖µ0‖, µ̄ ′′ ∼ FθN‖FθN +µ0‖ + ‖µ0‖ ‖FθN +µ0‖ µ̄ ′′ (5.8) Note that FθN ({θi}) is equivalent to nN.i, the number of customers eating dish i in restaurant N, in Equation 3.7. In addition, G({θi}) is equivalent to m.i, the number of tables serving dish i in the CRF model in Equation 3.7. We denote the predictive distribution for the next state presented in Equation 5.8 by µ̄ ′(H)θN . We use the superscript (H) to distinguish from the non-hierarchical case. Figure 5.1 illustrates an example of generating a state θN+1 from the predictive distribution µ̄ ′(H)θN . Assume that we are at the Nth jump and we have observed two distinct states a and b up to this time. The idea is to view the predictive distribution for the next state θN+1 in the hierarchical model as a mixture of two possibilities. One of these two is selected with probability proportional to (‖FθN‖,‖µ0‖). The two possibilities are: 39 Fఏಿ({ܽ}) ||Fఏಿ + µ଴|| ||µ଴|| ||Fఏಿ + µ଴|| G({ܽ}) ||G + H଴|| ||H଴|| ||G + H଴|| Existing transitions Existing states New state Fఏಿ({ܾ}) ||Fఏಿ + µ଴|| G({ܾ}) ||G + H଴|| ଵ್ܲ ଵܲೌ ଶܲೌ ଶ್ܲ ଷܲ ܣே = 1 ܣே = 0 Hഥ଴ Figure 5.1: An illustration of generating the next state given the current state θN in the HGEP model 1. to generate from FθN ‖FθN+µ0‖ , the empirical distribution over the transitions starting at θN . This is “joining one of the existing tables in the current restaurant (i.e. the current state θN)” in the CRF analogy. In the example in Figure 5.1, this case corresponds to path P1a or P1b . When this alternative is selected, the successor state θN+1 is determined with probability propor- tional to FθN (“the new customer picks the dish of the selected existing ta- ble”). In the example, the next state is a or b with probability proportional to FθN ({a}) or FθN ({b}) correspondingly. 2. to generate recursively from a “back-off” distribution, µ̄ ′′ (“creating a new table in the current restaurant”). When this alternative is selected, the new table picks a dish. The dish can be picked from G, the empirical distribu- tion over the dishes picked by tables across all restaurants (corresponding to 40 paths P2a or P2b), or it can be picked from the “back-off” distribution, the normalized base measure, H̄0 (path P3). It remains to provide a formal definition for the table creation auxiliary variable An. By augmenting the sufficient statistics with an indicator over which of the two alternatives (1,2) is selected at each transition, the predictive distribution takes a tractable form. Formally, the definition of the table creation auxiliary variables is therefore as follows: P(AN+1 = a | ‖µ0‖,X) ∝ ‖µ0‖a ‖FθN‖1−a 1[a ∈ {0,1}] θN+1 ∣∣ AN+1,X ∼ (1−AN+1)F̄θN +AN+1µ̄ ′′. 5.4 Predictive Distribution In the previous section we showed that the predictive distribution for the next state given the past states and ‖µ0‖ has a CRF representation. In this section we present the predictive distribution for the next event given the auxiliary variables and the past events. The main result of this section is as follows: Proposition 17. The predictive distribution of the HGEP is given by: (θN+1,JN+1) ∣∣(X ,{An}Nn=1,‖µ0‖)∼ µ̄ ′(H)θN ×TP(‖µ ′(H)θN ‖,β ′θN ). Proof. Conditioning on ‖µ0‖, we have that θN+1 and JN+1 are independent. As mentioned (θN+1 ∣∣X ,{An}Nn=1,‖µ0‖) can be viewed as an HDP with concentrations ‖H0‖ for the top level DP, and ‖µ0‖ for the lower level DPs. Hence, the distribution µ̄ ′(H)θN is then the predictive distribution for θN+1. For JN+1 ∣∣X ,‖µ0‖, we know from Proposition 15 that the predictive distribution is a translated Pareto distribution with parameters equal to ‖FθN + µ0‖, the nor- malization constant of the updated base measure, and TθN + β0, the updated rate parameter. That is, JN+1 ∣∣X ,‖µ0‖ ∼ TP(‖µ ′(H)θN ‖,β ′θN ). 41 The conditional distribution of the auxiliary variable ‖µ0‖ given all the other random variables will be useful for the inference algorithm that we introduce in Chapter 6. It can be shown that, given all the other random variables, ‖µ0‖ has a gamma distribution. Proposition 18. The conditional distribution of ‖µ0‖ given the other variables is a gamma distribution: ‖µ0‖ ∣∣∣X ,{An}Nn=1 ∼ Gamma(a,b), where a = ‖H0 +G‖,b = γ0+∑θ∈Ω log(β ′θ/β0). Proof. The conditional distribution has density p(x) proportional to: p(‖µ0‖= x | (θn, jn)Nn=1,{An}Nn=1) ∝p(‖µ0‖= x) · p( j1, j2, · · · , jN |‖µ0‖= x) · p(θ1, · · · ,θN |‖µ0‖= x) For p(θ1, · · · ,θN |‖µ0‖), the joint density of the states given ‖µ0‖, we must use the predictive distribution for the states given by Equation 5.8. Let iω be the set of indexes of the events for which we have θi = ω; iω = {i : θi = ω}. We denote the size of the set iω = {1ω , · · · ,Nω} by |iω | and the number of times the transition from state ω results in a new table by Gω . We have: p(θ1ω , · · · ,θNω |‖µ0‖= x) = p(θ1ω |‖µ0‖= x)p(θ2ω |θ1ω ,‖µ0‖= x)× ·· ·× p(θNω |θ1ω , · · · ,θ(N−1)ω ,‖µ0‖= x) ∝ xGω (x)(x+1) · · ·(x+ |iω |−1) ∝ xGω (x)‖Fθω ‖ Therefore, for the sequence of states θ1, · · · ,θN we have: p(θ1, · · · ,θN |‖µ0‖= x) ∝ ∏ θ∈Ω xGθ (x)‖Fθ‖ ∝ x‖G‖∏ θ∈Ω ( (x)‖Fθ‖ )−1 Finally, using Equations 5.2, 4.8, and 5.8, for p(x) we have: 42 p(x) ∝ ( xα0−1 exp(−γ0x) )( ∏ θ∈Ω (x)‖Fθ‖β x 0 (Tθ +β0)x+‖Fθ‖ ) × ( x‖G‖∏ θ∈Ω ( (x)‖Fθ‖ )−1) ∝ ( xα0+‖G‖−1 ) exp ( −x ( γ0+ ∑ θ∈Ω log ( β ′θ/β0 ))) 43 Chapter 6 INFERENCE In the previous two chapters we assumed the sequence of events is observed. Now we relax this assumption and use the models described in the last two chapters as a prior distribution over a sequence of hidden events. The goal of this chapter is to propose an inference algorithm to approximate the posterior of the hidden events X given the observations Y . In particular, we are interested in approxi- mating E[h(X)|Y ], the expectation of a general function of the events X under the posterior distribution of GEPs. In most applications, the sequence of states is not directly nor fully observed. First, instead of observing the states θ , we observe X -valued random variables Yn distributed according to a parametric family P indexed by the states θ of the chain, P = {Lθ :FX → [0,1],θ ∈ Ω}. Second, the observations are generally available only for a finite set of times T . For instance, in Figure 2.1 instead of ob- serving the MJP denoted by X , we only have observations at time points t1, · · · , tG (we will discuss multiple sequences in the next section). To specify the random variables in question, we need a notation for the hidden event index at a given time t, I(t) = min { N : ∑N+1n=1 Jn > t } (see Figure 2.1, where I(t∗) = N− 1). Thus, as- suming conditional independence, for an individual observation at time point t we have Y (t)|X d= Y (t)|θI(t) ∼ LθI(t) . The set of all observed random variables is then defined as Y = (Y (t1),Y (t2), . . . ,Y (tG) : tg < tg+1,{ti}=T ) . 44 We then denote the list of hidden events from time t = 0 to tG by X1:G =(θn,Jn) I(tG) n=1 where tG ∈T . We will describe the general inference framework for the model of Chapter 4. Extension to hierarchical models is direct (by keeping track of an additional suffi- cient statistic G, as well as the auxiliary variables An,‖µ0‖). For simplicity, we assume in this section that P is a conjugate family with respect to H0. Non-conjugate models can be handled by incorporating the auxiliary variables of Algorithm 8 in [33]. As mentioned our goal is to approximate expectations under the posterior dis- tribution of GEPs. However, as the posterior distribution is intractable we cannot evaluate posterior expectations directly. Thus, we need to rely on the Monte Carlo methods and sample from approximations to the posterior distribution of the hid- den events given the observations; that is, p(X1:G|Y ). Commonly used algorithms for sampling (approximately) from a posterior dis- tribution in Bayesian statistics are based on Markov chain Monte Carlo (MCMC) methods. In order to approximately sample from the posterior distribution for our model we use a particle Markov chain Monte Carlo (PMCMC) algorithm [2], a special type of the MCMC algorithm well-suited for sampling sequential data. 6.1 MCMC Algorithms In this section we briefly review the MCMC methods; for a detailed discussion and proofs of the results see for example [28]. MCMC is a general approach for gen- erating samples from a target distribution p(x) using a Markov chain mechanism. The samples x(i) generated from an MCMC based algorithm form a Markov chain that converges to p(x). We use MCMC when we cannot directly sample from p(x) but we can evaluate it up to a normalizing constant. Recall that the evolution of a Markov chain depends only on its current state and a probability transition matrix which we denote by T . As long as the Markov chain is irreducible and aperiodic there is a distribution p(x) called the invariant distri- bution to which the Markov chain converges. Informally, an aperiodic irreducible Markov chain is a Markov chain that does not have a cyclic behavior and from any 45 of its states there is a positive probability that all other states can be visited in finite number of steps1. The detailed balance condition, a condition that guaranties p(x) is the invariant distribution for a Markov chain, is defined as: p(x(i)) T (x(i−1)|x(i)) = p(x(i−1)) T (x(i)|x(i−1)); an MCMC sampler can be obtained by satisfying the detailed balance condition. The most popular MCMC algorithm is the Metropolis - Hastings (MH) algo- rithm; most MCMC algorithms can be viewed as a special case or an extension of this algorithm. In every step of this algorithm, given the current state of the Markov chain xc, a candidate state x∗ is sampled from a proposal distribution q(·|xc). Then x∗ is accepted with the probability: min{1, p(x ∗)q(xc|X∗) p(xc)q(x∗|xc) }; if x∗ is accepted the Markov chain moves to x∗, otherwise it remains at xc. Note that we only need p(x) up to a normalizing constant in order to compute the acceptance probability. Although the MH algorithm is simple, finding a good proposal distri- bution that has fast convergence rate can be a challenging task. The MH algorithm converges if we can show that the chain has no cycles and every state that has pos- itive probability can be reached in a finite number of steps. Under these conditions the chain converges to the target distribution p(x) after sufficient burn-in period which is the number of steps until the chain approaches p(x). In this definition every sample is dependent on the previous sample; as a result we have dependence between draws in the Markov chain. To reduce this depen- dence and get closer to i.i.d. samples, some have suggested thinning: keeping only the samples generated after every r-th iteration of the chain. A variant of the MH algorithm is the independent MH algorithm for which the proposal is independent of the current state, q(x∗) = q(x∗|xc). Thus, the acceptance probability is min{1, p(x∗)q(xc)p(xc)q(x∗)}. Another MCMC algorithm that can be considered as a special case of the MH 1For a formal definition see for example [19]. 46 is the Gibbs sampler. It can be shown that the Gibbs sampler is an MH algorithm with acceptance probability equal to 1. Suppose the goal is sampling from the n- dimensional joint probability distribution p(x1,x2, · · · ,xn) and sampling from this distribution is hard. If samples from the full conditionals p(xi|x1, · · · ,xi−1,xi+1, · · · ,xn) can be easily obtained then we can produce samples from the joint distribution by sampling from the conditional distributions. We initialize the Gibbs sampler with an arbitrary sample (x1(0),x2(0), · · · ,xn(0)). Then for iterations i = 1 to N the algorithm consists of n steps as follows: Sample x1(i)∼ p(x1|x2(i−1),x3(i−1), · · · ,xn(i−1)) Sample x2(i)∼ p(x2|x1(i−1),x3(i−1), · · · ,xn(i−1)) ... Sample xk(i)∼ p(xk|x1(i−1),x2(i−1), · · · ,xk−1(i−1),xk+1(i−1), · · · ,xn(i−1)) ... Sample xn(i)∼ p(xn|x1(i−1),x2(i−1), · · · ,xn−1(i−1)) With any MCMC sampler, the empirical distribution of the generated samples yields p̂(x), the approximation to the target distribution p(x); in Bayesian contexts, the target distribution will be a posterior distribution. In addition to approximating the posterior distribution p(x), the samples obtained from an MCMC sampler can be used to approximate the mode (or other characteristics) of the posterior distri- bution. For instance, to approximate the maximum a posteriori (MAP), if we have N samples, we find argmax x(i);i=1,··· ,N (p̂(x(i))). In applications we usually only consider the samples after the burn-in period. The theory guarantees that the chain converges to the target distribution, but it does not say anything about how long we need to run the chain to obtain conver- gence. In particular, there are two practical issues that we need to address when using an MCMC sampler: how long should the burn-in period be and how long we need to run the chain after that. Different diagnostics are available for monitor- ing the chain’s convergence. The easiest diagnostic is to simply visualize the state 47 of the chain with respect to iterations; if there is no evident trend in the plot this suggests (approximate) convergence has been achieved. There are also some quan- titative diagnostics such as Geweke’s diagnostic [15], Heidelberger and Welch’s diagnostic [21], and Gelman and Rubin’s diagnostic [14]. Although these diag- nostics are informative guides, they cannot definitely tell whether the chain has converged; judging the adequacy of convergence of MCMC samplers necessarily retains a subjective element. If we are convinced that the chain has converged, then we need to know how long we need to run the chain to get a reasonable approximation. The answer depends on the objective of the study: the greater the precision needed for the study, the longer we need to run the chain. We can use Raftery and Lewis’s diagnostic [37] to estimate how long our chain needs to run in order to estimate quantiles within a specified accuracy with some specified probability. 6.2 PMCMC Algorithm As mentioned we use a special type of the MCMC algorithm, the PMCMC al- gorithm, to approximately sample from the posterior distribution in our model. There are different flavors of PMCMC algorithms; we use the particle independent Metropolis - Hastings (PIMH) algorithm [2], an MCMC algorithm with propos- als that are sequential Monte Carlo (SMC) approximations of p(X1:G|Y ). We will explain the SMC algorithm in detail in the following section; for now it can be thought of as an algorithm which provides an approximation for the posterior density p(X1:G|Y ) and the marginal likelihood p(Y ). As mentioned to sample from p(X1:G|Y ) in a standard MCMC algorithm we sample from a proposal density q(X1:G|Y ). Given a current state Xc1:G, the proposed sample X∗1:G is then accepted with probability: min{1, p(X ∗ 1:G|Y )q(Xc1:G|Y ) p(Xc1:G|Y )q(X∗1:G|Y ) }. In the PMCMC algorithm we use an SMC approximation to p(X1:G|Y ) which we denote by p̂(X1:G|Y ) as our proposal density. Note that at each iteration of the PMCMC algorithm, the sample from the p̂(X1:G|Y ) is a list of hidden events from 48 t1 to tg. It can be shown [2] that using an appropriate acceptance ratio we can make this proposal a valid MCMC move; the algorithm then converges to the desired posterior density p(X1:G|Y ). The PMCMC algorithm follows the following simple form: 1. iteration i = 0: Run an SMC algorithm to obtain an approximation for p(X1:G|Y ); sample X1:G(0)∼ p̂(·|Y ) and denote the corresponding marginal likelihood estimate p̂(Y ) by L(0). 2. iteration i≥ 1: (a) run the SMC algorithm and sample X∗1:G ∼ p̂(·|Y ), and denote the cor- responding marginal likelihood estimate by L∗ (b) with probability min{1, L∗L(i−1)} accept X∗1:G: set X1:G(i) = X∗1:G and L(i) = L∗. Otherwise, set X1:G(i) = X1:G(i−1) and L(i) = L(i−1). Denoting the total number of iterations by iter and the number of burn-in iter- ations by B, the collection of samples {X1:G(i)}iteri=B+1 are approximately samples from the posterior density p(X1:G|Y ) for large enough B (i.e., sufficient burn-in period). Table 6.1 summarizes the notation used in the PMCMC and the SMC algorithms. In general, there may be several sequences of observations; for instance, we can have sequences of EDSS values from several MS patients. We can have different observation times for different sequences. We denote the number of time series by K, each of the form Y (k) = ( Y (k)(t(k)1 ),Y (k)(t(k)2 ), . . . ,Y (k)(t(k) G(k) ) : t(k)g < t (k) g+1,{t(k)i }=T (k) ) , k ∈ {1, . . . ,K}; moreover, we use the superscript \k to indicate all the sequences except k. In case of multiple sequences, we resample the hidden events X (k) 1:G(k) for one sequence k given the sufficient statistics of all the other sequences (F(\k)θ ,T (\k) θ ), and the sufficient statistics for the likelihood model denoted by S(\k)θ . The role of the sufficient statistics in the algorithm and their updating procedure will be explained in the next section where we describe the SMC algorithm. To simplify 49 Notation Definition I(t) the hidden event index at a given time t: I(t) = min { N : ∑N+1n=1 Jn > t } X1:g the list of hidden events from time t = 0 to tg: X1:g = (θn,Jn) I(tg) n=1 Y (t) an observation at time t Y1:g the observations from time t1 to time tg: Y (t1),Y (t2), . . . ,Y (tg) Xm,g a particle, a list of hidden events indexed by n covering the first g observations: Xm,g = (θm,n,Jm,n) Nm,g n=0 Nm,g the smallest number of events required to cover the first g obser- vations in the m-th particle: tg ≤ ∑Nm,gn=0 Jm,n X ′m,g+1 a new particle extended from Xm,g X̃mg+1 an event (a pair of state and waiting time) sampled from the pro- posal q(·|Y (tg+1),X ′m,g+1) Xmg+1 a list of events appended to Xm,g Wm,g the importance weight of the particle Xm,g Fkθ ,m,n empirical transition measures for the m-th particle after the n-th jump in the k-th sequence T kθ ,m,n empirical waiting times for the m-th particle after the n-th jump in the k-th sequence Gkm,n empirical counts for the number of transitions from all states to the same destination after the n-th jump for the m-th particle in the k-th sequence Skθ ,m,n sufficient statistics of the likelihood model for the m-th particle after the n-th jump in the k-th sequence \k all the sequences except the k-th (used as superscript for the suf- ficient statistics) Table 6.1: Notation used in the PMCMC algorithm the notation we denote X (k) 1:G(k) by X (k). The pseudocode for the PMCMC algorithm in the GEP model (with multiple sequences) is provided in Figure 6.1. 6.3 SMC Algorithm SMC algorithms provide an approximation of p(X1:G|Y ), the posterior density, and p(Y ), the marginal likelihood [2]. These algorithms approximate the pos- 50 Algorithm 1 PMCMC algorithm for inference in the GEP model K: number of sequences k: sequence number k ∈ {1, . . . ,K} For i = 0 to #iterations do For k = 1 to K do - Run an SMC algorithm to obtain an approximation for p(X (k)|Y (k),F(\k)θ ,T (\k)θ ,S(\k)θ ) - Sample X (k)∗ ∼ p̂(·|Y (k),F(\k)θ ,T (\k)θ ,S(\k)θ ), and estimate the corresponding marginal likelihood L(k)∗ - Sample u∼Uni f [0,1] If u < min{1, L(k)∗ L(k)(i−1)} or i = 0 - Set X (k)(i) = X (k)∗ , L(k)(i) = L (k) ∗ - Update (F(k)θ ,T (k) θ ,S (k) θ ) else - Set X (k)(i) = X (k)(i−1) and L(k)(i) = L(k)(i−1) End If End For End For Figure 6.1: Pseudocode for the PMCMC algorithm terior density p(X1:G|Y ) sequentially2. In other words, a SMC algorithm first approximates p(X1:1|Y (t1)), then p(X1:2|Y (t1),Y (t2)) and so on. We denote the observations from time t1 to time tg (i.e., Y (t1),Y (t2), . . . ,Y (tg)) by Y1:g. Given the set of all observations Y1:G, the approximation for the posterior is obtained from a set of M weighted random samples: p̂(dX1:G|Y1:G) := M ∑ m=1 Wm,GδXm,G(dX1:G), (6.1) where Wm,G is the importance weight of the random sample Xm,G. Each random sample Xm,g, m ∈ {1, . . . ,M} is called a particle; it consists of a list of hidden events indexed by n, containing both (hidden) states and waiting times: Xm,g = (θm,n,Jm,n) Nm,g n=0 . We use Nm,g to denote the smallest number of events required to cover the first g observations in the m-th particle; that is, tg ≤ ∑Nm,gn=0 Jm,n. For g = 0 we let Nm,0 = 0. The algorithm propagates the particles Xm,g and updates the weights Wm,g from generation g = 0 up to generation G, where generation g corresponds to the obser- 2To simplify the description of the algorithm we begin by explaining it for the case where there is only one sequence; next, we generalize the algorithm for the case of multiple sequences 51 vation time tg. A proposal density q is used to propagate the particles and evaluate the weights. Two different proposal densities will be introduced in Section 6.4; evaluation of the weights for each of the proposals will also be discussed. Initially, we assume all the particles in generation g= 0 (i.e., time= 0) are set to contain only the beginning special state, Xm,0 = (θbeg,0) for all m ∈ {1, . . . ,M}3. For generation g = 1, a proposal (i.e., importance) density q(·|Y (t1)) is used to approximate p(X1:1|Y1:1). We generate M particles from q(·|Y (t1)); each particle Xm,1 will be constructed such that the list of events contains enough events (i.e., smallest number of events) to cover the first observation time, t1. In other words, t1 ≤ ∑Nm,1n=1 Jm,n. An illustration of the SMC algorithm is provided in Figures 6.2, 6.3, and 6.4 where we have three observations (i.e., EDSS= 1.5,1.5,2 at time points t1, t2, t3). The first generation of three particles at observation time t1 is shown in Figure 6.2(a). In order to take into account the discrepancy between the two densities we as- sign importance weights Wm,1 to the particles (details to follow). This is shown in Figure 6.2(b) where we assign weights W1,1, W2,1, and W3,1 to the particles. Next, in a resampling step, we sample M times from the weighted particle den- sity p̂(dX1:1|Y1:1) which is an approximation of p(X1:1|Y1:1). In Figure 6.3(a) we assumed we have sampled three times from the weighted particle density and ob- tained the same particles as before; we have three particles X1,1, X2,1, and X3,1. At time t2, we extend each particle Xm,1 into a new particle X ′m,2 through the proposal density q(·|Y (t2),Xm,1) to obtain an approximation for p(X1:2|Y1:2). To extend a particle Xm,1 into a particle X ′m,2, we first copy the events of Xm,1 into X ′ m,2, and then append new sampled events (from q(·|Y (t2),Xm,1)) until the observation at time t2 is covered. Figure 6.3(b) shows the extended particles x′1,2, x ′ 2,2, and x ′ 3,2. We need to compute the importance weights Wm,2 for the extended particles shown in Figure 6.3(b) as we are not directly sampling from p(X1:2|Y1:2). A new population of extended particles (Xm,2)Mm=1 is then obtained by resampling M times from p̂(dX1:2|Y1:2), where 3As this event is present at the beginning of each particle, whenever g > 0 we initialize the sequence of hidden events for each particle from the first event n = 1. That is, for g > 0 we define each particle as Xm,g = (θm,n,Jm,n) Nm,g n=1. 52 (a) (b) Figure 6.2: An illustration of the SMC algorithm with three particles and three observation times up to generation 1. 53 (a) (b) Figure 6.3: An illustration of the SMC algorithm with three particles and three observation times up to generation 2. 54 (a) (b) Figure 6.4: An illustration of the SMC algorithm with three particles and three observation times up to generation 3. 55 p̂(dX1:2|Y1:2) = M ∑ m=1 Wm,2δX ′m,2(dX1:2). The new population of random particles X1,2, X2,2, and X3,2 is illustrated in Fig- ure 6.4(a) where we have two samples from X ′3,2 and one sample from X ′ 1,2. This procedure is then repeated until time tG is covered; for instance, in Figure 6.4(b) we stop the procedure as soon as the observation time t3 is covered. This com- pletes generating the particles; the approximation for the posterior distribution p(X1:G|Y1:G) is then obtained from p̂(dX1:G|Y1:G) (e.g., ∑Mm=1Wm,3δX ′m,3(dX1:3) for the example in Figure 6.4(b)). The SMC algorithm can be summarized as below (for details see [2]). To sim- plify the notation, we omit writing ∀m ∈ {1, . . . ,M}; we also denote the sufficient statistics for the likelihood modelP by Sθ . 1. at time = 0 set Xm,0 = (θbeg,0) and Wm,0 = 1/M. 2. for g = 0 to G−1 (a) extend Xm,g to a new particle X ′m,g+1: i. copy the contents of Xm,g into X ′m,g+1 and create X m g+1 a list of events that have been appended to Xm,g. Note that the list Xmg+1 contains no events at this stage as we have only copied the contents of Xm,g into X ′m,g+1 (see Figure 6.4(b) for an example of the list of appended events to the particles X1,2, X2,2, and X2,3: X13 , X 2 3 and X33 ). ii. sample from q(·|Y (tg+1),X ′m,g+1), append the new sampled event X̃mg+1 to X ′ m,g+1 and also to X m g+1 (i.e., X ′ m,g+1 := (X ′ m,g+1, X̃ m g+1) and Xmg+1 :=(X m g+1, X̃ m g+1)). Finally, update the sufficient statistics (e.g., F , T and S for the GEP model). Continue appending new events, and updating the sufficient statistics; stop as soon as the first g+1 observations are covered; that is, tg+1 ≤ ∑Nm,gn=1 Jm,n. (b) compute and normalize the weights for the extended particles; to do this, for an observation dy ∈FX , let L(dy|Sθ ) denote the predictive 56 likelihood given Sθ . It can be shown [2] that the appropriate importance weights are: wm,g+1 = L(dy|Sθ )p(Xmg+1|Xm,g) q(Xmg+1|Y (tg+1),Xm,g) ; (6.2) then the normalized weights are: Wm,g+1 := wm,g+1 ∑Mn=1 wn,g+1 . (c) after obtaining extended particles X ′m,g+1 and their weights, generate the new generation of particles Xm,g+1 by resampling M times from p̂(dX1:g+1|Y1:g+1) = M ∑ m=1 Wm,g+1δX ′m,g+1(dX1:g+1). (6.3) 3. the approximation for the posterior density p(X1:G|Y ) is then: p̂(dX1:G|Y1:G) := M ∑ m=1 Wm,GδXm,G(dX1:G). (6.4) The SMC algorithm also provides an estimate of the marginal likelihood p(Y ) given by p̂(Y ) = G ∏ g=1 1 M M ∑ n=1 wn,g. (6.5) As mentioned we use the marginal likelihood estimates in computing the accep- tance probability in the PMCMC algorithm. For the case of multiple sequences we need to keep the track of the sufficient statistics for each sequence. In the GEP model, for the k-th sequence the sufficient statistics are the empirical transition measures (Equation 4.1), the empirical waiting times (Equation 4.2), and the sufficient statistics of the likelihood model for the m- th particle after the n-th jump; we denote these sufficient statistics by F(k)θ ,m,n, T (k) θ ,m,n, and S(k)θ ,m,n correspondingly. Note that the sufficient statistics for the likelihood 57 Algorithm 2 SMC algorithm to construct a proposal for the PMCMC G: the number of observations in the current sequence (k) g: particles generation g ∈ {0,1, . . . ,G} M: number of particles m: particle number m ∈ {1, . . . ,M} K: the total number of sequences k: sequence number k ∈ {1, . . . ,K} (We omit writing ∀m ∈ {1, . . . ,M} and superscript (k) for every X to avoid excessive notation) For a given sequence k: - Set Xm,0 = (θbeg,0) For g = 0 to g = G−1 do - Extend Xm,g to a new particle X ′m,g+1: Copy the events of Xm,g into X ′m,g+1 and create X m g+1 a list of events appended to Xm,g Loop until t(k)g+1 ≤ ∑ Nm,g+1 n=1 Jm,n is satisfied sample a pair (θ ,J) from q(·|Y (tg+1),X ′m,g+1) append the new sampled event X̃mg+1 to X ′ m,g+1; let X ′ m,g+1 := (X ′ m,g+1, X̃ m g+1) append the new sampled event X̃mg+1 to X m g+1; let X m g+1 := (X m g+1, X̃ m g+1) update the sufficient statistics F(k)θ ,m,n, T (k) θ ,m,n and S (k) θ ,m,n for events that have been generated so far End Loop - Compute the weight of the particles wm,g+1 = L(Y (tg+1)|θI(tg+1),S (\k) θ +S (k) θ ,m,n)p(X m g+1|Xm,g) q(Xmg+1|Y (tg+1),Xm,g) - Generate the new population of particles Xm,g+1 by resampling M times from p̂(dX1:g+1|Y1:g+1) (Equation (6.3)) End For - Approximate the posterior density p(X1:G|Y ) and the marginal likelihood p(Y ) with Equations 6.4 and 6.5 correspondingly. Figure 6.5: Pseudocode for the SMC step of the PMCMC algorithm model depend on the chosen likelihood model; for instance, in the likelihood model that we will introduce in Chapter 7 the sufficient statistics are the frequencies of each observation. The pseudocode of the SMC algorithm in the GEP model (in case of multiple sequences) is presented in Figure 6.5. 6.4 SMC Algorithm Proposal Distributions We describe two proposal distributions for the SMC algorithm in this section: the predictive distribution and a modified version of the predictive distribution. 6.4.1 The predictive distribution as the proposal distribution If we use the predictive distribution as the proposal q in the SMC algorithm of Figure 6.5, then to extend the particle Xm,g to X ′m,g+1 we need to sample the hidden event X̃mg+1 from Equation 4.7. 58 The sufficient statistics for this sampling step come from two sources: (1) the other sequences, which are held fixed (i.e., (F(\k)θ ,T (\k) θ )), and (2) the events that have been generated so far in the current particle, X ′m,g+1,n := (θm,i,Jm,i)ni=1. The corresponding sufficient statistics are denoted by (F(k)θ ,m,n,T (k) θ ,m,n). Since these statistics are additive, we can write the sampling step as: (θn+1,Jn+1)|F(\k)θ ,T (\k)θ ,X ′m,g+1,n ∼ µ̄ ′θn×TP(‖µ ′θn‖,β ′θn), where: µ ′θ = F (\k) θ + F (k) θ ,m,n +H0 and β ′ θ = T (\k) θ + T (k) θ ,m,n + β0. The weights of the particles have the simple form wm,g+1 = L(dy|S(\k)θ + S(k)θ ,m,n) as the proposal q(Xmg+1|Yg+1,Xm,g) is the same as the predictive distribution p(Xmg+1|Xm,g). 6.4.2 An “improved” proposal distribution based on a modified predictive distribution For our experiments discussed in Chapter 8, we use a modified version of the hier- archical model of Chapter 5 and a relatively simple likelihood model. Therefore, in this section as opposed to the previous sections of this chapter we describe the proposal distribution for the HGEP model, which is used in our experiments. For the likelihood model, we assume each hidden state θ emits an observation deter- ministically; that is, given the hidden state the observation is known. However, multiple hidden states can emit the same observation. More formally, we assume H0 in the HGEP model of Section 5.2 is a product of uniform and multinomial distributions, (i.e., H0 = Unif×Mult). Moreover, for the likelihood model Lθ we use a Dirac delta. Each sample from H0 is a hidden state θ in the form of a pair (u,y) where y ∈ Σ the set of possible observations, and u ∈ [0,1]. The uniform distribution can be thought as being responsible for generating unique identifiers for hidden states. Given a hidden state θ = (u,y) at time t the observation at that time step is a deterministic function of θ : L(u,y) = δy. Further details of the likelihood model are provided in Chapter 8. Any probability distribution that we can easily sample from and for which we can compute the importance weights using Equation 6.2 is a candidate for the proposal distribution. In this section we define a proposal q(Xmg+1|Y (tg+1),Xm,g) which, in contrast to the previous proposal, takes the observation Y (tg+1) into account. Based 59 on the defined likelihood model, if we know Y (t), the observation at time step t, then we know the set of possible hidden states for that time step. Thus, this new proposal is more suitable for our experiments where we use the described likelihood model. We call this the “improved” proposal distribution. We denote the set of possible hidden states for observation Y (t) at time step t by ΩY (t) where ΩY (t) := {(u,y) ∈ Ω : y = Y (t),L(u,y) = 1}. We are interested in a proposal distribution which, given Xm,g = (θm,i,Jm,i)ni=1 and Y (tg+1), proposes a sample of hidden events (θm,i,Jm,i)Ni=n+1 with the property that θI(tg+1) ∈ ΩY (tg+1). Informally, we are only interested in samples which end up in a hidden state that can emit Y (tg+1). To obtain the desired proposal, we first define a probability distribution which we call the “modified” predictive distribution. As the name suggests, we define this distribution by modifying the predictive distribution of the model in Section 5.4. This probability distribution is the key element in our new proposal. Next, we intro- duce the “improved” proposal distribution. Finally, we explain how the importance weights can be computed. “Modified” predictive distribution We change the predictive distribution for the hidden states of the model in Sec- tion 5.4 to generate only samples from θ ∈ ΩY (tg+1). We denote the modified pre- dictive distribution by variables with superscript M. In order to obtain the mod- ified predictive distribution, we change the base measure H0, defined above (i.e., H0 = Unif×Mult); and redefine the sufficient statistics Fθ and G. We define the modified base measure as HM0 = Unif×Y (tg+1); hence, every hidden state sampled from HM0 is a pair (u,y) where y is always equal to Y (tg+1). In other words all the hidden states sampled from HM0 always emit the observation Y (tg+1) at time tg+1. Next we define the modified sufficient statistics as: FMθ = n ∑ i=1 1[θi−1 = θ & θi ∈ΩY (tg+1)] δθi (6.6) GM = n ∑ i=1 1[θi ∈ΩY (tg+1)]Aiδθi . (6.7) 60 For FMθ , the empirical transition measures for state θ , we make the mass at all atoms where θi /∈ ΩY (tg+1) equal to zero and keep the mass for all the other atoms the same as their mass in Fθ . Similarly for the GM, we only keep the mass at atoms where θi ∈ΩY (tg+1) and assign zero mass to all the other atoms. Using the same normalization constant as in the original predictive distribution (i.e., ‖µ0‖), we define the modified predictive distribution µ̄ ′(M)θ based on: µ ′′(M) = GM +HM0 (6.8) µ ′(M)θ = F M θ +‖µ0‖µ̄ ′′(M) (6.9) This modified predictive distribution is used in our proposal. Note that any state sampled from this probability distribution will only emit Y (tg+1), the observation at time tg+1 . Probability density value for a sampled state Before proceeding further, it is useful to briefly review computing the probability density value of a sampled state. This value is needed in obtaining the probability of a sampled path from the proposal which in turn is required for computing the importance weights. For a state θ ∗ sampled from the predictive distribution µ̄ ′(H)θN , we aim to com- pute p(θN+1 = θ ∗ ∣∣X ,{An}Nn=1,‖µ0‖). We denote this probability density value by CRF(θ ∗; µ̄ ′(H)θN ). Using Equation 5.8 (see Figure 5.1), we can compute this value from the following formula: CRF(θ ∗; µ̄ ′(H)θN ) =  FθN ({θ ∗}) ‖FθN+µ0‖ ; AN = 0 G({θ ∗}) ‖G+H0‖ ‖µ0‖ ‖FθN+µ0‖ ; AN = 1 & θ ∗ ∈ {θ1:N} H0(θ ∗) ‖G+H0‖ ‖µ0‖ ‖FθN+µ0‖ ; o.w. (6.10) The probability density value for a sampled state θ ∗ from the “modified” pre- 61 dictive distribution µ̄ ′(M)θ is defined similarly: MCRF(θ ∗; µ̄ ′(M)θN ) =  FMθN ({θ ∗}) ‖FMθN+µ0‖ ; AN = 0 GM({θ ∗}) ‖GM+HM0 ‖ ‖µ0‖ ‖FMθN+µ0‖ ; AN = 1 & θ ∗ ∈ {θ1:N} HM0 (θ ∗) ‖GM+HM0 ‖ ‖µ0‖ ‖FMθN+µ0‖ ; o.w. (6.11) Sampling procedure for the “improved” proposal distribution Now that we have defined the key element of our proposal distribution, we in- troduce the “improved” proposal distribution. The pseudocode for sampling from the proposal distribution, in case of multiple sequences, is provided in Figure 6.6. We define the proposal by defining its sampling procedure for the case of a single sequence. We assume that we are at observation time tg and the current hidden state for the m-th particle is given: θm,I(tg). For this proposal we should emphasize that although we sample until we cover the next observation time, we do not assume knowledge of the time to the next transition after tg+1 nor of the state to which that transition is made. In other words, we only know there is a transition after tg+1 but the state and the time to the transition are unknown (see Figure 6.7 for an illustration). For all generations of the particles we always start sampling from an observation time. This point will become clear when we describe how to sample from the proposal and compute the probability of a sampled path. To simplify the notation we omit the subscript m in describing the procedure and denote I(tg) by num. We use l to index the events occurring after time tg; for instance, the index of the first jump after tg is l = num+ 1. For initializing the procedure we let l = num+ 1. The sampling procedure is described below. Fig- ure 6.8(a) illustrates the initial state of the g+1-th generation of a single particle. Given the current state θl−1, we sample Jl , the time to the next jump, from the translated Pareto distribution TP(‖µ ′(H)θl−1 ‖,β ′θl−1). Depending on the sampled value 62 Algorithm 3 Modified predictive distribution as the proposal for inference in the HGEP model Given the current hidden state at time point tg (i.e., θm,I(tg)): (From now on, we omit writing ∀m ∈ {1, . . . ,M} to avoid excessive notation and also drop particle index m from the notation) K: number of sequences k: sequence number k ∈ {1, . . . ,K} num: I(tg), the number of events up to time step tg l = num+1 Jl = 0 While ∑li=num+1 Ji < tg+1− tg - Sample the time to next jump Jl from TP(‖µ ′(H)θl−1 ‖,β ′θl−1) - If ∑li=num+1 Ji < tg+1− tg then - sample the next state θl from µ̄ ′(H) θl−1 - l = l+1 - update the sufficient statistics Fkθ ,l , T k θ ,l , S k θ ,l , and G k l - else if θl−1 ∈ΩY (tg+1) - Jl = tg+1−∑l−1i=1 Ji and θl = θl−1 - update the sufficient statistic T kθ ,l - else - sample Jl from Uni f [0, tg+1−∑l−1i=1 Ji] - sample θl from µ̄ ′(M) θl−1 , the modified predictive distribution - update the sufficient statistics Fkθ ,l , T k θ ,l , S k θ ,l , and G k l - l = l+1 - Jl = tg+1−∑l−1i=1 Ji and θl = θl−1 - update the sufficient statistic T kθ ,l End If End While Figure 6.6: Pseudocode for sampling from the “improved” proposal distribu- tion Figure 6.7: An illustration of sampling the last jump for a generation of par- ticles in the SMC algorithm. For simplicity we show only one particle. 63 (a) (b) Figure 6.8: Sampling procedure for the “modified” predictive distribution (case 1). there are three different cases: 1. the sampled path does not pass the next observation time, ∑li=num+1 Ji < tg+1− tg. In this case, we sample the next state from µ̄ ′(H)θl−1 defined in Equa- tion 5.8; set l := l+1 and update the sufficient statistics Fθ ,l , Tθ ,l , Sθ ,l , and Gl . This case is depicted in Figure 6.8(b); in this example Jnum+1 < tg+1− tg and θnum+1 ∼ µ̄ ′(H)θnum . 2. the sampled path passes the next observation time, and the observation emit- ted from the current state is consistent with Y (tg+1) (i.e., the current state emits Y (tg+1): θl−1 ∈ ΩY (tg+1)). In this case, we discard the current Jl and instead set it equal to the remaining time to the observation time tg+1 (i.e., Jl = tg+1−∑l−1i=1 Ji). Moreover, we set the next state equal to the current state: θl = θl−1 and update the sufficient statistic Tθ ,l . Note that we do not need to update the other sufficient statistics as the process remains in its current state without any transitions. An example of this case is given in Figure 6.9. In this example the obser- vation emitted from θnum (i.e., EDSS = 1.5) is consistent with Y (tg+1); we have Jnum+1 = tg+1− tg and θnum+1 = θnum. 3. the sampled path passes the next observation time, and the current state does not emit Y (tg+1) (i.e., θl−1 /∈ ΩY (tg+1)). In this case, we take the following two steps: 64 Figure 6.9: Sampling procedure for the “modified” predictive distribution (case 2). (a) first we discard the current Jl and instead sample Jl from Uni f [0, tg+1− ∑l−1i=1 Ji]; that is, we sample the time of the next jump uniformly between the current time, ∑l−1i=1 Ji, and the next observation time tg+1. Then, we sample the next state θl from µ̄ ′(M) θl−1 , the modified predictive distribu- tion, and update the sufficient statistics Fθ ,l , Tθ ,l , Sθ ,l , and Gl . (b) we set l = l + 1, next we set Jl = tg+1−∑l−1i=1 Ji and θl = θl−1. That is, we do not sample the last jump; instead, we deterministically set the time to the last jump equal to tg+1−∑l−1i=1 Ji which is the remaining time till tg+1. Finally, we only update Tθ ,l as we do not assume knowledge of the state after tg+1 and for the next generation of the particles we are going to sample from time tg+1 forward. For an example see Figure 6.10. In the example the sample for Jnum+1 is discarded and a new Jnum+1 is sampled from Uni f [0, tg+1−tg]; the new jump time is denoted by t∗. For the next state we have θnum+1 ∼ µ̄ ′(M)θnum ; hence, we have θnum+1 ∈ΩY (tg+1). Finally, the time to the last jump Jnum+2 is set equal to tg+1− (Jnum+1+ tg). The sampling procedure for this generation is stopped as soon as we cover the observation time tg+1; that is, ∑li=num+1 Ji < tg+1− tg. A sampled path from this proposal always ends up in a state θI(tg+1) ∈ ΩY (tg+1). Whenever the path passes tg+1 we check whether the path is in the right group of states ΩY (tg+1); if not then we discard the sampled jump and sample the time of the jump from a uniform distribution and the state from the modified predictive distribution µ̄ ′(M)θ . Therefore, the sampled path has the desired property. 65 Figure 6.10: Sampling procedure for the “modified” predictive distribution (case 3). Computing the importance weights In order to compute the weights of the particles, we need the proposal and pre- dictive probabilities for a sampled path. We denote the cumulative distribution of the translated Pareto distribution by FT P(t;α,β ), the updated MGP parameters after the ith transition by β (i)θ and µ (i) θ ; and the predictive distribution of the ith event given events 1 to i−1 in the HGEP model by p(θi,Ji|θ1:i−1,J1:i−1). Setting n= I(tg), N = I(tg+1), and ∆= tg+1−∑N−1k=1 Jk, the predictive probability for a sam- pled path (θi,Ji)Ni=n+1 (after dropping the particle index m and the superscript H) is given by: p((θi,Ji)Ni=n+1|Xg = (θi,Ji)ni=1,‖µ0‖) = N ∏ i=n+1 p(θi,Ji|θ1:i−1,J1:i−1) · (1−FT P(∆;‖µ(N)θN ‖,β (N) θN )) = N ∏ i=n+1 p(θi|θ1:i−1,J1:i−1) · p(Ji|θ1:i−1,J1:i−1) · (1−FT P(∆;‖µ(N)θN ‖,β (N) θN )) = N ∏ i=n+1 CRF(θi; µ̄ (i−1) θi−1 ) ·T P(Ji;‖µ (i−1) θi−1 ‖,β (i−1) θi−1 ) · (1−FT P(∆;‖µ (N) θN ‖,β (N) θN )). The third equality is due to the fact that given the current state the next state and jump time are independent. Figure 6.11 provides an example which clarifies the above computation. In the figure to keep the notation uncluttered we drop the parameters of the distributions. The figure is an example of a sampled path begin- ning from the n-th event; each state is sampled from the predictive distribution for 66 Figure 6.11: Computing the predictive probability for a sampled path the states (denoted by CRF in the figure) given all the previous events. Moreover, each waiting time is sampled from the translated Pareto distribution. Finally, for the transition after tg+1 we know it happens at a time greater than ∆; hence, we use 1−FT P(∆). Based on the algorithm presented in Figure 6.6, assuming the proposed hidden events are (θm,i,Jm,i)Ni=n+1, the probability of a sampled path q((θi,Ji) N i=n+1|Y (tg+1),Xg = (θi,Ji)ni=1,‖µ0‖) for the proposal can be obtained for two different scenarios: 1. If the N− 1th (hidden) state, the state at the penultimate jump, belongs to ΩY (tg+1) (i.e., θN−1 ∈ ΩY (tg+1)), then we are certain that the events at both jumps N− 1 and N are sampled from the predictive distribution. In other words, the path is sampled from the predictive distribution at all the jumps. 2. If the N−1th state does not belong to ΩY (tg+1) then there could be two indis- tinguishable cases: (a) The Nth event is sampled from the predictive distribution and the Nth sampled state belongs to ΩY (tg+1); thus, similar to the previous case all the hidden events are sampled from the predictive distribution. 67 (b) The Nth jump is sampled from the uniform distribution and the Nth state is sampled from the modified CRF. Denoting the modified predictive distribution by p∗, for the proposal probability we have: q((θi,Ji)Ni=n+1|Y (tg+1),Xg = (θi,Ji)ni=1,‖µ0‖) = 1[θN−1 ∈ΩY (tg+1)] N ∏ i=n+1 p(θi,Ji|θ1:i−1,J1:i−1) · (1−FT P(∆;‖µ(N)θN ‖,β (N) θN )) +(1−1[θN−1 ∈ΩY (tg+1)])(( N ∏ i=n+1 p(θi,Ji|θ1:i−1,J1:i−1) · (1−FT P(∆;‖µ(N)θN ‖,β (N) θN ))) + N−1 ∏ i=n+1 p(θi,Ji|θ1:i−1,J1:i−1) · p∗(θN ,JN |θN−1,JN−1) · (1−FT P(∆;‖µ ′(H)θN ‖,β (N)θN ))) = 1[θN−1 ∈ΩY (tg+1)]( N ∏ i=n+1 CRF(θi; µ̄ (i−1) θi−1 ) ·T P(Ji;‖µ (i−1) θi−1 ‖,β (i−1) θi−1 ) · (1−FT P(∆;‖µ (N) θN ‖,β (N) θN ))) +(1−1[θN−1 ∈ΩY (tg+1)])( N ∏ i=n+1 CRF(θi; µ̄ (i−1) θi−1 ) ·T P(Ji;‖µ (i−1) θi−1 ‖,β (i−1) θi−1 ) · (1−FT P(∆;‖µ (N) θN ‖,β (N) θN )) + N−1 ∏ i=n+1 CRF(θi; µ̄ (i−1) θi−1 ) ·T P(Ji;‖µ (i−1) θi−1 ‖,β (i−1) θi−1 ) ·MCRF(θN ; µ̄(N−1,M)θN−1 ) · 1 tg+1−∑N−1i=1 Ji · (1−FT P(∆;‖µ(N)θN ‖,β (N) θN ))). The weights of the particles now have the form: wm,g+1 = p(Xmg+1|Xm,g) q(Xmg+1|Y (tg+1),Xm,g) (6.12) due to the fact that now we always have L(Y (tg+1)|θI(tg+1),S(\k)θ +S(k)θ ,m,n) = 1. 68 Chapter 7 LIKELIHOOD MODEL The likelihood model that we use in our experiments was introduced in Section 6.4.2. In this section we review the likelihood model and explain the changes that we make in the HGEP model in order to apply it in our experiments. As mentioned, we assume that the observation y is a deterministic function of the hidden state and H0 is a product of multinomial and uniform distributions, i.e. H0 = Unif×Mult. Thus, every hidden state sampled from H0 is in the form of a pair θ = (u,y): u is a unique identifier for every hidden state u ∈ [0,1] which is sampled from the uniform distribution while the observations y take a value in the set of possible observations Σ. The set of possible observations consists of the finite number of symbols that we can observe. Given θ = (u,y) at a time step t, we have a Dirac delta for Lθ : L(u,y) = δy. More- over, we add a Dirichlet distribution over the parameters of the multinomial dis- tribution; we denote these parameters p1, · · · , p|Σ| by p1:|Σ|. This likelihood model can therefore be thought as a Dirichlet-Multinomial-Dirac model. For our experiments we add another level of hierarchy to the HGEP. Informally, this level is responsible for counting the number of times a transition is made from hidden states emitting the same observation y ∈ Σ to another hidden state. This allows the model to share statistical strength between states emitting the same ob- 69 servation. The model can be represented similarly to Equation 5.1: p1:|Σ| ∼ Dir(α3) H0 = Mult(p1:|Σ|)×Unif µ0 ∼MGP(H0,‖H0‖) µy|µ0 iid∼MGP(µ0,γ0) µθ |µy iid∼MGP(µy,β0) θN+1 ∣∣X ,{µθ}θ∈Ω ∼ µ̄θN JN+1 ∣∣X ,{µθ}θ∈Ω ∼ Exp(‖µθN‖). (7.1) The predictive distribution for the above model can be obtained recursively. However, we need to introduce a new additional auxiliary variable Ky which we will explain momentarily. Similar to the HGEP predictive distribution (Equa- tion 5.8) of Section 5.3 we have the following equations for the predictive dis- tribution of the hidden states: µ̄ ′′′ = G ‖G+H0‖ + ‖H0‖ ‖G+H0‖ H̄0 µ̄ ′′ = Ky ‖Ky+µ0‖ + ‖µ0‖ ‖Ky+µ0‖ µ̄ ′′′ θN+1|θ1, · · · ,θN ,‖µy‖, µ̄ ′′ ∼ FθN‖FθN +µy‖ + ‖µy‖ ‖FθN +µy‖ µ̄ ′′ (7.2) Figure 7.1 illustrates an example of sampling from this predictive distribution. Here α2 = ‖H0‖ and α1 = ‖µ0‖. Note that in this model the hyperparameters are α1, α2, α3, β0, and γ0. The bottom level of the hierarchy is the same as in Figure 5.1 of Section 5.3. From the bottom level we can move to a higher level DP with probability propor- tional to ‖µy‖ which is the normalization of the middle level random measure. For the middle level we define auxiliary variables Bn similar to An. That is, the 70 Fఏ ||Fఏ + µ௬|| ||µ௬|| ||Fఏ + µ௬|| G ||G +αଶ|| ||ܪ଴|| ||G +αଶ|| K௬ ||K௬+αଵ|| αଵ ||K௬+αଵ|| Aே = 1, Bே = 1 Aே = 0, Bே = 0 Aே = 0, Bே = 1 Hഥ଴ Figure 7.1: Sampling from the predictive distribution of the HGEP model used in the experiments event Bn = 1 means we have moved from the bottom level to the middle level of the hierarchy at transition n. Furthermore, at this level of hierarchy we augment the sufficient statistics with the empirical counts for the number of transitions already made from a state that emits the observation y; that is, Ky = ∑Nn=2 1[L(θn−1,y) = 1] Bnδθn . As at the bottom level, we have two possibilities: to choose the state at this level with probability proportional to Ky or to default to another DP with probability proportional to α1. As in Section 5.3, we use the auxiliary variable An to determine when we default to the top level DP (note that An = 1 implies Bn = 1). Finally, the top level is the same as the top of the hierarchy in the model of Section 5.3. At this level, we can have a transition to an entirely new state with probability proportional to α2 (i.e. ‖H0‖). We generate a unique identifier for this new state by sampling from Uni f [0,1]. Furthermore, to decide on the observation for this new state, we use a multinomial distribution over the possible observations Σ. 71 Chapter 8 EXPERIMENTS In this section we present the results of our experiments. First, we demonstrate the behavior of state trajectories and sojourn times sampled from the prior to give a qualitative idea of the range of time series of hidden states that can be captured by our model. Next, we evaluate our model quantitatively by applying it to held-out tasks for three different datasets: synthetic, multiple sclerosis (MS) patients, and RNA evolutionary. Each dataset consists of multiple time series with observations at different time points. Note that in each dataset we have finite number of possible observations which we call “symbols”. A sample dataset with symbols A, C, T and G having the same structure as our three datasets is provided in Table 8.1. We will explain each of these datasets in detail later in this chapter. Furthermore, we apply the model to the problem of comparing disease progres- sion on the two arms of a clinical trial in MS; evaluating the effectiveness of a drug is a typical problem in MS clinical research. In order to compare the arms of a clinical trial we need a model for MS disease progression; however, as mentioned in Chapter 1, MS can be considered as a disease with complex latent structure. By using an HGEP as the model for the latent structure we compare the disease progression in the treatment and the placebo arms. Finally, we compare the perfor- mance of the two proposal distributions we introduced in Section 6.4. 72 Sequence 1 Time points 0 1 1.5 2.3 4.1 Observations A A T C C Sequence 2 Time points 1.1 1 3.1 4 Observations T C T G ... Sequence M Time points 0 2 2.5 4.3 7 Observations G C T A A Table 8.1: A sample dataset with symbols A, T, C and G having the same structure as datasets used in the experiments 8.1 Qualitative Analysis of the Prior The behavior of the prior can change as a function of the three parameters of the HGEP model: β0, ‖H0‖ and γ0. We sampled a sequence of length T = 800 and present the state-time plots. At least four types of behavior can be distinguished: 1. short sojourn times and high volatility of states (Figure 8.1(a)) 2. long sojourn times with low volatility (Figure 8.1(b)) 3. many new states with short sojourn times (Figure 8.1(c)) 4. high tendency to create new states, long sojourn times (Figure 8.1(d)) The concentration parameter ‖H0‖ has the same interpretation as the concentra- tion parameters of HDPs. In addition, in conjunction with the rate parameter β0, it controls the waiting times. These behaviors show that the GEP process can de- scribe various kinds of stochastic processes. For all of our experiments we chose (arbitrarily) values of 1 for all hyper-parameters. 8.2 Quantitative Analysis In this section, we use the likelihood model introduced in Chapter 7 for discrete observations to evaluate our method on three held-out tasks. We considered three 73 0 200 400 600 8000 5 10 15 20 25 Time St at e 0 200 400 600 800 2 4 6 8 10 12 Time St at e (a) (b) 0 200 400 600 800 0 10 0 20 0 30 0 40 0 Time St at e 0 200 400 600 800 0 50 10 0 15 0 Time St at e (c) (d) Figure 8.1: Qualitative behavior of the prior: (a) β0 = 10, ‖H0‖= 5, γ0 = 100, (b) β0 = 100, ‖H0‖= 5, γ0 = 10, (c) β0 = 10, ‖H0‖= 500, γ0 = 10, (d) β0 = 1, ‖H0‖= 5000, γ0 = 1 evaluation datasets obtained by holding out each observed datapoint with a 10% probability (see Table 8.2 for the properties of the datasets). We then reconstructed the observations at these held-out times, and measured the mean error. The time se- ries were scanned for 1000 iterations of the PMCMC algorithm; that is, we carried the PMCMC algorithm for each dataset for 1000 iterations. For HGEP, at each iteration of the PMCMC algorithm, we sample a latent path from an approximation to the posterior p(X (k)|Y (k)) for every sequence k. Hence, for each held-out time of that sequence we have a sampled latent state. Based on our defined likelihood model, we know the observation the latent state emits. We reconstruct the held-out observation using this emitted observation for each iteration; we count the number of times each symbol is reconstructed up to that iteration and assign the symbol with the highest count to that held-out time. In other words, reconstruction is done by using the Bayes estimator approximated 74 Datasets Results (mean error) Name # sequences # datapoints # heldout # characters Baseline EM HGEP Synthetic 1000 10000 878 4 0.703 0.404 0.446 MS 72 384 31 3 0.516 0.355 0.277 RNA 1000 6167 508 4 0.648 0.596 0.426 Table 8.2: Summary statistics and mean error results for the experiments. All experiments were repeated 5 times. from 1000 posterior samples (one after each scan through all the sequences). We repeated all experiments 5 times with different random seeds which control the randomness of sampling from the posterior. To compute the error for each iteration, we count the number of correct estimates and divide it by the total number of held-out observations; we only consider the error for the 1000th iteration as the error for that repetition of the experiment. The mean error is the average computed error for the 5 repetitions. We compared against the standard maximum likelihood rate matrix estimator, estimated by the EM described in [22]. This method also estimates the transitions between two observations. As a result, we can estimate the whole trajectory for a sequence given only observations at finite time points. We reconstruct each held- out observation from this estimated trajectory, as we know the state of the sequence at every time point. We also report in Table 8.2 the mean error for a simple estimate where we re- construct each held-out observation using the most common symbol in the whole dataset; we call this estimate the baseline estimate. Figure 8.2 shows the mean error as a function of the number of iterations for all three datasets. 8.2.1 Synthetic data experiment We used the Erdös-Rényi model [11], a model for generating random graphs, to generate a random rate matrix. The model is defined by two parameters: the num- ber of nodes and the probability of having an edge between two nodes in the graph. We used the probability parameter 1/5 and 10 nodes to generate a random graph. This random graph corresponds to a matrix of size 10×10 with elements equal to 1 where we have an edge between two nodes in the graph. 75 Synthetic RNA MS 0 200 400 600 8000 .4 0. 5 0. 6 0. 7 0. 8 l HGEP EM l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll 0 200 400 600 800 0. 40 0. 50 0. 60 l HGEP EM l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll 0 200 400 600 8000. 25 0. 35 0. 45 l HGEP EM l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Figure 8.2: Mean reconstruction error on the held-out data as a function of the number of Gibbs scans for the experiments. The non-diagonal zeros in this matrix correspond to entries with Unif(0,1/100) rate, and the non-diagonal ones in this matrix correspond to entries with Unif(0,1/2+ 1/100) rate. The diagonal entries were filled with minus the value of the sum of the non-diagonal ones, and each of the 10 states is set to deterministically emit one of the symbols A, B, C or D at random. To create the synthetic dataset, we generate 1000 sequences with time length 10 from the rate matrix using the Doob-Gillespie algorithm (described in Section 2.1). Both HGEP and the EM-learned maximum likelihood outperformed the base- line. In contrast to the next two tasks, the EM approach slightly outperformed the HGEP model here. We believe this is because the synthetic data was not suffi- ciently rich to highlight the advantages of HGEPs. 8.2.2 RNA evolution modeling MJPs can be used for describing the evolution of nucleotide sequences in a phy- logenetic tree that shows the evolutionary relationships among different species. Each node of the tree corresponds to the nucleotide sequence of a species. While nucleotide sequences can be observed at the leaves of the tree (modern species), the information on the branch length (time since divergence from another species) and the topology of the tree cannot be observed. We used the dataset from Cannone et al. [6] containing aligned 16S riboso- mal RNA of species from the three domains of life. Aligned RNA sequences are 76 sequences arranged based on their similarities. When two RNA sequences are aligned, the relation between two nucleotides at the same position can only be an identity, a mismatch or a gap. For instance, if we assume sequence AGGC evolves to AA-C and these sequences are aligned, then there are two similar sites (1 and 4), one mismatch at the second site and one deletion at the third site. As a preprocessing step, a tree was constructed on a random subset of 30 species using PhyML [20], and the nucleotides at speciation events (i.e., branching events in the tree) were reconstructed using a K2P rate matrix [24] and the sum-product algorithm on trees. We then considered the time series consisting of paths from one modern leaf to the root for each site. Figure 8.3(a) illustrates a sample path in the reconstructed phylogenetic tree. In this figure, node r is the root, nodes e, f , g, h are the leaves, and nodes a, b, c, d are the internal nodes reconstructed by the explained procedure. At each of these nodes there is a nucleotide sequence. However, we study each site of these sequences independently where the observed state is the nucleotide; thus, there are 4 possible observations (i.e. A, T, C, G) (see Figure 8.3(b)). For the RNA dataset, we considered 1000 sequences of nucleotides from the root to the modern leaves in the reconstructed tree. A sample sequence is denoted by a dashed rectangle in Figure 8.3(b). The number of datapoints and heldouts are indicated in Table 8.2. For this dataset, the task is to reconstruct held-out nucleotides using only the data in the paths. Again, both HGEP and EM outperformed the baseline, and our model outperformed EM with a relative error reduction of 29%. 8.2.3 MS disease progression The dataset we used for this experiment is obtained from a phase III clinical trial of a drug given to MS patients. The dataset tracks the progression of MS (i.e., EDSS scores) in 72 patients of the placebo arm of the trial over 3 years. This is the subsample of the original cohort of patients who enrolled early to the trial and com- pleted 3 magnetic resonance imaging scans [39]. As in Mandel [31], the observed EDSS score of a patient at a given time is binned into three categories: 1 (EDSS ≤ 1.5); 2 (EDSS = 2,2.5); and 3 for (EDSS ≥ 3). Here, we are assuming these 77 ab c d e f g h r ATCCG CTACG CGACT AGTCT CGGCT TGGAT A T C C G C T A C G C G A C T A G T C T T G G A T E.g. Site 3 evolves independently from all other sites, according to a Markov chain C G G C T time (a) (b) Figure 8.3: (a) A sample path in a reconstructed phylogenetic tree, (b) Evo- lution of the sequences for the reconstructed sample path categorical observations can be described by an HMM with underlying, possibly, infinite hidden states that are not directly observed. In the held-out task experiment, both HGEP and EM outperformed the base- line by a large margin. The HGEP model outperformed EM with a relative error reduction of 22%. 8.2.4 Application in estimating disease progression in MS In this section we apply our model to the problem of comparing the disease pro- gression in treatment and placebo arms of a clinical trial. We use the data of the placebo arm data from the previous section along with the data from the treatment arm of the same trial. Several tools are currently used in MS to describe or predict the course of the disease. Examples include expected time to a certain EDSS score [32], time to confirmed progression (i.e. reaching a certain EDSS and staying there 78 Placebo Arm Treatment Arm 1 2 3 4 5 6 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Time in 6 months Pr ob . 1 2 3 4 5 6 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Time in 6 months Pr ob . Figure 8.4: Mean predictive probability of reaching category 3 starting in cat- egory 1 (95% credible interval is provided by the dashed lines) for 6 months or longer), and plots for the probability of confirmed progression against time [31]. We focus on the plots of probability of progression against time; these plots show the (mean predictive) probability of reaching a particular EDSS state against time, for patients who start at the same baseline EDSS state. To construct these plots, after each of the 1000 PMCMC iterations, we sample (without rejection) 100 times from the predictive distribution of the time to the particular state conditioned on the starting state. Then, using the 100 samples, for each PMCMC iteration we estimate the probability of having hit a specific EDSS state at a time point or earlier. We use 11 time points, ranging from 1 to 6 months in 15 days steps. Finally, we compute the mean of the probabilities at each of the 11 time points to summarize the results. As mentioned observed EDSS scores are binned into three categories. Figure 8.4 shows the plot of the predicted probability of reaching category 3 starting from cat- egory 1 for the both arms of a clinical trial, along with their 95% credible intervals. The plots depict that, for instance, after 3 years the mean predicted probability of reaching category 3 for the treatment and the placebo arms are 0.37 (95% C.I.: 0.12-0.61) and 0.55 ((95% C.I.: 0.22-0.73)), correspondingly. 79 0 200 400 600 800 0. 50 0. 60 0. 70 0. 50 0. 60 0. 70 0 200 400 600 800 0. 50 0. 60 0. 70 l l l l l l l l l l l l ll l l l l l ll l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l ll lll l l l ll l l l l ll l l ll l l l l ll l l l l l l l l l l l l l l l l l l llll l l l l l l l l l l l l lll l l l l l lll l l ll l l l l l l ll lll l l l l l l l l l l l l l l l l l l l l l ll l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l ll l l l l l ll l l l l l l l l l l l l l l l l l ll l l l l l l l l ll l ll ll l l ll l l l l l l ll l l l l l l l l l l l l l l l l l ll ll l l l l l l l l ll l l l l l l l l l l l l l l l l l l l lll l l l l l l l l l l l l ll l l l l ll ll l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l l l l l l ll ll l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l ll l l ll l l l l lll l l l l l l l l l l l l l l l l l l l l ll l l l l l ll l l l l l l l l l l ll l l l ll l l l l l l l ll l l l ll l l l l l l l l l l l ll l l l l l l l l l l l l ll ll l l ll l l l l l l l l l l l ll l l l ll l l l l l ll l l l l l l l ll l l l l l l l ll l l ll l ll l l l l l l l l l l ll l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l ll ll l ll ll ll l l l l l l ll l ll l l ll l lll l ll l l l l l l l ll l l l l l l l l l l l l l l l l l l ll l l ll ll l l l l l l ll l l l l l l l l l l l l ll l l l l 0 200 400 600 800 0. 80 0. 85 0. 90 0. 80 0. 85 0. 90 0 200 400 600 800 0. 80 0. 85 0. 90 l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l ll l l l l l l l ll l l l l l l l l l l ll ll l l l l l l ll l l l ll l l l l l l l l ll l l l l l l l l l l ll l l l l l l ll l l l l l l ll ll l l l l l l ll l l l l l l l l l l l l l llll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l ll l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l ll ll l l l l l l l l l l l ll l l l l l l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l ll ll l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l ll ll ll l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l ll l l l l l l l l l ll l ll l ll l ll l l l l l l l l l l ll l l l l l l l l l ll l l l l l l l ll l l l ll l l l l l l l ll l ll ll l ll l l l l l l l l ll l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l ll l ll ll l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l ll l l l l l l l l l l l l l ll l l l l l l l l l l l l l l ll l l l ll l l l l l l l l l l l l l l l l lll l l l l l l l l l l l l l l ll l l ll l l l l l l l l l ll l l l l l l l l l l l l l l l l l l ll l l l l l (a) (b) Figure 8.5: Comparison of inference algorithms for the RNA dataset: mean acceptance rate per iteration for a proposal using 100 particles as a func- tion of MCMC iteration using (a) predictive distribution (b) modified predictive distribution in the SMC proposal. 8.3 Comparison of Inference Algorithms In this section to analyze the effect of using the two different proposal distributions described in Section 6.4 in the PMCMC algorithm for our experiments. The most important effect is on the acceptance rate of the samples in the PMCMC algorithm. Figure 8.5 shows that the acceptance rate in the “improved” proposal distribution is greater than that for the original proposal distribution. We have only provided the results for the RNA dataset; similar results hold for other datasets. Moreover, for the “improved” proposal we always accept all the particles in the SMC algorithm so we have zero failure rate for the particles. Recall that the weight of each particle is computed based on the likelihood of the observation. For our defined likelihood the likelihood can only be 0 or 1. Without the “improved” proposal we can have many zero weighted particles in case of long sequences (i.e., high failure rate for the particles) due to the fact that for long sequences, we have higher chance of having at least one zero likelihood value. 80 Chapter 9 CONCLUSION In this thesis, we introduced a model for MJPs with (potentially) infinite state space; we showed how this can be used as a prior in a Bayesian approach to the analysis of multiple sequences of data observed at discrete time points. We reviewed some basic notions and closely related models, built our model (GEP) based on the gamma process, and extended it to a hierarchical model (HGEP). We showed the GEP has some attractive properties such as conjugacy and closed form predictive distributions. Finally, we applied the model to some real world datasets; the results showed the model outperformed the maximum likelihood rate matrix estimator despite using a simple likelihood model (i.e., a Dirac delta function). However, in our model, as in many nonparametric Bayesian models, the data likelihood (i.e., p(Y )) is nontrivial and hard to compute. As a result, performing Bayesian model comparison and computing the Bayes factor, which requires com- puting the data likelihood, is a challenging problem in our model. An example of this problem is the experiment of Section 8.2.4 where we want to compare the two arms of the trial; there may be some situations in which the presence of a treatment effect is less obvious from the plots and we need a statistical test for testing the treatment effect. Moreover, our inference algorithm is computationally intensive; we may improve it in terms of computational efficiency by possibly using a slice sampler. We are currently working on some of the applications and generalizations of our model. Application of the method in models with continuous state space for 81 the observations, proposing more efficient inference algorithms, calculating the Bayes factor for estimating the treatment effect in clinical trials, and incorporating covariates in the hierarchical GEP are issues that we are plan to address in future research. We are also interested in adding a layer of decision making to our model. This can make the model suitable for sequential decision making problems where we have partial observability of the data and the complexity of the latent structure is unknown. 82 Bibliography [1] P. S. Albert. A Markov model for sequences of ordinal data from a relapsing-remitting disease. Biometrics, 50(1):51–60, 1994. → pages 2 [2] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010. → pages 45, 48, 49, 50, 56, 57 [3] M. Beal, Z. Ghahramani, and C. Rasmussen. The infinite hidden Markov model. Advances in Neural Information Processing Systems, 14:577–584, 2002. → pages 3, 24 [4] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022, 2003. → pages 18 [5] A. Bureau, S. Shiboski, and J. P. Hughes. Applications of continuous time hidden Markov models to the study of misclassified disease outcomes. Statistics in Medicine, 22(3):441–462, 2003. → pages 1, 2 [6] J. Cannone, S. Subramanian, M. Schnare, J. Collett, L. D’Souza, Y. Du, B. Feng, N. Lin, L. Madabusi, K. Müller, et al. The comparative RNA web (CRW) site: An online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics, 3 (1):2, 2002. → pages 76 [7] Y. Chung and D. Dunson. The local Dirichlet process. Annals of the Institute of Statistical Mathematics, 63(1):59–80, 2011. → pages 28 [8] J. L. Doob. Markoff chains–denumerable case. Transactions of the American Mathematical Society, 58(3):455–473, 1945. → pages 7 [9] D. Dufresne. G distributions and the beta-gamma algebra. Electronic Journal of Probability, 15(71):2163–2199, 2010. → pages 33 83 [10] D. Dunson. Bayesian dynamic modeling of latent trait distributions. Biostatistics, 7(4):551–568, 2006. → pages 27 [11] P. Erdős and A. Rényi. On the evolution of random graphs. Evolution, 5(1): 17–61, 1960. → pages 75 [12] T. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1(2):209–230, 1973. → pages 13, 14 [13] E. Fox, E. Sudderth, M. Jordan, and A. Willsky. An HDP-HMM for systems with state persistence. In Proceedings of the 25th International Conference on Machine Learning, pages 312–319. ACM, 2008. → pages 3, 25, 26 [14] A. Gelman and D. Rubin. Inference from iterative simulation using multiple sequences. Statistical Science, 7(4):457–472, 1992. → pages 48 [15] J. Geweke. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In Bayesian Statistics 4, pages 169–193. Oxford University Press, 1992. → pages 48 [16] J. Griffin. The Ornstein-Uhlenbeck Dirichlet process and other time-varying processes for Bayesian nonparametric inference. Journal of Statistical Planning and Inference, 141(11):3648–3664, 2011. → pages 28 [17] J. Griffin and M. Steel. Order-based dependent Dirichlet processes. Journal of the American Statistical Association, 101(473):179–194, 2006. → pages 3, 27 [18] J. Griffin and M. Steel. Stick-breaking autoregressive processes. Journal of Econometrics, 162(2):383–396, 2011. → pages 3, 27 [19] G. Grimmett and D. Stirzaker. Probability and Random Processes. Oxford University Press, 2001. → pages 46 [20] S. Guindon and O. Gascuel. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biology, 52 (5):696–704, 2003. → pages 77 [21] P. Heidelberger and P. Welch. Simulation run length control in the presence of an initial transient. Operations Research, 31(6):1109–1144, 1983. → pages 48 [22] A. Hobolth and J. Jensen. Statistical inference in evolutionary models of DNA sequences via the EM algorithm. Statistical Applications in Genetics and Molecular Biology, 4(1), 2005. → pages 75 84 [23] M. Jordan. Dirichlet processes, chinese restaurant processes and all that. In Tutorial presentation at the NIPS Conference. URL http://www.cs.berkeley.edu/∼jordan/nips-tutorial05.ps. → pages 15 [24] M. Kimura. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16(2):111–120, 1980. → pages 77 [25] J. Kingman. Poisson Processes. Oxford University Press, 1993. → pages 8, 9 [26] J. Kivinen, E. Sudderth, and M. Jordan. Learning multiscale representations of natural scenes using Dirichlet processes. In IEEE 11th International Conference on Computer Vision, pages 1–8. IEEE, 2007. → pages 25 [27] D. Kroese, T. Taimre, and Z. Botev. Handbook of Monte Carlo Methods. Wiley, 2011. → pages 6 [28] J. Liu. Monte Carlo Strategies in Scientific Computing. Springer Verlag, 2008. → pages 45 [29] S. MacEachern. Dependent nonparametric processes. In Section on Bayesian Statistical Science, American Statistical Association, pages 50–55, 1999. → pages 26, 27 [30] R. MacKay. Estimating the order of a hidden Markov model. Canadian Journal of Statistics, 30(4):573–589, 2002. → pages 2 [31] M. Mandel. Estimating disease progression using panel data. Biostatistics, 11(2):304–316, 2010. → pages 1, 2, 77, 79 [32] M. Mandel and R. A. Betensky. Estimating time-to-event from longitudinal ordinal data using random-effects Markov models: application to multiple sclerosis progression. Biostatistics, 9(4):750–764, 2008. → pages 2, 78 [33] R. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265, 2000. → pages 45 [34] R. Nelson. Probability, Stochastic Processes, and Queueing Theory: The Mathematics of Computer Performance Modelling. Springer-Verlag, 1995. → pages 32 [35] L. Nieto-Barajas, P. Müller, Y. Ji, Y. Lu, and G. Mills. Time series dependent Dirichlet process. Preprint, 2008. → pages 28 85 [36] J. Pitman. Combinatorial Stochastic Processes. Springer-Verlag, 2006. → pages 13 [37] A. Raftery and S. Lewis. Implementation strategies for Markov chain Monte Carlo. Statistical Science, 7(4):493–497, 1992. → pages 48 [38] A. Rodriguez, D. Dunson, and A. Gelfand. The nested Dirichlet process. Journal of the American Statistical Association, 103(483):1131–1154, 2008. → pages 28 [39] R. A. Rudick, E. Fisher, J. C. Lee, J. Simon, and L. Jacobs. Use of the brain parenchymal fraction to measure whole brain atrophy in relapsing-remitting multiple sclerosis. Neurology, 53(8):1698–1704, 1999. → pages 77 [40] J. Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, 4:639–650, 1994. → pages 13, 14 [41] A. Siepel and D. Haussler. Combining phylogenetic and hidden Markov models in biosequence analysis. Journal of Computational Biology, 11(2-3): 413–428, 2004. → pages 1 [42] Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006. → pages 17, 19, 21, 22, 23, 24 [43] A. C. Titman and L. D. Sharples. Semi-Markov models with phase-type sojourn distributions. Biometrics, 66(3):742–752, 2010. → pages 2 [44] W. Wei, B. Wang, and D. Towsley. Continuous-time hidden Markov models for network performance evaluation. Performance Evaluation, 49:129–146, 2002. → pages 1 [45] E. Xing and K. Sohn. Hidden Markov Dirichlet process: Modeling genetic inference in open ancestral space. Bayesian Analysis, 2(3):501–528, 2007. → pages 25 86