Nonparametric Bayesian Models for Markov Jump Processes by Ardavan Saeedi B.Sc., Sharif University of Technology, 2007 M.Sc., Sharif University of Technology, 2009 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Statistics) The University Of British Columbia (Vancouver) August 2012 © Ardavan Saeedi, 2012 Abstract Markov jump processes (MJPs) have been used as models in various fields such as disease progression, phylogenetic trees, and communication networks. The main motivation behind this thesis is the application of MJPs to data modeled as having complex latent structure. In this thesis we propose a nonparametric prior, the gamma-exponential process (GEP), over MJPs. Nonparametric Bayesian models have recently attracted much attention in the statistics community, due to their flexibility, adaptability, and usefulness in analyzing complex real world datasets. The GEP is a prior over infinite rate matrices which characterize an MJP; this prior can be used in Bayesian models where an MJP is imposed on the data but the number of states of the MJP is unknown in advance. We show that the GEP model we propose has some attractive properties such as conjugacy and simple closed-form predictive distributions. We also introduce the hierarchical version of the GEP model; sharing statistical strength can be considered as the main motivation behind the hierarchical model. We show that our hierarchical model admits efficient inference algorithms. We introduce two inference algorithms: 1) a “basic” particle Markov chain Monte Carlo (PMCMC) algorithm which is an MCMC algorithm with sequences proposed by a sequential Monte Carlo (SMC) algorithm; 2) a modified version of this PMCPC algorithm with an “improved” SMC proposal. Finally, we demonstrate the algorithms on the problems of estimating disease progression in multiple sclerosis and RNA evolutionary modeling. In both domains, we found that our model outperformed the standard rate matrix estimation approach. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 I NTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 BASIC N OTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Markov Jump Process (MJP) . . . . . . . . . . . . . . . . . . . . 5 2.2 Gamma Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 2.3 2.4 MGP and Dirichlet process (DP) . . . . . . . . . . . . . . 10 DP Definition and Properties . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Conjugacy and posterior distribution . . . . . . . . . . . . 12 2.3.3 Stick-breaking construction . . . . . . . . . . . . . . . . 13 2.3.4 CRP definition . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.5 DP mixture model . . . . . . . . . . . . . . . . . . . . . 15 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 iii 3 R ELATED M ODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Hierarchical Dirichlet Process . . . . . . . . . . . . . . . . . . . 17 3.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.3 Chinese Restaurant Franchise . . . . . . . . . . . . . . . 19 3.1.4 Stick-breaking representation for the HDP . . . . . . . . . 21 HDP-HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.1 Background on HMM . . . . . . . . . . . . . . . . . . . 22 3.2.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Sticky HDP-HMM . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Dependent Dirichlet Process (DDP) . . . . . . . . . . . . . . . . 26 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 G AMMA E XPONENTIAL P ROCESS . . . . . . . . . . . . . . . . . . . . 30 4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 Conjugacy Property . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Predictive Distribution . . . . . . . . . . . . . . . . . . . . . . . 35 H IERARCHICAL G AMMA E XPONENTIAL P ROCESS . . . . . . . . . . 37 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3 Connection to CRF . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.4 Predictive Distribution . . . . . . . . . . . . . . . . . . . . . . . 41 I NFERENCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.1 MCMC Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.2 PMCMC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.3 SMC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.4 SMC Algorithm Proposal Distributions . . . . . . . . . . . . . . 58 6.4.1 The predictive distribution as the proposal distribution . . 58 6.4.2 An “improved” proposal distribution based on a modified 3.2 4 5 6 7 predictive distribution . . . . . . . . . . . . . . . . . . . 59 . . . . . . . . . . . . . . . . . . . . . . . . . . 69 L IKELIHOOD M ODEL iv 8 E XPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 8.1 Qualitative Analysis of the Prior . . . . . . . . . . . . . . . . . . 73 8.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . 73 8.2.1 Synthetic data experiment . . . . . . . . . . . . . . . . . 75 8.2.2 RNA evolution modeling . . . . . . . . . . . . . . . . . . 76 8.2.3 multiple sclerosis (MS) disease progression . . . . . . . . 77 8.2.4 Application in estimating disease progression in MS . . . 78 Comparison of Inference Algorithms . . . . . . . . . . . . . . . . 80 C ONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 8.3 9 v List of Tables Table 3.1 Variables used in the CRF . . . . . . . . . . . . . . . . . . . . 19 Table 6.1 Notation used in the PMCMC algorithm . . . . . . . . . . . . 50 Table 8.1 A sample dataset with symbols A, T, C and G having the same structure as datasets used in the experiments . . . . . . . . . . Table 8.2 73 Summary statistics and mean error results for the experiments. All experiments were repeated 5 times. . . . . . . . . . . . . . vi 75 List of Figures Figure 2.1 An illustration of notation for samples from MJPs. We assume the state space (Ω) is countable. The notation for the observations Y (t1 ), . . . ,Y (tG ) is described in Chapter 6. . . . . . . . . Figure 2.2 8 A realization G of a DP with base measure G0 and support Ω. A partition (A1 , A2 , · · · , AK ) and a point y sampled from G are also illustrated. . . . . . . . . . . . . . . . . . . . . . . . . . 13 Figure 3.1 A graphical representation for an HMM . . . . . . . . . . . . 23 Figure 3.2 Generative mechanism for the state transitions in the iHMM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.1 25 Generating samples of µθ from MGP(H0 , β0 ). We use superscript i instead of subscript N to distinguish the ith atom from the state at the Nth jump. We present the row corresponding to state θ i in the rate matrix. As an example, if θN = θ i , then the probability of transition to θ 2 is proportional to µθ i ({θ 2 }). . . Figure 5.1 31 An illustration of generating the next state given the current state θN in the HGEP model . . . . . . . . . . . . . . . . . . 40 Figure 6.1 Pseudocode for the PMCMC algorithm . . . . . . . . . . . . 51 Figure 6.2 An illustration of the SMC algorithm with three particles and three observation times up to generation 1. . . . . . . . . . . Figure 6.3 53 An illustration of the SMC algorithm with three particles and three observation times up to generation 2. . . . . . . . . . . vii 54 Figure 6.4 An illustration of the SMC algorithm with three particles and three observation times up to generation 3. . . . . . . . . . . 55 Figure 6.5 Pseudocode for the SMC step of the PMCMC algorithm . . . 58 Figure 6.6 Pseudocode for sampling from the “improved” proposal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 6.7 63 An illustration of sampling the last jump for a generation of particles in the SMC algorithm. For simplicity we show only one particle. . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 6.8 Sampling procedure for the “modified” predictive distribution (case 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 6.9 63 64 Sampling procedure for the “modified” predictive distribution (case 2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Figure 6.10 Sampling procedure for the “modified” predictive distribution (case 3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Figure 6.11 Computing the predictive probability for a sampled path . . . 67 Figure 7.1 Sampling from the predictive distribution of the HGEP model used in the experiments . . . . . . . . . . . . . . . . . . . . . Figure 8.1 71 Qualitative behavior of the prior: (a) β0 = 10, H0 = 5, γ0 = 100, (b) β0 = 100, H0 = 5, γ0 = 10, (c) β0 = 10, H0 = 500, γ0 = 10, (d) β0 = 1, H0 = 5000, γ0 = 1 . . . . . . . . . . . Figure 8.2 Mean reconstruction error on the held-out data as a function of the number of Gibbs scans for the experiments. . . . . . . . . Figure 8.3 76 (a) A sample path in a reconstructed phylogenetic tree, (b) Evolution of the sequences for the reconstructed sample path . Figure 8.4 74 78 Mean predictive probability of reaching category 3 starting in category 1 (95% credible interval is provided by the dashed lines) 79 Figure 8.5 Comparison of inference algorithms for the RNA dataset: mean acceptance rate per iteration for a proposal using 100 particles as a function of MCMC iteration using (a) predictive distribution (b) modified predictive distribution in the SMC proposal. viii 80 Glossary Acronyms CRF Chinese restaurant franchise CRP Chinese restaurant process DDP dependent Dirichlet process DP Dirichlet process EDSS expanded disability status scale GEP gamma exponential process HDP hierarchical Dirichlet process HGEP hierarchical gamma exponential process HMM hidden Markov model IHMM infinite hidden Markov model MCMC Markov chain Monte Carlo MGP Moran gamma process MH Metropolis - Hastings MJP Markov jump process MS multiple sclerosis ix PIMH particle independent Metropolis - Hastings PMCMC particle Markov chain Monte Carlo SMC sequential Monte Carlo x Acknowledgments It is difficult to overstate my gratitude to my supervisors Drs. Alexandre BouchardCˆot´e and John Petkau. Alex’s enthusiasm, his patience, his guidance, and his support have helped me greatly and this thesis would not have been possible without him. He is an inspiration and I am very honored to call him my teacher. John’s great advice, support, friendship, especially his great efforts to explain things clearly and simply has been invaluable on both an academic and a personal level, for which I am extremely grateful. Many thanks go to the numerous people who have been my teachers and mentors along my academic journey so far. I would especially like to express my gratitude to Dr. Alireza Haji (B.Sc. supervisor) and Dr. Kourosh Eshghi (M.Sc. supervisor) at Sharif University of Technology. I am very grateful for their generous support, advise, and help with various projects. I would like to thank my student colleagues and friends for great discussions and for providing a stimulating and fun environment in which to learn and grow. Special thanks go to Chao Xiong, Hongyang Zhang, Liangliang Wang, Hossein Bashashati, Reza Skandari, Shahram Pourazadi, Sajjad Fayazi, Abbas Javaherian, and Ehsan Esfahanian. The secretaries at the Statistics department deserve my gratitude for providing assistance and administrative support, with special mention to Peggy, Andrea, and Elaine. Last but not least, I am forever grateful to my parents for their love and support, especially my mother who has always been there for me and helped me grow as a person. To her I dedicate this thesis. xi Chapter 1 I NTRODUCTION Inside every non-Bayesian, there is a Bayesian struggling to get out. — Dennis V. Lindley Markov jump processes (MJPs), continuous-time Markov processes with piecewise constant paths, have been used as models in many different fields including disease progression [5], phylogenetic trees [41], and communication networks [44]. The main motivation behind this thesis is the application of MJPs to data modeled as having complex latent structure. As a concrete example, disease progression in multiple sclerosis (MS) can be considered as an MJP [31]. MS is one of the most complex disabling diseases affecting the central nervous system. A typical evolution of the disease begins with periods of relapses and remissions. At this stage of the disease, patients recover from the neurological symptoms that appear during a relapse period in the following remission period. However, most of the patients eventually move to another stage of the disease in which disability accumulates. The course of the disease can be assessed using disability scales such as the expanded disability status scale (EDSS), an ordinal score ranging from 0 to 10 in half point steps. Since the patient’s state (measured by the EDSS score) usually remains stable for a relatively long period of time, an MJP model is a reasonable choice for modeling the EDSS data. In MS studies, EDSS scores are analyzed with the aim of describing or predicting the disease progression. The EDSS data available are partially observed 1 data; measurements are only taken at certain time points. It is not reasonable to assume the EDSS scores, collected in time, are independent of each other; thus, they can be thought of as sequential data for each individual patient. Several authors have proposed Markov models for analyzing EDSS scores; see, for example [1] and [32]. However, due to the complex nature of the disease and also the partial observability of the data, more sophisticated models may be required. A hidden Markov model (HMM) is proposed in [31] as a possible model for disease progression. In an HMM, the states are not directly observed, but are reflected by the observations (e.g., EDSS scores). That is, we only observe a sequence of outputs which are dependent on an underlying Markov process. The currently available methods for carrying out inference in an HMM with an underlying MJP assume the number of the hidden states is either known (see for example [43] or [5]) or can be estimated using a model selection method (e.g., AIC, BIC, or penalized minimum-distance [30] methods). To successfully apply these methods to a realworld problem, we need a domain expert or a model selection method to specify the number of hidden states. In complex settings, such as disease progression in MS, it is seldom the case that one has a defensible parametric model. Nonparametric Bayesian models, Bayesian models with nonparametric priors, have recently attracted much attention in the statistics community, due to their flexibility, adaptability, usefulness in analyzing complex real world datasets, and their ability to sidestep the model selection. Nonparametric Bayesian models can automatically infer an adequate model complexity from the data, without needing explicit Bayesian model comparison. In addition, while being highly adaptable, these models maintain some advantages of a fully model-based probabilistic framework. Exact inference in these models is usually impossible; thus, statistical inference typically is done using Markov chain Monte Carlo (MCMC), a general approach used in approximate inference algorithms for Bayesian models. In this thesis we propose a nonparametric prior, the gamma exponential process (GEP), over MJPs. In particular, we propose a prior over infinite rate matrices which characterize an MJP. These priors can be used in Bayesian models where an MJP is imposed on the data but the number of states of the MJP is unknown in advance. Thus, if we consider a Bayesian MJP model for analyzing the MS data 2 then these priors are a reasonable choice. We will briefly provide some background on MJPs and their rate matrices in Chapter 2. We will also review some basic nonparametric priors such as the gamma process and the Dirichlet process (DP) in this chapter. There are a number of nonparametric Bayesian models available for sequential data. One class of these models are developed for the discrete-time framework; examples of these models are the infinite HMM (iHMM) [3] and the sticky hierarchical Dirichlet process-HMM [13]. Another class of nonparametric Bayesian models are developed for continuous-time transient processes. Transient processes are processes where the effect of an observation at time point t1 , on a prediction at time point t2 should decrease as |t1 − t2 | increases. Examples of these models are the order-based dependent Dirichlet process (DDP) [17] and the stick-breaking autoregressive process [18]. We will review these and related models in Chapter 3. Our GEP model fills the gap for nonparametric priors over continuous-time recurrent (i.e., non-transient) processes. The GEP model that we propose has some attractive properties such as conjugacy and simple closed-form predictive distributions. The simple predictive distributions that arise with the DP is one of the factors behind its widespread adoption in nonparametric Bayesian statistics; moreover, a simple predictive distribution can simplify the process of making statistical inference. We discuss the properties of GEPs in Chapter 4. We extend the basic model of Chapter 4 to a hierarchical version in Chapter 5. Sharing statistical strength can be considered as the main motivation behind a hierarchical model. Informally, we want the rows of the rate matrix to share information on what states are frequently visited. We introduce the hierarchical gamma exponential process (HGEP) and the motivation behind it in Chapter 5. After building an appropriate model (i.e., GEP and HGEP) for a complex latent structure, we need to infer this structure from the observed data. Typically we are not able to observe the whole sequence of states; that is, we only have partially observed sequences of data. In Chapter 6, we show that the HGEP model admits efficient inference algorithms. We introduce two inference algorithms: 1) a particle Markov chain Monte Carlo (PMCMC) algorithm which is an MCMC algorithm with sequences proposed by a sequential Monte Carlo (SMC) algo3 rithm; 2) a modified version of this PMCPC algorithm with an “improved” SMC proposal. In order to link the observed data to the latent structure we need a likelihood model; we assume a simple multinomial-Dirac likelihood model. Using this likelihood model in several experiments, we observe that despite the simplicity of this likelihood model the results are reasonable. The multinomial-Dirac likelihood model is described in Chapter 7. We apply the model to three different datasets in Chapter 8: 1) a synthetic dataset generated from a random rate matrix; 2) an MS dataset obtained from a phase III clinical trial of a drug; and 3) an RNA dataset containing aligned RNA of species from the three domains of life. We evaluate our model on these datasets based on a held-out task. For the held-out task we construct three new datasets where each observation is held-out with 10% probability; then we reconstruct the observations at the held-out times and measure the mean error. The results show that our model outperforms the current available methods. We also apply the model to estimating disease progression in MS. We conclude the thesis in Chapter 9 with some final remarks and directions for future research. 4 Chapter 2 BASIC N OTIONS In this chapter, we introduce some basic notions and theories used in the following chapters of the thesis. We start by defining a Markov jump process (MJP); next, we present the definitions and properties of gamma and Dirichlet processes as the important elements of our model. 2.1 Markov Jump Process (MJP) An MJP can be described as a continuous time process with the Markov property on a discrete state space; a more formal definition is as follows: Definition 1. An MJP is a stochastic process {St ,t ∈ T } with a discrete state space Ω and a continuous index set T = R+ that satisfies the Markov property: p(St+s = j|Su ; u ≤ t) = p(St+s = j|St ). (2.1) If, moreover, the right hand side of Equation 2.1 only depends on the time increment s, but not on t, then the MJP is called homogeneous. Here, we only consider homogeneous MJPs; thus, the term MJP will always refer to a homogeneous MJP. 5 An MJP is often defined through its instantaneous transition rate matrix Q: q1,1 q1,2 q1,3 · · · q2,1 q2,2 q2,3 · · · . Q= q3,1 q3,2 q3,3 · · · .. .. .. .. . . . . The off-diagonal element qi, j corresponds to the instantaneous transition rate from state i to state j: qi, j = lim h↓0 P(St+h = j|St = i) ; h hence, they are non-negative. As a consequence of the definition of qi, j , the waiting rate at state i denoted by qi,i is determined from qi,i = − ∑ j=i qi, j . The convention is to place qi,i in the diagonals of Q. One particular probability of interest in an MJP is the transition probability from one state to another given a time interval t. The transition probability matrix P(t) = (pi, j (t)) where: pi, j (t) = p(St = j|S0 = i) is calculated by matrix exponentiation; that is, P(t) = eQt . Another probability of interest is the initial probability distribution π where πi = p(S0 = i). Given the transition rate matrix Q, the behavior of a sample path from an MJP is specified by the following theorem [27]: Theorem 2. Assume an MJP process with transition rate matrix Q = (qi, j ), is in state i at time point t. The sample path behavior of the MJP can be described as follows: 1. The waiting time in the current state i is distributed according to Exp(−qi,i ). 2. Given its past,if a transition occurs at a particular time, the probability that this transition is from the current state i to state j is given by Mi, j = qi, j /|qi,i |, 6 Note that in the second statement of Theorem 2, M is the transition probability matrix for a Markov chain without self-transitions (i.e., Mi,i = 0, ∀i ∈ Ω). This Markov chain is called the embedded Markov chain of the MJP. A sample from an MJP can be described as a list of pairs X = (θn , Jn )Nn=1 , where Jn is the waiting time at jump n-1 (J1 = 0) and θn is the state at jump n . We assume the initial state of the sequence (i.e., θ1 or S0 ) is distributed according to the initial distribution of the MJP. For finite sequences with length N, we can assume a special absorbing state for the end of a finite sequence θend . We would then condition on (θN+1 = θend ) and (θn = θend , n ∈ {1, . . . , N}), and set the total rate for the row corresponding to θend to zero. In the following, we only consider distributions over infinite sequences. Based on Theorem 2, we can use Doob-Gillespie algorithm [8] to sample from an MJP. Denoting the jump times by T1 , T2 , . . ., this algorithm can be described as follows: 1. For T0 = 0 sample θ1 from π. Set J1 = 0, S0 = θ1 , and n = 1. 2. Sample Jn+1 from Exp(−qθn ,θn ). 3. Update the jump time: Tn = Tn−1 + Jn+1 . 4. For Tn−1 ≤ t < Tn set St = θn . 5. Sample θn+1 from the row corresponding to the θn row of the probability transition matrix M. Set n = n + 1 and go to step 2. A sample path from an MJP is sketched in Figure 2.1. 2.2 Gamma Process The nonparametric prior that we propose is based on the gamma process, a stochastic process with independent, identically gamma distributed increments. Recall that the gamma distribution with shape parameter a > 0 and rate parameter b > 0 has a density given by: Gamma(x|a, b) = ba a−1 −bx x e I(0,∞) (x), Γ(a) 7 Ω J =0 θ θ J J θ J θ Y θ ... X t t∗ t=0 Y(t ) Y(t ) ... Y(t ) Y(t ) Y(t ) Figure 2.1: An illustration of notation for samples from MJPs. We assume the state space (Ω) is countable. The notation for the observations Y (t1 ), . . . ,Y (tG ) is described in Chapter 6. where Γ(a) = ∞ a−1 −z e dz z=0 z is the gamma function for a > 0, and IA (x) = 1 if x ∈ A and IA (x) = 0 otherwise. We use the terminology in [25] for defining the gamma process; thus we use the term Moran gamma process (MGP) and denote a gamma process by MGP. The MGP can be defined through three parameters: 1. concentration parameter also called the shape parameter α0 > 0 2. base probability distribution, P0 : FΩ → [0, 1], where FΩ is the σ − algebra over the sample space Ω 3. rate parameter β0 > 0 We can combine the first two parameters into one new parameter called the base measure parameter H0 where H0 = α0 P0 is a measure on the space (Ω, FΩ ). By this parameterization we can characterize an MGP just by two parameters H0 and β0 . Informally a sample from a gamma process is an atomic measure with random weighted point masses. The point masses ω ∈ Ω and their corresponding weights x > 0 are points in a product space Ω ⊗ [0, ∞). A formal definition of the MGP can be provided based on Poisson processes. 8 Consider a Poisson process with intensity λ (dωdx) = H0 (dω)x−1 e−β0 x dx over the product space Ω ⊗ [0, ∞). A realization of this Poisson process is an infinite set of atoms, {ωi , xi }∞ i=1 where ωi ∈ Ω can be interpreted as the point mass of an atomic measure and xi as its corresponding weight. A realization of an MGP(H0 , β0 ) is then defined as: ∞ µ = ∑ xi δωi ∼ MGP(H0 , β0 ). (2.2) i=1 Note that in contrast to the usual definition of the one-dimensional Poisson process where the process is defined over a line (usually time), here we define the process over a product space Ω ⊗ [0, ∞). In such a spatial Poisson process the counts of the number of events inside each of a number of non-overlapping finite sub-regions of the space are Poisson distributed and independent of each other. It can be shown [25] that the total mass µ(A) = ∑∞ i=1 xi 1(ωi ∈ A) of any measurable subset A ⊂ Ω is gamma distributed with shape parameter H0 (A) and rate parameter β0 . In other words, a gamma process takes positive values independently distributed as gamma at each event generated by a Poisson process. Hence, we have an equivalent definition for the MGP based on its marginals. Recall that by the Kolmogorov consistency theorem, in order to guarantee the existence of a stochastic process on a probability space (Ω , FΩ ), it is enough to provide a consistent definition of the marginals of this stochastic process. As already mentioned, in the case of an MGP, the marginals are gamma distributions: Definition 3 (MGP). Let H0 , β0 be of the types listed above. We say that µ : FΩ → (FΩ → [0, ∞)) is distributed according to the MGP distribution, denoted by µ ∼ MGP(H0 , β0 ), if for all measurable partitions of Ω, (A1 , . . . , AK ), we have: 1 (µ(A1 ), µ(A2 ), . . . , µ(AK )) ∼ Gamma(H0 (A1 ), β0 ) × · · · × Gamma(H0 (Ak ), β0 ). 1 We use the rate parameterization for the gamma density throughout. 9 2.2.1 MGP and Dirichlet process (DP) There is a connection between the MGP and the DP which will be important in the following chapters. As mentioned in Section 2.2 for any measurable subset A ⊂ Ω we have: ∞ µ(A) = ∑ xi 1(ωi ∈ A) ∼ Gamma(H0 (A), β0 ). (2.3) i=1 Dividing a sample from an MGP, µ by µ(Ω) we get a random probability measure which is in fact a sample from a DP with concentration parameter H0 (Ω) and the base probability measure H0 /H0 (Ω). We will explain these parameters in the following section. 2.3 DP Definition and Properties In the previous section, we mentioned that by normalizing the MGP we obtain a new stochastic process called the DP. In this section we describe the DP, some of its important properties, and different ways to represent it. 2.3.1 Definition Intuitively the DP is a generalization of the Dirichlet distribution to an infinite number of dimensions. In fact, the Dirichlet distribution is itself the multivariate generalization of the beta distribution. Recall that the beta distribution is a continuous probability distribution defined on the interval (0, 1) with two parameters α > 0 and β > 0. This distribution is typically used in modelling proportions. The probability density function of the beta distribution is: Beta(x; α, β ) = Γ(α + β ) α−1 x (1 − x)β −1 I(0,1) (x) Γ(α)Γ(β ) By extending a beta distribution to more than two dimensions we obtain a Dirichlet distribution. A K-dimensional Dirichlet distribution with positive parameters 10 α1 , · · · , αK is defined by the following density function: Dir(x1 , · · · , xK−1 ; α1 , · · · , αK ) = Γ(∑Kk=1 αk ) K αk −1 ∏ xk ∏Kk=1 Γ(αk ) k=1 K−1 for all x1 , · · · , xK−1 > 0 satisfying ∑K−1 k=1 xk < 1 where xK = 1 − (∑k=1 xk ). This distribution is often used in the Bayesian framework as a prior for the multinomial distribution. That is due to the fact that the Dirichlet distribution is a conjugate prior over the multinomial distribution. That is, if a random variable has a multinomial distribution with a Dirichlet prior over the parameters of the distribution, then the posterior over the parameters is also Dirichlet. Denoting N independent observations from a multinomial distribution by Y = (y1 , · · · , yN ), we have P = (p1 , · · · , pK ) ∼ Dir(α1 , · · · , αK ) Y |P ∼ Mult(p1 , · · · , pK ) ⇒ P|Y ∼ Dir(α1 + n1 , · · · , αK + nK ) where nk is the number of yi s in category k. The DP is the result of extending the Dirichlet distribution to infinite dimensions. A DP, denoted by G, can be thought of as a probability distribution over probability measures (i.e., non-negative functions that integrate to one) and is defined by two parameters: 1. Base measure (G0 ): a probability distribution G0 : FΩ → [0, 1] which is the mean of the DP; that is, for any measurable set A ⊂ Ω we have E[G(A)] = G0 (A). Here, FΩ is the σ − algebra over the sample space Ω. 2. Concentration parameter (α0 ): a positive real number that controls the variability around the mean G0 . A realization from a DP(α0 , G0 ) is a random distribution with values drawn from G0 . In other words, the support of the sample is the same as the base measure. Similar to the definition of the MGP, by the Kolmogrov consistency theorem, we can define a DP on a probability space (Ω , FΩ ) by providing a consistent defini11 tion of its marginals. Following this approach we have the following definition for the DP: Definition 4. G : FΩ → (FΩ → [0, 1]) is DP distributed with base measure G0 and concentration parameter α0 , denoted by G ∼ DP(α0 , G0 ), if for any finite measurable partition (A1 , · · · , AK ) of Ω we have: (G(A1 ), G(A2 ), · · · , G(AK )) ∼ Dir(α0 G0 (A1 ), α0 G0 (A2 ), · · · , α0 G0 (AK )). (2.4) Note that, similarly to the gamma process, Definition 4 suggests that the two parameters α0 and G0 can be treated as a single parameter H0 = α0 G0 ; this will result in a notationally convenient parameterization G ∼ DP(H0 ). 2.3.2 Conjugacy and posterior distribution A conjugacy property holds for the DP. Let G ∼ DP(α0 , G0 ) and y be a single realization from G, y ∼ G. Note that since G is a measure valued random variable, we can sample random variables from a realization of G. Roughly speaking G can be thought of as an infinite-dimensional multinomial distribution. Due to the Dirichlet-multinomial conjugacy, for a Dirichlet distribution induced by a fixed partition (A1 , A2 , · · · , AK ) of Ω (i.e., the support of G0 ) and y ∈ Ak , we have: G|α0 , G0 ∼ DP(α0 , G0 ) y|G ∼ G (G(A1 ), G(A2 ), · · · , G(AK ))|y ∼ Dir(α0 G0 (A1 ), α0 G0 (A2 ), · · · , α0 G0 (Ak ) + 1, · · · , α0 G0 (AK )) That is, the single (data) point in the state space y only affects the single element of the partition Ak to which it belongs. Figure 2.2 illustrates an example where the point belongs to the element A2 of the partition. Generally, for a sequence of N independent realizations y1 , · · · , yN from G we have: (G(A1 ), G(A2 ), · · · , G(AK ))|y1 , · · · , yN ∼ Dir(α0 G0 (A1 ) + n1 , α0 G0 (A2 ) + n2 , · · · , α0 G0 (AK ) + nK ) where nk is the number of points in the kth element of the partition: nk = |{i : yi ∈ Ak }|. Thus, the posterior distribution over G must be a DP as well, since the above 12 Figure 2.2: A realization G of a DP with base measure G0 and support Ω. A partition (A1 , A2 , · · · , AK ) and a point y sampled from G are also illustrated. is true for any finite partition. Formally, we have the following proposition for the posterior of a DP measure given a set of points y1 , · · · , yN . Proposition 5. Suppose G ∼ DP(α0 , G0 ). Given N independent observations yi ∼ G the posterior distribution also follows a DP: G|y1 , · · · , yN , α0 , G0 ∼ DP(α0 + N, α0 G0 + ∑Ni=1 δyi ). α0 + N where δyi is a point mass located at yi . The proof provided in [12] follows directly from the Dirichlet-multinomial conjugacy. 2.3.3 Stick-breaking construction Definition 4 is not constructive; that is, it does not provide a mechanism for sampling from a DP. Here we provide an alternative constructive definition for the DP. Definition 6 (Stick-breaking construction [40]). Let βc ∼ Beta(1, α0 ) be independent. Define the stick lengths π = {πc }∞ c=1 , as follows: π1 = β1 πc = βc ∏1≤c <c (1 − βc ) = βc (1 − ∑cc =1 πc ) c > 1. We denote this construction by π ∼ GEM(α0 ) 2 . 2 GEM named after Griffiths, Engen and McCloskey; the abbreviation was first introduced in [36]. 13 In words, we start from a stick with length 1; at step c a proportion βc is broken off and used as the stick length. At each step, the rest of the stick is kept for the remaining steps; thus, if at some step c we have a stick with length L the new stick will have length βc L. It is shown in [40] that π constructed by this mechanism is a probability distribution. The following theorem provides an alternative definition for the DP which is constructive. Theorem 7. Given a base measure G0 consider the random measure ∞ G= ∑ πk δθ , k k=1 iid where θk ∼ G0 and π ∼ GEM(α0 ). Then G ∼ DP(α0 , G0 ). The proof is provided in [40]. That is, the two alternative definitions of DP provided in Definition 4 and Theorem 7 are equivalent. Theorem 7 shows that the random probability measures generated from a DP are discrete with probability one [12]. This leads to the predictive distribution, the distribution of a new point yN+1 given a set of N existing i.i.d. points Y = {y1 , · · · , yN }, for the DP known as the Chinese restaurant process (CRP). 2.3.4 CRP definition As mentioned in Section 2.3.3, the DP produces discrete random measures; thus, there is a positive probability of multiple points yi taking identical values and hence forming clusters. Recall that the points y1 , · · · , yN are i.i.d. draws from G, a realization of a DP with some concentration parameter α0 and a base measure G0 ; that is, G ∼ DP(α0 , G0 ). Since the values of the draws are repeated, let y∗1 , · · · , y∗M be the unique values among y1 , · · · , yN and nk be the number of repeats of y∗k . The predictive distribution for the N + 1th point yN+1 given all the previous observations y1 , · · · , yN is given by the following theorem: Theorem 8. Let y1 , · · · , yN be an i.i.d. sample from G ∼ DP(α0 , G0 ), y∗1 , · · · , y∗M be the unique values among y1 , · · · , yN , and nk be the number of repeats of y∗k . Then 14 the predictive distribution for yN+1 is given by: M α0 ni δy∗i + G0 . N + α0 i=1 N + α0 yN+1 |y1 , · · · , yN , α0 , G0 ∼ ∑ (2.5) For a proof of Theorem 8 see for example [23]. The CRP refers to the above distribution (i.e., predictive distribution of a DP). The interpretation of Equation 2.5 in terms of the CRP is as follows. Consider a restaurant with an infinite set of tables; tables correspond to clusters, the ith customer corresponds to the ith point yi , and the kth dish corresponds to the kth unique value that points can attain y∗k . The customers sequentially enter the restaurant and sit either at a previously occupied table or at a new table. Customer i enters and sits at an already occupied table k with probability proportional to the number of customers already at that table nk ; that is, P(Yi = y∗k ) = nk δy∗ k α0 +N , or sits at a new table with probability proportional to α0 ; that is, P(Yi = y∗ ) = α0 G0 α0 +N . When the table is occupied the new customer shares the same dish as the first customer at that table, and when the table is empty a new dish is sampled y∗ ∼ G0 . 2.3.5 DP mixture model One of the most common applications of the DP is the DP mixture model which can be used for clustering data. In fact, the DP mixture model is a mixture model with countably infinite mixture components. Consider a set of points y1 , y2 , · · · , yN , a set of latent parameters θ1 , · · · , θn , and a parametric distribution F. Then the DP mixture model can be written in the following form: yi |θi ∼ F(θi ) θi |G ∼ G G|α0 , G0 ∼ DP(α0 , G0 ) Each observation yi is sampled from F(θi ) and each θi is sampled i.i.d. from G. As G is discrete, samples θi from G can have identical values; therefore, obser15 vations sampled from F with the shared parameter θi belong to the same mixture component (i.e., cluster). 2.4 Summary In this chapter we reviewed some of the basic notions that will be useful in developing our model. In the subsequent chapters, we will show how our nonparametric prior over MJPs is built upon the MGP that we described in this chapter. Moreover, we will present the properties of our model that we believe can be better understood in connection with the properties of the DP explained in the current chapter. In the following chapter, we review some Bayesian nonparametric models and explain how our proposed model is different from these currently available models. 16 Chapter 3 R ELATED M ODELS Our model can be considered as a prior over MJPs, which are a specific type of time series. In this chapter, we present the currently available Bayesian nonparametric models that are used in time series modelling; we describe their properties and group them into three main categories. 3.1 Hierarchical Dirichlet Process In this section, we introduce the hierarchical Dirichlet process (HDP) [42], a nonparametric prior over a set of random probability measures over (Ω, FΩ ). We first explain the motivation behind the HDP from the perspective of mixture models; next, we describe different representations of the HDP which are in fact analogous to different representations of the DP. 3.1.1 Motivation Consider a clustering problem where there are multiple groupings of data; we want to be able to share the clusters across the multiple groupings of data. As a concrete example, consider a document modeling problem where the goal is to model a corpus of documents. A common simplification in document modeling is “the bag of words” assumption by which the words are assumed to be exchangeable and the order of the words in the documents is ignored. Another common assumption in document modeling is that the words in each document are sampled from the 17 distribution of a topic [4]; each topic has a multinomial distribution over a basic vocabulary. For example, in a document on the applications of machine learning in genetics, the words might be sampled from the topics “genetics” and “machine learning”. Now, considering a collection of documents (i.e., corpus) we would like to share the topics among the documents. We might also be interested in sharing among different corpus. 3.1.2 Definition The HDP is defined by a set of hyperparameters: a baseline probability measure H, and concentration parameters γ and α. The HDP is defined as follows: G0 |γ, H ∼ DP(γ, H) iid G j |α, G0 ∼ DP(α, G0 ) iid θi j |G j ∼ G j (3.1) (3.2) (3.3) According to this definition a global measure G0 is sampled from a DP with the baseline measure H and concentration parameter γ; G j is the random probability measure for group j sampled from a DP with base measure G0 and concentration parameter α. Given G0 , the G j s are independent from each other. In other words, through the global measure we are linking different groups together which will allow us to share statistical strength between them. The connection of the HDP with mixture models in general, and topic models in particular, is completed with the specification of a parametric distribution F for the observations: xi j |θi j ∼ F(θi j ). (3.4) Here we assume given θi j , the parameter of the ith cluster in the jth group, the observation xi j is independent of all other observations. In comparison with the DP mixture model (Section 2.3.5), in the HDP mixture model we have a collection of mixture models that are connected through a global base measure G0 . Note that we are not limited to a single layer of hierarchy in the HDP. In fact, we can have as many layers as required for modeling through a recursive structure where the 18 Notation θi j φk ψ jt n jtk n jt. n j.k m jk m.k m j. m.. Definition ith customer in restaurant j kth sampled dish from the global menu H the dish served at table t in restaurant j number of customers in restaurant j at table t eating dish k number of customers in restaurant j at table t number of customers in restaurant j eating dish k number of tables in restaurant j serving dish k number of tables serving dish k number of occupied tables in restaurant j total number of occupied tables Table 3.1: Variables used in the CRF base measure for each layer is itself sampled from another DP. An example of a situation where we might need more than one layer would be sharing topics among different corpus. 3.1.3 Chinese Restaurant Franchise In Section 2.3.4 we introduced the CRP: a representation for the predictive distribution of a DP where we marginalize out the random probability measure. There is an analogous representation for the HDP which is called a Chinese restaurant franchise (CRF). In a CRF we have multiple restaurants with a shared menu. For each restaurant, a customer entering the restaurant can sit at an occupied table or at a new table. A customer sitting at an occupied table orders the same dish as that for the first customer at that table; for a new table a dish is ordered from the shared menu across restaurants. Each restaurant corresponds to a group and customers correspond to the θi j s in Equation 3.3 (i.e., θi j is the ith customer at the jth restaurant). Moreover, following the notation in [42], we denote K i.i.d. samples from the baseline measure H by φ1 , · · · , φK ; here H corresponds to the global menu of the dishes and φk is the kth sampled dish from the global menu. The notation needed for defining the predictive distribution of an HDP is provided in Table 3.1. 19 Similar to the DP, in order to obtain the predictive distribution for the HDP we have to marginalize out the random probability measures G0 and the G j s. First we marginalize out G j . From Theorem 8, the predictive distribution for θi j (ith latent parameter in group j) given θ1 j , θ2 j , · · · , θi−1, j is: m j. n jt. α0 δψ jt + G0 . i − 1 + α0 t=1 i − 1 + α0 θi j |θ1 j , · · · , θi−1, j , α, G0 ∼ ∑ (3.5) Note that the similarity to Equation 2.5. That is due to the fact that in both we are marginalizing out realizations of a DP. Marginalizing out G0 , using Theorem 8 again, we can obtain the predictive distribution for the dishes. Anytime we need to sample a new dish in Equation 3.5 (i.e., sampling from G0 ), we sample from the predictive distribution of the dishes. The dishes are sampled in order of the tables being occupied in each restaurant; that is, we start from table 1 in restaurant j and assign a dish to it (ψ j1 ), next we move to the second table (ψ j2 ), and so on. The predictive distribution for ψ jt given the dishes for all occupied tables in all restaurants, ψ11 .ψ12 , · · · , ψ21 , · · · , ψ j,t−1 , is: K ψ jt |ψ11 , ψ12 , · · · , ψ21 , · · · , ψ j,t−1 , γ, H ∼ m.k ∑ m.. + γ δφ k k=1 + γ H. m.. + γ (3.6) Equation 3.5 is a mixture for the customers; a customer entering restaurant j: 1) sits at a preoccupied table with probability proportional to the number of customers sitting at that table or 2) sits at a new table with probability proportional to α0 . Equation 3.6 is a mixture for the dishes; whenever a new dish is needed for a new table: 1) it is chosen from the previously chosen dishes with probability proportional to the number of tables serving that dish or 2) a new dish is sampled with probability proportional to γ. In summary, to obtain a new sample θi j we first use Equation 3.5; if a new dish is needed from G0 we proceed to Equation 3.6 and sample a new dish ψ jt and set θi j = ψ jt . 20 We can combine Equations 3.5 and 3.6 as follows: K θi j |θ1 j , · · · , θi−1, j , α ∼ n j.k ∑ i − 1 + α0 δφ k + k=1 K α0 µ i − 1 + α0 (3.7) γ m.k δφk + H. µ=∑ m + γ m .. + γ k=1 .. This can be done due to the fact that m j. ∑ n jt. δψ jt = t=1 K m jk ∑ K ∑ n jtk δψ jt = k=1 t=1 ∑ n j.k δφ . k k=1 This representation will be useful in Chapter 5. 3.1.4 Stick-breaking representation for the HDP Similar to the DP, there is a stick-breaking representation for the HDP. We have this representation for the global measure G0 and also for the other random probability iid measures G j s, as all are sampled from a DP. Assume φk ∼ H; it can be shown [42] the stick-breaking representation for G0 and the G j s is as follows: ∞ G0 = (3.8) ∑ βk δφ k k=1 ∞ Gj = ∑ π jk δφ k (3.9) k=1 ∞ where β = (βk )∞ k=1 ∼ GEM(γ) are mutually independent, and π j = (π jk )k=1 ∼ DP(α0 , β ). Note that the weights π j are independent given β as the G j are independent given G0 . In this representation, the support of G0 is φ = (φk )∞ k=1 ; each G j has the same support since each G j is a realization of a DP with base measure G0 . In summary, the stick-breaking representation of the HDP can be written as follows: 21 iid φk ∼ H (3.10) β |γ ∼ GEM(γ) iid π j |α0 , β ∼ DP(α0 , β ) (3.11) (3.12) ∞ G0 = ∑ βk δφ (3.13) k k=1 ∞ Gj = ∑ π jk δφ k (3.14) k=1 3.2 HDP-HMM In this section we describe a closely related model to the HDP which is called the hierarchical Dirichlet process hidden Markov model (HDP-HMM) [42]. Before introducing the model we briefly review hidden Markov model (HMM). 3.2.1 Background on HMM HMMs are doubly stochastic Markov chains. That is, there are finite number of observations at discrete time points; we denote the set of observations by Y = (Y1 ,Y2 , . . . ,YN ). Each of these observations is a X -valued random variable; given its hidden state θ the observation is independent of all the other observations. In other words, there is a hidden state at each of the discrete time points; these hidden states follow a Markov chain on the space of hidden states Ω. HMMs are defined by a transition probability matrix P that characterizes the dynamics of the hidden states, and a parametric family P indexed by the hidden states θ of the chain, P = {Lθ : FX → [0, 1], θ ∈ Ω}. The parametric family determines the probability of “emitting” an observation by a hidden state. A graphical representation of an HMM is sketched in Figure 3.1. An HMM can be viewed as a dynamic mixture model; the choice of the mixture component for each observation is not independent of the other observations but depends on choice of mixture component for the previous observation. That is, for each hidden state θ there is a corresponding random probability measure µθ which 22 Figure 3.1: A graphical representation for an HMM can be thought of as a multinomial distribution over mixture components. More concretely, assume the current hidden state is θN where θN ∈ Ω. Since the hidden states follow a Markov chain, the next hidden state is a random variable distributed as a multinomial distribution that corresponds to the θN th row of P, denoted θN+1 ∼ PθN . The next observation would be a random variable sampled from a parametric family indexed by θN+1 . 3.2.2 Definition The dynamic mixture view of the HMM allows us to extend its definition to an infinite number of hidden states which yields the infinite mixture model. The result, an HDP-HMM [42], is a Bayesian model for the HMM with a nonparametric prior over an infinite-dimension transition probability matrix. We assume a DP prior over the random probability measure µθ (equivalent to G j in 3.2) corresponding to each hidden state of the HMM; informally, we are assuming an infinite-dimension multinomial distribution for each hidden state. However, the DP priors for all the hidden states need to be linked as we want the chain to be irreducible. One approach for linking the DP priors for the states could be using a shared iid base measure H among all of the priors, µθ ∼ DP(α, H) ∀θ ∈ Ω. This approach may work if the base measure has a countable support; otherwise, samples from the DPs will have different supports with probability one. Another approach, that works even with uncountable support for the base measure, is using the HDP framework; we first sample a global measure µ0 from a DP with base measure H and then use the global measure as a shared base measure for the DPs of the hidden states. Let θN be the current state of the chain, yN+1 23 the observation emitted by the hidden state θN+1 , and F a parametric distribution. Then, formally, the HDP-HMM model has the following form: µ0 ∼ DP(γ, H) iid µθ ∼ DP(α, µ0 ) θN+1 | {θn }Nn=1 , {µθ } (3.15) ∀θ ∈ Ω ∼ µθN yN+1 | θN+1 ∼ F(θN+1 ) (3.16) (3.17) (3.18) It is worth mentioning that a model was introduced in [3] that allows countably infinite hidden states for an HMM. This model, known as the infinite hidden Markov model (IHMM), is in fact a CRF representation of an HDP-HMM. This model can be described as a two level hierarchy. Given the current state θN = i, there could be: 1. a self transition with probability proportional to the number of self transitions that were observed before for state i, nii plus a parameter α (i.e., nii + α), 2. a transition to an already visited state j with probability proportional to ni j , the number of times this transition was observed before, 3. a move to a higher level of the hierarchy where an “oracle” process is invoked, with probability proportional to a parameter β . At this level of the hierarchy there are two possibilities: (a) a transition to an already visited state j, with probability proportional to n0j the number of times state j was previously chosen by the oracle regardless of the previous state, (b) a transition to a novel state with probability proportional to a parameter γ. The role of the oracle is similar to the role of µ0 in the definition of the HDPHMM (Equation 3.15); that is, it ties together the transition models to have destination states in common. Figure 3.2 illustrates the transition generative mechanism. The equivalence of this model and a CRF representation of HDP-HMM is shown in [42]. 24 Figure 3.2: Generative mechanism for the state transitions in the iHMM model 3.3 Sticky HDP-HMM Although the HDP-HMM was successful in various applications (see for example [26] or [45]) it has limitations. One limitation of the HDP-HMM is inadequacy in modeling temporal persistence of states; the HDP-HMM creates unnecessarily rapid transitions between states in order to explain all the transitions. By using iid µθ ∼ DP(α, µ0 ) ∀θ ∈ Ω as in Equation 3.16, the HDP-HMM cannot differentiate between self-transitions and transitions to other states. Thus, although all states have similar transition distributions, there could be unnecessary transitions to other states. Particularly, in data sequences with persistent states, the flexibility of HDPHMM will cause the sequence samples with rapid state switches to have high posterior probability. In other words, HDP-HMM will generate new redundant states instead of having larger probability of self transition. The sticky HDP-HMM model introduced in [13] provides a solution to the problem of state-persistence in the HDP-HMM. The model simply includes a selftransition bias parameter. For presenting the sticky HDP-HMM model in this section, we assume that all the hidden states are observed; hence, by the data sequence 25 we mean the sequence of the states. The model proposed in [13] is a slight modification of the model in Section 3.1.4. Using the same notation as Sections 3.1.4 and 3.2 the model can be defined as follows: iid φk ∼ H (3.19) β | γ ∼ GEM(γ) (3.20) iid πθ | α0 , β , η ∼ DP(α0 + η, α0 β + ηδθ ) α0 + η (3.21) ∞ µ0 = (3.22) ∑ βk δφ k k=1 ∞ µθ = ∑ πθ k δφ ∀θ ∈ Ω k (3.23) k=1 θN+1 | {θn }Nn=1 , {µθ } ∼ µθN (3.24) (3.25) In this definition η is the self-transition bias parameter introduced to the HDP; here α0 β + ηδθ means that η > 0 is added to the θ th component of α0 β . This parameter will encourage self-transitions in the model; note that η = 0 reduces the model to the HDP-HMM. 3.4 Dependent Dirichlet Process (DDP) DDPs are a general framework for modeling a collection of random probability measures GT = {Gt : t ∈ T ⊂ Rd }. Here Gt is a random probability measure on some space (Ω, FΩ ) which is indexed by elements t ∈ T . We only consider time space in the following; hence, T = R+ . As in the previous section, for presenting the DDP model in this section, we assume that all the states are observed. In the DDP we start from the stick-breaking construction of the DP (Section 2.3.3) and replace the weights and/or atoms with appropriate stochastic processes [29]. The general definition for a DDP provided in [29] is: Definition 9. A DDP is a collection of random probability measures GT = {Gt : 26 t ∈ T }; Gt is defined as ∞ Gt = ∑ πk (t)δθ (t) k (3.26) k=1 where {θk (t) : t ∈ T }, for k = 1, 2, · · · are independent realizations from a stochastic process G0 defined on T and the πk (t)s are distributed independently across t according to GEM(αt ). Note that in the above definition both the set of atoms and the set of weights depend on t. The DP and HDP can be considered as special cases of the DDP; for any fixed t, the DDP yields a DP. Moreover, the HDP is an example of the DDP where we have dependence in the weights but not in the atoms (see Section 3.1.4). The simplest construction for the DDP is obtained when the atoms are replaced with a stochastic process, but the weights remain constant at different values of t. The resulting model is called a single-p DP [29] which is simply a DP mixture of stochastic processes. The advantage of this model is that it is simple and fairly flexible; however, the disadvantage is that the model has a lack of “locality” due to the global sharing of π. More general DDP models discuss how π should vary across T . A model proposed in [17], the order-based DDP, uses a common collection of stick-breaking proportions β = {βi }∞ i=1 for all Gt , t ∈ T . Since the βi s are i.i.d. for any permutation σ , πc = βσ (c) ∏1≤c <c (1 − βσ (c ) ) is valid. The model allows the permutation σt to vary with t. Note that at any time we have the usual stick breaking construction; thus, the marginal over t of the defined process is DP. Another model which deals with a π varying across T is proposed in [10]; this model is in the discrete-time framework. The evolution of the random probability measure in the model is defined as Gt = νGt−1 + (1 − ν)εt where ν is a parameter and εt is a sample from a DP. The continuous-time version of the previous model, called the stick-breaking autoregressive process, is introduced in [18]. The process is defined as Gt = G˜ N(t) where N(t) is a Poisson process with rate λ and G˜ i is defined as follows: G˜ i = βi G˜ i−1 + (1 − βi )δθi 27 (3.27) where βi ∼ Beta(1, α0 ) and θi ∼ G0 . That is, there are jumps in the distributions at the arrival times of the Poisson process. At the i-th arrival time a new atom is introduced and the effect of the previous atoms decay with the rate βi . It can be shown that both Gt and G˜ i follow a DP with concentration parameter α0 and base measure G0 . Moreover, for any set ω: Corr(Gt (ω), Gt+s (ω)) = ρ s (3.28) where ρ = exp(− α0λ+1 ); thus, the dependence between the random measures decreases exponentially with their separation in time. The common theme among all of the continuous-time DDP models presented in this section is that the correlation between the marginals of Gt1 and Gt2 should decay smoothly with |t1 −t2 |. That is, if we have points at times t1 and t2 then these models are transient in the sense that the effect of a point at time t1 , on the point at time t2 should decrease as t2 − t1 increases. 3.5 Summary In this chapter we discussed some models that are closely related to our proposed model; we also mentioned that the DDP can be considered as a general framework that covers all of these models. There are other related models that we will not explain in detail as they have less importance in the following chapters of the thesis. However, for the sake of completeness we list these models here. Other models that can be considered as examples of the DDP are: the Ornstein-Uhlenbeck DP [16], a DDP model in the continuous-time framework with fixed atoms (i.e., the atoms do not depend on time); the time series dependent DP [35], a DDP in discrete time; the nested DP [38], a model which clusters similar distributions together, in contrast to the HDP which shares clusters among all distributions; and the local DP introduced in [7] that provides a distribution for a collection of random probability measures indexed by predictors. All the models presented in this chapter can be grouped into four categories based on being 1) continuous/discrete-time and 2) transient/non-transient. All the models in Sections 3.1, 3.2, 3.3 are in the discrete-time framework, whereas the 28 models in Section 3.4 are in the continuous-time framework (unless specified otherwise). In Sections 3.2 and 3.3 the models are non-transient in contrast to the models introduced in Section 3.4 which are transient. In this thesis we are going to propose a model for continuous-time non-transient processes 29 Chapter 4 G AMMA E XPONENTIAL P ROCESS In this chapter, we first present the formal definition of the gamma exponential process (GEP). Next, we show that this model has some attractive properties such as conjugacy and a closed form expression for the predictive distribution. 4.1 Definition In a GEP, we first obtain the rows of the rate matrix Q by a transformation of i.i.d. samples from an MGP; then, we generate the states from Q with the Doob-Gillespie algorithm described in the Section 2.1. We will denote the normalization constant of a measure µ (i.e., µ(Ω) where Ω is the support of µ) by µ and the normalized measure by µ¯ = µµ . Moreover, we assume that all the states are observed for now, and treat the partially observed case in Chapter 6. That is, using the notation of Section 2.1, we assume X = (θn , Jn )Nn=1 is the list of observed events. Definition 10 (GEP). Let H0 be a base measure on a countable support Ω with H0 < ∞. The GEP is formally defined as follows: iid µθ ∼ MGP(H0 , β0 ) ∀θ ∈ Ω θN+1 X, {µθ }θ ∈Ω ∼ µ¯ θN JN+1 X, {µθ }θ ∈Ω ∼ Exp ( µθN ) We will relax the countable base measure support assumption in the next section. 30 Figure 4.1: Generating samples of µθ from MGP(H0 , β0 ). We use superscript i instead of subscript N to distinguish the ith atom from the state at the Nth jump. We present the row corresponding to state θ i in the rate matrix. As an example, if θN = θ i , then the probability of transition to θ 2 is proportional to µθ i ({θ 2 }). Figure 4.1 shows an example of generating samples of µθ from MGP(H0 , β0 ). Note that Definition 10 has positive self-transition rates as the diagonal elements so self-transitions are allowed. However, we will use this definition since for computing predictive distributions, it will be simpler to allow positive self-transition rates. The next proposition shows we can remove the self-transitions using a transformation and obtain an equivalent process; by equivalent we mean the two processes have the same distribution of waiting time in each state, and the same transition probability from a state to a different state. Proposition 11. Let St be the MJP obtained from a GEP. Then we can obtain an equivalent MJP (St∗ ) by arbitrarily ordering Ω = θ (1) , θ (2) , . . . , and setting: q∗θ (i) ,θ ( j) = µθ (i) ({θ ( j) }) µθ (i) µ¯ θ (i) ({θ (i) }) − 1 if i = j o.w. St∗ generated from Q∗ has no self-transitions. Proof. We can prove that the two processes are equivalent by showing given the 31 current state, the distribution of the waiting time in that state and the transition probability to the next different state for the two processes are the same. Pθ∗(i) ,θ ( j) , the probability of transition from θ (i) to a different state θ ( j) in the process St∗ , is equal to µθ (i) ({θ ( j) }) µθ (i) −µθ (i) ({θ (i) }) . Assuming the process is at the Nth jump, the same probability in St can be obtained as follows: Pθ (i) ,θ ( j) = P(θN+1 = θ ( j) |θN = θ (i) , i = j) = µ (i) ({θ ( j) }) θ µ (i) θ µ (i) ({θ (i) }) 1− θ µ θ (i) = µθ (i) ({θ ( j) }) µθ (i) −µθ (i) ({θ (i) }) . Thus, the probability of transition from one state to another is the same for St and St∗ . Now we show the waiting time T in state θ (i) for both processes is distributed as Exp( µθ (i) − µθ (i) ({θ (i) })). By definition, this is true for St∗ . For St we have: (i) P(T < t) =P(∪∞ n=1 {leave state θ at jump n & jump n occurs before time t}) = ∞ ∑ P{leave state θ (i) at jump n}× n=1 P{jump n occurs before time t|leave state θ (i) at jump n} = ∞ ∑ pn P{jump n occurs before time t|leave state θ (i) at jump n} n=1 µ (i) ({θ (i) }) µθ (i) ({θ ( j) }) n−1 ) (1 − θ µ ) is the µθ (i) θ (i) µ (i) distribution for a geometric random variable with psuccess = 1 − µθ . θ (i) Moreover, P{jump n occurs before time t|leave state θ (i) at jump n} is the cumu- where pn = P{leave state θ (i) at jump n}= ( lative distribution function evaluated at time t for the sum of n i.i.d. exponential random variables. Note that, conditioned on the event that state θ (i) has self-transitions up to jump n, the waiting time to each jump (up to jump n) has an Exp( µθ (i) ) distribution. It is known that the geometric sum of i.i.d. exponential random variables with rate µθ (i) has an exponential distribution with rate psuccess × µθ (i) (see for example [34]). Hence, we have proved the waiting times have the same distribution in both processes. To understand the connection with the Doob-Gillespie algorithm presented in 32 Section 2.1 note that we can use Proposition 11 to obtain a rate matrix Q from the matrix generated from µθ in Definition 10. As mentioned in Section 2.1, to completely define an MJP we also need the initial distribution. Instead of considering an initial distribution π we can equivalently assume that there is a special state θbeg always present at the beginning of the sequence, and only at the beginning. That is, we always condition on (θ0 = θbeg , J0 = 0) and (θn = θbeg , n > 0), and drop these conditioning events from the notation. The initial distribution is then the set of transition probabilities out of θbeg . 4.2 Conjugacy Property In this section, we show that the posterior of each row, µθ |X, is also MGP distributed with updated parameters. The sufficient statistics for the parameters of µθ |X, the empirical transition measures Fθ and the empirical waiting times Tθ , are defined as follows: N Fθ = ∑ 1[θn−1 = θ ] δθ , n (4.1) n=1 N Tθ = ∑ 1[θn−1 = θ ] Jn . (4.2) n=1 The main result of this section is as follows: Proposition 12. The GEP is a conjugate family: µθ |X ∼ MGP µθ , βθ , where µθ = Fθ + H0 and βθ = Tθ + β0 . For the proof of Proposition 12, we will need the following elementary lemma: Lemma 13. If V ∼ Beta(a, b) and W ∼ Gamma(a + b, c) are independent, then VW ∼ Gamma(a, c). See for example [9] for a survey of standard beta-gamma results such as that stated in this lemma. We now prove the proposition: 33 Proof. For simplicity, fix an arbitrary state θ and drop the index (this is without loss of generality since the rows are iid); hence, µ = µθ ∼ MGP(H0 , β0 ), µ = µθ , and β = βθ . Let (A1 , . . . , AK ) be a measurable partition of Ω. By the Kolmogorov consistency theorem, it is enough to show that for all such partitions, (µ(A1 ), µ(A2 ), . . . , µ(AK )) | X ∼ Gamma(µ (A1 ), β ) × · · · × Gamma(µ (Ak ), β ). (4.3) Assume for simplicity that K = 2 (the argument can be generalized to K > 2 without difficulty), and let V = µ(A1 )/ µ , W = µ . Then, by elementary properties of gamma distributed vectors: V ∼ Beta(H0 (A1 ), H0 (A2 )), (4.4) W ∼ Gamma(α0 , β0 ), (4.5) and V,W are independent (both conditionally on X = (θn , Jn )Nn=1 and unconditionally). By beta-multinomial conjugacy, we have d V | X = V | (θn , Jn )Nn=1 = V | θ1 , . . . , θN ∼ Beta(µ (A1 ), µ (A2 )), and by gamma-exponential conjugacy, we have d W | X = W | J1 , · · · , JN ∼ Gamma( µ , β ). Using Lemma 13 with a = µ (A1 ), b = µ (A2 ), c = β , we finally get that µ(A1 )|X = VW |X ∼ Gamma(µ (A1 ), β ), which concludes the proof. Note that µθ is the mean parameter of an unnormalized DP, as µθ |X ∼ MGP µθ , βθ which is an unnormalized DP. 34 4.3 Predictive Distribution In this section we find an expression for the distribution of (θN+1 , JN+1 )|X which is the predictive distribution. We will need the following family of densities: Definition 14 (Translated Pareto). Let α > 0, β > 0. We say that a random variable T is translated-Pareto, denoted T ∼ TP(α, β ), if it has density: f (t) = 1[t > 0]αβ α . (t + β )α+1 (4.6) Proposition 15. The predictive distribution of the GEP is given by: (θN+1 , JN+1 ) | X ∼ µ¯ θN × TP( µθN , βθN ). (4.7) ¯ Proof. By Proposition 12, it is enough to show that if µ ∼ MGP(H0 , β0 ), θ |µ ∼ µ, and J|µ ∼ Exp( µ ), then (θ , J) ∼ µ¯ × TP(α0 , β0 ), where α0 = H0 . d Note first that because (J|θ ) = J the minimum and argmin of independent exponential random variables are independent. To get the distribution of J, we need to integrate over the rate parameter of the exponential waiting time distribution. Here we denote the distribution of the waiting time given the rate by p(t|x) and the distribution of the rate parameter by p(x). From Definition 10 and the properties of the gamma process described in Section 2.2, we know the rate parameter has a gamma distribution. Thus, to obtain the distribution of J we compute the following integral: p(t) = p(t|x)p(x) dx x>0 Exp(t; x) · Gamma(x; α0 , β0 ) dx = = x>0 β0α0 Γ(α0 ) = x>0 x>0 x exp(−xt) · xα0 −1 exp(−β0 x) dx xα0 exp (−(β0 + t)x) dx = 35 α0 β0α0 (β0 + t)α0 +1 Hence J ∼ TP(α0 , β0 ). A useful property of predictive distributions is exchangeability. We can show that these predictive distributions are indeed exchangeable: Proposition 16. Let J j(θ ,1) , J j(θ ,2) , . . . , J j(θ ,K) be the subsequence of waiting times following state θ . Then the random variables J j(θ ,1) , J j(θ ,2) , . . . , J j(θ ,K) are exchangeable. Moreover, the joint density of the sequence of waiting times (J j(θ ,1) = j1 , J j(θ ,2) = j2 , . . . , J j(θ ,K) = jK ) is given by: p( j1 , j2 , . . . , jK ) = 1[ jk > 0, k ∈ {1, . . . , K}](α0 )K β0α0 (β0 + j1 + · · · + jK )α0 +K (4.8) where the Pochhammer symbol (x)n is defined as (x)n = x(x + 1) · · · (x + n − 1). Proof. From Proposition 15, we have: p( j1 , j2 , . . . , jK ) = p( j1 )p( j2 | j1 ) × · · · × p( jK | j1 , . . . , jK ) = T P(α0 , β0 )T P(α0 + 1, β0 + j1 ) × · · · × T P(α0 + K − 1, β0 + j1 + · · · + jK−1 ) α0 β0α0 (α0 + 1)(β0 + j1 )α0 +1 (β0 + j1 )α0 +1 (β0 + j1 + j2 )α0 +2 (α0 + K − 1)(β0 + j1 + · · · + jK−1 )α0 +K−1 ×···× (β0 + j1 + · · · + jK )α0 +K = 1[ jk > 0, k ∈ {1, . . . , K}] ∝ 1[ jk > 0, k ∈ {1, . . . , K}](β0 + j1 + · · · + jK )−α0 −K . The second line is obtained from the first by considering the fact that each jump to state θ updates the sufficient statistics Fθ and Tθ ; consequently, the parameters of the TP (i.e., µθ and βθ ) are updated. The normalization of the above expression is indeed equal to 1/((α0 )K β0α0 ). 36 Chapter 5 H IERARCHICAL G AMMA E XPONENTIAL P ROCESS 5.1 Motivation In this section, we present a hierarchical version of the GEP, where the rate matrix, instead of i.i.d. rows, has exchangeable rows. Informally, the motivation behind this construction is to have the rows share information on what states are frequently visited. This is similar to the motivation behind the HDP-HMM model presented in Section 3.2. Recall that in order to link the rows of the transition probability matrix in the HDP-HMM model, we used a random shared base probability measure distributed according to a DP. As with HDPs, the hierarchical construction is especially important when Ω is uncountable. For such spaces, since each GEP sample has a random countable support, any two independent GEP samples will have disjoint supports with probability one. Therefore, a GEP alone cannot be used to construct recurrent Markov processes when Ω is uncountable. Fortunately, the hierarchical model introduced in this chapter addresses this issue: it yields a recurrent model for MJPs over both countable and uncountable spaces Ω. 37 5.2 Definition The hierarchical gamma exponential process (HGEP) is constructed by making the base measure parameter of the rows shared and random. Formally, the model has the following form: µ0 ∼ MGP(H0 , γ0 ) iid µθ | µ0 ∼ MGP(µ0 , β0 ) θN+1 X, {µθ }θ ∈Ω ∼ µ¯ θN (5.1) JN+1 X, {µθ }θ ∈Ω ∼ Exp( µθN ). In order to get a tractable predictive distribution, we introduce a set of auxiliary variables. As we will see shortly, these auxiliary variables are closely related to the variables used in the CRF metaphor presented in Section 3.1.3 to indicate when new tables are created in a given restaurant (i.e., m jk ). 5.3 Connection to CRF From the connection between the DP and the gamma process described in Section 2.2.1, we know that if X ∼ DP(α0 , H0 ) and Y ∼ Gamma(γ0 , β0 ) then XY ∼ MGP(γ0 H0 , β0 ). Hence, we can rewrite Definition 5.1 as: µ0 ∼ Gamma( H0 , γ0 ) µ¯ 0 ∼ DP( H0 , H¯ 0 ) µθ (5.2) (5.3) iid (5.4) iid (5.5) µ0 ∼ Gamma( µ0 , β0 ) µ¯ θ µ¯ 0 , µ0 ∼ DP( µ0 , µ¯ 0 ) θN+1 X, {µθ }θ ∈Ω ∼ µ¯ θN JN+1 X, {µθ }θ ∈Ω ∼ Exp( µθN ). (5.6) (5.7) Note the similarity between Equations 5.3 and 3.1, and between Equations 5.5 and 3.2. The only difference is in Equation 5.5, where we condition on a random variable µ0 , the normalization of the top level random measure. Now, as we have 38 a similar structure to the HDP model, we can construct the predictive distribution for the states θ in the HGEP similarly to the CRF model of Section 3.1.3. In the HGEP, a restaurant can be understood as a row in the rate matrix, the tables as groups of transitions to the same destination state, and a dish as a destination state. We can use an auxiliary variable An to indicate when the n-th transition creates a new table; An = 1 means the n-th transition is to a state not previously visited. The variable takes value An = 0 otherwise. We augment the sufficient statistics (i.e., Fθ and Tθ defined in Section 4.2) with empirical counts for the number of tables across all restaurants that share a given dish, G = ∑Nn=1 An δθn . In other words, G({θ }) indicates the number of transitions to the state θ from all the states. We need to introduce one additional auxiliary variable, the normalization of the top level random measure, µ0 . This auxiliary variable has no equivalent in CRFs. Based on the construction of the CRF in Section 3.1.3 and the DP-based representation of the HGEP described above, we can write the predictive distribution for the state θN+1 in the form of Equation 3.7: µ = G + H0 G H0 + H¯ 0 G + H0 G + H0 FθN µ0 µ¯ θN+1 | θ1 , · · · , θN , µ0 , µ¯ ∼ + FθN + µ0 FθN + µ0 µ¯ = (5.8) Note that FθN ({θi }) is equivalent to nN.i , the number of customers eating dish i in restaurant N, in Equation 3.7. In addition, G({θi }) is equivalent to m.i , the number of tables serving dish i in the CRF model in Equation 3.7. We denote the (H) predictive distribution for the next state presented in Equation 5.8 by µ¯ θN . We use the superscript (H) to distinguish from the non-hierarchical case. Figure 5.1 illustrates an example of generating a state θN+1 from the predictive (H) distribution µ¯ θN . Assume that we are at the Nth jump and we have observed two distinct states a and b up to this time. The idea is to view the predictive distribution for the next state θN+1 in the hierarchical model as a mixture of two possibilities. One of these two is selected with probability proportional to ( FθN , µ0 ). The two possibilities are: 39 New state H Existing states ||H || ||G + H || G({ }) ||G + H || G({ }) ||G + H || =1 Existing transitions ||F ||µ || + µ || F ||F ({ }) + µ || F ||F ({ }) + µ || =0 Figure 5.1: An illustration of generating the next state given the current state θN in the HGEP model 1. to generate from FθN FθN +µ0 , the empirical distribution over the transitions starting at θN . This is “joining one of the existing tables in the current restaurant (i.e. the current state θN )” in the CRF analogy. In the example in Figure 5.1, this case corresponds to path P1a or P1b . When this alternative is selected, the successor state θN+1 is determined with probability proportional to FθN (“the new customer picks the dish of the selected existing table”). In the example, the next state is a or b with probability proportional to FθN ({a}) or FθN ({b}) correspondingly. 2. to generate recursively from a “back-off” distribution, µ¯ (“creating a new table in the current restaurant”). When this alternative is selected, the new table picks a dish. The dish can be picked from G, the empirical distribution over the dishes picked by tables across all restaurants (corresponding to 40 paths P2a or P2b ), or it can be picked from the “back-off” distribution, the normalized base measure, H¯ 0 (path P3 ). It remains to provide a formal definition for the table creation auxiliary variable An . By augmenting the sufficient statistics with an indicator over which of the two alternatives (1,2) is selected at each transition, the predictive distribution takes a tractable form. Formally, the definition of the table creation auxiliary variables is therefore as follows: P(AN+1 = a | µ0 , X) ∝ µ0 a FθN 1−a 1[a ∈ {0, 1}] θN+1 AN+1 , X ∼ (1 − AN+1 )F¯θN + AN+1 µ¯ . 5.4 Predictive Distribution In the previous section we showed that the predictive distribution for the next state given the past states and µ0 has a CRF representation. In this section we present the predictive distribution for the next event given the auxiliary variables and the past events. The main result of this section is as follows: Proposition 17. The predictive distribution of the HGEP is given by: (H) (H) (θN+1 , JN+1 ) (X, {An }Nn=1 , µ0 ) ∼ µ¯ θN × TP( µθN , βθN ). Proof. Conditioning on µ0 , we have that θN+1 and JN+1 are independent. As mentioned (θN+1 X, {An }Nn=1 , µ0 ) can be viewed as an HDP with concentrations H0 for the top level DP, and µ0 for the lower level DPs. Hence, the distribution (H) µ¯ θN is then the predictive distribution for θN+1 . For JN+1 X, µ0 , we know from Proposition 15 that the predictive distribution is a translated Pareto distribution with parameters equal to FθN + µ0 , the normalization constant of the updated base measure, and TθN + β0 , the updated rate (H) parameter. That is, JN+1 X, µ0 ∼ TP( µθN 41 , βθN ). The conditional distribution of the auxiliary variable µ0 given all the other random variables will be useful for the inference algorithm that we introduce in Chapter 6. It can be shown that, given all the other random variables, µ0 has a gamma distribution. Proposition 18. The conditional distribution of µ0 given the other variables is a gamma distribution: µ0 X, {An }Nn=1 ∼ Gamma(a, b), where a = H0 + G , b = γ0 + ∑θ ∈Ω log(βθ /β0 ). Proof. The conditional distribution has density p(x) proportional to: p( µ0 = x | (θn , jn )Nn=1 , {An }Nn=1 ) ∝p( µ0 = x) · p( j1 , j2 , · · · , jN | µ0 = x) · p(θ1 , · · · , θN | µ0 = x) For p(θ1 , · · · , θN | µ0 ), the joint density of the states given µ0 , we must use the predictive distribution for the states given by Equation 5.8. Let iω be the set of indexes of the events for which we have θi = ω; iω = {i : θi = ω}. We denote the size of the set iω = {1ω , · · · , Nω } by |iω | and the number of times the transition from state ω results in a new table by Gω . We have: p(θ1ω , · · · , θNω | µ0 = x) = p(θ1ω | µ0 = x)p(θ2ω |θ1ω , µ0 = x)× · · · × p(θNω |θ1ω , · · · , θ(N−1)ω , µ0 = x) xGω (x)(x + 1) · · · (x + |iω | − 1) xGω ∝ (x) Fθ ω ∝ Therefore, for the sequence of states θ1 , · · · , θN we have: p(θ1 , · · · , θN | µ0 = x) ∝ x Gθ ∏ θ ∈Ω (x) Fθ ∝x G ∏ (x) −1 Fθ θ ∈Ω Finally, using Equations 5.2, 4.8, and 5.8, for p(x) we have: 42 (x) Fθ β0x ∏ x+ Fθ θ ∈Ω (Tθ + β0 ) p(x) ∝ xα0 −1 exp(−γ0 x) × x G ∏ (x) −1 Fθ θ ∈Ω ∝ xα0 + G −1 exp −x γ0 + ∑ log θ ∈Ω 43 βθ /β0 Chapter 6 I NFERENCE In the previous two chapters we assumed the sequence of events is observed. Now we relax this assumption and use the models described in the last two chapters as a prior distribution over a sequence of hidden events. The goal of this chapter is to propose an inference algorithm to approximate the posterior of the hidden events X given the observations Y . In particular, we are interested in approximating E[h(X)|Y ], the expectation of a general function of the events X under the posterior distribution of GEPs. In most applications, the sequence of states is not directly nor fully observed. First, instead of observing the states θ , we observe X -valued random variables Yn distributed according to a parametric family P indexed by the states θ of the chain, P = {Lθ : FX → [0, 1], θ ∈ Ω}. Second, the observations are generally available only for a finite set of times T . For instance, in Figure 2.1 instead of observing the MJP denoted by X, we only have observations at time points t1 , · · · ,tG (we will discuss multiple sequences in the next section). To specify the random variables in question, we need a notation for the hidden event index at a given time ∗ t, I(t) = min N : ∑N+1 n=1 Jn > t (see Figure 2.1, where I(t ) = N − 1). Thus, as- suming conditional independence, for an individual observation at time point t we d have Y (t)|X = Y (t)|θI(t) ∼ LθI(t) . The set of all observed random variables is then defined as Y = (Y (t1 ),Y (t2 ), . . . ,Y (tG ) : tg < tg+1 , {ti } = T ) . 44 I(t ) G We then denote the list of hidden events from time t = 0 to tG by X1:G = (θn , Jn )n=1 where tG ∈ T . We will describe the general inference framework for the model of Chapter 4. Extension to hierarchical models is direct (by keeping track of an additional sufficient statistic G, as well as the auxiliary variables An , µ0 ). For simplicity, we assume in this section that P is a conjugate family with respect to H0 . Non-conjugate models can be handled by incorporating the auxiliary variables of Algorithm 8 in [33]. As mentioned our goal is to approximate expectations under the posterior distribution of GEPs. However, as the posterior distribution is intractable we cannot evaluate posterior expectations directly. Thus, we need to rely on the Monte Carlo methods and sample from approximations to the posterior distribution of the hidden events given the observations; that is, p(X1:G |Y ). Commonly used algorithms for sampling (approximately) from a posterior distribution in Bayesian statistics are based on Markov chain Monte Carlo (MCMC) methods. In order to approximately sample from the posterior distribution for our model we use a particle Markov chain Monte Carlo (PMCMC) algorithm [2], a special type of the MCMC algorithm well-suited for sampling sequential data. 6.1 MCMC Algorithms In this section we briefly review the MCMC methods; for a detailed discussion and proofs of the results see for example [28]. MCMC is a general approach for generating samples from a target distribution p(x) using a Markov chain mechanism. The samples x(i) generated from an MCMC based algorithm form a Markov chain that converges to p(x). We use MCMC when we cannot directly sample from p(x) but we can evaluate it up to a normalizing constant. Recall that the evolution of a Markov chain depends only on its current state and a probability transition matrix which we denote by T . As long as the Markov chain is irreducible and aperiodic there is a distribution p(x) called the invariant distribution to which the Markov chain converges. Informally, an aperiodic irreducible Markov chain is a Markov chain that does not have a cyclic behavior and from any 45 of its states there is a positive probability that all other states can be visited in finite number of steps1 . The detailed balance condition, a condition that guaranties p(x) is the invariant distribution for a Markov chain, is defined as: p(x(i)) T (x(i − 1)|x(i)) = p(x(i − 1)) T (x(i)|x(i − 1)); an MCMC sampler can be obtained by satisfying the detailed balance condition. The most popular MCMC algorithm is the Metropolis - Hastings (MH) algorithm; most MCMC algorithms can be viewed as a special case or an extension of this algorithm. In every step of this algorithm, given the current state of the Markov chain xc , a candidate state x∗ is sampled from a proposal distribution q(·|xc ). Then x∗ is accepted with the probability: min{1, p(x∗ )q(xc |X ∗ ) }; p(xc )q(x∗ |xc ) if x∗ is accepted the Markov chain moves to x∗ , otherwise it remains at xc . Note that we only need p(x) up to a normalizing constant in order to compute the acceptance probability. Although the MH algorithm is simple, finding a good proposal distribution that has fast convergence rate can be a challenging task. The MH algorithm converges if we can show that the chain has no cycles and every state that has positive probability can be reached in a finite number of steps. Under these conditions the chain converges to the target distribution p(x) after sufficient burn-in period which is the number of steps until the chain approaches p(x). In this definition every sample is dependent on the previous sample; as a result we have dependence between draws in the Markov chain. To reduce this dependence and get closer to i.i.d. samples, some have suggested thinning: keeping only the samples generated after every r-th iteration of the chain. A variant of the MH algorithm is the independent MH algorithm for which the proposal is independent of the current state, q(x∗ ) = q(x∗ |xc ). Thus, the acceptance ∗ c )q(x ) probability is min{1, p(x p(xc )q(x∗ ) }. Another MCMC algorithm that can be considered as a special case of the MH 1 For a formal definition see for example [19]. 46 is the Gibbs sampler. It can be shown that the Gibbs sampler is an MH algorithm with acceptance probability equal to 1. Suppose the goal is sampling from the ndimensional joint probability distribution p(x1 , x2 , · · · , xn ) and sampling from this distribution is hard. If samples from the full conditionals p(xi |x1 , · · · , xi−1 , xi+1 , · · · , xn ) can be easily obtained then we can produce samples from the joint distribution by sampling from the conditional distributions. We initialize the Gibbs sampler with an arbitrary sample (x1 (0), x2 (0), · · · , xn (0)). Then for iterations i = 1 to N the algorithm consists of n steps as follows: Sample x1 (i) ∼ p(x1 |x2 (i − 1), x3 (i − 1), · · · , xn (i − 1)) Sample x2 (i) ∼ p(x2 |x1 (i − 1), x3 (i − 1), · · · , xn (i − 1)) .. . Sample xk (i) ∼ p(xk |x1 (i − 1), x2 (i − 1), · · · , xk−1 (i − 1), xk+1 (i − 1), · · · , xn (i − 1)) .. . Sample xn (i) ∼ p(xn |x1 (i − 1), x2 (i − 1), · · · , xn−1 (i − 1)) With any MCMC sampler, the empirical distribution of the generated samples yields p(x), ˆ the approximation to the target distribution p(x); in Bayesian contexts, the target distribution will be a posterior distribution. In addition to approximating the posterior distribution p(x), the samples obtained from an MCMC sampler can be used to approximate the mode (or other characteristics) of the posterior distribution. For instance, to approximate the maximum a posteriori (MAP), if we have N samples, we find argmax ( p(x(i))). ˆ x(i);i=1,··· ,N In applications we usually only consider the samples after the burn-in period. The theory guarantees that the chain converges to the target distribution, but it does not say anything about how long we need to run the chain to obtain convergence. In particular, there are two practical issues that we need to address when using an MCMC sampler: how long should the burn-in period be and how long we need to run the chain after that. Different diagnostics are available for monitoring the chain’s convergence. The easiest diagnostic is to simply visualize the state 47 of the chain with respect to iterations; if there is no evident trend in the plot this suggests (approximate) convergence has been achieved. There are also some quantitative diagnostics such as Geweke’s diagnostic [15], Heidelberger and Welch’s diagnostic [21], and Gelman and Rubin’s diagnostic [14]. Although these diagnostics are informative guides, they cannot definitely tell whether the chain has converged; judging the adequacy of convergence of MCMC samplers necessarily retains a subjective element. If we are convinced that the chain has converged, then we need to know how long we need to run the chain to get a reasonable approximation. The answer depends on the objective of the study: the greater the precision needed for the study, the longer we need to run the chain. We can use Raftery and Lewis’s diagnostic [37] to estimate how long our chain needs to run in order to estimate quantiles within a specified accuracy with some specified probability. 6.2 PMCMC Algorithm As mentioned we use a special type of the MCMC algorithm, the PMCMC algorithm, to approximately sample from the posterior distribution in our model. There are different flavors of PMCMC algorithms; we use the particle independent Metropolis - Hastings (PIMH) algorithm [2], an MCMC algorithm with proposals that are sequential Monte Carlo (SMC) approximations of p(X1:G |Y ). We will explain the SMC algorithm in detail in the following section; for now it can be thought of as an algorithm which provides an approximation for the posterior density p(X1:G |Y ) and the marginal likelihood p(Y ). As mentioned to sample from p(X1:G |Y ) in a standard MCMC algorithm we c , the proposed sample from a proposal density q(X1:G |Y ). Given a current state X1:G ∗ is then accepted with probability: sample X1:G min{1, ∗ |Y )q(X c |Y ) p(X1:G 1:G c |Y )q(X ∗ |Y ) }. p(X1:G 1:G In the PMCMC algorithm we use an SMC approximation to p(X1:G |Y ) which we denote by p(X ˆ 1:G |Y ) as our proposal density. Note that at each iteration of the PMCMC algorithm, the sample from the p(X ˆ 1:G |Y ) is a list of hidden events from 48 t1 to tg . It can be shown [2] that using an appropriate acceptance ratio we can make this proposal a valid MCMC move; the algorithm then converges to the desired posterior density p(X1:G |Y ). The PMCMC algorithm follows the following simple form: 1. iteration i = 0: Run an SMC algorithm to obtain an approximation for p(X1:G |Y ); sample X1:G (0) ∼ p(·|Y ˆ ) and denote the corresponding marginal likelihood estimate p(Y ˆ ) by L(0). 2. iteration i ≥ 1: ∗ ∼ p(·|Y (a) run the SMC algorithm and sample X1:G ˆ ), and denote the cor- responding marginal likelihood estimate by L∗ L∗ ∗ : set X ∗ (b) with probability min{1, L(i−1) } accept X1:G 1:G (i) = X1:G and L(i) = L∗ . Otherwise, set X1:G (i) = X1:G (i − 1) and L(i) = L(i − 1). Denoting the total number of iterations by iter and the number of burn-in iterations by B, the collection of samples {X1:G (i)}iter i=B+1 are approximately samples from the posterior density p(X1:G |Y ) for large enough B (i.e., sufficient burn-in period). Table 6.1 summarizes the notation used in the PMCMC and the SMC algorithms. In general, there may be several sequences of observations; for instance, we can have sequences of EDSS values from several MS patients. We can have different observation times for different sequences. We denote the number of time series by K, each of the form (k) (k) (k) (k) (k) (k) Y (k) = Y (k) (t1 ),Y (k) (t2 ), . . . ,Y (k) (tG(k) ) : tg < tg+1 , {ti } = T (k) , k ∈ {1, . . . , K}; moreover, we use the superscript \k to indicate all the sequences except k. (k) In case of multiple sequences, we resample the hidden events X1:G(k) for one (\k) sequence k given the sufficient statistics of all the other sequences (Fθ and the sufficient statistics for the likelihood model denoted by (\k) Sθ . (\k) , Tθ ), The role of the sufficient statistics in the algorithm and their updating procedure will be explained in the next section where we describe the SMC algorithm. To simplify 49 Notation I(t) X1:g Y (t) Y1:g Xm,g Nm,g Xm,g+1 m X˜g+1 m Xg+1 Wm,g Fθk,m,n Tθk,m,n Gkm,n Sθk ,m,n \k Definition the hidden event index at a given time t: I(t) = min N : ∑N+1 J > t n n=1 I(tg ) the list of hidden events from time t = 0 to tg : X1:g = (θn , Jn )n=1 an observation at time t the observations from time t1 to time tg : Y (t1 ),Y (t2 ), . . . ,Y (tg ) a particle, a list of hidden events indexed by n covering the first g Nm,g observations: Xm,g = (θm,n , Jm,n )n=0 the smallest number of events required to cover the first g obserNm,g vations in the m-th particle: tg ≤ ∑n=0 Jm,n a new particle extended from Xm,g an event (a pair of state and waiting time) sampled from the proposal q(·|Y (tg+1 ), Xm,g+1 ) a list of events appended to Xm,g the importance weight of the particle Xm,g empirical transition measures for the m-th particle after the n-th jump in the k-th sequence empirical waiting times for the m-th particle after the n-th jump in the k-th sequence empirical counts for the number of transitions from all states to the same destination after the n-th jump for the m-th particle in the k-th sequence sufficient statistics of the likelihood model for the m-th particle after the n-th jump in the k-th sequence all the sequences except the k-th (used as superscript for the sufficient statistics) Table 6.1: Notation used in the PMCMC algorithm (k) the notation we denote X1:G(k) by X (k) . The pseudocode for the PMCMC algorithm in the GEP model (with multiple sequences) is provided in Figure 6.1. 6.3 SMC Algorithm SMC algorithms provide an approximation of p(X1:G |Y ), the posterior density, and p(Y ), the marginal likelihood [2]. These algorithms approximate the pos- 50 Algorithm 1 PMCMC algorithm for inference in the GEP model K: number of sequences k: sequence number k ∈ {1, . . . , K} For i = 0 to #iterations do For k = 1 to K do (\k) (\k) (\k) - Run an SMC algorithm to obtain an approximation for p(X (k) |Y (k) , Fθ , Tθ , Sθ ) (k) (\k) (\k) (\k) (k) (k) - Sample X∗ ∼ p(·|Y ˆ , Fθ , Tθ , Sθ ), and estimate the corresponding marginal likelihood L∗ - Sample u ∼ Uni f [0, 1] (k) ∗ If u < min{1, L(k)L(i−1) } or i = 0 (k) (k) - Set X (k) (i) = X∗ , L(k) (i) = L∗ (k) (k) (k) - Update (Fθ , Tθ , Sθ ) else - Set X (k) (i) = X (k) (i − 1) and L(k) (i) = L(k) (i − 1) End If End For End For Figure 6.1: Pseudocode for the PMCMC algorithm terior density p(X1:G |Y ) sequentially2 . In other words, a SMC algorithm first approximates p(X1:1 |Y (t1 )), then p(X1:2 |Y (t1 ),Y (t2 )) and so on. We denote the observations from time t1 to time tg (i.e., Y (t1 ),Y (t2 ), . . . ,Y (tg )) by Y1:g . Given the set of all observations Y1:G , the approximation for the posterior is obtained from a set of M weighted random samples: M p(dX ˆ 1:G |Y1:G ) := ∑ Wm,G δX m,G (dX1:G ), (6.1) m=1 where Wm,G is the importance weight of the random sample Xm,G . Each random sample Xm,g , m ∈ {1, . . . , M} is called a particle; it consists of a list of hidden events indexed by n, containing both (hidden) states and waiting times: Xm,g = N m,g (θm,n , Jm,n )n=0 . We use Nm,g to denote the smallest number of events required to N m,g cover the first g observations in the m-th particle; that is, tg ≤ ∑n=0 Jm,n . For g = 0 we let Nm,0 = 0. The algorithm propagates the particles Xm,g and updates the weights Wm,g from generation g = 0 up to generation G, where generation g corresponds to the obser2 To simplify the description of the algorithm we begin by explaining it for the case where there is only one sequence; next, we generalize the algorithm for the case of multiple sequences 51 vation time tg . A proposal density q is used to propagate the particles and evaluate the weights. Two different proposal densities will be introduced in Section 6.4; evaluation of the weights for each of the proposals will also be discussed. Initially, we assume all the particles in generation g = 0 (i.e., time = 0) are set to contain only the beginning special state, Xm,0 = (θbeg , 0) for all m ∈ {1, . . . , M}3 . For generation g = 1, a proposal (i.e., importance) density q(·|Y (t1 )) is used to approximate p(X1:1 |Y1:1 ). We generate M particles from q(·|Y (t1 )); each particle Xm,1 will be constructed such that the list of events contains enough events (i.e., smallest number of events) to cover the first observation time, t1 . In other words, N m,1 t1 ≤ ∑n=1 Jm,n . An illustration of the SMC algorithm is provided in Figures 6.2, 6.3, and 6.4 where we have three observations (i.e., EDSS= 1.5, 1.5, 2 at time points t1 ,t2 ,t3 ). The first generation of three particles at observation time t1 is shown in Figure 6.2(a). In order to take into account the discrepancy between the two densities we assign importance weights Wm,1 to the particles (details to follow). This is shown in Figure 6.2(b) where we assign weights W1,1 , W2,1 , and W3,1 to the particles. Next, in a resampling step, we sample M times from the weighted particle density p(dX ˆ 1:1 |Y1:1 ) which is an approximation of p(X1:1 |Y1:1 ). In Figure 6.3(a) we assumed we have sampled three times from the weighted particle density and obtained the same particles as before; we have three particles X1,1 , X2,1 , and X3,1 . At time t2 , we extend each particle Xm,1 into a new particle Xm,2 through the proposal density q(·|Y (t2 ), Xm,1 ) to obtain an approximation for p(X1:2 |Y1:2 ). To extend a particle Xm,1 into a particle Xm,2 , we first copy the events of Xm,1 into Xm,2 , and then append new sampled events (from q(·|Y (t2 ), Xm,1 )) until the observation at time t2 is covered. Figure 6.3(b) shows the extended particles x1,2 , x2,2 , and x3,2 . We need to compute the importance weights Wm,2 for the extended particles shown in Figure 6.3(b) as we are not directly sampling from p(X1:2 |Y1:2 ). A new population of extended particles (Xm,2 )M m=1 is then obtained by resampling M times from p(dX ˆ 1:2 |Y1:2 ), where 3 As this event is present at the beginning of each particle, whenever g > 0 we initialize the sequence of hidden events for each particle from the first event n = 1. That is, for g > 0 we define Nm,g each particle as Xm,g = (θm,n , Jm,n )n=1 . 52 (a) (b) Figure 6.2: An illustration of the SMC algorithm with three particles and three observation times up to generation 1. 53 (a) (b) Figure 6.3: An illustration of the SMC algorithm with three particles and three observation times up to generation 2. 54 (a) (b) Figure 6.4: An illustration of the SMC algorithm with three particles and three observation times up to generation 3. 55 M p(dX ˆ 1:2 |Y1:2 ) = ∑ Wm,2 δX m=1 m,2 (dX1:2 ). The new population of random particles X1,2 , X2,2 , and X3,2 is illustrated in Figure 6.4(a) where we have two samples from X3,2 and one sample from X1,2 . This procedure is then repeated until time tG is covered; for instance, in Figure 6.4(b) we stop the procedure as soon as the observation time t3 is covered. This completes generating the particles; the approximation for the posterior distribution M p(X1:G |Y1:G ) is then obtained from p(dX ˆ 1:G |Y1:G ) (e.g., ∑m=1 Wm,3 δXm,3 (dX1:3 ) for the example in Figure 6.4(b)). The SMC algorithm can be summarized as below (for details see [2]). To simplify the notation, we omit writing ∀m ∈ {1, . . . , M}; we also denote the sufficient statistics for the likelihood model P by Sθ . 1. at time = 0 set Xm,0 = (θbeg , 0) and Wm,0 = 1/M. 2. for g = 0 to G − 1 (a) extend Xm,g to a new particle Xm,g+1 : m a list of i. copy the contents of Xm,g into Xm,g+1 and create Xg+1 m events that have been appended to Xm,g . Note that the list Xg+1 contains no events at this stage as we have only copied the contents of Xm,g into Xm,g+1 (see Figure 6.4(b) for an example of the list of appended events to the particles X1,2 , X2,2 , and X2,3 : X31 , X32 and X33 ). ii. sample from q(·|Y (tg+1 ), Xm,g+1 ), append the new sampled event m to X m ˜m X˜g+1 m,g+1 and also to Xg+1 (i.e., Xm,g+1 := (Xm,g+1 , Xg+1 ) and m := (X m , X ˜m Xg+1 g+1 g+1 )). Finally, update the sufficient statistics (e.g., F, T and S for the GEP model). Continue appending new events, and updating the sufficient statistics; stop as soon as the first g + 1 N m,g observations are covered; that is, tg+1 ≤ ∑n=1 Jm,n . (b) compute and normalize the weights for the extended particles; to do this, for an observation dy ∈ FX , let L( dy|Sθ ) denote the predictive 56 likelihood given Sθ . It can be shown [2] that the appropriate importance weights are: wm,g+1 = m |X L( dy|Sθ )p(Xg+1 m,g ) m |Y (t q(Xg+1 g+1 ), Xm,g ) ; (6.2) then the normalized weights are: Wm,g+1 := wm,g+1 . M ∑n=1 wn,g+1 (c) after obtaining extended particles Xm,g+1 and their weights, generate the new generation of particles Xm,g+1 by resampling M times from M p(dX ˆ 1:g+1 |Y1:g+1 ) = ∑ Wm,g+1 δX m,g+1 m=1 (dX1:g+1 ). (6.3) 3. the approximation for the posterior density p(X1:G |Y ) is then: M p(dX ˆ 1:G |Y1:G ) := ∑ Wm,G δX m,G (dX1:G ). (6.4) m=1 The SMC algorithm also provides an estimate of the marginal likelihood p(Y ) given by 1 M ∑ wn,g . g=1 M n=1 G p(Y ˆ )=∏ (6.5) As mentioned we use the marginal likelihood estimates in computing the acceptance probability in the PMCMC algorithm. For the case of multiple sequences we need to keep the track of the sufficient statistics for each sequence. In the GEP model, for the k-th sequence the sufficient statistics are the empirical transition measures (Equation 4.1), the empirical waiting times (Equation 4.2), and the sufficient statistics of the likelihood model for the m(k) (k) th particle after the n-th jump; we denote these sufficient statistics by Fθ ,m,n , Tθ ,m,n , (k) and Sθ ,m,n correspondingly. Note that the sufficient statistics for the likelihood 57 Algorithm 2 SMC algorithm to construct a proposal for the PMCMC G: the number of observations in the current sequence (k) g: particles generation g ∈ {0, 1, . . . , G} M: number of particles m: particle number m ∈ {1, . . . , M} K: the total number of sequences k: sequence number k ∈ {1, . . . , K} (We omit writing ∀m ∈ {1, . . . , M} and superscript (k) for every X to avoid excessive notation) For a given sequence k: - Set Xm,0 = (θbeg , 0) For g = 0 to g = G − 1 do - Extend Xm,g to a new particle Xm,g+1 : m a list of events appended to X Copy the events of Xm,g into Xm,g+1 and create Xg+1 m,g (k) N m,g+1 Jm,n is satisfied Loop until tg+1 ≤ ∑n=1 sample a pair (θ , J) from q(·|Y (tg+1 ), Xm,g+1 ) m to X ˜m append the new sampled event X˜g+1 m,g+1 ; let Xm,g+1 := (Xm,g+1 , Xg+1 ) m to X m ; let X m := (X m , X m ) ˜ append the new sampled event X˜g+1 g+1 g+1 g+1 g+1 (k) (k) (k) update the sufficient statistics Fθ ,m,n , Tθ ,m,n and Sθ ,m,n for events that have been generated so far End Loop (\k) - Compute the weight of the particles wm,g+1 = (k) m |X L(Y (tg+1 )|θI(tg+1 ) ,Sθ +Sθ ,m,n )p(Xg+1 m,g ) m |Y (t q(Xg+1 g+1 ),Xm,g ) - Generate the new population of particles Xm,g+1 by resampling M times from p(dX ˆ 1:g+1 |Y1:g+1 ) (Equation (6.3)) End For - Approximate the posterior density p(X1:G |Y ) and the marginal likelihood p(Y ) with Equations 6.4 and 6.5 correspondingly. Figure 6.5: Pseudocode for the SMC step of the PMCMC algorithm model depend on the chosen likelihood model; for instance, in the likelihood model that we will introduce in Chapter 7 the sufficient statistics are the frequencies of each observation. The pseudocode of the SMC algorithm in the GEP model (in case of multiple sequences) is presented in Figure 6.5. 6.4 SMC Algorithm Proposal Distributions We describe two proposal distributions for the SMC algorithm in this section: the predictive distribution and a modified version of the predictive distribution. 6.4.1 The predictive distribution as the proposal distribution If we use the predictive distribution as the proposal q in the SMC algorithm of Figure 6.5, then to extend the particle Xm,g to Xm,g+1 we need to sample the hidden m from Equation 4.7. event X˜g+1 58 The sufficient statistics for this sampling step come from two sources: (1) the (\k) other sequences, which are held fixed (i.e., (Fθ (\k) , Tθ )), and (2) the events that have been generated so far in the current particle, Xm,g+1,n := (θm,i , Jm,i )ni=1 . The (k) (k) corresponding sufficient statistics are denoted by (Fθ ,m,n , Tθ ,m,n ). Since these statistics are additive, we can write the sampling step as: (\k) (θn+1 , Jn+1 )|Fθ (\k) where: µθ = Fθ (\k) , Tθ , Xm,g+1,n ∼ µ¯ θn × TP( µθn , βθn ), (\k) (k) + Fθ ,m,n + H0 and βθ = Tθ (k) + Tθ ,m,n + β0 . The weights of (\k) the particles have the simple form wm,g+1 = L( dy|Sθ (k) + Sθ ,m,n ) as the proposal m |Y m q(Xg+1 g+1 , Xm,g ) is the same as the predictive distribution p(Xg+1 |Xm,g ). 6.4.2 An “improved” proposal distribution based on a modified predictive distribution For our experiments discussed in Chapter 8, we use a modified version of the hierarchical model of Chapter 5 and a relatively simple likelihood model. Therefore, in this section as opposed to the previous sections of this chapter we describe the proposal distribution for the HGEP model, which is used in our experiments. For the likelihood model, we assume each hidden state θ emits an observation deterministically; that is, given the hidden state the observation is known. However, multiple hidden states can emit the same observation. More formally, we assume H0 in the HGEP model of Section 5.2 is a product of uniform and multinomial distributions, (i.e., H0 = Unif × Mult). Moreover, for the likelihood model Lθ we use a Dirac delta. Each sample from H0 is a hidden state θ in the form of a pair (u, y) where y ∈ Σ the set of possible observations, and u ∈ [0, 1]. The uniform distribution can be thought as being responsible for generating unique identifiers for hidden states. Given a hidden state θ = (u, y) at time t the observation at that time step is a deterministic function of θ : L(u,y) = δy . Further details of the likelihood model are provided in Chapter 8. Any probability distribution that we can easily sample from and for which we can compute the importance weights using Equation 6.2 is a candidate for the proposal m |Y (t distribution. In this section we define a proposal q(Xg+1 g+1 ), Xm,g ) which, in contrast to the previous proposal, takes the observation Y (tg+1 ) into account. Based 59 on the defined likelihood model, if we know Y (t), the observation at time step t, then we know the set of possible hidden states for that time step. Thus, this new proposal is more suitable for our experiments where we use the described likelihood model. We call this the “improved” proposal distribution. We denote the set of possible hidden states for observation Y (t) at time step t by ΩY (t) where ΩY (t) := {(u, y) ∈ Ω : y = Y (t), L(u,y) = 1}. We are interested in a proposal distribution which, given Xm,g = (θm,i , Jm,i )ni=1 and Y (tg+1 ), proposes a sample of hidden events (θm,i , Jm,i )Ni=n+1 with the property that θI(tg+1 ) ∈ ΩY (tg+1 ) . Informally, we are only interested in samples which end up in a hidden state that can emit Y (tg+1 ). To obtain the desired proposal, we first define a probability distribution which we call the “modified” predictive distribution. As the name suggests, we define this distribution by modifying the predictive distribution of the model in Section 5.4. This probability distribution is the key element in our new proposal. Next, we introduce the “improved” proposal distribution. Finally, we explain how the importance weights can be computed. “Modified” predictive distribution We change the predictive distribution for the hidden states of the model in Section 5.4 to generate only samples from θ ∈ ΩY (tg+1 ) . We denote the modified predictive distribution by variables with superscript M. In order to obtain the modified predictive distribution, we change the base measure H0 , defined above (i.e., H0 = Unif × Mult); and redefine the sufficient statistics Fθ and G. We define the modified base measure as H0M = Unif × Y (tg+1 ); hence, every hidden state sampled from H0M is a pair (u, y) where y is always equal to Y (tg+1 ). In other words all the hidden states sampled from H0M always emit the observation Y (tg+1 ) at time tg+1 . Next we define the modified sufficient statistics as: n FθM = ∑ 1[θi−1 = θ & θi ∈ ΩY (tg+1 ) ] δθi (6.6) GM = ∑ 1[θi ∈ ΩY (tg+1 ) ]Ai δθi . (6.7) i=1 n i=1 60 For FθM , the empirical transition measures for state θ , we make the mass at all atoms where θi ∈ / ΩY (tg+1 ) equal to zero and keep the mass for all the other atoms the same as their mass in Fθ . Similarly for the GM , we only keep the mass at atoms where θi ∈ ΩY (tg+1 ) and assign zero mass to all the other atoms. Using the same normalization constant as in the original predictive distribution (M) (i.e., µ0 ), we define the modified predictive distribution µ¯ θ based on: µ (M) = GM + H0M (M) = FθM + µ0 µ¯ µθ (6.8) (M) (6.9) This modified predictive distribution is used in our proposal. Note that any state sampled from this probability distribution will only emit Y (tg+1 ), the observation at time tg+1 . Probability density value for a sampled state Before proceeding further, it is useful to briefly review computing the probability density value of a sampled state. This value is needed in obtaining the probability of a sampled path from the proposal which in turn is required for computing the importance weights. (H) For a state θ ∗ sampled from the predictive distribution µ¯ θN , we aim to com- pute p(θN+1 = θ ∗ X, {An }Nn=1 , µ0 ). We denote this probability density value by (H) CRF(θ ∗ ; µ¯ θN ). Using Equation 5.8 (see Figure 5.1), we can compute this value from the following formula: (H) CRF(θ ∗ ; µ¯ θN ) = FθN ({θ ∗ }) FθN +µ0 ; AN = 0 G({θ ∗ }) µ0 G+H0 FθN +µ0 ; AN = 1 & θ ∗ ∈ {θ1:N } H0 (θ ∗ ) G+H0 ; o.w. µ0 FθN +µ0 (6.10) The probability density value for a sampled state θ ∗ from the “modified” pre61 (M) dictive distribution µ¯ θ (M) MCRF(θ ∗ ; µ¯ θN ) = is defined similarly: FθM ({θ ∗ }) ; AN = 0 N FθM +µ0 N GM ({θ ∗ }) GM +H0M µ0 FθM +µ0 ; AN = 1 & θ ∗ ∈ {θ1:N } H0M (θ ∗ ) GM +H0M µ0 FθM +µ0 ; o.w. (6.11) N N Sampling procedure for the “improved” proposal distribution Now that we have defined the key element of our proposal distribution, we introduce the “improved” proposal distribution. The pseudocode for sampling from the proposal distribution, in case of multiple sequences, is provided in Figure 6.6. We define the proposal by defining its sampling procedure for the case of a single sequence. We assume that we are at observation time tg and the current hidden state for the m-th particle is given: θm,I(tg ) . For this proposal we should emphasize that although we sample until we cover the next observation time, we do not assume knowledge of the time to the next transition after tg+1 nor of the state to which that transition is made. In other words, we only know there is a transition after tg+1 but the state and the time to the transition are unknown (see Figure 6.7 for an illustration). For all generations of the particles we always start sampling from an observation time. This point will become clear when we describe how to sample from the proposal and compute the probability of a sampled path. To simplify the notation we omit the subscript m in describing the procedure and denote I(tg ) by num. We use l to index the events occurring after time tg ; for instance, the index of the first jump after tg is l = num + 1. For initializing the procedure we let l = num + 1. The sampling procedure is described below. Figure 6.8(a) illustrates the initial state of the g + 1-th generation of a single particle. Given the current state θl−1 , we sample Jl , the time to the next jump, from the (H) translated Pareto distribution TP( µθl−1 , βθl−1 ). Depending on the sampled value 62 Algorithm 3 Modified predictive distribution as the proposal for inference in the HGEP model Given the current hidden state at time point tg (i.e., θm,I(tg ) ): (From now on, we omit writing ∀m ∈ {1, . . . , M} to avoid excessive notation and also drop particle index m from the notation) K: number of sequences k: sequence number k ∈ {1, . . . , K} num: I(tg ), the number of events up to time step tg l = num + 1 Jl = 0 While ∑li=num+1 Ji < tg+1 − tg (H) - Sample the time to next jump Jl from TP( µθl−1 , βθl−1 ) l - If ∑i=num+1 Ji < tg+1 − tg then (H) - sample the next state θl from µ¯ θl−1 - l = l +1 - update the sufficient statistics Fθk,l , Tθk,l , Sθk ,l , and Gkl - else if θl−1 ∈ ΩY (tg+1 ) - Jl = tg+1 − ∑l−1 i=1 Ji and θl = θl−1 - update the sufficient statistic Tθk,l - else - sample Jl from Uni f [0,tg+1 − ∑l−1 i=1 Ji ] (M) - sample θl from µ¯ θl−1 , the modified predictive distribution - update the sufficient statistics Fθk,l , Tθk,l , Sθk ,l , and Gkl - l = l +1 - Jl = tg+1 − ∑l−1 i=1 Ji and θl = θl−1 - update the sufficient statistic Tθk,l End If End While Figure 6.6: Pseudocode for sampling from the “improved” proposal distribution Figure 6.7: An illustration of sampling the last jump for a generation of particles in the SMC algorithm. For simplicity we show only one particle. 63 (a) (b) Figure 6.8: Sampling procedure for the “modified” predictive distribution (case 1). there are three different cases: 1. the sampled path does not pass the next observation time, ∑li=num+1 Ji < (H) tg+1 − tg . In this case, we sample the next state from µ¯ θl−1 defined in Equation 5.8; set l := l + 1 and update the sufficient statistics Fθ ,l , Tθ ,l , Sθ ,l , and Gl . This case is depicted in Figure 6.8(b); in this example Jnum+1 < tg+1 −tg and (H) θnum+1 ∼ µ¯ θnum . 2. the sampled path passes the next observation time, and the observation emitted from the current state is consistent with Y (tg+1 ) (i.e., the current state emits Y (tg+1 ): θl−1 ∈ ΩY (tg+1 ) ). In this case, we discard the current Jl and instead set it equal to the remaining time to the observation time tg+1 (i.e., Jl = tg+1 − ∑l−1 i=1 Ji ). Moreover, we set the next state equal to the current state: θl = θl−1 and update the sufficient statistic Tθ ,l . Note that we do not need to update the other sufficient statistics as the process remains in its current state without any transitions. An example of this case is given in Figure 6.9. In this example the observation emitted from θnum (i.e., EDSS = 1.5) is consistent with Y (tg+1 ); we have Jnum+1 = tg+1 − tg and θnum+1 = θnum . 3. the sampled path passes the next observation time, and the current state does not emit Y (tg+1 ) (i.e., θl−1 ∈ / ΩY (tg+1 ) ). In this case, we take the following two steps: 64 Figure 6.9: Sampling procedure for the “modified” predictive distribution (case 2). (a) first we discard the current Jl and instead sample Jl from Uni f [0,tg+1 − ∑l−1 i=1 Ji ]; that is, we sample the time of the next jump uniformly between the current time, ∑l−1 i=1 Ji , and the next observation time tg+1 . Then, we (M) sample the next state θl from µ¯ θl−1 , the modified predictive distribution, and update the sufficient statistics Fθ ,l , Tθ ,l , Sθ ,l , and Gl . (b) we set l = l + 1, next we set Jl = tg+1 − ∑l−1 i=1 Ji and θl = θl−1 . That is, we do not sample the last jump; instead, we deterministically set the time to the last jump equal to tg+1 − ∑l−1 i=1 Ji which is the remaining time till tg+1 . Finally, we only update Tθ ,l as we do not assume knowledge of the state after tg+1 and for the next generation of the particles we are going to sample from time tg+1 forward. For an example see Figure 6.10. In the example the sample for Jnum+1 is discarded and a new Jnum+1 is sampled from Uni f [0,tg+1 −tg ]; the new jump (M) time is denoted by t ∗ . For the next state we have θnum+1 ∼ µ¯ θnum ; hence, we have θnum+1 ∈ ΩY (tg+1 ) . Finally, the time to the last jump Jnum+2 is set equal to tg+1 − (Jnum+1 + tg ). The sampling procedure for this generation is stopped as soon as we cover the observation time tg+1 ; that is, ∑li=num+1 Ji < tg+1 − tg . A sampled path from this proposal always ends up in a state θI(tg+1 ) ∈ ΩY (tg+1 ) . Whenever the path passes tg+1 we check whether the path is in the right group of states ΩY (tg+1 ) ; if not then we discard the sampled jump and sample the time of the jump from a uniform distribution and the state from the modified predictive (M) distribution µ¯ θ . Therefore, the sampled path has the desired property. 65 Figure 6.10: Sampling procedure for the “modified” predictive distribution (case 3). Computing the importance weights In order to compute the weights of the particles, we need the proposal and predictive probabilities for a sampled path. We denote the cumulative distribution of the translated Pareto distribution by FT P (t; α, β ), the updated MGP parameters (i) (i) after the ith transition by βθ and µθ ; and the predictive distribution of the ith event given events 1 to i − 1 in the HGEP model by p(θi , Ji |θ1:i−1 , J1:i−1 ). Setting N−1 Jk , the predictive probability for a samn = I(tg ), N = I(tg+1 ), and ∆ = tg+1 − ∑k=1 pled path (θi , Ji )Ni=n+1 (after dropping the particle index m and the superscript H) is given by: p((θi , Ji )Ni=n+1 |Xg = (θi , Ji )ni=1 , µ0 ) N = ∏ i=n+1 (N) N = ∏ i=n+1 N = ∏ i=n+1 (N) p(θi , Ji |θ1:i−1 , J1:i−1 ) · (1 − FT P (∆; µθN , βθN )) (N) (N) p(θi |θ1:i−1 , J1:i−1 ) · p(Ji |θ1:i−1 , J1:i−1 ) · (1 − FT P (∆; µθN , βθN )) (i−1) (i−1) CRF(θi ; µ¯ θi−1 ) · T P(Ji ; µθi−1 (i−1) (N) The third equality is due to the fact that given the current state the next state and jump time are independent. Figure 6.11 provides an example which clarifies the above computation. In the figure to keep the notation uncluttered we drop the parameters of the distributions. The figure is an example of a sampled path beginning from the n-th event; each state is sampled from the predictive distribution for 66 (N) , βθi−1 ) · (1 − FT P (∆; µθN , βθN )). Figure 6.11: Computing the predictive probability for a sampled path the states (denoted by CRF in the figure) given all the previous events. Moreover, each waiting time is sampled from the translated Pareto distribution. Finally, for the transition after tg+1 we know it happens at a time greater than ∆; hence, we use 1 − FT P (∆). Based on the algorithm presented in Figure 6.6, assuming the proposed hidden events are (θm,i , Jm,i )Ni=n+1 , the probability of a sampled path q((θi , Ji )Ni=n+1 |Y (tg+1 ), Xg = (θi , Ji )ni=1 , µ0 ) for the proposal can be obtained for two different scenarios: 1. If the N − 1th (hidden) state, the state at the penultimate jump, belongs to ΩY (tg+1 ) (i.e., θN−1 ∈ ΩY (tg+1 ) ), then we are certain that the events at both jumps N − 1 and N are sampled from the predictive distribution. In other words, the path is sampled from the predictive distribution at all the jumps. 2. If the N − 1th state does not belong to ΩY (tg+1 ) then there could be two indistinguishable cases: (a) The Nth event is sampled from the predictive distribution and the Nth sampled state belongs to ΩY (tg+1 ) ; thus, similar to the previous case all the hidden events are sampled from the predictive distribution. 67 (b) The Nth jump is sampled from the uniform distribution and the Nth state is sampled from the modified CRF. Denoting the modified predictive distribution by p∗ , for the proposal probability we have: q((θi , Ji )Ni=n+1 |Y (tg+1 ), Xg = (θi , Ji )ni=1 , µ0 ) N = 1[θN−1 ∈ ΩY (tg+1 ) ] ∏ i=n+1 (N) N + (1 − 1[θN−1 ∈ ΩY (tg+1 ) ])(( ∏ (N) i=n+1 N−1 + (N) p(θi , Ji |θ1:i−1 , J1:i−1 ) · (1 − FT P (∆; µθN , βθN )) (N) p(θi , Ji |θ1:i−1 , J1:i−1 ) · (1 − FT P (∆; µθN , βθN ))) p(θi , Ji |θ1:i−1 , J1:i−1 ) · p∗ (θN , JN |θN−1 , JN−1 ) ∏ i=n+1 (H) · (1 − FT P (∆; µθN N = 1[θN−1 ∈ ΩY (tg+1 ) ]( ∏ i=n+1 (i−1) (i−1) CRF(θi ; µ¯ θi−1 ) · T P(Ji ; µθi−1 N + (1 − 1[θN−1 ∈ ΩY (tg+1 ) ])( ∏ i=n+1 ∏ (i−1) i=n+1 (i−1) (N) (i−1) CRF(θi ; µ¯ θi−1 ) · T P(Ji ; µθi−1 (N−1,M) · MCRF(θN ; µ¯ θN−1 (N) )· (i−1) , βθi−1 ) 1 N−1 Ji tg+1 − ∑i=1 (N) · (1 − FT P (∆; µθN , βθN ))). The weights of the particles now have the form: wm,g+1 = (N) , βθi−1 ) · (1 − FT P (∆; µθN , βθN ))) (i−1) (i−1) (i−1) (N) (N) CRF(θi ; µ¯ θi−1 ) · T P(Ji ; µθi−1 , βθi−1 ) · (1 − FT P (∆; µθN , βθN )) N−1 + , β (N)θN ))) m |X p(Xg+1 m,g ) (6.12) m |Y (t q(Xg+1 g+1 ), Xm,g ) (\k) due to the fact that now we always have L(Y (tg+1 )|θI(tg+1 ) , Sθ 68 (k) + Sθ ,m,n ) = 1. Chapter 7 L IKELIHOOD M ODEL The likelihood model that we use in our experiments was introduced in Section 6.4.2. In this section we review the likelihood model and explain the changes that we make in the HGEP model in order to apply it in our experiments. As mentioned, we assume that the observation y is a deterministic function of the hidden state and H0 is a product of multinomial and uniform distributions, i.e. H0 = Unif × Mult. Thus, every hidden state sampled from H0 is in the form of a pair θ = (u, y): u is a unique identifier for every hidden state u ∈ [0, 1] which is sampled from the uniform distribution while the observations y take a value in the set of possible observations Σ. The set of possible observations consists of the finite number of symbols that we can observe. Given θ = (u, y) at a time step t, we have a Dirac delta for Lθ : L(u,y) = δy . Moreover, we add a Dirichlet distribution over the parameters of the multinomial distribution; we denote these parameters p1 , · · · , p|Σ| by p1:|Σ| . This likelihood model can therefore be thought as a Dirichlet-Multinomial-Dirac model. For our experiments we add another level of hierarchy to the HGEP. Informally, this level is responsible for counting the number of times a transition is made from hidden states emitting the same observation y ∈ Σ to another hidden state. This allows the model to share statistical strength between states emitting the same ob- 69 servation. The model can be represented similarly to Equation 5.1: p1:|Σ| ∼ Dir(α3 ) H0 = Mult(p1:|Σ| ) × Unif µ0 ∼ MGP(H0 , H0 ) iid µy |µ0 ∼ MGP(µ0 , γ0 ) (7.1) iid µθ |µy ∼ MGP(µy , β0 ) θN+1 X, {µθ }θ ∈Ω ∼ µ¯ θN JN+1 X, {µθ }θ ∈Ω ∼ Exp( µθN ). The predictive distribution for the above model can be obtained recursively. However, we need to introduce a new additional auxiliary variable Ky which we will explain momentarily. Similar to the HGEP predictive distribution (Equation 5.8) of Section 5.3 we have the following equations for the predictive distribution of the hidden states: µ¯ = G H0 + H¯ 0 G + H0 G + H0 µ¯ = Ky µ0 + µ¯ Ky + µ0 Ky + µ0 θN+1 |θ1 , · · · , θN , µy , µ¯ ∼ (7.2) µy FθN + µ¯ FθN + µy FθN + µy Figure 7.1 illustrates an example of sampling from this predictive distribution. Here α2 = H0 and α1 = µ0 . Note that in this model the hyperparameters are α1 , α2 , α3 , β0 , and γ0 . The bottom level of the hierarchy is the same as in Figure 5.1 of Section 5.3. From the bottom level we can move to a higher level DP with probability proportional to µy which is the normalization of the middle level random measure. For the middle level we define auxiliary variables Bn similar to An . That is, the 70 H || || ||G +α || G ||G +α || α ||K +α || A = 1, B = 1 K ||K +α || ||µ || ||F + µ || A = 0, B = 1 F ||F + µ || A = 0, B = 0 Figure 7.1: Sampling from the predictive distribution of the HGEP model used in the experiments event Bn = 1 means we have moved from the bottom level to the middle level of the hierarchy at transition n. Furthermore, at this level of hierarchy we augment the sufficient statistics with the empirical counts for the number of transitions already made from a state that emits the observation y; that is, Ky = ∑Nn=2 1[L(θn−1 ,y) = 1] Bn δθn . As at the bottom level, we have two possibilities: to choose the state at this level with probability proportional to Ky or to default to another DP with probability proportional to α1 . As in Section 5.3, we use the auxiliary variable An to determine when we default to the top level DP (note that An = 1 implies Bn = 1). Finally, the top level is the same as the top of the hierarchy in the model of Section 5.3. At this level, we can have a transition to an entirely new state with probability proportional to α2 (i.e. H0 ). We generate a unique identifier for this new state by sampling from Uni f [0, 1]. Furthermore, to decide on the observation for this new state, we use a multinomial distribution over the possible observations Σ. 71 Chapter 8 E XPERIMENTS In this section we present the results of our experiments. First, we demonstrate the behavior of state trajectories and sojourn times sampled from the prior to give a qualitative idea of the range of time series of hidden states that can be captured by our model. Next, we evaluate our model quantitatively by applying it to held-out tasks for three different datasets: synthetic, multiple sclerosis (MS) patients, and RNA evolutionary. Each dataset consists of multiple time series with observations at different time points. Note that in each dataset we have finite number of possible observations which we call “symbols”. A sample dataset with symbols A, C, T and G having the same structure as our three datasets is provided in Table 8.1. We will explain each of these datasets in detail later in this chapter. Furthermore, we apply the model to the problem of comparing disease progression on the two arms of a clinical trial in MS; evaluating the effectiveness of a drug is a typical problem in MS clinical research. In order to compare the arms of a clinical trial we need a model for MS disease progression; however, as mentioned in Chapter 1, MS can be considered as a disease with complex latent structure. By using an HGEP as the model for the latent structure we compare the disease progression in the treatment and the placebo arms. Finally, we compare the performance of the two proposal distributions we introduced in Section 6.4. 72 Sequence 1 Time points Observations Sequence 2 Time points Observations 0 A 1 A 1.5 T 2.3 C 1.1 T 1 C 3.1 T .. . 4 G Sequence M Time points Observations 0 G 2 C 2.5 T 4.3 A 4.1 C 7 A Table 8.1: A sample dataset with symbols A, T, C and G having the same structure as datasets used in the experiments 8.1 Qualitative Analysis of the Prior The behavior of the prior can change as a function of the three parameters of the HGEP model: β0 , H0 and γ0 . We sampled a sequence of length T = 800 and present the state-time plots. At least four types of behavior can be distinguished: 1. short sojourn times and high volatility of states (Figure 8.1(a)) 2. long sojourn times with low volatility (Figure 8.1(b)) 3. many new states with short sojourn times (Figure 8.1(c)) 4. high tendency to create new states, long sojourn times (Figure 8.1(d)) The concentration parameter H0 has the same interpretation as the concentration parameters of HDPs. In addition, in conjunction with the rate parameter β0 , it controls the waiting times. These behaviors show that the GEP process can describe various kinds of stochastic processes. For all of our experiments we chose (arbitrarily) values of 1 for all hyper-parameters. 8.2 Quantitative Analysis In this section, we use the likelihood model introduced in Chapter 7 for discrete observations to evaluate our method on three held-out tasks. We considered three 73 12 10 25 4 6 State 8 20 15 State 10 2 5 0 0 200 400 600 800 0 200 Time 600 800 600 800 Time (b) 100 0 0 100 50 200 State 300 150 400 (a) State 400 0 200 400 600 800 0 200 Time 400 Time (c) (d) Figure 8.1: Qualitative behavior of the prior: (a) β0 = 10, H0 = 5, γ0 = 100, (b) β0 = 100, H0 = 5, γ0 = 10, (c) β0 = 10, H0 = 500, γ0 = 10, (d) β0 = 1, H0 = 5000, γ0 = 1 evaluation datasets obtained by holding out each observed datapoint with a 10% probability (see Table 8.2 for the properties of the datasets). We then reconstructed the observations at these held-out times, and measured the mean error. The time series were scanned for 1000 iterations of the PMCMC algorithm; that is, we carried the PMCMC algorithm for each dataset for 1000 iterations. For HGEP, at each iteration of the PMCMC algorithm, we sample a latent path from an approximation to the posterior p(X (k) |Y (k) ) for every sequence k. Hence, for each held-out time of that sequence we have a sampled latent state. Based on our defined likelihood model, we know the observation the latent state emits. We reconstruct the held-out observation using this emitted observation for each iteration; we count the number of times each symbol is reconstructed up to that iteration and assign the symbol with the highest count to that held-out time. In other words, reconstruction is done by using the Bayes estimator approximated 74 Datasets Name Synthetic MS RNA Results (mean error) # sequences # datapoints # heldout # characters Baseline EM HGEP 1000 72 1000 10000 384 6167 878 31 508 4 3 4 0.703 0.516 0.648 0.404 0.355 0.596 0.446 0.277 0.426 Table 8.2: Summary statistics and mean error results for the experiments. All experiments were repeated 5 times. from 1000 posterior samples (one after each scan through all the sequences). We repeated all experiments 5 times with different random seeds which control the randomness of sampling from the posterior. To compute the error for each iteration, we count the number of correct estimates and divide it by the total number of held-out observations; we only consider the error for the 1000th iteration as the error for that repetition of the experiment. The mean error is the average computed error for the 5 repetitions. We compared against the standard maximum likelihood rate matrix estimator, estimated by the EM described in [22]. This method also estimates the transitions between two observations. As a result, we can estimate the whole trajectory for a sequence given only observations at finite time points. We reconstruct each heldout observation from this estimated trajectory, as we know the state of the sequence at every time point. We also report in Table 8.2 the mean error for a simple estimate where we reconstruct each held-out observation using the most common symbol in the whole dataset; we call this estimate the baseline estimate. Figure 8.2 shows the mean error as a function of the number of iterations for all three datasets. 8.2.1 Synthetic data experiment We used the Erd¨os-R´enyi model [11], a model for generating random graphs, to generate a random rate matrix. The model is defined by two parameters: the number of nodes and the probability of having an edge between two nodes in the graph. We used the probability parameter 1/5 and 10 nodes to generate a random graph. This random graph corresponds to a matrix of size 10 × 10 with elements equal to 1 where we have an edge between two nodes in the graph. 75 RNA ● 0.7 ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 200 400 600 800 0.40 0.5 0.4 ● ● 0.35 0.50 0.6 HGEP EM ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● 0.45 HGEP EM 0.60 ● MS 0 200 400 600 800 0.25 0.8 Synthetic HGEP EM ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 200 400 600 800 Figure 8.2: Mean reconstruction error on the held-out data as a function of the number of Gibbs scans for the experiments. The non-diagonal zeros in this matrix correspond to entries with Unif(0, 1/100) rate, and the non-diagonal ones in this matrix correspond to entries with Unif(0, 1/2+ 1/100) rate. The diagonal entries were filled with minus the value of the sum of the non-diagonal ones, and each of the 10 states is set to deterministically emit one of the symbols A, B, C or D at random. To create the synthetic dataset, we generate 1000 sequences with time length 10 from the rate matrix using the Doob-Gillespie algorithm (described in Section 2.1). Both HGEP and the EM-learned maximum likelihood outperformed the baseline. In contrast to the next two tasks, the EM approach slightly outperformed the HGEP model here. We believe this is because the synthetic data was not sufficiently rich to highlight the advantages of HGEPs. 8.2.2 RNA evolution modeling MJPs can be used for describing the evolution of nucleotide sequences in a phylogenetic tree that shows the evolutionary relationships among different species. Each node of the tree corresponds to the nucleotide sequence of a species. While nucleotide sequences can be observed at the leaves of the tree (modern species), the information on the branch length (time since divergence from another species) and the topology of the tree cannot be observed. We used the dataset from Cannone et al. [6] containing aligned 16S ribosomal RNA of species from the three domains of life. Aligned RNA sequences are 76 sequences arranged based on their similarities. When two RNA sequences are aligned, the relation between two nucleotides at the same position can only be an identity, a mismatch or a gap. For instance, if we assume sequence AGGC evolves to AA-C and these sequences are aligned, then there are two similar sites (1 and 4), one mismatch at the second site and one deletion at the third site. As a preprocessing step, a tree was constructed on a random subset of 30 species using PhyML [20], and the nucleotides at speciation events (i.e., branching events in the tree) were reconstructed using a K2P rate matrix [24] and the sum-product algorithm on trees. We then considered the time series consisting of paths from one modern leaf to the root for each site. Figure 8.3(a) illustrates a sample path in the reconstructed phylogenetic tree. In this figure, node r is the root, nodes e, f , g, h are the leaves, and nodes a, b, c, d are the internal nodes reconstructed by the explained procedure. At each of these nodes there is a nucleotide sequence. However, we study each site of these sequences independently where the observed state is the nucleotide; thus, there are 4 possible observations (i.e. A, T, C, G) (see Figure 8.3(b)). For the RNA dataset, we considered 1000 sequences of nucleotides from the root to the modern leaves in the reconstructed tree. A sample sequence is denoted by a dashed rectangle in Figure 8.3(b). The number of datapoints and heldouts are indicated in Table 8.2. For this dataset, the task is to reconstruct held-out nucleotides using only the data in the paths. Again, both HGEP and EM outperformed the baseline, and our model outperformed EM with a relative error reduction of 29%. 8.2.3 MS disease progression The dataset we used for this experiment is obtained from a phase III clinical trial of a drug given to MS patients. The dataset tracks the progression of MS (i.e., EDSS scores) in 72 patients of the placebo arm of the trial over 3 years. This is the subsample of the original cohort of patients who enrolled early to the trial and completed 3 magnetic resonance imaging scans [39]. As in Mandel [31], the observed EDSS score of a patient at a given time is binned into three categories: 1 (EDSS ≤ 1.5); 2 (EDSS = 2, 2.5); and 3 for (EDSS ≥ 3). Here, we are assuming these 77 time r ATCCG a b CTACG CTAC G CGACT c AGTCT CGGCT AT C C G C GACT AGTCT d CGGCT TGGAT e f g h T G GAT (a) (b) Figure 8.3: (a) A sample path in a reconstructed phylogenetic tree, (b) Evolution of the sequences for the reconstructed sample path categorical observations can be described by an HMM with underlying, possibly, infinite hidden states that are not directly observed. In the held-out task experiment, both HGEP and EM outperformed the baseline by a large margin. The HGEP model outperformed EM with a relative error reduction of 22%. 8.2.4 Application in estimating disease progression in MS In this section we apply our model to the problem of comparing the disease progression in treatment and placebo arms of a clinical trial. We use the data of the placebo arm data from the previous section along with the data from the treatment arm of the same trial. Several tools are currently used in MS to describe or predict the course of the disease. Examples include expected time to a certain EDSS score [32], time to confirmed progression (i.e. reaching a certain EDSS and staying there 78 E.g. Site 3 evolves independently from all other sites, according to a Markov chain 0.8 0.6 Prob. 0.0 0.2 0.4 0.6 0.4 0.0 0.2 Prob. 0.8 1.0 Treatment Arm 1.0 Placebo Arm 1 2 3 4 5 6 1 Time in 6 months 2 3 4 5 6 Time in 6 months Figure 8.4: Mean predictive probability of reaching category 3 starting in category 1 (95% credible interval is provided by the dashed lines) for 6 months or longer), and plots for the probability of confirmed progression against time [31]. We focus on the plots of probability of progression against time; these plots show the (mean predictive) probability of reaching a particular EDSS state against time, for patients who start at the same baseline EDSS state. To construct these plots, after each of the 1000 PMCMC iterations, we sample (without rejection) 100 times from the predictive distribution of the time to the particular state conditioned on the starting state. Then, using the 100 samples, for each PMCMC iteration we estimate the probability of having hit a specific EDSS state at a time point or earlier. We use 11 time points, ranging from 1 to 6 months in 15 days steps. Finally, we compute the mean of the probabilities at each of the 11 time points to summarize the results. As mentioned observed EDSS scores are binned into three categories. Figure 8.4 shows the plot of the predicted probability of reaching category 3 starting from category 1 for the both arms of a clinical trial, along with their 95% credible intervals. The plots depict that, for instance, after 3 years the mean predicted probability of reaching category 3 for the treatment and the placebo arms are 0.37 (95% C.I.: 0.12-0.61) and 0.55 ((95% C.I.: 0.22-0.73)), correspondingly. 79 200 400 600 800 0 ● 0.85 0.80 200 400 600 800 ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ●● ● ● ● ●● ● ●●● ●● ● ● ● ● ●●●● ● ●● ●●● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ●●●●● ● ●● ● ●● ●●● ●● ● ●● ●● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ●● ● ●● ● ● ●● ● ●●●●●●● ●● ● ●●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●● ● ●●● ●● ● ●● ● ●● ●● ●● ●● ●● ●●● ● ● ●● ● ● ●● ● ● ● ●● ● ● ●●● ● ● ● ●●●● ●●● ●● ●● ●●● ●●● ●●● ●●●● ●●●●●● ●● ● ● ●● ●● ● ●● ●● ●●●●● ●●●● ● ● ● ●● ● ● ● ● ● ●●● ●● ●● ●● ● ● ● ●●● ●● ● ●●● ●● ●●●●●● ●● ●● ● ●●● ●● ● ●● ●● ●●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●●●● ●● ●●● ● ● ●● ● ● ● ● ●● ●●●●● ● ● ●●●● ● ●● ● ●● ●● ● ● ● ● ● ●●● ● ● ●●●● ●●● ●● ● ●● ● ● ● ●●●● ● ● ●●● ● ● ●● ●●● ●●●● ●● ●●● ●● ● ●● ● ●● ●●●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ●● ● ●●● ●● ● ● ● ● ●● ● ● ●●●●● ●● ●●● ●●●●●●● ●● ● ● ●● ●● ●●● ●●●●●●● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●●● ● ● ●●● ●●●● ● ● ●● ●●● ●● ● ●● ●● ● ●●● ●● ● ●●●●● ● ● ● ● ● ●●● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ●●● ● ●●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●● ●●●●● ● ● ●● ●●●● ● ● ● ● ●●● ● ●● ● ● ●● ● ●●●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 (a) 0.85 0.60 ● ● 0.50 ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ●● ●● ● ●● ● ● ● ●●●●● ●● ● ● ● ●●●● ● ● ●● ●● ●● ● ● ●● ● ● ● ●● ●● ●● ● ●● ● ●● ● ● ●●●● ●● ● ●● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ●●●●● ● ●●● ● ● ●●●●● ● ● ● ● ● ●● ●● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ●●● ● ●● ●● ● ●● ●●● ●● ●●●● ● ● ●● ● ● ● ● ●● ● ●●● ● ●●●● ● ●● ● ●● ● ●●● ● ● ● ● ● ●● ● ● ● ●●● ●● ●●●●● ●● ●● ●●●● ● ● ●●● ● ●●●● ● ●● ● ● ●● ●● ● ●●● ●● ●● ● ●● ● ●●● ●●● ●● ●●● ● ●● ● ● ● ●● ● ●● ● ●●●● ● ● ● ●● ● ●● ● ●●● ●●● ●● ●● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ●● ●●● ● ● ● ●● ●●●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●●● ●● ● ●●● ● ● ●● ●●● ●●● ● ● ●● ●● ● ● ●●● ● ● ●●● ●● ●● ●●● ●● ● ● ●● ●●●● ● ● ●● ● ● ● ●●● ●● ●● ●● ● ● ●● ● ● ● ●●● ●● ● ● ●●● ● ● ● ● ●●● ● ● ● ●●●●● ● ●●● ● ●● ●●● ●●●● ● ●● ●● ●●● ● ● ●● ● ●● ●● ● ●●●●●● ● ●●● ●● ●● ●●● ● ● ● ● ● ● ●●● ●● ● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ●●● ●● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●●● ●● ● ● ● ●●●● ● ● ●● ●●● ●●●● ● ●● ●● ●● ● ●● ●●● ● ●●●● ●● ●● ●● ● ● ●●● ● ●● ● ● ●●●● ● ● ●● ●● ●● ● ● ●●● ●●● ● ●● ● ● ● ●●● ● ● ●●● ● ●● ● ● ●●● ●● ● ● ●● ●● ● ●●●●● ● ● ● ●● ●●● ● ●● ● ● ●● ●●● ● ● ●● ●●●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ●● ● ●● ●● ● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● 0.50 0.60 ● ● ● 0 0.90 ● ● 0.90 ● 0.70 0.70 ● 200 400 600 800 0.80 0 200 400 600 800 (b) Figure 8.5: Comparison of inference algorithms for the RNA dataset: mean acceptance rate per iteration for a proposal using 100 particles as a function of MCMC iteration using (a) predictive distribution (b) modified predictive distribution in the SMC proposal. 8.3 Comparison of Inference Algorithms In this section to analyze the effect of using the two different proposal distributions described in Section 6.4 in the PMCMC algorithm for our experiments. The most important effect is on the acceptance rate of the samples in the PMCMC algorithm. Figure 8.5 shows that the acceptance rate in the “improved” proposal distribution is greater than that for the original proposal distribution. We have only provided the results for the RNA dataset; similar results hold for other datasets. Moreover, for the “improved” proposal we always accept all the particles in the SMC algorithm so we have zero failure rate for the particles. Recall that the weight of each particle is computed based on the likelihood of the observation. For our defined likelihood the likelihood can only be 0 or 1. Without the “improved” proposal we can have many zero weighted particles in case of long sequences (i.e., high failure rate for the particles) due to the fact that for long sequences, we have higher chance of having at least one zero likelihood value. 80 Chapter 9 C ONCLUSION In this thesis, we introduced a model for MJPs with (potentially) infinite state space; we showed how this can be used as a prior in a Bayesian approach to the analysis of multiple sequences of data observed at discrete time points. We reviewed some basic notions and closely related models, built our model (GEP) based on the gamma process, and extended it to a hierarchical model (HGEP). We showed the GEP has some attractive properties such as conjugacy and closed form predictive distributions. Finally, we applied the model to some real world datasets; the results showed the model outperformed the maximum likelihood rate matrix estimator despite using a simple likelihood model (i.e., a Dirac delta function). However, in our model, as in many nonparametric Bayesian models, the data likelihood (i.e., p(Y )) is nontrivial and hard to compute. As a result, performing Bayesian model comparison and computing the Bayes factor, which requires computing the data likelihood, is a challenging problem in our model. An example of this problem is the experiment of Section 8.2.4 where we want to compare the two arms of the trial; there may be some situations in which the presence of a treatment effect is less obvious from the plots and we need a statistical test for testing the treatment effect. Moreover, our inference algorithm is computationally intensive; we may improve it in terms of computational efficiency by possibly using a slice sampler. We are currently working on some of the applications and generalizations of our model. Application of the method in models with continuous state space for 81 the observations, proposing more efficient inference algorithms, calculating the Bayes factor for estimating the treatment effect in clinical trials, and incorporating covariates in the hierarchical GEP are issues that we are plan to address in future research. We are also interested in adding a layer of decision making to our model. This can make the model suitable for sequential decision making problems where we have partial observability of the data and the complexity of the latent structure is unknown. 82 Bibliography [1] P. S. Albert. A Markov model for sequences of ordinal data from a relapsing-remitting disease. Biometrics, 50(1):51–60, 1994. → pages 2 [2] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010. → pages 45, 48, 49, 50, 56, 57 [3] M. Beal, Z. Ghahramani, and C. Rasmussen. The infinite hidden Markov model. Advances in Neural Information Processing Systems, 14:577–584, 2002. → pages 3, 24 [4] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022, 2003. → pages 18 [5] A. Bureau, S. Shiboski, and J. P. Hughes. Applications of continuous time hidden Markov models to the study of misclassified disease outcomes. Statistics in Medicine, 22(3):441–462, 2003. → pages 1, 2 [6] J. Cannone, S. Subramanian, M. Schnare, J. Collett, L. D’Souza, Y. Du, B. Feng, N. Lin, L. Madabusi, K. M¨uller, et al. The comparative RNA web (CRW) site: An online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics, 3 (1):2, 2002. → pages 76 [7] Y. Chung and D. Dunson. The local Dirichlet process. Annals of the Institute of Statistical Mathematics, 63(1):59–80, 2011. → pages 28 [8] J. L. Doob. Markoff chains–denumerable case. Transactions of the American Mathematical Society, 58(3):455–473, 1945. → pages 7 [9] D. Dufresne. G distributions and the beta-gamma algebra. Electronic Journal of Probability, 15(71):2163–2199, 2010. → pages 33 83 [10] D. Dunson. Bayesian dynamic modeling of latent trait distributions. Biostatistics, 7(4):551–568, 2006. → pages 27 [11] P. Erd˝os and A. R´enyi. On the evolution of random graphs. Evolution, 5(1): 17–61, 1960. → pages 75 [12] T. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1(2):209–230, 1973. → pages 13, 14 [13] E. Fox, E. Sudderth, M. Jordan, and A. Willsky. An HDP-HMM for systems with state persistence. In Proceedings of the 25th International Conference on Machine Learning, pages 312–319. ACM, 2008. → pages 3, 25, 26 [14] A. Gelman and D. Rubin. Inference from iterative simulation using multiple sequences. Statistical Science, 7(4):457–472, 1992. → pages 48 [15] J. Geweke. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In Bayesian Statistics 4, pages 169–193. Oxford University Press, 1992. → pages 48 [16] J. Griffin. The Ornstein-Uhlenbeck Dirichlet process and other time-varying processes for Bayesian nonparametric inference. Journal of Statistical Planning and Inference, 141(11):3648–3664, 2011. → pages 28 [17] J. Griffin and M. Steel. Order-based dependent Dirichlet processes. Journal of the American Statistical Association, 101(473):179–194, 2006. → pages 3, 27 [18] J. Griffin and M. Steel. Stick-breaking autoregressive processes. Journal of Econometrics, 162(2):383–396, 2011. → pages 3, 27 [19] G. Grimmett and D. Stirzaker. Probability and Random Processes. Oxford University Press, 2001. → pages 46 [20] S. Guindon and O. Gascuel. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biology, 52 (5):696–704, 2003. → pages 77 [21] P. Heidelberger and P. Welch. Simulation run length control in the presence of an initial transient. Operations Research, 31(6):1109–1144, 1983. → pages 48 [22] A. Hobolth and J. Jensen. Statistical inference in evolutionary models of DNA sequences via the EM algorithm. Statistical Applications in Genetics and Molecular Biology, 4(1), 2005. → pages 75 84 [23] M. Jordan. Dirichlet processes, chinese restaurant processes and all that. In Tutorial presentation at the NIPS Conference. URL http://www.cs.berkeley.edu/∼jordan/nips-tutorial05.ps. → pages 15 [24] M. Kimura. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16(2):111–120, 1980. → pages 77 [25] J. Kingman. Poisson Processes. Oxford University Press, 1993. → pages 8, 9 [26] J. Kivinen, E. Sudderth, and M. Jordan. Learning multiscale representations of natural scenes using Dirichlet processes. In IEEE 11th International Conference on Computer Vision, pages 1–8. IEEE, 2007. → pages 25 [27] D. Kroese, T. Taimre, and Z. Botev. Handbook of Monte Carlo Methods. Wiley, 2011. → pages 6 [28] J. Liu. Monte Carlo Strategies in Scientific Computing. Springer Verlag, 2008. → pages 45 [29] S. MacEachern. Dependent nonparametric processes. In Section on Bayesian Statistical Science, American Statistical Association, pages 50–55, 1999. → pages 26, 27 [30] R. MacKay. Estimating the order of a hidden Markov model. Canadian Journal of Statistics, 30(4):573–589, 2002. → pages 2 [31] M. Mandel. Estimating disease progression using panel data. Biostatistics, 11(2):304–316, 2010. → pages 1, 2, 77, 79 [32] M. Mandel and R. A. Betensky. Estimating time-to-event from longitudinal ordinal data using random-effects Markov models: application to multiple sclerosis progression. Biostatistics, 9(4):750–764, 2008. → pages 2, 78 [33] R. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265, 2000. → pages 45 [34] R. Nelson. Probability, Stochastic Processes, and Queueing Theory: The Mathematics of Computer Performance Modelling. Springer-Verlag, 1995. → pages 32 [35] L. Nieto-Barajas, P. M¨uller, Y. Ji, Y. Lu, and G. Mills. Time series dependent Dirichlet process. Preprint, 2008. → pages 28 85 [36] J. Pitman. Combinatorial Stochastic Processes. Springer-Verlag, 2006. → pages 13 [37] A. Raftery and S. Lewis. Implementation strategies for Markov chain Monte Carlo. Statistical Science, 7(4):493–497, 1992. → pages 48 [38] A. Rodriguez, D. Dunson, and A. Gelfand. The nested Dirichlet process. Journal of the American Statistical Association, 103(483):1131–1154, 2008. → pages 28 [39] R. A. Rudick, E. Fisher, J. C. Lee, J. Simon, and L. Jacobs. Use of the brain parenchymal fraction to measure whole brain atrophy in relapsing-remitting multiple sclerosis. Neurology, 53(8):1698–1704, 1999. → pages 77 [40] J. Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, 4:639–650, 1994. → pages 13, 14 [41] A. Siepel and D. Haussler. Combining phylogenetic and hidden Markov models in biosequence analysis. Journal of Computational Biology, 11(2-3): 413–428, 2004. → pages 1 [42] Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006. → pages 17, 19, 21, 22, 23, 24 [43] A. C. Titman and L. D. Sharples. Semi-Markov models with phase-type sojourn distributions. Biometrics, 66(3):742–752, 2010. → pages 2 [44] W. Wei, B. Wang, and D. Towsley. Continuous-time hidden Markov models for network performance evaluation. Performance Evaluation, 49:129–146, 2002. → pages 1 [45] E. Xing and K. Sohn. Hidden Markov Dirichlet process: Modeling genetic inference in open ancestral space. Bayesian Analysis, 2(3):501–528, 2007. → pages 25 86
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Nonparametric Bayesian models for Markov jump processes
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Nonparametric Bayesian models for Markov jump processes Saeedi, Ardavan 2012
pdf
Page Metadata
Item Metadata
Title | Nonparametric Bayesian models for Markov jump processes |
Creator |
Saeedi, Ardavan |
Publisher | University of British Columbia |
Date Issued | 2012 |
Description | Markov jump processes (MJPs) have been used as models in various fields such as disease progression, phylogenetic trees, and communication networks. The main motivation behind this thesis is the application of MJPs to data modeled as having complex latent structure. In this thesis we propose a nonparametric prior, the gamma-exponential process (GEP), over MJPs. Nonparametric Bayesian models have recently attracted much attention in the statistics community, due to their flexibility, adaptability, and usefulness in analyzing complex real world datasets. The GEP is a prior over infinite rate matrices which characterize an MJP; this prior can be used in Bayesian models where an MJP is imposed on the data but the number of states of the MJP is unknown in advance. We show that the GEP model we propose has some attractive properties such as conjugacy and simple closed-form predictive distributions. We also introduce the hierarchical version of the GEP model; sharing statistical strength can be considered as the main motivation behind the hierarchical model. We show that our hierarchical model admits efficient inference algorithms. We introduce two inference algorithms: 1) a “basic” particle Markov chain Monte Carlo (PMCMC) algorithm which is an MCMC algorithm with sequences proposed by a sequential Monte Carlo (SMC) algorithm; 2) a modified version of this PMCPC algorithm with an “improved” SMC proposal. Finally, we demonstrate the algorithms on the problems of estimating disease progression in multiple sclerosis and RNA evolutionary modeling. In both domains, we found that our model outperformed the standard rate matrix estimation approach. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2012-08-17 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0073015 |
URI | http://hdl.handle.net/2429/42963 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2012-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2012_fall_saeedi_ardavan.pdf [ 2.43MB ]
- Metadata
- JSON: 24-1.0073015.json
- JSON-LD: 24-1.0073015-ld.json
- RDF/XML (Pretty): 24-1.0073015-rdf.xml
- RDF/JSON: 24-1.0073015-rdf.json
- Turtle: 24-1.0073015-turtle.txt
- N-Triples: 24-1.0073015-rdf-ntriples.txt
- Original Record: 24-1.0073015-source.json
- Full Text
- 24-1.0073015-fulltext.txt
- Citation
- 24-1.0073015.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0073015/manifest