Bayesian Inference on Change Point Problems by Xiang Xuan B . S c , The University of British Columbia, 2004 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L M E N T O F THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Computer Science) The University Of British Columbia March, 2007 © X i a n g X u a n 2007 r Abstract Change point problems are referred to detect heterogeneity in temporal or spatial data. They have applications in many areas like D N A sequences, time series, signal processing, etc. financial A large number of techniques have been proposed to tackle the problems. One of the most difficult issues is estimating the number of the change points. A s i n other examples of model selection, the Bayesian approach is particularly appealing, since it automatically captures a trade off between model complexity^(the number of change points) and model fit. It also allows one to express uncertainty about the number and location of change points. J In a series of papers [13, 14, 16], Fearnhead developed efficient dynamic programming algorithms for exactly computing the posterior over the number and location of change points in one dimensional series. This improved upon earlier approaches, such as [12], which relied on reversible j u m p M C M C . We extend Fearnhead's algorithms to the case of multiple dimensional series. This allows us to detect changes on correlation structures, as well as changes on mean, variance, etc. We also model the correlation structures using Gaussian graphical models. T h i s allow us to estimate the changing topology of dependencies among series, in addition to detecting change points. This is particularly useful i n high dimensional cases because of sparsity. ii Table of Contents Abstract ii Table of Contents List of Figures v Acknowledgements 1 2 iii vii Introduction 1 1.1 Problem Statement 1 1.2 Related Works 3 1.2.1 Hidden Markov Models 3 1.2.2 Reversible Jump M C M C 3 1.3 Contribution 4 1.4 Thesis Outline 4 O n e D i m e n s i o n a l T i m e Series 5 2.1 Prior on Change Point Location and Number 6 2.2 Likelihood Functions for Data in Each Partition 9 2.3 2.2.1 Linear Regression Models 10 2.2.2 Poisson-Gamma Models 13 2.2.3 Exponential-Gamma Models 15 Basis Functions 16 2.3.1 Polynomial Basis 16 2.3.2 Autoregressive Basis 16 iii Table of Contents 2.4 2.5 2.4.1 Backward Recursion 17 2.4.2 Forward Simulation 18 Online Algorithms 19 2.5.1 Forward Recursion 20 2.5.2 Backward Simulation 21 2.5.3 V i t e r b i Algorithms . . . 22 2.5.4 Approximate Algorithm 23 Implementation Issues 23 2.7 Experimental Results . . . . ) 24 2.7.1 Synthetic D a t a • • 24 2.7.2 British Coal Mining Disaster Data 24 2.7.3 Tiger Woods Data 25 Choices of Hyperparameters 26 M u l t i p l e D i m e n s i o n a l T i m e Series 35 3.1 Independent Sequences 35 3.2 Dependent Sequences 36 3.2.1 A Motivating Example 36 3.2.2 Multivariate Linear Regression Models 3.3 3.4 4 17 2.6 2.8 3 Offline Algorithms ' 37 Gaussian Graphical Models 41 3.3.1 Structure Representation 41 3.3.2 Sliding Windows and Structure Learning Methods 3.3.3 Likelihood Functions . . . . 42 45 Experimental Results 48 3.4.1 Synthetic Data 48 3.4.2 U . S . Portfolios D a t a 51 3.4.3 Honey Bee Dance Data 52 Conclusion and Future W o r k Bibliography 61 62 iv List of Figures \ 1.1 Examples on possible changes over consecutive segments 2.1 Graphical model of linear regression models 10 2.2 Graphical model of Poisson-Gamma models 14 2.3 Results of synthetic data Blocks 28 2.4 Results of synthetic data A R 1 29 2.5 Results of coal mining disaster data 30 2.6 Results of Tiger Woods data 31 2.7 Results of synthetic data Blocks under different values of hyperparameter A 2.8 33 Results of synthetic data A R 1 under different values of hyperparameter 7 3.1 32 Results of synthetic data Blocks under different values of hyperparameter 7 2.9 2 34 A n example to show the importance of modeling correlation structures 36 3.2 Graphical model of multivariate linear regression models 38 3.3 A n example of graphical representation of Gaussian graphical models 42 3.4 Graphical representation for independent model 43 3.5 Graphical representation for fully dependent model 43 3.6 A n example to show sliding window method 44 3.7 Results of synthetic data 2D 49 v List of Figures 3.8 Results of synthetic data 10D 54 3.9 Candidate list of graphs generated by sliding windows i n 10D data 55 3.10 Results of synthetic data 20D 56 3.11 Results of U . S . portfolios data of 5 industries 57 3.12 Results of U . S . portfolios data of 30. industries 58 3.13 Results of honey bee dance sequence 4 59 3.14 Results of honey bee dance sequence 6 60 Acknowledgements I would like to thank all the people who gave me help and support throughout my degree. First of all, I would like to thank my supervisor, Professor K e v i n M u r p h y for his constant encouragement and rewarding guidance throughout this thesis work. K e v i n introduced me to this interesting problem. I am grateful to K e v i n for giving me the Bayesian education and directing me to the research in machine learning. I am amazed by Kevin's broad knowledge and many quick and brilliant ideas. Secondly, I would like to thank Professor Nando de Freitas for dedicating his time and effort in reviewing my thesis. Thirdly, I would like to extend my appreciation to my colleagues for their friendship and help, and especially to the following people: Sohrab Shah, Wei-Lwun L u and Chao Y a n . Last but not least, I would like to thank my parents and my wife for their endless love and support. XIANG XUAN The University of British Columbia March 2007 vii Chapter 1 Introduction 1.1 Problem Statement Change point problems are commonly referred to detect heterogeneity of temporal or spatial data. Given a sequence of data over time or space, change points split data into a set of disjoint segments. Then it is assumed that the data from the same segment comes from the same model. If we assume data — model + noise then the data on two successive segments could be different in the following ways: • different models (model types or model orders) • same model with different parameters • same model with different noise levels • in multiple sequences, different correlation among sequences Figure 1.1 shows four examples of changes over successive segments. In all four examples, there is one change point at location 50 (black vertical solid line) which separates 100 observations into two segments. The top left panel shows an example of different model orders. The 1st segment is a 2nd order Autoregressive model and the 2nd segment is a 4th order Autoregressive model. The top right panel shows an example of same model with different parameters. B o t h segments are linear models, but the 1st segment has a negative slope while the 2nd segment has a positive slope. The bottom left panel shows an example 1 Chapter 1. Introduction of same model with different noise level. B o t h segments are constant models which have means at 0, but the noise level (the standard deviation) of the 2nd segment is three times as large as the one on the 1st segment. The bottom right panel is an example of different correlation between two series. We can see that two series are positive correlated in the 1st segment, but negative correlated in the 2nd segment. Figure 1.1: Examples show possible changes over successive segments. The top left panel shows changes on A R model orders. The top right panel shows changes on parameters. The bottom left panel shows changes on noise level. T h e bottom right panel shows changes on correlation between two series. The aim of the change point problems is to make inference about the number and location of change points. 2 Chapter 1. Introduction 1.2 Related Works Many works have been done by many researchers in different areas. This thesis is an extension based on the Fearnhead's work which is a special case of the Product Partition Models ( P P M ) (defined later) and using dynamic programming algorithms. Here I like to mention two approaches that are closely related to our works. The first approach is based on a different models Hidden Markov model ( H M M ) , and the second approach is using a different algorithm Reversible J u m p Monte Carlo Markov Chain. 1.2.1 Hidden Markov Models A H M M is a statistical model where system being modeled is assumed to be Markov process w i t h hidden states. In a regular Markov model, the state is directly visible. In a H M M , the state is not directly visible, but variables influenced by the hidden states are visible. The challenge is to determine the hidden states from observed variables. In change point problems, we view the change is due to change on hidden states. Hence by inference on all hidden states, we can segment data implicitly. In H M M , we need to fix the number of hidden states, and often the number of state cannot be too large. 1.2.2 Reversible Jump M C M C Reversible jump is a M C M C algorithm which has been extensively used in the change point problems. It starts by a initial set of change points. A t each step, it can make the following three kinds of moves: • Death move to delete a change point or merge two consecutive segments, • B i r t h move to add a new change point or split a segment into two, • Update move to shift the position of a change point 3 i Chapter 1. Introduction Each step, we will accept a move based on a probability calculated by the Metropolis-Hastings algorithm. We run until the chain converges. T h e n we can find out the posterior probability of the number and position of change points. The advantage of reversible jump M C M C is that it can handle a large family of distributions even when we only know the distribution up to a normalized constant. The disadvantages are slow and difficult to diagnose convergence of the chain. 1.3 Contribution Our contributions of the thesis are: • We extend Fearnhead's work to the case of multiple dimensional series which allows us to detect changes on correlation structures, as well as changes on means, variance, etc. • We further model the correlation structures using Gaussian graphical models which allows us to estimate the changing topology of dependencies among series, in addition to detecting change points. • We illustrate the algorithms by applying them to some synthetic and real data sets. 1.4 Thesis Outline The remaining of the thesis is organized as follows. In Chapter 2, we will review Fearnhead's work in one dimensional series in details, and provide experimental results on some synthetic and real data sets. In Chapter 3, we will show how to extend Fearnhead's work in multiple dimensional series and how to use Gaussian graphical models to model and learn correlation structures. We will also provide experimental results on some synthetic and real data sets. Finally, our conclusions are stated in Chapter 4, along w i t h a number of suggestions for future works. 4 Chapter 2 One Dimensional Time Series Let's consider the change point problems with the following conditional independence property^ given the position of a change point, the data before that change point is independent of the data after the change point. Then these models are exactly the Product Partition Models ( P P M ) in one dimension [3, 4, 11]. Here a dataset YI-N is partitioned into K partitions, where the number of the partitions K is unknown. The data on the fc-th segment Y is assumed to be K independent with the data on the other segments given a set of parameters 9k for that partition . Hence given the segmentation (partition) S\-K, we can write K P(Y \K,S ,9 .. ) 1:N 1:K 1 = K T\p{Y \e ) (2.1) k k k=l Fearnhead [13, 14, 16] proposed three algorithms (offline, online exact and online approximate) to solve the change point problems under P P M . They all first calculate the joint posterior distribution of the number and positions of change points P(K, SI-K\YI:N) using dynamic programming, then sample change points from this posterior distribution by perfect sampling [22]. K,S 1:K ~ P{K,S \Y ) 1:K (2.2) 1:N After sampling change points, making inference on models and their parameters over segments is straight forward. The offline algorithm and online exact algorithm run i n 0(N ), and the online 2 approximate algorithm runs i n O(N). 5 Chapter 2. One Dimensional Time Series 2.1 Prior on Change Point Location and Number To express the uncertainty of the number and position of change points, we put the following prior distribution on change points. Let's assume that the change points occur at discrete time points and we model the change point positions by a Markov process. Let the transition probabilities of this Markov process be the following, P(next change point at £|change point at s) = g(\t — s\) (2.3) We assume ( 2.3) only depends on the distance between two change points. Also we let the probability mass function for the distance between two successive change points s and t be g(\t — s\). Furthermore, we define the cumulative distribution function for the distance as following, l G(l) = X» (2.4) i=l and assume that g() is also the probability mass function for the position of the first change point. In general, g() can be any arbitrary probability mass function w i t h the domain over 1, 2, • • •, N - 1. Then g() and GQ imply a prior distribution on the number and positions of change points. For example, if we use the Geometric distribution as <?(), then our model implies a Binomial distribution for the number of change points and a Uniform distribution for the locations of change points. To see that, let's suppose there are N data points and we use a Geometric distribution w i t h parameter A. We denote P(d — 1) as the probability of location i being a change point. B y default, position 0 is always a change point. That is, P(C 0 = 1) = 1 (2.5) First, we show that the distribution for the location of change points is Uniform by induction. That is, V i = 1 , . . . , N P(Ci = l) = A (2.6) 6 Chapter 2. One Dimensional Time Series W h e n i = 1, we only have one case: position 0 and 1 both are change points. Hence the length of the segment is 1. We have, P ( d = l) = 9(1) = A Suppose V i < k, we have ' P ( C i = l) = A (2.7) Now when i = k+ 1, conditioning on the last change point before position k +1, we have, fc P ( C k P(C f c + 1 + 1 = l) ^P(C = f c + i = l | C , = l)P(C7,- = l ) (2.8) where = l \ C j = 1) = (fc + l - j ) = A ( l - A ) - ' (2.9) f c S B y ( 2.5), ( 2.7) and ( 2.9), ( 2.8) becomes, k p(A + 1 = i) = p(c = A ( l - A) + = k+1 i\c 0 = i)P(c = 0 i) + Y,P(c (let t = k-j) A (! - A ) ^' f c A = ( A x - A ) f c + A 2 Yti = A - 1 )' A) k A (2.10) follows Binomial distribution. Let's consider each position as a trial with two outcomes, either being a change the same. B y ( 2.6), we know the probability of being a change point is T h e n we only need to show each trial is independent. V i , j = 1,...,N i) A B y induction, this proves ( 2.6). Next we show the number of change points point or not. = i _ * + A ^ i ^ i ^ = ( l - \ ) + A(l - (1 - X) ) k A ( i t=0 5=1 = i)P{c = i fc-i it fc i\c = k+1 and i ^ T h a t is, j, P(Ci = l) = P ( C i = l \ C j = 1) = A (2.11) 7 I Chapter 2. One Dimensional Time Series W h e n i < j, it is true by default, since future will not change history. W h e n i > j, we show it by induction on j. W h e n j — i — 1, we only have one case: position j and i both are change points. Hence the length of the segment is 1. We have, P(d = l\d-x = 1) = s(l) = A Suppose V j > k, we have P(d = 1|C,- = 1) = A (2.12) Now when j = k — 1, conditioning on the next change point after position k — 1, we have, i P(d = l\C -i = 1) k = ^ P ( C i = l|C = l,C _i = l)P(C t f c t = l|C _i = l) f c t=k (2.13) where ! P(C t = l | C - i = 1) = f c g(t-k A if t < i 1 if t = i + l) = \(l-X) t (2.14) k Hence ( 2.13) becomes, p e a = i|c _i = i ) = fc ^AAa-Ar^+Aa-Ay t=fc (let s = t-k) = A = A(l-(l-A) - 2 ] T \(1i --- AA) ) -+| - A/ \ ((l , -I - A\y~ ; s = = A A ' k 1 ^_ ^^ 1 1 _ ^ +r ^A i( l- -A A) j fc 0 i f c ) + A(l-A) i f c = A (2.15) B y induction, this proves ( 2.11). Hence the number of change points follows Binomial distribution. 8 Chapter 2. One Dimensional Time Series 2.2 Likelihood Functions for Data in Each Partition If we assume that s and t are two successive change points, then data Y i s+ :t forms one segment. We use the likelihood function P{Y . ) to evaluate how s+1 t good the data Y i-, can be fit i n one segment. Here there are two levels of s+ t uncertainty. The first is the uncertainty of model types. There are many possible candidate models and we certainly cannot enumerate all of them. A s a result, we will only consider a finite set of models. For example, polynomial regression models up to order 2 or autoregressive models up to order 3. T h e n if we put a prior distribution n() on the set of models, and define P(Y i \q) s+ as the likelihood :t function of y +i:t conditional on a model q, then we can evaluate P(Y i ) as s s+ ]t following, p(Y ) Y.POr.+uiqMq) = s+1:t (2-i6) i The second is the uncertainty of model parameters. If the parameter on this segment conditional on model q is 6 , and we put a prior distribution n(6 ) on q q parameters, we can evaluate P(Y i- \q) as following, s+ P(Y \q) s+1:t = t j 17 f{Yi\0„,qMO \q)dB q (2.17) q i=s+l We assume that P(Y . \q) can be efficiently calculated for all s, t and q, where 3+1 t s < t. In practice, this requires either conjugate priors on 9 which allow us to q work out the likelihood function analytically, or fast numerical routines which are able t o evaluate the required integration. In general, for any data and models, as long as we can evaluate the likelihood function ( 2.17), we can use Fearnhead's algorithms. O u r extensions are mainly based on this. Now let's look at some examples. 9 Chapter 2. One Dimensional Time Series 2.2.1 Linear Regression Models The linear regression models are one of the most widely used models. Here, we assume n +1:t = (2.18) HP+ e where H is a (t — s) by q matrix of basis functions, P is a q by 1 vector of regression parameters and e is a (t — s) by 1 vector of iid Gaussian noise with mean 0 and variance a . 2 D Figure 2.1: Since we assume conjugate priors, a 2 V Y has an Inverse 1 Graphical representation of Hierarchical structures of Linear Re- gression Models G a m m a distribution with parameters v/2 and 7/2, and the jth component of regression parameter Pj has a Gaussian distribution with mean 0 and variance o & . This hierarchical model can be illustrated by Figure 2.1. 2 2 For simplicity, we write Y i- as Y and let n = rj — s. A n d we have the following, s+ P(Y\Py) t = ^^exp(-^(Y-HP) I-HY-HP)y2.19) T 10 Chapter 2. One Dimensional Time Series where D = diag(5 ,..., S ) and / „ is a n by n identity matrix. 2 2 B y ( 2.17), we have, P<y. , \q) = P(Y\D,v,y) +1 t = JJp(Y,0,o \D,v, )d0do 2 2 1 = Jjp(Y\0,a )P(0\D,a )P(a \u,y)d0da 2 2 2 2 Multiplying ( 2.19) and ( 2.20), we have following, P{Y\B,a )P(B\D,a ) 2 cc exp(-^{(Y 2 oc exp ( - ^ 2 ( - H0) (Y - H0) +B D~ p)j T y T T ~ W H0 + 0 H H0 + y T T T 1 0 D- 0)j T 1 Now let M = (H H + D" )" P = (7 - T \\Y\\ = 1 1 HMH ) T Y PY 2 T P (*) = Y Y - 1Y H0 + 0 H H0 + 0 D- 0 T T T T T l Then (*) = 0 (H H + D- )0-2Y H0 T T 1 +YY T T = 0 M~ 0 — 2Y HMM~ 0 + YY = 0 M~ 0 - 2Y HMM~ 0 + Y HMM~ M H Y T T l T l 1 T T 1 T 1 T - Y HMM~ M H Y T T x T T Using fact M = M T (*) = (0 — M H Y) T T M~ (0 — MH Y) + Y Y — Y l T T T = (0- MH Y) M- {0 - MH Y) + Y PY = (0- MH Y) M-\0- MH Y) + \\Y\\ T T T T 1 T T HMH Y T T 2 P 11 + YY T Chapter 2. One Dimensional Time Series Hence P(Y\P,K)P(p\D,K) oc exp ((/? - MH Y) M 1 2a 2 1 So the posterior for P is still Gaussian w i t h mean MH Y T (P - MH Y) + \\Y\\j, 1 and variance a M: 2 (2.22) Then integrating out P, we have P(Y\D,a ) 2 = Jp(Y\P,a )P(P\D,o )dp 2 1 2 (27r)"/ ff"|M| / 2 {2ir) l a n 2 n 1 2 (2I:)I! <TI\D\V 2 2 exp 1 ( 2 7 r )n/ 2 ( 7 n ^ (2.23) J Now we multiply ( 2.21) and ( 2.23) P(Y\D,a )P(a \u,j) 2 oc ( a ) - ^ - e x p 2 So the posterior for a 2 2 (- l 1 + ^ ? P ) is still Inverse G a m m a with parameters (n + v)l2 and 2 (7+imi p)/2-. 2 P(CT>, ) exp Y{{n + V)/2) 7 ( 7+ire 2a 2 (2.24) Then integrating out a , we have 2 P(Y \q) s+1]t = = P{Y\D,v,y) y"p(y|D,cT )P(cT |^7)dcr 2 2 1 f | M | V (2TT)»/2 V PI _ ^ ( m V : |£>l y J (7/2) 2 1//2 r((n + i/)/2) r(«//2) J L((7 + I | V 1 I P ) / 2 ) ( " + " ) / (T) (7 (//2 2 T((n + 0 / 2 ) + ire) " (n+ )/a r(i//2) (2.25) 12 Chapter 2. One Dimensional Time Series In implementation, we rewrite ( 2.25) in log space as following, log(P(Y \q)) s+1:t = ~logM -\(log\M\ - log\D\) + £ l o ( ) - ^log{j + log{T{{n + u)/2))-log{T{u/2)) S + \\Y\\ ) 2 7 P (2.26) To speed up, -%log{ir), %log(-,), -\log\D\, log(T({n + v)/2)) and log{T{v/2)) can be pre-computed. A t each iteration, M and | | y | | p can be computed by the following rank one update, ^lTt+l^l:»+l *lTi*l:t = Hv.i+l,Xl-i+l + Hui/Xl-.i + = Hf Yi \ +l + We use the following notations: if Y is a vector, then Y - denotes the entries s t from position s to t inclusive. If Y is a matrix, then Y -t, denotes the s-th row s to the t-th row inclusive, and Y : denotes the s-th column to the f-th column )S:t inclusive. 2.2.2 Poisson-Gamma Models In Poisson Models, each observation y is a non-negative integer which fol- lows Poisson distribution with parameter A. W i t h conjugate prior, A follows a G a m m a distribution w i t h parameters a and (3. This hierarchical model can be illustrated by Figure 2.2. Hence we have the following, P(Yi\\) = P(X\a,l3) = For simplicity, we write Y i s+ P(Y \q) s+1:t (2.27) ^ ^ r7 (~a X ) ~ ( ' 2 2 8 ) as Y and let n — t — s. B y ( 2.17), we have, :t = P(Y\a,p) = Jp(Y,\\a,(3)d\ = Jp(Y\\)P(\\a,/3)d\ 13 Chapter 2. One Dimensional Time Series Figure 2.2: Graphical representation of Hierarchical structures of Poisson- G a m m a Models. Note the Exponential-Gamma Models have the same Graphical representation of their Hierarchical structures. B y multiply ( 2.27) and ( 2.28), and let SY = ^ P(Y\\)P(\\a,8) cx e Y*, e have, w -"A-/3A sy A + a -i So the posterior for A is still Gamma with parameters SY + a and n + 8 (n 4- m P(X\a,8) s y +a -("+/3)A P F(SY (2.29) + a) Then integrating out A, we have P{Y\a,B) P(Y i:t\q) s+ - n 1 \ A Yi\) YW) B a T(SY + a) T(a) (n + 8) +<* SY T(a) (n + B) + SY a (2.30) Similarly, in log. space, we have the following, log(P(Y .. \q)) s+1 t = " E + °9(r(SY l + a)) - log(T{a)) + alog(J3) 14 Chapter 2. - One Dimensional Time Series (SY + a)log(n + (1) (2.31) Where ^2 log(Y l), log(T(a)) and alog(/3) can be pre-computed, and SY can 1 i i use the following rank one update, SY i = i+ 2.2.3 SYi + Y i i+ Exponential-Gamma Models In Exponential Models, each observation Y i i s a positive real number which ( follows Exponential distribution with parameter A. W i t h conjugate prior, A follows a G a m m a distribution with parameters a and /?. This hierarchical model is the same as the one in Poisson-Gamma Models, which can be illustrated by Figure 2.2. Hence we have the following, P(Yi\\) = Ae" P{\\a,B) = A ^ " For simplicity, we write Y i s+ P{Y :t \q) (2.32) A y i 1 ' ^ - (2.33) as Y and let n — t — s. B y ( 2.17), we have, = P{Y\a,B) e+1:t = Jp(Y,\\a,P)d\ = Jp(Y\X)P(\\a,8)d\ = j(j[P(Y \\)jp(\\a,p)d\ i B y multiply ( 2.32) and ( 2.33), and let S T = P(Y\\)P(\\a, 8) co E;^> > w e have SYx-0X n+ -i e X a So the posterior for A is still G a m m a with parameters n + a and SY + 3: 1 (n + a) (2.34) Then integrating out A, we have P(Y .. \q) s+1 t = P(Y\a,B) 0 a T{n + a) T(a) iSY + B) * n (2.35) 15 Chapter 2. One Dimensional Time Series Similarly, i n log space, we have the following, log(P(Y \q)) s+1:t = alog(P)-log(r(a)) + log(r(n + a))-(n + a)log(SY + 6) (2.36) Where log(F(a)) and alog(0) can be pre-computed, and SY can use the rank one update as mentioned earlier. 2.3 Basis Functions In the linear regression model ( 2.18), we have a basis function H. Some common basis functions are the following. 2.3.1 Polynomial Basis Polynomial model of order r is defined as following, Yi = 0 +0X o 1 i (2.37) + 8 X? + --- + 3 X<; 2 r Hence the polynomial basis H is following, xf Xi 1 Hi:j V 1 \ Xl i Xi, 1 — x\ + (2.38) X: Xj where X; 2.3.2 Autoregressive Basis Autoregressive model of order r is defined as following, Yi = /? y _ + /3 y- _ + --- + /? y _ 1 i 1 2 i 2 r i r (2.39) 16 Chapter 2. One Dimensional Time Series Hence the autoregressive basis H is following, Y \ Yi-r+1 Hi.j (2.40) = V Y i~2 Y 3-l Y i-r ) There are many other possible basis functions as well (eg, Fourier Basis). Here we just want to point out that the basis function does provide a way to extend Fearnhead's algorithms (eg, Kernel Basis). 2.4 Offline Algorithms Now we review the offline algorithm in detail. B y its name, we know it works in ? batch mode. It first recurses backward, then simulates change points forward. 2.4.1 Backward Recursion Let's define for s = 2 , . . . , N, Q(s) = P r o b ( F j v | l o c a t i o n s — 1 is a change point) s: (2.41) and Q ( l ) = Prob{Yi-N) since location 0 is a change point by default. Offline algorithm will calculate Q(s) for all s = 1, 2, • • •, N. When s = N Q(N) = ^ P O - W l g M g ) (2.42) Y, ^:N\qMq)(i-G(N-s)) (2.43) When s < N t=3 + q p where 7r(g) is the prior probability of model q and g() and G() are defined in ( 2.4). 17 Chapter 2. One Dimensional Time Series The reason behind ( 2.43) is that (drop the explicit conditioning on a change point at s — 1 for notational convenience) N-l Q(s) = ^ P(Y N, + P(Y ;N, next change point is at t) S: no further change points) S (2.44) For the first part we have, P{Y N, next change point is at t) S: Yt i /v|next change point is at t) = P ( n e x t change point is at t)P(Y . , = g(t - s + l)P(Y, t|s,t form a segment)P(y"t i Af |next change point is at t) = ^P(F i S T + : : s : t + : | q ) 7 r ( ) Q ( t + l)ff(t-s + l) g For the second part we have, P{Y ;N, no further change points) S = P ( y A T |s,N form a segment)P(the length of segment > N — s) = ^P(F^k)7r(g)(l-G(JV- )) s : S i We will calculate Q(s) backward for s = N, • • •, 1 by ( 2.42) and ( 2.43). Since s is from TV to 1, and at step s we will compute sum over t where t is from s to TV, the whole algorithm runs i n 0(N ). 2 2.4.2 Forward Simulation After we calculate Q(s) for all s = 1 , . . . , TV, we can simulate all change points forward. To simulate one realisation, we do the following, 1. Set T = 0, and 0 k= 0. 2. Compute the posterior distribution of T > For r k+1 = r + k P{Tk+i\Tk,Yi ) :N 1, T k = + 1 given T as following, k + 2, • • •, TV - 1, ^P(Y \q)n(q)Q(T Tk+l:Tk+1 k+1 + l) (r 5 f c + 1 - T )/Q(r k k i 18 + 1) Chapter 2. One Dimensional Time Series For T f c i = N, which means no further change points + P(Tk+i\Tk,Y . ) = V N EP(y; f c + 1 : W | )7r(c/)(l-G(iV-r,))/g(T g f c + l) 3. Simulated T^+I from P ( 7 > | T f c , YUN), and setfc=fc+ 1. + 1 4. If r/t < N return to step (2); otherwise output the set of simulated change points, T i , T , . . . , T . 2 2.5 f c Online Algorithms Now we discuss the online algorithms which have two versions (exact and approximate). B o t h work i n online mode, and in the same way that first calculate recursion forward, then simulate change points backward. We first review the online exact algorithm i n detail. Let's introduce C as a t state at time t, which is defined to be the time of the most recent change point prior to t. If C = 0, then there is no change point before time t. T h e n the t state C can take values 0,1,...,t - 1, and C\, C2, • • • ,C satisfy Markov propt t erty since we assume conditional independence over segments. Conditional on C -x = j, either t — 1 is not a change point, which leads to C = j, or t — 1 is t t a change point, which leads to C = t — 1. Hence the transition probabilities of t this Markov C h a i n can be calculated as following, i-G(t-i-i) if 1 = j ! l_G(t-i-2) 0 1 1 J 1 otherwise where G is defined i n ( 2.4). The reason behind ( 2.45) is : given C -\ = i, we know there is no change point t between i and t — 2. This also means that the length of segment must be longer than t — 2 — i . A t the same time, G(n) means the cumulative probability of the distance between two consecutive change points no more than n, which can also 19 be considered as the cumulative probability of the length of segment no more Chapter 2. One Dimensional Time Series than n. Hence We have P{C -\ = i) = 1 - G(t -2 — i). Then if j = i, we have t P(C = i) = 1 — G(t - 1 — i). If j = t — 1, which means there is a change point t at £ — 1, and by definition of <?(), the length of two change points is t — 1 — i, and this is just g(t — 1 — i), which is also G(t — i — 1) — G(t — i — 2). If we use a Geometric distribution as g(), then g(i) = (1 — X)'~ X and G(i) = 1 1 - (1 - A ) \ Now if j = i, TP{C =j\C -i=i) t \-G{t-i-l) = t _ i-G{t-i-2) 1- A (l-A)'-'- 1 (l-xy-i- 2 A n d if j = t - 1 G(t - i - 1) - G{t - i - 2) _ (1 - xy- - - (1 - A)' { TP(C = j|Ct_i = : l-G(f-i-2) t 2 (l-A) 4 A Hence ( 2.45) becomes, 1-A TP{C — j\C -i t t = i) = if j = i { X if j = t - 1 0 otherwise where A is the parameter of the Geometric distribution. 2.5.1 Forward Recursion Let's define the filtering density P(C = j\Yi- ) as: given data Y i , the probat t : J bility of the last change point is at position j. Online algorithm will compute P(C = j\Y ) for all t,j, such that t = 1,2,..., N and j = 0 , 1 , . . . , t - 1. t 1:t W h e n t = 1, we have j = 0, hence P(G 1 = 0|y 1 : 1 ) = 1 (2.46) W h e n t > 1, by standard filtering recursions, we have p(.c = j\Y ) t 1:t = P(c =j|y yi _i) t t ) : t p(gt = j , y | y i : - i ) t P(Vt|Vl:t-l) 20 Chapter 2. One Dimensional Time Series P{Y \C = j,Y ^)P(C t t 1:t =j\Y - ) t 1:t 1 P(Y \Y -i) t oc ut P{Y \C =3,Yv.t-x)P{Ct=j\Y ..t-i) t t 1 (2.47) and t-2 P(C = J l J W i ) = t J2 (°t = 3\C -i = i)P(C -i = *|Ki=.-i) (2.48) TP t t If we define w{ = P(Y \C = j, V ^ t - i ) , and with ( 2.45), we have t P(C = j\Y ) t lu t ex wiiz^ztllPiCt-^m-.t-i) if i < < - 1 zTJ ( ^ i r - y ^ m - i M = i|Ki:.-o) if i =*-1 (2.49) Now W( can be calculated as following. W h e n j < t - 1, , P(y l:«|C =j) p(y _ |c7 =j) i+ J+1:t t 1 t E,^(yj+i:«|g)^(q) E,P(y,+i -ik)^(g) : t W h e n j = t — 1, = ^P(y : t |gW9) (2.50) where 7r(g) is the prior probability of model q. We will calculate P{C \Y ) forward for t = 1, • • •, TV by ( 2.46) and ( 2.49). t l:t Since t is from 1 to TV, and at step t we will compute C = j where j is from 0 t to t — 1, the whole algorithm runs in 0(N ). 2 2.5.2 Backward Simulation After we calculate the filtering densities P{C \Yi-t) for all t = 1 , . . . , T V , we t can use idea i n [9] to simulate all change points backward. To simulate one realisation from this joint density, we do the following, 1. Set r = TV, and k = 0. 0 21 Chapter 2. One Dimensional Time Series 2. Simulated T^+I from the filtering density P(C \Yi ), Tk :Tk and set k = k + 1. 3. If Tfc > 0 return to step (2); otherwise output the set of simulated change points backward, T^-I, Tk-2, • • •, 2.5.3 Viterbi Algorithms We can also obtain an online Viterbi algorithm for calculating the maximum a posterior ( M A P ) estimate of positions of change points and model parameters for each segment as following. We define M j to be the event that given a change point at time i, the M A P choice of change points and model parameters occurs prior to time i. P (i,q) = t P(Ct = i,modelq,Mi,Y ) 1:t and pMAr = p ( h a n g e point at c t,M ,Y ) t 1:t Then we can compute P (i,q) using the following: t P (i,q) t = (l-G(t-j-l))P(Y .,\q)n(q)P 3+1 MAP 3 (2-51) and P ^ = r n a ^ f ^ ' J l (2.52) A t time t, the M A P estimates of C and the current model parameters are given t respectively by the values of j and q which maximize P (i,q). Given a M A P t estimate of C , we can then calculate the M A P estimates of the change point t prior to C and the model parameters of that segment by the value of j and q t that maximized the right hand side of ( 2.52). This procedure can be repeated to find the M A P estimates of all change points and model parameters. 22 Chapter 2. One Dimensional Time Series 2.5.4 Approximate Algorithm Now we look at the online approximate algorithm. T h e online approximate algorithm works in almost same way as the online exact algorithm. T h e only difference is the following. A t step t, online exact algorithm stores the complete posterior distribution P[C t P{C t = j\Yi-.t)forj = 0,1, • • •, t - 1. We approximate = j\Yi-.t) by a discrete distribution with fewer points. T h i s approximate distribution can be described by a set of support points. We call them particles. We impose a maximum number (eg, M) of particles to be stored at each step. Whenever we have M particles, we perform resampling to reduce the number of particles to L, where L < M. Hence the maximum computational cost per iteration is proportional to M. Since M is not dependent on the size of data Af, the approximate algorithm runs i n O(N). There are many ways to perform resampling [7, 12, 15, 16]. Here, we use the simplest one. We let L = M — 1, hence each step we just drop the particle that has the lowest support. We will show later that this approximation will not affect the accuracy of the result much, but runs much faster. 2.6 Implementation Issues There are two issues that need to be mentioned i n implementation. The first issue is numerical stability. We perform all calculations in log space to prevent overflow and underflow. We also use the functions logsumexp and logdet. In all three algorithms, we introduce a threshold (eg, 1 x 1 0 ~ ) . During 30 recursions, whenever the quantity that we need to compute is less than the threshold, we will set it to be —oo, hence it will not be computed i n the next iteration. This will provide not only stable calculation but also speed up because we will calculate less terms in each iteration. A t the end of each iteration, we will rescale the quantity to prevent underflow. The second one is rank one update. This is very important in term of speed. The way to do rank one update depends on the likelihood function we use. 23 Chapter 2. One Dimensional Time Series 2.7 Experimental Results Now we will show experimental results on some synthetic and real data sets. 2.7.1 Synthetic Data We will look at two synthetic data sets. The first one is called Blocks. It has been previously analyzed in [10, 13, 21]. It is a very simple data set, and the results are shown in Figure 2.3. Each segment is a piecewise constant model, with mean level shifting over segments. We set the hyper parameter v = 2 and 7. = 2 on a . 2 The top panel shows the raw data with the true change points (red vertical line). The bottom three panels show the posterior probability of being change points at each position and the number of segments that are calculated (from top to bottom) by the offline, the online exact and the online approximate algorithms. We can see that the results are essentially the same. The second data set is called A R 1 , and results are shown i n Figure 2.4. Each segment is an autoregressive model. We set the hyper parameter v = 2 and 7 = 0.02 on a . 2 (red The top panel shows the raw data with the true change points vertical line). The bottom three panels show the posterior probability of being change points at each position and the number of segments that are calculated (from top to bottom) by the offline, the online exact and the online approximate algorithms. Again, we see the results from three algorithms are essentially the same. Henceforth only use online approximate method. 2.7.2 British Coal Mining Disaster Data Now let's look at a real data set which records British coal-mining disasters [17] by year during the 112 year period from 1851 to 1962. This is a well-known data set and is previously studied by many researchers [6, 14, 18, 25]. Here YJ is the number of disasters in the i-th year, which follows Poisson distribution with parameter (rate) A. Hence the natural conjugate prior is a G a m m a distribution. 24 Chapter 2. One Dimensional Time Series We set the hyper parameter a = 1.66 and 0=1, since we let the prior mean of A is equal to the empirical mean (j| = Y = 1.66) and the prior strength 0 to be weak. T h e n we use Fearnhead's algorithms to analyse the data, and the results are shown in Figure 2.5. In top left panel, it shows the raw data as a Poisson sequence. The bottom left panel shows the posterior distribution on the number of segments. It shows the most probable number of segments is four. The bottom right panel shows the posterior distribution of being change points at each location. Since we have four segments, we will pick the most probable three change points at location 41, 84 and 102 which corresponding to year 1891, 1934 and 1952. Then in the up right panel, it shows the resulted segmentation (the red vertical line) and the posterior estimators of rate A on each segment (the red horizontal line). O n four segments, the posterior rates are roughly as following, 3, 1, 1.5 and 0.5. 2.7.3 Tiger Woods D a t a Now let's look at another interesting example. In sports, when an individual player or team enjoys periods of good form, it, is called 'streakiness'. It is interesting to detect a streaky player or a streaky team in many sports [1, 5, 26] including baseball, basketball and golf. The opposite of a streaky competitor is a player or team with a constant rate of success over time. A streaky player is different, since the associated success rate is not constant over time. Streaky players might have a large number of success during one or more periods, with fewer or no successes during other periods. More streaky players tend to have more change points. Here we use Tiger Woods as an example. Tiger Woods is one of the most famous golf players in the world, and the first ever to w i n all four professional major championships consecutively. Woods turned professional at the Great Milwaukee Open in August 29, 1996 and play 230 tournaments, winning 62 championships in the period September 1996 to December 2005. 0 otherwise. Let Y t = 1 if Woods won the i - t h tournament and Y t = T h e n the data can be expressed as a Bernoulli sequence with 25 Chapter 2. One Dimensional Time Series parameter (rate) A. We put a conjugate Beta prior with hyper parameters a = 1 and (3 = 1 on A since this is the weakest prior. T h e n we perform Fearnhead's algorithms to analyse the data, and the results are shown i n Figure 2.6. In top left panel, it shows the raw data as a Bernoulli sequence. The bottom left panel shows the posterior distribution on the number of segments. It shows the most probable number of segments is three. The bottom right panel shows the posterior distribution of being change points at each location. Since we have three segments, we will pick the most probable two change points at location 73 and 167 which corresponding to M a y 1999 and M a r c h 2003. T h e n in the up right panel, it shows the resulted segmentation (the red vertical line) and the posterior estimators of rate A on each segment (the red horizontal line). We can see clearly that Tiger Woods's career from September 1996 to December 2005 can be splitted into three periods. Initially, from September 1996 to M a y 1999, he was an golf player with winning rate lower than 0.2. However, from M a y 1999 to M a r c h 2003, he was in his peak period with winning rate nearly 0.5. After March 2003 till December 2005, his winning rate was dropped back to lower than 0.2. The data comes from Tiger Woods's official website http://www.tigerwoods.com/, and is previous studied by [26]. In [26], the data is only from September 1996 to June 2001. 2.8 Choices of Hyperparameters Hyperparameters A is the rate of change points and can be set as following, A = the total number of expected segments the total number of data points . . (2.53) For example, if there are 1000 data points and we expect there are 10 segments, then we will set A = 0.01. If we increase A, we will encourage more change points. We use the synthetic data Blocks as an example. Results obtained by the online approximate algorithms under different values of A are shown in Figure 2.7. From top to bottom, the values of A are: 0.5, 0.1, 0.01, 0.001 and 0.0002. We see when A = 0.5 (this prior says the length of segment is 26 Chapter 2. One Dimensional Time Series 2, which is too short.), the result is clearly oversegmented. Under other values of A, results are fine. Different likelihood functions have different hyperparameters. In linear regression, we have hyperparameters v and 7 on the variance cr . Since v represents 2 the strength of prior, we normally set v = 2 such that it is a weak prior. Then we can set 7 to reflect our belief on how large the variance will be within each segment. (Note: we parameterize Inverse-Gamma in term of | and ^.) W h e n v = 2, the mean does not exist. We use the mode ^ 5 * ° s e t 7 = ^' 1 where cr 2 is expected variance within each segment. For example, if we believe the variance is 0.01, then we will set 7 = 0.04. Now we show results from synthetic data Blocks and A R 1 obtained by the online approximate algorithms under different values of 7 in Figure 2.8 and Figure 2.9. From top to bottom, the values of 7 are: 100, 20, 2, 0.2 and 0.04. We see that the data Blocks is very robust to the choices of 7 . For the data A R 1 , we see when 7 = 100, we only detect 3 change points instead of 5. For other values of 7 , results are fine. 27 Chapter 2. 0 100 200 300 One Dimensional 400 SOO Time Series 600 BOO 900 1000 BOO 900 1000 Offlfcia 0 100 200 300 400 SOO 600 700 0 5 Onllna Enact 0 100 200 ) SOO p(N) 600 700 600 700 Online Appro limits 200 Figure 2.3: 300 400 SOO 800 900 1000 0 Results on synthetic data Blocks (1000 data points). 5 The top panel is the Blocks data set with true change points. The rest are the posterior probability of being change points at each position and the number of segments calculated by (from top to bottom) the offline, the online exact and the online approximate algorithms. Results are generated by 'showBlocks'. 28 Chapter 2. One Dimensional Time Series a 100 200 300 400 500 600 700 800 900 1000 Offline 0.8 o.e D 100 200 300 400 1 500 600 700 | 0.4 ,i 0.2 , 1 BOO 900 , 1000 „ 2 Onllna Enact 4 P(N) OB 0.6 I 1 0 100 200 300 400 500 ,1 600 700 , 1 BOO °' 4 0.2 , BOO 1000 2 Online Approximate 4 P(N) 0.8 o.e 0.4 I . °0 10O 200 .! . 300 400 i, 5O0 t| . 600 II 700 BOO 000 • 1000 0.2 °0 2 4 I 1 1 6 6 6 0 Figure 2.4: Results on synthetic data A R 1 (1000 data points). The top panel is the A R 1 data set w i t h true change points. The rest are the posterior probability of being change points at each position and the number of segments calculated by (from top to bottom) the offline, the online exact and the online approximate algorithms. Results are generated by ' s h o w A R l ' . 29 Chapter 2. One Dimensional Time Series Figure 2.5: Results on Coal Mining Disaster data. The top left panel shows the raw data as a Poisson sequence. The bottom left panel shows the posterior distribution on the number of segments. The bottom right panel shows the posterior distribution of being change points at each position. The up right panel shows the resulted segmentation and the posterior estimators of rate A on each segment. Results are generated by 'showCoalMining'. 30 Chapter 2. One Dimensional Time Series Figure 2.6: Results on Tiger Woods data. The top left panel shows the raw data as a Bernoulli sequence. The bottom left panel shows the posterior distribution on the number of segments. The bottom right panel shows the posterior distribution of being change points at each position. The up right panel shows the resulted segmentation and the posterior estimators of rate A on each segment. Results are generated by 'showTigerWoods'. This data comes from Tiger Woods's official website http://www.tigerwoods.com/. 31 Chapter 2. One Dimensional Time Series H «*fW/V*,v: 1 .4. L. 1 tyV J "* "ly* yWVv 1 iiiiiiimi 20 P(N) 40 20 P(N) 40 20 P(N) 40 20 P(N) 100 200 300 400 SOO 600 700 BOO 800 1000 0 20 40 • 40 Figure 2.7: Results on synthetic data Blocks (1000 data points) under different values of hyperparameter A. The top panel is the Blocks data set with true change points. The rest are the posterior probability of being change points at each position and the number of segments calculated by the online approximate algorithms under different values of A. From top to bottom, the values of A are: 0.5, 0.1, 0.01, 0.001 and 0.0002. Large value of A will encourage more segments. Results are generated by 'showBlocksLambda'. 32 Chapter 2. One Dimensional Time Series i i • 1 1 ^ * Figure 2.8: Results on synthetic data Blocks (1000 data points) under different values of hyperparameter 7. The top panel is the Blocks data set with true change points. The rest are the posterior probability of being change points at each position and the number of segments calculated by the online approximate algorithms under different values of 7 . From top to bottom, the values of 7 are: 100, 20, 2, 0.2 and 0.04. Large value of 7 will allow higher variance i n one segment, hence encourage less segments. Results are generated by 'showBlocksGamma'. 33 Chapter 2. One Dimensional Time Series 1 0 Figure 2 . 9 : 100 200 300 400 500 700 BOO 1000 0 2 Results on synthetic data A R 1 (1000 data points) under differ- ent values of hyperparameter 7 . The top panel is the A R 1 data set with true change points. The rest are the posterior probability of being change points at each position and the number of segments calculated by the online approximate algorithms under different values of 7. From top to bottom, the values of 7 are: 100, 20, 2, 0.2 and 0.04. Large value of 7 will allow higher variance within one segment, hence encourage less segments. Results are generated by 'showARlGamma'. 34 Chapter 3 M u l t i p l e Dimensional T i m e Series We extend Fearnhead's algorithms to the case of multiple dimensional series. Now Yi, the observation at time i, is a d-dimensional vector. T h e n we can detect change points based on changing correlation structure. We model the correlation structures using Gaussian graphical models. Hence we can estimate the changing topology of the dependencies among the series, in addition to detecting change points. 3.1 Independent Sequences The first obvious extension is to assume that each dimension is independent. Under this assumption, the marginal likelihood function ( 2.16) can be written as the following product, p(Y ..t) a+1 where PiXa+vt) Yg+it ' s o n e l s * n e dimension, = n w + 1 : t ) (3.i) marginal likelihood in the j'-th dimension. Since now P(Yj . ) +1 t can be any likelihood function discussed in the previous chapter. The independent model is simple to use. Even when independent assumption is not valid, it can be used as an approximate model similar to Naive Bayes when we cannot model the correlation structures among each dimension. 35 Chapter 3. Multiple Dimensional Time Series 3.2 Dependent Sequences Correlation structures only exist when we have multiple series. This is the main difference when we go from one dimension to multiple dimensions. Hence we should model correlation structures whenever we are able to do so. 3.2.1 A Motivating Example Let's use the following example to illustrate the importance of modeling correlation structures. Figure 3.1: A s shown in Figure 3.1, we have two series. The data on A simple example to show the importance of modeling chang- ing correlation structures. We are unable to identify any change if we look at each series individually. However, their correlation structures are changed over segments(positive, independent, negative). Results are generated by 'plot2DExample'. three segments are generated from the following Gaussian distributions. O n the 36 Chapter 3. Multiple Dimensional Time Series fc-th segment, - where E i = 1 0.75 0.75 fc 1 0 ,s = 2 1 /V(0,E ) 0 1 ,s = 3 A s a result, the marginal distribution on each dimension is, N(0,1), the same over all three segments. Hence if we look at each dimension individually, we are unable to identify any changes. However if we consider them jointly, then we find their correlation structures are changed. For example, in the first segment, they are positive correlated; in the second segment, they are independent; i n the last segment, they are negative correlated. 3.2.2 Multivariate Linear Regression Models For multivariate linear regression models, we know how to model correlation structures, and we can calculate the likelihood function ( 2.16) analytically. O n segment Y i- we still have, s+ t Y s+1:t = H3 + e (3.2) For simplicity, we write Y i., as Y, and let n = t — s. Now Y is a n by d B+ t matrix of data, H is a n by q matrix of basis functions, (3 is a q by d matrix of regression parameters and e is a n by d matrix of noise. Here we need introduce the Matrix-Gaussian distribution. A random rn by n matrix A is Matrix-Gaussian distributed with parameters MA , V and W if the density function of A is A PiA) ~ = N ( M A , V, ( 2 7 r ) W 2 | W) ; | r t / w | m / 2 ^ (-\trace [A - M^V^A { - M A ) W ^ where M is a m by n matrix representing the mean, V is a m by m matrix representing covariance among rows and W is a n by n matrix representing covariance among columns. 37 Chapter 3. Multiple Dimensional Time Series Similar as before we assume conjugate priors, e has a Matrix-Gaussian distribution w i t h mean 0 and covariance I and E , since we assume e is independent n across time(rows) but dependent across features(columns). A n d 8 has a M a t r i x - Gaussian distribution with mean 0 and covariance D and E . Finally covariance E has an Inverse-Wishart distribution with parameter No and E . This hierar0 chical model can be illustrated by Figure 3.2. Here D = diag{5\, • • •, <5) and I is a n by n identity matrix. Hence we have, 2 n D Figure 3.2: No Graphical representation of Hierarchical structures of Multivariate Linear Regression Models P(Y\P, E ) = ^ ( 2 7 r ) n d / 2 | J / 2 | £ | n / 2 ex (-\trace (Y P - H B ) W { - HB)^) • (3.3) ( 1 P(/3|D,E) - ( )«jd/2|D| / |E|9^ d 2 7 r |Sor° /2 V 2 __^^ 1. 2 '^Tn-lflV-l' 1 f--trac (EoE- )') (3.5) 1 e a ; p e where JV > d and 0 ,d) = Z(n 7r^- 1)/4 nr((n+l-i)/2) 38 Chapter 3. Multiple Dimensional Time Series B y ( 2.17), we have, P{Y , \q) t+1 = t P(Y\D,N ,Z ) 0 = 0 JJp(Y,d,X\D,N ,X )dddX 0 = 0 J|p(y|/3,E)P(/?|/J,E)P(E|iV ,Eo)^dE 0 Now multiply ( 3.3) and ( 3.4), P(Y\B, T.)P(B\D, E ) oc • exp (-±trace(((Y oc - HB) (Y T exp (-^trace((Y Y - HP) + p D' T - 2Y HP + P H HP T T T /^E" )) 1 1 + T P D~ P)T,- )j T 1 1 Now let M = (7J 7f+ P = (7 — HMH ) (*) = Y Y -2Y HP T T T T + P H HP T + T P D- P T 1 Then (*) = p (H H + D- )P-2Y Hp = p M~ P — 2Y = P M~ P - 2Y HMM" T T T x T 1 1 + YY T T T H M M~ p + Y Y x T T P + Y HMM~ M H Y 1 T 1 T - Y HMM~ M H Y T T l T + YY T T Using fact M = M T (*) = = {P- MH Y) M" (P T T (P — MH Y) T - MH Y) 1 T T M~ (P — MH Y) l T +YY - Y HMH Y T T + Y PY T ' • T Hence P(y|/?,E)P(/?|D,E) oc exp(^-^trace{((p-MH Y) M- (P-MH Y))^- +Y PYT,T So the posterior for P is still Matrix-Gaussian with mean MH Y T T 1 T and covariance Af and W , ' PQ9|L>,E) ~ N(MH Y, M, E ) T (3.6) 39 l T Chapter 3. Multiple Dimensional Time Series Then integrating out 3, we have P(Y\D,Z) = Jp(Y\6,-£)P(0\D,-E)d6 (3.7) (27r)" / |/ |<V |E|"/ ( 2 7 r ) 9 / | L > | / | E | ? / d 2 2 2 d 2 d 2 ^ 2 n ' 1 ( )nd/2| |n/2 2 7 r ^ . 2 e a _ ItroceCY^pyE- )) 1 p ( y\ \J E 2 v D 2 Now multiply ( 3.5) and ( 3.7) P(y|AE)P(E|/Vo,E ) | K 0 /2 S h | s | (N i)/2 ^(-2 e 0 + r f + t r a c e (( y T p y + S °) " )) S esp(-^roce((F Py + EoJE )) T ~ |£|(n+JV +<i+l)/2 ^ 0 - 1 2 So the posterior for E is still Inverse-Wishart with parameter n + N 0 y T 1 and p y + E , 0 P(E) ~ IW(n + N ,Y PY + E ) T 0 (3.8) 0 Then integrating out E , we have P(Y \q) S+Ut = P(Y\D,N ,Z ) 0 = . 0 Jp(Y\D,Z)P(Z\N ,Z )dZ 0 f\M\\i 1 \D\) 7T - /2 |E 1 °/ N (2ix) / W h e n d = 1 and by setting N 0 Z(n + 2 0 nd f\M\Y \\D\J nd 0 2 Z{N ,d)2 ° / N \Y PY d 2 T 0 N ,d)2( + °W n N 2 0 + Y, \( + °)/ n N 2 0 |E r°/ Z ( n + 7V ,d) -)-Eo|("+ °)/ Z(/V ,d) 2 0 \Y PY T = v and E 0 N 2 (3.9) 0 0 = 7 , ( 2.25) is just a special case of ( 3.9), since InverseWishart(v,^) = InverseGamma(%, ^ ) . In implementation, we rewrite ( 3.9) in log space as following, log(P(Y \ )) s+1:t q = , -yM*) ^log(\Z \)-^^log(\Y PY T 0 d + J2 °9( ((n l + X \)-l(log\M\-logm 0 d r + N + l-i)/2))-J2log(T((N 0 0 + l-i)/2)) (3.10) 40 Chapter 3. Multiple Dimensional Time Series To speed up, -flog(n), ^log(\E \),-*log\D\, £ t i log(T((n+N + l-i)/2)) 0 and ^>2i ilog(T((No = and Y PY 7 3.3 + 1 — i)/2)) 0 can be pre-computed. A t each iteration, M can be computed by the following rank one update, ^l:i+l,:-^l:>+l>: = ^lTi+l,:^l:> + l , : = + * l T i , : l:i,: Y + Gaussian Graphical Models Gaussian graphical models are more general tools to model correlation structures, especially conditional independence structures. There are two kinds of graphs: directed and undirected. Here we use undirected graphs as examples, and we will talk about directed graphs later. 3.3.1 Structure Representation In an undirected graph, each node represents a random variable. Define the precision matrix A = E _ 1 , which is the inverse of covariance matrix E . There is no edge between node i and node j if and only if A y = 0. A s shown in Figure 3.3, there are four nodes. There is an edge between node 1 and node 2, since A12 7^ 0. ( " X " denotes non-zero entries in the matrix.) Then the independent model and the multivariate linear regression model are just two special cases of Gaussian graphical models. In independent model, A is a diagonal matrix, which represents the graph with all isolated nodes. In multivariate linear regression model, A is a full matrix, which represents the graph with fully connected nodes. Unlike independent models which are too simple, full covariance models could be too complex when the dimension d is large since the number of parameters needed by the models are 0(d ). Gaussian graphical models provide a more flex2 ible and better solution between these two extremes, especially when d becomes large, since in higher dimensions, sparse structures are more important. 41 Chapter 3. Multiple Dimensional Time Series A=2- •A;= L ->A. X X 0 :X X X 0 0 0 0 X 0 X 0, 0 X 0—© © © Figure 3.3: A simple example to show how to use a graph to represent correlation structures. We first compute precision matrix A . T h e n from A , if A j = 0, t then there is no edge on from node i to node j . " X " represent non-zero entries in the matrix. 3.3.2 Sliding Windows and Structure Learning Methods In order to use Fearnhead's algorithms, we need to do the following two things: • generate a list of possible graph structures, • compute the marginal likelihood given a graph structure. For the first one, the number of all possible undirected graphs is 0(2 ). d2 Hence it is infeasible to try all possible graphs. We can only choose a subset of all graphs. How to choose them is a chicken and egg problem: if we knew the segmentation, then we could run a fast structure learning algorithm on each segment, but we need to know the structures in order to compute the segmentation. Hence we propose the following sliding window method to generate the set of graphs. We slide a window of width w = Q.2N across the data, shifting 42 Chapter 3. Multiple Dimensional Time Series X 0 0 X 0 0 0 0 0 0 X 0 0 0 0 X © © © © Figure 3.4: Gaussian graphical model of independent models. 'X X X X X X X X X X X X X X X X Figure 3.5: Gaussian graphical model of full models. by s = O.lw at each step. This will generate a set of about 50 candidate segmentations. We can repeat this for different setting of w and s. T h e n we run a fast structure learning algorithm on each windowed segment, and sort resulting set of graphs by frequency of occurrence. Finally we pick the top M = 20 to form the set of candidate graphs. We hope this set will contain the true graph structures or at least the ones that are very similar. As shown in Figure 3.6, suppose the vertical pink dot lines are true segmentations. W h e n we run a sliding window inside of a true segment, (eg, the red window), we hope the structure we learn from this window is similar to the true structure of this segment. A n d when we shift the window one step, (eg, shifting to the blue window), if it is still inside of the same segment, we hope the structure we learn is the same or at least very similar to the one we learn in the previous window. O f course we will have the window that overlaps two segments (eg, the black window), then we know the structure we learn from this window will represent neither segments. However, this brings no harm since these "wrong" graph structures will later receive negligible posterior probability. We can choose the number of the graphs we want to consider based on how 43 Chapter 3. Multiple Dimensional Time Series 1 1 j 1; i I I: I 1 1 1 1 i ' i: i 1 V Figure 3.6: A n example to show sliding window method. fast we want the algorithms to run: Now on each windowed segment, we need to learn the graph structure. The simplest method is the thresholding method. It works as following, 1. compute the empirical covariance E , 2. compute the precision A = E _ 1 , 3. compute the partial correlation coefficients by Oij = —7=%=, 4. set edge G y = 0 if \pij\ < 6 for some threshold 8 (eg, 0 — 0.2). The thresholding method is simple and fast, but it may not give a good estimator. We can also use the shrinkage method discussed in [23] to get a better estimator of E , which helps regularize the problem when the segment is too short and d is large. If we further pose the sparsity on the graph structure, we can use convex optimization techniques discussed in [2] to compute the M A P estimator for the precision A under a prior that encourage many entries to go to 0. We first form 44 Chapter 3. Multiple Dimensional Time Series the following problem, max ^o E log(det(A)) - troce(EA) - p | | A | | i (3.11) where E is the empirical covariance, | | A | | i = E y | A y | , and p > 0 is the regularization parameter which controls the sparsity of the graph. T h e n we can solve ( 3.11) by block coordinate descent algorithms. We summarize the overall algorithm as following, 1. I n p u t : data YI-N, hyperparameters A for change point rate and 9 for likelihood functions, observation model obslik and parameter p for graph structure learning methods. 2. S = make overlapping segments from Yi- N by sliding windows 3. G = estimate graph structure estG(Y , p) : s 6 S s 4. w h i l e not converged d o 5. {K,S\-K) = segment(F 6. G = estG{Y ,p) : s € s 1:N , obslik, A, 6) S 1:K 7. e n d w h i l e 8. Inference model irij = m a x P(m\Y ) for i = 1 : K m s 9. O u t p u t : the number of segments K, the set of segments SI-.K, the model inferences on each segment mi-n 3.3.3 Likelihood Functions After we get the set of candidate graphs, we need to be able to compute marginal likelihood for each graph. However, we can only do so for decomposable graphs in undirected graphs. For non-decomposable graphs, we will use approximation by adding minimum number of edges to make it decomposable. Given a decomposable graph, we will assume the following conjugate priors. Comparing with the multivariate linear regression models, everything is the 45 Chapter 3. Multiple Dimensional Time Series same except the prior on E is now a Hyper-Inverse-Wishart instead of InverseWishart. Hence we still have the following model, (let Y e as Y and n = t — s) Y is+ t = H6+e ~ /V(0,/„,E) The priors are, where D, I n 0 ~ N(p,D,X) E ~ HIW(b ,T, ) 0 (3.12) 0 are defined before, and bo = N + 1 — d > 0. 0 W h e n a graph G is decomposable, by the standard graph theory [8], G can be decomposited into a list of components and a list of separators, and each component and separator itself is a clique. T h e n the joint density of HyperInverse-Wishart can be decomposed as following, P ( E | b 0 ' E 0 ) " U P^s\b ,^ ) s 0 ( 3 0 - 1 3 ) where C is each component and S is each separator. For each C , we have E c ~ IW(b + \C\ - l , E f ) and E f is the block of E 0 0 corresponding to E<?. The same applies to each S. B y ( 2.17), we have, P(Y .. \q) a+1 = t P(Y\D,b ,X ) 0 = JJp(Y,3,X\D,b ,i: )d0dZ o = Since P(Y\0, 0 JJP(Y\P, X)P(0\D, o £ ) P ( E | & , E ) d(3dZ 0 0 T,)P(0\D, E ) are the same as before, by integrating out 0, we have, P(Y1AE) = jp{Y\B,V)P{p-\D,Y,)dB (3.14) d (\M1\ ( )nd/2| |n/2 ^ |£>| J 27r where M = (H H + D T - 1 S ) - 1 exp{-\trace{Y PYT,- )) 2 T ^ and P = (7 - HMH ) T 1 2 are defined before. W h e n P is positive definite, P has the Cholesky decomposition Q Q. (In T 46 Chapter 3. Multiple Dimensional Time Series practice, we can let the prior D = c*I, where I is the identity matrix and c is a scalar. B y setting c into a proper value, we can make sure P is always positive definite. This is-very important i n high dimension.) Now let X = QY, then ( 3.14) can rewrite as the following, d W A S ) = ( 2 7 r ) n | ; M ezpi-ltraceiXTX^)) 2 | s | n / 2 | W T V(27r)" / |/n| / |E| / D\J = (^J) •exp(-hrace(X I- X^- )) 1 d 2 / 2 d 2 n 2 v 1 1 2 W) ( - ) 3 1 5 where P ( X | E ) ~ JV(0, / „ , E ) , or just P ( X | E ) ~ N(0, E ) . Since we use decomposable graphs, this likelihood P ( X | E ) can be decomposited as following [8], llsP\ s\^s) A where C is each component and S is each separator. For each C, we have Xc ~ N(0, E c ) and E c is the block of E corresponding to Xc- The same applies to each S. Now multiply ( 3.13) and ( 3.16), we have, P(X|E)P(E|6 ,E ) 0 - 0 Yl^XsVslPVsM) Then for each C , the posterior ofJEc ~ IW(b + \C\-l + n, 0 same holds for each ( 3 1 7 ) Yfi + X'gXc)- The S. As a result, the posterior of E ~ HIW(bo+n, T, +X X). T 0 Hence by integrating out E , we have, P(Y \q) s+1:t = P(y|2?,6o,So) /'p(r|/J,E)P(E|6o,Eo)dE where h{G,b,T,) = nclEcl^^^bo + n |S | s s i ± ^ Z(6 i 0 i q - i . I c i ) - 1 + |5|-l,|5|)-i 47 Chapter 3. Multiple Dimensional Time Series Z(n,d) = 7r d(d 1)/4 J]r((n+l-i)/2) t=i E n = b„ = E + Y PY T 0 b +n 0 In log space, ( 3.18) can re-write as following, log(P(Y \q)) s+1:t = " y M T ) - |(tofl|Af| - Zo |/J|) + log(h(G, b , E ) ) - log(h(G, b , 5 0 0 n (3.19) We will also use the similar rank one update as mentioned earlier. Also notice that h(G, b, E ) contains many local terms. When we evaluate h(G, b, E ) over different graphs, we will cache all local terms. Then later when we meet the same term, we don't need to re-evalue it. 3.4 Experimental Results Now we will show experimental results on some synthetic and real data sets. 3.4.1 Synthetic D a t a First, we revisit the 2D synthetic data mentioned in the beginning of this chapter. We run it with all three different models (independent, full and Gaussian graphical models). We set the hyper parameter v = 2 and 7 = 2 on a for each 2 dimension in the independent model; and set the hyper parameter NQ = d and En = I on E where d is the number of the dimension (in this case d = 2) and I is the identity matrix in the full model; and set the hyper parameter 60 = 1 and Eo = I on E in the Gaussian graphical model. We know there are three segments. From Figure 3.7, the raw data are shown in the first two rows. The independent model thinks there is only one segment, since the posterior probability of the number of segments is mainly at 1. Hence it detects no change points. The other two models both think there are three segments. B o t h models detect positions of change points that are close to the ground truth with some 48 Chapter 3. Multiple Dimensional Time Series uncertainty. xa „ , ,. „ *\ G G Indep: P(N) GGM: P(N) ' M JL ^ ! • Figure 3.7: Results on synthetic data 2D. The top two rows are raw data. The 3rd row is the ground truth of change points. From the 4th row to the 6th row, results are the posterior distributions of being change points at each position and the number of segments generated from the independent model, the full model and the Gaussian graphical model respectfully. Results are generated by 'show2D'. Now let's look at a 10D synthetic data. We generate data i n a similar way, but this time we have 10 series. A n d we set the hyper parameters in a similar way. To save space, we only show the first two dimensions since the rest are similar. From Figure 3.8, we see as before the independent model thinks there is only one segment hence detects no change point. The full model thinks there might be one or two segments. Also it detects the position of change point is close to 49 Chapter 3. Multiple Dimensional Time Series two ends which is wrong. In this case, only the Gaussian graphical model think there are three segments, and detects positions that are very close to the ground truth. T h e n based on the segments estimated by Gaussian graphical model, we plot the posterior over all graph structures, P(G\Y ), the true graph strucS[t ture, the M A P structure GMAP = argmaxcP(G\Y ), and the marginal edge a:t probability, P(Gtj = l\Y ), computed using Bayesian model averaging (Gray s:t squares represent edges about which we are uncertain) in the bottom two rows of Figure 3.8. Note, we plot the graph structure by its adjacency matrix, such that if node i and j are connected, then the corresponding entry is a black square; otherwise it is a white square. We plot all candidate graph strutures selected by sliding windows i n Figure 3.9. In this case, we have 30 graphs in total. For the 1st segment, we notice that the true graph structure is not included in the candidate list. A s a result, G G M picks the graphs that are most close to the true structure. In this case, there are two graphs that both are close. Hence we see some uncertainty over the edges. However on the 3rd segment, the true structure is included i n the candidate list. In this case, G G M can identify it correctly, and we see very little uncertainty in the posterior probability. A s mentioned earlier, the candidate list contains 30 graphs. Some are very different from the true structures which might be generated by the window overlapping two segments. However, we find these graphs only slow the algorithms but won't hurt their results, because these "useless" graphs all get very low posterior probabilities. Finally let's look at a 20D synthetic data. We generate data i n a similar way and set the hyper parameters in a similar way. To save space, we only show the first two dimensions since the rest are similar. From Figure 3.10, we see as before the independent model thinks there is only one segment hence detects no change point. The full model clearly oversegments the data. Again, only the Gaussian graphical model correctly segments, and detects positions that are very close to the ground truth. Comparing with the results from 2d and lOd cases, we find that the independent model fails to detect in all cases since the changes are on the correlation structures, and the independent model cannot model it. T h e 50 Chapter 3. Multiple Dimensional Time Series full model can detect in low dimensional case, but fails i n high dimension cases, because the full model has more parameters to learn. The Gaussian graphical model can detect on all cases, and it is more confident i n high dimension cases since the sparsity structures are more important i n high dimension. 3.4.2 U . S . Portfolios D a t a Now let's look at two real data sets which record annually rebalanced valueweighted monthly returns from July 1926 to December 2001 total 906 month of U . S . portfolios data. The first data set has five industries (manufacturing, utilities, shops, finance and other) and the second has thirty industries. The first data set has been previously studied by Talih and Hengartner in [24]. We set the hyper parameter v = 2 and 7 = 2 on a for each dimension in the 2 independent model; and set the hyper parameter N 0 = 5 and E •= / on £ in the full model; and set the hyper parameter b = 1 and E 0 0 0 = I 011 E in the Gaussian graphical model. The raw data are shown in the first five rows of Figure 3.11. Talih's result is shown on the 6th row. From the 7th row to the 9th row, results are from independent model, full model and G G M respectfully. We find that full model and G G M have similar results since they think there are roughly 7 segments, and they agree on 4 out of 6 change points. Independent model seems to be over-segmented. In this problem, we do not know the ground truth and the true graph structures. Note, only 2 change points that we discovered coincide w i t h the results of Talih (namely 1959 and 1984). There are many possible reasons for this. First, they assume a different prior over models (the graph changes by one arc at a time between neighboring segments); second, they use reversible jump M C M C ; third, their model requires to pre-specify the number of change points. In this case, the positions of change points are very sensitive to the number of change points. We think their results could be oversegmented. In this data, sliding windows generates 17 graphs. A s usual, based on the segmentation estimated by G G M , we plot the posterior over all 17 graphs 51 Chapter 3. Multiple Dimensional Time Series structures, P(G\Y . ), s t the M A P graph structure GMAP = argmaxaP(G\Y ), s:t and the marginal edge probability, P ( G j - = l | y , t ) , computed using Bayesian J : model averaging (Gray squares represent edges about which we are uncertain) in the bottom three rows of Figure 3.11. From the M A P graph structures detected from our results, we find Talih's assumption on the graph structure changing by one arc at a time between neighboring segments may not be true. For example, between the third segment and the fourth segment i n our results, the graph structure is changed by three arcs, rather than one. The second data set is similar to the first one, but has 30 series. Since from the results of synthetic data, we know that in higher dimension, the Gaussian graphical model performs much better than the independent model and the full model, we only show the result from the Gaussian graphical model i n Figure 3.12. We show the first two industry raw data in the top two rows. In the third row, we show the resulted posterior distribution of being change points at each position and the number of segments generated by the Gaussian graphical model. We still don't know the ground truth and the true graph structures. We also show the M A P graph structures detected on three consecutive regions. We can see that the graph structures are very sparse and they differ more than one arcs. 3.4.3 Honey Bee Dance Data Finally, we analyse the honey bee dance data set used in [19, 20]. This consists of the x and y coordinates of a honey bee, and its head angle 9, as it moves around an enclosure, as observed by an overhead camera. T w o examples of the data, together with a ground truth segmentation (created by human experts) are shown i n Figure 3.13 and 3.14. We also show the results of segmenting this using a first-order auto-regressive A R ( 1 ) model, using independent model or with full covariate model. We preprocessed the data by replacing 9 with sin9 and cost? to overcome the discontinuity as the bees moves between —7r to IT. We set the hyper parameter v = 2 and 7 = 0.02 on cr for each dimension i n the 2 52 Chapter 3. Multiple Dimensional Time Series independent model; and set the hyper parameter iVo = 4 and Eo = 0.01 * I on E in the full model. From both Figures, the results from full covariate model are very close to the ground truth, but the results from independent model are clearly over segmented. 53 Chapter 3. Multiple Dimensional Time Series VI 0 100 yj 100 300 GGM : P(N) P(G) Truth P(G) Truth ft. r-C-.ti.JJ- i f L Figure 3.8: , MAP Results on synthetic data 10D. The top two rows are the first two dimensions of raw data. The 3rd row is the ground truth of change points. From the 4th row to the 6th row, results are the posterior distributions of being change points at each position and the number of segments generated from the independent model, the full model and the Gaussian graphical model respectfully. In the bottom 2 rows, we plot the posterior over all graph stuctures, P(G\Y ), a:t the true graph structure, the M A P structure GMAP = argmaxGP(G\Y ), and s:t the marginal edge probability, P{Gij = l\Y -. ) on 3 segments detected by the s t Gaussian graphical model. Results are generated by 'showlOD'. 54 Chapter 3. Multiple Dimensional Time Series I- \ \/: Iv" .-:' • i ' < ft. .r J .• i .• • Figure 3.9: Candidate list of graphs generated by sliding windows in 10D data. We plot the graph structure by its adjacency matrix, such that if node i and j are connected, then the corresponding entry is a black square; otherwise it is a white square. Results are generated by 'showlOD'. 55 Chapter 3. Multiple Dimensional Time Series | ;;! J j,; ; | j Illyil'tJuiililih,! J ;: iivjiiiii ill! IllPl liNlJiIllilllllH (Ii liil Iii Iii 111 Ii # 1 1 Tnith P(G{IJ)-1) MAP P(G(ij)-1) Figure 3.10: Results on synthetic data 20D. The top two rows are the first two dimensions of raw data. The 3rd row is the ground truth of change points. From the 4th row to the 6th row, results are the posterior distributions of being change points at each position and the number of segments generated from the independent model, the full model and the Gaussian graphical model respectfully. In the bottom 2 rows, we plot the posterior over all graph stuctures, P(G\Y ), s:t the true graph structure, the M A P structure GMAP = argmaxcP(G\Y . ), and s t the marginal edge probability, P(Gij = l | y t ) on 3 segments detected by the s : Gaussian graphical model. Results are generated by 'show20D'. 56 Chapter 3. Multiple Dimensional Time Series 2 "* 'tv^'''^ '.fi r r , i j L ±1_L LT i " iT ? yr i l if Figure 3.11: Results on U . S . portfolios data of 5 industries. The top five rows are the raw data representing annually rebalanced value-weighted monthly return i n 5 industries from July 1926 to December 2001. Total there are 906 month. T h e 6th row is the result by Talih. From the 7th row to the 9th row, results are the posterior distributions of being change points at each position and the number of segments generated from the independent model, the full model and the Gaussian graphical model respectfully. ' In the bottom three rows, we plot the posterior over all graph structures, P(G\Y . ), the M A P 3 t graph structure GMAP = argmaxcP{G\Y - ), and the marginal edge probas t bility, P(Gij = l | y j ) . Results are generated by 'showPortofios'. s : 57 Chapter 3. Multiple Dimensional Time Series Figure 3.12: Results on U . S . portfolios data of 30 industries. We show the first two industry raw data in the top two rows. The t h i r d row is the result of the posterior distribution of being change points at each position and the number of segments generated from the Gaussian graphical model. In the fourth row, we show the M A P graph structures in 3 consecutive regions detected by the Gaussian graphical model. Results are generated by 'showPort30'. 58 Chapter 3. Multiple Dimensional Time Series f\ i i A [\ ll l A l . A l A A Li ii 1- J i. A 1 ..ll f 1 i "UL Figure 3.13: Results on honey bee dance data 4. The top four rows are the raw data representing a;, y coordinates of honey bee and sin, cos of its head angle 6. The 5th row is the ground truth. The 6th row is the result from independent model and the 7th row is the result from full covariate model. Results are generated by 'showBees(4)\ 59 Chapter 3. Multiple Dimensional Time Series X 0 100 i iii.i LLLU 200 *00 Y LL •U. i 600 «00 I Mil ni. I Ii Figure 3.14: Results on honey bee dance data 6. The top four rows are the raw data representing x, y coordinates of honey bee and sin, cos of its head angle 9. The 5th row is the ground truth. The 6th row is the result from independent model and the 7th row is the result from full covariate model. Results are generated by 'showBees(6)'. 60 Chapter 4 Conclusion and Future Work In this thesis, we extended Fearnhead's algorithms to the case of multiple dimensional series. This allowed us to detect changes on correlation structures, as well as changes on mean, variance, etc. We modeled the correlation structures using undirected Gaussian graphical models. T h i s allowed us to estimate the changing topology of dependencies among series, in addition to detecting change points. T h i s is particularly useful in high dimensional cases because of sparsity. We can also model the correlation structures by directed graphs. In directed graphs, we can compute the marginal likelihood for any graph structures. However, we don't have a fast way to generate candidate list of graph structures since currently all structure learning algorithms for directed graphs are too slow. Hence if there is a fast way to learn structure of directed graphs, we can use directed graphs. Finally, the conditional independence property sometime might not be reasonable. O r we might want to model other constrains on changing over consecutive segments. T h i s requires us to come up with new models. 61 Bibliography [1] J i m Albert and Patricia Williamson. Using model/data simulation to detect streakiness. The American statistician, 55:41-50, 2001. [2] Onureena Banerjee, Laurent E l Ghaoui, Alexandre d'Aspremont, and Georges Natsoulis. Convex optimization techniques for fitting sparse Gaussian graphical models. Proceedings of the 23rd international conference on Machine learning, pages 89-96, 2006. [3] Daniel B a r r y and J . A . Hartigan. Product partition models for change point problems. The Annals of Statistics, 20(l):260-279, 1992. [4] Daniel B a r r y and J . A . Hartigan. A Bayesian analysis for change point problems. Journal of the American Statistical Association, 88(421):309- 319, 1993. [5] Albright S. C . , Albert J . , Stern H . S., and Morris C . N . A statistical analysis of hitting streaks in baseball. Journal of the American Statistical Association, 88:1175-1196, 1993. [6] Bradley P. Carlin, A l a n E . Gelfand, and A d r i a n F . M . Smith. Hierarchical Bayesian analysis of changepoints problems. Applied Statistics, 41(2):389405, 1992. [7] J . Carpenter, Peter Clifford, and P a u l Fearnhead. A n improved particle filter for non-linear problems. IEE Proceedings of Radar and Sonar Navigation, 146:2-7, 1999. 62 Bibliography [8] Carlos Marinho Carvalho. Structure and sparsity in high-dimensional multivariate analysis. P h D thesis, Duke University, 2006. [9] Nicolas Chopin. Dynamic detection of change points i n long time series. The Annals of the Institute of Statistical Mathematics, 2006. [10] D . G . T . Denison, B . K . Mallick, and A . F . M . Smith. Automatic Bayesian curve fitting. Journal of Royal Statistical Society Series' B, 60:333-350, 1998. [11] David G . T . Denison, Christopher C . Holmes, B a n i K . Mallick, and A d r i a n ' F . M . Smith. Bayesian methods for nonlinear classification and regression. John Wiley & Sons, L t d , Baffins, Chichester, West Sussex, England, 2002. [12] A r n a u d Doucet, Nando De Freitas, and Neil Gordon. Sequential Monte Carlo methods in practice. Springer-Verlag, New York, 2001. [13] P a u l Fearnhead. Exact Bayesian curve fitting and signal segmentation. IEEE Transactions on Signal Processing, 53:2160-2166, 2005. [14] P a u l Fearnhead. Exact and efficient Bayesian inference for multiple changepoint problems. Statistics and Computing, 16(2):203-213, 2006. [15] Paul Fearnhead and Peter Clifford. Online inference for hidden Markov models v i a particle filters. Journal of the Royal Statistical Society: Series B, 65:887-899, 2003. [16] P a u l Fearnhead and Zhen L i u . Online inference for multiple change point problems. 2005. [17] R . G . Jarrett. A note on the intervals between coal-mining disasters. Biometrika, 66(1):191-193, 1979. [18] Peter J.Green. Reversible jump Markov C h a i n Monte Carlo computation and Bayesian model determination. Biometrika, 82(4):711-732, 1995. 63 Bibliography [19] Sang M i n O h , James M . Rehg, Tucker Balch, and Frank Dellaert. Learning and inference in parametric switching linear dynamic systems. IEEE International Conference on Computer Vision, 2:1161-1168, 2005. [20] Sang M i n O h , James M . Rehg, and Frank Dellaert. Parameterized duration modeling for switching linear dynamic systems. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2:16941700, 2006. [21] Elena Punskaya, Christophe Andrieu, A r n a u d Doucet, and W i l l i a m J . Fitzgerald. Bayesian curve fitting using M C M C with application to signal segmentation. IEEE Transactions on Signal Processing, 50:747-758, 2002. [22] Christian P. Robert and George Casella. Monte Carlo Statistical Methods. Springer, New York, 2004. [23] Juliane Schafer and Korbinian Strimmer. A shrinkage approach to largescale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4, 2005. [24] M a k r a m Talih and Nicolas Hengartner. varying components: Structural learning w i t h time- Tracking the cross-section of financial time series. Journal of Royal Statistical Society Series B, 67:321-341, 2005. [25] T . Y . Yang and L . K u o . Bayesian binary segmentation procedure for a Poisson process w i t h multiple changepoints. Journal of Computational and Graphical Statistics, 10:772-785, 2001. [26] Tae Young Yang. Bayesian binary segmentation procedure for detecting streakiness i n sports. Journal of Royal Statistical Society Series A, 167:627637, 2004. 64
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Bayesian inference on change point problems
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Bayesian inference on change point problems Xuan, Xiang 2007
pdf
Page Metadata
Item Metadata
Title | Bayesian inference on change point problems |
Creator |
Xuan, Xiang |
Publisher | University of British Columbia |
Date Issued | 2007 |
Description | Change point problems are referred to detect heterogeneity in temporal or spatial data. They have applications in many areas like DNA sequences, financial time series, signal processing, etc. A large number of techniques have been proposed to tackle the problems. One of the most difficult issues is estimating the number of the change points. As in other examples of model selection, the Bayesian approach is particularly appealing, since it automatically captures a trade off between model complexity (the number of change points) and model fit. It also allows one to express uncertainty about the number and location of change points. In a series of papers [13, 14, 16], Fearnhead developed efficient dynamic programming algorithms for exactly computing the posterior over the number and location of change points in one dimensional series. This improved upon earlier approaches, such as [12], which relied on reversible jump MCMC. We extend Fearnhead's algorithms to the case of multiple dimensional series. This allows us to detect changes on correlation structures, as well as changes on mean, variance, etc. We also model the correlation structures using Gaussian graphical models. This allow us to estimate the changing topology of dependencies among series, in addition to detecting change points. This is particularly useful in high dimensional cases because of sparsity. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2011-02-23 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0052076 |
URI | http://hdl.handle.net/2429/31707 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_2007-0289.pdf [ 2.26MB ]
- Metadata
- JSON: 831-1.0052076.json
- JSON-LD: 831-1.0052076-ld.json
- RDF/XML (Pretty): 831-1.0052076-rdf.xml
- RDF/JSON: 831-1.0052076-rdf.json
- Turtle: 831-1.0052076-turtle.txt
- N-Triples: 831-1.0052076-rdf-ntriples.txt
- Original Record: 831-1.0052076-source.json
- Full Text
- 831-1.0052076-fulltext.txt
- Citation
- 831-1.0052076.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0052076/manifest