UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

On-line EM and quasi-Baye or : how I learned to stop worrying and love stochastic approximation Bao, Kejie 2003

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2003-0653.pdf [ 2.12MB ]
Metadata
JSON: 831-1.0051701.json
JSON-LD: 831-1.0051701-ld.json
RDF/XML (Pretty): 831-1.0051701-rdf.xml
RDF/JSON: 831-1.0051701-rdf.json
Turtle: 831-1.0051701-turtle.txt
N-Triples: 831-1.0051701-rdf-ntriples.txt
Original Record: 831-1.0051701-source.json
Full Text
831-1.0051701-fulltext.txt
Citation
831-1.0051701.ris

Full Text

On-line E M and Quasi-Bayes or: How I Learned to Stop Worrying and Love Stochastic Approximation by Kejie Bao B.E., Zhejiang University, 2001 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Department of Computer Science) We accept this thesis as conforming 1$, the required standard The University of British Columbia August 2003 © Kejie Bao, 2003 In presenting this thesis in partial fulf i lment of the requirements for an advanced degree at the University of Br i t ish Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for f inancial gain shall not be allowed without my written permission. Department The University of Br i t ish Columbia Vancouver, Canada Abstract The E M algorithm is one of the most popular statistical learning algorithms. Unfortunately, it is a batch learning method. For large data sets and real-time systems, we need to develop on-line methods. In this thesis, we present a comprehensive study of on-line E M algorithms. We use Bayesian theory to propose a new on-line E M algorithm for multinomial mixtures. Based on this theory, we show that there is a direct connection between the setting of Bayes priors and the so-called learning rates of stochastic approximation algorithms, such as on-line E M and quasi-Bayes . Finally, we present extensive simulations, comparisons and parameter sensi-tivity studies on both synthetic data and documents with text, images and music. ii Contents Abstract ii Contents iii Publications v Acknowledgements vi Dedication vii 1 Introduction 1 2 Background: The E M Algorithm 4 2.1 General Derivation for E M 4 2.2 Batch E M for Mixtures of Multinomials with ML estimates 6 3 On-line E M Algorithms 10 3.1 On-line E M Algorithm via Weighted Means 10 3.1.1 Weighted-Mean Expression for Batch Updates 10 3.1.2 Weighted-Mean Expression for On-line Updates 11 3.1.3 Forgetting Factor 12 3.1.4 On-line E M for Mixtures of Multinomials 13 3.2 On-line E M Algorithm within Bayesian Framework 14 3.2.1 Coherent Bayes Solution for Sequential Learning of Mixtures 14 iii 3.2.2 Quasi-Bayes Sequential Learning for Mixtures of Multinomials 16 4 Convergence Issues 20 4.1 Learning Rate Schedules 21 5 Experiments 25 5.1 Experiments on Synthetic Data 25 5.2 Modelling Documents with Text, Images and Music 31 5.2.1 Model Specification 31 5.2.2 On-line E M Algorithms 34 5.2.3 Experiments and Results 39 6 Conclusion 43 7 Future Work 45 Bibliography 46 Appendix A Q M L Function Expansion 50 Appendix B M step for Mixtures of Multinomials with M L estimates 52 iv Publications Jim Little, Nando de Freitas, Jesse Hoey, Kejie Bao and Eric Brochu, Department of Computer Science, the University of British Columbia Analysis of driver behavior with computer vision. NISSAN-UBC (FY2002) partnership, 2003 Eric Brochu, Nando de Freitas and Kejie Bao. The Sound of an Album Cover: A Probabilistic Approach to Multimedia. Artificial Intelligence and Statistics, 2003 v Acknowledgements I owe an immense debt of gratitude to my supervisor, Nando, a very kind and en-thusiastic person to work with. His inspiring guidance from the formative stages, his stimulating ideas and sound advice were so invaluable to me. Without him, the thesis would not have been completed. I am so grateful to Will for his careful reading and helpful comments. I would also like to thank Eric for his cooperation and timely help. Thanks to Naizhi, for his patience reading the first draft and for giving me many useful suggestions. Thank Qi for her kindness all along. She is truly a friend who's always there. I also want to thank my good friends, Jihong Ren, Rui L i , Juan Li , Kan Cai, Zhimin Chen, Lin Zhong, Jian Chen, Xiaohong Zhao, Yan Zhao, Shaofeng Bu and Peng Peng, for endowing me with unforgettable memories in Vancouver. K E J I E B A O The University of British Columbia August 2003 vi Dedicated to my parents, who endowed me with encouragement, inspiration, love and care throughout the course of the thesis. Chapter 1 Introduction A great many learning problems such as filling in missing data, inferring latent vari-ables and estimating parameters of statistical models, fall back on the expectation-maximization (EM) algorithm, which performs maximum likelihood (ML) or maxi-mum a posteriori (MAP) estimation for incomplete data. The earliest resource dates back several decades, when Mckendrick[l] applied it in medical problems. Hartley[2] first explicitly introduced the maximum likelihood estimation (MLE) in the general case. Baum et al[3] applied the algorithm in a special case of hidden Markov models. The quintessential work by Dempster, Laird and Rubin[4] (DLR) formalized E M , proved its convergence and presented the broad applicability by unifying strands of ostensibly unrelated applications under the banner of E M . The scope of the appli-cations were indicated in a recent book by McLachlan and Krishnan[5]. From an intuitive perspective, the E M algorithm is tied to the idea of alter-nating between 2 procedures after the initial guess. It consists of an expectation (E) step filling in missing data and finding its distribution given the known variables and the current parameter estimates, and a maximization (M) step re-estimating the parameters to maximize the likelihood given the premise that the distribution in the E step is correct. It was proven that the true likelihood never decreases as the iterations carry on[4], thus the convergence to a local maximum of the likelihood 1 function is guaranteed. However, the batch E M fails to scale well as the data set becomes huge, or the real dynamic environment demands on-line processing so that the data can be supplied one after another rather than in a batch and the model parameters are updated using just the current data for each time slice. Applications such as on-line querying on music, images and text, and on-line motion analysis naturally fit in this scenario. Regarded as a stochastic approximation method[6], on-line E M algorithms have been proposed for various restricted domains [7] [8] [9] [10]. A general presentation for Exponential Family models with forgetting factors appeared in [11] [12]. Neal and Hinton[13] introduced a free energy formulation for the E M algorithm and also presented the incremental E M algorithm and its widespread variants with a partial E step which saved computation over batch E M for big data sets. Yet, it is still inappropriate for real-time learning where new data is supplied for each time frame, inasmuch as they either require additional storage for all the data or sufficient statistics obtained from all the data at each iteration. Sato also proposed in [11] appropriate choice for the learning rate, (l/t)-schedule. He compared the performance of batch E M , on-line E M and stochastic gradient algorithms using some simple examples for mixtures of Gaussians, and revealed that the on-line E M excels the batch and stochastic gradient counterparts in both convergence rate and accuracy. The reason lies in the fact that on-line E M can avoid being trapped at a local maximum due to its stochastic nature in the early stage of learning. Yet it does not guarantee convergence to the global maximum. A much earlier treatment of on-line E M appeared under the name quasi-Bayes in [14]. There, it was shown that the algorithm can also be interpreted as a stochastic approximation procedure, and therefore, the convergence properties were established using the stochastic approximation theory. In that article, a series of simulations were performed to compare the quasi-Bayes procedure with the exact Bayes solution. The results of the former were quite close to that of the latter, while 2 the former reduced the computational complexity. We extend the quasi-Bayes procedure of [14] to the case of mixtures of multi-nomials. We derive the on-line update equations for all the parameters with the learning rates being explained by priors in posterior mean estimation (PME). The connection between the learning rates for update equations and Baysian prior mod-els motivated our research on learning rate schedules. In this thesis work, a detailed recipe of learning rate schedules for the pro-posed on-line E M algorithm is presented based upon the performance analysis and parameter sensitivity analysis. We carry out comprehensive comparisons between batch E M and on-line E M algorithms derived from two different views. The results of the comparisons show that on-line E M algorithms, which employ the stochasitic property, outperform their batch counterpart from an empirical view. The real-data application reveals that the proposed on-line E M performs better than Sato's on-line E M . Moreover, our experiments attest that approximately the same assignment of both the priors of P M E and schedule parameters for the proposed on-line E M generate similar performance. This implies that from now onwards researchers will be able to use the ideas from Bayesian statistics to come up with reasonable learning rates for stochastic approximation schemes. This thesis is organized as follows: we first review batch E M algorithms for mixtures of multinomials. We then introduce on-line E M algorithms and derive them from two different perspectives. The convergence issue including the learning rate schedules is discussed. Finally, we present extensive simulations, comparisons and parameter sensitivity studies on both synthetic and real data for mixtures of multinomials. 3 Chapter 2 Background: The E M Algorithm The E M algorithm is a general approach for maximum likelihood or maximum a posteriori estimation in statistical models with latent variables. It can be derived in many different ways. We adopt one of the most insightful explanations in terms of lower bound maximization[13][15]. In this derivation, the E-step can be interpreted as computing a local lower bound of the likelihood function or posterior distribu-tion, whereas the M-step maximizes the bound, improving the estimates for the parameters. 2.1 General Derivation for E M Let X denote the observed variables, and let.Z denote the latent variables. We wish to find the M L E for the parameters of a probability model p(x, z\8). The marginal probability for the data x is p(x\6) = J2ZP(X> Z\®)- Given the observed data x, our goal is to find the value of the parameters 9 that maximizes the log-likelihood, 1(9) = log p(x\6) = log ^lzp(x,z\6). The pivotal problem encountered while maximizing it is the logarithm of a big sum which cannot be decoupled. We then construct a tractable lower bound C(q,6), where q refers to some distribution 4 over z so that ]T Q(z\x) = ^ ^s proven in[13] that if a local maximum of £ occurs at q* and 6*, a local maximum of / can also be reached at 8*. Therefore, we can solve the original problem by finding the local maximum of £. We define the lower bound function C(q, 8) as follows: £(q, 8) ± Eq{zlx) [log p(x, z\0)] + H(q) (2.1) where H(q) = —E9[log q] is the entropy of the distribution q and E g denotes expec-tation with respect to the distribution of q. The function £ can also be interpreted as the "variational free energy" in the fields of statistical physics with the physical states being z and the energy of the state being — log p(x, z\8)[13]. We can also show that the difference between £(q, 8) and 1(8) is the Kullback-Leibler (KL) divergence between q(z\x) and p(z\x,6): l(8)-£(q,8) = /((?)-J>(z|x) log q{z\x) = ]T q(z\x) log p(x\8) - £ q(z\x) log z = ^2Q(z\x)\og q(z\x) p(x\8)q(z\x) p(x,z\8) = V(q(z\x) || p(z\x,8)) (2.2) Since the KL divergence is non-negative, and is zero between identical distributions, we are seeking ways to minimize the difference between 1(8) and £(q, 8) to maximize £(q, 8). As proven in[13], for a fixed value of 8, there is a unique distribution q(z\x) = p(z\x, 8), that maximizes £(q, 8). If q(z\x) = p(z\x, 8), then £(q, 8) = 1(8). The E M algorithm first finds an optimal lower bound £(q,8^) at the current estimate 8^ by taking g( t + 1) = p(z\x,8^) in the "expectation"(E) step. It then maximizes this bound to obtain the new estimate in the " maximization" (M) step. Therefore, the E M algorithm can be formalized as follows: > • E step: Compute g ( t + 1 ) = p(z\x,0®). • M step: 6»( t + 1) = arg max 0 £(q(t+1\ 8). In Eqn.(2.1), ri(q) is independent of 8, thus maximizing E ^ ^ ) [log p(x,z\8)] with respect to 9 has the same effect as maximizing C(q, 6) with respect to 8. A n alternative equivalent view of the E M algorithm is as follows: • E step: Compute the expected complete log-likelihood with respect to the distribution of the latent variables Q M L = Ep^x d(t)^[log p(x,z\9)]. • M step: Perform the maximization #( t + 1) = arg max# QML. 2.2 B a t c h E M for M i x t u r e s of M u l t i n o m i a l s w i th M L estimates The mixture-density parameter estimation for finite mixture models [16] is a com-mon application for E M . Finite mixture models provide a simple framework to model population heterogeneity. They prove to be sound tools for data analysis, such as mixtures of factor analyzers [17], principle component analyzers [18], mix-tures of H M M (switching hidden Markov models) [30] and mixtures of Kalman filter (switching state-space models)[20]. Let X = (xi,...,xnx) 1 denote the collection of nx random variables, belonging to specific classes Ci ,C2, - - - , Cnc respectively where 2~3™=i = 1- Given the non-negative mixing weights A = (Ai, A2, • • • , A„ c) and the corresponding density functions / i , / 2 , - - - , / n c , the random variables x^ (i = 1,2, • • • , nx) have the following distribution: n c f{xi\(p, A) = ^ A c / C (xi | (^ c ) (2.3) c=l where (pc denotes the density parameter for component c. Let 8 = (A, ip) denote the parameters of this model. Learning the mixtures is simply estimating 8, which is usually resolved by E M algorithm. For concreteness, 1For simplicity, we use to denote both the random variable and its realization. 6 we show the derivation of the E M algorithm for mixtures of multinomials with M L E . In the case of mixtures of multinomials, each variable X{ is drawn from a mixture of multinomials, and each X j can be represented as X{ = (xi,i , ...,Xi:Ua), a collection of integer-valued random variables representing event counts, x^a represents the num-ber of the times the ath event occurs in a set of nj independent trials for component Xi. Therefore Y12=i x'>" = n « ( n« m a y v a r Y f ° r different i). Each variable is drawn from the following finite probabilistic mixture model: The model parameters are 9 = (A, <p) 2. ip denotes the parameters of the mixture component densities with 2~^ a=i Va>c = 1- A denotes nc mixing coefficients A c for n c mixed component densities with YJc=i A c = 1. To generate a variable from the mixture density, we first choose a mixture component with probability A c and then sample the variable from this mixture component based on its component density. For the sake of convenience for inference problems of such a model, we intro-duce the latent variables Zi € { 1 , n c } to show which component density generated each variable. Obviously, p(zi = c) = A c . Thus Zi — c if the ith sample was gener-ated by the cth mixture component. In other words, zi is assumed to be drawn from a multinomial distribution, Zi ~ A 4 n c ( l , A ) , which admits the density p(zr\\) = ]JXl/Zi) (2.5) c=l where I C(ZJ) = 1 if Xi belongs to component c and Ic(zi) = 0 otherwise. Assuming Xi\(zi,9) ~ •MncO-'iYYa=ilPa.,cXi'a)y w e c a n n o w express the mixture model in the 2 We use italic Roman letters to refer to both individual and compound items. For example, <p = . . . ,<p„ a ,„J. . • • (2.4) c=l a=l 7 following form facilitating our derivation for parameter estimation: p(xi,Zi\0) = p(xi\zi,9)p(zi\0) n c / n a \ !c(zt) n c = n n ^ i . c=l \a=l / c=l n c / n a = n K n ^ H M c=l \ a=l / By marginalizing over Zj in Eqn.(2.6), we obtain Eqn.(2.4). Since the data is i.i.d., multiplying each mixture density together gives the complete likelihood: n x n c / n a \ "c(«t) p(x,z\e) = l[Y[ A c I J ^ / ' . - (2.7) 1=1 c=l \ o=l / Under our modeling assumptions, the complete log-likelihood function cal-culated at time t is given by (see Appendix A): Q M L = ^p(z\x,9W) [ l 0g P(*. AW = KP(Z]X,0M) { nn Uij>,c i=l c=l \ a=l i,a = E E ^ ^ I 1 ^ 1 " ) 1 0 ? A e f ] ^ / - (2.8) i=l c=l V a=l / Let £j ) C = p(zi = c\xi,6^) for notational simplicity. Eqn.(2.8) is what E step calculates. In the M step, we need to maximize Q M L subject to the constraints 2~TJc=i AC = 1 a n 0 - X^a=i Va,c = L Appendix B accomplishes this by introducing Lagrange multipliers. The results are: ^ n x Ac = - E ^ (2-9) x i=i ¥>«.c = (2.10) Z^i=l S i , c ' t i Now we can summarize the E M algorithm for mixtures of multinomials: • E step: Compute the distribution of the latent variables: fc,c = p(zi = c\Xi, *<«>) = ^ i i / ^ - c (2.11) Z^c'=l A c l l a = l ¥>a i C' 8 M step: Perform the maximization = arg maxg <2ML under the con-straints J2c=i Xc = 1 and Y2=i fa,c = 1: n x A, »2 . , 1=1 Xyj=l &,c3 ,^a Z-a=l Si,c''i 9 Chapter 3 On-line E M Algorithms The batch E M fails to scale well as the number of data becomes very large or the real dynamic environment demands on-line processing. To surmount this problem, on-line E M algorithms are called for. In this situation, the training data x are supplied one by one and the model parameters 6 are updated within each time frame using the current data. Applications such as on-line querying on music, images and text, and on-line motion analysis naturally fit in this scenario. In this section, on-line E M algorithms are derived in two ways as Sato[ll][12] and Smith & Makov[14] suggested. The former is derived with sufficient statistics expressed as weighted means, while the latter, by introducing priors, is cast in the Bayesian framework. 3.1 On-line E M Algorithm via Weighted Means 3.1.1 Weighted-Mean Expression for Batch Updates As Eqn.(2.9), (2.10) show, we have estimates of parameters in the M step. We can also express them in terms of the weighted means of some sufficient statistics. We define (/(x)) c(n x) as the weighted mean of f(x) with respect to the posterior probability £j ) C and it is denned by (f(x))c(nx) = E ^ f { X l ) i i ' c (3.1) 10 Here nx denotes the total number of data points. Therefore, we can write the M step updates as: A c = ( 1 ) C K ) (3.2) if — (xi,a)c{nx) ^ g-j 3.1.2 Weighted-Mean Expression for On-line Updates In the case of on-line processing, we supply one data point for each iteration and the estimates are changed after each observation. From on-line processing, we use superscripts to denote the time stamps. Thus the first t data are presented as x^\x^2\--- ,x^\ denotes the indicator for 0^ denotes the parameter after the tth observation i ® . We define another weighted mean of f(x) at time t: (f(x))c^ as follows: </(*))« ± r,(t) E ( II Ks)) f(x(T)Mz(T) = c\x^,e^) (3.4) T=l \S=T+1 / =( E ri a ( s ) ) ( s- 5) \ T=1S=T+1 / Here, the parameter A(£)(0 < X(t) < l,t = 2,3, • • •) is a time-dependent forgetting factor which is discussed in section 3.1.3. rj(t)(l/t < n(t) < 1) is a normalizing coefficient which is like a learning rate. In the following, we derive the discounted average from Eqn.(3.4) and Eqn.(3.5). </(*))« = n(t)\(t)£ ( n A(fi)) f(x(T)Mz{T) = c\x^\e^)) T=1 \s=r+l / t-1 / t-1 \ = »7(t)A(t)E E[ m)f(xiT))p(z{T)=c\x^\9^) T=l \S=T + 1 / + n(t)f(x{t))p(z^ = clxM^C-V). We use the fact: n x(s)=1 s=t+l 11 in the above equations. Since (fix))^ = V(t~l) £ r = i (UTJT+i Ks)) /(xW)p(zW = c\xV,eW), we have: A W 1 A. /ff„\\(t-i) + I W ^ j - W)1 W l ) > i • < 3 ' 6 ) Prom Eqn.(3.5), we know that ( t t \ - 1 / t - i t - i \ _ 1 E n a ( S ) = I + A W E n a(s) r=ls=r+l / \ T=1s=r+l / 1 X(t)ri(t - I ) " 1 + 1 A(t) + r ? ( i - l ) Thus we get: A ^ +1 (3.7) v(t) v(t ~ 1) From Eqn.(3.6) and Eqn.(3.7), we have: </(*)>« = (f{x))tl) +V(t) [f(x®)p(zM = clxV^-V) - (f{x))^-l) If we write this expression as follows: ( / (*))« = (1 - v m f W t " + V(t) [f(x{t))p(zU = dxV^-V) We can see that the current estimate is a weighted sum of the previous estimate and the expectation of the current data point. 3.1.3 Forgetting Factor In Eqn.(3.4) and Eqn.(3.5), a time-dependent forgetting factor \(t) is introduced for forgetting the contributions of the earlier estimates. Suppose A(l) = 0.999, and \(t) 12 is approaching 1 over time. At time 10001, the effect of the Ist round estimates on the current estimates is reduced to less than 0.99910000/10000 * 100% = 4.5e - 7%, while the 9001s t estimates contribute almost 3.6e — 3%. The largest weight 0.01% is contributed by the 10000*'' estimates. If there is no such factor, i.e., X(t) = 1, each datum observed weighs equally in the weighted-mean expression for sufficient statistics, thus the learning rate expression is reduced to rj(t) = 1/t. Intuitively, we may see slow updates for the estimates in that the earlier inaccurate estimates haven't been forgotten and still affect the later learning cycles. This forgetting factor plays an important role in achieving fast convergence. An appropriate schedule of designing the forgetting factor is discussed in [11], which is recapitulated in Section 4.1. Seeking the equivalence of the on-line E M algorithm and its batch counter-part for the sake of the comparability of the two algorithms, the value of \{t) is chosen in [12]. It is assumed that the same set of data is repeatedly supplied to the on-line learning system, i.e., x(t + T) = x(t). T is the period of the epoch. The time t is represented by an epoch index fc(=l,2,...) and a data number index i within an epoch, i.e., t = (k — 1)T + i. Assume that the model parameter 6 is updated at the end of each epoch, i.e. when i = T, thus the on-line E M algorithm equals to its batch counterpart when A is given by 0 \it = (k-l)T + l X(t)=\ K ' 1 if otherwise Thus when we compare the performances of the batch and on-line E M algorithms, we run the algorithms for the same number of iterations (epochs), updating the parameters at each time slice within each epoch in the on-line case and at the end of each epoch in the batch case. 3.1.4 On-line E M for Mixtures of Multinomials Based on Eqn.(3.1.2), we derive the on-line E M for mixtures of multinomials. 13 • E step: Compute the distribution of the latent variables: -(*> &=P(ZV = c\x^,e^) = c i i a = 1 ^ ' c (3.8) M step: Update the estimates in terms of weighted-means: = + 7 ? ( i ) ^ ) _(1)(*-D) (3.9) (*a)® = {xa)t1)+r}(t)(xa^-(xa)t1)) (3-10) £ » « = (E^"1)+^)(E^4)-(E^ a a \ a a (3.11) AW = <1)W (3.12) V " - (E„x„)?» ( 3 1 3 ) In the remainder, we refer to this algorithm as Sato's on-line E M (SOEM). 3.2 On-line E M Algorithm within Bayesian Framework 3.2.1 Coherent Bayes Solution for Sequential Learning of Mixtures Consider the problem of unsupervised sequential learning and classification. Let denote the sequential observations, each belonging to specific classes, C\, C2, • • • , Cn,c respectively. In the scenario of sequential learning, for each t, the tth observation is classified on and only on the basis of previous observations x^\x^2\ • • • and the current data x^\ Given the non-negative mixing weights A = (Ai, A2, • • • , A n J and the density functions fi, f2, • • • , / n . c , the random variables x^ have the following distribution: f(x®\\) = jr\kfk(x®) (3.14) k=l 14 where Y^k=i f^c = 1- Here we assumed, as in [14] that the f^s are known. We will soon relax this assumption and extend the method to parameterized density functions. Let p(X) denote a prior for A, p(\\x^\ x^2\ • • • , x ^ ) denote the posterior density of A given x^\x^2\ • • • , x^\ and pc(X\x^\ x^2\ • • • , x ^ ) denote the poste-rior density of A given x^\x^2\ • • • , x^' (x^> belongs to the cth cluster). By Bayes theorem, we have, for t > 1, p ( A | x ^ , x ( 2 ) , - - - ,x^)ocf(x{t)\X)p(X\x^\x^,--- ,x ( i " 1 )) (3.15) Again, we have allocation variables , z^, • • • , z®, such that z^ = c if and only if belongs to the cth cluster. From Eqn.(3.14) and Eqn.(3.15), the posterior is calculated as follows: P(\\xW,x<>2\- • • , x « ) = X>(z(t) = c ^ , x ( 2 ) , • • • , x^)pc(X\x^\x^\ • • • , x « ) fc=l (3.16) where, for k = 1, 2, • • • , n c , p(* ( t ) = c\xM,xW,--- , x « ) oc fc(xU)Xt1} (3-17) and A(*-D = f ••. fl XcV(X\x^\x^2\... ,x^)dX (3.18) Jo Jo Eqn.(3.16) together with Eqn.(3.17) and Eqn.(3.18) carries out the sequential learn-ing recursively. Eqn.(3.18) computes the posterior mean estimate (PME) of A in case that A c are unknown. The learning procedure seems straightforward, yet im-pratical. Recursive computation of Eqn.(3.16) in the mixture form causes a linear combination of component posterior densities growing in exponential order, which renders precise computation time-consuming. Consequently, an approximation so-lution called quasi-Bayes is proposed in [14], which takes both the convergence and computational issues into account, while retaining the form of the Bayes procedure. 15 3.2.2 Quasi-Bayes Sequential Learning for Mixtures of Multinomi-als In this section, we derive a new on-line algorithm for mixtures of multinomial while explicitly obtaining expressions for the learning rates. Considering the learning problem of mixtures of multinomials discussed in previous sections, we place a conjugate Dirichelet prior on the mixing coefficient A ~ T>nc(a), having the following form of density: where T(-) denotes the Gamma function and ao = Y J C ac. Similarly we place a Dirichlet prior on each <pa,a a n d assume these priors are independent. n c n c n a p(<p\p) ~ n • • • > ^a,c)«n n ^ c _ i (3.20) C=l C=l fl=l The posterior distribution p(z,6\x) can be obtained by multiplying the prior func-tions Eqn.(3.19), Eqn.(3.20) and the likelihood function Eqn.(2.7), and then nor-malizing the product. p(z,9\x) oc p(x,z\9)p(9) n x n c / n a \ M ^ i ) n c n c n a « nn hTw- . n^ -'nn^ -1 i=l c=l \ a=l / c=l c=la=l n x n c n c n c n a n c n a n, = n n - n ^ n n n n n ^Io(2,) i=lc=l c=l c=la=l c=la=li=l n c n c n a = r j A?«^-^n ^ ^ W + ^ (3.21) c=l c=l a=l According to this posterior, we draw n x n x A | ( Z , X , Q ) ~ Vnc(Y^h(zi) + au--- ,Y^Inc{Zi) + anc) (3.22) i=i i=i n x n x <pc\(z,x,P) ~ 2 > n a E s M I c t e ) + A , c , - " > E ^ n 0 I c ( * i ) + A w : ) (3.23) i=l i=l 16 The P M E takes the mean of the posterior with respect to the domain of the parameters 9. Q(PME) = f Qp(e\x,z)d6 (3.24) Je The integral is applied to every component of 9. From the standard properties of Dirichlet distribution, the posterior mean estimates for A and ip after observing the sequence x\, • • • ,xn, are easily given by: UP ME) _ ac + YA=I ^C(ZJ) / „ 9 r \ " « 0 + E £ i E ^ i t o ) 1 ' '(PME) Pa,c + 12j=l xi,cRc{zi) /o 9 f i \ A),c + YZl £ £ = 1 Xi,a>U*i" { ' where Po,c = E 0 A , c - ^c(zi) indicates if = c, and is unknown, yet £i )C serves as the expectation of Ic(ZJ)[14]. Approximating the estimates by substituting £j ) C for Ic(zi), Eqn.(3.25) and Eqn.(3.26) will be replaced as: UP ME) _ ctc + YA=I & , C " 0 + 2^1=1 z L c = l ?i,c ~(PME) _ /3g,c + £ i = l xi,a£i,c /„ o e \ We now start from the time 0 for our approximation schedule, and the superscripts of the parameters denote the time stamps. A|(xW,a) ~ Vnc(a?)+&\---,^+^) (3.29) <pc\(xW,0) ~ 2? n a ( /3i2+xi, ie^ J --- > )9£ ) , c + x 1 , B B ^ ) ) (3.30) Subsequently, estimates are updated within the Dirichlet family, i.e., p(X\x^\a)^ is Dirichlet with parameters o$ = ac~^ + £c* \ andp((pc\x^\d)^> is Dirichlet with parameters (5a% = Pa'c"^ + X a \ c \ where ^ = p(z^ = c\x^,8^). According to Eqn.(3.29) and Eqn.(3.30), we give the quasi posterior mean estimates for 9 as follows: .-(t-i) At) (t) AW = ^ +tc -SO— (3.31) At-l) (t)At) At) P0,c + 2^fc=l 2^a' = l X<* 4c P0,c + 2^fc =i 2^a'=l X a ' « 17 where a 0 . = a? + • • • + < C and /3 0 > c = + • • • + / C V „(o) (0) ,(o) From Eqn.(3.31), we have equations for Ac at time t and t — 1. A« = (*) aQ + t (t-i) ao + t a.o + t-1 Substituting Eqn.(3.34) into Eqn.(3.33), we get: AW = £ 0 + ^ 1 ^ - ! ) + 1 # a 0 - M We can rewrite Eqn.(3.35) in a recurrence form as follows (3.33) (3.34) (3.35) (3.36) where r% = ao+~i serves as the learning rate for all the clusters. This is the same result as in Eqn.(3.12), but this time we know how to set the learning rate. Regarding the initialization for Ac, we exploit the prior information to denote the initial component probability, say Al°) = o f which satisfies Eqn.(3.31) when t=0 as well. Similarly, let ac(t) = (A),c + E L i Ea"=i ^^V 1 , ten as the following equations at time t and t — 1: (3.37) Eqn.(3.32) can be writ-0a% = (h(t)[^+x^i = « c ( t - i ) f e 2 ) + 4 t - i ) r i ) (3.38) (3.39) From Eqn.(3.38) and Eqn.(3.39), we have: 0® ra.c ac(t - 1) •^a S>c c I ra.c (3.40) 18 To put Eqn.(3.40) in the same form as Eqn.(3.36), let rjc(t) = ac(t) J^a' x a ^ ^ -Consequently, we have the recursive equation for ipac: <Pd% = fit* + Vc(t) where the learning rate r)c(t) is calculated as follows: T(*)t(*) (t)M) * V (3.41) Vc(t) (fc)c(fc) c 1 a V ^ " a _(fc)t(fc) (3.42) The initialization for (/?ac can also be done by employing the prior informa-tion: .(0) _ Pafi ra,c a P0,c which satisfies Eqn.(3.32) when t=0. Thus we give the step-wise equations for A c and (fia,c'-(3.43) where ra,c rc(t) Vc(t) T(*)t(t) l^a' Xa' ?c (t-1) 1 A>,, y ^ t V " « ~ (f c ) t ( f c ) ( t ) t ( t ) + ( t )c( t ) with the following initialization: r v ( 0 ) \(0) = 2£_ a o -(0) = WojC P0,c In the remainder, we refer to this algorithm as Bayesian on-line E M (BOEM) . 19 Chapter 4 Convergence Issues Both SOEM and BOEM resort to stochastic approximation theory[6] for convergence issues. Sato pointed out in[11] that the on-line E M algorithm is identical to a stochastic gradient method with the inverse of the Fisher information matrix as a coefficient matrix. If the effective learning rate r](t)(> 0) satisfies the conditions: V(t) 0 oo t=l oo i=l the stochastic approximation theory guarantees that the estimates converge to a local maximum of the likelihood function of the data. The (l/t)-schedule has been designed to match these conditions, which will be shown later. In the quasi-Bayes procedure defined by Smith and Markov[14] shown in Section 3.2.2, convergence is established again using standard asymptotic results from stochastic approximation. The assumptions in their theorem can be used to prove the asymptotic convergence of BOEM. However, here we are not interested in the limit of the data set becoming infinite, but are rather concerned with coming up with good, automatic learning rates for finite data sets. 20 4.1 L e a r n i n g R a t e Schedu les Many algorithms employing on-line learning and stochastic gradient search, such as on-line E M , least mean square (LMS), on-line backpropagation and adaptive k-means clustering[21], have raised the issue of learning rate scheduling. The easiest way of scheduling learning rate is to set it as constant, as ap-plied in L M S and backpropagation. A constant learning rate incurs fluctuation and unpredictable instability in the learning process, not to mention its possible lack of convergence. Small constants might reduce the fluctuation, but slacken the convergence. In the theory of stochastic approximation [6], we take the standard learning rate rj(t) = c/t. It may result in slow convergence and non-optimal solutions for small c. And it may cause parameter blow-up for small t if c is too large [22] [23]. In the literature of stochastic gradient algorithms, Darken and Moody pro-posed a class of search then converge (STC) learning rate schedules [22] [23]. They showed the theoretical optimal asymptotic convergence rate and the optimal results escaping from the local maximum. However, in STC, the user has to specify the key parameters for calculation of r\ for each time slice. To facilitate automation, several heuristic adaptive (n depends on the time and the previous data processed) schedules have been developed [24] [25] [26] [27]. In the on-line E M learning area, Sato also proposed in [11] an non-adaptive schedule of n (rj is the function of time only) for on-line E M : (l/t)-schedule, to achieve the effective learning rate as well as to enjoy a reasonable physical expla-nation. He also compared the performances of batch E M , on-line E M and on-line gradient ascent algorithms using some simple examples for mixtures of Gaussians, and revealed that the on-line E M outperforms the batch and on-line gradient coun-terparts in terms of convergence speed and accuracy of the learned estimates. This is because the on-line E M can avoid being trapped at a local maximum due to its stochastic nature in the early stage of learning. There, the following factor schedule 21 is employed: A(t) = 1 - -. (4.1) w (t - 2)K + t 0 We can obtain the alternative step-wise form for Eqn. (4.1): -, l - A ( t - l ) , x A < " - 1 " 1 + K[l-A(t-1)] <«> where t > 2. Note that A(2) doesn't have to be defined. According to Eqn. (3.7), the learning rate satisfies the following recursive form: "® = 7-rwr (4-3) 1 + »j(t-l) Eqn.(3.5) together with the condition 0 < \(t) < 1 decides l/rj < r/(t) < 1 with 77(1) = 1. Wi th Eqn.(4.2) and Eqn.(4.3), we can easily obtain the learning rate function: where t > 2. For large t, rj(t) is approximately ( i ^ ) | - Meaningful interpretation of schedule parameters affecting the learning process in different stages was elaborated in [11]. Here, a concise description of the physical meanings of the parameters is given, to denotes the number of samples contributing to the discounted average for the sufficient statistics in the early learning phase, K controls the asymptotic decreasing ratio for the learning rate. Eqn.(3.5) determines 7/(1), the initial value for 77, which means the initial uncertainty does not depend on any data. Instead, to achieve faster convergence in the early stage, a smaller number 770 is set as the initial value for 77. 770 represents the initial uncertainty for the estimates, i.e. the initial estimates depend on 1/770 — 1 data. It can simultaneouly avoid singular values in sufficient statistics calculation in the early stage. With different 770 denoting the initial value for 77, rj(t) can be computed from Eqn.(4.1) and Eqn.(4.3). It acts according to Eqn.(4.4), as long as 770 < 1. For 0 < K < 1, there are 3 learning stages. In the first stage (1 < t < to), r\ drops off swiftly to l / r j 0 no matter what 22 initial value of r\ has been chosen. This is because n(t) is a monotone decreasing function, and V { t 0 ) = (to - 1)*+ fo - «) = = ^ ( 4 ' 5 ) if to is not too small. During this period, a rough estimate is built up rapidly. In the second stage ( to < t < to/n), where the main part of the learning process takes place, r) almost remains at 1/io- In this stage, fairly good estimates are obtained. In the last stage (to < t), when t goes to infinity, the learning rate is approximate (^K^)l ' which stabilizes the value gained in the second stage. In the thesis, we propose another learning rate schedule for on-line learn-ing within Bayesian framework, taking mixtures of multinomials as an example. As introduced in Section 3.2.2, we derive the recursive update equations for the estimates: A« = Ai*"1) + r, where 0® = rc(t) •nS) <p, ( t - i ) a,c + Vc(t) Laa1 Xa' SC 0<t-l) ra.c a0 + t V>* v^na _(fc)t(fc) (t)c(t) + WM) Different from Sato's updates in terms of weighted means of sufficient statis-tics, quasi-Bayes uses a pseudo posterior mean to approximate the estimates as the data is fed one by one. The parameters are directly learned with a proper learning rate schedule related to the priors. Therefore, the key parameters that control the learning rate can be wisely designed according to what prior information we want to lay on the distribution. As to this learning problem for mixtures of multinomials, we discuss the two learning rates respectively. 23 r c takes a fixed function schedule, displaying the form quite similar to what Sato has suggested, with forgetting factor A(t) = 1, i.e. without discounts for the previous estimates. Exploiting the results in approximation theory, [11] [14] have given the convergence proof and properties for this schedule. The only parameter in the function is ao, the prior on the mixing weights. Therefore, the estimate of each mixing weight applies the same schedule with the initial value offi/&o- &Q, the sum of ac, controls the behavior of rj, and decides the number of the data that contribute to the early learning stage. rjc adopts an adaptive schedule, which not only depends on time and the preset parameters, but on the processed data as well. Unlike rc, r\c might differ for each cluster at the same time. It is controlled by the following terms: the prior information on <pc — /3o)C, and the expected counts of all events belonging to cluster c — 2~2a' xa^c\ and the accumulated expected counts of all events belonging to cluster c - E L i E E = I s ^ t f 0 - The ratio of £ L i YTaU ^ to £Q, suggests the expected time for every time step. If £ a , x^)^ is considerably larger than the past values, the expected time will be smaller, making the learning rate larger for quick adaptation. Therefore, the learning rate depends not only on time and preset parameters. The fed data also adds weight on, adjusting the learning in a more automatic way. /3o,c/ Yla' x a ^ ^ approximates the number of the data contributing to the early learning stage. The initialization for <pa^c can be resolved utilizing the prior information: ipa,c = Since J2a'xa' & fluctuates over time, the learning rate is not a monotone decreasing function, yet eventually it goes to 0. 24 Chapter 5 Experiments In this chapter, we examine the on-line E M algorithms by simulations over finite multinomial mixture models on both synthetic and real data. 5.1 E x p e r i m e n t s on Synthet ic D a t a We generated 500 data from a mixture of multinomials with 3 components. The mixture weights were set to (Ai, A2, A3) = (1/5,1/2, 3/10). The mixture parameters ipatC were randomly generated, with a = 1,..., 30. Having the data set in hand, we started the training procedure adopting 3 algorithms: batch E M (EM), Sato's on-line E M (SOEM) and our Bayesian on-line E M (BOEM). We repeated each experiment 10 times with different initializations of mixture parameters. For each algorithm, the parameters ip were initialized at random. The mixing weights were initially set to 1 over the number of clusters. We set the number of the clusters to 6 (the regularizer in the on-line E M algorithms should enable us to recover the true number of clusters: 3). The number of passes through the entire data set (epochs) was set to 50. To measure the generalization error of the learned distribution from the true 25 one, we adopted K L divergence: ^ l k ) = £ ^ ) l o s 4 4 where p denotes the true multinomial mixture and q is our estimated mixture. This measure, though very useful, is hard to calculate because of the high-dimensional sum. We approximate it via Monte Carlo simulation with 10,000 samples. For SOEM, we tried several parameter settings. The best encountered setting was: 770 = 0.2, to = 100, K = 0.1. We kept this setting throughout the experiments. Learning curves for the three methods are plotted in Fig.5.1. We observe that the on-line algorithms outperform the batch E M as they do not get trapped at local maxima. B O E M converges a bit faster and achieves more accurate results than SOEM for this data set. EM : 1 i 1 111 • 1 SOEM S 10 15 2 O 2 5 3 0 K 4 O 1 S 5 O BOEM i !0 1 S 2 O a 3 O M * O 4 5 W Epoch Figure 5.1: Learning curves (approximate KL divergence) using EM, SOEM and BOEM. Each figure plots 10 learning curves lasting 50 epochs from different initial-izations. The number of the clusters is initially set as 6. We set ac°^ = 1 so that ao = 6; and set (3^1 as a random number such that /3o,c = 120. The parameter ac°^ is set to 1 to indicate that we have no a priori preference for any mixture component to be on (The Dirichlet distribution with ac°^ = 1 gives us a uniform prior). The hyper-parameter j3 controls the amount of regularization. It is a penalisation term. Larger values of (3 lead to fewer clusters. If the prior for j3 is forcing all discrete items to be in one single cluster with equal probability, the estimate of the cluster probability A goes to zero. This ensures that the model still 26 provides a good explanation of the data. This is a well known shrinkage method for model selection in discrete mixtures. B0EM1 LL , , , 0 0.5 1 1.5 Time BOEM3 2 2 X10* BOEM2 15 20 25 30 Epoch Figure 5.2: Evolution of the mixing weights. Top left: = l,P^ = 120; top right: o i 0 ) = l,pi0) = 300; bottom left: = 1,^ 0 ) = 1200; bottom right: PME. To study the sensitivity of the results with respect to the setting of P we carried out several experiments. We also included batch posterior mean estimation (PME), as in Eqn.(3.27) and Eqn.(3.28). Since BOEM is derived from the Bayesian framework, specially from PME, the relation between the priors in P M E and the 27 schedule parameters for BOEM, a and 8, is of great importance. Fig.5.2 plots the evolution of the mixing weights A c during the course of learning using P M E and BOEM with different learning rate schedules. The priors of P M E are set as the same values as the corresponding learning rate parameters of BOEM1. BOEM1 demonstrates rapid adaptation in the early learning stage and quick convergence in less than 3 epochs. BOEM2 and BOEM3 need longer time to achieve convegence. P M E enhances the accuracy and stability of batch E M , but runs longer to converge than the BOEMs. 3 4 Different Methods Figure 5.3: Box plots of final KL divergences using EM, PME, SOEM and BOEM with different learning rate schedules. K L divergences are obtained over 10 runs. The text E M denotes batch E M with ML estimates, P M E denotes batch P M E with the parameters ac = l,3a,c — 4. The SOEM learning rate schedule is: 770 = 0.2, t 0 = 100, K = 0.1. The BOEM learning rate schedules are: BOEM1 denotes a{c] = l,di0) = 120, BOEM2 denotes c4 0 ) = 1,^ 0 ) = 300, BOEM3 denotes ai° } = 1,4 0 ) = 1200. 28 The box plots in Fig.5.3 show the final K L divergences between the true distribution and the learned one over 10 runs using E M , PME, SOEM and BOEM with 3 different learning rate schedules described earlier. Considering only the error distribution, the following is summarized: E M brings forth the highest error mean among the 6 methods with two outliers, characterized as local optima. SOEM presents a fairly compact error distribution with small error mean and variances, as most of the BOEMs. BOEM1 generates the finest error distribution for the 10 runs with the smallest error mean and variances, thus it is empirically proved to be the most stable and the least error-prone method amongst all. BOEM2, quite similar to BOEM1, outperforms SOEM. The bigger variances of BOEM3 reveal the instability or slow convergence may exist in inappropriate scheduling of learning rate. Concerning P M E with the priors being the same values as the parameters of BOEM1, it does not outperform other BOEMs as BOEM1 does, whereas predictable good behavior is obtained for this deterministic algorithm by introducing priors. Let us turn to the study of some parameter combinations for detailed com-parisons. We fix as 1. The parameters and the corresponding performances are compared in Table 5.1. Schedule No. a c 0 ) /#> Performance SI 1 120 converge the most rapidly in early stage, smallest error mean and variances S2 1 300 converge moderate rapidly, small error mean and variances S3 1 1200 converge slowly, moderate error mean and large variances Table 5.1: Sensitivity Analysis for schedules with fixed ac In the experiments, we varied /?c ; from 30 to 6000. The values larger than 600 engendered slow convergence. Very small values led to instability of the results, since less than enough prior information was cast on the parameter (p. This means 29 that the data governs the whole learning process, resulting in unpredictable clusters. The choice of 0 is closely pertaining to the performance of corresponding PME. With ac fixed at 1, if each /3 a > c is picked from [1,20], the performance of P M E is pretty good. The range of [1,20] just corresponds to [30,600] in BOEM for this data set, which explains the overall nice behaviours B O E M has achieved with /?c0^ in this range. Moreover, within these ranges, the performance is insensitive to the parameter tuning. From the above analysis, we know both a^1 and (3^ act as the schedule parameters during learning. From another perspective, they can be expressed as the prior information on the distribution to be learned, in order to smooth the learning process and obtain a desired distribution. Within this Bayesian framework, quasi-Bayes procedure turns the coherent Bayes sequential learning to an stochastic approximation process by transforming the P M E to a recursive estimation for on-line processing with the original priors translated to learning rates. Thus the batch P M E can be simulated in a on-line fashion, with better performance. From the comparison between the two on-line E M algorithms, we discover that under the appropriate schedules, B O E M exhibits if not better, equivalently good performance as SOEM. From the study of the learning rate schedules, we root out the possibility that their schedule parameters are of some relevance. In BOEM learning rate scheduling, there are two quantities: the expected counts of all events belonging to cluster c — VJ a , x^)(c \ and the accumulated expected counts of all events belonging to cluster c — Y?k=i £ a ' = i x ^ ^ - The ratio of Efc=i £" '=1 ^ v ^ c ^ to J2a' suggests the expected time for every time step. If E a ' is considerably larger than the past values, the expected time will be smaller, rendering the learning rate larger for quick adaptation. This character can be compared to K, the decaying ratio of learning rate in SOEM. /?o,c/ 2~Za' in BOEM plays a similar role as to in SOEM, outlining the number of the data contributing to the early learning stage. 30 5.2 Modelling Documents with Text, Images and Music Modeling and Analyzing image, text and other web resources have been flourishing in the information era, boosting the development in multimedia database brows-ing, document retrieval and computer vision. Several models such as PLSA[34], CAM[35], Bayesian LSA[36] have led the study to new heights. In [37], we presented a novel, flexible, statistical approach to modeling music, image and text jointly. The learned models can be used to browse multimedia databases, to query on a multimedia database using any combination of music, images and text (lyrics and other contextual information), to annotate documents with music and images, and to find documents in a database similar to input text, music and/or graphics files. 5.2.1 Model Specification Our current model is based on documents with text (lyrics or information about the song), musical scores in GUIDO notation 1 [38], and J P E G image files. We model the data with a Bayesian multi-modal mixture model. Words, images and scores are assumed to be conditionally independent given the mixture component label. We model musical scores with first-order Markov chains, in which each state corresponds to a note, rest, or the start of a new voice. Notes' pitches are represented by the interval change (in semitones) from the previous note, rather than by absolute pitch, so that a score or query transposed to a different key will still have the same Markov chain. Rhythm is represented using the standard fractional musical measurement of whole-note, half-note, quarter-note, etc. Rest states are represented similarly, save that pitch is not represented. See Figure 5.4 for an example. Polyphonic scores are represented by chaining the beginning of a new voice to the end of a previous one. In order to ensure that the first note in each voice 1 GUIDO is a powerful language for representing musical scores in an HTML-like notation. MIDI files, plentiful on the World Wide Web, can be easily converted to this format. 31 4~ *tir [_*3/4 b&l*3/16 bl /16 c#2*ll /16 b&l /16 a&l*3/16 b & l / 1 6 f#l/2 ] 5 I N T E R V A L D U R A T I O N 0 newvoice 0 1 rest 3/4 2 f i r s t n o t e 3/16 3 +1 1/16 4 +2 11/16 5 -2 1/16 6 -2 3/16 7 +3 1/16 8 -5 1/2 Figure 5.4: Sample melody - the opening notes to "The Yellow Submarine" by The Beatles - in different notations. From top: standard musical notation (generated from GUIDO notation), GUIDO notation, and as a series of states in a first-order Markov chain (also generated from GUIDO notation). appears in both the row and column of the Markov transition matrix, a special "new voice" state with no interval or rhythm serves as a dummy state marking the beginning of a new voice. The first note of a voice has a distinguishing "first note" interval value. The Markov chain representation of a piece of music k is then mapped to a transition frequency table Mk, where M j j ^ denotes the number of times we observe the transition from state i to state j in document k. We use M^o to denote the initial state of the Markov chain. In essence, this Markovian approach is analogous to a text bigram model, save that the states are musical notes and rests rather than words. Images are represented as image intensity histograms of 256 equally-spaced bins, each representing a range of colour values. Each bin's value is initially equal to the number of pixels of the image that fall into that range, and then the entire 32 histogram is normalized to find the relative frequencies of the bins, Gk, where Gk)b indicated the relative frequency value of bin b in image k. The associated text is modeled using a standard term frequency vector Tk, where TWjk denotes the number of times word w appears in document k. For notational simplicity, we group together the image, music and text vari-able as follows: Xk = {Gk, Mk, Tk}. Note that there is no requirement of uniqueness in the database for the elements of different Xk. One graphic document can be as-sociated with any number of text or music documents, though under this model, each association requires a separate instance of X. We rely on a human expert to provide the groupings for each instance of X in the database. Our multi-modal mixture model is as follows: n c rib ris n 3 n s n w xk\e % 5>(C) ]Jp(b\cf^]Jp(j\c)^M^]J^ c = l 6=1 j=l j = l i = l w=l (5.1) where 6 = {p(c),p(b\c),p(j\c),p(j\i,c),p(w\c)} encompasses all the model parame-ters and where Ij(Mkfi) — 1 if the first entry of Mk belongs to state j and is 0 otherwise. The three-dimensional matrix p(j\i, c) denotes the estimated probability of transitioning from state i to state j in cluster c, the matrix p(j\c) denotes the initial probabilities of being in state j, given membership in cluster c. The vector p(c) denotes the probability of each cluster. The two-dimensional matrices p(b\c) and p(w\c) denote the probabilities of bin b and word w in cluster c. The mixture model is defined on the standard probability simplex {Vc, p(c) > 0 and Ec=i P ( c ) = !}• ^ e introduce the latent allocation variables zk € {1,.. . , n c } to indicate that a particular sequence belongs to a specific cluster c. These indicator variables {zk;k = l , . . . , n x } correspond to an i.i.d. sample from the distribution p(zk = c) = p(c). Regarding the prior specification, We follow a hierarchical Bayesian strategy, where the unknown parameters 6 and the allocation variables z are regarded as being drawn from appropriate prior distributions. We acknowledge our uncertainty 33 about the exact form of the prior by specifying it in terms of some unknown pa-rameters (hyperparameters). The allocation variables Zk are assumed to be drawn from a multinomial distribution, Zk ~ -M n c(l;p(c)). We place a conjugate Dirichlet prior on the mixing coefficients p(c) ~ T>nc(a). Similarly, we place Dirichlet prior distributions Vnj(j3) on each p(j\c), T>nj(-y) on each p(j\i, c), T>nw(p) on each p(w\c), Vnb(v) on each p(b\c), and assume that these priors are independent. The posterior for the allocation variables will be required. It can be obtained easily using Bayes' rule: ^ , - * . * ) - * * ^ ( n b n s n s n s n w Y[P(b\cfb'k Y[P(j\c)i^Mk<o) Y[T[pU\hc)Mid'k n piw\c)Tw'k 6=1 j=l j=l i=l w=l (5.2) 5.2.2 On-l ine E M Algor i thms We derive a very efficient on-line E M algorithm (SOEM) based on the batch E M with M L E . In the thesis, by introducing PME, we derive another on-line E M algorithm (BOEM). The batch E M with M L E iterates between the following two steps after initialization: 1. E step: Compute the expectation of the complete log-likelihood with respect to the distribution of the allocation variables QML = ^*P(Z\G M T 9(°M)) [l°gp(2> ^ M, T\9)], where #(old) represents the value of the parameters at the previous time step. 2. M step: Maximize over the parameters: 0(new) = aj-gmaxQML 9 The QML function expands to Tic QML=EE^iog fc=l c=l 34 p(c)Y[p(b\cf^ ]Jp(j\c)^M^]J]Jp(j\i,c)^ 6=1 j=l j=l i=l w=l In the E step, we have to compute £,ck using equation (5.2). The corre-sponding M step requires that we maximize Q M L subject to the constraints that all probabilities for the parameters sum up to 1. This constrained maximization can be carried out by introducing Lagrange multipliers. The resulting parameter estimates are: p(b\c) P(j\c) p(w\c) ^ nx nx t—' x k=l Efc=l 12b' Gb',k£ck fc=l ?cfc YJkLl Mi,j,k£,ck Y2=i ^i,j',kick fc=l -Lw,k^ck Enx rp £ k=\ l^w' 1w',kC,ck The above estimates will be used to derive SOEM. The updates for sufficient statistics for SOEM are of the form: where $ = p(* ( t ) = c\X®, 9^). As in the batch scenario, the E step involves computing the posterior of the allocation variables: / n b Mt) H s n" H s (t) "«i # - P (c)w (Y[p(b\c){t)G- n^ic)(t)Ij(Mo) nn^i^)(t)M,j r i p H c ) ^ 16=1 j=l j=lx=l ui=l 35 In the M step, we compute the parameters p(c)W =(1>W dt) mi® (DP (*) where the online expectations are given by (1>W (G 6)W b w (I,(M0))W = r,(t) (1>? -1) ( G ^ - 1 ) + 7,(4) [ G ? ^ - < G W ^ = (EG*>?_1) + ^ *) E^ (*)^ )-<E^ >?"1) = ( r ^ + ^ ^wew-^)?- 1 )] U) L w w = (EM^t_1)+^) E^?^)-(E^)^) i i To ensure quick convergence for SOEM, we choose the parameters for decay-ing learning rates as follows: 770 = 0.2, to = 100, K = 0.1. From the standard properties of Dirichlet distribution, the P M E parameter 36 estimates are: p{c) p(b\c) P(j\c) PU\i,c) p(w\c) «V + Efc=i Gb,kIc(zk) ^c + Trk=xMMk,o)Uzk) Y^UPr^ + YZLMzk) 7ij,c + EZ=i Mj,,-fcIc(2;fc) £ ? = i 7i,i ' , c + E " 4 i E £ i M ^ , ^ ^ ) £^7=1 Pru'.c + Efc=l £«, ' ^',fclc(2fc) By substituting £ C j k for Ic(zfc), the estimates are approximated as follows: p{c) p(b\c) PU\c) P(j\i,c) p(w\c) Ec'=i a c + ^ Ub,c + Efc=l Gb,ktjck Eb 'L i ^'.c + Efc=i E t ' Gb'^ck •^•c + Efc=iIj(A^fc,o)U £ / = i Pi ' ,c + Efe=i ?c/c 7i,j,c + Efc=l Mi,j,k£ck E"'=i 7i j ' , c + E " 4 i Efc=i Mi,j',k€ck + E Z =1 Tw<k^ck Eu7=l Pw'.c + £/c=l £ « / Tw';k£ck which exhibit quite a similar form as M A P estimation. These expressions will be used to derive BOEM. According to the derivation in Section 3.2.2, We derive the B O E M with the same E step as that of the SOEM and the M step as follows: 37 p(b\c){t) p(j\c)(t) p(j|i,c)W p(w\c)^ = p(b\c)(t-l)+V2c(t) Ub & v(b\c)^ = P{j\c){t-l) + r?3c(*) fl,(M 0)W - p(j|c)<* p(w\cf-V +r,5c(t) T ( * M O -p(j|*,c) ( t~ i) p(w\c)^ 1) where mc(*) ^/2c(*) V5c(t) 1 a 0 + 1 + Z-,k=\ £--b' = l b' ? c 00,c _|_ E i U l ^ e f c ) c(t) 7t,0,« 2-fc=l 2-j/ = 1 M j , j ' ^ e 2-t,/ r , j ' ? c 1 PO,< (0) (°) The estimates can be initialized as follows: p(c)^ = ^ - ,p(6|c)(°) = ^ , p ( j | c ) ( 0 ) (°) (o) ' ° fe,p(jK,c)(°) = and pMc)(o> = (t)t(t) PO,, In the rest of the experiments, we initialize the parameters as follows: set a c 0 ) = 1, and take random values for v^, /3J°\ T ^ C , p$c, such that v0tC - 400, Po,c = 300, 7 i ) 0 , c = 300, and p0,c = 2000 for each c. 38 Z 3 4 5 6 7 S 9 10 cluster Figure 5.5: Probability mass distribution across clusters discovered from our test dataset using SOEM. 5.2.3 Experiments and Results To test the model with text, images and music, we clustered on a database of musical scores with associated text documents and JPEG images. The database is composed of various types of musical scores — jazz, classical, television theme songs, and contemporary pop and electronic music — each of which has an associated text file and image file, as represented by the combined media variable Xk. The scores are represented in GUIDO notation. The associated text files are a song's lyrics, where applicable, or textual commentary on the score for instrumental pieces, all of which were extracted from the World Wide Web. The image file for each piece is an image of the cover of the CD on which the song appears. The experimental database contains 100 scores, each with a single associated text document and image. There is nothing in the model, however, that requires this one-to-one association of text documents and scores — this was done solely for testing and pedagogical simplicity. In a deployment such as the world wide web, one would routinely expect one-to-many or many-to-many mappings between the scores and text. 39 C L U S T E R S O N G £cfc 1 The Beatles - Good Day Sunshine 0.1667 1 other - 'The Addams Family' theme 0.0043 2 J. S. Bach - Invention #1 1.0000 2 J. S. Bach - Invention #2 1.0000 2 other - 'The Jetsons' Theme 1.0000 3 Nine Inch Nails - Down In It 1.0000 3 Nine Inch Nails - The Perfect Drug 0.9998 3 Nine Inch Nails - Wish 1.0000 4 The Cure - 10:15 Saturday Night 1.0000 4 Moby - Flower 0.6667 4 other - 'The Addams Family' theme 0.9957 5 The Smiths - Girlfriend in a Coma 1.0000 5 The Cure - Push 0.9753 5 Nine Inch Nails - The Perfect Drug 0.0002 7 The Prodigy - One Love 1.0000 7 PJ Harvey - Down by the Water 1.0000 7 Rogers & Hart - Blue Moon 1.0000. 8 Soft Cell - Tainted Love 1.0000 Figure 5.6: Representative probabilistic cluster allocations using SOEM. i We tested our database with the batch E M , the SOEM and the BOEM. We found that standard batch E M gave the least satisfactory results, distributing prob-ability mass across the maximum number of clusters in a nondeterministic fashion. The SOEM and the B O E M gave better results and regularized to a smaller number of more intuitive clusters. The two On-line E M algorithms also took significantly less time (2 epochs) than the batch E M . The BOEM broke the songs down into more distinct and reasonable clusters than the SOEM. Fig.5.5 and Fig.5.6 show some rep-resentative cluster probability assignments obtained with the SOEM. Fig.5.7 and Fig.5.8 show the representative cluster probability assignments obtained with the BOEM. 40 Clusters with Mixing Weights Clusters Figure 5.7: Probability mass distribution across clusters discovered from our test dataset using BOEM. By and large, the clusters generated by both on-line E M algorithms are intuitive. From the result of the SOEM, the 15 pieces by J. S. Bach each have very high (p > 0.999) probabilities of membership in the same cluster, as do the 13 pieces from the band Nine Inch Nails. Rock and pop songs are scattered over different clusters. A few curious anomalies exist. The theme song to the television show The Jetsons is included in the same cluster as the Bach pieces, for example. There are also some noise clusters containing one or two songs. Regarding the result of the BOEM, almost all the songs are deterministically assigned to one cluster with very high probability. The 17 pieces by J. S. Bach including Toccatta and Fugu and Piano Sinfonia are grouped together as cluster 1 with very high probability 1.0000. Cluster 2 mainly contains the T V themes and the chorals (classical singing). Most rock and pop songs with similar beats and lyrics are grouped into one big cluster: cluster 7. Cluster 10 mostly consists of dance music and jazz. 41 C L U S T E R S O N G Ccfc 1 J. S. Bach - Toccatta and Fugue 1.0000 1 J. S. Bach - Invention #1 1.0000 1 J. S. Bach - Invention #2 1.0000 1 J. S. Bach - Piano Sinfonia 1.0000 2 J. S. Bach - Choral #1 1.0000 2 J. S. Bach - Choral #2 1.0000 2 TV Theme - The Jeffersons 1.0000 2 TV Theme - Wheel of Fortune 1.0000 2 TV Theme - The Wonder Years 1.0000 7 The Beatles - Yellow Submarin 1.0000 7 The Beatles - And Your Bird Can Sing 1.0000 7 The Beatles - Doctor Robert 1.0000 7 The Cure - Killing an Arab 1.0000 7 The Cure - Fascination Street 1.0000 7 The Cure - A Forest 1.0000 10 Jazz Standard - Blue Moon 1.0000 10 Jazz Standard - Tuxedo JunctionMoby 1.0000 10 Nine Inch Nails - Head Like a Hole 1.0000 Figure 5.8: Representative probabilistic cluster allocations using BOEM. 42 Chapter 6 Conclusion In this thesis, we presented a comprehensive study of on-line E M algorithms. We derived the on-line E M from two different views for mixtures of multinomials. We used the Bayesian theory to propose the new on-line E M algorithm. Based on this theory, we showed that there was a direct connection between the setting of Bayes pirors and the learning rates of stochastic approximation algorithms. The performance of each on-line E M algorithm was examined by simulations on both synthetic and real data. The simulation results demonstrated that the on-line E M algorithms with appropriate learning rate schedules converged faster in the early stage and were less prone to get trapped to a local maximum than the batch counterpart. The results of the real-data application strengthened the above conclusion. Besides, the BOEM performed better than the SOEM on the real data, as the former achieved more distinct and reasonable clusters. One of the prominent contributions of the thesis is to relate the prior infor-mation in Bayesian learning to the learning rate schedule parameters in BOEM. As we know, and (3^ serve as the schedule parameters during the learning process. They can alternatively be expressed as the prior information, in order to smooth the learning process and to obtain a desired distribution. Our experiments attest that approximately the same assignment of the priors of P M E and the schedule param-43 eters for B O E M generate similar performance. Therefore, our approach turns the Bayes sequential learning to the stochastic approximation process, by transforming the posterior mean estimation to on-line estimation and translating the priors and the data to the learning rates. This implies that from now onwards researchers will be able to use the ideas from Bayesian statistics to come up with reasonable learning rates for stochastic approximation schemes. 44 Chapter 7 Future Work In the future, we want to generalize BOEM to mixtures of Gaussians and more complicated models, which adopt more intricate hyper-parameters. Therefore, we can provide a universal switchover from Bayesian batch learning to on-line processing with the learning rates representing the priors of the former. This will greatly benefit the larger scale real-time systems such as on-line video processing, on-line learning of robot behaviors, etc. In the current derivation and implementation, we consume one datum at a time and update the parameters n times within each epoch, with n representing the total number of the data. If the computation for each update involves complex operations, it will be prohibitive to process along using the current version of on-line E M . In search for a means to reach a trade-off between the efficiency and accuracy, we can adjust the number of data that are being processed on-line for each step, that is, we build a window for updating the estimates using the current data within the window and the previous estimates. Suppose the window size is w, at time w, we update the estimates using the first w data, then we take w time steps ahead and update again using the next w data and previous estimates, w equaling n results in batch E M . 45 Bibliography A.G. Mckendrick. Application of mathematics to medical problems. Proc. Edingurgh Math. Soc, 44:98-130, 1926. H. Hartley. Maximum likelihood estimation from incomplete data. Biometrics, 14:174-194, 1958. L.E. Baum, T. Peterie, G. Soules, and N. Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Statistics., 41:164-171, 1970. A.P. Dempster, N.M. Laird, and D.B.Rubin. Maximum likelihood from incom-plete data using the E M algorithm. Journal of the Royal Statistical Society, 39(B), 1977. G McLachlan and T. Krishnan. The EM algorithms and extensions. John Wiley & Sons, 1997. H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400-407, 1951. Nowlan S.J. Soft competitive adaptation: Neural network learning algorithms based on fitting statistical mixtures. Technical Report CS-91-126, CMU, 1991. M.I. Jordan and R.A Jacobs. Hierarchical mixtures of experts and the E M algorithm. Neural Computation, 6:181-214, 1991. Xavier Boyen and Daphne Roller. Approximate learning of dynamic models. In Proceedings of the 11th Annual Conference on Neural Information Processing Systems (NIPS-98), pages 396-402, Denver, Colorado, December 1998. N. Petrovic, N . Jojic, B.J. Frey, and Huang T.S. Real-time on-line learning of transformed hidden Markov models from video. In AISTATS, Jan., 2003. Masa-Aki Sato. Convergence of on-line E M algorithm. ICONIP, 1:476-481, 2000. 46 [12] Masa-Aki Sato and Shin Ishii. On-line E M algorithm for the normalized gaus-sian network, 1998. [13] R.M. Neal and G.E. Hinton. A view of the E M algorithm that justifies incre-mental, sparse, and other vairants. In Jordan M.I., editor, Learning in Graphical Models, 1998. [14] A.F.M.Smith and U.E. Markov. A quasi-bayes sequential procedure for mix-tures. Journal of the Royal Statistical Society, 40(B), 1978. [15] T.P. Minka. Expectation-maximization as lower bound maximization, 1998. [16] G.J.McLachlan and D. Peel. Finite Mixture Models. Wiley, New York, 2000. [17] Z. Ghahramani and G.E. Hinton. The em algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, University of Toronto, 1996. [18] M.E.Tipping and C.M.Bishop. Mixture of probabilistic principle component analyzers. Neural Computation, ll(2):443-482, 1999. [19] James J. Little, Nando de Freitas, Jesse Hoey, Kejie Bao, and Eric Brochu. The em algorithm for mixtures of factor analyzers. Technical Report Technical Re-port for NISSAN-UBC(FY 2000) partnership, University of British Columbia, 2003. [20] Zoubin Ghahramani and Geoffrey E. Hinton. Switching state-space models. Technical report, University of Toronto, 6 King's College Road, Toronto M5S 3H5, Canada, 1998. [21] C Darken and J Moody. Fast adaptive k-means clustering: some empirical results. In Proceedings of the IEEE IJCNN Conference, pages 2:233-238, Pis-cataway, New Jersey, 1990. IEEE press. [22] C Darken and J Moody. Note on learning rate schedules for staochastic opti-mization. In Morgan Kauffman, editor, Advances in Neural Information Pro-cessing Systems 3, pages 832-838, San Mateo, Califormia, 1990. [23] C Darken and J Moody. Towards faster stochastic gradient search. In Morgan Kauffman, editor, Advances in Neural Information Processing Systems 4, pages 1009-1016, San Mateo, Califormia, 1991. [24] R. Jacobs. Increased rates of convergfence through learning raate adaptation. Neural Networks, 1:295-307, 1988. 47 [25] Kesten. Accelated stochastic approximation. Annals of Mathematical Statistics, 29:41-59, 1978. [26] S. Urasiev. Adaptive stochastic quasigradient procedures. In Y. Ermoliev and R. Wets Eds., editors, Numerical Techniques for Stochastic Optimization. Springer Verlag, 1988. [27] C. Darken, J. Chang, and J. Moody. Learning rate schedules for faster stochas-tic gradient search. In D.A. White and D.A. Sofge, editors, Neural Networks for Signal Processing 2, 1992. [28] T M Cover and J A Thomas. Elements of Information Theory. Wiley Series in Telecommunications, New York, 1991. [29] Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and Michael Jordan. An introduction to mcmc for machine learing. Machine Learning, 2001. [30] Jim Little, Nando de Freitas, Jesse Hoey, Kejie Bao, and Eric Brochu. Anal-ysis of driver behavior with computer vision, nissan-ubc (fy2002) partnership. Technical report, Department of Computer Science, the University of British Columbia, 2003. [31] E P Simoncelli, E H Adelson, and D J Heeger. Probability distributions of optical flow. In Proc Conf on Computer Vision and Pattern Recognition, pages 310-315, Mauii, Hawaii, 1991. IEEE Computer Society. [32] Aluizio Prata and W.V.T.Rusch. Algorithms for computation of zernike poly-nomials expansion coefficients. Applied Optics, 28(4):749-754, 1989. [33] Jesse Hoey and James J.Little. Representation and recognition of complex human motion. IEEE CVPR, 1, 2000. [34] T Hofmann. Probabilistic latent semantic analysis. In Uncertainty in Artificial Intelligence, 1999. [35] Thomas Hofmann. The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data. In IJCAI, pages 682-687, 1999. [36] N . de Freitas and K. Barnard. Bayesian latent semantic analysis of multimedia databases. Technical Report 2001-15, University of British Columbia, 2001. [37] Eric Brochu, Nando de Freitas, and Kejie Bao. The sound of an album cover: A probablistic approach to multimedia. AISTATS, 2003. 48 [38] H H Hoos, K Renz, and M Gorg. GUIDO/MIR - an experimental musical information retrieval system based on GUIDO music notation. In International Symposium on Music Information Retrieval, 2001. 49 Appendix A <2ML Function Expansion In the case of mixture of multinomial, the Q M L function can be expanded as follows: nx nc / na \ Mzi) = E, {p(z\x,eW) nc nn un i = l c = l \ a=l / / n a ^ Ic(«i) log I A c Y[ fa,cXi-a L i = l c= l \ a= l / nx nc / na hfa) log A c JJ <pa,cXi'a ' J 3 = 1 = E - E ci=l cnx = l Li=l c=l \ a=l « i nc / na \ n c "c = EE l o s A c n r ' A E• • • E i c ( * o n ^ > ^ ( t ) ) ( A - 1 ) 1 = 1 C=l V a=l / Cl = l Cn3;=l j = l Expression ( A l ) can be simplified by noticing that nc nc nx E-- E i c ( ^ ) n p ^ i ^ ^ ( t ) ) Cl=l C n x=l j=l nc nx E - E E ••• E IT ^ i ^ ( t ) ) Cl=l Ci_!=lCi + l=l Cna.=l J = l nx = n j=i,j& = p(zt = c\xi,e^) c,=l p(zi = c\xue®) 50 Therefore, Q M L is reduced to: Q M L = jrJTpfa = c \ x u l o g ( A c fj <pa^A i = l c=l V a=l / When it comes to mixtures of Gaussians, everything is identical, except for the component density, which now is a Gaussian: 2 M L = = c l x ^ M ) i = l c=l + SX]p(-2i = c N , ^ ) l o g A, ^ log | E C | - ^(x; - M c ) ' S c ^ X j - / / c 1=1 c=l 51 Appendix B M step for Mixtures of Multinomials with M L estimates In the M step, we need maximize Q M L subject to the constraints E c = i A c = 1 and E a = i <Pa,c = 1- To compute A, we introduce a Lagrange multiplier, fj,, and maximize the Lagrangian: £ = QML + v(^l-j^\)j (B.l) Then we differentiate Eqn.(5.1) with respect to A c and equate to zero: £ £ p ( * i = 4 E i , 0 W ) l o g . 1=1 c= l <Pa. = 0 y;P(«i = c|s i ,0W)-^--/i = o (B.2) Summing both sides of Eqn.(£?.2) over all the values of c, we have \i = nx. Let &,c = p(z% = c\xu 0 ( t )), we get Tlx 52 We turn to the computation for ip: r\ I Tlx Tlc K. i=l c=l Ac W * * ' a=l \ a H i a=l Xi,a i,c — M i=l = 0 = 0 (B.3) Summing both sides of Eqn.(/3.3) over all the values of a, we have = E £ i &,cni Thus we get fa,, Ei=l Ci.c^ i.ti £i=l £i,cni 53 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051701/manifest

Comment

Related Items