"Science, Faculty of"@en . "Computer Science, Department of"@en . "DSpace"@en . "UBCV"@en . "Bao, Kejie"@en . "2009-11-02T20:46:12Z"@en . "2003"@en . "Master of Science - MSc"@en . "University of British Columbia"@en . "The EM algorithm is one of the most popular statistical learning algorithms. Unfortunately, it is a batch learning method. For large data sets and real-time systems, we need to develop on-line methods. In this thesis, we present a comprehensive study of on-line EM algorithms. We use Bayesian theory to propose a new on-line EM algorithm for multinomial mixtures. Based on this theory, we show that there is a direct connection between the setting of Bayes priors and the so-called learning rates of stochastic approximation algorithms, such as on-line EM and quasi-Bayes . Finally, we present extensive simulations, comparisons and parameter sensitivity studies on both synthetic data and documents with text, images and music."@en . "https://circle.library.ubc.ca/rest/handle/2429/14569?expand=metadata"@en . "2222843 bytes"@en . "application/pdf"@en . "On-line E M and Quasi-Bayes or: How I Learned to Stop Worrying and Love Stochastic Approximation by Kejie Bao B.E., Zhejiang University, 2001 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Department of Computer Science) We accept this thesis as conforming 1$, the required standard The University of British Columbia August 2003 \u00C2\u00A9 Kejie Bao, 2003 In presenting this thesis in partial fulf i lment of the requirements for an advanced degree at the University of Br i t ish Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for f inancial gain shall not be allowed without my written permission. Department The University of Br i t ish Columbia Vancouver, Canada Abstract The E M algorithm is one of the most popular statistical learning algorithms. Unfortunately, it is a batch learning method. For large data sets and real-time systems, we need to develop on-line methods. In this thesis, we present a comprehensive study of on-line E M algorithms. We use Bayesian theory to propose a new on-line E M algorithm for multinomial mixtures. Based on this theory, we show that there is a direct connection between the setting of Bayes priors and the so-called learning rates of stochastic approximation algorithms, such as on-line E M and quasi-Bayes . Finally, we present extensive simulations, comparisons and parameter sensi-tivity studies on both synthetic data and documents with text, images and music. ii Contents Abstract ii Contents iii Publications v Acknowledgements vi Dedication vii 1 Introduction 1 2 Background: The E M Algorithm 4 2.1 General Derivation for E M 4 2.2 Batch E M for Mixtures of Multinomials with ML estimates 6 3 On-line E M Algorithms 10 3.1 On-line E M Algorithm via Weighted Means 10 3.1.1 Weighted-Mean Expression for Batch Updates 10 3.1.2 Weighted-Mean Expression for On-line Updates 11 3.1.3 Forgetting Factor 12 3.1.4 On-line E M for Mixtures of Multinomials 13 3.2 On-line E M Algorithm within Bayesian Framework 14 3.2.1 Coherent Bayes Solution for Sequential Learning of Mixtures 14 iii 3.2.2 Quasi-Bayes Sequential Learning for Mixtures of Multinomials 16 4 Convergence Issues 20 4.1 Learning Rate Schedules 21 5 Experiments 25 5.1 Experiments on Synthetic Data 25 5.2 Modelling Documents with Text, Images and Music 31 5.2.1 Model Specification 31 5.2.2 On-line E M Algorithms 34 5.2.3 Experiments and Results 39 6 Conclusion 43 7 Future Work 45 Bibliography 46 Appendix A Q M L Function Expansion 50 Appendix B M step for Mixtures of Multinomials with M L estimates 52 iv Publications Jim Little, Nando de Freitas, Jesse Hoey, Kejie Bao and Eric Brochu, Department of Computer Science, the University of British Columbia Analysis of driver behavior with computer vision. NISSAN-UBC (FY2002) partnership, 2003 Eric Brochu, Nando de Freitas and Kejie Bao. The Sound of an Album Cover: A Probabilistic Approach to Multimedia. Artificial Intelligence and Statistics, 2003 v Acknowledgements I owe an immense debt of gratitude to my supervisor, Nando, a very kind and en-thusiastic person to work with. His inspiring guidance from the formative stages, his stimulating ideas and sound advice were so invaluable to me. Without him, the thesis would not have been completed. I am so grateful to Will for his careful reading and helpful comments. I would also like to thank Eric for his cooperation and timely help. Thanks to Naizhi, for his patience reading the first draft and for giving me many useful suggestions. Thank Qi for her kindness all along. She is truly a friend who's always there. I also want to thank my good friends, Jihong Ren, Rui L i , Juan Li , Kan Cai, Zhimin Chen, Lin Zhong, Jian Chen, Xiaohong Zhao, Yan Zhao, Shaofeng Bu and Peng Peng, for endowing me with unforgettable memories in Vancouver. K E J I E B A O The University of British Columbia August 2003 vi Dedicated to my parents, who endowed me with encouragement, inspiration, love and care throughout the course of the thesis. Chapter 1 Introduction A great many learning problems such as filling in missing data, inferring latent vari-ables and estimating parameters of statistical models, fall back on the expectation-maximization (EM) algorithm, which performs maximum likelihood (ML) or maxi-mum a posteriori (MAP) estimation for incomplete data. The earliest resource dates back several decades, when Mckendrick[l] applied it in medical problems. Hartley[2] first explicitly introduced the maximum likelihood estimation (MLE) in the general case. Baum et al[3] applied the algorithm in a special case of hidden Markov models. The quintessential work by Dempster, Laird and Rubin[4] (DLR) formalized E M , proved its convergence and presented the broad applicability by unifying strands of ostensibly unrelated applications under the banner of E M . The scope of the appli-cations were indicated in a recent book by McLachlan and Krishnan[5]. From an intuitive perspective, the E M algorithm is tied to the idea of alter-nating between 2 procedures after the initial guess. It consists of an expectation (E) step filling in missing data and finding its distribution given the known variables and the current parameter estimates, and a maximization (M) step re-estimating the parameters to maximize the likelihood given the premise that the distribution in the E step is correct. It was proven that the true likelihood never decreases as the iterations carry on[4], thus the convergence to a local maximum of the likelihood 1 function is guaranteed. However, the batch E M fails to scale well as the data set becomes huge, or the real dynamic environment demands on-line processing so that the data can be supplied one after another rather than in a batch and the model parameters are updated using just the current data for each time slice. Applications such as on-line querying on music, images and text, and on-line motion analysis naturally fit in this scenario. Regarded as a stochastic approximation method[6], on-line E M algorithms have been proposed for various restricted domains [7] [8] [9] [10]. A general presentation for Exponential Family models with forgetting factors appeared in [11] [12]. Neal and Hinton[13] introduced a free energy formulation for the E M algorithm and also presented the incremental E M algorithm and its widespread variants with a partial E step which saved computation over batch E M for big data sets. Yet, it is still inappropriate for real-time learning where new data is supplied for each time frame, inasmuch as they either require additional storage for all the data or sufficient statistics obtained from all the data at each iteration. Sato also proposed in [11] appropriate choice for the learning rate, (l/t)-schedule. He compared the performance of batch E M , on-line E M and stochastic gradient algorithms using some simple examples for mixtures of Gaussians, and revealed that the on-line E M excels the batch and stochastic gradient counterparts in both convergence rate and accuracy. The reason lies in the fact that on-line E M can avoid being trapped at a local maximum due to its stochastic nature in the early stage of learning. Yet it does not guarantee convergence to the global maximum. A much earlier treatment of on-line E M appeared under the name quasi-Bayes in [14]. There, it was shown that the algorithm can also be interpreted as a stochastic approximation procedure, and therefore, the convergence properties were established using the stochastic approximation theory. In that article, a series of simulations were performed to compare the quasi-Bayes procedure with the exact Bayes solution. The results of the former were quite close to that of the latter, while 2 the former reduced the computational complexity. We extend the quasi-Bayes procedure of [14] to the case of mixtures of multi-nomials. We derive the on-line update equations for all the parameters with the learning rates being explained by priors in posterior mean estimation (PME). The connection between the learning rates for update equations and Baysian prior mod-els motivated our research on learning rate schedules. In this thesis work, a detailed recipe of learning rate schedules for the pro-posed on-line E M algorithm is presented based upon the performance analysis and parameter sensitivity analysis. We carry out comprehensive comparisons between batch E M and on-line E M algorithms derived from two different views. The results of the comparisons show that on-line E M algorithms, which employ the stochasitic property, outperform their batch counterpart from an empirical view. The real-data application reveals that the proposed on-line E M performs better than Sato's on-line E M . Moreover, our experiments attest that approximately the same assignment of both the priors of P M E and schedule parameters for the proposed on-line E M generate similar performance. This implies that from now onwards researchers will be able to use the ideas from Bayesian statistics to come up with reasonable learning rates for stochastic approximation schemes. This thesis is organized as follows: we first review batch E M algorithms for mixtures of multinomials. We then introduce on-line E M algorithms and derive them from two different perspectives. The convergence issue including the learning rate schedules is discussed. Finally, we present extensive simulations, comparisons and parameter sensitivity studies on both synthetic and real data for mixtures of multinomials. 3 Chapter 2 Background: The E M Algorithm The E M algorithm is a general approach for maximum likelihood or maximum a posteriori estimation in statistical models with latent variables. It can be derived in many different ways. We adopt one of the most insightful explanations in terms of lower bound maximization[13][15]. In this derivation, the E-step can be interpreted as computing a local lower bound of the likelihood function or posterior distribu-tion, whereas the M-step maximizes the bound, improving the estimates for the parameters. 2.1 General Derivation for E M Let X denote the observed variables, and let.Z denote the latent variables. We wish to find the M L E for the parameters of a probability model p(x, z\8). The marginal probability for the data x is p(x\6) = J2ZP(X> Z\\u00C2\u00AE)- Given the observed data x, our goal is to find the value of the parameters 9 that maximizes the log-likelihood, 1(9) = log p(x\6) = log ^lzp(x,z\6). The pivotal problem encountered while maximizing it is the logarithm of a big sum which cannot be decoupled. We then construct a tractable lower bound C(q,6), where q refers to some distribution 4 over z so that ]T Q(z\x) = ^ ^s proven in[13] that if a local maximum of \u00C2\u00A3 occurs at q* and 6*, a local maximum of / can also be reached at 8*. Therefore, we can solve the original problem by finding the local maximum of \u00C2\u00A3. We define the lower bound function C(q, 8) as follows: \u00C2\u00A3(q, 8) \u00C2\u00B1 Eq{zlx) [log p(x, z\0)] + H(q) (2.1) where H(q) = \u00E2\u0080\u0094E9[log q] is the entropy of the distribution q and E g denotes expec-tation with respect to the distribution of q. The function \u00C2\u00A3 can also be interpreted as the \"variational free energy\" in the fields of statistical physics with the physical states being z and the energy of the state being \u00E2\u0080\u0094 log p(x, z\8)[13]. We can also show that the difference between \u00C2\u00A3(q, 8) and 1(8) is the Kullback-Leibler (KL) divergence between q(z\x) and p(z\x,6): l(8)-\u00C2\u00A3(q,8) = /((?)-J>(z|x) log q{z\x) = ]T q(z\x) log p(x\8) - \u00C2\u00A3 q(z\x) log z = ^2Q(z\x)\og q(z\x) p(x\8)q(z\x) p(x,z\8) = V(q(z\x) || p(z\x,8)) (2.2) Since the KL divergence is non-negative, and is zero between identical distributions, we are seeking ways to minimize the difference between 1(8) and \u00C2\u00A3(q, 8) to maximize \u00C2\u00A3(q, 8). As proven in[13], for a fixed value of 8, there is a unique distribution q(z\x) = p(z\x, 8), that maximizes \u00C2\u00A3(q, 8). If q(z\x) = p(z\x, 8), then \u00C2\u00A3(q, 8) = 1(8). The E M algorithm first finds an optimal lower bound \u00C2\u00A3(q,8^) at the current estimate 8^ by taking g( t + 1) = p(z\x,8^) in the \"expectation\"(E) step. It then maximizes this bound to obtain the new estimate in the \" maximization\" (M) step. Therefore, the E M algorithm can be formalized as follows: > \u00E2\u0080\u00A2 E step: Compute g ( t + 1 ) = p(z\x,0\u00C2\u00AE). \u00E2\u0080\u00A2 M step: 6\u00C2\u00BB( t + 1) = arg max 0 \u00C2\u00A3(q(t+1\ 8). In Eqn.(2.1), ri(q) is independent of 8, thus maximizing E ^ ^ ) [log p(x,z\8)] with respect to 9 has the same effect as maximizing C(q, 6) with respect to 8. A n alternative equivalent view of the E M algorithm is as follows: \u00E2\u0080\u00A2 E step: Compute the expected complete log-likelihood with respect to the distribution of the latent variables Q M L = Ep^x d(t)^[log p(x,z\9)]. \u00E2\u0080\u00A2 M step: Perform the maximization #( t + 1) = arg max# QML. 2.2 B a t c h E M for M i x t u r e s of M u l t i n o m i a l s w i th M L estimates The mixture-density parameter estimation for finite mixture models [16] is a com-mon application for E M . Finite mixture models provide a simple framework to model population heterogeneity. They prove to be sound tools for data analysis, such as mixtures of factor analyzers [17], principle component analyzers [18], mix-tures of H M M (switching hidden Markov models) [30] and mixtures of Kalman filter (switching state-space models)[20]. Let X = (xi,...,xnx) 1 denote the collection of nx random variables, belonging to specific classes Ci ,C2, - - - , Cnc respectively where 2~3\u00E2\u0084\u00A2=i = 1- Given the non-negative mixing weights A = (Ai, A2, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 , A\u00E2\u0080\u009E c) and the corresponding density functions / i , / 2 , - - - , / n c , the random variables x^ (i = 1,2, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 , nx) have the following distribution: n c f{xi\(p, A) = ^ A c / C (xi | (^ c ) (2.3) c=l where (pc denotes the density parameter for component c. Let 8 = (A, ip) denote the parameters of this model. Learning the mixtures is simply estimating 8, which is usually resolved by E M algorithm. For concreteness, 1For simplicity, we use to denote both the random variable and its realization. 6 we show the derivation of the E M algorithm for mixtures of multinomials with M L E . In the case of mixtures of multinomials, each variable X{ is drawn from a mixture of multinomials, and each X j can be represented as X{ = (xi,i , ...,Xi:Ua), a collection of integer-valued random variables representing event counts, x^a represents the num-ber of the times the ath event occurs in a set of nj independent trials for component Xi. Therefore Y12=i x'>\" = n \u00C2\u00AB ( n\u00C2\u00AB m a y v a r Y f \u00C2\u00B0 r different i). Each variable is drawn from the following finite probabilistic mixture model: The model parameters are 9 = (A, c = 1- A denotes nc mixing coefficients A c for n c mixed component densities with YJc=i A c = 1. To generate a variable from the mixture density, we first choose a mixture component with probability A c and then sample the variable from this mixture component based on its component density. For the sake of convenience for inference problems of such a model, we intro-duce the latent variables Zi \u00E2\u0082\u00AC { 1 , n c } to show which component density generated each variable. Obviously, p(zi = c) = A c . Thus Zi \u00E2\u0080\u0094 c if the ith sample was gener-ated by the cth mixture component. In other words, zi is assumed to be drawn from a multinomial distribution, Zi ~ A 4 n c ( l , A ) , which admits the density p(zr\\) = ]JXl/Zi) (2.5) c=l where I C(ZJ) = 1 if Xi belongs to component c and Ic(zi) = 0 otherwise. Assuming Xi\(zi,9) ~ \u00E2\u0080\u00A2MncO-'iYYa=ilPa.,cXi'a)y w e c a n n o w express the mixture model in the 2 We use italic Roman letters to refer to both individual and compound items. For example,

,c i=l c=l \ a=l i,a = E E ^ ^ I 1 ^ 1 \" ) 1 0 ? A e f ] ^ / - (2.8) i=l c=l V a=l / Let \u00C2\u00A3j ) C = p(zi = c\xi,6^) for notational simplicity. Eqn.(2.8) is what E step calculates. In the M step, we need to maximize Q M L subject to the constraints 2~TJc=i AC = 1 a n 0 - X^a=i Va,c = L Appendix B accomplishes this by introducing Lagrange multipliers. The results are: ^ n x Ac = - E ^ (2-9) x i=i \u00C2\u00A5>\u00C2\u00AB.c = (2.10) Z^i=l S i , c ' t i Now we can summarize the E M algorithm for mixtures of multinomials: \u00E2\u0080\u00A2 E step: Compute the distribution of the latent variables: fc,c = p(zi = c\Xi, *<\u00C2\u00AB>) = ^ i i / ^ - c (2.11) Z^c'=l A c l l a = l \u00C2\u00A5>a i C' 8 M step: Perform the maximization = arg maxg <2ML under the con-straints J2c=i Xc = 1 and Y2=i fa,c = 1: n x A, \u00C2\u00BB2 . , 1=1 Xyj=l &,c3 ,^a Z-a=l Si,c''i 9 Chapter 3 On-line E M Algorithms The batch E M fails to scale well as the number of data becomes very large or the real dynamic environment demands on-line processing. To surmount this problem, on-line E M algorithms are called for. In this situation, the training data x are supplied one by one and the model parameters 6 are updated within each time frame using the current data. Applications such as on-line querying on music, images and text, and on-line motion analysis naturally fit in this scenario. In this section, on-line E M algorithms are derived in two ways as Sato[ll][12] and Smith & Makov[14] suggested. The former is derived with sufficient statistics expressed as weighted means, while the latter, by introducing priors, is cast in the Bayesian framework. 3.1 On-line E M Algorithm via Weighted Means 3.1.1 Weighted-Mean Expression for Batch Updates As Eqn.(2.9), (2.10) show, we have estimates of parameters in the M step. We can also express them in terms of the weighted means of some sufficient statistics. We define (/(x)) c(n x) as the weighted mean of f(x) with respect to the posterior probability \u00C2\u00A3j ) C and it is denned by (f(x))c(nx) = E ^ f { X l ) i i ' c (3.1) 10 Here nx denotes the total number of data points. Therefore, we can write the M step updates as: A c = ( 1 ) C K ) (3.2) if \u00E2\u0080\u0094 (xi,a)c{nx) ^ g-j 3.1.2 Weighted-Mean Expression for On-line Updates In the case of on-line processing, we supply one data point for each iteration and the estimates are changed after each observation. From on-line processing, we use superscripts to denote the time stamps. Thus the first t data are presented as x^\x^2\--- ,x^\ denotes the indicator for 0^ denotes the parameter after the tth observation i \u00C2\u00AE . We define another weighted mean of f(x) at time t: (f(x))c^ as follows: i \u00E2\u0080\u00A2 < 3 ' 6 ) Prom Eqn.(3.5), we know that ( t t \ - 1 / t - i t - i \ _ 1 E n a ( S ) = I + A W E n a(s) r=ls=r+l / \ T=1s=r+l / 1 X(t)ri(t - I ) \" 1 + 1 A(t) + r ? ( i - l ) Thus we get: A ^ +1 (3.7) v(t) v(t ~ 1) From Eqn.(3.6) and Eqn.(3.7), we have: \u00C2\u00AB = (f{x))tl) +V(t) [f(x\u00C2\u00AE)p(zM = clxV^-V) - (f{x))^-l) If we write this expression as follows: ( / (*))\u00C2\u00AB = (1 - v m f W t \" + V(t) [f(x{t))p(zU = dxV^-V) We can see that the current estimate is a weighted sum of the previous estimate and the expectation of the current data point. 3.1.3 Forgetting Factor In Eqn.(3.4) and Eqn.(3.5), a time-dependent forgetting factor \(t) is introduced for forgetting the contributions of the earlier estimates. Suppose A(l) = 0.999, and \(t) 12 is approaching 1 over time. At time 10001, the effect of the Ist round estimates on the current estimates is reduced to less than 0.99910000/10000 * 100% = 4.5e - 7%, while the 9001s t estimates contribute almost 3.6e \u00E2\u0080\u0094 3%. The largest weight 0.01% is contributed by the 10000*'' estimates. If there is no such factor, i.e., X(t) = 1, each datum observed weighs equally in the weighted-mean expression for sufficient statistics, thus the learning rate expression is reduced to rj(t) = 1/t. Intuitively, we may see slow updates for the estimates in that the earlier inaccurate estimates haven't been forgotten and still affect the later learning cycles. This forgetting factor plays an important role in achieving fast convergence. An appropriate schedule of designing the forgetting factor is discussed in [11], which is recapitulated in Section 4.1. Seeking the equivalence of the on-line E M algorithm and its batch counter-part for the sake of the comparability of the two algorithms, the value of \{t) is chosen in [12]. It is assumed that the same set of data is repeatedly supplied to the on-line learning system, i.e., x(t + T) = x(t). T is the period of the epoch. The time t is represented by an epoch index fc(=l,2,...) and a data number index i within an epoch, i.e., t = (k \u00E2\u0080\u0094 1)T + i. Assume that the model parameter 6 is updated at the end of each epoch, i.e. when i = T, thus the on-line E M algorithm equals to its batch counterpart when A is given by 0 \it = (k-l)T + l X(t)=\ K ' 1 if otherwise Thus when we compare the performances of the batch and on-line E M algorithms, we run the algorithms for the same number of iterations (epochs), updating the parameters at each time slice within each epoch in the on-line case and at the end of each epoch in the batch case. 3.1.4 On-line E M for Mixtures of Multinomials Based on Eqn.(3.1.2), we derive the on-line E M for mixtures of multinomials. 13 \u00E2\u0080\u00A2 E step: Compute the distribution of the latent variables: -(*> &=P(ZV = c\x^,e^) = c i i a = 1 ^ ' c (3.8) M step: Update the estimates in terms of weighted-means: = + 7 ? ( i ) ^ ) _(1)(*-D) (3.9) (*a)\u00C2\u00AE = {xa)t1)+r}(t)(xa^-(xa)t1)) (3-10) \u00C2\u00A3 \u00C2\u00BB \u00C2\u00AB = (E^\"1)+^)(E^4)-(E^ a a \ a a (3.11) AW = <1)W (3.12) V \" - (E\u00E2\u0080\u009Ex\u00E2\u0080\u009E)?\u00C2\u00BB ( 3 1 3 ) In the remainder, we refer to this algorithm as Sato's on-line E M (SOEM). 3.2 On-line E M Algorithm within Bayesian Framework 3.2.1 Coherent Bayes Solution for Sequential Learning of Mixtures Consider the problem of unsupervised sequential learning and classification. Let denote the sequential observations, each belonging to specific classes, C\, C2, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 , Cn,c respectively. In the scenario of sequential learning, for each t, the tth observation is classified on and only on the basis of previous observations x^\x^2\ \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 and the current data x^\ Given the non-negative mixing weights A = (Ai, A2, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 , A n J and the density functions fi, f2, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 , / n . c , the random variables x^ have the following distribution: f(x\u00C2\u00AE\\) = jr\kfk(x\u00C2\u00AE) (3.14) k=l 14 where Y^k=i f^c = 1- Here we assumed, as in [14] that the f^s are known. We will soon relax this assumption and extend the method to parameterized density functions. Let p(X) denote a prior for A, p(\\x^\ x^2\ \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 , x ^ ) denote the posterior density of A given x^\x^2\ \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 , x^\ and pc(X\x^\ x^2\ \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 , x ^ ) denote the poste-rior density of A given x^\x^2\ \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 , x^' (x^> belongs to the cth cluster). By Bayes theorem, we have, for t > 1, p ( A | x ^ , x ( 2 ) , - - - ,x^)ocf(x{t)\X)p(X\x^\x^,--- ,x ( i \" 1 )) (3.15) Again, we have allocation variables , z^, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 , z\u00C2\u00AE, such that z^ = c if and only if belongs to the cth cluster. From Eqn.(3.14) and Eqn.(3.15), the posterior is calculated as follows: P(\\xW,x<>2\- \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 , x \u00C2\u00AB ) = X>(z(t) = c ^ , x ( 2 ) , \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 , x^)pc(X\x^\x^\ \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 , x \u00C2\u00AB ) fc=l (3.16) where, for k = 1, 2, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 , n c , p(* ( t ) = c\xM,xW,--- , x \u00C2\u00AB ) oc fc(xU)Xt1} (3-17) and A(*-D = f \u00E2\u0080\u00A2\u00E2\u0080\u00A2. fl XcV(X\x^\x^2\... ,x^)dX (3.18) Jo Jo Eqn.(3.16) together with Eqn.(3.17) and Eqn.(3.18) carries out the sequential learn-ing recursively. Eqn.(3.18) computes the posterior mean estimate (PME) of A in case that A c are unknown. The learning procedure seems straightforward, yet im-pratical. Recursive computation of Eqn.(3.16) in the mixture form causes a linear combination of component posterior densities growing in exponential order, which renders precise computation time-consuming. Consequently, an approximation so-lution called quasi-Bayes is proposed in [14], which takes both the convergence and computational issues into account, while retaining the form of the Bayes procedure. 15 3.2.2 Quasi-Bayes Sequential Learning for Mixtures of Multinomi-als In this section, we derive a new on-line algorithm for mixtures of multinomial while explicitly obtaining expressions for the learning rates. Considering the learning problem of mixtures of multinomials discussed in previous sections, we place a conjugate Dirichelet prior on the mixing coefficient A ~ T>nc(a), having the following form of density: where T(-) denotes the Gamma function and ao = Y J C ac. Similarly we place a Dirichlet prior on each ^a,c)\u00C2\u00ABn n ^ c _ i (3.20) C=l C=l fl=l The posterior distribution p(z,6\x) can be obtained by multiplying the prior func-tions Eqn.(3.19), Eqn.(3.20) and the likelihood function Eqn.(2.7), and then nor-malizing the product. p(z,9\x) oc p(x,z\9)p(9) n x n c / n a \ M ^ i ) n c n c n a \u00C2\u00AB nn hTw- . n^ -'nn^ -1 i=l c=l \ a=l / c=l c=la=l n x n c n c n c n a n c n a n, = n n - n ^ n n n n n ^Io(2,) i=lc=l c=l c=la=l c=la=li=l n c n c n a = r j A?\u00C2\u00AB^-^n ^ ^ W + ^ (3.21) c=l c=l a=l According to this posterior, we draw n x n x A | ( Z , X , Q ) ~ Vnc(Y^h(zi) + au--- ,Y^Inc{Zi) + anc) (3.22) i=i i=i n x n x n a E s M I c t e ) + A , c , - \" > E ^ n 0 I c ( * i ) + A w : ) (3.23) i=l i=l 16 The P M E takes the mean of the posterior with respect to the domain of the parameters 9. Q(PME) = f Qp(e\x,z)d6 (3.24) Je The integral is applied to every component of 9. From the standard properties of Dirichlet distribution, the posterior mean estimates for A and ip after observing the sequence x\, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 ,xn, are easily given by: UP ME) _ ac + YA=I ^C(ZJ) / \u00E2\u0080\u009E 9 r \ \" \u00C2\u00AB 0 + E \u00C2\u00A3 i E ^ i t o ) 1 ' '(PME) Pa,c + 12j=l xi,cRc{zi) /o 9 f i \ A),c + YZl \u00C2\u00A3 \u00C2\u00A3 = 1 Xi,a>U*i\" { ' where Po,c = E 0 A , c - ^c(zi) indicates if = c, and is unknown, yet \u00C2\u00A3i )C serves as the expectation of Ic(ZJ)[14]. Approximating the estimates by substituting \u00C2\u00A3j ) C for Ic(zi), Eqn.(3.25) and Eqn.(3.26) will be replaced as: UP ME) _ ctc + YA=I & , C \" 0 + 2^1=1 z L c = l ?i,c ~(PME) _ /3g,c + \u00C2\u00A3 i = l xi,a\u00C2\u00A3i,c /\u00E2\u0080\u009E o e \ We now start from the time 0 for our approximation schedule, and the superscripts of the parameters denote the time stamps. A|(xW,a) ~ Vnc(a?)+&\---,^+^) (3.29) )9\u00C2\u00A3 ) , c + x 1 , B B ^ ) ) (3.30) Subsequently, estimates are updated within the Dirichlet family, i.e., p(X\x^\a)^ is Dirichlet with parameters o$ = ac~^ + \u00C2\u00A3c* \ andp((pc\x^\d)^> is Dirichlet with parameters (5a% = Pa'c\"^ + X a \ c \ where ^ = p(z^ = c\x^,8^). According to Eqn.(3.29) and Eqn.(3.30), we give the quasi posterior mean estimates for 9 as follows: .-(t-i) At) (t) AW = ^ +tc -SO\u00E2\u0080\u0094 (3.31) At-l) (t)At) At) P0,c + 2^fc=l 2^a' = l X<* 4c P0,c + 2^fc =i 2^a'=l X a ' \u00C2\u00AB 17 where a 0 . = a? + \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 + < C and /3 0 > c = + \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 + / C V \u00E2\u0080\u009E(o) (0) ,(o) From Eqn.(3.31), we have equations for Ac at time t and t \u00E2\u0080\u0094 1. A\u00C2\u00AB = (*) aQ + t (t-i) ao + t a.o + t-1 Substituting Eqn.(3.34) into Eqn.(3.33), we get: AW = \u00C2\u00A3 0 + ^ 1 ^ - ! ) + 1 # a 0 - M We can rewrite Eqn.(3.35) in a recurrence form as follows (3.33) (3.34) (3.35) (3.36) where r% = ao+~i serves as the learning rate for all the clusters. This is the same result as in Eqn.(3.12), but this time we know how to set the learning rate. Regarding the initialization for Ac, we exploit the prior information to denote the initial component probability, say Al\u00C2\u00B0) = o f which satisfies Eqn.(3.31) when t=0 as well. Similarly, let ac(t) = (A),c + E L i Ea\"=i ^^V 1 , ten as the following equations at time t and t \u00E2\u0080\u0094 1: (3.37) Eqn.(3.32) can be writ-0a% = (h(t)[^+x^i = \u00C2\u00AB c ( t - i ) f e 2 ) + 4 t - i ) r i ) (3.38) (3.39) From Eqn.(3.38) and Eqn.(3.39), we have: 0\u00C2\u00AE ra.c ac(t - 1) \u00E2\u0080\u00A2^a S>c c I ra.c (3.40) 18 To put Eqn.(3.40) in the same form as Eqn.(3.36), let rjc(t) = ac(t) J^a' x a ^ ^ -Consequently, we have the recursive equation for ipac: ,, y ^ t V \" \u00C2\u00AB ~ (f c ) t ( f c ) ( t ) t ( t ) + ( t )c( t ) with the following initialization: r v ( 0 ) \(0) = 2\u00C2\u00A3_ a o -(0) = WojC P0,c In the remainder, we refer to this algorithm as Bayesian on-line E M (BOEM) . 19 Chapter 4 Convergence Issues Both SOEM and BOEM resort to stochastic approximation theory[6] for convergence issues. Sato pointed out in[11] that the on-line E M algorithm is identical to a stochastic gradient method with the inverse of the Fisher information matrix as a coefficient matrix. If the effective learning rate r](t)(> 0) satisfies the conditions: V(t) 0 oo t=l oo i=l the stochastic approximation theory guarantees that the estimates converge to a local maximum of the likelihood function of the data. The (l/t)-schedule has been designed to match these conditions, which will be shown later. In the quasi-Bayes procedure defined by Smith and Markov[14] shown in Section 3.2.2, convergence is established again using standard asymptotic results from stochastic approximation. The assumptions in their theorem can be used to prove the asymptotic convergence of BOEM. However, here we are not interested in the limit of the data set becoming infinite, but are rather concerned with coming up with good, automatic learning rates for finite data sets. 20 4.1 L e a r n i n g R a t e Schedu les Many algorithms employing on-line learning and stochastic gradient search, such as on-line E M , least mean square (LMS), on-line backpropagation and adaptive k-means clustering[21], have raised the issue of learning rate scheduling. The easiest way of scheduling learning rate is to set it as constant, as ap-plied in L M S and backpropagation. A constant learning rate incurs fluctuation and unpredictable instability in the learning process, not to mention its possible lack of convergence. Small constants might reduce the fluctuation, but slacken the convergence. In the theory of stochastic approximation [6], we take the standard learning rate rj(t) = c/t. It may result in slow convergence and non-optimal solutions for small c. And it may cause parameter blow-up for small t if c is too large [22] [23]. In the literature of stochastic gradient algorithms, Darken and Moody pro-posed a class of search then converge (STC) learning rate schedules [22] [23]. They showed the theoretical optimal asymptotic convergence rate and the optimal results escaping from the local maximum. However, in STC, the user has to specify the key parameters for calculation of r\ for each time slice. To facilitate automation, several heuristic adaptive (n depends on the time and the previous data processed) schedules have been developed [24] [25] [26] [27]. In the on-line E M learning area, Sato also proposed in [11] an non-adaptive schedule of n (rj is the function of time only) for on-line E M : (l/t)-schedule, to achieve the effective learning rate as well as to enjoy a reasonable physical expla-nation. He also compared the performances of batch E M , on-line E M and on-line gradient ascent algorithms using some simple examples for mixtures of Gaussians, and revealed that the on-line E M outperforms the batch and on-line gradient coun-terparts in terms of convergence speed and accuracy of the learned estimates. This is because the on-line E M can avoid being trapped at a local maximum due to its stochastic nature in the early stage of learning. There, the following factor schedule 21 is employed: A(t) = 1 - -. (4.1) w (t - 2)K + t 0 We can obtain the alternative step-wise form for Eqn. (4.1): -, l - A ( t - l ) , x A < \" - 1 \" 1 + K[l-A(t-1)] <\u00C2\u00AB> where t > 2. Note that A(2) doesn't have to be defined. According to Eqn. (3.7), the learning rate satisfies the following recursive form: \"\u00C2\u00AE = 7-rwr (4-3) 1 + \u00C2\u00BBj(t-l) Eqn.(3.5) together with the condition 0 < \(t) < 1 decides l/rj < r/(t) < 1 with 77(1) = 1. Wi th Eqn.(4.2) and Eqn.(4.3), we can easily obtain the learning rate function: where t > 2. For large t, rj(t) is approximately ( i ^ ) | - Meaningful interpretation of schedule parameters affecting the learning process in different stages was elaborated in [11]. Here, a concise description of the physical meanings of the parameters is given, to denotes the number of samples contributing to the discounted average for the sufficient statistics in the early learning phase, K controls the asymptotic decreasing ratio for the learning rate. Eqn.(3.5) determines 7/(1), the initial value for 77, which means the initial uncertainty does not depend on any data. Instead, to achieve faster convergence in the early stage, a smaller number 770 is set as the initial value for 77. 770 represents the initial uncertainty for the estimates, i.e. the initial estimates depend on 1/770 \u00E2\u0080\u0094 1 data. It can simultaneouly avoid singular values in sufficient statistics calculation in the early stage. With different 770 denoting the initial value for 77, rj(t) can be computed from Eqn.(4.1) and Eqn.(4.3). It acts according to Eqn.(4.4), as long as 770 < 1. For 0 < K < 1, there are 3 learning stages. In the first stage (1 < t < to), r\ drops off swiftly to l / r j 0 no matter what 22 initial value of r\ has been chosen. This is because n(t) is a monotone decreasing function, and V { t 0 ) = (to - 1)*+ fo - \u00C2\u00AB) = = ^ ( 4 ' 5 ) if to is not too small. During this period, a rough estimate is built up rapidly. In the second stage ( to < t < to/n), where the main part of the learning process takes place, r) almost remains at 1/io- In this stage, fairly good estimates are obtained. In the last stage (to < t), when t goes to infinity, the learning rate is approximate (^K^)l ' which stabilizes the value gained in the second stage. In the thesis, we propose another learning rate schedule for on-line learn-ing within Bayesian framework, taking mixtures of multinomials as an example. As introduced in Section 3.2.2, we derive the recursive update equations for the estimates: A\u00C2\u00AB = Ai*\"1) + r, where 0\u00C2\u00AE = rc(t) \u00E2\u0080\u00A2nS) * v^na _(fc)t(fc) (t)c(t) + WM) Different from Sato's updates in terms of weighted means of sufficient statis-tics, quasi-Bayes uses a pseudo posterior mean to approximate the estimates as the data is fed one by one. The parameters are directly learned with a proper learning rate schedule related to the priors. Therefore, the key parameters that control the learning rate can be wisely designed according to what prior information we want to lay on the distribution. As to this learning problem for mixtures of multinomials, we discuss the two learning rates respectively. 23 r c takes a fixed function schedule, displaying the form quite similar to what Sato has suggested, with forgetting factor A(t) = 1, i.e. without discounts for the previous estimates. Exploiting the results in approximation theory, [11] [14] have given the convergence proof and properties for this schedule. The only parameter in the function is ao, the prior on the mixing weights. Therefore, the estimate of each mixing weight applies the same schedule with the initial value offi/&o- &Q, the sum of ac, controls the behavior of rj, and decides the number of the data that contribute to the early learning stage. rjc adopts an adaptive schedule, which not only depends on time and the preset parameters, but on the processed data as well. Unlike rc, r\c might differ for each cluster at the same time. It is controlled by the following terms: the prior information on Performance SI 1 120 converge the most rapidly in early stage, smallest error mean and variances S2 1 300 converge moderate rapidly, small error mean and variances S3 1 1200 converge slowly, moderate error mean and large variances Table 5.1: Sensitivity Analysis for schedules with fixed ac In the experiments, we varied /?c ; from 30 to 6000. The values larger than 600 engendered slow convergence. Very small values led to instability of the results, since less than enough prior information was cast on the parameter (p. This means 29 that the data governs the whole learning process, resulting in unpredictable clusters. The choice of 0 is closely pertaining to the performance of corresponding PME. With ac fixed at 1, if each /3 a > c is picked from [1,20], the performance of P M E is pretty good. The range of [1,20] just corresponds to [30,600] in BOEM for this data set, which explains the overall nice behaviours B O E M has achieved with /?c0^ in this range. Moreover, within these ranges, the performance is insensitive to the parameter tuning. From the above analysis, we know both a^1 and (3^ act as the schedule parameters during learning. From another perspective, they can be expressed as the prior information on the distribution to be learned, in order to smooth the learning process and obtain a desired distribution. Within this Bayesian framework, quasi-Bayes procedure turns the coherent Bayes sequential learning to an stochastic approximation process by transforming the P M E to a recursive estimation for on-line processing with the original priors translated to learning rates. Thus the batch P M E can be simulated in a on-line fashion, with better performance. From the comparison between the two on-line E M algorithms, we discover that under the appropriate schedules, B O E M exhibits if not better, equivalently good performance as SOEM. From the study of the learning rate schedules, we root out the possibility that their schedule parameters are of some relevance. In BOEM learning rate scheduling, there are two quantities: the expected counts of all events belonging to cluster c \u00E2\u0080\u0094 VJ a , x^)(c \ and the accumulated expected counts of all events belonging to cluster c \u00E2\u0080\u0094 Y?k=i \u00C2\u00A3 a ' = i x ^ ^ - The ratio of Efc=i \u00C2\u00A3\" '=1 ^ v ^ c ^ to J2a' suggests the expected time for every time step. If E a ' is considerably larger than the past values, the expected time will be smaller, rendering the learning rate larger for quick adaptation. This character can be compared to K, the decaying ratio of learning rate in SOEM. /?o,c/ 2~Za' in BOEM plays a similar role as to in SOEM, outlining the number of the data contributing to the early learning stage. 30 5.2 Modelling Documents with Text, Images and Music Modeling and Analyzing image, text and other web resources have been flourishing in the information era, boosting the development in multimedia database brows-ing, document retrieval and computer vision. Several models such as PLSA[34], CAM[35], Bayesian LSA[36] have led the study to new heights. In [37], we presented a novel, flexible, statistical approach to modeling music, image and text jointly. The learned models can be used to browse multimedia databases, to query on a multimedia database using any combination of music, images and text (lyrics and other contextual information), to annotate documents with music and images, and to find documents in a database similar to input text, music and/or graphics files. 5.2.1 Model Specification Our current model is based on documents with text (lyrics or information about the song), musical scores in GUIDO notation 1 [38], and J P E G image files. We model the data with a Bayesian multi-modal mixture model. Words, images and scores are assumed to be conditionally independent given the mixture component label. We model musical scores with first-order Markov chains, in which each state corresponds to a note, rest, or the start of a new voice. Notes' pitches are represented by the interval change (in semitones) from the previous note, rather than by absolute pitch, so that a score or query transposed to a different key will still have the same Markov chain. Rhythm is represented using the standard fractional musical measurement of whole-note, half-note, quarter-note, etc. Rest states are represented similarly, save that pitch is not represented. See Figure 5.4 for an example. Polyphonic scores are represented by chaining the beginning of a new voice to the end of a previous one. In order to ensure that the first note in each voice 1 GUIDO is a powerful language for representing musical scores in an HTML-like notation. MIDI files, plentiful on the World Wide Web, can be easily converted to this format. 31 4~ *tir [_*3/4 b&l*3/16 bl /16 c#2*ll /16 b&l /16 a&l*3/16 b & l / 1 6 f#l/2 ] 5 I N T E R V A L D U R A T I O N 0 newvoice 0 1 rest 3/4 2 f i r s t n o t e 3/16 3 +1 1/16 4 +2 11/16 5 -2 1/16 6 -2 3/16 7 +3 1/16 8 -5 1/2 Figure 5.4: Sample melody - the opening notes to \"The Yellow Submarine\" by The Beatles - in different notations. From top: standard musical notation (generated from GUIDO notation), GUIDO notation, and as a series of states in a first-order Markov chain (also generated from GUIDO notation). appears in both the row and column of the Markov transition matrix, a special \"new voice\" state with no interval or rhythm serves as a dummy state marking the beginning of a new voice. The first note of a voice has a distinguishing \"first note\" interval value. The Markov chain representation of a piece of music k is then mapped to a transition frequency table Mk, where M j j ^ denotes the number of times we observe the transition from state i to state j in document k. We use M^o to denote the initial state of the Markov chain. In essence, this Markovian approach is analogous to a text bigram model, save that the states are musical notes and rests rather than words. Images are represented as image intensity histograms of 256 equally-spaced bins, each representing a range of colour values. Each bin's value is initially equal to the number of pixels of the image that fall into that range, and then the entire 32 histogram is normalized to find the relative frequencies of the bins, Gk, where Gk)b indicated the relative frequency value of bin b in image k. The associated text is modeled using a standard term frequency vector Tk, where TWjk denotes the number of times word w appears in document k. For notational simplicity, we group together the image, music and text vari-able as follows: Xk = {Gk, Mk, Tk}. Note that there is no requirement of uniqueness in the database for the elements of different Xk. One graphic document can be as-sociated with any number of text or music documents, though under this model, each association requires a separate instance of X. We rely on a human expert to provide the groupings for each instance of X in the database. Our multi-modal mixture model is as follows: n c rib ris n 3 n s n w xk\e % 5>(C) ]Jp(b\cf^]Jp(j\c)^M^]J^ c = l 6=1 j=l j = l i = l w=l (5.1) where 6 = {p(c),p(b\c),p(j\c),p(j\i,c),p(w\c)} encompasses all the model parame-ters and where Ij(Mkfi) \u00E2\u0080\u0094 1 if the first entry of Mk belongs to state j and is 0 otherwise. The three-dimensional matrix p(j\i, c) denotes the estimated probability of transitioning from state i to state j in cluster c, the matrix p(j\c) denotes the initial probabilities of being in state j, given membership in cluster c. The vector p(c) denotes the probability of each cluster. The two-dimensional matrices p(b\c) and p(w\c) denote the probabilities of bin b and word w in cluster c. The mixture model is defined on the standard probability simplex {Vc, p(c) > 0 and Ec=i P ( c ) = !}\u00E2\u0080\u00A2 ^ e introduce the latent allocation variables zk \u00E2\u0082\u00AC {1,.. . , n c } to indicate that a particular sequence belongs to a specific cluster c. These indicator variables {zk;k = l , . . . , n x } correspond to an i.i.d. sample from the distribution p(zk = c) = p(c). Regarding the prior specification, We follow a hierarchical Bayesian strategy, where the unknown parameters 6 and the allocation variables z are regarded as being drawn from appropriate prior distributions. We acknowledge our uncertainty 33 about the exact form of the prior by specifying it in terms of some unknown pa-rameters (hyperparameters). The allocation variables Zk are assumed to be drawn from a multinomial distribution, Zk ~ -M n c(l;p(c)). We place a conjugate Dirichlet prior on the mixing coefficients p(c) ~ T>nc(a). Similarly, we place Dirichlet prior distributions Vnj(j3) on each p(j\c), T>nj(-y) on each p(j\i, c), T>nw(p) on each p(w\c), Vnb(v) on each p(b\c), and assume that these priors are independent. The posterior for the allocation variables will be required. It can be obtained easily using Bayes' rule: ^ , - * . * ) - * * ^ ( n b n s n s n s n w Y[P(b\cfb'k Y[P(j\c)i^Mk ^ M, T\9)], where #(old) represents the value of the parameters at the previous time step. 2. M step: Maximize over the parameters: 0(new) = aj-gmaxQML 9 The QML function expands to Tic QML=EE^iog fc=l c=l 34 p(c)Y[p(b\cf^ ]Jp(j\c)^M^]J]Jp(j\i,c)^ 6=1 j=l j=l i=l w=l In the E step, we have to compute \u00C2\u00A3,ck using equation (5.2). The corre-sponding M step requires that we maximize Q M L subject to the constraints that all probabilities for the parameters sum up to 1. This constrained maximization can be carried out by introducing Lagrange multipliers. The resulting parameter estimates are: p(b\c) P(j\c) p(w\c) ^ nx nx t\u00E2\u0080\u0094' x k=l Efc=l 12b' Gb',k\u00C2\u00A3ck fc=l ?cfc YJkLl Mi,j,k\u00C2\u00A3,ck Y2=i ^i,j',kick fc=l -Lw,k^ck Enx rp \u00C2\u00A3 k=\ l^w' 1w',kC,ck The above estimates will be used to derive SOEM. The updates for sufficient statistics for SOEM are of the form: where $ = p(* ( t ) = c\X\u00C2\u00AE, 9^). As in the batch scenario, the E step involves computing the posterior of the allocation variables: / n b Mt) H s n\" H s (t) \"\u00C2\u00ABi # - P (c)w (Y[p(b\c){t)G- n^ic)(t)Ij(Mo) nn^i^)(t)M,j r i p H c ) ^ 16=1 j=l j=lx=l ui=l 35 In the M step, we compute the parameters p(c)W =(1>W dt) mi\u00C2\u00AE (DP (*) where the online expectations are given by (1>W (G 6)W b w (I,(M0))W = r,(t) (1>? -1) ( G ^ - 1 ) + 7,(4) [ G ? ^ - < G W ^ = (EG*>?_1) + ^ *) E^ (*)^ )-?\"1) = ( r ^ + ^ ^wew-^)?- 1 )] U) L w w = (EM^t_1)+^) E^?^)-(E^)^) i i To ensure quick convergence for SOEM, we choose the parameters for decay-ing learning rates as follows: 770 = 0.2, to = 100, K = 0.1. From the standard properties of Dirichlet distribution, the P M E parameter 36 estimates are: p{c) p(b\c) P(j\c) PU\i,c) p(w\c) \u00C2\u00ABV + Efc=i Gb,kIc(zk) ^c + Trk=xMMk,o)Uzk) Y^UPr^ + YZLMzk) 7ij,c + EZ=i Mj,,-fcIc(2;fc) \u00C2\u00A3 ? = i 7i,i ' , c + E \" 4 i E \u00C2\u00A3 i M ^ , ^ ^ ) \u00C2\u00A3^7=1 Pru'.c + Efc=l \u00C2\u00A3\u00C2\u00AB, ' ^',fclc(2fc) By substituting \u00C2\u00A3 C j k for Ic(zfc), the estimates are approximated as follows: p{c) p(b\c) PU\c) P(j\i,c) p(w\c) Ec'=i a c + ^ Ub,c + Efc=l Gb,ktjck Eb 'L i ^'.c + Efc=i E t ' Gb'^ck \u00E2\u0080\u00A2^\u00E2\u0080\u00A2c + Efc=iIj(A^fc,o)U \u00C2\u00A3 / = i Pi ' ,c + Efe=i ?c/c 7i,j,c + Efc=l Mi,j,k\u00C2\u00A3ck E\"'=i 7i j ' , c + E \" 4 i Efc=i Mi,j',k\u00E2\u0082\u00ACck + E Z =1 Tw = (t)t(t) PO,, In the rest of the experiments, we initialize the parameters as follows: set a c 0 ) = 1, and take random values for v^, /3J\u00C2\u00B0\ T ^ C , p$c, such that v0tC - 400, Po,c = 300, 7 i ) 0 , c = 300, and p0,c = 2000 for each c. 38 Z 3 4 5 6 7 S 9 10 cluster Figure 5.5: Probability mass distribution across clusters discovered from our test dataset using SOEM. 5.2.3 Experiments and Results To test the model with text, images and music, we clustered on a database of musical scores with associated text documents and JPEG images. The database is composed of various types of musical scores \u00E2\u0080\u0094 jazz, classical, television theme songs, and contemporary pop and electronic music \u00E2\u0080\u0094 each of which has an associated text file and image file, as represented by the combined media variable Xk. The scores are represented in GUIDO notation. The associated text files are a song's lyrics, where applicable, or textual commentary on the score for instrumental pieces, all of which were extracted from the World Wide Web. The image file for each piece is an image of the cover of the CD on which the song appears. The experimental database contains 100 scores, each with a single associated text document and image. There is nothing in the model, however, that requires this one-to-one association of text documents and scores \u00E2\u0080\u0094 this was done solely for testing and pedagogical simplicity. In a deployment such as the world wide web, one would routinely expect one-to-many or many-to-many mappings between the scores and text. 39 C L U S T E R S O N G \u00C2\u00A3cfc 1 The Beatles - Good Day Sunshine 0.1667 1 other - 'The Addams Family' theme 0.0043 2 J. S. Bach - Invention #1 1.0000 2 J. S. Bach - Invention #2 1.0000 2 other - 'The Jetsons' Theme 1.0000 3 Nine Inch Nails - Down In It 1.0000 3 Nine Inch Nails - The Perfect Drug 0.9998 3 Nine Inch Nails - Wish 1.0000 4 The Cure - 10:15 Saturday Night 1.0000 4 Moby - Flower 0.6667 4 other - 'The Addams Family' theme 0.9957 5 The Smiths - Girlfriend in a Coma 1.0000 5 The Cure - Push 0.9753 5 Nine Inch Nails - The Perfect Drug 0.0002 7 The Prodigy - One Love 1.0000 7 PJ Harvey - Down by the Water 1.0000 7 Rogers & Hart - Blue Moon 1.0000. 8 Soft Cell - Tainted Love 1.0000 Figure 5.6: Representative probabilistic cluster allocations using SOEM. i We tested our database with the batch E M , the SOEM and the BOEM. We found that standard batch E M gave the least satisfactory results, distributing prob-ability mass across the maximum number of clusters in a nondeterministic fashion. The SOEM and the B O E M gave better results and regularized to a smaller number of more intuitive clusters. The two On-line E M algorithms also took significantly less time (2 epochs) than the batch E M . The BOEM broke the songs down into more distinct and reasonable clusters than the SOEM. Fig.5.5 and Fig.5.6 show some rep-resentative cluster probability assignments obtained with the SOEM. Fig.5.7 and Fig.5.8 show the representative cluster probability assignments obtained with the BOEM. 40 Clusters with Mixing Weights Clusters Figure 5.7: Probability mass distribution across clusters discovered from our test dataset using BOEM. By and large, the clusters generated by both on-line E M algorithms are intuitive. From the result of the SOEM, the 15 pieces by J. S. Bach each have very high (p > 0.999) probabilities of membership in the same cluster, as do the 13 pieces from the band Nine Inch Nails. Rock and pop songs are scattered over different clusters. A few curious anomalies exist. The theme song to the television show The Jetsons is included in the same cluster as the Bach pieces, for example. There are also some noise clusters containing one or two songs. Regarding the result of the BOEM, almost all the songs are deterministically assigned to one cluster with very high probability. The 17 pieces by J. S. Bach including Toccatta and Fugu and Piano Sinfonia are grouped together as cluster 1 with very high probability 1.0000. Cluster 2 mainly contains the T V themes and the chorals (classical singing). Most rock and pop songs with similar beats and lyrics are grouped into one big cluster: cluster 7. Cluster 10 mostly consists of dance music and jazz. 41 C L U S T E R S O N G Ccfc 1 J. S. Bach - Toccatta and Fugue 1.0000 1 J. S. Bach - Invention #1 1.0000 1 J. S. Bach - Invention #2 1.0000 1 J. S. Bach - Piano Sinfonia 1.0000 2 J. S. Bach - Choral #1 1.0000 2 J. S. Bach - Choral #2 1.0000 2 TV Theme - The Jeffersons 1.0000 2 TV Theme - Wheel of Fortune 1.0000 2 TV Theme - The Wonder Years 1.0000 7 The Beatles - Yellow Submarin 1.0000 7 The Beatles - And Your Bird Can Sing 1.0000 7 The Beatles - Doctor Robert 1.0000 7 The Cure - Killing an Arab 1.0000 7 The Cure - Fascination Street 1.0000 7 The Cure - A Forest 1.0000 10 Jazz Standard - Blue Moon 1.0000 10 Jazz Standard - Tuxedo JunctionMoby 1.0000 10 Nine Inch Nails - Head Like a Hole 1.0000 Figure 5.8: Representative probabilistic cluster allocations using BOEM. 42 Chapter 6 Conclusion In this thesis, we presented a comprehensive study of on-line E M algorithms. We derived the on-line E M from two different views for mixtures of multinomials. We used the Bayesian theory to propose the new on-line E M algorithm. Based on this theory, we showed that there was a direct connection between the setting of Bayes pirors and the learning rates of stochastic approximation algorithms. The performance of each on-line E M algorithm was examined by simulations on both synthetic and real data. The simulation results demonstrated that the on-line E M algorithms with appropriate learning rate schedules converged faster in the early stage and were less prone to get trapped to a local maximum than the batch counterpart. The results of the real-data application strengthened the above conclusion. Besides, the BOEM performed better than the SOEM on the real data, as the former achieved more distinct and reasonable clusters. One of the prominent contributions of the thesis is to relate the prior infor-mation in Bayesian learning to the learning rate schedule parameters in BOEM. As we know, and (3^ serve as the schedule parameters during the learning process. They can alternatively be expressed as the prior information, in order to smooth the learning process and to obtain a desired distribution. Our experiments attest that approximately the same assignment of the priors of P M E and the schedule param-43 eters for B O E M generate similar performance. Therefore, our approach turns the Bayes sequential learning to the stochastic approximation process, by transforming the posterior mean estimation to on-line estimation and translating the priors and the data to the learning rates. This implies that from now onwards researchers will be able to use the ideas from Bayesian statistics to come up with reasonable learning rates for stochastic approximation schemes. 44 Chapter 7 Future Work In the future, we want to generalize BOEM to mixtures of Gaussians and more complicated models, which adopt more intricate hyper-parameters. Therefore, we can provide a universal switchover from Bayesian batch learning to on-line processing with the learning rates representing the priors of the former. This will greatly benefit the larger scale real-time systems such as on-line video processing, on-line learning of robot behaviors, etc. In the current derivation and implementation, we consume one datum at a time and update the parameters n times within each epoch, with n representing the total number of the data. If the computation for each update involves complex operations, it will be prohibitive to process along using the current version of on-line E M . In search for a means to reach a trade-off between the efficiency and accuracy, we can adjust the number of data that are being processed on-line for each step, that is, we build a window for updating the estimates using the current data within the window and the previous estimates. Suppose the window size is w, at time w, we update the estimates using the first w data, then we take w time steps ahead and update again using the next w data and previous estimates, w equaling n results in batch E M . 45 Bibliography A.G. Mckendrick. Application of mathematics to medical problems. Proc. Edingurgh Math. Soc, 44:98-130, 1926. H. Hartley. Maximum likelihood estimation from incomplete data. Biometrics, 14:174-194, 1958. L.E. Baum, T. Peterie, G. Soules, and N. Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Statistics., 41:164-171, 1970. A.P. Dempster, N.M. Laird, and D.B.Rubin. Maximum likelihood from incom-plete data using the E M algorithm. Journal of the Royal Statistical Society, 39(B), 1977. G McLachlan and T. Krishnan. The EM algorithms and extensions. John Wiley & Sons, 1997. H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400-407, 1951. Nowlan S.J. Soft competitive adaptation: Neural network learning algorithms based on fitting statistical mixtures. Technical Report CS-91-126, CMU, 1991. M.I. Jordan and R.A Jacobs. Hierarchical mixtures of experts and the E M algorithm. Neural Computation, 6:181-214, 1991. Xavier Boyen and Daphne Roller. Approximate learning of dynamic models. In Proceedings of the 11th Annual Conference on Neural Information Processing Systems (NIPS-98), pages 396-402, Denver, Colorado, December 1998. N. Petrovic, N . Jojic, B.J. Frey, and Huang T.S. Real-time on-line learning of transformed hidden Markov models from video. In AISTATS, Jan., 2003. Masa-Aki Sato. Convergence of on-line E M algorithm. ICONIP, 1:476-481, 2000. 46 [12] Masa-Aki Sato and Shin Ishii. On-line E M algorithm for the normalized gaus-sian network, 1998. [13] R.M. Neal and G.E. Hinton. A view of the E M algorithm that justifies incre-mental, sparse, and other vairants. In Jordan M.I., editor, Learning in Graphical Models, 1998. [14] A.F.M.Smith and U.E. Markov. A quasi-bayes sequential procedure for mix-tures. Journal of the Royal Statistical Society, 40(B), 1978. [15] T.P. Minka. Expectation-maximization as lower bound maximization, 1998. [16] G.J.McLachlan and D. Peel. Finite Mixture Models. Wiley, New York, 2000. [17] Z. Ghahramani and G.E. Hinton. The em algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, University of Toronto, 1996. [18] M.E.Tipping and C.M.Bishop. Mixture of probabilistic principle component analyzers. Neural Computation, ll(2):443-482, 1999. [19] James J. Little, Nando de Freitas, Jesse Hoey, Kejie Bao, and Eric Brochu. The em algorithm for mixtures of factor analyzers. Technical Report Technical Re-port for NISSAN-UBC(FY 2000) partnership, University of British Columbia, 2003. [20] Zoubin Ghahramani and Geoffrey E. Hinton. Switching state-space models. Technical report, University of Toronto, 6 King's College Road, Toronto M5S 3H5, Canada, 1998. [21] C Darken and J Moody. Fast adaptive k-means clustering: some empirical results. In Proceedings of the IEEE IJCNN Conference, pages 2:233-238, Pis-cataway, New Jersey, 1990. IEEE press. [22] C Darken and J Moody. Note on learning rate schedules for staochastic opti-mization. In Morgan Kauffman, editor, Advances in Neural Information Pro-cessing Systems 3, pages 832-838, San Mateo, Califormia, 1990. [23] C Darken and J Moody. Towards faster stochastic gradient search. In Morgan Kauffman, editor, Advances in Neural Information Processing Systems 4, pages 1009-1016, San Mateo, Califormia, 1991. [24] R. Jacobs. Increased rates of convergfence through learning raate adaptation. Neural Networks, 1:295-307, 1988. 47 [25] Kesten. Accelated stochastic approximation. Annals of Mathematical Statistics, 29:41-59, 1978. [26] S. Urasiev. Adaptive stochastic quasigradient procedures. In Y. Ermoliev and R. Wets Eds., editors, Numerical Techniques for Stochastic Optimization. Springer Verlag, 1988. [27] C. Darken, J. Chang, and J. Moody. Learning rate schedules for faster stochas-tic gradient search. In D.A. White and D.A. Sofge, editors, Neural Networks for Signal Processing 2, 1992. [28] T M Cover and J A Thomas. Elements of Information Theory. Wiley Series in Telecommunications, New York, 1991. [29] Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and Michael Jordan. An introduction to mcmc for machine learing. Machine Learning, 2001. [30] Jim Little, Nando de Freitas, Jesse Hoey, Kejie Bao, and Eric Brochu. Anal-ysis of driver behavior with computer vision, nissan-ubc (fy2002) partnership. Technical report, Department of Computer Science, the University of British Columbia, 2003. [31] E P Simoncelli, E H Adelson, and D J Heeger. Probability distributions of optical flow. In Proc Conf on Computer Vision and Pattern Recognition, pages 310-315, Mauii, Hawaii, 1991. IEEE Computer Society. [32] Aluizio Prata and W.V.T.Rusch. Algorithms for computation of zernike poly-nomials expansion coefficients. Applied Optics, 28(4):749-754, 1989. [33] Jesse Hoey and James J.Little. Representation and recognition of complex human motion. IEEE CVPR, 1, 2000. [34] T Hofmann. Probabilistic latent semantic analysis. In Uncertainty in Artificial Intelligence, 1999. [35] Thomas Hofmann. The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data. In IJCAI, pages 682-687, 1999. [36] N . de Freitas and K. Barnard. Bayesian latent semantic analysis of multimedia databases. Technical Report 2001-15, University of British Columbia, 2001. [37] Eric Brochu, Nando de Freitas, and Kejie Bao. The sound of an album cover: A probablistic approach to multimedia. AISTATS, 2003. 48 [38] H H Hoos, K Renz, and M Gorg. GUIDO/MIR - an experimental musical information retrieval system based on GUIDO music notation. In International Symposium on Music Information Retrieval, 2001. 49 Appendix A <2ML Function Expansion In the case of mixture of multinomial, the Q M L function can be expanded as follows: nx nc / na \ Mzi) = E, {p(z\x,eW) nc nn un i = l c = l \ a=l / / n a ^ Ic(\u00C2\u00ABi) log I A c Y[ fa,cXi-a L i = l c= l \ a= l / nx nc / na hfa) log A c JJ ^ ( t ) ) ( A - 1 ) 1 = 1 C=l V a=l / Cl = l Cn3;=l j = l Expression ( A l ) can be simplified by noticing that nc nc nx E-- E i c ( ^ ) n p ^ i ^ ^ ( t ) ) Cl=l C n x=l j=l nc nx E - E E \u00E2\u0080\u00A2\u00E2\u0080\u00A2\u00E2\u0080\u00A2 E IT ^ i ^ ( t ) ) Cl=l Ci_!=lCi + l=l Cna.=l J = l nx = n j=i,j& = p(zt = c\xi,e^) c,=l p(zi = c\xue\u00C2\u00AE) 50 Therefore, Q M L is reduced to: Q M L = jrJTpfa = c \ x u l o g ( A c fj "Thesis/Dissertation"@en . "2003-11"@en . "10.14288/1.0051701"@en . "eng"@en . "Computer Science"@en . "Vancouver : University of British Columbia Library"@en . "University of British Columbia"@en . "For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use."@en . "Graduate"@en . "On-line EM and quasi-Baye or : how I learned to stop worrying and love stochastic approximation"@en . "Text"@en . "http://hdl.handle.net/2429/14569"@en .