MAXIMUM LIKELIHOOD ESTIMATION FOR MIXTURE DISTRIBUTIONS AND HIDDEN MARKOV MODELS By BRIAN LEROUX B.Sc, Carleton University, 1982 M.Sc, The University of British Columbia, 1985 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF T H E REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in T H E FACULTY OF GRADUATE STUDIES Department of Statistics We accept this thesis as conforming to the required standard T H E UNIVERSITY OF BRITISH COLUMBIA August 1989 Â© Brian Leroux In p resen t i ng this thesis in partial fu l f i lment of the requ i remen ts for an a d v a n c e d d e g r e e at the Univers i ty of Brit ish C o l u m b i a , I agree that the Library shal l m a k e it f reely avai lable fo r re fe rence and s tudy. I fur ther agree that p e r m i s s i o n fo r ex tens ive c o p y i n g of this thesis fo r scho lar ly p u r p o s e s may be g ran ted by the h e a d o f m y depa r tmen t o r by his o r her representa t ives . It is u n d e r s t o o d that c o p y i n g o r p u b l i c a t i o n of this thes is for f inancia l ga in shall no t b e a l l o w e d w i t hou t m y wr i t ten p e r m i s s i o n . D e p a r t m e n t of S t a t i s t i c s T h e Un ivers i ty of Brit ish C o l u m b i a V a n c o u v e r , C a n a d a D E - 6 (2/88) This thesis deals with computational and theoretical aspects of maximum likelihood esti-mation for data from a mixture model and a hidden Markov model. A maximum penalized likelihood method is proposed for estimating the number of com-ponents in a mixture distribution. This method produces a consistent estimator of the un-known mixing distribution, in the sense of weak convergence of distribution functions. The proof of this result consists of establishing consistency results concerning maximum likeli-hood estimators (which have unrestricted number of components) and constrained maximum likelihood estimators (which assume a fixed finite number of components). In particular, a new proof of the consistency of maximum likelihood estimators is given. Also, the large sample limits of a sequence of constrained maximum likelihood estimators are identified as those distributions minimizing Kullback-Leibler divergence from the true distribution. If the number of components of the true mixture distribution is not greater than the assumed number, the constrained maximum likelihood estimator is consistent in the sense of weak convergence. If the assumed number of components is exactly correct, the estimators of the parameters which define the mixing distribution are also consistent (in a certain sense). A n algorithm for computation of maximum likelihood estimates (and the maximum penalized likelihood estimate) is given. The E M algorithm is used to locate local maxima of the likelihood function and a method of automatically generating "good" starting values for each possible number of components is incorporated. The estimation of a Poisson mixture distribution is illustrated using a distribution of traffic accidents in a population and a sequence of observations of fetal movements. One way of looking at the finite mixture model is as a random sample of "states" from a mixing distribution and a sequence of conditionally independent observed variables with distributions determined by the states. In the hidden Markov model considered here, the i i sequence of states is modelled by a Markov chain. The use of the E M algorithm for finding local maxima of the likelihood function for the hidden Markov model is described. Prob-lems arising in the implementation of the algorithm are discussed, including the automatic generation of starting values and a necessary adjustment to the forward-backward equa-tions. The algorithm is applied, with Poisson component distributions, to the sequence of observations of fetal movements. The consistency of the maximum likelihood estimator for the hidden Markov model is proved. The proof requires the consideration of identifiability, ergodicity, entropy, cross-entropy, and convergence of the log-likelihood function. For instance, the conclusion of the Shannon-McMillan-Breiman theorem on entropy convergence is established for hidden Markov models. A class of doubly stochastic Poisson processes which corresponds to a continuous time version of the hidden Markov model is also considered. We discuss some preliminary work on the extension of the E M algorithm to these processes, and also the possibility of applying our method of proof of consistency of maximum likelihood estimators. i i i C o n t e n t s Abstract ii Acknowledgements vii 1 Introduction 1 2 Maximum likelihood estimation for mixture distributions 5 2.1 Existence and characterization of the maximum likelihood estimator . . . . 9 2.2 Computational aspectsâ€”the EM algorithm 13 2.3 Identifiability and Kullback-Leibler divergence 17 2.4 Consistency 19 3 Consistency results for estimators of a mixing distribution 23 3.1 Choosing the number of components in a mixing distribution 23 3.2 Consistency of the maximum likelihood estimator 28 3.3 Approximation of mixing distributions 33 3.4 Consistency of constrained maximum likelihood estimators 40 3.5 Consistency of the maximum penalized likelihood estimator 42 4 Implementation and numerical examples for mixtures of Poisson distri-iv butions 47 4.1 Computational methods for maximization of mixture likelihoods 47 4.2 Simar's accident data 51 4.3 Fetal movement data 54 5 Maximum likelihood estimation for hidden Markov models 57 5.1 Calculation of maximum likelihood estimates using the E M algorithm . . . 60 5.2 The forward-backward algorithm 63 5.3 Starting values for the E M algorithm 68 5.4 Apphcation to fetal movement data 69 6 Consistency of maximum likelihood estimators for hidden Markov models 72 6.1 Conditions 74 6.2 Identifiability 76 6.3 The ergodic property 78 6.4 Entropy 79 6.5 Convergence of the log-likelihood function 83 6.6 Cross-entropy 86 6.7 Consistency of the maximum likelihood estimator 93 7 Future research 96 7.1 Doubly stochastic Poisson processes 96 7.2 Inference for hidden Markov models 98 7.3 Hypothesis testing in mixture models 98 References 100 v A p p e n d i x A A p p e n d i x B Acknowledgements I thank m y supervisor , Professor M a r t y P u t e r m a n , who has been a constant source of inspirat ion and guidance to me i n complet ing this work. I have had m a n y helpful discussions w i t h Professor H a r r y Joe and received valuable com-ments on this work f rom h i m . Professor J o h n Petkau has provided much encouragement and he lpful advice throughout m y graduate career. I thank Professor D . R u r a k for supplying the fetal movement data . T h e support of the N a t u r a l Sciences and Engineering Research C o u n c i l is gratefully ac-knowledged b o t h for Postgraduate Scholarships and par t ia l support of this work under grant A5527 . I also acknowledge the support of the Univers i ty of B r i t i s h C o l u m b i a through a U n i v e r s i t y Graduate Fel lowship. v i i v m Chapter 1 I n t r o d u c t i o n T h e work i n this thesis was motivated by a study designed to examine the possible changes i n the amount and pattern of fetal lamb ac t iv i ty dur ing gestation ( W i t t m a n , R u r a k , and Tay lor , 1984). D a t a is obtained by observing a fetal l a m b using ul trasound and recording the number of movements i n each of several consecutive t ime intervals. T h e statist ical problem treated here is that of f i t t ing a model to these counts which describes the pat tern of fetal ac t iv i ty at a fixed point in the gestation per iod . T h e degree of clustering of the movements is a part icular ly important feature, and so the Poisson process model , which implies random occurrence of events i n t ime, does not provide enough f lexibi l i ty . However, the Poisson process assumptions seem to be reasonable i n short t ime intervals, and thus, we are led to consider the class of doubly stochastic Poisson processes, which al low the event rate to change over t ime i n a random fashion. W e apply several models based on the doubly stochastic Poisson process to the observed data ; m a x i m u m l ikel ihood est imation is used i n each case for parameter est imation. F i r s t , the counts are regarded as a random sample f rom a Poisson mixture d i s t r ibut ion . In this model , the counts are condit ional ly Poisson dis tr ibuted, w i t h the rates drawn at random 1 independently. N e x t we consider the hidden M a r k o v model , which preserves the feature of random rates, while a l lowing the counts to be dependent. U n d e r this model , at any given t ime the fetus is in one of several discrete states w i t h different rates of movement; the sequence of states is generated according to a M a r k o v chain. H i d d e n M a r k o v models provide a r i ch class of fa ir ly realistic models for the pattern of movements. W e also consider the continuous t ime version, namely, the doubly stochastic Poisson process w i t h a M a r k o v intensity process. M a x i m u m l ikel ihood est imation based on a sample f rom a mixture d is t r ibut ion is re-viewed i n Chapter 2. Est imators w i t h a fixed finite number of components (called con-strained m a x i m u m l ikel ihood estimators) and the estimator which maximizes the l ikel ihood over the entire space of m i x i n g distr ibutions are considered. W e discuss the identif iabi l i ty of mixture distr ibut ions ; the K u l l b a c k - L e i b l e r divergence; L indsay 's (1983a) characterization of the unrestricted m a x i m u m l ikel ihood estimate as a finite d is t r ibut ion ; the computat ion of the estimates and observed informat ion, using the E M algor i thm in part icular ; and large sample properties. In Chapter 3, t h e m i x i n g dis t r ibut ion est imation problem is cast as a problem of model selection, w i t h the models represented by the possible numbers of components. A maxi -m u m penalized l ikel ihood method is proposed for selecting a model . T h e m a i n result of this chapter establishes the consistency of the result ing estimator of the u n k n o w n m i x i n g dis tr i -b u t i o n , i n the sense of weak convergence of d is t r ibut ion functions. T h e proof uses several consistency results which are also of interest. In part icular , a new proof of the consistency of the m a x i m u m l ikel ihood estimator is given. A l s o , the large sample l imi ts of a sequence of constrained m a x i m u m l ikel ihood estimators are identified as those distr ibutions m i n i m i z i n g the K u l l b a c k - L e i b l e r divergence f rom the true m i x i n g d is t r ibut ion . If the true mixture dis-2 t r i b u t i o n is contained i n the impl ied model (that is, the true number of components is not greater t h a n the assumed number) , the constrained m a x i m u m l ikel ihood estimator is con-sistent i n the sense of weak convergence. If the assumed number of components is exactly correct, the estimators of the parameters which define the m i x i n g dis t r ibut ion are consistent i n the quotient topology of Redner (1981); this topology must be considered because of the lack of ident i f iabi l i ty of the parameters. A n a lgor i thm for computat ion of m a x i m u m l ikel ihood estimates (and thus the m a x i m u m penalized l ike l ihood estimate) is presented i n Chapter 4 (For t ran code for implementat ion of the a lgor i thm is provided i n A p p e n d i x A ) . For each possible number of components, the a lgor i thm automat ica l ly generates s tar t ing values and executes the E M algori thm to locate a local m a x i m u m of the l ikel ihood funct ion. T h e use of the a lgor i thm for the estimation of Poisson mixture distr ibutions is i l lustrated using two data sets. For the dis tr ibut ion of traffic accidents presented by S imar (1976), a solution w i t h a higher l ikel ihood value than the solut ion presented by S imar is found. For the fetal movement data described above, the estimated mixture distr ibutions are interpretable i n terms of various states w i t h different rates of act ivi ty . In Chapter 5 we t u r n to the general hidden M a r k o v model . T h e relationship w i t h other time-series models is discussed, w i t h reference to problems of state est imation or reconstruction. T h e E M algor i thm can be used to find l o c a l m a x i m a of the l ikel ihood funct ion for these models. T h e implementat ion of the a lgor i thm using the forward-backward equations of B a u m et al (1970) is discussed; an adjustment to circumvent the inherent computat ional problem w i t h these equations and a method of automatical ly generating " g o o d " s tar t ing values are given. (Fortran code is provided in A p p e n d i x B. ) T h e a lgori thm is appl ied, w i t h Poisson component distr ibutions, to the fetal movement data. T h e results 3 support the previous interpretation of the mixture model as defining states with different rates of activity, and give a more detailed picture of the process. In Chapter 6, we prove the consistency of the maximum likelihood estimator for hidden Markov models. This requires the definition of an equivalence relation on the parameter space and an identifiability property relative to this equivalence relation; consistency of the estimated equivalence class is then established. Preliminary results of interest include the verification of the ergodic property and the conclusion of the Shannon-McMillan-Breiman theorem on entropy convergence, and a definition of cross-entropy as a limit of the log-likelihood function. Chapter 7 concerns future research. Maximum likelihood estimation for the doubly stochastic Poisson process with a Markov intensity process (a continuous time version of the hidden Markov model) is considered. We describe the apphcation of the E M algorithm to these processes, the analogs of the forward-backward equations, and the complications involved in proving consistency of maximum likelihood estimators for this model. Another problem concerns inference in hidden Markov models, including a test of independence. Finally, we mention some hypothesis testing problems in mixture models, their apphcation to tests of over-dispersion, and the distribution of likelihood ratio test statistics. 4 Chapter 2 M a x i m u m l i k e l i h o o d e s t i m a t i o n f o r m i x t u r e d i s t r i b u t i o n s The Poisson distribution is commonly used to model data which are counts. For instance, the number of events of a particular type which occur in an interval of time might be taken to be Poisson distributed if the axioms of the Poisson process appear to be satisfied by the event process. The Poisson distribution is also used as an approximation to the binomial distribution. It often happens that counts which are a priori thought to be Poisson distributed exhibit over-dispersion relative to the Poisson distribution, that is, the sample variance (s 2 = Dr=i(2/i - y)2/(n - 1)) is larger than the sample mean (y = 23?=i 2/t/71)- A potential source of over-dispersion is heterogeneity among the means of several Poisson distributions. Consider, for example, the numbers of automobile accidents incurred by members of some population in a given year. Although each person may have a Poisson distributed number of accidents, there might be subgroups of the population with different accident rates. A random sample from this population will tend to produce an over-dispersed set of counts. More specifically, the distribution of the number of accidents in the population could be represented by a mixed Poisson distribution or Poisson mixture, having a density of the 5 fol lowing form: m (2.1) j=i where Xj is the accident rate i n the jth subgroup and otj is the proport ion of the populat ion i n the jth subgroup. M o r e general Poisson mixtures arise f rom al lowing the rate to vary continuously throughout a (hypothetical) popula t ion , producing a density of the form In this case F is called the mixing distribution. Other m i x t u r e distr ibutions commonly used i n applications are mixtures of b inomials , normals , and exponentials. Appl ica t ions can be divided among at least three categories. F i r s t , the m i x i n g dis t r ibut ion may represent a physical reality; for instance, M a c d o n a l d (1975) models the length o f f i s h i n a populat ion containing fish of different ages as a normal mix ture w i t h components corresponding to the age groups. Secondly, i n cases such as the d is t r ibut ion of accidents considered above, well-defined subgroups do not exist, but hypothet ica l subgroups can lead to an interpretive model for the d is t r ibut ion . T h i r d l y , a m i x t u r e d is t r ibut ion may be used s imply because i t provides a good fit to data owing to some general feature such as over-dispersion for counts. T h e number of components in the mixture is well-defined in the first case, not so well-defined i n the second, and completely arbi t rary i n the th i rd case. W e are concerned w i t h the problem of l ikel ihood-based inference for the m i x i n g dis tr i -but ion , based on an observed random sample y\,..., yn; the log-l ikel ihood funct ion is given Â«=i T h e parameter space may be taken to be the entire set of d is t r ibut ion functions on [0,oo) 6 by n (2.2) (for Poisson mixtures) or some smaller class of d i s t r ibut ion functions. F o r instance, we might restrict the parameter space to include only the measures w i t h at most m points of support . A d is t r ibut ion w i t h a finite number of support points is called a finite mixing distribution; i t is denoted by F = Y^?=\aj&\j a n d t n e derived d is t r ibut ion PF{-), which reduces to p(y) in (2.1), is then called a finite mixture distribution. Depending on whether the m i x i n g d is t r ibut ion represents a well-defined physical model , the selection of the number of components may present an addit ional est imation problem. In this sett ing, a maximum likelihood estimator is a d i s t r ibut ion funct ion which maxi -mizes the log-l ikel ihood over the unrestricted parameter space. It was introduced by Kiefer and Wol fowi tz (1956) as part of a method of est imation for problems involv ing nuisance pa-rameters. T h e first a lgor i thm for comput ing this estimate (specialized to the Poisson case) was given by S imar (1976). Other work on this subject includes L a i r d (1978), who showed the estimator is a discrete d is t r ibut ion , and L i n d s a y (1983a), who showed the estimator has at most n components, where n is the number of observations. M u c h of the recent research on m a x i m u m l ikel ihood est imation for finite mixture dis-tr ibutions concerns the E M algor i thm, which provides the most appeal ing, i f not the most efficient, means of computat ion . T h e E M a lgor i thm is a useful tool for l ike l ihood m a x i m i z a -t ion i n models which either expl ic i t ly involve missing data or can be formulated i n terms of missing informat ion. Some of the early derivations of the a lgor i thm were based on mixture models. In what is believed to be the first appl icat ion of mix ture dis tr ibut ions , Newcomb (1886) modelled observations in astronomy by a finite m i x t u r e w i t h n o r m a l components h a v i n g different variances and a common mean. Newcomb gave the equations of the E M a lgor i thm for est imation of the common mean w i t h fixed variances and m i x i n g proportions. Jeffreys (1932) used the E M algor i thm to obtain estimates of the parameters of a mix ture of 7 two n o r m a l distr ibutions w i t h different means and variances. (See B u t l e r , 1986 for further discussion of the papers of Newcomb and Jeffreys.) In addi t ion to its uses i n special models, a general theory has emerged around the E M algor i thm. B a u m et al. (1970) proved that the l ikel ihood is increased at each i terat ion of the a lgor i thm and applied i t to hidden M a r k o v models (a generalization of mixture models i n which a sequence of "state" variables fol lowing a M a r k o v chain determines the sampling distr ibutions of the observations; see Chapters 5-6) . O r c h a r d and W o o d b u r y (1971) devel-oped some of the general theory behind the E M algor i thm and Sundberg (1976) developed some of the general theory for incomplete data f rom exponential families. Dempster , L a i r d , and R u b i n (1977) made a grand unif icat ion of the previous work, named the E M algor i thm, and demonstrated a general theoretical treatment w i t h such diverse areas of appl icat ion as grouped, censored, t runcated, and otherwise incomplete, data; finite mixtures; variance components models; empir ica l Bayes methods; i terat ively reweighted least squares; and factor analysis. E m p i r i c a l Bayes inference is one of the important fields of appl icat ion of mix ture dis tr i -butions; a t y p i c a l problem i n this area is the fol lowing one. Le t (<?i, Yi),..., (9n,Yn) be n independent and ident ical ly dis tr ibuted random vectors such that the d is t r ibut ion funct ion of 0i is F and the condit ional density of Y\ given &i is / ( - ,#i) . T h e problem is to estimate one or more of 8\,..., 6n given the observation of Y\,..., Yn. If this problem were posed from a Bayesian point of view, F would be a (known) prior d i s t r ibut ion , and one solution (opt imal for squared-error loss) would be to estimate 6{ by its posterior mean given Yf. Ep{9% 1 Vt] ~ IKviW One solution to the empir ica l Bayes problem derives f rom the observation that Y\,..., Yn form a random sample f r o m the mixture dis t r ibut ion / f(y,6)dF(0); thus Y i , . . . ,Yn could be used to estimate F and the result ing estimate used in the posterior mean as i f it were the true F. O n e theoretical question that arises is the degree to which the estimated posterior mean approximates the true posterior mean for large samples. T h e empir ical Bayes approach to stat ist ical inference was ini t iated by Robbins (1956); m i x i n g d is t r ibut ion estimators have been developed specifically for use i n this approach (e.g. E d e l m a n , 1988). In the next section we outl ine the argument for the existence of m a x i m u m l ikel ihood estimators and the characterization of the unrestricted m a x i m u m l ikel ihood estimator as a discrete measure w i t h at most n components. T h e computat ion of the estimators and their informat ion m a t r i x by the E M algor i thm is discussed i n Section 2.2. Section 2.3 contains a discussion of the ident i f iabi l i ty of the m i x i n g d is t r ibut ion , an essential property for the existence of consistent estimators, and the K u l l b a c k - L e i b l e r divergence between m i x i n g dis tr ibut ions . A review of some published results on consistency, and their relation to the new results to be presented i n Chapter 3, is provided i n Section 2.4. 2.1 Existence and characterization of the m a x i m u m likeli-hood estimator T h e question of existence is whether the log-l ikel ihood funct ion is bounded and attains its m a x i m u m value at some d is t r ibut ion funct ion. W e now formal ly define a general model under which this question can be answered. For each i, let {p,(-,#) : 0 G 0 } be a fami ly of densities on a set E{ w i t h respect to a measure fa. We assume 0 is a Bore l subset of a E u c l i d e a n space. A l t h o u g h the parameter space of interest consists of d is t r ibut ion functions, we work w i t h F, the set of subdis tr ibut ion functions on 0 (which have total mass not more than one), because this set has a compactness property der iving from the Helley selection theorem (Bil l ingsley, 1979, Theorem 25.9 and the proof of Theorem 29.3), which w i l l be exploited i n Chapter 3. For 9 F G F, let PF(V) = JPi(y,0)dF(6), yeEi,i = l,...,n. (2.3) U n d e r the measurabi l i ty of pi(-, â€¢) on E x 0 , and more specifically, under the continuity of Pi(yu 0) P F 1S a density on E{ w i t h respect to /J,-. NOW let Y i , . . . ,Yn be independent, w i t h Y i hav ing density p^? ; the log-l ikel ihood funct ion is defined on F by e(F) = -Â£iog$(yi). i-l T h e fol lowing theorem states the existence of a m a x i m u m l ike l ihood estimator and its characterization as a discrete d is t r ibut ion w i t h at most n components. W e include a proof for completeness. Theorem 2.1 (Lindsay, 1983a) If the set M1 = {(pi(yi, 9),... ,pn(yn, &)) â€¢ 0 G 0 } is compact, then there exists a distribution function F G F, which has at most n components and maximizes the log-likelihood over F. Proof : Caratheodory 's theorem (Roberts and Varberg , 1973, p . 76) states that any element of c o n v ( A ^ i ) , the convex h u l l of M\, is the convex combinat ion of n + 1 or fewer points of M\; this is strengthened to n or fewer, for points on the boundary of M\ (Silvey, 1980, A p p e n d i x 2). A s a corollary (Roberts and Varberg, 1973, p. 78), c o n v ( A ^ i ) is compact . N o w , the compactness of M\ implies that (Phelps, 1966, P r o p o s i t i o n 1.2) the set M = {(p(p(yi),...,p(Z\yn)): F e F}, is the closed convex h u l l of Mi, w h i c h is just c o n v ( M i ) in this case. M a x i m i z i n g the log-l ikel ihood is now equivalent to m a x i m i z i n g r ( p ( 1 ) , . . . , 2 , H ) = ^ 1 Â° g p ( i ) 10 over conv(A^x)- Since the funct ion Â£* is s tr ic t ly concave, i t achieves its m a x i m u m value over the compact convex set c o n v ( . M i ) at a unique point on the boundary (Roberts and Varberg , 1973, p . 124). (It is easy to see Â£* attains its m a x i m u m on the boundary because any neighbourhood around an interior point contains points w i t h values of i* larger than the value at the interior point . ) U s i n g Caratheodory's theorem, the point at which the m a x i m u m occurs is the convex combinat ion of n or fewer elements of M\, and hence is equal to pp for some dis t r ibut ion funct ion F G T, w i t h n or fewer support points. â€¢ R e m a r k s 1. A l t h o u g h the point of M at which Â£* achieves its m a x i m u m is unique, i t does not follow that F is unique; the vector in M at which the m a x i m u m occurs might correspond to two different m i x i n g distr ibut ions . However uniqueness has been proved for Poisson mixtures (S imar , 1976) and exponential mixtures (Jewell , 1982). L i n d s a y (1983b) has studied the uniqueness problem for general exponential family mixtures . 2. T h e method of the theorem can be used to establish the existence, for each ro > 1, of a d i s t r ibut ion funct ion Fm w i t h m components, which maximizes the l ikel ihood over Fm, the subset of T consisting of the measures w i t h at most m support points . Fm w i l l be called a constrained maximum likelihood estimator. A g a i n , the m a x i m i z i n g d is t r ibut ion might not be unique; i n fact, i n this case the question of uniqueness seems more difficult to answer than for F. 3. In case the densities pi are the same for a l l i, the number of support points can be bounded above by the number of distinct points i n the sample. F o r samples f rom a discrete d i s t r ibut ion , this bound can be much smaller than the sample size. Le t t ing Mi = {(p(x\,0),.. .,P(XK,8)) : 0 G 0 } , where XI,...,XJC are the distinct points i n the sample, the arguments i n the proof of the theorem show the l ikel ihood is maximized by a 11 distribution function with K or fewer support points. 4. The theorem, as stated, is not very useful because for many cases the compactness does not hold. However, as Lindsay mentioned, the compactness holds for the closure of Mi, which is obtained by adding points to 0. It then remains to check that these points do not contribute to F. This idea lies behind the following applicable result. C o r o l l a r y 2 .1 Assume that, for every i, the mapping 9 â€”> Pi(yi,9) is continuous and "vanishes at infinity", that is, for every e > 0 there is a compact Kt C 0 for which PiiVijQ) < e) 0 & KÂ£. Then there exists a distribution function F Â£ F, which has at most n components and maximizes the log-likelihood over F. Proof: First we show the closure of the set Mi, defined in Theorem 2.1, is simply Mi U {(0,...,0)}. If ( p W , . . . , ^ 7 1 ) ) is a limit point of Mi, i.e. {pi{yi,9k),... ,pn{yn,6k)) -* some sequence {9k} in 0, then there are two possibilities: 1. {8k} C K for some compact K C 0. In this case 9kt â€”> 9 Â£ K for some subsequence {6kl} and hence (pC 1 ) , . . . , p W ) = 9),.. .,pn(yn,9)) Â£ Mx. 2. {9k} does not lie totally within any compact set. Then there is a subsequence {9kt} for which pi(yi, 9kl) â€”> 0 for all i, implying = 0 for all i. Since Pi(yi,9) is a bounded function of 9 for each i, A4\ is a bounded set, and hence Mi U {(0,..., 0)} is compact. Now let p* maximize r ( p W , . . . ,p^) = YA=\ l o g p ( l ) over the convex hull of Mi U {(0,..., 0)}. Caratheodory's theorem, as in the proof of the theorem, implies p* is a convex combination of n or fewer elements of A^iU{(0,... ,0)}. But (0,..., 0) is certainly not one of these points; therefore p* = pp for some distribution function F Â£ F, with n or fewer support points. â€¢ 12 N e x t we show the hypotheses of this corollary are satisfied for three families of dis t r i -butions commonly used i n applications of mixtures . E x a m p l e s 1. (Poisson). Pi(yi,6) = (8ti)yie~6ti/y,!, 8 > 0. In this case pi(yi,8) â€”â€¢ 0 for every i as 8 oo. 2. ( N o r m a l mean). pi(yi,8) = e~^-6? I2"* I sjlito2, â€”co < 8 < oo. In this case pi{yi, 8) â€”* 0 for every i as 0 â€”> Â± o o . 3. (Exponent ia l ) . Pi(yi,9) = 9e~eyi, 0 < # < oo. In this case pi(yi,8) â€”â€¢ 0 for every i as 0 â€”> 0 or as 0 â€”*â€¢ oo. 2.2 Computat ional aspectsâ€”the E M algorithm T h e problem of est imating a m i x i n g dis t r ibut ion w i t h a known number of components, be-sides being of interest for its own sake, leads to one method for obta ining the unrestricted m a x i m u m l ike l ihood estimator. B y this method, the l ike l ihood is maximized by f inding the constrained m a x i m u m l ikel ihood estimators w i t h increasing numbers of components; the characterizat ion given i n Section 2.1 guarantees the l ikel ihood cannot continue increasing after the number of components reaches n. One means of calculat ing the finite component estimates is the E M algor i thm to be described below. L indsay (1981) recommended a differ-ent a lgor i thm for comput ing the unrestricted m a x i m u m l ike l ihood estimator. His method i n i t i a l l y uses the E M a lgor i thm w i t h a moderately large number of components. It then alternates between a d d i n g components, by choosing points of support w h i c h w i l l lead to an increase i n l ike l ihood, and using the E M algor i thm w i t h the new support set. T h e E M a lgor i thm is one of the more popular methods for obta in ing an estimate of a finite m i x i n g d is t r ibut ion . Its appeal lies i n its s implic i ty , its interpretat ion as the iterative 13 appl icat ion of the l ikel ihood equations, and the fact i t produces a sequence of parameter values w i t h increasing values of the l ike l ihood, s tar t ing f rom an arbi t rary i n i t i a l estimate. Despite a l l of this , the E M algor i thm suffers from the one potent ia l ly serious flaw which plagues most numerical opt imizat ion routines. Th is is the fact that convergence of the algo-r i t h m produces a local m a x i m u m of the l ike l ihood which may not be the global m a x i m u m . In Chapter 4 we w i l l discuss means of f inding a global m a x i m u m , v i a an appropriate choice of s tar t ing values for the E M algor i thm. Here we describe the E M a lgor i thm as one useful way of f inding local m a x i m a . A l t e r n a t i v e methods include N e w t o n - R a p h s o n and quasi -Newton methods such as scoring; Redner and Walker (1984) indicate that these methods use more operations per i terat ion than the E M algor i thm, but tend to require fewer iterations to achieve convergence. Denote an m-component m i x i n g d is t r ibut ion by m ^ = E " A - . (2-4) 3=1 where aj > 0, 9j Â£ 0 , j = 1 , . . . , m , ^Zctj = 1, and Sg denotes a point mass at 9. Define the complete data to be (yÂ»,u,-), i â€” 1 , . . . ,ra, where u t- is an indicator vector of length m w i t h a 1 i n the posi t ion corresponding to the component of the mixture w h i c h generated yi, i.e. u , = (un,... , Â« , m ) and U{j is 1 i f yi came from the j t h component and 0 otherwise. T h e jo int density of the complete data is n m ^)=nn(Â«iÂ«(j 'Â»w i . t=ij=i where <f> = ( Â« i , . . . ,ctm, 9\,... ,9m), and the joint density of the observations is obtained by s u m m i n g Lc over a l l possible sequences u = ( u i , . . . , u n ) , g iv ing m m n 14 T h e E M algor i thm exploits the fact that the jo int density of the complete data is m u c h easier to work w i t h than that of the observations, since tak ing the l o g a r i t h m reduces i t to a sum. If <f>w is the current estimate, the updated estimate ^>w+1 is chosen to maximize EpilogLc(r+1) I yu...,yn], (2.5) with. E$* denoting the condit ional expectation calculated under 4>w. In fact, a l l that is necessary is Â£ ^ [ l o g Z c ( < Â£ u ; + 1 ) | yi,...,yn] > E^[\ogLc(4>w) | yi,...,yn], which implies , by an appl icat ion of Jensen's inequality, S Hp) o g E u i c ( ^ ) Lc((pw+1) Lc(<t>w) = logE 7/ Lc^w) E u ^ c ( ^ ) Lc(<j>u+1) = l o g Â£ ^ [ | yi,...,yn] > *V [ l o g o f f I yi,...,yn] > 0- (2.6) T h u s , at each step the hkel ihood is increased, u n t i l a local m a x i m u m is found. E a c h i terat ion of the E M algor i thm consists of two steps: the E-step, i n which the expectat ion (2.5) is calculated, and the M - s t e p , i n which the m a x i m i z a t i o n is performed. T h e equations for performing these steps for the case that Pi(-,8) is the probabi l i ty mass funct ion of the Poisson dis t r ibut ion w i t h mean 6 are given i n the fol lowing description of the a lgor i thm. E M a l g o r i t h m (Poisson mixture) : 1. Choose <j>Â° = ( a Â° , . . . , a Â° , 0 Â° , . . . , 6Â°m). Set u - 0. 15 2. Repeat u n t i l every coordinate of <f>w+1 â€” <f>w is smal l : (i) E-step. For i = 1 , . . . , n, j = 1 , . . . , m , compute (ii) M-s tep . For j = 1 , . . . , m , compute k=l 2^i=l uik 0' (i i i) Set u *â€” OJ + 1. R e m a r k Since Uij is the condit ional probabi l i ty that observation i came f rom component j, that is , Ej,w[uij | y\,... yn], the condit ional expectation i n (2.5) is equal to can be obtained i f the Poisson fami ly is replaced by any mult iparameter exponential family, or even i f the observations are assumed to come from different exponential families; for instance, a sample of Poisson and b inomia l observations could be modelled as a mixture w i t h 9 representing an under ly ing rate. A potent ia l advantage of other methods of l ikel ihood m a x i m i z a t i o n , such as the scoring and N e w t o n - R a p h s o n methods, is the abi l i ty to compute the informat ion m a t r i x . However, Louis (1982) has shown how to accomplish this (for general missing in format ion problems) when using the E M algor i thm. We i l lustrate this method below for the est imation of finite mixtures . For this purpose, redefine <f> by removing , say, i n order to obta in a non-singular m n m n (2.7) j = l i = l j = l i = l f rom w h i c h the two equations i n the M-s tep are easily derived. S imi lar formulas for 0j 16 informat ion matr ix . Let a vector of length 2 m â€” 1, and d2 bij(Vi,<j>) = d ^ r [ l o g < * j + logPt(y,-, Oj)], a (2m â€” 1) X ( 2m â€” 1) matr ix . T h e expression for the observed informat ion m a t r i x i n equation (3.2') of Louis (1982) becomes n m n m m m E E ^A J - E E uasijsJj - 2 E ( E Â«v 5Â«)(E *hjhj)T, i=l j=l i=l j= l i<h j=\ j=l where T denotes transpose, b\j = 6,j(y,-,^), S{j = Sij(yi,<j>), <j> is the m a x i m u m l ikel ihood estimate, and w,j is the value of the condit ional probabi l i ty used i n the E M a lgor i thm, evaluated using the m a x i m u m l ike l ihood estimate. B o t h 0 and Uij are available after the f ina l i terat ion of the E M algor i thm. 2.3 Identifiability and Kullback-Leibler divergence Let {p(-,0): 6 G 0 } be a fami ly of densities on a set E w i t h respect to a measure fi. T h i s fami ly is said to have the identifiability property (be identifiable) i f Jp{y,6)dFx(6) = Jp(y,e)dF2(6) a.e.d^y) =â€¢ F1 = F2. (2.8) Identif iabil i ty makes the est imation of the m i x i n g d is t r ibut ion an unambiguous problem, and i t is a necessary condit ion for consistent estimators to exist. T h e identi f iabi l i ty property is satisfied in many impor tant cases, i n c l u d i n g the ones considered i n Section 2.1. However the b inomia l family {b(k,p): 0 < p < 1} satisfies only a l i m i t e d identif iabi l i ty. A b inomial mixture density has the form l { y ) p V { 1 ~ r*k~VdFw ={y)kf(-ir(k~iy)j py+idnP) 17 and so is determined by the first k moments of F; these moments cannot uniquely determine the 2 m â€” 1 parameters of an m-component mixture i f 2m â€” 1 > k. However (2.8) holds for m i x i n g distr ibutions w i t h m < (k + l ) / 2 components (Teicher, 1963). W h e n (2.8) holds for m i x i n g distr ibutions w i t h m or fewer components, we say the fami ly {p(-,0) : 6 â‚¬ 0} satisfies the m-identifiability property. In such a case, a corresponding constraint on the m i x i n g d is t r ibut ion must be made when obta ining estimates; consistent estimators m a y exist i f the true m i x i n g d is t r ibut ion satisfies the constraint on the number of parameters. A n o t h e r important class of examples is the class of mixtures of products of densities f rom a given fami ly ; these mixtures provide a means of defining mult ivariate distr ibutions for specific types of data , such as counts. For example, suppose r a n d o m variables Y i , . . . , Yv are condit ional ly independent given A i , . . . , A p , w i t h Poisson distr ibutions having means A i , . . . , A p , and A i , . . . , A p have the jo int d is t r ibut ion F. T h e n the joint density of Y j , . . . ,YP is T [ ^ - r - d F ( \ 1 , . . . , \ P ) . Such mixtures are studied i n Teicher (1967), where it is shown the ident i f iabi l i ty property carries over f rom a fami ly of densities to the fami ly of products of members f rom this family. Teicher's proof also applies to the m-ident i f iabi l i ty property, and we have the fol lowing result. L e m m a 2.1 If {/(y,#): 0 â‚¬ 0} satisfies the identifiability property, then so does the class of products of p members from this family. The same implication holds for m-identifiability. A n impor tant concept for the theory of est imation, connected w i t h identi f iabi l i ty , is the 18 K u l l b a c k - L e i b l e r divergence f rom Fi to F2 (or between Fi and F2), w h i c h is defined by it can be thought of as a measure of "distance" between ppx and pp2. T h e identi f iabi l i ty property says different m i x i n g distr ibutions produce different mix ture densities. Fur thermore , the next result shows that the K u l l b a c k - L e i b l e r divergence between different m i x i n g distr ibutions is posit ive. L e m m a 2.2 For any F\, F2 G F, K(F\,F2) > 0. If the identifiability property holds, then K(F\,F2) > 0 if F2 / F\. If F\ and F2 each has m or fewer components and the m-identifiability property holds, then K{F\,F2) > 0 if F2 ^ F\. Proof : B y Jensen's inequality, K{F\,F2) is always non-negative, and posit ive unless pp1 = PF2 a.e.d\x. If the identi f iabi l i ty property (2.8) is satisfied, the latter holds only i f F\ = F2. T h e same argument applies i f the m-ident i f iabi l i ty property holds and F\ and F2 each has m or fewer components. â€¢ T h e K u l l b a c k - L e i b l e r divergence K(F\,F2) can be extended to the case that F2 is a subdis-tribution function (which corresponds to a measure w i t h tota l mass at most 1); the conclu-sions of L e m m a 2.2 remain val id i f F\ is a d is t r ibut ion funct ion and F2 is a subdis t r ibut ion funct ion. 2.4 Consistency In this section we review the l i terature on consistency of m a x i m u m l ike l ihood estimators of m i x i n g distr ibutions and preview some of the new results to be presented in Chapter 3. C o n -R e m a r k 19 sistency results per ta in either to the m a x i m u m l ikel ihood estimator F or to the constrained m a x i m u m l ike l ihood estimator Fm; we consider Fm f irst. For a fixed given number of components, say m , the unknown m i x i n g dis t r ibut ion de-pends on a finite set of parameters: a\,... ,am,0i,... ,6m. T h u s , classical large sample results for m a x i m u m l ikel ihood estimators can be applied to the parameters of a finite mix-ture. Redner and Walker (1984) obta in a consistent and asymptot ical ly n o r m a l sequence of solutions of the l ike l ihood equations i n this way. Sundberg (1974) proves results of this type for incomplete observations f rom an exponential family , w i t h observations f rom an exponential fami ly mixture as a special case. T h e usefulness of this approach is somewhat l i m i t e d by the complicated form of the l ikel ihood funct ion, which makes verification of reg-u lar i ty conditions diff icult . M o r e important ly , the results apply only i f the assumed number of components is correct, that is, i f a l l of the aj are posit ive. Redner (1981) applies the result of W a l d (1949) to obta in the consistency ( in a certain sense, see Section 3.4) of the estimator w h i c h maximizes the l ikel ihood w i t h a l l Oj restricted to lie i n a compact set, as-suming the true number of components is not larger than the assumed number. Sundberg (1974) has s imi lar results for incomplete exponential fami ly observations. Hathaway (1985) proved consistency for a class of n o r m a l mixtures and removed the compactness restrict ion. W e w i l l give a general consistency result for finite mixtures in Section 3.4. A n impor tant question is the convergence of the m-component constrained m a x i m u m l ikel ihood estimator when the true m i x i n g dis t r ibut ion does not have m components. There are two cases to consider. F i r s t , if the true dis tr ibut ion has more than m components, the true model lies outside of the parameter space under consideration and we must first answer the question: what might Fm be consistent for? T h e second possibi l i ty is that the true m i x i n g d is t r ibut ion has fewer than m components. A concrete example of this 20 s i tuat ion arises i n the problem of testing the hypothesis of a " p u r e " d is t r ibut ion against alternatives specifying a mixture w i t h two components. In this context one might want the n u l l d is t r ibut ion of the l ikel ihood rat io test statistic and this leads to the asymptotic behaviour of the estimator of a two-component m i x i n g dis t r ibut ion when there is only one component. Several authors have noted that the classical theory of l ikel ihood rat io test statistics does not apply to the test for the presence of a mixture . In fact the m a x i m u m l ikel ihood estimators of the parameters of a 2-component d is t r ibut ion are not consistent in the usual sense when there is actually only 1 component, because of the lack of identi f iabi l i ty of the parameters. However, there is a sense i n which the constrained m a x i m u m l ikel ihood estimator Fm is consistent; i t was first applied to the unconstrained m a x i m u m l ikel ihood estimator F. For F, consistency is defined by imposing the topology of weak convergence on the parameter space of a l l m i x i n g distr ibutions. T h u s we say F is consistent for F* if, wi th probabi l i ty one, F converges weakly to F* as n â€”> oo, which means F(6) - Â» F*(6) for a l l continuity points 9 of F*\ (2.9) we w i l l denote this by F â€”> F*. T h i s concept of convergence has the fol lowing impor tant impl i ca t ion . U n d e r the condi t ion of Corol lary 2.1, consistency i n the sense of weak conver-gence implies that the estimated density pp converges pointwise to the true density pp*. T h i s fact provides weak convergence w i t h a pract ical significance and makes i t the natura l basis for a definit ion of consistency. T h e first proof of consistency for F was given by Kiefer and Wol fowitz (1956), under conditions adapted f rom those which W a l d (1949) used for a f inite-dimensional parameter (Heckman and Singer, 1984, verify these conditions for a fami ly of life dis tr ibut ions , but the conditions are difficult to verify) . Other work on this problem includes proofs of consistency 21 by S imar (1976) for mixtures of Poisson distr ibutions and by Jewel l (1982), who adapted Simar's proof to mixtures of exponential distr ibut ions . Recently, P fanzagl (1988) has proved some general results on consistency for m a x i m u m l ike l ihood estimators i n nonparametric models w i t h mix ture distr ibutions as a specific example. W e w i l l apply the concept of weak convergence to finite mixtures i n order to obtain consistency results for any number of components. If more components are estimated than are actual ly present, a l though the estimated parameters cannot be consistent i n the usual sense, we prove the consistency of the estimated m i x i n g d is t r ibut ion . If fewer components are estimated than are necessary, we show weak convergence to a dis t r ibut ion w h i c h approx-imates the true m i x i n g d is t r ibut ion i n the sense of m i n i m u m K u l l b a c k - L e i b l e r divergence. In order to be able to infer results on convergence of estimated parameters, we provide i n Section 3.4 a connection between the two notions of convergence for finite mixtures. In part icular this result is used to prove consistency of the m a x i m u m l ikel ihood estimators of the parameters of a finite m i x i n g d is t r ibut ion w i t h the number of components correctly specified. Procedures for selecting the number of components i n a mixture have received rela-t ive ly l i t t le at tention i n the published l i terature. ( A n exception is Henna , 1985, where a consistent est imator of m based on m i n i m u m distance estimators of F is presented.) Th is problem provided a focus for the present research; Chapter 3 begins w i t h a proposed proce-dure for es t imat ing m based on constrained m a x i m u m l ikel ihood estimators of the m i x i n g d i s t r ibut ion , and ends w i t h a result on the large sample behaviour of this procedure. 22 Chapter 3 C o n s i s t e n c y r e s u l t s f o r e s t i m a t o r s o f a m i x i n g d i s t r i b u t i o n 3.1 Choosing the number of components in a mixing dis-tr ibution In this chapter we will show that, in many cases, a maximum likelihood estimator F is consistent in the sense of weak convergence, as defined in Section 2.4. One consequence of consistency is that, if the true mixing distribution F* is actually a finite distribution, then F has, in the limit as the sample size n increases, a larger number of components than F*. More precisely, if F* has m* components, and F has rhn components, then l i m i n f n _ + 0 0 fhn > m* ; this holds also if F* is not finite: m â€”Â»â€¢ oo in this case. Thus consistency provides a guarantee that all existing components will be "detected" with a large enough sample. On the other hand, the inclusion of too many components, as tends to occur for the maximum likelihood estimator, reduces the usefulness of the estimate as a simple explanatory model for the data. It may be possible to improve on F by reducing the estimated number of components, while preserving consistency. This leads us to consider procedures for explicitly estimating the number of components rather than accepting the number determined by the maximum likelihood procedure. Notice the present situation is one in which the well known "optimality" properties of maximum likelihood estimators do not necessarily hold; these properties concern finite dimensional parameters while the 23 m i x i n g d i s t r i b u t i o n w i t h unrestricted number of components requires an infinite dimensional parameter space. T h e procedures we w i l l consider for est imating the number of components are based on the general theory of " m o d e l selection" (see, for example, L i n h a r t and Zucch in i , 1986). We have a sequence J:\,T2,-.. of nested hypotheses about the parameter F*, or a sequence of nested models, w i t h Tm representing the set of m i x i n g distr ibutions (actually subdis-t r i b u t i o n functions) w i t h at most m components. If F* is a finite d is t r ibut ion w i t h m* components, that is , F* G Tm+, F* Â£ Tm*-\i then Tm* is the smallest true hypothesis. We propose m be chosen to maximize a cr i ter ion (or objective function) obtained by subtract ing f rom Â£(Fm), the log-l ikel ihood maximized over Tm, a penalty term which discourages the selection of a model w i t h an excessive number of components. W e w i l l write the cr i ter ion i n the fol lowing form: Choose m to maximize Â£(Fm) â€” amn, (3-1) where n is the sample size and amn are pre-determined constants. (It is to be understood that m = oo i f no m a x i m u m is obtained.) T h e result ing m i x i n g dis t r ibut ion estimator Ffh (= F, i f m = oo), is called a maximum penalized likelihood estimator. W e w i l l consider the fo l lowing two cri ter ia : A k a i k e informat ion cri terion ( A I C ) . Choose m to maximize Â£(Fm) â€” dim( jF T O ) . Bayesian informat ion cri terion ( B I C ) . Choose m to maximize Â£(Fm) â€” | ( logn) d i m ( ^ " m ) . In these cr i ter ia , d i m ( J " m ) = m â€” 1 + mp, where p is the dimension of 9, e.g. p = 1 for Poisson mixtures . W i t h o u t compl icat ing the theoretical development, amn could be made dependent on the data . A n important example which results is the cross-validatory criterion ( C V C ) : choose fa to maximize 53â„¢=i lÂ°gPg'(>)(?/Â»)> where Fm is the constrained m a x i m u m r m 24 l ike l ihood estimator based on the sample y\,... ,yn after yi is deleted. Thus C V C has the form given i n (3.1) w i t h amn = Â£ ? = 1 logp* m (y i ) /PMi)(y i ) -T h e use of cross-validation for model selection was developed by Stone (1974) and Geisser (1975). T h e cross-validatory cr i ter ion i m p l i c i t l y penalizes the selection of an overly complex m o d e l , because the increase i n the value of Y^i=\ lÂ°gPe.(')(2/i) 4 8 parameters are added to the model w i l l tend to be somewhat offset by the decrease i n precision of param-eter estimates. U n d e r certain condit ions, related to the classical regularity conditions for m a x i m u m l ikel ihood est imation, the cross-validatory cr i ter ion is asymptot ica l ly equivalent to A I C (see Stone, 1978). However, these conditions are not satisfied for mixtures , as was pointed out by T i t t e r i n g t o n , S m i t h , and M a k o v (1985) (see also Section 2.4). Schwarz (1978) derived B I C as a model selection cr i ter ion by placing a prior d is t r ibut ion on the parameter space ( inc luding prior probabil i t ies of the various model-defining subsets) and showing B I C is asymptot ica l ly equivalent to the posterior probabi l i ty that the parameter belongs to the mth subset. These arguments also depend on strict regularity conditions and do not carry over to the present s i tuat ion. In practice, the choice of a specific cr i ter ion wiU be based on the relative costs of over-and under-est imation of the number of components. Since B I C contains a larger penalty t e rm (for n > 7), B I C selects a smaller number of components than A I C . B o t h A I C and B I C produce consistent estimators of the unknown m i x i n g d is t r ibut ion , but i t is not k n o w n whether the cross-validatory cr i ter ion does also. (Theorem 3.4 provides a sufficient condit ion for consistency of the m a x i m u m penalized l ikel ihood estimator, invo lv ing the rate of growth of a m n w i t h n.) W e now describe i n detai l the sett ing under which we w i l l carry out our analyses. Le t {Â»(-,#) : 0 â‚¬ 0 } be a fami ly of densities on a B o r e l subset E of Rq, w i t h respect to a 25 measure fx, where 0 is an interval i n Rp. (Intervals i n a mul t id imensional Euc l idean space are products of intervals on the real line.) For each F Â£ T (T is the set of subdistr ibut ion functions on 0 ) , define P F ( y ) = / p(y,0)dF(9),yeE. Let yi,...,yn be an observed random sample f rom the dis t r ibut ion pp*, where F* is a d is t r ibut ion funct ion i n T. T h e log-l ikel ihood funct ion is defined by Â£(F) = ^ogpF(yi), F t T . i=\ A maximum likelihood estimator is a dis t r ibut ion funct ion, denoted F, which maximizes Â£ over T\ a constrained maximum likelihood estimator is a dis t r ibut ion funct ion, denoted Fm, w h i c h maximizes Â£ over Tm (see Section 2.1). Throughout this chapter we w i l l refer to the fol lowing conditions: 1. p(y,0) is continuous on E X 0 . 2. F o r any compact C C E and e > 0, there exist a,b Â£ 0 such that p(y,9) < e, 9 Â£ 0 \ [a,b], y Â£ C. (Note : A \ B denotes the set of points i n A and not i n B.) 3. There is a measurable set Z C E and an interval fl C 0 such that n(Z) > 0, fQ dF* > 0, p(y,9) = 0 on Z x ( 0 \ ft), and p(y,9) > 0 on E x ft. 4. p(y,9) < h(y), 0 e O, y Â£ E, where h is continuous on Â£ and fPF'(y) |logfc(y)| < Â°o-5- IPF*(y)0-ogp(y,9))~diJ,(y) < oo, 0 Â£ ft, where a ; - = max{â€”x,0} . 6. For each y Â£ E, \ogp(y, â€¢) is concave on ft. Condi t ions 1 and 2 are sl ightly stronger than the conditions used i n C o r o l l a r y 2.1 to obtain the existence and finite characterization of F. C o n d i t i o n 3 is automat ica l ly satisfied for 26 densities w h i c h are s tr ic t ly posit ive over their entire range. T h e integrabi l i ty conditions (4 and 5) are s l ight ly stronger than the requirement of finite entropy, a frequently imposed condi t ion i n large sample studies of m a x i m u m l ikel ihood estimators. T h e above conditions were imposed w i t h exponential families i n m i n d (especially condi t ion 6, which is satisfied for exponential families) . N e x t we show a l l the conditions are satisfied for the examples of 1-parameter exponential families considered at the end of Section 2.1, w i t h certain restrictions imposed by the integrabi l i ty conditions. E x a m p l e s 1. (Poisson). p(y, 9) - 9ye~e, 0 > 0, y - 0 , 1 , . . . , and dfi(y) = l/yl (We define p i n this way to s impl i fy checking one of the conditions.) For Poisson mixtures , the identi f iabi l i ty property was proved by Feller (1943). For condit ion 2 we set o = 0; then y < y* and 9 > b > y* i m p l y p(y,9) < bye~b < by*e~b, and the latter is < e for large enough b. C o n d i t i o n 3 is satisfied w i t h Z = {1 ,2 , . . . } , f i = (0 ,oo) , provided dF*(0) < 1. C o n d i t i o n 4 is satisfied w i t h h(y) = yye~y, if Y^PF'{y)y l o g y < oo; a sufficient condi t ion for this is / 92dF*(9) < oo. C o n d i t i o n 5 is then also satisfied. 2. (Exponent ia l ) . p(y, 9) = 9e~y0, y > 0, 9 > 0, and /J. is Lebesgue measure. For exponential mixtures , the ident i f iabi l i ty property was proved by Teicher (1961). For condi t ion 2, i f V* < V < V* and 9 > b > 1/y*, then p(y,9) < be~yb < be~y*b < e, for large enough b; s imi lar ly , i f y* < y < y* and 9 < a < 1/y*, then p(y,9) < ae~ya < ae~y*a < â‚¬, for smal l enough a. C o n d i t i o n 3 is satisfied w i t h Z â€” (0 ,oo) , f i = (0 ,oo) . C o n d i t i o n 4 is satisfied w i t h h(y) = e - 1 m a x { y _ 1 , 1 } , i f pF*(y)\og(l/y)dy < oo; a sufficient condit ion for this is j9dF*(9) < oo. C o n d i t i o n 5 is satisfied i f / p F ' ( y ) y d y < oo, or equivalently, f(l/9)dF*(9) < oo. 3. ( N o r m a l mean). p(y,9) = e-(y-e)2/2<r^ y 9 e ( _ o o , o o ) , and fi is ( 2 7 r a 2 ) - 1 / 2 times 27 Lebesgue measure. For normal mean mixtures, the identifiability property was proved by Teicher (1960) . For condition 2, if ym < y < y* and 8 > b > y*, then p(y,8) < e-(2>-*>)2/2<r2 < e - ( y * - 6 ) 2 / 2 < r 2 < e^ fQT j a r g e enough fr- similarly, if y* < y < y* and 8 < a < y*, then P(y,9) < e-(y-Â°)212Â°2 < e-(y*-a)2liÂ°2 < e , for small enough a. Condition 3 is satisfied with Z = = (â€”oo,oo). Condition 4 is satisfied with h(y) = 1 and Condition 5 is also satisfied for any F*. We now give a brief outline of the results presented in this chapter; a summary of these results is provided in the table at the end of the chapter (although their exact meaning will not be clear at this stage). In the next section we prove the consistency of maxi-mum likelihood estimators. Following that, in Section 3.3, we consider the approximation of an arbitrary mixing distribution by one with a specified number of components. This approximation is used in Section 3.4 to study the general asymptotic behaviour of con-strained maximum likelihood estimators when nothing is assumed about the true mixing distribution-â€”it may have a different number of components than the number which are estimated or it may not even be a finite distribution. These results will be used to prove the consistency of maximum penalized likelihood estimators in Section 3.5. 3.2 Consistency of the m a x i m u m likelihood estimator This section contains a proof of the consistency of F. The proof depends on the following technical result. Lemma 3.1 If conditions 1 and 2 hold, then the following are true: (i) For any subdistribution function F on Q, pp is continuous on E. 28 (ii) If Fk and F are subdistribution functions on 0 and Fk(9) â€”> F(9) for all continuity points 9 of F, then ppk â€”Â»â€¢ pp uniformly on compact subsets of E. Proof : (i) Let C be a compact subset of E and e > 0. B y condit ion 2 there exist a,b Â£ 0 for w h i c h p(y, 9) < e, 9 Â£ 0 \ [a, b], y Â£ C. Since C X [a, b] is compact , p is uni formly continuous on this set; i n part icular there is a 6C > 0 such that T h i s means pp is uni formly continuous on C, and it follows that pp is continuous on E. (ii) A s a first step, we show limfc-vco PFk{y) = PF(y) for any y Â£ E. Not ice that this part of the proof is a proof of a mult id imensional version of the Extended Hel ley-Bray L e m m a (Loeve, 1977, p. 183). In order to carry this out , we extend F and Fk to a l l of Rp, if necessary, i n the obvious way (by assigning zero mass to Rp \ 0 ) , and extend p(y, â€¢) to a l l of Rp, preserving its properties of continuity and vanishing off a compact set. Let e > 0. B y condit ion 2 there exist a, b such that P(V,0) - p(y',9) | < â‚¬, if || y -y'\\< 6C, y,y'eC,9e [a,6], where || â€¢ || denotes Euc l idean distance. N o w < 36, i f || y -y' \\<6e, y,y'eC. P(y,0)<e, 0$[a,b] (3.2) w h i c h implies Therefore i t suffices to prove (3.3) 29 W e w i l l use the fact that the points a and b i n (3.2) can be taken to be cont inuity points of F. (a is a cont inuity point i f i t satisfies F(a) = F(a~), where F(a~) is the left l i m i t of F at a.) W e now prove this is true for a (a s imilar argument applies to b). F o r each r = 1 , . . . ,p let F^ denote the rth marginal d is t r ibut ion of F; then F(a)-F(a-)<J2[F{r\ar)-F^(a;)}, r = l i f a = ( a i , . . . ,ap). B u t a subdis tr ibut ion funct ion on the real line has at most countably m a n y discontinuity pointsâ€”therefore, we can choose a = (ai,...,ap) so that F^r\ar) = F^r\a~) for each r and (3.2) is satisfied; then a is a cont inuity point of F. T h u s , by the condit ion of the l e m m a , Fk(a) â€”Â»â€¢ F(a) and Fk(b) â€”â€¢ F(b); next we prove that these i m p l y (3.3). There are two cases to consider. Case 1. IiF(b)-F(a) = 0, then Fk(b)-Fk(a) < c for large k and hence f[aib]p(y,0)dFk(9) < Me for large k, where M is an upper bound for p(y, â€¢) on [a, b], which is finite by continuity. Case 2. If F(b) - F(a) > 0, define Gk(6) = (Fk(9) - Fk(a))/(Fk(b) - Fk(a)) and G(9) = (F(9) - F(a))/(F(b) - F(a)) for 9 Â£ [a, b]. T h e n Gk -> G at a l l cont inuity points of G; since G is a d is t r ibut ion funct ion , Gk G, which implies limk^ f[ab]p(y,9)dGk(9) = f[atb]p(y,9)dG(9) and hence l i m ^ o o / [ 0 i 6 ] p(y, 9)dFk{9) = fM p(y, 9)dF{9). T h e pointwise convergence of ppk is thus established; that is , for each y 6 E there is an integer k(e, y) for w h i c h I PFk(y) - PF{y) | < c, if k > fc(e, y). 30 N o w , i f C is a compact subset of E and c > 0, i n the same way as i n ( i ) , there is a Se > 0 for w h i c h I pFk(y) - PFM I < ^ II y - y' II < ^ y,y' e c. (3.4) B y the compactness of C, there is a finite set {yi,... , y^} of points in C for which C C uf=1Ni(6â‚¬), where N{(St) is an open bal l of radius Se centered at y ; ; let kt â€” max{fc(e, y i ) , . . . , k(e,yj)}- T h e n , for any y Â£ C , y G Ni(Se) for some i , and we have I P n ( y ) - PF(y) I < I - PFk(yt) I + I PFk(yi) - PFivi) I + I PF{yd - P F { V ) I < 2(3e) + e, i f Ar > fcÂ£. T h i s establishes the uni form convergence on C. â€¢ R e m a r k E q u a t i o n (3.4) states that the sequence {pFk} is equicontinuous on C and the argument used to obta in the uni form convergence mimics the proof of the A s c o l i - A r z e l a theorem. Next we prove the m a i n result of this section. T h e o r e m 3.1 If conditions 1-5 and the identifiability property (2.8) hold, then F ^ F* as n â€”>â€¢ oo, with probability one. Proof : A theorem of Varadara jan (1958) states that the empir ica l d is t r ibut ion based on a r a n d o m sample f rom a d is t r ibut ion funct ion on a Euc l idean space converges weakly to that d is t r ibut ion funct ion , w i t h probabi l i ty one. Therefore, w i t h probabi l i ty one, the empir ica l d is t r ibut ion Hn based on Y i , . . . ,Yn converges weakly to the dis t r ibut ion w i t h density pp* w i t h respect to n. A l s o w i t h probabi l i ty one, floghdIIn â€”> /pp* loghdp, flogpp>dHn â€”> j pp* log pp* dfjL, and fzdHn â€”>â€¢ fZPF*dfJ,, by the strong law of large numbers. (Note that conditions 4 and 5 together i m p l y / p p * | logp.p* | d/x < co.) We show that these facts i m p l y F ^ F*. 31 Assume there is a subsequence of {F} and F Â£ F, such that JP â€”â€¢ F a long this sub-sequence at a l l cont inuity points of F. (Notice that subsequences do exist w i t h these convergence properties, by the HeUey selection theorem.) In the fol lowing a l l l imits over n are assumed to be taken along this subsequence. W e first show P F ( V ) > 0 for every y. A c c o r d i n g to condit ion 3, pF(y) can be 0 for some y only i f JQ dF = 0, that is, i f JQ dF â€”* 0. Suppose this is true. W e have U{F) = Jz log (jf p{y, 6)dF(B)) dHn{y) + J log Pp(y)dHn(y) < Jz log (Ja h{y)dF{0)j dHn(y) + J log h(y)dHn(y) = I log (h(y) f dF(8)) dHn(y) + [ logh(y)dHn(y) Jz V JQ J JE\Z = I \ogh(y)dHn(y) + log(f dF) f dHn. JE JQ JZ Hence, using JloghdHn -* f pp* loghdf i Â£ (â€”00,00) and fzdHn â€”â€¢ \zpp*d\x > 0, we obta in Â£(F)/n â€”* â€”00. B u t this easily leads to a contradict ion, for Â£(F)/n > l(F*)/n J pF. \ogpF.dn Â£ (-00,00). (3.5) L e t C be compact . W i t h pp > 0, L e m m a 3.1 implies that logp^ , â€”Â»â€¢ \ogpp uni formly on C, and hence | fc\ogppdHn - fc\ogpFdHn | < fc | l o g p ^ - l o g p F | dHn -Â»â€¢ 0. Since Iclogpp is bounded and upper semicontinuous (Ic is the indicator funct ion of C), we have Urn s u p n fc log pp dHn < fc pp* log pp dfj,, by the weak convergence of Hn (Bi l l ingsley, 1979, p. 340). Therefore l i m s u p n / c l o g p p d H n < fc pp* logp Fdfj,, and this proves, u s i n g p F < h, that h m s u p \og(pp/h)dHn < l i m s u p / \og(pp/h)dHn n J n Jc < / pp*\og(pF/h)diJ, Jc 32 for every compact C. N o w consider a sequence of compact sets w h i c h increases to E. T a k i n g l imits a long this sequence, Fatou's l e m m a gives l i m s u p / pp* log(pp/h) dfi < / pp*log(pp/h)dLi. c\E Jc J T o summarize , we have shown l i m s u p Â£ ( F ) / n = l i m s u p / \ogppdHn < / pp* logppd/j,. n n J J B u t , by (3.5), l iminfâ€ž Â£(F)/n > f pp* logpp*dfi. Therefore J pF* log pFdft > J pF* log pF*dn, that is , K(F*,F) < 0, which implies F = F*, by L e m m a 2.2. Since F = F* holds for any convergent subsequence of {F}, the result follows. â€¢ R e m a r k s 1. In the case of mixtures of Poisson distr ibut ions, S imar (1976) obtained the consistency of F wi thout the integrabi l i ty conditions 4 and 5. Simar's proof can also be extended to mixtures of other families of discrete distr ibutions. P fanzagl (1988) also proves a general consistency result for discrete distr ibutions. 2. Jewel l (1982) proved consistency for mixtures of exponential distr ibut ions, by extending Simar's proof. 3. P fanzagl (1988) proved the consistency of F w i t h s l ightly weaker assumptions than Theorem 3.1. O u r proof w i l l y ie ld further results i n the rest of this chapter. 3.3 Approximat ion of mixing distributions In order to discuss the convergence of the constrained m a x i m u m l ikel ihood estimator i n general, we must produce a candidate l i m i t d is tr ibut ion. If the true m i x i n g dis t r ibut ion has 33 three components but we are est imating two, what are we estimating? T h i s question leads us to consider the error i n approximat ion of F* by a d is t r ibut ion w i t h a specified number of components. One measure of this approximat ion error is the K u l l b a c k - L e i b l e r divergence f rom F* to F, w h i c h was defined i n Section 2.3 by for any subdis t r ibut ion funct ion F. N o w let K(F*,Tm) be the K u l l b a c k - L e i b l e r divergence f rom F* to the set Tm, that is , W h e n fpF*logpF*dfi < oo, m i n i m i z i n g K(F*,F) is equivalent to m a x i m i z i n g fpp*logpFdfi and so is entirely analogous to the calculat ion of the m a x i m u m l ikel ihood estimator. T h e only difference is that an average over observed data values has been replaced by integration against the true d is t r ibut ion . In fact, the E M algor i thm could be applied to this new opt imiza t ion problem as wel l . It is clear that , i f the m i n i m u m distance is at tained, i t is attained by a d is t r ibut ion funct ion (that is w i t h to ta l mass 1). T h e following result shows the m i n i m u m distance is atta ined, and that the m i n i m u m distance is s tr ic t ly decreasing in m , at least up to the number of components of F*. Lemma 3.2 // conditions 1, 2, and 4 hold, then for each m > 1, there is a distribution function Fm â‚¬ Tm for which K(F*,F^l) = K(F*,Tm). If, in addition, the identifiability property holds, then for every m > 1, K(F*,Fm+1)<K(F*,Fm) if F* g" Tm. Proof : (i) Let {Fk} be a sequence i n Tm for which K{F*,Fk) -+ K{F*,Tm). B y the Helley selection theorem, there is an F Â£ Fm and a subsequence, along which Fk converges to F K{F*,?m)= in f K(F*,F). 34 at all continuity points of F. Taking limits along this subsequence, Fatou's lemma gives l iminf J pp* log(h/pFk)dfi > J PF* ^og(h/pF)dfx and hence l iminf J pF* log(pF*/pFk)dfi > J pF* \og(pF* /pF)dfi. But this means K(F*,F) = K(F*,Tm). (ii) Assume K(F*,Fm+1) = K(F*,Fm). Then K(F*,F) > K(F*,Fm) for every F G Fm+i; in particular /p,-(rtiÂ°g ( ( 1 ' ( ) V F ; ^ g p M ) ) Mv) < o, * 6 e, e > 0. , Therefore o > mninf lpF.(y)-'oB ( ( 1 - e ) i " J " ( ' ) f *Â»â€¢*>) My) where we have used Fatou's lemma, which applies because I l o g ( ( 1- t^M + ^"-*)) > i i o g f l - Â«) > 21ogl/2, , 6 JS. Â« < 1/2. Now, from /p F *p(y,0) / 'pp^dfi < 1 for every 0, we get J(pF*)2/'pF^dLi < 1 and K(F*,Fm) = JpF* log(pF*/PFgl)dii < j PF*{PF>/PFm - l)dn < 0, which implies Fm = F*, by Lemma 2.2. â€¢ R e m a r k s 1. In the following, Fm will denote any distribution function satisfying K(F*,Fm) = K(F*,Fm). 35 2. T h e proof of part (i) consists of showing the funct ion F â€”â€¢ K(F*,F) is lower semicon-tinuous and that a lower semicontinuous funct ion attains its i n f i m u m over a compact set. Reca l l the analogy we drew i n Section 3.1 between the choice of the number of compo-nents and the general " m o d e l selection" problem. In the choice of a model (a subset Fm), a modelling error is made, as measured by K(F* ,F^). In addi t ion , an estimation error is made i n the est imation of the parameters of F^; this w i l l be considered i n the next section. Essential ly, consistency of the resulting estimator of the m i x i n g dis t r ibut ion (for example the m a x i m u m penalized l ikel ihood estimator) says both of these errors are negligible for large samples. W e next consider the question of whether the model l ing error can be made arbi trar i ly smal l w i t h a large enough number of components. Clear ly , for a finite m i x i n g dis t r ibut ion F*, K(F*,F^) is 0 when m is equal to the number of components of F*. It can also be shown, using the construction described below, that K(F*,F^) approaches 0 i f F* contains a discrete mass at some point , but this is not a very useful fact. However, we w i l l be able to say something about general m i x i n g distr ibutions. W e first construct a sequence {Fk} of finite distr ibut ions, to be used i n the proof of the m a i n result below, for which Fk F*. Let {A\,..., Ap} be a finite par t i t ion of 0 by intervals, that is, each A,- is an interval i n 0 and U^=1Ai = 0 . (Recal l that intervals i n a mul t id imens ional Euc l idean space are products of intervals on the real l ine.) T h e n we can form a finite d is t r ibut ion funct ion on 0 by F = J2ai^6i, where a,- = JA. dF* and , for the posit ive Oi = fA.0dF*(9)/cti. B y replacing 0 by the support of F*, we can assume, without loss of generality, that a l l of the ai are posit ive for any par t i t ion . W e construct a sequence of distr ibutions Fk i n this way, corresponding to a sequence of 36 finite part i t ions {A[ , APk'} having the fol lowing two properties: A . For each k > 1, . . . , AjÂ£j^} is a subpart i t ion of { A J ^ , . . . , A J ^ } , ( that is , each ^( f c+i) j s c o n t a i n e d }n s o m e A \ ^ ) . (k) B . For each 0 6 0 and e > 0, there is a par t i t ion w i t h an element A) which contains 0 and has diameter less t h a n e. (This means the parti t ions get arbi trar i ly fine.) A n example of such a sequence of parti t ions of the posit ive real l ine is given by A ^ = [i2~k,(i + l)2-k), i = -k2k,...,k2k, A ^ , , = ( - o o , * ) , and = [fc + 2 " * , o o ) . T h i s example also yields appropriate part i t ions of any interval of the real l ine , and can be adapted to an interval i n higher dimensional Euc l idean space by forming products of the above par t i t ion intervals. Le t Fk be denoted by X^=i a \ &g{k) where the a\ ' and 9\ ' are determined by the k par t i t ion i n the same way as above. T h e n L e m m a 3.3 If Fk is defined as above, in terms of a sequence of partitions having the properties A and B, then Fk F*. Proof : W e first prove the fol lowing relations: {A â‚¬ 0: A < 0} C l i m i n f A^(0) C l i m s u p A^(0) C {A G 0: A < 0}. (3.6) where l i m i n f A^(0) = U f c o nk>ko A^k\0) and l i m s u p A^(0) = nko U k > k o A<fc)(0). (The Fk(9) Pk where AW(0) = U{A\k): 6(k) < 0}. middle relat ion of (3.6) holds by definition.) 37 N o w , i f A â‚¬ 0 and A < 9, by property B there is a par t i t ion w i t h an element A J * 0 ^ which contains A and every element of which is < 9, that is A\kÂ°^ < 9. T h e n , for every k > ko, by property A there is an element A ^ w i t h A G A^ and A^ C A\kÂ°\ S O A^ < 9 and 0\k) < 9 (using the fact that o\k) must be i n A ^ ) . Therefore A 6 A( f c)(0) for every k > k0, and so the first relation i n (3.6) is proved. N e x t assume A â‚¬ 0 and A ^ 9. In the same way as before, w i t h % replacing < (notice that i n mult id imensional space jÂ£ is not the same as >), it can be shown that A Â£ l i m i n f ( A ( f c ) ( 0 ) ) c . B u t this is equal to ( l i m s u p A^k\9))c and so the f inal relation i n (3.6) holds. N o w , using Fatou's l emma, l i m i n f / dF* â€” l i m i n f / IA(k)in\dF* JAW (9) J A (B> > j l i m i n f IAW(g)dF* = [ dF*, (3.7) and by t a k i n g complements i n this relat ion, we obtain also l i m s u p / dF* < f dF*. (3.8) JA(.K){6) ilim sup A(K) (<?) N o w , for points 9 at which F* is continuous, /{^ ee-A<0} ^ * = /{AeÂ©-A<e} dF*, and i n this case, (3.6), (3.7), and (3.8) i m p l y lim Fk(9) = l i m / dF* = [ dF* = F*(9). JA<.k)(8) J{\ee-.\<e} â€¢ In the next result, the weak convergence of Fk to F* (and the special structure of these approx imat ing d is t r ibut ion functions) w i l l be used to prove the modeUing error converges to 0. 38 L e m m a 3.4 If conditions 1-6 hold, then K(F*,F^) â€”Â» 0 as m â€”Â»â€¢ oo. Proof : We w i l l show K(F*,Fk) â€”â€¢ 0 for the sequence {Fk} of finite distr ibutions constructed above. Let A be a bounded interval of the klh par t i t ion w i t h a = fA dF* > 0 and the closure of A contained i n fi, which exists by condi t ion 3. T h e n , by property A , for every k > k0, U{A\k): A(k) C A} = A and Z{i:A(Â»)cA} a [ k ) = a. Therefore, l Â°gPF f c (2 / ) /a > log (k) {i:A\k)CA} by Jensen's inequality. A n o t h e r applicat ion of Jensen's inequality, this t ime applied to the concave funct ion g(0) = logp(y,0), gives logP(y,e\k)) = g{ j 0dF*/a^) > J g(9)dF*/a^ = J log p(y, 9)dF*/af. Therefore logpFk(y)/a > Â£ [ ( k ) logp(y,e)dF*/a = f logp(y,9)dF*/a. rt\ J A) ' JA {i-.A^cA} ' T h e next step is to show that fAlogp(y,0)dF* is pi?Â»d/i-integrable. In one dimension, because log p(y, 0) is concave, logp(y,9) > mm{logp(y,a),logp(y,b)}, 0 e [a,b]. In part icular this shows logp(y, â€¢) is bounded below on A by the m i n i m u m of the values at the endpoints of A. For two dimensions, A is a rectangle and logp(y, â€¢) is bounded below on A by the m i n i m u m of the values at the four vertices. This argument clearly applies to any number of dimensions. T h u s , using condit ion 5 (and the fact that the vertices of A he i n fi), fAlogp(y,0)dF* is integrable. Th is means {logpFk} is dominated below by an integrable 39 funct ion; but it is also dominated above by condit ion 4. Therefore, using the convergence PFK â€”â€¢ PF', we get / p p * logPF kdfi â€”* j PF* logpp*d^i and the proof is complete. â€¢ 3.4 Consistency of constrained m a x i m u m likelihood esti-mators N o w we consider the consistency of Fm. In order for Fm to be consistent, i t is necessary that Fm be uniquely defined; this means Fm uniquely satisfies the fol lowing two requirements: Fm G Tm and K(F*,Fm) = K(F*,fm). T h i s clearly holds i f f 1 * G Tm and the m-ident i f iabi l i ty property of Section 2.3 holds, but general statements on this question seem very difficult to obtain. Fortunately, this issue is not cr i t ica l for the study of the m a x i m u m penalized l ikel ihood estimator. However the general asymptot ic behaviour of Fm is of interest i n its o w n right and we present the fol lowing result. Theorem 3.2 Assume conditions 1-5 hold. With probability one, every limit F of {Fm} (that is, Fm â€”â€¢ F at all continuity points of F) satisfies K(F*,F) = K(F*,F^). If F^ is unique, then Fm Fm as n â€”* oo, with probability one. Proof : A s i n the proof of Theorem 3.1 we get that , w i t h probabi l i ty one, the l i m i t F of any convergent subsequence of {Fm} must satisfy J pF' log pFdii> J PF' log pFm dpi or, equivalently, K(F*,F) < K(F*,F^). T h e results follow f rom this. (The arguments used i n the proof of Theorem 3.1, when applied to Fm, require / p p * \ \ogppm \ d\i < oo, which follows f rom conditions 4 and 5.) â€¢ T h e above consistency theorem enables us to prove the consistency of the m a x i m u m l ikel ihood estimators of the parameters of a finite m i x i n g d is t r ibut ion , when the number of 40 components is k n o w n (this result was referred to i n Section 2.4). F i rs t we establish in the fol lowing l e m m a the equivalence of weak convergence of m i x i n g distr ibutions and conver-gence of the associated parameters i n the quotient topology considered by Redner (1981). T h e quotient topology is defined relative to the equivalence relation under which two sets of parameters are equivalent i f they define the same m i x i n g d is t r ibut ion . T h e topology of weak convergence has one advantage over the quotient topology, namely the automatic compactness of the parameter space which results from the Helley selection theorem; com-pactness is , of course, very useful for establishing the consistency of estimators (Redner, 1981, assumed a compact parameter space and Hathaway, 1985, used a compactif icat ion device). L e m m a 3.5 Let F* = YJj!=\ aj^Sj where aj > 0 for every j, aj â€” 1, and the 6j are distinct points of Â©. Let {Fk} be an arbitrary sequence in Tm, that is, Fk = Â£â„¢ ! ^ where af] > 0, of] Â£ 0 , j = l , . . . , m . Then Fk F* if and only if (a[k\ ... ,am\o[k\... ,9$) â€”â€¢ (a\,... ,am,6i,... ,9m) in the quotient topology. Proof : Assume Fk F*. T o show the convergence in the quotient topology we must show that , for large k, every parametric representation of Fk lies i n a sphere of radius e around one of the parametric representations of F*. Let Nj be a sphere of radius S centered at Qj, j â€” 1 , . . . , m. If 6 is smal l enough, J V i , . . . , Nm are dis joint, so Nj is a cont inuity set of F* and fN. dFk -* fN. dF* â€” aj. Since aj > 0 and Ylj aj = 1 ; f Â° r large k each of o[k\..., 9$ is contained i n one of N\,... ,Nm; let j' be the index for which 9^ Â£ Nji. T h e n , for large k, 9(k) is w i t h i n 8 oiOy and af] is w i t h i n S of a j >, that is (a[k\... ,a$ ,o[k\... ,0$) is i n a neighbourhood of ( a i > , . . . ,ami,8\i,...,9mi). For the converse, assume the convergence of parameters i n the quotient topology. T h e n , for large k (a[k\.. .,a$,0[k\ .. .,9$) is near (av,... ,ami,9y,.. .,9m>) for some permu-41 tat ion 1 ' , . . . , m ' , depending on k. W e must show m m X > f I{0f < x} ^J2aiH0j <x} = F*(x), 3=1 3=1 for a l l cont inui ty points x of F*. N o w F* is continuous at x i f and only i f no 9j is on the boundary of the set {9 Â£ 0 : 9 < x}. (In one dimension this is more s imply stated as x ^ 9j for every j.) B u t for such x, i f 9^ is near 9y, then I{9^ < x} is near I{9j> < x}; the result follows. â€¢ Theorem 3.3 Let F* = Y7jL\ ajf>0j where ctj > 0 for every j, J2]Li aj â€” 1> and the dj are distinct points ofQ. Assume that the m-identifiability property of Section 2.3 and conditions 1-5 are satisfied. Let &j,9j be maximum likelihood estimators of ctj,9j, j â€” 1 , . . . , m. Then, with probability one, ( & i , . . . , am, Q\,..., 9m) â€”â€¢ ( a i , . . . i &m 5 9\,..., 9m) in the quotient topology, as n â€”> oo. Proof : In this case F^ = F* is unique (see the paragraph before Theorem 3.2). Therefore T h e o r e m 3.2 gives Fm ^ F*, where Fm = Y^j=\aj^e â€¢ T h e conclusion follows by the hypotheses on the parameters, using the above lemma. â€¢ Remark T h e conditions 1-5 s impl i fy greatly i f F* is a finite d is t r ibut ion ; for instance, for the three examples of Section 3.1, a l l of the conditions are met for any F* as given i n Theorem 3.3. 3.5 Consistency of the maximum penalized likelihood esti-mator W e now return to the m a x i m u m penalized l ikel ihood estimator Fjnn, where mn maximizes Â£(Fm) â€” a m n over m. B y combining the results of the previous sections, we can prove Ffnn is consistent; we w i l l need one more pre l iminary result. 42 Lemma 3.6 Assume conditions 1-5 hold. Then, with probability one, (i) l i m ^ o o 1(F)I'n = J pp* log pp* du; and (ii) l i m ^ o o Â£(Fm)/n = f pp. logpF^dfi, for every m > 1. Proof: (i) was proved in the proof of Theorem 3.1. (ii) is proved in the same way (see the proof of Theorem 3.2). â€¢ We now prove the main result. Theorem 3.4 Assume conditions 1-6 and the identifiability property hold. (i) If F* has m* < oo components and for every m < m*, am+i,n > a m n for all n and l imsup n a m n /n = 0, then, with probability one, liminf TIâ€”t-oo mn > m* and Ffnn -* F* as n â€”Â»â€¢ oo. (ii) If F* is not a finite distribution, and for every m < oo, a m + i ( â€ž > a m n for all n and lim supn amn/n = 0, then, with probability one, mn â€”> oo and Fjnn â€”> F*, as n â€”> oo. Proof: (i) First assume F* is a finite distribution with m* components. By Lemma 3.6, since F^* - F*, ]im[Â£(Fm.) - Â£(Fm)]/n = J pp. log(pF. / p p ^ d u = K(F*, F*), for all m < m*, (3.9) with probability one. According to the condition on the rate of growth of amn and the fact that K(F*,Fm) > 0, (3.9) implies Â£(Fm*) - l(Fm) > (am*>n - amn), for all m < m*, for large n, and this means mn > m* for large n. In particular we have proved liminfn mn > m* with probabihty one. By the definition of mn, Vihn) afnnn ^ Â£(F ) â€” am*n. 43 B u t w i t h probabi l i ty one, for large n, we have mn > m* and hence Â£(.Fmâ€ž) > Â£(F*). T h e argument of the proof of Theorem 3.1 can now be used to obtain Ffn^ â€”â€¢ F*. (ii) N o w assume F* is not a finite d is t r ibut ion . B y L e m m a 3.6 we obta in hm[i(Fm+1) -Â£(Fm)]/n = jpF*log(pFm+JpF^)diM = K(F*,Fm)-K(F*,Fm+1), f o r a l l m > l , (3.10) w i t h probabi l i ty one. B u t K(F*,Fm+1) < K(F*,F^) for every m by L e m m a 3.2, and so (3.10) implies Â£(Fm+i) - Â£{Fm) > (am+1>n - amn), for a l l m > 1, for large n. T h i s inequal i ty implies that for each m > 1, for large enough n, mn > m and Â£(Fmn) > Â£(Fm). Therefore we have proved mn â€”> oo w i t h probabi l i ty one; furthermore, we have that , w i t h probabi l i ty one, for every m > 1, Â£(Fmn) > Â£(Fm) holds for large n. Let Fmn converge to F along a subsequence. A s in the proof of Theorem 3.1, Â£(Fmn) > Â£{Fm) for large n implies J PF* logPFd/J, > J pF* logpp^dfi. B u t , by L e m m a 3.4, /PF* logpF^dn â€”Â»â€¢ /p F * logpF*dji as m â€”Â» oo. Therefore F = F* and the proof is complete. â€¢ Since the conditions of the theorem are satisfied for A I C and B I C , we have the fol lowing corollary. Corollary 3.1 Assume conditions 1-6 and the identifiability property hold. If rhn is se-lected by AIC or BIC, then l i m i n f n - c c o > m* (â„¢>n â€”^ oo if F* not finite) and Fmn ^ F*, as n â€”>â€¢ oo, with probability one. AA It is clear f r o m the proof of Theorem 3.4 that a cri terion defined by a sequence amn depending on the sample produces a consistent estimate, provided the conditions on amn hold w i t h probabi l i ty one. Th is provides one potent ial means of proving consistency for C V C , but it is not clear whether the conditions hold for the impl ied amn (see Section 3.1); checking the conditions on a m n may be simplified by the fact that only the cases m < m * must be considered. T h e results of the above theorem state that the estimator m â€ž , in the l i m i t , does not underestimate the number of components of the true m i x i n g d is t r ibut ion , and i n case the true number of components is inf inite , the estimator is consistent in the sense that it con-verges to inf ini ty . T h e overestimation i n the finite case is perhaps not a serious deficiency, especially i f one is main ly interested i n other features of F* aside from the number of sup-port points , because the mass attached to the "ex t ra " components is very smal l for large samples. (Indeed this is necessary for the consistency to hold.) However, the question of whether rhn can be shown to be consistent for a finite number of components (possibly under an addi t iona l condit ion on amn), is worthy of further research. In order to help explain the relationships among the results of this chapter, we summa-rize them i n the fol lowing table. For each general approach to the selection of the number of components (corresponding to the unrestricted m a x i m u m l ikel ihood estimator, the max-i m u m penalized l ikel ihood estimator, and the constrained m a x i m u m l ikel ihood estimator) , we display the consistency results which have been obtained. 45 Number of components conditions Consistency property unrestricted fi A F* selected by maximum penalized likelihood am+l,n ^ amn and l i m s u p n a m n / n = 0 for m < m* only l i m i n f mn > m* fi JE> F* m fixed in advance m > m* fi A F* m = m* m < m* 46 Chapter 4 I m p l e m e n t a t i o n a n d n u m e r i c a l e x a m p l e s f o r m i x t u r e s o f P o i s s o n d i s t r i b u t i o n s In this chapter we describe numerical procedures for the computat ion of the various estima-tors discussed i n Chapter 3. F o r t r a n code wri t ten for this purpose is provided i n A p p e n d i x A . We specialize the procedures to mixtures of Poisson distr ibutions al though they carry over, w i t h simple modificat ions, to mixtures of other families of distr ibutions as wel l . W e also apply the est imation procedures for Poisson mixtures to two examples. 4.1 Computat ional methods for maximization of mixture likelihoods F i r s t we briefly ment ion some of the methods which have been suggested for calculat ing the m a x i m u m l ikel ihood estimator. S imar (1976) proposed an a lgor i thm which alternates between m a x i m i z i n g the l ikel ihood over the component probabilit ies for a fixed support set (the set of points on which the m i x i n g dis tr ibut ion puts mass), and adding to the support set the point for which the direct ional derivative of the l ikel ihood is largest, while deleting a point whenever certain bounds on the number of support points are exceeded. T h e convergence properties of algorithms of this type are studied i n B b h n i n g (1982, 1985). L i n d s a y (1981) suggested a modif icat ion to this strategy, according to which one or more 47 points w i t h large direct ional derivatives are added to the support set, and the E M algor i thm is then used to find a loca l m a x i m u m . A l t h o u g h the above methods offer efficient means of f inding a m a x i m u m l ikel ihood estimate, they cannot be used for f inding the m a x i m u m penalized l ikel ihood estimate, which requires m a x i m i z a t i o n of the l ikel ihood for several different numbers of components. A n outl ine of the strategy we use to accomplish this for Poisson mixtures follows. T h e data are denoted by y\,..., yn, w i t h sample mean y = yi/n, or alternately by Nk, k = 1 , . . . , K, where the dist inct observed counts are x\,..., XK, and Nk is the number of times appears i n the sample. 1. Set m <â€” 1. C o m p u t e Â£(F\) â€” max> Y^k=i Nk{xklogX â€” A) = n(ylogy â€” y). 2. Set ro <- ro + 1. F i n d Â£(Fm) = m a x Â£ Â£ i i Nk l o g ( E ^ i ajXfe'^) over ctj, Xj > o,EÂ«i = i-3. If Â£(Fm) â€” Â£(Fm-\) < e stop; otherwise, go to 2. Step 3 is designed to test for the condit ion Â£(Fm) = Â£(Fm-i); of course, this can only be done w i t h i n machine accuracy. Impl ic i t ly used in this strategy is the fact that i t is not always necessary to search through a l l possible numbers of componentsâ€”if Â£(Fm) = Â£(Fm-i) then Â£(Fm..i) = Â£(F). Th is property can be proved using the method i n the proof of L e m m a 3.2, where a s imilar property is demonstrated for the K u l l b a c k - L e i b l e r divergence ( in fact the argument is s impler for the l ikel ihood funct ion, which is defined i n terms of a finite sum rather than an integral) . A l s o i m p l i c i t l y used is the fact that the l ikel ihood funct ion does not continue to increase indefinitely, and so the above procedure wiU stop after a finite number of iterations; i n fact, i t is not necessary to consider numbers of components larger than K (recall R e m a r k 3 fol lowing Theorem 2.1). 48 W e now consider i n detai l the problem of automatical ly locat ing the m a x i m u m value of the l ike l ihood funct ion for a fixed number of components. W e use the E M algor i thm w i t h several sets of s tar t ing values to locate points w i t h large values of the l ike l ihood, and then choose the one w i t h the largest value. Not ice that , because a sequence of l ikel ihood values generated by the E M a lgor i thm is non-decreasing and bounded, i t must converge. T h i s strategy raises the need of a method for automatical ly generating s tar t ing values for the E M a lgor i thm, which can be expected to ensure that the m a x i m u m value of the l ikel ihood funct ion w i l l then be identi f ied. L a i r d (1978) suggested an equally spaced gr id for the support points , without saying how the choices of the grid spacings and s tar t ing values for the component probabilit ies are to be made. W e w i l l use a different method, to be described next. T h e proposed method of generating s tar t ing values arises f rom the fol lowing simple intu-i t ive approach to the est imation of a m i x i n g d is t r ibut ion . G i v e n the distinct observed counts xi,... ,XKI form m clusters w i t h counts considered to arise f rom the same under lying rate grouped together. A n estimate of the m i x i n g d is t r ibut ion can be obtained f rom the clusters as follows: (i) component probabilit ies are estimated by the relative sizes of the clusters; (ii) component parameters are estimated by a p p l y i n g the m a x i m u m l ikel ihood method to the observations w i t h i n each cluster. T h u s , for the Poisson case, the component rates are obtained by calculat ing means w i t h i n each cluster. These calculations can be considered as an execution of the M-step of the E M algor i thm w i t h the condit ional probabil i t ies iiij (see Section 2.2) replaced by indicators of cluster membership. M c L a c h l a n and Basford (1988, Section 1.7) suggest choosing start ing values i n this way, w i t h clusters generated by a clustering a lgor i thm. A simple way of forming the clusters is v i a disjoint intervals w i t h the observations 49 classified according to which of the intervals they fal l in to . T h e estimate w i l l clearly be very much influenced by the choice of intervals, and the set of a l l possible choices w i l l generate a set of estimates covering a wide range of m i x i n g distr ibutions. W e take this set as our set of s tar t ing values. In the case that one or more zero counts are observed, there w i l l be among the s tar t ing values some w i t h a component having zero rate. T h i s is i m p o r t a n t because it helps to locate the local m a x i m a w i t h a zero rate component (the E M algor i thm preserves the zero rate). Generat ing the set of start ing values presents the task of automat ica l ly forming a l l possible clusters of a set of objects into m disjoint intervals. T h i s operat ion can be performed by using an a lgor i thm for forming a l l possible compositions of N into m parts , that is , compositions of the form N = r\ + â€¢ â€¢ â€¢ + rm w i t h r ; > 0, for w h i c h order counts. Such an a lgor i thm can be found i n Nijenhuis and W i l f (1978, Chapter 5). W e let N be K â€” m and translate a composit ion of N in to a clustering of the N + m dist inct observed counts as follows: s tart ing f rom the smallest count, the first r j + 1 dist inct observed counts are assigned to the first cluster, the next r 2 + 1 are assigned to the second cluster, et cetera. T h e tota l number of operations required to maximize the l ikel ihood for m components w i l l be the number of operations required to generate a l l clusters times the average number for the execution of the E M algor i thm w i t h given s tar t ing values. T h e number of operations required for the execution of the E M algor i thm to f ind one global m a x i m u m depends on m a n y factors, inc luding the choice of s tart ing value, but clearly a potent ia l ly very large number of operations is required to f ind the m a x i m u m penalized l ikel ihood estimator. F o r each of the examples presented next, we report the t ime required for the computations; a l l computations were performed on the U B C Department of Statistics M i c r o v a x II computer. 50 4.2 S i m a r ' s acc ident d a t a T o i l lustrate the m a x i m u m l ikel ihood estimator of a m i x i n g d is t r ibut ion , S imar (1976) uses data on the number of automobile accident claims i n a single year for 9461 policies of L a Royale Beige Insurance Company. T h e frequency d is t r ibut ion follows: number of accident claims 0 1 2 3 4 5 6 7 number of individuals 7840 1317 239 42 14 4 4 1 S imar reports the m a x i m u m l ikel ihood estimator to be the dis t r ibut ion given by the fol lowing parameter values: probabi l i ty rate 0.75997 0.08854 0.23617 0.58020 0.00370 3.17606 0.00016 3.66871 Notice that the mean of this d is t r ibut ion , 0.21665, is not equal to the sample mean 0.21435. A c c o r d i n g to a property of l o c a l m a x i m a first proved by L i n d s a y (1981), this implies that Simar's solut ion cannot be a local m a x i m u m of the l ike l ihood. F o r mixtures of an exponential family i n standard form, a local m a x i m u m must define a mixture dis t r ibut ion w i t h mean value equal to the observed sample mean. L i n d s a y obtained this property f rom a more general result called self-consistency and required that the local m a x i m u m define a point i n the interior of the parameter space. A simple proof without this restrict ion can be obtained by t h i n k i n g of a local m a x i m u m as a fixed point of the E M algor i thm (this was pointed out i n T i t t e r i n g t o n , S m i t h , and M a k o v , 1985, p . 86). A proof for the Poisson case follows f rom the expressions for the updated parameter estimates i n the M-s tep which were given i n the description of the E M algor i thm in Section 2.2: 51 and \ _ E"=i wijyi 3 ' E t i wn ' w h i c h i m p l y since E j ^ 'j = 1- Therefore the results of just one step of the E M algor i thm adjust the mixture d is t r ibut ion to have the correct mean; thus a fixed point ( in part icular a local m a x i m u m of the l ikel ihood) must have this property. Therefore, if the E M algor i thm is executed s tar t ing f rom Simar's solut ion, i t w i l l move to points w i t h higher values of the l ike l ihood. In fact, the l ike l ihood funct ion is maximized overall by a 3-component dis tr ibut ion; we present below the constrained m a x i m u m l ike l i -hood estimates for 1, 2, and 3 components (using 4 components d i d not produce a larger l ikel ihood than the 3-component estimate) and the corresponding values of the log-l ikel ihood and the penalized log-l ikelihoods, calculated using A I C and B I C . N u m b e r of Est imates log-Components probabi l i ty rate l ikel ihood A I C B I C 1 1 0.2144 -5151.38 -5152.38 -5155.96 2 0.9378 0.1469 -5008.56 -5011.56 -5022.29 0.0622 1.2307 3 0.4183 0 -5001.30 -5006.30 -5024.19 0.5730 0.3355 0.0087 2.5450 T h e numbers of E M iterations required were 3345, 16034, and 27502 for 2, 3, and 4 com-ponents, respectively; the total comput ing t ime was 19 minutes. W e see that B I C leads to a choice of two components, while three are indicated by A I C . Simar's solution has log-l ikel ihood equal to â€”5002.13, close to the m a x i m u m value. Interestingly, one of the components of the m a x i m u m l ikel ihood estimate has mean zero. 52 In this example, a component w i t h mean zero might be interpreted as representing the class of persons who hold a driver's license, but who, for various reasons, do not drive . Unfor tunate ly the part icular solution obtained here probably does not represent this class, as it has a mean zero component containing about 42 percent of the popula t ion ! In a case such as the present one, where the penalized l ike l ihood cr i ter ia A I C and B I C do not yield a clear choice of the number of components, this choice might be based on a direct comparison of the fitted frequency distr ibut ions . T h e table below lists the fitted frequencies for the fol lowing estimates: (1) Poisson dis t r ibut ion ; (2) constrained m a x i m u m l ikel ihood estimate w i t h 2 components; (3) m a x i m u m l ikel ihood estimate; and (4) Simar's solut ion. count observed frequency (1) (2) (3) (4) 0 7840 7635.65 7831.90 7839.98 7833.12 1 1317 1636.70 1337.13 1316.95 1313.16 2 239 175.41 212.88 239.06 243.95 3 42 12.53 57.46 42.11 49.60 4 14 0.67 16.58 13.31 12.41 5 4 0.03 4.05 5.87 4.84 6 4 0.00 0.83 2.44 2.28 7 1 0.00 0.15 0.89 1.02 8+ 0 0.00 0.03 0.39 0.62 T h e fitted frequency d is t r ibut ion calculated using the best f i t t ing Poisson dis t r ibut ion makes i t clear that 1 component is certainly not sufficient. T h e 2-component estimate might also be judged inadequate for this data because i t underiits the number of individuals w i t h 6 and more claims, i f these large numbers of claims are considered especially important . T h e 3-component estimate provides a remarkably good fit to the observed d i s t r ibut ion , even considering the relatively large number of parameters used (five) compared w i t h the number of cells i n the d is t r ibut ion . 53 4.3 F e t a l m o v e m e n t d a t a T h e second example concerns the observation of the numbers of movements i n 240 con-secutive 5-second intervals, by a fetal lamb observed through ul trasound (data supplied by Professor D a n i e l W . R u r a k ) . These observations were part of a s tudy of breathing and b o d y movements i n fetal lambs designed to examine the possible changes i n the amount and pattern of fetal ac t iv i ty d u r i n g the last two-thirds of the approximate ly 21 week gestation per iod . These changes may be due to physical factors such as reduct ion i n amniot ic f luid volume and empty space w i t h i n the uterus, or the development of the central nervous sys-tem (see W i t t m a n , R u r a k , and Tay lor , 1984). T h e analyses we w i l l perform i n this section and Section 5.4 provide a description of the pattern of fetal ac t iv i ty at a f ixed point i n the gestation per iod . Here we fit Poisson mixture distr ibutions to the d is t r ibut ion of the n u m -ber of moves, as one means of assessing the pattern of ac t iv i ty . A pure Poisson d is t r ibut ion indicates that the movements occur randomly i n t ime, as the events i n a Poisson process, while a mixture w i t h components having widely vary ing rates indicates that the fetus exists i n different states characterized by high and low levels of ac t iv i ty . T h e latter model suggests that the counts i n adjacent t ime intervals are related in some way; in Chapter 5 we w i l l consider a more general mixture model for these data to capture this feature and remove the assumption of independence. For the present we ignore the order of the counts and use only their frequency dis t r ibu-t ion , presented in the fol lowing table: number of movements 0 1 2 3 4 5 6 7 number of intervals 182 41 12 2 2 0 0 1 T h e results for these d a t a follow. 54 N u m b e r of Est imates log-Components probabi l i ty rate l ikel ihood A I C B I C 1 1 0.3583 -174 .26 -175 .26 -177 .00 2 0.9388 0.2302 -160 .21 -163 .21 -168 .43 0.0612 2.3242 3 0.4380 0 -159 .01 -164 .01 -172 .71 0.5447 0.5320 0.0173 3.9683 4 0.4201 0 -159 .00 -166 .00 -178 .18 0.5462 0.4918 0.0214 1.6615 0.0123 4.4074 T h e numbers of E M iterations required were 983, 9177, 14808, and 12502 for 2, 3, 4, and 5 components, respectively; to ta l comput ing t ime was 13 minutes. A l t h o u g h four components are required to maximize the l ike l ihood, A I C and B I C i n -dicate two is the best choice. A s in the previous example estimates w i t h a mean zero component arise; a zero rate represents a state i n which no movements can occur. C u -riously, again the zero rates belong to the estimates w i t h larger numbers of components (three and four) and account for almost half of the mass. T h e estimates w i t h 2, 3, and 4 components a l l assign between 94 and 98 per-cent of the mass to smal l rates (.532 or less); these rates might be interpreted as representing the normal background rates of movement by the fetus. T h e rest of the mass (between 2 and 6 per-cent) is assigned to rates much larger and these might be interpreted as representing states i n which movement is triggered by s t i m u l i (such as movement by the mother, outside noises, etc.). T h e fol lowing table contains the fitted frequency distr ibutions for the constrained max-i m u m l ikel ihood estimates w i t h 1, 2, and 3 components (columns labelled (1), (2), and (3)), and the m a x i m u m l ikel ihood estimate (column (4)). 55 observed count frequency (1) (2) (3) (4) 0 182 167.72 180.42 182.00 182.00 1 41 60.10 44.54 41.16 41.20 2 12 10.77 8.62 11.48 11.39 3 2 1.29 3.37 2.74 2.85 4 2 0.12 1.77 1.07 1.07 5 0 0.01 0.81 0.67 0.62 6 0 0.00 0.31 0.43 0.40 7 1 0.00 0.10 0.24 0.24 8+ 0 0.00 0.04 0.21 0.24 T h e best f i t t ing Poisson frequency dis t r ibut ion provides a very inadequate fit , and the 2-component estimate provides a large improvement i n f i t , as i n the previous example. In fact, bo th A I C and B I C select only 2 components, a l though there is some further improvement provided by 3 components w h i c h might be thought to be i m p o r t a n t . T h e fitted frequency distr ibutions for the estimates w i t h 3 and 4 components are pract ical ly ident ical , and thus 4 components are definitely more than are necessary. 56 Chapter 5 M a x i m u m l i k e l i h o o d e s t i m a t i o n f o r h i d d e n M a r k o v m o d e l s Consider a sequence of counts Y j , . . . , Yn thought to have some dependence structure. A n example is the sequence of numbers of fetal movements in consecutive time intervals, mod-elled in Section 4.3 as independent and distributed as a Poisson mixture. The mixture model entails the following sampling scheme for the counts: rates A i , . . . , A n are indepen-dently sampled from a distribution F, and conditional on the values obtained, the counts are independently distributed, Yi with the Poisson(A,) distribution. A much more general class of models results (the so-called hidden Markov models) if it is assumed only that A i , . . . , A n form a finite state Markov chain, with stationary transition probabilities. The general hidden Markov model for a sequence of random variables Y\,..., Yn is now described. Let X\,..., Xn be a Markov chain with states 1 , . . . , m, stationary transition probabilities ajk â€” P(Xi â€” j | = k), and initial distribution given by = P(X\ = j). Let / i , . . . , / m (one for each state of the Markov chain) be densities on a common set, and let Y i , . . . , Yn be conditionally independent, given Xi = the conditional density of Yi being fXi. Typically the component densities are /(â€¢, 8i),..., /(â€¢, 0m) for some parametric family {/(â€¢,#) : 0 Â£ Â© } . If the parametric family is the family of Poisson distributions, the model for Y i , . . . , Y n will be called a hidden Markov model with Poisson components. 57 W e now consider briefly the early development of h idden M a r k o v models, their rela-t ionship to other t ime series models ( in part icular , w i t h respect to filtering and smoothing equations), and some areas i n which they have been appl ied. T h e special case of the hidden M a r k o v model i n which the observed variables have finitely m a n y values is commonly referred to as a probabil ist ic funct ion of a M a r k o v chain. F o r this case, the component distr ibutions are m u l t i n o m i a l and the component parameters are probabi l i ty vectors (thus, i n our terminology, i t is a hidden M a r k o v m o d e l w i t h m u l t i n o m i a l components) . A s y m p t o t i c properties of m a x i m u m l ikel ihood estimators i n this model were studied by B a u m and Petr ie (1966) and Petr ie (1969). B a u m and E a g o n (1967) describe an a lgor i thm for locat ing a local m a x i m u m of the l ikel ihood for this model . T h e general h idden M a r k o v model was introduced i n B a u m et al. (1970), where the a lgor i thm of B a u m and E a g o n (1967) was extended to the general model and rederived as an example of a general l ikel ihood m a x i m i z a t i o n a lgor i thm; this general a lgor i thm was later named the E M a lgor i thm by Dempster , L a i r d , and R u b i n (1977). W e give a detailed description of the use of this a lgor i thm for hidden M a r k o v models i n the fol lowing sections. H i d d e n M a r k o v models are closely related to the so-called state-space models. A com-m o n model of this type is the l inear state-space model : Xn = FXn-i + Vn, Yn = HXn + Wn, where Xn is an unobserved "state" variable w h i c h determines the d is t r ibut ion of the obser-vat ion Yn and Vn and Wn are independent sequences of independent, ident ical ly distr ibuted variables regarded as "noise" . In many applications, the goal is " reconstruct ion" of a value Xn based on an observation set Y\,... , Y;v; this can take the more specific form of filteringif n = N ( in which case the calculations are usual ly done i n "real - t ime") ; smoothing i f n < N; 58 or prediction if n > N. In the classical model , w i t h n o r m a l errors, the reconstruction is performed using the K a l m a n filter. Recently, the analysis of non-normal and non-linear state-space models has been considered; for example, K i t a g a w a (1987) gives recursive equa-tions for f i l ter ing, smoothing, and predict ion w h i c h are val id quite generally (see K o h n and Ansley , 1987). T h e key elements seem to be a state process w h i c h is M a r k o v and an ob-servation sequence constructed f rom a condit ional ly independent sequence, given the state process. T h u s we find overlap w i t h hidden M a r k o v models. Reconstruct ion has also been a pr ime concern i n the study of hidden M a r k o v models. T h e forward-backward algorithm contained i n the iterative l ikel ihood m a x i m i z a t i o n a lgor i thm of B a u m et al. (1970) can be used for reconstruction of the under lying M a r k o v chain ( w i t h a minor adjustment; see Sec-t ion 5.2). A l s o , versions of the smoothing and f i l tering equations of K i t a g a w a (1987) were derived i n A s k a r and D e r i n (1981) and L i n d g r e n (1978) for hidden M a r k o v models. F i n a l l y , we mention a different generalization of K a l m a n f i l ter ing, namely, Bayesian forecasting i n the dynamic generalized linear model of West , Harr i son , and M i g o n (1985). H i d d e n M a r k o v models have been applied i n a diverse collection of fields, of w h i c h we ment ion two. C h u r c h i l l (1989) applies the 2-component model to sequences of bases of a D N A molecule and , by smoothing the under lying M a r k o v chain , indicates regions which are r i ch i n certain bases. H i d d e n M a r k o v models have been extensively applied to automatic speech recognition, serving as a model for a speech generating source (see, for example, Levinson , Rabiner , and Sondhi , 1983). In the next section, we describe the a lgor i thm of B a u m et al. (1970) for the calcu-la t ion of m a x i m u m l ikel ihood estimates i n hidden M a r k o v models. Fo l lowing that , the forward-backward a lgor i thm for comput ing the condit ional probabilit ies ar is ing in the i m -plementat ion of the E M a lgor i thm is given i n Section 5.2. Because of the possibi l i ty of 59 m a n y local m a x i m a of the l ikel ihood funct ion, the E M algor i thm must be started f rom several points i n the parameter space; the automatic choice of s tar t ing values is considered i n Section 5.3. F i n a l l y , the methods are applied to the fetal movement data i n Section 5.4. 5.1 Calculation of m a x i m u m likelihood estimates using the E M algorithm O u r s tudy of m a x i m u m l ikel ihood est imation w i l l focus on the h idden M a r k o v model w i t h a f ixed fami ly of component distr ibutions and a fixed number of components. In practice, m a x i m u m l ike l ihood estimates for a few values of m might be calculated, and A I C , B I C , or some other penalized l ikel ihood cri ter ion (see Section 3.1) might be used as a guide for the selection of m . Define the parameter <f> by the vector w i t h coordinates a u , a i 2 , . . . ,amm, 0 j , . . . ,0m. T h e i n i t i a l probabil i t ies are not inc luded, a l though the est imation method to be described can be extended in a fair ly obvious way to accommodate them. C o m p a r e d to the other parameters, there is very l i t t le informat ion available to estimate the i n i t i a l probabil i t ies , and so we w i l l set them a l l equal to l / m . A l i m i t theorem given i n Chapter 6 provides some just i f icat ion for this, showing that the effect of the i n i t i a l probabilit ies (if they are a l l positive) vanishes as the number of observations increases, but determinat ion of the sensit ivity of the estimates to this choice seems prudent i n applications to short series. T h e l ikel ihood funct ion for the parameter <j> is m m n i ( > | y i , . . . , y â€ž ) = Â£ â€¢ â€¢ â€¢ Â£ a^f(yueXl(cp))Y[aXx_^xmyuOxm, x\â€”\ xn=\ i=2 and the m a x i m u m l ike l ihood estimate <^> is the value of <j> which maximizes L(<f> \ yi,..., yn) over $ = { ( a i i , a i 2 , . . . ,amm,6i,... ,6m) :a3k. > 0, Ylk ajk = 1> 0j G & } . M a n y extensions of this model present no addi t ional difficulties for parameter est imat ion, i n c l u d i n g models 60 w i t h some transi t ion probabil i t ies set to zero or set to a c o m m o n value. D u e to the form of the l ikel ihood funct ion, the computat ion of m a x i m u m l ikel ihood estimates seems to be a difficult task i n any case. However, the E M a lgor i thm can be applied to this problem without much more effort than for the case of independent observations f rom a mixture d is t r ibut ion , discussed i n Section 2.2. T o describe the E M algor i thm i n this s i tuat ion, we begin by displaying the complete data l ike l ihood, that is, the l ikel ihood funct ion for the observations (xi, yi), i = 1,... ,n (<f> is suppressed for ease of notat ion) : \o&Lc(<t> | xx,yi,... ,xn,yn) n = log{a jj? f(yi, 0*i) II .*Â« f(y*> 6*<)} 4=2 n n = log aj j j + Y l o S axi-x + Â£ l o g f(vÂ» Â°xi) i=2 t'=l Tl m 771 71 771 = l o g a g ) + Â£ Â£ Â£ Vjk{i) log a j k + J2 Â£ l o Â§ f(Vi, ^ i ) . Â»'=2 j=l fc=l 4=1 j=l where Uj(i) is 1 i f a;,- = j and 0 otherwise, and Vjk(i) is 1 i f a transi t ion f rom j to k occurred at i (i.e. Xi-\ = j, Xi = k) and 0 otherwise. T h i s log- l ikel ihood funct ion consists of two parts, namely, the log-l ikel ihood for a M a r k o v chain , depending only on the transi t ion probabil i t ies , and the log-l ikel ihood for independent observations, depending only on 0\,... ,6m. T h e m a x i m i z a t i o n of the complete-data l ikel ihood can be done i n closed form i n most applicat ions. T h e values of the transi t ion probabil i t ies w h i c h maximize Lc are easily seen to be a3k=^^flyj,k = l,...,rn. (5.1) Ei=2 2^1 vjM These equations are analogous to equations given in Section 2.2 for the m i x i n g proportions i n a mixture d is t r ibut ion . T h e m a x i m i z i n g value of 6 w i l l usuaUy be found by solving the 61 fol lowing set of equations: n Â£ t t j ( * ) 0 1 Â° g / ( y . - . f y ) / 0 0 j = 0, j = l , . . . , m . 1=1 These equations can be solved i n closed form i f / has the exponential family form; for example, i f f(y, 0) is a Poisson density w i t h mean 6, then the values of the 0j which maximize the complete-data l ike l ihood are given by the fol lowing weighted averages: ~ EEWO ' ( 5 - 2 ) These equations are analogous to equations for the rates i n a Poisson mixture d is t r ibut ion i n the context of independent data (see Section 2.2). W e w i l l use (5.1) and (5.2) i n the remainder of our description of the E M algor i thm, for concreteness, a l though many other cases are handled wi thout any addi t ional difficulty. T h e M-s tep of the E M algor i thm requires m a x i m i z a t i o n of the condi t ional mean E{\ogL%4>)\yu...,yn}. T h i s is easily accomplished by replacing the components of the missing data which appear i n (5.1) and (5.2) by their condit ional means, namely vjk(i) = E{vjk(i) I yi,...,yn} = -P{^,-i = j, Xt = k \ yi,...,yn}. and Uj(i) = E{uj(i) I j / i , . . . ,yn} = P{Xi = j \ yi,.. .,yn}-W e leave the details of the computat ion of these condi t ional means to the next section. W e can now describe the E M algor i thm expl ic i t ly . E M a l g o r i t h m : (hidden M a r k o v model w i t h Poisson components) 1. Choose <t>Â° = (aÂ°n, a?2,..., aÂ°mm, 6$,..., 0Â° ). S ^ af] = l / m , j = 1 , . . . , m and u <- 0. 62 2. Repeat u n t i l a l l coordinates of <f>w+1 â€” <f>w are smal l : E-step. (i) F o r i = 1 , . . . , n, j â€” 1 , . . . , ro, compute u 1 (i) = P<t>Â»{Xi = j \ th,.. â€¢ ,yn}-(ii) For i = 2 , . . . , n, j, k = 1 , . . . , ro, compute % ( 0 = P^{Xi-i = j,Xi = k\ y1,..., yn}. M - s t e p . (i) For j, k = 1 , . . . , ro, compute 71 71 7Tb ^ + 1 = E ^ ( 0 / E E Â« ) -t=2 i=2 Z=l (ii) For j = 1 , . . . , ro, compute i=l i=l Set w Â«â€” a; + 1. T h e sequence {4>w} produces an increasing sequence of l ike l ihood values. T h i s fact can be proved i n the same way as i n the independent observation case; i n fact the first proof of this property of the E M algor i thm ( B a u m et a/., 1970) inc luded the hidden M a r k o v model as a special case (the proof uses the argument of (2.6) i n Section 2.2). 5.2 T h e forward-backward algorithm In this section we present the forward-backward a lgor i thm of B a u m et al. (1970) for calcu-la t ing the condit ional probabil i t ies Uj(i) = P^>{Xi = j \ y\,..., yn} and 63 Vjk(i) = P${Xi-\ = j, Xi = k | 2/1,.. . , y n } . T h e computat ional problem which arises i n the implementat ion of this a lgor i thm is described and a simple adjustment to overcome this problem is given. T h e parameter value <f> is fixed throughout this section and w i l l be sup-pressed to keep the notat ion simple. T h e calculations are accomplished by first expressing these condit ional probabilit ies i n terms of more basic probabilit ies for w h i c h simple recursive formulae exist. W e w i l l use quantities defined i n terms of joint densities of subsets of the Yi and the Xi, for example, m m m m p ( y u . . . , y n , X t = j) = Â£â€¢â€¢â€¢ Â£ Â£ â€¢â€¢â€¢Â£ a % , B X l ) Â£1=1 I i - F l l i + l = l xn=\ iâ€” 1 n J[c(Xh_uXhf(yh,exh)aXi_ujf(yi,ej) JJ aXh_uXhf(yh,9Xh); h=2 h=i+l other expressions of this type are defined s imilar ly . (In case the Yi are discrete random variables there is an obvious meaning as the probabi l i ty of an event.) N o w , by definit ion, Wj(i) = p(yi,...,yn,Xi = j)/p(yi,...,yn) = p(yi,...,yn,Xi = j)lYjP(y^---iyn,Xi = k) k and we have the fol lowing decomposit ion: P{yi,--- ,yn,Xi - j) = p ( y 1 , . . . , y i , X i = j ) p ( y i + i , . . . , y n | yx ...,yi,Xi = j) = â€¢â€¢â€¢,yi,Xi = j)p(yi+i,...,yn\ Xi = j) = aj(0&j(*)> where a3{i) = p{yu Xt = j) for * = 2 , . . . , n, a , ( l ) = f{yi,9j), &?'(0 = PiVi+i, â€¢ â€¢ â€¢ > Vn I X, = j) for i = 1 , . . . , n - 1, and 6j(n) = 1. F o r vjk(i), we have fy*(0 = p(yi>â€¢â€¢â€¢>Â»Â»Â», ^ - i = i, -^i = , â€¢ â€¢ â€¢ > yn) 64 P(yi,-- â€¢ , yn , -Xt - i = j,Xi = k) = p(j/x,.. X , _ i = j)ctjkp(yi,...,yn \Xi = k) - <Xjkf(yi, 9k)a3{i - l)bk(i). The fact that a,j(i) and 6j(i) can be calculated recursively in i (see below) allows efficient calculation of and Vjk(i). The following recursive formulae for a,j(i) and bj(i) hold: m a i ( 0 = Â£ â€¢ â€¢â€¢â€¢Â»%, = k,Xi= j) k=l m = Â£ P(Â»iÂ»â€¢ â€¢ â€¢ Â» J / i - i Â» = k)akjf(yi,6j) (5.3) fc=l m = i)akjf(yi,9j) fc=i and m &?(0 = Â£ ,---,yn, xi+1 = k\Xi = j) m = Â£ajfc/(yf+i,*fc)p(2/if2,---,Â»n I JQ+i = *) (5.4) fc=l m k=i The ay(i) are computed by initially setting fflj(l) = f(yi,&j), j = 1,... ,m (recall that a^p are the initial probabilities for the Markov chain X), while the bj(i) are computed by working backwards, starting with bj(n) = 1, j = 1 , . . . , m. The E-step of the E M algorithm is given in detail below to show how the recursive equations are applied. E-s tep of the E M a l g o r i t h m : 1. Compute aj(l) = f(yi,0j), j - 1 , . . . ,ro, and for i = 2 , . . . , n , j = 1 , . . . , m , m a j ( 0 = Â£ ak{i ~ l)a*j7(Â»ii 65 2. Set bj(n) = 1, j = 1 , . . . , m , and , for i = n - 1 , . . . , 1, j = 1 , . . . , m , compute TO = Â£ajfc/(2/i+iA)&fc(* + !)â€¢ 3. For i = 1 , . . . , n , j = 1 , . . . , m , compute Â«i(0 = *j(0*i(0/]Â£Â«*(Â»)â€¢ 4. F o r i â€” 2 , . . . , n, j = 1 , . . . , m , compute m UjfcO') = ajkf(yi,0k)aj(i - l)bk(i)/ Â£ a , j ( n ) . j=i In theory, by performing the E - and M-steps u n t i l some convergence cri terion is met, i t is possible to locate a local m a x i m u m of the l ikel ihood funct ion. However, for many situations the a lgor i thm cannot be implemented as formulated above because a,j(i) converges rapidly to 0 or oo as i increases, thus m a k i n g i t impossible to calculate and store long sequences. Intuit ively ( th inking of the case i n which Yi is discrete), the reason is that aj(i) is the probabi l i ty of a sequence of events, increasing i n length as the calculations proceed. T h e fo l lowing l i m i t result, useful for analyz ing the asymptot ic behaviour of the m a x i m u m l ikel ihood estimator, also has direct relevance here. F o r almost a l l sequences (that is wi th probabi l i ty one), p(yi,...,?/,) = a j(0 1S approximately equal to exp(â€” iH) for large i, where H is a constant called the entropy of the process, and so Yl]Li a j ( z ) converges to 0 or oo at an exponential rate (see Chapter 6). T y p i c a l l y , a l l of ai(i),... ,am(i) are about the same order of magnitude, so the same conclusion applies to each aj(i) ind iv idua l ly . A p p l y i n g the logar i thm would produce a sequence which "blows u p " (converges to Â± o o ) at a more manageable (linear) rate, but this is not helpful , as the logar i thm does not lead to any s impli f icat ion of aj(i). 66 Thus some adjustment to the E-step is required i n order to make i t generally useful. Before we give our adjustment, we briefly discuss other proposals f rom the field of speech recognition. Levinson et al. (1983), on heuristic grounds, suggested repeated scaling of a,j(i) and bj(i) as the recursive calculations are performed, either at each step or periodical ly whenever it is necessary. Devi jver (1985) showed this scaling amounts to the recursive calculat ion of quantities related to aj(i) and bj(i); i n part icular , the scaling replaces aj(i) w i t h a,j(i) = P{Xi = j \ yi,... ,yi}. A n o t h e r related a lgor i thm is given i n A s k a r and D e r i n (1981) and also discussed i n Devi jver (1985). O u r s imple procedure for solving the computat ional problem consists of m u l t i p l y i n g aj(i) and b3(i) by a constant so they stay w i t h i n a prescribed range. A c c o r d i n g to the result quoted above, the constant eH would work; since J2j aj(0 exp(iH) â€”â€¢ 1, the problem wo uld be solved by m u l t i p l y i n g aj(i) by eH at each i terat ion. T h i s procedure cannot be implemented of course, since H is u n k n o w n . T h u s , for each i, we determine the order of magnitude of Â£ j aj(i) (the integer p for which 1CT P J2j hes between 0.1 and 1.0) and m u l t i p l y aj(i), j â€” 1 , . . . , m , by 1 0 _ p . In the end, a3(i) can be reconstructed for the purpose of comput ing v,j(i) and Vjk(i). To make this clear, the E-step is repeated below w i t h these adjustments. E-step of the E M a l g o r i t h m (adjusted) : 1. Set a j ( l ) = cSpf(yi,6j), j = l , . . . , m , and A ( l ) = 0. For i = 2,...,n, do the fol lowing: (i) C o m p u t e Oj(t') = Â£ â„¢ = 1 ak(i - l)akjf(yÂ» 9j), j = 1 , . . . , m . (ii) Set p equal to the order of magnitude of Ej aj(0? ^(0 = M} ~ 1) + P; a n ( i aj(i) = 10~paj(i), j = 1 , . . . , m . 67 2. Set bj(n) = 1, j = 1 , , . . , r a , and I?(n) = 0. For i = n â€” 1 , . . . , 1, do the fol lowing: (i) C o m p u t e bj(i) = Ylk=i Ujkf(yi+\,Qk)bk{i + 1), i = 1 , . . . , m. (ii) Set j> equal to the order of magnitude of J2jbj{i), B(i) = B(i + 1) + p, and b3(i) = l O - P f t ^ t ) . 3 = 1, â€¢â€¢â€¢,"Â». 3. F o r i = 1 , . . . , n compute e(i) = A ( i ) + B(i) â€” A ( n ) and &,â€¢(*) = 1 0 e W a j ( i ) 6 , - ( i ) / Â£ o j ( n ) , i = l , . . . , m . 4. F o r i = 2 , . . . , n compute e(i) = A(i â€” 1) + 5 ( i ) - A ( n ) and m Vj j t ( i ) = I0e^ajkf(yi,ek)aj(i - l)bk(i)/ Â£ a.j(n), j, k = 1 , . . . , m . 5.3 Starting values for the E M algorithm T h e choice of s tart ing values is part icular ly important for h idden M a r k o v models because of the large number of parameters considered ( m 2 for m Poisson components) and the tendency for there to exist m a n y local m a x i m a of the l ikel ihood funct ion. W e w i l l present a procedure for automatical ly generating " g o o d " s tar t ing values. In our experience, this procedure has performed well i n the sense that the s tar t ing values generated have seemed to be sufficient to locate the global m a x i m u m of the l ikel ihood (addi t ional explorat ion of the l ikel ihood surface d i d not find a larger l ike l ihood value). T h e strategy is basically the same as the one used for the generation of s tar t ing values for mix ture distr ibutions i n Chapter 4. We enter the E M a lgor i thm at the M - s t e p , w i t h a set of values for Vjk(i) (notice Uj(i) is then determined). E a c h par t i t ion of the observed dist inct counts i n yi,...,yn yields the " e m p i r i c a l " values Vjk(i) = 1 i f yi-i is i n the jth 68 component of the partition and yi is in the kth, and Vjk(i) = 0 otherwise. However, these values tend to generate many zero transition probabilities, which the E M algorithm pre-serves; consequently, only a subspace of the parameter space will be explored. Another set of starting values to be considered results from assuming an independence model and setting Vjk(i) â€” ctjoik, where ctj is the proportion of observations in the j t h partition compo-nent. However, these starting values also confine the search to a subspace because the E M algorithm also preserves independence. We generate three sets of starting values from each partition by setting vjk(i) to be convex combinations of the above two values. Different values of the weight in the convex combination (.1, .5, and .9) produce starting values with very different features, thus aUowing a large region of the parameter space to be covered. Fortran code for the implementation of the E M algorithm which we have described in this chapter is given in Appendix B . 5.4 A p p l i c a t i o n to fetal m o v e m e n t d a t a Hidden Markov models provide a means of fitting a model to the sequence of counts of fetal movements which takes into account the belief that observations in close time intervals will be related in some way. In fact, these models seem ideally suited for the interpretation of these data. A t any given time, we hypothesize, the fetus exists in one of several discrete states with different rates of activity (that is, movement). A one-state model indicates a totally random occurrence of movements in time, as in a Poisson process. A model with more than one state specifies a mixed Poisson distribution for the marginal distribution of the number of movements in a fixed time interval, but allows dependence between these numbers. Thus hidden Markov models provide a rich class of fairly realistic models for the pattern of movements; the application of mixture models to these data and their use in 69 studies of fetal development are discussed i n Section 4.3. There appear to be no strong arguments i n favor of a part icular number of components, so we consider a l l values of ro up to four, the number which gave the m a x i m u m l ikel ihood value in the mixture model of Section 4.3. T h e implementat ion of the E M a lgor i thm de-scribed i n the previous sections produced the results i n the table below. T h e cr i ter ia A I C and B I C might be used as a rough guide for selecting the number of components and are included below, but i t seems reasonable to also use interpretive ab i l i ty as the cr i ter ia for model selection. Est imates log-ro rate transi t ion probabilit ies l ikel ihood A I C B I C 1 0.3583 -174 .26 -175 .26 -177 .00 2 0.2555 3.0766 0.9883 0.3110 0.0117 0.6890 -151 .38 -155 .38 -162 .34 3 0.0398 0.4937 3.4106 0.9487 0.0400 0.1848 0.0409 0.9600 0. 0.0104 0. 0.8152 -140 .08 -149 .08 -164 .74 4 0. 0.1790 0.5641 3.3714 0.9759 0. 0.0242 0. 0.0241 0.9671 0. 0.1907 0. 0.0200 0.9758 0. 0. 0.0128 0. 0.8093 -136 .24 -152 .24 -180 .09 T h e computations required 514 E M iterations and 3 minutes of C P U t ime for two compo-nents, 2310 iterations and 28 minutes for three components, and 8679 iterations and 170 minutes for four components. T h e movement rates of the two-component estimate differ by more than a factor of ten and so might represent a n o r m a l state, i n which movements occur at a background rate, and an excited state i n which movements occur at a m u c h larger rate. T h e transi t ion probabilit ies support this interpretat ion, w i t h the average wai t ing times of approximately 85 t ime units (a unit is five seconds) i n the normal state and 3 units i n the excited state. 70 T h e three-component estimate yields a s imilar interpretat ion, w i t h a h igh rate about the same as for two components, a moderate, and a low rate. Transit ions between the states w i t h the two largest rates do not occur, so these states might be interpreted as two excited states. W h e n the background-rate state is exited, a t ransi t ion to the moderate-rate state occurs w i t h 4 times the frequency as the high-rate state. T h e average wai t ing times are 5 units for the high-rate state and 19 and 25 for the other two. B o t h the two- and three-component estimates have rates which are quite close to those of the estimated two- and three-component mixture distr ibutions obtained i n Section 4.3. T h e stat ionary distr ibutions of the estimated stochastic m a t r i x , namely, (.964, .036) for m = 2 and (.481, .492, .027) for m = 3, are also quite s imilar to the corresponding m i x i n g proport ions. T h e hidden M a r k o v models provide a much improved fit over the corresponding independence models, as indicated by the log-l ikel ihood values; out of a l l of the models considered here and i n Section 4.3, A I C selects the hidden M a r k o v model w i t h 3 components and B I C selects the hidden M a r k o v model w i t h 2 components. 71 Chapter 6 C o n s i s t e n c y o f m a x i m u m l i k e l i h o o d e s t i m a t o r s f o r h i d d e n M a r k o v m o d e l s The previous chapter described a means of finding the maximum likelihood estimate for hidden Markov models; in this chapter the large sample behaviour of the estimator is examined. Limit theorems and other results are developed, which may be of independent interest. The only known previous work on this problem is Baum and Petrie (1966) and Petrie (1969), who consider the case that the observed process has finitely many values. Lindgren (1978) proved that the maximum likelihood estimators for the model which assumes the Y,-are independent with a common finite mixture distribution are consistent, under the hidden Markov model, for the stationary distribution of {Yi}. For an illustration of this theorem, refer back to Section 5.4, where the estimate of the stationary distribution obtained from the hidden Markov model parameter estimates was found to be quite similar to the estimate of the mixing distribution obtained in Section 4.3. Assume {Yi} follows a hidden Markov model with a stationary irreducible hidden Markov chain {Xi}, so that {Y,} is a stationary process. Stationarity can be imposed either by assuming the initial distribution of {Xi} is a stationary distribution or by postulating doubly infinite processes {X,-}^, and { Y , } ^ , . 72 A s i n Chapter 5, {Yi} is a sequence of condit ional ly independent random variables, given a real ization {x^ of the M a r k o v chain, and the condit ional density of Yi is f(-,0Xi); Xi takes values i n { 1 , . . . , m} and {/(â€¢, 0) : 9 Â£ 0 } is a fami ly of densities on the real line w i t h respect to a measure a. T h e extension to mul t i -d imens ional observations is easily made, but for notat ional convenience, we assume Yi is real-valued. W e parameterize the characteristics of the hidden M a r k o v model using a parameter <f>, which belongs to a param-eter space <fr, a subset of Rp; thus we have transi t ion probabil i t ies ajk(<f>), j, k = 1 , . . . , m, and component parameters 9j(<l>) 6 Â©, j,â€” l , . . . , m . T h e i n i t i a l probabi l i ty d is t r ibu-t ion used is any probabi l i ty vector { a ^ } w i t h > 0, not necessarily the stationary probabi l i ty d is t r ibut ion corresponding to the stochastic m a t r i x [ctjk(4>))- T h e usual case is 4> â€” ( < * i i , a i 2 , â€¢ â€¢ â€¢ , Â« m m ^ i i â€¢ â€¢ â€¢ ,#771)5 where otjk(-) and 6j(-) are coordinate projections. T h e asymptot ic properties of m a x i m u m l ikel ihood estimators for the general parameter wiU be developed. T h e l ikel ihood funct ion for observations y\,..., yn is Pn(yi, â€¢ â€¢ â€¢,Vn]4>) = Â£ â€¢ â€¢ â€¢ Â£ a ^ f ( y i , 0 X l O ) ) JJ aXi_liXi(cj>)f(yi,0Xi(4>)), and a m a x i m u m l ike l ihood estimate is defined to be a point <j>n at w h i c h pn achieves its m a x i m u m value over <&. Not ice that the i n i t i a l probabi l i ty d is t r ibut ion used i n the l ikel ihood funct ion is not necessarily the stationary probabi l i ty d is t r ibut ion ; the definit ion of 4>n is the one used i n Chapter 5, where the i n i t i a l probabil i t ies were fixed i n advance before the estimates were computed. It turns out that m a x i m u m l ikel ihood estimators have the same asymptot ic behaviour for any choice of (positive) i n i t i a l probabil i t ies . A n outl ine of the rest of this chapter follows. In Section 6.1 the conditions to be used are collected together and the parameter space is compactif ied. In Section 6.2, an equivalence relation is defined on the parameter space and an ident i f iabi l i ty property for 73 the parameters of the general hidden M a r k o v model is given. O u r m a i n tools for establishing l i m i t theorems w i l l be the ergodic theorem (see Bi l l ingsley, 1965, or K a r l i n and Taylor , 1975) and the (more general) subaddit ive ergodic theorem of K i n g m a n (1976); these theorems make use of the ergodic property, the subject of Section 6.3. We also rely on the cross-entropy funct ion , w h i c h is analogous to the K u l l b a c k - L e i b l e r divergence used i n Chapter 3 for mix ture distr ibutions. One of the ingredients of the cross-entropy is the entropy of the process {Yn}', this is defined in Section 6.4, where an extension of the S h a n n o n - M c M i l l a n -B r e i m a n Theorem (Bil l ingsley, 1965) to hidden M a r k o v models is proved. (This part icular result is not needed for the further developments i n succeeding sections.) In Section 6.5, the cross-entropy is introduced and a result more general than the S h a n n o n - M c M i l l a n - B r e i m a n Theorem is proved. T h e cr i t ica l property of the cross-entropy, namely, that i t is posit ive for two different parameter points , is proved i n Section 6.6. F i n a l l y , the m a i n result on the consistency of a sequence of m a x i m u m l ikel ihood estimators is presented i n Section 6.7. 6.1 Conditions W e w i l l assume the stochastic matr ix [ajk(<f>o)], for the true parameter value >^o> is irre-ducible . T h i s assumption excludes the possibi l i ty that { X , } enters a transient state, i n which case informat ion on parameters associated w i t h other states would stop accumulat-i n g . T h e entire set of conditions we w i l l use i n this chapter follows: 1. T h e stochastic matr ix [c<jk(<f>o)] is irreducible. 2- {f(y,6): & G 0} is m-identif iable (see Section 2.3). 3. F o r each y, / ( y , â€¢) is continuous and vanishes at inf ini ty (see C o r o l l a r y 2.1). 4. For each j, k, a^ ( - ) and Oj(-) are continuous. 74 5. ^ 0 [ | l o g / ( y 1 , ^ 0 ) ) | ] < o o , j = l,...,m. 6. For every 9 Â£ 0 , Â£ 0 j s u p | | e , _ e | | < 5 ( l o g / ( Y i , 0 ' ) ) + ] < Â° Â° , for some 6 > 0, (|| â€¢ || is E u c l i d e a n distance and ar + = max{a;,0}). E x a m p l e L e t {f(y,0): 9 Â£ 0 } be a s tandard exponential family , that is , \ogf(y,9) = y9-b{9). W e consider the conditions 2, 3, 5, and 6, which do not depend on the part icular param-eterization. For the common examples (Poisson, b inomia l , normal , exponential) , f(y,-) is continuous and vanishes at inf inity, and the integrabi l i ty conditions hold . Identif iabil i ty holds for any m for the first three listed examples and for some values of m for the b inomia l (see Section 2.3). Before developing the required probabil ist ic tools for the proof of consistency, we "com-pact i fy" the parameter space $ by adding to i t l imits of C a u c h y sequences and denote the resulting space $ c (see Kiefer and Wol fowi tz , 1956, where this device was first used in the context of m a x i m u m l ikel ihood estimation). T o explain this concept, we expl ic i t ly describe the new parameter space i n the case that $ = {{au,ai2, â€¢ â€¢ â€¢ ,ctmm,9i,... ,9m): a j k > 0, Â£ a j f c = 1, 9j Â£ 0 } . k Denote by 0 C the one-point compactif ication of 0 , obtained by at taching to 0 a point denoted oo, and extend f(y,-) to 0 C by defining / ( y , o o ) = 0 (for example, i f / ( y , - ) is the Poisson density w i t h mean 9, then 0 C = [0,oo]). T h e compactif ied space $ c is then { ( a n , "12, . . . , a m m , 9m): a } k > 0, Â£ a j f c = 1, 9j Â£ 0 C } . (6.1) k 75 For the general parameterizat ion, (an(<f>), a 1 2(0), . . . , a f f l m ( f t ^ i ( f t . . . , 0m(<f>)) w i l l s t i l l belong to the set i n (6.1) for a l l parameter values after compacti f icat ion. C o n d i t i o n 3 ensures that f(y, â€¢) is continuous on a l l of 0 C ; also, the cont inui ty of #(â€¢) and ctjk(-) extends to $ c . 6.2 Identifiability It is clear f rom the s i tuat ion for mixture distr ibutions that identi f iabi l i ty of the parameters is too much to demand, at least for the parameterizat ion <j> = ( a n , a i 2 , . . â€¢ , c t m m , 0 i , . . . ,6m). Just as for m i x t u r e distr ibut ions, the component indices are arbi trary; by permut ing the rows of the stochastic m a t r i x [ajk] and a p p l y i n g the same permutat ion to . . , 0 m , the law of {Oxi} remains unchanged and hence so also does the law of {Y,} (see Section 2.4 for other ways i n which the parameters of mixture distr ibutions can fa i l to be identifiable). Define an equivalence relation ~ on $ c , whereby fa ~ fa i f and only i f fa and fa define the same law for {0x{}- Let <f> denote the equivalence class to which <f> belongs. Not ice that the law of {0Xi} is determined by the i n i t i a l d is t r ibut ion of {Xi} and that there may be more t h a n one i n i t i a l d i s t r ibut ion for w h i c h the process {Xi} is stationary. T o accommodate such parameters, we extend the definit ion of equivalence to allow somewhat arbi t rary choices of i n i t i a l distr ibutions for X; more precisely, fa ~ fa i f and only i f there are i n i t i a l probabi l i ty distr ibutions a^1 and a ^ 2 such that the fol lowing holds: (i) {0Xi{<f>i)} is a stationary process, where {Xi} has transi t ion probabilit ies ajk(fa) and i n i t i a l d is t r ibut ion a^1; (ii) {Qxi(fa)} is a stationary process, where {Xi} has transi t ion probabil i t ies atjk(fa) and i n i t i a l d i s t r ibut ion a^2; (i i i) the processes {&Xi(fa)} and {Oxi(fa)} i n (i) and (ii) have the same laws. 76 For example, a l l parameters fa w i t h 6(fa) = (A, A ) T are in the same equivalence class, since 0Xi(<f>i) is identical ly equal to A i n this case. A n o t h e r parameter i n the same equivalence class is given by If Oi((j)),..., 0m((j)) are dist inct and [ctjk(fa] is irreducible and aperiodic , and so has a unique stat ionary d is t r ibut ion ( K a r l i n and Tay lor , 1975, Theorem 1.3 and R e m a r k 3.1), then 4> only contains points obtained by the permutations described i n the first paragraph of this section; this corresponds to a finite mixture d is t r ibut ion w i t h dist inct support points and posit ive m i x i n g proport ions. B a u m and Petr ie (1966) and Petr ie (1969) consider the identif iabi l i ty question for the finite state hidden M a r k o v model f rom the point of view of identi f iabi l i ty for functions of a finite state M a r k o v chain ( G i l b e r t , 1959). T h e fol lowing l e m m a shows that the equivalence classes are identif iable, i n the sense that two parameter values i n different equivalence classes produce different stat ionary laws for the process {Yi}. In Section 6.7, we establish the consistency of the equivalence class of the m a x i m u m l ikel ihood estimator. L e m m a 6.1 If condition 2 holds, then fa and fa define the same stationary law for the process {Yi} if and only if fa ~ fa. Proof : If fa and fa define the same stationary law for the process {Yi}, then, i n part icular , the jo int d is t r ibut ion of Y\ and Y2 is the same under fa and fa. N o w these jo int distributions have densities of the fol lowing form: m m Â£ Â£ aJ(faa3k{faf(y1,ej{<i>))f(y2,ek{fa), j=l k=l namely, finite mixtures of products of two densities f rom the fami ly {f(y, 6): 9 Â£ 0 } . Thus , by condit ion 1 and L e m m a 2.1, fa and fa define the same m i x i n g d is t r ibut ion , that is, the 77 same dis t r ibut ion for { 0 x 1 , 8 x 2 ) i a n d hence the same law for {Oxi}- â€¢ 6.3 The ergodic property In order to define the ergodic property we introduce the set y of doubly infinite sequences (... 2/0, Vi, â€¢ â€¢ â€¢) of real numbers. Let B(jik) be the a-f ield of cylinder sets depending on yj,... ,yk, namely, sets of the form A â€” {y â‚¬ y : (yj,...,yk) â‚¬ B} where B is a B o r e l subset of i Â£ f c - J ' + 1 ; also let B be the u-field generated by U^._0 0S(_^. ifc) (B is the B o r e l cr-field on y, that is , the cr-field generated by the open sets under the product topology). N o w define the shift operator T on y by T{yi] = {y[} where y\ = yi+i; that is , T shifts each element of a sequence back one posit ion. A set A i n B is defined to be invariant i f it satisfies the fol lowing property : y G A i f and only if Ty Â£ A. A stat ionary process Y = { Y j / ^ o is defined to be ergodic i f P{Y Â£ A} is either zero or one for every shift invariant set A. T h e most impor tant examples of invariant sets are defined i n terms of l imits of sequences; thus the ergodic property can be used to establish that a random variable defined as the l i m i t of an ergodic process is degenerate. T h e next result establishes the ergodic property for hidden M a r k o v models. L e m m a 6.2 If {X{} is stationary and irreducible, then {Yi}â„¢^ is ergodic. Proof : Let A be an invariant set i n y. A c c o r d i n g to the K o l m o g o r o v extension theorem, there is a subsequence {&'} and cylinder sets Ak' Â£ B(-k',k') s u c n that , for every k > 1, P{Y Â£ AAAk>} < 2~k, (6.2) where Y stands for {Yijf^ and A is the symmetric difference operator (EAA = (En AÂ°) U (Ec n A ) ) . B u t , since Y is stationary and A is invariant , P{Y Â£ AAAk'} = P{T2k'Y Â£ AA A,k} 78 .= P{Y Â£AAT~2k'Ak,} = P{YÂ£AAAk,}, (6.3) where T2k' is the 2knh iterate of T and T~2k' Ak, = {</ : T 2 * ' y G A'}, so i f c - = T~2k> Ak< Â£ #(fc',3fc')- N o w * e t -4 = Mfc'> *-Â°-} = nfc>i u j > * A'- T h e n .4Â° n A - Ac D {Ak>, i.o.} = {Ac n Aw, i.o.} and An Ac C An {A%,, i.o.} = {An A%,, i.o.}, so (6.2), (6.3), and the B o r e l - C a n t e l l i l e m m a (Bil l ingsley, 1979, Theorem 4.3) i m p l y P{Y G A A A} = 0. T h u s i t suffices to show P{Y G A} is either zero or one. N o w A G n fc>i^(A;,oo)5 the " t a i l " <7-field; because the Yi are condit ional ly independent given a realization x = {a;,} of the under ly ing M a r k o v chain , the zero-one law (Bil l ingsley, 1979, Theorem 22.1) implies that P{Y G A | x} is either zero or one. Le t E = {x : P{Y G A\x} = 1}, so P{Y Â£ A} = E[P{Y G A \ X}] = P{X Â£ E}. N o w E is an invariant set, since P{Y Â£ A | x} = P{TY Â£ A | Tx} = P{Y Â£ A | Tx} (A is invariant) . B u t a finite irreducible M a r k o v chain is ergodic ( K a r l i n and Taylor , 1975, p. 533, or Bi l l ingsley, 1965, p. 31) and therefore P{X Â£ E} is either zero or one; this completes the proof. â€¢ 6.4 Entropy In this section we define the entropy for stationary hidden M a r k o v models and show that the conclusion of the S h a n n o n - M c M i l l a n - B r e i m a n Theorem, which concerns finite-state pro-cesses, also holds for the general hidden M a r k o v model . T h i s result is relatively simple to 79 prove and anticipates the more general result of the next section, but none of the develop-ment in this section is necessary for anything in the sequel. The entropy of the stationary process {Y,} under the parameter <j>o is defined by the following expression: H(<f>0) = - Â£ 0 o [ l o g p ( * o I y - i , Y_2,.. .;4>0)}. (6.4) In order for this definition to have meaning, the conditional density p(Yo | Y _ i , Y _ 2 , . . . \ 4>o) must be shown to exist. We will construct the conditional density by considering limits of the conditional densities which depend on a finite number of past values of the process, and then allow this number to grow arbitrarily large. The term entropy is used because the above definition of H(4>Q) is a generalization of the well-known entropy for a random variable Y with density p, namely E[â€” logp(Y)]. In order to define the conditional density of Yo given the infinite past, consider the following representation for the conditional densities which depend only on a finite number of past observations: m Pi(yo I y_i, . . . , jM+i;^o) = Â£ P ^ 0 { X 0 = j | . . . ,y_i+i}/(yoÂ» 0j(&>))-(The conditional probabilities are denned using the stationary distributions of {Xi} and {Y{} and hence are not the same as the conditional probabilities of the E M algorithm in Chapter 5, where an arbitrary initial distribution was used.) A classical martingale convergence result says that, if Z is an integrable random variable and {Qi} is an increasing sequence of a-fields, then l im^oo E[Z | Gi] = E[Z | G<x>], with probability one, where Goo is the cr-field generated by U;Â£,' (Billingsley, 1979, Theorem 35.5). Applying this result with Z equal to the indicator of the event {XQ = j} and Gi equal to 80 the cr-field generated by Y _ i , . . . ,Y-i+i gives lim p^{x0 = j | y _ ! , . . .,Y.i+1} = p^{x0 = j | y _ i , y _ 2 , â€¢..}, (6.5) Â»â€”t-oo with probability one. Therefore we can define the conditional density depending on the infinite past by m p(Y0 I Y^,Y_2,...;4>0) = J2P*o{Xo = j I Y-1,Y-2,...}f{Yo,0i(<h)), (6-6) and we have l im Pi(Y0 | . . . , y _ , - + 1 ; <Â£0) = P^o I ^ -1,^-2, â€¢ â€¢ â€¢ \</>o), (6.7) Â«â€”+oo with probability one. Theorem 6.1 If conditions 1 and 5 hold, then H(4>Q) - E^0[- logp(Yo | y_i,y_ 2,...;<j>o)} is finite and (i) h m n ijE7^,[logpâ€ž(yi,.. .,Yn;<f>0)] = -H(<f>0); (ii) l i m n i - l o g p n ( Y i , . . . ,Yn]<f>o) = â€”H(<j>o), with probability one, under 4>Q. Proof: By condition 5, {logp^Yo | Y L i , . . . ,Y_,-+i;0o)} is a uniformly integrable sequence of random variables, since mmf(Y0,63{4>o)) < Pi{Y0 | Y-1,...,Y-i+1;<j>o) < m a x / ( y 0 , ^ ( 0 O ) ) , 3 3 and hence (6.7) implies H{ct>0) = l i m ^ J - logp,-(y0 | y _ i , . . . , y _ i + 1 ; therefore, H(<f>o) is finite. Using 1 1 n - i o g p n ( Y i , . . . , y n ; < Â£ 0 ) = - iogTTpj (y t -1 y,-_i,...,yi;</Â»o) n n â€¢ , = i ^ i o g p ^ y i i y . - i , . . . , ^ ; ^ ) , n .-=i 81 we conclude that i ^ 0 [ l o g p n ( y 1 , . . . , Y B ; ^ 0 ) ] = lj2E^ogPi(Yi\ Y - i , . . . , Y i ; ^ 0 ) ] Tt Tt = i f > * 0 [ l o g p , - ( Y 0 I Y _ a , . . . , Y _ ! + 1 ; ^ o ) ] Tl 1=1 - -H(<f>0). T o prove ( i i ) , we w i l l approximate l o g p n ( Y i , . . . , Y n ; <f>o)/n by i f > g F ( Y , - | Y - i , Y _ 2 , . . . ; ^ 0 ) , 1 = 1 to w h i c h the ergodic theorem can be appl ied, g iv ing l i m - Â£ log Pi(Yi | Y ; - i , Y;-2,...; <f>0) = [log pt(Y0 | Y _ x , Y _ 2 , . . . ; &,)] , n Tl . _ t=l w i t h probabi l i ty one, under P ^ 0 . T h e arguments we w i l l use to achieve this approximat ion are adapted f rom K a r l i n and Taylor (1975, pp. 498-502). N o w 1 1 n | - l o g p n ( Y i , . . . , Y n ; < / i . o ) Â£ l o g p(Yi | Y _ x , Y i _ 2 , . . . ;<Â£ 0) | t=l < I Â£ | log Yf | Y , _ i , . . . , Y x ; (j>o)/p{Yi | Y ^ , Y t _ 2 , . . . ; <f>0) | Ti 1 = 1 1 " = - Â£ f c , - ( Y t - , Y ; - _ i , . . . ; < / > o ) , t=i say. If HN(Y0, Y _ i , . . . ; (fo) = s u p ^ ^ / i 4 ( Y 0 , Y _ x , . . . ; </>0), then (6.7) implies l i m HN(Y0,Y_u...;<f>0) = 0, Nâ€”t-oo w i t h probabi l i ty one. A l s o , { i f ^ ( Y o , Y _ i , . . . ; < 0^)} is uni formly integrable, by condi t ion 5, since hi(Yo, Y _ i , . . . ; <f>o) < 2 m a x , | log / ( Y 0 , &j((f>o) |. Therefore, l i m ^ 0 [ i r J V ( Y o , y ' - i , . . . ; ^ o ) ] = 0 82 and, for e > 0, there exists JVe such that ifyJ.fljv e(Yo, Y _ i , . . . ;<fo)] < e- By the ergodic theorem ({H~Ne(Yi, Y t - i , . . . ; ^ o ) } ^ - ^ is a stationary ergodic process), i J2 HNe(Yn Yi-i, â€¢ â€¢ â€¢; <t>o) = E^ [HNâ‚¬(Y0, Y.lt...; <f>0)], with probability one. Now choose n Â£ > Ne (depending on {Fi}) such that 1 n -Â£-ffwe(yÂ»,y;--i,...;^ o) < ^ 0[^( yo,y-i,...;^>o)] + â‚¬, n > n e, and 1 ^ - 5 Z ^ f ( y i , y Â» - i , â€¢ â€¢ â€¢ i 4>o) < e, n> nt. n â€” !=1 Then | - \ogPn{Yu. ..,Yn;4>0)--J2 log jXTi I ^ - i, ^ - 2 , â€¢ â€¢ â€¢; <M I t=i j JVÂ« 1 n ^ - E ^ ( y i , y , - i , . . . ; 0 o ) + - Â£ JTive(Yt-,y,-_i,...;0o)<3e, 1=1 t=7Vc+l for n > nt. â€¢ 6.5 Convergence of the log-likelihood function The main result of this section is the limit theorem for the log-likelihood function stated below. This result is similar to Theorem 6.1, with the important difference that the limit of the log-likelihood at points other than 4>o is identified. This limit leads to a definition of cross-entropy, which is studied in the next section. Theorem 6.2 Assume conditions 1, 3, and 6 hold. Then, for <j> Â£ 3>c, there is a constant H(<j>o,<f)) < oo (possibly equal to â€”oo), such that (i) limâ€ž [logpn(Y 1 ,...,Yn;<f>)] = H{<fo,<f>); (ii) \\ran -^log pn(Yi,... ,Yn;<f>) = H((po,<f>), with probability one, undergo. 83 Fix the value of <f> Â£ $; for the rest of this section, the parameters a}k and 6j (and joint densities defined using them) are assumed to be evaluated at this point. Define X2 xn i=3 Pn(yi,---,yn I j ) = / ( y i , 0 j ) Â£ - - - Â£ a ! i , * a / ( y 2 , 0 x 2 ) n Q ! * i - i . * . - / ( w Â» ^ ) (with | j) = f(yi,0j)) and qn{yu â€¢ â€¢ â€¢ ,yn) = maxpn(yi,...,yn \ j). 3 Then the likelihood satisfies pn(yi,---,yn) = Â£Â«S-1WyiÂ»--->!'n | j) 3 > gn(yi,---,yn )minaf ) and hence I / â€¢ W\ / I Pn(yi, â€¢ - â€¢ ,Vn) . n loglmma,- ) < log â€” ; f < 0. i 3 5 n ( Â» l , . . . , l / n ) Therefore l o g p n ( Y i , . . . ,Yn)/n and Â£^ 0 [ logpâ€ž(Yi , . . . , Y n ) ] /n have the same limiting values as l o g q n (Y i , . . . ,Yn)/n and .E^ 0 [ logg n (Y i , . . . ,Y n ) ] /n , respectively, and so the conclusions of the theorem wil l foUow from the corresponding conclusions applied to qn. Notice that qn does not depend on the initial probabilities, provided they are positive, so that the limit of the log-likelihood is valid for any choice. The advantage in working with qn rather than pn is provided by the property given in the following lemma. L e m m a 6.3 For any sequence {yn}, q3+t(yu ... ,ys+t) < qs(yi,... ,ys)qt(ys+1,... ,ys+t), s , t > l . Proof: By the definition of ps+t(' I j)-, Ps+t(yi,---,ys+t | j) 84 = f(Vl, Oj) Â£ â€¢ â€¢ â€¢ Â£ aJ.*a/(Â»2 > ) II a * i - i ' ) 272 i s 3 s+t axs,kf(ys+i,Ok)(*k<xs+2f(ys+2,0xs+2) II a * i _ i , * i / ( w Â» ^ ) = /(l^l. * j ) Â£ â€¢ â€¢ â€¢ Â£ <*j,x2f{V2, #x2) II a * < - i ,*.7(Â»t, 0*,-) X2 zÂ» 3 Â£ a * , , * PÂ«(yÂ«+i , . â€¢â€¢,Â».+* I k) k â€¢ s < f(yi,0j) Â£ â€¢ â€¢ â€¢ Â£ otjiX2f(y2,0*a) n a * . - - i f ( y i > 6 * i ) 9*(y*+i> â€¢ â€¢ â€¢ >yÂ»+0 ^2 iÂ« 3 < ?s (y i , â€¢ â€¢ â€¢ , y s ) ?<(y s+i, â€¢ â€¢ â€¢, â€¢ N o w define the doubly indexed sequence of random variables { W r f } by Wst = \og(qt-s(Ys+1,... ,Yt)), s > t. T h e above l emma says Wst < Wsu + Wut, s<u<t. (6.8) Ergodic theorems for processes satisfying this subaddi t iv i ty property were obtained by K i n g -m a n (1976), so we consider next the other properties w h i c h were used to obta in these theorems. B y the stat ionarity of {Yn}, {Wat} is stat ionary relative to the shift transformation Wst â€”> Wg+ij+i; (6.9) for example, Wst and Ws+\j+i have the same dis t r ibut ion . A l s o , the integrabi l i ty condit ion E^[W+] < oo (6.10) is satisfied under condit ion 6, since l o g g i ( y i ) < log(maxj f(yi,8j))-K i n g m a n (1976, Theorems 1.5 and 1.8) proved that a process satisfying (6.8), (6.9), and (6.10) also satisfies the conclusions of the ergodic theorem, namely (i) limâ€ž W0n/n = W < oo exists w i t h probabi l i ty one; (ii) E[W] â€” l i m n E[W0n/n]; and (i i i) W is degenerate 85 i f the process is ergodic (the cr-field of events invariant under the shift transformation i n (6.9) is t r iv ia l ) . K i n g m a n ' s results actual ly generalize the classical ergodic theorem, w h i c h deals w i t h addit ive rather than subaddit ive processes. A p p l y i n g these results to Won = l o g g n ( Y i , . â€¢ â€¢ ,Yn) gives (the ergodicity carries over f rom the ergodici ty of {Yn}) l i m [log qn(YiYâ€ž)] = H(<f>0, 4>) < oo exists and l i m - l o g 9 â€ž ( Y i , . . . , y n ) = H(<j)a,<l>), n n w i t h probabi l i ty one, under </>o- A s demonstrated above, logpn/n and logqn/n have the same l i m i t i n g behaviour; thus the proof of Theorem 6.2 is complete. 6.6 Cross-entropy T h e cross-entropy between <f>o and <j> is defined as K(<f>o', <p) = H(<j>o,4>o) â€” H{<j>o,<f>), where H((f>o, 4>o) and H(4>o, <f>) are defined i n Theorem 6.2 (notice H(<j>o, <f>o) is the negative entropy, where the entropy H(<J>Q) is defined i n Section 6.4). A s mentioned above, K provides a measure of distance between parameter points ; the definit ion of H(<f>o, 4>) i n Theorem 6.2 shows that K(<t>o, 4>) is the large-sample average K u l l b a c k - L e i b l e r divergence per observation between pn(yi,- â€¢ â€¢ ,yn\<f>o) a n d pn(yi, â€¢ â€¢ â€¢ ,yn] <t>)- J u a n g and Rabiner (1985) use the cross-entropy as a measure of distance between hidden M a r k o v models i n a numerical study of the effects of s tar t ing values and observation sequence length on m a x i m u m l ike l ihood estimates. In this section we prove a result needed for the large sample analysis of m a x i m u m l ike l i -hood estimators, that the cross-entropy between two different points is posit ive. O b t a i n i n g this result is surpris ingly difficult and w i l l lead to another s tudy of the asymptot ic be-haviour of the log- l ikel ihood. K i n g m a n ' s subaddit ive ergodic theorem which was used in 86 the previous section does not include a representation of the hmi t as the expected value of some random variable, as does the classical ergodic theorem. In this section, we w i l l direct ly establish the convergence of the normalized log-l ikel ihood a n d , using the previous section's results to identify the h m i t random variable w i t h the constant H((j>o, <f>), obtain such a representation for H((j>o,<f>). A s i n Section 6.4, we w i l l s tudy the log-l ikel ihood using the relat ion n log pn(yi, â€¢ â€¢ â€¢, yn; 4) = Â£ log Pi(yi | y , _ i , . . . , yi; 4>). However, instead of approx imat ing pi(Yi | Y i _ i , . . . , Y i ; 4>) by a stat ionary process, we define a new probabi l i ty measure (on an augmented probabi l i ty space), under which {pi(Y{ | Y f _ i , . . . , Y i ; <j>)} is itself stationary. T h e quantities derived under this new probabi l -i t y space w i l l then be related back to quantities defined i n terms of the or iginal probabi l i ty space. T h e mot ivat ion for using this approach came from Furstenburg and Kesten (1960), who studied the convergence of products of random matrices and also f rom Petr ie (1969) who used the latter results to obtain the convergence of the log- l ikel ihood for f inite-valued h idden M a r k o v models. There is a connection w i t h K i n g m a n ' s theorems which were used i n the previous section; K i n g m a n applied his results to obta in those of Furstenburg and Kesten (1960). O n the other hand, the hmi t results for {qn} obtained using K i n g m a n ' s theorems could be proved using arguments s imilar to those of Furstenburg and Kesten (1960). T h e approach to be followed requires a careful accounting of the probabi l i ty spaces and measures involved. W e begin w i t h the probabi l i ty measure P$0 defined on (y, B); for notat ion, see Section 6.3. L e t f i be the set of sequences {(yn,u^)}, where the yn are reals and the are ro-dimensional vectors. Let P ^ o ^ be the probabi l i ty measure on f i defined as the image of P ^ 0 on the subset where = ctj(<f>o), the s tat ionary probabil i t ies of the 87 stochastic m a t r i x [OJJA:(^ O)]> and â€¢ r Â» = E ; ' ^ ( ^ , * = i Â» , . = i , 2 . . . . (..ID (0/0 is taken to be 0); {u^n^} is determined by {yn} on this subset, so this definit ion is meaningful . Le t YN and be the coordinate mappings on f i . T h e goal is to define a probabi l i ty measure on f i , under w h i c h {U^} is a stat ionary sequence, while {YN} has the same dis t r ibut ion as it does under P^. Let TQ be the shift transformation on f i , that is , Tu{(yn, u^)} = { ( j / n + 1 , Â« ( n + 1 ) ) } . Le t P^T^k be the prob-a b i l i t y measure on f i which is the inverse image of P$ $ under the kth iterate of T Q , that is , P'MT~k{A) = P^{u e f i : e A}, A e % (So, is the B o r e l a-field of f i , defined i n the same way as for y; see Section 6.3). Define new probabi l i ty measures P^^ = H^oPfo^T"1 /l, for / = 1,2, T h e fol lowing l emma is essentially due to Furstenburg and Kesten (1960). L e m m a 6.4 There is a subsequence {4} and a probability measure P^O )0 such that (i) for every p, the joint distribution of (Yi,U^),..., (YP,U^) under P^]p converges weakly to the corresponding joint distribution under P^0>4>; (ii) {(Yn,U^)} is a stationary process under Pfo^; and (iii) {Yn} has the same distribution under P<f,0l4> as under P ^ 0 . Proof : Let Qpl) be the jo int d is t r ibut ion of ( Y i , tfW),..., (Yp, U^) under P ^ . B y the Hel ley selection theorem (Bil l ingsley, 1979, Theorem 25.9), for each p there is a subsequence <t>) U^) {lk }k=i s u c h that Qpk converges to a measure Qp w i t h mass less than or equal to one; we show Qp has mass one. G i v e n e > 0, let Kt be a compact subset of Rp for which Pj,0(Kt) > 1 â€” e; observe also that the range of is contained i n the compact set [0, l ] m . N o w , because a l l of the P^ $ agree on events defined i n terms of {Yn}, i f = 88 {((YIMl)),...,(yP,uW)): (yi,...,yP) G Kt,(uW,...,uW) e [0,1]â„¢}, then Q)P(KP) = Pfo(Kc) > 1 â€” e for every /. Since is compact , this proves {QP^} is t ight , and i t follows that Qp has mass one (Bi l l ingsley, 1979, Theorem 29.3). B y construct ion, the measures Qp are consistent i n the sense that Qp+i agrees w i t h Qp on sets defined i n terms of the first p coordinates. Therefore, by the K o l m o g o r o v extension theorem, there is a measure -P^0 l </, on ft w i t h pth-OTdeT f inite dimensional d is t r ibut ion Qp for every p. A l s o , since {/jjf4^} can be assumed to be a subsequence of {/j^ }, the diagonal method (Bi l l ingsley, 1979, p . 292) implies that there is a subsequence {lk} such that the pth-ordeT f inite dimensional distr ibutions of P^\ converge to Qp for every p. Since {Yn} is a stationary process under P^0t^, a l l P^o <j)T~k, and consequently a l l P^^ and P<t>0,<t>i agree w i t h P^0<f, on events defined by {Yn}; therefore, {Yn} has the same dis-t r i b u t i o n under P ^ 0 i ^ as under P^ , 0 . T o show {(Yn,U^)} is stat ionary under P ^ 0 l ^ , or i n other words, that p0O )0 is a stationary probabi l i ty measure, we must show P^^A) â€” p<t>oAT~lA)i AÂ£Ba\ t l l i s follows f rom i=0 â€¢ R e m a r k It can further be shown that P$*\ itself converges weakly to -P<^ o,0, but this w i l l not be needed. For the case 4> â€” <j)0, the meaning of the random vector under P^,0^o w i l l now be explained. T h e recursion relations (6.11) for and the i n i t i a l condi t ion = <Xj(<fio) give = P<j>0{Xi = j | Y J _ i , . . . , Y 1 } under P^0i^o; hence, the operat ion of shift ing the 89 t ime scale and tak ing the l i m i t to obta in has the effect of convert ing C/j ' in to a con-di t iona l probabi l i ty depending on inf initely many past values of {Yi} . M o r e precisely, C/j1^ represents P^0{X\ = j \ Y 0 , Y _ 1 ( . . . } i n the sense (to be proved i n L e m m a 6.6 below) that the condit ional density of Y i , . . . , Y ; given under P^ 0 > ( / , 0 , is U^pi(yu..., y{ | j;<t>0). Therefore, the entropy H(<j>o), defined in Section 6.4 by H(4>o) = E<j>0[â€” l o g { Â£ j P^0{Xi = j | Y0fY-1,...}f(Y1,ej)}], is seen to be equal to E M [ - l o g ^ - uf* f(Yx, 9j)}], w i t h the consequence that this representation can be extended to parameters other t h a n <f>Q, as i n the fol lowing result. L e m m a 6.5 Assume conditions 1, 3, and 6 hold. Then, for every <j) 6 $ c , HfaA) = W l o g f E , . uffiY^e ,{<)>))}]. Proof : T h e ergodic theorem implies ^ { ^ ^ E i < Â« { E ^ 0 / ( y . - , W))> = z} = i, i=i j where Z is a random variable w i t h EM[Z) = E^^og^j Uj^f(Xu 8j(<f>))}]-Fi rs t we show E^,0t^[Z] < H(<po,<f>)- T h e recursion relations (6.11) i m p l y Â£ ui2)f{Y2, ek{<t>)) = Â£ Â£ ^ / ( Y x , W ) ) / ( Y 2 , ek{4>))i Â£ u^m, e^)) k j k j = S ^ f t e . ^ I * * ) / Â£ cfW* I 3\4>), (6-12) and i terat ing gives E^ , V (yÂ» ,W)) = J/fW^ i,..., Yi | j; <Â£)/ O-fWiC^, â€¢ â€¢ â€¢, I J'; </>)â€¢ (6-13) Therefore, Â£ l o g { Â£ c f / ( K - , W ) ) } = i o g { Â£ f / f p n ( ^ , - - - ^ n l i ; ^ ) } !=1 j J < l o g g n ( Y i , . . . , Y n ; 0 ) , 90 and , since {Y,} has the same dis t r ibut ion under P ^ 0 as under Pfa^, the proof of Theorem 6.2 gives Z < H(4>0,4>). N e x t we show E^0t^[Z\ > H(4>o, 4>). Assume, wi thout loss of generality, H(<j)o, 0) > - o o . U s i n g the fact that the joint d is t r ibut ion of (Y\, U^) under P ^ J , converges weakly to the corresponding d is t r ibut ion under p0 o ,0 , we get (Bi l l ingsley, 1979, Theorem 25.11) l i m s u p / log{Â£t/f)/(^,W))}^S< / l o g i ^ J / f / m , ^ ^ ) ) } ^ ^ , k J a j J a j where A = {log{Â£i ^ / ( Y i , * ; ( > ) ) } < 0}. A l s o , since (log{Â£jUPf(Yi, W))})+ is uni formly integrable w i t h respect to P$*\ by condi t ion 6, l i m / l o g { Â£ c f V ( ^ i , W))}^i = / l o g i E ^ / m ^ i W ) ) } ^ , * -Â« JQ\A j Jn\A j Therefore, Z = E+0,4\og{Y,uVm,93i<f>))}} > h m s u p ^ ^ l o g i ^ C / f ^ Y ! , ^ ^ ) ) } ] j k i = l i m s u p I Â£ l ^ P o g E ffj'VW, W))}] = l i m s u p l o g { Â£ c f / ( Y â€ž 9M))}] k k !=1 j = l imsupj f3^ 0 [ log{Â£<* j { (h )p i k (Y i , . . . ,Y [ k \ j;<f>)}] k the second last equality follows f rom (6.13) and f/ j 1^ = aj(<f>o) on the support of P^o ^, and the last f r o m the proof of Theorem 6.2, using aj((j>o) > 0 (which follows f rom the i r reducib i l i ty of [c<jk(<t>o)])- 1=1 T h e representation in the lemma is used next to prove that the cross-entropy between two different parameter points is positive. 91 L e m m a 6.6 Assume conditions 1-3, 5, and 6 hold. For every <f> E $c, K(<l>o, <j>) > 0. // <f> rf, <j)0) then K(4>o, <f>) > 0. Proof : T h e first step is a verif ication of the property (described fol lowing L e m m a 6.4) of the jo int d i s t r ibut ion of Y i , . . . ,Yi, IjW under P ^ 0 , ^ 0 , namely that ( Y i , . . . , YJ) has the con-d i t i o n a l density J2j UJ^pi(yi, ...,yi\ j; 4>o), given U^. T h e case i = 2 w i l l be considered; the general case is verified s imilar ly . Le t Q be the dis t r ibut ion of U^> under P ^ 0 , ^ 0 ; then, i f B is a cont inui ty set of Q, P*o,*o{(Yi,Y2)eA,uWâ‚¬B} 1 ^ f f = l i m r Â£ / / E u i J ) P 2 ( y i , i / i + i I i j ^ o J d/i^ O r f M i / i + O d Q K ^ ) = / / Â£ uiP2 ( 2/i,yi+i I J\4>o)dii{yi)du(y2)dQ{-lk\u) k J B J A 3 = j z\2u3P2(yi,yi+i \ 3]<t>o)dÂ»(yi)dp(y2)dQ{u), where Q{ and Q(0 are the distr ibutions of Â£/(') under P ^ ^ and Â£ ( - = 1 ^ 0 fl respec-t ively ; the second equal i ty follows f rom Ijf = P<t,0{Xi = j \ Y , _ i , . . . , Y i } under P ^ o N o w stat ionarity, (6.12), and the above property i m p l y 2H(<t>0,<t>0) = ^ o , 0 o [ l o g { E ^ 1 ) / ( y i , ^ ( ^ o ) ) } + l o g{j : J / j 2 )/(y2 , ^ ( ^ o ) ) } ] 3 3 = E^logiJ^U^MYi^ | j ; ^ o ) } ] 3 = J J J E u i K(!'i^2 | j;0o)log{^ttiP2(yi,W I J;<f>o)}d/J,(yi)dfj,(y2)dQ(u). 3 3 Next we extend the construct ion of L e m m a 6.4 to simultaneously include two sequences, { w h i c h satisfies (6.11) w i t h the parameter value <f>0, and { V W } w h i c h satisfies (6.11) 92 w i t h the parameter value (f>. T h e n , as above, 2#Oo,<Â£) = ^0,41og{X V^p2{Yu Y2 | j; <f>)}] 3 = JjJ ^ Â£ ^ 2 ( ^ 1 , 2 / 2 1 j ; < M l o g { Â£ ^ ^ 3 3 where Q'(-, â€¢) is the d is t r ibut ion of ( { j W , W 1 ).) under P^0l<j,. Since the marg ina l d is t r ibut ion of Q' corresponding to the first coordinate is Q, we have K(<f>o,<t>) = HUE jUjp2(yi,y2 | j; <j>Q) log j ^ y j P 2 ( y i *yl\j;<t>) } dn{y\)dn(y2)dQ'(u, v). Since the inner integral , for f ixed u, v, is the K u l l b a c k - L e i b l e r divergence between two mixture densities, K(<f>o, 4>) > 0 a n d , i f K(<j>o, <f>) = 0, then this K u l l b a c k - L e i b l e r divergence is zero for almost every pair u, v (wi th respect to Q'(-, â€¢))â€¢ La the latter case, L e m m a 2.1 and L e m m a 2.2 i m p l y Y,llu3a3k{<t>oÂ¥(ej(^)M*o)) = Â£ Â£ W * W(*>(*)A(*)) 3 k j k for almost every pair u, v (wi th respect to Q'(-, â€¢)), where 8 denotes the d i s t r ibut ion func-t ion of a point mass. Since has the dis t r ibut ion of P^Xi = j | Y 0 , Y L j , . . . ; <f>0}, E$o,<l>Wj1^] â€” (Xji&o)- Therefore, K(<f>0,<f>) = 0 implies E E a i ( ^ ) a i ^ W ( Â« y ( * o ) A W o ) ) = HYlf v3dQ'(u,v)ajk(<l))S{ej{(l>)M<t>)); 3 k 3 k hence <f> and <f>o define the same stat ionary laws for (#A'I ^ X2) a n ( i s o 4> ~ n 6.7 Consistency of the m a x i m u m likelihood estimator W e can now present the m a i n result, which concerns the consistency of the m a x i m u m l ikel ihood estimator. T h e results of the previous sections al low the appl icat ion of the basic 93 strategy invented by W a l d (1949) and further developed by Kiefer and Wol fowitz (1956). Consistency must be stated i n terms of convergence of the equivalence class of the max-i m u m l ike l ihood estimate 4>n (see Section 6.2). W e w i l l obtain convergence i n the quotient topology defined relative to the equivalence relation ~ . Redner (1981) used convergence i n this sense for estimators of the parameters of a finite mix ture d is t r ibut ion (see Section 2.4). Consistency i n the sense of the quotient topology s imply means that any open subset of the parameter space $ c w h i c h contains the equivalence class (fo of the true parameter must , for large n, contain the equivalence class of < n^. Theorem 6.3 Assume conditions 1-6 hold. Let <f>o be the true parameter value and let 4>n be a maximum likelihood estimator. Then (f>n converges to <fro in the quotient topology, with probability one. Proof : Le t logqn{4>) denote l o g g n ( l i , . . . , Yn; <p) and s imilar ly for pn. For <fi ^ <fo, we have limâ€ž Epilog qn(<f>)]/n = II(<f>0,<f>) < H (<j)o, <f>o), by L e m m a 6.6 and the proof of L e m m a 6.2; hence there is an e > 0 and integer n e such that E^0[logqne((j))]/nâ‚¬ < H(<f>o,(f>o) â€” e. N o w , qne{-) is continuous, and , using the integrabi l i ty condi t ion 6, J5^ o [ ( logsup0. e C ) ^ q , n c ( < ^ ' ) ) + ] < oo, for a smal l enough neighbourhood 0$ of <j>\ therefore, there is an open neighbourhood Oj, for w h i c h E<f,0 [log s u p ^ ; g 0 < ^ qnc (<f>')}/ne < E^log qnâ‚¬(<t>)]/ne + e/2 < H{<f>0, <t>0) - e/2. A s i n the proof of L e m m a 6.2, 0 > log{ sup pn((f>')/ sup qn{4>')} > log m i n a 1 ; 1 * , and so log{sup^/ e C ) ^ Pn(,<t>')}/n and l o g { s u p ^ e C ) ^ qn(4>')}/n have the same l i m i t i n g behaviour as n â€”*â€¢ oo. A l s o , Wst â€” l Â° g s u p 0 / g C ^ qt-9(<f>') satisfies the conditions of K i n g m a n ' s subad-di t ive ergodic theorem (see equations (6.8), (6.9), and (6.10)), and hence def ff(<Â£o,<fo<^) =. l i m f ^ J l o g sup gâ€ž(<Â£ ' ) ] / n 94 exists and l i m l o g { sup qn((/>')}/n = H(<j>0, <f>; O ^ ) , w i t h probabi l i ty one. B y another property of subaddit ive processes ( K i n g m a n , 1976, The-orem 1.1), H(<j>o,<j>;C>4,) = i n f ^ 0 [ l o g sup qn(<j>')]/n, 71 so that H(<f>0,<l>;â‚¬></,)< Epilog sup qM')]/nt. Therefore, w i t h probabihty one, l i m l o g sup pn(<j>')/n = H(4>0,<f>\O4>) < H(<j>0,4>0) - e/2. Let C be a closed subset of $ c , not containing any points of the equivalence class 4>Q. Since $ c is compact , C is compact and so is covered by the union U^ = 1OA , where {</>i,... ,4>d} is a finite subset of C and Oh = 0(f,h. Therefore, w i t h probabi l i ty one, sup(logp n (<Â£) - logp n (<Â£ 0 ) ) = max{ log sup pn(<f>) - l o g p n ( ^ 0 ) } ->â€¢ - o o , 4>ec h <j>e0h w h i c h implies that , for any open subset O of <&c which contains the equivalence class <j>0, 4>n Â£ O for large n. If follows that the m a x i m u m l ike l ihood estimator converges to <f>o i n the quotient topology, w i t h probabihty one. â€¢ 95 Chapter 7 F u t u r e r e s e a r c h This chapter concerns three topics for future research: (i) a class of doubly stochastic Poisson processes which has the features of the hidden Markov model in continuous time; (ii) extensions of the large sample results of Chapter 5 to allow inference based on the maximum likelihood estimators for hidden Markov models; and (iii) hypothesis testing in mixture models for independent observations. We will give suggestions of specific problems and, in some cases, describe some preliminary work addressing these problems. 7.1 D o u b l y stochastic Poisson processes Let X(t) be a continuous time Markov chain with states 1,... , m and, conditional on the entire sample path of X, let N(t) be a non-homogeneous Poisson process with intensity function Ox(t)- Then N(t) is an example of a doubly stochastic Poisson process or Cox process. This class of point processes was proposed by Cox (1955) and has been studied extensively (see GrandeU, 1976, Snyder, 1975, and Karr, 1986). We have begun to investigate the computation of maximum likelihood estimates by the E M algorithm, using an analogy with the hidden Markov model with Poisson components. Conditional probabilities for X(t) given observation of N(t) on an interval can be calculated using a forward-backward algorithm. The recursive equations of Section 5.2 are replaced by 96 differential equations to describe the evolution over t ime intervals between observed events and u p d a t i n g equations to describe the evolut ion at event times. T h e differential equations yield explicit so lut ion, thus a l lowing simple evaluation of the required condit ional means. T h e relevant equations for P{X(t) = j \ N(s) : 0 < s < t}, j â€” 1 , . . . , m , have been given i n studies of f i l ter ing for C o x processes (Yashin , 1970, R u d e m o , 1972, and Snyder, 1972). W e mention that S m i t h and K a r r (1985) discuss m a x i m u m l ike l ihood est imation for the case that the intensity process takes only two values, one of which is zero, and D e m b o and Ze i touni (1986) use the E M algor i thm for a C o x process w i t h an intensity process which is a diffusion. T h e consistency of the m a x i m u m l ikel ihood estimators, as the observation period be-comes infinitely long, could perhaps be established using extensions of the arguments of Chapter 6 to the continuous t ime setting. For instance, the not ion of identi f iabi l i ty seems to carry over immediate ly w i t h an inf initesimal m a t r i x of transi t ion rates replacing the transi t ion probabihty matr ix . T h e subaddit ive ergodic theorem of K i n g m a n admits certain extensions to continuous t ime (see the discussion of this point i n K i n g m a n , 1976). Some of the other constructions, the cross-entropy, for instance, require a l i t t le more work. T h e questions of asymptot ic normal i ty of the estimators, x2 approximations of l ikel ihood rat io test statistics, and computat ion of the observed informat ion arise. T h e computat ion of the observed informat ion matr ix can be achieved by apply ing the general formula of Louis (1982). These questions should be addressed because of the great importance attached to C o x processes i n the field of point process model l ing. A l s o , the test of the one-component model against models containing more than one component provides a test of the hypothesis of a Poisson process which is sensitive to clustering of events. Ogata (1978) and Konecny (1986) develop asymptot ic theory of m a x i m u m l ikel ihood estimators for C o x processes, but 97 their results do not immediate ly apply to the model considered here. 7.2 Inference for hidden M a r k o v models It seems reasonable to believe that the m a x i m u m l ike l ihood estimators for the hidden M a r k o v model of Chapters 5 and 6 are asymptot ica l ly normal , w i t h large sample variances obtained f rom the informat ion m a t r i x i n the usual way; some restrict ion on the true param-eter value s imilar to that required for finite mixture distr ibutions w i l l l ikely be necessary. (Louis 's method for calculat ing informat ion can also be applied i n this model.) A potent ia l ly impor tant appl icat ion of the large sample results is to testing independence of a sequence of r a n d o m variables; since the independent mix ture model is nested w i t h i n the h idden M a r k o v model , the l ikel ihood rat io test could be used. T h e test statistic based on the m-component h idden M a r k o v model might have the usual large sample x2-distribution under the right conditions on the n u l l value of the parameter (for example, that it defines a mix ture d is t r ibut ion wi th exactly m components). 7.3 Hypothesis testing in mixture models In Section 2.4, we explained why the usual large sample x2 approximat ion for the l ikel ihood rat io test statist ic does not hold (at least cannot be obtained i n the usual way) for a n u l l hypothesis specifying a parametric fami ly of densities, w i t h the alternative specifying a class of mixtures f rom that family. H a r t i g a n (1985) showed, for a normal mixture family, that the x2 approx imat ion certainly does not ho ld , as the test statist ic converges to oo w i t h the sample size, under the n u l l hypothesis. G h o s h and Sen (1985) proved that a l ikel ihood rat io test stat ist ic , obtained under some constraints on the parameters, possesses a l i m i t i n g d is t r ibut ion expressed i n terms of a funct ional of a Gaussian process; this result appears to 98 have l i m i t e d use because of the diff iculty of determining the d is t r ibut ion of this funct ional . A s i d e f r o m the importance of mix ture models, tests of mixture hypotheses can be applied to hypothesis test ing situations not direct ly related to mixtures . For instance, the l ikel ihood rat io test of a pure Poisson dis t r ibut ion against a Poisson mixture yields a Poisson goodness of fit test w i t h the potent ia l to be sensitive to over-dispersion. W e undertook a smal l s imulat ion s tudy to compare the power of this test w i t h the test based on the index of dispersion. T h i s s tudy indicated that the usual x2 approximat ion does not hold for the l ikel ihood ratio test statistic and also that the potent ia l for improvement i n power over the dispersion test is rather l i m i t e d . 99 References A s k a r , M . and D e r i n , H . (1981). A recursive a lgor i thm for the Bayes solution of the smooth-i n g problem. IEEE Trans. Automatic Control, 26, 558-560. B a u m , L . E . and E a g o n , J . A . (1967). A n inequality w i t h applications to statist ical estima-t ion for probabi l is t ic functions of M a r k o v processes and to a model for ecology. Bull. Amer. Math. Soc, 73, 360-363. B a u m , L . E . and Petr ie , T . A . (1966). Stat is t ical inference for probabi l is t ic functions of finite state M a r k o v chains. Ann. Math. Statist., 37, 1554-1563. B a u m , L . E . , Petr ie , T . , Soules, G . , and Weiss, N . (1970). A maximiza t ion technique occur-r i n g i n the stat ist ical analysis of probabil ist ic functions of M a r k o v chains. Ann. Math. Statist, 41, 164-171. B b h n i n g , D . (1982). Convergence of Simar's a lgor i thm for f inding the M L E of a compound Poisson process. Ann. Statist., 10, 1006-1008. B o h n i n g , D . (1985). N u m e r i c a l est imation of a probabi l i ty measure. J. Stat. Plan. Inf., 11, 57-69. Bi l l ingsley, P . (1965). Ergodic Theory and Information. W i l e y , N e w Y o r k . Bi l l ingsley, P . (1979). Probability and Measure. W i l e y , New Y o r k . B u t l e r , R . W . (1986). Predic t ive l ike l ihood inference w i t h applications ( w i t h Discussion). J.R.S.S. B 48, 1-38. C h u r c h i l l , G . A . (1989). Stochastic models for heterogeneous D N A sequences. Bull. Math. Biol, 51, 79-94. C o x , D . R . (1955). Some stat ist ical models connected w i t h series of events. J.R.S.S. B 17, 129-164. D e m b o , A . and Ze i touni , O . (1986). Parameter est imation of part ia l ly observed continuous t ime stochastic processes v i a the E M algor i thm. Stoch. Proc. Appl., 23, 91-113. Dempster , A . P . , L a i r d , N . M . , and R u b i n , D . B . (1977). M a x i m u m l ikel ihood est imation f rom incomplete data v i a the E M algor i thm (wi th discussion). J.R.S.S. B , 39, 1-38. Devi jver , P . A . (1985). B a u m ' s forward-backward a lgor i thm revisited. Pattern Recognition Letters 3, 369-373. E d e l m a n , D . (1988). E s t i m a t i o n of the m i x i n g d is t r ibut ion for a n o r m a l mean w i t h appl i -cations to the compound decision problem. Ann. Stat. 16, 1609-1622. Feller, W . (1943). O n a general class of contagious distr ibutions. Ann. Math. Stat., 14, 389-399. Furstenberg, H . and Kes ten , H . (1960). Products of random matrices. Ann. Math. Stat., 31, 457-469. Geisser, S. (1975). T h e predict ive sample reuse method w i t h applicat ions. J.A.S.A., 70, 320-328. 100 G h o s h , J . and Sen, P . K . (1985). O n the asymptot ic performance of the log l ikel ihood ratio statistic for the mixture model and related results. Proc. of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer, V o l . II ( L . M . L e C a m and R . A . Olshen, eds.), p p . 789-806. G i l b e r t , E . J . (1959). O n the identi f iabi l i ty problem for functions of finite M a r k o v chains. Ann. Math. Stat, 30, 688-697. G r a n d e l l , J . (1976). Doubly Stochastic Poisson Processes. Springer-Verlag, B e r l i n . H a r t i g a n , J . A . (1985). A failure of l ikel ihood asymptotics for normal mixtures . Proc. of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer, V o l . II, ( L . M . Le C a m and R . A . Olshen, eds.), pp . 807-810. Hathaway, R . J . (1985). A constrained formulat ion of maximum-l ike l ihood est imation for N o r m a l mixture dis tr ibut ions . Ann. Stat. 13, 795-800. H e c k m a n , J . and Singer, B . (1984). A method for m i n i m i z i n g the impact of d is t r ibut ional assumptions i n econometric models for durat ion data . Econometrica, 52, 271-320. H e n n a , J . (1985). O n est imating the number of constituents of a finite mixture of continuous dis tr ibut ions . Ann. Inst. Statist. Math. 37, P a r t A , 235-240. Jeffreys, H . (1932). A n alternative to the refection of observations. Proc. Royal. Soc. Lon-don A 137, 78-87. Jewel l , N . P . (1982). M i x t u r e s of exponential distr ibutions. Ann. Statist., 10, 479-484. Juang , B . - H . and Rabiner , L . R . (1985). A probabil ist ic distance measure for hidden M a r k o v models. AT&T Technical Journal, 64, 391-408. K a r l i n , S. and Taylor , H . M . (1975). A First Course in Stochastic Processes. Academic Press, N e w Y o r k . K a r r , A . F . (1986). Point processes and their statistical inference. Dekker , N e w Y o r k . Kie fer , J . , and Wol fowi tz , J . (1956). Consistency of the m a x i m u m l ikel ihood estimator i n the presence of infinitely m a n y nuisance parameters. Ann. Math. Statist., 27, 887-906. K i n g m a n , J . F . C . (1976). Subaddit ive processes. In Ecole d'Ete de Probabilites de Saint-Flour V-1976, E d : P . L . Hennequin , Lecture Notes i n Mathemat ics 539, pp . 167-223, Springer-Verlag, B e r l i n . K i t a g a w a , G . (1987). N o n - G a u s s i a n state-space model ing of nonstationary t ime series. J.A.S.A., 82, 1032-1041. K o h n , R . and Ansley , C F . (1987). Comment on K i t a g a w a (1987). J.A.S.A., 82, 1041-1044. Konecny , F . (1986). M a x i m u m l ikel ihood est imation for doubly stochastic Poisson processes w i t h par t ia l observations. Stochastics, 16, 51-63. L a i r d , N . M . (1978). Nonparametr ic m a x i m u m l ikel ihood est imation of a m i x i n g dis tr ibu-t ion . J.A.S.A., 73, 805-811. 101 Levinson , S . E . , Rab iner , L . R . , and Sondhi , M . M . (1983). A n introduct ion to the applica-t ion of the theory of probabil ist ic functions of a M a r k o v process i n automatic speech recognit ion. Bell System Technical Journal, 62, 1035-1074. L i n d g r e n , G . (1978). M a r k o v regime models for mixed distr ibutions and switching regres-sions. Scand. J. Statist., 5, 81-91. L indsay , B . G . (1981). Properties of the m a x i m u m l ikel ihood estimator of a m i x i n g dis tr i -b u t i o n . In Statistical Distributions in Scientific Work (Eds . C . Ta i l l ie et al.), V o l . 5, 95-110. L indsay , B . G . (1983a). T h e geometry of m i x i n g l ikel ihoods: a general theory. Ann. Statist., 11, 86-94. L indsay , B . G . (1983b). T h e geometry of m i x i n g l ikel ihoods, part II: the exponential family. Ann. Statist, 11, 783-792. L i n h a r t , H . and Z u c c h i n i , W . (1986). Model Selection. W i l e y , N e w Y o r k . Loeve, M . (1977). Probability Theory I, 4 th edit ion. Springer-Verlag, N e w Y o r k . L o u i s , T . A . (1982). F i n d i n g the observed information matr ix when using the E M algor i thm. J.R.S.S. B 44, 226-233. M a c d o n a l d , P . D . M . (1975). E s t i m a t i o n of finite d is t r ibut ion mixtures . In Applied Statistics, R . P . G u p t a ( E d . ) . N o r t h - H o l l a n d , A m s t e r d a m , pp . 231-245. M c L a c h l a n , G . J . and Bas ford , K . E . (1988). Mixture Models. Dekker , N e w Y o r k . Newcomb, S. (1886). A generalized theory of the combinat ion of observations so as to obtain the best result. Amer. J. Math., 8, 343-366. Ni jenhuis , A . and W i l f , H . S . (1978). Combinatorial Algorithms, 2nd ed. Academic Press, N e w Y o r k . O g a t a , Y . (1978). T h e asymptot ic behaviour of m a x i m u m l ikel ihood estimators for station-ary point processes. Ann. Inst. Statist. Math., 30, P a r t A , 243-261. O r c h a r d , T . and W o o d b u r y , M . A . (1972). A missing informat ion principle : theory and ap-plicat ions. Proc. of the Sixth Berkeley Symp. on Math. Statist, and Prob. 1, 697-715. Petr ie , T . (1969). Probabi l i s t i c functions of finite state M a r k o v chains. Ann. Math. Statist., 40, 97-115. P f a n z a g l , J . (1988). Consistency of m a x i m u m l ikel ihood estimators for certain nonparamet-ric families, in part icular : mixtures . J. Stat. Plan. Inf., 19, 137-158. Phelps , R . R . (1966). Lectures on Choquet's Theorem. V a n N o s t r a n d , Pr ince ton . Redner , R . (1981). Note on the consistency of the m a x i m u m l ikel ihood estimate for non-identif iable distr ibutions. Ann. Stat., 9, 225-228. Redner , R . A . and Walker , H . F . (1984). M i x t u r e densities, m a x i m u m l ikel ihood and the E M a lgor i thm. SI AM Rev., 26, 195-239. 102 Robbins , H . (1956). A n empir ica l Bayes approach to statistics. Proc. of the Third Berkeley Symp. on Math. Statist, and Prob. 1, 157-163. Roberts , A . W . and Varberg , D . E . (1973). Convex Functions. Academic Press, New Y o r k . R u d e m o , M . (1972). D o u b l y stochastic Poisson processes and process control . Adv. Appl. Prob. 4, 318-338. Schwarz, G . (1978). E s t i m a t i n g the dimension of a model . Ann. Statist., 6, 461-464. Silvey, S . D . (1980). Optimal Design. C h a p m a n and H a l l , L o n d o n . S imar , L . (1976). M a x i m u m l ikel ihood est imation of a compound Poisson process. Ann. Statist, 4, 1200-1209. S m i t h , J . and K a r r , A . (1985). Stat is t ica l inference for point process models of rainfal l . Water Resour. Res., 21, 73-79. Snyder , D . L . (1972). F i l t e r i n g and detection for doubly stochastic Poisson processes. IEEE Trans. Info. Thy., IT-18, 91-102. Snyder , D . L . (1975). Random Point Processes. W i l e y , N e w Y o r k . Stone, M . (1974). Cross-val idatory choice and assessment of stat ist ical predictions (wi th Discussion) . J.R.S.S. B , 36, 111-147. Stone, M . (1977). A n asymptot ic equivalence of choice of model by cross-validation and Akaike 's cr i ter ion. J.R.S.S. B , 39, 44-47. Sundberg, R . (1974). M a x i m u m l ikel ihood theory for incomplete data f r o m an exponential family. Scand. J. Statist. 1, 49-58. Sundberg, R . (1976). A n iterative method for solution of the l ike l ihood equations for i n -complete data f r o m exponential families. Comm. Stat. B 5, 55-64. Teicher, H . (1960). O n the mixture of distr ibutions. Ann. Math. Statist., 31, 55-73. Teicher, H . (1961). Identif iabil i ty of mixtures . Ann. Math. Statist, 32, 244-248. Teicher, H . (1963). Identif iabil i ty of finite mixtures . Ann. Math. Statist, 34, 1265-1269. Teicher, H . (1967). Identif iabil i ty of mixtures of product measures. Ann. Math. Statist., 38, 1300-1302. T i t t e r i n g t o n , D . M . , S m i t h , A . F . M . and M a k o v , U . E . (1985). Statistical Analysis of Finite Mixture Distributions. W i l e y , N e w Y o r k . Varadara jan , V . S . (1958). O n the convergence of sample probabi l i ty distr ibutions. Sankhya 19, 23-26. W a l d , A . (1949). Note on the consistency of the m a x i m u m l ikel ihood estimate. Ann. Math. Statist, 20, 595-601. West , M . , H a r r i s o n , P . J . , and M i g o n , H . S . (1985). D y n a m i c generalized l inear models and Bayesian forecasting. J.A.S.A., 80, 73-83. 103 W i t t m a n n , B . K . , R u r a k , D . W . , and Taylor , S. (1984). Real - t ime ultrasound observation of breathing and body movements i n fetal lambs f rom 55 days gestation to term. Abstract presented at the XI Annual Conference, Society for the Study of Fetal Physiology, O x f o r d . Y a s h i n , A . I . (1970). F i l t e r i n g of j u m p processes. E n g l , trans, of Avtomatika i Telemekhanika 5, 52-58. 104 Appendix A Fortran program for computing maximum penal-ized likelihood estimates C This program calculates maximum likelihood estimates for mixtures of C Poisson distributions. For each value of m = # of components, C the constrained mle i s found using the EM algorithm C with starting values obtained from partitions of the data values. real lambda(6),lmax(6),p(6),pmax(6) real 11, llprev, Umax, aic, bic, t o l , mean integer f(8),x(8),group(8),fpos(8),t,h,r(99).total,rtotal integer m,k,j,flag,npos,conv,n,max,it,dim,totit,maxit C Read in the data, maximum # of iterations and stopping criterion. read(l,120) (f(j),j=l,8) 120 format(8i5) read(1,130) maxit,tol 130 format(i6,f9.6) max=l do 10 j=l,8 10 if(f(j).gt.0) max=j n=f(l) do 15 j=2,max 15 n=n+f(j) write(7,705)(f(j),j=l,max) 705 formatC Frequency distribution: ',15i5/) write(7,706)maxit,tol 706 format(' max. # i t e r . : ' , i 6 , ' stopping criterion:',g8.1) mean=f(2) do 20 j=3,max 20 mean=mean+f(j)*(j-l) mean=mean/n if(max.gt.2) goto 25 write(7,710)mean 710 format(' Only 0,1 counts so mle has 1 component: ',f8.4) stop C Output results for 1 component. 25 m=l llprev=(mean*alog(mean)-mean) *n dim=l aic=llprev-dim bic=llprev-0.5*dim*alog(float(n)) write(7,718) 718 formatC/' ',23x,' # EM ',' log-') write(7,719) 719 formatC ',23x,'iterations ', + 'likelihood' ,10x, 'AIC ,12x, 'BIC') write(7,720) llprev,aic,bic 720 formatC/' 1 component:',21x,3gl5.6) write(7,730) mean 730 formatC Estimate of the mean: ',fll.5) C Replace the frequencies with non-zero frequencies and corresponding values. npos=0 do 30 j=l,max if(f(j).eq.0) goto 30 npos=npos+l 105 fpos(npos)=f(j) x(npos)=j-l 30 continue C Iterate over the number of components. 98 m=m+l totit=0 dim=2*m-l flag=0 llmax=llprev C Enumerate a l l partitions of the data into m groups. For each c a l l the mle C procedure with starting values calculated from the partition. total=npos-m r(l)=total t=total h=0 do 911 k=2,m 911 r(k)=0 915 rtotal=0 do 40 k=l,m lambda(k)=0.0 p(k)=0.0 do 42 j=rtotal+l,rtotal+r(k)+l group(x(j)+l)=k p(k)=p(k)+fpos(j) 42 lambda(k)=lambda(k)+x(j)*fpos(j) if(p(k).lt.l.0) goto 900 lambda(k)=lambda(k)/p(k) p(k)=float(p(k))/n 40 rtotal=rtotal+r(k)+l write(8,840)(x(j),j =1,npos) 840 formatC count:', 15i4) write(8,845)(group(x(j)+l),j=l,npos) 845 formatO group:15i4) conv=0 c a l l em(fpos,x,npos,p,lambda,n,m,maxit,tol,conv,it,11) totit=totit+it if(11.It.Umax) goto 900 flag=l llmax=ll convmax=conv do 50 k=l,m pmax(k)=p(k) 50 lmax(k)=lambda(k) C Here we generate the next partition of the data. 900 if(r(m).eq.total) goto 999 i f ( t . g t . l ) h=0 h=h+l t=r(h) r(h)=0 r(l)=t-l r(h+l)=r(h+l)+l goto 915 C A l l partitions have been generated. 999 if(flag.eq.l) goto 199 write(7,740) m 740 format(' No increase in likelihood for',i2,' components') stop 199 aic=llmax-dim bic=llmax-0.5*dim*alog(float(n) ) 106 ifCconvmax.eq.O) goto 101 write(7,770) m,totit,Umax,aic,bic 770 formatC/' ' , i l , ' components:',llx,i6,3x,3gl5.6) goto 102 101 write(7,771) m,totit,Umax,aic,bic 771 formatC/' ' , i l , ' components:',' (no conv.)',i6,3x,3gl5.6) 102 write(7,769) 769 formatC Estimates: probability mean') write(7,790)(pmax(k).ImaxCk),k=l,m) 790 formatC ', 19x,f7.5,f 9.5) if(llmax-llprev.lt.1.0e-6*abs(llprev)) stop llprev=llmax if(m.lt.npos) goto 98 stop end C This subroutine computes log-likelihood and posterior prob's of E-step. subroutine llik(f,x,max,p,lambda,m,loglik,cp) real lambda(6),p(6),wsum,cp(8,6),loglik integer f(8),x(8),j,max,m,k loglik=0.0 do 31 j=l,max wsum=0.0 do 32 k=l,m ifCxCj).eq.0) goto 37 if(lambda(k).It.1.0e-9) goto 38 cp(j,k)=p(k)*exp(x(j)*alog(lambda(k))-lambda(k)) goto 32 38 cp(j,k)=0.0 goto 32 37 cp(j ,k)=p(k)*exp(-lambda(k)) 32 wsum=wsum+cp(j,k) do 33 k=l,m 33 cp(j ,k)=cp(j ,k)/wsum 31 loglik=loglik+f(j)*alog(wsum) return end C This subroutine performs the EM algorithm with given starting values. subroutine em(f,x,max,p,lambda,n,m,maxit,tol,conv,iter,loglik) real lambda(6),p(6),lambnew(6),pnew(6),cp(8,6).error,loglik,tol integer f(8),x(8),j,n,k,m,iter,maxit,max,conv iter=0 1 iter=iter+l C Compute posterior probabilities cp(j,k) and log-likelihood, c a l l llik(f,x,max,p,lambda,m,loglik,cp) i f ( i t e r . g t . l ) goto 199 write(8,870) loglik 870 formatC Starting values: (loglik= ',gl8.7,')') write(8,871) 871 formatC' probability mean') write(8,872) CpCk).lambdaCk),k=l,m) 872 formatC \fll.6,f9.6) C Now compute the updated estimates, lambnewCk).pnewCk). 199 do 40 k=l,m pnewCk)=0.0 lambnewCk)=0.0 do 45 j=l,max pnewCk)=pnewCk)+fCj)*cpCj,k) 45 lambnewCk)=lambnewCk)+xCj)*fCj)*cpCj,k) 1ambne w C k)=1ambnewCk)/pnewCk) 107 40 pnew(k)=pnew(k)/n C Check whether anything has changed much since last iteration, if(iter.eq.1) goto 55 do 50 k=l,m error=pne w ( k) -p ( k) if(pnew(k).lt.l.0e-7) goto 51 if(abs(error)/pnew(k).gt.tol) goto 55 51 error=lambnew(k)-lambda(k) if(lambnew(k).lt.l.0e-7) goto 50 if(abs(error)/lambnew(k).gt.tol) goto 55 50 continue conv=l goto 99 C If anything has changed, update the values of p(k), lambda(k). 55 do 56 k=l,m lambda(k)=lambneo(k) 56 p(k)=pnew(k) if(iter.It.maxit) goto 1 99 write(8,880) loglik 880 formatC Final values: (loglik= ',gl8.7,')') write(8,872) (p(k).lambda(k),k=l,m) if(conv.eq.0) goto 849 write(8,873) iter 873 formatC Convergence achieved after ',i5,' iterations.'/) goto 850 849 write(8,874) iter 874 formatC A l l ',i5, + ' iterations performed - convergence not achieved.'/) 850 return end Appendix B Fortran program for computing maximum likeli-hood estimates for hidden Markov models C This program computes the maximum likelihood estimate for hidden Markov models C with Poisson components and given # of components, using the EM algorithm, C with starting values obtained from partitions of the data values. real lambda(6),lmax(6),p(6,6),pmax(6,6),u(6,240),v(6,6,240) real alpha(6) ,11,Umax,aic.bic,mean,tol.epsilon integer y(240),f(8),x(8),group(8),r(99),n,max,m,dim,npos integer j,k,i,conv,cwarn,maxit,flag,t,h,total,totit,it,npart C Read i n data, # components, maximum # iterations, and stopping criterion. read(1,500) n 500 format(i3) read(l,510)(y(i),i=l,n) 510 format(30i2) read(l,520) maxit,tol.m.epsilon 520 format(i6,lx,f8.6,lx,il,lx,f8.6) C Calculate the frequency distribution and mean, do 5 j=l,8 5 f(j)=0 max=l 108 do 10 i=l,n j=y(i)+l if(j.gt.max) max=j 10 f(j)=f(j)+l mean=f(2) do 15 j=3,max 15 mean=mean+f(j)*(j-l) mean=mean/n write(7,700) n 700 formatC Data: (n=',i4,')') write(7,710)(y(i),i=l,n) 710 format(30i2) write(7,720) (f(j),j=l.max) 720 formatC frequency distribution:', 15i4) write(7,730) epsilon.tol 730 formatC epsilon:' ,f8.6,' stopping criterion: ',f8.6) llmax=(mean*alog(mean)-mean)*n aic=llmax-l bic=llmax-0.5*alog(float(n)) write(7,740) 740 format(/' ',33x,' log-') write(7,750) 750 formatC ',33x,'likelihood',9x,'AIC,12x,'BIC') orite(7,760) llmax.aic,bic 760 format(/' 1 component:',20x,3gl5.6) write(7,770) mean 770 formatC Estimated rate:' ,f9.5) if(m.le.npos) goto 100 write(7,780) 780 formatC m must be <= number of distinct data values. ') stop C Replace the frequencies with non-zero frequencies and corresponding values. 100 npos=0 do 20 j=l,max if(f(j).eq.0) goto 20 npos=npos+l x(npos)=j 20 continue totit=0 cwarn=l flag=0 dim=m*m C Enumerate a l l partitions of the observed counts into m groups. Each C partition yields i n i t i a l values of v and em is entered at the M-step. npart=0 total=npos-m r(l)=total t=total h=0 do 900 i=2,m 900 r(i)=0 910 npart=npart+l rtotal=0 do 25 j=l,m alpha(j)=0.0 do 27 i=rtotal+l,rtotal+r(j)+l alpha(j)=alpha(j)+f(x(i)) 27 group(x(i))=j alpha (j ) =alpha (j )/n 109 25 rtotal=rtotal+r(j)+l write(8,800) ( x ( j ) - l , j=l ,npos) 800 formatC count:',8i4) write(8,810) (group(x(j)),j=l,npos) 810 formatC group: ',8i4) do 29 i=l,n do 30 j=l,m do 31 k=l,m 31 v(j,k,i)=epsilon*alpha(j)*alpha(k) 30 u(j,i)=epsilon*alpha(j) 29 continue do 33 i=2,n j=group(y(i-l)+l) k=group(y(i)+l) v(j ,k,i)=(l .0-epsilon)+v(j , k, i) 33 u(j,i-l)=(1.0-epsilon)+epsilon*alpha(j) j=group(y(n)+l) u(j ,n)=(1.0-epsilon)+epsilon*alpha(j) conv=0 c a l l em(n,y,max,m,p,lambda,u,v,maxit,tol,conv,it,11) if(conv.eq.0) cwarn=0 totit=totit+it if(11.It.Umax) goto 50 flag=l llmax=ll do 45 j=l,m do 47 k=l,m 47 pmax(j ,k)=p(j ,k) 45 lmax(j)=lambda(j) C Here we generate the next partition of the data. 50 if(r(m).eq.total) goto 999 i f ( t . g t . l ) h=0 h=h+l t=r(h) r(h)=0 r(l)=t-l r(h+l)=r(h+l)+l goto 910 C A l l partitions have been generated. 999 write(7,791) npart,totit 791 formatC ',i3,' partitions,' ,i7,' EM iterations') if(cwarn.eq.l) goto 60 write(7,788) 788 formatC WARNING: not a l l starting values gave convergence.') 60 if(flag.eq.l) goto 55 write(7,789) Umax 789 formatC No estimate found with likelihood >',gl5.6) stop 55 aic=llmax-dim bic=llmax-0.5*dim*alog(float(n)) write(7,790) m,Umax,aic,bic 790 format(/' ' , i l , ' components:',19x,3gl5.6) write(7,794) 794 formatC Estimates:'/' rates trans, probs.') do 80 j=l,m 80 write(7,796)lmax(j),(pmax(j,k),k=l,m) 796 formatC ' ,f9.6,2x,7f9.6) stop end 110 C This subroutine performs the EM algorithm. subroutine em(n,y,max,m,p,lambda,u,v,maxit,tol,conv,it,loglik) integer y(240),n,m,i,j,k,maxit,it,conv,likexp,max real p(6,6) , lambda(6),pnew(6,6) , lambneH(6),pdf(8,6) real u(6,240),v(6,6,240),vsuml(6),vsum2,usum real error,tol,loglik,lik.logten logten=alog(10.0) it=0 goto 5 1 it=it+l C Compute Poisson pdf, conditional prob.s, log-lik. for current estimates, do 2 j=l,m pdf(1,j)=exp(-lambda(j)) do 3 i=2,max 3 pdf(i,j)=pdf(i-1,j)*lambda(j) 2 continue c a l l posterior(n,y,m,lambda,p,pdf,u,v,lik,likexp) loglik=alog(lik)+likexp*logten i f ( i t . g t . l ) goto 5 write(8,820) loglik 820 formatC Starting values: (log-lik=',gl4.6,')'/ +' rates trans, prob.s') do 10 j=l,m 10 write(8,830) lambda(j),(p(j,k),k=l,m) 830 formatC ' ,f9.6,2x,7f9.6) C Compute updated estimates lambnew(j), pnew(j.k). 5 do 20 j=l,m usum=0.0 lambnew(j)=0.0 do 30 i=l,n lambnew(j)=lambnew(j)+u(j,i)*y(i) 30 usum=usum+u(j,i) 1ambnew(j)=lambnew(j)/usurn vsum2=0.0 do 40 k=l,m vsuml(k)=0.0 do 50 i=2,n 50 vsuml(k)=vsuml(k)+v(j,k,i) 40 vsum2=vsum2+vsuml(k) do 60 k=l,m 60 pnew(j ,k)=vsuml(k)/vsum2 20 continue C Check whether anything has changed much since last iteration. if(it.eq.O) goto 130 do 110 j=l,m do 120 k=l,m error=pnew(j,k)-p(j,k) if(pnew(j,k).lt.l.0e-7) goto 120 if(abs(error)/pnew(j,k).gt.tol) goto 130 120 continue error=lambnew(j)-lambda(j ) if(lambnew(j).lt.l.0e-7) goto 110 if(abs(error)/lambnew(j).gt.tol) goto 130 110 continue conv=l goto 160 C If anything has changed, update the values of p(m) and lambda(m). 130 do 140 j=l,m do 150 k=l,m 111 150 p(j ,k)=pnew(j ,k) 140 lambda(j)=lambnew(j) if(it.It.maxit) goto 1 160 write(8,840) loglik 840 formatC Final estimates: (log-lik=' ,gl4.6,') ') do 170 j=l,m 170 write(8,830) lambda(j),(p(j,k),k=l,m) if(conv.eq.0) goto 180 write(8,850) i t 850 formatC Convergence achieved after ',i5,' iterations.'/) goto 190 180 write(8,860) i t 860 formatC A l l ',i5,' iterations performed - no convergence.'/) 190 return end C This subroutine computes the conditional probabilities in the E-step. subroutine posterior(n,y,m,lambda,p,pdf,u,v,lik,likexp) integer y(240),ascale(240),bscale(240),e(240) integer n,m,i,j,k,likexp,asc,bsc real a(6,240),b(6,240),u(6,240),v(6,6,240),p(6,6),lambda(6) real asum,bsum,lik,pdf(8,6) ascale(l)=0 do 10 j = l,m 10 a(j,l) = pdf(y(l)+l,j)/float(m) do 20 i = 2,n asum =0.0 do 30 j = l,m a(j,i) = 0.0 do 40 k = l , i 40 a(j,i) = a(j,i) + a(k,i-l)*p(k,j)*pdf(y(i)+l,j) 30 asum=asura+a(j,i) asc=0 19 if(asum.gt.1.0e-l) goto 18 asc=asc+l asum=asum*10.0 goto 19 18 do 50 j = l,io 50 a(j ,i)=a(j ,i)*10.0**asc ascale(i)=ascale(i-1)+asc 20 continue do 60 j = l,m 60 b(j,n) = 1 bscale(n)=0 do 70 i = n-1,1,-1 bsum = 0.0 do 80 j = l,m b(j,i) = 0.0 do 90 k = l,m 90 b(j,i) =b(j,i) + b(k,i+l)*p(j,k)*pdf(y(i+l)+l,k) 80 bsum=bsum+b(j , i) bsc=0 69 if(bsum.gt.1.0e-l) goto 68 bsc=bsc+l bsum=bsum*10.0 goto 69 68 do 100 j = l.m 100 b(j,i)=b(j,i)*10.0**bsc bscale(i)=bscale(i+1)+bsc 70 continue 112 lik=0.0 do 110 j=l,m 110 lik=lik + a(j,n) do 120 i=l,n e(i)=ascale(n)-ascale(i)-bscale(i) do 130 j=l,m 130 u(j,i)=a(j,i)*b(j,i)*10.0**e(i)/lik 120 continue do 140 i=2,n e(i)=ascale(n)-ascale(i-1)-bscale(i) do 150 j=l,m do 160 k=l,m 160 v(j,k,i) = + p(j,k)*a(j,i-l)*pdi(y(i)+l,k)*b(k,i)*10.0**e(i)/lik 150 continue 140 continue likexp=-ascale(n) return end 113
You may notice some images loading slow across the Open Collections website. Thank you for your patience as we rebuild the cache to make images load faster.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Maximum likelihood estimation for mixture distributions...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Maximum likelihood estimation for mixture distributions and hidden Markov models Leroux, Brian
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Maximum likelihood estimation for mixture distributions and hidden Markov models |
Creator |
Leroux, Brian |
Publisher | University of British Columbia |
Date Issued | 1989 |
Description | This thesis deals with computational and theoretical aspects of maximum likelihood estimation for data from a mixture model and a hidden Markov model. A maximum penalized likelihood method is proposed for estimating the number of components in a mixture distribution. This method produces a consistent estimator of the unknown mixing distribution, in the sense of weak convergence of distribution functions. The proof of this result consists of establishing consistency results concerning maximum likelihood estimators (which have unrestricted number of components) and constrained maximum likelihood estimators (which assume a fixed finite number of components). In particular, a new proof of the consistency of maximum likelihood estimators is given. Also, the large sample limits of a sequence of constrained maximum likelihood estimators are identified as those distributions minimizing Kullback-Leibler divergence from the true distribution. If the number of components of the true mixture distribution is not greater than the assumed number, the constrained maximum likelihood estimator is consistent in the sense of weak convergence. If the assumed number of components is exactly correct, the estimators of the parameters which define the mixing distribution are also consistent (in a certain sense). An algorithm for computation of maximum likelihood estimates (and the maximum penalized likelihood estimate) is given. The EM algorithm is used to locate local maxima of the likelihood function and a method of automatically generating "good" starting values for each possible number of components is incorporated. The estimation of a Poisson mixture distribution is illustrated using a distribution of traffic accidents in a population and a sequence of observations of fetal movements. One way of looking at the finite mixture model is as a random sample of "states" from a mixing distribution and a sequence of conditionally independent observed variables with distributions determined by the states. In the hidden Markov model considered here, the sequence of states is modelled by a Markov chain. The use of the EM algorithm for finding local maxima of the likelihood function for the hidden Markov model is described. Problems arising in the implementation of the algorithm are discussed, including the automatic generation of starting values and a necessary adjustment to the forward-backward equations. The algorithm is applied, with Poisson component distributions, to the sequence of observations of fetal movements. The consistency of the maximum likelihood estimator for the hidden Markov model is proved. The proof requires the consideration of identifiability, ergodicity, entropy, cross-entropy, and convergence of the log-likelihood function. For instance, the conclusion of the Shannon-McMillan-Breiman theorem on entropy convergence is established for hidden Markov models. A class of doubly stochastic Poisson processes which corresponds to a continuous time version of the hidden Markov model is also considered. We discuss some preliminary work on the extension of the EM algorithm to these processes, and also the possibility of applying our method of proof of consistency of maximum likelihood estimators. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-10-14 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0098216 |
URI | http://hdl.handle.net/2429/29176 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-UBC_1989_A1 L47.pdf [ 5.84MB ]
- Metadata
- JSON: 831-1.0098216.json
- JSON-LD: 831-1.0098216-ld.json
- RDF/XML (Pretty): 831-1.0098216-rdf.xml
- RDF/JSON: 831-1.0098216-rdf.json
- Turtle: 831-1.0098216-turtle.txt
- N-Triples: 831-1.0098216-rdf-ntriples.txt
- Original Record: 831-1.0098216-source.json
- Full Text
- 831-1.0098216-fulltext.txt
- Citation
- 831-1.0098216.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0098216/manifest