Inference of Rates across Sites via anExpectation Maximization AlgorithmbyTingting ZhaoB.Sc., Zhejiang University, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate Studies(Statistics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2013c? Tingting Zhao 2013AbstractThe rates of nucleotide substitution can be different from genes to genes.Moreover, different regions of the same gene can have different rates of mu-tation as well. Many attempts have been tried to allow for the variable ratesacross different nucleotide sites. A rate factor coming from the continuousdistribution has been introduced to deal with the problem. However, forcomputation reasons, this method can only scale to less than a dozen se-quences. Later studies use a discrete gamma distribution to approximatethe gamma distribution.The main contribution of our work is that we propose a discrete distri-bution over the rate factor which is more flexible while preserving attractivecomputational properties. We make inference about the rate factor and itsdistribution via an Expectation Maximization (EM) algorithm. We evaluateour method by both simulations and a real dataset. From the real dataset, itreflects that the method is useful for large phylogenies with even thousandsof sequences. We analyze the identifiability of our model for a pair of DNAsequences under certain conditions. We also prove for certain types of ratematrices, this model is non-identifiable.iiPrefaceThis dissertation is original, unpublished, independent work by the au-thor, Tingting Zhao.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Introduction to common DNA Evolution Models . . . . . 52.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Models and Data Structure . . . . . . . . . . . . . . . . . . . 62.2.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Model of DNA Evolution in the framework of CTMC 62.2.3 Rate Matrices of Different Models . . . . . . . . . . . 7ivTable of Contents2.2.4 Models considering Rate Variation between DifferentNucleotide Sites . . . . . . . . . . . . . . . . . . . . . 93 Likelihood Methods . . . . . . . . . . . . . . . . . . . . . . . . 133.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Likelihood for a pair of DNA sequences . . . . . . . . . . . . 133.2.1 Without Rate Heterogeneity . . . . . . . . . . . . . . 133.2.2 With Rate Heterogeneity . . . . . . . . . . . . . . . . 143.3 Likelihood for Trees . . . . . . . . . . . . . . . . . . . . . . . 153.3.1 Without Rate Heterogeneity . . . . . . . . . . . . . . 153.3.2 With Rate Heterogeneity . . . . . . . . . . . . . . . . 174 Estimating the Rate Factor ? via an EM Algorithm . . . 214.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 Learn ? and f via an EM Algorithm for a pair of DNA se-quences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2.1 The likelihood for a pair of DNA sequences . . . . . . 214.2.2 Learn ? and f via an EM Algorithm . . . . . . . . . 234.3 Likelihood for a tree with multiple DNA sequences . . . . . 284.3.1 Complete Likelihood for Multiple DNA sequences . . 304.4 Expectation Maximization to learn ? and f for Trees . . . . . 374.4.1 E step . . . . . . . . . . . . . . . . . . . . . . . . . . 374.4.2 M step . . . . . . . . . . . . . . . . . . . . . . . . . . 385 Identifiability of the Model . . . . . . . . . . . . . . . . . . . . 425.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.2 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.3 Identifiability of ? with two categories . . . . . . . . . . . . . 45vTable of Contents5.4 Non-identifiability of F81 Model . . . . . . . . . . . . . . . . 475.5 Identifiability of Models with at least 3 different eigenvalues 515.5.1 F84 Model . . . . . . . . . . . . . . . . . . . . . . . . 525.5.2 HKY85 Model . . . . . . . . . . . . . . . . . . . . . . 535.6 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 566 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . 616.2 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . 636.3 Data Analysis via an EM Algorithm . . . . . . . . . . . . . . 657 Conclusions and Future Work . . . . . . . . . . . . . . . . . . 68AppendicesA First Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . 69B Second Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . 71viList of Tables2.1 Numbers of nucleotide substitutions per site (K) between cowand goat ?? and ??globin genes and between cow and goat??globin pseudogenes cited from ? . . . . . . . . . . . . . . 105.1 Estimating ? and f with ?initial and finitial . . . . . . . . . . 575.2 Average of nllk of the generated dataset given different estimates 586.1 Estimating ? and f with different ?initial and finitial with partof alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.2 AIC and BIC for different Models . . . . . . . . . . . . . . . 67viiList of Figures3.1 Tree Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.1 Tree Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.1 negative log-likelihood given ? by grid search . . . . . . . . . 595.2 level plot of negative log-likelihood given ? by grid search . . 596.1 percentage of sites with different numbers of missing data . . 626.2 log-likelihood w.r.t ? at site ?1? . . . . . . . . . . . . . . . . 646.3 log-likelihood w.r.t ? at site ?2? . . . . . . . . . . . . . . . . 646.4 log-likelihood w.r.t ? at site ?3? . . . . . . . . . . . . . . . . 656.5 Histogram of ??MLE by grid search . . . . . . . . . . . . . . . 66viiiAcknowledgementsFirst and foremost I would like to express my deepest gratitude to mysupervisor Dr. Alexandre Bouchard-Co?te? for his guidance, insights and en-couragement for the last two years. I feel lucky to work with him since hehas taken me to start a wonderful journey in the research world. I am in-fluenced by his enthusiasm and attitude for research and work. I would liketo thank him for his dedication in revising the thesis. I am also grateful toDr. Lang Wu for being my second reader and his precious suggestions andcomments.Second, I would like to thank my peers including Yanling(Tara) Cai,Yongliang(Vincent) Zhai, Yi Huang, Yunlong Nie, Seong-Hwan Jun, Tingt-ing Yu and Andy Leung for sharing their research experiences and discus-sions with me.Last, I would like to thank my parents for bringing me to this world andalways trying their best to help me realize my dreams. This thesis will serveas a gift to them.ixChapter 1Introduction1.1 MotivationPhylogenetics is the study of the evolutionary relationships among groupsof organisms. Continuous time Markov chains (CTMC) are at the core ofmodern phylogenetic methods to model the evolutionary process of DNAsequences. Several different DNA models have been proposed such as theJC69 Model by ?, the K80 Model by ? and so on. If DNA sequences areavailable for several species, we can make inference about the phylogeny ofthese species via the maximum likelihood method and obtain estimates ofthe parameters such as the topology of the tree, the transition matrix andthe branch lengths.If there is no rate variation, all sites share the same rate matrixQ describ-ing the instantaneous rate of different kinds of substitutions with differentbases. However, it has been discovered that the mutation rates of differentregions of the same gene can be different as shown by ?. To account forthe rate variation over sites, several approaches have been proposed. ? pro-posed a rate factor coming from the gamma distribution. A rate factor ? isa parameter assigned to each site to adjust the rate variation by multiplyingit to the rate matrix Q for this site. However, for computational reasons,this method can only scale to less than a dozen sequences. To simplify the11.1. Motivationcomputation, ? uses the ?discrete gamma distribution? to approximate thecontinuous gamma distribution, but he points out there is overestimationor underestimation of the shape parameter of the gamma distribution givendifferent tree topologies. Moreover, there is no direct biological reasons tofavour the gamma distributions of the rate factor. We propose a new modelto allow the rate variation which is more flexible while preserving attractivecomputational properties.We assume a discrete distribution over the rate factor without other re-strictions. We make inference about the rates and their distribution via anEM algorithm for both pairs of sequences as well as trees with large phy-logenies. We evaluate the method with a real dataset. It reflects that themethod is practical for large phylogenies with even thousands of sequences.When doing simulations to check whether the EM algorithm can recoverthe rate factor and its distribution used to generate the dataset, we find thatdifferent rate factors and distributions can give a similar likelihood of thesame generated dataset. We are motivated to study the identifiability of themodel. In the context of phylogenetics, a model is non-identifiable if differentset of parameters including tree topologies, branch lengths and evolutionaryparameters can produce the same likelihood. Many people contributed toinvestigate the identifiability of models with the rate factor coming from acontinuous distribution like the gamma distribution with mean one. ? usedthe F81 Model and discovered that the shape parameter of the gamma dis-tribution and the topology of the tree are not identifiable. ? proved theidentifiability of general time reversible (GTR) models. ? proved that thefour-state GTR + ? model is identifiable given the joint distribution of atleast triples of taxa. However, few attempts are done in the literature tostudy the identifiability of the rate scalar from a discrete distribution.21.2. OutlineThe reason leads to that is the difficulty in obtaining the solution of therate factor and its distribution of the inverse moment generating function.For the gamma distribution with mean one, its inverse moment generatingfunction is only determined by the shape parameter. In our work, we provethe non-identifiability of the JC69 Model and the F81 Model of a pair ofDNA sequences in the context of four distinct category modes. We alsoprove the non-identifiability of other DNA evolution models with two orthree distinct eigenvalues under certain conditions. The identifiability ofthe models under tree structures is still an open question.1.2 OutlineIn Chapter 2, we review some popular Markov models of DNA sequenceevolution in the framework of CTMC. After introducing the rate matricesfor different DNA evolution models, we illustrate how to deal with unequalevolution rates of different sites.In Chapter 3, we introduce how to calculate the likelihood for a pairof homologous DNA sequences without rate variations and also with rateheterogeneity. Moreover, we also provide how to calculate the likelihood ofa given tree in those two situations.In Chapter 4, we propose a discrete distribution of the rate factor todeal with rate heterogeneity. We explain how to make inference of the ratefactor and its distribution via an EM algorithm for a pair of homologousDNA sequences and generalize it to a tree.In Chapter 5, we summarize the previous work and results about theidentifiability of models assuming the rate factor coming from a gammadistribution. We also analyze the identifiability of our model assuming a31.2. Outlinediscrete distribution. We prove the non-identifiability of the JC69 Modeland the F81 Model of a pair of DNA sequences in the context of four dis-tinct category modes. We also prove the non-identifiability of other DNAevolution models under certain conditions. The identifiability of the modelsunder tree structures is still an open question for future work.In Chapter 6, we evaluate our EM algorithm with a real dataset with1028 DNA sequences. It reflects that our algorithm is computationally at-tractive for trees with large phylogenies.In the conclusion and future work part, we summarize the results of thethesis and discuss possible future work.4Chapter 2Introduction to commonDNA Evolution Models2.1 IntroductionIn this chapter, we review some popular Markov models of DNA sequenceevolution in the framework of continuous time Markov chains (CTMC).Since the time of divergence between different pairs of homologous DNAsequences descending from a common ancestral sequence can widely vary,different branch lengths are introduced to represent the expected number ofnucleotide substitutions between sequences. Time homogeneity is assumedin the continuous time Markov chains model so that the instantaneous ratematrix can be used to describe the substitution process. The difference be-tween these models lies in the parameters describing the rates of differentsubstitutions. The JC69 Model assumes equal transition rates, for example,the probability of a nucleotide A to change into other nucleotides G, T orC is the same while the K80 Model considers that the probability betweenpurines such as a nucleotide A to change into G is larger than the changesbetween purines and pyrimidines such as a nucleotide A to change into T orC because of the similarity in the structure of A and G.After introducing the rate matrices for different models, we also intro-52.2. Models and Data Structureduce how to deal with rates variations of different sites by introducing arate factor multiplying the rate matrix of each site proposed by ?. To sim-plify the calculation, ? uses the ?discrete gamma distribution? model toapproximate the continuous gamma distribution.2.2 Models and Data Structure2.2.1 Data StructureThe data comes from DNA sequences from homologous regions for speciesi ? {1, 2, . . . , n1}. Let X = {xij} be the aligned nucleotide sequences, wherei ? {1, 2, . . . , n1}, j ? {1, 2, ..., n2}. Then n2 is the number of nucleotidesper sequence. Each column of the data matrix xj = {x1,j , ..., xn1,j} spec-ifies the nucleotides for the n1 sequences at the jth site. The site is theposition of a nucleotide in DNA sequences. Each row of the data matrixxi = {xi,1, ..., xi,n1} represents all the nucleotides of the ith DNA sequence.X =????????x1,1 x1,2 ? ? ? x1,n2x2,1 x2,2 ? ? ? x2,n2....... . ....xn1,1 xn1,2 ? ? ? xn1,n2????????2.2.2 Model of DNA Evolution in the framework of CTMCIn the framework of CTMC, ? = {A,G, T,C} represents the state spaceconsisting four kinds of nucleotides in DNA sequences. Each individualentry refers to the probability that the state i will change into the state j,where i, j ? ?. P (t) is the transition matrix, where t is the branch lengthrepresenting the expected number of nucleotide substitution for a pair of62.2. Models and Data Structurehomologous DNA sequences descending from a common ancestral sequence.P (t) =????????pAA(t) pAG(t) pAC(t) pAT (t)pGA(t) pGG(t) pGC(t) pGT (t)pCA(t) pCG(t) pCC(t) pCT (t)pTA(t) pTG(t) pTC(t) pTT (t)????????The instantaneous rates of change from one state to another is reflectedby Q where Q = dP (t)dt with P0 = I. In turn, Pt = etQ =?j(tQ)jj! ,Q =????????? ?AG ?AT ?AC?GA ? ?GT ?GC?TA ?TG ? ?TC?CA ?CG ?CT ?????????.The diagonal elements are specified to make sure the sum of each rowin the Q matrix is zero. The Q matrix will also be constrained by multiply-ing each element of the matrix by a same factor ? = ?1/?i{A,C,G,T} piiQii,where pii is the stationary distribution of the rate matrix. This normalizationensures a branch length of one yields one expected change per nucleotide.2.2.3 Rate Matrices of Different ModelsJC69 Model proposed by ?Q =????????? ?4?4?4?4 ??4?4?4?4 ??4?4?4?4 ?????????72.2. Models and Data StructureIn this model, the stationary distribution is pi = (14 ,14 ,14 ,14) and ? is thestandardization factor, where ? = ?43 .K80 Model proposed by ?Q =????????? ? 1 1? ? 1 11 1 ? ?1 1 ? ?????????In this model, the standardization factor ? = 1/4 (?+ 2). The ? in the ma-trix is the ratio of transition and transversion which are two types of DNAsubstitution mutations. Transitions are interchanges between both purinesincluding A and G or both pyrimidines including C and T. Transversions areinterchanges between purines and pyrimidines. Purines are two-ring struc-ture while pyrimidines are one-ring structures. As a result, it is typicallyassumed that transitions are more likely to happen than transversions i.e.? > 1.F81 Model proposed by ?Q =????????? piC piA piGpiT ? piA piGpiT piC ? piGpiT piC piA ?????????This model allows for different base frequencies for four different states A,G, T and C. The stationary distribution is {piA, piG, piT , piC}, where the stan-dardization factor ? is 1/(1? pi2A ? pi2C ? pi2G ? pi2T).82.2. Models and Data StructureHKY85 Model proposed by ?Q =????????? ?piC piA piG?piT ? piA piGpiT piC ? ?piGpiT piC ?piA ?????????The model does not assume equal base frequencies for the four differentstates and accounts for the difference between transitions and transversionswith one parameter ? in the rate matrix Q, where the normalization constant? is 1/(2(piA + piG)(piC + piT ) + 2?(piApiG + piCpiT )).In this thesis, we are using the HKY85 Model in the simulation studysince it can incorporate rate variations considering different base frequenciesand the bias in transitions over transversions shown by many genes. But inthe real dataset, we are using the K80 model since it is more simple thanHKY85 model and we assume a uniform distribution of the four states of{A,G, T,C}.2.2.4 Models considering Rate Variation between DifferentNucleotide SitesAccording to ?, the rate of nucleotide substitution ? is defined as thenumber of substitution per site per year. If no rate variation across nu-cleotide sites is introduced, the rates of substitution are the same for allsites on the DNA sequences according to the Q matrix. For example, as-suming there are 1000 sites on a pair of DNA sequences, then on all thesesites, the rate for a nucleotide to change from A to G is the same. However,this assumption may not hold. We cite a table from ? as Table 2.1 to illus-92.2. Models and Data Structuretrate that the numbers of nucleotide substitutions per site (K) on regions ofgenes can be different.Table 2.1: Numbers of nucleotide substitutions per site (K) between cowand goat ?? and ??globin genes and between cow and goat ??globin pseu-dogenes cited from ?Region Ka5? flanking region 5.3 ? 1.25? untranslated region 4.0 ? 2.0Fourfold degenerate sites 8.6 ? 2.5Introns 8.1 ? 0.73? untranslated region 8.8 ? 0.2Pseudogenes 9.1 ? 0.9The rates in the table are in units of substitutions per site per 109 years.From the table, it reflects that the rates of nucleotide substitution are dif-ferent in different regions. These regions are classified according to differentfunctions that they perform during transcription and translation. Tran-scription is the synthesis of a RNA molecule based on a DNA template andtranslation is the process of the produced RNA during transcription con-veying information to the ribosomes to create proteins.If the rates of change are different for distinct sites, then some sitesevolve more quickly than others. In this situation, ? proposed a rate factor?i for the ith site where ?i comes from the gamma distribution. Then the102.2. Models and Data Structurerate matrix for the ith site isQi = ?i ?????????? ?piC piA piG?piT ? piA piGpiT piC ? ?piGpiT piC ?piA ?????????However, instead of making inference of ?i for each site,? proposed thatall possible rates could be integrated out for each site when calculating thelikelihood which will be covered in the next chapter.It has been noted that using the gamma distribution is very computa-tionally expensive and in order to improve that, ? proposed the ?discretegamma distribution? to approximate the continuous gamma distribution.This project relaxes the assumption of equal rates of substitution for allsites by multiplying Q with one category of a rate factor ? = (?1, . . . , ?k).However, we only assume ? comes from a discrete distribution.Definition 1. We define the rate factor ? = (?1, . . . , ?k) with a discretedistribution f = (f1, f2, . . . , fk) as a parameter assigned to each site to adjustthe rate variation by multiplying one category of ? to the rate matrix Q forthis site, where fk denotes the probability for this site to take the kth categoryof ?.Then for a specific site, the rate matrix for this site is ?j ? Q, wherej ? {1, . . . , k}. For different sites, different elements of ? are taken so thatthe rates are variable on diffferent sites. For example, if for the first 50 siteson a pair of homologous DNA sequences, the rate of substitution is slowerthan the next 50 sites on the same pair of sequences. Then we consider therate matrix for the first 50 sites is ?1?Q and ?2?Q for the second 50 siteswhere ?1 6= ?2. Then the probability for a nucleotide with state A to change112.2. Models and Data Structureinto G is exp(?1Qt){A?G} for a site within the first 50 sites. The probabilityfor a nucleotide with state A to change into G is exp(?2Qt){A?G} for a sitewithin the second 50 sites.12Chapter 3Likelihood Methods3.1 IntroductionIn this chapter, we introduce how to calculate the likelihood of a pair ofDNA sequences under two situations. Identical rates are considered first andrate heterogeneity is introduced later. The computation of the likelihood fora pair of sequences serves as a basis for more complicated situations. Afterthat, we explain how to calculate the likelihood of a given tree. However, inreal cases, we need to make inference of the topology of the tree first. Thedetails of how to infer the topology of the tree and the branch lengths of thetree are introduced by ?.3.2 Likelihood for a pair of DNA sequences3.2.1 Without Rate HeterogeneityConsidering two homologous DNA sequences represented by xT = (x1, x2,. . . , xn) and yT = (y1, y2, ..., yn), we first study how to calculate the likeli-hood for this pair of sequences. The parameters are ? = (Q, t), where Q isthe instantaneous transition rate matrix and t is the branch length betweenthis pair of sequences.133.2. Likelihood for a pair of DNA sequencesFor a single site i on both sequences x and y, the likelihood isL(?) = P (Xi = xi)(eQt)xi?yi ,where xi, yi ? {A,G, T,C}, P (Xi = xi) is the base frequency pi = (piA, piC , piG, piT )which can be derived from the eigenvector of QT . If xi = A, then P (Xi =xi) = piA.For n sites on the sequences, under the assumption of independenceacross different sites, the likelihood isL(?) =n?i=1P (Xi = xi)(eQt)xi?yiThe incomplete log likelihood for n sites on one sequence isl(?) =n?i=1log(P (Xi = xi)(eQt)xi?yi).3.2.2 With Rate HeterogeneityIf different sites evolve at different rates, a rate factor ? is introduced.Recall that in Section 2.2.4 in Chapter 2, we have defined that the ratematrix for the ith site is ?i ?Q, where ?i is the rate factor for the ith site.In the literature, most authors assume the rate factor ? follows a continuousdistribution like the gamma distribution or log normal distribution. Theparameters are ? = (Q, t, ?). Assuming g(?) is the prior density function for?, the likelihood for this pair of sequences isL(?) =n?i=1? ?0P (Xi = xi)(e?iQt)xi?yig(?i)d(?i). (3.1)In this case, we do not estimate the rate at each site, if a single rate isassumed for each site, then the number of unknowns increases quickly asthe number of sites increases. As a result, as shown in Equation 3.1, we143.3. Likelihood for Treesintegrate out all the possible rates for each site and take it as the wholecontribution of this site to the whole likelihood.3.3 Likelihood for Trees3.3.1 Without Rate Heterogeneity? has introduced the ?pruning? method to economize the computationof the likelihood of the tree. In the field of computer science, this methodis known under the name of ?dynamic programming?. The details of howto implement this method to a tree is covered in Chapter 16 of ?. We willintroduce this method briefly.Figure 3.1: Tree TopologyWe first illustrate how to calculate the likelihood for the tree in Figure3.1 for only one site. Assuming for the ith site, the states of DNA at the153.3. Likelihood for Treesthree tips X4, X1 and X2 are A, G and C respectively. From the tips tothe root of the tree, the algorithm calculates the conditional likelihood ofa subtree recursively given the state of this node in the current generationby combining the conditional likelihood of its immediate child in the leftlineages and also the one in the right lineages of the same node.Using the small tree, we illustrate how this algorithm works. The depthof the tree is two. Let Lixk(A) denote the probability of observing all the tipsgiven the state of the node xk has state A at the ith site. As assumed, at thetips, the state ofX2 in the ith site is C. Then (Lix2(A), Lix2(C), Lix2(G), Lix2(T ))= (0, 1, 0, 0), (Lix2(A), Lix2(C), Lix2(G), Lix2(T )) = (0, 0, 1, 0), (Lix4(A), Lix4(C),Lix4(G), Lix4(T )) = (1, 0, 0, 0). For an internal node X3, we illustrate how tocalculate Lix3(A) as an example.Lix3(A) =(?xProb(X1 = x|A, t1)Lix1(x))?(?yProb(X2 = y|A, t2)Lix2(y))=(?xexp(t1Q)A?xLix1(x))?(?yexp(t2Q)A?yLix1(y))163.3. Likelihood for TreesAs a result, Lix3(C), Lix3(G), Lix3(T )) can be calculated in the same way.Lix5(r) =(?sProb(X3 = s|r, t1)Lix3(s))?(?qProb(X4 = q|r, t4)Lix4(q))=(?sexp(t3Q)r?sLix3(s))?(?qexp(t4Q)r?qLix4(q))Assuming the stationary distribution for the four states is pi = (piA, piC , piG, piT ),then the likelihood of this tree for the ith site isLi =?rpirLix5(r).Denote the log-likelihood for the ith site as li = log(Li). By assumingindependence of different sites, the log-likelihood for this tree is l =?ni=1 li.3.3.2 With Rate HeterogeneityHowever, it seems impractical that each site evolves at the same rate.? proposes the rate factor coming from a gamma distribution on each site.Below we provide how to calculate the likelihood of the previous tree inFigure 3.1 with the assumption that the rate follows the gamma distributionwith mean one. Denote the prior density function for the rate ? as g(?). Thedifference of computing the likelihood when considering rate heterogeneitylies in its impossibility of ?dynamic programming?. Same as in Section 3.3.1,we explain how to calculate the likelihood of the tree in Figure 3.1 for onlyone site first. It is assumed the states of X4, X1 and X2 are A, G and C173.3. Likelihood for Treesrespectively. The likelihood of the tree in Figure 3.1 for this site is?p?qpipProb(A|X5 = p, t4)Prob(X3 = q|p)Prob(G|q, t1)Prob(C|q, t2)=?p?qpip? ?0pipg(?) exp(t4Q?)p?A exp(t3Q?)p?q exp(t1Q?)q?Gexp(t2Q?)q?Cd? (3.2)From Equation 3.2, we can see that if the number of the internal nodesis w, the computation complexity is 4w since ?dynamic programming? cannot be applied. The computation time of this method increases explosivelyas the number of species increases. As mentioned in the original paper, thismethod can only deal with tree topologies with no more than four specieswith a microcomputer.In order to deal with the intense computation, ? proposed the ?discretegamma distribution? to simplify the problem. By assuming equal probabil-ity of the rate in each category, ? uses the mean or median in each categoryto represent all the rates in the same category. Assuming we use k categoriesof ?discrete gamma distribution? to replace the continuous gamma distri-bution, where (?1, ?2, . . . , ?k) is the mean of each category. We use Lix5(r)to illustrate how to use the ?discrete gamma distribution? to approximate183.3. Likelihood for Treesthe continuous case, where r is the state of X5.Lix5(r) =(?sProb(X3 = s|r, t1)Lix3(s))?(?qProb(X4 = q|r, t4)Lix4(q))=k?j=11k(?sexp(?jt3Q)r?sLix3(s))?(?qexp(?jt4Q)r?qLix4(q))As to how to calculate the mean of each category, readers can refer tothe paper of ? for details.However, in both the continuous and the discrete cases, the author re-stricts the mean of the gamma distribution to one, then only the shapeparameter ? of the gamma distribution needs to be estimated. Since withthe fixed number of species at the tips, a tree can have several differenttopologies. The process of maximizing the likelihood of a tree with a giventopology is repeated for each of the possible topology until a maximum treeis found. By maximizing the likelihood of the tree over the branch lengthsand ?, an estimate of ? can be obtained.Both the continuous and discrete gamma distributions of the rate factorhave their own limitations. ? reviewed that by assuming the continuousgamma distribution of the rate factor, the algorithm is practical for no morethan six sequences. By using the discrete gamma distribution, ? has pointedout that ?? can be very different based on different tree topologies where ??is the estimate of the shape parameter ? of the gamma distribution. Forexample, ?? tends to be larger given the maximum likelihood tree while ??seems the smallest from a star tree most of the times. The difference gets193.3. Likelihood for Treeslarger especially when there are many species. As a result, we are moti-vated to relax the restriction of the specific distribution of ? because of theintense computation of the continuous gamma distribution and the over-estimation or underestimation of the shape parameter ? from the discretegamma distribution given different tree structures. Finally, we propose adiscrete distribution of the rate factor without any other restrictions whichwill be covered by the Chapter 4.20Chapter 4Estimating the Rate Factor ?via an EM Algorithm4.1 IntroductionIn this chapter, we first introduce how to use an EM (Expectation Max-imization) algorithm to estimate the rate factor and its distribution. Wefirst introduce how to implement the method to a pair of DNA sequences.Then we explain how to estimate ? and its distribution for a tree.4.2 Learn ? and f via an EM Algorithm for apair of DNA sequences4.2.1 The likelihood for a pair of DNA sequencesWhen having two DNA sequences represented by xT = (x1, x2, . . . , xn)and yT = (y1, y2, . . . , yn) which are the observed data, we will first studyhow to calculate the likelihood for this pair of sequences while the rate fac-tor following a discrete distribution is introduced. Assuming that differentnucleotide sites are independent within each DNA sequence, the rate factoris denoted as ?T = (?1, ?2, . . . , ?k) which is used to scale the original ratematrix Q and a latent variable Z is introduced to imply which category the214.2. Learn ? and f via an EM Algorithm for a pair of DNA sequencesrate for a particular site belongs to. The latent variable Z is the missingdata. The complete data includes the observed sequences x, y and z whichis the realization of the latent variable Z. The prior distribution of Z isf = (f1, f2, . . . , fk). The parameters to be estimated are ? = (?, f) giventhe data from pairs of DNA sequences.For a single site i on both sequences x and y, the likelihood isL(?) =k?j=1P (Zi = j)(e?jQt)xi?yiP (Xi = xi),where xi, yi ? {A,G, T,C}, i ? {1, 2, . . . , n} and P (Xi = xi) is the basefrequency which can be derived from the eigenvector of QT .For n sites on the sequences, under the assumption of independenceacross different sites, the likelihood isL(?) =n?i=1k?j=1P (Zi = j)(e?jQt)xi?yiP (Xi = xi).The incomplete log-likelihood for n sites on one sequence isl(?) =n?i=1log??k?j=1P (Zi = j)(e?jQt)xi?yiP (Xi = xi)?? .Since this incomplete log likelihood is hard to deal with, an EM algorithmis introduced. In order to utilize the EM algorithm, we need to calculatethe complete log likelihood for this pair of DNA sequences first.For the ith site on the DNA sequence, zi represents which category therate ? takes. The complete likelihood for a pair of sequences with n sitesgiven ? isL(?;x, y, z) =n?i=1k?j=1f1(zi=j)j (e?jQt)1(zi=j)xi?yi P (Xi = xi).224.2. Learn ? and f via an EM Algorithm for a pair of DNA sequencesThe complete log-likelihood given the parameters ? will bel(?;x, y, z) =n?i=1k?j=11(zi = j) log(fj(e?jQt)xi?yiP (Xi = xi)).Denote P (X = x) = pix as the prior probability, the complete log-likelihood will bel(?;x, y, z) =n?i=1k?j=11(zi = j) log(fjpixi(e?jQt)xi?yi).The expected complete log-likelihood isE(l(?;x, y, Z)|X = x, Y = y) =n?i=1k?j=1E(1(Zi = j)|Xi = xi, Yi = yi)log(fjpixi(e?jQt)xi?yi)When we have N pairs of DNA sequences, the expected complete log-likelihood isE(l(?;x, y, Z)|X = x, Y = y)=N?m=1nm?i=1k?j=1E(1(Zm,i = j)|Xm,i = xm,i, Ym,i = ym,i) log(fjpixm,i(e?jQt)xm,i?ym,i)The number of sites in different pairs of sequences can be different whichis denoted by nm.4.2.2 Learn ? and f via an EM AlgorithmIn this section, we explain how to learn the parameters ? = (?1, . . . , ?k)and update the posterior distribution f = (f1, . . . , fk) for the latent variableZ given the DNA sequence data. When there are more than one pair ofsequences, it needs to sum over all the sites on all pairs of DNA sequences234.2. Learn ? and f via an EM Algorithm for a pair of DNA sequencesto obtain the expected complete log-likelihood. For simplicity, we assumethere is one pair of DNA sequences here.E step of EM algorithmE?t?1(l(?;x, y, Z)|X = x, Y = y)=n?i=1k?j=1E?t?1(1(Zi = j)|Xi = xi, Yi = yi) log(fjpixi(e?jQt)xi?yi)=n?i=1k?j=1P?t?1(Zi = j|Xi = xi, Yi = yi) log(fjpixi(e?jQt)xi?yi)=n?i=1k?j=1P?t?1(Zi = j,Xi = xi, Yi = yi)P?t?1(Xi = xi, Yi = yi)log(fjpixi(e?jQt)xi?yi)Assuming ?t?1 = (?t?1, f t?1) has been learned from (t?1)th step, where?t?1 = (?t?11 , . . . , ?t?1k ) and ft?1 = (f t?11 , . . . , ft?1k ). Then in the tth step,both ?j and fj will be updated, where j = 1, 2, . . . , k.The exectation term E(1(Zi = j)|Xi = xi, Yi = yi) is taken with respectto the old parameters ?t?1 = (?t?11 , . . . , ?t?1k ) and the observed data xi, yi.The goal of the E step is to compute E(l(?;x, y, Z)|X = x, Y = y). In the Mstep, since there is no analytical solution of ?, E(l(?;x, y, Z)|X = x, Y = y)is optimized with respect to ?E?t?1(l(?;x, y, Z)|X = x, Y = y)=n?i=1k?j=1e(?t?1j Qt)xi?yif(t?1)j?kj=1 e(?t?1j Qt)xi?yif(t?1)jlog(fjpixi(e?jQt)xi?yi). (4.1)M step of the EM algorithmUpdating fjThe posterior distribution f = (f1, . . . , fk) of latent variable Z is updated244.2. Learn ? and f via an EM Algorithm for a pair of DNA sequencesin the following way in the tth step assuming f (t?1) and ?(t?1) have beenobtained.DenoteAt?1j =e(?t?1j Qt)f (t?1)j?kj=1 e(?t?1j Qt)f (t?1)j(4.2)as a 4 ? 4 matrix, j = 1, 2, . . . , k. The element in the ith row and kthcolumn of At?1j represents the posterior probability of Z = j meaning that? takes the jth category given the site changes from state i to state k of twohomologous sequences from a common ancestor in the (t? 1)th iteration.There are four states A,G, T,C in the DNA sequence. As a result thereare 16 kinds of transitions between these states including A?A, A?G, A?T and so on. For a pair of sequences with n sites, the number of eachkind of transition will be counted and these numbers form a 4 ? 4 matrixB. The element in matrix B records the number of this kind of transitioncorresponding to the element appearing in the Q matrix. That means therow and column order of {A, G, T, C} in B is the same as rate matrix Qand At?1j . For example, if the element in the first row and second columnin Q denotes the rate for a state to change from A to C, then the elementin the first row and second column in B denotes the number of transitionsfrom A to C for n sites on this pair of DNA sequences.Let B ?At?1j denote the dot product of two matrix, where (l,m) denotethe dimension of matrix B ? At?1j . Then fj , j = 1, 2 . . . , k is updated as fora pair of sequences with n sites as followsf (t)j =?l,m(B ?At?1j )l,m/n. (4.3)The sum of all elements in B ?At?1j is the expected weighted number ofsites for ? to take the jth category, j = 1, . . . , k.254.2. Learn ? and f via an EM Algorithm for a pair of DNA sequencesUpdating ?To update ? in the tth step, we choose ? such that a local maximum ofEquation 4.1 can be obtained when the gradient of ? is a zero vector andthe Hessian matrix of ? is negative definite.In order to deal with getting the gradient of Equation 4.1 with respectto ?, we first introduce Theorem 1 to illustrate how to get the derivative ofa particular element of matrix exponential.Theorem 1.limtn?t(etnQ)i,j ? (etQ)i,jtn ? t= (QetQ)i,jTheorem 1 states that if we get derivative with respect to t for the el-ement in the ith row and jth column of the matrix (etA) is equivalent togetting derivative with respect to t for the whole matrix and then take theelement in the ith row and jth column. This result is obtained by ? and weprovide a standard proof of Theorem 1 in Appendix A.With Theorem 1, we can get the gradient and Hessian matrix with re-spect to ?.Review in Equation 4.1 and Equation 4.2, we have thatE?t?1(l(?;x, y, Z)|X = x, Y = y)=n?i=1k?j=1(e(?t?1j Qt))xi?yif(t?1)j?kj=1 (e(?t?1j Qt))xi?yif(t?1)jlog(fjpixi(e?jQt)xi?yi),At?1j =e(?t?1j Qt)f (t?1)j?kj=1 e(?t?1j Qt)f (t?1)j,then,E?t?1(l(?;x, y, Z)|X = x, Y = y) =n?i=1k?j=1(At?1j )xi?yi log(fjpixi(e?jQt)xi?yi).(4.4)264.2. Learn ? and f via an EM Algorithm for a pair of DNA sequencesIn order to find the local maximum of Equation 4.4, the gradient ofEquation 4.4 with respect to ? is desired. Denote E(l(?;x, y, Z)|X = x, Y =y) as W (?), where ? = (?1, ?2, . . . , ?k) and getting the derivative of W (?)is only with respect to ?. Assume ?m is one element of ?, m = 1, 2, . . . , k.Under the assumption that ?i and ?j are independent when i 6= j, it can beobtained that?W (?)??m=?ni=1(At?1j )xi?yifm(tQ?exp(?mtQ))xi?yi(e(?mQt))xi?yifm=?ni=1(At?1j )xi?yi(tQ?exp(?mtQ))xi?yi(e(?mQt))xi?yi. (4.5)Finally we have,?W (?)??=(?W (?)??1,?W (?)??2. . . ,?W (?)??k)T. (4.6)How to calculate each element of ?W (?)?? is defined in Equation 4.5.Next, the Hessian matrix of W (?) with respect to ? is deducted. Underthe assumption that ?i and ?j are independent when i 6= j, the Hessianmatrix is an diagonal matrix. For any ?m ? ?, where m ? 1, 2, . . . , k, wehave?2W (?)?2?m=n?i=1(At?1j )xi?yi( ???m[tQ exp(?mtQ)]xi?yi(exp(?mtQ))xi?yiexp(?mtQ)2xi?yi?[tQ ? exp(?mtQ)]xi?yi???m(exp(?mtQ))xi?yiexp(?mtQ)2xi?yi)=n?i=1(At?1j )xi?yi([tQ ? tQ ? exp(?mtQ)]xi?yi exp(?ktQ)xi?yiexp(?mtQ)2xi?yi?[tQ ? exp(?mtQ)]xi?yi [tQ ? exp(?mtQ)]xi?yiexp(?mtQ)2xi?yi). (4.7)The Hessian matrix is an k ? k diagonal matrix with the mth diagonalelement defined in Equation 4.7.274.3. Likelihood for a tree with multiple DNA sequencesBased on the gradient and Hessian matrix, ? satisfies that the gradientat this vector is zero and the Hessian matrix is negative positive. This canbe obtained by using optim() function in R to minimize the negative loglikelihood and specifying the gradient and Hessian matrix.When there are multiple sequences, for the expected log likelihood, gra-dient and Hessian matrix, we just sum over all the sites on all pairs ofsequences.The iteration to update ? and fj ends when ?t?1 and ?t are quite close,for example, ||?t ? ?t?1||? 10?5.4.3 Likelihood for a tree with multiple DNAsequencesWe have covered how to obtain the likelihood for a pair of DNA sequenceswith multiple sites. Now we concentrate on how to get the likelihood of atree. We begin with the simplest tree with three DNA sequences. There arethree tips and two internal nodes in this situation. The topology of the treeis shown in Figure 4.1.In this tree, X1, X2 and X4 are three DNA sequences at tips of thetree representing three different species. For any single site on X1, X2or X4, we have already known the state of this site. The state space is? = {A,G, T,C}, X3 and X5 are annotated since they are internal nodes.For any single site on an internal node, it can take A, G, T or C which is notknown to us. The notations of t1, t2, t3 and t4 represent the branch lengthsin the tree.First we show how to get the likelihood of this tree given the DNA se-quence data at X1, X2 and X4. Independence among different nucleotide284.3. Likelihood for a tree with multiple DNA sequencesFigure 4.1: Tree Topologysites is assumed within each DNA sequence. The rate factor ?T = (?1, ?2, . . . , ?k)is defined the same as in the previous section. A latent variable Z is intro-duced to imply which category of the rate factor a particular site takes.The distribution of Z is f = (f1, f2, . . . , fk). The stationary distribution for{A, G, T, C} is pi = (piA, piG, piT , piC). The parameters to be estimated are? = (?, f) given the DNA sequences from X1, X2, and X4 at the tips.For a single site i on both the ancestral and descendent sequences, thelikelihood isL(?) =k?z=1?p??x5=p?q??x3=qfzpip exp(?zQt4)p?x4 ? exp(?zQt3)p?qexp(?zQt1)q?x1 exp(?zQt2)q?x2 ,where ? = {A,G, T,C} and pip is the stationary frequency which can bederived from the eigenvector of QT .294.3. Likelihood for a tree with multiple DNA sequencesFor n sites on the sequences, under the assumption of independenceacross different sites, the likelihood isL(?) =n?i=1k?z=1?p??x5=p?q??x3=qfzpip exp(?zQt4)p?x4 ? exp(?zQt3)p?qexp(?zQt1)q?x1 exp(?zQt2)q?x2The incomplete log-likelihood for n sites of this tree isl(?;x1, x2, x4)=n?i=1log( k?z=1?p??x5i=p?q??x3i=qfzpip exp(?zQt4)p?x4i ? exp(?zQt3)p?qexp(?zQt1)q?x1i exp(?zQt2)q?x2i)This incomplete log-likelihood is hard to deal with since we need tomarginalize all the internal nodes and latent variables, an EM algorithm isintroduced for trees with multiple DNA sequences as well.4.3.1 Complete Likelihood for Multiple DNA sequencesThe complete likelihood for this tree with n nucleotide sites isL(?, z;x1, x2, x4)=n?i=1k?j=1?x5i???x3i??(fjpip exp(?jQt4)x5i?x4i ? exp(?jQt3)x5i?x3iexp(?jQt1)x3i?x1i exp(?jQt2)x3i?x2i)(1(zi=j,X5i=p,X3i=q))(4.8)304.3. Likelihood for a tree with multiple DNA sequencesThen the complete log-likelihood isl(?, z;x1, x2, x4)=n?i=1k?j=1?x5i???x3i??1(zi = j,X5i = p,X3i = q)? log(fjpipexp(?jQt4)p?x4i ? exp(?jQt3)p?q exp(?jQt1)p?x1iexp(?jQt2)q?x2i) (4.9)Denote E(l(?, Z;x1, x2, x4)|X1 = x1, X2 = x2, X4 = x2) as W (?), thenthe expected complete log-likelihood isW (?)=n?i=1k?j=1?x5i???x3i??E(1(Zi = j,X5i = p,X3i = q)|X1i = x1i, X2i = x2i, X4i = x4i)? log(fjpip exp(?jQt4)p?x4i exp(?jQt3)p?q exp(?jQt1)q?x1i exp(?jQt2)q?x2i)(4.10)In order to simplify the calculation of function W (?), we separate eachterm oflog(fjpip exp(?jQt4)p?x4i exp(?jQt3)p?q exp(?jQt1)q?x1i exp(?jQt2)q?x2i)intolog(fj), log(pip), log(exp(?jQt4)p?x4i), log(exp(?jQt3)p?q), log(exp(?jQt1)q?x1i),log(exp(?jQt2)q?x2i)).The following proposition reveals the details of how to calculate Equa-tion 4.10 efficiently.314.3. Likelihood for a tree with multiple DNA sequencesProposition 1. Assume there are n latent variables X1, X2, . . . , Xn, v1, v2,. . . , vn are functions that v1 only depends on X1, v2 only depends on X2, . . .and vnonly depends on Xn. The denote a subset {1, 2, . . . , n1} of {1, 2, . . . , n}as S. Then?x1?x2. . .?xnE (1(X1 = x1, X2 = x2, . . . , Xn = xn)) log(v1v2 . . . vn1)=?SE (1(X1 = x1, X2 = x2, . . . , Xn = xn1)) log(v1v2 . . . vn1) (4.11)Proof. The proof of Proposition 1 is simple. Denote S0 = {1, 2, . . . , n}.Since S = {1, 2, . . . , n1}, S1 = S0\S, then?x1?x2. . .?xnE (1(X1 = x1, X2 = x2, . . . , Xn = xn)) log(v1v2 . . . vn1)=?S?S1E (1(X1 = x1, X2 = x2, . . . , Xn = xn)) log(v1v2 . . . vn1)=?SE (1(X1 = x1, X2 = x2, . . . , Xn1 = xn1)) log(v1v2 . . . vn1)In order to calculate W (?), we separate the expected complete log-likelihood into several parts by partition the term in the product insidethe log function and utilize Proposition 1 to calculate each of the followingterm and then sum them together.Some of the latent variables can be summed over in the previous term324.3. Likelihood for a tree with multiple DNA sequencesto simplify the calculation.n?i=1k?j=1?p??x5i=p?q??x3i=qE(1(Zi = j,X5i = p,X3i = q)|X1i = x1i, X2i = x2i, X4i = x4i)? log(fj)=k?j=1n?i=1E(1(Zi = j)|X1i = x1i, X2i = x2i, X4i = x4i)? log(fj) (4.12)n?i=1k?j=1?p??x5i=p?q??x3i=qE(1(Zi = j,X5i = p,X3i = q)|X1i = x1i, X2i = x2i, X4i = x4i)? log(pip)=k?j=1n?i=1?p??x5i=pE(1(Zi = j,X5i = p)|X1i = x1i, X2i = x2i, X4i = x4i) log(pip)(4.13)n?i=1k?j=1?p??x5i=p?q??x3i=qE(1(Zi = j,X5i = p,X3i = q)|X1i = x1i, X2i = x2i, X4i = x4i)? log(exp(?jQt4)p?x4i)=k?j=1n?i=1?p??x5i=pE(1(Zi = j,X5i = p,X4i = x4i)|X1i = x1i, X2i = x2i, X4i = x4i)log(exp(?jQt4)p?x4i) (4.14)334.3. Likelihood for a tree with multiple DNA sequencesn?i=1k?j=1?p??x5i=p?q??x3i=qE(1(Zi = j,X5i = p,X3i = q)|X1i = x1i, X2i = x2i, X4i = x4i)? log(exp(?jQt3)p?q)=k?j=1n?i=1?p??x5i=p?q??x3i=qE(1(Zi = j,X5i = p,X3i = q)|X1i = x1i, X2i = x2i, X4i = x4i)log(exp(?jQt3)p?q) (4.15)n?i=1k?j=1?p??x5i=p?q??x3i=qE(1(Zi = j,X5i = p,X3i = q)|X1i = x1i, X2i = x2i, X4i = x4i)? log(exp(?jQt3)q?x1i)=k?j=1n?i=1?q??x3i=qE(1(Zi = j,X3i = q,X1i = x1i)|X1i = x1i, X2i = x2i, X4i = x4i)log(exp(?jQt1)q?x1i) (4.16)n?i=1k?j=1?p??x5i=p?q??x3i=qE(1(Zi = j,X5i = p,X3i = q)|X1i = x1i, X2i = x2i, X4i = x4i)? log(exp(?jQt3)q?x2i)=k?j=1n?i=1?q??x3i=qE(1(Zi = j,X3i = q,X2i = x2i)|X1i = x1i, X2i = x2i, X4i = x4i)log(exp(?ZiQt2)q?x2i (4.17)Then W (?) is the sum of Equation 4.12, Equation 4.13, . . . and Equation4.17.344.3. Likelihood for a tree with multiple DNA sequencesEquation 4.17 reflects that the indicator function only depends on thestates of the ith site of the ancestral and its direct descendant sequences.For example,n?i=1k?j=1?p??x5i=p?q??x3i=qE(1(Zi = j,X5i = p,X3i = q)|X1i = x1i, X2i = x2i, X4i = x4i)log(exp(?jQt2)q?x2i)=k?j=1n?i=1?q??x3i=qE(1(Zi = j,X3i = q,X2i = x2i)|X1i = x1i, X2i = x2i, X4i = x4i)log(exp(?jQt2)q?x2i)When we are interested in calculating?kj=1?ni=1?p??x5i=p?q??x3i=qE(1(Zi = j,X5i = p,X3i = q)|X1i = x1i, X2i =x2i, X4i = x4i) log(exp(?jQt2)q?x2i), where X3 is the ancestral sequenceand X2 is X3?s descendant sequence. The states of internal nodes X5 canbe summed over so that X5 does not appear in?kj=1?ni=1?q??x3i=qE(1(Zi =j,X3i = q,X2i = x2i)|X1i = x1i, X2i = x2i, X4i = x4i) log(exp(?jQt2)q?x2i).After separating the expected complete log-likelihood into several partsby partition each term in the product inside the log function, the indicatorfunction only depends on a pair of ancestral and descendant node of interestwhich are X3i and X2i since all unknown states of other internal nodescan be summed over. Moreover, 1(Zi = j,X3i = q,X2i = x2i) dependsonly on Zi, X3i, X2i which appears in the term that multiplies it which islog(exp(?jQt2)x3i?x2i).Then to calculate?kj=1?ni=1?p???q??E(1(Zi = j,X5i = p,X3i =q)|X1i = x1i, X2i = x2i, X4i = x4i) log(exp(?jQt3)p?q), we define a matrixE and a matrix L for computation convenience because we would like to354.3. Likelihood for a tree with multiple DNA sequencesdenote this term as a sum of the dot product of each row in E and thecorresponding row in L.E =??????ni=1E(1(Zi = 1, xai = A, xdi = A)|data) . . .?ni=1E(1(Zi = 1, xai = G, xdi = G)|data).........?ni=1E(1(Zi = k, xai = A, xdi = A)|data) . . .?ni=1E(1(Zi = k, xai = G, xdi = G)|data)?????,where xai denotes the ith site on the ancestral sequence and xdi denotes theith site on the descendant sequence. We can just arrangelog(exp(?ziQt3))into a k ? 16 matrix L.L =?????log(exp(?1Qt)A?A) . . . log(exp(?1Qt)G?G).... . ....log(exp(?kQt)A?A) . . . log(exp(?kQt)G?G)?????As a result, to calculate the expected complete likelihood of the tree inFigure 1, we can divide it into three parts. The first part isk?j=1n?i=1E(1(Zi = j)|x1i, x2i, x4i) log(fj),which can be obtained by summing over all the internal nodes.The second part isk?j=1n?i=1?p??E(1(Zi = j,X5i = p)|x1i, x2i, x4i) log(pip).To calculate it, we need to know the assumed location of the root ofthe tree and which is the root?s direct child. By summing over the possiblestates of the descendant node, we can obtain this term.364.4. Expectation Maximization to learn ? and f for TreesIn the third part, we compute the expected complete log-likelihood be-tween each pair of ancestral and its direct desendant sequences and thensum them together. Now we introduce how to learn the parameters of ?and its distribution f .When we perform the real data analysis, we add a penalty term whichis the sum of ?2 to the complete log-likelihood. The reason is that if anycategory of ? is so large then exp(?Qt) will go to the stationary distribution.In that case, even if ? increases, the likelihood will not increase dramatically.As a result, we add the penalty term to constrain it.4.4 Expectation Maximization to learn ? and ffor TreesAccording to Equation 4.10, we have the expression of how to calculatethe expected complete log-likelihood.4.4.1 E stepE(l(?, Z;x1, x2, x4)|X1 = x1, X2 = x2, X4 = x4)=n?i=1k?j=1?p??x5i=p?q??x3i=qE(1(Zi = j,X5i = p,X3i = q)|X1i = x1i, X2i = x2i, X4i = x4i)? log(fjpip exp(?jQt4)p?x4i exp(?jQt3)p?q exp(?jQt1)q?x1i exp(?jQt2)q?x2i)(4.18)374.4. Expectation Maximization to learn ? and f for TreesE(1(Zi = j,X5i = p,X3i = q)|X1i = x1i, X2i = x2i, X4i = x4i))=P (Zi = j,X5i = p,X3i = q,X1i = x1i, X2i = x2i, X4i = x4i)?kj=1?p???q?? P (Zi = j,X5i = p,X3i = q,X1i = x1i, X2i = x2i, X4i = x4i)(4.19)P (Zi = j,X5i = p,X3i = q,X1i = x1i, X2i = x2i, X4i = x4i)= fjpip exp(?jQt4)p?t4 exp(?jQt3)p?q exp(?jQt1)q?x1 exp(?jQt2)q?x2(4.20)E?t?1 (1(Zi = j,X5i = p,X3i = q)|X1i = x1i, X2i = x2i, X4i = x4i)=P?t?1(Zi = j,X5i = p,X3i = q,X1i = x1i, X2i = x2i, X4i = x4i)?kj=1?p???q?? P?t?1(Zi = j,X5i = p,X3i = q,X1i = x1i, X2i = x2i, X4i = x4i)whereP?t?1(Zi = j,X5i = p,X3i = q,X1i = x1i, X2i = x2i, X4i = x4i)= f t?1j pip exp(?t?1j Qt4)p?t4 exp(?t?1j Qt3)p?q exp(?t?1j Qt1)q?x1 exp(?t?1j Qt2)q?x2E?t?1(l(?, Z;x1, x2, x4)|X1 = x1, X2 = x2, X4 = x4)=n?i=1k?j=1?p??x5i=p?q??x3i=qP?t?1(Zi = j,X5i = p,X3i = q,X1 = x1, X2 = x2, X4 = x4)?kj=1?p???q?? P?t?1(Zi = j,X5i = p,X3i = q,X1 = x1, X2 = x2, X4 = x4)log(fjpip exp(?jQt4)p?t4 exp(?jQt3)p?q exp(?jQt1)q?x1 exp(?jQt2)q?x2)(4.21)4.4.2 M stepUpdating ?In the M step, E?t?1(l(?;x1, x2, x4, Z)|X1, X2, X4) is optimized with re-spect to ?. Both ? and fj are updated.384.4. Expectation Maximization to learn ? and f for TreesTo learn parameters ? in the tth iteration, a local maximum with respectto ? can be obtained when the gradient of ? is a zero vector and the Hessianmatrix of ? is negative definite. In order to get the gradient of ?, we needto know how to get derivatives with respect to ?j in the following term:exp(?jQt4)p?x4 exp(?jQt3)p?q exp(?jQt1)q?x1 exp(?jQt2)q?x2 .To achieve that, we can prove the following theoremTheorem 2.???(exp(?Q1)m?n exp(?Q2)a?b)= [???(exp(?Q1))]m?n exp(?Q2)a?b + exp(?Q1)m?n[???(exp(?Q2))]a?b= (Q1 exp(?Q1))m?n exp(?Q2)a?b + exp(?Q1)(Q2 exp(?Q2))a?bWe provide a standard proof of this theorem in the Appendix.In order to get the gradient with respect to ?j , we first defineAi,j,p,q= E(1(Zi = j,X5i = p,X3i = q)|X1i, X2i, X4i)=P (Zi = j,X5i = p,X3i = q,X1i = x1i, X2i = x2i, X4i = x4i)?kj=1?p???q?? P (Zi = j,X5i = p,X3i = q,X1i = x1i, X2i = x2i, X4i = x4i).(4.22)394.4. Expectation Maximization to learn ? and f for TreesE(l(?, Z;x1, x2, x4)|X1 = x1, X2 = x2, X4 = x4)=n?i=1k?j=1?p??x5i=p?q??x3i=qE(1(Zi = j,X5i = p,X3i = q)|X1i = x1i, X2i = x2i, X4i = x4i)? log(fjpip exp(?jQt4)p?x4i exp(?jQt3)p?x3i exp(?jQt1)q?x1i exp(?jQt2)q?x2i)=n?i=1k?j=1?p???q??Ai,j,p,q ? log(fjpip exp(?jQt4)p?x4i exp(?jQt3)q?q exp(?jQt1)q?x1iexp(?jQt2)q?x2i)E?t?1(l(?, Z;x1, x2, x4)|X1 = x1, X2 = x2, X4 = x4)=n?i=1k?j=1?p??x5i=p?q??x3i=qE?t?1(1(Zi = j,X5i = p,X3i = q)|X1i = x1i, X2i = x2i, X4i = x4i)? log(fjpiq exp(?jQt4)p?x4i exp(?jQt3)p?q exp(?jQt1)q?x1i exp(?jQt2)q?x2i)=n?i=1k?j=1?p??x5i=p?q??x3i=qA(t?1)i,j,p,q ? log(fjpip exp(?jQt4)p?x4i exp(?jQt3)p?q exp(?jQt1)q?x1iexp(?jQt2)q?x2i)When we get derivatives with ?, denoteW?t?1(?) as E?t?1(l(?;x1, x2, x4)|X1 =404.4. Expectation Maximization to learn ? and f for Treesx1, X2 = x2, X4 = x4).?W?t?1(?)??j=???jn?i=1k?j=1?p??x5i=p?q??x3i=qA(t?1)j,p,q ? log(fjpip exp(?jQt4)p?x4iexp(?jQt3)p?q exp(?jQt1)q?x1i exp(?jQt2)q?x2i)=n?i=1?p??x5i=p?q??x3i=qA(t?1)i,j,p,q ????jlog(fjpip exp(?jQt4)p?x4iexp(?jQt3)p?q exp(?jQt1)q?x1i exp(?jQt2)q?x2i)=n?i=1?p???q??A(t?1)i,j,p,q(t4Q exp(?jQt4)p?x4exp(?jQt4)p?x4+t3Q exp(?jQt3)p?qexp(?jQt4)p?q+t1Q exp(?jQt1)q?x1exp(?jQt1)q?x1+t2Q exp(?jQt2)q?x2exp(?jQt2)q?x2)(4.23)Recall that for a pair of sequence with n nucleotide sites, xi, yi ? {A,G, T,C}?W?t?1(?)??j=?ni=1(At?1j )xi?yi(tQ?exp(?jtQ))xi?yi(exp(?jQt))xi?yi. (4.24)When we have a tree,?W?t?1 (?)??jis a sum over all pairs of the ancestorand its direct child sequences over all sites multiplying its correspondingposterior distribution A(t?1)i,j,p,q from the E step.41Chapter 5Identifiability of the Model5.1 IntroductionIn the context of phylogenetics, a model is non-identifiable if differentset of parameters including tree topologies, branch lengths and evolutionaryparameters can produce the same likelihood.Section 5.2 is a review of the previous work on the identifiability ofmodels with the rate factor following the gamma distribution. We discussthe relationship of the previous work and our work.We illustrate how the identifiability of ? = (?, f) is transformed into theproblem of determining the uniqueness of the solution to a set of nonlinearequations in Section 5.3. The number of equations is equal to the numberof different eigenvalues of Q. In Section 5.4, we prove that if the rate matrixcomes from the F81 family, then the model is unidentifiable which carriesover to the case where the rate factor follows a gamma distribution. InSection 5.5, by applying ??s result, we prove the non-identifiability of ? andf under certain conditions. Finally, we have shown some simulation resultsto provide empirical support of the non-identifiability of HKY85 model.425.2. Review5.2 Review? have reviewed both the identifiability of transition matrices for inde-pendent and identically distributed sites and models with rate variation byintroducing a rate factor ?. The definition of the rate factor can be referredto Section 2.2.4. They review that if different sites evolve at the same rate,pairwise comparisons of sequences are not able to reconstruct the transitionmatrices and that the distribution of at least triples of sites is sufficientto reconstruct the transition matrices. Moreover, for models allowing ratevariations, the topology and the transition rate matrices are identifiable un-der certain conditions. They can be identified when the distribution of ? iscompletely known. Assuming the rate factor follows a gamma distribution,? uses the F81 Model and discovers that the shape parameter of the gammadistribution and the topology of the tree are unidentifiable. Different shapeparameters and tree topologies can produce the same pairwise distributionbetween all pairs of taxa. However, this is not a general case. ? proves thatthe four-state GTR + ? model is identifiable given the joint distributionof at least triples of taxa by assuming the rate factor follows the gammadistribution. The reason why gamma distribution is favoured is its simpleform of the inverse of moment generating function. It is still of interestwhether pairwise distributions are sufficient to discover the identifiability ofthe model. ? proves that several different conditions when the parameterscan be identified. As long as the rate matrix Q has at least two differentnon-zero eigenvalues and two non-zero pairwise distances, the rate matrix Q,pairwise distances and shape parameter ? can be identified simultaneously.This conclusion applies to a general time reversible (GTR) model since ithas three non-zero distinct eigenvalues. It also explains why the F81 model435.2. Reviewis non-identifiable since it only has one non-zero eigenvalue. Moreover, ?provides a theorem stating that the rate matrix, pairwise distance and anyarbitrary distribution of the rate factor are identifiable if the rate matrixhas at least two distinct non-zero eigenvalues, the expectation of the ratefactor is one and the pairwise distribution is available for any distribution.This theorem also serves as a basis for our model when analyzing the iden-tifiability for pairwise sequences when the rate matrix Q has at least twonon-zero distinct eigenvalues.Instead of relying on the number of distinct eigenvalues of the rate ma-trix Q, ? have developed a new technique to study the identifiability oflarge trees. Without the assumption of the gamma distribution of the ratefactor, they imply large phylogenies are identifiable with the constraint thatthe expectation of the rate factor is one. They achieve that by binning siteswith similar rates into groups and using a bin with abundant sites to esti-mate the distance between any two leaves in order to recover the true tree.In our case, we introduce the rate factor ? = (?1, ?2, ..., ?k) and the la-tent categorical variable Z to specify which category ? takes. We assumea discrete distribution over ? which is f = (f1, f2, . . . , fk). The parametersin our model include ? = (?1, ?2, . . . , ?k) and f = (f1, f2, . . . , fk). The iden-tifiability of ? = (?, f) is of particular interest when the rate matrix Q isknown as well as the topology of the tree or one pairwise distance.In this chapter, we have studied the identifiability of ? = (?, f) for apair of sequences. We prove that for the F81 model and the JC69 model,? = (?, f) is unidentifiable due to the only one non-zero eigenvalues of therate matrix Q. We apply ??s conclusions to our situation. We relax theconstraint that the expectation of the rate factor is one. We prove that theeigenvalues of the rate matrix are identified if it is unknown but the pairwise445.3. Identifiability of ? with two categoriesdistance and the distribution of the rate scalar are unidentifiable for GTRmodels.5.3 Identifiability of ? with two categoriesFor simplicity, we first study the identifiability for only one pair of se-quences. We derive the log-likelihood for a pair of sequences with n sitesisl(?) =n?i=1log??k?j=1P (Zi = j)(e?jQt)xi?yiP (X = xi)??=n?i=1log??k?j=1fj(e?jQt)xi?yipixi?? (5.1)with the assumption of independence across different sites. A latent cat-egorical variable Z is introduced to imply which category the rate for aparticular site belongs to. The distribution of Z is f = (f1, f2, . . . , fk). Theparameters to be estimated are ? = (?, f) given the data from pairs of DNAsequences.To further simplify the problem, consider k = 2. Assume ? = (?1, ?2)where ?1 < ?2 and f = (f1, f2), which is ? = (?1, ?2, f1, f2) and l(?) = l0.If a model for a pair of sequences is identifiable then there does not exist?? = (??1 , ??2 , f?1 , f?2 ) (where ??1 < ??2) other than ? such that l(??) = l0.Assuming ?? = (??1 , ??2 , f?1 , f?2 ) exists to make l(??) = l(?0), for anyxi, yi ? {A,G, T,C}, thenpixif1 exp(?1Qt)xi?yi + pixif2exp(?2Qt)xi?yi= pixif?1 exp(??1Qt)xi?yi + pixif?2 exp(??2Qt)xi?yi , (5.2)455.3. Identifiability of ? with two categorieswhere pixi can be cancelled on both sides of the equation. Then it is simpli-fied asf1 exp(?1Qt)xi?yi + f2 exp(?2Qt)xi?yi= f?1 exp(??1Qt)xi?yi + f?2 exp(??2Qt)xi?yi . (5.3)Suppose eQ = Xdiag(ed1 , . . . , edp)X?1, where d1, . . . , dp are the p dis-tinct eigenvalues of Q and p ? 4. It is also known thate?Qt = Xdiag(e?d1t, . . . , e?dpt)X?1.In order to make Equation 5.3 to hold, we only need to make sure thatM1 = M?1 ,whereM1 = X????????f1ed1?1t + f2ed1?2tf1ed2?1t + f2ed2?2t. . .f1edp?1t + f2edp?2t????????X?1,M?1 = X????????f?1 ed1??1 t + f?2 ed1??2 tf?1 ed2??1 t + f?2 ed2??2 t. . .f?1 edp??1 t + f?2 edp??2 t????????X?1.To analyze the identifiability of the model, we are interested to knowwhether there is ?? = (??1 , ??2 , f?1 , f?2 ) 6= ? = (?1, ?2, f1, f2) such that the465.4. Non-identifiability of F81 Modelfollowing system of nonlinear equations holds.?????????????????f1ed1?1t + f2ed1?2t = f?1 ed1??1 t + f?2 ed1??2 tf1ed2?1t + f2ed2?2t = f?1 ed2??1 t + f?2 ed2??2 t? ? ?f1edp?1t + f2edp?2t = f?1 edp??1 t + f?2 edp??2 t(5.4)(5.5)(5.6)Theorem 3. Denote (d1, d2, . . . , dp) as p distinct eigenvalues of the ratematrix Q and t as the branch length. For a pair of DNA sequences, if thereis a unique solution ? = (?, f) to the system of Equations meaning that (?? =?, f? = f), then the model is identifiable, otherwise, it is unidentifiable.In the previous system of equations, there are p equations in total wherep is the number of distinct eigenvalues of rate matrix Q satisfying p ? 4.It reflects that the identifiability of the model depends on the number ofdifferent eigenvalues of the rate matrix Q.5.4 Non-identifiability of F81 ModelProposition 1. F81 family of rate matrices have one eigenvalue 0 withalgebraic multiplicity 1 and the other eigenvalue -1 with algebraic multiplicity3.Proof. By denoting the the diagonal elements as * to make sure the sum of475.4. Non-identifiability of F81 Modelthe each row in the Q matrix is zero, whereQ =????????? piC piA piGpiT ? piA piGpiT piC ? piGpiT piC piA ?????????,the characteristic polynomial of ?(?) of the 4? 4 matrix is????????????? ? ? piC piA piGpiT ? ? ? piA piGpiT piC ? ? ? piGpiT piC piA ? ? ?????????????.By doing elementary transformations, ?(?) can be transformed into?(?) =????????????? ? ? piC 0 piGpiT ? ? ? 0 piGpiT piC ?(1 + ?) piGpiT piC piA +piApiG(1 + ?? piG) ? ? ?????????????.Similarly, we get?(?) =?????????????(1 + ?) piC 0 piGpiT +piTpiC(1 + ?? piC) ? ? ? 0 piG0 piC ?(1 + ?) piG0 piC piA +piApiG(1 + ?? piG) ? ? ?????????????.485.4. Non-identifiability of F81 ModelBy Laplace expansion along the first column, we get?(?) = ?(1 + ?)??????????(1 + ?? pic) 0 piG(1 + 1+??piCpiC)piC ?(1 + ?) 0piC piA(1 + 1+??piGpiG)?(1 + ?)??????????piT(1 +1 + ?? piCpiC)?????????piC 0 0piC ?(1 + ?) 0piC piA(1 + 1+??piGpiG)?(1 + ?)?????????= ?(1 + ?)(?(1 + ?? piC)(1 + ?)2 + piA(1 + ?)2 + piG(1 + ?)2)?piT (1 + ?)3= ?(1 + ?)3 (piA + piG + piC ? (1 + ?))? piT (1 + ?)3= ?(1 + ?)3 (1? piT ? (1 + ?))? piT (1 + ?)3= (1 + ?)3 (?+ piT )? piT (1 + ?)3= ? (1 + ?)3 . (5.7)As a result, ?(?) has two distinct eigenvalues: ?1 = 0 with algebraic multi-plicity 1 and ?2 = ?1 with algebraic multiplicity 3.Theorem 4. Let X and Y be two DNA sequences, Q be any F81 rate matrixof the following formQ =????????? piC piA piGpiT ? piA piGpiT piC ? piGpiT piC piA ?????????.Denote ? = (?1, ?2) as the rate factor, where ?1 < ?2 and f = (f1, f2) as thedistribution for the latent variable Z specifying which category ? takes. The495.4. Non-identifiability of F81 Modelbranch length t for this pair of sequence is assumed to be 1. Assuming Q isknown, the parameter ? = (?, f) is unidentifiable.Proof. Assuming the model is identifiable, ?? = (??1 , ??2 , f?1 , f?2 ) must beequal to ? = (?1, ?2, f1, f2). For F81 models, we have d1 = 0, d2 = ?1.???f1e0??1t + f2e0??2t = f?1 e0???1 t + f?2 e0???2 tf1e(?1)??1t + f2e(?1)?2t = f?1 e(?1)???1 t + f?2 e(?1)???2 t(5.8)(5.9)which is???f1 + f2 = f?1 + f?2f1e??1t + f2e??2t = f?1 e???1 t + f?2 e???2 t(5.10)(5.11)By setting f1 = f?1 = f2 = f?2 = 0.5, ??1 =12?1, t=1, we get ??2 =? log(e??1 + e??2 ? e?0.5?1). This means there is a distinct solution ?? =(12?1,? log(e??1 + e??2 ? e?0.5?1), f1, f2) which contradicts our assumptionthat no ?? exists. We conclude that the model is unidentifiable.Example 1. To illustrate Theorem 4, we provide one rate matrix from F81family and assume piA = 0.4, piC = 0.3, piT = 0.2 and piG = 0.1, f1 = 0.5,f2 = 0.5, ?1 = 1, ?2 = 2, t = 1 andQ =????????? piC piA piGpiT ? piA piGpiT piC ? piGpiT piC piA ?????????=?????????0.8 0.3 0.4 0.10.2 ?0.7 0.4 0.10.2 0.3 ?0.6 0.10.2 0.3 0.4 ?0.9????????.The eigenvalues of Q are equal to -1 with algebraic multiplicity 3 and 0with algebraic multiplicity 1. After setting f?1 = 0.5, ??1 = 0.5, f?2 = 0.5,505.5. Identifiability of Models with at least 3 different eigenvalues??2 = ? log(e?1 + e?2 ? e?0.5), we get?????????f1 + f2 = f?1 + f?2 = 10.5e?1 + 0.5e?2 = 0.5?e?0.5 + 0.5elog(e?1+e?2?e0.5)= 0.5?e?1 + 0.5e?2.(5.12)(5.13)Following from Equation 5.4 to Equation 5.6, this model is unidentifiablesince ? = (1, 2, 0.5, 0.5) and ?? = (0.5,? log(e?1 +e?2?e?0.5), 0.5, 0.5) havethe same likelihood for a pair of sequences.Corollary 1. Let X and Y be two DNA sequences, Q be any JC69 ratematrix of the following form,Q =????????? ?4?4?4?4 ??4?4?4?4 ??4?4?4?4 ?????????.Assuming ? = (?1, ?2) (where ?1 < ?2) is the rate scalar and f = (f1, f2)is its distribution, the rate matrix Q and the branch length are known, theparameter ? = (?, f) is unidentifiable.Proof. JC69 rate matrices are special cases of F81 rate matrices by settingpiA = piG = piT = piC =?4 , the result follows from Theorem 4.5.5 Identifiability of Models with at least 3different eigenvaluesIn this section, from the F81 model, ? = (?, f) is unidentifiable sincethere are only two equations and there are four unknown quantities (??1 , ??2 , f?1 ,515.5. Identifiability of Models with at least 3 different eigenvaluesf?2 ). As a result, we can find multiple ?? = (??1 , ??2 , f?1 , f?2 ) to satisfy the con-ditions. However, it is of interest to explore whether the model is identifiablewhen the rate matrix Q has three distinct eigenvalues or four eigenvalues.When Q has three different eigenvalues, there will be three equations andfour unknown quantities (??1 , ??2 , f?1 , f?2 ) so that the number of equations isstill smaller than the number of unknown quantities. When Q has four dif-ferent eigenvalues, there will be four equations and four unknown quantities(??1 , ??2 , f?1 , f?2 ) so that the number of equations is equal to the number ofunknown quantities. We prove the non-identifiability of the GTR modelunder certain conditions on the basis of ??s work.5.5.1 F84 ModelRecall that in the F1984 model, the rate matrix is given byQ =????????? (1 + k/piY )piC piA ?piG(1 + k/piY )piT ? piA piGpiT piC ? (1 + k/piR)piGpiT piC (1 + k/piR)piA ?????????.The three eigenvalues are ?1 = 0, ?2 = ??, ?3 = ?4 = ?(1 + k)?, where ?is 1/(4piTpiC(1 + k/piY ) + 4piApiG(1 + k/piY ) + 4piY piR).To analyze the identifiability of the F1984 model, we are interested infinding whether there is ?? = (??1 , ??2 , f?1 , f?2 ) 6= ? = (?1, ?2, f1, f2) such thatthe following nonlinear equations hold. If ?? exists, then the model is uniden-tifiable.525.5. Identifiability of Models with at least 3 different eigenvalues?????????f1 + f2 = f?1 + f?2f1e???1t + f2e???2t = f?1 e????1 t + f?2 e????2 tf?1 e?(1+k)???1 t + f?2 e?(1+k)???2 t = f?1 e?(1+k)???1 t + f?2 e?(1+k)???2 t(5.14)(5.15)(5.16)5.5.2 HKY85 Model? prove that in HKY85 model, Q has four distinct eigenvalues. If Q is asbelow, the ratio of ? and ? is the ratio of transition and tranversion. Thereare four distinct eigenvalues of the rate matrix????????? ?piC ?piA ?piG?piT ? ?piA ?piG?piT ?piC ? ?piG?piT ?piC ?piA ?????????,where ?1 = 0, ?2 = ??, ?3 = ?(piY ? + piR?), ?4 = (piY ?+ piR?).The identifiability of HKY85 model lies in whether we are able to find?? = (??1 , ??2 , f?1 , f?2 ) 6= ? = (?1, ?2, f1, f2) such that the following nonlinearequations hold.?????????????????f1 + f2 = f?1 + f?2f1e??2?2t + f2e??2?2t = f?1 e??2??1 t + f?2 e??2??2 tf?1 e??3??1 t + f?2 e??3??2 t = f?1 e??3??1 t + f?2 e??3??2 tf1e??4?1t + f2e??4?2t = f?1 e??4??1 t + f?2 e??4??2 t(5.17)(5.18)(5.19)(5.20)These conditions also apply to GTR models since they also have fourdistinct eigenvalues as HKY85 model.We cite ??s proved Theorem 5 as below to address the identifiability issuefor our model.535.5. Identifiability of Models with at least 3 different eigenvaluesTheorem 5. Consider the GTR model with unknown rate matrix Q, whichhas an unknown non-zero stationary frequencies. The rate factor ? is de-scribed by an arbitrary distribution ?, E?(?) = 1. If pairwise distributionsare available for any distance, then the rate distribution, the pairwise dis-tance d and the rate matrix Q are identifiable.The details of proof for Theorem 5 are provided by ?. In our case,the rate factor ? comes from an arbitrary discrete distribution without theconstraint of E?(?) = 1 so we generalize Theorem 5 into Theorem 6.Theorem 6. For any GTR model with unknown rate matrix Q, which has atleast two distinct eigenvalues and unknown non-zero stationary frequencies.The rate factor ? is described by an arbitrary distribution ?. If pairwisedistributions are available for any distance, the eigenvalues of the rate matrixQ are identifiable but the rate distribution and the pairwise distance areunidentifiable. However, if two choices of distance, rate distribution (d, ?)and (d?, ??) can lead to the same pairwise distribution, the following equationholdsd? E?(?) = d?? E??(??).Proof. The proof can be obtained using the same technique when ? provedtheir ?Theorem 2? in the original paper. For reader?s convenience, we usedthe same notations as the original paper. Assuming the rate factor ? has adistribution ?, then the moment generating function isM(t) = E?(et?)for any t ? 0. Assuming (d, ?,Q) and (d?, ??, Q?) are two different choices ofthe pairwise distance, distribution of the rate factor and rate matrix. As545.5. Identifiability of Models with at least 3 different eigenvaluesdenoted before, M and M? are the two moment generating functions corre-sponding to ? and ??. The rate matrix Q has n distinct eigenvalues ?i wherei ? {1, 2, . . . , n}. Moreover, Q can be represented asXdiag(?1, ?2, . . . , ?n)X?1and Q? can be represented as X?diag(??1, ??2, . . . , ??n)X??1. Let c1 denoteE?(?), c2 denote E??(??) and c denote the ratio of c1 and c2. From theproperties of moment generating functions, we haveM ?(0)M? ?(0)=E?(?)E??(??)=c1c2= c.It follows from the proof of Theorem 2 in the original paper of ? that forany v ? 0, we have M(v) = M?(cv). When the pairwise distribution is thesame for two choices of (d, ?,Q) and (d?, ??, Q?), thend? = ??i?1M??1(M(?id)).Since M(v) = M?(cv), thend? = ??i?1M??1(M(?id)) = c?i??id,??i?i= c?dd?= c? k.From the proof of Theorem 1(a) by ?, since the unknown stationarydistribution ? can be obtained using the 1-taxa marginalization and therate matrices are scaled so that trace(?Q) = ?1, then,?1 = trace(?X?diag(??1, ??2, . . . , ??p)X??1)= trace(?Xdiag(?1, ?2, . . . , ?p)X?1) = ?ck,so thatck =E?(?)E??(?)?dd?= 1,where ?i = ??i for any i ? {1, 2, . . . , n}.555.6. Simulation Results5.6 Simulation ResultsTo study the identifiability of ? and its distribution f , we select HKY85model since it has three distinct nonzero eigenvalues. We are interested tosee if the rate matrix is known, whether it is possible to recover the distri-bution of parameters ? using a pair of sequences.Using the HKY85 model, the rate matrix isQ = ? ?????????? piC ?piG piTpiA ? piG ?piT?piA piC ? piTpiA ?piC piG ?????????.The normalization constant is ? = 1/{?i{A,C,G,T} piiQii}. After set-ting ? = 5 and the stationary distribution as pi = (piA, piC , piG, piT ) =(0.4, 0.3, 0.2, 0.1), we obtain the rate matrix as belowQ0 =????????A C G TA ?0.89 0.19 0.63 0.06C 0.25 ?0.70 0.13 0.32G 1.27 0.19 ?1.52 0.06T 0.25 0.95 0.13 ?1.33????????.We generate a pair of sequences with 50000 sites from the previousHKY85 model with Q0 as the rate matrix and the branch length is 0.2.The rate factor is ? = (0, 1) and its distribution is f = (0.2, 0.8). Thismeans that among 50000 sites, approximately 20 percent of the sites stayunchanged and 80 percent of the sites change according to the rate matrixQ0. Using the EM algorithm to learn ? and f for a pair of sequences, as-565.6. Simulation Resultssuming we know Q0, we are interested to see whether we can recover thereference held-out parameters ? = (0, 1) and its distribution f = (0.2, 0.8)using the generated dataset. If the estimate of ? and f is close to ? andf , it is promising that ? and its distribution f are identifiable providedthat the rate matrix is known. If the estimate of ?? and f? are very dif-ferent from ? and f , but these two choices of parameters can produce thesame or very similar likelihood, this will provide some empirical support forthe non-identifiability for HKY85 model. We replicate simulating sequenceswith ? = (0, 1) and f = (0.2, 0.8), then make inference about ? and f withdifferent initial values of ? and f denoted as ?initial and finitial. For 500times, the average of ?? and f? are summarized as below.Table 5.1: Estimating ? and f with ?initial and finitial?initial finitial ?? f? SD(??) SD(f?)(0, 1) (0.2, 0.8) (0, 0.96) (0.178, 0.822) (0, 0.069) (0.048, 0.048)(0, 0.5) (0.5, 0.5) (0, 1.01) (0.208, 0.792) (0, 0.093) (0.059, 0.059)(0.4, 0.5) (0.5, 0.5) (0.546, 1.011) (0.497, 0.503) (0.176, 0.201) (0.002, 0.002)(0.2, 0.8) (0.5, 0.5) (0.510, 1.04) (0.485, 0.515) (0.179, 0.197) (0.007, 0.007)(0.2, 0.8) (0.3, 0.7) (0.417, 0.935) (0.293, 0.707) (0.260, 0.136) (0.008, 0.008)When the initial values ?initial and finitial are equal to the reference held-out parameters ? and f , the estimated parameters ?? and f? are close to theparameters we use to simulate the data. The standard error of ?? and f?represents the variation of ?? and f? out of 500 replications. When only one575.6. Simulation Resultscategory of ?initial is the same as ? which is zero, ?? and f? are still close to? and f . When ?initial and finitial are very different from ? and f , ?? and f?are no longer close to ? and f . Moreover, f? highly depends on finitial whichis reflected from the last three rows of Table 5.1.To investigate the identifiability of the problem, we are interested tocheck whether the negative log-likelihood given ?? and f? are the same as thenegative log-likelihood given ? and f .Table 5.2: Average of nllk of the generated dataset given different estimates?? f? ?nllk<??,f?>?nllk<?,f> Range Times(0.417, 0.935) (0.29, 0.71) 88839 88841 (-7.30, 0.03) 494(0.510, 1.04) (0.49, 0.51) 88855 88856 (-8.70, 0.06) 492(0.546, 1.011) (0.50, 0.50) 88850 88852 (-6.99, 0.05) 491(0, 1.011) (0.21, 0.79) 88825 88826 (-6.75, 1.70) 421(0, 0.96) (0.18, 0.82) 88839 88834 (-6.28, 2.42) 411In Table 5.2, ?Average of nllk? represents the average of negative log-likelihood out of 500 replications. ?Times? denotes the number of times thatthe negative log-likelihood of the generated dataset given ?? and f? is smallerthan the one given ? and f used to generate the dataset. ?Range? denotesthe range of the difference between the negative log-likelihood given (??, f?)and (?, f).Based on these simulations, we conjecture that ? and f are not identifi-able in this setup. We also produce two pictures to help visualize that. Wegenerate a pair of sequences with ? = (0.6, 1) and f = (0.2, 0.8).585.6. Simulation ResultsFigure 5.1: negative log-likelihood given ? by grid searchgamma1gamma20.51.01.52.02.50.5 1.0 1.5 2.0 2.592000940009600098000100000102000Figure 5.2: level plot of negative log-likelihood given ? by grid search595.6. Simulation ResultsFigure 5.1 shows the negative log-likelihood given different rate factors.In this figure, ?gamma1? denotes the first category of ? and ?gamma2?denotes the second category of ?. It shows on the bottom of the plot, theplane is flat which is coloured by ?red?. To check this, we produce Figure5.2 by projecting the 3-d plot to a 2-d plane. From the curve in the plot, itreflects that different combinations of the two categories of ? could producesimilar negative log-likelihood. This also serves as the empirical supportthat the HKY85 model for a pair of sequences might not be identifiable.60Chapter 6Data Analysis6.1 Data DescriptionThe dataset comes from the original paper by ?. There are 1028 ribo-somal RNA sequences. Each sequence has 4907 sites of 16S gene and issampled from eucarya mitochondria. The topology and branch lengths ofthe tree are first estimated by maximum likelihood methods.Let X = {xij} be the aligned nucleotide sequences, where i ? {1, 2, . . . ,1028}, j ? {1, 2, ..., 4907}. Each column of the data matrix xj = {x1,j , ...,x1028,j} specifies the nucleotides for the 1028 sequences at the jth site. Eachrow of the data matrix xi = {xi,1, ..., xi,4907} represents all the nucleotidesof the ith DNA sequence.X =????????x1,1 x1,2 ? ? ? x1,4907x2,1 x2,2 ? ? ? x2,4907....... . ....x1028,1 x1028,2 ? ? ? x1028,4907????????However, a considerable fraction of xij are considered missing. In phyloge-netics, missing data are not usually due to missing measurements, but arerather artificially introduced as a preprocessing step called sequence align-ment. In order to match the sequences with the same length, part of themissing data denoted as ?-? in some sequences are ?deletions or ?insertions?616.1. Data Descriptionin which nucleotides are deleted or inserted. Below we show a histogram ofthe percentage of the sites with different numbers of missing data.lenemptyPercent of Total02040600 200 400 600 800 1000Figure 6.1: percentage of sites with different numbers of missing dataFor sixty percent of the sites, the states of more than 1000 sequenceson each of this site is missing. For example, when we check the data, onthe 4907th site, only the state of one sequence is known and the states ofall other sequences at this site are missing. This kind of sites are not usedwhen analyzing the data.626.2. Exploratory Data Analysis6.2 Exploratory Data AnalysisHere we explore whether the rates at different sites are the same or not.We use the sites with the least missing data in the original dataset. In total,there are 44 sites at which only the states of two sequences out of 1028sequences are missing.In this chapter, we use the K80 model. It is assumed that the stationaryfrequencies of the nucleotides pix are uniform and the ratio of transition overtransversion ? is equal to 1.5. Then the rate matrix Q used in this datasetisQ =????????? piC ?piG piTpiA ? piG ?piT?piA piC ? piTpiA ?piC piG ?????????=?????????0.875 0.25 0.375 0.250.25 ?0.875 0.25 0.3750.375 0.25 ?0.875 0.250.25 0.375 0.25 ?0.875??????????,where ? = 1.142857.We would like to explore whether the rates at these sites are the same ornot. We calculate the likelihood for each of these sites under different ratefactors with one category ranging from 0.01 to 1 with step size 0.01. Wedenote the rate factor as ? and show three figures of the log-likelihood withrespect to the rate factor at three different sites.The patterns of the three figures are similar since they are all unimodaland the log-likelihood increases as ? increases first and decreases after acertain point. We provide a histogram of the estimate of ? by grid searchfor the selected sites. The histogram reflects that the distribution of ?? atwhich the maximum likelihood is achieved by grid search method. Note thatmore than sixty percent of the sites evolve slowly relative to the other ones.636.2. Exploratory Data Analysis0.0 0.2 0.4 0.6 0.8 1.0-220-200-180-160-140-120gammalikelihooddatagamma=0.11Figure 6.2: log-likelihood w.r.t ? at site ?1?0.0 0.2 0.4 0.6 0.8 1.0-500-450-400-350gammalikelihooddatagamma=0.38Figure 6.3: log-likelihood w.r.t ? at site ?2?646.3. Data Analysis via an EM Algorithm0.0 0.2 0.4 0.6 0.8 1.0-1000-900-800-700-600gammalikelihooddatagamma=0.92Figure 6.4: log-likelihood w.r.t ? at site ?3?6.3 Data Analysis via an EM AlgorithmThe fixed quantities include the topology of the tree and the branchlengths. The parameters to be estimated are the rate factor ? and its dis-tribution f . Without introducing the rate factor, the likelihood of the treeis -381560.755. Using an EM algorithm with a penalty term which is thesum of ?2, we provide the results as below. As to the details of the penaltyterms, please refer to Chapter 4. We get the estimates by removing the siteswhere the states of 1027 sequences are missing for a single site since thosesites only have the state of one sequence so that it does not provide muchinformation.656.3. Data Analysis via an EM AlgorithmratePercent of Total02040600.0 0.2 0.4 0.6 0.8 1.0Figure 6.5: Histogram of ??MLE by grid searchTable 6.1: Estimating ? and f with different ?initial and finitial with part ofalignments?initial finitial ?? f? nllk{??,f?}0.5 1 0.381 1 3213741 1 0.381 1 321374(0.4, 1.4) (0.5, 0.5) (0.236, 3.531) (0.488, 0.512) 303971(0.16, 0.6, 1.2) (13 ,13 ,13) (0.116, 0.926,2.602)(0.238, 0.420,0.342)290146(0.12, 0.15,0.6, 0.9)(14 ,14 ,14 ,14) (0.038, 0.160,0.591, 3.148)(0.087, 0.125,0.380, 0.408)287676(0.12, 0.15,0.6, 0.9, 1.2)(15 ,15 ,15 ,15 ,15) (0.389, 0.152,0.393,0.961,2.727)(0.087, 0.104,0.166, 0.271,0.372)285772666.3. Data Analysis via an EM AlgorithmIn order to perform the model selection, we calculate the the Akaike in-formation criterion (AIC) and Bayesian information criterion (BIC) for thefive models.Table 6.2: AIC and BIC for different Models#of categories AIC BIC1 642756 6427822 607948 6079883 580302 5803684 575366 5754135 571562 571681Since both the AIC and BIC keep increasing for the five models as the numberof parameters increases, we suspect it under-penalize model complexity. The reasonis that the dependence between sequences violates the assumption of AIC and BICso that a new model selection method is needed. To find a new model selectionmethod is part of our future work.67Chapter 7Conclusions and FutureWorkIn this thesis, we propose a new computationally attractive model. Using anEM algorithm, we can model the rates over sites on DNA sequences with a discretedistribution for thousands of sequences. We also analyze the identifiability forthe rate factor coming from a discrete distribution for the first time. We provea general condition of the identifiability of our model. Based on that, we provethe non-identifiability of F81 model. We also prove the non-identifiability of GTRmodel under certain conditions.For future work, we are interested to explore new model selection methods toevaluate our models. Moreover, we would like to propose new methods to estimatethe rate matrix Q and the rate factor ? and its distribution simultaneously. Finally,we are interested to analyze the identifiability of large phylogenies.68Appendix AFirst AppendixTheorem 1.limtn?t(etnQ)i,j ? (etQ)i,jtn ? t= (QetQ)i,jProof. If A = [aij ] is a p?pmatrix, the matrix exponential of A is eA =??j=0Aj/j!.Assume A has p distinct eigenvalues d1, d2, . . . , dp so that X is the p ? p matrixwhere the jth column is a right eigenvector of unit length corresponding to dj .Since A = XDX?1,etA = Xdiag(ed1t, . . . , edpt)X?1etA =????????x11ed1t x12ed2t . . . x1pedptx21ed1t x22ed2t . . . x2pedpt....... . ....xp1ed1t xp2ed2t . . . xppedpt????????????????x?11 x?12 . . . x?1px?21 x?22 . . . x?2p....... . ....x?p1 x?p2 . . . x?pp????????whereX?1 =????????x?11 x?12 . . . x?1px?21 x?22 . . . x?2p....... . ....x?p1 x?p2 . . . x?pp????????X =????????x11 x12 . . . x1px21 x22 . . . x2p....... . ....xp1 xp2 . . . xpp????????(etA)i,j =p?k=1xikedktx?ki =p?k=1xikx?kiedktlimtn?t(etnQ)i,j ? (etQ)i,jtn ? t=?(etA)i,j?t=?(?pk=1 xikx?kiedkt)?t=p?k=1xikx?kidkedkt69Appendix A. First AppendixAetA = Xdiag(d1, d2, . . . , dp)X?1Xdiag(ed1t, . . . , edpt)X?1= Xdiag(d1, . . . , dp)diag(ed1t, . . . , edpt)X?1=????????x11d1 x12d2 . . . x1pdpx21d1 x22d2 . . . x2pdp....... . ....xp1d1 xp2d2 . . . xppdp????????????????ed1t 0 . . . 00 ed2t . . . 0....... . ....0 0 . . . ednt????????????????x?11 x?12 . . . x?1px?21 x?22 . . . x?2p....... . ....x?p1 x?p2 . . . x?pp????????=????????x11d1ed1t x12d2ed2t . . . x1pdpedptx21d1ed1t x22d2ed2t . . . x2pdpedpt....... . ....xp1d1ed1t xp2d2ed2t . . . xppdpedpt????????????????x?11 x?12 . . . x?1px?21 x?22 . . . x?2p....... . ....x?p1 x?p2 . . . x?pp????????Then(AetA)ij =p?k=1xikx?kidkedkt =?(?pk=1 xikx?kiedkt)?t= limtn?t(etnQ)i,j ? (etQ)i,jtn ? t70Appendix BSecond AppendixTheorem 1.???(exp(?Q1)m?n exp(?Q2)a?b)= [???(exp(?Q1))]m?n exp(?Q2)a?b + exp(?Q1)m?n[???(exp(?Q2))]a?b= (Q1 exp(?Q1))m?n exp(?Q2)a?b + exp(?Q1)(Q2 exp(?Q2))a?bThe proof of this theorem is provided in the Appendix.Proof. It has been proved that,limtn?t(etnQ)i,j ? (etQ)i,jtn ? t= (QetQ)i,jWhen we get derivative wrt t for the element in the ith row and jth column of thematrix (etA)ij is equivalent to getting derivative wrt t for the matrix first and thentake the element in the ith row and jth column of the matrix. Then???(exp(?Q1)m?n exp(?Q2)a?b)= (???exp(?Q1)m?n) exp(?Q2)a?b + exp(?Q1)m?n(???exp(?Q2)a?b)= (???exp(?Q1))m?n exp(?Q2)a?b + exp(?Q1)m?n(???exp(?Q2))a?b (B.1)71
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Inference of rates across sites via an expectation...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Inference of rates across sites via an expectation maximization algorithm Zhao, Tingting 2013
pdf
Page Metadata
Item Metadata
Title | Inference of rates across sites via an expectation maximization algorithm |
Creator |
Zhao, Tingting |
Publisher | University of British Columbia |
Date Issued | 2013 |
Description | The rates of nucleotide substitution can be different from genes to genes. Moreover, different regions of the same gene can have different rates of mutation as well. Many attempts have been tried to allow for the variable rates across different nucleotide sites. A rate factor coming from the continuous distribution has been introduced to deal with the problem. However, for computation reasons, this method can only scale to less than a dozen sequences. Later studies use a discrete gamma distribution to approximate the gamma distribution. The main contribution of our work is that we propose a discrete distribution over the rate factor which is more flexible while preserving attractive computational properties. We make inference about the rate factor and its distribution via an Expectation Maximization (EM) algorithm. We evaluate our method by both simulations and a real dataset. From the real dataset, it reflects that the method is useful for large phylogenies with even thousands of sequences. We analyze the identifiability of our model for a pair of DNA sequences under certain conditions. We also prove for certain types of rate matrices, this model is non-identifiable. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2013-08-29 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0074174 |
URI | http://hdl.handle.net/2429/44930 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2013-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2013_fall_zhao_tingting.pdf [ 1.11MB ]
- Metadata
- JSON: 24-1.0074174.json
- JSON-LD: 24-1.0074174-ld.json
- RDF/XML (Pretty): 24-1.0074174-rdf.xml
- RDF/JSON: 24-1.0074174-rdf.json
- Turtle: 24-1.0074174-turtle.txt
- N-Triples: 24-1.0074174-rdf-ntriples.txt
- Original Record: 24-1.0074174-source.json
- Full Text
- 24-1.0074174-fulltext.txt
- Citation
- 24-1.0074174.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0074174/manifest