RESEARCH Open AccessEfficient algorithms for training the parametersof hidden Markov models using stochasticexpectation maximization (EM) training andViterbi trainingTin Y Lam, Irmtraud M Meyer*AbstractBackground: Hidden Markov models are widely employed by numerous bioinformatics programs used today.Applications range widely from comparative gene prediction to time-series analyses of micro-array data. Theparameters of the underlying models need to be adjusted for specific data sets, for example the genome of aparticular species, in order to maximize the prediction accuracy. Computationally efficient algorithms for parametertraining are thus key to maximizing the usability of a wide range of bioinformatics applications.Results: We introduce two computationally efficient training algorithms, one for Viterbi training and one forstochastic expectation maximization (EM) training, which render the memory requirements independent of thesequence length. Unlike the existing algorithms for Viterbi and stochastic EM training which require a two-stepprocedure, our two new algorithms require only one step and scan the input sequence in only one direction. Wealso implement these two new algorithms and the already published linear-memory algorithm for EM training intothe hidden Markov model compiler HMM-CONVERTER and examine their respective practical merits for three smallexample models.Conclusions: Bioinformatics applications employing hidden Markov models can use the two algorithms in order tomake Viterbi training and stochastic EM training more computationally efficient. Using these algorithms, parametertraining can thus be attempted for more complex models and longer training sequences. The two new algorithmshave the added advantage of being easier to implement than the corresponding default algorithms for Viterbitraining and stochastic EM training.BackgroundHidden Markov models (HMMs) and their variants arewidely used for analyzing biological sequence data.Bioinformatics applications range from methods forcomparative gene prediction (e.g. [1,2]) to methods formodeling promoter grammars (e.g. [3]), identifying pro-tein domains (e.g. [4]), predicting protein interfaces (e.g.[5]), the topology of transmembrane proteins (e.g. [6])and residue-residue contacts in protein structures (e.g.[7]), querying pathways in protein interaction networks(e.g. [8]), predicting the occupancy of transcriptionfactors (e.g. [9]) as well as inference models for genome-wide association studies (e.g. [10]) and disease associa-tion tests for inferring ancestral haplotypes (e.g. [11]).Most of these bioinformatics applications have beenset up for a specific type of analysis and a specific biolo-gical data set, at least initially. The states of the underly-ing HMM and the implemented prediction algorithmsdetermine which type of data analysis can be performed,whereas the parameter values of the HMM are chosenfor a particular data set in order to optimize the corre-sponding prediction accuracy. If we want to apply thesame method to a new data set, e.g. predict genes in adifferent genome, we need to adjust the parametervalues in order to make sure the performance accuracyis optimal.* Correspondence: irmtraud.meyer@cantab.netCentre for High-Throughput Biology, Department of Computer Science andDepartment of Medical Genetics, 2366 Main Mall, University of BritishColumbia, Vancouver V6T 1Z4, CanadaLam and Meyer Algorithms for Molecular Biology 2010, 5:38http://www.almob.org/content/5/1/38© 2010 Lam and Meyer; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.Manually adjusting the parameters of an HMM inorder to get a high prediction accuracy can be a verytime consuming task which is also not guaranteed toimprove the performance accuracy. A variety of trainingalgorithms have therefore been devised in order toaddress this challenge. These training algorithms requireas input and starting point a so-called training set of(typically partly annotated) data. Starting with a set of(typically user-chosen) initial parameter values, thetraining algorithm employs an iterative procedure whichsubsequently derives new, more refined parametervalues. The iterations are stopped when a terminationcriterion is met, e.g. when a maximum number of itera-tions have been completed or when the change of thelog-likelihood from one iteration to the next becomesufficiently small. The model with the final set of para-meters is then used to test if the performance accuracyhas been improved. This is typically done by analyzing atest set of annotated data which has no overlap with thetraining set by comparing the predicted to the knownannotation.Of the training algorithms used in bioinformaticsapplications, the Viterbi training algorithm [12,13] isprobably the most commonly used, see e.g. [14-16]. Thisis due to the fact that it is easy to implement if theViterbi algorithm [17] is used for generating predictions.In each iteration of Viterbi training, a new set of para-meter values j is derived from the counts of emissionsand transitions in the Viterbi paths Π* for the set oftraining sequences . Because the new parameters arecompletely determined by the Viterbi paths, Viterbitraining converges as soon as the Viterbi paths nolonger change or, alternatively, if a fixed number ofiterations have been completed. Viterbi training finds atbest a local optimum of the likelihood P( , Π*|j), i.e.it derives parameter values j that maximize the contri-bution from the set of Viterbi paths Π* to the likelihood.There already exist a number of algorithms that canmake Viterbi decoding computationally more efficient.Keibler et al. [18] introduce two heuristic algorithms forViterbi decoding which they implement into the gene-prediction program TWINSCAN/N-SCAN, called“Treeterbi” and “Parallel Treeterbi”, which have thesame worst case asymptotic memory and time require-ments as the standard Viterbi algorithm, but which inpractice work in a significantly more memory efficientway. Sramek et al. [19] present a new algorithm, called“on-line Viterbi algorithm” which renders Viterbi decod-ing more memory efficient without significantly increas-ing the time requirement. The most recent contributionis from Lifshits et al. [20] who propose more efficientalgorithms for Viterbi decoding and Viterbi training.These new algorithms exploit repetitions in the inputsequences (in five different ways) in order to acceleratethe default algorithm.Another well-known training algorithm for HMMs isBaum-Welch training [21] which is an expectation max-imization (EM) algorithm [22]. In each iteration, a newset of parameter values is derived from the estimatednumber of counts of emissions and transitions by con-sidering all possible state paths (rather than only a sin-gle Viterbi path) for every training sequence. Theiterations are typically stopped after a fixed number ofiterations or as soon as the change in the log-likelihoodis sufficiently small. For Baum-Welch training, the likeli-hood P( |j) [13] can be shown to converge (undersome conditions) to a stationary point which is either alocal optimum or a saddle point. Baum-Welch trainingusing the traditional combination of forward and back-ward algorithm [13] is, for example, implemented intothe prokaryotic gene prediction method EASYGENE[23] and the HMM-compiler HMMoC [15]. As forViterbi training, the outcome of Baum-Welch trainingmay strongly depend on the chosen set of initial para-meter values. As Jensen [24] and Khreich et al. [25]describe, computationally more efficient algorithms forBaum-Welch training which render the memoryrequirement independent of the sequence length havebeen proposed, first in the communication field by[26-28] and later, independently, in bioinformatics byMiklós and Meyer [29], see also [30]. The advantage ofthis linear-memory memory algorithm is that it is com-paratively easy to implement as it requires only a one-rather than a two-step procedure and as it scans thesequence in a uni- rather than bi-directional way. Thisalgorithm was employed by Hobolth and Jensen [31] forcomparative gene prediction and has also been imple-mented, albeit in a modified version, by Churbanov andWinters-Hilt [30] who also compare it to other imple-mentations of Viterbi and Baum-Welch training includ-ing checkpointing implementations.Stochastic expectation maximization (EM) training orMonte Carlo EM training [32] is another iterative proce-dure for training the parameters of HMMs. Instead ofconsidering only a single Viterbi state path for a giventraining sequence as in Viterbi training or all statepaths as in Baum-Welch training, stochastic EM trainingconsiders a fixed-number of K state paths Πs which aresampled from the posterior distribution P(Π|X) forevery training sequence X in every iteration. Sampledstate paths have already been used in several bioinfor-matics applications for sequence decoding, see e.g.[2,33] where sampled state paths are used in the contextof gene prediction to detect alternative splice variants.All three above training algorithms, i.e. Viterbi train-ing, Baum-Welch training and stochastic EM training,Lam and Meyer Algorithms for Molecular Biology 2010, 5:38http://www.almob.org/content/5/1/38Page 2 of 16can be combined with the traditional check-pointingalgorithm [34-36] in order to trade time for memoryrequirements.We here introduce two new algorithms that makeViterbi training and stochastic EM training computa-tionally more efficient. Both algorithms have the signifi-cant advantage of rendering the memory requirementindependent of the sequence length for HMMs whilekeeping the time requirement the same (for Viterbitraining) or modifying it by a factor of M K/(M + K), i.e. decreasing it when only one state path K = 1 issampled for a model of M states (for stochastic EMtraining). Both algorithms are inspired by the linear-memory algorithm for Baum-Welch training whichrequires only a uni-directional rather than bi-directionalmovement along the input sequence and which has theadded advantage of being considerably easier to imple-ment. We present a detailed description of the two newalgorithms for Viterbi training and stochastic EM train-ing. In addition, we implement all three algorithms, i.e.the new algorithms for Viterbi training and stochasticEM training and the previously published linear-memoryalgorithm for Baum-Welch training, into our HMM-compiler HMM-CONVERTER[37] and examine thepractical features of these these three algorithms forthree small example HMMs.Methods and ResultsDefinitions and notationIn order to simplify the notation in the following, wewill assume without loss of generality that we are deal-ing with a 1st-order HMM where the Start state andthe End state are the only silent states. Our descriptionof the existing and the new algorithms easily generalizeto higher-order HMMs, HMMs with more silent states(provided there exists no circular path in the HMMinvolving only silent states) and n-HMMs, i.e. HMMswhich read n un-aligned input sequences rather than asingle input sequence at a time. An HMM is defined by● a set of states = {0, 1, ... , M}, where state 0denotes the start and state M denotes the End stateand where all other states are non-silent,● a set of transition probabilities = {ti,j |i, j Î },where ti,j denotes the transition probability to gofrom state i to state j and t i jj S ,∈∑ = 1 for every statei Î and● a set of emission probabilities ℰ = {ei(y)|i Î , yÎ }, where ei(y) denotes the emission probabilityof state i for symbol y and e yiy ( )∈∑ = 1 for everynon-silent state i Î and denotes the alphabetfrom which the symbols in the input sequences arederived, e.g. = {A, C, G, T} when dealing withDNA sequences.We also define:● Tmax is the maximum number of states that anystate in the model is connected to, also called themodel’s connectivity.● = {X1, X2, ... , XN} denotes the training set of Nsequences, where each particular training sequenceXi of length Li is denoted X x x xi i iLii= …( , , , )1 2 . Inthe following and to simplify the notation, we pickone particular training sequence X Î of length Las representative which we denote X = (x1, x2, ... , xL).We write Xn = (x1, x2, ... , xn), n Î {1, ... , L}, to denotethe sub-sequence of X which finishes at sequenceposition n.● Π = (π0, π1, ... , πL+1) denotes a state path in theHMM for an input sequence X of length L, i.e. stateπi is assigned to sequence position xi. Π* denotes aViterbi path and Πs a state path that has beensampled from the posterior distribution P(Π|X ) ofthe corresponding sequence X.A linear-memory algorithm for Viterbi trainingOf the HMM-based methods that provide automaticalgorithms for parameter training, Viterbi training [13]is the most popular. This is primarily due to the factthat Viterbi training is readily implemented if theViterbi algorithm is used to generate predictions. Similarto Baum-Welch training [21,22], Viterbi training is aniterative training procedure. Unlike Baum-Welch train-ing, however, which considers all state paths for a giventraining sequence in each iteration, Viterbi training onlyconsiders a single state path, namely a Viterbi path,when deriving new sets of parameters. In each iteration,a new set of parameter values is derived from the countsof emissions and transitions in the Viterbi paths [17] ofthe training sequences. The iterations are terminated assoon as the Viterbi paths of the training sequences nolonger change.In the following,● let E y X Xiq( , , ( ))*Π denote the number of timesthat state i reads symbol y from input sequence X inViterbi path Π*(X) given the HMM with parametersfrom the q-th iteration,● in particular let E y X X miqk k k( , , ( , ))* *Π = denotethe number of times that state i reads symbol yLam and Meyer Algorithms for Molecular Biology 2010, 5:38http://www.almob.org/content/5/1/38Page 3 of 16from input sequence X in the partial Viterbi pathΠ* * * * *( , ) ( , , , )X m mk k k k = = … =−0 1 which finishesat sequence position k in state m, and● let T X Xi jq,*( , ( ))Π denote the number of timesthat a transition from state i to state j is used inViterbi path Π*(X) for sequence X given the HMMwith parameters from the q-th iteration,● in particular let T X X mi jqk k k,* *( , ( , ))Π = denotethe number of times that a transition from state i tostate j is used in the partial Viterbi pathΠ* * * * *( , ) ( , , , )X m mk k k k = = … =−0 1 which finishesat sequence position k in state m.In the following, the superscript q will indicate fromwhich iteration the underlying parameters derive. If weconsider all N sequences of a training set = {X1, ...XN} and a Viterbi path Π*(Xn) for each sequence Xn inthe training set, the recursion which updates the valuesof the transition and emission probabilities reads:tT X XT X Xi jq i jq n nnNi jq n nnNjM,,*,*( , ( ))( , ( ))+ ====∑∑∑1 111ΠΠ(1)e yE y X XE y X Xiq iq n nnNiq n nnNy+ ==∈=∑∑∑1 11( )( , , ( ))( ’, , ( ))**’ΠΠ(2)These equations assume that we know the values ofT X Xi jq n n,*( , ( ))Π and E y X Xiq n n( , , ( ))*Π , i.e. how ofteneach transition and emission is used in the Viterbi pathΠ*(Xn) for training sequence Xn.One straightforward way to determine T X Xi jq n n,*( , ( ))Πand E y X Xiq n n( , , ( ))*Π is to first calculate the two-dimensional Viterbi matrix for every training sequenceXn, to then derive a Viterbi state path Π*(Xn) from eachViterbi matrix using the well-known traceback procedure[17] and to then simply count how often each transitionand each emission was used. Using this strategy, everyiteration in the Viterbi training algorithm wouldrequire (M maxi{Li} + maxi{Li}) memory and( )MT L Lmax iNiNi i= =∑ ∑+1 1 time, where LiiN=∑ 1 is the sum ofthe N sequence lengths in the training set and maxi{Li} the length of the longest sequence in training set .However, for many bioinformatics applications where thenumber of states in the model M is large, the connectiv-ity Tmax of the model high or the training sequences arelong, these memory and time requirements are too largeto allow automatic parameter training using thisalgorithm.A linear-memory version of the Viterbi algorithm,called the Hirschberg algorithm [38], has been knownsince 1975. It can be used to derive Viterbi paths inmemory that is linearized with respect to the length ofone of the input sequences while increasing the timerequirement by at most a factor of two. The Hirschbergalgorithm, however, only applies to n-HMMs with n ≥ 2,i.e. HMMs which read two or more un-aligned inputsequences at a time. One significant disadvantage of theHirschberg algorithm is that it is considerably more diffi-cult to implement than the Viterbi algorithm. Only fewHMM-based applications in bioinformatics actuallyemploy it, see e.g. [1,37,39]. We will see in the followinghow we can devise a linear-memory algorithm for Viterbitraining that does not involve the Hirschberg algorithmand that can be applied to all n-HMMs including n = 1.We now introduce a linear-memory algorithm forViterbi training. The idea for this algorithm stems fromthe following observations:(V1) If we consider the description of the Viterbi algo-rithm [17], in particular the recursion, we realize thatthe calculation of the Viterbi values can be continued byretaining only the values for the previous sequenceposition.(V2) If we have a close look at the description of thetraceback procedure [17], we realize that we only haveto remember the Viterbi matrix elements at the previoussequence position in order to deduce the state fromwhich the Viterbi matrix element at the currentsequence position and state was derived.(V3) If we want to derive the Viterbi path Π from theViterbi matrix, we have to start at the end of thesequence in the End state M.Observations (V1) and (V2) imply that local informa-tion suffices to continue the calculation of the Viterbimatrix elements (V1) and to derive a previous state (V2)if we already are in a particular state and sequence posi-tion, whereas observation (V3) reminds us that in orderto derive the Viterbi path, we have to start at the end ofthe training sequence. Given these three observations, itis not obvious how we can come up with a computa-tionally more efficient algorithm for training withViterbi paths. In order to realize that a more efficientalgorithm exists, one also has to also note that:(V4) While calculating the Viterbi matrix elements inthe memory-efficient way outlined in (V1), we cansimultaneously keep track of the previous state fromwhich the Viterbi matrix element at every current stateand sequence position was derived. This is possiblebecause of observation (V2) above.Lam and Meyer Algorithms for Molecular Biology 2010, 5:38http://www.almob.org/content/5/1/38Page 4 of 16(V5) In every iteration q of the training procedure, weonly need to know the values of T X Xi jq,*( , ( ))Π andE y X Xiq( , , ( ))*Π , i.e. how often each transition andemission was used in each Viterbi state path Π*(X) forevery training sequence X , but not where in the Viterbimatrix each transition and emission was used.Given all observations (V1) to (V5), we can now for-mally write down an algorithm which calculatesT X Xi jq,*( , ( ))Π and E y X Xiq( , , ( ))*Π in a computation-ally efficient way which linearizes the memory require-ment with respect to the sequence length and which isalso easy to implement. In order to simplify the nota-tion, we describe the following algorithm for one parti-cular training sequence X and omit the superscript forthe iteration q, as both remain the same throughout thealgorithm. In the following,● Ti,j (k, m) denotes the number of times the transi-tion from state i to state j is used in a Viterbi statepath that finishes at sequence position k in state m,● Ei(y, k, m) denotes the number of times that state ireads symbol y in a Viterbi state path that finishes atsequence position k in state m,● vi(k) denotes the Viterbi matrix element for state iand sequence position k, i.e. vi(k) is the probabilityof the Viterbi state path, i.e. the state path with thehighest overall probability, that starts at the begin-ning of the sequence in the Start state and finishesin state i as sequence position k,● i, j, n Î , y Î and l Î denotes the pre-vious state from which the current Viterbi matrixelement vm(k) was derived, and● δi,j is the delta-function with δi,j = 1 for i = j andδi,j = 0 else.Initialization: at the start of training sequence X =(x1, ... , xL) and for all m Î , setvmmT mE y mmi ji( )( , )( , , ),0000 00 010==≠⎧⎨⎩==Recursion: loop over all positions k from 1 to L in thetraining sequence X and loop, for each such sequenceposition k, over all states m Î \{0} = {1, ... , M } andsetv k e x v k tT k m T k lm m knn n mi j i j l( ) ( ) { ( ) }( , ) ( , ),, , ,= ⋅ − ⋅= − +∈max11 i m ji i m i y xE y k m E y k l k⋅= − + ⋅ ,, ,( , , ) ( , , )1where l denotes the state at the previous sequenceposition k − 1 from which the Viterbi matrix elementvm(k) for state m and sequence position k derives, i.e.l v k tn S n n m= − ⋅∈argmax { ( ) },1 .Termination: at the end of the input sequence, i.e. fork = L and for m = M the silent End state, setv L v L tT L M T L lE yMnn n Mi j i j l i M ji( ) { ( ) }( , ) ( , )( ,,, , , ,= ⋅= + ⋅∈max L M E y L li, ) ( , , )=where l denotes the state at the sequence position Lfrom which the Viterbi matrix element vM (L) for theEnd state M and sequence position L derives, i.e.l v L tn S n n M= ⋅∈argmax { ( ) }, .The above algorithm yields T L M T X Xi j i jq, ,*( , ) ( , ( ))= Πand E y L M E y X Xi iq( , , ) ( , , ( ))*= Π (and vM(L) =Pq(X, Π*(X))), i.e. we know how often a transition from state i tostate j was used and how often symbol y was read bystate i in Viterbi state path Π*(X) in iteration q.Theorem 1: The above algorithm yields T L Mi j, ( , ) =T X Xi jq,*( , ( ))Π and E y L M E y X Xi iq( , , ) ( , , ( ))*= Π .Proof: We will prove these statements via inductionwith respect to the sequence position k.(1) Induction start at k = 0: This corresponds to theinitialization step in the algorithm. Ti,j (0, m) = 0 and Ei(y, 0, m) = 0 for all m Î as any zero-length Viterbipath finishing in state m at sequence position 0 has zerotransitions from state i to j and has not read anysequence symbol.(2) Induction step k − 1 ® k for k Î {1, ... L − 1}if the state at sequence position k = L is not theEnd state M : This case corresponds to the recursionin the algorithm. We assume thatT k m T X X mi j i jqk k k, ,* *( , ) ( , ( , ))− = =− − −1 1 1 1Π andE y k m E y X X mi iqk k k( , , ) ( , , ( , ))* *− = =− − −1 1 1 1Π .We need to distinguish two cases (a) and (b). Let ldenote the state at sequence position k − 1 from whichthe Viterbi matrix element vm(k) for state m andsequence position k derives, i.e.l v k tn S n n m= − ⋅∈argmax { ( ) },1 .● Case (a):Emissions (i): m = i and y = xk : In this case, Ei(y, k,m) = Ei(y, k − 1, l) + 1. As we know that Ei(y, k − 1, l)is the number of times that state i reads symbol y in aViterbi path ending in state l at sequence positionk − 1, we need to add 1 count for reading symbolLam and Meyer Algorithms for Molecular Biology 2010, 5:38http://www.almob.org/content/5/1/38Page 5 of 16y = xk by state m = i at the next sequence position kin order to obtain Ei(y, k, m).Transitions (ii): l = i and m = j: In this case, Ti,j(k, m) = Ti,j(k − 1, l) + 1. As we know that Ti,j (k − 1, l)is the number of times that a transition from state i tostate j is used in a Viterbi path ending in state l atsequence position k − 1, we need to add 1 count forthe transition from state l = i to state m = j whichbrings us from sequence position k − 1 to k in order toget Ti,j (k, m).● Case (b):Emissions (i): m ≠ i or y ≠ xk : In this case, Ei(y, k,m) = Ei(y, k − 1, l). We know that Ei(y, k − 1, l) isthe number of times that state i reads symbol y in aViterbi path ending in state l at sequence position k− 1. If we go from state l at position k − 1 to statem at position k and read symbol xk and if m ≠ i or y≠ xk , we do not need to modify the number ofcounts as we know that state i at position k does notread symbol y, i.e. Ei(y, k, m) = Ei(y, k − 1, l).Transitions (ii): l ≠ i or m ≠ j: In this case, Ti,j(k, m) = Ti,j (k − 1, l). We know that Ti,j (k − 1, l) isthe number of times that a transition from state i tostate j is used in a Viterbi path ending in state l atsequence position k − 1. If we make a transitionfrom state l at position k − 1 to state m at position kand if l ≠ i or m ≠ j, we do not need to modify thenumber of counts as we know this is not a transitionfrom state i to state j, i.e. Ti,j (k, m) = Ti,j (k − 1, l).(3) If the state at sequence position k = L is theEnd state M : This case corresponds to the terminationstep in the algorithm. As in (2), we need to distinguishtwo cases (a) and (b), but now only for the transitioncounts. Let l denote the state at sequence position Lfrom which the Viterbi matrix element vM (L) for theEnd state M and sequence position L derives, i.e.l v L tn S n n m= ⋅∈argmax { ( ) }, .Emissions (i): In this case, Ei(y, L, M) = Ei(y, L, l). Aswe know that Ei(y, L, l) is the number of times thatstate i reads symbol y in a Viterbi path ending in state lat sequence position L, we do not need to modify thisnumber of counts when going to the silent End state atthe same sequence position L as silent states do notread any symbols from the input sequence. As we arenow at the end of the input sequence X and the Viterbipath Π*(X), we have E y L M E y X Xi iq( , , ) ( , , ( ))*= Π .● Case (a):Transitions (i): l = i and M = j: In this case, Ti,j (L,M) = Ti,j (L, l) + 1. As we know that Ti,j (L, l) is thenumber of times that a transition from state i tostate j is used in a Viterbi path ending in state l atsequence position L, we need to add 1 count for thetransition from state l = i to the End state M = j atsequence position L. Note that this transition ofstate does not incur a change of sequence positionas the End state is a silent state. As we are now atthe end of the input sequence X and the Viterbipath Π*(X), we have T L M T X Xi j i jq, ,*( , ) ( , ( ))= Π .● Case (b):Transitions (i): l ≠ i or M ≠ j: In this case, Ti,j (L, M)= Ti,j (L, l). We know that Ti,j (L, l) is the numberof times that a transition from state i to state j isused in a Viterbi path ending in state l at sequenceposition L. If we make a transition from state l atposition L to the End state M at sequence positionL and if l ≠ i or M ≠ j, we do not make a transitionfrom state i to state j and thus do not need to mod-ify the number of counts, i.e. Ti,j (L, M) = Ti,j (L, l).Also in case (a), we are now at the end of the inputsequence X and the Viterbi path Π*(X ) and thushave T L M T X Xi j i jq, ,*( , ) ( , ( )).= ΠEnd of proof.As is clear from the above description of the algo-rithm, the calculation of the vm, Ti,j and Ei values forsequence position k requires only the respective valuesfor the previous sequence position k − 1, i.e. the mem-ory requirement can be linearized with respect to thesequence length.For an HMM with M states and a training sequenceof length L and for every free parameter of the HMMthat we want to train, we thus need in every iteration (M ) memory to store the vm values and (M) mem-ory to store the cumulative counts for the free para-meter itself, e.g. the Ti,j values for a particular transitionfrom state i to state j. For an HMM, the memoryrequirement of the training using the new algorithm isthus independent of the length of the training sequence.For training one free parameter in the HMM with theabove algorithm, each iteration requires (MTmaxL)time to calculate the vm values and to calculate thecumulative counts. If Q is the total number of free para-meters in the model and if we choose P of these para-meters to be trained in parallel, i.e. P Î {1, ... Q} andQ/P Î N, the memory requirement increases slightlyto (MP ) and the time requirement becomes( )MT L QPmax. This algorithm can therefore be readilyadjusted to trade memory and time requirements, e.g. tomaximize speed by using the maximum amount of avail-able memory. This can be directly compared to theLam and Meyer Algorithms for Molecular Biology 2010, 5:38http://www.almob.org/content/5/1/38Page 6 of 16default algorithm for Viterbi training described abovewith first calculates the entire Viterbi matrix and whichrequires (M L) memory and (TmaxLM) time toachieve the same. Our new algorithm thus has the sig-nificant advantage of linearizing the memory require-ment with respect to the sequence length while keepingthe time requirement the same, see Table 1 for adetailed overview. Our new algorithm is thus as memoryefficient as Viterbi training using the Hirschberg algo-rithm, while being more time efficient, significantlyeasier to implement and applicable to all n-HMMs,including the case n = 1.A linear-memory algorithm for stochastic EM trainingOne alternative to Viterbi training is Baum-Welch train-ing [21], which is an expectation maximization (EM)algorithm [22]. As Viterbi training, Baum-Welch train-ing is an iterative procedure. In each iteration of Baum-Welch training, the estimated number of counts foreach transition and emission is derived by consideringall possible state paths for a given training sequence inthe model rather than only the single Viterbi path. Asdiscussed in the introduction, there already exists anefficient algorithm for Baum-Welch training which line-arizes the memory requirement with respect to thesequence length and which is also relatively easy toimplement.One variant of Baum-Welch training is called stochas-tic EM algorithm [32]. Unlike Viterbi training whichconsiders only a single state path and unlike Baum-Welch training which considers all possible state pathsfor every training sequence, the stochastic EM algorithmderives new parameter values from a fixed number of Kstate paths (each of which is denoted Πs(X)) that aresampled for each training sequence from the posteriordistribution P(Π|X). Similar to Viterbi and Baum-Welchtraining, the stochastic EM algorithm employs an itera-tive procedure. As for Baum-Welch training, the itera-tions are stopped once a maximum number of iterationshave been reached or once the change in the log-likeli-hood is sufficiently small.In strict analogy to the notation we introduced forViterbi training, E y X Xiq s( , , ( ))Π denotes the number oftimes that state i reads symbol y from input sequence Xin a sampled state path Πs(X) given the HMM with para-meters from the q-th iteration. Similarly, T X Xi jq s, ( , ( ))Πdenotes the number of times that a transition from statei to state j is used in a sampled state path Πs(X) forsequence X given the HMM with parameters from the q-th iteration.As usual, the superscript q indicates from which itera-tion the underlying parameters of the HMM derive. If weconsider all N sequences of the training set = {X1, ...XN} and sample K state paths Π ks nX( ) , k Î {1, ... K},Table 1 Theoretical computational requirementstraining one parameter at a timetype of training algorithm time memory referenceViterbi Viterbi (TmaxLM) (ML) [17]Lam-Meyer (TmaxLM) (M) this paperBaum-Welch Baum-Welch (TmaxLM) (ML) [13]checkpointing (TmaxLM log(L)) (M log(L)) [34]linear-memory (TmaxLM) (M) [29]stochastic EM forward & back-tracing (TmaxL(M + K)) (ML) [32]Lam-Meyer (TmaxLMK) (MK + Tmax) this papertraining P of Q parameters at the same time with P Î {1, ... , Q} and Q/P Î NViterbi Viterbi (TmaxLMQ/P) (ML) [17]Lam-Meyer (TmaxLMQ/P) (MP) this paperBaum-Welch Baum-Welch (TmaxLMQ/P) (ML + P) [13]checkpointing (TmaxLMQ log(L/P)) (M log(L)) [34]linear-memory (TmaxLM Q/P) (M) [29]stochastic EM forward & back-tracing (TmaxL(M + K)Q/P ) (ML) [32]Lam-Meyer (TmaxLMKQ/P ) (MKP + Tmax) this paperOverview of the theoretical time and memory requirements for Viterbi training, Baum-Welch training and stochastic EM training for an HMM with M states, aconnectivity of Tmax and Q free parameters. K denotes the number of state paths sampled in each iteration for every training sequence for stochastic EM training.The time and memory requirements below are the requirements per iteration for a single training sequence of length L. It is up to the user to decide whether totrain the Q free parameters of the model sequentially, i.e. one at a time, or in parallel in groups. The two tables below cover all possibilities.In the general case we are dealing with a training set = {X1, X2, ... , XN} of N sequences, where the length of training sequence Xi is Li. If training involves theentire training set, i.e. all training sequences simultaneously, L in the formulae below needs to be replaced by LiiN=∑ 1 for the memory requirements and by maxi{Li} for the time requirements. If, on the other hand, training is done by considering by one training sequence at a time, L in the formulae below needs to bereplaced by LiiN=∑ 1 for the time requirements and by maxi{Li} for the memory requirements.Lam and Meyer Algorithms for Molecular Biology 2010, 5:38http://www.almob.org/content/5/1/38Page 7 of 16for each sequence Xn in the training set, the step whichupdates the values of the transition and emission prob-abilities can be written as:tT X XT X Xi jq i jq nks nkKnNi jq nks nkK,,, ’( , ( ))( , ( ))+ ====∑∑∑1 111ΠΠnNjMiq iq nks nkKnNiq ne yy X Xy XEE==+ ==∑∑∑∑=111 11’( )( , , ( ))( , ,Π′ Π ks nkKnNyX( ))==∈ ∑∑∑ 11′ These expressions are strictly analogous to equations1 and 2 that we introduced for Viterbi training.As before, these assume that we know the valuesof T X Xi jq nks n, ( , ( ))Π and E y X Xiq nks n( , , ( ))Π , i.e. howoften each transition and emission is used in eachsampled state path Π ks nX( ) for every training sequenceXn.Obtaining the counts from the forward algorithm andstochastic back-tracingIt is well-known that we can obtain the above counts Ti,j(X, Πs(X)) and Ei(y, X, Πs(X)) for a given trainingsequence X, iteration q and a sampled state path Πs(X)by using a combination of the forward algorithm andstochastic back-tracing [13,32]. For this, we first calcu-late all values in the two-dimensional forward matrixusing the forward algorithm and then invoke the sto-chastic back-tracing procedure to sample a state-path Πs(X) from the posterior distribution P(Π|X).We will now explain these two algorithms in detail inorder to facilitate the introduction of our new algorithm.In the following,● fi(k) denotes the sum of probabilities of all statepaths that have read training sequence X up to andincluding sequence position k and that end in statei, i.e. fi(k) = P(x1, ... , xk , s(xk ) = i), where s(xk)denotes the state that reads sequence position xkfrom input sequence X. We call fi(k) the forwardprobability for sequence position k and state i.● pi(k, m) denotes the probability of selecting statem as the previous state while being in state i atsequence position k (i.e. sequence position k hasalready been read by state i), i.e. pi(k, m) = P(πk−1 =m|πk = i). For a given sequence position k and statei, pi(k, m) defines a probability distribution over pre-vious states as p k mim ( , ) =∑ 1 .The forward matrix is calculated using the forwardalgorithm [13]:Initialization: at the start of the input sequence, con-sider all states m Î in the model and setfmmm( )01 00 0==≠⎧⎨⎩Recursion: loop over all positions k from 1 to L in theinput sequence and loop, for each such sequence posi-tion k, over all states m Î \{0} = {1, ... , M} and setf k e x f k tm m k nnMn m( ) ( ) ( ) ,= ⋅ − ⋅=∑01 (3)Termination: at the end of the input sequence, i.e. fork = L and m = M the End state, setP X f L f L tM nnMx n M( ) ( ) ( ) ,= = ⋅=∑0Once we have calculated all forward probabilities fi(k)in the two-dimensional forward matrix, i.e. for all statesi in the model and all positions k in the given trainingsequence X, we can then use the stochastic back-tracingprocedure [13] to sample a state path from the posteriordistribution P(Π|X).The stochastic back-tracing starts at the end of theinput sequence, i.e. at sequence position k = L, in theEnd state, i.e. i = M , and selects state m as the previousstate with probability:p k mf k e x tf kif k tim k m iimi( , )( ) ( )( )( ),=− ⋅ ⋅⋅1if state is not silentm iif ki,( )if state is silent⎧⎨⎪⎪⎩⎪⎪(4)This procedure is continued until we reach the start ofthe sequence and the Start state. The resulting succes-sion of chosen previous states corresponds to one statepath Πs(X) that was sampled from the posterior distribu-tion P(Π|X ).The denominator in equation (4) corresponds to thesum of probabilities of all state paths that finish in statei at sequence position k, whereas the nominator corre-sponds to the sum of probabilities of all state paths thatfinish in state i at sequence position k and that havestate m as the previous state.When being in state i at sequence position k, we cantherefore use this ratio to sample which previous statem we should have come from.As this stochastic back-tracing procedure requires theentire matrix of forward values for all states and allsequence positions, the above algorithm for sampling astate path requires (ML) memory and (MTmaxL)Lam and Meyer Algorithms for Molecular Biology 2010, 5:38http://www.almob.org/content/5/1/38Page 8 of 16time in order to first calculate the matrix of forwardvalues and then (L) memory and (LTmax) time forsampling a single state path from the matrix. Note thatadditional state paths can be sampled without having torecalculate the matrix of forward values. For sampling Kstate paths for the same sequence in a given iteration,we thus need ((M + K)TmaxL) time and (ML)memory, if we do not to store the sampled state pathsthemselves.If our computer has enough memory to use the for-ward algorithm and the stochastic back-tracing proce-dure described above, each iteration in the trainingalgorithm would require (M maxi{Li} + K maxi{Li})memory and ( )MT L K Lmax iiNiiN= =∑ ∑+1 1 time, whereLiiN=∑ 1 is the sum of the N sequence lengths in thetraining set and maxi{Li} the length of the longestsequence in training set . As we do not have to keepthe K sampled state paths in memory, the memoryrequirement can be reduced to (M maxi{Li}).For many bioinformatics applications, however, wherethe number of states in the model M is large, the con-nectivity Tmax of the model high or the trainingsequences are long, these memory and time require-ments are too large to allow automatic parameter train-ing using stochastic EM training.Obtaining the counts in a more efficient wayOur previous observations (V1) to (V5) that led to thelinear-memory algorithm for Viterbi training can bereplaced by similar observations for stochastic EMtraining:(S1) If we consider the description of the forwardalgorithm above, in particular the recursion in Equation(3), we realize that the calculation of the forward valuescan be continued by retaining only the values for theprevious sequence position.(S2) If we have a close look at the description of thestochastic back-tracing algorithm, in particular the sam-pling step in Equation (4), we observe that the samplingof a previous state only requires the forward values forthe current and the previous sequence position. So, pro-vided we are at a particular sequence position and in aparticular state, we can sample the state at the previoussequence position, if we know all forward values for theprevious sequence position.(S3) If we want to sample a state path Πs(X) from theposterior distribution P(Π|X), we have to start at theend of the sequence in the End state, see the descriptionabove and Equation (4) above. (The only valid alterna-tive for sampling state paths from the posterior distribu-tion would be to use the backward algorithm [13]instead of the forward algorithm and to then start thestochastic back-tracing procedure at the start of thesequence in the Start state.)Observations (S1) and (S2) above imply that localinformation suffices to continue the calculation of theforward values (S1) and to sample a previous state (S2)if we already are in a particular state and sequence posi-tion, whereas observation (S3) reminds us that in orderto sample from the correct probability distribution, wehave to start the sampling at the end of the trainingsequence. Given these three observations, it is – asbefore for Viterbi training – not obvious how we cancome up with a computationally more efficient algo-rithm. In order to realize that a more efficient algorithmdoes exist, one also has to note that:(S4) While calculating the forward values in the mem-ory-efficient way outlined in (S1) above, we can simulta-neously sample a previous state for every combination ofa state and a sequence position that we encounter in thecalculating of the forward values. This is possiblebecause of observation (S2) above.(S5) In every iteration q of the training procedure, weonly need to know the values of T X Xi jq s, ( , ( ))Π andE y X Xiq s( , , ( ))Π , i.e. how often each transition andemission appears in each sampled state path Πs(X) forevery training sequence X , but not where in the matrixof forward values the transition or emission was used.Given all observations (S1) to (S5) above, we can nowformally write down a new algorithm which calculatesT X Xi jq s, ( , ( ))Π and E y X Xiq s( , , ( ))Π in a computationallymore efficient way. In order to simplify the notation, weconsider one particular training sequence X = (x1, ... xL)of length L and omit the superscript for the iteration q,as both remain the same throughout the following algo-rithm. In the following, Ti,j (k, m) denotes the number oftimes the transition from state i to state j is used in asampled state path that finishes at sequence position k instate m and Ei(y, k, m) denotes the number of times statei read symbol y in a sampled state path that finishes atsequence position k in state m. As defined earlier, fi(k)denotes the forward probability for sequence position kand state i, pi(k, m) is the probability of selecting state mas the previous state while being in state i at sequenceposition k, i, j, n Î and y Î .Initialization: at the start of the training sequence Xand for all states m Î , setfmmT mE y mmi ji( )( , )( , , ),01 00 00 00 0==≠⎧⎨⎩==Lam and Meyer Algorithms for Molecular Biology 2010, 5:38http://www.almob.org/content/5/1/38Page 9 of 16Recursion: loop over all positions k from 1 to L in thetraining sequence X and loop, for each such sequenceposition k, over all states m Î \{0} = {1, ... , M}and setf k e x f k tp k ne x f k tfm m k nnMn mmm k n n m( ) ( ) ( )( , )( ) ( ),,= ⋅ − ⋅=⋅ − ⋅=∑011mi j i j l i m ji ikT k m T k lE y k m E y k l( )( , ) ( , )( , , ) ( , , ), , , ,= − + ⋅= − +11 m i y xk, ,⋅where l denotes the state at previous sequence posi-tion k − 1 that was sampled from the probability distri-bution pm(k, n), n Î S, while being in state m atsequence position k.Termination: at the end of the input sequence, i.e. fork = L and m = M the End state, setf L f L tp L nf L tf LT L M TM nnMx n MMn n MMi j i( ) ( )( , )( )( )( , ),,, ,= ⋅=⋅==∑0j l i M ji iL lE y L M E y L l( , )( , , ) ( , , ), ,+ ⋅= where l now denotes the state at sequence position Lthat was sampled from the probability distribution pM(L,n), n Î , while being in the End state M at sequenceposition L, i.e. at the end of the training sequence.The above algorithm yields T L M T X Xi j i jq s, ,( , ) ( , ( ))= Π , andE y L M E y X Xi iq s( , , ) ( , , ( ))= Π (and f L P XMq( ) ( ))= , i.e. weknow how often a transition from state i to state j wasused and how often symbol y was read by state i in astate path ΠS(X) sampled from the posterior distributionP(X|Π) in iteration q for sequence X.Theorem 2: The above algorithm yieldsT L M T X Xi j i jq s, ,( , ) ( , ( ))= Π and E y L M E y X Xi iq s( , , ) ( , , ( ))= Π .Proof: The proof for this theorem is very similar tothe proof of theorem 1 for Viterbi training and thereforeomitted. The key differences are, first, that l here corre-sponds to the state at the previous sequence positionthat is sampled from a probability distribution ratherthan deterministically determined and, second, that Πshere corresponds to a sampled state path rather than adeterministically derived Viterbi path Π*.End of proof.As is clear from the above algorithm, the calculationof the fm, pm, Ti,j and Ei values for sequence position krequires only the respective values for the previoussequence position k − 1, i.e. the memory requirementcan be linearized with respect to the sequence length.For an HMM with M states, a training sequence oflength L and for every free parameter to be trained, wethus need (M) memory to store the fm values, (Tmax)memory to store the pm values and (M) memory tostore the cumulative counts for the free parameter itself inevery iteration, e.g. the Ti,j values for a particular transitionfrom state i to state j. If we sample K state paths, we haveto store the cumulative counts from different state pathsseparately, i.e. we need K times more memory to store thecumulative counts for each free parameter, but the mem-ory for storing the fm and the pm values remains the same.Overall, if K state paths are being sampled in each itera-tion, we thus need (M) memory to store the fm values, (Tmax) memory to store the pm values and (MK)memory to store the cumulative counts for the free para-meter itself in every iteration. For an HMM, the memoryrequirement of the new training algorithm is thus inde-pendent of the length of the training sequence.For training one free parameter in the HMM with theabove algorithm, each iterations requires (MTmaxL)time to calculate the fm and the pm values and to calcu-late the cumulative counts for one training sequence. IfK state paths are being sampled in each iteration, thetime required to calculate the cumulative countsincreases to (MTmaxLK), but the time requirementsfor calculating the fm and pm values remains the same.For sampling K state paths for the same inputsequence and training one free parameter, we thus need (MK + Tmax) memory and (MTmaxLK) time forevery iteration. If the model has Q parameters and if Pof these parameters are to be trained in parallel, i.e. P Î{1, ... Q} and Q/P Î N, the memory requirementincreases slightly to (MKP + Tmax) and the timerequirement becomes ( )MT LK QPmax. As for Viterbitraining, the linear-memory algorithm for stochastic EMtraining can therefore be readily used to trade memoryand time requirements, e.g. to maximize speed by usingthe maximum amount of available memory, see Table 1for a detailed overview.This can be directly compared to the algorithmdescribed in 2.1 with requires (ML) memory and (TmaxL(M + K)) time to do the same. Our new algo-rithm thus has the significant advantage of linearizing thememory requirement and making it independent of thesequence length for HMMs while increasing the timerequirement only by a factor of MKM K+, i.e. decreasing itwhen only one state path K = 1 is sampled.Lam and Meyer Algorithms for Molecular Biology 2010, 5:38http://www.almob.org/content/5/1/38Page 10 of 16ExamplesThe algorithms that we introduce here can be used totrain any HMM. The previous sections discuss the theo-retical properties of the different parameter trainingmethods in detail which are summarized in Table 1.Even though the theoretical properties of the respec-tive algorithms are independent of any particular HMM,the outcome of the different types of parameter trainingin terms of prediction accuracy and parameter conver-gence may very well depend on the features of a parti-cular HMM. This is because the quantities that can beshown to be (locally) optimized by some training algo-rithms do not necessarily translate into an optimizedprediction accuracy as defined by us here.In order to investigate how well the different methodsdo in practice in terms of prediction accuracy and para-meter convergence, we implemented Viterbi training,Baum-Welch training and stochastic EM training forthree small example HMMs. For each model, we imple-mented the linear-memory algorithm for Baum-Welchtraining published earlier as well as the linear-memoryalgorithms for Viterbi training and stochastic EM train-ing presented here.In the first step, we use each model with the originalparameter values to generate the sequences of the data set.We then randomly choose initial parameter values to initi-alize the HMM for parameter training. Each type of para-meter training is performed three times using 2/3 of theun-annotated data set as training set and the remaining 1/3 of the data set for performance evaluation, i.e. we per-form three cross-evaluation experiments for each model.Example 1: The dishonest casinoAs first case, we consider the well-known example ofthe dishonest casino [13], see Figure 1. This casino con-sists of a fair (state F) and a loaded dice (state L).The fair dice generates numbers from = {1, 2, 3, 4,5, 6} with equal probability, whereas the loaded dicegenerates the same numbers in a biased way. The prop-erties of the dishonest casino are readily captured in afour-state HMM with 8 transition and 12 emissionprobabilities, six each for each non-silent state F and L.Parameterizing the emission and transition probabilitiesof this HMM results in two independent transitionprobabilities and 10 independent emission probabilities,i.e. altogether 12 values to be trained. In order to avoidpremature termination of parameter training, we usepseudo-counts of 1 for every parameter to be trained.The data set for this model consists of 300 sequencesof 5000 bp length each. The results of the trainingexperiments are shown in Figures 2 and 3.Example 2: The extended dishonest casinoIn order to investigate a HMM with a more complicatedregular grammar, we extended the above example of thedishonest casino so it can now use the loaded dice(state L) only in multiples of two and the fair dice(state F) only in multiples of three, see Figure 4.This extended HMM has seven states, the silent Startand End states, two F states and three L states, 11 tran-sition probabilities and 30 emission probabilities. Para-meterizing the HMM’s probabilities yields twoindependent transition probabilities and 10 independentFLStart EndFigure 1 HMM of the dishonest casino. Symbolic representationof the HMM of the dishonest casino. States are shown as circles,transitions are shown as directed arrows. Please refer to the text formore details.number of iterationsperformance0 15 30 45 60 75 90 105 120 135 1500.00.20.40.60.81.0Baum−WelchStochastic EM 1Stochastic EM 3Stochastic EM 5ViterbiFigure 2 Performance for the dishonest casino. The averageperformance as function of the number of iterations for eachtraining algorithm. The performance is defined as the product ofthe sensitivity and specificity and the average is the average ofthree cross-evaluation experiments. For stochastic EM training, afixed number of state paths were sampled for each trainingsequence in each iteration (stochastic EM 1: one sampled statepath, stochastic EM 3: three sampled state paths, stochastic EM 5:five sampled state paths). The error bars correspond to the standarddeviation of the performance from the three cross-evaluationexperiments. Please refer to the text for more information.Lam and Meyer Algorithms for Molecular Biology 2010, 5:38http://www.almob.org/content/5/1/38Page 11 of 16emission probabilities to be trained, i.e. 12 parametervalues. In order to avoid premature termination of para-meter training, we use pseudo-counts of 1 for everyparameter to be trained.The data set for this model consists of 300 sequencesof 5000 bp length each. The results for this extendedmodel are shown in Figures 5 and 6.Example 3: The CpG island modelIn order to study the features for the different trainingalgorithms for a bioinformatics application, we alsoinvestigate an HMM that can be used to detect CpGislands in sequences of genomic DNA [13], see Figure 7.The model consists of 10 states, the silent Start and Endstates, four non-silent states to model regions inside CpGislands (states A+, C+, G+ and T+) and four non-silentstates to model regions outside CpG islands (states A−, C−, G− and T−). The emission probabilities for each of theeight non-silent states is a delta-function so that any par-ticular state (say A+ or A−) has an emission probability of1 for reading the corresponding DNA nucleotide (in thiscase A) and a probability of zero for all other nucleotides,i.e. eX + (Y) = eX− (Y) = δX,Y for X, Y Î {A, C, G, T}. Thisimplies that none of the emission probabilities of thismodel thus requires training. With a total of 80 transi-tion probabilities the model is, however, highly con-nected as any non-silent state is connected in bothdirections to any other non-silent state. ParameterizingFigure 3 Parameter convergence for the dishonest casino. Average differences of the trained and known parameter values as function ofthe number of iterations for each training algorithm. For a given number of iterations, we first calculate the average value of the absolutedifferences between the trained and known value of each emission parameter (left figure) or transition parameter (right figure) and then takethe average over the three experiments from the three-fold cross-evaluation. The error bars correspond to the standard deviation from the threecross-evaluation experiments. The algorithms have the same meaning as in Figure 2. Please refer to the text for more information.FStartF FEndL LFigure 4 HMM of the extended dishonest casino. Symbolicrepresentation of the HMM of the extended dishonest casino. Statesare shown as circles, transitions are shown as directed arrows.Please refer to the text for more details.number of iterationsperformance0 15 30 45 60 75 90 105 120 135 1500.00.20.40.60.81.0Baum−WelchStochastic EM 1Stochastic EM 3Stochastic EM 5ViterbiFigure 5 Performance for the extended dishonest casino. Theaverage performance as function of the number of iterations foreach training algorithm. The performance is defined as the productof the sensitivity and specificity and the average is the average ofthree cross-evaluation experiments. For stochastic EM training, afixed number of state paths were sampled for each trainingsequence in each iteration (stochastic EM 1: one sampled statepath, stochastic EM 3: three sampled state paths, stochastic EM 5:five sampled state paths). The error bars correspond to the standarddeviation of the performance from the three cross-evaluationexperiments. Please refer to the text for more information.Lam and Meyer Algorithms for Molecular Biology 2010, 5:38http://www.almob.org/content/5/1/38Page 12 of 16these transition probabilities results in 33 parameters, 32of which were determined in training (the transitionprobability to go to the End state was fixed). In order toavoid premature termination of parameter training, weuse pseudo-counts of 1 for every parameter to be trained.The data set for this model consists of 180 sequencesof 5000 bp length each. Figures 8 and 9 show the result-ing performance.Prediction accuracy and parameter convergence Ourprimary goal is to investigate how the prediction accu-racy of the different training algorithms varies as func-tion of the number of iterations. The predictionaccuracy or performance is defined as the product ofthe sensitivity and specificity. Figures 2, 5 and 8 showthe prediction accuracy as function of the number ofiterations for all three training methods for the respec-tive model.Another important goal of parameter training is torecover the original parameter values of the correspond-ing model. We therefore also investigate how well thetrained parameter values converge to the original para-meter values, see Figures 3, 6 and 9 show the averagedifferences between the trained and known parameterFigure 6 Parameter convergence for the extended dishonest casino. Average differences of the trained and known parameter values asfunction of the number of iterations for each training algorithm. For a given number of iterations, we first calculate the average value of theabsolute differences between the trained and known value of each emission parameter (left figure) or transition parameter (right figure) andthen take the average over the three cross-evaluation experiments. The error bars correspond to the standard deviation from the three cross-evaluation experiments. The algorithms have the same meaning as in Figure 5. Please refer to the text for more information.G+ T+StartC+A+A− C− G− T−EndFigure 7 CpG island HMM. Symbolic representation of the CpGisland HMM. States are shown as circles, transitions are shown asdirected arrows. Every non-silent state can be reached from theStart state and has a transition to the End state. In addition, everynon-silent state is connected in both directions to all non-silentstates. For clarity, we here only show the transitions from theperspective of the A+ state. Please refer to the text for more details.number of iterationsperformance0 15 30 45 60 75 90 105 120 135 1500.00.20.40.60.81.0Baum−WelchStochastic EM 1Stochastic EM 3Stochastic EM 5ViterbiFigure 8 Performance for the CpG island model. The averageperformance as function of the number of iterations for eachtraining algorithm. The performance is defined as the product ofthe sensitivity and specificity and the average is the average ofthree cross-evaluation experiments. For stochastic EM training, afixed number of state paths were sampled for each trainingsequence in each iteration (stochastic EM 1: one sampled statepath, stochastic EM 3: three sampled state paths, stochastic EM 5:five sampled state paths). The error bars correspond to the standarddeviation of the performance from the three cross-evaluationexperiments. Please refer to the text for more information.Lam and Meyer Algorithms for Molecular Biology 2010, 5:38http://www.almob.org/content/5/1/38Page 13 of 16values as function of the number of iterations for eachtraining algorithm and the respective model. Every datapoint is calculated by first determining the average valueof the absolute differences between the trained andknown value of each emission parameter (left figures) ortransition parameter (right figures) and then taking theaverage over the three experiments from the three-foldcross-evaluation.For the dishonest casino and the extended dishonestcasino, stochastic EM training performs best, both interms of performance and parameter convergence. It isinteresting to note that the results for sampling one,three or five state paths per training sequence and periteration are essentially the same within error bars. Forthese two models, Viterbi training converges fastest, i.e.the Viterbi paths remain the same from one iteration tothe next, but the point of convergence is sub-optimal interms of performance and in particular in terms of para-meter convergence. Baum-Welch training does betterthan Viterbi training for these two models, but not aswell as stochastic BM training as it requires more itera-tions to reach a lower prediction accuracy and worseparameter convergence and as it exhibits the largest var-iation with respect to the three cross-evaluation experi-ments. The latter is due to many high-scoring, sub-optimal state paths. For the CpG island model, all train-ing algorithms do almost equally well, with Viterbitraining converging fastest. Table 2 summarizes theCPU time per iteration for the different trainingalgorithms and models. For all three models, stochasticEM training is faster than Baum-Welch training for one,three or five sampled state paths per training sequence.Viterbi training is even a bit more time efficient thanstochastic EM training when sampling one state pathper training sequence.Based on the results from these three small examplemodels, we would thus recommend using stochastic EMtraining for parameter training.Conclusion and discussionA wide range of bioinformatics applications are basedon hidden Markov models. Having computationally effi-cient algorithms for training the free parameters ofthese models is key to optimizing the performance ofthese models and to adapting the models to new datasets, e.g. biological data sets from a different organism.We here introduce two new algorithms which renderthe memory requirements for Viterbi training and sto-chastic EM training independent of the sequence length.This is achieved by replacing the usual bi-directionaltwo-step procedure (which involves first calculating theViterbi matrix and then retrieving the Viterbi path (incase of Viterbi training) or first calculating the forwardmatrix and the backward matrix before estimating counts(in case of Baum-Welch training)) by a one-step proce-dure which scans each training sequence only in a one-directional way. For an HMM with M states and a con-nectivity of Tmax, a training sequence of length L and oneiteration, our new algorithm reduces the memoryrequirement of Viterbi training from (ML) to (M )while keeping the time requirement of (MTmaxL)unchanged, see Table 1 for details. For stochastic EMtraining where K is the number of state paths sampledfor every training sequence in every iteration, the mem-ory requirements are (as, typically, L ≫ K + 1 ≥ K +Tmax/M ) reduced from (ML) to (MK + Tmax) whilethe time requirement per iteration changes fromnumber of iterationstprob convergence0 15 30 45 60 75 90 105 120 135 1500.00.10.20.30.40.5Baum−WelchStochastic EM 1Stochastic EM 3Stochastic EM 5ViterbiFigure 9 Parameter convergence for the CpG island model.Average differences of the trained and known parameter values asfunction of the number of iterations for each training algorithm. Fora given number of iterations, we first calculate the average value ofthe absolute differences between the trained and known value ofeach transition parameter (this model does not have any emissionparameters that require training) and then take the average overthe three cross-evaluation experiments. The error bars correspond tothe standard deviation from the three cross-evaluation experiments.The algorithms have the same meaning as in Figure 8. Please referto the text for more information.Table 2 CPU time use for different modelsCPU time (sec) periterationdishonest extendeddishonestCpGislandCasino Casino ModelBaum-Welch training 8.85 5.94 22.22stochastic EM training K = 1 5.12 3.42 5.42stochastic EM training K = 3 6.02 4.42 10.30stochastic EM training K = 5 7.06 5.38 14.84Viterbi training 4.42 2.84 5.00Overview of the CPU time usage in seconds per iteration for Viterbi training,Baum-Welch training and stochastic EM training for the three differentmodels. For each model, we implemented each of the three training methodsusing the linear-memory algorithms for Baum-Welch training, Viterbi trainingand stochastic EM training. The number of state paths that are sampled foreach iteration and each training sequence in stochastic EM training isdenoted K.Lam and Meyer Algorithms for Molecular Biology 2010, 5:38http://www.almob.org/content/5/1/38Page 14 of 16(TmaxL(M + K)) to (TmaxLMK) depending on the user-chosen value of K. An added advantage of our two newalgorithms is they are easier to implement than the corre-sponding default algorithms for Viterbi training and sto-chastic EM training. In addition to introducing the twonew algorithms for Viterbi training and stochastic EMtraining, we also examine their practical merits for threesmall example models by comparing them to the linear-memory algorithm for Baum-Welch training which wasintroduced earlier. Based on our results from these three(non-representative) models, we would recommend usingstochastic EM training for parameter training.We have implemented the new algorithms for Viterbitraining and stochastic EM training as well as the lin-ear-memory algorithm for Baum-Welch training intoour HMM-compiler HMMCONVERTER[37] which canbe used to set up a variety of HMM-based applicationsand which is freely available under the GNU General Pub-lic License version 3 (GPLv3). Please see http://people.cs.ubc.ca/~irmtraud/training for more information and thesource code.We hope that the new parameter training algorithmsintroduced here will make parameter training forHMM-based applications easier, in particular those inbioinformatics.AcknowledgementsBoth authors would like to thank the anonymous referees for providinguseful comments. We would also like to thank Anne Condon for giving ushelpful feedback on our manuscript. Both authors gratefully acknowledgesupport by a Discovery Grant of the Natural Sciences and EngineeringResearch Council, Canada, and by a Leaders Opportunity Fund of theCanada Foundation for Innovation to I.M.M.Authors’ contributionsTYL and IMM devised the new algorithms, TYL implemented them, TYL andIMM conducted the experiments, evaluated the experiments and wrote themanuscript. All authors read and approved the final manuscript.Competing interestsThe authors declare that they have no competing interests.Received: 21 June 2010 Accepted: 9 December 2010Published: 9 December 2010References1. Meyer I, Durbin R: Gene structure conservation aids similarity based geneprediction. Nucleic Acids Research 2004, 32(2):776-783.2. Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B:AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic AcidsResearch 2006, 34:W435-W439.3. Won K, Sandelin A, Marstrand T, Krogh A: Modeling promoter grammarswith evolving hidden Markov models. Bioinformatics 2008,24(15):1669-1675.4. Finn R, Tate J, Mistry J, Coggill P, Sammut S, Hotz H, Ceric G, Forslund K,Eddy S, Sonnhammer E, Bateman A: The Pfam protein families database.Nucleic Acids Research 2008, 36:281-288.5. Nguyen C, Gardiner K, Cios K: A hidden Markov model for predictingprotein interfaces. Journal of Bioinformatics and Computational Biology2007, 5(3):739-753.6. Krogh A, Larsson B, von Heijne G, Sonnhammer E: Predictingtransmembrane protein topology with a hidden Markov model:application to complete genomes. Journal of Molecular Biology 2001,305(3):567-580.7. Bjöorkholm P, Daniluk P, Kryshtafovych A, Fidelis K, Andersson R, Hvidsten T:Using multi-data hidden Markov models trained on local neighborhoodsof protein structure to predict residue-residue contacts. Bioinformatics2009, 25(10):1264-1270.8. Qian X, Sze S, Yoon B: Querying pathways in protein interaction networksbased on hidden Markov models. Journal of Computational Biology 2009,16(2):145-157.9. Drawid A, Gupta N, Nagaraj V, Gélinas C, Sengupta A: OHMM: a HiddenMarkov Model accurately predicting the occupancy of a transcriptionfactor with a self-overlapping binding motif. BMC Bioinformatics 2009,10:208.10. king F, Sterne J, Smith G, Green P: Inference from genome-wideassociation studies using a novel Markov model. Genetic Epidemiology2008, 32(6):497-504.11. Su S, Balding D, Coin L: Disease association tests by inferring ancestralhaplotypes using a hidden markov model. Bioinformatics 2008,24(7):972-978.12. Juang B, Rabiner L: A segmental k-means algorithm for estimatingparameters of hidden Markov models. IEEE Transactions on Acoustics,Speech, and Signal Processing 1990, 38(9):1639-1641.13. Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis:Probabilistic models of proteins and nucleic acids Cambridge: CambridgeUniversity Press; 1998.14. Besemer J, Lomsazde A, Borodovsky M: GeneMarkS: a self-training methodfor prediction of gene starts in microbial genomes. Implications forfinding sequence motifs in regulatory regions. Nucleic Acids Research2001, 29(12):2607-2618.15. Lunter G: HMMoC – a compiler for hidden Markov models. Bioinformatics2007, 23(18):2485-2487.16. Ter-Hovhannisyan V, Lomsadze A, Cherno Y, Borodovsky M: Geneprediction in novel fungal genomes using an ab initio algorithm withunsupervised training. Genome Research 2008, 18:1979-1990.17. Viterbi A: Error bounds for convolutional codes and an assymptoticallyoptimum decoding algorithm. IEEE Trans Infor Theor 1967, 260-269.18. Keibler E, Arumugam M, Brent MR: The Treeterbi and Parallel Treeterbialgorithms: efficient, optimal decoding for ordinary, generalized and pairHMMs. Bioinformatics 2007, 23(5):545-554.19. Sramek R, Brejova B, Vinar T: On-line Viterbi algorithm for analysis of longbiological sequences. Algorithms in Bioinformatics, Lecture Notes inBioinformatics 2007, 4645:240-251.20. Lifshits Y, Mozes S, Weimann O, Ziv-Ukelson M: Speeding Up HMMDecoding and Training by Exploiting Sequence Repetitions. Algorithmica2009, 54(3):379-399.21. Baum L: An equality and associated maximization technique in statisticalestimation for probabilistic functions of Markov processes. Inequalities1972, 3:1-8.22. Dempster A, Laird N, Rubin D: Maximum likelihood from incomplete datavia the EM algorithm. J Roy Stat Soc B 1977, 39:1-38.23. Larsen T, Krogh A: EasyGene - a prokaryotic gene finder that ranks ORFsby statistical significance. BMC Bioinformatics 2003, 4:21.24. Jensen JL: A Note on the Linear Memory Baum-Welch Algorithm. Journalof Computational Biology 2009, 16(9):1209-1210.25. Khreich W, Granger E, Miri A, Sabourin R: On the memory complexity ofthe forward-backward algorithm. Pattern Recognition Letters 2010,31(2):91-99.26. Elliott RJ, Aggoun L, Moon JB: Hidden Markov Models. Estimation andControl Berlin, Germany: Springer-Verlag; 1995.27. Sivaprakasam S, Shanmugan SK: A forward-only recursion based hmm formodeling burst errors in digital channels. IEEE Global TelecommunicationsConference 1995, 2:1054-1058.28. Turin W: Unidirectional and parallel Baum-Welch algorithms. IEEE TransSpeech Audio Process 1998, , 6: 516-523.29. Miklós I, Meyer I: A linear memory algorithm for Baum-Welch training.BMC Bioinformatics 2005, 6:231.30. Churbanov A, Winters-Hilt S: Implementing EM and Viterbi algorithms forHidden Markov Model in linear memory. BMC Bioinformatics 2008, 9:224.Lam and Meyer Algorithms for Molecular Biology 2010, 5:38http://www.almob.org/content/5/1/38Page 15 of 1631. Hobolth A, Jensen JL: Applications of hidden Markov models forcharacterization of homologous DNA sequences with common genes.Journal of Computational Biology 2005, 12:186-203.32. Bishop CM: Pattern Recognition and Machine Learning Berlin, Germany:Springer-Verlag; 2006, chap. 11.1.6.33. Cawley SL, Pachter L: HMM sampling and applications to gene findingand alternative splicing. Bioinformatics 2003, 19(2):ii36-ii41.34. Grice JA, Hughey R, Speck D: Reduced space sequence alignment.Computer Applications in the Biosciences 1997, 13:45-53.35. Tarnas C, Hughey R: Reduced space hidden Markov model training.Bioinformatics 1998, 14(5):401-406.36. Wheeler R, Hughey R: Optimizing reduced-space sequence analysis.Bioinformatics 2000, 16(12):1082-1090.37. Lam TY, Meyer I: HMMConverter 1.0: a toolbox for hidden Markovmodels. Nucleic Acids Research 2009, 37(21):e139.38. Hirschberg D: A linear space algorithm for computing maximal commonsubsequences. Commun ACM 1975, 18:341-343.39. Meyer IM, Durbin R: Comparative ab initio prediction of gene structuresusing pair HMMs. Bioinformatics 2002, 18(10):1309-1318.doi:10.1186/1748-7188-5-38Cite this article as: Lam and Meyer: Efficient algorithms for training theparameters of hidden Markov models using stochastic expectationmaximization (EM) training and Viterbi training. Algorithms for MolecularBiology 2010 5:38.Submit your next manuscript to BioMed Centraland take full advantage of: • Convenient online submission• Thorough peer review• No space constraints or color figure charges• Immediate publication on acceptance• Inclusion in PubMed, CAS, Scopus and Google Scholar• Research which is freely available for redistributionSubmit your manuscript at www.biomedcentral.com/submitLam and Meyer Algorithms for Molecular Biology 2010, 5:38http://www.almob.org/content/5/1/38Page 16 of 16
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Faculty Research and Publications /
- Efficient algorithms for training the parameters of...
Open Collections
UBC Faculty Research and Publications
Efficient algorithms for training the parameters of hidden Markov models using stochastic expectation… Lam, Tin Y; Meyer, Irmtraud M Dec 9, 2010
pdf
Page Metadata
Item Metadata
Title | Efficient algorithms for training the parameters of hidden Markov models using stochastic expectation maximization (EM) training and Viterbi training |
Creator |
Lam, Tin Y Meyer, Irmtraud M |
Publisher | BioMed Central |
Date Issued | 2010-12-09 |
Description | Background: Hidden Markov models are widely employed by numerous bioinformatics programs used today. Applications range widely from comparative gene prediction to time-series analyses of micro-array data. The parameters of the underlying models need to be adjusted for specific data sets, for example the genome of a particular species, in order to maximize the prediction accuracy. Computationally efficient algorithms for parameter training are thus key to maximizing the usability of a wide range of bioinformatics applications. Results: We introduce two computationally efficient training algorithms, one for Viterbi training and one for stochastic expectation maximization (EM) training, which render the memory requirements independent of the sequence length. Unlike the existing algorithms for Viterbi and stochastic EM training which require a two-step procedure, our two new algorithms require only one step and scan the input sequence in only one direction. We also implement these two new algorithms and the already published linear-memory algorithm for EM training into the hidden Markov model compiler HMM- CONVERTER and examine their respective practical merits for three small example models. Conclusions: Bioinformatics applications employing hidden Markov models can use the two algorithms in order to make Viterbi training and stochastic EM training more computationally efficient. Using these algorithms, parameter training can thus be attempted for more complex models and longer training sequences. The two new algorithms have the added advantage of being easier to implement than the corresponding default algorithms for Viterbi training and stochastic EM training. |
Genre |
Article |
Type |
Text |
Language | eng |
Date Available | 2016-01-07 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution 4.0 International (CC BY 4.0) |
DOI | 10.14288/1.0223143 |
URI | http://hdl.handle.net/2429/56259 |
Affiliation |
Computer Science, Department of Medical Genetics, Department of Medicine, Faculty of Science, Faculty of Other UBC |
Citation | Algorithms for Molecular Biology. 2010 Dec 09;5(1):38 |
Publisher DOI | 10.1186/1748-7188-5-38 |
Peer Review Status | Reviewed |
Scholarly Level | Faculty |
Copyright Holder | Lam and Meyer. |
Rights URI | http://creativecommons.org/licenses/by/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 52383-13015_2010_Article_111.pdf [ 840.75kB ]
- Metadata
- JSON: 52383-1.0223143.json
- JSON-LD: 52383-1.0223143-ld.json
- RDF/XML (Pretty): 52383-1.0223143-rdf.xml
- RDF/JSON: 52383-1.0223143-rdf.json
- Turtle: 52383-1.0223143-turtle.txt
- N-Triples: 52383-1.0223143-rdf-ntriples.txt
- Original Record: 52383-1.0223143-source.json
- Full Text
- 52383-1.0223143-fulltext.txt
- Citation
- 52383-1.0223143.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52383.1-0223143/manifest