"Science, Faculty of"@en . "Computer Science, Department of"@en . "DSpace"@en . "UBCV"@en . "Lam, Tin Yin"@en . "2009-03-09T20:40:19Z"@en . "2008"@en . "Master of Science - MSc"@en . "University of British Columbia"@en . "Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis.\nMany recently developed Bioinformatics applications employ variants of HMMs to analyze\ndiverse types of biological data. It is typically fairly easy to design the states and the topological\nstructure of an HMM. However, it can be difficult to estimate parameter values which yield a\ngood prediction performance. As many HMM-based applications employ similar algorithms\nfor generating predictions, it is also time-consuming and error-prone to have to re-implement\nthese algorithms whenever a new HMM-based application is to be designed.\nThis thesis addresses these challenges by introducing a tool-box, called HMMC0NvERTER,\nwhich only requires an XML-input file to define an HMM and to use it for sequence decoding\nand parameter training. The package not only allows for rapid proto-typing of HMM-based\napplications, but also incorporates several algorithms for sequence decoding and parameter\ntraining, including two new, linear memory algorithms for parameter training. Using this\nsoftware package, even users without programming knowledge can quickly set up sophisticated\nHMMs and pair-HMMs and use them with efficient algorithms for parameter training and\nsequence analyses. We use HMMCONVERTER to construct a new comparative gene prediction\nprogram, called ANNOTAID, which can predict pairs of orthologous genes by integrating prior\ninformation about each input sequence probabilistically into the gene prediction process and\ninto parameter training. ANNOTAID can thus be readily adapted to predict orthologous gene\npairs in newly sequenced genomes."@en . "https://circle.library.ubc.ca/rest/handle/2429/5786?expand=metadata"@en . "2755022 bytes"@en . "application/pdf"@en . "HMMCONVERTER A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms by TIN YIN LAM B.Sc., The Chinese University of Hong Kong, 2006 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) October, 2008 \u00C2\u00A9 TIN YIN LAM 2008 Abstract Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics applications employ variants of HMMs to analyze di verse types of biological data. It is typically fairly easy to design the states and the topological structure of an HMM. However, it can be difficult to estimate parameter values which yield a good prediction performance. As many HMM-based applications employ similar algorithms for generating predictions, it is also time-consuming and error-prone to have to re-implement these algorithms whenever a new HMM-based application is to be designed. This thesis addresses these challenges by introducing a tool-box, called HMMC0NvERTER, which only requires an XML-input file to define an HMM and to use it for sequence decoding and parameter training. The package not only allows for rapid proto-typing of HMM-based applications, but also incorporates several algorithms for sequence decoding and parameter training, including two new, linear memory algorithms for parameter training. Using this software package, even users without programming knowledge can quickly set up sophisticated HMMs and pair-HMMs and use them with efficient algorithms for parameter training and sequence analyses. We use HMMCONVERTER to construct a new comparative gene prediction program, called ANNOTAID, which can predict pairs of orthologous genes by integrating prior information about each input sequence probabilistically into the gene prediction process and into parameter training. ANNOTAID can thus be readily adapted to predict orthologous gene pairs in newly sequenced genomes. 11 Table of Contents Abstract ii Table of Contents iii List of Figures vii Acknowledgements ix 1 HMMs in Bioinformatics applications 1 1.1 Introduction 1 1.2 Introduction to hidden Markov models 3 1.2.1 Theoretical background 3 1.2.2 Variants of HMMs for Bioinformatics applications 8 1.3 Challenges in designing and implementing HMMs 13 1.4 Discussion and conclusion 16 2 Decoding and parameter training algorithms for HMMs 18 2.1 Introduction and motivation 18 2.2 Decoding algorithms 19 2.2.1 The Viterbi algorithm 20 2.2.2 The Hirschberg algorithm 23 111 Table of Contents 2.3 Parameter training 27 2.3.1 The Baum-Welch algorithm 28 2.3.2 The Viterbi training algorithm 31 2.4 Discussion and conclusion 32 3 HMMCONVERTER 34 3.1 Motivation 34 3.2 Main features 36 3.2.1 Input and output files 36 3.2.2 Special features 39 3.2.3 Sequence decoding algorithms 51 3.2.4 Training algorithms 53 3.3 Training using sampled state paths 58 3.3.1 Sampling a state path from posterior distribution 58 3.3.2 A new, linear memory algorithm for parameter training using sampled state paths 61 3.3.3 A new linear memory algorithm for Viterbi training 68 3.3.4 Pseudo-counts and pseudo-probabilities 72 3.3.5 Training free parameters 73 3.4 Availability 76 3.5 Discussion and conclusion 77 4 Specifying an HMM or pair-HMM with HMMConverter 78 4.1 Introduction 78 4.2 The XML-input file 79 4.2.1 The component of the XML file 79 iv Table of Contents 4.2.2 The component of the XML file 91 4.3 Other text input files 99 4.3.1 The input sequence file 99 4.3.2 The free emission parameter file 101 4.3.3 The free transition parameter file 103 4.3.4 The tube file 104 4.3.5 The prior information file 105 4.4 The output files of HMMC0NvERTER 107 4.5 Examples 111 5 Implementation of HMMConverter 112 5.1 Introduction 112 5.2 Converting XML into C++ code 113 5.2.1 The model_parameters class 113 5.2.2 Other classes 115 5.3 Decoding and training algorithms 120 5.3.1 The Viterbi and Hirschberg algorithms 120 5.3.2 The parameter training algorithms 122 6 Annotaid 124 6.1 Introduction and Motivation 124 6.2 Main features of Annotaid 126 6.2.1 The pair-HMM of ANNOTAID 126 6.2.2 Banding and sequence decoding algorithms 127 6.2.3 Integrating prior information along the input sequences 127 6.2.4 Parameter training in ANNOTAID 130 v Table of Contents 6.3 Availability 133 6.4 Discussion and Future Work 134 Bibliography 136 vi List of Figures 1.1 A Markov model for the CpG-Island process 5 1.2 An HMM for the CpG-Island 5 1.3 A 5-state simple pair-HMM 9 1.4 A simple pair-HMM for predicting related protein coding regions in two input DNA sequences 10 2.1 A simple generalized pair-HMM for predicting related protein coding regions in two DNA sequences 19 2.2 Illustration of the Viterbi algorithm 23 2.3 Illustration of the Hirschberg algorithm (1) 25 2.4 Illustration of the Hirschberg algorithm (2) 26 3.1 The XML-input file for defining an HMM in HMMC0NvERTER 37 3.2 Using BLAST to generate a tube for HMMC0NvERTER 44 3.3 Defining a tube for HMMC0NvERTER explicitly 45 3.4 A state which carries labels from two different annotation label sets 47 3.5 Illustration of the special emission feature in HMMC0NvERTER 49 3.6 Using the Hirschberg algorithm inside a tube 53 3.7 Illustration of the forward algorithm 63 vi\u00E2\u0080\u0099 List of Figures 3.8 Sampling a state path from the posterior distribution 63 3.9 Illustration of the novel, linear memory algorithm for parameter training using sampled state paths 69 3.10 A part of the pair-HMM of DOUBLESCAN 75 4.1 Defining a state of an HMM or pair-HMM in HMMCONvERTER 83 4.2 A part of the pair-HMM of DOUBLESCAN 98 4.3 Specifying a tube explicitly in a flat text file 105 4.4 A simple 5-state pair-HMM 108 6.1 Combing the Hirschberg algorithm with a tube 128 6.2 Special emissions in PROJECTOR and ANNOTAID 129 6.3 Combining the parameter training with a tube 132 vu\u00E2\u0080\u0099 Acknowledgements I would like to thank my supervisor, Prof. Irmtraud Meyer for her advice, support and encour agement during the course of this project. A special thank you goes to Prof. Anne Condon for agreeing to be the second reader of my thesis. I thank Nick Wiebe, Rodrigo Goya, Chris Thachuk and all other members of the BETA lab who have made my life on campus enjoyable as they are always being helpful and nice. ix Chapter 1 HMMs in Bioinformatics applications 1.1 Introduction Hidden Markov models (HMMs) are flexible and powerful statistical tools for many different kinds of sequence analysis. HMMs were first introduced in a series of papers by Leonard E. Baum in the 1960s, and an early application of HMMs is speech recognition in the mid 1970s. Today, HMM-based methods are widely used in many different fields such as speech recognition and speech synthesis [42, 45], natural language processing [32], hand-writing recog nition [27] and Bioinformatics [11]. Among the above applications, HMMs are especially popular for sequence analysis in Bioinformatics. In Bioinformatics, HMMs are used for sequence annotation such as gene prediction [8, 23], sequence analysis such as transcription binding site recognition [17] and peptide identification [22], sequence alignment such as protein alignment [21, 24] and other applications such as protein-protein interaction prediction [41]. Many variations of HMMs have been developed in the recent two decades for different Bioinformatics applications. In addition, new algorithms for sequence decoding and parameter training have been introduced. Although HMMs are a powerful and popular concept, several practical challenges often have 1 1.1. Introduction to be overcome in practice. It is not easy to design an HMM, especially to assign values to its parameters. The time and memory requirements for analyzing long input sequences can be prohibitive. Finally, it can be very time consuming and error-prone to implement an HMM and the corresponding prediction and parameter training algorithms. This thesis presents an HMM generation tool called HMMCONVERTER, which only re quires an XML-file and simple text input files for defining an HMM and for using the HMM for predictions and for training its parameters. Users of HMMC0NvERTER are not required to have programming skills. The tool provides the most commonly used Viterbi algorithm [47], and the more memory efficient Hirschberg algorithm [161 for sequence decoding. For param eter training, HMMC0NvERTER offers not only a new, linear memory algorithm for Baum- Welch training, but also a novel linear memory variation of the Viterbi training algorithm as well as a new and efficient algorithm for training with sampled state paths. Furthermore, HMMC0NvERTER supports many special features for constructing various types of HMMs, such as higher order HMMs, pair-HMMs as well as HMMs which take external information on the input sequences into account for parameter training and generating predictions. In this chapter, we give some theoretical background on HMMs, and describe variants of HMM and their applications in Bioinformatics. We also further discuss the challenges of designing and implementing an HMM. Existing sequence decoding and parameter training algorithms for HMM are introduced in chapter 2. Chapter 3 describes the main features of HMMCONVERTER, and includes a detailed explanation of the novel parameter training algo rithms. The specifications of an HMM within HMMC0NvERTER and the input and output format for the tool are described in chapter 4, some implementation details of HMMC0N- VERTER are also provided in chapter 5. Finally, we used this tool to implement a novel comparative gene prediction pair-HMM called ANNOTAID which can take various types of prior information on each of the two input sequences into account in order to give more accu 2 1.2. Introduction to hidden Markov models rate predictions. The features and applications of this new model are described in chapter 6. 1.2 Introduction to hidden Markov models 1.2.1 Theoretical background A hidden Markov model (HMM) is a variation of a Markov model, where the state sequence of a Markov chain is hidden from the observation. We first explain what a Markov chain is and then use a simple biological example to introduce HMMs. Markov model A Markov chain is a random process with a Markov property. In brief, a random process is a process for which the next state is determined in a probabilistic way. The Markov property implies that given the current state in the random process, the future state is independent of the past states. Let X be a random variable for all positive integers i, representing the state of the process at time i. The sequence X1,X2,... ,X,... is a Markov chain if: P(X = = ,X1 = Xl) = P(XZ = = A time-homogeneous Markov chain means that the next state of the process is independent of the time, but only depends on the current state: P(X, = xIX_i = y) = = xIX2 = y) One can also define higher order Markov chains. A Markov chain of n-th order (Vn) is a Markov chain for which depends on the current state and the previous n \u00E2\u0080\u0094 1 states. Note 3 1.2. Introduction to hidden Markov models that when we say Markov chain, we refer to the simplest case, i.e. a 1st order Markov chain. A Markov chain can be represented by a finite state machine [11]. CpG-Island Now, we the use CpG-Island [4] example from the book Biological Sequence Analysis [11] to introduce hidden Markov models. In a genomic DNA sequence, the dinucleotide CG (i.e. a neighboring C and G on the same strand) appears more frequently around the transcription start sites of genes than in other regions of the genome. We call regions which have an increased frequency of CG pairs a CpG-Island. They are usually a few hundred to a few thousand bases long. Suppose we want to identify CpG-Islands in a long genomic sequence. We can do this by defining a Markov chain. Each state of the model in figure 1.2.1 represents a DNA nucleotide in a CpG-Island or not in a CpG-Island. The Markov property holds for this model as the probability of going from a given state to the next state only depends on the current state, but not the previous states. Figure 1.2.1 shows the Markov model of the CpG-Island \u00E2\u0080\u009Cprocess\u00E2\u0080\u009D, but this model is lacking something as we cannot distinguish whether an observed nucleotide is from a CpG-Island(+) or not (-). We can modify this model a little, see figure 1.2.1. In this new model, each observed nucleotide can be read by two states, one state with label \u00E2\u0080\u009C+\u00E2\u0080\u009C representing a nucleotide from a CpG-Island and one state with \u00E2\u0080\u009C-\u00E2\u0080\u009C representing a nucleotide outside a CpG-Island. This is a hidden Markov model for a CpG-Island process, where the \u00E2\u0080\u009Chidden feature\u00E2\u0080\u009D is the underlying \u00E2\u0080\u009C+\u00E2\u0080\u009C or \u00E2\u0080\u009C-\u00E2\u0080\u009C state associated to nucleotides in the observed sequence. The transition probability from the C+ state to the G+ state should be larger than the transition probability from the C- state to the G- state. The start state and the end state in the two models are both silent states which do not read any character from the input sequence. The model always start with the start state 4 1.2. Introduction to hidden Markov models e Figure 1.1: This is a Markov model for annotation problem. A Markov model for the CpG-Island process the CpG-Island process, but it cannot solve the CpG-Island B Figure 1.2: An HMM for the CpG-Island This figure shows an HMM which can model CpG-Islands. The states of the model are all connected to each other, i.e. it is a complete graph(K8)where each state also has a transition back to itself. 5 1.2. Introduction to hidden Markov models before reading any character from the input sequence. On the other hand, every state path ends at the end state. In the above example, the start state is connected to all the other states (except the end state), and all the states except the start state are connected to the end state. Now we are ready to give a more formal definition of HMM, an HMM is unambiguously defined by the following components: \u00E2\u0080\u00A2 A set of N states S = {0,... , N \u00E2\u0080\u0094 1}, where 0 is the start state and N \u00E2\u0080\u0094 1 is the end state. \u00E2\u0080\u00A2 An alphabet A which contains the symbols of observations of the HMM (i.e. {A,C,G,T} is the DNA alphabet for the CpG-Island 11MM in the above example). \u00E2\u0080\u00A2 A set of transition probabilities T = i, j e S}, where tj E [0, 1] is the transition probability of going from state i in the model to state j and JI tj,j\u00E2\u0080\u0099 = 1 for all states i inS. \u00E2\u0080\u00A2 A set of emission probabilities E = {ejQy)Ii 5,7 E A}, where ej(7) e [0,11 is the emission probability of state i for symbol \u00E2\u0080\u0098y and 7\u00E2\u0080\u0099EA e (Wy\u00E2\u0080\u0099) = 1 for all states i in S. An HMM as defined above corresponds to a probabilistic model. It can be used in a number of ways to generate predictions. In the following, \u00E2\u0080\u00A2 let M denote an 11MM, \u00E2\u0080\u00A2 X = (xi,... , XL) an input sequence of length L where x is the i-th letter of the sequence and \u00E2\u0080\u00A2 Tr = , lrj) a state path of length 1 + 1 in the model M where ir is the state at the k-th position in the state path, i.e. iro is always the start state and Trj is always the 6 1.2. Introduction to hidden Markov models end state. We can use M in a number of ways to generate predictions: \u00E2\u0080\u00A2 Sequence decoding: For a given input sequence X and model M, we can for example calculate the state path with the highest overall probability in the model, i.e. the state path ir which maximizes P(irIX, M). In that case, model M analyzes the sequence X and we will, in all of the following, say that the states of model M read the symbols of observations of input sequence X. \u00E2\u0080\u00A2 Parameter training: For a given set of training sequences X and a given model M, we can train the set of free parameters 0 of M, for example by using training algorithms which maximize P(XI0, M). \u00E2\u0080\u00A2 Generating sequences of observations: As any HMM defines a probability dis tribution over sequences of observations, we can use any given HMM M to generate sequences of observations. In that case, we say that the states of model M emit sym bols of observations. In this thesis, this way of using an HMM is not employed. We will, in all of the following, therefore refer to a state as reading one or more symbols from an input sequence. The CpG-Island finding problem can be solved with a decoding algorithm, which computes the most probable state path it for an input DNA sequence X and model M. Before we can use an HMM for decoding, we first have have to determine good values for its parameters which may require parameter training. Algorithms for decoding and parameter training will be discussed in more detail in this chapter and in chapter 2. In the following section, we first describe variants of HMMs that have been developed recently for Bioinformatics applications. 7 1.2. Introduction to hidden Markov models 1.2.2 Variants of HMMs for Bioinformatics applications Pair-HMMs and n-HMMs A pair-HMM is a simple extension of an HMM that reads two rather than only one sequence at a time. One of the main applications of pair-FIMMs is to generate sequence alignments, so pair-HMMs are particularly prevalent in Bioinformatics [39]. For example, a pair-HMM called WABA [21] can be used to identify pairs of protein coding exons in two orthologous input DNA sequences. This pair-HMM generates a global alignment between the two input sequences unlike TBLAsTX which generates local alignments [1]. Pair-HMMs are also used for comparative gene prediction [26, 34, 36]. These pair-HMMs take a pair of homologous DNA sequences as input and predict pairs of homologous gene structures. Comparative gene prediction methods using pair-HMMs have been shown [26, 36] to perform better than non-comparative approaches [8]. Figure 1.2.2 shows a simple 5-state pair-HMM. In an n HMM, each state can read letters from up to ri input sequences. n-HMMs can be used for multiple sequence alignment, but because of their high memory and time requirements, they are typically used only for short sequences, such as protein sequences. Generalized HMMs Generalized HMMs [8, 48] allow states to read more than one letter at a time from an input sequence. In Bioinformatics applications, a state of an HMM often corresponds to a biological entity, e.g. a codon consisting of 3 nucleotides, such a state would read 3 letters from the input sequence at a time. Figure 1.2.2 shows a simple generalized pair-HMM for finding related protein-coding exons in two input DNA sequences. In particular, any HMM used for gene prediction should be a generalized HMM. 8 1.2. Introduction to hidden Markov models Figure 1.3: A 5-state simple pair-HMM This is a 5-state simple pair-HMM, where the Match state reads one character from each of the two input sequences and aligns them according to the emission probability of the combination of the two characters read by the state. The Emit_X state reads one character from input sequence X and the Emit.Y state does the same thing as the Emit....X state, but it reads one character from the input sequence Y at a time. 9 1.2. Introduction to hidden Markov models Figure 1.4: A simple pair-HMM for predicting related protein coding regions in two input DNA sequences This simple pair-HMM can be used for predicting related protein coding regions in two input DNA sequences. There are 3 states to model the protein coding part and 3 states to model the non-coding part of the two input sequences. The states which model the protein coding part read one codon at a time, e.g. the Match_Codon state reads 3 nucleotides from each input sequence at a time. 10 1.2. Introduction to hidden Markov models Explicit state duration HMMs Explicit state duration HMMs extend the concept of HMMs by keeping track of the state\u00E2\u0080\u0099s duration, i.e. the \u00E2\u0080\u009Ctime\u00E2\u0080\u009D that the state path has stayed in the same state is also remembered in the decoding process. In this type of HMM, a set of parameters or functions for modeling the state duration is needed in addition to the set of transition and emission probabilities, and the probability of the next state is conditional on both the current state and the state duration. GENSCAN [8] and SNAP [25] are two ab initio gene prediction models which are based on an explicit state duration HMM. Evolutionary HMMs For many types of biological sequence analysis, it is important to know the evolutionary cor relations between the sequences. [24] describe a method that uses evolutionary information to derive the transition probabilities of the insertion and deletion states of sequence align ment pair-HMMs. EVOGENE [40] is an evolutionary HMM (EHMM), which incorporates the evolutionary relationship between the input sequences into the sequence decoding process. EVOGENE consists of an HMM and a set of evolutionary models. In the model, each state of the HMM consists of a set of alphabets over fixed alignment columns and a state-specific evolutionary model, and the emission probabilities of each state are calculated using the corresponding evolutionary model for each column in the input alignment. The Felsenstein algorithm [13] is used to calculate the likelihood of an alignment column given an evolutionary input tree and an evolutionary model. 11 1.2. Introduction to hidden Markov models HMMs which take external evidence into account Because of significant improvements in biological sequencing technology, large amounts of genome-wide data such as cDNAs, ESTs and protein sequences [3, 18, 37, 38] are currently being generated. It is desirable to incorporate these types of additional evidence into com putational sequence analysis methods to get a better performance. In recent years, many HMMs that can incorporate prior information on the input sequences into the prediction process have been developed. They are especially popular in gene prediction because of the highly available data. TWINSCAN [26] is one of the first HMMs that incorporates external information into the gene prediction process. It extends the classical gene prediction HMM GENSCAN [8] by incorporating matches to several homologous genomic sequences to the input DNA sequence into the predictions. TWINSCAN first generates a \u00E2\u0080\u009Cconserved sequence\u00E2\u0080\u009D from the matches between homologous DNA sequences and the DNA sequence of interest. This conserved sequence is then used to bias the emission probabilities of the underlying GEN SCAN HMM. DOUBLESCAN [34] is a comparative gene prediction pair-HMM which performs ab initio gene prediction and sequence alignment simultaneously by taking two un-annotated and unaligned homologous genomic sequences as input. It incorporates splice site informa tion along each input sequence using the predictions of STRATASPLICE [29], which generates log-odd scores and posterior probabilities for potential splice site. These scores are then inte grated into the prediction process, and greatly improve the prediction accuracy of splice sites. PROJECTOR [36] extends DOUBLESCAN by using the known genes in one of the two input sequences to predict the gene structures in the other un-annotated DNA input sequence. It uses so-called special emissions, which integrate the information on the annotated sequence into the prediction process by biasing the emission probabilities. In its published version, PROJECTOR is only capable of incorporating prior probabilities with discrete values (i.e. ei 12 1.3. Challenges in designing and implementing HMMs ther 0 or 1). More recently, several gene prediction HMMs have been developed for integrating various types of prior information into the prediction process [7, 15, 19, 44]. These programs (they all employ HMMs but not pair-HMMs) all allow taking prior information with different levels of confidence in account. For example, in JIGSAW [19], a feature vector is used to store prior information on different features from various sources in probabilities. Simple multipli cation rules are then used to incorporate the vector into the nominal emission probabilities. EXONHUNTER [7] calls a piece of external evidence an \u00E2\u0080\u009Cadvisor\u00E2\u0080\u009D, different \u00E2\u0080\u009Cadvisors\u00E2\u0080\u009D are combined to form a \u00E2\u0080\u009Csuperadvisor\u00E2\u0080\u009D by quadratic programming (it is a problem of minimiz ing or maximizing a quadratic function of several variables subject to linear constraints on these variables), which is then incorporated into the prediction process. All of these HMMs are capable of integrating various prior information with different confidence scores into the prediction process, but they all have some limitations. For example, they cannot deal with contradicting pieces of information, and one state of the model can only incorporate one type of prior information. 1.3 Challenges in designing and implementing HMMs As discussed previously, HMMs are a widely used powerful concept in many applications. However, it is often not easy to design and implement an HMM. Various algorithms exist for generating predictions with the model and for training the model\u00E2\u0080\u0099s parameters. The main challenges are: \u00E2\u0080\u00A2 Long input sequences lead to large memory and time requirements for predictions and parameter training. \u00E2\u0080\u00A2 It is difficult to set up the transition and emission parameters of the model that yield a good performance. 13 1.3. Challenges in designing and implementing HMMs \u00E2\u0080\u00A2 It is time consuming and tricky to implement the model and to test several HMMs against each other (\u00E2\u0080\u009CProto-typing\u00E2\u0080\u009D). Dealing with long input sequences In Bioinformatics, the length of input sequences for HMMs varies between applications. For example, the average gene size (before splicing) of eukaryotic genes varies from around 1kb (yeast) to 50kb (human) or longer. Before discussing the sequence decoding and parameter training algorithms in detail, we first specify the time and memory requirements of these algorithms to give readers a better impression of the challenges. For a pair-HMM with N states, the Viterbi algorithm [47], which is the most widely used algorithm for sequence decoding, requires O(N2L)time and O(NL2) memory for analyzing a pair of sequences of length L each. For a gene prediction pair-HMM consisting of 10 states, it would thus require between 108 and 1010 time units and between io and i09 many units of memory to analyze typical input sequences which contain only a single gene. These constraints are even more serious for parameter training, as most training algorithms are iterative processes which have to consider many input sequences. These challenges for parameter estimation of HMMs are described below. New efficient algorithms are needed to make decoding and parameter training practical even for long input sequences. Parameter training The set of transition probabilities and emission probabilities constitutes the parameters of an HMM. There are additional parameters for explicit state duration HMMs and evolutionary HMMs. As the performance of an HMM critically depends on the choice of the parameter values, it is important to find good parameter values. As the states of an HMM closely reflect the biological problem that is being addressed, it is typically fairly easy to design the states of 14 1.3. Challenges in designing and implementing HMMs the model. However, it can be difficult to manually derive good parameter values. There are two common ways to define the parameters of an HMM: (1) Users can manually choose the parameters. This is time consuming and also requires a very good understanding of the data. Also, this strategy may not result in values that optimize the performance. (2) An other strategy is to use an automatic parameter training algorithm. This strategy works best if we have a large and diverse set of annotated training sequences. Being able to use automatic parameter training for a pair-HMM, for comparative gene prediction means that it can be readily trained to analyze different pairs of genornes. However, as we explain now, parameter training is not an easy task. There are two common types of parameter training strategies for HMMs, maximum like lihood methods (ML) and expectation maximization methods (EM) [10]. The corresponding training algorithms are usually iterative processes, whose convergence depends on the set of initial values and the amount of training data. Depending on the training algorithm, the out come of the training may depend on the initial parameter values and may also not necessarily maximize the resulting prediction performance. Pseudo-counts for the parameters may be needed to avoid over-fitting, especially if the training data is sparse or biased. We have discussed the importance of parameter training for setting up an HMM. Because of the difficulties of implementing training algorithms that can be used efficiently on realistic data sets, parameter training algorithms are not implemented in most HMM applications (i.e. the parameters are set up manually for one particular data set). An exception are some HMMs for gene prediction [25, 30, 46]. This is because (1) the set of parameters are very important for the gene prediction performance, (2) parameters need to be adapted to predict genes for different organisms, (3) due to the great improvement in biological sequencing techniques, there are large enough sets of known genes to allow parameter training. For gene prediction methods, the main challenge is to implement efficient parameter training algorithms. The 15 1.4. Discussion and conclusion existing parameter training algorithms are described in chapter 2. Proto-typing Designing an HMM for a new Bioinformatics application is fairly easy given a good under standings of the underlying biological problem, whereas the implementation of several possible HMMs can be a tedious task. Proto-typing several HMMs and implementing the sequence decoding and parameter training algorithms is time-consuming and error-prone. Although there are many variations of HMMs, they all use the same prediction and training algorithms. It is particularly inefficient if the same algorithms have to be implemented repeatedly when ever a new HMM application is created. If there was a software tool for generating an 11MM with the corresponding algorithms, then users would be able to focus on designing the 11MM, i.e. the states of the model, which is the only task that requires human expertise and insight into the biological problem. 1.4 Discussion and conclusion HMMs are a very powerful statistical concept which allows many different kinds of sequence analysis applications. In this chapter, we have introduced the theoretical background of HMMs and some popular variants of HMMs and also described the popularity of these models in Bioinformatics. As we explained, it is typically not easy to design and implement an HMM in practice. We have discussed several challenges, such as long input sequences which lead to large memory and time requirements, difficulties with setting up the parameters of the model, and the time consuming and error-prone task of implementing efficient algorithms. Even though marty different HMMs have been developed for different applications, they use similar prediction and parameter training algorithms. It is therefore not efficient if users 16 1.4. Discussion and conclusion spend most of their time implementing and re-implementing these algorithms for different HMMs, when they could be focusing on the design and parameterization of the HMM. We here propose an HMM-generating tool HMMC0NvERTER, which takes an XML input file for defining an HMM, its states and parameters and the algorithms to be used for gener ating predictions and training the parameters. Using this tool, a user is not required to have computer programming knowledge to set up an HMM and to use it for data analyses and parameter training. This tool provides several widely used sequence decoding and parameter training algorithms, including two novel parameter training algorithms. Furthermore, HMM CONVERTER supports many special features, such as incorporating external information on each input sequence with confidence scores, and several heuristic algorithms for reducing the run time and memory requirement greatly while keeping the performance essentially un changed. These features are especially useful for Bioinformatics applications. In the following chapter, we introduce several widely used existing sequence decoding and parameter training algorithms. The 11MM generating tOol-HMMCONVERTER is described in chapter 3, including details on all special features and novel algorithms. Chapter 4 explains how to specify an HMM using HMMCONVERTER and chapter 5 describes some implemen tation details of HMMCONVERTER. Chapter 6 describes a novel pair-HMM for comparative gene prediction called ANNOTAID, which is constructed using the HMMCONVERTER frame work. 17 Chapter 2 Decoding and parameter training algorithms for HMMs 2.1 Introduction and motivation In biological sequence analysis or biological sequence annotations, we are interested in finding good annotations for the raw biological sequences. This is the sequence decoding problem discussed in chapter 1. Every HMM assigns an overall probability to each state path which is equal to the product of the encountered transition and emission probabilities, and every state path corresponds to an annotation of the input sequence. For an HMM with well-chosen parameters, high-quality annotations should correspond to high-probability state paths, so the task for sequence decoding is to find a state path with high probability. The Viterbi algorithm [47] is one of the most widely used sequence decoding algorithms which calculates the most probable state path for a given input sequence and HMM. For many Bioinformatics applications, it is often the case that several state paths that yield the same annotation. For example, in a simple pair-HMM which generates global align ments of protein coding regions in two input DNA sequences (see figure 2.1), there are states Emit..Codonx and Match..Codon, which both assign the annotation label Codon to the letters they read from sequence X. In this pair-HMM, several different state paths that cor 18 2.2. Decoding algorithms respond to different global alignments between the two sequences could yield the same gene annotation for the two input sequences. From this example, we can see that it is always possible to translate a state path in the model into an annotation of the input sequence. Figure 2.1: A simple generalized pair-HMM for predicting related protein coding regions in two DNA sequences This simple generalized pair-HMM has been presented in chapter 1. In particular, the MatchCodon state, the EmitCodonX state and the EmitCodonY state all assign the annotation label Codon to the letters they read from the input sequences. The three other states (except the start and the end state) assign the annotation label non_coding. 2.2 Decoding algorithms In this section, we describe two existing sequence decoding algorithms for HMMs: The most commonly used Viterbi algorithm [47] and the Hirschberg algorithm [16]. The Hirschberg 19 2.2. Decoding algorithms algorithm can be regarded as a more memory efficient version of the Viterbi algorithm. For a given HMM and an input sequence, both algorithms derive the state path with the highest overall probability. 2.2.1 The Viterbi algorithm The Viterbi algorithm is one of the most widely used sequence decoding algorithm for HMMs. It first calculates the maximum probability that a state path in a given model M can have for a given input sequence, and then retrieves the corresponding optimal state path by a back-tracking process. In this section, we introduce the algorithm for an HMM, i.e. a single tape HMM where only the start and and end states are silent and all other states read one letter from the input sequence at a time. The algorithm consists of two parts: \u00E2\u0080\u00A2 Part 1: Calculate the optimum probability of any state path using dynamic program ming. \u00E2\u0080\u00A2 Part 2: Retrieve the corresponding optimal state path by a back-tracking process. We first introduce some notations before explaining the algorithm with pseudo-code, some of these notations have already been introduced in chapter 1. \u00E2\u0080\u00A2 X = (xi,. .. ,xL) denotes an input sequence of length L, where x is i-th letter of the sequence. \u00E2\u0080\u00A2 S = {O,.. . , N \u00E2\u0080\u0094 1} denotes the set of N states in the HMM, where 0 is the start state and N \u00E2\u0080\u0094 1 is the end state. \u00E2\u0080\u00A2 T = Ii, j e S}, where t\u00C3\u00A7j is the transition probability from state i to state j 20 2.2. Decoding algorithms \u00E2\u0080\u00A2 6 {eiQy)i e 8, \u00E2\u0080\u0098y e A}, where ej(y) is the emission probability of state i to read symbol 7 from alphabet A \u00E2\u0080\u00A2 v denotes the two-dimensional Viterbi matrix, where v3 (i) is the probability of the most probable path that ends in state s and that has read the input sequence up to and including position i. \u00E2\u0080\u00A2 ptr denotes the two-dimensional pointer matrix, where ptr (s) is the previous state from which the maximum probability at state s and sequence position i was derived. \u00E2\u0080\u00A2 ir (7r0,... , ir) represents a state path in the model for a given input sequence of length L, where 2r is the i-th state of the state path. In particular, is the start state and lrL+1 is the end state. \u00E2\u0080\u00A2 .* denotes the optimal state path, it is the state path with the highest overall probability on sequence X, i.e. lr* = argmaxP(irX, M). We also call this state path the Viterbi path. In order to calculate the optimum probability among all state paths for a given HMM and a given input sequence, the Viterbi algorithm uses the following dynamic programming ap proach: Part 1: Initialization before reading the first letter from the input sequence 1, ifs=start v(O) = 1 0, ifsstart Recursion Fori=1,.. ,L 21 2.2. Decoding algorithms Fors\u00E2\u0080\u0099=1,\u00E2\u0080\u00A2 ,N\u00E2\u0080\u00942 v5i(i) = eS\u00E2\u0080\u0099(i)maxSEs{vS(i \u00E2\u0080\u0094 1)t8,} (2.1) ptr(s\u00E2\u0080\u0099) = argmax38{vs(i \u00E2\u0080\u0094 1)t5,3i} (2.2) Termination at the end of the sequence and in the end state VN_1(L) = maxses{vs(L)ts,end} PtTL(N \u00E2\u0080\u0094 1) = argmax85{v3(L)t,6d} From the definition of v, we know that the probability of the optimal state path P(x, ir*) = VN_1(L). The corresponding optimal state path can be obtained by a back-tracking procedure. Part 2: Back tracking starts at the end of the sequence in the end state and finishes at the start of the sequence in the start state. It is defined by the recursion: 71-:_i =ptr2(4) Figure 2.2.1 illustrates the first part of the Viterbi algorithm. The above algorithm can be easily generalized for pair-HMMs by using a three dimensional matrix and looping over both X and Y dimensions in the recursion. For a generalized HMMs where states can read more than one character at a time from the input sequence, the entries of the Viterbi matrix are calculated by: v3(i) = e8\u00E2\u0080\u0099(i)max(v(i \u00E2\u0080\u0094 where (s\u00E2\u0080\u0099) is the number of characters read by state s\u00E2\u0080\u0099. 22 2.2. Decoding algorithms S State State Figure 2.2: Illustration of the Viterbi algorithm The left part shows how the Viterbi algorithm calculates the Viterbi matrix from left to right in the recursion. The algorithm first ioops along the input sequence X and then loops over the states in the model. It terminates in the top right corner of the matrix at the end of the input sequence and in the end state. The right part demonstrates the calculation of entry v3(i), where the matrix element for state s\u00E2\u0080\u0099 and sequence position i is derived from state s at sequence position (i \u00E2\u0080\u0094 1), i.e. v8(i) =e8i(i) . v8(i \u00E2\u0080\u0094 1)t,3 = e3i(i) . maxSEs{vS(i \u00E2\u0080\u0094 1)t8,i}. Let Tmax be the maximum number of transitions into any state in the model. The Viterbi algorithm takes O(NTmaxL) operations and O(NL1)memory for an n-HMMs which reads n input sequences at a time. These time and memory requirements impose serious constraints when employing this algorithm with n-HMM and long input sequences. 2.2.2 The Hirschberg algorithm From equation 2.1, we can see that the Viterbi algorithm can be continued if we keep the Viterbi elements for the previous sequence position in memory. This means that only O(NL\u00E2\u0080\u0099) memory is required for an n-HMM with N states and n input sequences of identi / / / / /-+ / / / / / s, I: i-i I 23 2.2. Decoding algorithms cal length L if we only want to calculate the probability of the Viterbi path. However, in order to obtain the Viterbi path, O(NL) memory is required for storing the back-tracking pointers (see equation 2.2) for the back-tracking process. The memory needed for the back-tracking pointers thus dominates the memory requirements for the Viterbi algorithm. The Hirschberg algorithm is a more memory efficient version of the Viterbi algorithm. It requires O(NL\u00E2\u0080\u009D1)memory instead of O(NL) for an n-HMM. Furthermore, the time requirement of the Hirschberg algorithm is of the same order as for the Viterbi algorithm. It uses the same recurrences as the Viterbi algorithm, but searches the Viterbi matrix in a smarter way. We now describe the Hirschberg algorithm for a pair-HMM. Instead of filling the Viterbi matrix in a single direction along one of the two input sequences, the Hirschberg algorithm traverses the matrix from both ends using two Hirschberg strips of O(NL) memory each (see figure 2.2.2). We call them the forward strip and the backward strip, respectively. In order to perform the backward traversal, a \u00E2\u0080\u009Cmirror model\u00E2\u0080\u009D of the original HMM has to be set up by reversing the transitions and exchanging the end state and start state with respect to the original HMM. This mirror model does not necessarily define an HMM because the sum of the transition probabilities from a state may no longer be 1. The forward strip calculated the 3-dimensional matrix in the same direction as the Viterbi algorithm, while the backward strip uses the mirror model to move in the opposite direction, i.e. it starts at the end state and at the end of one sequence. The matrix element (i, j, s) in the backward strip stores the probability of the optimal state path that has read the input sequences from their ends up to and including position i of sequence X and position j of sequence Y and that finishes in state s. Figure 2.2.2 shows how the two Hirschberg strips move in opposite directions and meet at the middle position of one of two input sequences (i.e. sequence X in 24 2.2. Decoding algorithms this example). i 1+1 Figure 2.3: Illustration of the Hirschberg algorithm (1) This figure shows the projection of the 3-dimensional Viterbi matrix onto the X\u00E2\u0080\u0094Y plane. The two Hirschberg strips traverse the matrix from both ends of sequence X and meet at the middle of the sequence X. For a generalized HMM, the two strips have a width ofmax3es(Z(s)) + 1, so they will overlap instead of \u00E2\u0080\u009Ctouching\u00E2\u0080\u009D each other. The additional columns of the strip are used to store the values calculated for the previous sequence position which are needed in order to continue the calculation. Suppose the two strips meet at position i of sequence X, where the forward strip stores the Viterbi matrix elements for position i of sequence X and the backward strip stores the corresponding probabilities for position i + 1 of sequence X. At that sequence position in X, the probability of the optimal state path P(X, YIlr*) can be calculated by finding the transition probabilities from states in the forward strip to states in the backward strip, such that the product of this transition probability and the two entries in the strip is maximum. Using this criterion, we can identify the Viterbi matrix element (i, j, s) through which the optimal state path lr* goes. Y 7 / / / / / / / / /Y 25 2.2. Decoding algorithms In summary, when the two Hirschberg strips meet, the optimum probability and a coor dinate on the optimal state path can be determined. The process only continues on the left sub-matrix and the right sub-matrix which corresponds to only half of the search space of the previous iteration (see figure 2.2.2). From the above example, the position i of sequence X defines the right boundary of the \u00E2\u0080\u0098eft sub-matrix, and the coordinate (i, j, s) defines the end position, this coordinate also defines the start position of the right sub-matrix. Figiire / / / * / _______ / / I_ ___ - / / (i,j,s) / * Y / / / _____ _ ______ ____ I Figure 2.4: Illustration of the Hirschberg algorithm (2) The coordinate (i,j,s) on the optimal state path was determined in the first iteration of the Hirschberg algorithm (see figure 2.2.2). The matrix is then split into left and right sub- matrices and the next round of iterations is started for each of the two sub-matrices. This recursive procedure continues, until the complete state path is determined. 2.2.2 shows that two smaller Hirschberg strips search in each of the two sub-matrices. The recursive process continues until the optimal state path is completely determined. The figure also shows that in each iteration, the total search space for the Hirschberg strips is reduced by a factor of two. So the number of operations performed by the Hirschberg algorithm is given by: t + + \u00E2\u0080\u00A2. <2t, where t denotes the number of operations required for searching 26 2.3. Parameter training the entire matrix, i.e. the number of operations in the Viterbi algorithm. 2.3 Parameter training In order to calculate the most probable state path for a given input sequence or sequences, a set of parameters must be provided. The parameters are the set of transition probabilities and emission probabilities. HMMs of the same topology but different parameter values can be used for different applications. For example, the average length of an exonic stretch in human genes is different from that in mouse genes, so the transition probabilities from the exon state to intergenic state in the HMM should not be the same. The values of these parameters determine the prediction result and the performance of the HMM. In general, the set of parameters of the HMM should be adjusted according to applications, so the parameter estimation is very crucial. However, it is often a difficult task to obtain good parameter estimation, because the parameter training process is expensive to run and training data is usually inadequate. Parameter training is essential in order to set up HMMs for novel biological data sets and to optimize their prediction performance. First of all, a set of training sequences is required for the training process. If the state paths for all the training sequences are known, then the parameter estimation becomes a simple counting process where the performance is optimized via a maximum likelihood approach. Suppose there are M training sequences. Let: \u00E2\u0080\u00A2 X = {Xi,. .. , XM} denote the set of M training sequences. \u00E2\u0080\u00A2 0 denote the set of parameters of the HMM. This set refers to the set of transition and emission probabilities, i.e. 0 = {T, }. 27 2.3. Parameter training \u00E2\u0080\u00A2 r8,\u00E2\u0080\u0099 the pseudo-count for the number of times the transition s \u00E2\u0080\u0094f S is used. \u00E2\u0080\u00A2 r8(-y) the pseudo-count for the number of times states s reads symbol 7. We can then derive the parameters that maximize the likelihood, P(XI), by setting: t.,8\u00E2\u0080\u0099 = V.\u00E2\u0080\u0099 -, where (2.3) Ls T3,8 = number of transitions from s to s\u00E2\u0080\u0099 in all the state paths including any pseudo-counts r3,\u00E2\u0080\u0099. Similarly, e8(7) E8(-y) , , where (2.4)Z7\u00E2\u0080\u0099 (7) E8(7) = number of times state s reads letter 7 in all the state paths including any pseudo- countsr8Qy) Often, the state paths that correspond to a known annotation are not known or too numerous. In that case, parameter estimation becomes more difficult. In this situation, we can employ two frequently used parameter training algorithms: the Baum-Welch algorithm [2] and the Viterbi training [11]. 2.3.1 The Baum-Welch algorithm The Baum-Welch algorithm is a special case of an expectation maximization (EM) algorithm, which is a general method for probabilistic parameter estimation. The Baum-Welch algorithm iteratively calculates new estimates for all parameters and the overall log likelihood of the HMM can be shown to converge to a local maximum. We introduce a few more notations, 28 2.3. Parameter training and then describe the Baum-Welch algorithm: \u00E2\u0080\u00A2 denotes the letter of the j-th position of sequence X in the set of training sequences x. \u00E2\u0080\u00A2 f(i, Xk) denotes the forward probability for state s and sequence position i in iteration q of the Baum-Welch algorithm. It is equal to the sum of probabilities of all state paths that have read the input sequence Xk up to and including sequence position i and are in states, i.e.f8(i,Xk) := P(x,... = s). \u00E2\u0080\u00A2 b(i, Xk) denotes the backward probability for state s and sequence position i in iteration q of the Baum-Welch algorithm. It is equal to the sum of probabilities of all state paths that have read the input sequence from the end up to and including sequence position i+ 1, given that the state of sequence position i is s, i.e. b3(i, Xk) := P(x+i,... , X = s). \u00E2\u0080\u00A2 t is the transition probability of going from state s to state s\u00E2\u0080\u0099 in iteration q of the Baum-Welch algorithm. \u00E2\u0080\u00A2 e (-y) is the emission probability of reading symbol -y in state s in iteration q of the Baum-Welch algorithm. \u00E2\u0080\u00A2 0q is the estimated set of parameters. i.e. the set of transition and emission probabilities of the model in iteration q of the Baum-Welch algorithm. So, in each iteration, we derive a new value for every transition probability from the ex pected number of times that this transition is used in any state path: 29 2.3. Parameter training q+1 \u00E2\u0080\u0094 ZT,8i(Xk) \u00E2\u0080\u0098c- \u00E2\u0080\u0098-,-\u00E2\u0080\u0098q v L.k=1 L.ES lsJkk) q ZLJdik f(i,Xk)t31e,(x+i)b,(i + 1,Xk) where T881(Xk) := P(X) (2.5) Similarly for the emission probabilities. \u00E2\u0080\u0098i\u00E2\u0080\u0099J \u00E2\u0080\u0094 -\u00E2\u0080\u0098M \u00E2\u0080\u0098\u00E2\u0080\u0098 \u00E2\u0080\u0098 x \u00E2\u0080\u0098\u00E2\u0080\u0098L.ik=1 L.d\u00E2\u0080\u0099y\u00E2\u0080\u0099EA S k) \u00E2\u0080\u0098k &f(i, Xk)b(i, Xk) where E(\u00E2\u0080\u0099y,Xk) := (2.6)P(X) and \u00C3\u00B6 is the delta function such that 5, = 1 if i = j and \u00C3\u00B63 = 0 if i j. P(X) is the probability of the sequence Xk in the 11MM of iteration q. From the definition of forward probability, we can see that P(X) = ffld(Lxk, Xk). This value can be calculated by the forward algorithm [11J. In equation 2.5, the value 1,X)/P(X) is the probability that a transition from state s to state s\u00E2\u0080\u0099 is used at sequence position i given the set of parameters 8q So, I5,(Xk) is the expected number of times that a transition from state s to state s\u00E2\u0080\u0099 is used in sequence Xk given the set of parameters 8q\u00E2\u0080\u00A2 It can be calculated by summing over all sequence positions of sequence Xk. Similarly, the value f(i, Xk)b(i, Xk)/P(Xk) in equation 2.6 is the probability that state s reads symbol at position i in sequence Xk given that the set of parameter 0\u00E2\u0080\u009D. So, E(7Xk) is the expected number of times that state s reads symbol 7 in the sequence Xk given the set of parameters 0q\u00E2\u0080\u00A2 It can be calculated by summing over all sequence position of the sequence Xk. 30 2.3. Parameter training In Baum-Welch training, the iterations in 2.5 and 2.6 are continued until a stopping criterion is reached. There are typically two stopping criterion for the Baum-Welch training, (1) the differences between the estimated parameters in the q-th iteration and the estimated parameters in the (q \u00E2\u0080\u0094 1)-th iteration is smaller than a pre-defined threshold. This difference can be estimated by the log odd ratio (2) The number of iterations exceeds a maximum allowed number of iterations. Suppose that c is the number of parameters to train, each iteration of the Baum-Welch training algorithm requires O(NL\u00E2\u0080\u0099) memory and O(c. M . NTmaxL\u00E2\u0080\u0099) operations for an n-HMM with N states and M training sequences of identical length L. 2.3.2 The Viterbi training algorithm The Viterbi training algorithm is another popular training algorithm which uses the Viterbi state paths to derive parameter estimates in an iterative way. Given a set of initial parameters, the Viterbi algorithm is first used to obtain the optimal state paths for all the training sequences. Then the counting procedure (equation 2.3 and equation 2.4) described at the beginning of this section is performed to calculate new parameter values. This procedure is iterated until the Viterbi paths of all training sequences remain unchanged. The Viterbi training is popular because (1) the Viterbi algorithm is usually implemented for sequence decoding in many HMM applications, so only the simple counting procedures in equation 2.3 and 2.4 needs to implement in addition. (2) As the Viterbi algorithm is usually used for sequence decoding, it can be argued that Viterbi training results in a better set of parameters for decoding with the Viterbi algorithm. The Viterbi training consists of the following steps: (i) The optimal state paths for all the training sequences in the training set are obtained 31 2.4. Discussion and conclusion by using the Viterbi algorithm and the 11MM with its parameters 0q of iteration q. (ii) Suppose T, and E(-y) have same meaning as after equation 2.3 and equation 2.4, with the Viterbi paths assigned to all the training sequences. The values of these pa rameters can be calculated using equation 2.3 and 2.4, i.e. \u00E2\u0080\u0094 ________ q+1 \u00E2\u0080\u0094 E(-y) \u00E2\u0080\u0094 e3 (\u00E2\u0080\u0098y) \u00E2\u0080\u0094 7\u00E2\u0080\u0099EA(7) The above two steps are repeated until the Viterbi state paths for all the training sequences remain unchanged. This can always be achieved as the state paths correspond to discrete values. The Viterbi training is not an EM algorithm and it cannot be shown to always maximize the likelihood of the model. However, it maximizes the likelihood for both parameters and the optimal state paths for all the training sequences, i.e. P(X1,... , X1O, *(xi),... , For each iteration, the Viterbi training algorithm requires O(NL) memory and O(c. M. NTmaxL\u00E2\u0080\u0099) operations for an m-HMM with N states and M training sequences of identical length L, which are the same requirements as for the Baum-Welch algorithm. 2.4 Discussion and conclusion We have discussed the roles of sequence decoding and parameter estimation for HMM-based applications, and have introduced several commonly used algorithms. For sequence decod ing, the Viterbi algorithm calculates the optimal state path for each input sequence. One disadvantage is that it requires O(NL) memory for an n-HMM, and this imposes a serious constraint if the input sequences are long. The Hirschberg algorithm reduces the memory 32 2.4. Discussion and conclusion usage by a factor of L compared to the Viterbi algorithm, and while increasing the run time by at most a factor of two. For parameter estimation, both the Baum-Welch algorithm and the Viterbi training have the same memory and time requirements for each iteration. The implementation of the Baum-Welch algorithm is more complicated as both the forward and backward values have to be calculated. However, it is guaranteed to converge to a local performance maximum and produce better parameter estimates than Viterbi training as it considers all state paths and weighs them according to their probabilities in order to estimate new parameters, whereas Viterbi training only considers a single state path for every training sequence (the Viterbi path). 33 Chapter 3 HMM CONVERTER 3.1 Motivation We have already introduced the challenges of designing and implementing HMMs in chapter 1. It is both time-consuming and error-prone to implement these models and efficient algorithms. As HMMs are used for many Bioinformatics applications, it would be extremely useful to be able to use the same source code and efficient algorithms for quickly setting up different kinds of HMMs and data analyses. One early effort is the DYNAMITE software package [5J which is a dynamic-programming code generator which can be used to implement Viterbi-like sequence analysis algorithms. The tool is a compiler which produces highly efficient C code. As this is a tool for dynamic programming, it is not specially convenient for constructing HMMs for different applications. As users still have to implement the HMM, and it is also not flexible to incorporate special features into the provided dynamic programming framework. There are several powerful profile HMM tools such as HMMER [12] and SAM [20] which provides sequence decoding and parameter training algorithms. But these tools have limitations on the models that can be implemented as all the algorithms provided are for profile HMMs only, this type of HMM is mainly used for analyzing fixed multiple sequence alignments. MAMOT [43] and HMMoC [31] are two recently published HMM generating tools. MAMOT takes a simple text file as input for defining the HMM and provides the Viterbi and Baum-Welch training algorithm. But MAMOT can only implement HMMs of order 1. 34 3.1. Motivation With these limitations, it is not practical for most Bioinformatics sequence analyse. HMM0C is a more comprehensive HMM generating tool which supports n-HMMs, i.e. HMMs with any number of input sequences, and higher order HMMs. The tool also incor porates special features such as position-dependent emission and transition probabilities as well as banding which allows a user to define a smaller search space in the Viterbi matrix. However, it is not easy to define an HMM within HMM0C as it requires an XML description and user-supplied programming codes to define an HMM. For example, C code fragments are required for setting up the transition and emission probabilities, the position-dependent probabilities and the banding features. This requires users to have programming knowledge. We here present an HMM generator software tool called HMMCONVERTER, which takes an XML and simple text files as input files, and then translates them into C++ code. The converter supports generalized pair-HMMs with many special features such as banding which allows users to specify a sub-search space for sequence decoding. It is much more flexible and easier to use than HMM0C, the feature of special emissions allows integrating prior information along each input sequence into the prediction process, and the parameterization of transition and emission probabilities makes it possible to efficiently train a large number of transition and emission probabilities from a small set of free parameters. Four unique features of our software package are the memory-efficient Hirschberg algorithm [16] for decoding, a linear-memory version of the Baum-Welch algorithm [35], a novel linear-memory version of Viterbi training and a novel and computationally efficient algorithm for training with sampled state paths. All of these decoding and training algorithms can be combined with banding and special emissions, making HMMCONVERTER thereby an extremely powerful and useful software package that can be used to swiftly set up sophisticated HMMs for real-life data sets. In this chapter, we focus on introducing the features of HMMCONVERTER. Section 3.2 introduces the main features of HMMCONVERTER, including the special features mentioned 35 3.2. Main features above and the different algorithms. Section 3.3 introduces the two novel training algorithms, and other special features for parameter training with HMMCONVERTER. The specifications and formats of the input and output files are described in chapter 4 and some implementation details of HMMC0NvERTER are discussed in chapter 5. 3.2 Main features 3.2.1 Input and output files In order to build an HMM with HMMC0NvERTER, users have to supply several input files for describing the HMM and setting up special features. The format of these input files are designed to be simple, easy to construct, and easy to read even by users who may not have any programming experience. Depending on the type of analysis requested, HMMC0NvERTER generates output files for storing the results of sequence decoding and parameter training. The format of the output files is also designed to be easily readable by users. In particular, the output files that store the sequence decoding results are designed to be straightforward and readable and can be easily to converted into other popular formats like GTF etc (GTF stands for Gene Transfer format, which is a file format used to store information about gene struc tures). In this section, we briefly describe the input and output files of HMMCONVERTER, the details of the format of these files are described in chapter 4. Input files Depending on the analysis to be performed, HMMCONVERTER takes up to 5 input files: \u00E2\u0080\u00A2 XML input file (compulsory): This file defines the different components and the topolog ical structure of the HMM, it specifies the algorithms to be used for sequence decoding or parameter training as well as any free parameters required by these algorithms. The 36 3.2. Main features parameters for setting up the set of free transition parameters and the set of free emis sion probabilities are also defined in this file, while the values of the free transition parameters and the free emission probabilities are specified in separated files. These parameters include the id and the size of the set of free transition parameters and the set of free emission probabilities, and the name of the file that stores the values of these free transition parameters and free emission probabilities. This is a pure XML-based file and users are not required to have knowledge in any computer programming language. The name of all the other compulsory or optional input files are also specified in this XML input file, see figure 3.2.1. free emission parameter file free transition parameter file XML-Ne prior information file tube file Figure 3.1: The XML-input file for defining an HMM in HMMCONVERTER The XML-input file references all other fiat text files which store parameters for the HMM or specify special features in HMMC0NVERTER. The dashed lines corresponds to optional input files which are required only if the corresponding feature is used. \u00E2\u0080\u00A2 free emission parameter file (compulsory): This is a fiat text file which specifies the set of free emission probabilities of the HMM. We call group of emission probabilities stored in this file a free emission parameter if it defines the emission probabilities of a state in the HMM. We use a separate text file to specify the free emission parameters in order to keep the XML input file short and readable. For example, for a higher-order pair-HMM with a state that reads 6 characters from the two input sequences, we would need to 37 3.2. Main features specify 46 emission probabilities for a DNA alphabet only for this state. The format of the file with the free emission parameters can be easily generated using a little script. \u00E2\u0080\u00A2 free transition parameter file (optional): This file stores the values of the free transition parameters. It is required only if the transition probabilities are parameterized. \u00E2\u0080\u00A2 tube file (optional): This file specifies the restricted search space along the sequence dimensions. It is required if banding is to be used. \u00E2\u0080\u00A2 prior-information file (optional): This file stores all prior information on input sequences. It is required if special emissions are to be used. The last three files are optional input files which are only required if the corresponding feature in HMMCONVERTER is to be used in the algorithm for decoding or parameter training. The concept and theoretical details of these features are explained in the next section. Output files HMMC0NvERTER generates up to 5 output files for storing the results of sequence decoding and parameter training: \u00E2\u0080\u00A2 Two output files for sequence decoding: \u00E2\u0080\u0094 The first file contains the optimal state path for all the input sequences. In this file, each line specifies one state of the state path with the corresponding positions in the input sequence(s), as well as information on the state. \u00E2\u0080\u0094 If annotation label sets are defined for the HMM, then a second file will be gen erated. This file shows the resulting annotation of the input sequence(s) which corresponds to the optimal state path. \u00E2\u0080\u00A2 Three output files for parameter training: 38 3.2. Main features \u00E2\u0080\u0094 XML output file: This file has the same format as the XML input file, but contains the trained values of the transition probabilities. Given this file, it is easy for users to compare the 11MM before and after training. \u00E2\u0080\u0094 output file with trained free emission parameters: This file stores the trained free emission parameters of the HMM. It has the same format as the input file with the free emission parameters. \u00E2\u0080\u0094 output file with free transition parameters: This file stores the trained free tran sition parameters of the HMM. It has the same format as the input file with the free transition parameters. 3.2.2 Special features Three special features of HMMC0NvERTER are introduced in this section. They are: pa rameterization of transition and emission probabilities, banding and special emissions. These features are particularly useful for constructing complex pair-HMMs. Special emissions can be used to integrate prior information on each input sequence probabilistically into all pre diction and parameter training algorithms, which is a very desirable feature for many recent Bioinformatics applications. Parameterization of transition and emission probabilities Complex HMMs with a large number of states may consist of hundreds of transition and emission probabilities. For many applications, the transitions or emission probabilities of different states are correlated. In that case, specifying all the probabilities as independent parameters is not desired. Furthermore, the large number of parameters would entail serious constraints for parameter training, as enormous amounts of training data would be needed 39 3.2. Main features to avoid over-fitting. In a particular case, two states may share the same set of emission probabilities. If the training process views them as independent parameters during parameter training, this would not only double the time for training, but may also result in different parameter values. In order to solve the above problem, HMMC0NvERTER allows the parameterization of transition and emission probabilities. Besides specifying the free parameters, users also have to define the derivation rules which express the probabilities as function of the free param eters (required for decoding and training) as well as derivation rules which express the free parameters as function of the probabilities (required for training only). The rules for deriving transition probabilities and emission probabilities are different from each other and we will describe them separately in the following. Parameterization of transition probabilities We first present the power of parameter ization by looking at the complex generalized pair-HMM of DOUBLESCAN [33, 34] for com parative gene prediction. It consists of 153 transitions, but can be efficiently parameterized using only 19 free transition parameters. In HMMC0NvERTER, a transition probability can be defined by an arithmetic formula. The formula may use any of the basic arithmetic operators: \u00E2\u0080\u0098+\u00E2\u0080\u0098,\u00E2\u0080\u0098\u00E2\u0080\u0094\u00E2\u0080\u0098, \u00E2\u0080\u0098*\u00E2\u0080\u0098,\u00E2\u0080\u0098/\u00E2\u0080\u0098 and brackets, with the id of the free transition parameters and numerical values as operands. The free transition parameters need not correspond to probabilities, i.e. their values are not restricted to the range [0,11. However, users have to ensure that the parameterization of all transition probabilities implies that = 1 for all states i in the HMM. For example, for a state i with three transitions, i \u00E2\u0080\u0094 j1, i \u00E2\u0080\u0094 i2 and i \u00E2\u0080\u0094 3, and the derivation formula (functions of free parameters) t1 = fi, tj2 = f2 and t,j3 = f, respectively, the sum of the three transition probabilities as function of the free parameters always has to be 1 (i.e. fi + f2 + f 1). 40 3.2. Main features HMMC0NvERTER applies consistency checks to ensure that = 1 and ej(7) = 1 for all states i in the HMM and reports error messages and terminates the program otherwise. Derivation of emission probabilities For the emission probabilities, we only allow free parameters that correspond to probabilities, i.e. values in [0, 1]. In HMMC0NvERTER, a group of emission probabilities that defines the set of emission probabilities for a state is regarded as one emission parameter. All the free emission parameters are specified in the free emission parameter file. HMMCONVERTER provides two functions, SumOver and Product, for deriving emis sion probabilities. Consistency checks on the correct dimensions of the parameters and the normalization are enforced by HMMC0NvERTER. There are four ways of defining the emission probabilities of a state: (i) Directly defining the emission probability An emission parameter can be used directly by several states. This means that the emission probabilities of a state can be defined by pointing directly to an emission parameter in the free emission parameter file. (ii) SumOver This function can be used to derive a lower dimensional emission probability from a higher dimensional emission probability. The dimension of an emission probability is the sequence dimension, i.e. the total number of letters read at by the corresponding state at a time. Consider as example three states: matchxy, emitx and emit.y of dimension 6, 3 and 3, respectively. The emission probabilities of emit..x and emit_y can be derived from 41 3.2. Main features those of matchxy by summing over several sequence dimensions: Pemitx(xi, X2, X3) = Pmatchzy(X1, X2, X3, Yi, 112, Ya) (yl,y2,y3) Pemit,(yi, Y2, Y3) = Pmatchzu(X1, X2, X3, 111,112,113) (x1 ,X2,X3) (iii) Product This function allows the user to express a high dimensional emission prob ability as product of several lower dimensional emission probabilities. Consider as example two states, triple and single, of dimension 3 and 1, respectively. The emission probabilities of triple can be expressed as function of those of single in the following way: Ptriple(X1,X2,X3) = Psimgie(xi) * Psingle(x2) * Psingie(xa) (iv) SumOver & Product The two functions can be combined to form more complex rules, for example for three states of dimension 3, 1 and 4, respectively: P3_tuple(X1,X2,X3) = ( P4_tuple(X1,X2,y1,y2) \(y1,y2) J The four above rules are easy to apply and flexible enough for many applications. It is easy to prove mathematically that these rules derive a normalized set of emission probabilities which fulfill > eQy) = 1 for all states i in the HMM. Banding As discussed previously, it takes O(N2L1)time and O(NL1L2)memory to run the Viterbi algorithm on a pair-HMM with N states and two input sequences of length L1 and L2. These requirements are often too high for many Bioinformatics applications where the input sequences are very long. 42 3.2. Main features Banding is a feature in HMMCONVERTER for addressing this problem. This feature is particularly important for pair-HMMs and allows to run sequence decoding and parameter training algorithms on a restricted search space along the sequence dimensions. We call the restricted search space a \u00E2\u0080\u009Ctube\u00E2\u0080\u009D. If the tube is well chosen, the memory and time require ments of the algorithms can be greatly improved while keeping the performance essentially unchanged. In HMMC0NvERTER, there are two ways to construct a tube: (i) derive a tube from local matches generated by TBLASTX or BLASTN [1]. (ii) specify the restricted search space explicitly in the tube file. (i) In order to derive a tube from BLAST matches, the user has to specify a parameter called radius which defines the width of the tube around the selected matches. As many local BLAST matches can be generated for the same pair of input sequences, HMMCONVERTER uses a dynamic programming routine to derive a highest-scoring subset of mutually compatible matches from the original set of BLAST matches. It is only this subset of matches which is used to construct the tube. Figure 3.2.2 illustrates a tube derived from two matches. (ii) We now consider specifying the tube explicitly in the tube file. HMMC0NvERTER defines a tube via a set of user defined 4-tuples, {(X1,yl,X2,y2)l1 x X2 L,1 Yi y2 , t\u00E2\u0080\u0099, = 1 for all states i in the HMM. This same strategy can be applied for training emission probabilities. For the free transition parameters which need not correspond to probabilities, pseudo- counts can be used and they are added to the free parameter after each training iteration. Note that the pseudo-count itself can be any value 0. For pseudo-counts for free parameters, no normalization process is needed as the normalization is implicitly imposed by the derivation rules. 3.3.5 Training free parameters HMMCONVERTER supports the parameterization of transition and emission probabilities in order to avoid over-fitting during parameter training. The training algorithms have to be capable of training the free parameters. Recall that HMMC0NvERTER allows different kinds of parameterizations for transition and emission probabilities. In the following, the training of free transition parameters and free emission parameters is therefore explained in separate subsections. Training of free transition parameters For the training of free transition parameters, users have to supply a reverse formula which expresses each free transition parameter as a function of transition probabilities. They corre 73 3.3. Training using sampled state paths spond to the inverse of the functions which express each transition probability as function of the free transition parameters. In each iteration, the normal training process is performed for the transitions involved in the inverse functions, and the values of the parameters are then derived using the corresponding inverse function with the updated transition probabilities. In complicated cases, a free transition parameter cannot be expressed by an arithmetic function. Figure 3.3.5 shows a part of the pair-HMM of DOUBLESCAN [34]. The states shown in figure 3.3.5 can model pairs of aligned codons (match_exon state) in a pair of orthologous exons which is either followed by a pair of orthologous introns in one of 3 possible phases (states 14 to 16) or followed by an intron in one of the two input sequence only (states 17 to 19 model an intron in X only, states 20 to 22 model an intron in Y only). Two particular parameters are involved in these transitions, namely PhaseO and Phasel, they represent the relative probability of starting an intron or intron pair after codon position 3 (Phase 0), 1 (Phase 1) and 2 (Phase 2). Phase2 can be expressed as 1 \u00E2\u0080\u0094 PhaseO \u00E2\u0080\u0094 Phasel. We now consider training the parameter Phase0. From figure 3.3.5, the value of PhaseO should correspond to the fraction of transitions 3 \u00E2\u0080\u0094f 14, 3 \u00E2\u0080\u0094* 17 and 3 \u00E2\u0080\u0094\u00C3\u00B7 20 normalized by transitions from state 3 to any of the states 14 to 22. As state 3 also has other transitions, we cannot easily write PhaseO as arithmetic function of transition probabilities. For solving this problem, HMMCONVERTER supports so-called \u00E2\u0080\u009Cgroup transitions\u00E2\u0080\u009D. This allows users to define a group of transitions which should be regarded as the same transition in the training. In this example, the parameter PhaseO would be trained by calculating the weight of the group transitions of 3 \u00E2\u0080\u0094\u00E2\u0080\u0098 14, 3 \u00E2\u0080\u0094* 17 and 3 \u00E2\u0080\u0094* 20 normalized by the weight of group transitions of state 3 to all the states 14 to 22. Chapter 4 explains how \u00E2\u0080\u009Cgroup transitions\u00E2\u0080\u009D are specified in HMMCONVERTER. 74 3.3. &aining using sampled state paths match exan match 5\u00E2\u0080\u0099 splice site Phase U () Phase I () Phase 2 () emit x 5\u00E2\u0080\u0099 splice site Phase 0 Figure 3.10: A part of the pair-HMM of DOUBLESCAN In order to train the free transition parameter PhaseO , transitions from state 3 to all the states representing splice site phase 0 should be counted. They include 3 \u00E2\u0080\u0094 14, 3 \u00E2\u0080\u0094* 17 and 3 \u00E2\u0080\u0094* 20. 0 Phase 1 () Phase 2 () y 5\u00E2\u0080\u0099 splice site Phase0 Phase 1 () Phase 2 () 75 3.4. Availability Training of free emission parameters We now describe the training of free emission parameters. As previously discussed, every free emission parameter has to be a group of emission probabilities that define the emission probabilities for some states. Furthermore, an emission parameter specified in the emission probability file can be shared by several states, and also be used for expressing the emission probabilities of other states. In the most general case, it is therefore not trivial to define rules for deriving functions which would allow us to derive the free emission parameters from the emission probabilities of the states. Because of this, HMMCONVERTER supports one simple way for training free emission parameters. For each free parameter, the states which share emission probabilities directly with the parameter are determined in HMM CONVERTER, and the parameter is trained by taking contributions from all those states into account. This strategy allows the efficient and successful training even for complex models as long as their number of free emission parameters is still small. 3.4 Availability HMMCONVERTER is available under the Gnu Public License. The software comprising HMMCONVERTER will be available for download, free of charge, from http: //www. cs . nbc. ca/irmtraud/HMMConverter once paper submission is accepted. The package includes a de tailed manual, the software itself and the examples described in the last section of chapter 4. 76 3.5. Discussion and conclusion 3.5 Discussion and conclusion We have discussed the main features of HMMCONVERTER in this chapter, many of them are novel among existing HMM generating packages. Among the novel features, using banding together with the Hirschberg algorithm is both very memory and time efficient for sequence decoding. The special emission feature is desirable for many Bioinformatics applications where prior information on the input sequences has to be integrated into the decoding algorithms. The linear-memory Baum-Welch algorithm and the new linear-memory training algorithm for training with sampled state paths are both memory efficient, banding can also be applied to parameter training, which can greatly accelerate the process. Furthermore, special emissions can be used to integrate prior information about training sequences into the training process, with the capability of training free parameters, the overall performance of the parameter estimation process can be greatly improved. The specifications for constructing an 11MM with HMMC0NvERTER are described in detail in chapter 4. Some implementation details of important classes and the interactions between the classes are discussed in chapter 5. 77 Chapter 4 Specifying an HMM or pair-HMM with HMMConverter 4.1 Introduction HMMC0NvER\u00E2\u0080\u0099rER takes up to six input files if all special features are used. We first give a brief description of these files: \u00E2\u0080\u00A2 XML-input file: This is a compulsory XML file which describes the states and connec tions of an 11MM or pair-HMM. \u00E2\u0080\u00A2 Free emission parameter file: This is a compulsory fiat text file which specifies the emission probabilities of the 11MM. \u00E2\u0080\u00A2 Free transition parameter file: This file is only required if the transition probabilities are parameterized. \u00E2\u0080\u00A2 Tube file: This file is only required if banding is used with a user-defined tube. \u00E2\u0080\u00A2 Prior information file: This file is only required if special emissions are switched on. \u00E2\u0080\u00A2 Input sequence file: This compulsory file stores the input sequences. 78 4.2. The XML-input file In this chapter, the format of all the input files and the output files generated by HMM CONVERTER are described in detail. In the last section of this chapter, we describe several HMMCONVERTER applications included in the software package. Their main purpose is to demonstrate how to set different HMMs with HMMC0NvERTER. 4.2 The XML-input file HMMCONVERTER takes an XML-input file, and translates it into efficient C++ code. The XML-input file consists of two large components, defined by the and the tag. In this section, we explain the XML tags by first showing an example of how to specify the tag, and then describe its attributes. The term \u00E2\u0080\u009Cboolean attribute\u00E2\u0080\u009D used in this section refers to an attribute that only accepts two discrete values \u00E2\u0080\u009C0\u00E2\u0080\u009D and \u00E2\u0080\u009C1\u00E2\u0080\u009D where \u00E2\u0080\u009C0\u00E2\u0080\u009D means \u00E2\u0080\u009CNo\u00E2\u0080\u009D and \u00E2\u0080\u009C1\u00E2\u0080\u009D means \u00E2\u0080\u009CYes\u00E2\u0080\u009D. 4.2.1 The component of the XML file This component consists of five compulsory tags for specifying the HMM, and two optional tags for using special features. [i] This tag defines some general parameters of the HMM. Example tag: Attributes of : \u00E2\u0080\u00A2 name: This compulsory attribute defines the name of the model. \u00E2\u0080\u00A2 pair: This optional boolean attribute defines whether the model is a pair-HMM or an 11MM. The default value is \u00E2\u0080\u009C0\u00E2\u0080\u009D (HMM). 79 4.2. The XML-input file \u00E2\u0080\u00A2 SpecialEmission: This optional boolean attribute defines whether the model uses special emission or not. The default value of this attribute is \u00E2\u0080\u009C0\u00E2\u0080\u009D (no special emissions). [ii] This tag defines the set of symbols, i.e. the alphabet, that the HMM can read from the input sequences Example tag: Attributes of : \u00E2\u0080\u00A2 set: This compulsory attribute defines the set of symbols, i.e. the alphabet, that the HMM can read. \u00E2\u0080\u00A2 cases: This optional boolean attribute defines whether the model is case sensitive (\u00E2\u0080\u009C 1\u00E2\u0080\u009D) or not (\u00E2\u0080\u009C0\u00E2\u0080\u009D). The default value of this attribute is \u00E2\u0080\u009C0\u00E2\u0080\u009D. [iii] This tag defines the attributes of the set of free emission param eters on which the emission probabilities of the HMM depend. The free emission parameters themselves are defined in a separate input file, we call it the free emission parameter file. Example tag: Attributes of : \u00E2\u0080\u00A2 id: This compulsory attribute defines the id of the set of free emission parameters. This id is used to find the corresponding emission parameters in the free emission parameter file. \u00E2\u0080\u00A2 file: The name of the free emission parameter file. 80 4.2. The XML-input file \u00E2\u0080\u00A2 size: Compulsory attribute which specifies the number of free emission parameters contained in the free emission parameter file. In this example, \u00E2\u0080\u009CFEP\u00E2\u0080\u009D is the id of the set of 7 free emission parameters (size=\u00E2\u0080\u009D 7\u00E2\u0080\u009D), and the ids of the free emission parameters specified in the free emission parameter file would be FEP.O\u00E2\u0080\u009D ,\u00E2\u0080\u009C FEP. 1\u00E2\u0080\u009D,. . ,\u00E2\u0080\u009C FEP.6\u00E2\u0080\u009D. [iv] This tag defines the attributes of all states in the HMM. For an HMM with N states, the start state should be defined with id S.O and the end state should be defined with id S.(N \u00E2\u0080\u0094 1). Example tag: 81 4.2. The XML-input file Attributes of : \u00E2\u0080\u00A2 id: This compulsory attribute takes a string value. The format of the string starts with \u00E2\u0080\u009CS.\u00E2\u0080\u009D, followed by an integer number in the set {O, 1,... , N \u00E2\u0080\u0094 1}, where N is the number of states in the model. \u00E2\u0080\u00A2 name: This compulsory attribute defines the name of the state. \u00E2\u0080\u00A2 xdim: This optional attribute defines the number of characters that this state reads from input sequence x. The default value is \u00E2\u0080\u009C0\u00E2\u0080\u009D. \u00E2\u0080\u00A2 ydim: This optional attribute defines the number of characters that this state reads from sequence y, the default value is \u00E2\u0080\u009C0\u00E2\u0080\u009D. This attribute is only required for states in pair-HMMs. \u00E2\u0080\u00A2 special: This optional boolean attribute defines whether the special emission of this state is switched on (\u00E2\u0080\u009C1\u00E2\u0080\u009D) or not (\u00E2\u0080\u009C0\u00E2\u0080\u009D). The state (namely \u00E2\u0080\u009Csample_state\u00E2\u0080\u009D) described in the above example tag is shown in figure 4.2.1. This state reads five characters, two from input sequence X and three from input sequence Y, we call this a state of dimension five. For the annotation label set \u00E2\u0080\u009CSeti\u00E2\u0080\u009D i defined for the HMM, only one label \u00E2\u0080\u009CSetl.2\u00E2\u0080\u009D (this is the id of the annotation label defined in this annotation label set) is specified, which means that all the five sequence positions of the state have the same annotation label \u00E2\u0080\u009CSetl.2\u00E2\u0080\u009D from this set. For \u00E2\u0080\u009CSet2\u00E2\u0080\u009D (another annotation label set defined for the HMM), different labels are specified for each of the five sequence positions (xl,x2,yl,y2,y3), i.e. the annotation label in \u00E2\u0080\u009CSet2\u00E2\u0080\u009D for Y2 is \u00E2\u0080\u009CSet2.1\u00E2\u0080\u009D. The name of the (and .) tag has to match with the name of the correspond ing annotation label set specified in the tag. The annotation labels de fined in this tag are only for specifying the annotation labels for this particular state, 82 4.2. The XML-input file Figure 4.1: Defining a state of an HMM or pair-HMM in HMMC0NvERTER The figure shows the state defined in the above example tag. This state reads five letters at a time, two from input sequence X and three from input sequence Y. Each of the sequence position are assigned with two annotation labels from the annotation label set \u00E2\u0080\u009CSeti\u00E2\u0080\u009D and \u00E2\u0080\u009CSet2\u00E2\u0080\u009D. In this example, all the five sequence positions are assigned with the same annotation label \u00E2\u0080\u009CSeti .2\u00E2\u0080\u009D from the annotation label set \u00E2\u0080\u009CSet 1\u00E2\u0080\u009D, while different annotation labels from annotation label set \u00E2\u0080\u009CSet2\u00E2\u0080\u009D are assigned to each sequence positions. For example, the sequence position Y2 is assigned with \u00E2\u0080\u009CSetl.2\u00E2\u0080\u009D from annotation label set \u00E2\u0080\u009CSeti\u00E2\u0080\u009D and \u00E2\u0080\u009CSet2.2\u00E2\u0080\u009D from annotation label set \u00E2\u0080\u009CSet2\u00E2\u0080\u009D. 83 4.2. The XML-input file the complete annotation label sets for the HMM and the annotation labels defined in each annotation label set are specified in the tag. The tag is an optional tag inside the component which will be explained after describing all the compulsory tags. The attribute of the