Customizable Objective Function for Hidden Markov Models 1Customizable Objective Function for Hidden Markov Models Nawar Malhis The Centre for High-Throughput Biology The University of British Columbia 2125 East Mall Vancouver, BC V6T 1Z4 Email: nmalhis@chibi.ubc.ca Abstract - Identifying sequences with frequent patterns is a major data-mining problem in computational biology, in this work, our focus is on utilizing a HMM like model for extracting sequences with interesting frequent patterns from a set of unlabeled sequences in a heavy noise environment. First we show that the likelihood objective function for HMMs is very sensitive to noise which limits its use to labeled sequences. Then we introduce an alternative model that we call Hidden States Model, HSM, which is a HMM with customizable objective function, OF. We show empirically (on synthetic data to facilitate precise performance evaluation) how OFs can be customized to target specific information such as patterns frequency and size. Results Show HSM vastly outperformed HMM in extracting sequences with interesting patterns for both unmutated and mutated data. Keywords: Hidden Markov Models; Hidden States Model; data mining; computational biology; discrete sequence; frequent pattern; unlabeled sequences. 1 Introduction Identifying sequences with frequent patterns is a common problem in biological data mining such as gene finding [1], predicting gene regulatory motifs in DNA sequences [2], and detecting candidate regulatory relations in micro-array time series data [3]. Different tools are used including Support Vector Machines (SVMs), and Hidden Markov Models (HMMs). In this work, our focus is in utilizing HMMs for data mining unlabeled sequences. Let?s have a data of discrete sequences that are drawn from a distribution D, which consists of an infrequent noise background plus a number of frequent short sub sequences that constitute patterns or motifs. Some of these motifs are considered ?interesting? and every other sub sequence (frequent pattern or infrequent background) is considered ?noise?. Now consider DI and DN such that D = DI ? DN. Interesting patterns and noise are in DI and only noise in DN. Sequences in DI have higher percentage of frequent patterns than those in DN. See Figure 1 for an example. Given a sequence y ? D, we might want to guess if y ? DI or y ? DN. That is, does y contain interesting patterns, or only noise? (This problem is an abstraction of many problems that arises in bioinformatics). A machine learning natural approach is to train a statistical model M based on a training set of sequences DT ? DI, and then to evaluate the likelihood P(y|M). If this value is above some threshold (indicating high percentage of frequent patterns in y and therefore it is likely to contain interesting patterns), we say y ? DI (interesting), otherwise we say y ? DN (noise). An obvious choice of statistical model is a HMM. In this case, we can train the HMM on DT using Baum-Welch Expectation-maximization EM algorithm to maximize the likelihood of DT and then calculate the likelihood of y using the forwards algorithm: ?? == ? == Tt MtMdefTt tt PPMyyPMyPMyP 212 1:11 ),|()|()|( (1) Where T is the length of the test sequence y and )|( 11 MyPPdefM = & ),|( 1:1 MyyPP ttdefMt ?= (2) MtP is the probability of yt under the posterior predictive probability, given past data y1:t-1 and model M. In general, we can think of HMM likelihood as just a scoring function or an objective function, OF, which we want to maximize; we denote this by SHMM(y) = P(y|HMM). One problem with HMM likelihood as a scoring function is that it focuses on rare events, which often just correspond to noise. To see this, note that if any MtP is Figure 1. Two sequences drawn from distribution D=DI ? DN. Black blocks (A and B) represent frequent ?interesting? patterns, shaded blocks (C, D, and F) represent frequent ?noise? patterns, and white represents infrequent background noise. We assume there is noise (background and frequent patterns) in every sequence; interesting patterns are only in DIsequences. (a) A sequence from distribution DI. (b) A sequence from DN. (a) (b) Customizable Objective Function for Hidden Markov Models 2small, then the product is small. As a result, any EM training approach, in order to increase the score of the training set, it needs to setup the HMM parameters such that it scores, as proportion of their count, rare background noise events higher than frequent events. Furthermore, as the training set DT is usually small, it assumed to represents frequent patterns in the target set, background noise is not likely to be appropriately represented, so, the likelihood P(y|HMM) is to a large degree controlled by those rare background noise events in y that are not represented or under represented in the training set (with very low MtP ). This is reflected in a low HMM reliability, as a result, HMM is almost always used in applications of labeled sequences such as in [4][5][6][7][8][9][10][11]. To overcome this problem, an alternative OF that relies on the sum of MtP instead of its multiplication might be used. Another problem in likelihood is that, in scoring, it significantly favors shorter sequences, so, when dealing with variable size sequences, sequence size needs to be factored out. In this paper, we propose an alternative model that we call Hidden States Model, HSM, which is just a HMM with a customizable OF. The idea is to enable users to use a HMM like model with a customizable OF that is assembled using MtP values to best fit their problem requirements. Note that discriminative training of HMMs is quite common on labeled sequences [12]. Similarly, maximizing per-symbol performance has been proposed as an objective for CRFs [13], but again, that relies on labeled data. The main disadvantage of HSM is optimization, unlike HMM likelihood objective function which can be optimized using Baum-Welch algorithm, HSM customized scoring function makes it harder to optimize. In section 2, we introduce a randomized local search approach that optimizes HSM parameters for any OF to maximize the score SHSM(DT). Then, to illustrate the use of HSM and OFs customization, we present examples of constructing several HSM OFs to extract sequences with interesting frequent patterns using synthetic datasets in section 3. 2 Optimizing Hidden States Models In this section we are looking for a general optimization method that maximize the training data score, SHSM(DT), for a HSM independent of it?s OF. Lets have a HSM with NS states and a data set with NO observation symbols, then the number of variables to be optimized: NV = NS * NO + NS * (NS + 1), taking in consideration such a large number of optimization variables, the dependencies between these variables, and an unknown OF (unconstrained optimization surface), we propose this general random search approach for training: ? At first, all parameters are initiated randomly and SHSM(DT) is computed. ? Random perturbations with a noise level L are to be performed to a randomly selected subset of parameters size NSS (starting with NSS = NV) and accept any moves that increase the score (stochastic parameters are normalized appropriately after each step to ensure the sum-to-one constraints are satisfied). ? If we fail to make any progress after a certain number of trials, NT, we decrease the noise level L by LS and the number of parameters in play by one and then try again. ? Optimization ends once the number of parameters in play reaches zero. The values of NT, L, and LS are used to control the optimization process. The rational of this training is to have early iterations assigning values to HSM parameters such that SHSM(DT) is in the neighborhood of the training surface global maximum (to the degree possible), and final iterations are expected to tune in on that maximum. 3 Customizing HSM Objective Functions, a Walkthrough Example In this section we are testing several HSM OFs for extracting sequences with ?interesting? frequent patterns from a set of sequences. We will train HSMs on a training set, use these trained HSMs to score test sequences, and evaluate the outcome using the Receiver Operating Characteristic AUC. We will walk through and compare several solutions (different OFs) for the same problem. Our goals are: ? Analyze and evaluate HSMs with different OFs, ? Quantify how data characteristics is reflected on each OF mining performance, and ? Compare different outcomes including that of HMM. Therefore, we are using two synthetic data-sets, DSa and DSb, to facilitate precise performance evaluation. 3.1. Synthetic Data DSa: a synthetic dataset of discrete sequences with an alphabet size 20 (to simulate amino acids); first, five randomly generated short sequences (3 to 5 symbols each) -to be used as frequent patterns- are generated, two patterns we call interesting and three we call noise. DSa consist of three files (FI ?interesting?, FN ?noise?, and FT ?training?). Each one of these files holds sequences between 50 and 60 symbol each, with a uniform distribution background (infrequent noise): ? FI ?interesting?: consist of 2000 sequences; each sequence is infected with both interesting and noise frequent patterns, for an example see figure 1(a). FI is used for testing. ? FN ?noise?: consist of 2000 sequences; each sequence is infected with noise frequent patterns, for an example see figure 1(b). FN is used for testing. ? FT ?training?: is of 50 sequences. Sequences in FT are drawn from the same distribution of FI which includes Customizable Objective Function for Hidden Markov Models 3both interesting and noise frequent patterns. FT is used for training. Frequent patterns are added to sequences using patterns infection algorithm presented in Figure 2. DSb: is similar to DSa, except that both interesting and noise frequent patterns are randomly mutated before been added to sequences, more than half frequent patterns are mutated by altering up to two symbols in every pattern in DSb, see figure 2.(c). Figure 2. (a) This algorithm inserts five short patterns in a sequence seq and is used to generate data files FI and FT. When the dataset is DSb, the algorithm calls mutate function on every pattern before inserting it. (b) This algorithm inserts three frequent short patterns in a sequence seq and is used to generate data file FN. When the dataset is DSb, the algorithm calls mutate function on every pattern before inserting it. (c) This mutate function is called on every pattern p before inserting it in DSb sequences. for(i = 0; i<5; i++) { if (i < 3) p = select noise pattern at random; else p = select interesting pattern at random; if(data set == DSb) mutate(p); insert p in seq at a random location; } (a) for(i=0; i<3; i++) { p = select noise pattern at random; if(data set == DSb) mutate(p); insert p in seq at a random location; } (b) mutate(p) { mCount = p.SIZE() ? 3; for (j = 0; j < mCount; j++) if(RND(1.0) < 0.6) p[RND_INT(p.SIZE()) += RND_INT(5)?2; } (c) 3.2. Model Description We are using a randomly initiated left-to-right [9] eight states model M. Baum-Welch training algorithm is used for training M with the regular HMM likelihood OF, and the HSM training described in section 2 is used with customized OFs. Although in principle, for any model M, both training the model and then scoring sequences should be done using the same OF, we will show that to more effectively target frequent patterns one can use two different, but compatible, OFs, a training objective function OFtr, and a scoring objective function OFsc. 3.3. Evaluating Different OFs For each solution (a pair of OFtr, OFsc), a HSM is trained on FT (using OFtr), test sequences (FI + FN) are then scored on the trained HSM (using OFsc), and an AUC (Area Under the Receiver Operating Characteristic Curve) is used for evaluation. Since training algorithms are non deterministic, that is different instances of training using the same FT are likely to produce different outcomes, fifty tests are conducted and AUC = average?variance is used for evaluation. To facilitate understanding differences between OFtr, each possible adjacent pair1 of symbols (used as a short pattern), AB, will be given a score ABTS equals to the last symbol, B, average MtP over all AB occurrences in FT in all 50 trained HSM instants. Similarly, ABINS is computed for pairs of symbols in FI + FN in all 50 trained HSM instants. Scattered plots, PFS plots, are used to visualize the relationship between patterns (pairs of symbols, AB) frequencies in the training data FT and their scores ABTS and ABINS for each OFtr. We define a pattern frequency in a set of sequences as its number of occurrences in that set divided by the number of sequences in the set. 3.4. Results Starting with the un-mutated data-set DSa, for each suggested solution, fifty HSMs are trained on the same FT, sequences in (FN + FI) are scored on each model and AUC values are computed. Each solution result is presented in the form of average AUC ? variance: SOLUTION 1: ? OFtr = OFsc =? =Tt MtP1 (HMM likelihood): Results and analysis: From the PFS plot and the bar chart (figure 3) we see that most MtP values are in the range 0.04 to 0.06, some frequent patterns (freq>0.5) have values around 0.08, MtP values for frequent patterns are just slightly higher that those of infrequent patterns. Running 50 tests resulted an AUC=0.533?0.0008. SOLUTION 2: ? OFtr =? =Tt MtP1 (HMM likelihood), ? OFsc = T TtMtP? =1 (geometric mean): 1 In this analysis we choose to use short patterns of only two symbols, combined with our alphabet size of 20, the total number of possible patterns 202 = 400, makes our data (training and testing) large enough to support this analysis. If one is using a smaller alphabet size, let?s say nucleotide sequences, with an alphabet of 4, the length of short patterns should be 4 or 5. Customizable Objective Function for Hidden Markov Models 4Rationale: Since scored sequences are of different sizes and the HMMs likelihood OF scores favors shorter sequences, we scored using geometric mean. Results and analysis: a mild improvement to an AUC=0.581?0.0039. Figure 3. OFtr = HMM likelihood: (a) PFS plot, (b) Bar chart show average MtP values for frequency groups. (a) (b) SOLUTION 3: ? OFtr = OFsc = ? =Tt MtPT 11 (average): Rationale: OFs that are based on the sum of MtP (arithmetic mean), instead of its multiplication (geometric mean), is used to reduce the effect of noise events (small MtP values) on the final score. Results and analysis: PFS plot in Figure 4 shows that MtP values are in a wide range from 0.0 to 0.8 (high variation within each frequency group), this OFtr focus on maximizing the overall MtP value of all patterns in FT. The corresponding bar chart shows that the average MtP values for frequent patterns are significantly higher than those of infrequent patterns. It is also clear that MtP values for the training data FT is higher than those of the test data (FI + FN), this is due to over-fitting. Result show an improvement over multiplication based OFs (solutions 1 and 2) to an AUC=0.68?0.0067. SOLUTION 4: ? OFtr =)()(MtMtPVariancePAverage , ? OFsc = ? =Tt MtPT 11 (average) Figure 4. OFtr = ? =Tt MtPT 11 : (a) PFS plot, (b) Bar chart show average MtP values for frequency groups. (a) (b) Rationale: Even that patterns with higher frequency, on average, yield higher MtP values, PFS plot of solution 3, figure 4, show that the correlation between frequency and MtP is low as MtP values at any frequency range are inconsistent. This inconsistency between frequency and MtP has a negative effect on AUC value and resulted high variance in AUC (low reliability), to reduce noise, we replaced the OFtr in solution 3 with one that evaluate sequences using the arithmetic mean of MtP divided by its variance. Results and analysis: Dividing by the variance of MtP over the entire FT data (frequent and infrequent) directed the training process to reduce the range of MtP values by: ? Increasing low values slightly and ? Significantly decreasing those that are high. Resulting a higher and more reliable AUC=0.74?0.0004. SOLUTION 5: ? OFtr = ? = ????