Customizable Objective Function for Hidden Markov Models 1Customizable Objective Function for Hidden Markov Models Nawar Malhis The Centre for High-Throughput Biology The University of British Columbia 2125 East Mall Vancouver, BC V6T 1Z4 Email: nmalhis@chibi.ubc.ca Abstract - Identifying sequences with frequent patterns is a major data-mining problem in computational biology, in this work, our focus is on utilizing a HMM like model for extracting sequences with interesting frequent patterns from a set of unlabeled sequences in a heavy noise environment. First we show that the likelihood objective function for HMMs is very sensitive to noise which limits its use to labeled sequences. Then we introduce an alternative model that we call Hidden States Model, HSM, which is a HMM with customizable objective function, OF. We show empirically (on synthetic data to facilitate precise performance evaluation) how OFs can be customized to target specific information such as patterns frequency and size. Results Show HSM vastly outperformed HMM in extracting sequences with interesting patterns for both unmutated and mutated data. Keywords: Hidden Markov Models; Hidden States Model; data mining; computational biology; discrete sequence; frequent pattern; unlabeled sequences. 1 Introduction Identifying sequences with frequent patterns is a common problem in biological data mining such as gene finding [1], predicting gene regulatory motifs in DNA sequences [2], and detecting candidate regulatory relations in micro-array time series data [3]. Different tools are used including Support Vector Machines (SVMs), and Hidden Markov Models (HMMs). In this work, our focus is in utilizing HMMs for data mining unlabeled sequences. Let?s have a data of discrete sequences that are drawn from a distribution D, which consists of an infrequent noise background plus a number of frequent short sub sequences that constitute patterns or motifs. Some of these motifs are considered ?interesting? and every other sub sequence (frequent pattern or infrequent background) is considered ?noise?. Now consider DI and DN such that D = DI ? DN. Interesting patterns and noise are in DI and only noise in DN. Sequences in DI have higher percentage of frequent patterns than those in DN. See Figure 1 for an example. Given a sequence y ? D, we might want to guess if y ? DI or y ? DN. That is, does y contain interesting patterns, or only noise? (This problem is an abstraction of many problems that arises in bioinformatics). A machine learning natural approach is to train a statistical model M based on a training set of sequences DT ? DI, and then to evaluate the likelihood P(y|M). If this value is above some threshold (indicating high percentage of frequent patterns in y and therefore it is likely to contain interesting patterns), we say y ? DI (interesting), otherwise we say y ? DN (noise). An obvious choice of statistical model is a HMM. In this case, we can train the HMM on DT using Baum-Welch Expectation-maximization EM algorithm to maximize the likelihood of DT and then calculate the likelihood of y using the forwards algorithm: ?? == ? == Tt MtMdefTt tt PPMyyPMyPMyP 212 1:11 ),|()|()|( (1) Where T is the length of the test sequence y and )|( 11 MyPPdefM = & ),|( 1:1 MyyPP ttdefMt ?= (2) MtP is the probability of yt under the posterior predictive probability, given past data y1:t-1 and model M. In general, we can think of HMM likelihood as just a scoring function or an objective function, OF, which we want to maximize; we denote this by SHMM(y) = P(y|HMM). One problem with HMM likelihood as a scoring function is that it focuses on rare events, which often just correspond to noise. To see this, note that if any MtP is Figure 1. Two sequences drawn from distribution D=DI ? DN. Black blocks (A and B) represent frequent ?interesting? patterns, shaded blocks (C, D, and F) represent frequent ?noise? patterns, and white represents infrequent background noise. We assume there is noise (background and frequent patterns) in every sequence; interesting patterns are only in DIsequences. (a) A sequence from distribution DI. (b) A sequence from DN. (a) (b) Customizable Objective Function for Hidden Markov Models 2small, then the product is small. As a result, any EM training approach, in order to increase the score of the training set, it needs to setup the HMM parameters such that it scores, as proportion of their count, rare background noise events higher than frequent events. Furthermore, as the training set DT is usually small, it assumed to represents frequent patterns in the target set, background noise is not likely to be appropriately represented, so, the likelihood P(y|HMM) is to a large degree controlled by those rare background noise events in y that are not represented or under represented in the training set (with very low MtP ). This is reflected in a low HMM reliability, as a result, HMM is almost always used in applications of labeled sequences such as in [4][5][6][7][8][9][10][11]. To overcome this problem, an alternative OF that relies on the sum of MtP instead of its multiplication might be used. Another problem in likelihood is that, in scoring, it significantly favors shorter sequences, so, when dealing with variable size sequences, sequence size needs to be factored out. In this paper, we propose an alternative model that we call Hidden States Model, HSM, which is just a HMM with a customizable OF. The idea is to enable users to use a HMM like model with a customizable OF that is assembled using MtP values to best fit their problem requirements. Note that discriminative training of HMMs is quite common on labeled sequences [12]. Similarly, maximizing per-symbol performance has been proposed as an objective for CRFs [13], but again, that relies on labeled data. The main disadvantage of HSM is optimization, unlike HMM likelihood objective function which can be optimized using Baum-Welch algorithm, HSM customized scoring function makes it harder to optimize. In section 2, we introduce a randomized local search approach that optimizes HSM parameters for any OF to maximize the score SHSM(DT). Then, to illustrate the use of HSM and OFs customization, we present examples of constructing several HSM OFs to extract sequences with interesting frequent patterns using synthetic datasets in section 3. 2 Optimizing Hidden States Models In this section we are looking for a general optimization method that maximize the training data score, SHSM(DT), for a HSM independent of it?s OF. Lets have a HSM with NS states and a data set with NO observation symbols, then the number of variables to be optimized: NV = NS * NO + NS * (NS + 1), taking in consideration such a large number of optimization variables, the dependencies between these variables, and an unknown OF (unconstrained optimization surface), we propose this general random search approach for training: ? At first, all parameters are initiated randomly and SHSM(DT) is computed. ? Random perturbations with a noise level L are to be performed to a randomly selected subset of parameters size NSS (starting with NSS = NV) and accept any moves that increase the score (stochastic parameters are normalized appropriately after each step to ensure the sum-to-one constraints are satisfied). ? If we fail to make any progress after a certain number of trials, NT, we decrease the noise level L by LS and the number of parameters in play by one and then try again. ? Optimization ends once the number of parameters in play reaches zero. The values of NT, L, and LS are used to control the optimization process. The rational of this training is to have early iterations assigning values to HSM parameters such that SHSM(DT) is in the neighborhood of the training surface global maximum (to the degree possible), and final iterations are expected to tune in on that maximum. 3 Customizing HSM Objective Functions, a Walkthrough Example In this section we are testing several HSM OFs for extracting sequences with ?interesting? frequent patterns from a set of sequences. We will train HSMs on a training set, use these trained HSMs to score test sequences, and evaluate the outcome using the Receiver Operating Characteristic AUC. We will walk through and compare several solutions (different OFs) for the same problem. Our goals are: ? Analyze and evaluate HSMs with different OFs, ? Quantify how data characteristics is reflected on each OF mining performance, and ? Compare different outcomes including that of HMM. Therefore, we are using two synthetic data-sets, DSa and DSb, to facilitate precise performance evaluation. 3.1. Synthetic Data DSa: a synthetic dataset of discrete sequences with an alphabet size 20 (to simulate amino acids); first, five randomly generated short sequences (3 to 5 symbols each) -to be used as frequent patterns- are generated, two patterns we call interesting and three we call noise. DSa consist of three files (FI ?interesting?, FN ?noise?, and FT ?training?). Each one of these files holds sequences between 50 and 60 symbol each, with a uniform distribution background (infrequent noise): ? FI ?interesting?: consist of 2000 sequences; each sequence is infected with both interesting and noise frequent patterns, for an example see figure 1(a). FI is used for testing. ? FN ?noise?: consist of 2000 sequences; each sequence is infected with noise frequent patterns, for an example see figure 1(b). FN is used for testing. ? FT ?training?: is of 50 sequences. Sequences in FT are drawn from the same distribution of FI which includes Customizable Objective Function for Hidden Markov Models 3both interesting and noise frequent patterns. FT is used for training. Frequent patterns are added to sequences using patterns infection algorithm presented in Figure 2. DSb: is similar to DSa, except that both interesting and noise frequent patterns are randomly mutated before been added to sequences, more than half frequent patterns are mutated by altering up to two symbols in every pattern in DSb, see figure 2.(c). Figure 2. (a) This algorithm inserts five short patterns in a sequence seq and is used to generate data files FI and FT. When the dataset is DSb, the algorithm calls mutate function on every pattern before inserting it. (b) This algorithm inserts three frequent short patterns in a sequence seq and is used to generate data file FN. When the dataset is DSb, the algorithm calls mutate function on every pattern before inserting it. (c) This mutate function is called on every pattern p before inserting it in DSb sequences. for(i = 0; i<5; i++) { if (i < 3) p = select noise pattern at random; else p = select interesting pattern at random; if(data set == DSb) mutate(p); insert p in seq at a random location; } (a) for(i=0; i<3; i++) { p = select noise pattern at random; if(data set == DSb) mutate(p); insert p in seq at a random location; } (b) mutate(p) { mCount = p.SIZE() ? 3; for (j = 0; j < mCount; j++) if(RND(1.0) < 0.6) p[RND_INT(p.SIZE()) += RND_INT(5)?2; } (c) 3.2. Model Description We are using a randomly initiated left-to-right [9] eight states model M. Baum-Welch training algorithm is used for training M with the regular HMM likelihood OF, and the HSM training described in section 2 is used with customized OFs. Although in principle, for any model M, both training the model and then scoring sequences should be done using the same OF, we will show that to more effectively target frequent patterns one can use two different, but compatible, OFs, a training objective function OFtr, and a scoring objective function OFsc. 3.3. Evaluating Different OFs For each solution (a pair of OFtr, OFsc), a HSM is trained on FT (using OFtr), test sequences (FI + FN) are then scored on the trained HSM (using OFsc), and an AUC (Area Under the Receiver Operating Characteristic Curve) is used for evaluation. Since training algorithms are non deterministic, that is different instances of training using the same FT are likely to produce different outcomes, fifty tests are conducted and AUC = average?variance is used for evaluation. To facilitate understanding differences between OFtr, each possible adjacent pair1 of symbols (used as a short pattern), AB, will be given a score ABTS equals to the last symbol, B, average MtP over all AB occurrences in FT in all 50 trained HSM instants. Similarly, ABINS is computed for pairs of symbols in FI + FN in all 50 trained HSM instants. Scattered plots, PFS plots, are used to visualize the relationship between patterns (pairs of symbols, AB) frequencies in the training data FT and their scores ABTS and ABINS for each OFtr. We define a pattern frequency in a set of sequences as its number of occurrences in that set divided by the number of sequences in the set. 3.4. Results Starting with the un-mutated data-set DSa, for each suggested solution, fifty HSMs are trained on the same FT, sequences in (FN + FI) are scored on each model and AUC values are computed. Each solution result is presented in the form of average AUC ? variance: SOLUTION 1: ? OFtr = OFsc =? =Tt MtP1 (HMM likelihood): Results and analysis: From the PFS plot and the bar chart (figure 3) we see that most MtP values are in the range 0.04 to 0.06, some frequent patterns (freq>0.5) have values around 0.08, MtP values for frequent patterns are just slightly higher that those of infrequent patterns. Running 50 tests resulted an AUC=0.533?0.0008. SOLUTION 2: ? OFtr =? =Tt MtP1 (HMM likelihood), ? OFsc = T TtMtP? =1 (geometric mean): 1 In this analysis we choose to use short patterns of only two symbols, combined with our alphabet size of 20, the total number of possible patterns 202 = 400, makes our data (training and testing) large enough to support this analysis. If one is using a smaller alphabet size, let?s say nucleotide sequences, with an alphabet of 4, the length of short patterns should be 4 or 5. Customizable Objective Function for Hidden Markov Models 4Rationale: Since scored sequences are of different sizes and the HMMs likelihood OF scores favors shorter sequences, we scored using geometric mean. Results and analysis: a mild improvement to an AUC=0.581?0.0039. Figure 3. OFtr = HMM likelihood: (a) PFS plot, (b) Bar chart show average MtP values for frequency groups. (a) (b) SOLUTION 3: ? OFtr = OFsc = ? =Tt MtPT 11 (average): Rationale: OFs that are based on the sum of MtP (arithmetic mean), instead of its multiplication (geometric mean), is used to reduce the effect of noise events (small MtP values) on the final score. Results and analysis: PFS plot in Figure 4 shows that MtP values are in a wide range from 0.0 to 0.8 (high variation within each frequency group), this OFtr focus on maximizing the overall MtP value of all patterns in FT. The corresponding bar chart shows that the average MtP values for frequent patterns are significantly higher than those of infrequent patterns. It is also clear that MtP values for the training data FT is higher than those of the test data (FI + FN), this is due to over-fitting. Result show an improvement over multiplication based OFs (solutions 1 and 2) to an AUC=0.68?0.0067. SOLUTION 4: ? OFtr =)()(MtMtPVariancePAverage , ? OFsc = ? =Tt MtPT 11 (average) Figure 4. OFtr = ? =Tt MtPT 11 : (a) PFS plot, (b) Bar chart show average MtP values for frequency groups. (a) (b) Rationale: Even that patterns with higher frequency, on average, yield higher MtP values, PFS plot of solution 3, figure 4, show that the correlation between frequency and MtP is low as MtP values at any frequency range are inconsistent. This inconsistency between frequency and MtP has a negative effect on AUC value and resulted high variance in AUC (low reliability), to reduce noise, we replaced the OFtr in solution 3 with one that evaluate sequences using the arithmetic mean of MtP divided by its variance. Results and analysis: Dividing by the variance of MtP over the entire FT data (frequent and infrequent) directed the training process to reduce the range of MtP values by: ? Increasing low values slightly and ? Significantly decreasing those that are high. Resulting a higher and more reliable AUC=0.74?0.0004. SOLUTION 5: ? OFtr = ? = ????<Tt MtMtPelsePfreqpatifT 1 }{}){5.0_(1 , ? OFsc = ? =Tt MtPT 11 : Rationale: The training process of HSM adjust its parameters to maximize the score SHSM(FT). For any pattern in FT, the higher its frequency in FT, the higher its contribution to SHSM(FT), therefore, while the training process attempt to increase MtP for all patterns, its main emphasis is on increasing MtP values for higher frequency patterns (especially true when utilizing an arithmetic mean Customizable Objective Function for Hidden Markov Models 5OFtr). Ideally, one would like training to only increase MtP for frequent patterns (with frequency higher than some threshold value) and suppress MtP for infrequent background noise. Figure 5. OFtr =)()(MtMtPVariancePAverage : (a) PFS plot, (b) Bar chart show average MtP values for frequency groups (a) (b) To achieve that, we first created a frequency table for every possible pairs of symbols, pattern, by counting its occurrences in FT and dividing that count with the number of sequences in FT, then an ?arithmetic mean like? OFtr is used except that MtP for each symbol in FT is assigned positive sign when the frequency of the pattern pat_freq ending with the symbol in FT is in some desirable range of frequency (higher than some threshold (=0.5) value), otherwise MtP is assigned a negative sign. Therefore, in order for the training process to maximize the score SHSM(FT) (to maximize the sum of all MtP ) it maximizes positive MtP values (frequent patterns) and minimizes those MtP values with negative sign (infrequent patterns). Results and analysis: the resulting PFS plot and corresponding bar chart (figure 6) show significantly higher separation of MtP values between higher and lower frequency patterns, resulting an AUC = 0.774?0.022. SOLUTION 6: ? OFtr = ? = ????<?Tt MtMtPelsePfreqpatifT 1 }{}){5.0_25.0(1 , ? OFsc = ? =Tt MtPT 11 , with an altered training set. Figure 6. OFtr = ? = ????<Tt MtMtPelsePfreqpatifT 1 }{}){5.0_(1 : (a) PFS plot, (b) Bar chart show average MtP values for frequency groups. (a) (b) Rationale: The PFS plot and the bar chart in solution 5 showed that assigning negative sign to MtP values of patterns with undesirable frequency range directed the training process to suppress these values significantly, leaving OFsc scores mainly controlled by MtP values of patterns with desirable frequency range(s). Therefore, if one can have a training data with the frequency range of frequent interesting patterns separate of other patterns frequencies, a HSM can be trained to increase MtP values only for the frequencies of interesting patterns and suppress MtP values for every other frequency. Obtaining such training data is case dependent, one way is to increasing the frequency of noise frequent patterns with respect to interesting patterns by mixing some test data with FT, which include higher level of noise patterns compared to training sequences, or simply to train on a subset of the test sequences, FI + FN. The key issue is to have a broad estimate of different patterns frequencies. In this example, we added 100 sequences that are drawn from a distribution similar to that of FN to the 50 originally in FT and the new training set has 150 sequences but only one third (50) are infected with interesting patterns. The OFtr is setup to assign a positive sign to MtP values of those patterns with frequency higher than 0.25 and lower than 0.50. Results and analysis: PFS plot shows that MtP values for interesting patterns significantly higher than those of infrequent noise, and MtP values of frequent noise are about zero. However, using the same scoring OFsc of solution 5, there is no improvement on the average AUC result and significant step back regarding the variance (reliability), AUC=0.764?0.0631. Looking at the bar chart in figure 7, with such high separation of MtP values for Customizable Objective Function for Hidden Markov Models 6interesting and noise patterns, one would have predicted near optimal AUC outcome, and slightly less optimistic outcome can be anticipated from its PFS plots due to the clear inconsistency of MtP values at any frequency range especially those of interesting patterns (frequency range [0.25 .. 0.5],). This unanticipated low reliability and lake of progress in AUC is likely to be a result of the inconsistency in MtP values. Note that, for the training data set, each pattern (of two symbols) is represented with only one dot in PFS plots (and one dot for the test data set) by averaging the MtP values of all the occurrences of that pattern in that data set over all 50 tests, therefore the actual inconsistency is much higher than what is presented in PTF plots. Figure 7. OFtr = ? = ????<?Tt MtMtPelsePfreqpatifT 1 }{}){5.0_25.0(1 : (a) PFS plot, (b) Bar chart show average MtP values for frequency groups. (a) (b) Overcoming inconsistency: To overcome this inconsistency ( MtP values sensitivity to minor details of the data), we replaced MtP in OFsc with NMtP . In scoring any sequence, we define the NMtP value of any symbol is the average of its MtP values computed on all N trained HSMs. ?==NiMtNMtiPNP11 (3) So, when using 50 HSMs, MtP50 is the average of MtP values computed on all 50 trained HSMs. Again, the idea is to reduce inconsistency in MtP (with respect to frequency) by averaging each symbol MtP values over large enough number of HSMs. We scored our test data using MtP50 which resulted an AUC50 = 0.999. Sliding window: Even that MtP50 values for interesting patterns are significantly higher than those of frequent and infrequent noise, the consistency in these MtP50 values (also improved over those of MtP ) is not enough to completely separate FN sequences from those of FI. Since frequent patterns are of 3 to 5 symbols each (2 to 4 pairs of symbols), to reduce the effect of inconsistent values over the entire sequence length, OFsc is modified to use a sliding window of size 4, such that a sequence score is the maximum score of any of its 4 adjacent symbols. This resulted an AUC50 = 1.0. We computed the AUC50 values (utilizing MtP50 ) for solutions 1 to 6 using the same set of trained HSMs instances used to compute AUC (utilizing MtP ), results in table 1 show that utilizing MtP50 values lead to significant improvement over MtP with two exceptions: in solution one, pure HMM, a sequence score was mainly controlled by its length not by its content, therefore both AUC and AUC50 are about 0.5 or random picks. And the other exception is solution 4, having an OFtr that divide average by variance reduced extreme values and the effect of inconsistency was limited. Regarding the computation of NMtP , we proposed in equation 3 to averaged MtP values generated using all instances of trained HSM for simplicity and because we are using large enough number of HSMs, one might use some statistical tool to exclude extreme values before averaging. More complicated cumulative objective functions can also be used with HSMs, for example, two OFtr, one to amplify MtP values of interesting patterns based on some aspect (e.g. frequency range) by assigning positive sign to these MtP values and a negative sign to all other values, and the other OFtr suppress these values (reverse), then some OFsc can be used that give high values only for those symbols with high NMtP values in the first OFtr and low in the second (this to reduce inconsistency). Table 1. Result of all 6 solutions for the unmutated Data set DSa Solution AUC AUC50 1 0.533?0.0008 0.535 2 0.581?0.0039 0.746 3 0.680?0.0067 0.768 4 0.740?0.0004 0.775 5 0.774?0.0220 0.894 6 0.764?0.0631 0.999 6 with sliding window size 4 1.0 One major use of HMMs, and therefore HSMs, is to detect altered or mutated patterns in which some symbols in Customizable Objective Function for Hidden Markov Models 7frequent patters are different at different instances. We generated our second data-set, DSb, such that about 60% of frequent patterns sizes 4 and 5 are mutated at one or two random locations. We repeated all runs (solutions 1 to 6) on DSb, results are in table 2. Table 2. Result of all 6 solutions for the mutated Data set DSb Solution AUC AUC50 1 0.515?0.0010 0.512 2 0.529?0.0045 0.556 3 0.556?0.0029 0.581 4 0.636?0.0001 0.649 5 0.650?0.0231 0.798 6 0.640?0.0529 0.985 6 with sliding window size 4 0.986 4 Applications with labeled sequences HMMs are being very successfully used (with labeled data) to predict hidden states of sequences, e.g. in gene annotation and other applications. By manually designing the model states transition matrix and limiting the degrees of freedom of its states, the model is only assigning states to sections of sequences based on the average symbols occurrences in each state, i.e. it is no longer learning hidden ?interesting? frequent patterns. With HSM one can use a fully connected number of state to represent one label in the training data which enable the model to identify parallel labels based on patterns frequencies. 5 Conclusions We presented a new HMM like statistical model that we called Hidden States Model, HSM, which is a HMM with a customizable objective function. Using synthetic data-sets to facilitate precise performance evaluation, we demonstrated how HSMs can extract sequences with some interesting patterns from a set of sequences in a high noise environment. First we analyzed HMM likelihood objective function and explained when and why it can generate poor performance. Then we suggested some alternative OFs to improve on the HMM likelihood OF; and we presented a general optimization approach that can train HSMs with any user defined OF. And finally, we used examples to demonstrated how HSM OFs can be constructed to extract sequences with frequent interesting patterns based on some of these patterns features such as frequency and size, and we also showed that two different OFs, one for training and one for scoring, can provide more capabilities in targeting these features. We utilized scattered plots to understand (for each OF) the relationship between patterns frequency and scores, and then to update these OFs to improve the extraction process. Results Show HSM outperformed HMM in extracting sequences with interesting patterns on both unmutated and mutated data (about 60% of frequent patterns sizes 4 and 5 are mutated at one or two random locations in mutated data). On unmutated data HMM AUC was as low as 0.533 (almost random pick) compared to an optimal HSM AUC of 1.0, and on mutated data HMM AUC was 0.515 compared to a HSM AUC of 0.986. 6 References 1. Cawley, S., & Pachter, L. Hmm sampling and applications to gene finding and alternative splicing, Bioinformatics 2003, pp. II36-II41. 2. Sinha, S., Nimwegen, E., & Siggia, E. A probabilistic method to detect regulatory modules, Bioinformatics 2003, 19, pp. i292-i301. 3. Malhis, N., & Ruttan, A.. Detecting gene regulation relations from microarray time series data, Proceedings of the 2006 International Conference on Machine Learning; Models, Technologies & Applications MLMTA'06, June 26-29, Las Vegas, Nevada, USA. 4. Bishop, M., & Thompson, E. Maximum likelihood alignment of DNA sequences, Journal of Molecular Biology 1986, 190, pp. 159-165. 5. Kulp, D., Haussler, D., Reese, M., & Eeckman, F. A generalized hidden markov model for the recognition of human genes in DNA, ISMB 1996. 6. Burge, C., & Karlin, S. Prediction of complete gene structures in human genomic DNA, J. Mol. Biol. 1997, pp. 78-94. 7. Lukashin, A.V. & Borodovsky, M. Genemark.hmm: new solutions for gene nding, Nucleic Acids Research 1998, 26, pp. 1107-1115. 8. Reese, M. G., Eeckman, F., Kulp, D., & Haussler, D. Improved splice site detection in genie, J. COMPUT. BIOL 1997, 4, pp. 311-323. 9. Rabiner, R. A tutorial on hidden markov models and selected applications in speech recognition, IEEE 1989. 10. Fox, E., Sudderth, E., Jordan, M. and Willsky, A. The sticky HDP-HMM: Bayesian nonparametric Hidden Markov Models with persistent states. Tech. Rep 2009. P-2777, MIT LIDS. 11. Krogh, A., Larsson, B., Heijne, G.V., & Sonnhammer, E. L. L.. Predicting transmembrane protein topology with a hidden markov model: Application to complete genomes, J. Mol. Biol. 2001, 305, pp. 565-580. 12. Bengio, Y.. Markovian models for sequential data, Neural Computing Surveys 1999, 2, pp. 129-162. 13. Kakade, S., Teh, Y., & Roweis, S.. An alternate objective function for Monrovian fields, Intl. Conf. on Machine Learning 2002.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Community, Partners, and Alumni Publications /
- Customizable objective function for hidden Markov Models
Open Collections
UBC Community, Partners, and Alumni Publications
Customizable objective function for hidden Markov Models Malhis, Nawar 2013
pdf
Page Metadata
Item Metadata
Title | Customizable objective function for hidden Markov Models |
Creator |
Malhis, Nawar |
Contributor | University of British Columbia. Centre for High-Throughput Biology Michael Smith Laboratories |
Date Issued | 2013 |
Description | Identifying sequences with frequent patterns is a major data-mining problem in computational biology, in this work, our focus is on utilizing a HMM like model for extracting sequences with interesting frequent patterns from a set of unlabeled sequences in a heavy noise environment. First we show that the likelihood objective function for HMMs is very sensitive to noise which limits its use to labeled sequences. Then we introduce an alternative model that we call Hidden States Model, HSM, which is a HMM with customizable objective function, OF. We show empirically (on synthetic data to facilitate precise performance evaluation) how OFs can be customized to target specific information such as patterns frequency and size. Results Show HSM vastly outperformed HMM in extracting sequences with interesting patterns for both unmutated and mutated data. |
Subject |
Hidden Markov Models Hidden States computational biology data mining discrete sequence frequent pattern unlabeled sequences |
Genre |
Article |
Type |
Text |
Language | eng |
Date Available | 2013-12-03 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
DOI | 10.14288/1.0075987 |
URI | http://hdl.handle.net/2429/45582 |
Affiliation |
Other UBC |
Peer Review Status | Unreviewed |
Scholarly Level | Researcher |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 52387-Malhis_Nawar_Customizable_Objective_2013.pdf [ 97.75kB ]
- Metadata
- JSON: 52387-1.0075987.json
- JSON-LD: 52387-1.0075987-ld.json
- RDF/XML (Pretty): 52387-1.0075987-rdf.xml
- RDF/JSON: 52387-1.0075987-rdf.json
- Turtle: 52387-1.0075987-turtle.txt
- N-Triples: 52387-1.0075987-rdf-ntriples.txt
- Original Record: 52387-1.0075987-source.json
- Full Text
- 52387-1.0075987-fulltext.txt
- Citation
- 52387-1.0075987.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.52387.1-0075987/manifest