Co RE TH mputer A THESI QUIREME E UNIVE Modeli Con S SUBMIT NTS FOR Irving K. RSITY OF ng of M tent Se Curt TED IN P THE DE SC Barber Sc BRITISH Ma ©Curtis olecula nsitive by is Stodgel ARTIAL GREE OF IENCE in hool of Ar COLUM rch 2007 Stodgell, r Genet Analysi l FULLFIL HONOU ts and Sci BIA OKA 2007 ic Even s MENT OF RS IN CO ences NAGAN ts using THE MPUTER CAMPUS 1 Abstract: The problem of motif finding plays an important role in understanding the development, function and evolution of organisms. The Planted (l, d)-Motif Problem first introduced by Pevzner and Sze [13] is a variant model of motif finding that has been widely studied. Nevertheless, despite the many different algorithms constructed to solve this problem, it is far from being solved entirely [5]. We analyze a number of algorithms including: the Basic Voting Algorithm, the Winnower and the Closest String. One thing that has become ubiquitous among these algorithms is the tradeoff between efficiency and accuracy. We formulate the motif-finding problem as a constraint satisfaction problem and introduce a special form of constraint consistency. By using a fixed-parameter algorithm for the closest string problem [4], we develop a propagation algorithm that enforces the new constraint consistency with the same worst-case time complexity as that of the standard path consistency algorithm. Experiments on randomly generated sequences indicate that our approach is effective and efficient. 2 Table of Contents Abstract .......................................................................................................................................1 Table of Contents .......................................................................................................................2 1. Introduction ...........................................................................................................................3 2. Planted Motif Problem ...........................................................................................................5 1. Definitions and Notation ..................................................................................................5 2. Winnowing Approach: Locating Signals via Winnowing ...............................................6 3. A Fixed Parameter Algorithm for the Closest String .......................................................9 4. Basic Voting Algorithm .................................................................................................11 3. Overview of Content Sensitive Consistency .......................................................................13 1. Constructing the Constraint Graph ................................................................................14 2. Arc Consistency .............................................................................................................16 1. Arc Consistency Implementation (AC6) ...........................................................17 3. Path Consistency ............................................................................................................18 4. k-consistency..................................................................................................................19 5. Content Sensitive Analysis (CSC) .................................................................................20 1. CSC Implementation (PC6+CS) ........................................................................20 4. Experimentation and Results ..............................................................................................22 1. Generation of Samples ...................................................................................................22 2. Results ............................................................................................................................22 5. Discussion and Conclusion ..................................................................................................26 Appendix A ...............................................................................................................................27 References .................................................................................................................................30 3 1 Introduction: Transcription factors are proteins that bind to DNA in order to regulate the expression of genes through the activation or inhibition of transcription mechanisms. Locating these sites is an integral part of understanding the regulatory process. Computational tools are an invaluable method used to locate these particular sites and this general problem has been termed motif finding. Essentially the motif finding problem is the problem of finding common patterns among a set of sequences. A simple model for the problem is as follows: Given a sample of random sequences can we find a common unknown pattern (motif) that is hidden at random locations in the sequences? If the motif was not subject to mutations it would be relatively easy to locate using a basic brute force algorithm; however, in biology the sequences being analyzed are subject to mutations and therefore the motif cannot be assumed to be exact. As such, it makes sense to construct a model that will allow the patterns we are searching for to have an arbitrary number of mutations [13]. Buher and Tompa [8, 7] found limitations for this particular problem. They found that when the number of sequences, the length of sequences and the length of the pattern are fixed, if the number of mutations in the pattern is larger than a particular threshold then it is highly improbable that any algorithm would be able to find such a pattern due to an abundance of random patterns [8, 7]. However, there have been many algorithms suggested for instances of this problem that do not exceed this threshold [8]. In this thesis we analyze a number of different algorithms that attempt to solve the problem in several different ways. These include A voting algorithm by Leung and Chin [8] that locates common motifs by enumerating all neighbors of each l length substrings in a given set of sequences, as well as a combinatorial approach by Pevzner and Sze [13] that essentially reduces the motif finding problem to locating cliques in multi-partite graphs. Closely related to the planted-motif problem are the closest string problem and the closest substring problem. Both of these problems are NP-hard, but the closest string problem can be solved in linear time assuming that the number of mutations is a fixed constant [9]. In the Winnower Pevzner and Sze designed an algorithm that prunes spurious similarities, which enables us to easily locate the signal we are searching for. They did so with an algorithm that has a time complexity of O(Nk+1) and found that the algorithm was accurate for 4 k=3. Our contribution includes a variation of the Winnower combined with the closest string. We have designed an algorithm that essentially runs as the Winnower for k=2 along with the closest string algorithm to construct an algorithm with a time complexity of O(N3 + dd) where d is the number of mutations and constant. This thesis is organized as follows: In chapter 2, we will discuss numerous algorithms that approach The Planted (l, d)-Motif Problem in various ways, each algorithm achieving a different level of both accuracy and efficiency. In chapter 3, we present our approach to the problem, including the CSP formulation, the propagation algorithm for the stronger notion of path consistency and its implementation. In chapter 4 we present our results and allegorize the pruning power and efficiency of our approach. In chapter 5, we will conclude our results and analyses, as well as propose future challenges. 2 Plan In this ch variety o 2.1 Defi In order t definition define no T partially Definitio mathema strings di they diffe In a string s Definitio problem makes th specific n ted Moti apter we int f existing alg nitions an o understan based on a tation we w he methods by comparin n 1 (Hamm tics termed ffer. For ex r at indices this paper t and s’. n 2 (Plante of motif find e problem d umber (refe f Problem roduce form orithms des d Notatio d how the m variant mod ill use throu we will be u g the Hamm ing Distanc information ample given 1 & 3; there he notation d (l, d)-Mot ing deals w ifficult is th rred to as m ally the Pla igned to sol n otif finding el of the pro ghout this th sing to solv ing distanc e): Hammi theory. It is two strings fore, the ha DH(s, s’) wi if Problem) ith locating e fact that th utations). F nted (l, d)-M ve this prob problem can blem prese esis. e this partic e between tw ng distance defined sim “ATAGTC mming dista ll be used to : A motif i common pa is pattern is ormally we otif Problem lem. be solved nted by Pev ular problem o strings. is a simple c ply as the n A” and “TT nce between denote the s a single or tterns amon not exact an can describ [13] as we we will give zner and Sze will usuall oncept in an umber of in TGTCA” w these two s hamming di repeated pa g a set of se d is allowed e the proble ll as discuss a formal [13], as we y be execute area of dices that tw e can see th trings is 2 stance betw ttern. The quences. W to differ by m as follow 5 a ll as d o at een hat a s: 6 Assume there exists a motif m of length l; given t number sequences each of length n, the problem is to find a planted motif m’ of length l in each of the t sequences such that m’ differs in at most d positions from m. A more general problem is the closest substring problem which has been shown to be NP-hard [9]. The difference between the planted motif problem and the closest string problem is that for the planted motif problem, it is guaranteed that there exists at least one common substring that occurs in each of the sequences. Whereas for the closest string problem, it is not guaranteed that there will be such a common substring among each of the sequences. The parameters l and d in the Planted (l, d)-Motif Problem have a significant impact on the practical difficulty of the problem. Some particularly difficult Planted (l, d)-Motif problem instances proven to be probabilistically challenging are (9,2), (11, 3) and (15, 4) [1]. These problems are considered challenging because given any two instances of a mutated (l, d)-signal there is a high probability that non-signals (spurious similarities) will be randomly generated such that they very closely resemble the proper signal. Many techniques have been designed to solve these challenging problems which will be looked at below. 2.2 Winnowing Approach: Locating Signals via Winnowing. Pavel A. Pevzner and Sing-Hoi Sze [4] take a combinatorial approach to the motif finding problem by reducing it to a clique finding problem. Pevzner et al. recognize that in each sequence there are spurious signals that occur at random that tend to hide the real signal. Most algorithms tend to focus directly on the process of locating the real signal; however, the winnower takes an opposite approach. It focuses on the spurious signals (rather than the signal itself) and removes these spurious similarities incrementally until the signal we are looking for is no longer difficult to find. This elimination process is, however, non-trivial and time consuming. It is important to note that the Winnower algorithm only considers signals with Hamming distance as criterion for an edge and not insertions or deletions (signals will all be the same length). This is done for simplicity; however the algorithm could easily be modified to account for these constraints [8]. T in a mult paramete G=(V, E) V the set S represent vertices. T to substri A by an edg if F conta also be th sequence T been rece generated problem T remove e the winno that a ver We defin he Winnowe i-partite grap r l (length o in the follo ertices will = {s1, s2, … the substrin here is an ed ngs from se clique in a e. Given an ins a clique e case that t s (partites) f he clique fin nt interest i graph [6]. in that the u he Winnowe dges that ar wer remov tex u is a ne e a clique to r is set up t h. Given a f the signal) wing way: correspond ,st} and for g sij of leng ge between parate seque graph G = ( y graph F = of at least s his clique c orming an e ding proble n studying th The planted nderlying gr r algorithm e proven to es edges Pev ighbor of a be extenda o reduce the set of t sequ and a numb to substrings each index j th l from ind vi and vj if a nces. V,E) is a sub (V, E), con ize k, can be ontains the o xtendable c m is a graph e problem motif-findi aph is a mu uses the pri not be part o zner and Sz clique C = { ble if it has Planted (l,d ences S = { er d (numbe of each seq in sequence ex j to (n-1) nd only if D set C such t structed as d asked. On riginal sign lique can be -theoretical of finding hi ng problem lti-partite gr nciple of ex f the clique e use the no v1,…,vk} if at least one n )-Motif prob s1, s2, … , st r of mismat uence, that si we const + l; thus th H (vi, vj) ≤ 2 hat any two escribed by ce such a cli al being sou seen in the NP-comple dden clique is a special aph. ploring thes we are sear tation of an {v1, … , vk, u eighbor in lem to find } each of len ches), const is, for every ruct a vertex ere is a total d and vi an vertices in C the Winnow que has bee ght after. A figure below te problem [ s in an other case of the h e particular ching for. T extendable } is a clique every part o ing large cli gth n, a ruct a graph sequence si v V that of t((n-1) + d vj correspo are connec er, the que n found, it w n example o : 6]. There h wise random idden cliqu graphs and w o describe h clique. “We in the grap f the multip 7 ques of will l) nd ted stion ill f 4 as ly e ill ow say h. artite graph G. k. One w edges are 3]. The e increases T Pevzner edges tha observati enough s T and a num vertex u w is an edg multipart S4); howe F will pass is no such be delete We define ay to impos deleted afte ssential idea ; in doing so his is also th [13, 15] for t are not par ons indicate purious edg he winnowe ber k = 1, ill pass the e in each si) ite graph wi ver, vertex or k = 2 the the test if th vertex vi, e d; however, an edge to b e increasing r winnowin of the winn spurious e e approach k = 2. How t of an exten that only w es in such a r removes e the algorith test if there , if vertex u th four parti v12 will be d algorithm w ere is a vert dge (u, w) w edge (v12, v e spurious i ly strict con g and only e ower is to a dges are effe that was tak ever it was o dable cliqu hen k = 3 do way that the dges in the f m checks ev exists at lea fails this tes te sets, verte eleted (v12 h ill now chec ex vi for eac ill be delet 21) will be d f it does not ditions as k xtendable c llow only ex ctively rem en by Vingr bserved by e is too wea es the meth clique we’r ollowing wa ery vertex i st one neigh t, it will be d x v11 will n as an edge i k each edge h partite si s ed. In the ill eleted. belong to a increases is liques rema tendable cli oved. on & Agros Pevzner & S k for prunin od become e searching y: Given a n G and app bor vi in ea eleted. In t ot be deleted n S2 and S3 b (u, w) in th uch that {u, ustration bel ny extendab to ensure th in in the fina ques of size [13, 14] and ze that whe g spurious e effective at r becomes ev multipartite lies the follo ch partite se he illustratio (v11 has as ut no edge e graph G. vi, w} is a t ow, edge (v le clique of at all spurio l graph.” [3 k to remain Vingron & n k ≤ 2, filte dges [13]. T emoving ident. graph G = ( wing test. A t of G (ei: (u n below, fo edge in S2, S in S4). The edge (u riangle. If th 11, v22) will n 8 size us , pg. as k ring heir V,E) , vi) r a 3, , w) ere ot S continues similar te vertex vi x) & (u, x W locating e Winnowe power of complexi the Plant meaningf result it b 2.3 A F The Clos biology, given the o, as k incre in this fash st for the ot in each part ) will be rem innower is xtendable c r takes O(N Winnower ty of the Wi ed-(l, d) Mo ul. Howeve ecomes ver ixed-Para est String pr coding theor following g ases the con ion for k = 3 her values o ite set si if { oved from an iterative liques of siz 2) time, whe increases as nnower is O tif Problem r the Winno y slow for la meter Alg oblem is an y and conse eneral defin ditions for e ; the algorit f k. This tria u, vi, w, x} i the graph. algorithm th e k become re N is the l k increases, (Nk+1). Pev k = 3 is suff wer also re rge samples orithm fo important p nsus word a ition: dge remova hm selects e ngle will pa s a clique of at removes a trivial pro ength of all but so does zner and Sz icient for fin quires a larg [13]. r the Clos roblem in m nalysis [4][ l become m ach triangle ss the test if size 4. Othe inconsistent blem. The c sequences i the running e point out t ding extend e amount of est String any areas su 9]. The Clo uch stronger (u, w, x) to there exists rwise the ed edges from onstruction n the set S. T time since t hat for many able cliques time and m Problem ch as comp sest String p . The algori apply the at least one ges (u, w), a graph unt of the graph he pruning he running instances o which are emory; as a . utational roblem can 9 thm (w, il for time f be Definitio and a num C applicatio present a particular small (ra Definitio character string in there is n central st G strings. T We say a character matrix ha n 3 (Closes ber d the p losest string ns of comp fixed-param ly importan nge of 1 ~ 7 n 4 (Closes at position p the set is of o string s’ s ring s with iven a set of he term col particular c s in the colu s only 3 dir t String Pro roblem is to is an NP-co utational bio eter algorit t for the mo ). t string not in s, wher length l, we uch that DH DH (s, si) = 2 k strings of umns of a C olumn of th mn. For the ty columns. blem): Giv find a cente mplete prob logy d is qu hm with an tif finding p ation): For e 1 ≤ p ≤ l. say a string (s’, si) < DH for all si length l, we losest String is matrix is a example be en a set S = r string s’ s lem; howev ite small. Je exponential roblem cons a given strin Given a set s is an optim (s, si) for al {s1, s2, s3, s4 can create matrix are dirty colum low the S m {s1, s2, … , uch that DH er, when we ns Gram & growth com idering the g s of lengt of strings S al closest s l si S. The , s5, s6}. a k x l chara in reference n if there ar atrix has 8 d sk} of t strin (s’, si) ≤ d fo are dealing Rolf Nieder plexity of O value d is co h l, s[p] will = {s1, s2, … tring for S i example b cter matrix to this part e two or mo irty column gs of length r all si S. with meirer [4] (dd). This i nstant and q denote the , sk}, where f and only if elow shows from this se icular matrix re differing s while the 10 n, s uite each a t of . T B algorithm recursive if such a In more tha center str arbitrary another s the candi = (h-1). find a sol when car is O(dd), 2.4 Bas This sect which is the BVR G sequence (n-1) – l. each vari regardles variants i T O(n(3l)d ounded sear s for a varie procedure p string exists itially, befo n kd dirty co ing can be f string si S tring sj S ( date string a The algorith ution to the eful sub cas where d is c ic Voting ion will desc a variant of using the sa iven a set of si S and g For each l- ant generate s of the fact n this way, he BVR trad + nt), in com ch tree is on ty of proble resented by for a given re calling th lumns, of th ound for giv as the cand where si ≠ s nd changed m continues problem (a es of this rec onstant. Algorithm ribe the Ba an algorithm me notation sequences enerate l-le length subst d. Any vari that an iden it will be the es a lower t parison to a e of the bas ms. The alg Gramm & N set of string e procedure e Closest St en value of idate string j), where DH to match th in this fash center string ursion are s sic Voting A first propo as defined S and two n ngth substrin ring the algo ant generate tical variant case that m ime comple brute force ic technique orithm for C iedermeire s [4]. , Gramm & ring matrix k & d). Oth s and an inte (s, sj) = h an at of string s ion until eit s’ is found elected the lgorithm (B sed by Ware by the Plant umbers l and gs from eac rithm will g d will get o is generate otif m will h xity O(nt(3l algorithm w s used in de losest Strin r that is pro Niedermier S, then the i erwise the m ger value d d h > d, the j at the same her s moves ). Gramm & upper limit o SV) created mann et al. ed (l,d)-Mot d, the BVA h position i enerate all d ne and only d at a later s ave t numb )d) for an inc hich has co signing fixe g in Append ven to find a point out th nstance is re ethod is cal as paramete n a position position su to far away Niedermie n the size o by Leung a [8, 16]. We if Problem. will iterate n si from ind variants an one vote, pe tep. When c er of votes. reased spac mplexities O d-parameter ix A is a central stri at if there ar jected (no led with any rs. If there is selected ch that DH(s from si or w r point out t f this search nd Chin [8] will describ through ea ex 0 to inde d cast a vot r sequence, asting votes e complexit (nt4l) and 11 ng s e is in , sj) e hat tree e ch x e on on y 12 O(nt) respectively [8]. Leung et al. show the BVR to be efficient for solving many Planted (l, d) Motif Problem instances [8]. 13 3 Overview of Content Sensitive Consistency (CSC) The problem of motif finding can be formulated as a constraint satisfaction problem (CSP). Formulating in this way allows us to detect and remove spurious similarities via the use of constraint propagation techniques that enforces local consistency. This process works similar to the methods used in the winnower approach with equivalent time complexities. A CSP is a combinatorial search problem – well studied in the area of Artificial Intelligence. Essentially a CSP can be defined as: given a set of variables one must assign each variable a value that will satisfy a predefined number of constraints. Often a CSP requires a heuristic and combinatorial search method to be solved in reasonable time. Definition 6 (CSP): A CSP consists of: A set of variables X = {X1, X2, … , Xn} and a set of constraints C = {C1, C2, … , Cm}. Each variable Xi has a set of i possible domains D = {D1, D2, … , Di} where for each value j, Dj D defines a possible value. Each constraint is defined over a subset of variables and specifies the allowable combinations of values for that subset. Each state of this problem is defined by the assignment of values to a partial or a complete set of variables. We say an assignment is consistent when a variable takes on a value that does not violate any constraint Ci C for all of i. An assignment is complete when all variables have a value assigned (not necessarily satisfying all constraints). A solution to a CSP occurs when an assignment is both consistent and complete. A general way to solve a CSP is to incrementally make assignments to variables in a consistent way until a complete assignment is achieved. There are many methods available to perform this operation; however, because the CSP is NP-hard, most algorithms have a time complexity that is exponential. There are however, many algorithms that can achieve lower running times when based on particular problem specific features. The CSP can be used to formulate many different existing problems. We show how to formulate the clique finding problem as a CSP. Definition 7 (Motif finding as a binary CSP): Given a set of sequences S for an instance of the motif finding problem, we can define an instance of a CSP whose variables X correspond to S. Each var domain v constrain the two v A variable a that the H less than winnowe 3.1 Con Our grap of length input, a g T correspon index j in from inde F would be iable xi X alue will co t requiring t ariables, mu binary CSP nd each arc amming di 2d. This co r. structing h constructi n), a param raph G=(V,E he algorithm d to substri sequence s x j to (n-1) or example split into th will have a rrespond to hat the two st have a H can be repr represents c stance betwe nstraint grap the Const on algorithm eter l (length ) will be co iterates thr ngs of each i we constru + l; for a tot given any ar e vertices co set of possib each l-lengt substrings c amming dist esented as a onstraint be en the two h is constru raint Gra takes in as of signal) a nstructed in ough each s sequence, th ct a vertex v al of t((n-1) bitrary sequ rresponding le domains h substring. orrespondin ance less th constraint g tween two v l-length sub cted in muc ph: input a set o nd a numbe the followi equence in t at is, for ev V that wi + l) vertice ence si the f to substrin D = {D1, D2 For each pa g to the two an or equal raph where ariables. T strings (relat h the same w f t sequence r d (number ng way: he set S and ery sequenc ll represent s. igure below gs. , …, D(n-1)-l} ir of variab domain valu to 2d. each node r he constrain ive to the no ay as the g s S = {s1, s2 of mismatc constructs v e si of the se the substring illustrates h where each les, there is es, assigne epresents a t for an arc des) must b raph in the , … , st} (ea hes). Given ertices that t S and for e sij of lengt ow the sequ 14 a d to is e ch this ach h l ence 15 After running this procedure across all t sequences we would achieve a multi-partite graph similar to figure below: After this vertex constructing process the algorithm iterates through each vertex vi in each jth partite Sj and compares vi against every other vertex in each remaining partite. An edge between two vertices will correspond to vertices that have a hamming distance less than 2d, that is, we connect vertices vij and vab by an edge if and only if DH (vij, vab) ≤ 2d where vij and vab are from separate sequences. Constructing a graph in this way will encapsulate the signal in a clique; if we select any arbitrary edge between any two vertices, corresponding to a signal, that edge will have a maximum hamming distance of at most 2d [13]. However constructing the graph in this manner also gives rise to many spurious edges that are not part of the clique we want to find. In order to prune these edges we use constraint propagation. The main idea of constraint propagation is to repeatedly reduce the domain of each variable in order to be consistent with its arcs. Reducing the domains in this manner is what makes constraint propagation an effective pruning mechanism. Applying constraint propagation is what is known as achieving local consistency. The requirement of local consistency conditions are to take a CSP problem P and map that problem into another problem P’ without altering the original problem’s solutions. This mapping is referred to as constraint propagation. The main idea here is to create a problem that is less complex by reducing the domains of variables by applying constraints or adding new ones. There are several such local consistency conditions that exist; the ones we will concern with here are arc consistency, path consistency and the general case k-consistency. 16 3.2 Arc Consistency Arc consistency is one of the simplest forms of local consistency and is a process equivalent to the Winnower for k=1. To achieve arc consistency we apply an algorithm that will remove all unsupported values from the domains of individual variables to return a reduced CSP that is easier to solve (or find the problem to have no solution). Generally we say a CSP P is arc consistent if each of a variable’s admissible values in P is consistent with every other variable’s admissible values in P. Formally: Definition 7 (Formal Arc Consistency): Given a set of variables X = {x1, x2, …, xn} we say variable xi X is arc-consistent with another variable xj X (where xi ≠ xj) if for every value ai in the domain xi (there exists) a value bj in the domain of xj such that (ai, bj) satisfies each constraint between xi and xj. We will call such a variable bj a support of ai. Graphically we can illustrate how arc consistency works with the following binary network: We see S1 is arc consistent with S2 & S3. S3 is arc consistent with S1 & S2. S2 is arc consistent with S3 however not with S1. This is because the assignment S2 = 1 does not correspond to any value for S1 (no supporting variable for S2=1). We can therefore remove 1 from the domain of S2. Now S3 is no longer arc consistent with S2, because S3 = 2 does not correspond to any value for S2. We therefore remove 2 from the domain of S3. We continue to remove values until we end up with a new problem which is arc consistent: W construct partites, w partite of have the edge. An removed proposed 3.2.1 Ar There are Mackwor and O(er to be opt and a spa Cordier [ optimal r GAC3 (O T (such as t each part and, for e e can use th ed by our gr e simply en G. That is, property suc y vertex tha along with a by Bessiere c Consist several wa th [2, 10] is ) space (whe imal is the A ce complex 3] that comp un time com (ed)). he main ide he GAC3 d ite. Upon in ach vertex v is idea of ar aph constru sure that fo any given v h that there t does not h ny edges th and Cordie ency Imp ys to implem a non optim re r is the g C4 propose ity of O(ed2) romises bet plexity of th a of the AC6 oes), but rat itialization AC6 attem c consistenc ction algorit r every vert ertex v that r exists a sup ave this pro at are conne r [3]. lementati ent arc con al algorithm reatest rank d by Mohr a [2, 11, 12] ween the G e AC4 (O(e is not to co her to ensur AC6 increm pts to find a y to prune a hm) G=(V,E ex v in G the emains afte porting vert perty is not cted to it. T on (AC6) sistency. Th that achiev among cons nd Henders . AC6 is an AC3 and the d2)) with a unt how ma e each vertex entally itera supporting way spurio ) which is a re exists a s r arc consist ex v in each arc consisten he algorithm e GAC3 alg es arc consi traints) [2]. on, with a ti algorithm p AC4 insofa lower space ny supportin has at leas tes through variable u in us edges. G multi-parti upporting v ency is perf partite - me t and theref we use in orithm prop stency in O An algorith me complex roposed by B r that it mai complexity g vertices e t one suppor each vertex each partit iven a graph te graph wit ariable u in ormed on G aning (u, v) ore will be CSC is AC6 osed by (er3dr+1) tim m that is sh ity of O(ed2 essiere and ntains the closer to th ach vertex h ting vertex in the graph e for vertex 17 (as h t each , will is an as e own ) at of as in v. If 18 such a vertex u is found in each partite, then v will remain, otherwise it will be deleted. Briefly the AC6 algorithm (Appendix A) works as follows: The AC6 maintains a list S where S[xj, vj] stores all supporting values for which vj is the supporting variable for xj. The algorithm checks that there is a support vj for each variable xj in graph G. If there is no such variable vj, then xj will be removed and stored in list Q for future propagation. The propagation loop in AC6 checks the consequences of removing the variables that are stored in Q. When a vertex ai is chosen from Q, AC6 searches list S to check if ai was a support for some other vertex xj. It does so by checking list S for S[xj, aj]. If it was a supporting variable then a new support must be found for xj, if none is found xj will be removed and in turn be stored in list Q. If there is a support then this new support will replace aj and stored in list S. The algorithm continues in this manner until the list Q is empty at which point every remaining vertex in graph G will have supporting variables in each partite of graph G, and thus arc consistent. Arc consistency is an effective pruning measure for simple instances of the planted motif problem; however, a stronger consistency is required for more challenging instances [13]. 3.3 Path Consistency Path consistency works much like arc consistency, but considers pairs of vertices rather than a single vertex which is essentially the same pruning mechanism used in the winnower for k=2. We define path consistency formally as: Definition 8 (Formal Path Consistency): Given a set of variables X = {x1, x2, …, xn} we say pair (xi, xj) X are path consistent with another variable xk X (where xi ≠ xj) if for every pair of values (a, b) that satisfies the constraint between (xi, xj) (there exists) a value c in the domain of xk such that (a, c) and (b, c) satisfies each constraint between (xi, xk) and (xj, xk). We will call such a variable xk a supporting variable of pair (xi, xj). Graphically the figure below show’s an example of enforced path consistency: 19 The figure on the left is an example of a graph that is arc consistent but not path consistent. The dotted edges in the figure on the right indicate edges that would be removed if the graph was to be made path consistent. We use this idea of path consistency to prune away spurious edges. The same algorithm (AC6) is used with a slight modification. Instead of search for supporting variables for a single vertex xi we will search for supporting variables for edges (xi, xj). That is, for every edge (xi, xj) we need to search each partite for a support vj such that (xi, xj, vj) forms a triangle. Path consistency is a more powerful pruning mechanism than arc consistency; however, it is still insufficient for difficult instances of the planted motif problem [13]. One way to increase the pruning power is to increase to 4-consistency (k-consistency) which is equivalent to the winnower method for k=3. 3.4 k-consistency K-consistency generalizes the notion of arc consistency and path consistency to sets of variables: Definition 8 (k-consistency): Given a set of variables X = {x1, x2, …, xn} we say a sub set of variables X’ X of size k-1 are k-consistent if for every set of values A = (a1, a2, … , at) that satisfies the constraint between all variables in X’ there exists a value c in a variable uk such that no constraint is violated for (a1, a2, … , at, c). Such a variable uk is a supporting variable of X’. Given this formal definition, arc consistency is equivalent to 2-consistency and path consistency is equivalent to 3-consistency. For the winnower, Pevzner et al. show that k=3 is 20 sufficient as an effective pruning mechanism, which would be equivalent to 4-consistency for our CSP. The tradeoff for this high degree of accuracy is a rather high time complexity of O(N4). We will now look at a new approach which combines path consistency along with the notion of closest string for a time complexity of O(N3 + dd). 3.5 Content Sensitive Analysis Instead of increasing to 4-consistency we add a new constraint to our path consistency. Not only do we enforce path consistency but for every pair of nodes that have support we also ensure there is a center string among these three nodes we are analyzing. The figure below shows a graph that is path consistent on the left. The figure on the right shows that a center string c must also exist if the edges in the graph on the left are to remain. How the algorithm works is described below. 3.5.1 Content Sensitive Analysis Implementation (PC6+CS) The PC6+CS maintains an array of lists S where S[vj] stores all a list for which vj is the supporting variable for all pairs of nodes (xi, xj). The algorithm checks that there is a support vj for each edge (xi, xj) in graph G. If there is no such variable vj, then edge (xi, xj) will be removed and stored in list Q for future propagation. If there is a support vj found the algorithm then uses the CS algorithm to find if there exists a center string c for nodes (xi, xj, vj). If such a string exists (xi, xj) is stored in list S under otherwise (xi, xj) is removed from G and (xi, xj) is stored in list Q for future propagation. 21 The propagation loop in PC6+CS checks the consequences of removing the variable pairs that are stored in Q. When a pair (xi, xj) is chosen from Q, PC6+CS searches list S to check if (xi, xj) was a supporting edge for some other pair of vertices (xi, a) or (xi, a). It does so by checking list S[xi] for (xj, a), and S[xj] for pair (xi, a). If it was a supporting edge then a new support must be found for both (xj, a) and (xi, a), if none is found (xj, a) and (xi, a), will be removed and in turn be stored in list Q. If there is a support vj then, as before, a center string c will be sought after. If such a string c is found then vj will replace aj and stored in list S. The algorithm continues in this manner until the list Q is empty at which point every remaining vertex in graph G will have supporting variables in each partite of graph G, and that satisfies our constraints. 22 4 Experimentation and Results. In this section we will describe how we generated samples as well as describe our results of running AC6, PC6 and PC6+CS on this these samples. All analysis was based on implementations in Java. 4.1 Sample Generation To test the effectiveness of our algorithm it is important to be able to generate random samples. Our sample generation algorithm works as follows: The algorithm takes in as input numbers t, n, l and d. Given these numbers the algorithm will generate set of sequences S = {s1, s2, …, st} where each si is of length n and each position in si is a nucleotide selected at random from alphabet ‘A’, ‘C’, ‘T’, or ‘G’. A string s of length l will be randomly generated using this same alphabet. A set of strings M = {m1, m2, …, mt} are generated where each mi is a d-variant of s (i.e: DH(s, mi) = d for all i). The algorithm then randomly selects a position j in each sequence si and plants string mi at this position replacing all nucleotides in sequences si from j to j+l. 4.2 Results We test the performance and limitations of AC6, PC6 and PC6+CS on several instances of the Planted (l, d)-Motif Problem. First we will represent our results based on the (3, 15) problem. We run tests for each algorithm based on parameters t=20, l=15 and d=3. In instances of the (3,15) problem all three algorithms pruned away sufficient edges to expose the signal we are searching for. The chart below is based on the average results obtained from running each algorithm independently on 50 different samples each. The ‘-‘ indicates a time that is greater than 1 hour. The grap locating t F Though t running t PC6+CS A vg Ti m e (s ) h below illu he signal. or this (3, 15 he AC6 has ime in pract the conditio 0.01 0.1 1 10 100 1000 10000 50 A vg . T im e (s ) Se strates avera ) problem t the lowest t ice. An exp n for edge r 100 150 AC quence Leng n 50 100 150 200 250 300 350 400 450 500 550 600 ge time (in he graph illu ime comple lanation for emoval is m 200 250 300 6, PC6 th A 1 1 10 31 72 13 219 seconds) as strates an in xity, of all th this interest uch stricter, 350 400 4n & PC6+C T C6 .805 9.73 1.601 7.01 0.453 20.43 0.797 ‐ ‐ ‐ ‐ ‐ a function o teresting ph ree algorith ing phenom therefore ed 50 500 550 S vs. n ime (s) PC6 0.063 0.179 0.369 0.666 1.045 1.552 2.095 2.76 3.621 4.64 5.792 7.058 f sequence l enomenon t ms, it has th enon is that ges are rem 600 PC6 PC6 AC6 PC6+CS 0.331 0.482 0.687 0.989 1.393 1.902 2.569 3.431 4.168 5.188 6.398 7.848 ength n, for hat occurs. e largest for PC6 and oved more (with CS) (no CS) 23 24 quickly resulting in less iterations and thus a shorter running. A similar observation was also made by Pevzner et al. [8]. AC6 tends to become impractical for n > 350 in comparison to PC6 and PC6+CS. The PC6 and PC6+CS find the signal under 10 seconds for relatively long sequence length (n=600). For this instance they both have roughly the same running time. Secondly we run tests for the (4, 15) problem. For this problem each algorithm is executed based on parameters t=20, l=15 and d=4. We first compare the AC6 with the PC6+CS. We randomly generated 50 samples for every value of n; for each sample we run the AC6 and then pass this arc consistent graph to the PC6+CS to assess how many edges were not pruned by the AC6. The chart below is based on the average results obtained from running both algorithms sequentially. We see that AC6 is accurate in finding the clique we are searching for n =50 however AC6 is insufficient for n > 50, and will be non effective for removing any edges at all for n ≥ 400. Sequence Length TIME (s) EDGES n AC6 PC6 + CS Initial Edges AC6 prune PC6 + CS Prune 50 1.512 0.401 31520 31140 0 100 9.612 0.597 163346 151251 11709 150 2.013 4.441 401516 6753 394374 200 0.81 9.523 748116 996 746725 250 0.736 23.255 1202603 132 1202066 300 1.164 63.515 1763251 23 1762820 350 1.639 190.054 2433053 12 2432635 400 2.515 834.385 3211088 0 3210672 We then executed another 50 tests on the (4, 15) problem comparing accuracy and efficiency between PC6 and PC6+CS. First we run PC6 on each sample, once PC6 is finish the resulting path consistent graph is sent to PC6+CS to examine how many edges PC6+CS will prune from an already path consistent sample. Interestingly the PC6+CS is slightly more accurate and becomes increasingly accurate as n increases, removing a trivial number of edges after the graph has been made path consistent. The running time for of the PC6 become impractical (over 4 hours) for n > 300. Finally w implemen The grap function mechanis Sequence Length n 50 100 150 200 250 300 e run a sepa tation of th h below com of n and illu m than the P 1 10 100 A vg . T im e (s ) Se TIM PC6 0.432 1.604 3.666 9.421 46.869 941.822 rate test on e PC6+CS, pares the ru strates that i C6. 0.1 1 10 00 00 00 50 10 quence Leng n 50 100 150 200 250 300 350 400 450 E (s) PC6 + CS 0.393 0.451 0.503 0.593 0.725 0.852 the (4, 15) p the running nning times n practice, P 0 150 200 Tim th TIME ( PC6 + C 0.898 2.209 4.562 9.497 22.534 608.37 188.15 834.38 3754.39 Initial Edges 31353 163793 402334 749111 1202661 1764138 roblem usin time becom of the both C6+CS is a 250 300 3 n e (s) vs s) S Initial Ed 3157 16344 40226 74884 12018 17645 6 24330 5 32110 6 40929 EDGES PC6 prune 30970 163403 401941 748711 1202250 1763728 g only the P es impractic PC6 and PC slightly mo 50 400 450 n EDGES ges PC6 + 0 3 4 1 5 4 5 74 33 12 60 17 53 24 88 32 68 40 PC6 + CS 1 1 2 4 8 7 C6+CS. Fo al for n> 45 6+CS with re efficient PC6 PC6 CS Prune 1188 63057 01872 8450 01438 64160 32647 10672 92541 Prune r our 0. time as a pruning +CS 25 26 5 Discussion and Conclusion The problem of motif finding plays a very important role in understanding gene regulatory mechanisms. An area of bioinformatics is concerned with modeling this problem in a way that will allow computational algorithms to lend aid in locating these motifs. One such model includes the Planted (l, d)-Motif Problem. In this thesis we have analyzed the efficiency and accuracy of several algorithms designed to solve this particular Planted (l, d)-Motif Problem, including: the Basic Voting Algorithm, the Winnower and the Closest String. We have presented a new formulation that composes the Planted (l, d)-Motif Problem as a CSP as well as introduced a new form of center string constraint consistency. Experimental results on a Java implementation, based on this formulation, locate signals accurately and efficiently for simulated data. However, using this implementation, we were unable to conclude the relative strength of this center string consistency when comparing it to the notion of Hamming distance. The impractical running times for this implementation did not allow us to test for large values of n where the PC6 is known to fail. This may be due to either a non optimal implementation or the relatively slow performance of Java. Future work may include an implementation in C so that we could achieve a practical running time with a large enough value of n such as n≥700 for the (4, 15) where the PC6 is known to fail [8]. Running the PC6+CS for these values of n is required to analyze the relative pruning strength of this new notion of consistency. 27 Appendex A Algorithm 1: AC6 [2, pg. 20] X ← 2D constraint graph Q ← ø S[xj , vj] = 0, vj D(xj ), xj X foreach xi X, cij C, vi D(xi) do vj ← smallest value in D(xj) s.t. (vi, vj) cij if vj exists then add (xi, vi) to S[xj , vj] else remove vi from D(xi) and add (xi, vi) to Q if D(xi) = ø then return false while Q ≠; do select and remove (xj , vj) from Q; foreach (xi, vi) S[xj, vj ] do if vi D(xi) then v’j ← smallest value in D(xj) greater than vj s.t. (vi, vj) cij if v’j exists then add (xi, vi) to S[xj, v’j] else remove vi from D(xi); add (xi, vi) to Q; if D(xi) = ø then return false return true Algorithm 2: Closest String [4, pg. 6] recursive procedure: CSd(s, ∆d) Global variables: Set of strings S = {s1, s2, . . . , sk}, integer d. Input: Candidate string s and integer ∆d. Output: A strings with maxi=1,...,kdH(s, si) ≤ d and dH(s, s) ≤ ∆d if it exists, and “not found,” otherwise. if (∆d < 0), then return “not found”; if (dH(s, si) > d + ∆d) for some i {1, . . . , k} then return “not found”; if (dH(s, si) ≤ d) for all i = 1, . . . , k then return s; Choose some i {1, . . . , k} such that dH(s, si) > d: P := { p | s[p] si[p] }; Choose any P’ P with |P’| = d + 1; for all p P do s’:= s; s’[p] := si[p]; sret:= CSd(s , ∆d − 1); If sret “not found” then return sret return “not found” 28 Algorithm 3: PC6+CS // initialization C ← a 2D constraint graph Q ← ø // Q list stores all deleted pairs of variables S[n] ← ø // where s[i] stores all pairs of variables supported by i. D ← all pairs of variables that have not yet violated any constraint // begin foreach pair (xai, xbj) do // where a and b indicate partites which x is a member foreach partite c where c ≠ a and c ≠ b do vc == checkForSupport(xai, xbj, c)) //return ‐1 if no support found if (vc ≠ ‐1) and (centerString(xai, xbj, vc)) then S[Vc] ← (xai, xbj) else C[xai][xbj] = 0 Q ← (xai, xbj) remove (xai, xbj) from D if D empty return false // no feasible solution // propagation while Q ≠ ø do select and remove (xai, xbj) from Q S’ ← each triple (yai, ybi, vc) of variables for which edge (xai, xbj) was a support foreach (yai, ybj) S’ do if (yai, ybj) D then vc == checkForSupport(yai, ybj, vc)) //starting at index vc in c if (vc ≠ ‐1) and (centerString(yai, ybj, vc)) then S[Vc] ← (yai, ybj) else C[xai][xbj] = 0 Q ← (yai, ybj) remove (yai, ybj) from D if D empty return false // no feasible solution return true 29 Algorithm 4: Basic Voting Algorithm [8, pg. 4] Create two hash tables V and R and set the value of each entry to be 0 {Table V keeps the number of votes received by each length-l sequence s. Each length-l sequence, received t votes is a candidate for motif. Hash table R ensures that each length-l sequence receives at most one vote from each input sequence. Si{j} is the j-th character in the i- th input sequence Si and H(s) is the hash value of a length-l sequence s.} C ← ø for i ← 1 to t do for j ← 1 to n – l + 1 do for each length‐l sequence is N(Si[j … j + l – 1],d) do if R[H(s)] <> i then V[H(s)] ← V[H(s)] + 1 R[H(s)] ← i for j ← to n – l + 1 do for each length‐l sequence s in N(Si[j … j + l – 1],d) do if V{H(s)]= t then insert s into C 30 References [1] S. Balla, J. Davila and S. Rajasekaran. On the Challenging Instances of the Planted Motif Problem. http://www.engr.uconn.edu/becat/reports/BECAT-CSE-TR-07-2.pdf [2] C. Bessiere. Constraint Propagation. Handbook of Constraint Programming. Ch. 3, Pg. 29 – 85 ISBN 0-444-52726-5 [3] C. Bessiere, M.O. Cordier. Arc-consistency and arc-consistency again. In Proceedings AAAI’93, pages 108–113, Washington D.C., 1993. [4] J. Gramm R. Niedermeier. Fixed-Paramter Algorithms for Closest String and Related Problems. Algorithmica [0178-4617] Gramm yr:2003 vol:37 iss:1 pg:25 [5] J. Hu, B Li, D. Kihara. Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res 2005, 33(15):4899-4913. [6] R. M. Karp. Reducibility Among Combinatorial Problems. In Complexity of Computer Computations, Proc. Sympos. IBM Thomas J. Watson Res. Center, Yorktown Heights, N.Y.. New York: Plenum, p.85-103. 1972. [7] C. Lawrence, S. Altschul, M. Boguski, J. Liu, A. Neuwald and J. Wootton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208-214, 1993. [8] H. C.M. Leung, F.Y.L. Chin. Algorithms for challenging motif problems. Journal of bioinformatics and computational biology [0219-7200] Leung yr: 2006 vol: 4 iss: 1 pg: 43 -58 [9] M. Li, B. Ma, L. Wang. 2002. On the Closest String and Substring Problems. Journal of the ACM, Vol. 49, No. 2, March 2002, pp. 157-171. 31 [10] A.K. Mackworth. Consistency in networks of relations. Technical Report 75-3, Dept. of Computer Science, Univ. of B.C. Vancouver, 1975. (also in Artificial Intelligence 8, 99-118, 1977). [11] R. Mohr and T.C. Henderson. Arc and path consistency revisited. Arti- ficial Intelligence, 28:225–233, 1986. [12] R. Mohr and G. Masini. Good old discrete relaxation. In ProceedingsECAI’88, pages 651– 656, Munchen, FRG, 1988. [13] P.A. Pevzner, S.H. Sze. Combinatorial approaches to finding subtle signals in DNA sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol., 269–278, AAAI Press, 2000. [14] M. Vingron, and P. Argos, 1991. Motif recognitionand alignment for many sequences by comparison of dot-matrices. Journal of Molecular Biology 218:33-43. [15] M.Vingron, and P. Pevzner, 1995. Multiple sequence comparison and consistency on multipartite graphs. Advances in Applied Mathematics 16:1-22. [16] M.S. Waterman, R. Arratia and D.J. Galas. Pattern recognition in several sequences: consensus and alignment. Bull. Math. Biol., 46:515-527. 1984.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Undergraduate Research /
- Computer Modeling of Molecular Genetic Events using...
Open Collections
UBC Undergraduate Research
Computer Modeling of Molecular Genetic Events using Content Sensitive Analysis Stodgell, Curtis 2007
pdf
Page Metadata
Item Metadata
Title | Computer Modeling of Molecular Genetic Events using Content Sensitive Analysis |
Creator |
Stodgell, Curtis |
Date Issued | 2007 |
Description | The problem of motif finding plays an important role in understanding the development, function and evolution of organisms. The Planted (l, d)-Motif Problem first introduced by Pevzner and Sze [13] is a variant model of motif finding that has been widely studied. Nevertheless, despite the many different algorithms constructed to solve this problem, it is far from being solved entirely [5]. We analyze a number of algorithms including: the Basic Voting Algorithm, the Winnower and the Closest String. One thing that has become ubiquitous among these algorithms is the tradeoff between efficiency and accuracy. We formulate the motif-finding problem as a constraint satisfaction problem and introduce a special form of constraint consistency. By using a fixed-parameter algorithm for the closest string problem [4], we develop a propagation algorithm that enforces the new constraint consistency with the same worst-case time complexity as that of the standard path consistency algorithm. Experiments on randomly generated sequences indicate that our approach is effective and efficient. |
Type |
Text |
Language | eng |
Series | University of British Columbia. COSC 449 |
Date Available | 2010-07-30 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0052231 |
URI | http://hdl.handle.net/2429/27050 |
Affiliation |
Irving K. Barber School of Arts and Sciences (Okanagan) Computer Science, Mathematics, Physics and Statistics, Department of (Okanagan) |
Peer Review Status | Unreviewed |
Scholarly Level | Undergraduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 52966-Thesis Curtis Stodgell.pdf [ 647.54kB ]
- Metadata
- JSON: 52966-1.0052231.json
- JSON-LD: 52966-1.0052231-ld.json
- RDF/XML (Pretty): 52966-1.0052231-rdf.xml
- RDF/JSON: 52966-1.0052231-rdf.json
- Turtle: 52966-1.0052231-turtle.txt
- N-Triples: 52966-1.0052231-rdf-ntriples.txt
- Original Record: 52966-1.0052231-source.json
- Full Text
- 52966-1.0052231-fulltext.txt
- Citation
- 52966-1.0052231.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52966.1-0052231/manifest