CoRETH mputerA THESIQUIREMEE UNIVE ModeliConS SUBMITNTS FORIrving K.RSITY OFng of Mtent SeCurtTED IN P THE DESC Barber Sc BRITISHMa©Curtis oleculansitive by is Stodgel ARTIALGREE OFIENCE in hool of Ar COLUMrch 2007Stodgell, r GenetAnalysil FULLFIL HONOUts and Sci BIA OKA2007 ic Evens MENT OFRS IN COences NAGAN ts using THE MPUTERCAMPUS 1 Abstract: The problem of motif finding plays an important role in understanding the development, function and evolution of organisms. The Planted (l, d)-Motif Problem first introduced by Pevzner and Sze [13] is a variant model of motif finding that has been widely studied. Nevertheless, despite the many different algorithms constructed to solve this problem, it is far from being solved entirely [5]. We analyze a number of algorithms including: the Basic Voting Algorithm, the Winnower and the Closest String. One thing that has become ubiquitous among these algorithms is the tradeoff between efficiency and accuracy. We formulate the motif-finding problem as a constraint satisfaction problem and introduce a special form of constraint consistency. By using a fixed-parameter algorithm for the closest string problem [4], we develop a propagation algorithm that enforces the new constraint consistency with the same worst-case time complexity as that of the standard path consistency algorithm. Experiments on randomly generated sequences indicate that our approach is effective and efficient. 2 Table of Contents Abstract .......................................................................................................................................1 Table of Contents .......................................................................................................................2 1. Introduction ...........................................................................................................................3 2. Planted Motif Problem ...........................................................................................................5 1. Definitions and Notation ..................................................................................................5 2. Winnowing Approach: Locating Signals via Winnowing ...............................................6 3. A Fixed Parameter Algorithm for the Closest String .......................................................9 4. Basic Voting Algorithm .................................................................................................11 3. Overview of Content Sensitive Consistency .......................................................................13 1. Constructing the Constraint Graph ................................................................................14 2. Arc Consistency .............................................................................................................16 1. Arc Consistency Implementation (AC6) ...........................................................17 3. Path Consistency ............................................................................................................18 4. k-consistency..................................................................................................................19 5. Content Sensitive Analysis (CSC) .................................................................................20 1. CSC Implementation (PC6+CS) ........................................................................20 4. Experimentation and Results ..............................................................................................22 1. Generation of Samples ...................................................................................................22 2. Results ............................................................................................................................22 5. Discussion and Conclusion ..................................................................................................26 Appendix A ...............................................................................................................................27 References .................................................................................................................................30 3 1 Introduction: Transcription factors are proteins that bind to DNA in order to regulate the expression of genes through the activation or inhibition of transcription mechanisms. Locating these sites is an integral part of understanding the regulatory process. Computational tools are an invaluable method used to locate these particular sites and this general problem has been termed motif finding. Essentially the motif finding problem is the problem of finding common patterns among a set of sequences. A simple model for the problem is as follows: Given a sample of random sequences can we find a common unknown pattern (motif) that is hidden at random locations in the sequences? If the motif was not subject to mutations it would be relatively easy to locate using a basic brute force algorithm; however, in biology the sequences being analyzed are subject to mutations and therefore the motif cannot be assumed to be exact. As such, it makes sense to construct a model that will allow the patterns we are searching for to have an arbitrary number of mutations [13]. Buher and Tompa [8, 7] found limitations for this particular problem. They found that when the number of sequences, the length of sequences and the length of the pattern are fixed, if the number of mutations in the pattern is larger than a particular threshold then it is highly improbable that any algorithm would be able to find such a pattern due to an abundance of random patterns [8, 7]. However, there have been many algorithms suggested for instances of this problem that do not exceed this threshold [8]. In this thesis we analyze a number of different algorithms that attempt to solve the problem in several different ways. These include A voting algorithm by Leung and Chin [8] that locates common motifs by enumerating all neighbors of each l length substrings in a given set of sequences, as well as a combinatorial approach by Pevzner and Sze [13] that essentially reduces the motif finding problem to locating cliques in multi-partite graphs. Closely related to the planted-motif problem are the closest string problem and the closest substring problem. Both of these problems are NP-hard, but the closest string problem can be solved in linear time assuming that the number of mutations is a fixed constant [9]. In the Winnower Pevzner and Sze designed an algorithm that prunes spurious similarities, which enables us to easily locate the signal we are searching for. They did so with an algorithm that has a time complexity of O(Nk+1) and found that the algorithm was accurate for 4 k=3. Our contribution includes a variation of the Winnower combined with the closest string. We have designed an algorithm that essentially runs as the Winnower for k=2 along with the closest string algorithm to construct an algorithm with a time complexity of O(N3 + dd) where d is the number of mutations and constant. This thesis is organized as follows: In chapter 2, we will discuss numerous algorithms that approach The Planted (l, d)-Motif Problem in various ways, each algorithm achieving a different level of both accuracy and efficiency. In chapter 3, we present our approach to the problem, including the CSP formulation, the propagation algorithm for the stronger notion of path consistency and its implementation. In chapter 4 we present our results and allegorize the pruning power and efficiency of our approach. In chapter 5, we will conclude our results and analyses, as well as propose future challenges. 2 PlanIn this chvariety o 2.1 DefiIn order tdefinitiondefine noTpartially Definitiomathemastrings dithey diffe Ina string s Definitioproblem makes thspecific nted Motiapter we intf existing algnitions ano understan based on atation we whe methods by comparinn 1 (Hammtics termed ffer. For exr at indices this paper t and s’. n 2 (Planteof motif finde problem dumber (refef Problemroduce formorithms desd Notatiod how the m variant modill use throuwe will be ug the Hamming Distancinformation ample given1 & 3; therehe notation d (l, d)-Moting deals wifficult is thrred to as m ally the Plaigned to soln otif findingel of the proghout this thsing to solving distance): Hammitheory. It is two stringsfore, the haDH(s, s’) wiif Problem)ith locating e fact that thutations). Fnted (l, d)-Mve this prob problem canblem preseesis. e this partice between twng distance defined sim “ATAGTCmming dista ll be used to: A motif icommon pais pattern isormally weotif Problemlem. be solved nted by Pevular problemo strings. is a simple cply as the nA” and “TTnce between denote the s a single ortterns amon not exact an can describ [13] as wewe will givezner and Sze will usualloncept in anumber of inTGTCA” w these two shamming di repeated pag a set of sed is allowede the problell as discuss a formal [13], as wey be execute area of dices that twe can see thtrings is 2stance betwttern. The quences. W to differ bym as follow5 a ll as d o at een hat a s: 6 Assume there exists a motif m of length l; given t number sequences each of length n, the problem is to find a planted motif m’ of length l in each of the t sequences such that m’ differs in at most d positions from m. A more general problem is the closest substring problem which has been shown to be NP-hard [9]. The difference between the planted motif problem and the closest string problem is that for the planted motif problem, it is guaranteed that there exists at least one common substring that occurs in each of the sequences. Whereas for the closest string problem, it is not guaranteed that there will be such a common substring among each of the sequences. The parameters l and d in the Planted (l, d)-Motif Problem have a significant impact on the practical difficulty of the problem. Some particularly difficult Planted (l, d)-Motif problem instances proven to be probabilistically challenging are (9,2), (11, 3) and (15, 4) [1]. These problems are considered challenging because given any two instances of a mutated (l, d)-signal there is a high probability that non-signals (spurious similarities) will be randomly generated such that they very closely resemble the proper signal. Many techniques have been designed to solve these challenging problems which will be looked at below. 2.2 Winnowing Approach: Locating Signals via Winnowing. Pavel A. Pevzner and Sing-Hoi Sze [4] take a combinatorial approach to the motif finding problem by reducing it to a clique finding problem. Pevzner et al. recognize that in each sequence there are spurious signals that occur at random that tend to hide the real signal. Most algorithms tend to focus directly on the process of locating the real signal; however, the winnower takes an opposite approach. It focuses on the spurious signals (rather than the signal itself) and removes these spurious similarities incrementally until the signal we are looking for is no longer difficult to find. This elimination process is, however, non-trivial and time consuming. It is important to note that the Winnower algorithm only considers signals with Hamming distance as criterion for an edge and not insertions or deletions (signals will all be the same length). This is done for simplicity; however the algorithm could easily be modified to account for these constraints [8]. Tin a multparameteG=(V, E)Vthe set S representvertices. Tto substriAby an edgif F contaalso be thsequenceTbeen recegeneratedproblem Tremove ethe winnothat a verWe definhe Winnowei-partite grapr l (length o in the folloertices will = {s1, s2, … the substrin here is an edngs from se clique in a e. Given anins a cliquee case that ts (partites) fhe clique finnt interest i graph [6]. in that the uhe Winnowedges that arwer removtex u is a nee a clique tor is set up th. Given af the signal)wing way: correspond ,st} and for g sij of lengge betweenparate sequegraph G = (y graph F = of at least shis clique corming an eding problen studying th The plantednderlying grr algorithme proven to es edges Pevighbor of a be extendao reduce the set of t sequ and a numbto substringseach index jth l from ind vi and vj if ances. V,E) is a sub (V, E), conize k, can beontains the oxtendable cm is a graphe problem motif-findiaph is a mu uses the prinot be part ozner and Szclique C = {ble if it has Planted (l,dences S = {er d (numbe of each seq in sequenceex j to (n-1)nd only if Dset C such tstructed as d asked. Onriginal signlique can be-theoreticalof finding hing problem lti-partite grnciple of exf the clique e use the nov1,…,vk} if at least one n)-Motif probs1, s2, … , str of mismatuence, that si we const + l; thus thH (vi, vj) ≤ 2hat any twoescribed byce such a clial being sou seen in the NP-compledden cliqueis a special aph. ploring theswe are seartation of an{v1, … , vk, ueighbor in lem to find} each of lenches), constis, for everyruct a vertexere is a totald and vi an vertices in C the Winnowque has beeght after. Afigure below te problem [s in an othercase of the he particular ching for. T extendable } is a cliqueevery part oing large cligth n, a ruct a graph sequence si v V that of t((n-1) +d vj correspo are connecer, the quen found, it wn example o: 6]. There hwise randomidden cliqugraphs and wo describe hclique. “We in the grapf the multip7 ques of will l) nd ted stion ill f 4 as ly e ill ow say h. artite graph G. k. One wedges are3]. The eincreasesTPevzner edges thaobservatienough sTand a numvertex u wis an edgmultipartS4); howe Fwill passis no suchbe delete We define ay to impos deleted aftessential idea; in doing sohis is also th[13, 15] for t are not parons indicatepurious edghe winnoweber k = 1, ill pass thee in each si)ite graph wiver, vertex or k = 2 the the test if th vertex vi, ed; however,an edge to be increasingr winnowin of the winn spurious ee approach k = 2. Howt of an exten that only wes in such a r removes ethe algorith test if there, if vertex u th four partiv12 will be dalgorithm were is a vertdge (u, w) w edge (v12, ve spurious ily strict cong and only eower is to adges are effethat was takever it was odable cliquhen k = 3 doway that thedges in the fm checks ev exists at leafails this teste sets, verteeleted (v12 hill now checex vi for eacill be delet21) will be df it does notditions as kxtendable cllow only exctively remen by Vingrbserved by e is too weaes the meth clique we’rollowing waery vertex ist one neight, it will be dx v11 will nas an edge ik each edgeh partite si sed. In the illeleted. belong to aincreases isliques rematendable clioved. on & AgrosPevzner & Sk for pruninod become e searchingy: Given a n G and appbor vi in eaeleted. In tot be deletedn S2 and S3 b (u, w) in thuch that {u,ustration belny extendab to ensure thin in the finaques of size[13, 14] andze that wheg spurious eeffective at r becomes evmultipartite lies the folloch partite sehe illustratio (v11 has as ut no edge e graph G. vi, w} is a tow, edge (vle clique of at all spuriol graph.” [3 k to remain Vingron &n k ≤ 2, filtedges [13]. Temoving ident. graph G = (wing test. At of G (ei: (un below, foedge in S2, Sin S4). The edge (uriangle. If th11, v22) will n8 size us , pg. as k ring heir V,E) , vi) r a 3, , w) ere ot Scontinuessimilar tevertex vi x) & (u, xWlocating eWinnowepower ofcomplexithe Plantmeaningfresult it b 2.3 A FThe Closbiology, given the o, as k incre in this fashst for the otin each part) will be reminnower is xtendable cer takes O(N Winnower ty of the Wied-(l, d) Moul. Howeveecomes verixed-Paraest String prcoding theor following gases the conion for k = 3her values oite set si if {oved froman iterative liques of siz2) time, wheincreases as nnower is Otif Problem r the Winnoy slow for lameter Algoblem is any and conseeneral definditions for e; the algoritf k. This triau, vi, w, x} i the graph. algorithm the k become re N is the l k increases,(Nk+1). Pevk = 3 is suffwer also rerge samplesorithm fo important pnsus word aition: dge removahm selects engle will pas a clique ofat removes a trivial proength of all but so doeszner and Szicient for finquires a larg [13]. r the Closroblem in mnalysis [4][l become mach triangless the test if size 4. Otheinconsistentblem. The csequences i the runninge point out tding extende amount ofest Stringany areas su9]. The Clouch stronger (u, w, x) to there existsrwise the ed edges fromonstruction n the set S. T time since that for manyable cliques time and m Problemch as compsest String p. The algori apply the at least oneges (u, w), a graph untof the graphhe pruninghe running instances o which are emory; as a . utational roblem can 9 thm (w, il for time f be Definitioand a num Capplicatiopresent aparticularsmall (ra Definitiocharacterstring in there is ncentral st Gstrings. TWe say acharactermatrix ha n 3 (Closesber d the plosest stringns of comp fixed-paramly importannge of 1 ~ 7n 4 (Closes at position pthe set is of o string s’ sring s with iven a set ofhe term col particular cs in the colus only 3 dirt String Proroblem is to is an NP-coutational bioeter algoritt for the mo). t string not in s, wherlength l, weuch that DH DH (s, si) = 2 k strings ofumns of a Column of thmn. For thety columns. blem): Giv find a centemplete problogy d is quhm with an tif finding pation): For e 1 ≤ p ≤ l. say a string(s’, si) < DH for all si length l, welosest Stringis matrix is a example been a set S =r string s’ slem; howevite small. Jeexponential roblem consa given strinGiven a set s is an optim (s, si) for al{s1, s2, s3, s4 can create matrix are dirty columlow the S m {s1, s2, … , uch that DHer, when wens Gram & growth comidering the g s of lengtof strings S al closest sl si S. The, s5, s6}. a k x l chara in referencen if there aratrix has 8 dsk} of t strin(s’, si) ≤ d fo are dealingRolf Niederplexity of Ovalue d is coh l, s[p] will= {s1, s2, … tring for S i example b cter matrix to this parte two or moirty columngs of lengthr all si S. with meirer [4] (dd). This instant and q denote the , sk}, where f and only ifelow shows from this seicular matrixre differings while the 10 n, s uite each a t of . T Balgorithmrecursiveif such a Inmore thacenter strarbitrary another sthe candi= (h-1). find a solwhen caris O(dd), 2.4 BasThis sectwhich is the BVR Gsequence(n-1) – l.each variregardlesvariants iTO(n(3l)d ounded sears for a varie procedure pstring existsitially, befon kd dirty coing can be fstring si String sj S (date string aThe algorithution to the eful sub caswhere d is cic Voting ion will desca variant of using the saiven a set of si S and g For each l-ant generates of the factn this way, he BVR trad+ nt), in comch tree is onty of probleresented by for a given re calling thlumns, of thound for giv as the candwhere si ≠ snd changedm continuesproblem (a es of this reconstant. Algorithmribe the Baan algorithmme notation sequences enerate l-lelength substd. Any vari that an idenit will be thees a lower tparison to ae of the basms. The alg Gramm & Nset of stringe proceduree Closest Sten value of idate string j), where DH to match th in this fashcenter stringursion are s sic Voting A first propo as defined S and two nngth substrinring the algoant generatetical variant case that mime comple brute forceic techniqueorithm for Ciedermeires [4]. , Gramm & ring matrix k & d). Oths and an inte(s, sj) = h anat of string sion until eit s’ is foundelected the lgorithm (Bsed by Wareby the Plantumbers l andgs from eacrithm will gd will get o is generateotif m will hxity O(nt(3l algorithm ws used in delosest Strinr that is proNiedermier S, then the ierwise the mger value dd h > d, thej at the sameher s moves). Gramm &upper limit oSV) createdmann et al.ed (l,d)-Mot d, the BVAh position ienerate all dne and only d at a later save t numb)d) for an inchich has co signing fixeg in Appendven to find apoint out thnstance is reethod is cal as parameten a position position su to far away Niedermien the size o by Leung a [8, 16]. Weif Problem. will iteraten si from ind variants anone vote, petep. When cer of votes. reased spacmplexities Oed-parameterix A is a central striat if there arjected (no led with anyrs. If there is selected ch that DH(sfrom si or wr point out tf this searchnd Chin [8] will describ through eaex 0 to inded cast a votr sequence,asting votes e complexit(nt4l) and 11 ng s e is in , sj) e hat tree e ch x e on on y 12 O(nt) respectively [8]. Leung et al. show the BVR to be efficient for solving many Planted (l, d) Motif Problem instances [8]. 13 3 Overview of Content Sensitive Consistency (CSC) The problem of motif finding can be formulated as a constraint satisfaction problem (CSP). Formulating in this way allows us to detect and remove spurious similarities via the use of constraint propagation techniques that enforces local consistency. This process works similar to the methods used in the winnower approach with equivalent time complexities. A CSP is a combinatorial search problem – well studied in the area of Artificial Intelligence. Essentially a CSP can be defined as: given a set of variables one must assign each variable a value that will satisfy a predefined number of constraints. Often a CSP requires a heuristic and combinatorial search method to be solved in reasonable time. Definition 6 (CSP): A CSP consists of: A set of variables X = {X1, X2, … , Xn} and a set of constraints C = {C1, C2, … , Cm}. Each variable Xi has a set of i possible domains D = {D1, D2, … , Di} where for each value j, Dj D defines a possible value. Each constraint is defined over a subset of variables and specifies the allowable combinations of values for that subset. Each state of this problem is defined by the assignment of values to a partial or a complete set of variables. We say an assignment is consistent when a variable takes on a value that does not violate any constraint Ci C for all of i. An assignment is complete when all variables have a value assigned (not necessarily satisfying all constraints). A solution to a CSP occurs when an assignment is both consistent and complete. A general way to solve a CSP is to incrementally make assignments to variables in a consistent way until a complete assignment is achieved. There are many methods available to perform this operation; however, because the CSP is NP-hard, most algorithms have a time complexity that is exponential. There are however, many algorithms that can achieve lower running times when based on particular problem specific features. The CSP can be used to formulate many different existing problems. We show how to formulate the clique finding problem as a CSP. Definition 7 (Motif finding as a binary CSP): Given a set of sequences S for an instance of the motif finding problem, we can define an instance of a CSP whose variables X correspond to S. Each vardomain vconstrainthe two v Avariable athat the Hless than winnowe 3.1 ConOur grapof lengthinput, a gTcorresponindex j infrom inde Fwould be iable xi X alue will cot requiring tariables, mu binary CSPnd each arcamming di2d. This cor. structingh constructi n), a paramraph G=(V,Ehe algorithmd to substri sequence sx j to (n-1) or example split into thwill have a rrespond to hat the two st have a H can be repr represents cstance betwenstraint grap the Conston algorithmeter l (length) will be co iterates thrngs of each i we constru+ l; for a totgiven any are vertices coset of possibeach l-lengtsubstrings camming distesented as aonstraint been the two h is construraint Gra takes in as of signal) anstructed inough each ssequence, thct a vertex val of t((n-1)bitrary sequrrespondingle domains h substring. orrespondinance less th constraint gtween two vl-length subcted in mucph: input a set ond a numbe the followiequence in tat is, for ev V that wi + l) verticeence si the f to substrinD = {D1, D2 For each pag to the twoan or equal raph whereariables. Tstrings (relath the same wf t sequencer d (numberng way: he set S andery sequencll represent s. igure belowgs. , …, D(n-1)-l}ir of variab domain valuto 2d. each node rhe constrainive to the noay as the gs S = {s1, s2 of mismatc constructs ve si of the sethe substring illustrates h where eachles, there is es, assigneepresents a t for an arc des) must braph in the , … , st} (eahes). Givenertices thatt S and for e sij of lengtow the sequ14 a d to is e ch this ach h l ence 15 After running this procedure across all t sequences we would achieve a multi-partite graph similar to figure below: After this vertex constructing process the algorithm iterates through each vertex vi in each jth partite Sj and compares vi against every other vertex in each remaining partite. An edge between two vertices will correspond to vertices that have a hamming distance less than 2d, that is, we connect vertices vij and vab by an edge if and only if DH (vij, vab) ≤ 2d where vij and vab are from separate sequences. Constructing a graph in this way will encapsulate the signal in a clique; if we select any arbitrary edge between any two vertices, corresponding to a signal, that edge will have a maximum hamming distance of at most 2d [13]. However constructing the graph in this manner also gives rise to many spurious edges that are not part of the clique we want to find. In order to prune these edges we use constraint propagation. The main idea of constraint propagation is to repeatedly reduce the domain of each variable in order to be consistent with its arcs. Reducing the domains in this manner is what makes constraint propagation an effective pruning mechanism. Applying constraint propagation is what is known as achieving local consistency. The requirement of local consistency conditions are to take a CSP problem P and map that problem into another problem P’ without altering the original problem’s solutions. This mapping is referred to as constraint propagation. The main idea here is to create a problem that is less complex by reducing the domains of variables by applying constraints or adding new ones. There are several such local consistency conditions that exist; the ones we will concern with here are arc consistency, path consistency and the general case k-consistency. 16 3.2 Arc Consistency Arc consistency is one of the simplest forms of local consistency and is a process equivalent to the Winnower for k=1. To achieve arc consistency we apply an algorithm that will remove all unsupported values from the domains of individual variables to return a reduced CSP that is easier to solve (or find the problem to have no solution). Generally we say a CSP P is arc consistent if each of a variable’s admissible values in P is consistent with every other variable’s admissible values in P. Formally: Definition 7 (Formal Arc Consistency): Given a set of variables X = {x1, x2, …, xn} we say variable xi X is arc-consistent with another variable xj X (where xi ≠ xj) if for every value ai in the domain xi (there exists) a value bj in the domain of xj such that (ai, bj) satisfies each constraint between xi and xj. We will call such a variable bj a support of ai. Graphically we can illustrate how arc consistency works with the following binary network: We see S1 is arc consistent with S2 & S3. S3 is arc consistent with S1 & S2. S2 is arc consistent with S3 however not with S1. This is because the assignment S2 = 1 does not correspond to any value for S1 (no supporting variable for S2=1). We can therefore remove 1 from the domain of S2. Now S3 is no longer arc consistent with S2, because S3 = 2 does not correspond to any value for S2. We therefore remove 2 from the domain of S3. We continue to remove values until we end up with a new problem which is arc consistent: Wconstructpartites, wpartite ofhave the edge. Anremoved proposed 3.2.1 ArThere areMackworand O(erto be optand a spaCordier [optimal rGAC3 (OT(such as teach partand, for ee can use thed by our gre simply en G. That is, property sucy vertex thaalong with a by Bessierec Consist several wath [2, 10] is) space (wheimal is the Ace complex3] that compun time com(ed)). he main idehe GAC3 dite. Upon inach vertex vis idea of araph construsure that foany given vh that theret does not hny edges th and Cordieency Impys to implem a non optimre r is the gC4 proposeity of O(ed2)romises betplexity of tha of the AC6oes), but ratitialization AC6 attemc consistencction algoritr every vertertex v that r exists a supave this proat are conner [3]. lementatient arc conal algorithmreatest rankd by Mohr a [2, 11, 12]ween the Ge AC4 (O(e is not to coher to ensurAC6 incrempts to find ay to prune ahm) G=(V,Eex v in G theemains afteporting vertperty is not cted to it. Ton (AC6)sistency. Th that achiev among consnd Henders. AC6 is an AC3 and thed2)) with a unt how mae each vertexentally itera supporting way spurio) which is are exists a sr arc consistex v in eacharc consistenhe algorithme GAC3 alges arc consitraints) [2].on, with a tialgorithm p AC4 insofalower spaceny supportin has at leastes through variable u inus edges. G multi-partiupporting vency is perf partite - met and theref we use in orithm propstency in O An algorithme complexroposed by Br that it mai complexityg vertices et one supporeach vertex each partitiven a graphte graph witariable u in ormed on Ganing (u, v) ore will be CSC is AC6osed by (er3dr+1) timm that is shity of O(ed2essiere andntains the closer to thach vertex hting vertex in the graphe for vertex 17 (as h t each , will is an as e own ) at of as in v. If 18 such a vertex u is found in each partite, then v will remain, otherwise it will be deleted. Briefly the AC6 algorithm (Appendix A) works as follows: The AC6 maintains a list S where S[xj, vj] stores all supporting values for which vj is the supporting variable for xj. The algorithm checks that there is a support vj for each variable xj in graph G. If there is no such variable vj, then xj will be removed and stored in list Q for future propagation. The propagation loop in AC6 checks the consequences of removing the variables that are stored in Q. When a vertex ai is chosen from Q, AC6 searches list S to check if ai was a support for some other vertex xj. It does so by checking list S for S[xj, aj]. If it was a supporting variable then a new support must be found for xj, if none is found xj will be removed and in turn be stored in list Q. If there is a support then this new support will replace aj and stored in list S. The algorithm continues in this manner until the list Q is empty at which point every remaining vertex in graph G will have supporting variables in each partite of graph G, and thus arc consistent. Arc consistency is an effective pruning measure for simple instances of the planted motif problem; however, a stronger consistency is required for more challenging instances [13]. 3.3 Path Consistency Path consistency works much like arc consistency, but considers pairs of vertices rather than a single vertex which is essentially the same pruning mechanism used in the winnower for k=2. We define path consistency formally as: Definition 8 (Formal Path Consistency): Given a set of variables X = {x1, x2, …, xn} we say pair (xi, xj) X are path consistent with another variable xk X (where xi ≠ xj) if for every pair of values (a, b) that satisfies the constraint between (xi, xj) (there exists) a value c in the domain of xk such that (a, c) and (b, c) satisfies each constraint between (xi, xk) and (xj, xk). We will call such a variable xk a supporting variable of pair (xi, xj). Graphically the figure below show’s an example of enforced path consistency: 19 The figure on the left is an example of a graph that is arc consistent but not path consistent. The dotted edges in the figure on the right indicate edges that would be removed if the graph was to be made path consistent. We use this idea of path consistency to prune away spurious edges. The same algorithm (AC6) is used with a slight modification. Instead of search for supporting variables for a single vertex xi we will search for supporting variables for edges (xi, xj). That is, for every edge (xi, xj) we need to search each partite for a support vj such that (xi, xj, vj) forms a triangle. Path consistency is a more powerful pruning mechanism than arc consistency; however, it is still insufficient for difficult instances of the planted motif problem [13]. One way to increase the pruning power is to increase to 4-consistency (k-consistency) which is equivalent to the winnower method for k=3. 3.4 k-consistency K-consistency generalizes the notion of arc consistency and path consistency to sets of variables: Definition 8 (k-consistency): Given a set of variables X = {x1, x2, …, xn} we say a sub set of variables X’ X of size k-1 are k-consistent if for every set of values A = (a1, a2, … , at) that satisfies the constraint between all variables in X’ there exists a value c in a variable uk such that no constraint is violated for (a1, a2, … , at, c). Such a variable uk is a supporting variable of X’. Given this formal definition, arc consistency is equivalent to 2-consistency and path consistency is equivalent to 3-consistency. For the winnower, Pevzner et al. show that k=3 is 20 sufficient as an effective pruning mechanism, which would be equivalent to 4-consistency for our CSP. The tradeoff for this high degree of accuracy is a rather high time complexity of O(N4). We will now look at a new approach which combines path consistency along with the notion of closest string for a time complexity of O(N3 + dd). 3.5 Content Sensitive Analysis Instead of increasing to 4-consistency we add a new constraint to our path consistency. Not only do we enforce path consistency but for every pair of nodes that have support we also ensure there is a center string among these three nodes we are analyzing. The figure below shows a graph that is path consistent on the left. The figure on the right shows that a center string c must also exist if the edges in the graph on the left are to remain. How the algorithm works is described below. 3.5.1 Content Sensitive Analysis Implementation (PC6+CS) The PC6+CS maintains an array of lists S where S[vj] stores all a list for which vj is the supporting variable for all pairs of nodes (xi, xj). The algorithm checks that there is a support vj for each edge (xi, xj) in graph G. If there is no such variable vj, then edge (xi, xj) will be removed and stored in list Q for future propagation. If there is a support vj found the algorithm then uses the CS algorithm to find if there exists a center string c for nodes (xi, xj, vj). If such a string exists (xi, xj) is stored in list S under otherwise (xi, xj) is removed from G and (xi, xj) is stored in list Q for future propagation. 21 The propagation loop in PC6+CS checks the consequences of removing the variable pairs that are stored in Q. When a pair (xi, xj) is chosen from Q, PC6+CS searches list S to check if (xi, xj) was a supporting edge for some other pair of vertices (xi, a) or (xi, a). It does so by checking list S[xi] for (xj, a), and S[xj] for pair (xi, a). If it was a supporting edge then a new support must be found for both (xj, a) and (xi, a), if none is found (xj, a) and (xi, a), will be removed and in turn be stored in list Q. If there is a support vj then, as before, a center string c will be sought after. If such a string c is found then vj will replace aj and stored in list S. The algorithm continues in this manner until the list Q is empty at which point every remaining vertex in graph G will have supporting variables in each partite of graph G, and that satisfies our constraints. 22 4 Experimentation and Results. In this section we will describe how we generated samples as well as describe our results of running AC6, PC6 and PC6+CS on this these samples. All analysis was based on implementations in Java. 4.1 Sample Generation To test the effectiveness of our algorithm it is important to be able to generate random samples. Our sample generation algorithm works as follows: The algorithm takes in as input numbers t, n, l and d. Given these numbers the algorithm will generate set of sequences S = {s1, s2, …, st} where each si is of length n and each position in si is a nucleotide selected at random from alphabet ‘A’, ‘C’, ‘T’, or ‘G’. A string s of length l will be randomly generated using this same alphabet. A set of strings M = {m1, m2, …, mt} are generated where each mi is a d-variant of s (i.e: DH(s, mi) = d for all i). The algorithm then randomly selects a position j in each sequence si and plants string mi at this position replacing all nucleotides in sequences si from j to j+l. 4.2 Results We test the performance and limitations of AC6, PC6 and PC6+CS on several instances of the Planted (l, d)-Motif Problem. First we will represent our results based on the (3, 15) problem. We run tests for each algorithm based on parameters t=20, l=15 and d=3. In instances of the (3,15) problem all three algorithms pruned away sufficient edges to expose the signal we are searching for. The chart below is based on the average results obtained from running each algorithm independently on 50 different samples each. The ‘-‘ indicates a time that is greater than 1 hour. The graplocating t FThough trunning tPC6+CS AvgTime(s)h below illuhe signal. or this (3, 15he AC6 hasime in practthe conditio0.010.111010010001000050Avg. Time (s)Sestrates avera) problem t the lowest tice. An expn for edge r100 150ACquence Lengn 50 100 150 200 250 300 350 400 450 500 550 600 ge time (in he graph illuime complelanation foremoval is m200 250 3006, PC6 th A1110317213219seconds) as strates an inxity, of all th this interestuch stricter,350 400 4n& PC6+CTC6.8059.731.6017.010.45320.430.797‐‐‐‐‐a function oteresting phree algorithing phenom therefore ed50 500 550S vs. nime (s)PC60.0630.1790.3690.6661.0451.5522.0952.763.6214.645.7927.058f sequence lenomenon tms, it has thenon is thatges are rem600PC6PC6AC6PC6+CS 0.331 0.482 0.687 0.989 1.393 1.902 2.569 3.431 4.168 5.188 6.398 7.848 ength n, for hat occurs. e largest for PC6 andoved more (with CS) (no CS) 23 24 quickly resulting in less iterations and thus a shorter running. A similar observation was also made by Pevzner et al. [8]. AC6 tends to become impractical for n > 350 in comparison to PC6 and PC6+CS. The PC6 and PC6+CS find the signal under 10 seconds for relatively long sequence length (n=600). For this instance they both have roughly the same running time. Secondly we run tests for the (4, 15) problem. For this problem each algorithm is executed based on parameters t=20, l=15 and d=4. We first compare the AC6 with the PC6+CS. We randomly generated 50 samples for every value of n; for each sample we run the AC6 and then pass this arc consistent graph to the PC6+CS to assess how many edges were not pruned by the AC6. The chart below is based on the average results obtained from running both algorithms sequentially. We see that AC6 is accurate in finding the clique we are searching for n =50 however AC6 is insufficient for n > 50, and will be non effective for removing any edges at all for n ≥ 400. Sequence Length TIME (s) EDGESn AC6 PC6 + CS Initial Edges AC6 prune PC6 + CS Prune50 1.512 0.401 31520 31140 0 100 9.612 0.597 163346 151251 11709 150 2.013 4.441 401516 6753 394374 200 0.81 9.523 748116 996 746725 250 0.736 23.255 1202603 132 1202066 300 1.164 63.515 1763251 23 1762820 350 1.639 190.054 2433053 12 2432635 400 2.515 834.385 3211088 0 3210672 We then executed another 50 tests on the (4, 15) problem comparing accuracy and efficiency between PC6 and PC6+CS. First we run PC6 on each sample, once PC6 is finish the resulting path consistent graph is sent to PC6+CS to examine how many edges PC6+CS will prune from an already path consistent sample. Interestingly the PC6+CS is slightly more accurate and becomes increasingly accurate as n increases, removing a trivial number of edges after the graph has been made path consistent. The running time for of the PC6 become impractical (over 4 hours) for n > 300. Finally wimplemen The grapfunction mechanisSequence Length n 50 100 150 200 250 300 e run a sepatation of thh below comof n and illum than the P110100Avg. Time (s)SeTIMPC6 0.432 1.604 3.666 9.421 46.869 941.822rate test on e PC6+CS, pares the rustrates that iC6. 0.111000000050 10quence Lengn 50 100 150 200 250 300 350 400 450 E (s)PC6 + CS0.3930.4510.5030.5930.725 0.852the (4, 15) pthe running nning timesn practice, P0 150 200Timth TIME (PC6 + C0.8982.2094.5629.49722.534608.37188.15834.383754.39Initial Edges3135316379340233474911112026611764138roblem usintime becom of the bothC6+CS is a250 300 3ne (s) vs s)S Initial Ed315716344402267488412018176456 243305 321106 40929EDGESPC6 prune3097016340340194174871112022501763728g only the Pes impractic PC6 and PC slightly mo50 400 450nEDGESges PC6 +0 34 15 45 7433 1260 1753 2488 3268 40PC6 + CS 1 1 2 4 8 7 C6+CS. Foal for n> 456+CS withre efficient PC6PC6 CS Prune 11886305701872845001438 64160 32647 10672 92541 Prune r our 0. time as a pruning +CS25 26 5 Discussion and Conclusion The problem of motif finding plays a very important role in understanding gene regulatory mechanisms. An area of bioinformatics is concerned with modeling this problem in a way that will allow computational algorithms to lend aid in locating these motifs. One such model includes the Planted (l, d)-Motif Problem. In this thesis we have analyzed the efficiency and accuracy of several algorithms designed to solve this particular Planted (l, d)-Motif Problem, including: the Basic Voting Algorithm, the Winnower and the Closest String. We have presented a new formulation that composes the Planted (l, d)-Motif Problem as a CSP as well as introduced a new form of center string constraint consistency. Experimental results on a Java implementation, based on this formulation, locate signals accurately and efficiently for simulated data. However, using this implementation, we were unable to conclude the relative strength of this center string consistency when comparing it to the notion of Hamming distance. The impractical running times for this implementation did not allow us to test for large values of n where the PC6 is known to fail. This may be due to either a non optimal implementation or the relatively slow performance of Java. Future work may include an implementation in C so that we could achieve a practical running time with a large enough value of n such as n≥700 for the (4, 15) where the PC6 is known to fail [8]. Running the PC6+CS for these values of n is required to analyze the relative pruning strength of this new notion of consistency. 27 Appendex A Algorithm 1: AC6 [2, pg. 20] X ← 2D constraint graph Q ← ø S[xj , vj] = 0, vj D(xj ), xj X foreach xi X, cij C, vi D(xi) do vj ← smallest value in D(xj) s.t. (vi, vj) cij if vj exists then add (xi, vi) to S[xj , vj] else remove vi from D(xi) and add (xi, vi) to Q if D(xi) = ø then return false while Q ≠; do select and remove (xj , vj) from Q; foreach (xi, vi) S[xj, vj ] do if vi D(xi) then v’j ← smallest value in D(xj) greater than vj s.t. (vi, vj) cij if v’j exists then add (xi, vi) to S[xj, v’j] else remove vi from D(xi); add (xi, vi) to Q; if D(xi) = ø then return false return true Algorithm 2: Closest String [4, pg. 6] recursive procedure: CSd(s, ∆d) Global variables: Set of strings S = {s1, s2, . . . , sk}, integer d. Input: Candidate string s and integer ∆d. Output: A strings with maxi=1,...,kdH(s, si) ≤ d and dH(s, s) ≤ ∆d if it exists, and “not found,” otherwise. if (∆d < 0), then return “not found”; if (dH(s, si) > d + ∆d) for some i {1, . . . , k} then return “not found”; if (dH(s, si) ≤ d) for all i = 1, . . . , k then return s; Choose some i {1, . . . , k} such that dH(s, si) > d: P := { p | s[p] si[p] }; Choose any P’ P with |P’| = d + 1; for all p P do s’:= s; s’[p] := si[p]; sret:= CSd(s , ∆d − 1); If sret “not found” then return sret return “not found” 28 Algorithm 3: PC6+CS // initialization C ← a 2D constraint graph Q ← ø // Q list stores all deleted pairs of variables S[n] ← ø // where s[i] stores all pairs of variables supported by i. D ← all pairs of variables that have not yet violated any constraint // begin foreach pair (xai, xbj) do // where a and b indicate partites which x is a member foreach partite c where c ≠ a and c ≠ b do vc == checkForSupport(xai, xbj, c)) //return ‐1 if no support found if (vc ≠ ‐1) and (centerString(xai, xbj, vc)) then S[Vc] ← (xai, xbj) else C[xai][xbj] = 0 Q ← (xai, xbj) remove (xai, xbj) from D if D empty return false // no feasible solution // propagation while Q ≠ ø do select and remove (xai, xbj) from Q S’ ← each triple (yai, ybi, vc) of variables for which edge (xai, xbj) was a support foreach (yai, ybj) S’ do if (yai, ybj) D then vc == checkForSupport(yai, ybj, vc)) //starting at index vc in c if (vc ≠ ‐1) and (centerString(yai, ybj, vc)) then S[Vc] ← (yai, ybj) else C[xai][xbj] = 0 Q ← (yai, ybj) remove (yai, ybj) from D if D empty return false // no feasible solution return true 29 Algorithm 4: Basic Voting Algorithm [8, pg. 4] Create two hash tables V and R and set the value of each entry to be 0 {Table V keeps the number of votes received by each length-l sequence s. Each length-l sequence, received t votes is a candidate for motif. Hash table R ensures that each length-l sequence receives at most one vote from each input sequence. Si{j} is the j-th character in the i-th input sequence Si and H(s) is the hash value of a length-l sequence s.} C ← ø for i ← 1 to t do for j ← 1 to n – l + 1 do for each length‐l sequence is N(Si[j … j + l – 1],d) do if R[H(s)] <> i then V[H(s)] ← V[H(s)] + 1 R[H(s)] ← i for j ← to n – l + 1 do for each length‐l sequence s in N(Si[j … j + l – 1],d) do if V{H(s)]= t then insert s into C 30 References [1] S. Balla, J. Davila and S. Rajasekaran. On the Challenging Instances of the Planted Motif Problem. http://www.engr.uconn.edu/becat/reports/BECAT-CSE-TR-07-2.pdf [2] C. Bessiere. Constraint Propagation. Handbook of Constraint Programming. Ch. 3, Pg. 29 – 85 ISBN 0-444-52726-5 [3] C. Bessiere, M.O. Cordier. Arc-consistency and arc-consistency again. In Proceedings AAAI’93, pages 108–113, Washington D.C., 1993. [4] J. Gramm R. Niedermeier. Fixed-Paramter Algorithms for Closest String and Related Problems. Algorithmica [0178-4617] Gramm yr:2003 vol:37 iss:1 pg:25 [5] J. Hu, B Li, D. Kihara. Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res 2005, 33(15):4899-4913. [6] R. M. Karp. Reducibility Among Combinatorial Problems. In Complexity of Computer Computations, Proc. Sympos. IBM Thomas J. Watson Res. Center, Yorktown Heights, N.Y.. New York: Plenum, p.85-103. 1972. [7] C. Lawrence, S. Altschul, M. Boguski, J. Liu, A. Neuwald and J. Wootton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208-214, 1993. [8] H. C.M. Leung, F.Y.L. Chin. Algorithms for challenging motif problems. Journal of bioinformatics and computational biology [0219-7200] Leung yr: 2006 vol: 4 iss: 1 pg: 43 -58 [9] M. Li, B. Ma, L. Wang. 2002. On the Closest String and Substring Problems. Journal of the ACM, Vol. 49, No. 2, March 2002, pp. 157-171. 31 [10] A.K. Mackworth. Consistency in networks of relations. Technical Report 75-3, Dept. of Computer Science, Univ. of B.C. Vancouver, 1975. (also in Artificial Intelligence 8, 99-118, 1977). [11] R. Mohr and T.C. Henderson. Arc and path consistency revisited. Arti- ficial Intelligence, 28:225–233, 1986. [12] R. Mohr and G. Masini. Good old discrete relaxation. In ProceedingsECAI’88, pages 651–656, Munchen, FRG, 1988. [13] P.A. Pevzner, S.H. Sze. Combinatorial approaches to finding subtle signals in DNA sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol., 269–278, AAAI Press, 2000. [14] M. Vingron, and P. Argos, 1991. Motif recognitionand alignment for many sequences by comparison of dot-matrices. Journal of Molecular Biology 218:33-43. [15] M.Vingron, and P. Pevzner, 1995. Multiple sequence comparison and consistency on multipartite graphs. Advances in Applied Mathematics 16:1-22. [16] M.S. Waterman, R. Arratia and D.J. Galas. Pattern recognition in several sequences: consensus and alignment. Bull. Math. Biol., 46:515-527. 1984.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Undergraduate Research /
- Computer Modeling of Molecular Genetic Events using...
Open Collections
UBC Undergraduate Research
Computer Modeling of Molecular Genetic Events using Content Sensitive Analysis Stodgell, Curtis 2007
pdf
Page Metadata
Item Metadata
Title | Computer Modeling of Molecular Genetic Events using Content Sensitive Analysis |
Creator |
Stodgell, Curtis |
Date Issued | 2016-05-01 |
Description | The problem of motif finding plays an important role in understanding the development, function and evolution of organisms. The Planted (l, d)-Motif Problem first introduced by Pevzner and Sze [13] is a variant model of motif finding that has been widely studied. Nevertheless, despite the many different algorithms constructed to solve this problem, it is far from being solved entirely [5]. We analyze a number of algorithms including: the Basic Voting Algorithm, the Winnower and the Closest String. One thing that has become ubiquitous among these algorithms is the tradeoff between efficiency and accuracy. We formulate the motif-finding problem as a constraint satisfaction problem and introduce a special form of constraint consistency. By using a fixed-parameter algorithm for the closest string problem [4], we develop a propagation algorithm that enforces the new constraint consistency with the same worst-case time complexity as that of the standard path consistency algorithm. Experiments on randomly generated sequences indicate that our approach is effective and efficient. |
Genre |
Other |
Type |
Text |
Language | Eng |
Collection |
Computer Science Undergraduate Honours Essays (Okanagan Campus) |
Series | University of British Columbia, Okanagan campus, Computer Science Undergraduate Honours Essays |
Date Available | 2010-07-30 |
Provider | Vancouver : University of British Columbia Library |
DOI | 10.14288/1.0052231 |
Affiliation |
Irving K. Barber School of Arts and Sciences (Okanagan) |
Peer Review Status | Unreviewed |
Scholarly Level | Undergraduate |
URI | http://hdl.handle.net/2429/27050 |
Aggregated Source Repository | DSpace |
Download
- Media
- [if-you-see-this-DO-NOT-CLICK]
- Thesis Curtis Stodgell.pdf [ 647.54kB ]
- [if-you-see-this-DO-NOT-CLICK]
- Metadata
- JSON: 1.0052231.json
- JSON-LD: 1.0052231+ld.json
- RDF/XML (Pretty): 1.0052231.xml
- RDF/JSON: 1.0052231+rdf.json
- Turtle: 1.0052231+rdf-turtle.txt
- N-Triples: 1.0052231+rdf-ntriples.txt
- Original Record: 1.0052231 +original-record.json
- Full Text
- 1.0052231.txt
- Citation
- 1.0052231.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
United States | 20 | 1 |
China | 13 | 11 |
Canada | 3 | 0 |
Germany | 3 | 0 |
Sweden | 2 | 0 |
City | Views | Downloads |
---|---|---|
Ashburn | 13 | 0 |
Beijing | 13 | 0 |
Unknown | 3 | 0 |
San Francisco | 3 | 0 |
Stockholm | 2 | 0 |
Mountain View | 2 | 1 |
Vancouver | 2 | 0 |
Redmond | 1 | 0 |
Ottawa | 1 | 0 |
Seattle | 1 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.42497.1-0052231/manifest
Comment
Related Items
Admin Tools
To re-ingest this item use button below, on average re-ingesting will take 5 minutes per item.
ReingestTo clear this item from the cache, please use the button below;
Clear Item cache