Coomputerr Modeliing of Molecula M ar Genettic Even nts usingg Con ntent Seensitive Analysiis by Curttis Stodgelll A THESIS SUBMIT TTED IN PARTIAL P LMENT OF F THE FULLFIL REQUIREME ENTS FOR R THE DE EGREE OF F HONOURS IN CO OMPUTER R SC CIENCE in Irving K.. Barber Scchool of Arrts and Sciiences HE UNIVERSITY OF F BRITISH H COLUM MBIA OKA ANAGAN CAMPUS TH Maarch 2007 ©Curtis Stodgell, 2007 Abstract: The problem of motif finding plays an important role in understanding the development, function and evolution of organisms. The Planted (l, d)-Motif Problem first introduced by Pevzner and Sze [13] is a variant model of motif finding that has been widely studied. Nevertheless, despite the many different algorithms constructed to solve this problem, it is far from being solved entirely [5]. We analyze a number of algorithms including: the Basic Voting Algorithm, the Winnower and the Closest String. One thing that has become ubiquitous among these algorithms is the tradeoff between efficiency and accuracy. We formulate the motif-finding problem as a constraint satisfaction problem and introduce a special form of constraint consistency. By using a fixed-parameter algorithm for the closest string problem [4], we develop a propagation algorithm that enforces the new constraint consistency with the same worst-case time complexity as that of the standard path consistency algorithm. Experiments on randomly generated sequences indicate that our approach is effective and efficient. 1 Table of Contents Abstract .......................................................................................................................................1 Table of Contents .......................................................................................................................2 1. Introduction ...........................................................................................................................3 2. Planted Motif Problem...........................................................................................................5 1. Definitions and Notation ..................................................................................................5 2. Winnowing Approach: Locating Signals via Winnowing ...............................................6 3. A Fixed Parameter Algorithm for the Closest String.......................................................9 4. Basic Voting Algorithm .................................................................................................11 3. Overview of Content Sensitive Consistency .......................................................................13 1. Constructing the Constraint Graph ................................................................................14 2. Arc Consistency .............................................................................................................16 1. Arc Consistency Implementation (AC6) ...........................................................17 3. Path Consistency ............................................................................................................18 4. k-consistency..................................................................................................................19 5. Content Sensitive Analysis (CSC) .................................................................................20 1. CSC Implementation (PC6+CS) ........................................................................20 4. Experimentation and Results ..............................................................................................22 1. Generation of Samples ...................................................................................................22 2. Results ............................................................................................................................22 5. Discussion and Conclusion ..................................................................................................26 Appendix A ...............................................................................................................................27 References .................................................................................................................................30 2 1 Introduction: Transcription factors are proteins that bind to DNA in order to regulate the expression of genes through the activation or inhibition of transcription mechanisms. Locating these sites is an integral part of understanding the regulatory process. Computational tools are an invaluable method used to locate these particular sites and this general problem has been termed motif finding. Essentially the motif finding problem is the problem of finding common patterns among a set of sequences. A simple model for the problem is as follows: Given a sample of random sequences can we find a common unknown pattern (motif) that is hidden at random locations in the sequences? If the motif was not subject to mutations it would be relatively easy to locate using a basic brute force algorithm; however, in biology the sequences being analyzed are subject to mutations and therefore the motif cannot be assumed to be exact. As such, it makes sense to construct a model that will allow the patterns we are searching for to have an arbitrary number of mutations [13]. Buher and Tompa [8, 7] found limitations for this particular problem. They found that when the number of sequences, the length of sequences and the length of the pattern are fixed, if the number of mutations in the pattern is larger than a particular threshold then it is highly improbable that any algorithm would be able to find such a pattern due to an abundance of random patterns [8, 7]. However, there have been many algorithms suggested for instances of this problem that do not exceed this threshold [8]. In this thesis we analyze a number of different algorithms that attempt to solve the problem in several different ways. These include A voting algorithm by Leung and Chin [8] that locates common motifs by enumerating all neighbors of each l length substrings in a given set of sequences, as well as a combinatorial approach by Pevzner and Sze [13] that essentially reduces the motif finding problem to locating cliques in multi-partite graphs. Closely related to the planted-motif problem are the closest string problem and the closest substring problem. Both of these problems are NP-hard, but the closest string problem can be solved in linear time assuming that the number of mutations is a fixed constant [9]. In the Winnower Pevzner and Sze designed an algorithm that prunes spurious similarities, which enables us to easily locate the signal we are searching for. They did so with an algorithm that has a time complexity of O(Nk+1) and found that the algorithm was accurate for 3 k=3. Our contribution includes a variation of the Winnower combined with the closest string. We have designed an algorithm that essentially runs as the Winnower for k=2 along with the closest string algorithm to construct an algorithm with a time complexity of O(N3 + dd) where d is the number of mutations and constant. This thesis is organized as follows: In chapter 2, we will discuss numerous algorithms that approach The Planted (l, d)-Motif Problem in various ways, each algorithm achieving a different level of both accuracy and efficiency. In chapter 3, we present our approach to the problem, including the CSP formulation, the propagation algorithm for the stronger notion of path consistency and its implementation. In chapter 4 we present our results and allegorize the pruning power and efficiency of our approach. In chapter 5, we will conclude our results and analyses, as well as propose future challenges. 4 2 Plan nted Motiif Problem m In this chhapter we inttroduce form mally the Plaanted (l, d)-M Motif Problem m [13] as weell as discusss a variety of existing alg gorithms dessigned to sollve this probblem. 2.1 Defi finitions an nd Notatioon In order to t understan nd how the motif m findingg problem cann be solved we will givee a formal definitionn based on a variant moddel of the prooblem preseented by Pevzner and Szee [13], as weell as define nootation we will w use throuughout this thhesis. T methods we will be using The u to solvve this particcular problem m will usuallly be executeed partially by comparin ng the Hamm ming distancce between tw wo strings. Definitioon 1 (Hamm ming Distancce): Hammiing distance is a simple concept c in ann area of mathemaatics termed information theory. It iss defined sim mply as the number n of inndices that tw wo strings diiffer. For ex xample givenn two stringss “ATAGTC CA” and “TT TTGTCA” we w can see thhat they diffeer at indices 1 & 3; thereefore, the hamming distaance betweenn these two strings s is 2 t notation DH(s, s’) wiill be used too denote the hamming diistance betw ween Inn this paper the a string s and s’. Definitioon 2 (Planteed (l, d)-Mottif Problem)): A motif is a single orr repeated paattern. The problem of motif find ding deals with w locating common paatterns amonng a set of sequences. What W makes thhe problem difficult d is the fact that thhis pattern is not exact annd is allowedd to differ byy a specific number n (refeerred to as mutations). m F Formally wee can describbe the problem as follows: 5 Assume there exists a motif m of length l; given t number sequences each of length n, the problem is to find a planted motif m’ of length l in each of the t sequences such that m’ differs in at most d positions from m. A more general problem is the closest substring problem which has been shown to be NP-hard [9]. The difference between the planted motif problem and the closest string problem is that for the planted motif problem, it is guaranteed that there exists at least one common substring that occurs in each of the sequences. Whereas for the closest string problem, it is not guaranteed that there will be such a common substring among each of the sequences. The parameters l and d in the Planted (l, d)-Motif Problem have a significant impact on the practical difficulty of the problem. Some particularly difficult Planted (l, d)-Motif problem instances proven to be probabilistically challenging are (9,2), (11, 3) and (15, 4) [1]. These problems are considered challenging because given any two instances of a mutated (l, d)-signal there is a high probability that non-signals (spurious similarities) will be randomly generated such that they very closely resemble the proper signal. Many techniques have been designed to solve these challenging problems which will be looked at below. 2.2 Winnowing Approach: Locating Signals via Winnowing. Pavel A. Pevzner and Sing-Hoi Sze [4] take a combinatorial approach to the motif finding problem by reducing it to a clique finding problem. Pevzner et al. recognize that in each sequence there are spurious signals that occur at random that tend to hide the real signal. Most algorithms tend to focus directly on the process of locating the real signal; however, the winnower takes an opposite approach. It focuses on the spurious signals (rather than the signal itself) and removes these spurious similarities incrementally until the signal we are looking for is no longer difficult to find. This elimination process is, however, non-trivial and time consuming. It is important to note that the Winnower algorithm only considers signals with Hamming distance as criterion for an edge and not insertions or deletions (signals will all be the same length). This is done for simplicity; however the algorithm could easily be modified to account for these constraints [8]. 6 T Winnoweer is set up to The t reduce thee Planted (l,dd)-Motif probblem to findding large cliiques in a multti-partite grap ph. Given a set of t sequuences S = {s { 1, s2, … , st} each of lenngth n, a parameteer l (length of o the signal)) and a numbber d (numbeer of mismattches), consttruct a graphh G=(V, E)) in the following way: V Vertices will correspond to substringss of each seqquence, that is, for everyy sequence si of the set S = {s1, s2, … ,st} and for each index j in sequencee si we consttruct a vertexx v V that will ng sij of lenggth l from inddex j to (n-1)) + l; thus thhere is a totall of t((n-1) + l) representt the substrin vertices. T There is an ed dge betweenn vi and vj if and a only if DH (vi, vj) ≤ 2d 2 and vi and vj correspoond to substriings from seeparate sequeences. A clique in a graph G = (V,E) is a subbset C such that t any twoo vertices in C are conneccted by an edgge. Given an ny graph F = (V, E), connstructed as described d byy the Winnow wer, the question if F contaains a cliquee of at least size s k, can bee asked. Onnce such a cliique has beeen found, it will w also be thhe case that this t clique contains the original o signnal being souught after. An example of o 4 sequencees (partites) forming f an extendable e clique can bee seen in the figure below w: T clique fin The nding probleem is a graphh-theoreticall NP-compleete problem [6]. [ There has h been receent interest in studying thhe problem of finding hiidden cliquees in an otherrwise random mly generatedd graph [6]. The plantedd motif-findiing problem is a special case of the hidden h cliquue problem in that the underlying u grraph is a multi-partite grraph. T Winnoweer algorithm The m uses the priinciple of exxploring thesse particular graphs and will w remove edges e that arre proven to not be part of o the clique we are searching for. To T describe how h the winnoower removes edges Pevvzner and Szze use the nootation of an extendable clique. “Wee say that a verrtex u is a neeighbor of a clique C = {v { 1,…,vk} if {v1, … , vk, u} u is a cliquee in the grapph. We definne a clique to o be extendaable if it has at least one neighbor n in every part of o the multipartite 7 graph G. We define an edge to be b spurious if i it does nott belong to any extendabble clique of size w to imposse increasinggly strict connditions as k increases is to ensure thhat all spurioous k. One way edges aree deleted afteer winnowinng and only extendable e c cliques remain in the finaal graph.” [33, pg. 3]. The essential e ideaa of the winnnower is to allow a only exxtendable cliiques of sizee k to remainn as k increasess; in doing so o spurious edges are effeectively rem moved. T is also th This he approach that was takken by Vingrron & Agross[13, 14] andd Vingron & Pevzner [13, 15] for k = 2. However it was observed o by Pevzner & Sze S that wheen k ≤ 2, filteering edges thaat are not parrt of an extenndable cliquue is too weaak for pruninng spurious edges e [13]. Their T observatiions indicatee that only when w k = 3 dooes the methhod become effective at removing r enough spurious s edges in such a way that thee clique we’rre searching becomes evvident. T winnoweer removes edges in the following The f waay: Given a multipartite graph G = (V,E) ( and a num mber k = 1, the algorithhm checks evvery vertex in G and appplies the folloowing test. A vertex u will w pass thee test if theree exists at leaast one neighhbor vi in eaach partite seet of G (ei: (uu, vi) is an edgge in each si), if vertex u fails this tesst, it will be deleted. d In the t illustratioon below, foor a multiparttite graph wiith four partiite sets, verteex v11 will not n be deletedd (v11 has as edge in S2, S3, S4); howeever, vertex v12 will be deleted d (v12 has h an edge in i S2 and S3 but b no edge in S4). w now checck each edgee (u, w) in thhe graph G. The edge (uu, w) For k = 2 the algorithm will will pass the test if th here is a verttex vi for eacch partite si such s that {u,, vi, w} is a triangle. t If thhere e (u, w) will w be deleted. In the illlustration bellow, edge (vv11, v22) will not n is no suchh vertex vi, edge be deleteed; however, edge (v12, v21) will be deleted. d 8 e removaal become much strongerr. The algoriithm So, as k increeases the connditions for edge hion for k = 3; 3 the algoritthm selects each e trianglee (u, w, x) too apply the continuess in this fash similar teest for the otther values of o k. This triaangle will paass the test iff there existss at least onee vertex vi in each partite set si if {u, vi, w, x} is a clique off size 4. Otheerwise the eddges (u, w), (w, x) & (u, x) x will be rem moved from m the graph. W Winnower is an iterative algorithm thhat removes inconsistentt edges from m a graph until locating extendable e cliques c of sizze k become a trivial prooblem. The construction c of the graphh for Winnoweer takes O(N N2) time, wheere N is the length l of all sequences in the set S. The T pruningg power off Winnower increases as k increases,, but so doess the runningg time since the t running time complexiity of the Wiinnower is O(N O k+1). Pevvzner and Szze point out that t for manyy instances of o the Plantted-(l, d) Mo otif Problem k = 3 is suffficient for finnding extenddable cliquess which are meaningfful. Howeveer the Winnoower also requires a largge amount off time and memory; m as a result it becomes b verry slow for laarge sampless [13]. 2.3 A Fixed-Para F ameter Alggorithm foor the Clossest Stringg Problem m. The Clossest String prroblem is ann important problem p in many m areas suuch as compputational biology, coding theorry and conseensus word analysis a [4][9]. The Cloosest String problem p can be given thee following general g definnition: 9 Definitioon 3 (Closesst String Prooblem): Givven a set S = {s1, s2, … , sk} of t strinngs of lengthh n, and a num mber d the problem p is too find a centeer string s’ such s that DH(s’, si) ≤ d foor all si S. C Closest string g is an NP-coomplete probblem; howevver, when wee are dealingg with applicatioons of comp putational bioology d is quuite small. Jeens Gram & Rolf Niederrmeirer [4] present a fixed-param meter algoritthm with an exponential growth com mplexity of O(d O d). This is i particularrly importan nt for the mootif finding problem p conssidering the value d is coonstant and quite q small (raange of 1 ~ 7). Definitioon 4 (Closesst string notation): For a given strinng s of lengtth l, s[p] willl denote the characterr at position p in s, where 1 ≤ p ≤ l. Given a set of strings S = {s1, s2, … , sk}, where each string in the set is of length l, we say a stringg s is an optim mal closest string s for S if i and only iff there is no n string s’ such that DH (s’, si) < DH (s, si) for alll si central sttring s with DH (s, si) = 2 for all si S. Thee example below shows a {s1, s2, s3, s4, s5, s6}. G Given a set off k strings off length l, wee can create a k x l charaacter matrix from this set of strings. The T term collumns of a Closest C Stringg matrix aree in referencee to this partticular matrixx. We say a particular column c of this matrix is a dirty colum mn if there arre two or moore differingg characterrs in the colu umn. For thee example beelow the S matrix m has 8 dirty d columnns while the T matrix haas only 3 dirrty columns. 10 B Bounded searrch tree is onne of the bassic techniquees used in deesigning fixeed-parameterr algorithm ms for a varieety of probleems. The alggorithm for Closest C Strinng in Appenddix A is a recursivee procedure presented p byy Gramm & Niedermeire N er that is prooven to find a central striing s if such a string existss for a given set of stringgs [4]. ore calling thhe proceduree, Gramm & Niedermier point out thhat if there arre Innitially, befo more thaan kd dirty co olumns, of thhe Closest Sttring matrix S, then the instance i is reejected (no center strring can be found f for givven value off k & d). Othherwise the method m is callled with anyy arbitrary string si S as the canddidate string s and an inteeger value d as parameteers. If there is another string s sj S (where ( si ≠ sj), where DH(s, sj) = h annd h > d, theen a positionn is selected in the candiidate string and a changedd to match thhat of string sj at the samee position suuch that DH(ss, sj) = (h-1). The algorith hm continuess in this fashhion until either s moves to far away from si or we w find a sollution to the problem (a center stringg s’ is found). Gramm & Niedermieer point out that t when carreful sub casses of this reccursion are selected s the upper limit on o the size of o this searchh tree is O(dd), where d is constant. c 2.4 Basic Voting Algorithm m This secttion will desccribe the Baasic Voting Algorithm A (B BSV) createdd by Leung and a Chin [8]] which is a variant of an algorithm m first propoosed by Wareemann et al. [8, 16]. We will describbe the BVR R using the saame notationn as defined by the Plantted (l,d)-Mottif Problem. G Given a set off sequences S and two numbers n l andd d, the BVA A will iteratee through eaach sequencee si S and generate g l-length substrinngs from eacch position in i si from inddex 0 to indeex (n-1) – l. For each l--length substtring the algoorithm will generate g all d variants annd cast a votte on each variiant generateed. Any variiant generateed will get one o and only one vote, peer sequence, regardlesss of the factt that an idenntical variantt is generated at a later step. s When casting c votess on variants in i this way, it will be thee case that motif m m will have h t numbber of votes. T BVR trad The des a lower time t compleexity O(nt(3ll)d) for an inccreased spacce complexitty l O(n(3l)d + nt), in com mparison to a brute forcee algorithm which w has coomplexities O(nt4 O ) and 11 O(nt) respectively [8]. Leung et al. show the BVR to be efficient for solving many Planted (l, d) Motif Problem instances [8]. 12 3 Overview of Content Sensitive Consistency (CSC) The problem of motif finding can be formulated as a constraint satisfaction problem (CSP). Formulating in this way allows us to detect and remove spurious similarities via the use of constraint propagation techniques that enforces local consistency. This process works similar to the methods used in the winnower approach with equivalent time complexities. A CSP is a combinatorial search problem – well studied in the area of Artificial Intelligence. Essentially a CSP can be defined as: given a set of variables one must assign each variable a value that will satisfy a predefined number of constraints. Often a CSP requires a heuristic and combinatorial search method to be solved in reasonable time. Definition 6 (CSP): A CSP consists of: A set of variables X = {X1, X2, … , Xn} and a set of constraints C = {C1, C2, … , Cm}. Each variable Xi has a set of i possible domains D = {D1, D2, … , Di} where for each value j, Dj D defines a possible value. Each constraint is defined over a subset of variables and specifies the allowable combinations of values for that subset. Each state of this problem is defined by the assignment of values to a partial or a complete set of variables. We say an assignment is consistent when a variable takes on a value that does not violate any constraint Ci C for all of i. An assignment is complete when all variables have a value assigned (not necessarily satisfying all constraints). A solution to a CSP occurs when an assignment is both consistent and complete. A general way to solve a CSP is to incrementally make assignments to variables in a consistent way until a complete assignment is achieved. There are many methods available to perform this operation; however, because the CSP is NP-hard, most algorithms have a time complexity that is exponential. There are however, many algorithms that can achieve lower running times when based on particular problem specific features. The CSP can be used to formulate many different existing problems. We show how to formulate the clique finding problem as a CSP. Definition 7 (Motif finding as a binary CSP): Given a set of sequences S for an instance of the motif finding problem, we can define an instance of a CSP whose variables X correspond to S. 13 Each varriable xi X will have a set of possibble domains D = {D1, D2, …, D(n-1)-l} where eachh domain value v will co orrespond to each l-lengtth substring. For each paair of variabbles, there is a constrainnt requiring that t the two substrings correspondinng to the two domain valuues, assigned to the two variables, v mu ust have a Hamming H disttance less thhan or equal to 2d. A binary CSP P can be reprresented as a constraint graph g wheree each node represents r a variable and a each arcc represents constraint c beetween two variables. v T constrainnt for an arc is The that the Hamming H distance betweeen the two l-length substrings (relattive to the noodes) must be b less than 2d. This co onstraint grapph is construucted in mucch the same way w as the graph g in the winnoweer. 3.1 Con nstructing the Consttraint Graaph: Our grapph construction algorithm m takes in as input a set of o t sequencees S = {s1, s2, … , st} (eaach of lengthh n), a param meter l (lengthh of signal) and a a numbeer d (numberr of mismatcches). Givenn this input, a graph g G=(V,E E) will be coonstructed inn the followiing way: T algorithm The m iterates thrrough each sequence s in the t set S andd constructs vertices v thatt corresponnd to substriings of each sequence, thhat is, for evvery sequence si of the seet S and for each e index j inn sequence si we construuct a vertex v V that wiill represent the substringg sij of lengtth l from indeex j to (n-1) + l; for a tottal of t((n-1)) + l) vertices. For example given any arrbitrary sequuence si the figure f below w illustrates how h the sequuence would bee split into th he vertices coorrespondingg to substrinngs. 14 After running this procedure across all t sequences we would achieve a multi-partite graph similar to figure below: After this vertex constructing process the algorithm iterates through each vertex vi in each jth partite Sj and compares vi against every other vertex in each remaining partite. An edge between two vertices will correspond to vertices that have a hamming distance less than 2d, that is, we connect vertices vij and vab by an edge if and only if DH (vij, vab) ≤ 2d where vij and vab are from separate sequences. Constructing a graph in this way will encapsulate the signal in a clique; if we select any arbitrary edge between any two vertices, corresponding to a signal, that edge will have a maximum hamming distance of at most 2d [13]. However constructing the graph in this manner also gives rise to many spurious edges that are not part of the clique we want to find. In order to prune these edges we use constraint propagation. The main idea of constraint propagation is to repeatedly reduce the domain of each variable in order to be consistent with its arcs. Reducing the domains in this manner is what makes constraint propagation an effective pruning mechanism. Applying constraint propagation is what is known as achieving local consistency. The requirement of local consistency conditions are to take a CSP problem P and map that problem into another problem P’ without altering the original problem’s solutions. This mapping is referred to as constraint propagation. The main idea here is to create a problem that is less complex by reducing the domains of variables by applying constraints or adding new ones. There are several such local consistency conditions that exist; the ones we will concern with here are arc consistency, path consistency and the general case k-consistency. 15 3.2 Arc Consistency Arc consistency is one of the simplest forms of local consistency and is a process equivalent to the Winnower for k=1. To achieve arc consistency we apply an algorithm that will remove all unsupported values from the domains of individual variables to return a reduced CSP that is easier to solve (or find the problem to have no solution). Generally we say a CSP P is arc consistent if each of a variable’s admissible values in P is consistent with every other variable’s admissible values in P. Formally: Definition 7 (Formal Arc Consistency): Given a set of variables X = {x1, x2, …, xn} we say variable xi X is arc-consistent with another variable xj X (where xi ≠ xj) if for every value ai in the domain xi (there exists) a value bj in the domain of xj such that (ai, bj) satisfies each constraint between xi and xj. We will call such a variable bj a support of ai. Graphically we can illustrate how arc consistency works with the following binary network: We see S1 is arc consistent with S2 & S3. S3 is arc consistent with S1 & S2. S2 is arc consistent with S3 however not with S1. This is because the assignment S2 = 1 does not correspond to any value for S1 (no supporting variable for S2=1). We can therefore remove 1 from the domain of S2. Now S3 is no longer arc consistent with S2, because S3 = 2 does not correspond to any value for S2. We therefore remove 2 from the domain of S3. We continue to remove values until we end up with a new problem which is arc consistent: 16 W can use th We his idea of arrc consistenccy to prune away a spurious edges. Given G a graphh (as constructted by our grraph construuction algoritthm) G=(V,E E) which is a multi-partite graph witth t partites, we w simply en nsure that foor every verttex v in G theere exists a supporting s v variable u in each partite off G. That is, any given vertex v that remains r afteer arc consisttency is perfformed on G, G will have the property succh that there exists a suppporting verttex v in eachh partite - meeaning (u, v) is an edge. Anny vertex thaat does not have h this prooperty is not arc consistennt and thereffore will be removed along with any a edges thhat are conneected to it. The T algorithm m we use in CSC is AC66 as proposedd by Bessieree and Cordieer [3]. 3.2.1 Arrc Consisttency Imp plementation (AC6) There aree several waays to implem ment arc connsistency. Thhe GAC3 alggorithm propposed by Mackworrth [2, 10] iss a non optim mal algorithm m that achievves arc consiistency in O(er3dr+1) tim me and O(err) space (wheere r is the greatest g rankk among consstraints) [2]. An algorithhm that is shhown to be optimal is the AC4 A proposeed by Mohr and a Hendersson, with a tiime complexxity of O(ed2) p by Bessiere B andd and a spaace complexity of O(ed2) [2, 11, 12]. AC6 is an algorithm proposed Cordier [3] [ that comp promises bettween the GAC3 and thee AC4 insofa far that it maiintains the optimal run r time com mplexity of thhe AC4 (O(eed2)) with a lower spacee complexityy closer to that of GAC3 (O O(ed)). T main idea of the AC66 is not to coount how maany supportinng vertices each The e vertex has h (such as the t GAC3 does), d but ratther to ensure each vertexx has at leasst one supporrting vertex in each parttite. Upon in nitialization AC6 increm mentally iteraates through each vertex in the graphh and, for each e vertex v AC6 attem mpts to find a supporting variable u inn each partitte for vertex v. If 17 such a vertex u is found in each partite, then v will remain, otherwise it will be deleted. Briefly the AC6 algorithm (Appendix A) works as follows: The AC6 maintains a list S where S[xj, vj] stores all supporting values for which vj is the supporting variable for xj. The algorithm checks that there is a support vj for each variable xj in graph G. If there is no such variable vj, then xj will be removed and stored in list Q for future propagation. The propagation loop in AC6 checks the consequences of removing the variables that are stored in Q. When a vertex ai is chosen from Q, AC6 searches list S to check if ai was a support for some other vertex xj. It does so by checking list S for S[xj, aj]. If it was a supporting variable then a new support must be found for xj, if none is found xj will be removed and in turn be stored in list Q. If there is a support then this new support will replace aj and stored in list S. The algorithm continues in this manner until the list Q is empty at which point every remaining vertex in graph G will have supporting variables in each partite of graph G, and thus arc consistent. Arc consistency is an effective pruning measure for simple instances of the planted motif problem; however, a stronger consistency is required for more challenging instances [13]. 3.3 Path Consistency Path consistency works much like arc consistency, but considers pairs of vertices rather than a single vertex which is essentially the same pruning mechanism used in the winnower for k=2. We define path consistency formally as: Definition 8 (Formal Path Consistency): Given a set of variables X = {x1, x2, …, xn} we say pair (xi, xj) X are path consistent with another variable xk X (where xi ≠ xj) if for every pair of values (a, b) that satisfies the constraint between (xi, xj) (there exists) a value c in the domain of xk such that (a, c) and (b, c) satisfies each constraint between (xi, xk) and (xj, xk). We will call such a variable xk a supporting variable of pair (xi, xj). Graphically the figure below show’s an example of enforced path consistency: 18 The figure on the left is an example of a graph that is arc consistent but not path consistent. The dotted edges in the figure on the right indicate edges that would be removed if the graph was to be made path consistent. We use this idea of path consistency to prune away spurious edges. The same algorithm (AC6) is used with a slight modification. Instead of search for supporting variables for a single vertex xi we will search for supporting variables for edges (xi, xj). That is, for every edge (xi, xj) we need to search each partite for a support vj such that (xi, xj, vj) forms a triangle. Path consistency is a more powerful pruning mechanism than arc consistency; however, it is still insufficient for difficult instances of the planted motif problem [13]. One way to increase the pruning power is to increase to 4-consistency (k-consistency) which is equivalent to the winnower method for k=3. 3.4 k-consistency K-consistency generalizes the notion of arc consistency and path consistency to sets of variables: Definition 8 (k-consistency): Given a set of variables X = {x1, x2, …, xn} we say a sub set of variables X’ X of size k-1 are k-consistent if for every set of values A = (a1, a2, … , at) that satisfies the constraint between all variables in X’ there exists a value c in a variable uk such that no constraint is violated for (a1, a2, … , at, c). Such a variable uk is a supporting variable of X’. Given this formal definition, arc consistency is equivalent to 2-consistency and path consistency is equivalent to 3-consistency. For the winnower, Pevzner et al. show that k=3 is 19 sufficient as an effective pruning mechanism, which would be equivalent to 4-consistency for our CSP. The tradeoff for this high degree of accuracy is a rather high time complexity of O(N4). We will now look at a new approach which combines path consistency along with the notion of closest string for a time complexity of O(N3 + dd). 3.5 Content Sensitive Analysis Instead of increasing to 4-consistency we add a new constraint to our path consistency. Not only do we enforce path consistency but for every pair of nodes that have support we also ensure there is a center string among these three nodes we are analyzing. The figure below shows a graph that is path consistent on the left. The figure on the right shows that a center string c must also exist if the edges in the graph on the left are to remain. How the algorithm works is described below. 3.5.1 Content Sensitive Analysis Implementation (PC6+CS) The PC6+CS maintains an array of lists S where S[vj] stores all a list for which vj is the supporting variable for all pairs of nodes (xi, xj). The algorithm checks that there is a support vj for each edge (xi, xj) in graph G. If there is no such variable vj, then edge (xi, xj) will be removed and stored in list Q for future propagation. If there is a support vj found the algorithm then uses the CS algorithm to find if there exists a center string c for nodes (xi, xj, vj). If such a string exists (xi, xj) is stored in list S under otherwise (xi, xj) is removed from G and (xi, xj) is stored in list Q for future propagation. 20 The propagation loop in PC6+CS checks the consequences of removing the variable pairs that are stored in Q. When a pair (xi, xj) is chosen from Q, PC6+CS searches list S to check if (xi, xj) was a supporting edge for some other pair of vertices (xi, a) or (xi, a). It does so by checking list S[xi] for (xj, a), and S[xj] for pair (xi, a). If it was a supporting edge then a new support must be found for both (xj, a) and (xi, a), if none is found (xj, a) and (xi, a), will be removed and in turn be stored in list Q. If there is a support vj then, as before, a center string c will be sought after. If such a string c is found then vj will replace aj and stored in list S. The algorithm continues in this manner until the list Q is empty at which point every remaining vertex in graph G will have supporting variables in each partite of graph G, and that satisfies our constraints. 21 4 Experimentation and Results. In this section we will describe how we generated samples as well as describe our results of running AC6, PC6 and PC6+CS on this these samples. All analysis was based on implementations in Java. 4.1 Sample Generation To test the effectiveness of our algorithm it is important to be able to generate random samples. Our sample generation algorithm works as follows: The algorithm takes in as input numbers t, n, l and d. Given these numbers the algorithm will generate set of sequences S = {s1, s2, …, st} where each si is of length n and each position in si is a nucleotide selected at random from alphabet ‘A’, ‘C’, ‘T’, or ‘G’. A string s of length l will be randomly generated using this same alphabet. A set of strings M = {m1, m2, …, mt} are generated where each mi is a d-variant of s (i.e: DH(s, mi) = d for all i). The algorithm then randomly selects a position j in each sequence si and plants string mi at this position replacing all nucleotides in sequences si from j to j+l. 4.2 Results We test the performance and limitations of AC6, PC6 and PC6+CS on several instances of the Planted (l, d)-Motif Problem. First we will represent our results based on the (3, 15) problem. We run tests for each algorithm based on parameters t=20, l=15 and d=3. In instances of the (3,15) problem all three algorithms pruned away sufficient edges to expose the signal we are searching for. The chart below is based on the average results obtained from running each algorithm independently on 50 different samples each. The ‘-‘ indicates a time that is greater than 1 hour. 22 Sequence Length n 50 100 150 200 250 300 350 400 450 500 550 600 T Time (s) AC6 A 1.805 19.73 10 01.601 31 17.01 720.453 1320.43 219 90.797 ‐ ‐ ‐ ‐ ‐ PC6 0.063 0.179 0.369 0.666 1.045 1.552 2.095 2.76 3.621 4.64 5.792 7.058 PC6+CS 0.331 0.482 0.687 0.989 1.393 1.902 2.569 3.431 4.168 5.188 6.398 7.848 ustrates averaage time (in seconds) as a function of o sequence length l n, forr The grapph below illu locating the t signal. AC C6, PC6 & PC6+C CS vs. n 10000 Avg Time (s) Avg. Time (s) 1000 100 PC6 (with CS) 10 PC6 (no CS) 1 AC6 6 0.1 0.01 50 0 100 150 200 250 300 0 n 350 400 450 500 550 600 For this (3, 15 5) problem the t graph illuustrates an innteresting phhenomenon that t occurs. t AC6 has the lowest time t complexity, of all thhree algorithhms, it has thhe largest Though the running time t in practtice. An expplanation forr this interestting phenom menon is thatt for PC6 andd PC6+CS the conditio on for edge removal r is much m stricter,, therefore eddges are rem moved more 23 quickly resulting in less iterations and thus a shorter running. A similar observation was also made by Pevzner et al. [8]. AC6 tends to become impractical for n > 350 in comparison to PC6 and PC6+CS. The PC6 and PC6+CS find the signal under 10 seconds for relatively long sequence length (n=600). For this instance they both have roughly the same running time. Secondly we run tests for the (4, 15) problem. For this problem each algorithm is executed based on parameters t=20, l=15 and d=4. We first compare the AC6 with the PC6+CS. We randomly generated 50 samples for every value of n; for each sample we run the AC6 and then pass this arc consistent graph to the PC6+CS to assess how many edges were not pruned by the AC6. The chart below is based on the average results obtained from running both algorithms sequentially. We see that AC6 is accurate in finding the clique we are searching for n =50 however AC6 is insufficient for n > 50, and will be non effective for removing any edges at all for n ≥ 400. Sequence Length n 50 100 150 200 250 300 350 400 TIME (s) AC6 1.512 9.612 2.013 0.81 0.736 1.164 1.639 2.515 PC6 + CS 0.401 0.597 4.441 9.523 23.255 63.515 190.054 834.385 EDGES Initial Edges 31520 163346 401516 748116 1202603 1763251 2433053 3211088 AC6 prune 31140 151251 6753 996 132 23 12 0 PC6 + CS Prune 0 11709 394374 746725 1202066 1762820 2432635 3210672 We then executed another 50 tests on the (4, 15) problem comparing accuracy and efficiency between PC6 and PC6+CS. First we run PC6 on each sample, once PC6 is finish the resulting path consistent graph is sent to PC6+CS to examine how many edges PC6+CS will prune from an already path consistent sample. Interestingly the PC6+CS is slightly more accurate and becomes increasingly accurate as n increases, removing a trivial number of edges after the graph has been made path consistent. The running time for of the PC6 become impractical (over 4 hours) for n > 300. 24 Sequence Length TIM ME (s) n 50 100 150 200 250 300 PC6 PC6 + CS 0.432 0.393 1.604 0.451 3.666 0.503 9.421 0.593 46.869 0.725 941.822 2 0.852 EDGESS Initial Edgess 31353 163793 402334 749111 1202661 1764138 PC6 prunee 30970 163403 401941 748711 1202250 0 1763728 8 PC6 + CS Prune 1 1 2 4 8 7 Finally we w run a sepaarate test on the (4, 15) problem p usinng only the PC6+CS. P Foor our implemenntation of th he PC6+CS, the running time becom mes impracticcal for n> 450. Se equence Lenggth TIME (s) n 50 100 150 200 250 300 350 400 450 PC6 + C CS 0.898 8 2.209 9 4.562 2 9.497 7 22.534 4 608.37 7 188.15 56 834.38 85 3754.39 96 EDGES Initial Ed dges 31570 16344 44 40226 65 74884 45 1201833 1764560 24330 053 32110 088 4092968 PC6 ++ CS Prune 3 31188 163057 401872 74 48450 12 201438 17 764160 24 432647 32 210672 40 092541 mpares the ruunning timess of the bothh PC6 and PC C6+CS withh time as a The grapph below com function of n and illu ustrates that in i practice, PC6+CS P is a slightly moore efficient pruning P mechanissm than the PC6. Tim me (s) vs n 100 000 Avg. Time (s) 10 000 1 100 PC6 6 10 PC6 6+CS 1 0.1 50 00 150 200 250 300 350 3 400 450 0 10 n 25 5 Discussion and Conclusion The problem of motif finding plays a very important role in understanding gene regulatory mechanisms. An area of bioinformatics is concerned with modeling this problem in a way that will allow computational algorithms to lend aid in locating these motifs. One such model includes the Planted (l, d)-Motif Problem. In this thesis we have analyzed the efficiency and accuracy of several algorithms designed to solve this particular Planted (l, d)-Motif Problem, including: the Basic Voting Algorithm, the Winnower and the Closest String. We have presented a new formulation that composes the Planted (l, d)-Motif Problem as a CSP as well as introduced a new form of center string constraint consistency. Experimental results on a Java implementation, based on this formulation, locate signals accurately and efficiently for simulated data. However, using this implementation, we were unable to conclude the relative strength of this center string consistency when comparing it to the notion of Hamming distance. The impractical running times for this implementation did not allow us to test for large values of n where the PC6 is known to fail. This may be due to either a non optimal implementation or the relatively slow performance of Java. Future work may include an implementation in C so that we could achieve a practical running time with a large enough value of n such as n≥700 for the (4, 15) where the PC6 is known to fail [8]. Running the PC6+CS for these values of n is required to analyze the relative pruning strength of this new notion of consistency. 26 Appendex A Algorithm 1: AC6 [2, pg. 20] X ← 2D constraint graph Q ← ø S[xj , vj] = 0, vj D(xj ), xj X foreach xi X, cij C, vi D(xi) do vj ← smallest value in D(xj) s.t. (vi, vj) cij if vj exists then add (xi, vi) to S[xj , vj] else remove vi from D(xi) and add (xi, vi) to Q if D(xi) = ø then return false while Q ≠; do select and remove (xj , vj) from Q; foreach (xi, vi) S[xj, vj ] do if vi D(xi) then v’j ← smallest value in D(xj) greater than vj s.t. (vi, vj) cij if v’j exists then add (xi, vi) to S[xj, v’j] else remove vi from D(xi); add (xi, vi) to Q; if D(xi) = ø then return false return true Algorithm 2: Closest String [4, pg. 6] recursive procedure: CSd(s, ∆d) Global variables: Set of strings S = {s1, s2, . . . , sk}, integer d. Input: Candidate string s and integer ∆d. Output: A strings with maxi=1,...,kdH(s, si) ≤ d and dH(s, s) ≤ ∆d if it exists, and “not found,” otherwise. if (∆d < 0), then return “not found”; if (dH(s, si) > d + ∆d) for some i {1, . . . , k} then return “not found”; if (dH(s, si) ≤ d) for all i = 1, . . . , k then return s; Choose some i {1, . . . , k} such that dH(s, si) > d: P := { p | s[p] si[p] }; Choose any P’ P with |P’| = d + 1; for all p P do s’:= s; s’[p] := si[p]; sret:= CSd(s , ∆d − 1); If sret “not found” then return sret return “not found” 27 Algorithm 3: PC6+CS // initialization C ← a 2D constraint graph Q ← ø // Q list stores all deleted pairs of variables S[n] ← ø // where s[i] stores all pairs of variables supported by i. D ← all pairs of variables that have not yet violated any constraint // begin foreach pair (xai, xbj) do // where a and b indicate partites which x is a member foreach partite c where c ≠ a and c ≠ b do vc == checkForSupport(xai, xbj, c)) //return ‐1 if no support found if (vc ≠ ‐1) and (centerString(xai, xbj, vc)) then S[Vc] ← (xai, xbj) else C[xai][xbj] = 0 Q ← (xai, xbj) remove (xai, xbj) from D if D empty return false // no feasible solution // propagation while Q ≠ ø do select and remove (xai, xbj) from Q S’ ← each triple (yai, ybi, vc) of variables for which edge (xai, xbj) was a support foreach (yai, ybj) S’ do if (yai, ybj) D then vc == checkForSupport(yai, ybj, vc)) //starting at index vc in c if (vc ≠ ‐1) and (centerString(yai, ybj, vc)) then S[Vc] ← (yai, ybj) else C[xai][xbj] = 0 Q ← (yai, ybj) remove (yai, ybj) from D if D empty return false // no feasible solution return true 28 Algorithm 4: Basic Voting Algorithm [8, pg. 4] Create two hash tables V and R and set the value of each entry to be 0 {Table V keeps the number of votes received by each length-l sequence s. Each length-l sequence, received t votes is a candidate for motif. Hash table R ensures that each length-l sequence receives at most one vote from each input sequence. Si{j} is the j-th character in the ith input sequence Si and H(s) is the hash value of a length-l sequence s.} C ← ø for i ← 1 to t do for j ← 1 to n – l + 1 do for each length‐l sequence is N(Si[j … j + l – 1],d) do if R[H(s)] <> i then V[H(s)] ← V[H(s)] + 1 R[H(s)] ← i for j ← to n – l + 1 do for each length‐l sequence s in N(Si[j … j + l – 1],d) do if V{H(s)]= t then insert s into C 29 References [1] S. Balla, J. Davila and S. Rajasekaran. On the Challenging Instances of the Planted Motif Problem. http://www.engr.uconn.edu/becat/reports/BECAT-CSE-TR-07-2.pdf [2] C. Bessiere. Constraint Propagation. Handbook of Constraint Programming. Ch. 3, Pg. 29 – 85 ISBN 0-444-52726-5 [3] C. Bessiere, M.O. Cordier. Arc-consistency and arc-consistency again. In Proceedings AAAI’93, pages 108–113, Washington D.C., 1993. [4] J. Gramm R. Niedermeier. Fixed-Paramter Algorithms for Closest String and Related Problems. Algorithmica [0178-4617] Gramm yr:2003 vol:37 iss:1 pg:25 [5] J. Hu, B Li, D. Kihara. Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res 2005, 33(15):4899-4913. [6] R. M. Karp. Reducibility Among Combinatorial Problems. In Complexity of Computer Computations, Proc. Sympos. IBM Thomas J. Watson Res. Center, Yorktown Heights, N.Y.. New York: Plenum, p.85-103. 1972. [7] C. Lawrence, S. Altschul, M. Boguski, J. Liu, A. Neuwald and J. Wootton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208-214, 1993. [8] H. C.M. Leung, F.Y.L. Chin. Algorithms for challenging motif problems. Journal of bioinformatics and computational biology [0219-7200] Leung yr: 2006 vol: 4 iss: 1 pg: 43 -58 [9] M. Li, B. Ma, L. Wang. 2002. On the Closest String and Substring Problems. Journal of the ACM, Vol. 49, No. 2, March 2002, pp. 157-171. 30 [10] A.K. Mackworth. Consistency in networks of relations. Technical Report 75-3, Dept. of Computer Science, Univ. of B.C. Vancouver, 1975. (also in Artificial Intelligence 8, 99-118, 1977). [11] R. Mohr and T.C. Henderson. Arc and path consistency revisited. Artificial Intelligence, 28:225–233, 1986. [12] R. Mohr and G. Masini. Good old discrete relaxation. In ProceedingsECAI’88, pages 651– 656, Munchen, FRG, 1988. [13] P.A. Pevzner, S.H. Sze. Combinatorial approaches to finding subtle signals in DNA sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol., 269–278, AAAI Press, 2000. [14] M. Vingron, and P. Argos, 1991. Motif recognitionand alignment for many sequences by comparison of dot-matrices. Journal of Molecular Biology 218:33-43. [15] M.Vingron, and P. Pevzner, 1995. Multiple sequence comparison and consistency on multipartite graphs. Advances in Applied Mathematics 16:1-22. [16] M.S. Waterman, R. Arratia and D.J. Galas. Pattern recognition in several sequences: consensus and alignment. Bull. Math. Biol., 46:515-527. 1984. 31
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Undergraduate Research /
- Computer Modeling of Molecular Genetic Events using...
Open Collections
UBC Undergraduate Research
Computer Modeling of Molecular Genetic Events using Content Sensitive Analysis Stodgell, Curtis 2007
pdf
Page Metadata
Item Metadata
Title | Computer Modeling of Molecular Genetic Events using Content Sensitive Analysis |
Creator |
Stodgell, Curtis |
Date Issued | 2007 |
Description | The problem of motif finding plays an important role in understanding the development, function and evolution of organisms. The Planted (l, d)-Motif Problem first introduced by Pevzner and Sze [13] is a variant model of motif finding that has been widely studied. Nevertheless, despite the many different algorithms constructed to solve this problem, it is far from being solved entirely [5]. We analyze a number of algorithms including: the Basic Voting Algorithm, the Winnower and the Closest String. One thing that has become ubiquitous among these algorithms is the tradeoff between efficiency and accuracy. We formulate the motif-finding problem as a constraint satisfaction problem and introduce a special form of constraint consistency. By using a fixed-parameter algorithm for the closest string problem [4], we develop a propagation algorithm that enforces the new constraint consistency with the same worst-case time complexity as that of the standard path consistency algorithm. Experiments on randomly generated sequences indicate that our approach is effective and efficient. |
Genre |
Graduating Project |
Type |
Text |
Language | eng |
Series | University of British Columbia. COSC 449 |
Date Available | 2010-07-30 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0052231 |
URI | http://hdl.handle.net/2429/27050 |
Affiliation |
Irving K. Barber School of Arts and Sciences (Okanagan) Computer Science, Department of (Okanagan) |
Peer Review Status | Unreviewed |
Scholarly Level | Undergraduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 52966-Thesis Curtis Stodgell.pdf [ 647.54kB ]
- Metadata
- JSON: 52966-1.0052231.json
- JSON-LD: 52966-1.0052231-ld.json
- RDF/XML (Pretty): 52966-1.0052231-rdf.xml
- RDF/JSON: 52966-1.0052231-rdf.json
- Turtle: 52966-1.0052231-turtle.txt
- N-Triples: 52966-1.0052231-rdf-ntriples.txt
- Original Record: 52966-1.0052231-source.json
- Full Text
- 52966-1.0052231-fulltext.txt
- Citation
- 52966-1.0052231.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52966.1-0052231/manifest