PROCEEDINGS Open AccessLinearization of ancestral multichromosomalgenomesJán Maňuch1,2, Murray Patterson3,4*, Roland Wittler5, Cedric Chauve1, Eric Tannier3,4*From Tenth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Com-parative GenomicsNiterói, Brazil. 17-19 October 2012AbstractBackground: Recovering the structure of ancestral genomes can be formalized in terms of properties of binarymatrices such as the Consecutive-Ones Property (C1P). The Linearization Problem asks to extract, from a givenbinary matrix, a maximum weight subset of rows that satisfies such a property. This problem is in generalintractable, and in particular if the ancestral genome is expected to contain only linear chromosomes or a uniquecircular chromosome. In the present work, we consider a relaxation of this problem, which allows ancestralgenomes that can contain several chromosomes, each either linear or circular.Result: We show that, when restricted to binary matrices of degree two, which correspond to adjacencies, thegenomic characters used in most ancestral genome reconstruction methods, this relaxed version of theLinearization Problem is polynomially solvable using a reduction to a matching problem. This result holds in themore general case where columns have bounded multiplicity, which models possibly duplicated ancestral genes.We also prove that for matrices with rows of degrees 2 and 3, without multiplicity and without weights on therows, the problem is NP-complete, thus tracing sharp tractability boundaries.Conclusion: As it happened for the breakpoint median problem, also used in ancestral genome reconstruction,relaxing the definition of a genome turns an intractable problem into a tractable one. The relaxation is adapted tosome biological contexts, such as bacterial genomes with several replicons, possibly partially assembled. Algorithmscan also be used as heuristics for hard variants. More generally, this work opens a way to better understandlinearization results for ancestral genome structure inference.IntroductionGenomes, meant as the linear organization of genes alongchromosomes, have been successively modelled by severalmathematical objects. Sturtevant and Tan [1] first intro-duced permutations to study the evolution of genomestructure. Starting in the 1980’s [2], a large body of workfocused on the mathematical and algorithmic properties ofsuch models, including linear and circular genomes [3].Multichromosomal linear genomes have been defined asgeneralizations of permutations: they are permutations cutin several pieces [4]. In this framework, hardness results ofalgorithmic complexity were ubiquitous as soon as threegenomes were compared [5,6]. Even worse, if strings wereused to model duplications and heterogeneous gene con-tent, then even the basic problem of comparing two gen-omes proved to be hard [7].In order to scale up and handle the dozens of availablegenomes, another model was needed. Bergeron, Mixtackiand Stoye [8] proposed to use a graph matching betweengene extremities to define a genome. It simplified thepresentation of the Double-Cut and Join (DCJ) theory [9]at the expense of relaxing the model of chromosomalstructure as genomes could contain both linear and cir-cular chromosomes. This can be seen as an unrealisticrelaxation, as genomes are mostly either linear multi-chromosomal (eukaryotic nuclear genomes) or circularunichromosomal (bacterial or organelle genomes). But* Correspondence: Murray.Patterson@inria.fr; Eric.Tannier@inria.fr3INRIA Rhône-Alpes, 655 avenue de I’Europe, F-38344 Montbonnot, FranceFull list of author information is available at the end of the articleMaňuch et al. BMC Bioinformatics 2012, 13(Suppl 19):S11http://www.biomedcentral.com/1471-2105/13/S19/S11© 2012 Maňuch et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.eukaryotes with organelles, or prokaryotes with severalreplicons, which have not yet been handled explicitly by aformal comparative genomics approach, arguably fit sucha model. An unexpected consequence of this relaxation isthat the comparison of three genomes with the break-point distance proved to be tractable, as an exact optimalmedian can be computed by solving a maximum weightperfect matching problem [10]. Moreover, the small par-simony problem, i.e., reconstructing the minimum num-ber of evolutionary events along a given phylogeny, canbe solved for any number of genomes for the Single-Cutand Join (SCJ) distance by Fitch’s parsimony algorithmon binary characters [11]. This opened the way to scal-able methods at the level of large multispecies datasets.An additional relaxation consists in allowing any graph,and not only a matching, to model genomes. Ancestralgenome reconstruction methods often first compute setsof ancestral adjacencies (neighborhood relations betweentwo genes) [12-14], intervals (neighborhood relationsbetween an arbitrary number of genes) [15-18], whichresult in non-linear structures. This, while also unrealisticat a first glance, allows computational breakthroughs, likeincorporating duplications and heterogeneous gene con-tent in the framework [19,20] with polynomial exactmethods. Beyond the significant computational speedup,nonlinear genomes may help to understand the amountof error in the data [20].Nevertheless, biological applications in general requirelinear genomes, which raises the question of linearizing acollection of adjacencies or intervals. The LinearizationProblem is, given a set of weighted intervals (the weightindicates a confidence value based on phylogenetic conser-vation of intervals), to find a maximum weight subsetwhich is compatible with a linear structure.According to the definition of a linear structure, thiscan be described by some variant of the Consecutive-Ones property (C1P) of binary matrices [10,15,17]: A bin-ary matrix has the C1P if its columns can be orderedsuch that in each row, there is no 0 entry between two 1entries. Here, each column is a gene or a gene extremityand each row is an interval. Adjacencies are a particularcase of intervals of size two: In that case, the matrix,which has degree 2, can be identified with a graph (ver-tices are columns and edges are rows). In the case ofadjacencies, the Linearization Problem translates to theMaximum Weight Vertex-Disjoint Path Cover Problem,so it is NP-complete. A variant handles genomes with asingle circular chromosome: A binary matrix has the cir-cular C1P (Ci1P) if its columns can be ordered such thatin each row, either there is no 0 entry between two 1entries, or there is no 1 entry between two 0 entries. Foradjacencies, the Linearization Problem contains the Max-imum Weight Hamiltonian Cycle Problem, so it is alsoNP-complete.To the best of our knowledge, there is currently notractability result known for the Linearization Problem.Currently all methods [12-14,19] rely on heuristic orexternal Traveling Salesman Problem solvers, or branchand bound techniques [10,15-18]. Moreover, none of thepreviously published methods is able to infer multichro-mosomal genomes, possibly with circular chromosomes,which is the natural model for bacterial genomes withplasmids.In the present paper, we prove that the LinearizationProblem for weighted adjacencies, when ancestral gen-omes can have several circular and linear chromosomes,is tractable. We prove this in a more general case, wheremultiple copies of columns are allowed. Here, instead ofa permutation of the columns, one asks for a sequenceon the alphabet of columns, containing at most m(c)occurrences of a column c. In the context of genomereconstruction, this allows to model genes with multiplecopies in an ancestral genome [21] or to include telomeremarkers [22].We show that this corresponds to finding a maximumweight f-matching, which, in turn, is reducible to findinga maximum weight matching. Also, following the com-plexity pattern already observed with the model of theC1P with multiplicities [21], we further show that theLinearization Problem for matrices with rows of degrees2 and 3 is NP-complete, even if all rows have the sameweight and multiplicity one. We discuss the possibilitiesthat our tractability result opens for ancestral genomereconstruction.ResultsA few definitions are needed to prove the two main resultsof this paper: (1) a polynomial algorithm for the lineariza-tion of degree 2 matrices with columns with multiplicityand weighted rows; and (2) an NP-completeness proof forthe linearization of matrices with rows of degrees 2 and 3,even if all multiplicities and row weights are equal to one.The degree of a row of a binary matrix M (over {0, 1})is the number of 1 entries in that row. The degree of Mis the maximum degree over all its rows. In genomics,the columns of M are the genes, and its rows are theintervals of genes. If a row has degree 2, the interval iscalled an adjacency. A degree 2 matrix M can be identi-fied with a graph, whose vertices are the columns of M,and edges are the adjacencies. We suppose that all rowsare different (and in consequence the graph is simple: ithas no multi-edges).A binary matrix (or submatrix) M has the ConsecutiveOnes Property (C1P) if its columns can be ordered suchthat in each row, the 1 entries are consecutive (there is no0 entry between two 1 entries). If M has degree 2, it hasthe C1P if and only if the corresponding graph is a collec-tion of vertex-disjoint paths. A matrix (or submatrix) MMaňuch et al. BMC Bioinformatics 2012, 13(Suppl 19):S11http://www.biomedcentral.com/1471-2105/13/S19/S11Page 2 of 11has the Circular Ones Property (Ci1P) if its columns canbe ordered such that in each row, either the 1 entries areconsecutive (there is no 0 entry between two 1 entries), orthe 0 entries are consecutive (there is no 1 entry betweentwo 0 entries); in other words the 1 entries are consecutivewhen the order of columns is viewed as a circle. If M hasdegree 2, it has the Ci1P if and only if the graph is a cycleor a collection of vertex-disjoint paths.Given maximum copy numbers m(c) for each columnc, M satisfies the Consecutive Ones Property with multi-plicities (mC1P), if there is a sequence S of columns, con-taining at most m(c) occurrences of column c, and foreach row r, the columns containing a 1 in r appear conse-cutively in S. The mCi1P is defined analogously.The MAX-ROW-C1P problem takes a binary matrixwith weighted rows as input and asks for a subset of rowsof maximum cumulative weight with the C1P. For graphsit is equivalent to the Maximum Weight Vertex-DisjointPath Cover Problem and thus the MAX-ROW-C1P isNP-complete [23]. The MAX-ROW-Ci1P problem takes amatrix with weighted rows as input and asks for a subsetof rows of maximum cumulative weight with the Ci1P.For graphs it can solve the Traveling Salesman Problemand thus the MAX-ROW-Ci1P is NP-complete [23].These two problems are classical and have been definedindependently from comparative genomics, but modelwell the linearization of genomes with linear chromo-somes, or a single circular chromosome, respectively. Butthe general case would better be modelled by the follow-ing. A matrix is component-mCi1P if there is a collectionof cyclic sequences of columns that satisfy the followingtwo conditions: (i) for each row r, the columns contain-ing a 1 in r appear consecutively in at least one of thecyclic sequences; and (ii) the total number of occurrencesof each column c in all cyclic sequences is at least oneand at most m(c). In the particular case where m(c) = 1for every column c, a matrix is component-mCi1P if itscolumns can be partitioned such that a row has 1s onlyin one part and each part is Ci1P. Here chromosomes aresequences, which mean possible ancestral gene orders. Itis then the matter of solving the following problem.MAX-ROW-component-mCi1PInput. A matrix with maximum copy numbers assignedto all columns and weighted rows;Output. A subset of rows of maximum cumulativeweight such that the obtained submatrix is component-mCi1P.Note that it is equivalent if some sequences are notrequired to be circular, so it handles well the case whereboth circular and linear chromosomes are allowed. It is arelaxation of the previous problems, so the NP-hardnessdoes not follow from them. And in fact, the problem fordegree 2 matrices (adjacencies) happens to be polynomial,as we now show in the next subsection.A solution for matrices of degree two with weighted rowsand multiplicitesFor a degree 2 matrix M, let GM be the correspondinggraph with a node for each column and a weighted edgefor each (weighted) row. Let m : V(GM)® N be the func-tion specifying the maximum copy number of each col-umn, i.e., the multiplicity limit for each vertex of GM. Wesay that matrix M (resp., the corresponding graph GM) iscomponent-mCi1P for m if there exists a collection of cyc-lic walks (resp., corresponding cyclic sequences) that satis-fies the following two conditions: (i) GM is a subgraph ofthe union of cyclic walks; and (ii) the total number ofoccurrences of each vertex v in all cyclic walks is at mostm(v).A 2m-matching of a graph G is a spanning subgraph ofG such that the degree of each vertex v Î V(G) is at most2m(v). The following lemma shows the correspondencebetween spanning subgraphs of G that are component-mCi1P for m and 2m-matchings of G.Lemma 1 A spanning subgraph of a graph G is com-ponent-mCi1P for m if and only if it is a 2m-matchingof G.We give a sketch of the proof. For more details, we referthe reader to [21], where a similar proof was given.Proof. First, assume a spanning subgraph G’ of G is com-ponent-mCi1P for m. Then there is a collection of cyclicwalks satisfying conditions (i) and (ii). Since each vertex vappears at most m(v) times in these cyclic walks and eachoccurrence has only two neighbors, the degree of v in G’ isat most 2m(v). Hence, G’ is a 2m-matching of G.Conversely, assume G’ is a 2m-matching of G. If degG’(v) < 2m(v) for some v Î V(G’), then we add a new vertexv0 and for each v such that degG’ (v) < 2m(v), we add anew edge {v0, v} with multiplicity 2m(v) − degG’ (v) to G’.Since now every vertex of G’ has even degree, each com-ponent C of G’ is Eulerian, i.e., there is a cyclic walk whichcontains all edges of C, and each v Î V(C) appears exactlym(v) times in the walk. If C does not contain v0 then thiscyclic walk satisfies conditions (i) and (ii) for vertices in V(C). If C contains v0, then after omitting all occurrences ofv0 we obtain a cyclic walk satisfying conditions (i) and (ii)for vertices in V(C). Hence, G’ is component-mCi1Pfor m. QEDIt follows that solutions to the MAX-ROW-component-mCi1P for matrix M and m correspond to maximumweight 2m-matchings of GM. Next, we give an algorithmfor finding a maximum weight f-matching of G with run-ning time (O((|V(G)| + |E(G)|)3/2)), where f : V(G) ® N.We will use a more general form of Tutte’s reduction forreducing the maximum weight f-matching problem to theMaňuch et al. BMC Bioinformatics 2012, 13(Suppl 19):S11http://www.biomedcentral.com/1471-2105/13/S19/S11Page 3 of 11maximum weight matching problem similar to the onespresented in [24,25].Given an edge weighted graph G and function f, con-struct G’ in the following way: For all x in V(G), let x1,x2, ..., xf(x) be in V(G’); and for all e = {x, y}in E(G), let exand ey be in V(G’). Now, for all e = {x, y} in E(G), let {x1,ex}, ..., {xf(x), ex}, {ex, ey}, {y1, ey}, ...,{yf(y), ey} be edges ofG’, and all these edges have weight w(e). This reductionis illustrated in Figure 1.Property 1 There is an f-matching in G with weight wif and only if there is a matching in G’ with weight w +W, where W =∑e∈E(G) w(e).An unweighted version of this property was shown in[25]. The weighted version can be shown in the sameway, and hence, we omit the proof.Since a maximum weight matching can be found intime O(√|V(G′)| · |E(G′)|)[26], we have polynomial O((|V(G)| + |E(G)|)3/2) algorithms for the maximumweight f-matching problem and for the MAX-ROW-component-mCi1P problem with multiplicities onmatrices of degree 2.Intractability for matrices of degree larger than twoThe tractability does not generalize to matrices, that is,the MAX-ROW-component-Ci1P is already NP-com-plete for unweighted matrices with rows of degrees 2and 3. Note that the result for unweighted matricesimplies NP-completeness also for the cases when rowsare weighted and/or columns have multiplicities.We will first show that the following hypergraph coveringproblem is NP-complete. Here we say that a hypergraphH = (V, E) is 2, 3-uniform when all of its hyperedges areeither 2-edges or 3-edges, that is, hyperedges that containexactly two or three vertices. We will also denote 2,3-uni-form hypergraphs H = (V, E) as H = (V, E2, E3), where E2(resp., E3) is its set of 2-edges (resp., 3-edges). We denotethe power set of a set S with P(S) (also known as 2S).Definition 1 A graph covering of a 2, 3-uniformhypergraph H = (V, E2, E3) is a graph G = (V, E’) suchthat there exist a map c : E2 ∪ E3 → P(E′), satisfying thefollowing for every h Î E2 ∪ E3:(a) for every h Î E, and for every e Î c(h), e ⊆ h;(b) |c(h)| = 1 if h Î E2 and |c(h)| = 2 if h Î E3; and(c) ∪h∈E2∪E3c(h) = E’.Here, we say that each set of edges c(h) covers thehyperedge h.Informally, a graph covering of a 2,3-uniform hyper-graph is a graph constructed by picking an edge fromeach 2-edge, and a pair of edges from each 3-edge.Problem 1 (The 2,3-Uniform Hypergraph Coveringby Cycles and Paths by Edge Removal Problem(23UCR Problem)) Given a 2, 3-uniform hypergraphH = (V, E) and an integer k, is there a graph covering ofH that consists of a collection of disjoint cycles andpaths after removing at most k hyperedges from E?Here we will show that Problem 1, the 23UCR Pro-blem, is NP-complete. Later in this section we will showthat this implies that the MAX-ROW-component-Ci1PProblem is NP-complete for matrices with rows ofdegrees 2 and 3. First, we must define the following NP-complete version of 3SAT, which we will use to showNP-completeness of Problem 1.Problem 2 (The 3SAT(2,3) Problem) Given a CNFformula φ with the following three properties, is φsatisfiable?(a) Formula φ has only 2-clauses and 3-clauses.(b) Each variable x of φ has exactly two positive occur-rences and one negative occurrence in the clauses.(c) Exactly one positive occurrence of x appears in the3-clauses, while the other two occurrences appear in the2-clauses.We show that this version of 3SAT is NP-completeusing a very similar proof to the one in [27], by reduc-tion from 3SAT.Theorem 1 The 3SAT(2,3) Problem is NP-complete.Proof. Clearly, the problem is in NP. We now show thatit is NP-hard by reduction from 3SAT by transforming agiven formula φ that is an instance of 3SAT to a formulaφ′ that is an instance of 3SAT(2,3) that is satisfiable ifFigure 1 Reduction used to transform the maximum weight f-matching problem to the maximum weight matching problem. Edgeweights are all one, unless otherwise indicated, and f is given by the white dots inside the nodes. The total edge weight in G is 8. The solidedges show a maximum weight f-matching in G (w = 6), and a corresponding maximum weight matching in G’ (of weight 6 + 8 = 14).Maňuch et al. BMC Bioinformatics 2012, 13(Suppl 19):S11http://www.biomedcentral.com/1471-2105/13/S19/S11Page 4 of 11and only if φ is satisfiable. For each variable x of φ thathas k occurrences, we first replace its k occurrences withx1, x2,..., xk, i.e., replace the i-th occurrence of x (as literalx or ¬x) with xi. We then add the following 2-clauses:x˜i ⇒ x˜i+1(i.e.,¬x˜i ∨ x˜i+1) for i = 1, ..., k - 1 and alsox˜k ⇒ x˜1, where for each i,x˜i ={xi if the i - th occurrence of variable x is positive, and¬xi otherwise.This “cycle” of implications (2-clauses) on x1,..., xk,ensures that for any truth assignment to the variables ofφ′, the values of x˜1, . . . , x˜k are either all set to true or allset to false. In the first case, the xi’s corresponding tothe positive occurrences of x, are set to true and the xi’scorresponding to the negated occurrence of x, are set tofalse. In the second case, the situation is reversed.Hence, any satisfying truth assignment to the variablesof φ′ can be translated into a satisfying truth assignmentto the variables of φ, and vice versa, i.e., φ′ is satisfiableif and only if φ is satisfiable. Since it is easy to verifythat this transformation can be done in polynomialtime, and that φ′ is indeed an instance of 3SAT(2,3), itfollows that the 3SAT(2,3) Problem is NP-complete.QEDWe now show that the 23UCR Problem is NP-com-plete by reduction from 3SAT(2,3).Theorem 2 The 23UCR Problem is NP-complete.Proof. Clearly, the problem is in NP. We will showthat it is also NP-hard by reduction from 3SAT(2,3).Given a 3SAT(2,3) formula φ with variables X = {x1, ...,xn} and set C2 ={c1, . . . , cm2}of 2-clauses (resp., setC3 ={c1, . . . , cm3}of 3-clauses), we construct a 2,3-uni-form hypergraph Hj = (V, E). Hypergraph Hj is com-posed of variable gadgets and clause gadgets whichcontains, among other vertices, a vertex for each literal ofφ (what we will refer to as literal vertices: there are 3n =2m2 + 3m3 such vertices). The design of Hφ is such thatthere is a graph covering G of Hφ that consists of a col-lection of disjoint cycles and paths after removing atmost m2 + n edges from E if and only if φ is satisfiable.For this proof, we call such a G a valid covering of H.Note that a valid covering does not contain any vertex ofdegree 3 or more.Figure 2a depicts the variable gadget for variable x ÎX with its two positive occurrences, labeled as x1 and x2,and its one negative occurrence ¬x in the clauses. Wecall the 3-edge {x’’, x’’’, x’’’’ } the auxiliary hyperedge,while the other two, {x1, x2, x’} and {¬x, x’, x’’}, are calledthe main hyperedges of the variable gadget.Figure 2b (resp., 2c) depicts the clause gadget for the 2-clause containing literals p, q (resp., and also r for a 3-clause). For the 2-clause gadget, we call the 2-edge {c, c’}the auxiliary hyperedge. We will refer to literals of j andthe literal vertices of the gadgets of Hφ interchangeablywhen the context is clear. We have the following claim.Lemma 2 Formula φ has a satisfying assignment ifand only if Hφ has a valid covering.Proof. “⇒” We first show that a satisfying assignmentof φ can be used to construct a valid covering of Hφ.For the variable gadget corresponding to x Î X, wefirst remove the main hyperedge that contains the literal(s) that is satisfied in the assignment, and then cover theremaining two edges as depicted in Figure 3: Figure 3a(resp., 3b) depicts how to cover the clause gadget whenx is false (resp., true) in the assignment. In this figure(as in all remaining figures of this paper), hyperedgesdrawn with dashed lines are removed, while the straightlines are edges picked in the covering.For a 2-clause (resp., 3-clause) c containing literals p,q (resp., and also r for a 3-clause), without loss of gener-ality let p be a literal that is satisfied in c (there has tobe such a literal since it is a satisfying assignment). If cis a 2-clause (resp., 3-clause), we cover the correspond-ing gadget as depicted in Figure 4a (resp., 4b).In the above covering, since exactly m2 + n hyper-edges were removed, and since it is easy to verify thateach vertex has degree at most 2, it follows that it is avalid covering of Hφ.“⇐” Now we show that if Hφ has a valid covering thenφ is satisfiable.For hypergraph Hφ = (V,E), we say that a graph G =(V, E’) selects a literal vertex for x Î X of Hφ if x is adja-cent to two edges of G in some clause gadget of Hφ.Obviously, selected vertices of G correspond to a satisfy-ing truth assignment of φ if and only if(i) in every clause gadget, at least one literal vertex isselected, and(ii) for every x Î X, at most one of x and ¬x isselected.We call a graph G = (V, E’) an expected behavior cov-ering of Hφ = (V,E) when each variable (resp., clause)gadget of Hφ is covered in a way depicted in Figure 3(resp., 4). It is easy to verify the following observation.Observation 1 If a valid covering G = (V, E’) ofHφ = (V,E) is also an expected behavior covering of Hφ,then G corresponds to a satisfying truth assignment of j.In the remainder of this lemma, we will give a set oftransformations that converts a valid covering into anexpected behavior covering while preserving the validityof the covering at each step. Assume that we have avalid covering of Hj.We say that a variable gadget is undecided in a validcovering of Hj if neither of its two main hyperedges isremoved. We first show that we can assume that thereare no undecided variable gadgets.Maňuch et al. BMC Bioinformatics 2012, 13(Suppl 19):S11http://www.biomedcentral.com/1471-2105/13/S19/S11Page 5 of 11Claim 1 We can transform a valid covering of Hj intoa valid covering that contains no undecided variablegadgets.Proof. To prove this claim we do a case analysis onthe possible configurations that an undecided variablegadget can have in a valid covering of Hj, and showhow we can locally transform each one to a decidedconfiguration without affecting the validity of the cover-ing of Hj.First, assume that the auxiliary hyperedge is removedin an undecided variable gadget. The set of possibleconfigurations that the gadget can be in is depicted onthe left in Figure 5. In this figure (as in all remainingfigures) double-headed arrows pointing to two verticesin a 3-edge represent the two coverings of this 3-edgeas explained in Figure 6.We can transform any configuration of Figure 5 to thedecided configuration on the right. It is easy to see thatin the transformed configuration, the number of hyper-edges removed is the same as in any initial configura-tion, and that vertices x’, x’’, x’’’ and x’’’’ (which do notintersect any vertex outside this variable gadget) havedegree at most 2. Finally, since each initial configurationof Figure 5 is part of a valid covering of Hj, and thedegree of any literal vertex (x1, x2 and ¬x) affected bythe transformation has only decreased or remained thesame, it follows that the covering of Hj remains validafter this local transformation.Hence, we can assume that the auxiliary hyperedge ispresent in any undecided variable gadget. Without lossof generality we can then assume that any configurationof the undecided variable gadget must be in one of thetwo forms depicted on the left in Figure 7. In each case,we can perform the corresponding transformationshown in Figure 7. Again it is easy to see that the num-ber of hyperedges removed has not increased, that ver-tices x’, x’’, x’’’, x’’’’, c and c’ (which do not intersect anyvertex outside of what is shown here) have degree atmost 2, and that degree of any involved literal vertexhas not increased. Hence, the covering remains valid.QEDWe have the following claim.Figure 2 (a) The variable gadget for variable x with literal vertices x1, x2 and ¬x, as well as the four auxiliary vertices x’, x’’, x’’’, x’’’’that do not appear in any other hyperedge of Hφ. (b) 2-clause gadget with literal vertices p and q, as well as the two auxiliary vertices cand c’ that do not appear in any other hyperedge of Hj. (c) 3-clause gadget with literal vertices p, q and r.Figure 3 Two coverings of the variable gadget for x, when x is set to: (a) false, or (b) true in the assignment.Maňuch et al. BMC Bioinformatics 2012, 13(Suppl 19):S11http://www.biomedcentral.com/1471-2105/13/S19/S11Page 6 of 11Claim 2 In any valid covering of Hj, at least onehyperedge is removed from each 2-clause gadget.Proof. If no hyperedge is removed from the 2-clausegadget (i.e., all hyperedges are covered) in a valid cover-ing of Hj, then vertex c (see Figure 2b) has degree 3,which contradicts the fact that this 2-clause gadget ispart of a valid covering of Hj. QEDBy Claims 1 and 2, at least n + m2 hyperedges havebeen removed from the variable and 2-clause gadgets,and since in any valid covering this is the maximumnumber of hyperedges which can be removed, we havethe following corollary.Corollary 1 We can transform a valid covering of Hjinto a valid covering where:(a) exactly one hyperedge is removed from each vari-able gadget and each 2-clause gadget of Hj, and(b) no hyperedge is removed from any 3-clause gadget.We have the following claim.Claim 3 We can transform a valid covering of Hj intoan expected behavior valid covering.Proof. Firstly, in the valid covering of Hj we can assume,by Claim 1 and Corollary 1, that exactly one main hyper-edge is removed and the auxiliary hyperedge is notremoved from each variable gadget. However, this doesnot imply expected behavior. All possible configurations ofa decided variable gadget without expected behavior andtheir corresponding transformations to the expected beha-vior covering are shown in Figure 8. Analogous to theproof of Claim 1, these local transformations do not affectvalidity of the covering.Secondly, in the valid covering of Hj we can assume,by Corollary 1, that exactly one hyperedge is removedfrom each 2-clause gadget. Assume now that a 2-clausegadget is not expected behavior covered. The only possi-ble such configuration can be transformed to theexpected behavior covering as shown in Figure 9. Again,these local transformations do not affect validity of thecovering.Thirdly, in the valid covering of Hj we can assume,again by Corollary 1, that the 3-clause gadget is covered,and hence it is also expected behavior covered (see Fig-ure 4b). Since all gadgets are expected behavior coveredin this valid covering of Hj, the claim holds. QEDIt follows by Observation 1 and Claim 3, that if Hjhas a valid covering, then j is satisfiable. This completesthe proof of the lemma. QEDFinally, since the construction of Hj is polynomial,then by Lemma 2 it follows that the 23UCR Problem isNP-complete. QEDLet the component-Ci1P by Row Removal Problem bethe corresponding decision version of the MAX-ROW-component-Ci1P Problem as follows.Figure 4 A covering of the (a) 2-clause gadget; and (b) 3-clause gadget, where literal p is satisfied.Figure 5 The transformation in the case when the auxiliary hyperedge of the variable gadget is removed.Maňuch et al. BMC Bioinformatics 2012, 13(Suppl 19):S11http://www.biomedcentral.com/1471-2105/13/S19/S11Page 7 of 11Figure 6 (a) a 3-edge with a double arrow pointing to two vertices. (b)-(c) the two configurations that are represented by (a).Figure 7 Two sets of possible configurations of an undecided variable gadget and the corresponding transformation of the covering.(Note that if edge {c, c’} is also missing in either initial configuration on the left, that the corresponding configuration on the right still applies,since c and c’ do not intersect any vertex outside this variable gadget, and the number of hyperedges removed has not increased.)Maňuch et al. BMC Bioinformatics 2012, 13(Suppl 19):S11http://www.biomedcentral.com/1471-2105/13/S19/S11Page 8 of 11Figure 8 Two sets of possible configurations of a decided variable gadget without expected behavior and the correspondingtransformations.Figure 9 The only possible configuration of a 2-clause gadget without expected behavior and the corresponding transformation.Maňuch et al. BMC Bioinformatics 2012, 13(Suppl 19):S11http://www.biomedcentral.com/1471-2105/13/S19/S11Page 9 of 11Problem 3 (The component-Ci1P by Row RemovalProblem) Given a binary matrix M and an integer k,can we obtain a submatrix that is component-mCi1P byremoving at most k rows from M?We now show that the component-Ci1P by RowRemoval Problem is NP-complete for matrices withrows of degrees 2 and 3.The following lemma shows the correspondencebetween the component-Ci1P by Row Removal Problemfor matrices with rows of degrees 2 and 3 and the23UCR Problem. A 2,3-uniform hypergraph H = (V, E)can be represented by a binary matrix BH with |V| col-umns and |E| rows, where for each hyperedge h Î E, weadd a row with 1’s in the columns corresponding to thevertices in h and 0’s everywhere else. Obviously, there isa one-to-one correspondence between 2,3-uniformhypergraphs and such matrices.Lemma 3 A 2,3-uniform hypergraph H = (V, E) can becovered by a collection of disjoint cycles and paths afterremoving at most k hyperedges from E if and only if matrixBH has the component-Ci1P after removing at most k rows.Proof. Assume first that H has a covering G that consistsof a collection of disjoint cycles and paths after removingat most k hyperedges from E. Remove the (at most k) rowsfrom BH that correspond to the hyperedges removed fromE. Each path (resp., cycle) O of G defines a cyclic order onits set of vertices. Consider the cyclic ordering of the col-umns of each component of BH corresponding to O. It iseasy to see that each such cyclic ordering is a Ci1P order-ing of its corresponding component, and hence BH has thecomponent-Ci1P after removing at most k rows.Conversely, assume that each component C = {v1,...,v|C|} of the submatrix of BH obtained by removing at mostk rows is Ci1P with respect to cyclic orderπ = vi1 , . . . , vi |C | of its columns. Consider the followingcovering G of H, after removing the (at most k) hyper-edges from E that correspond to the rows removed fromBH: for every hyperedge, pick the edge between two adja-cent columns/vertices in π. Note that every picked edgeis {vij , vij+1 } for some j, or {vi|C| , vi1}. Hence, G consists of acollection of disjoint cycles and paths. QEDBy Theorem 2 and Lemma 3 it follows that the com-ponent-Ci1P by Row Removal Problem is NP-completefor matrices with rows of degrees 2 and 3. Since thisdecision problem is NP-complete, it follows that theMAX-ROW-component-Ci1P Problem is also NP-com-plete for matrices with rows of degrees 2 and 3.Theorem 3 The MAX-ROW-component-Ci1P Problemis NP-complete.Discussion/ConclusionThere are exact optimization [12,19,20] or empirical[13-15] fast methods to construct ancestral adjacencieswhich do not necessarily form a linear signal. But todate, all linearization methods were heuristics or calls toTraveling Salesman Problem solvers [12-14,19]. More-over, no method is currently adapted to reconstruct bac-terial ancestral genome with plasmid(s), while thissituation is common in the living world.We report here two results: (1) a polynomial variant ofthe Linearization Problem, when the output allows pathsand cycles and a maximum number of copies per gene, inthe case of degree 2 matrices with weighted rows; and (2)an NP-completeness proof of the same problem formatrices with rows of degrees 2 and 3, even when multi-plicities and weights are equal to one.It is not the first time that a slight change in the formu-lation of a problem dramatically changes its computationalstatus [10]. Even if such a relaxation is less realistic in cer-tain contexts, solving the relaxation can also help toapproach efficiently the constrained problem, like for DCJand inversions/translocations for example [28,29]. Moregenerally, 2-factors (spanning subgraphs composed of col-lections of vertex-disjoint cycles) have been used toapproximate Traveling Salesman solutions [24], so gen-omes composed of several circular chromosomes can be away to approximate solutions for linear ones.Moreover, considering genomes composed of linear andcircular segments is appropriate for bacterial genomeswhere linear segments can be seen as segments of nottotally recovered circular chromosomes. Currently noancestral genome reconstruction method is able to handlebacterial genomes with plasmids, but rather they arerestricted to eukaryotes or bacterial chromosomes with asingle circular chromosome. For example, Darling et al.[30] reconstruct the ancestral genomes of Yersinia pestisstrains but are limited to the main chromosome by theirmethod, while there are 3 plasmids in most current spe-cies, and they are of capital importance since they are sus-pected to have provoked the pathogenicity of the plagueagent. So it is crucial to include them in evolutionary stu-dies, which justifies our model for future biological studies.Furthermore, genes are often duplicated in genomes,and in the absence of a precise and efficient phyloge-netic context, which is still absent for bacteria (noancestral genome reconstruction method is able to han-dle horizontal transfers for example), a multi-copyfamily translates into a multiplicity in the problemstatements.The ability to obtain such genomes in polynomial timefrom adjacencies also opens interesting perspectives forphylogenetic scaffolding of extant bacterial genomes[31] or more generally bacterial communities [32].These applications are left as a future work.AcknowledgementsWe thank Jens Stoye for useful discussions. JM and CC are funded by NSERCDiscovery Grants. MP is funded by a Marie Curie Fellowship from the AlainMaňuch et al. BMC Bioinformatics 2012, 13(Suppl 19):S11http://www.biomedcentral.com/1471-2105/13/S19/S11Page 10 of 11Bensoussan program of ERCIM. ET is funded by the Agence Nationale pourla Recherche, Ancestrome project ANR-10-BINF-01-01.This article has been published as part of BMC Bioinformatics Volume 13Supplement 19, 2012: Proceedings of the Tenth Annual Research inComputational Molecular Biology (RECOMB) Satellite Workshop onComparative Genomics. The full contents of the supplement are availableonline at URL.Author details1Department of Mathematics, Simon Fraser University, Burnaby BC, V5A1S6,Canada. 2Department of Computer Science, University of British Columbia,Vancouver BC, V6T1Z4, Canada. 3INRIA Rhône-Alpes, 655 avenue de I’Europe,F-38344 Montbonnot, France. 4Laboratoire de Biométrie et BiologieÉvolutive, CNRS and Université de Lyon 1, 43 boulevard du 11 novembre1918, F-69622 Villeurbanne, France. 5Genome Informatics, Faculty ofTechnology and Institute for Bioinformatics, Center for Biotechnology(CeBiTec), Bielefeld University, 33594 Bielefeld, Germany.Authors’ contributionsJM, MP, RW, CC and ET formalized and solved the linearization problemsand wrote the paper.Competing interestsThe authors declare that they have no competing interests.Published: 19 December 2012References1. Sturtevant A, Tan C: The comparative genetics of DrosophilaPseudoobscura and Drosophila Melanogaster. Journal of Genetics 1937,34:415-432.2. Watterson GA, Ewens WJ, Hall TE, Morgan A: The chromosome inversionproblem. Journal of Theoretical Biology 1982, 99:1-7.3. Fertin G, Labarre A, Rusu I, Tannier E, Vialette S: Combinatorics of genomerearrangements. MIT press 2009.4. Hannenhalli S, Pevzner P: Transforming men into mice (polynomialalgorithm for genomic distance problem). 36th Annual Symposium onFoundations of Computer Science, IEEE Comput. Soc. Press, Los Alamitos, CA1995, 581-592.5. Bryant D: The complexity of the breakpoint median problem. Tech RepCRM-2579 Centre de Recherches Mathématiques, Université de Montréal;1998.6. Caprara A: The reversal median problem. INFORMS Journal on Computing2003, 15:93-113.7. Blin G, Chauve C, Fertin G, Rizzi R, Vialette S: Comparing Genomes withDuplications: a Computational Complexity Point of View. ACM/IEEETransactions on Computational Biology and Bioinformatics 2007, 4:523-534.8. Bergeron A, Mixtacki J, Stoye J: A Unifying View of GenomeRearrangements. Algorithms in Bioinformatics, Proceedings of WABI’06,Volume 4175 of Lecture Notes in Computer Science 2006, 163-173.9. Yancopoulos S, Attie O, Friedberg R: Efficient sorting of genomicpermutations by translocation, inversion and block interchange.Bioinformatics 2005, 21:3340-3346.10. Tannier E: Yeast Ancestral Genome reconstruction: the possibilities ofcomputational methods. Comparative Genomics, Proceedings of RECOMB-CG’09, Volume 5817 of Lecture Notes in Computer Science 2009, 1-12.11. Feijao P, Meidanis J: SCJ: a breakpoint-like distance that simplifies severalrearrangement problems. IEEE/ACM Transactions on Computational Biologyand Bioinformatics 2011, 8:1318-1329.12. Ma J, Zhang L, Suh B, Raney B, Burhans R, Kent W, Blanchette M,Haussler D, Miller W: Reconstructing contiguous regions of an ancestralgenome. Genome Research 2006, 16:1557-1565.13. Bertrand D, Gagnon Y, Blanchette M, El-Mabrouk N: Reconstruction ofAncestral Genome subject to Whole Genome Duplication, Speciation,Rearrangement and Loss. Algorithms in Bioinformatics, Proceedings ofWABI’10, Volume 6293 of Lecture Notes in Bioinformatics 2010, 78-89.14. Muffato M, Louis A, Poisnel CE, Crollius HR: Genomicus: a database and abrowser to study gene synteny in modern and ancestral genomes.Bioinformatics 2010, 26:1119-1121.15. Chauve C, Tannier E: A methodological framework for the reconstructionof contiguous regions of ancestral genomes and its application tomammalian genomes. PLoS Computational Biology 2008, 4:e1000234.16. Stoye J, Wittler R: A Unified Approach for Reconstructing Ancient GeneClusters. IEEE/ACM Trans Comput Biol Bioinf 2009, 6(3):387-400.17. Chauve C, Gavranović H, Ouangraoua A, Tannier E: Yeast ancestralgenome reconstructions: the possibilities of computational methods II.Journal of Computational Biology 2010, 17:1097-1112.18. Jones BR, Rajaraman A, Tannier E, Chauve C: ANGES: ReconstructingANcestral GEnomeS maps. Bioinformatics 2012, 18.19. Ma J, Ratan A, Raney BJ, Suh BB, Zhang L, Miller W, Haussler D: DUPCAR:reconstructing contiguous ancestral regions with duplications. Journal ofComputational Biology 2008, 15:1007-1027.20. Bérard S, Gallien C, Boussau B, Szollosi G, Daubin V, E T: Evolution of geneneighborhood within reconciled phylogenies. Bioinformatics 2012.21. Wittler R, Maňuch J, Patterson M, Stoye J: Consistency of sequence-basedgene clusters. Journal of Computational Biology 2011, 18(9):1023-1039.22. Chauve C, Maňuch J, Patterson M, Wittler R: Tractability results for theConsecutive-Ones Property with multiplicity. Combinatorial PatternMatching, Proceedings of CPM’11, Volume 6661 of Lecture Notes in ComputerScience 2011, 90-103.23. Garey M, Johnson D: Computers and Intractability: A Guide to the Theory ofNP-completeness W. H. Freeman & Co; 1979.24. Lovasz L, Plummer MD: Matching Theory, Volume 29 of Annals of DiscreteMathematics North Holland; 1986.25. Dessmark A, Lingas A, Garrido O: On Parallel Complexity of Maximumf-matching and the Degree Sequence Problem. Proceedings of the 19thInternational Symposium on Mathematical Foundations of Computer Science1994 MFCS ‘94, Springer-Verlag; 1994, 316-325.26. Micali S, Vazirani VV: An OV⋅E algorithm for finding maximum matchingin general graphs. Proceedings of FOCS’80 1980, 17-27.27. Papadimitriou C: Computational Complexity Addison Wesley; 1994.28. Miklós I, Tannier E: Bayesian sampling of genomic rearrangementscenarios via double cut and join. Bioinformatics 2010, 26:3012-3019.29. Miklós I, Tannier E: Approximating the number of double cut-an-joinscenarios. Theoretical Computer Science 2011, 439:30-40.30. Darling AE, Miklós I, Ragan MA: Dynamics of genome rearrangement inbacterial populations. PLoS Genetics 2008, 4:e1000128.31. Husemann P, Stoye J: Phylogenetic Comparative Assembly. Algorithms forMolecular Biology 2010, 5:3.32. Pop M: Genome assembly reborn: recent computational challenges.Briefings in Bioinformatics 2009, 10:354-366.doi:10.1186/1471-2105-13-S19-S11Cite this article as: Maňuch et al.: Linearization of ancestralmultichromosomal genomes. BMC Bioinformatics 2012 13(Suppl 19):S11.Submit your next manuscript to BioMed Centraland take full advantage of: • Convenient online submission• Thorough peer review• No space constraints or color figure charges• Immediate publication on acceptance• Inclusion in PubMed, CAS, Scopus and Google Scholar• Research which is freely available for redistributionSubmit your manuscript at www.biomedcentral.com/submitMaňuch et al. BMC Bioinformatics 2012, 13(Suppl 19):S11http://www.biomedcentral.com/1471-2105/13/S19/S11Page 11 of 11
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Faculty Research and Publications /
- Linearization of ancestral multichromosomal genomes
Open Collections
UBC Faculty Research and Publications
Linearization of ancestral multichromosomal genomes Maňuch, Ján; Patterson, Murray; Wittler, Roland; Chauve, Cedric; Tannier, Eric Dec 19, 2012
pdf
Page Metadata
Item Metadata
Title | Linearization of ancestral multichromosomal genomes |
Creator |
Maňuch, Ján Patterson, Murray Wittler, Roland Chauve, Cedric Tannier, Eric |
Publisher | BioMed Central |
Date Issued | 2012-12-19 |
Description | Background: Recovering the structure of ancestral genomes can be formalized in terms of properties of binary matrices such as the Consecutive-Ones Property (C1P). The Linearization Problem asks to extract, from a given binary matrix, a maximum weight subset of rows that satisfies such a property. This problem is in general intractable, and in particular if the ancestral genome is expected to contain only linear chromosomes or a unique circular chromosome. In the present work, we consider a relaxation of this problem, which allows ancestral genomes that can contain several chromosomes, each either linear or circular. Result: We show that, when restricted to binary matrices of degree two, which correspond to adjacencies, the genomic characters used in most ancestral genome reconstruction methods, this relaxed version of the Linearization Problem is polynomially solvable using a reduction to a matching problem. This result holds in the more general case where columns have bounded multiplicity, which models possibly duplicated ancestral genes. We also prove that for matrices with rows of degrees 2 and 3, without multiplicity and without weights on the rows, the problem is NP-complete, thus tracing sharp tractability boundaries. Conclusion: As it happened for the breakpoint median problem, also used in ancestral genome reconstruction, relaxing the definition of a genome turns an intractable problem into a tractable one. The relaxation is adapted to some biological contexts, such as bacterial genomes with several replicons, possibly partially assembled. Algorithms can also be used as heuristics for hard variants. More generally, this work opens a way to better understand linearization results for ancestral genome structure inference. |
Genre |
Article |
Type |
Text |
Language | eng |
Date Available | 2016-01-27 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution 4.0 International (CC BY 4.0) |
DOI | 10.14288/1.0223772 |
URI | http://hdl.handle.net/2429/56726 |
Affiliation |
Computer Science, Department of Science, Faculty of Non UBC |
Citation | BMC Bioinformatics. 2012 Dec 19;13(Suppl 19):S11 |
Publisher DOI | 10.1186/1471-2105-13-S19-S11 |
Peer Review Status | Reviewed |
Scholarly Level | Faculty |
Copyright Holder | Maňuch et al.; licensee BioMed Central Ltd. |
Rights URI | http://creativecommons.org/licenses/by/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 52383-12859_2012_Article_5520.pdf [ 1.38MB ]
- Metadata
- JSON: 52383-1.0223772.json
- JSON-LD: 52383-1.0223772-ld.json
- RDF/XML (Pretty): 52383-1.0223772-rdf.xml
- RDF/JSON: 52383-1.0223772-rdf.json
- Turtle: 52383-1.0223772-turtle.txt
- N-Triples: 52383-1.0223772-rdf-ntriples.txt
- Original Record: 52383-1.0223772-source.json
- Full Text
- 52383-1.0223772-fulltext.txt
- Citation
- 52383-1.0223772.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52383.1-0223772/manifest