Variants of the Consecutive-Ones Property Motivated by the Reconstruction of Ancestral Species by Murray Patterson BSc. (Honours) Computer Science, Acadia University, 2003 MSc. Computing Science, Simon Fraser University, 2006 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University Of British Columbia (Vancouver) January 2012 c Murray Patterson, 2012 Abstract The polynomial-time decidable Consecutive-Ones Property (C1P) of binary matrices, formally introduced in 1965 by Fulkerson and Gross [52], has since found applications in many areas. In this thesis, we propose and study several variants of this property that are motivated by the reconstruction of ancestral species. We first propose the Gapped C1P, or the (k, δ )-Consecutive-Ones Property ((k, δ )-C1P): a binary matrix M has the (k, δ )-C1P for integers k and δ if the columns of M can be permuted such that each row contains at most k blocks of 1’s and no two neighboring blocks of 1’s are separated by a gap of more than δ 0’s. The C1P is equivalent to the (1, 0)-C1P. We show that for every bounded and unbounded k ≥ 2, δ ≥ 1, (k, δ ) = (2, 1), deciding the (k, δ )-C1P is NP-complete [55]. We also provide an algorithm for a relevant case of the (2,1)-C1P. We then study the (k, δ )-C1P with a bound d on the maximum number of 1’s in any row (the maximum degree) of M. We show that the (d, k, δ )-Consecutive-Ones Property ((d, k, δ )-C1P) is polynomial-time decidable when all three parameters are fixed constants. Since fixing d also fixes k (k ≤ d), the only case left to consider is the (d, k, ∞)-C1P (when δ is unbounded). We show that for every d > k ≥ 2, deciding the (d, k, ∞)-C1P is NP-complete. We also study the Consecutive-Ones Property with Multiplicity (mC1P), introduced by Wittler and Stoye [151]: a binary matrix M on columns S = {1, . . . , n} has the mC1P for multiplicity vector m : S → N if there is a sequence σ on S such that (i) σ contains each s ∈ S at most m(s) times, and (ii) for each row r of M, the set of columns that have entry 1 in r form at least one subsequence of σ . We show that deciding the mC1P, and two restricted variants thereof, are NP-complete, for M having maximum degree 3 (6 for one of the variants), and for m(s) ≤ 2 for all ii s ∈ S. We also give a tractability result for the mC1P that is motivated by handling telomeres in the reconstruction of ancestral species. Finally, we study the Generalized Cladistic Character Compatibility (GCCC) Problem, a generalization of the Perfect Phylogeny Problem [137] introduced by Benham et al. [12]. We use the structure of the PQ-tree [21] associated with the C1P to give algorithms for several cases of the GCCC Problem. iii Preface This thesis is structured into six chapters. The first chapter gives a general overview of the C1P and the motivation for considering the four variants that we propose and study here. This was written specifically for the thesis by me with help from Cedric Chauve in structuring the content. Each of the four subsequent chapters is then dedicated to a particular variant. These four chapters form the results of this thesis, which have been published in several co-authored publications, as detailed below. The sixth chapter concludes this thesis with open questions and future work. In Chapter 2, Cedric Chauve identified the (k, δ )-C1P and its motivation for studying this variant. The results of Sections 2.2 and 2.3 were found by J´an Maˇnuch and I, while J´an Maˇnuch wrote most of Section 2.2 and I wrote Section 2.3. The ideas of Section 2.4, with exception of Condition 8 were found by me, and this section was also written by me. Finally, the idea of the construction of Section 2.5 was mine, while I wrote most of this with some help from J´an Maˇnuch. All the results, with exception of Section 2.4 appear in our work Maˇnuch et al. [101]. Preliminary results on this appear in our published work Chauve et al. [29]. In Chapter 3, Cedric Chauve came up with the idea for the algorithm of Section 3.1, and also wrote most of this, which was expanded later by me. The results of Section 3.2 were then found by J´an Maˇnuch and I. J´an Maˇnuch came up with the idea of using a hypergraph covering problem to show NP-completeness of deciding the (3, 2, ∞)-C1P, and wrote this up as well (Sections 3.2.1, 3.2.2 and 3.2.4). Generalizing this construction (Section 3.2.3) was then found by J´an Maˇnuch and I, while I wrote it up and J´an Maˇnuch supplied the figures. The result of Section 3.1 can be found in our work Maˇnuch et al. [101] (and in our work Chauve et al. [29]). The results of Section 3.2 are the subject of our published iv work Maˇnuch and Patterson [100], while preliminary results on this appear in our published work Maˇnuch and Patterson [99]. In Chapter 4, Wittler and Stoye [151] formally define the notion of the mC1P, and propose also the two variants of Section 4.2. All of the results of Sections 4.1 and 4.2 where found by J´an Maˇnuch and I, with some help from Roland Wittler. The ideas and work for the tractability result of Section 4.3 were then shared with Cedric Chauve, J´an Maˇnuch, Roland Wittler and I. In particular, Cedric Chauve and Roland Wittler worked on and wrote the subsection titled “The Case of a Single Multicolumn”, while J´an Maˇnuch and I worked on and wrote the subsection titled “Completing the Proof of Theorem 51”. All of the results of Sections 4.1 and 4.2 appear in our published work Wittler et al. [152], while the tractability result of Section 4.3 is the subject of our published work [31]. The work of Chapter 5 was an equal contribution of J´an Maˇnuch and I. The GCCC (at least its form) was first proposed in Benham et al. [12]. The results of Section 5.2 were then found and written by J´an Maˇnuch and I. The algorithm of Subsection 5.3.1 was found by me, and written with help from J´an Maˇnuch. In Subsection 5.3.2, J´an Maˇnuch came up with the idea of Lemma 63, while I came up with the idea of this struture based on PQ-trees [21, 106] for Lemma 65. This Subsection 5.3.2 was then written by me. J´an Maˇnuch and I then came up with the idea of Subsection 5.3.3, and J´an Maˇnuch wrote this. The results of Section 5.4 were then found and written by J´an Maˇnuch and I. v Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 The Consecutive-Ones Property . . . . . . . . . . . . . . . . . . 2 1.1.1 An Introduction of the Consecutive-Ones Property . . . . 2 1.1.2 Background: Deciding the Consecutive-Ones Property . . 3 1.1.3 Applications of the Consecutive-Ones Property . . . . . . 6 The Reconstruction of Ancestral Gene Orders . . . . . . . . . . . 8 1.2 1.2.1 1.2.2 A Basic Overview of the Reconstruction of Ancestral Gene Orders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Previous Approaches to Reconstructing Ancestral Gene Orders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 vi 1.2.3 cestral Gene Orders (AGOs) . . . . . . . . . . . . . . . . 16 Computational Solutions for non-C1P Matrices . . . . . . . . . . 21 1.3.1 Transforming the Matrix to a C1P Matrix . . . . . . . . . 21 1.3.2 Relaxing the C1P . . . . . . . . . . . . . . . . . . . . . . 21 1.3.3 Matrices of Bounded Degree . . . . . . . . . . . . . . . . 24 1.3.4 Matrices with Columns of Multiplicity . . . . . . . . . . 26 The Generalized Cladistic Character Compatibility Problem . . . 30 The Gapped Consecutive-Ones Property . . . . . . . . . . . . . . . 32 2.1 Notation and Conventions . . . . . . . . . . . . . . . . . . . . . . 32 2.2 Fixing the Order of Selected Columns in a Matrix . . . . . . . . . 33 2.3 The Complexity of Deciding the (k, δ )-C1P . . . . . . . . . . . . 34 1.3 1.4 2 2.3.1 4 The Complexity of Deciding the (k, δ )-C1P for every k, δ ≥ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 The Complexity of Deciding the (k, 1)-C1P for every k ≥ 3 38 2.4 The (2,1)-C1P . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . 39 40 2.5 The Complexity of Deciding the (∞, δ )-C1P . . . . . . . . . . . . 46 2.5.1 The 3SAT(L:2,R:2) Problem . . . . . . . . . . . . . . . . 46 2.5.2 The Complexity of Deciding the (∞, 1)-C1P . . . . . . . . 47 2.5.3 The Complexity of Deciding the (∞, δ )-C1P . . . . . . . 51 2.3.2 3 Binary Matrices, the C1P and the Reconstruction of An- The Gapped Consecutive-Ones Property for Matrices of Bounded Maximum Degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.1 An Algorithm for Matrices of Bounded Maximum Degree . . . . 53 3.2 The (d, k, ∞)-C1P . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2.1 A Hypergraph Covering Problem . . . . . . . . . . . . . 61 3.2.2 The 3-Uniform Hypergraph 1-Covering by Paths Problem 62 3.2.3 The d-Uniform Hypergraph p-Covering by Paths Problem 65 3.2.4 The Complexity of Deciding the (d, k, ∞)-C1P . . . . . . 70 The Consecutive-Ones Property with Multiplicity . . . . . . . . . . 73 4.1 73 The Consecutive-Ones Property with Multiplicity (mC1P) . . . . vii 4.2 Two Variants of the mC1P . . . . . . . . . . . . . . . . . . . . . 4.2.1 The Consecutive-Ones Property with Multiplicity for Framed Rows (mC1P(fr)) Variant . . . . . . . . . . . . . 4.2.2 82 A Tractability Result for the Consecutive-Ones Property with Multiplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.2 A Tractable Case of Deciding the mC1P . . . . . . . . . . 89 4.3.3 Building a PQ-tree which Describes All Sequences that Satisfy the Consecutivity Requirement . . . . . . . . . . . 5 78 The Consecutive-Ones Property with Multiplicity for Nested Rows (mC1P(ne)) Variant . . . . . . . . . . . . . 4.3 78 97 The Generalized Cladistic Character Compatibility Problem . . . . 100 5.1 The Generalized Cladistic Character Compatibility (GCCC) Problem101 5.2 Ordering Problems . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.3 Tractability Results . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.3.1 An Algorithm for Cases of the Single-Branch GCCC Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.3.2 The Benhan-Kannan-Warnow (BKW) Case of the Single-Branch GCCC-NB (SB-GCCC-NB) Problem is Polynomial-Time Solvable . . . . . . . . . . . . . . . . . 110 5.3.3 The {{1}, {2}, {0, 2}}-Path GCCC-NB (P-GCCC-NB) Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.4 6 Hardness Results . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.1 Chapter 2: The (k, δ )-C1P . . . . . . . . . . . . . . . . . . . . . 119 6.2 Chapter 3: The (d, k, δ )-C1P . . . . . . . . . . . . . . . . . . . . 121 6.3 Chapter 4: The mC1P . . . . . . . . . . . . . . . . . . . . . . . . 123 6.4 Chapter 5: The GCCC Problem . . . . . . . . . . . . . . . . . . . 124 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 viii List of Tables Table 5.1 Complexity of all cases of the GCCC Problem for the character tree 0 → 1 → 2 and set of states chosen from the set Q ⊆ {{0}, {1}, {2}, {0, 2}, {0, 1, 2}}. The BKW Case is marked with *. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 ix List of Figures Figure 1.1 (a) A binary matrix M that has the C1P. (b) A consecutiveones (C1) order of M. (c) A binary matrix that does not have the C1P [145]. . . . . . . . . . . . . . . . . . . . . . . . . . Figure 1.2 2 (a) A binary matrix M that has does not have the C1P. (b) The bipartite graph GM corresponding to M where the black (resp., white) vertices correspond to the columns (resp., rows) of M. Graph GM contains the asteroidal triple g, h, j. . . . . . . . . . Figure 1.3 4 (a) A binary C1P matrix M. (b) The PQ-tree TM for M. Here, TM has internal (circular) P-nodes and (rectangular) Q-nodes, and leaf nodes for the set of columns of M. A leaf order of TM obtained by taking any arbitrary (resp., forward or reverse) permutation of the children of a P-node (resp., Q-node) represents a C1 order of M. Note that the current configuration of TM represents C1 order bdaecgf of M. (c) Another configuration of PQ-tree TM representing C1 order dgcaebf of M. Note that T (M) has 4! · 2 · 2 = 96 configurations, and hence M has 96 C1 orders. . . . . . . . . . . . . . . . . . . . . . . . . . . x 5 Figure 1.4 (a) A binary matrix M that does not have the C1P. Note that M is the matrix of Figure 1.3a with a fourth row added, causing this matrix to not have the C1P. (Note also, that M contains the forbidden submatrix of Figure 1.2a). (b) The PQR-tree TM for M. Here, TM has the additional third type of internal (diamondshaped) R-node. An R-node represents a part of M which do not have the C1P (contains a conflicting set of columns of M). Note that the PQR-tree of the matrix formed by the first three lines of M is equivalent to that shown in Figure 1.3b. . . . . . Figure 1.5 6 A Sequence Tagged Site (STS) physical map of the kallikrein gene region. The positions of the markers are depicted along the top, and the clones are shown as horizontal lines. The markers were developed from clone insert ends (red) and kallikrein genes (blue). The unfilled squares on clones 338F22 and 003F08 show markers not analysed. (source: http://westnilevirus.okstate.edu/research/2004rr/13/13.htm) . Figure 1.6 9 The alignment of a human genome against several mammals and a chicken genome. Here, the regions of each genome that code for the Apolipoprotein A1 gene (a gene that has an important role in lipid metabolism) are highly conserved, and hence similar, for all mammals. (source: http://www.lbl.gov/tt/techs/lbnl1690.html) . . . . . . . . . . . Figure 1.7 10 An illustration of the inference of syntenies for the ancestor common to set S = {human, mouse, dog} of species with outgroup species chicken. A synteny is a group of markers that appears together in at least two species whose path goes through the considered ancestor. Here, the first synteny appears in the human and the dog, and the second is inferred from the chicken and the mouse, while the fourth one appears in all three species of S. These syntenies can be weighted according to how often they appear in the existing species, i.e, this fourth synteny would be weighted more heavily than the first two. (animal skeleton reproduced with permission from www.bigstock.com) xi 11 Figure 1.8 (a) Human-mouse nets [85] with human as the reference. Four mouse intervals are depicted, as ordered and oriented by the orthologous human segments. The second and third mouse intervals are adjacent (and appropriately oriented) on a mouse chromosome, and the intervening bases, if any, do not align to human, and are depicted by a thin line connecting these intervals. (b) The human-mouse, human-rat and human-dog nets for a segment of the human sequence, which illustrates the construction of orthology blocks (OB). (c) The construction of conserved segments (CS) from the fusion of runs of consecutive orthology blocks whenever the order and orientation of these blocks are conserved in each of the existing genomes. (source: Ma et al. [96]) . . . . . . . . . . . . . . . . . . . . . Figure 1.9 14 The set of Contiguous Ancestral Regions (CARs) for the Boreoeutherian ancestral genome (of human, rat, mouse and dog) constructed from the experiments of Ma et al. [96]. Numbers above bars indicate the corresponding human chromosomes. (source: Ma et al. [96]) . . . . . . . . . . . . . . . . . . . . . Figure 1.10 The binary matrix M corresponding to the set of sytenies in- 15 ferred in Figure 1.7 and the PQ-tree TM for M. Note that each CAR is a child of the root P-node r of TM . . . . . . . . . . . . 19 Figure 2.1 Possible positions of columns 2δ + 2 and 2δ + 3. . . . . . . . 33 Figure 2.2 The structure of Mφ and the five rows encoding clause c2 = {v2 ∨ ¬v3 ∨ v1 }. . . . . . . . . . . . . . . . . . . . . . . . . . xii 37 Figure 2.3 The structure of the construction for a 3CNF formula φ on the set V of variables and C of clauses, along with the 3 rows encoding the clause c j = {v1 ∨ v2 ∨ ¬v3 }. The blocks b1 , . . . , b|V | correspond to the variables of φ in exactly the same way as in the construction of Subsection 2.3.1. The blocks D1 , . . . , D|C| correspond to the clauses. Here, for i ∈ {1, 2, 3}, rˆi is row ri restricted to the columns of Bt , and Pt1 , Pt3 (resp., Pt2 ) are sets of permutations that do not place any 0 to the left (resp., right) of any 1 in Bt in rows r1 , r3 (resp., r2 ). It follows that all truth assignments to the literals of c j are (2,1)-C1 orders except for the case when all 3 literals are false (c j is not satisfied), since Pt1 ∩ Pt2 ∩ Pt3 = 0. / Note that for each i ∈ {1, . . . , |V |}, rows can be added to force the copy of variable block bi on the left and right of the clause blocks to encode the same truth value. . . . 45 Figure 2.4 The structure of matrix Mφ . . . . . . . . . . . . . . . . . . . 49 Figure 3.1 (a) A simple dependency on 1-coverings of two touching hyperedges enforced by a copy of D (depicted as a diamond). (b) Figure 3.2 The 2-clause and (c) 3-clause gadgets for clause ci . . . . . . . (a) The variable gadget for variable with positive occurrences 63 cip and cqj and negated occurrence crk in the clauses. The dashed edge is always picked in any valid 1-covering. (b) Grey edges are picked when this variable is set to false in a satisfying assignment of φ . (c) Grey edges are picked when the variable is set to true. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.3 Figure 3.4 Hypergraph Dd,p : only one of the The path G′ |S| d−p hyperedges is shown. 64 66 through vertex set S ∪ P that alternates between subpaths completely in S and completely in P. Some of the shown edges may be virtual. . . . . . . . . . . . . . . . . . . Figure 3.5 Figure 3.6 Hyperedge h of Dd,p which contains less than p edges from 67 G′ depicted in Figure 3.4. . . . . . . . . . . . . . . . . . . . . . 67 A valid p-covering of Dd,p in which vertex v has degree 1. . . 68 xiii Figure 3.7 Vertices and hyperedges added to H¯ to simulate the 3-edge h = {a, b, c}. The grayed diamonds depict copies of Dd,p . . . . Figure 4.1 69 Graphical representations of the (a) 2-clause gadget and (b) 3clause gadget for clause ci . The multiplicity of the columns (resp., vertices) is indicated by the number of dots. Rows are depicted by ellipses surrounding two vertices or triangles surrounding three vertices, respectively. . . . . . . . . . . . . . . Figure 4.2 75 Graphical representation of the variable gadget for variable xℓ β γ with positive occurrences cαi and c j and negated occurrence ck in the clauses. . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.3 Graphical representations of the (a) 2-clause gadget and (b) 3clause gadget for clause ci in the mC1P(ne) case. . . . . . . . Figure 4.4 76 84 Graphical representation of the variable gadget for variable xℓ β γ with positive occurrences cαi and c j and negated occurrence ck in the clauses in the mC1P(ne) case. . . . . . . . . . . . . . . Figure 4.5 85 (a) Binary matrix M, with matched multirows. Let m(1) = · · · = m(5) = 1 and m(a) = m(b) = 2: a and b are multicolumns and r1 , r3 and r4 are multirows. Row r3 is not minimal, because it contains r4 . (b) The corresponding matrix ˆ Since in M, ˆ by definition rˆi = ri for all multirows ri , the M. matched multirows are discarded. . . . . . . . . . . . . . . . xiv 87 Figure 4.6 (a) Binary matrix M, with matched multirows. Let m(c′ ) = 2. (b) PQ-tree belonging to the equivalence class PQMˆ . P-nodes are represented by circular nodes and Q-nodes by rectangular nodes. An example of a valid C1 order with multiplicity is c′ 1 2 3 4 c′ 7 8 9 5 6 which is obtained by taking the equivalent PQ-tree with frontier 1 2 3 4 7 8 9 5 6 and inserting two copies of c′ into the corresponding positions. Notice that inserting c′ between 2 and 3 would break row r2 . Illustration of Algorithm 2. LCA(ˆr1 ) and the respective segments of LCA(ˆr3,4 ) are highlighted in gray and the respective paths are depicted by dashed lines. The upper left edge is contained in two paths. Here, K1 = 1 and K2 = 1, thus K = 2 ≤ m(c′ ) = 2. Figure 4.7 . . . . . . . . . . . . . . . . . . . . . . 91 Augmented PQ-tree T ′ for the matrix given in Figure 4.6. (In fact, to get an augmented PQ-tree from the original PQ-tree shown in Figure 4.6, no modifications are necessary other than attaching leaf nodes labeled c′ at appropriate locations.) Only the trees in the equivalence class of T ′ where the left side of the right Q-node is placed adjacent to the left Q-node have shortened frontiers that meet the multiplicity constraint (m(c′ ) = 2), for example, c′ 1 2 3 4 c′ 7 8 9 5 6. . . . . . . . . . . . . . . . . . Figure 4.8 97 Transformation rules for the LCAs to construct an augmented PQ-tree. An LCA and its parent node are replaced by the nodes shown on the right. The LCA (or the segment of an LCA, respectively) are highlighted in gray. . . . . . . . . . . . . . . Figure 4.9 98 Transformation rules for bottom-up iteration to construct an augmented PQ-tree. A newly created Q-node and its parent node are replaced by the nodes shown on the right. . . . . . . 99 Figure 4.10 Special transformation rules for bottom-up iteration to construct an augmented PQ-tree. A newly created Q-node two levels below the root node and its parent node are replaced by the nodes shown on the right. xv . . . . . . . . . . . . . . . . . 99 Figure 5.1 (a) A matrix M with entries from set {0, 1, 0− , 0+ }. (b) PQ-tree PQM for M where the labels of the special zeros (0− and 0+ ) have been “forgotten”. . . . . . . . . . . . . . . . . . . . . . 112 xvi Glossary C1P Consecutive-Ones Property, a property of binary matricies AGO Ancestral Gene Order DNA Deoxyribonucleic Acid, a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms (with the exception of RNA viruses) RNA Ribonucleic Acid, one of the three major macromolecules (along with DNA and proteins) that are essential for all known forms of life STS Sequence Tagged Site mapping, a type of physical mapping of DNA [108, 119] CAR Contiguous Ancestral Region, a set of genes that remain together in some (reconstructed) ancestral genome [96] k-C1P k-Consecutive-Ones Property (k, δ )-C1P (k, δ )-Consecutive-Ones Property (d, k, δ )-C1P (d, k, δ )-Consecutive-Ones Property d-UH-p-CP d-Uniform Hypergraph p-Covering by Paths Problem mC1P Consecutive-Ones Property with Multiplicity mC1P(fr) Consecutive-Ones Property with Multiplicity for Framed Rows, a restricted variant of the mC1P xvii mC1P(ne) Consecutive-Ones Property with Multiplicity for Nested Rows, another restricted variant of the mC1P GCCC Generalized Cladistic Character Compatibility Problem, a generalization of the Perfect Phylogeny Problem [137] GCCC-NB GCCC with non-branching character trees Problem, a special case of the GCCC Problem in which character trees have a single branch, i.e., each character tree Tα is 0 → 1 → · · · → |Tα | − 1 SB-GCCC-NB Single-Branch GCCC-NB Problem, the case of the GCCC with non-branching character trees (GCCC-NB) Problem where we restrict the solution (a phylogeny tree) to have only one branch starting at the root P-GCCC-NB Path GCCC-NB Problem, the case of the GCCC-NB Problem where we restrict the solution (a phylogeny tree) to have only two branches starting at the root BKW Benhan-Kannan-Warnow Case, a case of the GCCC Problem that is of particular intrest to the biological setting that motivated this problem FPT Fixed Parameter Tractable PTC Path Triple Consistency Problem LEF-PTC Left Element Fixed Path Triple Consistency Problem REF-PTC Right Element Fixed Path Triple Consistency Problem OEF-TO One Element Fixed Total Ordering Problem QC Quartet Consistency Problem TO Total Ordering Problem NAE-3SAT Not-All-Equal-3SAT E-C1P Extended Consecutive-Ones Property, a property of matrices with entries from set {0, 1, 0− , 0+ } xviii Acknowledgments First, I would like to thank the members of my supervisory committee, J´an Maˇnuch, Cedric Chauve, Arvind Gupta and Anne Condon. I am very grateful to J´an Maˇnuch for spending the time to regularly meet and discuss or work on problems, as well as to help proofread and improve my writing. J´an’s dedication and great ideas have provided great motivation and direction through tough problems, where a solution seemed nowhere in sight. Indeed, most of what I know about doing scientific research has come through working with J´an, and without him as a mentor, I would not have been able to write this work, or even obtain its results. I would like to thank Cedric Chauve for providing a wealth of ideas, and ultimately, problems that are relevant to the area of computational biology. Because Cedric is always on the frontier of important research in computational biology, this work contains results to many problems that are not only interesting, but also relevant to this area. Cedric’s involvement in the research community has also provided great networking opportunities, one which has led to the position that I plan to hold after this degree. I am grateful to Arvind Gupta for securing the majority of the funding for this research, as well as pointing me to viable career, research and funding opportunities. Of the members in my committee, I have been working with Arvind the longest, and during this time he has opened up many opportunities. I hope he considers it as good of an investment as I have. Finally, of my supervisory committee, I would like to thank Anne Condon for helping to familiarize me with the Beta Lab and community at the University of British Columbia (UBC) after my transfer to UBC with my senior supervisor and xix his other students halfway through this degree. Anne has also provided very helpful feedback on my thesis. I also wish to thank my external examiner Binhai Zhu, as well as university examiners William Evans and Paul Pavlidis for their helpful feedback. I gratefully acknowledge funding from the following sources. First, I would like to thank the Natural Sciences and Engineering Research Council (NSERC) of Canada for providing three years of funding with a PGS-D3 Scholarship. For this, I also thank Arvind Gupta and Eugenia Ternovska (a supervisor of my masters thesis) who wrote letters that have no doubt determined greatly the success of my application for this award. I would like to thank the School of Computing Science at Simon Fraser University (SFU), the university where I obtained my masters and then began this degree, for its NSERC award top-up, as well as for several graduate fellowships. I also thank the Canadian Liquid Air Ltd. for a graduate scholarship. I would like to thank UBC for the remainder of its NSERC award top-up when I transferred there, as well as the tuition scholarship that comes with holding an NSERC award. Two co-authors whom I have not mentioned above, but were important to several results in this work, are Roland Wittler and Jens Stoye, who introduced the concept of the Consecutive-Ones Property (C1P) with Multiplicity (Chapter 4 of this thesis). I thank them for this, as well as for helpful discussions about research and career opportunities. I would like to thank my friends and colleagues at, or affiliated with, the Beta Lab at UBC, in particular Chris Thachuk for helping to familiarize me with UBC when I transferred here, as well as for many insightful discussions about research and career opportunities. I would also like to thank Bonnie Kirkpatrick, Jeff Sember and Frank Hutter at the Beta Lab for the helpful discussions, and for making the lab a lively and engaging place to work. I would also like to thank the many friends and colleagues I have made at SFU, where I did my masters thesis, as well as a good amount of this degree. In particular, I would like to thank Phuong Dao for the many discussions about research and career opportunities as well as for all the social times we had, when we wanted to take a break from all of the work. I would like to thank George Ma for the encouragement and help during the low points of this degree, as well as xx for the helpful discussions about career and life. I would like to also thank Osama Saleh and Jingyun Chen for the lively and insightful discussions on many subjects; their senses of humour through tough times have been a good morale booster. I would also like to thank Fereydoun Hormozdiari, Iman Hajirasouliha, as well as the many other friends and colleagues at SFU whose discussions helped to shape my research and career plans, as well as who helped to encourage and support a community where such ideas can be discussed. Finally, I am grateful to the many friends I have met in Vancouver, which does not exclude those mentioned above, who made the time enjoyable as I did this degree. I especially thank my family and friends back home in Nova Scotia for sticking with me and giving me the encouragement to undertake this challenging venture for such a long time, and in such a far away and different place. xxi Dedication To my father, Kenzie Patterson, who always pushed for higher education. Since a PhD is the highest form of education one can obtain, I hope that he would be proud. xxii Chapter 1 Introduction This thesis concerns variants of the Consecutive-Ones Property (C1P) of binary matrices. In particular, we define and study here four ways of generalizing the C1P in order to better model various scenarios of the reconstruction of ancestral species from a computational point of view. The first three of these are motivated by the reconstruction of Ancestral Gene Orders (AGOs) [27], while the last is motivated by the Generalized Cladistic Character Compatibility (GCCC) Problem [12]. First we give an overview of the C1P, some historical background of the property, and its applications. We then give a detailed overview of the reconstruction of AGOs and show its relation to the C1P. We then show that, among many other applications, the problem with the reconstruction of AGOs is that it often involves handling matrices that do not have the C1P. This then leads to the first contribution of this thesis: to offer three ways of generalizing the C1P in order to address this problem raised in the reconstruction of AGOs. Finally we introduce and motivate the GCCC Problem. We then propose our fourth and final variant of the C1P that leads to an algorithm for a case of this problem. 1 a 1 1 0 b 1 0 1 c 0 1 0 (a) d 1 0 1 e 0 0 1 c 0 1 0 a 1 1 0 b 1 0 1 d 1 0 1 e 0 0 1 (b) f 1 1 1 g 1 0 0 h 0 0 1 i 1 0 1 j 0 1 0 (c) Figure 1.1: (a) A binary matrix M that has the C1P. (b) A C1 order of M. (c) A binary matrix that does not have the C1P [145]. 1.1 The Consecutive-Ones Property 1.1.1 An Introduction of the Consecutive-Ones Property Let M be a binary (0,1)-matrix with m rows and n columns. A block in a row of M is a maximal sequence of consecutive entries containing 1. A gap is a sequence of consecutive 0’s that separates two blocks, where the size of the gap is the length of this sequence of 0’s. The degree of a row of M is the number of 1’s in the row. The degree of a matrix M is the largest degree over all rows of M. In the first row of the matrix M of Figure 1.1a, the blocks are ab and d, while a gap of size one separates these two blocks. The degree of the second row of M in Figure 1.1a is 2, while the degree of M is 3. A matrix M is said to have the C1P (for rows) if its columns can be permuted such that each row contains only one block (there are no gaps in this case). We call a permutation π of the columns of M that witnesses this property a consecutiveones (C1) order of M; that the matrix M ′ resulting from this permutation is consecutive, or that it is consecutive with respect to π ; and that M has the C1P, or is C1P. Further, we call the problem of deciding whether or not a binary matrix has the C1P the C1P Problem. Observe that the matrix M of Figure 1.1a has the C1P, while permutation cabde of its columns is a C1 order of this M, cf. Figure 1.1b, while the matrix of Figure 1.1c does not have the C1P. According to Kendall [84], this property was first mentioned by Petrie, an archaeologist, in 1899. In 1951, Robinson [129], also an archaeologist, proposed several heuristic methods for the problem. The first polynomial-time algorithm for deciding the C1P was then introduced by Fulkerson and Gross [52] in 1965. Inter2 estingly, it was a problem in genetics, cf. Benzer [13], that motivated these authors to study the C1P. 1.1.2 Background: Deciding the Consecutive-Ones Property The early attempts at deciding the C1P started with the algorithm of Fulkerson and Gross [52]. In this work, Fulkerson and Gross [52] first compute the overlap graph for the set of rows of the binary matrix M. For each component (a tree, otherwise M does not have the C1P) of this graph, they then give a quadratic-time algorithm to incrementally build a permuted form of this component (which corresponds to a set of rows) that has the C1P. Following this, in 1969, Ryser [132] studied this problem and provided a generalization of the result of Fulkerson and Gross [52] for a class of matrices that have the circular-ones property.1 In 1972, Tucker [145] then presented a forbidden submatrix characterization of binary C1P matrices. In this work, Tucker [145] shows that a binary matrix M has the C1P if and only if the bipartite graph GM corresponding to M contains no asteroidal triple. Here, GM = (V1 ,V2 , E), where V1 (resp., V2 ) is the set of columns (resp., rows) of M, and (v1 , v2 ) ∈ E if and only if column v1 contains a 1 in row v2 (cf. Figure 1.2). An asteroidal triple of a graph is a set of three vertices such that there is a path between any two of these vertices which avoids the neighborhood of the third vertex, cf. Figure 1.2b. This set of forbidden submitrices then comes directly from the set of bipartite (sub) graphs which contain an asteroidal triple. For example, the matrix of Figure 1.1c that does not have the C1P contains the submatrix obtained by removing column i, shown in Figure 1.2a, which is a forbidden submatrix because its corresponding bipartite graph, shown in Figure 1.2b, has an asteroidal triple. Until recently [18, 30, 39], the forbidden submatrix approach of Tucker was not seen as computationally useful, which is why people followed other approaches. In 1976, Booth and Lueker [21] introduced the first linear-time algorithm for deciding this property. In Booth and Lueker [21], the authors introduced also a data structure called the PQ-tree, a linear-time constructible structure that encodes all C1 orders of a binary C1P matrix. See Figure 1.3 for an example of a PQ-tree: 1 While we focus here on generalizations of the C1P other than the circular-ones property, refer to Dom [37] for details on this property. 3 g f 1 1 1 g 1 0 0 h 0 0 1 f j 0 1 0 j h (a) (b) Figure 1.2: (a) A binary matrix M that has does not have the C1P. (b) The bipartite graph GM corresponding to M where the black (resp., white) vertices correspond to the columns (resp., rows) of M. Graph GM contains the asteroidal triple g, h, j. In Figure 1.3a we have a binary C1P matrix, while Figures 1.3b and 1.3c give two configurations for the PQ-tree of this matrix. This work of Booth and Lueker [21] was a significant achievement in the history of deciding the C1P. In particular, the PQ-tree has since served as a useful tool in using the C1P for modelling problems in many settings. In this thesis, we use the structure of the PQ-tree to obtain several of our algorithmic results. There would then be a break in research on deciding the C1P for more than ten years after this milestone result of Booth and Lueker [21]. Year 1989 showed a renewed interest in research on deciding the C1P with the result of Korte and M¨ohring [87]. Indeed, while the structure of the PQ-tree is very elegant and simple, the algorithm in Booth and Lueker [21] for constructing it is quite complicated. This motivated Korte and M¨ohring [87] to introduce MPQ-trees (modified PQ-trees), where the internal (P- and Q-) nodes contain some additional information, which makes these trees simpler to construct. In 1992, Hsu [73] ([76]) also presented a linear-time algorithm to test for the C1P without using PQ-trees, however its implementation is still quite complicated. In 1998, Meidanis et al. [106] proposed a new theory of the C1P which formalizes many concepts alluded to in other works, such as orthogonality of two rows in a binary matrix [21, 52, 73, 76, 113]. In addition to this new theory, the authors of Meidanis et al. [106] also introduce a new structure called the PQR-tree, which exists for any instance of a binary matrix; it generalizes the PQ-tree in that a PQR-tree 4 a 1 1 0 b 0 0 0 c 0 1 1 d 0 0 0 e 1 1 0 f 0 0 0 g 0 0 1 b d f c a (a) g e (b) d b g f c a e (c) Figure 1.3: (a) A binary C1P matrix M. (b) The PQ-tree TM for M. Here, TM has internal (circular) P-nodes and (rectangular) Q-nodes, and leaf nodes for the set of columns of M. A leaf order of TM obtained by taking any arbitrary (resp., forward or reverse) permutation of the children of a P-node (resp., Q-node) represents a C1 order of M. Note that the current configuration of TM represents C1 order bdaecgf of M. (c) Another configuration of PQ-tree TM representing C1 order dgcaebf of M. Note that T (M) has 4! · 2 · 2 = 96 configurations, and hence M has 96 C1 orders. for a C1P matrix is a PQ-tree. See Figure 1.4a for an example of a PQR-tree. In 2000, Habib et al. [64] gave a very simple algorithm for deciding the C1P which is based on partition refinement. In 2003, Hsu and McConnell [78] introduced a remarkable simplification for building PQ-trees. Here, these authors introduced PC-trees, a structure that is much more straightforward to construct, but which encodes all circular-ones orders of a binary matrix that has the circular-ones property. However, a binary matrix M that has the C1P has also the circular-ones property, and moreover, there is an easy way to modify the PC-tree for M so that it yields the PQ-tree for M [37]. In 2004, McConnell [102] proposed the first linear-time certifying algorithm for deciding the C1P, that is, if a matrix M is not C1P, the algorithm outputs a certificate of size linear in M that verifies this.2 In McConnell [102], the author also provided a slightly different type of structure than the PQR-tree based on partitive families, called the Generalized PQ-tree, which exists for any instance of a binary matrix; again, a generalized PQ-tree is a PQ-tree for a C1P matrix. Most recently, in 2010, Blin et al. [18] developed a faster algorithm for finding the forbidden submatrices 2 Refer to Kratsch et al. [90] for more details on such certificates. 5 a 1 1 0 0 b 0 0 0 1 c 0 1 1 1 d 0 0 0 0 e 1 1 0 0 f 0 0 0 0 g 0 0 1 0 d b (a) f a e c g (b) Figure 1.4: (a) A binary matrix M that does not have the C1P. Note that M is the matrix of Figure 1.3a with a fourth row added, causing this matrix to not have the C1P. (Note also, that M contains the forbidden submatrix of Figure 1.2a). (b) The PQR-tree TM for M. Here, TM has the additional third type of internal (diamond-shaped) R-node. An R-node represents a part of M which do not have the C1P (contains a conflicting set of columns of M). Note that the PQR-tree of the matrix formed by the first three lines of M is equivalent to that shown in Figure 1.3b. of Tucker [145]. Refer to the works of Michael Dom [36, 37] for a nice survey of the C1P and its algorithmic aspects, respectively. 1.1.3 Applications of the Consecutive-Ones Property The Consecutive-Ones Property has had a rich set of applications since its introduction. Indeed, according to Kendall [84], Petrie’s interest in this prop- erty in 1899 was motivated by the application of the seriation of archaeological data [72, 84, 129]. The C1P appears in many other practical applications, such as scheduling [67, 70, 89, 147], information retrieval [54, 88] and circuit and railway design/optimization [46, 104, 105, 131]. Essentially, the C1P finds applications in any problem where one needs to linearly arrange a set of objects subject to the constraint that objects in a given subset must appear consecutively in this order. Since binary matrices can be represented as graphs and vice versa, the C1P has close connections to graph theory, in particular to interval graphs and their recognition [34, 74, 77, 90]. Indeed, much of the progress in deciding the C1P was a result of research on interval graphs [21, 52, 64, 87]. The C1P also plays an important role in the area of solving (integer) linear programs, in terms of both its direct application to practical linear programming problems [6, 70, 71, 147], or how it relates to linear programming from a more theoretical point of view [115, 116, 136]. From 6 a complexity theoretic point of view, there are many problems on matrices that are in general NP-hard that become polynomial-time solvable when the input has the C1P [33, 112], such as problems in railway optimization and scheduling [105, 147]. This has also been shown in the study of covering problems such as set cover, as well as geometric covering problems such as rectangle stabbing [36, 43, 104, 131]. The C1P has also found applications in quite a few areas of (computational) molecular biology as different technologies developed over time. Since this application is the subject of this thesis, we illustrate this in more detail in the next few paragraphs. One of its first applications to molecular biology was in the study of the composition of genes [13, 52, 92]. By 1926, it was already known from Morgan [109] that genes are arranged linearly on a chromosome. However, by 1959, genetic analysis technology was advanced enough [124] that Benzer [13] was able to perform a series of experiments aimed at verifying whether a gene is also a linear arrangement of its components. While the primitive genetic maps of the set of components that Benzer [13] produced did not altogether exclude nonlinear arrangements, the assumption of a linear arrangement seemed to be the most probable fit given this data. This would be the first, very crude, form of physical mapping. It was six years later, in Fulkerson and Gross [52], in the study of this problem by these authors that they formulated the set of these components from the experiments of Benzer [13] as a binary matrix M, where each component represented a row M. The set of components then has a linear arrangement exactly when matrix M has the C1P, hence formally defining this notion of the C1P in Fulkerson and Gross [52] and also introducing the first polynomial-time algorithm for deciding the C1P. Of course, today it is common knowledge that a gene is a linear arrangement of its components, but in Benzer [13], this was an exciting result that provided the first insights into the finer structure of genes. More recently, when the technology allowed scientists to begin constructing, en masse, highly accurate physical maps [93] of hybridization data, with the aim of sequencing specific DNA strands, it introduced new computational challenges [7, 8], some of which have been overcome by very applied approaches [32, 58, 94], while several theoretical works exist on the subject [4, 5, 55, 149]. Since a DNA strand is too long to study in its entirety (i.e., the human chromosome contains about 108 base pairs [5]), it is broken into fragments, or clones, and the goal of physical 7 mapping is to reconstruct the DNA strand given a collection of overlapping clones of the strand. A popular approach of the time was Sequence Tagged Site (STS) mapping [108, 119]. In this approach, relatively short substrings called markers (or probes) are extracted from the DNA strand itself, but are sufficiently long, however, that it is highly unlikely to occur twice on the same strand. Given the information as to which clones contain which markers, the goal is then to find an order of markers in such a way that subsets of markers that appear on the same clone appear consecutively in this order, i.e., one possible reconstruction of this DNA strand. See Figure 1.5 for an example of an STS physical map. Consider the binary matrix M where we have a column for each marker, and a row for each subset of markers that appear on the same clone (i.e., a row with a 1 in the column corresponding to each marker in this subset). It follows that we can find an order of markers satisfying the above condition if and only if M has the C1P. In the next section, we introduce in detail an application in the area of molecular biology, namely the reconstruction of AGOs, the application that has motivated the definition and study of the several relaxed versions of the C1P that are the subject of this thesis. 1.2 The Reconstruction of Ancestral Gene Orders 1.2.1 A Basic Overview of the Reconstruction of Ancestral Gene Orders The area of comparative genomics concerns the relationship between the structure and function of genomes across sets of different species. This involves the analysis of the information provided by the signatures of selection in an attempt to understand the evolutionary processes that act on these genomes. Studies in this area have shown that conserved regions between the genomes of a set of species often contain functionally or evolutionarily associated genes [35, 118]. See Figure 1.6 for an example. From this discipline, and the existing data that has been generated, comes the natural question of inferring the structure of ancestral genomes, or Ancestral Gene Orders (AGOs). A set of closely related species, such as mammals in Figure 1.6 have many regions that are common, or at least similar. We can use this 8 Figure 1.5: A STS physical map of the kallikrein gene region. The positions of the markers are depicted along the top, and the clones are shown as horizontal lines. The markers were developed from clone insert ends (red) and kallikrein genes (blue). The unfilled squares on clones 338F22 and 003F08 show markers not analysed. (source: http://westnilevirus.okstate.edu/research/2004rr/13/13.htm) commonality to reconstruct the AGOs for this set of species. Given the genomes for a set S of existing species and a set of genomic markers (such as markers obtained from STS physical mapping, for example, genes), the reconstruction of AGOs is to infer possible orders of these markers in the chromosomes of some ancestor common to S. This assumes that a phylogenetic tree T is given, with the existing species S at the leaves of this tree, and the common ancestor is the extinct (unsequenced) species at the internal node of T that is common to set S. Note, that T may contain some less closely related outgroup species (leaves that are not in S), and, in fact, this is a good practice, as the information they provide helps to produce more accurate reconstructions [27, 96]. As an auxiliary step to reconstructing AGOs, we first infer a set of syntenies, taking from the terminology of Chauve and Tannier [27], i.e., groups of markers that are believed to appear together in this ancestor, cf. Figure 1.7 for an illustration of this. An AGOs is then any order of the markers such that each group of markers in a synteny appears to- 9 Figure 1.6: The alignment of a human genome against several mammals and a chicken genome. Here, the regions of each genome that code for the Apolipoprotein A1 gene (a gene that has an important role in lipid metabolism) are highly conserved, and hence similar, for all mammals. (source: http://www.lbl.gov/tt/techs/lbnl1690.html) gether in this order. The value of reconstructing AGOs is that it can give us insights into the biology, ecology, and evolution of extinct species [26, 56]. Experimentally, at least for proteins, the reconstruction of ancestral proteins has led to the discovery of new biochemical functions that have been lost in modern proteins [80, 133]. Since the input to this problem is a phylogeny tree T , this area is closely related to phylogenetics [45] (constructing a phylogeny tree for a set of species, etc.). There are also studies of reconstructing phylogenies for a set of existing species given AGO data as well as computing both simultaneously [1]. 10 Figure 1.7: An illustration of the inference of syntenies for the ancestor common to set S = {human, mouse, dog} of species with outgroup species chicken. A synteny is a group of markers that appears together in at least two species whose path goes through the considered ancestor. Here, the first synteny appears in the human and the dog, and the second is inferred from the chicken and the mouse, while the fourth one appears in all three species of S. These syntenies can be weighted according to how often they appear in the existing species, i.e, this fourth synteny would be weighted more heavily than the first two. (animal skeleton reproduced with permission from www.bigstock.com) 1.2.2 Previous Approaches to Reconstructing Ancestral Gene Orders While the problem of reconstructing AGOs has been studied even as early as 1936 for simpler organisms such as insects [140], cytogenetics technology such as chromosome painting allowed scientists to start reconstructing more complex organisms such as mammals [50, 127, 141, 142, 150, 153] in the early to mid 2000’s. At roughly the same time, because physical maps for different species became available [5, 108, 119], many bioinformatics methods for reconstructing AGOs from physical mapping data also began to appear [22–24, 111]. The benefit of bioinfor- 11 matics methods over cytogenetics methods is that they produce AGOs at a much higher resolution. However, since physical mapping is still a young field, there are fewer such existing genome sequences available [44, 110, 126]. Since physical maps continue to be generated at an explosive rate (one reason being the drop in the cost of next generation sequencing technology) it is expected that bioinformatics methods will be the dominating technology for reconstructing AGOs. These bioinformatics methods use various differing approaches in processing data from physical maps. However, scientists started to notice a divergence between some of the bioinformatics methods that use a parsimony approach in terms of evolutionary events (reversals, translocations, fusions and fissions), in particular, the works [22, 111], with cytogenetics studies [51]. However, in 2006, the first bioinformatics approach to this problem appeared in Ma et al. [96] that, when applied to mammalian genomes, gave results that were more in agreement with cytogenetics methods, while exhibiting few points of divergence [130]. We present this important result in more detail in the next paragraph. Given the genomes for a set S of existing species (in their experiments, S consists of human, mouse, rat and dog, while they use the two outgroup species chicken and opossum), and phylogeny tree T containing S, the approach of Ma et al. [96] is to first segment the multispecies alignment of S with the human genome as a reference (or more precisely, nets, cf. Kent et al. [85]) to build a set of orthology blocks [96]. Orthology blocks are essentially regions that are common (regions that are of some minimum size, here 50kb [96], that meet a certain similarity threshold) among all species in S. From these orthology blocks, Ma et al. [96] then compute conserved segments, that is, sequences of orthology blocks that remain together and in the same order in all species in S, see Figure 1.8. Finally, from the set of pairs of conserved segments, where each pair appears adjacent in some species of S, they extract a maximal unambiguous subset of adjacencies to construct Contiguous Ancestral Regions (CARs). In order to do this, they employ a method analogous to Fitch [48] to find the most parsimonious scenario for each of these adjacencies. This has the effect of assigning each adjacency a weight between 0 and 1, where the weight is the measure of confidence that this adjacency appears also in the ancestor. The set of outgroup species (here, chicken and 12 opossum) is used to improve the accuracy of this step. Consider now the graph3 G = (V, E), where the vertex set V is the set of conserved segments, and set E of (weighted) edges is this set of weighted adjacencies. Since the goal to infer a set of AGOs, they construct a graph G′ incrementally by selecting edges from E in order of decreasing weight, skipping over any edge in this order that creates either (a) a vertex of degree larger than two, or (b) a cycle, in the current G′ . At the end of this process, G′ should be a union of disjoint paths, where any layout of these paths on a line represents a potential AGO for this set S of species. Here, it is each disjoint path, or rather its set of conserved segments that represents a CAR. Figure 1.9 represents the set of CARs constructed in the experiments of Ma et al. [96]. The mapping of these CARs (cf. Figure 1.9) onto the chromosomes of the human show quite a similarity, which is expected, as these CARs essentially represent ancestral chromosomal segments. While this approach of Ma et al. [96] uses a parsimony method to weight each adjacency, there are no assumptions on any evolutionary events, nor is each CAR even guaranteed to be an ancestral whole chromosome, rather their approach is model-free, taking from the terminology of Adam et al. [1]. Indeed, the modelfree approach avoids computing any global parsimony in terms of evolutionary events such as reversals, translocations, fusions and fissions, which is what all the methods whose results diverge with those of cytogenetics studies [51] rely on. This, and the fact that Ma et al. [96] is the first bioinformatics method to agree well with cytogenetics methods [130], suggests that a model-free approach is a step in the right direction. In the next subsection, we present a model-free framework for reconstructing AGOs based on the C1P of binary matrices. Note that model of adjacencies, used here in Ma et al. [96], is the special case of degree 2 binary matrices. Indeed, with the method of Ma et al. [96], the link between the C1P and the reconstruction of AGOs started to become explicit. The approach we propose generalizes this method of Ma et al. [96] (in one sense, that it concerns matrices 3 Note that in Ma et al. [96], they consider a directed graph, however the principle is the same. This detail is left out to ease the summary of this method. 13 Figure 1.8: (a) Human-mouse nets [85] with human as the reference. Four mouse intervals are depicted, as ordered and oriented by the orthologous human segments. The second and third mouse intervals are adjacent (and appropriately oriented) on a mouse chromosome, and the intervening bases, if any, do not align to human, and are depicted by a thin line connecting these intervals. (b) The human-mouse, human-rat and human-dog nets for a segment of the human sequence, which illustrates the construction of orthology blocks (OB). (c) The construction of conserved segments (CS) from the fusion of runs of consecutive orthology blocks whenever the order and orientation of these blocks are conserved in each of the existing genomes. (source: Ma et al. [96]) 14 Figure 1.9: The set of CARs for the Boreoeutherian ancestral genome (of human, rat, mouse and dog) constructed from the experiments of Ma et al. [96]. Numbers above bars indicate the corresponding human chromosomes. (source: Ma et al. [96]) 15 of degree larger than 2), and is the state of the art in terms of methodologies for reconstructing AGOs. 1.2.3 Binary Matrices, the C1P and the Reconstruction of AGOs We now outline the approach for reconstructing AGOs based on the C1P of binary matrices that formalizes and generalizes the principles used in several computational [1, 96] as well as the cytogenetics studies [127, 150, 153]. This approach can be broken down into the following two steps. The first is a data acquisition phase: where we compute from the alignments of these genomes a set (or alphabet) of genomic markers L = {1, . . . , n}. From this set L of genomic markers we then compute the groups of markers (syntenies) that are believed to be contiguous in the ancestral genome. Here, we represent the set of syntenies with a binary matrix M on the set of columns L where for each synteny X ⊆ L , we have a row in M with a 1 in every column of X , and 0’s everywhere else. In general each synteny (row of M) can also be weighted according to the confidence that it appears in the ancestral genome. The second step of this approach consists of transforming this matrix M into a C1P matrix. It is this second step that we concentrate on in this thesis, however, we will see later that the way to approach this second phase depends very much upon the data acquisition phase. Indeed, from a computational point of view, this approach is closely related to physical mapping: if M has (or can be transformed into a matrix that has) the C1P, then we can find an order of markers that represents an AGO. Because syntenies of markers are naturally represented by binary matrices in this way, it also follows that there can be many AGOs that are consistent with M. This set of AGOs can be encoded in a compact way with some uncertainty by the PQ-tree for (the possibly transformed) M, which is another benefit of C1P-based approach. The first work to represent AGOs with PQ-trees appeared in 2004. Bergeron et al. [15] used a Fitch-like [48] approach to find a most parsimonious scenario for the set of intervals (sytenies) defined by this PQ-tree. This work was quite preliminary however, and the experiments were performed on fairly basic chloroplast genome data. A year later, in 2005, Landau et al. [91] also use PQ-trees for ancestral genomes, but also in a parsimony context. Here, Landau et al. [91] 16 also suggest a way of representing duplicated genes (genes with multiplicity, which we will cover later in this thesis), but show only how this approach works on some experimental data. Following this, in 2006, Parida [121] improved on the result of Landau et al. [91] by using a PQ-tree where some of the internal nodes are oriented, to help to uniquely construct the orders it encodes, as well as a branch-and-bound scheme for outputting all solutions, rather than just the most parsimonious solution. Again, while the concept of Parida [121] is on the right track, they only give preliminary experimental results to test this concept. In 2007, the work of Adam et al. [1] also considered representing AGOs with PQ-trees. Here, they are concerned with computing the phylogeny and the AGOs, where they frame it as solving the Steiner Tree Problem. While they perform experiments only on fairly basic chloroplast genome data as well, this is the paper that introduces the model-free approach to using bioinformatics methods for reconstructing AGOs. Note that, while this is not made explicit, Ma et al. [96] also represent AGOs with PQ-trees. In Ma et al. [96], L is their set of conserved segments, and M stores the set of adjacencies (i.e., M has degree 2). This union of disjoint paths that they build is then equivalent to a PQ-tree with a P-node as the root r, where each child of r, containing only Q-nodes (since M has degree 2) corresponds to a path (or CAR). Next we detail the work of Chauve and Tannier [27], where this two step approach for reconstructing AGOs based on the C1P and PQ-trees was first developped. Here we give some details of the method of Chauve and Tannier [27]. While this approach generalizes the approach of Ma et al. [96] (for one thing, Chauve and Tannier [27] consider matrices of degree larger than 2), these approaches are very similar in spirit. Here, given the markers for set S of species (and possibly some outgroup species) with phylogenetic tree T on S (and the outgroup species), they first compute the set L of markers. Here they mention that markers can be genes from whole genome alignment methods, orthologous genes, or various others (from comparative maps [111] or virtual hybridization [10] for example). From the input representation of S, Chauve and Tannier [27] compute (maximal) sets of markers, i.e., syntenies, that appear consecutively4 in at least two 4 Note that, more precisely, Chauve and Tannier [27] compute a set of gene teams [9, 95]: syntenies, as we have defined them here are gene teams for δ = 1 [27]. We leave these details out to ease the explanation of the principles of this approach of Chauve and Tannier [27]. 17 species from S, where the path in T between these two species goes through the node for the ancestor which we wish to reconstruct. These syntenies are weighted using the same principle in Ma et al. [96] for weighting adjacencies, and outgroup species are also used to improve this step. In fact, the set of syntenies inferred in Figure 1.7 is exactly what the method of Chauve and Tannier [27] would obtain. Note that, since an adjacency is a synteny of size two, this method is more general than that of Ma et al. [96]. One reason for considering these more general syntenies is that it is closer to the methods [127, 150, 153] on cytogenetics data. Indeed, the inference of syntenies in Chauve and Tannier [27] is a bioinformatics version of the hybridization used by cytogeneticists, which explains the convergence between these two approaches. Chauve and Tannier [27] represent the set of syntenies with a binary matrix M on the set of columns L where for each synteny X ⊆ L , they have a row in M with a 1 in every column of X , and 0’s everywhere else. We now outline the second step of the approach of Chauve and Tannier [27], transforming M into a C1P matrix. Given binary matrix M, constructing an AGO for S then corresponds to finding a linear order of L , such that each X appears consecutively in this order, i.e., a C1 order of M. In fact, all AGOs for S can be represented by building the PQ-tree TM for M. Here, the set of CARs for S will be the children of the root node r of TM (as it was in Ma et al. [96], however they can contain also P-nodes now, as each row of M has degree larger than 2 in general). It is here that the C1P plays an important role in the reconstruction of AGOs of Chauve and Tannier [27], i.e., that they can represent sets of CARs with a PQ-tree [21]. Indeed, for the set of syntenies inferred in Figure 1.7, the matrix (which is C1P) for this set is given along with the PQ-tree TM for M in Figure 1.10. However, M rarely has the C1P as we will see later, and so Chauve and Tannier [27] do the following to build this PQ-tree (implicitly transforming M into a C1P matrix). At this point, they could employ the greedy heuristic of Ma et al. [96] of incrementally building a PQ-tree by selecting syntenies in order of decreasing weight, and skipping over any synteny that creates a conflicting set in the collection of currently selected syntenies. Rather than doing this, however, they build first a generalized PQ-tree (a PQR-tree [106], or the generalized PQ-tree from McConnell [102]), and then find a subset of sytenies (rows of M) of maximum cumulative weight, such that the matrix M ′ of this subset has 18 Figure 1.10: The binary matrix M corresponding to the set of sytenies inferred in Figure 1.7 and the PQ-tree TM for M. Note that each CAR is a child of the root P-node r of TM . the C1P, i.e., the generalized PQ-tree for M ′ is a PQ-tree. While this approach is not greedy, it is the combinatorial optimization problem known as the ConsecutiveOnes Submatrix Problem. Here, Chauve and Tannier [27] use the structure of this generalized PQ-tree for M to design an efficient branch-and-bound algorithm for this problem. In experiments, the method of Chauve and Tannier [27] agrees well with all of the cytogenetics studies [50, 127, 141, 142, 150, 153] as well as with the work of Ma et al. [96], while disagreeing with the same approaches (that are not modelfree) that Ma et al. [96] disagrees with. However, different experiments (from data at different levels of resolution, or variations on the input phylogeny T ) show that the approach of Chauve and Tannier [27] is more stable in general than that of Ma et al. [96]. One reason for this is due to the fact that, while CARs from syntenies are less well-defined than those of adjacencies (they are degree larger than two), they are better supported because every computed synteny appears in at least two existing species whose path in T goes through the considered ancestor. Another reason is likely due to the fact that in certain cases, the optimization 19 phase of Chauve and Tannier [27], can do much better than the greedy approach of Ma et al. [96]. While both the greedy approach of Ma et al. [96] and the optimization approach of Chauve and Tannier [27] tend to work well in practice (these are the state of the art in bioinformatics methods for reconstructing AGOs), there is much more work to be done in the area of handling a matrix that does not have the C1P. The first step in this effort is to study why matrix M does not have the C1P. Indeed, previous works [27, 96] point this out, which we go into more detail in the next paragraph. Indeed, the second step of this two step approach of reconstructing AGOs based on the C1P of binary matrices, is to transform binary matrix M into one that has the C1P. Ideally, if each synteny was a true positive ancestral synteny, then M would be C1P, however matrices from real data are rarely C1P. Rather some of the syntenies are false positives, i.e., not contiguous in the true ancestral genome. The reason and nature of these false positives depends highly on the data acquisition method. Depending on the method used, the reasons for this can be errors in constructing the set of markers L , such as errors in the assembly from the whole genome alignments, such as paralogs being mistaken for orthologs in the construction of orthology blocks [96]. Other reasons come from the construction of incomplete syntenies due to the convergent loss of markers, and two syntenies joining together (creating a “chimeric” synteny) due to the convergent fusion of chromosomal segments in several lineages. For example, this second case of chimeric syntenies might happen especially in genomes of yeasts where we generally see many translocations [81, 128]. Indeed, it is unavoidable that we must deal with matrices that do not have the C1P. This is what motivates the work in this thesis. Why these matrices do not have the C1P depends on the nature of the errors in the data acquisition phase. In the next section, we illustrate the several open problems on such matrices, raised by these different types of errors, some of these mentioned in Chauve and Tannier [27], and then propose several relaxations of the C1P to address these problems, which is the contribution of this thesis. In some cases, solving these generalizations is NP-complete, and in other cases, there are algorithms for finding a solution. 20 1.3 Computational Solutions for non-C1P Matrices 1.3.1 Transforming the Matrix to a C1P Matrix The first and most direct approach, taken in previous works [27, 96] is to transform the binary matrix M into one that has the C1P. Indeed, because of the assumptions made on the nature of the errors expected in their datasets (that the markers, i.e., columns, were inferred correctly), in Chauve and Tannier [27] they consider all computed syntenies, and extract a maximum subset of rows, such that submatrix M ′ of M defined by this set of rows is C1P. However, one could also remove columns from M if one was less confident on the correctness of the markers for example, or flip some entries in M from 0 to 1, or from 1 to 0 to account for approximate syntenies. It follows, however that all corresponding optimization problems are NP-complete [36, 38, 66], even for sparse matrices [143]. For the case of extracting a maximum subset from M of rows or columns that is C1P, it has been shown in Dom [36] that this is also APX-hard and W[1]-hard. Aside from the work of Chauve and Tannier [27] and the reconstruction of AGOs in general, this problem of transforming a matrix M into one that has the C1P, while minimizing the modifications to M can be found in other applications [8, 143], as well as physical mapping [7, 55, 94, 149]. The latter comes as no surprise, since, from a computational point of view, physical mapping is also determining the C1P of a binary matrix in the presence of errors (in assembly, computing markers, etc.). We now introduce the contribution of this thesis: in the next three subsections, we outline three variants of the C1P motivated by this problem of reconstructing AGOs that we have proposed and/or studied here. 1.3.2 Relaxing the C1P Another approach for handling a binary matrix M that does not have the C1P is, instead of transforming M, to relax the notion of the C1P, and then decide whether M has this relaxed property. A natural relaxation of the C1P is to allow gaps in each row of this “relaxed” C1 order of M. Indeed Chauve and Tannier [27] they claim that in their reconstructions, certain syntenic features are not captured with the strict nature of the C1P. Rather, if some number of gaps were allowed [16, 122], 21 a significantly larger number of syntenies would be detected. However, allowing gaps could radically change the combinatorial nature of this problem, which means we cannot rely anymore on PQ-trees to encode all solutions, a powerful tool in using an approach based on the C1P for reconstructing AGOs. Indeed a relaxed form of the C1P with gaps was considered in 1995, motivated by problems in the area of physical mapping [55]. Here Goldberg et al. [55] introduce the notion of the k-Consecutive-Ones Property (k-C1P). A binary matrix M has the k-C1P when its set of columns can be permuted such that each row contains at most k blocks. This is a fairly general form of relaxing the C1P, as it does not put any restriction on the size of the gaps between blocks. Goldberg et al. proposed this relaxation of the C1P to handle the case in physical mapping of chimeric clones: when sets of markers from two distant clones appear as the same clone, an artifact of hybridization [149]. Interestingly, from a computational standpoint, this is identical to the case of Chauve and Tannier [27] when two syntenies join together (creating a “chimeric” synteny) due to the convergent fusion of chromosomal segments in several lineages. The k-C1P models this case well, as there is no restriction on the distance between the two syntenies that join together to form the chimeric clone. However, this relaxation is indeed radically different in combinatorial nature, as Goldberg et al. [55] show that deciding if a binary matrix M has the k-C1P is NP-complete. Chauve and Tannier [27] state, however, that a decision problem of “consecutive-ones with allowed gaps” is still open, i.e., each row of the matrix must have consecutive-ones, except that between each pair of ones, a fixed number of zeros is allowed. So, in this setting, it makes sense to consider a limit on the maximum size of any allowed gap. This idea has been motivated in other works as well. Indeed, Pasek et al. [122] consider an arbitrary number of fixedsized gaps and are able to capture interesting conserved syntenic features. Further, Ouangraoua et al. [117], in work on double-conserved syntenies, show that when trying to transform their obtained matrix M into a C1P matrix, they must discard a large number of syntenies, and conclude that the C1P is not the proper model here, and that gaps are needed. The first variant of this thesis is relaxation of the C1P with a limit on the maximum size of any gap. Here we define the Gapped C1P, or the (k, δ )-Consecutive22 Ones Property. Property 1 ((k, δ )-Consecutive-Ones Property ((k, δ )-C1P)). A binary matrix M has the (k, δ )-C1P for the two integers k and δ if the columns of M can be ordered such that each row contains at most k blocks, and no two neighboring blocks of 1’s are separated by a gap of size more than δ . Notice that the classical C1P is equivalent to the (1, 0)-C1P. If any of the two parameters is unbounded, we replace k or δ with ∞. For instance, the k-C1P is equivalent to the (k, ∞)-C1P. Note also, then, that in the work of Pasek et al. [122] they consider precisely the (∞, δ )-C1P. Here, we call a permutation π of the columns of M that witnesses the (k, δ )-C1P a (k, δ )-consecutive-ones ((k, δ )-C1) order of M; that the matrix resulting from this permutation is (k, δ )-consecutive, or that it is (k, δ )-consecutive with respect to π ; and that M is (k, δ )-C1P, or has the (k, δ )-C1P. Note that, for small k and δ , this is a stricter model than the ones considered before, such as the k-C1P [55] or that of Pasek et al. [122]. A model that is even more strict would be to consider the number of 0’s in gaps in the entire matrix (in addition to the constraints k and δ ) as a third parameter. This remains an interesting open question. Although the (k, δ )-C1P is stricter than previous models, we show in this thesis, however, that deciding this property is computationally hard for the most part. We give our first set of results, the complexity of deciding the (k, δ )-C1P in Chapter 2 of this thesis. In Section 2.3, we show that for every k ≥ 2, δ ≥ 1, (k, δ ) = (2, 1), deciding the (k, δ )-C1P is NP-complete, leaving open only case of the complexity of the (2, 1)-C1P. We show that this remains NP-complete even if one of the two parameters is unbounded: (i) for every k ≥ 2, deciding the (k, ∞)-C1P is just the problem of deciding if matrix M has the k-C1P, and is thus NP-complete by Goldberg et al. [55], and (ii) for every δ ≥ 1, deciding the (∞, δ )-C1P is NP-complete (Section 2.5). While the complexity of the (2,1)-C1P remains open, we do provide an algorithmic result for a relevant case of the (2,1)-C1P in Section 2.4. We now mention several other versions of the C1P with gaps considered in other works. 23 Another slightly different version of the C1P with gaps was considered in Haddadi [65], where they show that finding an order of the columns that minimizes the number of gaps in the entire matrix M is NP-complete, even if each row of M has degree at most two. While the works [75, 94] do not deal with the C1P with gaps, they do propose algorithms for recognizing matrices that are “close” to having the C1P in some sense. Aside from this, Dom [36, 37] presents an approximation algorithm as well as a fixed parameter algorithm for instances of the Set Cover Problem that are “close” to having the C1P, which basically means that either the input matrices have been generated by starting with a matrix that has the C1P and replacing randomly a certain percentage of the 1’s by 0’s [104], that the average number of blocks of 1’s per row is much smaller than the number of columns of the matrix [131], or that the maximum number of blocks of 1’s per row is small [105]. In light of this, approximation schemes remain to be considered for the (k, δ )-C1P, as well as any natural parameter that could lead to a Fixed Parameter Tractable (FPT) result. In the next subsection, we consider the (k, δ )-C1P for matrices of bounded degree which is the second variant of this thesis. 1.3.3 Matrices of Bounded Degree The NP-completeness results on deciding the (k, δ )-C1P of Chapter 2 involve constructions with many rows of large degree. After examining some data from the experiments of Chauve and Tannier [27], however, we found that this is not always realistic. We considered here the ancestral syntenies dataset for the boreoeutherian ancestor of Chauve and Tannier [27] at a resolution of 200kb, with 1651 markers (i.e., columns) and 2515 syntenies (rows).5 In this dataset, we observed that 90% of the syntenies have small degree (less than or equal to 16, which is less than 1% of the number of columns of this matrix). In addition to this, each of the remaining 10% of the syntenies (with degrees 17 to 99) contains between 16–144 of these syntenies of degree less than or equal to 16. Indeed this makes sense, as a long common interval that does not contain any other common interval would not be realistic. Hence, if the syntenies with large degree (10%) are discarded, the majority of the information is preserved. Indeed, this has already been shown in 5 This dataset can be found at http://www.cecm.sfu.ca/∼ cchauve/SUPP/ANCESTOR08/BOREO 200 u/index.html . 24 Chauve and Tannier [27]: when considering only adjacencies (matrices of degree 2), they obtain only slightly more CARs than in the general case of syntenies. This illustrates again that most of the signal is captured in small common intervals. In light of these two analyses, it makes sense to consider versions of the (k, δ )-C1P where the degree is bounded, especially if this could result in algorithms for these versions. Note that this would apply to chimeric syntenies: that we would expect that the individual syntenies that compose them will be detected as well, and then we just need to remove the row corresponding to a chimeric synteny. To take into account the above observations, we consider here the case of the (k, δ )-C1P for matrices of bounded degree. This forms the second result of this thesis, given in Chapter 3. Formally, we define the (d, k, δ )-Consecutive-Ones Property. Property 2 ((d, k, δ )-Consecutive-Ones Property ((d, k, δ )-C1P)). A binary matrix M has the (d, k, δ )-C1P when the bound on the maximum degree of any row of M is d, and M has the (k, δ )-C1P. We call a permutation π of the columns of M that witnesses this property a (d, k, δ )-consecutive-ones ((d, k, δ )-C1) order; that the matrix M ′ resulting from this permutation is (d, k, δ )-consecutive, or that it is (d, k, δ )-consecutive with respect to π ; and that M is (d, k, δ )-C1P, or has the (d, k, δ )-C1P. In Chapter 3, Section 3.1, we first show that if all three parameters are fixed, deciding the (d, k, δ )-C1P is related to the deciding the bandwidth of a graph, and can be decided in polynomial time by slightly modifying an algorithm of Saxe [135] for recognizing graphs with a fixed constant bandwidth. While this algorithm is only practical for small values of the parameters, this is usually the case in practice (cf. Chauve and Tannier [27] and discussion in previous paragraphs). Currently, an implementation of this algorithm on biological data is in a preliminary stage. We point out that for the case where d = 2, we can also take advantage of the faster linear-time algorithm of Caprara et al. [25] for the bandwidth 2 case. An interesting open question here is whether or not the techniques used in Caprara et al. [25] can be extended to matrices of degree (and graphs of bandwidth) larger than two. After obtaining this algorithmic result for the case of deciding the (d, k, δ )-C1P when all three parameters are fixed, we began to study the complexity of deciding 25 this property when one or more of these parameters is unbounded. The case with d unbounded is just the (k, δ )-C1P, and hence the complexity of deciding everything except for the (∞, 2, 1)-C1P, or just the (2, 1)-C1P is known. Since fixing d also fixes k (k ≤ d), the only case that remains for us to consider is the case when δ is unbounded, or the (d, k, ∞)-C1P. The motivation from a practical point of view to consider this case is that it concerns chimeric syntenies (the gap size is unbounded) where we assume that we do not lose too much information by considering only syntenies with low degree as argued above. Here, in Chapter 3, Section 3.2.4, we show that in every non-trivial case, deciding this property is NP-complete, i.e., for every d > k ≥ 2, deciding the (d, k, ∞)-C1P is NP-complete. Note that if d = 2, the this becomes the C1P, and if d ≤ k, then any order of the columns of M is a valid solution, since no row can have more than d blocks of 1’s. This case is also of importance to physical mapping, since chimerism is a phenomenon that happens here also. In particular, since the setting when clones are short and there is limited coverage of the sequence by the clones is likely to be more realistic (similar to how it is in the reconstruction of AGOs), Goldberg et al. [55] pose the question of deciding the 2-C1P when the number of ones per row and per column is bounded. Interestingly, the construction we use in Section 3.2.4 of Chapter 3 happens also to use a bounded number of ones per column, and hence we answer the above question posed by Goldberg et al. [55]. In the next subsection, we present the third variant of the C1P of binary matrices that we study in this thesis. 1.3.4 Matrices with Columns of Multiplicity Here, we present the third variant of the C1P of binary matrices that we study in this thesis, namely to allow columns to appear multiple times in a C1 order. While this is technically another relaxation of the C1P, it is very different than the ones considered previously. It also models a very different phenomenon in the reconstruction of AGOs, namely duplicated (or indistinguishable) markers. Indeed, a preliminary approach for handling this was mentioned in Landau et al. [91]. Alternative ways of handling duplicated markers was also a line of future research posed in Chauve and Tannier [27]. The input to C1P based approach mentioned above for reconstructing AGOs is a set of pairwise distinct markers L = {1, . . . , n}. This as- 26 sumption is needed for the use of the C1P and, in particular, PQ-trees for the reconstruction of AGOs (the columns of the binary matrix M that is the input to deciding the C1P are pairwise distinct). In order to cope with datasets containing duplicate markers (among other things like missing or overlapping markers which are beyond the scope of this discussion), in Chauve and Tannier [27] they use approximate intervals of markers in the detection phase. That is, a set of markers need only be approximately similar (e.g., 80% similar) between two species from S, where the path in T between these two species goes through the ancestor, for it to be considered a synteny (a row in M). This approach in some sense allows the existence of duplications by relaxing the detection of syntenies. An alternate approach suggested in Chauve and Tannier [27] would be to infer some pre-duplication AGO, which has been considered in some rearrangment-based works such as [3, 41, 134].6 Chauve and Tannier [27] also mention that there exist algorithms for computing syntenies between pairs of genomes with duplicate markers [16], or with duplicate segments followed by losses in both copies [146]. However, because these algorithms account for duplicates, the input is not assumed to be a set of pairwise distinct markers anymore, and hence one cannot use the C1P to model AGOs here. In 2009, a year after the important result of Chauve and Tannier [27], Stoye and Wittler [139] present a parsimony approach for reconstructing AGOs7 that uses PQ-trees [139]. Here, they propose a framework based on Bergeron et al. [15], which is what Chauve and Tannier [27] is based on, and provide an efficient method for finding a most parsimonious AGO, which they show works well in practice. In this work of Stoye and Wittler [139], they propose extending their models to allow markers to appear multiple times (to account for duplications). A year after this, in Wittler and Stoye [151], the authors then formally define a model that incorporates markers with multiplicity. This model is equivalent to deciding the following property of binary matrices. Property 3 (Consecutive-Ones Property with Multiplicity (mC1P)). Given a binary matrix M on columns S = {1, . . . , n} and a function m : S → N, is there a 6 Refer to Ma et al. [97] for a solution to handling duplicate markers in the case of physical mapping. 7 More accurately, their work concerns models of gene clusters, of which a set of syntenies used to reconstruct an AGO is one such model. We only discuss their work in the scope of reconstructing AGOs to remain within the subject of this thesis. 27 sequence σ over alphabet S that (i) σ contains each column s ∈ S at most m(s) times, and (ii) for each row r of M, the set of columns that have entry 1 in r form at least one subsequence of σ . Note that deciding this property becomes trivial if we allow any column to have arbitrary multiplicity, i.e., we could take σ to be the concatenation of all rows of M. Of course, such a long AGO would be dubious, and hence a threshold on the multiplicity of each marker is reasonable. This is why Wittler and Stoye [151] introduce this multiplicity constraint (i). This property generalizes the C1P: indeed the C1P is the case when m(s) = 1 for all s ∈ S, i.e., that there simply is a permutation π over the alphabet S such that (ii) holds. Here, we call this the mC1P. Of course, now that this problem has moved outside the domain of permutations into sequences, the classical C1P and the associated PQ-tree do not apply anymore. A natural question to ask then is the complexity of deciding the mC1P. In Wittler and Stoye [151], they show that deciding the mC1P can be done in polynomial time if each row of M has degree at most 2 (which is the model of adjacencies) by showing that this problem is equivalent to deciding if a graph is Eulerian. The authors of Wittler and Stoye [151] also show that if each row has degree at most 5, then the mC1P, as well as two restricted variants motivated by biological settings, is NP-complete. We mention that one of these restricted variants, the case of framed common intervals on permutation, was the first model used to formally state the problem of reconstructing AGOs using PQ-trees [15]. In this thesis, we improve these NP-completeness results to each row having degree at most 3 (resp., at most 6 in the case of the framed common intervals variant), while m(s) ≤ 2 for each s ∈ S, where S is the set of columns of M. We give these results in Section 4.1 and 4.2 of Chapter 4. The techniques used here to improve these NP-completeness results are based on those introduced in Chapter 3 for showing NP-completeness of deciding the (d, k, ∞)-Consecutive-Ones Property ((d, k, ∞)-C1P). Finally, in Section 4.3 of Chapter 4, we then present a tractability result which is motivated in the following. The C1P based approach for reconstructing AGOs 28 introduced here (for example, by Chauve and Tannier [27]) involves computing a set of ancestral syntenies represented by binary matrix M, and then building a PQ-tree for M (by possibly transforming M to a C1P matrix). Here, each subtree rooted at a child of the root of this PQ-tree represents a CAR. A CAR is an ancestral chromosomal segment, but it is not guaranteed to be a complete ancestral chromosome. In fact, it is common that the number of CARs obtained is larger than the expected number of ancestral chromosomes. This raises the following natural question: which CARs are believed to form complete ancestral chromosomes, or more generally, to contain an extremity of an ancestral chromosome (an ancestral telomere)? Indeed, a CAR with two ancestral telomeres is in fact a complete ancestral chromosome. Moreover, when CARs are grouped into syntenic sets, that is, sets of CARs that are believed to belong to the same ancestral chromosome, each such syntenic set of CARs can contain only two ancestral telomeres. We address this question as follows. A column c′ with multiplicity (bounded, for example, by twice the maximum expected number of ancestral chromosomes, or more generally with infinite multiplicity) can then be used to represent telomeres, that is, virtual extremities of ancestral chromosomes. Then any ancestral synteny that contains putatively a marker that is an extremity of an ancestral chromosome (for example because the ancestral synteny is telomeric in two existing descendants of the considered ancestor) can be represented by two rows in M: a row representing the ancestral synteny, plus a copy of this row with an additional entry 1 in column c′ . This structure ensures that if M has the mC1P, then the occurrences of c′ are located at the extremities of the CARs. Otherwise (M does not have the mC1P), some rows can be discarded to result in a matrix M ′ that has the mC1P, with the same property. This assumption on the structure of M is fundamental to leave open the possibility for any ancestral synteny to be at the extremity of a CAR or to be embedded inside a CAR. It follows that the tractable family of matrices considered here meets precisely this assumption. Formally, in Section 4.3 of Chapter 4, we present a tractability result for a family of matrices where every row of M has (i) at most one entry 1 in columns with multiplicity greater than one, or (ii) exactly two entries 1 in columns with multiplicity greater than one and no other entries. Our proofs rely on the two classical concepts of PQ-trees and Eulerian graphs. The final section of this thesis 29 outlines our study of the GCCC Problem, where we present our fourth and final variant of the C1P that we use to develop an algorithm for a special case of this problem. 1.4 The Generalized Cladistic Character Compatibility Problem In Chapter 5 we present our fourth and final variant of the C1P in order to develop an algorithm for a case of a phylogeny problem that we consider here. We now briefly motivate our study of this type of phylogeny problem. Here we study the problem of constructing a phylogenetic tree for a set of species [45]. A qualitative character assigns to each species a state from a set of states, e.g., “is a vertebrate”, or “number of legs”. When the evolution of the states of the character is known, e.g., evolution from invertebrate to vertebrate is only forward, the character is called cladistic. This evolution of the states is usually represented by a rooted tree, called a character tree, on the set of states. The Qualitative Character Compatibility Problem, or Perfect Phylogeny Problem, is NP-complete [20, 138], while it is polynomial-time solvable when any of the associated parameters is fixed [2, 82, 83, 103]. When characters are cladistic, the problem, called the Cladistic Character Compatibility Problem, is the problem of finding a perfect phylogeny tree on the set of species such that it can be contracted to a subtree of each character tree. This problem is polynomial-time solvable [42, 62, 148]. Experimental research in molecular biology [47, 79, 86, 144] shows that traits can disappear and then reappear during the evolution of a species, suggesting that genes contain information about traits that are not always expressed. In Benham et al. [11, 12], the authors argue that a new model for characters is needed in order for the resultant phylogenetic trees to capture this phenomenon. The authors thus devise the generalized character, which assigns to each species a subset of a set of states, where we only know that the expressed trait (state) is in this subset. The GCCC Problem is then the Cladistic Character Compatibility Problem on a set of species with generalized characters where we first have to pick one state from the subset for each character. Interestingly, generalized characters capture 30 also the case of qualitative characters with missing data (the “Incomplete Perfect Phylogeny” Problem). Here, missing data can be replaced by a “wildcard” generalized state containing all possible states of the character. This problem was shown to be NP-complete even if the number of states is constant in [63]. In Chapter 5 we study the complexity of several cases of the GCCC Problem that are motivated by the previous works of Benham et al. [11, 12]. In Subsection 5.3.2, we introduce a variant of the C1P which gives us an algorithm for a case of this problem. 31 Chapter 2 The Gapped Consecutive-Ones Property In this chapter we show that for every bounded and unbounded k ≥ 2, δ ≥ 1, (k, δ ) = (2, 1), deciding the (k, δ )-C1P is NP-complete. Section 2.1 outlines the notation used in this chapter. Section 2.2 provides a theorem that is central to the results of Section 2.3: that for every bounded k ≥ 2, δ ≥ 1, (k, δ ) = (2, 1), deciding the (k, δ )-C1P is NP-complete. In Section 2.4, we then give an algorithm for a case of the (2,1)-C1P that is motivated by the type of construction used to obtain the results of Section 2.3. In the final Section 2.5 of this chapter we show that for every δ ≥ 1, deciding the (∞, δ )-C1P is NP-complete. 2.1 Notation and Conventions First we introduce all notation and conventions used throughout this chapter. Given integers a, b, where a ≤ b, a, b denotes the set {a, a + 1, . . . , b}. Let M be a binary m × n matrix (on 0’s and 1’s) with columns labelled by 1, n . In the constructions used to show NP-completeness of deciding the (k, δ )-C1P, we will divide columns of M into ordered sequences of blocks B1 , . . . , B p by designing rows enforcing the columns of each block to appear together and the blocks to appear in the order B1 , . . . , B p (resp., in the reversed order), i.e., for any i < j, column c ∈ Bi and d ∈ B j , c appears before (resp., after) d in any (k, δ )-C1 order of M. The columns 32 ... 1 2 ... δ+1 δ+2 ... 2δ ... 2δ + 1 2δ + 2 2δ + 3 Figure 2.1: Possible positions of columns 2δ + 2 and 2δ + 3. |Bi | of a block Bi will be denoted B1i , . . . , Bi and Bi a,b = {Bai , Ba+1 , . . . , Bbi }, where i a ≤ b. To specify a row in the a binary matrix M, we use the convention of only listing in the square brackets, the columns that contain 1 in this row. For example, [1, 8, 5] represents a row with 1’s in columns 1, 5, and 8, and 0’s everywhere else. We will also use blocks of M to specify columns in the block, for example, if B1 = {1, 2, 3, 4, 5}, then [B1 , 7] would mean [1, 2, 3, 4, 5, 7], [B1 \ {B21 }, 6, 7] would 2,4 mean [1, 3, 4, 5, 6, 7], and [B1 , 6] would mean [2, 3, 4, 6]. 2.2 Fixing the Order of Selected Columns in a Matrix For every k ≥ 2, δ ≥ 1, we have the following important property of matrices that have the (k, δ )-C1P. Note that the following construction does not depend on k as it uses only two ones per row. Theorem 4. For every k ≥ 2 or k = ∞, δ ≥ 1 and s ≥ 2δ + 3, given binary matrix M on n ≥ s columns, s + δ + 1 rows can be added to M to force s selected columns to appear together and in fixed order (or the reverse order) in any (k, δ )-C1 order of M. Proof. Let k ≥ 2 (or k = ∞), δ ≥ 1, s ≥ 2δ +3 and n ≥ s. Without loss of generality, let S = {1, . . . , s} be the subset of s columns that we want to force to appear together and in this order (or the reverse order) in any (k, δ )-C1 order of M. We will show by induction on s that there are s+ δ + 1 rows of the type [c, d], where 1 ≤ c < d ≤ s and |c − d| ≤ δ + 1, which force this order. For the base case, let us assume that s = 2δ + 3. We will show the base case 33 by induction on δ . If δ = 1, then s = 2 · 1 + 3 = 5, and we add to M the following 7 rows: [1, 2], [2, 3], [3, 4], [4, 5], [1, 3], [2, 4], and [3, 5]. It is easy to check that the claim holds and that the number of rows used is exactly s + δ + 1. Now assume that the claim holds for δ = δ0 and s = s0 = 2δ0 + 3, where δ0 ≥ 1. We will show that it holds also for δ = δ0 + 1 and s = 2δ + 3 = 2δ0 + 5. Using the induction hypothesis, there are s0 + δ0 + 1 = s − 2 + δ − 1 + 1 = s + δ − 2 rows, which will force the correct order for columns 1, . . . , 2δ + 1. Note that all of these rows [c, d] satisfy the condition |c − d| ≤ δ + 1, and hence, they can be added to M for parameters δ = δ0 + 1 and s = 2δ0 + 5. In addition, we add to M three new rows: [δ + 1, 2δ + 2], [δ + 2, 2δ + 3] and [2δ + 2, 2δ + 3]. The total number of rows added to M is now s + δ + 1. Figure 2.1 shows the possible positions of columns 2δ + 2 and 2δ + 3 forced by rows [δ + 1, 2δ + 2] and [δ + 2, 2δ + 3] if we assume that rows 1, . . . , 2δ + 1 appear in the correct order. It is easy to see that the row [2δ + 2, 2δ + 3] is (k, δ )-consecutive only if columns 2δ + 2 and 2δ + 3 appear in the correct positions as well. This completes the induction on δ and we have that the claim holds for any δ ≥ 1 and s = 2δ + 3, i.e., the base case for the induction on s. Now, assuming that the claim holds for s − 1, where s − 1 ≥ 2δ + 3, we show that it holds also for s columns. By the induction hypothesis, there are s + δ rows which will force columns 1, . . . , s − 1 to appear in the correct order. We add one new row: [s − δ − 1, s]. Since s − δ − 1 ≥ δ + 3, there is only one position where column s can appear: next to s − 1, i.e., all columns in S appear in correct order. The number of rows used is exactly s + δ + 1. This completes the induction on s, and the claim follows. 2.3 The Complexity of Deciding the (k, δ )-C1P In this section we will show that for every k ≥ 2, δ ≥ 1, (k, δ ) = (2, 1), deciding the (k, δ )-C1P is NP-complete. 2.3.1 The Complexity of Deciding the (k, δ )-C1P for every k, δ ≥ 2 For every k, δ ≥ 2, we use Theorem 4 in a reduction from 3SAT to the problem of deciding the (k, δ )-C1P to show that this problem is NP-complete. 34 Theorem 5. For every k, δ ≥ 2, deciding the (k, δ )-C1P is NP-complete. Proof. Consider k, δ ≥ 2. Let φ be a 3CNF formula over the n variables {v1 , . . . , vn }, with m clauses {c1 , . . . , cm }. We construct a matrix Mφ with 2n + d + 6m columns and n + 7m + d + δ + 1 rows, where d = max{2k − 1, 2δ + 3}, such that Mφ has the (k, δ )-C1P if and only if φ is satisfiable. Goldberg et al. [55] show that for every k ≥ 2, given a 3CNF formula φ , they can construct a matrix Mφ that has the k-C1P if and only if φ is satisfiable. Our construction is based on theirs. In our construction, we associate the first 2n columns 1, 2n of Mφ with the variables {v1 , . . . , vn }. In particular, we associate variable vi with the pair of columns bi = {2i − 1, 2i}, for i ∈ 1, n . Variable vi equal to true represents the statement about the order of the columns: “2i − 1 is before 2i” (vi equal to false represents statement: “2i − 1 is after 2i”). Since a truth assignment to the formula φ represents a statement about a permutation of the columns of Mφ , we want to relate Mφ to the clauses {c1 , . . . , cm } of φ in such a way that only the permutations of Mφ that are (k, δ )-consecutive correspond to truth assignments that satisfy φ and vice versa. This construction involves associating the last 6m columns 2n + d + 1, 2n + d + 6m with the clauses {c1 , . . . , cm }. In particular, we associate clause c j with the block of five columns B j = 2n + d + 6 j − 4, 2n + d + 6 j , while each block B j is preceded by a column a j = {2n + d + 6 j − 5}. Finally, the set 2n + 1, 2n + d of columns in the middle will be used to ensure that the construction works for parameters k and δ . The details are as follows. The base of our construction is a subset of the columns of Mφ that we force to be together and in fixed order in any (k, δ )-C1 order of Mφ , and then we will build off of this base a construction similar to that of Goldberg et al. [55]. In particular, we impose this fixed order on this subset 2n + 1, 2n + d of the columns in the middle of Mφ by adding d + δ + 1 rows to Mφ according to Theorem 4. While these d columns must be together and in fixed order (or the reverse) in any (k, δ )C1 order, we assume the former without loss of generality. We now build the remaining construction off of this block of d columns. To force the blocks b1 , . . . , bn to appear together and in this order, and before the set 2n+ 1, 2n+ d of d columns in Mφ , we add the n rows [bi , bi+1 , . . . , bn , 2n+ 1, 2n + 3, . . . , 2n + 2k − 3, 2n + 2k − 1] to Mφ , for i ∈ 1, n . Observe that, if block 35 bn is not immediately to the left of the d columns, then there are more than k − 1 gaps in the row [bn , 2n + 1, 2n + 3, . . . , 2n + 2k − 3, 2n + 2k − 1], while, for each i ∈ 1, n − 1 , if block bi is not immediately to the left of bi+1 , then there are more than k − 1 gaps in the row [bi , bi+1 , . . . , bn , 2n + 1, 2n + 3, . . . , 2n + 2k − 3, 2n + 2k − 1]. Next, to force the blocks a1 , B1 , . . . , am , Bm to appear together and in this order, and after the set 2n + 1, 2n + d of d columns in Mφ , we add the 2m rows [2n + d − (2k − 2), 2n + d − (2k − 4), . . . , 2n + d − 4, 2n + d − 2, 2n + d, a1 , B1 , . . . , a j−1 , B j−1 , a j ] and [2n + d − (2k − 2), 2n + d − (2k − 4), . . . , 2n + d − 4, 2n + d − 2, 2n + d, a1 , B1 , . . . , a j , B j ] to Mφ , for j ∈ 1, m . Now the blocks of columns in any (k, δ )-C1 order of the matrix Mφ are ordered as follows: the blocks b1 , . . . , bn associated with the variables of φ , followed by the d columns 2n + 1, . . . , 2n + d, followed by the blocks a1 , B1 , . . . , am , Bm , where the blocks B1 , . . . , Bm are associated with the clauses of φ . Since the restrictions placed on variable blocks {b1 , . . . , bn } and the clause blocks {B1 , . . . , Bm } are the same as in Goldberg et al. [55], we simply have to add rows, similar to those in Goldberg et al. [55], to Mφ to associate each clause to its three variables to properly simulate 3SAT. The difference from our construction to that of Goldberg et al. [55], is what values the row takes within this segment 2n + 1, 2n + d of d columns and the m columns a1 , . . . , am . We now present the details. Suppose that clause c j contains the literal vα . We add the following (corresponding) row to Mφ : [b2α , bα +1 , . . . , bn , 2n + 1, 2n + 3, . . . , 2n + 2k − 7, 2n + 2k − 5, 2n+2k−3, 2n+d , a1 , B1 , . . . , , a j , B1j ]. If vα is false, this forces B1j to be the first column of block B j in any (k, δ )-C1 order of Mφ . Any other order of the columns of B j would introduce a k-th gap in this row. If vα appears negated in c j , then we add the row [b1α , bα +1 , . . . , bn , 2n+ 1, 2n+ 3, . . . , 2n+ 2k − 7, 2n+ 2k − 5, 2n+ 2k − 3, 2n + d , a1 , B1 , . . . , a j , B1j ] instead. Suppose another literal in c j is vβ . We add the row [b2β , bβ +1 , . . . , bn , 2n + 1, 2n + 3, . . . , 2n + 2k − 7, 2n + 2k − 5, 2n + 2k − 3, 2n + 1,4 d , a1 , B1 , . . . , a j , B j ]. If vβ is false, this forces B5j to be the last column of block B j . Suppose the third literal of c j is vγ . We add the rows [bγ2 , bγ +1 , . . . , bn , 2n + 1,2 1, 2n + 3, . . . , 2n + 2k − 7, 2n + 2k − 5, 2n + 2k − 3, 2n + d , a1 , B1 , . . . , a j , B j ] and [b2γ , bγ +1 , . . . , bn , 2n + 1, 2n + 3, . . . , 2n + 2k − 7, 2n + 2k − 5, 2n + 2k − 3, 2n + 1,3 d , a1 , B1 , . . . , a j , B j ] to Mφ . If vγ is false, this forces B3j to be the middle col36 b 1 b 2 b 3 b4 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 1 1 1 1 0 ... ... ... ... ... ... bn 1 1 1 1 0 1 1 1 1 0 a1 2n + 1, 2n + d 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 ... ... ... ... ... 1 1 1 1 1 1 1 1 1 1 B1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 a2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 a3B31 B2 1 1 1 1 1 0 1 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ... ... ... ... ... ... am 0 0 0 0 0 Bm 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 k−3 gaps Figure 2.2: The structure of Mφ and the five rows encoding clause c2 = {v2 ∨ ¬v3 ∨ v1 }. umn of block B j . Finally, we add the row [2n + 3, 2n + 5, . . . , 2n + 2k − 7, 2n + 2k − 5, 2n + 2k − 3, 2n + d , a1 , B1 , . . . , a j−1 , B j−1 , B1j , B3j , B5j ] to Mφ . This last row is not (k, δ )-consecutive exactly when B1j , B3j , and B5j are the first, middle and last columns of block B j , as it contains k gaps then. This fifth row enforces the constraint that not all three literals of c j can be false. Figure 2.2 illustrates the structure of matrix Mφ , along with these five rows that would be added to Mφ for clause c2 = {v2 ∨ ¬v3 ∨ v1 }. It remains to show that if any literal in c j is true, then there is some order of the columns of block B j such that these five rows are (k, δ )-consecutive. If vα (resp., vβ ) is true, we can order the columns B2j , B1j , B3j , B4j , B5j (resp., B1j , B2j , B3j , B5j , B4j ). If vγ is true, the columns can be in any order that places B1j (resp., B5j ) in the first (resp., last) position, while placing B2j , B3j , B4j in any of the four orders that avoids placing B3j in the middle (as this fifth row would have k gaps in this case). Note that these orders work even when the corresponding variable is the only one that is true, and that in all of these orders, no row has a gap of size larger than two. Finally, we remark that if vγ is the only variable that satisfies clause c j , for example, then in all of the four (possible) orders of the columns where these five rows are (k, δ )consecutive, there is a gap of size two in the fifth row. Hence this construction does not work for δ = 1. Since, for every k, δ ≥ 2, deciding the (k, δ )-C1P is clearly in NP, by the above reduction from 3SAT, it follows that for every k, δ ≥ 2, deciding the (k, δ )-C1P is NP-complete. 37 2.3.2 The Complexity of Deciding the (k, 1)-C1P for every k ≥ 3 We slightly modify the reduction from 3SAT in the proof of Theorem 5 to show that, for every k ≥ 3, deciding the (k, 1)-C1P is NP-complete. Theorem 6. For every k ≥ 3, deciding the (k, 1)-C1P is NP-complete. Proof. Consider k ≥ 3. Let φ be a 3CNF formula over the n variables {v1 , . . . , vn }, with m clauses {c1 , . . . , cm }. We construct a matrix Mφ with 2n + d + 4m columns and n + 4m + d + 2 rows, where d = 2k − 1, such that Mφ has the (k, 1)-C1P if and only if φ is satisfiable. We do this as follows. We again use Theorem 4 to force the columns 2n + 1, 2n + d to appear together and in fixed order in any (k, 1)-C1 order of Mφ , and build a construction off of this block. We again associate columns 1, 2n with the variables of φ , and associate each clause c j with block B j . However, B j now has four columns rather than five, that is B j = 2n + d + 4 j − 3, 2n + d + 4 j . Note also that we do not have the blocks a j in this construction. We again add the appropriate rows to Mφ so that the columns of any (k, 1)-C1 order of the matrix Mφ are ordered b1 , . . . , bn , followed by 2n + 1, . . . , 2n + d, followed by B1 , . . . , Bm . The only major difference from Theorem 5 of this reduction is the manner in which we associate the clauses to their variables to property simulate 3SAT. The details are as follows. We need to introduce only three more rows to associate the clauses to their variables to properly simulate 3SAT. Suppose that clause c j contains literals vα , vβ and vγ . We add the row [b2α , bα +1 , . . . , bn , 2n + 1, 2n + 3, . . . , 2n + 2k − 9, 2n + 2k − 1,2 7, 2n + 2k − 5, 2n + d , B1 , . . . , B j−1 , B j ] to Mφ . If vα is false, this forces B1j and B2j to be among the first three columns of block B j in any (k, 1)-C1 order of Mφ . Note that any other order of the columns of B j would introduce either a gap of size 2, or a k-th gap in this row. Similarly, we add the rows [bβ2 , bβ +1 , . . . , bn , 2n + 1, 2n + 3, . . . , 2n + 2k − 9, 2n + 2k − 7, 2n + 2k − 5, 2n + d , B1 , . . . , B j−1 , B1j , B3j ] and [b2γ , bγ +1 , . . . , bn , 2n + 1, 2n + 3, . . . , 2n + 2k − 9, 2n + 2k − 7, 2n + 2k − 5, 2n + d , B1 , . . . , B j−1 , B1j , B4j ] to Mφ . If vβ is false, this forces B1j and B3j to be among the first three columns of block B j , and if vγ is false, this forces B1j and B4j to be among the first three columns of block B j . Finally, since B1j , B2j , B3j , B4j cannot simultaneously be among the first three columns of block B j , we have that not all three literals of c j can be false in any (k, 1)-C1 order of Mφ . 38 It remains to show that if any literal in c j is true, then there is some order of the columns of block B j such that these four rows are (k, 1)-consecutive. If vα (resp., vβ , and vγ ) is true, we can order the columns B3j , B1j , B4j , B2j (resp., B2j , B1j , B4j , B3j , and B2j , B1j , B3j , B4j ). Note that these orders work even when the corresponding variable is the only one that is true. Since, for every k ≥ 3, deciding the (k, 1)-C1P is clearly in NP, by the above reduction from 3SAT, it follows that for every k ≥ 3, deciding the (k, 1)-C1P is NP-complete. In summary, by Theorem 5 and Theorem 6, it follows that for every k ≥ 2, δ ≥ 1, (k, δ ) = (2, 1), deciding the (k, δ )-C1P is NP-complete. The only open question that remains is the case of the complexity of deciding the (2,1)-C1P. In the next section we give a result for a special case of the (2,1)-C1P. 2.4 The (2,1)-C1P All of the constructions in this chapter used to show NP-completeness of deciding the (k, δ )-C1P for k ≥ 2, δ ≥ 1, (k, δ ) = (2, 1) divided the columns of a binary matrix M into an ordered sequence of blocks B1 , . . . , B p by designing rows which force the columns of each block to appear together and the blocks to appear in the order B1 , . . . , B p (or in the reversed order), i.e., for any t < u, column d ∈ Bt and e ∈ Bu , d appears before e in any (k, δ )-C1 order of M. We will call any permutation of the columns of M that meets this condition a {B1 , . . . , B p }-blockstructured order. Given any 3CNF formula φ , we then represented each variable and each clause with a block from {B1 , . . . , B p }, where the permutations of the columns within this block correspond to the configurations (a) true and false, if it is a variable block, or (b) which of its literals is set to true and false, if it is a clause block. The above restriction of columns into blocks then provided enough structure so that for each clause c, we could add some rows to M that introduce dependencies only between the permutations of columns of the block for c and the 3 blocks corresponding to each of c’s literals, such that c is satisfied in φ if and only if these rows have the (k, δ )-C1P, and no other dependencies, i.e., that φ is satisfiable if and only if M has a (k, δ )-C1 {B1 , . . . , B p }-block-structured order. Here we provide a polynomial-time O(m2 n(ℓ + 1)! + nℓ!23ℓ ) and space 39 O(m2 nℓ! + 2ℓ ) algorithm, given binary n × m matrix M where its columns are divided into an ordered sequence of blocks B1 , . . . , B p and each block contains at most some fixed constant number ℓ of columns (|Bt | ≤ ℓ for all t ∈ 1, p ), which either (a) decides if it has a {B1 , . . . , B p }-block-structured (2,1)-C1 order, or (b) finds a proof that deciding the (2,1)-C1P is NP-complete. Note that we can force any (2,1)-C1 order to be a {B1 , . . . , B p }-block-structured order by adding rows to the matrix similarly as it was done in Theorems 5 and 6. One observation is that this algorithm is FPT in parameter ℓ. Another motivation for this result is that if the (2,1)-C1P is NP-complete even in this {B1 , . . . , B p }block-structured case, then this algorithm provides an automated tool which could be used to prove this: with some instance of the problem, it would find the proof that deciding the (2,1)-C1P is NP-complete. We now give the algorithm in the next subsection. 2.4.1 The Algorithm Let M be a binary matrix on m rows and n columns and B = {B1 , . . . , B p } be sets of columns of M where |Bt | ≤ ℓ for all t ∈ 1, p . The basic idea of the algorithm is as follows. First, it does some preprocessing on M to check if it has some necessary properties for it to have a B-block-structured order that is also a (2,1)-C1 order. If this succeeds, it then checks another condition of matrix M. If this condition holds, it generates a set of 2-clauses of polynomial-size that is satisfiable if and only if M has such an order. If this condition does not hold, it is able to find in polynomial-time, proof that deciding the (2,1)-C1P is NP-complete. Given M and B = {B1 , . . . , B p }, for t ∈ 1, p , let Ut denote the set of permutations of Bt that are (2,1)-C1P with respect to M, for some order of the columns outside of Bt . We will explain later exactly how to compute Ut , but for now, we observe the following property: Property 7. The set of {B1 , . . . , Bt }-block-structured (2,1)-C1 orders of M is a subset of U = U1 × · · · × U p . 40 Let ri , for i ∈ 1, m be a row of M. For a set B of columns of M, we use σ (B, i) to denote the subset of columns of B that contain a 1 in row ri . Let si (resp., ei ) ∈ 1, p be the index of the first (resp., last) block of M such that σ (Bsi , i) (resp., / and Bi = Bsi +1 , . . . , Bei −1 , the sequence of blocks between Bsi and σ (Bei , i)) = 0, Bei (we can assume that M does not contain any row on only 0’s). Note that Bi may be empty, however, i.e., the case when si = ei , or ei = si + 1. Since Property 7 holds, it follows that (i) if Bi in row ri contains two or more 0’s, then ri , and hence, M is not (2,1)C1P; and (ii) if Bi in row ri contains exactly one 0, then, in Bsi (resp., Bei ) in ri , there cannot be a single 0 to the right (resp., left) of any 1 in any (2,1)-C1 order of row ri , and hence, of M. However, this effectively splits the block Bsi (resp.,Bei ) into the two blocks: B′si = σ (Bsi , i) (resp., B′ei = σ (Bei , i)); and B′′si = Bsi \ σ (Bsi , i) (resp., B′′ei = Bei \ σ (Bei , i)). We can then re- place the set B = {B1 , . . . , Bsi , . . . , Bei , . . . , B p } of blocks with the new set B ′ = {B1 , . . . , B′′si , B′si , . . . , B′ei , B′′ei , . . . , B p } of blocks, and the set of B ′ -blockstructured (2, 1)-C1 orders of matrix M with row ri removed (as this row is always (2, 1)-C1P in any B ′ -block-structured order) will be the same as the set of B-block-structured (2, 1)-C1 orders of M. Since both cases (i) and (ii) for row ri can be determined in time O(n) and space O(1), and hence, in overall time O(mn) and space O(1) for M, we can assume that, in M, these cases do not apply, i.e., that σ (Bi , i) = Bi . We now explain how to compute Ut for each t ∈ 1, p . Since we ruled out cases (i) and (ii) in the previous paragraph, it follows that for each t ∈ 1, p and row ri for i ∈ 1, m , we have the following set of disjoint cases: (1) σ (Bt , i) = 0, / i.e., t ∈ si , ei ; (2) σ (Bt , i) = Bt , i.e., t ∈ si , ei ; or (3) neither (1) nor (2), then t = si or ei , and Bt contains some 0’s and some 1’s in row ri , and (a) si < ei , or 41 (b) si = ei , i.e., Bt is the only block of M where σ (Bt , i) = 0/ For t ∈ 1, p , we will denote Sti as the set of permutations of each Bt that are (2, 1)C1P with respect to row ri , for some order of the columns outside of Bt . If either case (1) or (2) holds, then any permutation of the columns of Bt is in Sti . In case (3a), when t = si (resp., ei ), then, in row ri , any permutation which does not place more than one 0 to the right (resp., left) of any 1 in Bt is in Sti . In case (3b), in row ri , any (2, 1)-C1 order of Bt is in Sti . Since |Bt | ≤ ℓ, determining if a permutation is in Sti takes time O(ℓ), and since there are at most ℓ! such permutations, computing Sti takes time O((ℓ + 1)!) and space O(ℓ!). Then, set Sti , Ut = (2.1) i∈ 1,m and computing Ut for a given t ∈ 1, p takes time O(mℓ!) and space O(ℓ!). Since p is O(n), computing Ut for all t ∈ 1, p takes time O(mnℓ!) and space O(nℓ!) overall. Note that if Ut = 0/ for some t, then U = 0, / and hence, by Property 7, M does not have the (2, 1)-C1P. We remark that if only one block B1 is associated with M, then U1 is simply the set of (2, 1)-C1 orders of M. This completes the details of the preprocessing phase on M to check if it has some necessary properties for it to have a {B1 , . . . , B p }-block-structured (2, 1)-C1 order. Up to this point, we have ruled out the trivial cases (i) and (ii) when M does not have such an order and we have computed Ut for all t ∈ 1, p , and we can assume that Ut = 0. / The set of B-block-structured (2, 1)-C1 orders of M is a subset of U (Property 7), however it may be the case that it is not equivalent to U , as a choice of one permutation in some Ui and another in some U j might lead to an order which is not (2, 1)-consecutive. In particular, for any row ri for i ∈ 1, m where si < ei , Bsi and Bei (or neither) can be permuted such that exactly one 0 is to the right (resp., left) of any 1 in row ri , but both blocks cannot be in this state if ri is to be (2, 1)-consecutive. We will express this dependency on the permutations of Bsi and Bei with a disjunction on two Boolean variables, defined below. For t ∈ 1, p , let Pti be the set of permutations πt ∈ Ut that do not place any 0 to the right (resp., left) of any 1 in Bt in row ri , in the case (3a) when t = si (resp., ei ). So that Pti is defined for every block/row pair, we let Pti = Ut in cases (1), (2) and (3b) for 42 t ∈ 1, p , i ∈ 1, m . Note that, for a given t ∈ 1, p and i ∈ 1, m (like with Sti ) that Pti can also be constructed in time O((ℓ + 1)!) and space O(ℓ!), for overall time O(mn(ℓ + 1)!) and space O(mnℓ!) to compute Pti for all t ∈ 1, p and i ∈ 1, m . Let Boolean variable Xt,i represent πt ∈ Pti , for t ∈ 1, p , i ∈ 1, m . It follows that ri is (2, 1)-consecutive if and only if it satisfies Xsi ,i ∨ Xei ,i . (2.2) Note that Equation 2.2 is defined for all rows ri for i ∈ 1, m (in the case that si = ei , Equation 2.2 is a tautology by the fact that Ut = 0/ for all t ∈ 1, p ). In the previous paragraph, we saw that, for any given row ri , there is a oneto-one correspondence between satisfying truth assignments to Equation 2.2 and (2, 1)-C1 orders of ri . However, the same correspondence between (2, 1)-C1 orders of M and satisfying truth assignments of Xsi ,i ∨ Xei ,i (2.3) i∈ 1,m does not hold in general. This is due to the fact that when a set A ⊆ 1, m of two or more rows are involved, one of {si , ei } for each ri , i ∈ A can coincide on the single block Bt , and i i∈A Pt = 0. / This means that a truth assignment τ to the pairs of variables Xt,i corresponding to the rows of A may not be valid, where we say that a truth assignment τ is valid when there is a B-block-structured order π = π1 , . . . , π p of the columns of M such that τ (Xt,i ) = true if and only if πt ∈ Pti for all t ∈ 1, p , i ∈ 1, m . So, in addition to τ satisfying Equation 2.3, we must also ensure that τ is valid. This can be done simply by ensuring for any such set of rows A where i / for some t ∈ 1, p , that not all Xt,i , i ∈ A are set to true, which can be i∈A Pt = 0 encoded by ¬Xt,i . (2.4) t∈ 1,p A⊆ 1,m i / i∈A i∈A Pt =0 While the (2,1)-C1 orders of M correspond to the satisfying assignments of Equation 2.4, this SAT formulation can have clauses of size as large as m, since A can be as large as m. We now give the following condition of M, which can be checked in 43 polynomial-time O(mnℓ + nℓ!23ℓ ). If the condition holds, then Equation 2.4 can be replaced by an equivalent set of 2-clauses, and if the condition does not hold, then an NP-completeness proof can be constructed. Condition 8. For every t ∈ 1, p , and A ⊆ 1, m , exists i, j ∈ A such that Pti ∩ Pt j i i∈A Pt = 0/ implies that there = 0. / It follows that if this condition holds, then for every t ∈ 1, p , it is sufficient to forbid Xt,i and Xt, j from both being true in τ for every pair i, j ∈ 1, m such that / We can hence replace Equation 2.4 with Pti ∩ Pt j = 0. ¬Xt,i ∨ ¬Xt, j . (2.5) t∈ 1,p i, j∈ 1,m Pti ∩Pt j =0/ Clearly this is a 2SAT formulation of polynomial-size O(m2 n). Note that Equation 2.5 can be constructed in polynomial-time O(m2 nℓ!). We now show how to perform the polynomial-time check to see if this condition holds for the particular instance M, and how to construct an NP-completeness proof if Condition 8 does not hold. To check this condition, we have to check for every t ∈ 1, p if there is an A ⊆ 1, m with |A| > 2 such that i i∈A Pt = 0. / For a given t ∈ 1, p , |Bt | ≤ ℓ, and since M is a binary matrix, there are only 2ℓ unique rows in Bt . It takes time O(mℓ + ℓ2ℓ ) to find this set of unique rows ℓ and space O(2ℓ ) to store (or index) this set. From this set, there are 22 choices for A. For each choice of A, we have to compute i∈A⊆ 1,m Pti which takes time O(ℓ!2ℓ ). For each of these intersections, we have to compute Pti ∩ Pt j for each i, j ∈ A. Since computing Pti ∩ Pt j takes time O(ℓ!), and there are 22ℓ pairs i, j, this step takes time O(ℓ!22ℓ ). Hence, the time of this check for a given t ∈ 1, p is O(mℓ + ℓ2ℓ + 2ℓ · (ℓ!2ℓ + ℓ!22ℓ )) which simplifies to O(mℓ + ℓ!23ℓ ). Again, since p is O(n), this check takes overall time O(mnℓ + nℓ!23ℓ ). Since Pti for each t ∈ 1, p and i ∈ 1, m has already been computed previously, the only additional space used is O(2ℓ ) to store the set of unique rows for the current Bt , and O(1) for the pair i, j, and hence this check uses space O(2ℓ ). Now, suppose that we find a set A ⊆ 1, m with |A| > 2 such that i i∈A Pt = 0. / For simplicity, let A be the set of rows {r1 , r2 , r3 }. If this is the case, it follows that for some t ∈ 1, p , Pt1 ∩ Pt2 ∩ Pt3 = 0, / while for any pair {i, j} ⊂ {1, 2, 3} where 44 b1 b2 b3 b4 ... 0 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 1 1 ... b|V | D1 1 1 1 0 0 0 1 1 1 D|C| b|V | Dj−1 Dj Dj+1 ... ... ... ... ... ... ... ... 1 0 1 rˆ1 rˆ2 rˆ3 0 1 0 ... ... ... ... ... ... ... 0 0 0 ... 1 1 1 ... 0 0 0 ... b3 b2 b1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 Figure 2.3: The structure of the construction for a 3CNF formula φ on the set V of variables and C of clauses, along with the 3 rows encoding the clause c j = {v1 ∨ v2 ∨ ¬v3 }. The blocks b1 , . . . , b|V | correspond to the variables of φ in exactly the same way as in the construction of Subsection 2.3.1. The blocks D1 , . . . , D|C| correspond to the clauses. Here, for i ∈ {1, 2, 3}, rˆi is row ri restricted to the columns of Bt , and Pt1 , Pt3 (resp., Pt2 ) are sets of permutations that do not place any 0 to the left (resp., right) of any 1 in Bt in rows r1 , r3 (resp., r2 ). It follows that all truth assignments to the literals of c j are (2,1)-C1 orders except for the case when all 3 literals are false (c j is not satisfied), since Pt1 ∩ Pt2 ∩ Pt3 = 0. / Note that for each i ∈ {1, . . . , |V |}, rows can be added to force the copy of variable block bi on the left and right of the clause blocks to encode the same truth value. i = j, Pti ∩ Pt j = 0. / This property allows us to use A to build a 3-clause gadget, for a reduction from 3SAT, similar to that of Subsection 2.3.1, to the problem of deciding if M has a B-block-structured (2,1)-C1 order. Figure 2.3 illustrates the structure of this construction along with the 3 rows that would be added for the clause c j = {v1 ∨ v2 ∨ ¬v3 }. Note that if |A| > 3, then a |A|-clause gadget can be built in a similar fashion. Since |A| ≤ 2ℓ , this construction is of size polynomial in M , and hence deciding the (2,1)-C1P would be NP-complete if Condition 8 does not hold for some M. Finally, we summarize the time and space complexity of this algorithm. The preprocessing phase, i.e., checking cases (i) and (ii) for each row of M takes time O(mn) and space O(1). For each t ∈ 1, p and i ∈ 1, m , computing Sti (and Pti ) takes time O((ℓ + 1)!) and space O(ℓ!), for overall time O(mn(ℓ + 1)!) and space O(mnℓ!) for this step. Computing Ut for all t ∈ 1, p takes time O(mnℓ!) and space O(nℓ!) overall. Performing the check for Condition 8 takes time O(mnℓ + nℓ!23ℓ ) and space O(2ℓ ). If the condition holds, then it computes Equation 2.5, which takes time O(m2 nℓ!), generating a 2SAT formulation of size O(m2 n). In summary, it follows that this algorithm runs in time O(m2 n(ℓ + 1)! + nℓ!23ℓ ) and 45 space O(m2 nℓ! + 2ℓ ). Theorem 9. Given binary matrix M on n columns and m rows and a collection B = {B1 , . . . , B p } of sets of columns of M where |Bt | ≤ ℓ for all t ∈ 1, p for some fixed constant number ℓ, there is an algorithm that runs in polynomial-time O(m2 n(ℓ + 1)! + nℓ!23ℓ ) and space O(m2 nℓ! + 2ℓ ) which either (a) decides if there is B-block-structured order of M that is also a (2,1)-C1 order, or (b) finds a proof that deciding the (2,1)-C1P is NP-complete. While this algorithm checks Condition 8 for the particular instance M, we conjecture that Condition 8 holds for all binary matrices. If this is the case, as a corollary of Theorem 9, we could omit the check of this condition for a faster algorithm. Corollary 10. If Condition 8 holds for all binary matrices, then given binary matrix M on n columns and m rows and a collection B = {B1 , . . . , B p } of sets of columns of M where |Bt | ≤ ℓ for all t ∈ 1, p for some fixed constant number ℓ, there is an algorithm that runs in polynomial-time O(m2 n(ℓ + 1)!) and space O(m2 nℓ!) which decides if there is B-block-structured order of M that is also a (2,1)-C1 order. 2.5 The Complexity of Deciding the (∞, δ )-C1P Here we show that for every δ ≥ 1, deciding the (∞, δ )-C1P is NP-complete. The first step is to reduce 3SAT(3), the version of the 3SAT Problem where no variable appears more than twice positively and more than once negatively to an auxiliary version of the 3SAT Problem. We then reduce this auxiliary version to the problem of deciding the (∞, δ )-C1P for the result. 2.5.1 The 3SAT(L:2,R:2) Problem First we reduce from 3SAT(3), the version of the 3SAT Problem with 2-clauses and 3-clauses, and where no variable appears more than twice positively and more than once negatively [120, p. 183, Prop. 9.3], to an auxiliary version of the 3SAT 46 Problem, namely 3SAT(L:2,R:2): the version of the 3SAT Problem with 2-clauses and 3-clauses, where each clause is assigned the label L or R (for left or right) such that for each label, no variable appears more than once positively and more than once negatively in the corresponding set of clauses.1 Lemma 11. The 3SAT(L:2,R:2) Problem is NP-complete. Proof. We are given an instance to the 3SAT(3) Problem: a set V of variables and C of 2 and 3-clauses, such that for each v ∈ V , v appears no more than twice in C and ¬v appears no more than once in C. For each v ∈ V with two positive occurrences, we replace one of the occurrences of v with the new variable v′ . We then label all the clauses of this new instance with L. Note that in this set of clauses labelled with L, no variable appears more than once positively and once negatively. Now, for each appearance of v′ , we add the two new clauses c1v = v′ ∨ ¬v and c2v = v ∨ ¬v′ , and label them both with R. These two clauses enforce the constraint that v = v′ in any satisfying assignment to this new instance of the 3SAT Problem, thus this new instance is satisfiable if and only if the original 3SAT Problem instance is satisfiable. This new instance of the 3SAT Problem has 2- and 3-clauses, and for each of the labels L and R, no variable appears more than once positively and once negatively. Thus we have transformed in polynomial time the instance of the 3SAT(3) Problem to an instance of the 3SAT(L:2,R:2) Problem that is satisfiable if and only if the original 3SAT(3) instance is satisfiable. Since the 3SAT(L:2,R:2) Problem is clearly in NP, it follows that the 3SAT(L:2,R:2) Problem is NP-complete. 2.5.2 The Complexity of Deciding the (∞, 1)-C1P We now show that the problem of deciding the (∞, 1)-C1P is NP-Complete by giving a reduction from 3SAT(L:2,R:2). We will later generalize this reduction to show that for every δ ≥ 1, deciding the (∞, δ )-C1P is NP-Complete. Theorem 12. Deciding the (∞, 1)-C1P is NP-complete. 1 We remark that the exact formulation of 3SAT(3) in Papadimitriou [120] allows also variables with one positive and two negated occurrences, however these can easily be converted to the other type of variables by replacing them with their negations in all clauses. Clearly, this does not affect the complexity of the problem. 47 Proof. We are given an instance φ of the 3SAT(L:2,R:2) Problem: a set V of variables and the sets CL and CR of 2- and 3-clauses, such that for each v ∈ V , v and ¬v each appear no more than once in CS , for S ∈ {L, R}. We use φ to build a matrix Mφ such that φ is satisfiable if and only if Mφ has the (∞, 1)-C1P. The idea of the construction is that for each variable vi ∈ V = {v1 , . . . , vn }, the matrix Mφ will have the block of columns bi , called the variable block, to represent the value of this variable. Matrix Mφ will also contain the blocks of columns b0,1 , . . . , bn,n+1 of dummy blocks that will interleave the variable blocks. We will add some rows to Mφ to force the individual columns of each of the variable and dummy blocks to appear together and in fixed order, or the reverse order. The direction of block bi will represent the value of the variable vi . We will then add some rows to Mφ to force only the order b0,1 , b1 , b1,2 , . . . , bn−1,n , bn , bn,n+1 (or the reverse order) of these blocks, while the individual variable blocks may switch direction relative to this order. If variable block bi is in the same order relative to this order of all of the blocks then its corresponding variable vi has value true, otherwise it has value false. The matrix Mφ will also have an additional 2n free columns. To each clause c ∈ C = {CL ∪ CR } we associate a unique empty free column fc . This is possible since for every S ∈ {L, R}, each variable appears no more than once positively and once negatively in CS , and each c ∈ CS contains at least 2 variables, and hence |CS | ≤ 2n/2 = n. Thus |CL | + |CR | ≤ 2n. We then add some rows to Mφ to force these 2n free columns to fall (in any order) between the 2n pairs of adjacent bi−1,i , bi and bi , bi,i+1 blocks, for i ∈ 1, n , such that there is one free column for each hole. For a clause c ∈ CL (resp., CR ) where c contains variables vα , vβ (and vγ for a 3-clause), we assign this clause to column fc of the 2n free columns, and we add a row to Mφ that forces the column fc to be to the left (resp., right) of either block bα , bβ (or bγ for a 3-clause). However, column fc can only go to the left (resp., right) of the block of a variable when its corresponding literal is set to the value that satisfies clause c. Note by the construction that each variable can satisfy at most one left and one right clause, which is sufficient because each literal appears at most once in a right (resp., left) clause. These properties will imply that only when, for every c ∈ CL (resp., CR ), column fc can be placed to the left (resp., right) of a bi , for i ∈ 1, n , for a vi that is set to a value that satisfies c, i.e., φ is satisfied, 48 free columns b0,1 b1 b1,2 bn−1,n ... bn,n+1 bn Figure 2.4: The structure of matrix Mφ . is there a (∞, 1)-C1 order of Mφ , and vice versa. We now give the full details of the construction in what follows. For each variable vi ∈ V = {v1 , . . . , vn }, we add the set of columns bi = 1 {bi , . . . , b5i } to Mφ . In addition, for every i ∈ 1, n , we add the set of columns bi−1,i = {b1i−1,i , . . . , b5i−1,i } to Mφ . For each set of columns bi for i ∈ 1, n , and bi−1,i for i ∈ 1, n + 1 , we add to Mφ the rows according to Theorem 4 to force the columns of each set to appear together and in fixed order (or the reverse) in any (∞, 1)-C1 order of Mφ , i.e., in any (∞, 1)-C1 order of Mφ , set bi will appear either as the sequence b1i , . . . , b5i or b5i , . . . , b1i of consecutive columns, and similarly for the columns in sets bi−1,i . We will refer to the bi as variable blocks and the bi−1,i as dummy blocks. Note that Theorem 4 requires that a set of columns must have size 2δ + 3 before such an order can be enforced on it, this is why each block is of size five. In addition, we add 2n free columns to Mφ . Now, for each pair of blocks bi−1,i , bi and bi , bi,i+1 for i ∈ 1, n , we add rows [bi−1,i \ {b1i−1,i } ∪ bi ] and [bi ∪ bi,i+1 \ {b5i,i+1 }] to force these pairs to be together with at most one free column in between them. This enforces that the blocks appear in the order b0,1 , b1 , b1,2 , . . . , bn−1,n , bn , bn,n+1 (or the reverse) in any (∞, 1)-C1 order of Mφ . The first (resp., last) column of the dummy blocks is omitted to fix their direction (relative to the order of the blocks) under the assumption that there is a free column between each pair of neighboring blocks, which we will now enforce 2,4 2,4 with the following row. We add to Mφ the row [B ∪ F], where B = b0,1 ∪ b1 49 ∪ 2,4 · · · ∪ bn 2,4 ∪ bn,n+1 }, and F is the set of 2n free columns. It now follows that between each bi−1,i , bi and bi , bi,i+1 pair for i ∈ 1, n , there must lie at least one column from F, in any (∞, 1)-C1 order of Mφ . Since we have exactly 2n pairs, between each pair there must be exactly one. Figure 2.4 depicts all (∞, 1)-C1 orders of the current matrix Mφ . Note that the columns in each variable block can be oriented either in the same direction as the order of all of the blocks, or in the reverse direction. If variable block bi is oriented in the same direction as the order of all of the blocks, this corresponds to the setting of the variable vi to true, while the reverse direction corresponds to vi being false. Now it remains to add rows to Mφ to force the free column associated with each clause to fall next to only the blocks of variables that are set to a value that satisfies the clause. Let c ∈ CL (resp., CR ) contain the variables xα , xβ (and xγ for a 3-clause), and let fc ∈ F be the free column associated with clause c. We add the row [B ∪ F \ { fc } ∪ Sc ] to Mφ , where Sc is defined as follows. If c ∈ CL , then for each j ∈ {α , β } ( j ∈ {α , β , γ } for a 3-clause), if v j appears positively (resp., negatively) in c, set Sc contains the columns {b5j−1, j , b1j } (resp., {b5j−1, j , b5j }). Otherwise, if c ∈ CR , then for each j, if v j appears positively (resp., negatively) in c, set Sc contains the columns {b5j , b1j, j+1 } (resp., {b1j , b1j, j+1 }). Adding these extra ones around the variable blocks b j for each j forces fc to fall only to the immediate left (resp., right) of these b j in any (∞, 1)-C1 order of Mφ . Furthermore, fc can only fall to the immediate left (resp., right) of a b j if it is oriented in a direction such that corresponding variable v j is set to a value that sets its literal to true, i.e., if v j satisfies c. Hence, the satisfying assignments of any individual clause c correspond to the (∞, 1)-C1 orders of the submatrix of Mφ consisting of the row added for clause c, and all of the rows previously added to Mφ for the blocks bi for i ∈ 1, n , and bi−1,i for i ∈ 1, n + 1 . After adding the row for all clauses c ∈ CL ∪CR , the set of remaining (∞, 1)-C1 orders of Mφ (if there exist any) correspond to the cases where for every clause c ∈ CL (resp., CR ), its corresponding column fc is placed to the immediate left (resp., right) of a block of a variable that is set to a value (true or false) that satisfies c, that is, to satisfying assignments of φ . Conversely, if φ has a satisfying assignment, then we can assign each c ∈ CL (resp., CR ) to a unique v ∈ V that satisfies c, in the sense that either v or ¬v satisfies c, i.e., each v ∈ V will satisfy at most one clause 50 from CL and at most one clause from CR . We can make this claim because v and ¬v each appear no more than once in CL (resp., CR ), and at most one of v and ¬v satisfies a given clause c. Thus we can assign each column fc of Mφ to a unique slot to the immediate left (resp., right) of block bi for i ∈ 1, n , for the corresponding vi that satisfies the clause c. Thus Mφ has a (∞, 1)-C1 order. Hence, φ is satisfiable if and only if Mφ has the (∞, 1)-C1P. In summary, given a 3SAT(L:2,R:2) formula φ with n variables and m ≤ 2n clauses, we have constructed a matrix Mφ with 12n + 5 columns and 16n + m + 8 rows such that Mφ has the (∞, 1)-C1P if and only if φ is satisfiable. Given that deciding the (∞, 1)-C1P is clearly in NP, and Lemma 11, it follows that deciding the (∞, 1)-C1P is NP-complete. 2.5.3 The Complexity of Deciding the (∞, δ )-C1P We now generalize the construction given in Subsection 2.5.2 to show that for every δ ≥ 1, the problem of deciding the (∞, δ )-C1P is NP-complete by reduction from 3SAT(L:2,R:2). Theorem 13. For every δ ≥ 1, deciding the (∞, δ )-C1P is NP-complete. Proof. Consider δ ≥ 1. Here, given an instance φ of 3SAT(L:2,R:2), we build a matrix Mφ such that φ is satisfiable if and only if Mφ has the (∞, δ )-C1P. The idea of the construction is the same as that of the proof of Theorem 12: it will again have the blocks bi for i ∈ 1, n , and bi−1,i for i ∈ 1, n + 1 as well as 2n free columns for the clauses, only the blocks will need more columns, and we will need to add more rows to Mφ in order for it to behave in the same way for arbitrary δ . For each block bi for i ∈ 1, n , and bi−1,i for i ∈ 1, n + 1 we again add to Mφ the rows according to Theorem 4 to force each individual block to be in fixed order (or the reverse) in any (∞, δ )-C1 order of Mφ . Thus, each block will contain 2δ + 3 columns. In order to force each pair of blocks bi−1,i , bi and bi , bi,i+1 for i ∈ 1, n , to be together, with at most one free column in between them, thus enforcing a δ +1,δ +4 total order on the blocks, we add the rows [bi−1,i δ ,δ +3 ∪ bi ] and [bi ∪ bi,i+1 ]. Note here, that the first (resp., last) δ columns of the dummy blocks are omitted to fix their direction (relative to the order of the blocks) under the assumption that there 51 is a free column between each pair of neighboring blocks, which we enforce by δ +1,δ +3 adding to Mφ the row [B ∪ F], where B = b0,1 δ +1,δ +3 bn,n+1 δ +1,δ +3 ∪ b1 δ +1,δ +3 ∪ · · · ∪ bn ∪ , and F is a set of 2n free columns. Now Mφ again has the desired structure, as depicted in Figure 2.4. Now it remains to add rows to Mφ for the clauses. Let c ∈ CL (resp., CR ) contain the variables xα , xβ (and xγ for a 3-clause), and let fc ∈ F be the free column associated with clause c. We add the row [B ∪ F \ { fc } ∪ Sc ] to Mφ , where Sc is defined as follows. If c ∈ CL , then for each j ∈ {α , β } ( j ∈ {α , β , γ } for a 3-clause), if v j appears positively (resp., negatively) in c, set δ +3 1 2δ +3 2δ +3 Sc contains the columns {b2j−1, }). Otherwise, if c ∈ CR , j , b j } (resp., {b j−1, j , b j then for each j, if v j appears positively (resp., negatively) in c, set Sc contains the columns {b2j δ +3 , b1j, j+1 } (resp., {b1j , b1j, j+1 }). Now this matrix Mφ will have the same behavior as in the proof of Theorem 12, hence φ is satisfiable if and only if Mφ has the (∞, δ )-C1P. In summary, for every δ ≥ 1, given a 3SAT(L:2,R:2) formula φ with n variables and m ≤ 2n clauses, we have constructed a matrix Mφ with (4δ + 8)n + 2δ + 3 columns and (6δ + 10)n + m + 3δ + 4 rows such that Mφ has the (∞, δ )-C1P if and only if φ is satisfiable. Given that for every δ ≥ 1, deciding the (∞, δ )-C1P is clearly in NP, and Lemma 11, it follows that for every δ ≥ 1, deciding the (∞, δ )C1P is NP-complete. 52 Chapter 3 The Gapped Consecutive-Ones Property for Matrices of Bounded Maximum Degree In this chapter, we study the (k, δ )-C1P with a third parameter d, the bound on the maximum degree of M. In Section 3.1 we first provide an algorithm for the case of the (d, k, δ )-C1P when all three parameters are fixed constants. In Section 3.2, we show, in four subsections, that deciding the (d, k, ∞)-C1P for every d > k ≥ 2 is NP-complete. First, in Subsection 3.2.1, we give the definition of a type of hypergraph covering problem. In Subsection 3.2.2 we show that a special case of this hypergraph covering problem is NP-complete, and then in Subsection 3.2.3 we generalize this construction to show that the general case of this hypergraph covering problem is NP-complete. Finally, in Subsection 3.2.4 we show a direct correspondence of the general case of this hypergraph covering problem to deciding the (d, k, ∞)-C1P for every d > k ≥ 2 to give the result of this Section 3.2. 3.1 An Algorithm for Matrices of Bounded Maximum Degree A binary matrix M has maximum degree d if every row contains at most d entries 1. We show now that, when d and δ are constant (which implies that k is also 53 constant, since k ≤ d), then deciding the (k, δ )-C1P is tractable. We rely on a connection to graph bandwidth, and an algorithm of Saxe [135] for deciding graph bandwidth. We now give several definitions, theorems and eventually the algorithm from Saxe [135], and our extensions to these to give an algorithm for deciding the (d, k, δ )-C1P. We first define a layout of a graph, or a mapping of its vertices to distinct positive integers, and then the bandwidth of a layout. Definition 14 (Layout of a Graph). Saxe [135] Let G = (V, E) be a graph with |V | = n. A layout of G is a one-to-one mapping f : V → 1, n . Definition 15 (Bandwidth of a Layout). Saxe [135] Given graph G = (V, E) with |V | = n, and a layout f of G, the bandwidth of f is defined as the maximum distance between the images under f of any two vertices that are connected by an edge in G. That is, bandwidth( f ) = max{ f (u) − f (v) | {u, v} ∈ E}. The bandwidth of a graph is then the smallest bandwidth for any of its layouts. Definition 16 (Bandwidth of a Graph). Saxe [135] Given graph G(V, E) with |V | = n, bandwidth(G) = min{bandwidth( f ) | f is a layout of G}. We now show the connection of graph bandwidth to the (d, k, δ )-C1P. Let M be an m × n binary matrix and GM = (VM , EM ) be the undirected graph defined as follows: VM = 1, n (each vertex of GM represents a column of M), and there is an edge {i, j} ∈ EM if and only if there is a row of M with entries 1 in columns i and j. The following property then follows immediately from this definition: Property 17. If M has maximum degree d and M has the (k, δ )-C1P, then bandwidth(GM ) is at most d + (k − 1)δ − 1. We hence denote a layout of a binary matrix M to be a layout f of its GM , while the bandwidth of such a layout is the bandwidth of f (where the domain of this layout of M is its columns 1, n corresponding to the vertices of GM that form the domain of f ). We then denote that the bandwidth, bandwidth(M), of a binary matrix M 54 is bandwidth(GM ). It is an algorithm for deciding for some given graph G, if bandwidth(G) ≤ b for some fixed constant b, that is the main result of Saxe [135]. In the following, we give the details of this algorithm, and how it can be extended to give an algorithm for deciding the (d, k, δ )-C1P. This relies, of course, on the above Property 17. We first need to give some of the preliminary assumptions, definitions and theorems (and their extensions for our purposes) of Saxe [135]. First, we note that if graph G = (V, E) with |V | = n is not connected, then G has a layout of bandwidth ≤ b if and only if each of its components has such a layout. Also, it is clearly impossible for G to have such a layout if G has any vertex of degree greater than 2b. We therefore assume that G is (i) connected and (ii) has no vertex of degree greater than 2b. Since b is a fixed constant, we can determine (ii) in linear time O(n), and, given that (ii) holds, we can determine (i) in linear time as well [135]. Similarly, we assume that any matrix M given as input to deciding the (d, k, δ )-C1P emits a graph GM that has properties (i) and (ii). We now introduce the key notion from Saxe [135] of a partial layout, some related definitions with respect to deciding if a graph has bandwidth ≤ b where b is a fixed constant, and our extensions of some of these definitions so that we can later extend the algorithm of Saxe [135] to obtain an algorithm for deciding if a binary matrix has the (d, k, δ )-C1P. Definition 18 (Partial Layout of a Graph). Saxe [135] Let G = (V, E) be a graph with |V | = n. A partial layout of G is a one-to-one mapping f : U → 1, p , where U ⊆ V and |U | = p, i.e., 0 ≤ p ≤ n. Definition 19 (Feasible Partial Layout). Saxe [135] We say that a partial layout f of a graph G is feasible if it can be extended to a (total) layout g, such that bandwidth(g) ≤ b. Definition 20 (Bandwidth of a Partial Layout). Saxe [135] The bandwidth of a partial layout f of a graph G is the maximum distance between the images of any two edge-connected vertices of G which are in the domain of f . Definition 21 (Edge Dangling from a Partial Layout). Saxe [135] Given partial layout f of a graph G = (V, E), if {u, v} ∈ E and u is in the domain of f and v is not, then edge {u, v} is said to be dangling from f . 55 Here we denote a partial layout of a binary matrix M to be a partial layout of its GM . Note that this emits a submatrix M ′ of M on the columns U . The remainder of these definitions carry over directly to matrices M, i.e., in terms of GM , with the exception of feasibility, which is a bit more complicated. Definition 22 (Feasible Partial Layout of a Binary Matrix). We say that a partial layout f of a binary matrix M is feasible if it is a feasible partial layout of GM , and if it can be extended to a (total) layout g, such that bandwidth(g) ≤ b, and the order of the columns of M given by g−1 (1), . . . , g−1 (n) is a (k, δ )-C1 order. We now introduce the notions from Saxe [135] of a plausible partial layout, and the active region of a partial layout. Definition 23 (Plausible Partial Layout of a Graph). Saxe [135] Given partial layout f of a graph G = (V, E), where f is of size p, it is clear that f cannot be feasible unless (1) bandwidth( f ) ≤ b, and (2) whenever u and v are vertices of G such that f (u) < p − b and {u, v} ∈ E, then v is also in the domain of f . If f satisfies both of these conditions, then f is said to be a plausible partial layout. In order to extend this above definition so that it holds for also binary matrices, we have to add to it the following third and fourth properties (3) submatrix M ′ given by f has the (k, δ )-C1P, and (4) for each row r of M, if the degree of r in M ′ is less than its degree in M, then (a) r in M ′ has the (k − 1, δ )-C1P, and (b) the rightmost 1 in r of M ′ is followed by at most δ 0’s. Note that G = GM in (2) of the above property. Finally, we give the following definition of active region which carries over directly to the case of binary matrices. 56 Definition 24 (Active Region of a Partial Layout). Saxe [135] Given partial layout f of a graph G, where f is of size p, the sequence ( f −1 (max(p − b + 1, 1)), . . . , f −1 (p)) taken together with the set of dangling edges of f is called the active region of f . We now present the theorem of Saxe [135] on which Saxe’s principal algorithm depends. Theorem 25. Saxe [135] Let f and g be two plausible partial layouts of G having identical active regions. Then, (1) f and g have identical domains, and (2) f is feasible if and only if g is feasible. Proof. Since G is connected, the domains of f and g must each consist precisely of those vertices which are path-connected to vertices in the active region by paths not including any dangling edges. Thus, (1) holds. To see that (2) holds, we need only note that any assignment of the remaining vertices which extends either f or g to a total layout of bandwidth ≤ b must also extend the other to such a layout. Note that since we defined active region and what it means for a partial layout of a binary matrix to be feasible and plausible, that Theorem 25 carries over to the case of a binary matrix M also, where G = GM and the assignment that extends either f and g to a total layout of bandwidth ≤ b also has the (k, δ )-C1P in the proof of this theorem. Finally, we present the notions from Saxe [135] of a successor and predecessor of a plausible partial layout. Definition 26 (Successor of a Plausible Partial Layout). Saxe [135] Let f be a plausible partial layout of G. Then a successor of f is a plausible partial layout g which extends f by precisely one element. In this case, the active region of g is also said to the be the successor of the active region of f . Definition 27 (Predecessor of a Plausible Partial Layout). Saxe [135] If plausible partial layout g is the successor of plausible partial layout f , then (the active region of) f is a predecessor of (the active region of) g. 57 Again, because we have defined all of these notions in the case of binary matrices, these notions of successor and predecessor also carry over directly to the case of binary matrices. As in Saxe [135], Theorem 25 allows us to say that two plausible partial layouts of a binary matrix M are equivalent if they have identical active regions. We can now easily extend the algorithm of Saxe to obtain a breadth-first search over the space of all induced equivalence classes of plausible partial layouts, i.e., the active regions. Here, again, since each active region consists of at most b vertices and each vertex has no more than 2b edges, each of which may or may not be dangling, the number of equivalence classes is bounded above by ∑ 0≤i≤b n (i!)(22b )i = O(nb ). i (3.1) We are now ready to present our extension of the algorithm of Saxe [135] for deciding if a binary matrix M has the (d, k, δ )-C1P. Here, we need only extend Saxe’s algorithm with a data structure that stores a submatrix M ′ corresponding to each active region, and some procedures associated with this submatrix to test for the (k, δ )-C1P of this active region. Note that since we assume that M has bounded degree d, by Property 17 we can test here for bandwidth ≤ b, where b = d + (k − 1)δ − 1. Note also that, by the definition of GM , it follows that b is greater than the distance between the leftmost 1 and rightmost 1 in any row of any (k, δ )-C1 order of M. Hence, it is sufficient to test for the (k, δ )-C1P of only the submatrix M ′ corresponding to each active region (of size b), and not all of M. The algorithm uses the following two data structures: (1) A (fifo) queue Q whose elements are active regions. (2) An array A which contains one element for each possible active region. Each element A[r] of A consists of a Boolean flag A[r].examined, telling whether the active region r has already been considered in the search, and a list A[r].unplaced of vertices which is intended to list all vertices not in the domain of each plausible partial layout with active region r. Here, we extend each element A[r] so that it also contains a n × b (sub-) matrix A[r].M ′ which stores the submatrix of M that corresponds (in this order) to the columns 58 f −1 (max(p − b + 1, 1), . . . , f −1 (p)) of active region r. At the start of the algorithm, Q is initialized to contain the single element representing the active region (henceforth denoted Φ) of the empty partial layout 0. / The flag A[Φ].examined is set to true and A[Φ].unplaced is initialized to list all the elements of V . The remaining A[r].examined are initially false, and the remaining A[r].unplaced are uninitialized. Each A[r].M ′ is also set to the matrix on zero columns (the empty matrix). The algorithm now proceeds as follows: Algorithm 1 Algorithm of Saxe [135] for testing the bandwidth of a graph. 1. Extract an active region r from the head of Q. 2. From A[r].unplaced, determine the successors of r. To determine if s, the active region obtained by extending r with some c ∈ A[r].unplaced is a successor, we first check (as in done in Saxe [135]) to see if s is a the active region of a plausible partial layout of GM . In addition, we compute A[s].M ′ by adding column c to the end of A[r].M ′ . This new active region s can only be a successor if A[s].M ′ satisfies properties (3) and (4) that extend Definition 23. 3. for each successor s of r such that A[s].examined is false, perform the following steps: a. Set A[s].examined to true. b. Compute A[s].unplaced by deleting the last vertex of s from A[r].unplaced. c. If A[s].unplaced is the empty set, bandwidth(G) ≤ b. then halt, asserting that d. Insert s at the end of Q. 4. If Q is empty, then halt, asserting that bandwidth(G) > b. Otherwise, go to Step 1. We now analyze the time and space complexity of the algorithm. there are O(nb ) Since active regions r, and with each r we associate the O(n) elements A[r].unplaced, the n × b matrix A[r].M ′ , which is of size O(mn), and some constant size flags, the space required by the algorithm is O(mnb+2 ). For the running time, we first note that (as in Saxe’s algorithm) that Steps 1 through 4 will be exe59 cuted O(nb ) times. Again, each individual execution of Steps 1 and 4 take constant time, so their contribution to the total running time is O(nb ). In each execution of Step 2, A[r].unplaced can have O(n) elements, and the test that A[s].M ′ for each potential successor s in A[r].unplaced satisfies properties (3) and (4) that extend Definition 23 takes time O(mn), since A[s].M ′ is of size O(mn). Hence Step 2 contributes O(mnb+2 ) to the total execution time. In each execution of Step 3, again Steps 3.a through 3.d maybe executed as many as n times, and that Step 3.b takes time O(n), Step 3 takes time O(nb+2 ) (which is already less than that of Step 2, so we leave out the analysis of Saxe [135] for bringing down this upper bound). We hence have our version of the following theorem from Saxe [135] for deciding the bandwidth of a binary matrix. Theorem 28. Let b be any positive integer. Then, given some binary matrix M, there is an algorithm which decides if bandwidth(M) ≤ b using time and space O(mnb+2 ). Proof. To test the bandwidth of M, we first perform a time O(n) depth-first search which either (1) determines that GM has some vertex of degree greater than 2b, or (2) partitions GM into connected components, none of which have any vertex of degree greater than 2b. In case (1), we know immediately that bandwidth(M) > k. In case (2), we apply Algorithm 1 to the submatrices of M that correspond to the connected components of GM . By the above Theorem 28 and Property 17, we have the following theorem which gives us the result. Theorem 29. Let M be an m × n binary matrix such that every row has at most d entries 1. Deciding if M has the (k, δ )-C1P can be done in time and space O(mnd+(k−1)δ +1 ). 60 3.2 The (d, k, ∞)-C1P Here, we show that deciding the (d, k, ∞)-C1P for every d > k ≥ 2 is NP-complete. This proof is broken down into the following subsections. In Subsection 3.2.1 we define first a hypergraph covering problem that will be used later to show NPcompleteness of this case. In Subsection 3.2.2 we then show that a special case of this covering problem for 3-uniform hypergraphs is NP-complete. In Subsection 3.2.3 we use the NP-completeness construction of Subsection 3.2.2 to show that this covering problem defined in Subsection 3.2.1 is NP-complete in general. Finally, in Subsection 3.2.4, we give a correspondence of this covering problem of Subsection 3.2.1 to the problem of deciding the (d, k, ∞)-C1P for the result of this section. 3.2.1 A Hypergraph Covering Problem We first define the following hypergraph covering problem. In the sections that follow, we will show that this problem is NP-complete, and that it corresponds exactly to the problem of deciding the (d, k, ∞)-C1P for the hardness result of this chapter. Note that a hypergraph H = (V, E) is d-uniform when all its hyperedges are d-edges, that is, hyperedges that contain exactly d vertices. Definition 30 (p-Covering of a d-Uniform Hypergraph). Given a d-uniform hypergraph H = (V, E) and an integer p, let K|V | be a complete graph on V and let P p be the set of all subsets of E(K|V | ) with exactly p edges. A p-covering of H is a graph G = (V, E ′ ) such that there exists a map c : E → P p such that (a) for every h ∈ E, and for every e ∈ c(h), e ⊆ h; and (b) E ′ = h∈E c(h). Here, we say that set c(h) p-covers the hyperedge h and that G p-covers H. Informally, a p-covering of a d-uniform hypergraph is a graph constructed by picking p edges from each hyperedge. Problem 31 (d-Uniform Hypergraph p-Covering by Paths (d-UH-p-CP)). Given a d-uniform hypergraph H = (V, E) and an integer p < d, is there a p-covering of H which consists only of disjoint paths? 61 Variations of this problem were defined in previous works [59–61]. The first variation allowed the hypergraph to have only 2, 3 and 4-edges, where 2- and 3edges were covered by picking one edge, while 4-edges were covered by two parallel edges, and required that the covering contains only disjoint edges and vertices. This variation was shown to be polynomial-time solvable which provided an algorithm for a special version of haplotyping problem via galled-tree networks [59]. The second variation allowed only 3-uniform hypergraphs, and required all connected components of the covering to be paths of length at most 3. This variation was shown to be NP-complete [61]. A slightly more complex version of this was then used to show that in general the haplotyping problem via galled-tree networks is NP-complete [60]. In the next section, we show that a special case of this problem, namely the 3-UH-1-CP Problem, is NP-complete, which is then generalized in Section 3.2.3 to show NP-completeness of the d-UH-p-CP Problem for every d − 2 ≥ p ≥ 1. 3.2.2 The 3-Uniform Hypergraph 1-Covering by Paths Problem We now show that the 3-Uniform Hypergraph 1-Covering by Paths (3-UH-1-CP) Problem is NP-complete. Theorem 32. The 3-UH-1-CP Problem is NP-complete. Proof. Clearly, the problem is in NP. We will show it is also NP-hard by reduction from 3SAT(3), a restricted version of 3SAT, proved NP-complete by Papadimitriou [120], in which every variable has exactly two positive and one negative occurrence in the clauses.1 We will call a p-covering of a hypergraph valid if it consists only of disjoint paths. Note that a valid p-covering does not contain vertices of degree 3 or more and does not contain cycles. Given 3SAT(3) formula φ with variables X = {x1 , . . . , xn } and clauses C = {c1 , . . . , cm }, we now construct a 3-uniform hypergraph Hφ on at most 12n+ 15m hyperedges which contains, among other vertices, a vertex for each literal of φ (there are 3n such vertices) that has a valid 1-covering if and only if φ is satisfiable. 1 We remark that the exact formulation of 3SAT(3) in Papadimitriou [120] allows also variables with one positive and two negated occurrences, but these can easily be converted to the other type of variables by replacing them with their negations in all clauses. Clearly, this does not affect the complexity of the problem. 62 c2i h1 h2 c1i c1i v D c2i (a) c3i (b) (c) Figure 3.1: (a) A simple dependency on 1-coverings of two touching hyperedges enforced by a copy of D (depicted as a diamond). (b) The 2-clause and (c) 3-clause gadgets for clause ci . First we give an important building block that is used throughout this construction: the complete 3-uniform hypergraph D on 4 vertices. In any valid 1-covering of D, there is no isolated vertex. Indeed, assume for contradiction that v is the isolated vertex in a valid covering G of D. Let u1 , u2 , u3 be the remaining three vertices. Then there is a pair ui , u j such that {ui , u j } is not an edge in G. However, no edge is 1-covering hyperedge {v, ui , u j }, a contradiction. We will use several copies of D in the construction to introduce a dependency on 1-coverings of touching hyperedges and depict them as diamonds in the figures. For instance, consider the hypergraph in Figure 3.1a. Since in any valid 1-covering G of this hypergraph, v is a member of an edge in D, at most one of the hyperedges h1 and h2 can “pick” an edge involving v, otherwise vertex v would have degree 3 or more. Now to the main construction. Consider the instance φ of 3SAT(3) with variables X = {x1 , . . . , xn } and clauses C = {c1 , . . . , cm }. In the construction, any valid covering selects a set of literals (more precisely, the vertices corresponding to these literals), i.e., positive and negative occurrences of variables. If this selection satisfies the following two properties: (1) every clause selects at least one literal, and (2) for every x ∈ X , at most one of x and ¬x is selected, then this selection can be used to build a satisfying truth assignment for φ as follows: for every i ∈ {1, . . . , n} if xi (resp., ¬xi ) is in the selection, set the value of 63 cpi cpi cpi cqj cqj crk (a) cqj crk (b) crk (c) Figure 3.2: (a) The variable gadget for variable with positive occurrences cip and cqj and negated occurrence crk in the clauses. The dashed edge is always picked in any valid 1-covering. (b) Grey edges are picked when this variable is set to false in a satisfying assignment of φ . (c) Grey edges are picked when the variable is set to true. xi to true (resp., false). If neither xi and or ¬xi is in the selection, pick the value at random. We design a hypergraph Hφ composed of clause gadgets which will guarantee the first condition and variable gadgets which will ensure the second condition. Figures 3.1b and 3.1c depict the 2-clause and 3-clause gadgets, respectively. Given a valid 1-covering G of the clause gadget for clause ci with literals c1i , c2i (and c3i for a 3-clause), we say that a literal vertex cij is selected in G, if cij is contained in two edges of the covering G. Note that in both clause gadgets at least one of the literal vertices is selected in any valid covering. This is obvious for the 2-clause gadget. For the 3-clause gadget, if none of the literal vertices is selected in a valid 1-covering of this gadget, then in the three hyperedges in Figure 3.1c, no picked edge involves c1i , c2i or c3i . But this creates a cycle, a contradiction. Now, each literal vertex cij will also appear in exactly one variable gadget described in the next paragraph. If a literal vertex cij is selected in a valid covering then it cannot be contained in any edge that covers the hyperedges of the variable gadget, otherwise cij has degree 3 or more in this covering. The variable gadget for each x ∈ X will use this property to ensure that literal vertices x and ¬x are not selected at the same time. Figure 3.2a depicts the variable gadget for variable x ∈ X with the two positive occurrences cip and cqj , and one negated occurrence crk of this variable x in the 64 clauses. Note that if both a positive and the negated literal vertices of x are selected by a clause gadget in a valid 1-covering of Hφ , then it forces a cycle in the variable gadget of x, a contradiction. It follows that if Hφ has a valid 1-covering then φ is satisfiable. Conversely, if φ has a satisfying assignment τ , let us pick one literal for each clause which makes it satisfied in τ and build the 1-covering of Hφ as follows. In each clause gadget, (i) in each hyperedge of this clause gadget that contains a literal vertex, pick an edge containing the literal vertex if this literal was selected for this corresponding clause, and (ii) for each diamond, choose any of the 3 valid 1-coverings of this diamond that consist of 2 parallel edges. In the variable gadgets, pick the edges as depicted in Figure 3.2b if the variable has value false in τ and otherwise, pick the edges as depicted in Figure 3.2c. By selecting edges in this fashion, every hyperedge of Hφ is 1-covered by an edge, and each literal vertex is adjacent to at most two edges in the 1-covering, one of them lying in the diamond. Hence, there is no vertex of degree 3 and no cycles in this 1-covering, i.e., this 1-covering is valid. Since the number of hyperedges used in the construction is at most 12n + 15m, i.e., linear in the size of φ , this construction can be built in polynomial-time, and hence, the 3-UH-1-CP Problem is NP-hard. In the following section, we generalize this construction to show that for every d − 2 ≥ p ≥ 1, the d-UH-p-CP Problem is NP-complete. 3.2.3 The d-Uniform Hypergraph p-Covering by Paths Problem We now show how the construction of Section 3.2.2 can be generalized to show that for every d − 2 ≥ p ≥ 1, the d-Uniform Hypergraph p-Covering by Paths (d-UHp-CP) Problem is NP-complete. The main building block in this new construction is the following d-uniform hypergraph that generalizes the hypergraph D (the diamond) from the previous construction of Section 3.2.2. Lemma 33. For any d − 2 ≥ p ≥ 1, there exists a d-uniform hypergraph Dd,p = (V, E) with a distinguished vertex v ∈ V that has the following properties: 1. |V | = 2d − p − 1 and |E| = 2(d−p)−1 d−p 65 ; ... ... S P v Figure 3.3: Hypergraph Dd,p : only one of the |S| d−p hyperedges is shown. 2. in any valid p-covering of Dd,p , v is not isolated; and 3. hypergraph Dd,p has a valid p-covering in which v has degree 1. Proof. Let Dd,p = (V, E) be the d-uniform hypergraph on the vertex set V = S ∪ P ∪ {v} where |S| = 2(d − p) − 1, |P| = p − 1, and v is the single distinguished vertex. For every subset S′ ⊆ S of size d − p, we add a hyperedge on the d vertices S′ ∪ P ∪ {v} to E, i.e., |E| = 2(d−p)−1 d−p . Hypergraph Dd,p is depicted in Figure 3.3. We now show that this hypergraph satisfies conditions 2 and 3 of the lemma. Here, again, we call a graph p-covering of a hypergraph valid if it consists only of disjoint paths. Assume, for contradiction, that v is isolated in a valid p-covering G of Dd,p . Since G is some collection of paths on the vertex set S ∪ P, virtual edges can always be added to G to extend this collection to a single path G′ on this set. In what follows, we will find a hyperedge in Dd,p and show that it contains less than p edges in G′ , and hence, less than p edges in G, and thus, is not covered by G. Path G′ defines a total order on its vertex set S ∪ P (there are two such total orders, but we can choose either one, without loss of generality). Let t : S ∪ P → N be this total order. If we follow the vertices of path G′ according to t, it starts at some vertex in one of S or P, alternates between the two sets, and then terminates in one of these sets. Hence, the subgraph G′S (resp., G′P ) of G′ induced on vertex set S (resp., P) is some collection of paths on S (resp., P), say S1 , . . . , Sr (resp., P1 , . . . , Pℓ ), where for any i < j, vertex u ∈ Si (resp., Pi ) and u′ ∈ S j (resp., Pj ), 66 S1 S2 S3 P1 S4 P2 P3 v Figure 3.4: The path G′ through vertex set S ∪ P that alternates between subpaths completely in S and completely in P. Some of the shown edges may be virtual. S1 S2 S3 P1 S4 P3 P2 h v Figure 3.5: Hyperedge h of Dd,p which contains less than p edges from G′ depicted in Figure 3.4. t(u) < t(u′ ), cf. Figure 3.4. Let us order the elements of S according to total order t and let S′ be the odd numbered elements of S according to this order. Since |S| = 2(d − p) − 1, |S′ | = d − p. Now, consider the d-edge h = S′ ∪ P ∪ {v} of Dd,p . Hyperedge h is indeed an edge in Dd,p since it contains P ∪ {v} and a subset, namely S′ , of size d − p of S. Hyperedge h for the example of Figure 3.4 is depicted in Figure 3.5. We will show that this hyperedge contains less than p edges from G′ . Let us count the number of edges of G′ that are contained in h. Each path Pi , i = 1, . . . , ℓ is completely contained in h, and thus contributes to h the |Pi | − 1 edges that connect the vertices of this path. On the other hand, since S′ is the set 67 ... ... S P v Figure 3.6: A valid p-covering of Dd,p in which vertex v has degree 1. of odd numbered elements of S according to total order t, none of the edges in S j , j = 1, . . . , r is contained in h. Finally, we need to consider edges of the path G′ crossing between the sets S and P. We will show that for each i = 1, . . . , ℓ, there is at most one crossing edge starting at a vertex of Pi and ending in S that is contained in h. There are at most two edges starting at a vertex of Pi and ending in some vertex of S. If the number of these edges is less than two, the claim holds. Assume there are two such edges. They must start at the endpoints of Pi and end in the consecutive elements of S (according to t). Hence, at most one of them is ending in the odd numbered element of S, i.e., contained in h. It follows that the number of crossing edges contained in h is at most ℓ. Hence, h contains at most ℓ + ∑ℓi=1 (|Pi | − 1) = |P| = p − 1 edges of G′ , and hence, at most p − 1 edges of G, thus it is not p-covered by G, a contradiction. We can conclude that in any valid p-covering of Dd,p , vertex v has degree at least one. Finally, we show that Dd,p has a valid p-covering in which vertex v has degree 1. Consider the path G that starts at v and then visits all vertices in P and then all vertices in S, cf. Figure 3.6. Consider any hyperedge h = S′ ∪ P ∪ {v} where S′ is some subset of S of size d − p. The hyperedge h contains P ∪ {v}, and thus the subpath of G induced by these vertices. This subpath has p − 1 edges. Consider the subgraph of G induced by S′ . If this subgraph contains at least one edge, we pick this edge for h, and hence, h is p-covered by G. Otherwise, S′ must consist only of odd numbered elements of the subpath of G induced by S, and thus it contains the first vertex of this subpath. Hence, h contains the edge of G connecting sets S and P, and we pick this edge for h, i.e., it is p-covered by G. 68 c hp ... ... hd−3 hp−1 h1 a b Figure 3.7: Vertices and hyperedges added to H¯ to simulate the 3-edge h = {a, b, c}. The grayed diamonds depict copies of Dd,p . In the following theorem we will use many copies of Dd,p to simulate the behavior of a 3-edge in the 1-covering problem with a d-edge in the p-covering problem. Theorem 34. For every d − 2 ≥ p ≥ 1, the d-UH-p-CP Problem is NP-complete. Proof. Clearly, this problem is in NP. We will show that it is also NP-hard by reduction from the 3-UH-1-CP Problem that was shown to be NP-complete in Section 3.2.2. Given a 3-uniform hypergraph H = (V, E), we will construct a d-uniform hypergraph H¯ that has a valid p-covering if and only if H has a valid 1covering. For each 3-edge h = {a, b, c} ∈ E we add the corresponding d-edge ¯ To simulate in H, ¯ the behavior of h, we then add h¯ = {a, b, c, h1 , . . . , hd−3 } to H. ¯ where the distinguished vertex v of each copy 2(d − p − 2) copies of Dd,p to H, is identified with one of the vertices h p , . . . , hd−3 such that each of them is used exactly twice. Figure 3.7 illustrates all vertices and hyperedges added to H¯ for this 3-edge h in H. We note that all vertices other than a, b, c added to H¯ for h are disjoint from all other vertices. ¯ We will construct a Now, assume that there is a valid p-covering G¯ of H. 1-covering G of H as follows. For each h ∈ E, consider the subgraph G¯ h¯ of G¯ ¯ It must have at least p edges. By Lemma 33, vertices induced by the vertices in h. h p , . . . , hd−3 are incident to some edges of G¯ in two different diamonds, and since ¯ they are isolated vertices in G¯ h . Hence, we there is no vertex of degree 3 in G, have p edges in the p + 2 element set {a, b, c, h1 , . . . , h p−1 } which cannot create 69 a cycle. It hence follows that these vertices must form at most two components. Therefore, at least one pair of the vertices a, b, c must lie in the same component. If there is only one such pair, we add it to G as an edge. If all three vertices a, b, c are connected, we add to G a pair which remains connected after removing the third vertex. As a consequence of this choice, each edge {u, v} in G covering a hyperedge h in H corresponds to a path in G¯ connecting u and v. In addition, all internal vertices of these paths are not in V , and since hyperedges in H¯ share only vertices in V , they are pairwise internally vertex disjoint. The graph G constructed above is obviously a 1-covering of H. Let us check that it is also valid. First, if there is a vertex u ∈ V with degree 3 or more, then there ¯ i.e., u would have degree at least are three internally disjoint paths starting at u in G, ¯ a contradiction. Second, if there is a cycle u1 , u2 , . . . , uk , u1 in G, then for 3 in G, ¯ Since these each edge {ui , ui+1 } in G, we have a path connecting ui and ui+1 in G. ¯ a contradiction. paths are internally vertex disjoint, they create a cycle in G, Conversely, assume there is a 1-covering G of H. We construct a p-covering ¯ G of H¯ as follows. Cover each copy of Dd,p such that the distinguished vertex has degree 1 (this is possible by Lemma 33). For each hyperedge h = {a, b, c} ∈ E, without loss of generality, let {a, b} be the edge that covers h in G. Then cover hyperedge h¯ by a path starting at a, visiting all vertices h1 , . . . , h p−1 and ending at b, while the vertex c is an isolated vertex. This is a p-covering H¯ and it is easy to verify that it is also valid. Finally, let us check that the construction is polynomial. The number of vertices of H¯ is |V | + |E|[d − 3 + 2(d − p − 2)(2d − p − 2)] and the number of edges is |E| 1 + 2(d − p − 2) 2(d−p)−1 d−p . Since d and p are assumed to be constants, the reduction is polynomial. 3.2.4 The Complexity of Deciding the (d, k, ∞)-C1P We now show that for every d > k ≥ 2, deciding the (d, k, ∞)-C1P is NP-complete, by showing the correspondence of this problem to the d-UH-(d − k)-CP Problem. A d-uniform hypergraph H = (V, E) can be represented as a binary matrix BH with |V | columns and |E| rows, where for each hyperedge h ∈ E, we add a row with 1’s in the columns corresponding to vertices in h and 0’s everywhere else. Obvi- 70 ously, the degree of every row of BH is d and there is a one-to-one correspondence between d-uniform hypergraphs and such matrices. Lemma 35. A d-uniform hypergraph H = (V, E) can be (d − k)-covered by disjoint paths if and only if matrix BH has the (d, k, ∞)-C1P. Proof. Assume first that H has a valid covering G. Since G consists of disjoint paths, there is a Hamiltonian path P on V containing all edges of G. This path defines an order on the vertices in V . Consider the order of the columns of matrix BH based on this order (V is the set of columns BH ). We will show that this order is (d, k, ∞)-consecutive. Since each row of BH contains exactly d 1’s, it is enough to show that d − k pairs of these d columns are adjacent in this order. The d columns containing 1’s in each row form a hyperedge in H. Since G is a valid (d − k)covering, there are edges between d − k pairs of these d columns in G. Since P contains all edges of G, it contains also these d − k edges and hence, each of the corresponding d − k pairs of columns are adjacent in the order. It follows that the order of BH is (d, k, ∞)-consecutive. Conversely, assume that matrix BH is (d, k, ∞)-consecutive. Let π = vi1 , . . . , vin be the order of the columns in a (d, k, ∞)-consecutive order of BH . Now, for any hyperedge h = {v j1 , v j2 , . . . , v jd } of H, there is a row in BH with 1’s in these d columns, hence, d − k pairs of the columns in h must be adjacent in the order π . Consider the following covering G of H: for every hyperedge pick the edge between each pair of adjacent columns/vertices. Note that every edge in G is {vi j , vi j+1 } for some j. Hence, G has no vertex of degree 3 or higher, nor any cycle, thus G is a collection of disjoint paths, i.e., a valid (d − k) covering of H. By Theorem 34 and Lemma 35 it follows that for every d > k ≥ 2, deciding the (d, k, ∞)-C1P is NP-complete. Theorem 36. For every d > k ≥ 2, deciding the (d, k, ∞)-C1P is NP-complete. We remark that the (d, k, ∞)-C1P is the k-C1P [55] for matrices of bounded degree d. Goldberg et al. [55] posed the open question of the complexity of deciding the 2-C1P for matrices with a limit ℓ on the number of ones per row, and per column. This is motivated by a typical setting in physical mapping, where a clone 71 will only contain a small number of probes, and there is only limited coverage of the entire sequence by the clones (cf. Chapter 1 for details on physical mapping). Since our construction of Theorem 32 in Subsection 3.2.2 which implies that deciding the 2-C1P for matrices of bounded degree 3 is NP-complete uses also only 7 ones per column, we have the following corollary which closes this open question of Goldberg et al. [55]. Corollary 37. Deciding the 2-C1P with a limit 3 on the number of ones per row and 7 on the number of ones per column is NP-complete. 72 Chapter 4 The Consecutive-Ones Property with Multiplicity In this chapter we show in Section 4.1 that deciding the mC1P is NP-complete for matrices with degree at most 3 and m(s) ≤ 2 for each s ∈ S, where S is the set of columns of M. We then present in Section 4.2 the two restricted variants of the mC1P given in Wittler and Stoye [151], namely the Consecutive-Ones Property with Multiplicity for Framed Rows (mC1P(fr)) and the Consecutive-Ones Property with Multiplicity for Nested Rows (mC1P(ne)). In Subsection 4.2.1 (resp., Subsection 4.2.2) we detail the mC1P(fr) (resp., mC1P(ne)) variant, its biological motivation, and show that deciding the mC1P(fr) (resp., mC1P(ne)) is NP-complete for matrices with degree at most 6 (resp., 3) and m(s) ≤ 2 for each s ∈ S. Then, in Section 4.3 we give a tractability result for a case of the mC1P, motivated by handling ancestral telomeres in the reconstruction of AGO. 4.1 The Consecutive-Ones Property with Multiplicity (mC1P) Here, we show that deciding the mC1P is NP-complete for matrices with degree at most 3 and m(s) ≤ 2 for each s ∈ S, where S is the set of columns of M. Theorem 38. Given a degree 3 matrix M on set S of columns, deciding the mC1P for M is NP-complete for a multiplicity vector m where m(s) ≤ 2 for each s ∈ S. 73 Before giving the proof, we would like to emphasize that this is the strongest possible result. If the maximum multiplicity would be one, this is just an instance of the classical C1P. If the degree of M is restricted to 2, then this corresponds to the model of adjacencies, and can hence by solved using the method based on Eulerian graphs given in Wittler and Stoye [151]. Proof. One can easily formulate an algorithm that verifies a given solution, i.e., a C1 order with multiplicity in polynomial time, which shows that the problem belongs to the complexity class NP. We will show NP-hardness of deciding the mC1P by reduction from 3SAT(3), which has been proven to be NP-complete by Papadimitriou [120]. 3SAT(3) is a restricted version of 3SAT in which every variable has exactly two positive and one negative occurrence in the clauses.1 Here, we again reduce from a type of hypergraph covering problem as we did in Chapter 3 to show NP-hardness of deciding the (d, k, δ )-C1P. Given a 3SAT(3) formula φ with variables X = {x1 , . . . , xn } and clauses C = {c1 , . . . , cm }, we construct a matrix Mφ consisting of at most 5n + 2m columns of multiplicity at most two and at most 5n + 3m rows of degree three or less for which a C1 order σ with multiplicity exists if and only if φ is satisfiable. For this instance φ of 3SAT(3), we say that a clause selects one of its literals in a truth assignment of φ if this literal has value true in this assignment. Obviously, a truth assignment of φ is a satisfying truth assignment if and only if every clause selects at least one literal and for every x ∈ X , at most one of x and ¬x is selected. We design an instance Mφ composed of clause gadgets which will guarantee the first condition and variable gadgets which will ensure the second condition. For each 2-clause ci with literals c1i and c2i , we add to Mφ the two columns c1i and c2i , each of multiplicity 2, and the two columns c∗i and c∗∗ i , each of multiplicity 1, and the rows Si1 = [c1i , c2i , c∗i ] and Si2 = [c∗i , c∗∗ i ]. This is referred to as the 2clause gadget. For each 3-clause ci with literals c1i , c2i and c3i , we add to Mφ the three columns c1i , c2i and c3i , each with multiplicity 2, and the row Si = [c1i , c2i , c3i ]. This is referred to as the 3-clause gadget. 1 We remark that the exact formulation of 3SAT(3) in Papadimitriou [120] allows also variables with one negated and two positive occurrences, but these can easily be converted to the other type of variables by replacing them with their negations in all clauses. Clearly, this does not affect the complexity of the problem. 74 c2i c1i c2i c∗i c∗∗ i c1i (a) c3i (b) Figure 4.1: Graphical representations of the (a) 2-clause gadget and (b) 3clause gadget for clause ci . The multiplicity of the columns (resp., vertices) is indicated by the number of dots. Rows are depicted by ellipses surrounding two vertices or triangles surrounding three vertices, respectively. Figure 4.1 shows graphical representations of these gadgets, which also highlights that Mφ can be viewed as a hypergraph with a vertex for each column and a hyperedge for each row. A C1 order with multiplicity of Mφ is then a collection of walks on this hypergraph that covers each hyperedge (for each hyperedge e there is a connected subwalk containing all vertices in e) such that no vertex v is visited more than m(v) times. We say that in string σ , a clause gadget selects a literal column cij , if, in σ , cij is enclosed on both the left and right side by at least one column of this gadget. Note that in both clause gadgets, at least one of the literal columns is selected in any valid string σ . For the 2-clause gadget, string σ has to contain one of the 2 1 ∗ ∗∗ substrings c1i c2i c∗i c∗∗ i or ci ci ci ci , or one of their reversals. For the 3-clause gadget, string σ has to contain one of the substrings c,i c2i c3i or c2i c1i c3i , or c1i c3i c2i or one of their reversals. Clearly a literal column is always selected in each of these gadgets for any string σ that is a C1 order with multiplicity of Mφ . Now, all 3n literal columns cij from the set of clause gadgets for C will appear in the variable gadgets, where the variable gadget selects this column cij , if cij is again enclosed on both the left and the right side by at least one column of the gadget in σ . So if a literal column cij is selected by a clause gadget, then it cannot be selected by this variable gadget, since σ is a string and thus cij can be framed by at most two other columns of σ . The variable gadget for each x ∈ X will use this property to ensure that literal vertices x and ¬x are not selected at the same time. 75 cβj cγk cα i x′ℓ x′′ℓ Figure 4.2: Graphical representation of the variable gadget for variable xℓ β γ with positive occurrences cαi and c j and negated occurrence ck in the clauses. β For each variable xℓ with the two positive occurrences cαi and c j and the γ β γ negated occurrence ck , we already added to Mφ the columns cαi , c j and ck , each of multiplicity two in the corresponding clause gadgets for the clauses containing xℓ and ¬xℓ . We further add to Mφ the two columns x′ℓ and x′′ℓ , each of multiplicity one, γ β β γ and the four rows Pℓ1 = [cαi , ck , x′ℓ ], Pℓ2 = [x′ℓ , c j ], Pℓ3 = [c j , ck ], Pℓ4 = [cαi , x′′ℓ ]. This is referred to as the variable gadget for xℓ , depicted in Figure 4.2. γ Now, consider a C1 order σ with multiplicity for Mφ where the literal ck is γ selected by some clause gadget. Since one copy of ck is used up by this clause β γ gadget, σ must contain the substring c j ck cαi x′ℓ or its reversal because it is the only way to ensure consistency in σ for rows Pℓ1 and Pℓ3 in Mφ with the one remaining γ β copy of ck . If literal c j is also selected by some clause gadget, then there is no way that σ can be consistent with Pℓ2 in σ . While if literal cαi is also selected by some clause gadget, then there is no way that σ can be consistent with Pℓ4 , which contradicts the fact that σ is a C1 order with multiplicity for Mφ . It follows that if Mφ has a C1 order σ with multiplicity, then φ is satisfiable. We now show that the converse holds, namely if φ has a satisfying truth assignment τ , then Mφ has a C1 order σ with multiplicity. Given τ , we construct σ β as follows. For any variable xℓ with the two positive occurrences cαi and c j and the γ negative occurrence ck , in τ , either of the two cases must hold: β γ µ γ 1. cαi and c j are false and ck is true: In this case, we create the substring ck ck cνk , µ γ γ 1 2 satisfying Sk , or ck ck c∗k c∗∗ k satisfying Sk and Sk , depending on whether ck is β γ β part of a 3- or a 2-clause. Further, we create the substrings c j ck cαi x′ℓ c j and cαi x′′ℓ , fulfilling all Pℓ1,2,3,4 . 76 β γ 2. cαi and c j are true and ck is false: In this case, we create the substrings µ µ µ β ν 2 1 ci cαi cνi , satisfying Si (resp., ci cαi c∗i , c∗∗ i satisfying Si and Si ) and c j c j c j , µ β 1 2 satisfying S j (resp., c j , c j , c∗j c∗∗ j satisfying S j and S j ). Further, we create the γ β γ substring x′′ℓ cαi ck x′ℓ c j ck , fulfilling all Pℓ1,2,3,4 . The requirements for σ imposed by all given rows of Mφ fulfilled. It remains γ to be shown that the multiplicity constraint is met as well. None of the columns ck , ′ ′′ (resp., c∗k , c∗∗ k ), xℓ and xℓ used in the first case are affected by the second case for β ∗ ∗∗ ′ any other variable and none of the columns cαi , (resp., c∗i , c∗∗ i ), c j , (resp., c j , c j ), xℓ and x′′ℓ are affected by the first case for any other variable. For all of these columns, the multiplicity constraint is met. The column cαi is used twice in case one. The µ same column will occur as ci or cνi in the second case for some other variable. But since in both of the corresponding substrings, the column is the first or last element, they can be merged into one substring using only two copies of cαi . Here we might have to reverse the substring cαj x′′ℓ to x′′ℓ cαj , still fulfilling Pℓ4 . The same β argument holds for c j . γ Analogously, the column ck is already used twice in case two. The same colµ umn will occur as ck or cνk in the first case for some other variable. But since in both of the corresponding substrings, the column is the first or last element, they γ can be merged to one substring using only two copies of ck . Here we might have to reverse one of the substrings, still fulfilling the restrictions by the rows of Mφ . Since each column only occurs in one row Si (resp., Si1 ) of Mφ , each substring induced by rows Pℓ1,2,3,4 has to be merged on one side, i.e., no cycles are created in the set of walks covering the corresponding hypergraph. Eventually, any concatenation of the constructed substrings yields a string σ that is a C1 order with multiplicity of Mφ . Thus if φ has a satisfying assignment τ , then Mφ has a C1 order σ with multiplicity. Since the number of columns used in the construction is at most 5n + 2m, the number of rows is at most 4n + 2m, and each row is of degree at most 3, i.e., the construction is linear in the size of φ , it can be built in linear time, and hence, deciding the mC1P is NP-hard for matrices with degree at most 3, and no column has multiplicity greater than two. 77 4.2 Two Variants of the mC1P Besides its classical definition, there are different generalizations of the mC1P discussed in the literature, such as r-windows [40, 49], max-gap clusters [68, 69, 122], and approximate gene clusters [19, 125]. Since deciding the mC1P is NP-complete, any generalization is NP-hard as well. In contrast to generalizations, there are also restricted variants of the mC1P that are relevant to settings in the reconstruction of AGOs. In the following, we will discuss such models, in particular the Consecutive-Ones Property with Multiplicity for Framed Rows (mC1P(fr)) and Consecutive-Ones Property with Multiplicity for Nested Rows (mC1P(ne)). 4.2.1 The mC1P(fr) Variant The C1P of binary matrices where each row is framed by two columns, or the model of common intervals framed by two genes (whose orientations have to be conserved also), was first introduced as conserved intervals on permutations in Bergeron and Stoye [14]. In the reconstruction of AGOs, framed rows on permutations was the first model to formally state the problem of finding putative AGOs [15]. Here, we define the mC1P(fr), which models framed rows on sequences, to account for duplicate markers, for example. Definition 39 (Framed Row of a Matrix). Let M be a binary matrix on the set of columns S = {1, . . . , n}. A framed row (for r ⊆ S) of M is denoted [a r b], where its two extremities (or framing columns) a, b ∈ S. We sometimes refer to the columns of r as the inner columns of this framed row. A framed row [a r b] is contained in a sequence σ on S if, somewhere in σ , a and b appear with the set of characters of the substring between a and b taken only from r. Definition 40 (Consecutive-Ones Property with Multiplicity for Framed Rows (mC1P(fr))). A binary matrix M on the set of columns S = {1, . . . , n} with framed rows has the mC1P(fr) if there is a sequence σ that contains each framed row of M. The obvious relationship of the mC1P and the mC1P(fr) allows us to infer an important correlation of these properties: any instance of the mC1P can be reduced 78 to an instance of the mC1P(fr). Based on this, we can deduce the following statement. Theorem 41. Given a degree 6 matrix M on set of columns S, deciding the mC1P(fr) is NP-complete for a multiplicity vector m where m(s) ≤ 2 for each s ∈ S. Proof. Again, one can easily formulate an algorithm that verifies a given solution for correctness in polynomial time, which shows that the problem belongs to the complexity class NP. NP-hardness is shown by reducing the case for the mC1P used in the proof of Theorem 38 to the mC1P(fr). The basic idea is to replace each row B = {e1 , . . . , em } by a framed row B = ˜ ¯ containing, besides others, the columns of B as inner columns. [B {e1 , . . . , em , . . .} B] Then, if this new instance allows for a valid sequence σ , there is a sequence σ ′ for the original instance of the mC1P by simply removing all newly introduced columns from σ such that only the columns contained in the rows B are left in σ ′ . Because the inner columns of all framed rows have to occur contiguously in σ , the columns of the original rows occur contiguously in σ ′ . Since the rows of the matrix used in the proof of Theorem 38 overlap, the framing columns have to be included into the set of inner columns of overlapping framed rows. However, no row is included in another. This allows us to use the following technique which ensures that, if there is a valid sequence for the original matrix, there is a valid sequence for the constructed set of framed rows. Together with the argument in the previous paragraph, this will yield equivalence of the two instances of the C1P with multiplicity. For each row B = {e1 , . . . , em } overlapping with rows B1 , . . . , Bk , we create ¯ containing the framing a framed row B = [B˜ {e1 , . . . , em , B˜ 1 , . . . , B˜ k , B¯ 1 , . . . , B¯ k } B] columns of B1 , . . . , Bk , the framed rows constructed for B1 , . . . , Bk . Note that this means that the framing columns B˜ and B¯ also appear as inner columns to B1 , . . . , Bk . All rows used in the proof of Theorem 38 have the property that, for a valid sequence (a C1 order with multiplicity), an occurrence of a given row can overlap with the occurrence of only one other row on each side. Assume, the occurrence of some B overlaps with Bl in e1 , . . . , el on one side and with Br in er , . . . , em on the other side. Then, we can extend the substring that fulfills B to 79 ˜ e1 , . . . , el , B¯ l , . . . , B˜ r , er , . . . , em , B. ¯ Between B¯ l and B˜ r we include all remaining B, inner columns of B in an arbitrary order. The resulting substring fulfills B and also allows a realization of the framed rows created for Bl and Br . This extension can be performed for all rows such that, finally, all framed rows are contained in the extended, overall string. What remains to be shown is that such a construction is possible using at most six inner columns in each framed common row, as well as that a maximum multiplicity of two is sufficient. To minimize the number of inner columns, we do not always add both framing columns to all overlapping rows. The structure of the rows used in the gadgets of the proof of Theorem 38 restricts the possible overlaps of their occurrences in a valid sequence. As can be seen in the proof of Theorem 38, if there is a valid sequence, we can construct one using the following orders (or their reversals) of row occurrences within the gadgets: Pℓ3 , Pℓ1 , Pℓ2 or Pℓ4 , Pℓ1 , Pℓ2 , Pℓ3 , and Si1 , Si2 . Within the gadget, Pℓ3 can only be followed by Pℓ1 . We thus add P¯ℓ3 (but not P˜ℓ3 ) to the inner columns of P1ℓ , and P˜ℓ1 (but not P¯ℓ1 ) to those of P3ℓ . Analogously, we add P¯ 1 to P2ℓ and P˜ 2 to P1ℓ , P¯ 4 to P1ℓ and P˜ 1 to P4ℓ , P¯ 2 to P3ℓ and P˜ 3 to P2ℓ , and S¯i1 to S2i ℓ ℓ ℓ ℓ ℓ ℓ and S˜i2 to S1i . A 2-clause gadget overlaps the gadgets of two variables, say x j and xk . As can be seen in the proof of Theorem 38, if there is a valid sequence, we can construct one with one of the following orders (or their reversals) of row occurrences: Pjp , Si1 , Si2 , or Pkq , Si1 , Si2 where p, q ∈ {2, 3, 4}, depending on where it overlaps the variable gadgets. Thus, we add S˜i1 to the inner columns of P p and Pq . j k A 3-clause gadget overlaps the gadgets of three variables, say x j , xk and xℓ in the column c1i , c2i and c3i , respectively. As can be seen in the proof of Theorem 38, if there is a valid sequence, we can construct one with one of the following orders (or their reversals) of row occurrences: Pjp , Si , Pkq or Pjp , Si , Pℓr or Pkq , Si , Pℓr , 80 where p, q, r ∈ {2, 3, 4}, depending on where the variable gadgets are overlapped by Si . We add S˜i to the inner columns of P p , S¯i to the inner columns of Pq , and S˜i j k and S¯i to the inner columns of Prℓ . This way, in any of the three cases, there is at least one copy of each framing column of Si available on both sides. In summary, we reduce a given set of rows as used in the proof of Theorem 38 to a set of framed rows with at most six inner columns as follows. For the rows Pℓ1,2,3,4 used in the variable gadget for xℓ , we create the framed rows γ P1ℓ = [P˜ℓ1 {cαi , ck , x′ℓ , P˜ℓ2 , P¯ℓ3 , P¯ℓ4 } P¯ℓ1 ], β β P2ℓ = [P˜ℓ2 {x′ℓ , c j , P¯ℓ1 , P˜ℓ3 } ∪ I j P¯ℓ2 ], β γ γ P3ℓ = [P˜ℓ3 {c j , ck , P˜ℓ1 , P¯ℓ2 } ∪ Ik P¯ℓ3 ] and P4ℓ = [P˜ℓ4 {cαi , x′′ℓ , P˜ℓ1 } ∪ Iiα P¯ℓ4 ], where {S˜ } (or {S˜t1 } if ctα appears in a 2-clause) t µ It = {S¯t } (or {S˜t1 }) {S˜ , S¯ } (or {S˜1 }) t t t if µ = α if µ = β if µ = γ . For the rows Si1,2 used in the 2-clause gadget for ci , we create the framed rows ˜ 1i ), P(c ˜ 2i )} S¯i1 ] and S1i = [S˜i1 {c1i , c2i , c∗i , S˜i2 , P(c ¯1 ¯2 S2i = [S˜i2 {c∗i , c∗∗ i , Si } Si ], ¯ ij ) to be the right framing column of the (unique) Pm where we define P(c ℓ that contains c1i and c2i as inner columns. For the row Si used as in the 3-clause gadget for ci , we create the framed row ˜ 1i ), P(c ˜ 2i ), P(c ˜ 3i )} P¯i ], Si = [F˜i {c1i , c2i , c3i , P(c ˜ ij ) to be the left framing column of the (unique) Pm where we define P(c ℓ that contains c1i , and c2i or c3i as inner columns. It remains to be shown that a maximum multiplicity of two for all newly added columns suffices. This is true, because each new column is included in the inner columns of at most two framed rows. In fact, we can assign a multiplicity of 81 one to some of these columns. We define: m(P¯ 1,2,4 ) = m(S˜i2 ) = m(S¯i2 ) = 1 and m(P˜ 1,2,3,4 ) = m(P¯ 3 ) = m(S˜i ) = m(S¯i ) = 2. Since the number of columns used in the construction is at most 6n + 8m, the number of framed rows is at most 4n + 2m, and each framed rows contains at most six inner columns, i.e., the construction is linear in the size of φ , it can be built in linear time, and hence, deciding the mC1P(fr) is NP-hard for matrices with degree at most 6, and no column has multiplicity greater than two. Please note that, again, for a maximum multiplicity of one, polynomial solutions exist. Framed rows with no inner columns are equivalent to adjacencies, for which there is an efficient solution [151]. However, there is a gap left for framed rows with one to five inner columns. For these, the complexity is still open. 4.2.2 The mC1P(ne) Variant Hoberman and Durand [69] discussed nestedness as a desired property of gene clusters (ancestral syntenies in our case) and proposed a first algorithm to identify respective clusters. Recently, Blin et al. [17] formally defined and studied nested common intervals, and gave efficient algorithms to detect them in genomes modeled both as permutations and as sequences. Here, we define a notion of the mC1P for nested rows. Definition 42 (Nested Row of a Matrix). Let M be a binary matrix on the set of columns S = {1, . . . , n}. The structure of a nested row of M is defined recursively. A nested row in M is either (i) a row {a, b} of degree 2, or (ii) a tuple (c, a) of a nested row c and a column a. A nested row (c, a) (resp., {a, b}) is contained in a sequence σ on S if a is adjacent to a substring σ ′ of σ such that the character set of c is σ ′ , and c is contained in σ ′ (resp., a and b are adjacent in σ ). Here, the character set CS of a nested row is defined recursively as (i) CS({a, b}) = {a, b}, and (ii) CS((c, a)) = CS(c) ∪ {a} 82 Example 43. Consider the sequence σ = 5421236. The nested row (({2, 3}, 1), 4) is contained in σ as illustrated below, where the occurrences of the (nested) subrows are indicated by lines: (5, 4, 2, 1, 2, 3, 6) . In contrast, (({1, 3}, 2), 4) is not contained in σ since, although 4 is adjacent to a substring with character set {1, 2, 3} in σ , none of the occurrences of 2 is adjacent to a substring with character set {1, 3}. Note that row ({2, 3}, 3) is not contained in s, because 3 is not adjacent to a substring with character set {2, 3}, whereas row ({1, 2}, 2) is contained in σ : (5, 4, 2, 1, 2, 3, 6) . Definition 44 (Consecutive-Ones Property with Multiplicity for Nested Rows (mC1P(ne))). A binary matrix M on the set of columns S = {1, . . . , n} with nested rows has the mC1P(ne) if there is a sequence σ on S that contains each nested row of M. We show now that even the strict assumption of nestedness is not strong enough to allow an efficient verification of this variant. In fact, similar to deciding the mC1P, there is no gap left for fixed-parameter tractability in the considered parameters. Theorem 45. Given a degree 3 matrix M on the set of columns S, deciding the mC1P(ne) for M is NP-complete for a multiplicity vector m where m(s) ≤ 2 for each s ∈ S. Proof. NP-hardness is proven by reduction from 3SAT(3) using a construction very similar to that of Theorem 38. Given 3SAT(3) formula φ , we will again design an instance Mφ of the matrix on nested rows comprising of clause gadgets and a variable gadget, and then argue why they simulate (or rather that deciding the mC1P(ne) for this instance simulates) exactly this instance φ . For each 2-clause ci with literals c1i and c2i , we add to Mφ the two columns c1i 83 c2i c˜2i c¯2i c2i c˜1i c∗i c1i c¯1i c1i (a) c3i c¯3i c˜3i (b) Figure 4.3: Graphical representations of the (a) 2-clause gadget and (b) 3clause gadget for clause ci in the mC1P(ne) case. and c2i , each of multiplicity two, and the column c∗i of multiplicity one, and the nested row Si1 = ({c1i , c2i }, c∗i ). The 2-clause gadget is depicted in Figure 4.3a. For each 3-clause ci with literals c1i , c2i and c3i , we add to Mφ the three columns c1i , c2i and c3i , each with multiplicity two, the three columns c˜ 1i , c˜ 2i and c˜ 3i , each with multiplicity one, the three columns c¯ 1i , c¯ 2i and c¯ 3i , each with multiplicity two and the six nested rows Si1 = ({c1i , c¯ 1i }, c˜ 1i ), Si2 = ({c2i , c¯ 2i }, c˜ 2i ), Si3 = ({c3i , c¯ 3i }, c˜ 3i ), Si4 = {¯c1i , c¯ 2i }, Si5 = {¯c2i , c¯ 3i }, Si6 = {¯c3i , c¯ 1i }. The 3-clause gadget is depicted in Figure 4.3b. Note again that in both clause gadgets, at least one of the literal columns is selected in any valid string σ . For the 2-clause gadget, string σ has to contain one of the substrings c1i , c2i , c∗i or c2i , c1i , c∗i , or one of their reversals, thus a literal columns is always selected in this case. In the 3-clause gadget, if no literal column is selected in string σ , i.e., σ contains substrings c˜ qi , c¯ qi , cqi (or their reversals) for q ∈ {1, 2, 3}, there is only one remaining copy of c¯ qi for q ∈ {1, 2, 3} and hence {4,5,6} there is no way that σ can be consistent with all of Si simultaneously, which is a contradiction. Therefore at least one literal column is selected in this case as well. β For each variable xℓ with the two positive occurrences cαi and c j and the γ β γ negated occurrence ck , we already added to Mφ the columns cαi , c j and ck , each of multiplicity two in the corresponding clause gadgets for the clauses containing 84 cβj cγk cα i x′ℓ x′′ℓ Figure 4.4: Graphical representation of the variable gadget for variable xℓ β γ with positive occurrences cαi and c j and negated occurrence ck in the clauses in the mC1P(ne) case. xℓ and ¬xℓ . We further add to Mφ the two columns x′ℓ and x′′ℓ , each of multiplicity γ β β γ one, and the four nested rows Pℓ1 = ({cαi , ck }, x′ℓ ), Pℓ2 = {x′ℓ , c j }, Pℓ3 = {c j , ck }, Pℓ4 = {cαi , x′′ℓ }. The variable gadget is depicted in Figure 4.4. γ Now, consider a valid string σ where the literal ck is selected by some clause γ gadget. Since one copy of ck is used up by this clause gadget, σ must contain the β γ substring c j , ck , cαi , x′ℓ or its reversal because it is the only way to ensure consisγ β tency for nested rows Pℓ1 and Pℓ3 with the one remaining copy of ck . If literal c j is also selected by some clause gadget, then there is no way that σ can be consistent with Pℓ2 . While if literal cαi is also selected by some clause gadget, then there is no way that σ can be consistent with Pℓ4 , which is a contradiction to the fact that σ is valid. It follows that if Mφ has a valid string σ , then φ is satisfiable. We now show that the converse holds, namely if φ has a satisfying truth assignment τ , then Mφ has a valid string σ . Given τ , we construct σ as follows. β For any variable xℓ with the two positive occurrences cαi and c j and the negative γ occurrence ck , in τ , either of the two cases must hold: β γ 1. cαi and c j are false and ck is true: In this case, we create the substring γ γ γ µ µ µ µ γ c˜ k , ck , c¯ k , c¯ k , ck , c˜ k , c¯ k , c¯ νk , cνk , c˜ νk , c¯ νk , c¯ k satisfying Sk {1,...,6} γ µ γ or ck , ck c∗k satisfy- ing Sk1 , depending on whether ck is part of a 3- or 2-clause. Further, we again β γ β create substrings c j , ck , cαi , x′ℓ , c j and cαi , x′′ℓ , fulfilling all Pℓ β {1,2,3,4} . γ 2. cαi and c j are true and ck is false: In this case, we create the substrings µ µ µ µ c˜ αk , cαk , c¯ αk , c¯ k , ck , c˜ k , c¯ k , c¯ νk , cνk , c˜ νk , c¯ νk , c¯ αk satisfying Sk β {1,...,6} β µ µ µ µ β µ (resp., ci , cαi , c∗i satisfying Si1 ) and c˜ k , cαk , c¯ k , c¯ k , ck , c˜ k , c¯ k , c¯ νk , cνk , c˜ νk , c¯ νk , c¯ k satisfying Sk µ β (resp., c j , c j , c∗j satisfying S1j ). {1,...,6} Further, we again create the substring 85 γ β γ x′′ , cαi , ck x′ , c j , ck , fulfilling all Pℓ1,2,3,4 . The requirements imposed by all given nested rows are fulfilled. Since none of the columns used in the first (resp., second) case are affected by the second (resp., β γ first) case for any other variable (this time, cαi , c j and ck appear only once in either of the two cases), the multiplicity constraint is met as well. Eventually, again, any concatenation of the constructed substrings yields a string σ that is valid w.r.t. Mφ . Thus, if φ has a satisfying assignment τ , then Mφ has a valid string σ . Since the number of columns used in this construction is at most 5n + 6m, the number of nested rows is at most 5n + 9m, and each nested row is of size at most three, i.e., the construction is linear in the size of φ , it can be built in linear time, and hence, deciding the mC1P(ne) is NP-hard for matrices with degree at most 3, and no column has multiplicity greater than two. Indeed, deciding the mC1P is a hard problem, since even two restricted versions of it are hard. In the next section, however, we present a class of deciding the mC1P that is tractable, motivated by handling telomeres in the reconstruction of AGOs. 4.3 A Tractability Result for the Consecutive-Ones Property with Multiplicity In this section, we present a tractability result for a family of matrices where every row of M has (i) at most one entry 1 in columns with multiplicity greater than one, or (ii) exactly two entries 1 in columns with multiplicity greater than one and no other entries. Our proofs rely on the two classical concepts of PQ-trees and Eulerian graphs. We first give the following technical preliminaries. 4.3.1 Preliminaries Let M be a binary matrix, with rows R = {r1 , . . . , rm }, columns S = {s1 , . . . , sn } and ℓ entries 1. We represent a row r of M as a subset of S, defined as the set of si such that M[r, si ] = 1. A column s with multiplicity m(s) > 1 is called a multicolumn and a row r containing a multicolumn (i.e., M[r, s] = 1 for some column s with m(s) > 1) is called a multirow. A multirow that does not contain any other multirow is called minimal. We say a binary matrix M with multiplicity vector m : S → N 86 M r1 rˆ1 r2 r3 rˆ3 r4 rˆ4 r5 1 1 1 1 0 0 0 0 1 2 1 1 1 0 0 0 0 0 3 0 0 1 1 1 0 0 0 4 0 0 0 1 1 1 1 1 5 0 0 0 1 1 1 1 1 a 1 0 0 0 0 0 0 0 b 1 0 0 1 0 1 0 0 Mˆ r1 r2 r3 r4 r5 (a) 1 1 1 0 0 1 2 1 1 0 0 0 3 0 1 1 0 0 4 0 0 1 1 1 5 0 0 1 1 1 (b) Figure 4.5: (a) Binary matrix M, with matched multirows. Let m(1) = · · · = m(5) = 1 and m(a) = m(b) = 2: a and b are multicolumns and r1 , r3 and r4 are multirows. Row r3 is not minimal, because it contains r4 . (b) ˆ Since in M, ˆ by definition rˆi = ri for all The corresponding matrix M. multirows ri , the matched multirows are discarded. has matched multirows if, for every multirow r ⊆ S that contains at least two entries 1 in non-multicolumns, there exists a row rˆ which is a copy of r where all entries in multicolumns have been discarded (i.e., switched from 1 to 0). We denote by Mˆ the binary matrix obtained from M by discarding all multicolumns. In this work, we assume that all matrices we deal with have matched multirows unless otherwise stated. Figure 4.5 illustrates the above definitions. We now have the important lemma about the mC1P of matrices with matched mutlirows, which leads to this tractability result. Lemma 46. Every C1 order with multiplicity of M with multiplicity vector m contains a C1 order of Mˆ as a subsequence. As a consequence, if a binary matrix M has the mC1P, then Mˆ has the C1P. This lemma suggests that, to decide if M has the mC1P for a given multiplicity vector m, we can first check if Mˆ has the C1P, and then extend a C1 order of Mˆ into an C1 order with multiplicity of M by adding copies of multicolumns. Note that the matrix Mˆ in Figure 4.5 does not have C1P, and hence, M does not have ˆ which mC1P. However, if we omit column r5 , then 12345 is a C1 order of M, can be extended to the following C1 order with multiplicity of M: ab12345b. To ˆ account for the fact that there can be an exponential number of C1 orders of M, ˆ dewe use PQ-trees, a linear size structure that can describe all C1 orders of M, 87 fined below. For a more complete treatment of PQ-trees, we refer the reader to Booth and Lueker [21] or Meidanis et al. [106]. The frontier F(T ) of a PQ-tree T of a matrix M on columns S is the sequence of S obtained by reading the labels of its leaves from left to right. The frontier of an internal (P- or Q-) node N in T is the frontier of the subtree rooted at N. Let {F(N)} be the set of elements appearing in the sequence F(N). Two PQ-trees are equivalent if one can be obtained from the other by applying a sequence of the following transformation rules: (RP) arbitrarily permute the children of a P-node; (RQ) reverse the order of the children of a Q-node. Theorem 47. Booth and Lueker [21] If a binary matrix M has the C1P, there exists a unique equivalence class PQM of PQ-trees with the property that there is a oneto-one correspondence between the C1 orders of M and the frontiers of the PQtrees of PQM , and a PQ-tree belonging to PQM can be constructed in linear time. Each PQ-tree in the equivalence class PQM satisfies the following properties (that are implicitly given in Booth and Lueker [21] and McConnell [102]) which we will use in this section. Property 48. Let M be a binary matrix that has the C1P with rows R and T a PQ-tree in the equivalence class PQM . Then 1. for every row r ∈ R, there is a node N in T such that either {F(N)} = r, if N is a P-node, or r is consecutive in F(N), if N is a Q-node; 2. for every node N different from the root of T , there is a row r ∈ R such that {F(N)} ⊆ r; and 3. for every Q-node N, and every two consecutive children N1 and N2 of N, there is a row r ∈ R such that {F(N1 )} ∪ {F(N2 )} ⊆ r. Finally, we recall briefly the technique used to prove that matrices with two entries 1 per row (usually called matrices of degree 2) form a class of tractable instances for deciding the mC1P as we will use it to prove our main result. Such matrices can be naturally represented as a collection of adjacency constraints A = {{ai , bi }}m i=1 on the set S, where ai = bi and the collection is a set (no duplicate 88 elements). Collection A is consistent with respect to m if there is a sequence σ on S such that each adjacency is consecutive in σ . We will refer to this sequence as a consistency sequence of A and m. Note that an C1 order with multiplicity of M is a consistency sequence of the corresponding collection A and m, and vice versa, and hence, M has the mC1P for m if and only if A is consistent with respect to m. Given a collection of adjacencies A , we define the graph GA with vertex set S and edges given by adjacencies. Theorem 49. Wittler and Stoye [151] A collection of adjacencies A is consistent with respect to a multiplicity vector m if and only if for all si ∈ S, degreeGA (si ) ≤ 2m(si ) and for each connected component B ⊆ S of GA , for at least one si ∈ B, degreeGA (si ) < 2m(si ). The above theorem relies on the fact that the graph GA satisfying the above conditions can be extended to a multigraph on S ∪ {s0 } that has an Eulerian cycle. It can be easily seen that the proof presented in Wittler and Stoye [151] applies to generalized adjacencies, where we allow ai = bi and the collection to be a multiset, and we require that each adjacency in A appears in σ in a unique position. Note that GA is now a multigraph with self-loops. We have the following corollary. Corollary 50. A collection of generalized adjacencies A is consistent with respect to a multiplicity vector m if and only if for all si ∈ S, degreeGA (si ) ≤ 2m(si ) and for each connected component B ⊆ S of GA , for at least one si ∈ B, degreeGA (si ) < 2m(si ). 4.3.2 A Tractable Case of Deciding the mC1P Our main result is that deciding the mC1P is tractable for a large family of matrices with constraints on the maximum number of entries 1 in multicolumns a row can have. The motivation for studying this particular family of matrices arises from incorporating information on telomeres in ancestral gene order reconstruction (cf. Chapter 1) Theorem 51. Let M be a binary matrix and m a multiplicity vector such that (1) M has matched multirows, and 89 (2) each row contains either (i) at most one entry 1 in multicolumns, or (ii) two entries 1 in multicolumns and no other entries. Deciding if M has the mC1P for m can be done in polynomial time and space. We split the proof into two parts. First, we consider the case (2i) where M with multiplicity vector m contains a single multicolumn, and we show that deciding if M has the mC1P for m can be done efficiently using PQ-trees. Then we show how to handle the general case using Corollary 50 which relies on Eulerian cycles. Finally, in Section 4.3.3, we give an algorithm for building a PQ-tree which describes all sequences that satisfy the consecutivity requirement (condition (i) of Property 3 defined in Chapter 1). The Case of a Single Multicolumn We assume that the multiplicity vector m defines only one multicolumn denoted by c′ . According to Lemma 46, M satisfies the mC1P only if Mˆ has the C1P, which can be checked in linear time (Theorem 47). Assume that Mˆ has the C1P and let T be a PQ-tree from the equivalence class PQMˆ . We then aim at finding a PQ-tree from PQMˆ (by applying operations (RP) and (RQ) on T ) whose frontier can be extended to a valid C1 order with multiplicity by inserting copies of c′ . We say that inserting a copy of c′ into F(T ) breaks a row r of Mˆ if r is not consecutive in the resulting sequence. An example is given in Figure 4.6. Recall that rows are subsets of S. As M has matched multirows, all rows in Mˆ are also rows in M. Since the consecutivity of the 1’s in each row of Mˆ in the frontier F(T ) has to be maintained when inserting copies of c′ , no c′ can be inserted ˆ Lemma 52 below is a consequence into a position where it breaks any row of M. of this observation. Lemma 52. Let M be a binary matrix with matched multirows, and m be a multiplicity vector defining exactly one multicolumn c′ . Assume that M has the mC1P, and let T be a PQ-tree from PQMˆ and T ′ an extension of T whose frontier F(T ′ ) is an mC1 order of M. 1. If the root of T is a P-node, then, for each child node N of the root, c′ can only appear as the first or last element of the frontier F(N) in T ′ . 90 r1 rˆ1 r2 r3 rˆ3 r4 rˆ4 r5 r6 1 1 1 1 0 0 0 0 0 0 2 1 1 1 0 0 0 0 0 0 3 0 0 1 1 1 0 0 0 0 4 0 0 0 1 1 0 0 0 0 5 0 0 0 0 0 0 0 0 1 6 0 0 0 0 0 0 0 0 1 7 0 0 0 0 0 1 1 0 0 8 0 0 0 0 0 1 1 1 0 9 0 0 0 0 0 0 0 1 0 c′ 1 0 0 1 0 1 0 0 0 (b) (a) Figure 4.6: (a) Binary matrix M, with matched multirows. Let m(c′ ) = 2. (b) PQ-tree belonging to the equivalence class PQMˆ . P-nodes are represented by circular nodes and Q-nodes by rectangular nodes. An example of a valid C1 order with multiplicity is c′ 1 2 3 4 c′ 7 8 9 5 6 which is obtained by taking the equivalent PQ-tree with frontier 1 2 3 4 7 8 9 5 6 and inserting two copies of c′ into the corresponding positions. Notice that inserting c′ between 2 and 3 would break row r2 . Illustration of Algorithm 2. LCA(ˆr1 ) and the respective segments of LCA(ˆr3,4 ) are highlighted in gray and the respective paths are depicted by dashed lines. The upper left edge is contained in two paths. Here, K1 = 1 and K2 = 1, thus K = 2 ≤ m(c′ ) = 2. 2. If the root of T is a Q-node, the copies of c′ in T ′ can only appear as the first and/or last element of the frontier F(T ′ ). Proof. It follows by Property 48.2 that for every child N of the root of T , any ˆ and hence, inserting c′ pair of consecutive leaves in F(N) belongs to a row of M, between these leaves breaks this row. In addition, if the root of T is a Q-node, then by Property 48.3, for any two consecutive children N1 and N2 of the root, there is a row of Mˆ that contains elements of F(N1 ) and of F(N2 ). This prevents the insertion of c′ into root between N1 and N2 as this would break such a row. Hence c′ can appear only at the extremities of F(T ′ ). Lemma 52 rules out many positions in F(T ) where to insert copies of c′ : indeed, copies of c′ can only be inserted at extremities of the subsequences of F(T ) formed by children of the root (and only at the extremities of F(T ), if the root is a 91 Q-node). On the other hand, each multirow specifies a position where a copy of c′ must be inserted. These two constraints give rise to a polynomial algorithm which we describe in the following. Algorithm 2 starts with a PQ-tree for Mˆ and works in two stages. First (Step 3), based on Lemma 52, it checks if there is a way to permute nodes in the subtrees rooted at each child of the root such that for each multirow r = rˆ ∪ {c′ }, rows in rˆ appear as a prefix or a suffix of the frontier of some child. To satisfy the consecutivity requirement for each multirow r it is enough to add copies of c′ to F(T ) before or after the frontier of the child of the root containing rˆ. To satisfy the multiplicity constraint imposed by m, we need to permute the children of the root and possibly reverse the order of the frontier of some children. The basic idea is that we can save one copy of c′ if a child requiring a copy of c′ on the right is followed by a child requiring a copy of c′ on the left. Whether enough copies of c′ can be saved to satisfy the multiplicity constraint is checked in Steps 4–5. Let r = rˆ ∪ {c′ } be a multirow. By Property 48.1, there is in T either a P-node that contains exactly the columns in rˆ in its subtree, or a Q-node with a segment of two or more consecutive children which together contain exactly the columns in rˆ in their subtrees. This node is the least common ancestor in T of the columns in rˆ, and hence, will be denoted by LCA(ˆr). Now to argue that Algorithm 2 is correct. If condition 3.c.i applies, r would require the insertion of a copy of c′ within F(U ) in any PQ-tree of PQMˆ , which contradicts Lemma 52. The paths indicate positions where copies of c′ have to be added to the frontier so that the consecutivity requirement is satisfied. Following Lemma 52, we have to verify whether we can transform T such that all paths lie on the outside of the subtree of a child of the root of T . If conditions 3.c.ii–3.c.iv apply, there are two or more competing multirows, and we cannot transform T such that all of the corresponding paths lie on the outside of the subtree of a child of the root of T . Paths that are sub-paths of one another are excluded by not considering any multirow r = rˆ ∪ {c′ } which contains another multirow r′ = rˆ′ ∪ {c′ } (line 3). These rows do not need to be considered at this stage, because in any order with c′ adjacent to the elements in rˆ′ , since rˆ′ ⊆ rˆ, c′ is also adjacent to the elements in rˆ. If the root of T is a P-node, we have to consider the children of the root node separately: We could insert a copy of c′ on both sides of a frontier of a child of the 92 Algorithm 2 Deciding the mC1P for a matrix M with matched multirows and a multiplicity vector m defining a single multicolumn c′ . 1. Check if Mˆ has the mC1P. 2. If not, return false, else let T be a PQ-tree from PQMˆ . 3. For each minimal multirow r = rˆ ∪ {c′ } in M do a. Locate N := LCA(ˆr). b. Let Pr be the path from N to the root of T . c. For each edge e = {U,V } in Pr , where U is the parent of V do i. If U is a Q-node and V is neither its first nor its last child, return false; ii. If the root of T is a Q-node and e also belongs to the path Pr′ defined by another minimal multirow r′ , return false; iii. If U is not the root of T and e also belongs to the path defined by another minimal multirow, return false; iv. If U is the root of T and e also belongs to the paths defined by at least two other minimal multirows, return false. 4. If the root of T is a Q-node, return true. 5. If the root of T is a P-node: a. Let K1 and K2 be the number of children of the root of T belonging to exactly one or two paths defined by minimal multirows, respectively. b. K := K21 + K2 + 1 if K1 = 0 and K2 > 0, 0 otherwise. c. Return K ≤ m(c′ ) root, i.e., at most two paths can join above such a child node. In levels below the root, only one path can be moved to the border of the subtree, i.e., no two edges can join. If conditions 3.c.i–iv do not apply for a multirow r, there is a way to transform T (with rules (RP) and (RQ)) in the nodes on the path Pr (excluding the root) so that the frontier of N = LCA(ˆr) appears as a prefix or suffix of the frontier of N ′ , where N ′ is a child of the root lying on the path Pr . Next, we will show that all 93 these transformations can be performed simultaneously without any conflict. Obviously, the conflicts could only occur if the paths Pr share vertices other than root. Condition 3.c.iv guarantees that there are never three or more minimal multirows in the same subtree rooted at a child N ′ of the root. Condition 3.c.iii guarantees that if there are two minimal multirows in the same subtree rooted at a child N ′ of the root, their paths must meet only in N ′ , and hence, one can appear as a prefix and one as a suffix of the frontier of N ′ . However, if the root is a Q-node, by Lemma 52, column c′ can be attached only on one side of the frontier of N ′ , and hence, only one minimal multirow can appear in the subtree rooted at N ′ , which is checked in condition 3.c.ii. Hence, if Step 3 succeeds for all rows, there is a PQ-tree in PQMˆ from which we can obtain a sequence of the columns fulfilling the consecutivity requirement of M by inserting copies of c′ into its frontier at positions indicated by the paths of multirows. Steps 4–5 check if the multiplicity constraint imposed by m can be satisfied. Note, that if the root of T is a Q-node (Step 4), then the multiplicity constraint is satisfied since m(c′ ) ≥ 2. In Step 5, we count the number of copies of c′ required to satisfy all multirows. The position where to insert these copies are given by the paths. Since the root of T is a P-node, we can rearrange the children of the root such that one copy of c′ would coincide with two paths (from neighboring children). For instance, we can greedily pair nodes with one path each, using ⌈K1 /2⌉ copies and then include nodes with two paths (one path on each side) in-between, requiring one further copy each, K2 in total. If K1 = 0 and K2 > 0, chaining the two-path nodes results in K2 + 1 copies of c′ . It is easy to see that this joining process is optimal. If the number of required copies of c′ does not exceed the given maximum multiplicity m(c′ ), the given matrix M with multiplicity vector m has the mC1P. Finally, to complete the proof of the correctness of the algorithm, we only need to notice that the result of Algorithm 2 does not depend on the choice of the PQ-tree T of PQMˆ , as the LCAs and paths are invariant under the transformation rules (RQ) and (RP). The analysis of the time and space complexity of Algorithm 2 is as follows. First, Steps 1 and 2 can be completed in O(m + n + ℓ) time and space using the algorithm described in McConnell [102]; note that T can then be encoded in O(n) 94 space. Next, Step 3 is composed of at most m iterations, each of them requiring time O(n), the maximum length of a path from N to the root of T , as each path is obviously processed in time linear in its length. This gives an O(mn) time complexity for Step 3. For similar reasons, Step 4 can be achieved in time O(mn), which gives an overall worst-case time complexity of O(mn). This completes the proof of the case of a single multicolumn in Theorem 51. Completing the Proof of Theorem 51 Proof of Theorem 51. Given matrix M with multiplicity vector m and having matched multirows, let S′ be its set of multicolumns. A multirow containing multicolumn c′ ∈ S′ , will be called a c′ -multirow. Algorithm 3 works in the same two stages as Algorithm 2. However, the second stage is more complex. It requires building the collection of generalized adjacencies A on set S′ ∪ {s0 } by replacing each child of the root of the PQ-tree T for Mˆ by an adjacency and then applying Corollary 50. Algorithm 3 Deciding the mC1P for a matrix M with matched multirows and a multiplicity vector m. 1. Run the first 4 steps of Algorithm 2, where c′ is any element of S′ . 2. Construct a multiset of generalized adjacencies A on set S′ ∪{s0 } as follows. For every child N of the root of T do a. If N belongs to exactly one path defined by multirows, say by a c′ multirow, add adjacency {c′ , s0 } to A ; b. If N belongs to two paths defined by multirows, say by a c′ -multirow and a d ′ -multirow (c′ and d ′ may be equal), add adjacency {c′ , d ′ } to A . 3. Report if A is consistent with respect to m (use Corollary 50). Correctness of Step 1 follows from the correctness of the first stage of Algorithm 2. If Step 1 succeeds, we can assume that the root of T is a P-node (the case when the root is a Q-node is handled in Step 1), and hence, it is enough to satisfy the multiplicity constraint by permuting the children of the root and possibly reversing the order of the frontiers of some children. Let π be this order of children 95 of the root. In Step 2, the algorithm constructs the multiset of generalized adjacencies A whose consistency sequence (produced in Step 3) describes the way to do this as follows. Children that belong to zero paths defined by multirows will not introduce any adjacency constraints and can be placed at the end of π in any order and orientation. For any other child of the root, we have a unique position in the consistency sequence, hence we can order and orient these children based on these positions. Next, we insert copies of multicolumns as follows. For each subsequence c1 c2 c3 of the consistency sequence, where adjacency {c1 , c2 } corresponds to child N1 and {c2 , c3 } to N2 , if c2 = c0 , we insert a copy of c2 between the frontiers of N1 and N2 in F(T ). Hence, the number of copies of a multicolumn c′ ∈ S′ is equal to the number of its occurrences in the consistency sequence. Therefore, the frontier F(T ) with all required copies of multicolumns inserted satisfies the multiplicity constraint given by m. It is easy to see that if there is an mC1 order of M, then we can extract from it an order of the children of the root which gives this consistency sequence. The analysis of the time complexity is as follows. The first stage of the algorithm is a subroutine of Algorithm 2, and hence, has a time and space complexity of order O(mn). Since the number of children of the root of T that belong to at least one path defined by multirows is at most m, the number of adjacencies in A is at most m, and hence, building A takes time O(m). Finally, checking the degree conditions (applying Corollary 50) takes time O(n). Hence, the total time and space complexity of the algorithm is O(mn). Finally, Algorithm 3 can also be easily extended to the case when the matrix also contains rows of degree 2 containing two multicolumns, as follows. First, we run Steps 1 and 2 where we ignore multirows containing two multicolumns. Then, we add to A also an adjacency for every such multirow. Finally, we run Step 3 of the algorithm on this new collection A . It is easy to see that the time complexity of this new algorithm is still O(mn). Hence, the theorem holds. 96 Figure 4.7: Augmented PQ-tree T ′ for the matrix given in Figure 4.6. (In fact, to get an augmented PQ-tree from the original PQ-tree shown in Figure 4.6, no modifications are necessary other than attaching leaf nodes labeled c′ at appropriate locations.) Only the trees in the equivalence class of T ′ where the left side of the right Q-node is placed adjacent to the left Q-node have shortened frontiers that meet the multiplicity constraint (m(c′ ) = 2), for example, c′ 1 2 3 4 c′ 7 8 9 5 6. 4.3.3 Building a PQ-tree which Describes All Sequences that Satisfy the Consecutivity Requirement Here, we describe how a given PQ-tree T ∈ PQMˆ can be augmented to a PQ-tree T ′ which represents the set of all sequences σ , up to “pumping” occurrences of multicolumns, that satisfy the consecutivity requirement (condition (i) of Property 3 in Chapter 1) in that the frontier of any tree in the equivalence class of T ′ is such a sequence σ . However, not all frontiers meet the multiplicity constraint (condition (ii) of Property 3). For some trees in the equivalence class of T ′ , the respective frontier contains pairs of adjacent occurrences of a multicolumn c′ , each of which can be replaced by one occurrence of c′ without breaking any row of M (violating the consecutivity requirement). This reduces the number of used copies of the multicolumns. Only such shortened frontiers which meet the multiplicity constraint are valid mC1 orders, and, in fact, the set of such shortened frontiers is exactly the set of valid mC1 orders of M. Figure 4.7 shows an example. To construct an augmented PQ-tree T ′ , we process the original tree T in a bottom-up fashion along the paths Pr defined in Algorithm 2, starting with the LCAs. We replace an LCA by a new Q-node which has a copy of its corresponding multicolumn c′ as its first child and further children, depending on whether the LCA itself and its parent are P or Q-nodes. These intuitive transformation rules are detailed in Figure 4.8. Then, any parent node of a newly obtained Q-node is refined 97 ⇒ ⇒ ⇒ ⇒ Figure 4.8: Transformation rules for the LCAs to construct an augmented PQ-tree. An LCA and its parent node are replaced by the nodes shown on the right. The LCA (or the segment of an LCA, respectively) are highlighted in gray. to a new Q-node, moving up the copy of c′ , as shown in Figure 4.9. This process is iterated until we reach the root node. Since a node that is a child of the root can be contained in two paths, separate (but similar) rules are required, illustrated in Figure 4.10. Further specific rules which apply if an LCA is a child of the root of T or if the root node is a Q-node are straightforward. In some cases, after generating the tree as described above, simplifications can be carried out, such as replacing a P-node with a single child by a direct edge or substituting a Q-node with two children by a P-node. Analogously to Algorithm 2, that only checks if a matrix has the mC1P, the above construction of an augmented PQ-tree T ′ can be carried out in O(mn) time. 98 ⇒ ⇒ Figure 4.9: Transformation rules for bottom-up iteration to construct an augmented PQ-tree. A newly created Q-node and its parent node are replaced by the nodes shown on the right. ⇒ ⇒ Figure 4.10: Special transformation rules for bottom-up iteration to construct an augmented PQ-tree. A newly created Q-node two levels below the root node and its parent node are replaced by the nodes shown on the right. 99 Chapter 5 The Generalized Cladistic Character Compatibility Problem The authors of Benham et al. [11, 12] give a polynomial-time algorithm for the case of the GCCC Problem where for each character, the set of states of each species forms a directed path in its character tree. It thus follows that if the character trees are non-branching, then the Incomplete Cladistic Character Compatibility Problem can be solved in polynomial time. The complexity of this case when each character has at most two states was further improved in Pe’er et al. [123]. In Benham et al. [11, 12], it was shown that the GCCC Problem is NP-complete using a construction involving character trees that are branching. However, the authors argued that in this setting the situation when a trait becomes hidden and then reappears does not happen, hence in Benham et al. [12] they posed an open case of the GCCC Problem where each character tree has one branch 0 → 1 → 2 and the collection of sets of states for each species is {{0}, {1}, {2}, {0, 2}}. We call this the Benhan-Kannan-Warnow (BKW) Case. They then showed in Benham et al. [11] that if a “wildcard” set {0, 1, 2} is added to the collection, the problem is NPcomplete. Here, we study the complexity of cases of the GCCC Problem for nonbranching character trees with 3 states and set of states chosen from the set {{0}, {1}, {2}, {0, 2}, {0, 1, 2}} when the phylogeny tree that is a solution to this problem is restricted to be (a) any single-branch tree, (b) path or (c) tree, cf. Ta100 ble 5.1. In Gramm et al. [57], the authors state that searching for path phylogenies is strongly motivated by the characteristics of human genotype data: 70% of real instances that admit a tree phylogeny also admit a path phylogeny. This chapter is structured as follows, with the results summarized in Table 5.1. In Section 5.1 we formally define the Generalized Cladistic Character Compatibility Problem. In Section 5.2 we study several types of ordering problems, some being polynomial, while others are NP-complete; some of them is then used to determine the complexity of several cases in Table 5.1. Section 5.3 contains the tractability results of this chapter. Subsection 5.3.1 gives a polynomial-time algorithm based on that of Benham et al., Benham et al. [11, 12] for the case of the GCCC Problem for (a) where for each character, the set of states of each species forms a directed path in its character tree, giving entries (3a) and (7a) of Table 5.1. In Subsection 5.3.2, we first show that (5a–b) of Table 5.1 are equivalent to deciding the C1P. We then show that the BKW Case is polynomial-time solvable by giving an algorithm based on PQ-trees [21, 106] associated with the C1P, giving also the entries (6a), (8a) and (9a) of Table 5.1. In Subsection 5.3.3 we show that case (8b) is polynomial by showing that any instance of this case can be reduced to solving an instance of polynomial case (8a). Section 5.4 contains the intractability results of this chapter. Here we show that cases (10a–b) are NP-complete by reduction from the Path Triple Consistency (PTC) Problem of Section 5.2, and then how NP-completeness of an instance of the GCCC Problem for (a) can be transformed into certain instances of the same problem for (b) and (c). Finally, we show that cases (3b), (6b), (7b) and (9b) are NP-complete by reduction from the Left Element Fixed Path Triple Consistency (LEF-PTC) Problem of Section 5.2. Note that this last result includes the fact that case (9b), the BKW Case of the GCCC Problem for (b) is NP-complete. 5.1 The Generalized Cladistic Character Compatibility (GCCC) Problem Let S be a set of species. A generalized (cladistic) character [11, 12] on S is a pair αˆ = (α , Tα ), such that: (a) α is a function α : S → 2Qα , where Qα denotes the set of states of αˆ . 101 (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Q\soln Q ⊆ {{0}, {1}, {2}} {{0, 1, 2}} ⊆ Q ⊆ {{0}, {1}, {2}, {0, 1, 2}}; |Q| ≤ 2 {{0, 1, 2}} ⊆ Q ⊆ {{0}, {1}, {2}, {0, 1, 2}}; |Q| ≥ 3 Q ⊆ {{0}, {0, 2}, {0, 1, 2}} or Q ⊆ {{2}, {0, 2}, {0, 1, 2}} {{1}, {0, 2}} {{0}, {1}, {0, 2}} {{0}, {2}, {0, 2}}(∪{{0, 1, 2}}) {{1}, {2}, {0, 2}} {{0}, {1}, {2}, {0, 2}} ∗ {{1}, {0, 2}, {0, 1, 2}} ⊆ Q (a) branch P [2] trivial P (Th. 62) trivial P (Lem. 63) P (Th. 66) P (Th. 62) P (Th. 66) P (Th. 66) NP-c (Th. 70) (b) path P [2] trivial NP-c (Th. 72) trivial P (Lem. 63) NP-c (Th. 72) NP-c (Th. 72) P (Cor. 69) NP-c (Th. 72) NP-c (Th. 70) (c) tree P [2] trivial P [11, 12] trivial ? ? P [11, 12] ? ? NP-c [11] Table 5.1: Complexity of all cases of the GCCC Problem for the character tree 0 → 1 → 2 and set of states chosen from the set Q ⊆ {{0}, {1}, {2}, {0, 2}, {0, 1, 2}}. The BKW Case is marked with *. (b) Tα = (V (Tα ), E) is a rooted character tree with nodes bijectively labelled by the elements of Qα . The GCCC Problem is to find a perfect phylogeny [20] of a set of species with generalized characters: Problem 53 (Generalized Cladistic Character Compatibility (GCCC) Problem). Given a set S of species and a set C of generalized characters on S, is there a rooted tree T = (VT , ET ) and a “state-choosing” function c : VT × C → αˆ ∈C Qα such that the following holds: (1) For each species s ∈ S there is a vertex vs in T such that for each αˆ ∈ C, c(vs , αˆ ) ∈ α (s). (2) For every αˆ ∈ C and i ∈ Qα , the set {v ∈ VT | c(v, αˆ ) = i} is a connected component of T . (3) For every αˆ ∈ C, the tree T (α ) is an induced subtree of Tα , where T (α ) is the tree obtained from T by labelling the nodes of T only with their α -states (as chosen by c), and then contracting edges having the same α -state at their endpoints. Essentially, the first condition is that each species is represented somewhere in the tree T , and the second condition is that the set of nodes labelled by a given state of a given character form a connected subtree of T , just as with the Character 102 Compatibility Problem. Finally, condition three is that the state transitions for each character αˆ must respect its character tree Tα . The GCCC Problem is NP-complete [11, 12], however it is polynomial for many special cases of the problem [11, 12, 98]. In particular, in Benham et al. [11] it was shown to be NP-complete for a case where for each species s and character αˆ , α (s) ∈ {{1}, {0, 2}, {0, 1, 2}}, and Tα is 0 → 1 → 2. It was also shown to be polynomial-time solvable in the case where for each species s ∈ S, α (s) is a directed path in Tα for each αˆ = (α , Tα ) ∈ C [12]. We will consider the following variants of the GCCC Problem. The GCCC with non-branching character trees (GCCCNB) Problem is a special case of the GCCC Problem in which character trees have a single branch, i.e., each character tree Tα is 0 → 1 → · · · → |Tα | − 1. If we restrict the solution of the GCCC-NB Problem (a phylogeny tree) to have only one, or two branches starting at the root, we will call this problem the Single-Branch GCCCNB (SB-GCCC-NB) Problem, and the Path GCCC-NB (P-GCCC-NB) Problem, respectively. In addition, if in any of these problems, say in problem X , we restrict the set of states to be from the set Q, we will call this problem the Q-X Problem. Table 5.1 summarizes the cases studied here. 5.2 Ordering Problems In this section, we discuss several different types of ordering problems. These problems are related to the Single-Branch and P-GCCC-NB Problems. We will use one of these variants to obtain a hardness result in Section 5.4. The PTC Problem is a simplified version of the extensively studied Quartet Consistency (QC) Problem [138]. In the QC Problem, given a set S and the collection of quartets (ai , bi : ci , di ), where ai , bi , ci , di ∈ S, the task is to construct a tree T containing vertices S such that for each quartet there is an edge of T whose removal separates vertices {ai , bi } from vertices {ci , di }. This problem was shown to be NPcomplete in Steel [138]. Here, we show that the problem remains NP-complete when we restrict the tree to be a path. In this case it is easy to see that (i) we can assume the path contains only vertices in S and (ii) each quartet (ai , bi : ci , di ) can be replaced with the three triples (ai , bi : ci ), (ai , bi : di ) and (ci , di : ai ). The PTC Problem can be viewed as the Total Ordering (TO) Problem with negative 103 constraints ci ∈ / [ai , bi ], where [ai , bi ] is the set of all elements between ai and bi in the total order. The TO Problem with positive constraints ci ∈ [ai , bi ] was shown to be NP-complete in Opatrny [114]. The formal definition of the PTC Problem is as follows. Problem 54 (Path Triple Consistency (PTC) Problem). Given a set S = {1, . . . , n}, and a set of triples {ai , bi : ci |i = 1, . . . , k}, where ai , bi , ci ∈ S for every i = 1, . . . , k, is there a path (order) P on vertices S such that for each i = 1, . . . , k, there is an edge ei of P whose removal separates vertices {ai , bi } from vertex ci . Lemma 55. The PTC Problem is NP-complete. Proof. The PTC Problem is actually complementary to the TO Problem, which was shown to be NP-hard by Opatrny in 1979 [114]. The TO Problem is, given a set Q = {1, . . . , n} and a set of triples {ai , bi , ci |i = 1, . . . , k}, where for i = 1, . . . , k, ai , bi , ci ∈ S, is there a path (order) on Q such that for each i = 1, . . . , k, either ai < bi < ci or ci < bi < ai . It is easy to see that the NP-completeness of the TO Problem implies the NP-completeness of the PTC Problem. Given instance of TO Problem Q = {1, . . . , n} and {ai , bi , ci |i = 1 . . . , k}, for the corresponding instance of the PTC Problem we let S = Q, and for each triple a, b, c of the instance of the TO Problem, we introduce the triples a, b : c and c, b : a. Now, we study two subclasses of the PTC Problem and one subclass of the TO Problem in which one element of each constraint is fixed. Problem 56 (Left Element Fixed Path Triple Consistency (LEF-PTC) Problem). Given a set S = {1, . . . , n}, an element r ∈ S, and a set of triples {(ai , r : ci )}ki=1 where ai , ci ∈ S for every i ∈ {1, . . . , k}, is there a path (an order) P on vertices S ∪ {r} such that for each i ∈ {1, . . . , k}, there is an edge of P whose removal separates {r, ai } from ci . Problem 57 (Right Element Fixed Path Triple Consistency (REF-PTC) Problem). Given a set S = {1, . . . , n}, an element r ∈ S, and a set of triples {(ai , bi : r)}ki=1 where ai , bi ∈ S for every i ∈ {1, . . . , k}, is there a path (an order) P on vertices S ∪ {r} such that for each i ∈ {1, . . . , k}, there is an edge of P whose removal separates {ai , bi } from r. 104 Problem 58 (One Element Fixed Total Ordering (OEF-TO) Problem). Given a set S = {1, . . . , n}, an element r ∈ S, and a set of triples {(ai , bi , ci )}ki=1 where for every i ∈ {1, . . . , k}, either ai , ci ∈ S and bi = r, or ai , bi ∈ S and ci = r, is there a path (a Total Ordering) P on vertices S ∪ {r} such that for each i ∈ {1, . . . , k}, bi appears between ai and ci on P. In what follows, we will show that the first problem LEF-PTC is NP-complete, while the other two problems REF-PTC and OEF-TO are solvable in polynomial time. Thus, the LEF-PTC Problem seems to be the simplest version of the problem which is still intractable. Lemma 59. The LEF-PTC Problem is NP-complete. Proof. Here, we give a reduction from the Not-All-Equal-3SAT (NAE-3SAT) Problem [53]. The NAE-3SAT Problem is: given a set of Boolean variables X = {x1 , . . . , xn } and a set of clauses C = {C1 , . . . ,Cm }, where each clause contains three literals, is there a truth assignment to the set of variables such that in no clause, its three literals are all true or all false. Given an instance of NAE3SAT, let S be the union of variable symbols {x1 , x¯1 , . . . , xn , x¯n } and literal symbols {ℓ11 , ℓ21 , ℓ31 , . . . , ℓ1m , ℓ2m , ℓ3m }. The basic principle of the reduction is the following observation. The triple (ai , r : ci ) is equivalent to the following condition on the elements in S ∪ {r}: r < ci ⇔ ai < ci . (5.1) The Boolean value of predicate r < xi will represent the value of variable xi , for i ∈ {1, . . . , n}. First, we introduce the triples (xi , r : x¯i ) and (x¯i , r : xi ), for i ∈ {1, . . . , n}. These triples are equivalent to the following logical statement: r < x¯i ⇔ xi < x¯i ⇔ xi < r. Hence, they enforce x¯i < r iff r < xi , and hence the Boolean value of predicate r < x¯i represents the value of ¬xi . Now, let clause C j contain variables xk1 , xk2 and xk3 . We will use symbols ℓ1j , ℓ2j , ℓ3j to represent the values of the three literals of C j : the Boolean value of the i-th literal of C j will be equal to the value of predicate r < ℓij . To achieve this, we will reuse the above constraints. For each variable xki with positive occurrence in C j , we introduce the triples (ℓij , r : x¯ki ) and (x¯ki , r : ℓij ), and for each variable xki 105 with a negated occurrence in C j , triples (ℓij , r : xki ) and (xki , r : ℓij ). These triples will guarantee that predicate r < ℓij represents the Boolean value of the i-th literal of C j . The reason why we have a symbol for each literal is that the position of the literal symbol ℓij and the position of the variable symbol xki (or x¯ki ) are only very weakly dependent: one is smaller than r if and only if the other is, but otherwise they are independent. This is important, since the clause gadgets introduced in the next paragraph might put some ordering restrictions on its literal symbols, and hence if we would use the variable symbols xki (or x¯ki ) in several clause gadgets, the ordering restrictions from different clause gadgets might not be compatible. The clause gadget for clause C j will contain the three triples (ℓ1j , r : ℓ2j ), (ℓ2j , r : ℓ3j ) and (ℓ3j , r : ℓ1j ). The purpose of these constraints is to guarantee that in any order at least one and not all literals in the clause C j are true. For instance, assume that all literals are true, i.e., r < ℓij for i ∈ {1, 2, 3}. By (5.1), this is equivalent to ℓ1j < ℓ2j , ℓ2j < ℓ3j and ℓ3j < ℓ1j , which leads to a contradiction. Similarly, if literals are false in the order, all three inequalities will reverse their direction, and we get a contradiction again. Hence, each clause is satisfied and predicates r < vi define a solution to the instance of NAE-3SAT. Now, assume that the instance of NAE-3SAT has a solution ψ : X → {false, true}. Consider the order of elements of S ∪ {r} satisfying the following conditions: (a) for each vi ∈ {v1 , . . . , vn }, vi appears to the right of r, i.e., r < vi in the order, if and only if ψ (xi ) = true for the xi corresponding to vi ; (b) for each clause C j , the relative order of the literal symbols ℓ1j , ℓ2j , ℓ3j and r is one of the following: (ℓ1j , r, ℓ2j , ℓ3j ), (ℓ3j , ℓ2j , r, ℓ1j ), (ℓ2j , r, ℓ3j , ℓ1j ), (ℓ1j , ℓ3j , r, ℓ2j ), (ℓ3j , r, ℓ1j , ℓ2j ) and (ℓ2j , ℓ1j , r, ℓ3j ). Note that for any valid combination of truth assignments to the literals of C j , there is one order in the list above. This order imposes a restriction on the relative order of the two literal symbols appearing on the same side of r, the reason why we created the literal symbols. It is easy to see that for each s ∈ S, other than on which side of r the s appears, there is at most one constraint specifying its relative order to another element. Hence, it is always possible to find an order satisfying the above conditions. 106 Let us verify that this order satisfies all triple constraints. The constraints (xi , r : x¯i ) and (x¯i , r : xi ) (respectively, (ℓij , r : xki ) and (xki , r : ℓij ); (ℓij , r : x¯ki ) and (x¯ki , r : ℓij )) are satisfied just by the placement of symbols to the correct sides of r. For instance, if r < xi then the relative order of xi , x¯i , r is x¯i , r, xi and this order satisfies both triples. For the constraints for clause C j , only the relative order of elements ℓ1j , ℓ2j , ℓ3j and r is important. It is easy to check that any of the six orders of these elements listed above satisfies all three triples for C j . Hence, the constructed order is a solution to the corresponding instance of the LEF-PTC Problem. Lemma 60. Any instance of the REF-PTC Problem always has a solution, and thus the problem is solvable in constant time. Proof. Consider any order of S ∪ {r} with r as the first (resp., last) element. Then the first (resp., last) edge separates r from any pair of elements in S. Thus, such an order is a solution to any instance of the REF-PTC Problem. Lemma 61. The OEF-TO Problem can be solved in linear time. Proof. The algorithm will work in two stages. In the first stage the elements will be clustered into parts each appearing on different sides of r. In the second stage, we will determine the ordering of the elements in each part. Constraint (ai , r, ci ) is satisfied if and only if ai and ci appear on opposite sides of r. Constraint (ai , bi , r) is satisfied if and only if (i) ai and bi appear on the same side of r, and (ii) bi is closer to r than ai , which we write as bi ≺ ai . Consider the graph with vertex set S and edges between any two vertices u, v ∈ such that u and v appear together in some triple (ai , bi , ci ). Let C be a connected component of this graph. It is easy to see that once we fix the side of one element in the component, the side of all elements in the component will be determined. Hence, we can uniquely partition C into two (paired) clusters such that all edges from constraints of type (ai , r, ci ) are between two clusters and all edges from constraints of type (ai , bi , r) are inside one of the two clusters. Now, pick one cluster from each pair and place all its elements on one side of r and all other clusters to the other side. Note that there can many ways how to do this, the number of ways is exponential in the number of pairs of clusters. 107 It remains to satisfy the precedence conditions. These conditions (bi ≺ ai ) define a partial order on each side of r. Any total order compatible with these partial orders will form a solution to the problem. Such an order can be found in time O(n + k). 5.3 Tractability Results 5.3.1 An Algorithm for Cases of the Single-Branch GCCC Problem Here we show that when each α (s) induces a directed path in Tα , for each αˆ ∈ C, s ∈ S, the Single-Branch GCCC (SB-GCCC) Problem is polynomial-time solvable. The algorithm we use, while much simpler, is based on the algorithm given in Benham et al. [11]. Theorem 62. The SB-GCCC Problem is solvable in time O(|S| ∑αˆ ∈C |Qα |), if each α (s) induces a directed path in Tα , for each αˆ ∈ C, s ∈ S. Proof. Consider an instance of the SB-GCCC Problem (S,C) with the required property. Let startα (s) and endα (s) be the first and the last node on the directed path induced by α (s). We define the partial order on the nodes of Tα by saying v α w if the directed path from the root rα of Tα to w passes through v. Similarly, for each solution (T, c) we define the partial order T on S based on T . Since is a total order, i.e., for every s1 , s2 ∈ S, s1 and s2 are comparable by T . Hence, for every αˆ ∈ C, c(s1 , αˆ ) and c(s2 , αˆ ) are comparable by α . Therefore, for all s ∈ S, c(s, αˆ ) lie on a single branch (directed path starting T has a single branch, T in the root) Pα of Tα . Since startα (s1 ) α c(s1 , αˆ ) and startα (s2 ) α c(s2 , αˆ ), we can assume that for all s ∈ S, startα (s) lie on a subpath Pα′ of Pα starting in the root rα of Tα and ending in startα (ℓα ), where ℓα ∈ S and startα (s) α startα (ℓα ) for every s ∈ S. If that is not the case, there is no solution. This can be checked in time O(|S||Qα |) for each αˆ ∈ C. Next, we will argue that it is enough to consider only solutions in which c maps / Pα′ must lie on the all elements in S to Pα′ . Consider a solution (T, c). Any c(s, αˆ ) ∈ subpath of Pα ending at vertex startα (ℓα ). Since startα (s) α startα (ℓα ), we can remap c(s, αˆ ) to startα (ℓα ). It is easy to check that conditions (1)–(3) of the GCCC 108 Problem remain satisfied after mapping all such c(s, αˆ ) to startα (ℓα ). Hence, we can assume that c(s, αˆ ) ∈ α ′ (s) = α (s) ∩ Pα′ , for each αˆ ∈ C and s ∈ S. Note that for all s ∈ S, α ′ (s) induce directed subpaths of Pα′ . Now, we are ready to present the algorithm for solving the SB-GCCC Problem with the required property. First, we will build a set C of constraints on the ordering of the nodes of T which have to be satisfied in any solution (T, c). If for s1 , s2 ∈ S and αˆ ∈ C, the paths induced by α ′ (s1 ) and α ′ (s2 ) are disjoint, and the path induced by α ′ (s1 ) is closer to the root rα , then we must have s1 ≺T s2 . Therefore, we add this constraint to the set C . Let T be a single branch tree that satisfies all these constraints in C and let s1 ≺T s2 ≺T · · · ≺T s|S| be the elements of S ordered according to this tree. (If such a tree does not exist, there is no solution.) For each character αˆ ∈ C, we will map c(si , αˆ ) to α ′ (si ) using Algorithm 4, where max(a, b) is the element (a or b) further from the root if a and b are comparable, and undefined otherwise. Algorithm 4 Iterative algorithm that assigns to each species a state. ˆ ) ← startα (s1 ) 1: c(s1 , α 2: for i = 2 up to |S| do 3: c(si , αˆ ) ← max(startα (si ), c(si−1 , αˆ )) 4: end for Let us verify that (T, c) is indeed a solution. First, note that since all startα (si ) lie on the path Pα′ , the arguments of the max function are always comparable. Furthermore, it is easy to see that all c(si , αˆ ) are assigned to the set {startα (s); s ∈ S}, and that c(s1 , αˆ ) α c(s2 , αˆ ) α . . . α c(s|S| , αˆ ). It remains to show that for each i, c(si , αˆ ) ∈ α ′ (si ). Let i be the smallest index for which c(si , αˆ ) ∈ / α ′ (si ). We must have that endα (si ) ≺α c(si , αˆ ). Since c(si , αˆ ) = startα (s j ) for some j < i, the subpath of Pα′ induced by α ′ (si ) is closer to the root than the subpath induced by α ′ (s j ). Hence, C must contain the constraint si ≺T s j , which contradicts the fact that T satisfies all these constraints. It follows that (T, c) is a solution. Finally, let us analyze the running time of the algorithm. We can verify whether this set C of constraints defines a partial order and find a total order T compatible with this partial order in time O(|S| + m), where m is the number of constraints. For each αˆ ∈ C, we can have at most |Qα | disjoint induced paths, and it is enough 109 to consider the constraint between the neighbouring disjoint induced paths only. Hence, m = O(∑αˆ ∈C |Qα |). We remark that this type of theorem does not hold for the case of path phylogeny, cf. Table 5.1. 5.3.2 The BKW Case of the SB-GCCC-NB Problem is Polynomial-Time Solvable First, we show that the {{1}, {0, 2}}-SB-GCCC-NB and {{1}, {0, 2}}-P-GCCCNB Problems are polynomial-time solvable, by showing that they are equivalent to deciding the C1P. We then build on the algorithm for constructing a PQ-tree for a binary C1P matrix [21] to show that the {{0}, {1}, {2}, {0, 2}}-SB-GCCC-NB Problem (the BKW Case of the SB-GCCC-NB Problem) is also polynomial-time solvable. Lemma 63. The {{1}, {0, 2}}-SB-GCCC-NB and {{1}, {0, 2}}-P-GCCC-NB Problems are polynomial-time solvable. Proof. The solutions to the {{1}, {0, 2}}-SB-GCCC-NB and {1}, {0, 2}}-PGCCC-NB Problems must fall on a single-branch tree and path, respectively. Because Tα is 0 → 1 → 2 for any character αˆ , all species where αˆ has state 1 must appear consecutively in this single-branch tree (resp., path), otherwise there would be more than one transition from 0 to 1 in the phylogeny, for some character αˆ . In this case of the SB-GCCC-NB Problem, all other species can appear before (resp., after) this consecutive set of ones, because the “state-choosing” function c can map these species to 0 (resp., 2). Hence, this problem is exactly the problem of determining whether or not a binary (0/1-) matrix has the C1P, where each species is a column in this matrix. In this case of the P-GCCC-NB Problem, if there does exist a solution P, then there is always a “state-choosing” function c′ that reflects the fact that the corresponding matrix has the C1P. Therefore these cases are polynomialtime solvable. We now consider the {{0}, {1}, {2}, {0, 2}}-SB-GCCC-NB Problem, the BKW case of the SB-GCCC-NB Problem. Here, for any character αˆ , a species s with α (s) = {0, 2} can still appear before or after the consecutive set of ones (on 110 this single-branch tree), however a species s with α (s) = 0 has to appear before this set, while the species s with α (s) = 2 has to appear after this set. So essentially, this is again the problem of determining whether a binary (0/1-) matrix has the C1P, however the matrix, in addition to containing zeros and ones, contains some special zeros, we call them 0− (0+ ), that must appear before (resp., after) the set of consecutive ones of its row, in any C1 order. Hence, this case is equivalent to deciding the following generalized version of the C1P. Property 64 (Extended Consecutive-Ones Property (E-C1P)). A matrix M on m rows and n columns with entries from set {0, 1, 0− , 0+ } has the E-C1P if there is an order of the n columns such that, for any row, the set of columns that have entry 1 in that row are consecutive in the order, and any column that has entry 0− (resp., 0+ ) in that row appears before (resp., after) this consecutive set of ones. Lemma 65. The E-C1P can be decided in polynomial-time. Proof. We prove this by showing that a structure that encodes all extended consecutive-ones (E-C1) orders of a matrix with entries from set E = {0, 1, 0− , 0+ } can be constructed in polynomial-time. Given matrix M on m rows and n columns with entries from set E, we first construct PQ-tree PQM for matrix M, where we have “forgotten” the labels of the special zeros (we treat 0− and 0+ simply as 0). This can be done in time O(m + n) [21]. It is clear that PQM encodes a superset of the E-C1 orders of M. We then associate to each P-node, the empty partial order on its children, and to each Q-node, the set of directions {left, right}. Next, we obtain a list of order constraints imposed by the special zeros of M, by processing each pair (0− , 1), (0+ , 1) and (0− , 0+ ). For instance, if column i has 0− and j has 1 in some row r, then we add constraint i < j to the list. We now update these sets that are associated with each P- and Q-node, one-by-one from the list, to incorporate these ordering constraints. The idea is that these sets will restrict the configurations each node in PQM can have to the set of E-C1 orders of M. When adding constraint i < j from the list to PQM , we find the least common ancestor ai, j of i and j in PQM , which takes O(n) steps. For ai, j , one of the two cases holds: 1. ai, j is a Q-node. Then we eliminate from the set at this Q-node, the direction that places j before i. If the set of directions is now empty, then the algorithm 111 g h v4 v5 a b v1 v2 a 0 0+ 0 b 0− 0 0 c 1 1 0 d 1 1 0 e 0 1 1 f 0+ 0 1 g 0+ 0 0 h 0 0 0 (a) e c f d v3 (b) Figure 5.1: (a) A matrix M with entries from set {0, 1, 0− , 0+ }. (b) PQ-tree PQM for M where the labels of the special zeros (0− and 0+ ) have been “forgotten”. halts, outputting that M does not have an E-C1 order. This can be done in constant time. 2. ai, j is a P-node. This P-node stores some partial order on its children {v1 , . . . , vk } = V . First, we find the children of ai, j : vx and vy such that the subtrees rooted at them contain i and j, respectively. We add the constraint vx < vy to the existing partial order at this P-node. If this constraint is not consistent with the existing partial order then the algorithm halts, outputting that M does not have an E-C1 order. This partial order can be updated in time O(k2 ). Hence, this step takes time O(k2 ) ⊆ O(n2 ). Since there are O(mn2 ) order constraints, and it takes time O(n2 ) to process each constraint, the algorithm takes time O(mn4 ). Furthermore, since the tree has O(n) internal nodes, and each one stores O(mn2 ) information, this structure is of size O(mn3 ). For example, let M be the matrix with entries from set E = {0, 1, 0− 1, 0+ } shown in Figure 5.1a. The PQ-tree PQM for M, where we have “forgotten” the labels of the special zeroes is given in Figure 5.1b. The special zeros of M then give rise to the following order constraints. In the first row of M, for example, since column b has entry 0− and columns c and g have entries 1 and 0+ respectively, this introduces the order constraints b < c 112 and b < g. The list of order constraints given by the first row of M is {b < c, b < d, b < f , b < g, c < f , c < g, d < f , d < g}, while the list given by the second row is {a > c, a > d, a > e}. The third row introduces no order constraints. After adding the above order constraints to PQM of Figure 5.1b, the P-node that is the root of this tree, with children {v1 , . . . , v5 } stores the partial order {v2 < v3 , v2 < v4 , v3 < v4 , v1 > v3 } on its children. The only Q-node of PQM has associated with it the set {right}, and the other P-node stores the empty partial order on its children {c, d}. Here, PQM , with the following sets (resp., partial orders) associated with its Q- (resp., P-) nodes encodes all E-C1 orders of M. For example, in the P-node that is the root of PQM , order constraint v2 < v3 guarantees that b < d, however this is also necessary. The Q-node has associated with it set {right} to enforce c < f , or even d < f . Note that if instead of constraint d < f , we had d > f that this matrix would not have an E-C1 order. Finally, the P-node of PQM with children {c, d} stores the empty partial order because there are no constraints involving c and d. Theorem 66. The {{0}, {1}, {2}, {0, 2}}-SB-GCCC-NB Problem is polynomialtime solvable. Proof. This follows from equivalence to deciding the E-C1P and Lemma 65. Since the {{0}, {1}, {2}, {0, 2}}-SB-GCCC-NB Problem is the BKW Case of the SB-GCCC-NB Problem, we have the following corollary. Corollary 67. The BKW Case of the SB-GCCC-NB Problem is polynomial-time solvable. Note that the constructed structure of Theorem 66 encodes all solutions to the problem, even if there are exponentially many of them. 5.3.3 The {{1}, {2}, {0, 2}}-P-GCCC-NB Problem We will show that if there is a solution to an instance (S,C) of the Q∗ -P-GCCCNB Problem then there is a solution to the instance (S,C) of the Q ∗ -SB-GCCC-NB Problem, and vice versa, where Q∗ = {{1}, {2}, {0, 2}}. Since the single branch version of this problem can be solved in polynomial time by Theorem 66, it follows that also the path version is polynomial-time solvable. 113 Lemma 68. An instance (S,C) of the {{1}, {2}, {0, 2}}-P-GCCC-NB Problem has a solution if and only if the instance (S,C) of the {{1}, {2}, {0, 2}}-SB-GCCC-NB Problem has a solution. Proof. Let Q∗ = {{1}, {2}, {0, 2}}. Obviously, a solution to the instance (S,C) of the Q∗ -SB-GCCC-NB Problem is also a solution to the instance (S,C) of the Q∗ P-GCCC-NB Problem. Now, assume that (T, c) is a solution to the instance (S,C) of the Q∗ -P-GCCC-NB Problem. Let P1 and P2 be two branches of T starting at the root r. Let T ′ be the tree obtained by attaching P2 to the last vertex of P1 . To define the state-choosing function c′ we only need to determine the values of c′ (s, αˆ ) when α (s) = {0, 2}. Consider s ∈ S and αˆ ∈ C such that α (s) = {0, 2}. If there is a species s′ ≺T s such that α (s′ ) = {1} then we set c′ (s, αˆ ) = 2, otherwise we set c′ (s, αˆ ) = 0. We will show that (T ′ , c′ ) is a solution to the instance (S,C) of the Q∗ -SB-GCCC-NB Problem. For each αˆ ∈ C, the set of species Sαˆ ,{1} = {s ∈ S|α (s) = {1}} must induce a connected component in T . Since α (r) = 0, this component lies entirely in P1 or in P2 . Hence, the set Sαˆ ,{1} induces a connected component K in T ′ as well. By the definition of c′ , all species that lie below K in T ′ are assigned value 2 and all species s such that α (s) = {0, 2} that lie above K in T ′ are assigned value 0. Hence, the only possible violation is if there is a species s such that α (s) = {2} that lies above K in T ′ . This species s either lies above K in T or lies in the branch that does not contain K in T . In either case, (T, c) cannot be a solution to the instance (S,C) of the Q∗ -P-GCCC-NB Problem, a contradiction. Corollary 69. The {{1}, {2}, {0, 2}}-P-GCCC-NB Problem is polynomial-time solvable. 5.4 Hardness Results We first show that {{1}, {0, 2}, {0, 1, 2}}-SB-GCCC-NB the and {{1}, {0, 2}, {0, 1, 2}}-P-GCCC-NB Problems are NP-complete by by reduction from the PTC Problem (Lemma 55). Theorem 70. The {{1}, {0, 2}, {0, 1, 2}}-SB-GCCC-NB {{1}, {0, 2}, {0, 1, 2}}-P-GCCC-NB Problems are NP-complete. 114 and Proof. Let Q△ = {{1}, {0, 2}, {0, 1, 2}}. Let S and {(ai , bi : ci )}ki=1 be an instance of the PTC Problem. We will construct an instance of the Q △ -SB-GCCC-NB (resp., Q△ -P-GCCC-NB) Problem as follows. Let S be the set of species and C = {αˆ 1 , . . . , αˆ k } the set of characters. For every αˆ ∈ C, we let αi (ai ) = αi (bi ) = {1}, αi (ci ) = {0, 2} and for all s ∈ S \ {ai , bi , ci }, αi (s) = {0, 1, 2}. We will show that the instance of the PTC Problem has a solution if and only if the constructed instance of the Q △ -SB-GCCC-NB (resp., Q △ -P-GCCC-NB) Problem has a solution. First, consider a single-branch tree (resp., path) P containing vertices S which is a solution to the constructed instance. Consider the order of elements in S as they occur on P starting from the root (resp., leaf on one branch) of P and ending with the leaf (resp., leaf on the other branch). For every i ∈ {1, . . . , k}, all elements in [ai , bi ] must have state 1 for character αˆ i , hence, ci ∈ / [ai , bi ], i.e., this order is a solution to the PTC Problem. On the other hand, let order O be a solution to the PTC Problem. Consider a tree T with a single branch consisting of the all-zero root followed by vertices in S ordered by O. Note that, for every i ∈ {1, . . . , k}, ci appears either above both ai and bi , or below them. The state-choosing function is defined as follows. For every node in S, we choose for character αˆ i state 0 if they are above both ai and bi , state 1 if they are between ai and bi , and state 2 otherwise. Clearly, this tree is compatible with all character trees and it is easy to see that each c(s, αˆ ) ∈ α (s), i.e., T is a solution to the Q△ -SB-GCCC-NB (resp., Q△ -P-GCCC-NB) Problem. Next, we show that if for Q ⊆ 2{0,...,m} , the Q-SB-GCCC-NB Problem is NPcomplete, then the Q ∪ {{m}}-(P-)GCCC-NB Problems are NP-complete. Theorem 71. If for Q ⊆ 2{0,...,m} , the Q-SB-GCCC-NB Problem is NP-complete, then the Q ∪ {{m}}-P-GCCC-NB and Q ∪ {{m}}-GCCC-NB Problems are NPcomplete. Proof. We will prove the claim by reduction from the Q-SB-GCCC-NB Problem. An instance of the SB-GCCC-NB Problem can be considered as an instance of the (P-)GCCC-NB Problem, provided that we can force all species to be on a single branch. This can be done easily by adding the extra species x that has state set {m} on all characters, and showing that all other species must have x as a descendant, 115 which forces any solution to this instance of the (P-)GCCC-NB Problem to be a single-branch tree. We omit the details. As a corollary, we have that the {{1}, {2}, {0, 2}, {0, 1, 2}}-(P-)GCCC-NB Problem is NP-complete. However, the complexity of the BKW case posed in Benham et al. [12] remains open. Finally, we show that the {{0}, {1}, {0, 1}}-P-GCCC-NB Problem is NPcomplete by reduction from the LEF-PTC Problem (Lemma 59). Theorem 72. The {{0}, {1}, {0, 1}}-P-GCCC-NB Problem is NP-complete. Proof. Given an instance of the LEF-PTC Problem S = {1, . . . , n}, r, and the set of k triples (ai , r : ci ), let S be the set of species, and C = {αˆ 1 , . . . , αˆ k } be the set of characters. For each αˆ i ∈ C, we let αi (ai ) = {0} and αi (ci ) = {1}, while for all other s ∈ S \ {ai , ci } we let αi (s) = {0, 1}. Let path phylogeny T be a solution to this instance of the {{0}, {1}, {0, 1}}-PGCCC-NB Problem. Let r be the root of T , i.e., r is the all-zero vertex. Consider the ordering of elements in S ∪ {r} based on the ordering of vertices on path T starting in the leaf of one branch and ending in the leaf of the other branch. Assume the triple (ai , r : ci ) is not valid, i.e., ci appears between ai and r. However, this is not possible since vertex ai is then below ci in T and we have a transition from 1 to 0 somewhere on the path from ci to ai for character αˆ i . Hence, the order is a solution to the LEF-PTC Problem. Conversely, let path/order P be a solution to the LEF-PTC Problem. Consider the path phylogeny obtained from P by rooting it at r and the state-choosing function assigning 1 to ci and all nodes below ci and 0 to all other nodes for character αˆ i . Clearly, this tree is compatible with all character trees. The state choosing function could only fail, if ai is below ci , in which case c(ai , αˆ i ) = 1, but αi (ai ) = {0}. However, this is not possible as then ci would be between r and ai on P which violates the constraint (ai , r : ci ). The claim follows by Lemma 59. Note that Theorem 72 implies NP-completeness of several cases of the PGCCC-NB Problem. In fact, any case of the problem in which set Q contains two distinct state singletons {a} and {b}, and a set containing states 0, c and d such that a α c α b and b α d in Tα is NP-complete. For instance, for a = c = 0, 116 b = 1 and d = 2, we have that the {{0}, {1}, {0, 2}}-P-GCCC-NB Problem is NPcomplete ((6b) in Table 5.1). 117 Chapter 6 Conclusion In this thesis, we have defined and studied several variants of the ConsecutiveOnes Property (C1P) in order to model or solve several problems that arise in the reconstruction of ancestral species. We first define in Chapter 2 a way of relaxing the C1P of binary matrices, namely the (k, δ )-C1P, to model the problem of reconstructing AGOs in the presence of small errors [27, 96]. We show that for most values of k and δ , deciding the (k, δ )-C1P is NP-complete, as well as give a tractability result for a relevant case of the (2,1)-C1P. In light of this result, and the fact that matrices from real data generally have low degree [27], in Chapter 3, we then consider the (k, δ )-C1P for matrices of bounded degree d (the (d, k, δ )-C1P). We then show that the (d, k, δ )C1P is polynomial-time solvable when all three parameters are fixed constants, while other cases are NP-complete. In Chapter 4, we then study a slightly different way to relax the C1P: by allowing columns to appear multiple times in an order, or the mC1P, which was first introduced in Wittler and Stoye [151]. We improve upon the hardness results of Wittler and Stoye to show that this problem is NP-complete in most cases, while also finding a tractable case of interest to handling telomeres in the reconstruction of AGOs. Finally, in Chapter 5, we use the C1P, or more specifically, its associated data structure, the PQ-tree, to develop algorithms for several cases of the Generalized Cladistic Character Compatibility (GCCC) Problem. We now summarize our re118 sults for these four chapters in more detail, along with relevant future work. 6.1 Chapter 2: The (k, δ )-C1P In Section 2.3 of this chapter, we show that for every k ≥ 2, δ ≥ 1, (k, δ ) = (2, 1), deciding the (k, δ )-C1P is NP-complete by first showing in Subsection 2.3.1 that for every k, δ ≥ 2, deciding the (k, δ )-C1P is NP-complete, and then in Subsection 2.3.2 that for every k ≥ 3, deciding the (k, 1)-C1P is NP-complete. Note that this leaves open the case of the (2,1)-C1P, one that is interesting for real applications such as the reconstruction of AGOs [27]. In Section 2.4, we give an algorithm that, given a binary matrix M, either (a) decides if M has the (2,1)-C1P when the orders of the columns of M are restricted according to the block construction with blocks of fixed constant size of the type which the two above-mentioned constructions are, or (b) finds a proof that deciding the (2,1)-C1P is NP-complete. In fact, this algorithm is FPT in the maximum size of any block. We then show that for every δ ≥ 1, deciding the (∞, δ )-C1P is NP-complete in Section 2.5. We note that deciding the k-C1P, or equivalently, the (k, ∞)-C1P for k ≥ 2 has been proved NP-complete in Goldberg et al. [55]. This set of results implies that deciding the (k, δ )-C1P is NP-complete for all bounded and unbounded values of k and δ except for (k, δ ) = (2, 1). The above study of this particular gapped C1P of binary matrices, namely the (k, δ )-C1P, immediately raises some open questions about closely related properties. A more restricted version would be the (k, δ )-C1P where the number of gaps in the entire matrix M is bounded by some K ≤ m(k − 1) where m is the number of rows of M. Is such a property polynomial-time decidable? The (k, δ )-C1P is known to be NP-complete for all values of k and δ except for (k, δ ) = (2, 1): are there any natural parameters such that the (k, δ )-C1P is FPT? One drawback of the (k, δ )-C1P (and the k-C1P, for that matter) is that it has the rigid limit of k − 1 gaps per row. What if we allowed allowed rows to “share a pool” of gaps, in the sense that if one row has only k − 2 gaps, then another may have k gaps? A more general version of the (k, δ )-C1P is to bound the total number of 0’s in the gaps in all of M. For example, given a matrix M with m rows and at most N ≤ m(k − 1)δ 0’s can be in the gaps of M (an average of one gap per row when (k, δ ) = (2, 1), which, in a 119 way, generalizes the (2,1)-C1P). Is this problem FPT for some natural parameter? From a purely combinatorial point of view, there has been a renewed interest in the characterization of matrices that do not have the C1P in terms of forbidden submatrices introduced by Tucker [145]. It has recently been shown that this characterization could be used in the design of algorithms related to the C1P [18, 28, 36]. This then raises the following natural question: is there a nice characterization of matrices that do not have the (k, δ )-C1P in terms of forbidden submatrices? This is of particular interest to the open (2,1)-C1P case: if it is indeed polynomial-time decidable, trying to find a forbidden submatrix characterization may lead to an algorithm for this case. If such a characterization does not exist, given a matrix that is not C1P, can the (k, δ )-C1P be quickly determined if the set of all Tucker patterns is known? Finally, it is also natural to ask if there exists a structure that can represent all orders that satisfy some gaps conditions related to the C1P. Such a structure exists for the C1P with no gaps: for a matrix that has the C1P, its PQ-tree represents all its C1 orders, and can be computed in linear time [21]. This has even been extended to matrices that do not have the C1P through the notion of the PQR-tree [106, 107], or the Generalized PQ-tree of McConnell [102]. Although the existence of such a structure with nice algorithmic properties is ruled out by the hardness of deciding the (k, δ )-C1P (except for maybe the (2,1)-C1P), it remains open to find classes of matrices such that testing for this property is tractable, and in such case, to represent all possible orders in a compact way. Here again, this question is motivated both by theoretical considerations (for example representing all possible layouts of a graph of bandwidth 2), but also by problems in computational genomics, such as the reconstruction of AGOs [27, 96]. Recall that, in the approach of Chauve and Tannier [27], they discard the minimum number of rows of a given matrix M using a branch-and-bound procedure, until the remaining matrix has the C1P. In Chauve and Tannier’s experiments, the number of rows discarded is generally a very small fraction of the number of rows of M. This motivates the following question. Given a PQ-tree T and a set of rows R of bounded size, is there a permutation π that is generated by T to which all r ∈ R map with at most k gaps of size δ ? A variant of this would be to try and map the set R onto π while trying to minimize the number of gaps, or the number 120 of 0’s in the gaps (cf. a previous paragraph). In either case, perhaps there is a way to refine a PQ-tree T , as in Section 5.3 of Chapter 5, or even to partially refine T , as in Section 4.3 of Chapter 4 to come up with a new structure T ′ that encodes or (partially encodes) all permutations π that meet the gaps constraints of R. Either one would be a weak notion of a structure that encodes (k, δ )-C1 orders of a matrix that has the (k, δ )-C1P. 6.2 Chapter 3: The (d, k, δ )-C1P In this chapter we study the (k, δ )-C1P for matrices of bounded degree d, or the (d, k, δ )-C1P. This is motivated by the fact that we have observed that matrices from experimental data of the reconstruction of AGOs [27] tend to have low degree. In Section 3.1 we show that when all three parameters are fixed constants, the (d, k, δ )-C1P is related to the classical Graph Bandwidth Problem, and can hence be solved in polynomial-time using a variant of a relatively brute-force algorithm of Saxe [135]. Then, in Section 3.2.4 we show that, for every d > k ≥ 2, deciding the (d, k, ∞)C1P is NP-complete, by reducing from an NP-complete hypergraph covering problem which is defined in Section 3.2.1, and then is shown, in Sections 3.2.2 and 3.2.3, to be NP-complete. We comment that here we have studied the weakest formulation of the C1P with gaps: indeed, in the (d, d − 1, ∞)-C1P case, it is required that only two of the d 1’s in each row are adjacent in any order, while the other 1’s can end up arbitrarily far away from this pair. It is thus surprising that deciding this property is still NP-complete for any d ≥ 3 as implied by the general result above. This chapter closes the case of the complexity of deciding the (d, k, δ )-C1P, with the exception of the (∞, 2, 1)-C1P case, or just the (2,1)-C1P case (cf. Chapter 2), which remains open. We comment here that Goldberg et al. [55] poses the open question about the complexity of deciding the 2-C1P for sparse matrices (matrices where there is a limit on the number of ones per row and per column). The (d, 2, ∞)-C1P limits the number of 1’s per row only, that is, it is equivalent to the 2-C1P for bounded degree matrices. If we could determine the complexity of deciding the (d, 2, ∞)-C1P for matrices with a bounded number of 1’s per column, we could close this open 121 question of Goldberg et al.. We do show, as a corollary of Theorem 32, that deciding the (d, 2, ∞)-C1P is NP-complete for matrices with at most 7 1’s per column, closing this open question of Goldberg et al. [55]. There are several open questions and directions we would like to follow in the future work, some of them being parallel to open questions posed in the context of just the (k, δ )-C1P. One such question: is it possible to find a nice characterization of matrices that do not have the (d, k, δ )-C1P in terms of forbidden structures, such as Tucker submatrices [145], especially for small values of d? Can the (d, k, δ )C1P be quickly determined if the set of all Tucker patterns is known? When all three parameters are fixed, the (d, k, δ )-C1P is related to the classical Graph Bandwidth Problem, and can hence be solved in polynomial time [29] using a variant of a relatively brute-force algorithm of Saxe [135] for deciding if a graph has bandwidth d + (k − 1)δ − 1. This algorithm of Saxe decides if a given graph has bandwidth b in time O(nb+1 ). Caprara et al. [25] provide a linear time algorithm for the special case of deciding if a graph G has bandwidth 2. In this algorithm, Caprara et al. first reduce G to a skeleton (called an auxiliary graph) that all bandwidth 2 layouts must contain. The bandwidth 2 layouts of each component of this auxiliary graph, irreducible subgraphs of G that are independent of each other, then determine the set of bandwidth 2 layouts of G. Indeed, this auxiliary graph resembles somewhat a PQ-tree: given a C1P matrix M, and its graph GM as defined in Section 3.1, how does the PQ-tree for M relate to the auxiliary graph for GM ? If M does not have the C1P, how does the auxiliary graph relate to the set of (d, k, δ )-C1 orders of M? How does the auxiliary graph relate to the active regions computed in the algorithm of Saxe? Indeed, since Caprara et al.’s algorithm is linear for graphs of bandwidth 2, perhaps improvements can be made in the general bandwidth b case (this is one of the open questions posed by Saxe). Even for small values of b, this would be useful in applications involving the reconstruction of AGOs [27]. Can the auxiliary graph be extended to a structure that all bandwidth b layouts must contain, even if computing it involves a large time overhead? This could lead to a weak notion of a PQ-tree for all (d, k, δ )-C1 orders of a matrix that has the (d, k, δ )-C1P. Finally, assuming that k is close to d, for each row there are many orders of columns which make this row (d, k, ∞)-consecutive. Hence, for a small number of 122 rows, random instances of matrices have the (d, k, ∞)-C1P almost always. Conversely, for a large number of rows, random instances of matrices that have the (d, k, ∞)-C1P would have very few column orders that witness this property. We would like to investigate the ratios between the number of rows and columns for which one or the other type of instance occurs, with the goal of developing heuristics for both of these types of instances. 6.3 Chapter 4: The mC1P In Section 4.1 of this chapter, we have shown that deciding the mC1P is NPcomplete for matrices with degree at most 3 and m(s) ≤ 2 for each s ∈ S, where S is the set of columns of M. In Section 4.2 we then show that the two restricted variants of the mC1P given in Wittler and Stoye [151], namely the mC1P(fr) and the mC1P(ne) are NP-complete for matrices with degree at most 3 (6 for the mC1P(fr) case) and m(s) ≤ 2 for each s ∈ S, where S is the set of columns of M. In Section 4.3, we have shown that, given a matrix M and a multiplicity vector m such that (1) M has matched multirows, and (2) each row contains either (i) at most one entry 1 in multicolumns, or (ii) two entries 1 in multicolumns and no other entries, that deciding if M has the mC1P for m can be done in polynomial time and space (cf. Theorem 51). In light of the result of Section 4.3, we extend the domain of tractable instances of deciding the mC1P for binary matrices. This approach relies on previously used techniques to decide the C1P and simpler instances of the mC1P, and answers a natural problem in reconstructing ancestral gene orders. Several questions remain open. Naturally, one can ask to relax the condition that M has matched multirows, which is crucial in our proofs. It seems however that the problem becomes hard in this case, and some less rigid constraints on M would then have to be introduced to recover tractability. Also it is open to exhibit an extension of the notion of the PQtree that could encode all mC1P orders of a binary matrix that satisfies this property. Even for the case of a matrix with matched multirows, our techniques lead to a data structure which only captures the consecutivity requirement (cf. Section 4.3) but not the multiplicity requirement. From an algorithmic complexity point of view, our algorithm has an O(mn) time complexity, and it remains open to see if this 123 case can be solved in O(m + n + ℓ) time, where ℓ is the number of 1’s in the entire matrix M. The problem of covering hypergraphs with a collection of paths played a key role in the hardness results of Chapter 3, and, with slightly different conditions on this collection of paths, played a key role in the hardness results of this chapter. Other variants of hypergraph covering were also used to show both hardness and algorithmic results for the haplotyping problem via galled tree networks [59–61]. Perhaps considering other conditions on the covering could give rise to other new and interesting problems. In fact, one could do a systematic study of the covering of hypergraphs with graphs to see which conditions (on both hypergraph and graph) lead to interesting results. 6.4 Chapter 5: The GCCC Problem In Section 5.2 we show that the PTC and the LEF-PTC Problems are NP-complete, while the OEF-TO Problem is polynomial-time solvable, and the REF-PTC Problem always has a solution. In Section 5.3, we present some tractable cases of the GCCC Problem, while in Section 5.4 we present some hardness results. Here, we have characterized the complexity of cases of the Q-SB-GCCCNB and Q-P-GCCC-NB Problems for Q ⊆ {{0}, {1}, {2}, {0, 2}, {0, 1, 2}}. This leaves open, however, some interesting cases of the GCCC-NB Problem. Here we show that when Q′ = {{1}, {0, 2}}, the input corresponds to a binary matrix M, hence the Q′ -SB-GCCC-NB Problem is equivalent to the C1P Problem. That is, the Q′ -SB-GCCC-NB (resp., Q ′ -GCCC-NB) Problem is to find a single-branch path (resp., tree) with vertex set containing the columns of M (and possibly other columns) such that for each row of M, the set of vertices labelled 1 by this row forms a connected subpath (resp., subtree), i.e., M has the C1P (resp., a “connectedones property” of trees). Note also that for a tree to have this connected-ones property, that sets of vertices labelled 0 by any row must form at most 2 connected subtrees, so that this tree can be contracted to 0 → 1 → 2 for each row (this is automatically enforced in the case of the C1P, since the set of vertices labelled by 1 in each row is a path). If we can determine in polynomial-time that this connectedones property holds (like we can for the C1P), it might provide an answer to the 124 BKW Case. Preliminary study has shown that the set of such matrices corresponds to a special class of chordal graphs: deeper study into this connection could be useful. Finally, it would be interesting to systematically study these problems for all subsets of 2{0,1,2} , as it would complete the study for all possible inputs to the GCCC-NB Problem when character trees are 0 → 1 → 2. 125 Bibliography [1] Z. Adam, M. Turmel, C. Lemieux, and D. Sankoff. Common intervals and symmetric difference in a model-free phlogenomics, with an application to streptophyte evolution. Journal of Computational Biology, 14:436–445, 2007. → pages 10, 13, 16, 17 [2] R. Agarwala and D. Fernandez-Baca. A polynomial-time algorithm for the perfect phylogeny problem when the number of character states is fixed. SIAM Journal on Computing, 26(6):1216–1224, 1994. → pages 30, 102 [3] M. Alekseyev and P. Pevzner. Colored de bruijn graphs and the genome halving problem. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4:98–107, 2007. → pages 27 [4] F. Alizadeh, R. Karp, L. Newberg, and D. Weisser. Physical mapping of chromosomes: A combinatorial problem in molecular biology. In Proceedings of the 4th ACM-SIAM Symposium on Discrete Algorithms (SODA), 1991. → pages 7 [5] F. Alizadeh, R. Karp, D. Weisser, and G. Zweig. Physical mapping of chromosomes using unique probes. J. Comput. Biol., 2(2):159–184, 1995. → pages 7, 11 [6] E. Althaus, S. Canzar, M. Emmett, A. Karrenbauer, A. Marshall, A. Meyer-Baese, and H.-M. Zhang. Computing H/D-exchange speeds of single residues from data of peptic fragments. In Proceedings of the 23rd ACM Symposium on Applied Computing (SAC 2008), pages 1273–1277. ACM Press, 2008. → pages 6 [7] J. Atkins and M. Middendorf. On physical mapping and the consecutive ones property for sparse matrices. Discrete Applied Mathematics, 71(1-3): 23–40, 1996. → pages 7, 21 126 [8] J. Atkins, E. Boman, and B. Hendrickson. A spectral algorithm for seriation and the consecutive ones problem. SIAM Journal on Computing, 28(1):297–310, 1998. → pages 7, 21 [9] M. Beal, A. Bergeron, S. Corteel, and M. Raffinot. An algorithmic view of gene teams. Theoretical Computer Science, 320:395–418, 2004. → pages 17 [10] M. Belcaid, A. Bergeron, A. Chateau, C. Chauve, Y. Gingras, G. Poisson, and M. Vendette. Exploring genome rearrangments using virtual hybridization. In D. Sankoff, L. Wang, and F. Chin, editors, Proceedings of the 5th Asia-Pacific Bioinformatics Conference (APBC), volume 5 of Advances in Bioinformatics and Computational Biology, pages 205–214. Imperial College Press, 2007. → pages 17 [11] C. Benham, S. Kannan, M. Paterson, and T. Warnow. Hen’s teeth and whale’s feet: Geralized characters and their compatibility. Computational Biology, 2(4):515–525, 1995. → pages 30, 31, 100, 101, 102, 103, 108 [12] C. Benham, S. Kannan, and T. Warnow. Of chicken teeth and mouse eyes, or generalized character compatibility. Combinatorial Pattern Matching, pages 17–26, 1995. → pages iii, v, 1, 30, 31, 100, 101, 102, 103, 116 [13] S. Benzer. On the topology of genetic fine structure. In Proceedings of the National Academy of Sciences, volume 45, pages 1607–1620, U.S.A., 1959. → pages 3, 7 [14] A. Bergeron and J. Stoye. On the similarity of sets of permutations and its applications to genome comparison. Journal of Computational Biology, 13 (7):1340–1354, 2006. → pages 78 [15] A. Bergeron, M. Blanchette, A. Chateau, and C. Chauve. Reconstructing ancestral genomes using conserved intervals. In I. Jonassen and J. Kim, editors, Proceedings of the 4th International Workshop on Algorithms in Bioinformatics, volume 3240 of Lecture Note in Bioinformatics, pages 14–25, 2004. → pages 16, 27, 28, 78 [16] A. Bergeron, Y. Gingras, and C. Chauve. Bioinformatics Algorithms: Techniques and Applications, chapter 8 Formal Models of Gene Clusters, pages 177–202. 2008. → pages 21, 27 [17] G. Blin, D. Faye, and J. Stoye. Finding nested common intervals efficiently. Journal of Computational Biology, 17(9):1183–1194, 2010. → pages 82 127 [18] G. Blin, R. Rizzi, and S. Vialette. A faster algorithm for finding minimum tucker submatrices. In F. Ferreira, B. L¨owe, E. Mayordomo, and L. Gomes, editors, In Proceedings of Program, Proofs, Processes, the Sixth Conference on Computability in Europe (CiE), volume 6158 of LNCS, pages 69–77. Springer, 2010. → pages 3, 5, 120 [19] S. B¨ocker, K. Jahn, J. Mixtacki, and J. Stoye. Computation of median gene clusters. Journal of Computational Biology, 16(8):1085–1099, 2009. → pages 78 [20] H. Bodlaender, M. Fellows, and T. Warnow. Two strikes against perfect phylogeny. In ICALP, pages 273–283, 1992. → pages 30, 102 [21] K. S. Booth and G. S. Lueker. Testing for the consecutive ones property of, interval graphs, and graph planarity using PQ-tree algorithms. Journal of Computer and System Sciences, 13(3):335–379, 1976. → pages iii, v, 3, 4, 6, 18, 88, 101, 110, 111, 120 [22] G. Bourque and P. Pevzner. Genome-scale evolution: reconstructing gene orders in the ancestral species. Genome Research, 12:26–36, 2002. → pages 11, 12 [23] G. Bourque, P. Pevzner, and G. Tesler. Reconstructing the genomic architecture of ancestral mammals: lessons from human, mouse and rat genomes. Genome Research, 14:507–516, 2004. → pages [24] G. Bourque, E. Zdobnov, P. Bork, P. Pevzner, and G. Tesler. Comparative architectures of mammalian and chicken genomes reveal highly rates of genomic rearrangements across different lineages. Genome Research, 15: 98–110, 2005. → pages 11 [25] A. Caprara, F. Malucelli, and D. Petrolani. On bandwidth-2 graphs. Discrete Applied Mathematics, 34:477–495, 2002. → pages 25, 122 [26] B. Chang, K. J¨onsson, M. Kazmi, M. Donoghue, and T. Sakmar. Recreating a functional ancestral archosaur visual pigment. Molecular Biology and Evolution, 19(9):1483–1489, 2002. → pages 10 [27] C. Chauve and E. Tannier. A methodological framework for the reconstruction of contiguous regions of ancestral genomes and its application to mammalian genomes. PLoS Comput. Biol., 4(e1000234), 2008. → pages 1, 9, 17, 18, 19, 20, 21, 22, 24, 25, 26, 27, 29, 118, 119, 120, 121, 122 128 [28] C. Chauve, U.-W. Haus, T. Stephen, and V. You. Minimal conflicting sets for the consecutive-ones property in ancestral genome reconstruction. In Proc. of RECOMB-CG, volume 5817 of LNBI, pages 48–58, 2009. → pages 120 [29] C. Chauve, J. Maˇnuch, and M. Patterson. On the gapped consecutive-ones property. In Proc. of European Conference on Combinatorics, Graph Theory and Applications (EUROCOMB), volume 34 of ENDM, pages 121–125, 2009. → pages iv, 122 [30] C. Chauve, U.-W. Haus, T. Stephen, and V. You. Minimal conflicting sets for the consecutive-ones property in ancestral genome reconstruction. Journal of Computational Biology, 17(9):1167–1181, 2010. → pages 3 [31] C. Chauve, J. Maˇnuch, M. Patterson, and R. Wittler. Tractability results for the consecutive-ones property with multiplicity. In Proceedings of the 22nd Annual Symposium on Combinatorial Pattern Matching (CPM), volume 6661 of Lecture Notes in Computer Science, pages 90–103. Springer, 2011. → pages v [32] T. Christof, M. Jnger, J. Kececioglu, P. Mutzel, and G. Reinelt. A branch-and-cut approach to physical mapping of chromosomes by unique end-probes. Journal of Computational Biology, 4:433–447, 1997. → pages 7 [33] T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 2001. → pages 7 [34] D. Corneil, S. Olariu, and L. Stewart. The ultimate interval graph recognition algorithm? In Proceedings of the 9th Symposium on Discrete Algorithms (SODA), pages 175–180. ACM/SIAM, 1998. → pages 6 [35] T. Dandekar, B. Snel, M. Huynen, and P. Bork. Conservation of gene order: A fingerprint of proteins that physically interact. Trends in Biochemical Sciences, 23(9):324–328, 1998. → pages 8 [36] M. Dom. Recognition, generation, and application of binary matrices with the consecutive-ones property. PhD thesis, Institut f¨ur Informatik, Friedrich-Schiller-Universit¨at, Jena, 2008. → pages 6, 7, 21, 24, 120 [37] M. Dom. Algorithmic aspects of the consecutive-ones property. Bullentin of the European Association of Theoretical Computer Science (EATCS), 98 (2759), 2009. → pages 3, 5, 6, 24 129 [38] M. Dom, J. Guo, and R. Niedermeier. Approximability and parameterized complexity of the consecutive ones submatrix problems. In Proceedings of the 4th International Conference on the Theory and Applications of Models of Computation (TAMC), volume 4484 of LNCS, pages 680–691. Springer-Verlag, 2007. → pages 21 [39] M. Dom, J. Guo, and R. Niedermeier. Approximation and fixed-parameter algorithms for consecutive ones submatrix problems. Journal of Computer and System Sciences, 2009. → pages 3 [40] D. Durand and D. Sankoff. Tests for gene clustering. pages 144–154. ACM Press, 2002. → pages 78 [41] N. El-Mabrouk and D. Sankoff. The reconstruction of doubled genomes. SIAM Journal of Computing, 32:754–792, 2003. → pages 27 [42] G. Estabroowk and F. McMorris. When is one estimate of evolutionary relationships a refinement of the another? J. Math. Biosci., 10:327–373, 1980. → pages 30 [43] G. Even, R. Levi, D. Rawitz, B. Schieber, S. Shahar, and M. Sviridenko. Algorithms for capacitated rectangle stabbing and lot sizing with joint set-up costs. ACM Transactions on Algorithms, 4(3), 2008. Article 34. → pages 7 [44] T. Faraut. Adressing chromosome evolution in the whole-genome sequence era. Chromosome Research, 16:5–16, 2008. → pages 12 [45] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, 2003. → pages 10, 30 [46] A. Ferreria and S. Song. Achieving optimality for gate matrix layout and PLA folding: a graph theoretic approach. In I. Simon, editor, Proceedings of the 1st Latin American Symposium on Theoretical Informatics (LATIN), volume 583 of LNCS, pages 139–153, S˜ao Paulo, Brasil, 1992. Springer-Verlag. → pages 6 [47] L. Figuera, M. Pandolfo, P. Dunne, J. Cantu, and P. Patel. Mapping the congenital generalized hypertrichosis locus to chromosome Xq24-q27.1. Nature (London), 10:202–207, 1995. → pages 30 [48] W. Fitch. Towards defining the course of evolution: Minimum change for a specific tree topology. Systematic Zoology, 20:406–416, 1971. → pages 12, 16 130 [49] R. Friedman and A. Hughes. Gene duplication and the structure of eukaryotic genomes. Genome Research, 11(3):373–381, 2001. → pages 78 [50] L. Froenicke, J. Wienberg, G. Stone, L. Adams, and R. Stanyon. Towards the delineation of the ancestral eutherian genome organization: comparitive genome maps of human and the african elephant (loxodonta africana) generated by chromosome painting. In Proceedings of the Royal Society B Biological Sciences, volume 270, pages 1331–1340, 2003. → pages 11, 19 [51] L. Froenicke, M. Cald´es, A. Graphodatsky, S. M¨uller, L. Lyons, T. Robinson, M. Volleth, F. Yang, and J. Wienberg. Are molecular cytogenetics and bioinformatics suggesting diverging models of ancestral mammalian genomes? Genome Research, 16:306–310, 2006. → pages 12, 13 [52] D. Fulkerson and O. Gross. Incidence matrices and interval graphs. Pacific Journal of Mathematics, 15:835–855, 1965. → pages ii, 2, 3, 4, 6, 7 [53] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979. → pages 105 [54] S. Ghosh. File organization: the consecutive retrieval property. Communcations of the ACM, 15(9):802–808, 1972. → pages 6 [55] P. Goldberg, M. Golumbic, H. Kaplan, and R. Shamir. Four strikes against physical mapping of DNA. J. Comput. Biol., 2(1):139–152, 1995. → pages ii, 7, 21, 22, 23, 26, 35, 36, 71, 72, 119, 121, 122 [56] J. Gordon, K. Byrne, and K. Wolfe. Additions, losses, and rearrangements on the evolutionary route from a reconstructed ancestor to the modern Saccharomyces cerevisiae genome. PLoS Genetics, 5(5), 2009. → pages 10 [57] J. Gramm, T. Nierhoff, R. Sharan, and T. Tantau. Haplotyping with missing data via perfect path phylogenies. Discrete Applied Mathematics, 155: 788–805, 2007. → pages 101 [58] D. Greenberg and S. Istrail. Physical mapping by STS hybridization: algorithmic strategies and the challenge of software evaluation. Journal of Computational Biology, 2(2):219–274, Summer 1995. → pages 7 [59] A. Gupta, J. Maˇnuch, L. Stacho, and X. Zhao. Algorithm for haplotype inferring via galled-tree networks with simple galls. In Proc. of Int. Symposium on Bioinformatics Research and Applications (ISBRA), volume 4463 of LNBI, pages 121–132, 2007. → pages 62, 124 131 [60] A. Gupta, J. Maˇnuch, L. Stacho, and X. Zhao. Haplotype inferring via galled-tree networks is NP-complete. In Proc. of Annual Int. Computing and Combinatorics Conference (COCOON), volume 5092 of LNCS, pages 287–298, 2008. → pages 62 [61] A. Gupta, J. Maˇnuch, L. Stacho, and X. Zhao. Haplotype inferring via galled-tree networks using a hypergraph covering problem for special genotype matrices. Discr. Appl. Math., 157(10):2310–2324, 2009. → pages 62, 124 [62] D. Gusfield. Efficient algorithms for inferring evolutionary trees. Networks, 21:19–28, 1991. → pages 30 [63] D. Gusfield. The multi-state perfect phylogeny problem with missing and removable data: Solutions via integer-programming and chordal graph theory. In Proc. of RECOMB 2009, volume 5541 of LNCS, pages 294–310, 2009. → pages 31 [64] M. Habib, R. M. McConnell, C. Paul, and L. Viennot. Lex-BFS and partition refinement, with applications to transitive orientation, interval graph recognition and consecutive ones testing. Theoretical Computer Science, 234(1-2):59–84, 2000. → pages 5, 6 [65] S. Haddadi. A note on the NP-hardness of the consecutive block minimization problem. International Transactions on Operational Research, 9(6):775–777, 2002. → pages 24 [66] M. T. Hajiaghayi and Y. Ganjali. A note on the consecutive ones submatrix problem. Information Processing Letters, 83(3):163–166, 2002. → pages 21 [67] R. Hassin and M. Megiddo. Approximation algorithms for hitting objects with straight lines. Discrete Applied Mathematics, 30:29–42, 1991. → pages 6 [68] X. He and M. Goldwasser. Identifying conserved gene clusters in the presence of homology families. Journal of Computational Biology, 12(6): 638–656, 2005. → pages 78 [69] R. Hoberman and D. Durand. The incompatible desiderata of gene cluster properties. In Proceedings of RECOMB Comparitive Genomics, volume 3678 of Lecture Notes in Bioinformatics, pages 73–87. Springer Verlag, 2005. → pages 78, 82 132 [70] D. Hochbaum and A. Levin. Cyclical scheduling and multi-shift scheduling: Complexity and approximation algorithms. Discrete Optimization, 3(4):327–340, 2006. → pages 6 [71] D. Hochbaum and P. Tucker. Minimax problems with bitonic matrices. Networks, 40(3):113–124, 2002. → pages 6 [72] F. Hole and M. Shaw. Computer analysis of chronological seriation, volume 53. 1967. → pages 6 [73] W.-L. Hsu. A simple test for the consecutive ones property. In T. Ibaraki, Y. Inagaki, and K. Iwama, editors, ISAAC, volume 650 of LNCS, pages 459–468, 1992. → pages 4 [74] W.-L. Hsu. A simple test for interval graphs. In Proceedings of the 18th International Workshop on Graph-Theoretic Concepts in Computer Science (WG), volume 657 of LNCS, pages 11–16. Springer, 1992. → pages 6 [75] W.-L. Hsu. On physical mapping algorithms – an error-tolerant test for the consecutive-ones property. volume 1276 of Lecture Notes in Computer Science, pages 242–250. Springer, 1997. → pages 24 [76] W.-L. Hsu. A simple test for the consecutive ones property. Journal of Algorithms, 43(1):1–16, 2002. → pages 4 [77] W.-L. Hsu and T.-H. Ma. Fast and simple algorithms for recognizing chordal comparability graphs and interval graphs. SIAM Journal on Computing, 28(3):1004–1020, 1999. → pages 6 [78] W.-L. Hsu and R. McConnell. PC tress and circular-ones arragements. Theoretical Computer Science, 296(1):99–116, 2003. → pages 5 [79] C. Janis. The sabertooth’s repeat performances. Natural History, 103: 78–82, 1994. → pages 30 [80] T. Jermann, J. Opitz, J. Stackhouse, and S. Benner. Reconstructing the evolutionary history of the artiodactyl ribonuclease superfamily. Nature, 374(6517):57–59, 1995. → pages 10 [81] S. Jinks-Robertson and T. Petes. Chromosomal translocations generated by high-frequency meiotics recombination between repeated yeast genes. Genetics, 114(3):731–752, 1986. → pages 20 133 [82] S. Kannan and T. Warnow. Inferring evolutionary history from DNA sequences. SIAM Journal on Computing, 23(4):713–737, 1994. → pages 30 [83] S. Kannan and T. Warnow. A fast algorithm for the computation and enumeration of perfect phylogenies. In SODA, pages 595–603, 1995. → pages 30 [84] D. Kendall. Incidence matrices, interval graphs and seriation in archaeology. Pacific Journal of Mathematics, 2(28):219–274, 1995. → pages 2, 6 [85] W. Kent, R. Baertsch, A. Hinrichs, W. Miller, and D. Haussler. Evolutions’s cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. In Proceedings of the National Academy of Sciences USA, volume 100, pages 11484–11489. → pages xii, 12, 14 [86] E. Kollar and C. Fisher. Tooth induction in chick epithelium: Expression of quiescent genes for enamel synthesis. Science, 207:993–995, 1980. → pages 30 [87] N. Korte and R. M¨ohring. An incremental linear-time algorithm for recognizing interval graphs. SIAM Journal on Computing, 18(1):68–81, 1989. → pages 4, 6 [88] L. Kou. Polynomial complete consecutive information retrieval problems. SIAM Journal on Computing, 6(1):67–75, 1977. → pages 6 [89] S. Kovaleva and F. Spieksma. Approximation of a geometric set covering problem. In Proceedings of the 12th International Society for Analysis, Applications and Computation (ISAAC), volume 2223 of LNCS, pages 493–501. Springer, 2001. → pages 6 [90] D. Kratsch, R. McConnell, K. Mehlhorn, and J. Spinrad. Certifying algorithms for recognizing interval graphs and permutation graphs. SIAM Journal on Computing, 36(2):326–353, 2006. → pages 5, 6 [91] G. Landau, L. Parida, and O. Weimann. Gene proximity analysis across whole genomes via PQ trees. Journal of Computational Biology, 12(10): 1289–1306, 2005. → pages 16, 17, 26 [92] C. Lekkerkerker and J. Boland. Representation of a finite graph by a set of intervals on the real line. Fundamentals of Mathematics, 51:45–64, 1962. → pages 7 134 [93] H. Lewin, D. Larkin, J. Pontius, and S. O’Brien. Every genome sequence needs a good map. Genome Research, 19:1925–1928, 2009. → pages 7 [94] W.-F. Lu and W.-L. Hsu. A test for the consecutive ones property on noisy data – application to physical mapping and sequence assembly. Journal of Computational Biology, 10(5):709–735, 2003. → pages 7, 21, 24 [95] N. Luc, J. Risler, A. Bergeron, and M. Raffinot. Gene teams: A new formalization of gene clusters for comparative genomics. Computational Biology and Chemistry, 27:59–67, 2003. → pages 17 [96] J. Ma, L. Zhang, B. Suh, B. Raney, R. Burhans, W. Kent, M. Blanchette, D. Haussler, and W. Miller. Reconstructing contiguous regions of an ancestral genome. Genome Research, 16(12):1557–1565, 2006. → pages xii, xvii, 9, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 118, 120 [97] J. Ma, A. Ratan, B. Raney, B. Suh, L. Zhang, W. Miller, and D. Haussler. DUPCAR: Reconstructing contiguous ancestral regions with duplications. Journal of Computational Biology, 15:1007–1027, 2008. → pages 27 [98] J. Maˇnuch, M. Patterson, and A. Gupta. On the generalised character compatibility problem for non-branching character trees. In H. Q. Ngo, editor, Proceedings of the 15th Annual International Conference on Computing and Combinatorics (COCOON), pages 268–276, 2009. → pages 103 [99] J. Maˇnuch and M. Patterson. The complexity of the gapped consecutive-ones property problem for matrices of bounded maximum degree. In Proceedings of the 8th Annual RECOMB Satellite Workshop on Comparative Genomics (RECOMB-CG), volume 6398 of Lecture Notes in Bioinformatics, pages 278–289. Springer, 2010. → pages v [100] J. Maˇnuch and M. Patterson. The complexity of the gapped consecutive-ones property problem for matrices of bounded maximum degree. Journal of Computational Biology, 18(9):1243–1253, 2011. → pages v [101] J. Maˇnuch, M. Patterson, and C. Chauve. Hardness results for the gapped consecutive-ones property. Discrete Applied Mathematics, 2011. to appear. → pages iv [102] R. McConnell. A certifying algorithm for the consecutive-ones property. In Proc. of the Fifth Annual Symposium on Discrete Algorithms (SODA), pages 761–770. SIAM, 2004. → pages 5, 18, 88, 94, 120 135 [103] F. McMorris, T. Warnow, and T. Wimer. Triangulating vertex colored graphs. SIAM Journal of Discrete Mathematics, 7(2):296–306, 1994. → pages 30 [104] S. Mecke and D. Wagner. Solving geometric covering problems by data reduction. In Proceedings of the 12th European Symposium on Algorithms (ESA), volume 3221 of LNCS, pages 760–771. Springer, 2004. → pages 6, 7, 24 [105] S. Mecke, A. Sch¨obel, and D. Wagner. Station location – complexity and approximation. In Proceedings of the 5th Algorithmic Methods and Models for Optimization of Railways (ATMOS), IBFI. Dagstuhl, Germany, 2005. → pages 6, 7, 24 [106] J. Meidanis, O. Porto, and G. Telles. On the consecutive ones property. Discrete Applied Mathematics, 88(1-3):325–354, 1998. → pages v, 4, 18, 88, 101, 120 [107] J. Meidanis, O. Porto, and G. P. Telles. On the consecutive ones property. Discrete Applied Mathematics, 155:788–805, 2007. → pages 120 [108] T. Mizukami, W. Chang, I. Garkavtsev, N. Kaplan, D. Lombardi, T. Matsumoto, O. Niwa, A. K. andM. Yanagida, T. Marr, and D. Beach. A 13kb resolution cosmid map of the 14mb fission yeast genome by nonrandom sequence-tagged site mapping. Cell, 73:121–132, 1993. → pages xvii, 8, 11 [109] T. Morgan. The Theory of the Gene. Yale University Press, New Haven, 1926. → pages 7 [110] M. Muffato and H. Roest-Crollius. Paleogenomics, or the recovery of lost genomes from the mist of times. Bioessays, 30:122–134, 2008. → pages 12 [111] W. Murphy, D. Larkin, A. E. van der Wind, G. Bourque, G. Tesler, L. Auvil, J. Beever, B. Chowdhary, F. Galibert, L. Gatzke, C. Hitte, S. Meyers, D. Milan, E. Ostrander, G. Pape, H. Parker, T. Raudsepp, M. Rogatcheva, L. Schook, L. Skow, M. Welge, J. Womack, S. O’Brien, P. Pevzner, and H. Lewin. Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science, 309(5734): 613–617, July 2005. → pages 11, 12, 17 [112] G. Nemhauser and L. Wolsey. Integer and Combinatorial Optimization. Discrete Mathematics and Optimization. Wiley, 1988. → pages 7 136 [113] M. Novick. Generalized pq-trees. Technical Report 89-1074, Cornell University, 1989. → pages 4 [114] J. Opatrny. Total ordering problem. SIAM Journal of Computing, 8(1): 111–114, 1979. → pages 104 [115] M. Oswald and G. Reinelt. Polyhedral aspects of the consecutive ones problem. In Proceedings of the 6th Annual International Computing and Combinatorics Conference (COCOON), volume 1858 of LNCS, pages 373–382. Springer, 2000. → pages 6 [116] M. Oswald and G. Reinelt. Constructing new facets of the consecutive ones polytope. In Proceedings of the 5th International Workshop on Combinatorial Optimization–”Eureka, You Shrink!”, volume 2570 of LNCS, pages 147–157. Springer, 2003. → pages 6 [117] A. Ouangraoua, F. Boyer, A. McPherson, E. Tannier, and C. Chauve. Prediction of contiguous regions in the amniote ancestral genome. In Proceedings of the International Symposium on Bioinformatics Research and Applications (ISBRA), Lecture Notes in Computer Science, pages 173–185. Springer, 2009. → pages 22 [118] R. Overbeek, M. Fonstein, M. D’Souza, G. Pusch, and M. Maltsev. The use of gene clusters to infer functional coupling. In Proceedings of the National Academy of Sciences USA, volume 96, pages 2896–2901, 1999. → pages 8 [119] M. Palazzolo, S. Sawyer, C. Martin, D. Smoller, and D. Hartl. Optimized strategies for sequence-tagged-site selection in genome mapping. In Proceedings of the National Academy of Sciences, volume 88, pages 8034–8038, U.S.A., 1991. → pages xvii, 8, 11 [120] C. Papadimitriou. Computational Complexity. Addison Wesley, 1994. → pages 46, 47, 62, 74 [121] L. Parida. Using PQ structures for genomic rearrangement phylogeny. Journal of Computational Biology, 13(10):1685–1700, 2006. → pages 17 [122] S. Pasek, A. Bergeron, J. Risler, A. Louis, E. Ollivier, and M. Raffinot. Identification of genomic features using microsyntenies of domains: domain teams. Genome Research, 15(6):867–874, 2005. → pages 21, 22, 23, 78 [123] I. Pe’er, T. Pupko, R. Shamir, and R. Sharan. Incomplete directed perfect phylogeny. SIAM J. Computing, 33:590–607, 2004. → pages 100 137 [124] G. Pontecorvo. Trends in Genetic Analysis. Columbia University Press, New York, 1958. → pages 7 [125] S. Rahmann and G. Klau. Integer linear programs for discovering approximate gene clusters. In Proceedings of the Workshop on Algorithms in Bioinformatics (WABI), volume 4175 of Lecture Notes in Bioinformatics, pages 298–306. Springer Verlag, 2006. → pages 78 [126] V. Rascol, P. Pontarotti, and A. Levasseur. Ancestral animal genomes reconstruction. Current Opinions in Immunology, 19(5):542–546, 2007. → pages 12 [127] F. Richard, M. Lombard, and B. Dutrillaux. Reconstruction of the ancestral karyotype of eutherian mammals. Chromosome Research, 11:605–618, 2002. → pages 11, 16, 18, 19 [128] C. Richardson and M. Jasin. Frequent chromosomal translocations induced by DNA double-strand breaks. Nature, 405:697–700, 2000. → pages 20 [129] W. Robinson. A method for chronologically ordering archaeological deposits. American Antiquity, 16:293–301, 1951. → pages 2, 6 [130] M. Rocchi, N. Archidiacono, and R. Stanyon. Ancestral genome reconstruction: an integrated, multi-disciplinary approach is needed. Genome Research, 16:1441–1444, 2006. → pages 12, 13 [131] A. Ruf and A. Sch¨obel. Set covering with almost consecutive ones property. Discrete Optimization, 1(2):215–228, 2004. → pages 6, 7, 24 [132] H. Ryser. Combinatorial configurations. SIAM Journal on Applied Mathematics, 17(3):593–602, 1969. → pages 3 [133] M. Sadqi, E. de Alba, R. P´erez-Jim´enez, J. Sanchez-Ruiz, and B. M. noz. A designed protein as experimental model of primordial folding. In Proceedings of the National Academy of Sciences USA, volume 106, pages 4127–4132, 2009. → pages 10 [134] D. Sankoff, C. Zheng, and Q. Zhu. Polyploids, genome halving and phylogeny. Bioinformatics, 23:433–439, 2007. → pages 27 [135] J. B. Saxe. Dynamic-programming algorithms for recognizing small-bandwidth graphs in polynomial time. SIAM J. on Alg. and Discr. Meth., 1(4):363–369, 1980. → pages 25, 54, 55, 56, 57, 58, 59, 60, 121, 122 138 [136] A. Schrijver. Theory of Linear and Integer Programming. Wiley, 1986. → pages 6 [137] C. Semple and M. Steel. Phylogenetics. Oxford Lecture Series in Mathematics and its Applications. Oxford University Press, 2003. → pages iii, xviii [138] M. Steel. The complexity of reconstructing trees from qualitative characters and subtrees. Journal of Classification, 9:91–116, 1992. → pages 30, 103 [139] J. Stoye and R. Wittler. A unified approach for reconstructing ancient gene clusters. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 2009. → pages 27 [140] A. Sturtevant and T. Dobzhansky. Inversions in the third chromosome of wild races of drosophilia pseudoobsura, and their use in the study of the history of the species. In Proceedings of the National Academy of Sciences, number 22, pages 448–450, 1936. → pages 11 [141] M. Svartman, G. Stone, J. Page, and R. Stanyon. A chromosome painting test of the basal eutherian karyotype. Chromosome Research, 12:45–53, 2004. → pages 11, 19 [142] M. Svartman, G. Stone, and R. Stanyon. The ancestral eutherian karyotype is present in xenarthra. PLoS Genetics 2, (e109), 2006. → pages 11, 19 [143] J. Tang and L. Zhang. The consecutive ones submatrix problem for sparse matrices. Algorithmica, 48:287–299, 2007. → pages 21 [144] J. Trowsdale. Genomic structure and function in the MHC. Trends Genet., 9:117–122, 1993. → pages 30 [145] A. C. Tucker. A structure theorem for the consecutive 1’s property. J. of Comb. Theory, Series B, 12:153–162, 1972. → pages x, 2, 3, 6, 120, 122 [146] Y. van de Peer. Computational approaches to unveiling ancient genome duplications. Nature Reviews, 5:752–763, 2004. → pages 27 [147] A. Veinott and H. Wagner. Optimal capacity scheduling. Operational Research, 10:518–547, 1962. → pages 6, 7 [148] T. Warnow. Tree compatibility and inferring evolutionary history. J. Algorithms, 16:388–407, 1994. → pages 30 139 [149] S. Weis and R. Reischuk. The complexity of physical mapping with strict chimerism. In Proceedings of the Sixth Annual COCOON, volume 1858 of LNCS, pages 383–395. Springer, 2000. → pages 7, 21, 22 [150] J. Wienberg. The evolution of eutherian chromosomes. Current Opinion in Genetics and Development, (6):657–666, 2004. → pages 11, 16, 18, 19 [151] R. Wittler and J. Stoye. Consistency of sequence-based gene clusters. In Proceedings of RECOMB Comparitive Genomics, volume 6398 of Lecture Note in Bioinformatics, pages 252–263. Springer, 2010. → pages ii, v, 27, 28, 73, 74, 82, 89, 118, 123 [152] R. Wittler, J. Maˇnuch, M. Patterson, and J. Stoye. Consistency of sequence-based gene clusters. Journal of Computational Biology, 18(9): 1023–1039, 2011. → pages v [153] F. Yang, E. Alkalaeva, P. Perelman, A. Pardini, W. Harrison, P. O’Brien, B. Fu, A. Graphodasky, M. Ferguson-Smith, and T. Robinson. Reciprocal chromosome painting among human, aardvark, and elephant (superorder afrotheria) reveals the likely eutherian ancestral karyotype. In Proceedings of the National Academy of Sciences, volume 100, pages 1062–1066, U.S.A., 2003. → pages 11, 16, 18, 19 140
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Variants of the Consecutive-Ones Property motivated...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Variants of the Consecutive-Ones Property motivated by the reconstruction of ancestral species Patterson, Murray 2012
pdf
Page Metadata
Item Metadata
Title | Variants of the Consecutive-Ones Property motivated by the reconstruction of ancestral species |
Creator |
Patterson, Murray |
Publisher | University of British Columbia |
Date Issued | 2012 |
Description | The polynomial-time decidable Consecutive-Ones Property (C1P) of binary matrices, formally introduced in 1965 by Fulkerson and Gross, has since found applications in many areas. In this thesis, we propose and study several variants of this property that are motivated by the reconstruction of ancestral species. We first propose the Gapped C1P, or the (k,delta)-C1P: a binary matrix M has the (k,delta)-C1P for integers k and delta if the columns of M can be permuted such that each row contains at most k blocks of 1's and no two neighboring blocks of 1's are separated by a gap of more than delta 0's. The C1P is equivalent to the (1,0)-C1P. We show that for every bounded and unbounded k ≥ 2, delta ≥ 1, (k,delta)≠ (2,1), deciding the (k,delta)-C1P is NP-complete [Golberg et al., 1995]. We also provide an algorithm for a relevant case of the (2,1)-C1P. We then study the (k,delta)-C1P with a bound d on the maximum number of 1's in any row (the maximum degree) of M. We show that the (d,k,delta)-C1P is polynomial-time decidable when all three parameters are fixed constants. Since fixing d also fixes k (k ≤ d), the only case left to consider is the (d,k,infinity)-C1P (when delta is unbounded). We show that for every d > k ≥ 2, deciding the (d,k,infinity)-C1P is NP-complete. We also study the C1P with Multiplicity (mC1P), introduced by Wittler and Stoye [2010]: a binary matrix M on columns S = {1,..,n} has the mC1P for multiplicity vector m:S→ ℕ if there is a sequence sigma on S such that (i) sigma contains each s ∈ S at most m(s) times, and (ii) for each row r of M, the set of columns that have entry 1 in r form at least one subsequence of sigma. We show that deciding the mC1P, and two restricted variants thereof, are NP-complete, for M having maximum degree 3 (6 for one of the variants), and for m(s) ≤ 2 for all s ∈ S. We also give a tractability result for the mC1P that is motivated by handling telomeres in the reconstruction of ancestral species. Finally, we study the Generalized Cladistic Character Compatibility (GCCC) Problem, a generalization of the Perfect Phylogeny Problem [Semple and Steel, 2003] introduced by Benham et al. [1995]. We use the structure of the PQ-tree [Booth and Leuker, 1976] associated with the C1P to give algorithms for several cases of the GCCC Problem. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2012-01-18 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-ShareAlike 3.0 Unported |
DOI | 10.14288/1.0052159 |
URI | http://hdl.handle.net/2429/40128 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2012-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-sa/3.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2012_spring_patterson_murray.pdf [ 1.56MB ]
- Metadata
- JSON: 24-1.0052159.json
- JSON-LD: 24-1.0052159-ld.json
- RDF/XML (Pretty): 24-1.0052159-rdf.xml
- RDF/JSON: 24-1.0052159-rdf.json
- Turtle: 24-1.0052159-turtle.txt
- N-Triples: 24-1.0052159-rdf-ntriples.txt
- Original Record: 24-1.0052159-source.json
- Full Text
- 24-1.0052159-fulltext.txt
- Citation
- 24-1.0052159.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0052159/manifest